One of our Java applications deployed on VMware Tanzu (formerly known as Pivotal Cloud Foundry or PCF) started crashing every few hours. The application handles nearly 30 million calls per day and has less than a second response time.
We have had no new deployment done recently. Hence, the frequent crash of the application was quite surprising for the team.
The Tanzu metrics screen did help to understand that the memory was consistently reaching around 90% of its allocation before the applications started slowing down. After a few minutes of slow performance, the app used to crash.
So, it confirmed an issue with memory management for the application. The next step was to determine whether the memory issue was due to lousy application code or third-party APIs used.
Around 20 applications within the organization use the services exposed by the application having trouble. Hence, the team needed to find the root cause of the issue fast and resolve it to avoid downtimes in some of the critical sales and CRM applications.
Once the team confirmed the issue was related to memory, the next step was to plug into the PCF nodes and determine what was causing the memory leak. It took a couple of hours to find out the root cause of the issue, to be an increase in resource utilization due to an exponential rise in the call volume. The team worked quickly to resolve the issue to better handle the volume and deploy the updated code. The application started working like a charm, even with a higher volume.
I will detail the tools used and the steps taken to identify the root cause of the memory leak. This can help in quick RCA and resolution if you face a similar memory usage issue in your application deployed in VMware Tanzu (PCF).
PCF Metrics
VMware Tanzu provides a well-designed tool for tracking the metrics for your application. The device tracks the metrics for the following items:
Average Request Latency
Requests per minute
Request errors per minute
CPU usage
Disk
Memory
Here we are interested in the ‘Memory’ metrics. It gives a clear indication of how much memory is being used. We can choose to see the metrics in terms of average, maximum, minimum, or total usage compared to the total allocated memory for the application.
Below is a screenshot of the PCF metrics dashboard showing the average usage view of the resources.
In the following diagram, you can see the same metrics for each instance running for the app. The charts give a clear idea about whether all the nodes are behaving in the same manner or just a few outliers are causing issues.
In our case, all the nodes behaved more or less with the same pattern, i.e., they were all hitting the 90% memory usage level within a few hours of their last restart.
Looking at the above metrics dashboard, the team was confident that there was something wrong with the memory utilization.
Enable JMX to Troubleshoot Memory Leak
Generally, issues related to memory leak arises because of incorrect handling of the objects in the code. When the application code creates objects for holding data but does not release them appropriately — the java garbage collector will not consider them for garbage collection. These objects will hang around in memory long after the application completes its usage. With time or under heavy volume, the issue will extrapolate in size to start causing memory issues.
There are several ways to discover what is causing the memory issue and which objects are hogging the memory. One way is to take a heap dump of the instances to find out what objects are alive during the particular time when the dump is taken.
The other way is to enable JMX and connect directly to the JVMs. I prefer this method because it provides a live view of how the objects form inside the application. It also provides a way to capture snapshots and compare object counts and memory used between different points in time.
Here are the steps to enable JMX for nodes on the Tanzu platform.
Through the admin console, set the following variable in the ‘User-Provided Environment Variables’ under the ‘Settings’ menu of the application.
key = JBP_CONFIG_JMX | value = {enabled: true}
You can do the same through cf CLI using the below command.
cf set-env rspusage JBP_CONFIG_JMX "{enabled: true}"
Once you save the settings, you need to ‘restage’ the application using the following command. Restarting the application will not work for activating the settings.
cf restage <add your application name here>
Now the application will allow you to create an SSH tunnel into it. Run the following command in cf CLI from your local machine. It will go to the next line in the command prompt and do nothing. Check for a few seconds — if the prompt does not throw any error, you are successfully connected.
cf ssh -N -T -L 5000:localhost:5000 <add your application name here>
With the above steps, your application deployed in the Tanzu platform is now enabled for JMX. The next step is to connect to the local port using a tool that supports JMX.
Connect to SSH Port to Read JVM Stats
Many graphical monitoring tools can help you monitor the JVM and Java applications from your local system, e.g., JConsole, VisualVM, etc.
While working on the memory issue, we used VisualVM to monitor the JVM in our Tanzu nodes. Connecting to instances using VisualVM is pretty easy, and it offers a simple UI to view various metrics. You can download the latest VisualVM software from the link given in the references section of this article.
Once you have the software downloaded, unzip the download and double-click on the application file inside the bin directory. It will open up the VisualVM app in your local.
In the VisualVM application, click on the File > Add JMX Connection submenu. It will give a dialogue box to enter the SSH connection details.
In the ‘Add JMX Connection’ page, enter the ‘Connection’ field value as ‘localhost:5000’. Remember, you have used port 5000 in previous steps to open the SSH connection to the PCF instance.
You may also change the ‘Display name’ field value to something that quickly tells you which application instances you are connecting to in your VisualVM app. This will be helpful when you take multiple snapshots later.
You don’t need to enter any other details. Click on the ‘OK’ button. The VisualVM app will try to add the JMX connection by connecting to port 5000. It generally takes a few seconds to add the connection. Once successfully connected, it will show the new connection under the ‘Local’ menu in the ‘Applications’ tab.
In case the app cannot connect — it will throw an error as below.
Go back to your command prompt where you opened the SSH connection and verify if the connection is terminated. If so, then restart the SSH connection using the ‘cf ssh …’ command you used for the first time.
Now that you are all set with the VisualVM setup, you can open the connection you just added and start monitoring. Right-click on the connection name and select the ‘Open’ option.
VisualVM will take a few seconds to open the connection and display the following screen.
The ‘Sampler’ tab helps you perform the CPU and Memory sampling. Once you click on the tab, it will take a few seconds to analyze and then will show you a message on whether the sampling is available for the required resource or not.
Next, click on the ‘Memory’ button to start collecting the memory data. You will now be able to see the utilization of the object live in action. It will show the Live Bytes and the count of Live objects.
The easiest way to monitor which objects are accumulating in your memory without getting garbage collected is to use the ‘Delta’ option. Delta view shows the addition of bytes and counts of objects to the JVM memory after you have started the connection.
If you continue to see particular objects increasing in bytes and counts but not getting reset with time, that is a point of concern. Especially if the objects on the top are your custom objects, you need to check the code where you are using them to find out if they are getting released successfully after their use.
The ‘Monitor’ tab is also handy to check your resource utilization. Suppose your heap size (orange graph) continuously increases with time and does not come down even after multiple garbage collection cycles (as shown in the below screenshot). In that case, that’s a problem area to look into for finding the memory leak.
Once you analyze the memory utilization using the above steps, you will be able to pinpoint what all objects are causing the memory leak issue. Go back to your relevant code and refactor it. Make sure you are releasing all the unused objects appropriately. Garbage collection will not be able to claim the memory, which is still referenced by your code.
Hence, it is always a good practice to explicitly release the objects in the code that are no longer required during the processing.
When you are on a pay-per-use model, the ineffective utilization of cloud resources can be a waste of your money without adding any business value to your company. Even on a fixed pay model, it is always a good practice to optimize your cloud resource utilization to maximize the benefit of your investment.
Go ahead and play around with the SSH connection and VisualVM tool to understand your application's JVM performance and resource utilization process. It will help you quickly troubleshoot memory leaks or other resource utilization-related issues.
Thank you for reading the article. I hope you find it helpful. References:
Link to download VisualVM — https://visualvm.github.io/download.html
Link to Tanzu site — https://tanzu.vmware.com/application-modernization-recipes/observability/enabling-jmx-for-pivotal-platform-java-apps
Subscribe to my free newsletter to get stories delivered directly to your mailbox.
Dude this is amazing Document