Suns Troubleshooting Guide for Java
Suns Troubleshooting Guide for Java provides a nice overview of available tools to analyze and monitor Java processes. I would guess that a lot of Java developers doesn’t know them.
I recently had to identify a performance problem on one of our production servers. The application appeared to hang every few seconds and then continue normal operation. Using jstat I quickly discovered that there were a lot of objects created just to be garbage collected right after. This was the output of jstat:
S0 S1 E O P YGC YGCT FGC FGCT GCT LGCC GCC
0,00 0,00 100,00 100,00 89,75 10546 726,162 1192 12922,944 13649,106 Allocation Failure unknown GCCause
0,00 0,00 100,00 100,00 89,44 10546 726,162 1192 12922,944 13649,106 Allocation Failure unknown GCCause
0,00 0,00 21,93 100,00 89,47 10546 726,162 1192 12934,808 13660,970 unknown GCCause No GC
0,00 0,00 32,41 100,00 89,48 10546 726,162 1192 12934,808 13660,970 unknown GCCause No GC
0,00 0,00 42,44 100,00 89,48 10546 726,162 1192 12934,808 13660,970 unknown GCCause No GC
S0 S1 E O P YGC YGCT FGC FGCT GCT LGCC GCC
0,00 0,00 56,79 100,00 89,48 10546 726,162 1192 12934,808 13660,970 unknown GCCause No GC
0,00 0,00 68,61 100,00 89,48 10546 726,162 1192 12934,808 13660,970 unknown GCCause No GC
0,00 0,00 78,02 100,00 89,48 10546 726,162 1192 12934,808 13660,970 unknown GCCause No GC
0,00 0,00 88,74 100,00 89,48 10546 726,162 1192 12934,808 13660,970 unknown GCCause No GC
0,00 0,00 100,00 100,00 89,48 10546 726,162 1193 12934,808 13660,970 Allocation Failure unknown GCCause
S0 S1 E O P YGC YGCT FGC FGCT GCT LGCC GCC
0,00 0,00 100,00 100,00 89,48 10546 726,162 1193 12934,808 13660,970 Allocation Failure unknown GCCause
0,00 0,00 100,00 100,00 89,48 10546 726,162 1193 12934,808 13660,970 Allocation Failure unknown GCCause
0,00 0,00 100,00 100,00 89,48 10546 726,162 1193 12934,808 13660,970 Allocation Failure unknown GCCause
0,00 0,00 100,00 100,00 89,48 10546 726,162 1193 12934,808 13660,970 Allocation Failure unknown GCCause
0,00 0,00 100,00 100,00 89,48 10546 726,162 1193 12934,808 13660,970 Allocation Failure unknown GCCause
S0 S1 E O P YGC YGCT FGC FGCT GCT LGCC GCC
0,00 0,00 100,00 100,00 89,48 10546 726,162 1193 12934,808 13660,970 Allocation Failure unknown GCCause
0,00 0,00 100,00 100,00 89,48 10546 726,162 1193 12934,808 13660,970 Allocation Failure unknown GCCause
0,00 0,00 100,00 100,00 89,48 10546 726,162 1193 12934,808 13660,970 Allocation Failure unknown GCCause
0,00 0,00 100,00 100,00 89,48 10546 726,162 1193 12934,808 13660,970 Allocation Failure unknown GCCause
0,00 0,00 100,00 100,00 89,48 10546 726,162 1193 12934,808 13660,970 Allocation Failure unknown GCCause
S0 S1 E O P YGC YGCT FGC FGCT GCT LGCC GCC
0,00 0,00 100,00 100,00 89,48 10546 726,162 1193 12934,808 13660,970 Allocation Failure unknown GCCause
0,00 0,00 24,36 100,00 89,51 10546 726,162 1193 12946,347 13672,509 unknown GCCause No GC
The “E” column denotes percent of Eden space used. In the time where this value is 100% the application doesn’t respond until the garbage collection took place.
Using jstack I was able to find out what the server was doing while inflating the heap.
All this was done on a server running in production without using any additional profilers!
Another nice feature is the -XX:+HeapDumpOnOutOfMemoryError option mentioned in section 3.3.3.4 of the guide. I always set this option for production servers. This way you always have a memory dump available to analyze, even if the support team only restarted the application without taking a heap dump.
Or take jmap for example. You can not only get some heap statistics but also generate a heap dump while the application is running.
So make sure you know about the tools available as they might become useful if your application ever gets into trouble
You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.
September 17th, 2007 at 8:53 pm
“All this was done on a server running in production without using any additional profilers!”
I think you will find that most respectable companies do not use profilers on production servers and instead use Java performance management solutions that are optimized to automate most of the performance monitoring and problem diagnostics activities including such adhoc investigations as above.
What is important to bear in mind is that in any reasonable sized installation developers do not have access to servers and that operations & application management support staff are looking to build a knowledge management solution around diagnostics images whilst handling incidents to ensure problem management is actually performed and in an efficient manner.
regards,
William
September 17th, 2007 at 9:56 pm
Well, maybe I’m developing for too small sized installations. Do you have any recommendations for products that do automatic performance monitoring and problem diagnostics? Sounds very interesting.
September 18th, 2007 at 4:15 am
thanks, quite useful and good timing of this blog
September 21st, 2007 at 2:55 am
Try Wily (www.wilytech.com) for monitoring production JVMs. They helped write JSR-163, the spec for byte-code instrumentation (BCI). Their stuff collects a ton of metrics in production with very low overhead.