When we decided to put up an open source tool that helps measure runtime platform [un]responsiveness, we found that the hardest thing to explain about such a tool is just why it is actually needed.
Most people think they already know how the underlying platform behaves, and expect the results to be uninteresting. The basic assumption we seem to have about the platforms we run on is that the platform itself is fairly consistent in its key responsiveness behavior.
Servers have lots of CPU, and unless you completely saturate and thrash a server, people expect to have their software up and running on a CPU within milliseconds of “wanting to”. Sure, we know that the cpu we run on is being time sliced, and some other load may be using it some of the time, but at 20% CPU utilization, how big of an issue can that really be? We don’t expect the rare few milliseconds of delay every once in a while to really show up in application responsiveness stats. Some people also know that other factors (like competing loads, hardware power saving, and things like internal runtime bookkeeping work) can add to the noise levels, but they think of it as just that – “noise”.
But what happens when that “noise” grows to levels that are larger than the processing you actually want to do? When the platform’s waits, stalls, pauses, execution interruptions, or whatever other name they might go by, come in chunks big enough to dominate the application response time? What happens most of the time is that we ignore the issue, chalk it off as an “outlier”, and continue to think of the server we run on as a smooth, continually operating machine.
A very effective way to ignore the issue seems to be to collect and present results in terms of means and standard deviations. The reality of platform noise is anything but “normal” in distribution – it tends to be multi-modal – mostly good, and then very, very bad, with very little in between. Like drowning in a lake with an average depth of 2 inches, a 30 second stall in a system with an average response time of 0.2 seconds milliseconds and a standard deviation of 0.3 seconds can make for a really bad day.
So what can we do to make people more aware of the need to actually look at their runtime platform behavior, and see if it really is as smooth as they thought it was?
We can name the problem.
We chose to use a name that would get some attention, even if it sounds a bit silly at first. A name that would make you think of anything but a normal, smooth distribution. We decided to name that thing where you see your system stalling every once in a while a “Hiccup”.
I then built a tool to measure and chart your runtime hiccups, and we called it… jHiccup.
Your system probably gets the hiccups all the time. Especially when it’s running your application under load. How big each hiccup is, and how often they happen varies. A lot. But almost all systems that do anything other that sitting idle will exhibit some level of hiccups, and looking at the hiccups of even an idle application turns out to be educational.
jHiccup is so simple that most people’s reaction to seeing what it actually does is “duh!”. The reaction to the plotted results is another thing though. Those usually evoke more of a “hmmm…. that’s interesting.”
jHiccup uses a trivial mechanism to measure runtime hiccups while your application is actually running: It measures how long it takes a separate application thread to do absolutely nothing. Doing nothing should be pretty quick, usually, and if doing nothing took an otherwise idle application thread a long time, then it experienced a runtime hiccup. What caused the observed hiccup doesn’t really matter. It’s a pretty safe bet that other application threads – the ones that actually do something, would experience the same hiccup levels, with the hiccup time adding to their overall time to perform whatever work it is they were trying to complete.
Simple measurements with jHiccup showing what happens to an idle application running on a dedicated, idle system, are unsurprisingly boring. However, looking at what jHiccup observes as “the time to do nothing” when an actual java application load is running on the same runtime can teach you a lot about what runtime hiccups look like for your specific application.
The most striking thing about “Hiccup Charts” (the way we plot jHiccup results) is that for Java runtimes carrying actual workloads, they tend to show regular patterns of pretty big hiccups, into the 100s of msec, and into the seconds sometimes. Those patterns are clearly not “noise”, and as the Hiccup Chart percentile distributions show, they often have a significant effect on your application’s behavior in the higher percentiles. Most importantly, they are not caused by your application’s code. They are caused by the runtime platform (the JVM and everything under it, including the OS, the hardware, etc.) stopping to do something, and stalling all work while that thing is done.
What the various causes of the hiccups are, and what we can do about them is something for another post. For now it’s enough that we know they are there, that we have a name to call them by, and that we now have ways to make pretty(?) pictures that show them.
So that’s how Java got the Hiccups. Now, if only someone could figure out a way to cure them…..
jHiccup can be found on github at https://github.com/giltene/jHiccup. For more details on jHiccup, how it works, how to use it, and for some “pretty” pictures see http://www.azul.com/jhiccup