Tuesday, November 10, 2015

How Java Got The Hiccups

[This is a recycled post from an older blog location. Originally posted in late 2011 when put up the first version of jHiccup. I was recently reminded of its existence, and figured I'd revive it here...]

When we decided to put up an open source tool that helps measure runtime platform [un]responsiveness, we found that the hardest thing to explain about such a tool is just why it is actually needed.
Most people think they already know how the underlying platform behaves, and expect the results to be uninteresting. The basic assumption we seem to have about the platforms we run on is that the platform itself is fairly consistent in its key responsiveness behavior.
Servers have lots of CPU, and unless you completely saturate and thrash a server, people expect to have their software up and running on a CPU within milliseconds of “wanting to”. Sure, we know that the cpu we run on is being time sliced, and some other load may be using it some of the time, but at 20% CPU utilization, how big of an issue can that really be? We don’t expect the rare few milliseconds of delay every once in a while to really show up in application responsiveness stats. Some people also know that other factors (like competing loads, hardware power saving, and things like internal runtime bookkeeping work) can add to the noise levels, but they think of it as just that – “noise”.
But what happens when that “noise” grows to levels that are larger than the processing you actually want to do? When the platform’s waits, stalls, pauses, execution interruptions, or whatever other name they might go by, come in chunks big enough to dominate the application response time? What happens most of the time is that we ignore the issue, chalk it off as an “outlier”, and continue to think of the server we run on as a smooth, continually operating machine.
A very effective way to ignore the issue seems to be to collect and present results in terms of means and standard deviations. The reality of platform noise is anything but “normal” in distribution – it tends to be multi-modal – mostly good, and then very, very bad, with very little in between. Like drowning in a lake with an average depth of 2 inches, a 30 second stall in a system with an average response time of 0.2 seconds milliseconds and a standard deviation of 0.3 seconds can make for a really bad day.
So what can we do to make people more aware of the need to actually look at their runtime platform behavior, and see if it really is as smooth as they thought it was?
We can name the problem.
We chose to use a name that would get some attention, even if it sounds a bit silly at first. A name that would make you think of anything but a normal, smooth distribution. We decided to name that thing where you see your system stalling every once in a while a “Hiccup”.
I then built a tool to measure and chart your runtime hiccups, and we called it… jHiccup.
Your system probably gets the hiccups all the time. Especially when it’s running your application under load. How big each hiccup is, and how often they happen varies. A lot. But almost all systems that do anything other that sitting idle will exhibit some level of hiccups, and looking at the hiccups of even an idle application turns out to be educational.
jHiccup is so simple that most people’s reaction to seeing what it actually does is “duh!”. The reaction to the plotted results is another thing though. Those usually evoke more of a “hmmm…. that’s interesting.”
jHiccup uses a trivial mechanism to measure runtime hiccups while your application is actually running: It measures how long it takes a separate application thread to do absolutely nothing. Doing nothing should be pretty quick, usually, and if doing nothing took an otherwise idle application thread a long time, then it experienced a runtime hiccup. What caused the observed hiccup doesn’t really matter. It’s a pretty safe bet that other application threads – the ones that actually do something, would experience the same hiccup levels, with the hiccup time adding to their overall time to perform whatever work it is they were trying to complete.
Simple measurements with jHiccup showing what happens to an idle application running on a dedicated, idle system, are unsurprisingly boring. However, looking at what jHiccup observes as “the time to do nothing” when an actual java application load is running on the same runtime can teach you a lot about what runtime hiccups look like for your specific application.
The most striking thing about “Hiccup Charts” (the way we plot jHiccup results) is that for Java runtimes carrying actual workloads, they tend to show regular patterns of pretty big hiccups, into the 100s of msec, and into the seconds sometimes. Those patterns are clearly not “noise”, and as the Hiccup Chart percentile distributions show, they often have a significant effect on your application’s behavior in the higher percentiles. Most importantly, they are not caused by your application’s code. They are caused by the runtime platform (the JVM and everything under it, including the OS, the hardware, etc.) stopping to do something, and stalling all work while that thing is done.
What the various causes of the hiccups are, and what we can do about them is something for another post. For now it’s enough that we know they are there, that we have a name to call them by, and that we now have ways to make pretty(?) pictures that show them.
So that’s how Java got the Hiccups. Now, if only someone could figure out a way to cure them…..
jHiccup can be found on github at https://github.com/giltene/jHiccup. For more details on jHiccup, how it works, how to use it, and for some “pretty” pictures see http://www.azul.com/jhiccup

14 comments:

  1. Replies
    1. IEEE Final Year Project centers make amazing deep learning final year projects ideas for final year students Final Year Projects for CSE to training and develop their deep learning experience and talents.

      IEEE Final Year projects Project Centers in India are consistently sought after. Final Year Students Projects take a shot at them to improve their aptitudes, while specialists like the enjoyment in interfering with innovation.

      corporate training in chennai corporate training in chennai

      corporate training companies in india corporate training companies in india

      corporate training companies in chennai corporate training companies in chennai

      I have read your blog its very attractive and impressive. I like it your blog. Digital Marketing Company in Chennai

      Delete
  2. LifeVoxel.AI has developed a Interactive Streaming and AI Platform for medical imaging using GPU clusters cloud computing. It is a leap in cloud technology platform in medical imaging that encompasses use cases in visualization, AI, image management and workflow. It’s approach is unique that it has been granted 12 International patents. LifeVoxel.AI’s platform is certified for HIPAA compliancy. LifeVoxel’s cloud addresses the Internet limitations of bandwidth, latency and scalability which are pivotal in this respect. The platform was granted an FDA 510K approval for use in diagnostic interpretation of medical images.

    Interactive Streaming AI Platform RIS PACS

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. LifeVoxel.AI platform helps imaging diagnostic centers and hospitals to save up to 50%+ over conventional RIS PACS with higher functionality. LifeVoxel.AI is the fastest RIS PACS available globally and have unimaginable capabilities of centralized PACS across all your network of Imaging Centers to single window HUB.

    RIS PACS
    RIS PACS software

    ReplyDelete
  5. Thanks a lot for giving great kind of information. So useful and practical for me. Excellent blog and very informative, nice work keep updating. If you are looking for any Big data related information, check our bigdata training institute in bangalore web page. Thanks a lot.

    ReplyDelete

  6. I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly.I want to share about tableau training online and tableau learning .

    ReplyDelete
  7. Excellent blog thanks for sharing the valuable information..it becomes easy to read and easily understand the information.
    Useful article which was very helpful. also interesting and contains good information.
    to know about python training course , use the below link.

    Python Training in chennai

    Python Course in chennai

    ReplyDelete
  8. Good article and right to the point. I don’t know if this is really the best place to ask but do you folks have any thoughts on where to employ some professional writers? Thanks in advance :)

    Java Training in Chennai

    Java Course in Chennai

    ReplyDelete