Tuesday, May 2, 2017

Zing hits the trifecta

Three winners combine to make Java go fast, start fast, and stay fast.


By now you should be able find lots of material
online about Azul's new Falcon compiler technology, which was just released as the default JIT optimizer in Zing, our already-very-cool JVM. Falcons and JIT compilers are obviously all about speed. But with Falcon, Zing doesn't just bring more speed to Java. It brings that speed faster. It brings that speed sooner. It brings that speed all the time.


Falcon produces faster code. That's the point. And by bringing an LLVM backend optimizer to the JVM, Falcon lets Zing leverage the optimization work of literally hundreds of other developers who have been (and will continue to be) busy adding more and more optimizations and getting the valuable features of each new processor generation to actually get used by optimized code. A great example of this benefit in play is vectorization. LLVM's (and hence Falcon's) vectorization engine will now match normal Java loops with modern instructions, making code like this:

int sumIfEven(int[] a) { int sum = 0; for (int i = 0; i < a.length; i++) { if ((a[i] & 0x1) == 0) { sum += a[i]; } } return sum; }

run faster (as in up to 8x faster) on modern servers than the current HotSpot JVM will (see details of jmh microbenchmark here, and try for yourself with the trial version of Zing). This loop has a predicated operation (add only when the number is even), which makes it hard to match with the vector instructions (SSE, AVX, etc.) that have been around for quite a while. But when the same, unmodified classes are executed on a newer server that has AVX2 instructions (which include some cool new vector masking capabilities) code like this will get fully vectorized, with the speed benefits of the more sophisticated instructions exposed. The cool part is not just the fact that such code gets to be vectorized, or that it is fast. It's that Falcon gets to do this sort of thing without Azul engineers putting in tons of engineering effort to optimize for and keep up with new processors. Others (e.g. Intel) have spent the last few years contributing optimizations to LLVM, and we (and your Java code) now get to benefit from that work.

Could other JIT compilers do this? Of course they could. Eventually. With enough work. But they haven't yet. Falcon is ahead in optimization adoption because it gets to leverage an ongoing stream of new optimization contributions made by others. And we expect it to stay ahead that way. With Falcon and LLVM, Zing gets to be fast sooner, and on an ongoing basis.

Of course Falcon is not just about leveraging other people's work and contributions. We like that, but we had to sink in a bunch of our own work and contributions to make it all possible. The Falcon project at Azul has made significant improvements to LLVM's ability to optimize code in managed runtime environments that includes things like JITs, speculative optimization, deoptimization, and Garbage Collection. When we started, LLVM was able to deal with these concepts to various degrees, but having any of them around in actual code ended up disabling or defeating most optimizations. Over the past three years, Azul's LLVM team has improved that situation dramatically, and successfully and repeatedly landed those improvements upstream such that they can benefit others in the LLVM community (in other runtimes, not just in Java).

With Falcon, we also had to build a host of runtime and Java-specific optimizations that are typical of optimizing JIT compilers, but not typical in static compilation environments; implicit null checks, speculative devirtualization and guarded inlining are just a few examples. Happily, we found that the tooling available with LLVM and it's maturity as an optimization-creation platform have made our new optimization development velocity dwarf the speed at which we were able to create new optimizations in the past.

Great. But how does all this "now we have faster code" stuff make a trifecta? Falcon completes a picture that we've been working towards at a Azul for a while now. It has to do with speed, with how speed behaves, and the ways by which speed can be made to behave better.


Albatross takeoff run
Since the late 1990s when JIT compilers were added to the JVM, Java has had a reputation for being "eventually pretty fast". But that speed has been flakey compared to traditional static environments. Like an albatross, we have come to expect the JVM to take time (and many temporary slowdowns) to get up to speed and airborne. And due to the common pauses associated with typical runtime behaviors (like Stop-The-World GC pauses), Java's eventual speed couldn't even be called "eventually consistent". It has been predictably unpredictable. Consistently inconsistent.


For a quick overview of the various aspects of "speed" that JVM designs have been challenged with, lets start by looking at what JIT compilers typically do. Most JIT environments will typically load code and start executing it in a slow, interpreted form. As "hotter" parts of the code are identified, they are compiled using relatively cheap Tier 1 compiler that focuses on producing "better than interpreted" code, which is used to record the detailed profiling needed for eventual optimization. Finally, after a piece of code has labored through enough slow interpreter and Tier 1 (profiling) execution, higher optimizations are applied. This typically happens only after 10,000+ slower executions have passed.


The chart to the right depicts this evolution of code execution over time, showing the relative portions of interpreted, Tier1 (profiling), and optimized code that is dynamically executed. Since many valuable JIT optimizations rely on speculative assumptions, reversions to interpreted code can (and will) happen as the code learns which speculations actually "stick". Those spikes are referred to as "de-optimizations".



Interpreted and Tier 1 compiled code tend to be significantly slower than optimized code. The amount of time it takes an operation to complete (e.g. the response time to a service request) will obviously be dramatically affected by this mix. We can see how response time evolves over time, as the portion of code that is actually optimized grows and eventually stabilizes.

(In this same depiction, we can see the impact of Stop-The-World GC pauses as well).

We can translate this operation completion time behavior to a depiction of speed over time, showing the speed (as opposed to time) contribution that different optimization level have towards overall speed:

This is how Java "speed" has behaved until now. You can see the albatross takeoff run as it takes time to work up speed. You can see the eventual speed it gains as it stabilizes on running optimized code. And you can also see the all-too-frequent dips to (literally) zero speed that occur when the entire runtime stalls and performs a Stop-The-World Garbage Collection pause.

Zing, with Falcon and friends, now improves speed in three different ways:

  • Falcon literally raises the bar by applying new optimizations and better leveraging the latest hardware features available in servers, improving the overall speed of optimized code.

  • ReadyNow and it's profile-playback capabilities remove the albatross takeoff run, replacing it with an immediate rise to full optimized speed at the very beginning of execution. 

  • The C4 Garbage Collector eliminates the persistent down-spikes in speed associated with Stop-The-World GC pauses. 



These three key features play off each other to improve the overall speed picture, and finally provide Java on servers a speed profile and speed-behavior-over-time similar to the speed people have always expected from C/C++ applications. With Zing, Java is now not only faster, it is consistently faster. From the start.



Albatrosses arguing about the future
Falcon is real. Fast-from-the-start Java is here with ReadyNow. And Stop-The-World GC stalls are a thing of the past with C4. We are kicking ass and taking names. Sign up, or just take Zing for a spin with a free trial.







Or, you could just keep taking those albatross takeoff runs every time, ignore the regular dips to zero speed and altitude, and just keep listening to people who explain how that's all ok, and how some day someone will eventually find (and ship?) some other holy graal...

Tuesday, November 10, 2015

How Java Got The Hiccups

[This is a recycled post from an older blog location. Originally posted in late 2011 when put up the first version of jHiccup. I was recently reminded of its existence, and figured I'd revive it here...]

When we decided to put up an open source tool that helps measure runtime platform [un]responsiveness, we found that the hardest thing to explain about such a tool is just why it is actually needed.
Most people think they already know how the underlying platform behaves, and expect the results to be uninteresting. The basic assumption we seem to have about the platforms we run on is that the platform itself is fairly consistent in its key responsiveness behavior.
Servers have lots of CPU, and unless you completely saturate and thrash a server, people expect to have their software up and running on a CPU within milliseconds of “wanting to”. Sure, we know that the cpu we run on is being time sliced, and some other load may be using it some of the time, but at 20% CPU utilization, how big of an issue can that really be? We don’t expect the rare few milliseconds of delay every once in a while to really show up in application responsiveness stats. Some people also know that other factors (like competing loads, hardware power saving, and things like internal runtime bookkeeping work) can add to the noise levels, but they think of it as just that – “noise”.
But what happens when that “noise” grows to levels that are larger than the processing you actually want to do? When the platform’s waits, stalls, pauses, execution interruptions, or whatever other name they might go by, come in chunks big enough to dominate the application response time? What happens most of the time is that we ignore the issue, chalk it off as an “outlier”, and continue to think of the server we run on as a smooth, continually operating machine.
A very effective way to ignore the issue seems to be to collect and present results in terms of means and standard deviations. The reality of platform noise is anything but “normal” in distribution – it tends to be multi-modal – mostly good, and then very, very bad, with very little in between. Like drowning in a lake with an average depth of 2 inches, a 30 second stall in a system with an average response time of 0.2 seconds milliseconds and a standard deviation of 0.3 seconds can make for a really bad day.
So what can we do to make people more aware of the need to actually look at their runtime platform behavior, and see if it really is as smooth as they thought it was?
We can name the problem.
We chose to use a name that would get some attention, even if it sounds a bit silly at first. A name that would make you think of anything but a normal, smooth distribution. We decided to name that thing where you see your system stalling every once in a while a “Hiccup”.
I then built a tool to measure and chart your runtime hiccups, and we called it… jHiccup.
Your system probably gets the hiccups all the time. Especially when it’s running your application under load. How big each hiccup is, and how often they happen varies. A lot. But almost all systems that do anything other that sitting idle will exhibit some level of hiccups, and looking at the hiccups of even an idle application turns out to be educational.
jHiccup is so simple that most people’s reaction to seeing what it actually does is “duh!”. The reaction to the plotted results is another thing though. Those usually evoke more of a “hmmm…. that’s interesting.”
jHiccup uses a trivial mechanism to measure runtime hiccups while your application is actually running: It measures how long it takes a separate application thread to do absolutely nothing. Doing nothing should be pretty quick, usually, and if doing nothing took an otherwise idle application thread a long time, then it experienced a runtime hiccup. What caused the observed hiccup doesn’t really matter. It’s a pretty safe bet that other application threads – the ones that actually do something, would experience the same hiccup levels, with the hiccup time adding to their overall time to perform whatever work it is they were trying to complete.
Simple measurements with jHiccup showing what happens to an idle application running on a dedicated, idle system, are unsurprisingly boring. However, looking at what jHiccup observes as “the time to do nothing” when an actual java application load is running on the same runtime can teach you a lot about what runtime hiccups look like for your specific application.
The most striking thing about “Hiccup Charts” (the way we plot jHiccup results) is that for Java runtimes carrying actual workloads, they tend to show regular patterns of pretty big hiccups, into the 100s of msec, and into the seconds sometimes. Those patterns are clearly not “noise”, and as the Hiccup Chart percentile distributions show, they often have a significant effect on your application’s behavior in the higher percentiles. Most importantly, they are not caused by your application’s code. They are caused by the runtime platform (the JVM and everything under it, including the OS, the hardware, etc.) stopping to do something, and stalling all work while that thing is done.
What the various causes of the hiccups are, and what we can do about them is something for another post. For now it’s enough that we know they are there, that we have a name to call them by, and that we now have ways to make pretty(?) pictures that show them.
So that’s how Java got the Hiccups. Now, if only someone could figure out a way to cure them…..
jHiccup can be found on github at https://github.com/giltene/jHiccup. For more details on jHiccup, how it works, how to use it, and for some “pretty” pictures see http://www.azul.com/jhiccup

Sunday, November 16, 2014

WriterReaderPhaser: A story about a new (?) synchronization primitive

I recently added a synchronization primitive mechanism in my HdrHistogram and LatencyUtils code, which I think has generic use for some very common operations. Specifically, when wait-free writers are updating stuff that background analyzers or loggers needs to look at. I've isolated it in what I now call a WriterReaderPhaser. The name is very intentional, and we'll get to that in a moment. And to the code (all 66 actual lines of it, 200 with elaborate comments). But first, I'll stray into some "how did this come about" storytelling.

WriterReaderPhaser is a new (I think) synchronization primitive: It provides a straightforward interface and API to coordinate wait-free writing to a shared data structure with blocking reading operations of the same data. Readers view a stable (i.e. non changing, coherent) data set while writers continue to modify data without waiting. And readers are guaranteed forward progress, and will only block for other readers and for writers that may have been "in flight" at the time the reader establishes a stable view of the data.

How did this come about?


This sometimes happens when I build stuff: I find myself in need of some behavior that I thought would be common, but for which I can't find an existing implementation, or a name, or a description. This can obviously be ascribed to my weak Google-fu skills, but after a while I give up and just build the thing, because "it's not that complicated". So I build a one-off implementation into whatever I am doing at the time, and move on with life. At some later point, I find myself needing the same thing again. And since I had already solved that problem once, I go back to my old code and (let's be honest) copy-and-paste my first implementation into whatever new thing I'm working on. Sometimes the little guy on my right shoulder wins over the other guy, and I come back and refactor the behavior into a separate class and build an API for more generic use, at which point the "does this deserve it's own library? It's own repo?" thinking starts, coupled with much Yak Shaving [1]. Sometimes the guy on the left shoulder wins, and I actually get on with the real work I was supposed to be doing. I'll leave it to you to decide which little guy is red and which is white.

Sometimes (usually much later) I realize that what I built was actually new. That even though I thought it was a common use case, and built my version simply out of impatience or frustration at not finding something I could use as-is, I may actually be the first person to solve it. Most of those times, this realization is quickly followed by someone showing me a paper or a piece of code that is 30 years old that makes me go "oh... right.". But sometimes that doesn't happen. Sometimes it really is new.

HdrHistogram itself started this way. It was nothing more than about 100 lines of code in a one-off "JitterMeter" tool I was playing with, which needed to record latencies very quickly and report accurate percentiles with many nines in them. Then I found myself building all sorts of variations on jitter meters and sharing them (jHiccup is an evolved version with a better name). And then I found that people (myself included) were taking the code and ripping out just the histogram trick inside, because they needed a histogram that was actually useful for talking about latencies. Recognizing that a fast histogram with good precision and accurate and fine grained quantile reporting capability is actually a very common use case, I decided to build a Yak shaving co-op on github and called it HdrHistogram. The first Yak hair I produced was Java-colored but others have recently added other colors and breeds.

HdrHistogram is a [presumably] successful example of this process going the distance. More often than not, it doesn't. That's probably what my stale repos on github with 2 stars and no forks represent.

WriterReaderPhaser is currently about halfway through this precarious process, but at this point I'm pretty sure it's not going to die. It's a class on it's own, but not yet it's own library. Certainly not it's own repo yet. It will need to find a home, but org.giltene.stuff is probably not where it needs to end up. Since it's so short, this blog entry is as good a home as any for now.

Most importantly, it looks like it may actually be a new and generically useful synchronization primitive. More accurately: nobody has shown me that "oh... right." link or paper yet, and I'm done holding my breath for now.

So what is WriterReaderPhaser about? 


Have you ever had a need for logging or analyzing data that is actively being updated? Have you ever wanted to do that without stalling the writers (recorders) in any way? If so, then WriterReaderPhaser is for you.

 I'm not talking about logging messages or text lines here. I'm talking about data. Data larger than one word of memory. Data that holds actual interesting state. Data that keeps being updated, but needs to be viewed in a stable and coherent way for analysis or logging. Data like frame buffers. Data like histograms. Data like usage counts. Data that changes.

Existing solutions


Sure, you can use channels, queues or magic rings to move data updates and safely process them in background copies of the data. You can use persistent data structures and all sorts of immutable trickery. But those are expensive. As in orders of magnitude more expensive than updating in-cache state in place. When this data thing you want to look at could be updated millions of times per second, you invariably end up with some sort of double-buffered (or multi buffered) scheme: Updates are done to an active copy, and analysis is done "in the background" on stable, inactive copies.


Double buffered schemes usually involve some sort of "phase flipping". At some point the notion of which copy is active changes. Writers update the "new" active copy, and readers access a stable and coherent copy that used to be active, but now isn't. It's this phase flipping that usually comes in the way of keeping writers from blocking.

There are all sorts of variations on how to do this flipping. We can obviously use some form of mutual exclusion lock to protect the writes and the flip. But then writers will block each other, and be blocked by the flipping operation. We can use ReaderWriter locks backwards: where the state being protected by the ReaderWriter lock would be the notion of which data set is the "active" one (the one writers write to). In this scheme writers take the read lock for the duration of their active state modification operations, while readers take the write lock to flip the roles of active and inactive data sets. This can be [much] better than complete mutual exclusion when multiple writers are involved, since writers no longer block other writers, but readers still block writers during a flip. Also, when you start asking yourself "what does 'read' mean again in this context?" that is a good sign you have a problem. Most people write buggier code when standing on their head and juggling. I'm sure there are a whole bunch of other schemes people use, but in my looking around thus far, I didn't find any examples that were non-blocking for the writers.

Why did I care?


The thing I actually wanted to double-buffer was a histogram. And not just any histogram. A fixed-footprint histogram that supports lossless recording of experienced latencies, such that later computation of precise percentiles will be possible, all the way to the as-many-9s-as-there-are-in-the-data level. The very purpose of such a histogram is often to capture and analyze latency outlier behavior. The recording operation cannot be allowed to be a cause of the very outliers it is trying to measure. For the latency recording mechanism to have any susceptibility to blocking or locking would be unacceptable.

These latency histograms are basically non-blocking data structures with tens (or hundreds) of kilobytes of state that is rapidly being mutated by critical path "writer" code. But I wanted to log their contents over intervals that are short enough to be interesting for monitoring purposes, and for later time based analysis. In order to log the latency information being captured, I needed a logging "reader" to somehow gain access to a stable, coherent "snapshot" of the latency data that was recorded during some prior interval. To do this, I needed a way for the reader to flip the roles of the active and inactive histograms, but I needed to do that without ever blocking the writers. This is a classic case of an asymmetric synchronization need. I'm fine blocking, delaying and pausing the reader. I just can't afford for the writers to ever block or otherwise delay the execution of the thread they are recording in.

In comes WriterReaderPhaser. And the best starting point for understanding what it does is to dissect the name:

The Phaser part is there because it's main function is to coordinate phase shifts between the writers and the readers. Besides, I couldn't bring myself to call this thing a lock. It's not a lock. Not in it's most important function, which is phase shift coordination. Writers remain lock-free in all cases (they actually remain wait free on architectures that support atomic increment operations). They never block or lock. Calling WriterReaderPhaser a lock would be like calling an AtomicLong an "add lock" because someone could also construct a spin-lock around it....

The WriterReader part is a reversal of the commonly used ReaderWriter (or ReadWrite) term. ReaderWriter locks are asymmetric, but in the reverse direction of what I needed: they enable [relatively] smooth reader operation while causing the writers to block. The really cool wait-free Left-Right which Martin Thompson had pointed me to achieves perfectly smooth reader operation, but that's still not what I needed. WriterReaderPhaser works for the exactly reversed need: Writers remain non-blocking and perfectly smooth, while only readers suffer.

The desired behaviors I was looking for in a WriterReaderPhaser were:

1. Writers remaining lock-free at all times. Ideally they will remain wait-free at all times.

2. A Reader can coordinate a phase flip and access to the inactive data such that:

2.1 Other readers will not flip a phase while this reader is still interested in the inactive data.

2.2 No writer modification will be made to the inactive data after the phase flip operation is complete, and for as long as the reader is interested in the inactive data.

2.3 Readers are guaranteed forward progress (even in the presence of heavy and continuous writer activity, and even when there is no writer activity at all).

Defining WriterReaderPhaser:


With these high level desired behaviors stated, lets clearly define the qualities and guarantees that a well implemented WriterReaderPhaser primitive would provide to users, and the relevant rules that users must adhere to in order to maintain those qualities and guarantees:

A WriterReaderPhaser instance provides the following 5 operations:
  • writerCriticalSectionEnter
  • writerCriticalSectionExit
  • readerLock
  • readerUnlock
  • flipPhase
When a WriterReaderPhaser instance is used to protect an actively updated data structure [or set of data structures] involving [potentially multiple] writers and [potentially multiple] readers , the assumptions on how readers and writers act are:
  • There are two sets of data structures (an "active" set and an "inactive" set)
  • Writing is done to the perceived active version (as perceived by the writer), and only within critical sections delineated by writerCriticalSectionEnter and writerCriticalSectionExit operations.
  • Only readers switch the perceived roles of the active and inactive data structures. They do so only while holding the readerLock, and the switch is only done before execution a flipPhase.
  • Readers do not hold onto readerLock indefinitely. 
  • Only readers perform readerLock and readerUnlock.
  • Writers do not remain in their critical sections indefinitely. 
  • Only writers perform writerCriticalSectionEnter and writerCriticalSectionExit.
  • Only readers perform flipPhase operations, and only while holding the readerLock.

When the above assumptions are met, WriterReaderPhaser guarantees that the inactive data structures are not being modified by any writers while being read while under readerLock protection after a flipPhase operation.

The following progress guarantees are provided to writers and readers that adhere to the above stated assumptions:
  • Writers operations (writerCriticalSectionEnter and writerCriticalSectionExit) are wait free (on architectures that support wait-free atomic increment operations).
  • flipPhase operations are guaranteed to make forward progress, and will only be blocked by writers whose critical sections were entered prior to the start of the reader's flipPhase operation, and have not yet exited their critical sections.
  • readerLock only blocks for other readers that are holding the readerLock.

Example use


Imagine a simple use case where a large set of rapidly updated counters is being modified by writers, and a reader needs to gain access to stable interval samples of those counters for reporting and other analysis purposes. 

The counters are represented in a volatile array of values (it is the array reference that is volatile, not the value cells within it):

volatile long counts[];
...

A writer updates a specific count (n) in the set of counters:

writerCriticalSectionEnter
   counts[n]++; // should use atomic increment if multi-writer
writerCriticalSectionExit

A reader gains access to a stable set of counts collected during an interval, reports on it, and accumulates it:

long intervalCounts[];
long accumulated_counts[];

...
readerLock
   reset(interval_counts);
   long tmp[] = counts;
   counts = interval_counts;
   interval_counts = tmp;
flipPhase
   // At this point, interval_counts content is stable  
   report_interval_counts(interval_counts);
   accumulated_counts.add(interval_counts);
readerUnlock

A working implementation


Under the hood, my WriterReaderPhaser implementation achieves these qualities in a fairly straightforward way, by using a dual set of epoch counters (and "odd" set and "even" set) to coordinate the phase flip operations, coupled with a read lock that is used purely to protect readers from each other in multi-reader situations: i.e. to prevent one reader from flipping a phase or changing the notion of active o inactive data while another reader is still operating on it. Many other implementation mechanisms are possible, but this one is certainly sufficient for the job at hand.

Rather than describe the logic in text, it is easiest to list it as code at this point. Below is the entire WriterReaderPhaser class as implemented in my current HdrHistogram repository, spelled out in Java code (most of which is detailed comments). The mechanism can obviously be ported to any language and envrionment that can provide support to atomic increment and atomic swap operations. It's the API and documentation (in the case the details in the JavaDoc comments) that is more important. A simple example of how this is used in practice can be found in HdrHistogram's various interval histogram recorders, like the original (and probably simplest example) in IntervalHistogramRecorder.java, or its more recent replacements in DoubleRecorder.java and Recorder.java which add some unrelated and more complicated logic that deals with safely avoiding some copy costs on getIntervalHistogram() variants.

And yes, it is now all in the public domain.

Enjoy.


[1] For an apparent etymology of the term "Yak Shaving", read the example story attributed here.

Friday, October 31, 2014

What sort of allocation rates can servers handle?

First, a side note: This blog entry is a [nearly verbatim] copy of a posting I made on the Mechanical Sympathy Google Group. I'm lazy. But I recycle. So I think of that as a net positive.

The discussion in question actually started from a question about what a good choice of hardware for running a low latency application may looks like these days, but then evolved into other subjects (as many of the best discussions on the group do), one of which was allocation rates.

Several smart and experienced people on the group chimed in and shared their hard earned wisdom, a lot of which came down to recommendations like "keep your allocation rates low", and "reducing allocation rates is one of the best tools to improve application behavior/performance". Specific numbers were cited (e.g. "My current threshold ... is ~300-400MB/sec").

This made that big "Java applications work hard to use less than 5% of today's toy server capabilities" chip I carry on my shoulder itch. I decided to scratch the itch by pointing out that one thing (and one thing only) is making people work hard to keep their apps within those silly limits: It's all about GC Pauses.

To support my claim, I went on a trivial math spree to show that even today's "toy" commodity servers can easily accommodate a rate of allocation 50x higher than the levels people try and contain their applications within, and that the only bad thing about a higher allocation rate is higher pause artifacts. In the poor systems that have those pause artifacts, of course....

The rest of the below is the actual posting:

...
These "keep allocation rates down to 640KB/sec" (oh, right, you said 300MB/sec) guidelines are purely driven by GC pausing behavior. Nothing else.

Kirk (and others) are absolutely right to look for such limits when pausing GCs are used. But the *only* thing that makes allocation rates a challenge in todays Java/.NET (and other GC based) systems is GC pauses. All else (various resources spent or wasted) falls away with simple mechanical sympathy math. Moore's law is alive and well (for now). And hardware-related sustainable allocation rate follows it nicely. 20+GB/sec is a very practical level on current systems when pauses are not an issue. And yes, that's 50x the level at which people seem to "tune" for by crippling their code or their engineers...

Here is some basic mechanical sympathy driven math about sustainable allocation rates (based mostly on Xeons):

1. From a speed and system resources spent perspective, sustainable allocation rate roughly follows Moore's law for the past 5 Xeon CPU generations.

  1.1 From a CPU speed perspective:

  • The rate of sustainable allocation of a single core (at a given frequency) is growing very slowly over time (not @ Moore's law rates, but still creeping up with better speed at similar frequency, e.g. Haswell vs. Nehalem).
  • The number of cores per socket is growing nicely, and with it the overall overall CPU power per socket (@ roughly Moore's law). (e.g. from 4 cores per socket in late 2009 to 18 cores per socket in late 2014).
  • The overall CPU power available to sustain allocation rate per socket (and per 2 socket system, for example) is therefore growing at roughly Moore's law rates.
  1.2 From a cache perspective:
  • L1 and L2 cache size per core have been fixed for the past 6 years in the Xeon world.
  • The L3 cache size per core is growing fairly slowly (not at Moore's law rates), but the L3 cache per socket has been growing slightly faster than number of cores per socket. (e.g. from 8MB/4_core_socket in 2009 to 45MB/18_core_socket in late 2014).
  • The cache size per socket has been growing steadily at Moore's law rates.
  • With the cache space per core growing slightly over time, the cache available for allocation work per core remains fixed or better.
  1.3 From a memory bandwidth point of view:
  • The memory bandwidth per socket has been steadily growing, but at a rate slower than Moore's law. E.g. A late 2014 E5-2690 V3 has a max bandwidth of 68GB/sec. per socket. A late 2009 E5590 had 32GB/sec of max memory bandwidth per socket. That's a 2x increase over a period of time during which CPU capacity grew by more than 4x.
  • However, the memory bandwidth available (assume sustainable memory bandwidth is 1/3 or 1/2 of max), is still WAY up there, at 1.5GB-3GB/sec/core (that's out of a max of about 4-8GB/sec per core, depending on cores/socket chosen).
  • So while there is a looming bandwidth cap that may hit us in the future (bandwidth growing slower than CPU power), It's not until we reach allocation levels of ~1GB/sec/core that we'll start challenging memory bandwidth in current commodity server architectures. 
  • From a memory bandwidth point of view, this translates to >20GB/sec of comfortably sustainable allocation rate on current commodity systems..
  1.4 From a GC *work* perspective:
  • From a GC perspective, work per allocation unit is a constant that the user controls (with ratio or empty to live memory).
  • On Copying or Mark/Compact collectors, the work spent to collect a heap is linear to the live set size (NOT the heap size).
  • The frequency at which a collector has to do this work roughly follows: 
                         allocation_rate / (heap_size - live_set_size)
  • The overall work per time unit is therefore follows allocation rate (for a given live_set_size and heap_size).
  • And the overall work per allocation unit is therefore a constant (for a given live_set_size and heap_size)
  • The constant is under the user's control. E.g. user can arbitrarily grow heap size to decrease work per unit, and arbitrarily shrink memory to go the other way (e.g. if they want to spend CPU power to save memory).
  • This math holds for all current newgen collectors, which tend to dominate the amount of work spent in GC (so not just in Zing, where it holds for both newgen and olden).
  • But using this math does require a willingness to grow the heap size with Moore's law, which people have refused to do for over a decade. [driven by the unwillingness to deal with the pausing effects that would grow with it]
  • [BTW, we find it to be common practice, on current applications and on current systems, to deal with 1-5GB/sec of allocation rate, and to confortably do so while spend no more than 2-5% of overall system CPU cycles on GC work. This level seems to be the point where most people stop caring enough to spend more memory on reducing CPU consumption.]

2. From a GC pause perspective:
  • This is the big bugaboo. The one that keeps people from applying all the nice math above. The one that keeps Java heaps and allocation rates today at the same levels they were 10 years ago. The one that seems to keep people doing "creative things" in order to keep leveraging Moore's law and having programs that are aware of more than 640MB of state.
  • GC pauses don't have to grow with Moore's law. They don't even have to exist. But as long as they do, and as long as their magnitude grows with the attempt to linearly grow state and allocation rates. Pauses will continue to dominate people's tuning and coding decisions and motivations. [and by magnitude, we're not talking about averages. We're talking about the worst thing people will accept during a day.]
  • GC pauses seem to be limiting both allocation rates and live set sizes.
  • The live set size part is semi-obvious: If your [eventual, inevitable, large] GC pause grows with the size of your live set or heap size, you'll cap your heap size at whatever size causes the largest pause you are willing to bear. Period.
  • The allocation rate part requires some more math, and this differs for different collector parts:
  2.2 For the newgen parts of collector: 
  • By definition, a higher allocation rate requires a linearly larger newgen sizes to maintain the same "objects die in newgen" properties. [e.g. if you put 4x as many cores to work doing the same transactions, with the same object lifetime profiles, you need 4x as much newgen to avoid promoting more things with larger pauses].
  • While "typical" newgen pauses may stay just as small, a larger newgen linearly grows the worst-case amount of stuff that a newgen *might* promote in a single GC pauses, and with it grows the actual newgen pause experienced when promotion spikes occur. 
  • Unfortunately, real applications have those spikes every time you read in a bunch of long-lasting data in one shot (like updating a cache or a directory, or reading in a new index, or replicating state on a failover), 
  • Latency sensitive apps tend to cap their newgen size to cap their newgen pause times, in turn capping their allocation rate.

  2.3 For oldgen collectors:
  • Oldgen collectors that pause for *everything* (like ParallelGC) actually don't get worse with allocation rate. They are just so terrible to begin with (pausing for ~1 second per live GB) that outside of batch processing, nobody would consider using them for live sets larger than a couple of GB (unless they find regularly pausing for more than a couple of seconds acceptable).
  • Most Oldgen collectors that *try* to not pause "most" of the time (like CMS) are highly susceptible to allocation rate and mutation rate (and mutation rate tends to track allocation rate linearly in most apps). E.g. the mostly-concurrent-marking algorithms used in CMS and G1 must revisit (CMS) or otherwise process (G1's SATB) all references mutated in the heap before it finishes. The rate of mutation increases the GC cycle time, while at the same time the rate of allocation reduces the time the GC has in order to complete it's work. At a high enough allocation rate + mutation rate level, the collector can't finish it's work fast enough and a promotion failure or a concurrent mode failure occurs. And when that occurs, you get that terrible pause you were trying to avoid. 
  • As a result, even for apps that don't try to maintain "low latency" and only go for "keep the humans happy" levels, most current mostly-concurrently collectors only remain mostly-concurrent within a limited allocation rate. Which is why I suspect these 640KB/sec (oh, right, 300MB/sec) guidelines exist.

Bottom line:

When pauses are not there to worry about, sustaining many GB/sec of allocation is a walk in the park on today's cheap, commodity servers. It's pauses, and only pauses, that make people work so hard to fit their applications in a budget that is 50x smaller than what the hardware can accommodate. People that do this do it for good reason. But it's too bad they have to shoulder the responsibility for badly behaving garbage collectors. When they can choose (and there is a choice) to use collectors that don't pause, the pressure to keep allocation rates down changes, moving the "this is too much" lines up by more than an order of magnitude.

With less pauses comes less responsibility.

[ I need to go do real work now... ]

Saturday, March 29, 2014

A Pauseless HashMap

HashMaps are great. And fast. Well, fast most of the time. If you keep growing them, you'll get elevator music every once in a while. Then they go fast again. For a while.

Wouldn't it be nice if HashMaps didn't stall your code even when they were resizing?

Some background: As those of you who have read my various past rants may have noticed, I spend a lot of my time thinking about the behavior of {latency, response-time, reaction-time}. In addition to trying to better understand or teach about the behavior (with monitoring and measurement tools like HdrHistogram, LatencyUtils, and jHiccup), I actually work on things that try to improve bad behavior. For some definitions of "improve" and "bad". Eliminating pausing behavior in GC was the lowest hanging fruit, but more recent work has focused on eliminating pauses due to other things that stand out once those pesky GC blips are gone. Things like at-market-open deoptimzations. Things like lock deflation, lock de-biasing, class unloading, weak reference processing, and all sorts of TTSP (time to safepoint) issues. I've also learned a lot about how to bring down Linux's contribution to latency spikes.

But the JVM and the OS are not the only things that cause latency spikes. Sometimes it's your code, and the code is doing something "spiky". In my day job, I keep running into in actual, real-world low latency system code that is typically super-fast, but occasionally spikes in actual work latency due to some rare but huge piece of work that needs to be done. This is most often associated with some state accumulation. Once we eliminate GC pauses (which tend to dominate latency spikes, but also tend to simply disappear when Zing is applied), we get to see the things that were hiding in the GC noise. We often run into "nice" patterns of growing latency spikes at growing intervals, with a near-perfect doubling in both magnitude and interval between the spikes. This happens so often that we've studied the common causes, and (by far) the most common culprits seem to be HashMaps. The kind used to accumulate something during the day, and which resize in powers-of-two steps as a result.

I've had "build a Pauseless HashMap" on my weekend project list for over a year now, but finally got around to actually building it (at the request of a friend on the mechanical sympathy group). There are probably at least 17 ways to skin a HashMap so it won't stall puts and gets when it resizes, but this is my simple take on it: 


Keep in mind that (so far) this is a "probably-working draft" that's gone through some bench testing, but is not yet battle hardened (scrutiny is welcome).

I intentionally based this version on Apache Harmony's version of HashMap, and not on OpenJDK's, in order to make it available without GPLv2 license restrictions (for those who may want to include it in non-GPL products). The background resizing concept itself is simple, and can be applied just as easily to the OpenJDK version (e.g. if some future Java SE version wants to use it). You can use (https://svn.apache.org/repos/asf/harmony/enhanced/java/trunk/classlib/modules/luni/src/main/java/java/util/HashMap.java) as a baseline comparison for the code I started with.

This is also a classic example of how GC makes this sort of concurrent programming thing both fast and simple. This is a classic case of an asymmetric speed need between two concurrent actors that share mutable state. I worked hard to make the fast path get() and put() cheap, and managed (I think) to not even use volatiles in the fast path. In doing this, I shifted all the cost I could think of to the background work, where latency doesn't matter nearly as much. This sort of trick would be much harder (or slower) to do if GC wasn't there to safely pick up the junk behind us, as it would (invariably, I think) add a need for additional synchronizing work in the fast path.