Friday, October 31, 2014

What sort of allocation rates can servers handle?

First, a side note: This blog entry is a [nearly verbatim] copy of a posting I made on the Mechanical Sympathy Google Group. I'm lazy. But I recycle. So I think of that as a net positive.

The discussion in question actually started from a question about what a good choice of hardware for running a low latency application may looks like these days, but then evolved into other subjects (as many of the best discussions on the group do), one of which was allocation rates.

Several smart and experienced people on the group chimed in and shared their hard earned wisdom, a lot of which came down to recommendations like "keep your allocation rates low", and "reducing allocation rates is one of the best tools to improve application behavior/performance". Specific numbers were cited (e.g. "My current threshold ... is ~300-400MB/sec").

This made that big "Java applications work hard to use less than 5% of today's toy server capabilities" chip I carry on my shoulder itch. I decided to scratch the itch by pointing out that one thing (and one thing only) is making people work hard to keep their apps within those silly limits: It's all about GC Pauses.

To support my claim, I went on a trivial math spree to show that even today's "toy" commodity servers can easily accommodate a rate of allocation 50x higher than the levels people try and contain their applications within, and that the only bad thing about a higher allocation rate is higher pause artifacts. In the poor systems that have those pause artifacts, of course....

The rest of the below is the actual posting:

These "keep allocation rates down to 640KB/sec" (oh, right, you said 300MB/sec) guidelines are purely driven by GC pausing behavior. Nothing else.

Kirk (and others) are absolutely right to look for such limits when pausing GCs are used. But the *only* thing that makes allocation rates a challenge in todays Java/.NET (and other GC based) systems is GC pauses. All else (various resources spent or wasted) falls away with simple mechanical sympathy math. Moore's law is alive and well (for now). And hardware-related sustainable allocation rate follows it nicely. 20+GB/sec is a very practical level on current systems when pauses are not an issue. And yes, that's 50x the level at which people seem to "tune" for by crippling their code or their engineers...

Here is some basic mechanical sympathy driven math about sustainable allocation rates (based mostly on Xeons):

1. From a speed and system resources spent perspective, sustainable allocation rate roughly follows Moore's law for the past 5 Xeon CPU generations.

  1.1 From a CPU speed perspective:

  • The rate of sustainable allocation of a single core (at a given frequency) is growing very slowly over time (not @ Moore's law rates, but still creeping up with better speed at similar frequency, e.g. Haswell vs. Nehalem).
  • The number of cores per socket is growing nicely, and with it the overall overall CPU power per socket (@ roughly Moore's law). (e.g. from 4 cores per socket in late 2009 to 18 cores per socket in late 2014).
  • The overall CPU power available to sustain allocation rate per socket (and per 2 socket system, for example) is therefore growing at roughly Moore's law rates.
  1.2 From a cache perspective:
  • L1 and L2 cache size per core have been fixed for the past 6 years in the Xeon world.
  • The L3 cache size per core is growing fairly slowly (not at Moore's law rates), but the L3 cache per socket has been growing slightly faster than number of cores per socket. (e.g. from 8MB/4_core_socket in 2009 to 45MB/18_core_socket in late 2014).
  • The cache size per socket has been growing steadily at Moore's law rates.
  • With the cache space per core growing slightly over time, the cache available for allocation work per core remains fixed or better.
  1.3 From a memory bandwidth point of view:
  • The memory bandwidth per socket has been steadily growing, but at a rate slower than Moore's law. E.g. A late 2014 E5-2690 V3 has a max bandwidth of 68GB/sec. per socket. A late 2009 E5590 had 32GB/sec of max memory bandwidth per socket. That's a 2x increase over a period of time during which CPU capacity grew by more than 4x.
  • However, the memory bandwidth available (assume sustainable memory bandwidth is 1/3 or 1/2 of max), is still WAY up there, at 1.5GB-3GB/sec/core (that's out of a max of about 4-8GB/sec per core, depending on cores/socket chosen).
  • So while there is a looming bandwidth cap that may hit us in the future (bandwidth growing slower than CPU power), It's not until we reach allocation levels of ~1GB/sec/core that we'll start challenging memory bandwidth in current commodity server architectures. 
  • From a memory bandwidth point of view, this translates to >20GB/sec of comfortably sustainable allocation rate on current commodity systems..
  1.4 From a GC *work* perspective:
  • From a GC perspective, work per allocation unit is a constant that the user controls (with ratio or empty to live memory).
  • On Copying or Mark/Compact collectors, the work spent to collect a heap is linear to the live set size (NOT the heap size).
  • The frequency at which a collector has to do this work roughly follows: 
                         allocation_rate / (heap_size - live_set_size)
  • The overall work per time unit is therefore follows allocation rate (for a given live_set_size and heap_size).
  • And the overall work per allocation unit is therefore a constant (for a given live_set_size and heap_size)
  • The constant is under the user's control. E.g. user can arbitrarily grow heap size to decrease work per unit, and arbitrarily shrink memory to go the other way (e.g. if they want to spend CPU power to save memory).
  • This math holds for all current newgen collectors, which tend to dominate the amount of work spent in GC (so not just in Zing, where it holds for both newgen and olden).
  • But using this math does require a willingness to grow the heap size with Moore's law, which people have refused to do for over a decade. [driven by the unwillingness to deal with the pausing effects that would grow with it]
  • [BTW, we find it to be common practice, on current applications and on current systems, to deal with 1-5GB/sec of allocation rate, and to confortably do so while spend no more than 2-5% of overall system CPU cycles on GC work. This level seems to be the point where most people stop caring enough to spend more memory on reducing CPU consumption.]

2. From a GC pause perspective:
  • This is the big bugaboo. The one that keeps people from applying all the nice math above. The one that keeps Java heaps and allocation rates today at the same levels they were 10 years ago. The one that seems to keep people doing "creative things" in order to keep leveraging Moore's law and having programs that are aware of more than 640MB of state.
  • GC pauses don't have to grow with Moore's law. They don't even have to exist. But as long as they do, and as long as their magnitude grows with the attempt to linearly grow state and allocation rates. Pauses will continue to dominate people's tuning and coding decisions and motivations. [and by magnitude, we're not talking about averages. We're talking about the worst thing people will accept during a day.]
  • GC pauses seem to be limiting both allocation rates and live set sizes.
  • The live set size part is semi-obvious: If your [eventual, inevitable, large] GC pause grows with the size of your live set or heap size, you'll cap your heap size at whatever size causes the largest pause you are willing to bear. Period.
  • The allocation rate part requires some more math, and this differs for different collector parts:
  2.2 For the newgen parts of collector: 
  • By definition, a higher allocation rate requires a linearly larger newgen sizes to maintain the same "objects die in newgen" properties. [e.g. if you put 4x as many cores to work doing the same transactions, with the same object lifetime profiles, you need 4x as much newgen to avoid promoting more things with larger pauses].
  • While "typical" newgen pauses may stay just as small, a larger newgen linearly grows the worst-case amount of stuff that a newgen *might* promote in a single GC pauses, and with it grows the actual newgen pause experienced when promotion spikes occur. 
  • Unfortunately, real applications have those spikes every time you read in a bunch of long-lasting data in one shot (like updating a cache or a directory, or reading in a new index, or replicating state on a failover), 
  • Latency sensitive apps tend to cap their newgen size to cap their newgen pause times, in turn capping their allocation rate.

  2.3 For oldgen collectors:
  • Oldgen collectors that pause for *everything* (like ParallelGC) actually don't get worse with allocation rate. They are just so terrible to begin with (pausing for ~1 second per live GB) that outside of batch processing, nobody would consider using them for live sets larger than a couple of GB (unless they find regularly pausing for more than a couple of seconds acceptable).
  • Most Oldgen collectors that *try* to not pause "most" of the time (like CMS) are highly susceptible to allocation rate and mutation rate (and mutation rate tends to track allocation rate linearly in most apps). E.g. the mostly-concurrent-marking algorithms used in CMS and G1 must revisit (CMS) or otherwise process (G1's SATB) all references mutated in the heap before it finishes. The rate of mutation increases the GC cycle time, while at the same time the rate of allocation reduces the time the GC has in order to complete it's work. At a high enough allocation rate + mutation rate level, the collector can't finish it's work fast enough and a promotion failure or a concurrent mode failure occurs. And when that occurs, you get that terrible pause you were trying to avoid. 
  • As a result, even for apps that don't try to maintain "low latency" and only go for "keep the humans happy" levels, most current mostly-concurrently collectors only remain mostly-concurrent within a limited allocation rate. Which is why I suspect these 640KB/sec (oh, right, 300MB/sec) guidelines exist.

Bottom line:

When pauses are not there to worry about, sustaining many GB/sec of allocation is a walk in the park on today's cheap, commodity servers. It's pauses, and only pauses, that make people work so hard to fit their applications in a budget that is 50x smaller than what the hardware can accommodate. People that do this do it for good reason. But it's too bad they have to shoulder the responsibility for badly behaving garbage collectors. When they can choose (and there is a choice) to use collectors that don't pause, the pressure to keep allocation rates down changes, moving the "this is too much" lines up by more than an order of magnitude.

With less pauses comes less responsibility.

[ I need to go do real work now... ]


  1. This comment has been removed by a blog administrator.

  2. This comment has been removed by a blog administrator.

  3. Hi, just got here after listening to Your brilliant "Understanding Java GC" presentation from 2012 (thank You a TON for compiling and sharing it!!)

    As much as I'd like us to try Zing out - it'd take lots of time before it reaches production datacenter,
    and for short-term - I don't think the miracle of plugglable-GCs exists in JVMs (to use C4 instead of CMS).

    Since our throughput had grown - we're facing >100sec GC pauses, but RAM isn't over-utilized at all (live data is less than 5GB).
    To mitigate the "length of GC pause exceeds human/software timeouts" - I'm planning to try a bit counter-intuitive approach:
    - REDUCE the heap_size 10x times
    - deploy 10x tomcat instances
    - reduce live_set_size "almost 10x" times (balancer routes calls per user-hash, so we can reduce user-obj caches 10x, though static/config caches will remain same)

    If understood Your math corrcetly - this will cause each instance to have:
    - almost-same frequency of fullGCs (allocation_rate/10, heap_size/10, live_set_size/almost_10)
    - 10x shorter duration of each fullGC (heap_size/10)
    And overall CPU use should remain same (same load spread into smaller buckets).

    I understand the approach is not that efficient in comparison to C4, but still effective up to some NNN-times scale (100x?), with following limitations:
    1) at some scale the need to keep NNN copies of static/config caches (live set) should skew the "almost-NNNx" math into impractical
    2) non-uniformness of user-data (large users) will start to suffer from walls of too-small-buckets

    Just curious if any mainstream JVM is capable of releasing unused heap (above -Xms level) back to OS? (this would've addressed the #2 above)

    1. The approach of slicing larger processes into smaller ones in order to reduce the max pause time seen in each is certainly common. It's what I call the "640KB" design style (which is more like 640MB these days, but same concept). The approach certainly works at some levels, with enough "duct tape" to keep things working. It's a good example of wasted engineering effort and inefficient design driven by the need to work around around a single problem: pause time that are too large to be acceptable.

      As to what scale the approach is efficient to, this varies dramatically. E.g. in the second tier of distributed cache systems, you can probably see N go to the low tens (within a physical system). But in most actual applications (you mention tomcat as a container), this approach usually caps out with N in the single digits (on a single system) before getting into problems. Hosting tens of JVMs that are mostly idle, or tens of JVMs that all carry nearly-identical work patterns seems to work on a single System. But going to tens of JVMs that are all active and all have disparate timing and working sets tends to lead to thrashing. A single active JVM is a hungry thing (especially when it is busy doing a multi-second GC). And tens of these hungry processes don't tend to make good neighbors.

      When any form of caching is involved, the inefficiency of splitting and replicating the cached data comes into play pretty early, too. You often end up either with a smaller cache per instance (most common) which results in higher miss rates in the instance-local cache and higher actual work that needs to be done as a result. Or you end up replicating the cache in each process (whether it's kept coherent or not), which leads to dramatically increased GC work in the system as a whole and still keeps GC pauses in each instance high.

      To the question about mainstream JVMs that are capable of releasing unused heap back to the OS. There is one that does this very well: Zing. It is completely elastic, with all pages (above a dedicated level) released back to the OS immediately as they are collected. Other HotSpot variants will also dynamically adjust their heap (down to Xms) if there is no pressure on it. But this much more slowly adjusting behavior is delayed, and only occurs when the JVM load or working set drops on the individual JVM dramatically for a long period of time (multiple oldgen cycles). Slowly releasing memory when idle doesn't do much for you when all those N JVMs are actually active at the same time, which would be an inherent behavior if they are just split-up portions of what would otherwise be a single active instance.

      As to the notion that it will take a long time to get Zing into your production data center: I'm often surprised at the sort of re-engineering people do aim at their production datacenter deployments (like splitting their processes up into lots of small pieces, with all the disruptive changes that entails) when the alternative is much simpler and easier to get through even the most rigorous testing and re-qualification processes. Lots of people use Zing in production datacenters, either in place of HotSpot or side by side (some apps use Zing, other use HotSpot). Shifting to using Zing is invariably easier to convince your datacenter folks to do than a redesign and re-architecture of the deployment of your nice working (except for pauses) application that would increase process counts by 10x for the same workload (and the rigorous testing under varying load conditions that would be needed to study the edges of the new load-driven failure modes and load-bearing behaviors that sort of fine-grain splitting creates).