Stuff Gil Says: WriterReaderPhaser: A story about a new (?) synchronization primitive

Sunday, November 16, 2014

WriterReaderPhaser: A story about a new (?) synchronization primitive

I recently added a synchronization primitive mechanism in my HdrHistogram and LatencyUtils code, which I think has generic use for some very common operations. Specifically, when wait-free writers are updating stuff that background analyzers or loggers needs to look at. I've isolated it in what I now call a WriterReaderPhaser. The name is very intentional, and we'll get to that in a moment. And to the code (all 66 actual lines of it, 200 with elaborate comments). But first, I'll stray into some "how did this come about" storytelling.

WriterReaderPhaser is a new (I think) synchronization primitive: It provides a straightforward interface and API to coordinate wait-free writing to a shared data structure with blocking reading operations of the same data. Readers view a stable (i.e. non changing, coherent) data set while writers continue to modify data without waiting. And readers are guaranteed forward progress, and will only block for other readers and for writers that may have been "in flight" at the time the reader establishes a stable view of the data.

How did this come about?

This sometimes happens when I build stuff: I find myself in need of some behavior that I thought would be common, but for which I can't find an existing implementation, or a name, or a description. This can obviously be ascribed to my weak Google-fu skills, but after a while I give up and just build the thing, because "it's not that complicated". So I build a one-off implementation into whatever I am doing at the time, and move on with life. At some later point, I find myself needing the same thing again. And since I had already solved that problem once, I go back to my old code and (let's be honest) copy-and-paste my first implementation into whatever new thing I'm working on. Sometimes the little guy on my right shoulder wins over the other guy, and I come back and refactor the behavior into a separate class and build an API for more generic use, at which point the "does this deserve it's own library? It's own repo?" thinking starts, coupled with much Yak Shaving [1]. Sometimes the guy on the left shoulder wins, and I actually get on with the real work I was supposed to be doing. I'll leave it to you to decide which little guy is red and which is white.

Sometimes (usually much later) I realize that what I built was actually new. That even though I thought it was a common use case, and built my version simply out of impatience or frustration at not finding something I could use as-is, I may actually be the first person to solve it. Most of those times, this realization is quickly followed by someone showing me a paper or a piece of code that is 30 years old that makes me go "oh... right.". But sometimes that doesn't happen. Sometimes it really is new.

HdrHistogram itself started this way. It was nothing more than about 100 lines of code in a one-off "JitterMeter" tool I was playing with, which needed to record latencies very quickly and report accurate percentiles with many nines in them. Then I found myself building all sorts of variations on jitter meters and sharing them (jHiccup is an evolved version with a better name). And then I found that people (myself included) were taking the code and ripping out just the histogram trick inside, because they needed a histogram that was actually useful for talking about latencies. Recognizing that a fast histogram with good precision and accurate and fine grained quantile reporting capability is actually a very common use case, I decided to build a Yak shaving co-op on github and called it HdrHistogram. The first Yak hair I produced was Java-colored but others have recently added other colors and breeds.

HdrHistogram is a [presumably] successful example of this process going the distance. More often than not, it doesn't. That's probably what my stale repos on github with 2 stars and no forks represent.

WriterReaderPhaser is currently about halfway through this precarious process, but at this point I'm pretty sure it's not going to die. It's a class on it's own, but not yet it's own library. Certainly not it's own repo yet. It will need to find a home, but org.giltene.stuff is probably not where it needs to end up. Since it's so short, this blog entry is as good a home as any for now.

Most importantly, it looks like it may actually be a new and generically useful synchronization primitive. More accurately: nobody has shown me that "oh... right." link or paper yet, and I'm done holding my breath for now.

So what is WriterReaderPhaser about?

Have you ever had a need for logging or analyzing data that is actively being updated? Have you ever wanted to do that without stalling the writers (recorders) in any way? If so, then WriterReaderPhaser is for you.

I'm not talking about logging messages or text lines here. I'm talking about data. Data larger than one word of memory. Data that holds actual interesting state. Data that keeps being updated, but needs to be viewed in a stable and coherent way for analysis or logging. Data like frame buffers. Data like histograms. Data like usage counts. Data that changes.

Existing solutions

Sure, you can use channels, queues or magic rings to move data updates and safely process them in background copies of the data. You can use persistent data structures and all sorts of immutable trickery. But those are expensive. As in orders of magnitude more expensive than updating in-cache state in place. When this data thing you want to look at could be updated millions of times per second, you invariably end up with some sort of double-buffered (or multi buffered) scheme: Updates are done to an active copy, and analysis is done "in the background" on stable, inactive copies.

Double buffered schemes usually involve some sort of "phase flipping". At some point the notion of which copy is active changes. Writers update the "new" active copy, and readers access a stable and coherent copy that used to be active, but now isn't. It's this phase flipping that usually comes in the way of keeping writers from blocking.

There are all sorts of variations on how to do this flipping. We can obviously use some form of mutual exclusion lock to protect the writes and the flip. But then writers will block each other, and be blocked by the flipping operation. We can use ReaderWriter locks backwards: where the state being protected by the ReaderWriter lock would be the notion of which data set is the "active" one (the one writers write to). In this scheme writers take the read lock for the duration of their active state modification operations, while readers take the write lock to flip the roles of active and inactive data sets. This can be [much] better than complete mutual exclusion when multiple writers are involved, since writers no longer block other writers, but readers still block writers during a flip. Also, when you start asking yourself "what does 'read' mean again in this context?" that is a good sign you have a problem. Most people write buggier code when standing on their head and juggling. I'm sure there are a whole bunch of other schemes people use, but in my looking around thus far, I didn't find any examples that were non-blocking for the writers.

Why did I care?

The thing I actually wanted to double-buffer was a histogram. And not just any histogram. A fixed-footprint histogram that supports lossless recording of experienced latencies, such that later computation of precise percentiles will be possible, all the way to the as-many-9s-as-there-are-in-the-data level. The very purpose of such a histogram is often to capture and analyze latency outlier behavior. The recording operation cannot be allowed to be a cause of the very outliers it is trying to measure. For the latency recording mechanism to have any susceptibility to blocking or locking would be unacceptable.

These latency histograms are basically non-blocking data structures with tens (or hundreds) of kilobytes of state that is rapidly being mutated by critical path "writer" code. But I wanted to log their contents over intervals that are short enough to be interesting for monitoring purposes, and for later time based analysis. In order to log the latency information being captured, I needed a logging "reader" to somehow gain access to a stable, coherent "snapshot" of the latency data that was recorded during some prior interval. To do this, I needed a way for the reader to flip the roles of the active and inactive histograms, but I needed to do that without ever blocking the writers. This is a classic case of an asymmetric synchronization need. I'm fine blocking, delaying and pausing the reader. I just can't afford for the writers to ever block or otherwise delay the execution of the thread they are recording in.

In comes WriterReaderPhaser. And the best starting point for understanding what it does is to dissect the name:

The Phaser part is there because it's main function is to coordinate phase shifts between the writers and the readers. Besides, I couldn't bring myself to call this thing a lock. It's not a lock. Not in it's most important function, which is phase shift coordination. Writers remain lock-free in all cases (they actually remain wait free on architectures that support atomic increment operations). They never block or lock. Calling WriterReaderPhaser a lock would be like calling an AtomicLong an "add lock" because someone could also construct a spin-lock around it....

The WriterReader part is a reversal of the commonly used ReaderWriter (or ReadWrite) term. ReaderWriter locks are asymmetric, but in the reverse direction of what I needed: they enable [relatively] smooth reader operation while causing the writers to block. The really cool wait-free Left-Right which Martin Thompson had pointed me to achieves perfectly smooth reader operation, but that's still not what I needed. WriterReaderPhaser works for the exactly reversed need: Writers remain non-blocking and perfectly smooth, while only readers suffer.

The desired behaviors I was looking for in a WriterReaderPhaser were:

1. Writers remaining lock-free at all times. Ideally they will remain wait-free at all times.

2. A Reader can coordinate a phase flip and access to the inactive data such that:

2.1 Other readers will not flip a phase while this reader is still interested in the inactive data.

2.2 No writer modification will be made to the inactive data after the phase flip operation is complete, and for as long as the reader is interested in the inactive data.

2.3 Readers are guaranteed forward progress (even in the presence of heavy and continuous writer activity, and even when there is no writer activity at all).

Defining WriterReaderPhaser:

With these high level desired behaviors stated, lets clearly define the qualities and guarantees that a well implemented WriterReaderPhaser primitive would provide to users, and the relevant rules that users must adhere to in order to maintain those qualities and guarantees:

A WriterReaderPhaser instance provides the following 5 operations:

writerCriticalSectionEnter

writerCriticalSectionExit

readerLock

readerUnlock

flipPhase

When a WriterReaderPhaser instance is used to protect an actively updated data structure [or set of data structures] involving [potentially multiple] writers and [potentially multiple] readers , the assumptions on how readers and writers act are:

There are two sets of data structures (an "active" set and an "inactive" set)
Writing is done to the perceived active version (as perceived by the writer), and only within critical sections delineated by writerCriticalSectionEnter and writerCriticalSectionExit operations.
Only readers switch the perceived roles of the active and inactive data structures. They do so only while holding the readerLock, and the switch is only done before execution a flipPhase.
Readers do not hold onto readerLock indefinitely.
Only readers perform readerLock and readerUnlock.
Writers do not remain in their critical sections indefinitely.
Only writers perform writerCriticalSectionEnter and writerCriticalSectionExit.
Only readers perform flipPhase operations, and only while holding the readerLock.

When the above assumptions are met, WriterReaderPhaser guarantees that the inactive data structures are not being modified by any writers while being read while under readerLock protection after a flipPhase operation.

The following progress guarantees are provided to writers and readers that adhere to the above stated assumptions:

Writers operations (writerCriticalSectionEnter and writerCriticalSectionExit) are wait free (on architectures that support wait-free atomic increment operations).
flipPhase operations are guaranteed to make forward progress, and will only be blocked by writers whose critical sections were entered prior to the start of the reader's flipPhase operation, and have not yet exited their critical sections.
readerLock only blocks for other readers that are holding the readerLock.

Example use

Imagine a simple use case where a large set of rapidly updated counters is being modified by writers, and a reader needs to gain access to stable interval samples of those counters for reporting and other analysis purposes.

The counters are represented in a volatile array of values (it is the array reference that is volatile, not the value cells within it):

volatile long counts[];

...

A writer updates a specific count (n) in the set of counters:

writerCriticalSectionEnter

counts[n]++; // should use atomic increment if multi-writer

writerCriticalSectionExit

A reader gains access to a stable set of counts collected during an interval, reports on it, and accumulates it:

long intervalCounts[];

long accumulated_counts[];

...

readerLock

reset(interval_counts);

long tmp[] = counts;

counts = interval_counts;

interval_counts = tmp;

flipPhase
// At this point, interval_counts content is stable

report_interval_counts(interval_counts);

accumulated_counts.add(interval_counts);

readerUnlock

A working implementation

Under the hood, my WriterReaderPhaser implementation achieves these qualities in a fairly straightforward way, by using a dual set of epoch counters (and "odd" set and "even" set) to coordinate the phase flip operations, coupled with a read lock that is used purely to protect readers from each other in multi-reader situations: i.e. to prevent one reader from flipping a phase or changing the notion of active o inactive data while another reader is still operating on it. Many other implementation mechanisms are possible, but this one is certainly sufficient for the job at hand.

Rather than describe the logic in text, it is easiest to list it as code at this point. Below is the entire WriterReaderPhaser class as implemented in my current HdrHistogram repository, spelled out in Java code (most of which is detailed comments). The mechanism can obviously be ported to any language and envrionment that can provide support to atomic increment and atomic swap operations. It's the API and documentation (in the case the details in the JavaDoc comments) that is more important. A simple example of how this is used in practice can be found in HdrHistogram's various interval histogram recorders, like the original (and probably simplest example) in IntervalHistogramRecorder.java, or its more recent replacements in DoubleRecorder.java and Recorder.java which add some unrelated and more complicated logic that deals with safely avoiding some copy costs on getIntervalHistogram() variants.

And yes, it is now all in the public domain.

Enjoy.

[1] For an apparent etymology of the term "Yak Shaving", read the example story attributed here.

24 comments:

Paul KhuongNovember 17, 2014 at 5:47 AM
The write-side atomic increments look like they implement a very fine grained epoch system for SMR. Classical epoch uses per-thread counters and gets away with only only stores and barriers (one fetch-and-set and regular stores on TSO).

Implementing a reader-writer lock with minimal read-side overhead on top of SMR (or RCU) is a classic exercise; in the case of your phaser, writers acquire the lock for reading and flippers acquire it for writing. If an application uses a lot of SMR-style constructs, it might make sense to centralise and amortise the epoch overhead in a full epoch reclamation scheme.

BTW, do you know how the phaser compares to sequence locks, if one doesn't need concurrent writes?
ReplyDelete
Replies
Gil TeneNovember 17, 2014 at 9:55 AM
There are certainly similarities to RCU, Left-Right, and Epoch based SMR (such as Epoch Reclamation), but the motivations are different, and with the differences in motivation come subtle (and sometime snot so subtle) important semantic difference sin what is achieved.

Epoch Reclamation (and. I think other epoch based SMRs) are concerned with safe reclamation of memory, but not at all with maintaining a consistent state of the memory being reclaimed. They therefore don't (normally?) provide readers with the sort of guarantee that WriterReaderPhaser is meant to provide as it's primary function: that the inactive state is stable, coherent, and is not being modified by any writers.

Similarly, in schemes that would use a reader-writer lock to protect the notion of which data set is active (basically readers acquire as "writers" to flip the choice of active data set, writers acquire as "readers" to determine which data set is active), the readers are not guaranteed a stable & unchanging inactive data set, because there is no means for ensuring against some in-flight writer writing according to a stale notion of what is active.

To your question about comparing with sequence locks: AFAICT, even in a single-reader and single-writer situation, where there are no concurrent writes, sequence locks do not protect against reader starvation. In contrast, WriterReaderPhaser guarantees forward progress for readers as long as writers do no stall in the middle of a critical section. Readers only block for writers that were in flight when a phaseFlip is requested, and do not require any new writer activity to occur in order to successfully flip a phase.

I've also played with single-reader-single-writer schemes (which are a common use case) where the reader requests a flip, but the writer is the one doing the flip. These schemes are generally cheaper in the critical path writer side (single volatile read in the fast path as opposed to two atomic ops in WriterReaderPhaser), but have the significant downside of having the reader block until a actual write operation occurs. This is not a problem in systems that write continuously. E.g. I used this scheme in jHiccup (before switching to an IntervalHistogramRecorder that uses WriterReaderPhaser). Since the jHiccup writer ticks 1K times per second there was no problem of stalling the reader. However, in many systems that need regular interval recording (e.g. things that use LatencyUtils to record latencies as they occur), the need to record "idle" intervals where no writing has occurred is very real, and these writer-does-the-flip-on-request schemes fundamentally fail there.
ReplyDelete
Replies
Pedro RamalheteNovember 18, 2014 at 5:39 AM
Hi Gil,

I reviewed your code and found some issues.

First thing I don't understand is in IntervalHistogramRecorder::performIntervalSample():
private void performIntervalSample() {
inactiveHistogram.reset();
try {
recordingPhaser.readerLock();
(...)
How can you guarantee that there is no other "Reader" thread calling inactiveHistogram.reset() at the same time as the current "Reader" has acquired recordingPhaser.readLock() and is/has currently swapped inactiveHstogram with activeHistogram?

The method getIntervalHistogram() is synchronized but it's on the IntervalHistogramRecorder instance. Shouldn't it be on recordingPhase.readLock() as well?
public synchronized Histogram getIntervalHistogram() {
Histogram intervalHistogram = new Histogram(inactiveHistogram);
getIntervalHistogramInto(intervalHistogram);
return intervalHistogram;
}

Please, don't get demotivated, but flipPhase() looks wrong as well.
I think you should read the Left-Right paper more carefully, but hey, I'm biased ;)

Cheers,
Pedro
ReplyDelete
Replies
David KarnokDecember 8, 2014 at 11:32 AM
Seems just like what I need! My use case is that multiple writers add work to a synchronized list and a single threaded pool picks up the entire batch and gives the worker a new list. Now if you happen to have an unbounded and resizable list as well that concurrent writers can use lock-free, that would be wonderful. (Can't use fixed buffer because of the dynamic and non-blocking environment: can't tax small activities with several kilobytes of buffer every time.)

I've looked at flipPhase and I believe oddEndEpoch and evenEndEpoch can be written lazily; the following startEpochUpdater.getAndSet is a full barrier anyway.

ReplyDelete
Replies
forked_franzNovember 19, 2017 at 10:43 AM
I hope to not be too OT: wouldn't be nice to implement an off-heap single writer histogram that uses an off heap phaser to allow an IPC version of the same pattern used on jHicchup to record interval histograms?
If you said that it seems feasable I'm very favourable to start working on it and pushing a PR on HdrHistogram (but I'm not sure it can be done without using Unsafe or similar) :)
ReplyDelete
Replies
forked_franzApril 4, 2018 at 8:37 AM
Thanks Gil,

You've given me very good points to understand how much complex and difficult could be to make such counters "right", just porting the in-process pattern into the IPC world.
It totally makes sense and providing it natively without taking care of all the logical states that could happen will just risk to provide a broken/dangerous tool to the end users.

> Are we having fun yet? (Actually, probably yes).

Never for fun, but enough if for a daily job :D

I'm starting to think that https://github.com/JCTools/JCTools/tree/master/jctools-channels would need to provide at least the mimimum required APIs to allow liveness recognition and unblock operations to help a user to not fall into the "bad things won't happen" trap while using such tools into IPC space.

Re the WR Phaser, I've noticed on the mech symp groups that there is an old post related to it...and it cover most of the questions I could have on it, but one.

Do you feel to have been influenced by any concurrent patterns while designing its logic?
Just trying to understand if I can gain the most by writing my own stuff or just read some good book/code to became more confident while dealing with concurrent programming :P
ReplyDelete
Replies

Add comment