Sunday, November 16, 2014

WriterReaderPhaser: A story about a new (?) synchronization primitive

I recently added a synchronization primitive mechanism in my HdrHistogram and LatencyUtils code, which I think has generic use for some very common operations. Specifically, when wait-free writers are updating stuff that background analyzers or loggers needs to look at. I've isolated it in what I now call a WriterReaderPhaser. The name is very intentional, and we'll get to that in a moment. And to the code (all 66 actual lines of it, 200 with elaborate comments). But first, I'll stray into some "how did this come about" storytelling.

WriterReaderPhaser is a new (I think) synchronization primitive: It provides a straightforward interface and API to coordinate wait-free writing to a shared data structure with blocking reading operations of the same data. Readers view a stable (i.e. non changing, coherent) data set while writers continue to modify data without waiting. And readers are guaranteed forward progress, and will only block for other readers and for writers that may have been "in flight" at the time the reader establishes a stable view of the data.

How did this come about?


This sometimes happens when I build stuff: I find myself in need of some behavior that I thought would be common, but for which I can't find an existing implementation, or a name, or a description. This can obviously be ascribed to my weak Google-fu skills, but after a while I give up and just build the thing, because "it's not that complicated". So I build a one-off implementation into whatever I am doing at the time, and move on with life. At some later point, I find myself needing the same thing again. And since I had already solved that problem once, I go back to my old code and (let's be honest) copy-and-paste my first implementation into whatever new thing I'm working on. Sometimes the little guy on my right shoulder wins over the other guy, and I come back and refactor the behavior into a separate class and build an API for more generic use, at which point the "does this deserve it's own library? It's own repo?" thinking starts, coupled with much Yak Shaving [1]. Sometimes the guy on the left shoulder wins, and I actually get on with the real work I was supposed to be doing. I'll leave it to you to decide which little guy is red and which is white.

Sometimes (usually much later) I realize that what I built was actually new. That even though I thought it was a common use case, and built my version simply out of impatience or frustration at not finding something I could use as-is, I may actually be the first person to solve it. Most of those times, this realization is quickly followed by someone showing me a paper or a piece of code that is 30 years old that makes me go "oh... right.". But sometimes that doesn't happen. Sometimes it really is new.

HdrHistogram itself started this way. It was nothing more than about 100 lines of code in a one-off "JitterMeter" tool I was playing with, which needed to record latencies very quickly and report accurate percentiles with many nines in them. Then I found myself building all sorts of variations on jitter meters and sharing them (jHiccup is an evolved version with a better name). And then I found that people (myself included) were taking the code and ripping out just the histogram trick inside, because they needed a histogram that was actually useful for talking about latencies. Recognizing that a fast histogram with good precision and accurate and fine grained quantile reporting capability is actually a very common use case, I decided to build a Yak shaving co-op on github and called it HdrHistogram. The first Yak hair I produced was Java-colored but others have recently added other colors and breeds.

HdrHistogram is a [presumably] successful example of this process going the distance. More often than not, it doesn't. That's probably what my stale repos on github with 2 stars and no forks represent.

WriterReaderPhaser is currently about halfway through this precarious process, but at this point I'm pretty sure it's not going to die. It's a class on it's own, but not yet it's own library. Certainly not it's own repo yet. It will need to find a home, but org.giltene.stuff is probably not where it needs to end up. Since it's so short, this blog entry is as good a home as any for now.

Most importantly, it looks like it may actually be a new and generically useful synchronization primitive. More accurately: nobody has shown me that "oh... right." link or paper yet, and I'm done holding my breath for now.

So what is WriterReaderPhaser about? 


Have you ever had a need for logging or analyzing data that is actively being updated? Have you ever wanted to do that without stalling the writers (recorders) in any way? If so, then WriterReaderPhaser is for you.

 I'm not talking about logging messages or text lines here. I'm talking about data. Data larger than one word of memory. Data that holds actual interesting state. Data that keeps being updated, but needs to be viewed in a stable and coherent way for analysis or logging. Data like frame buffers. Data like histograms. Data like usage counts. Data that changes.

Existing solutions


Sure, you can use channels, queues or magic rings to move data updates and safely process them in background copies of the data. You can use persistent data structures and all sorts of immutable trickery. But those are expensive. As in orders of magnitude more expensive than updating in-cache state in place. When this data thing you want to look at could be updated millions of times per second, you invariably end up with some sort of double-buffered (or multi buffered) scheme: Updates are done to an active copy, and analysis is done "in the background" on stable, inactive copies.


Double buffered schemes usually involve some sort of "phase flipping". At some point the notion of which copy is active changes. Writers update the "new" active copy, and readers access a stable and coherent copy that used to be active, but now isn't. It's this phase flipping that usually comes in the way of keeping writers from blocking.

There are all sorts of variations on how to do this flipping. We can obviously use some form of mutual exclusion lock to protect the writes and the flip. But then writers will block each other, and be blocked by the flipping operation. We can use ReaderWriter locks backwards: where the state being protected by the ReaderWriter lock would be the notion of which data set is the "active" one (the one writers write to). In this scheme writers take the read lock for the duration of their active state modification operations, while readers take the write lock to flip the roles of active and inactive data sets. This can be [much] better than complete mutual exclusion when multiple writers are involved, since writers no longer block other writers, but readers still block writers during a flip. Also, when you start asking yourself "what does 'read' mean again in this context?" that is a good sign you have a problem. Most people write buggier code when standing on their head and juggling. I'm sure there are a whole bunch of other schemes people use, but in my looking around thus far, I didn't find any examples that were non-blocking for the writers.

Why did I care?


The thing I actually wanted to double-buffer was a histogram. And not just any histogram. A fixed-footprint histogram that supports lossless recording of experienced latencies, such that later computation of precise percentiles will be possible, all the way to the as-many-9s-as-there-are-in-the-data level. The very purpose of such a histogram is often to capture and analyze latency outlier behavior. The recording operation cannot be allowed to be a cause of the very outliers it is trying to measure. For the latency recording mechanism to have any susceptibility to blocking or locking would be unacceptable.

These latency histograms are basically non-blocking data structures with tens (or hundreds) of kilobytes of state that is rapidly being mutated by critical path "writer" code. But I wanted to log their contents over intervals that are short enough to be interesting for monitoring purposes, and for later time based analysis. In order to log the latency information being captured, I needed a logging "reader" to somehow gain access to a stable, coherent "snapshot" of the latency data that was recorded during some prior interval. To do this, I needed a way for the reader to flip the roles of the active and inactive histograms, but I needed to do that without ever blocking the writers. This is a classic case of an asymmetric synchronization need. I'm fine blocking, delaying and pausing the reader. I just can't afford for the writers to ever block or otherwise delay the execution of the thread they are recording in.

In comes WriterReaderPhaser. And the best starting point for understanding what it does is to dissect the name:

The Phaser part is there because it's main function is to coordinate phase shifts between the writers and the readers. Besides, I couldn't bring myself to call this thing a lock. It's not a lock. Not in it's most important function, which is phase shift coordination. Writers remain lock-free in all cases (they actually remain wait free on architectures that support atomic increment operations). They never block or lock. Calling WriterReaderPhaser a lock would be like calling an AtomicLong an "add lock" because someone could also construct a spin-lock around it....

The WriterReader part is a reversal of the commonly used ReaderWriter (or ReadWrite) term. ReaderWriter locks are asymmetric, but in the reverse direction of what I needed: they enable [relatively] smooth reader operation while causing the writers to block. The really cool wait-free Left-Right which Martin Thompson had pointed me to achieves perfectly smooth reader operation, but that's still not what I needed. WriterReaderPhaser works for the exactly reversed need: Writers remain non-blocking and perfectly smooth, while only readers suffer.

The desired behaviors I was looking for in a WriterReaderPhaser were:

1. Writers remaining lock-free at all times. Ideally they will remain wait-free at all times.

2. A Reader can coordinate a phase flip and access to the inactive data such that:

2.1 Other readers will not flip a phase while this reader is still interested in the inactive data.

2.2 No writer modification will be made to the inactive data after the phase flip operation is complete, and for as long as the reader is interested in the inactive data.

2.3 Readers are guaranteed forward progress (even in the presence of heavy and continuous writer activity, and even when there is no writer activity at all).

Defining WriterReaderPhaser:


With these high level desired behaviors stated, lets clearly define the qualities and guarantees that a well implemented WriterReaderPhaser primitive would provide to users, and the relevant rules that users must adhere to in order to maintain those qualities and guarantees:

A WriterReaderPhaser instance provides the following 5 operations:
  • writerCriticalSectionEnter
  • writerCriticalSectionExit
  • readerLock
  • readerUnlock
  • flipPhase
When a WriterReaderPhaser instance is used to protect an actively updated data structure [or set of data structures] involving [potentially multiple] writers and [potentially multiple] readers , the assumptions on how readers and writers act are:
  • There are two sets of data structures (an "active" set and an "inactive" set)
  • Writing is done to the perceived active version (as perceived by the writer), and only within critical sections delineated by writerCriticalSectionEnter and writerCriticalSectionExit operations.
  • Only readers switch the perceived roles of the active and inactive data structures. They do so only while holding the readerLock, and the switch is only done before execution a flipPhase.
  • Readers do not hold onto readerLock indefinitely. 
  • Only readers perform readerLock and readerUnlock.
  • Writers do not remain in their critical sections indefinitely. 
  • Only writers perform writerCriticalSectionEnter and writerCriticalSectionExit.
  • Only readers perform flipPhase operations, and only while holding the readerLock.

When the above assumptions are met, WriterReaderPhaser guarantees that the inactive data structures are not being modified by any writers while being read while under readerLock protection after a flipPhase operation.

The following progress guarantees are provided to writers and readers that adhere to the above stated assumptions:
  • Writers operations (writerCriticalSectionEnter and writerCriticalSectionExit) are wait free (on architectures that support wait-free atomic increment operations).
  • flipPhase operations are guaranteed to make forward progress, and will only be blocked by writers whose critical sections were entered prior to the start of the reader's flipPhase operation, and have not yet exited their critical sections.
  • readerLock only blocks for other readers that are holding the readerLock.

Example use


Imagine a simple use case where a large set of rapidly updated counters is being modified by writers, and a reader needs to gain access to stable interval samples of those counters for reporting and other analysis purposes. 

The counters are represented in a volatile array of values (it is the array reference that is volatile, not the value cells within it):

volatile long counts[];
...

A writer updates a specific count (n) in the set of counters:

writerCriticalSectionEnter
   counts[n]++; // should use atomic increment if multi-writer
writerCriticalSectionExit

A reader gains access to a stable set of counts collected during an interval, reports on it, and accumulates it:

long intervalCounts[];
long accumulated_counts[];

...
readerLock
   reset(interval_counts);
   long tmp[] = counts;
   counts = interval_counts;
   interval_counts = tmp;
flipPhase
   // At this point, interval_counts content is stable  
   report_interval_counts(interval_counts);
   accumulated_counts.add(interval_counts);
readerUnlock

A working implementation


Under the hood, my WriterReaderPhaser implementation achieves these qualities in a fairly straightforward way, by using a dual set of epoch counters (and "odd" set and "even" set) to coordinate the phase flip operations, coupled with a read lock that is used purely to protect readers from each other in multi-reader situations: i.e. to prevent one reader from flipping a phase or changing the notion of active o inactive data while another reader is still operating on it. Many other implementation mechanisms are possible, but this one is certainly sufficient for the job at hand.

Rather than describe the logic in text, it is easiest to list it as code at this point. Below is the entire WriterReaderPhaser class as implemented in my current HdrHistogram repository, spelled out in Java code (most of which is detailed comments). The mechanism can obviously be ported to any language and envrionment that can provide support to atomic increment and atomic swap operations. It's the API and documentation (in the case the details in the JavaDoc comments) that is more important. A simple example of how this is used in practice can be found in HdrHistogram's various interval histogram recorders, like the original (and probably simplest example) in IntervalHistogramRecorder.java, or its more recent replacements in DoubleRecorder.java and Recorder.java which add some unrelated and more complicated logic that deals with safely avoiding some copy costs on getIntervalHistogram() variants.

And yes, it is now all in the public domain.

Enjoy.


[1] For an apparent etymology of the term "Yak Shaving", read the example story attributed here.

24 comments:

  1. The write-side atomic increments look like they implement a very fine grained epoch system for SMR. Classical epoch uses per-thread counters and gets away with only only stores and barriers (one fetch-and-set and regular stores on TSO).

    Implementing a reader-writer lock with minimal read-side overhead on top of SMR (or RCU) is a classic exercise; in the case of your phaser, writers acquire the lock for reading and flippers acquire it for writing. If an application uses a lot of SMR-style constructs, it might make sense to centralise and amortise the epoch overhead in a full epoch reclamation scheme.

    BTW, do you know how the phaser compares to sequence locks, if one doesn't need concurrent writes?

    ReplyDelete
  2. There are certainly similarities to RCU, Left-Right, and Epoch based SMR (such as Epoch Reclamation), but the motivations are different, and with the differences in motivation come subtle (and sometime snot so subtle) important semantic difference sin what is achieved.

    Epoch Reclamation (and. I think other epoch based SMRs) are concerned with safe reclamation of memory, but not at all with maintaining a consistent state of the memory being reclaimed. They therefore don't (normally?) provide readers with the sort of guarantee that WriterReaderPhaser is meant to provide as it's primary function: that the inactive state is stable, coherent, and is not being modified by any writers.

    Similarly, in schemes that would use a reader-writer lock to protect the notion of which data set is active (basically readers acquire as "writers" to flip the choice of active data set, writers acquire as "readers" to determine which data set is active), the readers are not guaranteed a stable & unchanging inactive data set, because there is no means for ensuring against some in-flight writer writing according to a stale notion of what is active.

    To your question about comparing with sequence locks: AFAICT, even in a single-reader and single-writer situation, where there are no concurrent writes, sequence locks do not protect against reader starvation. In contrast, WriterReaderPhaser guarantees forward progress for readers as long as writers do no stall in the middle of a critical section. Readers only block for writers that were in flight when a phaseFlip is requested, and do not require any new writer activity to occur in order to successfully flip a phase.

    I've also played with single-reader-single-writer schemes (which are a common use case) where the reader requests a flip, but the writer is the one doing the flip. These schemes are generally cheaper in the critical path writer side (single volatile read in the fast path as opposed to two atomic ops in WriterReaderPhaser), but have the significant downside of having the reader block until a actual write operation occurs. This is not a problem in systems that write continuously. E.g. I used this scheme in jHiccup (before switching to an IntervalHistogramRecorder that uses WriterReaderPhaser). Since the jHiccup writer ticks 1K times per second there was no problem of stalling the reader. However, in many systems that need regular interval recording (e.g. things that use LatencyUtils to record latencies as they occur), the need to record "idle" intervals where no writing has occurred is very real, and these writer-does-the-flip-on-request schemes fundamentally fail there.

    ReplyDelete
  3. Hi Gil,

    I reviewed your code and found some issues.

    First thing I don't understand is in IntervalHistogramRecorder::performIntervalSample():
    private void performIntervalSample() {
    inactiveHistogram.reset();
    try {
    recordingPhaser.readerLock();
    (...)
    How can you guarantee that there is no other "Reader" thread calling inactiveHistogram.reset() at the same time as the current "Reader" has acquired recordingPhaser.readLock() and is/has currently swapped inactiveHstogram with activeHistogram?

    The method getIntervalHistogram() is synchronized but it's on the IntervalHistogramRecorder instance. Shouldn't it be on recordingPhase.readLock() as well?
    public synchronized Histogram getIntervalHistogram() {
    Histogram intervalHistogram = new Histogram(inactiveHistogram);
    getIntervalHistogramInto(intervalHistogram);
    return intervalHistogram;
    }

    Please, don't get demotivated, but flipPhase() looks wrong as well.
    I think you should read the Left-Right paper more carefully, but hey, I'm biased ;)

    Cheers,
    Pedro

    ReplyDelete
    Replies
    1. > How can you guarantee that there is no other "Reader" thread calling
      > inactiveHistogram.reset() at the same time as the current "Reader" has
      > acquired recordingPhaser.readLock() and is/has currently swapped
      > inactiveHstogram with activeHistogram?

      In IntervalHistogramRecorder (not necessarily in the general WriterReaderPhaser case) the recordingPhaser is internal and private, and all non-recording methods are synchronized. There can't be another reader resetting the inactiveHistogram because of this synchronization.

      This wold be a common case when the double-buffering nature remains and internal concern and the only thing exposed to callers is stable and self-consistent copies. The readLock() only needs to be exposed to a caller of such things if they have a need to operate directly on an inactive data set, and not on a copy.

      > The method getIntervalHistogram() is synchronized but it's on the
      > IntervalHistogramRecorder instance. Shouldn't it be on
      > recordingPhase.readLock() as well?

      It's an either-or choice. We don't need both, as they both provide the same protection. I chose to use the readLock only where is was necessity to do so (in performIntervalSample() to surround the flip, because the flip API requires it), and use the idiomatic "synchronized" for the public API calls.

      > Please, don't get demotivated, but flipPhase() looks wrong as well.
      > I think you should read the Left-Right paper more carefully, but
      > hey, I'm biased ;)

      No harm in being biased.

      And no worries, it's hard to get me demotivated. Even if you point to something that is actually incorrect or wrong.

      But given that the two points above aren't "wrong" (hopefully the explanation makes it clear that the implementation is correct), can you point out what you think is wrong with flipPhase()?

      As for Left-Right: I think Left-Right is a really cool alternative for ReadeWriter locks. But do you think there is some way to do the wait-free (for writers) WriteRead coordination I'm looking (and that WriterReaderPhaser was built for) using Left-Right as a primitive?

      Delete
    2. Ahhh yes, I understand it now, the 'synchronized' is what matters here. Thanks for the explanation!

      After a more careful analysis of the code I realize now that it is correct.
      In fact, I believe this is a (smart) variant of Left-Right.

      Please notice the following when comparing with Left-Right:
      - The versionIndex has been encoded as the sign of startEpoch;
      - Because the versionIndex has been encoded as the sign of startEpoch, we can
      change both in one getAndSet() atomic operation, thus bypassing the need to do
      the second wait for arrive()/depart() on the previous counter after changing
      versionIndex, just like the Left-right variants "No Version" or
      "Reader's Version" described in the paper.
      Just like for those two variants, the WriterReaderPhaser variant
      does not need to wait for the "previous versionIndex" before changing the
      versionIndex (the sign of startEpoch);
      - Although we can implement evenEndEpoch and oddEndEpoch as distributed
      counters (see LongAdder or DistributedCacheLineCounter), due to the
      getAndSet() and getAndIncrement() the startEpoch must always be a single
      atomic counter, which can cause a bottleneck on arrive() (named
      writerCriticalSectionEnter() in this variant);
      - The leftRight variable is encoded directly in the instance that is returned by
      activeHistogram, which may not be as cache-friendly as branching on leftRight;

      The main innovation in this variant of Left-Right (WriterReaderPhaser) is the
      combination of the new ReadIndicator with the versionIndex.
      The main advantage of WriterReadPhaser is that the versionIndex can be re-used
      without having to verify it before re-using it, at the same time keeping the
      starvation-freedom progress condition.

      Anyways, very cool stuff!

      Delete
    3. Pedro,

      :-) I think that both WriterReaderPhaser and Left-Right are (smart) variants on multi epoch games.

      The key difference I see between WriterReaderPhaser and Left-Right is in functionality: they aim at (and provide) opposite behaviors, which drives the significant difference in implementation:

      - Left-Right provides for wait-free readers and blocking writers. VERY useful in read mostly data structures (e.g. caches) where maintaining wait-free concurrency for readers is the important behavior.

      - WriterReaderPhaser provides for wait-free writers with blocking readers. Useful for write-mostly data structures (like histograms, frame-buffers, count accumulators) where maintaining wait-free operation on the recording path is the important behavior.

      The reader and writer roles (with respect to the data being shared) are very important in both cases, Neither WriterReaderPhaser nor Left-right will protect writers from readers (readers are not expected to modify the active data writers are writing in).

      I don't really know if single-atomic-word variants of multi-epochs (like I use in WriterReaderPhaser) would work for a wait-free reader semantic in a variant of Left-Right (providing same functionality with less state per your description, but with potential contention effects that your per thread state may do much better on for readers). I certainly can't lay claim to the single atomic word truck itself, as it's been done so many times before. But I'm pretty sure WriterReaderPhaser (as it stands) doesn't cover the needed semantics for mimicking Left-Right's wait free reader capability, just as Left-Right (as it stands) doesn't cover the needed semantics to mimic WriterReaderPhaser's wait-free writer capability. In this sense, I think both primitives may well be separately new and novel.

      But I may be missing something in how Left-Right can be used beyond the wait-free readers case. If you think it can also be used for wait-free writers in double buffered schemes, can you jot down an example of how one would use it? (E.g.an equivalent external api functionality to IntervalHistogramRecorder that uses Left-Right internally instead of WriterReaderPhaser)

      Delete
    4. Hi Gil,
      I've added an example as a pull request on github of how to use one of the other variants of Left-Right as a replacement for WriterReaderPhaser... and yes, the WriterReaderPhaser is a variant of Left-Right, that combines the versionIndex with the ReadIndicator.
      Discovering a new variant of Left-Right is not an easy feat at all, so kudos to you!

      You talked about it on your post so I'm sure you agree that a Reader-Writer Lock used "backwards" is still a Reader-Writer Lock, just like a Left-Right Pattern used "backwards" is still a Left-Right Pattern ;)

      If you have further questions feel free to drop me a line over email.

      Cheers,
      Pedro

      Delete
    5. Ahh, I think I see where the confusion is. Your implementation doesn't guarantee forward progress for readers (my readers). This may be a fundamental difference in the requirements perception that made you think Left-Right variants can do the job. WriterReaderPhase fundamentally guarantees forward progress of readers. It is one of the base requirements, as without this quality, a continuous recording stream (which is very common) could prevent readers sampling the buffer indefinitely. This behavior of WriterReaderPhaser is spelled out in the WriterReaderPhaser in [... "readers" block for other "readers", and "readers" are only blocked by "writers" whose critical was entered before the reader's flipPhase() attempt] in the JavaDoc. For clarity, I added this point as requirement 2.3 in the blog entry above (added at the time of this comment, so you may not have seen it in prior reading, leading to our confusion).

      AFACIT (correct me if I'm wrong), the Left-Right implementation in your pull request example (at https://github.com/pramalhe/HdrHistogram/commit/08f65b5221ae97b173da6e63b3f70807a66150dc) does not provide forward progress guarantee for readers. This would result in an inability to read data when recoding is fast enough, which is a very real problem for any write-heavy traffic with concurrent traffic, and it can occur quite easily at a high enough concurrency level and/or with a short enough time-between-critical-sections. time in the writer code.

      My work on WriterReaderPhaser started/arrived at the basic requirement of using a single atomic word for this basic starvation reason (I don't think it can be done without a single atomic update for both phase and epoch count, but I'd be glad to be proven wrong). Without the non-starvation requirement (and as implemented in LatencyUtils before it) the writer-reader flip (as opposed to the reader-writer one, perhaps) can be done with a simple multi-epoch barrier (one for entry to and one for exit from the critical section) and this worked just fine as long as reader starvation was acceptable (or as long as I ignored it as a possibility). The multi-epoch pattern has been used for decades as far as I know (as simple line in the sand non-blocking checkpointing mechanism). The very reason I started work on WriterReaderPhaser was that I found out (the hard way) that the multi-epoch system had a starvation problem...

      BTW, I was not aware that (in it's wait-free reader form) Left-Right allows for writer starvation. Is that really the case, or is my reading of the code wrong?

      If it really is the case, then I can suggest a WriterReader variant of Left-Right ( ;-) ) that would use the single-atomic word design requirement, providing guaranteed forward progress for writers even under continuous and heavy read activity. This (forward progress for writers) tends to be a basic requirement for "correct" reader-writer lock patterns, and I've run into serious bugs in systems that don't provide this guarantee (hint: readerWriter spin lock implementations in some kernel versions).

      Delete
    6. Hmm... I think my previous reply is wrong. And that your code DOES guarantee forward progress without the single atomic word thing.

      I was reading the isEmpty() loops "backwards", thinking they can go on indefinitely under pressure. The reason that won't happen is that you only do isEmpty() on the ingress/egress counter version that is not currently active, and therefore only currently in-flight readers (your readers) can be holding the loop back. Since no newly active readers would be counting with that version, you are guaranteed to complete.

      So yes, I think this Left-Right variant works correctly for the same use case.

      I think that the fact that you were re-reading the ingress value in the loops
      (e.g. return (readersEgressv0.sum() == readersIngressv0.sum()); in isEmpty()) is what made me think the loops are unbounded. You could loop waiting for the egress value to reach the seen-before-loop-starts ingress value instead, but since the ingress value being watched is not changing, the logic is equivalent (and nobody really cares about the cost of the extra read in this loop).

      A suggestion I'd have is to add a check that the writeLock is actually owned by the caller in toggleVersionAndScan() (and throw an exception if it is not). This will make it impossible for external callers to mess with your internal state. They can still use things wrong, but they can't mess with your internal version behavior. Also, you can just used a ReentrantLock for writeLock (and writeLock.getOwner() == Thread.currentThread() for the check), as there is no need for a stampedLock...

      Delete
    7. I'm going to assume that by "forward progress" you actually mean "starvation freedom". If this is not
      the case, then please explain in more detail what you mean by "forward progress", since Progress Conditions
      have their own precise definitions:
      http://concurrencyfreaks.com/2013/05/lock-free-and-wait-free-definition-and.html

      You said
      (...)Your implementation doesn't guarantee forward progress for readers (my readers).(...)
      but I'm afraid that is incorrect.
      In the paper, there are three variants for Left Right: "Classical", "Readers's Version", and "No Version".
      Both the "Classical" and "Reader's Version" provide starvation freedom from Writers to Readers,
      from Readers to Writers, from Readers to Readers (if a wait-free ReadIndicator is used), and
      Writers to Writers (if a starvation-free writersMutex is used).
      The pull request is using the Classical variant, and is therefore, starvation-free from
      Readers to Writers, and from Writers to Readers.

      I'm pretty sure you have much better things to do your time than reading semi-obscure
      papers such as the Left-Right, but until you devote some time to it, I'm afraid any further discussions
      will be somewhat fruitless.

      Just a rhetorical question: Are there any _correct_ use-cases where two threads have acquired 'readerLock' and
      then call flipPhase() concurrently?

      Delete
    8. > Just a rhetorical question: Are there any _correct_ use-cases
      > where two threads have acquired 'readerLock' and then call
      > flipPhase() concurrently?

      The typical roles of writer and reader in WriterReaderPhaser are reversed from those of a classic Reade Writer lock. It's the writers that get to run concurrently, while the readers are serialized. Hence the name.

      readerLock (in WriterReaderPhaser) is therefore a mutual exclusion lock (not a read write lock or stamped lock), so two threads can't aquire it and do anything concurrently. It is required for performing a flipPhase (will throw exception if not held by caller), and it's purpose is to protect readers from each other, such that any pre-flip work (like swapping active and inactive roles of data structures) and any post-flip work (like reading the contents of the inactive contents) would be safe and stable when done under a readLock().

      This combination provides an API that is natural for write-heavy double buffer flipping to use and reason about (with the reader and writer roles matching what the api user actually does). While it is possible for Left-Right with a Classical variant to be used to provide the equivalent concurrency protection, the named roles (and the lock types) will be reversed when doing so, requiring a write lock to be held by the readers, and for actual readers to play the role of "writer" and actual writers to play the role of "readers" as in your example. A de-obfuscating wrapper around Left-Right that would hide this api confusion and use the terms natural to the caller (in a WriterReaderPhaser use case) would help avoid questions like "what happens when two readers hold a readerLock()?", but more importantly, help avoid bug traps that would come when a "reader code author" naturally assumes that using the reader API calls (and playing the reader roles expected) in Left-Right would produce desirable or correct behavior.


      Delete
    9. Dope... for some reason I misread the type of the variable 'readerLock' as being a readLock() of ReaderWriterLock, totally my mistake, sorry for the added confusion. Thank you for taking the time to explain that!

      This means that 'readerLock' is the mutual exclusion lock named 'writersMutex' in the Left-Right paper :)

      The naming of 'readers' and 'writers' is certainly deceiving, we should have called them something like 'arrivers' and 'flippers'.

      Delete
    10. No harm. You misread my code a couple of times, and I misread your code a couple of times. It’s all code under the bridge.

      Having spent the time to read the Left-Right paper in more detail now. Here are my observations:

      1. I'm still pretty sure WriterReaderPhaser is new. As in "a wait-free writers with blocking readers synchronization primitive” is new. I could call it "A Concurrency Control Technique with Wait-Free Population Oblivious Writes" to mirror Left-Right, and it would still be new. Left-Right (which is truly novel, I think) discusses wait free readers with blocking writers. It doesn’t discusses the possibility of wait free writers with blocking readers. I haven't found (so far) anything else that does either.

      2. It is certainly true that a part of Left-Right's internal mechanism can be used under the hood in a WriterReaderPhaser implementation. We both seem to be using phased (or "versioned") variants on epochs to track entry/exit to critical sections on the "things" that we want to keep wait free (readers in the case of Left-Right. writers in the case of WriterReaderPhaser). Not sure what that mechanism would be called, but it's not new in either of our implementations (checkpointing mechanisms have used it forever). It's the application of this mechanism to create a wait-free readers with blocking writers synchronization mechanism that makes Left-Right novel. I think that the independent application of a similar checkpointing mechanism to create a generically useful wait-free writers with blocking readers mechanism makes WriterReaderPhaser new as well. The duality is probably a natural artifact of the problems that we were independently working to solve.

      3. If you stand on your head and squint hard, you can correctly state that one *could* use a part of Left-Write as a mechanism that protects read access to the reference word that denotes the "active" write-to data structure in a double-buffered mechanism that wishes to maintain wait free writers with blocking but guaranteed forward progress readers. But to do that all roles in Left-Right APIs must be reversed to their exact opposite meaning (hence the upside-down squinting). Readers mean writers. Writers mean readers. But you need to be careful not to make writers that mean readers mean writers again, and you need the readers to write and the writers to read. And if readers read or writers write things will go very wrong. Such use (while correct) is clearly counter-intention, and makes every comment and rule in Left-Right look like a "this is not what this thing promises" sign... Yes. It works, but only if/because you actually understand what the underlying implementation logic does. To an API user, that's not a wait-free writers API...

      4. the single-atomic-word, single wait-loop for draining technique I use in WriterReaderPhaser is not the only way to protect the critical section. As you've demonstrated, it is possible to have the phase (or version) reside in separate word if you use two draining wait-loops (before and after the phase shift). Both techniques remain wait-free for the "important" parties (readers for Left-Write, writers for WriterReaderPhaser), and both guarantee forward progress of the "less-important" parties (writers for Left-Write, readers for WriterReaderPhaser).

      5. The wider variants (more than just two atomic words) of Left-Right's flipping mechanism may be useful in WriterReaderPhaser variants that want to spread cache-line contention through striping. It's an interesting tradeoff to look at. My current implementation is probably faster for low contention situations (no dereferencing to get to the atomic lines), while your variants are probably faster in heavier contention situations (where the dereferencing gets you to separate parts of LongAdders).

      BTW Pedro, what is your twitter handle? I wanted to refer to you in my "I was wrong on the internet" tweet that linked to this comment thread...

      Delete
    11. I think you should show in your code where is your "new" synchronization primitive. And forget about inverting the roles of Reader with Writer, that doesn't make it "new".

      Delete
    12. Andreia,

      It's the primitive itself that is new. The thing that demonstrates the new primitive is not the specific example implementation (that’s just a combination of atomic operations and volatiles). It's the API (captured in JavaDoc in this case) and the code that uses the primitive that demonstrates it (e.g. IntervalHistogramRecorder) that show a new primitive.

      The new primitive is a WriterReaderPhaser: a primitive that provides a direct API for writers that write into a shared data structures to coordinate with readers that need to read a stable view of the data structure, and provides wait-free behavior for the writers while maintaining blocking but guaranteed forward progress behavior for the readers.

      Plenty of other primitives that provide for coordination of writers and readers exist, but I am not aware of single one that provides these qualities. They are high desirable for the common use case of write-mostly structures with occasional reads for e.g. logging or analysis.

      Implementing something that provides these qualities out of a combination of other previously available primitives doesn't make the primitive itself less “new”. That's the whole point: Extracting the primitive and describing it’s common use, qualities, and guarantees. I'm sure I’m not the first to make writers and readers behave like this. I'm sure people have custom-solved this using their own custom and cobbled together concurrent thing that they had to make sure worked for the purpose each time (I have done that before more than once). But I *think* I'm the first to describe the primitive itself (with an example working implementation) along with an API that others can use without building their own custom concurrent synchronization mechanism for the task at hand (and having to convince themselves that it actually does what they want, and does it safely).

      If a previously un-described primitive that directly serves a common use case like this does not qualify as new, neither would other primitive above an atomic single word operation (and maybe not even those). A regular read-write can be built from a counting semaphore or a CAS. A semaphore, or a condition variable, or a spin lock wouldn’t be different from a CAS either. And Left-Right wouldn’t be a new thing. After all, they all run on turing machines, and we can create hard-to-reason-about direct implementations of each using basic instructions embedded in the code that needs the behavior at hand.

      It’s [only] the specified qualities and the guarantees that the primitives provide under prescribed use rules that make them useful and different from each other.

      What makes Left-Right new (and different from, say, RCU and epoch based SMRs that preceded it) is the use case and API, not the implementation. Left-Right provides a clean way for readers and writers on a common data structure to coordinate with better behavior (for the use case at hand) than was previously available in a directly useable primitive. [IMO] The generally useful Left-Right use case is your emulation of a ReadWriteLock with better qualities. Left-Right provides a clean API that requires nothing special of readers (just do your reads on the data structure you get from Left-Right) and only a slight modification needed in how writers normally use such a primitive (need to write twice, on both sides of a flip).

      What makes WriterReaderPhaser new is the use case and API. WriterReaderPhaser provides a clean way for writers and readers to coordinate with better behavior (for the common use case at hand) than was previously available in a directly useable primitive. It does so with a clean API that requires nothing special from the writers (just protect your writes with the WriterReaderPhaser) and only a slight modification form how a mutex or ReadWriterLock-used-backwards would be used: readers switch active/inactive data sets before a flip, and use the inactive after the flip. [no Left-Write style double writes needed, because it is a fundamentally different use case].

      Delete
  4. Seems just like what I need! My use case is that multiple writers add work to a synchronized list and a single threaded pool picks up the entire batch and gives the worker a new list. Now if you happen to have an unbounded and resizable list as well that concurrent writers can use lock-free, that would be wonderful. (Can't use fixed buffer because of the dynamic and non-blocking environment: can't tax small activities with several kilobytes of buffer every time.)

    I've looked at flipPhase and I believe oddEndEpoch and evenEndEpoch can be written lazily; the following startEpochUpdater.getAndSet is a full barrier anyway.

    ReplyDelete
    Replies
    1. Dávid,

      You are correct about being able to use a lazySet instead of a volatile write in the flip when resetting the [currently inactive] end epochs ahead of flipping that start phase. I made the change to both the gist above and the github code. Thx for pointing out the optimization.

      For the resizable list question above, I'm not quite sure if this will address your need for resizable lists, but it may: I had recently build resizing capability into ConcurrentHistogram, and I use the WriterReaderPhase for that too (managing two internal counts arrays). HdrHistogram's Recorder (what used to be IntervalHistogramRecorder) now uses ConcurrentHistogram under the hood when resizing support is needed (you can find both the Recorder and ConcurrentHistogram classes in https://github.com/HdrHistogram/HdrHistogram/tree/master/src/main/java/org/HdrHistogram).

      That use case (for resizing) is a bit more complicated than a pure writer-reader-flip though, and probably falls right between what WriterReaderPhaser and LeftRight are each intended to do (by their expected abstraction and use protocol or contract), so I'm not sure what I'd call the way I use the primitive there (it can be thought of both as "WriterReaderPhaser with extra assumptions" and "LeftRight with extra assumptions"). I believe my resizing logic is safe to do using my *current implementation* of WriterReaderPhaser, but I'm not entirely sure it is safe using the contract I state for its use (i.e. if some future implementation of the WriterReaderPhaser contract will be just as safe to use). I'll need to think about that [or if you want to spend the time to reason about it and follow up here, that would be cool].

      Delete
    2. Thanks for the answer Gil. I'll do some experiments with WRP and simple 2D atomic reference array with resize logic similar to Cliff Click's concurrent hashmap.

      Delete
  5. I hope to not be too OT: wouldn't be nice to implement an off-heap single writer histogram that uses an off heap phaser to allow an IPC version of the same pattern used on jHicchup to record interval histograms?
    If you said that it seems feasable I'm very favourable to start working on it and pushing a PR on HdrHistogram (but I'm not sure it can be done without using Unsafe or similar) :)

    ReplyDelete
    Replies
    1. If the recording is done in Java, I'd have an in-process polling service that other procresses call into, and which copies the stable histogram out to a place where others can see it (and has it's own cross-reader coordination semantics). HdrHistograms are generally cheap enough to copy that copying one (a few tens of KB) each time you poll for a stable interval read is not a big deal. As for placing data in some cross-process (and polygot) readable form, I'd use the compressed histogram format that all the various HdrHistogram implementations (Java, C, .NET, Python, JavaScript, Rust, etc.) all know how to parse, and place the data in a mapped file or shared memory object.

      Delete
    2. That's a good point, but IMO it doesn't allow complete decoupling between the monitored and monitoring process.

      For example, the Aeron counters are always "on", without any specific configuration of the monitored applications (no polling rate cofiguration too), but on demand a monitoring process can be attached and collect the samples with a preferred configuration.
      It seems similar but from the perspective of the monitored application isn't actually the same.
      Anyway it was just something funny to work on, I'm already more than grateful and happy about what the lib is already providing :)

      While, Re the WR Phaser did you have explored alternative implementations (a single counter for the end phase?) too?

      Delete
    3. Let me try to share some hard-earned wisdom here: I predict that the inherent coupling between the writer and readers will lead you on to a fun and long (and probably fun) Yack shaving exercise here... :

      IMO it is “hard” (as in probably impossible but I can’t prove that) to completely decouple a monitored and monitoring process and have speed that is close to what we are talking about here.

      If we built such a a shared-memory WRP for coordinating such shared memory data, and even if we only had a single writer for the shared data structure it coordinates (e.g. a single-writer histogram recorder), this would create a very strong coupling of the readers to the writer, as “strange things” that the writer may do (like crashing and not coming back) can cause “terrible things” in the readers (like hanging indefinitely). E.g. imagine something killing or cashing the single writer while it is in the critical section. Not a problem for in-process stuff (everything crashes together), but a new and annoyingly detailed thing to solve in a cross-process situation.

      The way such things tend to get resolved in cross-process IPC is usually to add some system-level thing that ensures that the “important” processes that everyone else is strongly coupled to don’t disappear forever, and the important processes have to deal with state-resetting things that in-process synchronization never needs to think about. E.g. in this case, “something” must make sure a new writer starts if the writer disappears, but must also make sure only a single writer ever exists. (Sidetrack here to figure out how to make sure the magic “something” itself HA-enough so it doesn’t disappear, and reinvent something like inetd). And now the new writer needs to reset the critical section indicators to unblock any potentially hung readers, etc. And we haven’t even started on the cross-readers coordination stuff that now needs to cover the possibility of a crashed “”non-important” reader (so you readers locks now need something like safe expiry and ownership refresh mechanisms, or a lease mechanism). You also need to deal with lifecycle and namespace issues on the shared resource (e.g. who creates the shared resource? And since it will outlive a crashed writer and must be reused by a replacing writer, who tears it down?), this will probably also pull you into a fun discovery questions sidetrack, and to namespace convention stuff for some additional re-inventioning activities.

      Are we having fun yet? (Actually, probably yes).

      This is all doable and solvable, but the solutions tend to create a strong coupling between the participants, and that coupling tends to be around new (and often fragile) protocols that all future versions of all participants must adhere to and follow. You then find yourself trying to come up with a robust and future-proof cross process protocol, and often re-invent something like Unix sockets or named pipes. When you try to extend the same concept outside of a single image (e.g. to distributed shared memory, or cross-container shared memory) you end up reinventing something like TCP.

      The complexity and fragility involved in all this usually makes it worth “doing right” only for high-value things. E.g. empirically, a shared memory buffer for very high speed cross-process IPC may or may not count as a high enough value, because most solutions for that abstraction not are not “quite as right” in the longevity or robustness dimensions as e.g. HTTP sockets in a TCP/IP world with DNS namespaces are. (E.g. Aeron is possibly on the way to providing a longevity/robust addressing of that concept, but it takes years to get there). For all other situations, putting down the Yak-shaving shears and retreating back to in-process synchronization fronted by well understood and established protocols (e.g. exposing a service that others call into using an established way to do things between processes and in distributed systems) is probably the way to go.

      Delete
  6. Thanks Gil,

    You've given me very good points to understand how much complex and difficult could be to make such counters "right", just porting the in-process pattern into the IPC world.
    It totally makes sense and providing it natively without taking care of all the logical states that could happen will just risk to provide a broken/dangerous tool to the end users.

    > Are we having fun yet? (Actually, probably yes).

    Never for fun, but enough if for a daily job :D

    I'm starting to think that https://github.com/JCTools/JCTools/tree/master/jctools-channels would need to provide at least the mimimum required APIs to allow liveness recognition and unblock operations to help a user to not fall into the "bad things won't happen" trap while using such tools into IPC space.

    Re the WR Phaser, I've noticed on the mech symp groups that there is an old post related to it...and it cover most of the questions I could have on it, but one.

    Do you feel to have been influenced by any concurrent patterns while designing its logic?
    Just trying to understand if I can gain the most by writing my own stuff or just read some good book/code to became more confident while dealing with concurrent programming :P

    ReplyDelete
    Replies
    1. > Do you feel to have been influenced by any concurrent patterns while designing its logic?

      I'm sure I've been influened by many. But the isn't some specific pattern I can point to that I modeled WRP after. I wanted it to be generaically useable, in library form, by user-space programs. That ruled out a whole bunch of patterns that can be used to achivee similar benefits but would only be useable in much more restricted circumstances.

      I've read many papers (and not nearly as many books) over the years, but I mostly just read a lot of code, often trying to "empathise" with the writer in trying to understand the motivations, the tradeoffs being made, and accepting that most code isn't (and doesn't need to be) generically useful, and represents an interesting poiunt-solution that something may be learned from. Some of this stuff soaks into my brain and over time I recognize patterns and families of "how things are done" and "what qualities are sought", which I don't neccesarily have names for. This obviously includes huge families of "mistakes people make" and "pitfalls people run into", which to my thinking are the best teachings of all.

      Over years of this sort of exposure I guess you end up with interesting classifications. E.g. there is this huge group of what I think of as "epoch style" things that allow for some great concurrency qualities as long as something is known to be reliably ticking in the background, taking care of some crucial book-keeping (like garbage collecting or expiring/releasing old things) and guaranteeing forward progress even when the players go idle (RCU is a good example of that pattern). This sort of pattern can be used in systems that can centralize this sort of background book-keeper thing across all uses of it (e.g. a Liniux kernel gets to do this), but it becomes less pratical in many user-mode library situations, where things get annoyingly complicated when looking to provide useful primitives thatg can coordinate work across a few players within a larger system (which can have hundrreds of other unrelated players that want a similar thing for themselves and their own small groups of functionality). There is a lot to learn from these "Epochj style" patterns (e.g. about what great things you can wish for), but also a lot to learn about what makes them hard to use in certain spaces (e.g. "don't rely on any magic background threads" is a good rule of thumb for building a widely reuseable user-space primitive that would show in in a library, like e.g. a mutex, or semaphore, or RWlock, without requiring the using process to dedicate some executors for making the stuff work). The topoosite side akeway for this can also be: "If [in your problem domain] you can afford to dedicate some common background workers for some purpose, you can prpbably build/use/achive some really cool concurremncy properties that cannot be had without such things.

      E.g. GC makes a whole class of concurency mechanisms possible and practical, that would be "very hard" to do without a garabge collector

      Delete