Random thoughts on concurrency, databases and distributed systems

Seqlock-Based Atomic Memory Snapshots

Andrei Pechkurov — Sun, 06 Aug 2023 11:10:02 GMT

Last time we discussed k-word CAS algorithm and came to the conclusion that seqlock-based atomic snapshots may be used as an alternative in situations when your writes are infrequent and single-writer limitation is acceptable, but you want the readers to be able to read multiple values atomically. Such use cases may be met in the Linux kernel, but not only. For instance, QuestDB uses a similar approach to read the metadata of the latest transaction atomically. So, today we'll consider an implementation in Java and try to measure how efficient it is from the reader's perspective.

To keep things simple, we'll write an atomic tuple class holding 3 long values. The structure of the class looks like the following:

public class AtomicLongTuple {    // Version is used for the seqlock.    long version;    // 3 long fields:    long x;    long y;    long z;    // Var handles for the fields:    private static final VarHandle VH_VERSION;    private static final VarHandle VH_X;    private static final VarHandle VH_Y;    private static final VarHandle VH_Z;    // Holder class used by readers and writers.    public static class TupleHolder {        public long x;        public long y;        public long z;    }    // Readers provide a holder to avoid an allocation.    public void read(TupleHolder holder) {        // Implementation goes here...    }    // Writers provide a function to modify the latest state.    public void write(Consumer writer) {        // Implementation goes here...    }}

Here we have the tuple, i.e. 3 long fields, plus one more long field called version . As we'll see now, the version is needed to implement the seqlock. Of course, the above code doesn't include boring details such as padding or VarHandle initialization, but you may find the full source code here.

To learn about the seqlock, let's start with the write() method:

public void write(Consumer writer) {    for (;;) {        final long version = (long) VH_VERSION.getAcquire(this);        if ((version & 1) == 1) {            // Another write is in progress. Back off and keep spinning.            LockSupport.parkNanos(1);            continue;        }        // Try to update the version to an odd value (write intent).        // We don't use compareAndExchangeRelease here to avoid        // an additional full fence following this operation.        final long currentVersion = (long) VH_VERSION.compareAndExchange(this, version, version + 1);        if (currentVersion != version) {            // Someone else started writing. Back off and try again.            LockSupport.parkNanos(10);            continue;        }        // Apply the write.        writerHolder.x = (long) VH_X.getOpaque(this);        writerHolder.y = (long) VH_Y.getOpaque(this);        writerHolder.z = (long) VH_Z.getOpaque(this);        writer.accept(writerHolder);        VH_X.setOpaque(this, writerHolder.x);        VH_Y.setOpaque(this, writerHolder.y);        VH_Z.setOpaque(this, writerHolder.z);        // Update the version to an even value (write finished).        VH_VERSION.setRelease(this, version + 2);        return;    }}

The above code uses acquire/release memory semantics here, so if you're not familiar with it, refer to this post. Other than that, the code is quite simple: each writer spins trying to increment the version field to an odd value to prevent concurrent writes (think, a spinlock) and, when it succeeds, applies the given function to the current tuple state. Finally, it increments the version, so that it has an even value telling readers and other writers that there is no ongoing write.

Notice that we're using plain loads and stores to fetch and modify the tuple in the "critical section". These operations don't need to be atomic, all thanks to the atomicity (and ordering guarantees) of the surrounding operations and memory fences.

Now, let's check what readers do:

public void read(TupleHolder holder) {    for (;;) {        final long version = (long) VH_VERSION.getAcquire(this);        if ((version & 1) == 1) {            // A write is in progress. Back off and keep spinning.            LockSupport.parkNanos(1);            continue;        }        // Read the tuple.        holder.x = (long) VH_X.getOpaque(this);        holder.y = (long) VH_Y.getOpaque(this);        holder.z = (long) VH_Z.getOpaque(this);        // We don't want the below load to bubble up, hence the fence.        VarHandle.loadLoadFence();        final long currentVersion = (long) VH_VERSION.getAcquire(this);        if (currentVersion == version) {            // The version didn't change, so the atomic snapshot succeeded.            return;        }    }}

The reader's code is even simpler than the writer's one. Each reader spins trying to read an even version, then reads the tuple state. Finally, it reads the version once again and compares it with the previously read value. If the value hasn't changed, the reader was able to read the tuple atomically, so the operation finishes. Again, the tuple state operations are non-atomic, but thanks to the surrounding operations and fences that's not needed.

A "classical" seqlock assumes a separate sequence number and a mutex. Using a mutex over a hand-made spinlock is a wise choice since mutexes in modern runtimes and OSes came a long path and they're not as expensive as they used to be. In our case, the version field combines the sequence number and the mutex (to be more precise, a spinlock). For educational purposes, that's not really important since the algorithm is the same. Thanks to seqlock, we were able to implement atomic memory snapshots with a few lines of code.

As you may have noticed, the algorithm is not lock-free since if a writer thread gets blocked after it has incremented the version to an odd value, no readers and writers would be able to make progress. Considering this, you may be asking yourself how practical this approach is. It's certainly not versatile and only makes sense in use cases when the writes are infrequent and the total amount of memory to be read stays within a few cache lines (ideally, within a single cache line, i.e. 64 bytes on most modern machines). In other scenarios, it's better to use a good old exclusive or shared mutex.

If you're curious about how many wasteful spins a reader has to make before making a successful snapshot, there is a test for that. It starts a few reader threads that spin over the read() method and a single writer thread that calls write() and then emulates a bit of work by calling Blackhole.consumeCPU(50) on the Blackhole class from JMH. When the test finishes, each reader thread reports the wasteful_spins / total_spins ratio which, on my Linux machine, is around 0.7-0.9. So, in the presence of relatively frequent writes, no reader has to do more than a couple of spins on average.

That's it for today. Good luck and see you next time.

A Few Thoughts on K-Word CAS

Andrei Pechkurov — Thu, 06 Jul 2023 17:46:52 GMT

A few months ago I went through Efficient Multi-word Compare and Swap paper, so here are a few thoughts on the algorithm. Long story short, I have mixed feelings about this k-word CAS algorithm. It focuses on nice properties of the CAS operation, such as lock-freedom and linearizability, while overlooking atomic k-word reads.

The core idea of the algorithm is that writers do a loop of CAS operations on each stored word. They try to swap the value (or a pointer) with a pointer to the so-called operation descriptor structure. The structure includes old and new values, as well as the operation status. Once a writer successfully completes all CASes, it does one more CAS on the status field marking the operation as complete. So, the algorithm requires k+1 single-word CAS operations per k-word CAS.

Indeed, this k-word CAS algorithm is lock-free and linearizable, but if you also want to be able to read all k words atomically, the algorithm won't be of any help. Of course, you can do a no-op k-word CAS to do an atomic read, but such read may be costly. One more downside is that being able to swap a primitive value with a pointer means that the algorithm is not meant to be used in any language with a GC. In theory, it can be still used in languages with non-moving GC, such as Golang, but with unsafe things like pointer tagging. Also, if you're fine with less plain memory layout, then, say, in Java the algorithm may be modified to use an array of AtomicReference instead of an array of primitive type (usually, long[]).

If lock-freedom is not a must, a seqlock-like approach might do just fine. The sequence field could be used to preserve exclusive writer access, as well as enough help for readers to determine whether they got an atomic snapshot of all k-words. If lock-freedom is important, then in languages with GC there is an even simpler option which is to use an immutable data structure and let the writers do a single-word CAS swapping the pointer to the old values with the new one. Finally, a good old lock could be used, optionally a much more scalable reader-writer one.

The Secret Life of fsync

Andrei Pechkurov — Fri, 31 Mar 2023 18:49:29 GMT

Several times I've heard opinions that many mass-market SSDs and HDDs don't provide sufficient durability guarantees and Linux can do nothing with that. Namely, after an fsync() call recently modified data can still sit in the drive's volatile write cache and, thus, it may be lost in case of a power failure. If you want any meaningful durability, you should go for enterprise-grade drives that have a battery/capacitor so that they can flush the data to persistent storage on power loss. Is it really so? Let's find out.

First, let's check what POSIX.1-2017 specification says about fsync:

The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to the storage device associated with the file described by fildes. The nature of the transfer is implementation-defined.

The above description is rather vague. If the OS issues operations to write the data to the disk's volatile cache, that's a "transfer", so formally such OS would be POSIX-compliant. The informative section of the spec sheds more light on what a proper fsync implementation should do:

The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk. Since the concepts of "buffer cache", "system crash", "physical write", and "non-volatile storage" are not defined here, the wording has to be more abstract.

OK, that's much more specific. Once an fsync call is made, the data should become durable in the face of a system crash, e.g. due to a power loss. But is it really something Linux does on an fsync?

As with many other FS-related system calls, most (if not all) file systems have their own implementation of fsync. To keep things simpler, we're going to check the ext4 implementation in the recent 6.x kernel code base. We should be looking at the `ext4_sync_file()` function which is invoked on an fsync() call. It involves the following steps:

First, it writes all dirty pages belonging to the file that corresponds to the input file descriptor to the disk. That's done by the `file_write_and_wait_range()` function. As a result, the data may be sitting in the disk volatile cache, so that's not what we're looking for.
Next, it writes the inode's metadata to the disk. Depending on whether journaling is enabled on the FS or not, it's done via a specific function, e.g. `ext4_fsync_journal()`. Again, not something we're in search of.
Finally, if the needs_barrier variable is true, it calls the blkdev_issue_flush() function. That's probably what we need, isn't it?

Let's leave the needs_barrier variable out of the equation for now and check what blkdev_issue_flush() does. This function queues a flush operation to the block device and waits until it's finished. The operation has the REQ_PREFLUSH bit set among the flags. If we open kernel docs, we'll find some information on this flag (and not only):

In addition the REQ_PREFLUSH flag can be set on an otherwise empty bio structure, which causes only an explicit cache flush without any dependent I/O. It is recommend to use the blkdev_issue_flush() helper for a pure cache flush.

As we expected, the REQ_PREFLUSH flag (as well as the REQ_FUA flag) tells the block device that it should flush its volatile cache to the persistent storage. Drivers for any well-behaved disk with a volatile write cache should handle this flag properly. Obviously, disks without such cache don't need to bother with these operations and flags.

Now, what's the buzz with needs_barrier? In both journaled and non-journaled ext4 code paths, it appears to be nothing more, but an optimization to avoid sending flush requests to the disk multiple times. Note that you can also configure ext4 not to issue flush operations. For example, in non-journaled mode it's done via EXT4_DEFM_NOBARRIER mount option.

Other file systems have their own specifics, but the overall logic should be close enough to the ext4's one. So unlike macOS, Linux does its best to transfer the data to persistent storage on fsync. Of course, this doesn't protect you from a flawed driver implementation written for a cheap no-name drive, but it also means that if you have a decent SSD from a well-known brand, you may be fine without an enterprise-grade disk.

Multithreaded Scatter-Gather Execution Model for Analytical Queries

Andrei Pechkurov — Sat, 14 Jan 2023 11:03:43 GMT

Today we'll be discussing an approach used in some analytical databases to speed up the execution of queries at the cost of additional HW resources, namely CPU cores and memory. The query execution model is usually referred to as "scatter-gather", yet it's hard to find an article with a good amount of detail for this model (at least, I failed to do that), so I decided to write a brief post on the topic.

To be slightly more concrete, here is a simple example of a query that can benefit from scatter-gather execution:

SELECT sensor_id, max(temp), min(temp), avg(temp)FROM temperatureWHERE sensor_id IN (402, 1202, 3983)GROUP BY sensor_id;

To be able to use the scatter-gather model, the storage format used in the database needs to support parallelism, i.e. it should be possible to divide the on-disk data into individual parts that can be independently scanned. In practice, this usually means log-structured append-only storage format (which is in many cases also columnar) but not necessarily.

Now, let's consider the execution of the above query.

Scatter

For the sake of simplicity, let's assume that there is no index on the `sensor_id` column and table columns are stored in append-only log files. In this case, it's trivial to split the log files to be scanned into N chunks based on offsets. That's basically what the original thread does when scattering the query execution. Each of the chunks is serialized into a task object/struct along with the query execution plan details, such as selected columns, aggregate functions and filter, and written into an in-memory queue to be picked up by worker threads.

Let's refer to the original thread as the orchestrator thread. If the original thread belongs to the same thread pool, it must participate as a worker, i.e. it should poll tasks from the queue and execute them. That's to avoid starvation and deadlocks in the situation when all threads try to orchestrate their own queries.

When a worker thread picks up a task, it starts executing it. It scans through the data, applies the filter (WHERE clause) and calculates the intermediate result for each aggregate function (min, max and avg in our example). A convenient way to store the result is a hash table holding > key-value pairs. Here, we use int type for the key assuming that sensor_id is an integer column and each tuple stands for the intermediate results of our three aggregate functions.

Once a task gets executed, the worker needs to write the intermediate result (hash table) into an in-memory queue to be consumed and gathered by the orchestrator thread.

Gather (and merge)

The orchestrator thread needs a hash table to store query results. Initially, it's empty, but as soon as the thread consumes (gathers) a result from one of the workers, it needs to merge two tables. The merge is simple thanks to the natural properties of the aggregate functions:

min(A  B) = min(min(A), min(B))max(A  B) = max(max(A), max(B))count(A  B) = count(A) + count(B)sum(A  B) = sum(A) + sum(B)avg(A  B) = sum(A  B) / count(A  B)

It's not a complete list, but most scalar aggregate functions assume little data to be stored as their state, so they fit into scatter-gather nicely.

As soon as the orchestrator gathers and merges the last task result, it has the query result to be returned to the client.

Conclusion

Scatter-gather(-merge) is a simple single-stage execution model. It works nicely for relatively trivial GROUP BY queries with an optional WHERE clause while more complex queries involving JOINs require a more complex multi-stage parallel execution.

Of course, Amdahl's law applies to the scatter-gather model, so if the serial part of the total work is significant, the speed up from parallelism will be humble. But if the database uses blocking disk I/O, there might be a benefit even in the degenerate case when each group has a single row to be scanned.

One more advantage of this model is that it applies to distributed databases naturally. We can easily swap "thread" with "node" in the above text with no other changes except for the storage requirement. The data has to be sharded across cluster nodes and the orchestrator node has to be aware of the sharding scheme so that it's aware of the data location when it distributes the work.

I'm interested in learning more about analytical query execution models and not only, so if you have anything to share, don't hesitate to write a comment. Have fun coding and see you next time.

BuzzwordBusters: What Does Lock-Free, Wait-Free Really Mean?

Andrei Pechkurov — Sun, 18 Dec 2022 08:15:33 GMT

It seems to be a common belief that code which uses mutexes/locks/synchronized methods is "slow" and, as soon as you replace them with atomics, your code becomes fast and lock-free. Atomic operations don't make your code wait-free, lock-free, or even obstruction-free. This tiny blog post is dedicated to the above definitions.

Wait-freedom means that any thread can make progress in a finite number of steps regardless of external factors, such as other threads blocking. A trivial example of a wait-free data structure is an atomic counter (in x86 it would use a LOCK XADD instruction), e.g. Java's j.u.c.AtomicInteger.

public class Counter {    private final AtomicInteger cnt = new AtomicInteger();    public int add(int delta) {        return cnt.addAndGet(delta);    }        public int get() {        return cnt.get();    }}

Lock-freedom means that the application as a whole can make progress regardless of anything. So, while individual threads may be blocked, at least one of them would be making progress. A trivial example would be the same atomic counter based on a loop with a CAS operation.

public class Counter {    private static final AtomicIntegerFieldUpdater updater =        AtomicIntegerFieldUpdater.newUpdater(Counter.class, "cnt");    private volatile int cnt;    public int add(int delta) {        int cur;        do {            cur = cnt;        } while (!updater.compareAndSet(this, cur, cur + delta));        return cur + delta;    }    public int get() {        return cnt;    }}

Obstruction-freedom means that a thread can make progress only if there is no contention from other threads. This guarantee is the weakest one on the list. It's hard to illustrate this definition with a simple enough example I'm not aware of a trivial example of this one, but you may refer to this paper.

Is it fair enough to say that wait-free data structures and algorithms are faster than lock-free ones and that lock-freedom means something better than obstruction-freedom and blocking code? Not really. You may limit yourself with wait-freedom if you want to limit the maximum latency of an individual operation, e.g. if you're building a real-time OS. But in most cases, you should consider all possible algorithms and their combinations. For instance, your data structure may be quite fast while it implements wait-free or lock-free reads and blocking writes based on a striped lock (wink-wink Java's ConcurrentHashMap and xsync's Map/MapOf).

If you want to learn more about multithreaded programming and scalable concurrent data structures, I highly recommend Dmitry Vyukov's old blog. Just go through all posts starting with the intro one.

Have fun coding and see you next time.

Concurrent Map in Go vs Java: Yet Another Meaningless Benchmark

Andrei Pechkurov — Fri, 04 Nov 2022 19:07:26 GMT

Today we're comparing Java's j.u.c.ConcurrentHashMap and Go's xsync.MapOf in a totally non-scientific, unfair benchmark. While most of such language performance comparisons are generally useless and harmful, the purpose of this exercise is a comparison of the algorithms behind both data structures. I'm driven by curiosity here, so don't take this post seriously. The results may be completely different on another HW and in different scenarios, so don't forget that any benchmark has to be taken with a grain of salt.

On the one hand we have ConcurrentHashMap, also known as CHM. It's a brilliant concurrent hash table: no writes to shared memory in read-only operations, hybrid linked list / tree used for buckets depending on the number of stored nodes, volatile value fields in nodes allowing to update the value without having to allocate a new node, a striped counter for the current size, and so on. CHM has been improved across many versions of Java, for many years, so reaching its level of performance isn't an easy task.

On the other hand, MapOf borrows ideas from multiple sources: buckets are organized in unrolled linked lists of cache line size thanks to Cache-Line Hash Table (CLHT), hash codes stored in the buckets to avoid extra hash function calls and pointer chasing, immutable entries to avoid extra synchronization on reads and also reduce GC pressure, also a striped counter for the current size.

Both maps have a bunch of handy atomic operations available to the developer. But enough of this boring stuff, let's do the benchmarking.

Our test stand is nothing more, but a laptop machine with an i7-1185G7 CPU (8 HT cores) and Ubuntu 22.04 running 64-bit builds of Go 1.19.3 and OpenJDK 17.0.4.

The CHM benchmark can be found here, while the Go benchmark is in the xsync repo.

The benchmarks start with a pre-warmed map holding 1,000 64-bit integer key-value pairs. The "99% reads" scenario assumes that each thread spends 99% of its calls on get (Load) operations and 0.5% on both put (Store) and remove (Delete) operations. Each operation is called for a randomly selected key, so there are no hot spots. The "75% reads" scenario is more write-heavy as write operations get 12.5% of total number of operations each.

The benchmarks were run with different number of threads/cores to be used, staring with 1 core and ending with all 8 available cores. That's to see how well both maps scale on multi-core machines. Finally, the below results are based on average values collected after 10 runs of each benchmark.

Let's start with the write-heavy scenario. CHM has an advantage here since it allows updating values in-place, so it might show better results.

As expected, CHM is slightly better on all core counts except for 4 cores. Both maps scale well enough.

Let's see results for the read-heavy scenario which might be more common in many real world applications.

Again, both maps scale well with a slight advantage of MapOf. I'm pretty happy with the results. There are certainly rough edges, but MapOf has proven its worthiness.

Have fun coding and see you next time.

Thread-Local State in Go, Huh?

Andrei Pechkurov — Sat, 29 Oct 2022 09:41:00 GMT

We all know that there is no such thing as thread-local state in Go. Yet, there is a trick that would help you to retain the thread identity at least on the hot path. This trick would be helpful if you're trying to implement a striped counter (wink-wink, j.u.c.a.LongAdder from Java), or a BRAVO lock, or any kind of a data structure with striped state.

This brief post is based on the talk I gave a while ago. I assume that you're familiar with the concept of state striping in concurrent counters since it's important for understanding the end application.

First of all, while Golang assigns identifiers to goroutines, it doesn't expose them. That's by the design:

Goroutines do not have names; they are just anonymousworkers. They expose no unique identifier, name, or datastructure to the programmer. Some people are surprised bythis, expecting the go statement to return some item that canbe used to access and control the goroutine later.
The fundamental reason goroutines are anonymous is so thatthe full Go language is available when programmingconcurrent code. By contrast, the usage patterns that developwhen threads and goroutines are named can restrict what alibrary using them can do.

The same applies to the worker threads used by the Golang scheduler to run your goroutines. But the whole idea of a striped counter depends on being able to identify the current thread, so that subsequent calls are (most of the time) run on the same CPU core and, hence, avoid contention.

There are two straightforward approaches to the problem:

CPUID x86 instruction - not portable to other architectures and also requires some assembly or FFI calls to be used.
gettid(2) Linux-only system call - not portable to other OSes.

Both of these options are non-versatile and, ideally, we want to have a Go-native and cross-platform solution. Luckily there is one.

I'm talking of sync.Pool. If you're familiar with its source code, you already know that it uses thread-local pools under the hood. If we allocate a struct and place it in the pool, the next time we request it one the same thread (but not necessarily same goroutine) we should get the same struct.

Let's take a look at a fragment of the xsync.Counter's code:

// pool for P tokensvar ptokenPool sync.Pool// ptoken is used to point at the current OS thread (P)// on which the goroutine is run; exact identity of the thread,// as well as P migration tolerance, is not important since// it's used to as a best effort mechanism for assigning// concurrent operations (goroutines) to different stripes of// the counter.type ptoken struct {    idx uint32}// Counter is a striped int64 counter.type Counter struct {    stripes []cstripe    mask    uint32}type cstripe struct {    c int64    // The padding prevent false sharing.    pad [cacheLineSize - 8]byte}func NewCounter() *Counter {    // Consider the number of CPU cores    // when deciding on the number of stripes.    nstripes := nextPowOf2(parallelism())    c := Counter{        stripes: make([]cstripe, nstripes),        mask:    nstripes - 1,    }    return &c}// Value returns the current counter value.func (c *Counter) Value() int64 {    v := int64(0)    for i := 0; i < len(c.stripes); i++ {        stripe := &c.stripes[i]        v += atomic.LoadInt64(&stripe.c)    }    return v}// Add adds the delta to the counter.func (c *Counter) Add(delta int64) {    // Pick up a token from the pool. If Add was called recently    // on the same thread, we'll probably get the same ptoken.    t, ok := ptokenPool.Get().(*ptoken)    if !ok {        // Allocate a new token and pick up a random stripe index.        t = new(ptoken)        t.idx = fastrand() & c.mask    }    for {        stripe := &c.stripes[t.idx]        cnt := atomic.LoadInt64(&stripe.c)        if atomic.CompareAndSwapInt64(&stripe.c, cnt, cnt+delta) {            // We were able to update the stripe, so all done.            break        }        // CAS failed, so there is some contention over the stripe.        // Give a try with another randomly selected stripe.        t.idx = fastrand() & c.mask    }    // Return ptoken back to the pool, so that another goroutine    // running on the same thread can use it.    ptokenPool.Put(t)}

Here, in the Add method, we using the ptoken structs to hold (not-so) thread-local state. Once we obtain a ptoken, we try to change the corresponding stripe in a CAS-based loop. This allows the goroutines to self-organize: they detect contention via a failed CAS operation and then change the stripe. The goal is to avoid contention due to unlucky thread-to-stripe distribution.

You may ask if piggybacking on a sync.Pool's implementation detail is worth hassle. My answer would be "no, unless you really know what you're doing". Say, single-threaded performance of a primitive atomic int64 would be better. There is also an overhead in the Value() method since it needs to read values from all stripes. So, this trick is certainly from the "don't try that at home" category. But if you aim for scalability of your write operations, it's certainly worth it:

$ go test -benchmem -run=^$ -bench "Counter|Atomic"goos: linuxgoarch: amd64pkg: github.com/puzpuzpuz/xsync/v2cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHzBenchmarkCounter-8           409773502             2.909 ns/op           0 B/op           0 allocs/opBenchmarkAtomicInt64-8       92472007            14.09 ns/op           0 B/op           0 allocs/opPASSok      github.com/puzpuzpuz/xsync/v2    3.024s

If you run the same benchmark on a machine with more cores, the int64's result would only get worse due to contention.

Both Map and MapOf, concurrent hash maps from xsync library, use a variation of a striped counter internally to track the current map size. Naturally, they do a counter increment or decrement on each write operation, but read the counter value rarely when a resize happens.

One more example of application of this trick is RBMutex, a reader biased reader/writer mutual exclusion lock that implements BRAVO algorithm. I'm leaving learning the internals of this one to the curious reader.

As promised, the post is a short one, so that's it for today. Have fun coding and see you next time.

So long, sync.Map

Andrei Pechkurov — Sat, 22 Oct 2022 17:50:36 GMT

While the title is certainly a clickbait, I definitely don't see any strong reason to keep dealing with sync.Map if you're a Go generics user. Instead, you should consider xsync.MapOf:

type point struct {    x int    y int}// create a MapOf for with keys and values of point typem := NewTypedMapOf[point, point](func(p point) uint64 {    // hash function to be used by the map    return uint64(31*p.x + p.y)})// load the existing value or compute it lazily, if it's absentv, loaded := m.LoadOrCompute(point{42, 42}, func() int {    return point{0, 0}})

The above wasn't possible until xsync v1.5.0. That's because the generic version of the concurrent map available in the library supported only string keys. But now you can use any comparable type as a key, so xsync.MapOf became a real, more scalable alternative to the good old sync.Map. Today we're going to discuss the recent changes in the data structure that allowed using arbitrary key types and also do some (as always, non-scientific) microbenchmarking.

The challenges

The original intent behind xsync library was to provide faster alternatives for the built-in concurrent data structures available in the Golang standard library. Most of the alternatives, like RBMutex or MPMCQueue, aren't suitable for general purpose, i.e. they're tailored for niche use cases.

But xsync.Map is different as it was aimed to replace sync.Map in the same scenarios and beyond. It's based on a (noticeably) modified version of Cache-Line Hash Table (CLHT). Read-only operations, such as Load or Range, are obstruction-free while write operations and rehashing use fine-grained locking (lock sharding). Refer to this blog post to learn more on the algorithm. The only significant limitation was in keys limited to the string type.

The original version of the map was non-generic, so you had to deal with unpleasant interface{} rituals in your code. When Go got a stable version of generics, Viacheslav Poturaev contributed a generics-friendly version of the map, xsync.MapOf. This was a nice step forward, but it still supported string keys only.

A few month ago, the library repo received a pull request from Rob Mason. The goal was to allow arbitrary comparable types for keys while maintaining backwards compatibility. Since Golang doesn't expose the built-in hash functions except for what's in the hash/maphash package, the hash function has to be provided by the user which isn't a big deal. The only problem was the layout of the data structure.

Each bucket in the underlying hash table had the following structure:

Here we have 128 bytes (two cache-lines on most modern CPUs) holding a sync.Mutex, 7 key/value pairs, and a uint64 with cached hash codes for the keys present in the bucket. The mutex is being used while writing to the bucket. The key/values pairs hold pointers to the map entries. Finally, the hash code cache holds one most significant byte per each keys hash code.

So, each bucket was able to hold up to 7 key/value pairs and, in case if the bucket was fully on an insert, a rehashing would have to be made. Such design is sufficient if you have a high quality hash function, such as the built-in one for strings. But if the function is user-provided chances of having 7+ perfect hash code collisions in the same map increase dramatically.

To fix that, mutex and hashes fields were merged into a single uint64-based data structure with the following layout:

| key 0's top hash | ... | key 7's top hash | bitmap for keys ||      1 byte      | ... |      1 byte      |     1 byte      |

Here, we have 8 most significant bits (MSBs) of the hash code for each key stored in the bucket. They're used to avoid many expensive hash code computations on each look up. Before calculating the hash code for a key/value pair, we compare the MSBs with the MSBs of the hash code of the user-provided key and, if they don't match, we move on to the next pair in the bucket. Next, the least significant bit in the "bitmap for keys" byte is used to implement a mutex (a TTAS spinlock, to be more precise).

Merging the mutex and the hash code MSBs together saved gorgeous 8 bytes of memory in each bucket. The saved bytes hold pointers to the next bucket in the chain, so as in the very first xsync version, hash table buckets are now organized in unrolled linked lists. Hence, in face of perfect hash code collisions, the linked list trivially growths until the map load factor is met.

Needless to say that once all of the above was done, supporting any comparable key type finally became possible.

At this point you may be asking yourself if it's using a non-standard map is worth it. Let's see.

The promised benchmarks

The following results were obtained on i7-1185G7, Ubuntu 22.04 x86-64 and Go 1.19.2. The benchmark source code itself can be found in the xsync repo.

We start with a benchmark that uses a pre-warmed map with 1,000 key/value pairs. The keys are strings while the values are ints. The benchmark uses all 8 goroutines to load all 8 HT cores available on the machine. Each goroutine selects a key randomly and executes either Load, Store or Delete operation on it.

Here, the x axis stands for the percentage of Load operation while the two remaining operations have equal chances to be called, e.g. for the 90% value we have 90% of Load calls, 5% of Stores and 5% of Deletes. The y axis stands for the average time in nanoseconds spend on an operation.

Things get even more interesting with 1M key/value pairs.

As usual, take the above results with a grain of salt. There might be some scenarios when the standard map is faster, so make sure to test things close to what your application does. Nevertheless, due to the design the standard map is quite limited in terms of scalability in the presence of even a small fraction of concurrent writes. After all, a data structure guarded with a single lock will scale worse than the one with sharded locks. When it comes to reads which are the strongest point of sync.Map, both data structures are on par.

Lessons learned

Hopefully, this post has convinced you to try xsync.MapOf in action and maybe to contribute to xsync. This story is another evidence of the power of open source communities. Without so many contributors, xsync.Map would be still limited in its capabilities. I'm pretty sure that it's not the end of the story and the library would continue evolving in future. Have fun coding and see you next time.

Testing Concurrent Code for Fun and Profit

Andrei Pechkurov — Sun, 09 Oct 2022 10:03:05 GMT

Everyone knows that multi-threaded code is not a piece of cake. There are lots of publications on how to write concurrent code properly and also lots of well-known algorithms and data structures to choose from. Yet, authors often ignore another important topic and pretend as if it's not worth the discussion. The topic is how to test your concurrent code and that's what we're going to consider today. No way I'm an expert on this matter (and any other matter), so everything below is an attempt to share an opinionated approach that appears to work well for me and helps to find vast majority of the concurrency bugs.

To narrow down the topic, we're going to use Java and try to cover the SPSC queue we built recently with a minimal set of tests. While the observed tests are minimal, you should be able to write tests for your own concurrent code after reading this blog post. As for the programming language, everything we talk of should be applicable to any language with threads or green threads, so C/C++, Rust, Golang, Zig and many others apply.

The interface of our queue is very simple and consists of two methods:

public class SpscBoundedQueue<E> {    // Some boring stuff, like fields and constructor.    /**     * Publishes an item to the tail of the queue, if it's not full.     */    public boolean offer(E e) { /* some code goes here */ }    /**     * Removes and returns the head of the queue, if it's not empty.     */    public E poll() { /* some code goes here */ }}

So, where to start when testing it?

Where to start?

The best thing to do as the first step is to write good old single-threaded tests. Many aspects of your code can (and should) be covered with such tests. Those are input validation, boundary checks, basic invariants of your data structure, methods that aren't thread-safe and, hence, will be always called from a single thread - all of these should be covered with "cheap" (in terms of the execution time) tests. Keep in mind that the aforementioned list is not complete. You should always try to cover as much as possible with single-threaded tests.

In our case, we should cover the following things:

Input validation - our original code was minimal, so it lacked things like positive size validation in the constructor. Such validation is a perfect candidate for a single-threaded test.
Boundary checks - our queue has a limited size, so we expect offer() to return false when the queue is full, as well as poll() to return a null when the queue is empty.
Basic invariants - we should test the "First in, first out" (FIFO) property of our data structure.
Non thread-safe methods - again, we omitted many other methods that are handy, such as clear() method. Due to the queue design, those methods have to be called from a single thread in absence of any other queue mutation calls. Single-threaded tests are to the rescue.

Here is a test that illustrates items 2 and 3 from the above list:

@Testpublic void testSerial() {    SpscBoundedQueue queue = new SpscBoundedQueue<>(10);    Assert.assertNull(queue.poll());    for (int i = 0; i < 10; i++) {        Assert.assertTrue(queue.offer(i));    }    Assert.assertFalse(queue.offer(42));    for (int i = 0; i < 10; i++) {        Assert.assertEquals((Integer) i, queue.poll());    }    Assert.assertNull(queue.poll());}

It's a bit dense and could be split into multiple, more focused tests, but it's not a big deal considering that it's an illustration of the concept. The test verifies both boundary checks, as well as the FIFO property.

Enough silly serial tests, let's run things in parallel!

How to break things?

Our main goal is to find any thread-safety violations. But what does it mean in practice? Such violations may be very infrequent and hard to reproduce. Sometimes race conditions, data races and other unpleasant things may even remain unnoticed until you hit a certain edge case. The sad truth is that testing concurrent code is hard and you can never be sure that your test suite is good enough. But that means that you should do your best at writing concurrent tests to eliminate most, if not all, thread-safety bugs.

Each concurrent test scenario has to be thought separately. It should reproduce a use case and involve a set of related methods of your data structure(s). In our example, things are simple: we need to test the offer() and poll() methods running on two different threads (remember, we deal with a Single Producer Single Consumer queue). But is it enough to call these methods like crazy from separate threads? Not really. Just like with single-threaded tests, we have to think of the invariants we have in the thread-safe part of the code.

To keep things practical, let's start with the skeleton of the test:

@Testpublic void testHammer() throws InterruptedException {    final int iterations = 1_000_000;    // Prepare the data structure.    SpscBoundedQueue queue = new SpscBoundedQueue<>(10);    // Prepare helper data structures (test infra).    CyclicBarrier barrier = new CyclicBarrier(2);    CountDownLatch latch = new CountDownLatch(2);    AtomicInteger anomalies = new AtomicInteger();    // Prepare and start the threads.    ConsumerThread consumer = new ConsumerThread(queue, barrier, latch, anomalies, iterations);    consumer.start();    ProducerThread producer = new ProducerThread(queue, barrier, latch, anomalies, iterations);    producer.start();    // Wait for the threads to finish.    latch.await();    // Verify that there were no thread-safety violations.    Assert.assertEquals(0, anomalies.get());}

The above code is quite straightforward. The test runs two threads and involves a queue, as well as a number of helper synchronization primitives. Once the threads are done, it checks the anomalies counter to verify that there were no thread-safety violations. As the test name suggests, it's a hammer style test, i.e. it aims to "bash" the queue from multiple threads until it breaks (or not). Such tests sometimes called stress tests for concurrent code.

Now, let's see what our threads actually do. We start with the producer thread:

private static class ProducerThread extends Thread {    // Boring stuff such as fields and constructor goes here...    @Override    public void run() {        try {            // Await for the consumer thread, so we start simultaneously.            barrier.await();            // Start publishing incrementing numbers to the queue.            for (int i = 0; i < iterations; i++) {                while (!queue.offer(i)) {                    // Yes, we want to busy spin.                }            }        } catch (Exception e) {            // Any exception we get when producing is an anomaly.            e.printStackTrace();            anomalies.incrementAndGet();        } finally {            // Notify the main thread that we're done.            latch.countDown();        }    }}

Producer's code is simple and illustrative. Notice that we're publishing incrementing numbers to the queue. As we're going to see in the consumer's code, that's to be able to verify our main invariant - the FIFO property. One more important thing here is that we don't have any kind of back-off calls in the while loop. Instead, we prefer to busy spin. That's because calls like Thread#sleep() or LockSupport.parkNanos() or anything similar involve synchronization that might fix your otherwise broken code. Also, if you need to emulate some local work as the back-off or on successful operation, prefer using Blackhole#consumeCpu() from JMH or similar methods of your choice. Finally, due to the same consideration, it's definitely a bad idea to call System.out.println() or log anything is the main loop.

The consumer thread's code is also pretty simple:

private static class ConsumerThread extends Thread {    // Boring stuff such as fields and constructor goes here...    @Override    public void run() {        try {            barrier.await();            // Consume all items from the queue.            int prev = -1;            while (prev != iterations - 1) {                Integer element = queue.poll();                if (element == null) {                    // Again, we busy spin.                    continue;                }                // Check that we received the incremented number.                if (element != prev + 1) {                    anomalies.incrementAndGet();                }                prev = element;            }            // We expect the queue to be empty now.            if (queue.poll() != null) {                anomalies.incrementAndGet();            }        } catch (Exception e) {            e.printStackTrace();            anomalies.incrementAndGet();        } finally {            latch.countDown();        }    }}

The above code completes the picture: our test aims to verify the FIFO property of the queue and nothing more than that.

Variations of the concurrent tests are important. If we would be testing a MPMC queue, it would be a good idea to have multiple tests with different number of producer and consumer threads: single producer - single consumer, single producer - multiple consumers, multiple producers - single consumer, multiple producers - multiple consumers. If we have some kind of local work emulation in the tests, it would be nice to test it with different CPU time too. Same applies to data structure capacity and any other things that may affect the flow of your code. Thread-safety violations are a question of unlucky (or lucky, if you want to find bugs) ordering and visibility, so the more variations of the scenario you run, the higher chances to find a violation.

The complete test source code may be found here. If you're proficient in Golang and fancy to see a more complex application of the above principles, see tests of xsync library. The library consists of a number of concurrent data structures that are certainly more complex than our SPSC queue.

How to run the tests?

Before we wrap up, let's discuss a few tips to squeeze everything from your multi-threaded tests.

First of all, it is a good habit to run the newly written concurrent tests on your dev machine for a few minutes. This might show failures early, without involving many CI runs.

Next, if your code runs on different CPU architectures, make sure to run tests on those. For instance, ARM CPUs have a weaker hardware memory model when compared with x86 ones.

Some language ecosystems have race detector tools, like ThreadSanitizer or Golang's Data Race Detector. If applicable, make sure to configure your CI to run the tests with enabled race detector. It's also worth mentioning jcstress and Lincheck frameworks available in JVM ecosystem. Unlike the aforementioned race detectors, these frameworks require writing dedicated tests, so, in case of concurrent data structure testing, they can be seen as an alternative to the hand-written tests we're discussing today.

Finally, if some of your concurrent tests appear to be flaky, i.e. infrequently fail due to an unknown reason, that may be an indication of an actual bug. Make sure to do your best to reproduce the failure, analyze it and fix the cause.

Let's recap?

Writing thread-safe concurrent code is hard. Writing sufficient tests for such code may be even harder. Here is the summary of what we discussed today:

Always try to cover as much as possible with single-threaded tests.
Write your concurrent tests to stress your code and verify a set of invariants.
Avoid calls that involve additional synchronization in the main loops of your tests.
Variations of the concurrent tests are important.
Test on various CPU architectures (wink-wink ARM).
If applicable, configure your CI server to run the concurrent tests with a race detector.
Flaky tests are your friends. Always do your best to reproduce and analyze them.

I hope you've learned something new today. Good luck with your concurrent tests and see you next time.

Fast and Simple SPSC Queue

Andrei Pechkurov — Sat, 01 Oct 2022 18:44:22 GMT

Single producer single consumer (SPSC) queues form the simplest type of concurrent queues. We have a single thread producing the items, as well as a single thread consuming them concurrently - what can be simpler than that? Nevertheless, such queues may be met in complex software projects, such as Linux kernel. Use cases include sending network packets between NICs and OS drivers and receiving I/O completion events in io_uring, the newest asynchronous I/O API available in Linux. An SPSC queue may be unbounded meaning that the total number of items that can be pushed into the queue is unlimited or bounded which in practice means that it's built on top of a ring buffer. Today, we're discussing a bounded SPSC queue implemented in Java. The beauty of this data structure is its simplicity combined with a good level of performance on modern hardware.

We start with the skeleton of the data structure, i.e. its fields and interface:

public class SpscBoundedQueue<E> {    private final Object[] data;    private final PaddedAtomicInteger producerIdx = new PaddedAtomicInteger();    private final PaddedAtomicInteger producerCachedIdx = new PaddedAtomicInteger();    private final PaddedAtomicInteger consumerIdx = new PaddedAtomicInteger();    private final PaddedAtomicInteger consumerCachedIdx = new PaddedAtomicInteger();    public SpscBoundedQueue(int size) {        this.data = new Object[size + 1];    }    public boolean offer(E e) {        // The code will follow...    }    public E poll() {        // The code will follow...    }    static class PaddedAtomicInteger extends AtomicInteger {        @SuppressWarnings("unused")        private int i1, i2, i3, i4, i5, i6, i7, i8,                i9, i10, i11, i12, i13, i14, i15;    }}

Here, we have an array of queue elements plus a number of index fields where consumer and producer each get a pair of PaddedAtomicInteger. The PaddedAtomicInteger is basically the standard j.u.c.a.AtomicInteger class with some padding added to prevent false sharing. Alternatively, we could keep the memory layout flat with all indexes declared as primitive fields right in the SpscBoundedQueue class, but this would make the code much less readable.

You may also notice that only offer() and poll() methods are implemented. Again, that's to keep the code compact and readable. Adding other useful methods, like the batch flavor ones, is simple enough and left as an exercise for curious readers.

The array of queue items is used as a ring buffer of arbitrary size, i.e. there is no power of two restriction for the size like in some ring buffer implementations. The producerIdx and consumerIdx fields are used to synchronize producer's and consumer's accesses to the array. Both producer and consumer check each other's index to understand if they can insert or read the next item and, if the check succeeds, perform the action and update their own index. Two other fields are used to cache the index seen during the latest check. We'll discuss why such caching improves the end performance in a moment.

Let's see how it all works for the producer:

public boolean offer(E e) {    // Read producer's own index.    final int idx = producerIdx.getOpaque();    int nextIdx = idx + 1;    if (nextIdx == data.length) {        nextIdx = 0;    }    // Read the last seen consumer's index.    int cachedIdx = consumerCachedIdx.getPlain();    if (nextIdx == cachedIdx) {        // If we have reached the known index, we need to read the current value.        cachedIdx = consumerIdx.getAcquire();        // Make sure to update the cached value.        consumerCachedIdx.setPlain(cachedIdx);        if (nextIdx == cachedIdx) {            // The queue is full.            return false;        }    }    // There is an empty slot, so we can insert the item.    data[idx] = e;    // Make sure to update our own index.    // We use release semantics while the consumer has an acquire edge.    producerIdx.setRelease(nextIdx);    return true;}

The above code uses acquire/release semantics to keep the emitted instructions as lightweight as possible from the memory barriers perspective. Other than that, the code does pretty much as what we discussed before.

As it was already mentioned, the manipulations with the consumerCachedIdx field are important for the end performance. All reads and writes on this field are thread-local, i.e. only the producer thread accesses this field, so we don't need to use costly atomic operations. This reduces cache coherency traffic dramatically and lets the CPU core on its own non-shared data in those cases when there multiple empty slots are available in the queue.

Consumer's part of the picture may be seen in the full source code available here.

Finally, we're going compare our queue with the good old j.u.c.ArrayBlockingQueue and a couple of SPSC queue implementations from JCTools library. If you're not familiar with JCTools and never used it, I advice you to put it on your radar.

The benchmark we'll be running is available here. When run, it starts a couple of threads to play a ping-pong game. Each operation, a.k.a. a ping-pong round, assumes sending/receiving a single item over the SPSC queue combined with a bit of work done for each successful attempt.

Here is a reduced JMH benchmark output on my laptop running Ubuntu 20.04 and OpenJDK 17.0.4 64-bit:

Benchmark                                            (type)   Mode  Cnt          Score          Error  UnitsSpscQueueBenchmark.group                         SPSC_QUEUE  thrpt    3  107503612.612  16230253.288  ops/sSpscQueueBenchmark.group               ARRAY_BLOCKING_QUEUE  thrpt    3    7158948.722   8635350.468  ops/sSpscQueueBenchmark.group                      JCTOOLS_QUEUE  thrpt    3  120533694.168   4686758.722  ops/sSpscQueueBenchmark.group               JCTOOLS_ATOMIC_QUEUE  thrpt    3  101704017.278  18252611.281  ops/s

As expected, JCTools' queues and our own one are significantly faster than the ArrayBlockingQueue queue. Also, surprisingly, our SPSC queue keeps on par with the JCTools' queues which is not something I was expecting, to be honest. Does it mean that you should go for an in-house implementation instead of JCTools? Not really. If you can afford yourself 3rd-party dependencies, go for JCTools. JCTools' data structures are certainly more efficient, as well as much better tested and benchmarked than our toy queue. So, you'd have to spend quite some time reaching the same level of stability for a DIY queue.

Needless to say that this algorithm is not something new. You may see it in this great blog post by Erik Rigtorp, as well as recognize it in the SPSequence and SCSequence classes in QuestDB's source code. Yet, I hope that this data structure would be a nice addition to your engineering toolkit. See you next time.

Using Acquire/Release Semantics in Java Atomics for Fun and Profit

Andrei Pechkurov — Fri, 21 Jan 2022 15:52:26 GMT

In case you've missed it, recent JDK versions include new memory semantics for atomic operations available in VarHandle and Atomic* classes. These new semantics are equivalent to C/C++'s std::memory_order. The only confusing naming convention is that *Opaque methods in Java map to the memory_order_relaxed memory order in C/C++. Other than that, the idea is the same - these semantics allow developers to use a weaker memory model than the default sequential consistency model, i.e. full memory barrier which is used in the old atomic methods. This can potentially improve the performance at the cost of more complex and, thus, less maintainable code.

Anyhow, we're not going to go through the basics of memory semantics. If you're not familiar with them, I'd recommend watching this talk by Fedor Pikus where he does a great job at explaining C++'s std::atomic. As usual, today we'll be doing a weird and questionable experiment. We'll use acquire/release semantics to build a lossy (a.k.a. not-so-atomic) counter on top of j.u.c.a.AtomicLong.

Imagine that you need a rough order of magnitude counter in your application. Say, you want to measure the total number of operations performed mostly on a single thread, and in the case of concurrent execution, you're fine with losing some of the concurrent updates as long as the counter is incremented by at least one of the threads. Apart from the good old AtomicLong#addAndGet() method which would keep the counter truly atomic at the cost of performance penalty under contention, there are some other well-known ways to achieve what we want here. To name a few, one way is the j.u.c.a.LongAdder class which implements a sharded atomic counter. Its downsides are the higher read cost and the memory footprint. Another approach might be to accumulate the number of operations in a thread-local counter and periodically flush them via the AtomicLong#addAndGet() call. That's a certainly viable way to build an eventually consistent atomic counter, but today we'll consider a simpler approach that comes at the cost of concurrent increments loss.

You could say that the above example sounds artificial and you would be not far away from being absolutely correct. Nevertheless, the use case is good enough for today's experiment.

So, if we use acquire/release operations to build a lossy counter, we should get something like the following:

public class LossyCounter extends AtomicLong {    public long addAndGetLossy(long delta) {        long value = getAcquire();        long newValue = value + delta;        setRelease(newValue);        return newValue;    }}

We're going to benchmark this counter with other approaches, including atomic increments. While this wouldn't be an apple-to-apple comparison in terms of the counter operation guarantees, our goal is to get some understanding of the performance implications for different semantics and types of atomic operations.

The JMH benchmark we're going to use may be found here. Our test stand is a laptop with i7-1185G7 x86-64 CPU with 4/8 cores running Ubuntu 20.04 and OpenJDK 17.0.1.

Let's first run the benchmark on a single thread:

Benchmark                                      Mode  Cnt   Score   Error  UnitsLossyCounterBenchmark.testAtomicCas            avgt   10  11.740  0.040  ns/opLossyCounterBenchmark.testAtomicIncrement      avgt   10   6.509  0.027  ns/opLossyCounterBenchmark.testBaseline             avgt   10   3.454  0.004  ns/opLossyCounterBenchmark.testLossyAcquireRelease  avgt   10   3.777  0.158  ns/opLossyCounterBenchmark.testLossyDefault         avgt   10   8.788  0.016  ns/op

The testAtomicCas result here stands for a compareAndSet loop which is an awful way to do atomic increments on a counter. Not a big surprise that it showed the worst result. Then, testAtomicIncrement stands for the addAndGet operation, the default way to build an atomic counter. The baseline is nothing more than random number generation which is done as a part of all other benchmarks. Finally, testLossyAcquireRelease is our lossy counter while testLossyDefault stands for the same counter, but with the default operation semantics.

You may notice that our lossy counter adds almost nothing on top of the baseline and that's expected. The thing is that acquire/release semantics are no-op on x86 when it comes to ordinary loads (get) and stores (set). Read this blog post from Russ Cox if you want to learn more about HW memory models.

Let's run the benchmark on 8 threads now:

Benchmark                                      Mode  Cnt     Score    Error  UnitsLossyCounterBenchmark.testAtomicCas            avgt   10  1163.104  62.158  ns/opLossyCounterBenchmark.testAtomicIncrement      avgt   10   139.210   0.571  ns/opLossyCounterBenchmark.testBaseline             avgt   10     6.549   0.029  ns/opLossyCounterBenchmark.testLossyAcquireRelease  avgt   10    20.058   0.186  ns/opLossyCounterBenchmark.testLossyDefault         avgt   10   253.887   3.668  ns/op

As expected, the CAS-based counter is a terrible idea. The addAndGet (LOCK XADD on x86) atomic counter does a much better job. Of course, a LongAdder, being used to build an atomic counter, would do even better under contention, but we're not interested in atomic counters now.

Interestingly, the testLossyDefault counter is almost 2x slower than the atomic one. That should be explained by the price of two full memory barriers executed on each increment operation in that lossy counter. Finally, the acquire/release lossy counter is the doubtless winner of our unfair competition.

The above benchmark and the lossy counter approach should be taken with a grain of salt. My only intention was to demonstrate that weaker memory semantics may yield better performance of your code, at least in a niche use case. However, the performance advantage may be insignificant in your concrete application, yet it will certainly come at the cost of more complex and, thus, less maintainable code. So, be mindful when using the new memory semantics.

Next time we're going to build atomic memory snapshots based on a seqlock and discuss whether it's a good idea to do so. See you!

Benchmarking Non-shared Locks in Java

Andrei Pechkurov — Sun, 09 Jan 2022 13:36:19 GMT

Last time we discussed scalability of j.u.c.l.ReentrantReadWriteLock and some alternatives. Some of the alternatives used a simple CAS (compare-and-swap) based spinlock as the internal writer lock. So, I was curious whether such custom spinlock makes sense against what we have in the standard library. This brief post is dedicated to benchmarking the ReentrantLock class against a number of other non-shared (exclusive) locks.

Before we go any further, I have to warn readers that the considered alternative lock implementations are not production-ready in any sense, so use them at your own risk. The below results were obtained on concrete HW and SW and may change a lot in a different scenario and not only. Needless to say that using a single lock on the hot path is usually a bad idea. I advise going with the standard library as the default choice. So, consider this post to be an unfair, non-scientific experiment done out of curiosity.

The full code of the benchmark and the custom locks is available in this repo.

Competitors

Our first competitor is the j.u.c.l.ReentrantLock class, in its both unfair and fair flavors. The common wisdom claims that fair mode comes at a high cost. But what does it mean in practice? In theory, fair locks prevent thread starvation, so if the cost is reasonable, say, 2-3x, it might be a good idea to use it in certain use cases. That's why we're considering fair ReentrantLock.

The first custom lock implementation we're going to use is a primitive CAS-based spinlock:

public class CasSpinLock implements Lock {    private final AtomicBoolean lock = new AtomicBoolean();    @Override    public void lock() {        while (!lock.compareAndSet(false, true)) {}    }    @Override    public void unlock() {        lock.set(false);    }    // ...}

This spinlock is as primitive as it could be. Apart from this basic version, we're also including a simple backoff version of it. To lower the contention for the boolean flag, the backoff version does LockSupport.parkNanos(10) in the loop body. This flavor of the CAS lock is exactly what we were using in the previous blog post.

The next spinlock is a test and test-and-set one. The main difference is in the lock() method:

@Overridepublic void lock() {    long delay = MIN_DELAY;    for (;;) {        // busy spin until the lock is available        while (lock.get()) {}        // try to acquire the lock        if (!lock.getAndSet(true)) {            return;        }        // back off        LockSupport.parkNanos(delay);        if (delay < MAX_DELAY) {            delay *= 2;        }    }}

This spinlock should do a better job than the CAS one in terms of cache coherence traffic - most of the time it reads the lock flag from the local core's cache line and only does an atomic test-and-set operation when it saw that the lock was just released. The lock also uses an exponential backoff technique to reduce the contention. The exponential backoff should do a better job than the constant backoff used in the previous lock. So, in theory, this lock has the chance to demonstrate better throughput than the CAS spinlock.

The next spinlock is the well-known ticket lock. Again, in the theory, it has a number of advantages over previously listed spinlocks. The ticket lock is built on top of two atomic counters. The first one stands for ticket numbers: any thread attempting to acquire the lock increments the counter to get its ticket number. The second counter means currently served ticket: the thread with this value is considered to be the lock owner. When the lock owner releases the lock, it increments the currently served counter. Due to this design, the ticket lock it provides fairness guarantees, i.e. it guarantees FIFO ordering of lock acquisition. Another advantage is that there is only a single counter increment and a busy-wait read on the second counter on the hot path. We're not going to focus on the source code, but you may find it here. Worth mentioning that ticket locks may be found in Linux kernel.

Last, but not least on our list is the MCS spinlock. The algorithm is named after the authors. Just like the ticket spinlock, the MCS lock is a fair lock. The main idea behind it is to organize the waiter queue in a singly linked list where each waiting thread spins on its own node. When the current owner releases the lock, it updates a flag on the node belonging to the waiter in the head of the queue. Hence, MCS spinlock is very efficient in terms of cache coherence traffic: the number of messages exchanged by the cores on each lock acquisition is O(1), unlike O(N) in the ticket lock. We're going to test two flavors of MCS lock. The first one is not a spinlock since it uses LockSupport.park()/unpark() facility to suspend and resume threads in the queue. This flavor should perform close to the fair ReentrantLock. That's because the standard class uses CLH algorithm to implement fair lock mode. The CLH lock uses an implicit linked list, but otherwise, it's close enough to the MCS lock. The second MCS lock flavor is a proper spinlock. Again, you may find a variation of the MCS spinlock in Linux kernel.

Since our test stand is not a NUMA machine, we're not considering NUMA-aware locks, such as hierarchical locks or lock cohorting.

Benchmark

The benchmark we're going to use focuses on the average execution time per lock -> some work in the critical section -> unlock chain of calls. Thus, we're interested in the throughput rather than latency distribution or power efficiency of the locks under test. Since we want to understand lock scalability properties, the work done in the critical section is kept short enough.

Here is the benchmark itself:

@Benchmarkpublic void testLock(BenchmarkState state, Blackhole bh) {    final ThreadLocalRandom rnd = ThreadLocalRandom.current();    state.lock.lock();    // emulate some work    for (int i = 0; i < NUM_WORK_SPINS; i++) {        // access a counter (shared memory)        state.sum += rnd.nextInt();    }    bh.consume(state.sum);    state.lock.unlock();}

We're going to run this benchmark varying the number of threads from a single thread (no contention) to the number of available CPU cores (highest contention).

Results

The below results were obtained on a GCP's e2-highcpu-32 VM with 32 vCPUs (Intel Haswell), 32 GB memory running Ubuntu 20.04, and OpenJDK 17.0.1. The following chart represents all results. A text version of the results is also available here.

Notice that the vertical axis has a log scale and stands for the average operation latency in nanoseconds.

The first thing to notice is that the baseline (think, work done in the critical section) result at the very bottom of the char is constant for any number of threads except for 32 threads where it gets 2x slower. Most likely that's because of the hyper-threading cores available on the VM. Hyper-threading sibling cores share arithmetic logic units (ALU) and since we're doing some number crunching in the critical section, that becomes the bottleneck when all cores are in use.

Next, the fair mode of ReentrantLock comes at a very high cost. In the 32 threads scenario, the difference in latency is 218x, hence two orders of magnitude. Anyone who uses fair ReentrantLock should be aware of potential performance implications. As we expected, the non-spinlock flavor of the MCS lock comes close to the fair ReentrantLock.

There is a number of outsiders among our hand-crafted spinlocks. The first one is the CAS spinlock without a backoff. It heavily suffers from contention over CAS operations over a single atomic flag (think, a cache line). Surprisingly, ticket and MCS spinlocks, which were very promising in theory, follow the basic CAS spinlock closely in terms of the average latency. Although the difference between the CAS spinlock and the MCS spinlock is 3x, the MCS spinlock is still far away from the group of winners.

Let's remove the outsiders from the chart and get rid of the log scale. This should help us when analyzing the winning group.

The first thing to note here is that a CAS spinlock with a primitive, constant time backoff implementation performs better than the TTAS spinlock. The latter has a slightly more complex code for the spinlock itself and definitely a more complex exponential backoff mechanism. Surprisingly, the CAS spinlock provides lower average latency on all thread counts, so it makes no sense to deal with the TTAS spinlock, at least in the considered benchmark scenario.

The second observation is that unfair ReentrantLock does a really great job overall. Our custom spinlocks show a better result with 5x lower average latency only when the benchmark is run on 2 threads while the standard lock wins on 8 threads and beyond. In the highest contention scenario, ReentrantLock's latency is 26% lower than the CAS spinlock's one.

Lessons learned

Hopefully, this toy benchmark can serve as another argument to use the unfair ReentrantLock as the default choice in any Java application. The standard lock provides solid performance and can be beaten by a custom lock only in a concrete scenario. On the other hand, the fair ReentrantLock mode has to be used cautiously when you're certain that the fairness guarantee outweighs the performance impact.

As for the custom spinlock classes, it's not a one size fits all story. Depending on the concrete hardware and usage scenario a simpler lock may outperform more complex locks while in theory, it should be the other way around.

Scalable Readers-Writer Lock

Andrei Pechkurov — Sun, 02 Jan 2022 11:30:39 GMT

Locks, or mutexes (mutual exclusions), are one of the most basic concurrency primitives. It's hard to find a developer who won't be able to explain a mutex, at least on the fundamental level. Yet, mutexes are more than that. They may be:

OS-level (think, a pthread mutex) or user-land (think, a spinlock),
expose pessimistic (blocking) or optimistic (non-blocking) locking API,
provide fairness in lock acquisition or keep things unfair,
support reentrant calls, or prefer to be non-reentrant,
have a notion of asymmetry in locking (say, with a shared lock available for readers) or stick to symmetric, exclusive locking,
strictly require unlocking on the same thread (a pthread mutex, once again) or prefer not to bother with unlocker's identity (sync.Mutex in Golang),
support timed-based cancellation for locking attempts or have non-abortable calls only.

Today we focus on asymmetric, readers-writer locks which are familiar to most Java developers. Such locks allow concurrent readers to proceed with executing their critical section, while writers are guaranteed to acquire exclusive ownership of the lock. These locks are used in scenarios where the vast majority of calls come from readers and writers acquire the lock rather infrequently.

Our ultimate goal is to come up with a lock implementation that would scale reader operations linearly in terms of the CPU core count and compare the result with alternatives such as ReentrantReadWriteLock class from the standard library.

Prior art

Readers-writer (R/W) locks are not something new. A wide-spread R/W lock implementation uses an atomic counter for the reader part and looks something like the following class from QuestDB code base.

Locking for a reader in this class looks like this:

@Overridepublic void lock() {    // start a lock attempt    while (nReaders.incrementAndGet() >= MAX_READERS) {        // there is a writer owning the lock, so clean up, sleep and go for another spin        nReaders.decrementAndGet();        LockSupport.parkNanos(10);    }}

Note. If you run your code on Windows, you may face latency issues with LockSupport.parkNanos(). So, make sure to do some benchmarking before using any of the locks we cover today.

Here nReaders is an AtomicInteger used as a medium between readers and the writer. Each reader increments the counter atomically and checks the result value. If it's smaller than the threshold, a reader lock is acquired successfully. If not, the reader has to busy spin (in fact, it could sleep and wait for a notification from the writer, but that would slightly increase the latency). The writer, on the other hand, does the following to acquire a lock:

@Overridepublic void lock() {    // trimmed code that acquires the internal writer lock:    // ...    // increment the readers counter by the threshold    int n = nReaders.addAndGet(MAX_READERS);    // wait until there are no readers holding a lock    while (n != MAX_READERS) {        n = nReaders.get();    }}

It is important to stress that this lock is non-reentrant. As it was previously mentioned, such readers-writer lock design is quite popular and you may find it in, say, Go standard library's sync.RWMutex struct. The main problem with this approach is that its reader part doesn't scale. This means that if the time spent in the critical section by each reader is rather low, adding more threads (and cores) to the program may not lead to improved performance.

Let's demonstrate this. Our test stand is a laptop with i7-1185G7 CPU with 4/8 cores running Ubuntu 20.04 and OpenJDK 17.0.1. The JMH microbenchmark we're going to use may be found here.

Benchmark                            (type)  Mode  Cnt    Score   Error  UnitsReadWriteLockBenchmark.testBaseline     N/A  avgt   10   21.506  0.784  ns/opReadWriteLockBenchmark.testLock2     SIMPLE  avgt   10  101.476  2.403  ns/opReadWriteLockBenchmark.testLock4     SIMPLE  avgt   10  205.656  1.970  ns/opReadWriteLockBenchmark.testLock8     SIMPLE  avgt   10  296.461  0.735  ns/op

Here the testBaseline benchmark stands for the baseline, i.e. the work done in the critical section in the main benchmark. As for, testLockN it means reader lock benchmark run on N threads.

You may have already noticed almost linear degradation in the average operation time when we increase the number of threads. That's because the hot path in the microbenchmark boils down to an atomic increment instruction available in modern CPUs, e.g. LOCK XADD on x86. The aforementioned instruction implies exclusive access to the counter (the corresponding cache line, to be more precise) acquired by the CPU core executing the caller thread. Hence, the increment is still restricted with a single core and, to add on top of that, in the face of contention the synchronization cost paid by each core increases significantly. Refer to this comprehensive blog post by Travis Downs to learn more about the cost of concurrency primitives on modern HW.

If we add LinuxPerfProfiler JMH profiler and re-run the benchmark to get perf stat output for the benchmark run on 2 threads, we'll see the following:

Perf stats:--------------------------------------------------        199010,83 msec task-clock                #    1,521 CPUs utilized                       3887      context-switches          #    0,020 K/sec                                 223      cpu-migrations            #    0,001 K/sec                                 255      page-faults               #    0,001 K/sec                     614120444917      cycles                    #    3,086 GHz                      (41,68%)   418345242445      instructions              #    0,68  insn per cycle           (50,02%)    21283383788      branches                  #  106,946 M/sec                    (58,35%)         2027813      branch-misses             #    0,01% of all branches          (66,69%)    57388541249      L1-dcache-loads           #  288,369 M/sec                    (66,69%)     2080486452      L1-dcache-load-misses     #    3,63% of all L1-dcache accesses  (66,68%)         2462215      LLC-loads                 #    0,012 M/sec                    (66,68%)           312884      LLC-load-misses           #   12,71% of all LL-cache accesses  (66,69%)    supported>      L1-icache-loads                                                     20972526      L1-icache-load-misses                                         (33,32%)    57430817101      dTLB-loads                #  288,581 M/sec                    (33,33%)            63230      dTLB-load-misses          #    0,00% of all dTLB cache accesses  (33,34%)    supported>      iTLB-loads                                                             116372      iTLB-load-misses                                              (33,33%)    supported>      L1-dcache-prefetches                                            supported>      L1-dcache-prefetch-misses                                        130,817291871 seconds time elapsed     260,986586000 seconds user       0,210040000 seconds sys

Now, if we run the benchmark on 8 threads, the result would be:

Perf stats:--------------------------------------------------        784673,96 msec task-clock                #    5,996 CPUs utilized                     340290      context-switches          #    0,434 K/sec                                 212      cpu-migrations            #    0,000 K/sec                                 253      page-faults               #    0,000 K/sec                   2418749827832      cycles                    #    3,082 GHz                      (41,66%)   394531732276      instructions              #    0,16  insn per cycle           (50,01%)    20804674516      branches                  #   26,514 M/sec                    (58,35%)        11107908      branch-misses             #    0,05% of all branches          (66,67%)    54747118838      L1-dcache-loads           #   69,771 M/sec                    (66,68%)     2413557669      L1-dcache-load-misses     #    4,41% of all L1-dcache accesses  (66,68%)       582485788      LLC-loads                 #    0,742 M/sec                    (66,67%)           484122      LLC-load-misses           #    0,08% of all LL-cache accesses  (66,66%)    supported>      L1-icache-loads                                                    157690988      L1-icache-load-misses                                         (33,32%)    54789336429      dTLB-loads                #   69,824 M/sec                    (33,33%)           504440      dTLB-load-misses          #    0,00% of all dTLB cache accesses  (33,33%)    supported>      iTLB-loads                                                           3304265      iTLB-load-misses                                              (33,34%)    supported>      L1-dcache-prefetches                                            supported>      L1-dcache-prefetch-misses                                        130,864386377 seconds time elapsed    1025,676317000 seconds user       2,552276000 seconds sys

The previously mentioned problem with contention over a single atomic counter primary manifests itself in significant degradation of instruction per cycle (IPC) metric: it goes from 0,68 insn/cycle for 2 threads, which is already quite low, to pitiful 0,16 insn/cycle for 8 threads. Other metrics also degrade proportionally. That's because each CPU core's backend spends a lot of time synchronizing the counter's cache line with other cores.

To be precise, low IPC doesn't necessarily mean contention caused by atomic instructions. It may be caused by other reasons, such as random memory accesses on a data structure that doesn't fit into memory. If you'd like to detect this particular scenario, you should profile for specific PMU (Performance Monitoring Unit) events as Travis Downs pointed out.

To nail down the single atomic counter-based lock topic, let's try to compare it with the standard j.u.c.l.ReentrantReadWriteLock class. We're going to use the same benchmark running on 8 threads:

Benchmark                        (type)  Mode  Cnt     Score    Error  UnitsReadWriteLockBenchmark.testLock     JUC  avgt   10  2081.664  15.117  ns/opReadWriteLockBenchmark.testLock  SIMPLE  avgt   10   289.155   3.339  ns/op

Here JUC type stands for the ReentrantReadWriteLock class and SIMPLE means the single atomic counter lock. The difference is significant, almost 7x. Moreover, if we would activate the GC profiler in the benchmark, we would see that the standard class allocates around 200 MB/sec (not the end of the world, but could be avoided), while the atomic counter lock does not allocate at all. So, if you have a concrete use case and you really know what you're doing, using a custom lock might be a good idea.

Long story short, a single atomic counter-based lock might do a better job than the standard Java library, but it has an important flaw. The thing is that it assumes contention between readers while it's not necessary at all. Readers need to share memory (think, synchronize) with the writer only, not with each other. And, of course, there are algorithms that try to address the reader scalability problem. Let's quickly discuss the alternatives.

The first alternative I'm aware of is Dmitry Vyukov's distributed reader-writer mutex. It's written in C++, but it should be straightforward to port it to Java. The main idea is to shard the reader counter to an array of atomic counters. The size of the array is set in runtime to match the number of available cores. Each reader is associated with a slot in the array based on the id of the core running the reader thread. The id is obtained via the getcpu() system call available on Linux, so porting it to other OSes is problematic. Another problem is that threads may migrate between cores unless you go with thread affinity. To address this problem, D.Vyukov's lock provides the core id as the return value of the reader's lock() method. This value has to be provided later when the reader calls unlock(). So, while this lock has the potential, it's certainly not general-purpose.

Another worthwhile scalable readers-writer lock I know of is called BRAVO lock. The idea is quite close to D.Vyukov's class, yet the sharded counters array is fixed-size, and the reader's slot is determined based on the thread id. The second part is not a hard requirement and it's possible to implement the BRAVO algorithm in, say, Golang which doesn't expose goroutine or thread id by design. Implementation in Java is available here. We're going to benchmark it against the single atomic counter lock now. If you're interested in learning more about BRAVO lock, refer to the original paper.

Here is how BRAVO lock does in the benchmark run on 8 threads:

Benchmark                          (type)  Mode  Cnt    Score   Error  UnitsReadWriteLockBenchmark.testLock    SIMPLE  avgt   10  296.300  1.824  ns/opReadWriteLockBenchmark.testLock    BIASED  avgt   10   53.811  1.548  ns/op

The BIASED type from the above JMH output is the BRAVO lock while the SIMPLE type is the same atomic counter lock we previously discussed. As expected, BRAVO lock solves the scalability issue for readers, but it also has some problematic parts. First, the size of the array has to be large enough to avoid reader contention. The Java class we benchmarked sets the array size to 4096, which means 16KB of memory. The array size should be ideally based on the hardware that runs your code. Second, even if the array is properly sized, readers may still go to the same slot. In fact, adjacent slots are enough as long as they occupy the same cache line. That's because the slot assignment applies a hash function to the thread id and, due to that, does not guarantee the absence of hash code collisions.

So, both alternatives have certain cons and leave enough space for improvements. That's exactly what we came for.

Meet TLBiasedReadWriteLock

Yes, naming is not my strong side. TLBiasedReadWriteLock stands for thread-local reader biased lock. As the name suggests, the class uses a thread-local counter for each reader, while the writer lock is based on a spinlock.

Simplified reader's lock method looks like the following:

@Overridepublic void lock() {    // initialize or fetch a thread-local counter for the reader    PaddedAtomicLong readerCounter = tlReaderCounter.get();    for (;;) {        // check if the writer lock is available and we can start the attempt        if (!wLock.get()) {            // increment the reader counter            readerCounter.incrementAndGet();            // check that no writer acquired the lock            if (!wLock.get()) {                // attempt is successful, we're done                break;            }            // attempt failed, go for another spin            readerCounter.decrementAndGet();        }        LockSupport.parkNanos(10);    }}

In this code, PaddedAtomicLong is nothing more than, well, a padded AtomicLong used as a reader counter. The padding is added to prevent false sharing, i.e. the situation when different readers are unfortunate to have their counters adjacent on the heap memory so that they end up in the same CPU cache line. When a reader accesses the thread-local counter for the first time, the counter gets added to an array of weak references. These weak references are checked by writers when they want to acquire the lock.

You may wonder about scalability of the above code in terms of readers. Yes, there are no writes to shared memory, but we read the writer lock value multiple times. Will this code scale? The answer is yes, reads (loads) of shared memory scale and it's fine to use them anywhere in your code.

Next, writer's lock method looks like this:

@Overridepublic void lock() {    // first, acquire writer lock    lock0();    // next, wait for the readers    Iterator> iterator = readerCounters.iterator();    while (iterator.hasNext()) {        // fetch reader counter's weak reference        WeakReference ref = iterator.next();        PaddedAtomicLong counter = ref.get();        if (counter == null) {            // clean up the counter since the reader thread stopped            iterator.remove();            continue;        }        while (counter.get() != 0) {            // the reader still holds the lock, so sleep and go for another spin            LockSupport.parkNanos(10);        }    }}

As promised, this lock guarantees zero contention for readers since their counters are thread-local. The counters are allocated dynamically, so in scenarios when there are a few threads accessing the lock its memory footprint should be lower than BRAVO lock's one. This lock should do its best when accessed on a fixed-size thread pool.

If we benchmark this lock (TLBIASED type) against BRAVO (BIASED type) on 8 threads, we get this:

Benchmark                          (type)  Mode  Cnt    Score   Error  UnitsReadWriteLockBenchmark.testLock    BIASED  avgt   10   53.811  1.548  ns/opReadWriteLockBenchmark.testLock  TLBIASED  avgt   10   54.682  1.204  ns/op

As you would expect, two locks are on par in terms of reader lock scalability and end performance. Expectedly, if we would size BRAVO lock's array less aggressively or run the benchmark on a lot more cores/threads, BRAVO lock would perform worse.

Final battle

Before we wrap up, let's compare all observed lock implementations in a slightly more realistic benchmark. To make it happen we increase the time spent in the critical section by 4x so that it takes around 140 nanoseconds instead of 22 nanoseconds used in the reader-only benchmark. Next, we change the benchmark so that the writer lock is occasionally acquired. The ratio between reader and writer lock calls we're going to use will be 1,000:1, 10,000:1, or 100,000:1. The benchmark code may be found here.

That's what we get when the benchmark is run on 8 threads:

Benchmark                                 (readWriteRatio)    (type)  Mode  Cnt     Score    Error  UnitsReadWriteLockBenchmark.testReadWriteLock              1000       JUC  avgt   10  1966.910  25.643  ns/opReadWriteLockBenchmark.testReadWriteLock              1000    SIMPLE  avgt   10   668.782   0.879  ns/opReadWriteLockBenchmark.testReadWriteLock              1000    BIASED  avgt   10   458.470   3.169  ns/opReadWriteLockBenchmark.testReadWriteLock              1000  TLBIASED  avgt   10   598.463   2.135  ns/opReadWriteLockBenchmark.testReadWriteLock             10000       JUC  avgt   10  1828.119  41.570  ns/opReadWriteLockBenchmark.testReadWriteLock             10000    SIMPLE  avgt   10   593.159   1.913  ns/opReadWriteLockBenchmark.testReadWriteLock             10000    BIASED  avgt   10   219.809   3.359  ns/opReadWriteLockBenchmark.testReadWriteLock             10000  TLBIASED  avgt   10   226.159   0.808  ns/opReadWriteLockBenchmark.testReadWriteLock            100000       JUC  avgt   10  1973.585  14.310  ns/opReadWriteLockBenchmark.testReadWriteLock            100000    SIMPLE  avgt   10   584.412   5.078  ns/opReadWriteLockBenchmark.testReadWriteLock            100000    BIASED  avgt   10   173.194   1.360  ns/opReadWriteLockBenchmark.testReadWriteLock            100000  TLBIASED  avgt   10   174.324   3.236  ns/op

Good old ReentrantReadWriteLock (JUC) is equally slow regardless of the read/write lock call ratio. The single atomic counter lock (SIMPLE) also doesn't improve much when there are fewer writer calls. For the 1,000:1 ratio, the single counter lock's performance is very close to the performance of BRAVO lock (BIASED) and thread-local reader biased lock (TLBIASED).

What's interesting is that the BRAVO lock (BIASED) is a bit cheaper than the thread-local biased lock for the 1,000:1 read-write ratio, but when the number of reader calls grows the difference disappears. That's because of a more expensive write lock call in the thread-local biased lock. Again, the strong point of the thread-local lock is the guaranteed absence of contention for readers. Since the BRAVO lock is put into the best possible conditions in terms of the internal array sizing and the number of threads and CPU cores, we didn't observe any scalability issues with this lock in the conducted benchmarks.

What's in it for me?

Before we go any further, I'd suggest using a custom lock implementation only in niche use cases and, what's even more important, only if you know what you're doing. In all other situations, the standard library should be the default way to go.

If you're certain that there is a majority of reader locks in your code, they have rather short critical sections, and the ReentrantReadWriteLock class shows up as the bottleneck in profiler reports, any of the observed alternatives have the chance to improve your application's performance.

Hopefully, you've learned something new today. Good luck and see you next time.

My previous blog

Andrei Pechkurov — Sat, 25 Dec 2021 16:16:19 GMT

My previous blog on Medium is located here. It's mostly focused on Node.js.

You may also find my blog post in Gopher Advent 2021.