Scalable Readers-Writer Lock

Locks, or mutexes (mutual exclusions), are one of the most basic concurrency primitives. It's hard to find a developer who won't be able to explain a mutex, at least on the fundamental level. Yet, mutexes are more than that. They may be:

OS-level (think, a pthread mutex) or user-land (think, a spinlock),
expose pessimistic (blocking) or optimistic (non-blocking) locking API,
provide fairness in lock acquisition or keep things unfair,
support reentrant calls, or prefer to be non-reentrant,
have a notion of asymmetry in locking (say, with a shared lock available for readers) or stick to symmetric, exclusive locking,
strictly require unlocking on the same thread (a pthread mutex, once again) or prefer not to bother with unlocker's identity (sync.Mutex in Golang),
support timed-based cancellation for locking attempts or have non-abortable calls only.

Today we focus on asymmetric, readers-writer locks which are familiar to most Java developers. Such locks allow concurrent readers to proceed with executing their critical section, while writers are guaranteed to acquire exclusive ownership of the lock. These locks are used in scenarios where the vast majority of calls come from readers and writers acquire the lock rather infrequently.

Our ultimate goal is to come up with a lock implementation that would scale reader operations linearly in terms of the CPU core count and compare the result with alternatives such as ReentrantReadWriteLock class from the standard library.

Prior art

Readers-writer (R/W) locks are not something new. A wide-spread R/W lock implementation uses an atomic counter for the reader part and looks something like the following class from QuestDB code base.

Locking for a reader in this class looks like this:

@Override
public void lock() {
    // start a lock attempt
    while (nReaders.incrementAndGet() >= MAX_READERS) {
        // there is a writer owning the lock, so clean up, sleep and go for another spin
        nReaders.decrementAndGet();
        LockSupport.parkNanos(10);
    }
}

Note. If you run your code on Windows, you may face latency issues with LockSupport.parkNanos(). So, make sure to do some benchmarking before using any of the locks we cover today.

Here nReaders is an AtomicInteger used as a medium between readers and the writer. Each reader increments the counter atomically and checks the result value. If it's smaller than the threshold, a reader lock is acquired successfully. If not, the reader has to busy spin (in fact, it could sleep and wait for a notification from the writer, but that would slightly increase the latency). The writer, on the other hand, does the following to acquire a lock:

@Override
public void lock() {
    // trimmed code that acquires the internal writer lock:
    // ...

    // increment the readers counter by the threshold
    int n = nReaders.addAndGet(MAX_READERS);
    // wait until there are no readers holding a lock
    while (n != MAX_READERS) {
        n = nReaders.get();
    }
}

It is important to stress that this lock is non-reentrant. As it was previously mentioned, such readers-writer lock design is quite popular and you may find it in, say, Go standard library's sync.RWMutex struct. The main problem with this approach is that its reader part doesn't scale. This means that if the time spent in the critical section by each reader is rather low, adding more threads (and cores) to the program may not lead to improved performance.

Let's demonstrate this. Our test stand is a laptop with i7-1185G7 CPU with 4/8 cores running Ubuntu 20.04 and OpenJDK 17.0.1. The JMH microbenchmark we're going to use may be found here.

Benchmark                            (type)  Mode  Cnt    Score   Error  Units
ReadWriteLockBenchmark.testBaseline     N/A  avgt   10   21.506 ± 0.784  ns/op
ReadWriteLockBenchmark.testLock2     SIMPLE  avgt   10  101.476 ± 2.403  ns/op
ReadWriteLockBenchmark.testLock4     SIMPLE  avgt   10  205.656 ± 1.970  ns/op
ReadWriteLockBenchmark.testLock8     SIMPLE  avgt   10  296.461 ± 0.735  ns/op

Here the testBaseline benchmark stands for the baseline, i.e. the work done in the critical section in the main benchmark. As for, testLockN it means reader lock benchmark run on N threads.

You may have already noticed almost linear degradation in the average operation time when we increase the number of threads. That's because the hot path in the microbenchmark boils down to an atomic increment instruction available in modern CPUs, e.g. LOCK XADD on x86. The aforementioned instruction implies exclusive access to the counter (the corresponding cache line, to be more precise) acquired by the CPU core executing the caller thread. Hence, the increment is still restricted with a single core and, to add on top of that, in the face of contention the synchronization cost paid by each core increases significantly. Refer to this comprehensive blog post by Travis Downs to learn more about the cost of concurrency primitives on modern HW.

If we add LinuxPerfProfiler JMH profiler and re-run the benchmark to get perf stat output for the benchmark run on 2 threads, we'll see the following:

Perf stats:
--------------------------------------------------

        199 010,83 msec task-clock                #    1,521 CPUs utilized          
             3 887      context-switches          #    0,020 K/sec                  
               223      cpu-migrations            #    0,001 K/sec                  
               255      page-faults               #    0,001 K/sec                  
   614 120 444 917      cycles                    #    3,086 GHz                      (41,68%)
   418 345 242 445      instructions              #    0,68  insn per cycle           (50,02%)
    21 283 383 788      branches                  #  106,946 M/sec                    (58,35%)
         2 027 813      branch-misses             #    0,01% of all branches          (66,69%)
    57 388 541 249      L1-dcache-loads           #  288,369 M/sec                    (66,69%)
     2 080 486 452      L1-dcache-load-misses     #    3,63% of all L1-dcache accesses  (66,68%)
         2 462 215      LLC-loads                 #    0,012 M/sec                    (66,68%)
           312 884      LLC-load-misses           #   12,71% of all LL-cache accesses  (66,69%)
   <not supported>      L1-icache-loads                                             
        20 972 526      L1-icache-load-misses                                         (33,32%)
    57 430 817 101      dTLB-loads                #  288,581 M/sec                    (33,33%)
            63 230      dTLB-load-misses          #    0,00% of all dTLB cache accesses  (33,34%)
   <not supported>      iTLB-loads                                                  
           116 372      iTLB-load-misses                                              (33,33%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

     130,817291871 seconds time elapsed

     260,986586000 seconds user
       0,210040000 seconds sys

Now, if we run the benchmark on 8 threads, the result would be:

Perf stats:
--------------------------------------------------

        784 673,96 msec task-clock                #    5,996 CPUs utilized          
           340 290      context-switches          #    0,434 K/sec                  
               212      cpu-migrations            #    0,000 K/sec                  
               253      page-faults               #    0,000 K/sec                  
 2 418 749 827 832      cycles                    #    3,082 GHz                      (41,66%)
   394 531 732 276      instructions              #    0,16  insn per cycle           (50,01%)
    20 804 674 516      branches                  #   26,514 M/sec                    (58,35%)
        11 107 908      branch-misses             #    0,05% of all branches          (66,67%)
    54 747 118 838      L1-dcache-loads           #   69,771 M/sec                    (66,68%)
     2 413 557 669      L1-dcache-load-misses     #    4,41% of all L1-dcache accesses  (66,68%)
       582 485 788      LLC-loads                 #    0,742 M/sec                    (66,67%)
           484 122      LLC-load-misses           #    0,08% of all LL-cache accesses  (66,66%)
   <not supported>      L1-icache-loads                                             
       157 690 988      L1-icache-load-misses                                         (33,32%)
    54 789 336 429      dTLB-loads                #   69,824 M/sec                    (33,33%)
           504 440      dTLB-load-misses          #    0,00% of all dTLB cache accesses  (33,33%)
   <not supported>      iTLB-loads                                                  
         3 304 265      iTLB-load-misses                                              (33,34%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

     130,864386377 seconds time elapsed

    1025,676317000 seconds user
       2,552276000 seconds sys

The previously mentioned problem with contention over a single atomic counter primary manifests itself in significant degradation of instruction per cycle (IPC) metric: it goes from 0,68 insn/cycle for 2 threads, which is already quite low, to pitiful 0,16 insn/cycle for 8 threads. Other metrics also degrade proportionally. That's because each CPU core's backend spends a lot of time synchronizing the counter's cache line with other cores.

To be precise, low IPC doesn't necessarily mean contention caused by atomic instructions. It may be caused by other reasons, such as random memory accesses on a data structure that doesn't fit into memory. If you'd like to detect this particular scenario, you should profile for specific PMU (Performance Monitoring Unit) events as Travis Downs pointed out.

To nail down the single atomic counter-based lock topic, let's try to compare it with the standard j.u.c.l.ReentrantReadWriteLock class. We're going to use the same benchmark running on 8 threads:

Benchmark                        (type)  Mode  Cnt     Score    Error  Units
ReadWriteLockBenchmark.testLock     JUC  avgt   10  2081.664 ± 15.117  ns/op
ReadWriteLockBenchmark.testLock  SIMPLE  avgt   10   289.155 ±  3.339  ns/op

Here JUC type stands for the ReentrantReadWriteLock class and SIMPLE means the single atomic counter lock. The difference is significant, almost 7x. Moreover, if we would activate the GC profiler in the benchmark, we would see that the standard class allocates around 200 MB/sec (not the end of the world, but could be avoided), while the atomic counter lock does not allocate at all. So, if you have a concrete use case and you really know what you're doing, using a custom lock might be a good idea.

Long story short, a single atomic counter-based lock might do a better job than the standard Java library, but it has an important flaw. The thing is that it assumes contention between readers while it's not necessary at all. Readers need to share memory (think, synchronize) with the writer only, not with each other. And, of course, there are algorithms that try to address the reader scalability problem. Let's quickly discuss the alternatives.

The first alternative I'm aware of is Dmitry Vyukov's distributed reader-writer mutex. It's written in C++, but it should be straightforward to port it to Java. The main idea is to shard the reader counter to an array of atomic counters. The size of the array is set in runtime to match the number of available cores. Each reader is associated with a slot in the array based on the id of the core running the reader thread. The id is obtained via the getcpu() system call available on Linux, so porting it to other OSes is problematic. Another problem is that threads may migrate between cores unless you go with thread affinity. To address this problem, D.Vyukov's lock provides the core id as the return value of the reader's lock() method. This value has to be provided later when the reader calls unlock(). So, while this lock has the potential, it's certainly not general-purpose.

Another worthwhile scalable readers-writer lock I know of is called BRAVO lock. The idea is quite close to D.Vyukov's class, yet the sharded counters array is fixed-size, and the reader's slot is determined based on the thread id. The second part is not a hard requirement and it's possible to implement the BRAVO algorithm in, say, Golang which doesn't expose goroutine or thread id by design. Implementation in Java is available here. We're going to benchmark it against the single atomic counter lock now. If you're interested in learning more about BRAVO lock, refer to the original paper.

Here is how BRAVO lock does in the benchmark run on 8 threads:

Benchmark                          (type)  Mode  Cnt    Score   Error  Units
ReadWriteLockBenchmark.testLock    SIMPLE  avgt   10  296.300 ± 1.824  ns/op
ReadWriteLockBenchmark.testLock    BIASED  avgt   10   53.811 ± 1.548  ns/op

The BIASED type from the above JMH output is the BRAVO lock while the SIMPLE type is the same atomic counter lock we previously discussed. As expected, BRAVO lock solves the scalability issue for readers, but it also has some problematic parts. First, the size of the array has to be large enough to avoid reader contention. The Java class we benchmarked sets the array size to 4096, which means 16KB of memory. The array size should be ideally based on the hardware that runs your code. Second, even if the array is properly sized, readers may still go to the same slot. In fact, adjacent slots are enough as long as they occupy the same cache line. That's because the slot assignment applies a hash function to the thread id and, due to that, does not guarantee the absence of hash code collisions.

So, both alternatives have certain cons and leave enough space for improvements. That's exactly what we came for.

Meet TLBiasedReadWriteLock

Yes, naming is not my strong side. TLBiasedReadWriteLock stands for thread-local reader biased lock. As the name suggests, the class uses a thread-local counter for each reader, while the writer lock is based on a spinlock.

Simplified reader's lock method looks like the following:

@Override
public void lock() {
    // initialize or fetch a thread-local counter for the reader
    PaddedAtomicLong readerCounter = tlReaderCounter.get();
    for (;;) {
        // check if the writer lock is available and we can start the attempt
        if (!wLock.get()) {
            // increment the reader counter
            readerCounter.incrementAndGet();
            // check that no writer acquired the lock
            if (!wLock.get()) {
                // attempt is successful, we're done
                break;
            }
            // attempt failed, go for another spin
            readerCounter.decrementAndGet();
        }
        LockSupport.parkNanos(10);
    }
}

In this code, PaddedAtomicLong is nothing more than, well, a padded AtomicLong used as a reader counter. The padding is added to prevent false sharing, i.e. the situation when different readers are unfortunate to have their counters adjacent on the heap memory so that they end up in the same CPU cache line. When a reader accesses the thread-local counter for the first time, the counter gets added to an array of weak references. These weak references are checked by writers when they want to acquire the lock.

You may wonder about scalability of the above code in terms of readers. Yes, there are no writes to shared memory, but we read the writer lock value multiple times. Will this code scale? The answer is yes, reads (loads) of shared memory scale and it's fine to use them anywhere in your code.

Next, writer's lock method looks like this:

@Override
public void lock() {
    // first, acquire writer lock
    lock0();
    // next, wait for the readers
    Iterator<WeakReference<PaddedAtomicLong>> iterator = readerCounters.iterator();
    while (iterator.hasNext()) {
        // fetch reader counter's weak reference
        WeakReference<PaddedAtomicLong> ref = iterator.next();
        PaddedAtomicLong counter = ref.get();
        if (counter == null) {
            // clean up the counter since the reader thread stopped
            iterator.remove();
            continue;
        }
        while (counter.get() != 0) {
            // the reader still holds the lock, so sleep and go for another spin
            LockSupport.parkNanos(10);
        }
    }
}

As promised, this lock guarantees zero contention for readers since their counters are thread-local. The counters are allocated dynamically, so in scenarios when there are a few threads accessing the lock its memory footprint should be lower than BRAVO lock's one. This lock should do its best when accessed on a fixed-size thread pool.

If we benchmark this lock (TLBIASED type) against BRAVO (BIASED type) on 8 threads, we get this:

Benchmark                          (type)  Mode  Cnt    Score   Error  Units
ReadWriteLockBenchmark.testLock    BIASED  avgt   10   53.811 ± 1.548  ns/op
ReadWriteLockBenchmark.testLock  TLBIASED  avgt   10   54.682 ± 1.204  ns/op

As you would expect, two locks are on par in terms of reader lock scalability and end performance. Expectedly, if we would size BRAVO lock's array less aggressively or run the benchmark on a lot more cores/threads, BRAVO lock would perform worse.

Final battle

Before we wrap up, let's compare all observed lock implementations in a slightly more realistic benchmark. To make it happen we increase the time spent in the critical section by 4x so that it takes around 140 nanoseconds instead of 22 nanoseconds used in the reader-only benchmark. Next, we change the benchmark so that the writer lock is occasionally acquired. The ratio between reader and writer lock calls we're going to use will be 1,000:1, 10,000:1, or 100,000:1. The benchmark code may be found here.

That's what we get when the benchmark is run on 8 threads:

Benchmark                                 (readWriteRatio)    (type)  Mode  Cnt     Score    Error  Units
ReadWriteLockBenchmark.testReadWriteLock              1000       JUC  avgt   10  1966.910 ± 25.643  ns/op
ReadWriteLockBenchmark.testReadWriteLock              1000    SIMPLE  avgt   10   668.782 ±  0.879  ns/op
ReadWriteLockBenchmark.testReadWriteLock              1000    BIASED  avgt   10   458.470 ±  3.169  ns/op
ReadWriteLockBenchmark.testReadWriteLock              1000  TLBIASED  avgt   10   598.463 ±  2.135  ns/op
ReadWriteLockBenchmark.testReadWriteLock             10000       JUC  avgt   10  1828.119 ± 41.570  ns/op
ReadWriteLockBenchmark.testReadWriteLock             10000    SIMPLE  avgt   10   593.159 ±  1.913  ns/op
ReadWriteLockBenchmark.testReadWriteLock             10000    BIASED  avgt   10   219.809 ±  3.359  ns/op
ReadWriteLockBenchmark.testReadWriteLock             10000  TLBIASED  avgt   10   226.159 ±  0.808  ns/op
ReadWriteLockBenchmark.testReadWriteLock            100000       JUC  avgt   10  1973.585 ± 14.310  ns/op
ReadWriteLockBenchmark.testReadWriteLock            100000    SIMPLE  avgt   10   584.412 ±  5.078  ns/op
ReadWriteLockBenchmark.testReadWriteLock            100000    BIASED  avgt   10   173.194 ±  1.360  ns/op
ReadWriteLockBenchmark.testReadWriteLock            100000  TLBIASED  avgt   10   174.324 ±  3.236  ns/op

Good old ReentrantReadWriteLock (JUC) is equally slow regardless of the read/write lock call ratio. The single atomic counter lock (SIMPLE) also doesn't improve much when there are fewer writer calls. For the 1,000:1 ratio, the single counter lock's performance is very close to the performance of BRAVO lock (BIASED) and thread-local reader biased lock (TLBIASED).

What's interesting is that the BRAVO lock (BIASED) is a bit cheaper than the thread-local biased lock for the 1,000:1 read-write ratio, but when the number of reader calls grows the difference disappears. That's because of a more expensive write lock call in the thread-local biased lock. Again, the strong point of the thread-local lock is the guaranteed absence of contention for readers. Since the BRAVO lock is put into the best possible conditions in terms of the internal array sizing and the number of threads and CPU cores, we didn't observe any scalability issues with this lock in the conducted benchmarks.

What's in it for me?

Before we go any further, I'd suggest using a custom lock implementation only in niche use cases and, what's even more important, only if you know what you're doing. In all other situations, the standard library should be the default way to go.

If you're certain that there is a majority of reader locks in your code, they have rather short critical sections, and the ReentrantReadWriteLock class shows up as the bottleneck in profiler reports, any of the observed alternatives have the chance to improve your application's performance.

Hopefully, you've learned something new today. Good luck and see you next time.