<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Random thoughts on concurrency, databases and distributed systems]]></title><description><![CDATA[Core database engineer at QuestDB. Distributed systems gazer. Node.js contributor. Occasional tech blogger and speaker.]]></description><link>https://puzpuzpuz.dev</link><generator>RSS for Node</generator><lastBuildDate>Sun, 10 May 2026 23:31:40 GMT</lastBuildDate><atom:link href="https://puzpuzpuz.dev/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Multithreaded Execution Model for Queries With ORDER BY and LIMIT Clauses]]></title><description><![CDATA[This small post continues the previous post dedicated to multithreaded query execution in databases with column-oriented storage format. This time we'll consider queries like the following:
SELECT SearchPhrase
FROM hits
WHERE SearchPhrase IS NOT NULL...]]></description><link>https://puzpuzpuz.dev/multithreaded-execution-model-for-queries-with-order-by-and-limit-clauses</link><guid isPermaLink="true">https://puzpuzpuz.dev/multithreaded-execution-model-for-queries-with-order-by-and-limit-clauses</guid><category><![CDATA[Databases]]></category><category><![CDATA[multithreading]]></category><dc:creator><![CDATA[Andrei Pechkurov]]></dc:creator><pubDate>Sun, 09 Jun 2024 09:18:35 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/oJ2CpJ50ptE/upload/ed535b37e5b7658d2358860febb12358.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This small post continues the <a target="_blank" href="https://puzpuzpuz.dev/multithreaded-scatter-gather-execution-model-for-analytical-queries">previous post</a> dedicated to multithreaded query execution in databases with column-oriented storage format. This time we'll consider queries like the following:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> SearchPhrase
<span class="hljs-keyword">FROM</span> hits
<span class="hljs-keyword">WHERE</span> SearchPhrase <span class="hljs-keyword">IS</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> SearchPhrase
<span class="hljs-keyword">LIMIT</span> <span class="hljs-number">10</span>;
</code></pre>
<p>It returns the top 10 rows sorted by non-NULL <code>SearchPhrase</code> column values in ascending order. The query is taken from the <a target="_blank" href="https://github.com/ClickHouse/ClickBench">ClickBench</a> benchmark.</p>
<p>In the case of large datasets, such queries can be efficiently executed on multiple threads. This requires several stages:</p>
<ol>
<li><p>The "query owner" thread divides the dataset into small enough frames. The maximum size for each frame varies in the range of 10K-1M rows in popular analytical databases. Say, DuckDB uses 10K frames while in QuestDB they're up to 1M rows in size. Then it publishes tasks containing the frames in an SPMC queue. The tasks should also contain data required to execute the filter (<code>WHERE SearchPhrase IS NOT NULL</code> in our case).</p>
</li>
<li><p>Worker threads pick up the tasks from the queue and process them. First, they apply the filter. This step is optional as there may be no filter in the query. Then they try adding filtered rows to a sorted data structure, like min/max heap or R/B tree. The data structure may contain row IDs, in case the storage format supports random access, or materialized columns returned by the query, in case random access isn't possible due to data compression. As a result of a task execution, the data structure contains up to the top 10 rows. Each worker thread should have its instance of the data structure and use it when handling all tasks belonging to the given query.</p>
</li>
<li><p>Once all tasks are processed, the "query owner" thread needs to merge the worker "top 10 rows" data structures into a single one, so that it can return the result set back to the client. This can be done eagerly, e.g. the "query owner" thread may iterate through all worker data structures and put their rows into its data structure, or lazily via the <a target="_blank" href="https://en.wikipedia.org/wiki/K-way_merge_algorithm">k-way merge algorithm</a>.</p>
</li>
</ol>
<p>The 1st and 3rd stages above are serial, but they assume very little work while the main body of work is done in the 2nd stage. Thanks to this, this execution model scales very nicely with the number of worker threads and CPU cores.</p>
<p>I'm interested in learning more about analytical query execution models and not only, so if you have anything to share, don't hesitate to write a comment. Have fun coding and see you next time.</p>
]]></content:encoded></item><item><title><![CDATA[An mmap-based hash table optimization]]></title><description><![CDATA[mmap is a controversial topic among the database community. The common belief is that it's no good for a database, but in my experience, it does a decent job for an analytical database, when you want to scan columns stored in a column-oriented format...]]></description><link>https://puzpuzpuz.dev/an-mmap-based-hash-table-optimization</link><guid isPermaLink="true">https://puzpuzpuz.dev/an-mmap-based-hash-table-optimization</guid><category><![CDATA[Databases]]></category><category><![CDATA[data structures]]></category><category><![CDATA[performance]]></category><dc:creator><![CDATA[Andrei Pechkurov]]></dc:creator><pubDate>Sat, 01 Jun 2024 09:35:46 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/NIJuEQw0RKg/upload/b057886190582173a306c4057f3e9e4e.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>mmap is a <a target="_blank" href="https://db.cs.cmu.edu/mmap-cidr2022/">controversial</a> <a target="_blank" href="https://www.symas.com/post/are-you-sure-you-want-to-use-mmap-in-your-dbms">topic</a> <a target="_blank" href="https://ravendb.net/articles/re-are-you-sure-you-want-to-use-mmap-in-your-database-management-system">among</a> the database community. The common belief is that it's no good for a database, but in my experience, it does a decent job for an analytical database, when you want to scan columns stored in a column-oriented format. Moreover, it unlocks some optimizations, unavailable with other file I/O APIs. This brief note describes one such optimization recently <a target="_blank" href="https://github.com/questdb/questdb/pull/4435">added</a> to QuestDB by <a target="_blank" href="https://x.com/jerrinot">Jaromir Hamala</a>.</p>
<p>As you may already know, hash tables play a <a target="_blank" href="https://puzpuzpuz.dev/multithreaded-scatter-gather-execution-model-for-analytical-queries">crucial role</a> not only in hash joins, but in parallel GROUP BY execution. There is no "silver bullet" design for a hash table, so having data structures specialized for a given scenario is the key to efficient query execution. QuestDB's SQL engine currently uses 6 hash tables written in Java and one written in C++. The numbers are very likely to grow in the future.</p>
<p>The hash table of our interest is aimed for GROUP BY queries over a single varchar key, like the following one:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">URL</span>, <span class="hljs-keyword">COUNT</span>(*) <span class="hljs-keyword">AS</span> c
<span class="hljs-keyword">FROM</span> hits
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> c <span class="hljs-keyword">DESC</span>
<span class="hljs-keyword">LIMIT</span> <span class="hljs-number">10</span>;
</code></pre>
<p>It's a hash table with linear probing. The data is stored in a contiguous chunk of native memory and has the following layout for key/value pairs:</p>
<pre><code class="lang-markdown">| Hash code 32 LSBs |    Size    | Varchar pointer | Value columns 0..V |
+-------------------+------------+-----------------+--------------------+
|       4 bytes     |  4 bytes   |     8 bytes     |         -          |
+-------------------+------------+-----------------+--------------------+
</code></pre>
<p>The trick is that the 3rd field here contains pointers to mmapped memory. So, the hash table itself doesn't hold copies of varchar byte arrays but instead points at external (stable) memory. This way, we don't need to allocate additional memory and do a varchar copy. Another nice side effect is a lower memory footprint, all thanks to page cache memory being elastic, i.e. the OS evicts pages on memory pressure and reloads them from the disk on future access.</p>
<p>Based on our benchmarks, this hash table is 10-30% faster than our "default" <a target="_blank" href="https://questdb.io/blog/building-faster-hash-table-high-performance-sql-joins/">hash table</a>. While not much, in the case of slow queries, the execution time is reduced by several seconds, so it's worth it.</p>
<p>Have fun coding and see you next time.</p>
]]></content:encoded></item><item><title><![CDATA[Seqlock-Based Atomic Memory Snapshots]]></title><description><![CDATA[Last time we discussed k-word CAS algorithm and came to the conclusion that seqlock-based atomic snapshots may be used as an alternative in situations when your writes are infrequent and single-writer limitation is acceptable, but you want the reader...]]></description><link>https://puzpuzpuz.dev/seqlock-based-atomic-memory-snapshots</link><guid isPermaLink="true">https://puzpuzpuz.dev/seqlock-based-atomic-memory-snapshots</guid><category><![CDATA[concurrency]]></category><category><![CDATA[multithreading]]></category><category><![CDATA[algorithms]]></category><dc:creator><![CDATA[Andrei Pechkurov]]></dc:creator><pubDate>Sun, 06 Aug 2023 11:10:02 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/mdgMbYfFlSA/upload/dca4ca8fa58a9751367c20813364de08.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Last time we <a target="_blank" href="https://puzpuzpuz.dev/a-few-thoughts-on-k-word-cas">discussed</a> k-word CAS algorithm and came to the conclusion that seqlock-based atomic snapshots may be used as an alternative in situations when your writes are infrequent and single-writer limitation is acceptable, but you want the readers to be able to read multiple values atomically. Such use cases may be <a target="_blank" href="https://en.wikipedia.org/wiki/Seqlock">met</a> in the Linux kernel, but not only. For instance, QuestDB uses a <a target="_blank" href="https://github.com/questdb/questdb/blob/0fd8581f6a3ee98a0117cf620d48fa55d7d16c76/core/src/main/java/io/questdb/cairo/TxReader.java#L392-L428">similar approach</a> to read the metadata of the latest transaction atomically. So, today we'll consider an implementation in Java and try to measure how efficient it is from the reader's perspective.</p>
<p>To keep things simple, we'll write an atomic tuple class holding 3 <code>long</code> values. The structure of the class looks like the following:</p>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">AtomicLongTuple</span> </span>{

    <span class="hljs-comment">// Version is used for the seqlock.</span>
    <span class="hljs-keyword">long</span> version;
    <span class="hljs-comment">// 3 long fields:</span>
    <span class="hljs-keyword">long</span> x;
    <span class="hljs-keyword">long</span> y;
    <span class="hljs-keyword">long</span> z;

    <span class="hljs-comment">// Var handles for the fields:</span>
    <span class="hljs-keyword">private</span> <span class="hljs-keyword">static</span> <span class="hljs-keyword">final</span> VarHandle VH_VERSION;
    <span class="hljs-keyword">private</span> <span class="hljs-keyword">static</span> <span class="hljs-keyword">final</span> VarHandle VH_X;
    <span class="hljs-keyword">private</span> <span class="hljs-keyword">static</span> <span class="hljs-keyword">final</span> VarHandle VH_Y;
    <span class="hljs-keyword">private</span> <span class="hljs-keyword">static</span> <span class="hljs-keyword">final</span> VarHandle VH_Z;

    <span class="hljs-comment">// Holder class used by readers and writers.</span>
    <span class="hljs-keyword">public</span> <span class="hljs-keyword">static</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">TupleHolder</span> </span>{
        <span class="hljs-keyword">public</span> <span class="hljs-keyword">long</span> x;
        <span class="hljs-keyword">public</span> <span class="hljs-keyword">long</span> y;
        <span class="hljs-keyword">public</span> <span class="hljs-keyword">long</span> z;
    }

    <span class="hljs-comment">// Readers provide a holder to avoid an allocation.</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">read</span><span class="hljs-params">(TupleHolder holder)</span> </span>{
        <span class="hljs-comment">// Implementation goes here...</span>
    }

    <span class="hljs-comment">// Writers provide a function to modify the latest state.</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">write</span><span class="hljs-params">(Consumer&lt;TupleHolder&gt; writer)</span> </span>{
        <span class="hljs-comment">// Implementation goes here...</span>
    }
}
</code></pre>
<p>Here we have the tuple, i.e. 3 <code>long</code> fields, plus one more <code>long</code> field called <code>version</code> . As we'll see now, the version is needed to implement the seqlock. Of course, the above code doesn't include boring details such as padding or <code>VarHandle</code> initialization, but you may find the full source code <a target="_blank" href="https://github.com/puzpuzpuz/java-concurrency-samples/blob/5bdfc514a859867e90ab36217f952a32835f94ea/src/main/java/io/puzpuzpuz/atomic/AtomicLongTuple.java">here</a>.</p>
<p>To learn about the seqlock, let's start with the <code>write()</code> method:</p>
<pre><code class="lang-java"><span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">write</span><span class="hljs-params">(Consumer&lt;TupleHolder&gt; writer)</span> </span>{
    <span class="hljs-keyword">for</span> (;;) {
        <span class="hljs-keyword">final</span> <span class="hljs-keyword">long</span> version = (<span class="hljs-keyword">long</span>) VH_VERSION.getAcquire(<span class="hljs-keyword">this</span>);
        <span class="hljs-keyword">if</span> ((version &amp; <span class="hljs-number">1</span>) == <span class="hljs-number">1</span>) {
            <span class="hljs-comment">// Another write is in progress. Back off and keep spinning.</span>
            LockSupport.parkNanos(<span class="hljs-number">1</span>);
            <span class="hljs-keyword">continue</span>;
        }

        <span class="hljs-comment">// Try to update the version to an odd value (write intent).</span>
        <span class="hljs-comment">// We don't use compareAndExchangeRelease here to avoid</span>
        <span class="hljs-comment">// an additional full fence following this operation.</span>
        <span class="hljs-keyword">final</span> <span class="hljs-keyword">long</span> currentVersion = (<span class="hljs-keyword">long</span>) VH_VERSION.compareAndExchange(<span class="hljs-keyword">this</span>, version, version + <span class="hljs-number">1</span>);
        <span class="hljs-keyword">if</span> (currentVersion != version) {
            <span class="hljs-comment">// Someone else started writing. Back off and try again.</span>
            LockSupport.parkNanos(<span class="hljs-number">10</span>);
            <span class="hljs-keyword">continue</span>;
        }

        <span class="hljs-comment">// Apply the write.</span>
        writerHolder.x = (<span class="hljs-keyword">long</span>) VH_X.getOpaque(<span class="hljs-keyword">this</span>);
        writerHolder.y = (<span class="hljs-keyword">long</span>) VH_Y.getOpaque(<span class="hljs-keyword">this</span>);
        writerHolder.z = (<span class="hljs-keyword">long</span>) VH_Z.getOpaque(<span class="hljs-keyword">this</span>);
        writer.accept(writerHolder);
        VH_X.setOpaque(<span class="hljs-keyword">this</span>, writerHolder.x);
        VH_Y.setOpaque(<span class="hljs-keyword">this</span>, writerHolder.y);
        VH_Z.setOpaque(<span class="hljs-keyword">this</span>, writerHolder.z);

        <span class="hljs-comment">// Update the version to an even value (write finished).</span>
        VH_VERSION.setRelease(<span class="hljs-keyword">this</span>, version + <span class="hljs-number">2</span>);
        <span class="hljs-keyword">return</span>;
    }
}
</code></pre>
<p>The above code uses acquire/release memory semantics here, so if you're not familiar with it, refer to <a target="_blank" href="https://puzpuzpuz.dev/using-acquirerelease-semantics-in-java-atomics-for-fun-and-profit">this post</a>. Other than that, the code is quite simple: each writer spins trying to increment the <code>version</code> field to an odd value to prevent concurrent writes (think, a <a target="_blank" href="https://puzpuzpuz.dev/benchmarking-non-shared-locks-in-java">spinlock</a>) and, when it succeeds, applies the given function to the current tuple state. Finally, it increments the version, so that it has an even value telling readers and other writers that there is no ongoing write.</p>
<p>Notice that we're using plain loads and stores to fetch and modify the tuple in the "critical section". These operations don't need to be atomic, all thanks to the atomicity (and ordering guarantees) of the surrounding operations and memory fences.</p>
<p>Now, let's check what readers do:</p>
<pre><code class="lang-java"><span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">read</span><span class="hljs-params">(TupleHolder holder)</span> </span>{
    <span class="hljs-keyword">for</span> (;;) {
        <span class="hljs-keyword">final</span> <span class="hljs-keyword">long</span> version = (<span class="hljs-keyword">long</span>) VH_VERSION.getAcquire(<span class="hljs-keyword">this</span>);
        <span class="hljs-keyword">if</span> ((version &amp; <span class="hljs-number">1</span>) == <span class="hljs-number">1</span>) {
            <span class="hljs-comment">// A write is in progress. Back off and keep spinning.</span>
            LockSupport.parkNanos(<span class="hljs-number">1</span>);
            <span class="hljs-keyword">continue</span>;
        }

        <span class="hljs-comment">// Read the tuple.</span>
        holder.x = (<span class="hljs-keyword">long</span>) VH_X.getOpaque(<span class="hljs-keyword">this</span>);
        holder.y = (<span class="hljs-keyword">long</span>) VH_Y.getOpaque(<span class="hljs-keyword">this</span>);
        holder.z = (<span class="hljs-keyword">long</span>) VH_Z.getOpaque(<span class="hljs-keyword">this</span>);

        <span class="hljs-comment">// We don't want the below load to bubble up, hence the fence.</span>
        VarHandle.loadLoadFence();

        <span class="hljs-keyword">final</span> <span class="hljs-keyword">long</span> currentVersion = (<span class="hljs-keyword">long</span>) VH_VERSION.getAcquire(<span class="hljs-keyword">this</span>);
        <span class="hljs-keyword">if</span> (currentVersion == version) {
            <span class="hljs-comment">// The version didn't change, so the atomic snapshot succeeded.</span>
            <span class="hljs-keyword">return</span>;
        }
    }
}
</code></pre>
<p>The reader's code is even simpler than the writer's one. Each reader spins trying to read an even version, then reads the tuple state. Finally, it reads the version once again and compares it with the previously read value. If the value hasn't changed, the reader was able to read the tuple atomically, so the operation finishes. Again, the tuple state operations are non-atomic, but thanks to the surrounding operations and fences that's not needed.</p>
<p>A "classical" seqlock assumes a separate sequence number and a mutex. Using a mutex over a hand-made spinlock is a wise choice since mutexes in modern runtimes and OSes came a long path and they're not as expensive as they used to be. In our case, the <code>version</code> field combines the sequence number and the mutex (to be more precise, a spinlock). For educational purposes, that's not really important since the algorithm is the same. Thanks to seqlock, we were able to implement atomic memory snapshots with a few lines of code.</p>
<p>As you may have noticed, the algorithm is not lock-free since if a writer thread gets blocked after it has incremented the version to an odd value, no readers and writers would be able to make progress. Considering this, you may be asking yourself how practical this approach is. It's certainly not versatile and only makes sense in use cases when the writes are infrequent and the total amount of memory to be read stays within a few cache lines (ideally, within a single cache line, i.e. 64 bytes on most modern machines). In other scenarios, it's better to use a good old exclusive or <a target="_blank" href="https://puzpuzpuz.dev/scalable-readers-writer-lock">shared</a> mutex.</p>
<p>If you're curious about how many wasteful spins a reader has to make before making a successful snapshot, there is a <a target="_blank" href="https://github.com/puzpuzpuz/java-concurrency-samples/blob/e180af67323b324e2d69f8cce422e1d9b618b2c6/src/test/java/io/puzpuzpuz/atomic/AtomicLongTupleTest.java#L36">test</a> for that. It starts a few reader threads that spin over the <code>read()</code> method and a single writer thread that calls <code>write()</code> and then emulates a bit of work by calling <code>Blackhole.consumeCPU(50)</code> on the <code>Blackhole</code> class from JMH. When the test finishes, each reader thread reports the <code>wasteful_spins / total_spins</code> ratio which, on my Linux machine, is around 0.7-0.9. So, in the presence of relatively frequent writes, no reader has to do more than a couple of spins on average.</p>
<p>That's it for today. Good luck and see you next time.</p>
]]></content:encoded></item><item><title><![CDATA[A Few Thoughts on K-Word CAS]]></title><description><![CDATA[A few months ago I went through Efficient Multi-word Compare and Swap paper, so here are a few thoughts on the algorithm. Long story short, I have mixed feelings about this k-word CAS algorithm. It focuses on nice properties of the CAS operation, suc...]]></description><link>https://puzpuzpuz.dev/a-few-thoughts-on-k-word-cas</link><guid isPermaLink="true">https://puzpuzpuz.dev/a-few-thoughts-on-k-word-cas</guid><category><![CDATA[concurrency]]></category><category><![CDATA[multithreading]]></category><category><![CDATA[algorithms]]></category><dc:creator><![CDATA[Andrei Pechkurov]]></dc:creator><pubDate>Thu, 06 Jul 2023 17:46:52 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/pxVOztBa6mY/upload/55fbf363651dcae3f3b9ec78734e4f2e.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A few months ago I went through <a target="_blank" href="https://arxiv.org/abs/2008.02527">Efficient Multi-word Compare and Swap</a> paper, so here are a few thoughts on the algorithm. Long story short, I have mixed feelings about this k-word CAS algorithm. It focuses on nice properties of the <a target="_blank" href="https://en.wikipedia.org/wiki/Compare-and-swap">CAS operation</a>, such as lock-freedom and linearizability, while overlooking atomic k-word reads.</p>
<p>The core idea of the algorithm is that writers do a loop of CAS operations on each stored word. They try to swap the value (or a pointer) with a pointer to the so-called operation descriptor structure. The structure includes old and new values, as well as the operation status. Once a writer successfully completes all CASes, it does one more CAS on the status field marking the operation as complete. So, the algorithm requires k+1 single-word CAS operations per k-word CAS.</p>
<p>Indeed, this k-word CAS algorithm is lock-free and linearizable, but if you also want to be able to read all k words atomically, the algorithm won't be of any help. Of course, you can do a no-op k-word CAS to do an atomic read, but such read may be costly. One more downside is that being able to swap a primitive value with a pointer means that the algorithm is not meant to be used in any language with a GC. In theory, it can be still used in languages with non-moving GC, such as Golang, but with unsafe things like pointer tagging. Also, if you're fine with less plain memory layout, then, say, in Java the algorithm may be modified to use an array of <code>AtomicReference&lt;Object&gt;</code> instead of an array of primitive type (usually, <code>long[]</code>).</p>
<p>If lock-freedom is not a must, a <a target="_blank" href="https://en.wikipedia.org/wiki/Seqlock">seqlock-like</a> approach might do just fine. The sequence field could be used to preserve exclusive writer access, as well as enough help for readers to determine whether they got an atomic snapshot of all k-words. If lock-freedom is important, then in languages with GC there is an even simpler option which is to use an immutable data structure and let the writers do a single-word CAS swapping the pointer to the old values with the new one. Finally, a good old lock could be used, optionally a much more scalable reader-writer one.</p>
]]></content:encoded></item><item><title><![CDATA[The Secret Life of fsync]]></title><description><![CDATA[Several times I've heard opinions that many mass-market SSDs and HDDs don't provide sufficient durability guarantees and Linux can do nothing with that. Namely, after an fsync() call recently modified data can still sit in the drive's volatile write ...]]></description><link>https://puzpuzpuz.dev/the-secret-life-of-fsync</link><guid isPermaLink="true">https://puzpuzpuz.dev/the-secret-life-of-fsync</guid><category><![CDATA[Databases]]></category><category><![CDATA[Linux]]></category><dc:creator><![CDATA[Andrei Pechkurov]]></dc:creator><pubDate>Fri, 31 Mar 2023 18:49:29 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/OQxJd-eGuhg/upload/00acb4791d49fbed3c3bdc3a6c9b5207.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Several times I've heard opinions that many mass-market SSDs and HDDs don't provide sufficient durability guarantees and Linux can do nothing with that. Namely, after an <a target="_blank" href="https://man7.org/linux/man-pages/man2/fsync.2.html"><code>fsync</code></a><code>()</code> call recently modified data can still sit in the drive's volatile write cache and, thus, it may be lost in case of a power failure. If you want any meaningful durability, you should go for enterprise-grade drives that have a battery/capacitor so that they can flush the data to persistent storage on power loss. Is it really so? Let's find out.</p>
<p>First, let's check what POSIX.1-2017 <a target="_blank" href="https://pubs.opengroup.org/onlinepubs/9699919799/">specification</a> says about <code>fsync</code>:</p>
<blockquote>
<p>The <em>fsync</em>() function shall request that all data for the open file descriptor named by <em>fildes</em> is to be transferred to the storage device associated with the file described by <em>fildes</em>. The nature of the transfer is implementation-defined.</p>
</blockquote>
<p>The above description is rather vague. If the OS issues operations to write the data to the disk's volatile cache, that's a "transfer", so formally such OS would be POSIX-compliant. The informative section of the spec sheds more light on what a proper <code>fsync</code> implementation should do:</p>
<blockquote>
<p>The <em>fsync</em>() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the <em>fsync</em>() call is recorded on the disk. Since the concepts of "buffer cache", "system crash", "physical write", and "non-volatile storage" are not defined here, the wording has to be more abstract.</p>
</blockquote>
<p>OK, that's much more specific. Once an <code>fsync</code> call is made, the data should become durable in the face of a system crash, e.g. due to a power loss. But is it really something Linux does on an <code>fsync</code>?</p>
<p>As with many other FS-related system calls, most (if not all) file systems have their own implementation of <code>fsync</code>. To keep things simpler, we're going to check the ext4 implementation in the recent 6.x kernel code base. We should be looking at the <a target="_blank" href="https://github.com/torvalds/linux/blob/62bad54b26db8bc98e28749cd76b2d890edb4258/fs/ext4/fsync.c#L129-L187">`ext4_sync_file()`</a> function which is invoked on an <code>fsync()</code> call. It involves the following steps:</p>
<ol>
<li><p>First, it writes all dirty pages belonging to the file that corresponds to the input file descriptor to the disk. That's done by the `file_write_and_wait_range()` function. As a result, the data may be sitting in the disk volatile cache, so that's not what we're looking for.</p>
</li>
<li><p>Next, it writes the inode's metadata to the disk. Depending on whether journaling is enabled on the FS or not, it's done via a specific function, e.g. `ext4_fsync_journal()`. Again, not something we're in search of.</p>
</li>
<li><p>Finally, if the <code>needs_barrier</code> variable is true, it calls the <code>blkdev_issue_flush()</code> function. That's probably what we need, isn't it?</p>
</li>
</ol>
<p>Let's leave the <code>needs_barrier</code> variable out of the equation for now and check what <code>blkdev_issue_flush()</code> does. This <a target="_blank" href="https://github.com/torvalds/linux/blob/62bad54b26db8bc98e28749cd76b2d890edb4258/block/blk-flush.c#L462-L468">function</a> queues a flush operation to the block device and waits until it's finished. The operation has the <code>REQ_PREFLUSH</code> bit set among the flags. If we open kernel docs, we'll find <a target="_blank" href="https://docs.kernel.org/block/writeback_cache_control.html#explicit-cache-flushes">some information</a> on this flag (and not only):</p>
<blockquote>
<p>In addition the REQ_PREFLUSH flag can be set on an otherwise empty bio structure, which causes only an explicit cache flush without any dependent I/O. It is recommend to use the blkdev_issue_flush() helper for a pure cache flush.</p>
</blockquote>
<p>As we expected, the <code>REQ_PREFLUSH</code> flag (as well as the <code>REQ_FUA</code> flag) tells the block device that it should flush its volatile cache to the persistent storage. Drivers for any well-behaved disk with a volatile write cache <a target="_blank" href="https://docs.kernel.org/block/writeback_cache_control.html#implementation-details-for-request-fn-based-block-drivers">should handle</a> this flag properly. Obviously, disks without such cache don't need to bother with these operations and flags.</p>
<p>Now, what's the buzz with <code>needs_barrier</code>? In both <a target="_blank" href="https://github.com/torvalds/linux/blob/62bad54b26db8bc98e28749cd76b2d890edb4258/fs/jbd2/journal.c#L638-L670">journaled</a> and <a target="_blank" href="https://github.com/torvalds/linux/blob/62bad54b26db8bc98e28749cd76b2d890edb4258/fs/ext4/fsync.c#L98-L99">non-journaled</a> ext4 code paths, it appears to be nothing more, but an optimization to avoid sending flush requests to the disk multiple times. Note that you can also configure ext4 not to issue flush operations. For example, in non-journaled mode it's done via EXT4_DEFM_NOBARRIER mount option.</p>
<p>Other file systems have their own specifics, but the overall logic should be close enough to the ext4's one. So unlike <a target="_blank" href="https://news.ycombinator.com/item?id=30370551">macOS</a>, Linux does its best to transfer the data to persistent storage on <code>fsync</code>. Of course, this doesn't protect you from a flawed driver implementation written for a cheap no-name drive, but it also means that if you have a decent SSD from a well-known brand, you may be fine without an enterprise-grade disk.</p>
]]></content:encoded></item><item><title><![CDATA[Multithreaded Scatter-Gather Execution Model for Analytical Queries]]></title><description><![CDATA[Today we'll discuss an approach used in some analytical databases to speed up the execution of queries at the cost of additional HW resources, namely CPU cores and memory. The query execution model is usually referred to as "scatter-gather", yet it's...]]></description><link>https://puzpuzpuz.dev/multithreaded-scatter-gather-execution-model-for-analytical-queries</link><guid isPermaLink="true">https://puzpuzpuz.dev/multithreaded-scatter-gather-execution-model-for-analytical-queries</guid><category><![CDATA[Databases]]></category><category><![CDATA[multithreading]]></category><category><![CDATA[distributed system]]></category><dc:creator><![CDATA[Andrei Pechkurov]]></dc:creator><pubDate>Sat, 14 Jan 2023 11:03:43 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/fqoq39Jj5us/upload/02a80c024138f526960cd112ada0fb6e.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Today we'll discuss an approach used in some analytical databases to speed up the execution of queries at the cost of additional HW resources, namely CPU cores and memory. The query execution model is usually referred to as "scatter-gather", yet it's hard to find an article with a good amount of detail for this model (at least, I failed to do that), so I decided to write a brief post on the topic.</p>
<p>To be slightly more concrete, here is a simple example of a query that can benefit from scatter-gather execution:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> sensor_id, <span class="hljs-keyword">max</span>(temp), <span class="hljs-keyword">min</span>(temp), <span class="hljs-keyword">avg</span>(temp)
<span class="hljs-keyword">FROM</span> temperature
<span class="hljs-keyword">WHERE</span> sensor_id <span class="hljs-keyword">IN</span> (<span class="hljs-number">402</span>, <span class="hljs-number">1202</span>, <span class="hljs-number">3983</span>)
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> sensor_id;
</code></pre>
<p>To be able to use the scatter-gather model, the storage format used in the database needs to support parallelism, i.e. it should be possible to divide the on-disk data into individual parts that can be independently scanned. In practice, this usually means log-structured append-only storage format (which is in many cases also columnar) but <a target="_blank" href="https://www.postgresql.org/docs/15/parallel-plans.html#PARALLEL-AGGREGATION">not necessarily</a>.</p>
<p>Now, let's consider the execution of the above query.</p>
<h3 id="heading-scatter">Scatter</h3>
<p>For the sake of simplicity, let's assume that there is no index on the `sensor_id` column, and table columns are stored in append-only log files. In this case, it's trivial to split the log files to be scanned into N chunks based on offsets. That's basically what the original thread does when scattering the query execution. Each of the chunks is serialized into a task object/struct along with the query execution plan details, such as selected columns, aggregate functions and filter, and written into an in-memory queue to be picked up by worker threads.</p>
<p>Let's refer to the original thread as the orchestrator thread. If the original thread belongs to the same thread pool, it must participate as a worker, i.e. it should poll tasks from the queue and execute them. That's to avoid starvation and deadlocks in the situation when all threads try to orchestrate their own queries.</p>
<p>When a worker thread picks up a task, it starts executing it. It scans through the data, applies the filter (<code>WHERE</code> clause) and calculates the intermediate result for each aggregate function (<code>min</code>, <code>max</code> and <code>avg</code> in our example). A convenient way to store the result is a hash table holding <code>&lt;int, &lt;int, int, float&gt;&gt;</code> key-value pairs. Here, we use <code>int</code> type for the key assuming that <code>sensor_id</code> is an integer column and each <code>&lt;int, int, float&gt;</code> tuple stands for the intermediate results of our three aggregate functions.</p>
<p>Once a task is executed, the worker must write the intermediate result (hash table) into an in-memory queue to be consumed and gathered by the orchestrator thread.</p>
<h3 id="heading-gather-and-merge">Gather (and merge)</h3>
<p>The orchestrator thread needs a hash table to store query results. Initially, it's empty, but as soon as the thread consumes (gathers) a result from one of the workers, it needs to merge two tables. The merge is simple thanks to the natural properties of the aggregate functions:</p>
<pre><code class="lang-java">min(A ∪ B) = min(min(A), min(B))
max(A ∪ B) = max(max(A), max(B))
count(A ∪ B) = count(A) + count(B)
sum(A ∪ B) = sum(A) + sum(B)
avg(A ∪ B) = sum(A ∪ B) / count(A ∪ B)
</code></pre>
<p>It's not a complete list, but most scalar aggregate functions assume little data to be stored as their state, so they fit into scatter-gather nicely.</p>
<p>As soon as the orchestrator gathers and merges the last task result, it has the query result to be returned to the client.</p>
<h3 id="heading-conclusion">Conclusion</h3>
<p>Scatter-gather(-merge) is a simple single-stage execution model. It works nicely for relatively trivial GROUP BY queries with an optional WHERE clause while more complex queries involving JOINs require a more complex multi-stage parallel execution.</p>
<p>Of course, <a target="_blank" href="https://en.wikipedia.org/wiki/Amdahl%27s_law">Amdahl's law</a> applies to the scatter-gather model, so if the serial part of the total work is significant, the speed up from parallelism will be humble. This may be fixed by <a target="_blank" href="https://duckdb.org/2022/03/07/aggregate-hashtable.html">an approach</a> similar to radix-partitioning. The idea is to split each worker's hash table into a fixed number of hash tables based on a few of the highest bytes of the hash code. Then at the later stage, the merge can be done in parallel for each set of these hash tables. As a nice side effect, it enables parallelism for later stages like ORDER BY + LIMIT.</p>
<p>Another advantage of this model is that it naturally applies to distributed databases. We can easily swap "thread" with "node" in the above text with no other changes except for the storage requirement. The data has to be sharded across cluster nodes and the orchestrator node has to be aware of the sharding scheme so that it's aware of the data location when it distributes the work.</p>
<p>I'm interested in learning more about analytical query execution models and not only, so if you have anything to share, don't hesitate to write a comment. Have fun coding and see you next time.</p>
]]></content:encoded></item><item><title><![CDATA[BuzzwordBusters: What Does Lock-Free, Wait-Free Really Mean?]]></title><description><![CDATA[It seems to be a common belief that code which uses mutexes/locks/synchronized methods is "slow" and, as soon as you replace them with atomics, your code becomes fast and lock-free. Atomic operations don't make your code wait-free, lock-free, or even...]]></description><link>https://puzpuzpuz.dev/buzzwordbusters-what-does-lock-free-wait-free-really-mean</link><guid isPermaLink="true">https://puzpuzpuz.dev/buzzwordbusters-what-does-lock-free-wait-free-really-mean</guid><category><![CDATA[concurrency]]></category><category><![CDATA[multithreading]]></category><category><![CDATA[Java]]></category><dc:creator><![CDATA[Andrei Pechkurov]]></dc:creator><pubDate>Sun, 18 Dec 2022 08:15:33 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1671351231898/Azpw-vPiw.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>It seems to be a common belief that code which uses mutexes/locks/synchronized methods is "slow" and, as soon as you replace them with atomics, your code becomes fast and lock-free. Atomic operations don't make your code wait-free, lock-free, or even obstruction-free. This tiny blog post is dedicated to the above definitions.</p>
<p>Wait-freedom means that any thread can make progress in a finite number of steps regardless of external factors, such as other threads blocking. A trivial example of a wait-free data structure is an atomic counter (in x86 it would use a <code>LOCK XADD</code> instruction), e.g. Java's <code>j.u.c.AtomicInteger</code>.</p>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Counter</span> </span>{

    <span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> AtomicInteger cnt = <span class="hljs-keyword">new</span> AtomicInteger();

    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">int</span> <span class="hljs-title">add</span><span class="hljs-params">(<span class="hljs-keyword">int</span> delta)</span> </span>{
        <span class="hljs-keyword">return</span> cnt.addAndGet(delta);
    }    

    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">int</span> <span class="hljs-title">get</span><span class="hljs-params">()</span> </span>{
        <span class="hljs-keyword">return</span> cnt.get();
    }
}
</code></pre>
<p>Lock-freedom means that the application as a whole can make progress regardless of anything. So, while individual threads may be blocked, at least one of them would be making progress. A trivial example would be the same atomic counter based on a loop with a CAS operation.</p>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Counter</span> </span>{

    <span class="hljs-keyword">private</span> <span class="hljs-keyword">static</span> <span class="hljs-keyword">final</span> AtomicIntegerFieldUpdater&lt;Counter&gt; updater =
        AtomicIntegerFieldUpdater.newUpdater(Counter.class, "cnt");
    <span class="hljs-keyword">private</span> <span class="hljs-keyword">volatile</span> <span class="hljs-keyword">int</span> cnt;

    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">int</span> <span class="hljs-title">add</span><span class="hljs-params">(<span class="hljs-keyword">int</span> delta)</span> </span>{
        <span class="hljs-keyword">int</span> cur;
        <span class="hljs-keyword">do</span> {
            cur = cnt;
        } <span class="hljs-keyword">while</span> (!updater.compareAndSet(<span class="hljs-keyword">this</span>, cur, cur + delta));
        <span class="hljs-keyword">return</span> cur + delta;
    }

    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">int</span> <span class="hljs-title">get</span><span class="hljs-params">()</span> </span>{
        <span class="hljs-keyword">return</span> cnt;
    }
}
</code></pre>
<p>Obstruction-freedom means that a thread can make progress only if there is no contention from other threads. This guarantee is the weakest one on the list. It's hard to illustrate this definition with a simple enough example I'm not aware of a trivial example of this one, but you may refer to <a target="_blank" href="https://ieeexplore.ieee.org/document/1203503">this paper</a>.</p>
<p>Is it fair enough to say that wait-free data structures and algorithms are faster than lock-free ones and that lock-freedom means something better than obstruction-freedom and blocking code? Not really. You may limit yourself with wait-freedom if you want to limit the maximum latency of an individual operation, e.g. if you're building a real-time OS. But in most cases, you should consider all possible algorithms and their combinations. For instance, your data structure may be quite fast while it implements wait-free or lock-free reads and blocking writes based on a striped lock (wink-wink Java's <code>ConcurrentHashMap</code> and xsync's <code>Map</code>/<code>MapOf</code>).</p>
<p>If you want to learn more about multithreaded programming and scalable concurrent data structures, I highly recommend Dmitry Vyukov's <a target="_blank" href="https://www.1024cores.net/">old blog</a>. Just go through all posts starting with the <a target="_blank" href="https://www.1024cores.net/home/lock-free-algorithms/introduction">intro one</a>.</p>
<p>Have fun coding and see you next time.</p>
]]></content:encoded></item><item><title><![CDATA[Concurrent Map in Go vs Java: Yet Another Meaningless Benchmark]]></title><description><![CDATA[Today we're comparing Java's j.u.c.ConcurrentHashMap and Go's xsync.MapOf in a totally non-scientific, unfair benchmark. While most of such language performance comparisons are generally useless and harmful, the purpose of this exercise is a comparis...]]></description><link>https://puzpuzpuz.dev/concurrent-map-in-go-vs-java-yet-another-meaningless-benchmark</link><guid isPermaLink="true">https://puzpuzpuz.dev/concurrent-map-in-go-vs-java-yet-another-meaningless-benchmark</guid><category><![CDATA[Java]]></category><category><![CDATA[Go Language]]></category><category><![CDATA[concurrency]]></category><category><![CDATA[Benchmark]]></category><dc:creator><![CDATA[Andrei Pechkurov]]></dc:creator><pubDate>Fri, 04 Nov 2022 19:07:26 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/unsplash/Iq9SaJezkOE/upload/v1667585143972/txbdMjnbJ.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Today we're comparing Java's <a target="_blank" href="https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/concurrent/ConcurrentHashMap.html"><code>j.u.c.ConcurrentHashMap</code></a> and Go's <a target="_blank" href="https://github.com/puzpuzpuz/xsync#map"><code>xsync.MapOf</code></a> in a totally non-scientific, unfair benchmark. While most of such language performance comparisons are generally useless and harmful, the purpose of this exercise is a comparison of the algorithms behind both data structures. I'm driven by curiosity here, so don't take this post seriously. The results may be completely different on another HW and in different scenarios, so don't forget that any benchmark has to be taken with a grain of salt.</p>
<p>On the one hand we have <code>ConcurrentHashMap</code>, also known as CHM. It's a brilliant concurrent hash table: no writes to shared memory in read-only operations, hybrid linked list / tree used for buckets depending on the number of stored nodes, volatile value fields in nodes allowing to update the value without having to allocate a new node, a striped counter for the current size, and so on. CHM has been improved across many versions of Java, for many years, so reaching its level of performance isn't an easy task.</p>
<p>On the other hand, <code>MapOf</code> borrows ideas from multiple sources: buckets are organized in unrolled linked lists of cache line size thanks to <a target="_blank" href="https://github.com/LPD-EPFL/CLHT">Cache-Line Hash Table</a> (CLHT), hash codes stored in the buckets to avoid extra hash function calls and pointer chasing, immutable entries to avoid extra synchronization on reads and also reduce GC pressure, also a striped counter for the current size.</p>
<p>Both maps have a bunch of handy atomic operations available to the developer. But enough of this boring stuff, let's do the benchmarking.</p>
<p>Our test stand is nothing more, but a laptop machine with an i7-1185G7 CPU (8 HT cores) and Ubuntu 22.04 running 64-bit builds of Go 1.19.3 and OpenJDK 17.0.4.</p>
<p>The CHM benchmark can be found <a target="_blank" href="https://github.com/puzpuzpuz/java-concurrency-samples/blob/8fbd5032327551fb339ed6d7d208df55436a1952/src/test/java/io/puzpuzpuz/map/ConcurrentMapBenchmark.java">here</a>, while the Go benchmark is in the <a target="_blank" href="https://github.com/puzpuzpuz/xsync/blob/a140d88f8cdc4ebfddf75d89428079d8d1f3ad6f/mapof_test.go#L1045-L1061">xsync repo</a>.</p>
<p>The benchmarks start with a pre-warmed map holding 1,000 64-bit integer key-value pairs. The "99% reads" scenario assumes that each thread spends 99% of its calls on <code>get</code> (<code>Load</code>) operations and 0.5% on both <code>put</code> (<code>Store</code>) and <code>remove</code> (<code>Delete</code>) operations. Each operation is called for a randomly selected key, so there are no hot spots. The "75% reads" scenario is more write-heavy as write operations get 12.5% of total number of operations each.</p>
<p>The benchmarks were run with different number of threads/cores to be used, staring with 1 core and ending with all 8 available cores. That's to see how well both maps scale on multi-core machines. Finally, the below results are based on average values collected after 10 runs of each benchmark.</p>
<p>Let's start with the write-heavy scenario. CHM has an advantage here since it allows updating values in-place, so it might show better results.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1667588120561/-fPxV0rj8.png" alt="Results for 75% reads scenario" /></p>
<p>As expected, CHM is slightly better on all core counts except for 4 cores. Both maps scale well enough.</p>
<p>Let's see results for the read-heavy scenario which might be more common in many real world applications.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1667588318604/rtOSa0JJh.png" alt="Results for 99% reads scenario" /></p>
<p>Again, both maps scale well with a slight advantage of <code>MapOf</code>. I'm pretty happy with the results. There are certainly rough edges, but <code>MapOf</code> has proven its worthiness.</p>
<p>Have fun coding and see you next time.</p>
]]></content:encoded></item><item><title><![CDATA[Thread-Local State in Go, Huh?]]></title><description><![CDATA[We all know that there is no such thing as thread-local state in Go. Yet, there is a trick that would help you to retain the thread identity at least on the hot path. This trick would be helpful if you're trying to implement a striped counter (wink-w...]]></description><link>https://puzpuzpuz.dev/thread-local-state-in-go-huh</link><guid isPermaLink="true">https://puzpuzpuz.dev/thread-local-state-in-go-huh</guid><category><![CDATA[golang]]></category><category><![CDATA[concurrency]]></category><category><![CDATA[performance]]></category><dc:creator><![CDATA[Andrei Pechkurov]]></dc:creator><pubDate>Sat, 29 Oct 2022 09:41:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/unsplash/qvWnGmoTbik/upload/v1667036370356/BF0eWC7lbD.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We all know that there is no such thing as thread-local state in Go. Yet, there is a trick that would help you to retain the thread identity at least on the hot path. This trick would be helpful if you're trying to implement a <a target="_blank" href="https://github.com/puzpuzpuz/xsync#counter">striped counter</a> (wink-wink, <code>j.u.c.a.LongAdder</code> from Java), or a <a target="_blank" href="https://github.com/puzpuzpuz/xsync#rbmutex">BRAVO lock</a>, or any kind of a data structure with striped state.</p>
<p>This brief post is based on <a target="_blank" href="https://github.com/puzpuzpuz/talks/blob/c1839354447cf9092d23f90986fd128f9f3f6563/2021-ru-is-time-to-resync/slides.pdf">the talk</a> I gave a while ago. I assume that you're familiar with the concept of <a target="_blank" href="https://www.baeldung.com/java-longadder-and-longaccumulator#dynamic-striping">state striping</a> in concurrent counters since it's important for understanding the end application.</p>
<p>First of all, while Golang assigns identifiers to goroutines, it doesn't expose them. That's by <a target="_blank" href="https://golang.org/doc/faq#no_goroutine_id">the design</a>:</p>
<blockquote>
<p>Goroutines do not have names; they are just anonymous
workers. They expose no unique identifier, name, or data
structure to the programmer. Some people are surprised by
this, expecting the go statement to return some item that can
be used to access and control the goroutine later.</p>
<p>The fundamental reason goroutines are anonymous is so that
the full Go language is available when programming
concurrent code. By contrast, the usage patterns that develop
when threads and goroutines are named can restrict what a
library using them can do.</p>
</blockquote>
<p>The same applies to the worker threads used by the Golang scheduler to run your goroutines. But the whole idea of a striped counter depends on being able to identify the current thread, so that subsequent calls are (most of the time) run on the same CPU core and, hence, avoid contention.</p>
<p>There are two straightforward approaches to the problem:</p>
<ol>
<li>CPUID x86 instruction - not portable to other architectures and also requires some assembly or FFI calls to be used.</li>
<li>gettid(2) Linux-only system call - not portable to other OSes.</li>
</ol>
<p>Both of these options are non-versatile and, ideally, we want to have a Go-native and cross-platform solution. Luckily there is one.</p>
<p>I'm talking of <a target="_blank" href="https://pkg.go.dev/sync#Pool"><code>sync.Pool</code></a>. If you're familiar with its <a target="_blank" href="https://github.com/golang/go/blob/e09bbaec69a8ff960110e13eabb3bef5331ecb0c/src/sync/pool.go">source code</a>, you already know that it uses thread-local pools under the hood. If we allocate a struct and place it in the pool, the next time we request it one the same thread (but not necessarily same goroutine) we should get the same struct.</p>
<p>Let's take a look at a fragment of the <code>xsync.Counter</code>'s code:</p>
<pre><code class="lang-go"><span class="hljs-comment">// pool for P tokens</span>
<span class="hljs-keyword">var</span> ptokenPool sync.Pool

<span class="hljs-comment">// ptoken is used to point at the current OS thread (P)</span>
<span class="hljs-comment">// on which the goroutine is run; exact identity of the thread,</span>
<span class="hljs-comment">// as well as P migration tolerance, is not important since</span>
<span class="hljs-comment">// it's used to as a best effort mechanism for assigning</span>
<span class="hljs-comment">// concurrent operations (goroutines) to different stripes of</span>
<span class="hljs-comment">// the counter.</span>
<span class="hljs-keyword">type</span> ptoken <span class="hljs-keyword">struct</span> {
    idx <span class="hljs-keyword">uint32</span>
}

<span class="hljs-comment">// Counter is a striped int64 counter.</span>
<span class="hljs-keyword">type</span> Counter <span class="hljs-keyword">struct</span> {
    stripes []cstripe
    mask    <span class="hljs-keyword">uint32</span>
}

<span class="hljs-keyword">type</span> cstripe <span class="hljs-keyword">struct</span> {
    c <span class="hljs-keyword">int64</span>
    <span class="hljs-comment">// The padding prevent false sharing.</span>
    pad [cacheLineSize - <span class="hljs-number">8</span>]<span class="hljs-keyword">byte</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">NewCounter</span><span class="hljs-params">()</span> *<span class="hljs-title">Counter</span></span> {
    <span class="hljs-comment">// Consider the number of CPU cores</span>
    <span class="hljs-comment">// when deciding on the number of stripes.</span>
    nstripes := nextPowOf2(parallelism())
    c := Counter{
        stripes: <span class="hljs-built_in">make</span>([]cstripe, nstripes),
        mask:    nstripes - <span class="hljs-number">1</span>,
    }
    <span class="hljs-keyword">return</span> &amp;c
}

<span class="hljs-comment">// Value returns the current counter value.</span>
<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(c *Counter)</span> <span class="hljs-title">Value</span><span class="hljs-params">()</span> <span class="hljs-title">int64</span></span> {
    v := <span class="hljs-keyword">int64</span>(<span class="hljs-number">0</span>)
    <span class="hljs-keyword">for</span> i := <span class="hljs-number">0</span>; i &lt; <span class="hljs-built_in">len</span>(c.stripes); i++ {
        stripe := &amp;c.stripes[i]
        v += atomic.LoadInt64(&amp;stripe.c)
    }
    <span class="hljs-keyword">return</span> v
}

<span class="hljs-comment">// Add adds the delta to the counter.</span>
<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(c *Counter)</span> <span class="hljs-title">Add</span><span class="hljs-params">(delta <span class="hljs-keyword">int64</span>)</span></span> {
    <span class="hljs-comment">// Pick up a token from the pool. If Add was called recently</span>
    <span class="hljs-comment">// on the same thread, we'll probably get the same ptoken.</span>
    t, ok := ptokenPool.Get().(*ptoken)
    <span class="hljs-keyword">if</span> !ok {
        <span class="hljs-comment">// Allocate a new token and pick up a random stripe index.</span>
        t = <span class="hljs-built_in">new</span>(ptoken)
        t.idx = fastrand() &amp; c.mask
    }
    <span class="hljs-keyword">for</span> {
        stripe := &amp;c.stripes[t.idx]
        cnt := atomic.LoadInt64(&amp;stripe.c)
        <span class="hljs-keyword">if</span> atomic.CompareAndSwapInt64(&amp;stripe.c, cnt, cnt+delta) {
            <span class="hljs-comment">// We were able to update the stripe, so all done.</span>
            <span class="hljs-keyword">break</span>
        }
        <span class="hljs-comment">// CAS failed, so there is some contention over the stripe.</span>
        <span class="hljs-comment">// Give a try with another randomly selected stripe.</span>
        t.idx = fastrand() &amp; c.mask
    }
    <span class="hljs-comment">// Return ptoken back to the pool, so that another goroutine</span>
    <span class="hljs-comment">// running on the same thread can use it.</span>
    ptokenPool.Put(t)
}
</code></pre>
<p>Here, in the <code>Add</code> method, we using the <code>ptoken</code> structs to hold (not-so) thread-local state. Once we obtain a <code>ptoken</code>, we try to change the corresponding stripe in a CAS-based loop. This allows the goroutines to self-organize: they detect contention via a failed CAS operation and then change the stripe. The goal is to avoid contention due to unlucky thread-to-stripe distribution.</p>
<p>You may ask if piggybacking on a <code>sync.Pool</code>'s implementation detail is worth hassle. My answer would be "no, unless you really know what you're doing". Say, single-threaded performance of a primitive atomic <code>int64</code> would be better. There is also an overhead in the <code>Value()</code> method since it needs to read values from all stripes. So, this trick is certainly from the "don't try that at home" category. But if you aim for scalability of your write operations, it's certainly worth it:</p>
<pre><code class="lang-bash">$ go <span class="hljs-built_in">test</span> -benchmem -run=^$ -bench <span class="hljs-string">"Counter|Atomic"</span>
goos: linux
goarch: amd64
pkg: github.com/puzpuzpuz/xsync/v2
cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
BenchmarkCounter-8           409773502             2.909 ns/op           0 B/op           0 allocs/op
BenchmarkAtomicInt64-8       92472007            14.09 ns/op           0 B/op           0 allocs/op
PASS
ok      github.com/puzpuzpuz/xsync/v2    3.024s
</code></pre>
<p>If you run the same benchmark on a machine with more cores, the <code>int64</code>'s result would only get worse due to contention.</p>
<p>Both <code>Map</code> and <code>MapOf</code>, concurrent hash maps from <a target="_blank" href="https://github.com/puzpuzpuz/xsync"><code>xsync</code> library</a>, use a variation of a striped counter internally to track the current map size. Naturally, they do a counter increment or decrement on each write operation, but read the counter value rarely when a resize happens.</p>
<p>One more example of application of this trick is <a target="_blank" href="https://github.com/puzpuzpuz/xsync#rbmutex"><code>RBMutex</code></a>, a reader biased reader/writer mutual exclusion lock that implements BRAVO algorithm. I'm leaving learning the internals of this one to the curious reader.</p>
<p>As promised, the post is a short one, so that's it for today. Have fun coding and see you next time.</p>
]]></content:encoded></item><item><title><![CDATA[So long, sync.Map]]></title><description><![CDATA[While the title is certainly a clickbait, I definitely don't see any strong reason to keep dealing with sync.Map if you're a Go generics user. Instead, you should consider xsync.MapOf:
type point struct {
    x int
    y int
}
// create a MapOf for w...]]></description><link>https://puzpuzpuz.dev/so-long-syncmap</link><guid isPermaLink="true">https://puzpuzpuz.dev/so-long-syncmap</guid><category><![CDATA[Go Language]]></category><category><![CDATA[concurrency]]></category><category><![CDATA[algorithms]]></category><dc:creator><![CDATA[Andrei Pechkurov]]></dc:creator><pubDate>Sat, 22 Oct 2022 17:50:36 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/unsplash/Sot0f3hQQ4Y/upload/v1666452261072/_NYvZ2s-s.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>While the title is certainly a clickbait, I definitely don't see any strong reason to keep dealing with <a target="_blank" href="https://pkg.go.dev/sync#Map">sync.Map</a> if you're a Go generics user. Instead, you should consider <a target="_blank" href="https://pkg.go.dev/github.com/puzpuzpuz/xsync#MapOf">xsync.MapOf</a>:</p>
<pre><code class="lang-go"><span class="hljs-keyword">type</span> point <span class="hljs-keyword">struct</span> {
    x <span class="hljs-keyword">int</span>
    y <span class="hljs-keyword">int</span>
}
<span class="hljs-comment">// create a MapOf for with keys and values of point type</span>
m := NewTypedMapOf[point, point](<span class="hljs-function"><span class="hljs-keyword">func</span><span class="hljs-params">(p point)</span> <span class="hljs-title">uint64</span></span> {
    <span class="hljs-comment">// hash function to be used by the map</span>
    <span class="hljs-keyword">return</span> <span class="hljs-keyword">uint64</span>(<span class="hljs-number">31</span>*p.x + p.y)
})
<span class="hljs-comment">// load the existing value or compute it lazily, if it's absent</span>
v, loaded := m.LoadOrCompute(point{<span class="hljs-number">42</span>, <span class="hljs-number">42</span>}, <span class="hljs-function"><span class="hljs-keyword">func</span><span class="hljs-params">()</span> <span class="hljs-title">int</span></span> {
    <span class="hljs-keyword">return</span> point{<span class="hljs-number">0</span>, <span class="hljs-number">0</span>}
})
</code></pre>
<p>The above wasn't possible until xsync v1.5.0. That's because the generic version of the concurrent map available in the library supported only <code>string</code> keys. But now you can use any <code>comparable</code> type as a key, so <code>xsync.MapOf</code> became a real, more scalable alternative to the good old <code>sync.Map</code>. Today we're going to discuss the recent changes in the data structure that allowed using arbitrary key types and also do some (as always, non-scientific) microbenchmarking.</p>
<h3 id="heading-the-challenges">The challenges</h3>
<p>The original intent behind xsync library was to provide faster alternatives for the built-in concurrent data structures available in the Golang standard library. Most of the alternatives, like <code>RBMutex</code> or <code>MPMCQueue</code>, aren't suitable for general purpose, i.e. they're tailored for niche use cases.</p>
<p>But <code>xsync.Map</code> is different as it was aimed to replace <code>sync.Map</code> in the same scenarios and beyond. It's based on a (noticeably) modified version of <a target="_blank" href="https://github.com/LPD-EPFL/CLHT">Cache-Line Hash Table</a> (CLHT). Read-only operations, such as <code>Load</code> or <code>Range</code>, are obstruction-free while write operations and rehashing use fine-grained locking (lock sharding). Refer to this <a target="_blank" href="https://gopheradvent.com/calendar/2021/journey-to-a-faster-concurrent-map/">blog post</a> to learn more on the algorithm. The only significant limitation was in keys limited to the <code>string</code> type.</p>
<p>The original version of the map was non-generic, so you had to deal with unpleasant <code>interface{}</code> rituals in your code. When Go got a stable version of generics, <a target="_blank" href="https://github.com/vearutop">Viacheslav Poturaev</a> <a target="_blank" href="https://github.com/puzpuzpuz/xsync/pull/34">contributed</a> a generics-friendly version of the map, <code>xsync.MapOf</code>. This was a nice step forward, but it still supported <code>string</code> keys only.</p>
<p>A few month ago, the library repo received a <a target="_blank" href="https://github.com/puzpuzpuz/xsync/pull/46">pull request</a> from <a target="_blank" href="https://github.com/iamcalledrob">Rob Mason</a>. The goal was to allow arbitrary <code>comparable</code> types for keys while maintaining backwards compatibility. Since Golang doesn't expose the built-in hash functions except for what's in the <code>hash/maphash</code> package, the hash function has to be provided by the user which isn't a big deal. The only problem was the layout of the data structure.</p>
<p>Each bucket in the underlying hash table had the following structure:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1666450053716/B_6Tn9JAP.png" alt="bucket.png" /></p>
<p>Here we have 128 bytes (two cache-lines on most modern CPUs) holding a <code>sync.Mutex</code>, 7 key/value pairs, and a <code>uint64</code> with cached hash codes for the keys present in the bucket. The mutex is being used while writing to the bucket. The key/values pairs hold pointers to the map entries. Finally, the hash code cache holds one most significant byte per each key’s hash code.</p>
<p>So, each bucket was able to hold up to 7 key/value pairs and, in case if the bucket was fully on an insert, a rehashing would have to be made. Such design is sufficient if you have a high quality hash function, such as the built-in one for <code>string</code>s. But if the function is user-provided chances of having 7+ perfect hash code collisions in the same map increase dramatically.</p>
<p>To fix that, <code>mutex</code> and <code>hashes</code> fields were <a target="_blank" href="https://github.com/puzpuzpuz/xsync/pull/48">merged</a> into a single <code>uint64</code>-based data structure with the following layout:</p>
<pre><code>| key <span class="hljs-number">0</span><span class="hljs-string">'s top hash | ... | key 7'</span>s top hash | bitmap <span class="hljs-keyword">for</span> keys |
|      <span class="hljs-number">1</span> byte      | ... |      <span class="hljs-number">1</span> byte      |     <span class="hljs-number">1</span> byte      |
</code></pre><p>Here, we have 8 most significant bits (MSBs) of the hash code for each key stored in the bucket. They're used to avoid many expensive hash code computations on each look up. Before calculating the hash code for a key/value pair, we compare the MSBs with the MSBs of the hash code of the user-provided key and, if they don't match, we move on to the next pair in the bucket. Next, the least significant bit in the "bitmap for keys" byte is used to implement a mutex (a <a target="_blank" href="https://puzpuzpuz.dev/benchmarking-non-shared-locks-in-java">TTAS spinlock</a>, to be more precise).</p>
<p>Merging the mutex and the hash code MSBs together saved gorgeous 8 bytes of memory in each bucket. The saved bytes hold pointers to the next bucket in the chain, so as in the very first xsync version, hash table buckets are now organized in <a target="_blank" href="https://en.wikipedia.org/wiki/Unrolled_linked_list">unrolled linked lists</a>. Hence, in face of perfect hash code collisions, the linked list trivially growths until the map load factor is met.</p>
<p>Needless to say that once all of the above was done, supporting any <code>comparable</code> key type finally became possible.</p>
<p>At this point you may be asking yourself if it's using a non-standard map is worth it. Let's see.</p>
<h3 id="heading-the-promised-benchmarks">The promised benchmarks</h3>
<p>The following results were obtained on i7-1185G7, Ubuntu 22.04 x86-64 and Go 1.19.2. The benchmark source code itself can be found in the <a target="_blank" href="https://github.com/puzpuzpuz/xsync">xsync repo</a>.</p>
<p>We start with a benchmark that uses a pre-warmed map with 1,000 key/value pairs. The keys are <code>string</code>s while the values are <code>int</code>s. The benchmark uses all 8 goroutines to load all 8 HT cores available on the machine. Each goroutine selects a key randomly and executes either <code>Load</code>, <code>Store</code> or <code>Delete</code> operation on it.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1666460218809/kPTAb9-LF.png" alt="map-8c-1k.png" /></p>
<p>Here, the x axis stands for the percentage of <code>Load</code> operation while the two remaining operations have equal chances to be called, e.g. for the 90% value we have 90% of <code>Load</code> calls, 5% of <code>Store</code>s and 5% of <code>Delete</code>s. The y axis stands for the average time in nanoseconds spend on an operation.</p>
<p>Things get even more interesting with 1M key/value pairs.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1666460231264/Ag006imdD.png" alt="map-8c-1m.png" /></p>
<p>As usual, take the above results with a grain of salt. There might be some scenarios when the standard map is faster, so make sure to test things close to what your application does. Nevertheless, due to the design the standard map is quite limited in terms of scalability in the presence of even a small fraction of concurrent writes. After all, a data structure guarded with a single lock will scale worse than the one with sharded locks. When it comes to reads which are the strongest point of <code>sync.Map</code>, both data structures are on par.</p>
<h3 id="heading-lessons-learned">Lessons learned</h3>
<p>Hopefully, this post has convinced you to try <code>xsync.MapOf</code> in action and maybe to contribute to xsync. This story is another evidence of the power of open source communities. Without so many contributors, <code>xsync.Map</code> would be still limited in its capabilities. I'm pretty sure that it's not the end of the story and the library would continue evolving in future. Have fun coding and see you next time.</p>
]]></content:encoded></item><item><title><![CDATA[Testing Concurrent Code for Fun and Profit]]></title><description><![CDATA[Everyone knows that multi-threaded code is not a piece of cake. There are lots of publications on how to write concurrent code properly and also lots of well-known algorithms and data structures to choose from. Yet, authors often ignore another impor...]]></description><link>https://puzpuzpuz.dev/testing-concurrent-code-for-fun-and-profit</link><guid isPermaLink="true">https://puzpuzpuz.dev/testing-concurrent-code-for-fun-and-profit</guid><category><![CDATA[Java]]></category><category><![CDATA[concurrency]]></category><category><![CDATA[Testing]]></category><category><![CDATA[multithreading]]></category><dc:creator><![CDATA[Andrei Pechkurov]]></dc:creator><pubDate>Sun, 09 Oct 2022 10:03:05 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/unsplash/XZuqMUiSdgc/upload/v1665297274428/V1KYwe0RS.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Everyone knows that multi-threaded code is not a piece of cake. There are lots of publications on how to write concurrent code properly and also lots of well-known algorithms and data structures to choose from. Yet, authors often ignore another important topic and pretend as if it's not worth the discussion. The topic is how to test your concurrent code and that's what we're going to consider today. No way I'm an expert on this matter (and any other matter), so everything below is an attempt to share an opinionated approach that appears to work well for me and helps to find vast majority of the concurrency bugs.</p>
<p>To narrow down the topic, we're going to use Java and try to cover the <a target="_blank" href="https://puzpuzpuz.dev/fast-and-simple-spsc-queue">SPSC queue</a> we built recently with a minimal set of tests. While the observed tests are minimal, you should be able to write tests for your own concurrent code after reading this blog post. As for the programming language, everything we talk of should be applicable to any language with threads or green threads, so C/C++, Rust, Golang, Zig and many others apply.</p>
<p>The interface of our queue is very simple and consists of two methods:</p>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">SpscBoundedQueue</span>&lt;<span class="hljs-title">E</span>&gt; </span>{

    <span class="hljs-comment">// Some boring stuff, like fields and constructor.</span>

    <span class="hljs-comment">/**
     * Publishes an item to the tail of the queue, if it's not full.
     */</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">boolean</span> <span class="hljs-title">offer</span><span class="hljs-params">(E e)</span> </span>{ <span class="hljs-comment">/* some code goes here */</span> }

    <span class="hljs-comment">/**
     * Removes and returns the head of the queue, if it's not empty.
     */</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> E <span class="hljs-title">poll</span><span class="hljs-params">()</span> </span>{ <span class="hljs-comment">/* some code goes here */</span> }
}
</code></pre>
<p>So, where to start when testing it?</p>
<h3 id="heading-where-to-start">Where to start?</h3>
<p>The best thing to do as the first step is to write good old single-threaded tests. Many aspects of your code can (and should) be covered with such tests. Those are input validation, boundary checks, basic invariants of your data structure, methods that aren't thread-safe and, hence, will be always called from a single thread - all of these should be covered with "cheap" (in terms of the execution time) tests. Keep in mind that the aforementioned list is not complete. You should always try to cover as much as possible with single-threaded tests.</p>
<p>In our case, we should cover the following things:</p>
<ol>
<li>Input validation - our original code was minimal, so it lacked things like positive size validation in the constructor. Such validation is a perfect candidate for a single-threaded test.</li>
<li>Boundary checks - our queue has a limited size, so we expect <code>offer()</code> to return <code>false</code> when the queue is full, as well as <code>poll()</code> to return a <code>null</code> when the queue is empty.</li>
<li>Basic invariants - we should test the "First in, first out" (FIFO) property of our data structure.</li>
<li>Non thread-safe methods - again, we omitted many other methods that are handy, such as <code>clear()</code> method. Due to the queue design, those methods have to be called from a single thread in absence of any other queue mutation calls. Single-threaded tests are to the rescue.</li>
</ol>
<p>Here is a test that illustrates items 2 and 3 from the above list:</p>
<pre><code class="lang-java"><span class="hljs-meta">@Test</span>
<span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">testSerial</span><span class="hljs-params">()</span> </span>{
    SpscBoundedQueue&lt;Integer&gt; queue = <span class="hljs-keyword">new</span> SpscBoundedQueue&lt;&gt;(<span class="hljs-number">10</span>);

    Assert.assertNull(queue.poll());

    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; <span class="hljs-number">10</span>; i++) {
        Assert.assertTrue(queue.offer(i));
    }
    Assert.assertFalse(queue.offer(<span class="hljs-number">42</span>));

    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; <span class="hljs-number">10</span>; i++) {
        Assert.assertEquals((Integer) i, queue.poll());
    }
    Assert.assertNull(queue.poll());
}
</code></pre>
<p>It's a bit dense and could be split into multiple, more focused tests, but it's not a big deal considering that it's an illustration of the concept. The test verifies both boundary checks, as well as the FIFO property.</p>
<p>Enough silly serial tests, let's run things in parallel!</p>
<h3 id="heading-how-to-break-things">How to break things?</h3>
<p>Our main goal is to find any thread-safety violations. But what does it mean in practice? Such violations may be very infrequent and hard to reproduce. Sometimes race conditions, data races and other unpleasant things may even remain unnoticed until you hit a certain edge case. The sad truth is that testing concurrent code is hard and you can never be sure that your test suite is good enough. But that means that you should do your best at writing concurrent tests to eliminate most, if not all, thread-safety bugs.</p>
<p>Each concurrent test scenario has to be thought separately. It should reproduce a use case and involve a set of related methods of your data structure(s). In our example, things are simple: we need to test the <code>offer()</code> and <code>poll()</code> methods running on two different threads (remember, we deal with a Single Producer Single Consumer queue). But is it enough to call these methods like crazy from separate threads? Not really. Just like with single-threaded tests, we have to think of the invariants we have in the thread-safe part of the code.</p>
<p>To keep things practical, let's start with the skeleton of the test:</p>
<pre><code class="lang-java"><span class="hljs-meta">@Test</span>
<span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">testHammer</span><span class="hljs-params">()</span> <span class="hljs-keyword">throws</span> InterruptedException </span>{
    <span class="hljs-keyword">final</span> <span class="hljs-keyword">int</span> iterations = <span class="hljs-number">1_000_000</span>;

    <span class="hljs-comment">// Prepare the data structure.</span>
    SpscBoundedQueue&lt;Integer&gt; queue = <span class="hljs-keyword">new</span> SpscBoundedQueue&lt;&gt;(<span class="hljs-number">10</span>);

    <span class="hljs-comment">// Prepare helper data structures (test infra).</span>
    CyclicBarrier barrier = <span class="hljs-keyword">new</span> CyclicBarrier(<span class="hljs-number">2</span>);
    CountDownLatch latch = <span class="hljs-keyword">new</span> CountDownLatch(<span class="hljs-number">2</span>);
    AtomicInteger anomalies = <span class="hljs-keyword">new</span> AtomicInteger();

    <span class="hljs-comment">// Prepare and start the threads.</span>
    ConsumerThread consumer = <span class="hljs-keyword">new</span> ConsumerThread(queue, barrier, latch, anomalies, iterations);
    consumer.start();
    ProducerThread producer = <span class="hljs-keyword">new</span> ProducerThread(queue, barrier, latch, anomalies, iterations);
    producer.start();

    <span class="hljs-comment">// Wait for the threads to finish.</span>
    latch.await();

    <span class="hljs-comment">// Verify that there were no thread-safety violations.</span>
    Assert.assertEquals(<span class="hljs-number">0</span>, anomalies.get());
}
</code></pre>
<p>The above code is quite straightforward. The test runs two threads and involves a queue, as well as a number of helper synchronization primitives. Once the threads are done, it checks the anomalies counter to verify that there were no thread-safety violations. As the test name suggests, it's a hammer style test, i.e. it aims to "bash" the queue from multiple threads until it breaks (or not). Such tests sometimes called stress tests for concurrent code.</p>
<p>Now, let's see what our threads actually do. We start with the producer thread:</p>
<pre><code class="lang-java"><span class="hljs-keyword">private</span> <span class="hljs-keyword">static</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">ProducerThread</span> <span class="hljs-keyword">extends</span> <span class="hljs-title">Thread</span> </span>{

    <span class="hljs-comment">// Boring stuff such as fields and constructor goes here...</span>

    <span class="hljs-meta">@Override</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">run</span><span class="hljs-params">()</span> </span>{
        <span class="hljs-keyword">try</span> {
            <span class="hljs-comment">// Await for the consumer thread, so we start simultaneously.</span>
            barrier.await();
            <span class="hljs-comment">// Start publishing incrementing numbers to the queue.</span>
            <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; iterations; i++) {
                <span class="hljs-keyword">while</span> (!queue.offer(i)) {
                    <span class="hljs-comment">// Yes, we want to busy spin.</span>
                }
            }
        } <span class="hljs-keyword">catch</span> (Exception e) {
            <span class="hljs-comment">// Any exception we get when producing is an anomaly.</span>
            e.printStackTrace();
            anomalies.incrementAndGet();
        } <span class="hljs-keyword">finally</span> {
            <span class="hljs-comment">// Notify the main thread that we're done.</span>
            latch.countDown();
        }
    }
}
</code></pre>
<p>Producer's code is simple and illustrative. Notice that we're publishing incrementing numbers to the queue. As we're going to see in the consumer's code, that's to be able to verify our main invariant - the FIFO property. One more important thing here is that we don't have any kind of back-off calls in the <code>while</code> loop. Instead, we prefer to busy spin. That's because calls like <code>Thread#sleep()</code> or <code>LockSupport.parkNanos()</code> or anything similar involve synchronization that might fix your otherwise broken code. Also, if you need to emulate some local work as the back-off or on successful operation, prefer using <code>Blackhole#consumeCpu()</code> from JMH or similar methods of your choice. Finally, due to the same consideration, it's definitely a bad idea to call <code>System.out.println()</code> or log anything is the main loop.</p>
<p>The consumer thread's code is also pretty simple:</p>
<pre><code class="lang-java"><span class="hljs-keyword">private</span> <span class="hljs-keyword">static</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">ConsumerThread</span> <span class="hljs-keyword">extends</span> <span class="hljs-title">Thread</span> </span>{

    <span class="hljs-comment">// Boring stuff such as fields and constructor goes here...</span>

    <span class="hljs-meta">@Override</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">run</span><span class="hljs-params">()</span> </span>{
        <span class="hljs-keyword">try</span> {
            barrier.await();
            <span class="hljs-comment">// Consume all items from the queue.</span>
            <span class="hljs-keyword">int</span> prev = -<span class="hljs-number">1</span>;
            <span class="hljs-keyword">while</span> (prev != iterations - <span class="hljs-number">1</span>) {
                Integer element = queue.poll();
                <span class="hljs-keyword">if</span> (element == <span class="hljs-keyword">null</span>) {
                    <span class="hljs-comment">// Again, we busy spin.</span>
                    <span class="hljs-keyword">continue</span>;
                }
                <span class="hljs-comment">// Check that we received the incremented number.</span>
                <span class="hljs-keyword">if</span> (element != prev + <span class="hljs-number">1</span>) {
                    anomalies.incrementAndGet();
                }
                prev = element;
            }
            <span class="hljs-comment">// We expect the queue to be empty now.</span>
            <span class="hljs-keyword">if</span> (queue.poll() != <span class="hljs-keyword">null</span>) {
                anomalies.incrementAndGet();
            }
        } <span class="hljs-keyword">catch</span> (Exception e) {
            e.printStackTrace();
            anomalies.incrementAndGet();
        } <span class="hljs-keyword">finally</span> {
            latch.countDown();
        }
    }
}
</code></pre>
<p>The above code completes the picture: our test aims to verify the FIFO property of the queue and nothing more than that.</p>
<p>Variations of the concurrent tests are important. If we would be testing a MPMC queue, it would be a good idea to have multiple tests with different number of producer and consumer threads: single producer - single consumer, single producer - multiple consumers, multiple producers - single consumer, multiple producers - multiple consumers. If we have some kind of local work emulation in the tests, it would be nice to test it with different CPU time too. Same applies to data structure capacity and any other things that may affect the flow of your code. Thread-safety violations are a question of unlucky (or lucky, if you want to find bugs) ordering and visibility, so the more variations of the scenario you run, the higher chances to find a violation.</p>
<p>The complete test source code may be found <a target="_blank" href="https://github.com/puzpuzpuz/java-concurrency-samples/blob/c9ac37591deef8b16308ee45bc8d14675c7d66d3/src/test/java/io/puzpuzpuz/queue/SpscBoundedQueueTest.java">here</a>. If you're proficient in Golang and fancy to see a more complex application of the above principles, see tests of <a target="_blank" href="https://github.com/puzpuzpuz/xsync">xsync</a> library. The library consists of a number of concurrent data structures that are certainly more complex than our SPSC queue.</p>
<h3 id="heading-how-to-run-the-tests">How to run the tests?</h3>
<p>Before we wrap up, let's discuss a few tips to squeeze everything from your multi-threaded tests.</p>
<p>First of all, it is a good habit to run the newly written concurrent tests on your dev machine for a few minutes. This might show failures early, without involving many CI runs.</p>
<p>Next, if your code runs on different CPU architectures, make sure to run tests on those. For instance, ARM CPUs have a weaker <a target="_blank" href="https://research.swtch.com/hwmm">hardware memory model</a> when compared with x86 ones.</p>
<p>Some language ecosystems have race detector tools, like <a target="_blank" href="https://clang.llvm.org/docs/ThreadSanitizer.html">ThreadSanitizer</a> or Golang's <a target="_blank" href="https://go.dev/doc/articles/race_detector">Data Race Detector</a>. If applicable, make sure to configure your CI to run the tests with enabled race detector. It's also worth mentioning <a target="_blank" href="https://github.com/openjdk/jcstress">jcstress</a> and <a target="_blank" href="https://github.com/Kotlin/kotlinx-lincheck">Lincheck</a> frameworks available in JVM ecosystem. Unlike the aforementioned race detectors, these frameworks require writing dedicated tests, so, in case of concurrent data structure testing, they can be seen as an alternative to the hand-written tests we're discussing today.</p>
<p>Finally, if some of your concurrent tests appear to be flaky, i.e. infrequently fail due to an unknown reason, that may be an indication of an actual bug. Make sure to do your best to reproduce the failure, analyze it and fix the cause.</p>
<h3 id="heading-lets-recap">Let's recap?</h3>
<p>Writing thread-safe concurrent code is hard. Writing sufficient tests for such code may be even harder. Here is the summary of what we discussed today:</p>
<ul>
<li>Always try to cover as much as possible with single-threaded tests.</li>
<li>Write your concurrent tests to stress your code and verify a set of invariants.</li>
<li>Avoid calls that involve additional synchronization in the main loops of your tests.</li>
<li>Variations of the concurrent tests are important.</li>
<li>Test on various CPU architectures (wink-wink ARM).</li>
<li>If applicable, configure your CI server to run the concurrent tests with a race detector.</li>
<li>Flaky tests are your friends. Always do your best to reproduce and analyze them.</li>
</ul>
<p>I hope you've learned something new today. Good luck with your concurrent tests and see you next time.</p>
]]></content:encoded></item><item><title><![CDATA[Fast and Simple SPSC Queue]]></title><description><![CDATA[Single producer single consumer (SPSC) queues form the simplest type of concurrent queues. We have a single thread producing the items, as well as a single thread consuming them concurrently - what can be simpler than that? Nevertheless, such queues ...]]></description><link>https://puzpuzpuz.dev/fast-and-simple-spsc-queue</link><guid isPermaLink="true">https://puzpuzpuz.dev/fast-and-simple-spsc-queue</guid><category><![CDATA[Java]]></category><category><![CDATA[concurrency]]></category><category><![CDATA[algorithms]]></category><dc:creator><![CDATA[Andrei Pechkurov]]></dc:creator><pubDate>Sat, 01 Oct 2022 18:44:22 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/unsplash/CrHG_ZYn1Dw/upload/v1664649661591/Sotx-ezX6.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Single producer single consumer (SPSC) queues form the simplest type of concurrent queues. We have a single thread producing the items, as well as a single thread consuming them concurrently - what can be simpler than that? Nevertheless, such queues may be met in complex software projects, such as Linux kernel. Use cases include sending network packets between NICs and OS drivers and receiving I/O completion events in <a target="_blank" href="https://kernel.dk/io_uring.pdf">io_uring</a>, the newest asynchronous I/O API available in Linux. An SPSC queue may be unbounded meaning that the total number of items that can be pushed into the queue is unlimited or bounded which in practice means that it's built on top of a <a target="_blank" href="https://en.wikipedia.org/wiki/Circular_buffer">ring buffer</a>. Today, we're discussing a bounded SPSC queue implemented in Java. The beauty of this data structure is its simplicity combined with a good level of performance on modern hardware.</p>
<p>We start with the skeleton of the data structure, i.e. its fields and interface:</p>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">SpscBoundedQueue</span>&lt;<span class="hljs-title">E</span>&gt; </span>{

    <span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> Object[] data;
    <span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> PaddedAtomicInteger producerIdx = <span class="hljs-keyword">new</span> PaddedAtomicInteger();
    <span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> PaddedAtomicInteger producerCachedIdx = <span class="hljs-keyword">new</span> PaddedAtomicInteger();
    <span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> PaddedAtomicInteger consumerIdx = <span class="hljs-keyword">new</span> PaddedAtomicInteger();
    <span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> PaddedAtomicInteger consumerCachedIdx = <span class="hljs-keyword">new</span> PaddedAtomicInteger();

    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-title">SpscBoundedQueue</span><span class="hljs-params">(<span class="hljs-keyword">int</span> size)</span> </span>{
        <span class="hljs-keyword">this</span>.data = <span class="hljs-keyword">new</span> Object[size + <span class="hljs-number">1</span>];
    }

    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">boolean</span> <span class="hljs-title">offer</span><span class="hljs-params">(E e)</span> </span>{
        <span class="hljs-comment">// The code will follow...</span>
    }

    <span class="hljs-function"><span class="hljs-keyword">public</span> E <span class="hljs-title">poll</span><span class="hljs-params">()</span> </span>{
        <span class="hljs-comment">// The code will follow...</span>
    }

    <span class="hljs-keyword">static</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">PaddedAtomicInteger</span> <span class="hljs-keyword">extends</span> <span class="hljs-title">AtomicInteger</span> </span>{
        <span class="hljs-meta">@SuppressWarnings("unused")</span>
        <span class="hljs-keyword">private</span> <span class="hljs-keyword">int</span> i1, i2, i3, i4, i5, i6, i7, i8,
                i9, i10, i11, i12, i13, i14, i15;
    }
}
</code></pre>
<p>Here, we have an array of queue elements plus a number of index fields where consumer and producer each get a pair of <code>PaddedAtomicInteger</code>. The <code>PaddedAtomicInteger</code> is basically the standard <code>j.u.c.a.AtomicInteger</code> class with some padding added to prevent <a target="_blank" href="https://en.wikipedia.org/wiki/False_sharing">false sharing</a>. Alternatively, we could keep the memory layout flat with all indexes declared as primitive fields right in the <code>SpscBoundedQueue</code> class, but this would make the code much less readable.</p>
<p>You may also notice that only <code>offer()</code> and <code>poll()</code> methods are implemented. Again, that's to keep the code compact and readable. Adding other useful methods, like the batch flavor ones, is simple enough and left as an exercise for curious readers.</p>
<p>The array of queue items is used as a ring buffer of arbitrary size, i.e. there is no power of two restriction for the size like in some ring buffer implementations. The <code>producerIdx</code> and <code>consumerIdx</code> fields are used to synchronize producer's and consumer's accesses to the array. Both producer and consumer check each other's index to understand if they can insert or read the next item and, if the check succeeds, perform the action and update their own index. Two other fields are used to cache the index seen during the latest check. We'll discuss why such caching improves the end performance in a moment.</p>
<p>Let's see how it all works for the producer:</p>
<pre><code class="lang-java"><span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">boolean</span> <span class="hljs-title">offer</span><span class="hljs-params">(E e)</span> </span>{
    <span class="hljs-comment">// Read producer's own index.</span>
    <span class="hljs-keyword">final</span> <span class="hljs-keyword">int</span> idx = producerIdx.getOpaque();
    <span class="hljs-keyword">int</span> nextIdx = idx + <span class="hljs-number">1</span>;
    <span class="hljs-keyword">if</span> (nextIdx == data.length) {
        nextIdx = <span class="hljs-number">0</span>;
    }
    <span class="hljs-comment">// Read the last seen consumer's index.</span>
    <span class="hljs-keyword">int</span> cachedIdx = consumerCachedIdx.getPlain();
    <span class="hljs-keyword">if</span> (nextIdx == cachedIdx) {
        <span class="hljs-comment">// If we have reached the known index, we need to read the current value.</span>
        cachedIdx = consumerIdx.getAcquire();
        <span class="hljs-comment">// Make sure to update the cached value.</span>
        consumerCachedIdx.setPlain(cachedIdx);
        <span class="hljs-keyword">if</span> (nextIdx == cachedIdx) {
            <span class="hljs-comment">// The queue is full.</span>
            <span class="hljs-keyword">return</span> <span class="hljs-keyword">false</span>;
        }
    }
    <span class="hljs-comment">// There is an empty slot, so we can insert the item.</span>
    data[idx] = e;
    <span class="hljs-comment">// Make sure to update our own index.</span>
    <span class="hljs-comment">// We use release semantics while the consumer has an acquire edge.</span>
    producerIdx.setRelease(nextIdx);
    <span class="hljs-keyword">return</span> <span class="hljs-keyword">true</span>;
}
</code></pre>
<p>The above code uses <a target="_blank" href="https://puzpuzpuz.dev/using-acquirerelease-semantics-in-java-atomics-for-fun-and-profit">acquire/release semantics</a> to keep the emitted instructions as lightweight as possible from the memory barriers perspective. Other than that, the code does pretty much as what we discussed before.</p>
<p>As it was already mentioned, the manipulations with the <code>consumerCachedIdx</code> field are important for the end performance. All reads and writes on this field are thread-local, i.e. only the producer thread accesses this field, so we don't need to use costly atomic operations. This reduces cache coherency traffic dramatically and lets the CPU core on its own non-shared data in those cases when there multiple empty slots are available in the queue.</p>
<p>Consumer's part of the picture may be seen in the full source code available <a target="_blank" href="https://github.com/puzpuzpuz/java-concurrency-samples/blob/6eb2c14e5cc7476a268606c94abb722c2e6f1e81/src/main/java/io/puzpuzpuz/queue/SpscBoundedQueue.java">here</a>.</p>
<p>Finally, we're going compare our queue with the good old <code>j.u.c.ArrayBlockingQueue</code> and a couple of SPSC queue implementations from <a target="_blank" href="https://github.com/JCTools/JCTools">JCTools</a> library. If you're not familiar with JCTools and never used it, I advice you to put it on your radar.</p>
<p>The benchmark we'll be running is available <a target="_blank" href="https://github.com/puzpuzpuz/java-concurrency-samples/blob/6eb2c14e5cc7476a268606c94abb722c2e6f1e81/src/test/java/io/puzpuzpuz/queue/SpscQueueBenchmark.java">here</a>. When run, it starts a couple of threads to play a ping-pong game. Each operation, a.k.a. a ping-pong round, assumes sending/receiving a single item over the SPSC queue combined with a bit of work done for each successful attempt.</p>
<p>Here is a reduced JMH benchmark output on my laptop running Ubuntu 20.04 and OpenJDK 17.0.4 64-bit:</p>
<pre><code>Benchmark                                            (type)   Mode  Cnt          Score          <span class="hljs-built_in">Error</span>  Units
SpscQueueBenchmark.group                         SPSC_QUEUE  thrpt    <span class="hljs-number">3</span>  <span class="hljs-number">107503612.612</span> ± <span class="hljs-number">16230253.288</span>  ops/s
SpscQueueBenchmark.group               ARRAY_BLOCKING_QUEUE  thrpt    <span class="hljs-number">3</span>    <span class="hljs-number">7158948.722</span> ±  <span class="hljs-number">8635350.468</span>  ops/s
SpscQueueBenchmark.group                      JCTOOLS_QUEUE  thrpt    <span class="hljs-number">3</span>  <span class="hljs-number">120533694.168</span> ±  <span class="hljs-number">4686758.722</span>  ops/s
SpscQueueBenchmark.group               JCTOOLS_ATOMIC_QUEUE  thrpt    <span class="hljs-number">3</span>  <span class="hljs-number">101704017.278</span> ± <span class="hljs-number">18252611.281</span>  ops/s
</code></pre><p>As expected, JCTools' queues and our own one are significantly faster than the <code>ArrayBlockingQueue</code> queue. Also, surprisingly, our SPSC queue keeps on par with the JCTools' queues which is not something I was expecting, to be honest. Does it mean that you should go for an in-house implementation instead of JCTools? Not really. If you can afford yourself 3rd-party dependencies, go for JCTools. JCTools' data structures are certainly more efficient, as well as much better tested and benchmarked than our toy queue. So, you'd have to spend quite some time reaching the same level of stability for a DIY queue.</p>
<p>Needless to say that this algorithm is not something new. You may see it in this great <a target="_blank" href="https://rigtorp.se/ringbuffer/">blog post</a> by Erik Rigtorp, as well as recognize it in the <code>SPSequence</code> and <code>SCSequence</code> classes in QuestDB's <a target="_blank" href="https://github.com/questdb/questdb">source code</a>. Yet, I hope that this data structure would be a nice addition to your engineering toolkit. See you next time.</p>
]]></content:encoded></item><item><title><![CDATA[Using Acquire/Release Semantics in Java Atomics for Fun and Profit]]></title><description><![CDATA[In case you've missed it, recent JDK versions include new memory semantics for atomic operations available in VarHandle and Atomic* classes. These new semantics are equivalent to C/C++'s std::memory_order. The only confusing naming convention is that...]]></description><link>https://puzpuzpuz.dev/using-acquirerelease-semantics-in-java-atomics-for-fun-and-profit</link><guid isPermaLink="true">https://puzpuzpuz.dev/using-acquirerelease-semantics-in-java-atomics-for-fun-and-profit</guid><category><![CDATA[Java]]></category><category><![CDATA[concurrency]]></category><dc:creator><![CDATA[Andrei Pechkurov]]></dc:creator><pubDate>Fri, 21 Jan 2022 15:52:26 GMT</pubDate><content:encoded><![CDATA[<p>In case you've missed it, recent JDK versions include new memory semantics for atomic operations available in <a target="_blank" href="https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/invoke/VarHandle.html"><code>VarHandle</code></a> and <code>Atomic*</code> classes. These new semantics are equivalent to C/C++'s <a target="_blank" href="https://en.cppreference.com/w/cpp/atomic/memory_order"><code>std::memory_order</code></a>. The only confusing naming convention is that <code>*Opaque</code> methods in Java map to the <code>memory_order_relaxed</code> memory order in C/C++. Other than that, the idea is the same - these semantics allow developers to use a weaker memory model than the default sequential consistency model, i.e. full memory barrier which is used in the old atomic methods. This can potentially improve the performance at the cost of more complex and, thus, less maintainable code.</p>
<p>Anyhow, we're not going to go through the basics of memory semantics. If you're not familiar with them, I'd recommend watching <a target="_blank" href="https://youtu.be/ZQFzMfHIxng">this talk</a> by Fedor Pikus where he does a great job at explaining C++'s <code>std::atomic</code>. As usual, today we'll be doing a weird and questionable experiment. We'll use acquire/release semantics to build a lossy (a.k.a. not-so-atomic) counter on top of <code>j.u.c.a.AtomicLong</code>.</p>
<p>Imagine that you need a rough order of magnitude counter in your application. Say, you want to measure the total number of operations performed mostly on a single thread, and in the case of concurrent execution, you're fine with losing some of the concurrent updates as long as the counter is incremented by at least one of the threads. Apart from the good old <code>AtomicLong#addAndGet()</code> method which would keep the counter truly atomic at the cost of performance penalty under contention, there are some other well-known ways to achieve what we want here. To name a few, one way is the <code>j.u.c.a.LongAdder</code> class which implements a sharded atomic counter. Its downsides are the higher read cost and the memory footprint. Another approach might be to accumulate the number of operations in a thread-local counter and periodically flush them via the <code>AtomicLong#addAndGet()</code> call. That's a certainly viable way to build an eventually consistent atomic counter, but today we'll consider a simpler approach that comes at the cost of concurrent increments loss.</p>
<p>You could say that the above example sounds artificial and you would be not far away from being absolutely correct. Nevertheless, the use case is good enough for today's experiment.</p>
<p>So, if we use acquire/release operations to build a lossy counter, we should get something like the following:</p>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">LossyCounter</span> <span class="hljs-keyword">extends</span> <span class="hljs-title">AtomicLong</span> </span>{

    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">long</span> <span class="hljs-title">addAndGetLossy</span><span class="hljs-params">(<span class="hljs-keyword">long</span> delta)</span> </span>{
        <span class="hljs-keyword">long</span> value = getAcquire();
        <span class="hljs-keyword">long</span> newValue = value + delta;
        setRelease(newValue);
        <span class="hljs-keyword">return</span> newValue;
    }
}
</code></pre>
<p>We're going to benchmark this counter with other approaches, including atomic increments. While this wouldn't be an apple-to-apple comparison in terms of the counter operation guarantees, our goal is to get some understanding of the performance implications for different semantics and types of atomic operations.</p>
<p>The JMH benchmark we're going to use may be found <a target="_blank" href="https://github.com/puzpuzpuz/java-concurrency-samples/blob/f5aaf0898408927918a16b649ddc8df54879957e/src/test/java/io/puzpuzpuz/atomic/LossyCounterBenchmark.java">here</a>. Our test stand is a laptop with i7-1185G7 x86-64 CPU with 4/8 cores running Ubuntu 20.04 and OpenJDK 17.0.1.</p>
<p>Let's first run the benchmark on a single thread:</p>
<pre><code><span class="hljs-attribute">Benchmark</span>                                      Mode  Cnt   Score   Error  Units
<span class="hljs-attribute">LossyCounterBenchmark</span>.testAtomicCas            avgt   <span class="hljs-number">10</span>  <span class="hljs-number">11</span>.<span class="hljs-number">740</span> ± <span class="hljs-number">0</span>.<span class="hljs-number">040</span>  ns/op
<span class="hljs-attribute">LossyCounterBenchmark</span>.testAtomicIncrement      avgt   <span class="hljs-number">10</span>   <span class="hljs-number">6</span>.<span class="hljs-number">509</span> ± <span class="hljs-number">0</span>.<span class="hljs-number">027</span>  ns/op
<span class="hljs-attribute">LossyCounterBenchmark</span>.testBaseline             avgt   <span class="hljs-number">10</span>   <span class="hljs-number">3</span>.<span class="hljs-number">454</span> ± <span class="hljs-number">0</span>.<span class="hljs-number">004</span>  ns/op
<span class="hljs-attribute">LossyCounterBenchmark</span>.testLossyAcquireRelease  avgt   <span class="hljs-number">10</span>   <span class="hljs-number">3</span>.<span class="hljs-number">777</span> ± <span class="hljs-number">0</span>.<span class="hljs-number">158</span>  ns/op
<span class="hljs-attribute">LossyCounterBenchmark</span>.testLossyDefault         avgt   <span class="hljs-number">10</span>   <span class="hljs-number">8</span>.<span class="hljs-number">788</span> ± <span class="hljs-number">0</span>.<span class="hljs-number">016</span>  ns/op
</code></pre><p>The testAtomicCas result here stands for a <code>compareAndSet</code> loop which is an awful way to do atomic increments on a counter. Not a big surprise that it showed the worst result. Then, testAtomicIncrement stands for the <code>addAndGet</code> operation, the default way to build an atomic counter. The baseline is nothing more than random number generation which is done as a part of all other benchmarks. Finally, testLossyAcquireRelease is our lossy counter while testLossyDefault stands for the same counter, but with the default operation semantics.</p>
<p>You may notice that our lossy counter adds almost nothing on top of the baseline and that's expected. The thing is that acquire/release semantics are no-op on x86 when it comes to ordinary loads (<code>get</code>) and stores (<code>set</code>). Read this <a target="_blank" href="https://research.swtch.com/hwmm">blog post</a> from Russ Cox if you want to learn more about HW memory models.</p>
<p>Let's run the benchmark on 8 threads now:</p>
<pre><code><span class="hljs-attribute">Benchmark</span>                                      Mode  Cnt     Score    Error  Units
<span class="hljs-attribute">LossyCounterBenchmark</span>.testAtomicCas            avgt   <span class="hljs-number">10</span>  <span class="hljs-number">1163</span>.<span class="hljs-number">104</span> ± <span class="hljs-number">62</span>.<span class="hljs-number">158</span>  ns/op
<span class="hljs-attribute">LossyCounterBenchmark</span>.testAtomicIncrement      avgt   <span class="hljs-number">10</span>   <span class="hljs-number">139</span>.<span class="hljs-number">210</span> ±  <span class="hljs-number">0</span>.<span class="hljs-number">571</span>  ns/op
<span class="hljs-attribute">LossyCounterBenchmark</span>.testBaseline             avgt   <span class="hljs-number">10</span>     <span class="hljs-number">6</span>.<span class="hljs-number">549</span> ±  <span class="hljs-number">0</span>.<span class="hljs-number">029</span>  ns/op
<span class="hljs-attribute">LossyCounterBenchmark</span>.testLossyAcquireRelease  avgt   <span class="hljs-number">10</span>    <span class="hljs-number">20</span>.<span class="hljs-number">058</span> ±  <span class="hljs-number">0</span>.<span class="hljs-number">186</span>  ns/op
<span class="hljs-attribute">LossyCounterBenchmark</span>.testLossyDefault         avgt   <span class="hljs-number">10</span>   <span class="hljs-number">253</span>.<span class="hljs-number">887</span> ±  <span class="hljs-number">3</span>.<span class="hljs-number">668</span>  ns/op
</code></pre><p>As expected, the CAS-based counter is a terrible idea. The <code>addAndGet</code> (<code>LOCK XADD</code> on x86) atomic counter does a much better job. Of course, a <code>LongAdder</code>, being used to build an atomic counter, would do even better under contention, but we're not interested in atomic counters now.</p>
<p>Interestingly, the testLossyDefault counter is almost 2x slower than the atomic one. That should be explained by the price of two full memory barriers executed on each increment operation in that lossy counter. Finally, the acquire/release lossy counter is the doubtless winner of our unfair competition.</p>
<p>The above benchmark and the lossy counter approach should be taken with a grain of salt. My only intention was to demonstrate that weaker memory semantics may yield better performance of your code, at least in a niche use case. However, the performance advantage may be insignificant in your concrete application, yet it will certainly come at the cost of more complex and, thus, less maintainable code. So, be mindful when using the new memory semantics.</p>
<p>Next time we're going to build atomic memory snapshots based on a seqlock and discuss whether it's a good idea to do so. See you!</p>
]]></content:encoded></item><item><title><![CDATA[Benchmarking Non-shared Locks in Java]]></title><description><![CDATA[Last time we discussed scalability of j.u.c.l.ReentrantReadWriteLock and some alternatives. Some of the alternatives used a simple CAS (compare-and-swap) based spinlock as the internal writer lock. So, I was curious whether such custom spinlock makes...]]></description><link>https://puzpuzpuz.dev/benchmarking-non-shared-locks-in-java</link><guid isPermaLink="true">https://puzpuzpuz.dev/benchmarking-non-shared-locks-in-java</guid><category><![CDATA[Java]]></category><category><![CDATA[concurrency]]></category><dc:creator><![CDATA[Andrei Pechkurov]]></dc:creator><pubDate>Sun, 09 Jan 2022 13:36:19 GMT</pubDate><content:encoded><![CDATA[<p><a target="_blank" href="https://puzpuzpuz.io/scalable-readers-writer-lock">Last time</a> we discussed scalability of <code>j.u.c.l.ReentrantReadWriteLock</code> and some alternatives. Some of the alternatives used a simple CAS (compare-and-swap) based spinlock as the internal writer lock. So, I was curious whether such custom spinlock makes sense against what we have in the standard library. This brief post is dedicated to benchmarking the <code>ReentrantLock</code> class against a number of other non-shared (exclusive) locks.</p>
<p>Before we go any further, I have to warn readers that the considered alternative lock implementations are not production-ready in any sense, so use them at your own risk. The below results were obtained on concrete HW and SW and may change a lot in a different scenario and not only. Needless to say that using a single lock on the hot path is usually a bad idea. I advise going with the standard library as the default choice. So, consider this post to be an unfair, non-scientific experiment done out of curiosity.</p>
<p>The full code of the benchmark and the custom locks is available in this <a target="_blank" href="https://github.com/puzpuzpuz/java-concurrency-samples">repo</a>.</p>
<h2 id="heading-competitors">Competitors</h2>
<p>Our first competitor is the <code>j.u.c.l.ReentrantLock</code> class, in its both unfair and fair flavors. The common wisdom claims that fair mode comes at a high cost. But what does it mean in practice? In theory, fair locks prevent thread starvation, so if the cost is reasonable, say, 2-3x, it might be a good idea to use it in certain use cases. That's why we're considering fair <code>ReentrantLock</code>.</p>
<p>The first custom lock implementation we're going to use is a primitive CAS-based spinlock:</p>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">CasSpinLock</span> <span class="hljs-keyword">implements</span> <span class="hljs-title">Lock</span> </span>{

    <span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> AtomicBoolean lock = <span class="hljs-keyword">new</span> AtomicBoolean();

    <span class="hljs-meta">@Override</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">lock</span><span class="hljs-params">()</span> </span>{
        <span class="hljs-keyword">while</span> (!lock.compareAndSet(<span class="hljs-keyword">false</span>, <span class="hljs-keyword">true</span>)) {}
    }

    <span class="hljs-meta">@Override</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">unlock</span><span class="hljs-params">()</span> </span>{
        lock.set(<span class="hljs-keyword">false</span>);
    }

    <span class="hljs-comment">// ...</span>
}
</code></pre>
<p>This spinlock is as primitive as it could be. Apart from this basic version, we're also including a simple backoff version of it. To lower the contention for the boolean flag, the backoff version does <code>LockSupport.parkNanos(10)</code> in the loop body. This flavor of the CAS lock is exactly what we were using in the previous blog post.</p>
<p>The next spinlock is a test and test-and-set one. The main difference is in the <code>lock()</code> method:</p>
<pre><code class="lang-java"><span class="hljs-meta">@Override</span>
<span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">lock</span><span class="hljs-params">()</span> </span>{
    <span class="hljs-keyword">long</span> delay = MIN_DELAY;
    <span class="hljs-keyword">for</span> (;;) {
        <span class="hljs-comment">// busy spin until the lock is available</span>
        <span class="hljs-keyword">while</span> (lock.get()) {}
        <span class="hljs-comment">// try to acquire the lock</span>
        <span class="hljs-keyword">if</span> (!lock.getAndSet(<span class="hljs-keyword">true</span>)) {
            <span class="hljs-keyword">return</span>;
        }
        <span class="hljs-comment">// back off</span>
        LockSupport.parkNanos(delay);
        <span class="hljs-keyword">if</span> (delay &lt; MAX_DELAY) {
            delay *= <span class="hljs-number">2</span>;
        }
    }
}
</code></pre>
<p>This spinlock should do a better job than the CAS one in terms of cache coherence traffic - most of the time it reads the lock flag from the local core's cache line and only does an atomic test-and-set operation when it saw that the lock was just released. The lock also uses an exponential backoff technique to reduce the contention. The exponential backoff should do a better job than the constant backoff used in the previous lock. So, in theory, this lock has the chance to demonstrate better throughput than the CAS spinlock.</p>
<p>The next spinlock is the well-known <a target="_blank" href="https://en.wikipedia.org/wiki/Ticket_lock">ticket lock</a>. Again, in the theory, it has a number of advantages over previously listed spinlocks. The ticket lock is built on top of two atomic counters. The first one stands for ticket numbers: any thread attempting to acquire the lock increments the counter to get its ticket number. The second counter means currently served ticket: the thread with this value is considered to be the lock owner. When the lock owner releases the lock, it increments the currently served counter. Due to this design, the ticket lock it provides fairness guarantees, i.e. it guarantees FIFO ordering of lock acquisition. Another advantage is that there is only a single counter increment and a busy-wait read on the second counter on the hot path. We're not going to focus on the source code, but you may find it <a target="_blank" href="https://github.com/puzpuzpuz/java-concurrency-samples/blob/6f6ff4311e3fb17fd7a8037f080dd351db9befc7/src/main/java/io/puzpuzpuz/lock/TicketSpinLock.java">here</a>. Worth mentioning that ticket locks may be found in <a target="_blank" href="https://lwn.net/Articles/267968/">Linux kernel</a>.</p>
<p>Last, but not least on our list is the <a target="_blank" href="http://web.mit.edu/6.173/www/currentsemester/readings/R06-scalable-synchronization-1991.pdf">MCS spinlock</a>. The algorithm is named after the authors. Just like the ticket spinlock, the MCS lock is a fair lock. The main idea behind it is to organize the waiter queue in a singly linked list where each waiting thread spins on its own node. When the current owner releases the lock, it updates a flag on the node belonging to the waiter in the head of the queue. Hence, MCS spinlock is very efficient in terms of cache coherence traffic: the number of messages exchanged by the cores on each lock acquisition is O(1), unlike O(N) in the ticket lock. We're going to test two flavors of MCS lock. The <a target="_blank" href="https://github.com/puzpuzpuz/java-concurrency-samples/blob/6f6ff4311e3fb17fd7a8037f080dd351db9befc7/src/main/java/io/puzpuzpuz/lock/McsSpinLock.java">first one</a> is not a spinlock since it uses <code>LockSupport.park()</code>/<code>unpark()</code> facility to suspend and resume threads in the queue. This flavor should perform close to the fair <code>ReentrantLock</code>. That's because the standard class <a target="_blank" href="https://github.com/openjdk/jdk/blob/b3dbfc645283cb315016ec531ec41570ab3f75f1/src/java.base/share/classes/java/util/concurrent/locks/AbstractQueuedSynchronizer.java#L319">uses</a> CLH algorithm to implement fair lock mode. The CLH lock uses an implicit linked list, but otherwise, it's close enough to the MCS lock. The second MCS lock flavor is a proper <a target="_blank" href="https://github.com/puzpuzpuz/java-concurrency-samples/blob/6f6ff4311e3fb17fd7a8037f080dd351db9befc7/src/main/java/io/puzpuzpuz/lock/McsSpinLock.java">spinlock</a>. Again, you may find a variation of the MCS spinlock in <a target="_blank" href="https://lwn.net/Articles/590243/">Linux kernel</a>.</p>
<p>Since our test stand is not a NUMA machine, we're not considering NUMA-aware locks, such as hierarchical locks or lock cohorting.</p>
<h2 id="heading-benchmark">Benchmark</h2>
<p>The benchmark we're going to use focuses on the average execution time per lock -&gt; some work in the critical section -&gt; unlock chain of calls. Thus, we're interested in the throughput rather than latency distribution or power efficiency of the locks under test. Since we want to understand lock scalability properties, the work done in the critical section is kept short enough.</p>
<p>Here is the benchmark itself:</p>
<pre><code class="lang-java"><span class="hljs-meta">@Benchmark</span>
<span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">testLock</span><span class="hljs-params">(BenchmarkState state, Blackhole bh)</span> </span>{
    <span class="hljs-keyword">final</span> ThreadLocalRandom rnd = ThreadLocalRandom.current();
    state.lock.lock();
    <span class="hljs-comment">// emulate some work</span>
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; NUM_WORK_SPINS; i++) {
        <span class="hljs-comment">// access a counter (shared memory)</span>
        state.sum += rnd.nextInt();
    }
    bh.consume(state.sum);
    state.lock.unlock();
}
</code></pre>
<p>We're going to run this benchmark varying the number of threads from a single thread (no contention) to the number of available CPU cores (highest contention).</p>
<h2 id="heading-results">Results</h2>
<p>The below results were obtained on a GCP's e2-highcpu-32 VM with 32 vCPUs (Intel Haswell), 32 GB memory running Ubuntu 20.04, and OpenJDK 17.0.1. The following chart represents all results. A text version of the results is also available <a target="_blank" href="https://gist.github.com/puzpuzpuz/5d47c42ec6f4bcbcf2372941baf0b37a">here</a>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1641735307727/Tx0X0PDHg.png" alt="Lock benchmark results" /></p>
<p>Notice that the vertical axis has a log scale and stands for the average operation latency in nanoseconds.</p>
<p>The first thing to notice is that the baseline (think, work done in the critical section) result at the very bottom of the char is constant for any number of threads except for 32 threads where it gets 2x slower. Most likely that's because of the hyper-threading cores available on the VM. Hyper-threading sibling cores share arithmetic logic units (ALU) and since we're doing some number crunching in the critical section, that becomes the bottleneck when all cores are in use.</p>
<p>Next, the fair mode of <code>ReentrantLock</code> comes at a very high cost. In the 32 threads scenario, the difference in latency is 218x, hence two orders of magnitude. Anyone who uses fair <code>ReentrantLock</code> should be aware of potential performance implications. As we expected, the non-spinlock flavor of the MCS lock comes close to the fair <code>ReentrantLock</code>.</p>
<p>There is a number of outsiders among our hand-crafted spinlocks. The first one is the CAS spinlock without a backoff. It heavily suffers from contention over CAS operations over a single atomic flag (think, a cache line). Surprisingly, ticket and MCS spinlocks, which were very promising in theory, follow the basic CAS spinlock closely in terms of the average latency. Although the difference between the CAS spinlock and the MCS spinlock is 3x, the MCS spinlock is still far away from the group of winners.</p>
<p>Let's remove the outsiders from the chart and get rid of the log scale. This should help us when analyzing the winning group.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1641735328891/-JaLIQvM8.png" alt="Results for the winning group only" /></p>
<p>The first thing to note here is that a CAS spinlock with a primitive, constant time backoff implementation performs better than the TTAS spinlock. The latter has a slightly more complex code for the spinlock itself and definitely a more complex exponential backoff mechanism. Surprisingly, the CAS spinlock provides lower average latency on all thread counts, so it makes no sense to deal with the TTAS spinlock, at least in the considered benchmark scenario.</p>
<p>The second observation is that unfair <code>ReentrantLock</code> does a really great job overall. Our custom spinlocks show a better result with 5x lower average latency only when the benchmark is run on 2 threads while the standard lock wins on 8 threads and beyond. In the highest contention scenario, <code>ReentrantLock</code>'s latency is 26% lower than the CAS spinlock's one.</p>
<h2 id="heading-lessons-learned">Lessons learned</h2>
<p>Hopefully, this toy benchmark can serve as another argument to use the unfair <code>ReentrantLock</code> as the default choice in any Java application. The standard lock provides solid performance and can be beaten by a custom lock only in a concrete scenario. On the other hand, the fair <code>ReentrantLock</code> mode has to be used cautiously when you're certain that the fairness guarantee outweighs the performance impact.</p>
<p>As for the custom spinlock classes, it's not a one size fits all story. Depending on the concrete hardware and usage scenario a simpler lock may outperform more complex locks while in theory, it should be the other way around.</p>
]]></content:encoded></item><item><title><![CDATA[Scalable Readers-Writer Lock]]></title><description><![CDATA[Locks, or mutexes (mutual exclusions), are one of the most basic concurrency primitives. It's hard to find a developer who won't be able to explain a mutex, at least on the fundamental level. Yet, mutexes are more than that. They may be:

OS-level (t...]]></description><link>https://puzpuzpuz.dev/scalable-readers-writer-lock</link><guid isPermaLink="true">https://puzpuzpuz.dev/scalable-readers-writer-lock</guid><category><![CDATA[Java]]></category><category><![CDATA[concurrency]]></category><dc:creator><![CDATA[Andrei Pechkurov]]></dc:creator><pubDate>Sun, 02 Jan 2022 11:30:39 GMT</pubDate><content:encoded><![CDATA[<p>Locks, or mutexes (mutual exclusions), are one of the most basic concurrency primitives. It's hard to find a developer who won't be able to explain a mutex, at least on the fundamental level. Yet, mutexes are more than that. They may be:</p>
<ul>
<li>OS-level (think, a pthread mutex) or user-land (think, a spinlock),</li>
<li>expose pessimistic (blocking) or optimistic (non-blocking) locking API,</li>
<li>provide fairness in lock acquisition or keep things unfair,</li>
<li>support reentrant calls, or prefer to be non-reentrant,</li>
<li>have a notion of asymmetry in locking (say, with a shared lock available for readers) or stick to symmetric, exclusive locking,</li>
<li>strictly require unlocking on the same thread (a pthread mutex, once again) or prefer not to bother with unlocker's identity (<code>sync.Mutex</code> in Golang),</li>
<li>support timed-based cancellation for locking attempts or have non-abortable calls only.</li>
</ul>
<p>Today we focus on asymmetric, readers-writer locks which are <a target="_blank" href="https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/ReadWriteLock.html">familiar</a> to most Java developers. Such locks allow concurrent readers to proceed with executing their critical section, while writers are guaranteed to acquire exclusive ownership of the lock. These locks are used in scenarios where the vast majority of calls come from readers and writers acquire the lock rather infrequently.</p>
<p>Our ultimate goal is to come up with a lock implementation that would scale reader operations linearly in terms of the CPU core count and compare the result with alternatives such as  <a target="_blank" href="https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/ReentrantReadWriteLock.html"><code>ReentrantReadWriteLock</code></a> class from the standard library.</p>
<h2 id="heading-prior-art">Prior art</h2>
<p>Readers-writer (R/W) locks are not something new. A wide-spread R/W lock implementation uses an atomic counter for the reader part and looks something like the <a target="_blank" href="https://github.com/puzpuzpuz/questdb/blob/f817ed19b205be383d9556b64c8e1ac96a5f377d/core/src/main/java/io/questdb/std/SimpleReadWriteLock.java">following class</a> from QuestDB code base.</p>
<p>Locking for a reader in this class looks like this:</p>
<pre><code class="lang-java"><span class="hljs-meta">@Override</span>
<span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">lock</span><span class="hljs-params">()</span> </span>{
    <span class="hljs-comment">// start a lock attempt</span>
    <span class="hljs-keyword">while</span> (nReaders.incrementAndGet() &gt;= MAX_READERS) {
        <span class="hljs-comment">// there is a writer owning the lock, so clean up, sleep and go for another spin</span>
        nReaders.decrementAndGet();
        LockSupport.parkNanos(<span class="hljs-number">10</span>);
    }
}
</code></pre>
<p><em>Note.</em> If you run your code on Windows, you may <a target="_blank" href="https://hazelcast.com/blog/locksupport-parknanos-under-the-hood-and-the-curious-case-of-parking-part-ii-windows/">face</a> latency issues with <code>LockSupport.parkNanos()</code>. So, make sure to do some benchmarking before using any of the locks we cover today.</p>
<p>Here <code>nReaders</code> is an <code>AtomicInteger</code> used as a medium between readers and the writer. Each reader increments the counter atomically and checks the result value. If it's smaller than the threshold, a reader lock is acquired successfully. If not, the reader has to busy spin (in fact, it could sleep and wait for a notification from the writer, but that would slightly increase the latency). The writer, on the other hand, does the following to acquire a lock:</p>
<pre><code class="lang-java"><span class="hljs-meta">@Override</span>
<span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">lock</span><span class="hljs-params">()</span> </span>{
    <span class="hljs-comment">// trimmed code that acquires the internal writer lock:</span>
    <span class="hljs-comment">// ...</span>

    <span class="hljs-comment">// increment the readers counter by the threshold</span>
    <span class="hljs-keyword">int</span> n = nReaders.addAndGet(MAX_READERS);
    <span class="hljs-comment">// wait until there are no readers holding a lock</span>
    <span class="hljs-keyword">while</span> (n != MAX_READERS) {
        n = nReaders.get();
    }
}
</code></pre>
<p>It is important to stress that this lock is non-reentrant. As it was previously mentioned, such readers-writer lock design is quite popular and you may find it in, say, Go standard library's <a target="_blank" href="https://github.com/golang/go/blob/b357b05b70d2b8c4988ac2a27f2af176e7a09e1b/src/sync/rwmutex.go#L56-L69">sync.RWMutex</a> struct. The main problem with this approach is that its reader part doesn't scale. This means that if the time spent in the critical section by each reader is rather low, adding more threads (and cores) to the program may not lead to improved performance.</p>
<p>Let's demonstrate this. Our test stand is a laptop with i7-1185G7 CPU with 4/8 cores running Ubuntu 20.04 and OpenJDK 17.0.1. The JMH microbenchmark we're going to use may be found <a target="_blank" href="https://github.com/puzpuzpuz/questdb/blob/f817ed19b205be383d9556b64c8e1ac96a5f377d/benchmarks/src/main/java/org/questdb/ReadWriteLockBenchmark.java">here</a>.</p>
<pre><code><span class="hljs-attribute">Benchmark</span>                            (type)  Mode  Cnt    Score   Error  Units
<span class="hljs-attribute">ReadWriteLockBenchmark</span>.testBaseline     N/A  avgt   <span class="hljs-number">10</span>   <span class="hljs-number">21</span>.<span class="hljs-number">506</span> ± <span class="hljs-number">0</span>.<span class="hljs-number">784</span>  ns/op
<span class="hljs-attribute">ReadWriteLockBenchmark</span>.testLock<span class="hljs-number">2</span>     SIMPLE  avgt   <span class="hljs-number">10</span>  <span class="hljs-number">101</span>.<span class="hljs-number">476</span> ± <span class="hljs-number">2</span>.<span class="hljs-number">403</span>  ns/op
<span class="hljs-attribute">ReadWriteLockBenchmark</span>.testLock<span class="hljs-number">4</span>     SIMPLE  avgt   <span class="hljs-number">10</span>  <span class="hljs-number">205</span>.<span class="hljs-number">656</span> ± <span class="hljs-number">1</span>.<span class="hljs-number">970</span>  ns/op
<span class="hljs-attribute">ReadWriteLockBenchmark</span>.testLock<span class="hljs-number">8</span>     SIMPLE  avgt   <span class="hljs-number">10</span>  <span class="hljs-number">296</span>.<span class="hljs-number">461</span> ± <span class="hljs-number">0</span>.<span class="hljs-number">735</span>  ns/op
</code></pre><p>Here the <code>testBaseline</code> benchmark stands for the baseline, i.e. the work done in the critical section in the main benchmark. As for, <code>testLockN</code> it means reader lock benchmark run on N threads.</p>
<p>You may have already noticed almost linear degradation in the average operation time when we increase the number of threads. That's because the hot path in the microbenchmark boils down to an atomic increment instruction available in modern CPUs, e.g. <code>LOCK XADD</code> on x86. The aforementioned instruction implies exclusive access to the counter (the corresponding cache line, to be more precise) acquired by the CPU core executing the caller thread. Hence, the increment is still restricted with a single core and, to add on top of that, in the face of contention the synchronization cost paid by each core increases significantly. Refer to this comprehensive <a target="_blank" href="https://travisdowns.github.io/blog/2020/07/06/concurrency-costs.html">blog post</a> by Travis Downs to learn more about the cost of concurrency primitives on modern HW. </p>
<p>If we add <code>LinuxPerfProfiler</code> JMH profiler and re-run the benchmark to get <a target="_blank" href="https://man7.org/linux/man-pages/man1/perf-stat.1.html"><code>perf stat</code></a> output for the benchmark run on 2 threads, we'll see the following:</p>
<pre><code><span class="hljs-attr">Perf stats:</span>
<span class="hljs-string">--------------------------------------------------</span>

        <span class="hljs-number">199</span> <span class="hljs-number">010</span><span class="hljs-string">,83</span> <span class="hljs-string">msec</span> <span class="hljs-string">task-clock</span>                <span class="hljs-comment">#    1,521 CPUs utilized          </span>
             <span class="hljs-number">3</span> <span class="hljs-number">887</span>      <span class="hljs-string">context-switches</span>          <span class="hljs-comment">#    0,020 K/sec                  </span>
               <span class="hljs-number">223</span>      <span class="hljs-string">cpu-migrations</span>            <span class="hljs-comment">#    0,001 K/sec                  </span>
               <span class="hljs-number">255</span>      <span class="hljs-string">page-faults</span>               <span class="hljs-comment">#    0,001 K/sec                  </span>
   <span class="hljs-number">614</span> <span class="hljs-number">120</span> <span class="hljs-number">444</span> <span class="hljs-number">917</span>      <span class="hljs-string">cycles</span>                    <span class="hljs-comment">#    3,086 GHz                      (41,68%)</span>
   <span class="hljs-number">418</span> <span class="hljs-number">345</span> <span class="hljs-number">242</span> <span class="hljs-number">445</span>      <span class="hljs-string">instructions</span>              <span class="hljs-comment">#    0,68  insn per cycle           (50,02%)</span>
    <span class="hljs-number">21</span> <span class="hljs-number">283</span> <span class="hljs-number">383</span> <span class="hljs-number">788</span>      <span class="hljs-string">branches</span>                  <span class="hljs-comment">#  106,946 M/sec                    (58,35%)</span>
         <span class="hljs-number">2</span> <span class="hljs-number">027</span> <span class="hljs-number">813</span>      <span class="hljs-string">branch-misses</span>             <span class="hljs-comment">#    0,01% of all branches          (66,69%)</span>
    <span class="hljs-number">57</span> <span class="hljs-number">388</span> <span class="hljs-number">541</span> <span class="hljs-number">249</span>      <span class="hljs-string">L1-dcache-loads</span>           <span class="hljs-comment">#  288,369 M/sec                    (66,69%)</span>
     <span class="hljs-number">2</span> <span class="hljs-number">080</span> <span class="hljs-number">486</span> <span class="hljs-number">452</span>      <span class="hljs-string">L1-dcache-load-misses</span>     <span class="hljs-comment">#    3,63% of all L1-dcache accesses  (66,68%)</span>
         <span class="hljs-number">2</span> <span class="hljs-number">462</span> <span class="hljs-number">215</span>      <span class="hljs-string">LLC-loads</span>                 <span class="hljs-comment">#    0,012 M/sec                    (66,68%)</span>
           <span class="hljs-number">312</span> <span class="hljs-number">884</span>      <span class="hljs-string">LLC-load-misses</span>           <span class="hljs-comment">#   12,71% of all LL-cache accesses  (66,69%)</span>
   <span class="hljs-string">&lt;not</span> <span class="hljs-string">supported&gt;</span>      <span class="hljs-string">L1-icache-loads</span>                                             
        <span class="hljs-number">20</span> <span class="hljs-number">972</span> <span class="hljs-number">526</span>      <span class="hljs-string">L1-icache-load-misses</span>                                         <span class="hljs-string">(33,32%)</span>
    <span class="hljs-number">57</span> <span class="hljs-number">430</span> <span class="hljs-number">817</span> <span class="hljs-number">101</span>      <span class="hljs-string">dTLB-loads</span>                <span class="hljs-comment">#  288,581 M/sec                    (33,33%)</span>
            <span class="hljs-number">63</span> <span class="hljs-number">230</span>      <span class="hljs-string">dTLB-load-misses</span>          <span class="hljs-comment">#    0,00% of all dTLB cache accesses  (33,34%)</span>
   <span class="hljs-string">&lt;not</span> <span class="hljs-string">supported&gt;</span>      <span class="hljs-string">iTLB-loads</span>                                                  
           <span class="hljs-number">116</span> <span class="hljs-number">372</span>      <span class="hljs-string">iTLB-load-misses</span>                                              <span class="hljs-string">(33,33%)</span>
   <span class="hljs-string">&lt;not</span> <span class="hljs-string">supported&gt;</span>      <span class="hljs-string">L1-dcache-prefetches</span>                                        
   <span class="hljs-string">&lt;not</span> <span class="hljs-string">supported&gt;</span>      <span class="hljs-string">L1-dcache-prefetch-misses</span>                                   

     <span class="hljs-number">130</span><span class="hljs-string">,817291871</span> <span class="hljs-string">seconds</span> <span class="hljs-string">time</span> <span class="hljs-string">elapsed</span>

     <span class="hljs-number">260</span><span class="hljs-string">,986586000</span> <span class="hljs-string">seconds</span> <span class="hljs-string">user</span>
       <span class="hljs-number">0</span><span class="hljs-string">,210040000</span> <span class="hljs-string">seconds</span> <span class="hljs-string">sys</span>
</code></pre><p>Now, if we run the benchmark on 8 threads, the result would be:</p>
<pre><code><span class="hljs-attr">Perf stats:</span>
<span class="hljs-string">--------------------------------------------------</span>

        <span class="hljs-number">784</span> <span class="hljs-number">673</span><span class="hljs-string">,96</span> <span class="hljs-string">msec</span> <span class="hljs-string">task-clock</span>                <span class="hljs-comment">#    5,996 CPUs utilized          </span>
           <span class="hljs-number">340</span> <span class="hljs-number">290</span>      <span class="hljs-string">context-switches</span>          <span class="hljs-comment">#    0,434 K/sec                  </span>
               <span class="hljs-number">212</span>      <span class="hljs-string">cpu-migrations</span>            <span class="hljs-comment">#    0,000 K/sec                  </span>
               <span class="hljs-number">253</span>      <span class="hljs-string">page-faults</span>               <span class="hljs-comment">#    0,000 K/sec                  </span>
 <span class="hljs-number">2</span> <span class="hljs-number">418</span> <span class="hljs-number">749</span> <span class="hljs-number">827</span> <span class="hljs-number">832</span>      <span class="hljs-string">cycles</span>                    <span class="hljs-comment">#    3,082 GHz                      (41,66%)</span>
   <span class="hljs-number">394</span> <span class="hljs-number">531</span> <span class="hljs-number">732</span> <span class="hljs-number">276</span>      <span class="hljs-string">instructions</span>              <span class="hljs-comment">#    0,16  insn per cycle           (50,01%)</span>
    <span class="hljs-number">20</span> <span class="hljs-number">804</span> <span class="hljs-number">674</span> <span class="hljs-number">516</span>      <span class="hljs-string">branches</span>                  <span class="hljs-comment">#   26,514 M/sec                    (58,35%)</span>
        <span class="hljs-number">11</span> <span class="hljs-number">107</span> <span class="hljs-number">908</span>      <span class="hljs-string">branch-misses</span>             <span class="hljs-comment">#    0,05% of all branches          (66,67%)</span>
    <span class="hljs-number">54</span> <span class="hljs-number">747</span> <span class="hljs-number">118</span> <span class="hljs-number">838</span>      <span class="hljs-string">L1-dcache-loads</span>           <span class="hljs-comment">#   69,771 M/sec                    (66,68%)</span>
     <span class="hljs-number">2</span> <span class="hljs-number">413</span> <span class="hljs-number">557</span> <span class="hljs-number">669</span>      <span class="hljs-string">L1-dcache-load-misses</span>     <span class="hljs-comment">#    4,41% of all L1-dcache accesses  (66,68%)</span>
       <span class="hljs-number">582</span> <span class="hljs-number">485</span> <span class="hljs-number">788</span>      <span class="hljs-string">LLC-loads</span>                 <span class="hljs-comment">#    0,742 M/sec                    (66,67%)</span>
           <span class="hljs-number">484</span> <span class="hljs-number">122</span>      <span class="hljs-string">LLC-load-misses</span>           <span class="hljs-comment">#    0,08% of all LL-cache accesses  (66,66%)</span>
   <span class="hljs-string">&lt;not</span> <span class="hljs-string">supported&gt;</span>      <span class="hljs-string">L1-icache-loads</span>                                             
       <span class="hljs-number">157</span> <span class="hljs-number">690</span> <span class="hljs-number">988</span>      <span class="hljs-string">L1-icache-load-misses</span>                                         <span class="hljs-string">(33,32%)</span>
    <span class="hljs-number">54</span> <span class="hljs-number">789</span> <span class="hljs-number">336</span> <span class="hljs-number">429</span>      <span class="hljs-string">dTLB-loads</span>                <span class="hljs-comment">#   69,824 M/sec                    (33,33%)</span>
           <span class="hljs-number">504</span> <span class="hljs-number">440</span>      <span class="hljs-string">dTLB-load-misses</span>          <span class="hljs-comment">#    0,00% of all dTLB cache accesses  (33,33%)</span>
   <span class="hljs-string">&lt;not</span> <span class="hljs-string">supported&gt;</span>      <span class="hljs-string">iTLB-loads</span>                                                  
         <span class="hljs-number">3</span> <span class="hljs-number">304</span> <span class="hljs-number">265</span>      <span class="hljs-string">iTLB-load-misses</span>                                              <span class="hljs-string">(33,34%)</span>
   <span class="hljs-string">&lt;not</span> <span class="hljs-string">supported&gt;</span>      <span class="hljs-string">L1-dcache-prefetches</span>                                        
   <span class="hljs-string">&lt;not</span> <span class="hljs-string">supported&gt;</span>      <span class="hljs-string">L1-dcache-prefetch-misses</span>                                   

     <span class="hljs-number">130</span><span class="hljs-string">,864386377</span> <span class="hljs-string">seconds</span> <span class="hljs-string">time</span> <span class="hljs-string">elapsed</span>

    <span class="hljs-number">1025</span><span class="hljs-string">,676317000</span> <span class="hljs-string">seconds</span> <span class="hljs-string">user</span>
       <span class="hljs-number">2</span><span class="hljs-string">,552276000</span> <span class="hljs-string">seconds</span> <span class="hljs-string">sys</span>
</code></pre><p>The previously mentioned problem with contention over a single atomic counter primary manifests itself in significant degradation of instruction per cycle (IPC) metric: it goes from 0,68 insn/cycle for 2 threads, which is already quite low, to pitiful 0,16 insn/cycle for 8 threads. Other metrics also degrade proportionally. That's because each CPU core's backend spends a lot of time synchronizing the counter's cache line with other cores.</p>
<p>To be precise, low IPC doesn't necessarily mean contention caused by atomic instructions. It may be caused by other reasons, such as random memory accesses on a data structure that doesn't fit into memory. If you'd like to detect this particular scenario, you should profile for specific PMU (Performance Monitoring Unit) events as Travis Downs <a target="_blank" href="https://twitter.com/trav_downs/status/1466472385834590211?s=20">pointed out</a>.</p>
<p>To nail down the single atomic counter-based lock topic, let's try to compare it with the standard <code>j.u.c.l.ReentrantReadWriteLock</code> class. We're going to use the same benchmark running on 8 threads:</p>
<pre><code><span class="hljs-attribute">Benchmark</span>                        (type)  Mode  Cnt     Score    Error  Units
<span class="hljs-attribute">ReadWriteLockBenchmark</span>.testLock     JUC  avgt   <span class="hljs-number">10</span>  <span class="hljs-number">2081</span>.<span class="hljs-number">664</span> ± <span class="hljs-number">15</span>.<span class="hljs-number">117</span>  ns/op
<span class="hljs-attribute">ReadWriteLockBenchmark</span>.testLock  SIMPLE  avgt   <span class="hljs-number">10</span>   <span class="hljs-number">289</span>.<span class="hljs-number">155</span> ±  <span class="hljs-number">3</span>.<span class="hljs-number">339</span>  ns/op
</code></pre><p>Here <code>JUC</code> type stands for the <code>ReentrantReadWriteLock</code> class and <code>SIMPLE</code> means the single atomic counter lock. The difference is significant, almost 7x. Moreover, if we would activate the GC profiler in the benchmark, we would see that the standard class allocates around 200 MB/sec (not the end of the world, but could be avoided), while the atomic counter lock does not allocate at all. So, if you have a concrete use case and you really know what you're doing, using a custom lock might be a good idea.</p>
<p>Long story short, a single atomic counter-based lock might do a better job than the standard Java library, but it has an important flaw. The thing is that it assumes contention between readers while it's not necessary at all. Readers need to share memory (think, synchronize) with the writer only, not with each other. And, of course, there are algorithms that try to address the reader scalability problem. Let's quickly discuss the alternatives.</p>
<p>The first alternative I'm aware of is Dmitry Vyukov's distributed reader-writer <a target="_blank" href="https://www.1024cores.net/home/lock-free-algorithms/reader-writer-problem/distributed-reader-writer-mutex">mutex</a>. It's written in C++, but it should be straightforward to port it to Java. The main idea is to shard the reader counter to an array of atomic counters. The size of the array is set in runtime to match the number of available cores. Each reader is associated with a slot in the array based on the id of the core running the reader thread. The id is obtained via the <a target="_blank" href="https://man7.org/linux/man-pages/man2/getcpu.2.html"><code>getcpu()</code></a> system call available on Linux, so porting it to other OSes is problematic. Another problem is that threads may migrate between cores unless you go with thread affinity. To address this problem, D.Vyukov's lock provides the core id as the return value of the reader's <code>lock()</code> method. This value has to be provided later when the reader calls <code>unlock()</code>. So, while this lock has the potential, it's certainly not general-purpose.</p>
<p>Another worthwhile scalable readers-writer lock I know of is called BRAVO lock. The idea is quite close to D.Vyukov's class, yet the sharded counters array is fixed-size, and the reader's slot is determined based on the thread id. The second part is not a hard requirement and it's possible to implement the BRAVO algorithm in, say, <a target="_blank" href="https://github.com/puzpuzpuz/xsync/blob/efd8a81aa9261ce9d19dc9c02a9a87e8f34d8e93/rbmutex.go">Golang</a> which doesn't expose goroutine or thread id by design. Implementation in Java is available <a target="_blank" href="https://github.com/puzpuzpuz/questdb/blob/f817ed19b205be383d9556b64c8e1ac96a5f377d/core/src/main/java/io/questdb/std/BiasedReadWriteLock.java">here</a>. We're going to benchmark it against the single atomic counter lock now. If you're interested in learning more about BRAVO lock, refer to the <a target="_blank" href="https://arxiv.org/pdf/1810.01553.pdf">original paper</a>.</p>
<p>Here is how BRAVO lock does in the benchmark run on 8 threads:</p>
<pre><code><span class="hljs-attribute">Benchmark</span>                          (type)  Mode  Cnt    Score   Error  Units
<span class="hljs-attribute">ReadWriteLockBenchmark</span>.testLock    SIMPLE  avgt   <span class="hljs-number">10</span>  <span class="hljs-number">296</span>.<span class="hljs-number">300</span> ± <span class="hljs-number">1</span>.<span class="hljs-number">824</span>  ns/op
<span class="hljs-attribute">ReadWriteLockBenchmark</span>.testLock    BIASED  avgt   <span class="hljs-number">10</span>   <span class="hljs-number">53</span>.<span class="hljs-number">811</span> ± <span class="hljs-number">1</span>.<span class="hljs-number">548</span>  ns/op
</code></pre><p>The BIASED type from the above JMH output is the BRAVO lock while the SIMPLE type is the same atomic counter lock we previously discussed. As expected, BRAVO lock solves the scalability issue for readers, but it also has some problematic parts. First, the size of the array has to be large enough to avoid reader contention. The Java class we benchmarked sets the array size to 4096, which means 16KB of memory. The array size should be ideally based on the hardware that runs your code. Second, even if the array is properly sized, readers may still go to the same slot. In fact, adjacent slots are enough as long as they occupy the same cache line. That's because the slot assignment applies a hash function to the thread id and, due to that, does not guarantee the absence of hash code collisions.</p>
<p>So, both alternatives have certain cons and leave enough space for improvements. That's exactly what we came for.</p>
<h2 id="heading-meet-tlbiasedreadwritelock">Meet TLBiasedReadWriteLock</h2>
<p>Yes, naming is not my strong side. <code>TLBiasedReadWriteLock</code> stands for thread-local reader biased lock. As the name suggests, <a target="_blank" href="https://github.com/puzpuzpuz/questdb/blob/f817ed19b205be383d9556b64c8e1ac96a5f377d/core/src/main/java/io/questdb/std/TLBiasedReadWriteLock.java">the class</a> uses a thread-local counter for each reader, while the writer lock is based on a spinlock.</p>
<p>Simplified reader's lock method looks like the following:</p>
<pre><code class="lang-java"><span class="hljs-meta">@Override</span>
<span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">lock</span><span class="hljs-params">()</span> </span>{
    <span class="hljs-comment">// initialize or fetch a thread-local counter for the reader</span>
    PaddedAtomicLong readerCounter = tlReaderCounter.get();
    <span class="hljs-keyword">for</span> (;;) {
        <span class="hljs-comment">// check if the writer lock is available and we can start the attempt</span>
        <span class="hljs-keyword">if</span> (!wLock.get()) {
            <span class="hljs-comment">// increment the reader counter</span>
            readerCounter.incrementAndGet();
            <span class="hljs-comment">// check that no writer acquired the lock</span>
            <span class="hljs-keyword">if</span> (!wLock.get()) {
                <span class="hljs-comment">// attempt is successful, we're done</span>
                <span class="hljs-keyword">break</span>;
            }
            <span class="hljs-comment">// attempt failed, go for another spin</span>
            readerCounter.decrementAndGet();
        }
        LockSupport.parkNanos(<span class="hljs-number">10</span>);
    }
}
</code></pre>
<p>In this code, <code>PaddedAtomicLong</code> is nothing more than, well, a padded <code>AtomicLong</code> used as a reader counter. The padding is added to prevent false sharing, i.e. the situation when different readers are unfortunate to have their counters adjacent on the heap memory so that they end up in the same CPU cache line. When a reader accesses the thread-local counter for the first time, the counter gets added to an array of weak references. These weak references are checked by writers when they want to acquire the lock.</p>
<p>You may wonder about scalability of the above code in terms of readers. Yes, there are no writes to shared memory, but we read the writer lock value multiple times. Will this code scale? The answer is yes, reads (loads) of shared memory <a target="_blank" href="https://www.1024cores.net/home/lock-free-algorithms/first-things-first">scale</a> and it's fine to use them anywhere in your code.</p>
<p>Next, writer's lock method looks like this:</p>
<pre><code class="lang-java"><span class="hljs-meta">@Override</span>
<span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">lock</span><span class="hljs-params">()</span> </span>{
    <span class="hljs-comment">// first, acquire writer lock</span>
    lock0();
    <span class="hljs-comment">// next, wait for the readers</span>
    Iterator&lt;WeakReference&lt;PaddedAtomicLong&gt;&gt; iterator = readerCounters.iterator();
    <span class="hljs-keyword">while</span> (iterator.hasNext()) {
        <span class="hljs-comment">// fetch reader counter's weak reference</span>
        WeakReference&lt;PaddedAtomicLong&gt; ref = iterator.next();
        PaddedAtomicLong counter = ref.get();
        <span class="hljs-keyword">if</span> (counter == <span class="hljs-keyword">null</span>) {
            <span class="hljs-comment">// clean up the counter since the reader thread stopped</span>
            iterator.remove();
            <span class="hljs-keyword">continue</span>;
        }
        <span class="hljs-keyword">while</span> (counter.get() != <span class="hljs-number">0</span>) {
            <span class="hljs-comment">// the reader still holds the lock, so sleep and go for another spin</span>
            LockSupport.parkNanos(<span class="hljs-number">10</span>);
        }
    }
}
</code></pre>
<p>As promised, this lock guarantees zero contention for readers since their counters are thread-local. The counters are allocated dynamically, so in scenarios when there are a few threads accessing the lock its memory footprint should be lower than BRAVO lock's one. This lock should do its best when accessed on a fixed-size thread pool.</p>
<p>If we benchmark this lock (TLBIASED type) against BRAVO (BIASED type) on 8 threads, we get this:</p>
<pre><code><span class="hljs-attribute">Benchmark</span>                          (type)  Mode  Cnt    Score   Error  Units
<span class="hljs-attribute">ReadWriteLockBenchmark</span>.testLock    BIASED  avgt   <span class="hljs-number">10</span>   <span class="hljs-number">53</span>.<span class="hljs-number">811</span> ± <span class="hljs-number">1</span>.<span class="hljs-number">548</span>  ns/op
<span class="hljs-attribute">ReadWriteLockBenchmark</span>.testLock  TLBIASED  avgt   <span class="hljs-number">10</span>   <span class="hljs-number">54</span>.<span class="hljs-number">682</span> ± <span class="hljs-number">1</span>.<span class="hljs-number">204</span>  ns/op
</code></pre><p>As you would expect, two locks are on par in terms of reader lock scalability and end performance. Expectedly, if we would size BRAVO lock's array less aggressively or run the benchmark on a lot more cores/threads, BRAVO lock would perform worse.</p>
<h2 id="heading-final-battle">Final battle</h2>
<p>Before we wrap up, let's compare all observed lock implementations in a slightly more realistic benchmark. To make it happen we increase the time spent in the critical section by 4x so that it takes around 140 nanoseconds instead of 22 nanoseconds used in the reader-only benchmark. Next, we change the benchmark so that the writer lock is occasionally acquired. The ratio between reader and writer lock calls we're going to use will be 1,000:1, 10,000:1, or 100,000:1. The benchmark code may be found <a target="_blank" href="https://github.com/puzpuzpuz/questdb/blob/faaf0eb5deb948bc98f95172a9177f2f9386ff51/benchmarks/src/main/java/org/questdb/ReadWriteLockBenchmark.java#L85-L96">here</a>.</p>
<p>That's what we get when the benchmark is run on 8 threads:</p>
<pre><code><span class="hljs-attribute">Benchmark</span>                                 (readWriteRatio)    (type)  Mode  Cnt     Score    Error  Units
<span class="hljs-attribute">ReadWriteLockBenchmark</span>.testReadWriteLock              <span class="hljs-number">1000</span>       JUC  avgt   <span class="hljs-number">10</span>  <span class="hljs-number">1966</span>.<span class="hljs-number">910</span> ± <span class="hljs-number">25</span>.<span class="hljs-number">643</span>  ns/op
<span class="hljs-attribute">ReadWriteLockBenchmark</span>.testReadWriteLock              <span class="hljs-number">1000</span>    SIMPLE  avgt   <span class="hljs-number">10</span>   <span class="hljs-number">668</span>.<span class="hljs-number">782</span> ±  <span class="hljs-number">0</span>.<span class="hljs-number">879</span>  ns/op
<span class="hljs-attribute">ReadWriteLockBenchmark</span>.testReadWriteLock              <span class="hljs-number">1000</span>    BIASED  avgt   <span class="hljs-number">10</span>   <span class="hljs-number">458</span>.<span class="hljs-number">470</span> ±  <span class="hljs-number">3</span>.<span class="hljs-number">169</span>  ns/op
<span class="hljs-attribute">ReadWriteLockBenchmark</span>.testReadWriteLock              <span class="hljs-number">1000</span>  TLBIASED  avgt   <span class="hljs-number">10</span>   <span class="hljs-number">598</span>.<span class="hljs-number">463</span> ±  <span class="hljs-number">2</span>.<span class="hljs-number">135</span>  ns/op
<span class="hljs-attribute">ReadWriteLockBenchmark</span>.testReadWriteLock             <span class="hljs-number">10000</span>       JUC  avgt   <span class="hljs-number">10</span>  <span class="hljs-number">1828</span>.<span class="hljs-number">119</span> ± <span class="hljs-number">41</span>.<span class="hljs-number">570</span>  ns/op
<span class="hljs-attribute">ReadWriteLockBenchmark</span>.testReadWriteLock             <span class="hljs-number">10000</span>    SIMPLE  avgt   <span class="hljs-number">10</span>   <span class="hljs-number">593</span>.<span class="hljs-number">159</span> ±  <span class="hljs-number">1</span>.<span class="hljs-number">913</span>  ns/op
<span class="hljs-attribute">ReadWriteLockBenchmark</span>.testReadWriteLock             <span class="hljs-number">10000</span>    BIASED  avgt   <span class="hljs-number">10</span>   <span class="hljs-number">219</span>.<span class="hljs-number">809</span> ±  <span class="hljs-number">3</span>.<span class="hljs-number">359</span>  ns/op
<span class="hljs-attribute">ReadWriteLockBenchmark</span>.testReadWriteLock             <span class="hljs-number">10000</span>  TLBIASED  avgt   <span class="hljs-number">10</span>   <span class="hljs-number">226</span>.<span class="hljs-number">159</span> ±  <span class="hljs-number">0</span>.<span class="hljs-number">808</span>  ns/op
<span class="hljs-attribute">ReadWriteLockBenchmark</span>.testReadWriteLock            <span class="hljs-number">100000</span>       JUC  avgt   <span class="hljs-number">10</span>  <span class="hljs-number">1973</span>.<span class="hljs-number">585</span> ± <span class="hljs-number">14</span>.<span class="hljs-number">310</span>  ns/op
<span class="hljs-attribute">ReadWriteLockBenchmark</span>.testReadWriteLock            <span class="hljs-number">100000</span>    SIMPLE  avgt   <span class="hljs-number">10</span>   <span class="hljs-number">584</span>.<span class="hljs-number">412</span> ±  <span class="hljs-number">5</span>.<span class="hljs-number">078</span>  ns/op
<span class="hljs-attribute">ReadWriteLockBenchmark</span>.testReadWriteLock            <span class="hljs-number">100000</span>    BIASED  avgt   <span class="hljs-number">10</span>   <span class="hljs-number">173</span>.<span class="hljs-number">194</span> ±  <span class="hljs-number">1</span>.<span class="hljs-number">360</span>  ns/op
<span class="hljs-attribute">ReadWriteLockBenchmark</span>.testReadWriteLock            <span class="hljs-number">100000</span>  TLBIASED  avgt   <span class="hljs-number">10</span>   <span class="hljs-number">174</span>.<span class="hljs-number">324</span> ±  <span class="hljs-number">3</span>.<span class="hljs-number">236</span>  ns/op
</code></pre><p>Good old <code>ReentrantReadWriteLock</code> (JUC) is equally slow regardless of the read/write lock call ratio. The single atomic counter lock (SIMPLE) also doesn't improve much when there are fewer writer calls. For the 1,000:1 ratio, the single counter lock's performance is very close to the performance of BRAVO lock (BIASED) and thread-local reader biased lock (TLBIASED).</p>
<p>What's interesting is that the BRAVO lock (BIASED) is a bit cheaper than the thread-local biased lock for the 1,000:1 read-write ratio, but when the number of reader calls grows the difference disappears. That's because of a more expensive write lock call in the thread-local biased lock. Again, the strong point of the thread-local lock is the guaranteed absence of contention for readers. Since the BRAVO lock is put into the best possible conditions in terms of the internal array sizing and the number of threads and CPU cores, we didn't observe any scalability issues with this lock in the conducted benchmarks.</p>
<h2 id="heading-whats-in-it-for-me">What's in it for me?</h2>
<p>Before we go any further, I'd suggest using a custom lock implementation only in niche use cases and, what's even more important, only if you know what you're doing. In all other situations, the standard library should be the default way to go.</p>
<p>If you're certain that there is a majority of reader locks in your code, they have rather short critical sections, and the <code>ReentrantReadWriteLock</code> class shows up as the bottleneck in profiler reports, any of the observed alternatives have the chance to improve your application's performance.</p>
<p>Hopefully, you've learned something new today. Good luck and see you next time.</p>
]]></content:encoded></item><item><title><![CDATA[My previous blog]]></title><description><![CDATA[My previous blog on Medium is located here. It's mostly focused on Node.js.
You may also find my blog post  in Gopher Advent 2021.]]></description><link>https://puzpuzpuz.dev/my-previous-blog</link><guid isPermaLink="true">https://puzpuzpuz.dev/my-previous-blog</guid><dc:creator><![CDATA[Andrei Pechkurov]]></dc:creator><pubDate>Sat, 25 Dec 2021 16:16:19 GMT</pubDate><content:encoded><![CDATA[<p>My previous blog on Medium is located <a target="_blank" href="https://apechkurov.medium.com/">here</a>. It's mostly focused on Node.js.</p>
<p>You may also find my <a target="_blank" href="https://gopheradvent.com/calendar/2021/journey-to-a-faster-concurrent-map/">blog post</a>  in Gopher Advent 2021.</p>
]]></content:encoded></item></channel></rss>