The Secret Life of fsync

Several times I've heard opinions that many mass-market SSDs and HDDs don't provide sufficient durability guarantees and Linux can do nothing with that. Namely, after an fsync() call recently modified data can still sit in the drive's volatile write cache and, thus, it may be lost in case of a power failure. If you want any meaningful durability, you should go for enterprise-grade drives that have a battery/capacitor so that they can flush the data to persistent storage on power loss. Is it really so? Let's find out.

First, let's check what POSIX.1-2017 specification says about fsync:

The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to the storage device associated with the file described by fildes. The nature of the transfer is implementation-defined.

The above description is rather vague. If the OS issues operations to write the data to the disk's volatile cache, that's a "transfer", so formally such OS would be POSIX-compliant. The informative section of the spec sheds more light on what a proper fsync implementation should do:

The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk. Since the concepts of "buffer cache", "system crash", "physical write", and "non-volatile storage" are not defined here, the wording has to be more abstract.

OK, that's much more specific. Once an fsync call is made, the data should become durable in the face of a system crash, e.g. due to a power loss. But is it really something Linux does on an fsync?

As with many other FS-related system calls, most (if not all) file systems have their own implementation of fsync. To keep things simpler, we're going to check the ext4 implementation in the recent 6.x kernel code base. We should be looking at the `ext4_sync_file()` function which is invoked on an fsync() call. It involves the following steps:

First, it writes all dirty pages belonging to the file that corresponds to the input file descriptor to the disk. That's done by the `file_write_and_wait_range()` function. As a result, the data may be sitting in the disk volatile cache, so that's not what we're looking for.
Next, it writes the inode's metadata to the disk. Depending on whether journaling is enabled on the FS or not, it's done via a specific function, e.g. `ext4_fsync_journal()`. Again, not something we're in search of.
Finally, if the needs_barrier variable is true, it calls the blkdev_issue_flush() function. That's probably what we need, isn't it?

Let's leave the needs_barrier variable out of the equation for now and check what blkdev_issue_flush() does. This function queues a flush operation to the block device and waits until it's finished. The operation has the REQ_PREFLUSH bit set among the flags. If we open kernel docs, we'll find some information on this flag (and not only):

In addition the REQ_PREFLUSH flag can be set on an otherwise empty bio structure, which causes only an explicit cache flush without any dependent I/O. It is recommend to use the blkdev_issue_flush() helper for a pure cache flush.

As we expected, the REQ_PREFLUSH flag (as well as the REQ_FUA flag) tells the block device that it should flush its volatile cache to the persistent storage. Drivers for any well-behaved disk with a volatile write cache should handle this flag properly. Obviously, disks without such cache don't need to bother with these operations and flags.

Now, what's the buzz with needs_barrier? In both journaled and non-journaled ext4 code paths, it appears to be nothing more, but an optimization to avoid sending flush requests to the disk multiple times. Note that you can also configure ext4 not to issue flush operations. For example, in non-journaled mode it's done via EXT4_DEFM_NOBARRIER mount option.

Other file systems have their own specifics, but the overall logic should be close enough to the ext4's one. So unlike macOS, Linux does its best to transfer the data to persistent storage on fsync. Of course, this doesn't protect you from a flawed driver implementation written for a cheap no-name drive, but it also means that if you have a decent SSD from a well-known brand, you may be fine without an enterprise-grade disk.

The Secret Life of fsync

Comments

More from this blog

Multithreaded Execution Model for Queries With ORDER BY and LIMIT Clauses

An mmap-based hash table optimization

Seqlock-Based Atomic Memory Snapshots

A Few Thoughts on K-Word CAS

Command Palette

Comments

More from this blog