Shipping a write-ahead log in three evenings
I had a very specific bug. The kind that only shows up when you yank the power cable out of a server, and only one time in twenty. Everything else worked. The tests passed. The CI was green. The graphs were boring.
But once in a while, after a hard reboot, a row would just be — gone. Not corrupted. Not in some half-state. Gone, like it had never been written. And the WAL replay ran clean, every single time.
The naive version
The first cut was forty lines. Open file, append record, fsync, return. It looked like every textbook ever written. And it shipped, and it worked, and it ran for six weeks before anyone noticed anything wrong.
fn append(&mut self, record: &[u8]) -> Result<()> {
self.file.write_all(record)?;
self.file.sync_data()?;
Ok(())
}
I did not know this on the first evening. I figured it out on the third, somewhere around 2am, after reading the same paragraph of the Linux man page nine times in a row.
What actually happens during a power-cut
The kernel buffers writes. The disk buffers writes. The disk controller buffers writes. Every layer between you and the magnetic platter is allowed to lie to you, in the name of throughput, until you explicitly tell it not to.
You tell it not to with fsync. But fsync only flushes the file you call it on. The directory entry that points to the file is a separate object, with separate buffers, and it needs its own fsync.
The version that survives
Six hundred lines later, here is the shape: every write is a CRC-tagged record. Every batch of records ends with a checksum trailer. Every fsync is paired with a directory fsync. Every replay verifies every checksum and stops at the first one that fails.
It is not fast. It is not pretty. But it has not lost a row in four months, and I have stopped waking up at 3am to check the dashboard.