diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-21 11:54:28 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-21 11:54:28 +0000 |
commit | e6918187568dbd01842d8d1d2c808ce16a894239 (patch) | |
tree | 64f88b554b444a49f656b6c656111a145cbbaa28 /src/rocksdb/docs/_posts/2022-10-05-lost-buffered-write-recovery.markdown | |
parent | Initial commit. (diff) | |
download | ceph-e6918187568dbd01842d8d1d2c808ce16a894239.tar.xz ceph-e6918187568dbd01842d8d1d2c808ce16a894239.zip |
Adding upstream version 18.2.2.upstream/18.2.2
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'src/rocksdb/docs/_posts/2022-10-05-lost-buffered-write-recovery.markdown')
-rw-r--r-- | src/rocksdb/docs/_posts/2022-10-05-lost-buffered-write-recovery.markdown | 123 |
1 files changed, 123 insertions, 0 deletions
diff --git a/src/rocksdb/docs/_posts/2022-10-05-lost-buffered-write-recovery.markdown b/src/rocksdb/docs/_posts/2022-10-05-lost-buffered-write-recovery.markdown new file mode 100644 index 000000000..fca3ea739 --- /dev/null +++ b/src/rocksdb/docs/_posts/2022-10-05-lost-buffered-write-recovery.markdown @@ -0,0 +1,123 @@ +--- +title: "Verifying crash-recovery with lost buffered writes" +layout: post +author: +- ajkr +category: blog +--- + +## Introduction + +Writes to a RocksDB instance go through multiple layers before they are fully persisted. +Those layers may buffer writes, delaying their persistence. +Depending on the layer, buffered writes may be lost in a process or system crash. +A process crash loses writes buffered in process memory only. +A system crash additionally loses writes buffered in OS memory. + +The new test coverage introduced in this post verifies there is no hole in the recovered data in either type of crash. +A hole would exist if any recovered write were newer than any lost write, as illustrated below. +This guarantee is important for many applications, such as those that use the newest recovered write to determine the starting point for replication. + +![](/static/images/lost-buffered-write-recovery/happy-cat.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +*Valid (no hole) recovery: all recovered writes (1 and 2) are older than all lost writes (3 and 4)* +{: style="text-align: center"} + +![](/static/images/lost-buffered-write-recovery/angry-cat.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +*Invalid (hole) recovery: a recovered write (4) is newer than a lost write (3)* +{: style="text-align: center"} + +The new test coverage assumes all writes use the same options related to buffering/persistence. +For example, we do not cover the case of alternating writes with WAL disabled and WAL enabled (`WriteOptions::disableWAL`). +It also assumes the crash does not have any unexpected consequences like corrupting persisted data. + +Testing for holes in the recovery is challenging because there are many valid recovery outcomes. +Our solution involves tracing all the writes and then verifying the recovery matches a prefix of the trace. +This proves there are no holes in the recovery. +See "Extensions for lost buffered writes" subsection below for more details. + +Testing actual system crashes would be operationally difficult. +Our solution simulates system crash by buffering written but unsynced data in process memory such that it is lost in a process crash. +See "Simulating system crash" subsection below for more details. + +## Scenarios covered + +We began testing recovery has no hole in the following new scenarios. +This coverage is included in our internal CI that periodically runs against the latest commit on the main branch. + +1. **Process crash with WAL disabled** (`WriteOptions::disableWAL=1`), which loses writes since the last memtable flush. +2. **System crash with WAL enabled** (`WriteOptions::disableWAL=0`), which loses writes since the last memtable flush or WAL sync (`WriteOptions::sync=1`, `SyncWAL()`, or `FlushWAL(true /* sync */)`). +3. **Process crash with manual WAL flush** (`DBOptions::manual_wal_flush=1`), which loses writes since the last memtable flush or manual WAL flush (`FlushWAL()`). +4. **System crash with manual WAL flush** (`DBOptions::manual_wal_flush=1`), which loses writes since the last memtable flush or synced manual WAL flush (`FlushWAL(true /* sync */)`, or `FlushWAL(false /* sync */)` followed by WAL sync). + +## Issues found + +* [False detection of corruption after system crash due to race condition with WAL sync and `track_and_verify_wals_in_manifest](https://github.com/facebook/rocksdb/pull/10185) +* [Undetected hole in recovery after system crash due to race condition in WAL sync](https://github.com/facebook/rocksdb/pull/10560) +* [Recovery failure after system crash due to missing directory sync for critical metadata file](https://github.com/facebook/rocksdb/pull/10573) + +## Solution details + +### Basic setup + +![](/static/images/lost-buffered-write-recovery/basic-setup.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +Our correctness testing framework consists of a stress test program (`db_stress`) and a wrapper script (`db_crashtest.py`). +`db_crashtest.py` manages instances of `db_stress`, starting them and injecting crashes. +`db_stress` operates a DB and test oracle ("Latest values file"). + +At startup, `db_stress` verifies the DB using the test oracle, skipping keys that had pending writes when the last crash happened. +`db_stress` then stresses the DB with random operations, keeping the test oracle up-to-date. + +As the name "Latest values file" implies, this test oracle only tracks the latest value for each key. +As a result, this setup is unable to verify recoveries involving lost buffered writes, where recovering older values is tolerated as long as there is no hole. + +### Extensions for lost buffered writes + +To accommodate lost buffered writes, we extended the test oracle to include two new files: "`verifiedSeqno`.state" and "`verifiedSeqno`.trace". +`verifiedSeqno` is the sequence number of the last successful verification. +"`verifiedSeqno`.state" is the expected values file at that sequence number, and "`verifiedSeqno`.trace" is the trace file of all operations that happened after that sequence number. + +![](/static/images/lost-buffered-write-recovery/replay-extension.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +When buffered writes may have been lost by the previous `db_stress` instance, the current `db_stress` instance must reconstruct the latest values file before startup verification. +M is the recovery sequence number of the current `db_stress` instance and N is the recovery sequence number of the previous `db_stress` instance. +M is learned from the DB, while N is learned from the filesystem by parsing the "*.{trace,state}" filenames. +Then, the latest values file ("LATEST.state") can be reconstructed by replaying the first M-N traced operations (in "N.trace") on top of the last instance's starting point ("N.state"). + +![](/static/images/lost-buffered-write-recovery/trace-extension.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +When buffered writes may be lost by the current `db_stress` instance, we save the current expected values into "M.state" and begin tracing newer operations in "M.trace". + +### Simulating system crash + +When simulating system crash, we send file writes to a `TestFSWritableFile`, which buffers unsynced writes in process memory. +That way, the existing `db_stress` process crash mechanism will lose unsynced writes. + +![](/static/images/lost-buffered-write-recovery/test-fs-writable-file.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +`TestFSWritableFile` is implemented as follows. + +* `Append()` buffers the write in a local `std::string` rather than calling `write()`. +* `Sync()` transfers the local `std::string`s content to `PosixWritableFile::Append()`, which will then `write()` it to the OS page cache. + +## Next steps +An untested guarantee is that RocksDB recovers all writes that the user explicitly flushed out of the buffers lost in the crash. +We may recover more writes than these due to internal flushing of buffers, but never less. +Our test oracle needs to be further extended to track the lower bound on the sequence number that is expected to survive a crash. + +We would also like to make our system crash simulation more realistic. +Currently we only drop unsynced regular file data, but we should drop unsynced directory entries as well. + +## Acknowledgements + +Hui Xiao added the manual WAL flush coverage and compatibility with `TransactionDB`. +Zhichao Cao added the system crash simulation. +Several RocksDB team members contributed to this feature's dependencies. |