diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-21 11:54:28 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-21 11:54:28 +0000 |
commit | e6918187568dbd01842d8d1d2c808ce16a894239 (patch) | |
tree | 64f88b554b444a49f656b6c656111a145cbbaa28 /src/rocksdb/docs/_posts | |
parent | Initial commit. (diff) | |
download | ceph-e6918187568dbd01842d8d1d2c808ce16a894239.tar.xz ceph-e6918187568dbd01842d8d1d2c808ce16a894239.zip |
Adding upstream version 18.2.2.upstream/18.2.2
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'src/rocksdb/docs/_posts')
65 files changed, 4446 insertions, 0 deletions
diff --git a/src/rocksdb/docs/_posts/2014-03-27-how-to-backup-rocksdb.markdown b/src/rocksdb/docs/_posts/2014-03-27-how-to-backup-rocksdb.markdown new file mode 100644 index 000000000..f9e4a5444 --- /dev/null +++ b/src/rocksdb/docs/_posts/2014-03-27-how-to-backup-rocksdb.markdown @@ -0,0 +1,135 @@ +--- +title: How to backup RocksDB? +layout: post +author: icanadi +category: blog +redirect_from: + - /blog/191/how-to-backup-rocksdb/ +--- + +In RocksDB, we have implemented an easy way to backup your DB. Here is a simple example: + + + + #include "rocksdb/db.h" + #include "utilities/backupable_db.h" + using namespace rocksdb; + + DB* db; + DB::Open(Options(), "/tmp/rocksdb", &db); + BackupableDB* backupable_db = new BackupableDB(db, BackupableDBOptions("/tmp/rocksdb_backup")); + backupable_db->Put(...); // do your thing + backupable_db->CreateNewBackup(); + delete backupable_db; // no need to also delete db + +<!--truncate--> + + +This simple example will create a backup of your DB in "/tmp/rocksdb_backup". Creating new BackupableDB consumes DB* and you should be calling all the DB methods on object `backupable_db` going forward. + +Restoring is also easy: + + + + RestoreBackupableDB* restore = new RestoreBackupableDB(Env::Default(), BackupableDBOptions("/tmp/rocksdb_backup")); + restore->RestoreDBFromLatestBackup("/tmp/rocksdb", "/tmp/rocksdb"); + delete restore; + + + + +This code will restore the backup back to "/tmp/rocksdb". The second parameter is the location of log files (In some DBs they are different from DB directory, but usually they are the same. See Options::wal_dir for more info). + +An alternative API for backups is to use BackupEngine directly: + + + + #include "rocksdb/db.h" + #include "utilities/backupable_db.h" + using namespace rocksdb; + + DB* db; + DB::Open(Options(), "/tmp/rocksdb", &db); + db->Put(...); // do your thing + BackupEngine* backup_engine = BackupEngine::NewBackupEngine(Env::Default(), BackupableDBOptions("/tmp/rocksdb_backup")); + backup_engine->CreateNewBackup(db); + delete db; + delete backup_engine; + + + + +Restoring with BackupEngine is similar to RestoreBackupableDB: + + + + BackupEngine* backup_engine = BackupEngine::NewBackupEngine(Env::Default(), BackupableDBOptions("/tmp/rocksdb_backup")); + backup_engine->RestoreDBFromLatestBackup("/tmp/rocksdb", "/tmp/rocksdb"); + delete backup_engine; + + + + +Backups are incremental. You can create a new backup with `CreateNewBackup()` and only the new data will be copied to backup directory (for more details on what gets copied, see "Under the hood"). Checksum is always calculated for any backuped file (including sst, log, and etc). It is used to make sure files are kept sound in the file system. Checksum is also verified for files from the previous backups even though they do not need to be copied. A checksum mismatch aborts the current backup (see "Under the hood" for more details). Once you have more backups saved, you can issue `GetBackupInfo()` call to get a list of all backups together with information on timestamp of the backup and the size (please note that sum of all backups' sizes is bigger than the actual size of the backup directory because some data is shared by multiple backups). Backups are identified by their always-increasing IDs. `GetBackupInfo()` is available both in `BackupableDB` and `RestoreBackupableDB`. + +You probably want to keep around only small number of backups. To delete old backups, just call `PurgeOldBackups(N)`, where N is how many backups you'd like to keep. All backups except the N newest ones will be deleted. You can also choose to delete arbitrary backup with call `DeleteBackup(id)`. + +`RestoreDBFromLatestBackup()` will restore the DB from the latest consistent backup. An alternative is `RestoreDBFromBackup()` which takes a backup ID and restores that particular backup. Checksum is calculated for any restored file and compared against the one stored during the backup time. If a checksum mismatch is detected, the restore process is aborted and `Status::Corruption` is returned. Very important thing to note here: Let's say you have backups 1, 2, 3, 4. If you restore from backup 2 and start writing more data to your database, newly created backup will delete old backups 3 and 4 and create new backup 3 on top of 2. + + + +## Advanced usage + + +Let's say you want to backup your DB to HDFS. There is an option in `BackupableDBOptions` to set `backup_env`, which will be used for all file I/O related to backup dir (writes when backuping, reads when restoring). If you set it to HDFS Env, all the backups will be stored in HDFS. + +`BackupableDBOptions::info_log` is a Logger object that is used to print out LOG messages if not-nullptr. + +If `BackupableDBOptions::sync` is true, we will sync data to disk after every file write, guaranteeing that backups will be consistent after a reboot or if machine crashes. Setting it to false will speed things up a bit, but some (newer) backups might be inconsistent. In most cases, everything should be fine, though. + +If you set `BackupableDBOptions::destroy_old_data` to true, creating new `BackupableDB` will delete all the old backups in the backup directory. + +`BackupableDB::CreateNewBackup()` method takes a parameter `flush_before_backup`, which is false by default. When `flush_before_backup` is true, `BackupableDB` will first issue a memtable flush and only then copy the DB files to the backup directory. Doing so will prevent log files from being copied to the backup directory (since flush will delete them). If `flush_before_backup` is false, backup will not issue flush before starting the backup. In that case, the backup will also include log files corresponding to live memtables. Backup will be consistent with current state of the database regardless of `flush_before_backup` parameter. + + + +## Under the hood + + +`BackupableDB` implements `DB` interface and adds four methods to it: `CreateNewBackup()`, `GetBackupInfo()`, `PurgeOldBackups()`, `DeleteBackup()`. Any `DB` interface calls will get forwarded to underlying `DB` object. + +When you call `BackupableDB::CreateNewBackup()`, it does the following: + + + + + + 1. Disable file deletions + + + + 2. Get live files (this includes table files, current and manifest file). + + + + 3. Copy live files to the backup directory. Since table files are immutable and filenames unique, we don't copy a table file that is already present in the backup directory. For example, if there is a file `00050.sst` already backed up and `GetLiveFiles()` returns `00050.sst`, we will not copy that file to the backup directory. However, checksum is calculated for all files regardless if a file needs to be copied or not. If a file is already present, the calculated checksum is compared against previously calculated checksum to make sure nothing crazy happened between backups. If a mismatch is detected, backup is aborted and the system is restored back to the state before `BackupableDB::CreateNewBackup()` is called. One thing to note is that a backup abortion could mean a corruption from a file in backup directory or the corresponding live file in current DB. Both manifest and current files are copied, since they are not immutable. + + + + 4. If `flush_before_backup` was set to false, we also need to copy log files to the backup directory. We call `GetSortedWalFiles()` and copy all live files to the backup directory. + + + + 5. Enable file deletions + + + + +Backup IDs are always increasing and we have a file `LATEST_BACKUP` that contains the ID of the latest backup. If we crash in middle of backing up, on a restart we will detect that there are newer backup files than `LATEST_BACKUP` claims there are. In that case, we will delete any backup newer than `LATEST_BACKUP` and clean up all the files since some of the table files might be corrupted. Having corrupted table files in the backup directory is dangerous because of our deduplication strategy. + + + +## Further reading + + +For the API details, see `include/utilities/backupable_db.h`. For the implementation, see `utilities/backupable/backupable_db.cc`. diff --git a/src/rocksdb/docs/_posts/2014-03-27-how-to-persist-in-memory-rocksdb-database.markdown b/src/rocksdb/docs/_posts/2014-03-27-how-to-persist-in-memory-rocksdb-database.markdown new file mode 100644 index 000000000..89ffb2d97 --- /dev/null +++ b/src/rocksdb/docs/_posts/2014-03-27-how-to-persist-in-memory-rocksdb-database.markdown @@ -0,0 +1,54 @@ +--- +title: How to persist in-memory RocksDB database? +layout: post +author: icanadi +category: blog +redirect_from: + - /blog/245/how-to-persist-in-memory-rocksdb-database/ +--- + +In recent months, we have focused on optimizing RocksDB for in-memory workloads. With growing RAM sizes and strict low-latency requirements, lots of applications decide to keep their entire data in memory. Running in-memory database with RocksDB is easy -- just mount your RocksDB directory on tmpfs or ramfs [1]. Even if the process crashes, RocksDB can recover all of your data from in-memory filesystem. However, what happens if the machine reboots? + +<!--truncate--> + +In this article we will explain how you can recover your in-memory RocksDB database even after a machine reboot. + +Every update to RocksDB is written to two places - one is an in-memory data structure called memtable and second is write-ahead log. Write-ahead log can be used to completely recover the data in memtable. By default, when we flush the memtable to table file, we also delete the current log, since we don't need it anymore for recovery (the data from the log is "persisted" in the table file -- we say that the log file is obsolete). However, if your table file is stored in in-memory file system, you may need the obsolete write-ahead log to recover the data after the machine reboots. Here's how you can do that. + +Options::wal_dir is the directory where RocksDB stores write-ahead log files. If you configure this directory to be on flash or disk, you will not lose current log file on machine reboot. +Options::WAL_ttl_seconds is the timeout when we delete the archived log files. If the timeout is non-zero, obsolete log files will be moved to `archive/` directory under Options::wal_dir. Those archived log files will only be deleted after the specified timeout. + +Let's assume Options::wal_dir is a directory on persistent storage and Options::WAL_ttl_seconds is set to one day. To fully recover the DB, we also need to backup the current snapshot of the database (containing table and metadata files) with a frequency of less than one day. RocksDB provides an utility that enables you to easily backup the snapshot of your database. You can learn more about it here: [How to backup RocksDB?](https://github.com/facebook/rocksdb/wiki/How-to-backup-RocksDB%3F) + +You should configure the backup process to avoid backing up log files, since they are already stored in persistent storage. To do that, set BackupableDBOptions::backup_log_files to false. + +Restore process by default cleans up entire DB and WAL directory. Since we didn't include log files in the backup, we need to make sure that restoring the database doesn't delete log files in WAL directory. When restoring, configure RestoreOptions::keep_log_file to true. That option will also move any archived log files back to WAL directory, enabling RocksDB to replay all archived log files and rebuild the in-memory database state. + +To reiterate, here's what you have to do: + + + + + * Set DB directory to tmpfs or ramfs mounted drive + + + + * Set Options::wal_log to a directory on persistent storage + + + + * Set Options::WAL_ttl_seconds to T seconds + + + + * Backup RocksDB every T/2 seconds, with BackupableDBOptions::backup_log_files = false + + + + * When you lose data, restore from backup with RestoreOptions::keep_log_file = true + + + + + +[1] You might also want to consider using [PlainTable format](https://github.com/facebook/rocksdb/wiki/PlainTable-Format) for table files diff --git a/src/rocksdb/docs/_posts/2014-04-02-the-1st-rocksdb-local-meetup-held-on-march-27-2014.markdown b/src/rocksdb/docs/_posts/2014-04-02-the-1st-rocksdb-local-meetup-held-on-march-27-2014.markdown new file mode 100644 index 000000000..7ccbdbaad --- /dev/null +++ b/src/rocksdb/docs/_posts/2014-04-02-the-1st-rocksdb-local-meetup-held-on-march-27-2014.markdown @@ -0,0 +1,53 @@ +--- +title: The 1st RocksDB Local Meetup Held on March 27, 2014 +layout: post +author: xjin +category: blog +redirect_from: + - /blog/323/the-1st-rocksdb-local-meetup-held-on-march-27-2014/ +--- + +On Mar 27, 2014, RocksDB team @ Facebook held the 1st RocksDB local meetup in FB HQ (Menlo Park, California). We invited around 80 guests from 20+ local companies, including LinkedIn, Twitter, Dropbox, Square, Pinterest, MapR, Microsoft and IBM. Finally around 50 guests showed up, totaling around 60% show-up rate. + +<!--truncate--> + +[![Resize of 20140327_200754](/static/images/Resize-of-20140327_200754-300x225.jpg)](/static/images/Resize-of-20140327_200754-300x225.jpg) + +RocksDB team @ Facebook gave four talks about the latest progress and experience on RocksDB: + + + + + * [Supporting a 1PB In-Memory Workload](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Haobo-RocksDB-In-Memory.pdf) + + + + + * [Column Families in RocksDB](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Igor-Column-Families.pdf) + + + + + * ["Lockless" Get() in RocksDB?](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Lei-Lockless-Get.pdf) + + + + + * [Prefix Hashing in RocksDB](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Siying-Prefix-Hash.pdf) + + +A very interesting question asked by a massive number of guests is: does RocksDB plan to provide replication functionality? Obviously, many applications need a resilient and distributed storage solution, not just single-node storage. We are considering how to approach this issue. + +When will be the next meetup? We haven't decided yet. We will see whether the community is interested in it and how it can help RocksDB grow. + +If you have any questions or feedback for the meetup or RocksDB, please let us know in [our Facebook group](https://www.facebook.com/groups/rocksdb.dev/). + +### Comments + +**[Rajiv](geetasen@gmail.com)** + +Have any of these talks been recorded and if so will they be published? + +**[Igor Canadi](icanadi@fb.com)** + +Yes, I think we plan to publish them soon. diff --git a/src/rocksdb/docs/_posts/2014-04-07-rocksdb-2-8-release.markdown b/src/rocksdb/docs/_posts/2014-04-07-rocksdb-2-8-release.markdown new file mode 100644 index 000000000..7be7842a5 --- /dev/null +++ b/src/rocksdb/docs/_posts/2014-04-07-rocksdb-2-8-release.markdown @@ -0,0 +1,40 @@ +--- +title: RocksDB 2.8 release +layout: post +author: icanadi +category: blog +redirect_from: + - /blog/371/rocksdb-2-8-release/ +--- + +Check out the new RocksDB 2.8 release on [Github](https://github.com/facebook/rocksdb/releases/tag/2.8.fb). + +RocksDB 2.8. is mostly focused on improving performance for in-memory workloads. We are seeing read QPS as high as 5M (we will write a separate blog post on this). + +<!--truncate--> + +Here is the summary of new features: + + * Added a new table format called PlainTable, which is optimized for RAM storage (ramfs or tmpfs). You can read more details about it on [our wiki](https://github.com/facebook/rocksdb/wiki/PlainTable-Format). + + + * New prefixed memtable format HashLinkedList, which is optimized for cases where there are only a few keys for each prefix. + + + * Merge operator supports a new function PartialMergeMulti() that allows users to do partial merges against multiple operands. This function enables big speedups for workloads that use merge operators. + + + * Added a V2 compaction filter interface. It buffers the kv-pairs sharing the same key prefix, process them in batches, and return the batched results back to DB. + + + * Geo-spatial support for locations and radial-search. + + + * Improved read performance using thread local cache for frequently accessed data. + + + * Stability improvements -- we're now ignoring partially written tailing record to MANIFEST or WAL files. + + + +We have also introduced small incompatible API changes (mostly for advanced users). You can see full release notes in our [HISTORY.my](https://github.com/facebook/rocksdb/blob/2.8.fb/HISTORY.md) file. diff --git a/src/rocksdb/docs/_posts/2014-04-21-indexing-sst-files-for-better-lookup-performance.markdown b/src/rocksdb/docs/_posts/2014-04-21-indexing-sst-files-for-better-lookup-performance.markdown new file mode 100644 index 000000000..368055d2c --- /dev/null +++ b/src/rocksdb/docs/_posts/2014-04-21-indexing-sst-files-for-better-lookup-performance.markdown @@ -0,0 +1,28 @@ +--- +title: Indexing SST Files for Better Lookup Performance +layout: post +author: leijin +category: blog +redirect_from: + - /blog/431/indexing-sst-files-for-better-lookup-performance/ +--- + +For a `Get()` request, RocksDB goes through mutable memtable, list of immutable memtables, and SST files to look up the target key. SST files are organized in levels. + +On level 0, files are sorted based on the time they are flushed. Their key range (as defined by FileMetaData.smallest and FileMetaData.largest) are mostly overlapped with each other. So it needs to look up every L0 file. + +<!--truncate--> + +Compaction is scheduled periodically to pick up files from an upper level and merges them with files from lower level. As a result, key/values are moved from L0 down the LSM tree gradually. Compaction sorts key/values and split them into files. From level 1 and below, SST files are sorted based on key. Their key range are mutually exclusive. Instead of scanning through each SST file and checking if a key falls into its range, RocksDB performs a binary search based on FileMetaData.largest to locate a candidate file that can potentially contain the target key. This reduces complexity from O(N) to O(log(N)). However, log(N) can still be large for bottom levels. For a fan-out ratio of 10, level 3 can have 1000 files. That requires 10 comparisons to locate a candidate file. This is a significant cost for an in-memory database when you can do [several million gets per second](https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks). + +One observation to this problem is that: after the LSM tree is built, an SST file's position in its level is fixed. Furthermore, its order relative to files from the next level is also fixed. Based on this idea, we can perform [fractional cascading](http://en.wikipedia.org/wiki/Fractional_cascading) kind of optimization to narrow down the binary search range. Here is an example: + +[![tree_example](/static/images/tree_example1.png)](/static/images/tree_example1.png) + +Level 1 has 2 files and level 2 has 8 files. Now, we want to look up key 80. A binary search based FileMetaData.largest tells you file 1 is the candidate. Then key 80 is compared with its FileMetaData.smallest and FileMetaData.largest to decide if it falls into the range. The comparison shows 80 is less than FileMetaData.smallest (100), so file 1 does not possibly contain key 80. We to proceed to check level 2. Usually, we need to do binary search among all 8 files on level 2. But since we already know target key 80 is less than 100 and only file 1 to file 3 can contain key less than 100, we can safely exclude other files from the search. As a result we cut down the search space from 8 files to 3 files. + +Let's look at another example. We want to get key 230. A binary search on level 1 locates to file 2 (this also implies key 230 is larger than file 1's FileMetaData.largest 200). A comparison with file 2's range shows the target key is smaller than file 2's FileMetaData.smallest 300. Even though, we couldn't find key on level 1, we have derived hints that target key is in range between 200 and 300. Any files on level 2 that cannot overlap with [200, 300] can be safely excluded. As a result, we only need to look at file 5 and file 6 on level 2. + +Inspired by this concept, we pre-build pointers at compaction time on level 1 files that point to a range of files on level 2. For example, file 1 on level 1 points to file 3 (on level 2) on the left and file 4 on the right. File 2 will point to level 2 files 6 and 7. At query time, these pointers are used to determine the actual binary search range based on comparison result. + +Our benchmark shows that this optimization improves lookup QPS by ~5% for similar setup mentioned [here](https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks). diff --git a/src/rocksdb/docs/_posts/2014-05-14-lock.markdown b/src/rocksdb/docs/_posts/2014-05-14-lock.markdown new file mode 100644 index 000000000..12009cc88 --- /dev/null +++ b/src/rocksdb/docs/_posts/2014-05-14-lock.markdown @@ -0,0 +1,88 @@ +--- +title: Reducing Lock Contention in RocksDB +layout: post +author: sdong +category: blog +redirect_from: + - /blog/521/lock/ +--- + +In this post, we briefly introduce the recent improvements we did to RocksDB to improve the issue of lock contention costs. + +RocksDB has a simple thread synchronization mechanism (See [RocksDB Architecture Guide](https://github.com/facebook/rocksdb/wiki/Rocksdb-Architecture-Guide) to understand terms used below, like SST tables or mem tables). SST tables are immutable after being written and mem tables are lock-free data structures supporting single writer and multiple readers. There is only one single major lock, the DB mutex (DBImpl.mutex_) protecting all the meta operations, including: + +<!--truncate--> + + * Increase or decrease reference counters of mem tables and SST tables + + + * Change and check meta data structures, before and after finishing compactions, flushes and new mem table creations + + + * Coordinating writers + + +This DB mutex used to be scalability bottleneck preventing us from scaling to more than 16 threads. To address the issue, we improved RocksDB in several ways. + +1. Consolidate reference counters and introduce "super version". For every read operation, mutex was acquired, and reference counters for each mem table and each SST table were increased. One such operation is not expensive but if you are building a high throughput server with lots of reads, the lock contention will become the bottleneck. This is especially true if you store all your data in RAM. + +To solve this problem, we created a meta-meta data structure called “[super version](https://reviews.facebook.net/rROCKSDB1fdb3f7dc60e96394e3e5b69a46ede5d67fb976c)”, which holds reference counters to all those mem table and SST tables, so that readers only need to increase the reference counters for this single data structure. In RocksDB, list of live mem tables and SST tables only changes infrequently, which would happen when new mem tables are created or flush/compaction happens. Now, at those times, a new super version is created with their reference counters increased. A super version lists live mem tables and SST tables so a reader only needs acquire the lock in order to find the latest super version and increase its reference counter. From the super version, the reader can find all the mem and SST tables which are safety accessible as long as the reader holds the reference count for the super version. + +2. We replace some reference counters to stc::atomic objects, so that decreasing reference count of an object usually doesn’t need to be inside the mutex any more. + +3. Make fetching super version and reference counting lock-free in read queries. After consolidating reference counting to one single super version and removing the locking for decreasing reference counts, in read case, we only acquire mutex for one thing: fetch the latest super version and increase the reference count for that (dereference the counter is done in an atomic decrease). We designed and implemented a (mostly) lock-free approach to do it. See [details](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Lei-Lockless-Get.pdf). We will write a separate blog post for that. + +4. Avoid disk I/O inside the mutex. As we know, each disk I/O to hard drives takes several milliseconds. It can be even longer if file system journal is involved or I/Os are queued. Even occasional disk I/O within mutex can cause huge performance outliers. +We identified in two situations, we might do disk I/O inside mutex and we removed them: +(1) Opening and closing transactional log files. We moved those operations out of the mutex. +(2) Information logging. In multiple places we write to logs within mutex. There is a chance that file write will wait for disk I/O to finish before finishing, even if fsync() is not issued, especially in EXT systems. We occasionally see 100+ milliseconds write() latency on EXT. Instead of removing those logging, we came up with a solution of delay logging. When inside mutex, instead of directly writing to the log file, we write to a log buffer, with the timing information. As soon as mutex is released, we flush the log buffer to log files. + +5. Reduce object creation inside the mutex. +Object creation can be slow because it involves malloc (in our case). Malloc sometimes is slow because it needs to lock some shared data structures. Allocating can also be slow because we sometimes do expensive operations in some of our classes' constructors. For these reasons, we try to reduce object creations inside the mutex. Here are two examples: + +(1) std::vector uses malloc inside. We introduced “[autovector](https://reviews.facebook.net/rROCKSDBc01676e46d3be08c3c140361ef1f5884f47d3b3c)” data structure, in which memory for first a few elements are pre-allocated as members of the autovector class. When an autovector is used as a stack variable, no malloc will be needed unless the pre-allocated buffer is used up. This autovector is quite useful for manipulating those meta data structures. Those meta operations are often locked inside DB mutex. + +(2) When building an iterator, we used to creating iterator of every live men table and SST table within the mutex and a merging iterator on top of them. Besides malloc, some of those iterators can be quite expensive to create, like sorting. Now, instead of doing that, we simply increase the reference counters of them, and release the mutex before creating any iterator. + +6. Deal with mutexes in LRU caches. +When I said there was only one single major lock, I was lying. In RocksDB, all LRU caches had exclusive mutexes within to protect writes to the LRU lists, which are done in both of read and write operations. LRU caches are used in block cache and table cache. Both of them are accessed more frequently than DB data structures. Lock contention of these two locks are as intense as the DB mutex. Even if LRU cache is sharded into ShardedLRUCache, we can still see lock contentions, especially table caches. We further address this issue in two way: +(1) Bypassing table caches. A table cache maintains list of SST table’s read handlers. Those handlers contain SST files’ descriptors, table metadata, and possibly data indexes, as well as bloom filters. When the table handler needs to be evicted based on LRU, those information is cleared. When the SST table needs to be read and its table handler is not in LRU cache, the table is opened and those metadata is loaded. In some cases, users want to tune the system in a way that table handler evictions should never happen. It is common for high-throughput, low-latency servers. We introduce a mode where table cache is bypassed in read queries. In this mode, all table handlers are cached and accessed directly, so there is no need to query and adjust table caches for reading the database. It is the users’ responsibility to reserve enough resource for it. This mode can be turned on by setting options.max_open_files=-1. + +(2) [New PlainTable format](//github.com/facebook/rocksdb/wiki/PlainTable-Format) (optimized for SST in ramfs/tmpfs) does not organize data by blocks. Data are located by memory addresses so no block cache is needed. + +With all of those improvements, lock contention is not a bottleneck anymore, which is shown in our [memory-only benchmark](https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks) . Furthermore, lock contentions are not causing some huge (50 milliseconds+) latency outliers they used to cause. + +### Comments + +**[Lee Hounshell](lee@apsalar.com)** + +Please post an example of reading the same rocksdb concurrently. + +We are using the latest 3.0 rocksdb; however, when two separate processes +try and open the same rocksdb for reading, only one of the open requests +succeed. The other open always fails with “db/LOCK: Resource temporarily unavailable” So far we have not found an option that allows sharing the rocksdb for reads. An example would be most appreciated. + +**[Siying Dong](siying.d@fb.com)** + +Sorry for the delay. We don’t have feature support for this scenario yet. Here is an example you can work around this problem. You can build a snapshot of the DB by doing this: + +1. create a separate directory on the same host for a snapshot of the DB. +1. call `DB::DisableFileDeletions()` +1. call `DB::GetLiveFiles()` to get a full list of the files. +1. for all the files except manifest, add a hardlink file in your new directory pointing to the original file +1. copy the manifest file and truncate the size (you can read the comments of `DB::GetLiveFiles()` for more information) +1. call `DB::EnableFileDeletions()` +1. now you can open the snapshot directory in another process to access those files. Please remember to delete the directory after reading the data to allow those files to be recycled. + +By the way, the best way to ask those questions is in our [facebook group](https://www.facebook.com/groups/rocksdb.dev/). Let us know if you need any further help. + +**[Darshan](darshan.ghumare@gmail.com)** + +Will this consistency problem of RocksDB all occurs in case of single put/write? +What all ACID properties is supported by RocksDB, only durability irrespective of single or batch write? + +**[Siying Dong](siying.d@fb.com)** + +We recently [introduced optimistic transaction](https://reviews.facebook.net/D33435) which can help you ensure all of ACID. + +This blog post is mainly about optimizations in implementation. The RocksDB consistency semantic is not changed. diff --git a/src/rocksdb/docs/_posts/2014-05-19-rocksdb-3-0-release.markdown b/src/rocksdb/docs/_posts/2014-05-19-rocksdb-3-0-release.markdown new file mode 100644 index 000000000..61c90dc93 --- /dev/null +++ b/src/rocksdb/docs/_posts/2014-05-19-rocksdb-3-0-release.markdown @@ -0,0 +1,24 @@ +--- +title: RocksDB 3.0 release +layout: post +author: icanadi +category: blog +redirect_from: + - /blog/557/rocksdb-3-0-release/ +--- + +Check out new RocksDB release on [Github](https://github.com/facebook/rocksdb/releases/tag/3.0.fb)! + +New features in RocksDB 3.0: + + * [Column Family support](https://github.com/facebook/rocksdb/wiki/Column-Families) + + + * [Ability to chose different checksum function](https://github.com/facebook/rocksdb/commit/0afc8bc29a5800e3212388c327c750d32e31f3d6) + + + * Deprecated ReadOptions::prefix_seek and ReadOptions::prefix + +<!--truncate--> + +Check out the full [change log](https://github.com/facebook/rocksdb/blob/3.0.fb/HISTORY.md). diff --git a/src/rocksdb/docs/_posts/2014-05-22-rocksdb-3-1-release.markdown b/src/rocksdb/docs/_posts/2014-05-22-rocksdb-3-1-release.markdown new file mode 100644 index 000000000..30156742b --- /dev/null +++ b/src/rocksdb/docs/_posts/2014-05-22-rocksdb-3-1-release.markdown @@ -0,0 +1,20 @@ +--- +title: RocksDB 3.1 release +layout: post +author: icanadi +category: blog +redirect_from: + - /blog/575/rocksdb-3-1-release/ +--- + +Check out the new release on [Github](https://github.com/facebook/rocksdb/releases/tag/rocksdb-3.1)! + +New features in RocksDB 3.1: + + * [Materialized hash index](https://github.com/facebook/rocksdb/commit/0b3d03d026a7248e438341264b4c6df339edc1d7) + + + * [FIFO compaction style](https://github.com/facebook/rocksdb/wiki/FIFO-compaction-style) + + +We released 3.1 so fast after 3.0 because one of our internal customers needed materialized hash index. diff --git a/src/rocksdb/docs/_posts/2014-06-23-plaintable-a-new-file-format.markdown b/src/rocksdb/docs/_posts/2014-06-23-plaintable-a-new-file-format.markdown new file mode 100644 index 000000000..6a641f233 --- /dev/null +++ b/src/rocksdb/docs/_posts/2014-06-23-plaintable-a-new-file-format.markdown @@ -0,0 +1,47 @@ +--- +title: PlainTable — A New File Format +layout: post +author: sdong +category: blog +redirect_from: + - /blog/599/plaintable-a-new-file-format/ +--- + +In this post, we are introducing "PlainTable" -- a file format we designed for RocksDB, initially to satisfy a production use case at Facebook. + +Design goals: + +1. All data stored in memory, in files stored in tmpfs/ramfs. Support DBs larger than 100GB (may be sharded across multiple RocksDB instance). +1. Optimize for [prefix hashing](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Siying-Prefix-Hash.pdf) +1. Less than or around 1 micro-second average latency for single Get() or Seek(). +1. Minimize memory consumption. +1. Queries efficiently return empty results + +<!--truncate--> + +Notice that our priority was not to maximize query performance, but to strike a balance between query performance and memory consumption. PlainTable query performance is not as good as you would see with a nicely-designed hash table, but they are of the same order of magnitude, while keeping memory overhead to a minimum. + +Since we are targeting micro-second latency, it is on the level of the number of CPU cache misses (if they cannot be parallellized, which are usually the case for index look-ups). On our target hardware with Intel CPUs of multiple sockets with NUMA, we can only allow 4-5 CPU cache misses (including costs of data TLB). + +To meet our requirements, given that only hash prefix iterating is needed, we made two decisions: + +1. to use a hash index, which is +1. directly addressed to rows, with no block structure. + +Having addressed our latency goal, the next task was to design a very compact hash index to minimize memory consumption. Some tricks we used to meet this goal: + +1. We only use 32-bit integers for data and index offsets.The first bit serves as a flag, so we can avoid using 8-byte pointers. +1. We never copy keys or parts of keys to index search structures. We store only offsets from which keys can be retrieved, to make comparisons with search keys. +1. Since our file is immutable, we can accurately estimate the number of hash buckets needed. + +To make sure the format works efficiently with empty queries, we added a bloom filter check before the query. This adds only one cache miss for non-empty cases [1], but avoids multiple cache misses for most empty results queries. This is a good trade-off for use cases with a large percentage of empty results. + +These are the design goals and basic ideas of PlainTable file format. For detailed information, see [this wiki page](https://github.com/facebook/rocksdb/wiki/PlainTable-Format). + +[1] Bloom filter checks typically require multiple memory access. However, because they are independent, they usually do not make the CPU pipeline stale. In any case, we improved the bloom filter to improve data locality - we may cover this further in a future blog post. + +### Comments + +**[Siying Dong](siying.d@fb.com)** + +Does [http://rocksdb.org/feed/](http://rocksdb.org/feed/) work? diff --git a/src/rocksdb/docs/_posts/2014-06-27-avoid-expensive-locks-in-get.markdown b/src/rocksdb/docs/_posts/2014-06-27-avoid-expensive-locks-in-get.markdown new file mode 100644 index 000000000..4411c7ae3 --- /dev/null +++ b/src/rocksdb/docs/_posts/2014-06-27-avoid-expensive-locks-in-get.markdown @@ -0,0 +1,89 @@ +--- +title: Avoid Expensive Locks in Get() +layout: post +author: leijin +category: blog +redirect_from: + - /blog/677/avoid-expensive-locks-in-get/ +--- + +As promised in the previous [blog post](blog/2014/05/14/lock.html)! + +RocksDB employs a multiversion concurrency control strategy. Before reading data, it needs to grab the current version, which is encapsulated in a data structure called [SuperVersion](https://reviews.facebook.net/rROCKSDB1fdb3f7dc60e96394e3e5b69a46ede5d67fb976c). + +<!--truncate--> + +At the beginning of `GetImpl()`, it used to do this: + + + <span class="zw-portion">mutex_.Lock(); + </span>auto* s = super_version_->Ref(); + mutex_.Unlock(); + + +The lock is necessary because pointer super_version_ may be updated, the corresponding SuperVersion may be deleted while Ref() is in progress. + + +`Ref()` simply increases the reference counter and returns “this” pointer. However, this simple operation posed big challenges for in-memory workload and stopped RocksDB from scaling read throughput beyond 8 cores. Running 32 read threads on a 32-core CPU leads to [70% system CPU usage](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Lei-Lockless-Get.pdf). This is outrageous! + + + + +Luckily, we found a way to circumvent this problem by using [thread local storage](http://en.wikipedia.org/wiki/Thread-local_storage). Version change is a rare event comparable to millions of read requests. On the very first Get() request, each thread pays the mutex cost to acquire a reference to the new super version. Instead of releasing the reference after use, the reference is cached in thread’s local storage. An atomic variable is used to track global super version number. Subsequent reads simply compare the local super version number against the global super version number. If they are the same, the cached super version reference may be used directly, at no cost. If a version change is detected, mutex must be acquired to update the reference. The cost of mutex lock is amortized among millions of reads and becomes negligible. + + + + +The code looks something like this: + + + + + + SuperVersion* s = thread_local_->Get(); + if (s->version_number != super_version_number_.load()) { + // slow path, cleanup of current super version is omitted + mutex_.Lock(); + s = super_version_->Ref(); + mutex_.Unlock(); + } + + + + +The result is quite amazing. RocksDB can nicely [scale to 32 cores](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Lei-Lockless-Get.pdf) and most CPU time is spent in user land. + + + + +Daryl Grove gives a pretty good [comparison between mutex and atomic](https://blogs.oracle.com/d/entry/the_cost_of_mutexes). However, the real cost difference lies beyond what is shown in the assembly code. Mutex can keep threads spinning on CPU or even trigger thread context switches in which all readers compete to access the critical area. Our approach prevents mutual competition by directing threads to check against a global version which does not change at high frequency, and is therefore much more cache-friendly. + + + + +The new approach entails one issue: a thread can visit GetImpl() once but can never come back again. SuperVersion is referenced and cached in its thread local storage. All resources (e.g., memtables, files) which belong to that version are frozen. A “supervisor” is required to visit each thread’s local storage and free its resources without incurring a lock. We designed a lockless sweep using CAS (compare and switch instruction). Here is how it works: + + + + +(1) A reader thread uses CAS to acquire SuperVersion from its local storage and to put in a special flag (SuperVersion::kSVInUse). + + + + +(2) Upon completion of GetImpl(), the reader thread tries to return SuperVersion to local storage by CAS, expecting the special flag (SuperVersion::kSVInUse) in its local storage. If it does not see SuperVersion::kSVInUse, that means a “sweep” was done and the reader thread is responsible for cleanup (this is expensive, but does not happen often on the hot path). + + + + +(3) After any flush/compaction, the background thread performs a sweep (CAS) across all threads’ local storage and frees encountered SuperVersion. A reader thread must re-acquire a new SuperVersion reference on its next visit. + +### Comments + +**[David Barbour](dmbarbour@gmail.com)** + +Please post an example of reading the same rocksdb concurrently. + +We are using the latest 3.0 rocksdb; however, when two separate processes +try and open the same rocksdb for reading, only one of the open requests +succeed. The other open always fails with “db/LOCK: Resource temporarily unavailable” So far we have not found an option that allows sharing the rocksdb for reads. An example would be most appreciated. diff --git a/src/rocksdb/docs/_posts/2014-06-27-rocksdb-3-2-release.markdown b/src/rocksdb/docs/_posts/2014-06-27-rocksdb-3-2-release.markdown new file mode 100644 index 000000000..e4eba6af4 --- /dev/null +++ b/src/rocksdb/docs/_posts/2014-06-27-rocksdb-3-2-release.markdown @@ -0,0 +1,30 @@ +--- +title: RocksDB 3.2 release +layout: post +author: leijin +category: blog +redirect_from: + - /blog/647/rocksdb-3-2-release/ +--- + +Check out new RocksDB release on [GitHub](https://github.com/facebook/rocksdb/releases/tag/rocksdb-3.2)! + +New Features in RocksDB 3.2: + + * PlainTable now supports a new key encoding: for keys of the same prefix, the prefix is only written once. It can be enabled through encoding_type paramter of NewPlainTableFactory() + + + * Add AdaptiveTableFactory, which is used to convert from a DB of PlainTable to BlockBasedTabe, or vise versa. It can be created using NewAdaptiveTableFactory() + +<!--truncate--> + +Public API changes: + + + * We removed seek compaction as a concept from RocksDB + + + * Add two paramters to NewHashLinkListRepFactory() for logging on too many entries in a hash bucket when flushing + + + * Added new option BlockBasedTableOptions::hash_index_allow_collision. When enabled, prefix hash index for block-based table will not store prefix and allow hash collision, reducing memory consumption diff --git a/src/rocksdb/docs/_posts/2014-07-29-rocksdb-3-3-release.markdown b/src/rocksdb/docs/_posts/2014-07-29-rocksdb-3-3-release.markdown new file mode 100644 index 000000000..d858e4faf --- /dev/null +++ b/src/rocksdb/docs/_posts/2014-07-29-rocksdb-3-3-release.markdown @@ -0,0 +1,34 @@ +--- +title: RocksDB 3.3 Release +layout: post +author: yhciang +category: blog +redirect_from: + - /blog/1301/rocksdb-3-3-release/ +--- + +Check out new RocksDB release on [GitHub](https://github.com/facebook/rocksdb/releases/tag/rocksdb-3.3)! + +New Features in RocksDB 3.3: + + * **JSON API prototype**. + + + * **Performance improvement on HashLinkList**: We addressed performance outlier of HashLinkList caused by skewed bucket by switching data in the bucket from linked list to skip list. Add parameter threshold_use_skiplist in NewHashLinkListRepFactory(). + +<!--truncate--> + + * **More effective on storage space reclaim**: RocksDB is now able to reclaim storage space more effectively during the compaction process. This is done by compensating the size of each deletion entry by the 2X average value size, which makes compaction to be triggerred by deletion entries more easily. + + + * **TimeOut API to write**: Now WriteOptions have a variable called timeout_hint_us. With timeout_hint_us set to non-zero, any write associated with this timeout_hint_us may be aborted when it runs longer than the specified timeout_hint_us, and it is guaranteed that any write completes earlier than the specified time-out will not be aborted due to the time-out condition. + + + * **rate_limiter option**: We added an option that controls total throughput of flush and compaction. The throughput is specified in bytes/sec. Flush always has precedence over compaction when available bandwidth is constrained. + + + +Public API changes: + + + * Removed NewTotalOrderPlainTableFactory because it is not used and implemented semantically incorrect. diff --git a/src/rocksdb/docs/_posts/2014-09-12-cuckoo.markdown b/src/rocksdb/docs/_posts/2014-09-12-cuckoo.markdown new file mode 100644 index 000000000..22178f7ca --- /dev/null +++ b/src/rocksdb/docs/_posts/2014-09-12-cuckoo.markdown @@ -0,0 +1,74 @@ +--- +title: Cuckoo Hashing Table Format +layout: post +author: radheshyam +category: blog +redirect_from: + - /blog/1427/new-bloom-filter-format/ +--- + +## Introduction + +We recently introduced a new [Cuckoo Hashing](http://en.wikipedia.org/wiki/Cuckoo_hashing) based SST file format which is optimized for fast point lookups. The new format was built for applications which require very high point lookup rates (~4Mqps) in read only mode but do not use operations like range scan, merge operator, etc. But, the existing RocksDB file formats were built to support range scan and other operations and the current best point lookup in RocksDB is 1.2 Mqps given by [PlainTable](https://github.com/facebook/rocksdb/wiki/PlainTable-Format)[ format](https://github.com/facebook/rocksdb/wiki/PlainTable-Format). This prompted a hashing based file format, which we present here. The new table format uses a cache friendly version of Cuckoo Hashing algorithm with only 1 or 2 memory accesses per lookup. + +<!--truncate--> + +Goals: + + * Reduce memory accesses per lookup to 1 or 2 + + + * Get an end to end point lookup rate of at least 4 Mqps + + + * Minimize database size + + +Assumptions: + + * Key length and value length are fixed + + + * The database is operated in read only mode + + +Non-goals: + + + * While optimizing the performance of Get() operation was our primary goal, compaction and build times were secondary. We may work on improving them in future. + + +Details for setting up the table format can be found in [GitHub](https://github.com/facebook/rocksdb/wiki/CuckooTable-Format). + + +## Cuckoo Hashing Algorithm + +In order to achieve high lookup speeds, we did multiple optimizations, including a cache friendly cuckoo hash algorithm. Cuckoo Hashing uses multiple hash functions, _h1, ..., __hn._ + +### Original Cuckoo Hashing + +To insert any new key _k_, we compute hashes of the key _h1(k), ..., __hn__(k)_. We insert the key in the first hash location that is free. If all the locations are blocked, we try to move one of the colliding keys to a different location by trying to re-insert it. + +Finding smallest set of keys to displace in order to accommodate the new key is naturally a shortest path problem in a directed graph where nodes are buckets of hash table and there is an edge from bucket _A_ to bucket _B_ if the element stored in bucket _A_ can be accommodated in bucket _B_ using one of the hash functions. The source nodes are the possible hash locations for the given key _k_ and destination is any one of the empty buckets. We use this algorithm to handle collision. + +To retrieve a key _k_, we compute hashes, _h1(k), ..., __hn__(k)_ and the key must be present in one of these locations. + +Our goal is to minimize average (and maximum) number of hash functions required and hence the number of memory accesses. In our experiments, with a hash utilization of 90%, we found that the average number of lookups is 1.8 and maximum is 3. Around 44% of keys are accommodated in first hash location and 33% in second location. + + +### Cache Friendly Cuckoo Hashing + +We noticed the following two sub-optimal properties in original Cuckoo implementation: + + + * If the key is not present in first hash location, we jump to second hash location which may not be in cache. This results in many cache misses. + + + * Because only 44% of keys are located in first cuckoo block, we couldn't have an optimal prefetching strategy - prefetching all hash locations for a key is wasteful. But prefetching only the first hash location helps only 44% of cases. + + + +The solution is to insert more keys near first location. In case of collision in the first hash location - _h1(k)_, we try to insert it in next few buckets, _h1(k)+1, _h1(k)+2, _..., h1(k)+t-1_. If all of these _t_ locations are occupied, we skip over to next hash function _h2_ and repeat the process. We call the set of _t_ buckets as a _Cuckoo Block_. We chose _t_ such that size of a block is not bigger than a cache line and we prefetch the first cuckoo block. + + +With the new algorithm, for 90% hash utilization, we found that 85% of keys are accommodated in first Cuckoo Block. Prefetching the first cuckoo block yields best results. For a database of 100 million keys with key length 8 and value length 4, the hash algorithm alone can achieve 9.6 Mqps and we are working on improving it further. End to end RocksDB performance results can be found [here](https://github.com/facebook/rocksdb/wiki/CuckooTable-Format). diff --git a/src/rocksdb/docs/_posts/2014-09-12-new-bloom-filter-format.markdown b/src/rocksdb/docs/_posts/2014-09-12-new-bloom-filter-format.markdown new file mode 100644 index 000000000..96fa50a40 --- /dev/null +++ b/src/rocksdb/docs/_posts/2014-09-12-new-bloom-filter-format.markdown @@ -0,0 +1,52 @@ +--- +title: New Bloom Filter Format +layout: post +author: zagfox +category: blog +redirect_from: + - /blog/1367/cuckoo/ +--- + +## Introduction + +In this post, we are introducing "full filter block" --- a new bloom filter format for [block based table](https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format). This could bring about 40% of improvement for key query under in-memory (all data stored in memory, files stored in tmpfs/ramfs, an [example](https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks) workload. The main idea behind is to generate a big filter that covers all the keys in SST file to avoid lots of unnecessary memory look ups. + + +<!--truncate--> + +## What is Bloom Filter + +In brief, [bloom filter](https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter) is a bits array generated for a set of keys that could tell if an arbitrary key may exist in that set. + +In RocksDB, we generate such a bloom filter for each SST file. When we conduct a query for a key, we first goes to the bloom filter block of SST file. If key may exist in filter, we goes into data block in SST file to search for the key. If not, we would return directly. So it could help speed up point look up operation a lot. + +## Original Bloom Filter Format + +Original bloom filter creates filters for each individual data block in SST file. It has complex structure (ref [here](https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format#filter-meta-block)) which results in a lot of non-adjacent memory look ups. + +Here's the work flow for checking original bloom filter in block based table: + +1. Given the target key, we goes to the index block to get the "data block ID" where this key may reside. +1. Using the "data block ID", we goes to the filter block and get the correct "offset of filter". +1. Using the "offset of filter", we goes to the actual filter and do the checking. + +## New Bloom Filter Format + +New bloom filter creates filter for all keys in SST file and we name it "full filter". The data structure of full filter is very simple, there is just one big filter: + + [ full filter ] + +In this way, the work flow of bloom filter checking is much simplified. + +(1) Given the target key, we goes directly to the filter block and conduct the filter checking. + +To be specific, there would be no checking for index block and no address jumping inside of filter block. + +Though it is a big filter, the total filter size would be the same as the original filter. + +One little draw back is that the new bloom filter introduces more memory consumption when building SST file because we need to buffer keys (or their hashes) before generating filter. Original filter just creates a bunch of small filters so it just buffer a small amount of keys. For full filter, we buffer hashes of all keys, which would take more memory when SST file size increases. + + +## Usage & Customization + +You can refer to the document here for [usage](https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter#usage-of-new-bloom-filter) and [customization](https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter#customize-your-own-filterpolicy). diff --git a/src/rocksdb/docs/_posts/2014-09-15-rocksdb-3-5-release.markdown b/src/rocksdb/docs/_posts/2014-09-15-rocksdb-3-5-release.markdown new file mode 100644 index 000000000..1878a5a56 --- /dev/null +++ b/src/rocksdb/docs/_posts/2014-09-15-rocksdb-3-5-release.markdown @@ -0,0 +1,38 @@ +--- +title: RocksDB 3.5 Release! +layout: post +author: leijin +category: blog +redirect_from: + - /blog/1547/rocksdb-3-5-release/ +--- + +New RocksDB release - 3.5! + + +**New Features** + + + 1. Add include/utilities/write_batch_with_index.h, providing a utility class to query data out of WriteBatch when building it. + + + 2. new ReadOptions.total_order_seek to force total order seek when block-based table is built with hash index. + +<!--truncate--> + +**Public API changes** + + + 1. The Prefix Extractor used with V2 compaction filters is now passed user key to SliceTransform::Transform instead of unparsed RocksDB key. + + + 2. Move BlockBasedTable related options to BlockBasedTableOptions from Options. Change corresponding JNI interface. Options affected include: no_block_cache, block_cache, block_cache_compressed, block_size, block_size_deviation, block_restart_interval, filter_policy, whole_key_filtering. filter_policy is changed to shared_ptr from a raw pointer. + + + 3. Remove deprecated options: disable_seek_compaction and db_stats_log_interval + + + 4. OptimizeForPointLookup() takes one parameter for block cache size. It now builds hash index, bloom filter, and block cache. + + +[https://github.com/facebook/rocksdb/releases/tag/v3.5](https://github.com/facebook/rocksdb/releases/tag/rocksdb-3.5) diff --git a/src/rocksdb/docs/_posts/2015-01-16-migrating-from-leveldb-to-rocksdb-2.markdown b/src/rocksdb/docs/_posts/2015-01-16-migrating-from-leveldb-to-rocksdb-2.markdown new file mode 100644 index 000000000..f18de0bbc --- /dev/null +++ b/src/rocksdb/docs/_posts/2015-01-16-migrating-from-leveldb-to-rocksdb-2.markdown @@ -0,0 +1,112 @@ +--- +title: Migrating from LevelDB to RocksDB +layout: post +author: lgalanis +category: blog +redirect_from: + - /blog/1811/migrating-from-leveldb-to-rocksdb-2/ +--- + +If you have an existing application that uses LevelDB and would like to migrate to using RocksDB, one problem you need to overcome is to map the options for LevelDB to proper options for RocksDB. As of release 3.9 this can be automatically done by using our option conversion utility found in rocksdb/utilities/leveldb_options.h. What is needed, is to first replace `leveldb::Options` with `rocksdb::LevelDBOptions`. Then, use `rocksdb::ConvertOptions( )` to convert the `LevelDBOptions` struct into appropriate RocksDB options. Here is an example: + +<!--truncate--> + +LevelDB code: + +```c++ +#include <string> +#include "leveldb/db.h" + +using namespace leveldb; + +int main(int argc, char** argv) { + DB *db; + + Options opt; + opt.create_if_missing = true; + opt.max_open_files = 1000; + opt.block_size = 4096; + + Status s = DB::Open(opt, "/tmp/mydb", &db); + + delete db; +} +``` + +RocksDB code: + +```c++ +#include <string> +#include "rocksdb/db.h" +#include "rocksdb/utilities/leveldb_options.h" + +using namespace rocksdb; + +int main(int argc, char** argv) { + DB *db; + + LevelDBOptions opt; + opt.create_if_missing = true; + opt.max_open_files = 1000; + opt.block_size = 4096; + + Options rocksdb_options = ConvertOptions(opt); + // add rocksdb specific options here + + Status s = DB::Open(rocksdb_options, "/tmp/mydb_rocks", &db); + + delete db; +} +``` + +The difference is: + +```diff +-#include "leveldb/db.h" ++#include "rocksdb/db.h" ++#include "rocksdb/utilities/leveldb_options.h" + +-using namespace leveldb; ++using namespace rocksdb; + +- Options opt; ++ LevelDBOptions opt; + +- Status s = DB::Open(opt, "/tmp/mydb", &db); ++ Options rocksdb_options = ConvertOptions(opt); ++ // add rockdb specific options here ++ ++ Status s = DB::Open(rocksdb_options, "/tmp/mydb_rocks", &db); +``` + +Once you get up and running with RocksDB you can then focus on tuning RocksDB further by modifying the converted options struct. + +The reason why ConvertOptions is handy is because a lot of individual options in RocksDB have moved to other structures in different components. For example, block_size is not available in struct rocksdb::Options. It resides in struct rocksdb::BlockBasedTableOptions, which is used to create a TableFactory object that RocksDB uses internally to create the proper TableBuilder objects. If you were to write your application from scratch it would look like this: + +RocksDB code from scratch: + +```c++ +#include <string> +#include "rocksdb/db.h" +#include "rocksdb/table.h" + +using namespace rocksdb; + +int main(int argc, char** argv) { + DB *db; + + Options opt; + opt.create_if_missing = true; + opt.max_open_files = 1000; + + BlockBasedTableOptions topt; + topt.block_size = 4096; + opt.table_factory.reset(NewBlockBasedTableFactory(topt)); + + Status s = DB::Open(opt, "/tmp/mydb_rocks", &db); + + delete db; +} +``` + +The LevelDBOptions utility can ease migration to RocksDB from LevelDB and allows us to break down the various options across classes as it is needed. diff --git a/src/rocksdb/docs/_posts/2015-02-24-reading-rocksdb-options-from-a-file.markdown b/src/rocksdb/docs/_posts/2015-02-24-reading-rocksdb-options-from-a-file.markdown new file mode 100644 index 000000000..cddc0dd01 --- /dev/null +++ b/src/rocksdb/docs/_posts/2015-02-24-reading-rocksdb-options-from-a-file.markdown @@ -0,0 +1,41 @@ +--- +title: Reading RocksDB options from a file +layout: post +author: lgalanis +category: blog +redirect_from: + - /blog/1883/reading-rocksdb-options-from-a-file/ +--- + +RocksDB options can be provided using a file or any string to RocksDB. The format is straightforward: `write_buffer_size=1024;max_write_buffer_number=2`. Any whitespace around `=` and `;` is OK. Moreover, options can be nested as necessary. For example `BlockBasedTableOptions` can be nested as follows: `write_buffer_size=1024; max_write_buffer_number=2; block_based_table_factory={block_size=4k};`. Similarly any white space around `{` or `}` is ok. Here is what it looks like in code: + +<!--truncate--> + +```c++ +#include <string> +#include "rocksdb/db.h" +#include "rocksdb/table.h" +#include "rocksdb/utilities/convenience.h" + +using namespace rocksdb; + +int main(int argc, char** argv) { + DB *db; + + Options opt; + + std::string options_string = + "create_if_missing=true;max_open_files=1000;" + "block_based_table_factory={block_size=4096}"; + + Status s = GetDBOptionsFromString(opt, options_string, &opt); + + s = DB::Open(opt, "/tmp/mydb_rocks", &db); + + // use db + + delete db; +} +``` + +Using `GetDBOptionsFromString` is a convenient way of changing options for your RocksDB application without needing to resort to recompilation or tedious command line parsing. diff --git a/src/rocksdb/docs/_posts/2015-02-27-write-batch-with-index.markdown b/src/rocksdb/docs/_posts/2015-02-27-write-batch-with-index.markdown new file mode 100644 index 000000000..7f9f77653 --- /dev/null +++ b/src/rocksdb/docs/_posts/2015-02-27-write-batch-with-index.markdown @@ -0,0 +1,20 @@ +--- +title: 'WriteBatchWithIndex: Utility for Implementing Read-Your-Own-Writes' +layout: post +author: sdong +category: blog +redirect_from: + - /blog/1901/write-batch-with-index/ +--- + +RocksDB can be used as a storage engine of a higher level database. In fact, we are currently plugging RocksDB into MySQL and MongoDB as one of their storage engines. RocksDB can help with guaranteeing some of the ACID properties: durability is guaranteed by RocksDB by design; while consistency and isolation need to be enforced by concurrency controls on top of RocksDB; Atomicity can be implemented by committing a transaction's writes with one write batch to RocksDB in the end. + +<!--truncate--> + +However, if we enforce atomicity by only committing all writes in the end of the transaction in one batch, you cannot get the updated value from RocksDB previously written by the same transaction (read-your-own-write). To read the updated value, the databases on top of RocksDB need to maintain an internal buffer for all the written keys, and when a read happens they need to merge the result from RocksDB and from this buffer. This is a problem we faced when building the RocksDB storage engine in MongoDB. We solved it by creating a utility class, WriteBatchWithIndex (a write batch with a searchable index) and made it part of public API so that the community can also benefit from it. + +Before talking about the index part, let me introduce write batch first. The write batch class, `WriteBatch`, is a RocksDB data structure for atomic writes of multiple keys. Users can buffer their updates to a `WriteBatch` by calling `write_batch.Put("key1", "value1")` or `write_batch.Delete("key2")`, similar as calling RocksDB's functions of the same names. In the end, they call `db->Write(write_batch)` to atomically update all those batched operations to the DB. It is how a database can guarantee atomicity, as shown above. Adding a searchable index to `WriteBatch`, we now have `WriteBatchWithIndex`. Users can put updates to WriteBatchIndex in the same way as to `WriteBatch`. In the end, users can get a `WriteBatch` object from it and issue `db->Write()`. Additionally, users can create an iterator of a WriteBatchWithIndex, seek to any key location and iterate from there. + +To implement read-your-own-write using `WriteBatchWithIndex`, every time the user creates a transaction, we create a `WriteBatchWithIndex` attached to it. All the writes of the transaction go to the `WriteBatchWithIndex` first. When we commit the transaction, we atomically write the batch to RocksDB. When the user wants to call `Get()`, we first check if the value exists in the `WriteBatchWithIndex` and return the value if existing, by seeking and reading from an iterator of the write batch, before checking data in RocksDB. For example, here is the we implement it in MongoDB's RocksDB storage engine: [link](https://github.com/mongodb/mongo/blob/a31cc114a89a3645e97645805ba77db32c433dce/src/mongo/db/storage/rocks/rocks_recovery_unit.cpp#L245-L260). If a range query comes, we pass a DB's iterator to `WriteBatchWithIndex`, which creates a super iterator which combines the results from the DB iterator with the batch's iterator. Using this super iterator, we can iterate the DB with the transaction's own writes. Here is the iterator creation codes in MongoDB's RocksDB storage engine: [link](https://github.com/mongodb/mongo/blob/a31cc114a89a3645e97645805ba77db32c433dce/src/mongo/db/storage/rocks/rocks_recovery_unit.cpp#L266-L269). In this way, the database can solve the read-your-own-write problem by using RocksDB to handle a transaction's uncommitted writes. + +Using `WriteBatchWithIndex`, we successfully implemented read-your-own-writes in the RocksDB storage engine of MongoDB. If you also have a read-your-own-write problem, `WriteBatchWithIndex` can help you implement it quickly and correctly. diff --git a/src/rocksdb/docs/_posts/2015-04-22-integrating-rocksdb-with-mongodb-2.markdown b/src/rocksdb/docs/_posts/2015-04-22-integrating-rocksdb-with-mongodb-2.markdown new file mode 100644 index 000000000..1ffe2c532 --- /dev/null +++ b/src/rocksdb/docs/_posts/2015-04-22-integrating-rocksdb-with-mongodb-2.markdown @@ -0,0 +1,16 @@ +--- +title: Integrating RocksDB with MongoDB +layout: post +author: icanadi +category: blog +redirect_from: + - /blog/1967/integrating-rocksdb-with-mongodb-2/ +--- + +Over the last couple of years, we have been busy integrating RocksDB with various services here at Facebook that needed to store key-value pairs locally. We have also seen other companies using RocksDB as local storage components of their distributed systems. + +<!--truncate--> + +The next big challenge for us is to bring RocksDB storage engine to general purpose databases. Today we have an exciting milestone to share with our community! We're running MongoDB with RocksDB in production and seeing great results! You can read more about it here: [http://blog.parse.com/announcements/mongodb-rocksdb-parse/](http://blog.parse.com/announcements/mongodb-rocksdb-parse/) + +Keep tuned for benchmarks and more stability and performance improvements. diff --git a/src/rocksdb/docs/_posts/2015-06-12-rocksdb-in-osquery.markdown b/src/rocksdb/docs/_posts/2015-06-12-rocksdb-in-osquery.markdown new file mode 100644 index 000000000..f3a55faae --- /dev/null +++ b/src/rocksdb/docs/_posts/2015-06-12-rocksdb-in-osquery.markdown @@ -0,0 +1,10 @@ +--- +title: RocksDB in osquery +layout: post +author: icanadi +category: lgalanis +redirect_from: + - /blog/1997/rocksdb-in-osquery/ +--- + +Check out [this](https://code.facebook.com/posts/1411870269134471/how-rocksdb-is-used-in-osquery/) blog post by [Mike Arpaia](https://www.facebook.com/mike.arpaia) and [Ted Reed](https://www.facebook.com/treeded) about how osquery leverages RocksDB to build an embedded pub-sub system. This article is a great read and contains insights on how to properly use RocksDB. diff --git a/src/rocksdb/docs/_posts/2015-07-15-rocksdb-2015-h2-roadmap.markdown b/src/rocksdb/docs/_posts/2015-07-15-rocksdb-2015-h2-roadmap.markdown new file mode 100644 index 000000000..b3e2703fc --- /dev/null +++ b/src/rocksdb/docs/_posts/2015-07-15-rocksdb-2015-h2-roadmap.markdown @@ -0,0 +1,92 @@ +--- +title: RocksDB 2015 H2 roadmap +layout: post +author: icanadi +category: blog +redirect_from: + - /blog/2015/rocksdb-2015-h2-roadmap/ +--- + +Every 6 months, RocksDB team gets together to prioritize the work ahead of us. We just went through this exercise and we wanted to share the results with the community. Here's what RocksDB team will be focusing on for the next 6 months: + +<!--truncate--> + +**MyRocks** + +As you might know, we're working hard to integrate RocksDB as a storage engine for MySQL. This project is pretty important for us because we're heavy users of MySQL. We're already getting pretty good performance results, but there is more work to be done. We need to focus on both performance and stability. The most high priority items on are list are: + + + + + 1. Reduce CPU costs of RocksDB as a MySQL storage engine + + + 2. Implement pessimistic concurrency control to support repeatable read isolation level in MyRocks + + + 3. Reduce P99 read latency, which is high mostly because of lingering tombstones + + + 4. Port ZSTD compression + + +**MongoRocks** + +Another database that we're working on is MongoDB. The project of integrating MongoDB with RocksDB storage engine is called MongoRocks. It's already running in production at Parse [1] and we're seeing surprisingly few issues. Our plans for the next half: + + + + + 1. Keep improving performance and stability, possibly reuse work done on MyRocks (workloads are pretty similar). + + + 2. Increase internal and external adoption. + + + 3. Support new MongoDB 3.2. + + +**RocksDB on cheaper storage media** + +Up to now, our mission was to build the best key-value store “for fast storage” (flash and in-memory). However, there are some use-cases at Facebook that don't need expensive high-end storage. In the next six months, we plan to deploy RocksDB on cheaper storage media. We will optimize performance to RocksDB on either or both: + + + + + 1. Hard drive storage array. + + + 2. Tiered Storage. + + +**Quality of Service** + +When talking to our customers, there are couple of issues that keep reoccurring. We need to fix them to make our customers happy. We will improve RocksDB to provide better assurance of performance and resource usage. Non-exhaustive list includes: + + + + + 1. Iterate P99 can be high due to the presence of tombstones. + + + 2. Write stalls can happen during high write loads. + + + 3. Better control of memory and disk usage. + + + 4. Service quality and performance of backup engine. + + +**Operation's user experience** + +As we increase deployment of RocksDB, engineers are spending more time on debugging RocksDB issues. We plan to improve user experience when running RocksDB. The goal is to reduce TTD (time-to-debug). The work includes monitoring, visualizations and documentations. + +[1]( http://blog.parse.com/announcements/mongodb-rocksdb-parse/](http://blog.parse.com/announcements/mongodb-rocksdb-parse/) + + +### Comments + +**[Mike](allspace2012@outlook.com)** + +What’s the status of this roadmap? “RocksDB on cheaper storage media”, has this been implemented? diff --git a/src/rocksdb/docs/_posts/2015-07-17-spatial-indexing-in-rocksdb.markdown b/src/rocksdb/docs/_posts/2015-07-17-spatial-indexing-in-rocksdb.markdown new file mode 100644 index 000000000..53c1f5a90 --- /dev/null +++ b/src/rocksdb/docs/_posts/2015-07-17-spatial-indexing-in-rocksdb.markdown @@ -0,0 +1,78 @@ +--- +title: Spatial indexing in RocksDB +layout: post +author: icanadi +category: blog +redirect_from: + - /blog/2039/spatial-indexing-in-rocksdb/ +--- + +About a year ago, there was a need to develop a spatial database at Facebook. We needed to store and index Earth's map data. Before building our own, we looked at the existing spatial databases. They were all very good technology, but also general purpose. We could sacrifice a general-purpose API, so we thought we could build a more performant database, since it would be specifically designed for our use-case. Furthermore, we decided to build the spatial database on top of RocksDB, because we have a lot of operational experience with running and tuning RocksDB at a large scale. + +<!--truncate--> + +When we started looking at this project, the first thing that surprised us was that our planet is not that big. Earth's entire map data can fit in memory on a reasonably high-end machine. Thus, we also decided to build a spatial database optimized for memory-resident dataset. + +The first use-case of our spatial database was an experimental map renderer. As part of our project, we successfully loaded [Open Street Maps](https://www.openstreetmap.org/) dataset and hooked it up with [Mapnik](http://mapnik.org/), a map rendering engine. + +The usual Mapnik workflow is to load the map data into a SQL-based database and then define map layers with SQL statements. To render a tile, Mapnik needs to execute a couple of SQL queries. The benefit of this approach is that you don't need to reload your database when you change your map style. You can just change your SQL query and Mapnik picks it up. In our model, we decided to precompute the features we need for each tile. We need to know the map style before we create the database. However, when rendering the map tile, we only fetch the features that we need to render. + +We haven't open sourced the RocksDB Mapnik plugin or the database loading pipeline. However, the spatial indexing is available in RocksDB under a name [SpatialDB](https://github.com/facebook/rocksdb/blob/main/include/rocksdb/utilities/spatial_db.h). The API is focused on map rendering use-case, but we hope that it can also be used for other spatial-based applications. + +Let's take a tour of the API. When you create a spatial database, you specify the spatial indexes that need to be built. Each spatial index is defined by a bounding box and granularity. For map rendering, we create a spatial index for each zoom levels. Higher zoom levels have more granularity. + + + + SpatialDB::Create( + SpatialDBOptions(), + "/data/map", { + SpatialIndexOptions("zoom10", BoundingBox(0, 0, 100, 100), 10), + SpatialIndexOptions("zoom16", BoundingBox(0, 0, 100, 100), 16) + } + ); + + + + +When you insert a feature (building, street, country border) into SpatialDB, you need to specify the list of spatial indexes that will index the feature. In the loading phase we process the map style to determine the list of zoom levels on which we'll render the feature. For example, we will not render the building on zoom level that shows an entire country. Building will only be indexed on higher zoom level's index. Country borders will be indexes on all zoom levels. + + + + FeatureSet feature; + feature.Set("type", "building"); + feature.Set("height", 6); + db->Insert(WriteOptions(), BoundingBox<double>(5, 5, 10, 10), + well_known_binary_blob, feature, {"zoom16"}); + + + + +The indexing part is pretty simple. For each feature, we first find a list of index tiles that it intersects. Then, we add a link from the tile's [quad key](https://msdn.microsoft.com/en-us/library/bb259689.aspx) to the feature's primary key. Using quad keys improves data locality, i.e. features closer together geographically will have similar quad keys. Even though we're optimizing for a memory-resident dataset, data locality is still very important due to different caching effects. + +After you're done inserting all the features, you can call an API Compact() that will compact the dataset and speed up read queries. + + + + db->Compact(); + + + + +SpatialDB's query specifies: 1) bounding box we're interested in, and 2) a zoom level. We find all tiles that intersect with the query's bounding box and return all features in those tiles. + + + + + Cursor* c = db_->Query(ReadOptions(), BoundingBox<double>(1, 1, 7, 7), "zoom16"); + for (c->Valid(); c->Next()) { + Render(c->blob(), c->feature_set()); + } + + + + +Note: `Render()` function is not part of RocksDB. You will need to use one of many open source map renderers, for example check out [Mapnik](http://mapnik.org/). + +TL;DR If you need an embedded spatial database, check out RocksDB's SpatialDB. [Let us know](https://www.facebook.com/groups/rocksdb.dev/) how we can make it better. + +If you're interested in learning more, check out this [talk](https://www.youtube.com/watch?v=T1jWsDMONM8). diff --git a/src/rocksdb/docs/_posts/2015-07-22-rocksdb-is-now-available-in-windows-platform.markdown b/src/rocksdb/docs/_posts/2015-07-22-rocksdb-is-now-available-in-windows-platform.markdown new file mode 100644 index 000000000..b6bb47d53 --- /dev/null +++ b/src/rocksdb/docs/_posts/2015-07-22-rocksdb-is-now-available-in-windows-platform.markdown @@ -0,0 +1,30 @@ +--- +title: RocksDB is now available in Windows Platform +layout: post +author: dmitrism +category: blog +redirect_from: + - /blog/2033/rocksdb-is-now-available-in-windows-platform/ +--- + +Over the past 6 months we have seen a number of use cases where RocksDB is successfully used by the community and various companies to achieve high throughput and volume in a modern server environment. + +We at Microsoft Bing could not be left behind. As a result we are happy to [announce](http://bit.ly/1OmWBT9) the availability of the Windows Port created here at Microsoft which we intend to use as a storage option for one of our key/value data stores. + +<!--truncate--> + +We are happy to make this available for the community. Keep tuned for more announcements to come. + +### Comments + +**[Siying Dong](siying.d@fb.com)** + +Appreciate your contributions to RocksDB project! I believe it will benefits many users! + +**[empresas sevilla](oxofkx@gmail.com)** + +Magnifico artículo|, un placer leer el blog + +**[jak usunac](tomogedac@o2.pl)** + +I believe it will benefits too diff --git a/src/rocksdb/docs/_posts/2015-07-23-dynamic-level.markdown b/src/rocksdb/docs/_posts/2015-07-23-dynamic-level.markdown new file mode 100644 index 000000000..0ff3a0542 --- /dev/null +++ b/src/rocksdb/docs/_posts/2015-07-23-dynamic-level.markdown @@ -0,0 +1,29 @@ +--- +title: Dynamic Level Size for Level-Based Compaction +layout: post +author: sdong +category: blog +redirect_from: + - /blog/2207/dynamic-level/ +--- + +In this article, we follow up on the first part of an answer to one of the questions in our [AMA](https://www.reddit.com/r/IAmA/comments/3de3cv/we_are_rocksdb_engineering_team_ask_us_anything/ct4a8tb), the dynamic level size in level-based compaction. + +<!--truncate--> + +Level-based compaction is the original LevelDB compaction style and one of the two major compaction styles in RocksDB (See [our wiki](https://github.com/facebook/rocksdb/wiki/RocksDB-Basics#multi-threaded-compactions)). In RocksDB we introduced parallelism and more configurable options to it but the main algorithm stayed the same, until we recently introduced the dynamic level size mode. + + +In level-based compaction, we organize data to different sorted runs, called levels. Each level has a target size. Usually target size of levels increases by the same size multiplier. For example, you can set target size of level 1 to be 1GB, and size multiplier to be 10, and the target size of level 1, 2, 3, 4 will be 1GB, 10GB, 100GB and 1000GB. Before level 1, there will be some staging file flushed from mem tables, called Level 0 files, which will later be merged to level 1. Compactions will be triggered as soon as actual size of a level exceeds its target size. We will merge a subset of data of that level to next level, to reduce size of the level. More compactions will be triggered until sizes of all the levels are lower than their target sizes. In a steady state, the size of each level will be around the same size of the size of level targets. + + +Level-based compaction’s advantage is its good space efficiency. We usually use the metric space amplification to measure the space efficiency. In this article ignore the effects of data compression so space amplification= size_on_file_system / size_of_user_data. + + +How do we estimate space amplification of level-based compaction? We focus specifically on the databases in steady state, which means database size is stable or grows slowly over time. This means updates will add roughly the same or little more data than what is removed by deletes. Given that, if we compact all the data all to the last level, the size of level will be equal as the size of last level before the compaction. On the other hand, the size of user data will be approximately the size of DB if we compact all the levels down to the last level. So the size of the last level will be a good estimation of user data size. So total size of the DB divided by the size of the last level will be a good estimation of space amplification. + + +Applying the equation, if we have four non-zero levels, their sizes are 1GB, 10GB, 100GB, 1000GB, the size amplification will be approximately (1000GB + 100GB + 10GB + 1GB) / 1000GB = 1.111, which is a very good number. However, there is a catch here: how to make sure the last level’s size is 1000GB, the same as the level’s size target? A user has to fine tune level sizes to achieve this number and will need to re-tune if DB size changes. The theoretic number 1.11 is hard to achieve in practice. In a worse case, if you have the target size of last level to be 1000GB but the user data is only 200GB, then the actual space amplification will be (200GB + 100GB + 10GB + 1GB) / 200GB = 1.555, a much worse number. + + +To solve this problem, my colleague Igor Kabiljo came up with a solution of dynamic level size target mode. You can enable it by setting options.level_compaction_dynamic_level_bytes=true. In this mode, size target of levels are changed dynamically based on size of the last level. Suppose the level size multiplier to be 10, and the DB size is 200GB. The target size of the last level is automatically set to be the actual size of the level, which is 200GB, the second to last level’s size target will be automatically set to be size_last_level / 10 = 20GB, the third last level’s will be size_last_level/100 = 2GB, and next level to be size_last_level/1000 = 200MB. We stop here because 200MB is within the range of the first level. In this way, we can achieve the 1.111 space amplification, without fine tuning of the level size targets. More details can be found in [code comments of the option](https://github.com/facebook/rocksdb/blob/v3.11/include/rocksdb/options.h#L366-L423) in the header file. diff --git a/src/rocksdb/docs/_posts/2015-10-27-getthreadlist.markdown b/src/rocksdb/docs/_posts/2015-10-27-getthreadlist.markdown new file mode 100644 index 000000000..92f743adc --- /dev/null +++ b/src/rocksdb/docs/_posts/2015-10-27-getthreadlist.markdown @@ -0,0 +1,193 @@ +--- +title: GetThreadList +layout: post +author: yhciang +category: blog +redirect_from: + - /blog/2261/getthreadlist/ +--- + +We recently added a new API, called `GetThreadList()`, that exposes the RocksDB background thread activity. With this feature, developers will be able to obtain the real-time information about the currently running compactions and flushes such as the input / output size, elapsed time, the number of bytes it has written. Below is an example output of `GetThreadList`. To better illustrate the example, we have put a sample output of `GetThreadList` into a table where each column represents a thread status: + +<!--truncate--> + +<table width="637" > +<tbody > +<tr style="border:2px solid #000000" > + +<td style="padding:3px" >ThreadID +</td> + +<td >140716395198208 +</td> + +<td >140716416169728 +</td> +</tr> +<tr > + +<td style="padding:3px" >DB +</td> + +<td >db1 +</td> + +<td >db2 +</td> +</tr> +<tr > + +<td style="padding:3px" >CF +</td> + +<td >default +</td> + +<td >picachu +</td> +</tr> +<tr > + +<td style="padding:3px" >ThreadType +</td> + +<td >High Pri +</td> + +<td >Low Pri +</td> +</tr> +<tr > + +<td style="padding:3px" >Operation +</td> + +<td >Flush +</td> + +<td >Compaction +</td> +</tr> +<tr > + +<td style="padding:3px" >ElapsedTime +</td> + +<td >143.459 ms +</td> + +<td >607.538 ms +</td> +</tr> +<tr > + +<td style="padding:3px" >Stage +</td> + +<td >FlushJob::WriteLevel0Table +</td> + +<td >CompactionJob::Install +</td> +</tr> +<tr > + +<td style="vertical-align:top;padding:3px" >OperationProperties +</td> + +<td style="vertical-align:top;padding:3px" > +BytesMemtables 4092938 +BytesWritten 1050701 +</td> + +<td style="vertical-align:top" > +BaseInputLevel 1 +BytesRead 4876417 +BytesWritten 4140109 +IsDeletion 0 +IsManual 0 +IsTrivialMove 0 +JobID 146 +OutputLevel 2 +TotalInputBytes 4883044 +</td> +</tr> +</tbody> +</table> + +In the above output, we can see `GetThreadList()` reports the activity of two threads: one thread running flush job (middle column) and the other thread running a compaction job (right-most column). In each thread status, it shows basic information about the thread such as thread id, it's target db / column family, and the job it is currently doing and the current status of the job. For instance, we can see thread 140716416169728 is doing compaction on the `picachu` column family in database `db2`. In addition, we can see the compaction has been running for 600 ms, and it has read 4876417 bytes out of 4883044 bytes. This indicates the compaction is about to complete. The stage property indicates which code block the thread is currently executing. For instance, thread 140716416169728 is currently running `CompactionJob::Install`, which further indicates the compaction job is almost done. + +Below we briefly describe its API. + + +## How to Enable it? + + +To enable thread-tracking of a rocksdb instance, simply set `enable_thread_tracking` to true in its DBOptions: + +```c++ +// If true, then the status of the threads involved in this DB will +// be tracked and available via GetThreadList() API. +// +// Default: false +bool enable_thread_tracking; +``` + + + +## The API + + +The GetThreadList API is defined in [include/rocksdb/env.h](https://github.com/facebook/rocksdb/blob/main/include/rocksdb/env.h#L317-L318), which is an Env +function: + +```c++ +virtual Status GetThreadList(std::vector* thread_list) +``` + +Since an Env can be shared across multiple rocksdb instances, the output of +`GetThreadList()` include the background activity of all the rocksdb instances +that using the same Env. + +The `GetThreadList()` API simply returns a vector of `ThreadStatus`, each describes +the current status of a thread. The `ThreadStatus` structure, defined in +[include/rocksdb/thread_status.h](https://github.com/facebook/rocksdb/blob/main/include/rocksdb/thread_status.h), contains the following information: + +```c++ +// An unique ID for the thread. +const uint64_t thread_id; + +// The type of the thread, it could be HIGH_PRIORITY, +// LOW_PRIORITY, and USER +const ThreadType thread_type; + +// The name of the DB instance where the thread is currently +// involved with. It would be set to empty string if the thread +// does not involve in any DB operation. +const std::string db_name; + +// The name of the column family where the thread is currently +// It would be set to empty string if the thread does not involve +// in any column family. +const std::string cf_name; + +// The operation (high-level action) that the current thread is involved. +const OperationType operation_type; + +// The elapsed time in micros of the current thread operation. +const uint64_t op_elapsed_micros; + +// An integer showing the current stage where the thread is involved +// in the current operation. +const OperationStage operation_stage; + +// A list of properties that describe some details about the current +// operation. Same field in op_properties[] might have different +// meanings for different operations. +uint64_t op_properties[kNumOperationProperties]; + +// The state (lower-level action) that the current thread is involved. +const StateType state_type; +``` + +If you are interested in the background thread activity of your RocksDB application, please feel free to give `GetThreadList()` a try :) diff --git a/src/rocksdb/docs/_posts/2015-11-10-use-checkpoints-for-efficient-snapshots.markdown b/src/rocksdb/docs/_posts/2015-11-10-use-checkpoints-for-efficient-snapshots.markdown new file mode 100644 index 000000000..6852b8ffa --- /dev/null +++ b/src/rocksdb/docs/_posts/2015-11-10-use-checkpoints-for-efficient-snapshots.markdown @@ -0,0 +1,45 @@ +--- +title: Use Checkpoints for Efficient Snapshots +layout: post +author: rven2 +category: blog +redirect_from: + - /blog/2609/use-checkpoints-for-efficient-snapshots/ +--- + +**Checkpoint** is a feature in RocksDB which provides the ability to take a snapshot of a running RocksDB database in a separate directory. Checkpoints can be used as a point in time snapshot, which can be opened Read-only to query rows as of the point in time or as a Writeable snapshot by opening it Read-Write. Checkpoints can be used for both full and incremental backups. + +<!--truncate--> + + +The Checkpoint feature enables RocksDB to create a consistent snapshot of a given RocksDB database in the specified directory. If the snapshot is on the same filesystem as the original database, the SST files will be hard-linked, otherwise SST files will be copied. The manifest and CURRENT files will be copied. In addition, if there are multiple column families, log files will be copied for the period covering the start and end of the checkpoint, in order to provide a consistent snapshot across column families. + + + + +A Checkpoint object needs to be created for a database before checkpoints are created. The API is as follows: + + + + +`Status Create(DB* db, Checkpoint** checkpoint_ptr);` + + + + +Given a checkpoint object and a directory, the CreateCheckpoint function creates a consistent snapshot of the database in the given directory. + + + + +`Status CreateCheckpoint(const std::string& checkpoint_dir);` + + + + +The directory should not already exist and will be created by this API. The directory will be an absolute path. The checkpoint can be used as a read-only copy of the DB or can be opened as a standalone DB. When opened read/write, the SST files continue to be hard links and these links are removed when the files are obsoleted. When the user is done with the snapshot, the user can delete the directory to remove the snapshot. + + + + +Checkpoints are used for online backup in MyRocks. which is MySQL using RocksDB as the storage engine . ([MySQL on RocksDB](https://github.com/facebook/mysql-5.6)) diff --git a/src/rocksdb/docs/_posts/2015-11-16-analysis-file-read-latency-by-level.markdown b/src/rocksdb/docs/_posts/2015-11-16-analysis-file-read-latency-by-level.markdown new file mode 100644 index 000000000..b21b04fe3 --- /dev/null +++ b/src/rocksdb/docs/_posts/2015-11-16-analysis-file-read-latency-by-level.markdown @@ -0,0 +1,244 @@ +--- +title: Analysis File Read Latency by Level +layout: post +author: sdong +category: blog +redirect_from: + - /blog/2537/analysis-file-read-latency-by-level/ +--- + +In many use cases of RocksDB, people rely on OS page cache for caching compressed data. With this approach, verifying effective of the OS page caching is challenging, because file system is a black box to users. + +As an example, a user can tune the DB as following: use level-based compaction, with L1 - L4 sizes to be 1GB, 10GB, 100GB and 1TB. And they reserve about 20GB memory as OS page cache, expecting level 0, 1 and 2 are mostly cached in memory, leaving only reads from level 3 and 4 requiring disk I/Os. However, in practice, it's not easy to verify whether OS page cache does exactly what we expect. For example, if we end up with doing 4 instead of 2 I/Os per query, it's not easy for users to figure out whether the it's because of efficiency of OS page cache or reading multiple blocks for a level. Analysis like it is especially important if users run RocksDB on hard drive disks, for the gap of latency between hard drives and memory is much higher than flash-based SSDs. + +<!--truncate--> + +In order to make tuning easier, we added new instrumentation to help users analysis latency distribution of file reads in different levels. If users turn DB statistics on, we always keep track of distribution of file read latency for each level. Users can retrieve the information by querying DB property “rocksdb.stats” ( [https://github.com/facebook/rocksdb/blob/v3.13.1/include/rocksdb/db.h#L315-L316](https://github.com/facebook/rocksdb/blob/v3.13.1/include/rocksdb/db.h#L315-L316) ). It will also printed out as a part of compaction summary in info logs periodically. + +The output looks like this: + + +``` +** Level 0 read latency histogram (micros): +Count: 696 Average: 489.8118 StdDev: 222.40 +Min: 3.0000 Median: 452.3077 Max: 1896.0000 +Percentiles: P50: 452.31 P75: 641.30 P99: 1068.00 P99.9: 1860.80 P99.99: 1896.00 +------------------------------------------------------ +[ 2, 3 ) 1 0.144% 0.144% +[ 18, 20 ) 1 0.144% 0.287% +[ 45, 50 ) 5 0.718% 1.006% +[ 50, 60 ) 26 3.736% 4.741% # +[ 60, 70 ) 6 0.862% 5.603% +[ 90, 100 ) 1 0.144% 5.747% +[ 120, 140 ) 2 0.287% 6.034% +[ 140, 160 ) 1 0.144% 6.178% +[ 160, 180 ) 1 0.144% 6.322% +[ 200, 250 ) 9 1.293% 7.615% +[ 250, 300 ) 45 6.466% 14.080% # +[ 300, 350 ) 88 12.644% 26.724% ### +[ 350, 400 ) 88 12.644% 39.368% ### +[ 400, 450 ) 71 10.201% 49.569% ## +[ 450, 500 ) 65 9.339% 58.908% ## +[ 500, 600 ) 74 10.632% 69.540% ## +[ 600, 700 ) 92 13.218% 82.759% ### +[ 700, 800 ) 64 9.195% 91.954% ## +[ 800, 900 ) 35 5.029% 96.983% # +[ 900, 1000 ) 12 1.724% 98.707% +[ 1000, 1200 ) 6 0.862% 99.569% +[ 1200, 1400 ) 2 0.287% 99.856% +[ 1800, 2000 ) 1 0.144% 100.000% + +** Level 1 read latency histogram (micros): +(......not pasted.....) + +** Level 2 read latency histogram (micros): +(......not pasted.....) + +** Level 3 read latency histogram (micros): +(......not pasted.....) + +** Level 4 read latency histogram (micros): +(......not pasted.....) + +** Level 5 read latency histogram (micros): +Count: 25583746 Average: 421.1326 StdDev: 385.11 +Min: 1.0000 Median: 376.0011 Max: 202444.0000 +Percentiles: P50: 376.00 P75: 438.00 P99: 1421.68 P99.9: 4164.43 P99.99: 9056.52 +------------------------------------------------------ +[ 0, 1 ) 2351 0.009% 0.009% +[ 1, 2 ) 6077 0.024% 0.033% +[ 2, 3 ) 8471 0.033% 0.066% +[ 3, 4 ) 788 0.003% 0.069% +[ 4, 5 ) 393 0.002% 0.071% +[ 5, 6 ) 786 0.003% 0.074% +[ 6, 7 ) 1709 0.007% 0.080% +[ 7, 8 ) 1769 0.007% 0.087% +[ 8, 9 ) 1573 0.006% 0.093% +[ 9, 10 ) 1495 0.006% 0.099% +[ 10, 12 ) 3043 0.012% 0.111% +[ 12, 14 ) 2259 0.009% 0.120% +[ 14, 16 ) 1233 0.005% 0.125% +[ 16, 18 ) 762 0.003% 0.128% +[ 18, 20 ) 451 0.002% 0.130% +[ 20, 25 ) 794 0.003% 0.133% +[ 25, 30 ) 1279 0.005% 0.138% +[ 30, 35 ) 1172 0.005% 0.142% +[ 35, 40 ) 1363 0.005% 0.148% +[ 40, 45 ) 409 0.002% 0.149% +[ 45, 50 ) 105 0.000% 0.150% +[ 50, 60 ) 80 0.000% 0.150% +[ 60, 70 ) 280 0.001% 0.151% +[ 70, 80 ) 1583 0.006% 0.157% +[ 80, 90 ) 4245 0.017% 0.174% +[ 90, 100 ) 6572 0.026% 0.200% +[ 100, 120 ) 9724 0.038% 0.238% +[ 120, 140 ) 3713 0.015% 0.252% +[ 140, 160 ) 2383 0.009% 0.261% +[ 160, 180 ) 18344 0.072% 0.333% +[ 180, 200 ) 51873 0.203% 0.536% +[ 200, 250 ) 631722 2.469% 3.005% +[ 250, 300 ) 2721970 10.639% 13.644% ## +[ 300, 350 ) 5909249 23.098% 36.742% ##### +[ 350, 400 ) 6522507 25.495% 62.237% ##### +[ 400, 450 ) 4296332 16.793% 79.030% ### +[ 450, 500 ) 2130323 8.327% 87.357% ## +[ 500, 600 ) 1553208 6.071% 93.428% # +[ 600, 700 ) 642129 2.510% 95.938% # +[ 700, 800 ) 372428 1.456% 97.394% +[ 800, 900 ) 187561 0.733% 98.127% +[ 900, 1000 ) 85858 0.336% 98.462% +[ 1000, 1200 ) 82730 0.323% 98.786% +[ 1200, 1400 ) 50691 0.198% 98.984% +[ 1400, 1600 ) 38026 0.149% 99.133% +[ 1600, 1800 ) 32991 0.129% 99.261% +[ 1800, 2000 ) 30200 0.118% 99.380% +[ 2000, 2500 ) 62195 0.243% 99.623% +[ 2500, 3000 ) 36684 0.143% 99.766% +[ 3000, 3500 ) 21317 0.083% 99.849% +[ 3500, 4000 ) 10216 0.040% 99.889% +[ 4000, 4500 ) 8351 0.033% 99.922% +[ 4500, 5000 ) 4152 0.016% 99.938% +[ 5000, 6000 ) 6328 0.025% 99.963% +[ 6000, 7000 ) 3253 0.013% 99.976% +[ 7000, 8000 ) 2082 0.008% 99.984% +[ 8000, 9000 ) 1546 0.006% 99.990% +[ 9000, 10000 ) 1055 0.004% 99.994% +[ 10000, 12000 ) 1566 0.006% 100.000% +[ 12000, 14000 ) 761 0.003% 100.003% +[ 14000, 16000 ) 462 0.002% 100.005% +[ 16000, 18000 ) 226 0.001% 100.006% +[ 18000, 20000 ) 126 0.000% 100.006% +[ 20000, 25000 ) 107 0.000% 100.007% +[ 25000, 30000 ) 43 0.000% 100.007% +[ 30000, 35000 ) 15 0.000% 100.007% +[ 35000, 40000 ) 14 0.000% 100.007% +[ 40000, 45000 ) 16 0.000% 100.007% +[ 45000, 50000 ) 1 0.000% 100.007% +[ 50000, 60000 ) 22 0.000% 100.007% +[ 60000, 70000 ) 10 0.000% 100.007% +[ 70000, 80000 ) 5 0.000% 100.007% +[ 80000, 90000 ) 14 0.000% 100.007% +[ 90000, 100000 ) 11 0.000% 100.007% +[ 100000, 120000 ) 33 0.000% 100.007% +[ 120000, 140000 ) 6 0.000% 100.007% +[ 140000, 160000 ) 3 0.000% 100.007% +[ 160000, 180000 ) 7 0.000% 100.007% +[ 200000, 250000 ) 2 0.000% 100.007% +``` + + +In this example, you can see we only issued 696 reads from level 0 while issued 25 million reads from level 5. The latency distribution is also clearly shown among those reads. This will be helpful for users to analysis OS page cache efficiency. + +Currently the read latency per level includes reads from data blocks, index blocks, as well as bloom filter blocks. We are also working on a feature to break down those three type of blocks. + +### Comments + +**[Tao Feng](fengtao04@gmail.com)** + +Is this feature also included in RocksJava? + +**[Siying Dong](siying.d@fb.com)** + +Should be. As long as you enable statistics, you should be able to get the value from `RocksDB.getProperty()` with property `rocksdb.dbstats`. Let me know if you can’t find it. + +**[chiddu](cnbscience@gmail.com)** + +> In this example, you can see we only issued 696 reads from level 0 while issued 256K reads from level 5. + +Isn’t it 2.5 M of reads instead of 256K ? . + +Also could anyone please provide more description on the histogram ? especially + +> Count: 25583746 Average: 421.1326 StdDev: 385.11 +> Min: 1.0000 Median: 376.0011 Max: 202444.0000 +> Percentiles: P50: 376.00 P75: 438.00 P99: 1421.68 P99.9: 4164.43 P99.99: 9056.52 + +and + +> [ 0, 1 ) 2351 0.009% 0.009% +> [ 1, 2 ) 6077 0.024% 0.033% +> [ 2, 3 ) 8471 0.033% 0.066% +> [ 3, 4 ) 788 0.003% 0.069%” + +thanks in advance + +**[Siying Dong](siying.d@fb.com)** + +Thank you for pointing out the mistake. I fixed it now. + +In this output, there are 2.5 million samples, average latency is 421 micro seconds, with standard deviation 385. Median is 376, max value is 202 milliseconds. 0.009% has value of 1, 0.024% has value of 1, 0.033% has value of 2. Accumulated value from 0 to 2 is 0.066%. + +Hope it helps. + +**[chiddu](cnbscience@gmail.com)** + +Thank you Siying for the quick reply, I was running couple of benchmark testing to check the performance of rocksdb on SSD. One of the test is similar to what is mentioned in the wiki, TEST 4 : Random read , except the key_size is 10 and value_size is 20. I am inserting 1 billion hashes and reading 1 billion hashes with 32 threads. The histogram shows something like this + +``` +Level 5 read latency histogram (micros): +Count: 7133903059 Average: 480.4357 StdDev: 309.18 +Min: 0.0000 Median: 551.1491 Max: 224142.0000 +Percentiles: P50: 551.15 P75: 651.44 P99: 996.52 P99.9: 2073.07 P99.99: 3196.32 +—————————————————— +[ 0, 1 ) 28587385 0.401% 0.401% +[ 1, 2 ) 686572516 9.624% 10.025% ## +[ 2, 3 ) 567317522 7.952% 17.977% ## +[ 3, 4 ) 44979472 0.631% 18.608% +[ 4, 5 ) 50379685 0.706% 19.314% +[ 5, 6 ) 64930061 0.910% 20.224% +[ 6, 7 ) 22613561 0.317% 20.541% +…………more…………. +``` + +If I understand your previous comment correctly, + +1. How is it that the count is around 7 billion when I have only inserted 1 billion hashes ? is the stat broken ? +1. What does the percentiles and the numbers signify ? +1. 0, 1 ) 28587385 0.401% 0.401% what does this “28587385” stand for in the histogram row ? + +**[Siying Dong](siying.d@fb.com)** + +If I remember correctly, with db_bench, if you specify –num=1000000000 –threads=32, it is every thread reading one billion keys, total of 32 billions. Is it the case you ran into? + +28,587,385 means that number of data points take the value [0,1) +28,587,385 / 7,133,903,058 = 0.401% provides percentage. + +**[chiddu](cnbscience@gmail.com)** + +I do have `num=1000000000` and `t=32`. The script says reading 1 billion hashes and not 32 billion hashes. + +this is the script on which I have used + +``` +echo “Load 1B keys sequentially into database…..” +bpl=10485760;overlap=10;mcz=2;del=300000000;levels=6;ctrig=4; delay=8; stop=12; wbn=3; mbc=20; mb=67108864;wbs=134217728; dds=1; sync=0; r=1000000000; t=1; vs=20; bs=4096; cs=1048576; of=500000; si=1000000; ./db_bench –benchmarks=fillseq –disable_seek_compaction=1 –mmap_read=0 –statistics=1 –histogram=1 –num=$r –threads=$t –value_size=$vs –block_size=$bs –cache_size=$cs –bloom_bits=10 –cache_numshardbits=6 –open_files=$of –verify_checksum=1 –db=/data/mysql/leveldb/test –sync=$sync –disable_wal=1 –compression_type=none –stats_interval=$si –compression_ratio=0.5 –disable_data_sync=$dds –write_buffer_size=$wbs –target_file_size_base=$mb –max_write_buffer_number=$wbn –max_background_compactions=$mbc –level0_file_num_compaction_trigger=$ctrig –level0_slowdown_writes_trigger=$delay –level0_stop_writes_trigger=$stop –num_levels=$levels –delete_obsolete_files_period_micros=$del –min_level_to_compress=$mcz –max_grandparent_overlap_factor=$overlap –stats_per_interval=1 –max_bytes_for_level_base=$bpl –use_existing_db=0 –key_size=10 + +echo “Reading 1B keys in database in random order….” +bpl=10485760;overlap=10;mcz=2;del=300000000;levels=6;ctrig=4; delay=8; stop=12; wbn=3; mbc=20; mb=67108864;wbs=134217728; dds=0; sync=0; r=1000000000; t=32; vs=20; bs=4096; cs=1048576; of=500000; si=1000000; ./db_bench –benchmarks=readrandom –disable_seek_compaction=1 –mmap_read=0 –statistics=1 –histogram=1 –num=$r –threads=$t –value_size=$vs –block_size=$bs –cache_size=$cs –bloom_bits=10 –cache_numshardbits=6 –open_files=$of –verify_checksum=1 –db=/some_data_base –sync=$sync –disable_wal=1 –compression_type=none –stats_interval=$si –compression_ratio=0.5 –disable_data_sync=$dds –write_buffer_size=$wbs –target_file_size_base=$mb –max_write_buffer_number=$wbn –max_background_compactions=$mbc –level0_file_num_compaction_trigger=$ctrig –level0_slowdown_writes_trigger=$delay –level0_stop_writes_trigger=$stop –num_levels=$levels –delete_obsolete_files_period_micros=$del –min_level_to_compress=$mcz –max_grandparent_overlap_factor=$overlap –stats_per_interval=1 –max_bytes_for_level_base=$bpl –use_existing_db=1 –key_size=10 +``` + +After running this script, there were no issues wrt to loading billion hashes , but when it came to reading part, its been almost 4 days and still I have only read 7 billion hashes and have read 200 million hashes in 2 and half days. Is there something which is missing in db_bench or something which I am missing ? + +**[Siying Dong](siying.d@fb.com)** + +It’s a printing error then. If you have `num=1000000000` and `t=32`, it will be 32 threads, and each reads 1 billion keys. diff --git a/src/rocksdb/docs/_posts/2016-01-29-compaction_pri.markdown b/src/rocksdb/docs/_posts/2016-01-29-compaction_pri.markdown new file mode 100644 index 000000000..ba9ee627c --- /dev/null +++ b/src/rocksdb/docs/_posts/2016-01-29-compaction_pri.markdown @@ -0,0 +1,51 @@ +--- +title: Option of Compaction Priority +layout: post +author: sdong +category: blog +redirect_from: + - /blog/2921/compaction_pri/ +--- + +The most popular compaction style of RocksDB is level-based compaction, which is an improved version of LevelDB's compaction algorithm. Page 9- 16 of this [slides](https://github.com/facebook/rocksdb/blob/gh-pages/talks/2015-09-29-HPTS-Siying-RocksDB.pdf) gives an illustrated introduction of this compaction style. The basic idea that: data is organized by multiple levels with exponential increasing target size. Except a special level 0, every level is key-range partitioned into many files. When size of a level exceeds its target size, we pick one or more of its files, and merge the file into the next level. + +<!--truncate--> + +Which file to pick to compact is an interesting question. LevelDB only uses one thread for compaction and it always picks files in round robin manner. We implemented multi-thread compaction in RocksDB by picking multiple files from the same level and compact them in parallel. We had to move away from LevelDB's file picking approach. Recently, we created an option [options.compaction_pri](https://github.com/facebook/rocksdb/blob/d6c838f1e130d8860407bc771fa6d4ac238859ba/include/rocksdb/options.h#L83-L93), which indicated three different algorithms to pick files to compact. + +Why do we need to multiple algorithms to choose from? Because there are different factors to consider when picking the files, and we now don't yet know how to balance them automatically, so we expose it to users to choose. Here are factors to consider: + +**Write amplification** + +When we estimate write amplification, we usually simplify the problem by assuming keys are uniformly distributed inside each level. In reality, it is not the case, even if user updates are uniformly distributed across the whole key range. For instance, when we compact one file of a level to the next level, it creates a hole. Over time, incoming compaction will fill data to the hole, but the density will still be lower for a while. Picking a file with keys least densely populated is more expensive to get the file to the next level, because there will be more overlapping files in the next level so we need to rewrite more data. For example, assume a file is 100MB, if an L2 file overlaps with 8 L3 files, we need to rewrite about 800MB of data to get the file to L3. If the file overlaps with 12 L3 files, we'll need to rewrite about 1200MB to get a file of the same size out of L2. It uses 50% more writes. (This analysis ignores the key density of the next level, because the range covers N times of files in that level so one hole only impacts write amplification by 1/N) + +If all the updates are uniformly distributed, LevelDB's approach optimizes write amplification, because a file being picked covers a range whose last compaction time to the next level is the oldest, so the range will accumulated keys from incoming compactions for the longest and the density is the highest. + +We created a compaction priority **kOldestSmallestSeqFirst** for the same effect. With this mode, we always pick the file covers the oldest updates in the level, which usually is contains the densest key range. If you have a use case where writes are uniformly distributed across the key space and you want to reduce write amplification, you should set options.compaction_pri=kOldestSmallestSeqFirst. + +**Optimize for small working set** + +We are assuming updates are uniformly distributed across the whole key space in previous analysis. However, in many use cases, there are subset of keys that are frequently updated while other key ranges are very cold. In this case, keeping hot key ranges from compacting to deeper levels will benefit write amplification, as well as space amplification. For example, if in a DB only key 150-160 are updated and other keys are seldom updated. If level 1 contains 20 keys, we want to keep 150-160 all stay in level 1. Because when next level 0 -> 1 compaction comes, it will simply overwrite existing keys so size level 1 doesn't increase, so no need to schedule further compaction for level 1->2. On the other hand, if we compact key 150-155 to level2, when a new Level 1->2 compaction comes, it increases the size of level 1, making size of level 1 exceed target size and more compactions will be needed, which generates more writes. + +The compaction priority **kOldestLargestSeqFirst** optimizes this use case. In this mode, we will pick a file whose latest update is the oldest. It means there is no incoming data for the range for the longest. Usually it is the coldest range. By compacting coldest range first, we leave the hot ranges in the level. If your use case is to overwrite existing keys in a small range, try options.compaction_pri=kOldestLargestSeqFirst**.** + +**Drop delete marker sooner** + +If one file contains a lot of delete markers, it may slow down iterating over this area, because we still need to iterate those deleted keys just to ignore them. Furthermore, the sooner we compact delete keys into the last level, the sooner the disk space is reclaimed, so it is good for space efficiency. + +Our default compaction priority **kByCompensatedSize** considers the case. If number of deletes in a file exceeds number of inserts, it is more likely to be picked for compaction. The more number of deletes exceed inserts, the more likely it is being compacted. The optimization is added to avoid the worst performance of space efficiency and query performance when a large percentage of the DB is deleted. + +**Efficiency of compaction filter** + +Usually people use [compaction filters](https://github.com/facebook/rocksdb/blob/v4.1/include/rocksdb/options.h#L201-L226) to clean up old data to free up space. Picking files to compact may impact space efficiency. We don't yet have a a compaction priority to optimize this case. In some of our use cases, we solved the problem in a different way: we have an external service checking modify time of all SST files. If any of the files is too old, we force the single file to compaction by calling DB::CompactFiles() using the single file. In this way, we can provide a time bound of data passing through compaction filters. + + +In all, there three choices of compaction priority modes optimizing different scenarios. if you have a new use case, we suggest you start with `options.compaction_pri=kOldestSmallestSeqFirst` (note it is not the default one for backward compatible reason). If you want to further optimize your use case, you can try other two use cases if your use cases apply. + +If you have good ideas about better compaction picker approach, you are welcome to implement and benchmark it. We'll be glad to review and merge your a pull requests. + +### Comments + +**[Mark Callaghan](mdcallag@gmail.com)** + +Performance results for compaction_pri values and linkbench are explained at [http://smalldatum.blogspot.com/2016/02/compaction-priority-in-rocksdb.html](http://smalldatum.blogspot.com/2016/02/compaction-priority-in-rocksdb.html) diff --git a/src/rocksdb/docs/_posts/2016-02-24-rocksdb-4-2-release.markdown b/src/rocksdb/docs/_posts/2016-02-24-rocksdb-4-2-release.markdown new file mode 100644 index 000000000..409015cc8 --- /dev/null +++ b/src/rocksdb/docs/_posts/2016-02-24-rocksdb-4-2-release.markdown @@ -0,0 +1,41 @@ +--- +title: RocksDB 4.2 Release! +layout: post +author: sdong +category: blog +redirect_from: + - /blog/3017/rocksdb-4-2-release/ +--- + +New RocksDB release - 4.2! + + +**New Features** + + 1. Introduce CreateLoggerFromOptions(), this function create a Logger for provided DBOptions. + + + 2. Add GetAggregatedIntProperty(), which returns the sum of the GetIntProperty of all the column families. + + + 3. Add MemoryUtil in rocksdb/utilities/memory.h. It currently offers a way to get the memory usage by type from a list rocksdb instances. + + +<!--truncate--> + + +**Public API changes** + + 1. CompactionFilter::Context includes information of Column Family ID + + + 2. The need-compaction hint given by TablePropertiesCollector::NeedCompact() will be persistent and recoverable after DB recovery. This introduces a breaking format change. If you use this experimental feature, including NewCompactOnDeletionCollectorFactory() in the new version, you may not be able to directly downgrade the DB back to version 4.0 or lower. + + + 3. TablePropertiesCollectorFactory::CreateTablePropertiesCollector() now takes an option Context, containing the information of column family ID for the file being written. + + + 4. Remove DefaultCompactionFilterFactory. + + +[https://github.com/facebook/rocksdb/releases/tag/v4.2](https://github.com/facebook/rocksdb/releases/tag/v4.2) diff --git a/src/rocksdb/docs/_posts/2016-02-25-rocksdb-ama.markdown b/src/rocksdb/docs/_posts/2016-02-25-rocksdb-ama.markdown new file mode 100644 index 000000000..2ba04f39a --- /dev/null +++ b/src/rocksdb/docs/_posts/2016-02-25-rocksdb-ama.markdown @@ -0,0 +1,20 @@ +--- +title: RocksDB AMA +layout: post +author: yhchiang +category: blog +redirect_from: + - /blog/3065/rocksdb-ama/ +--- + +RocksDB developers are doing a Reddit Ask-Me-Anything now at 10AM – 11AM PDT! We welcome you to stop by and ask any RocksDB related questions, including existing / upcoming features, tuning tips, or database design. + +Here are some enhancements that we'd like to focus on over the next six months: + +* 2-Phase Commit +* Lua support in some custom functions +* Backup and repair tools +* Direct I/O to bypass OS cache +* RocksDB Java API + +[https://www.reddit.com/r/IAmA/comments/47k1si/we_are_rocksdb_developers_ask_us_anything/](https://www.reddit.com/r/IAmA/comments/47k1si/we_are_rocksdb_developers_ask_us_anything/) diff --git a/src/rocksdb/docs/_posts/2016-03-07-rocksdb-options-file.markdown b/src/rocksdb/docs/_posts/2016-03-07-rocksdb-options-file.markdown new file mode 100644 index 000000000..703449b01 --- /dev/null +++ b/src/rocksdb/docs/_posts/2016-03-07-rocksdb-options-file.markdown @@ -0,0 +1,24 @@ +--- +title: RocksDB Options File +layout: post +author: yhciang +category: blog +redirect_from: + - /blog/3089/rocksdb-options-file/ +--- + +In RocksDB 4.3, we added a new set of features that makes managing RocksDB options easier. Specifically: + + * **Persisting Options Automatically**: Each RocksDB database will now automatically persist its current set of options into an INI file on every successful call of DB::Open(), SetOptions(), and CreateColumnFamily() / DropColumnFamily(). + + + + * **Load Options from File**: We added [LoadLatestOptions() / LoadOptionsFromFile()](https://github.com/facebook/rocksdb/blob/4.3.fb/include/rocksdb/utilities/options_util.h#L48-L58) that enables developers to construct RocksDB options object from an options file. + + + + * **Sanity Check Options**: We added [CheckOptionsCompatibility](https://github.com/facebook/rocksdb/blob/4.3.fb/include/rocksdb/utilities/options_util.h#L64-L77) that performs compatibility check on two sets of RocksDB options. + +<!--truncate--> + +Want to know more about how to use this new features? Check out the [RocksDB Options File wiki page](https://github.com/facebook/rocksdb/wiki/RocksDB-Options-File) and start using this new feature today! diff --git a/src/rocksdb/docs/_posts/2016-04-26-rocksdb-4-5-1-released.markdown b/src/rocksdb/docs/_posts/2016-04-26-rocksdb-4-5-1-released.markdown new file mode 100644 index 000000000..247768d30 --- /dev/null +++ b/src/rocksdb/docs/_posts/2016-04-26-rocksdb-4-5-1-released.markdown @@ -0,0 +1,60 @@ +--- +title: RocksDB 4.5.1 Released! +layout: post +author: sdong +category: blog +redirect_from: + - /blog/3179/rocksdb-4-5-1-released/ +--- + +## 4.5.1 (3/25/2016) + +### Bug Fixes + + * Fix failures caused by the destorying order of singleton objects. + +<br/> + +## 4.5.0 (2/5/2016) + +### Public API Changes + + * Add a new perf context level between kEnableCount and kEnableTime. Level 2 now does not include timers for mutexes. + * Statistics of mutex operation durations will not be measured by default. If you want to have them enabled, you need to set Statistics::stats_level_ to kAll. + * DBOptions::delete_scheduler and NewDeleteScheduler() are removed, please use DBOptions::sst_file_manager and NewSstFileManager() instead + +### New Features + * ldb tool now supports operations to non-default column families. + * Add kPersistedTier to ReadTier. This option allows Get and MultiGet to read only the persited data and skip mem-tables if writes were done with disableWAL = true. + * Add DBOptions::sst_file_manager. Use NewSstFileManager() in include/rocksdb/sst_file_manager.h to create a SstFileManager that can be used to track the total size of SST files and control the SST files deletion rate. + +<br/> + +<!--truncate--> + +## 4.4.0 (1/14/2016) + +### Public API Changes + + * Change names in CompactionPri and add a new one. + * Deprecate options.soft_rate_limit and add options.soft_pending_compaction_bytes_limit. + * If options.max_write_buffer_number > 3, writes will be slowed down when writing to the last write buffer to delay a full stop. + * Introduce CompactionJobInfo::compaction_reason, this field include the reason to trigger the compaction. + * After slow down is triggered, if estimated pending compaction bytes keep increasing, slowdown more. + * Increase default options.delayed_write_rate to 2MB/s. + * Added a new parameter --path to ldb tool. --path accepts the name of either MANIFEST, SST or a WAL file. Either --db or --path can be used when calling ldb. + +<br/> + +## 4.3.0 (12/8/2015) + +### New Features + + * CompactionFilter has new member function called IgnoreSnapshots which allows CompactionFilter to be called even if there are snapshots later than the key. + * RocksDB will now persist options under the same directory as the RocksDB database on successful DB::Open, CreateColumnFamily, DropColumnFamily, and SetOptions. + * Introduce LoadLatestOptions() in rocksdb/utilities/options_util.h. This function can construct the latest DBOptions / ColumnFamilyOptions used by the specified RocksDB intance. + * Introduce CheckOptionsCompatibility() in rocksdb/utilities/options_util.h. This function checks whether the input set of options is able to open the specified DB successfully. + +### Public API Changes + + * When options.db_write_buffer_size triggers, only the column family with the largest column family size will be flushed, not all the column families. diff --git a/src/rocksdb/docs/_posts/2016-07-26-rocksdb-4-8-released.markdown b/src/rocksdb/docs/_posts/2016-07-26-rocksdb-4-8-released.markdown new file mode 100644 index 000000000..0db275ddf --- /dev/null +++ b/src/rocksdb/docs/_posts/2016-07-26-rocksdb-4-8-released.markdown @@ -0,0 +1,48 @@ +--- +title: RocksDB 4.8 Released! +layout: post +author: yiwu +category: blog +redirect_from: + - /blog/3239/rocksdb-4-8-released/ +--- + +## 4.8.0 (5/2/2016) + +### [](https://github.com/facebook/rocksdb/blob/main/HISTORY.md#public-api-change-1)Public API Change + + * Allow preset compression dictionary for improved compression of block-based tables. This is supported for zlib, zstd, and lz4. The compression dictionary's size is configurable via CompressionOptions::max_dict_bytes. + * Delete deprecated classes for creating backups (BackupableDB) and restoring from backups (RestoreBackupableDB). Now, BackupEngine should be used for creating backups, and BackupEngineReadOnly should be used for restorations. For more details, see [https://github.com/facebook/rocksdb/wiki/How-to-backup-RocksDB%3F](https://github.com/facebook/rocksdb/wiki/How-to-backup-RocksDB%3F) + * Expose estimate of per-level compression ratio via DB property: "rocksdb.compression-ratio-at-levelN". + * Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status. + +### [](https://github.com/facebook/rocksdb/blob/main/HISTORY.md#new-features-2)New Features + + * Add ReadOptions::readahead_size. If non-zero, NewIterator will create a new table reader which performs reads of the given size. + +<br/> + +<!--truncate--> + +## [](https://github.com/facebook/rocksdb/blob/main/HISTORY.md#470-482016)4.7.0 (4/8/2016) + +### [](https://github.com/facebook/rocksdb/blob/main/HISTORY.md#public-api-change-2)Public API Change + + * rename options compaction_measure_io_stats to report_bg_io_stats and include flush too. + * Change some default options. Now default options will optimize for server-workloads. Also enable slowdown and full stop triggers for pending compaction bytes. These changes may cause sub-optimal performance or significant increase of resource usage. To avoid these risks, users can open existing RocksDB with options extracted from RocksDB option files. See [https://github.com/facebook/rocksdb/wiki/RocksDB-Options-File](https://github.com/facebook/rocksdb/wiki/RocksDB-Options-File) for how to use RocksDB option files. Or you can call Options.OldDefaults() to recover old defaults. DEFAULT_OPTIONS_HISTORY.md will track change history of default options. + +<br/> + +## [](https://github.com/facebook/rocksdb/blob/main/HISTORY.md#460-3102016)4.6.0 (3/10/2016) + +### [](https://github.com/facebook/rocksdb/blob/main/HISTORY.md#public-api-changes-1)Public API Changes + + * Change default of BlockBasedTableOptions.format_version to 2. It means default DB created by 4.6 or up cannot be opened by RocksDB version 3.9 or earlier + * Added strict_capacity_limit option to NewLRUCache. If the flag is set to true, insert to cache will fail if no enough capacity can be free. Signature of Cache::Insert() is updated accordingly. + * Tickers [NUMBER_DB_NEXT, NUMBER_DB_PREV, NUMBER_DB_NEXT_FOUND, NUMBER_DB_PREV_FOUND, ITER_BYTES_READ] are not updated immediately. The are updated when the Iterator is deleted. + * Add monotonically increasing counter (DB property "rocksdb.current-super-version-number") that increments upon any change to the LSM tree. + +### [](https://github.com/facebook/rocksdb/blob/main/HISTORY.md#new-features-3)New Features + + * Add CompactionPri::kMinOverlappingRatio, a compaction picking mode friendly to write amplification. + * Deprecate Iterator::IsKeyPinned() and replace it with Iterator::GetProperty() with prop_name="rocksdb.iterator.is.key.pinned" diff --git a/src/rocksdb/docs/_posts/2016-09-28-rocksdb-4-11-2-released.markdown b/src/rocksdb/docs/_posts/2016-09-28-rocksdb-4-11-2-released.markdown new file mode 100644 index 000000000..87c20eb47 --- /dev/null +++ b/src/rocksdb/docs/_posts/2016-09-28-rocksdb-4-11-2-released.markdown @@ -0,0 +1,49 @@ +--- +title: RocksDB 4.11.2 Released! +layout: post +author: sdong +category: blog +--- +We abandoned release candidates 4.10.x and directly go to 4.11.2 from 4.9, to make sure the latest release is stable. In 4.11.2, we fixed several data corruption related bugs introduced in 4.9.0. + +## 4.11.2 (9/15/2016) + +### Bug fixes + + * Segfault when failing to open an SST file for read-ahead iterators. + * WAL without data for all CFs is not deleted after recovery. + +<!--truncate--> + +## 4.11.1 (8/30/2016) + +### Bug Fixes + + * Mitigate the regression bug of deadlock condition during recovery when options.max_successive_merges hits. + * Fix data race condition related to hash index in block based table when putting indexes in the block cache. + +## 4.11.0 (8/1/2016) + +### Public API Change + + * options.memtable_prefix_bloom_huge_page_tlb_size => memtable_huge_page_size. When it is set, RocksDB will try to allocate memory from huge page for memtable too, rather than just memtable bloom filter. + +### New Features + + * A tool to migrate DB after options change. See include/rocksdb/utilities/option_change_migration.h. + * Add ReadOptions.background_purge_on_iterator_cleanup. If true, we avoid file deletion when destorying iterators. + +## 4.10.0 (7/5/2016) + +### Public API Change + + * options.memtable_prefix_bloom_bits changes to options.memtable_prefix_bloom_bits_ratio and deprecate options.memtable_prefix_bloom_probes + * enum type CompressionType and PerfLevel changes from char to unsigned char. Value of all PerfLevel shift by one. + * Deprecate options.filter_deletes. + +### New Features + + * Add avoid_flush_during_recovery option. + * Add a read option background_purge_on_iterator_cleanup to avoid deleting files in foreground when destroying iterators. Instead, a job is scheduled in high priority queue and would be executed in a separate background thread. + * RepairDB support for column families. RepairDB now associates data with non-default column families using information embedded in the SST/WAL files (4.7 or later). For data written by 4.6 or earlier, RepairDB associates it with the default column family. + * Add options.write_buffer_manager which allows users to control total memtable sizes across multiple DB instances. diff --git a/src/rocksdb/docs/_posts/2017-01-06-rocksdb-5-0-1-released.markdown b/src/rocksdb/docs/_posts/2017-01-06-rocksdb-5-0-1-released.markdown new file mode 100644 index 000000000..fb0413055 --- /dev/null +++ b/src/rocksdb/docs/_posts/2017-01-06-rocksdb-5-0-1-released.markdown @@ -0,0 +1,26 @@ +--- +title: RocksDB 5.0.1 Released! +layout: post +author: yiwu +category: blog +--- + +### Public API Change + + * Options::max_bytes_for_level_multiplier is now a double along with all getters and setters. + * Support dynamically change `delayed_write_rate` and `max_total_wal_size` options via SetDBOptions(). + * Introduce DB::DeleteRange for optimized deletion of large ranges of contiguous keys. + * Support dynamically change `delayed_write_rate` option via SetDBOptions(). + * Options::allow_concurrent_memtable_write and Options::enable_write_thread_adaptive_yield are now true by default. + * Remove Tickers::SEQUENCE_NUMBER to avoid confusion if statistics object is shared among RocksDB instance. Alternatively DB::GetLatestSequenceNumber() can be used to get the same value. + * Options.level0_stop_writes_trigger default value changes from 24 to 32. + * New compaction filter API: CompactionFilter::FilterV2(). Allows to drop ranges of keys. + * Removed flashcache support. + * DB::AddFile() is deprecated and is replaced with DB::IngestExternalFile(). DB::IngestExternalFile() remove all the restrictions that existed for DB::AddFile. + +### New Features + + * Add avoid_flush_during_shutdown option, which speeds up DB shutdown by not flushing unpersisted data (i.e. with disableWAL = true). Unpersisted data will be lost. The options is dynamically changeable via SetDBOptions(). + * Add memtable_insert_with_hint_prefix_extractor option. The option is mean to reduce CPU usage for inserting keys into memtable, if keys can be group by prefix and insert for each prefix are sequential or almost sequential. See include/rocksdb/options.h for more details. + * Add LuaCompactionFilter in utilities. This allows developers to write compaction filters in Lua. To use this feature, LUA_PATH needs to be set to the root directory of Lua. + * No longer populate "LATEST_BACKUP" file in backup directory, which formerly contained the number of the latest backup. The latest backup can be determined by finding the highest numbered file in the "meta/" subdirectory. diff --git a/src/rocksdb/docs/_posts/2017-02-07-rocksdb-5-1-2-released.markdown b/src/rocksdb/docs/_posts/2017-02-07-rocksdb-5-1-2-released.markdown new file mode 100644 index 000000000..35bafb219 --- /dev/null +++ b/src/rocksdb/docs/_posts/2017-02-07-rocksdb-5-1-2-released.markdown @@ -0,0 +1,15 @@ +--- +title: RocksDB 5.1.2 Released! +layout: post +author: maysamyabandeh +category: blog +--- + +### Public API Change +* Support dynamically change `delete_obsolete_files_period_micros` option via SetDBOptions(). +* Added EventListener::OnExternalFileIngested which will be called when IngestExternalFile() add a file successfully. +* BackupEngine::Open and BackupEngineReadOnly::Open now always return error statuses matching those of the backup Env. + +### Bug Fixes +* Fix the bug that if 2PC is enabled, checkpoints may loss some recent transactions. +* When file copying is needed when creating checkpoints or bulk loading files, fsync the file after the file copying. diff --git a/src/rocksdb/docs/_posts/2017-02-17-bulkoad-ingest-sst-file.markdown b/src/rocksdb/docs/_posts/2017-02-17-bulkoad-ingest-sst-file.markdown new file mode 100644 index 000000000..9a43a846a --- /dev/null +++ b/src/rocksdb/docs/_posts/2017-02-17-bulkoad-ingest-sst-file.markdown @@ -0,0 +1,50 @@ +--- +title: Bulkloading by ingesting external SST files +layout: post +author: IslamAbdelRahman +category: blog +--- + +## Introduction + +One of the basic operations of RocksDB is writing to RocksDB, Writes happen when user call (DB::Put, DB::Write, DB::Delete ... ), but what happens when you write to RocksDB ? .. this is a brief description of what happens. +- User insert a new key/value by calling DB::Put() (or DB::Write()) +- We create a new entry for the new key/value in our in-memory structure (memtable / SkipList by default) and we assign it a new sequence number. +- When the memtable exceeds a specific size (64 MB for example), we convert this memtable to a SST file, and put this file in level 0 of our LSM-Tree +- Later, compaction will kick in and move data from level 0 to level 1, and then from level 1 to level 2 .. and so on + +But what if we can skip these steps and add data to the lowest possible level directly ? This is what bulk-loading does + +## Bulkloading + +- Write all of our keys and values into SST file outside of the DB +- Add the SST file into the LSM directly + +This is bulk-loading, and in specific use-cases it allow users to achieve faster data loading and better write-amplification. + +and doing it is as simple as +```cpp +Options options; +SstFileWriter sst_file_writer(EnvOptions(), options, options.comparator); +Status s = sst_file_writer.Open(file_path); +assert(s.ok()); + +// Insert rows into the SST file, note that inserted keys must be +// strictly increasing (based on options.comparator) +for (...) { + s = sst_file_writer.Add(key, value); + assert(s.ok()); +} + +// Ingest the external SST file into the DB +s = db_->IngestExternalFile({"/home/usr/file1.sst"}, IngestExternalFileOptions()); +assert(s.ok()); +``` + +You can find more details about how to generate SST files and ingesting them into RocksDB in this [wiki page](https://github.com/facebook/rocksdb/wiki/Creating-and-Ingesting-SST-files) + +## Use cases +There are multiple use cases where bulkloading could be useful, for example +- Generating SST files in offline jobs in Hadoop, then downloading and ingesting the SST files into RocksDB +- Migrating shards between machines by dumping key-range in SST File and loading the file in a different machine +- Migrating from a different storage (InnoDB to RocksDB migration in MyRocks) diff --git a/src/rocksdb/docs/_posts/2017-03-02-rocksdb-5-2-1-released.markdown b/src/rocksdb/docs/_posts/2017-03-02-rocksdb-5-2-1-released.markdown new file mode 100644 index 000000000..c6ce27d64 --- /dev/null +++ b/src/rocksdb/docs/_posts/2017-03-02-rocksdb-5-2-1-released.markdown @@ -0,0 +1,22 @@ +--- +title: RocksDB 5.2.1 Released! +layout: post +author: sdong +category: blog +--- + +### Public API Change +* NewLRUCache() will determine number of shard bits automatically based on capacity, if the user doesn't pass one. This also impacts the default block cache when the user doesn't explict provide one. +* Change the default of delayed slowdown value to 16MB/s and further increase the L0 stop condition to 36 files. + +### New Features +* Added new overloaded function GetApproximateSizes that allows to specify if memtable stats should be computed only without computing SST files' stats approximations. +* Added new function GetApproximateMemTableStats that approximates both number of records and size of memtables. +* (Experimental) Two-level indexing that partition the index and creates a 2nd level index on the partitions. The feature can be enabled by setting kTwoLevelIndexSearch as IndexType and configuring index_per_partition. + +### Bug Fixes +* RangeSync() should work if ROCKSDB_FALLOCATE_PRESENT is not set +* Fix wrong results in a data race case in Get() +* Some fixes related to 2PC. +* Fix several bugs in Direct I/O supports. +* Fix a regression bug which can cause Seek() to miss some keys if the return key has been updated many times after the snapshot which is used by the iterator. diff --git a/src/rocksdb/docs/_posts/2017-05-12-partitioned-index-filter.markdown b/src/rocksdb/docs/_posts/2017-05-12-partitioned-index-filter.markdown new file mode 100644 index 000000000..a537feb0c --- /dev/null +++ b/src/rocksdb/docs/_posts/2017-05-12-partitioned-index-filter.markdown @@ -0,0 +1,34 @@ +--- +title: Partitioned Index/Filters +layout: post +author: maysamyabandeh +category: blog +--- + +As DB/mem ratio gets larger, the memory footprint of filter/index blocks becomes non-trivial. Although `cache_index_and_filter_blocks` allows storing only a subset of them in block cache, their relatively large size negatively affects the performance by i) occupying the block cache space that could otherwise be used for caching data, ii) increasing the load on the disk storage by loading them into the cache after a miss. Here we illustrate these problems in more detail and explain how partitioning index/filters alleviates the overhead. + +### How large are the index/filter blocks? + +RocksDB has by default one index/filter block per SST file. The size of the index/filter varies based on the configuration but for a SST of size 256MB the index/filter block of size 0.5/5MB is typical, which is much larger than the typical data block size of 4-32KB. That is fine when all index/filters fit perfectly into memory and hence are read once per SST lifetime, not so much when they compete with data blocks for the block cache space and are also likely to be re-read many times from the disk. + +### What is the big deal with large index/filter blocks? + +When index/filter blocks are stored in block cache they are effectively competing with data blocks (as well as with each other) on this scarce resource. A filter of size 5MB is occupying the space that could otherwise be used to cache 1000s of data blocks (of size 4KB). This would result in more cache misses for data blocks. The large index/filters also kick each other out of the block cache more often and exacerbate their own cache miss rate too. This is while only a small part of the index/filter block might have been actually used during its lifetime in the cache. + +After the cache miss of an index/filter, it has to be reloaded from the disk, and its large size is not helping in reducing the IO cost. While a simple point lookup might need at most a couple of data block reads (of size 4KB) one from each layer of LSM, it might end up also loading multiple megabytes of index/filter blocks. If that happens often then the disk is spending more time serving index/filters rather than the actual data blocks. + +## What is partitioned index/filters? + +With partitioning, the index/filter of a SST file is partitioned into smaller blocks with an additional top-level index on them. When reading an index/filter, only top-level index is loaded into memory. The partitioned index/filter then uses the top-level index to load on demand into the block cache the partitions that are required to perform the index/filter query. The top-level index, which has much smaller memory footprint, can be stored in heap or block cache depending on the `cache_index_and_filter_blocks` setting. + +### Success stories + +#### HDD, 100TB DB + +In this example we have a DB of size 86G on HDD and emulate the small memory that is present to a node with 100TB of data by using direct IO (skipping OS file cache) and a very small block cache of size 60MB. Partitioning improves throughput by 11x from 5 op/s to 55 op/s. + +#### SSD, Linkbench + +In this example we have a DB of size 300G on SSD and emulate the small memory that would be available in presence of other DBs on the same node by by using direct IO (skipping OS file cache) and block cache of size 6G and 2G. Without partitioning the linkbench throughput drops from 38k tps to 23k when reducing block cache size from 6G to 2G. With partitioning the throughput drops from 38k to only 30k. + +Learn more [here](https://github.com/facebook/rocksdb/wiki/Partitioned-Index-Filters). diff --git a/src/rocksdb/docs/_posts/2017-05-14-core-local-stats.markdown b/src/rocksdb/docs/_posts/2017-05-14-core-local-stats.markdown new file mode 100644 index 000000000..a806541fc --- /dev/null +++ b/src/rocksdb/docs/_posts/2017-05-14-core-local-stats.markdown @@ -0,0 +1,106 @@ +--- +title: Core-local Statistics +layout: post +author: ajkr +category: blog +--- + +## Origins: Global Atomics + +Until RocksDB 4.12, ticker/histogram statistics were implemented with std::atomic values shared across the entire program. A ticker consists of a single atomic, while a histogram consists of several atomics to represent things like min/max/per-bucket counters. These statistics could be updated by all user/background threads. + +For concurrent/high-throughput workloads, cache line bouncing of atomics caused high CPU utilization. For example, we have tickers that count block cache hits and misses. Almost every user read increments these tickers a few times. Many concurrent user reads would cause the cache lines containing these atomics to bounce between cores. + +### Performance + +Here are perf results for 32 reader threads where most reads (99%+) are served by uncompressed block cache. Such a scenario stresses the statistics code heavily. + +Benchmark command: `TEST_TMPDIR=/dev/shm/ perf record -g ./db_bench -statistics -use_existing_db=true -benchmarks=readrandom -threads=32 -cache_size=1048576000 -num=1000000 -reads=1000000 && perf report -g --children` + +Perf snippet for "cycles" event: + +``` + Children Self Command Shared Object Symbol ++ 30.33% 30.17% db_bench db_bench [.] rocksdb::StatisticsImpl::recordTick ++ 3.65% 0.98% db_bench db_bench [.] rocksdb::StatisticsImpl::measureTime +``` + +Perf snippet for "cache-misses" event: + +``` + Children Self Command Shared Object Symbol ++ 19.54% 19.50% db_bench db_bench [.] rocksdb::StatisticsImpl::recordTick ++ 3.44% 0.57% db_bench db_bench [.] rocksdb::StatisticsImpl::measureTime +``` + +The high CPU overhead for updating tickers and histograms corresponds well to the high cache misses. + +## Thread-locals: Faster Updates + +Since RocksDB 4.12, ticker/histogram statistics use thread-local storage. Each thread has a local set of atomic values that no other thread can update. This prevents the cache line bouncing problem described above. Even though updates to a given value are always made by the same thread, atomics are still useful to synchronize with aggregations for querying statistics. + +Implementing this approach involved a couple challenges. First, each query for a statistic's global value must aggregate all threads' local values. This adds some overhead, which may pass unnoticed if statistics are queried infrequently. Second, exited threads' local values are still needed to provide accurate statistics. We handle this by merging a thread's local values into process-wide variables upon thread exit. + +### Performance + +Update benchmark setup is same as before. CPU overhead improved 7.8x compared to global atomics, corresponding to a 17.8x reduction in cache-misses overhead. + +Perf snippet for "cycles" event: + +``` + Children Self Command Shared Object Symbol ++ 2.96% 0.87% db_bench db_bench [.] rocksdb::StatisticsImpl::recordTick ++ 1.37% 0.10% db_bench db_bench [.] rocksdb::StatisticsImpl::measureTime +``` + +Perf snippet for "cache-misses" event: + +``` + Children Self Command Shared Object Symbol ++ 1.21% 0.65% db_bench db_bench [.] rocksdb::StatisticsImpl::recordTick + 0.08% 0.00% db_bench db_bench [.] rocksdb::StatisticsImpl::measureTime +``` + +To measure statistics query latency, we ran sysbench with 4K OLTP clients concurrently with one client that queries statistics repeatedly. Times shown are in milliseconds. + +``` + min: 18.45 + avg: 27.91 + max: 231.65 + 95th percentile: 55.82 +``` + +## Core-locals: Faster Querying + +The thread-local approach is working well for applications calling RocksDB from only a few threads, or polling statistics infrequently. Eventually, though, we found use cases where those assumptions do not hold. For example, one application has per-connection threads and typically runs into performance issues when connection count grows very high. For debugging such issues, they want high-frequency statistics polling to correlate issues in their application with changes in RocksDB's state. + +Once [PR #2258](https://github.com/facebook/rocksdb/pull/2258) lands, ticker/histogram statistics will be local to each CPU core. Similarly to thread-local, each core updates only its local values, thus avoiding cache line bouncing. Local values are still atomics to make aggregation possible. With this change, query work depends only on number of cores, not the number of threads. So, applications with many more threads than cores can no longer impact statistics query latency. + +### Performance + +Update benchmark setup is same as before. CPU overhead worsened ~23% compared to thread-local, while cache performance was unchanged. + +Perf snippet for "cycles" event: + +``` + Children Self Command Shared Object Symbol ++ 2.96% 0.87% db_bench db_bench [.] rocksdb::StatisticsImpl::recordTick ++ 1.37% 0.10% db_bench db_bench [.] rocksdb::StatisticsImpl::measureTime +``` + +Perf snippet for "cache-misses" event: + +``` + Children Self Command Shared Object Symbol ++ 1.21% 0.65% db_bench db_bench [.] rocksdb::StatisticsImpl::recordTick + 0.08% 0.00% db_bench db_bench [.] rocksdb::StatisticsImpl::measureTime +``` + +Query latency is measured same as before with times in milliseconds. Average latency improved by 6.3x compared to thread-local. + +``` + min: 2.47 + avg: 4.45 + max: 91.13 + 95th percentile: 7.56 +``` diff --git a/src/rocksdb/docs/_posts/2017-05-26-rocksdb-5-4-5-released.markdown b/src/rocksdb/docs/_posts/2017-05-26-rocksdb-5-4-5-released.markdown new file mode 100644 index 000000000..561dab4c2 --- /dev/null +++ b/src/rocksdb/docs/_posts/2017-05-26-rocksdb-5-4-5-released.markdown @@ -0,0 +1,39 @@ +--- +title: RocksDB 5.4.5 Released! +layout: post +author: sagar0 +category: blog +--- + +### Public API Change +* Support dynamically changing `stats_dump_period_sec` option via SetDBOptions(). +* Added ReadOptions::max_skippable_internal_keys to set a threshold to fail a request as incomplete when too many keys are being skipped while using iterators. +* DB::Get in place of std::string accepts PinnableSlice, which avoids the extra memcpy of value to std::string in most of cases. + * PinnableSlice releases the pinned resources that contain the value when it is destructed or when ::Reset() is called on it. + * The old API that accepts std::string, although discouraged, is still supported. +* Replace Options::use_direct_writes with Options::use_direct_io_for_flush_and_compaction. See Direct IO wiki for details. + +### New Features +* Memtable flush can be avoided during checkpoint creation if total log file size is smaller than a threshold specified by the user. +* Introduce level-based L0->L0 compactions to reduce file count, so write delays are incurred less often. +* (Experimental) Partitioning filters which creates an index on the partitions. The feature can be enabled by setting partition_filters when using kFullFilter. Currently the feature also requires two-level indexing to be enabled. Number of partitions is the same as the number of partitions for indexes, which is controlled by metadata_block_size. +* DB::ResetStats() to reset internal stats. +* Added CompactionEventListener and EventListener::OnFlushBegin interfaces. +* Added DB::CreateColumnFamilie() and DB::DropColumnFamilies() to bulk create/drop column families. +* Facility for cross-building RocksJava using Docker. + +### Bug Fixes +* Fix WriteBatchWithIndex address use after scope error. +* Fix WritableFile buffer size in direct IO. +* Add prefetch to PosixRandomAccessFile in buffered io. +* Fix PinnableSlice access invalid address when row cache is enabled. +* Fix huge fallocate calls fail and make XFS unhappy. +* Fix memory alignment with logical sector size. +* Fix alignment in ReadaheadRandomAccessFile. +* Fix bias with read amplification stats (READ_AMP_ESTIMATE_USEFUL_BYTES and READ_AMP_TOTAL_READ_BYTES). +* Fix a manual / auto compaction data race. +* Fix CentOS 5 cross-building of RocksJava. +* Build and link with ZStd when creating the static RocksJava build. +* Fix snprintf's usage to be cross-platform. +* Fix build errors with blob DB. +* Fix readamp test type inconsistency. diff --git a/src/rocksdb/docs/_posts/2017-06-26-17-level-based-changes.markdown b/src/rocksdb/docs/_posts/2017-06-26-17-level-based-changes.markdown new file mode 100644 index 000000000..9e838eb7f --- /dev/null +++ b/src/rocksdb/docs/_posts/2017-06-26-17-level-based-changes.markdown @@ -0,0 +1,60 @@ +--- +title: Level-based Compaction Changes +layout: post +author: ajkr +category: blog +--- + +### Introduction + +RocksDB provides an option to limit the number of L0 files, which bounds read-amplification. Since L0 files (unlike files at lower levels) can span the entire key-range, a key might be in any file, thus reads need to check them one-by-one. Users often wish to configure a low limit to improve their read latency. + +Although, the mechanism with which we enforce L0's file count limit may be unappealing. When the limit is reached, RocksDB intentionally delays user writes. This slows down accumulation of files in L0, and frees up resources for compacting files down to lower levels. But adding delays will significantly increase user-visible write latency jitter. + +Also, due to how L0 files can span the entire key-range, compaction parallelization is limited. Files at L0 or L1 may be locked due to involvement in pending L0->L1 or L1->L2 compactions. We can only schedule a parallel L0->L1 compaction if it does not require any of the locked files, which is typically not the case. + +To handle these constraints better, we added a new type of compaction, L0->L0. It quickly reduces file count in L0 and can be scheduled even when L1 files are locked, unlike L0->L1. We also changed the L0->L1 picking algorithm to increase opportunities for parallelism. + +### Old L0->L1 Picking Logic + +Previously, our logic for picking which L0 file to compact was the same as every other level: pick the largest file in the level. One special property of L0->L1 compaction is that files can overlap in the input level, so those overlapping files must be pulled in as well. For example, a compaction may look like this: + +![full-range.png](/static/images/compaction/full-range.png) + +This compaction pulls in every L0 and L1 file. This happens regardless of which L0 file is initially chosen as each file overlaps with every other file. + +Users may insert their data less uniformly in the key-range. For example, a database may look like this during L0->L1 compaction: + +![part-range-old.png](/static/images/compaction/part-range-old.png) + +Let's say the third file from the top is the largest, and let's say the top two files are created after the compaction started. When the compaction is picked, the fourth L0 file and six rightmost L1 files are pulled in due to overlap. Notice this leaves the database in a state where we might not be able to schedule parallel compactions. For example, if the sixth file from the top is the next largest, we can't compact it because it overlaps with the top two files, which overlap with the locked L0 files. + +We can now see the high-level problems with this approach more clearly. First, locked files in L0 or L1 prevent us from parallelizing compactions. When locked files block L0->L1 compaction, there is nothing we can do to eliminate L0 files. Second, L0->L1 compactions are relatively slow. As we saw, when keys are uniformly distributed, L0->L1 compacts two entire levels. While this is happening, new files are being flushed to L0, advancing towards the file count limit. + +### New L0->L0 Algorithm + +We introduced compaction within L0 to improve both parallelization and speed of reducing L0 file count. An L0->L0 compaction may look like this: + +![l1-l2-contend.png](/static/images/compaction/l1-l2-contend.png) + +Say the L1->L2 compaction started first. Now L0->L1 is prevented by the locked L1 file. In this case, we compact files within L0. This allows us to start the work for eliminating L0 files earlier. It also lets us do less work since we don't pull in any L1 files, whereas L0->L1 compaction would've pulled in all of them. This lets us quickly reduce L0 file count to keep read-amp low while sustaining large bursts of writes (i.e., fast accumulation of L0 files). + +The tradeoff is this increases total compaction work, as we're now compacting files without contributing towards our eventual goal of moving them towards lower levels. Our benchmarks, though, consistently show less compaction stalls and improved write throughput. One justification is that L0 file data is highly likely in page cache and/or block cache due to it being recently written and frequently accessed. So, this type of compaction is relatively cheap compared to compactions at lower levels. + +This feature is available since RocksDB 5.4. + +### New L0->L1 Picking Logic + +Recall how the old L0->L1 picking algorithm chose the largest L0 file for compaction. This didn't fit well with L0->L0 compaction, which operates on a span of files. That span begins at the newest L0 file, and expands towards older files as long as they're not being compacted. Since the largest file may be anywhere, the old L0->L1 picking logic could arbitrarily prevent us from getting a long span of files. See the second illustration in this post for a scenario where this would happen. + +So, we changed the L0->L1 picking algorithm to start from the oldest file and expand towards newer files as long as they're not being compacted. For example: + +![l0-l1-contend.png](/static/images/compaction/l0-l1-contend.png) + +Now, there can never be L0 files unreachable for L0->L0 due to L0->L1 selecting files in the middle. When longer spans of files are available for L0->L0, we perform less compaction work per deleted L0 file, thus improving efficiency. + +This feature will be available in RocksDB 5.7. + +### Performance Changes + +Mark Callaghan did the most extensive benchmarking of this feature's impact on MyRocks. See his results [here](http://smalldatum.blogspot.com/2017/05/innodb-myrocks-and-tokudb-on-insert.html). Note the primary change between his March 17 and April 14 builds is the latter performs L0->L0 compaction. diff --git a/src/rocksdb/docs/_posts/2017-06-29-rocksdb-5-5-1-released.markdown b/src/rocksdb/docs/_posts/2017-06-29-rocksdb-5-5-1-released.markdown new file mode 100644 index 000000000..d7856088b --- /dev/null +++ b/src/rocksdb/docs/_posts/2017-06-29-rocksdb-5-5-1-released.markdown @@ -0,0 +1,22 @@ +--- +title: RocksDB 5.5.1 Released! +layout: post +author: lightmark +category: blog +--- + +### New Features +* FIFO compaction to support Intra L0 compaction too with CompactionOptionsFIFO.allow_compaction=true. +* Statistics::Reset() to reset user stats. +* ldb add option --try_load_options, which will open DB with its own option file. +* Introduce WriteBatch::PopSavePoint to pop the most recent save point explicitly. +* Support dynamically change `max_open_files` option via SetDBOptions() +* Added DB::CreateColumnFamilie() and DB::DropColumnFamilies() to bulk create/drop column families. +* Add debugging function `GetAllKeyVersions` to see internal versions of a range of keys. +* Support file ingestion with universal compaction style +* Support file ingestion behind with option `allow_ingest_behind` +* New option enable_pipelined_write which may improve write throughput in case writing from multiple threads and WAL enabled. + +### Bug Fixes +* Fix the bug that Direct I/O uses direct reads for non-SST file +* Fix the bug that flush doesn't respond to fsync result diff --git a/src/rocksdb/docs/_posts/2017-07-25-rocksdb-5-6-1-released.markdown b/src/rocksdb/docs/_posts/2017-07-25-rocksdb-5-6-1-released.markdown new file mode 100644 index 000000000..3b54ffd5a --- /dev/null +++ b/src/rocksdb/docs/_posts/2017-07-25-rocksdb-5-6-1-released.markdown @@ -0,0 +1,22 @@ +--- +title: RocksDB 5.6.1 Released! +layout: post +author: yiwu +category: blog +--- + +### Public API Change +* Scheduling flushes and compactions in the same thread pool is no longer supported by setting `max_background_flushes=0`. Instead, users can achieve this by configuring their high-pri thread pool to have zero threads. See https://github.com/facebook/rocksdb/wiki/Thread-Pool for more details. +* Replace `Options::max_background_flushes`, `Options::max_background_compactions`, and `Options::base_background_compactions` all with `Options::max_background_jobs`, which automatically decides how many threads to allocate towards flush/compaction. +* options.delayed_write_rate by default take the value of options.rate_limiter rate. +* Replace global variable `IOStatsContext iostats_context` with `IOStatsContext* get_iostats_context()`; replace global variable `PerfContext perf_context` with `PerfContext* get_perf_context()`. + +### New Features +* Change ticker/histogram statistics implementations to use core-local storage. This improves aggregation speed compared to our previous thread-local approach, particularly for applications with many threads. See http://rocksdb.org/blog/2017/05/14/core-local-stats.html for more details. +* Users can pass a cache object to write buffer manager, so that they can cap memory usage for memtable and block cache using one single limit. +* Flush will be triggered when 7/8 of the limit introduced by write_buffer_manager or db_write_buffer_size is triggered, so that the hard threshold is hard to hit. See https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager for more details. +* Introduce WriteOptions.low_pri. If it is true, low priority writes will be throttled if the compaction is behind. See https://github.com/facebook/rocksdb/wiki/Low-Priority-Write for more details. +* `DB::IngestExternalFile()` now supports ingesting files into a database containing range deletions. + +### Bug Fixes +* Shouldn't ignore return value of fsync() in flush. diff --git a/src/rocksdb/docs/_posts/2017-08-24-pinnableslice.markdown b/src/rocksdb/docs/_posts/2017-08-24-pinnableslice.markdown new file mode 100644 index 000000000..06e0bcb2f --- /dev/null +++ b/src/rocksdb/docs/_posts/2017-08-24-pinnableslice.markdown @@ -0,0 +1,37 @@ +--- +title: PinnableSlice; less memcpy with point lookups +layout: post +author: maysamyabandeh +category: blog +--- + +The classic API for [DB::Get](https://github.com/facebook/rocksdb/blob/9e583711144f580390ce21a49a8ceacca338fcd5/include/rocksdb/db.h#L310) receives a std::string as argument to which it will copy the value. The memcpy overhead could be non-trivial when the value is large. The [new API](https://github.com/facebook/rocksdb/blob/9e583711144f580390ce21a49a8ceacca338fcd5/include/rocksdb/db.h#L322) receives a PinnableSlice instead, which avoids memcpy in most of the cases. + +### What is PinnableSlice? + +Similarly to Slice, PinnableSlice refers to some in-memory data so it does not incur the memcpy cost. To ensure that the data will not be erased while it is being processed by the user, PinnableSlice, as its name suggests, has the data pinned in memory. The pinned data are released when PinnableSlice object is destructed or when ::Reset is invoked explicitly on it. + +### How good is it? + +Here are the improvements in throughput for an [in-memory benchmark](https://github.com/facebook/rocksdb/pull/1756#issuecomment-286201693): +* value 1k byte: 14% +* value 10k byte: 34% + +### Any limitations? + +PinnableSlice tries to avoid memcpy as much as possible. The primary gain is when reading large values from the block cache. There are however cases that it would still have to copy the data into its internal buffer. The reason is mainly the complexity of implementation and if there is enough motivation on the application side. the scope of PinnableSlice could be extended to such cases too. These include: +* Merged values +* Reads from memtables + +### How to use it? + +```cpp +PinnableSlice pinnable_val; +while (!stopped) { + auto s = db->Get(opt, cf, key, &pinnable_val); + // ... use it + pinnable_val.Reset(); // then release it immediately +} +``` + +You can also [initialize the internal buffer](https://github.com/facebook/rocksdb/blob/9e583711144f580390ce21a49a8ceacca338fcd5/include/rocksdb/db.h#L314) of PinnableSlice by passing your own string in the constructor. [simple_example.cc](https://github.com/facebook/rocksdb/blob/main/examples/simple_example.cc) demonstrates that with more examples. diff --git a/src/rocksdb/docs/_posts/2017-08-25-flushwal.markdown b/src/rocksdb/docs/_posts/2017-08-25-flushwal.markdown new file mode 100644 index 000000000..751fe5249 --- /dev/null +++ b/src/rocksdb/docs/_posts/2017-08-25-flushwal.markdown @@ -0,0 +1,26 @@ +--- +title: FlushWAL; less fwrite, faster writes +layout: post +author: maysamyabandeh +category: blog +--- + +When `DB::Put` is called, the data is written to both memtable (to be flushed to SST files later) and the WAL (write-ahead log) if it is enabled. In the case of a crash, RocksDB can recover as much as the memtable state that is reflected into the WAL. By default RocksDB automatically flushes the WAL from the application memory to the OS buffer after each `::Put`. It however can be configured to perform the flush manually after an explicit call to `::FlushWAL`. Not doing fwrite syscall after each `::Put` offers a tradeoff between reliability and write latency for the general case. As we explain below, some applications such as MyRocks benefit from this API to gain higher write throughput with however no compromise in reliability. + +### How much is the gain? + +Using `::FlushWAL` API along with setting `DBOptions.concurrent_prepare`, MyRocks achieves 40% higher throughput in Sysbench's [update-nonindex](https://github.com/akopytov/sysbench/blob/master/src/lua/oltp_update_non_index.lua) benchmark. + +### Write, Flush, and Sync + +The write to the WAL is first written to the application memory buffer. The buffer in the next step is "flushed" to OS buffer by calling fwrite syscall. The OS buffer is later "synced" to the persistent storage. The data in the OS buffer, although not persisted yet, will survive the application crash. By default, the flush occurs automatically upon each call to `DB::Put` or `DB::Write`. The user can additionally request sync after each write by setting `WriteOptions::sync`. + +### FlushWAL API + +The user can turn off the automatic flush of the WAL by setting `DBOptions::manual_wal_flush`. In that case, the WAL buffer is flushed when it is either full or `DB::FlushWAL` is called by the user. The API also accepts a boolean argument should we want to sync right after the flush: `::FlushWAL(true)`. + +### Success story: MyRocks + +Some applications that use RocksDB, already have other machinsims in place to provide reliability. MySQL for example uses 2PC (two-phase commit) to write to both binlog as well as the storage engine such as InnoDB and MyRocks. The group commit logic in MySQL allows the 1st phase (Prepare) to be run in parallel but after a commit group is formed performs the 2nd phase (Commit) in a serial manner. This makes low commit latency in the storage engine essential for achieving high throughput. The commit in MyRocks includes writing to the RocksDB WAL, which as explaiend above, by default incures the latency of flushing the WAL new appends to the OS buffer. + +Since binlog helps in recovering from some failure scenarios, MySQL can provide reliability without however needing a storage WAL flush after each individual commit. MyRocks benefits from this property, disables automatic WAL flush in RocksDB, and manually calls `::FlushWAL` when requested by MySQL. diff --git a/src/rocksdb/docs/_posts/2017-09-28-rocksdb-5-8-released.markdown b/src/rocksdb/docs/_posts/2017-09-28-rocksdb-5-8-released.markdown new file mode 100644 index 000000000..a22dcaa1c --- /dev/null +++ b/src/rocksdb/docs/_posts/2017-09-28-rocksdb-5-8-released.markdown @@ -0,0 +1,25 @@ +--- +title: RocksDB 5.8 Released! +layout: post +author: maysamyabandeh +category: blog +--- + +### Public API Change +* Users of `Statistics::getHistogramString()` will see fewer histogram buckets and different bucket endpoints. +* `Slice::compare` and BytewiseComparator `Compare` no longer accept `Slice`s containing nullptr. +* `Transaction::Get` and `Transaction::GetForUpdate` variants with `PinnableSlice` added. + +### New Features +* Add Iterator::Refresh(), which allows users to update the iterator state so that they can avoid some initialization costs of recreating iterators. +* Replace dynamic_cast<> (except unit test) so people can choose to build with RTTI off. With make, release mode is by default built with -fno-rtti and debug mode is built without it. Users can override it by setting USE_RTTI=0 or 1. +* Universal compactions including the bottom level can be executed in a dedicated thread pool. This alleviates head-of-line blocking in the compaction queue, which cause write stalling, particularly in multi-instance use cases. Users can enable this feature via `Env::SetBackgroundThreads(N, Env::Priority::BOTTOM)`, where `N > 0`. +* Allow merge operator to be called even with a single merge operand during compactions, by appropriately overriding `MergeOperator::AllowSingleOperand`. +* Add `DB::VerifyChecksum()`, which verifies the checksums in all SST files in a running DB. +* Block-based table support for disabling checksums by setting `BlockBasedTableOptions::checksum = kNoChecksum`. + +### Bug Fixes +* Fix wrong latencies in `rocksdb.db.get.micros`, `rocksdb.db.write.micros`, and `rocksdb.sst.read.micros`. +* Fix incorrect dropping of deletions during intra-L0 compaction. +* Fix transient reappearance of keys covered by range deletions when memtable prefix bloom filter is enabled. +* Fix potentially wrong file smallest key when range deletions separated by snapshot are written together. diff --git a/src/rocksdb/docs/_posts/2017-12-18-17-auto-tuned-rate-limiter.markdown b/src/rocksdb/docs/_posts/2017-12-18-17-auto-tuned-rate-limiter.markdown new file mode 100644 index 000000000..d2e6204e1 --- /dev/null +++ b/src/rocksdb/docs/_posts/2017-12-18-17-auto-tuned-rate-limiter.markdown @@ -0,0 +1,28 @@ +--- +title: Auto-tuned Rate Limiter +layout: post +author: ajkr +category: blog +--- + +### Introduction + +Our rate limiter has been hard to configure since users need to pick a value that is low enough to prevent background I/O spikes, which can impact user-visible read/write latencies. Meanwhile, picking too low a value can cause memtables and L0 files to pile up, eventually leading to writes stalling. Tuning the rate limiter has been especially difficult for users whose DB instances have different workloads, or have workloads that vary over time, or commonly both. + +To address this, in RocksDB 5.9 we released a dynamic rate limiter that adjusts itself over time according to demand for background I/O. It can be enabled simply by passing `auto_tuned=true` in the `NewGenericRateLimiter()` call. In this case `rate_bytes_per_sec` will indicate the upper-bound of the window within which a rate limit will be picked dynamically. The chosen rate limit will be much lower unless absolutely necessary, so setting this to the device's maximum throughput is a reasonable choice on dedicated hosts. + +### Algorithm + +We use a simple multiplicative-increase, multiplicative-decrease algorithm. We measure demand for background I/O as the ratio of intervals where the rate limiter is drained. There are low and high watermarks for this ratio, which will trigger a change in rate limit when breached. The rate limit can move within a window bounded by the user-specified upper-bound, and a lower-bound that we derive internally. Users can expect this lower bound to be 1-2 orders of magnitude less than the provided upper-bound (so don't provide INT64_MAX as your upper-bound), although it's subject to change. + +### Benchmark Results + +Data is ingested at 10MB/s and the rate limiter was created with 1000MB/s as its upper bound. The dynamically chosen rate limit hovers around 125MB/s. The other clustering of points at 50MB/s is due to number of compaction threads being reduced to one when there's no compaction pressure. + +![](/static/images/rate-limiter/write-KBps-series.png) + +![](/static/images/rate-limiter/auto-tuned-write-KBps-series.png) + +The following graph summarizes the above two time series graphs in CDF form. In particular, notice the p90 - p100 for background write rate are significantly lower with auto-tuned rate limiter enabled. + +![](/static/images/rate-limiter/write-KBps-cdf.png) diff --git a/src/rocksdb/docs/_posts/2017-12-19-write-prepared-txn.markdown b/src/rocksdb/docs/_posts/2017-12-19-write-prepared-txn.markdown new file mode 100644 index 000000000..439b3f83c --- /dev/null +++ b/src/rocksdb/docs/_posts/2017-12-19-write-prepared-txn.markdown @@ -0,0 +1,41 @@ +--- +title: WritePrepared Transactions +layout: post +author: maysamyabandeh +category: blog +--- + +RocksDB supports both optimistic and pessimistic concurrency controls. The pessimistic transactions make use of locks to provide isolation between the transactions. The default write policy in pessimistic transactions is _WriteCommitted_, which means that the data is written to the DB, i.e., the memtable, only after the transaction is committed. This policy simplified the implementation but came with some limitations in throughput, transaction size, and variety in supported isolation levels. In the below, we explain these in detail and present the other write policies, _WritePrepared_ and _WriteUnprepared_. We then dive into the design of _WritePrepared_ transactions. + +### WriteCommitted, Pros and Cons + +With _WriteCommitted_ write policy, the data is written to the memtable only after the transaction commits. This greatly simplifies the read path as any data that is read by other transactions can be assumed to be committed. This write policy, however, implies that the writes are buffered in memory in the meanwhile. This makes memory a bottleneck for large transactions. The delay of the commit phase in 2PC (two-phase commit) also becomes noticeable since most of the work, i.e., writing to memtable, is done at the commit phase. When the commit of multiple transactions are done in a serial fashion, such as in 2PC implementation of MySQL, the lengthy commit latency becomes a major contributor to lower throughput. Moreover this write policy cannot provide weaker isolation levels, such as READ UNCOMMITTED, that could potentially provide higher throughput for some applications. + +### Alternatives: _WritePrepared_ and _WriteUnprepared_ + +To tackle the lengthy commit issue, we should do memtable writes at earlier phases of 2PC so that the commit phase become lightweight and fast. 2PC is composed of Write stage, where the transaction `::Put` is invoked, the prepare phase, where `::Prepare` is invoked (upon which the DB promises to commit the transaction if later is requested), and commit phase, where `::Commit` is invoked and the transaction writes become visible to all readers. To make the commit phase lightweight, the memtable write could be done at either `::Prepare` or `::Put` stages, resulting into _WritePrepared_ and _WriteUnprepared_ write policies respectively. The downside is that when another transaction is reading data, it would need a way to tell apart which data is committed, and if they are, whether they are committed before the transaction's start, i.e., in the read snapshot of the transaction. _WritePrepared_ would still have the issue of buffering the data, which makes the memory the bottleneck for large transactions. It however provides a good milestone for transitioning from _WriteCommitted_ to _WriteUnprepared_ write policy. Here we explain the design of _WritePrepared_ policy. We will cover the changes that make the design to also supported _WriteUnprepared_ in an upcoming post. + +### _WritePrepared_ in a nutshell + +These are the primary design questions that needs to be addressed: +1) How do we identify the key/values in the DB with transactions that wrote them? +2) How do we figure if a key/value written by transaction Txn_w is in the read snapshot of the reading transaction Txn_r? +3) How do we rollback the data written by aborted transactions? + +With _WritePrepared_, a transaction still buffers the writes in a write batch object in memory. When 2PC `::Prepare` is called, it writes the in-memory write batch to the WAL (write-ahead log) as well as to the memtable(s) (one memtable per column family); We reuse the existing notion of sequence numbers in RocksDB to tag all the key/values in the same write batch with the same sequence number, `prepare_seq`, which is also used as the identifier for the transaction. At commit time, it writes a commit marker to the WAL, whose sequence number, `commit_seq`, will be used as the commit timestamp of the transaction. Before releasing the commit sequence number to the readers, it stores a mapping from `prepare_seq` to `commit_seq` in an in-memory data structure that we call _CommitCache_. When a transaction reading values from the DB (tagged with `prepare_seq`) it makes use of the _CommitCache_ to figure if `commit_seq` of the value is in its read snapshot. To rollback an aborted transaction, we apply the status before the transaction by making another write that cancels out the writes of the aborted transaction. + +The _CommitCache_ is a lock-free data structure that caches the recent commit entries. Looking up the entries in the cache must be enough for almost all th transactions that commit in a timely manner. When evicting the older entries from the cache, it still maintains some other data structures to cover the corner cases for transactions that takes abnormally too long to finish. We will cover them in the design details below. + +### Benchmark Results +Here we presents the improvements observed in MyRocks with sysbench and linkbench: +* benchmark...........tps.........p95 latency....cpu/query +* insert...................68% +* update-noindex...30%......38% +* update-index.......61%.......28% +* read-write............6%........3.5% +* read-only...........-1.2%.....-1.8% +* linkbench.............1.9%......+overall........0.6% + +Here are also the detailed results for [In-Memory Sysbench](https://gist.github.com/maysamyabandeh/bdb868091b2929a6d938615fdcf58424) and [SSD Sysbench](https://gist.github.com/maysamyabandeh/ff94f378ab48925025c34c47eff99306) curtesy of [@mdcallag](https://github.com/mdcallag). + +Learn more [here](https://github.com/facebook/rocksdb/wiki/WritePrepared-Transactions). diff --git a/src/rocksdb/docs/_posts/2018-02-05-rocksdb-5-10-2-released.markdown b/src/rocksdb/docs/_posts/2018-02-05-rocksdb-5-10-2-released.markdown new file mode 100644 index 000000000..9f32d3f94 --- /dev/null +++ b/src/rocksdb/docs/_posts/2018-02-05-rocksdb-5-10-2-released.markdown @@ -0,0 +1,22 @@ +--- +title: RocksDB 5.10.2 Released! +layout: post +author: siying +category: blog +--- + +### Public API Change +* When running `make` with environment variable `USE_SSE` set and `PORTABLE` unset, will use all machine features available locally. Previously this combination only compiled SSE-related features. + +### New Features +* CRC32C is now using the 3-way pipelined SSE algorithm `crc32c_3way` on supported platforms to improve performance. The system will choose to use this algorithm on supported platforms automatically whenever possible. If PCLMULQDQ is not supported it will fall back to the old Fast_CRC32 algorithm. +* Provide lifetime hints when writing files on Linux. This reduces hardware write-amp on storage devices supporting multiple streams. +* Add a DB stat, `NUMBER_ITER_SKIP`, which returns how many internal keys were skipped during iterations (e.g., due to being tombstones or duplicate versions of a key). +* Add PerfContext counters, `key_lock_wait_count` and `key_lock_wait_time`, which measure the number of times transactions wait on key locks and total amount of time waiting. + +### Bug Fixes +* Fix IOError on WAL write doesn't propagate to write group follower +* Make iterator invalid on merge error. +* Fix performance issue in `IngestExternalFile()` affecting databases with large number of SST files. +* Fix possible corruption to LSM structure when `DeleteFilesInRange()` deletes a subset of files spanned by a `DeleteRange()` marker. +* Fix DB::Flush() keep waiting after flush finish under certain condition. diff --git a/src/rocksdb/docs/_posts/2018-08-01-rocksdb-tuning-advisor.markdown b/src/rocksdb/docs/_posts/2018-08-01-rocksdb-tuning-advisor.markdown new file mode 100644 index 000000000..ff9b1e464 --- /dev/null +++ b/src/rocksdb/docs/_posts/2018-08-01-rocksdb-tuning-advisor.markdown @@ -0,0 +1,58 @@ +--- +title: Rocksdb Tuning Advisor +layout: post +author: poojam23 +category: blog +--- + +The performance of Rocksdb is contingent on its tuning. However, because +of the complexity of its underlying technology and a large number of +configurable parameters, a good configuration is sometimes hard to obtain. The aim of +the python command-line tool, Rocksdb Advisor, is to automate the process of +suggesting improvements in the configuration based on advice from Rocksdb +experts. + +### Overview + +Experts share their wisdom as rules comprising of conditions and suggestions in the INI format (refer +[rules.ini](https://github.com/facebook/rocksdb/blob/main/tools/advisor/advisor/rules.ini)). +Users provide the Rocksdb configuration that they want to improve upon (as the +familiar Rocksdb OPTIONS file — +[example](https://github.com/facebook/rocksdb/blob/main/examples/rocksdb_option_file_example.ini)) +and the path of the file which contains Rocksdb logs and statistics. +The [Advisor](https://github.com/facebook/rocksdb/blob/main/tools/advisor/advisor/rule_parser_example.py) +creates appropriate DataSource objects (for Rocksdb +[logs](https://github.com/facebook/rocksdb/blob/main/tools/advisor/advisor/db_log_parser.py), +[options](https://github.com/facebook/rocksdb/blob/main/tools/advisor/advisor/db_options_parser.py), +[statistics](https://github.com/facebook/rocksdb/blob/main/tools/advisor/advisor/db_stats_fetcher.py) etc.) +and provides them to the [Rules Engine](https://github.com/facebook/rocksdb/blob/main/tools/advisor/advisor/rule_parser.py). +The Rules uses rules from experts to parse data-sources and trigger appropriate rules. +The Advisor's output gives information about which rules were triggered, +why they were triggered and what each of them suggests. Each suggestion +provided by a triggered rule advises some action on a Rocksdb +configuration option, for example, increase CFOptions.write_buffer_size, +set bloom_bits to 2 etc. + +### Usage + +An example command to run the tool: + +```shell +cd rocksdb/tools/advisor +python3 -m advisor.rule_parser_example --rules_spec=advisor/rules.ini --rocksdb_options=test/input_files/OPTIONS-000005 --log_files_path_prefix=test/input_files/LOG-0 --stats_dump_period_sec=20 +``` + +Sample output where a Rocksdb log-based rule has been triggered : + +```shell +Rule: stall-too-many-memtables +LogCondition: stall-too-many-memtables regex: Stopping writes because we have \d+ immutable memtables \(waiting for flush\), max_write_buffer_number is set to \d+ +Suggestion: inc-bg-flush option : DBOptions.max_background_flushes action : increase suggested_values : ['2'] +Suggestion: inc-write-buffer option : CFOptions.max_write_buffer_number action : increase +scope: col_fam: +{'default'} +``` + +### Read more + +For more information, refer to [advisor](https://github.com/facebook/rocksdb/tree/main/tools/advisor/README.md). diff --git a/src/rocksdb/docs/_posts/2018-08-23-data-block-hash-index.markdown b/src/rocksdb/docs/_posts/2018-08-23-data-block-hash-index.markdown new file mode 100644 index 000000000..c4b24ec2a --- /dev/null +++ b/src/rocksdb/docs/_posts/2018-08-23-data-block-hash-index.markdown @@ -0,0 +1,118 @@ +--- +title: Improving Point-Lookup Using Data Block Hash Index +layout: post +author: fgwu +category: blog +--- +We've designed and implemented a _data block hash index_ in RocksDB that has the benefit of both reducing the CPU util and increasing the throughput for point lookup queries with a reasonable and tunable space overhead. + +Specifially, we append a compact hash table to the end of the data block for efficient indexing. It is backward compatible with the data base created without this feature. After turned on the hash index feature, existing data will be gradually converted to the hash index format. + +Benchmarks with `db_bench` show the CPU utilization of one of the main functions in the point lookup code path, `DataBlockIter::Seek()`, is reduced by 21.8%, and the overall RocksDB throughput is increased by 10% under purely cached workloads, at an overhead of 4.6% more space. Shadow testing with Facebook production traffic shows good CPU improvements too. + + +### How to use it +Two new options are added as part of this feature: `BlockBasedTableOptions::data_block_index_type` and `BlockBasedTableOptions::data_block_hash_table_util_ratio`. + +The hash index is disabled by default unless `BlockBasedTableOptions::data_block_index_type` is set to `data_block_index_type = kDataBlockBinaryAndHash`. The hash table utilization ratio is adjustable using `BlockBasedTableOptions::data_block_hash_table_util_ratio`, which is valid only if `data_block_index_type = kDataBlockBinaryAndHash`. + + +``` +// the definitions can be found in include/rocksdb/table.h + +// The index type that will be used for the data block. +enum DataBlockIndexType : char { + kDataBlockBinarySearch = 0, // traditional block type + kDataBlockBinaryAndHash = 1, // additional hash index +}; + +// Set to kDataBlockBinaryAndHash to enable hash index +DataBlockIndexType data_block_index_type = kDataBlockBinarySearch; + +// #entries/#buckets. It is valid only when data_block_hash_index_type is +// kDataBlockBinaryAndHash. +double data_block_hash_table_util_ratio = 0.75; + +``` + + +### Data Block Hash Index Design + +Current data block format groups adjacent keys together as a restart interval. One block consists of multiple restart intervals. The byte offset of the beginning of each restart interval, i.e. a restart point, is stored in an array called restart interval index or binary seek index. RocksDB does a binary search when performing point lookup for keys in data blocks to find the right restart interval the key may reside. We will use binary seek and binary search interchangeably in this post. + +In order to find the right location where the key may reside using binary search, multiple key parsing and comparison are needed. Each binary search branching triggers CPU cache miss, causing much CPU utilization. We have seen that this binary search takes up considerable CPU in production use-cases. + +![](/static/images/data-block-hash-index/block-format-binary-seek.png) + +We implemented a hash map at the end of the block to index the key to reduce the CPU overhead of the binary search. The hash index is just an array of pointers pointing into the binary seek index. + +![](/static/images/data-block-hash-index/block-format-hash-index.png) + + +Each array element is considered as a hash bucket when storing the location of a key (or more precisely, the restart index of the restart interval where the key resides). When multiple keys happen to hash into the same bucket (hash collision), we just mark the bucket as “collision”. So that when later querying on that key, the hash table lookup knows that there was a hash collision happened so it can fall back to the traditional binary search to find the location of the key. + +We define hash table utilization ratio as the #keys/#buckets. If a utilization ratio is 0.5 and there are 100 buckets, 50 keys are stored in the bucket. The less the util ratio, the less hash collision, and the less chance for a point lookup falls back to binary seek (fall back ratio) due to the collision. So a small util ratio has more benefit to reduce the CPU time but introduces more space overhead. + +Space overhead depends on the util ratio. Each bucket is a `uint8_t` (i.e. one byte). For a util ratio of 1, the space overhead is 1Byte per key, the fall back ratio observed is ~52%. + +![](/static/images/data-block-hash-index/hash-index-data-structure.png) + +### Things that Need Attention + +**Customized Comparator** + +Hash index will hash different keys (keys with different content, or byte sequence) into different hash values. This assumes the comparator will not treat different keys as equal if they have different content. + +The default bytewise comparator orders the keys in alphabetical order and works well with hash index, as different keys will never be regarded as equal. However, some specially crafted comparators will do. For example, say, a `StringToIntComparator` can convert a string into an integer, and use the integer to perform the comparison. Key string “16” and “0x10” is equal to each other as seen by this `StringToIntComparator`, but they probably hash to different value. Later queries to one form of the key will not be able to find the existing key been stored in the other format. + +We add a new function member to the comparator interface: + +``` +virtual bool CanKeysWithDifferentByteContentsBeEqual() const { return true; } +``` + + +Every comparator implementation should override this function and specify the behavior of the comparator. If a comparator can regard different keys equal, the function returns true, and as a result the hash index feature will not be enabled, and vice versa. + +NOTE: to use the hash index feature, one should 1) have a comparator that can never treat different keys as equal; and 2) override the `CanKeysWithDifferentByteContentsBeEqual()` function to return `false`, so the hash index can be enabled. + + +**Util Ratio's Impact on Data Block Cache** + +Adding the hash index to the end of the data block essentially takes up the data block cache space, making the effective data block cache size smaller and increasing the data block cache miss ratio. Therefore, a very small util ratio will result in a large data block cache miss ratio, and the extra I/O may drag down the throughput gain achieved by the hash index lookup. Besides, when compression is enabled, cache miss also incurs data block decompression, which is CPU-consuming. Therefore the CPU may even increase if using a too small util ratio. The best util ratio depends on workloads, cache to data ratio, disk bandwidth/latency etc. In our experiment, we found util ratio = 0.5 ~ 1 is a good range to explore that brings both CPU and throughput gains. + + +### Limitations + +As we use `uint8_t` to store binary seek index, i.e. restart interval index, the total number of restart intervals cannot be more than 253 (we reserved 255 and 254 as special flags). For blocks having a larger number of restart intervals, the hash index will not be created and the point lookup will be done by traditional binary seek. + +Data block hash index only supports point lookup. We do not support range lookup. Range lookup request will fall back to BinarySeek. + +RocksDB supports many types of records, such as `Put`, `Delete`, `Merge`, etc (visit [here](https://github.com/facebook/rocksdb/wiki/rocksdb-basics) for more information). Currently we only support `Put` and `Delete`, but not `Merge`. Internally we have a limited set of supported record types: + + +``` +kPutRecord, <=== supported +kDeleteRecord, <=== supported +kSingleDeleteRecord, <=== supported +kTypeBlobIndex, <=== supported +``` + +For records not supported, the searching process will fall back to the traditional binary seek. + + + +### Evaluation +To evaluate the CPU util reduction and isolate other factors such as disk I/O and block decompression, we first evaluate the hash idnex in a purely cached workload. We observe that the CPU utilization of one of the main functions in the point lookup code path, DataBlockIter::Seek(), is reduced by 21.8% and the overall throughput is increased by 10% at an overhead of 4.6% more space. + +However, general worload is not always purely cached. So we also evaluate the performance under different cache space pressure. In the following test, we use `db_bench` with RocksDB deployed on SSDs. The total DB size is 5~6GB, and it is about 14GB if decompressed. Different block cache sizes are used, ranging from 14GB down to 2GB, with an increasing cache miss ratio. + +Orange bars are representing our hash index performance. We use a hash util ratio of 1.0 in this test. Block size are set to 16KiB with the restart interval as 16. + +![](/static/images/data-block-hash-index/perf-throughput.png) +![](/static/images/data-block-hash-index/perf-cache-miss.png) + +We can see that if cache size is greater than 8GB, hash index can bring throughput gain. Cache size greater than 8GB can be translated to a cache miss ratio smaller than 40%. So if the workload has a cache miss ratio smaller than 40%, hash index is able to increase the throughput. + +Besides, shadow testing with Facebook production traffic shows good CPU improvements too. + diff --git a/src/rocksdb/docs/_posts/2018-11-21-delete-range.markdown b/src/rocksdb/docs/_posts/2018-11-21-delete-range.markdown new file mode 100644 index 000000000..96fc3562d --- /dev/null +++ b/src/rocksdb/docs/_posts/2018-11-21-delete-range.markdown @@ -0,0 +1,292 @@ +--- +title: "DeleteRange: A New Native RocksDB Operation" +layout: post +author: +- abhimadan +- ajkr +category: blog +--- +## Motivation + +### Deletion patterns in LSM + +Deleting a range of keys is a common pattern in RocksDB. Most systems built on top of +RocksDB have multi-component key schemas, where keys sharing a common prefix are +logically related. Here are some examples. + +MyRocks is a MySQL fork using RocksDB as its storage engine. Each key's first +four bytes identify the table or index to which that key belongs. Thus dropping +a table or index involves deleting all the keys with that prefix. + +Rockssandra is a Cassandra variant that uses RocksDB as its storage engine. One +of its admin tool commands, `nodetool cleanup`, removes key-ranges that have been migrated +to other nodes in the cluster. + +Marketplace uses RocksDB to store product data. Its key begins with product ID, +and it stores various data associated with the product in separate keys. When a +product is removed, all these keys must be deleted. + +When we decide what to improve, we try to find a use case that's common across +users, since we want to build a generally useful system, not one that has many +one-off features for individual users. The range deletion pattern is common as +illustrated above, so from this perspective it's a good target for optimization. + +### Existing mechanisms: challenges and opportunities + +The most common pattern we see is scan-and-delete, i.e., advance an iterator +through the to-be-deleted range, and issue a `Delete` for each key. This is +slow (involves read I/O) so cannot be done in any critical path. Additionally, +it creates many tombstones, which slows down iterators and doesn't offer a deadline +for space reclamation. + +Another common pattern is using a custom compaction filter that drops keys in +the deleted range(s). This deletes the range asynchronously, so cannot be used +in cases where readers must not see keys in deleted ranges. Further, it has the +disadvantage of outputting tombstones to all but the bottom level. That's +because compaction cannot detect whether dropping a key would cause an older +version at a lower level to reappear. + +If space reclamation time is important, or it is important that the deleted +range not affect iterators, the user can trigger `CompactRange` on the deleted +range. This can involve arbitrarily long waits in the compaction queue, and +increases write-amp. By the time it's finished, however, the range is completely +gone from the LSM. + +`DeleteFilesInRange` can be used prior to compacting the deleted range as long +as snapshot readers do not need to access them. It drops files that are +completely contained in the deleted range. That saves write-amp because, in +`CompactRange`, the file data would have to be rewritten several times before it +reaches the bottom of the LSM, where tombstones can finally be dropped. + +In addition to the above approaches having various drawbacks, they are quite +complicated to reason about and implement. In an ideal world, deleting a range +of keys would be (1) simple, i.e., a single API call; (2) synchronous, i.e., +when the call finishes, the keys are guaranteed to be wiped from the DB; (3) low +latency so it can be used in critical paths; and (4) a first-class operation +with all the guarantees of any other write, like atomicity, crash-recovery, etc. + +## v1: Getting it to work + +### Where to persist them? + +The first place we thought about storing them is inline with the data blocks. +We could not think of a good way to do it, however, since the start of a range +tombstone covering a key could be anywhere, making binary search impossible. +So, we decided to investigate segregated storage. + +A second solution we considered is appending to the manifest. This file is +append-only, periodically compacted, and stores metadata like the level to which +each SST belongs. This is tempting because it leverages an existing file, which +is maintained in the background and fully read when the DB is opened. However, +it conceptually violates the manifest's purpose, which is to store metadata. It +also has no way to detect when a range tombstone no longer covers anything and +is droppable. Further, it'd be possible for keys above a range tombstone to disappear +when they have their seqnums zeroed upon compaction to the bottommost level. + +A third candidate is using a separate column family. This has similar problems +to the manifest approach. That is, we cannot easily detect when a range +tombstone is obsolete, and seqnum zeroing can cause a key +to go from above a range tombstone to below, i.e., disappearing. The upside is +we can reuse logic for memory buffering, consistent reads/writes, etc. + +The problems with the second and third solutions indicate a need for range +tombstones to be aware of flush/compaction. An easy way to achieve this is put +them in the SST files themselves - but not in the data blocks, as explained for +the first solution. So, we introduced a separate meta-block for range tombstones. +This resolved the problem of when to obsolete range tombstones, as it's simple: +when they're compacted to the bottom level. We also reused the LSM invariants +that newer versions of a key are always in a higher level to prevent the seqnum +zeroing problem. This approach has the side benefit of constraining the range +tombstones seen during reads to ones in a similar key-range. + +![](/static/images/delrange/delrange_sst_blocks.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +*When there are range tombstones in an SST, they are segregated in a separate meta-block* +{: style="text-align: center"} + +![](/static/images/delrange/delrange_key_schema.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +*Logical range tombstones (left) and their corresponding physical key-value representation (right)* +{: style="text-align: center"} + +### Write path + +`WriteBatch` stores range tombstones in its buffer which are logged to the WAL and +then applied to a dedicated range tombstone memtable during `Write`. Later in +the background the range tombstone memtable and its corresponding data memtable +are flushed together into a single SST with a range tombstone meta-block. SSTs +periodically undergo compaction which rewrites SSTs with point data and range +tombstones dropped or merged wherever possible. + +We chose to use a dedicated memtable for range tombstones. The memtable +representation is always skiplist in order to minimize overhead in the usual +case, which is the memtable contains zero or a small number of range tombstones. +The range tombstones are segregated to a separate memtable for the same reason +we segregated range tombstones in SSTs. That is, we did not know how to +interleave the range tombstone with point data in a way that we would be able to +find it for arbitrary keys that it covers. + +![](/static/images/delrange/delrange_write_path.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 70%"} + +*Lifetime of point keys and range tombstones in RocksDB* +{: style="text-align: center"} + +During flush and compaction, we chose to write out all non-obsolete range +tombstones unsorted. Sorting by a single dimension is easy to implement, but +doesn't bring asymptotic improvement to queries over range data. Ideally, we +want to store skylines (see “Read Path” subsection below) computed over our ranges so we can binary search. +However, a couple of concerns cause doing this in flush and compaction to feel +unsatisfactory: (1) we need to store multiple skylines, one for each snapshot, +which further complicates the range tombstone meta-block encoding; and (2) even +if we implement this, the range tombstone memtable still needs to be linearly +scanned. Given these concerns we decided to defer collapsing work to the read +side, hoping a good caching strategy could optimize this at some future point. + + +### Read path + +In point lookups, we aggregate range tombstones in an unordered vector as we +search through live memtable, immutable memtables, and then SSTs. When a key is +found that matches the lookup key, we do a scan through the vector, checking +whether the key is deleted. + +In iterators, we aggregate range tombstones into a skyline as we visit live +memtable, immutable memtables, and SSTs. The skyline is expensive to construct but fast to determine whether a key is covered. The skyline keeps track of the most recent range tombstone found to optimize `Next` and `Prev`. + +|![](/static/images/delrange/delrange_uncollapsed.png) |![](/static/images/delrange/delrange_collapsed.png) | + +*([Image source: Leetcode](https://leetcode.com/problems/the-skyline-problem/description/)) The skyline problem involves taking building location/height data in the +unsearchable form of A and converting it to the form of B, which is +binary-searchable. With overlapping range tombstones, to achieve efficient +searching we need to solve an analogous problem, where the x-axis is the +key-space and the y-axis is the sequence number.* +{: style="text-align: center"} + +### Performance characteristics + +For the v1 implementation, writes are much faster compared to the scan and +delete (optionally within a transaction) pattern. `DeleteRange` only logs to WAL +and applies to memtable. Logging to WAL always `fflush`es, and optionally +`fsync`s or `fdatasync`s. Applying to memtable is always an in-memory operation. +Since range tombstones have a dedicated skiplist memtable, the complexity of inserting is O(log(T)), where T is the number of existing buffered range tombstones. + +Reading in the presence of v1 range tombstones, however, is much slower than reads +in a database where scan-and-delete has happened, due to the linear scan over +range tombstone memtables/meta-blocks. + +Iterating in a database with v1 range tombstones is usually slower than in a +scan-and-delete database, although the gap lessens as iterations grow longer. +When an iterator is first created and seeked, we construct a skyline over its +tombstones. This operation is O(T\*log(T)) where T is the number of tombstones +found across live memtable, immutable memtable, L0 files, and one file from each +of the L1+ levels. However, moving the iterator forwards or backwards is simply +a constant-time operation (excluding edge cases, e.g., many range tombstones +between consecutive point keys). + +## v2: Making it fast + +`DeleteRange`’s negative impact on read perf is a barrier to its adoption. The +root cause is range tombstones are not stored or cached in a format that can be +efficiently searched. We needed to design DeleteRange so that we could maintain +write performance while making read performance competitive with workarounds +used in production (e.g., scan-and-delete). + +### Representations + +The key idea of the redesign is that, instead of globally collapsing range tombstones, + we can locally “fragment” them for each SST file and memtable to guarantee that: + +* no range tombstones overlap; and +* range tombstones are ordered by start key. + +Combined, these properties make range tombstones binary searchable. This + fragmentation will happen on the read path, but unlike the previous design, we can + easily cache many of these range tombstone fragments on the read path. + +### Write path + +The write path remains unchanged. + +### Read path + +When an SST file is opened, its range tombstones are fragmented and cached. For point + lookups, we binary search each file's fragmented range tombstones for one that covers + the lookup key. Unlike the old design, once we find a tombstone, we no longer need to + search for the key in lower levels, since we know that any keys on those levels will be + covered (though we do still check the current level since there may be keys written after + the range tombstone). + +For range scans, we create iterators over all the fragmented range + tombstones and store them in a list, seeking each one to cover the start key of the range + scan (if possible), and query each encountered key in this structure as in the old design, + advancing range tombstone iterators as necessary. In effect, we implicitly create a skyline. + This requires significantly less work on iterator creation, but since each memtable/SST has +its own range tombstone iterator, querying range tombstones requires key comparisons (and +possibly iterator increments) for several iterators (as opposed to v1, where we had a global +collapsed representation of all range tombstones). As a result, very long range scans may become + slower than before, but short range scans are an order of magnitude faster, which are the + more common class of range scan. + +## Benchmarks + +To understand the performance of this new design, we used `db_bench` to compare point lookup, short range scan, + and long range scan performance across: + +* the v1 DeleteRange design, +* the scan-and-delete workaround, and +* the v2 DeleteRange design. + +In these benchmarks, we used a database with 5 million data keys, and 10000 range tombstones (ignoring +those dropped during compaction) that were written in regular intervals after 4.5 million data keys were written. +Writing the range tombstones ensures that most of them are not compacted away, and we have more tombstones +in higher levels that cover keys in lower levels, which allows the benchmarks to exercise more interesting behavior +when reading deleted keys. + +Point lookup benchmarks read 100000 keys from a database using `readwhilewriting`. Range scan benchmarks used +`seekrandomwhilewriting` and seeked 100000 times, and advanced up to 10 keys away from the seek position for short range scans, and advanced up to 1000 keys away from the seek position for long range scans. + +The results are summarized in the tables below, averaged over 10 runs (note the +different SHAs for v1 benchmarks are due to a new `db_bench` flag that was added in order to compare performance with databases with no tombstones; for brevity, those results are not reported here). Also note that the block cache was large enough to hold the entire db, so the large throughput is due to limited I/Os and little time spent on decompression. The range tombstone blocks are always pinned uncompressed in memory. We believe these setup details should not affect relative performance between versions. + +### Point Lookups + +|Name |SHA |avg micros/op |avg ops/sec | +|v1 |35cd754a6 |1.3179 |759,830.90 | +|scan-del |7528130e3 |0.6036 |1,667,237.70 | +|v2 |7528130e3 |0.6128 |1,634,633.40 | + +### Short Range Scans + +|Name |SHA |avg micros/op |avg ops/sec | +|v1 |0ed738fdd |6.23 |176,562.00 | +|scan-del |PR 4677 |2.6844 |377,313.00 | +|v2 |PR 4677 |2.8226 |361,249.70 | + +### Long Range scans + +|Name |SHA |avg micros/op |avg ops/sec | +|v1 |0ed738fdd |52.7066 |19,074.00 | +|scan-del |PR 4677 |38.0325 |26,648.60 | +|v2 |PR 4677 |41.2882 |24,714.70 | + +## Future Work + +Note that memtable range tombstones are fragmented every read; for now this is acceptable, + since we expect there to be relatively few range tombstones in memtables (and users can + enforce this by keeping track of the number of memtable range deletions and manually flushing + after it passes a threshold). In the future, a specialized data structure can be used for storing + range tombstones in memory to avoid this work. + +Another future optimization is to create a new format version that requires range tombstones to + be stored in a fragmented form. This would save time when opening SST files, and when `max_open_files` +is not -1 (i.e., files may be opened several times). + +## Acknowledgements + +Special thanks to Peter Mattis and Nikhil Benesch from Cockroach Labs, who were early users of +DeleteRange v1 in production, contributed the cleanest/most efficient v1 aggregation implementation, found and fixed bugs, and provided initial DeleteRange v2 design and continued help. + +Thanks to Huachao Huang and Jinpeng Zhang from PingCAP for early DeleteRange v1 adoption, bug reports, and fixes. diff --git a/src/rocksdb/docs/_posts/2019-03-08-format-version-4.markdown b/src/rocksdb/docs/_posts/2019-03-08-format-version-4.markdown new file mode 100644 index 000000000..ce657696c --- /dev/null +++ b/src/rocksdb/docs/_posts/2019-03-08-format-version-4.markdown @@ -0,0 +1,36 @@ +--- +title: format_version 4 +layout: post +author: maysamyabandeh +category: blog +--- + +The data blocks in RocksDB consist of a sequence of key/values pairs sorted by key, where the pairs are grouped into _restart intervals_ specified by `block_restart_interval`. Up to RocksDB version 5.14, where the latest and default value of `BlockBasedTableOptions::format_version` is 2, the format of index and data blocks are the same: index blocks use the same key format of <`user_key`,`seq`> and encode pointers to data blocks, <`offset`,`size`>, to a byte string and use them as values. The only difference is that the index blocks use `index_block_restart_interval` for the size of _restart intervals_. `format_version=`3,4 offer more optimized, backward-compatible, yet forward-incompatible format for index blocks. + +### Pros + +Using `format_version`=4 significantly reduces the index block size, in some cases around 4-5x. This frees more space in block cache, which would result in higher hit rate for data and filter blocks, or offer the same performance with a smaller block cache size. + +### Cons + +Being _forward-incompatible_ means that if you enable `format_version=`4 you cannot downgrade to a RocksDB version lower than 5.16. + +### How to use it? + +- `BlockBasedTableOptions::format_version` = 4 +- `BlockBasedTableOptions::index_block_restart_interval` = 16 + +### What is format_version 3? +(Since RocksDB 5.15) In most cases, the sequence number `seq` is not necessary for keys in the index blocks. In such cases, `format_version`=3 skips encoding the sequence number and sets `index_key_is_user_key` in TableProperties, which is used by the reader to know how to decode the index block. + +### What is format_version 4? +(Since RocksDB 5.16) Changes the format of index blocks by delta encoding the index values, which are the block handles. This saves the encoding of `BlockHandle::offset` of the non-head index entries in each restart interval. If used, `TableProperties::index_value_is_delta_encoded` is set, which is used by the reader to know how to decode the index block. The format of each key is (shared_size, non_shared_size, shared, non_shared). The format of each value, i.e., block handle, is (offset, size) whenever the shared_size is 0, which included the first entry in each restart point. Otherwise the format is delta-size = block handle size - size of last block handle. + +The index format in `format_version=4` would be as follows: + + restart_point 0: k, v (off, sz), k, v (delta-sz), ..., k, v (delta-sz) + restart_point 1: k, v (off, sz), k, v (delta-sz), ..., k, v (delta-sz) + ... + restart_point n-1: k, v (off, sz), k, v (delta-sz), ..., k, v (delta-sz) + where, k is key, v is value, and its encoding is in parenthesis. + diff --git a/src/rocksdb/docs/_posts/2019-08-15-unordered-write.markdown b/src/rocksdb/docs/_posts/2019-08-15-unordered-write.markdown new file mode 100644 index 000000000..5f0eb2880 --- /dev/null +++ b/src/rocksdb/docs/_posts/2019-08-15-unordered-write.markdown @@ -0,0 +1,56 @@ +--- +title: Higher write throughput with `unordered_write` feature +layout: post +author: maysamyabandeh +category: blog +--- + +Since RocksDB 6.3, The `unordered_write=`true option together with WritePrepared transactions offers 34-42% higher write throughput compared to vanilla RocksDB. If the application can handle more relaxed ordering guarantees, the gain in throughput would increase to 63-131%. + +### Background + +Currently RocksDB API delivers the following powerful guarantees: +- Atomic reads: Either all of a write batch is visible to reads or none of it. +- Read-your-own writes: When a write thread returns to the user, a subsequent read by the same thread will be able to see its own writes. +- Immutable Snapshots: The reads visible to the snapshot are immutable in the sense that it will not be affected by any in-flight or future writes. + +### `unordered_write` + +The `unordered_write` feature, when turned on, relaxes the default guarantees of RocksDB. While it still gives read-your-own-write property, neither atomic reads nor the immutable snapshot properties are provided any longer. However, RocksDB users could still get read-your-own-write and immutable snapshots when using this feature in conjunction with TransactionDB configured with WritePrepared transactions and `two_write_queues`. You can read [here](https://github.com/facebook/rocksdb/wiki/unordered_write) to learn about the design of `unordered_write` and [here](https://github.com/facebook/rocksdb/wiki/WritePrepared-Transactions) to learn more about WritePrepared transactions. + +### How to use it? + +To get the same guarantees as vanilla RocksdB: + + DBOptions db_options; + db_options.unordered_write = true; + db_options.two_write_queues = true; + DB* db; + { + TransactionDBOptions txn_db_options; + txn_db_options.write_policy = TxnDBWritePolicy::WRITE_PREPARED; + txn_db_options.skip_concurrency_control = true; + TransactionDB* txn_db; + TransactionDB::Open(options, txn_db_options, kDBPath, &txn_db); + db = txn_db; + } + db->Write(...); + +To get relaxed guarantees: + + DBOptions db_options; + db_options.unordered_write = true; + DB* db; + DB::Open(db_options, kDBPath, &db); + db->Write(...); + +# Benchmarks + + TEST_TMPDIR=/dev/shm/ ~/db_bench --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions --transaction_db=true --unordered_write=1 --disable_wal=0 + +Throughput with `unordered_write`=true and using WritePrepared transaction: +- WAL: +42% +- No-WAL: +34% +Throughput with `unordered_write`=true +- WAL: +63% +- NoWAL: +131% diff --git a/src/rocksdb/docs/_posts/2021-04-12-universal-improvements.markdown b/src/rocksdb/docs/_posts/2021-04-12-universal-improvements.markdown new file mode 100644 index 000000000..fa4e9d463 --- /dev/null +++ b/src/rocksdb/docs/_posts/2021-04-12-universal-improvements.markdown @@ -0,0 +1,46 @@ +--- +title: (Call For Contribution) Make Universal Compaction More Incremental +layout: post +author: sdong +category: blog +--- + +### Motivation + +Universal Compaction is an important compaction style, but few changes were made after we made the structure multi-leveled. Yet the major restriction of always compacting full sorted run is not relaxed. Compared to Leveled Compaction, where we usually only compile several SST files together, in universal compaction, we frequently compact GBs of data. Two issues with this gap: 1. it makes it harder to unify universal and leveled compaction; 2. periodically data is fully compacted, and in the mean time space is doubled. To ease the problem, we can break the restriction and do similar as leveled compaction, and bring it closer to unified compaction. + +We call for help for making following improvements. + + +### How Universal Compaction Works + +In universal, whole levels are compacted together to satisfy two conditions (See [wiki page](https://github.com/facebook/rocksdb/wiki/Universal-Compaction) for more details): + +1. total size / bottommost level size > a threshold, or +2. total number of sorted runs (non-0 levels + L0 files) is within a threshold + +1 is to limit extra space overhead used for dead data and 2 is for read performance. + +If 1 is triggered, likely a full compaction will be triggered. If 2 is triggered, RocksDB compact some sorted runs to bring the number down. It does it by using a simple heuristic so that less writes needed for that purpose over time: it starts from compacting smaller files, but if total size to compact is similar to or larger than size of the next level, it will take that level together, as soon on (whether it is the best heuristic is another question and we’ve never seriously looked at it). + +### How We Can Improve? + +Let’s start from condition 1. Here we do full compaction but is not necessary. A simple optimization would be to compact so that just enough files are merged into the bottommost level (Lmax) to satisfy condition 1. It would work if we only need to pick some files from Lmax-1, or if it is cheaper over time, we can pick some files from other levels too. + +Then condition 2. If we finish condition 1, there might be holes in some ranges in older levels. These holes might make it possible that only by compacting some sub ranges, we can fix the LSM-tree for condition 2. RocksDB can take single files into consideration and apply more sophisticated heuristic. + +This new approach makes universal compaction closer to leveled compaction. The operation for 1 is closer to how Leveled compaction triggeres Lmax-1 to Lmax compaction. And 2 can potentially be implemented as something similar to level picking in Leveled Compaction. In fact, all those file picking can co-existing in one single compaction style and there isn’t fundamental conflicts to that. + +### Limitation + +There are two limitations: + +* Periodic automatic full compaction is unpleasant but at the same time is pleasant in another way. Some users might uses it to reason that everything is periodically collapsed so dead data is gone and old data is rewritten. We need to make sure periodic compaction works to continue with that. +* L0 to the first non-L0 level compaction is the first time data is partitioned in LSM-tree so that incremental compaction by range is possible. We might need to do more of these compactions in order to make incremental possible, which will increase compaction slightly. +* Compacting subset of a level would introduce some extra overhead for unaligned files, just as in leveled compaction. More SST boundary cutting heuristic can reduce this overhead but it will be there. + +But I believe the benefits would outweight the limitations. Reducing temporary space doubling and moving towards to unified compaction would be important achievements. + +### Interested in Help? + +Compaction is the core of LSM-tree, but its improvements are far overdue. If you are a user of universal compaction and would be able to benefit from those improvements, we will be happy to work with you on speeding up the project and bring them to RocksDB sooner. Feel free to communicate with us in [this issue](https://github.com/facebook/rocksdb/issues/8181). diff --git a/src/rocksdb/docs/_posts/2021-05-26-integrated-blob-db.markdown b/src/rocksdb/docs/_posts/2021-05-26-integrated-blob-db.markdown new file mode 100644 index 000000000..9f3a22fa2 --- /dev/null +++ b/src/rocksdb/docs/_posts/2021-05-26-integrated-blob-db.markdown @@ -0,0 +1,101 @@ +--- +title: Integrated BlobDB +layout: post +author: ltamasi +category: blog +--- +## Background + +BlobDB is essentially RocksDB for large-value use cases. The basic idea, which was proposed in the [WiscKey paper](https://www.usenix.org/system/files/conference/fast16/fast16-papers-lu.pdf), is key-value separation: by storing large values in dedicated blob files and storing only small pointers to them in the LSM tree, we avoid copying the values over and over again during compaction, thus reducing write amplification. Historically, BlobDB supported only FIFO and TTL based use cases that can tolerate some data loss. In addition, it was incompatible with many widely used RocksDB features, and required users to adopt a custom API. In 2020, we decided to rearchitect BlobDB from the ground up, taking the lessons learned from WiscKey and the original BlobDB but also drawing inspiration and incorporating ideas from other similar systems. Our goals were to eliminate the above limitations and to create a new integrated version that enables customers to use the well-known RocksDB API, has feature parity with the core of RocksDB, and offers better performance. This new implementation is now available and provides the following improvements over the original: + +* **API.** In contrast with the legacy BlobDB implementation, which had its own `StackableDB`-based interface (`rocksdb::blob_db::BlobDB`), the new version can be used via the well-known `rocksdb::DB` API, and can be configured simply by using a few column family options. +* **Consistency.** With the integrated BlobDB implementation, RocksDB’s consistency guarantees and various write options (like using the WAL or synchronous writes) now apply to blobs as well. Moreover, the new BlobDB keeps track of blob files in the RocksDB MANIFEST. +* **Write performance.** When using the old BlobDB, blobs are extracted and immediately written to blob files by the BlobDB layer *in the application thread*. This has multiple drawbacks from a performance perspective: first, it requires synchronization; second, it means that expensive operations like compression are performed in the application thread; and finally, it involves flushing the blob file after each blob. The new code takes a completely different approach by *offloading blob file building to RocksDB’s background jobs*, i.e. flushes and compactions. This means that similarly to SSTs, any given blob file is now written by a single background thread, eliminating the need for locking, flushing, or performing compression in the foreground. Note that this approach is also a better fit for network-based file systems where small writes might be expensive and opens up the possibility of file format optimizations that involve buffering (like dictionary compression). +* **Read performance.** The old code relies on each read (i.e. `Get`, `MultiGet`, or iterator) taking a snapshot and uses those snapshots when deciding which obsolete blob files can be removed. The new BlobDB improves this by generalizing RocksDB’s Version concept, which historically referred to the set of live SST files at a given point in time, to include the set of live blob files as well. This has performance benefits like [making the read path mostly lock-free by utilizing thread-local storage](https://rocksdb.org/blog/2014/06/27/avoid-expensive-locks-in-get.html). We have also introduced a blob file cache that can be utilized to keep frequently accessed blob files open. +* **Garbage collection.** Key-value separation means that if a key pointing to a blob gets overwritten or deleted, the blob becomes unreferenced garbage. To be able to reclaim this space, BlobDB now has garbage collection capabilities. GC is integrated into the compaction process and works by relocating valid blobs residing in old blob files as they are encountered during compaction. Blob files can be marked obsolete (and eventually deleted in one shot) once they contain nothing but garbage. This is more efficient than the method used by WiscKey, which involves performing a `Get` operation to find out whether a blob is still referenced followed by a `Put` to update the reference, which in turn results in garbage collection competing and potentially conflicting with the application’s writes. +* **Feature parity with the RocksDB core.** The new BlobDB supports way more features than the original and is near feature parity with vanilla RocksDB. In particular, we support all basic read/write APIs (with the exception of `Merge`, which is coming soon), recovery, compression, atomic flush, column families, compaction filters, checkpoints, backup/restore, transactions, per-file checksums, and the SST file manager. In addition, the new BlobDB’s options can be dynamically adjusted using the `SetOptions` interface. + +## API + +The new BlobDB can be configured (on a per-column family basis if needed) simply by using the following options: + +* `enable_blob_files`: set it to `true` to enable key-value separation. +* `min_blob_size`: values at or above this threshold will be written to blob files during flush or compaction. +* `blob_file_size`: the size limit for blob files. +* `blob_compression_type`: the compression type to use for blob files. All blobs in the same file are compressed using the same algorithm. +* `enable_blob_garbage_collection`: set this to `true` to make BlobDB actively relocate valid blobs from the oldest blob files as they are encountered during compaction. +* `blob_garbage_collection_age_cutoff`: the threshold that the GC logic uses to determine which blob files should be considered “old.” For example, the default value of 0.25 signals to RocksDB that blobs residing in the oldest 25% of blob files should be relocated by GC. This parameter can be tuned to adjust the trade-off between write amplification and space amplification. + +The above options are all dynamically adjustable via the `SetOptions` API; changing them will affect subsequent flushes and compactions but not ones that are already in progress. + +In terms of compaction styles, we recommend using leveled compaction with BlobDB. The rationale behind universal compaction in general is to provide lower write amplification at the expense of higher read amplification; however, as we will see later in the Performance section, BlobDB can provide very low write amp and good read performance with leveled compaction. Therefore, there is really no reason to take the hit in read performance that comes with universal compaction. + +In addition to the above, consider tuning the following non-BlobDB specific options: + +* `write_buffer_size`: this is the memtable size. You might want to increase it for large-value workloads to ensure that SST and blob files contain a decent number of keys. +* `target_file_size_base`: the target size of SST files. Note that even when using BlobDB, it is important to have an LSM tree with a “nice” shape and multiple levels and files per level to prevent heavy compactions. Since BlobDB extracts and writes large values to blob files, it makes sense to make this parameter significantly smaller than the memtable size. One guideline is to set `blob_file_size` to the same value as `write_buffer_size` (adjusted for compression if needed) and make `target_file_size_base` proportionally smaller based on the ratio of key size to value size. +* `max_bytes_for_level_base`: consider setting this to a multiple (e.g. 8x or 10x) of `target_file_size_base`. + +As mentioned above, the new BlobDB now also supports compaction filters. Key-value separation actually enables an optimization here: if the compaction filter of an application can make a decision about a key-value solely based on the key, it is unnecessary to read the value from the blob file. Applications can take advantage of this optimization by implementing the new `FilterBlobByKey` method of the `CompactionFilter` interface. This method gets called by RocksDB first whenever it encounters a key-value where the value is stored in a blob file. If this method returns a “final” decision like `kKeep`, `kRemove`, `kChangeValue`, or `kRemoveAndSkipUntil`, RocksDB will honor that decision; on the other hand, if the method returns `kUndetermined`, RocksDB will read the blob from the blob file and call `FilterV2` with the value in the usual fashion. + +## Performance + +We tested the performance of the new BlobDB for six different value sizes between 1 KB and 1 MB using a customized version of our [standard benchmark suite](https://github.com/facebook/rocksdb/wiki/Performance-Benchmarks) on a box with an 18-core Skylake DE CPU (running at 1.6 GHz, with hyperthreading enabled), 64 GB RAM, a 512 GB boot SSD, and two 1.88 TB M.2 SSDs in a RAID0 configuration for data. The RocksDB version used was equivalent to 6.18.1, with some benchmarking and statistics related enhancements. Leveled and universal compaction without key-value separation were used as reference points. Note that for simplicity, we use “leveled compaction” and “universal compaction” as shorthand for leveled and universal compaction without key-value separation, respectively, and “BlobDB” for BlobDB with leveled compaction. + +Our benchmarks cycled through six different workloads: two write-only ones (initial load and overwrite), two read/write ones (point lookup/write mix and range scan/write mix), and finally two read-only ones (point lookups and range scans). The first two phases performed a fixed amount of work (see below), while the final four were run for a fixed amount of time, namely 30 minutes each. Each phase other than the first one started with the database state left behind by the previous one. Here’s a brief description of the workloads: + +* **Initial load**: this workload has two distinct stages, a single-threaded random write stage during which compactions are disabled (so all data is flushed to L0, where it remains for the rest of the stage), followed by a full manual compaction. The random writes are performed with load-optimized settings, namely using the vector memtable implementation and with concurrent memtable writes and WAL disabled. This stage was used to populate the database with 1 TB worth of raw values, e.g. 2^30 (~1 billion) 1 KB values or 2^20 (~1 million) 1 MB values. +* **Overwrite**: this is a multi-threaded random write workload using the usual skiplist memtable, with compactions, WAL, and concurrent memtable writes enabled. In our tests, 16 writer threads were used. The total number of writes was set to the same number as in the initial load stage and split up evenly between the writer threads. For instance, for the 1 MB value size, we had 2^20 writes divided up between the 16 threads, resulting in each thread performing 2^16 write operations. At the end of this phase, a “wait for compactions” step was added to prevent this workload from exhibiting artificially low write amp or conversely, the next phase showing inflated write amp. +* **Point lookup/write mix**: a single writer thread performing random writes while N (in our case, 16) threads perform random point lookups. WAL is enabled and all writes are synced. +* **Range scan/write mix**: similar to the above, with one writer thread and N reader threads (where N was again set to 16 in our tests). The reader threads perform random range scans, with 10 `Next` calls per `Seek`. Again, WAL is enabled, and sync writes are used. +* **Point lookups (read-only)**: N=16 threads perform random point lookups. +* **Range scans (read-only)**: N=16 threads execute random range scans, with 10 `Next`s per `Seek` like above. + +With that out of the way, let’s see how the new BlobDB performs against traditional leveled and universal compaction. In the next few sections, we’ll be looking at write amplification as well as read and write performance. We’ll also briefly compare the write performance of the new BlobDB with the legacy implementation. + +### Write amplification + +Reducing write amp is the original motivation for key-value separation. Here, we follow RocksDB’s definition of write amplification (as used in compaction statistics and the info log). That is, we define write amp as the total amount of data written by flushes and compactions divided by the amount of data written by flushes, where “data written” includes SST files and blob files as well (if applicable). The following charts show that BlobDB significantly reduces write amplification for all of our (non-read only) workloads. + +For the initial load, where due to the nature of the workload both leveled and universal already have a low write amp factor of 1.6, BlobDB has a write amp close to the theoretical minimum of 1.0, namely in the 1.0..1.02 range, depending on value size. How is this possible? Well, the trick is that when key-value separation is used, the full compaction step only has to sort the keys but not the values. This results in a write amp that is about **36% lower** than the already low write amp you get with either leveled or universal. + +In the case of the overwrite workload, BlobDB had a write amp between 1.4 and 1.7 depending on value size. This is around **75-78% lower** than the write amp of leveled compaction (6.1 to 6.8) and **70-77% lower** than universal (5.7 to 6.2); for this workload, there wasn’t a huge difference between the performance of leveled and universal. + +When it comes to the point lookup/write mix workload, BlobDB had a write amp between 1.4 and 1.8. This is **83-88% lower** than the write amp of leveled compaction, which had values between 10.8 and 12.5. Universal fared much better than leveled under this workload, and had write amp in the 2.2..6.6 range; however, BlobDB still provided significant gains for all value sizes we tested: namely, write amp was **18-77% lower** than that of universal, depending on value size. + +As for the range scan/write mix workload, BlobDB again had a write amp between 1.4 and 1.8, while leveled had values between 13.6 and 14.9, and universal was between 2.8 and 5.0. In other words, BlobDB’s write amp was **88-90% lower** than that of leveled, and **46-70% lower** than that of universal. + +![Write amplification](/static/images/integrated-blob-db/BlobDB_Benchmarks_Write_Amp.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +### Write performance + +In terms of write performance, there are other factors to consider besides write amplification. The following charts show some interesting metrics for the two write-only workloads (initial load and overwrite). As discussed earlier, these two workloads perform a fixed amount of work; the two charts in the top row show how long it took BlobDB, leveled, and universal to complete that work. Note that each bar is broken down into two, corresponding to the two stages of each workload (random write and full compaction for initial load, and random write and waiting for compactions for overwrite). + +For initial load, note that the random write stage takes the same amount of time regardless of which algorithm is used. This is not surprising considering the fact that compactions are disabled during this stage and thus RocksDB is simply writing L0 files (and in BlobDB’s case, blob files) as fast as it can. The second stage, on the other hand, is very different: as mentioned above, BlobDB essentially only needs to read, sort, and rewrite the keys during compaction, which can be done much much faster (with 1 MB values, more than a hundred times faster) than doing the same for large key-values. Due to this, initial load completed **2.3x to 4.7x faster** overall when using BlobDB. + +As for the overwrite workload, BlobDB performs much better during both stages. The two charts in the bottom row help explain why. In the case of both leveled and universal compaction, compactions can’t keep up with the write rate, which eventually leads to back pressure in the form of write stalls. As shown in the chart below, both leveled and universal stall between ~40% and ~70% of the time; on the other hand, BlobDB is stall-free except for the largest value size tested (1 MB). This naturally leads to higher throughput, namely **2.1x to 3.5x higher** throughput compared to leveled, and **1.6x to 3.0x higher** throughput compared to universal. The overwrite time chart also shows that the catch-up stage that waits for all compactions to finish is much shorter (and in fact, at larger value sizes, negligible) with BlobDB. + +![Write performance](/static/images/integrated-blob-db/BlobDB_Benchmarks_Write_Perf.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +### Read/write and read-only performance + +The charts below show the read performance (in terms of operations per second) of BlobDB versus leveled and universal compaction under the two read/write workloads and the two read-only workloads. BlobDB meets or exceeds the read performance of leveled compaction, except for workloads involving range scans at the two smallest value sizes tested (1 KB and 4 KB). It also provides better (in some cases, much better) read performance than universal across the board. In particular, BlobDB provides up **1.4x higher** read performance than leveled (for larger values), and up to **5.6x higher** than universal. + +![Read-write and read-only performance](/static/images/integrated-blob-db/BlobDB_Benchmarks_RW_RO_Perf.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +### Comparing the two BlobDB implementations + +To compare the write performance of the new BlobDB with the legacy implementation, we ran two versions of the first (single-threaded random write) stage of the initial load benchmark using 1 KB values: one with WAL disabled, and one with WAL enabled. The new implementation completed the load **4.6x faster** than the old one without WAL, and **2.3x faster** with WAL. + +![Comparing the two BlobDB implementations](/static/images/integrated-blob-db/BlobDB_Benchmarks_Legacy_Vs_Integrated.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +## Future work + +There are a few remaining features that are not yet supported by the new BlobDB. The most important one is `Merge` (and the related `GetMergeOperands` API); in addition, we don’t currently support the `EventListener` interface, the `GetLiveFilesMetaData` and `GetColumnFamilyMetaData` APIs, secondary instances, and ingestion of blob files. We will continue to work on closing this gap. + +We also have further plans when it comes to performance. These include optimizing garbage collection, introducing a dedicated cache for blobs, improving iterator and `MultiGet` performance, and evolving the blob file format amongst others. + diff --git a/src/rocksdb/docs/_posts/2021-05-26-online-validation.markdown b/src/rocksdb/docs/_posts/2021-05-26-online-validation.markdown new file mode 100644 index 000000000..33e9dfc15 --- /dev/null +++ b/src/rocksdb/docs/_posts/2021-05-26-online-validation.markdown @@ -0,0 +1,17 @@ +--- +title: Online Validation +layout: post +author: sdong +category: blog +--- +To prevent or mitigate data corrution in RocksDB when some software or hardware issues happens, we keep adding online consistency checks and improving existing ones. + +We improved ColumnFamilyOptions::force_consistency_checks and enabled it by default. The option does some basic consistency checks to LSM-tree, e.g., files in one level are not overlapping. The DB will be frozen from new writes if a violation is detected. Previously, the feature’s check was too limited and didn’t always freeze the DB in a timely manner. Last year, we made the checking stricter so that it can [catch much more corrupted LSM-tree structures](https://github.com/facebook/rocksdb/pull/6901). We also fixed several issues where the checking failure was swallowed without freezing the DB. After making force_consistency_checks more reliable, we changed the default value to be on. + +ColumnFamilyOptions::paranoid_file_checks does some more expensive extra checking when generating a new SST file. Last year, we advanced coverage to this feature: after every SST file is generated, the SST file is created, read back keys one by one and check two things: (1) the keys are in comparator order (also available and enabled by default during file write via ColumnFamilyOptions::check_flush_compaction_key_order); (2) the hash of all the KVs is the same as calculated when we add KVs into it. These checks detect certain corruptions so we can prevent the corrupt files from being applied to the DB. We suggest users turn it on at least in shadow environments, and consider to run it in production too if you can afford the overheads. + +A recent feature is added to check the count of entries added into memtable while flushing it into an SST file. This feature is to have some online coverage to memtable corruption, caused by either software bug or hardware issue. This feature will be released in the coming release (6.21) and by default on. In the future, we will check more counters during memtables, e.g. number of puts or number of deletes. + +We also improved the reporting of online validation errors to improve debuggability. For example, failure to parse a corrupt key now reports details about the corrupt key. Since we did not want to expose key data in logs, error messages, etc., by default, this reporting is opt-in via DBOptions::allow_data_in_errors. + +More online checking features are planned and some are more sophisticated, including key/value checksums and sample based query validation. diff --git a/src/rocksdb/docs/_posts/2021-05-27-rocksdb-secondary-cache.markdown b/src/rocksdb/docs/_posts/2021-05-27-rocksdb-secondary-cache.markdown new file mode 100644 index 000000000..3ad1141bf --- /dev/null +++ b/src/rocksdb/docs/_posts/2021-05-27-rocksdb-secondary-cache.markdown @@ -0,0 +1,195 @@ +--- +title: RocksDB Secondary Cache +layout: post +author: anand1976 +category: blog +--- +## Introduction + +The RocksDB team is implementing support for a block cache on non-volatile media, such as a local flash device or NVM/SCM. It can be viewed as an extension of RocksDB’s current volatile block cache (LRUCache or ClockCache). The non-volatile block cache acts as a second tier cache that contains blocks evicted from the volatile cache. Those blocks are then promoted to the volatile cache as they become hotter due to access. + +This feature is meant for cases where the DB is located on remote storage or cloud storage. The non-volatile cache is officially referred to in RocksDB as the SecondaryCache. By maintaining a SecondaryCache that’s an order of magnitude larger than DRAM, fewer reads would be required from remote storage, thus reducing read latency as well as network bandwidth consumption. + +From the user point of view, the local flash cache will support the following requirements - + +1. Provide a pointer to a secondary cache when opening a DB +2. Be able to share the secondary cache across DBs in the same process +3. Have multiple secondary caches on a host +4. Support persisting the cache across process restarts and reboots by ensuring repeatability of the cache key + +![Architecture](/static/images/rocksdb-secondary-cache/arch_diagram.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +## Design + +When designing the API for a SecondaryCache, we had a choice between making it visible to the RocksDB code (table reader) or hiding it behind the RocksDB block cache. There are several advantages of hiding it behind the block cache - + +* Allows flexibility in insertion of blocks into the secondary cache. A block can be inserted on eviction from the RAM tier, or it could be eagerly inserted. +* It makes the rest of the RocksDB code less complex by providing a uniform interface regardless of whether a secondary cache is configured or not +* Makes parallel reads, peeking in the cache for prefetching, failure handling etc. easier +* Makes it easier to extend to compressed data if needed, and allows other persistent media, such as PM, to be added as an additional tier + + +We decided to make the secondary cache transparent to the rest of RocksDB code by hiding it behind the block cache. A key issue that we needed to address was the allocation and ownership of memory of the cached items - insertion into the secondary cache may require that memory be allocated by the same. This means that parts of the cached object that can be transferred to the secondary cache needs to be copied out (referred to as **unpacking**), and on a lookup the data stored in the secondary cache needs to be provided to the object constructor (referred to as **packing**). For RocksDB cached objects such as data blocks, index and filter blocks, and compression dictionaries, unpacking involves copying out the raw uncompressed BlockContents of the block, and packing involves constructing the corresponding block/index/filter/dictionary object using the raw uncompressed data. + +Another alternative we considered was the existing PersistentCache interface. However, we decided to not pursue it and eventually deprecate it for the following reasons - +* It is exposed directly to the table reader code, which makes it more difficult to implement different policies such as inclusive/exclusive cache, as well as extending it to more sophisticated admission control policies +* The interface does not allow for custom memory allocation and object packing/unpacking, so new APIs would have to be defined anyway +* The current PersistentCache implementation is very simple and does not have any admission control policies + +## API + +The interface between RocksDB’s block cache and the secondary cache is designed to allow pluggable implementations. For FB internal usage, we plan to use Cachelib with a wrapper to provide the plug-in implementation and use folly and other fbcode libraries, which cannot be used directly by RocksDB, to efficiently implement the cache operations. The following diagrams show the flow of insertion and lookup of a block. + +![Insert flow](/static/images/rocksdb-secondary-cache/insert_flow.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +![Lookup flow](/static/images/rocksdb-secondary-cache/lookup_flow.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +An item in the secondary cache is referenced by a SecondaryCacheHandle. The handle may not be immediately ready or have a valid value. The caller can call IsReady() to determine if its ready, and can call Wait() in order to block until it becomes ready. The caller must call Value() after it becomes ready to determine if the item was successfully read. Value() must return nullptr on failure. + +``` +class SecondaryCacheHandle { + public: + virtual ~SecondaryCacheHandle() {} + + // Returns whether the handle is ready or not + virtual bool IsReady() = 0; + + // Block until handle becomes ready + virtual void Wait() = 0; + + // Return the value. If nullptr, it means the lookup was unsuccessful + virtual void* Value() = 0; + + // Return the size of value + virtual size_t Size() = 0; +}; +``` + +The user of the secondary cache (for example, BlockBasedTableReader indirectly through LRUCache) must implement the callbacks defined in CacheItemHelper, in order to facilitate the unpacking/packing of objects for saving to and restoring from the secondary cache. The CreateCallback must be implemented to construct a cacheable object from the raw data in secondary cache. + +``` + // The SizeCallback takes a void* pointer to the object and returns the size + // of the persistable data. It can be used by the secondary cache to allocate + // memory if needed. + using SizeCallback = size_t (*)(void* obj); + + // The SaveToCallback takes a void* object pointer and saves the persistable + // data into a buffer. The secondary cache may decide to not store it in a + // contiguous buffer, in which case this callback will be called multiple + // times with increasing offset + using SaveToCallback = Status (*)(void* from_obj, size_t from_offset, + size_t length, void* out); + + // A function pointer type for custom destruction of an entry's + // value. The Cache is responsible for copying and reclaiming space + // for the key, but values are managed by the caller. + using DeleterFn = void (*)(const Slice& key, void* value); + + // A struct with pointers to helper functions for spilling items from the + // cache into the secondary cache. May be extended in the future. An + // instance of this struct is expected to outlive the cache. + struct CacheItemHelper { + SizeCallback size_cb; + SaveToCallback saveto_cb; + DeleterFn del_cb; + + CacheItemHelper() : size_cb(nullptr), saveto_cb(nullptr), del_cb(nullptr) {} + CacheItemHelper(SizeCallback _size_cb, SaveToCallback _saveto_cb, + DeleterFn _del_cb) + : size_cb(_size_cb), saveto_cb(_saveto_cb), del_cb(_del_cb) {} + }; + + // The CreateCallback is passed by the block cache user to Lookup(). It + // takes in a buffer from the NVM cache and constructs an object using + // it. The callback doesn't have ownership of the buffer and should + // copy the contents into its own buffer. + // typedef std::function<Status(void* buf, size_t size, void** out_obj, + // size_t* charge)> + // CreateCallback; + using CreateCallback = std::function<Status(void* buf, size_t size, + void** out_obj, size_t* charge)>; +``` + +The secondary cache provider must provide a concrete implementation of the SecondaryCache abstract class. + +``` +// SecondaryCache +// +// Cache interface for caching blocks on a secondary tier (which can include +// non-volatile media, or alternate forms of caching such as compressed data) +class SecondaryCache { + public: + virtual ~SecondaryCache() {} + + virtual std::string Name() = 0; + + static const std::string Type() { return "SecondaryCache"; } + + // Insert the given value into this cache. The value is not written + // directly. Rather, the SaveToCallback provided by helper_cb will be + // used to extract the persistable data in value, which will be written + // to this tier. The implementation may or may not write it to cache + // depending on the admission control policy, even if the return status is + // success. + virtual Status Insert(const Slice& key, void* value, + const Cache::CacheItemHelper* helper) = 0; + + // Lookup the data for the given key in this cache. The create_cb + // will be used to create the object. The handle returned may not be + // ready yet, unless wait=true, in which case Lookup() will block until + // the handle is ready + virtual std::unique_ptr<SecondaryCacheHandle> Lookup( + const Slice& key, const Cache::CreateCallback& create_cb, bool wait) = 0; + + // At the discretion of the implementation, erase the data associated + // with key + virtual void Erase(const Slice& key) = 0; + + // Wait for a collection of handles to become ready. This would be used + // by MultiGet, for example, to read multitple data blocks in parallel + virtual void WaitAll(std::vector<SecondaryCacheHandle*> handles) = 0; + + virtual std::string GetPrintableOptions() const = 0; +}; +``` + +A SecondaryCache is configured by the user by providing a pointer to it in LRUCacheOptions - +``` +struct LRUCacheOptions { + ... + // A SecondaryCache instance to use as an additional cache tier + std::shared_ptr<SecondaryCache> secondary_cache; + ... +}; +``` + +## Current Status + +The initial RocksDB support for the secondary cache has been merged into the main branch, and will be available in the 6.21 release. This includes providing a way for the user to configure a secondary cache when instantiating RocksDB’s LRU cache (volatile block cache), spilling blocks evicted from the LRU cache to the flash cache, promoting a block read from the SecondaryCache to the LRU cache, update tools such as cache_bench and db_bench to specify a flash cache. The relevant PRs are [#8271](https://github.com/facebook/rocksdb/pull/8271), [#8191](https://github.com/facebook/rocksdb/pull/8191), and [#8312](https://github.com/facebook/rocksdb/pull/8312). + +We prototyped an end-to-end solution, with the above PRs as well as a Cachelib based implementation of the SecondaryCache. We ran a mixgraph benchmark to simulate a realistic read/write workload. The results showed a 15% gain with the local flash cache over no local cache, and a ~25-30% reduction in network reads with a corresponding decrease in cache misses. + +![Throughput](/static/images/rocksdb-secondary-cache/Mixgraph_throughput.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +![Hit Rate](/static/images/rocksdb-secondary-cache/Mixgraph_hit_rate.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +## Future Work + +In the short term, we plan to do the following in order to fully integrate the SecondaryCache with RocksDB - + +1. Use DB session ID as the cache key prefix to ensure uniqueness and repeatability +2. Optimize flash cache usage of MultiGet and iterator workloads +3. Stress testing +4. More benchmarking + +Longer term, we plan to deploy this in production at Facebook. + +## Call to Action + +We are hoping for a community contribution of a secondary cache implementation, which would make this feature usable by the broader RocksDB userbase. If you are interested in contributing, please reach out to us in [this issue](https://github.com/facebook/rocksdb/issues/8347). + diff --git a/src/rocksdb/docs/_posts/2021-05-31-dictionary-compression.markdown b/src/rocksdb/docs/_posts/2021-05-31-dictionary-compression.markdown new file mode 100644 index 000000000..9b0f45293 --- /dev/null +++ b/src/rocksdb/docs/_posts/2021-05-31-dictionary-compression.markdown @@ -0,0 +1,157 @@ +--- +title: Preset Dictionary Compression +layout: post +author: ajkr +category: blog +--- + +## Summary + +Compression algorithms relying on an adaptive dictionary, such as LZ4, zstd, and zlib, struggle to achieve good compression ratios on small inputs when using the basic compress API. +With the basic compress API, the compressor starts with an empty dictionary. +With small inputs, not much content gets added to the dictionary during the compression. +Combined, these factors suggest the dictionary will never have enough contents to achieve great compression ratios. + +RocksDB groups key-value pairs into data blocks before storing them in files. +For use cases that are heavy on random accesses, smaller data block size is sometimes desirable for reducing I/O and CPU spent reading blocks. +However, as explained above, smaller data block size comes with the downside of worse compression ratio when using the basic compress API. + +Fortunately, zstd and other libraries offer advanced compress APIs that preset the dictionary. +A preset dictionary makes it possible for the compressor to start from a useful state instead of from an empty one, making compression immediately effective. + +RocksDB now optionally takes advantage of these dictionary presetting APIs. +The challenges in integrating this feature into the storage engine were more substantial than apparent on the surface. +First, we need to target a preset dictionary to the relevant data. +Second, preset dictionaries need to be trained from data samples, which need to be gathered. +Third, preset dictionaries need to be persisted since they are needed at decompression time. +Fourth, overhead in accessing the preset dictionary must be minimized to prevent regression in critical code paths. +Fifth, we need easy-to-use measurement to evaluate candidate use cases and production impact. + +In production, we have deployed dictionary presetting to save space in multiple RocksDB use cases with data block size 8KB or smaller. +We have measured meaningful benefit to compression ratio in use cases with data block size up to 16KB. +We have also measured a use case that can save both CPU and space by reducing data block size and turning on dictionary presetting at the same time. + +## Feature design +#### Targeting + +Over time we have considered a few possibilities for the scope of a dictionary. + +- Subcompaction +- SST file +- Column family + +The original choice was subcompaction scope. +This enabled an approach with minimal buffering overhead because we could collect samples while generating the first output SST file. +The dictionary could then be trained and applied to subsequent SST files in the same subcompaction. + +However, we found a large use case where the proximity of data in the keyspace was more correlated with its similarity than we had predicted. +In particular, the approach of training a dictionary on an adjacent file yielded substantially worse ratios than training the dictionary on the same file it would be used to compress. +In response to this finding, we changed the preset dictionary scope to per SST file. + +With this change in approach, we had to face the problem we had hoped to avoid: how can we compress all of an SST file's data blocks with the same preset dictionary while that dictionary can only be trained after many data blocks have been sampled? +The solutions we considered both involved a new overhead. +We could read the input more than once and introduce I/O overhead, or we could buffer the uncompressed output file data blocks until a dictionary is trained, introducing memory overhead. +We chose to take the hit on memory overhead. + +Another approach that we considered was associating multiple dictionaries with a column family. +For example, in MyRocks there could be a dictionary trained on data from each large table. +When compressing a data block, we would look at the table to which its data belongs and pick the corresponding dictionary. +However, this approach would introduce many challenges. +RocksDB would need to be aware of the key schema to know where are the table boundaries. +RocksDB would also need to periodically update the dictionaries to account for changes in data pattern. +It would need somewhere to store dictionaries at column family scope. +Overall, we thought these challenges were too difficult to pursue the approach. + +#### Training + +![](/static/images/dictcmp/dictcmp_raw_sampled.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} +<p align="center"><i> +Raw samples mode (`zstd_max_train_bytes == 0`) +</i></p> + +As mentioned earlier, the approach we took is to build the dictionary from buffered uncompressed data blocks. +The first row of data blocks in these diagrams illustrate this buffering. +The second row illustrates training samples selected from the buffered blocks. +In raw samples mode (above), the final dictionary is simply the concatenation of these samples. +Whereas, in zstd training mode (below), these samples will be passed to the trainer to produce the final dictionary. + +![](/static/images/dictcmp/dictcmp_zstd_trained.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} +<p align="center"><i> +zstd training mode (`zstd_max_train_bytes > 0`) +</i></p> + +#### Compression path + +Once the preset dictionary is generated by the above process, we apply it to the buffered data blocks and write them to the output file. +Thereafter, newly generated data blocks are immediately compressed and written out. + +One optimization here is available to zstd v0.7.0+ users. +Instead of deserializing the dictionary on each compress invocation, we can do that work once and reuse it. +A `ZSTD_CDict` holds this digested dictionary state and is passed to the compress API. + +#### Persistence + +When an SST file's data blocks are compressed using a preset dictionary, that dictionary is stored inside the file for later use in decompression. + +![](/static/images/dictcmp/dictcmp_sst_blocks.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} +<p align="center"><i> +SST file layout with the preset dictionary in its own (uncompressed) block +</i></p> + +#### Decompression path + +To decompress, we need to provide both the data block and the dictionary used to compress it. +Since dictionaries are just blocks in a file, we access them through block cache. +However this additional load on block cache can be problematic. +It can be alleviated by pinning the dictionaries to avoid going through the LRU locks. + +An optimization analogous to the digested dictionary exists for certain zstd users (see User API section for details). +When enabled, the block cache stores the digested dictionary state for decompression (`ZSTD_DDict`) instead of the block contents. +In some cases we have seen decompression CPU decrease overall when enabling dictionary thanks to this optimization. + +#### Measurement + +Typically our first step in evaluating a candidate use case is an offline analysis of the data. +This gives us a quick idea whether presetting dictionary will be beneficial without any code, config, or data changes. +Our `sst_dump` tool reports what size SST files would have been using specified compression libraries and options. +We can select random SST files and compare the size with vs. without dictionary. + +When that goes well, the next step is to see how it works in a live DB, like a production shadow or canary. +There we can observe how it affects application/system metrics. + +Even after dictionary is enabled, there is the question of how much space was finally saved. +We provide a way to A/B test size with vs. without dictionary while running in production. +This feature picks a sample of data blocks to compress in multiple ways -- one of the outputs is stored, while the other outputs are thrown away after counting their size. +Due to API limitations, the stored output always has to be the dictionary-compressed one, so this feature can only be used after enabling dictionary. +The size with and without dictionary are stored in the SST file as table properties. +These properties can be aggregated across all SST files in a DB (and across all DBs in a tier) to learn the final space saving. + +## User API + +RocksDB allows presetting compression dictionary for users of LZ4, zstd, and zlib. +The most advanced capabilities are available to zstd v1.1.4+ users who statically link (see below). +Newer versions of zstd (v1.3.6+) have internal changes to the dictionary trainer and digested dictionary management, which significantly improve memory and CPU efficiency. + +Run-time settings: + +- `CompressionOptions::max_dict_bytes`: Limit on per-SST file dictionary size. Increasing this causes dictionaries to consume more space and memory for the possibility of better data block compression. A typical value we use is 16KB. +- (**zstd only**) `CompressionOptions::zstd_max_train_bytes`: Limit on training data passed to zstd dictionary trainer. Larger values cause the training to consume more CPU (and take longer) while generating more effective dictionaries. The starting point guidance we received from zstd team is to set it to 100x `CompressionOptions::max_dict_bytes`. +- `CompressionOptions::max_dict_buffer_bytes`: Limit on data buffering from which training samples are gathered. By default we buffer up to the target file size per ongoing background job. If this amount of memory is concerning, this option can constrain the buffering with the downside that training samples will cover a smaller portion of the SST file. Work is ongoing to charge this memory usage to block cache so it will not need to be accounted for separately. +- `BlockBasedTableOptions::cache_index_and_filter_blocks`: Controls whether metadata blocks including dictionary are accessed through block cache or held in table reader memory (yes, its name is outdated). +- `BlockBasedTableOptions::metadata_cache_options`: Controls what metadata blocks are pinned in block cache. Pinning avoids LRU contention at the risk of cold blocks holding memory. +- `ColumnFamilyOptions::sample_for_compression`: Controls frequency of measuring extra compressions on data blocks using various libraries with default settings (i.e., without preset dictionary). + +Compile-time setting: + +- (**zstd only**) `EXTRA_CXXFLAGS=-DZSTD_STATIC_LINKING_ONLY`: Hold digested dictionaries in block cache to save repetitive deserialization overhead. This saves a lot of CPU for read-heavy workloads. This compiler flag is necessary because one of the digested dictionary APIs we use is marked as experimental. We still use it in production, however. + +Function: + +- `DB::GetPropertiesOfAllTables()`: The properties `kSlowCompressionEstimatedDataSize` and `kFastCompressionEstimatedDataSize` estimate what the data block size (`kDataSize`) would have been if the corresponding compression library had been used. These properties are only present when `ColumnFamilyOptions::sample_for_compression` causes one or more samples to be measured, and they become more accurate with higher sampling frequency. + +Tool: + +- `sst_dump --command=recompress`: Offline analysis tool that reports what the SST file size would have been using the specified compression library and options. diff --git a/src/rocksdb/docs/_posts/2021-12-29-ribbon-filter.markdown b/src/rocksdb/docs/_posts/2021-12-29-ribbon-filter.markdown new file mode 100644 index 000000000..c6a52ce84 --- /dev/null +++ b/src/rocksdb/docs/_posts/2021-12-29-ribbon-filter.markdown @@ -0,0 +1,281 @@ +--- +title: Ribbon Filter +layout: post +author: pdillinger +category: blog +--- + +## Summary +Since version 6.15 last year, RocksDB supports Ribbon filters, a new +alternative to Bloom filters that save space, especially memory, at +the cost of more CPU usage, mostly in constructing the filters in the +background. Most applications with long-lived data (many hours or +longer) will likely benefit from adopting a Ribbon+Bloom hybrid filter +policy. Here we explain why and how. + +[Ribbon filter on RocksDB wiki](https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter#ribbon-filter) + +[Ribbon filter paper](https://arxiv.org/abs/2103.02515) + +## Problem & background +Bloom filters play a critical role in optimizing point queries and +some range queries in LSM-tree storage systems like RocksDB. Very +large DBs can use 10% or more of their RAM memory for (Bloom) filters, +so that (average case) read performance can be very good despite high +(worst case) read amplification, [which is useful for lowering write +and/or space +amplification](http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-pick-2_23.html). +Although the `format_version=5` Bloom filter in RocksDB is extremely +fast, all Bloom filters use around 50% more space than is +theoretically possible for a hashed structure configured for the same +false positive (FP) rate and number of keys added. What would it take +to save that significant share of “wasted” filter memory, and when +does it make sense to use such a Bloom alternative? + +A number of alternatives to Bloom filters were known, especially for +static filters (not modified after construction), but all the +previously known structures were unsatisfying for SSTs because of some +combination of +* Not enough space savings for CPU increase. For example, [Xor + filters](https://arxiv.org/abs/1912.08258) use 3-4x more CPU than + Bloom but only save 15-20% of + space. [GOV](https://arxiv.org/pdf/1603.04330.pdf) can save around + 30% space but requires around 10x more CPU than Bloom. +* Inconsistent space savings. [Cuckoo + filters](https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf) + and Xor+ filters offer significant space savings for very low FP + rates (high bits per key) but little or no savings for higher FP + rates (low bits per key). ([Higher FP rates are considered best for + largest levels of + LSM.](https://stratos.seas.harvard.edu/files/stratos/files/monkeykeyvaluestore.pdf)) + [Spatially-coupled Xor + filters](https://arxiv.org/pdf/2001.10500.pdf) require very large + number of keys per filter for large space savings. +* Inflexible configuration. No published alternatives offered the same + continuous configurability of Bloom filters, where any FP rate and + any fractional bits per key could be chosen. This flexibility + improves memory efficiency with the `optimize_filters_for_memory` + option that minimizes internal fragmentation on filters. + +## Ribbon filter development and implementation +The Ribbon filter came about when I developed a faster, simpler, and +more adaptable algorithm for constructing a little-known [Xor-based +structure from Dietzfelbinger and +Walzer](https://arxiv.org/pdf/1907.04750.pdf). It has very good space +usage for required CPU time (~30% space savings for 3-4x CPU) and, +with some engineering, Bloom-like configurability. The complications +were managable for use in RocksDB: +* Ribbon space efficiency does not naturally scale to very large + number of keys in a single filter (whole SST file or partition), but + with the current 128-bit Ribbon implementation in RocksDB, even 100 + million keys in one filter saves 27% space vs. Bloom rather than 30% + for 100,000 keys in a filter. +* More temporary memory is required during construction, ~230 bits per + key for 128-bit Ribbon vs. ~75 bits per key for Bloom filter. A + quick calculation shows that if you are saving 3 bits per key on the + generated filter, you only need about 50 generated filters in memory + to offset this temporary memory usage. (Thousands of filters in + memory is typical.) Starting in RocksDB version 6.27, this temporary + memory can be accounted for under block cache using + `BlockBasedTableOptions::reserve_table_builder_memory`. +* Ribbon filter queries use relatively more CPU for lower FP rates + (but still O(1) relative to number of keys added to filter). This + should be OK because lower FP rates are only appropriate when then + cost of a false positive is very high (worth extra query time) or + memory is not so constrained (can use Bloom instead). + +Future: data in [the paper](https://arxiv.org/abs/2103.02515) suggests +that 32-bit Balanced Ribbon (new name: [Bump-Once +Ribbon](https://arxiv.org/pdf/2109.01892.pdf)) would improve all of +these issues and be better all around (except for code complexity). + +## Ribbon vs. Bloom in RocksDB configuration +Different applications and hardware configurations have different +constraints, but we can use hardware costs to examine and better +understand the trade-off between Bloom and Ribbon. + +### Same FP rate, RAM vs. CPU hardware cost +Under ideal conditions where we can adjust our hardware to suit the +application, in terms of dollars, how much does it cost to construct, +query, and keep in memory a Bloom filter vs. a Ribbon filter? The +Ribbon filter costs more for CPU but less for RAM. Importantly, the +RAM cost directly depends on how long the filter is kept in memory, +which in RocksDB is essentially the lifetime of the filter. +(Temporary RAM during construction is so short-lived that it is +ignored.) Using some consumer hardware and electricity prices and a +predicted balance between construction and queries, we can compute a +“break even” duration in memory. To minimize cost, filters with a +lifetime shorter than this should be Bloom and filters with a lifetime +longer than this should be Ribbon. (Python code) + +``` +# Commodity prices based roughly on consumer prices and rough guesses +# Upfront cost of a CPU per hardware thread +upfront_dollars_per_cpu_thread = 30.0 + +# CPU average power usage per hardware thread +watts_per_cpu_thread = 3.5 + +# Upfront cost of a GB of RAM +upfront_dollars_per_gb_ram = 8.0 + +# RAM average power usage per GB +# https://www.crucial.com/support/articles-faq-memory/how-much-power-does-memory-use +watts_per_gb_ram = 0.375 + +# Estimated price of power per kilowatt-hour, including overheads like conversion losses and cooling +dollars_per_kwh = 0.35 + +# Assume 3 year hardware lifetime +hours_per_lifetime = 3 * 365 * 24 +seconds_per_lifetime = hours_per_lifetime * 60 * 60 + +# Number of filter queries per key added in filter construction is heavily dependent on workload. +# When replication is in layer above RocksDB, it will be low, likely < 1. When replication is in +# storage layer below RocksDB, it will likely be > 1. Using a rough and general guesstimate. +key_query_per_construct = 1.0 + +#================================== +# Bloom & Ribbon filter performance +typical_bloom_bits_per_key = 10.0 +typical_ribbon_bits_per_key = 7.0 + +# Speeds here are sensitive to many variables, especially query speed because it +# is so dependent on memory latency. Using this benchmark here: +# for IMPL in 2 3; do +# ./filter_bench -impl=$IMPL -quick -m_keys_total_max=200 -use_full_block_reader +# done +# and "Random filter" queries. +nanoseconds_per_construct_bloom_key = 32.0 +nanoseconds_per_construct_ribbon_key = 140.0 + +nanoseconds_per_query_bloom_key = 500.0 +nanoseconds_per_query_ribbon_key = 600.0 + +#================================== +# Some constants +kwh_per_watt_lifetime = hours_per_lifetime / 1000.0 +bits_per_gb = 8 * 1024 * 1024 * 1024 + +#================================== +# Crunching the numbers +# on CPU for constructing filters +dollars_per_cpu_thread_lifetime = upfront_dollars_per_cpu_thread + watts_per_cpu_thread * kwh_per_watt_lifetime * dollars_per_kwh +dollars_per_cpu_thread_second = dollars_per_cpu_thread_lifetime / seconds_per_lifetime + +dollars_per_construct_bloom_key = dollars_per_cpu_thread_second * nanoseconds_per_construct_bloom_key / 10**9 +dollars_per_construct_ribbon_key = dollars_per_cpu_thread_second * nanoseconds_per_construct_ribbon_key / 10**9 + +dollars_per_query_bloom_key = dollars_per_cpu_thread_second * nanoseconds_per_query_bloom_key / 10**9 +dollars_per_query_ribbon_key = dollars_per_cpu_thread_second * nanoseconds_per_query_ribbon_key / 10**9 + +dollars_per_bloom_key_cpu = dollars_per_construct_bloom_key + key_query_per_construct * dollars_per_query_bloom_key +dollars_per_ribbon_key_cpu = dollars_per_construct_ribbon_key + key_query_per_construct * dollars_per_query_ribbon_key + +# on holding filters in RAM +dollars_per_gb_ram_lifetime = upfront_dollars_per_gb_ram + watts_per_gb_ram * kwh_per_watt_lifetime * dollars_per_kwh +dollars_per_gb_ram_second = dollars_per_gb_ram_lifetime / seconds_per_lifetime + +dollars_per_bloom_key_in_ram_second = dollars_per_gb_ram_second / bits_per_gb * typical_bloom_bits_per_key +dollars_per_ribbon_key_in_ram_second = dollars_per_gb_ram_second / bits_per_gb * typical_ribbon_bits_per_key + +#================================== +# How many seconds does it take for the added cost of constructing a ribbon filter instead +# of bloom to be offset by the added cost of holding the bloom filter in memory? +break_even_seconds = (dollars_per_ribbon_key_cpu - dollars_per_bloom_key_cpu) / (dollars_per_bloom_key_in_ram_second - dollars_per_ribbon_key_in_ram_second) +print(break_even_seconds) +# -> 3235.1647730256936 +``` + +So roughly speaking, filters that live in memory for more than an hour +should be Ribbon, and filters that live less than an hour should be +Bloom. This is very interesting, but how long do filters live in +RocksDB? + +First let's consider the average case. Write-heavy RocksDB loads are +often backed by flash storage, which has some specified write +endurance for its intended lifetime. This can be expressed as *device +writes per day* (DWPD), and supported DWPD is typically < 10.0 even +for high end devices (excluding NVRAM). Roughly speaking, the DB would +need to be writing at a rate of 20+ DWPD for data to have an average +lifetime of less than one hour. Thus, unless you are prematurely +burning out your flash or massively under-utilizing available storage, +using the Ribbon filter has the better cost profile *on average*. + +### Predictable lifetime +But we can do even better than optimizing for the average case. LSM +levels give us very strong data lifetime hints. Data in L0 might live +for minutes or a small number of hours. Data in Lmax might live for +days or weeks. So even if Ribbon filters weren't the best choice on +average for a workload, they almost certainly make sense for the +larger, longer-lived levels of the LSM. As of RocksDB 6.24, you can +specify a minimum LSM level for Ribbon filters with +`NewRibbonFilterPolicy`, and earlier levels will use Bloom filters. + +### Resident filter memory +The above analysis assumes that nearly all filters for all live SST +files are resident in memory. This is true if using +`cache_index_and_filter_blocks=0` and `max_open_files=-1` (defaults), +but `cache_index_and_filter_blocks=1` is popular. In that case, +if you use `optimize_filters_for_hits=1` and non-partitioned filters +(a popular MyRocks configuration), it is also likely that nearly all +live filters are in memory. However, if you don't use +`optimize_filters_for_hits` and use partitioned filters, then +cold data (by age or by key range) can lead to only a portion of +filters being resident in memory. In that case, benefit from Ribbon +filter is not as clear, though because Ribbon filters are smaller, +they are more efficient to read into memory. + +RocksDB version 6.21 and later include a rough feature to determine +block cache usage for data blocks, filter blocks, index blocks, etc. +Data like this is periodically dumped to LOG file +(`stats_dump_period_sec`): + +``` +Block cache entry stats(count,size,portion): DataBlock(441761,6.82 GB,75.765%) FilterBlock(3002,1.27 GB,14.1387%) IndexBlock(17777,887.75 MB,9.63267%) Misc(1,0.00 KB,0%) +Block cache LRUCache@0x7fdd08104290#7004432 capacity: 9.00 GB collections: 2573 last_copies: 10 last_secs: 0.143248 secs_since: 0 +``` + +This indicates that at this moment in time, the block cache object +identified by `LRUCache@0x7fdd08104290#7004432` (potentially used +by multiple DBs) uses roughly 14% of its 9GB, about 1.27 GB, on filter +blocks. This same data is available through `DB::GetMapProperty` with +`DB::Properties::kBlockCacheEntryStats`, and (with some effort) can +be compared to total size of all filters (not necessarily in memory) +using `rocksdb.filter.size` from +`DB::Properties::kAggregatedTableProperties`. + +### Sanity checking lifetime +Can we be sure that using filters even makes sense for such long-lived +data? We can apply [the current 5 minute rule for caching SSD data in +RAM](http://renata.borovica-gajic.com/data/adms2017_5minuterule.pdf). A +4KB filter page holds data for roughly 4K keys. If we assume at least +one negative (useful) filter query in its lifetime per added key, it +can satisfy the 5 minute rule with a lifetime of up to about two +weeks. Thus, the lifetime threshold for “no filter” is about 300x +higher than the lifetime threshold for Ribbon filter. + +### What to do with saved memory +The default way to improve overall RocksDB performance with more +available memory is to use more space for caching, which improves +latency, CPU load, read IOs, etc. With +`cache_index_and_filter_blocks=1`, savings in filters will +automatically make room for caching more data blocks in block +cache. With `cache_index_and_filter_blocks=0`, consider increasing +block cache size. + +Using the space savings to lower filter FP rates is also an option, +but there is less evidence for this commonly improving existing +*optimized* configurations. + +## Generic recommendation +If using `NewBloomFilterPolicy(bpk)` for a large persistent DB using +compression, try using `NewRibbonFilterPolicy(bpk)` instead, which +will generate Ribbon filters during compaction and Bloom filters +for flush, both with the same FP rate as the old setting. Once new SST +files are generated under the new policy, this should free up some +memory for more caching without much effect on burst or sustained +write speed. Both kinds of filters can be read under either policy, so +there's always an option to adjust settings or gracefully roll back to +using Bloom filter only (keeping in mind that SST files must be +replaced to see effect of that change). diff --git a/src/rocksdb/docs/_posts/2022-07-18-per-key-value-checksum.markdown b/src/rocksdb/docs/_posts/2022-07-18-per-key-value-checksum.markdown new file mode 100644 index 000000000..6b9ad801c --- /dev/null +++ b/src/rocksdb/docs/_posts/2022-07-18-per-key-value-checksum.markdown @@ -0,0 +1,142 @@ +--- +title: "Per Key-Value Checksum" +layout: post +author: +- cbi42 +- ajkr +category: blog +--- + +## Summary + +Silent data corruptions can severely impact RocksDB users. As a key-value library, RocksDB resides at the bottom of the user space software stack for many diverse applications. Returning wrong query results can cause unpredictable consequences for our users so must be avoided. + +To prevent and detect corruption, RocksDB has several consistency checks [1], especially focusing on the storage layer. For example, SST files contain block checksums that are verified during reads, and each SST file has a full file checksum that can be verified when files are transferred. + +Other sources of corruptions, such as those from faulty CPU/memory or heap corruptions, pose risks for which protections are relatively underdeveloped. Meanwhile, recent work [2] suggests one per thousand machines in our fleet will at some point experience a hardware error that is exposed to an application. Additionally, software bugs can increase the risk of heap corruptions at any time. + +Hardware/heap corruptions are naturally difficult to detect in the application layer since they can compromise any data or control flow. Some factors we take into account when choosing where to add protection are the volume of data, the importance of the data, the CPU instructions that operate on the data, and the duration it resides in memory. One recently added protection, `detect_filter_construct_corruption`, has proven itself useful in preventing corrupt filters from being persisted. We have seen hardware encounter machine-check exceptions a few hours after we detected a corrupt filter. + +The next way we intend to detect hardware and heap corruptions before they cause queries to return wrong results is through developing a new feature: per key-value checksum. This feature will eventually provide optional end-to-end integrity protection for every key-value pair. RocksDB 7.4 offers substantial coverage of the user write and recovery paths with per key-value checksum protection. + +## User API + +For integrity protection during recovery, no change is required. Recovery is always protected. + +For user write protection, RocksDB allows the user to specify per key-value protection through `WriteOptions::protection_bytes_per_key` or pass in `protection_bytes_per_key` to `WriteBatch` constructor when creating a `WriteBatch` directly. Currently, only 0 (default, no protection) and 8 bytes per key are supported. This should be fine for write batches as they do not usually contain a huge number of keys. We are working on supporting more settings as 8 bytes per key might cause considerable memory overhead when the protection is extended to memtable entries. + +## Feature Design + +### Data Structures + +#### Protection info + +For protecting key-value pairs, we chose to use a hashing algorithm, xxh3 [3], for its good efficiency without relying on special hardware. While algorithms like crc32c can guarantee detection of certain patterns of bit flips, xxh3 offers no such guarantees. This is acceptable for us as we do not expect any particular error pattern [4], and even if we did, xxh3 can achieve a collision probability close enough to zero for us by tuning the number of protection bytes per key-value. + +Key-value pairs have multiple representations in RocksDB: in [WriteBatch](https://github.com/facebook/rocksdb/blob/7d0ecab570742c7280628b08ddc03cfd692f484f/db/write_batch.cc#L14-L31), in memtable [entries](https://github.com/facebook/rocksdb/blob/fc51b7f33adcba7ac725ed0e7fe8b8155aaeaee4/db/memtable.cc#L541-L545) and in [data blocks](https://github.com/facebook/rocksdb/blob/fc51b7f33adcba7ac725ed0e7fe8b8155aaeaee4/table/block_based/block_builder.cc#L21-L27). In this post we focus on key-values in write batches and memtable as in-memory data blocks are not yet protected. + +Besides user key and value, RocksDB includes internal metadata in the per key-value checksum calculation. Depending on the representation, internal metadata consists of some combination of sequence number, operation type, and column family ID. Note that since timestamp (when enabled) is part of the user key it is protected as well. + +The protection info consists of the XOR’d result of the xxh3 hash for all the protected components. This allows us to efficiently transform protection info for different representations. See below for an example converting WriteBatch protection info to memtable protection info. + +A risk of using XOR is the possibility of swapping corruptions (e.g., key becomes the value and the value becomes the key). To mitigate this risk, we use an independent seed for hashing each type of component. + +The following two figures illustrate how protection info in WriteBatch and memtable are calculated from a key-value’s components. + +![](/static/images/kv-checksum/ProtInfo-Writebatch.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +*Protection info for a key-value in a WriteBatch* +{: style="text-align: center"} + +![](/static/images/kv-checksum/ProtInfo-Memtable.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +*Protection info for a key-value in a memtable* +{: style="text-align: center"} + +The next figure illustrates how protection info for a key-value can be transformed to protect that same key-value in a different representation. Note this is done without recalculating the hash for all the key-value’s components. + +![](/static/images/kv-checksum/ProtInfo-Writebatch-to-Memtable.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +*Protection info for a key-value in a memtable derived from an existing WriteBatch protection info* +{: style="text-align: center"} + +Above, we see two (small) components are hashed: column family ID and sequence number. When a key-value is inserted from WriteBatch into memtable, it is assigned a sequence number and drops the column family ID since each memtable is associated with one column family. Recall the xxh3 of column family ID was included in the WriteBatch protection info, which is canceled out by the column family ID xxh3 included in the XOR. + +#### WAL fragment + +WAL (Write-ahead-log) persists write batches that correspond to operations in memtables and enables consistent database recovery after restart. RocksDB writes to WAL in chunks of some [fixed block size](https://github.com/facebook/rocksdb/blob/fc51b7f33adcba7ac725ed0e7fe8b8155aaeaee4/db/log_writer.h#L44) for efficiency. It is possible that some write batch does not fit into the space left in the current block and/or is larger than the fixed block size. Thus, serialized write batches (WAL records) are divided into WAL fragments before being written to WAL. The format of a WAL fragment is in the following diagram (there is another legacy format detailed in code [comments](https://github.com/facebook/rocksdb/blob/fc51b7f33adcba7ac725ed0e7fe8b8155aaeaee4/db/log_writer.h#L47-L59)). Roughly, the `Type` field indicates whether a fragment is at the beginning, middle or end of a record, and is used to group fragments. + +![](/static/images/kv-checksum/WAL-fragment.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +Note that each fragment is prefixed by a crc32c checksum that is calculated over `Type`, `Log #` and `Payload`. This ensures that RocksDB can detect corruptions that happened to the WAL in the storage layer. + +#### Write batch + +As mentioned above, a WAL record is a serialized `WriteBatch` that is split into physical fragments during writes to WAL. During DB recovery, once a WAL record is reconstructed from one or more fragments, it is [copied](https://github.com/facebook/rocksdb/blob/fc51b7f33adcba7ac725ed0e7fe8b8155aaeaee4/db/db_impl/db_impl_open.cc#L1127) into the content of a `WriteBatch`. The write batch will then be used to restore the memtable states. + +Besides the recovery path, a write batch is always constructed during user writes. Firstly, RocksDB allows users to construct a write batch directly, and pass it to DB through `DB::Write()` API for execution. Higher-level buffered write APIs like Transaction rely on a write batch to buffer writes prior to executing them. For unbuffered write APIs like `DB::Put()`, RocksDB constructs a write batch internally with the input user key and value. + +![](/static/images/kv-checksum/Write-batch.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +The above diagram shows a rough representation of a write batch in memory. `Contents` is the concatenation of serialized user operations in this write batch. Each operation consists of user key, value, op_type and optionally column family ID. With per key-value checksum protection enabled, a vector of ProtectionInfo is stored in the write batch, one for each user operation. + +#### Memtable entry + +![](/static/images/kv-checksum/Memtable-entry.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +A memtable entry is similar to write batch content, except that it captures only a single user operation and that it does not contain column family ID (since memtable is per column family). User key and value are length-prefixed, and seqno and optype are combined in a fixed 8 bytes representation. + +### Processes + +In order to protect user writes and recovery, per key-value checksum is covered in the following code paths. + +#### WriteBatch write + +Per key-value checksum coverage starts with the user buffers that contain user key and/or value. When users call DB Write APIs (e.g., `DB::Put()`), or when users add operations into write batches directly (e.g. `WriteBatch::Put()`), RocksDB constructs `ProtectionInfo` from the user buffer (e.g. [here](https://github.com/facebook/rocksdb/blob/96206531bc0bb56d87012921c5458c8a3047a6b3/db/write_batch.cc#L813)) and [stores](https://github.com/facebook/rocksdb/blob/96206531bc0bb56d87012921c5458c8a3047a6b3/include/rocksdb/write_batch.h#L478) the protection information within the corresponding `WriteBatch` object as diagramed below. Then the user key and/or value are copied into the `WriteBatch`, thus starting per key-value checksum protection from user buffer. + +![](/static/images/kv-checksum/Writebatch-write.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + + +#### WAL write + +Before a `WriteBatch` leaves RocksDB and be persisted in a WAL file, it is verified against its `ProtectionInfo` to ensure its content is not corrupted. We added `WriteBatch::VerifyChecksum()` for this purpose. Once we verify the content of a `WriteBatch`, it is then divided into potentially multiple WAL fragments and persisted in the underlying file system. From that point on, the integrity protection is handed off to the per fragment crc32c checksum that is persisted in WAL too. + +![](/static/images/kv-checksum/WAL-write.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +#### Memtable write + +Similar to the WAL write path, `ProtectionInfo` is verified before an entry is inserted into a memtable. The difference here is that an memtable entry has its own buffer, and the content of a `WriteBatch` is copied into the memtable entry. So the `ProtectionInfo` is verified against the memtable entry buffer instead. The current per key-value checksum protection ends at this verification on the buffer containing a memtable entry, and one of the future work is to extend the coverage to key-value pairs in memtables. + +![](/static/images/kv-checksum/Memtable-write.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +#### WAL read + +This is for the DB recovery path: WAL fragments are read into memory, concatenated together to form WAL records, and then `WriteBatch`es are constructed from WAL records and added to memtables. In RocksDB 7.4, once a `WriteBatch` copies its content from a WAL record, `ProtectionInfo` is constructed from the `WriteBatch` content and per key-value protection starts. However, this copy operation is not protected, neither is the reconstruction of a WAL record from WAL fragments. To provide protection from silent data corruption during these memory copying operations, we added checksum handshake detailed below in RocksDB 7.5. + +When a WAL fragment is first read into memory, its crc32c checksum is [verified](https://github.com/facebook/rocksdb/blob/2f13f5f7d09c589d5adebf0cbc42fadf0da0f00e/db/log_reader.cc#L483). The WAL fragment is then appended to the buffer containing a WAL record. RocksDB uses xxh3’s streaming API to calculate the checksum of the WAL record and updates the streaming hash state with the new WAL fragment content whenever it is appended to the WAL record buffer (e.g. [here](https://github.com/facebook/rocksdb/blob/2f13f5f7d09c589d5adebf0cbc42fadf0da0f00e/db/log_reader.cc#L135)). After the WAL record is constructed, it is copied into a `WriteBatch` and `ProtectionInfo` is constructed from the write batch content. Then, the xxh3 checksum of the WAL record is [verified](https://github.com/facebook/rocksdb/blob/2f13f5f7d09c589d5adebf0cbc42fadf0da0f00e/db/write_batch.cc#L3081-L3085) against the write batch content to complete the checksum handshake. If the checksum verification succeeds, then we are more confident that `ProtectionInfo` is calculated based on uncorrupted data, and the protection coverage continues with the newly constructed `ProtectionInfo` along the write code paths mentioned above. + +![](/static/images/kv-checksum/WAL-read.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +## Future work + +Future coverage expansion will cover memtable KVs, flush, compaction and user reads etc. + +## References + +[1] http://rocksdb.org/blog/2021/05/26/online-validation.html + +[2] H. D. Dixit, L. Boyle, G. Vunnam, S. Pendharkar, M. Beadon, and S. Sankar, ‘Detecting silent data corruptions in the wild’. arXiv, 2022. + +[3] https://github.com/Cyan4973/xxHash + +[4] https://github.com/Cyan4973/xxHash/issues/229#issuecomment-511956403 diff --git a/src/rocksdb/docs/_posts/2022-10-05-lost-buffered-write-recovery.markdown b/src/rocksdb/docs/_posts/2022-10-05-lost-buffered-write-recovery.markdown new file mode 100644 index 000000000..fca3ea739 --- /dev/null +++ b/src/rocksdb/docs/_posts/2022-10-05-lost-buffered-write-recovery.markdown @@ -0,0 +1,123 @@ +--- +title: "Verifying crash-recovery with lost buffered writes" +layout: post +author: +- ajkr +category: blog +--- + +## Introduction + +Writes to a RocksDB instance go through multiple layers before they are fully persisted. +Those layers may buffer writes, delaying their persistence. +Depending on the layer, buffered writes may be lost in a process or system crash. +A process crash loses writes buffered in process memory only. +A system crash additionally loses writes buffered in OS memory. + +The new test coverage introduced in this post verifies there is no hole in the recovered data in either type of crash. +A hole would exist if any recovered write were newer than any lost write, as illustrated below. +This guarantee is important for many applications, such as those that use the newest recovered write to determine the starting point for replication. + +![](/static/images/lost-buffered-write-recovery/happy-cat.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +*Valid (no hole) recovery: all recovered writes (1 and 2) are older than all lost writes (3 and 4)* +{: style="text-align: center"} + +![](/static/images/lost-buffered-write-recovery/angry-cat.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +*Invalid (hole) recovery: a recovered write (4) is newer than a lost write (3)* +{: style="text-align: center"} + +The new test coverage assumes all writes use the same options related to buffering/persistence. +For example, we do not cover the case of alternating writes with WAL disabled and WAL enabled (`WriteOptions::disableWAL`). +It also assumes the crash does not have any unexpected consequences like corrupting persisted data. + +Testing for holes in the recovery is challenging because there are many valid recovery outcomes. +Our solution involves tracing all the writes and then verifying the recovery matches a prefix of the trace. +This proves there are no holes in the recovery. +See "Extensions for lost buffered writes" subsection below for more details. + +Testing actual system crashes would be operationally difficult. +Our solution simulates system crash by buffering written but unsynced data in process memory such that it is lost in a process crash. +See "Simulating system crash" subsection below for more details. + +## Scenarios covered + +We began testing recovery has no hole in the following new scenarios. +This coverage is included in our internal CI that periodically runs against the latest commit on the main branch. + +1. **Process crash with WAL disabled** (`WriteOptions::disableWAL=1`), which loses writes since the last memtable flush. +2. **System crash with WAL enabled** (`WriteOptions::disableWAL=0`), which loses writes since the last memtable flush or WAL sync (`WriteOptions::sync=1`, `SyncWAL()`, or `FlushWAL(true /* sync */)`). +3. **Process crash with manual WAL flush** (`DBOptions::manual_wal_flush=1`), which loses writes since the last memtable flush or manual WAL flush (`FlushWAL()`). +4. **System crash with manual WAL flush** (`DBOptions::manual_wal_flush=1`), which loses writes since the last memtable flush or synced manual WAL flush (`FlushWAL(true /* sync */)`, or `FlushWAL(false /* sync */)` followed by WAL sync). + +## Issues found + +* [False detection of corruption after system crash due to race condition with WAL sync and `track_and_verify_wals_in_manifest](https://github.com/facebook/rocksdb/pull/10185) +* [Undetected hole in recovery after system crash due to race condition in WAL sync](https://github.com/facebook/rocksdb/pull/10560) +* [Recovery failure after system crash due to missing directory sync for critical metadata file](https://github.com/facebook/rocksdb/pull/10573) + +## Solution details + +### Basic setup + +![](/static/images/lost-buffered-write-recovery/basic-setup.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +Our correctness testing framework consists of a stress test program (`db_stress`) and a wrapper script (`db_crashtest.py`). +`db_crashtest.py` manages instances of `db_stress`, starting them and injecting crashes. +`db_stress` operates a DB and test oracle ("Latest values file"). + +At startup, `db_stress` verifies the DB using the test oracle, skipping keys that had pending writes when the last crash happened. +`db_stress` then stresses the DB with random operations, keeping the test oracle up-to-date. + +As the name "Latest values file" implies, this test oracle only tracks the latest value for each key. +As a result, this setup is unable to verify recoveries involving lost buffered writes, where recovering older values is tolerated as long as there is no hole. + +### Extensions for lost buffered writes + +To accommodate lost buffered writes, we extended the test oracle to include two new files: "`verifiedSeqno`.state" and "`verifiedSeqno`.trace". +`verifiedSeqno` is the sequence number of the last successful verification. +"`verifiedSeqno`.state" is the expected values file at that sequence number, and "`verifiedSeqno`.trace" is the trace file of all operations that happened after that sequence number. + +![](/static/images/lost-buffered-write-recovery/replay-extension.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +When buffered writes may have been lost by the previous `db_stress` instance, the current `db_stress` instance must reconstruct the latest values file before startup verification. +M is the recovery sequence number of the current `db_stress` instance and N is the recovery sequence number of the previous `db_stress` instance. +M is learned from the DB, while N is learned from the filesystem by parsing the "*.{trace,state}" filenames. +Then, the latest values file ("LATEST.state") can be reconstructed by replaying the first M-N traced operations (in "N.trace") on top of the last instance's starting point ("N.state"). + +![](/static/images/lost-buffered-write-recovery/trace-extension.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +When buffered writes may be lost by the current `db_stress` instance, we save the current expected values into "M.state" and begin tracing newer operations in "M.trace". + +### Simulating system crash + +When simulating system crash, we send file writes to a `TestFSWritableFile`, which buffers unsynced writes in process memory. +That way, the existing `db_stress` process crash mechanism will lose unsynced writes. + +![](/static/images/lost-buffered-write-recovery/test-fs-writable-file.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +`TestFSWritableFile` is implemented as follows. + +* `Append()` buffers the write in a local `std::string` rather than calling `write()`. +* `Sync()` transfers the local `std::string`s content to `PosixWritableFile::Append()`, which will then `write()` it to the OS page cache. + +## Next steps +An untested guarantee is that RocksDB recovers all writes that the user explicitly flushed out of the buffers lost in the crash. +We may recover more writes than these due to internal flushing of buffers, but never less. +Our test oracle needs to be further extended to track the lower bound on the sequence number that is expected to survive a crash. + +We would also like to make our system crash simulation more realistic. +Currently we only drop unsynced regular file data, but we should drop unsynced directory entries as well. + +## Acknowledgements + +Hui Xiao added the manual WAL flush coverage and compatibility with `TransactionDB`. +Zhichao Cao added the system crash simulation. +Several RocksDB team members contributed to this feature's dependencies. diff --git a/src/rocksdb/docs/_posts/2022-10-07-asynchronous-io-in-rocksdb.markdown b/src/rocksdb/docs/_posts/2022-10-07-asynchronous-io-in-rocksdb.markdown new file mode 100644 index 000000000..0586f1c3d --- /dev/null +++ b/src/rocksdb/docs/_posts/2022-10-07-asynchronous-io-in-rocksdb.markdown @@ -0,0 +1,133 @@ +--- +title: Asynchronous IO in RocksDB +layout: post +author: +- akankshamahajan15 +- anand1976 +category: blog +--- +## Summary + +RocksDB provides several APIs to read KV pairs from a database, including Get and MultiGet for point lookups and Iterator for sequential scanning. These APIs may result in RocksDB reading blocks from SST files on disk storage. The types of blocks and the frequency with which they are read from storage is workload dependent. Some workloads may have a small working set and thus may be able to cache most of the data required, while others may have large working sets and have to read from disk more often. In the latter case, the latency would be much higher and throughput would be lower than the former. They would also be dependent on the characteristics of the underlying storage media, making it difficult to migrate from one medium to another, for example, local flash to disaggregated flash. + +One way to mitigate the impact of storage latency is to read asynchronously and in parallel as much as possible, in order to hide IO latency. We have implemented this in RocksDB in Iterators and MultiGet. In Iterators, we prefetch data asynchronously in the background for each file being iterated on, unlike the current implementation that does prefetching synchronously, thus blocking the iterator thread. In MultiGet, we determine the set of files that a given batch of keys overlaps, and read the necessary data blocks from those files in parallel using an asynchronous file system API. These optimizations have significantly decreased the overall latency of the RocksDB MultiGet and iteration APIs on slower storage compared to local flash. + +The optimizations described here are in the internal implementation of Iterator and MultiGet in RocksDB. The user API is still synchronous, so existing code can easily benefit from it. We might consider async user APIs in the future. + + +## Design + +### API + +A new flag in `ReadOptions`, `async_io`, controls the usage of async IO. This flag, when set, enables async IO in Iterators and MultiGet. For MultiGet, an additional `ReadOptions` flag, `optimize_multiget_for_io` (defaults to true), controls how aggressively to use async IO. If the flag is not set, files in the same level are read in parallel but not different levels. If the flag is set, the level restriction is removed and as many files as possible are read in parallel, regardless of level. The latter might have a higher CPU cost depending on the workload. + +At the FileSystem layer, we use the `FSRandomAccessFile::ReadAsync` API to start an async read, providing a completion callback. + +### Scan + +A RocksDB scan usually involves the allocation of a new iterator, followed by a Seek call with a target key to position the iterator, followed by multiple Next calls to iterate through the keys sequentially. Both the Seek and Next operations present opportunities to read asynchronously, thereby reducing the scan latency. + +A scan usually involves iterating through keys in multiple entities - the active memtable, sealed and unflushed memtables, every L0 file, and every non-empty non-zero level. The first two are completely in memory and thus not impacted by IO latency. The latter two involve reading from SST files. This means that an increase in IO latency has a multiplier effect, since multiple L0 files and levels have to be iterated on. + +Some factors, such as block cache and prefix bloom filters, can reduce the number of files to iterate and number of reads from the files. Nevertheless, even a few reads from disk can dominate the overall latency. RocksDB uses async IO in both Seek and Next to mitigate the latency impact, as described below. + + +#### Seek + +A RocksDB iterator maintains a collection of child iterators, one for each L0 file and for each non-empty non-zero levels. For a Seek operation every child iterator has to Seek to the target key. This is normally done serially, by doing synchronous reads from SST files when the required data blocks are not in cache. When the async_io option is enabled, RocksDB performs the Seek in 2 phases - 1) Locate the data block required for Seek in each file/level and issue an async read, and 2) in the second phase, reseek with the same key, which will wait for the async read to finish at each level and position the table iterator. Phase 1 reads multiple blocks in parallel, reducing overall Seek latency. + + +#### Next + +For the iterator Next operation, RocksDB tries to reduce the latency due to IO by prefetching data from the file. This prefetching occurs when a data block required by Next is not present in the cache. The reads from file and prefetching is managed by the FilePrefetchBuffer, which is an object that’s created per table iterator (BlockBasedTableIterator). The FilePrefetchBuffer reads the required data block, and an additional amount of data that varies depending on the options provided by the user in ReadOptions and BlockBasedTableOptions. The default behavior is to start prefetching on the third read from a file, with an initial prefetch size of 8KB and doubling it on every subsequent read, upto a max of 256KB. + +While the prefetching in the previous paragraph helps, it is still synchronous and contributes to the iterator latency. When the async_io option is enabled, RocksDB prefetches in the background, i.e while the iterator is scanning KV pairs. This is accomplished in FilePrefetchBuffer by maintaining two prefetch buffers. The prefetch size is calculated as usual, but its then split across the two buffers. As the iteration proceeds and data in the first buffer is consumed, the buffer is cleared and an async read is scheduled to prefetch additional data. This read continues in the background while the iterator continues to process data in the second buffer. At this point, the roles of the two buffers are reversed. This does not completely hide the IO latency, since the iterator would have to wait for an async read to complete after the data in memory has been consumed. However, it does hide some of it by overlapping CPU and IO, and async prefetch can be happening on multiple levels in parallel, further reducing the latency. + +![Scan flow](/static/images/asynchronous-io/scan_async.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +### MultiGet + +The MultiGet API accepts a batch of keys as input. Its a more efficient way of looking up multiple keys compared to a loop of Gets. One way MultiGet is more efficient is by reading multiple data blocks from an SST file in a batch, for keys in the same file. This greatly reduces the latency of the request, compared to a loop of Gets. The MultiRead FileSystem API is used to read a batch of data blocks. + +![MultiGet flow](/static/images/asynchronous-io/mget_async.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +Even with the MultiRead optimization, subset of keys that are in different files still need to be read serially. We can take this one step further and read multiple files in parallel. In order to do this, a few fundamental changes were required in the MultiGet implementation - + +1. Coroutines - A MultiGet involves determining the set of keys in a batch that overlap an SST file, and then calling TableReader::MultiGet to do the actual lookup. The TableReader probes the bloom filter, traverses the index block, looks up the block cache for the necessary, reads the missing data blocks from the SST file, and then searches for the keys in the data blocks. There is a significant amount of context that’s accumulated at each stage, and it would be rather complex to interleave data blocks reads by multiple TableReaders. In order to simplify it, we used async IO with C++ coroutines. The TableReader::MultiGet is implemented as a coroutine, and the coroutine is suspended after issuing async reads for missing data blocks. This allows the top-level MultiGet to iterate through the TableReaders for all the keys, before waiting for the reads to finish and resuming the coroutines. +2. Filtering - The downside of using coroutines is the CPU overhead, which is non-trivial. To minimize the overhead, its desirable to not use coroutines as much as possible. One scenario in which we can completely avoid the call to a TableReader::MultiGet coroutine is if we know that none of the overlapping keys are actually present in the SST file. This can easily determined by probing the bloom filter. In the previous implementation, the bloom filter lookup was embedded in TableReader::MultiGet. However, we could easily implement is as a separate step, before calling TableReader::MultiGet. +3. Splitting batches - The default strategy of MultiGet is to lookup keys in one level (or L0 file), before moving on to the next. This limits the amount of IO parallelism we can exploit. For example, the keys in a batch may not be clustered together, and may be scattered over multiple files. Even if they are clustered together in the key space, they may not all be in the same level. In order to optimize for these situations, we determine the subset of keys that are likely to be in a given level, and then split the MultiGet batch into 2 - the subset in that level, and the remainder. The batch containing the remainder can then be processed in parallel. The subset of keys likely to be in a level is determined by the filtering step. + +Together, these changes enabled two types of latency optimization in MultiGet using async IO - single-level and multi-level. The former reads data blocks in parallel from multiple files in the same LSM level, while the latter reads in parallel from multiple files in multiple levels. + +## Results + +Command used to generate the database: + +`buck-out/opt/gen/rocks/tools/rocks_db_bench —db=/rocks_db_team/prefix_scan —env_uri=ws://ws.flash.ftw3preprod1 -logtostderr=false -benchmarks="fillseqdeterministic" -key_size=32 -value_size=512 -num=5000000 -num_levels=4 -multiread_batched=true -use_direct_reads=false -adaptive_readahead=true -threads=1 -cache_size=10485760000 -async_io=false -multiread_stride=40000 -disable_auto_compactions=true -compaction_style=1 -bloom_bits=10` + +Structure of the database: + +`Level[0]: /000233.sst(size: 24828520 bytes)` +`Level[0]: /000232.sst(size: 49874113 bytes)` +`Level[0]: /000231.sst(size: 100243447 bytes)` +`Level[0]: /000230.sst(size: 201507232 bytes)` +`Level[1]: /000224.sst - /000229.sst(total size: 405046844 bytes)` +`Level[2]: /000211.sst - /000223.sst(total size: 814190051 bytes)` +`Level[3]: /000188.sst - /000210.sst(total size: 1515327216 bytes)` + + +### MultiGet + +MultiGet benchmark command: + +`buck-out/opt/gen/rocks/tools/rocks_db_bench -use_existing_db=true —db=/rocks_db_team/prefix_scan -benchmarks="multireadrandom" -key_size=32 -value_size=512 -num=5000000 -batch_size=8 -multiread_batched=true -use_direct_reads=false -duration=60 -ops_between_duration_checks=1 -readonly=true -threads=4 -cache_size=300000000 -async_io=true -multiread_stride=40000 -statistics —env_uri=ws://ws.flash.ftw3preprod1 -logtostderr=false -adaptive_readahead=true -bloom_bits=10` + +#### Single-file + +The default MultiGet implementation of reading from one file at a time had a latency of 1292 micros/op. + +`multireadrandom : 1291.992 micros/op 3095 ops/sec 60.007 seconds 185768 operations; 1.6 MB/s (46768 of 46768 found) ` +`rocksdb.db.multiget.micros P50 : 9664.419795 P95 : 20757.097056 P99 : 29329.444444 P100 : 46162.000000 COUNT : 23221 SUM : 239839394` + +#### Single-level + +MultiGet with async_io=true and optimize_multiget_for_io=false had a latency of 775 micros/op. + +`multireadrandom : 774.587 micros/op 5163 ops/sec 60.009 seconds 309864 operations; 2.7 MB/s (77816 of 77816 found)` +`rocksdb.db.multiget.micros P50 : [6029.601964](tel:6029601964) P95 : 10727.467932 P99 : 13986.683940 P100 : 47466.000000 COUNT : 38733 SUM : 239750172` + +#### Multi-level + +With all optimizations turned on, MultiGet had the lowest latency of 508 micros/op. + +`multireadrandom : 507.533 micros/op 7881 ops/sec 60.003 seconds 472896 operations; 4.1 MB/s (117536 of 117536 found)` +`rocksdb.db.multiget.micros P50 : 3923.819467 P95 : 7356.182075 P99 : 10880.728723 P100 : 28511.000000 COUNT : 59112 SUM : 239642721` + +### Scan + +Benchmark command: + +`buck-out/opt/gen/rocks/tools/rocks_db_bench -use_existing_db=true —db=/rocks_db_team/prefix_scan -ben``chmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -batch_size=8 -multiread_batched=true -use_direct_reads=false -duration=60 -ops_between_duration_che``cks=1 -readonly=true -threads=4 -cache_size=300000000 -async_io=true -multiread_stride=40000 -statistics —env_uri=ws://ws.flash.ftw3preprod1 -logtostderr=false -a``daptive_readahead=true -bloom_bits=10 -seek_nexts=65536` + +### With async scan + +`seekrandom : 414442.303 micros/op 9 ops/sec 60.288 seconds 581 operations; 326.2 MB/s (145 of 145 found)` + +### Without async scan + +`seekrandom : 848858.669 micros/op 4 ops/sec 60.529 seconds 284 operations; 158.1 MB/s (74 of 74 found)` + +## Known Limitations + +These optimizations apply only to block based table SSTs. File system support for the `ReadAsync` and `Poll` interfaces is required. Currently, it is available only for `PosixFileSystem`. + +The MultiGet async IO optimization has a few additional limitations - + +1. Depends on folly, which introduces a few additional build steps +2. Higher CPU overhead due to coroutines. The CPU overhead of MultiGet may increase 6-15%, with the worst case being a single threaded MultiGet batch of keys with 1 key/file intersection and 100% cache hit rate. A more realistic case of multiple threads with a few keys (~4) overlap per file should see ~6% higher CPU util. +3. No parallelization of metadata reads. A metadata read will block the thread. +4. A few other cases will also be in serial, such as additional block reads for merge operands. + + diff --git a/src/rocksdb/docs/_posts/2022-10-31-align-compaction-output-file.markdown b/src/rocksdb/docs/_posts/2022-10-31-align-compaction-output-file.markdown new file mode 100644 index 000000000..6df61b551 --- /dev/null +++ b/src/rocksdb/docs/_posts/2022-10-31-align-compaction-output-file.markdown @@ -0,0 +1,107 @@ +--- +title: Reduce Write Amplification by Aligning Compaction Output File Boundaries +layout: post +author: +- zjay +category: blog +--- +## TL;DR +By cutting the compaction output file earlier and allowing larger than targeted_file_size to align the compaction output files to the next level files, it can **reduce WA (Write Amplification) by more than 10%**. The feature is **enabled by default** after the user upgrades RocksDB to version `7.8.0+`. + +## Background +RocksDB level compaction picks one file from the source level and compacts to the next level, which is a typical partial merge compaction algorithm. Compared to the full merge compaction strategy for example [universal compaction](https://github.com/facebook/rocksdb/wiki/Universal-Compaction), it has the benefits of smaller compaction size, better parallelism, etc. But it also has a larger write amplification (typically 20-30 times user data). One of the problems is wasted compaction at the beginning and ending: + +![](/static/images/align-compaction-output/file_cut_normal.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +In the diagram above, `SST11` is selected for the compaction, it overlaps with `SST20` to `SST23`, so all these files are selected for compaction. But the beginning and ending of the SST on Level 2 are wasted, which also means it will be compacted again when `SST10` is compacting down. If the file boundaries are aligned, then the wasted compaction size could be reduced. On average, the wasted compaction is `1` file size: `0.5` at the beginning, and `0.5` at the end. Typically the average compaction fan-out is about 6 (with the default max_bytes_for_level_multiplier = 10), then `1 / (6 + 1) ~= 14%` of compaction is wasted. +## Implemtation +To reduce such wasted compaction, RocksDB now tries to align the compaction output file to the next level's file. So future compactions will have fewer wasted compaction. For example, the above case might be cut like this: + +![](/static/images/align-compaction-output/file_cut_align.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +The trade-off is the file won't be cut exactly after it exceeds target_file_size_base, instead, it will be more likely cut when it's aligned with the next level file's boundary, so the file size might be more varied. It could be as small as 50% of `target_file_size` or as large as `2x target_file_size`. It will only impact non-bottommost-level files, which should be only `~11%` of the data. +Internally, RocksDB tries to cut the file so its size is close to the `target_file_size` setting but also aligned with the next level boundary. When the compaction output file hit a next-level file boundary, either the beginning or ending boundary, it will cut if: +``` +current_size > ((5 * min(bounderies_num, 8) + 50) / 100) * target_file_size +``` +([details](https://github.com/facebook/rocksdb/blob/23fa5b7789d6acd0c211d6bdd41448bbf1513bb6/db/compaction/compaction_outputs.cc#L270-L290)) + +The file size is also capped at `2x target_file_size`: [details](https://github.com/facebook/rocksdb/blob/f726d29a8268ae4e2ffeec09172383cff2ab4db9/db/compaction/compaction.cc#L273-L277). +Another benefit of cutting the file earlier is having more trivial move compaction, which is moving the file from a high level to a low level without compacting anything. Based on a compaction simulator test, the trivial move data is increased by 30% (but still less than 1% compaction data is trivial move): + +![](/static/images/align-compaction-output/file_cut_trival_move.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +Based on the db_bench test, it can save `~12%` compaction load, here is the test command and result: +``` +TEST_TMPDIR=/data/dbbench ./db_bench --benchmarks=fillrandom,readrandom -max_background_jobs=12 -num=400000000 -target_file_size_base=33554432 + +# baseline: +Flush(GB): cumulative 25.882, interval 7.216 +Cumulative compaction: 285.90 GB write, 162.36 MB/s write, 269.68 GB read, 153.15 MB/s read, 2926.7 seconds + +# with this change: +Flush(GB): cumulative 25.882, interval 7.753 +Cumulative compaction: 249.97 GB write, 141.96 MB/s write, 233.74 GB read, 132.74 MB/s read, 2534.9 seconds +``` + +The feature is enabled by default by upgrading to RocksDB 7.8 or later versions, as the feature should have a limited impact on the file size and have great write amplification improvements. If in a rare case, it needs to opt out, set +``` +options.level_compaction_dynamic_file_size = false; +``` + +## Other Options and Benchmark +We also tested a few other options, starting with a fixed threshold: 75% of the target_file_size and 50%. Then with a dynamic threshold that is explained, but still limiting file size smaller than the target_file_size. +1. Baseline (main branch before [PR#10655](https://github.com/facebook/rocksdb/pull/10655)); +2. Fixed Threshold `75%`: after 75% of target file size, cut the file whenever it aligns with a low level file boundary; +3. Fixed Threshold `50%`: reduce the threshold to 50% of target file size; +4. Dynamic Threshold `(5*bounderies_num + 50)` percent of target file size and maxed at 90%; +5. Dynamic Threshold + allow 2x the target file size (chosen option). + +### Test Environment and Data +To speed up the benchmark, we introduced a compaction simulator within Rocksdb ([details](https://github.com/jay-zhuang/rocksdb/tree/compaction_sim)), which replaced the physical SST with in-memory data (a large bitset). Which can test compaction more consistently. As it's a simulator, it has its limitations: + +it assumes each key-value has the same size; +1. no deletion (but has override); +2. doesn't consider data compression; +3. single-threaded and finish all compactions before the next flush (so no write stall). + +We use 3 kinds of the dataset for tests: +1. Random Data, has an override, evenly distributed; +2. Zipf distribution with alpha = 1.01, moderately skewed; +3. Zipf distribution with alpha = 1.2, highly skewed. + +#### Write Amplification + +![](/static/images/align-compaction-output/write_amp_compare.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 100%"} + +As we can see, all options are better than the baseline. Option5 (brown) and option3 (green) have similar WA improvements. (The sudden WA drop during ~40G Random Dataset is because we enabled `level_compaction_dynamic_level_bytes` and the level number was increased from 3 to 4, the similar test result without enabling `level_compaction_dynamic_level_bytes`). + +#### File Size Distribution at the End of Test +This is the file size distribution at the end of the test, which loads about 100G data. As this change only impacts the non-bottommost file size, and the majority of the SST files are bottommost, there're no significant differences: + +![](/static/images/align-compaction-output/file_size_compare.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 100%"} + +#### All Compaction Generated File Sizes +The high-level files are much more likely to be compacted, so all compaction-generated files size has more significant change: + +![](/static/images/align-compaction-output/compaction_output_file_size_compare.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 100%"} + +Overall option5 has most of the file size close to the target file size. vs. option3 has a much smaller size. Here are more detailed stats for compaction output file size: +``` + base 50p 75p dynamic 2xdynamic +count 1.656000e+03 1.960000e+03 1.770000e+03 1.687000e+03 1.705000e+03 +mean 3.116062e+07 2.634125e+07 2.917876e+07 3.060135e+07 3.028076e+07 +std 7.145242e+06 1.065134e+07 8.800474e+06 7.612939e+06 8.046139e+06 +``` + +## Summary +Allowing more dynamic file size and aligning the compaction output file to the next level file's boundary improves the RocksDB write amplification by more than 10%, which will be enabled by default in `7.8.0` release. We picked a simple algorithm to decide when to cut the output file, which can be further improved. For example, by estimating output file size with index information. Any suggestions or PR are welcomed. + +## Acknowledgements +We thank Siying Dong for initializing the file-cutting idea and thank Andrew Kryczka, Mark Callaghan for contributing to the ideas. And Changyu Bi for the detailed code review. |