diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-07 18:45:59 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-07 18:45:59 +0000 |
commit | 19fcec84d8d7d21e796c7624e521b60d28ee21ed (patch) | |
tree | 42d26aa27d1e3f7c0b8bd3fd14e7d7082f5008dc /src/rocksdb/docs/_posts | |
parent | Initial commit. (diff) | |
download | ceph-upstream.tar.xz ceph-upstream.zip |
Adding upstream version 16.2.11+ds.upstream/16.2.11+dsupstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to '')
55 files changed, 3144 insertions, 0 deletions
diff --git a/src/rocksdb/docs/_posts/2014-03-27-how-to-backup-rocksdb.markdown b/src/rocksdb/docs/_posts/2014-03-27-how-to-backup-rocksdb.markdown new file mode 100644 index 000000000..f9e4a5444 --- /dev/null +++ b/src/rocksdb/docs/_posts/2014-03-27-how-to-backup-rocksdb.markdown @@ -0,0 +1,135 @@ +--- +title: How to backup RocksDB? +layout: post +author: icanadi +category: blog +redirect_from: + - /blog/191/how-to-backup-rocksdb/ +--- + +In RocksDB, we have implemented an easy way to backup your DB. Here is a simple example: + + + + #include "rocksdb/db.h" + #include "utilities/backupable_db.h" + using namespace rocksdb; + + DB* db; + DB::Open(Options(), "/tmp/rocksdb", &db); + BackupableDB* backupable_db = new BackupableDB(db, BackupableDBOptions("/tmp/rocksdb_backup")); + backupable_db->Put(...); // do your thing + backupable_db->CreateNewBackup(); + delete backupable_db; // no need to also delete db + +<!--truncate--> + + +This simple example will create a backup of your DB in "/tmp/rocksdb_backup". Creating new BackupableDB consumes DB* and you should be calling all the DB methods on object `backupable_db` going forward. + +Restoring is also easy: + + + + RestoreBackupableDB* restore = new RestoreBackupableDB(Env::Default(), BackupableDBOptions("/tmp/rocksdb_backup")); + restore->RestoreDBFromLatestBackup("/tmp/rocksdb", "/tmp/rocksdb"); + delete restore; + + + + +This code will restore the backup back to "/tmp/rocksdb". The second parameter is the location of log files (In some DBs they are different from DB directory, but usually they are the same. See Options::wal_dir for more info). + +An alternative API for backups is to use BackupEngine directly: + + + + #include "rocksdb/db.h" + #include "utilities/backupable_db.h" + using namespace rocksdb; + + DB* db; + DB::Open(Options(), "/tmp/rocksdb", &db); + db->Put(...); // do your thing + BackupEngine* backup_engine = BackupEngine::NewBackupEngine(Env::Default(), BackupableDBOptions("/tmp/rocksdb_backup")); + backup_engine->CreateNewBackup(db); + delete db; + delete backup_engine; + + + + +Restoring with BackupEngine is similar to RestoreBackupableDB: + + + + BackupEngine* backup_engine = BackupEngine::NewBackupEngine(Env::Default(), BackupableDBOptions("/tmp/rocksdb_backup")); + backup_engine->RestoreDBFromLatestBackup("/tmp/rocksdb", "/tmp/rocksdb"); + delete backup_engine; + + + + +Backups are incremental. You can create a new backup with `CreateNewBackup()` and only the new data will be copied to backup directory (for more details on what gets copied, see "Under the hood"). Checksum is always calculated for any backuped file (including sst, log, and etc). It is used to make sure files are kept sound in the file system. Checksum is also verified for files from the previous backups even though they do not need to be copied. A checksum mismatch aborts the current backup (see "Under the hood" for more details). Once you have more backups saved, you can issue `GetBackupInfo()` call to get a list of all backups together with information on timestamp of the backup and the size (please note that sum of all backups' sizes is bigger than the actual size of the backup directory because some data is shared by multiple backups). Backups are identified by their always-increasing IDs. `GetBackupInfo()` is available both in `BackupableDB` and `RestoreBackupableDB`. + +You probably want to keep around only small number of backups. To delete old backups, just call `PurgeOldBackups(N)`, where N is how many backups you'd like to keep. All backups except the N newest ones will be deleted. You can also choose to delete arbitrary backup with call `DeleteBackup(id)`. + +`RestoreDBFromLatestBackup()` will restore the DB from the latest consistent backup. An alternative is `RestoreDBFromBackup()` which takes a backup ID and restores that particular backup. Checksum is calculated for any restored file and compared against the one stored during the backup time. If a checksum mismatch is detected, the restore process is aborted and `Status::Corruption` is returned. Very important thing to note here: Let's say you have backups 1, 2, 3, 4. If you restore from backup 2 and start writing more data to your database, newly created backup will delete old backups 3 and 4 and create new backup 3 on top of 2. + + + +## Advanced usage + + +Let's say you want to backup your DB to HDFS. There is an option in `BackupableDBOptions` to set `backup_env`, which will be used for all file I/O related to backup dir (writes when backuping, reads when restoring). If you set it to HDFS Env, all the backups will be stored in HDFS. + +`BackupableDBOptions::info_log` is a Logger object that is used to print out LOG messages if not-nullptr. + +If `BackupableDBOptions::sync` is true, we will sync data to disk after every file write, guaranteeing that backups will be consistent after a reboot or if machine crashes. Setting it to false will speed things up a bit, but some (newer) backups might be inconsistent. In most cases, everything should be fine, though. + +If you set `BackupableDBOptions::destroy_old_data` to true, creating new `BackupableDB` will delete all the old backups in the backup directory. + +`BackupableDB::CreateNewBackup()` method takes a parameter `flush_before_backup`, which is false by default. When `flush_before_backup` is true, `BackupableDB` will first issue a memtable flush and only then copy the DB files to the backup directory. Doing so will prevent log files from being copied to the backup directory (since flush will delete them). If `flush_before_backup` is false, backup will not issue flush before starting the backup. In that case, the backup will also include log files corresponding to live memtables. Backup will be consistent with current state of the database regardless of `flush_before_backup` parameter. + + + +## Under the hood + + +`BackupableDB` implements `DB` interface and adds four methods to it: `CreateNewBackup()`, `GetBackupInfo()`, `PurgeOldBackups()`, `DeleteBackup()`. Any `DB` interface calls will get forwarded to underlying `DB` object. + +When you call `BackupableDB::CreateNewBackup()`, it does the following: + + + + + + 1. Disable file deletions + + + + 2. Get live files (this includes table files, current and manifest file). + + + + 3. Copy live files to the backup directory. Since table files are immutable and filenames unique, we don't copy a table file that is already present in the backup directory. For example, if there is a file `00050.sst` already backed up and `GetLiveFiles()` returns `00050.sst`, we will not copy that file to the backup directory. However, checksum is calculated for all files regardless if a file needs to be copied or not. If a file is already present, the calculated checksum is compared against previously calculated checksum to make sure nothing crazy happened between backups. If a mismatch is detected, backup is aborted and the system is restored back to the state before `BackupableDB::CreateNewBackup()` is called. One thing to note is that a backup abortion could mean a corruption from a file in backup directory or the corresponding live file in current DB. Both manifest and current files are copied, since they are not immutable. + + + + 4. If `flush_before_backup` was set to false, we also need to copy log files to the backup directory. We call `GetSortedWalFiles()` and copy all live files to the backup directory. + + + + 5. Enable file deletions + + + + +Backup IDs are always increasing and we have a file `LATEST_BACKUP` that contains the ID of the latest backup. If we crash in middle of backing up, on a restart we will detect that there are newer backup files than `LATEST_BACKUP` claims there are. In that case, we will delete any backup newer than `LATEST_BACKUP` and clean up all the files since some of the table files might be corrupted. Having corrupted table files in the backup directory is dangerous because of our deduplication strategy. + + + +## Further reading + + +For the API details, see `include/utilities/backupable_db.h`. For the implementation, see `utilities/backupable/backupable_db.cc`. diff --git a/src/rocksdb/docs/_posts/2014-03-27-how-to-persist-in-memory-rocksdb-database.markdown b/src/rocksdb/docs/_posts/2014-03-27-how-to-persist-in-memory-rocksdb-database.markdown new file mode 100644 index 000000000..89ffb2d97 --- /dev/null +++ b/src/rocksdb/docs/_posts/2014-03-27-how-to-persist-in-memory-rocksdb-database.markdown @@ -0,0 +1,54 @@ +--- +title: How to persist in-memory RocksDB database? +layout: post +author: icanadi +category: blog +redirect_from: + - /blog/245/how-to-persist-in-memory-rocksdb-database/ +--- + +In recent months, we have focused on optimizing RocksDB for in-memory workloads. With growing RAM sizes and strict low-latency requirements, lots of applications decide to keep their entire data in memory. Running in-memory database with RocksDB is easy -- just mount your RocksDB directory on tmpfs or ramfs [1]. Even if the process crashes, RocksDB can recover all of your data from in-memory filesystem. However, what happens if the machine reboots? + +<!--truncate--> + +In this article we will explain how you can recover your in-memory RocksDB database even after a machine reboot. + +Every update to RocksDB is written to two places - one is an in-memory data structure called memtable and second is write-ahead log. Write-ahead log can be used to completely recover the data in memtable. By default, when we flush the memtable to table file, we also delete the current log, since we don't need it anymore for recovery (the data from the log is "persisted" in the table file -- we say that the log file is obsolete). However, if your table file is stored in in-memory file system, you may need the obsolete write-ahead log to recover the data after the machine reboots. Here's how you can do that. + +Options::wal_dir is the directory where RocksDB stores write-ahead log files. If you configure this directory to be on flash or disk, you will not lose current log file on machine reboot. +Options::WAL_ttl_seconds is the timeout when we delete the archived log files. If the timeout is non-zero, obsolete log files will be moved to `archive/` directory under Options::wal_dir. Those archived log files will only be deleted after the specified timeout. + +Let's assume Options::wal_dir is a directory on persistent storage and Options::WAL_ttl_seconds is set to one day. To fully recover the DB, we also need to backup the current snapshot of the database (containing table and metadata files) with a frequency of less than one day. RocksDB provides an utility that enables you to easily backup the snapshot of your database. You can learn more about it here: [How to backup RocksDB?](https://github.com/facebook/rocksdb/wiki/How-to-backup-RocksDB%3F) + +You should configure the backup process to avoid backing up log files, since they are already stored in persistent storage. To do that, set BackupableDBOptions::backup_log_files to false. + +Restore process by default cleans up entire DB and WAL directory. Since we didn't include log files in the backup, we need to make sure that restoring the database doesn't delete log files in WAL directory. When restoring, configure RestoreOptions::keep_log_file to true. That option will also move any archived log files back to WAL directory, enabling RocksDB to replay all archived log files and rebuild the in-memory database state. + +To reiterate, here's what you have to do: + + + + + * Set DB directory to tmpfs or ramfs mounted drive + + + + * Set Options::wal_log to a directory on persistent storage + + + + * Set Options::WAL_ttl_seconds to T seconds + + + + * Backup RocksDB every T/2 seconds, with BackupableDBOptions::backup_log_files = false + + + + * When you lose data, restore from backup with RestoreOptions::keep_log_file = true + + + + + +[1] You might also want to consider using [PlainTable format](https://github.com/facebook/rocksdb/wiki/PlainTable-Format) for table files diff --git a/src/rocksdb/docs/_posts/2014-04-02-the-1st-rocksdb-local-meetup-held-on-march-27-2014.markdown b/src/rocksdb/docs/_posts/2014-04-02-the-1st-rocksdb-local-meetup-held-on-march-27-2014.markdown new file mode 100644 index 000000000..7ccbdbaad --- /dev/null +++ b/src/rocksdb/docs/_posts/2014-04-02-the-1st-rocksdb-local-meetup-held-on-march-27-2014.markdown @@ -0,0 +1,53 @@ +--- +title: The 1st RocksDB Local Meetup Held on March 27, 2014 +layout: post +author: xjin +category: blog +redirect_from: + - /blog/323/the-1st-rocksdb-local-meetup-held-on-march-27-2014/ +--- + +On Mar 27, 2014, RocksDB team @ Facebook held the 1st RocksDB local meetup in FB HQ (Menlo Park, California). We invited around 80 guests from 20+ local companies, including LinkedIn, Twitter, Dropbox, Square, Pinterest, MapR, Microsoft and IBM. Finally around 50 guests showed up, totaling around 60% show-up rate. + +<!--truncate--> + +[![Resize of 20140327_200754](/static/images/Resize-of-20140327_200754-300x225.jpg)](/static/images/Resize-of-20140327_200754-300x225.jpg) + +RocksDB team @ Facebook gave four talks about the latest progress and experience on RocksDB: + + + + + * [Supporting a 1PB In-Memory Workload](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Haobo-RocksDB-In-Memory.pdf) + + + + + * [Column Families in RocksDB](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Igor-Column-Families.pdf) + + + + + * ["Lockless" Get() in RocksDB?](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Lei-Lockless-Get.pdf) + + + + + * [Prefix Hashing in RocksDB](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Siying-Prefix-Hash.pdf) + + +A very interesting question asked by a massive number of guests is: does RocksDB plan to provide replication functionality? Obviously, many applications need a resilient and distributed storage solution, not just single-node storage. We are considering how to approach this issue. + +When will be the next meetup? We haven't decided yet. We will see whether the community is interested in it and how it can help RocksDB grow. + +If you have any questions or feedback for the meetup or RocksDB, please let us know in [our Facebook group](https://www.facebook.com/groups/rocksdb.dev/). + +### Comments + +**[Rajiv](geetasen@gmail.com)** + +Have any of these talks been recorded and if so will they be published? + +**[Igor Canadi](icanadi@fb.com)** + +Yes, I think we plan to publish them soon. diff --git a/src/rocksdb/docs/_posts/2014-04-07-rocksdb-2-8-release.markdown b/src/rocksdb/docs/_posts/2014-04-07-rocksdb-2-8-release.markdown new file mode 100644 index 000000000..7be7842a5 --- /dev/null +++ b/src/rocksdb/docs/_posts/2014-04-07-rocksdb-2-8-release.markdown @@ -0,0 +1,40 @@ +--- +title: RocksDB 2.8 release +layout: post +author: icanadi +category: blog +redirect_from: + - /blog/371/rocksdb-2-8-release/ +--- + +Check out the new RocksDB 2.8 release on [Github](https://github.com/facebook/rocksdb/releases/tag/2.8.fb). + +RocksDB 2.8. is mostly focused on improving performance for in-memory workloads. We are seeing read QPS as high as 5M (we will write a separate blog post on this). + +<!--truncate--> + +Here is the summary of new features: + + * Added a new table format called PlainTable, which is optimized for RAM storage (ramfs or tmpfs). You can read more details about it on [our wiki](https://github.com/facebook/rocksdb/wiki/PlainTable-Format). + + + * New prefixed memtable format HashLinkedList, which is optimized for cases where there are only a few keys for each prefix. + + + * Merge operator supports a new function PartialMergeMulti() that allows users to do partial merges against multiple operands. This function enables big speedups for workloads that use merge operators. + + + * Added a V2 compaction filter interface. It buffers the kv-pairs sharing the same key prefix, process them in batches, and return the batched results back to DB. + + + * Geo-spatial support for locations and radial-search. + + + * Improved read performance using thread local cache for frequently accessed data. + + + * Stability improvements -- we're now ignoring partially written tailing record to MANIFEST or WAL files. + + + +We have also introduced small incompatible API changes (mostly for advanced users). You can see full release notes in our [HISTORY.my](https://github.com/facebook/rocksdb/blob/2.8.fb/HISTORY.md) file. diff --git a/src/rocksdb/docs/_posts/2014-04-21-indexing-sst-files-for-better-lookup-performance.markdown b/src/rocksdb/docs/_posts/2014-04-21-indexing-sst-files-for-better-lookup-performance.markdown new file mode 100644 index 000000000..368055d2c --- /dev/null +++ b/src/rocksdb/docs/_posts/2014-04-21-indexing-sst-files-for-better-lookup-performance.markdown @@ -0,0 +1,28 @@ +--- +title: Indexing SST Files for Better Lookup Performance +layout: post +author: leijin +category: blog +redirect_from: + - /blog/431/indexing-sst-files-for-better-lookup-performance/ +--- + +For a `Get()` request, RocksDB goes through mutable memtable, list of immutable memtables, and SST files to look up the target key. SST files are organized in levels. + +On level 0, files are sorted based on the time they are flushed. Their key range (as defined by FileMetaData.smallest and FileMetaData.largest) are mostly overlapped with each other. So it needs to look up every L0 file. + +<!--truncate--> + +Compaction is scheduled periodically to pick up files from an upper level and merges them with files from lower level. As a result, key/values are moved from L0 down the LSM tree gradually. Compaction sorts key/values and split them into files. From level 1 and below, SST files are sorted based on key. Their key range are mutually exclusive. Instead of scanning through each SST file and checking if a key falls into its range, RocksDB performs a binary search based on FileMetaData.largest to locate a candidate file that can potentially contain the target key. This reduces complexity from O(N) to O(log(N)). However, log(N) can still be large for bottom levels. For a fan-out ratio of 10, level 3 can have 1000 files. That requires 10 comparisons to locate a candidate file. This is a significant cost for an in-memory database when you can do [several million gets per second](https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks). + +One observation to this problem is that: after the LSM tree is built, an SST file's position in its level is fixed. Furthermore, its order relative to files from the next level is also fixed. Based on this idea, we can perform [fractional cascading](http://en.wikipedia.org/wiki/Fractional_cascading) kind of optimization to narrow down the binary search range. Here is an example: + +[![tree_example](/static/images/tree_example1.png)](/static/images/tree_example1.png) + +Level 1 has 2 files and level 2 has 8 files. Now, we want to look up key 80. A binary search based FileMetaData.largest tells you file 1 is the candidate. Then key 80 is compared with its FileMetaData.smallest and FileMetaData.largest to decide if it falls into the range. The comparison shows 80 is less than FileMetaData.smallest (100), so file 1 does not possibly contain key 80. We to proceed to check level 2. Usually, we need to do binary search among all 8 files on level 2. But since we already know target key 80 is less than 100 and only file 1 to file 3 can contain key less than 100, we can safely exclude other files from the search. As a result we cut down the search space from 8 files to 3 files. + +Let's look at another example. We want to get key 230. A binary search on level 1 locates to file 2 (this also implies key 230 is larger than file 1's FileMetaData.largest 200). A comparison with file 2's range shows the target key is smaller than file 2's FileMetaData.smallest 300. Even though, we couldn't find key on level 1, we have derived hints that target key is in range between 200 and 300. Any files on level 2 that cannot overlap with [200, 300] can be safely excluded. As a result, we only need to look at file 5 and file 6 on level 2. + +Inspired by this concept, we pre-build pointers at compaction time on level 1 files that point to a range of files on level 2. For example, file 1 on level 1 points to file 3 (on level 2) on the left and file 4 on the right. File 2 will point to level 2 files 6 and 7. At query time, these pointers are used to determine the actual binary search range based on comparison result. + +Our benchmark shows that this optimization improves lookup QPS by ~5% for similar setup mentioned [here](https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks). diff --git a/src/rocksdb/docs/_posts/2014-05-14-lock.markdown b/src/rocksdb/docs/_posts/2014-05-14-lock.markdown new file mode 100644 index 000000000..12009cc88 --- /dev/null +++ b/src/rocksdb/docs/_posts/2014-05-14-lock.markdown @@ -0,0 +1,88 @@ +--- +title: Reducing Lock Contention in RocksDB +layout: post +author: sdong +category: blog +redirect_from: + - /blog/521/lock/ +--- + +In this post, we briefly introduce the recent improvements we did to RocksDB to improve the issue of lock contention costs. + +RocksDB has a simple thread synchronization mechanism (See [RocksDB Architecture Guide](https://github.com/facebook/rocksdb/wiki/Rocksdb-Architecture-Guide) to understand terms used below, like SST tables or mem tables). SST tables are immutable after being written and mem tables are lock-free data structures supporting single writer and multiple readers. There is only one single major lock, the DB mutex (DBImpl.mutex_) protecting all the meta operations, including: + +<!--truncate--> + + * Increase or decrease reference counters of mem tables and SST tables + + + * Change and check meta data structures, before and after finishing compactions, flushes and new mem table creations + + + * Coordinating writers + + +This DB mutex used to be scalability bottleneck preventing us from scaling to more than 16 threads. To address the issue, we improved RocksDB in several ways. + +1. Consolidate reference counters and introduce "super version". For every read operation, mutex was acquired, and reference counters for each mem table and each SST table were increased. One such operation is not expensive but if you are building a high throughput server with lots of reads, the lock contention will become the bottleneck. This is especially true if you store all your data in RAM. + +To solve this problem, we created a meta-meta data structure called “[super version](https://reviews.facebook.net/rROCKSDB1fdb3f7dc60e96394e3e5b69a46ede5d67fb976c)”, which holds reference counters to all those mem table and SST tables, so that readers only need to increase the reference counters for this single data structure. In RocksDB, list of live mem tables and SST tables only changes infrequently, which would happen when new mem tables are created or flush/compaction happens. Now, at those times, a new super version is created with their reference counters increased. A super version lists live mem tables and SST tables so a reader only needs acquire the lock in order to find the latest super version and increase its reference counter. From the super version, the reader can find all the mem and SST tables which are safety accessible as long as the reader holds the reference count for the super version. + +2. We replace some reference counters to stc::atomic objects, so that decreasing reference count of an object usually doesn’t need to be inside the mutex any more. + +3. Make fetching super version and reference counting lock-free in read queries. After consolidating reference counting to one single super version and removing the locking for decreasing reference counts, in read case, we only acquire mutex for one thing: fetch the latest super version and increase the reference count for that (dereference the counter is done in an atomic decrease). We designed and implemented a (mostly) lock-free approach to do it. See [details](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Lei-Lockless-Get.pdf). We will write a separate blog post for that. + +4. Avoid disk I/O inside the mutex. As we know, each disk I/O to hard drives takes several milliseconds. It can be even longer if file system journal is involved or I/Os are queued. Even occasional disk I/O within mutex can cause huge performance outliers. +We identified in two situations, we might do disk I/O inside mutex and we removed them: +(1) Opening and closing transactional log files. We moved those operations out of the mutex. +(2) Information logging. In multiple places we write to logs within mutex. There is a chance that file write will wait for disk I/O to finish before finishing, even if fsync() is not issued, especially in EXT systems. We occasionally see 100+ milliseconds write() latency on EXT. Instead of removing those logging, we came up with a solution of delay logging. When inside mutex, instead of directly writing to the log file, we write to a log buffer, with the timing information. As soon as mutex is released, we flush the log buffer to log files. + +5. Reduce object creation inside the mutex. +Object creation can be slow because it involves malloc (in our case). Malloc sometimes is slow because it needs to lock some shared data structures. Allocating can also be slow because we sometimes do expensive operations in some of our classes' constructors. For these reasons, we try to reduce object creations inside the mutex. Here are two examples: + +(1) std::vector uses malloc inside. We introduced “[autovector](https://reviews.facebook.net/rROCKSDBc01676e46d3be08c3c140361ef1f5884f47d3b3c)” data structure, in which memory for first a few elements are pre-allocated as members of the autovector class. When an autovector is used as a stack variable, no malloc will be needed unless the pre-allocated buffer is used up. This autovector is quite useful for manipulating those meta data structures. Those meta operations are often locked inside DB mutex. + +(2) When building an iterator, we used to creating iterator of every live men table and SST table within the mutex and a merging iterator on top of them. Besides malloc, some of those iterators can be quite expensive to create, like sorting. Now, instead of doing that, we simply increase the reference counters of them, and release the mutex before creating any iterator. + +6. Deal with mutexes in LRU caches. +When I said there was only one single major lock, I was lying. In RocksDB, all LRU caches had exclusive mutexes within to protect writes to the LRU lists, which are done in both of read and write operations. LRU caches are used in block cache and table cache. Both of them are accessed more frequently than DB data structures. Lock contention of these two locks are as intense as the DB mutex. Even if LRU cache is sharded into ShardedLRUCache, we can still see lock contentions, especially table caches. We further address this issue in two way: +(1) Bypassing table caches. A table cache maintains list of SST table’s read handlers. Those handlers contain SST files’ descriptors, table metadata, and possibly data indexes, as well as bloom filters. When the table handler needs to be evicted based on LRU, those information is cleared. When the SST table needs to be read and its table handler is not in LRU cache, the table is opened and those metadata is loaded. In some cases, users want to tune the system in a way that table handler evictions should never happen. It is common for high-throughput, low-latency servers. We introduce a mode where table cache is bypassed in read queries. In this mode, all table handlers are cached and accessed directly, so there is no need to query and adjust table caches for reading the database. It is the users’ responsibility to reserve enough resource for it. This mode can be turned on by setting options.max_open_files=-1. + +(2) [New PlainTable format](//github.com/facebook/rocksdb/wiki/PlainTable-Format) (optimized for SST in ramfs/tmpfs) does not organize data by blocks. Data are located by memory addresses so no block cache is needed. + +With all of those improvements, lock contention is not a bottleneck anymore, which is shown in our [memory-only benchmark](https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks) . Furthermore, lock contentions are not causing some huge (50 milliseconds+) latency outliers they used to cause. + +### Comments + +**[Lee Hounshell](lee@apsalar.com)** + +Please post an example of reading the same rocksdb concurrently. + +We are using the latest 3.0 rocksdb; however, when two separate processes +try and open the same rocksdb for reading, only one of the open requests +succeed. The other open always fails with “db/LOCK: Resource temporarily unavailable” So far we have not found an option that allows sharing the rocksdb for reads. An example would be most appreciated. + +**[Siying Dong](siying.d@fb.com)** + +Sorry for the delay. We don’t have feature support for this scenario yet. Here is an example you can work around this problem. You can build a snapshot of the DB by doing this: + +1. create a separate directory on the same host for a snapshot of the DB. +1. call `DB::DisableFileDeletions()` +1. call `DB::GetLiveFiles()` to get a full list of the files. +1. for all the files except manifest, add a hardlink file in your new directory pointing to the original file +1. copy the manifest file and truncate the size (you can read the comments of `DB::GetLiveFiles()` for more information) +1. call `DB::EnableFileDeletions()` +1. now you can open the snapshot directory in another process to access those files. Please remember to delete the directory after reading the data to allow those files to be recycled. + +By the way, the best way to ask those questions is in our [facebook group](https://www.facebook.com/groups/rocksdb.dev/). Let us know if you need any further help. + +**[Darshan](darshan.ghumare@gmail.com)** + +Will this consistency problem of RocksDB all occurs in case of single put/write? +What all ACID properties is supported by RocksDB, only durability irrespective of single or batch write? + +**[Siying Dong](siying.d@fb.com)** + +We recently [introduced optimistic transaction](https://reviews.facebook.net/D33435) which can help you ensure all of ACID. + +This blog post is mainly about optimizations in implementation. The RocksDB consistency semantic is not changed. diff --git a/src/rocksdb/docs/_posts/2014-05-19-rocksdb-3-0-release.markdown b/src/rocksdb/docs/_posts/2014-05-19-rocksdb-3-0-release.markdown new file mode 100644 index 000000000..61c90dc93 --- /dev/null +++ b/src/rocksdb/docs/_posts/2014-05-19-rocksdb-3-0-release.markdown @@ -0,0 +1,24 @@ +--- +title: RocksDB 3.0 release +layout: post +author: icanadi +category: blog +redirect_from: + - /blog/557/rocksdb-3-0-release/ +--- + +Check out new RocksDB release on [Github](https://github.com/facebook/rocksdb/releases/tag/3.0.fb)! + +New features in RocksDB 3.0: + + * [Column Family support](https://github.com/facebook/rocksdb/wiki/Column-Families) + + + * [Ability to chose different checksum function](https://github.com/facebook/rocksdb/commit/0afc8bc29a5800e3212388c327c750d32e31f3d6) + + + * Deprecated ReadOptions::prefix_seek and ReadOptions::prefix + +<!--truncate--> + +Check out the full [change log](https://github.com/facebook/rocksdb/blob/3.0.fb/HISTORY.md). diff --git a/src/rocksdb/docs/_posts/2014-05-22-rocksdb-3-1-release.markdown b/src/rocksdb/docs/_posts/2014-05-22-rocksdb-3-1-release.markdown new file mode 100644 index 000000000..30156742b --- /dev/null +++ b/src/rocksdb/docs/_posts/2014-05-22-rocksdb-3-1-release.markdown @@ -0,0 +1,20 @@ +--- +title: RocksDB 3.1 release +layout: post +author: icanadi +category: blog +redirect_from: + - /blog/575/rocksdb-3-1-release/ +--- + +Check out the new release on [Github](https://github.com/facebook/rocksdb/releases/tag/rocksdb-3.1)! + +New features in RocksDB 3.1: + + * [Materialized hash index](https://github.com/facebook/rocksdb/commit/0b3d03d026a7248e438341264b4c6df339edc1d7) + + + * [FIFO compaction style](https://github.com/facebook/rocksdb/wiki/FIFO-compaction-style) + + +We released 3.1 so fast after 3.0 because one of our internal customers needed materialized hash index. diff --git a/src/rocksdb/docs/_posts/2014-06-23-plaintable-a-new-file-format.markdown b/src/rocksdb/docs/_posts/2014-06-23-plaintable-a-new-file-format.markdown new file mode 100644 index 000000000..6a641f233 --- /dev/null +++ b/src/rocksdb/docs/_posts/2014-06-23-plaintable-a-new-file-format.markdown @@ -0,0 +1,47 @@ +--- +title: PlainTable — A New File Format +layout: post +author: sdong +category: blog +redirect_from: + - /blog/599/plaintable-a-new-file-format/ +--- + +In this post, we are introducing "PlainTable" -- a file format we designed for RocksDB, initially to satisfy a production use case at Facebook. + +Design goals: + +1. All data stored in memory, in files stored in tmpfs/ramfs. Support DBs larger than 100GB (may be sharded across multiple RocksDB instance). +1. Optimize for [prefix hashing](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Siying-Prefix-Hash.pdf) +1. Less than or around 1 micro-second average latency for single Get() or Seek(). +1. Minimize memory consumption. +1. Queries efficiently return empty results + +<!--truncate--> + +Notice that our priority was not to maximize query performance, but to strike a balance between query performance and memory consumption. PlainTable query performance is not as good as you would see with a nicely-designed hash table, but they are of the same order of magnitude, while keeping memory overhead to a minimum. + +Since we are targeting micro-second latency, it is on the level of the number of CPU cache misses (if they cannot be parallellized, which are usually the case for index look-ups). On our target hardware with Intel CPUs of multiple sockets with NUMA, we can only allow 4-5 CPU cache misses (including costs of data TLB). + +To meet our requirements, given that only hash prefix iterating is needed, we made two decisions: + +1. to use a hash index, which is +1. directly addressed to rows, with no block structure. + +Having addressed our latency goal, the next task was to design a very compact hash index to minimize memory consumption. Some tricks we used to meet this goal: + +1. We only use 32-bit integers for data and index offsets.The first bit serves as a flag, so we can avoid using 8-byte pointers. +1. We never copy keys or parts of keys to index search structures. We store only offsets from which keys can be retrieved, to make comparisons with search keys. +1. Since our file is immutable, we can accurately estimate the number of hash buckets needed. + +To make sure the format works efficiently with empty queries, we added a bloom filter check before the query. This adds only one cache miss for non-empty cases [1], but avoids multiple cache misses for most empty results queries. This is a good trade-off for use cases with a large percentage of empty results. + +These are the design goals and basic ideas of PlainTable file format. For detailed information, see [this wiki page](https://github.com/facebook/rocksdb/wiki/PlainTable-Format). + +[1] Bloom filter checks typically require multiple memory access. However, because they are independent, they usually do not make the CPU pipeline stale. In any case, we improved the bloom filter to improve data locality - we may cover this further in a future blog post. + +### Comments + +**[Siying Dong](siying.d@fb.com)** + +Does [http://rocksdb.org/feed/](http://rocksdb.org/feed/) work? diff --git a/src/rocksdb/docs/_posts/2014-06-27-avoid-expensive-locks-in-get.markdown b/src/rocksdb/docs/_posts/2014-06-27-avoid-expensive-locks-in-get.markdown new file mode 100644 index 000000000..4411c7ae3 --- /dev/null +++ b/src/rocksdb/docs/_posts/2014-06-27-avoid-expensive-locks-in-get.markdown @@ -0,0 +1,89 @@ +--- +title: Avoid Expensive Locks in Get() +layout: post +author: leijin +category: blog +redirect_from: + - /blog/677/avoid-expensive-locks-in-get/ +--- + +As promised in the previous [blog post](blog/2014/05/14/lock.html)! + +RocksDB employs a multiversion concurrency control strategy. Before reading data, it needs to grab the current version, which is encapsulated in a data structure called [SuperVersion](https://reviews.facebook.net/rROCKSDB1fdb3f7dc60e96394e3e5b69a46ede5d67fb976c). + +<!--truncate--> + +At the beginning of `GetImpl()`, it used to do this: + + + <span class="zw-portion">mutex_.Lock(); + </span>auto* s = super_version_->Ref(); + mutex_.Unlock(); + + +The lock is necessary because pointer super_version_ may be updated, the corresponding SuperVersion may be deleted while Ref() is in progress. + + +`Ref()` simply increases the reference counter and returns “this” pointer. However, this simple operation posed big challenges for in-memory workload and stopped RocksDB from scaling read throughput beyond 8 cores. Running 32 read threads on a 32-core CPU leads to [70% system CPU usage](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Lei-Lockless-Get.pdf). This is outrageous! + + + + +Luckily, we found a way to circumvent this problem by using [thread local storage](http://en.wikipedia.org/wiki/Thread-local_storage). Version change is a rare event comparable to millions of read requests. On the very first Get() request, each thread pays the mutex cost to acquire a reference to the new super version. Instead of releasing the reference after use, the reference is cached in thread’s local storage. An atomic variable is used to track global super version number. Subsequent reads simply compare the local super version number against the global super version number. If they are the same, the cached super version reference may be used directly, at no cost. If a version change is detected, mutex must be acquired to update the reference. The cost of mutex lock is amortized among millions of reads and becomes negligible. + + + + +The code looks something like this: + + + + + + SuperVersion* s = thread_local_->Get(); + if (s->version_number != super_version_number_.load()) { + // slow path, cleanup of current super version is omitted + mutex_.Lock(); + s = super_version_->Ref(); + mutex_.Unlock(); + } + + + + +The result is quite amazing. RocksDB can nicely [scale to 32 cores](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Lei-Lockless-Get.pdf) and most CPU time is spent in user land. + + + + +Daryl Grove gives a pretty good [comparison between mutex and atomic](https://blogs.oracle.com/d/entry/the_cost_of_mutexes). However, the real cost difference lies beyond what is shown in the assembly code. Mutex can keep threads spinning on CPU or even trigger thread context switches in which all readers compete to access the critical area. Our approach prevents mutual competition by directing threads to check against a global version which does not change at high frequency, and is therefore much more cache-friendly. + + + + +The new approach entails one issue: a thread can visit GetImpl() once but can never come back again. SuperVersion is referenced and cached in its thread local storage. All resources (e.g., memtables, files) which belong to that version are frozen. A “supervisor” is required to visit each thread’s local storage and free its resources without incurring a lock. We designed a lockless sweep using CAS (compare and switch instruction). Here is how it works: + + + + +(1) A reader thread uses CAS to acquire SuperVersion from its local storage and to put in a special flag (SuperVersion::kSVInUse). + + + + +(2) Upon completion of GetImpl(), the reader thread tries to return SuperVersion to local storage by CAS, expecting the special flag (SuperVersion::kSVInUse) in its local storage. If it does not see SuperVersion::kSVInUse, that means a “sweep” was done and the reader thread is responsible for cleanup (this is expensive, but does not happen often on the hot path). + + + + +(3) After any flush/compaction, the background thread performs a sweep (CAS) across all threads’ local storage and frees encountered SuperVersion. A reader thread must re-acquire a new SuperVersion reference on its next visit. + +### Comments + +**[David Barbour](dmbarbour@gmail.com)** + +Please post an example of reading the same rocksdb concurrently. + +We are using the latest 3.0 rocksdb; however, when two separate processes +try and open the same rocksdb for reading, only one of the open requests +succeed. The other open always fails with “db/LOCK: Resource temporarily unavailable” So far we have not found an option that allows sharing the rocksdb for reads. An example would be most appreciated. diff --git a/src/rocksdb/docs/_posts/2014-06-27-rocksdb-3-2-release.markdown b/src/rocksdb/docs/_posts/2014-06-27-rocksdb-3-2-release.markdown new file mode 100644 index 000000000..e4eba6af4 --- /dev/null +++ b/src/rocksdb/docs/_posts/2014-06-27-rocksdb-3-2-release.markdown @@ -0,0 +1,30 @@ +--- +title: RocksDB 3.2 release +layout: post +author: leijin +category: blog +redirect_from: + - /blog/647/rocksdb-3-2-release/ +--- + +Check out new RocksDB release on [GitHub](https://github.com/facebook/rocksdb/releases/tag/rocksdb-3.2)! + +New Features in RocksDB 3.2: + + * PlainTable now supports a new key encoding: for keys of the same prefix, the prefix is only written once. It can be enabled through encoding_type paramter of NewPlainTableFactory() + + + * Add AdaptiveTableFactory, which is used to convert from a DB of PlainTable to BlockBasedTabe, or vise versa. It can be created using NewAdaptiveTableFactory() + +<!--truncate--> + +Public API changes: + + + * We removed seek compaction as a concept from RocksDB + + + * Add two paramters to NewHashLinkListRepFactory() for logging on too many entries in a hash bucket when flushing + + + * Added new option BlockBasedTableOptions::hash_index_allow_collision. When enabled, prefix hash index for block-based table will not store prefix and allow hash collision, reducing memory consumption diff --git a/src/rocksdb/docs/_posts/2014-07-29-rocksdb-3-3-release.markdown b/src/rocksdb/docs/_posts/2014-07-29-rocksdb-3-3-release.markdown new file mode 100644 index 000000000..d858e4faf --- /dev/null +++ b/src/rocksdb/docs/_posts/2014-07-29-rocksdb-3-3-release.markdown @@ -0,0 +1,34 @@ +--- +title: RocksDB 3.3 Release +layout: post +author: yhciang +category: blog +redirect_from: + - /blog/1301/rocksdb-3-3-release/ +--- + +Check out new RocksDB release on [GitHub](https://github.com/facebook/rocksdb/releases/tag/rocksdb-3.3)! + +New Features in RocksDB 3.3: + + * **JSON API prototype**. + + + * **Performance improvement on HashLinkList**: We addressed performance outlier of HashLinkList caused by skewed bucket by switching data in the bucket from linked list to skip list. Add parameter threshold_use_skiplist in NewHashLinkListRepFactory(). + +<!--truncate--> + + * **More effective on storage space reclaim**: RocksDB is now able to reclaim storage space more effectively during the compaction process. This is done by compensating the size of each deletion entry by the 2X average value size, which makes compaction to be triggerred by deletion entries more easily. + + + * **TimeOut API to write**: Now WriteOptions have a variable called timeout_hint_us. With timeout_hint_us set to non-zero, any write associated with this timeout_hint_us may be aborted when it runs longer than the specified timeout_hint_us, and it is guaranteed that any write completes earlier than the specified time-out will not be aborted due to the time-out condition. + + + * **rate_limiter option**: We added an option that controls total throughput of flush and compaction. The throughput is specified in bytes/sec. Flush always has precedence over compaction when available bandwidth is constrained. + + + +Public API changes: + + + * Removed NewTotalOrderPlainTableFactory because it is not used and implemented semantically incorrect. diff --git a/src/rocksdb/docs/_posts/2014-09-12-cuckoo.markdown b/src/rocksdb/docs/_posts/2014-09-12-cuckoo.markdown new file mode 100644 index 000000000..22178f7ca --- /dev/null +++ b/src/rocksdb/docs/_posts/2014-09-12-cuckoo.markdown @@ -0,0 +1,74 @@ +--- +title: Cuckoo Hashing Table Format +layout: post +author: radheshyam +category: blog +redirect_from: + - /blog/1427/new-bloom-filter-format/ +--- + +## Introduction + +We recently introduced a new [Cuckoo Hashing](http://en.wikipedia.org/wiki/Cuckoo_hashing) based SST file format which is optimized for fast point lookups. The new format was built for applications which require very high point lookup rates (~4Mqps) in read only mode but do not use operations like range scan, merge operator, etc. But, the existing RocksDB file formats were built to support range scan and other operations and the current best point lookup in RocksDB is 1.2 Mqps given by [PlainTable](https://github.com/facebook/rocksdb/wiki/PlainTable-Format)[ format](https://github.com/facebook/rocksdb/wiki/PlainTable-Format). This prompted a hashing based file format, which we present here. The new table format uses a cache friendly version of Cuckoo Hashing algorithm with only 1 or 2 memory accesses per lookup. + +<!--truncate--> + +Goals: + + * Reduce memory accesses per lookup to 1 or 2 + + + * Get an end to end point lookup rate of at least 4 Mqps + + + * Minimize database size + + +Assumptions: + + * Key length and value length are fixed + + + * The database is operated in read only mode + + +Non-goals: + + + * While optimizing the performance of Get() operation was our primary goal, compaction and build times were secondary. We may work on improving them in future. + + +Details for setting up the table format can be found in [GitHub](https://github.com/facebook/rocksdb/wiki/CuckooTable-Format). + + +## Cuckoo Hashing Algorithm + +In order to achieve high lookup speeds, we did multiple optimizations, including a cache friendly cuckoo hash algorithm. Cuckoo Hashing uses multiple hash functions, _h1, ..., __hn._ + +### Original Cuckoo Hashing + +To insert any new key _k_, we compute hashes of the key _h1(k), ..., __hn__(k)_. We insert the key in the first hash location that is free. If all the locations are blocked, we try to move one of the colliding keys to a different location by trying to re-insert it. + +Finding smallest set of keys to displace in order to accommodate the new key is naturally a shortest path problem in a directed graph where nodes are buckets of hash table and there is an edge from bucket _A_ to bucket _B_ if the element stored in bucket _A_ can be accommodated in bucket _B_ using one of the hash functions. The source nodes are the possible hash locations for the given key _k_ and destination is any one of the empty buckets. We use this algorithm to handle collision. + +To retrieve a key _k_, we compute hashes, _h1(k), ..., __hn__(k)_ and the key must be present in one of these locations. + +Our goal is to minimize average (and maximum) number of hash functions required and hence the number of memory accesses. In our experiments, with a hash utilization of 90%, we found that the average number of lookups is 1.8 and maximum is 3. Around 44% of keys are accommodated in first hash location and 33% in second location. + + +### Cache Friendly Cuckoo Hashing + +We noticed the following two sub-optimal properties in original Cuckoo implementation: + + + * If the key is not present in first hash location, we jump to second hash location which may not be in cache. This results in many cache misses. + + + * Because only 44% of keys are located in first cuckoo block, we couldn't have an optimal prefetching strategy - prefetching all hash locations for a key is wasteful. But prefetching only the first hash location helps only 44% of cases. + + + +The solution is to insert more keys near first location. In case of collision in the first hash location - _h1(k)_, we try to insert it in next few buckets, _h1(k)+1, _h1(k)+2, _..., h1(k)+t-1_. If all of these _t_ locations are occupied, we skip over to next hash function _h2_ and repeat the process. We call the set of _t_ buckets as a _Cuckoo Block_. We chose _t_ such that size of a block is not bigger than a cache line and we prefetch the first cuckoo block. + + +With the new algorithm, for 90% hash utilization, we found that 85% of keys are accommodated in first Cuckoo Block. Prefetching the first cuckoo block yields best results. For a database of 100 million keys with key length 8 and value length 4, the hash algorithm alone can achieve 9.6 Mqps and we are working on improving it further. End to end RocksDB performance results can be found [here](https://github.com/facebook/rocksdb/wiki/CuckooTable-Format). diff --git a/src/rocksdb/docs/_posts/2014-09-12-new-bloom-filter-format.markdown b/src/rocksdb/docs/_posts/2014-09-12-new-bloom-filter-format.markdown new file mode 100644 index 000000000..96fa50a40 --- /dev/null +++ b/src/rocksdb/docs/_posts/2014-09-12-new-bloom-filter-format.markdown @@ -0,0 +1,52 @@ +--- +title: New Bloom Filter Format +layout: post +author: zagfox +category: blog +redirect_from: + - /blog/1367/cuckoo/ +--- + +## Introduction + +In this post, we are introducing "full filter block" --- a new bloom filter format for [block based table](https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format). This could bring about 40% of improvement for key query under in-memory (all data stored in memory, files stored in tmpfs/ramfs, an [example](https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks) workload. The main idea behind is to generate a big filter that covers all the keys in SST file to avoid lots of unnecessary memory look ups. + + +<!--truncate--> + +## What is Bloom Filter + +In brief, [bloom filter](https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter) is a bits array generated for a set of keys that could tell if an arbitrary key may exist in that set. + +In RocksDB, we generate such a bloom filter for each SST file. When we conduct a query for a key, we first goes to the bloom filter block of SST file. If key may exist in filter, we goes into data block in SST file to search for the key. If not, we would return directly. So it could help speed up point look up operation a lot. + +## Original Bloom Filter Format + +Original bloom filter creates filters for each individual data block in SST file. It has complex structure (ref [here](https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format#filter-meta-block)) which results in a lot of non-adjacent memory look ups. + +Here's the work flow for checking original bloom filter in block based table: + +1. Given the target key, we goes to the index block to get the "data block ID" where this key may reside. +1. Using the "data block ID", we goes to the filter block and get the correct "offset of filter". +1. Using the "offset of filter", we goes to the actual filter and do the checking. + +## New Bloom Filter Format + +New bloom filter creates filter for all keys in SST file and we name it "full filter". The data structure of full filter is very simple, there is just one big filter: + + [ full filter ] + +In this way, the work flow of bloom filter checking is much simplified. + +(1) Given the target key, we goes directly to the filter block and conduct the filter checking. + +To be specific, there would be no checking for index block and no address jumping inside of filter block. + +Though it is a big filter, the total filter size would be the same as the original filter. + +One little draw back is that the new bloom filter introduces more memory consumption when building SST file because we need to buffer keys (or their hashes) before generating filter. Original filter just creates a bunch of small filters so it just buffer a small amount of keys. For full filter, we buffer hashes of all keys, which would take more memory when SST file size increases. + + +## Usage & Customization + +You can refer to the document here for [usage](https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter#usage-of-new-bloom-filter) and [customization](https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter#customize-your-own-filterpolicy). diff --git a/src/rocksdb/docs/_posts/2014-09-15-rocksdb-3-5-release.markdown b/src/rocksdb/docs/_posts/2014-09-15-rocksdb-3-5-release.markdown new file mode 100644 index 000000000..1878a5a56 --- /dev/null +++ b/src/rocksdb/docs/_posts/2014-09-15-rocksdb-3-5-release.markdown @@ -0,0 +1,38 @@ +--- +title: RocksDB 3.5 Release! +layout: post +author: leijin +category: blog +redirect_from: + - /blog/1547/rocksdb-3-5-release/ +--- + +New RocksDB release - 3.5! + + +**New Features** + + + 1. Add include/utilities/write_batch_with_index.h, providing a utility class to query data out of WriteBatch when building it. + + + 2. new ReadOptions.total_order_seek to force total order seek when block-based table is built with hash index. + +<!--truncate--> + +**Public API changes** + + + 1. The Prefix Extractor used with V2 compaction filters is now passed user key to SliceTransform::Transform instead of unparsed RocksDB key. + + + 2. Move BlockBasedTable related options to BlockBasedTableOptions from Options. Change corresponding JNI interface. Options affected include: no_block_cache, block_cache, block_cache_compressed, block_size, block_size_deviation, block_restart_interval, filter_policy, whole_key_filtering. filter_policy is changed to shared_ptr from a raw pointer. + + + 3. Remove deprecated options: disable_seek_compaction and db_stats_log_interval + + + 4. OptimizeForPointLookup() takes one parameter for block cache size. It now builds hash index, bloom filter, and block cache. + + +[https://github.com/facebook/rocksdb/releases/tag/v3.5](https://github.com/facebook/rocksdb/releases/tag/rocksdb-3.5) diff --git a/src/rocksdb/docs/_posts/2015-01-16-migrating-from-leveldb-to-rocksdb-2.markdown b/src/rocksdb/docs/_posts/2015-01-16-migrating-from-leveldb-to-rocksdb-2.markdown new file mode 100644 index 000000000..f18de0bbc --- /dev/null +++ b/src/rocksdb/docs/_posts/2015-01-16-migrating-from-leveldb-to-rocksdb-2.markdown @@ -0,0 +1,112 @@ +--- +title: Migrating from LevelDB to RocksDB +layout: post +author: lgalanis +category: blog +redirect_from: + - /blog/1811/migrating-from-leveldb-to-rocksdb-2/ +--- + +If you have an existing application that uses LevelDB and would like to migrate to using RocksDB, one problem you need to overcome is to map the options for LevelDB to proper options for RocksDB. As of release 3.9 this can be automatically done by using our option conversion utility found in rocksdb/utilities/leveldb_options.h. What is needed, is to first replace `leveldb::Options` with `rocksdb::LevelDBOptions`. Then, use `rocksdb::ConvertOptions( )` to convert the `LevelDBOptions` struct into appropriate RocksDB options. Here is an example: + +<!--truncate--> + +LevelDB code: + +```c++ +#include <string> +#include "leveldb/db.h" + +using namespace leveldb; + +int main(int argc, char** argv) { + DB *db; + + Options opt; + opt.create_if_missing = true; + opt.max_open_files = 1000; + opt.block_size = 4096; + + Status s = DB::Open(opt, "/tmp/mydb", &db); + + delete db; +} +``` + +RocksDB code: + +```c++ +#include <string> +#include "rocksdb/db.h" +#include "rocksdb/utilities/leveldb_options.h" + +using namespace rocksdb; + +int main(int argc, char** argv) { + DB *db; + + LevelDBOptions opt; + opt.create_if_missing = true; + opt.max_open_files = 1000; + opt.block_size = 4096; + + Options rocksdb_options = ConvertOptions(opt); + // add rocksdb specific options here + + Status s = DB::Open(rocksdb_options, "/tmp/mydb_rocks", &db); + + delete db; +} +``` + +The difference is: + +```diff +-#include "leveldb/db.h" ++#include "rocksdb/db.h" ++#include "rocksdb/utilities/leveldb_options.h" + +-using namespace leveldb; ++using namespace rocksdb; + +- Options opt; ++ LevelDBOptions opt; + +- Status s = DB::Open(opt, "/tmp/mydb", &db); ++ Options rocksdb_options = ConvertOptions(opt); ++ // add rockdb specific options here ++ ++ Status s = DB::Open(rocksdb_options, "/tmp/mydb_rocks", &db); +``` + +Once you get up and running with RocksDB you can then focus on tuning RocksDB further by modifying the converted options struct. + +The reason why ConvertOptions is handy is because a lot of individual options in RocksDB have moved to other structures in different components. For example, block_size is not available in struct rocksdb::Options. It resides in struct rocksdb::BlockBasedTableOptions, which is used to create a TableFactory object that RocksDB uses internally to create the proper TableBuilder objects. If you were to write your application from scratch it would look like this: + +RocksDB code from scratch: + +```c++ +#include <string> +#include "rocksdb/db.h" +#include "rocksdb/table.h" + +using namespace rocksdb; + +int main(int argc, char** argv) { + DB *db; + + Options opt; + opt.create_if_missing = true; + opt.max_open_files = 1000; + + BlockBasedTableOptions topt; + topt.block_size = 4096; + opt.table_factory.reset(NewBlockBasedTableFactory(topt)); + + Status s = DB::Open(opt, "/tmp/mydb_rocks", &db); + + delete db; +} +``` + +The LevelDBOptions utility can ease migration to RocksDB from LevelDB and allows us to break down the various options across classes as it is needed. diff --git a/src/rocksdb/docs/_posts/2015-02-24-reading-rocksdb-options-from-a-file.markdown b/src/rocksdb/docs/_posts/2015-02-24-reading-rocksdb-options-from-a-file.markdown new file mode 100644 index 000000000..cddc0dd01 --- /dev/null +++ b/src/rocksdb/docs/_posts/2015-02-24-reading-rocksdb-options-from-a-file.markdown @@ -0,0 +1,41 @@ +--- +title: Reading RocksDB options from a file +layout: post +author: lgalanis +category: blog +redirect_from: + - /blog/1883/reading-rocksdb-options-from-a-file/ +--- + +RocksDB options can be provided using a file or any string to RocksDB. The format is straightforward: `write_buffer_size=1024;max_write_buffer_number=2`. Any whitespace around `=` and `;` is OK. Moreover, options can be nested as necessary. For example `BlockBasedTableOptions` can be nested as follows: `write_buffer_size=1024; max_write_buffer_number=2; block_based_table_factory={block_size=4k};`. Similarly any white space around `{` or `}` is ok. Here is what it looks like in code: + +<!--truncate--> + +```c++ +#include <string> +#include "rocksdb/db.h" +#include "rocksdb/table.h" +#include "rocksdb/utilities/convenience.h" + +using namespace rocksdb; + +int main(int argc, char** argv) { + DB *db; + + Options opt; + + std::string options_string = + "create_if_missing=true;max_open_files=1000;" + "block_based_table_factory={block_size=4096}"; + + Status s = GetDBOptionsFromString(opt, options_string, &opt); + + s = DB::Open(opt, "/tmp/mydb_rocks", &db); + + // use db + + delete db; +} +``` + +Using `GetDBOptionsFromString` is a convenient way of changing options for your RocksDB application without needing to resort to recompilation or tedious command line parsing. diff --git a/src/rocksdb/docs/_posts/2015-02-27-write-batch-with-index.markdown b/src/rocksdb/docs/_posts/2015-02-27-write-batch-with-index.markdown new file mode 100644 index 000000000..7f9f77653 --- /dev/null +++ b/src/rocksdb/docs/_posts/2015-02-27-write-batch-with-index.markdown @@ -0,0 +1,20 @@ +--- +title: 'WriteBatchWithIndex: Utility for Implementing Read-Your-Own-Writes' +layout: post +author: sdong +category: blog +redirect_from: + - /blog/1901/write-batch-with-index/ +--- + +RocksDB can be used as a storage engine of a higher level database. In fact, we are currently plugging RocksDB into MySQL and MongoDB as one of their storage engines. RocksDB can help with guaranteeing some of the ACID properties: durability is guaranteed by RocksDB by design; while consistency and isolation need to be enforced by concurrency controls on top of RocksDB; Atomicity can be implemented by committing a transaction's writes with one write batch to RocksDB in the end. + +<!--truncate--> + +However, if we enforce atomicity by only committing all writes in the end of the transaction in one batch, you cannot get the updated value from RocksDB previously written by the same transaction (read-your-own-write). To read the updated value, the databases on top of RocksDB need to maintain an internal buffer for all the written keys, and when a read happens they need to merge the result from RocksDB and from this buffer. This is a problem we faced when building the RocksDB storage engine in MongoDB. We solved it by creating a utility class, WriteBatchWithIndex (a write batch with a searchable index) and made it part of public API so that the community can also benefit from it. + +Before talking about the index part, let me introduce write batch first. The write batch class, `WriteBatch`, is a RocksDB data structure for atomic writes of multiple keys. Users can buffer their updates to a `WriteBatch` by calling `write_batch.Put("key1", "value1")` or `write_batch.Delete("key2")`, similar as calling RocksDB's functions of the same names. In the end, they call `db->Write(write_batch)` to atomically update all those batched operations to the DB. It is how a database can guarantee atomicity, as shown above. Adding a searchable index to `WriteBatch`, we now have `WriteBatchWithIndex`. Users can put updates to WriteBatchIndex in the same way as to `WriteBatch`. In the end, users can get a `WriteBatch` object from it and issue `db->Write()`. Additionally, users can create an iterator of a WriteBatchWithIndex, seek to any key location and iterate from there. + +To implement read-your-own-write using `WriteBatchWithIndex`, every time the user creates a transaction, we create a `WriteBatchWithIndex` attached to it. All the writes of the transaction go to the `WriteBatchWithIndex` first. When we commit the transaction, we atomically write the batch to RocksDB. When the user wants to call `Get()`, we first check if the value exists in the `WriteBatchWithIndex` and return the value if existing, by seeking and reading from an iterator of the write batch, before checking data in RocksDB. For example, here is the we implement it in MongoDB's RocksDB storage engine: [link](https://github.com/mongodb/mongo/blob/a31cc114a89a3645e97645805ba77db32c433dce/src/mongo/db/storage/rocks/rocks_recovery_unit.cpp#L245-L260). If a range query comes, we pass a DB's iterator to `WriteBatchWithIndex`, which creates a super iterator which combines the results from the DB iterator with the batch's iterator. Using this super iterator, we can iterate the DB with the transaction's own writes. Here is the iterator creation codes in MongoDB's RocksDB storage engine: [link](https://github.com/mongodb/mongo/blob/a31cc114a89a3645e97645805ba77db32c433dce/src/mongo/db/storage/rocks/rocks_recovery_unit.cpp#L266-L269). In this way, the database can solve the read-your-own-write problem by using RocksDB to handle a transaction's uncommitted writes. + +Using `WriteBatchWithIndex`, we successfully implemented read-your-own-writes in the RocksDB storage engine of MongoDB. If you also have a read-your-own-write problem, `WriteBatchWithIndex` can help you implement it quickly and correctly. diff --git a/src/rocksdb/docs/_posts/2015-04-22-integrating-rocksdb-with-mongodb-2.markdown b/src/rocksdb/docs/_posts/2015-04-22-integrating-rocksdb-with-mongodb-2.markdown new file mode 100644 index 000000000..1ffe2c532 --- /dev/null +++ b/src/rocksdb/docs/_posts/2015-04-22-integrating-rocksdb-with-mongodb-2.markdown @@ -0,0 +1,16 @@ +--- +title: Integrating RocksDB with MongoDB +layout: post +author: icanadi +category: blog +redirect_from: + - /blog/1967/integrating-rocksdb-with-mongodb-2/ +--- + +Over the last couple of years, we have been busy integrating RocksDB with various services here at Facebook that needed to store key-value pairs locally. We have also seen other companies using RocksDB as local storage components of their distributed systems. + +<!--truncate--> + +The next big challenge for us is to bring RocksDB storage engine to general purpose databases. Today we have an exciting milestone to share with our community! We're running MongoDB with RocksDB in production and seeing great results! You can read more about it here: [http://blog.parse.com/announcements/mongodb-rocksdb-parse/](http://blog.parse.com/announcements/mongodb-rocksdb-parse/) + +Keep tuned for benchmarks and more stability and performance improvements. diff --git a/src/rocksdb/docs/_posts/2015-06-12-rocksdb-in-osquery.markdown b/src/rocksdb/docs/_posts/2015-06-12-rocksdb-in-osquery.markdown new file mode 100644 index 000000000..f3a55faae --- /dev/null +++ b/src/rocksdb/docs/_posts/2015-06-12-rocksdb-in-osquery.markdown @@ -0,0 +1,10 @@ +--- +title: RocksDB in osquery +layout: post +author: icanadi +category: lgalanis +redirect_from: + - /blog/1997/rocksdb-in-osquery/ +--- + +Check out [this](https://code.facebook.com/posts/1411870269134471/how-rocksdb-is-used-in-osquery/) blog post by [Mike Arpaia](https://www.facebook.com/mike.arpaia) and [Ted Reed](https://www.facebook.com/treeded) about how osquery leverages RocksDB to build an embedded pub-sub system. This article is a great read and contains insights on how to properly use RocksDB. diff --git a/src/rocksdb/docs/_posts/2015-07-15-rocksdb-2015-h2-roadmap.markdown b/src/rocksdb/docs/_posts/2015-07-15-rocksdb-2015-h2-roadmap.markdown new file mode 100644 index 000000000..b3e2703fc --- /dev/null +++ b/src/rocksdb/docs/_posts/2015-07-15-rocksdb-2015-h2-roadmap.markdown @@ -0,0 +1,92 @@ +--- +title: RocksDB 2015 H2 roadmap +layout: post +author: icanadi +category: blog +redirect_from: + - /blog/2015/rocksdb-2015-h2-roadmap/ +--- + +Every 6 months, RocksDB team gets together to prioritize the work ahead of us. We just went through this exercise and we wanted to share the results with the community. Here's what RocksDB team will be focusing on for the next 6 months: + +<!--truncate--> + +**MyRocks** + +As you might know, we're working hard to integrate RocksDB as a storage engine for MySQL. This project is pretty important for us because we're heavy users of MySQL. We're already getting pretty good performance results, but there is more work to be done. We need to focus on both performance and stability. The most high priority items on are list are: + + + + + 1. Reduce CPU costs of RocksDB as a MySQL storage engine + + + 2. Implement pessimistic concurrency control to support repeatable read isolation level in MyRocks + + + 3. Reduce P99 read latency, which is high mostly because of lingering tombstones + + + 4. Port ZSTD compression + + +**MongoRocks** + +Another database that we're working on is MongoDB. The project of integrating MongoDB with RocksDB storage engine is called MongoRocks. It's already running in production at Parse [1] and we're seeing surprisingly few issues. Our plans for the next half: + + + + + 1. Keep improving performance and stability, possibly reuse work done on MyRocks (workloads are pretty similar). + + + 2. Increase internal and external adoption. + + + 3. Support new MongoDB 3.2. + + +**RocksDB on cheaper storage media** + +Up to now, our mission was to build the best key-value store “for fast storage” (flash and in-memory). However, there are some use-cases at Facebook that don't need expensive high-end storage. In the next six months, we plan to deploy RocksDB on cheaper storage media. We will optimize performance to RocksDB on either or both: + + + + + 1. Hard drive storage array. + + + 2. Tiered Storage. + + +**Quality of Service** + +When talking to our customers, there are couple of issues that keep reoccurring. We need to fix them to make our customers happy. We will improve RocksDB to provide better assurance of performance and resource usage. Non-exhaustive list includes: + + + + + 1. Iterate P99 can be high due to the presence of tombstones. + + + 2. Write stalls can happen during high write loads. + + + 3. Better control of memory and disk usage. + + + 4. Service quality and performance of backup engine. + + +**Operation's user experience** + +As we increase deployment of RocksDB, engineers are spending more time on debugging RocksDB issues. We plan to improve user experience when running RocksDB. The goal is to reduce TTD (time-to-debug). The work includes monitoring, visualizations and documentations. + +[1]( http://blog.parse.com/announcements/mongodb-rocksdb-parse/](http://blog.parse.com/announcements/mongodb-rocksdb-parse/) + + +### Comments + +**[Mike](allspace2012@outlook.com)** + +What’s the status of this roadmap? “RocksDB on cheaper storage media”, has this been implemented? diff --git a/src/rocksdb/docs/_posts/2015-07-17-spatial-indexing-in-rocksdb.markdown b/src/rocksdb/docs/_posts/2015-07-17-spatial-indexing-in-rocksdb.markdown new file mode 100644 index 000000000..fe7b7b268 --- /dev/null +++ b/src/rocksdb/docs/_posts/2015-07-17-spatial-indexing-in-rocksdb.markdown @@ -0,0 +1,78 @@ +--- +title: Spatial indexing in RocksDB +layout: post +author: icanadi +category: blog +redirect_from: + - /blog/2039/spatial-indexing-in-rocksdb/ +--- + +About a year ago, there was a need to develop a spatial database at Facebook. We needed to store and index Earth's map data. Before building our own, we looked at the existing spatial databases. They were all very good technology, but also general purpose. We could sacrifice a general-purpose API, so we thought we could build a more performant database, since it would be specifically designed for our use-case. Furthermore, we decided to build the spatial database on top of RocksDB, because we have a lot of operational experience with running and tuning RocksDB at a large scale. + +<!--truncate--> + +When we started looking at this project, the first thing that surprised us was that our planet is not that big. Earth's entire map data can fit in memory on a reasonably high-end machine. Thus, we also decided to build a spatial database optimized for memory-resident dataset. + +The first use-case of our spatial database was an experimental map renderer. As part of our project, we successfully loaded [Open Street Maps](https://www.openstreetmap.org/) dataset and hooked it up with [Mapnik](http://mapnik.org/), a map rendering engine. + +The usual Mapnik workflow is to load the map data into a SQL-based database and then define map layers with SQL statements. To render a tile, Mapnik needs to execute a couple of SQL queries. The benefit of this approach is that you don't need to reload your database when you change your map style. You can just change your SQL query and Mapnik picks it up. In our model, we decided to precompute the features we need for each tile. We need to know the map style before we create the database. However, when rendering the map tile, we only fetch the features that we need to render. + +We haven't open sourced the RocksDB Mapnik plugin or the database loading pipeline. However, the spatial indexing is available in RocksDB under a name [SpatialDB](https://github.com/facebook/rocksdb/blob/master/include/rocksdb/utilities/spatial_db.h). The API is focused on map rendering use-case, but we hope that it can also be used for other spatial-based applications. + +Let's take a tour of the API. When you create a spatial database, you specify the spatial indexes that need to be built. Each spatial index is defined by a bounding box and granularity. For map rendering, we create a spatial index for each zoom levels. Higher zoom levels have more granularity. + + + + SpatialDB::Create( + SpatialDBOptions(), + "/data/map", { + SpatialIndexOptions("zoom10", BoundingBox(0, 0, 100, 100), 10), + SpatialIndexOptions("zoom16", BoundingBox(0, 0, 100, 100), 16) + } + ); + + + + +When you insert a feature (building, street, country border) into SpatialDB, you need to specify the list of spatial indexes that will index the feature. In the loading phase we process the map style to determine the list of zoom levels on which we'll render the feature. For example, we will not render the building on zoom level that shows an entire country. Building will only be indexed on higher zoom level's index. Country borders will be indexes on all zoom levels. + + + + FeatureSet feature; + feature.Set("type", "building"); + feature.Set("height", 6); + db->Insert(WriteOptions(), BoundingBox<double>(5, 5, 10, 10), + well_known_binary_blob, feature, {"zoom16"}); + + + + +The indexing part is pretty simple. For each feature, we first find a list of index tiles that it intersects. Then, we add a link from the tile's [quad key](https://msdn.microsoft.com/en-us/library/bb259689.aspx) to the feature's primary key. Using quad keys improves data locality, i.e. features closer together geographically will have similar quad keys. Even though we're optimizing for a memory-resident dataset, data locality is still very important due to different caching effects. + +After you're done inserting all the features, you can call an API Compact() that will compact the dataset and speed up read queries. + + + + db->Compact(); + + + + +SpatialDB's query specifies: 1) bounding box we're interested in, and 2) a zoom level. We find all tiles that intersect with the query's bounding box and return all features in those tiles. + + + + + Cursor* c = db_->Query(ReadOptions(), BoundingBox<double>(1, 1, 7, 7), "zoom16"); + for (c->Valid(); c->Next()) { + Render(c->blob(), c->feature_set()); + } + + + + +Note: `Render()` function is not part of RocksDB. You will need to use one of many open source map renderers, for example check out [Mapnik](http://mapnik.org/). + +TL;DR If you need an embedded spatial database, check out RocksDB's SpatialDB. [Let us know](https://www.facebook.com/groups/rocksdb.dev/) how we can make it better. + +If you're interested in learning more, check out this [talk](https://www.youtube.com/watch?v=T1jWsDMONM8). diff --git a/src/rocksdb/docs/_posts/2015-07-22-rocksdb-is-now-available-in-windows-platform.markdown b/src/rocksdb/docs/_posts/2015-07-22-rocksdb-is-now-available-in-windows-platform.markdown new file mode 100644 index 000000000..b6bb47d53 --- /dev/null +++ b/src/rocksdb/docs/_posts/2015-07-22-rocksdb-is-now-available-in-windows-platform.markdown @@ -0,0 +1,30 @@ +--- +title: RocksDB is now available in Windows Platform +layout: post +author: dmitrism +category: blog +redirect_from: + - /blog/2033/rocksdb-is-now-available-in-windows-platform/ +--- + +Over the past 6 months we have seen a number of use cases where RocksDB is successfully used by the community and various companies to achieve high throughput and volume in a modern server environment. + +We at Microsoft Bing could not be left behind. As a result we are happy to [announce](http://bit.ly/1OmWBT9) the availability of the Windows Port created here at Microsoft which we intend to use as a storage option for one of our key/value data stores. + +<!--truncate--> + +We are happy to make this available for the community. Keep tuned for more announcements to come. + +### Comments + +**[Siying Dong](siying.d@fb.com)** + +Appreciate your contributions to RocksDB project! I believe it will benefits many users! + +**[empresas sevilla](oxofkx@gmail.com)** + +Magnifico artículo|, un placer leer el blog + +**[jak usunac](tomogedac@o2.pl)** + +I believe it will benefits too diff --git a/src/rocksdb/docs/_posts/2015-07-23-dynamic-level.markdown b/src/rocksdb/docs/_posts/2015-07-23-dynamic-level.markdown new file mode 100644 index 000000000..0ff3a0542 --- /dev/null +++ b/src/rocksdb/docs/_posts/2015-07-23-dynamic-level.markdown @@ -0,0 +1,29 @@ +--- +title: Dynamic Level Size for Level-Based Compaction +layout: post +author: sdong +category: blog +redirect_from: + - /blog/2207/dynamic-level/ +--- + +In this article, we follow up on the first part of an answer to one of the questions in our [AMA](https://www.reddit.com/r/IAmA/comments/3de3cv/we_are_rocksdb_engineering_team_ask_us_anything/ct4a8tb), the dynamic level size in level-based compaction. + +<!--truncate--> + +Level-based compaction is the original LevelDB compaction style and one of the two major compaction styles in RocksDB (See [our wiki](https://github.com/facebook/rocksdb/wiki/RocksDB-Basics#multi-threaded-compactions)). In RocksDB we introduced parallelism and more configurable options to it but the main algorithm stayed the same, until we recently introduced the dynamic level size mode. + + +In level-based compaction, we organize data to different sorted runs, called levels. Each level has a target size. Usually target size of levels increases by the same size multiplier. For example, you can set target size of level 1 to be 1GB, and size multiplier to be 10, and the target size of level 1, 2, 3, 4 will be 1GB, 10GB, 100GB and 1000GB. Before level 1, there will be some staging file flushed from mem tables, called Level 0 files, which will later be merged to level 1. Compactions will be triggered as soon as actual size of a level exceeds its target size. We will merge a subset of data of that level to next level, to reduce size of the level. More compactions will be triggered until sizes of all the levels are lower than their target sizes. In a steady state, the size of each level will be around the same size of the size of level targets. + + +Level-based compaction’s advantage is its good space efficiency. We usually use the metric space amplification to measure the space efficiency. In this article ignore the effects of data compression so space amplification= size_on_file_system / size_of_user_data. + + +How do we estimate space amplification of level-based compaction? We focus specifically on the databases in steady state, which means database size is stable or grows slowly over time. This means updates will add roughly the same or little more data than what is removed by deletes. Given that, if we compact all the data all to the last level, the size of level will be equal as the size of last level before the compaction. On the other hand, the size of user data will be approximately the size of DB if we compact all the levels down to the last level. So the size of the last level will be a good estimation of user data size. So total size of the DB divided by the size of the last level will be a good estimation of space amplification. + + +Applying the equation, if we have four non-zero levels, their sizes are 1GB, 10GB, 100GB, 1000GB, the size amplification will be approximately (1000GB + 100GB + 10GB + 1GB) / 1000GB = 1.111, which is a very good number. However, there is a catch here: how to make sure the last level’s size is 1000GB, the same as the level’s size target? A user has to fine tune level sizes to achieve this number and will need to re-tune if DB size changes. The theoretic number 1.11 is hard to achieve in practice. In a worse case, if you have the target size of last level to be 1000GB but the user data is only 200GB, then the actual space amplification will be (200GB + 100GB + 10GB + 1GB) / 200GB = 1.555, a much worse number. + + +To solve this problem, my colleague Igor Kabiljo came up with a solution of dynamic level size target mode. You can enable it by setting options.level_compaction_dynamic_level_bytes=true. In this mode, size target of levels are changed dynamically based on size of the last level. Suppose the level size multiplier to be 10, and the DB size is 200GB. The target size of the last level is automatically set to be the actual size of the level, which is 200GB, the second to last level’s size target will be automatically set to be size_last_level / 10 = 20GB, the third last level’s will be size_last_level/100 = 2GB, and next level to be size_last_level/1000 = 200MB. We stop here because 200MB is within the range of the first level. In this way, we can achieve the 1.111 space amplification, without fine tuning of the level size targets. More details can be found in [code comments of the option](https://github.com/facebook/rocksdb/blob/v3.11/include/rocksdb/options.h#L366-L423) in the header file. diff --git a/src/rocksdb/docs/_posts/2015-10-27-getthreadlist.markdown b/src/rocksdb/docs/_posts/2015-10-27-getthreadlist.markdown new file mode 100644 index 000000000..332a29f02 --- /dev/null +++ b/src/rocksdb/docs/_posts/2015-10-27-getthreadlist.markdown @@ -0,0 +1,193 @@ +--- +title: GetThreadList +layout: post +author: yhciang +category: blog +redirect_from: + - /blog/2261/getthreadlist/ +--- + +We recently added a new API, called `GetThreadList()`, that exposes the RocksDB background thread activity. With this feature, developers will be able to obtain the real-time information about the currently running compactions and flushes such as the input / output size, elapsed time, the number of bytes it has written. Below is an example output of `GetThreadList`. To better illustrate the example, we have put a sample output of `GetThreadList` into a table where each column represents a thread status: + +<!--truncate--> + +<table width="637" > +<tbody > +<tr style="border:2px solid #000000" > + +<td style="padding:3px" >ThreadID +</td> + +<td >140716395198208 +</td> + +<td >140716416169728 +</td> +</tr> +<tr > + +<td style="padding:3px" >DB +</td> + +<td >db1 +</td> + +<td >db2 +</td> +</tr> +<tr > + +<td style="padding:3px" >CF +</td> + +<td >default +</td> + +<td >picachu +</td> +</tr> +<tr > + +<td style="padding:3px" >ThreadType +</td> + +<td >High Pri +</td> + +<td >Low Pri +</td> +</tr> +<tr > + +<td style="padding:3px" >Operation +</td> + +<td >Flush +</td> + +<td >Compaction +</td> +</tr> +<tr > + +<td style="padding:3px" >ElapsedTime +</td> + +<td >143.459 ms +</td> + +<td >607.538 ms +</td> +</tr> +<tr > + +<td style="padding:3px" >Stage +</td> + +<td >FlushJob::WriteLevel0Table +</td> + +<td >CompactionJob::Install +</td> +</tr> +<tr > + +<td style="vertical-align:top;padding:3px" >OperationProperties +</td> + +<td style="vertical-align:top;padding:3px" > +BytesMemtables 4092938 +BytesWritten 1050701 +</td> + +<td style="vertical-align:top" > +BaseInputLevel 1 +BytesRead 4876417 +BytesWritten 4140109 +IsDeletion 0 +IsManual 0 +IsTrivialMove 0 +JobID 146 +OutputLevel 2 +TotalInputBytes 4883044 +</td> +</tr> +</tbody> +</table> + +In the above output, we can see `GetThreadList()` reports the activity of two threads: one thread running flush job (middle column) and the other thread running a compaction job (right-most column). In each thread status, it shows basic information about the thread such as thread id, it's target db / column family, and the job it is currently doing and the current status of the job. For instance, we can see thread 140716416169728 is doing compaction on the `picachu` column family in database `db2`. In addition, we can see the compaction has been running for 600 ms, and it has read 4876417 bytes out of 4883044 bytes. This indicates the compaction is about to complete. The stage property indicates which code block the thread is currently executing. For instance, thread 140716416169728 is currently running `CompactionJob::Install`, which further indicates the compaction job is almost done. + +Below we briefly describe its API. + + +## How to Enable it? + + +To enable thread-tracking of a rocksdb instance, simply set `enable_thread_tracking` to true in its DBOptions: + +```c++ +// If true, then the status of the threads involved in this DB will +// be tracked and available via GetThreadList() API. +// +// Default: false +bool enable_thread_tracking; +``` + + + +## The API + + +The GetThreadList API is defined in [include/rocksdb/env.h](https://github.com/facebook/rocksdb/blob/master/include/rocksdb/env.h#L317-L318), which is an Env +function: + +```c++ +virtual Status GetThreadList(std::vector* thread_list) +``` + +Since an Env can be shared across multiple rocksdb instances, the output of +`GetThreadList()` include the background activity of all the rocksdb instances +that using the same Env. + +The `GetThreadList()` API simply returns a vector of `ThreadStatus`, each describes +the current status of a thread. The `ThreadStatus` structure, defined in +[include/rocksdb/thread_status.h](https://github.com/facebook/rocksdb/blob/master/include/rocksdb/thread_status.h), contains the following information: + +```c++ +// An unique ID for the thread. +const uint64_t thread_id; + +// The type of the thread, it could be HIGH_PRIORITY, +// LOW_PRIORITY, and USER +const ThreadType thread_type; + +// The name of the DB instance where the thread is currently +// involved with. It would be set to empty string if the thread +// does not involve in any DB operation. +const std::string db_name; + +// The name of the column family where the thread is currently +// It would be set to empty string if the thread does not involve +// in any column family. +const std::string cf_name; + +// The operation (high-level action) that the current thread is involved. +const OperationType operation_type; + +// The elapsed time in micros of the current thread operation. +const uint64_t op_elapsed_micros; + +// An integer showing the current stage where the thread is involved +// in the current operation. +const OperationStage operation_stage; + +// A list of properties that describe some details about the current +// operation. Same field in op_properties[] might have different +// meanings for different operations. +uint64_t op_properties[kNumOperationProperties]; + +// The state (lower-level action) that the current thread is involved. +const StateType state_type; +``` + +If you are interested in the background thread activity of your RocksDB application, please feel free to give `GetThreadList()` a try :) diff --git a/src/rocksdb/docs/_posts/2015-11-10-use-checkpoints-for-efficient-snapshots.markdown b/src/rocksdb/docs/_posts/2015-11-10-use-checkpoints-for-efficient-snapshots.markdown new file mode 100644 index 000000000..6852b8ffa --- /dev/null +++ b/src/rocksdb/docs/_posts/2015-11-10-use-checkpoints-for-efficient-snapshots.markdown @@ -0,0 +1,45 @@ +--- +title: Use Checkpoints for Efficient Snapshots +layout: post +author: rven2 +category: blog +redirect_from: + - /blog/2609/use-checkpoints-for-efficient-snapshots/ +--- + +**Checkpoint** is a feature in RocksDB which provides the ability to take a snapshot of a running RocksDB database in a separate directory. Checkpoints can be used as a point in time snapshot, which can be opened Read-only to query rows as of the point in time or as a Writeable snapshot by opening it Read-Write. Checkpoints can be used for both full and incremental backups. + +<!--truncate--> + + +The Checkpoint feature enables RocksDB to create a consistent snapshot of a given RocksDB database in the specified directory. If the snapshot is on the same filesystem as the original database, the SST files will be hard-linked, otherwise SST files will be copied. The manifest and CURRENT files will be copied. In addition, if there are multiple column families, log files will be copied for the period covering the start and end of the checkpoint, in order to provide a consistent snapshot across column families. + + + + +A Checkpoint object needs to be created for a database before checkpoints are created. The API is as follows: + + + + +`Status Create(DB* db, Checkpoint** checkpoint_ptr);` + + + + +Given a checkpoint object and a directory, the CreateCheckpoint function creates a consistent snapshot of the database in the given directory. + + + + +`Status CreateCheckpoint(const std::string& checkpoint_dir);` + + + + +The directory should not already exist and will be created by this API. The directory will be an absolute path. The checkpoint can be used as a read-only copy of the DB or can be opened as a standalone DB. When opened read/write, the SST files continue to be hard links and these links are removed when the files are obsoleted. When the user is done with the snapshot, the user can delete the directory to remove the snapshot. + + + + +Checkpoints are used for online backup in MyRocks. which is MySQL using RocksDB as the storage engine . ([MySQL on RocksDB](https://github.com/facebook/mysql-5.6)) diff --git a/src/rocksdb/docs/_posts/2015-11-16-analysis-file-read-latency-by-level.markdown b/src/rocksdb/docs/_posts/2015-11-16-analysis-file-read-latency-by-level.markdown new file mode 100644 index 000000000..b21b04fe3 --- /dev/null +++ b/src/rocksdb/docs/_posts/2015-11-16-analysis-file-read-latency-by-level.markdown @@ -0,0 +1,244 @@ +--- +title: Analysis File Read Latency by Level +layout: post +author: sdong +category: blog +redirect_from: + - /blog/2537/analysis-file-read-latency-by-level/ +--- + +In many use cases of RocksDB, people rely on OS page cache for caching compressed data. With this approach, verifying effective of the OS page caching is challenging, because file system is a black box to users. + +As an example, a user can tune the DB as following: use level-based compaction, with L1 - L4 sizes to be 1GB, 10GB, 100GB and 1TB. And they reserve about 20GB memory as OS page cache, expecting level 0, 1 and 2 are mostly cached in memory, leaving only reads from level 3 and 4 requiring disk I/Os. However, in practice, it's not easy to verify whether OS page cache does exactly what we expect. For example, if we end up with doing 4 instead of 2 I/Os per query, it's not easy for users to figure out whether the it's because of efficiency of OS page cache or reading multiple blocks for a level. Analysis like it is especially important if users run RocksDB on hard drive disks, for the gap of latency between hard drives and memory is much higher than flash-based SSDs. + +<!--truncate--> + +In order to make tuning easier, we added new instrumentation to help users analysis latency distribution of file reads in different levels. If users turn DB statistics on, we always keep track of distribution of file read latency for each level. Users can retrieve the information by querying DB property “rocksdb.stats” ( [https://github.com/facebook/rocksdb/blob/v3.13.1/include/rocksdb/db.h#L315-L316](https://github.com/facebook/rocksdb/blob/v3.13.1/include/rocksdb/db.h#L315-L316) ). It will also printed out as a part of compaction summary in info logs periodically. + +The output looks like this: + + +``` +** Level 0 read latency histogram (micros): +Count: 696 Average: 489.8118 StdDev: 222.40 +Min: 3.0000 Median: 452.3077 Max: 1896.0000 +Percentiles: P50: 452.31 P75: 641.30 P99: 1068.00 P99.9: 1860.80 P99.99: 1896.00 +------------------------------------------------------ +[ 2, 3 ) 1 0.144% 0.144% +[ 18, 20 ) 1 0.144% 0.287% +[ 45, 50 ) 5 0.718% 1.006% +[ 50, 60 ) 26 3.736% 4.741% # +[ 60, 70 ) 6 0.862% 5.603% +[ 90, 100 ) 1 0.144% 5.747% +[ 120, 140 ) 2 0.287% 6.034% +[ 140, 160 ) 1 0.144% 6.178% +[ 160, 180 ) 1 0.144% 6.322% +[ 200, 250 ) 9 1.293% 7.615% +[ 250, 300 ) 45 6.466% 14.080% # +[ 300, 350 ) 88 12.644% 26.724% ### +[ 350, 400 ) 88 12.644% 39.368% ### +[ 400, 450 ) 71 10.201% 49.569% ## +[ 450, 500 ) 65 9.339% 58.908% ## +[ 500, 600 ) 74 10.632% 69.540% ## +[ 600, 700 ) 92 13.218% 82.759% ### +[ 700, 800 ) 64 9.195% 91.954% ## +[ 800, 900 ) 35 5.029% 96.983% # +[ 900, 1000 ) 12 1.724% 98.707% +[ 1000, 1200 ) 6 0.862% 99.569% +[ 1200, 1400 ) 2 0.287% 99.856% +[ 1800, 2000 ) 1 0.144% 100.000% + +** Level 1 read latency histogram (micros): +(......not pasted.....) + +** Level 2 read latency histogram (micros): +(......not pasted.....) + +** Level 3 read latency histogram (micros): +(......not pasted.....) + +** Level 4 read latency histogram (micros): +(......not pasted.....) + +** Level 5 read latency histogram (micros): +Count: 25583746 Average: 421.1326 StdDev: 385.11 +Min: 1.0000 Median: 376.0011 Max: 202444.0000 +Percentiles: P50: 376.00 P75: 438.00 P99: 1421.68 P99.9: 4164.43 P99.99: 9056.52 +------------------------------------------------------ +[ 0, 1 ) 2351 0.009% 0.009% +[ 1, 2 ) 6077 0.024% 0.033% +[ 2, 3 ) 8471 0.033% 0.066% +[ 3, 4 ) 788 0.003% 0.069% +[ 4, 5 ) 393 0.002% 0.071% +[ 5, 6 ) 786 0.003% 0.074% +[ 6, 7 ) 1709 0.007% 0.080% +[ 7, 8 ) 1769 0.007% 0.087% +[ 8, 9 ) 1573 0.006% 0.093% +[ 9, 10 ) 1495 0.006% 0.099% +[ 10, 12 ) 3043 0.012% 0.111% +[ 12, 14 ) 2259 0.009% 0.120% +[ 14, 16 ) 1233 0.005% 0.125% +[ 16, 18 ) 762 0.003% 0.128% +[ 18, 20 ) 451 0.002% 0.130% +[ 20, 25 ) 794 0.003% 0.133% +[ 25, 30 ) 1279 0.005% 0.138% +[ 30, 35 ) 1172 0.005% 0.142% +[ 35, 40 ) 1363 0.005% 0.148% +[ 40, 45 ) 409 0.002% 0.149% +[ 45, 50 ) 105 0.000% 0.150% +[ 50, 60 ) 80 0.000% 0.150% +[ 60, 70 ) 280 0.001% 0.151% +[ 70, 80 ) 1583 0.006% 0.157% +[ 80, 90 ) 4245 0.017% 0.174% +[ 90, 100 ) 6572 0.026% 0.200% +[ 100, 120 ) 9724 0.038% 0.238% +[ 120, 140 ) 3713 0.015% 0.252% +[ 140, 160 ) 2383 0.009% 0.261% +[ 160, 180 ) 18344 0.072% 0.333% +[ 180, 200 ) 51873 0.203% 0.536% +[ 200, 250 ) 631722 2.469% 3.005% +[ 250, 300 ) 2721970 10.639% 13.644% ## +[ 300, 350 ) 5909249 23.098% 36.742% ##### +[ 350, 400 ) 6522507 25.495% 62.237% ##### +[ 400, 450 ) 4296332 16.793% 79.030% ### +[ 450, 500 ) 2130323 8.327% 87.357% ## +[ 500, 600 ) 1553208 6.071% 93.428% # +[ 600, 700 ) 642129 2.510% 95.938% # +[ 700, 800 ) 372428 1.456% 97.394% +[ 800, 900 ) 187561 0.733% 98.127% +[ 900, 1000 ) 85858 0.336% 98.462% +[ 1000, 1200 ) 82730 0.323% 98.786% +[ 1200, 1400 ) 50691 0.198% 98.984% +[ 1400, 1600 ) 38026 0.149% 99.133% +[ 1600, 1800 ) 32991 0.129% 99.261% +[ 1800, 2000 ) 30200 0.118% 99.380% +[ 2000, 2500 ) 62195 0.243% 99.623% +[ 2500, 3000 ) 36684 0.143% 99.766% +[ 3000, 3500 ) 21317 0.083% 99.849% +[ 3500, 4000 ) 10216 0.040% 99.889% +[ 4000, 4500 ) 8351 0.033% 99.922% +[ 4500, 5000 ) 4152 0.016% 99.938% +[ 5000, 6000 ) 6328 0.025% 99.963% +[ 6000, 7000 ) 3253 0.013% 99.976% +[ 7000, 8000 ) 2082 0.008% 99.984% +[ 8000, 9000 ) 1546 0.006% 99.990% +[ 9000, 10000 ) 1055 0.004% 99.994% +[ 10000, 12000 ) 1566 0.006% 100.000% +[ 12000, 14000 ) 761 0.003% 100.003% +[ 14000, 16000 ) 462 0.002% 100.005% +[ 16000, 18000 ) 226 0.001% 100.006% +[ 18000, 20000 ) 126 0.000% 100.006% +[ 20000, 25000 ) 107 0.000% 100.007% +[ 25000, 30000 ) 43 0.000% 100.007% +[ 30000, 35000 ) 15 0.000% 100.007% +[ 35000, 40000 ) 14 0.000% 100.007% +[ 40000, 45000 ) 16 0.000% 100.007% +[ 45000, 50000 ) 1 0.000% 100.007% +[ 50000, 60000 ) 22 0.000% 100.007% +[ 60000, 70000 ) 10 0.000% 100.007% +[ 70000, 80000 ) 5 0.000% 100.007% +[ 80000, 90000 ) 14 0.000% 100.007% +[ 90000, 100000 ) 11 0.000% 100.007% +[ 100000, 120000 ) 33 0.000% 100.007% +[ 120000, 140000 ) 6 0.000% 100.007% +[ 140000, 160000 ) 3 0.000% 100.007% +[ 160000, 180000 ) 7 0.000% 100.007% +[ 200000, 250000 ) 2 0.000% 100.007% +``` + + +In this example, you can see we only issued 696 reads from level 0 while issued 25 million reads from level 5. The latency distribution is also clearly shown among those reads. This will be helpful for users to analysis OS page cache efficiency. + +Currently the read latency per level includes reads from data blocks, index blocks, as well as bloom filter blocks. We are also working on a feature to break down those three type of blocks. + +### Comments + +**[Tao Feng](fengtao04@gmail.com)** + +Is this feature also included in RocksJava? + +**[Siying Dong](siying.d@fb.com)** + +Should be. As long as you enable statistics, you should be able to get the value from `RocksDB.getProperty()` with property `rocksdb.dbstats`. Let me know if you can’t find it. + +**[chiddu](cnbscience@gmail.com)** + +> In this example, you can see we only issued 696 reads from level 0 while issued 256K reads from level 5. + +Isn’t it 2.5 M of reads instead of 256K ? . + +Also could anyone please provide more description on the histogram ? especially + +> Count: 25583746 Average: 421.1326 StdDev: 385.11 +> Min: 1.0000 Median: 376.0011 Max: 202444.0000 +> Percentiles: P50: 376.00 P75: 438.00 P99: 1421.68 P99.9: 4164.43 P99.99: 9056.52 + +and + +> [ 0, 1 ) 2351 0.009% 0.009% +> [ 1, 2 ) 6077 0.024% 0.033% +> [ 2, 3 ) 8471 0.033% 0.066% +> [ 3, 4 ) 788 0.003% 0.069%” + +thanks in advance + +**[Siying Dong](siying.d@fb.com)** + +Thank you for pointing out the mistake. I fixed it now. + +In this output, there are 2.5 million samples, average latency is 421 micro seconds, with standard deviation 385. Median is 376, max value is 202 milliseconds. 0.009% has value of 1, 0.024% has value of 1, 0.033% has value of 2. Accumulated value from 0 to 2 is 0.066%. + +Hope it helps. + +**[chiddu](cnbscience@gmail.com)** + +Thank you Siying for the quick reply, I was running couple of benchmark testing to check the performance of rocksdb on SSD. One of the test is similar to what is mentioned in the wiki, TEST 4 : Random read , except the key_size is 10 and value_size is 20. I am inserting 1 billion hashes and reading 1 billion hashes with 32 threads. The histogram shows something like this + +``` +Level 5 read latency histogram (micros): +Count: 7133903059 Average: 480.4357 StdDev: 309.18 +Min: 0.0000 Median: 551.1491 Max: 224142.0000 +Percentiles: P50: 551.15 P75: 651.44 P99: 996.52 P99.9: 2073.07 P99.99: 3196.32 +—————————————————— +[ 0, 1 ) 28587385 0.401% 0.401% +[ 1, 2 ) 686572516 9.624% 10.025% ## +[ 2, 3 ) 567317522 7.952% 17.977% ## +[ 3, 4 ) 44979472 0.631% 18.608% +[ 4, 5 ) 50379685 0.706% 19.314% +[ 5, 6 ) 64930061 0.910% 20.224% +[ 6, 7 ) 22613561 0.317% 20.541% +…………more…………. +``` + +If I understand your previous comment correctly, + +1. How is it that the count is around 7 billion when I have only inserted 1 billion hashes ? is the stat broken ? +1. What does the percentiles and the numbers signify ? +1. 0, 1 ) 28587385 0.401% 0.401% what does this “28587385” stand for in the histogram row ? + +**[Siying Dong](siying.d@fb.com)** + +If I remember correctly, with db_bench, if you specify –num=1000000000 –threads=32, it is every thread reading one billion keys, total of 32 billions. Is it the case you ran into? + +28,587,385 means that number of data points take the value [0,1) +28,587,385 / 7,133,903,058 = 0.401% provides percentage. + +**[chiddu](cnbscience@gmail.com)** + +I do have `num=1000000000` and `t=32`. The script says reading 1 billion hashes and not 32 billion hashes. + +this is the script on which I have used + +``` +echo “Load 1B keys sequentially into database…..” +bpl=10485760;overlap=10;mcz=2;del=300000000;levels=6;ctrig=4; delay=8; stop=12; wbn=3; mbc=20; mb=67108864;wbs=134217728; dds=1; sync=0; r=1000000000; t=1; vs=20; bs=4096; cs=1048576; of=500000; si=1000000; ./db_bench –benchmarks=fillseq –disable_seek_compaction=1 –mmap_read=0 –statistics=1 –histogram=1 –num=$r –threads=$t –value_size=$vs –block_size=$bs –cache_size=$cs –bloom_bits=10 –cache_numshardbits=6 –open_files=$of –verify_checksum=1 –db=/data/mysql/leveldb/test –sync=$sync –disable_wal=1 –compression_type=none –stats_interval=$si –compression_ratio=0.5 –disable_data_sync=$dds –write_buffer_size=$wbs –target_file_size_base=$mb –max_write_buffer_number=$wbn –max_background_compactions=$mbc –level0_file_num_compaction_trigger=$ctrig –level0_slowdown_writes_trigger=$delay –level0_stop_writes_trigger=$stop –num_levels=$levels –delete_obsolete_files_period_micros=$del –min_level_to_compress=$mcz –max_grandparent_overlap_factor=$overlap –stats_per_interval=1 –max_bytes_for_level_base=$bpl –use_existing_db=0 –key_size=10 + +echo “Reading 1B keys in database in random order….” +bpl=10485760;overlap=10;mcz=2;del=300000000;levels=6;ctrig=4; delay=8; stop=12; wbn=3; mbc=20; mb=67108864;wbs=134217728; dds=0; sync=0; r=1000000000; t=32; vs=20; bs=4096; cs=1048576; of=500000; si=1000000; ./db_bench –benchmarks=readrandom –disable_seek_compaction=1 –mmap_read=0 –statistics=1 –histogram=1 –num=$r –threads=$t –value_size=$vs –block_size=$bs –cache_size=$cs –bloom_bits=10 –cache_numshardbits=6 –open_files=$of –verify_checksum=1 –db=/some_data_base –sync=$sync –disable_wal=1 –compression_type=none –stats_interval=$si –compression_ratio=0.5 –disable_data_sync=$dds –write_buffer_size=$wbs –target_file_size_base=$mb –max_write_buffer_number=$wbn –max_background_compactions=$mbc –level0_file_num_compaction_trigger=$ctrig –level0_slowdown_writes_trigger=$delay –level0_stop_writes_trigger=$stop –num_levels=$levels –delete_obsolete_files_period_micros=$del –min_level_to_compress=$mcz –max_grandparent_overlap_factor=$overlap –stats_per_interval=1 –max_bytes_for_level_base=$bpl –use_existing_db=1 –key_size=10 +``` + +After running this script, there were no issues wrt to loading billion hashes , but when it came to reading part, its been almost 4 days and still I have only read 7 billion hashes and have read 200 million hashes in 2 and half days. Is there something which is missing in db_bench or something which I am missing ? + +**[Siying Dong](siying.d@fb.com)** + +It’s a printing error then. If you have `num=1000000000` and `t=32`, it will be 32 threads, and each reads 1 billion keys. diff --git a/src/rocksdb/docs/_posts/2016-01-29-compaction_pri.markdown b/src/rocksdb/docs/_posts/2016-01-29-compaction_pri.markdown new file mode 100644 index 000000000..ba9ee627c --- /dev/null +++ b/src/rocksdb/docs/_posts/2016-01-29-compaction_pri.markdown @@ -0,0 +1,51 @@ +--- +title: Option of Compaction Priority +layout: post +author: sdong +category: blog +redirect_from: + - /blog/2921/compaction_pri/ +--- + +The most popular compaction style of RocksDB is level-based compaction, which is an improved version of LevelDB's compaction algorithm. Page 9- 16 of this [slides](https://github.com/facebook/rocksdb/blob/gh-pages/talks/2015-09-29-HPTS-Siying-RocksDB.pdf) gives an illustrated introduction of this compaction style. The basic idea that: data is organized by multiple levels with exponential increasing target size. Except a special level 0, every level is key-range partitioned into many files. When size of a level exceeds its target size, we pick one or more of its files, and merge the file into the next level. + +<!--truncate--> + +Which file to pick to compact is an interesting question. LevelDB only uses one thread for compaction and it always picks files in round robin manner. We implemented multi-thread compaction in RocksDB by picking multiple files from the same level and compact them in parallel. We had to move away from LevelDB's file picking approach. Recently, we created an option [options.compaction_pri](https://github.com/facebook/rocksdb/blob/d6c838f1e130d8860407bc771fa6d4ac238859ba/include/rocksdb/options.h#L83-L93), which indicated three different algorithms to pick files to compact. + +Why do we need to multiple algorithms to choose from? Because there are different factors to consider when picking the files, and we now don't yet know how to balance them automatically, so we expose it to users to choose. Here are factors to consider: + +**Write amplification** + +When we estimate write amplification, we usually simplify the problem by assuming keys are uniformly distributed inside each level. In reality, it is not the case, even if user updates are uniformly distributed across the whole key range. For instance, when we compact one file of a level to the next level, it creates a hole. Over time, incoming compaction will fill data to the hole, but the density will still be lower for a while. Picking a file with keys least densely populated is more expensive to get the file to the next level, because there will be more overlapping files in the next level so we need to rewrite more data. For example, assume a file is 100MB, if an L2 file overlaps with 8 L3 files, we need to rewrite about 800MB of data to get the file to L3. If the file overlaps with 12 L3 files, we'll need to rewrite about 1200MB to get a file of the same size out of L2. It uses 50% more writes. (This analysis ignores the key density of the next level, because the range covers N times of files in that level so one hole only impacts write amplification by 1/N) + +If all the updates are uniformly distributed, LevelDB's approach optimizes write amplification, because a file being picked covers a range whose last compaction time to the next level is the oldest, so the range will accumulated keys from incoming compactions for the longest and the density is the highest. + +We created a compaction priority **kOldestSmallestSeqFirst** for the same effect. With this mode, we always pick the file covers the oldest updates in the level, which usually is contains the densest key range. If you have a use case where writes are uniformly distributed across the key space and you want to reduce write amplification, you should set options.compaction_pri=kOldestSmallestSeqFirst. + +**Optimize for small working set** + +We are assuming updates are uniformly distributed across the whole key space in previous analysis. However, in many use cases, there are subset of keys that are frequently updated while other key ranges are very cold. In this case, keeping hot key ranges from compacting to deeper levels will benefit write amplification, as well as space amplification. For example, if in a DB only key 150-160 are updated and other keys are seldom updated. If level 1 contains 20 keys, we want to keep 150-160 all stay in level 1. Because when next level 0 -> 1 compaction comes, it will simply overwrite existing keys so size level 1 doesn't increase, so no need to schedule further compaction for level 1->2. On the other hand, if we compact key 150-155 to level2, when a new Level 1->2 compaction comes, it increases the size of level 1, making size of level 1 exceed target size and more compactions will be needed, which generates more writes. + +The compaction priority **kOldestLargestSeqFirst** optimizes this use case. In this mode, we will pick a file whose latest update is the oldest. It means there is no incoming data for the range for the longest. Usually it is the coldest range. By compacting coldest range first, we leave the hot ranges in the level. If your use case is to overwrite existing keys in a small range, try options.compaction_pri=kOldestLargestSeqFirst**.** + +**Drop delete marker sooner** + +If one file contains a lot of delete markers, it may slow down iterating over this area, because we still need to iterate those deleted keys just to ignore them. Furthermore, the sooner we compact delete keys into the last level, the sooner the disk space is reclaimed, so it is good for space efficiency. + +Our default compaction priority **kByCompensatedSize** considers the case. If number of deletes in a file exceeds number of inserts, it is more likely to be picked for compaction. The more number of deletes exceed inserts, the more likely it is being compacted. The optimization is added to avoid the worst performance of space efficiency and query performance when a large percentage of the DB is deleted. + +**Efficiency of compaction filter** + +Usually people use [compaction filters](https://github.com/facebook/rocksdb/blob/v4.1/include/rocksdb/options.h#L201-L226) to clean up old data to free up space. Picking files to compact may impact space efficiency. We don't yet have a a compaction priority to optimize this case. In some of our use cases, we solved the problem in a different way: we have an external service checking modify time of all SST files. If any of the files is too old, we force the single file to compaction by calling DB::CompactFiles() using the single file. In this way, we can provide a time bound of data passing through compaction filters. + + +In all, there three choices of compaction priority modes optimizing different scenarios. if you have a new use case, we suggest you start with `options.compaction_pri=kOldestSmallestSeqFirst` (note it is not the default one for backward compatible reason). If you want to further optimize your use case, you can try other two use cases if your use cases apply. + +If you have good ideas about better compaction picker approach, you are welcome to implement and benchmark it. We'll be glad to review and merge your a pull requests. + +### Comments + +**[Mark Callaghan](mdcallag@gmail.com)** + +Performance results for compaction_pri values and linkbench are explained at [http://smalldatum.blogspot.com/2016/02/compaction-priority-in-rocksdb.html](http://smalldatum.blogspot.com/2016/02/compaction-priority-in-rocksdb.html) diff --git a/src/rocksdb/docs/_posts/2016-02-24-rocksdb-4-2-release.markdown b/src/rocksdb/docs/_posts/2016-02-24-rocksdb-4-2-release.markdown new file mode 100644 index 000000000..409015cc8 --- /dev/null +++ b/src/rocksdb/docs/_posts/2016-02-24-rocksdb-4-2-release.markdown @@ -0,0 +1,41 @@ +--- +title: RocksDB 4.2 Release! +layout: post +author: sdong +category: blog +redirect_from: + - /blog/3017/rocksdb-4-2-release/ +--- + +New RocksDB release - 4.2! + + +**New Features** + + 1. Introduce CreateLoggerFromOptions(), this function create a Logger for provided DBOptions. + + + 2. Add GetAggregatedIntProperty(), which returns the sum of the GetIntProperty of all the column families. + + + 3. Add MemoryUtil in rocksdb/utilities/memory.h. It currently offers a way to get the memory usage by type from a list rocksdb instances. + + +<!--truncate--> + + +**Public API changes** + + 1. CompactionFilter::Context includes information of Column Family ID + + + 2. The need-compaction hint given by TablePropertiesCollector::NeedCompact() will be persistent and recoverable after DB recovery. This introduces a breaking format change. If you use this experimental feature, including NewCompactOnDeletionCollectorFactory() in the new version, you may not be able to directly downgrade the DB back to version 4.0 or lower. + + + 3. TablePropertiesCollectorFactory::CreateTablePropertiesCollector() now takes an option Context, containing the information of column family ID for the file being written. + + + 4. Remove DefaultCompactionFilterFactory. + + +[https://github.com/facebook/rocksdb/releases/tag/v4.2](https://github.com/facebook/rocksdb/releases/tag/v4.2) diff --git a/src/rocksdb/docs/_posts/2016-02-25-rocksdb-ama.markdown b/src/rocksdb/docs/_posts/2016-02-25-rocksdb-ama.markdown new file mode 100644 index 000000000..2ba04f39a --- /dev/null +++ b/src/rocksdb/docs/_posts/2016-02-25-rocksdb-ama.markdown @@ -0,0 +1,20 @@ +--- +title: RocksDB AMA +layout: post +author: yhchiang +category: blog +redirect_from: + - /blog/3065/rocksdb-ama/ +--- + +RocksDB developers are doing a Reddit Ask-Me-Anything now at 10AM – 11AM PDT! We welcome you to stop by and ask any RocksDB related questions, including existing / upcoming features, tuning tips, or database design. + +Here are some enhancements that we'd like to focus on over the next six months: + +* 2-Phase Commit +* Lua support in some custom functions +* Backup and repair tools +* Direct I/O to bypass OS cache +* RocksDB Java API + +[https://www.reddit.com/r/IAmA/comments/47k1si/we_are_rocksdb_developers_ask_us_anything/](https://www.reddit.com/r/IAmA/comments/47k1si/we_are_rocksdb_developers_ask_us_anything/) diff --git a/src/rocksdb/docs/_posts/2016-03-07-rocksdb-options-file.markdown b/src/rocksdb/docs/_posts/2016-03-07-rocksdb-options-file.markdown new file mode 100644 index 000000000..703449b01 --- /dev/null +++ b/src/rocksdb/docs/_posts/2016-03-07-rocksdb-options-file.markdown @@ -0,0 +1,24 @@ +--- +title: RocksDB Options File +layout: post +author: yhciang +category: blog +redirect_from: + - /blog/3089/rocksdb-options-file/ +--- + +In RocksDB 4.3, we added a new set of features that makes managing RocksDB options easier. Specifically: + + * **Persisting Options Automatically**: Each RocksDB database will now automatically persist its current set of options into an INI file on every successful call of DB::Open(), SetOptions(), and CreateColumnFamily() / DropColumnFamily(). + + + + * **Load Options from File**: We added [LoadLatestOptions() / LoadOptionsFromFile()](https://github.com/facebook/rocksdb/blob/4.3.fb/include/rocksdb/utilities/options_util.h#L48-L58) that enables developers to construct RocksDB options object from an options file. + + + + * **Sanity Check Options**: We added [CheckOptionsCompatibility](https://github.com/facebook/rocksdb/blob/4.3.fb/include/rocksdb/utilities/options_util.h#L64-L77) that performs compatibility check on two sets of RocksDB options. + +<!--truncate--> + +Want to know more about how to use this new features? Check out the [RocksDB Options File wiki page](https://github.com/facebook/rocksdb/wiki/RocksDB-Options-File) and start using this new feature today! diff --git a/src/rocksdb/docs/_posts/2016-04-26-rocksdb-4-5-1-released.markdown b/src/rocksdb/docs/_posts/2016-04-26-rocksdb-4-5-1-released.markdown new file mode 100644 index 000000000..247768d30 --- /dev/null +++ b/src/rocksdb/docs/_posts/2016-04-26-rocksdb-4-5-1-released.markdown @@ -0,0 +1,60 @@ +--- +title: RocksDB 4.5.1 Released! +layout: post +author: sdong +category: blog +redirect_from: + - /blog/3179/rocksdb-4-5-1-released/ +--- + +## 4.5.1 (3/25/2016) + +### Bug Fixes + + * Fix failures caused by the destorying order of singleton objects. + +<br/> + +## 4.5.0 (2/5/2016) + +### Public API Changes + + * Add a new perf context level between kEnableCount and kEnableTime. Level 2 now does not include timers for mutexes. + * Statistics of mutex operation durations will not be measured by default. If you want to have them enabled, you need to set Statistics::stats_level_ to kAll. + * DBOptions::delete_scheduler and NewDeleteScheduler() are removed, please use DBOptions::sst_file_manager and NewSstFileManager() instead + +### New Features + * ldb tool now supports operations to non-default column families. + * Add kPersistedTier to ReadTier. This option allows Get and MultiGet to read only the persited data and skip mem-tables if writes were done with disableWAL = true. + * Add DBOptions::sst_file_manager. Use NewSstFileManager() in include/rocksdb/sst_file_manager.h to create a SstFileManager that can be used to track the total size of SST files and control the SST files deletion rate. + +<br/> + +<!--truncate--> + +## 4.4.0 (1/14/2016) + +### Public API Changes + + * Change names in CompactionPri and add a new one. + * Deprecate options.soft_rate_limit and add options.soft_pending_compaction_bytes_limit. + * If options.max_write_buffer_number > 3, writes will be slowed down when writing to the last write buffer to delay a full stop. + * Introduce CompactionJobInfo::compaction_reason, this field include the reason to trigger the compaction. + * After slow down is triggered, if estimated pending compaction bytes keep increasing, slowdown more. + * Increase default options.delayed_write_rate to 2MB/s. + * Added a new parameter --path to ldb tool. --path accepts the name of either MANIFEST, SST or a WAL file. Either --db or --path can be used when calling ldb. + +<br/> + +## 4.3.0 (12/8/2015) + +### New Features + + * CompactionFilter has new member function called IgnoreSnapshots which allows CompactionFilter to be called even if there are snapshots later than the key. + * RocksDB will now persist options under the same directory as the RocksDB database on successful DB::Open, CreateColumnFamily, DropColumnFamily, and SetOptions. + * Introduce LoadLatestOptions() in rocksdb/utilities/options_util.h. This function can construct the latest DBOptions / ColumnFamilyOptions used by the specified RocksDB intance. + * Introduce CheckOptionsCompatibility() in rocksdb/utilities/options_util.h. This function checks whether the input set of options is able to open the specified DB successfully. + +### Public API Changes + + * When options.db_write_buffer_size triggers, only the column family with the largest column family size will be flushed, not all the column families. diff --git a/src/rocksdb/docs/_posts/2016-07-26-rocksdb-4-8-released.markdown b/src/rocksdb/docs/_posts/2016-07-26-rocksdb-4-8-released.markdown new file mode 100644 index 000000000..b42a66e30 --- /dev/null +++ b/src/rocksdb/docs/_posts/2016-07-26-rocksdb-4-8-released.markdown @@ -0,0 +1,48 @@ +--- +title: RocksDB 4.8 Released! +layout: post +author: yiwu +category: blog +redirect_from: + - /blog/3239/rocksdb-4-8-released/ +--- + +## 4.8.0 (5/2/2016) + +### [](https://github.com/facebook/rocksdb/blob/master/HISTORY.md#public-api-change-1)Public API Change + + * Allow preset compression dictionary for improved compression of block-based tables. This is supported for zlib, zstd, and lz4. The compression dictionary's size is configurable via CompressionOptions::max_dict_bytes. + * Delete deprecated classes for creating backups (BackupableDB) and restoring from backups (RestoreBackupableDB). Now, BackupEngine should be used for creating backups, and BackupEngineReadOnly should be used for restorations. For more details, see [https://github.com/facebook/rocksdb/wiki/How-to-backup-RocksDB%3F](https://github.com/facebook/rocksdb/wiki/How-to-backup-RocksDB%3F) + * Expose estimate of per-level compression ratio via DB property: "rocksdb.compression-ratio-at-levelN". + * Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status. + +### [](https://github.com/facebook/rocksdb/blob/master/HISTORY.md#new-features-2)New Features + + * Add ReadOptions::readahead_size. If non-zero, NewIterator will create a new table reader which performs reads of the given size. + +<br/> + +<!--truncate--> + +## [](https://github.com/facebook/rocksdb/blob/master/HISTORY.md#470-482016)4.7.0 (4/8/2016) + +### [](https://github.com/facebook/rocksdb/blob/master/HISTORY.md#public-api-change-2)Public API Change + + * rename options compaction_measure_io_stats to report_bg_io_stats and include flush too. + * Change some default options. Now default options will optimize for server-workloads. Also enable slowdown and full stop triggers for pending compaction bytes. These changes may cause sub-optimal performance or significant increase of resource usage. To avoid these risks, users can open existing RocksDB with options extracted from RocksDB option files. See [https://github.com/facebook/rocksdb/wiki/RocksDB-Options-File](https://github.com/facebook/rocksdb/wiki/RocksDB-Options-File) for how to use RocksDB option files. Or you can call Options.OldDefaults() to recover old defaults. DEFAULT_OPTIONS_HISTORY.md will track change history of default options. + +<br/> + +## [](https://github.com/facebook/rocksdb/blob/master/HISTORY.md#460-3102016)4.6.0 (3/10/2016) + +### [](https://github.com/facebook/rocksdb/blob/master/HISTORY.md#public-api-changes-1)Public API Changes + + * Change default of BlockBasedTableOptions.format_version to 2. It means default DB created by 4.6 or up cannot be opened by RocksDB version 3.9 or earlier + * Added strict_capacity_limit option to NewLRUCache. If the flag is set to true, insert to cache will fail if no enough capacity can be free. Signature of Cache::Insert() is updated accordingly. + * Tickers [NUMBER_DB_NEXT, NUMBER_DB_PREV, NUMBER_DB_NEXT_FOUND, NUMBER_DB_PREV_FOUND, ITER_BYTES_READ] are not updated immediately. The are updated when the Iterator is deleted. + * Add monotonically increasing counter (DB property "rocksdb.current-super-version-number") that increments upon any change to the LSM tree. + +### [](https://github.com/facebook/rocksdb/blob/master/HISTORY.md#new-features-3)New Features + + * Add CompactionPri::kMinOverlappingRatio, a compaction picking mode friendly to write amplification. + * Deprecate Iterator::IsKeyPinned() and replace it with Iterator::GetProperty() with prop_name="rocksdb.iterator.is.key.pinned" diff --git a/src/rocksdb/docs/_posts/2016-09-28-rocksdb-4-11-2-released.markdown b/src/rocksdb/docs/_posts/2016-09-28-rocksdb-4-11-2-released.markdown new file mode 100644 index 000000000..87c20eb47 --- /dev/null +++ b/src/rocksdb/docs/_posts/2016-09-28-rocksdb-4-11-2-released.markdown @@ -0,0 +1,49 @@ +--- +title: RocksDB 4.11.2 Released! +layout: post +author: sdong +category: blog +--- +We abandoned release candidates 4.10.x and directly go to 4.11.2 from 4.9, to make sure the latest release is stable. In 4.11.2, we fixed several data corruption related bugs introduced in 4.9.0. + +## 4.11.2 (9/15/2016) + +### Bug fixes + + * Segfault when failing to open an SST file for read-ahead iterators. + * WAL without data for all CFs is not deleted after recovery. + +<!--truncate--> + +## 4.11.1 (8/30/2016) + +### Bug Fixes + + * Mitigate the regression bug of deadlock condition during recovery when options.max_successive_merges hits. + * Fix data race condition related to hash index in block based table when putting indexes in the block cache. + +## 4.11.0 (8/1/2016) + +### Public API Change + + * options.memtable_prefix_bloom_huge_page_tlb_size => memtable_huge_page_size. When it is set, RocksDB will try to allocate memory from huge page for memtable too, rather than just memtable bloom filter. + +### New Features + + * A tool to migrate DB after options change. See include/rocksdb/utilities/option_change_migration.h. + * Add ReadOptions.background_purge_on_iterator_cleanup. If true, we avoid file deletion when destorying iterators. + +## 4.10.0 (7/5/2016) + +### Public API Change + + * options.memtable_prefix_bloom_bits changes to options.memtable_prefix_bloom_bits_ratio and deprecate options.memtable_prefix_bloom_probes + * enum type CompressionType and PerfLevel changes from char to unsigned char. Value of all PerfLevel shift by one. + * Deprecate options.filter_deletes. + +### New Features + + * Add avoid_flush_during_recovery option. + * Add a read option background_purge_on_iterator_cleanup to avoid deleting files in foreground when destroying iterators. Instead, a job is scheduled in high priority queue and would be executed in a separate background thread. + * RepairDB support for column families. RepairDB now associates data with non-default column families using information embedded in the SST/WAL files (4.7 or later). For data written by 4.6 or earlier, RepairDB associates it with the default column family. + * Add options.write_buffer_manager which allows users to control total memtable sizes across multiple DB instances. diff --git a/src/rocksdb/docs/_posts/2017-01-06-rocksdb-5-0-1-released.markdown b/src/rocksdb/docs/_posts/2017-01-06-rocksdb-5-0-1-released.markdown new file mode 100644 index 000000000..fb0413055 --- /dev/null +++ b/src/rocksdb/docs/_posts/2017-01-06-rocksdb-5-0-1-released.markdown @@ -0,0 +1,26 @@ +--- +title: RocksDB 5.0.1 Released! +layout: post +author: yiwu +category: blog +--- + +### Public API Change + + * Options::max_bytes_for_level_multiplier is now a double along with all getters and setters. + * Support dynamically change `delayed_write_rate` and `max_total_wal_size` options via SetDBOptions(). + * Introduce DB::DeleteRange for optimized deletion of large ranges of contiguous keys. + * Support dynamically change `delayed_write_rate` option via SetDBOptions(). + * Options::allow_concurrent_memtable_write and Options::enable_write_thread_adaptive_yield are now true by default. + * Remove Tickers::SEQUENCE_NUMBER to avoid confusion if statistics object is shared among RocksDB instance. Alternatively DB::GetLatestSequenceNumber() can be used to get the same value. + * Options.level0_stop_writes_trigger default value changes from 24 to 32. + * New compaction filter API: CompactionFilter::FilterV2(). Allows to drop ranges of keys. + * Removed flashcache support. + * DB::AddFile() is deprecated and is replaced with DB::IngestExternalFile(). DB::IngestExternalFile() remove all the restrictions that existed for DB::AddFile. + +### New Features + + * Add avoid_flush_during_shutdown option, which speeds up DB shutdown by not flushing unpersisted data (i.e. with disableWAL = true). Unpersisted data will be lost. The options is dynamically changeable via SetDBOptions(). + * Add memtable_insert_with_hint_prefix_extractor option. The option is mean to reduce CPU usage for inserting keys into memtable, if keys can be group by prefix and insert for each prefix are sequential or almost sequential. See include/rocksdb/options.h for more details. + * Add LuaCompactionFilter in utilities. This allows developers to write compaction filters in Lua. To use this feature, LUA_PATH needs to be set to the root directory of Lua. + * No longer populate "LATEST_BACKUP" file in backup directory, which formerly contained the number of the latest backup. The latest backup can be determined by finding the highest numbered file in the "meta/" subdirectory. diff --git a/src/rocksdb/docs/_posts/2017-02-07-rocksdb-5-1-2-released.markdown b/src/rocksdb/docs/_posts/2017-02-07-rocksdb-5-1-2-released.markdown new file mode 100644 index 000000000..35bafb219 --- /dev/null +++ b/src/rocksdb/docs/_posts/2017-02-07-rocksdb-5-1-2-released.markdown @@ -0,0 +1,15 @@ +--- +title: RocksDB 5.1.2 Released! +layout: post +author: maysamyabandeh +category: blog +--- + +### Public API Change +* Support dynamically change `delete_obsolete_files_period_micros` option via SetDBOptions(). +* Added EventListener::OnExternalFileIngested which will be called when IngestExternalFile() add a file successfully. +* BackupEngine::Open and BackupEngineReadOnly::Open now always return error statuses matching those of the backup Env. + +### Bug Fixes +* Fix the bug that if 2PC is enabled, checkpoints may loss some recent transactions. +* When file copying is needed when creating checkpoints or bulk loading files, fsync the file after the file copying. diff --git a/src/rocksdb/docs/_posts/2017-02-17-bulkoad-ingest-sst-file.markdown b/src/rocksdb/docs/_posts/2017-02-17-bulkoad-ingest-sst-file.markdown new file mode 100644 index 000000000..9a43a846a --- /dev/null +++ b/src/rocksdb/docs/_posts/2017-02-17-bulkoad-ingest-sst-file.markdown @@ -0,0 +1,50 @@ +--- +title: Bulkloading by ingesting external SST files +layout: post +author: IslamAbdelRahman +category: blog +--- + +## Introduction + +One of the basic operations of RocksDB is writing to RocksDB, Writes happen when user call (DB::Put, DB::Write, DB::Delete ... ), but what happens when you write to RocksDB ? .. this is a brief description of what happens. +- User insert a new key/value by calling DB::Put() (or DB::Write()) +- We create a new entry for the new key/value in our in-memory structure (memtable / SkipList by default) and we assign it a new sequence number. +- When the memtable exceeds a specific size (64 MB for example), we convert this memtable to a SST file, and put this file in level 0 of our LSM-Tree +- Later, compaction will kick in and move data from level 0 to level 1, and then from level 1 to level 2 .. and so on + +But what if we can skip these steps and add data to the lowest possible level directly ? This is what bulk-loading does + +## Bulkloading + +- Write all of our keys and values into SST file outside of the DB +- Add the SST file into the LSM directly + +This is bulk-loading, and in specific use-cases it allow users to achieve faster data loading and better write-amplification. + +and doing it is as simple as +```cpp +Options options; +SstFileWriter sst_file_writer(EnvOptions(), options, options.comparator); +Status s = sst_file_writer.Open(file_path); +assert(s.ok()); + +// Insert rows into the SST file, note that inserted keys must be +// strictly increasing (based on options.comparator) +for (...) { + s = sst_file_writer.Add(key, value); + assert(s.ok()); +} + +// Ingest the external SST file into the DB +s = db_->IngestExternalFile({"/home/usr/file1.sst"}, IngestExternalFileOptions()); +assert(s.ok()); +``` + +You can find more details about how to generate SST files and ingesting them into RocksDB in this [wiki page](https://github.com/facebook/rocksdb/wiki/Creating-and-Ingesting-SST-files) + +## Use cases +There are multiple use cases where bulkloading could be useful, for example +- Generating SST files in offline jobs in Hadoop, then downloading and ingesting the SST files into RocksDB +- Migrating shards between machines by dumping key-range in SST File and loading the file in a different machine +- Migrating from a different storage (InnoDB to RocksDB migration in MyRocks) diff --git a/src/rocksdb/docs/_posts/2017-03-02-rocksdb-5-2-1-released.markdown b/src/rocksdb/docs/_posts/2017-03-02-rocksdb-5-2-1-released.markdown new file mode 100644 index 000000000..c6ce27d64 --- /dev/null +++ b/src/rocksdb/docs/_posts/2017-03-02-rocksdb-5-2-1-released.markdown @@ -0,0 +1,22 @@ +--- +title: RocksDB 5.2.1 Released! +layout: post +author: sdong +category: blog +--- + +### Public API Change +* NewLRUCache() will determine number of shard bits automatically based on capacity, if the user doesn't pass one. This also impacts the default block cache when the user doesn't explict provide one. +* Change the default of delayed slowdown value to 16MB/s and further increase the L0 stop condition to 36 files. + +### New Features +* Added new overloaded function GetApproximateSizes that allows to specify if memtable stats should be computed only without computing SST files' stats approximations. +* Added new function GetApproximateMemTableStats that approximates both number of records and size of memtables. +* (Experimental) Two-level indexing that partition the index and creates a 2nd level index on the partitions. The feature can be enabled by setting kTwoLevelIndexSearch as IndexType and configuring index_per_partition. + +### Bug Fixes +* RangeSync() should work if ROCKSDB_FALLOCATE_PRESENT is not set +* Fix wrong results in a data race case in Get() +* Some fixes related to 2PC. +* Fix several bugs in Direct I/O supports. +* Fix a regression bug which can cause Seek() to miss some keys if the return key has been updated many times after the snapshot which is used by the iterator. diff --git a/src/rocksdb/docs/_posts/2017-05-12-partitioned-index-filter.markdown b/src/rocksdb/docs/_posts/2017-05-12-partitioned-index-filter.markdown new file mode 100644 index 000000000..a537feb0c --- /dev/null +++ b/src/rocksdb/docs/_posts/2017-05-12-partitioned-index-filter.markdown @@ -0,0 +1,34 @@ +--- +title: Partitioned Index/Filters +layout: post +author: maysamyabandeh +category: blog +--- + +As DB/mem ratio gets larger, the memory footprint of filter/index blocks becomes non-trivial. Although `cache_index_and_filter_blocks` allows storing only a subset of them in block cache, their relatively large size negatively affects the performance by i) occupying the block cache space that could otherwise be used for caching data, ii) increasing the load on the disk storage by loading them into the cache after a miss. Here we illustrate these problems in more detail and explain how partitioning index/filters alleviates the overhead. + +### How large are the index/filter blocks? + +RocksDB has by default one index/filter block per SST file. The size of the index/filter varies based on the configuration but for a SST of size 256MB the index/filter block of size 0.5/5MB is typical, which is much larger than the typical data block size of 4-32KB. That is fine when all index/filters fit perfectly into memory and hence are read once per SST lifetime, not so much when they compete with data blocks for the block cache space and are also likely to be re-read many times from the disk. + +### What is the big deal with large index/filter blocks? + +When index/filter blocks are stored in block cache they are effectively competing with data blocks (as well as with each other) on this scarce resource. A filter of size 5MB is occupying the space that could otherwise be used to cache 1000s of data blocks (of size 4KB). This would result in more cache misses for data blocks. The large index/filters also kick each other out of the block cache more often and exacerbate their own cache miss rate too. This is while only a small part of the index/filter block might have been actually used during its lifetime in the cache. + +After the cache miss of an index/filter, it has to be reloaded from the disk, and its large size is not helping in reducing the IO cost. While a simple point lookup might need at most a couple of data block reads (of size 4KB) one from each layer of LSM, it might end up also loading multiple megabytes of index/filter blocks. If that happens often then the disk is spending more time serving index/filters rather than the actual data blocks. + +## What is partitioned index/filters? + +With partitioning, the index/filter of a SST file is partitioned into smaller blocks with an additional top-level index on them. When reading an index/filter, only top-level index is loaded into memory. The partitioned index/filter then uses the top-level index to load on demand into the block cache the partitions that are required to perform the index/filter query. The top-level index, which has much smaller memory footprint, can be stored in heap or block cache depending on the `cache_index_and_filter_blocks` setting. + +### Success stories + +#### HDD, 100TB DB + +In this example we have a DB of size 86G on HDD and emulate the small memory that is present to a node with 100TB of data by using direct IO (skipping OS file cache) and a very small block cache of size 60MB. Partitioning improves throughput by 11x from 5 op/s to 55 op/s. + +#### SSD, Linkbench + +In this example we have a DB of size 300G on SSD and emulate the small memory that would be available in presence of other DBs on the same node by by using direct IO (skipping OS file cache) and block cache of size 6G and 2G. Without partitioning the linkbench throughput drops from 38k tps to 23k when reducing block cache size from 6G to 2G. With partitioning the throughput drops from 38k to only 30k. + +Learn more [here](https://github.com/facebook/rocksdb/wiki/Partitioned-Index-Filters). diff --git a/src/rocksdb/docs/_posts/2017-05-14-core-local-stats.markdown b/src/rocksdb/docs/_posts/2017-05-14-core-local-stats.markdown new file mode 100644 index 000000000..a806541fc --- /dev/null +++ b/src/rocksdb/docs/_posts/2017-05-14-core-local-stats.markdown @@ -0,0 +1,106 @@ +--- +title: Core-local Statistics +layout: post +author: ajkr +category: blog +--- + +## Origins: Global Atomics + +Until RocksDB 4.12, ticker/histogram statistics were implemented with std::atomic values shared across the entire program. A ticker consists of a single atomic, while a histogram consists of several atomics to represent things like min/max/per-bucket counters. These statistics could be updated by all user/background threads. + +For concurrent/high-throughput workloads, cache line bouncing of atomics caused high CPU utilization. For example, we have tickers that count block cache hits and misses. Almost every user read increments these tickers a few times. Many concurrent user reads would cause the cache lines containing these atomics to bounce between cores. + +### Performance + +Here are perf results for 32 reader threads where most reads (99%+) are served by uncompressed block cache. Such a scenario stresses the statistics code heavily. + +Benchmark command: `TEST_TMPDIR=/dev/shm/ perf record -g ./db_bench -statistics -use_existing_db=true -benchmarks=readrandom -threads=32 -cache_size=1048576000 -num=1000000 -reads=1000000 && perf report -g --children` + +Perf snippet for "cycles" event: + +``` + Children Self Command Shared Object Symbol ++ 30.33% 30.17% db_bench db_bench [.] rocksdb::StatisticsImpl::recordTick ++ 3.65% 0.98% db_bench db_bench [.] rocksdb::StatisticsImpl::measureTime +``` + +Perf snippet for "cache-misses" event: + +``` + Children Self Command Shared Object Symbol ++ 19.54% 19.50% db_bench db_bench [.] rocksdb::StatisticsImpl::recordTick ++ 3.44% 0.57% db_bench db_bench [.] rocksdb::StatisticsImpl::measureTime +``` + +The high CPU overhead for updating tickers and histograms corresponds well to the high cache misses. + +## Thread-locals: Faster Updates + +Since RocksDB 4.12, ticker/histogram statistics use thread-local storage. Each thread has a local set of atomic values that no other thread can update. This prevents the cache line bouncing problem described above. Even though updates to a given value are always made by the same thread, atomics are still useful to synchronize with aggregations for querying statistics. + +Implementing this approach involved a couple challenges. First, each query for a statistic's global value must aggregate all threads' local values. This adds some overhead, which may pass unnoticed if statistics are queried infrequently. Second, exited threads' local values are still needed to provide accurate statistics. We handle this by merging a thread's local values into process-wide variables upon thread exit. + +### Performance + +Update benchmark setup is same as before. CPU overhead improved 7.8x compared to global atomics, corresponding to a 17.8x reduction in cache-misses overhead. + +Perf snippet for "cycles" event: + +``` + Children Self Command Shared Object Symbol ++ 2.96% 0.87% db_bench db_bench [.] rocksdb::StatisticsImpl::recordTick ++ 1.37% 0.10% db_bench db_bench [.] rocksdb::StatisticsImpl::measureTime +``` + +Perf snippet for "cache-misses" event: + +``` + Children Self Command Shared Object Symbol ++ 1.21% 0.65% db_bench db_bench [.] rocksdb::StatisticsImpl::recordTick + 0.08% 0.00% db_bench db_bench [.] rocksdb::StatisticsImpl::measureTime +``` + +To measure statistics query latency, we ran sysbench with 4K OLTP clients concurrently with one client that queries statistics repeatedly. Times shown are in milliseconds. + +``` + min: 18.45 + avg: 27.91 + max: 231.65 + 95th percentile: 55.82 +``` + +## Core-locals: Faster Querying + +The thread-local approach is working well for applications calling RocksDB from only a few threads, or polling statistics infrequently. Eventually, though, we found use cases where those assumptions do not hold. For example, one application has per-connection threads and typically runs into performance issues when connection count grows very high. For debugging such issues, they want high-frequency statistics polling to correlate issues in their application with changes in RocksDB's state. + +Once [PR #2258](https://github.com/facebook/rocksdb/pull/2258) lands, ticker/histogram statistics will be local to each CPU core. Similarly to thread-local, each core updates only its local values, thus avoiding cache line bouncing. Local values are still atomics to make aggregation possible. With this change, query work depends only on number of cores, not the number of threads. So, applications with many more threads than cores can no longer impact statistics query latency. + +### Performance + +Update benchmark setup is same as before. CPU overhead worsened ~23% compared to thread-local, while cache performance was unchanged. + +Perf snippet for "cycles" event: + +``` + Children Self Command Shared Object Symbol ++ 2.96% 0.87% db_bench db_bench [.] rocksdb::StatisticsImpl::recordTick ++ 1.37% 0.10% db_bench db_bench [.] rocksdb::StatisticsImpl::measureTime +``` + +Perf snippet for "cache-misses" event: + +``` + Children Self Command Shared Object Symbol ++ 1.21% 0.65% db_bench db_bench [.] rocksdb::StatisticsImpl::recordTick + 0.08% 0.00% db_bench db_bench [.] rocksdb::StatisticsImpl::measureTime +``` + +Query latency is measured same as before with times in milliseconds. Average latency improved by 6.3x compared to thread-local. + +``` + min: 2.47 + avg: 4.45 + max: 91.13 + 95th percentile: 7.56 +``` diff --git a/src/rocksdb/docs/_posts/2017-05-26-rocksdb-5-4-5-released.markdown b/src/rocksdb/docs/_posts/2017-05-26-rocksdb-5-4-5-released.markdown new file mode 100644 index 000000000..561dab4c2 --- /dev/null +++ b/src/rocksdb/docs/_posts/2017-05-26-rocksdb-5-4-5-released.markdown @@ -0,0 +1,39 @@ +--- +title: RocksDB 5.4.5 Released! +layout: post +author: sagar0 +category: blog +--- + +### Public API Change +* Support dynamically changing `stats_dump_period_sec` option via SetDBOptions(). +* Added ReadOptions::max_skippable_internal_keys to set a threshold to fail a request as incomplete when too many keys are being skipped while using iterators. +* DB::Get in place of std::string accepts PinnableSlice, which avoids the extra memcpy of value to std::string in most of cases. + * PinnableSlice releases the pinned resources that contain the value when it is destructed or when ::Reset() is called on it. + * The old API that accepts std::string, although discouraged, is still supported. +* Replace Options::use_direct_writes with Options::use_direct_io_for_flush_and_compaction. See Direct IO wiki for details. + +### New Features +* Memtable flush can be avoided during checkpoint creation if total log file size is smaller than a threshold specified by the user. +* Introduce level-based L0->L0 compactions to reduce file count, so write delays are incurred less often. +* (Experimental) Partitioning filters which creates an index on the partitions. The feature can be enabled by setting partition_filters when using kFullFilter. Currently the feature also requires two-level indexing to be enabled. Number of partitions is the same as the number of partitions for indexes, which is controlled by metadata_block_size. +* DB::ResetStats() to reset internal stats. +* Added CompactionEventListener and EventListener::OnFlushBegin interfaces. +* Added DB::CreateColumnFamilie() and DB::DropColumnFamilies() to bulk create/drop column families. +* Facility for cross-building RocksJava using Docker. + +### Bug Fixes +* Fix WriteBatchWithIndex address use after scope error. +* Fix WritableFile buffer size in direct IO. +* Add prefetch to PosixRandomAccessFile in buffered io. +* Fix PinnableSlice access invalid address when row cache is enabled. +* Fix huge fallocate calls fail and make XFS unhappy. +* Fix memory alignment with logical sector size. +* Fix alignment in ReadaheadRandomAccessFile. +* Fix bias with read amplification stats (READ_AMP_ESTIMATE_USEFUL_BYTES and READ_AMP_TOTAL_READ_BYTES). +* Fix a manual / auto compaction data race. +* Fix CentOS 5 cross-building of RocksJava. +* Build and link with ZStd when creating the static RocksJava build. +* Fix snprintf's usage to be cross-platform. +* Fix build errors with blob DB. +* Fix readamp test type inconsistency. diff --git a/src/rocksdb/docs/_posts/2017-06-26-17-level-based-changes.markdown b/src/rocksdb/docs/_posts/2017-06-26-17-level-based-changes.markdown new file mode 100644 index 000000000..9e838eb7f --- /dev/null +++ b/src/rocksdb/docs/_posts/2017-06-26-17-level-based-changes.markdown @@ -0,0 +1,60 @@ +--- +title: Level-based Compaction Changes +layout: post +author: ajkr +category: blog +--- + +### Introduction + +RocksDB provides an option to limit the number of L0 files, which bounds read-amplification. Since L0 files (unlike files at lower levels) can span the entire key-range, a key might be in any file, thus reads need to check them one-by-one. Users often wish to configure a low limit to improve their read latency. + +Although, the mechanism with which we enforce L0's file count limit may be unappealing. When the limit is reached, RocksDB intentionally delays user writes. This slows down accumulation of files in L0, and frees up resources for compacting files down to lower levels. But adding delays will significantly increase user-visible write latency jitter. + +Also, due to how L0 files can span the entire key-range, compaction parallelization is limited. Files at L0 or L1 may be locked due to involvement in pending L0->L1 or L1->L2 compactions. We can only schedule a parallel L0->L1 compaction if it does not require any of the locked files, which is typically not the case. + +To handle these constraints better, we added a new type of compaction, L0->L0. It quickly reduces file count in L0 and can be scheduled even when L1 files are locked, unlike L0->L1. We also changed the L0->L1 picking algorithm to increase opportunities for parallelism. + +### Old L0->L1 Picking Logic + +Previously, our logic for picking which L0 file to compact was the same as every other level: pick the largest file in the level. One special property of L0->L1 compaction is that files can overlap in the input level, so those overlapping files must be pulled in as well. For example, a compaction may look like this: + +![full-range.png](/static/images/compaction/full-range.png) + +This compaction pulls in every L0 and L1 file. This happens regardless of which L0 file is initially chosen as each file overlaps with every other file. + +Users may insert their data less uniformly in the key-range. For example, a database may look like this during L0->L1 compaction: + +![part-range-old.png](/static/images/compaction/part-range-old.png) + +Let's say the third file from the top is the largest, and let's say the top two files are created after the compaction started. When the compaction is picked, the fourth L0 file and six rightmost L1 files are pulled in due to overlap. Notice this leaves the database in a state where we might not be able to schedule parallel compactions. For example, if the sixth file from the top is the next largest, we can't compact it because it overlaps with the top two files, which overlap with the locked L0 files. + +We can now see the high-level problems with this approach more clearly. First, locked files in L0 or L1 prevent us from parallelizing compactions. When locked files block L0->L1 compaction, there is nothing we can do to eliminate L0 files. Second, L0->L1 compactions are relatively slow. As we saw, when keys are uniformly distributed, L0->L1 compacts two entire levels. While this is happening, new files are being flushed to L0, advancing towards the file count limit. + +### New L0->L0 Algorithm + +We introduced compaction within L0 to improve both parallelization and speed of reducing L0 file count. An L0->L0 compaction may look like this: + +![l1-l2-contend.png](/static/images/compaction/l1-l2-contend.png) + +Say the L1->L2 compaction started first. Now L0->L1 is prevented by the locked L1 file. In this case, we compact files within L0. This allows us to start the work for eliminating L0 files earlier. It also lets us do less work since we don't pull in any L1 files, whereas L0->L1 compaction would've pulled in all of them. This lets us quickly reduce L0 file count to keep read-amp low while sustaining large bursts of writes (i.e., fast accumulation of L0 files). + +The tradeoff is this increases total compaction work, as we're now compacting files without contributing towards our eventual goal of moving them towards lower levels. Our benchmarks, though, consistently show less compaction stalls and improved write throughput. One justification is that L0 file data is highly likely in page cache and/or block cache due to it being recently written and frequently accessed. So, this type of compaction is relatively cheap compared to compactions at lower levels. + +This feature is available since RocksDB 5.4. + +### New L0->L1 Picking Logic + +Recall how the old L0->L1 picking algorithm chose the largest L0 file for compaction. This didn't fit well with L0->L0 compaction, which operates on a span of files. That span begins at the newest L0 file, and expands towards older files as long as they're not being compacted. Since the largest file may be anywhere, the old L0->L1 picking logic could arbitrarily prevent us from getting a long span of files. See the second illustration in this post for a scenario where this would happen. + +So, we changed the L0->L1 picking algorithm to start from the oldest file and expand towards newer files as long as they're not being compacted. For example: + +![l0-l1-contend.png](/static/images/compaction/l0-l1-contend.png) + +Now, there can never be L0 files unreachable for L0->L0 due to L0->L1 selecting files in the middle. When longer spans of files are available for L0->L0, we perform less compaction work per deleted L0 file, thus improving efficiency. + +This feature will be available in RocksDB 5.7. + +### Performance Changes + +Mark Callaghan did the most extensive benchmarking of this feature's impact on MyRocks. See his results [here](http://smalldatum.blogspot.com/2017/05/innodb-myrocks-and-tokudb-on-insert.html). Note the primary change between his March 17 and April 14 builds is the latter performs L0->L0 compaction. diff --git a/src/rocksdb/docs/_posts/2017-06-29-rocksdb-5-5-1-released.markdown b/src/rocksdb/docs/_posts/2017-06-29-rocksdb-5-5-1-released.markdown new file mode 100644 index 000000000..d7856088b --- /dev/null +++ b/src/rocksdb/docs/_posts/2017-06-29-rocksdb-5-5-1-released.markdown @@ -0,0 +1,22 @@ +--- +title: RocksDB 5.5.1 Released! +layout: post +author: lightmark +category: blog +--- + +### New Features +* FIFO compaction to support Intra L0 compaction too with CompactionOptionsFIFO.allow_compaction=true. +* Statistics::Reset() to reset user stats. +* ldb add option --try_load_options, which will open DB with its own option file. +* Introduce WriteBatch::PopSavePoint to pop the most recent save point explicitly. +* Support dynamically change `max_open_files` option via SetDBOptions() +* Added DB::CreateColumnFamilie() and DB::DropColumnFamilies() to bulk create/drop column families. +* Add debugging function `GetAllKeyVersions` to see internal versions of a range of keys. +* Support file ingestion with universal compaction style +* Support file ingestion behind with option `allow_ingest_behind` +* New option enable_pipelined_write which may improve write throughput in case writing from multiple threads and WAL enabled. + +### Bug Fixes +* Fix the bug that Direct I/O uses direct reads for non-SST file +* Fix the bug that flush doesn't respond to fsync result diff --git a/src/rocksdb/docs/_posts/2017-07-25-rocksdb-5-6-1-released.markdown b/src/rocksdb/docs/_posts/2017-07-25-rocksdb-5-6-1-released.markdown new file mode 100644 index 000000000..3b54ffd5a --- /dev/null +++ b/src/rocksdb/docs/_posts/2017-07-25-rocksdb-5-6-1-released.markdown @@ -0,0 +1,22 @@ +--- +title: RocksDB 5.6.1 Released! +layout: post +author: yiwu +category: blog +--- + +### Public API Change +* Scheduling flushes and compactions in the same thread pool is no longer supported by setting `max_background_flushes=0`. Instead, users can achieve this by configuring their high-pri thread pool to have zero threads. See https://github.com/facebook/rocksdb/wiki/Thread-Pool for more details. +* Replace `Options::max_background_flushes`, `Options::max_background_compactions`, and `Options::base_background_compactions` all with `Options::max_background_jobs`, which automatically decides how many threads to allocate towards flush/compaction. +* options.delayed_write_rate by default take the value of options.rate_limiter rate. +* Replace global variable `IOStatsContext iostats_context` with `IOStatsContext* get_iostats_context()`; replace global variable `PerfContext perf_context` with `PerfContext* get_perf_context()`. + +### New Features +* Change ticker/histogram statistics implementations to use core-local storage. This improves aggregation speed compared to our previous thread-local approach, particularly for applications with many threads. See http://rocksdb.org/blog/2017/05/14/core-local-stats.html for more details. +* Users can pass a cache object to write buffer manager, so that they can cap memory usage for memtable and block cache using one single limit. +* Flush will be triggered when 7/8 of the limit introduced by write_buffer_manager or db_write_buffer_size is triggered, so that the hard threshold is hard to hit. See https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager for more details. +* Introduce WriteOptions.low_pri. If it is true, low priority writes will be throttled if the compaction is behind. See https://github.com/facebook/rocksdb/wiki/Low-Priority-Write for more details. +* `DB::IngestExternalFile()` now supports ingesting files into a database containing range deletions. + +### Bug Fixes +* Shouldn't ignore return value of fsync() in flush. diff --git a/src/rocksdb/docs/_posts/2017-08-24-pinnableslice.markdown b/src/rocksdb/docs/_posts/2017-08-24-pinnableslice.markdown new file mode 100644 index 000000000..7ac2fec34 --- /dev/null +++ b/src/rocksdb/docs/_posts/2017-08-24-pinnableslice.markdown @@ -0,0 +1,37 @@ +--- +title: PinnableSlice; less memcpy with point lookups +layout: post +author: maysamyabandeh +category: blog +--- + +The classic API for [DB::Get](https://github.com/facebook/rocksdb/blob/9e583711144f580390ce21a49a8ceacca338fcd5/include/rocksdb/db.h#L310) receives a std::string as argument to which it will copy the value. The memcpy overhead could be non-trivial when the value is large. The [new API](https://github.com/facebook/rocksdb/blob/9e583711144f580390ce21a49a8ceacca338fcd5/include/rocksdb/db.h#L322) receives a PinnableSlice instead, which avoids memcpy in most of the cases. + +### What is PinnableSlice? + +Similarly to Slice, PinnableSlice refers to some in-memory data so it does not incur the memcpy cost. To ensure that the data will not be erased while it is being processed by the user, PinnableSlice, as its name suggests, has the data pinned in memory. The pinned data are released when PinnableSlice object is destructed or when ::Reset is invoked explicitly on it. + +### How good is it? + +Here are the improvements in throughput for an [in-memory benchmark](https://github.com/facebook/rocksdb/pull/1756#issuecomment-286201693): +* value 1k byte: 14% +* value 10k byte: 34% + +### Any limitations? + +PinnableSlice tries to avoid memcpy as much as possible. The primary gain is when reading large values from the block cache. There are however cases that it would still have to copy the data into its internal buffer. The reason is mainly the complexity of implementation and if there is enough motivation on the application side. the scope of PinnableSlice could be extended to such cases too. These include: +* Merged values +* Reads from memtables + +### How to use it? + +```cpp +PinnableSlice pinnable_val; +while (!stopped) { + auto s = db->Get(opt, cf, key, &pinnable_val); + // ... use it + pinnable_val.Reset(); // then release it immediately +} +``` + +You can also [initialize the internal buffer](https://github.com/facebook/rocksdb/blob/9e583711144f580390ce21a49a8ceacca338fcd5/include/rocksdb/db.h#L314) of PinnableSlice by passing your own string in the constructor. [simple_example.cc](https://github.com/facebook/rocksdb/blob/master/examples/simple_example.cc) demonstrates that with more examples. diff --git a/src/rocksdb/docs/_posts/2017-08-25-flushwal.markdown b/src/rocksdb/docs/_posts/2017-08-25-flushwal.markdown new file mode 100644 index 000000000..2dc5626ad --- /dev/null +++ b/src/rocksdb/docs/_posts/2017-08-25-flushwal.markdown @@ -0,0 +1,26 @@ +--- +title: FlushWAL; less fwrite, faster writes +layout: post +author: maysamyabandeh +category: blog +--- + +When `DB::Put` is called, the data is written to both memtable (to be flushed to SST files later) and the WAL (write-ahead log) if it is enabled. In the case of a crash, RocksDB can recover as much as the memtable state that is reflected into the WAL. By default RocksDB automatically flushes the WAL from the application memory to the OS buffer after each `::Put`. It however can be configured to perform the flush manually after an explicit call to `::FlushWAL`. Not doing fwrite syscall after each `::Put` offers a tradeoff between reliability and write latency for the general case. As we explain below, some applications such as MyRocks benefit from this API to gain higher write throughput with however no compromise in reliability. + +### How much is the gain? + +Using `::FlushWAL` API along with setting `DBOptions.concurrent_prepare`, MyRocks achieves 40% higher throughput in Sysbench's [update-nonindex](https://github.com/akopytov/sysbench/blob/master/src/lua/oltp_update_non_index.lua) benchmark. + +### Write, Flush, and Sync + +The write to the WAL is first written to the application memory buffer. The buffer in the next step is "flushed" to OS buffer by calling fwrite syscall. The OS buffer is later "synced" to the persistent storage. The data in the OS buffer, although not persisted yet, will survive the application crash. By default, the flush occurs automatically upon each call to `DB::Put` or `DB::Write`. The user can additionally request sync after each write by setting `WriteOptions::sync`. + +### FlushWAL API + +The user can turn off the automatic flush of the WAL by setting `DBOptions::manual_wal_flush`. In that case, the WAL buffer is flushed when it is either full or `DB::FlushWAL` is called by the user. The API also accepts a boolean argument should we want to sync right after the flush: `::FlushWAL(true)`. + +### Success story: MyRocks + +Some applications that use RocksDB, already have other machinsims in place to provide reliability. MySQL for example uses 2PC (two-phase commit) to write to both binlog as well as the storage engine such as InnoDB and MyRocks. The group commit logic in MySQL allows the 1st phase (Prepare) to be run in parallel but after a commit group is formed performs the 2nd phase (Commit) in a serial manner. This makes low commit latency in the storage engine essential for acheiving high throughput. The commit in MyRocks includes writing to the RocksDB WAL, which as explaiend above, by default incures the latency of flushing the WAL new appends to the OS buffer. + +Since binlog helps in recovering from some failure scenarios, MySQL can provide reliability without however needing a storage WAL flush after each individual commit. MyRocks benefits from this property, disables automatic WAL flush in RocksDB, and manually calls `::FlushWAL` when requested by MySQL. diff --git a/src/rocksdb/docs/_posts/2017-09-28-rocksdb-5-8-released.markdown b/src/rocksdb/docs/_posts/2017-09-28-rocksdb-5-8-released.markdown new file mode 100644 index 000000000..a22dcaa1c --- /dev/null +++ b/src/rocksdb/docs/_posts/2017-09-28-rocksdb-5-8-released.markdown @@ -0,0 +1,25 @@ +--- +title: RocksDB 5.8 Released! +layout: post +author: maysamyabandeh +category: blog +--- + +### Public API Change +* Users of `Statistics::getHistogramString()` will see fewer histogram buckets and different bucket endpoints. +* `Slice::compare` and BytewiseComparator `Compare` no longer accept `Slice`s containing nullptr. +* `Transaction::Get` and `Transaction::GetForUpdate` variants with `PinnableSlice` added. + +### New Features +* Add Iterator::Refresh(), which allows users to update the iterator state so that they can avoid some initialization costs of recreating iterators. +* Replace dynamic_cast<> (except unit test) so people can choose to build with RTTI off. With make, release mode is by default built with -fno-rtti and debug mode is built without it. Users can override it by setting USE_RTTI=0 or 1. +* Universal compactions including the bottom level can be executed in a dedicated thread pool. This alleviates head-of-line blocking in the compaction queue, which cause write stalling, particularly in multi-instance use cases. Users can enable this feature via `Env::SetBackgroundThreads(N, Env::Priority::BOTTOM)`, where `N > 0`. +* Allow merge operator to be called even with a single merge operand during compactions, by appropriately overriding `MergeOperator::AllowSingleOperand`. +* Add `DB::VerifyChecksum()`, which verifies the checksums in all SST files in a running DB. +* Block-based table support for disabling checksums by setting `BlockBasedTableOptions::checksum = kNoChecksum`. + +### Bug Fixes +* Fix wrong latencies in `rocksdb.db.get.micros`, `rocksdb.db.write.micros`, and `rocksdb.sst.read.micros`. +* Fix incorrect dropping of deletions during intra-L0 compaction. +* Fix transient reappearance of keys covered by range deletions when memtable prefix bloom filter is enabled. +* Fix potentially wrong file smallest key when range deletions separated by snapshot are written together. diff --git a/src/rocksdb/docs/_posts/2017-12-18-17-auto-tuned-rate-limiter.markdown b/src/rocksdb/docs/_posts/2017-12-18-17-auto-tuned-rate-limiter.markdown new file mode 100644 index 000000000..d2e6204e1 --- /dev/null +++ b/src/rocksdb/docs/_posts/2017-12-18-17-auto-tuned-rate-limiter.markdown @@ -0,0 +1,28 @@ +--- +title: Auto-tuned Rate Limiter +layout: post +author: ajkr +category: blog +--- + +### Introduction + +Our rate limiter has been hard to configure since users need to pick a value that is low enough to prevent background I/O spikes, which can impact user-visible read/write latencies. Meanwhile, picking too low a value can cause memtables and L0 files to pile up, eventually leading to writes stalling. Tuning the rate limiter has been especially difficult for users whose DB instances have different workloads, or have workloads that vary over time, or commonly both. + +To address this, in RocksDB 5.9 we released a dynamic rate limiter that adjusts itself over time according to demand for background I/O. It can be enabled simply by passing `auto_tuned=true` in the `NewGenericRateLimiter()` call. In this case `rate_bytes_per_sec` will indicate the upper-bound of the window within which a rate limit will be picked dynamically. The chosen rate limit will be much lower unless absolutely necessary, so setting this to the device's maximum throughput is a reasonable choice on dedicated hosts. + +### Algorithm + +We use a simple multiplicative-increase, multiplicative-decrease algorithm. We measure demand for background I/O as the ratio of intervals where the rate limiter is drained. There are low and high watermarks for this ratio, which will trigger a change in rate limit when breached. The rate limit can move within a window bounded by the user-specified upper-bound, and a lower-bound that we derive internally. Users can expect this lower bound to be 1-2 orders of magnitude less than the provided upper-bound (so don't provide INT64_MAX as your upper-bound), although it's subject to change. + +### Benchmark Results + +Data is ingested at 10MB/s and the rate limiter was created with 1000MB/s as its upper bound. The dynamically chosen rate limit hovers around 125MB/s. The other clustering of points at 50MB/s is due to number of compaction threads being reduced to one when there's no compaction pressure. + +![](/static/images/rate-limiter/write-KBps-series.png) + +![](/static/images/rate-limiter/auto-tuned-write-KBps-series.png) + +The following graph summarizes the above two time series graphs in CDF form. In particular, notice the p90 - p100 for background write rate are significantly lower with auto-tuned rate limiter enabled. + +![](/static/images/rate-limiter/write-KBps-cdf.png) diff --git a/src/rocksdb/docs/_posts/2017-12-19-write-prepared-txn.markdown b/src/rocksdb/docs/_posts/2017-12-19-write-prepared-txn.markdown new file mode 100644 index 000000000..439b3f83c --- /dev/null +++ b/src/rocksdb/docs/_posts/2017-12-19-write-prepared-txn.markdown @@ -0,0 +1,41 @@ +--- +title: WritePrepared Transactions +layout: post +author: maysamyabandeh +category: blog +--- + +RocksDB supports both optimistic and pessimistic concurrency controls. The pessimistic transactions make use of locks to provide isolation between the transactions. The default write policy in pessimistic transactions is _WriteCommitted_, which means that the data is written to the DB, i.e., the memtable, only after the transaction is committed. This policy simplified the implementation but came with some limitations in throughput, transaction size, and variety in supported isolation levels. In the below, we explain these in detail and present the other write policies, _WritePrepared_ and _WriteUnprepared_. We then dive into the design of _WritePrepared_ transactions. + +### WriteCommitted, Pros and Cons + +With _WriteCommitted_ write policy, the data is written to the memtable only after the transaction commits. This greatly simplifies the read path as any data that is read by other transactions can be assumed to be committed. This write policy, however, implies that the writes are buffered in memory in the meanwhile. This makes memory a bottleneck for large transactions. The delay of the commit phase in 2PC (two-phase commit) also becomes noticeable since most of the work, i.e., writing to memtable, is done at the commit phase. When the commit of multiple transactions are done in a serial fashion, such as in 2PC implementation of MySQL, the lengthy commit latency becomes a major contributor to lower throughput. Moreover this write policy cannot provide weaker isolation levels, such as READ UNCOMMITTED, that could potentially provide higher throughput for some applications. + +### Alternatives: _WritePrepared_ and _WriteUnprepared_ + +To tackle the lengthy commit issue, we should do memtable writes at earlier phases of 2PC so that the commit phase become lightweight and fast. 2PC is composed of Write stage, where the transaction `::Put` is invoked, the prepare phase, where `::Prepare` is invoked (upon which the DB promises to commit the transaction if later is requested), and commit phase, where `::Commit` is invoked and the transaction writes become visible to all readers. To make the commit phase lightweight, the memtable write could be done at either `::Prepare` or `::Put` stages, resulting into _WritePrepared_ and _WriteUnprepared_ write policies respectively. The downside is that when another transaction is reading data, it would need a way to tell apart which data is committed, and if they are, whether they are committed before the transaction's start, i.e., in the read snapshot of the transaction. _WritePrepared_ would still have the issue of buffering the data, which makes the memory the bottleneck for large transactions. It however provides a good milestone for transitioning from _WriteCommitted_ to _WriteUnprepared_ write policy. Here we explain the design of _WritePrepared_ policy. We will cover the changes that make the design to also supported _WriteUnprepared_ in an upcoming post. + +### _WritePrepared_ in a nutshell + +These are the primary design questions that needs to be addressed: +1) How do we identify the key/values in the DB with transactions that wrote them? +2) How do we figure if a key/value written by transaction Txn_w is in the read snapshot of the reading transaction Txn_r? +3) How do we rollback the data written by aborted transactions? + +With _WritePrepared_, a transaction still buffers the writes in a write batch object in memory. When 2PC `::Prepare` is called, it writes the in-memory write batch to the WAL (write-ahead log) as well as to the memtable(s) (one memtable per column family); We reuse the existing notion of sequence numbers in RocksDB to tag all the key/values in the same write batch with the same sequence number, `prepare_seq`, which is also used as the identifier for the transaction. At commit time, it writes a commit marker to the WAL, whose sequence number, `commit_seq`, will be used as the commit timestamp of the transaction. Before releasing the commit sequence number to the readers, it stores a mapping from `prepare_seq` to `commit_seq` in an in-memory data structure that we call _CommitCache_. When a transaction reading values from the DB (tagged with `prepare_seq`) it makes use of the _CommitCache_ to figure if `commit_seq` of the value is in its read snapshot. To rollback an aborted transaction, we apply the status before the transaction by making another write that cancels out the writes of the aborted transaction. + +The _CommitCache_ is a lock-free data structure that caches the recent commit entries. Looking up the entries in the cache must be enough for almost all th transactions that commit in a timely manner. When evicting the older entries from the cache, it still maintains some other data structures to cover the corner cases for transactions that takes abnormally too long to finish. We will cover them in the design details below. + +### Benchmark Results +Here we presents the improvements observed in MyRocks with sysbench and linkbench: +* benchmark...........tps.........p95 latency....cpu/query +* insert...................68% +* update-noindex...30%......38% +* update-index.......61%.......28% +* read-write............6%........3.5% +* read-only...........-1.2%.....-1.8% +* linkbench.............1.9%......+overall........0.6% + +Here are also the detailed results for [In-Memory Sysbench](https://gist.github.com/maysamyabandeh/bdb868091b2929a6d938615fdcf58424) and [SSD Sysbench](https://gist.github.com/maysamyabandeh/ff94f378ab48925025c34c47eff99306) curtesy of [@mdcallag](https://github.com/mdcallag). + +Learn more [here](https://github.com/facebook/rocksdb/wiki/WritePrepared-Transactions). diff --git a/src/rocksdb/docs/_posts/2018-02-05-rocksdb-5-10-2-released.markdown b/src/rocksdb/docs/_posts/2018-02-05-rocksdb-5-10-2-released.markdown new file mode 100644 index 000000000..9f32d3f94 --- /dev/null +++ b/src/rocksdb/docs/_posts/2018-02-05-rocksdb-5-10-2-released.markdown @@ -0,0 +1,22 @@ +--- +title: RocksDB 5.10.2 Released! +layout: post +author: siying +category: blog +--- + +### Public API Change +* When running `make` with environment variable `USE_SSE` set and `PORTABLE` unset, will use all machine features available locally. Previously this combination only compiled SSE-related features. + +### New Features +* CRC32C is now using the 3-way pipelined SSE algorithm `crc32c_3way` on supported platforms to improve performance. The system will choose to use this algorithm on supported platforms automatically whenever possible. If PCLMULQDQ is not supported it will fall back to the old Fast_CRC32 algorithm. +* Provide lifetime hints when writing files on Linux. This reduces hardware write-amp on storage devices supporting multiple streams. +* Add a DB stat, `NUMBER_ITER_SKIP`, which returns how many internal keys were skipped during iterations (e.g., due to being tombstones or duplicate versions of a key). +* Add PerfContext counters, `key_lock_wait_count` and `key_lock_wait_time`, which measure the number of times transactions wait on key locks and total amount of time waiting. + +### Bug Fixes +* Fix IOError on WAL write doesn't propagate to write group follower +* Make iterator invalid on merge error. +* Fix performance issue in `IngestExternalFile()` affecting databases with large number of SST files. +* Fix possible corruption to LSM structure when `DeleteFilesInRange()` deletes a subset of files spanned by a `DeleteRange()` marker. +* Fix DB::Flush() keep waiting after flush finish under certain condition. diff --git a/src/rocksdb/docs/_posts/2018-08-01-rocksdb-tuning-advisor.markdown b/src/rocksdb/docs/_posts/2018-08-01-rocksdb-tuning-advisor.markdown new file mode 100644 index 000000000..c0e8c4425 --- /dev/null +++ b/src/rocksdb/docs/_posts/2018-08-01-rocksdb-tuning-advisor.markdown @@ -0,0 +1,58 @@ +--- +title: Rocksdb Tuning Advisor +layout: post +author: poojam23 +category: blog +--- + +The performance of Rocksdb is contingent on its tuning. However, because +of the complexity of its underlying technology and a large number of +configurable parameters, a good configuration is sometimes hard to obtain. The aim of +the python command-line tool, Rocksdb Advisor, is to automate the process of +suggesting improvements in the configuration based on advice from Rocksdb +experts. + +### Overview + +Experts share their wisdom as rules comprising of conditions and suggestions in the INI format (refer +[rules.ini](https://github.com/facebook/rocksdb/blob/master/tools/advisor/advisor/rules.ini)). +Users provide the Rocksdb configuration that they want to improve upon (as the +familiar Rocksdb OPTIONS file — +[example](https://github.com/facebook/rocksdb/blob/master/examples/rocksdb_option_file_example.ini)) +and the path of the file which contains Rocksdb logs and statistics. +The [Advisor](https://github.com/facebook/rocksdb/blob/master/tools/advisor/advisor/rule_parser_example.py) +creates appropriate DataSource objects (for Rocksdb +[logs](https://github.com/facebook/rocksdb/blob/master/tools/advisor/advisor/db_log_parser.py), +[options](https://github.com/facebook/rocksdb/blob/master/tools/advisor/advisor/db_options_parser.py), +[statistics](https://github.com/facebook/rocksdb/blob/master/tools/advisor/advisor/db_stats_fetcher.py) etc.) +and provides them to the [Rules Engine](https://github.com/facebook/rocksdb/blob/master/tools/advisor/advisor/rule_parser.py). +The Rules uses rules from experts to parse data-sources and trigger appropriate rules. +The Advisor's output gives information about which rules were triggered, +why they were triggered and what each of them suggests. Each suggestion +provided by a triggered rule advises some action on a Rocksdb +configuration option, for example, increase CFOptions.write_buffer_size, +set bloom_bits to 2 etc. + +### Usage + +An example command to run the tool: + +```shell +cd rocksdb/tools/advisor +python3 -m advisor.rule_parser_example --rules_spec=advisor/rules.ini --rocksdb_options=test/input_files/OPTIONS-000005 --log_files_path_prefix=test/input_files/LOG-0 --stats_dump_period_sec=20 +``` + +Sample output where a Rocksdb log-based rule has been triggered : + +```shell +Rule: stall-too-many-memtables +LogCondition: stall-too-many-memtables regex: Stopping writes because we have \d+ immutable memtables \(waiting for flush\), max_write_buffer_number is set to \d+ +Suggestion: inc-bg-flush option : DBOptions.max_background_flushes action : increase suggested_values : ['2'] +Suggestion: inc-write-buffer option : CFOptions.max_write_buffer_number action : increase +scope: col_fam: +{'default'} +``` + +### Read more + +For more information, refer to [advisor](https://github.com/facebook/rocksdb/tree/master/tools/advisor/README.md). diff --git a/src/rocksdb/docs/_posts/2018-08-23-data-block-hash-index.markdown b/src/rocksdb/docs/_posts/2018-08-23-data-block-hash-index.markdown new file mode 100644 index 000000000..c4b24ec2a --- /dev/null +++ b/src/rocksdb/docs/_posts/2018-08-23-data-block-hash-index.markdown @@ -0,0 +1,118 @@ +--- +title: Improving Point-Lookup Using Data Block Hash Index +layout: post +author: fgwu +category: blog +--- +We've designed and implemented a _data block hash index_ in RocksDB that has the benefit of both reducing the CPU util and increasing the throughput for point lookup queries with a reasonable and tunable space overhead. + +Specifially, we append a compact hash table to the end of the data block for efficient indexing. It is backward compatible with the data base created without this feature. After turned on the hash index feature, existing data will be gradually converted to the hash index format. + +Benchmarks with `db_bench` show the CPU utilization of one of the main functions in the point lookup code path, `DataBlockIter::Seek()`, is reduced by 21.8%, and the overall RocksDB throughput is increased by 10% under purely cached workloads, at an overhead of 4.6% more space. Shadow testing with Facebook production traffic shows good CPU improvements too. + + +### How to use it +Two new options are added as part of this feature: `BlockBasedTableOptions::data_block_index_type` and `BlockBasedTableOptions::data_block_hash_table_util_ratio`. + +The hash index is disabled by default unless `BlockBasedTableOptions::data_block_index_type` is set to `data_block_index_type = kDataBlockBinaryAndHash`. The hash table utilization ratio is adjustable using `BlockBasedTableOptions::data_block_hash_table_util_ratio`, which is valid only if `data_block_index_type = kDataBlockBinaryAndHash`. + + +``` +// the definitions can be found in include/rocksdb/table.h + +// The index type that will be used for the data block. +enum DataBlockIndexType : char { + kDataBlockBinarySearch = 0, // traditional block type + kDataBlockBinaryAndHash = 1, // additional hash index +}; + +// Set to kDataBlockBinaryAndHash to enable hash index +DataBlockIndexType data_block_index_type = kDataBlockBinarySearch; + +// #entries/#buckets. It is valid only when data_block_hash_index_type is +// kDataBlockBinaryAndHash. +double data_block_hash_table_util_ratio = 0.75; + +``` + + +### Data Block Hash Index Design + +Current data block format groups adjacent keys together as a restart interval. One block consists of multiple restart intervals. The byte offset of the beginning of each restart interval, i.e. a restart point, is stored in an array called restart interval index or binary seek index. RocksDB does a binary search when performing point lookup for keys in data blocks to find the right restart interval the key may reside. We will use binary seek and binary search interchangeably in this post. + +In order to find the right location where the key may reside using binary search, multiple key parsing and comparison are needed. Each binary search branching triggers CPU cache miss, causing much CPU utilization. We have seen that this binary search takes up considerable CPU in production use-cases. + +![](/static/images/data-block-hash-index/block-format-binary-seek.png) + +We implemented a hash map at the end of the block to index the key to reduce the CPU overhead of the binary search. The hash index is just an array of pointers pointing into the binary seek index. + +![](/static/images/data-block-hash-index/block-format-hash-index.png) + + +Each array element is considered as a hash bucket when storing the location of a key (or more precisely, the restart index of the restart interval where the key resides). When multiple keys happen to hash into the same bucket (hash collision), we just mark the bucket as “collision”. So that when later querying on that key, the hash table lookup knows that there was a hash collision happened so it can fall back to the traditional binary search to find the location of the key. + +We define hash table utilization ratio as the #keys/#buckets. If a utilization ratio is 0.5 and there are 100 buckets, 50 keys are stored in the bucket. The less the util ratio, the less hash collision, and the less chance for a point lookup falls back to binary seek (fall back ratio) due to the collision. So a small util ratio has more benefit to reduce the CPU time but introduces more space overhead. + +Space overhead depends on the util ratio. Each bucket is a `uint8_t` (i.e. one byte). For a util ratio of 1, the space overhead is 1Byte per key, the fall back ratio observed is ~52%. + +![](/static/images/data-block-hash-index/hash-index-data-structure.png) + +### Things that Need Attention + +**Customized Comparator** + +Hash index will hash different keys (keys with different content, or byte sequence) into different hash values. This assumes the comparator will not treat different keys as equal if they have different content. + +The default bytewise comparator orders the keys in alphabetical order and works well with hash index, as different keys will never be regarded as equal. However, some specially crafted comparators will do. For example, say, a `StringToIntComparator` can convert a string into an integer, and use the integer to perform the comparison. Key string “16” and “0x10” is equal to each other as seen by this `StringToIntComparator`, but they probably hash to different value. Later queries to one form of the key will not be able to find the existing key been stored in the other format. + +We add a new function member to the comparator interface: + +``` +virtual bool CanKeysWithDifferentByteContentsBeEqual() const { return true; } +``` + + +Every comparator implementation should override this function and specify the behavior of the comparator. If a comparator can regard different keys equal, the function returns true, and as a result the hash index feature will not be enabled, and vice versa. + +NOTE: to use the hash index feature, one should 1) have a comparator that can never treat different keys as equal; and 2) override the `CanKeysWithDifferentByteContentsBeEqual()` function to return `false`, so the hash index can be enabled. + + +**Util Ratio's Impact on Data Block Cache** + +Adding the hash index to the end of the data block essentially takes up the data block cache space, making the effective data block cache size smaller and increasing the data block cache miss ratio. Therefore, a very small util ratio will result in a large data block cache miss ratio, and the extra I/O may drag down the throughput gain achieved by the hash index lookup. Besides, when compression is enabled, cache miss also incurs data block decompression, which is CPU-consuming. Therefore the CPU may even increase if using a too small util ratio. The best util ratio depends on workloads, cache to data ratio, disk bandwidth/latency etc. In our experiment, we found util ratio = 0.5 ~ 1 is a good range to explore that brings both CPU and throughput gains. + + +### Limitations + +As we use `uint8_t` to store binary seek index, i.e. restart interval index, the total number of restart intervals cannot be more than 253 (we reserved 255 and 254 as special flags). For blocks having a larger number of restart intervals, the hash index will not be created and the point lookup will be done by traditional binary seek. + +Data block hash index only supports point lookup. We do not support range lookup. Range lookup request will fall back to BinarySeek. + +RocksDB supports many types of records, such as `Put`, `Delete`, `Merge`, etc (visit [here](https://github.com/facebook/rocksdb/wiki/rocksdb-basics) for more information). Currently we only support `Put` and `Delete`, but not `Merge`. Internally we have a limited set of supported record types: + + +``` +kPutRecord, <=== supported +kDeleteRecord, <=== supported +kSingleDeleteRecord, <=== supported +kTypeBlobIndex, <=== supported +``` + +For records not supported, the searching process will fall back to the traditional binary seek. + + + +### Evaluation +To evaluate the CPU util reduction and isolate other factors such as disk I/O and block decompression, we first evaluate the hash idnex in a purely cached workload. We observe that the CPU utilization of one of the main functions in the point lookup code path, DataBlockIter::Seek(), is reduced by 21.8% and the overall throughput is increased by 10% at an overhead of 4.6% more space. + +However, general worload is not always purely cached. So we also evaluate the performance under different cache space pressure. In the following test, we use `db_bench` with RocksDB deployed on SSDs. The total DB size is 5~6GB, and it is about 14GB if decompressed. Different block cache sizes are used, ranging from 14GB down to 2GB, with an increasing cache miss ratio. + +Orange bars are representing our hash index performance. We use a hash util ratio of 1.0 in this test. Block size are set to 16KiB with the restart interval as 16. + +![](/static/images/data-block-hash-index/perf-throughput.png) +![](/static/images/data-block-hash-index/perf-cache-miss.png) + +We can see that if cache size is greater than 8GB, hash index can bring throughput gain. Cache size greater than 8GB can be translated to a cache miss ratio smaller than 40%. So if the workload has a cache miss ratio smaller than 40%, hash index is able to increase the throughput. + +Besides, shadow testing with Facebook production traffic shows good CPU improvements too. + diff --git a/src/rocksdb/docs/_posts/2018-11-21-delete-range.markdown b/src/rocksdb/docs/_posts/2018-11-21-delete-range.markdown new file mode 100644 index 000000000..96fc3562d --- /dev/null +++ b/src/rocksdb/docs/_posts/2018-11-21-delete-range.markdown @@ -0,0 +1,292 @@ +--- +title: "DeleteRange: A New Native RocksDB Operation" +layout: post +author: +- abhimadan +- ajkr +category: blog +--- +## Motivation + +### Deletion patterns in LSM + +Deleting a range of keys is a common pattern in RocksDB. Most systems built on top of +RocksDB have multi-component key schemas, where keys sharing a common prefix are +logically related. Here are some examples. + +MyRocks is a MySQL fork using RocksDB as its storage engine. Each key's first +four bytes identify the table or index to which that key belongs. Thus dropping +a table or index involves deleting all the keys with that prefix. + +Rockssandra is a Cassandra variant that uses RocksDB as its storage engine. One +of its admin tool commands, `nodetool cleanup`, removes key-ranges that have been migrated +to other nodes in the cluster. + +Marketplace uses RocksDB to store product data. Its key begins with product ID, +and it stores various data associated with the product in separate keys. When a +product is removed, all these keys must be deleted. + +When we decide what to improve, we try to find a use case that's common across +users, since we want to build a generally useful system, not one that has many +one-off features for individual users. The range deletion pattern is common as +illustrated above, so from this perspective it's a good target for optimization. + +### Existing mechanisms: challenges and opportunities + +The most common pattern we see is scan-and-delete, i.e., advance an iterator +through the to-be-deleted range, and issue a `Delete` for each key. This is +slow (involves read I/O) so cannot be done in any critical path. Additionally, +it creates many tombstones, which slows down iterators and doesn't offer a deadline +for space reclamation. + +Another common pattern is using a custom compaction filter that drops keys in +the deleted range(s). This deletes the range asynchronously, so cannot be used +in cases where readers must not see keys in deleted ranges. Further, it has the +disadvantage of outputting tombstones to all but the bottom level. That's +because compaction cannot detect whether dropping a key would cause an older +version at a lower level to reappear. + +If space reclamation time is important, or it is important that the deleted +range not affect iterators, the user can trigger `CompactRange` on the deleted +range. This can involve arbitrarily long waits in the compaction queue, and +increases write-amp. By the time it's finished, however, the range is completely +gone from the LSM. + +`DeleteFilesInRange` can be used prior to compacting the deleted range as long +as snapshot readers do not need to access them. It drops files that are +completely contained in the deleted range. That saves write-amp because, in +`CompactRange`, the file data would have to be rewritten several times before it +reaches the bottom of the LSM, where tombstones can finally be dropped. + +In addition to the above approaches having various drawbacks, they are quite +complicated to reason about and implement. In an ideal world, deleting a range +of keys would be (1) simple, i.e., a single API call; (2) synchronous, i.e., +when the call finishes, the keys are guaranteed to be wiped from the DB; (3) low +latency so it can be used in critical paths; and (4) a first-class operation +with all the guarantees of any other write, like atomicity, crash-recovery, etc. + +## v1: Getting it to work + +### Where to persist them? + +The first place we thought about storing them is inline with the data blocks. +We could not think of a good way to do it, however, since the start of a range +tombstone covering a key could be anywhere, making binary search impossible. +So, we decided to investigate segregated storage. + +A second solution we considered is appending to the manifest. This file is +append-only, periodically compacted, and stores metadata like the level to which +each SST belongs. This is tempting because it leverages an existing file, which +is maintained in the background and fully read when the DB is opened. However, +it conceptually violates the manifest's purpose, which is to store metadata. It +also has no way to detect when a range tombstone no longer covers anything and +is droppable. Further, it'd be possible for keys above a range tombstone to disappear +when they have their seqnums zeroed upon compaction to the bottommost level. + +A third candidate is using a separate column family. This has similar problems +to the manifest approach. That is, we cannot easily detect when a range +tombstone is obsolete, and seqnum zeroing can cause a key +to go from above a range tombstone to below, i.e., disappearing. The upside is +we can reuse logic for memory buffering, consistent reads/writes, etc. + +The problems with the second and third solutions indicate a need for range +tombstones to be aware of flush/compaction. An easy way to achieve this is put +them in the SST files themselves - but not in the data blocks, as explained for +the first solution. So, we introduced a separate meta-block for range tombstones. +This resolved the problem of when to obsolete range tombstones, as it's simple: +when they're compacted to the bottom level. We also reused the LSM invariants +that newer versions of a key are always in a higher level to prevent the seqnum +zeroing problem. This approach has the side benefit of constraining the range +tombstones seen during reads to ones in a similar key-range. + +![](/static/images/delrange/delrange_sst_blocks.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +*When there are range tombstones in an SST, they are segregated in a separate meta-block* +{: style="text-align: center"} + +![](/static/images/delrange/delrange_key_schema.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 80%"} + +*Logical range tombstones (left) and their corresponding physical key-value representation (right)* +{: style="text-align: center"} + +### Write path + +`WriteBatch` stores range tombstones in its buffer which are logged to the WAL and +then applied to a dedicated range tombstone memtable during `Write`. Later in +the background the range tombstone memtable and its corresponding data memtable +are flushed together into a single SST with a range tombstone meta-block. SSTs +periodically undergo compaction which rewrites SSTs with point data and range +tombstones dropped or merged wherever possible. + +We chose to use a dedicated memtable for range tombstones. The memtable +representation is always skiplist in order to minimize overhead in the usual +case, which is the memtable contains zero or a small number of range tombstones. +The range tombstones are segregated to a separate memtable for the same reason +we segregated range tombstones in SSTs. That is, we did not know how to +interleave the range tombstone with point data in a way that we would be able to +find it for arbitrary keys that it covers. + +![](/static/images/delrange/delrange_write_path.png) +{: style="display: block; margin-left: auto; margin-right: auto; width: 70%"} + +*Lifetime of point keys and range tombstones in RocksDB* +{: style="text-align: center"} + +During flush and compaction, we chose to write out all non-obsolete range +tombstones unsorted. Sorting by a single dimension is easy to implement, but +doesn't bring asymptotic improvement to queries over range data. Ideally, we +want to store skylines (see “Read Path” subsection below) computed over our ranges so we can binary search. +However, a couple of concerns cause doing this in flush and compaction to feel +unsatisfactory: (1) we need to store multiple skylines, one for each snapshot, +which further complicates the range tombstone meta-block encoding; and (2) even +if we implement this, the range tombstone memtable still needs to be linearly +scanned. Given these concerns we decided to defer collapsing work to the read +side, hoping a good caching strategy could optimize this at some future point. + + +### Read path + +In point lookups, we aggregate range tombstones in an unordered vector as we +search through live memtable, immutable memtables, and then SSTs. When a key is +found that matches the lookup key, we do a scan through the vector, checking +whether the key is deleted. + +In iterators, we aggregate range tombstones into a skyline as we visit live +memtable, immutable memtables, and SSTs. The skyline is expensive to construct but fast to determine whether a key is covered. The skyline keeps track of the most recent range tombstone found to optimize `Next` and `Prev`. + +|![](/static/images/delrange/delrange_uncollapsed.png) |![](/static/images/delrange/delrange_collapsed.png) | + +*([Image source: Leetcode](https://leetcode.com/problems/the-skyline-problem/description/)) The skyline problem involves taking building location/height data in the +unsearchable form of A and converting it to the form of B, which is +binary-searchable. With overlapping range tombstones, to achieve efficient +searching we need to solve an analogous problem, where the x-axis is the +key-space and the y-axis is the sequence number.* +{: style="text-align: center"} + +### Performance characteristics + +For the v1 implementation, writes are much faster compared to the scan and +delete (optionally within a transaction) pattern. `DeleteRange` only logs to WAL +and applies to memtable. Logging to WAL always `fflush`es, and optionally +`fsync`s or `fdatasync`s. Applying to memtable is always an in-memory operation. +Since range tombstones have a dedicated skiplist memtable, the complexity of inserting is O(log(T)), where T is the number of existing buffered range tombstones. + +Reading in the presence of v1 range tombstones, however, is much slower than reads +in a database where scan-and-delete has happened, due to the linear scan over +range tombstone memtables/meta-blocks. + +Iterating in a database with v1 range tombstones is usually slower than in a +scan-and-delete database, although the gap lessens as iterations grow longer. +When an iterator is first created and seeked, we construct a skyline over its +tombstones. This operation is O(T\*log(T)) where T is the number of tombstones +found across live memtable, immutable memtable, L0 files, and one file from each +of the L1+ levels. However, moving the iterator forwards or backwards is simply +a constant-time operation (excluding edge cases, e.g., many range tombstones +between consecutive point keys). + +## v2: Making it fast + +`DeleteRange`’s negative impact on read perf is a barrier to its adoption. The +root cause is range tombstones are not stored or cached in a format that can be +efficiently searched. We needed to design DeleteRange so that we could maintain +write performance while making read performance competitive with workarounds +used in production (e.g., scan-and-delete). + +### Representations + +The key idea of the redesign is that, instead of globally collapsing range tombstones, + we can locally “fragment” them for each SST file and memtable to guarantee that: + +* no range tombstones overlap; and +* range tombstones are ordered by start key. + +Combined, these properties make range tombstones binary searchable. This + fragmentation will happen on the read path, but unlike the previous design, we can + easily cache many of these range tombstone fragments on the read path. + +### Write path + +The write path remains unchanged. + +### Read path + +When an SST file is opened, its range tombstones are fragmented and cached. For point + lookups, we binary search each file's fragmented range tombstones for one that covers + the lookup key. Unlike the old design, once we find a tombstone, we no longer need to + search for the key in lower levels, since we know that any keys on those levels will be + covered (though we do still check the current level since there may be keys written after + the range tombstone). + +For range scans, we create iterators over all the fragmented range + tombstones and store them in a list, seeking each one to cover the start key of the range + scan (if possible), and query each encountered key in this structure as in the old design, + advancing range tombstone iterators as necessary. In effect, we implicitly create a skyline. + This requires significantly less work on iterator creation, but since each memtable/SST has +its own range tombstone iterator, querying range tombstones requires key comparisons (and +possibly iterator increments) for several iterators (as opposed to v1, where we had a global +collapsed representation of all range tombstones). As a result, very long range scans may become + slower than before, but short range scans are an order of magnitude faster, which are the + more common class of range scan. + +## Benchmarks + +To understand the performance of this new design, we used `db_bench` to compare point lookup, short range scan, + and long range scan performance across: + +* the v1 DeleteRange design, +* the scan-and-delete workaround, and +* the v2 DeleteRange design. + +In these benchmarks, we used a database with 5 million data keys, and 10000 range tombstones (ignoring +those dropped during compaction) that were written in regular intervals after 4.5 million data keys were written. +Writing the range tombstones ensures that most of them are not compacted away, and we have more tombstones +in higher levels that cover keys in lower levels, which allows the benchmarks to exercise more interesting behavior +when reading deleted keys. + +Point lookup benchmarks read 100000 keys from a database using `readwhilewriting`. Range scan benchmarks used +`seekrandomwhilewriting` and seeked 100000 times, and advanced up to 10 keys away from the seek position for short range scans, and advanced up to 1000 keys away from the seek position for long range scans. + +The results are summarized in the tables below, averaged over 10 runs (note the +different SHAs for v1 benchmarks are due to a new `db_bench` flag that was added in order to compare performance with databases with no tombstones; for brevity, those results are not reported here). Also note that the block cache was large enough to hold the entire db, so the large throughput is due to limited I/Os and little time spent on decompression. The range tombstone blocks are always pinned uncompressed in memory. We believe these setup details should not affect relative performance between versions. + +### Point Lookups + +|Name |SHA |avg micros/op |avg ops/sec | +|v1 |35cd754a6 |1.3179 |759,830.90 | +|scan-del |7528130e3 |0.6036 |1,667,237.70 | +|v2 |7528130e3 |0.6128 |1,634,633.40 | + +### Short Range Scans + +|Name |SHA |avg micros/op |avg ops/sec | +|v1 |0ed738fdd |6.23 |176,562.00 | +|scan-del |PR 4677 |2.6844 |377,313.00 | +|v2 |PR 4677 |2.8226 |361,249.70 | + +### Long Range scans + +|Name |SHA |avg micros/op |avg ops/sec | +|v1 |0ed738fdd |52.7066 |19,074.00 | +|scan-del |PR 4677 |38.0325 |26,648.60 | +|v2 |PR 4677 |41.2882 |24,714.70 | + +## Future Work + +Note that memtable range tombstones are fragmented every read; for now this is acceptable, + since we expect there to be relatively few range tombstones in memtables (and users can + enforce this by keeping track of the number of memtable range deletions and manually flushing + after it passes a threshold). In the future, a specialized data structure can be used for storing + range tombstones in memory to avoid this work. + +Another future optimization is to create a new format version that requires range tombstones to + be stored in a fragmented form. This would save time when opening SST files, and when `max_open_files` +is not -1 (i.e., files may be opened several times). + +## Acknowledgements + +Special thanks to Peter Mattis and Nikhil Benesch from Cockroach Labs, who were early users of +DeleteRange v1 in production, contributed the cleanest/most efficient v1 aggregation implementation, found and fixed bugs, and provided initial DeleteRange v2 design and continued help. + +Thanks to Huachao Huang and Jinpeng Zhang from PingCAP for early DeleteRange v1 adoption, bug reports, and fixes. diff --git a/src/rocksdb/docs/_posts/2019-03-08-format-version-4.markdown b/src/rocksdb/docs/_posts/2019-03-08-format-version-4.markdown new file mode 100644 index 000000000..ce657696c --- /dev/null +++ b/src/rocksdb/docs/_posts/2019-03-08-format-version-4.markdown @@ -0,0 +1,36 @@ +--- +title: format_version 4 +layout: post +author: maysamyabandeh +category: blog +--- + +The data blocks in RocksDB consist of a sequence of key/values pairs sorted by key, where the pairs are grouped into _restart intervals_ specified by `block_restart_interval`. Up to RocksDB version 5.14, where the latest and default value of `BlockBasedTableOptions::format_version` is 2, the format of index and data blocks are the same: index blocks use the same key format of <`user_key`,`seq`> and encode pointers to data blocks, <`offset`,`size`>, to a byte string and use them as values. The only difference is that the index blocks use `index_block_restart_interval` for the size of _restart intervals_. `format_version=`3,4 offer more optimized, backward-compatible, yet forward-incompatible format for index blocks. + +### Pros + +Using `format_version`=4 significantly reduces the index block size, in some cases around 4-5x. This frees more space in block cache, which would result in higher hit rate for data and filter blocks, or offer the same performance with a smaller block cache size. + +### Cons + +Being _forward-incompatible_ means that if you enable `format_version=`4 you cannot downgrade to a RocksDB version lower than 5.16. + +### How to use it? + +- `BlockBasedTableOptions::format_version` = 4 +- `BlockBasedTableOptions::index_block_restart_interval` = 16 + +### What is format_version 3? +(Since RocksDB 5.15) In most cases, the sequence number `seq` is not necessary for keys in the index blocks. In such cases, `format_version`=3 skips encoding the sequence number and sets `index_key_is_user_key` in TableProperties, which is used by the reader to know how to decode the index block. + +### What is format_version 4? +(Since RocksDB 5.16) Changes the format of index blocks by delta encoding the index values, which are the block handles. This saves the encoding of `BlockHandle::offset` of the non-head index entries in each restart interval. If used, `TableProperties::index_value_is_delta_encoded` is set, which is used by the reader to know how to decode the index block. The format of each key is (shared_size, non_shared_size, shared, non_shared). The format of each value, i.e., block handle, is (offset, size) whenever the shared_size is 0, which included the first entry in each restart point. Otherwise the format is delta-size = block handle size - size of last block handle. + +The index format in `format_version=4` would be as follows: + + restart_point 0: k, v (off, sz), k, v (delta-sz), ..., k, v (delta-sz) + restart_point 1: k, v (off, sz), k, v (delta-sz), ..., k, v (delta-sz) + ... + restart_point n-1: k, v (off, sz), k, v (delta-sz), ..., k, v (delta-sz) + where, k is key, v is value, and its encoding is in parenthesis. + diff --git a/src/rocksdb/docs/_posts/2019-08-15-unordered-write.markdown b/src/rocksdb/docs/_posts/2019-08-15-unordered-write.markdown new file mode 100644 index 000000000..5f0eb2880 --- /dev/null +++ b/src/rocksdb/docs/_posts/2019-08-15-unordered-write.markdown @@ -0,0 +1,56 @@ +--- +title: Higher write throughput with `unordered_write` feature +layout: post +author: maysamyabandeh +category: blog +--- + +Since RocksDB 6.3, The `unordered_write=`true option together with WritePrepared transactions offers 34-42% higher write throughput compared to vanilla RocksDB. If the application can handle more relaxed ordering guarantees, the gain in throughput would increase to 63-131%. + +### Background + +Currently RocksDB API delivers the following powerful guarantees: +- Atomic reads: Either all of a write batch is visible to reads or none of it. +- Read-your-own writes: When a write thread returns to the user, a subsequent read by the same thread will be able to see its own writes. +- Immutable Snapshots: The reads visible to the snapshot are immutable in the sense that it will not be affected by any in-flight or future writes. + +### `unordered_write` + +The `unordered_write` feature, when turned on, relaxes the default guarantees of RocksDB. While it still gives read-your-own-write property, neither atomic reads nor the immutable snapshot properties are provided any longer. However, RocksDB users could still get read-your-own-write and immutable snapshots when using this feature in conjunction with TransactionDB configured with WritePrepared transactions and `two_write_queues`. You can read [here](https://github.com/facebook/rocksdb/wiki/unordered_write) to learn about the design of `unordered_write` and [here](https://github.com/facebook/rocksdb/wiki/WritePrepared-Transactions) to learn more about WritePrepared transactions. + +### How to use it? + +To get the same guarantees as vanilla RocksdB: + + DBOptions db_options; + db_options.unordered_write = true; + db_options.two_write_queues = true; + DB* db; + { + TransactionDBOptions txn_db_options; + txn_db_options.write_policy = TxnDBWritePolicy::WRITE_PREPARED; + txn_db_options.skip_concurrency_control = true; + TransactionDB* txn_db; + TransactionDB::Open(options, txn_db_options, kDBPath, &txn_db); + db = txn_db; + } + db->Write(...); + +To get relaxed guarantees: + + DBOptions db_options; + db_options.unordered_write = true; + DB* db; + DB::Open(db_options, kDBPath, &db); + db->Write(...); + +# Benchmarks + + TEST_TMPDIR=/dev/shm/ ~/db_bench --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions --transaction_db=true --unordered_write=1 --disable_wal=0 + +Throughput with `unordered_write`=true and using WritePrepared transaction: +- WAL: +42% +- No-WAL: +34% +Throughput with `unordered_write`=true +- WAL: +63% +- NoWAL: +131% |