diff options
Diffstat (limited to 'doc/dev/cache-pool.rst')
-rw-r--r-- | doc/dev/cache-pool.rst | 200 |
1 files changed, 200 insertions, 0 deletions
diff --git a/doc/dev/cache-pool.rst b/doc/dev/cache-pool.rst new file mode 100644 index 00000000..7dc71c82 --- /dev/null +++ b/doc/dev/cache-pool.rst @@ -0,0 +1,200 @@ +Cache pool +========== + +Purpose +------- + +Use a pool of fast storage devices (probably SSDs) and use it as a +cache for an existing slower and larger pool. + +Use a replicated pool as a front-end to service most I/O, and destage +cold data to a separate erasure coded pool that does not currently (and +cannot efficiently) handle the workload. + +We should be able to create and add a cache pool to an existing pool +of data, and later remove it, without disrupting service or migrating +data around. + +Use cases +--------- + +Read-write pool, writeback +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +We have an existing data pool and put a fast cache pool "in front" of +it. Writes will go to the cache pool and immediately ack. We flush +them back to the data pool based on the defined policy. + +Read-only pool, weak consistency +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +We have an existing data pool and add one or more read-only cache +pools. We copy data to the cache pool(s) on read. Writes are +forwarded to the original data pool. Stale data is expired from the +cache pools based on the defined policy. + +This is likely only useful for specific applications with specific +data access patterns. It may be a match for rgw, for example. + + +Interface +--------- + +Set up a read/write cache pool foo-hot for pool foo:: + + ceph osd tier add foo foo-hot + ceph osd tier cache-mode foo-hot writeback + +Direct all traffic for foo to foo-hot:: + + ceph osd tier set-overlay foo foo-hot + +Set the target size and enable the tiering agent for foo-hot:: + + ceph osd pool set foo-hot hit_set_type bloom + ceph osd pool set foo-hot hit_set_count 1 + ceph osd pool set foo-hot hit_set_period 3600 # 1 hour + ceph osd pool set foo-hot target_max_bytes 1000000000000 # 1 TB + ceph osd pool set foo-hot min_read_recency_for_promote 1 + ceph osd pool set foo-hot min_write_recency_for_promote 1 + +Drain the cache in preparation for turning it off:: + + ceph osd tier cache-mode foo-hot forward + rados -p foo-hot cache-flush-evict-all + +When cache pool is finally empty, disable it:: + + ceph osd tier remove-overlay foo + ceph osd tier remove foo foo-hot + +Read-only pools with lazy consistency:: + + ceph osd tier add foo foo-east + ceph osd tier cache-mode foo-east readonly + ceph osd tier add foo foo-west + ceph osd tier cache-mode foo-west readonly + + + +Tiering agent +------------- + +The tiering policy is defined as properties on the cache pool itself. + +HitSet metadata +~~~~~~~~~~~~~~~ + +First, the agent requires HitSet information to be tracked on the +cache pool in order to determine which objects in the pool are being +accessed. This is enabled with:: + + ceph osd pool set foo-hot hit_set_type bloom + ceph osd pool set foo-hot hit_set_count 1 + ceph osd pool set foo-hot hit_set_period 3600 # 1 hour + +The supported HitSet types include 'bloom' (a bloom filter, the +default), 'explicit_hash', and 'explicit_object'. The latter two +explicitly enumerate accessed objects and are less memory efficient. +They are there primarily for debugging and to demonstrate pluggability +for the infrastructure. For the bloom filter type, you can additionally +define the false positive probability for the bloom filter (default is 0.05):: + + ceph osd pool set foo-hot hit_set_fpp 0.15 + +The hit_set_count and hit_set_period define how much time each HitSet +should cover, and how many such HitSets to store. Binning accesses +over time allows Ceph to independently determine whether an object was +accessed at least once and whether it was accessed more than once over +some time period ("age" vs "temperature"). + +The ``min_read_recency_for_promote`` defines how many HitSets to check for the +existence of an object when handling a read operation. The checking result is +used to decide whether to promote the object asynchronously. Its value should be +between 0 and ``hit_set_count``. If it's set to 0, the object is always promoted. +If it's set to 1, the current HitSet is checked. And if this object is in the +current HitSet, it's promoted. Otherwise not. For the other values, the exact +number of archive HitSets are checked. The object is promoted if the object is +found in any of the most recent ``min_read_recency_for_promote`` HitSets. + +A similar parameter can be set for the write operation, which is +``min_write_recency_for_promote``. :: + + ceph osd pool set {cachepool} min_read_recency_for_promote 1 + ceph osd pool set {cachepool} min_write_recency_for_promote 1 + +Note that the longer the ``hit_set_period`` and the higher the +``min_read_recency_for_promote``/``min_write_recency_for_promote`` the more RAM +will be consumed by the ceph-osd process. In particular, when the agent is active +to flush or evict cache objects, all hit_set_count HitSets are loaded into RAM. + +Cache mode +~~~~~~~~~~ + +The most important policy is the cache mode: + + ceph osd pool set foo-hot cache-mode writeback + +The supported modes are 'none', 'writeback', 'forward', and +'readonly'. Most installations want 'writeback', which will write +into the cache tier and only later flush updates back to the base +tier. Similarly, any object that is read will be promoted into the +cache tier. + +The 'forward' mode is intended for when the cache is being disabled +and needs to be drained. No new objects will be promoted or written +to the cache pool unless they are already present. A background +operation can then do something like:: + + rados -p foo-hot cache-try-flush-evict-all + rados -p foo-hot cache-flush-evict-all + +to force all data to be flushed back to the base tier. + +The 'readonly' mode is intended for read-only workloads that do not +require consistency to be enforced by the storage system. Writes will +be forwarded to the base tier, but objects that are read will get +promoted to the cache. No attempt is made by Ceph to ensure that the +contents of the cache tier(s) are consistent in the presence of object +updates. + +Cache sizing +~~~~~~~~~~~~ + +The agent performs two basic functions: flushing (writing 'dirty' +cache objects back to the base tier) and evicting (removing cold and +clean objects from the cache). + +The thresholds at which Ceph will flush or evict objects is specified +relative to a 'target size' of the pool. For example:: + + ceph osd pool set foo-hot cache_target_dirty_ratio .4 + ceph osd pool set foo-hot cache_target_dirty_high_ratio .6 + ceph osd pool set foo-hot cache_target_full_ratio .8 + +will begin flushing dirty objects when 40% of the pool is dirty and begin +evicting clean objects when we reach 80% of the target size. + +The target size can be specified either in terms of objects or bytes:: + + ceph osd pool set foo-hot target_max_bytes 1000000000000 # 1 TB + ceph osd pool set foo-hot target_max_objects 1000000 # 1 million objects + +Note that if both limits are specified, Ceph will begin flushing or +evicting when either threshold is triggered. + +Other tunables +~~~~~~~~~~~~~~ + +You can specify a minimum object age before a recently updated object is +flushed to the base tier:: + + ceph osd pool set foo-hot cache_min_flush_age 600 # 10 minutes + +You can specify the minimum age of an object before it will be evicted from +the cache tier:: + + ceph osd pool set foo-hot cache_min_evict_age 1800 # 30 minutes + + + |