diff options
Diffstat (limited to '')
-rw-r--r-- | doc/rados/operations/erasure-code.rst | 272 |
1 files changed, 272 insertions, 0 deletions
diff --git a/doc/rados/operations/erasure-code.rst b/doc/rados/operations/erasure-code.rst new file mode 100644 index 000000000..e2bd3c296 --- /dev/null +++ b/doc/rados/operations/erasure-code.rst @@ -0,0 +1,272 @@ +.. _ecpool: + +============== + Erasure code +============== + +By default, Ceph `pools <../pools>`_ are created with the type "replicated". In +replicated-type pools, every object is copied to multiple disks. This +multiple copying is the method of data protection known as "replication". + +By contrast, `erasure-coded <https://en.wikipedia.org/wiki/Erasure_code>`_ +pools use a method of data protection that is different from replication. In +erasure coding, data is broken into fragments of two kinds: data blocks and +parity blocks. If a drive fails or becomes corrupted, the parity blocks are +used to rebuild the data. At scale, erasure coding saves space relative to +replication. + +In this documentation, data blocks are referred to as "data chunks" +and parity blocks are referred to as "coding chunks". + +Erasure codes are also called "forward error correction codes". The +first forward error correction code was developed in 1950 by Richard +Hamming at Bell Laboratories. + + +Creating a sample erasure-coded pool +------------------------------------ + +The simplest erasure-coded pool is similar to `RAID5 +<https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and +requires at least three hosts: + +.. prompt:: bash $ + + ceph osd pool create ecpool erasure + +:: + + pool 'ecpool' created + +.. prompt:: bash $ + + echo ABCDEFGHI | rados --pool ecpool put NYAN - + rados --pool ecpool get NYAN - + +:: + + ABCDEFGHI + +Erasure-code profiles +--------------------- + +The default erasure-code profile can sustain the overlapping loss of two OSDs +without losing data. This erasure-code profile is equivalent to a replicated +pool of size three, but with different storage requirements: instead of +requiring 3TB to store 1TB, it requires only 2TB to store 1TB. The default +profile can be displayed with this command: + +.. prompt:: bash $ + + ceph osd erasure-code-profile get default + +:: + + k=2 + m=2 + plugin=jerasure + crush-failure-domain=host + technique=reed_sol_van + +.. note:: + The profile just displayed is for the *default* erasure-coded pool, not the + *simplest* erasure-coded pool. These two pools are not the same: + + The default erasure-coded pool has two data chunks (K) and two coding chunks + (M). The profile of the default erasure-coded pool is "k=2 m=2". + + The simplest erasure-coded pool has two data chunks (K) and one coding chunk + (M). The profile of the simplest erasure-coded pool is "k=2 m=1". + +Choosing the right profile is important because the profile cannot be modified +after the pool is created. If you find that you need an erasure-coded pool with +a profile different than the one you have created, you must create a new pool +with a different (and presumably more carefully considered) profile. When the +new pool is created, all objects from the wrongly configured pool must be moved +to the newly created pool. There is no way to alter the profile of a pool after +the pool has been created. + +The most important parameters of the profile are *K*, *M*, and +*crush-failure-domain* because they define the storage overhead and +the data durability. For example, if the desired architecture must +sustain the loss of two racks with a storage overhead of 67%, +the following profile can be defined: + +.. prompt:: bash $ + + ceph osd erasure-code-profile set myprofile \ + k=3 \ + m=2 \ + crush-failure-domain=rack + ceph osd pool create ecpool erasure myprofile + echo ABCDEFGHI | rados --pool ecpool put NYAN - + rados --pool ecpool get NYAN - + +:: + + ABCDEFGHI + +The *NYAN* object will be divided in three (*K=3*) and two additional +*chunks* will be created (*M=2*). The value of *M* defines how many +OSDs can be lost simultaneously without losing any data. The +*crush-failure-domain=rack* will create a CRUSH rule that ensures +no two *chunks* are stored in the same rack. + +.. ditaa:: + +-------------------+ + name | NYAN | + +-------------------+ + content | ABCDEFGHI | + +--------+----------+ + | + | + v + +------+------+ + +---------------+ encode(3,2) +-----------+ + | +--+--+---+---+ | + | | | | | + | +-------+ | +-----+ | + | | | | | + +--v---+ +--v---+ +--v---+ +--v---+ +--v---+ + name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN | + +------+ +------+ +------+ +------+ +------+ + shard | 1 | | 2 | | 3 | | 4 | | 5 | + +------+ +------+ +------+ +------+ +------+ + content | ABC | | DEF | | GHI | | YXY | | QGC | + +--+---+ +--+---+ +--+---+ +--+---+ +--+---+ + | | | | | + | | v | | + | | +--+---+ | | + | | | OSD1 | | | + | | +------+ | | + | | | | + | | +------+ | | + | +------>| OSD2 | | | + | +------+ | | + | | | + | +------+ | | + | | OSD3 |<----+ | + | +------+ | + | | + | +------+ | + | | OSD4 |<--------------+ + | +------+ + | + | +------+ + +----------------->| OSD5 | + +------+ + + +More information can be found in the `erasure-code profiles +<../erasure-code-profile>`_ documentation. + + +Erasure Coding with Overwrites +------------------------------ + +By default, erasure-coded pools work only with operations that +perform full object writes and appends (for example, RGW). + +Since Luminous, partial writes for an erasure-coded pool may be +enabled with a per-pool setting. This lets RBD and CephFS store their +data in an erasure-coded pool: + +.. prompt:: bash $ + + ceph osd pool set ec_pool allow_ec_overwrites true + +This can be enabled only on a pool residing on BlueStore OSDs, since +BlueStore's checksumming is used during deep scrubs to detect bitrot +or other corruption. Using Filestore with EC overwrites is not only +unsafe, but it also results in lower performance compared to BlueStore. + +Erasure-coded pools do not support omap, so to use them with RBD and +CephFS you must instruct them to store their data in an EC pool and +their metadata in a replicated pool. For RBD, this means using the +erasure-coded pool as the ``--data-pool`` during image creation: + +.. prompt:: bash $ + + rbd create --size 1G --data-pool ec_pool replicated_pool/image_name + +For CephFS, an erasure-coded pool can be set as the default data pool during +file system creation or via `file layouts <../../../cephfs/file-layouts>`_. + + +Erasure-coded pools and cache tiering +------------------------------------- + +.. note:: Cache tiering is deprecated in Reef. + +Erasure-coded pools require more resources than replicated pools and +lack some of the functionality supported by replicated pools (for example, omap). +To overcome these limitations, one can set up a `cache tier <../cache-tiering>`_ +before setting up the erasure-coded pool. + +For example, if the pool *hot-storage* is made of fast storage, the following commands +will place the *hot-storage* pool as a tier of *ecpool* in *writeback* +mode: + +.. prompt:: bash $ + + ceph osd tier add ecpool hot-storage + ceph osd tier cache-mode hot-storage writeback + ceph osd tier set-overlay ecpool hot-storage + +The result is that every write and read to the *ecpool* actually uses +the *hot-storage* pool and benefits from its flexibility and speed. + +More information can be found in the `cache tiering +<../cache-tiering>`_ documentation. Note, however, that cache tiering +is deprecated and may be removed completely in a future release. + +Erasure-coded pool recovery +--------------------------- +If an erasure-coded pool loses any data shards, it must recover them from others. +This recovery involves reading from the remaining shards, reconstructing the data, and +writing new shards. + +In Octopus and later releases, erasure-coded pools can recover as long as there are at least *K* shards +available. (With fewer than *K* shards, you have actually lost data!) + +Prior to Octopus, erasure-coded pools required that at least ``min_size`` shards be +available, even if ``min_size`` was greater than ``K``. This was a conservative +decision made out of an abundance of caution when designing the new pool +mode. As a result, however, pools with lost OSDs but without complete data loss were +unable to recover and go active without manual intervention to temporarily change +the ``min_size`` setting. + +We recommend that ``min_size`` be ``K+1`` or greater to prevent loss of writes and +loss of data. + + + +Glossary +-------- + +*chunk* + When the encoding function is called, it returns chunks of the same size as each other. There are two + kinds of chunks: (1) *data chunks*, which can be concatenated to reconstruct the original object, and + (2) *coding chunks*, which can be used to rebuild a lost chunk. + +*K* + The number of data chunks into which an object is divided. For example, if *K* = 2, then a 10KB object + is divided into two objects of 5KB each. + +*M* + The number of coding chunks computed by the encoding function. *M* is equal to the number of OSDs that can + be missing from the cluster without the cluster suffering data loss. For example, if there are two coding + chunks, then two OSDs can be missing without data loss. + +Table of contents +----------------- + +.. toctree:: + :maxdepth: 1 + + erasure-code-profile + erasure-code-jerasure + erasure-code-isa + erasure-code-lrc + erasure-code-shec + erasure-code-clay |