summaryrefslogtreecommitdiffstats
path: root/doc/start/hardware-recommendations.rst
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-07 18:45:59 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-07 18:45:59 +0000
commit19fcec84d8d7d21e796c7624e521b60d28ee21ed (patch)
tree42d26aa27d1e3f7c0b8bd3fd14e7d7082f5008dc /doc/start/hardware-recommendations.rst
parentInitial commit. (diff)
downloadceph-19fcec84d8d7d21e796c7624e521b60d28ee21ed.tar.xz
ceph-19fcec84d8d7d21e796c7624e521b60d28ee21ed.zip
Adding upstream version 16.2.11+ds.upstream/16.2.11+dsupstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'doc/start/hardware-recommendations.rst')
-rw-r--r--doc/start/hardware-recommendations.rst509
1 files changed, 509 insertions, 0 deletions
diff --git a/doc/start/hardware-recommendations.rst b/doc/start/hardware-recommendations.rst
new file mode 100644
index 000000000..bf8eca4ba
--- /dev/null
+++ b/doc/start/hardware-recommendations.rst
@@ -0,0 +1,509 @@
+.. _hardware-recommendations:
+
+==========================
+ Hardware Recommendations
+==========================
+
+Ceph was designed to run on commodity hardware, which makes building and
+maintaining petabyte-scale data clusters economically feasible.
+When planning out your cluster hardware, you will need to balance a number
+of considerations, including failure domains and potential performance
+issues. Hardware planning should include distributing Ceph daemons and
+other processes that use Ceph across many hosts. Generally, we recommend
+running Ceph daemons of a specific type on a host configured for that type
+of daemon. We recommend using other hosts for processes that utilize your
+data cluster (e.g., OpenStack, CloudStack, etc).
+
+
+.. tip:: Check out the `Ceph blog`_ too.
+
+
+CPU
+===
+
+CephFS metadata servers (MDS) are CPU-intensive. CephFS metadata servers (MDS)
+should therefore have quad-core (or better) CPUs and high clock rates (GHz). OSD
+nodes need enough processing power to run the RADOS service, to calculate data
+placement with CRUSH, to replicate data, and to maintain their own copies of the
+cluster map.
+
+The requirements of one Ceph cluster are not the same as the requirements of
+another, but here are some general guidelines.
+
+In earlier versions of Ceph, we would make hardware recommendations based on
+the number of cores per OSD, but this cores-per-OSD metric is no longer as
+useful a metric as the number of cycles per IOP and the number of IOPs per OSD.
+For example, for NVMe drives, Ceph can easily utilize five or six cores on real
+clusters and up to about fourteen cores on single OSDs in isolation. So cores
+per OSD are no longer as pressing a concern as they were. When selecting
+hardware, select for IOPs per core.
+
+Monitor nodes and manager nodes have no heavy CPU demands and require only
+modest processors. If your host machines will run CPU-intensive processes in
+addition to Ceph daemons, make sure that you have enough processing power to
+run both the CPU-intensive processes and the Ceph daemons. (OpenStack Nova is
+one such example of a CPU-intensive process.) We recommend that you run
+non-Ceph CPU-intensive processes on separate hosts (that is, on hosts that are
+not your monitor and manager nodes) in order to avoid resource contention.
+
+RAM
+===
+
+Generally, more RAM is better. Monitor / manager nodes for a modest cluster
+might do fine with 64GB; for a larger cluster with hundreds of OSDs 128GB
+is a reasonable target. There is a memory target for BlueStore OSDs that
+defaults to 4GB. Factor in a prudent margin for the operating system and
+administrative tasks (like monitoring and metrics) as well as increased
+consumption during recovery: provisioning ~8GB per BlueStore OSD
+is advised.
+
+Monitors and managers (ceph-mon and ceph-mgr)
+---------------------------------------------
+
+Monitor and manager daemon memory usage generally scales with the size of the
+cluster. Note that at boot-time and during topology changes and recovery these
+daemons will need more RAM than they do during steady-state operation, so plan
+for peak usage. For very small clusters, 32 GB suffices. For clusters of up to,
+say, 300 OSDs go with 64GB. For clusters built with (or which will grow to)
+even more OSDs you should provision 128GB. You may also want to consider
+tuning the following settings:
+
+* `mon_osd_cache_size`
+* `rocksdb_cache_size`
+
+
+Metadata servers (ceph-mds)
+---------------------------
+
+The metadata daemon memory utilization depends on how much memory its cache is
+configured to consume. We recommend 1 GB as a minimum for most systems. See
+``mds_cache_memory``.
+
+Memory
+======
+
+Bluestore uses its own memory to cache data rather than relying on the
+operating system's page cache. In Bluestore you can adjust the amount of memory
+that the OSD attempts to consume by changing the `osd_memory_target`
+configuration option.
+
+- Setting the `osd_memory_target` below 2GB is typically not
+ recommended (Ceph may fail to keep the memory consumption under 2GB and
+ this may cause extremely slow performance).
+
+- Setting the memory target between 2GB and 4GB typically works but may result
+ in degraded performance: metadata may be read from disk during IO unless the
+ active data set is relatively small.
+
+- 4GB is the current default `osd_memory_target` size. This default
+ was chosen for typical use cases, and is intended to balance memory
+ requirements and OSD performance.
+
+- Setting the `osd_memory_target` higher than 4GB can improve
+ performance when there many (small) objects or when large (256GB/OSD
+ or more) data sets are processed.
+
+.. important:: OSD memory autotuning is "best effort". Although the OSD may
+ unmap memory to allow the kernel to reclaim it, there is no guarantee that
+ the kernel will actually reclaim freed memory within a specific time
+ frame. This applies especially in older versions of Ceph, where transparent
+ huge pages can prevent the kernel from reclaiming memory that was freed from
+ fragmented huge pages. Modern versions of Ceph disable transparent huge
+ pages at the application level to avoid this, but that does not
+ guarantee that the kernel will immediately reclaim unmapped memory. The OSD
+ may still at times exceed its memory target. We recommend budgeting
+ approximately 20% extra memory on your system to prevent OSDs from going OOM
+ (**O**\ut **O**\f **M**\emory) during temporary spikes or due to delay in
+ the kernel reclaiming freed pages. That 20% value might be more or less than
+ needed, depending on the exact configuration of the system.
+
+When using the legacy FileStore back end, the page cache is used for caching
+data, so no tuning is normally needed. When using the legacy FileStore backend,
+the OSD memory consumption is related to the number of PGs per daemon in the
+system.
+
+
+Data Storage
+============
+
+Plan your data storage configuration carefully. There are significant cost and
+performance tradeoffs to consider when planning for data storage. Simultaneous
+OS operations and simultaneous requests from multiple daemons for read and
+write operations against a single drive can slow performance.
+
+Hard Disk Drives
+----------------
+
+OSDs should have plenty of storage drive space for object data. We recommend a
+minimum disk drive size of 1 terabyte. Consider the cost-per-gigabyte advantage
+of larger disks. We recommend dividing the price of the disk drive by the
+number of gigabytes to arrive at a cost per gigabyte, because larger drives may
+have a significant impact on the cost-per-gigabyte. For example, a 1 terabyte
+hard disk priced at $75.00 has a cost of $0.07 per gigabyte (i.e., $75 / 1024 =
+0.0732). By contrast, a 3 terabyte disk priced at $150.00 has a cost of $0.05
+per gigabyte (i.e., $150 / 3072 = 0.0488). In the foregoing example, using the
+1 terabyte disks would generally increase the cost per gigabyte by
+40%--rendering your cluster substantially less cost efficient.
+
+.. tip:: Running multiple OSDs on a single SAS / SATA drive
+ is **NOT** a good idea. NVMe drives, however, can achieve
+ improved performance by being split into two or more OSDs.
+
+.. tip:: Running an OSD and a monitor or a metadata server on a single
+ drive is also **NOT** a good idea.
+
+.. tip:: With spinning disks, the SATA and SAS interface increasingly
+ becomes a bottleneck at larger capacities. See also the `Storage Networking
+ Industry Association's Total Cost of Ownership calculator`_.
+
+
+Storage drives are subject to limitations on seek time, access time, read and
+write times, as well as total throughput. These physical limitations affect
+overall system performance--especially during recovery. We recommend using a
+dedicated (ideally mirrored) drive for the operating system and software, and
+one drive for each Ceph OSD Daemon you run on the host (modulo NVMe above).
+Many "slow OSD" issues (when they are not attributable to hardware failure)
+arise from running an operating system and multiple OSDs on the same drive.
+
+It is technically possible to run multiple Ceph OSD Daemons per SAS / SATA
+drive, but this will lead to resource contention and diminish overall
+throughput.
+
+To get the best performance out of Ceph, run the following on separate drives:
+(1) operating systems, (2) OSD data, and (3) BlueStore db. For more
+information on how to effectively use a mix of fast drives and slow drives in
+your Ceph cluster, see the `block and block.db`_ section of the Bluestore
+Configuration Reference.
+
+Solid State Drives
+------------------
+
+Ceph performance can be improved by using solid-state drives (SSDs). This
+reduces random access time and reduces latency while accelerating throughput.
+
+SSDs cost more per gigabyte than do hard disk drives, but SSDs often offer
+access times that are, at a minimum, 100 times faster than hard disk drives.
+SSDs avoid hotspot issues and bottleneck issues within busy clusters, and
+they may offer better economics when TCO is evaluated holistically.
+
+SSDs do not have moving mechanical parts, so they are not necessarily subject
+to the same types of limitations as hard disk drives. SSDs do have significant
+limitations though. When evaluating SSDs, it is important to consider the
+performance of sequential reads and writes.
+
+.. important:: We recommend exploring the use of SSDs to improve performance.
+ However, before making a significant investment in SSDs, we **strongly
+ recommend** reviewing the performance metrics of an SSD and testing the
+ SSD in a test configuration in order to gauge performance.
+
+Relatively inexpensive SSDs may appeal to your sense of economy. Use caution.
+Acceptable IOPS are not the only factor to consider when selecting an SSD for
+use with Ceph.
+
+SSDs have historically been cost prohibitive for object storage, but emerging
+QLC drives are closing the gap, offering greater density with lower power
+consumption and less power spent on cooling. HDD OSDs may see a significant
+performance improvement by offloading WAL+DB onto an SSD.
+
+To get a better sense of the factors that determine the cost of storage, you
+might use the `Storage Networking Industry Association's Total Cost of
+Ownership calculator`_
+
+Partition Alignment
+~~~~~~~~~~~~~~~~~~~
+
+When using SSDs with Ceph, make sure that your partitions are properly aligned.
+Improperly aligned partitions suffer slower data transfer speeds than do
+properly aligned partitions. For more information about proper partition
+alignment and example commands that show how to align partitions properly, see
+`Werner Fischer's blog post on partition alignment`_.
+
+CephFS Metadata Segregation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+One way that Ceph accelerates CephFS file system performance is by segregating
+the storage of CephFS metadata from the storage of the CephFS file contents.
+Ceph provides a default ``metadata`` pool for CephFS metadata. You will never
+have to create a pool for CephFS metadata, but you can create a CRUSH map
+hierarchy for your CephFS metadata pool that points only to SSD storage media.
+See :ref:`CRUSH Device Class<crush-map-device-class>` for details.
+
+
+Controllers
+-----------
+
+Disk controllers (HBAs) can have a significant impact on write throughput.
+Carefully consider your selection of HBAs to ensure that they do not create a
+performance bottleneck. Notably, RAID-mode (IR) HBAs may exhibit higher latency
+than simpler "JBOD" (IT) mode HBAs. The RAID SoC, write cache, and battery
+backup can substantially increase hardware and maintenance costs. Some RAID
+HBAs can be configured with an IT-mode "personality".
+
+.. tip:: The `Ceph blog`_ is often an excellent source of information on Ceph
+ performance issues. See `Ceph Write Throughput 1`_ and `Ceph Write
+ Throughput 2`_ for additional details.
+
+
+Benchmarking
+------------
+
+BlueStore opens block devices in O_DIRECT and uses fsync frequently to ensure
+that data is safely persisted to media. You can evaluate a drive's low-level
+write performance using ``fio``. For example, 4kB random write performance is
+measured as follows:
+
+.. code-block:: console
+
+ # fio --name=/dev/sdX --ioengine=libaio --direct=1 --fsync=1 --readwrite=randwrite --blocksize=4k --runtime=300
+
+Write Caches
+------------
+
+Enterprise SSDs and HDDs normally include power loss protection features which
+use multi-level caches to speed up direct or synchronous writes. These devices
+can be toggled between two caching modes -- a volatile cache flushed to
+persistent media with fsync, or a non-volatile cache written synchronously.
+
+These two modes are selected by either "enabling" or "disabling" the write
+(volatile) cache. When the volatile cache is enabled, Linux uses a device in
+"write back" mode, and when disabled, it uses "write through".
+
+The default configuration (normally caching enabled) may not be optimal, and
+OSD performance may be dramatically increased in terms of increased IOPS and
+decreased commit_latency by disabling the write cache.
+
+Users are therefore encouraged to benchmark their devices with ``fio`` as
+described earlier and persist the optimal cache configuration for their
+devices.
+
+The cache configuration can be queried with ``hdparm``, ``sdparm``,
+``smartctl`` or by reading the values in ``/sys/class/scsi_disk/*/cache_type``,
+for example:
+
+.. code-block:: console
+
+ # hdparm -W /dev/sda
+
+ /dev/sda:
+ write-caching = 1 (on)
+
+ # sdparm --get WCE /dev/sda
+ /dev/sda: ATA TOSHIBA MG07ACA1 0101
+ WCE 1 [cha: y]
+ # smartctl -g wcache /dev/sda
+ smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-305.19.1.el8_4.x86_64] (local build)
+ Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
+
+ Write cache is: Enabled
+
+ # cat /sys/class/scsi_disk/0\:0\:0\:0/cache_type
+ write back
+
+The write cache can be disabled with those same tools:
+
+.. code-block:: console
+
+ # hdparm -W0 /dev/sda
+
+ /dev/sda:
+ setting drive write-caching to 0 (off)
+ write-caching = 0 (off)
+
+ # sdparm --clear WCE /dev/sda
+ /dev/sda: ATA TOSHIBA MG07ACA1 0101
+ # smartctl -s wcache,off /dev/sda
+ smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-305.19.1.el8_4.x86_64] (local build)
+ Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
+
+ === START OF ENABLE/DISABLE COMMANDS SECTION ===
+ Write cache disabled
+
+Normally, disabling the cache using ``hdparm``, ``sdparm``, or ``smartctl``
+results in the cache_type changing automatically to "write through". If this is
+not the case, you can try setting it directly as follows. (Users should note
+that setting cache_type also correctly persists the caching mode of the device
+until the next reboot):
+
+.. code-block:: console
+
+ # echo "write through" > /sys/class/scsi_disk/0\:0\:0\:0/cache_type
+
+ # hdparm -W /dev/sda
+
+ /dev/sda:
+ write-caching = 0 (off)
+
+.. tip:: This udev rule (tested on CentOS 8) will set all SATA/SAS device cache_types to "write
+ through":
+
+ .. code-block:: console
+
+ # cat /etc/udev/rules.d/99-ceph-write-through.rules
+ ACTION=="add", SUBSYSTEM=="scsi_disk", ATTR{cache_type}:="write through"
+
+.. tip:: This udev rule (tested on CentOS 7) will set all SATA/SAS device cache_types to "write
+ through":
+
+ .. code-block:: console
+
+ # cat /etc/udev/rules.d/99-ceph-write-through-el7.rules
+ ACTION=="add", SUBSYSTEM=="scsi_disk", RUN+="/bin/sh -c 'echo write through > /sys/class/scsi_disk/$kernel/cache_type'"
+
+.. tip:: The ``sdparm`` utility can be used to view/change the volatile write
+ cache on several devices at once:
+
+ .. code-block:: console
+
+ # sdparm --get WCE /dev/sd*
+ /dev/sda: ATA TOSHIBA MG07ACA1 0101
+ WCE 0 [cha: y]
+ /dev/sdb: ATA TOSHIBA MG07ACA1 0101
+ WCE 0 [cha: y]
+ # sdparm --clear WCE /dev/sd*
+ /dev/sda: ATA TOSHIBA MG07ACA1 0101
+ /dev/sdb: ATA TOSHIBA MG07ACA1 0101
+
+Additional Considerations
+-------------------------
+
+You typically will run multiple OSDs per host, but you should ensure that the
+aggregate throughput of your OSD drives doesn't exceed the network bandwidth
+required to service a client's need to read or write data. You should also
+consider what percentage of the overall data the cluster stores on each host. If
+the percentage on a particular host is large and the host fails, it can lead to
+problems such as exceeding the ``full ratio``, which causes Ceph to halt
+operations as a safety precaution that prevents data loss.
+
+When you run multiple OSDs per host, you also need to ensure that the kernel
+is up to date. See `OS Recommendations`_ for notes on ``glibc`` and
+``syncfs(2)`` to ensure that your hardware performs as expected when running
+multiple OSDs per host.
+
+
+Networks
+========
+
+Provision at least 10 Gb/s networking in your racks.
+
+Speed
+-----
+
+It takes three hours to replicate 1 TB of data across a 1 Gb/s network and it
+takes thirty hours to replicate 10 TB across a 1 Gb/s network. But it takes only
+twenty minutes to replicate 1 TB across a 10 Gb/s network, and it takes
+only one hour to replicate 10 TB across a 10 Gb/s network.
+
+Cost
+----
+
+The larger the Ceph cluster, the more common OSD failures will be.
+The faster that a placement group (PG) can recover from a ``degraded`` state to
+an ``active + clean`` state, the better. Notably, fast recovery minimizes
+the liklihood of multiple, overlapping failures that can cause data to become
+temporarily unavailable or even lost. Of course, when provisioning your
+network, you will have to balance price against performance.
+
+Some deployment tools employ VLANs to make hardware and network cabling more
+manageable. VLANs that use the 802.1q protocol require VLAN-capable NICs and
+switches. The added expense of this hardware may be offset by the operational
+cost savings on network setup and maintenance. When using VLANs to handle VM
+traffic between the cluster and compute stacks (e.g., OpenStack, CloudStack,
+etc.), there is additional value in using 10 Gb/s Ethernet or better; 40 Gb/s or
+25/50/100 Gb/s networking as of 2022 is common for production clusters.
+
+Top-of-rack (TOR) switches also need fast and redundant uplinks to spind
+spine switches / routers, often at least 40 Gb/s.
+
+
+Baseboard Management Controller (BMC)
+-------------------------------------
+
+Your server chassis should have a Baseboard Management Controller (BMC).
+Well-known examples are iDRAC (Dell), CIMC (Cisco UCS), and iLO (HPE).
+Administration and deployment tools may also use BMCs extensively, especially
+via IPMI or Redfish, so consider the cost/benefit tradeoff of an out-of-band
+network for security and administration. Hypervisor SSH access, VM image uploads,
+OS image installs, management sockets, etc. can impose significant loads on a network.
+Running three networks may seem like overkill, but each traffic path represents
+a potential capacity, throughput and/or performance bottleneck that you should
+carefully consider before deploying a large scale data cluster.
+
+
+Failure Domains
+===============
+
+A failure domain is any failure that prevents access to one or more OSDs. That
+could be a stopped daemon on a host; a disk failure, an OS crash, a
+malfunctioning NIC, a failed power supply, a network outage, a power outage,
+and so forth. When planning out your hardware needs, you must balance the
+temptation to reduce costs by placing too many responsibilities into too few
+failure domains, and the added costs of isolating every potential failure
+domain.
+
+
+Minimum Hardware Recommendations
+================================
+
+Ceph can run on inexpensive commodity hardware. Small production clusters
+and development clusters can run successfully with modest hardware.
+
++--------------+----------------+-----------------------------------------+
+| Process | Criteria | Minimum Recommended |
++==============+================+=========================================+
+| ``ceph-osd`` | Processor | - 1 core minimum |
+| | | - 1 core per 200-500 MB/s |
+| | | - 1 core per 1000-3000 IOPS |
+| | | |
+| | | * Results are before replication. |
+| | | * Results may vary with different |
+| | | CPU models and Ceph features. |
+| | | (erasure coding, compression, etc) |
+| | | * ARM processors specifically may |
+| | | require additional cores. |
+| | | * Actual performance depends on many |
+| | | factors including drives, net, and |
+| | | client throughput and latency. |
+| | | Benchmarking is highly recommended. |
+| +----------------+-----------------------------------------+
+| | RAM | - 4GB+ per daemon (more is better) |
+| | | - 2-4GB often functions (may be slow) |
+| | | - Less than 2GB not recommended |
+| +----------------+-----------------------------------------+
+| | Volume Storage | 1x storage drive per daemon |
+| +----------------+-----------------------------------------+
+| | DB/WAL | 1x SSD partition per daemon (optional) |
+| +----------------+-----------------------------------------+
+| | Network | 1x 1GbE+ NICs (10GbE+ recommended) |
++--------------+----------------+-----------------------------------------+
+| ``ceph-mon`` | Processor | - 2 cores minimum |
+| +----------------+-----------------------------------------+
+| | RAM | 2-4GB+ per daemon |
+| +----------------+-----------------------------------------+
+| | Disk Space | 60 GB per daemon |
+| +----------------+-----------------------------------------+
+| | Network | 1x 1GbE+ NICs |
++--------------+----------------+-----------------------------------------+
+| ``ceph-mds`` | Processor | - 2 cores minimum |
+| +----------------+-----------------------------------------+
+| | RAM | 2GB+ per daemon |
+| +----------------+-----------------------------------------+
+| | Disk Space | 1 MB per daemon |
+| +----------------+-----------------------------------------+
+| | Network | 1x 1GbE+ NICs |
++--------------+----------------+-----------------------------------------+
+
+.. tip:: If you are running an OSD with a single disk, create a
+ partition for your volume storage that is separate from the partition
+ containing the OS. Generally, we recommend separate disks for the
+ OS and the volume storage.
+
+
+
+.. _block and block.db: https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#block-and-block-db
+.. _Ceph blog: https://ceph.com/community/blog/
+.. _Ceph Write Throughput 1: http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
+.. _Ceph Write Throughput 2: http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/
+.. _Mapping Pools to Different Types of OSDs: ../../rados/operations/crush-map#placing-different-pools-on-different-osds
+.. _OS Recommendations: ../os-recommendations
+.. _Storage Networking Industry Association's Total Cost of Ownership calculator: https://www.snia.org/forums/cmsi/programs/TCOcalc
+.. _Werner Fischer's blog post on partition alignment: https://www.thomas-krenn.com/en/wiki/Partition_Alignment_detailed_explanation