From 19fcec84d8d7d21e796c7624e521b60d28ee21ed Mon Sep 17 00:00:00 2001
From: Daniel Baumann <daniel.baumann@progress-linux.org>
Date: Sun, 7 Apr 2024 20:45:59 +0200
Subject: Adding upstream version 16.2.11+ds.

Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
---
 doc/start/hardware-recommendations.rst | 509 +++++++++++++++++++++++++++++++++
 1 file changed, 509 insertions(+)
 create mode 100644 doc/start/hardware-recommendations.rst

(limited to 'doc/start/hardware-recommendations.rst')

diff --git a/doc/start/hardware-recommendations.rst b/doc/start/hardware-recommendations.rst
new file mode 100644
index 000000000..bf8eca4ba
--- /dev/null
+++ b/doc/start/hardware-recommendations.rst
@@ -0,0 +1,509 @@
+.. _hardware-recommendations:
+
+==========================
+ Hardware Recommendations
+==========================
+
+Ceph was designed to run on commodity hardware, which makes building and
+maintaining petabyte-scale data clusters economically feasible. 
+When planning out your cluster hardware, you will need to balance a number 
+of considerations, including failure domains and potential performance
+issues. Hardware planning should include distributing Ceph daemons and 
+other processes that use Ceph across many hosts. Generally, we recommend 
+running Ceph daemons of a specific type on a host configured for that type 
+of daemon. We recommend using other hosts for processes that utilize your 
+data cluster (e.g., OpenStack, CloudStack, etc).
+
+
+.. tip:: Check out the `Ceph blog`_ too.
+
+
+CPU
+===
+
+CephFS metadata servers (MDS) are CPU-intensive. CephFS metadata servers (MDS)
+should therefore have quad-core (or better) CPUs and high clock rates (GHz). OSD
+nodes need enough processing power to run the RADOS service, to calculate data
+placement with CRUSH, to replicate data, and to maintain their own copies of the
+cluster map.
+
+The requirements of one Ceph cluster are not the same as the requirements of
+another, but here are some general guidelines. 
+
+In earlier versions of Ceph, we would make hardware recommendations based on
+the number of cores per OSD, but this cores-per-OSD metric is no longer as
+useful a metric as the number of cycles per IOP and the number of IOPs per OSD.
+For example, for NVMe drives, Ceph can easily utilize five or six cores on real
+clusters and up to about fourteen cores on single OSDs in isolation. So cores
+per OSD are no longer as pressing a concern as they were. When selecting
+hardware, select for IOPs per core.
+
+Monitor nodes and manager nodes have no heavy CPU demands and require only
+modest processors. If your host machines will run CPU-intensive processes in
+addition to Ceph daemons, make sure that you have enough processing power to
+run both the CPU-intensive processes and the Ceph daemons. (OpenStack Nova is
+one such example of a CPU-intensive process.) We recommend that you run
+non-Ceph CPU-intensive processes on separate hosts (that is, on hosts that are
+not your monitor and manager nodes) in order to avoid resource contention.
+
+RAM
+===
+
+Generally, more RAM is better.  Monitor / manager nodes for a modest cluster
+might do fine with 64GB; for a larger cluster with hundreds of OSDs 128GB
+is a reasonable target.  There is a memory target for BlueStore OSDs that
+defaults to 4GB.  Factor in a prudent margin for the operating system and
+administrative tasks (like monitoring and metrics) as well as increased
+consumption during recovery:  provisioning ~8GB per BlueStore OSD
+is advised.
+
+Monitors and managers (ceph-mon and ceph-mgr)
+---------------------------------------------
+
+Monitor and manager daemon memory usage generally scales with the size of the
+cluster.  Note that at boot-time and during topology changes and recovery these
+daemons will need more RAM than they do during steady-state operation, so plan
+for peak usage. For very small clusters, 32 GB suffices. For clusters of up to,
+say, 300 OSDs go with 64GB. For clusters built with (or which will grow to)
+even more OSDs you should provision 128GB. You may also want to consider
+tuning the following settings:
+
+* `mon_osd_cache_size`
+* `rocksdb_cache_size`
+
+
+Metadata servers (ceph-mds)
+---------------------------
+
+The metadata daemon memory utilization depends on how much memory its cache is
+configured to consume.  We recommend 1 GB as a minimum for most systems.  See
+``mds_cache_memory``.
+
+Memory
+======
+
+Bluestore uses its own memory to cache data rather than relying on the
+operating system's page cache. In Bluestore you can adjust the amount of memory
+that the OSD attempts to consume by changing the `osd_memory_target`
+configuration option.
+
+- Setting the `osd_memory_target` below 2GB is typically not
+  recommended (Ceph may fail to keep the memory consumption under 2GB and 
+  this may cause extremely slow performance).
+
+- Setting the memory target between 2GB and 4GB typically works but may result
+  in degraded performance: metadata may be read from disk during IO unless the
+  active data set is relatively small.
+
+- 4GB is the current default `osd_memory_target` size. This default
+  was chosen for typical use cases, and is intended to balance memory
+  requirements and OSD performance.
+
+- Setting the `osd_memory_target` higher than 4GB can improve
+  performance when there many (small) objects or when large (256GB/OSD 
+  or more) data sets are processed.
+
+.. important:: OSD memory autotuning is "best effort". Although the OSD may
+   unmap memory to allow the kernel to reclaim it, there is no guarantee that
+   the kernel will actually reclaim freed memory within a specific time
+   frame. This applies especially in older versions of Ceph, where transparent
+   huge pages can prevent the kernel from reclaiming memory that was freed from
+   fragmented huge pages. Modern versions of Ceph disable transparent huge
+   pages at the application level to avoid this, but that does not
+   guarantee that the kernel will immediately reclaim unmapped memory. The OSD
+   may still at times exceed its memory target. We recommend budgeting 
+   approximately 20% extra memory on your system to prevent OSDs from going OOM
+   (**O**\ut **O**\f **M**\emory) during temporary spikes or due to delay in
+   the kernel reclaiming freed pages. That 20% value might be more or less than
+   needed, depending on the exact configuration of the system.
+
+When using the legacy FileStore back end, the page cache is used for caching
+data, so no tuning is normally needed. When using the legacy FileStore backend,
+the OSD memory consumption is related to the number of PGs per daemon in the
+system.
+
+
+Data Storage
+============
+
+Plan your data storage configuration carefully. There are significant cost and
+performance tradeoffs to consider when planning for data storage. Simultaneous
+OS operations and simultaneous requests from multiple daemons for read and
+write operations against a single drive can slow performance.
+
+Hard Disk Drives
+----------------
+
+OSDs should have plenty of storage drive space for object data. We recommend a
+minimum disk drive size of 1 terabyte. Consider the cost-per-gigabyte advantage
+of larger disks. We recommend dividing the price of the disk drive by the
+number of gigabytes to arrive at a cost per gigabyte, because larger drives may
+have a significant impact on the cost-per-gigabyte. For example, a 1 terabyte
+hard disk priced at $75.00 has a cost of $0.07 per gigabyte (i.e., $75 / 1024 =
+0.0732). By contrast, a 3 terabyte disk priced at $150.00 has a cost of $0.05
+per gigabyte (i.e., $150 / 3072 = 0.0488). In the foregoing example, using the
+1 terabyte disks would generally increase the cost per gigabyte by
+40%--rendering your cluster substantially less cost efficient.
+
+.. tip:: Running multiple OSDs on a single SAS / SATA drive
+   is **NOT** a good idea.  NVMe drives, however, can achieve
+   improved performance by being split into two or more OSDs.
+
+.. tip:: Running an OSD and a monitor or a metadata server on a single 
+   drive is also **NOT** a good idea.
+
+.. tip:: With spinning disks, the SATA and SAS interface increasingly
+   becomes a bottleneck at larger capacities. See also the `Storage Networking 
+   Industry Association's Total Cost of Ownership calculator`_.
+
+
+Storage drives are subject to limitations on seek time, access time, read and
+write times, as well as total throughput. These physical limitations affect
+overall system performance--especially during recovery. We recommend using a
+dedicated (ideally mirrored) drive for the operating system and software, and
+one drive for each Ceph OSD Daemon you run on the host (modulo NVMe above).
+Many "slow OSD" issues (when they are not attributable to hardware failure)
+arise from running an operating system and multiple OSDs on the same drive.
+
+It is technically possible to run multiple Ceph OSD Daemons per SAS / SATA
+drive, but this will lead to resource contention and diminish overall
+throughput.
+
+To get the best performance out of Ceph, run the following on separate drives:
+(1) operating systems, (2) OSD data, and (3) BlueStore db.  For more
+information on how to effectively use a mix of fast drives and slow drives in
+your Ceph cluster, see the `block and block.db`_ section of the Bluestore
+Configuration Reference.
+
+Solid State Drives
+------------------
+
+Ceph performance can be improved by using solid-state drives (SSDs). This
+reduces random access time and reduces latency while accelerating throughput. 
+
+SSDs cost more per gigabyte than do hard disk drives, but SSDs often offer
+access times that are, at a minimum, 100 times faster than hard disk drives.
+SSDs avoid hotspot issues and bottleneck issues within busy clusters, and
+they may offer better economics when TCO is evaluated holistically.
+
+SSDs do not have moving mechanical parts, so they are not necessarily subject
+to the same types of limitations as hard disk drives. SSDs do have significant
+limitations though. When evaluating SSDs, it is important to consider the
+performance of sequential reads and writes.
+
+.. important:: We recommend exploring the use of SSDs to improve performance. 
+   However, before making a significant investment in SSDs, we **strongly
+   recommend** reviewing the performance metrics of an SSD and testing the
+   SSD in a test configuration in order to gauge performance. 
+
+Relatively inexpensive SSDs may appeal to your sense of economy. Use caution.
+Acceptable IOPS are not the only factor to consider when selecting an SSD for
+use with Ceph. 
+
+SSDs have historically been cost prohibitive for object storage, but emerging
+QLC drives are closing the gap, offering greater density with lower power
+consumption and less power spent on cooling. HDD OSDs may see a significant
+performance improvement by offloading WAL+DB onto an SSD.
+
+To get a better sense of the factors that determine the cost of storage, you
+might use the `Storage Networking Industry Association's Total Cost of
+Ownership calculator`_
+
+Partition Alignment
+~~~~~~~~~~~~~~~~~~~
+
+When using SSDs with Ceph, make sure that your partitions are properly aligned.
+Improperly aligned partitions suffer slower data transfer speeds than do
+properly aligned partitions. For more information about proper partition
+alignment and example commands that show how to align partitions properly, see
+`Werner Fischer's blog post on partition alignment`_.
+
+CephFS Metadata Segregation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+One way that Ceph accelerates CephFS file system performance is by segregating
+the storage of CephFS metadata from the storage of the CephFS file contents.
+Ceph provides a default ``metadata`` pool for CephFS metadata. You will never
+have to create a pool for CephFS metadata, but you can create a CRUSH map
+hierarchy for your CephFS metadata pool that points only to SSD storage media.
+See :ref:`CRUSH Device Class<crush-map-device-class>` for details.
+
+
+Controllers
+-----------
+
+Disk controllers (HBAs) can have a significant impact on write throughput.
+Carefully consider your selection of HBAs to ensure that they do not create a
+performance bottleneck. Notably, RAID-mode (IR) HBAs may exhibit higher latency
+than simpler "JBOD" (IT) mode HBAs. The RAID SoC, write cache, and battery
+backup can substantially increase hardware and maintenance costs. Some RAID
+HBAs can be configured with an IT-mode "personality".
+
+.. tip:: The `Ceph blog`_ is often an excellent source of information on Ceph
+   performance issues. See `Ceph Write Throughput 1`_ and `Ceph Write 
+   Throughput 2`_ for additional details.
+
+
+Benchmarking
+------------
+
+BlueStore opens block devices in O_DIRECT and uses fsync frequently to ensure
+that data is safely persisted to media. You can evaluate a drive's low-level
+write performance using ``fio``. For example, 4kB random write performance is
+measured as follows:
+
+.. code-block:: console
+
+  # fio --name=/dev/sdX --ioengine=libaio --direct=1 --fsync=1 --readwrite=randwrite --blocksize=4k --runtime=300
+
+Write Caches
+------------
+
+Enterprise SSDs and HDDs normally include power loss protection features which
+use multi-level caches to speed up direct or synchronous writes.  These devices
+can be toggled between two caching modes -- a volatile cache flushed to
+persistent media with fsync, or a non-volatile cache written synchronously.
+
+These two modes are selected by either "enabling" or "disabling" the write
+(volatile) cache.  When the volatile cache is enabled, Linux uses a device in
+"write back" mode, and when disabled, it uses "write through".
+
+The default configuration (normally caching enabled) may not be optimal, and
+OSD performance may be dramatically increased in terms of increased IOPS and
+decreased commit_latency by disabling the write cache.
+
+Users are therefore encouraged to benchmark their devices with ``fio`` as
+described earlier and persist the optimal cache configuration for their
+devices.
+
+The cache configuration can be queried with ``hdparm``, ``sdparm``,
+``smartctl`` or by reading the values in ``/sys/class/scsi_disk/*/cache_type``,
+for example:
+
+.. code-block:: console
+
+  # hdparm -W /dev/sda
+
+  /dev/sda:
+   write-caching =  1 (on)
+
+  # sdparm --get WCE /dev/sda
+      /dev/sda: ATA       TOSHIBA MG07ACA1  0101
+  WCE           1  [cha: y]
+  # smartctl -g wcache /dev/sda
+  smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-305.19.1.el8_4.x86_64] (local build)
+  Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
+
+  Write cache is:   Enabled
+
+  # cat /sys/class/scsi_disk/0\:0\:0\:0/cache_type
+  write back
+
+The write cache can be disabled with those same tools:
+
+.. code-block:: console
+
+  # hdparm -W0 /dev/sda
+
+  /dev/sda:
+   setting drive write-caching to 0 (off)
+   write-caching =  0 (off)
+
+  # sdparm --clear WCE /dev/sda
+      /dev/sda: ATA       TOSHIBA MG07ACA1  0101
+  # smartctl -s wcache,off /dev/sda
+  smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-305.19.1.el8_4.x86_64] (local build)
+  Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
+
+  === START OF ENABLE/DISABLE COMMANDS SECTION ===
+  Write cache disabled
+
+Normally, disabling the cache using ``hdparm``, ``sdparm``, or ``smartctl``
+results in the cache_type changing automatically to "write through". If this is
+not the case, you can try setting it directly as follows. (Users should note
+that setting cache_type also correctly persists the caching mode of the device
+until the next reboot):
+
+.. code-block:: console
+
+  # echo "write through" > /sys/class/scsi_disk/0\:0\:0\:0/cache_type
+
+  # hdparm -W /dev/sda
+
+  /dev/sda:
+   write-caching =  0 (off)
+
+.. tip:: This udev rule (tested on CentOS 8) will set all SATA/SAS device cache_types to "write
+  through":
+
+  .. code-block:: console
+
+    # cat /etc/udev/rules.d/99-ceph-write-through.rules
+    ACTION=="add", SUBSYSTEM=="scsi_disk", ATTR{cache_type}:="write through"
+
+.. tip:: This udev rule (tested on CentOS 7) will set all SATA/SAS device cache_types to "write
+  through":
+
+  .. code-block:: console
+
+    # cat /etc/udev/rules.d/99-ceph-write-through-el7.rules
+    ACTION=="add", SUBSYSTEM=="scsi_disk", RUN+="/bin/sh -c 'echo write through > /sys/class/scsi_disk/$kernel/cache_type'"
+
+.. tip:: The ``sdparm`` utility can be used to view/change the volatile write
+  cache on several devices at once:
+
+  .. code-block:: console
+
+    # sdparm --get WCE /dev/sd*
+        /dev/sda: ATA       TOSHIBA MG07ACA1  0101
+    WCE           0  [cha: y]
+        /dev/sdb: ATA       TOSHIBA MG07ACA1  0101
+    WCE           0  [cha: y]
+    # sdparm --clear WCE /dev/sd*
+        /dev/sda: ATA       TOSHIBA MG07ACA1  0101
+        /dev/sdb: ATA       TOSHIBA MG07ACA1  0101
+
+Additional Considerations
+-------------------------
+
+You typically will run multiple OSDs per host, but you should ensure that the
+aggregate throughput of your OSD drives doesn't exceed the network bandwidth
+required to service a client's need to read or write data. You should also
+consider what percentage of the overall data the cluster stores on each host. If
+the percentage on a particular host is large and the host fails, it can lead to
+problems such as exceeding the ``full ratio``,  which causes Ceph to halt
+operations as a safety precaution that prevents data loss.
+
+When you run multiple OSDs per host, you also need to ensure that the kernel
+is up to date. See `OS Recommendations`_ for notes on ``glibc`` and
+``syncfs(2)`` to ensure that your hardware performs as expected when running
+multiple OSDs per host.
+
+
+Networks
+========
+
+Provision at least 10 Gb/s networking in your racks.
+
+Speed
+-----
+
+It takes three hours to replicate 1 TB of data across a 1 Gb/s network and it
+takes thirty hours to replicate 10 TB across a 1 Gb/s network. But it takes only
+twenty minutes to replicate 1 TB across a 10 Gb/s network, and it takes
+only one hour to replicate 10 TB across a 10 Gb/s network. 
+
+Cost
+----
+
+The larger the Ceph cluster, the more common OSD failures will be.
+The faster that a placement group (PG) can recover from a ``degraded`` state to
+an ``active + clean`` state, the better. Notably, fast recovery minimizes
+the liklihood of multiple, overlapping failures that can cause data to become
+temporarily unavailable or even lost. Of course, when provisioning your
+network, you will have to balance price against performance. 
+
+Some deployment tools employ VLANs to make hardware and network cabling more
+manageable. VLANs that use the 802.1q protocol require VLAN-capable NICs and
+switches. The added expense of this hardware may be offset by the operational
+cost savings on network setup and maintenance. When using VLANs to handle VM
+traffic between the cluster and compute stacks (e.g., OpenStack, CloudStack,
+etc.), there is additional value in using 10 Gb/s Ethernet or better; 40 Gb/s or
+25/50/100 Gb/s networking as of 2022 is common for production clusters.
+
+Top-of-rack (TOR) switches also need fast and redundant uplinks to spind
+spine switches / routers, often at least 40 Gb/s.
+
+
+Baseboard Management Controller (BMC)
+-------------------------------------
+
+Your server chassis should have a Baseboard Management Controller (BMC).
+Well-known examples are iDRAC (Dell), CIMC (Cisco UCS), and iLO (HPE).
+Administration and deployment tools may also use BMCs extensively, especially
+via IPMI or Redfish, so consider the cost/benefit tradeoff of an out-of-band
+network for security and administration.  Hypervisor SSH access, VM image uploads,
+OS image installs, management sockets, etc. can impose significant loads on a network.
+Running three networks may seem like overkill, but each traffic path represents
+a potential capacity, throughput and/or performance bottleneck that you should
+carefully consider before deploying a large scale data cluster.
+ 
+
+Failure Domains
+===============
+
+A failure domain is any failure that prevents access to one or more OSDs. That
+could be a stopped daemon on a host; a disk failure, an OS crash, a
+malfunctioning NIC, a failed power supply, a network outage, a power outage,
+and so forth. When planning out your hardware needs, you must balance the
+temptation to reduce costs by placing too many responsibilities into too few
+failure domains, and the added costs of isolating every potential failure
+domain.
+
+
+Minimum Hardware Recommendations
+================================
+
+Ceph can run on inexpensive commodity hardware. Small production clusters
+and development clusters can run successfully with modest hardware.
+
++--------------+----------------+-----------------------------------------+
+|  Process     | Criteria       | Minimum Recommended                     |
++==============+================+=========================================+
+| ``ceph-osd`` | Processor      | - 1 core minimum                        |
+|              |                | - 1 core per 200-500 MB/s               |
+|              |                | - 1 core per 1000-3000 IOPS             |
+|              |                |                                         |
+|              |                | * Results are before replication.       |
+|              |                | * Results may vary with different       |
+|              |                |   CPU models and Ceph features.         |
+|              |                |   (erasure coding, compression, etc)    |
+|              |                | * ARM processors specifically may       |
+|              |                |   require additional cores.             |
+|              |                | * Actual performance depends on many    |
+|              |                |   factors including drives, net, and    |
+|              |                |   client throughput and latency.        |
+|              |                |   Benchmarking is highly recommended.   |
+|              +----------------+-----------------------------------------+
+|              | RAM            | - 4GB+ per daemon (more is better)      |
+|              |                | - 2-4GB often functions (may be slow)   |
+|              |                | - Less than 2GB not recommended         |
+|              +----------------+-----------------------------------------+
+|              | Volume Storage |  1x storage drive per daemon            |
+|              +----------------+-----------------------------------------+
+|              | DB/WAL         |  1x SSD partition per daemon (optional) |
+|              +----------------+-----------------------------------------+
+|              | Network        |  1x 1GbE+ NICs (10GbE+ recommended)     |
++--------------+----------------+-----------------------------------------+
+| ``ceph-mon`` | Processor      | - 2 cores minimum                       |
+|              +----------------+-----------------------------------------+
+|              | RAM            |  2-4GB+ per daemon                      |
+|              +----------------+-----------------------------------------+
+|              | Disk Space     |  60 GB per daemon                       |
+|              +----------------+-----------------------------------------+
+|              | Network        |  1x 1GbE+ NICs                          |
++--------------+----------------+-----------------------------------------+
+| ``ceph-mds`` | Processor      | - 2 cores minimum                       |
+|              +----------------+-----------------------------------------+
+|              | RAM            |  2GB+ per daemon                        |
+|              +----------------+-----------------------------------------+
+|              | Disk Space     |  1 MB per daemon                        |
+|              +----------------+-----------------------------------------+
+|              | Network        |  1x 1GbE+ NICs                          |
++--------------+----------------+-----------------------------------------+
+
+.. tip:: If you are running an OSD with a single disk, create a
+   partition for your volume storage that is separate from the partition
+   containing the OS. Generally, we recommend separate disks for the
+   OS and the volume storage.
+
+
+
+.. _block and block.db: https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#block-and-block-db
+.. _Ceph blog: https://ceph.com/community/blog/
+.. _Ceph Write Throughput 1: http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
+.. _Ceph Write Throughput 2: http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/
+.. _Mapping Pools to Different Types of OSDs: ../../rados/operations/crush-map#placing-different-pools-on-different-osds
+.. _OS Recommendations: ../os-recommendations
+.. _Storage Networking Industry Association's Total Cost of Ownership calculator: https://www.snia.org/forums/cmsi/programs/TCOcalc
+.. _Werner Fischer's blog post on partition alignment: https://www.thomas-krenn.com/en/wiki/Partition_Alignment_detailed_explanation
-- 
cgit v1.2.3