From e6918187568dbd01842d8d1d2c808ce16a894239 Mon Sep 17 00:00:00 2001
From: Daniel Baumann <daniel.baumann@progress-linux.org>
Date: Sun, 21 Apr 2024 13:54:28 +0200
Subject: Adding upstream version 18.2.2.

Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
---
 doc/start/hardware-recommendations.rst | 623 +++++++++++++++++++++++++++++++++
 1 file changed, 623 insertions(+)
 create mode 100644 doc/start/hardware-recommendations.rst

(limited to 'doc/start/hardware-recommendations.rst')

diff --git a/doc/start/hardware-recommendations.rst b/doc/start/hardware-recommendations.rst
new file mode 100644
index 000000000..a63b5a457
--- /dev/null
+++ b/doc/start/hardware-recommendations.rst
@@ -0,0 +1,623 @@
+.. _hardware-recommendations:
+
+==========================
+ hardware recommendations
+==========================
+
+Ceph is designed to run on commodity hardware, which makes building and
+maintaining petabyte-scale data clusters flexible and economically feasible. 
+When planning your cluster's hardware, you will need to balance a number 
+of considerations, including failure domains, cost, and performance.
+Hardware planning should include distributing Ceph daemons and 
+other processes that use Ceph across many hosts. Generally, we recommend 
+running Ceph daemons of a specific type on a host configured for that type 
+of daemon. We recommend using separate hosts for processes that utilize your 
+data cluster (e.g., OpenStack, CloudStack, Kubernetes, etc).
+
+The requirements of one Ceph cluster are not the same as the requirements of
+another, but below are some general guidelines. 
+
+.. tip:: check out the `ceph blog`_ too.
+
+CPU
+===
+
+CephFS Metadata Servers (MDS) are CPU-intensive. They are
+are single-threaded and perform best with CPUs with a high clock rate (GHz). MDS
+servers do not need a large number of CPU cores unless they are also hosting other
+services, such as SSD OSDs for the CephFS metadata pool.
+OSD nodes need enough processing power to run the RADOS service, to calculate data
+placement with CRUSH, to replicate data, and to maintain their own copies of the
+cluster map.
+
+With earlier releases of Ceph, we would make hardware recommendations based on
+the number of cores per OSD, but this cores-per-osd metric is no longer as
+useful a metric as the number of cycles per IOP and the number of IOPS per OSD.
+For example, with NVMe OSD drives, Ceph can easily utilize five or six cores on real
+clusters and up to about fourteen cores on single OSDs in isolation. So cores
+per OSD are no longer as pressing a concern as they were. When selecting
+hardware, select for IOPS per core.
+
+.. tip:: When we speak of CPU _cores_, we mean _threads_ when hyperthreading
+	 is enabled.  Hyperthreading is usually beneficial for Ceph servers.
+
+Monitor nodes and Manager nodes do not have heavy CPU demands and require only
+modest processors. if your hosts will run CPU-intensive processes in
+addition to Ceph daemons, make sure that you have enough processing power to
+run both the CPU-intensive processes and the Ceph daemons. (OpenStack Nova is
+one example of a CPU-intensive process.) We recommend that you run
+non-Ceph CPU-intensive processes on separate hosts (that is, on hosts that are
+not your Monitor and Manager nodes) in order to avoid resource contention.
+If your cluster deployes the Ceph Object Gateway, RGW daemons may co-reside
+with your Mon and Manager services if the nodes have sufficient resources.
+
+RAM
+===
+
+Generally, more RAM is better.  Monitor / Manager nodes for a modest cluster
+might do fine with 64GB; for a larger cluster with hundreds of OSDs 128GB
+is advised.
+
+.. tip:: when we speak of RAM and storage requirements, we often describe
+	 the needs of a single daemon of a given type.  A given server as
+	 a whole will thus need at least the sum of the needs of the
+	 daemons that it hosts as well as resources for logs and other operating
+	 system components.  Keep in mind that a server's need for RAM
+	 and storage will be greater at startup and when components
+	 fail or are added and the cluster rebalances.  In other words,
+	 allow headroom past what you might see used during a calm period
+	 on a small initial cluster footprint.
+
+There is an :confval:`osd_memory_target` setting for BlueStore OSDs that
+defaults to 4GB.  Factor in a prudent margin for the operating system and
+administrative tasks (like monitoring and metrics) as well as increased
+consumption during recovery:  provisioning ~8GB *per BlueStore OSD* is thus
+advised.
+
+Monitors and managers (ceph-mon and ceph-mgr)
+---------------------------------------------
+
+Monitor and manager daemon memory usage scales with the size of the
+cluster.  Note that at boot-time and during topology changes and recovery these
+daemons will need more RAM than they do during steady-state operation, so plan
+for peak usage. For very small clusters, 32 GB suffices. For clusters of up to,
+say, 300 OSDs go with 64GB. For clusters built with (or which will grow to)
+even more OSDs you should provision 128GB. You may also want to consider
+tuning the following settings:
+
+* :confval:`mon_osd_cache_size`
+* :confval:`rocksdb_cache_size`
+
+
+Metadata servers (ceph-mds)
+---------------------------
+
+CephFS metadata daemon memory utilization depends on the configured size of
+its cache. We recommend 1 GB as a minimum for most systems.  See
+:confval:`mds_cache_memory_limit`.
+
+
+Memory
+======
+
+Bluestore uses its own memory to cache data rather than relying on the
+operating system's page cache. In Bluestore you can adjust the amount of memory
+that the OSD attempts to consume by changing the :confval:`osd_memory_target`
+configuration option.
+
+- Setting the :confval:`osd_memory_target` below 2GB is not
+  recommended. Ceph may fail to keep the memory consumption under 2GB and 
+  extremely slow performance is likely.
+
+- Setting the memory target between 2GB and 4GB typically works but may result
+  in degraded performance: metadata may need to be read from disk during IO
+  unless the active data set is relatively small.
+
+- 4GB is the current default value for :confval:`osd_memory_target` This default
+  was chosen for typical use cases, and is intended to balance RAM cost and
+  OSD performance.
+
+- Setting the :confval:`osd_memory_target` higher than 4GB can improve
+  performance when there many (small) objects or when large (256GB/OSD 
+  or more) data sets are processed.  This is especially true with fast
+  NVMe OSDs.
+
+.. important:: OSD memory management is "best effort". Although the OSD may
+   unmap memory to allow the kernel to reclaim it, there is no guarantee that
+   the kernel will actually reclaim freed memory within a specific time
+   frame. This applies especially in older versions of Ceph, where transparent
+   huge pages can prevent the kernel from reclaiming memory that was freed from
+   fragmented huge pages. Modern versions of Ceph disable transparent huge
+   pages at the application level to avoid this, but that does not
+   guarantee that the kernel will immediately reclaim unmapped memory. The OSD
+   may still at times exceed its memory target. We recommend budgeting 
+   at least 20% extra memory on your system to prevent OSDs from going OOM
+   (**O**\ut **O**\f **M**\emory) during temporary spikes or due to delay in
+   the kernel reclaiming freed pages. That 20% value might be more or less than
+   needed, depending on the exact configuration of the system.
+
+.. tip:: Configuring the operating system with swap to provide additional
+	 virtual memory for daemons is not advised for modern systems.  Doing
+	 may result in lower performance, and your Ceph cluster may well be
+	 happier with a daemon that crashes vs one that slows to a crawl.
+
+When using the legacy FileStore back end, the OS page cache was used for caching
+data, so tuning was not normally needed. When using the legacy FileStore backend,
+the OSD memory consumption was related to the number of PGs per daemon in the
+system.
+
+
+Data Storage
+============
+
+Plan your data storage configuration carefully. There are significant cost and
+performance tradeoffs to consider when planning for data storage. Simultaneous
+OS operations and simultaneous requests from multiple daemons for read and
+write operations against a single drive can impact performance.
+
+OSDs require substantial storage drive space for RADOS data. We recommend a
+minimum drive size of 1 terabyte. OSD drives much smaller than one terabyte
+use a significant fraction of their capacity for metadata, and drives smaller
+than 100 gigabytes will not be effective at all.
+
+It is *strongly* suggested that (enterprise-class) SSDs are provisioned for, at a
+minimum, Ceph Monitor and Ceph Manager hosts, as well as CephFS Metadata Server
+metadata pools and Ceph Object Gateway (RGW) index pools, even if HDDs are to
+be provisioned for bulk OSD data.
+
+To get the best performance out of Ceph, provision the following on separate
+drives:
+
+* The operating systems
+* OSD data
+* BlueStore WAL+DB
+
+For more
+information on how to effectively use a mix of fast drives and slow drives in
+your Ceph cluster, see the `block and block.db`_ section of the Bluestore
+Configuration Reference.
+
+Hard Disk Drives
+----------------
+
+Consider carefully the cost-per-gigabyte advantage
+of larger disks. We recommend dividing the price of the disk drive by the
+number of gigabytes to arrive at a cost per gigabyte, because larger drives may
+have a significant impact on the cost-per-gigabyte. For example, a 1 terabyte
+hard disk priced at $75.00 has a cost of $0.07 per gigabyte (i.e., $75 / 1024 =
+0.0732). By contrast, a 3 terabyte disk priced at $150.00 has a cost of $0.05
+per gigabyte (i.e., $150 / 3072 = 0.0488). In the foregoing example, using the
+1 terabyte disks would generally increase the cost per gigabyte by
+40%--rendering your cluster substantially less cost efficient.
+
+.. tip:: Hosting multiple OSDs on a single SAS / SATA HDD
+   is **NOT** a good idea.
+
+.. tip:: Hosting an OSD with monitor, manager, or MDS data on a single 
+   drive is also **NOT** a good idea.
+
+.. tip:: With spinning disks, the SATA and SAS interface increasingly
+   becomes a bottleneck at larger capacities. See also the `Storage Networking 
+   Industry Association's Total Cost of Ownership calculator`_.
+
+
+Storage drives are subject to limitations on seek time, access time, read and
+write times, as well as total throughput. These physical limitations affect
+overall system performance--especially during recovery. We recommend using a
+dedicated (ideally mirrored) drive for the operating system and software, and
+one drive for each Ceph OSD Daemon you run on the host.
+Many "slow OSD" issues (when they are not attributable to hardware failure)
+arise from running an operating system and multiple OSDs on the same drive.
+Also be aware that today's 22TB HDD uses the same SATA interface as a
+3TB HDD from ten years ago: more than seven times the data to squeeze
+through the same same interface.  For this reason, when using HDDs for
+OSDs, drives larger than 8TB may be best suited for storage of large
+files / objects that are not at all performance-sensitive.
+
+
+Solid State Drives
+------------------
+
+Ceph performance is much improved when using solid-state drives (SSDs). This
+reduces random access time and reduces latency while increasing throughput. 
+
+SSDs cost more per gigabyte than do HDDs but SSDs often offer
+access times that are, at a minimum, 100 times faster than HDDs.
+SSDs avoid hotspot issues and bottleneck issues within busy clusters, and
+they may offer better economics when TCO is evaluated holistically. Notably,
+the amortized drive cost for a given number of IOPS is much lower with SSDs
+than with HDDs.  SSDs do not suffer rotational or seek latency and in addition
+to improved client performance, they substantially improve the speed and
+client impact of cluster changes including rebalancing when OSDs or Monitors
+are added, removed, or fail.
+
+SSDs do not have moving mechanical parts, so they are not subject
+to many of the limitations of HDDs.  SSDs do have significant
+limitations though. When evaluating SSDs, it is important to consider the
+performance of sequential and random reads and writes.
+
+.. important:: We recommend exploring the use of SSDs to improve performance. 
+   However, before making a significant investment in SSDs, we **strongly
+   recommend** reviewing the performance metrics of an SSD and testing the
+   SSD in a test configuration in order to gauge performance. 
+
+Relatively inexpensive SSDs may appeal to your sense of economy. Use caution.
+Acceptable IOPS are not the only factor to consider when selecting SSDs for
+use with Ceph. Bargain SSDs are often a false economy: they may experience
+"cliffing", which means that after an initial burst, sustained performance
+once a limited cache is filled declines considerably.  Consider also durability:
+a drive rated for 0.3 Drive Writes Per Day (DWPD or equivalent) may be fine for
+OSDs dedicated to certain types of sequentially-written read-mostly data, but
+are not a good choice for Ceph Monitor duty.  Enterprise-class SSDs are best
+for Ceph:  they almost always feature power less protection (PLP) and do
+not suffer the dramatic cliffing that client (desktop) models may experience.
+
+When using a single (or mirrored pair) SSD for both operating system boot
+and Ceph Monitor / Manager purposes, a minimum capacity of 256GB is advised
+and at least 480GB is recommended. A drive model rated at 1+ DWPD (or the
+equivalent in TBW (TeraBytes Written) is suggested.  However, for a given write
+workload, a larger drive than technically required will provide more endurance
+because it effectively has greater overprovsioning. We stress that
+enterprise-class drives are best for production use, as they feature power
+loss protection and increased durability compared to client (desktop) SKUs
+that are intended for much lighter and intermittent duty cycles.
+
+SSDs were historically been cost prohibitive for object storage, but
+QLC SSDs are closing the gap, offering greater density with lower power
+consumption and less power spent on cooling. Also, HDD OSDs may see a
+significant write latency improvement by offloading WAL+DB onto an SSD.
+Many Ceph OSD deployments do not require an SSD with greater endurance than
+1 DWPD (aka "read-optimized").  "Mixed-use" SSDs in the 3 DWPD class are
+often overkill for this purpose and cost signficantly more.
+
+To get a better sense of the factors that determine the total cost of storage,
+you might use the `Storage Networking Industry Association's Total Cost of
+Ownership calculator`_
+
+Partition Alignment
+~~~~~~~~~~~~~~~~~~~
+
+When using SSDs with Ceph, make sure that your partitions are properly aligned.
+Improperly aligned partitions suffer slower data transfer speeds than do
+properly aligned partitions. For more information about proper partition
+alignment and example commands that show how to align partitions properly, see
+`Werner Fischer's blog post on partition alignment`_.
+
+CephFS Metadata Segregation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+One way that Ceph accelerates CephFS file system performance is by separating
+the storage of CephFS metadata from the storage of the CephFS file contents.
+Ceph provides a default ``metadata`` pool for CephFS metadata. You will never
+have to manually create a pool for CephFS metadata, but you can create a CRUSH map
+hierarchy for your CephFS metadata pool that includes only SSD storage media.
+See :ref:`CRUSH Device Class<crush-map-device-class>` for details.
+
+
+Controllers
+-----------
+
+Disk controllers (HBAs) can have a significant impact on write throughput.
+Carefully consider your selection of HBAs to ensure that they do not create a
+performance bottleneck. Notably, RAID-mode (IR) HBAs may exhibit higher latency
+than simpler "JBOD" (IT) mode HBAs. The RAID SoC, write cache, and battery
+backup can substantially increase hardware and maintenance costs. Many RAID
+HBAs can be configured with an IT-mode "personality" or "JBOD mode" for
+streamlined operation.
+
+You do not need an RoC (RAID-capable) HBA. ZFS or Linux MD software mirroring
+serve well for boot volume durability.  When using SAS or SATA data drives,
+forgoing HBA RAID capabilities can reduce the gap between HDD and SSD
+media cost.  Moreover, when using NVMe SSDs, you do not need *any* HBA.  This
+additionally reduces the HDD vs SSD cost gap when the system as a whole is
+considered. The initial cost of a fancy RAID HBA plus onboard cache plus
+battery backup (BBU or supercapacitor) can easily exceed more than 1000 US
+dollars even after discounts - a sum that goes a log way toward SSD cost parity.
+An HBA-free system may also cost hundreds of US dollars less every year if one
+purchases an annual maintenance contract or extended warranty.
+
+.. tip:: The `Ceph blog`_ is often an excellent source of information on Ceph
+   performance issues. See `Ceph Write Throughput 1`_ and `Ceph Write 
+   Throughput 2`_ for additional details.
+
+
+Benchmarking
+------------
+
+BlueStore opens storage devices with ``O_DIRECT`` and issues ``fsync()``
+frequently to ensure that data is safely persisted to media. You can evaluate a
+drive's low-level write performance using ``fio``. For example, 4kB random write
+performance is measured as follows:
+
+.. code-block:: console
+
+  # fio --name=/dev/sdX --ioengine=libaio --direct=1 --fsync=1 --readwrite=randwrite --blocksize=4k --runtime=300
+
+Write Caches
+------------
+
+Enterprise SSDs and HDDs normally include power loss protection features which
+ensure data durability when power is lost while operating, and
+use multi-level caches to speed up direct or synchronous writes.  These devices
+can be toggled between two caching modes -- a volatile cache flushed to
+persistent media with fsync, or a non-volatile cache written synchronously.
+
+These two modes are selected by either "enabling" or "disabling" the write
+(volatile) cache.  When the volatile cache is enabled, Linux uses a device in
+"write back" mode, and when disabled, it uses "write through".
+
+The default configuration (usually: caching is enabled) may not be optimal, and
+OSD performance may be dramatically increased in terms of increased IOPS and
+decreased commit latency by disabling this write cache.
+
+Users are therefore encouraged to benchmark their devices with ``fio`` as
+described earlier and persist the optimal cache configuration for their
+devices.
+
+The cache configuration can be queried with ``hdparm``, ``sdparm``,
+``smartctl`` or by reading the values in ``/sys/class/scsi_disk/*/cache_type``,
+for example:
+
+.. code-block:: console
+
+  # hdparm -W /dev/sda
+
+  /dev/sda:
+   write-caching =  1 (on)
+
+  # sdparm --get WCE /dev/sda
+      /dev/sda: ATA       TOSHIBA MG07ACA1  0101
+  WCE           1  [cha: y]
+  # smartctl -g wcache /dev/sda
+  smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-305.19.1.el8_4.x86_64] (local build)
+  Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
+
+  Write cache is:   Enabled
+
+  # cat /sys/class/scsi_disk/0\:0\:0\:0/cache_type
+  write back
+
+The write cache can be disabled with those same tools:
+
+.. code-block:: console
+
+  # hdparm -W0 /dev/sda
+
+  /dev/sda:
+   setting drive write-caching to 0 (off)
+   write-caching =  0 (off)
+
+  # sdparm --clear WCE /dev/sda
+      /dev/sda: ATA       TOSHIBA MG07ACA1  0101
+  # smartctl -s wcache,off /dev/sda
+  smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-305.19.1.el8_4.x86_64] (local build)
+  Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
+
+  === START OF ENABLE/DISABLE COMMANDS SECTION ===
+  Write cache disabled
+
+In most cases, disabling this cache  using ``hdparm``, ``sdparm``, or ``smartctl``
+results in the cache_type changing automatically to "write through". If this is
+not the case, you can try setting it directly as follows. (Users should ensure
+that setting cache_type also correctly persists the caching mode of the device
+until the next reboot as some drives require this to be repeated at every boot):
+
+.. code-block:: console
+
+  # echo "write through" > /sys/class/scsi_disk/0\:0\:0\:0/cache_type
+
+  # hdparm -W /dev/sda
+
+  /dev/sda:
+   write-caching =  0 (off)
+
+.. tip:: This udev rule (tested on CentOS 8) will set all SATA/SAS device cache_types to "write
+  through":
+
+  .. code-block:: console
+
+    # cat /etc/udev/rules.d/99-ceph-write-through.rules
+    ACTION=="add", SUBSYSTEM=="scsi_disk", ATTR{cache_type}:="write through"
+
+.. tip:: This udev rule (tested on CentOS 7) will set all SATA/SAS device cache_types to "write
+  through":
+
+  .. code-block:: console
+
+    # cat /etc/udev/rules.d/99-ceph-write-through-el7.rules
+    ACTION=="add", SUBSYSTEM=="scsi_disk", RUN+="/bin/sh -c 'echo write through > /sys/class/scsi_disk/$kernel/cache_type'"
+
+.. tip:: The ``sdparm`` utility can be used to view/change the volatile write
+  cache on several devices at once:
+
+  .. code-block:: console
+
+    # sdparm --get WCE /dev/sd*
+        /dev/sda: ATA       TOSHIBA MG07ACA1  0101
+    WCE           0  [cha: y]
+        /dev/sdb: ATA       TOSHIBA MG07ACA1  0101
+    WCE           0  [cha: y]
+    # sdparm --clear WCE /dev/sd*
+        /dev/sda: ATA       TOSHIBA MG07ACA1  0101
+        /dev/sdb: ATA       TOSHIBA MG07ACA1  0101
+
+Additional Considerations
+-------------------------
+
+Ceph operators typically provision  multiple OSDs per host, but you should
+ensure that the aggregate throughput of your OSD drives doesn't exceed the
+network bandwidth required to service a client's read and write operations.
+You should also each host's percentage of the cluster's overall capacity. If
+the percentage located on a particular host is large and the host fails, it
+can lead to problems such as recovery causing OSDs to exceed the ``full ratio``,
+which in turn causes Ceph to halt operations to prevent data loss.
+
+When you run multiple OSDs per host, you also need to ensure that the kernel
+is up to date. See `OS Recommendations`_ for notes on ``glibc`` and
+``syncfs(2)`` to ensure that your hardware performs as expected when running
+multiple OSDs per host.
+
+
+Networks
+========
+
+Provision at least 10 Gb/s networking in your datacenter, both among Ceph
+hosts and between clients and your Ceph cluster.  Network link active/active
+bonding across separate network switches is strongly recommended both for
+increased throughput and for tolerance of network failures and maintenance.
+Take care that your bonding hash policy distributes traffic across links.
+
+Speed
+-----
+
+It takes three hours to replicate 1 TB of data across a 1 Gb/s network and it
+takes thirty hours to replicate 10 TB across a 1 Gb/s network. But it takes only
+twenty minutes to replicate 1 TB across a 10 Gb/s network, and it takes
+only one hour to replicate 10 TB across a 10 Gb/s network.
+
+Note that a 40 Gb/s network link is effectively four 10 Gb/s channels in
+parallel, and that a 100Gb/s network link is effectively four 25 Gb/s channels
+in parallel.  Thus, and perhaps somewhat counterintuitively, an individual
+packet on a 25 Gb/s network has slightly lower latency compared to a 40 Gb/s
+network.
+
+
+Cost
+----
+
+The larger the Ceph cluster, the more common OSD failures will be.
+The faster that a placement group (PG) can recover from a degraded state to
+an ``active + clean`` state, the better. Notably, fast recovery minimizes
+the likelihood of multiple, overlapping failures that can cause data to become
+temporarily unavailable or even lost. Of course, when provisioning your
+network, you will have to balance price against performance. 
+
+Some deployment tools employ VLANs to make hardware and network cabling more
+manageable. VLANs that use the 802.1q protocol require VLAN-capable NICs and
+switches. The added expense of this hardware may be offset by the operational
+cost savings on network setup and maintenance. When using VLANs to handle VM
+traffic between the cluster and compute stacks (e.g., OpenStack, CloudStack,
+etc.), there is additional value in using 10 Gb/s Ethernet or better; 40 Gb/s or
+increasingly 25/50/100 Gb/s networking as of 2022 is common for production clusters.
+
+Top-of-rack (TOR) switches also need fast and redundant uplinks to
+core / spine network switches or routers, often at least 40 Gb/s.
+
+
+Baseboard Management Controller (BMC)
+-------------------------------------
+
+Your server chassis should have a Baseboard Management Controller (BMC).
+Well-known examples are iDRAC (Dell), CIMC (Cisco UCS), and iLO (HPE).
+Administration and deployment tools may also use BMCs extensively, especially
+via IPMI or Redfish, so consider the cost/benefit tradeoff of an out-of-band
+network for security and administration.  Hypervisor SSH access, VM image uploads,
+OS image installs, management sockets, etc. can impose significant loads on a network.
+Running multiple networks may seem like overkill, but each traffic path represents
+a potential capacity, throughput and/or performance bottleneck that you should
+carefully consider before deploying a large scale data cluster.
+
+Additionally BMCs as of 2023 rarely sport network connections faster than 1 Gb/s,
+so dedicated and inexpensive 1 Gb/s switches for BMC administrative traffic
+may reduce costs by wasting fewer expenive ports on faster host switches.
+ 
+
+Failure Domains
+===============
+
+A failure domain can be thought of as any component loss that prevents access to
+one or more OSDs or other Ceph daemons. These could be a stopped daemon on a host;
+a storage drive failure, an OS crash, a malfunctioning NIC, a failed power supply,
+a network outage, a power outage, and so forth. When planning your hardware
+deployment, you must balance the risk of reducing costs by placing too many
+responsibilities into too few failure domains against the added costs of
+isolating every potential failure domain.
+
+
+Minimum Hardware Recommendations
+================================
+
+Ceph can run on inexpensive commodity hardware. Small production clusters
+and development clusters can run successfully with modest hardware.  As
+we noted above: when we speak of CPU _cores_, we mean _threads_ when
+hyperthreading (HT) is enabled.  Each modern physical x64 CPU core typically
+provides two logical CPU threads; other CPU architectures may vary.
+
+Take care that there are many factors that influence resource choices.  The
+minimum resources that suffice for one purpose will not necessarily suffice for
+another.  A sandbox cluster with one OSD built on a laptop with VirtualBox or on
+a trio of Raspberry PIs will get by with fewer resources than a production
+deployment with a thousand OSDs serving five thousand of RBD clients.  The
+classic Fisher Price PXL 2000 captures video, as does an IMAX or RED camera.
+One would not expect the former to do the job of the latter.  We especially
+cannot stress enough the criticality of using enterprise-quality storage
+media for production workloads.
+
+Additional insights into resource planning for production clusters are
+found above and elsewhere within this documentation.
+
++--------------+----------------+-----------------------------------------+
+|  Process     | Criteria       | Bare Minimum and Recommended            |
++==============+================+=========================================+
+| ``ceph-osd`` | Processor      | - 1 core minimum, 2 recommended         |
+|              |                | - 1 core per 200-500 MB/s throughput    |
+|              |                | - 1 core per 1000-3000 IOPS             |
+|              |                |                                         |
+|              |                | * Results are before replication.       |
+|              |                | * Results may vary across CPU and drive |
+|              |                |   models and Ceph configuration:        |
+|              |                |   (erasure coding, compression, etc)    |
+|              |                | * ARM processors specifically may       |
+|              |                |   require more cores for performance.   |
+|              |                | * SSD OSDs, especially NVMe, will       |
+|              |                |   benefit from additional cores per OSD.|
+|              |                | * Actual performance depends on many    |
+|              |                |   factors including drives, net, and    |
+|              |                |   client throughput and latency.        |
+|              |                |   Benchmarking is highly recommended.   |
+|              +----------------+-----------------------------------------+
+|              | RAM            | - 4GB+ per daemon (more is better)      |
+|              |                | - 2-4GB may function but may be slow    |
+|              |                | - Less than 2GB is not recommended      |
+|              +----------------+-----------------------------------------+
+|              | Storage Drives |  1x storage drive per OSD               |
+|              +----------------+-----------------------------------------+
+|              | DB/WAL         |  1x SSD partion per HDD OSD             |
+|              | (optional)     |  4-5x HDD OSDs per DB/WAL SATA SSD      |
+|              |                |  <= 10 HDD OSDss per DB/WAL NVMe SSD    |
+|              +----------------+-----------------------------------------+
+|              | Network        |  1x 1Gb/s (bonded 10+ Gb/s recommended) |
++--------------+----------------+-----------------------------------------+
+| ``ceph-mon`` | Processor      | - 2 cores minimum                       |
+|              +----------------+-----------------------------------------+
+|              | RAM            |  5GB+ per daemon (large / production    |
+|              |                |  clusters need more)                    |
+|              +----------------+-----------------------------------------+
+|              | Storage        |  100 GB per daemon, SSD is recommended  |
+|              +----------------+-----------------------------------------+
+|              | Network        |  1x 1Gb/s (10+ Gb/s recommended)        |
++--------------+----------------+-----------------------------------------+
+| ``ceph-mds`` | Processor      | - 2 cores minimum                       |
+|              +----------------+-----------------------------------------+
+|              | RAM            |  2GB+ per daemon (more for production)  |
+|              +----------------+-----------------------------------------+
+|              | Disk Space     |  1 GB per daemon                        |
+|              +----------------+-----------------------------------------+
+|              | Network        |  1x 1Gb/s (10+ Gb/s recommended)        |
++--------------+----------------+-----------------------------------------+
+
+.. tip:: If you are running an OSD node with a single storage drive, create a
+   partition for your OSD that is separate from the partition
+   containing the OS. We recommend separate drives for the
+   OS and for OSD storage.
+
+
+
+.. _block and block.db: https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#block-and-block-db
+.. _Ceph blog: https://ceph.com/community/blog/
+.. _Ceph Write Throughput 1: http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
+.. _Ceph Write Throughput 2: http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/
+.. _Mapping Pools to Different Types of OSDs: ../../rados/operations/crush-map#placing-different-pools-on-different-osds
+.. _OS Recommendations: ../os-recommendations
+.. _Storage Networking Industry Association's Total Cost of Ownership calculator: https://www.snia.org/forums/cmsi/programs/TCOcalc
+.. _Werner Fischer's blog post on partition alignment: https://www.thomas-krenn.com/en/wiki/Partition_Alignment_detailed_explanation
-- 
cgit v1.2.3