From e6918187568dbd01842d8d1d2c808ce16a894239 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Sun, 21 Apr 2024 13:54:28 +0200 Subject: Adding upstream version 18.2.2. Signed-off-by: Daniel Baumann --- doc/start/hardware-recommendations.rst | 623 +++++++++++++++++++++++++++++++++ 1 file changed, 623 insertions(+) create mode 100644 doc/start/hardware-recommendations.rst (limited to 'doc/start/hardware-recommendations.rst') diff --git a/doc/start/hardware-recommendations.rst b/doc/start/hardware-recommendations.rst new file mode 100644 index 000000000..a63b5a457 --- /dev/null +++ b/doc/start/hardware-recommendations.rst @@ -0,0 +1,623 @@ +.. _hardware-recommendations: + +========================== + hardware recommendations +========================== + +Ceph is designed to run on commodity hardware, which makes building and +maintaining petabyte-scale data clusters flexible and economically feasible. +When planning your cluster's hardware, you will need to balance a number +of considerations, including failure domains, cost, and performance. +Hardware planning should include distributing Ceph daemons and +other processes that use Ceph across many hosts. Generally, we recommend +running Ceph daemons of a specific type on a host configured for that type +of daemon. We recommend using separate hosts for processes that utilize your +data cluster (e.g., OpenStack, CloudStack, Kubernetes, etc). + +The requirements of one Ceph cluster are not the same as the requirements of +another, but below are some general guidelines. + +.. tip:: check out the `ceph blog`_ too. + +CPU +=== + +CephFS Metadata Servers (MDS) are CPU-intensive. They are +are single-threaded and perform best with CPUs with a high clock rate (GHz). MDS +servers do not need a large number of CPU cores unless they are also hosting other +services, such as SSD OSDs for the CephFS metadata pool. +OSD nodes need enough processing power to run the RADOS service, to calculate data +placement with CRUSH, to replicate data, and to maintain their own copies of the +cluster map. + +With earlier releases of Ceph, we would make hardware recommendations based on +the number of cores per OSD, but this cores-per-osd metric is no longer as +useful a metric as the number of cycles per IOP and the number of IOPS per OSD. +For example, with NVMe OSD drives, Ceph can easily utilize five or six cores on real +clusters and up to about fourteen cores on single OSDs in isolation. So cores +per OSD are no longer as pressing a concern as they were. When selecting +hardware, select for IOPS per core. + +.. tip:: When we speak of CPU _cores_, we mean _threads_ when hyperthreading + is enabled. Hyperthreading is usually beneficial for Ceph servers. + +Monitor nodes and Manager nodes do not have heavy CPU demands and require only +modest processors. if your hosts will run CPU-intensive processes in +addition to Ceph daemons, make sure that you have enough processing power to +run both the CPU-intensive processes and the Ceph daemons. (OpenStack Nova is +one example of a CPU-intensive process.) We recommend that you run +non-Ceph CPU-intensive processes on separate hosts (that is, on hosts that are +not your Monitor and Manager nodes) in order to avoid resource contention. +If your cluster deployes the Ceph Object Gateway, RGW daemons may co-reside +with your Mon and Manager services if the nodes have sufficient resources. + +RAM +=== + +Generally, more RAM is better. Monitor / Manager nodes for a modest cluster +might do fine with 64GB; for a larger cluster with hundreds of OSDs 128GB +is advised. + +.. tip:: when we speak of RAM and storage requirements, we often describe + the needs of a single daemon of a given type. A given server as + a whole will thus need at least the sum of the needs of the + daemons that it hosts as well as resources for logs and other operating + system components. Keep in mind that a server's need for RAM + and storage will be greater at startup and when components + fail or are added and the cluster rebalances. In other words, + allow headroom past what you might see used during a calm period + on a small initial cluster footprint. + +There is an :confval:`osd_memory_target` setting for BlueStore OSDs that +defaults to 4GB. Factor in a prudent margin for the operating system and +administrative tasks (like monitoring and metrics) as well as increased +consumption during recovery: provisioning ~8GB *per BlueStore OSD* is thus +advised. + +Monitors and managers (ceph-mon and ceph-mgr) +--------------------------------------------- + +Monitor and manager daemon memory usage scales with the size of the +cluster. Note that at boot-time and during topology changes and recovery these +daemons will need more RAM than they do during steady-state operation, so plan +for peak usage. For very small clusters, 32 GB suffices. For clusters of up to, +say, 300 OSDs go with 64GB. For clusters built with (or which will grow to) +even more OSDs you should provision 128GB. You may also want to consider +tuning the following settings: + +* :confval:`mon_osd_cache_size` +* :confval:`rocksdb_cache_size` + + +Metadata servers (ceph-mds) +--------------------------- + +CephFS metadata daemon memory utilization depends on the configured size of +its cache. We recommend 1 GB as a minimum for most systems. See +:confval:`mds_cache_memory_limit`. + + +Memory +====== + +Bluestore uses its own memory to cache data rather than relying on the +operating system's page cache. In Bluestore you can adjust the amount of memory +that the OSD attempts to consume by changing the :confval:`osd_memory_target` +configuration option. + +- Setting the :confval:`osd_memory_target` below 2GB is not + recommended. Ceph may fail to keep the memory consumption under 2GB and + extremely slow performance is likely. + +- Setting the memory target between 2GB and 4GB typically works but may result + in degraded performance: metadata may need to be read from disk during IO + unless the active data set is relatively small. + +- 4GB is the current default value for :confval:`osd_memory_target` This default + was chosen for typical use cases, and is intended to balance RAM cost and + OSD performance. + +- Setting the :confval:`osd_memory_target` higher than 4GB can improve + performance when there many (small) objects or when large (256GB/OSD + or more) data sets are processed. This is especially true with fast + NVMe OSDs. + +.. important:: OSD memory management is "best effort". Although the OSD may + unmap memory to allow the kernel to reclaim it, there is no guarantee that + the kernel will actually reclaim freed memory within a specific time + frame. This applies especially in older versions of Ceph, where transparent + huge pages can prevent the kernel from reclaiming memory that was freed from + fragmented huge pages. Modern versions of Ceph disable transparent huge + pages at the application level to avoid this, but that does not + guarantee that the kernel will immediately reclaim unmapped memory. The OSD + may still at times exceed its memory target. We recommend budgeting + at least 20% extra memory on your system to prevent OSDs from going OOM + (**O**\ut **O**\f **M**\emory) during temporary spikes or due to delay in + the kernel reclaiming freed pages. That 20% value might be more or less than + needed, depending on the exact configuration of the system. + +.. tip:: Configuring the operating system with swap to provide additional + virtual memory for daemons is not advised for modern systems. Doing + may result in lower performance, and your Ceph cluster may well be + happier with a daemon that crashes vs one that slows to a crawl. + +When using the legacy FileStore back end, the OS page cache was used for caching +data, so tuning was not normally needed. When using the legacy FileStore backend, +the OSD memory consumption was related to the number of PGs per daemon in the +system. + + +Data Storage +============ + +Plan your data storage configuration carefully. There are significant cost and +performance tradeoffs to consider when planning for data storage. Simultaneous +OS operations and simultaneous requests from multiple daemons for read and +write operations against a single drive can impact performance. + +OSDs require substantial storage drive space for RADOS data. We recommend a +minimum drive size of 1 terabyte. OSD drives much smaller than one terabyte +use a significant fraction of their capacity for metadata, and drives smaller +than 100 gigabytes will not be effective at all. + +It is *strongly* suggested that (enterprise-class) SSDs are provisioned for, at a +minimum, Ceph Monitor and Ceph Manager hosts, as well as CephFS Metadata Server +metadata pools and Ceph Object Gateway (RGW) index pools, even if HDDs are to +be provisioned for bulk OSD data. + +To get the best performance out of Ceph, provision the following on separate +drives: + +* The operating systems +* OSD data +* BlueStore WAL+DB + +For more +information on how to effectively use a mix of fast drives and slow drives in +your Ceph cluster, see the `block and block.db`_ section of the Bluestore +Configuration Reference. + +Hard Disk Drives +---------------- + +Consider carefully the cost-per-gigabyte advantage +of larger disks. We recommend dividing the price of the disk drive by the +number of gigabytes to arrive at a cost per gigabyte, because larger drives may +have a significant impact on the cost-per-gigabyte. For example, a 1 terabyte +hard disk priced at $75.00 has a cost of $0.07 per gigabyte (i.e., $75 / 1024 = +0.0732). By contrast, a 3 terabyte disk priced at $150.00 has a cost of $0.05 +per gigabyte (i.e., $150 / 3072 = 0.0488). In the foregoing example, using the +1 terabyte disks would generally increase the cost per gigabyte by +40%--rendering your cluster substantially less cost efficient. + +.. tip:: Hosting multiple OSDs on a single SAS / SATA HDD + is **NOT** a good idea. + +.. tip:: Hosting an OSD with monitor, manager, or MDS data on a single + drive is also **NOT** a good idea. + +.. tip:: With spinning disks, the SATA and SAS interface increasingly + becomes a bottleneck at larger capacities. See also the `Storage Networking + Industry Association's Total Cost of Ownership calculator`_. + + +Storage drives are subject to limitations on seek time, access time, read and +write times, as well as total throughput. These physical limitations affect +overall system performance--especially during recovery. We recommend using a +dedicated (ideally mirrored) drive for the operating system and software, and +one drive for each Ceph OSD Daemon you run on the host. +Many "slow OSD" issues (when they are not attributable to hardware failure) +arise from running an operating system and multiple OSDs on the same drive. +Also be aware that today's 22TB HDD uses the same SATA interface as a +3TB HDD from ten years ago: more than seven times the data to squeeze +through the same same interface. For this reason, when using HDDs for +OSDs, drives larger than 8TB may be best suited for storage of large +files / objects that are not at all performance-sensitive. + + +Solid State Drives +------------------ + +Ceph performance is much improved when using solid-state drives (SSDs). This +reduces random access time and reduces latency while increasing throughput. + +SSDs cost more per gigabyte than do HDDs but SSDs often offer +access times that are, at a minimum, 100 times faster than HDDs. +SSDs avoid hotspot issues and bottleneck issues within busy clusters, and +they may offer better economics when TCO is evaluated holistically. Notably, +the amortized drive cost for a given number of IOPS is much lower with SSDs +than with HDDs. SSDs do not suffer rotational or seek latency and in addition +to improved client performance, they substantially improve the speed and +client impact of cluster changes including rebalancing when OSDs or Monitors +are added, removed, or fail. + +SSDs do not have moving mechanical parts, so they are not subject +to many of the limitations of HDDs. SSDs do have significant +limitations though. When evaluating SSDs, it is important to consider the +performance of sequential and random reads and writes. + +.. important:: We recommend exploring the use of SSDs to improve performance. + However, before making a significant investment in SSDs, we **strongly + recommend** reviewing the performance metrics of an SSD and testing the + SSD in a test configuration in order to gauge performance. + +Relatively inexpensive SSDs may appeal to your sense of economy. Use caution. +Acceptable IOPS are not the only factor to consider when selecting SSDs for +use with Ceph. Bargain SSDs are often a false economy: they may experience +"cliffing", which means that after an initial burst, sustained performance +once a limited cache is filled declines considerably. Consider also durability: +a drive rated for 0.3 Drive Writes Per Day (DWPD or equivalent) may be fine for +OSDs dedicated to certain types of sequentially-written read-mostly data, but +are not a good choice for Ceph Monitor duty. Enterprise-class SSDs are best +for Ceph: they almost always feature power less protection (PLP) and do +not suffer the dramatic cliffing that client (desktop) models may experience. + +When using a single (or mirrored pair) SSD for both operating system boot +and Ceph Monitor / Manager purposes, a minimum capacity of 256GB is advised +and at least 480GB is recommended. A drive model rated at 1+ DWPD (or the +equivalent in TBW (TeraBytes Written) is suggested. However, for a given write +workload, a larger drive than technically required will provide more endurance +because it effectively has greater overprovsioning. We stress that +enterprise-class drives are best for production use, as they feature power +loss protection and increased durability compared to client (desktop) SKUs +that are intended for much lighter and intermittent duty cycles. + +SSDs were historically been cost prohibitive for object storage, but +QLC SSDs are closing the gap, offering greater density with lower power +consumption and less power spent on cooling. Also, HDD OSDs may see a +significant write latency improvement by offloading WAL+DB onto an SSD. +Many Ceph OSD deployments do not require an SSD with greater endurance than +1 DWPD (aka "read-optimized"). "Mixed-use" SSDs in the 3 DWPD class are +often overkill for this purpose and cost signficantly more. + +To get a better sense of the factors that determine the total cost of storage, +you might use the `Storage Networking Industry Association's Total Cost of +Ownership calculator`_ + +Partition Alignment +~~~~~~~~~~~~~~~~~~~ + +When using SSDs with Ceph, make sure that your partitions are properly aligned. +Improperly aligned partitions suffer slower data transfer speeds than do +properly aligned partitions. For more information about proper partition +alignment and example commands that show how to align partitions properly, see +`Werner Fischer's blog post on partition alignment`_. + +CephFS Metadata Segregation +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +One way that Ceph accelerates CephFS file system performance is by separating +the storage of CephFS metadata from the storage of the CephFS file contents. +Ceph provides a default ``metadata`` pool for CephFS metadata. You will never +have to manually create a pool for CephFS metadata, but you can create a CRUSH map +hierarchy for your CephFS metadata pool that includes only SSD storage media. +See :ref:`CRUSH Device Class` for details. + + +Controllers +----------- + +Disk controllers (HBAs) can have a significant impact on write throughput. +Carefully consider your selection of HBAs to ensure that they do not create a +performance bottleneck. Notably, RAID-mode (IR) HBAs may exhibit higher latency +than simpler "JBOD" (IT) mode HBAs. The RAID SoC, write cache, and battery +backup can substantially increase hardware and maintenance costs. Many RAID +HBAs can be configured with an IT-mode "personality" or "JBOD mode" for +streamlined operation. + +You do not need an RoC (RAID-capable) HBA. ZFS or Linux MD software mirroring +serve well for boot volume durability. When using SAS or SATA data drives, +forgoing HBA RAID capabilities can reduce the gap between HDD and SSD +media cost. Moreover, when using NVMe SSDs, you do not need *any* HBA. This +additionally reduces the HDD vs SSD cost gap when the system as a whole is +considered. The initial cost of a fancy RAID HBA plus onboard cache plus +battery backup (BBU or supercapacitor) can easily exceed more than 1000 US +dollars even after discounts - a sum that goes a log way toward SSD cost parity. +An HBA-free system may also cost hundreds of US dollars less every year if one +purchases an annual maintenance contract or extended warranty. + +.. tip:: The `Ceph blog`_ is often an excellent source of information on Ceph + performance issues. See `Ceph Write Throughput 1`_ and `Ceph Write + Throughput 2`_ for additional details. + + +Benchmarking +------------ + +BlueStore opens storage devices with ``O_DIRECT`` and issues ``fsync()`` +frequently to ensure that data is safely persisted to media. You can evaluate a +drive's low-level write performance using ``fio``. For example, 4kB random write +performance is measured as follows: + +.. code-block:: console + + # fio --name=/dev/sdX --ioengine=libaio --direct=1 --fsync=1 --readwrite=randwrite --blocksize=4k --runtime=300 + +Write Caches +------------ + +Enterprise SSDs and HDDs normally include power loss protection features which +ensure data durability when power is lost while operating, and +use multi-level caches to speed up direct or synchronous writes. These devices +can be toggled between two caching modes -- a volatile cache flushed to +persistent media with fsync, or a non-volatile cache written synchronously. + +These two modes are selected by either "enabling" or "disabling" the write +(volatile) cache. When the volatile cache is enabled, Linux uses a device in +"write back" mode, and when disabled, it uses "write through". + +The default configuration (usually: caching is enabled) may not be optimal, and +OSD performance may be dramatically increased in terms of increased IOPS and +decreased commit latency by disabling this write cache. + +Users are therefore encouraged to benchmark their devices with ``fio`` as +described earlier and persist the optimal cache configuration for their +devices. + +The cache configuration can be queried with ``hdparm``, ``sdparm``, +``smartctl`` or by reading the values in ``/sys/class/scsi_disk/*/cache_type``, +for example: + +.. code-block:: console + + # hdparm -W /dev/sda + + /dev/sda: + write-caching = 1 (on) + + # sdparm --get WCE /dev/sda + /dev/sda: ATA TOSHIBA MG07ACA1 0101 + WCE 1 [cha: y] + # smartctl -g wcache /dev/sda + smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-305.19.1.el8_4.x86_64] (local build) + Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org + + Write cache is: Enabled + + # cat /sys/class/scsi_disk/0\:0\:0\:0/cache_type + write back + +The write cache can be disabled with those same tools: + +.. code-block:: console + + # hdparm -W0 /dev/sda + + /dev/sda: + setting drive write-caching to 0 (off) + write-caching = 0 (off) + + # sdparm --clear WCE /dev/sda + /dev/sda: ATA TOSHIBA MG07ACA1 0101 + # smartctl -s wcache,off /dev/sda + smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-305.19.1.el8_4.x86_64] (local build) + Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org + + === START OF ENABLE/DISABLE COMMANDS SECTION === + Write cache disabled + +In most cases, disabling this cache using ``hdparm``, ``sdparm``, or ``smartctl`` +results in the cache_type changing automatically to "write through". If this is +not the case, you can try setting it directly as follows. (Users should ensure +that setting cache_type also correctly persists the caching mode of the device +until the next reboot as some drives require this to be repeated at every boot): + +.. code-block:: console + + # echo "write through" > /sys/class/scsi_disk/0\:0\:0\:0/cache_type + + # hdparm -W /dev/sda + + /dev/sda: + write-caching = 0 (off) + +.. tip:: This udev rule (tested on CentOS 8) will set all SATA/SAS device cache_types to "write + through": + + .. code-block:: console + + # cat /etc/udev/rules.d/99-ceph-write-through.rules + ACTION=="add", SUBSYSTEM=="scsi_disk", ATTR{cache_type}:="write through" + +.. tip:: This udev rule (tested on CentOS 7) will set all SATA/SAS device cache_types to "write + through": + + .. code-block:: console + + # cat /etc/udev/rules.d/99-ceph-write-through-el7.rules + ACTION=="add", SUBSYSTEM=="scsi_disk", RUN+="/bin/sh -c 'echo write through > /sys/class/scsi_disk/$kernel/cache_type'" + +.. tip:: The ``sdparm`` utility can be used to view/change the volatile write + cache on several devices at once: + + .. code-block:: console + + # sdparm --get WCE /dev/sd* + /dev/sda: ATA TOSHIBA MG07ACA1 0101 + WCE 0 [cha: y] + /dev/sdb: ATA TOSHIBA MG07ACA1 0101 + WCE 0 [cha: y] + # sdparm --clear WCE /dev/sd* + /dev/sda: ATA TOSHIBA MG07ACA1 0101 + /dev/sdb: ATA TOSHIBA MG07ACA1 0101 + +Additional Considerations +------------------------- + +Ceph operators typically provision multiple OSDs per host, but you should +ensure that the aggregate throughput of your OSD drives doesn't exceed the +network bandwidth required to service a client's read and write operations. +You should also each host's percentage of the cluster's overall capacity. If +the percentage located on a particular host is large and the host fails, it +can lead to problems such as recovery causing OSDs to exceed the ``full ratio``, +which in turn causes Ceph to halt operations to prevent data loss. + +When you run multiple OSDs per host, you also need to ensure that the kernel +is up to date. See `OS Recommendations`_ for notes on ``glibc`` and +``syncfs(2)`` to ensure that your hardware performs as expected when running +multiple OSDs per host. + + +Networks +======== + +Provision at least 10 Gb/s networking in your datacenter, both among Ceph +hosts and between clients and your Ceph cluster. Network link active/active +bonding across separate network switches is strongly recommended both for +increased throughput and for tolerance of network failures and maintenance. +Take care that your bonding hash policy distributes traffic across links. + +Speed +----- + +It takes three hours to replicate 1 TB of data across a 1 Gb/s network and it +takes thirty hours to replicate 10 TB across a 1 Gb/s network. But it takes only +twenty minutes to replicate 1 TB across a 10 Gb/s network, and it takes +only one hour to replicate 10 TB across a 10 Gb/s network. + +Note that a 40 Gb/s network link is effectively four 10 Gb/s channels in +parallel, and that a 100Gb/s network link is effectively four 25 Gb/s channels +in parallel. Thus, and perhaps somewhat counterintuitively, an individual +packet on a 25 Gb/s network has slightly lower latency compared to a 40 Gb/s +network. + + +Cost +---- + +The larger the Ceph cluster, the more common OSD failures will be. +The faster that a placement group (PG) can recover from a degraded state to +an ``active + clean`` state, the better. Notably, fast recovery minimizes +the likelihood of multiple, overlapping failures that can cause data to become +temporarily unavailable or even lost. Of course, when provisioning your +network, you will have to balance price against performance. + +Some deployment tools employ VLANs to make hardware and network cabling more +manageable. VLANs that use the 802.1q protocol require VLAN-capable NICs and +switches. The added expense of this hardware may be offset by the operational +cost savings on network setup and maintenance. When using VLANs to handle VM +traffic between the cluster and compute stacks (e.g., OpenStack, CloudStack, +etc.), there is additional value in using 10 Gb/s Ethernet or better; 40 Gb/s or +increasingly 25/50/100 Gb/s networking as of 2022 is common for production clusters. + +Top-of-rack (TOR) switches also need fast and redundant uplinks to +core / spine network switches or routers, often at least 40 Gb/s. + + +Baseboard Management Controller (BMC) +------------------------------------- + +Your server chassis should have a Baseboard Management Controller (BMC). +Well-known examples are iDRAC (Dell), CIMC (Cisco UCS), and iLO (HPE). +Administration and deployment tools may also use BMCs extensively, especially +via IPMI or Redfish, so consider the cost/benefit tradeoff of an out-of-band +network for security and administration. Hypervisor SSH access, VM image uploads, +OS image installs, management sockets, etc. can impose significant loads on a network. +Running multiple networks may seem like overkill, but each traffic path represents +a potential capacity, throughput and/or performance bottleneck that you should +carefully consider before deploying a large scale data cluster. + +Additionally BMCs as of 2023 rarely sport network connections faster than 1 Gb/s, +so dedicated and inexpensive 1 Gb/s switches for BMC administrative traffic +may reduce costs by wasting fewer expenive ports on faster host switches. + + +Failure Domains +=============== + +A failure domain can be thought of as any component loss that prevents access to +one or more OSDs or other Ceph daemons. These could be a stopped daemon on a host; +a storage drive failure, an OS crash, a malfunctioning NIC, a failed power supply, +a network outage, a power outage, and so forth. When planning your hardware +deployment, you must balance the risk of reducing costs by placing too many +responsibilities into too few failure domains against the added costs of +isolating every potential failure domain. + + +Minimum Hardware Recommendations +================================ + +Ceph can run on inexpensive commodity hardware. Small production clusters +and development clusters can run successfully with modest hardware. As +we noted above: when we speak of CPU _cores_, we mean _threads_ when +hyperthreading (HT) is enabled. Each modern physical x64 CPU core typically +provides two logical CPU threads; other CPU architectures may vary. + +Take care that there are many factors that influence resource choices. The +minimum resources that suffice for one purpose will not necessarily suffice for +another. A sandbox cluster with one OSD built on a laptop with VirtualBox or on +a trio of Raspberry PIs will get by with fewer resources than a production +deployment with a thousand OSDs serving five thousand of RBD clients. The +classic Fisher Price PXL 2000 captures video, as does an IMAX or RED camera. +One would not expect the former to do the job of the latter. We especially +cannot stress enough the criticality of using enterprise-quality storage +media for production workloads. + +Additional insights into resource planning for production clusters are +found above and elsewhere within this documentation. + ++--------------+----------------+-----------------------------------------+ +| Process | Criteria | Bare Minimum and Recommended | ++==============+================+=========================================+ +| ``ceph-osd`` | Processor | - 1 core minimum, 2 recommended | +| | | - 1 core per 200-500 MB/s throughput | +| | | - 1 core per 1000-3000 IOPS | +| | | | +| | | * Results are before replication. | +| | | * Results may vary across CPU and drive | +| | | models and Ceph configuration: | +| | | (erasure coding, compression, etc) | +| | | * ARM processors specifically may | +| | | require more cores for performance. | +| | | * SSD OSDs, especially NVMe, will | +| | | benefit from additional cores per OSD.| +| | | * Actual performance depends on many | +| | | factors including drives, net, and | +| | | client throughput and latency. | +| | | Benchmarking is highly recommended. | +| +----------------+-----------------------------------------+ +| | RAM | - 4GB+ per daemon (more is better) | +| | | - 2-4GB may function but may be slow | +| | | - Less than 2GB is not recommended | +| +----------------+-----------------------------------------+ +| | Storage Drives | 1x storage drive per OSD | +| +----------------+-----------------------------------------+ +| | DB/WAL | 1x SSD partion per HDD OSD | +| | (optional) | 4-5x HDD OSDs per DB/WAL SATA SSD | +| | | <= 10 HDD OSDss per DB/WAL NVMe SSD | +| +----------------+-----------------------------------------+ +| | Network | 1x 1Gb/s (bonded 10+ Gb/s recommended) | ++--------------+----------------+-----------------------------------------+ +| ``ceph-mon`` | Processor | - 2 cores minimum | +| +----------------+-----------------------------------------+ +| | RAM | 5GB+ per daemon (large / production | +| | | clusters need more) | +| +----------------+-----------------------------------------+ +| | Storage | 100 GB per daemon, SSD is recommended | +| +----------------+-----------------------------------------+ +| | Network | 1x 1Gb/s (10+ Gb/s recommended) | ++--------------+----------------+-----------------------------------------+ +| ``ceph-mds`` | Processor | - 2 cores minimum | +| +----------------+-----------------------------------------+ +| | RAM | 2GB+ per daemon (more for production) | +| +----------------+-----------------------------------------+ +| | Disk Space | 1 GB per daemon | +| +----------------+-----------------------------------------+ +| | Network | 1x 1Gb/s (10+ Gb/s recommended) | ++--------------+----------------+-----------------------------------------+ + +.. tip:: If you are running an OSD node with a single storage drive, create a + partition for your OSD that is separate from the partition + containing the OS. We recommend separate drives for the + OS and for OSD storage. + + + +.. _block and block.db: https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#block-and-block-db +.. _Ceph blog: https://ceph.com/community/blog/ +.. _Ceph Write Throughput 1: http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/ +.. _Ceph Write Throughput 2: http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/ +.. _Mapping Pools to Different Types of OSDs: ../../rados/operations/crush-map#placing-different-pools-on-different-osds +.. _OS Recommendations: ../os-recommendations +.. _Storage Networking Industry Association's Total Cost of Ownership calculator: https://www.snia.org/forums/cmsi/programs/TCOcalc +.. _Werner Fischer's blog post on partition alignment: https://www.thomas-krenn.com/en/wiki/Partition_Alignment_detailed_explanation -- cgit v1.2.3