summaryrefslogtreecommitdiffstats
path: root/doc/start/hardware-recommendations.rst
blob: a63b5a4579643b6ff8487ae2e71cf825e4d7d669 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
.. _hardware-recommendations:

==========================
 hardware recommendations
==========================

Ceph is designed to run on commodity hardware, which makes building and
maintaining petabyte-scale data clusters flexible and economically feasible. 
When planning your cluster's hardware, you will need to balance a number 
of considerations, including failure domains, cost, and performance.
Hardware planning should include distributing Ceph daemons and 
other processes that use Ceph across many hosts. Generally, we recommend 
running Ceph daemons of a specific type on a host configured for that type 
of daemon. We recommend using separate hosts for processes that utilize your 
data cluster (e.g., OpenStack, CloudStack, Kubernetes, etc).

The requirements of one Ceph cluster are not the same as the requirements of
another, but below are some general guidelines. 

.. tip:: check out the `ceph blog`_ too.

CPU
===

CephFS Metadata Servers (MDS) are CPU-intensive. They are
are single-threaded and perform best with CPUs with a high clock rate (GHz). MDS
servers do not need a large number of CPU cores unless they are also hosting other
services, such as SSD OSDs for the CephFS metadata pool.
OSD nodes need enough processing power to run the RADOS service, to calculate data
placement with CRUSH, to replicate data, and to maintain their own copies of the
cluster map.

With earlier releases of Ceph, we would make hardware recommendations based on
the number of cores per OSD, but this cores-per-osd metric is no longer as
useful a metric as the number of cycles per IOP and the number of IOPS per OSD.
For example, with NVMe OSD drives, Ceph can easily utilize five or six cores on real
clusters and up to about fourteen cores on single OSDs in isolation. So cores
per OSD are no longer as pressing a concern as they were. When selecting
hardware, select for IOPS per core.

.. tip:: When we speak of CPU _cores_, we mean _threads_ when hyperthreading
	 is enabled.  Hyperthreading is usually beneficial for Ceph servers.

Monitor nodes and Manager nodes do not have heavy CPU demands and require only
modest processors. if your hosts will run CPU-intensive processes in
addition to Ceph daemons, make sure that you have enough processing power to
run both the CPU-intensive processes and the Ceph daemons. (OpenStack Nova is
one example of a CPU-intensive process.) We recommend that you run
non-Ceph CPU-intensive processes on separate hosts (that is, on hosts that are
not your Monitor and Manager nodes) in order to avoid resource contention.
If your cluster deployes the Ceph Object Gateway, RGW daemons may co-reside
with your Mon and Manager services if the nodes have sufficient resources.

RAM
===

Generally, more RAM is better.  Monitor / Manager nodes for a modest cluster
might do fine with 64GB; for a larger cluster with hundreds of OSDs 128GB
is advised.

.. tip:: when we speak of RAM and storage requirements, we often describe
	 the needs of a single daemon of a given type.  A given server as
	 a whole will thus need at least the sum of the needs of the
	 daemons that it hosts as well as resources for logs and other operating
	 system components.  Keep in mind that a server's need for RAM
	 and storage will be greater at startup and when components
	 fail or are added and the cluster rebalances.  In other words,
	 allow headroom past what you might see used during a calm period
	 on a small initial cluster footprint.

There is an :confval:`osd_memory_target` setting for BlueStore OSDs that
defaults to 4GB.  Factor in a prudent margin for the operating system and
administrative tasks (like monitoring and metrics) as well as increased
consumption during recovery:  provisioning ~8GB *per BlueStore OSD* is thus
advised.

Monitors and managers (ceph-mon and ceph-mgr)
---------------------------------------------

Monitor and manager daemon memory usage scales with the size of the
cluster.  Note that at boot-time and during topology changes and recovery these
daemons will need more RAM than they do during steady-state operation, so plan
for peak usage. For very small clusters, 32 GB suffices. For clusters of up to,
say, 300 OSDs go with 64GB. For clusters built with (or which will grow to)
even more OSDs you should provision 128GB. You may also want to consider
tuning the following settings:

* :confval:`mon_osd_cache_size`
* :confval:`rocksdb_cache_size`


Metadata servers (ceph-mds)
---------------------------

CephFS metadata daemon memory utilization depends on the configured size of
its cache. We recommend 1 GB as a minimum for most systems.  See
:confval:`mds_cache_memory_limit`.


Memory
======

Bluestore uses its own memory to cache data rather than relying on the
operating system's page cache. In Bluestore you can adjust the amount of memory
that the OSD attempts to consume by changing the :confval:`osd_memory_target`
configuration option.

- Setting the :confval:`osd_memory_target` below 2GB is not
  recommended. Ceph may fail to keep the memory consumption under 2GB and 
  extremely slow performance is likely.

- Setting the memory target between 2GB and 4GB typically works but may result
  in degraded performance: metadata may need to be read from disk during IO
  unless the active data set is relatively small.

- 4GB is the current default value for :confval:`osd_memory_target` This default
  was chosen for typical use cases, and is intended to balance RAM cost and
  OSD performance.

- Setting the :confval:`osd_memory_target` higher than 4GB can improve
  performance when there many (small) objects or when large (256GB/OSD 
  or more) data sets are processed.  This is especially true with fast
  NVMe OSDs.

.. important:: OSD memory management is "best effort". Although the OSD may
   unmap memory to allow the kernel to reclaim it, there is no guarantee that
   the kernel will actually reclaim freed memory within a specific time
   frame. This applies especially in older versions of Ceph, where transparent
   huge pages can prevent the kernel from reclaiming memory that was freed from
   fragmented huge pages. Modern versions of Ceph disable transparent huge
   pages at the application level to avoid this, but that does not
   guarantee that the kernel will immediately reclaim unmapped memory. The OSD
   may still at times exceed its memory target. We recommend budgeting 
   at least 20% extra memory on your system to prevent OSDs from going OOM
   (**O**\ut **O**\f **M**\emory) during temporary spikes or due to delay in
   the kernel reclaiming freed pages. That 20% value might be more or less than
   needed, depending on the exact configuration of the system.

.. tip:: Configuring the operating system with swap to provide additional
	 virtual memory for daemons is not advised for modern systems.  Doing
	 may result in lower performance, and your Ceph cluster may well be
	 happier with a daemon that crashes vs one that slows to a crawl.

When using the legacy FileStore back end, the OS page cache was used for caching
data, so tuning was not normally needed. When using the legacy FileStore backend,
the OSD memory consumption was related to the number of PGs per daemon in the
system.


Data Storage
============

Plan your data storage configuration carefully. There are significant cost and
performance tradeoffs to consider when planning for data storage. Simultaneous
OS operations and simultaneous requests from multiple daemons for read and
write operations against a single drive can impact performance.

OSDs require substantial storage drive space for RADOS data. We recommend a
minimum drive size of 1 terabyte. OSD drives much smaller than one terabyte
use a significant fraction of their capacity for metadata, and drives smaller
than 100 gigabytes will not be effective at all.

It is *strongly* suggested that (enterprise-class) SSDs are provisioned for, at a
minimum, Ceph Monitor and Ceph Manager hosts, as well as CephFS Metadata Server
metadata pools and Ceph Object Gateway (RGW) index pools, even if HDDs are to
be provisioned for bulk OSD data.

To get the best performance out of Ceph, provision the following on separate
drives:

* The operating systems
* OSD data
* BlueStore WAL+DB

For more
information on how to effectively use a mix of fast drives and slow drives in
your Ceph cluster, see the `block and block.db`_ section of the Bluestore
Configuration Reference.

Hard Disk Drives
----------------

Consider carefully the cost-per-gigabyte advantage
of larger disks. We recommend dividing the price of the disk drive by the
number of gigabytes to arrive at a cost per gigabyte, because larger drives may
have a significant impact on the cost-per-gigabyte. For example, a 1 terabyte
hard disk priced at $75.00 has a cost of $0.07 per gigabyte (i.e., $75 / 1024 =
0.0732). By contrast, a 3 terabyte disk priced at $150.00 has a cost of $0.05
per gigabyte (i.e., $150 / 3072 = 0.0488). In the foregoing example, using the
1 terabyte disks would generally increase the cost per gigabyte by
40%--rendering your cluster substantially less cost efficient.

.. tip:: Hosting multiple OSDs on a single SAS / SATA HDD
   is **NOT** a good idea.

.. tip:: Hosting an OSD with monitor, manager, or MDS data on a single 
   drive is also **NOT** a good idea.

.. tip:: With spinning disks, the SATA and SAS interface increasingly
   becomes a bottleneck at larger capacities. See also the `Storage Networking 
   Industry Association's Total Cost of Ownership calculator`_.


Storage drives are subject to limitations on seek time, access time, read and
write times, as well as total throughput. These physical limitations affect
overall system performance--especially during recovery. We recommend using a
dedicated (ideally mirrored) drive for the operating system and software, and
one drive for each Ceph OSD Daemon you run on the host.
Many "slow OSD" issues (when they are not attributable to hardware failure)
arise from running an operating system and multiple OSDs on the same drive.
Also be aware that today's 22TB HDD uses the same SATA interface as a
3TB HDD from ten years ago: more than seven times the data to squeeze
through the same same interface.  For this reason, when using HDDs for
OSDs, drives larger than 8TB may be best suited for storage of large
files / objects that are not at all performance-sensitive.


Solid State Drives
------------------

Ceph performance is much improved when using solid-state drives (SSDs). This
reduces random access time and reduces latency while increasing throughput. 

SSDs cost more per gigabyte than do HDDs but SSDs often offer
access times that are, at a minimum, 100 times faster than HDDs.
SSDs avoid hotspot issues and bottleneck issues within busy clusters, and
they may offer better economics when TCO is evaluated holistically. Notably,
the amortized drive cost for a given number of IOPS is much lower with SSDs
than with HDDs.  SSDs do not suffer rotational or seek latency and in addition
to improved client performance, they substantially improve the speed and
client impact of cluster changes including rebalancing when OSDs or Monitors
are added, removed, or fail.

SSDs do not have moving mechanical parts, so they are not subject
to many of the limitations of HDDs.  SSDs do have significant
limitations though. When evaluating SSDs, it is important to consider the
performance of sequential and random reads and writes.

.. important:: We recommend exploring the use of SSDs to improve performance. 
   However, before making a significant investment in SSDs, we **strongly
   recommend** reviewing the performance metrics of an SSD and testing the
   SSD in a test configuration in order to gauge performance. 

Relatively inexpensive SSDs may appeal to your sense of economy. Use caution.
Acceptable IOPS are not the only factor to consider when selecting SSDs for
use with Ceph. Bargain SSDs are often a false economy: they may experience
"cliffing", which means that after an initial burst, sustained performance
once a limited cache is filled declines considerably.  Consider also durability:
a drive rated for 0.3 Drive Writes Per Day (DWPD or equivalent) may be fine for
OSDs dedicated to certain types of sequentially-written read-mostly data, but
are not a good choice for Ceph Monitor duty.  Enterprise-class SSDs are best
for Ceph:  they almost always feature power less protection (PLP) and do
not suffer the dramatic cliffing that client (desktop) models may experience.

When using a single (or mirrored pair) SSD for both operating system boot
and Ceph Monitor / Manager purposes, a minimum capacity of 256GB is advised
and at least 480GB is recommended. A drive model rated at 1+ DWPD (or the
equivalent in TBW (TeraBytes Written) is suggested.  However, for a given write
workload, a larger drive than technically required will provide more endurance
because it effectively has greater overprovsioning. We stress that
enterprise-class drives are best for production use, as they feature power
loss protection and increased durability compared to client (desktop) SKUs
that are intended for much lighter and intermittent duty cycles.

SSDs were historically been cost prohibitive for object storage, but
QLC SSDs are closing the gap, offering greater density with lower power
consumption and less power spent on cooling. Also, HDD OSDs may see a
significant write latency improvement by offloading WAL+DB onto an SSD.
Many Ceph OSD deployments do not require an SSD with greater endurance than
1 DWPD (aka "read-optimized").  "Mixed-use" SSDs in the 3 DWPD class are
often overkill for this purpose and cost signficantly more.

To get a better sense of the factors that determine the total cost of storage,
you might use the `Storage Networking Industry Association's Total Cost of
Ownership calculator`_

Partition Alignment
~~~~~~~~~~~~~~~~~~~

When using SSDs with Ceph, make sure that your partitions are properly aligned.
Improperly aligned partitions suffer slower data transfer speeds than do
properly aligned partitions. For more information about proper partition
alignment and example commands that show how to align partitions properly, see
`Werner Fischer's blog post on partition alignment`_.

CephFS Metadata Segregation
~~~~~~~~~~~~~~~~~~~~~~~~~~~

One way that Ceph accelerates CephFS file system performance is by separating
the storage of CephFS metadata from the storage of the CephFS file contents.
Ceph provides a default ``metadata`` pool for CephFS metadata. You will never
have to manually create a pool for CephFS metadata, but you can create a CRUSH map
hierarchy for your CephFS metadata pool that includes only SSD storage media.
See :ref:`CRUSH Device Class<crush-map-device-class>` for details.


Controllers
-----------

Disk controllers (HBAs) can have a significant impact on write throughput.
Carefully consider your selection of HBAs to ensure that they do not create a
performance bottleneck. Notably, RAID-mode (IR) HBAs may exhibit higher latency
than simpler "JBOD" (IT) mode HBAs. The RAID SoC, write cache, and battery
backup can substantially increase hardware and maintenance costs. Many RAID
HBAs can be configured with an IT-mode "personality" or "JBOD mode" for
streamlined operation.

You do not need an RoC (RAID-capable) HBA. ZFS or Linux MD software mirroring
serve well for boot volume durability.  When using SAS or SATA data drives,
forgoing HBA RAID capabilities can reduce the gap between HDD and SSD
media cost.  Moreover, when using NVMe SSDs, you do not need *any* HBA.  This
additionally reduces the HDD vs SSD cost gap when the system as a whole is
considered. The initial cost of a fancy RAID HBA plus onboard cache plus
battery backup (BBU or supercapacitor) can easily exceed more than 1000 US
dollars even after discounts - a sum that goes a log way toward SSD cost parity.
An HBA-free system may also cost hundreds of US dollars less every year if one
purchases an annual maintenance contract or extended warranty.

.. tip:: The `Ceph blog`_ is often an excellent source of information on Ceph
   performance issues. See `Ceph Write Throughput 1`_ and `Ceph Write 
   Throughput 2`_ for additional details.


Benchmarking
------------

BlueStore opens storage devices with ``O_DIRECT`` and issues ``fsync()``
frequently to ensure that data is safely persisted to media. You can evaluate a
drive's low-level write performance using ``fio``. For example, 4kB random write
performance is measured as follows:

.. code-block:: console

  # fio --name=/dev/sdX --ioengine=libaio --direct=1 --fsync=1 --readwrite=randwrite --blocksize=4k --runtime=300

Write Caches
------------

Enterprise SSDs and HDDs normally include power loss protection features which
ensure data durability when power is lost while operating, and
use multi-level caches to speed up direct or synchronous writes.  These devices
can be toggled between two caching modes -- a volatile cache flushed to
persistent media with fsync, or a non-volatile cache written synchronously.

These two modes are selected by either "enabling" or "disabling" the write
(volatile) cache.  When the volatile cache is enabled, Linux uses a device in
"write back" mode, and when disabled, it uses "write through".

The default configuration (usually: caching is enabled) may not be optimal, and
OSD performance may be dramatically increased in terms of increased IOPS and
decreased commit latency by disabling this write cache.

Users are therefore encouraged to benchmark their devices with ``fio`` as
described earlier and persist the optimal cache configuration for their
devices.

The cache configuration can be queried with ``hdparm``, ``sdparm``,
``smartctl`` or by reading the values in ``/sys/class/scsi_disk/*/cache_type``,
for example:

.. code-block:: console

  # hdparm -W /dev/sda

  /dev/sda:
   write-caching =  1 (on)

  # sdparm --get WCE /dev/sda
      /dev/sda: ATA       TOSHIBA MG07ACA1  0101
  WCE           1  [cha: y]
  # smartctl -g wcache /dev/sda
  smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-305.19.1.el8_4.x86_64] (local build)
  Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

  Write cache is:   Enabled

  # cat /sys/class/scsi_disk/0\:0\:0\:0/cache_type
  write back

The write cache can be disabled with those same tools:

.. code-block:: console

  # hdparm -W0 /dev/sda

  /dev/sda:
   setting drive write-caching to 0 (off)
   write-caching =  0 (off)

  # sdparm --clear WCE /dev/sda
      /dev/sda: ATA       TOSHIBA MG07ACA1  0101
  # smartctl -s wcache,off /dev/sda
  smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-305.19.1.el8_4.x86_64] (local build)
  Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

  === START OF ENABLE/DISABLE COMMANDS SECTION ===
  Write cache disabled

In most cases, disabling this cache  using ``hdparm``, ``sdparm``, or ``smartctl``
results in the cache_type changing automatically to "write through". If this is
not the case, you can try setting it directly as follows. (Users should ensure
that setting cache_type also correctly persists the caching mode of the device
until the next reboot as some drives require this to be repeated at every boot):

.. code-block:: console

  # echo "write through" > /sys/class/scsi_disk/0\:0\:0\:0/cache_type

  # hdparm -W /dev/sda

  /dev/sda:
   write-caching =  0 (off)

.. tip:: This udev rule (tested on CentOS 8) will set all SATA/SAS device cache_types to "write
  through":

  .. code-block:: console

    # cat /etc/udev/rules.d/99-ceph-write-through.rules
    ACTION=="add", SUBSYSTEM=="scsi_disk", ATTR{cache_type}:="write through"

.. tip:: This udev rule (tested on CentOS 7) will set all SATA/SAS device cache_types to "write
  through":

  .. code-block:: console

    # cat /etc/udev/rules.d/99-ceph-write-through-el7.rules
    ACTION=="add", SUBSYSTEM=="scsi_disk", RUN+="/bin/sh -c 'echo write through > /sys/class/scsi_disk/$kernel/cache_type'"

.. tip:: The ``sdparm`` utility can be used to view/change the volatile write
  cache on several devices at once:

  .. code-block:: console

    # sdparm --get WCE /dev/sd*
        /dev/sda: ATA       TOSHIBA MG07ACA1  0101
    WCE           0  [cha: y]
        /dev/sdb: ATA       TOSHIBA MG07ACA1  0101
    WCE           0  [cha: y]
    # sdparm --clear WCE /dev/sd*
        /dev/sda: ATA       TOSHIBA MG07ACA1  0101
        /dev/sdb: ATA       TOSHIBA MG07ACA1  0101

Additional Considerations
-------------------------

Ceph operators typically provision  multiple OSDs per host, but you should
ensure that the aggregate throughput of your OSD drives doesn't exceed the
network bandwidth required to service a client's read and write operations.
You should also each host's percentage of the cluster's overall capacity. If
the percentage located on a particular host is large and the host fails, it
can lead to problems such as recovery causing OSDs to exceed the ``full ratio``,
which in turn causes Ceph to halt operations to prevent data loss.

When you run multiple OSDs per host, you also need to ensure that the kernel
is up to date. See `OS Recommendations`_ for notes on ``glibc`` and
``syncfs(2)`` to ensure that your hardware performs as expected when running
multiple OSDs per host.


Networks
========

Provision at least 10 Gb/s networking in your datacenter, both among Ceph
hosts and between clients and your Ceph cluster.  Network link active/active
bonding across separate network switches is strongly recommended both for
increased throughput and for tolerance of network failures and maintenance.
Take care that your bonding hash policy distributes traffic across links.

Speed
-----

It takes three hours to replicate 1 TB of data across a 1 Gb/s network and it
takes thirty hours to replicate 10 TB across a 1 Gb/s network. But it takes only
twenty minutes to replicate 1 TB across a 10 Gb/s network, and it takes
only one hour to replicate 10 TB across a 10 Gb/s network.

Note that a 40 Gb/s network link is effectively four 10 Gb/s channels in
parallel, and that a 100Gb/s network link is effectively four 25 Gb/s channels
in parallel.  Thus, and perhaps somewhat counterintuitively, an individual
packet on a 25 Gb/s network has slightly lower latency compared to a 40 Gb/s
network.


Cost
----

The larger the Ceph cluster, the more common OSD failures will be.
The faster that a placement group (PG) can recover from a degraded state to
an ``active + clean`` state, the better. Notably, fast recovery minimizes
the likelihood of multiple, overlapping failures that can cause data to become
temporarily unavailable or even lost. Of course, when provisioning your
network, you will have to balance price against performance. 

Some deployment tools employ VLANs to make hardware and network cabling more
manageable. VLANs that use the 802.1q protocol require VLAN-capable NICs and
switches. The added expense of this hardware may be offset by the operational
cost savings on network setup and maintenance. When using VLANs to handle VM
traffic between the cluster and compute stacks (e.g., OpenStack, CloudStack,
etc.), there is additional value in using 10 Gb/s Ethernet or better; 40 Gb/s or
increasingly 25/50/100 Gb/s networking as of 2022 is common for production clusters.

Top-of-rack (TOR) switches also need fast and redundant uplinks to
core / spine network switches or routers, often at least 40 Gb/s.


Baseboard Management Controller (BMC)
-------------------------------------

Your server chassis should have a Baseboard Management Controller (BMC).
Well-known examples are iDRAC (Dell), CIMC (Cisco UCS), and iLO (HPE).
Administration and deployment tools may also use BMCs extensively, especially
via IPMI or Redfish, so consider the cost/benefit tradeoff of an out-of-band
network for security and administration.  Hypervisor SSH access, VM image uploads,
OS image installs, management sockets, etc. can impose significant loads on a network.
Running multiple networks may seem like overkill, but each traffic path represents
a potential capacity, throughput and/or performance bottleneck that you should
carefully consider before deploying a large scale data cluster.

Additionally BMCs as of 2023 rarely sport network connections faster than 1 Gb/s,
so dedicated and inexpensive 1 Gb/s switches for BMC administrative traffic
may reduce costs by wasting fewer expenive ports on faster host switches.
 

Failure Domains
===============

A failure domain can be thought of as any component loss that prevents access to
one or more OSDs or other Ceph daemons. These could be a stopped daemon on a host;
a storage drive failure, an OS crash, a malfunctioning NIC, a failed power supply,
a network outage, a power outage, and so forth. When planning your hardware
deployment, you must balance the risk of reducing costs by placing too many
responsibilities into too few failure domains against the added costs of
isolating every potential failure domain.


Minimum Hardware Recommendations
================================

Ceph can run on inexpensive commodity hardware. Small production clusters
and development clusters can run successfully with modest hardware.  As
we noted above: when we speak of CPU _cores_, we mean _threads_ when
hyperthreading (HT) is enabled.  Each modern physical x64 CPU core typically
provides two logical CPU threads; other CPU architectures may vary.

Take care that there are many factors that influence resource choices.  The
minimum resources that suffice for one purpose will not necessarily suffice for
another.  A sandbox cluster with one OSD built on a laptop with VirtualBox or on
a trio of Raspberry PIs will get by with fewer resources than a production
deployment with a thousand OSDs serving five thousand of RBD clients.  The
classic Fisher Price PXL 2000 captures video, as does an IMAX or RED camera.
One would not expect the former to do the job of the latter.  We especially
cannot stress enough the criticality of using enterprise-quality storage
media for production workloads.

Additional insights into resource planning for production clusters are
found above and elsewhere within this documentation.

+--------------+----------------+-----------------------------------------+
|  Process     | Criteria       | Bare Minimum and Recommended            |
+==============+================+=========================================+
| ``ceph-osd`` | Processor      | - 1 core minimum, 2 recommended         |
|              |                | - 1 core per 200-500 MB/s throughput    |
|              |                | - 1 core per 1000-3000 IOPS             |
|              |                |                                         |
|              |                | * Results are before replication.       |
|              |                | * Results may vary across CPU and drive |
|              |                |   models and Ceph configuration:        |
|              |                |   (erasure coding, compression, etc)    |
|              |                | * ARM processors specifically may       |
|              |                |   require more cores for performance.   |
|              |                | * SSD OSDs, especially NVMe, will       |
|              |                |   benefit from additional cores per OSD.|
|              |                | * Actual performance depends on many    |
|              |                |   factors including drives, net, and    |
|              |                |   client throughput and latency.        |
|              |                |   Benchmarking is highly recommended.   |
|              +----------------+-----------------------------------------+
|              | RAM            | - 4GB+ per daemon (more is better)      |
|              |                | - 2-4GB may function but may be slow    |
|              |                | - Less than 2GB is not recommended      |
|              +----------------+-----------------------------------------+
|              | Storage Drives |  1x storage drive per OSD               |
|              +----------------+-----------------------------------------+
|              | DB/WAL         |  1x SSD partion per HDD OSD             |
|              | (optional)     |  4-5x HDD OSDs per DB/WAL SATA SSD      |
|              |                |  <= 10 HDD OSDss per DB/WAL NVMe SSD    |
|              +----------------+-----------------------------------------+
|              | Network        |  1x 1Gb/s (bonded 10+ Gb/s recommended) |
+--------------+----------------+-----------------------------------------+
| ``ceph-mon`` | Processor      | - 2 cores minimum                       |
|              +----------------+-----------------------------------------+
|              | RAM            |  5GB+ per daemon (large / production    |
|              |                |  clusters need more)                    |
|              +----------------+-----------------------------------------+
|              | Storage        |  100 GB per daemon, SSD is recommended  |
|              +----------------+-----------------------------------------+
|              | Network        |  1x 1Gb/s (10+ Gb/s recommended)        |
+--------------+----------------+-----------------------------------------+
| ``ceph-mds`` | Processor      | - 2 cores minimum                       |
|              +----------------+-----------------------------------------+
|              | RAM            |  2GB+ per daemon (more for production)  |
|              +----------------+-----------------------------------------+
|              | Disk Space     |  1 GB per daemon                        |
|              +----------------+-----------------------------------------+
|              | Network        |  1x 1Gb/s (10+ Gb/s recommended)        |
+--------------+----------------+-----------------------------------------+

.. tip:: If you are running an OSD node with a single storage drive, create a
   partition for your OSD that is separate from the partition
   containing the OS. We recommend separate drives for the
   OS and for OSD storage.



.. _block and block.db: https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#block-and-block-db
.. _Ceph blog: https://ceph.com/community/blog/
.. _Ceph Write Throughput 1: http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
.. _Ceph Write Throughput 2: http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/
.. _Mapping Pools to Different Types of OSDs: ../../rados/operations/crush-map#placing-different-pools-on-different-osds
.. _OS Recommendations: ../os-recommendations
.. _Storage Networking Industry Association's Total Cost of Ownership calculator: https://www.snia.org/forums/cmsi/programs/TCOcalc
.. _Werner Fischer's blog post on partition alignment: https://www.thomas-krenn.com/en/wiki/Partition_Alignment_detailed_explanation