diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-07 18:45:59 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-07 18:45:59 +0000 |
commit | 19fcec84d8d7d21e796c7624e521b60d28ee21ed (patch) | |
tree | 42d26aa27d1e3f7c0b8bd3fd14e7d7082f5008dc /doc/rados/configuration/mon-config-ref.rst | |
parent | Initial commit. (diff) | |
download | ceph-upstream.tar.xz ceph-upstream.zip |
Adding upstream version 16.2.11+ds.upstream/16.2.11+dsupstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'doc/rados/configuration/mon-config-ref.rst')
-rw-r--r-- | doc/rados/configuration/mon-config-ref.rst | 1243 |
1 files changed, 1243 insertions, 0 deletions
diff --git a/doc/rados/configuration/mon-config-ref.rst b/doc/rados/configuration/mon-config-ref.rst new file mode 100644 index 000000000..3b12af43d --- /dev/null +++ b/doc/rados/configuration/mon-config-ref.rst @@ -0,0 +1,1243 @@ +.. _monitor-config-reference: + +========================== + Monitor Config Reference +========================== + +Understanding how to configure a :term:`Ceph Monitor` is an important part of +building a reliable :term:`Ceph Storage Cluster`. **All Ceph Storage Clusters +have at least one monitor**. The monitor complement usually remains fairly +consistent, but you can add, remove or replace a monitor in a cluster. See +`Adding/Removing a Monitor`_ for details. + + +.. index:: Ceph Monitor; Paxos + +Background +========== + +Ceph Monitors maintain a "master copy" of the :term:`Cluster Map`, which means a +:term:`Ceph Client` can determine the location of all Ceph Monitors, Ceph OSD +Daemons, and Ceph Metadata Servers just by connecting to one Ceph Monitor and +retrieving a current cluster map. Before Ceph Clients can read from or write to +Ceph OSD Daemons or Ceph Metadata Servers, they must connect to a Ceph Monitor +first. With a current copy of the cluster map and the CRUSH algorithm, a Ceph +Client can compute the location for any object. The ability to compute object +locations allows a Ceph Client to talk directly to Ceph OSD Daemons, which is a +very important aspect of Ceph's high scalability and performance. See +`Scalability and High Availability`_ for additional details. + +The primary role of the Ceph Monitor is to maintain a master copy of the cluster +map. Ceph Monitors also provide authentication and logging services. Ceph +Monitors write all changes in the monitor services to a single Paxos instance, +and Paxos writes the changes to a key/value store for strong consistency. Ceph +Monitors can query the most recent version of the cluster map during sync +operations. Ceph Monitors leverage the key/value store's snapshots and iterators +(using leveldb) to perform store-wide synchronization. + +.. ditaa:: + /-------------\ /-------------\ + | Monitor | Write Changes | Paxos | + | cCCC +-------------->+ cCCC | + | | | | + +-------------+ \------+------/ + | Auth | | + +-------------+ | Write Changes + | Log | | + +-------------+ v + | Monitor Map | /------+------\ + +-------------+ | Key / Value | + | OSD Map | | Store | + +-------------+ | cCCC | + | PG Map | \------+------/ + +-------------+ ^ + | MDS Map | | Read Changes + +-------------+ | + | cCCC |*---------------------+ + \-------------/ + + +.. deprecated:: version 0.58 + +In Ceph versions 0.58 and earlier, Ceph Monitors use a Paxos instance for +each service and store the map as a file. + +.. index:: Ceph Monitor; cluster map + +Cluster Maps +------------ + +The cluster map is a composite of maps, including the monitor map, the OSD map, +the placement group map and the metadata server map. The cluster map tracks a +number of important things: which processes are ``in`` the Ceph Storage Cluster; +which processes that are ``in`` the Ceph Storage Cluster are ``up`` and running +or ``down``; whether, the placement groups are ``active`` or ``inactive``, and +``clean`` or in some other state; and, other details that reflect the current +state of the cluster such as the total amount of storage space, and the amount +of storage used. + +When there is a significant change in the state of the cluster--e.g., a Ceph OSD +Daemon goes down, a placement group falls into a degraded state, etc.--the +cluster map gets updated to reflect the current state of the cluster. +Additionally, the Ceph Monitor also maintains a history of the prior states of +the cluster. The monitor map, OSD map, placement group map and metadata server +map each maintain a history of their map versions. We call each version an +"epoch." + +When operating your Ceph Storage Cluster, keeping track of these states is an +important part of your system administration duties. See `Monitoring a Cluster`_ +and `Monitoring OSDs and PGs`_ for additional details. + +.. index:: high availability; quorum + +Monitor Quorum +-------------- + +Our Configuring ceph section provides a trivial `Ceph configuration file`_ that +provides for one monitor in the test cluster. A cluster will run fine with a +single monitor; however, **a single monitor is a single-point-of-failure**. To +ensure high availability in a production Ceph Storage Cluster, you should run +Ceph with multiple monitors so that the failure of a single monitor **WILL NOT** +bring down your entire cluster. + +When a Ceph Storage Cluster runs multiple Ceph Monitors for high availability, +Ceph Monitors use `Paxos`_ to establish consensus about the master cluster map. +A consensus requires a majority of monitors running to establish a quorum for +consensus about the cluster map (e.g., 1; 2 out of 3; 3 out of 5; 4 out of 6; +etc.). + +``mon force quorum join`` + +:Description: Force monitor to join quorum even if it has been previously removed from the map +:Type: Boolean +:Default: ``False`` + +.. index:: Ceph Monitor; consistency + +Consistency +----------- + +When you add monitor settings to your Ceph configuration file, you need to be +aware of some of the architectural aspects of Ceph Monitors. **Ceph imposes +strict consistency requirements** for a Ceph monitor when discovering another +Ceph Monitor within the cluster. Whereas, Ceph Clients and other Ceph daemons +use the Ceph configuration file to discover monitors, monitors discover each +other using the monitor map (monmap), not the Ceph configuration file. + +A Ceph Monitor always refers to the local copy of the monmap when discovering +other Ceph Monitors in the Ceph Storage Cluster. Using the monmap instead of the +Ceph configuration file avoids errors that could break the cluster (e.g., typos +in ``ceph.conf`` when specifying a monitor address or port). Since monitors use +monmaps for discovery and they share monmaps with clients and other Ceph +daemons, **the monmap provides monitors with a strict guarantee that their +consensus is valid.** + +Strict consistency also applies to updates to the monmap. As with any other +updates on the Ceph Monitor, changes to the monmap always run through a +distributed consensus algorithm called `Paxos`_. The Ceph Monitors must agree on +each update to the monmap, such as adding or removing a Ceph Monitor, to ensure +that each monitor in the quorum has the same version of the monmap. Updates to +the monmap are incremental so that Ceph Monitors have the latest agreed upon +version, and a set of previous versions. Maintaining a history enables a Ceph +Monitor that has an older version of the monmap to catch up with the current +state of the Ceph Storage Cluster. + +If Ceph Monitors were to discover each other through the Ceph configuration file +instead of through the monmap, additional risks would be introduced because +Ceph configuration files are not updated and distributed automatically. Ceph +Monitors might inadvertently use an older Ceph configuration file, fail to +recognize a Ceph Monitor, fall out of a quorum, or develop a situation where +`Paxos`_ is not able to determine the current state of the system accurately. + + +.. index:: Ceph Monitor; bootstrapping monitors + +Bootstrapping Monitors +---------------------- + +In most configuration and deployment cases, tools that deploy Ceph help +bootstrap the Ceph Monitors by generating a monitor map for you (e.g., +``cephadm``, etc). A Ceph Monitor requires a few explicit +settings: + +- **Filesystem ID**: The ``fsid`` is the unique identifier for your + object store. Since you can run multiple clusters on the same + hardware, you must specify the unique ID of the object store when + bootstrapping a monitor. Deployment tools usually do this for you + (e.g., ``cephadm`` can call a tool like ``uuidgen``), but you + may specify the ``fsid`` manually too. + +- **Monitor ID**: A monitor ID is a unique ID assigned to each monitor within + the cluster. It is an alphanumeric value, and by convention the identifier + usually follows an alphabetical increment (e.g., ``a``, ``b``, etc.). This + can be set in a Ceph configuration file (e.g., ``[mon.a]``, ``[mon.b]``, etc.), + by a deployment tool, or using the ``ceph`` commandline. + +- **Keys**: The monitor must have secret keys. A deployment tool such as + ``cephadm`` usually does this for you, but you may + perform this step manually too. See `Monitor Keyrings`_ for details. + +For additional details on bootstrapping, see `Bootstrapping a Monitor`_. + +.. index:: Ceph Monitor; configuring monitors + +Configuring Monitors +==================== + +To apply configuration settings to the entire cluster, enter the configuration +settings under ``[global]``. To apply configuration settings to all monitors in +your cluster, enter the configuration settings under ``[mon]``. To apply +configuration settings to specific monitors, specify the monitor instance +(e.g., ``[mon.a]``). By convention, monitor instance names use alpha notation. + +.. code-block:: ini + + [global] + + [mon] + + [mon.a] + + [mon.b] + + [mon.c] + + +Minimum Configuration +--------------------- + +The bare minimum monitor settings for a Ceph monitor via the Ceph configuration +file include a hostname and a network address for each monitor. You can configure +these under ``[mon]`` or under the entry for a specific monitor. + +.. code-block:: ini + + [global] + mon host = 10.0.0.2,10.0.0.3,10.0.0.4 + +.. code-block:: ini + + [mon.a] + host = hostname1 + mon addr = 10.0.0.10:6789 + +See the `Network Configuration Reference`_ for details. + +.. note:: This minimum configuration for monitors assumes that a deployment + tool generates the ``fsid`` and the ``mon.`` key for you. + +Once you deploy a Ceph cluster, you **SHOULD NOT** change the IP addresses of +monitors. However, if you decide to change the monitor's IP address, you +must follow a specific procedure. See `Changing a Monitor's IP Address`_ for +details. + +Monitors can also be found by clients by using DNS SRV records. See `Monitor lookup through DNS`_ for details. + +Cluster ID +---------- + +Each Ceph Storage Cluster has a unique identifier (``fsid``). If specified, it +usually appears under the ``[global]`` section of the configuration file. +Deployment tools usually generate the ``fsid`` and store it in the monitor map, +so the value may not appear in a configuration file. The ``fsid`` makes it +possible to run daemons for multiple clusters on the same hardware. + +``fsid`` + +:Description: The cluster ID. One per cluster. +:Type: UUID +:Required: Yes. +:Default: N/A. May be generated by a deployment tool if not specified. + +.. note:: Do not set this value if you use a deployment tool that does + it for you. + + +.. index:: Ceph Monitor; initial members + +Initial Members +--------------- + +We recommend running a production Ceph Storage Cluster with at least three Ceph +Monitors to ensure high availability. When you run multiple monitors, you may +specify the initial monitors that must be members of the cluster in order to +establish a quorum. This may reduce the time it takes for your cluster to come +online. + +.. code-block:: ini + + [mon] + mon_initial_members = a,b,c + + +``mon_initial_members`` + +:Description: The IDs of initial monitors in a cluster during startup. If + specified, Ceph requires an odd number of monitors to form an + initial quorum (e.g., 3). + +:Type: String +:Default: None + +.. note:: A *majority* of monitors in your cluster must be able to reach + each other in order to establish a quorum. You can decrease the initial + number of monitors to establish a quorum with this setting. + +.. index:: Ceph Monitor; data path + +Data +---- + +Ceph provides a default path where Ceph Monitors store data. For optimal +performance in a production Ceph Storage Cluster, we recommend running Ceph +Monitors on separate hosts and drives from Ceph OSD Daemons. As leveldb uses +``mmap()`` for writing the data, Ceph Monitors flush their data from memory to disk +very often, which can interfere with Ceph OSD Daemon workloads if the data +store is co-located with the OSD Daemons. + +In Ceph versions 0.58 and earlier, Ceph Monitors store their data in plain files. This +approach allows users to inspect monitor data with common tools like ``ls`` +and ``cat``. However, this approach didn't provide strong consistency. + +In Ceph versions 0.59 and later, Ceph Monitors store their data as key/value +pairs. Ceph Monitors require `ACID`_ transactions. Using a data store prevents +recovering Ceph Monitors from running corrupted versions through Paxos, and it +enables multiple modification operations in one single atomic batch, among other +advantages. + +Generally, we do not recommend changing the default data location. If you modify +the default location, we recommend that you make it uniform across Ceph Monitors +by setting it in the ``[mon]`` section of the configuration file. + + +``mon_data`` + +:Description: The monitor's data location. +:Type: String +:Default: ``/var/lib/ceph/mon/$cluster-$id`` + + +``mon_data_size_warn`` + +:Description: Raise ``HEALTH_WARN`` status when a monitor's data + store grows to be larger than this size, 15GB by default. + +:Type: Integer +:Default: ``15*1024*1024*1024`` + + +``mon_data_avail_warn`` + +:Description: Raise ``HEALTH_WARN`` status when the filesystem that houses a + monitor's data store reports that its available capacity is + less than or equal to this percentage . + +:Type: Integer +:Default: ``30`` + + +``mon_data_avail_crit`` + +:Description: Raise ``HEALTH_ERR`` status when the filesystem that houses a + monitor's data store reports that its available capacity is + less than or equal to this percentage. + +:Type: Integer +:Default: ``5`` + +``mon_warn_on_cache_pools_without_hit_sets`` + +:Description: Raise ``HEALTH_WARN`` when a cache pool does not + have the ``hit_set_type`` value configured. + See :ref:`hit_set_type <hit_set_type>` for more + details. + +:Type: Boolean +:Default: ``True`` + +``mon_warn_on_crush_straw_calc_version_zero`` + +:Description: Raise ``HEALTH_WARN`` when the CRUSH + ``straw_calc_version`` is zero. See + :ref:`CRUSH map tunables <crush-map-tunables>` for + details. + +:Type: Boolean +:Default: ``True`` + + +``mon_warn_on_legacy_crush_tunables`` + +:Description: Raise ``HEALTH_WARN`` when + CRUSH tunables are too old (older than ``mon_min_crush_required_version``) + +:Type: Boolean +:Default: ``True`` + + +``mon_crush_min_required_version`` + +:Description: The minimum tunable profile required by the cluster. + See + :ref:`CRUSH map tunables <crush-map-tunables>` for + details. + +:Type: String +:Default: ``hammer`` + + +``mon_warn_on_osd_down_out_interval_zero`` + +:Description: Raise ``HEALTH_WARN`` when + ``mon_osd_down_out_interval`` is zero. Having this option set to + zero on the leader acts much like the ``noout`` flag. It's hard + to figure out what's going wrong with clusters without the + ``noout`` flag set but acting like that just the same, so we + report a warning in this case. + +:Type: Boolean +:Default: ``True`` + + +``mon_warn_on_slow_ping_ratio`` + +:Description: Raise ``HEALTH_WARN`` when any heartbeat + between OSDs exceeds ``mon_warn_on_slow_ping_ratio`` + of ``osd_heartbeat_grace``. The default is 5%. +:Type: Float +:Default: ``0.05`` + + +``mon_warn_on_slow_ping_time`` + +:Description: Override ``mon_warn_on_slow_ping_ratio`` with a specific value. + Raise ``HEALTH_WARN`` if any heartbeat + between OSDs exceeds ``mon_warn_on_slow_ping_time`` + milliseconds. The default is 0 (disabled). +:Type: Integer +:Default: ``0`` + + +``mon_warn_on_pool_no_redundancy`` + +:Description: Raise ``HEALTH_WARN`` if any pool is + configured with no replicas. +:Type: Boolean +:Default: ``True`` + + +``mon_cache_target_full_warn_ratio`` + +:Description: Position between pool's ``cache_target_full`` and + ``target_max_object`` where we start warning + +:Type: Float +:Default: ``0.66`` + + +``mon_health_to_clog`` + +:Description: Enable sending a health summary to the cluster log periodically. +:Type: Boolean +:Default: ``True`` + + +``mon_health_to_clog_tick_interval`` + +:Description: How often (in seconds) the monitor sends a health summary to the cluster + log (a non-positive number disables). If current health summary + is empty or identical to the last time, monitor will not send it + to cluster log. + +:Type: Float +:Default: ``60.0`` + + +``mon_health_to_clog_interval`` + +:Description: How often (in seconds) the monitor sends a health summary to the cluster + log (a non-positive number disables). Monitors will always + send a summary to the cluster log whether or not it differs from + the previous summary. + +:Type: Integer +:Default: ``3600`` + + + +.. index:: Ceph Storage Cluster; capacity planning, Ceph Monitor; capacity planning + +.. _storage-capacity: + +Storage Capacity +---------------- + +When a Ceph Storage Cluster gets close to its maximum capacity +(see``mon_osd_full ratio``), Ceph prevents you from writing to or reading from OSDs +as a safety measure to prevent data loss. Therefore, letting a +production Ceph Storage Cluster approach its full ratio is not a good practice, +because it sacrifices high availability. The default full ratio is ``.95``, or +95% of capacity. This a very aggressive setting for a test cluster with a small +number of OSDs. + +.. tip:: When monitoring your cluster, be alert to warnings related to the + ``nearfull`` ratio. This means that a failure of some OSDs could result + in a temporary service disruption if one or more OSDs fails. Consider adding + more OSDs to increase storage capacity. + +A common scenario for test clusters involves a system administrator removing an +OSD from the Ceph Storage Cluster, watching the cluster rebalance, then removing +another OSD, and another, until at least one OSD eventually reaches the full +ratio and the cluster locks up. We recommend a bit of capacity +planning even with a test cluster. Planning enables you to gauge how much spare +capacity you will need in order to maintain high availability. Ideally, you want +to plan for a series of Ceph OSD Daemon failures where the cluster can recover +to an ``active+clean`` state without replacing those OSDs +immediately. Cluster operation continues in the ``active+degraded`` state, but this +is not ideal for normal operation and should be addressed promptly. + +The following diagram depicts a simplistic Ceph Storage Cluster containing 33 +Ceph Nodes with one OSD per host, each OSD reading from +and writing to a 3TB drive. So this exemplary Ceph Storage Cluster has a maximum +actual capacity of 99TB. With a ``mon osd full ratio`` of ``0.95``, if the Ceph +Storage Cluster falls to 5TB of remaining capacity, the cluster will not allow +Ceph Clients to read and write data. So the Ceph Storage Cluster's operating +capacity is 95TB, not 99TB. + +.. ditaa:: + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | Rack 1 | | Rack 2 | | Rack 3 | | Rack 4 | | Rack 5 | | Rack 6 | + | cCCC | | cF00 | | cCCC | | cCCC | | cCCC | | cCCC | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 1 | | OSD 7 | | OSD 13 | | OSD 19 | | OSD 25 | | OSD 31 | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 2 | | OSD 8 | | OSD 14 | | OSD 20 | | OSD 26 | | OSD 32 | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 3 | | OSD 9 | | OSD 15 | | OSD 21 | | OSD 27 | | OSD 33 | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 4 | | OSD 10 | | OSD 16 | | OSD 22 | | OSD 28 | | Spare | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 5 | | OSD 11 | | OSD 17 | | OSD 23 | | OSD 29 | | Spare | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 6 | | OSD 12 | | OSD 18 | | OSD 24 | | OSD 30 | | Spare | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + +It is normal in such a cluster for one or two OSDs to fail. A less frequent but +reasonable scenario involves a rack's router or power supply failing, which +brings down multiple OSDs simultaneously (e.g., OSDs 7-12). In such a scenario, +you should still strive for a cluster that can remain operational and achieve an +``active + clean`` state--even if that means adding a few hosts with additional +OSDs in short order. If your capacity utilization is too high, you may not lose +data, but you could still sacrifice data availability while resolving an outage +within a failure domain if capacity utilization of the cluster exceeds the full +ratio. For this reason, we recommend at least some rough capacity planning. + +Identify two numbers for your cluster: + +#. The number of OSDs. +#. The total capacity of the cluster + +If you divide the total capacity of your cluster by the number of OSDs in your +cluster, you will find the mean average capacity of an OSD within your cluster. +Consider multiplying that number by the number of OSDs you expect will fail +simultaneously during normal operations (a relatively small number). Finally +multiply the capacity of the cluster by the full ratio to arrive at a maximum +operating capacity; then, subtract the number of amount of data from the OSDs +you expect to fail to arrive at a reasonable full ratio. Repeat the foregoing +process with a higher number of OSD failures (e.g., a rack of OSDs) to arrive at +a reasonable number for a near full ratio. + +The following settings only apply on cluster creation and are then stored in +the OSDMap. To clarify, in normal operation the values that are used by OSDs +are those found in the OSDMap, not those in the configuration file or central +config store. + +.. code-block:: ini + + [global] + mon_osd_full_ratio = .80 + mon_osd_backfillfull_ratio = .75 + mon_osd_nearfull_ratio = .70 + + +``mon_osd_full_ratio`` + +:Description: The threshold percentage of device space utilized before an OSD is + considered ``full``. + +:Type: Float +:Default: ``0.95`` + + +``mon_osd_backfillfull_ratio`` + +:Description: The threshold percentage of device space utilized before an OSD is + considered too ``full`` to backfill. + +:Type: Float +:Default: ``0.90`` + + +``mon_osd_nearfull_ratio`` + +:Description: The threshold percentage of device space used before an OSD is + considered ``nearfull``. + +:Type: Float +:Default: ``0.85`` + + +.. tip:: If some OSDs are nearfull, but others have plenty of capacity, you + may have an inaccurate CRUSH weight set for the nearfull OSDs. + +.. tip:: These settings only apply during cluster creation. Afterwards they need + to be changed in the OSDMap using ``ceph osd set-nearfull-ratio`` and + ``ceph osd set-full-ratio`` + +.. index:: heartbeat + +Heartbeat +--------- + +Ceph monitors know about the cluster by requiring reports from each OSD, and by +receiving reports from OSDs about the status of their neighboring OSDs. Ceph +provides reasonable default settings for monitor/OSD interaction; however, you +may modify them as needed. See `Monitor/OSD Interaction`_ for details. + + +.. index:: Ceph Monitor; leader, Ceph Monitor; provider, Ceph Monitor; requester, Ceph Monitor; synchronization + +Monitor Store Synchronization +----------------------------- + +When you run a production cluster with multiple monitors (recommended), each +monitor checks to see if a neighboring monitor has a more recent version of the +cluster map (e.g., a map in a neighboring monitor with one or more epoch numbers +higher than the most current epoch in the map of the instant monitor). +Periodically, one monitor in the cluster may fall behind the other monitors to +the point where it must leave the quorum, synchronize to retrieve the most +current information about the cluster, and then rejoin the quorum. For the +purposes of synchronization, monitors may assume one of three roles: + +#. **Leader**: The `Leader` is the first monitor to achieve the most recent + Paxos version of the cluster map. + +#. **Provider**: The `Provider` is a monitor that has the most recent version + of the cluster map, but wasn't the first to achieve the most recent version. + +#. **Requester:** A `Requester` is a monitor that has fallen behind the leader + and must synchronize in order to retrieve the most recent information about + the cluster before it can rejoin the quorum. + +These roles enable a leader to delegate synchronization duties to a provider, +which prevents synchronization requests from overloading the leader--improving +performance. In the following diagram, the requester has learned that it has +fallen behind the other monitors. The requester asks the leader to synchronize, +and the leader tells the requester to synchronize with a provider. + + +.. ditaa:: + +-----------+ +---------+ +----------+ + | Requester | | Leader | | Provider | + +-----------+ +---------+ +----------+ + | | | + | | | + | Ask to Synchronize | | + |------------------->| | + | | | + |<-------------------| | + | Tell Requester to | | + | Sync with Provider | | + | | | + | Synchronize | + |--------------------+-------------------->| + | | | + |<-------------------+---------------------| + | Send Chunk to Requester | + | (repeat as necessary) | + | Requester Acks Chuck to Provider | + |--------------------+-------------------->| + | | + | Sync Complete | + | Notification | + |------------------->| + | | + |<-------------------| + | Ack | + | | + + +Synchronization always occurs when a new monitor joins the cluster. During +runtime operations, monitors may receive updates to the cluster map at different +times. This means the leader and provider roles may migrate from one monitor to +another. If this happens while synchronizing (e.g., a provider falls behind the +leader), the provider can terminate synchronization with a requester. + +Once synchronization is complete, Ceph performs trimming across the cluster. +Trimming requires that the placement groups are ``active+clean``. + + +``mon_sync_timeout`` + +:Description: Number of seconds the monitor will wait for the next update + message from its sync provider before it gives up and bootstrap + again. + +:Type: Double +:Default: ``60.0`` + + +``mon_sync_max_payload_size`` + +:Description: The maximum size for a sync payload (in bytes). +:Type: 32-bit Integer +:Default: ``1048576`` + + +``paxos_max_join_drift`` + +:Description: The maximum Paxos iterations before we must first sync the + monitor data stores. When a monitor finds that its peer is too + far ahead of it, it will first sync with data stores before moving + on. + +:Type: Integer +:Default: ``10`` + + +``paxos_stash_full_interval`` + +:Description: How often (in commits) to stash a full copy of the PaxosService state. + Current this setting only affects ``mds``, ``mon``, ``auth`` and ``mgr`` + PaxosServices. + +:Type: Integer +:Default: ``25`` + + +``paxos_propose_interval`` + +:Description: Gather updates for this time interval before proposing + a map update. + +:Type: Double +:Default: ``1.0`` + + +``paxos_min`` + +:Description: The minimum number of Paxos states to keep around +:Type: Integer +:Default: ``500`` + + +``paxos_min_wait`` + +:Description: The minimum amount of time to gather updates after a period of + inactivity. + +:Type: Double +:Default: ``0.05`` + + +``paxos_trim_min`` + +:Description: Number of extra proposals tolerated before trimming +:Type: Integer +:Default: ``250`` + + +``paxos_trim_max`` + +:Description: The maximum number of extra proposals to trim at a time +:Type: Integer +:Default: ``500`` + + +``paxos_service_trim_min`` + +:Description: The minimum amount of versions to trigger a trim (0 disables it) +:Type: Integer +:Default: ``250`` + + +``paxos_service_trim_max`` + +:Description: The maximum amount of versions to trim during a single proposal (0 disables it) +:Type: Integer +:Default: ``500`` + + +``paxos service trim max multiplier`` + +:Description: The factor by which paxos service trim max will be multiplied + to get a new upper bound when trim sizes are high (0 disables it) +:Type: Integer +:Default: ``20`` + + +``mon mds force trim to`` + +:Description: Force monitor to trim mdsmaps to this point (0 disables it. + dangerous, use with care) + +:Type: Integer +:Default: ``0`` + + +``mon_osd_force_trim_to`` + +:Description: Force monitor to trim osdmaps to this point, even if there is + PGs not clean at the specified epoch (0 disables it. dangerous, + use with care) + +:Type: Integer +:Default: ``0`` + + +``mon_osd_cache_size`` + +:Description: The size of osdmaps cache, not to rely on underlying store's cache +:Type: Integer +:Default: ``500`` + + +``mon_election_timeout`` + +:Description: On election proposer, maximum waiting time for all ACKs in seconds. +:Type: Float +:Default: ``5.00`` + + +``mon_lease`` + +:Description: The length (in seconds) of the lease on the monitor's versions. +:Type: Float +:Default: ``5.00`` + + +``mon_lease_renew_interval_factor`` + +:Description: ``mon_lease`` \* ``mon_lease_renew_interval_factor`` will be the + interval for the Leader to renew the other monitor's leases. The + factor should be less than ``1.0``. + +:Type: Float +:Default: ``0.60`` + + +``mon_lease_ack_timeout_factor`` + +:Description: The Leader will wait ``mon_lease`` \* ``mon_lease_ack_timeout_factor`` + for the Providers to acknowledge the lease extension. + +:Type: Float +:Default: ``2.00`` + + +``mon_accept_timeout_factor`` + +:Description: The Leader will wait ``mon_lease`` \* ``mon_accept_timeout_factor`` + for the Requester(s) to accept a Paxos update. It is also used + during the Paxos recovery phase for similar purposes. + +:Type: Float +:Default: ``2.00`` + + +``mon_min_osdmap_epochs`` + +:Description: Minimum number of OSD map epochs to keep at all times. +:Type: 32-bit Integer +:Default: ``500`` + + +``mon_max_log_epochs`` + +:Description: Maximum number of Log epochs the monitor should keep. +:Type: 32-bit Integer +:Default: ``500`` + + + +.. index:: Ceph Monitor; clock + +Clock +----- + +Ceph daemons pass critical messages to each other, which must be processed +before daemons reach a timeout threshold. If the clocks in Ceph monitors +are not synchronized, it can lead to a number of anomalies. For example: + +- Daemons ignoring received messages (e.g., timestamps outdated) +- Timeouts triggered too soon/late when a message wasn't received in time. + +See `Monitor Store Synchronization`_ for details. + + +.. tip:: You must configure NTP or PTP daemons on your Ceph monitor hosts to + ensure that the monitor cluster operates with synchronized clocks. + It can be advantageous to have monitor hosts sync with each other + as well as with multiple quality upstream time sources. + +Clock drift may still be noticeable with NTP even though the discrepancy is not +yet harmful. Ceph's clock drift / clock skew warnings may get triggered even +though NTP maintains a reasonable level of synchronization. Increasing your +clock drift may be tolerable under such circumstances; however, a number of +factors such as workload, network latency, configuring overrides to default +timeouts and the `Monitor Store Synchronization`_ settings may influence +the level of acceptable clock drift without compromising Paxos guarantees. + +Ceph provides the following tunable options to allow you to find +acceptable values. + + +``mon_tick_interval`` + +:Description: A monitor's tick interval in seconds. +:Type: 32-bit Integer +:Default: ``5`` + + +``mon_clock_drift_allowed`` + +:Description: The clock drift in seconds allowed between monitors. +:Type: Float +:Default: ``0.05`` + + +``mon_clock_drift_warn_backoff`` + +:Description: Exponential backoff for clock drift warnings +:Type: Float +:Default: ``5.00`` + + +``mon_timecheck_interval`` + +:Description: The time check interval (clock drift check) in seconds + for the Leader. + +:Type: Float +:Default: ``300.00`` + + +``mon_timecheck_skew_interval`` + +:Description: The time check interval (clock drift check) in seconds when in + presence of a skew in seconds for the Leader. + +:Type: Float +:Default: ``30.00`` + + +Client +------ + +``mon_client_hunt_interval`` + +:Description: The client will try a new monitor every ``N`` seconds until it + establishes a connection. + +:Type: Double +:Default: ``3.00`` + + +``mon_client_ping_interval`` + +:Description: The client will ping the monitor every ``N`` seconds. +:Type: Double +:Default: ``10.00`` + + +``mon_client_max_log_entries_per_message`` + +:Description: The maximum number of log entries a monitor will generate + per client message. + +:Type: Integer +:Default: ``1000`` + + +``mon_client_bytes`` + +:Description: The amount of client message data allowed in memory (in bytes). +:Type: 64-bit Integer Unsigned +:Default: ``100ul << 20`` + +.. _pool-settings: + +Pool settings +============= + +Since version v0.94 there is support for pool flags which allow or disallow changes to be made to pools. +Monitors can also disallow removal of pools if appropriately configured. The inconvenience of this guardrail +is far outweighed by the number of accidental pool (and thus data) deletions it prevents. + +``mon_allow_pool_delete`` + +:Description: Should monitors allow pools to be removed, regardless of what the pool flags say? + +:Type: Boolean +:Default: ``false`` + + +``osd_pool_default_ec_fast_read`` + +:Description: Whether to turn on fast read on the pool or not. It will be used as + the default setting of newly created erasure coded pools if ``fast_read`` + is not specified at create time. + +:Type: Boolean +:Default: ``false`` + + +``osd_pool_default_flag_hashpspool`` + +:Description: Set the hashpspool flag on new pools +:Type: Boolean +:Default: ``true`` + + +``osd_pool_default_flag_nodelete`` + +:Description: Set the ``nodelete`` flag on new pools, which prevents pool removal. +:Type: Boolean +:Default: ``false`` + + +``osd_pool_default_flag_nopgchange`` + +:Description: Set the ``nopgchange`` flag on new pools. Does not allow the number of PGs to be changed. +:Type: Boolean +:Default: ``false`` + + +``osd_pool_default_flag_nosizechange`` + +:Description: Set the ``nosizechange`` flag on new pools. Does not allow the ``size`` to be changed. +:Type: Boolean +:Default: ``false`` + +For more information about the pool flags see `Pool values`_. + +Miscellaneous +============= + +``mon_max_osd`` + +:Description: The maximum number of OSDs allowed in the cluster. +:Type: 32-bit Integer +:Default: ``10000`` + + +``mon_globalid_prealloc`` + +:Description: The number of global IDs to pre-allocate for clients and daemons in the cluster. +:Type: 32-bit Integer +:Default: ``10000`` + + +``mon_subscribe_interval`` + +:Description: The refresh interval (in seconds) for subscriptions. The + subscription mechanism enables obtaining cluster maps + and log information. + +:Type: Double +:Default: ``86400.00`` + + +``mon_stat_smooth_intervals`` + +:Description: Ceph will smooth statistics over the last ``N`` PG maps. +:Type: Integer +:Default: ``6`` + + +``mon_probe_timeout`` + +:Description: Number of seconds the monitor will wait to find peers before bootstrapping. +:Type: Double +:Default: ``2.00`` + + +``mon_daemon_bytes`` + +:Description: The message memory cap for metadata server and OSD messages (in bytes). +:Type: 64-bit Integer Unsigned +:Default: ``400ul << 20`` + + +``mon_max_log_entries_per_event`` + +:Description: The maximum number of log entries per event. +:Type: Integer +:Default: ``4096`` + + +``mon_osd_prime_pg_temp`` + +:Description: Enables or disables priming the PGMap with the previous OSDs when an ``out`` + OSD comes back into the cluster. With the ``true`` setting, clients + will continue to use the previous OSDs until the newly ``in`` OSDs for + a PG have peered. + +:Type: Boolean +:Default: ``true`` + + +``mon_osd_prime pg temp max time`` + +:Description: How much time in seconds the monitor should spend trying to prime the + PGMap when an out OSD comes back into the cluster. + +:Type: Float +:Default: ``0.50`` + + +``mon_osd_prime_pg_temp_max_time_estimate`` + +:Description: Maximum estimate of time spent on each PG before we prime all PGs + in parallel. + +:Type: Float +:Default: ``0.25`` + + +``mon_mds_skip_sanity`` + +:Description: Skip safety assertions on FSMap (in case of bugs where we want to + continue anyway). Monitor terminates if the FSMap sanity check + fails, but we can disable it by enabling this option. + +:Type: Boolean +:Default: ``False`` + + +``mon_max_mdsmap_epochs`` + +:Description: The maximum number of mdsmap epochs to trim during a single proposal. +:Type: Integer +:Default: ``500`` + + +``mon_config_key_max_entry_size`` + +:Description: The maximum size of config-key entry (in bytes) +:Type: Integer +:Default: ``65536`` + + +``mon_scrub_interval`` + +:Description: How often the monitor scrubs its store by comparing + the stored checksums with the computed ones for all stored + keys. (0 disables it. dangerous, use with care) + +:Type: Seconds +:Default: ``1 day`` + + +``mon_scrub_max_keys`` + +:Description: The maximum number of keys to scrub each time. +:Type: Integer +:Default: ``100`` + + +``mon_compact_on_start`` + +:Description: Compact the database used as Ceph Monitor store on + ``ceph-mon`` start. A manual compaction helps to shrink the + monitor database and improve the performance of it if the regular + compaction fails to work. + +:Type: Boolean +:Default: ``False`` + + +``mon_compact_on_bootstrap`` + +:Description: Compact the database used as Ceph Monitor store + on bootstrap. Monitors probe each other to establish + a quorum after bootstrap. If a monitor times out before joining the + quorum, it will start over and bootstrap again. + +:Type: Boolean +:Default: ``False`` + + +``mon_compact_on_trim`` + +:Description: Compact a certain prefix (including paxos) when we trim its old states. +:Type: Boolean +:Default: ``True`` + + +``mon_cpu_threads`` + +:Description: Number of threads for performing CPU intensive work on monitor. +:Type: Integer +:Default: ``4`` + + +``mon_osd_mapping_pgs_per_chunk`` + +:Description: We calculate the mapping from placement group to OSDs in chunks. + This option specifies the number of placement groups per chunk. + +:Type: Integer +:Default: ``4096`` + + +``mon_session_timeout`` + +:Description: Monitor will terminate inactive sessions stay idle over this + time limit. + +:Type: Integer +:Default: ``300`` + + +``mon_osd_cache_size_min`` + +:Description: The minimum amount of bytes to be kept mapped in memory for osd + monitor caches. + +:Type: 64-bit Integer +:Default: ``134217728`` + + +``mon_memory_target`` + +:Description: The amount of bytes pertaining to OSD monitor caches and KV cache + to be kept mapped in memory with cache auto-tuning enabled. + +:Type: 64-bit Integer +:Default: ``2147483648`` + + +``mon_memory_autotune`` + +:Description: Autotune the cache memory used for OSD monitors and KV + database. + +:Type: Boolean +:Default: ``True`` + + +.. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science) +.. _Monitor Keyrings: ../../../dev/mon-bootstrap#secret-keys +.. _Ceph configuration file: ../ceph-conf/#monitors +.. _Network Configuration Reference: ../network-config-ref +.. _Monitor lookup through DNS: ../mon-lookup-dns +.. _ACID: https://en.wikipedia.org/wiki/ACID +.. _Adding/Removing a Monitor: ../../operations/add-or-rm-mons +.. _Monitoring a Cluster: ../../operations/monitoring +.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg +.. _Bootstrapping a Monitor: ../../../dev/mon-bootstrap +.. _Changing a Monitor's IP Address: ../../operations/add-or-rm-mons#changing-a-monitor-s-ip-address +.. _Monitor/OSD Interaction: ../mon-osd-interaction +.. _Scalability and High Availability: ../../../architecture#scalability-and-high-availability +.. _Pool values: ../../operations/pools/#set-pool-values |