diff options
Diffstat (limited to 'src/common/options/global.yaml.in')
-rw-r--r-- | src/common/options/global.yaml.in | 6396 |
1 files changed, 6396 insertions, 0 deletions
diff --git a/src/common/options/global.yaml.in b/src/common/options/global.yaml.in new file mode 100644 index 000000000..fa426a115 --- /dev/null +++ b/src/common/options/global.yaml.in @@ -0,0 +1,6396 @@ +# -*- mode: YAML -*- +--- + +options: +- name: host + type: str + level: basic + desc: local hostname + long_desc: if blank, ceph assumes the short hostname (hostname -s) + tags: + - network + services: + - common + flags: + - no_mon_update + with_legacy: true +- name: fsid + type: uuid + level: basic + desc: cluster fsid (uuid) + fmt_desc: The cluster ID. One per cluster. + May be generated by a deployment tool if not specified. + note: Do not set this value if you use a deployment tool that does + it for you. + tags: + - service + services: + - common + flags: + - no_mon_update + - startup +- name: public_addr + type: addr + level: basic + desc: public-facing address to bind to + fmt_desc: The IP address for the public (front-side) network. + Set for each daemon. + services: + - mon + - mds + - osd + - mgr + flags: + - startup + with_legacy: true +- name: public_addrv + type: addrvec + level: basic + desc: public-facing address to bind to + services: + - mon + - mds + - osd + - mgr + flags: + - startup + with_legacy: true +- name: public_bind_addr + type: addr + level: advanced + services: + - mon + flags: + - startup + fmt_desc: In some dynamic deployments the Ceph MON daemon might bind + to an IP address locally that is different from the ``public_addr`` + advertised to other peers in the network. The environment must ensure + that routing rules are set correctly. If ``public_bind_addr`` is set + the Ceph Monitor daemon will bind to it locally and use ``public_addr`` + in the monmaps to advertise its address to peers. This behavior is limited + to the Monitor daemon. + with_legacy: true +- name: cluster_addr + type: addr + level: basic + desc: cluster-facing address to bind to + fmt_desc: The IP address for the cluster (back-side) network. + Set for each daemon. + tags: + - network + services: + - osd + flags: + - startup + with_legacy: true +- name: public_network + type: str + level: advanced + desc: Network(s) from which to choose a public address to bind to + fmt_desc: The IP address and netmask of the public (front-side) network + (e.g., ``192.168.0.0/24``). Set in ``[global]``. You may specify + comma-separated subnets. The format of it looks like + ``{ip-address}/{netmask} [, {ip-address}/{netmask}]`` + tags: + - network + services: + - mon + - mds + - osd + - mgr + flags: + - startup + with_legacy: true +- name: public_network_interface + type: str + level: advanced + desc: Interface name(s) from which to choose an address from a public_network to + bind to; public_network must also be specified. + tags: + - network + services: + - mon + - mds + - osd + - mgr + see_also: + - public_network + flags: + - startup +- name: cluster_network + type: str + level: advanced + desc: Network(s) from which to choose a cluster address to bind to + fmt_desc: The IP address and netmask of the cluster (back-side) network + (e.g., ``10.0.0.0/24``). Set in ``[global]``. You may specify + comma-separated subnets. The format of it looks like + ``{ip-address}/{netmask} [, {ip-address}/{netmask}]`` + tags: + - network + services: + - osd + flags: + - startup + with_legacy: true +- name: cluster_network_interface + type: str + level: advanced + desc: Interface name(s) from which to choose an address from a cluster_network to + bind to; cluster_network must also be specified. + tags: + - network + services: + - mon + - mds + - osd + - mgr + see_also: + - cluster_network + flags: + - startup +- name: monmap + type: str + level: advanced + desc: path to MonMap file + long_desc: This option is normally used during mkfs, but can also be used to identify + which monitors to connect to. + services: + - mon + flags: + - no_mon_update + - create +- name: mon_host + type: str + level: basic + desc: list of hosts or addresses to search for a monitor + long_desc: This is a list of IP addresses or hostnames that are separated by commas, whitespace, or semicolons. Hostnames are resolved via DNS. All A and AAAA records are included in the search list. + services: + - common + flags: + - no_mon_update + - startup +- name: mon_host_override + type: str + level: advanced + desc: monitor(s) to use overriding the MonMap + fmt_desc: This is the list of monitors that the Ceph process **initially** contacts when first establishing communication with the Ceph cluster. This overrides the known monitor list that is derived from MonMap updates sent to older Ceph instances (like librados cluster handles). This option is expected to be useful primarily for debugging. + services: + - common + flags: + - no_mon_update + - startup +- name: mon_dns_srv_name + type: str + level: advanced + desc: name of DNS SRV record to check for monitor addresses + fmt_desc: the service name used querying the DNS for the monitor hosts/addresses + default: ceph-mon + tags: + - network + services: + - common + see_also: + - mon_host + flags: + - startup +- name: container_image + type: str + level: basic + desc: container image (used by cephadm orchestrator) + default: docker.io/ceph/daemon-base:latest-master-devel + flags: + - startup +- name: no_config_file + type: bool + level: advanced + desc: signal that we don't require a config file to be present + long_desc: When specified, we won't be looking for a configuration file, and will + instead expect that whatever options or values are required for us to work will + be passed as arguments. + default: false + tags: + - config + services: + - common + flags: + - no_mon_update + - startup +- name: lockdep + type: bool + level: dev + desc: enable lockdep lock dependency analyzer + default: false + services: + - common + flags: + - no_mon_update + - startup + with_legacy: true +- name: lockdep_force_backtrace + type: bool + level: dev + desc: always gather current backtrace at every lock + default: false + services: + - common + see_also: + - lockdep + flags: + - startup + with_legacy: true +- name: run_dir + type: str + level: advanced + desc: path for the 'run' directory for storing pid and socket files + default: /var/run/ceph + services: + - common + see_also: + - admin_socket + flags: + - startup + with_legacy: true +- name: admin_socket + type: str + level: advanced + desc: path for the runtime control socket file, used by the 'ceph daemon' command + fmt_desc: The socket for executing administrative commands on a daemon, + irrespective of whether Ceph Monitors have established a quorum. + daemon_default: $run_dir/$cluster-$name.asok + services: + - common + flags: + - startup + # default changed by common_preinit() + with_legacy: true +- name: admin_socket_mode + type: str + level: advanced + desc: file mode to set for the admin socket file, e.g, '0755' + services: + - common + see_also: + - admin_socket + flags: + - startup + with_legacy: true +- name: daemonize + type: bool + level: advanced + desc: whether to daemonize (background) after startup + default: false + daemon_default: true + tags: + - service + services: + - mon + - mgr + - osd + - mds + see_also: + - pid_file + - chdir + flags: + - no_mon_update + - startup + # default changed by common_preinit() + with_legacy: true +- name: setuser + type: str + level: advanced + desc: uid or user name to switch to on startup + long_desc: This is normally specified by the systemd unit file. + tags: + - service + services: + - mon + - mgr + - osd + - mds + see_also: + - setgroup + flags: + - startup + with_legacy: true +- name: setgroup + type: str + level: advanced + desc: gid or group name to switch to on startup + long_desc: This is normally specified by the systemd unit file. + tags: + - service + services: + - mon + - mgr + - osd + - mds + see_also: + - setuser + flags: + - startup + with_legacy: true +- name: setuser_match_path + type: str + level: advanced + desc: if set, setuser/setgroup is condition on this path matching ownership + long_desc: If setuser or setgroup are specified, and this option is non-empty, then + the uid/gid of the daemon will only be changed if the file or directory specified + by this option has a matching uid and/or gid. This exists primarily to allow + switching to user ceph for OSDs to be conditional on whether the osd data contents + have also been chowned after an upgrade. This is normally specified by the systemd + unit file. + tags: + - service + services: + - mon + - mgr + - osd + - mds + see_also: + - setuser + - setgroup + flags: + - startup + with_legacy: true +- name: pid_file + type: str + level: advanced + desc: path to write a pid file (if any) + fmt_desc: The file in which the mon, osd or mds will write its + PID. For instance, ``/var/run/$cluster/$type.$id.pid`` + will create /var/run/ceph/mon.a.pid for the ``mon`` with + id ``a`` running in the ``ceph`` cluster. The ``pid + file`` is removed when the daemon stops gracefully. If + the process is not daemonized (i.e. runs with the ``-f`` + or ``-d`` option), the ``pid file`` is not created. + tags: + - service + services: + - mon + - mgr + - osd + - mds + flags: + - startup + with_legacy: true +- name: chdir + type: str + level: advanced + desc: path to chdir(2) to after daemonizing + fmt_desc: The directory Ceph daemons change to once they are + up and running. Default ``/`` directory recommended. + tags: + - service + services: + - mon + - mgr + - osd + - mds + see_also: + - daemonize + flags: + - no_mon_update + - startup + with_legacy: true +- name: fatal_signal_handlers + type: bool + level: advanced + desc: whether to register signal handlers for SIGABRT etc that dump a stack trace + long_desc: This is normally true for daemons and values for libraries. + fmt_desc: If set, we will install signal handlers for SEGV, ABRT, BUS, ILL, + FPE, XCPU, XFSZ, SYS signals to generate a useful log message + default: true + tags: + - service + services: + - mon + - mgr + - osd + - mds + flags: + - startup + with_legacy: true +- name: crash_dir + type: str + level: advanced + desc: Directory where crash reports are archived + default: /var/lib/ceph/crash + flags: + - startup + with_legacy: true +- name: restapi_log_level + type: str + level: advanced + desc: default set by python code + with_legacy: true +- name: restapi_base_url + type: str + level: advanced + desc: default set by python code + with_legacy: true +- name: erasure_code_dir + type: str + level: advanced + desc: directory where erasure-code plugins can be found + default: @CEPH_INSTALL_FULL_PKGLIBDIR@/erasure-code + services: + - mon + - osd + flags: + - startup + with_legacy: true +- name: log_file + type: str + level: basic + desc: path to log file + fmt_desc: The location of the logging file for your cluster. + daemon_default: /var/log/ceph/$cluster-$name.log + see_also: + - log_to_file + - log_to_stderr + - err_to_stderr + - log_to_syslog + - err_to_syslog + # default changed by common_preinit() + with_legacy: true +- name: log_max_new + type: int + level: advanced + desc: max unwritten log entries to allow before waiting to flush to the log + fmt_desc: The maximum number of new log files. + default: 1000 + see_also: + - log_max_recent + # default changed by common_preinit() + with_legacy: true +- name: log_max_recent + type: int + level: advanced + desc: recent log entries to keep in memory to dump in the event of a crash + long_desc: The purpose of this option is to log at a higher debug level only to + the in-memory buffer, and write out the detailed log messages only if there is + a crash. Only log entries below the lower log level will be written unconditionally + to the log. For example, debug_osd=1/5 will write everything <= 1 to the log + unconditionally but keep entries at levels 2-5 in memory. If there is a seg fault + or assertion failure, all entries will be dumped to the log. + min: 1 + default: 500 + daemon_default: 10000 + # default changed by common_preinit() + with_legacy: true +- name: log_to_file + type: bool + level: basic + desc: send log lines to a file + fmt_desc: Determines if logging messages should appear in a file. + default: true + see_also: + - log_file + with_legacy: true +- name: log_to_stderr + type: bool + level: basic + desc: send log lines to stderr + fmt_desc: Determines if logging messages should appear in ``stderr``. + default: true + daemon_default: false + with_legacy: true +- name: err_to_stderr + type: bool + level: basic + desc: send critical error log lines to stderr + fmt_desc: Determines if error messages should appear in ``stderr``. + default: false + daemon_default: true + with_legacy: true +- name: log_stderr_prefix + type: str + level: advanced + desc: String to prefix log messages with when sent to stderr + long_desc: This is useful in container environments when combined with mon_cluster_log_to_stderr. The + mon log prefixes each line with the channel name (e.g., 'default', 'audit'), while + log_stderr_prefix can be set to 'debug '. + see_also: + - mon_cluster_log_to_stderr +- name: log_to_syslog + type: bool + level: basic + desc: send log lines to syslog facility + fmt_desc: Determines if logging messages should appear in ``syslog``. + default: false + with_legacy: true +- name: err_to_syslog + type: bool + level: basic + desc: send critical error log lines to syslog facility + fmt_desc: Determines if error messages should appear in ``syslog``. + default: false + with_legacy: true +- name: log_flush_on_exit + type: bool + level: advanced + desc: set a process exit handler to ensure the log is flushed on exit + fmt_desc: Determines if Ceph should flush the log files after exit. + default: false + with_legacy: true +- name: log_stop_at_utilization + type: float + level: basic + desc: stop writing to the log file when device utilization reaches this ratio + default: 0.97 + see_also: + - log_file + min: 0 + max: 1 + with_legacy: true +- name: log_to_graylog + type: bool + level: basic + desc: send log lines to remote graylog server + default: false + see_also: + - err_to_graylog + - log_graylog_host + - log_graylog_port + with_legacy: true +- name: err_to_graylog + type: bool + level: basic + desc: send critical error log lines to remote graylog server + default: false + see_also: + - log_to_graylog + - log_graylog_host + - log_graylog_port + with_legacy: true +- name: log_graylog_host + type: str + level: basic + desc: address or hostname of graylog server to log to + default: 127.0.0.1 + see_also: + - log_to_graylog + - err_to_graylog + - log_graylog_port + with_legacy: true +- name: log_graylog_port + type: int + level: basic + desc: port number for the remote graylog server + default: 12201 + see_also: + - log_graylog_host + with_legacy: true +- name: log_to_journald + type: bool + level: basic + desc: send log lines to journald + default: false + see_also: + - err_to_journald +- name: err_to_journald + type: bool + level: basic + desc: send critical error log lines to journald + default: false + see_also: + - log_to_journald +- name: log_coarse_timestamps + type: bool + level: advanced + desc: timestamp log entries from coarse system clock to improve performance + default: true + tags: + - performance + - service + services: + - common +# options will take k/v pairs, or single-item that will be assumed as general +# default for all, regardless of channel. +# e.g., "info" would be taken as the same as "default=info" +# also, "default=daemon audit=local0" would mean +# "default all to 'daemon', override 'audit' with 'local0' +- name: clog_to_monitors + type: str + level: advanced + desc: Make daemons send cluster log messages to monitors + fmt_desc: Determines if ``clog`` messages should be sent to monitors. + default: default=true + flags: + - runtime + with_legacy: true + services: + - mgr + - osd + - mds +- name: clog_to_syslog + type: str + level: advanced + desc: Make daemons send cluster log messages to syslog + fmt_desc: Determines if ``clog`` messages should be sent to syslog. + default: 'false' + flags: + - runtime + with_legacy: true + services: + - mon + - mgr + - osd + - mds +- name: clog_to_syslog_level + type: str + level: advanced + desc: Syslog level for cluster log messages + default: info + see_also: + - clog_to_syslog + flags: + - runtime + with_legacy: true + services: + - mon + - mgr + - osd + - mds +- name: clog_to_syslog_facility + type: str + level: advanced + desc: Syslog facility for cluster log messages + default: default=daemon audit=local0 + see_also: + - clog_to_syslog + flags: + - runtime + with_legacy: true + services: + - mon + - mgr + - osd + - mds +- name: clog_to_graylog + type: str + level: advanced + desc: Make daemons send cluster log to graylog + default: 'false' + flags: + - runtime + services: + - mon + - mgr + - osd + - mds +- name: clog_to_graylog_host + type: str + level: advanced + desc: Graylog host to cluster log messages + default: 127.0.0.1 + see_also: + - clog_to_graylog + flags: + - runtime + with_legacy: true + services: + - mon + - mgr + - osd + - mds +- name: clog_to_graylog_port + type: str + level: advanced + desc: Graylog port number for cluster log messages + default: '12201' + see_also: + - clog_to_graylog + flags: + - runtime + with_legacy: true + services: + - mon + - mgr + - osd + - mds +- name: enable_experimental_unrecoverable_data_corrupting_features + type: str + level: advanced + desc: Enable named (or all with '*') experimental features that may be untested, + dangerous, and/or cause permanent data loss + flags: + - runtime + with_legacy: true +- name: plugin_dir + type: str + level: advanced + desc: Base directory for dynamically loaded plugins + default: @CEPH_INSTALL_FULL_PKGLIBDIR@ + services: + - mon + - osd + flags: + - startup +- name: compressor_zlib_isal + type: bool + level: advanced + desc: Use Intel ISA-L accelerated zlib implementation if available + default: false + with_legacy: true +# regular zlib compression level, not applicable to isa-l optimized version +- name: compressor_zlib_level + type: int + level: advanced + desc: Zlib compression level to use + default: 5 + with_legacy: true +# regular zlib compression winsize, not applicable to isa-l optimized version +- name: compressor_zlib_winsize + type: int + level: advanced + desc: Zlib compression winsize to use + default: -15 + min: -15 + max: 32 + with_legacy: true +# regular zstd compression level +- name: compressor_zstd_level + type: int + level: advanced + desc: Zstd compression level to use + default: 1 + with_legacy: true +- name: qat_compressor_enabled + type: bool + level: advanced + desc: Enable Intel QAT acceleration support for compression if available + default: false + with_legacy: true +- name: qat_compressor_session_max_number + type: uint + level: advanced + desc: Set the maximum number of session within Qatzip when using QAT compressor + default: 256 +- name: plugin_crypto_accelerator + type: str + level: advanced + desc: Crypto accelerator library to use + default: crypto_isal + with_legacy: true +- name: openssl_engine_opts + type: str + level: advanced + desc: Use engine for specific openssl algorithm + long_desc: 'Pass opts in this way: engine_id=engine1,dynamic_path=/some/path/engine1.so,default_algorithms=DIGESTS:engine_id=engine2,dynamic_path=/some/path/engine2.so,default_algorithms=CIPHERS,other_ctrl=other_value' + flags: + - startup + with_legacy: true +- name: mempool_debug + type: bool + level: dev + default: false + flags: + - no_mon_update + with_legacy: true +- name: thp + type: bool + level: dev + desc: enable transparent huge page (THP) support + long_desc: Ceph is known to suffer from memory fragmentation due to THP use. This + is indicated by RSS usage above configured memory targets. Enabling THP is currently + discouraged until selective use of THP by Ceph is implemented. + default: false + flags: + - startup +- name: key + type: str + level: advanced + desc: Authentication key + long_desc: A CephX authentication key, base64 encoded. It normally looks something + like 'AQAtut9ZdMbNJBAAHz6yBAWyJyz2yYRyeMWDag=='. + fmt_desc: The key (i.e., the text string of the key itself). Not recommended. + see_also: + - keyfile + - keyring + flags: + - no_mon_update + - startup + with_legacy: true +- name: keyfile + type: str + level: advanced + desc: Path to a file containing a key + long_desc: The file should contain a CephX authentication key and optionally a trailing + newline, but nothing else. + fmt_desc: The path to a key file (i.e,. a file containing only the key). + see_also: + - key + flags: + - no_mon_update + - startup + with_legacy: true +- name: keyring + type: str + level: advanced + desc: Path to a keyring file. + long_desc: A keyring file is an INI-style formatted file where the section names + are client or daemon names (e.g., 'osd.0') and each section contains a 'key' property + with CephX authentication key as the value. + # please note, document are generated without accessing to the CMake + # variables, so please update the document manually with a representive + # default value using the ":default:" option of ".. confval::" directive. + default: @keyring_paths@ + see_also: + - key + - keyfile + flags: + - no_mon_update + - startup + with_legacy: true +- name: heartbeat_interval + type: int + level: advanced + desc: Frequency of internal heartbeat checks (seconds) + default: 5 + flags: + - startup + with_legacy: true +- name: heartbeat_file + type: str + level: advanced + desc: File to touch on successful internal heartbeat + long_desc: If set, this file will be touched every time an internal heartbeat check + succeeds. + see_also: + - heartbeat_interval + flags: + - startup + with_legacy: true +- name: heartbeat_inject_failure + type: int + level: dev + default: 0 + with_legacy: true +- name: perf + type: bool + level: advanced + desc: Enable internal performance metrics + long_desc: If enabled, collect and expose internal health metrics + default: true + with_legacy: true +- name: ms_type + type: str + level: advanced + desc: Messenger implementation to use for network communication + fmt_desc: Transport type used by Async Messenger. Can be ``async+posix``, + ``async+dpdk`` or ``async+rdma``. Posix uses standard TCP/IP networking and is + default. Other transports may be experimental and support may be limited. + default: async+posix + flags: + - startup + with_legacy: true +- name: ms_public_type + type: str + level: advanced + desc: Messenger implementation to use for the public network + long_desc: If not specified, use ms_type + see_also: + - ms_type + flags: + - startup + with_legacy: true +- name: ms_cluster_type + type: str + level: advanced + desc: Messenger implementation to use for the internal cluster network + long_desc: If not specified, use ms_type + see_also: + - ms_type + flags: + - startup + with_legacy: true +- name: ms_mon_cluster_mode + type: str + level: basic + desc: Connection modes (crc, secure) for intra-mon connections in order of preference + fmt_desc: the connection mode (or permitted modes) to use between monitors. + default: secure crc + see_also: + - ms_mon_service_mode + - ms_mon_client_mode + - ms_service_mode + - ms_cluster_mode + - ms_client_mode + flags: + - startup +- name: ms_mon_service_mode + type: str + level: basic + desc: Allowed connection modes (crc, secure) for connections to mons + fmt_desc: a list of permitted modes for clients or + other Ceph daemons to use when connecting to monitors. + default: secure crc + see_also: + - ms_service_mode + - ms_mon_cluster_mode + - ms_mon_client_mode + - ms_cluster_mode + - ms_client_mode + flags: + - startup +- name: ms_mon_client_mode + type: str + level: basic + desc: Connection modes (crc, secure) for connections from clients to monitors in + order of preference + fmt_desc: a list of connection modes, in order of + preference, for clients or non-monitor daemons to use when + connecting to monitors. + default: secure crc + see_also: + - ms_mon_service_mode + - ms_mon_cluster_mode + - ms_service_mode + - ms_cluster_mode + - ms_client_mode + flags: + - startup +- name: ms_cluster_mode + type: str + level: basic + desc: Connection modes (crc, secure) for intra-cluster connections in order of preference + fmt_desc: connection mode (or permitted modes) used + for intra-cluster communication between Ceph daemons. If multiple + modes are listed, the modes listed first are preferred. + default: crc secure + see_also: + - ms_service_mode + - ms_client_mode + flags: + - startup +- name: ms_service_mode + type: str + level: basic + desc: Allowed connection modes (crc, secure) for connections to daemons + fmt_desc: a list of permitted modes for clients to use + when connecting to the cluster. + default: crc secure + see_also: + - ms_cluster_mode + - ms_client_mode + flags: + - startup +- name: ms_client_mode + type: str + level: basic + desc: Connection modes (crc, secure) for connections from clients in order of preference + fmt_desc: a list of connection modes, in order of + preference, for clients to use (or allow) when talking to a Ceph + cluster. + default: crc secure + see_also: + - ms_cluster_mode + - ms_service_mode + flags: + - startup +- name: ms_osd_compress_mode + type: str + level: advanced + desc: Compression policy to use in Messenger for communicating with OSD + default: none + services: + - osd + enum_values: + - none + - force + see_also: + - ms_compress_secure + flags: + - runtime +- name: ms_osd_compress_min_size + type: uint + level: advanced + desc: Minimal message size eligable for on-wire compression + default: 1_K + services: + - osd + see_also: + - ms_osd_compress_mode + flags: + - runtime +- name: ms_osd_compression_algorithm + type: str + level: advanced + desc: Compression algorithm to use in Messenger when communicating with OSD + long_desc: Compression algorithm for connections with OSD in order of preference + Although the default value is set to snappy, a list + (like snappy zlib zstd etc.) is acceptable as well. + default: snappy + services: + - osd + see_also: + - ms_osd_compress_mode + flags: + - runtime +- name: ms_compress_secure + type: bool + level: advanced + desc: Allowing compression when on-wire encryption is enabled + long_desc: Combining encryption with compression reduces the level of security of + messages between peers. In case both encryption and compression are enabled, + compression setting will be ignored and message will not be compressed. + This behaviour can be override using this setting. + default: false + see_also: + - ms_osd_compress_mode + flags: + - runtime +- name: ms_learn_addr_from_peer + type: bool + level: advanced + desc: Learn address from what IP our first peer thinks we connect from + long_desc: Use the IP address our first peer (usually a monitor) sees that we are + connecting from. This is useful if a client is behind some sort of NAT and we + want to see it identified by its local (not NATed) address. + default: true + with_legacy: true +- name: ms_tcp_nodelay + type: bool + level: advanced + desc: Disable Nagle's algorithm and send queued network traffic immediately + fmt_desc: Ceph enables ``ms_tcp_nodelay`` so that each request is sent + immediately (no buffering). Disabling `Nagle's algorithm`_ + increases network traffic, which can introduce latency. If you + experience large numbers of small packets, you may try + disabling ``ms_tcp_nodelay``. + default: true + with_legacy: true +- name: ms_tcp_rcvbuf + type: size + level: advanced + desc: Size of TCP socket receive buffer + fmt_desc: The size of the socket buffer on the receiving end of a network + connection. Disable by default. + default: 0 + with_legacy: true +- name: ms_tcp_prefetch_max_size + type: size + level: advanced + desc: Maximum amount of data to prefetch out of the socket receive buffer + default: 4_K + with_legacy: true +- name: ms_initial_backoff + type: float + level: advanced + desc: Initial backoff after a network error is detected (seconds) + fmt_desc: The initial time to wait before reconnecting on a fault. + default: 0.2 + with_legacy: true +- name: ms_max_backoff + type: float + level: advanced + desc: Maximum backoff after a network error before retrying (seconds) + fmt_desc: The maximum time to wait before reconnecting on a fault. + default: 15 + see_also: + - ms_initial_backoff + with_legacy: true +- name: ms_crc_data + type: bool + level: dev + desc: Set and/or verify crc32c checksum on data payload sent over network + default: true + with_legacy: true +- name: ms_crc_header + type: bool + level: dev + desc: Set and/or verify crc32c checksum on header payload sent over network + default: true + with_legacy: true +- name: ms_die_on_bad_msg + type: bool + level: dev + desc: Induce a daemon crash/exit when a bad network message is received + fmt_desc: Debug option; do not configure. + default: false + with_legacy: true +- name: ms_die_on_unhandled_msg + type: bool + level: dev + desc: Induce a daemon crash/exit when an unrecognized message is received + default: false + with_legacy: true +- name: ms_die_on_old_message + type: bool + level: dev + desc: Induce a daemon crash/exit when a old, undecodable message is received + default: false + with_legacy: true +- name: ms_die_on_skipped_message + type: bool + level: dev + desc: Induce a daemon crash/exit if sender skips a message sequence number + default: false + with_legacy: true +- name: ms_die_on_bug + type: bool + level: dev + desc: Induce a crash/exit on various bugs (for testing purposes) + default: false + with_legacy: true +- name: ms_dispatch_throttle_bytes + type: size + level: advanced + desc: Limit messages that are read off the network but still being processed + fmt_desc: Throttles total size of messages waiting to be dispatched. + default: 100_M + with_legacy: true +- name: ms_bind_ipv4 + type: bool + level: advanced + desc: Bind servers to IPv4 address(es) + fmt_desc: Enables Ceph daemons to bind to IPv4 addresses. + default: true + see_also: + - ms_bind_ipv6 +- name: ms_bind_ipv6 + type: bool + level: advanced + desc: Bind servers to IPv6 address(es) + fmt_desc: Enables Ceph daemons to bind to IPv6 addresses. + default: false + see_also: + - ms_bind_ipv4 + with_legacy: true +- name: ms_bind_prefer_ipv4 + type: bool + level: advanced + desc: Prefer IPV4 over IPV6 address(es) + default: false +- name: ms_bind_msgr1 + type: bool + level: advanced + desc: Bind servers to msgr1 (legacy) protocol address(es) + default: true + see_also: + - ms_bind_msgr2 +- name: ms_bind_msgr2 + type: bool + level: advanced + desc: Bind servers to msgr2 (nautilus+) protocol address(es) + default: true + see_also: + - ms_bind_msgr1 +- name: ms_bind_port_min + type: int + level: advanced + desc: Lowest port number to bind daemon(s) to + fmt_desc: The minimum port number to which an OSD or MDS daemon will bind. + default: 6800 + with_legacy: true +- name: ms_bind_port_max + type: int + level: advanced + desc: Highest port number to bind daemon(s) to + fmt_desc: The maximum port number to which an OSD or MDS daemon will bind. + default: 7568 + with_legacy: true +# FreeBSD does not use SO_REAUSEADDR so allow for a bit more time per default +- name: ms_bind_retry_count + type: int + level: advanced + desc: Number of attempts to make while bind(2)ing to a port + default: @ms_bind_retry_count@ + with_legacy: true +# FreeBSD does not use SO_REAUSEADDR so allow for a bit more time per default +- name: ms_bind_retry_delay + type: int + level: advanced + desc: Delay between bind(2) attempts (seconds) + default: @ms_bind_retry_delay@ + with_legacy: true +- name: ms_bind_before_connect + type: bool + level: advanced + desc: Call bind(2) on client sockets + default: false + with_legacy: true +- name: ms_tcp_listen_backlog + type: int + level: advanced + desc: Size of queue of incoming connections for accept(2) + default: 512 + with_legacy: true +- name: ms_connection_ready_timeout + type: uint + level: advanced + desc: Time before we declare a not yet ready connection as dead (seconds) + default: 10 + with_legacy: true +- name: ms_connection_idle_timeout + type: uint + level: advanced + desc: Time before an idle connection is closed (seconds) + default: 900 + with_legacy: true +- name: ms_pq_max_tokens_per_priority + type: uint + level: dev + default: 16_M + with_legacy: true +- name: ms_pq_min_cost + type: size + level: dev + default: 64_K + with_legacy: true +- name: ms_inject_socket_failures + type: uint + level: dev + desc: Inject a socket failure every Nth socket operation + fmt_desc: Debug option; do not configure. + default: 0 + with_legacy: true +- name: ms_inject_delay_type + type: str + level: dev + desc: Entity type to inject delays for + flags: + - runtime + with_legacy: true +- name: ms_inject_delay_max + type: float + level: dev + desc: Max delay to inject + default: 1 + with_legacy: true +- name: ms_inject_delay_probability + type: float + level: dev + default: 0 + with_legacy: true +- name: ms_inject_internal_delays + type: float + level: dev + desc: Inject various internal delays to induce races (seconds) + default: 0 + with_legacy: true +- name: ms_inject_network_congestion + type: uint + level: dev + desc: Inject a network congestions that stuck with N times operations + default: 0 + with_legacy: true +- name: ms_blackhole_osd + type: bool + level: dev + default: false + with_legacy: true +- name: ms_blackhole_mon + type: bool + level: dev + default: false + with_legacy: true +- name: ms_blackhole_mds + type: bool + level: dev + default: false + with_legacy: true +- name: ms_blackhole_mgr + type: bool + level: dev + default: false + with_legacy: true +- name: ms_blackhole_client + type: bool + level: dev + default: false + with_legacy: true +- name: ms_dump_on_send + type: bool + level: advanced + desc: Hexdump message to debug log on message send + default: false + with_legacy: true +- name: ms_dump_corrupt_message_level + type: int + level: advanced + desc: Log level at which to hexdump corrupt messages we receive + default: 1 + with_legacy: true +# number of worker processing threads for async messenger created on init +- name: ms_async_op_threads + type: uint + level: advanced + desc: Threadpool size for AsyncMessenger (ms_type=async) + fmt_desc: Initial number of worker threads used by each Async Messenger instance. + Should be at least equal to highest number of replicas, but you can + decrease it if you are low on CPU core count and/or you host a lot of + OSDs on single server. + default: 3 + min: 1 + max: 24 + with_legacy: true +- name: ms_async_reap_threshold + type: uint + level: dev + desc: number of deleted connections before we reap + default: 5 + min: 1 + with_legacy: true +- name: ms_async_rdma_device_name + type: str + level: advanced + with_legacy: true +- name: ms_async_rdma_enable_hugepage + type: bool + level: advanced + default: false + with_legacy: true +- name: ms_async_rdma_buffer_size + type: size + level: advanced + default: 128_K + with_legacy: true +- name: ms_async_rdma_send_buffers + type: uint + level: advanced + default: 1_K + with_legacy: true +# size of the receive buffer pool, 0 is unlimited +- name: ms_async_rdma_receive_buffers + type: uint + level: advanced + default: 32_K + with_legacy: true +# max number of wr in srq +- name: ms_async_rdma_receive_queue_len + type: uint + level: advanced + default: 4_K + with_legacy: true +# support srq +- name: ms_async_rdma_support_srq + type: bool + level: advanced + default: true + with_legacy: true +- name: ms_async_rdma_port_num + type: uint + level: advanced + default: 1 + with_legacy: true +- name: ms_async_rdma_polling_us + type: uint + level: advanced + default: 1000 + with_legacy: true +- name: ms_async_rdma_gid_idx + type: int + level: advanced + desc: use gid_idx to select GID for choosing RoCEv1 or RoCEv2 + default: 0 + with_legacy: true +# GID format: "fe80:0000:0000:0000:7efe:90ff:fe72:6efe", no zero folding +- name: ms_async_rdma_local_gid + type: str + level: advanced + with_legacy: true +# 0=RoCEv1, 1=RoCEv2, 2=RoCEv1.5 +- name: ms_async_rdma_roce_ver + type: int + level: advanced + default: 1 + with_legacy: true +# in RoCE, this means PCP +- name: ms_async_rdma_sl + type: int + level: advanced + default: 3 + with_legacy: true +# in RoCE, this means DSCP +- name: ms_async_rdma_dscp + type: int + level: advanced + default: 96 + with_legacy: true +# when there are enough accept failures, indicating there are unrecoverable failures, +# just do ceph_abort() . Here we make it configurable. +- name: ms_max_accept_failures + type: int + level: advanced + desc: The maximum number of consecutive failed accept() calls before considering + the daemon is misconfigured and abort it. + default: 4 + with_legacy: true +# rdma connection management +- name: ms_async_rdma_cm + type: bool + level: advanced + default: false + with_legacy: true +- name: ms_async_rdma_type + type: str + level: advanced + default: ib + with_legacy: true +- name: ms_dpdk_port_id + type: int + level: advanced + default: 0 + with_legacy: true +# it is modified in unittest so that use SAFE_OPTION to declare +- name: ms_dpdk_coremask + type: str + level: advanced + default: '0xF' + see_also: + - ms_async_op_threads + with_legacy: true +- name: ms_dpdk_memory_channel + type: str + level: advanced + default: '4' + with_legacy: true +- name: ms_dpdk_hugepages + type: str + level: advanced + with_legacy: true +- name: ms_dpdk_pmd + type: str + level: advanced + with_legacy: true +- name: ms_dpdk_devs_allowlist + type: str + level: advanced + desc: NIC's PCIe address are allowed to use + long_desc: for a single NIC use ms_dpdk_devs_allowlist=-a 0000:7d:010 or --allow=0000:7d:010; + for a bond nics use ms_dpdk_devs_allowlist=--allow=0000:7d:01.0 --allow=0000:7d:02.6 + --vdev=net_bonding0,mode=2,slave=0000:7d:01.0,slave=0000:7d:02.6. +- name: ms_dpdk_host_ipv4_addr + type: str + level: advanced + with_legacy: true +- name: ms_dpdk_gateway_ipv4_addr + type: str + level: advanced + with_legacy: true +- name: ms_dpdk_netmask_ipv4_addr + type: str + level: advanced + with_legacy: true +- name: ms_dpdk_lro + type: bool + level: advanced + default: true + with_legacy: true +- name: ms_dpdk_enable_tso + type: bool + level: advanced + default: true +- name: ms_dpdk_hw_flow_control + type: bool + level: advanced + default: true + with_legacy: true +# Weighing of a hardware network queue relative to a software queue (0=no work, 1= equal share)") +- name: ms_dpdk_hw_queue_weight + type: float + level: advanced + default: 1 + with_legacy: true +- name: ms_dpdk_debug_allow_loopback + type: bool + level: dev + default: false + with_legacy: true +- name: ms_dpdk_rx_buffer_count_per_core + type: int + level: advanced + default: 8192 + with_legacy: true +- name: inject_early_sigterm + type: bool + level: dev + desc: send ourselves a SIGTERM early during startup + default: false + with_legacy: true +# list of initial cluster mon ids; if specified, need majority to form initial quorum and create new cluster +- name: mon_initial_members + type: str + level: advanced + fmt_desc: The IDs of initial monitors in a cluster during startup. If + specified, Ceph requires an odd number of monitors to form an + initial quorum (e.g., 3). + note: A *majority* of monitors in your cluster must be able to reach + each other in order to establish a quorum. You can decrease the initial + number of monitors to establish a quorum with this setting. + services: + - mon + flags: + - no_mon_update + - cluster_create + with_legacy: true +- name: mon_max_pg_per_osd + type: uint + level: advanced + desc: Max number of PGs per OSD the cluster will allow + long_desc: If the number of PGs per OSD exceeds this, a health warning will be visible + in `ceph status`. This is also used in automated PG management, as the threshold + at which some pools' pg_num may be shrunk in order to enable increasing the pg_num + of others. + default: 250 + flags: + - runtime + services: + - mgr + - mon + min: 1 +- name: mon_osd_full_ratio + type: float + level: advanced + desc: full ratio of OSDs to be set during initial creation of the cluster + default: 0.95 + flags: + - no_mon_update + - cluster_create + with_legacy: true +- name: mon_osd_backfillfull_ratio + type: float + level: advanced + default: 0.9 + flags: + - no_mon_update + - cluster_create + with_legacy: true +- name: mon_osd_nearfull_ratio + type: float + level: advanced + desc: nearfull ratio for OSDs to be set during initial creation of cluster + default: 0.85 + flags: + - no_mon_update + - cluster_create + with_legacy: true +- name: mon_osd_initial_require_min_compat_client + type: str + level: advanced + default: luminous + flags: + - no_mon_update + - cluster_create + with_legacy: true +- name: mon_allow_pool_delete + type: bool + level: advanced + desc: allow pool deletions + fmt_desc: Should monitors allow pools to be removed, regardless of what the pool flags say? + default: false + services: + - mon + with_legacy: true +- name: mon_fake_pool_delete + type: bool + level: advanced + desc: fake pool deletions by renaming the rados pool + default: false + services: + - mon + with_legacy: true +- name: mon_globalid_prealloc + type: uint + level: advanced + desc: number of globalid values to preallocate + long_desc: This setting caps how many new clients can authenticate with the cluster + before the monitors have to perform a write to preallocate more. Large values + burn through the 64-bit ID space more quickly. + fmt_desc: The number of global IDs to pre-allocate for clients and daemons in the cluster. + default: 10000 + services: + - mon + with_legacy: true +- name: mon_osd_report_timeout + type: int + level: advanced + desc: time before OSDs who do not report to the mons are marked down (seconds) + fmt_desc: The grace period in seconds before declaring + unresponsive Ceph OSD Daemons ``down``. + default: 15_min + services: + - mon + with_legacy: true +- name: mon_warn_on_insecure_global_id_reclaim + type: bool + level: advanced + desc: issue AUTH_INSECURE_GLOBAL_ID_RECLAIM health warning if any connected + clients are insecurely reclaiming global_id + default: true + services: + - mon + see_also: + - mon_warn_on_insecure_global_id_reclaim_allowed + - auth_allow_insecure_global_id_reclaim + - auth_expose_insecure_global_id_reclaim +- name: mon_warn_on_insecure_global_id_reclaim_allowed + type: bool + level: advanced + desc: issue AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED health warning if insecure + global_id reclaim is allowed + default: true + services: + - mon + see_also: + - mon_warn_on_insecure_global_id_reclaim + - auth_allow_insecure_global_id_reclaim + - auth_expose_insecure_global_id_reclaim +- name: mon_warn_on_msgr2_not_enabled + type: bool + level: advanced + desc: issue MON_MSGR2_NOT_ENABLED health warning if monitors are all running Nautilus + but not all binding to a msgr2 port + default: true + services: + - mon + see_also: + - ms_bind_msgr2 +- name: mon_warn_on_slow_ping_time + type: float + level: advanced + desc: Override mon_warn_on_slow_ping_ratio with specified threshold in milliseconds + fmt_desc: Override ``mon_warn_on_slow_ping_ratio`` with a specific value. + Raise ``HEALTH_WARN`` if any heartbeat between OSDs exceeds + ``mon_warn_on_slow_ping_time`` milliseconds. The default is 0 (disabled). + default: 0 + services: + - mgr + - osd + see_also: + - mon_warn_on_slow_ping_ratio +- name: mon_warn_on_slow_ping_ratio + type: float + level: advanced + desc: Issue a health warning if heartbeat ping longer than percentage of osd_heartbeat_grace + fmt_desc: Raise ``HEALTH_WARN`` when any heartbeat between OSDs exceeds + ``mon_warn_on_slow_ping_ratio`` of ``osd_heartbeat_grace``. + default: 0.05 + services: + - mgr + - osd + see_also: + - osd_heartbeat_grace + - mon_warn_on_slow_ping_time +- name: mon_max_snap_prune_per_epoch + type: uint + level: advanced + desc: max number of pruned snaps we will process in a single OSDMap epoch + default: 100 + services: + - mon +- name: mon_min_osdmap_epochs + type: int + level: advanced + desc: min number of OSDMaps to store + fmt_desc: Minimum number of OSD map epochs to keep at all times. + default: 500 + services: + - mon + with_legacy: true +- name: mon_max_log_epochs + type: int + level: advanced + desc: max number of past cluster log epochs to store + fmt_desc: Maximum number of Log epochs the monitor should keep. + default: 500 + services: + - mon + with_legacy: true +- name: mon_max_mdsmap_epochs + type: int + level: advanced + desc: max number of FSMaps/MDSMaps to store + fmt_desc: The maximum number of mdsmap epochs to trim during a single proposal. + default: 500 + services: + - mon + with_legacy: true +- name: mon_max_mgrmap_epochs + type: int + level: advanced + desc: max number of MgrMaps to store + default: 500 + services: + - mon +- name: mon_max_osd + type: int + level: advanced + desc: max number of OSDs in a cluster + fmt_desc: The maximum number of OSDs allowed in the cluster. + default: 10000 + services: + - mon + with_legacy: true +- name: mon_probe_timeout + type: float + level: advanced + desc: timeout for querying other mons during bootstrap pre-election phase (seconds) + fmt_desc: Number of seconds the monitor will wait to find peers before bootstrapping. + default: 2 + services: + - mon + with_legacy: true +- name: mon_client_bytes + type: size + level: advanced + desc: max bytes of outstanding client messages mon will read off the network + fmt_desc: The amount of client message data allowed in memory (in bytes). + default: 100_M + services: + - mon + with_legacy: true +- name: mon_warn_pg_not_scrubbed_ratio + type: float + level: advanced + desc: Percentage of the scrub max interval past the scrub max interval to warn + default: 0.5 + see_also: + - osd_scrub_max_interval + min: 0 + with_legacy: true +- name: mon_warn_pg_not_deep_scrubbed_ratio + type: float + level: advanced + desc: Percentage of the deep scrub interval past the deep scrub interval to warn + default: 0.75 + see_also: + - osd_deep_scrub_interval + min: 0 + with_legacy: true +- name: mon_scrub_interval + type: secs + level: advanced + desc: frequency for scrubbing mon database + fmt_desc: How often the monitor scrubs its store by comparing + the stored checksums with the computed ones for all stored + keys. (0 disables it. dangerous, use with care) + default: 1_day + services: + - mon +- name: mon_scrub_timeout + type: int + level: advanced + desc: timeout to restart scrub of mon quorum participant does not respond for the + latest chunk + default: 5_min + services: + - mon + with_legacy: true +- name: mon_scrub_max_keys + type: int + level: advanced + desc: max keys per on scrub chunk/step + fmt_desc: The maximum number of keys to scrub each time. + default: 100 + services: + - mon + with_legacy: true +# probability of injected crc mismatch [0.0, 1.0] +- name: mon_scrub_inject_crc_mismatch + type: float + level: dev + desc: probability for injecting crc mismatches into mon scrub + default: 0 + services: + - mon + with_legacy: true +# probability of injected missing keys [0.0, 1.0] +- name: mon_scrub_inject_missing_keys + type: float + level: dev + desc: probability for injecting missing keys into mon scrub + default: 0 + services: + - mon + with_legacy: true +- name: mon_config_key_max_entry_size + type: size + level: advanced + desc: Defines the number of bytes allowed to be held in a single config-key entry + fmt_desc: The maximum size of config-key entry (in bytes) + default: 64_K + services: + - mon + with_legacy: true +- name: mon_sync_timeout + type: float + level: advanced + desc: timeout before canceling sync if syncing mon does not respond + fmt_desc: Number of seconds the monitor will wait for the next update + message from its sync provider before it gives up and bootstrap + again. + default: 1_min + services: + - mon + with_legacy: true +- name: mon_sync_max_payload_size + type: size + level: advanced + desc: target max message payload for mon sync + fmt_desc: The maximum size for a sync payload (in bytes). + default: 1_M + services: + - mon + with_legacy: true +- name: mon_sync_max_payload_keys + type: int + level: advanced + desc: target max keys in message payload for mon sync + default: 2000 + services: + - mon + with_legacy: true +- name: mon_sync_debug + type: bool + level: dev + desc: enable extra debugging during mon sync + default: false + services: + - mon + with_legacy: true +- name: mon_inject_sync_get_chunk_delay + type: float + level: dev + desc: inject delay during sync (seconds) + default: 0 + services: + - mon + with_legacy: true +- name: mon_osd_min_down_reporters + type: uint + level: advanced + desc: number of OSDs from different subtrees who need to report a down OSD for it + to count + fmt_desc: The minimum number of Ceph OSD Daemons required to report a + ``down`` Ceph OSD Daemon. + default: 2 + services: + - mon + see_also: + - mon_osd_reporter_subtree_level +- name: mon_osd_reporter_subtree_level + type: str + level: advanced + desc: in which level of parent bucket the reporters are counted + fmt_desc: In which level of parent bucket the reporters are counted. The OSDs + send failure reports to monitors if they find a peer that is not responsive. + Monitors mark the reported ``OSD`` out and then ``down`` after a grace period. + default: host + services: + - mon + flags: + - runtime +- name: mon_osd_snap_trim_queue_warn_on + type: int + level: advanced + desc: Warn when snap trim queue is that large (or larger). + long_desc: Warn when snap trim queue length for at least one PG crosses this value, + as this is indicator of snap trimmer not keeping up, wasting disk space + default: 32768 + services: + - mon + with_legacy: true +# force mon to trim maps to this point, regardless of min_last_epoch_clean (dangerous) +- name: mon_osd_force_trim_to + type: int + level: dev + desc: force mons to trim osdmaps through this epoch + fmt_desc: Force monitor to trim osdmaps to this point, even if there is + PGs not clean at the specified epoch (0 disables it. dangerous, + use with care) + default: 0 + services: + - mon + with_legacy: true +- name: mon_debug_extra_checks + type: bool + level: dev + desc: Enable some additional monitor checks + long_desc: Enable some additional monitor checks that would be too expensive to + run on production systems, or would only be relevant while testing or debugging. + default: false + services: + - mon +- name: mon_debug_block_osdmap_trim + type: bool + level: dev + desc: Block OSDMap trimming while the option is enabled. + long_desc: Blocking OSDMap trimming may be quite helpful to easily reproduce states + in which the monitor keeps (hundreds of) thousands of osdmaps. + default: false + services: + - mon +- name: mon_debug_deprecated_as_obsolete + type: bool + level: dev + desc: treat deprecated mon commands as obsolete + default: false + services: + - mon + with_legacy: true +- name: mon_debug_dump_transactions + type: bool + level: dev + desc: dump paxos transactions to log + default: false + services: + - mon + see_also: + - mon_debug_dump_location + with_legacy: true +- name: mon_debug_dump_json + type: bool + level: dev + desc: dump paxos transasctions to log as json + default: false + services: + - mon + see_also: + - mon_debug_dump_transactions + with_legacy: true +- name: mon_debug_dump_location + type: str + level: dev + desc: file to dump paxos transactions to + default: /var/log/ceph/$cluster-$name.tdump + services: + - mon + see_also: + - mon_debug_dump_transactions + with_legacy: true +- name: mon_debug_no_require_quincy + type: bool + level: dev + desc: do not set quincy feature for new mon clusters + default: false + services: + - mon + flags: + - cluster_create +- name: mon_debug_no_require_reef + type: bool + level: dev + desc: do not set reef feature for new mon clusters + default: false + services: + - mon + flags: + - cluster_create +- name: mon_debug_no_require_bluestore_for_ec_overwrites + type: bool + level: dev + desc: do not require bluestore OSDs to enable EC overwrites on a rados pool + default: false + services: + - mon + with_legacy: true +- name: mon_debug_no_initial_persistent_features + type: bool + level: dev + desc: do not set any monmap features for new mon clusters + default: false + services: + - mon + flags: + - cluster_create + with_legacy: true +- name: mon_inject_transaction_delay_max + type: float + level: dev + desc: max duration of injected delay in paxos + default: 10 + services: + - mon + with_legacy: true +# range [0, 1] +- name: mon_inject_transaction_delay_probability + type: float + level: dev + desc: probability of injecting a delay in paxos + default: 0 + services: + - mon + with_legacy: true +- name: mon_inject_pg_merge_bounce_probability + type: float + level: dev + desc: probability of failing and reverting a pg_num decrement + default: 0 + services: + - mon +# kill the sync provider at a specific point in the work flow +- name: mon_sync_provider_kill_at + type: int + level: dev + desc: kill mon sync requester at specific point + default: 0 + services: + - mon + with_legacy: true +# kill the sync requester at a specific point in the work flow +- name: mon_sync_requester_kill_at + type: int + level: dev + desc: kill mon sync requestor at specific point + default: 0 + services: + - mon + with_legacy: true +# force monitor to join quorum even if it has been previously removed from the map +- name: mon_force_quorum_join + type: bool + level: advanced + desc: force mon to rejoin quorum even though it was just removed + fmt_desc: Force monitor to join quorum even if it has been previously removed from the map + default: false + services: + - mon + with_legacy: true +# type of keyvaluedb backend +- name: mon_keyvaluedb + type: str + level: advanced + desc: database backend to use for the mon database + default: rocksdb + services: + - mon + enum_values: + - leveldb + - rocksdb + flags: + - create + with_legacy: true +# UNSAFE -- TESTING ONLY! Allows addition of a cache tier with preexisting snaps +- name: mon_debug_unsafe_allow_tier_with_nonempty_snaps + type: bool + level: dev + default: false + services: + - mon + with_legacy: true +# required of mon, mds, osd daemons +- name: auth_cluster_required + type: str + level: advanced + desc: authentication methods required by the cluster + fmt_desc: If enabled, the Ceph Storage Cluster daemons (i.e., ``ceph-mon``, + ``ceph-osd``, ``ceph-mds`` and ``ceph-mgr``) must authenticate with + each other. Valid settings are ``cephx`` or ``none``. + default: cephx + with_legacy: true +# required by daemons of clients +- name: auth_service_required + type: str + level: advanced + desc: authentication methods required by service daemons + fmt_desc: If enabled, the Ceph Storage Cluster daemons require Ceph Clients + to authenticate with the Ceph Storage Cluster in order to access + Ceph services. Valid settings are ``cephx`` or ``none``. + default: cephx + with_legacy: true +# what clients require of daemons +- name: auth_client_required + type: str + level: advanced + desc: authentication methods allowed by clients + fmt_desc: If enabled, the Ceph Client requires the Ceph Storage Cluster to + authenticate with the Ceph Client. Valid settings are ``cephx`` + or ``none``. + default: cephx, none + with_legacy: true +# deprecated; default value for above if they are not defined. +- name: auth_supported + type: str + level: advanced + desc: authentication methods required (deprecated) + with_legacy: true +- name: max_rotating_auth_attempts + type: int + level: advanced + desc: number of attempts to initialize rotating keys before giving up + default: 10 + with_legacy: true +- name: rotating_keys_bootstrap_timeout + type: int + level: advanced + desc: timeout for obtaining rotating keys during bootstrap phase (seconds) + default: 30 +- name: rotating_keys_renewal_timeout + type: int + level: advanced + desc: timeout for updating rotating keys (seconds) + default: 10 +- name: cephx_require_signatures + type: bool + level: advanced + default: false + fmt_desc: If set to ``true``, Ceph requires signatures on all message + traffic between the Ceph Client and the Ceph Storage Cluster, and + between daemons comprising the Ceph Storage Cluster. + + Ceph Argonaut and Linux kernel versions prior to 3.19 do + not support signatures; if such clients are in use this + option can be turned off to allow them to connect. + with_legacy: true +- name: cephx_require_version + type: int + level: advanced + desc: Cephx version required (1 = pre-mimic, 2 = mimic+) + default: 2 + with_legacy: true +- name: cephx_cluster_require_signatures + type: bool + level: advanced + default: false + fmt_desc: If set to ``true``, Ceph requires signatures on all message + traffic between Ceph daemons comprising the Ceph Storage Cluster. + with_legacy: true +- name: cephx_cluster_require_version + type: int + level: advanced + desc: Cephx version required by the cluster from clients (1 = pre-mimic, 2 = mimic+) + default: 2 + with_legacy: true +- name: cephx_service_require_signatures + type: bool + level: advanced + default: false + fmt_desc: If set to ``true``, Ceph requires signatures on all message + traffic between Ceph Clients and the Ceph Storage Cluster. + with_legacy: true +- name: cephx_service_require_version + type: int + level: advanced + desc: Cephx version required from ceph services (1 = pre-mimic, 2 = mimic+) + default: 2 + with_legacy: true +# Default to signing session messages if supported +- name: cephx_sign_messages + type: bool + level: advanced + default: true + fmt_desc: If the Ceph version supports message signing, Ceph will sign + all messages so they are more difficult to spoof. + with_legacy: true +- name: auth_mon_ticket_ttl + type: float + level: advanced + default: 72_hr + with_legacy: true +- name: auth_service_ticket_ttl + type: float + level: advanced + default: 1_hr + fmt_desc: When the Ceph Storage Cluster sends a Ceph Client a ticket for + authentication, the Ceph Storage Cluster assigns the ticket a + time to live. + with_legacy: true +- name: auth_allow_insecure_global_id_reclaim + type: bool + level: advanced + desc: Allow reclaiming global_id without presenting a valid ticket proving + previous possession of that global_id + long_desc: Allowing unauthorized global_id (re)use poses a security risk. + Unfortunately, older clients may omit their ticket on reconnects and + therefore rely on this being allowed for preserving their global_id for + the lifetime of the client instance. Setting this value to false would + immediately prevent new connections from those clients (assuming + auth_expose_insecure_global_id_reclaim set to true) and eventually break + existing sessions as well (regardless of auth_expose_insecure_global_id_reclaim + setting). + default: true + see_also: + - mon_warn_on_insecure_global_id_reclaim + - mon_warn_on_insecure_global_id_reclaim_allowed + - auth_expose_insecure_global_id_reclaim + with_legacy: true +- name: auth_expose_insecure_global_id_reclaim + type: bool + level: advanced + desc: Force older clients that may omit their ticket on reconnects to + reconnect as part of establishing a session + long_desc: 'In permissive mode (auth_allow_insecure_global_id_reclaim set + to true), this helps with identifying clients that are not patched. In + enforcing mode (auth_allow_insecure_global_id_reclaim set to false), this + is a fail-fast mechanism: don''t establish a session that will almost + inevitably be broken later.' + default: true + see_also: + - mon_warn_on_insecure_global_id_reclaim + - mon_warn_on_insecure_global_id_reclaim_allowed + - auth_allow_insecure_global_id_reclaim + with_legacy: true +# if true, assert when weird things happen +- name: auth_debug + type: bool + level: dev + default: false + with_legacy: true +# how many mons to try to connect to in parallel during hunt +- name: mon_client_hunt_parallel + type: uint + level: advanced + default: 3 + with_legacy: true +# try new mon every N seconds until we connect +- name: mon_client_hunt_interval + type: float + level: advanced + default: 3 + fmt_desc: The client will try a new monitor every ``N`` seconds until it + establishes a connection. + with_legacy: true +# send logs every N seconds +- name: mon_client_log_interval + type: float + level: advanced + desc: How frequently we send queued cluster log messages to mon + default: 1 + with_legacy: true +# ping every N seconds +- name: mon_client_ping_interval + type: float + level: advanced + default: 10 + fmt_desc: The client will ping the monitor every ``N`` seconds. + with_legacy: true +# fail if we don't hear back +- name: mon_client_ping_timeout + type: float + level: advanced + default: 30 + with_legacy: true +- name: mon_client_hunt_interval_backoff + type: float + level: advanced + default: 1.5 + with_legacy: true +- name: mon_client_hunt_interval_min_multiple + type: float + level: advanced + default: 1 + with_legacy: true +- name: mon_client_hunt_interval_max_multiple + type: float + level: advanced + default: 10 + with_legacy: true +- name: mon_client_max_log_entries_per_message + type: int + level: advanced + default: 1000 + fmt_desc: The maximum number of log entries a monitor will generate + per client message. + with_legacy: true +- name: mon_client_directed_command_retry + type: int + level: dev + desc: Number of times to try sending a command directed at a specific monitor + default: 2 + with_legacy: true +# whitespace-separated list of key=value pairs describing crush location +- name: crush_location + type: str + level: advanced + with_legacy: true +- name: crush_location_hook + type: str + level: advanced + with_legacy: true +- name: crush_location_hook_timeout + type: int + level: advanced + default: 10 + with_legacy: true +- name: objecter_tick_interval + type: float + level: dev + default: 5 + with_legacy: true +# before we ask for a map +- name: objecter_timeout + type: float + level: advanced + desc: Seconds before in-flight op is considered 'laggy' and we query mon for the + latest OSDMap + default: 10 + with_legacy: true +- name: objecter_inflight_op_bytes + type: size + level: advanced + desc: Max in-flight data in bytes (both directions) + default: 100_M + with_legacy: true +- name: objecter_inflight_ops + type: uint + level: advanced + desc: Max in-flight operations + default: 1_K + with_legacy: true +# num of completion locks per each session, for serializing same object responses +- name: objecter_completion_locks_per_session + type: uint + level: dev + default: 32 + with_legacy: true +# suppress watch pings +- name: objecter_inject_no_watch_ping + type: bool + level: dev + default: false + with_legacy: true +# ignore the first reply for each write, and resend the osd op instead +- name: objecter_retry_writes_after_first_reply + type: bool + level: dev + default: false + with_legacy: true +- name: objecter_debug_inject_relock_delay + type: bool + level: dev + default: false + with_legacy: true +- name: filer_max_purge_ops + type: uint + level: advanced + desc: Max in-flight operations for purging a striped range (e.g., MDS journal) + default: 10 + with_legacy: true +- name: filer_max_truncate_ops + type: uint + level: advanced + desc: Max in-flight operations for truncating/deleting a striped sequence (e.g., + MDS journal) + default: 128 + with_legacy: true +- name: journaler_write_head_interval + type: int + level: advanced + desc: Interval in seconds between journal header updates (to help bound replay time) + default: 15 +# * journal object size +- name: journaler_prefetch_periods + type: uint + level: advanced + desc: Number of striping periods to prefetch while reading MDS journal + default: 10 + # we need at least 2 periods to make progress. + min: 2 +# * journal object size +- name: journaler_prezero_periods + type: uint + level: advanced + desc: Number of striping periods to zero head of MDS journal write position + default: 5 + # we need to zero at least two periods, minimum, to ensure that we + # have a full empty object/period in front of us. + min: 2 +- name: osd_calc_pg_upmaps_aggressively + type: bool + level: advanced + desc: try to calculate PG upmaps more aggressively, e.g., by doing a fairly exhaustive + search of existing PGs that can be unmapped or upmapped + default: true + flags: + - runtime +- name: osd_calc_pg_upmaps_aggressively_fast + type: bool + level: advanced + desc: Prevent very long (>10 minutes) calculations in some extreme cases (applicable + only to aggressive mode) + default: true + flags: + - runtime +- name: osd_calc_pg_upmaps_local_fallback_retries + type: uint + level: advanced + desc: 'Maximum number of PGs we can attempt to unmap or upmap for a specific overfull + or underfull osd per iteration ' + default: 100 + flags: + - runtime +# 1 = host +- name: osd_crush_chooseleaf_type + type: int + level: dev + desc: default chooseleaf type for osdmaptool --create + fmt_desc: The bucket type to use for ``chooseleaf`` in a CRUSH rule. Uses + ordinal rank rather than name. + default: 1 + flags: + - cluster_create + with_legacy: true +# try to use gmt for hitset archive names if all osds in cluster support it +- name: osd_pool_use_gmt_hitset + type: bool + level: dev + desc: use UTC for hitset timestamps + long_desc: This setting only exists for compatibility with hammer (and older) clusters. + default: true + with_legacy: true +# whether turn on fast read on the pool or not +- name: osd_pool_default_ec_fast_read + type: bool + level: advanced + desc: set ec_fast_read for new erasure-coded pools + fmt_desc: Whether to turn on fast read on the pool or not. It will be used as + the default setting of newly created erasure coded pools if ``fast_read`` + is not specified at create time. + default: false + services: + - mon + with_legacy: true +- name: osd_pool_default_crush_rule + type: int + level: advanced + desc: CRUSH rule for newly created pools + fmt_desc: The default CRUSH rule to use when creating a replicated pool. The + default value of ``-1`` means "pick the rule with the lowest numerical ID and + use that". This is to make pool creation work in the absence of rule 0. + default: -1 + services: + - mon +- name: osd_pool_default_size + type: uint + level: advanced + desc: the number of copies of an object for new replicated pools + fmt_desc: Sets the number of replicas for objects in the pool. The default + value is the same as + ``ceph osd pool set {pool-name} size {size}``. + default: 3 + services: + - mon + min: 0 + max: 10 + flags: + - runtime +- name: osd_pool_default_min_size + type: uint + level: advanced + desc: the minimal number of copies allowed to write to a degraded pool for new replicated + pools + long_desc: 0 means no specific default; ceph will use size-size/2 + fmt_desc: Sets the minimum number of written replicas for objects in the + pool in order to acknowledge an I/O operation to the client. If + minimum is not met, Ceph will not acknowledge the I/O to the + client, **which may result in data loss**. This setting ensures + a minimum number of replicas when operating in ``degraded`` mode. + The default value is ``0`` which means no particular minimum. If ``0``, + minimum is ``size - (size / 2)``. + default: 0 + services: + - mon + see_also: + - osd_pool_default_size + min: 0 + max: 255 + flags: + - runtime +- name: osd_pool_default_pg_num + type: uint + level: advanced + desc: number of PGs for new pools + fmt_desc: The default number of placement groups for a pool. The default + value is the same as ``pg_num`` with ``mkpool``. + long_desc: With default value of `osd_pool_default_pg_autoscale_mode` being + `on` the number of PGs for new pools will start out with 1 pg, unless the + user specifies the pg_num. + default: 32 + services: + - mon + see_also: + - osd_pool_default_pg_autoscale_mode + flags: + - runtime +- name: osd_pool_default_pgp_num + type: uint + level: advanced + desc: number of PGs for placement purposes (0 to match pg_num) + fmt_desc: | + The default number of placement groups for placement for a pool. + The default value is the same as ``pgp_num`` with ``mkpool``. + PG and PGP should be equal (for now). Note: should not be set unless + autoscaling is disabled. + default: 0 + services: + - mon + see_also: + - osd_pool_default_pg_num + - osd_pool_default_pg_autoscale_mode + flags: + - runtime +- name: osd_pool_default_type + type: str + level: advanced + desc: default type of pool to create + default: replicated + services: + - mon + enum_values: + - replicated + - erasure + flags: + - runtime +- name: osd_pool_default_erasure_code_profile + type: str + level: advanced + desc: default erasure code profile for new erasure-coded pools + default: plugin=jerasure technique=reed_sol_van k=2 m=2 + services: + - mon + flags: + - runtime +- name: osd_erasure_code_plugins + type: str + level: advanced + desc: erasure code plugins to load + default: @osd_erasure_code_plugins@ + services: + - mon + - osd + flags: + - startup + with_legacy: true +- name: osd_pool_default_flags + type: int + level: dev + desc: (integer) flags to set on new pools + fmt_desc: The default flags for new pools. + default: 0 + services: + - mon + with_legacy: true +# use new pg hashing to prevent pool/pg overlap +- name: osd_pool_default_flag_hashpspool + type: bool + level: advanced + desc: set hashpspool (better hashing scheme) flag on new pools + default: true + services: + - mon + with_legacy: true +# pool can't be deleted +- name: osd_pool_default_flag_nodelete + type: bool + level: advanced + desc: set nodelete flag on new pools + fmt_desc: Set the ``nodelete`` flag on new pools, which prevents pool removal. + default: false + services: + - mon + with_legacy: true +# pool's pg and pgp num can't be changed +- name: osd_pool_default_flag_nopgchange + type: bool + level: advanced + desc: set nopgchange flag on new pools + fmt_desc: Set the ``nopgchange`` flag on new pools. Does not allow the number of PGs to be changed. + default: false + services: + - mon + with_legacy: true +# pool's size and min size can't be changed +- name: osd_pool_default_flag_nosizechange + type: bool + level: advanced + desc: set nosizechange flag on new pools + fmt_desc: Set the ``nosizechange`` flag on new pools. Does not allow the ``size`` to be changed. + default: false + services: + - mon + with_legacy: true +- name: osd_pool_default_flag_bulk + type: bool + level: advanced + desc: set bulk flag on new pools + fmt_desc: Set the ``bulk`` flag on new pools. Allowing autoscaler to use scale-down mode. + default: false + services: + - mon + with_legacy: true +- name: osd_pool_default_hit_set_bloom_fpp + type: float + level: advanced + default: 0.05 + services: + - mon + see_also: + - osd_tier_default_cache_hit_set_type + with_legacy: true +- name: osd_pool_default_cache_target_dirty_ratio + type: float + level: advanced + default: 0.4 + with_legacy: true +- name: osd_pool_default_cache_target_dirty_high_ratio + type: float + level: advanced + default: 0.6 + with_legacy: true +- name: osd_pool_default_cache_target_full_ratio + type: float + level: advanced + default: 0.8 + with_legacy: true +# seconds +- name: osd_pool_default_cache_min_flush_age + type: int + level: advanced + default: 0 + with_legacy: true +# seconds +- name: osd_pool_default_cache_min_evict_age + type: int + level: advanced + default: 0 + with_legacy: true +# max size to check for eviction +- name: osd_pool_default_cache_max_evict_check_size + type: int + level: advanced + default: 10 + with_legacy: true +- name: osd_pool_default_pg_autoscale_mode + type: str + level: advanced + desc: Default PG autoscaling behavior for new pools + long_desc: With default value `on`, the autoscaler starts a new pool with 1 + pg, unless the user specifies the pg_num. + default: 'on' + enum_values: + - 'off' + - 'warn' + - 'on' + flags: + - runtime +- name: osd_pool_default_read_lease_ratio + type: float + level: dev + desc: Default read_lease_ratio for a pool, as a multiple of osd_heartbeat_grace + long_desc: This should be <= 1.0 so that the read lease will have expired by the + time we decide to mark a peer OSD down. + default: 0.8 + see_also: + - osd_heartbeat_grace + flags: + - runtime + with_legacy: true +# min target size for a HitSet +- name: osd_hit_set_min_size + type: int + level: advanced + default: 1000 + with_legacy: true +# max target size for a HitSet +- name: osd_hit_set_max_size + type: int + level: advanced + default: 100000 + with_legacy: true +# rados namespace for hit_set tracking +- name: osd_hit_set_namespace + type: str + level: advanced + default: .ceph-internal + with_legacy: true +# conservative default throttling values +- name: osd_tier_promote_max_objects_sec + type: uint + level: advanced + default: 25 + with_legacy: true +- name: osd_tier_promote_max_bytes_sec + type: size + level: advanced + default: 5_M + with_legacy: true +- name: osd_tier_default_cache_mode + type: str + level: advanced + default: writeback + enum_values: + - none + - writeback + - forward + - readonly + - readforward + - readproxy + - proxy + flags: + - runtime +- name: osd_tier_default_cache_hit_set_count + type: uint + level: advanced + default: 4 +- name: osd_tier_default_cache_hit_set_period + type: uint + level: advanced + default: 1200 +- name: osd_tier_default_cache_hit_set_type + type: str + level: advanced + default: bloom + enum_values: + - bloom + - explicit_hash + - explicit_object + flags: + - runtime +- name: osd_tier_default_cache_min_read_recency_for_promote + type: uint + level: advanced + desc: number of recent HitSets the object must appear in to be promoted (on read) + default: 1 +- name: osd_tier_default_cache_min_write_recency_for_promote + type: uint + level: advanced + desc: number of recent HitSets the object must appear in to be promoted (on write) + default: 1 +- name: osd_tier_default_cache_hit_set_grade_decay_rate + type: uint + level: advanced + default: 20 +- name: osd_tier_default_cache_hit_set_search_last_n + type: uint + level: advanced + default: 1 +- name: osd_objecter_finishers + type: int + level: advanced + default: 1 + flags: + - startup + with_legacy: true +- name: osd_map_dedup + type: bool + level: advanced + default: true + fmt_desc: Enable removing duplicates in the OSD map. + with_legacy: true +- name: osd_map_message_max + type: int + level: advanced + desc: maximum number of OSDMaps to include in a single message + fmt_desc: The maximum map entries allowed per MOSDMap message. + default: 40 + services: + - osd + - mon + with_legacy: true +- name: osd_map_message_max_bytes + type: size + level: advanced + desc: maximum number of bytes worth of OSDMaps to include in a single message + default: 10_M + services: + - osd + - mon + with_legacy: true +# do not assert on divergent_prior entries which aren't in the log and whose on-disk objects are newer +- name: osd_ignore_stale_divergent_priors + type: bool + level: advanced + default: false + with_legacy: true +- name: osd_heartbeat_interval + type: int + level: dev + desc: Interval (in seconds) between peer pings + fmt_desc: How often an Ceph OSD Daemon pings its peers (in seconds). + default: 6 + min: 1 + max: 1_min + with_legacy: true +# (seconds) how long before we decide a peer has failed +# This setting is read by the MONs and OSDs and has to be set to a equal value in both settings of the configuration +- name: osd_heartbeat_grace + type: int + level: advanced + default: 20 + fmt_desc: The elapsed time when a Ceph OSD Daemon hasn't shown a heartbeat + that the Ceph Storage Cluster considers it ``down``. + This setting must be set in both the [mon] and [osd] or [global] + sections so that it is read by both monitor and OSD daemons. + with_legacy: true +- name: osd_heartbeat_stale + type: int + level: advanced + desc: Interval (in seconds) we mark an unresponsive heartbeat peer as stale. + long_desc: Automatically mark unresponsive heartbeat sessions as stale and tear + them down. The primary benefit is that OSD doesn't need to keep a flood of blocked + heartbeat messages around in memory. + default: 10_min +# prio the heartbeat tcp socket and set dscp as CS6 on it if true +- name: osd_heartbeat_use_min_delay_socket + type: bool + level: advanced + default: false + with_legacy: true +# the minimum size of OSD heartbeat messages to send +- name: osd_heartbeat_min_size + type: size + level: advanced + desc: Minimum heartbeat packet size in bytes. Will add dummy payload if heartbeat + packet is smaller than this. + default: 2000 + with_legacy: true +# max number of parallel snap trims/pg +- name: osd_pg_max_concurrent_snap_trims + type: uint + level: advanced + default: 2 + min: 1 + with_legacy: true +# max number of trimming pgs +- name: osd_max_trimming_pgs + type: uint + level: advanced + default: 2 + with_legacy: true +# minimum number of peers that must be reachable to mark ourselves +# back up after being wrongly marked down. +- name: osd_heartbeat_min_healthy_ratio + type: float + level: advanced + default: 0.33 + with_legacy: true +# (seconds) how often to ping monitor if no peers +- name: osd_mon_heartbeat_interval + type: int + level: advanced + default: 30 + fmt_desc: How often the Ceph OSD Daemon pings a Ceph Monitor if it has no + Ceph OSD Daemon peers. + with_legacy: true +- name: osd_mon_heartbeat_stat_stale + type: int + level: advanced + desc: Stop reporting on heartbeat ping times not updated for this many seconds. + long_desc: Stop reporting on old heartbeat information unless this is set to zero + fmt_desc: Stop reporting on heartbeat ping times which haven't been updated for + this many seconds. Set to zero to disable this action. + default: 1_hr +# failures, up_thru, boot. +- name: osd_mon_report_interval + type: int + level: advanced + desc: Frequency of OSD reports to mon for peer failures, fullness status changes + fmt_desc: The number of seconds a Ceph OSD Daemon may wait + from startup or another reportable event before reporting + to a Ceph Monitor. + default: 5 + with_legacy: true +# max updates in flight +- name: osd_mon_report_max_in_flight + type: int + level: advanced + default: 2 + with_legacy: true +# (second) how often to send beacon message to monitor +- name: osd_beacon_report_interval + type: int + level: advanced + default: 5_min + with_legacy: true +# report pg stats for any given pg at least this often +- name: osd_pg_stat_report_interval_max + type: int + level: advanced + default: 500 + with_legacy: true +# Max number of snap intervals to report to mgr in pg_stat_t +- name: osd_max_snap_prune_intervals_per_epoch + type: uint + level: dev + desc: Max number of snap intervals to report to mgr in pg_stat_t + default: 512 + with_legacy: true +- name: osd_default_data_pool_replay_window + type: int + level: advanced + default: 45 + fmt_desc: The time (in seconds) for an OSD to wait for a client to replay + a request. +- name: osd_auto_mark_unfound_lost + type: bool + level: advanced + default: false + with_legacy: true +- name: osd_check_for_log_corruption + type: bool + level: advanced + default: false + fmt_desc: Check log files for corruption. Can be computationally expensive. + with_legacy: true +- name: osd_use_stale_snap + type: bool + level: advanced + default: false + with_legacy: true +- name: osd_rollback_to_cluster_snap + type: str + level: advanced + with_legacy: true +- name: osd_default_notify_timeout + type: uint + level: advanced + desc: default number of seconds after which notify propagation times out. used if + a client has not specified other value + fmt_desc: The OSD default notification timeout (in seconds). + default: 30 + with_legacy: true +- name: osd_kill_backfill_at + type: int + level: dev + default: 0 + with_legacy: true +# Bounds how infrequently a new map epoch will be persisted for a pg +# make this < map_cache_size! +- name: osd_pg_epoch_persisted_max_stale + type: uint + level: advanced + default: 40 + with_legacy: true +- name: osd_target_pg_log_entries_per_osd + type: uint + level: dev + desc: target number of PG entries total on an OSD - limited per pg by the min and + max options below + default: 300000 + see_also: + - osd_max_pg_log_entries + - osd_min_pg_log_entries + with_legacy: true +- name: osd_min_pg_log_entries + type: uint + level: dev + desc: minimum number of entries to maintain in the PG log + fmt_desc: The minimum number of placement group logs to maintain + when trimming log files. + default: 250 + services: + - osd + see_also: + - osd_max_pg_log_entries + - osd_pg_log_dups_tracked + - osd_target_pg_log_entries_per_osd + with_legacy: true +- name: osd_max_pg_log_entries + type: uint + level: dev + desc: maximum number of entries to maintain in the PG log + fmt_desc: The maximum number of placement group logs to maintain + when trimming log files. + default: 10000 + services: + - osd + see_also: + - osd_min_pg_log_entries + - osd_pg_log_dups_tracked + - osd_target_pg_log_entries_per_osd + with_legacy: true +- name: osd_pg_log_dups_tracked + type: uint + level: dev + desc: how many versions back to track in order to detect duplicate ops; this is + combined with both the regular pg log entries and additional minimal dup detection + entries + default: 3000 + services: + - osd + see_also: + - osd_min_pg_log_entries + - osd_max_pg_log_entries + with_legacy: true +- name: osd_object_clean_region_max_num_intervals + type: int + level: dev + desc: number of intervals in clean_offsets + long_desc: partial recovery uses multiple intervals to record the clean part of + the objectwhen the number of intervals is greater than osd_object_clean_region_max_num_intervals, + minimum interval will be trimmed(0 will recovery the entire object data interval) + default: 10 + services: + - osd + with_legacy: true +# max entries factor before force recovery +- name: osd_force_recovery_pg_log_entries_factor + type: float + level: dev + default: 1.3 + with_legacy: true +- name: osd_pg_log_trim_min + type: uint + level: dev + desc: Minimum number of log entries to trim at once. This lets us trim in larger + batches rather than with each write. + default: 100 + see_also: + - osd_max_pg_log_entries + - osd_min_pg_log_entries + with_legacy: true +- name: osd_force_auth_primary_missing_objects + type: uint + level: advanced + desc: Approximate missing objects above which to force auth_log_shard to be primary + temporarily + default: 100 +- name: osd_async_recovery_min_cost + type: uint + level: advanced + desc: A mixture measure of number of current log entries difference and historical + missing objects, above which we switch to use asynchronous recovery when appropriate + default: 100 + flags: + - runtime +- name: osd_max_pg_per_osd_hard_ratio + type: float + level: advanced + desc: Maximum number of PG per OSD, a factor of 'mon_max_pg_per_osd' + long_desc: OSD will refuse to instantiate PG if the number of PG it serves exceeds + this number. + fmt_desc: The ratio of number of PGs per OSD allowed by the cluster before the + OSD refuses to create new PGs. An OSD stops creating new PGs if the number + of PGs it serves exceeds + ``osd_max_pg_per_osd_hard_ratio`` \* ``mon_max_pg_per_osd``. + default: 3 + see_also: + - mon_max_pg_per_osd + min: 1 +- name: osd_pg_log_trim_max + type: uint + level: advanced + desc: maximum number of entries to remove at once from the PG log + default: 10000 + services: + - osd + see_also: + - osd_min_pg_log_entries + - osd_max_pg_log_entries + with_legacy: true +# how many seconds old makes an op complaint-worthy +- name: osd_op_complaint_time + type: float + level: advanced + default: 30 + fmt_desc: An operation becomes complaint worthy after the specified number + of seconds have elapsed. + with_legacy: true +- name: osd_command_max_records + type: int + level: advanced + default: 256 + fmt_desc: Limits the number of lost objects to return. + with_legacy: true +# max peer osds to report that are blocking our progress +- name: osd_max_pg_blocked_by + type: uint + level: advanced + default: 16 + with_legacy: true +- name: osd_op_log_threshold + type: int + level: advanced + default: 5 + fmt_desc: How many operations logs to display at once. + with_legacy: true +- name: osd_backoff_on_unfound + type: bool + level: advanced + default: true + with_legacy: true +# [mainly for debug?] object unreadable/writeable +- name: osd_backoff_on_degraded + type: bool + level: advanced + default: false + with_legacy: true +# [debug] pg peering +- name: osd_backoff_on_peering + type: bool + level: advanced + default: false + with_legacy: true +- name: osd_debug_shutdown + type: bool + level: dev + desc: Turn up debug levels during shutdown + default: false + with_legacy: true +# crash osd if client ignores a backoff; useful for debugging +- name: osd_debug_crash_on_ignored_backoff + type: bool + level: dev + default: false + with_legacy: true +- name: osd_debug_inject_dispatch_delay_probability + type: float + level: dev + default: 0 + with_legacy: true +- name: osd_debug_inject_dispatch_delay_duration + type: float + level: dev + default: 0.1 + with_legacy: true +- name: osd_debug_drop_ping_probability + desc: N/A + type: float + level: dev + default: 0 + with_legacy: true +- name: osd_debug_drop_ping_duration + desc: N/A + type: int + level: dev + default: 0 + with_legacy: true +- name: osd_debug_op_order + type: bool + level: dev + default: false + with_legacy: true +- name: osd_debug_verify_missing_on_start + type: bool + level: dev + default: false + with_legacy: true +- name: osd_debug_verify_snaps + type: bool + level: dev + default: false + with_legacy: true +- name: osd_debug_verify_stray_on_activate + type: bool + level: dev + default: false + with_legacy: true +- name: osd_debug_skip_full_check_in_backfill_reservation + type: bool + level: dev + default: false + with_legacy: true +- name: osd_debug_reject_backfill_probability + type: float + level: dev + default: 0 + with_legacy: true +# inject failure during copyfrom completion +- name: osd_debug_inject_copyfrom_error + type: bool + level: dev + default: false + with_legacy: true +- name: osd_debug_misdirected_ops + type: bool + level: dev + default: false + with_legacy: true +- name: osd_debug_skip_full_check_in_recovery + type: bool + level: dev + default: false + with_legacy: true +- name: osd_debug_random_push_read_error + type: float + level: dev + default: 0 + with_legacy: true +- name: osd_debug_verify_cached_snaps + type: bool + level: dev + default: false + with_legacy: true +- name: osd_debug_deep_scrub_sleep + type: float + level: dev + desc: Inject an expensive sleep during deep scrub IO to make it easier to induce + preemption + default: 0 + with_legacy: true +- name: osd_debug_no_acting_change + type: bool + level: dev + default: false + with_legacy: true +- name: osd_debug_no_purge_strays + type: bool + level: dev + default: false + with_legacy: true +- name: osd_debug_pretend_recovery_active + type: bool + level: dev + default: false + with_legacy: true +# enable/disable OSD op tracking +- name: osd_enable_op_tracker + type: bool + level: advanced + default: true + with_legacy: true +# The number of shards for holding the ops +- name: osd_num_op_tracker_shard + type: uint + level: advanced + default: 32 + with_legacy: true +# Max number of completed ops to track +- name: osd_op_history_size + type: uint + level: advanced + default: 20 + fmt_desc: The maximum number of completed operations to track. + with_legacy: true +# Oldest completed op to track +- name: osd_op_history_duration + type: uint + level: advanced + default: 600 + fmt_desc: The oldest completed operation to track. + with_legacy: true +# Max number of slow ops to track +- name: osd_op_history_slow_op_size + type: uint + level: advanced + default: 20 + with_legacy: true +# track the op if over this threshold +- name: osd_op_history_slow_op_threshold + type: float + level: advanced + default: 10 + with_legacy: true +# to adjust various transactions that batch smaller items +- name: osd_target_transaction_size + type: int + level: advanced + default: 30 + with_legacy: true +# what % full makes an OSD "full" (failsafe) +- name: osd_failsafe_full_ratio + type: float + level: advanced + default: 0.97 + with_legacy: true +- name: osd_fast_shutdown + type: bool + level: advanced + desc: Fast, immediate shutdown + long_desc: Setting this to false makes the OSD do a slower teardown of all state + when it receives a SIGINT or SIGTERM or when shutting down for any other reason. That + slow shutdown is primarilyy useful for doing memory leak checking with valgrind. + default: true + with_legacy: true +- name: osd_fast_shutdown_timeout + type: int + level: advanced + desc: timeout in seconds for osd fast-shutdown (0 is unlimited) + default: 15 + with_legacy: true + min: 0 +- name: osd_fast_shutdown_notify_mon + type: bool + level: advanced + desc: Tell mon about OSD shutdown on immediate shutdown + long_desc: Tell the monitor the OSD is shutting down on immediate shutdown. This + helps with cluster log messages from other OSDs reporting it immediately failed. + default: true + see_also: + - osd_fast_shutdown + - osd_mon_shutdown_timeout + with_legacy: true +# immediately mark OSDs as down once they refuse to accept connections +- name: osd_fast_fail_on_connection_refused + type: bool + level: advanced + default: true + fmt_desc: If this option is enabled, crashed OSDs are marked down + immediately by connected peers and MONs (assuming that the + crashed OSD host survives). Disable it to restore old + behavior, at the expense of possible long I/O stalls when + OSDs crash in the middle of I/O operations. + with_legacy: true +- name: osd_pg_object_context_cache_count + type: int + level: advanced + default: 64 + with_legacy: true +# true if LTTng-UST tracepoints should be enabled +- name: osd_tracing + type: bool + level: advanced + default: false + with_legacy: true +# true if function instrumentation should use LTTng +- name: osd_function_tracing + type: bool + level: advanced + default: false + with_legacy: true +# use fast info attr, if we can +- name: osd_fast_info + type: bool + level: advanced + default: true + with_legacy: true +# determines whether PGLog::check() compares written out log to stored log +- name: osd_debug_pg_log_writeout + type: bool + level: dev + default: false + with_legacy: true +# Max number of loop before we reset thread-pool's handle +- name: osd_loop_before_reset_tphandle + type: uint + level: advanced + default: 64 + with_legacy: true +# default timeout while caling WaitInterval on an empty queue +- name: threadpool_default_timeout + type: int + level: advanced + default: 1_min + with_legacy: true +# default wait time for an empty queue before pinging the hb timeout +- name: threadpool_empty_queue_max_wait + type: int + level: advanced + default: 2 + with_legacy: true +- name: leveldb_log_to_ceph_log + type: bool + level: advanced + default: true + with_legacy: true +- name: leveldb_write_buffer_size + type: size + level: advanced + default: 8_M + with_legacy: true +- name: leveldb_cache_size + type: size + level: advanced + default: 128_M + with_legacy: true +- name: leveldb_block_size + type: size + level: advanced + default: 0 + with_legacy: true +- name: leveldb_bloom_size + type: int + level: advanced + default: 0 + with_legacy: true +- name: leveldb_max_open_files + type: int + level: advanced + default: 0 + with_legacy: true +- name: leveldb_compression + type: bool + level: advanced + default: true + with_legacy: true +- name: leveldb_paranoid + type: bool + level: advanced + default: false + with_legacy: true +- name: leveldb_log + type: str + level: advanced + default: /dev/null + with_legacy: true +- name: leveldb_compact_on_mount + type: bool + level: advanced + default: false + with_legacy: true +- name: rocksdb_log_to_ceph_log + type: bool + level: advanced + default: true + with_legacy: true +- name: rocksdb_cache_size + type: size + level: advanced + default: 512_M + flags: + - runtime + with_legacy: true +# ratio of cache for row (vs block) +- name: rocksdb_cache_row_ratio + type: float + level: advanced + default: 0 + with_legacy: true +# rocksdb block cache shard bits, 4 bit -> 16 shards +- name: rocksdb_cache_shard_bits + type: int + level: advanced + default: 4 + with_legacy: true +# 'lru' or 'clock' +- name: rocksdb_cache_type + type: str + level: advanced + default: binned_lru + with_legacy: true +- name: rocksdb_block_size + type: size + level: advanced + default: 4_K + with_legacy: true +# Enabling this will have 5-10% impact on performance for the stats collection +- name: rocksdb_perf + type: bool + level: advanced + default: false + with_legacy: true +# For rocksdb, this behavior will be an overhead of 5%~10%, collected only rocksdb_perf is enabled. +- name: rocksdb_collect_compaction_stats + type: bool + level: advanced + default: false + with_legacy: true +# For rocksdb, this behavior will be an overhead of 5%~10%, collected only rocksdb_perf is enabled. +- name: rocksdb_collect_extended_stats + type: bool + level: advanced + default: false + with_legacy: true +# For rocksdb, this behavior will be an overhead of 5%~10%, collected only rocksdb_perf is enabled. +- name: rocksdb_collect_memory_stats + type: bool + level: advanced + default: false + with_legacy: true +- name: rocksdb_delete_range_threshold + type: uint + level: advanced + desc: The number of keys required to invoke DeleteRange when deleting muliple keys. + default: 1_M +- name: rocksdb_bloom_bits_per_key + type: uint + level: advanced + desc: Number of bits per key to use for RocksDB's bloom filters. + long_desc: 'RocksDB bloom filters can be used to quickly answer the question of + whether or not a key may exist or definitely does not exist in a given RocksDB + SST file without having to read all keys into memory. Using a higher bit value + decreases the likelihood of false positives at the expense of additional disk + space and memory consumption when the filter is loaded into RAM. The current + default value of 20 was found to provide significant performance gains when getattr + calls are made (such as during new object creation in bluestore) without significant + memory overhead or cache pollution when combined with rocksdb partitioned index + filters. See: https://github.com/facebook/rocksdb/wiki/Partitioned-Index-Filters + for more information.' + default: 20 +- name: rocksdb_cache_index_and_filter_blocks + type: bool + level: dev + desc: Whether to cache indices and filters in block cache + long_desc: By default RocksDB will load an SST file's index and bloom filters into + memory when it is opened and remove them from memory when an SST file is closed. Thus, + memory consumption by indices and bloom filters is directly tied to the number + of concurrent SST files allowed to be kept open. This option instead stores cached + indicies and filters in the block cache where they directly compete with other + cached data. By default we set this option to true to better account for and + bound rocksdb memory usage and keep filters in memory even when an SST file is + closed. + default: true +- name: rocksdb_cache_index_and_filter_blocks_with_high_priority + type: bool + level: dev + desc: Whether to cache indices and filters in the block cache with high priority + long_desc: A downside of setting rocksdb_cache_index_and_filter_blocks to true is + that regular data can push indices and filters out of memory. Setting this option + to true means they are cached with higher priority than other data and should + typically stay in the block cache. + default: false +- name: rocksdb_pin_l0_filter_and_index_blocks_in_cache + type: bool + level: dev + desc: Whether to pin Level 0 indices and bloom filters in the block cache + long_desc: A downside of setting rocksdb_cache_index_and_filter_blocks to true is + that regular data can push indices and filters out of memory. Setting this option + to true means that level 0 SST files will always have their indices and filters + pinned in the block cache. + default: false +- name: rocksdb_index_type + type: str + level: dev + desc: 'Type of index for SST files: binary_search, hash_search, two_level' + long_desc: 'This option controls the table index type. binary_search is a space + efficient index block that is optimized for block-search-based index. hash_search + may improve prefix lookup performance at the expense of higher disk and memory + usage and potentially slower compactions. two_level is an experimental index + type that uses two binary search indexes and works in conjunction with partition + filters. See: http://rocksdb.org/blog/2017/05/12/partitioned-index-filter.html' + default: binary_search +- name: rocksdb_partition_filters + type: bool + level: dev + desc: (experimental) partition SST index/filters into smaller blocks + long_desc: 'This is an experimental option for rocksdb that works in conjunction + with two_level indices to avoid having to keep the entire filter/index in cache + when cache_index_and_filter_blocks is true. The idea is to keep a much smaller + top-level index in heap/cache and then opportunistically cache the lower level + indices. See: https://github.com/facebook/rocksdb/wiki/Partitioned-Index-Filters' + default: false +- name: rocksdb_metadata_block_size + type: size + level: dev + desc: The block size for index partitions. (0 = rocksdb default) + default: 4_K +# osd_*_priority adjust the relative priority of client io, recovery io, +# snaptrim io, etc +# +# osd_*_priority determines the ratio of available io between client and +# recovery. Each option may be set between +# 1..63. +- name: rocksdb_cf_compact_on_deletion + type: bool + level: dev + desc: Compact the column family when a certain number of tombstones are observed within a given window. + long_desc: 'This setting instructs RocksDB to compact a column family when a certain + number of tombstones are observed during iteration within a certain sliding window. + For instance if rocksdb_cf_compact_on_deletion_sliding_window is 8192 and + rocksdb_cf_compact_on_deletion_trigger is 4096, then once 4096 tombstones are + observed after iteration over 8192 entries, the column family will be compacted.' + default: true + with_legacy: true + see_also: + - rocksdb_cf_compact_on_deletion_sliding_window + - rocksdb_cf_compact_on_deletion_trigger +- name: rocksdb_cf_compact_on_deletion_sliding_window + type: int + level: dev + desc: The sliding window to use when rocksdb_cf_compact_on_deletion is enabled. + default: 32768 + with_legacy: true + see_also: + - rocksdb_cf_compact_on_deletion +- name: rocksdb_cf_compact_on_deletion_trigger + type: int + level: dev + desc: The trigger to use when rocksdb_cf_compact_on_deletion is enabled. + default: 16384 + with_legacy: true + see_also: + - rocksdb_cf_compact_on_deletion +- name: osd_client_op_priority + type: uint + level: advanced + default: 63 + fmt_desc: The priority set for client operations. This value is relative + to that of ``osd_recovery_op_priority`` below. The default + strongly favors client ops over recovery. + with_legacy: true +- name: osd_recovery_op_priority + type: uint + level: advanced + desc: Priority to use for recovery operations if not specified for the pool + fmt_desc: The priority of recovery operations vs client operations, if not specified by the + pool's ``recovery_op_priority``. The default value prioritizes client + ops (see above) over recovery ops. You may adjust the tradeoff of client + impact against the time to restore cluster health by lowering this value + for increased prioritization of client ops, or by increasing it to favor + recovery. + default: 3 + with_legacy: true +- name: osd_peering_op_priority + type: uint + level: dev + default: 255 + with_legacy: true +- name: osd_snap_trim_priority + type: uint + level: advanced + default: 5 + fmt_desc: The priority set for the snap trim work queue. + with_legacy: true +- name: osd_snap_trim_cost + type: size + level: advanced + default: 1_M + with_legacy: true +- name: osd_pg_delete_priority + type: uint + level: advanced + default: 5 + with_legacy: true +- name: osd_pg_delete_cost + type: size + level: advanced + default: 1_M + with_legacy: true +- name: osd_scrub_priority + type: uint + level: advanced + desc: Priority for scrub operations in work queue + fmt_desc: The default work queue priority for scheduled scrubs when the + pool doesn't specify a value of ``scrub_priority``. This can be + boosted to the value of ``osd_client_op_priority`` when scrubs are + blocking client operations. + default: 5 + with_legacy: true +- name: osd_scrub_cost + type: size + level: advanced + desc: Cost for scrub operations in work queue + default: 50_M + with_legacy: true +- name: osd_scrub_event_cost + type: size + level: advanced + desc: Cost for each scrub operation, used when osd_op_queue=mclock_scheduler + default: 4_K + with_legacy: true +# set requested scrub priority higher than scrub priority to make the +# requested scrubs jump the queue of scheduled scrubs +- name: osd_requested_scrub_priority + type: uint + level: advanced + default: 120 + fmt_desc: The priority set for user requested scrub on the work queue. If + this value were to be smaller than ``osd_client_op_priority`` it + can be boosted to the value of ``osd_client_op_priority`` when + scrub is blocking client operations. + with_legacy: true +- name: osd_recovery_priority + type: uint + level: advanced + desc: Priority of recovery in the work queue + long_desc: Not related to a pool's recovery_priority + fmt_desc: The default priority set for recovery work queue. Not + related to a pool's ``recovery_priority``. + default: 5 + with_legacy: true +# set default cost equal to 20MB io +- name: osd_recovery_cost + type: size + level: advanced + default: 20_M + with_legacy: true +# osd_recovery_op_warn_multiple scales the normal warning threshold, +# osd_op_complaint_time, so that slow recovery ops won't cause noise +- name: osd_recovery_op_warn_multiple + type: uint + level: advanced + default: 16 + with_legacy: true +# Max time to wait between notifying mon of shutdown and shutting down +- name: osd_mon_shutdown_timeout + type: float + level: advanced + default: 5 + with_legacy: true +# crash if the OSD has stray PG refs on shutdown +- name: osd_shutdown_pgref_assert + type: bool + level: advanced + default: false + with_legacy: true +# OSD's maximum object size +- name: osd_max_object_size + type: size + level: advanced + default: 128_M + fmt_desc: The maximum size of a RADOS object in bytes. + with_legacy: true +# max rados object name len +- name: osd_max_object_name_len + type: uint + level: advanced + default: 2_K + with_legacy: true +# max rados object namespace len +- name: osd_max_object_namespace_len + type: uint + level: advanced + default: 256 + with_legacy: true +# max rados attr name len; cannot go higher than 100 chars for file system backends +- name: osd_max_attr_name_len + type: uint + level: advanced + default: 100 + with_legacy: true +- name: osd_max_attr_size + type: uint + level: advanced + default: 0 + with_legacy: true +- name: osd_max_omap_entries_per_request + type: uint + level: advanced + default: 1_K + with_legacy: true +- name: osd_max_omap_bytes_per_request + type: size + level: advanced + default: 1_G + with_legacy: true +# osd_recovery_op_warn_multiple scales the normal warning threshold, +# osd_op_complaint_time, so that slow recovery ops won't cause noise +- name: osd_max_write_op_reply_len + type: size + level: advanced + desc: Max size of the per-op payload for requests with the RETURNVEC flag set + long_desc: This value caps the amount of data (per op; a request may have many ops) + that will be sent back to the client and recorded in the PG log. + default: 64 + with_legacy: true +- name: osd_objectstore + type: str + level: advanced + desc: backend type for an OSD (like filestore or bluestore) + default: bluestore + enum_values: + - bluestore + - filestore + - memstore + - kstore + - seastore + - cyanstore + flags: + - create + with_legacy: true +# true if LTTng-UST tracepoints should be enabled +- name: osd_objectstore_tracing + type: bool + level: advanced + default: false + with_legacy: true +- name: osd_objectstore_fuse + type: bool + level: advanced + default: false + with_legacy: true +- name: osd_bench_small_size_max_iops + type: uint + level: advanced + default: 100 + with_legacy: true +- name: osd_bench_large_size_max_throughput + type: size + level: advanced + default: 100_M + with_legacy: true +- name: osd_bench_max_block_size + type: size + level: advanced + default: 64_M + with_legacy: true +# duration of 'osd bench', capped at 30s to avoid triggering timeouts +- name: osd_bench_duration + type: uint + level: advanced + default: 30 + with_legacy: true +# create a blkin trace for all osd requests +- name: osd_blkin_trace_all + type: bool + level: advanced + default: false + with_legacy: true +# create a blkin trace for all objecter requests +- name: osdc_blkin_trace_all + type: bool + level: advanced + default: false + with_legacy: true +- name: osd_discard_disconnected_ops + type: bool + level: advanced + default: true + with_legacy: true +- name: osd_memory_target + type: size + level: basic + desc: When tcmalloc and cache autotuning is enabled, try to keep this many bytes + mapped in memory. + long_desc: The minimum value must be at least equal to osd_memory_base + osd_memory_cache_min. + fmt_desc: | + When TCMalloc is available and cache autotuning is enabled, try to + keep this many bytes mapped in memory. Note: This may not exactly + match the RSS memory usage of the process. While the total amount + of heap memory mapped by the process should usually be close + to this target, there is no guarantee that the kernel will actually + reclaim memory that has been unmapped. During initial development, + it was found that some kernels result in the OSD's RSS memory + exceeding the mapped memory by up to 20%. It is hypothesised + however, that the kernel generally may be more aggressive about + reclaiming unmapped memory when there is a high amount of memory + pressure. Your mileage may vary. + default: 4_G + see_also: + - bluestore_cache_autotune + - osd_memory_cache_min + - osd_memory_base + - osd_memory_target_autotune + min: 896_M + flags: + - runtime +- name: osd_memory_target_autotune + type: bool + default: false + level: advanced + desc: If enabled, allow orchestrator to automatically tune osd_memory_target + see_also: + - osd_memory_target +- name: osd_memory_target_cgroup_limit_ratio + type: float + level: advanced + desc: Set the default value for osd_memory_target to the cgroup memory limit (if + set) times this value + long_desc: A value of 0 disables this feature. + default: 0.8 + see_also: + - osd_memory_target + min: 0 + max: 1 +- name: osd_memory_base + type: size + level: dev + desc: When tcmalloc and cache autotuning is enabled, estimate the minimum amount + of memory in bytes the OSD will need. + fmt_desc: When TCMalloc and cache autotuning are enabled, estimate the minimum + amount of memory in bytes the OSD will need. This is used to help + the autotuner estimate the expected aggregate memory consumption of + the caches. + default: 768_M + see_also: + - bluestore_cache_autotune + flags: + - runtime +- name: osd_memory_expected_fragmentation + type: float + level: dev + desc: When tcmalloc and cache autotuning is enabled, estimate the percent of memory + fragmentation. + fmt_desc: When TCMalloc and cache autotuning is enabled, estimate the + percentage of memory fragmentation. This is used to help the + autotuner estimate the expected aggregate memory consumption + of the caches. + default: 0.15 + see_also: + - bluestore_cache_autotune + min: 0 + max: 1 + flags: + - runtime +- name: osd_memory_cache_min + type: size + level: dev + desc: When tcmalloc and cache autotuning is enabled, set the minimum amount of memory + used for caches. + fmt_desc: | + When TCMalloc and cache autotuning are enabled, set the minimum + amount of memory used for caches. Note: Setting this value too + low can result in significant cache thrashing. + default: 128_M + see_also: + - bluestore_cache_autotune + min: 128_M + flags: + - runtime +- name: osd_memory_cache_resize_interval + type: float + level: dev + desc: When tcmalloc and cache autotuning is enabled, wait this many seconds between + resizing caches. + fmt_desc: When TCMalloc and cache autotuning are enabled, wait this many + seconds between resizing caches. This setting changes the total + amount of memory available for BlueStore to use for caching. Note + that setting this interval too small can result in memory allocator + thrashing and lower performance. + default: 1 + see_also: + - bluestore_cache_autotune +- name: memstore_device_bytes + type: size + level: advanced + default: 1_G + with_legacy: true +- name: memstore_page_set + type: bool + level: advanced + default: false + with_legacy: true +- name: memstore_page_size + type: size + level: advanced + default: 64_K + with_legacy: true +- name: memstore_debug_omit_block_device_write + type: bool + level: dev + desc: write metadata only + default: false + see_also: + - bluestore_debug_omit_block_device_write + with_legacy: true +- name: objectstore_blackhole + type: bool + level: advanced + default: false + with_legacy: true +- name: bdev_debug_inflight_ios + type: bool + level: dev + default: false + with_legacy: true +# if N>0, then ~ 1/N IOs will complete before we crash on flush +- name: bdev_inject_crash + type: int + level: dev + default: 0 + with_legacy: true +# wait N more seconds on flush +- name: bdev_inject_crash_flush_delay + type: int + level: dev + default: 2 + with_legacy: true +- name: bdev_aio + type: bool + level: advanced + default: true + with_legacy: true +# milliseconds +- name: bdev_aio_poll_ms + type: int + level: advanced + default: 250 + with_legacy: true +- name: bdev_aio_max_queue_depth + type: int + level: advanced + default: 1024 + with_legacy: true +- name: bdev_aio_reap_max + type: int + level: advanced + default: 16 + with_legacy: true +- name: bdev_block_size + type: size + level: advanced + default: 4_K + with_legacy: true +- name: bdev_read_buffer_alignment + type: size + level: advanced + default: 4_K + with_legacy: true +- name: bdev_read_preallocated_huge_buffers + type: str + level: advanced + desc: description of pools arrangement for huge page-based read buffers + long_desc: Arrangement of preallocated, huge pages-based pools for reading + from a KernelDevice. Applied to minimize size of scatter-gather lists + sent to NICs. Targets really big buffers (>= 2 or 4 MBs). + Keep in mind the system must be configured accordingly (see /proc/sys/vm/nr_hugepages). + Otherwise the OSD wil fail early. + Beware BlueStore, by default, stores large chunks across many smaller blobs. + Increasing bluestore_max_blob_size changes that, and thus allows the data to + be read back into small number of huge page-backed buffers. + fmt_desc: List of key=value pairs delimited by comma, semicolon or tab. + key specifies the targeted read size and must be expressed in bytes. + value specifies the number of preallocated buffers. + For instance, to preallocate 64 buffers that will be used to serve + 2 MB-sized read requests and 128 for 4 MB, someone needs to set + "2097152=64,4194304=128". + see_also: + - bluestore_max_blob_size +- name: bdev_debug_aio + type: bool + level: dev + default: false + with_legacy: true +- name: bdev_debug_aio_suicide_timeout + type: float + level: dev + default: 1_min + with_legacy: true +- name: bdev_debug_aio_log_age + type: float + level: dev + default: 5 + with_legacy: true +# if yes, osd will unbind all NVMe devices from kernel driver and bind them +# to the uio_pci_generic driver. The purpose is to prevent the case where +# NVMe driver is loaded while osd is running. +- name: bdev_nvme_unbind_from_kernel + type: bool + level: advanced + default: false + with_legacy: true +- name: bdev_enable_discard + type: bool + level: advanced + default: false + with_legacy: true +- name: bdev_async_discard + type: bool + level: advanced + default: false + with_legacy: true +- name: bdev_flock_retry_interval + type: float + level: advanced + desc: interval to retry the flock + default: 0.1 +- name: bdev_flock_retry + type: uint + level: advanced + desc: times to retry the flock + long_desc: The number of times to retry on getting the block device lock. Programs + such as systemd-udevd may compete with Ceph for this lock. 0 means 'unlimited'. + default: 3 +- name: bluefs_alloc_size + type: size + level: advanced + desc: Allocation unit size for DB and WAL devices + default: 1_M + with_legacy: true +- name: bluefs_shared_alloc_size + type: size + level: advanced + desc: Allocation unit size for primary/shared device + default: 64_K + with_legacy: true +- name: bluefs_failed_shared_alloc_cooldown + type: float + level: advanced + desc: duration(in seconds) untill the next attempt to use + 'bluefs_shared_alloc_size' after facing ENOSPC failure. + long_desc: Cooldown period(in seconds) when BlueFS uses shared/slow device + allocation size instead of "bluefs_shared_alloc_size' one after facing + recoverable (via fallback to smaller chunk size) ENOSPC failure. Intended + primarily to avoid repetitive unsuccessful allocations which might be + expensive. + default: 600 + with_legacy: true +- name: bluefs_max_prefetch + type: size + level: advanced + default: 1_M + with_legacy: true +# alloc when we get this low +- name: bluefs_min_log_runway + type: size + level: advanced + default: 1_M + with_legacy: true +# alloc this much at a time +- name: bluefs_max_log_runway + type: size + level: advanced + default: 4_M + with_legacy: true +# before we consider +- name: bluefs_log_compact_min_ratio + type: float + level: advanced + default: 5 + with_legacy: true +# before we consider +- name: bluefs_log_compact_min_size + type: size + level: advanced + default: 16_M + with_legacy: true +# ignore flush until its this big +- name: bluefs_min_flush_size + type: size + level: advanced + default: 512_K + with_legacy: true +# sync or async log compaction +- name: bluefs_compact_log_sync + type: bool + level: advanced + default: false + with_legacy: true +- name: bluefs_buffered_io + type: bool + level: advanced + desc: Enabled buffered IO for bluefs reads. + long_desc: When this option is enabled, bluefs will in some cases perform buffered + reads. This allows the kernel page cache to act as a secondary cache for things + like RocksDB block reads. For example, if the rocksdb block cache isn't large + enough to hold all blocks during OMAP iteration, it may be possible to read them + from page cache instead of from the disk. This can dramatically improve + performance when the osd_memory_target is too small to hold all entries in block + cache but it does come with downsides. It has been reported to occasionally + cause excessive kernel swapping (and associated stalls) under certain workloads. + Currently the best and most consistent performing combination appears to be + enabling bluefs_buffered_io and disabling system level swap. It is possible + that this recommendation may change in the future however. + default: true + with_legacy: true +- name: bluefs_sync_write + type: bool + level: advanced + default: false + with_legacy: true +- name: bluefs_allocator + type: str + level: dev + default: hybrid + enum_values: + - bitmap + - stupid + - avl + - hybrid + with_legacy: true +- name: bluefs_log_replay_check_allocations + type: bool + level: advanced + desc: Enables checks for allocations consistency during log replay + default: true + with_legacy: true +- name: bluefs_replay_recovery + type: bool + level: dev + desc: Attempt to read bluefs log so large that it became unreadable. + long_desc: If BlueFS log grows to extreme sizes (200GB+) it is likely that it becames + unreadable. This options enables heuristics that scans devices for missing data. + DO NOT ENABLE BY DEFAULT + default: false + with_legacy: true +- name: bluefs_replay_recovery_disable_compact + type: bool + level: advanced + default: false + with_legacy: true +- name: bluefs_check_for_zeros + type: bool + level: dev + desc: Check data read for suspicious pages + long_desc: Looks into data read to check if there is a 4K block entirely filled + with zeros. If this happens, we re-read data. If there is difference, we print + error to log. + default: false + see_also: + - bluestore_retry_disk_reads + flags: + - runtime + with_legacy: true +- name: bluefs_check_volume_selector_on_umount + type: bool + level: dev + desc: Check validity of volume selector on umount + long_desc: Checks if volume selector did not diverge from the state it should be in. + Reference is constructed from bluefs inode table. Asserts on inconsistency. + default: false + flags: + - runtime + with_legacy: true +- name: bluefs_check_volume_selector_often + type: bool + level: dev + desc: Periodically check validity of volume selector + long_desc: Periodically checks if current volume selector does not diverge from the valid state. + Reference is constructed from bluefs inode table. Asserts on inconsistency. This is debug feature. + default: false + see_also: + - bluefs_check_volume_selector_on_umount + flags: + - startup + with_legacy: true +- name: bluestore_bluefs + type: bool + level: dev + desc: Use BlueFS to back rocksdb + long_desc: BlueFS allows rocksdb to share the same physical device(s) as the rest + of BlueStore. It should be used in all cases unless testing/developing an alternative + metadata database for BlueStore. + default: true + flags: + - create + with_legacy: true +# mirror to normal Env for debug +- name: bluestore_bluefs_env_mirror + type: bool + level: dev + desc: Mirror bluefs data to file system for testing/validation + default: false + flags: + - create + with_legacy: true +- name: bluestore_bluefs_max_free + type: size + level: advanced + default: 10_G + desc: Maximum free space allocated to BlueFS +- name: bluestore_bluefs_alloc_failure_dump_interval + type: float + level: advanced + desc: How frequently (in seconds) to dump allocator on BlueFS space allocation failure + default: 0 + with_legacy: true +- name: bluestore_spdk_mem + type: size + level: dev + desc: Amount of dpdk memory size in MB + long_desc: If running multiple SPDK instances per node, you must specify the amount + of dpdk memory size in MB each instance will use, to make sure each instance uses + its own dpdk memory + default: 512 +- name: bluestore_spdk_coremask + type: str + level: dev + desc: A hexadecimal bit mask of the cores to run on. Note the core numbering can + change between platforms and should be determined beforehand + default: '0x1' +- name: bluestore_spdk_max_io_completion + type: uint + level: dev + desc: Maximal I/Os to be batched completed while checking queue pair completions, + 0 means let spdk library determine it + default: 0 +- name: bluestore_spdk_io_sleep + type: uint + level: dev + desc: Time period to wait if there is no completed I/O from polling + default: 5 +# If you want to use spdk driver, you need to specify NVMe serial number here +# with "spdk:" prefix. +# Users can use 'lspci -vvv -d 8086:0953 | grep "Device Serial Number"' to +# get the serial number of Intel(R) Fultondale NVMe controllers. +# Example: +# bluestore_block_path = spdk:55cd2e404bd73932 +- name: bluestore_block_path + type: str + level: dev + desc: Path to block device/file + flags: + - create + with_legacy: true +- name: bluestore_block_size + type: size + level: dev + desc: Size of file to create for backing bluestore + default: 100_G + flags: + - create + with_legacy: true +- name: bluestore_block_create + type: bool + level: dev + desc: Create bluestore_block_path if it doesn't exist + default: true + see_also: + - bluestore_block_path + - bluestore_block_size + flags: + - create + with_legacy: true +- name: bluestore_block_db_path + type: str + level: dev + desc: Path for db block device + flags: + - create + with_legacy: true +# rocksdb ssts (hot/warm) +- name: bluestore_block_db_size + type: size + level: dev + desc: Size of file to create for bluestore_block_db_path + default: 0 + flags: + - create + with_legacy: true +- name: bluestore_block_db_create + type: bool + level: dev + desc: Create bluestore_block_db_path if it doesn't exist + default: false + see_also: + - bluestore_block_db_path + - bluestore_block_db_size + flags: + - create + with_legacy: true +- name: bluestore_block_wal_path + type: str + level: dev + desc: Path to block device/file backing bluefs wal + flags: + - create + with_legacy: true +# rocksdb wal +- name: bluestore_block_wal_size + type: size + level: dev + desc: Size of file to create for bluestore_block_wal_path + default: 96_M + flags: + - create + with_legacy: true +- name: bluestore_block_wal_create + type: bool + level: dev + desc: Create bluestore_block_wal_path if it doesn't exist + default: false + see_also: + - bluestore_block_wal_path + - bluestore_block_wal_size + flags: + - create + with_legacy: true +# whether preallocate space if block/db_path/wal_path is file rather that block device. +- name: bluestore_block_preallocate_file + type: bool + level: dev + desc: Preallocate file created via bluestore_block*_create + default: false + flags: + - create + with_legacy: true +- name: bluestore_ignore_data_csum + type: bool + level: dev + desc: Ignore checksum errors on read and do not generate an EIO error + default: false + flags: + - runtime + with_legacy: true +- name: bluestore_csum_type + type: str + level: advanced + desc: Default checksum algorithm to use + long_desc: crc32c, xxhash32, and xxhash64 are available. The _16 and _8 variants + use only a subset of the bits for more compact (but less reliable) checksumming. + fmt_desc: The default checksum algorithm to use. + default: crc32c + enum_values: + - none + - crc32c + - crc32c_16 + - crc32c_8 + - xxhash32 + - xxhash64 + flags: + - runtime + with_legacy: true +- name: bluestore_retry_disk_reads + type: uint + level: advanced + desc: Number of read retries on checksum validation error + long_desc: Retries to read data from the disk this many times when checksum validation + fails to handle spurious read errors gracefully. + default: 3 + min: 0 + max: 255 + flags: + - runtime + with_legacy: true +- name: bluestore_min_alloc_size + type: uint + level: advanced + desc: Minimum allocation size to allocate for an object + long_desc: A smaller allocation size generally means less data is read and then + rewritten when a copy-on-write operation is triggered (e.g., when writing to something + that was recently snapshotted). Similarly, less data is journaled before performing + an overwrite (writes smaller than min_alloc_size must first pass through the BlueStore + journal). Larger values of min_alloc_size reduce the amount of metadata required + to describe the on-disk layout and reduce overall fragmentation. + default: 0 + flags: + - create + with_legacy: true +- name: bluestore_min_alloc_size_hdd + type: size + level: advanced + desc: Default min_alloc_size value for rotational media + default: 4_K + see_also: + - bluestore_min_alloc_size + flags: + - create + with_legacy: true +- name: bluestore_min_alloc_size_ssd + type: size + level: advanced + desc: Default min_alloc_size value for non-rotational (solid state) media + default: 4_K + see_also: + - bluestore_min_alloc_size + flags: + - create + with_legacy: true +- name: bluestore_use_optimal_io_size_for_min_alloc_size + type: bool + level: advanced + desc: Discover media optimal IO Size and use for min_alloc_size + default: false + see_also: + - bluestore_min_alloc_size + flags: + - create + with_legacy: true +- name: bluestore_max_alloc_size + type: size + level: advanced + desc: Maximum size of a single allocation (0 for no max) + default: 0 + flags: + - create + with_legacy: true +- name: bluestore_prefer_deferred_size + type: size + level: advanced + desc: Writes smaller than this size will be written to the journal and then asynchronously + written to the device. This can be beneficial when using rotational media where + seeks are expensive, and is helpful both with and without solid state journal/wal + devices. + default: 0 + flags: + - runtime + with_legacy: true +- name: bluestore_prefer_deferred_size_hdd + type: size + level: advanced + desc: Default bluestore_prefer_deferred_size for rotational media + default: 64_K + see_also: + - bluestore_prefer_deferred_size + flags: + - runtime + with_legacy: true +- name: bluestore_prefer_deferred_size_ssd + type: size + level: advanced + desc: Default bluestore_prefer_deferred_size for non-rotational (solid state) media + default: 0 + see_also: + - bluestore_prefer_deferred_size + flags: + - runtime + with_legacy: true +- name: bluestore_compression_mode + type: str + level: advanced + desc: Default policy for using compression when pool does not specify + long_desc: '''none'' means never use compression. ''passive'' means use compression + when clients hint that data is compressible. ''aggressive'' means use compression + unless clients hint that data is not compressible. This option is used when the + per-pool property for the compression mode is not present.' + fmt_desc: The default policy for using compression if the per-pool property + ``compression_mode`` is not set. ``none`` means never use + compression. ``passive`` means use compression when + :c:func:`clients hint <rados_set_alloc_hint>` that data is + compressible. ``aggressive`` means use compression unless + clients hint that data is not compressible. ``force`` means use + compression under all circumstances even if the clients hint that + the data is not compressible. + default: none + enum_values: + - none + - passive + - aggressive + - force + flags: + - runtime + with_legacy: true +- name: bluestore_compression_algorithm + type: str + level: advanced + desc: Default compression algorithm to use when writing object data + long_desc: This controls the default compressor to use (if any) if the per-pool + property is not set. Note that zstd is *not* recommended for bluestore due to + high CPU overhead when compressing small amounts of data. + fmt_desc: The default compressor to use (if any) if the per-pool property + ``compression_algorithm`` is not set. Note that ``zstd`` is *not* + recommended for BlueStore due to high CPU overhead when + compressing small amounts of data. + default: snappy + enum_values: + - '' + - snappy + - zlib + - zstd + - lz4 + flags: + - runtime + with_legacy: true +- name: bluestore_compression_min_blob_size + type: size + level: advanced + desc: Maximum chunk size to apply compression to when random access is expected + for an object. + long_desc: Chunks larger than this are broken into smaller chunks before being compressed + fmt_desc: Chunks smaller than this are never compressed. + The per-pool property ``compression_min_blob_size`` overrides + this setting. + default: 0 + flags: + - runtime + with_legacy: true +- name: bluestore_compression_min_blob_size_hdd + type: size + level: advanced + desc: Default value of bluestore_compression_min_blob_size for rotational media + fmt_desc: Default value of ``bluestore compression min blob size`` + for rotational media. + default: 8_K + see_also: + - bluestore_compression_min_blob_size + flags: + - runtime + with_legacy: true +- name: bluestore_compression_min_blob_size_ssd + type: size + level: advanced + desc: Default value of bluestore_compression_min_blob_size for non-rotational (solid + state) media + fmt_desc: Default value of ``bluestore compression min blob size`` + for non-rotational (solid state) media. + default: 64_K + see_also: + - bluestore_compression_min_blob_size + flags: + - runtime + with_legacy: true +- name: bluestore_compression_max_blob_size + type: size + level: advanced + desc: Maximum chunk size to apply compression to when non-random access is expected + for an object. + long_desc: Chunks larger than this are broken into smaller chunks before being compressed + fmt_desc: Chunks larger than this value are broken into smaller blobs of at most + ``bluestore_compression_max_blob_size`` bytes before being compressed. + The per-pool property ``compression_max_blob_size`` overrides + this setting. + default: 0 + flags: + - runtime + with_legacy: true +- name: bluestore_compression_max_blob_size_hdd + type: size + level: advanced + desc: Default value of bluestore_compression_max_blob_size for rotational media + fmt_desc: Default value of ``bluestore compression max blob size`` + for rotational media. + default: 64_K + see_also: + - bluestore_compression_max_blob_size + flags: + - runtime + with_legacy: true +- name: bluestore_compression_max_blob_size_ssd + type: size + level: advanced + desc: Default value of bluestore_compression_max_blob_size for non-rotational (solid + state) media + fmt_desc: Default value of ``bluestore compression max blob size`` + for non-rotational (SSD, NVMe) media. + default: 64_K + see_also: + - bluestore_compression_max_blob_size + flags: + - runtime + with_legacy: true +# Specifies minimum expected amount of saved allocation units +# per single blob to enable compressed blobs garbage collection +- name: bluestore_gc_enable_blob_threshold + type: int + level: dev + default: 0 + flags: + - runtime + with_legacy: true +# Specifies minimum expected amount of saved allocation units +# per all blobsb to enable compressed blobs garbage collection +- name: bluestore_gc_enable_total_threshold + type: int + level: dev + default: 0 + flags: + - runtime + with_legacy: true +- name: bluestore_max_blob_size + type: size + level: dev + long_desc: Bluestore blobs are collections of extents (ie on-disk data) originating + from one or more objects. Blobs can be compressed, typically have checksum data, + may be overwritten, may be shared (with an extent ref map), or split. This setting + controls the maximum size a blob is allowed to be. + default: 0 + flags: + - runtime + with_legacy: true +- name: bluestore_max_blob_size_hdd + type: size + level: dev + default: 64_K + see_also: + - bluestore_max_blob_size + flags: + - runtime + with_legacy: true +- name: bluestore_max_blob_size_ssd + type: size + level: dev + default: 64_K + see_also: + - bluestore_max_blob_size + flags: + - runtime + with_legacy: true +# Require the net gain of compression at least to be at this ratio, +# otherwise we don't compress. +# And ask for compressing at least 12.5%(1/8) off, by default. +- name: bluestore_compression_required_ratio + type: float + level: advanced + desc: Compression ratio required to store compressed data + long_desc: If we compress data and get less than this we discard the result and + store the original uncompressed data. + fmt_desc: The ratio of the size of the data chunk after + compression relative to the original size must be at + least this small in order to store the compressed + version. + default: 0.875 + flags: + - runtime + with_legacy: true +- name: bluestore_extent_map_shard_max_size + type: size + level: dev + desc: Max size (bytes) for a single extent map shard before splitting + default: 1200 + with_legacy: true +- name: bluestore_extent_map_shard_target_size + type: size + level: dev + desc: Target size (bytes) for a single extent map shard + default: 500 + with_legacy: true +- name: bluestore_extent_map_shard_min_size + type: size + level: dev + desc: Min size (bytes) for a single extent map shard before merging + default: 150 + with_legacy: true +- name: bluestore_extent_map_shard_target_size_slop + type: float + level: dev + desc: Ratio above/below target for a shard when trying to align to an existing extent + or blob boundary + default: 0.2 + with_legacy: true +- name: bluestore_extent_map_inline_shard_prealloc_size + type: size + level: dev + desc: Preallocated buffer for inline shards + default: 256 + with_legacy: true +- name: bluestore_cache_trim_interval + type: float + level: advanced + desc: How frequently we trim the bluestore cache + default: 0.05 + with_legacy: true +- name: bluestore_cache_trim_max_skip_pinned + type: uint + level: dev + desc: Max pinned cache entries we consider before giving up + default: 1000 + with_legacy: true +- name: bluestore_cache_type + type: str + level: dev + desc: Cache replacement algorithm + default: 2q + enum_values: + - 2q + - lru + with_legacy: true +- name: bluestore_2q_cache_kin_ratio + type: float + level: dev + desc: 2Q paper suggests .5 + default: 0.5 + with_legacy: true +- name: bluestore_2q_cache_kout_ratio + type: float + level: dev + desc: 2Q paper suggests .5 + default: 0.5 + with_legacy: true +- name: bluestore_cache_size + type: size + level: dev + desc: Cache size (in bytes) for BlueStore + long_desc: This includes data and metadata cached by BlueStore as well as memory + devoted to rocksdb's cache(s). + fmt_desc: The amount of memory BlueStore will use for its cache. If zero, + ``bluestore_cache_size_hdd`` or ``bluestore_cache_size_ssd`` will + be used instead. + default: 0 + with_legacy: true +- name: bluestore_cache_size_hdd + type: size + level: dev + desc: Default bluestore_cache_size for rotational media + fmt_desc: The default amount of memory BlueStore will use for its cache when + backed by an HDD. + default: 1_G + see_also: + - bluestore_cache_size + with_legacy: true +- name: bluestore_cache_size_ssd + type: size + level: dev + desc: Default bluestore_cache_size for non-rotational (solid state) media + fmt_desc: The default amount of memory BlueStore will use for its cache when + backed by an SSD. + default: 3_G + see_also: + - bluestore_cache_size + with_legacy: true +- name: bluestore_cache_meta_ratio + type: float + level: dev + desc: Ratio of bluestore cache to devote to metadata + default: 0.45 + see_also: + - bluestore_cache_size + with_legacy: true +- name: bluestore_cache_kv_ratio + type: float + level: dev + desc: Ratio of bluestore cache to devote to key/value database (RocksDB) + default: 0.45 + see_also: + - bluestore_cache_size + with_legacy: true +- name: bluestore_cache_kv_onode_ratio + type: float + level: dev + desc: Ratio of bluestore cache to devote to kv onode column family (rocksdb) + default: 0.04 + see_also: + - bluestore_cache_size +- name: bluestore_cache_autotune + type: bool + level: dev + desc: Automatically tune the ratio of caches while respecting min values. + fmt_desc: Automatically tune the space ratios assigned to various BlueStore + caches while respecting minimum values. + default: true + see_also: + - bluestore_cache_size + - bluestore_cache_meta_ratio +- name: bluestore_cache_autotune_interval + type: float + level: dev + desc: The number of seconds to wait between rebalances when cache autotune is enabled. + fmt_desc: | + The number of seconds to wait between rebalances when cache autotune is + enabled. `bluestore_cache_autotune_interval` sets the speed at which Ceph + recomputes the allocation ratios of various caches. Note: Setting this + interval too small can result in high CPU usage and lower performance. + default: 5 + see_also: + - bluestore_cache_autotune +- name: bluestore_cache_age_bin_interval + type: float + level: dev + desc: The duration (in seconds) represented by a single cache age bin. + fmt_desc: | + The caches used by bluestore will assign cache entries to an 'age bin' + that represents a period of time during which that cache entry was most + recently updated. By binning the caches in this way, Ceph's priority + cache balancing code can make better decisions about which caches should + receive priority based on the relative ages of items in the caches. By + default, a single cache age bin represents 1 second of time. Note: + Setting this interval too small can result in high CPU usage and lower + performance. + default: 1 + see_also: + - bluestore_cache_age_bins_kv + - bluestore_cache_age_bins_kv_onode + - bluestore_cache_age_bins_meta + - bluestore_cache_age_bins_data +- name: bluestore_cache_age_bins_kv + type: str + level: dev + desc: A 10 element, space separated list of age bins for kv cache + fmt_desc: | + A 10 element, space separated list of cache age bins grouped by + priority such that PRI1=[0,n), PRI2=[n,n+1), PRI3=[n+1,n+2) ... + PRI10=[n+8,n+9). Values represent the starting and ending bin for each + priority level. A 0 in the 2nd term will prevent any items from being + associated with that priority. bin duration is based on the + bluestore_cache_age_bin_interval value. For example, + "1 5 0 0 0 0 0 0 0 0" defines bin ranges for two priority levels. PRI1 + contains 1 age bin. Assuming the default age bin interval of 1 second, + PRI1 represents cache items that are less than 1 second old. PRI2 has 4 + bins representing cache items that are 1 to less than 5 seconds old. All + other cache items in this example are associated with the lowest priority + level as PRI3-PRI10 all have 0s in their second term. + default: "1 2 6 24 120 720 0 0 0 0" + see_also: + - bluestore_cache_age_bin_interval +- name: bluestore_cache_age_bins_kv_onode + type: str + level: dev + desc: A 10 element, space separated list of age bins for kv onode cache + fmt_desc: | + A 10 element, space separated list of cache age bins grouped by + priority such that PRI1=[0,n), PRI2=[n,n+1), PRI3=[n+1,n+2) ... + PRI10=[n+8,n+9). Values represent the starting and ending bin for each + priority level. A 0 in the 2nd term will prevent any items from being + associated with that priority. bin duration is based on the + bluestore_cache_age_bin_interval value. For example, + "1 5 0 0 0 0 0 0 0 0" defines bin ranges for two priority levels. PRI1 + contains 1 age bin. Assuming the default age bin interval of 1 second, + PRI1 represents cache items that are less than 1 second old. PRI2 has 4 + bins representing cache items that are 1 to less than 5 seconds old. All + other cache items in this example are associated with the lowest priority + level as PRI3-PRI10 all have 0s in their second term. + default: "0 0 0 0 0 0 0 0 0 720" + see_also: + - bluestore_cache_age_bin_interval +- name: bluestore_cache_age_bins_meta + type: str + level: dev + desc: A 10 element, space separated list of age bins for onode cache + fmt_desc: | + A 10 element, space separated list of cache age bins grouped by + priority such that PRI1=[0,n), PRI2=[n,n+1), PRI3=[n+1,n+2) ... + PRI10=[n+8,n+9). Values represent the starting and ending bin for each + priority level. A 0 in the 2nd term will prevent any items from being + associated with that priority. bin duration is based on the + bluestore_cache_age_bin_interval value. For example, + "1 5 0 0 0 0 0 0 0 0" defines bin ranges for two priority levels. PRI1 + contains 1 age bin. Assuming the default age bin interval of 1 second, + PRI1 represents cache items that are less than 1 second old. PRI2 has 4 + bins representing cache items that are 1 to less than 5 seconds old. All + other cache items in this example are associated with the lowest priority + level as PRI3-PRI10 all have 0s in their second term. + default: "1 2 6 24 120 720 0 0 0 0" + see_also: + - bluestore_cache_age_bin_interval +- name: bluestore_cache_age_bins_data + type: str + level: dev + desc: A 10 element, space separated list of age bins for data cache + fmt_desc: | + A 10 element, space separated list of cache age bins grouped by + priority such that PRI1=[0,n), PRI2=[n,n+1), PRI3=[n+1,n+2) ... + PRI10=[n+8,n+9). Values represent the starting and ending bin for each + priority level. A 0 in the 2nd term will prevent any items from being + associated with that priority. bin duration is based on the + bluestore_cache_age_bin_interval value. For example, + "1 5 0 0 0 0 0 0 0 0" defines bin ranges for two priority levels. PRI1 + contains 1 age bin. Assuming the default age bin interval of 1 second, + PRI1 represents cache items that are less than 1 second old. PRI2 has 4 + bins representing cache items that are 1 to less than 5 seconds old. All + other cache items in this example are associated with the lowest priority + level as PRI3-PRI10 all have 0s in their second term. + default: "1 2 6 24 120 720 0 0 0 0" + see_also: + - bluestore_cache_age_bin_interval +- name: bluestore_alloc_stats_dump_interval + type: float + level: dev + desc: The period (in second) for logging allocation statistics. + default: 1_day + with_legacy: true +- name: bluestore_kvbackend + type: str + level: dev + desc: Key value database to use for bluestore + default: rocksdb + flags: + - create + with_legacy: true +- name: bluestore_allocator + type: str + level: advanced + desc: Allocator policy + long_desc: Allocator to use for bluestore. Stupid should only be used for testing. + default: hybrid + enum_values: + - bitmap + - stupid + - avl + - hybrid + - zoned + with_legacy: true +- name: bluestore_freelist_blocks_per_key + type: size + level: dev + desc: Block (and bits) per database key + default: 128 + with_legacy: true +- name: bluestore_bitmapallocator_blocks_per_zone + type: size + level: dev + default: 1_K + with_legacy: true +- name: bluestore_bitmapallocator_span_size + type: size + level: dev + default: 1_K + with_legacy: true +- name: bluestore_max_deferred_txc + type: uint + level: advanced + desc: Max transactions with deferred writes that can accumulate before we force + flush deferred writes + default: 32 + with_legacy: true +- name: bluestore_max_defer_interval + type: float + level: advanced + desc: max duration to force deferred submit + default: 3 + with_legacy: true +- name: bluestore_rocksdb_options + type: str + level: advanced + desc: Full set of rocksdb settings to override + default: compression=kNoCompression,max_write_buffer_number=64,min_write_buffer_number_to_merge=6,compaction_style=kCompactionStyleLevel,write_buffer_size=16777216,max_background_jobs=4,level0_file_num_compaction_trigger=8,max_bytes_for_level_base=1073741824,max_bytes_for_level_multiplier=8,compaction_readahead_size=2MB,max_total_wal_size=1073741824,writable_file_max_buffer_size=0 + with_legacy: true +- name: bluestore_rocksdb_options_annex + type: str + level: advanced + desc: An addition to bluestore_rocksdb_options. Allows setting rocksdb options without + repeating the existing defaults. + with_legacy: true +- name: bluestore_rocksdb_cf + type: bool + level: advanced + desc: Enable use of rocksdb column families for bluestore metadata + fmt_desc: Enables sharding of BlueStore's RocksDB. + When ``true``, ``bluestore_rocksdb_cfs`` is used. + Only applied when OSD is doing ``--mkfs``. + default: true + verbatim: | + #ifdef WITH_SEASTAR + // This is necessary as the Seastar's allocator imposes restrictions + // on the number of threads that entered malloc/free/*. Unfortunately, + // RocksDB sharding in BlueStore dramatically lifted the number of + // threads spawn during RocksDB's init. + .set_validator([](std::string *value, std::string *error_message) { + if (const bool parsed_value = strict_strtob(value->c_str(), error_message); + error_message->empty() && parsed_value) { + *error_message = "invalid BlueStore sharding configuration." + " Be aware any change takes effect only on mkfs!"; + return -EINVAL; + } else { + return 0; + } + }) + #endif +- name: bluestore_rocksdb_cfs + type: str + level: dev + desc: Definition of column families and their sharding + long_desc: 'Space separated list of elements: column_def [ ''='' rocksdb_options + ]. column_def := column_name [ ''('' shard_count [ '','' hash_begin ''-'' [ hash_end + ] ] '')'' ]. Example: ''I=write_buffer_size=1048576 O(6) m(7,10-)''. Interval + [hash_begin..hash_end) defines characters to use for hash calculation. Recommended + hash ranges: O(0-13) P(0-8) m(0-16). Sharding of S,T,C,M,B prefixes is inadvised' + fmt_desc: Definition of BlueStore's RocksDB sharding. + The optimal value depends on multiple factors, and modification is invadvisable. + This setting is used only when OSD is doing ``--mkfs``. + Next runs of OSD retrieve sharding from disk. + default: m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L=min_write_buffer_number_to_merge=32 P=min_write_buffer_number_to_merge=32 +- name: bluestore_qfsck_on_mount + type: bool + level: dev + desc: Run quick-fsck at mount comparing allocation-file to RocksDB allocation state + default: true + with_legacy: true +- name: bluestore_fsck_on_mount + type: bool + level: dev + desc: Run fsck at mount + default: false + with_legacy: true +- name: bluestore_fsck_on_mount_deep + type: bool + level: dev + desc: Run deep fsck at mount when bluestore_fsck_on_mount is set to true + default: false + with_legacy: true +- name: bluestore_fsck_quick_fix_on_mount + type: bool + level: dev + desc: Do quick-fix for the store at mount + default: false + with_legacy: true +- name: bluestore_fsck_on_umount + type: bool + level: dev + desc: Run fsck at umount + default: false + with_legacy: true +- name: bluestore_allocation_from_file + type: bool + level: dev + desc: Remove allocation info from RocksDB and store the info in a new allocation file + default: true + with_legacy: true +- name: bluestore_debug_inject_allocation_from_file_failure + type: float + level: dev + desc: Enables random error injections when restoring allocation map from file. + long_desc: Specifies error injection probability for restoring allocation map from file + hence causing full recovery. Intended primarily for testing. + default: 0 + with_legacy: true +- name: bluestore_fsck_on_umount_deep + type: bool + level: dev + desc: Run deep fsck at umount when bluestore_fsck_on_umount is set to true + default: false + with_legacy: true +- name: bluestore_fsck_on_mkfs + type: bool + level: dev + desc: Run fsck after mkfs + default: true + with_legacy: true +- name: bluestore_fsck_on_mkfs_deep + type: bool + level: dev + desc: Run deep fsck after mkfs + default: false + with_legacy: true +- name: bluestore_sync_submit_transaction + type: bool + level: dev + desc: Try to submit metadata transaction to rocksdb in queuing thread context + default: false + with_legacy: true +- name: bluestore_fsck_read_bytes_cap + type: size + level: advanced + desc: Maximum bytes read at once by deep fsck + default: 64_M + flags: + - runtime + with_legacy: true +- name: bluestore_fsck_quick_fix_threads + type: int + level: advanced + desc: Number of additional threads to perform quick-fix (shallow fsck) command + default: 2 + with_legacy: true +- name: bluestore_fsck_shared_blob_tracker_size + type: float + level: dev + desc: Size(a fraction of osd_memory_target, defaults to 128MB) of a hash table to track shared blobs ref counts. Higher the size, more precise is the tracker -> less overhead during the repair. + default: 0.03125 + see_also: + - osd_memory_target + flags: + - runtime +- name: bluestore_throttle_bytes + type: size + level: advanced + desc: Maximum bytes in flight before we throttle IO submission + default: 64_M + flags: + - runtime + with_legacy: true +- name: bluestore_throttle_deferred_bytes + type: size + level: advanced + desc: Maximum bytes for deferred writes before we throttle IO submission + default: 128_M + flags: + - runtime + with_legacy: true +- name: bluestore_throttle_cost_per_io + type: size + level: advanced + desc: Overhead added to transaction cost (in bytes) for each IO + default: 0 + flags: + - runtime + with_legacy: true +- name: bluestore_throttle_cost_per_io_hdd + type: uint + level: advanced + desc: Default bluestore_throttle_cost_per_io for rotational media + default: 670000 + see_also: + - bluestore_throttle_cost_per_io + flags: + - runtime + with_legacy: true +- name: bluestore_throttle_cost_per_io_ssd + type: uint + level: advanced + desc: Default bluestore_throttle_cost_per_io for non-rotation (solid state) media + default: 4000 + see_also: + - bluestore_throttle_cost_per_io + flags: + - runtime + with_legacy: true +- name: bluestore_deferred_batch_ops + type: uint + level: advanced + desc: Max number of deferred writes before we flush the deferred write queue + default: 0 + min: 0 + max: 65535 + flags: + - runtime + with_legacy: true +- name: bluestore_deferred_batch_ops_hdd + type: uint + level: advanced + desc: Default bluestore_deferred_batch_ops for rotational media + default: 64 + see_also: + - bluestore_deferred_batch_ops + min: 0 + max: 65535 + flags: + - runtime + with_legacy: true +- name: bluestore_deferred_batch_ops_ssd + type: uint + level: advanced + desc: Default bluestore_deferred_batch_ops for non-rotational (solid state) media + default: 16 + see_also: + - bluestore_deferred_batch_ops + min: 0 + max: 65535 + flags: + - runtime + with_legacy: true +- name: bluestore_nid_prealloc + type: int + level: dev + desc: Number of unique object ids to preallocate at a time + default: 1024 + with_legacy: true +- name: bluestore_blobid_prealloc + type: uint + level: dev + desc: Number of unique blob ids to preallocate at a time + default: 10_K + with_legacy: true +- name: bluestore_clone_cow + type: bool + level: advanced + desc: Use copy-on-write when cloning objects (versus reading and rewriting them + at clone time) + default: true + flags: + - runtime + with_legacy: true +- name: bluestore_default_buffered_read + type: bool + level: advanced + desc: Cache read results by default (unless hinted NOCACHE or WONTNEED) + default: true + flags: + - runtime + with_legacy: true +- name: bluestore_default_buffered_write + type: bool + level: advanced + desc: Cache writes by default (unless hinted NOCACHE or WONTNEED) + default: false + flags: + - runtime + with_legacy: true +- name: bluestore_debug_no_reuse_blocks + type: bool + level: dev + default: false + with_legacy: true +- name: bluestore_debug_small_allocations + type: int + level: dev + default: 0 + with_legacy: true +- name: bluestore_debug_too_many_blobs_threshold + type: int + level: dev + default: 24576 + with_legacy: true +- name: bluestore_debug_freelist + type: bool + level: dev + default: false + with_legacy: true +- name: bluestore_debug_prefill + type: float + level: dev + desc: simulate fragmentation + default: 0 + with_legacy: true +- name: bluestore_debug_prefragment_max + type: size + level: dev + default: 1_M + with_legacy: true +- name: bluestore_debug_inject_read_err + type: bool + level: dev + default: false + with_legacy: true +- name: bluestore_debug_randomize_serial_transaction + type: int + level: dev + default: 0 + with_legacy: true +- name: bluestore_debug_omit_block_device_write + type: bool + level: dev + default: false + with_legacy: true +- name: bluestore_debug_fsck_abort + type: bool + level: dev + default: false + with_legacy: true +- name: bluestore_debug_omit_kv_commit + type: bool + level: dev + default: false + with_legacy: true +- name: bluestore_debug_permit_any_bdev_label + type: bool + level: dev + default: false + with_legacy: true +- name: bluestore_debug_random_read_err + type: float + level: dev + default: 0 + with_legacy: true +- name: bluestore_debug_inject_bug21040 + type: bool + level: dev + default: false + with_legacy: true +- name: bluestore_debug_inject_csum_err_probability + type: float + level: dev + desc: inject crc verification errors into bluestore device reads + default: 0 + with_legacy: true +- name: bluestore_debug_legacy_omap + type: bool + level: dev + desc: Allows mkfs to create OSD in legacy OMAP naming mode (neither per-pool nor per-pg). + This is intended primarily for developers' purposes. The resulting OSD might/would + be transformed to the currrently default 'per-pg' format when BlueStore's quick-fix or + repair are applied. + default: false + with_legacy: true +- name: bluestore_fsck_error_on_no_per_pool_stats + type: bool + level: advanced + desc: Make fsck error (instead of warn) when bluestore lacks per-pool stats, e.g., + after an upgrade + default: false + with_legacy: true +- name: bluestore_warn_on_bluefs_spillover + type: bool + level: advanced + desc: Enable health indication on bluefs slow device usage + default: true + with_legacy: true +- name: bluestore_warn_on_legacy_statfs + type: bool + level: advanced + desc: Enable health indication on lack of per-pool statfs reporting from bluestore + default: true + with_legacy: true +- name: bluestore_warn_on_spurious_read_errors + type: bool + level: advanced + desc: Enable health indication when spurious read errors are observed by OSD + default: true + with_legacy: true +- name: bluestore_fsck_error_on_no_per_pool_omap + type: bool + level: advanced + desc: Make fsck error (instead of warn) when objects without per-pool omap are found + default: false + with_legacy: true +- name: bluestore_fsck_error_on_no_per_pg_omap + type: bool + level: advanced + desc: Make fsck error (instead of warn) when objects without per-pg omap are found + default: false + with_legacy: true +- name: bluestore_warn_on_no_per_pool_omap + type: bool + level: advanced + desc: Enable health indication on lack of per-pool omap + default: true + with_legacy: true +- name: bluestore_warn_on_no_per_pg_omap + type: bool + level: advanced + desc: Enable health indication on lack of per-pg omap + default: false + with_legacy: true +- name: bluestore_log_op_age + type: float + level: advanced + desc: log operation if it's slower than this age (seconds) + default: 5 + with_legacy: true +- name: bluestore_log_omap_iterator_age + type: float + level: advanced + desc: log omap iteration operation if it's slower than this age (seconds) + default: 5 + with_legacy: true +- name: bluestore_log_collection_list_age + type: float + level: advanced + desc: log collection list operation if it's slower than this age (seconds) + default: 1_min + with_legacy: true +- name: bluestore_debug_enforce_settings + type: str + level: dev + desc: Enforces specific hw profile settings + long_desc: '''hdd'' enforces settings intended for BlueStore above a rotational + drive. ''ssd'' enforces settings intended for BlueStore above a solid drive. ''default'' + - using settings for the actual hardware.' + default: default + enum_values: + - default + - hdd + - ssd + with_legacy: true +- name: bluestore_avl_alloc_ff_max_search_count + type: uint + level: dev + desc: Search for this many ranges in first-fit mode before switching over to + to best-fit mode. 0 to iterate through all ranges for required chunk. + default: 100 +- name: bluestore_avl_alloc_ff_max_search_bytes + type: size + level: dev + desc: Maximum distance to search in first-fit mode before switching over to + to best-fit mode. 0 to iterate through all ranges for required chunk. + default: 16_M +- name: bluestore_avl_alloc_bf_threshold + type: uint + level: dev + desc: Sets threshold at which shrinking max free chunk size triggers enabling best-fit + mode. + long_desc: 'AVL allocator works in two modes: near-fit and best-fit. By default, + it uses very fast near-fit mode, in which it tries to fit a new block near the + last allocated block of similar size. The second mode is much slower best-fit + mode, in which it tries to find an exact match for the requested allocation. This + mode is used when either the device gets fragmented or when it is low on free + space. When the largest free block is smaller than ''bluestore_avl_alloc_bf_threshold'', + best-fit mode is used.' + default: 128_K + see_also: + - bluestore_avl_alloc_bf_free_pct +- name: bluestore_avl_alloc_bf_free_pct + type: uint + level: dev + desc: Sets threshold at which shrinking free space (in %, integer) triggers enabling + best-fit mode. + long_desc: 'AVL allocator works in two modes: near-fit and best-fit. By default, + it uses very fast near-fit mode, in which it tries to fit a new block near the + last allocated block of similar size. The second mode is much slower best-fit + mode, in which it tries to find an exact match for the requested allocation. This + mode is used when either the device gets fragmented or when it is low on free + space. When free space is smaller than ''bluestore_avl_alloc_bf_free_pct'', best-fit + mode is used.' + default: 4 + see_also: + - bluestore_avl_alloc_bf_threshold +- name: bluestore_hybrid_alloc_mem_cap + type: uint + level: dev + desc: Maximum RAM hybrid allocator should use before enabling bitmap supplement + default: 64_M +- name: bluestore_volume_selection_policy + type: str + level: dev + desc: Determines bluefs volume selection policy + long_desc: Determines bluefs volume selection policy. 'use_some_extra*' policy allows + to override RocksDB level granularity and put high level's data to faster device + even when the level doesn't completely fit there. 'fit_to_fast' policy enables + using 100% of faster disk capacity and allows the user to turn on 'level_compaction_dynamic_level_bytes' + option in RocksDB options. + default: use_some_extra + enum_values: + - rocksdb_original + - use_some_extra + - use_some_extra_enforced + - fit_to_fast + with_legacy: true +- name: bluestore_volume_selection_reserved_factor + type: float + level: advanced + desc: DB level size multiplier. Determines amount of space at DB device to bar from + the usage when 'use some extra' policy is in action. Reserved size is determined + as sum(L_max_size[0], L_max_size[L-1]) + L_max_size[L] * this_factor + default: 2 + flags: + - startup + with_legacy: true +- name: bluestore_volume_selection_reserved + type: int + level: advanced + desc: Space reserved at DB device and not allowed for 'use some extra' policy usage. + Overrides 'bluestore_volume_selection_reserved_factor' setting and introduces + straightforward limit. + default: 0 + flags: + - startup + with_legacy: true +- name: bdev_ioring + type: bool + level: advanced + desc: Enables Linux io_uring API instead of libaio + default: false +- name: bdev_ioring_hipri + type: bool + level: advanced + desc: Enables Linux io_uring API Use polled IO completions + default: false +- name: bdev_ioring_sqthread_poll + type: bool + level: advanced + desc: Enables Linux io_uring API Offload submission/completion to kernel thread + default: false +- name: bluestore_kv_sync_util_logging_s + type: float + level: advanced + desc: KV sync thread utilization logging period + long_desc: How often (in seconds) to print KV sync thread utilization, not logged + when set to 0 or when utilization is 0% + default: 10 + flags: + - runtime + with_legacy: true +- name: bluestore_fail_eio + type: bool + level: dev + desc: fail/crash on EIO + long_desc: whether bluestore osd fails on eio + default: false + flags: + - runtime + with_legacy: true +- name: bluestore_zero_block_detection + type: bool + level: dev + desc: punch holes instead of writing zeros + long_desc: Intended for large-scale synthetic testing. Currently this is implemented + with punch hole semantics, affecting the logical extent map of the object. This does + not interact well with some RBD and CephFS features. + default: false + flags: + - runtime + with_legacy: true +- name: kstore_max_ops + type: uint + level: advanced + default: 512 + with_legacy: true +- name: kstore_max_bytes + type: size + level: advanced + default: 64_M + with_legacy: true +- name: kstore_backend + type: str + level: advanced + default: rocksdb + with_legacy: true +- name: kstore_rocksdb_options + type: str + level: advanced + desc: Options to pass through when RocksDB is used as the KeyValueDB for kstore. + default: compression=kNoCompression + with_legacy: true +- name: kstore_fsck_on_mount + type: bool + level: advanced + desc: Whether or not to run fsck on mount for kstore. + default: false + with_legacy: true +- name: kstore_fsck_on_mount_deep + type: bool + level: advanced + desc: Whether or not to run deep fsck on mount for kstore + default: true + with_legacy: true +- name: kstore_nid_prealloc + type: uint + level: advanced + default: 1_K + with_legacy: true +- name: kstore_sync_transaction + type: bool + level: advanced + default: false + with_legacy: true +- name: kstore_sync_submit_transaction + type: bool + level: advanced + default: false + with_legacy: true +- name: kstore_onode_map_size + type: uint + level: advanced + default: 1_K + with_legacy: true +- name: kstore_default_stripe_size + type: size + level: advanced + default: 64_K + with_legacy: true +# rocksdb options that will be used for omap(if omap_backend is rocksdb) +- name: filestore_rocksdb_options + type: str + level: dev + desc: Options to pass through when RocksDB is used as the KeyValueDB for filestore. + default: max_background_jobs=10,compaction_readahead_size=2097152,compression=kNoCompression + with_legacy: true +- name: filestore_omap_backend + type: str + level: dev + desc: The KeyValueDB to use for filestore metadata (ie omap). + default: rocksdb + enum_values: + - leveldb + - rocksdb + with_legacy: true +- name: filestore_omap_backend_path + type: str + level: dev + desc: The path where the filestore KeyValueDB should store it's database(s). + with_legacy: true +# filestore wb throttle limits +- name: filestore_wbthrottle_enable + type: bool + level: advanced + desc: Enabling throttling of operations to backing file system + default: true + with_legacy: true +- name: filestore_wbthrottle_btrfs_bytes_start_flusher + type: size + level: advanced + desc: Start flushing (fsyncing) when this many bytes are written(btrfs) + default: 40_M + with_legacy: true +- name: filestore_wbthrottle_btrfs_bytes_hard_limit + type: size + level: advanced + desc: Block writes when this many bytes haven't been flushed (fsynced) (btrfs) + default: 400_M + with_legacy: true +- name: filestore_wbthrottle_btrfs_ios_start_flusher + type: uint + level: advanced + desc: Start flushing (fsyncing) when this many IOs are written (brtrfs) + default: 500 + with_legacy: true +- name: filestore_wbthrottle_btrfs_ios_hard_limit + type: uint + level: advanced + desc: Block writes when this many IOs haven't been flushed (fsynced) (btrfs) + default: 5000 + with_legacy: true +- name: filestore_wbthrottle_btrfs_inodes_start_flusher + type: uint + level: advanced + desc: Start flushing (fsyncing) when this many distinct inodes have been modified + (btrfs) + default: 500 + with_legacy: true +- name: filestore_wbthrottle_xfs_bytes_start_flusher + type: size + level: advanced + desc: Start flushing (fsyncing) when this many bytes are written(xfs) + default: 40_M + with_legacy: true +- name: filestore_wbthrottle_xfs_bytes_hard_limit + type: size + level: advanced + desc: Block writes when this many bytes haven't been flushed (fsynced) (xfs) + default: 400_M + with_legacy: true +- name: filestore_wbthrottle_xfs_ios_start_flusher + type: uint + level: advanced + desc: Start flushing (fsyncing) when this many IOs are written (xfs) + default: 500 + with_legacy: true +- name: filestore_wbthrottle_xfs_ios_hard_limit + type: uint + level: advanced + desc: Block writes when this many IOs haven't been flushed (fsynced) (xfs) + default: 5000 + with_legacy: true +- name: filestore_wbthrottle_xfs_inodes_start_flusher + type: uint + level: advanced + desc: Start flushing (fsyncing) when this many distinct inodes have been modified + (xfs) + default: 500 + with_legacy: true +# These must be less than the fd limit +- name: filestore_wbthrottle_btrfs_inodes_hard_limit + type: uint + level: advanced + desc: Block writing when this many inodes have outstanding writes (btrfs) + default: 5000 + with_legacy: true +- name: filestore_wbthrottle_xfs_inodes_hard_limit + type: uint + level: advanced + desc: Block writing when this many inodes have outstanding writes (xfs) + default: 5000 + with_legacy: true +# Introduce a O_DSYNC write in the filestore +- name: filestore_odsync_write + type: bool + level: dev + desc: Write with O_DSYNC + default: false + with_legacy: true +# Tests index failure paths +- name: filestore_index_retry_probability + type: float + level: dev + default: 0 + with_legacy: true +# Allow object read error injection +- name: filestore_debug_inject_read_err + type: bool + level: dev + default: false + with_legacy: true +- name: filestore_debug_random_read_err + type: float + level: dev + default: 0 + with_legacy: true +# Expensive debugging check on sync +- name: filestore_debug_omap_check + type: bool + level: dev + default: false + fmt_desc: Debugging check on synchronization. This is an expensive operation. + + with_legacy: true +- name: filestore_omap_header_cache_size + type: size + level: dev + default: 1_K + with_legacy: true +# Use omap for xattrs for attrs over +# filestore_max_inline_xattr_size or +- name: filestore_max_inline_xattr_size + type: size + level: dev + default: 0 + with_legacy: true +- name: filestore_max_inline_xattr_size_xfs + type: size + level: dev + default: 64_K + with_legacy: true +- name: filestore_max_inline_xattr_size_btrfs + type: size + level: dev + default: 2_K + with_legacy: true +- name: filestore_max_inline_xattr_size_other + type: size + level: dev + default: 512 + with_legacy: true +# for more than filestore_max_inline_xattrs attrs +- name: filestore_max_inline_xattrs + type: uint + level: dev + default: 0 + with_legacy: true +- name: filestore_max_inline_xattrs_xfs + type: uint + level: dev + default: 10 + with_legacy: true +- name: filestore_max_inline_xattrs_btrfs + type: uint + level: dev + default: 10 + with_legacy: true +- name: filestore_max_inline_xattrs_other + type: uint + level: dev + default: 2 + with_legacy: true +- name: filestore_max_xattr_value_size + type: size + level: dev + default: 0 + with_legacy: true +- name: filestore_max_xattr_value_size_xfs + type: size + level: dev + default: 64_K + with_legacy: true +- name: filestore_max_xattr_value_size_btrfs + type: size + level: dev + default: 64_K + with_legacy: true +# ext4 allows 4k xattrs total including some smallish extra fields and the +# keys. We're allowing 2 512 inline attrs in addition some some filestore +# replay attrs. After accounting for those, we still need to fit up to +# two attrs of this value. That means we need this value to be around 1k +# to be safe. This is hacky, but it's not worth complicating the code +# to work around ext4's total xattr limit. +- name: filestore_max_xattr_value_size_other + type: size + level: dev + default: 1_K + with_legacy: true +# track sloppy crcs +- name: filestore_sloppy_crc + type: bool + level: dev + default: false + with_legacy: true +- name: filestore_sloppy_crc_block_size + type: size + level: dev + default: 64_K + with_legacy: true +- name: filestore_max_alloc_hint_size + type: size + level: dev + default: 1_M + with_legacy: true +# seconds +- name: filestore_max_sync_interval + type: float + level: advanced + desc: Period between calls to syncfs(2) and journal trims (seconds) + default: 5 + with_legacy: true +# seconds +- name: filestore_min_sync_interval + type: float + level: dev + desc: Minimum period between calls to syncfs(2) + default: 0.01 + with_legacy: true +- name: filestore_btrfs_snap + type: bool + level: dev + default: true + with_legacy: true +- name: filestore_btrfs_clone_range + type: bool + level: advanced + desc: Use btrfs clone_range ioctl to efficiently duplicate objects + default: true + with_legacy: true +# zfsonlinux is still unstable +- name: filestore_zfs_snap + type: bool + level: dev + default: false + with_legacy: true +- name: filestore_fsync_flushes_journal_data + type: bool + level: dev + default: false + with_legacy: true +# (try to) use fiemap +- name: filestore_fiemap + type: bool + level: advanced + desc: Use fiemap ioctl(2) to determine which parts of objects are sparse + default: false + with_legacy: true +- name: filestore_punch_hole + type: bool + level: advanced + desc: Use fallocate(2) FALLOC_FL_PUNCH_HOLE to efficiently zero ranges of objects + default: false + with_legacy: true +# (try to) use seek_data/hole +- name: filestore_seek_data_hole + type: bool + level: advanced + desc: Use lseek(2) SEEK_HOLE and SEEK_DATA to determine which parts of objects are + sparse + default: false + with_legacy: true +- name: filestore_splice + type: bool + level: advanced + desc: Use splice(2) to more efficiently copy data between files + default: false + with_legacy: true +- name: filestore_fadvise + type: bool + level: advanced + desc: Use posix_fadvise(2) to pass hints to file system + default: true + with_legacy: true +# collect device partition information for management application to use +- name: filestore_collect_device_partition_information + type: bool + level: advanced + desc: Collect metadata about the backing file system on OSD startup + default: true + with_legacy: true +# (try to) use extsize for alloc hint NOTE: extsize seems to trigger +# data corruption in xfs prior to kernel 3.5. filestore will +# implicitly disable this if it cannot confirm the kernel is newer +# than that. +# NOTE: This option involves a tradeoff: When disabled, fragmentation is +# worse, but large sequential writes are faster. When enabled, large +# sequential writes are slower, but fragmentation is reduced. +- name: filestore_xfs_extsize + type: bool + level: advanced + desc: Use XFS extsize ioctl(2) to hint allocator about expected write sizes + default: false + with_legacy: true +- name: filestore_journal_parallel + type: bool + level: dev + default: false + with_legacy: true +- name: filestore_journal_writeahead + type: bool + level: dev + default: false + with_legacy: true +- name: filestore_journal_trailing + type: bool + level: dev + default: false + with_legacy: true +- name: filestore_queue_max_ops + type: uint + level: advanced + desc: Max IO operations in flight + default: 50 + with_legacy: true +- name: filestore_queue_max_bytes + type: size + level: advanced + desc: Max (written) bytes in flight + default: 100_M + with_legacy: true +- name: filestore_caller_concurrency + type: int + level: dev + default: 10 + with_legacy: true +# Expected filestore throughput in B/s +- name: filestore_expected_throughput_bytes + type: float + level: advanced + desc: Expected throughput of backend device (aids throttling calculations) + default: 209715200 + with_legacy: true +# Expected filestore throughput in ops/s +- name: filestore_expected_throughput_ops + type: float + level: advanced + desc: Expected through of backend device in IOPS (aids throttling calculations) + default: 200 + with_legacy: true +# Filestore max delay multiple. Defaults to 0 (disabled) +- name: filestore_queue_max_delay_multiple + type: float + level: dev + default: 0 + with_legacy: true +# Filestore high delay multiple. Defaults to 0 (disabled) +- name: filestore_queue_high_delay_multiple + type: float + level: dev + default: 0 + with_legacy: true +# Filestore max delay multiple ops. Defaults to 0 (disabled) +- name: filestore_queue_max_delay_multiple_bytes + type: float + level: dev + default: 0 + with_legacy: true +# Filestore high delay multiple bytes. Defaults to 0 (disabled) +- name: filestore_queue_high_delay_multiple_bytes + type: float + level: dev + default: 0 + with_legacy: true +# Filestore max delay multiple ops. Defaults to 0 (disabled) +- name: filestore_queue_max_delay_multiple_ops + type: float + level: dev + default: 0 + with_legacy: true +# Filestore high delay multiple ops. Defaults to 0 (disabled) +- name: filestore_queue_high_delay_multiple_ops + type: float + level: dev + default: 0 + with_legacy: true +- name: filestore_queue_low_threshhold + type: float + level: dev + default: 0.3 + with_legacy: true +- name: filestore_queue_high_threshhold + type: float + level: dev + with_legacy: true + default: 0.9 +- name: filestore_op_threads + type: int + level: advanced + desc: Threads used to apply changes to backing file system + default: 2 + with_legacy: true +- name: filestore_op_thread_timeout + type: int + level: advanced + desc: Seconds before a worker thread is considered stalled + default: 1_min + with_legacy: true +- name: filestore_op_thread_suicide_timeout + type: int + level: advanced + desc: Seconds before a worker thread is considered dead + default: 3_min + with_legacy: true +- name: filestore_commit_timeout + type: float + level: advanced + desc: Seconds before backing file system is considered hung + default: 10_min + with_legacy: true +- name: filestore_fiemap_threshold + type: size + level: dev + default: 4_K + with_legacy: true +- name: filestore_merge_threshold + type: int + level: dev + default: -10 + with_legacy: true +- name: filestore_split_multiple + type: int + level: dev + default: 2 + with_legacy: true +- name: filestore_split_rand_factor + type: uint + level: dev + default: 20 + with_legacy: true +- name: filestore_update_to + type: int + level: dev + default: 1000 + with_legacy: true +- name: filestore_blackhole + type: bool + level: dev + default: false + with_legacy: true +- name: filestore_fd_cache_size + type: int + level: dev + default: 128 + with_legacy: true +- name: filestore_fd_cache_shards + type: int + level: dev + default: 16 + with_legacy: true +- name: filestore_ondisk_finisher_threads + type: int + level: dev + default: 1 + with_legacy: true +- name: filestore_apply_finisher_threads + type: int + level: dev + default: 1 + with_legacy: true +# file onto which store transaction dumps +- name: filestore_dump_file + type: str + level: dev + with_legacy: true +# inject a failure at the n'th opportunity +- name: filestore_kill_at + type: int + level: dev + default: 0 + with_legacy: true +# artificially stall for N seconds in op queue thread +- name: filestore_inject_stall + type: int + level: dev + default: 0 + with_legacy: true +# fail/crash on EIO +- name: filestore_fail_eio + type: bool + level: dev + default: true + with_legacy: true +- name: filestore_debug_verify_split + type: bool + level: dev + default: false + with_legacy: true +- name: journal_dio + type: bool + level: dev + default: true + fmt_desc: Enables direct i/o to the journal. Requires ``journal block + align`` set to ``true``. + with_legacy: true +- name: journal_aio + type: bool + level: dev + default: true + fmt_desc: Enables using ``libaio`` for asynchronous writes to the journal. + Requires ``journal dio`` set to ``true``. Version 0.61 and later, ``true``. + Version 0.60 and earlier, ``false``. + with_legacy: true +- name: journal_force_aio + type: bool + level: dev + default: false + with_legacy: true +- name: journal_block_size + type: size + level: dev + default: 4_K + with_legacy: true +- name: journal_block_align + type: bool + level: dev + default: true + fmt_desc: Block aligns write operations. Required for ``dio`` and ``aio``. + with_legacy: true +- name: journal_write_header_frequency + type: uint + level: dev + default: 0 + with_legacy: true +- name: journal_max_write_bytes + type: size + level: advanced + desc: Max bytes in flight to journal + fmt_desc: The maximum number of bytes the journal will write at + any one time. + default: 10_M + with_legacy: true +- name: journal_max_write_entries + type: int + level: advanced + desc: Max IOs in flight to journal + fmt_desc: The maximum number of entries the journal will write at + any one time. + default: 100 + with_legacy: true +# Target range for journal fullness +- name: journal_throttle_low_threshhold + type: float + level: dev + default: 0.6 + with_legacy: true +- name: journal_throttle_high_threshhold + type: float + level: dev + default: 0.9 + with_legacy: true +# Multiple over expected at high_threshhold. Defaults to 0 (disabled). +- name: journal_throttle_high_multiple + type: float + level: dev + default: 0 + with_legacy: true +# Multiple over expected at max. Defaults to 0 (disabled). +- name: journal_throttle_max_multiple + type: float + level: dev + default: 0 + with_legacy: true +# align data payloads >= this. +- name: journal_align_min_size + type: size + level: dev + default: 64_K + fmt_desc: Align data payloads greater than the specified minimum. + with_legacy: true +- name: journal_replay_from + type: int + level: dev + default: 0 + with_legacy: true +- name: journal_zero_on_create + type: bool + level: dev + default: false + fmt_desc: | + Causes the file store to overwrite the entire journal with + ``0``'s during ``mkfs``. + with_legacy: true +# assume journal is not corrupt +- name: journal_ignore_corruption + type: bool + level: dev + default: false + with_legacy: true +# using ssd disk as journal, whether support discard nouse journal-data. +- name: journal_discard + type: bool + level: dev + default: false + with_legacy: true +# fio data directory for fio-objectstore +- name: fio_dir + type: str + level: advanced + default: /tmp/fio + with_legacy: true +- name: rados_mon_op_timeout + type: secs + level: advanced + desc: timeout for operations handled by monitors such as statfs (0 is unlimited) + default: 0 + min: 0 + flags: + - runtime +- name: rados_osd_op_timeout + type: secs + level: advanced + desc: timeout for operations handled by osds such as write (0 is unlimited) + default: 0 + min: 0 + flags: + - runtime +# true if LTTng-UST tracepoints should be enabled +- name: rados_tracing + type: bool + level: advanced + default: false + with_legacy: true +- name: mgr_connect_retry_interval + type: float + level: dev + default: 1 + services: + - common +- name: mgr_client_service_daemon_unregister_timeout + type: float + level: dev + desc: Time to wait during shutdown to deregister service with mgr + default: 1 +- name: throttler_perf_counter + type: bool + level: advanced + default: true + with_legacy: true +- name: event_tracing + type: bool + level: advanced + default: false + with_legacy: true +- name: bluestore_tracing + type: bool + level: advanced + desc: Enable bluestore event tracing. + default: false +- name: bluestore_throttle_trace_rate + type: float + level: advanced + desc: Rate at which to sample bluestore transactions (per second) + default: 0 +- name: debug_deliberately_leak_memory + type: bool + level: dev + default: false + with_legacy: true +- name: debug_asserts_on_shutdown + type: bool + level: dev + desc: Enable certain asserts to check for refcounting bugs on shutdown; see http://tracker.ceph.com/issues/21738 + default: false +- name: debug_asok_assert_abort + type: bool + level: dev + desc: allow commands 'assert' and 'abort' via asok for testing crash dumps etc + default: false + with_legacy: true +- name: target_max_misplaced_ratio + type: float + level: basic + desc: Max ratio of misplaced objects to target when throttling data rebalancing + activity + default: 0.05 +- name: device_failure_prediction_mode + type: str + level: basic + desc: Method used to predict device failures + long_desc: To disable prediction, use 'none', 'local' uses a prediction model that + runs inside the mgr daemon. 'cloud' will share metrics with a cloud service and + query the service for devicelife expectancy. + default: none + enum_values: + - none + - local + - cloud + flags: + - runtime +- name: gss_ktab_client_file + type: str + level: advanced + desc: GSS/KRB5 Keytab file for client authentication + long_desc: This sets the full path for the GSS/Kerberos client keytab file location. + default: /var/lib/ceph/$name/gss_client_$name.ktab + services: + - mon + - osd +- name: gss_target_name + type: str + level: advanced + long_desc: This sets the gss target service name. + default: ceph + services: + - mon + - osd +- name: debug_disable_randomized_ping + type: bool + level: dev + desc: Disable heartbeat ping randomization for testing purposes + default: false +- name: debug_heartbeat_testing_span + type: int + level: dev + desc: Override 60 second periods for testing only + default: 0 +- name: librados_thread_count + type: uint + level: advanced + desc: Size of thread pool for Objecter + default: 2 + tags: + - client + min: 1 +- name: osd_asio_thread_count + type: uint + level: advanced + desc: Size of thread pool for ASIO completions + default: 2 + tags: + - osd + min: 1 +- name: cephsqlite_lock_renewal_interval + type: millisecs + level: advanced + desc: number of milliseconds before lock is renewed + default: 2000 + tags: + - client + see_also: + - cephsqlite_lock_renewal_timeout + min: 100 +- name: cephsqlite_lock_renewal_timeout + type: millisecs + level: advanced + desc: number of milliseconds before transaction lock times out + long_desc: The amount of time before a running libcephsqlite VFS connection has + to renew a lock on the database before the lock is automatically lost. If the + lock is lost, the VFS will abort the process to prevent database corruption. + default: 30000 + tags: + - client + see_also: + - cephsqlite_lock_renewal_interval + min: 100 +- name: cephsqlite_blocklist_dead_locker + type: bool + level: advanced + desc: blocklist the last dead owner of the database lock + long_desc: Require that the Ceph SQLite VFS blocklist the last dead owner of the + database when cleanup was incomplete. DO NOT CHANGE THIS UNLESS YOU UNDERSTAND + THE RAMIFICATIONS. CORRUPTION MAY RESULT. + default: true + tags: + - client +- name: bdev_type + type: str + level: advanced + desc: Explicitly set the device type to select the driver if it's needed + enum_values: + - aio + - spdk + - pmem + - hm_smr +- name: bluestore_cleaner_sleep_interval + type: float + level: advanced + desc: How long cleaner should sleep before re-checking utilization + default: 5 + with_legacy: true +- name: jaeger_tracing_enable + type: bool + level: advanced + desc: Ceph should use jaeger tracing system + default: false + services: + - rgw + - osd + with_legacy: true +- name: jaeger_agent_port + type: int + level: advanced + desc: port number of the jaeger agent + default: 6799 + services: + - rgw + - osd +- name: mgr_ttl_cache_expire_seconds + type: uint + level: dev + desc: Set the time to live in seconds - set to 0 to disable the cache. + default: 0 + services: + - mgr |