summaryrefslogtreecommitdiffstats
path: root/upstream/mageia-cauldron/man5/btrfs.5
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-15 19:43:11 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-15 19:43:11 +0000
commitfc22b3d6507c6745911b9dfcc68f1e665ae13dbc (patch)
treece1e3bce06471410239a6f41282e328770aa404a /upstream/mageia-cauldron/man5/btrfs.5
parentInitial commit. (diff)
downloadmanpages-l10n-fc22b3d6507c6745911b9dfcc68f1e665ae13dbc.tar.xz
manpages-l10n-fc22b3d6507c6745911b9dfcc68f1e665ae13dbc.zip
Adding upstream version 4.22.0.upstream/4.22.0
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'upstream/mageia-cauldron/man5/btrfs.5')
-rw-r--r--upstream/mageia-cauldron/man5/btrfs.52902
1 files changed, 2902 insertions, 0 deletions
diff --git a/upstream/mageia-cauldron/man5/btrfs.5 b/upstream/mageia-cauldron/man5/btrfs.5
new file mode 100644
index 00000000..254a8894
--- /dev/null
+++ b/upstream/mageia-cauldron/man5/btrfs.5
@@ -0,0 +1,2902 @@
+.\" Man page generated from reStructuredText.
+.
+.
+.nr rst2man-indent-level 0
+.
+.de1 rstReportMargin
+\\$1 \\n[an-margin]
+level \\n[rst2man-indent-level]
+level margin: \\n[rst2man-indent\\n[rst2man-indent-level]]
+-
+\\n[rst2man-indent0]
+\\n[rst2man-indent1]
+\\n[rst2man-indent2]
+..
+.de1 INDENT
+.\" .rstReportMargin pre:
+. RS \\$1
+. nr rst2man-indent\\n[rst2man-indent-level] \\n[an-margin]
+. nr rst2man-indent-level +1
+.\" .rstReportMargin post:
+..
+.de UNINDENT
+. RE
+.\" indent \\n[an-margin]
+.\" old: \\n[rst2man-indent\\n[rst2man-indent-level]]
+.nr rst2man-indent-level -1
+.\" new: \\n[rst2man-indent\\n[rst2man-indent-level]]
+.in \\n[rst2man-indent\\n[rst2man-indent-level]]u
+..
+.TH "BTRFS" "5" "Jan 09, 2024" "6.6.3" "BTRFS"
+.SH NAME
+btrfs \- topics about the BTRFS filesystem (mount options, supported file attributes and other)
+.SH DESCRIPTION
+.sp
+This document describes topics related to BTRFS that are not specific to the
+tools. Currently covers:
+.INDENT 0.0
+.IP 1. 4
+mount options
+.IP 2. 4
+filesystem features
+.IP 3. 4
+checksum algorithms
+.IP 4. 4
+compression
+.IP 5. 4
+sysfs interface
+.IP 6. 4
+filesystem exclusive operations
+.IP 7. 4
+filesystem limits
+.IP 8. 4
+bootloader support
+.IP 9. 4
+file attributes
+.IP 10. 4
+zoned mode
+.IP 11. 4
+control device
+.IP 12. 4
+filesystems with multiple block group profiles
+.IP 13. 4
+seeding device
+.IP 14. 4
+RAID56 status and recommended practices
+.IP 15. 4
+storage model, hardware considerations
+.UNINDENT
+.SH MOUNT OPTIONS
+.SS BTRFS SPECIFIC MOUNT OPTIONS
+.sp
+This section describes mount options specific to BTRFS. For the generic mount
+options please refer to \fBmount(8)\fP manual page. The options are sorted alphabetically
+(discarding the \fIno\fP prefix).
+.sp
+\fBNOTE:\fP
+.INDENT 0.0
+.INDENT 3.5
+Most mount options apply to the whole filesystem and only options in the
+first mounted subvolume will take effect. This is due to lack of implementation
+and may change in the future. This means that (for example) you can\(aqt set
+per\-subvolume \fInodatacow\fP, \fInodatasum\fP, or \fIcompress\fP using mount options. This
+should eventually be fixed, but it has proved to be difficult to implement
+correctly within the Linux VFS framework.
+.UNINDENT
+.UNINDENT
+.sp
+Mount options are processed in order, only the last occurrence of an option
+takes effect and may disable other options due to constraints (see e.g.
+\fInodatacow\fP and \fIcompress\fP). The output of \fBmount\fP command shows which options
+have been applied.
+.INDENT 0.0
+.TP
+.B acl, noacl
+(default: on)
+.sp
+Enable/disable support for POSIX Access Control Lists (ACLs). See the
+\fBacl(5)\fP manual page for more information about ACLs.
+.sp
+The support for ACL is build\-time configurable (BTRFS_FS_POSIX_ACL) and
+mount fails if \fIacl\fP is requested but the feature is not compiled in.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B autodefrag, noautodefrag
+(since: 3.0, default: off)
+.sp
+Enable automatic file defragmentation.
+When enabled, small random writes into files (in a range of tens of kilobytes,
+currently it\(aqs 64KiB) are detected and queued up for the defragmentation process.
+May not be well suited for large database workloads.
+.sp
+The read latency may increase due to reading the adjacent blocks that make up the
+range for defragmentation, successive write will merge the blocks in the new
+location.
+.sp
+\fBWARNING:\fP
+.INDENT 7.0
+.INDENT 3.5
+Defragmenting with Linux kernel versions < 3.9 or ≥ 3.14\-rc2 as
+well as with Linux stable kernel versions ≥ 3.10.31, ≥ 3.12.12 or
+≥ 3.13.4 will break up the reflinks of COW data (for example files
+copied with \fBcp \-\-reflink\fP, snapshots or de\-duplicated data).
+This may cause considerable increase of space usage depending on the
+broken up reflinks.
+.UNINDENT
+.UNINDENT
+.TP
+.B barrier, nobarrier
+(default: on)
+.sp
+Ensure that all IO write operations make it through the device cache and are stored
+permanently when the filesystem is at its consistency checkpoint. This
+typically means that a flush command is sent to the device that will
+synchronize all pending data and ordinary metadata blocks, then writes the
+superblock and issues another flush.
+.sp
+The write flushes incur a slight hit and also prevent the IO block
+scheduler to reorder requests in a more effective way. Disabling barriers gets
+rid of that penalty but will most certainly lead to a corrupted filesystem in
+case of a crash or power loss. The ordinary metadata blocks could be yet
+unwritten at the time the new superblock is stored permanently, expecting that
+the block pointers to metadata were stored permanently before.
+.sp
+On a device with a volatile battery\-backed write\-back cache, the \fInobarrier\fP
+option will not lead to filesystem corruption as the pending blocks are
+supposed to make it to the permanent storage.
+.TP
+.B check_int, check_int_data, check_int_print_mask=<value>
+(since: 3.0, default: off)
+.sp
+These debugging options control the behavior of the integrity checking
+module (the BTRFS_FS_CHECK_INTEGRITY config option required). The main goal is
+to verify that all blocks from a given transaction period are properly linked.
+.sp
+\fIcheck_int\fP enables the integrity checker module, which examines all
+block write requests to ensure on\-disk consistency, at a large
+memory and CPU cost.
+.sp
+\fIcheck_int_data\fP includes extent data in the integrity checks, and
+implies the \fIcheck_int\fP option.
+.sp
+\fIcheck_int_print_mask\fP takes a bitmask of BTRFSIC_PRINT_MASK_* values
+as defined in \fIfs/btrfs/check\-integrity.c\fP, to control the integrity
+checker module behavior.
+.sp
+See comments at the top of \fIfs/btrfs/check\-integrity.c\fP
+for more information.
+.TP
+.B clear_cache
+Force clearing and rebuilding of the free space cache if something
+has gone wrong.
+.sp
+For free space cache \fIv1\fP, this only clears (and, unless \fInospace_cache\fP is
+used, rebuilds) the free space cache for block groups that are modified while
+the filesystem is mounted with that option. To actually clear an entire free
+space cache \fIv1\fP, see \fBbtrfs check \-\-clear\-space\-cache v1\fP\&.
+.sp
+For free space cache \fIv2\fP, this clears the entire free space cache.
+To do so without requiring to mounting the filesystem, see
+\fBbtrfs check \-\-clear\-space\-cache v2\fP\&.
+.sp
+See also: \fIspace_cache\fP\&.
+.TP
+.B commit=<seconds>
+(since: 3.12, default: 30)
+.sp
+Set the interval of periodic transaction commit when data are synchronized
+to permanent storage. Higher interval values lead to larger amount of unwritten
+data, which has obvious consequences when the system crashes.
+The upper bound is not forced, but a warning is printed if it\(aqs more than 300
+seconds (5 minutes). Use with care.
+.TP
+.B compress, compress=<type[:level]>, compress\-force, compress\-force=<type[:level]>
+(default: off, level support since: 5.1)
+.sp
+Control BTRFS file data compression. Type may be specified as \fIzlib\fP,
+\fIlzo\fP, \fIzstd\fP or \fIno\fP (for no compression, used for remounting). If no type
+is specified, \fIzlib\fP is used. If \fIcompress\-force\fP is specified,
+then compression will always be attempted, but the data may end up uncompressed
+if the compression would make them larger.
+.sp
+Both \fIzlib\fP and \fIzstd\fP (since version 5.1) expose the compression level as a
+tunable knob with higher levels trading speed and memory (\fIzstd\fP) for higher
+compression ratios. This can be set by appending a colon and the desired level.
+ZLIB accepts the range [1, 9] and ZSTD accepts [1, 15]. If no level is set,
+both currently use a default level of 3. The value 0 is an alias for the
+default level.
+.sp
+Otherwise some simple heuristics are applied to detect an incompressible file.
+If the first blocks written to a file are not compressible, the whole file is
+permanently marked to skip compression. As this is too simple, the
+\fIcompress\-force\fP is a workaround that will compress most of the files at the
+cost of some wasted CPU cycles on failed attempts.
+Since kernel 4.15, a set of heuristic algorithms have been improved by using
+frequency sampling, repeated pattern detection and Shannon entropy calculation
+to avoid that.
+.sp
+\fBNOTE:\fP
+.INDENT 7.0
+.INDENT 3.5
+If compression is enabled, \fInodatacow\fP and \fInodatasum\fP are disabled.
+.UNINDENT
+.UNINDENT
+.TP
+.B datacow, nodatacow
+(default: on)
+.sp
+Enable data copy\-on\-write for newly created files.
+\fINodatacow\fP implies \fInodatasum\fP, and disables \fIcompression\fP\&. All files created
+under \fInodatacow\fP are also set the NOCOW file attribute (see \fBchattr(1)\fP).
+.sp
+\fBNOTE:\fP
+.INDENT 7.0
+.INDENT 3.5
+If \fInodatacow\fP or \fInodatasum\fP are enabled, compression is disabled.
+.UNINDENT
+.UNINDENT
+.sp
+Updates in\-place improve performance for workloads that do frequent overwrites,
+at the cost of potential partial writes, in case the write is interrupted
+(system crash, device failure).
+.TP
+.B datasum, nodatasum
+(default: on)
+.sp
+Enable data checksumming for newly created files.
+\fIDatasum\fP implies \fIdatacow\fP, i.e. the normal mode of operation. All files created
+under \fInodatasum\fP inherit the \(dqno checksums\(dq property, however there\(aqs no
+corresponding file attribute (see \fBchattr(1)\fP).
+.sp
+\fBNOTE:\fP
+.INDENT 7.0
+.INDENT 3.5
+If \fInodatacow\fP or \fInodatasum\fP are enabled, compression is disabled.
+.UNINDENT
+.UNINDENT
+.sp
+There is a slight performance gain when checksums are turned off, the
+corresponding metadata blocks holding the checksums do not need to updated.
+The cost of checksumming of the blocks in memory is much lower than the IO,
+modern CPUs feature hardware support of the checksumming algorithm.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B degraded
+(default: off)
+.sp
+Allow mounts with fewer devices than the RAID profile constraints
+require. A read\-write mount (or remount) may fail when there are too many devices
+missing, for example if a stripe member is completely missing from RAID0.
+.sp
+Since 4.14, the constraint checks have been improved and are verified on the
+chunk level, not at the device level. This allows degraded mounts of
+filesystems with mixed RAID profiles for data and metadata, even if the
+device number constraints would not be satisfied for some of the profiles.
+.sp
+Example: metadata \-\- raid1, data \-\- single, devices \-\- \fB/dev/sda\fP, \fB/dev/sdb\fP
+.sp
+Suppose the data are completely stored on \fIsda\fP, then missing \fIsdb\fP will not
+prevent the mount, even if 1 missing device would normally prevent (any)
+\fIsingle\fP profile to mount. In case some of the data chunks are stored on \fIsdb\fP,
+then the constraint of single/data is not satisfied and the filesystem
+cannot be mounted.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B device=<devicepath>
+Specify a path to a device that will be scanned for BTRFS filesystem during
+mount. This is usually done automatically by a device manager (like udev) or
+using the \fBbtrfs device scan\fP command (e.g. run from the initial ramdisk). In
+cases where this is not possible the \fIdevice\fP mount option can help.
+.sp
+\fBNOTE:\fP
+.INDENT 7.0
+.INDENT 3.5
+Booting e.g. a RAID1 system may fail even if all filesystem\(aqs \fIdevice\fP
+paths are provided as the actual device nodes may not be discovered by the
+system at that point.
+.UNINDENT
+.UNINDENT
+.TP
+.B discard, discard=sync, discard=async, nodiscard
+(default: async when devices support it since 6.2, async support since: 5.6)
+.sp
+Enable discarding of freed file blocks. This is useful for SSD devices, thinly
+provisioned LUNs, or virtual machine images; however, every storage layer must
+support discard for it to work.
+.sp
+In the synchronous mode (\fIsync\fP or without option value), lack of asynchronous
+queued TRIM on the backing device TRIM can severely degrade performance,
+because a synchronous TRIM operation will be attempted instead. Queued TRIM
+requires newer than SATA revision 3.1 chipsets and devices.
+.sp
+The asynchronous mode (\fIasync\fP) gathers extents in larger chunks before sending
+them to the devices for TRIM. The overhead and performance impact should be
+negligible compared to the previous mode and it\(aqs supposed to be the preferred
+mode if needed.
+.sp
+If it is not necessary to immediately discard freed blocks, then the \fBfstrim\fP
+tool can be used to discard all free blocks in a batch. Scheduling a TRIM
+during a period of low system activity will prevent latent interference with
+the performance of other operations. Also, a device may ignore the TRIM command
+if the range is too small, so running a batch discard has a greater probability
+of actually discarding the blocks.
+.TP
+.B enospc_debug, noenospc_debug
+(default: off)
+.sp
+Enable verbose output for some ENOSPC conditions. It\(aqs safe to use but can
+be noisy if the system reaches near\-full state.
+.TP
+.B fatal_errors=<action>
+(since: 3.4, default: bug)
+.sp
+Action to take when encountering a fatal error.
+.INDENT 7.0
+.TP
+.B bug
+\fIBUG()\fP on a fatal error, the system will stay in the crashed state and may be
+still partially usable, but reboot is required for full operation
+.TP
+.B panic
+\fIpanic()\fP on a fatal error, depending on other system configuration, this may
+be followed by a reboot. Please refer to the documentation of kernel boot
+parameters, e.g. \fIpanic\fP, \fIoops\fP or \fIcrashkernel\fP\&.
+.UNINDENT
+.TP
+.B flushoncommit, noflushoncommit
+(default: off)
+.sp
+This option forces any data dirtied by a write in a prior transaction to commit
+as part of the current commit, effectively a full filesystem sync.
+.sp
+This makes the committed state a fully consistent view of the file system from
+the application\(aqs perspective (i.e. it includes all completed file system
+operations). This was previously the behavior only when a snapshot was
+created.
+.sp
+When off, the filesystem is consistent but buffered writes may last more than
+one transaction commit.
+.TP
+.B fragment=<type>
+(depends on compile\-time option CONFIG_BTRFS_DEBUG, since: 4.4, default: off)
+.sp
+A debugging helper to intentionally fragment given \fItype\fP of block groups. The
+type can be \fIdata\fP, \fImetadata\fP or \fIall\fP\&. This mount option should not be used
+outside of debugging environments and is not recognized if the kernel config
+option \fICONFIG_BTRFS_DEBUG\fP is not enabled.
+.TP
+.B nologreplay
+(default: off, even read\-only)
+.sp
+The tree\-log contains pending updates to the filesystem until the full commit.
+The log is replayed on next mount, this can be disabled by this option. See
+also \fItreelog\fP\&. Note that \fInologreplay\fP is the same as \fInorecovery\fP\&.
+.sp
+\fBWARNING:\fP
+.INDENT 7.0
+.INDENT 3.5
+Currently, the tree log is replayed even with a read\-only mount! To
+disable that behaviour, mount also with \fInologreplay\fP\&.
+.UNINDENT
+.UNINDENT
+.TP
+.B max_inline=<bytes>
+(default: min(2048, page size) )
+.sp
+Specify the maximum amount of space, that can be inlined in
+a metadata b\-tree leaf. The value is specified in bytes, optionally
+with a K suffix (case insensitive). In practice, this value
+is limited by the filesystem block size (named \fIsectorsize\fP at mkfs time),
+and memory page size of the system. In case of sectorsize limit, there\(aqs
+some space unavailable due to b\-tree leaf headers. For example, a 4KiB
+sectorsize, maximum size of inline data is about 3900 bytes.
+.sp
+Inlining can be completely turned off by specifying 0. This will increase data
+block slack if file sizes are much smaller than block size but will reduce
+metadata consumption in return.
+.sp
+\fBNOTE:\fP
+.INDENT 7.0
+.INDENT 3.5
+The default value has changed to 2048 in kernel 4.6.
+.UNINDENT
+.UNINDENT
+.TP
+.B metadata_ratio=<value>
+(default: 0, internal logic)
+.sp
+Specifies that 1 metadata chunk should be allocated after every \fIvalue\fP data
+chunks. Default behaviour depends on internal logic, some percent of unused
+metadata space is attempted to be maintained but is not always possible if
+there\(aqs not enough space left for chunk allocation. The option could be useful to
+override the internal logic in favor of the metadata allocation if the expected
+workload is supposed to be metadata intense (snapshots, reflinks, xattrs,
+inlined files).
+.TP
+.B norecovery
+(since: 4.5, default: off)
+.sp
+Do not attempt any data recovery at mount time. This will disable \fIlogreplay\fP
+and avoids other write operations. Note that this option is the same as
+\fInologreplay\fP\&.
+.sp
+\fBNOTE:\fP
+.INDENT 7.0
+.INDENT 3.5
+The opposite option \fIrecovery\fP used to have different meaning but was
+changed for consistency with other filesystems, where \fInorecovery\fP is used for
+skipping log replay. BTRFS does the same and in general will try to avoid any
+write operations.
+.UNINDENT
+.UNINDENT
+.TP
+.B rescan_uuid_tree
+(since: 3.12, default: off)
+.sp
+Force check and rebuild procedure of the UUID tree. This should not
+normally be needed.
+.TP
+.B rescue
+(since: 5.9)
+.sp
+Modes allowing mount with damaged filesystem structures.
+.INDENT 7.0
+.IP \(bu 2
+\fIusebackuproot\fP (since: 5.9, replaces standalone option \fIusebackuproot\fP)
+.IP \(bu 2
+\fInologreplay\fP (since: 5.9, replaces standalone option \fInologreplay\fP)
+.IP \(bu 2
+\fIignorebadroots\fP, \fIibadroots\fP (since: 5.11)
+.IP \(bu 2
+\fIignoredatacsums\fP, \fIidatacsums\fP (since: 5.11)
+.IP \(bu 2
+\fIall\fP (since: 5.9)
+.UNINDENT
+.TP
+.B skip_balance
+(since: 3.3, default: off)
+.sp
+Skip automatic resume of an interrupted balance operation. The operation can
+later be resumed with \fBbtrfs balance resume\fP, or the paused state can be
+removed with \fBbtrfs balance cancel\fP\&. The default behaviour is to resume an
+interrupted balance immediately after a volume is mounted.
+.TP
+.B space_cache, space_cache=<version>, nospace_cache
+(\fInospace_cache\fP since: 3.2, \fIspace_cache=v1\fP and \fIspace_cache=v2\fP since 4.5, default: \fIspace_cache=v2\fP)
+.sp
+Options to control the free space cache. The free space cache greatly improves
+performance when reading block group free space into memory. However, managing
+the space cache consumes some resources, including a small amount of disk
+space.
+.sp
+There are two implementations of the free space cache. The original
+one, referred to as \fIv1\fP, used to be a safe default but has been
+superseded by \fIv2\fP\&. The \fIv1\fP space cache can be disabled at mount time
+with \fInospace_cache\fP without clearing.
+.sp
+On very large filesystems (many terabytes) and certain workloads, the
+performance of the \fIv1\fP space cache may degrade drastically. The \fIv2\fP
+implementation, which adds a new b\-tree called the free space tree, addresses
+this issue. Once enabled, the \fIv2\fP space cache will always be used and cannot
+be disabled unless it is cleared. Use \fIclear_cache,space_cache=v1\fP or
+\fIclear_cache,nospace_cache\fP to do so. If \fIv2\fP is enabled, and \fIv1\fP space
+cache will be cleared (at the first mount) and kernels without \fIv2\fP
+support will only be able to mount the filesystem in read\-only mode.
+On an unmounted filesystem the caches (both versions) can be cleared by
+\(dqbtrfs check \-\-clear\-space\-cache\(dq.
+.sp
+The \fI\%btrfs\-check(8)\fP and \fI:doc:\(gamkfs.btrfs\fP commands have full \fIv2\fP free space
+cache support since v4.19.
+.sp
+If a version is not explicitly specified, the default implementation will be
+chosen, which is \fIv2\fP\&.
+.TP
+.B ssd, ssd_spread, nossd, nossd_spread
+(default: SSD autodetected)
+.sp
+Options to control SSD allocation schemes. By default, BTRFS will
+enable or disable SSD optimizations depending on status of a device with
+respect to rotational or non\-rotational type. This is determined by the
+contents of \fI/sys/block/DEV/queue/rotational\fP). If it is 0, the \fIssd\fP option is
+turned on. The option \fInossd\fP will disable the autodetection.
+.sp
+The optimizations make use of the absence of the seek penalty that\(aqs inherent
+for the rotational devices. The blocks can be typically written faster and
+are not offloaded to separate threads.
+.sp
+\fBNOTE:\fP
+.INDENT 7.0
+.INDENT 3.5
+Since 4.14, the block layout optimizations have been dropped. This used
+to help with first generations of SSD devices. Their FTL (flash translation
+layer) was not effective and the optimization was supposed to improve the wear
+by better aligning blocks. This is no longer true with modern SSD devices and
+the optimization had no real benefit. Furthermore it caused increased
+fragmentation. The layout tuning has been kept intact for the option
+\fIssd_spread\fP\&.
+.UNINDENT
+.UNINDENT
+.sp
+The \fIssd_spread\fP mount option attempts to allocate into bigger and aligned
+chunks of unused space, and may perform better on low\-end SSDs. \fIssd_spread\fP
+implies \fIssd\fP, enabling all other SSD heuristics as well. The option \fInossd\fP
+will disable all SSD options while \fInossd_spread\fP only disables \fIssd_spread\fP\&.
+.TP
+.B subvol=<path>
+Mount subvolume from \fIpath\fP rather than the toplevel subvolume. The
+\fIpath\fP is always treated as relative to the toplevel subvolume.
+This mount option overrides the default subvolume set for the given filesystem.
+.TP
+.B subvolid=<subvolid>
+Mount subvolume specified by a \fIsubvolid\fP number rather than the toplevel
+subvolume. You can use \fBbtrfs subvolume list\fP of \fBbtrfs subvolume show\fP to see
+subvolume ID numbers.
+This mount option overrides the default subvolume set for the given filesystem.
+.sp
+\fBNOTE:\fP
+.INDENT 7.0
+.INDENT 3.5
+If both \fIsubvolid\fP and \fIsubvol\fP are specified, they must point at the
+same subvolume, otherwise the mount will fail.
+.UNINDENT
+.UNINDENT
+.TP
+.B thread_pool=<number>
+(default: min(NRCPUS + 2, 8) )
+.sp
+The number of worker threads to start. NRCPUS is number of on\-line CPUs
+detected at the time of mount. Small number leads to less parallelism in
+processing data and metadata, higher numbers could lead to a performance hit
+due to increased locking contention, process scheduling, cache\-line bouncing or
+costly data transfers between local CPU memories.
+.TP
+.B treelog, notreelog
+(default: on)
+.sp
+Enable the tree logging used for \fIfsync\fP and \fIO_SYNC\fP writes. The tree log
+stores changes without the need of a full filesystem sync. The log operations
+are flushed at sync and transaction commit. If the system crashes between two
+such syncs, the pending tree log operations are replayed during mount.
+.sp
+\fBWARNING:\fP
+.INDENT 7.0
+.INDENT 3.5
+Currently, the tree log is replayed even with a read\-only mount! To
+disable that behaviour, also mount with \fInologreplay\fP\&.
+.UNINDENT
+.UNINDENT
+.sp
+The tree log could contain new files/directories, these would not exist on
+a mounted filesystem if the log is not replayed.
+.TP
+.B usebackuproot
+(since: 4.6, default: off)
+.sp
+Enable autorecovery attempts if a bad tree root is found at mount time.
+Currently this scans a backup list of several previous tree roots and tries to
+use the first readable. This can be used with read\-only mounts as well.
+.sp
+\fBNOTE:\fP
+.INDENT 7.0
+.INDENT 3.5
+This option has replaced \fIrecovery\fP\&.
+.UNINDENT
+.UNINDENT
+.TP
+.B user_subvol_rm_allowed
+(default: off)
+.sp
+Allow subvolumes to be deleted by their respective owner. Otherwise, only the
+root user can do that.
+.sp
+\fBNOTE:\fP
+.INDENT 7.0
+.INDENT 3.5
+Historically, any user could create a snapshot even if he was not owner
+of the source subvolume, the subvolume deletion has been restricted for that
+reason. The subvolume creation has been restricted but this mount option is
+still required. This is a usability issue.
+Since 4.18, the \fBrmdir(2)\fP syscall can delete an empty subvolume just like an
+ordinary directory. Whether this is possible can be detected at runtime, see
+\fIrmdir_subvol\fP feature in \fIFILESYSTEM FEATURES\fP\&.
+.UNINDENT
+.UNINDENT
+.UNINDENT
+.SS DEPRECATED MOUNT OPTIONS
+.sp
+List of mount options that have been removed, kept for backward compatibility.
+.INDENT 0.0
+.TP
+.B recovery
+(since: 3.2, default: off, deprecated since: 4.5)
+.sp
+\fBNOTE:\fP
+.INDENT 7.0
+.INDENT 3.5
+This option has been replaced by \fIusebackuproot\fP and should not be used
+but will work on 4.5+ kernels.
+.UNINDENT
+.UNINDENT
+.TP
+.B inode_cache, noinode_cache
+(removed in: 5.11, since: 3.0, default: off)
+.sp
+\fBNOTE:\fP
+.INDENT 7.0
+.INDENT 3.5
+The functionality has been removed in 5.11, any stale data created by
+previous use of the \fIinode_cache\fP option can be removed by
+\fI\%btrfs rescue clear\-ino\-cache\fP\&.
+.UNINDENT
+.UNINDENT
+.UNINDENT
+.SS NOTES ON GENERIC MOUNT OPTIONS
+.sp
+Some of the general mount options from \fBmount(8)\fP that affect BTRFS and are
+worth mentioning.
+.INDENT 0.0
+.TP
+.B noatime
+under read intensive work\-loads, specifying \fInoatime\fP significantly improves
+performance because no new access time information needs to be written. Without
+this option, the default is \fIrelatime\fP, which only reduces the number of
+inode atime updates in comparison to the traditional \fIstrictatime\fP\&. The worst
+case for atime updates under \fIrelatime\fP occurs when many files are read whose
+atime is older than 24 h and which are freshly snapshotted. In that case the
+atime is updated and COW happens \- for each file \- in bulk. See also
+\fI\%https://lwn.net/Articles/499293/\fP \- \fIAtime and btrfs: a bad combination? (LWN, 2012\-05\-31)\fP\&.
+.sp
+Note that \fInoatime\fP may break applications that rely on atime uptimes like
+the venerable Mutt (unless you use maildir mailboxes).
+.UNINDENT
+.SH FILESYSTEM FEATURES
+.sp
+The basic set of filesystem features gets extended over time. The backward
+compatibility is maintained and the features are optional, need to be
+explicitly asked for so accidental use will not create incompatibilities.
+.sp
+There are several classes and the respective tools to manage the features:
+.INDENT 0.0
+.TP
+.B at mkfs time only
+This is namely for core structures, like the b\-tree nodesize or checksum
+algorithm, see \fI\%mkfs.btrfs(8)\fP for more details.
+.TP
+.B after mkfs, on an unmounted filesystem
+Features that may optimize internal structures or add new structures to support
+new functionality, see \fI\%btrfstune(8)\fP\&. The command
+\fBbtrfs inspect\-internal dump\-super /dev/sdx\fP
+will dump a superblock, you can map the value of
+\fIincompat_flags\fP to the features listed below
+.TP
+.B after mkfs, on a mounted filesystem
+The features of a filesystem (with a given UUID) are listed in
+\fB/sys/fs/btrfs/UUID/features/\fP, one file per feature. The status is stored
+inside the file. The value \fI1\fP is for enabled and active, while \fI0\fP means the
+feature was enabled at mount time but turned off afterwards.
+.sp
+Whether a particular feature can be turned on a mounted filesystem can be found
+in the directory \fB/sys/fs/btrfs/features/\fP, one file per feature. The value \fI1\fP
+means the feature can be enabled.
+.UNINDENT
+.sp
+List of features (see also \fI\%mkfs.btrfs(8)\fP section
+\fI\%FILESYSTEM FEATURES\fP):
+.INDENT 0.0
+.TP
+.B big_metadata
+(since: 3.4)
+.sp
+the filesystem uses \fInodesize\fP for metadata blocks, this can be bigger than the
+page size
+.TP
+.B block_group_tree
+(since: 6.1)
+.sp
+block group item representation using a dedicated b\-tree, this can greatly
+reduce mount time for large filesystems
+.TP
+.B compress_lzo
+(since: 2.6.38)
+.sp
+the \fIlzo\fP compression has been used on the filesystem, either as a mount option
+or via \fBbtrfs filesystem defrag\fP\&.
+.TP
+.B compress_zstd
+(since: 4.14)
+.sp
+the \fIzstd\fP compression has been used on the filesystem, either as a mount option
+or via \fBbtrfs filesystem defrag\fP\&.
+.TP
+.B default_subvol
+(since: 2.6.34)
+.sp
+the default subvolume has been set on the filesystem
+.TP
+.B extended_iref
+(since: 3.7)
+.sp
+increased hardlink limit per file in a directory to 65536, older kernels
+supported a varying number of hardlinks depending on the sum of all file name
+sizes that can be stored into one metadata block
+.TP
+.B free_space_tree
+(since: 4.5)
+.sp
+free space representation using a dedicated b\-tree, successor of v1 space cache
+.TP
+.B metadata_uuid
+(since: 5.0)
+.sp
+the main filesystem UUID is the metadata_uuid, which stores the new UUID only
+in the superblock while all metadata blocks still have the UUID set at mkfs
+time, see \fI\%btrfstune(8)\fP for more
+.TP
+.B mixed_backref
+(since: 2.6.31)
+.sp
+the last major disk format change, improved backreferences, now default
+.TP
+.B mixed_groups
+(since: 2.6.37)
+.sp
+mixed data and metadata block groups, i.e. the data and metadata are not
+separated and occupy the same block groups, this mode is suitable for small
+volumes as there are no constraints how the remaining space should be used
+(compared to the split mode, where empty metadata space cannot be used for data
+and vice versa)
+.sp
+on the other hand, the final layout is quite unpredictable and possibly highly
+fragmented, which means worse performance
+.TP
+.B no_holes
+(since: 3.14)
+.sp
+improved representation of file extents where holes are not explicitly
+stored as an extent, saves a few percent of metadata if sparse files are used
+.TP
+.B raid1c34
+(since: 5.5)
+.sp
+extended RAID1 mode with copies on 3 or 4 devices respectively
+.TP
+.B RAID56
+(since: 3.9)
+.sp
+the filesystem contains or contained a RAID56 profile of block groups
+.TP
+.B rmdir_subvol
+(since: 4.18)
+.sp
+indicate that \fBrmdir(2)\fP syscall can delete an empty subvolume just like an
+ordinary directory. Note that this feature only depends on the kernel version.
+.TP
+.B skinny_metadata
+(since: 3.10)
+.sp
+reduced\-size metadata for extent references, saves a few percent of metadata
+.TP
+.B send_stream_version
+(since: 5.10)
+.sp
+number of the highest supported send stream version
+.TP
+.B supported_checksums
+(since: 5.5)
+.sp
+list of checksum algorithms supported by the kernel module, the respective
+modules or built\-in implementing the algorithms need to be present to mount
+the filesystem, see section \fI\%CHECKSUM ALGORITHMS\fP\&.
+.TP
+.B supported_sectorsizes
+(since: 5.13)
+.sp
+list of values that are accepted as sector sizes (\fBmkfs.btrfs \-\-sectorsize\fP) by
+the running kernel
+.TP
+.B supported_rescue_options
+(since: 5.11)
+.sp
+list of values for the mount option \fIrescue\fP that are supported by the running
+kernel, see \fI\%btrfs(5)\fP
+.TP
+.B zoned
+(since: 5.12)
+.sp
+zoned mode is allocation/write friendly to host\-managed zoned devices,
+allocation space is partitioned into fixed\-size zones that must be updated
+sequentially, see section \fI\%ZONED MODE\fP
+.UNINDENT
+.SH SWAPFILE SUPPORT
+.sp
+A swapfile, when active, is a file\-backed swap area. It is supported since kernel 5.0.
+Use \fBswapon(8)\fP to activate it, until then (respectively again after deactivating it
+with \fBswapoff(8)\fP) it\(aqs just a normal file (with NODATACOW set), for which the special
+restrictions for active swapfiles don\(aqt apply.
+.sp
+There are some limitations of the implementation in BTRFS and Linux swap
+subsystem:
+.INDENT 0.0
+.IP \(bu 2
+filesystem \- must be only single device
+.IP \(bu 2
+filesystem \- must have only \fIsingle\fP data profile
+.IP \(bu 2
+subvolume \- cannot be snapshotted if it contains any active swapfiles
+.IP \(bu 2
+swapfile \- must be preallocated (i.e. no holes)
+.IP \(bu 2
+swapfile \- must be NODATACOW (i.e. also NODATASUM, no compression)
+.UNINDENT
+.sp
+The limitations come namely from the COW\-based design and mapping layer of
+blocks that allows the advanced features like relocation and multi\-device
+filesystems. However, the swap subsystem expects simpler mapping and no
+background changes of the file block location once they\(aqve been assigned to
+swap.
+.sp
+With active swapfiles, the following whole\-filesystem operations will skip
+swapfile extents or may fail:
+.INDENT 0.0
+.IP \(bu 2
+balance \- block groups with extents of any active swapfiles are skipped and
+reported, the rest will be processed normally
+.IP \(bu 2
+resize grow \- unaffected
+.IP \(bu 2
+resize shrink \- works as long as the extents of any active swapfiles are
+outside of the shrunk range
+.IP \(bu 2
+device add \- if the new devices do not interfere with any already active swapfiles
+this operation will work, though no new swapfile can be activated
+afterwards
+.IP \(bu 2
+device delete \- if the device has been added as above, it can be also deleted
+.IP \(bu 2
+device replace \- ditto
+.UNINDENT
+.sp
+When there are no active swapfiles and a whole\-filesystem exclusive operation
+is running (e.g. balance, device delete, shrink), the swapfiles cannot be
+temporarily activated. The operation must finish first.
+.sp
+To create and activate a swapfile run the following commands:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+# truncate \-s 0 swapfile
+# chattr +C swapfile
+# fallocate \-l 2G swapfile
+# chmod 0600 swapfile
+# mkswap swapfile
+# swapon swapfile
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+Since version 6.1 it\(aqs possible to create the swapfile in a single command
+(except the activation):
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+# btrfs filesystem mkswapfile \-\-size 2G swapfile
+# swapon swapfile
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+Please note that the UUID returned by the \fImkswap\fP utility identifies the swap
+\(dqfilesystem\(dq and because it\(aqs stored in a file, it\(aqs not generally visible and
+usable as an identifier unlike if it was on a block device.
+.sp
+Once activated the file will appear in \fB/proc/swaps\fP:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+# cat /proc/swaps
+Filename Type Size Used Priority
+/path/swapfile file 2097152 0 \-2
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+The swapfile can be created as one\-time operation or, once properly created,
+activated on each boot by the \fBswapon \-a\fP command (usually started by the
+service manager). Add the following entry to \fI/etc/fstab\fP, assuming the
+filesystem that provides the \fI/path\fP has been already mounted at this point.
+Additional mount options relevant for the swapfile can be set too (like
+priority, not the BTRFS mount options).
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+/path/swapfile none swap defaults 0 0
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+From now on the subvolume with the active swapfile cannot be snapshotted until
+the swapfile is deactivated again by \fBswapoff\fP\&. Then the swapfile is a
+regular file and the subvolume can be snapshotted again, though this would prevent
+another activation any swapfile that has been snapshotted. New swapfiles (not
+snapshotted) can be created and activated.
+.sp
+Otherwise, an inactive swapfile does not affect the containing subvolume. Activation
+creates a temporary in\-memory status and prevents some file operations, but is
+not stored permanently.
+.SH HIBERNATION
+.sp
+A swapfile can be used for hibernation but it\(aqs not straightforward. Before
+hibernation a resume offset must be written to file \fI/sys/power/resume_offset\fP
+or the kernel command line parameter \fIresume_offset\fP must be set.
+.sp
+The value is the physical offset on the device. Note that \fBthis is not the same
+value that\fP \fBfilefrag\fP \fBprints as physical offset!\fP
+.sp
+Btrfs filesystem uses mapping between logical and physical addresses but here
+the physical can still map to one or more device\-specific physical block
+addresses. It\(aqs the device\-specific physical offset that is suitable as resume
+offset.
+.sp
+Since version 6.1 there\(aqs a command \fI\%btrfs inspect\-internal map\-swapfile\fP
+that will print the device physical offset and the adjusted value for
+\fB/sys/power/resume_offset\fP\&. Note that the value is divided by page size, i.e.
+it\(aqs not the offset itself.
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+# btrfs filesystem mkswapfile swapfile
+# btrfs inspect\-internal map\-swapfile swapfile
+Physical start: 811511726080
+Resume offset: 198122980
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+For scripting and convenience the option \fI\-r\fP will print just the offset:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+# btrfs inspect\-internal map\-swapfile \-r swapfile
+198122980
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+The command \fBmap\-swapfile\fP also verifies all the requirements, i.e. no holes,
+single device, etc.
+.SH TROUBLESHOOTING
+.sp
+If the swapfile activation fails please verify that you followed all the steps
+above or check the system log (e.g. \fBdmesg\fP or \fBjournalctl\fP) for more
+information.
+.sp
+Notably, the \fBswapon\fP utility exits with a message that does not say what
+failed:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+# swapon /path/swapfile
+swapon: /path/swapfile: swapon failed: Invalid argument
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+The specific reason is likely to be printed to the system log by the btrfs
+module:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+# journalctl \-t kernel | grep swapfile
+kernel: BTRFS warning (device sda): swapfile must have single data profile
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.SH CHECKSUM ALGORITHMS
+.sp
+Data and metadata are checksummed by default, the checksum is calculated before
+write and verified after reading the blocks from devices. The whole metadata
+block has a checksum stored inline in the b\-tree node header, each data block
+has a detached checksum stored in the checksum tree.
+.sp
+There are several checksum algorithms supported. The default and backward
+compatible is \fIcrc32c\fP\&. Since kernel 5.5 there are three more with different
+characteristics and trade\-offs regarding speed and strength. The following list
+may help you to decide which one to select.
+.INDENT 0.0
+.TP
+.B CRC32C (32bit digest)
+default, best backward compatibility, very fast, modern CPUs have
+instruction\-level support, not collision\-resistant but still good error
+detection capabilities
+.TP
+.B XXHASH (64bit digest)
+can be used as CRC32C successor, very fast, optimized for modern CPUs utilizing
+instruction pipelining, good collision resistance and error detection
+.TP
+.B SHA256 (256bit digest)
+a cryptographic\-strength hash, relatively slow but with possible CPU
+instruction acceleration or specialized hardware cards, FIPS certified and
+in wide use
+.TP
+.B BLAKE2b (256bit digest)
+a cryptographic\-strength hash, relatively fast with possible CPU acceleration
+using SIMD extensions, not standardized but based on BLAKE which was a SHA3
+finalist, in wide use, the algorithm used is BLAKE2b\-256 that\(aqs optimized for
+64bit platforms
+.UNINDENT
+.sp
+The \fIdigest size\fP affects overall size of data block checksums stored in the
+filesystem. The metadata blocks have a fixed area up to 256 bits (32 bytes), so
+there\(aqs no increase. Each data block has a separate checksum stored, with
+additional overhead of the b\-tree leaves.
+.sp
+Approximate relative performance of the algorithms, measured against CRC32C
+using reference software implementations on a 3.5GHz intel CPU:
+.TS
+center;
+|l|l|l|l|.
+_
+T{
+Digest
+T} T{
+Cycles/4KiB
+T} T{
+Ratio
+T} T{
+Implementation
+T}
+_
+T{
+CRC32C
+T} T{
+1700
+T} T{
+1.00
+T} T{
+CPU instruction
+T}
+_
+T{
+XXHASH
+T} T{
+2500
+T} T{
+1.44
+T} T{
+reference impl.
+T}
+_
+T{
+SHA256
+T} T{
+105000
+T} T{
+61
+T} T{
+reference impl.
+T}
+_
+T{
+SHA256
+T} T{
+36000
+T} T{
+21
+T} T{
+libgcrypt/AVX2
+T}
+_
+T{
+SHA256
+T} T{
+63000
+T} T{
+37
+T} T{
+libsodium/AVX2
+T}
+_
+T{
+BLAKE2b
+T} T{
+22000
+T} T{
+13
+T} T{
+reference impl.
+T}
+_
+T{
+BLAKE2b
+T} T{
+19000
+T} T{
+11
+T} T{
+libgcrypt/AVX2
+T}
+_
+T{
+BLAKE2b
+T} T{
+19000
+T} T{
+11
+T} T{
+libsodium/AVX2
+T}
+_
+.TE
+.sp
+Many kernels are configured with SHA256 as built\-in and not as a module.
+The accelerated versions are however provided by the modules and must be loaded
+explicitly (\fBmodprobe sha256\fP) before mounting the filesystem to make use of
+them. You can check in \fB/sys/fs/btrfs/FSID/checksum\fP which one is used. If you
+see \fIsha256\-generic\fP, then you may want to unmount and mount the filesystem
+again, changing that on a mounted filesystem is not possible.
+Check the file \fB/proc/crypto\fP, when the implementation is built\-in, you\(aqd find
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+name : sha256
+driver : sha256\-generic
+module : kernel
+priority : 100
+\&...
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+while accelerated implementation is e.g.
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+name : sha256
+driver : sha256\-avx2
+module : sha256_ssse3
+priority : 170
+\&...
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.SH COMPRESSION
+.sp
+Btrfs supports transparent file compression. There are three algorithms
+available: ZLIB, LZO and ZSTD (since v4.14), with various levels.
+The compression happens on the level of file extents and the algorithm is
+selected by file property, mount option or by a defrag command.
+You can have a single btrfs mount point that has some files that are
+uncompressed, some that are compressed with LZO, some with ZLIB, for instance
+(though you may not want it that way, it is supported).
+.sp
+Once the compression is set, all newly written data will be compressed, i.e.
+existing data are untouched. Data are split into smaller chunks (128KiB) before
+compression to make random rewrites possible without a high performance hit. Due
+to the increased number of extents the metadata consumption is higher. The
+chunks are compressed in parallel.
+.sp
+The algorithms can be characterized as follows regarding the speed/ratio
+trade\-offs:
+.INDENT 0.0
+.TP
+.B ZLIB
+.INDENT 7.0
+.IP \(bu 2
+slower, higher compression ratio
+.IP \(bu 2
+levels: 1 to 9, mapped directly, default level is 3
+.IP \(bu 2
+good backward compatibility
+.UNINDENT
+.TP
+.B LZO
+.INDENT 7.0
+.IP \(bu 2
+faster compression and decompression than ZLIB, worse compression ratio, designed to be fast
+.IP \(bu 2
+no levels
+.IP \(bu 2
+good backward compatibility
+.UNINDENT
+.TP
+.B ZSTD
+.INDENT 7.0
+.IP \(bu 2
+compression comparable to ZLIB with higher compression/decompression speeds and different ratio
+.IP \(bu 2
+levels: 1 to 15, mapped directly (higher levels are not available)
+.IP \(bu 2
+since 4.14, levels since 5.1
+.UNINDENT
+.UNINDENT
+.sp
+The differences depend on the actual data set and cannot be expressed by a
+single number or recommendation. Higher levels consume more CPU time and may
+not bring a significant improvement, lower levels are close to real time.
+.SH HOW TO ENABLE COMPRESSION
+.sp
+Typically the compression can be enabled on the whole filesystem, specified for
+the mount point. Note that the compression mount options are shared among all
+mounts of the same filesystem, either bind mounts or subvolume mounts.
+Please refer to \fI\%btrfs(5)\fP section
+\fI\%MOUNT OPTIONS\fP\&.
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+$ mount \-o compress=zstd /dev/sdx /mnt
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+This will enable the \fBzstd\fP algorithm on the default level (which is 3).
+The level can be specified manually too like \fBzstd:3\fP\&. Higher levels compress
+better at the cost of time. This in turn may cause increased write latency, low
+levels are suitable for real\-time compression and on reasonably fast CPU don\(aqt
+cause noticeable performance drops.
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+$ btrfs filesystem defrag \-czstd file
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+The command above will start defragmentation of the whole \fIfile\fP and apply
+the compression, regardless of the mount option. (Note: specifying level is not
+yet implemented). The compression algorithm is not persistent and applies only
+to the defragmentation command, for any other writes other compression settings
+apply.
+.sp
+Persistent settings on a per\-file basis can be set in two ways:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+$ chattr +c file
+$ btrfs property set file compression zstd
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+The first command is using legacy interface of file attributes inherited from
+ext2 filesystem and is not flexible, so by default the \fIzlib\fP compression is
+set. The other command sets a property on the file with the given algorithm.
+(Note: setting level that way is not yet implemented.)
+.SH COMPRESSION LEVELS
+.sp
+The level support of ZLIB has been added in v4.14, LZO does not support levels
+(the kernel implementation provides only one), ZSTD level support has been added
+in v5.1.
+.sp
+There are 9 levels of ZLIB supported (1 to 9), mapping 1:1 from the mount option
+to the algorithm defined level. The default is level 3, which provides the
+reasonably good compression ratio and is still reasonably fast. The difference
+in compression gain of levels 7, 8 and 9 is comparable but the higher levels
+take longer.
+.sp
+The ZSTD support includes levels 1 to 15, a subset of full range of what ZSTD
+provides. Levels 1\-3 are real\-time, 4\-8 slower with improved compression and
+9\-15 try even harder though the resulting size may not be significantly improved.
+.sp
+Level 0 always maps to the default. The compression level does not affect
+compatibility.
+.SH INCOMPRESSIBLE DATA
+.sp
+Files with already compressed data or with data that won\(aqt compress well with
+the CPU and memory constraints of the kernel implementations are using a simple
+decision logic. If the first portion of data being compressed is not smaller
+than the original, the compression of the file is disabled \-\- unless the
+filesystem is mounted with \fIcompress\-force\fP\&. In that case compression will
+always be attempted on the file only to be later discarded. This is not optimal
+and subject to optimizations and further development.
+.sp
+If a file is identified as incompressible, a flag is set (\fINOCOMPRESS\fP) and it\(aqs
+sticky. On that file compression won\(aqt be performed unless forced. The flag
+can be also set by \fBchattr +m\fP (since e2fsprogs 1.46.2) or by properties with
+value \fIno\fP or \fInone\fP\&. Empty value will reset it to the default that\(aqs currently
+applicable on the mounted filesystem.
+.sp
+There are two ways to detect incompressible data:
+.INDENT 0.0
+.IP \(bu 2
+actual compression attempt \- data are compressed, if the result is not smaller,
+it\(aqs discarded, so this depends on the algorithm and level
+.IP \(bu 2
+pre\-compression heuristics \- a quick statistical evaluation on the data is
+performed and based on the result either compression is performed or skipped,
+the NOCOMPRESS bit is not set just by the heuristic, only if the compression
+algorithm does not make an improvement
+.UNINDENT
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+$ lsattr file
+\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-m file
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+Using the forcing compression is not recommended, the heuristics are
+supposed to decide that and compression algorithms internally detect
+incompressible data too.
+.SH PRE-COMPRESSION HEURISTICS
+.sp
+The heuristics aim to do a few quick statistical tests on the compressed data
+in order to avoid probably costly compression that would turn out to be
+inefficient. Compression algorithms could have internal detection of
+incompressible data too but this leads to more overhead as the compression is
+done in another thread and has to write the data anyway. The heuristic is
+read\-only and can utilize cached memory.
+.sp
+The tests performed based on the following: data sampling, long repeated
+pattern detection, byte frequency, Shannon entropy.
+.SH COMPATIBILITY
+.sp
+Compression is done using the COW mechanism so it\(aqs incompatible with
+\fInodatacow\fP\&. Direct IO works on compressed files but will fall back to buffered
+writes and leads to recompression. Currently \fInodatasum\fP and compression don\(aqt
+work together.
+.sp
+The compression algorithms have been added over time so the version
+compatibility should be also considered, together with other tools that may
+access the compressed data like bootloaders.
+.SH SYSFS INTERFACE
+.sp
+Btrfs has a sysfs interface to provide extra knobs.
+.sp
+The top level path is \fB/sys/fs/btrfs/\fP, and the main directory layout is the following:
+.TS
+center;
+|l|l|l|.
+_
+T{
+Relative Path
+T} T{
+Description
+T} T{
+Version
+T}
+_
+T{
+features/
+T} T{
+All supported features
+T} T{
+3.14+
+T}
+_
+T{
+<UUID>/
+T} T{
+Mounted fs UUID
+T} T{
+3.14+
+T}
+_
+T{
+<UUID>/allocation/
+T} T{
+Space allocation info
+T} T{
+3.14+
+T}
+_
+T{
+<UUID>/features/
+T} T{
+Features of the filesystem
+T} T{
+3.14+
+T}
+_
+T{
+<UUID>/devices/<DEVID>/
+T} T{
+Symlink to each block device sysfs
+T} T{
+5.6+
+T}
+_
+T{
+<UUID>/devinfo/<DEVID>/
+T} T{
+Btrfs specific info for each device
+T} T{
+5.6+
+T}
+_
+T{
+<UUID>/qgroups/
+T} T{
+Global qgroup info
+T} T{
+5.9+
+T}
+_
+T{
+<UUID>/qgroups/<LEVEL>_<ID>/
+T} T{
+Info for each qgroup
+T} T{
+5.9+
+T}
+_
+T{
+<UUID>/discard/
+T} T{
+Discard stats and tunables
+T} T{
+6.1+
+T}
+_
+.TE
+.sp
+For \fB/sys/fs/btrfs/features/\fP directory, each file means a supported feature
+for the current kernel.
+.sp
+For \fB/sys/fs/btrfs/<UUID>/features/\fP directory, each file means an enabled
+feature for the mounted filesystem.
+.sp
+The features shares the same name in section
+\fI\%FILESYSTEM FEATURES\fP\&.
+.sp
+Files in \fB/sys/fs/btrfs/<UUID>/\fP directory are:
+.INDENT 0.0
+.TP
+.B bg_reclaim_threshold
+(RW, since: 5.19)
+.sp
+Used space percentage of total device space to start auto block group claim.
+Mostly for zoned devices.
+.TP
+.B checksum
+(RO, since: 5.5)
+.sp
+The checksum used for the mounted filesystem.
+This includes both the checksum type (see section
+\fI\%CHECKSUM ALGORITHMS\fP)
+and the implemented driver (mostly shows if it\(aqs hardware accelerated).
+.TP
+.B clone_alignment
+(RO, since: 3.16)
+.sp
+The bytes alignment for \fIclone\fP and \fIdedupe\fP ioctls.
+.TP
+.B commit_stats
+(RW, since: 6.0)
+.sp
+The performance statistics for btrfs transaction commit.
+Mostly for debug purposes.
+.sp
+Writing into this file will reset the maximum commit duration to
+the input value.
+.TP
+.B exclusive_operation
+(RO, since: 5.10)
+.sp
+Shows the running exclusive operation.
+Check section
+\fI\%FILESYSTEM EXCLUSIVE OPERATIONS\fP
+for details.
+.TP
+.B generation
+(RO, since: 5.11)
+.sp
+Show the generation of the mounted filesystem.
+.TP
+.B label
+(RW, since: 3.14)
+.sp
+Show the current label of the mounted filesystem.
+.TP
+.B metadata_uuid
+(RO, since: 5.0)
+.sp
+Shows the metadata uuid of the mounted filesystem.
+Check \fImetadata_uuid\fP feature for more details.
+.TP
+.B nodesize
+(RO, since: 3.14)
+.sp
+Show the nodesize of the mounted filesystem.
+.TP
+.B quota_override
+(RW, since: 4.13)
+.sp
+Shows the current quota override status.
+0 means no quota override.
+1 means quota override, quota can ignore the existing limit settings.
+.TP
+.B read_policy
+(RW, since: 5.11)
+.sp
+Shows the current balance policy for reads.
+Currently only \(dqpid\(dq (balance using pid value) is supported.
+.TP
+.B sectorsize
+(RO, since: 3.14)
+.sp
+Shows the sectorsize of the mounted filesystem.
+.UNINDENT
+.sp
+Files and directories in \fB/sys/fs/btrfs/<UUID>/allocations\fP directory are:
+.INDENT 0.0
+.TP
+.B global_rsv_reserved
+(RO, since: 3.14)
+.sp
+The used bytes of the global reservation.
+.TP
+.B global_rsv_size
+(RO, since: 3.14)
+.sp
+The total size of the global reservation.
+.TP
+.B \fIdata/\fP, \fImetadata/\fP and \fIsystem/\fP directories
+(RO, since: 5.14)
+.sp
+Space info accounting for the 3 chunk types.
+Mostly for debug purposes.
+.UNINDENT
+.sp
+Files in \fB/sys/fs/btrfs/<UUID>/allocations/\fP\fIdata,metadata,system\fP directory are:
+.INDENT 0.0
+.TP
+.B bg_reclaim_threshold
+(RW, since: 5.19)
+.sp
+Reclaimable space percentage of block group\(aqs size (excluding
+permanently unusable space) to reclaim the block group.
+Can be used on regular or zoned devices.
+.TP
+.B chunk_size
+(RW, since: 6.0)
+.sp
+Shows the chunk size. Can be changed for data and metadata.
+Cannot be set for zoned devices.
+.UNINDENT
+.sp
+Files in \fB/sys/fs/btrfs/<UUID>/devinfo/<DEVID>\fP directory are:
+.INDENT 0.0
+.TP
+.B error_stats:
+(RO, since: 5.14)
+.sp
+Shows all the history error numbers of the device.
+.TP
+.B fsid:
+(RO, since: 5.17)
+.sp
+Shows the fsid which the device belongs to.
+It can be different than the \fI<UUID>\fP if it\(aqs a seed device.
+.TP
+.B in_fs_metadata
+(RO, since: 5.6)
+.sp
+Shows whether we have found the device.
+Should always be 1, as if this turns to 0, the \fI<DEVID>\fP directory
+would get removed automatically.
+.TP
+.B missing
+(RO, since: 5.6)
+.sp
+Shows whether the device is missing.
+.TP
+.B replace_target
+(RO, since: 5.6)
+.sp
+Shows whether the device is the replace target.
+If no dev\-replace is running, this value should be 0.
+.TP
+.B scrub_speed_max
+(RW, since: 5.14)
+.sp
+Shows the scrub speed limit for this device. The unit is Bytes/s.
+0 means no limit.
+.TP
+.B writeable
+(RO, since: 5.6)
+.sp
+Show if the device is writeable.
+.UNINDENT
+.sp
+Files in \fB/sys/fs/btrfs/<UUID>/qgroups/\fP directory are:
+.INDENT 0.0
+.TP
+.B enabled
+(RO, since: 6.1)
+.sp
+Shows if qgroup is enabled.
+Also, if qgroup is disabled, the \fIqgroups\fP directory would
+be removed automatically.
+.TP
+.B inconsistent
+(RO, since: 6.1)
+.sp
+Shows if the qgroup numbers are inconsistent.
+If 1, it\(aqs recommended to do a qgroup rescan.
+.TP
+.B drop_subtree_threshold
+(RW, since: 6.1)
+.sp
+Shows the subtree drop threshold to automatically mark qgroup inconsistent.
+.sp
+When dropping large subvolumes with qgroup enabled, there would be a huge
+load for qgroup accounting.
+If we have a subtree whose level is larger than or equal to this value,
+we will not trigger qgroup account at all, but mark qgroup inconsistent to
+avoid the huge workload.
+.sp
+Default value is 8, where no subtree drop can trigger qgroup.
+.sp
+Lower value can reduce qgroup workload, at the cost of extra qgroup rescan
+to re\-calculate the numbers.
+.UNINDENT
+.sp
+Files in \fB/sys/fs/btrfs/<UUID>/<LEVEL>_<ID>/\fP directory are:
+.INDENT 0.0
+.TP
+.B exclusive
+(RO, since: 5.9)
+.sp
+Shows the exclusively owned bytes of the qgroup.
+.TP
+.B limit_flags
+(RO, since: 5.9)
+.sp
+Shows the numeric value of the limit flags.
+If 0, means no limit implied.
+.TP
+.B max_exclusive
+(RO, since: 5.9)
+.sp
+Shows the limits on exclusively owned bytes.
+.TP
+.B max_referenced
+(RO, since: 5.9)
+.sp
+Shows the limits on referenced bytes.
+.TP
+.B referenced
+(RO, since: 5.9)
+.sp
+Shows the referenced bytes of the qgroup.
+.TP
+.B rsv_data
+(RO, since: 5.9)
+.sp
+Shows the reserved bytes for data.
+.TP
+.B rsv_meta_pertrans
+(RO, since: 5.9)
+.sp
+Shows the reserved bytes for per transaction metadata.
+.TP
+.B rsv_meta_prealloc
+(RO, since: 5.9)
+.sp
+Shows the reserved bytes for preallocated metadata.
+.UNINDENT
+.sp
+Files in \fB/sys/fs/btrfs/<UUID>/discard/\fP directory are:
+.INDENT 0.0
+.TP
+.B discardable_bytes
+(RO, since: 6.1)
+.sp
+Shows amount of bytes that can be discarded in the async discard and
+nodiscard mode.
+.TP
+.B discardable_extents
+(RO, since: 6.1)
+.sp
+Shows number of extents to be discarded in the async discard and
+nodiscard mode.
+.TP
+.B discard_bitmap_bytes
+(RO, since: 6.1)
+.sp
+Shows amount of discarded bytes from data tracked as bitmaps.
+.TP
+.B discard_extent_bytes
+(RO, since: 6.1)
+.sp
+Shows amount of discarded extents from data tracked as bitmaps.
+.TP
+.B discard_bytes_saved
+(RO, since: 6.1)
+.sp
+Shows the amount of bytes that were reallocated without being discarded.
+.TP
+.B kbps_limit
+(RW, since: 6.1)
+.sp
+Tunable limit of kilobytes per second issued as discard IO in the async
+discard mode.
+.TP
+.B iops_limit
+(RW, since: 6.1)
+.sp
+Tunable limit of number of discard IO operations to be issued in the
+async discard mode.
+.TP
+.B max_discard_size
+(RW, since: 6.1)
+.sp
+Tunable limit for size of one IO discard request.
+.UNINDENT
+.SH FILESYSTEM EXCLUSIVE OPERATIONS
+.sp
+There are several operations that affect the whole filesystem and cannot be run
+in parallel. Attempt to start one while another is running will fail (see
+exceptions below).
+.sp
+Since kernel 5.10 the currently running operation can be obtained from
+\fB/sys/fs/UUID/exclusive_operation\fP with following values and operations:
+.INDENT 0.0
+.IP \(bu 2
+balance
+.IP \(bu 2
+balance paused (since 5.17)
+.IP \(bu 2
+device add
+.IP \(bu 2
+device delete
+.IP \(bu 2
+device replace
+.IP \(bu 2
+resize
+.IP \(bu 2
+swapfile activate
+.IP \(bu 2
+none
+.UNINDENT
+.sp
+Enqueuing is supported for several btrfs subcommands so they can be started
+at once and then serialized.
+.sp
+There\(aqs an exception when a paused balance allows to start a device add
+operation as they don\(aqt really collide and this can be used to add more space
+for the balance to finish.
+.SH FILESYSTEM LIMITS
+.INDENT 0.0
+.TP
+.B maximum file name length
+255
+.sp
+This limit is imposed by Linux VFS, the structures of BTRFS could store
+larger file names.
+.TP
+.B maximum symlink target length
+depends on the \fInodesize\fP value, for 4KiB it\(aqs 3949 bytes, for larger nodesize
+it\(aqs 4095 due to the system limit PATH_MAX
+.sp
+The symlink target may not be a valid path, i.e. the path name components
+can exceed the limits (NAME_MAX), there\(aqs no content validation at \fBsymlink(3)\fP
+creation.
+.TP
+.B maximum number of inodes
+2\s-2\u64\d\s0 but depends on the available metadata space as the inodes are created
+dynamically
+.sp
+Each subvolume is an independent namespace of inodes and thus their
+numbers, so the limit is per subvolume, not for the whole filesystem.
+.TP
+.B inode numbers
+minimum number: 256 (for subvolumes), regular files and directories: 257,
+maximum number: (2\s-2\u64\d\s0 \- 256)
+.sp
+The inode numbers that can be assigned to user created files are from
+the whole 64bit space except first 256 and last 256 in that range that
+are reserved for internal b\-tree identifiers.
+.TP
+.B maximum file length
+inherent limit of BTRFS is 2\s-2\u64\d\s0 (16 EiB) but the practical
+limit of Linux VFS is 2\s-2\u63\d\s0 (8 EiB)
+.TP
+.B maximum number of subvolumes
+the subvolume ids can go up to 2\s-2\u48\d\s0 but the number of actual subvolumes
+depends on the available metadata space
+.sp
+The space consumed by all subvolume metadata includes bookkeeping of
+shared extents can be large (MiB, GiB). The range is not the full 64bit
+range because of qgroups that use the upper 16 bits for another
+purposes.
+.TP
+.B maximum number of hardlinks of a file in a directory
+65536 when the \fIextref\fP feature is turned on during mkfs (default), roughly
+100 otherwise and depends on file name length that fits into one metadata node
+.TP
+.B minimum filesystem size
+the minimal size of each device depends on the \fImixed\-bg\fP feature, without that
+(the default) it\(aqs about 109MiB, with mixed\-bg it\(aqs is 16MiB
+.UNINDENT
+.SH BOOTLOADER SUPPORT
+.sp
+GRUB2 (\fI\%https://www.gnu.org/software/grub\fP) has the most advanced support of
+booting from BTRFS with respect to features.
+.sp
+U\-Boot (\fI\%https://www.denx.de/wiki/U\-Boot/\fP) has decent support for booting but
+not all BTRFS features are implemented, check the documentation.
+.sp
+In general, the first 1MiB on each device is unused with the exception of
+primary superblock that is on the offset 64KiB and spans 4KiB. The rest can be
+freely used by bootloaders or for other system information. Note that booting
+from a filesystem on \fI\%zoned device\fP is not supported.
+.SH FILE ATTRIBUTES
+.sp
+The btrfs filesystem supports setting file attributes or flags. Note there are
+old and new interfaces, with confusing names. The following list should clarify
+that:
+.INDENT 0.0
+.IP \(bu 2
+\fIattributes\fP: \fBchattr(1)\fP or \fBlsattr(1)\fP utilities (the ioctls are
+FS_IOC_GETFLAGS and FS_IOC_SETFLAGS), due to the ioctl names the attributes
+are also called flags
+.IP \(bu 2
+\fIxflags\fP: to distinguish from the previous, it\(aqs extended flags, with tunable
+bits similar to the attributes but extensible and new bits will be added in
+the future (the ioctls are FS_IOC_FSGETXATTR and FS_IOC_FSSETXATTR but they
+are not related to extended attributes that are also called xattrs), there\(aqs
+no standard tool to change the bits, there\(aqs support in \fBxfs_io(8)\fP as
+command \fBxfs_io \-c chattr\fP
+.UNINDENT
+.SS Attributes
+.INDENT 0.0
+.TP
+.B a
+\fIappend only\fP, new writes are always written at the end of the file
+.TP
+.B A
+\fIno atime updates\fP
+.TP
+.B c
+\fIcompress data\fP, all data written after this attribute is set will be compressed.
+Please note that compression is also affected by the mount options or the parent
+directory attributes.
+.sp
+When set on a directory, all newly created files will inherit this attribute.
+This attribute cannot be set with \(aqm\(aq at the same time.
+.TP
+.B C
+\fIno copy\-on\-write\fP, file data modifications are done in\-place
+.sp
+When set on a directory, all newly created files will inherit this attribute.
+.sp
+\fBNOTE:\fP
+.INDENT 7.0
+.INDENT 3.5
+Due to implementation limitations, this flag can be set/unset only on
+empty files.
+.UNINDENT
+.UNINDENT
+.TP
+.B d
+\fIno dump\fP, makes sense with 3rd party tools like \fBdump(8)\fP, on BTRFS the
+attribute can be set/unset but no other special handling is done
+.TP
+.B D
+\fIsynchronous directory updates\fP, for more details search \fBopen(2)\fP for \fIO_SYNC\fP
+and \fIO_DSYNC\fP
+.TP
+.B i
+\fIimmutable\fP, no file data and metadata changes allowed even to the root user as
+long as this attribute is set (obviously the exception is unsetting the attribute)
+.TP
+.B m
+\fIno compression\fP, permanently turn off compression on the given file. Any
+compression mount options will not affect this file. (\fBchattr\fP support added in
+1.46.2)
+.sp
+When set on a directory, all newly created files will inherit this attribute.
+This attribute cannot be set with \fIc\fP at the same time.
+.TP
+.B S
+\fIsynchronous updates\fP, for more details search \fBopen(2)\fP for \fIO_SYNC\fP and
+\fIO_DSYNC\fP
+.UNINDENT
+.sp
+No other attributes are supported. For the complete list please refer to the
+\fBchattr(1)\fP manual page.
+.SS XFLAGS
+.sp
+There\(aqs an overlap of letters assigned to the bits with the attributes, this list
+refers to what \fBxfs_io(8)\fP provides:
+.INDENT 0.0
+.TP
+.B i
+\fIimmutable\fP, same as the attribute
+.TP
+.B a
+\fIappend only\fP, same as the attribute
+.TP
+.B s
+\fIsynchronous updates\fP, same as the attribute \fIS\fP
+.TP
+.B A
+\fIno atime updates\fP, same as the attribute
+.TP
+.B d
+\fIno dump\fP, same as the attribute
+.UNINDENT
+.SH ZONED MODE
+.sp
+Since version 5.12 btrfs supports so called \fIzoned mode\fP\&. This is a special
+on\-disk format and allocation/write strategy that\(aqs friendly to zoned devices.
+In short, a device is partitioned into fixed\-size zones and each zone can be
+updated by append\-only manner, or reset. As btrfs has no fixed data structures,
+except the super blocks, the zoned mode only requires block placement that
+follows the device constraints. You can learn about the whole architecture at
+\fI\%https://zonedstorage.io\fP .
+.sp
+The devices are also called SMR/ZBC/ZNS, in \fIhost\-managed\fP mode. Note that
+there are devices that appear as non\-zoned but actually are, this is
+\fIdrive\-managed\fP and using zoned mode won\(aqt help.
+.sp
+The zone size depends on the device, typical sizes are 256MiB or 1GiB. In
+general it must be a power of two. Emulated zoned devices like \fInull_blk\fP allow
+to set various zone sizes.
+.SS Requirements, limitations
+.INDENT 0.0
+.IP \(bu 2
+all devices must have the same zone size
+.IP \(bu 2
+maximum zone size is 8GiB
+.IP \(bu 2
+minimum zone size is 4MiB
+.IP \(bu 2
+mixing zoned and non\-zoned devices is possible, the zone writes are emulated,
+but this is namely for testing
+.IP \(bu 2
+the super block is handled in a special way and is at different locations than on a non\-zoned filesystem:
+.INDENT 2.0
+.IP \(bu 2
+primary: 0B (and the next two zones)
+.IP \(bu 2
+secondary: 512GiB (and the next two zones)
+.IP \(bu 2
+tertiary: 4TiB (4096GiB, and the next two zones)
+.UNINDENT
+.UNINDENT
+.SS Incompatible features
+.sp
+The main constraint of the zoned devices is lack of in\-place update of the data.
+This is inherently incompatible with some features:
+.INDENT 0.0
+.IP \(bu 2
+NODATACOW \- overwrite in\-place, cannot create such files
+.IP \(bu 2
+fallocate \- preallocating space for in\-place first write
+.IP \(bu 2
+mixed\-bg \- unordered writes to data and metadata, fixing that means using
+separate data and metadata block groups
+.IP \(bu 2
+booting \- the zone at offset 0 contains superblock, resetting the zone would
+destroy the bootloader data
+.UNINDENT
+.sp
+Initial support lacks some features but they\(aqre planned:
+.INDENT 0.0
+.IP \(bu 2
+only single (data, metadata) and DUP (metadata) profile is supported
+.IP \(bu 2
+fstrim \- due to dependency on free space cache v1
+.UNINDENT
+.SS Super block
+.sp
+As said above, super block is handled in a special way. In order to be crash
+safe, at least one zone in a known location must contain a valid superblock.
+This is implemented as a ring buffer in two consecutive zones, starting from
+known offsets 0B, 512GiB and 4TiB.
+.sp
+The values are different than on non\-zoned devices. Each new super block is
+appended to the end of the zone, once it\(aqs filled, the zone is reset and writes
+continue to the next one. Looking up the latest super block needs to read
+offsets of both zones and determine the last written version.
+.sp
+The amount of space reserved for super block depends on the zone size. The
+secondary and tertiary copies are at distant offsets as the capacity of the
+devices is expected to be large, tens of terabytes. Maximum zone size supported
+is 8GiB, which would mean that e.g. offset 0\-16GiB would be reserved just for
+the super block on a hypothetical device of that zone size. This is wasteful
+but required to guarantee crash safety.
+.SS Devices
+.SS Real hardware
+.sp
+The WD Ultrastar series 600 advertises HM\-SMR, i.e. the host\-managed zoned
+mode. There are two more: DA (device managed, no zoned information exported to
+the system), HA (host aware, can be used as regular disk but zoned writes
+improve performance). There are not many devices available at the moment, the
+information about exact zoned mode is hard to find, check data sheets or
+community sources gathering information from real devices.
+.sp
+Note: zoned mode won\(aqt work with DM\-SMR disks.
+.INDENT 0.0
+.IP \(bu 2
+Ultrastar® DC ZN540 NVMe ZNS SSD (\fI\%product
+brief\fP)
+.UNINDENT
+.SS Emulated: null_blk
+.sp
+The driver \fInull_blk\fP provides memory backed device and is suitable for
+testing. There are some quirks setting up the devices. The module must be
+loaded with \fInr_devices=0\fP or the numbering of device nodes will be offset. The
+\fIconfigfs\fP must be mounted at \fI/sys/kernel/config\fP and the administration of
+the null_blk devices is done in \fI/sys/kernel/config/nullb\fP\&. The device nodes
+are named like \fB/dev/nullb0\fP and are numbered sequentially. NOTE: the device
+name may be different than the named directory in sysfs!
+.sp
+Setup:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+modprobe\ configfs
+modprobe\ null_blk\ nr_devices=0
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+Create a device \fImydev\fP, assuming no other previously created devices, size is
+2048MiB, zone size 256MiB. There are more tunable parameters, this is a minimal
+example taking defaults:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+cd\ /sys/kernel/config/nullb/
+mkdir\ mydev
+cd\ mydev
+echo\ 2048\ >\ size
+echo\ 1\ >\ zoned
+echo\ 1\ >\ memory_backed
+echo\ 256\ >\ zone_size
+echo\ 1\ >\ power
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+This will create a device \fB/dev/nullb0\fP and the value of file \fIindex\fP will
+match the ending number of the device node.
+.sp
+Remove the device:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+rmdir\ /sys/kernel/config/nullb/mydev
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+Then continue with \fBmkfs.btrfs /dev/nullb0\fP, the zoned mode is auto\-detected.
+.sp
+For convenience, there\(aqs a script wrapping the basic null_blk management operations
+\fI\%https://github.com/kdave/nullb.git\fP, the above commands become:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+nullb setup
+nullb create \-s 2g \-z 256
+mkfs.btrfs /dev/nullb0
+\&...
+nullb rm nullb0
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.SS Emulated: TCMU runner
+.sp
+TCMU is a framework to emulate SCSI devices in userspace, providing various
+backends for the storage, with zoned support as well. A file\-backed zoned
+device can provide more options for larger storage and zone size. Please follow
+the instructions at \fI\%https://zonedstorage.io/projects/tcmu\-runner/\fP .
+.SS Compatibility, incompatibility
+.INDENT 0.0
+.IP \(bu 2
+the feature sets an incompat bit and requires new kernel to access the
+filesystem (for both read and write)
+.IP \(bu 2
+superblock needs to be handled in a special way, there are still 3 copies
+but at different offsets (0, 512GiB, 4TiB) and the 2 consecutive zones are a
+ring buffer of the superblocks, finding the latest one needs reading it from
+the write pointer or do a full scan of the zones
+.IP \(bu 2
+mixing zoned and non zoned devices is possible (zones are emulated) but is
+recommended only for testing
+.IP \(bu 2
+mixing zoned devices with different zone sizes is not possible
+.IP \(bu 2
+zone sizes must be power of two, zone sizes of real devices are e.g. 256MiB
+or 1GiB, larger size is expected, maximum zone size supported by btrfs is
+8GiB
+.UNINDENT
+.SS Status, stability, reporting bugs
+.sp
+The zoned mode has been released in 5.12 and there are still some rough edges
+and corner cases one can hit during testing. Please report bugs to
+\fI\%https://github.com/naota/linux/issues/\fP .
+.SS References
+.INDENT 0.0
+.IP \(bu 2
+\fI\%https://zonedstorage.io\fP
+.INDENT 2.0
+.IP \(bu 2
+\fI\%https://zonedstorage.io/projects/libzbc/\fP \-\- \fIlibzbc\fP is library and set
+of tools to directly manipulate devices with ZBC/ZAC support
+.IP \(bu 2
+\fI\%https://zonedstorage.io/projects/libzbd/\fP \-\- \fIlibzbd\fP uses the kernel
+provided zoned block device interface based on the ioctl() system calls
+.UNINDENT
+.IP \(bu 2
+\fI\%https://hddscan.com/blog/2020/hdd\-wd\-smr.html\fP \-\- some details about exact device types
+.IP \(bu 2
+\fI\%https://lwn.net/Articles/853308/\fP \-\- \fIBtrfs on zoned block devices\fP
+.IP \(bu 2
+\fI\%https://www.usenix.org/conference/vault20/presentation/bjorling\fP \-\- Zone
+Append: A New Way of Writing to Zoned Storage
+.UNINDENT
+.SH CONTROL DEVICE
+.sp
+There\(aqs a character special device \fB/dev/btrfs\-control\fP with major and minor
+numbers 10 and 234 (the device can be found under the \fImisc\fP category).
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+$ ls \-l /dev/btrfs\-control
+crw\-\-\-\-\-\-\- 1 root root 10, 234 Jan 1 12:00 /dev/btrfs\-control
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+The device accepts some ioctl calls that can perform following actions on the
+filesystem module:
+.INDENT 0.0
+.IP \(bu 2
+scan devices for btrfs filesystem (i.e. to let multi\-device filesystems mount
+automatically) and register them with the kernel module
+.IP \(bu 2
+similar to scan, but also wait until the device scanning process is finished
+for a given filesystem
+.IP \(bu 2
+get the supported features (can be also found under \fB/sys/fs/btrfs/features\fP)
+.UNINDENT
+.sp
+The device is created when btrfs is initialized, either as a module or a
+built\-in functionality and makes sense only in connection with that. Running
+e.g. mkfs without the module loaded will not register the device and will
+probably warn about that.
+.sp
+In rare cases when the module is loaded but the device is not present (most
+likely accidentally deleted), it\(aqs possible to recreate it by
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+# mknod \-\-mode=600 /dev/btrfs\-control c 10 234
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+or (since 5.11) by a convenience command
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+# btrfs rescue create\-control\-device
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+The control device is not strictly required but the device scanning will not
+work and a workaround would need to be used to mount a multi\-device filesystem.
+The mount option \fIdevice\fP can trigger the device scanning during mount, see
+also \fBbtrfs device scan\fP\&.
+.SH FILESYSTEM WITH MULTIPLE PROFILES
+.sp
+It is possible that a btrfs filesystem contains multiple block group profiles
+of the same type. This could happen when a profile conversion using balance
+filters is interrupted (see \fI\%btrfs\-balance(8)\fP). Some
+\fBbtrfs\fP commands perform
+a test to detect this kind of condition and print a warning like this:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+WARNING: Multiple block group profiles detected, see \(aqman btrfs(5)\(aq.
+WARNING: Data: single, raid1
+WARNING: Metadata: single, raid1
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+The corresponding output of \fBbtrfs filesystem df\fP might look like:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+WARNING: Multiple block group profiles detected, see \(aqman btrfs(5)\(aq.
+WARNING: Data: single, raid1
+WARNING: Metadata: single, raid1
+Data, RAID1: total=832.00MiB, used=0.00B
+Data, single: total=1.63GiB, used=0.00B
+System, single: total=4.00MiB, used=16.00KiB
+Metadata, single: total=8.00MiB, used=112.00KiB
+Metadata, RAID1: total=64.00MiB, used=32.00KiB
+GlobalReserve, single: total=16.25MiB, used=0.00B
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+There\(aqs more than one line for type \fIData\fP and \fIMetadata\fP, while the profiles
+are \fIsingle\fP and \fIRAID1\fP\&.
+.sp
+This state of the filesystem OK but most likely needs the user/administrator to
+take an action and finish the interrupted tasks. This cannot be easily done
+automatically, also the user knows the expected final profiles.
+.sp
+In the example above, the filesystem started as a single device and \fIsingle\fP
+block group profile. Then another device was added, followed by balance with
+\fIconvert=raid1\fP but for some reason hasn\(aqt finished. Restarting the balance
+with \fIconvert=raid1\fP will continue and end up with filesystem with all block
+group profiles \fIRAID1\fP\&.
+.sp
+\fBNOTE:\fP
+.INDENT 0.0
+.INDENT 3.5
+If you\(aqre familiar with balance filters, you can use
+\fIconvert=raid1,profiles=single,soft\fP, which will take only the unconverted
+\fIsingle\fP profiles and convert them to \fIraid1\fP\&. This may speed up the conversion
+as it would not try to rewrite the already convert \fIraid1\fP profiles.
+.UNINDENT
+.UNINDENT
+.sp
+Having just one profile is desired as this also clearly defines the profile of
+newly allocated block groups, otherwise this depends on internal allocation
+policy. When there are multiple profiles present, the order of selection is
+RAID56, RAID10, RAID1, RAID0 as long as the device number constraints are
+satisfied.
+.sp
+Commands that print the warning were chosen so they\(aqre brought to user
+attention when the filesystem state is being changed in that regard. This is:
+\fBdevice add\fP, \fBdevice delete\fP, \fBbalance cancel\fP, \fBbalance pause\fP\&. Commands
+that report space usage: \fBfilesystem df\fP, \fBdevice usage\fP\&. The command
+\fBfilesystem usage\fP provides a line in the overall summary:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+Multiple profiles: yes (data, metadata)
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.SH SEEDING DEVICE
+.sp
+The COW mechanism and multiple devices under one hood enable an interesting
+concept, called a seeding device: extending a read\-only filesystem on a
+device with another device that captures all writes. For example
+imagine an immutable golden image of an operating system enhanced with another
+device that allows to use the data from the golden image and normal operation.
+This idea originated on CD\-ROMs with base OS and allowing to use them for live
+systems, but this became obsolete. There are technologies providing similar
+functionality, like \fI\%unionmount\fP,
+\fI\%overlayfs\fP or
+\fI\%qcow2\fP image snapshot.
+.sp
+The seeding device starts as a normal filesystem, once the contents is ready,
+\fBbtrfstune \-S 1\fP is used to flag it as a seeding device. Mounting such device
+will not allow any writes, except adding a new device by \fBbtrfs device add\fP\&.
+Then the filesystem can be remounted as read\-write.
+.sp
+Given that the filesystem on the seeding device is always recognized as
+read\-only, it can be used to seed multiple filesystems from one device at the
+same time. The UUID that is normally attached to a device is automatically
+changed to a random UUID on each mount.
+.sp
+Once the seeding device is mounted, it needs the writable device. After adding
+it, unmounting and mounting with \fBumount /path; mount /dev/writable
+/path\fP or remounting read\-write with \fBremount \-o remount,rw\fP makes the
+filesystem at \fB/path\fP ready for use.
+.sp
+\fBNOTE:\fP
+.INDENT 0.0
+.INDENT 3.5
+There is a known bug with using remount to make the mount writeable:
+remount will leave the filesystem in a state where it is unable to
+clean deleted snapshots, so it will leak space until it is unmounted
+and mounted properly.
+.UNINDENT
+.UNINDENT
+.sp
+Furthermore, deleting the seeding device from the filesystem can turn it into
+a normal filesystem, provided that the writable device can also contain all the
+data from the seeding device.
+.sp
+The seeding device flag can be cleared again by \fBbtrfstune \-f \-S 0\fP, e.g.
+allowing to update with newer data but please note that this will invalidate
+all existing filesystems that use this particular seeding device. This works
+for some use cases, not for others, and the forcing flag to the command is
+mandatory to avoid accidental mistakes.
+.sp
+Example how to create and use one seeding device:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+# mkfs.btrfs /dev/sda
+# mount /dev/sda /mnt/mnt1
+\&... fill mnt1 with data
+# umount /mnt/mnt1
+
+# btrfstune \-S 1 /dev/sda
+
+# mount /dev/sda /mnt/mnt1
+# btrfs device add /dev/sdb /mnt/mnt1
+# umount /mnt/mnt1
+# mount /dev/sdb /mnt/mnt1
+\&... /mnt/mnt1 is now writable
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+Now \fB/mnt/mnt1\fP can be used normally. The device \fB/dev/sda\fP can be mounted
+again with a another writable device:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+# mount /dev/sda /mnt/mnt2
+# btrfs device add /dev/sdc /mnt/mnt2
+# umount /mnt/mnt2
+# mount /dev/sdc /mnt/mnt2
+\&... /mnt/mnt2 is now writable
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+The writable device (file:\fI/dev/sdb\fP) can be decoupled from the seeding device and
+used independently:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+# btrfs device delete /dev/sda /mnt/mnt1
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+As the contents originated in the seeding device, it\(aqs possible to turn
+\fB/dev/sdb\fP to a seeding device again and repeat the whole process.
+.sp
+A few things to note:
+.INDENT 0.0
+.IP \(bu 2
+it\(aqs recommended to use only single device for the seeding device, it works
+for multiple devices but the \fIsingle\fP profile must be used in order to make
+the seeding device deletion work
+.IP \(bu 2
+block group profiles \fIsingle\fP and \fIdup\fP support the use cases above
+.IP \(bu 2
+the label is copied from the seeding device and can be changed by \fBbtrfs filesystem label\fP
+.IP \(bu 2
+each new mount of the seeding device gets a new random UUID
+.IP \(bu 2
+\fBumount /path; mount /dev/writable /path\fP can be replaced with
+\fBmount \-o remount,rw /path\fP
+but it won\(aqt reclaim space of deleted subvolumes until the seeding device
+is mounted read\-write again before making it seeding again
+.UNINDENT
+.SS Chained seeding devices
+.sp
+Though it\(aqs not recommended and is rather an obscure and untested use case,
+chaining seeding devices is possible. In the first example, the writable device
+\fB/dev/sdb\fP can be turned onto another seeding device again, depending on the
+unchanged seeding device \fB/dev/sda\fP\&. Then using \fB/dev/sdb\fP as the primary
+seeding device it can be extended with another writable device, say \fB/dev/sdd\fP,
+and it continues as before as a simple tree structure on devices.
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+# mkfs.btrfs /dev/sda
+# mount /dev/sda /mnt/mnt1
+\&... fill mnt1 with data
+# umount /mnt/mnt1
+
+# btrfstune \-S 1 /dev/sda
+
+# mount /dev/sda /mnt/mnt1
+# btrfs device add /dev/sdb /mnt/mnt1
+# mount \-o remount,rw /mnt/mnt1
+\&... /mnt/mnt1 is now writable
+# umount /mnt/mnt1
+
+# btrfstune \-S 1 /dev/sdb
+
+# mount /dev/sdb /mnt/mnt1
+# btrfs device add /dev/sdc /mnt
+# mount \-o remount,rw /mnt/mnt1
+\&... /mnt/mnt1 is now writable
+# umount /mnt/mnt1
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+As a result we have:
+.INDENT 0.0
+.IP \(bu 2
+\fIsda\fP is a single seeding device, with its initial contents
+.IP \(bu 2
+\fIsdb\fP is a seeding device but requires \fIsda\fP, the contents are from the time
+when \fIsdb\fP is made seeding, i.e. contents of \fIsda\fP with any later changes
+.IP \(bu 2
+\fIsdc\fP last writable, can be made a seeding one the same way as was \fIsdb\fP,
+preserving its contents and depending on \fIsda\fP and \fIsdb\fP
+.UNINDENT
+.sp
+As long as the seeding devices are unmodified and available, they can be used
+to start another branch.
+.SH RAID56 STATUS AND RECOMMENDED PRACTICES
+.sp
+The RAID56 feature provides striping and parity over several devices, same as
+the traditional RAID5/6. There are some implementation and design deficiencies
+that make it unreliable for some corner cases and the feature \fBshould not be
+used in production, only for evaluation or testing\fP\&. The power failure safety
+for metadata with RAID56 is not 100%.
+.SS Metadata
+.sp
+Do not use \fIraid5\fP nor \fIraid6\fP for metadata. Use \fIraid1\fP or \fIraid1c3\fP
+respectively.
+.sp
+The substitute profiles provide the same guarantees against loss of 1 or 2
+devices, and in some respect can be an improvement. Recovering from one
+missing device will only need to access the remaining 1st or 2nd copy, that in
+general may be stored on some other devices due to the way RAID1 works on
+btrfs, unlike on a striped profile (similar to \fIraid0\fP) that would need all
+devices all the time.
+.sp
+The space allocation pattern and consumption is different (e.g. on N devices):
+for \fIraid5\fP as an example, a 1GiB chunk is reserved on each device, while with
+\fIraid1\fP there\(aqs each 1GiB chunk stored on 2 devices. The consumption of each
+1GiB of used metadata is then \fIN * 1GiB\fP for vs \fI2 * 1GiB\fP\&. Using \fIraid1\fP
+is also more convenient for balancing/converting to other profile due to lower
+requirement on the available chunk space.
+.SS Missing/incomplete support
+.sp
+When RAID56 is on the same filesystem with different raid profiles, the space
+reporting is inaccurate, e.g. \fBdf\fP, \fBbtrfs filesystem df\fP or
+\fBbtrfs filesystem usage\fP\&. When there\(aqs only a one profile per block
+group type (e.g. RAID5 for data) the reporting is accurate.
+.sp
+When scrub is started on a RAID56 filesystem, it\(aqs started on all devices that
+degrade the performance. The workaround is to start it on each device
+separately. Due to that the device stats may not match the actual state and
+some errors might get reported multiple times.
+.sp
+The \fIwrite hole\fP problem. An unclean shutdown could leave a partially written
+stripe in a state where the some stripe ranges and the parity are from the old
+writes and some are new. The information which is which is not tracked. Write
+journal is not implemented. Alternatively a full read\-modify\-write would make
+sure that a full stripe is always written, avoiding the write hole completely,
+but performance in that case turned out to be too bad for use.
+.sp
+The striping happens on all available devices (at the time the chunks were
+allocated), so in case a new device is added it may not be utilized
+immediately and would require a rebalance. A fixed configured stripe width is
+not implemented.
+.SH STORAGE MODEL, HARDWARE CONSIDERATIONS
+.SS Storage model
+.sp
+\fIA storage model is a model that captures key physical aspects of data
+structure in a data store. A filesystem is the logical structure organizing
+data on top of the storage device.\fP
+.sp
+The filesystem assumes several features or limitations of the storage device
+and utilizes them or applies measures to guarantee reliability. BTRFS in
+particular is based on a COW (copy on write) mode of writing, i.e. not updating
+data in place but rather writing a new copy to a different location and then
+atomically switching the pointers.
+.sp
+In an ideal world, the device does what it promises. The filesystem assumes
+that this may not be true so additional mechanisms are applied to either detect
+misbehaving hardware or get valid data by other means. The devices may (and do)
+apply their own detection and repair mechanisms but we won\(aqt assume any.
+.sp
+The following assumptions about storage devices are considered (sorted by
+importance, numbers are for further reference):
+.INDENT 0.0
+.IP 1. 3
+atomicity of reads and writes of blocks/sectors (the smallest unit of data
+the device presents to the upper layers)
+.IP 2. 3
+there\(aqs a flush command that instructs the device to forcibly order writes
+before and after the command; alternatively there\(aqs a barrier command that
+facilitates the ordering but may not flush the data
+.IP 3. 3
+data sent to write to a given device offset will be written without further
+changes to the data and to the offset
+.IP 4. 3
+writes can be reordered by the device, unless explicitly serialized by the
+flush command
+.IP 5. 3
+reads and writes can be freely reordered and interleaved
+.UNINDENT
+.sp
+The consistency model of BTRFS builds on these assumptions. The logical data
+updates are grouped, into a generation, written on the device, serialized by
+the flush command and then the super block is written ending the generation.
+All logical links among metadata comprising a consistent view of the data may
+not cross the generation boundary.
+.SS When things go wrong
+.sp
+\fBNo or partial atomicity of block reads/writes (1)\fP
+.INDENT 0.0
+.IP \(bu 2
+\fIProblem\fP: a partial block contents is written (\fItorn write\fP), e.g. due to a
+power glitch or other electronics failure during the read/write
+.IP \(bu 2
+\fIDetection\fP: checksum mismatch on read
+.IP \(bu 2
+\fIRepair\fP: use another copy or rebuild from multiple blocks using some encoding
+scheme
+.UNINDENT
+.sp
+\fBThe flush command does not flush (2)\fP
+.sp
+This is perhaps the most serious problem and impossible to mitigate by
+filesystem without limitations and design restrictions. What could happen in
+the worst case is that writes from one generation bleed to another one, while
+still letting the filesystem consider the generations isolated. Crash at any
+point would leave data on the device in an inconsistent state without any hint
+what exactly got written, what is missing and leading to stale metadata link
+information.
+.sp
+Devices usually honor the flush command, but for performance reasons may do
+internal caching, where the flushed data are not yet persistently stored. A
+power failure could lead to a similar scenario as above, although it\(aqs less
+likely that later writes would be written before the cached ones. This is
+beyond what a filesystem can take into account. Devices or controllers are
+usually equipped with batteries or capacitors to write the cache contents even
+after power is cut. (\fIBattery backed write cache\fP)
+.sp
+\fBData get silently changed on write (3)\fP
+.sp
+Such thing should not happen frequently, but still can happen spuriously due
+the complex internal workings of devices or physical effects of the storage
+media itself.
+.INDENT 0.0
+.IP \(bu 2
+\fIProblem\fP: while the data are written atomically, the contents get changed
+.IP \(bu 2
+\fIDetection\fP: checksum mismatch on read
+.IP \(bu 2
+\fIRepair\fP: use another copy or rebuild from multiple blocks using some
+encoding scheme
+.UNINDENT
+.sp
+\fBData get silently written to another offset (3)\fP
+.sp
+This would be another serious problem as the filesystem has no information
+when it happens. For that reason the measures have to be done ahead of time.
+This problem is also commonly called \fIghost write\fP\&.
+.sp
+The metadata blocks have the checksum embedded in the blocks, so a correct
+atomic write would not corrupt the checksum. It\(aqs likely that after reading
+such block the data inside would not be consistent with the rest. To rule that
+out there\(aqs embedded block number in the metadata block. It\(aqs the logical
+block number because this is what the logical structure expects and verifies.
+.sp
+The following is based on information publicly available, user feedback,
+community discussions or bug report analyses. It\(aqs not complete and further
+research is encouraged when in doubt.
+.SS Main memory
+.sp
+The data structures and raw data blocks are temporarily stored in computer
+memory before they get written to the device. It is critical that memory is
+reliable because even simple bit flips can have vast consequences and lead to
+damaged structures, not only in the filesystem but in the whole operating
+system.
+.sp
+Based on experience in the community, memory bit flips are more common than one
+would think. When it happens, it\(aqs reported by the tree\-checker or by a checksum
+mismatch after reading blocks. There are some very obvious instances of bit
+flips that happen, e.g. in an ordered sequence of keys in metadata blocks. We can
+easily infer from the other data what values get damaged and how. However, fixing
+that is not straightforward and would require cross\-referencing data from the
+entire filesystem to see the scope.
+.sp
+If available, ECC memory should lower the chances of bit flips, but this
+type of memory is not available in all cases. A memory test should be performed
+in case there\(aqs a visible bit flip pattern, though this may not detect a faulty
+memory module because the actual load of the system could be the factor making
+the problems appear. In recent years attacks on how the memory modules operate
+have been demonstrated (\fIrowhammer\fP) achieving specific bits to be flipped.
+While these were targeted, this shows that a series of reads or writes can
+affect unrelated parts of memory.
+.sp
+Further reading:
+.INDENT 0.0
+.IP \(bu 2
+\fI\%https://en.wikipedia.org/wiki/Row_hammer\fP
+.UNINDENT
+.sp
+What to do:
+.INDENT 0.0
+.IP \(bu 2
+run \fImemtest\fP, note that sometimes memory errors happen only when the system
+is under heavy load that the default memtest cannot trigger
+.IP \(bu 2
+memory errors may appear as filesystem going read\-only due to \(dqpre write\(dq
+check, that verify meta data before they get written but fail some basic
+consistency checks
+.UNINDENT
+.SS Direct memory access (DMA)
+.sp
+Another class of errors is related to DMA (direct memory access) performed
+by device drivers. While this could be considered a software error, the
+data transfers that happen without CPU assistance may accidentally corrupt
+other pages. Storage devices utilize DMA for performance reasons, the
+filesystem structures and data pages are passed back and forth, making
+errors possible in case page life time is not properly tracked.
+.sp
+There are lots of quirks (device\-specific workarounds) in Linux kernel
+drivers (regarding not only DMA) that are added when found. The quirks
+may avoid specific errors or disable some features to avoid worse problems.
+.sp
+What to do:
+.INDENT 0.0
+.IP \(bu 2
+use up\-to\-date kernel (recent releases or maintained long term support versions)
+.IP \(bu 2
+as this may be caused by faulty drivers, keep the systems up\-to\-date
+.UNINDENT
+.SS Rotational disks (HDD)
+.sp
+Rotational HDDs typically fail at the level of individual sectors or small clusters.
+Read failures are caught on the levels below the filesystem and are returned to
+the user as \fIEIO \- Input/output error\fP\&. Reading the blocks repeatedly may
+return the data eventually, but this is better done by specialized tools and
+filesystem takes the result of the lower layers. Rewriting the sectors may
+trigger internal remapping but this inevitably leads to data loss.
+.sp
+Disk firmware is technically software but from the filesystem perspective is
+part of the hardware. IO requests are processed, and caching or various
+other optimizations are performed, which may lead to bugs under high load or
+unexpected physical conditions or unsupported use cases.
+.sp
+Disks are connected by cables with two ends, both of which can cause problems
+when not attached properly. Data transfers are protected by checksums and the
+lower layers try hard to transfer the data correctly or not at all. The errors
+from badly\-connecting cables may manifest as large amount of failed read or
+write requests, or as short error bursts depending on physical conditions.
+.sp
+What to do:
+.INDENT 0.0
+.IP \(bu 2
+check \fBsmartctl\fP for potential issues
+.UNINDENT
+.SS Solid state drives (SSD)
+.sp
+The mechanism of information storage is different from HDDs and this affects
+the failure mode as well. The data are stored in cells grouped in large blocks
+with limited number of resets and other write constraints. The firmware tries
+to avoid unnecessary resets and performs optimizations to maximize the storage
+media lifetime. The known techniques are deduplication (blocks with same
+fingerprint/hash are mapped to same physical block), compression or internal
+remapping and garbage collection of used memory cells. Due to the additional
+processing there are measures to verity the data e.g. by ECC codes.
+.sp
+The observations of failing SSDs show that the whole electronic fails at once
+or affects a lot of data (e.g. stored on one chip). Recovering such data
+may need specialized equipment and reading data repeatedly does not help as
+it\(aqs possible with HDDs.
+.sp
+There are several technologies of the memory cells with different
+characteristics and price. The lifetime is directly affected by the type and
+frequency of data written. Writing \(dqtoo much\(dq distinct data (e.g. encrypted)
+may render the internal deduplication ineffective and lead to a lot of rewrites
+and increased wear of the memory cells.
+.sp
+There are several technologies and manufacturers so it\(aqs hard to describe them
+but there are some that exhibit similar behaviour:
+.INDENT 0.0
+.IP \(bu 2
+expensive SSD will use more durable memory cells and is optimized for
+reliability and high load
+.IP \(bu 2
+cheap SSD is projected for a lower load (\(dqdesktop user\(dq) and is optimized for
+cost, it may employ the optimizations and/or extended error reporting
+partially or not at all
+.UNINDENT
+.sp
+It\(aqs not possible to reliably determine the expected lifetime of an SSD due to
+lack of information about how it works or due to lack of reliable stats provided
+by the device.
+.sp
+Metadata writes tend to be the biggest component of lifetime writes to a SSD,
+so there is some value in reducing them. Depending on the device class (high
+end/low end) the features like DUP block group profiles may affect the
+reliability in both ways:
+.INDENT 0.0
+.IP \(bu 2
+\fIhigh end\fP are typically more reliable and using \fIsingle\fP for data and
+metadata could be suitable to reduce device wear
+.IP \(bu 2
+\fIlow end\fP could lack ability to identify errors so an additional redundancy
+at the filesystem level (checksums, \fIDUP\fP) could help
+.UNINDENT
+.sp
+Only users who consume 50 to 100% of the SSD\(aqs actual lifetime writes need to be
+concerned by the write amplification of btrfs DUP metadata. Most users will be
+far below 50% of the actual lifetime, or will write the drive to death and
+discover how many writes 100% of the actual lifetime was. SSD firmware often
+adds its own write multipliers that can be arbitrary and unpredictable and
+dependent on application behavior, and these will typically have far greater
+effect on SSD lifespan than DUP metadata. It\(aqs more or less impossible to
+predict when a SSD will run out of lifetime writes to within a factor of two, so
+it\(aqs hard to justify wear reduction as a benefit.
+.sp
+Further reading:
+.INDENT 0.0
+.IP \(bu 2
+\fI\%https://www.snia.org/educational\-library/ssd\-and\-deduplication\-end\-spinning\-disk\-2012\fP
+.IP \(bu 2
+\fI\%https://www.snia.org/educational\-library/realities\-solid\-state\-storage\-2013\-2013\fP
+.IP \(bu 2
+\fI\%https://www.snia.org/educational\-library/ssd\-performance\-primer\-2013\fP
+.IP \(bu 2
+\fI\%https://www.snia.org/educational\-library/how\-controllers\-maximize\-ssd\-life\-2013\fP
+.UNINDENT
+.sp
+What to do:
+.INDENT 0.0
+.IP \(bu 2
+run \fBsmartctl\fP or self\-tests to look for potential issues
+.IP \(bu 2
+keep the firmware up\-to\-date
+.UNINDENT
+.SS NVM express, non\-volatile memory (NVMe)
+.sp
+NVMe is a type of persistent memory usually connected over a system bus (PCIe)
+or similar interface and the speeds are an order of magnitude faster than SSD.
+It is also a non\-rotating type of storage, and is not typically connected by a
+cable. It\(aqs not a SCSI type device either but rather a complete specification
+for logical device interface.
+.sp
+In a way the errors could be compared to a combination of SSD class and regular
+memory. Errors may exhibit as random bit flips or IO failures. There are tools
+to access the internal log (\fBnvme log\fP and \fBnvme\-cli\fP) for a more detailed
+analysis.
+.sp
+There are separate error detection and correction steps performed e.g. on the
+bus level and in most cases never making in to the filesystem level. Once this
+happens it could mean there\(aqs some systematic error like overheating or bad
+physical connection of the device. You may want to run self\-tests (using
+\fBsmartctl\fP).
+.INDENT 0.0
+.IP \(bu 2
+\fI\%https://en.wikipedia.org/wiki/NVM_Express\fP
+.IP \(bu 2
+\fI\%https://www.smartmontools.org/wiki/NVMe_Support\fP
+.UNINDENT
+.SS Drive firmware
+.sp
+Firmware is technically still software but embedded into the hardware. As all
+software has bugs, so does firmware. Storage devices can update the firmware
+and fix known bugs. In some cases the it\(aqs possible to avoid certain bugs by
+quirks (device\-specific workarounds) in Linux kernel.
+.sp
+A faulty firmware can cause wide range of corruptions from small and localized
+to large affecting lots of data. Self\-repair capabilities may not be sufficient.
+.sp
+What to do:
+.INDENT 0.0
+.IP \(bu 2
+check for firmware updates in case there are known problems, note that
+updating firmware can be risky on itself
+.IP \(bu 2
+use up\-to\-date kernel (recent releases or maintained long term support versions)
+.UNINDENT
+.SS SD flash cards
+.sp
+There are a lot of devices with low power consumption and thus using storage
+media based on low power consumption too, typically flash memory stored on
+a chip enclosed in a detachable card package. An improperly inserted card may be
+damaged by electrical spikes when the device is turned on or off. The chips
+storing data in turn may be damaged permanently. All types of flash memory
+have a limited number of rewrites, so the data are internally translated by FTL
+(flash translation layer). This is implemented in firmware (technically a
+software) and prone to bugs that manifest as hardware errors.
+.sp
+Adding redundancy like using DUP profiles for both data and metadata can help
+in some cases but a full backup might be the best option once problems appear
+and replacing the card could be required as well.
+.SS Hardware as the main source of filesystem corruptions
+.sp
+\fBIf you use unreliable hardware and don\(aqt know about that, don\(aqt blame the
+filesystem when it tells you.\fP
+.SH SEE ALSO
+.sp
+\fBacl(5)\fP,
+\fI\%btrfs(8)\fP,
+\fBchattr(1)\fP,
+\fBfstrim(8)\fP,
+\fBioctl(2)\fP,
+\fI\%mkfs.btrfs(8)\fP,
+\fBmount(8)\fP,
+\fBswapon(8)\fP
+.\" Generated by docutils manpage writer.
+.