From ace9429bb58fd418f0c81d4c2835699bddf6bde6 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Thu, 11 Apr 2024 10:27:49 +0200 Subject: Adding upstream version 6.6.15. Signed-off-by: Daniel Baumann --- Documentation/filesystems/erofs.rst | 364 ++++++++++++++++++++++++++++++++++++ 1 file changed, 364 insertions(+) create mode 100644 Documentation/filesystems/erofs.rst (limited to 'Documentation/filesystems/erofs.rst') diff --git a/Documentation/filesystems/erofs.rst b/Documentation/filesystems/erofs.rst new file mode 100644 index 0000000000..f200d78744 --- /dev/null +++ b/Documentation/filesystems/erofs.rst @@ -0,0 +1,364 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================================== +EROFS - Enhanced Read-Only File System +====================================== + +Overview +======== + +EROFS filesystem stands for Enhanced Read-Only File System. It aims to form a +generic read-only filesystem solution for various read-only use cases instead +of just focusing on storage space saving without considering any side effects +of runtime performance. + +It is designed to meet the needs of flexibility, feature extendability and user +payload friendly, etc. Apart from those, it is still kept as a simple +random-access friendly high-performance filesystem to get rid of unneeded I/O +amplification and memory-resident overhead compared to similar approaches. + +It is implemented to be a better choice for the following scenarios: + + - read-only storage media or + + - part of a fully trusted read-only solution, which means it needs to be + immutable and bit-for-bit identical to the official golden image for + their releases due to security or other considerations and + + - hope to minimize extra storage space with guaranteed end-to-end performance + by using compact layout, transparent file compression and direct access, + especially for those embedded devices with limited memory and high-density + hosts with numerous containers. + +Here are the main features of EROFS: + + - Little endian on-disk design; + + - Block-based distribution and file-based distribution over fscache are + supported; + + - Support multiple devices to refer to external blobs, which can be used + for container images; + + - 32-bit block addresses for each device, therefore 16TiB address space at + most with 4KiB block size for now; + + - Two inode layouts for different requirements: + + ===================== ============ ====================================== + compact (v1) extended (v2) + ===================== ============ ====================================== + Inode metadata size 32 bytes 64 bytes + Max file size 4 GiB 16 EiB (also limited by max. vol size) + Max uids/gids 65536 4294967296 + Per-inode timestamp no yes (64 + 32-bit timestamp) + Max hardlinks 65536 4294967296 + Metadata reserved 8 bytes 18 bytes + ===================== ============ ====================================== + + - Support extended attributes as an option; + + - Support a bloom filter that speeds up negative extended attribute lookups; + + - Support POSIX.1e ACLs by using extended attributes; + + - Support transparent data compression as an option: + LZ4, MicroLZMA and DEFLATE algorithms can be used on a per-file basis; In + addition, inplace decompression is also supported to avoid bounce compressed + buffers and unnecessary page cache thrashing. + + - Support chunk-based data deduplication and rolling-hash compressed data + deduplication; + + - Support tailpacking inline compared to byte-addressed unaligned metadata + or smaller block size alternatives; + + - Support merging tail-end data into a special inode as fragments. + + - Support large folios for uncompressed files. + + - Support direct I/O on uncompressed files to avoid double caching for loop + devices; + + - Support FSDAX on uncompressed images for secure containers and ramdisks in + order to get rid of unnecessary page cache. + + - Support file-based on-demand loading with the Fscache infrastructure. + +The following git tree provides the file system user-space tools under +development, such as a formatting tool (mkfs.erofs), an on-disk consistency & +compatibility checking tool (fsck.erofs), and a debugging tool (dump.erofs): + +- git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git + +Bugs and patches are welcome, please kindly help us and send to the following +linux-erofs mailing list: + +- linux-erofs mailing list + +Mount options +============= + +=================== ========================================================= +(no)user_xattr Setup Extended User Attributes. Note: xattr is enabled + by default if CONFIG_EROFS_FS_XATTR is selected. +(no)acl Setup POSIX Access Control List. Note: acl is enabled + by default if CONFIG_EROFS_FS_POSIX_ACL is selected. +cache_strategy=%s Select a strategy for cached decompression from now on: + + ========== ============================================= + disabled In-place I/O decompression only; + readahead Cache the last incomplete compressed physical + cluster for further reading. It still does + in-place I/O decompression for the rest + compressed physical clusters; + readaround Cache the both ends of incomplete compressed + physical clusters for further reading. + It still does in-place I/O decompression + for the rest compressed physical clusters. + ========== ============================================= +dax={always,never} Use direct access (no page cache). See + Documentation/filesystems/dax.rst. +dax A legacy option which is an alias for ``dax=always``. +device=%s Specify a path to an extra device to be used together. +fsid=%s Specify a filesystem image ID for Fscache back-end. +domain_id=%s Specify a domain ID in fscache mode so that different images + with the same blobs under a given domain ID can share storage. +=================== ========================================================= + +Sysfs Entries +============= + +Information about mounted erofs file systems can be found in /sys/fs/erofs. +Each mounted filesystem will have a directory in /sys/fs/erofs based on its +device name (i.e., /sys/fs/erofs/sda). +(see also Documentation/ABI/testing/sysfs-fs-erofs) + +On-disk details +=============== + +Summary +------- +Different from other read-only file systems, an EROFS volume is designed +to be as simple as possible:: + + |-> aligned with the block size + ____________________________________________________________ + | |SB| | ... | Metadata | ... | Data | Metadata | ... | Data | + |_|__|_|_____|__________|_____|______|__________|_____|______| + 0 +1K + +All data areas should be aligned with the block size, but metadata areas +may not. All metadatas can be now observed in two different spaces (views): + + 1. Inode metadata space + + Each valid inode should be aligned with an inode slot, which is a fixed + value (32 bytes) and designed to be kept in line with compact inode size. + + Each inode can be directly found with the following formula: + inode offset = meta_blkaddr * block_size + 32 * nid + + :: + + |-> aligned with 8B + |-> followed closely + + meta_blkaddr blocks |-> another slot + _____________________________________________________________________ + | ... | inode | xattrs | extents | data inline | ... | inode ... + |________|_______|(optional)|(optional)|__(optional)_|_____|__________ + |-> aligned with the inode slot size + . . + . . + . . + . . + . . + . . + .____________________________________________________|-> aligned with 4B + | xattr_ibody_header | shared xattrs | inline xattrs | + |____________________|_______________|_______________| + |-> 12 bytes <-|->x * 4 bytes<-| . + . . . + . . . + . . . + ._______________________________.______________________. + | id | id | id | id | ... | id | ent | ... | ent| ... | + |____|____|____|____|______|____|_____|_____|____|_____| + |-> aligned with 4B + |-> aligned with 4B + + Inode could be 32 or 64 bytes, which can be distinguished from a common + field which all inode versions have -- i_format:: + + __________________ __________________ + | i_format | | i_format | + |__________________| |__________________| + | ... | | ... | + | | | | + |__________________| 32 bytes | | + | | + |__________________| 64 bytes + + Xattrs, extents, data inline are followed by the corresponding inode with + proper alignment, and they could be optional for different data mappings. + _currently_ total 5 data layouts are supported: + + == ==================================================================== + 0 flat file data without data inline (no extent); + 1 fixed-sized output data compression (with non-compacted indexes); + 2 flat file data with tail packing data inline (no extent); + 3 fixed-sized output data compression (with compacted indexes, v5.3+); + 4 chunk-based file (v5.15+). + == ==================================================================== + + The size of the optional xattrs is indicated by i_xattr_count in inode + header. Large xattrs or xattrs shared by many different files can be + stored in shared xattrs metadata rather than inlined right after inode. + + 2. Shared xattrs metadata space + + Shared xattrs space is similar to the above inode space, started with + a specific block indicated by xattr_blkaddr, organized one by one with + proper align. + + Each share xattr can also be directly found by the following formula: + xattr offset = xattr_blkaddr * block_size + 4 * xattr_id + +:: + + |-> aligned by 4 bytes + + xattr_blkaddr blocks |-> aligned with 4 bytes + _________________________________________________________________________ + | ... | xattr_entry | xattr data | ... | xattr_entry | xattr data ... + |________|_____________|_____________|_____|______________|_______________ + +Directories +----------- +All directories are now organized in a compact on-disk format. Note that +each directory block is divided into index and name areas in order to support +random file lookup, and all directory entries are _strictly_ recorded in +alphabetical order in order to support improved prefix binary search +algorithm (could refer to the related source code). + +:: + + ___________________________ + / | + / ______________|________________ + / / | nameoff1 | nameoffN-1 + ____________.______________._______________v________________v__________ + | dirent | dirent | ... | dirent | filename | filename | ... | filename | + |___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____| + \ ^ + \ | * could have + \ | trailing '\0' + \________________________| nameoff0 + Directory block + +Note that apart from the offset of the first filename, nameoff0 also indicates +the total number of directory entries in this block since it is no need to +introduce another on-disk field at all. + +Chunk-based files +----------------- +In order to support chunk-based data deduplication, a new inode data layout has +been supported since Linux v5.15: Files are split in equal-sized data chunks +with ``extents`` area of the inode metadata indicating how to get the chunk +data: these can be simply as a 4-byte block address array or in the 8-byte +chunk index form (see struct erofs_inode_chunk_index in erofs_fs.h for more +details.) + +By the way, chunk-based files are all uncompressed for now. + +Long extended attribute name prefixes +------------------------------------- +There are use cases where extended attributes with different values can have +only a few common prefixes (such as overlayfs xattrs). The predefined prefixes +work inefficiently in both image size and runtime performance in such cases. + +The long xattr name prefixes feature is introduced to address this issue. The +overall idea is that, apart from the existing predefined prefixes, the xattr +entry could also refer to user-specified long xattr name prefixes, e.g. +"trusted.overlay.". + +When referring to a long xattr name prefix, the highest bit (bit 7) of +erofs_xattr_entry.e_name_index is set, while the lower bits (bit 0-6) as a whole +represent the index of the referred long name prefix among all long name +prefixes. Therefore, only the trailing part of the name apart from the long +xattr name prefix is stored in erofs_xattr_entry.e_name, which could be empty if +the full xattr name matches exactly as its long xattr name prefix. + +All long xattr prefixes are stored one by one in the packed inode as long as +the packed inode is valid, or in the meta inode otherwise. The +xattr_prefix_count (of the on-disk superblock) indicates the total number of +long xattr name prefixes, while (xattr_prefix_start * 4) indicates the start +offset of long name prefixes in the packed/meta inode. Note that, long extended +attribute name prefixes are disabled if xattr_prefix_count is 0. + +Each long name prefix is stored in the format: ALIGN({__le16 len, data}, 4), +where len represents the total size of the data part. The data part is actually +represented by 'struct erofs_xattr_long_prefix', where base_index represents the +index of the predefined xattr name prefix, e.g. EROFS_XATTR_INDEX_TRUSTED for +"trusted.overlay." long name prefix, while the infix string keeps the string +after stripping the short prefix, e.g. "overlay." for the example above. + +Data compression +---------------- +EROFS implements fixed-sized output compression which generates fixed-sized +compressed data blocks from variable-sized input in contrast to other existing +fixed-sized input solutions. Relatively higher compression ratios can be gotten +by using fixed-sized output compression since nowadays popular data compression +algorithms are mostly LZ77-based and such fixed-sized output approach can be +benefited from the historical dictionary (aka. sliding window). + +In details, original (uncompressed) data is turned into several variable-sized +extents and in the meanwhile, compressed into physical clusters (pclusters). +In order to record each variable-sized extent, logical clusters (lclusters) are +introduced as the basic unit of compress indexes to indicate whether a new +extent is generated within the range (HEAD) or not (NONHEAD). Lclusters are now +fixed in block size, as illustrated below:: + + |<- variable-sized extent ->|<- VLE ->| + clusterofs clusterofs clusterofs + | | | + _________v_________________________________v_______________________v________ + ... | . | | . | | . ... + ____|____._________|______________|________.___ _|______________|__.________ + |-> lcluster <-|-> lcluster <-|-> lcluster <-|-> lcluster <-| + (HEAD) (NONHEAD) (HEAD) (NONHEAD) . + . CBLKCNT . . + . . . + . . . + _______._____________________________.______________._________________ + ... | | | | ... + _______|______________|______________|______________|_________________ + |-> big pcluster <-|-> pcluster <-| + +A physical cluster can be seen as a container of physical compressed blocks +which contains compressed data. Previously, only lcluster-sized (4KB) pclusters +were supported. After big pcluster feature is introduced (available since +Linux v5.13), pcluster can be a multiple of lcluster size. + +For each HEAD lcluster, clusterofs is recorded to indicate where a new extent +starts and blkaddr is used to seek the compressed data. For each NONHEAD +lcluster, delta0 and delta1 are available instead of blkaddr to indicate the +distance to its HEAD lcluster and the next HEAD lcluster. A PLAIN lcluster is +also a HEAD lcluster except that its data is uncompressed. See the comments +around "struct z_erofs_vle_decompressed_index" in erofs_fs.h for more details. + +If big pcluster is enabled, pcluster size in lclusters needs to be recorded as +well. Let the delta0 of the first NONHEAD lcluster store the compressed block +count with a special flag as a new called CBLKCNT NONHEAD lcluster. It's easy +to understand its delta0 is constantly 1, as illustrated below:: + + __________________________________________________________ + | HEAD | NONHEAD | NONHEAD | ... | NONHEAD | HEAD | HEAD | + |__:___|_(CBLKCNT)_|_________|_____|_________|__:___|____:_| + |<----- a big pcluster (with CBLKCNT) ------>|<-- -->| + a lcluster-sized pcluster (without CBLKCNT) ^ + +If another HEAD follows a HEAD lcluster, there is no room to record CBLKCNT, +but it's easy to know the size of such pcluster is 1 lcluster as well. + +Since Linux v6.1, each pcluster can be used for multiple variable-sized extents, +therefore it can be used for compressed data deduplication. -- cgit v1.2.3