diff options
Diffstat (limited to '')
-rw-r--r-- | Documentation/filesystems/ext4/ondisk/directory.rst | 426 |
1 files changed, 426 insertions, 0 deletions
diff --git a/Documentation/filesystems/ext4/ondisk/directory.rst b/Documentation/filesystems/ext4/ondisk/directory.rst new file mode 100644 index 000000000..8fcba68c2 --- /dev/null +++ b/Documentation/filesystems/ext4/ondisk/directory.rst @@ -0,0 +1,426 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Directory Entries +----------------- + +In an ext4 filesystem, a directory is more or less a flat file that maps +an arbitrary byte string (usually ASCII) to an inode number on the +filesystem. There can be many directory entries across the filesystem +that reference the same inode number--these are known as hard links, and +that is why hard links cannot reference files on other filesystems. As +such, directory entries are found by reading the data block(s) +associated with a directory file for the particular directory entry that +is desired. + +Linear (Classic) Directories +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +By default, each directory lists its entries in an “almost-linear” +array. I write “almost” because it's not a linear array in the memory +sense because directory entries are not split across filesystem blocks. +Therefore, it is more accurate to say that a directory is a series of +data blocks and that each block contains a linear array of directory +entries. The end of each per-block array is signified by reaching the +end of the block; the last entry in the block has a record length that +takes it all the way to the end of the block. The end of the entire +directory is of course signified by reaching the end of the file. Unused +directory entries are signified by inode = 0. By default the filesystem +uses ``struct ext4_dir_entry_2`` for directory entries unless the +“filetype” feature flag is not set, in which case it uses +``struct ext4_dir_entry``. + +The original directory entry format is ``struct ext4_dir_entry``, which +is at most 263 bytes long, though on disk you'll need to reference +``dirent.rec_len`` to know for sure. + +.. list-table:: + :widths: 1 1 1 77 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - \_\_le32 + - inode + - Number of the inode that this directory entry points to. + * - 0x4 + - \_\_le16 + - rec\_len + - Length of this directory entry. Must be a multiple of 4. + * - 0x6 + - \_\_le16 + - name\_len + - Length of the file name. + * - 0x8 + - char + - name[EXT4\_NAME\_LEN] + - File name. + +Since file names cannot be longer than 255 bytes, the new directory +entry format shortens the rec\_len field and uses the space for a file +type flag, probably to avoid having to load every inode during directory +tree traversal. This format is ``ext4_dir_entry_2``, which is at most +263 bytes long, though on disk you'll need to reference +``dirent.rec_len`` to know for sure. + +.. list-table:: + :widths: 1 1 1 77 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - \_\_le32 + - inode + - Number of the inode that this directory entry points to. + * - 0x4 + - \_\_le16 + - rec\_len + - Length of this directory entry. + * - 0x6 + - \_\_u8 + - name\_len + - Length of the file name. + * - 0x7 + - \_\_u8 + - file\_type + - File type code, see ftype_ table below. + * - 0x8 + - char + - name[EXT4\_NAME\_LEN] + - File name. + +.. _ftype: + +The directory file type is one of the following values: + +.. list-table:: + :widths: 1 79 + :header-rows: 1 + + * - Value + - Description + * - 0x0 + - Unknown. + * - 0x1 + - Regular file. + * - 0x2 + - Directory. + * - 0x3 + - Character device file. + * - 0x4 + - Block device file. + * - 0x5 + - FIFO. + * - 0x6 + - Socket. + * - 0x7 + - Symbolic link. + +In order to add checksums to these classic directory blocks, a phony +``struct ext4_dir_entry`` is placed at the end of each leaf block to +hold the checksum. The directory entry is 12 bytes long. The inode +number and name\_len fields are set to zero to fool old software into +ignoring an apparently empty directory entry, and the checksum is stored +in the place where the name normally goes. The structure is +``struct ext4_dir_entry_tail``: + +.. list-table:: + :widths: 1 1 1 77 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - \_\_le32 + - det\_reserved\_zero1 + - Inode number, which must be zero. + * - 0x4 + - \_\_le16 + - det\_rec\_len + - Length of this directory entry, which must be 12. + * - 0x6 + - \_\_u8 + - det\_reserved\_zero2 + - Length of the file name, which must be zero. + * - 0x7 + - \_\_u8 + - det\_reserved\_ft + - File type, which must be 0xDE. + * - 0x8 + - \_\_le32 + - det\_checksum + - Directory leaf block checksum. + +The leaf directory block checksum is calculated against the FS UUID, the +directory's inode number, the directory's inode generation number, and +the entire directory entry block up to (but not including) the fake +directory entry. + +Hash Tree Directories +~~~~~~~~~~~~~~~~~~~~~ + +A linear array of directory entries isn't great for performance, so a +new feature was added to ext3 to provide a faster (but peculiar) +balanced tree keyed off a hash of the directory entry name. If the +EXT4\_INDEX\_FL (0x1000) flag is set in the inode, this directory uses a +hashed btree (htree) to organize and find directory entries. For +backwards read-only compatibility with ext2, this tree is actually +hidden inside the directory file, masquerading as “empty” directory data +blocks! It was stated previously that the end of the linear directory +entry table was signified with an entry pointing to inode 0; this is +(ab)used to fool the old linear-scan algorithm into thinking that the +rest of the directory block is empty so that it moves on. + +The root of the tree always lives in the first data block of the +directory. By ext2 custom, the '.' and '..' entries must appear at the +beginning of this first block, so they are put here as two +``struct ext4_dir_entry_2``\ s and not stored in the tree. The rest of +the root node contains metadata about the tree and finally a hash->block +map to find nodes that are lower in the htree. If +``dx_root.info.indirect_levels`` is non-zero then the htree has two +levels; the data block pointed to by the root node's map is an interior +node, which is indexed by a minor hash. Interior nodes in this tree +contains a zeroed out ``struct ext4_dir_entry_2`` followed by a +minor\_hash->block map to find leafe nodes. Leaf nodes contain a linear +array of all ``struct ext4_dir_entry_2``; all of these entries +(presumably) hash to the same value. If there is an overflow, the +entries simply overflow into the next leaf node, and the +least-significant bit of the hash (in the interior node map) that gets +us to this next leaf node is set. + +To traverse the directory as a htree, the code calculates the hash of +the desired file name and uses it to find the corresponding block +number. If the tree is flat, the block is a linear array of directory +entries that can be searched; otherwise, the minor hash of the file name +is computed and used against this second block to find the corresponding +third block number. That third block number will be a linear array of +directory entries. + +To traverse the directory as a linear array (such as the old code does), +the code simply reads every data block in the directory. The blocks used +for the htree will appear to have no entries (aside from '.' and '..') +and so only the leaf nodes will appear to have any interesting content. + +The root of the htree is in ``struct dx_root``, which is the full length +of a data block: + +.. list-table:: + :widths: 1 1 1 77 + :header-rows: 1 + + * - Offset + - Type + - Name + - Description + * - 0x0 + - \_\_le32 + - dot.inode + - inode number of this directory. + * - 0x4 + - \_\_le16 + - dot.rec\_len + - Length of this record, 12. + * - 0x6 + - u8 + - dot.name\_len + - Length of the name, 1. + * - 0x7 + - u8 + - dot.file\_type + - File type of this entry, 0x2 (directory) (if the feature flag is set). + * - 0x8 + - char + - dot.name[4] + - “.\\0\\0\\0” + * - 0xC + - \_\_le32 + - dotdot.inode + - inode number of parent directory. + * - 0x10 + - \_\_le16 + - dotdot.rec\_len + - block\_size - 12. The record length is long enough to cover all htree + data. + * - 0x12 + - u8 + - dotdot.name\_len + - Length of the name, 2. + * - 0x13 + - u8 + - dotdot.file\_type + - File type of this entry, 0x2 (directory) (if the feature flag is set). + * - 0x14 + - char + - dotdot\_name[4] + - “..\\0\\0” + * - 0x18 + - \_\_le32 + - struct dx\_root\_info.reserved\_zero + - Zero. + * - 0x1C + - u8 + - struct dx\_root\_info.hash\_version + - Hash type, see dirhash_ table below. + * - 0x1D + - u8 + - struct dx\_root\_info.info\_length + - Length of the tree information, 0x8. + * - 0x1E + - u8 + - struct dx\_root\_info.indirect\_levels + - Depth of the htree. Cannot be larger than 3 if the INCOMPAT\_LARGEDIR + feature is set; cannot be larger than 2 otherwise. + * - 0x1F + - u8 + - struct dx\_root\_info.unused\_flags + - + * - 0x20 + - \_\_le16 + - limit + - Maximum number of dx\_entries that can follow this header, plus 1 for + the header itself. + * - 0x22 + - \_\_le16 + - count + - Actual number of dx\_entries that follow this header, plus 1 for the + header itself. + * - 0x24 + - \_\_le32 + - block + - The block number (within the directory file) that goes with hash=0. + * - 0x28 + - struct dx\_entry + - entries[0] + - As many 8-byte ``struct dx_entry`` as fits in the rest of the data block. + +.. _dirhash: + +The directory hash is one of the following values: + +.. list-table:: + :widths: 1 79 + :header-rows: 1 + + * - Value + - Description + * - 0x0 + - Legacy. + * - 0x1 + - Half MD4. + * - 0x2 + - Tea. + * - 0x3 + - Legacy, unsigned. + * - 0x4 + - Half MD4, unsigned. + * - 0x5 + - Tea, unsigned. + +Interior nodes of an htree are recorded as ``struct dx_node``, which is +also the full length of a data block: + +.. list-table:: + :widths: 1 1 1 77 + :header-rows: 1 + + * - Offset + - Type + - Name + - Description + * - 0x0 + - \_\_le32 + - fake.inode + - Zero, to make it look like this entry is not in use. + * - 0x4 + - \_\_le16 + - fake.rec\_len + - The size of the block, in order to hide all of the dx\_node data. + * - 0x6 + - u8 + - name\_len + - Zero. There is no name for this “unused” directory entry. + * - 0x7 + - u8 + - file\_type + - Zero. There is no file type for this “unused” directory entry. + * - 0x8 + - \_\_le16 + - limit + - Maximum number of dx\_entries that can follow this header, plus 1 for + the header itself. + * - 0xA + - \_\_le16 + - count + - Actual number of dx\_entries that follow this header, plus 1 for the + header itself. + * - 0xE + - \_\_le32 + - block + - The block number (within the directory file) that goes with the lowest + hash value of this block. This value is stored in the parent block. + * - 0x12 + - struct dx\_entry + - entries[0] + - As many 8-byte ``struct dx_entry`` as fits in the rest of the data block. + +The hash maps that exist in both ``struct dx_root`` and +``struct dx_node`` are recorded as ``struct dx_entry``, which is 8 bytes +long: + +.. list-table:: + :widths: 1 1 1 77 + :header-rows: 1 + + * - Offset + - Type + - Name + - Description + * - 0x0 + - \_\_le32 + - hash + - Hash code. + * - 0x4 + - \_\_le32 + - block + - Block number (within the directory file, not filesystem blocks) of the + next node in the htree. + +(If you think this is all quite clever and peculiar, so does the +author.) + +If metadata checksums are enabled, the last 8 bytes of the directory +block (precisely the length of one dx\_entry) are used to store a +``struct dx_tail``, which contains the checksum. The ``limit`` and +``count`` entries in the dx\_root/dx\_node structures are adjusted as +necessary to fit the dx\_tail into the block. If there is no space for +the dx\_tail, the user is notified to run e2fsck -D to rebuild the +directory index (which will ensure that there's space for the checksum. +The dx\_tail structure is 8 bytes long and looks like this: + +.. list-table:: + :widths: 1 1 1 77 + :header-rows: 1 + + * - Offset + - Type + - Name + - Description + * - 0x0 + - u32 + - dt\_reserved + - Zero. + * - 0x4 + - \_\_le32 + - dt\_checksum + - Checksum of the htree directory block. + +The checksum is calculated against the FS UUID, the htree index header +(dx\_root or dx\_node), all of the htree indices (dx\_entry) that are in +use, and the tail block (dx\_tail). |