summaryrefslogtreecommitdiffstats
path: root/Documentation/filesystems/ext4/journal.rst
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-11 08:27:49 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-11 08:27:49 +0000
commitace9429bb58fd418f0c81d4c2835699bddf6bde6 (patch)
treeb2d64bc10158fdd5497876388cd68142ca374ed3 /Documentation/filesystems/ext4/journal.rst
parentInitial commit. (diff)
downloadlinux-ace9429bb58fd418f0c81d4c2835699bddf6bde6.tar.xz
linux-ace9429bb58fd418f0c81d4c2835699bddf6bde6.zip
Adding upstream version 6.6.15.upstream/6.6.15
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'Documentation/filesystems/ext4/journal.rst')
-rw-r--r--Documentation/filesystems/ext4/journal.rst761
1 files changed, 761 insertions, 0 deletions
diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
new file mode 100644
index 000000000..6e8fb2d4b
--- /dev/null
+++ b/Documentation/filesystems/ext4/journal.rst
@@ -0,0 +1,761 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Journal (jbd2)
+--------------
+
+Introduced in ext3, the ext4 filesystem employs a journal to protect the
+filesystem against metadata inconsistencies in the case of a system crash. Up
+to 10,240,000 file system blocks (see man mke2fs(8) for more details on journal
+size limits) can be reserved inside the filesystem as a place to land
+“important” data writes on-disk as quickly as possible. Once the important
+data transaction is fully written to the disk and flushed from the disk write
+cache, a record of the data being committed is also written to the journal. At
+some later point in time, the journal code writes the transactions to their
+final locations on disk (this could involve a lot of seeking or a lot of small
+read-write-erases) before erasing the commit record. Should the system
+crash during the second slow write, the journal can be replayed all the
+way to the latest commit record, guaranteeing the atomicity of whatever
+gets written through the journal to the disk. The effect of this is to
+guarantee that the filesystem does not become stuck midway through a
+metadata update.
+
+For performance reasons, ext4 by default only writes filesystem metadata
+through the journal. This means that file data blocks are /not/
+guaranteed to be in any consistent state after a crash. If this default
+guarantee level (``data=ordered``) is not satisfactory, there is a mount
+option to control journal behavior. If ``data=journal``, all data and
+metadata are written to disk through the journal. This is slower but
+safest. If ``data=writeback``, dirty data blocks are not flushed to the
+disk before the metadata are written to disk through the journal.
+
+In case of ``data=ordered`` mode, Ext4 also supports fast commits which
+help reduce commit latency significantly. The default ``data=ordered``
+mode works by logging metadata blocks to the journal. In fast commit
+mode, Ext4 only stores the minimal delta needed to recreate the
+affected metadata in fast commit space that is shared with JBD2.
+Once the fast commit area fills in or if fast commit is not possible
+or if JBD2 commit timer goes off, Ext4 performs a traditional full commit.
+A full commit invalidates all the fast commits that happened before
+it and thus it makes the fast commit area empty for further fast
+commits. This feature needs to be enabled at mkfs time.
+
+The journal inode is typically inode 8. The first 68 bytes of the
+journal inode are replicated in the ext4 superblock. The journal itself
+is normal (but hidden) file within the filesystem. The file usually
+consumes an entire block group, though mke2fs tries to put it in the
+middle of the disk.
+
+All fields in jbd2 are written to disk in big-endian order. This is the
+opposite of ext4.
+
+NOTE: Both ext4 and ocfs2 use jbd2.
+
+The maximum size of a journal embedded in an ext4 filesystem is 2^32
+blocks. jbd2 itself does not seem to care.
+
+Layout
+~~~~~~
+
+Generally speaking, the journal has this format:
+
+.. list-table::
+ :widths: 16 48 16
+ :header-rows: 1
+
+ * - Superblock
+ - descriptor_block (data_blocks or revocation_block) [more data or
+ revocations] commmit_block
+ - [more transactions...]
+ * -
+ - One transaction
+ -
+
+Notice that a transaction begins with either a descriptor and some data,
+or a block revocation list. A finished transaction always ends with a
+commit. If there is no commit record (or the checksums don't match), the
+transaction will be discarded during replay.
+
+External Journal
+~~~~~~~~~~~~~~~~
+
+Optionally, an ext4 filesystem can be created with an external journal
+device (as opposed to an internal journal, which uses a reserved inode).
+In this case, on the filesystem device, ``s_journal_inum`` should be
+zero and ``s_journal_uuid`` should be set. On the journal device there
+will be an ext4 super block in the usual place, with a matching UUID.
+The journal superblock will be in the next full block after the
+superblock.
+
+.. list-table::
+ :widths: 12 12 12 32 12
+ :header-rows: 1
+
+ * - 1024 bytes of padding
+ - ext4 Superblock
+ - Journal Superblock
+ - descriptor_block (data_blocks or revocation_block) [more data or
+ revocations] commmit_block
+ - [more transactions...]
+ * -
+ -
+ -
+ - One transaction
+ -
+
+Block Header
+~~~~~~~~~~~~
+
+Every block in the journal starts with a common 12-byte header
+``struct journal_header_s``:
+
+.. list-table::
+ :widths: 8 8 24 40
+ :header-rows: 1
+
+ * - Offset
+ - Type
+ - Name
+ - Description
+ * - 0x0
+ - __be32
+ - h_magic
+ - jbd2 magic number, 0xC03B3998.
+ * - 0x4
+ - __be32
+ - h_blocktype
+ - Description of what this block contains. See the jbd2_blocktype_ table
+ below.
+ * - 0x8
+ - __be32
+ - h_sequence
+ - The transaction ID that goes with this block.
+
+.. _jbd2_blocktype:
+
+The journal block type can be any one of:
+
+.. list-table::
+ :widths: 16 64
+ :header-rows: 1
+
+ * - Value
+ - Description
+ * - 1
+ - Descriptor. This block precedes a series of data blocks that were
+ written through the journal during a transaction.
+ * - 2
+ - Block commit record. This block signifies the completion of a
+ transaction.
+ * - 3
+ - Journal superblock, v1.
+ * - 4
+ - Journal superblock, v2.
+ * - 5
+ - Block revocation records. This speeds up recovery by enabling the
+ journal to skip writing blocks that were subsequently rewritten.
+
+Super Block
+~~~~~~~~~~~
+
+The super block for the journal is much simpler as compared to ext4's.
+The key data kept within are size of the journal, and where to find the
+start of the log of transactions.
+
+The journal superblock is recorded as ``struct journal_superblock_s``,
+which is 1024 bytes long:
+
+.. list-table::
+ :widths: 8 8 24 40
+ :header-rows: 1
+
+ * - Offset
+ - Type
+ - Name
+ - Description
+ * -
+ -
+ -
+ - Static information describing the journal.
+ * - 0x0
+ - journal_header_t (12 bytes)
+ - s_header
+ - Common header identifying this as a superblock.
+ * - 0xC
+ - __be32
+ - s_blocksize
+ - Journal device block size.
+ * - 0x10
+ - __be32
+ - s_maxlen
+ - Total number of blocks in this journal.
+ * - 0x14
+ - __be32
+ - s_first
+ - First block of log information.
+ * -
+ -
+ -
+ - Dynamic information describing the current state of the log.
+ * - 0x18
+ - __be32
+ - s_sequence
+ - First commit ID expected in log.
+ * - 0x1C
+ - __be32
+ - s_start
+ - Block number of the start of log. Contrary to the comments, this field
+ being zero does not imply that the journal is clean!
+ * - 0x20
+ - __be32
+ - s_errno
+ - Error value, as set by jbd2_journal_abort().
+ * -
+ -
+ -
+ - The remaining fields are only valid in a v2 superblock.
+ * - 0x24
+ - __be32
+ - s_feature_compat;
+ - Compatible feature set. See the table jbd2_compat_ below.
+ * - 0x28
+ - __be32
+ - s_feature_incompat
+ - Incompatible feature set. See the table jbd2_incompat_ below.
+ * - 0x2C
+ - __be32
+ - s_feature_ro_compat
+ - Read-only compatible feature set. There aren't any of these currently.
+ * - 0x30
+ - __u8
+ - s_uuid[16]
+ - 128-bit uuid for journal. This is compared against the copy in the ext4
+ super block at mount time.
+ * - 0x40
+ - __be32
+ - s_nr_users
+ - Number of file systems sharing this journal.
+ * - 0x44
+ - __be32
+ - s_dynsuper
+ - Location of dynamic super block copy. (Not used?)
+ * - 0x48
+ - __be32
+ - s_max_transaction
+ - Limit of journal blocks per transaction. (Not used?)
+ * - 0x4C
+ - __be32
+ - s_max_trans_data
+ - Limit of data blocks per transaction. (Not used?)
+ * - 0x50
+ - __u8
+ - s_checksum_type
+ - Checksum algorithm used for the journal. See jbd2_checksum_type_ for
+ more info.
+ * - 0x51
+ - __u8[3]
+ - s_padding2
+ -
+ * - 0x54
+ - __be32
+ - s_num_fc_blocks
+ - Number of fast commit blocks in the journal.
+ * - 0x58
+ - __be32
+ - s_head
+ - Block number of the head (first unused block) of the journal, only
+ up-to-date when the journal is empty.
+ * - 0x5C
+ - __u32
+ - s_padding[40]
+ -
+ * - 0xFC
+ - __be32
+ - s_checksum
+ - Checksum of the entire superblock, with this field set to zero.
+ * - 0x100
+ - __u8
+ - s_users[16*48]
+ - ids of all file systems sharing the log. e2fsprogs/Linux don't allow
+ shared external journals, but I imagine Lustre (or ocfs2?), which use
+ the jbd2 code, might.
+
+.. _jbd2_compat:
+
+The journal compat features are any combination of the following:
+
+.. list-table::
+ :widths: 16 64
+ :header-rows: 1
+
+ * - Value
+ - Description
+ * - 0x1
+ - Journal maintains checksums on the data blocks.
+ (JBD2_FEATURE_COMPAT_CHECKSUM)
+
+.. _jbd2_incompat:
+
+The journal incompat features are any combination of the following:
+
+.. list-table::
+ :widths: 16 64
+ :header-rows: 1
+
+ * - Value
+ - Description
+ * - 0x1
+ - Journal has block revocation records. (JBD2_FEATURE_INCOMPAT_REVOKE)
+ * - 0x2
+ - Journal can deal with 64-bit block numbers.
+ (JBD2_FEATURE_INCOMPAT_64BIT)
+ * - 0x4
+ - Journal commits asynchronously. (JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)
+ * - 0x8
+ - This journal uses v2 of the checksum on-disk format. Each journal
+ metadata block gets its own checksum, and the block tags in the
+ descriptor table contain checksums for each of the data blocks in the
+ journal. (JBD2_FEATURE_INCOMPAT_CSUM_V2)
+ * - 0x10
+ - This journal uses v3 of the checksum on-disk format. This is the same as
+ v2, but the journal block tag size is fixed regardless of the size of
+ block numbers. (JBD2_FEATURE_INCOMPAT_CSUM_V3)
+ * - 0x20
+ - Journal has fast commit blocks. (JBD2_FEATURE_INCOMPAT_FAST_COMMIT)
+
+.. _jbd2_checksum_type:
+
+Journal checksum type codes are one of the following. crc32 or crc32c are the
+most likely choices.
+
+.. list-table::
+ :widths: 16 64
+ :header-rows: 1
+
+ * - Value
+ - Description
+ * - 1
+ - CRC32
+ * - 2
+ - MD5
+ * - 3
+ - SHA1
+ * - 4
+ - CRC32C
+
+Descriptor Block
+~~~~~~~~~~~~~~~~
+
+The descriptor block contains an array of journal block tags that
+describe the final locations of the data blocks that follow in the
+journal. Descriptor blocks are open-coded instead of being completely
+described by a data structure, but here is the block structure anyway.
+Descriptor blocks consume at least 36 bytes, but use a full block:
+
+.. list-table::
+ :widths: 8 8 24 40
+ :header-rows: 1
+
+ * - Offset
+ - Type
+ - Name
+ - Descriptor
+ * - 0x0
+ - journal_header_t
+ - (open coded)
+ - Common block header.
+ * - 0xC
+ - struct journal_block_tag_s
+ - open coded array[]
+ - Enough tags either to fill up the block or to describe all the data
+ blocks that follow this descriptor block.
+
+Journal block tags have any of the following formats, depending on which
+journal feature and block tag flags are set.
+
+If JBD2_FEATURE_INCOMPAT_CSUM_V3 is set, the journal block tag is
+defined as ``struct journal_block_tag3_s``, which looks like the
+following. The size is 16 or 32 bytes.
+
+.. list-table::
+ :widths: 8 8 24 40
+ :header-rows: 1
+
+ * - Offset
+ - Type
+ - Name
+ - Descriptor
+ * - 0x0
+ - __be32
+ - t_blocknr
+ - Lower 32-bits of the location of where the corresponding data block
+ should end up on disk.
+ * - 0x4
+ - __be32
+ - t_flags
+ - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
+ more info.
+ * - 0x8
+ - __be32
+ - t_blocknr_high
+ - Upper 32-bits of the location of where the corresponding data block
+ should end up on disk. This is zero if JBD2_FEATURE_INCOMPAT_64BIT is
+ not enabled.
+ * - 0xC
+ - __be32
+ - t_checksum
+ - Checksum of the journal UUID, the sequence number, and the data block.
+ * -
+ -
+ -
+ - This field appears to be open coded. It always comes at the end of the
+ tag, after t_checksum. This field is not present if the "same UUID" flag
+ is set.
+ * - 0x8 or 0xC
+ - char
+ - uuid[16]
+ - A UUID to go with this tag. This field appears to be copied from the
+ ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
+ field.
+
+.. _jbd2_tag_flags:
+
+The journal tag flags are any combination of the following:
+
+.. list-table::
+ :widths: 16 64
+ :header-rows: 1
+
+ * - Value
+ - Description
+ * - 0x1
+ - On-disk block is escaped. The first four bytes of the data block just
+ happened to match the jbd2 magic number.
+ * - 0x2
+ - This block has the same UUID as previous, therefore the UUID field is
+ omitted.
+ * - 0x4
+ - The data block was deleted by the transaction. (Not used?)
+ * - 0x8
+ - This is the last tag in this descriptor block.
+
+If JBD2_FEATURE_INCOMPAT_CSUM_V3 is NOT set, the journal block tag
+is defined as ``struct journal_block_tag_s``, which looks like the
+following. The size is 8, 12, 24, or 28 bytes:
+
+.. list-table::
+ :widths: 8 8 24 40
+ :header-rows: 1
+
+ * - Offset
+ - Type
+ - Name
+ - Descriptor
+ * - 0x0
+ - __be32
+ - t_blocknr
+ - Lower 32-bits of the location of where the corresponding data block
+ should end up on disk.
+ * - 0x4
+ - __be16
+ - t_checksum
+ - Checksum of the journal UUID, the sequence number, and the data block.
+ Note that only the lower 16 bits are stored.
+ * - 0x6
+ - __be16
+ - t_flags
+ - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
+ more info.
+ * -
+ -
+ -
+ - This next field is only present if the super block indicates support for
+ 64-bit block numbers.
+ * - 0x8
+ - __be32
+ - t_blocknr_high
+ - Upper 32-bits of the location of where the corresponding data block
+ should end up on disk.
+ * -
+ -
+ -
+ - This field appears to be open coded. It always comes at the end of the
+ tag, after t_flags or t_blocknr_high. This field is not present if the
+ "same UUID" flag is set.
+ * - 0x8 or 0xC
+ - char
+ - uuid[16]
+ - A UUID to go with this tag. This field appears to be copied from the
+ ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
+ field.
+
+If JBD2_FEATURE_INCOMPAT_CSUM_V2 or
+JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the block is a
+``struct jbd2_journal_block_tail``, which looks like this:
+
+.. list-table::
+ :widths: 8 8 24 40
+ :header-rows: 1
+
+ * - Offset
+ - Type
+ - Name
+ - Descriptor
+ * - 0x0
+ - __be32
+ - t_checksum
+ - Checksum of the journal UUID + the descriptor block, with this field set
+ to zero.
+
+Data Block
+~~~~~~~~~~
+
+In general, the data blocks being written to disk through the journal
+are written verbatim into the journal file after the descriptor block.
+However, if the first four bytes of the block match the jbd2 magic
+number then those four bytes are replaced with zeroes and the “escaped”
+flag is set in the descriptor block tag.
+
+Revocation Block
+~~~~~~~~~~~~~~~~
+
+A revocation block is used to prevent replay of a block in an earlier
+transaction. This is used to mark blocks that were journalled at one
+time but are no longer journalled. Typically this happens if a metadata
+block is freed and re-allocated as a file data block; in this case, a
+journal replay after the file block was written to disk will cause
+corruption.
+
+**NOTE**: This mechanism is NOT used to express “this journal block is
+superseded by this other journal block”, as the author (djwong)
+mistakenly thought. Any block being added to a transaction will cause
+the removal of all existing revocation records for that block.
+
+Revocation blocks are described in
+``struct jbd2_journal_revoke_header_s``, are at least 16 bytes in
+length, but use a full block:
+
+.. list-table::
+ :widths: 8 8 24 40
+ :header-rows: 1
+
+ * - Offset
+ - Type
+ - Name
+ - Description
+ * - 0x0
+ - journal_header_t
+ - r_header
+ - Common block header.
+ * - 0xC
+ - __be32
+ - r_count
+ - Number of bytes used in this block.
+ * - 0x10
+ - __be32 or __be64
+ - blocks[0]
+ - Blocks to revoke.
+
+After r_count is a linear array of block numbers that are effectively
+revoked by this transaction. The size of each block number is 8 bytes if
+the superblock advertises 64-bit block number support, or 4 bytes
+otherwise.
+
+If JBD2_FEATURE_INCOMPAT_CSUM_V2 or
+JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the revocation
+block is a ``struct jbd2_journal_revoke_tail``, which has this format:
+
+.. list-table::
+ :widths: 8 8 24 40
+ :header-rows: 1
+
+ * - Offset
+ - Type
+ - Name
+ - Description
+ * - 0x0
+ - __be32
+ - r_checksum
+ - Checksum of the journal UUID + revocation block
+
+Commit Block
+~~~~~~~~~~~~
+
+The commit block is a sentry that indicates that a transaction has been
+completely written to the journal. Once this commit block reaches the
+journal, the data stored with this transaction can be written to their
+final locations on disk.
+
+The commit block is described by ``struct commit_header``, which is 32
+bytes long (but uses a full block):
+
+.. list-table::
+ :widths: 8 8 24 40
+ :header-rows: 1
+
+ * - Offset
+ - Type
+ - Name
+ - Descriptor
+ * - 0x0
+ - journal_header_s
+ - (open coded)
+ - Common block header.
+ * - 0xC
+ - unsigned char
+ - h_chksum_type
+ - The type of checksum to use to verify the integrity of the data blocks
+ in the transaction. See jbd2_checksum_type_ for more info.
+ * - 0xD
+ - unsigned char
+ - h_chksum_size
+ - The number of bytes used by the checksum. Most likely 4.
+ * - 0xE
+ - unsigned char
+ - h_padding[2]
+ -
+ * - 0x10
+ - __be32
+ - h_chksum[JBD2_CHECKSUM_BYTES]
+ - 32 bytes of space to store checksums. If
+ JBD2_FEATURE_INCOMPAT_CSUM_V2 or JBD2_FEATURE_INCOMPAT_CSUM_V3
+ are set, the first ``__be32`` is the checksum of the journal UUID and
+ the entire commit block, with this field zeroed. If
+ JBD2_FEATURE_COMPAT_CHECKSUM is set, the first ``__be32`` is the
+ crc32 of all the blocks already written to the transaction.
+ * - 0x30
+ - __be64
+ - h_commit_sec
+ - The time that the transaction was committed, in seconds since the epoch.
+ * - 0x38
+ - __be32
+ - h_commit_nsec
+ - Nanoseconds component of the above timestamp.
+
+Fast commits
+~~~~~~~~~~~~
+
+Fast commit area is organized as a log of tag length values. Each TLV has
+a ``struct ext4_fc_tl`` in the beginning which stores the tag and the length
+of the entire field. It is followed by variable length tag specific value.
+Here is the list of supported tags and their meanings:
+
+.. list-table::
+ :widths: 8 20 20 32
+ :header-rows: 1
+
+ * - Tag
+ - Meaning
+ - Value struct
+ - Description
+ * - EXT4_FC_TAG_HEAD
+ - Fast commit area header
+ - ``struct ext4_fc_head``
+ - Stores the TID of the transaction after which these fast commits should
+ be applied.
+ * - EXT4_FC_TAG_ADD_RANGE
+ - Add extent to inode
+ - ``struct ext4_fc_add_range``
+ - Stores the inode number and extent to be added in this inode
+ * - EXT4_FC_TAG_DEL_RANGE
+ - Remove logical offsets to inode
+ - ``struct ext4_fc_del_range``
+ - Stores the inode number and the logical offset range that needs to be
+ removed
+ * - EXT4_FC_TAG_CREAT
+ - Create directory entry for a newly created file
+ - ``struct ext4_fc_dentry_info``
+ - Stores the parent inode number, inode number and directory entry of the
+ newly created file
+ * - EXT4_FC_TAG_LINK
+ - Link a directory entry to an inode
+ - ``struct ext4_fc_dentry_info``
+ - Stores the parent inode number, inode number and directory entry
+ * - EXT4_FC_TAG_UNLINK
+ - Unlink a directory entry of an inode
+ - ``struct ext4_fc_dentry_info``
+ - Stores the parent inode number, inode number and directory entry
+
+ * - EXT4_FC_TAG_PAD
+ - Padding (unused area)
+ - None
+ - Unused bytes in the fast commit area.
+
+ * - EXT4_FC_TAG_TAIL
+ - Mark the end of a fast commit
+ - ``struct ext4_fc_tail``
+ - Stores the TID of the commit, CRC of the fast commit of which this tag
+ represents the end of
+
+Fast Commit Replay Idempotence
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Fast commits tags are idempotent in nature provided the recovery code follows
+certain rules. The guiding principle that the commit path follows while
+committing is that it stores the result of a particular operation instead of
+storing the procedure.
+
+Let's consider this rename operation: 'mv /a /b'. Let's assume dirent '/a'
+was associated with inode 10. During fast commit, instead of storing this
+operation as a procedure "rename a to b", we store the resulting file system
+state as a "series" of outcomes:
+
+- Link dirent b to inode 10
+- Unlink dirent a
+- Inode 10 with valid refcount
+
+Now when recovery code runs, it needs "enforce" this state on the file
+system. This is what guarantees idempotence of fast commit replay.
+
+Let's take an example of a procedure that is not idempotent and see how fast
+commits make it idempotent. Consider following sequence of operations:
+
+1) rm A
+2) mv B A
+3) read A
+
+If we store this sequence of operations as is then the replay is not idempotent.
+Let's say while in replay, we crash after (2). During the second replay,
+file A (which was actually created as a result of "mv B A" operation) would get
+deleted. Thus, file named A would be absent when we try to read A. So, this
+sequence of operations is not idempotent. However, as mentioned above, instead
+of storing the procedure fast commits store the outcome of each procedure. Thus
+the fast commit log for above procedure would be as follows:
+
+(Let's assume dirent A was linked to inode 10 and dirent B was linked to
+inode 11 before the replay)
+
+1) Unlink A
+2) Link A to inode 11
+3) Unlink B
+4) Inode 11
+
+If we crash after (3) we will have file A linked to inode 11. During the second
+replay, we will remove file A (inode 11). But we will create it back and make
+it point to inode 11. We won't find B, so we'll just skip that step. At this
+point, the refcount for inode 11 is not reliable, but that gets fixed by the
+replay of last inode 11 tag. Thus, by converting a non-idempotent procedure
+into a series of idempotent outcomes, fast commits ensured idempotence during
+the replay.
+
+Journal Checkpoint
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Checkpointing the journal ensures all transactions and their associated buffers
+are submitted to the disk. In-progress transactions are waited upon and included
+in the checkpoint. Checkpointing is used internally during critical updates to
+the filesystem including journal recovery, filesystem resizing, and freeing of
+the journal_t structure.
+
+A journal checkpoint can be triggered from userspace via the ioctl
+EXT4_IOC_CHECKPOINT. This ioctl takes a single, u64 argument for flags.
+Currently, three flags are supported. First, EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN
+can be used to verify input to the ioctl. It returns error if there is any
+invalid input, otherwise it returns success without performing
+any checkpointing. This can be used to check whether the ioctl exists on a
+system and to verify there are no issues with arguments or flags. The
+other two flags are EXT4_IOC_CHECKPOINT_FLAG_DISCARD and
+EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT. These flags cause the journal blocks to be
+discarded or zero-filled, respectively, after the journal checkpoint is
+complete. EXT4_IOC_CHECKPOINT_FLAG_DISCARD and EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT
+cannot both be set. The ioctl may be useful when snapshotting a system or for
+complying with content deletion SLOs.