diff options
Diffstat (limited to 'ext2ed/doc/ext2fs-overview.sgml')
-rw-r--r-- | ext2ed/doc/ext2fs-overview.sgml | 1569 |
1 files changed, 1569 insertions, 0 deletions
diff --git a/ext2ed/doc/ext2fs-overview.sgml b/ext2ed/doc/ext2fs-overview.sgml new file mode 100644 index 0000000..0d54f07 --- /dev/null +++ b/ext2ed/doc/ext2fs-overview.sgml @@ -0,0 +1,1569 @@ +<!DOCTYPE Article PUBLIC "-//Davenport//DTD DocBook V3.0//EN"> + +<Article> + +<ArtHeader> + +<Title>The extended-2 filesystem overview</Title> +<AUTHOR +> +<FirstName>Gadi Oxman, tgud@tochnapc2.technion.ac.il</FirstName> +</AUTHOR +> +<PubDate>v0.1, August 3 1995</PubDate> + +</ArtHeader> + +<Sect1> +<Title>Preface</Title> + +<Para> +This document attempts to present an overview of the internal structure of +the ext2 filesystem. It was written in summer 95, while I was working on the +<Literal remap="tt">ext2 filesystem editor project (EXT2ED)</Literal>. +</Para> + +<Para> +In the process of constructing EXT2ED, I acquired knowledge of the various +design aspects of the the ext2 filesystem. This document is a result of an +effort to document this knowledge. +</Para> + +<Para> +This is only the initial version of this document. It is obviously neither +error-prone nor complete, but at least it provides a starting point. +</Para> + +<Para> +In the process of learning the subject, I have used the following sources / +tools: + +<ItemizedList> +<ListItem> + +<Para> + Experimenting with EXT2ED, as it was developed. +</Para> +</ListItem> +<ListItem> + +<Para> + The ext2 kernel sources: + +<ItemizedList> +<ListItem> + +<Para> + The main ext2 include file, +<FILENAME>/usr/include/linux/ext2_fs.h</FILENAME> +</Para> +</ListItem> +<ListItem> + +<Para> + The contents of the directory <FILENAME>/usr/src/linux/fs/ext2</FILENAME>. +</Para> +</ListItem> +<ListItem> + +<Para> + The VFS layer sources (only a bit). +</Para> +</ListItem> + +</ItemizedList> + +</Para> +</ListItem> +<ListItem> + +<Para> + The slides: The Second Extended File System, Current State, Future +Development, by <personname><firstname>Remy</firstname> <surname>Card</surname></personname>. +</Para> +</ListItem> +<ListItem> + +<Para> + The slides: Optimisation in File Systems, by <personname><firstname>Stephen</firstname> <surname>Tweedie</surname></personname>. +</Para> +</ListItem> +<ListItem> + +<Para> + The various ext2 utilities. +</Para> +</ListItem> + +</ItemizedList> + +</Para> + +</Sect1> + +<Sect1> +<Title>Introduction</Title> + +<Para> +The <Literal remap="tt">Second Extended File System (Ext2fs)</Literal> is very popular among Linux +users. If you use Linux, chances are that you are using the ext2 filesystem. +</Para> + +<Para> +Ext2fs was designed by <personname><firstname>Remy</firstname> <surname>Card</surname></personname> and <personname><firstname>Wayne</firstname> <surname>Davison</surname></personname>. It was +implemented by <personname><firstname>Remy</firstname> <surname>Card</surname></personname> and was further enhanced by <personname><firstname>Stephen</firstname> +<surname>Tweedie</surname></personname> and <personname><firstname>Theodore</firstname> <surname>Ts'o</surname></personname>. +</Para> + +<Para> +The ext2 filesystem is still under development. I will document here +version 0.5a, which is distributed along with Linux 1.2.x. At this time of +writing, the most recent version of Linux is 1.3.13, and the version of the +ext2 kernel source is 0.5b. A lot of fancy enhancements are planned for the +ext2 filesystem in Linux 1.3, so stay tuned. +</Para> + +</Sect1> + +<Sect1> +<Title>A filesystem - Why do we need it?</Title> + +<Para> +I thought that before we dive into the various small details, I'll reserve a +few minutes for the discussion of filesystems from a general point of view. +</Para> + +<Para> +A <Literal remap="tt">filesystem</Literal> consists of two word - <Literal remap="tt">file</Literal> and <Literal remap="tt">system</Literal>. +</Para> + +<Para> +Everyone knows the meaning of the word <Literal remap="tt">file</Literal> - A bunch of data put +somewhere. where? This is an important question. I, for example, usually +throw almost everything into a single drawer, and have difficulties finding +something later. +</Para> + +<Para> +This is where the <Literal remap="tt">system</Literal> comes in - Instead of just throwing the data +to the device, we generalize and construct a <Literal remap="tt">system</Literal> which will +virtualize for us a nice and ordered structure in which we could arrange our +data in much the same way as books are arranged in a library. The purpose of +the filesystem, as I understand it, is to make it easy for us to update and +maintain our data. +</Para> + +<Para> +Normally, by <Literal remap="tt">mounting</Literal> filesystems, we just use the nice and logical +virtual structure. However, the disk knows nothing about that - The device +driver views the disk as a large continuous paper in which we can write notes +wherever we wish. It is the task of the filesystem management code to store +bookkeeping information which will serve the kernel for showing us the nice +and ordered virtual structure. +</Para> + +<Para> +In this document, we consider one particular administrative structure - The +Second Extended Filesystem. +</Para> + +</Sect1> + +<Sect1> +<Title>The Linux VFS layer</Title> + +<Para> +When Linux was first developed, it supported only one filesystem - The +<Literal remap="tt">Minix</Literal> filesystem. Today, Linux has the ability to support several +filesystems concurrently. This was done by the introduction of another layer +between the kernel and the filesystem code - The Virtual File System (VFS). +</Para> + +<Para> +The kernel "speaks" with the VFS layer. The VFS layer passes the kernel's +request to the proper filesystem management code. I haven't learned much of +the VFS layer as I didn't need it for the construction of EXT2ED so that I +can't elaborate on it. Just be aware that it exists. +</Para> + +</Sect1> + +<Sect1> +<Title>About blocks and block groups</Title> + +<Para> +In order to ease management, the ext2 filesystem logically divides the disk +into small units called <Literal remap="tt">blocks</Literal>. A block is the smallest unit which +can be allocated. Each block in the filesystem can be <Literal remap="tt">allocated</Literal> or +<Literal remap="tt">free</Literal>. +<FOOTNOTE> + +<Para> +The Ext2fs source code refers to the concept of <Literal remap="tt">fragments</Literal>, which I +believe are supposed to be sub-block allocations. As far as I know, +fragments are currently unsupported in Ext2fs. +</Para> + +</FOOTNOTE> + +The block size can be selected to be 1024, 2048 or 4096 bytes when creating +the filesystem. +</Para> + +<Para> +Ext2fs groups together a fixed number of sequential blocks into a <Literal remap="tt">group +block</Literal>. The resulting situation is that the filesystem is managed as a +series of group blocks. This is done in order to keep related information +physically close on the disk and to ease the management task. As a result, +much of the filesystem management reduces to management of a single blocks +group. +</Para> + +</Sect1> + +<Sect1> +<Title>The view of inodes from the point of view of a blocks group</Title> + +<Para> +Each file in the filesystem is reserved a special <Literal remap="tt">inode</Literal>. I don't want +to explain inodes now. Rather, I would like to treat it as another resource, +much like a <Literal remap="tt">block</Literal> - Each blocks group contains a limited number of +inode, while any specific inode can be <Literal remap="tt">allocated</Literal> or +<Literal remap="tt">unallocated</Literal>. +</Para> + +</Sect1> + +<Sect1> +<Title>The group descriptors</Title> + +<Para> +Each blocks group is accompanied by a <Literal remap="tt">group descriptor</Literal>. The group +descriptor summarizes some necessary information about the specific group +block. Follows the definition of the group descriptor, as defined in +<FILENAME>/usr/include/linux/ext2_fs.h</FILENAME>: +</Para> + +<Para> + +<ProgramListing> +struct ext2_group_desc +{ + __u32 bg_block_bitmap; /* Blocks bitmap block */ + __u32 bg_inode_bitmap; /* Inodes bitmap block */ + __u32 bg_inode_table; /* Inodes table block */ + __u16 bg_free_blocks_count; /* Free blocks count */ + __u16 bg_free_inodes_count; /* Free inodes count */ + __u16 bg_used_dirs_count; /* Directories count */ + __u16 bg_pad; + __u32 bg_reserved[3]; +}; +</ProgramListing> + +</Para> + +<Para> +The last three variables: <Literal remap="tt">bg_free_blocks_count, bg_free_inodes_count and bg_used_dirs_count</Literal> provide statistics about the use of the three +resources in a blocks group - The <Literal remap="tt">blocks</Literal>, the <Literal remap="tt">inodes</Literal> and the +<Literal remap="tt">directories</Literal>. I believe that they are used by the kernel for balancing +the load between the various blocks groups. +</Para> + +<Para> +<Literal remap="tt">bg_block_bitmap</Literal> contains the block number of the <Literal remap="tt">block allocation +bitmap block</Literal>. This is used to allocate / deallocate each block in the +specific blocks group. +</Para> + +<Para> +<Literal remap="tt">bg_inode_bitmap</Literal> is fully analogous to the previous variable - It +contains the block number of the <Literal remap="tt">inode allocation bitmap block</Literal>, which +is used to allocate / deallocate each specific inode in the filesystem. +</Para> + +<Para> +<Literal remap="tt">bg_inode_table</Literal> contains the block number of the start of the +<Literal remap="tt">inode table of the current blocks group</Literal>. The <Literal remap="tt">inode table</Literal> is +just the actual inodes which are reserved for the current block. +</Para> + +<Para> +The block bitmap block, inode bitmap block and the inode table are created +when the filesystem is created. +</Para> + +<Para> +The group descriptors are placed one after the other. Together they make the +<Literal remap="tt">group descriptors table</Literal>. +</Para> + +<Para> +Each blocks group contains the entire table of group descriptors in its +second block, right after the superblock. However, only the first copy (in +group 0) is actually used by the kernel. The other copies are there for +backup purposes and can be of use if the main copy gets corrupted. +</Para> + +</Sect1> + +<Sect1> +<Title>The block bitmap allocation block</Title> + +<Para> +Each blocks group contains one special block which is actually a map of the +entire blocks in the group, with respect to their allocation status. Each +<Literal remap="tt">bit</Literal> in the block bitmap indicated whether a specific block in the +group is used or free. +</Para> + +<Para> +The format is actually quite simple - Just view the entire block as a series +of bits. For example, +</Para> + +<Para> +Suppose the block size is 1024 bytes. As such, there is a place for +1024*8=8192 blocks in a group block. This number is one of the fields in the +filesystem's <Literal remap="tt">superblock</Literal>, which will be explained later. +</Para> + +<Para> + +<ItemizedList> +<ListItem> + +<Para> + Block 0 in the blocks group is managed by bit 0 of byte 0 in the bitmap +block. +</Para> +</ListItem> +<ListItem> + +<Para> + Block 7 in the blocks group is managed by bit 7 of byte 0 in the bitmap +block. +</Para> +</ListItem> +<ListItem> + +<Para> + Block 8 in the blocks group is managed by bit 0 of byte 1 in the bitmap +block. +</Para> +</ListItem> +<ListItem> + +<Para> + Block 8191 in the blocks group is managed by bit 7 of byte 1023 in the +bitmap block. +</Para> +</ListItem> + +</ItemizedList> + +</Para> + +<Para> +A value of "<Literal remap="tt">1</Literal>" in the appropriate bit signals that the block is +allocated, while a value of "<Literal remap="tt">0</Literal>" signals that the block is +unallocated. +</Para> + +<Para> +You will probably notice that typically, all the bits in a byte contain the +same value, making the byte's value <Literal remap="tt">0</Literal> or <Literal remap="tt">0ffh</Literal>. This is done by +the kernel on purpose in order to group related data in physically close +blocks, since the physical device is usually optimized to handle such a close +relationship. +</Para> + +</Sect1> + +<Sect1> +<Title>The inode allocation bitmap</Title> + +<Para> +The format of the inode allocation bitmap block is exactly like the format of +the block allocation bitmap block. The explanation above is valid here, with +the work <Literal remap="tt">block</Literal> replaced by <Literal remap="tt">inode</Literal>. Typically, there are much less +inodes then blocks in a blocks group and thus only part of the inode bitmap +block is used. The number of inodes in a blocks group is another variable +which is listed in the <Literal remap="tt">superblock</Literal>. +</Para> + +</Sect1> + +<Sect1> +<Title>On the inode and the inode tables</Title> + +<Para> +An inode is a main resource in the ext2 filesystem. It is used for various +purposes, but the main two are: + +<ItemizedList> +<ListItem> + +<Para> + Support of files +</Para> +</ListItem> +<ListItem> + +<Para> + Support of directories +</Para> +</ListItem> + +</ItemizedList> + +</Para> + +<Para> +Each file, for example, will allocate one inode from the filesystem +resources. +</Para> + +<Para> +An ext2 filesystem has a total number of available inodes which is determined +while creating the filesystem. When all the inodes are used, for example, you +will not be able to create an additional file even though there will still +be free blocks on the filesystem. +</Para> + +<Para> +Each inode takes up 128 bytes in the filesystem. By default, <Literal remap="tt">mke2fs</Literal> +reserves an inode for each 4096 bytes of the filesystem space. +</Para> + +<Para> +The inodes are placed in several tables, each of which contains the same +number of inodes and is placed at a different blocks group. The goal is to +place inodes and their related files in the same blocks group because of +locality arguments. +</Para> + +<Para> +The number of inodes in a blocks group is available in the superblock variable +<Literal remap="tt">s_inodes_per_group</Literal>. For example, if there are 2000 inodes per group, +group 0 will contain the inodes 1-2000, group 2 will contain the inodes +2001-4000, and so on. +</Para> + +<Para> +Each inode table is accessed from the group descriptor of the specific +blocks group which contains the table. +</Para> + +<Para> +Follows the structure of an inode in Ext2fs: +</Para> + +<Para> + +<ProgramListing> +struct ext2_inode { + __u16 i_mode; /* File mode */ + __u16 i_uid; /* Owner Uid */ + __u32 i_size; /* Size in bytes */ + __u32 i_atime; /* Access time */ + __u32 i_ctime; /* Creation time */ + __u32 i_mtime; /* Modification time */ + __u32 i_dtime; /* Deletion Time */ + __u16 i_gid; /* Group Id */ + __u16 i_links_count; /* Links count */ + __u32 i_blocks; /* Blocks count */ + __u32 i_flags; /* File flags */ + union { + struct { + __u32 l_i_reserved1; + } linux1; + struct { + __u32 h_i_translator; + } hurd1; + struct { + __u32 m_i_reserved1; + } masix1; + } osd1; /* OS dependent 1 */ + __u32 i_block[EXT2_N_BLOCKS];/* Pointers to blocks */ + __u32 i_version; /* File version (for NFS) */ + __u32 i_file_acl; /* File ACL */ + __u32 i_size_high; /* High 32bits of size */ + __u32 i_faddr; /* Fragment address */ + union { + struct { + __u8 l_i_frag; /* Fragment number */ + __u8 l_i_fsize; /* Fragment size */ + __u16 i_pad1; + __u32 l_i_reserved2[2]; + } linux2; + struct { + __u8 h_i_frag; /* Fragment number */ + __u8 h_i_fsize; /* Fragment size */ + __u16 h_i_mode_high; + __u16 h_i_uid_high; + __u16 h_i_gid_high; + __u32 h_i_author; + } hurd2; + struct { + __u8 m_i_frag; /* Fragment number */ + __u8 m_i_fsize; /* Fragment size */ + __u16 m_pad1; + __u32 m_i_reserved2[2]; + } masix2; + } osd2; /* OS dependent 2 */ +}; +</ProgramListing> + +</Para> + +<Sect2> +<Title>The allocated blocks</Title> + +<Para> +The basic functionality of an inode is to group together a series of +allocated blocks. There is no limitation on the allocated blocks - Each +block can be allocated to each inode. Nevertheless, block allocation will +usually be done in series to take advantage of the locality principle. +</Para> + +<Para> +The inode is not always used in that way. I will now explain the allocation +of blocks, assuming that the current inode type indeed refers to a list of +allocated blocks. +</Para> + +<Para> +It was found experimentally that many of the files in the filesystem are +actually quite small. To take advantage of this effect, the kernel provides +storage of up to 12 block numbers in the inode itself. Those blocks are +called <Literal remap="tt">direct blocks</Literal>. The advantage is that once the kernel has the +inode, it can directly access the file's blocks, without an additional disk +access. Those 12 blocks are directly specified in the variables +<Literal remap="tt">i_block[0] to i_block[11]</Literal>. +</Para> + +<Para> +<Literal remap="tt">i_block[12]</Literal> is the <Literal remap="tt">indirect block</Literal> - The block pointed by +i_block[12] will <Literal remap="tt">not</Literal> be a data block. Rather, it will just contain a +list of direct blocks. For example, if the block size is 1024 bytes, since +each block number is 4 bytes long, there will be place for 256 indirect +blocks. That is, block 13 till block 268 in the file will be accessed by the +<Literal remap="tt">indirect block</Literal> method. The penalty in this case, compared to the +direct blocks case, is that an additional access to the device is needed - +We need <Literal remap="tt">two</Literal> accesses to reach the required data block. +</Para> + +<Para> +In much the same way, <Literal remap="tt">i_block[13]</Literal> is the <Literal remap="tt">double indirect block</Literal> +and <Literal remap="tt">i_block[14]</Literal> is the <Literal remap="tt">triple indirect block</Literal>. +</Para> + +<Para> +<Literal remap="tt">i_block[13]</Literal> points to a block which contains pointers to indirect +blocks. Each one of them is handled in the way described above. +</Para> + +<Para> +In much the same way, the triple indirect block is just an additional level +of indirection - It will point to a list of double indirect blocks. +</Para> + +</Sect2> + +<Sect2> +<Title>The i_mode variable</Title> + +<Para> +The i_mode variable is used to determine the <Literal remap="tt">inode type</Literal> and the +associated <Literal remap="tt">permissions</Literal>. It is best described by representing it as an +octal number. Since it is a 16 bit variable, there will be 6 octal digits. +Those are divided into two parts - The rightmost 4 digits and the leftmost 2 +digits. +</Para> + +<Sect3> +<Title>The rightmost 4 octal digits</Title> + +<Para> +The rightmost 4 digits are <Literal remap="tt">bit options</Literal> - Each bit has its own +purpose. +</Para> + +<Para> +The last 3 digits (Octal digits 0,1 and 2) are just the usual permissions, +in the known form <Literal remap="tt">rwxrwxrwx</Literal>. Digit 2 refers to the user, digit 1 to +the group and digit 2 to everyone else. They are used by the kernel to grant +or deny access to the object presented by this inode. +<FOOTNOTE> + +<Para> +A <Literal remap="tt">smarter</Literal> permissions control is one of the enhancements planned for +Linux 1.3 - The ACL (Access Control Lists). Actually, from browsing of the +kernel source, some of the ACL handling is already done. +</Para> + +</FOOTNOTE> + +</Para> + +<Para> +Bit number 9 signals that the file (I'll refer to the object presented by +the inode as file even though it can be a special device, for example) is +<Literal remap="tt">set VTX</Literal>. I still don't know what is the meaning of "VTX". +</Para> + +<Para> +Bit number 10 signals that the file is <Literal remap="tt">set group id</Literal> - I don't know +exactly the meaning of the above either. +</Para> + +<Para> +Bit number 11 signals that the file is <Literal remap="tt">set user id</Literal>, which means that +the file will run with an effective user id root. +</Para> + +</Sect3> + +<Sect3> +<Title>The leftmost two octal digits</Title> + +<Para> +Note the the leftmost octal digit can only be 0 or 1, since the total number +of bits is 16. +</Para> + +<Para> +Those digits, as opposed to the rightmost 4 digits, are not bit mapped +options. They determine the type of the "file" to which the inode belongs: + +<ItemizedList> +<ListItem> + +<Para> + <Literal remap="tt">01</Literal> - The file is a <Literal remap="tt">FIFO</Literal>. +</Para> +</ListItem> +<ListItem> + +<Para> + <Literal remap="tt">02</Literal> - The file is a <Literal remap="tt">character device</Literal>. +</Para> +</ListItem> +<ListItem> + +<Para> + <Literal remap="tt">04</Literal> - The file is a <Literal remap="tt">directory</Literal>. +</Para> +</ListItem> +<ListItem> + +<Para> + <Literal remap="tt">06</Literal> - The file is a <Literal remap="tt">block device</Literal>. +</Para> +</ListItem> +<ListItem> + +<Para> + <Literal remap="tt">10</Literal> - The file is a <Literal remap="tt">regular file</Literal>. +</Para> +</ListItem> +<ListItem> + +<Para> + <Literal remap="tt">12</Literal> - The file is a <Literal remap="tt">symbolic link</Literal>. +</Para> +</ListItem> +<ListItem> + +<Para> + <Literal remap="tt">14</Literal> - The file is a <Literal remap="tt">socket</Literal>. +</Para> +</ListItem> + +</ItemizedList> + +</Para> + +</Sect3> + +</Sect2> + +<Sect2> +<Title>Time and date</Title> + +<Para> +Linux records the last time in which various operations occurred with the +file. The time and date are saved in the standard C library format - The +number of seconds which passed since 00:00:00 GMT, January 1, 1970. The +following times are recorded: + +<ItemizedList> +<ListItem> + +<Para> + <Literal remap="tt">i_ctime</Literal> - The time in which the inode was last allocated. In +other words, the time in which the file was created. +</Para> +</ListItem> +<ListItem> + +<Para> + <Literal remap="tt">i_mtime</Literal> - The time in which the file was last modified. +</Para> +</ListItem> +<ListItem> + +<Para> + <Literal remap="tt">i_atime</Literal> - The time in which the file was last accessed. +</Para> +</ListItem> +<ListItem> + +<Para> + <Literal remap="tt">i_dtime</Literal> - The time in which the inode was deallocated. In +other words, the time in which the file was deleted. +</Para> +</ListItem> + +</ItemizedList> + +</Para> + +</Sect2> + +<Sect2> +<Title>i_size</Title> + +<Para> +<Literal remap="tt">i_size</Literal> contains information about the size of the object presented by +the inode. If the inode corresponds to a regular file, this is just the size +of the file in bytes. In other cases, the interpretation of the variable is +different. +</Para> + +</Sect2> + +<Sect2> +<Title>User and group id</Title> + +<Para> +The user and group id of the file are just saved in the variables +<Literal remap="tt">i_uid</Literal> and <Literal remap="tt">i_gid</Literal>. +</Para> + +</Sect2> + +<Sect2> +<Title>Hard links</Title> + +<Para> +Later, when we'll discuss the implementation of directories, it will be +explained that each <Literal remap="tt">directory entry</Literal> points to an inode. It is quite +possible that a <Literal remap="tt">single inode</Literal> will be pointed to from <Literal remap="tt">several</Literal> +directories. In that case, we say that there exist <Literal remap="tt">hard links</Literal> to the +file - The file can be accessed from each of the directories. +</Para> + +<Para> +The kernel keeps track of the number of hard links in the variable +<Literal remap="tt">i_links_count</Literal>. The variable is set to "1" when first allocating the +inode, and is incremented with each additional link. Deletion of a file will +delete the current directory entry and will decrement the number of links. +Only when this number reaches zero, the inode will be actually deallocated. +</Para> + +<Para> +The name <Literal remap="tt">hard link</Literal> is used to distinguish between the alias method +described above, to another alias method called <Literal remap="tt">symbolic linking</Literal>, +which will be described later. +</Para> + +</Sect2> + +<Sect2> +<Title>The Ext2fs extended flags</Title> + +<Para> +The ext2 filesystem associates additional flags with an inode. The extended +attributes are stored in the variable <Literal remap="tt">i_flags</Literal>. <Literal remap="tt">i_flags</Literal> is a 32 +bit variable. Only the 7 rightmost bits are defined. Of them, only 5 bits +are used in version 0.5a of the filesystem. Specifically, the +<Literal remap="tt">undelete</Literal> and the <Literal remap="tt">compress</Literal> features are not implemented, and +are to be introduced in Linux 1.3 development. +</Para> + +<Para> +The currently available flags are: + +<ItemizedList> +<ListItem> + +<Para> + bit 0 - Secure deletion. + +When this bit is on, the file's blocks are zeroed when the file is +deleted. With this bit off, they will just be left with their +original data when the inode is deallocated. +</Para> +</ListItem> +<ListItem> + +<Para> + bit 1 - Undelete. + +This bit is not supported yet. It will be used to provide an +<Literal remap="tt">undelete</Literal> feature in future Ext2fs developments. +</Para> +</ListItem> +<ListItem> + +<Para> + bit 2 - Compress file. + +This bit is also not supported. The plan is to offer "compression on +the fly" in future releases. +</Para> +</ListItem> +<ListItem> + +<Para> + bit 3 - Synchronous updates. + +With this bit on, the meta-data will be written synchronously to the +disk, as if the filesystem was mounted with the "sync" mount option. +</Para> +</ListItem> +<ListItem> + +<Para> + bit 4 - Immutable file. + +When this bit is on, the file will stay as it is - Can not be +changed, deleted, renamed, no hard links, etc, before the bit is +cleared. +</Para> +</ListItem> +<ListItem> + +<Para> + bit 5 - Append only file. + +With this option active, data will only be appended to the file. +</Para> +</ListItem> +<ListItem> + +<Para> + bit 6 - Do not dump this file. + +I think that this bit is used by the port of dump to linux (ported by +<Literal remap="tt">Remy Card</Literal>) to check if the file should not be dumped. +</Para> +</ListItem> + +</ItemizedList> + +</Para> + +</Sect2> + +<Sect2> +<Title>Symbolic links</Title> + +<Para> +The <Literal remap="tt">hard links</Literal> presented above are just another pointers to the same +inode. The important aspect is that the inode number is <Literal remap="tt">fixed</Literal> when +the link is created. This means that the implementation details of the +filesystem are visible to the user - In a pure abstract usage of the +filesystem, the user should not care about inodes. +</Para> + +<Para> +The above causes several limitations: + +<ItemizedList> +<ListItem> + +<Para> + Hard links can be done only in the same filesystem. This is obvious, +since a hard link is just an inode number in some directory entry, +and the above elements are filesystem specific. +</Para> +</ListItem> +<ListItem> + +<Para> + You can not "replace" the file which is pointed to by the hard link +after the link creation. "Replacing" the file in one directory will +still leave the original file in the other directory - The +"replacement" will not deallocate the original inode, but rather +allocate another inode for the new version, and the directory entry +at the other place will just point to the old inode number. +</Para> +</ListItem> + +</ItemizedList> + +</Para> + +<Para> +<Literal remap="tt">Symbolic link</Literal>, on the other hand, is analyzed at <Literal remap="tt">run time</Literal>. A +symbolic link is just a <Literal remap="tt">pathname</Literal> which is accessible from an inode. +As such, it "speaks" in the language of the abstract filesystem. When the +kernel reaches a symbolic link, it will <Literal remap="tt">follow it in run time</Literal> using +its normal way of reaching directories. +</Para> + +<Para> +As such, symbolic link can be made <Literal remap="tt">across different filesystems</Literal> and a +replacement of a file with a new version will automatically be active on all +its symbolic links. +</Para> + +<Para> +The disadvantage is that hard link doesn't consume space except to a small +directory entry. Symbolic link, on the other hand, consumes at least an +inode, and can also consume one block. +</Para> + +<Para> +When the inode is identified as a symbolic link, the kernel needs to find +the path to which it points. +</Para> + +<Sect3> +<Title>Fast symbolic links</Title> + +<Para> +When the pathname contains up to 64 bytes, it can be saved directly in the +inode, on the <Literal remap="tt">i_block[0] - i_block[15]</Literal> variables, since those are not +needed in that case. This is called <Literal remap="tt">fast</Literal> symbolic link. It is fast +because the pathname resolution can be done using the inode itself, without +accessing additional blocks. It is also economical, since it allocates only +an inode. The length of the pathname is stored in the <Literal remap="tt">i_size</Literal> +variable. +</Para> + +</Sect3> + +<Sect3> +<Title>Slow symbolic links</Title> + +<Para> +Starting from 65 bytes, additional block is allocated (by the use of +<Literal remap="tt">i_block[0]</Literal>) and the pathname is stored in it. It is called slow +because the kernel needs to read additional block to resolve the pathname. +The length is again saved in <Literal remap="tt">i_size</Literal>. +</Para> + +</Sect3> + +</Sect2> + +<Sect2> +<Title>i_version</Title> + +<Para> +<Literal remap="tt">i_version</Literal> is used with regard to Network File System. I don't know +its exact use. +</Para> + +</Sect2> + +<Sect2> +<Title>Reserved variables</Title> + +<Para> +As far as I know, the variables which are connected to ACL and fragments +are not currently used. They will be supported in future versions. +</Para> + +<Para> +Ext2fs is being ported to other operating systems. As far as I know, +at least in linux, the os dependent variables are also not used. +</Para> + +</Sect2> + +<Sect2> +<Title>Special reserved inodes</Title> + +<Para> +The first ten inodes on the filesystem are special inodes: + +<ItemizedList> +<ListItem> + +<Para> + Inode 1 is the <Literal remap="tt">bad blocks inode</Literal> - I believe that its data +blocks contain a list of the bad blocks in the filesystem, which +should not be allocated. +</Para> +</ListItem> +<ListItem> + +<Para> + Inode 2 is the <Literal remap="tt">root inode</Literal> - The inode of the root directory. +It is the starting point for reaching a known path in the filesystem. +</Para> +</ListItem> +<ListItem> + +<Para> + Inode 3 is the <Literal remap="tt">acl index inode</Literal>. Access control lists are +currently not supported by the ext2 filesystem, so I believe this +inode is not used. +</Para> +</ListItem> +<ListItem> + +<Para> + Inode 4 is the <Literal remap="tt">acl data inode</Literal>. Of course, the above applies +here too. +</Para> +</ListItem> +<ListItem> + +<Para> + Inode 5 is the <Literal remap="tt">boot loader inode</Literal>. I don't know its +usage. +</Para> +</ListItem> +<ListItem> + +<Para> + Inode 6 is the <Literal remap="tt">undelete directory inode</Literal>. It is also a +foundation for future enhancements, and is currently not used. +</Para> +</ListItem> +<ListItem> + +<Para> + Inodes 7-10 are <Literal remap="tt">reserved</Literal> and currently not used. +</Para> +</ListItem> + +</ItemizedList> + +</Para> + +</Sect2> + +</Sect1> + +<Sect1> +<Title>Directories</Title> + +<Para> +A directory is implemented in the same way as files are implemented (with +the direct blocks, indirect blocks, etc) - It is just a file which is +formatted with a special format - A list of directory entries. +</Para> + +<Para> +Follows the definition of a directory entry: +</Para> + +<Para> + +<ProgramListing> +struct ext2_dir_entry { + __u32 inode; /* Inode number */ + __u16 rec_len; /* Directory entry length */ + __u16 name_len; /* Name length */ + char name[EXT2_NAME_LEN]; /* File name */ +}; +</ProgramListing> + +</Para> + +<Para> +Ext2fs supports file names of varying lengths, up to 255 bytes. The +<Literal remap="tt">name</Literal> field above just contains the file name. Note that it is +<Literal remap="tt">not zero terminated</Literal>; Instead, the variable <Literal remap="tt">name_len</Literal> contains +the length of the file name. +</Para> + +<Para> +The variable <Literal remap="tt">rec_len</Literal> is provided because the directory entries are +padded with zeroes so that the next entry will be in an offset which is +a multiplication of 4. The resulting directory entry size is stored in +<Literal remap="tt">rec_len</Literal>. If the directory entry is the last in the block, it is +padded with zeroes till the end of the block, and rec_len is updated +accordingly. +</Para> + +<Para> +The <Literal remap="tt">inode</Literal> variable points to the inode of the above file. +</Para> + +<Para> +Deletion of directory entries is done by appending of the deleted entry +space to the previous (or next, I am not sure) entry. +</Para> + +</Sect1> + +<Sect1> +<Title>The superblock</Title> + +<Para> +The <Literal remap="tt">superblock</Literal> is a block which contains information which describes +the state of the internal filesystem. +</Para> + +<Para> +The superblock is located at the <Literal remap="tt">fixed offset 1024</Literal> in the device. Its +length is 1024 bytes also. +</Para> + +<Para> +The superblock, like the group descriptors, is copied on each blocks group +boundary for backup purposes. However, only the main copy is used by the +kernel. +</Para> + +<Para> +The superblock contain three types of information: + +<ItemizedList> +<ListItem> + +<Para> + Filesystem parameters which are fixed and which were determined when +this specific filesystem was created. Some of those parameters can +be different in different installations of the ext2 filesystem, but +can not be changed once the filesystem was created. +</Para> +</ListItem> +<ListItem> + +<Para> + Filesystem parameters which are tunable - Can always be changed. +</Para> +</ListItem> +<ListItem> + +<Para> + Information about the current filesystem state. +</Para> +</ListItem> + +</ItemizedList> + +</Para> + +<Para> +Follows the superblock definition: +</Para> + +<Para> + +<ProgramListing> +struct ext2_super_block { + __u32 s_inodes_count; /* Inodes count */ + __u32 s_blocks_count; /* Blocks count */ + __u32 s_r_blocks_count; /* Reserved blocks count */ + __u32 s_free_blocks_count; /* Free blocks count */ + __u32 s_free_inodes_count; /* Free inodes count */ + __u32 s_first_data_block; /* First Data Block */ + __u32 s_log_block_size; /* Block size */ + __s32 s_log_frag_size; /* Fragment size */ + __u32 s_blocks_per_group; /* # Blocks per group */ + __u32 s_frags_per_group; /* # Fragments per group */ + __u32 s_inodes_per_group; /* # Inodes per group */ + __u32 s_mtime; /* Mount time */ + __u32 s_wtime; /* Write time */ + __u16 s_mnt_count; /* Mount count */ + __s16 s_max_mnt_count; /* Maximal mount count */ + __u16 s_magic; /* Magic signature */ + __u16 s_state; /* File system state */ + __u16 s_errors; /* Behaviour when detecting errors */ + __u16 s_pad; + __u32 s_lastcheck; /* time of last check */ + __u32 s_checkinterval; /* max. time between checks */ + __u32 s_creator_os; /* OS */ + __u32 s_rev_level; /* Revision level */ + __u16 s_def_resuid; /* Default uid for reserved blocks */ + __u16 s_def_resgid; /* Default gid for reserved blocks */ + __u32 s_reserved[235]; /* Padding to the end of the block */ +}; +</ProgramListing> + +</Para> + +<Sect2> +<Title>superblock identification</Title> + +<Para> +The ext2 filesystem's superblock is identified by the <Literal remap="tt">s_magic</Literal> field. +The current ext2 magic number is 0xEF53. I presume that "EF" means "Extended +Filesystem". In versions of the ext2 filesystem prior to 0.2B, the magic +number was 0xEF51. Those filesystems are not compatible with the current +versions; Specifically, the group descriptors definition is different. I +doubt if there still exists such a installation. +</Para> + +</Sect2> + +<Sect2> +<Title>Filesystem fixed parameters</Title> + +<Para> +By using the word <Literal remap="tt">fixed</Literal>, I mean fixed with respect to a particular +installation. Those variables are usually not fixed with respect to +different installations. +</Para> + +<Para> +The <Literal remap="tt">block size</Literal> is determined by using the <Literal remap="tt">s_log_block_size</Literal> +variable. The block size is 1024*pow (2,s_log_block_size) and should be +between 1024 and 4096. The available options are 1024, 2048 and 4096. +</Para> + +<Para> +<Literal remap="tt">s_inodes_count</Literal> contains the total number of available inodes. +</Para> + +<Para> +<Literal remap="tt">s_blocks_count</Literal> contains the total number of available blocks. +</Para> + +<Para> +<Literal remap="tt">s_first_data_block</Literal> specifies in which of the <Literal remap="tt">device block</Literal> the +<Literal remap="tt">superblock</Literal> is present. The superblock is always present at the fixed +offset 1024, but the device block numbering can differ. For example, if the +block size is 1024, the superblock will be at <Literal remap="tt">block 1</Literal> with respect to +the device. However, if the block size is 4096, offset 1024 is included in +<Literal remap="tt">block 0</Literal> of the device, and in that case <Literal remap="tt">s_first_data_block</Literal> +will contain 0. At least this is how I understood this variable. +</Para> + +<Para> +<Literal remap="tt">s_blocks_per_group</Literal> contains the number of blocks which are grouped +together as a blocks group. +</Para> + +<Para> +<Literal remap="tt">s_inodes_per_group</Literal> contains the number of inodes available in a group +block. I think that this is always the total number of inodes divided by the +number of blocks groups. +</Para> + +<Para> +<Literal remap="tt">s_creator_os</Literal> contains a code number which specifies the operating +system which created this specific filesystem: + +<ItemizedList> +<ListItem> + +<Para> + <Literal remap="tt">Linux</Literal> :-) is specified by the value <Literal remap="tt">0</Literal>. +</Para> +</ListItem> +<ListItem> + +<Para> + <Literal remap="tt">Hurd</Literal> is specified by the value <Literal remap="tt">1</Literal>. +</Para> +</ListItem> +<ListItem> + +<Para> + <Literal remap="tt">Masix</Literal> is specified by the value <Literal remap="tt">2</Literal>. +</Para> +</ListItem> + +</ItemizedList> + +</Para> + +<Para> +<Literal remap="tt">s_rev_level</Literal> contains the major version of the ext2 filesystem. +Currently this is always <Literal remap="tt">0</Literal>, as the most recent version is 0.5B. It +will probably take some time until we reach version 1.0. +</Para> + +<Para> +As far as I know, fragments (sub-block allocations) are currently not +supported and hence a block is equal to a fragment. As a result, +<Literal remap="tt">s_log_frag_size</Literal> and <Literal remap="tt">s_frags_per_group</Literal> are always equal to +<Literal remap="tt">s_log_block_size</Literal> and <Literal remap="tt">s_blocks_per_group</Literal>, respectively. +</Para> + +</Sect2> + +<Sect2> +<Title>Ext2fs error handling</Title> + +<Para> +The ext2 filesystem error handling is based on the following philosophy: + +<OrderedList> +<ListItem> + +<Para> + Identification of problems is done by the kernel code. +</Para> +</ListItem> +<ListItem> + +<Para> + The correction task is left to an external utility, such as +<Literal remap="tt">e2fsck by Theodore Ts'o</Literal> for <Literal remap="tt">automatic</Literal> analysis and +correction, or perhaps <Literal remap="tt">debugfs by Theodore Ts'o</Literal> and +<Literal remap="tt">EXT2ED by myself</Literal>, for <Literal remap="tt">hand</Literal> analysis and correction. +</Para> +</ListItem> + +</OrderedList> + +</Para> + +<Para> +The <Literal remap="tt">s_state</Literal> variable is used by the kernel to pass the identification +result to third party utilities: + +<ItemizedList> +<ListItem> + +<Para> + <Literal remap="tt">bit 0</Literal> of s_state is reset when the partition is mounted and +set when the partition is unmounted. Thus, a value of 0 on an +unmounted filesystem means that the filesystem was not unmounted +properly - The filesystem is not "clean" and probably contains +errors. +</Para> +</ListItem> +<ListItem> + +<Para> + <Literal remap="tt">bit 1</Literal> of s_state is set by the kernel when it detects an +error in the filesystem. A value of 0 doesn't mean that there isn't +an error in the filesystem, just that the kernel didn't find any. +</Para> +</ListItem> + +</ItemizedList> + +</Para> + +<Para> +The kernel behavior when an error is found is determined by the user tunable +parameter <Literal remap="tt">s_errors</Literal>: + +<ItemizedList> +<ListItem> + +<Para> + The kernel will ignore the error and continue if <Literal remap="tt">s_errors=1</Literal>. +</Para> +</ListItem> +<ListItem> + +<Para> + The kernel will remount the filesystem in read-only mode if +<Literal remap="tt">s_errors=2</Literal>. +</Para> +</ListItem> +<ListItem> + +<Para> + A kernel panic will be issued if <Literal remap="tt">s_errors=3</Literal>. +</Para> +</ListItem> + +</ItemizedList> + +</Para> + +<Para> +The default behavior is to ignore the error. +</Para> + +</Sect2> + +<Sect2> +<Title>Additional parameters used by e2fsck</Title> + +<Para> +Of-course, <Literal remap="tt">e2fsck</Literal> will check the filesystem if errors were detected +or if the filesystem is not clean. +</Para> + +<Para> +In addition, each time the filesystem is mounted, <Literal remap="tt">s_mnt_count</Literal> is +incremented. When s_mnt_count reaches <Literal remap="tt">s_max_mnt_count</Literal>, <Literal remap="tt">e2fsck</Literal> +will force a check on the filesystem even though it may be clean. It will +then zero s_mnt_count. <Literal remap="tt">s_max_mnt_count</Literal> is a tunable parameter. +</Para> + +<Para> +E2fsck also records the last time in which the file system was checked in +the <Literal remap="tt">s_lastcheck</Literal> variable. The user tunable parameter +<Literal remap="tt">s_checkinterval</Literal> will contain the number of seconds which are allowed +to pass since <Literal remap="tt">s_lastcheck</Literal> until a check is forced. A value of +<Literal remap="tt">0</Literal> disables time-based check. +</Para> + +</Sect2> + +<Sect2> +<Title>Additional user tunable parameters</Title> + +<Para> +<Literal remap="tt">s_r_blocks_count</Literal> contains the number of disk blocks which are +reserved for root, the user whose id number is <Literal remap="tt">s_def_resuid</Literal> and the +group whose id number is <Literal remap="tt">s_deg_resgid</Literal>. The kernel will refuse to +allocate those last <Literal remap="tt">s_r_blocks_count</Literal> if the user is not one of the +above. This is done so that the filesystem will usually not be 100% full, +since 100% full filesystems can affect various aspects of operation. +</Para> + +<Para> +<Literal remap="tt">s_def_resuid</Literal> and <Literal remap="tt">s_def_resgid</Literal> contain the id of the user and +of the group who can use the reserved blocks in addition to root. +</Para> + +</Sect2> + +<Sect2> +<Title>Filesystem current state</Title> + +<Para> +<Literal remap="tt">s_free_blocks_count</Literal> contains the current number of free blocks +in the filesystem. +</Para> + +<Para> +<Literal remap="tt">s_free_inodes_count</Literal> contains the current number of free inodes in the +filesystem. +</Para> + +<Para> +<Literal remap="tt">s_mtime</Literal> contains the time at which the system was last mounted. +</Para> + +<Para> +<Literal remap="tt">s_wtime</Literal> contains the last time at which something was changed in the +filesystem. +</Para> + +</Sect2> + +</Sect1> + +<Sect1> +<Title>Copyright</Title> + +<Para> +This document contains source code which was taken from the Linux ext2 +kernel source code, mainly from <FILENAME>/usr/include/linux/ext2_fs.h</FILENAME>. Follows +the original copyright: +</Para> + +<Para> + +<ProgramListing> +/* + * linux/include/linux/ext2_fs.h + * + * Copyright (C) 1992, 1993, 1994, 1995 + * Remy Card (card@masi.ibp.fr) + * Laboratoire MASI - Institut Blaise Pascal + * Universite Pierre et Marie Curie (Paris VI) + * + * from + * + * linux/include/linux/minix_fs.h + * + * Copyright (C) 1991, 1992 Linus Torvalds + */ + +</ProgramListing> + +</Para> + +</Sect1> + +<Sect1> +<Title>Acknowledgments</Title> + +<Para> +I would like to thank the following people, who were involved in the +design and implementation of the ext2 filesystem kernel code and support +utilities: + +<ItemizedList> +<ListItem> + +<Para> + <Literal remap="tt">Remy Card</Literal> + +Who designed, implemented and maintains the ext2 filesystem kernel +code, and some of the ext2 utilities. <Literal remap="tt">Remy Card</Literal> is also the +author of several helpful slides concerning the ext2 filesystem. +Specifically, he is the author of <Literal remap="tt">File Management in the Linux +Kernel</Literal> and of <Literal remap="tt">The Second Extended File System - Current +State, Future Development</Literal>. + +</Para> +</ListItem> +<ListItem> + +<Para> + <Literal remap="tt">Wayne Davison</Literal> + +Who designed the ext2 filesystem. +</Para> +</ListItem> +<ListItem> + +<Para> + <Literal remap="tt">Stephen Tweedie</Literal> + +Who helped designing the ext2 filesystem kernel code and wrote the +slides <Literal remap="tt">Optimizations in File Systems</Literal>. +</Para> +</ListItem> +<ListItem> + +<Para> + <Literal remap="tt">Theodore Ts'o</Literal> + +Who is the author of several ext2 utilities and of the ext2 library +<Literal remap="tt">libext2fs</Literal> (which I didn't use, simply because I didn't know +it exists when I started to work on my project). +</Para> +</ListItem> + +</ItemizedList> + +</Para> + +<Para> +Lastly, I would like to thank, of-course, <Literal remap="tt">Linus Torvalds</Literal> and the +<Literal remap="tt">Linux community</Literal> for providing all of us with such a great operating +system. +</Para> + +<Para> +Please contact me in a case of an error report, suggestions, or just about +anything concerning this document. +</Para> + +<Para> +Enjoy, +</Para> + +<Para> +Gadi Oxman <tgud@tochnapc2.technion.ac.il> +</Para> + +<Para> +Haifa, August 95 +</Para> + +</Sect1> + +</Article> |