diff options
Diffstat (limited to 'Documentation/technical/bitmap-format.txt')
-rw-r--r-- | Documentation/technical/bitmap-format.txt | 257 |
1 files changed, 257 insertions, 0 deletions
diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt new file mode 100644 index 0000000..f5d2009 --- /dev/null +++ b/Documentation/technical/bitmap-format.txt @@ -0,0 +1,257 @@ +GIT bitmap v1 format +==================== + +== Pack and multi-pack bitmaps + +Bitmaps store reachability information about the set of objects in a packfile, +or a multi-pack index (MIDX). The former is defined obviously, and the latter is +defined as the union of objects in packs contained in the MIDX. + +A bitmap may belong to either one pack, or the repository's multi-pack index (if +it exists). A repository may have at most one bitmap. + +An object is uniquely described by its bit position within a bitmap: + + - If the bitmap belongs to a packfile, the __n__th bit corresponds to + the __n__th object in pack order. For a function `offset` which maps + objects to their byte offset within a pack, pack order is defined as + follows: + + o1 <= o2 <==> offset(o1) <= offset(o2) + + - If the bitmap belongs to a MIDX, the __n__th bit corresponds to the + __n__th object in MIDX order. With an additional function `pack` which + maps objects to the pack they were selected from by the MIDX, MIDX order + is defined as follows: + + o1 <= o2 <==> pack(o1) <= pack(o2) /\ offset(o1) <= offset(o2) ++ +The ordering between packs is done according to the MIDX's .rev file. +Notably, the preferred pack sorts ahead of all other packs. + +The on-disk representation (described below) of a bitmap is the same regardless +of whether or not that bitmap belongs to a packfile or a MIDX. The only +difference is the interpretation of the bits, which is described above. + +Certain bitmap extensions are supported (see: Appendix B). No extensions are +required for bitmaps corresponding to packfiles. For bitmaps that correspond to +MIDXs, both the bit-cache and rev-cache extensions are required. + +== On-disk format + + * A header appears at the beginning: + + 4-byte signature: :: {'B', 'I', 'T', 'M'} + + 2-byte version number (network byte order): :: + + The current implementation only supports version 1 + of the bitmap index (the same one as JGit). + + 2-byte flags (network byte order): :: + + The following flags are supported: + + ** {empty} + BITMAP_OPT_FULL_DAG (0x1) REQUIRED: ::: + + This flag must always be present. It implies that the + bitmap index has been generated for a packfile or + multi-pack index (MIDX) with full closure (i.e. where + every single object in the packfile/MIDX can find its + parent links inside the same packfile/MIDX). This is a + requirement for the bitmap index format, also present in + JGit, that greatly reduces the complexity of the + implementation. + + ** {empty} + BITMAP_OPT_HASH_CACHE (0x4): ::: + + If present, the end of the bitmap file contains + `N` 32-bit name-hash values, one per object in the + pack/MIDX. The format and meaning of the name-hash is + described below. + + ** {empty} + BITMAP_OPT_LOOKUP_TABLE (0x10): ::: + If present, the end of the bitmap file contains a table + containing a list of `N` <commit_pos, offset, xor_row> + triplets. The format and meaning of the table is described + below. ++ +NOTE: Unlike the xor_offset used to compress an individual bitmap, +`xor_row` stores an *absolute* index into the lookup table, not a location +relative to the current entry. + + 4-byte entry count (network byte order): :: + The total count of entries (bitmapped commits) in this bitmap index. + + 20-byte checksum: :: + The SHA1 checksum of the pack/MIDX this bitmap index + belongs to. + + * 4 EWAH bitmaps that act as type indexes ++ +Type indexes are serialized after the hash cache in the shape +of four EWAH bitmaps stored consecutively (see Appendix A for +the serialization format of an EWAH bitmap). ++ +There is a bitmap for each Git object type, stored in the following +order: ++ + - Commits + - Trees + - Blobs + - Tags + ++ +In each bitmap, the `n`th bit is set to true if the `n`th object +in the packfile or multi-pack index is of that type. ++ +The obvious consequence is that the OR of all 4 bitmaps will result +in a full set (all bits set), and the AND of all 4 bitmaps will +result in an empty bitmap (no bits set). + + * N entries with compressed bitmaps, one for each indexed commit ++ +Where `N` is the total number of entries in this bitmap index. +Each entry contains the following: + + ** {empty} + 4-byte object position (network byte order): :: + The position **in the index for the packfile or + multi-pack index** where the bitmap for this commit is + found. + + ** {empty} + 1-byte XOR-offset: :: + The xor offset used to compress this bitmap. For an entry + in position `x`, an XOR offset of `y` means that the actual + bitmap representing this commit is composed by XORing the + bitmap for this entry with the bitmap in entry `x-y` (i.e. + the bitmap `y` entries before this one). ++ +NOTE: This compression can be recursive. In order to +XOR this entry with a previous one, the previous entry needs +to be decompressed first, and so on. ++ +The hard-limit for this offset is 160 (an entry can only be +xor'ed against one of the 160 entries preceding it). This +number is always positive, and hence entries are always xor'ed +with **previous** bitmaps, not bitmaps that will come afterwards +in the index. + + ** {empty} + 1-byte flags for this bitmap: :: + At the moment the only available flag is `0x1`, which hints + that this bitmap can be re-used when rebuilding bitmap indexes + for the repository. + + ** The compressed bitmap itself, see Appendix A. + + * {empty} + TRAILER: :: + Trailing checksum of the preceding contents. + +== Appendix A: Serialization format for an EWAH bitmap + +Ewah bitmaps are serialized in the same protocol as the JAVAEWAH +library, making them backwards compatible with the JGit +implementation: + + - 4-byte number of bits of the resulting UNCOMPRESSED bitmap + + - 4-byte number of words of the COMPRESSED bitmap, when stored + + - N x 8-byte words, as specified by the previous field ++ +This is the actual content of the compressed bitmap. + + - 4-byte position of the current RLW for the compressed + bitmap + +All words are stored in network byte order for their corresponding +sizes. + +The compressed bitmap is stored in a form of run-length encoding, as +follows. It consists of a concatenation of an arbitrary number of +chunks. Each chunk consists of one or more 64-bit words + + H L_1 L_2 L_3 .... L_M + +H is called RLW (run length word). It consists of (from lower to higher +order bits): + + - 1 bit: the repeated bit B + + - 32 bits: repetition count K (unsigned) + + - 31 bits: literal word count M (unsigned) + +The bitstream represented by the above chunk is then: + + - K repetitions of B + + - The bits stored in `L_1` through `L_M`. Within a word, bits at + lower order come earlier in the stream than those at higher + order. + +The next word after `L_M` (if any) must again be a RLW, for the next +chunk. For efficient appending to the bitstream, the EWAH stores a +pointer to the last RLW in the stream. + + +== Appendix B: Optional Bitmap Sections + +These sections may or may not be present in the `.bitmap` file; their +presence is indicated by the header flags section described above. + +Name-hash cache +--------------- + +If the BITMAP_OPT_HASH_CACHE flag is set, the end of the bitmap contains +a cache of 32-bit values, one per object in the pack/MIDX. The value at +position `i` is the hash of the pathname at which the `i`th object +(counting in index or multi-pack index order) in the pack/MIDX can be found. +This can be fed into the delta heuristics to compare objects with similar +pathnames. + +The hash algorithm used is: + + hash = 0; + while ((c = *name++)) + if (!isspace(c)) + hash = (hash >> 2) + (c << 24); + +Note that this hashing scheme is tied to the BITMAP_OPT_HASH_CACHE flag. +If implementations want to choose a different hashing scheme, they are +free to do so, but MUST allocate a new header flag (because comparing +hashes made under two different schemes would be pointless). + +Commit lookup table +------------------- + +If the BITMAP_OPT_LOOKUP_TABLE flag is set, the last `N * (4 + 8 + 4)` +bytes (preceding the name-hash cache and trailing hash) of the `.bitmap` +file contains a lookup table specifying the information needed to get +the desired bitmap from the entries without parsing previous unnecessary +bitmaps. + +For a `.bitmap` containing `nr_entries` reachability bitmaps, the table +contains a list of `nr_entries` <commit_pos, offset, xor_row> triplets +(sorted in the ascending order of `commit_pos`). The content of the i'th +triplet is - + + * {empty} + commit_pos (4 byte integer, network byte order): :: + It stores the object position of a commit (in the midx or pack + index). + + * {empty} + offset (8 byte integer, network byte order): :: + The offset from which that commit's bitmap can be read. + + * {empty} + xor_row (4 byte integer, network byte order): :: + The position of the triplet whose bitmap is used to compress + this one, or `0xffffffff` if no such bitmap exists. |