Adding upstream version 1:2.43.0.upstream/1%2.43.0

Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
author: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-04-09 13:34:27 +0000
committer: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-04-09 13:34:27 +0000
commit: 4dbdc42d9e7c3968ff7f690d00680419c9b8cb0f (patch)
tree: 47c1d492e9c956c1cd2b74dbd3b9d8b0db44dc4e /Documentation/technical
parent: Initial commit. (diff)
download: git-4dbdc42d9e7c3968ff7f690d00680419c9b8cb0f.tar.xz
git-4dbdc42d9e7c3968ff7f690d00680419c9b8cb0f.zip
30 files changed, 9360 insertions, 0 deletions
diff --git a/Documentation/technical/.gitignore b/Documentation/technical/.gitignore
new file mode 100644
index 0000000..8aa891d
--- /dev/null
+++ b/Documentation/technical/.gitignore
@@ -0,0 +1 @@
+api-index.txt
diff --git a/Documentation/technical/api-error-handling.txt b/Documentation/technical/api-error-handling.txt
new file mode 100644
index 0000000..665c496
--- /dev/null
+++ b/Documentation/technical/api-error-handling.txt
@@ -0,0 +1,103 @@
+Error reporting in git
+======================
+
+`BUG`, `bug`, `die`, `usage`, `error`, and `warning` report errors of
+various kinds.
+
+- `BUG` is for failed internal assertions that should never happen,
+  i.e. a bug in git itself.
+
+- `bug` (lower-case, not `BUG`) is supposed to be used like `BUG` but
+  prints a "BUG" message instead of calling `abort()`.
++
+A call to `bug()` will then result in a "real" call to the `BUG()`
+function, either explicitly by invoking `BUG_if_bug()` after call(s)
+to `bug()`, or implicitly at `exit()` time where we'll check if we
+encountered any outstanding `bug()` invocations.
++
+If there were no prior calls to `bug()` before invoking `BUG_if_bug()`
+the latter is a NOOP. The `BUG_if_bug()` function takes the same
+arguments as `BUG()` itself. Calling `BUG_if_bug()` explicitly isn't
+necessary, but ensures that we die as soon as possible.
++
+If you know you had prior calls to `bug()` then calling `BUG()` itself
+is equivalent to calling `BUG_if_bug()`, the latter being a wrapper
+calling `BUG()` if we've set a flag indicating that we've called
+`bug()`.
++
+This is for the convenience of APIs who'd like to potentially report
+more than one "bug", such as the optbug() validation in
+parse-options.c.
+
+- `die` is for fatal application errors.  It prints a message to
+  the user and exits with status 128.
+
+- `usage` is for errors in command line usage.  After printing its
+  message, it exits with status 129.  (See also `usage_with_options`
+  in the link:api-parse-options.html[parse-options API].)
+
+- `error` is for non-fatal library errors.  It prints a message
+  to the user and returns -1 for convenience in signaling the error
+  to the caller.
+
+- `warning` is for reporting situations that probably should not
+  occur but which the user (and Git) can continue to work around
+  without running into too many problems.  Like `error`, it
+  returns -1 after reporting the situation to the caller.
+
+These reports will be logged via the trace2 facility. See the "error"
+event in link:api-trace2.html[trace2 API].
+
+Customizable error handlers
+---------------------------
+
+The default behavior of `die` and `error` is to write a message to
+stderr and then exit or return as appropriate.  This behavior can be
+overridden using `set_die_routine` and `set_error_routine`.  For
+example, "git daemon" uses set_die_routine to write the reason `die`
+was called to syslog before exiting.
+
+Library errors
+--------------
+
+Functions return a negative integer on error.  Details beyond that
+vary from function to function:
+
+- Some functions return -1 for all errors.  Others return a more
+  specific value depending on how the caller might want to react
+  to the error.
+
+- Some functions report the error to stderr with `error`,
+  while others leave that for the caller to do.
+
+- errno is not meaningful on return from most functions (except
+  for thin wrappers for system calls).
+
+Check the function's API documentation to be sure.
+
+Caller-handled errors
+---------------------
+
+An increasing number of functions take a parameter 'struct strbuf *err'.
+On error, such functions append a message about what went wrong to the
+'err' strbuf.  The message is meant to be complete enough to be passed
+to `die` or `error` as-is.  For example:
+
+	if (ref_transaction_commit(transaction, &err))
+		die("%s", err.buf);
+
+The 'err' parameter will be untouched if no error occurred, so multiple
+function calls can be chained:
+
+	t = ref_transaction_begin(&err);
+	if (!t ||
+	    ref_transaction_update(t, "HEAD", ..., &err) ||
+	    ret_transaction_commit(t, &err))
+		die("%s", err.buf);
+
+The 'err' parameter must be a pointer to a valid strbuf.  To silence
+a message, pass a strbuf that is explicitly ignored:
+
+	if (thing_that_can_fail_in_an_ignorable_way(..., &err))
+		/* This failure is okay. */
+		strbuf_reset(&err);
diff --git a/Documentation/technical/api-index-skel.txt b/Documentation/technical/api-index-skel.txt
new file mode 100644
index 0000000..7780a76
--- /dev/null
+++ b/Documentation/technical/api-index-skel.txt
@@ -0,0 +1,13 @@
+Git API Documents
+=================
+
+Git has grown a set of internal APIs over time.  This collection
+documents them.
+
+////////////////////////////////////////////////////////////////
+// table of contents begin
+////////////////////////////////////////////////////////////////
+
+////////////////////////////////////////////////////////////////
+// table of contents end
+////////////////////////////////////////////////////////////////
diff --git a/Documentation/technical/api-index.sh b/Documentation/technical/api-index.sh
new file mode 100755
index 0000000..9c3f413
--- /dev/null
+++ b/Documentation/technical/api-index.sh
@@ -0,0 +1,28 @@
+#!/bin/sh
+
+(
+	c=////////////////////////////////////////////////////////////////
+	skel=api-index-skel.txt
+	sed -e '/^\/\/ table of contents begin/q' "$skel"
+	echo "$c"
+
+	ls api-*.txt |
+	while read filename
+	do
+		case "$filename" in
+		api-index-skel.txt | api-index.txt) continue ;;
+		esac
+		title=$(sed -e 1q "$filename")
+		html=${filename%.txt}.html
+		echo "* link:$html[$title]"
+	done
+	echo "$c"
+	sed -n -e '/^\/\/ table of contents end/,$p' "$skel"
+) >api-index.txt+
+
+if test -f api-index.txt && cmp api-index.txt api-index.txt+ >/dev/null
+then
+	rm -f api-index.txt+
+else
+	mv api-index.txt+ api-index.txt
+fi
diff --git a/Documentation/technical/api-merge.txt b/Documentation/technical/api-merge.txt
new file mode 100644
index 0000000..c2ba018
--- /dev/null
+++ b/Documentation/technical/api-merge.txt
@@ -0,0 +1,36 @@
+merge API
+=========
+
+The merge API helps a program to reconcile two competing sets of
+improvements to some files (e.g., unregistered changes from the work
+tree versus changes involved in switching to a new branch), reporting
+conflicts if found.  The library called through this API is
+responsible for a few things.
+
+ * determining which trees to merge (recursive ancestor consolidation);
+
+ * lining up corresponding files in the trees to be merged (rename
+   detection, subtree shifting), reporting edge cases like add/add
+   and rename/rename conflicts to the user;
+
+ * performing a three-way merge of corresponding files, taking
+   path-specific merge drivers (specified in `.gitattributes`)
+   into account.
+
+Data structures
+---------------
+
+* `mmbuffer_t`, `mmfile_t`
+
+These store data usable for use by the xdiff backend, for writing and
+for reading, respectively.  See `xdiff/xdiff.h` for the definitions
+and `diff.c` for examples.
+
+* `struct ll_merge_options`
+
+Check merge-ll.h for details.
+
+Low-level (single file) merge
+-----------------------------
+
+Check merge-ll.h for details.
diff --git a/Documentation/technical/api-parse-options.txt b/Documentation/technical/api-parse-options.txt
new file mode 100644
index 0000000..61fa6ee
--- /dev/null
+++ b/Documentation/technical/api-parse-options.txt
@@ -0,0 +1,349 @@
+parse-options API
+=================
+
+The parse-options API is used to parse and massage options in Git
+and to provide a usage help with consistent look.
+
+Basics
+------
+
+The argument vector `argv[]` may usually contain mandatory or optional
+'non-option arguments', e.g. a filename or a branch, 'options', and
+'subcommands'.
+Options are optional arguments that start with a dash and
+that allow to change the behavior of a command.
+
+* There are basically three types of options:
+  'boolean' options,
+  options with (mandatory) 'arguments' and
+  options with 'optional arguments'
+  (i.e. a boolean option that can be adjusted).
+
+* There are basically two forms of options:
+  'Short options' consist of one dash (`-`) and one alphanumeric
+  character.
+  'Long options' begin with two dashes (`--`) and some
+  alphanumeric characters.
+
+* Options are case-sensitive.
+  Please define 'lower-case long options' only.
+
+The parse-options API allows:
+
+* 'stuck' and 'separate form' of options with arguments.
+  `-oArg` is stuck, `-o Arg` is separate form.
+  `--option=Arg` is stuck, `--option Arg` is separate form.
+
+* Long options may be 'abbreviated', as long as the abbreviation
+  is unambiguous.
+
+* Short options may be bundled, e.g. `-a -b` can be specified as `-ab`.
+
+* Boolean long options can be 'negated' (or 'unset') by prepending
+  `no-`, e.g. `--no-abbrev` instead of `--abbrev`. Conversely,
+  options that begin with `no-` can be 'negated' by removing it.
+  Other long options can be unset (e.g., set string to NULL, set
+  integer to 0) by prepending `no-`.
+
+* Options and non-option arguments can clearly be separated using the `--`
+  option, e.g. `-a -b --option -- --this-is-a-file` indicates that
+  `--this-is-a-file` must not be processed as an option.
+
+Subcommands are special in a couple of ways:
+
+* Subcommands only have long form, and they have no double dash prefix, no
+  negated form, and no description, and they don't take any arguments, and
+  can't be abbreviated.
+
+* There must be exactly one subcommand among the arguments, or zero if the
+  command has a default operation mode.
+
+* All arguments following the subcommand are considered to be arguments of
+  the subcommand, and, conversely, arguments meant for the subcommand may
+  not precede the subcommand.
+
+Therefore, if the options array contains at least one subcommand and
+`parse_options()` encounters the first dashless argument, it will either:
+
+* stop and return, if that dashless argument is a known subcommand, setting
+  `value` to the function pointer associated with that subcommand, storing
+  the name of the subcommand in argv[0], and leaving the rest of the
+  arguments unprocessed, or
+
+* stop and return, if it was invoked with the `PARSE_OPT_SUBCOMMAND_OPTIONAL`
+  flag and that dashless argument doesn't match any subcommands, leaving
+  `value` unchanged and the rest of the arguments unprocessed, or
+
+* show error and usage, and abort.
+
+Steps to parse options
+----------------------
+
+. `#include "parse-options.h"`
+
+. define a NULL-terminated
+  `static const char * const builtin_foo_usage[]` array
+  containing alternative usage strings
+
+. define `builtin_foo_options` array as described below
+  in section 'Data Structure'.
+
+. in `cmd_foo(int argc, const char **argv, const char *prefix)`
+  call
+
+	argc = parse_options(argc, argv, prefix, builtin_foo_options, builtin_foo_usage, flags);
++
+`parse_options()` will filter out the processed options of `argv[]` and leave the
+non-option arguments in `argv[]`.
+`argc` is updated appropriately because of the assignment.
++
+You can also pass NULL instead of a usage array as the fifth parameter of
+parse_options(), to avoid displaying a help screen with usage info and
+option list.  This should only be done if necessary, e.g. to implement
+a limited parser for only a subset of the options that needs to be run
+before the full parser, which in turn shows the full help message.
++
+Flags are the bitwise-or of:
+
+`PARSE_OPT_KEEP_DASHDASH`::
+	Keep the `--` that usually separates options from
+	non-option arguments.
+
+`PARSE_OPT_STOP_AT_NON_OPTION`::
+	Usually the whole argument vector is massaged and reordered.
+	Using this flag, processing is stopped at the first non-option
+	argument.
+
+`PARSE_OPT_KEEP_ARGV0`::
+	Keep the first argument, which contains the program name.  It's
+	removed from argv[] by default.
+
+`PARSE_OPT_KEEP_UNKNOWN_OPT`::
+	Keep unknown options instead of erroring out.  This doesn't
+	work for all combinations of arguments as users might expect
+	it to do.  E.g. if the first argument in `--unknown --known`
+	takes a value (which we can't know), the second one is
+	mistakenly interpreted as a known option.  Similarly, if
+	`PARSE_OPT_STOP_AT_NON_OPTION` is set, the second argument in
+	`--unknown value` will be mistakenly interpreted as a
+	non-option, not as a value belonging to the unknown option,
+	the parser early.  That's why parse_options() errors out if
+	both options are set.
+	Note that non-option arguments are always kept, even without
+	this flag.
+
+`PARSE_OPT_NO_INTERNAL_HELP`::
+	By default, parse_options() handles `-h`, `--help` and
+	`--help-all` internally, by showing a help screen.  This option
+	turns it off and allows one to add custom handlers for these
+	options, or to just leave them unknown.
+
+`PARSE_OPT_SUBCOMMAND_OPTIONAL`::
+	Don't error out when no subcommand is specified.
+
+Note that `PARSE_OPT_STOP_AT_NON_OPTION` is incompatible with subcommands;
+while `PARSE_OPT_KEEP_DASHDASH` and `PARSE_OPT_KEEP_UNKNOWN_OPT` can only be
+used with subcommands when combined with `PARSE_OPT_SUBCOMMAND_OPTIONAL`.
+
+Data Structure
+--------------
+
+The main data structure is an array of the `option` struct,
+say `static struct option builtin_add_options[]`.
+There are some macros to easily define options:
+
+`OPT__ABBREV(&int_var)`::
+	Add `--abbrev[=<n>]`.
+
+`OPT__COLOR(&int_var, description)`::
+	Add `--color[=<when>]` and `--no-color`.
+
+`OPT__DRY_RUN(&int_var, description)`::
+	Add `-n, --dry-run`.
+
+`OPT__FORCE(&int_var, description)`::
+	Add `-f, --force`.
+
+`OPT__QUIET(&int_var, description)`::
+	Add `-q, --quiet`.
+
+`OPT__VERBOSE(&int_var, description)`::
+	Add `-v, --verbose`.
+
+`OPT_GROUP(description)`::
+	Start an option group. `description` is a short string that
+	describes the group or an empty string.
+	Start the description with an upper-case letter.
+
+`OPT_BOOL(short, long, &int_var, description)`::
+	Introduce a boolean option. `int_var` is set to one with
+	`--option` and set to zero with `--no-option`.
+
+`OPT_COUNTUP(short, long, &int_var, description)`::
+	Introduce a count-up option.
+	Each use of `--option` increments `int_var`, starting from zero
+	(even if initially negative), and `--no-option` resets it to
+	zero. To determine if `--option` or `--no-option` was encountered at
+	all, initialize `int_var` to a negative value, and if it is still
+	negative after parse_options(), then neither `--option` nor
+	`--no-option` was seen.
+
+`OPT_BIT(short, long, &int_var, description, mask)`::
+	Introduce a boolean option.
+	If used, `int_var` is bitwise-ored with `mask`.
+
+`OPT_NEGBIT(short, long, &int_var, description, mask)`::
+	Introduce a boolean option.
+	If used, `int_var` is bitwise-anded with the inverted `mask`.
+
+`OPT_SET_INT(short, long, &int_var, description, integer)`::
+	Introduce an integer option.
+	`int_var` is set to `integer` with `--option`, and
+	reset to zero with `--no-option`.
+
+`OPT_STRING(short, long, &str_var, arg_str, description)`::
+	Introduce an option with string argument.
+	The string argument is put into `str_var`.
+
+`OPT_STRING_LIST(short, long, &struct string_list, arg_str, description)`::
+	Introduce an option with string argument.
+	The string argument is stored as an element in `string_list`.
+	Use of `--no-option` will clear the list of preceding values.
+
+`OPT_INTEGER(short, long, &int_var, description)`::
+	Introduce an option with integer argument.
+	The integer is put into `int_var`.
+
+`OPT_MAGNITUDE(short, long, &unsigned_long_var, description)`::
+	Introduce an option with a size argument. The argument must be a
+	non-negative integer and may include a suffix of 'k', 'm' or 'g' to
+	scale the provided value by 1024, 1024^2 or 1024^3 respectively.
+	The scaled value is put into `unsigned_long_var`.
+
+`OPT_EXPIRY_DATE(short, long, &timestamp_t_var, description)`::
+	Introduce an option with expiry date argument, see `parse_expiry_date()`.
+	The timestamp is put into `timestamp_t_var`.
+
+`OPT_CALLBACK(short, long, &var, arg_str, description, func_ptr)`::
+	Introduce an option with argument.
+	The argument will be fed into the function given by `func_ptr`
+	and the result will be put into `var`.
+	See 'Option Callbacks' below for a more elaborate description.
+
+`OPT_FILENAME(short, long, &var, description)`::
+	Introduce an option with a filename argument.
+	The filename will be prefixed by passing the filename along with
+	the prefix argument of `parse_options()` to `prefix_filename()`.
+
+`OPT_NUMBER_CALLBACK(&var, description, func_ptr)`::
+	Recognize numerical options like -123 and feed the integer as
+	if it was an argument to the function given by `func_ptr`.
+	The result will be put into `var`.  There can be only one such
+	option definition.  It cannot be negated and it takes no
+	arguments.  Short options that happen to be digits take
+	precedence over it.
+
+`OPT_COLOR_FLAG(short, long, &int_var, description)`::
+	Introduce an option that takes an optional argument that can
+	have one of three values: "always", "never", or "auto".  If the
+	argument is not given, it defaults to "always".  The `--no-` form
+	works like `--long=never`; it cannot take an argument.  If
+	"always", set `int_var` to 1; if "never", set `int_var` to 0; if
+	"auto", set `int_var` to 1 if stdout is a tty or a pager,
+	0 otherwise.
+
+`OPT_NOOP_NOARG(short, long)`::
+	Introduce an option that has no effect and takes no arguments.
+	Use it to hide deprecated options that are still to be recognized
+	and ignored silently.
+
+`OPT_PASSTHRU(short, long, &char_var, arg_str, description, flags)`::
+	Introduce an option that will be reconstructed into a char* string,
+	which must be initialized to NULL. This is useful when you need to
+	pass the command-line option to another command. Any previous value
+	will be overwritten, so this should only be used for options where
+	the last one specified on the command line wins.
+
+`OPT_PASSTHRU_ARGV(short, long, &strvec_var, arg_str, description, flags)`::
+	Introduce an option where all instances of it on the command-line will
+	be reconstructed into a strvec. This is useful when you need to
+	pass the command-line option, which can be specified multiple times,
+	to another command.
+
+`OPT_CMDMODE(short, long, &int_var, description, enum_val)`::
+	Define an "operation mode" option, only one of which in the same
+	group of "operating mode" options that share the same `int_var`
+	can be given by the user. `int_var` is set to `enum_val` when the
+	option is used, but an error is reported if other "operating mode"
+	option has already set its value to the same `int_var`.
+	In new commands consider using subcommands instead.
+
+`OPT_SUBCOMMAND(long, &fn_ptr, subcommand_fn)`::
+	Define a subcommand.  `subcommand_fn` is put into `fn_ptr` when
+	this subcommand is used.
+
+The last element of the array must be `OPT_END()`.
+
+If not stated otherwise, interpret the arguments as follows:
+
+* `short` is a character for the short option
+  (e.g. `'e'` for `-e`, use `0` to omit),
+
+* `long` is a string for the long option
+  (e.g. `"example"` for `--example`, use `NULL` to omit),
+
+* `int_var` is an integer variable,
+
+* `str_var` is a string variable (`char *`),
+
+* `arg_str` is the string that is shown as argument
+  (e.g. `"branch"` will result in `<branch>`).
+  If set to `NULL`, three dots (`...`) will be displayed.
+
+* `description` is a short string to describe the effect of the option.
+  It shall begin with a lower-case letter and a full stop (`.`) shall be
+  omitted at the end.
+
+Option Callbacks
+----------------
+
+The function must be defined in this form:
+
+	int func(const struct option *opt, const char *arg, int unset)
+
+The callback mechanism is as follows:
+
+* Inside `func`, the only interesting member of the structure
+  given by `opt` is the void pointer `opt->value`.
+  `*opt->value` will be the value that is saved into `var`, if you
+  use `OPT_CALLBACK()`.
+  For example, do `*(unsigned long *)opt->value = 42;` to get 42
+  into an `unsigned long` variable.
+
+* Return value `0` indicates success and non-zero return
+  value will invoke `usage_with_options()` and, thus, die.
+
+* If the user negates the option, `arg` is `NULL` and `unset` is 1.
+
+Sophisticated option parsing
+----------------------------
+
+If you need, for example, option callbacks with optional arguments
+or without arguments at all, or if you need other special cases,
+that are not handled by the macros above, you need to specify the
+members of the `option` structure manually.
+
+This is not covered in this document, but well documented
+in `parse-options.h` itself.
+
+Examples
+--------
+
+See `test-parse-options.c` and
+`builtin/add.c`,
+`builtin/clone.c`,
+`builtin/commit.c`,
+`builtin/fetch.c`,
+`builtin/fsck.c`,
+`builtin/rm.c`
+for real-world examples.
diff --git a/Documentation/technical/api-simple-ipc.txt b/Documentation/technical/api-simple-ipc.txt
new file mode 100644
index 0000000..c4fb152
--- /dev/null
+++ b/Documentation/technical/api-simple-ipc.txt
@@ -0,0 +1,105 @@
+Simple-IPC API
+==============
+
+The Simple-IPC API is a collection of `ipc_` prefixed library routines
+and a basic communication protocol that allows an IPC-client process to
+send an application-specific IPC-request message to an IPC-server
+process and receive an application-specific IPC-response message.
+
+Communication occurs over a named pipe on Windows and a Unix domain
+socket on other platforms.  IPC-clients and IPC-servers rendezvous at
+a previously agreed-to application-specific pathname (which is outside
+the scope of this design) that is local to the computer system.
+
+The IPC-server routines within the server application process create a
+thread pool to listen for connections and receive request messages
+from multiple concurrent IPC-clients.  When received, these messages
+are dispatched up to the server application callbacks for handling.
+IPC-server routines then incrementally relay responses back to the
+IPC-client.
+
+The IPC-client routines within a client application process connect
+to the IPC-server and send a request message and wait for a response.
+When received, the response is returned back to the caller.
+
+For example, the `fsmonitor--daemon` feature will be built as a server
+application on top of the IPC-server library routines.  It will have
+threads watching for file system events and a thread pool waiting for
+client connections.  Clients, such as `git status`, will request a list
+of file system events since a point in time and the server will
+respond with a list of changed files and directories.  The formats of
+the request and response are application-specific; the IPC-client and
+IPC-server routines treat them as opaque byte streams.
+
+
+Comparison with sub-process model
+---------------------------------
+
+The Simple-IPC mechanism differs from the existing `sub-process.c`
+model (Documentation/technical/long-running-process-protocol.txt) and
+used by applications like Git-LFS.  In the LFS-style sub-process model,
+the helper is started by the foreground process, communication happens
+via a pair of file descriptors bound to the stdin/stdout of the
+sub-process, the sub-process only serves the current foreground
+process, and the sub-process exits when the foreground process
+terminates.
+
+In the Simple-IPC model the server is a very long-running service.  It
+can service many clients at the same time and has a private socket or
+named pipe connection to each active client.  It might be started
+(on-demand) by the current client process or it might have been
+started by a previous client or by the OS at boot time.  The server
+process is not associated with a terminal and it persists after
+clients terminate.  Clients do not have access to the stdin/stdout of
+the server process and therefore must communicate over sockets or
+named pipes.
+
+
+Server startup and shutdown
+---------------------------
+
+How an application server based upon IPC-server is started is also
+outside the scope of the Simple-IPC design and is a property of the
+application using it.  For example, the server might be started or
+restarted during routine maintenance operations, or it might be
+started as a system service during the system boot-up sequence, or it
+might be started on-demand by a foreground Git command when needed.
+
+Similarly, server shutdown is a property of the application using
+the simple-ipc routines.  For example, the server might decide to
+shutdown when idle or only upon explicit request.
+
+
+Simple-IPC protocol
+-------------------
+
+The Simple-IPC protocol consists of a single request message from the
+client and an optional response message from the server.  Both the
+client and server messages are unlimited in length and are terminated
+with a flush packet.
+
+The pkt-line routines (linkgit:gitprotocol-common[5])
+are used to simplify buffer management during message generation,
+transmission, and reception.  A flush packet is used to mark the end
+of the message.  This allows the sender to incrementally generate and
+transmit the message.  It allows the receiver to incrementally receive
+the message in chunks and to know when they have received the entire
+message.
+
+The actual byte format of the client request and server response
+messages are application specific.  The IPC layer transmits and
+receives them as opaque byte buffers without any concern for the
+content within.  It is the job of the calling application layer to
+understand the contents of the request and response messages.
+
+
+Summary
+-------
+
+Conceptually, the Simple-IPC protocol is similar to an HTTP REST
+request.  Clients connect, make an application-specific and
+stateless request, receive an application-specific
+response, and disconnect.  It is a one round trip facility for
+querying the server.  The Simple-IPC routines hide the socket,
+named pipe, and thread pool details and allow the application
+layer to focus on the task at hand.
diff --git a/Documentation/technical/api-trace2.txt b/Documentation/technical/api-trace2.txt
new file mode 100644
index 0000000..de5fc25
--- /dev/null
+++ b/Documentation/technical/api-trace2.txt
@@ -0,0 +1,1339 @@
+= Trace2 API
+
+The Trace2 API can be used to print debug, performance, and telemetry
+information to stderr or a file.  The Trace2 feature is inactive unless
+explicitly enabled by enabling one or more Trace2 Targets.
+
+The Trace2 API is intended to replace the existing (Trace1)
+`printf()`-style tracing provided by the existing `GIT_TRACE` and
+`GIT_TRACE_PERFORMANCE` facilities.  During initial implementation,
+Trace2 and Trace1 may operate in parallel.
+
+The Trace2 API defines a set of high-level messages with known fields,
+such as (`start`: `argv`) and (`exit`: {`exit-code`, `elapsed-time`}).
+
+Trace2 instrumentation throughout the Git code base sends Trace2
+messages to the enabled Trace2 Targets.  Targets transform these
+messages content into purpose-specific formats and write events to
+their data streams.  In this manner, the Trace2 API can drive
+many different types of analysis.
+
+Targets are defined using a VTable allowing easy extension to other
+formats in the future.  This might be used to define a binary format,
+for example.
+
+Trace2 is controlled using `trace2.*` config values in the system and
+global config files and `GIT_TRACE2*` environment variables.  Trace2 does
+not read from repo local or worktree config files, nor does it respect
+`-c` command line config settings.
+
+== Trace2 Targets
+
+Trace2 defines the following set of Trace2 Targets.
+Format details are given in a later section.
+
+=== The Normal Format Target
+
+The normal format target is a traditional `printf()` format and similar
+to the `GIT_TRACE` format.  This format is enabled with the `GIT_TRACE2`
+environment variable or the `trace2.normalTarget` system or global
+config setting.
+
+For example
+
+------------
+$ export GIT_TRACE2=~/log.normal
+$ git version
+git version 2.20.1.155.g426c96fcdb
+------------
+
+or
+
+------------
+$ git config --global trace2.normalTarget ~/log.normal
+$ git version
+git version 2.20.1.155.g426c96fcdb
+------------
+
+yields
+
+------------
+$ cat ~/log.normal
+12:28:42.620009 common-main.c:38                  version 2.20.1.155.g426c96fcdb
+12:28:42.620989 common-main.c:39                  start git version
+12:28:42.621101 git.c:432                         cmd_name version (version)
+12:28:42.621215 git.c:662                         exit elapsed:0.001227 code:0
+12:28:42.621250 trace2/tr2_tgt_normal.c:124       atexit elapsed:0.001265 code:0
+------------
+
+=== The Performance Format Target
+
+The performance format target (PERF) is a column-based format to
+replace `GIT_TRACE_PERFORMANCE` and is suitable for development and
+testing, possibly to complement tools like `gprof`.  This format is
+enabled with the `GIT_TRACE2_PERF` environment variable or the
+`trace2.perfTarget` system or global config setting.
+
+For example
+
+------------
+$ export GIT_TRACE2_PERF=~/log.perf
+$ git version
+git version 2.20.1.155.g426c96fcdb
+------------
+
+or
+
+------------
+$ git config --global trace2.perfTarget ~/log.perf
+$ git version
+git version 2.20.1.155.g426c96fcdb
+------------
+
+yields
+
+------------
+$ cat ~/log.perf
+12:28:42.620675 common-main.c:38                  | d0 | main                     | version      |     |           |           |            | 2.20.1.155.g426c96fcdb
+12:28:42.621001 common-main.c:39                  | d0 | main                     | start        |     |  0.001173 |           |            | git version
+12:28:42.621111 git.c:432                         | d0 | main                     | cmd_name     |     |           |           |            | version (version)
+12:28:42.621225 git.c:662                         | d0 | main                     | exit         |     |  0.001227 |           |            | code:0
+12:28:42.621259 trace2/tr2_tgt_perf.c:211         | d0 | main                     | atexit       |     |  0.001265 |           |            | code:0
+------------
+
+=== The Event Format Target
+
+The event format target is a JSON-based format of event data suitable
+for telemetry analysis.  This format is enabled with the `GIT_TRACE2_EVENT`
+environment variable or the `trace2.eventTarget` system or global config
+setting.
+
+For example
+
+------------
+$ export GIT_TRACE2_EVENT=~/log.event
+$ git version
+git version 2.20.1.155.g426c96fcdb
+------------
+
+or
+
+------------
+$ git config --global trace2.eventTarget ~/log.event
+$ git version
+git version 2.20.1.155.g426c96fcdb
+------------
+
+yields
+
+------------
+$ cat ~/log.event
+{"event":"version","sid":"20190408T191610.507018Z-H9b68c35f-P000059a8","thread":"main","time":"2019-01-16T17:28:42.620713Z","file":"common-main.c","line":38,"evt":"3","exe":"2.20.1.155.g426c96fcdb"}
+{"event":"start","sid":"20190408T191610.507018Z-H9b68c35f-P000059a8","thread":"main","time":"2019-01-16T17:28:42.621027Z","file":"common-main.c","line":39,"t_abs":0.001173,"argv":["git","version"]}
+{"event":"cmd_name","sid":"20190408T191610.507018Z-H9b68c35f-P000059a8","thread":"main","time":"2019-01-16T17:28:42.621122Z","file":"git.c","line":432,"name":"version","hierarchy":"version"}
+{"event":"exit","sid":"20190408T191610.507018Z-H9b68c35f-P000059a8","thread":"main","time":"2019-01-16T17:28:42.621236Z","file":"git.c","line":662,"t_abs":0.001227,"code":0}
+{"event":"atexit","sid":"20190408T191610.507018Z-H9b68c35f-P000059a8","thread":"main","time":"2019-01-16T17:28:42.621268Z","file":"trace2/tr2_tgt_event.c","line":163,"t_abs":0.001265,"code":0}
+------------
+
+=== Enabling a Target
+
+To enable a target, set the corresponding environment variable or
+system or global config value to one of the following:
+
+include::../trace2-target-values.txt[]
+
+When trace files are written to a target directory, they will be named according
+to the last component of the SID (optionally followed by a counter to avoid
+filename collisions).
+
+== Trace2 API
+
+The Trace2 public API is defined and documented in `trace2.h`; refer to it for
+more information.  All public functions and macros are prefixed
+with `trace2_` and are implemented in `trace2.c`.
+
+There are no public Trace2 data structures.
+
+The Trace2 code also defines a set of private functions and data types
+in the `trace2/` directory.  These symbols are prefixed with `tr2_`
+and should only be used by functions in `trace2.c` (or other private
+source files in `trace2/`).
+
+=== Conventions for Public Functions and Macros
+
+Some functions have a `_fl()` suffix to indicate that they take `file`
+and `line-number` arguments.
+
+Some functions have a `_va_fl()` suffix to indicate that they also
+take a `va_list` argument.
+
+Some functions have a `_printf_fl()` suffix to indicate that they also
+take a `printf()` style format with a variable number of arguments.
+
+CPP wrapper macros are defined to hide most of these details.
+
+== Trace2 Target Formats
+
+=== NORMAL Format
+
+Events are written as lines of the form:
+
+------------
+[<time> SP <filename>:<line> SP+] <event-name> [[SP] <event-message>] LF
+------------
+
+`<event-name>`::
+
+	is the event name.
+
+`<event-message>`::
+	is a free-form `printf()` message intended for human consumption.
++
+Note that this may contain embedded LF or CRLF characters that are
+not escaped, so the event may spill across multiple lines.
+
+If `GIT_TRACE2_BRIEF` or `trace2.normalBrief` is true, the `time`, `filename`,
+and `line` fields are omitted.
+
+This target is intended to be more of a summary (like GIT_TRACE) and
+less detailed than the other targets.  It ignores thread, region, and
+data messages, for example.
+
+=== PERF Format
+
+Events are written as lines of the form:
+
+------------
+[<time> SP <filename>:<line> SP+
+    BAR SP] d<depth> SP
+    BAR SP <thread-name> SP+
+    BAR SP <event-name> SP+
+    BAR SP [r<repo-id>] SP+
+    BAR SP [<t_abs>] SP+
+    BAR SP [<t_rel>] SP+
+    BAR SP [<category>] SP+
+    BAR SP DOTS* <perf-event-message>
+    LF
+------------
+
+`<depth>`::
+	is the git process depth.  This is the number of parent
+	git processes.  A top-level git command has depth value "d0".
+	A child of it has depth value "d1".  A second level child
+	has depth value "d2" and so on.
+
+`<thread-name>`::
+	is a unique name for the thread.  The primary thread
+	is called "main".  Other thread names are of the form "th%d:%s"
+	and include a unique number and the name of the thread-proc.
+
+`<event-name>`::
+	is the event name.
+
+`<repo-id>`::
+	when present, is a number indicating the repository
+	in use.  A `def_repo` event is emitted when a repository is
+	opened.  This defines the repo-id and associated worktree.
+	Subsequent repo-specific events will reference this repo-id.
++
+Currently, this is always "r1" for the main repository.
+This field is in anticipation of in-proc submodules in the future.
+
+`<t_abs>`::
+	when present, is the absolute time in seconds since the
+	program started.
+
+`<t_rel>`::
+	when present, is time in seconds relative to the start of
+	the current region.  For a thread-exit event, it is the elapsed
+	time of the thread.
+
+`<category>`::
+	is present on region and data events and is used to
+	indicate a broad category, such as "index" or "status".
+
+`<perf-event-message>`::
+	is a free-form `printf()` message intended for human consumption.
+
+------------
+15:33:33.532712 wt-status.c:2310                  | d0 | main                     | region_enter | r1  |  0.126064 |           | status     | label:print
+15:33:33.532712 wt-status.c:2331                  | d0 | main                     | region_leave | r1  |  0.127568 |  0.001504 | status     | label:print
+------------
+
+If `GIT_TRACE2_PERF_BRIEF` or `trace2.perfBrief` is true, the `time`, `file`,
+and `line` fields are omitted.
+
+------------
+d0 | main                     | region_leave | r1  |  0.011717 |  0.009122 | index      | label:preload
+------------
+
+The PERF target is intended for interactive performance analysis
+during development and is quite noisy.
+
+=== EVENT Format
+
+Each event is a JSON-object containing multiple key/value pairs
+written as a single line and followed by a LF.
+
+------------
+'{' <key> ':' <value> [',' <key> ':' <value>]* '}' LF
+------------
+
+Some key/value pairs are common to all events and some are
+event-specific.
+
+==== Common Key/Value Pairs
+
+The following key/value pairs are common to all events:
+
+------------
+{
+	"event":"version",
+	"sid":"20190408T191827.272759Z-H9b68c35f-P00003510",
+	"thread":"main",
+	"time":"2019-04-08T19:18:27.282761Z",
+	"file":"common-main.c",
+	"line":42,
+	...
+}
+------------
+
+`"event":<event>`::
+	is the event name.
+
+`"sid":<sid>`::
+	is the session-id.  This is a unique string to identify the
+	process instance to allow all events emitted by a process to
+	be identified.  A session-id is used instead of a PID because
+	PIDs are recycled by the OS.  For child git processes, the
+	session-id is prepended with the session-id of the parent git
+	process to allow parent-child relationships to be identified
+	during post-processing.
+
+`"thread":<thread>`::
+	is the thread name.
+
+`"time":<time>`::
+	is the UTC time of the event.
+
+`"file":<filename>`::
+	is source file generating the event.
+
+`"line":<line-number>`::
+	is the integer source line number generating the event.
+
+`"repo":<repo-id>`::
+	when present, is the integer repo-id as described previously.
+
+If `GIT_TRACE2_EVENT_BRIEF` or `trace2.eventBrief` is true, the `file`
+and `line` fields are omitted from all events and the `time` field is
+only present on the "start" and "atexit" events.
+
+==== Event-Specific Key/Value Pairs
+
+`"version"`::
+	This event gives the version of the executable and the EVENT format. It
+	should always be the first event in a trace session. The EVENT format
+	version will be incremented if new event types are added, if existing
+	fields are removed, or if there are significant changes in
+	interpretation of existing events or fields. Smaller changes, such as
+	adding a new field to an existing event, will not require an increment
+	to the EVENT format version.
++
+------------
+{
+	"event":"version",
+	...
+	"evt":"3",		       # EVENT format version
+	"exe":"2.20.1.155.g426c96fcdb" # git version
+}
+------------
+
+`"too_many_files"`::
+	This event is written to the git-trace2-discard sentinel file if there
+	are too many files in the target trace directory (see the
+	trace2.maxFiles config option).
++
+------------
+{
+	"event":"too_many_files",
+	...
+}
+------------
+
+`"start"`::
+	This event contains the complete argv received by main().
++
+------------
+{
+	"event":"start",
+	...
+	"t_abs":0.001227, # elapsed time in seconds
+	"argv":["git","version"]
+}
+------------
+
+`"exit"`::
+	This event is emitted when git calls `exit()`.
++
+------------
+{
+	"event":"exit",
+	...
+	"t_abs":0.001227, # elapsed time in seconds
+	"code":0	  # exit code
+}
+------------
+
+`"atexit"`::
+	This event is emitted by the Trace2 `atexit` routine during
+	final shutdown.  It should be the last event emitted by the
+	process.
++
+(The elapsed time reported here is greater than the time reported in
+the "exit" event because it runs after all other atexit tasks have
+completed.)
++
+------------
+{
+	"event":"atexit",
+	...
+	"t_abs":0.001227, # elapsed time in seconds
+	"code":0          # exit code
+}
+------------
+
+`"signal"`::
+	This event is emitted when the program is terminated by a user
+	signal.  Depending on the platform, the signal event may
+	prevent the "atexit" event from being generated.
++
+------------
+{
+	"event":"signal",
+	...
+	"t_abs":0.001227,  # elapsed time in seconds
+	"signo":13         # SIGTERM, SIGINT, etc.
+}
+------------
+
+`"error"`::
+	This event is emitted when one of the `BUG()`, `bug()`, `error()`,
+	`die()`, `warning()`, or `usage()` functions are called.
++
+------------
+{
+	"event":"error",
+	...
+	"msg":"invalid option: --cahced", # formatted error message
+	"fmt":"invalid option: %s"	  # error format string
+}
+------------
++
+The error event may be emitted more than once.  The format string
+allows post-processors to group errors by type without worrying
+about specific error arguments.
+
+`"cmd_path"`::
+	This event contains the discovered full path of the git
+	executable (on platforms that are configured to resolve it).
++
+------------
+{
+	"event":"cmd_path",
+	...
+	"path":"C:/work/gfw/git.exe"
+}
+------------
+
+`"cmd_ancestry"`::
+	This event contains the text command name for the parent (and earlier
+	generations of parents) of the current process, in an array ordered from
+	nearest parent to furthest great-grandparent. It may not be implemented
+	on all platforms.
++
+------------
+{
+	"event":"cmd_ancestry",
+	...
+	"ancestry":["bash","tmux: server","systemd"]
+}
+------------
+
+`"cmd_name"`::
+	This event contains the command name for this git process
+	and the hierarchy of commands from parent git processes.
++
+------------
+{
+	"event":"cmd_name",
+	...
+	"name":"pack-objects",
+	"hierarchy":"push/pack-objects"
+}
+------------
++
+Normally, the "name" field contains the canonical name of the
+command.  When a canonical name is not available, one of
+these special values are used:
++
+------------
+"_query_"            # "git --html-path"
+"_run_dashed_"       # when "git foo" tries to run "git-foo"
+"_run_shell_alias_"  # alias expansion to a shell command
+"_run_git_alias_"    # alias expansion to a git command
+"_usage_"            # usage error
+------------
+
+`"cmd_mode"`::
+	This event, when present, describes the command variant. This
+	event may be emitted more than once.
++
+------------
+{
+	"event":"cmd_mode",
+	...
+	"name":"branch"
+}
+------------
++
+The "name" field is an arbitrary string to describe the command mode.
+For example, checkout can checkout a branch or an individual file.
+And these variations typically have different performance
+characteristics that are not comparable.
+
+`"alias"`::
+	This event is present when an alias is expanded.
++
+------------
+{
+	"event":"alias",
+	...
+	"alias":"l",		 # registered alias
+	"argv":["log","--graph"] # alias expansion
+}
+------------
+
+`"child_start"`::
+	This event describes a child process that is about to be
+	spawned.
++
+------------
+{
+	"event":"child_start",
+	...
+	"child_id":2,
+	"child_class":"?",
+	"use_shell":false,
+	"argv":["git","rev-list","--objects","--stdin","--not","--all","--quiet"]
+
+	"hook_name":"<hook_name>"  # present when child_class is "hook"
+	"cd":"<path>"		   # present when cd is required
+}
+------------
++
+The "child_id" field can be used to match this child_start with the
+corresponding child_exit event.
++
+The "child_class" field is a rough classification, such as "editor",
+"pager", "transport/*", and "hook".  Unclassified children are classified
+with "?".
+
+`"child_exit"`::
+	This event is generated after the current process has returned
+	from the `waitpid()` and collected the exit information from the
+	child.
++
+------------
+{
+	"event":"child_exit",
+	...
+	"child_id":2,
+	"pid":14708,	 # child PID
+	"code":0,	 # child exit-code
+	"t_rel":0.110605 # observed run-time of child process
+}
+------------
++
+Note that the session-id of the child process is not available to
+the current/spawning process, so the child's PID is reported here as
+a hint for post-processing.  (But it is only a hint because the child
+process may be a shell script which doesn't have a session-id.)
++
+Note that the `t_rel` field contains the observed run time in seconds
+for the child process (starting before the fork/exec/spawn and
+stopping after the `waitpid()` and includes OS process creation overhead).
+So this time will be slightly larger than the atexit time reported by
+the child process itself.
+
+`"child_ready"`::
+	This event is generated after the current process has started
+	a background process and released all handles to it.
++
+------------
+{
+	"event":"child_ready",
+	...
+	"child_id":2,
+	"pid":14708,	 # child PID
+	"ready":"ready", # child ready state
+	"t_rel":0.110605 # observed run-time of child process
+}
+------------
++
+Note that the session-id of the child process is not available to
+the current/spawning process, so the child's PID is reported here as
+a hint for post-processing.  (But it is only a hint because the child
+process may be a shell script which doesn't have a session-id.)
++
+This event is generated after the child is started in the background
+and given a little time to boot up and start working.  If the child
+starts up normally while the parent is still waiting, the "ready"
+field will have the value "ready".
+If the child is too slow to start and the parent times out, the field
+will have the value "timeout".
+If the child starts but the parent is unable to probe it, the field
+will have the value "error".
++
+After the parent process emits this event, it will release all of its
+handles to the child process and treat the child as a background
+daemon.  So even if the child does eventually finish booting up,
+the parent will not emit an updated event.
++
+Note that the `t_rel` field contains the observed run time in seconds
+when the parent released the child process into the background.
+The child is assumed to be a long-running daemon process and may
+outlive the parent process.  So the parent's child event times should
+not be compared to the child's atexit times.
+
+`"exec"`::
+	This event is generated before git attempts to `exec()`
+	another command rather than starting a child process.
++
+------------
+{
+	"event":"exec",
+	...
+	"exec_id":0,
+	"exe":"git",
+	"argv":["foo", "bar"]
+}
+------------
++
+The "exec_id" field is a command-unique id and is only useful if the
+`exec()` fails and a corresponding exec_result event is generated.
+
+`"exec_result"`::
+	This event is generated if the `exec()` fails and control
+	returns to the current git command.
++
+------------
+{
+	"event":"exec_result",
+	...
+	"exec_id":0,
+	"code":1      # error code (errno) from exec()
+}
+------------
+
+`"thread_start"`::
+	This event is generated when a thread is started.  It is
+	generated from *within* the new thread's thread-proc (because
+	it needs to access data in the thread's thread-local storage).
++
+------------
+{
+	"event":"thread_start",
+	...
+	"thread":"th02:preload_thread" # thread name
+}
+------------
+
+`"thread_exit"`::
+	This event is generated when a thread exits.  It is generated
+	from *within* the thread's thread-proc.
++
+------------
+{
+	"event":"thread_exit",
+	...
+	"thread":"th02:preload_thread", # thread name
+	"t_rel":0.007328                # thread elapsed time
+}
+------------
+
+`"def_param"`::
+	This event is generated to log a global parameter, such as a config
+	setting, command-line flag, or environment variable.
++
+------------
+{
+	"event":"def_param",
+	...
+	"scope":"global",
+	"param":"core.abbrev",
+	"value":"7"
+}
+------------
+
+`"def_repo"`::
+	This event defines a repo-id and associates it with the root
+	of the worktree.
++
+------------
+{
+	"event":"def_repo",
+	...
+	"repo":1,
+	"worktree":"/Users/jeffhost/work/gfw"
+}
+------------
++
+As stated earlier, the repo-id is currently always 1, so there will
+only be one def_repo event.  Later, if in-proc submodules are
+supported, a def_repo event should be emitted for each submodule
+visited.
+
+`"region_enter"`::
+	This event is generated when entering a region.
++
+------------
+{
+	"event":"region_enter",
+	...
+	"repo":1,                # optional
+	"nesting":1,             # current region stack depth
+	"category":"index",      # optional
+	"label":"do_read_index", # optional
+	"msg":".git/index"       # optional
+}
+------------
++
+The `category` field may be used in a future enhancement to
+do category-based filtering.
++
+`GIT_TRACE2_EVENT_NESTING` or `trace2.eventNesting` can be used to
+filter deeply nested regions and data events.  It defaults to "2".
+
+`"region_leave"`::
+	This event is generated when leaving a region.
++
+------------
+{
+	"event":"region_leave",
+	...
+	"repo":1,                # optional
+	"t_rel":0.002876,        # time spent in region in seconds
+	"nesting":1,             # region stack depth
+	"category":"index",      # optional
+	"label":"do_read_index", # optional
+	"msg":".git/index"       # optional
+}
+------------
+
+`"data"`::
+	This event is generated to log a thread- and region-local
+	key/value pair.
++
+------------
+{
+	"event":"data",
+	...
+	"repo":1,              # optional
+	"t_abs":0.024107,      # absolute elapsed time
+	"t_rel":0.001031,      # elapsed time in region/thread
+	"nesting":2,           # region stack depth
+	"category":"index",
+	"key":"read/cache_nr",
+	"value":"3552"
+}
+------------
++
+The "value" field may be an integer or a string.
+
+`"data-json"`::
+	This event is generated to log a pre-formatted JSON string
+	containing structured data.
++
+------------
+{
+	"event":"data_json",
+	...
+	"repo":1,              # optional
+	"t_abs":0.015905,
+	"t_rel":0.015905,
+	"nesting":1,
+	"category":"process",
+	"key":"windows/ancestry",
+	"value":["bash.exe","bash.exe"]
+}
+------------
+
+`"th_timer"`::
+	This event logs the amount of time that a stopwatch timer was
+	running in the thread.  This event is generated when a thread
+	exits for timers that requested per-thread events.
++
+------------
+{
+	"event":"th_timer",
+	...
+	"category":"my_category",
+	"name":"my_timer",
+	"intervals":5,         # number of time it was started/stopped
+	"t_total":0.052741,    # total time in seconds it was running
+	"t_min":0.010061,      # shortest interval
+	"t_max":0.011648       # longest interval
+}
+------------
+
+`"timer"`::
+	This event logs the amount of time that a stopwatch timer was
+	running aggregated across all threads.  This event is generated
+	when the process exits.
++
+------------
+{
+	"event":"timer",
+	...
+	"category":"my_category",
+	"name":"my_timer",
+	"intervals":5,         # number of time it was started/stopped
+	"t_total":0.052741,    # total time in seconds it was running
+	"t_min":0.010061,      # shortest interval
+	"t_max":0.011648       # longest interval
+}
+------------
+
+`"th_counter"`::
+	This event logs the value of a counter variable in a thread.
+	This event is generated when a thread exits for counters that
+	requested per-thread events.
++
+------------
+{
+	"event":"th_counter",
+	...
+	"category":"my_category",
+	"name":"my_counter",
+	"count":23
+}
+------------
+
+`"counter"`::
+	This event logs the value of a counter variable across all threads.
+	This event is generated when the process exits.  The total value
+	reported here is the sum across all threads.
++
+------------
+{
+	"event":"counter",
+	...
+	"category":"my_category",
+	"name":"my_counter",
+	"count":23
+}
+------------
+
+
+== Example Trace2 API Usage
+
+Here is a hypothetical usage of the Trace2 API showing the intended
+usage (without worrying about the actual Git details).
+
+Initialization::
+
+	Initialization happens in `main()`.  Behind the scenes, an
+	`atexit` and `signal` handler are registered.
++
+----------------
+int main(int argc, const char **argv)
+{
+	int exit_code;
+
+	trace2_initialize();
+	trace2_cmd_start(argv);
+
+	exit_code = cmd_main(argc, argv);
+
+	trace2_cmd_exit(exit_code);
+
+	return exit_code;
+}
+----------------
+
+Command Details::
+
+	After the basics are established, additional command
+	information can be sent to Trace2 as it is discovered.
++
+----------------
+int cmd_checkout(int argc, const char **argv)
+{
+	trace2_cmd_name("checkout");
+	trace2_cmd_mode("branch");
+	trace2_def_repo(the_repository);
+
+	// emit "def_param" messages for "interesting" config settings.
+	trace2_cmd_list_config();
+
+	if (do_something())
+	    trace2_cmd_error("Path '%s': cannot do something", path);
+
+	return 0;
+}
+----------------
+
+Child Processes::
+
+	Wrap code spawning child processes.
++
+----------------
+void run_child(...)
+{
+	int child_exit_code;
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	...
+	cmd.trace2_child_class = "editor";
+
+	trace2_child_start(&cmd);
+	child_exit_code = spawn_child_and_wait_for_it();
+	trace2_child_exit(&cmd, child_exit_code);
+}
+----------------
++
+For example, the following fetch command spawned ssh, index-pack,
+rev-list, and gc.  This example also shows that fetch took
+5.199 seconds and of that 4.932 was in ssh.
++
+----------------
+$ export GIT_TRACE2_BRIEF=1
+$ export GIT_TRACE2=~/log.normal
+$ git fetch origin
+...
+----------------
++
+----------------
+$ cat ~/log.normal
+version 2.20.1.vfs.1.1.47.g534dbe1ad1
+start git fetch origin
+worktree /Users/jeffhost/work/gfw
+cmd_name fetch (fetch)
+child_start[0] ssh git@github.com ...
+child_start[1] git index-pack ...
+... (Trace2 events from child processes omitted)
+child_exit[1] pid:14707 code:0 elapsed:0.076353
+child_exit[0] pid:14706 code:0 elapsed:4.931869
+child_start[2] git rev-list ...
+... (Trace2 events from child process omitted)
+child_exit[2] pid:14708 code:0 elapsed:0.110605
+child_start[3] git gc --auto
+... (Trace2 events from child process omitted)
+child_exit[3] pid:14709 code:0 elapsed:0.006240
+exit elapsed:5.198503 code:0
+atexit elapsed:5.198541 code:0
+----------------
++
+When a git process is a (direct or indirect) child of another
+git process, it inherits Trace2 context information.  This
+allows the child to print the command hierarchy.  This example
+shows gc as child[3] of fetch.  When the gc process reports
+its name as "gc", it also reports the hierarchy as "fetch/gc".
+(In this example, trace2 messages from the child process is
+indented for clarity.)
++
+----------------
+$ export GIT_TRACE2_BRIEF=1
+$ export GIT_TRACE2=~/log.normal
+$ git fetch origin
+...
+----------------
++
+----------------
+$ cat ~/log.normal
+version 2.20.1.160.g5676107ecd.dirty
+start git fetch official
+worktree /Users/jeffhost/work/gfw
+cmd_name fetch (fetch)
+...
+child_start[3] git gc --auto
+    version 2.20.1.160.g5676107ecd.dirty
+    start /Users/jeffhost/work/gfw/git gc --auto
+    worktree /Users/jeffhost/work/gfw
+    cmd_name gc (fetch/gc)
+    exit elapsed:0.001959 code:0
+    atexit elapsed:0.001997 code:0
+child_exit[3] pid:20303 code:0 elapsed:0.007564
+exit elapsed:3.868938 code:0
+atexit elapsed:3.868970 code:0
+----------------
+
+Regions::
+
+	Regions can be used to time an interesting section of code.
++
+----------------
+void wt_status_collect(struct wt_status *s)
+{
+	trace2_region_enter("status", "worktrees", s->repo);
+	wt_status_collect_changes_worktree(s);
+	trace2_region_leave("status", "worktrees", s->repo);
+
+	trace2_region_enter("status", "index", s->repo);
+	wt_status_collect_changes_index(s);
+	trace2_region_leave("status", "index", s->repo);
+
+	trace2_region_enter("status", "untracked", s->repo);
+	wt_status_collect_untracked(s);
+	trace2_region_leave("status", "untracked", s->repo);
+}
+
+void wt_status_print(struct wt_status *s)
+{
+	trace2_region_enter("status", "print", s->repo);
+	switch (s->status_format) {
+	    ...
+	}
+	trace2_region_leave("status", "print", s->repo);
+}
+----------------
++
+In this example, scanning for untracked files ran from +0.012568 to
++0.027149 (since the process started) and took 0.014581 seconds.
++
+----------------
+$ export GIT_TRACE2_PERF_BRIEF=1
+$ export GIT_TRACE2_PERF=~/log.perf
+$ git status
+...
+
+$ cat ~/log.perf
+d0 | main                     | version      |     |           |           |            | 2.20.1.160.g5676107ecd.dirty
+d0 | main                     | start        |     |  0.001173 |           |            | git status
+d0 | main                     | def_repo     | r1  |           |           |            | worktree:/Users/jeffhost/work/gfw
+d0 | main                     | cmd_name     |     |           |           |            | status (status)
+...
+d0 | main                     | region_enter | r1  |  0.010988 |           | status     | label:worktrees
+d0 | main                     | region_leave | r1  |  0.011236 |  0.000248 | status     | label:worktrees
+d0 | main                     | region_enter | r1  |  0.011260 |           | status     | label:index
+d0 | main                     | region_leave | r1  |  0.012542 |  0.001282 | status     | label:index
+d0 | main                     | region_enter | r1  |  0.012568 |           | status     | label:untracked
+d0 | main                     | region_leave | r1  |  0.027149 |  0.014581 | status     | label:untracked
+d0 | main                     | region_enter | r1  |  0.027411 |           | status     | label:print
+d0 | main                     | region_leave | r1  |  0.028741 |  0.001330 | status     | label:print
+d0 | main                     | exit         |     |  0.028778 |           |            | code:0
+d0 | main                     | atexit       |     |  0.028809 |           |            | code:0
+----------------
++
+Regions may be nested.  This causes messages to be indented in the
+PERF target, for example.
+Elapsed times are relative to the start of the corresponding nesting
+level as expected.  For example, if we add region message to:
++
+----------------
+static enum path_treatment read_directory_recursive(struct dir_struct *dir,
+	struct index_state *istate, const char *base, int baselen,
+	struct untracked_cache_dir *untracked, int check_only,
+	int stop_at_first_file, const struct pathspec *pathspec)
+{
+	enum path_treatment state, subdir_state, dir_state = path_none;
+
+	trace2_region_enter_printf("dir", "read_recursive", NULL, "%.*s", baselen, base);
+	...
+	trace2_region_leave_printf("dir", "read_recursive", NULL, "%.*s", baselen, base);
+	return dir_state;
+}
+----------------
++
+We can further investigate the time spent scanning for untracked files.
++
+----------------
+$ export GIT_TRACE2_PERF_BRIEF=1
+$ export GIT_TRACE2_PERF=~/log.perf
+$ git status
+...
+$ cat ~/log.perf
+d0 | main                     | version      |     |           |           |            | 2.20.1.162.gb4ccea44db.dirty
+d0 | main                     | start        |     |  0.001173 |           |            | git status
+d0 | main                     | def_repo     | r1  |           |           |            | worktree:/Users/jeffhost/work/gfw
+d0 | main                     | cmd_name     |     |           |           |            | status (status)
+...
+d0 | main                     | region_enter | r1  |  0.015047 |           | status     | label:untracked
+d0 | main                     | region_enter |     |  0.015132 |           | dir        | ..label:read_recursive
+d0 | main                     | region_enter |     |  0.016341 |           | dir        | ....label:read_recursive vcs-svn/
+d0 | main                     | region_leave |     |  0.016422 |  0.000081 | dir        | ....label:read_recursive vcs-svn/
+d0 | main                     | region_enter |     |  0.016446 |           | dir        | ....label:read_recursive xdiff/
+d0 | main                     | region_leave |     |  0.016522 |  0.000076 | dir        | ....label:read_recursive xdiff/
+d0 | main                     | region_enter |     |  0.016612 |           | dir        | ....label:read_recursive git-gui/
+d0 | main                     | region_enter |     |  0.016698 |           | dir        | ......label:read_recursive git-gui/po/
+d0 | main                     | region_enter |     |  0.016810 |           | dir        | ........label:read_recursive git-gui/po/glossary/
+d0 | main                     | region_leave |     |  0.016863 |  0.000053 | dir        | ........label:read_recursive git-gui/po/glossary/
+...
+d0 | main                     | region_enter |     |  0.031876 |           | dir        | ....label:read_recursive builtin/
+d0 | main                     | region_leave |     |  0.032270 |  0.000394 | dir        | ....label:read_recursive builtin/
+d0 | main                     | region_leave |     |  0.032414 |  0.017282 | dir        | ..label:read_recursive
+d0 | main                     | region_leave | r1  |  0.032454 |  0.017407 | status     | label:untracked
+...
+d0 | main                     | exit         |     |  0.034279 |           |            | code:0
+d0 | main                     | atexit       |     |  0.034322 |           |            | code:0
+----------------
++
+Trace2 regions are similar to the existing trace_performance_enter()
+and trace_performance_leave() routines, but are thread safe and
+maintain per-thread stacks of timers.
+
+Data Messages::
+
+	Data messages added to a region.
++
+----------------
+int read_index_from(struct index_state *istate, const char *path,
+	const char *gitdir)
+{
+	trace2_region_enter_printf("index", "do_read_index", the_repository, "%s", path);
+
+	...
+
+	trace2_data_intmax("index", the_repository, "read/version", istate->version);
+	trace2_data_intmax("index", the_repository, "read/cache_nr", istate->cache_nr);
+
+	trace2_region_leave_printf("index", "do_read_index", the_repository, "%s", path);
+}
+----------------
++
+This example shows that the index contained 3552 entries.
++
+----------------
+$ export GIT_TRACE2_PERF_BRIEF=1
+$ export GIT_TRACE2_PERF=~/log.perf
+$ git status
+...
+$ cat ~/log.perf
+d0 | main                     | version      |     |           |           |            | 2.20.1.156.gf9916ae094.dirty
+d0 | main                     | start        |     |  0.001173 |           |            | git status
+d0 | main                     | def_repo     | r1  |           |           |            | worktree:/Users/jeffhost/work/gfw
+d0 | main                     | cmd_name     |     |           |           |            | status (status)
+d0 | main                     | region_enter | r1  |  0.001791 |           | index      | label:do_read_index .git/index
+d0 | main                     | data         | r1  |  0.002494 |  0.000703 | index      | ..read/version:2
+d0 | main                     | data         | r1  |  0.002520 |  0.000729 | index      | ..read/cache_nr:3552
+d0 | main                     | region_leave | r1  |  0.002539 |  0.000748 | index      | label:do_read_index .git/index
+...
+----------------
+
+Thread Events::
+
+	Thread messages added to a thread-proc.
++
+For example, the multi-threaded preload-index code can be
+instrumented with a region around the thread pool and then
+per-thread start and exit events within the thread-proc.
++
+----------------
+static void *preload_thread(void *_data)
+{
+	// start the per-thread clock and emit a message.
+	trace2_thread_start("preload_thread");
+
+	// report which chunk of the array this thread was assigned.
+	trace2_data_intmax("index", the_repository, "offset", p->offset);
+	trace2_data_intmax("index", the_repository, "count", nr);
+
+	do {
+	    ...
+	} while (--nr > 0);
+	...
+
+	// report elapsed time taken by this thread.
+	trace2_thread_exit();
+	return NULL;
+}
+
+void preload_index(struct index_state *index,
+	const struct pathspec *pathspec,
+	unsigned int refresh_flags)
+{
+	trace2_region_enter("index", "preload", the_repository);
+
+	for (i = 0; i < threads; i++) {
+	    ... /* create thread */
+	}
+
+	for (i = 0; i < threads; i++) {
+	    ... /* join thread */
+	}
+
+	trace2_region_leave("index", "preload", the_repository);
+}
+----------------
++
+In this example preload_index() was executed by the `main` thread
+and started the `preload` region.  Seven threads, named
+`th01:preload_thread` through `th07:preload_thread`, were started.
+Events from each thread are atomically appended to the shared target
+stream as they occur so they may appear in random order with respect
+other threads. Finally, the main thread waits for the threads to
+finish and leaves the region.
++
+Data events are tagged with the active thread name.  They are used
+to report the per-thread parameters.
++
+----------------
+$ export GIT_TRACE2_PERF_BRIEF=1
+$ export GIT_TRACE2_PERF=~/log.perf
+$ git status
+...
+$ cat ~/log.perf
+...
+d0 | main                     | region_enter | r1  |  0.002595 |           | index      | label:preload
+d0 | th01:preload_thread      | thread_start |     |  0.002699 |           |            |
+d0 | th02:preload_thread      | thread_start |     |  0.002721 |           |            |
+d0 | th01:preload_thread      | data         | r1  |  0.002736 |  0.000037 | index      | offset:0
+d0 | th02:preload_thread      | data         | r1  |  0.002751 |  0.000030 | index      | offset:2032
+d0 | th03:preload_thread      | thread_start |     |  0.002711 |           |            |
+d0 | th06:preload_thread      | thread_start |     |  0.002739 |           |            |
+d0 | th01:preload_thread      | data         | r1  |  0.002766 |  0.000067 | index      | count:508
+d0 | th06:preload_thread      | data         | r1  |  0.002856 |  0.000117 | index      | offset:2540
+d0 | th03:preload_thread      | data         | r1  |  0.002824 |  0.000113 | index      | offset:1016
+d0 | th04:preload_thread      | thread_start |     |  0.002710 |           |            |
+d0 | th02:preload_thread      | data         | r1  |  0.002779 |  0.000058 | index      | count:508
+d0 | th06:preload_thread      | data         | r1  |  0.002966 |  0.000227 | index      | count:508
+d0 | th07:preload_thread      | thread_start |     |  0.002741 |           |            |
+d0 | th07:preload_thread      | data         | r1  |  0.003017 |  0.000276 | index      | offset:3048
+d0 | th05:preload_thread      | thread_start |     |  0.002712 |           |            |
+d0 | th05:preload_thread      | data         | r1  |  0.003067 |  0.000355 | index      | offset:1524
+d0 | th05:preload_thread      | data         | r1  |  0.003090 |  0.000378 | index      | count:508
+d0 | th07:preload_thread      | data         | r1  |  0.003037 |  0.000296 | index      | count:504
+d0 | th03:preload_thread      | data         | r1  |  0.002971 |  0.000260 | index      | count:508
+d0 | th04:preload_thread      | data         | r1  |  0.002983 |  0.000273 | index      | offset:508
+d0 | th04:preload_thread      | data         | r1  |  0.007311 |  0.004601 | index      | count:508
+d0 | th05:preload_thread      | thread_exit  |     |  0.008781 |  0.006069 |            |
+d0 | th01:preload_thread      | thread_exit  |     |  0.009561 |  0.006862 |            |
+d0 | th03:preload_thread      | thread_exit  |     |  0.009742 |  0.007031 |            |
+d0 | th06:preload_thread      | thread_exit  |     |  0.009820 |  0.007081 |            |
+d0 | th02:preload_thread      | thread_exit  |     |  0.010274 |  0.007553 |            |
+d0 | th07:preload_thread      | thread_exit  |     |  0.010477 |  0.007736 |            |
+d0 | th04:preload_thread      | thread_exit  |     |  0.011657 |  0.008947 |            |
+d0 | main                     | region_leave | r1  |  0.011717 |  0.009122 | index      | label:preload
+...
+d0 | main                     | exit         |     |  0.029996 |           |            | code:0
+d0 | main                     | atexit       |     |  0.030027 |           |            | code:0
+----------------
++
+In this example, the preload region took 0.009122 seconds.  The 7 threads
+took between 0.006069 and 0.008947 seconds to work on their portion of
+the index.  Thread "th01" worked on 508 items at offset 0.  Thread "th02"
+worked on 508 items at offset 2032.  Thread "th04" worked on 508 items
+at offset 508.
++
+This example also shows that thread names are assigned in a racy manner
+as each thread starts.
+
+Config (def param) Events::
+
+	  Dump "interesting" config values to trace2 log.
++
+We can optionally emit configuration events, see
+`trace2.configparams` in linkgit:git-config[1] for how to enable
+it.
++
+----------------
+$ git config --system color.ui never
+$ git config --global color.ui always
+$ git config --local color.ui auto
+$ git config --list --show-scope | grep 'color.ui'
+system  color.ui=never
+global  color.ui=always
+local   color.ui=auto
+----------------
++
+Then, mark the config `color.ui` as "interesting" config with
+`GIT_TRACE2_CONFIG_PARAMS`:
++
+----------------
+$ export GIT_TRACE2_PERF_BRIEF=1
+$ export GIT_TRACE2_PERF=~/log.perf
+$ export GIT_TRACE2_CONFIG_PARAMS=color.ui
+$ git version
+...
+$ cat ~/log.perf
+d0 | main                     | version      |     |           |           |              | ...
+d0 | main                     | start        |     |  0.001642 |           |              | /usr/local/bin/git version
+d0 | main                     | cmd_name     |     |           |           |              | version (version)
+d0 | main                     | def_param    |     |           |           | scope:system | color.ui:never
+d0 | main                     | def_param    |     |           |           | scope:global | color.ui:always
+d0 | main                     | def_param    |     |           |           | scope:local  | color.ui:auto
+d0 | main                     | data         | r0  |  0.002100 |  0.002100 | fsync        | fsync/writeout-only:0
+d0 | main                     | data         | r0  |  0.002126 |  0.002126 | fsync        | fsync/hardware-flush:0
+d0 | main                     | exit         |     |  0.000470 |           |              | code:0
+d0 | main                     | atexit       |     |  0.000477 |           |              | code:0
+----------------
+
+Stopwatch Timer Events::
+
+	Measure the time spent in a function call or span of code
+	that might be called from many places within the code
+	throughout the life of the process.
++
+----------------
+static void expensive_function(void)
+{
+	trace2_timer_start(TRACE2_TIMER_ID_TEST1);
+	...
+	sleep_millisec(1000); // Do something expensive
+	...
+	trace2_timer_stop(TRACE2_TIMER_ID_TEST1);
+}
+
+static int ut_100timer(int argc, const char **argv)
+{
+	...
+
+	expensive_function();
+
+	// Do something else 1...
+
+	expensive_function();
+
+	// Do something else 2...
+
+	expensive_function();
+
+	return 0;
+}
+----------------
++
+In this example, we measure the total time spent in
+`expensive_function()` regardless of when it is called
+in the overall flow of the program.
++
+----------------
+$ export GIT_TRACE2_PERF_BRIEF=1
+$ export GIT_TRACE2_PERF=~/log.perf
+$ t/helper/test-tool trace2 100timer 3 1000
+...
+$ cat ~/log.perf
+d0 | main                     | version      |     |           |           |              | ...
+d0 | main                     | start        |     |  0.001453 |           |              | t/helper/test-tool trace2 100timer 3 1000
+d0 | main                     | cmd_name     |     |           |           |              | trace2 (trace2)
+d0 | main                     | exit         |     |  3.003667 |           |              | code:0
+d0 | main                     | timer        |     |           |           | test         | name:test1 intervals:3 total:3.001686 min:1.000254 max:1.000929
+d0 | main                     | atexit       |     |  3.003796 |           |              | code:0
+----------------
+
+
+== Future Work
+
+=== Relationship to the Existing Trace Api (api-trace.txt)
+
+There are a few issues to resolve before we can completely
+switch to Trace2.
+
+* Updating existing tests that assume `GIT_TRACE` format messages.
+
+* How to best handle custom `GIT_TRACE_<key>` messages?
+
+** The `GIT_TRACE_<key>` mechanism allows each <key> to write to a
+different file (in addition to just stderr).
+
+** Do we want to maintain that ability or simply write to the existing
+Trace2 targets (and convert <key> to a "category").
diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt
new file mode 100644
index 0000000..f5d2009
--- /dev/null
+++ b/Documentation/technical/bitmap-format.txt
@@ -0,0 +1,257 @@
+GIT bitmap v1 format
+====================
+
+== Pack and multi-pack bitmaps
+
+Bitmaps store reachability information about the set of objects in a packfile,
+or a multi-pack index (MIDX). The former is defined obviously, and the latter is
+defined as the union of objects in packs contained in the MIDX.
+
+A bitmap may belong to either one pack, or the repository's multi-pack index (if
+it exists). A repository may have at most one bitmap.
+
+An object is uniquely described by its bit position within a bitmap:
+
+	- If the bitmap belongs to a packfile, the __n__th bit corresponds to
+	the __n__th object in pack order. For a function `offset` which maps
+	objects to their byte offset within a pack, pack order is defined as
+	follows:
+
+		o1 <= o2 <==> offset(o1) <= offset(o2)
+
+	- If the bitmap belongs to a MIDX, the __n__th bit corresponds to the
+	__n__th object in MIDX order. With an additional function `pack` which
+	maps objects to the pack they were selected from by the MIDX, MIDX order
+	is defined as follows:
+
+		o1 <= o2 <==> pack(o1) <= pack(o2) /\ offset(o1) <= offset(o2)
++
+The ordering between packs is done according to the MIDX's .rev file.
+Notably, the preferred pack sorts ahead of all other packs.
+
+The on-disk representation (described below) of a bitmap is the same regardless
+of whether or not that bitmap belongs to a packfile or a MIDX. The only
+difference is the interpretation of the bits, which is described above.
+
+Certain bitmap extensions are supported (see: Appendix B). No extensions are
+required for bitmaps corresponding to packfiles. For bitmaps that correspond to
+MIDXs, both the bit-cache and rev-cache extensions are required.
+
+== On-disk format
+
+    * A header appears at the beginning:
+
+	4-byte signature: :: {'B', 'I', 'T', 'M'}
+
+	2-byte version number (network byte order): ::
+
+	    The current implementation only supports version 1
+	    of the bitmap index (the same one as JGit).
+
+	2-byte flags (network byte order): ::
+
+	    The following flags are supported:
+
+	    ** {empty}
+	    BITMAP_OPT_FULL_DAG (0x1) REQUIRED: :::
+
+	    This flag must always be present. It implies that the
+	    bitmap index has been generated for a packfile or
+	    multi-pack index (MIDX) with full closure (i.e. where
+	    every single object in the packfile/MIDX can find its
+	    parent links inside the same packfile/MIDX). This is a
+	    requirement for the bitmap index format, also present in
+	    JGit, that greatly reduces the complexity of the
+	    implementation.
+
+	    ** {empty}
+	    BITMAP_OPT_HASH_CACHE (0x4): :::
+
+	    If present, the end of the bitmap file contains
+	    `N` 32-bit name-hash values, one per object in the
+	    pack/MIDX. The format and meaning of the name-hash is
+	    described below.
+
+		** {empty}
+		BITMAP_OPT_LOOKUP_TABLE (0x10): :::
+		If present, the end of the bitmap file contains a table
+		containing a list of `N` <commit_pos, offset, xor_row>
+		triplets. The format and meaning of the table is described
+		below.
++
+NOTE: Unlike the xor_offset used to compress an individual bitmap,
+`xor_row` stores an *absolute* index into the lookup table, not a location
+relative to the current entry.
+
+	4-byte entry count (network byte order): ::
+	    The total count of entries (bitmapped commits) in this bitmap index.
+
+	20-byte checksum: ::
+	    The SHA1 checksum of the pack/MIDX this bitmap index
+	    belongs to.
+
+    * 4 EWAH bitmaps that act as type indexes
++
+Type indexes are serialized after the hash cache in the shape
+of four EWAH bitmaps stored consecutively (see Appendix A for
+the serialization format of an EWAH bitmap).
++
+There is a bitmap for each Git object type, stored in the following
+order:
++
+    - Commits
+    - Trees
+    - Blobs
+    - Tags
+
++
+In each bitmap, the `n`th bit is set to true if the `n`th object
+in the packfile or multi-pack index is of that type.
++
+The obvious consequence is that the OR of all 4 bitmaps will result
+in a full set (all bits set), and the AND of all 4 bitmaps will
+result in an empty bitmap (no bits set).
+
+    * N entries with compressed bitmaps, one for each indexed commit
++
+Where `N` is the total number of entries in this bitmap index.
+Each entry contains the following:
+
+	** {empty}
+	4-byte object position (network byte order): ::
+	    The position **in the index for the packfile or
+	    multi-pack index** where the bitmap for this commit is
+	    found.
+
+	** {empty}
+	1-byte XOR-offset: ::
+	    The xor offset used to compress this bitmap. For an entry
+	    in position `x`, an XOR offset of `y` means that the actual
+	    bitmap representing this commit is composed by XORing the
+	    bitmap for this entry with the bitmap in entry `x-y` (i.e.
+	    the bitmap `y` entries before this one).
++
+NOTE: This compression can be recursive. In order to
+XOR this entry with a previous one, the previous entry needs
+to be decompressed first, and so on.
++
+The hard-limit for this offset is 160 (an entry can only be
+xor'ed against one of the 160 entries preceding it). This
+number is always positive, and hence entries are always xor'ed
+with **previous** bitmaps, not bitmaps that will come afterwards
+in the index.
+
+	** {empty}
+	1-byte flags for this bitmap: ::
+	    At the moment the only available flag is `0x1`, which hints
+	    that this bitmap can be re-used when rebuilding bitmap indexes
+	    for the repository.
+
+	** The compressed bitmap itself, see Appendix A.
+
+	* {empty}
+	TRAILER: ::
+		Trailing checksum of the preceding contents.
+
+== Appendix A: Serialization format for an EWAH bitmap
+
+Ewah bitmaps are serialized in the same protocol as the JAVAEWAH
+library, making them backwards compatible with the JGit
+implementation:
+
+	- 4-byte number of bits of the resulting UNCOMPRESSED bitmap
+
+	- 4-byte number of words of the COMPRESSED bitmap, when stored
+
+	- N x 8-byte words, as specified by the previous field
++
+This is the actual content of the compressed bitmap.
+
+	- 4-byte position of the current RLW for the compressed
+		bitmap
+
+All words are stored in network byte order for their corresponding
+sizes.
+
+The compressed bitmap is stored in a form of run-length encoding, as
+follows.  It consists of a concatenation of an arbitrary number of
+chunks.  Each chunk consists of one or more 64-bit words
+
+     H  L_1  L_2  L_3 .... L_M
+
+H is called RLW (run length word).  It consists of (from lower to higher
+order bits):
+
+     - 1 bit: the repeated bit B
+
+     - 32 bits: repetition count K (unsigned)
+
+     - 31 bits: literal word count M (unsigned)
+
+The bitstream represented by the above chunk is then:
+
+     - K repetitions of B
+
+     - The bits stored in `L_1` through `L_M`.  Within a word, bits at
+       lower order come earlier in the stream than those at higher
+       order.
+
+The next word after `L_M` (if any) must again be a RLW, for the next
+chunk.  For efficient appending to the bitstream, the EWAH stores a
+pointer to the last RLW in the stream.
+
+
+== Appendix B: Optional Bitmap Sections
+
+These sections may or may not be present in the `.bitmap` file; their
+presence is indicated by the header flags section described above.
+
+Name-hash cache
+---------------
+
+If the BITMAP_OPT_HASH_CACHE flag is set, the end of the bitmap contains
+a cache of 32-bit values, one per object in the pack/MIDX. The value at
+position `i` is the hash of the pathname at which the `i`th object
+(counting in index or multi-pack index order) in the pack/MIDX can be found.
+This can be fed into the delta heuristics to compare objects with similar
+pathnames.
+
+The hash algorithm used is:
+
+    hash = 0;
+    while ((c = *name++))
+	    if (!isspace(c))
+		    hash = (hash >> 2) + (c << 24);
+
+Note that this hashing scheme is tied to the BITMAP_OPT_HASH_CACHE flag.
+If implementations want to choose a different hashing scheme, they are
+free to do so, but MUST allocate a new header flag (because comparing
+hashes made under two different schemes would be pointless).
+
+Commit lookup table
+-------------------
+
+If the BITMAP_OPT_LOOKUP_TABLE flag is set, the last `N * (4 + 8 + 4)`
+bytes (preceding the name-hash cache and trailing hash) of the `.bitmap`
+file contains a lookup table specifying the information needed to get
+the desired bitmap from the entries without parsing previous unnecessary
+bitmaps.
+
+For a `.bitmap` containing `nr_entries` reachability bitmaps, the table
+contains a list of `nr_entries` <commit_pos, offset, xor_row> triplets
+(sorted in the ascending order of `commit_pos`). The content of the i'th
+triplet is -
+
+	* {empty}
+	commit_pos (4 byte integer, network byte order): ::
+	It stores the object position of a commit (in the midx or pack
+	index).
+
+	* {empty}
+	offset (8 byte integer, network byte order): ::
+	The offset from which that commit's bitmap can be read.
+
+	* {empty}
+	xor_row (4 byte integer, network byte order): ::
+	The position of the triplet whose bitmap is used to compress
+	this one, or `0xffffffff` if no such bitmap exists.
diff --git a/Documentation/technical/bundle-uri.txt b/Documentation/technical/bundle-uri.txt
new file mode 100644
index 0000000..91d3a13
--- /dev/null
+++ b/Documentation/technical/bundle-uri.txt
@@ -0,0 +1,572 @@
+Bundle URIs
+===========
+
+Git bundles are files that store a pack-file along with some extra metadata,
+including a set of refs and a (possibly empty) set of necessary commits. See
+linkgit:git-bundle[1] and linkgit:gitformat-bundle[5] for more information.
+
+Bundle URIs are locations where Git can download one or more bundles in
+order to bootstrap the object database in advance of fetching the remaining
+objects from a remote.
+
+One goal is to speed up clones and fetches for users with poor network
+connectivity to the origin server. Another benefit is to allow heavy users,
+such as CI build farms, to use local resources for the majority of Git data
+and thereby reducing the load on the origin server.
+
+To enable the bundle URI feature, users can specify a bundle URI using
+command-line options or the origin server can advertise one or more URIs
+via a protocol v2 capability.
+
+Design Goals
+------------
+
+The bundle URI standard aims to be flexible enough to satisfy multiple
+workloads. The bundle provider and the Git client have several choices in
+how they create and consume bundle URIs.
+
+* Bundles can have whatever name the server desires. This name could refer
+  to immutable data by using a hash of the bundle contents. However, this
+  means that a new URI will be needed after every update of the content.
+  This might be acceptable if the server is advertising the URI (and the
+  server is aware of new bundles being generated) but would not be
+  ergonomic for users using the command line option.
+
+* The bundles could be organized specifically for bootstrapping full
+  clones, but could also be organized with the intention of bootstrapping
+  incremental fetches. The bundle provider must decide on one of several
+  organization schemes to minimize client downloads during incremental
+  fetches, but the Git client can also choose whether to use bundles for
+  either of these operations.
+
+* The bundle provider can choose to support full clones, partial clones,
+  or both. The client can detect which bundles are appropriate for the
+  repository's partial clone filter, if any.
+
+* The bundle provider can use a single bundle (for clones only), or a
+  list of bundles. When using a list of bundles, the provider can specify
+  whether or not the client needs _all_ of the bundle URIs for a full
+  clone, or if _any_ one of the bundle URIs is sufficient. This allows the
+  bundle provider to use different URIs for different geographies.
+
+* The bundle provider can organize the bundles using heuristics, such as
+  creation tokens, to help the client prevent downloading bundles it does
+  not need. When the bundle provider does not provide these heuristics,
+  the client can use optimizations to minimize how much of the data is
+  downloaded.
+
+* The bundle provider does not need to be associated with the Git server.
+  The client can choose to use the bundle provider without it being
+  advertised by the Git server.
+
+* The client can choose to discover bundle providers that are advertised
+  by the Git server. This could happen during `git clone`, during
+  `git fetch`, both, or neither. The user can choose which combination
+  works best for them.
+
+* The client can choose to configure a bundle provider manually at any
+  time. The client can also choose to specify a bundle provider manually
+  as a command-line option to `git clone`.
+
+Each repository is different and every Git server has different needs.
+Hopefully the bundle URI feature is flexible enough to satisfy all needs.
+If not, then the feature can be extended through its versioning mechanism.
+
+Server requirements
+-------------------
+
+To provide a server-side implementation of bundle servers, no other parts
+of the Git protocol are required. This allows server maintainers to use
+static content solutions such as CDNs in order to serve the bundle files.
+
+At the current scope of the bundle URI feature, all URIs are expected to
+be HTTP(S) URLs where content is downloaded to a local file using a `GET`
+request to that URL. The server could include authentication requirements
+to those requests with the aim of triggering the configured credential
+helper for secure access. (Future extensions could use "file://" URIs or
+SSH URIs.)
+
+Assuming a `200 OK` response from the server, the content at the URL is
+inspected. First, Git attempts to parse the file as a bundle file of
+version 2 or higher. If the file is not a bundle, then the file is parsed
+as a plain-text file using Git's config parser. The key-value pairs in
+that config file are expected to describe a list of bundle URIs. If
+neither of these parse attempts succeed, then Git will report an error to
+the user that the bundle URI provided erroneous data.
+
+Any other data provided by the server is considered erroneous.
+
+Bundle Lists
+------------
+
+The Git server can advertise bundle URIs using a set of `key=value` pairs.
+A bundle URI can also serve a plain-text file in the Git config format
+containing these same `key=value` pairs. In both cases, we consider this
+to be a _bundle list_. The pairs specify information about the bundles
+that the client can use to make decisions for which bundles to download
+and which to ignore.
+
+A few keys focus on properties of the list itself.
+
+bundle.version::
+	(Required) This value provides a version number for the bundle
+	list. If a future Git change enables a feature that needs the Git
+	client to react to a new key in the bundle list file, then this version
+	will increment. The only current version number is 1, and if any other
+	value is specified then Git will fail to use this file.
+
+bundle.mode::
+	(Required) This value has one of two values: `all` and `any`. When `all`
+	is specified, then the client should expect to need all of the listed
+	bundle URIs that match their repository's requirements. When `any` is
+	specified, then the client should expect that any one of the bundle URIs
+	that match their repository's requirements will suffice. Typically, the
+	`any` option is used to list a number of different bundle servers
+	located in different geographies.
+
+bundle.heuristic::
+	If this string-valued key exists, then the bundle list is designed to
+	work well with incremental `git fetch` commands. The heuristic signals
+	that there are additional keys available for each bundle that help
+	determine which subset of bundles the client should download. The only
+	heuristic currently planned is `creationToken`.
+
+The remaining keys include an `<id>` segment which is a server-designated
+name for each available bundle. The `<id>` must contain only alphanumeric
+and `-` characters.
+
+bundle.<id>.uri::
+	(Required) This string value is the URI for downloading bundle `<id>`.
+	If the URI begins with a protocol (`http://` or `https://`) then the URI
+	is absolute. Otherwise, the URI is interpreted as relative to the URI
+	used for the bundle list. If the URI begins with `/`, then that relative
+	path is relative to the domain name used for the bundle list. (This use
+	of relative paths is intended to make it easier to distribute a set of
+	bundles across a large number of servers or CDNs with different domain
+	names.)
+
+bundle.<id>.filter::
+	This string value represents an object filter that should also appear in
+	the header of this bundle. The server uses this value to differentiate
+	different kinds of bundles from which the client can choose those that
+	match their object filters.
+
+bundle.<id>.creationToken::
+	This value is a nonnegative 64-bit integer used for sorting the bundles
+	list. This is used to download a subset of bundles during a fetch when
+	`bundle.heuristic=creationToken`.
+
+bundle.<id>.location::
+	This string value advertises a real-world location from where the bundle
+	URI is served. This can be used to present the user with an option for
+	which bundle URI to use or simply as an informative indicator of which
+	bundle URI was selected by Git. This is only valuable when
+	`bundle.mode` is `any`.
+
+Here is an example bundle list using the Git config format:
+
+	[bundle]
+		version = 1
+		mode = all
+		heuristic = creationToken
+
+	[bundle "2022-02-09-1644442601-daily"]
+		uri = https://bundles.example.com/git/git/2022-02-09-1644442601-daily.bundle
+		creationToken = 1644442601
+
+	[bundle "2022-02-02-1643842562"]
+		uri = https://bundles.example.com/git/git/2022-02-02-1643842562.bundle
+		creationToken = 1643842562
+
+	[bundle "2022-02-09-1644442631-daily-blobless"]
+		uri = 2022-02-09-1644442631-daily-blobless.bundle
+		creationToken = 1644442631
+		filter = blob:none
+
+	[bundle "2022-02-02-1643842568-blobless"]
+		uri = /git/git/2022-02-02-1643842568-blobless.bundle
+		creationToken = 1643842568
+		filter = blob:none
+
+This example uses `bundle.mode=all` as well as the
+`bundle.<id>.creationToken` heuristic. It also uses the `bundle.<id>.filter`
+options to present two parallel sets of bundles: one for full clones and
+another for blobless partial clones.
+
+Suppose that this bundle list was found at the URI
+`https://bundles.example.com/git/git/` and so the two blobless bundles have
+the following fully-expanded URIs:
+
+* `https://bundles.example.com/git/git/2022-02-09-1644442631-daily-blobless.bundle`
+* `https://bundles.example.com/git/git/2022-02-02-1643842568-blobless.bundle`
+
+Advertising Bundle URIs
+-----------------------
+
+If a user knows a bundle URI for the repository they are cloning, then
+they can specify that URI manually through a command-line option. However,
+a Git host may want to advertise bundle URIs during the clone operation,
+helping users unaware of the feature.
+
+The only thing required for this feature is that the server can advertise
+one or more bundle URIs. This advertisement takes the form of a new
+protocol v2 capability specifically for discovering bundle URIs.
+
+The client could choose an arbitrary bundle URI as an option _or_ select
+the URI with best performance by some exploratory checks. It is up to the
+bundle provider to decide if having multiple URIs is preferable to a
+single URI that is geodistributed through server-side infrastructure.
+
+Cloning with Bundle URIs
+------------------------
+
+The primary need for bundle URIs is to speed up clones. The Git client
+will interact with bundle URIs according to the following flow:
+
+1. The user specifies a bundle URI with the `--bundle-uri` command-line
+   option _or_ the client discovers a bundle list advertised by the
+   Git server.
+
+2. If the downloaded data from a bundle URI is a bundle, then the client
+   inspects the bundle headers to check that the prerequisite commit OIDs
+   are present in the client repository. If some are missing, then the
+   client delays unbundling until other bundles have been unbundled,
+   making those OIDs present. When all required OIDs are present, the
+   client unbundles that data using a refspec. The default refspec is
+   `+refs/heads/*:refs/bundles/*`, but this can be configured. These refs
+   are stored so that later `git fetch` negotiations can communicate each
+   bundled ref as a `have`, reducing the size of the fetch over the Git
+   protocol. To allow pruning refs from this ref namespace, Git may
+   introduce a numbered namespace (such as `refs/bundles/<i>/*`) such that
+   stale bundle refs can be deleted.
+
+3. If the file is instead a bundle list, then the client inspects the
+   `bundle.mode` to see if the list is of the `all` or `any` form.
+
+   a. If `bundle.mode=all`, then the client considers all bundle
+      URIs. The list is reduced based on the `bundle.<id>.filter` options
+      matching the client repository's partial clone filter. Then, all
+      bundle URIs are requested. If the `bundle.<id>.creationToken`
+      heuristic is provided, then the bundles are downloaded in decreasing
+      order by the creation token, stopping when a bundle has all required
+      OIDs. The bundles can then be unbundled in increasing creation token
+      order. The client stores the latest creation token as a heuristic
+      for avoiding future downloads if the bundle list does not advertise
+      bundles with larger creation tokens.
+
+   b. If `bundle.mode=any`, then the client can choose any one of the
+      bundle URIs to inspect. The client can use a variety of ways to
+      choose among these URIs. The client can also fallback to another URI
+      if the initial choice fails to return a result.
+
+Note that during a clone we expect that all bundles will be required, and
+heuristics such as `bundle.<uri>.creationToken` can be used to download
+bundles in chronological order or in parallel.
+
+If a given bundle URI is a bundle list with a `bundle.heuristic`
+value, then the client can choose to store that URI as its chosen bundle
+URI. The client can then navigate directly to that URI during later `git
+fetch` calls.
+
+When downloading bundle URIs, the client can choose to inspect the initial
+content before committing to downloading the entire content. This may
+provide enough information to determine if the URI is a bundle list or
+a bundle. In the case of a bundle, the client may inspect the bundle
+header to determine that all advertised tips are already in the client
+repository and cancel the remaining download.
+
+Fetching with Bundle URIs
+-------------------------
+
+When the client fetches new data, it can decide to fetch from bundle
+servers before fetching from the origin remote. This could be done via a
+command-line option, but it is more likely useful to use a config value
+such as the one specified during the clone.
+
+The fetch operation follows the same procedure to download bundles from a
+bundle list (although we do _not_ want to use parallel downloads here). We
+expect that the process will end when all prerequisite commit OIDs in a
+thin bundle are already in the object database.
+
+When using the `creationToken` heuristic, the client can avoid downloading
+any bundles if their creation tokens are not larger than the stored
+creation token. After fetching new bundles, Git updates this local
+creation token.
+
+If the bundle provider does not provide a heuristic, then the client
+should attempt to inspect the bundle headers before downloading the full
+bundle data in case the bundle tips already exist in the client
+repository.
+
+Error Conditions
+----------------
+
+If the Git client discovers something unexpected while downloading
+information according to a bundle URI or the bundle list found at that
+location, then Git can ignore that data and continue as if it was not
+given a bundle URI. The remote Git server is the ultimate source of truth,
+not the bundle URI.
+
+Here are a few example error conditions:
+
+* The client fails to connect with a server at the given URI or a connection
+  is lost without any chance to recover.
+
+* The client receives a 400-level response (such as `404 Not Found` or
+  `401 Not Authorized`). The client should use the credential helper to
+  find and provide a credential for the URI, but match the semantics of
+  Git's other HTTP protocols in terms of handling specific 400-level
+  errors.
+
+* The server reports any other failure response.
+
+* The client receives data that is not parsable as a bundle or bundle list.
+
+* A bundle includes a filter that does not match expectations.
+
+* The client cannot unbundle the bundles because the prerequisite commit OIDs
+  are not in the object database and there are no more bundles to download.
+
+There are also situations that could be seen as wasteful, but are not
+error conditions:
+
+* The downloaded bundles contain more information than is requested by
+  the clone or fetch request. A primary example is if the user requests
+  a clone with `--single-branch` but downloads bundles that store every
+  reachable commit from all `refs/heads/*` references. This might be
+  initially wasteful, but perhaps these objects will become reachable by
+  a later ref update that the client cares about.
+
+* A bundle download during a `git fetch` contains objects already in the
+  object database. This is probably unavoidable if we are using bundles
+  for fetches, since the client will almost always be slightly ahead of
+  the bundle servers after performing its "catch-up" fetch to the remote
+  server. This extra work is most wasteful when the client is fetching
+  much more frequently than the server is computing bundles, such as if
+  the client is using hourly prefetches with background maintenance, but
+  the server is computing bundles weekly. For this reason, the client
+  should not use bundle URIs for fetch unless the server has explicitly
+  recommended it through a `bundle.heuristic` value.
+
+Example Bundle Provider organization
+------------------------------------
+
+The bundle URI feature is intentionally designed to be flexible to
+different ways a bundle provider wants to organize the object data.
+However, it can be helpful to have a complete organization model described
+here so providers can start from that base.
+
+This example organization is a simplified model of what is used by the
+GVFS Cache Servers (see section near the end of this document) which have
+been beneficial in speeding up clones and fetches for very large
+repositories, although using extra software outside of Git.
+
+The bundle provider deploys servers across multiple geographies. Each
+server manages its own bundle set. The server can track a number of Git
+repositories, but provides a bundle list for each based on a pattern. For
+example, when mirroring a repository at `https://<domain>/<org>/<repo>`
+the bundle server could have its bundle list available at
+`https://<server-url>/<domain>/<org>/<repo>`. The origin Git server can
+list all of these servers under the "any" mode:
+
+	[bundle]
+		version = 1
+		mode = any
+
+	[bundle "eastus"]
+		uri = https://eastus.example.com/<domain>/<org>/<repo>
+
+	[bundle "europe"]
+		uri = https://europe.example.com/<domain>/<org>/<repo>
+
+	[bundle "apac"]
+		uri = https://apac.example.com/<domain>/<org>/<repo>
+
+This "list of lists" is static and only changes if a bundle server is
+added or removed.
+
+Each bundle server manages its own set of bundles. The initial bundle list
+contains only a single bundle, containing all of the objects received from
+cloning the repository from the origin server. The list uses the
+`creationToken` heuristic and a `creationToken` is made for the bundle
+based on the server's timestamp.
+
+The bundle server runs regularly-scheduled updates for the bundle list,
+such as once a day. During this task, the server fetches the latest
+contents from the origin server and generates a bundle containing the
+objects reachable from the latest origin refs, but not contained in a
+previously-computed bundle. This bundle is added to the list, with care
+that the `creationToken` is strictly greater than the previous maximum
+`creationToken`.
+
+When the bundle list grows too large, say more than 30 bundles, then the
+oldest "_N_ minus 30" bundles are combined into a single bundle. This
+bundle's `creationToken` is equal to the maximum `creationToken` among the
+merged bundles.
+
+An example bundle list is provided here, although it only has two daily
+bundles and not a full list of 30:
+
+	[bundle]
+		version = 1
+		mode = all
+		heuristic = creationToken
+
+	[bundle "2022-02-13-1644770820-daily"]
+		uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-09-1644770820-daily.bundle
+		creationToken = 1644770820
+
+	[bundle "2022-02-09-1644442601-daily"]
+		uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-09-1644442601-daily.bundle
+		creationToken = 1644442601
+
+	[bundle "2022-02-02-1643842562"]
+		uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-02-1643842562.bundle
+		creationToken = 1643842562
+
+To avoid storing and serving object data in perpetuity despite becoming
+unreachable in the origin server, this bundle merge can be more careful.
+Instead of taking an absolute union of the old bundles, instead the bundle
+can be created by looking at the newer bundles and ensuring that their
+necessary commits are all available in this merged bundle (or in another
+one of the newer bundles). This allows "expiring" object data that is not
+being used by new commits in this window of time. That data could be
+reintroduced by a later push.
+
+The intention of this data organization has two main goals. First, initial
+clones of the repository become faster by downloading precomputed object
+data from a closer source. Second, `git fetch` commands can be faster,
+especially if the client has not fetched for a few days. However, if a
+client does not fetch for 30 days, then the bundle list organization would
+cause redownloading a large amount of object data.
+
+One way to make this organization more useful to users who fetch frequently
+is to have more frequent bundle creation. For example, bundles could be
+created every hour, and then once a day those "hourly" bundles could be
+merged into a "daily" bundle. The daily bundles are merged into the
+oldest bundle after 30 days.
+
+It is recommended that this bundle strategy is repeated with the `blob:none`
+filter if clients of this repository are expecting to use blobless partial
+clones. This list of blobless bundles stays in the same list as the full
+bundles, but uses the `bundle.<id>.filter` key to separate the two groups.
+For very large repositories, the bundle provider may want to _only_ provide
+blobless bundles.
+
+Implementation Plan
+-------------------
+
+This design document is being submitted on its own as an aspirational
+document, with the goal of implementing all of the mentioned client
+features over the course of several patch series. Here is a potential
+outline for submitting these features:
+
+1. Integrate bundle URIs into `git clone` with a `--bundle-uri` option.
+   This will include a new `git fetch --bundle-uri` mode for use as the
+   implementation underneath `git clone`. The initial version here will
+   expect a single bundle at the given URI.
+
+2. Implement the ability to parse a bundle list from a bundle URI and
+   update the `git fetch --bundle-uri` logic to properly distinguish
+   between `bundle.mode` options. Specifically design the feature so
+   that the config format parsing feeds a list of key-value pairs into the
+   bundle list logic.
+
+3. Create the `bundle-uri` protocol v2 command so Git servers can advertise
+   bundle URIs using the key-value pairs. Plug into the existing key-value
+   input to the bundle list logic. Allow `git clone` to discover these
+   bundle URIs and bootstrap the client repository from the bundle data.
+   (This choice is an opt-in via a config option and a command-line
+   option.)
+
+4. Allow the client to understand the `bundle.heuristic` configuration key
+   and the `bundle.<id>.creationToken` heuristic. When `git clone`
+   discovers a bundle URI with `bundle.heuristic`, it configures the client
+   repository to check that bundle URI during later `git fetch <remote>`
+   commands.
+
+5. Allow clients to discover bundle URIs during `git fetch` and configure
+   a bundle URI for later fetches if `bundle.heuristic` is set.
+
+6. Implement the "inspect headers" heuristic to reduce data downloads when
+   the `bundle.<id>.creationToken` heuristic is not available.
+
+As these features are reviewed, this plan might be updated. We also expect
+that new designs will be discovered and implemented as this feature
+matures and becomes used in real-world scenarios.
+
+Related Work: Packfile URIs
+---------------------------
+
+The Git protocol already has a capability where the Git server can list
+a set of URLs along with the packfile response when serving a client
+request. The client is then expected to download the packfiles at those
+locations in order to have a complete understanding of the response.
+
+This mechanism is used by the Gerrit server (implemented with JGit) and
+has been effective at reducing CPU load and improving user performance for
+clones.
+
+A major downside to this mechanism is that the origin server needs to know
+_exactly_ what is in those packfiles, and the packfiles need to be available
+to the user for some time after the server has responded. This coupling
+between the origin and the packfile data is difficult to manage.
+
+Further, this implementation is extremely hard to make work with fetches.
+
+Related Work: GVFS Cache Servers
+--------------------------------
+
+The GVFS Protocol [2] is a set of HTTP endpoints designed independently of
+the Git project before Git's partial clone was created. One feature of this
+protocol is the idea of a "cache server" which can be colocated with build
+machines or developer offices to transfer Git data without overloading the
+central server.
+
+The endpoint that VFS for Git is famous for is the `GET /gvfs/objects/{oid}`
+endpoint, which allows downloading an object on-demand. This is a critical
+piece of the filesystem virtualization of that product.
+
+However, a more subtle need is the `GET /gvfs/prefetch?lastPackTimestamp=<t>`
+endpoint. Given an optional timestamp, the cache server responds with a list
+of precomputed packfiles containing the commits and trees that were introduced
+in those time intervals.
+
+The cache server computes these "prefetch" packfiles using the following
+strategy:
+
+1. Every hour, an "hourly" pack is generated with a given timestamp.
+2. Nightly, the previous 24 hourly packs are rolled up into a "daily" pack.
+3. Nightly, all prefetch packs more than 30 days old are rolled up into
+   one pack.
+
+When a user runs `gvfs clone` or `scalar clone` against a repo with cache
+servers, the client requests all prefetch packfiles, which is at most
+`24 + 30 + 1` packfiles downloading only commits and trees. The client
+then follows with a request to the origin server for the references, and
+attempts to checkout that tip reference. (There is an extra endpoint that
+helps get all reachable trees from a given commit, in case that commit
+was not already in a prefetch packfile.)
+
+During a `git fetch`, a hook requests the prefetch endpoint using the
+most-recent timestamp from a previously-downloaded prefetch packfile.
+Only the list of packfiles with later timestamps are downloaded. Most
+users fetch hourly, so they get at most one hourly prefetch pack. Users
+whose machines have been off or otherwise have not fetched in over 30 days
+might redownload all prefetch packfiles. This is rare.
+
+It is important to note that the clients always contact the origin server
+for the refs advertisement, so the refs are frequently "ahead" of the
+prefetched pack data. The missing objects are downloaded on-demand using
+the `GET gvfs/objects/{oid}` requests, when needed by a command such as
+`git checkout` or `git log`. Some Git optimizations disable checks that
+would cause these on-demand downloads to be too aggressive.
+
+See Also
+--------
+
+[1] https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/
+    An earlier RFC for a bundle URI feature.
+
+[2] https://github.com/microsoft/VFSForGit/blob/master/Protocol.md
+    The GVFS Protocol
diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
new file mode 100644
index 0000000..2c26e95
--- /dev/null
+++ b/Documentation/technical/commit-graph.txt
@@ -0,0 +1,401 @@
+Git Commit-Graph Design Notes
+=============================
+
+Git walks the commit graph for many reasons, including:
+
+1. Listing and filtering commit history.
+2. Computing merge bases.
+
+These operations can become slow as the commit count grows. The merge
+base calculation shows up in many user-facing commands, such as 'merge-base'
+or 'status' and can take minutes to compute depending on history shape.
+
+There are two main costs here:
+
+1. Decompressing and parsing commits.
+2. Walking the entire graph to satisfy topological order constraints.
+
+The commit-graph file is a supplemental data structure that accelerates
+commit graph walks. If a user downgrades or disables the 'core.commitGraph'
+config setting, then the existing object database is sufficient. The file is stored
+as "commit-graph" either in the .git/objects/info directory or in the info
+directory of an alternate.
+
+The commit-graph file stores the commit graph structure along with some
+extra metadata to speed up graph walks. By listing commit OIDs in
+lexicographic order, we can identify an integer position for each commit
+and refer to the parents of a commit using those integer positions. We
+use binary search to find initial commits and then use the integer
+positions for fast lookups during the walk.
+
+A consumer may load the following info for a commit from the graph:
+
+1. The commit OID.
+2. The list of parents, along with their integer position.
+3. The commit date.
+4. The root tree OID.
+5. The generation number (see definition below).
+
+Values 1-4 satisfy the requirements of parse_commit_gently().
+
+There are two definitions of generation number:
+1. Corrected committer dates (generation number v2)
+2. Topological levels (generation number v1)
+
+Define "corrected committer date" of a commit recursively as follows:
+
+ * A commit with no parents (a root commit) has corrected committer date
+    equal to its committer date.
+
+ * A commit with at least one parent has corrected committer date equal to
+    the maximum of its committer date and one more than the largest corrected
+    committer date among its parents.
+
+ * As a special case, a root commit with timestamp zero has corrected commit
+    date of 1, to be able to distinguish it from GENERATION_NUMBER_ZERO
+    (that is, an uncomputed corrected commit date).
+
+Define the "topological level" of a commit recursively as follows:
+
+ * A commit with no parents (a root commit) has topological level of one.
+
+ * A commit with at least one parent has topological level one more than
+   the largest topological level among its parents.
+
+Equivalently, the topological level of a commit A is one more than the
+length of a longest path from A to a root commit. The recursive definition
+is easier to use for computation and observing the following property:
+
+    If A and B are commits with generation numbers N and M, respectively,
+    and N <= M, then A cannot reach B. That is, we know without searching
+    that B is not an ancestor of A because it is further from a root commit
+    than A.
+
+    Conversely, when checking if A is an ancestor of B, then we only need
+    to walk commits until all commits on the walk boundary have generation
+    number at most N. If we walk commits using a priority queue seeded by
+    generation numbers, then we always expand the boundary commit with highest
+    generation number and can easily detect the stopping condition.
+
+The property applies to both versions of generation number, that is both
+corrected committer dates and topological levels.
+
+This property can be used to significantly reduce the time it takes to
+walk commits and determine topological relationships. Without generation
+numbers, the general heuristic is the following:
+
+    If A and B are commits with commit time X and Y, respectively, and
+    X < Y, then A _probably_ cannot reach B.
+
+In absence of corrected commit dates (for example, old versions of Git or
+mixed generation graph chains),
+this heuristic is currently used whenever the computation is allowed to
+violate topological relationships due to clock skew (such as "git log"
+with default order), but is not used when the topological order is
+required (such as merge base calculations, "git log --graph").
+
+In practice, we expect some commits to be created recently and not stored
+in the commit-graph. We can treat these commits as having "infinite"
+generation number and walk until reaching commits with known generation
+number.
+
+We use the macro GENERATION_NUMBER_INFINITY to mark commits not
+in the commit-graph file. If a commit-graph file was written by a version
+of Git that did not compute generation numbers, then those commits will
+have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
+
+Since the commit-graph file is closed under reachability, we can guarantee
+the following weaker condition on all commits:
+
+    If A and B are commits with generation numbers N and M, respectively,
+    and N < M, then A cannot reach B.
+
+Note how the strict inequality differs from the inequality when we have
+fully-computed generation numbers. Using strict inequality may result in
+walking a few extra commits, but the simplicity in dealing with commits
+with generation number *_INFINITY or *_ZERO is valuable.
+
+We use the macro GENERATION_NUMBER_V1_MAX = 0x3FFFFFFF for commits whose
+topological levels (generation number v1) are computed to be at least
+this value. We limit at this value since it is the largest value that
+can be stored in the commit-graph file using the 30 bits available
+to topological levels. This presents another case where a commit can
+have generation number equal to that of a parent.
+
+Design Details
+--------------
+
+- The commit-graph file is stored in a file named 'commit-graph' in the
+  .git/objects/info directory. This could be stored in the info directory
+  of an alternate.
+
+- The core.commitGraph config setting must be on to consume graph files.
+
+- The file format includes parameters for the object ID hash function,
+  so a future change of hash algorithm does not require a change in format.
+
+- Commit grafts and replace objects can change the shape of the commit
+  history. The latter can also be enabled/disabled on the fly using
+  `--no-replace-objects`. This leads to difficulty storing both possible
+  interpretations of a commit id, especially when computing generation
+  numbers. The commit-graph will not be read or written when
+  replace-objects or grafts are present.
+
+- Shallow clones create grafts of commits by dropping their parents. This
+  leads the commit-graph to think those commits have generation number 1.
+  If and when those commits are made unshallow, those generation numbers
+  become invalid. Since shallow clones are intended to restrict the commit
+  history to a very small set of commits, the commit-graph feature is less
+  helpful for these clones, anyway. The commit-graph will not be read or
+  written when shallow commits are present.
+
+Commit-Graphs Chains
+--------------------
+
+Typically, repos grow with near-constant velocity (commits per day). Over time,
+the number of commits added by a fetch operation is much smaller than the
+number of commits in the full history. By creating a "chain" of commit-graphs,
+we enable fast writes of new commit data without rewriting the entire commit
+history -- at least, most of the time.
+
+## File Layout
+
+A commit-graph chain uses multiple files, and we use a fixed naming convention
+to organize these files. Each commit-graph file has a name
+`$OBJDIR/info/commit-graphs/graph-{hash}.graph` where `{hash}` is the hex-
+valued hash stored in the footer of that file (which is a hash of the file's
+contents before that hash). For a chain of commit-graph files, a plain-text
+file at `$OBJDIR/info/commit-graphs/commit-graph-chain` contains the
+hashes for the files in order from "lowest" to "highest".
+
+For example, if the `commit-graph-chain` file contains the lines
+
+```
+	{hash0}
+	{hash1}
+	{hash2}
+```
+
+then the commit-graph chain looks like the following diagram:
+
+ +-----------------------+
+ |  graph-{hash2}.graph  |
+ +-----------------------+
+	  |
+ +-----------------------+
+ |                       |
+ |  graph-{hash1}.graph  |
+ |                       |
+ +-----------------------+
+	  |
+ +-----------------------+
+ |                       |
+ |                       |
+ |                       |
+ |  graph-{hash0}.graph  |
+ |                       |
+ |                       |
+ |                       |
+ +-----------------------+
+
+Let X0 be the number of commits in `graph-{hash0}.graph`, X1 be the number of
+commits in `graph-{hash1}.graph`, and X2 be the number of commits in
+`graph-{hash2}.graph`. If a commit appears in position i in `graph-{hash2}.graph`,
+then we interpret this as being the commit in position (X0 + X1 + i), and that
+will be used as its "graph position". The commits in `graph-{hash2}.graph` use these
+positions to refer to their parents, which may be in `graph-{hash1}.graph` or
+`graph-{hash0}.graph`. We can navigate to an arbitrary commit in position j by checking
+its containment in the intervals [0, X0), [X0, X0 + X1), [X0 + X1, X0 + X1 +
+X2).
+
+Each commit-graph file (except the base, `graph-{hash0}.graph`) contains data
+specifying the hashes of all files in the lower layers. In the above example,
+`graph-{hash1}.graph` contains `{hash0}` while `graph-{hash2}.graph` contains
+`{hash0}` and `{hash1}`.
+
+## Merging commit-graph files
+
+If we only added a new commit-graph file on every write, we would run into a
+linear search problem through many commit-graph files.  Instead, we use a merge
+strategy to decide when the stack should collapse some number of levels.
+
+The diagram below shows such a collapse. As a set of new commits are added, it
+is determined by the merge strategy that the files should collapse to
+`graph-{hash1}`. Thus, the new commits, the commits in `graph-{hash2}` and
+the commits in `graph-{hash1}` should be combined into a new `graph-{hash3}`
+file.
+
+			    +---------------------+
+			    |                     |
+			    |    (new commits)    |
+			    |                     |
+			    +---------------------+
+			    |                     |
+ +-----------------------+  +---------------------+
+ |  graph-{hash2}        |->|                     |
+ +-----------------------+  +---------------------+
+	  |                 |                     |
+ +-----------------------+  +---------------------+
+ |                       |  |                     |
+ |  graph-{hash1}        |->|                     |
+ |                       |  |                     |
+ +-----------------------+  +---------------------+
+	  |                  tmp_graphXXX
+ +-----------------------+
+ |                       |
+ |                       |
+ |                       |
+ |  graph-{hash0}        |
+ |                       |
+ |                       |
+ |                       |
+ +-----------------------+
+
+During this process, the commits to write are combined, sorted and we write the
+contents to a temporary file, all while holding a `commit-graph-chain.lock`
+lock-file.  When the file is flushed, we rename it to `graph-{hash3}`
+according to the computed `{hash3}`. Finally, we write the new chain data to
+`commit-graph-chain.lock`:
+
+```
+	{hash3}
+	{hash0}
+```
+
+We then close the lock-file.
+
+## Merge Strategy
+
+When writing a set of commits that do not exist in the commit-graph stack of
+height N, we default to creating a new file at level N + 1. We then decide to
+merge with the Nth level if one of two conditions hold:
+
+  1. `--size-multiple=<X>` is specified or X = 2, and the number of commits in
+     level N is less than X times the number of commits in level N + 1.
+
+  2. `--max-commits=<C>` is specified with non-zero C and the number of commits
+     in level N + 1 is more than C commits.
+
+This decision cascades down the levels: when we merge a level we create a new
+set of commits that then compares to the next level.
+
+The first condition bounds the number of levels to be logarithmic in the total
+number of commits.  The second condition bounds the total number of commits in
+a `graph-{hashN}` file and not in the `commit-graph` file, preventing
+significant performance issues when the stack merges and another process only
+partially reads the previous stack.
+
+The merge strategy values (2 for the size multiple, 64,000 for the maximum
+number of commits) could be extracted into config settings for full
+flexibility.
+
+## Handling Mixed Generation Number Chains
+
+With the introduction of generation number v2 and generation data chunk, the
+following scenario is possible:
+
+1. "New" Git writes a commit-graph with the corrected commit dates.
+2. "Old" Git writes a split commit-graph on top without corrected commit dates.
+
+A naive approach of using the newest available generation number from
+each layer would lead to violated expectations: the lower layer would
+use corrected commit dates which are much larger than the topological
+levels of the higher layer. For this reason, Git inspects the topmost
+layer to see if the layer is missing corrected commit dates. In such a case
+Git only uses topological level for generation numbers.
+
+When writing a new layer in split commit-graph, we write corrected commit
+dates if the topmost layer has corrected commit dates written. This
+guarantees that if a layer has corrected commit dates, all lower layers
+must have corrected commit dates as well.
+
+When merging layers, we do not consider whether the merged layers had corrected
+commit dates. Instead, the new layer will have corrected commit dates if the
+layer below the new layer has corrected commit dates.
+
+While writing or merging layers, if the new layer is the only layer, it will
+have corrected commit dates when written by compatible versions of Git. Thus,
+rewriting split commit-graph as a single file (`--split=replace`) creates a
+single layer with corrected commit dates.
+
+## Deleting graph-{hash} files
+
+After a new tip file is written, some `graph-{hash}` files may no longer
+be part of a chain. It is important to remove these files from disk, eventually.
+The main reason to delay removal is that another process could read the
+`commit-graph-chain` file before it is rewritten, but then look for the
+`graph-{hash}` files after they are deleted.
+
+To allow holding old split commit-graphs for a while after they are unreferenced,
+we update the modified times of the files when they become unreferenced. Then,
+we scan the `$OBJDIR/info/commit-graphs/` directory for `graph-{hash}`
+files whose modified times are older than a given expiry window. This window
+defaults to zero, but can be changed using command-line arguments or a config
+setting.
+
+## Chains across multiple object directories
+
+In a repo with alternates, we look for the `commit-graph-chain` file starting
+in the local object directory and then in each alternate. The first file that
+exists defines our chain. As we look for the `graph-{hash}` files for
+each `{hash}` in the chain file, we follow the same pattern for the host
+directories.
+
+This allows commit-graphs to be split across multiple forks in a fork network.
+The typical case is a large "base" repo with many smaller forks.
+
+As the base repo advances, it will likely update and merge its commit-graph
+chain more frequently than the forks. If a fork updates their commit-graph after
+the base repo, then it should "reparent" the commit-graph chain onto the new
+chain in the base repo. When reading each `graph-{hash}` file, we track
+the object directory containing it. During a write of a new commit-graph file,
+we check for any changes in the source object directory and read the
+`commit-graph-chain` file for that source and create a new file based on those
+files. During this "reparent" operation, we necessarily need to collapse all
+levels in the fork, as all of the files are invalid against the new base file.
+
+It is crucial to be careful when cleaning up "unreferenced" `graph-{hash}.graph`
+files in this scenario. It falls to the user to define the proper settings for
+their custom environment:
+
+ 1. When merging levels in the base repo, the unreferenced files may still be
+    referenced by chains from fork repos.
+
+ 2. The expiry time should be set to a length of time such that every fork has
+    time to recompute their commit-graph chain to "reparent" onto the new base
+    file(s).
+
+ 3. If the commit-graph chain is updated in the base, the fork will not have
+    access to the new chain until its chain is updated to reference those files.
+    (This may change in the future [5].)
+
+Related Links
+-------------
+[0] https://bugs.chromium.org/p/git/issues/detail?id=8
+    Chromium work item for: Serialized Commit Graph
+
+[1] https://lore.kernel.org/git/20110713070517.GC18566@sigill.intra.peff.net/
+    An abandoned patch that introduced generation numbers.
+
+[2] https://lore.kernel.org/git/20170908033403.q7e6dj7benasrjes@sigill.intra.peff.net/
+    Discussion about generation numbers on commits and how they interact
+    with fsck.
+
+[3] https://lore.kernel.org/git/20170908034739.4op3w4f2ma5s65ku@sigill.intra.peff.net/
+    More discussion about generation numbers and not storing them inside
+    commit objects. A valuable quote:
+
+    "I think we should be moving more in the direction of keeping
+     repo-local caches for optimizations. Reachability bitmaps have been
+     a big performance win. I think we should be doing the same with our
+     properties of commits. Not just generation numbers, but making it
+     cheap to access the graph structure without zlib-inflating whole
+     commit objects (i.e., packv4 or something like the "metapacks" I
+     proposed a few years ago)."
+
+[4] https://lore.kernel.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u
+    A patch to remove the ahead-behind calculation from 'status'.
+
+[5] https://lore.kernel.org/git/f27db281-abad-5043-6d71-cbb083b1c877@gmail.com/
+    A discussion of a "two-dimensional graph position" that can allow reading
+    multiple commit-graph chains at the same time.
diff --git a/Documentation/technical/directory-rename-detection.txt b/Documentation/technical/directory-rename-detection.txt
new file mode 100644
index 0000000..029ee2c
--- /dev/null
+++ b/Documentation/technical/directory-rename-detection.txt
@@ -0,0 +1,118 @@
+Directory rename detection
+==========================
+
+Rename detection logic in diffcore-rename that checks for renames of
+individual files is also aggregated there and then analyzed in either
+merge-ort or merge-recursive for cases where combinations of renames
+indicate that a full directory has been renamed.
+
+Scope of abilities
+------------------
+
+It is perhaps easiest to start with an example:
+
+  * When all of x/a, x/b and x/c have moved to z/a, z/b and z/c, it is
+    likely that x/d added in the meantime would also want to move to z/d by
+    taking the hint that the entire directory 'x' moved to 'z'.
+
+More interesting possibilities exist, though, such as:
+
+  * one side of history renames x -> z, and the other renames some file to
+    x/e, causing the need for the merge to do a transitive rename so that
+    the rename ends up at z/e.
+
+  * one side of history renames x -> z, but also renames all files within x.
+    For example, x/a -> z/alpha, x/b -> z/bravo, etc.
+
+  * both 'x' and 'y' being merged into a single directory 'z', with a
+    directory rename being detected for both x->z and y->z.
+
+  * not all files in a directory being renamed to the same location;
+    i.e. perhaps most the files in 'x' are now found under 'z', but a few
+    are found under 'w'.
+
+  * a directory being renamed, which also contained a subdirectory that was
+    renamed to some entirely different location.  (And perhaps the inner
+    directory itself contained inner directories that were renamed to yet
+    other locations).
+
+  * combinations of the above; see t/t6423-merge-rename-directories.sh for
+    various interesting cases.
+
+Limitations -- applicability of directory renames
+-------------------------------------------------
+
+In order to prevent edge and corner cases resulting in either conflicts
+that cannot be represented in the index or which might be too complex for
+users to try to understand and resolve, a couple basic rules limit when
+directory rename detection applies:
+
+  1) If a given directory still exists on both sides of a merge, we do
+     not consider it to have been renamed.
+
+  2) If a subset of to-be-renamed files have a file or directory in the
+     way (or would be in the way of each other), "turn off" the directory
+     rename for those specific sub-paths and report the conflict to the
+     user.
+
+  3) If the other side of history did a directory rename to a path that
+     your side of history renamed away, then ignore that particular
+     rename from the other side of history for any implicit directory
+     renames (but warn the user).
+
+Limitations -- detailed rules and testcases
+-------------------------------------------
+
+t/t6423-merge-rename-directories.sh contains extensive tests and commentary
+which generate and explore the rules listed above.  It also lists a few
+additional rules:
+
+  a) If renames split a directory into two or more others, the directory
+     with the most renames, "wins".
+
+  b) Only apply implicit directory renames to directories if the other side
+     of history is the one doing the renaming.
+
+  c) Do not perform directory rename detection for directories which had no
+     new paths added to them.
+
+Limitations -- support in different commands
+--------------------------------------------
+
+Directory rename detection is supported by 'merge' and 'cherry-pick'.
+Other git commands which users might be surprised to see limited or no
+directory rename detection support in:
+
+  * diff
+
+    Folks have requested in the past that `git diff` detect directory
+    renames and somehow simplify its output.  It is not clear whether this
+    would be desirable or how the output should be simplified, so this was
+    simply not implemented.  Also, while diffcore-rename has most of the
+    logic for detecting directory renames, some of the logic is still found
+    within merge-ort and merge-recursive.  Fully supporting directory
+    rename detection in diffs would require copying or moving the remaining
+    bits of logic to the diff machinery.
+
+  * am
+
+    git-am tries to avoid a full three way merge, instead calling
+    git-apply.  That prevents us from detecting renames at all, which may
+    defeat the directory rename detection.  There is a fallback, though; if
+    the initial git-apply fails and the user has specified the -3 option,
+    git-am will fall back to a three way merge.  However, git-am lacks the
+    necessary information to do a "real" three way merge.  Instead, it has
+    to use build_fake_ancestor() to get a merge base that is missing files
+    whose rename may have been important to detect for directory rename
+    detection to function.
+
+  * rebase
+
+    Since am-based rebases work by first generating a bunch of patches
+    (which no longer record what the original commits were and thus don't
+    have the necessary info from which we can find a real merge-base), and
+    then calling git-am, this implies that am-based rebases will not always
+    successfully detect directory renames either (see the 'am' section
+    above).  merged-based rebases (rebase -m) and cherry-pick-based rebases
+    (rebase -i) are not affected by this shortcoming, and fully support
+    directory rename detection.
diff --git a/Documentation/technical/hash-function-transition.txt b/Documentation/technical/hash-function-transition.txt
new file mode 100644
index 0000000..ed57481
--- /dev/null
+++ b/Documentation/technical/hash-function-transition.txt
@@ -0,0 +1,830 @@
+Git hash function transition
+============================
+
+Objective
+---------
+Migrate Git from SHA-1 to a stronger hash function.
+
+Background
+----------
+At its core, the Git version control system is a content addressable
+filesystem. It uses the SHA-1 hash function to name content. For
+example, files, directories, and revisions are referred to by hash
+values unlike in other traditional version control systems where files
+or versions are referred to via sequential numbers. The use of a hash
+function to address its content delivers a few advantages:
+
+* Integrity checking is easy. Bit flips, for example, are easily
+  detected, as the hash of corrupted content does not match its name.
+* Lookup of objects is fast.
+
+Using a cryptographically secure hash function brings additional
+advantages:
+
+* Object names can be signed and third parties can trust the hash to
+  address the signed object and all objects it references.
+* Communication using Git protocol and out of band communication
+  methods have a short reliable string that can be used to reliably
+  address stored content.
+
+Over time some flaws in SHA-1 have been discovered by security
+researchers. On 23 February 2017 the SHAttered attack
+(https://shattered.io) demonstrated a practical SHA-1 hash collision.
+
+Git v2.13.0 and later subsequently moved to a hardened SHA-1
+implementation by default, which isn't vulnerable to the SHAttered
+attack, but SHA-1 is still weak.
+
+Thus it's considered prudent to move past any variant of SHA-1
+to a new hash. There's no guarantee that future attacks on SHA-1 won't
+be published in the future, and those attacks may not have viable
+mitigations.
+
+If SHA-1 and its variants were to be truly broken, Git's hash function
+could not be considered cryptographically secure any more. This would
+impact the communication of hash values because we could not trust
+that a given hash value represented the known good version of content
+that the speaker intended.
+
+SHA-1 still possesses the other properties such as fast object lookup
+and safe error checking, but other hash functions are equally suitable
+that are believed to be cryptographically secure.
+
+Choice of Hash
+--------------
+The hash to replace the hardened SHA-1 should be stronger than SHA-1
+was: we would like it to be trustworthy and useful in practice for at
+least 10 years.
+
+Some other relevant properties:
+
+1. A 256-bit hash (long enough to match common security practice; not
+   excessively long to hurt performance and disk usage).
+
+2. High quality implementations should be widely available (e.g., in
+   OpenSSL and Apple CommonCrypto).
+
+3. The hash function's properties should match Git's needs (e.g. Git
+   requires collision and 2nd preimage resistance and does not require
+   length extension resistance).
+
+4. As a tiebreaker, the hash should be fast to compute (fortunately
+   many contenders are faster than SHA-1).
+
+There were several contenders for a successor hash to SHA-1, including
+SHA-256, SHA-512/256, SHA-256x16, K12, and BLAKE2bp-256.
+
+In late 2018 the project picked SHA-256 as its successor hash.
+
+See 0ed8d8da374 (doc hash-function-transition: pick SHA-256 as
+NewHash, 2018-08-04) and numerous mailing list threads at the time,
+particularly the one starting at
+https://lore.kernel.org/git/20180609224913.GC38834@genre.crustytoothpaste.net/
+for more information.
+
+Goals
+-----
+1. The transition to SHA-256 can be done one local repository at a time.
+   a. Requiring no action by any other party.
+   b. A SHA-256 repository can communicate with SHA-1 Git servers
+      (push/fetch).
+   c. Users can use SHA-1 and SHA-256 identifiers for objects
+      interchangeably (see "Object names on the command line", below).
+   d. New signed objects make use of a stronger hash function than
+      SHA-1 for their security guarantees.
+2. Allow a complete transition away from SHA-1.
+   a. Local metadata for SHA-1 compatibility can be removed from a
+      repository if compatibility with SHA-1 is no longer needed.
+3. Maintainability throughout the process.
+   a. The object format is kept simple and consistent.
+   b. Creation of a generalized repository conversion tool.
+
+Non-Goals
+---------
+1. Add SHA-256 support to Git protocol. This is valuable and the
+   logical next step but it is out of scope for this initial design.
+2. Transparently improving the security of existing SHA-1 signed
+   objects.
+3. Intermixing objects using multiple hash functions in a single
+   repository.
+4. Taking the opportunity to fix other bugs in Git's formats and
+   protocols.
+5. Shallow clones and fetches into a SHA-256 repository. (This will
+   change when we add SHA-256 support to Git protocol.)
+6. Skip fetching some submodules of a project into a SHA-256
+   repository. (This also depends on SHA-256 support in Git
+   protocol.)
+
+Overview
+--------
+We introduce a new repository format extension. Repositories with this
+extension enabled use SHA-256 instead of SHA-1 to name their objects.
+This affects both object names and object content -- both the names
+of objects and all references to other objects within an object are
+switched to the new hash function.
+
+SHA-256 repositories cannot be read by older versions of Git.
+
+Alongside the packfile, a SHA-256 repository stores a bidirectional
+mapping between SHA-256 and SHA-1 object names. The mapping is generated
+locally and can be verified using "git fsck". Object lookups use this
+mapping to allow naming objects using either their SHA-1 and SHA-256 names
+interchangeably.
+
+"git cat-file" and "git hash-object" gain options to display an object
+in its SHA-1 form and write an object given its SHA-1 form. This
+requires all objects referenced by that object to be present in the
+object database so that they can be named using the appropriate name
+(using the bidirectional hash mapping).
+
+Fetches from a SHA-1 based server convert the fetched objects into
+SHA-256 form and record the mapping in the bidirectional mapping table
+(see below for details). Pushes to a SHA-1 based server convert the
+objects being pushed into SHA-1 form so the server does not have to be
+aware of the hash function the client is using.
+
+Detailed Design
+---------------
+Repository format extension
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+A SHA-256 repository uses repository format version `1` (see
+Documentation/technical/repository-version.txt) with extensions
+`objectFormat` and `compatObjectFormat`:
+
+	[core]
+		repositoryFormatVersion = 1
+	[extensions]
+		objectFormat = sha256
+		compatObjectFormat = sha1
+
+The combination of setting `core.repositoryFormatVersion=1` and
+populating `extensions.*` ensures that all versions of Git later than
+`v0.99.9l` will die instead of trying to operate on the SHA-256
+repository, instead producing an error message.
+
+	# Between v0.99.9l and v2.7.0
+	$ git status
+	fatal: Expected git repo version <= 0, found 1
+	# After v2.7.0
+	$ git status
+	fatal: unknown repository extensions found:
+		objectformat
+		compatobjectformat
+
+See the "Transition plan" section below for more details on these
+repository extensions.
+
+Object names
+~~~~~~~~~~~~
+Objects can be named by their 40 hexadecimal digit SHA-1 name or 64
+hexadecimal digit SHA-256 name, plus names derived from those (see
+gitrevisions(7)).
+
+The SHA-1 name of an object is the SHA-1 of the concatenation of its
+type, length, a nul byte, and the object's SHA-1 content. This is the
+traditional <sha1> used in Git to name objects.
+
+The SHA-256 name of an object is the SHA-256 of the concatenation of its
+type, length, a nul byte, and the object's SHA-256 content.
+
+Object format
+~~~~~~~~~~~~~
+The content as a byte sequence of a tag, commit, or tree object named
+by SHA-1 and SHA-256 differ because an object named by SHA-256 name refers to
+other objects by their SHA-256 names and an object named by SHA-1 name
+refers to other objects by their SHA-1 names.
+
+The SHA-256 content of an object is the same as its SHA-1 content, except
+that objects referenced by the object are named using their SHA-256 names
+instead of SHA-1 names. Because a blob object does not refer to any
+other object, its SHA-1 content and SHA-256 content are the same.
+
+The format allows round-trip conversion between SHA-256 content and
+SHA-1 content.
+
+Object storage
+~~~~~~~~~~~~~~
+Loose objects use zlib compression and packed objects use the packed
+format described in linkgit:gitformat-pack[5], just like
+today. The content that is compressed and stored uses SHA-256 content
+instead of SHA-1 content.
+
+Pack index
+~~~~~~~~~~
+Pack index (.idx) files use a new v3 format that supports multiple
+hash functions. They have the following format (all integers are in
+network byte order):
+
+- A header appears at the beginning and consists of the following:
+  * The 4-byte pack index signature: '\377t0c'
+  * 4-byte version number: 3
+  * 4-byte length of the header section, including the signature and
+    version number
+  * 4-byte number of objects contained in the pack
+  * 4-byte number of object formats in this pack index: 2
+  * For each object format:
+    ** 4-byte format identifier (e.g., 'sha1' for SHA-1)
+    ** 4-byte length in bytes of shortened object names. This is the
+      shortest possible length needed to make names in the shortened
+      object name table unambiguous.
+    ** 4-byte integer, recording where tables relating to this format
+      are stored in this index file, as an offset from the beginning.
+  * 4-byte offset to the trailer from the beginning of this file.
+  * Zero or more additional key/value pairs (4-byte key, 4-byte
+    value). Only one key is supported: 'PSRC'. See the "Loose objects
+    and unreachable objects" section for supported values and how this
+    is used.  All other keys are reserved. Readers must ignore
+    unrecognized keys.
+- Zero or more NUL bytes. This can optionally be used to improve the
+  alignment of the full object name table below.
+- Tables for the first object format:
+  * A sorted table of shortened object names.  These are prefixes of
+    the names of all objects in this pack file, packed together
+    without offset values to reduce the cache footprint of the binary
+    search for a specific object name.
+
+  * A table of full object names in pack order. This allows resolving
+    a reference to "the nth object in the pack file" (from a
+    reachability bitmap or from the next table of another object
+    format) to its object name.
+
+  * A table of 4-byte values mapping object name order to pack order.
+    For an object in the table of sorted shortened object names, the
+    value at the corresponding index in this table is the index in the
+    previous table for that same object.
+    This can be used to look up the object in reachability bitmaps or
+    to look up its name in another object format.
+
+  * A table of 4-byte CRC32 values of the packed object data, in the
+    order that the objects appear in the pack file. This is to allow
+    compressed data to be copied directly from pack to pack during
+    repacking without undetected data corruption.
+
+  * A table of 4-byte offset values. For an object in the table of
+    sorted shortened object names, the value at the corresponding
+    index in this table indicates where that object can be found in
+    the pack file. These are usually 31-bit pack file offsets, but
+    large offsets are encoded as an index into the next table with the
+    most significant bit set.
+
+  * A table of 8-byte offset entries (empty for pack files less than
+    2 GiB). Pack files are organized with heavily used objects toward
+    the front, so most object references should not need to refer to
+    this table.
+- Zero or more NUL bytes.
+- Tables for the second object format, with the same layout as above,
+  up to and not including the table of CRC32 values.
+- Zero or more NUL bytes.
+- The trailer consists of the following:
+  * A copy of the 20-byte SHA-256 checksum at the end of the
+    corresponding packfile.
+
+  * 20-byte SHA-256 checksum of all of the above.
+
+Loose object index
+~~~~~~~~~~~~~~~~~~
+A new file $GIT_OBJECT_DIR/loose-object-idx contains information about
+all loose objects. Its format is
+
+  # loose-object-idx
+  (sha256-name SP sha1-name LF)*
+
+where the object names are in hexadecimal format. The file is not
+sorted.
+
+The loose object index is protected against concurrent writes by a
+lock file $GIT_OBJECT_DIR/loose-object-idx.lock. To add a new loose
+object:
+
+1. Write the loose object to a temporary file, like today.
+2. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the lock.
+3. Rename the loose object into place.
+4. Open loose-object-idx with O_APPEND and write the new object
+5. Unlink loose-object-idx.lock to release the lock.
+
+To remove entries (e.g. in "git pack-refs" or "git-prune"):
+
+1. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the
+   lock.
+2. Write the new content to loose-object-idx.lock.
+3. Unlink any loose objects being removed.
+4. Rename to replace loose-object-idx, releasing the lock.
+
+Translation table
+~~~~~~~~~~~~~~~~~
+The index files support a bidirectional mapping between SHA-1 names
+and SHA-256 names. The lookup proceeds similarly to ordinary object
+lookups. For example, to convert a SHA-1 name to a SHA-256 name:
+
+ 1. Look for the object in idx files. If a match is present in the
+    idx's sorted list of truncated SHA-1 names, then:
+    a. Read the corresponding entry in the SHA-1 name order to pack
+       name order mapping.
+    b. Read the corresponding entry in the full SHA-1 name table to
+       verify we found the right object. If it is, then
+    c. Read the corresponding entry in the full SHA-256 name table.
+       That is the object's SHA-256 name.
+ 2. Check for a loose object. Read lines from loose-object-idx until
+    we find a match.
+
+Step (1) takes the same amount of time as an ordinary object lookup:
+O(number of packs * log(objects per pack)). Step (2) takes O(number of
+loose objects) time. To maintain good performance it will be necessary
+to keep the number of loose objects low. See the "Loose objects and
+unreachable objects" section below for more details.
+
+Since all operations that make new objects (e.g., "git commit") add
+the new objects to the corresponding index, this mapping is possible
+for all objects in the object store.
+
+Reading an object's SHA-1 content
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The SHA-1 content of an object can be read by converting all SHA-256 names
+of its SHA-256 content references to SHA-1 names using the translation table.
+
+Fetch
+~~~~~
+Fetching from a SHA-1 based server requires translating between SHA-1
+and SHA-256 based representations on the fly.
+
+SHA-1s named in the ref advertisement that are present on the client
+can be translated to SHA-256 and looked up as local objects using the
+translation table.
+
+Negotiation proceeds as today. Any "have"s generated locally are
+converted to SHA-1 before being sent to the server, and SHA-1s
+mentioned by the server are converted to SHA-256 when looking them up
+locally.
+
+After negotiation, the server sends a packfile containing the
+requested objects. We convert the packfile to SHA-256 format using
+the following steps:
+
+1. index-pack: inflate each object in the packfile and compute its
+   SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against
+   objects the client has locally. These objects can be looked up
+   using the translation table and their SHA-1 content read as
+   described above to resolve the deltas.
+2. topological sort: starting at the "want"s from the negotiation
+   phase, walk through objects in the pack and emit a list of them,
+   excluding blobs, in reverse topologically sorted order, with each
+   object coming later in the list than all objects it references.
+   (This list only contains objects reachable from the "wants". If the
+   pack from the server contained additional extraneous objects, then
+   they will be discarded.)
+3. convert to SHA-256: open a new SHA-256 packfile. Read the topologically
+   sorted list just generated. For each object, inflate its
+   SHA-1 content, convert to SHA-256 content, and write it to the SHA-256
+   pack. Record the new SHA-1<-->SHA-256 mapping entry for use in the idx.
+4. sort: reorder entries in the new pack to match the order of objects
+   in the pack the server generated and include blobs. Write a SHA-256 idx
+   file
+5. clean up: remove the SHA-1 based pack file, index, and
+   topologically sorted list obtained from the server in steps 1
+   and 2.
+
+Step 3 requires every object referenced by the new object to be in the
+translation table. This is why the topological sort step is necessary.
+
+As an optimization, step 1 could write a file describing what non-blob
+objects each object it has inflated from the packfile references. This
+makes the topological sort in step 2 possible without inflating the
+objects in the packfile for a second time. The objects need to be
+inflated again in step 3, for a total of two inflations.
+
+Step 4 is probably necessary for good read-time performance. "git
+pack-objects" on the server optimizes the pack file for good data
+locality (see Documentation/technical/pack-heuristics.txt).
+
+Details of this process are likely to change. It will take some
+experimenting to get this to perform well.
+
+Push
+~~~~
+Push is simpler than fetch because the objects referenced by the
+pushed objects are already in the translation table. The SHA-1 content
+of each object being pushed can be read as described in the "Reading
+an object's SHA-1 content" section to generate the pack written by git
+send-pack.
+
+Signed Commits
+~~~~~~~~~~~~~~
+We add a new field "gpgsig-sha256" to the commit object format to allow
+signing commits without relying on SHA-1. It is similar to the
+existing "gpgsig" field. Its signed payload is the SHA-256 content of the
+commit object with any "gpgsig" and "gpgsig-sha256" fields removed.
+
+This means commits can be signed
+
+1. using SHA-1 only, as in existing signed commit objects
+2. using both SHA-1 and SHA-256, by using both gpgsig-sha256 and gpgsig
+   fields.
+3. using only SHA-256, by only using the gpgsig-sha256 field.
+
+Old versions of "git verify-commit" can verify the gpgsig signature in
+cases (1) and (2) without modifications and view case (3) as an
+ordinary unsigned commit.
+
+Signed Tags
+~~~~~~~~~~~
+We add a new field "gpgsig-sha256" to the tag object format to allow
+signing tags without relying on SHA-1. Its signed payload is the
+SHA-256 content of the tag with its gpgsig-sha256 field and "-----BEGIN PGP
+SIGNATURE-----" delimited in-body signature removed.
+
+This means tags can be signed
+
+1. using SHA-1 only, as in existing signed tag objects
+2. using both SHA-1 and SHA-256, by using gpgsig-sha256 and an in-body
+   signature.
+3. using only SHA-256, by only using the gpgsig-sha256 field.
+
+Mergetag embedding
+~~~~~~~~~~~~~~~~~~
+The mergetag field in the SHA-1 content of a commit contains the
+SHA-1 content of a tag that was merged by that commit.
+
+The mergetag field in the SHA-256 content of the same commit contains the
+SHA-256 content of the same tag.
+
+Submodules
+~~~~~~~~~~
+To convert recorded submodule pointers, you need to have the converted
+submodule repository in place. The translation table of the submodule
+can be used to look up the new hash.
+
+Loose objects and unreachable objects
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Fast lookups in the loose-object-idx require that the number of loose
+objects not grow too high.
+
+"git gc --auto" currently waits for there to be 6700 loose objects
+present before consolidating them into a packfile. We will need to
+measure to find a more appropriate threshold for it to use.
+
+"git gc --auto" currently waits for there to be 50 packs present
+before combining packfiles. Packing loose objects more aggressively
+may cause the number of pack files to grow too quickly. This can be
+mitigated by using a strategy similar to Martin Fick's exponential
+rolling garbage collection script:
+https://gerrit-review.googlesource.com/c/gerrit/+/35215
+
+"git gc" currently expels any unreachable objects it encounters in
+pack files to loose objects in an attempt to prevent a race when
+pruning them (in case another process is simultaneously writing a new
+object that refers to the about-to-be-deleted object). This leads to
+an explosion in the number of loose objects present and disk space
+usage due to the objects in delta form being replaced with independent
+loose objects.  Worse, the race is still present for loose objects.
+
+Instead, "git gc" will need to move unreachable objects to a new
+packfile marked as UNREACHABLE_GARBAGE (using the PSRC field; see
+below). To avoid the race when writing new objects referring to an
+about-to-be-deleted object, code paths that write new objects will
+need to copy any objects from UNREACHABLE_GARBAGE packs that they
+refer to new, non-UNREACHABLE_GARBAGE packs (or loose objects).
+UNREACHABLE_GARBAGE are then safe to delete if their creation time (as
+indicated by the file's mtime) is long enough ago.
+
+To avoid a proliferation of UNREACHABLE_GARBAGE packs, they can be
+combined under certain circumstances. If "gc.garbageTtl" is set to
+greater than one day, then packs created within a single calendar day,
+UTC, can be coalesced together. The resulting packfile would have an
+mtime before midnight on that day, so this makes the effective maximum
+ttl the garbageTtl + 1 day. If "gc.garbageTtl" is less than one day,
+then we divide the calendar day into intervals one-third of that ttl
+in duration. Packs created within the same interval can be coalesced
+together. The resulting packfile would have an mtime before the end of
+the interval, so this makes the effective maximum ttl equal to the
+garbageTtl * 4/3.
+
+This rule comes from Thirumala Reddy Mutchukota's JGit change
+https://git.eclipse.org/r/90465.
+
+The UNREACHABLE_GARBAGE setting goes in the PSRC field of the pack
+index. More generally, that field indicates where a pack came from:
+
+ - 1 (PACK_SOURCE_RECEIVE) for a pack received over the network
+ - 2 (PACK_SOURCE_AUTO) for a pack created by a lightweight
+   "gc --auto" operation
+ - 3 (PACK_SOURCE_GC) for a pack created by a full gc
+ - 4 (PACK_SOURCE_UNREACHABLE_GARBAGE) for potential garbage
+   discovered by gc
+ - 5 (PACK_SOURCE_INSERT) for locally created objects that were
+   written directly to a pack file, e.g. from "git add ."
+
+This information can be useful for debugging and for "gc --auto" to
+make appropriate choices about which packs to coalesce.
+
+Caveats
+-------
+Invalid objects
+~~~~~~~~~~~~~~~
+The conversion from SHA-1 content to SHA-256 content retains any
+brokenness in the original object (e.g., tree entry modes encoded with
+leading 0, tree objects whose paths are not sorted correctly, and
+commit objects without an author or committer). This is a deliberate
+feature of the design to allow the conversion to round-trip.
+
+More profoundly broken objects (e.g., a commit with a truncated "tree"
+header line) cannot be converted but were not usable by current Git
+anyway.
+
+Shallow clone and submodules
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Because it requires all referenced objects to be available in the
+locally generated translation table, this design does not support
+shallow clone or unfetched submodules. Protocol improvements might
+allow lifting this restriction.
+
+Alternates
+~~~~~~~~~~
+For the same reason, a SHA-256 repository cannot borrow objects from a
+SHA-1 repository using objects/info/alternates or
+$GIT_ALTERNATE_OBJECT_REPOSITORIES.
+
+git notes
+~~~~~~~~~
+The "git notes" tool annotates objects using their SHA-1 name as key.
+This design does not describe a way to migrate notes trees to use
+SHA-256 names. That migration is expected to happen separately (for
+example using a file at the root of the notes tree to describe which
+hash it uses).
+
+Server-side cost
+~~~~~~~~~~~~~~~~
+Until Git protocol gains SHA-256 support, using SHA-256 based storage
+on public-facing Git servers is strongly discouraged. Once Git
+protocol gains SHA-256 support, SHA-256 based servers are likely not
+to support SHA-1 compatibility, to avoid what may be a very expensive
+hash re-encode during clone and to encourage peers to modernize.
+
+The design described here allows fetches by SHA-1 clients of a
+personal SHA-256 repository because it's not much more difficult than
+allowing pushes from that repository. This support needs to be guarded
+by a configuration option -- servers like git.kernel.org that serve a
+large number of clients would not be expected to bear that cost.
+
+Meaning of signatures
+~~~~~~~~~~~~~~~~~~~~~
+The signed payload for signed commits and tags does not explicitly
+name the hash used to identify objects. If some day Git adopts a new
+hash function with the same length as the current SHA-1 (40
+hexadecimal digit) or SHA-256 (64 hexadecimal digit) objects then the
+intent behind the PGP signed payload in an object signature is
+unclear:
+
+	object e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7
+	type commit
+	tag v2.12.0
+	tagger Junio C Hamano <gitster@pobox.com> 1487962205 -0800
+
+	Git 2.12
+
+Does this mean Git v2.12.0 is the commit with SHA-1 name
+e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with
+new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7?
+
+Fortunately SHA-256 and SHA-1 have different lengths. If Git starts
+using another hash with the same length to name objects, then it will
+need to change the format of signed payloads using that hash to
+address this issue.
+
+Object names on the command line
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+To support the transition (see Transition plan below), this design
+supports four different modes of operation:
+
+ 1. ("dark launch") Treat object names input by the user as SHA-1 and
+    convert any object names written to output to SHA-1, but store
+    objects using SHA-256.  This allows users to test the code with no
+    visible behavior change except for performance.  This allows
+    running even tests that assume the SHA-1 hash function, to
+    sanity-check the behavior of the new mode.
+
+ 2. ("early transition") Allow both SHA-1 and SHA-256 object names in
+    input. Any object names written to output use SHA-1. This allows
+    users to continue to make use of SHA-1 to communicate with peers
+    (e.g. by email) that have not migrated yet and prepares for mode 3.
+
+ 3. ("late transition") Allow both SHA-1 and SHA-256 object names in
+    input. Any object names written to output use SHA-256. In this
+    mode, users are using a more secure object naming method by
+    default.  The disruption is minimal as long as most of their peers
+    are in mode 2 or mode 3.
+
+ 4. ("post-transition") Treat object names input by the user as
+    SHA-256 and write output using SHA-256. This is safer than mode 3
+    because there is less risk that input is incorrectly interpreted
+    using the wrong hash function.
+
+The mode is specified in configuration.
+
+The user can also explicitly specify which format to use for a
+particular revision specifier and for output, overriding the mode. For
+example:
+
+    git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256}
+
+Transition plan
+---------------
+Some initial steps can be implemented independently of one another:
+
+- adding a hash function API (vtable)
+- teaching fsck to tolerate the gpgsig-sha256 field
+- excluding gpgsig-* from the fields copied by "git commit --amend"
+- annotating tests that depend on SHA-1 values with a SHA1 test
+  prerequisite
+- using "struct object_id", GIT_MAX_RAWSZ, and GIT_MAX_HEXSZ
+  consistently instead of "unsigned char *" and the hardcoded
+  constants 20 and 40.
+- introducing index v3
+- adding support for the PSRC field and safer object pruning
+
+The first user-visible change is the introduction of the objectFormat
+extension (without compatObjectFormat). This requires:
+
+- teaching fsck about this mode of operation
+- using the hash function API (vtable) when computing object names
+- signing objects and verifying signatures
+- rejecting attempts to fetch from or push to an incompatible
+  repository
+
+Next comes introduction of compatObjectFormat:
+
+- implementing the loose-object-idx
+- translating object names between object formats
+- translating object content between object formats
+- generating and verifying signatures in the compat format
+- adding appropriate index entries when adding a new object to the
+  object store
+- --output-format option
+- ^{sha1} and ^{sha256} revision notation
+- configuration to specify default input and output format (see
+  "Object names on the command line" above)
+
+The next step is supporting fetches and pushes to SHA-1 repositories:
+
+- allow pushes to a repository using the compat format
+- generate a topologically sorted list of the SHA-1 names of fetched
+  objects
+- convert the fetched packfile to SHA-256 format and generate an idx
+  file
+- re-sort to match the order of objects in the fetched packfile
+
+The infrastructure supporting fetch also allows converting an existing
+repository. In converted repositories and new clones, end users can
+gain support for the new hash function without any visible change in
+behavior (see "dark launch" in the "Object names on the command line"
+section). In particular this allows users to verify SHA-256 signatures
+on objects in the repository, and it should ensure the transition code
+is stable in production in preparation for using it more widely.
+
+Over time projects would encourage their users to adopt the "early
+transition" and then "late transition" modes to take advantage of the
+new, more futureproof SHA-256 object names.
+
+When objectFormat and compatObjectFormat are both set, commands
+generating signatures would generate both SHA-1 and SHA-256 signatures
+by default to support both new and old users.
+
+In projects using SHA-256 heavily, users could be encouraged to adopt
+the "post-transition" mode to avoid accidentally making implicit use
+of SHA-1 object names.
+
+Once a critical mass of users have upgraded to a version of Git that
+can verify SHA-256 signatures and have converted their existing
+repositories to support verifying them, we can add support for a
+setting to generate only SHA-256 signatures. This is expected to be at
+least a year later.
+
+That is also a good moment to advertise the ability to convert
+repositories to use SHA-256 only, stripping out all SHA-1 related
+metadata. This improves performance by eliminating translation
+overhead and security by avoiding the possibility of accidentally
+relying on the safety of SHA-1.
+
+Updating Git's protocols to allow a server to specify which hash
+functions it supports is also an important part of this transition. It
+is not discussed in detail in this document but this transition plan
+assumes it happens. :)
+
+Alternatives considered
+-----------------------
+Upgrading everyone working on a particular project on a flag day
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Projects like the Linux kernel are large and complex enough that
+flipping the switch for all projects based on the repository at once
+is infeasible.
+
+Not only would all developers and server operators supporting
+developers have to switch on the same flag day, but supporting tooling
+(continuous integration, code review, bug trackers, etc) would have to
+be adapted as well. This also makes it difficult to get early feedback
+from some project participants testing before it is time for mass
+adoption.
+
+Using hash functions in parallel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+(e.g. https://lore.kernel.org/git/22708.8913.864049.452252@chiark.greenend.org.uk/ )
+Objects newly created would be addressed by the new hash, but inside
+such an object (e.g. commit) it is still possible to address objects
+using the old hash function.
+
+* You cannot trust its history (needed for bisectability) in the
+  future without further work
+* Maintenance burden as the number of supported hash functions grows
+  (they will never go away, so they accumulate). In this proposal, by
+  comparison, converted objects lose all references to SHA-1.
+
+Signed objects with multiple hashes
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Instead of introducing the gpgsig-sha256 field in commit and tag objects
+for SHA-256 content based signatures, an earlier version of this design
+added "hash sha256 <SHA-256 name>" fields to strengthen the existing
+SHA-1 content based signatures.
+
+In other words, a single signature was used to attest to the object
+content using both hash functions. This had some advantages:
+
+* Using one signature instead of two speeds up the signing process.
+* Having one signed payload with both hashes allows the signer to
+  attest to the SHA-1 name and SHA-256 name referring to the same object.
+* All users consume the same signature. Broken signatures are likely
+  to be detected quickly using current versions of git.
+
+However, it also came with disadvantages:
+
+* Verifying a signed object requires access to the SHA-1 names of all
+  objects it references, even after the transition is complete and
+  translation table is no longer needed for anything else. To support
+  this, the design added fields such as "hash sha1 tree <SHA-1 name>"
+  and "hash sha1 parent <SHA-1 name>" to the SHA-256 content of a signed
+  commit, complicating the conversion process.
+* Allowing signed objects without a SHA-1 (for after the transition is
+  complete) complicated the design further, requiring a "nohash sha1"
+  field to suppress including "hash sha1" fields in the SHA-256 content
+  and signed payload.
+
+Lazily populated translation table
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Some of the work of building the translation table could be deferred to
+push time, but that would significantly complicate and slow down pushes.
+Calculating the SHA-1 name at object creation time at the same time it is
+being streamed to disk and having its SHA-256 name calculated should be
+an acceptable cost.
+
+Document History
+----------------
+
+2017-03-03
+bmwill@google.com, jonathantanmy@google.com, jrnieder@gmail.com,
+sbeller@google.com
+
+* Initial version sent to https://lore.kernel.org/git/20170304011251.GA26789@aiede.mtv.corp.google.com
+
+2017-03-03 jrnieder@gmail.com
+Incorporated suggestions from jonathantanmy and sbeller:
+
+* Describe purpose of signed objects with each hash type
+* Redefine signed object verification using object content under the
+  first hash function
+
+2017-03-06 jrnieder@gmail.com
+
+* Use SHA3-256 instead of SHA2 (thanks, Linus and brian m. carlson).[1][2]
+* Make SHA3-based signatures a separate field, avoiding the need for
+  "hash" and "nohash" fields (thanks to peff[3]).
+* Add a sorting phase to fetch (thanks to Junio for noticing the need
+  for this).
+* Omit blobs from the topological sort during fetch (thanks to peff).
+* Discuss alternates, git notes, and git servers in the caveats
+  section (thanks to Junio Hamano, brian m. carlson[4], and Shawn
+  Pearce).
+* Clarify language throughout (thanks to various commenters,
+  especially Junio).
+
+2017-09-27 jrnieder@gmail.com, sbeller@google.com
+
+* Use placeholder NewHash instead of SHA3-256
+* Describe criteria for picking a hash function.
+* Include a transition plan (thanks especially to Brandon Williams
+  for fleshing these ideas out)
+* Define the translation table (thanks, Shawn Pearce[5], Jonathan
+  Tan, and Masaya Suzuki)
+* Avoid loose object overhead by packing more aggressively in
+  "git gc --auto"
+
+Later history:
+
+* See the history of this file in git.git for the history of subsequent
+  edits. This document history is no longer being maintained as it
+  would now be superfluous to the commit log
+
+References:
+
+ [1] https://lore.kernel.org/git/CA+55aFzJtejiCjV0e43+9oR3QuJK2PiFiLQemytoLpyJWe6P9w@mail.gmail.com/
+ [2] https://lore.kernel.org/git/CA+55aFz+gkAsDZ24zmePQuEs1XPS9BP_s8O7Q4wQ7LV7X5-oDA@mail.gmail.com/
+ [3] https://lore.kernel.org/git/20170306084353.nrns455dvkdsfgo5@sigill.intra.peff.net/
+ [4] https://lore.kernel.org/git/20170304224936.rqqtkdvfjgyezsht@genre.crustytoothpaste.net
+ [5] https://lore.kernel.org/git/CAJo=hJtoX9=AyLHHpUJS7fueV9ciZ_MNpnEPHUz8Whui6g9F0A@mail.gmail.com/
diff --git a/Documentation/technical/long-running-process-protocol.txt b/Documentation/technical/long-running-process-protocol.txt
new file mode 100644
index 0000000..6f33654
--- /dev/null
+++ b/Documentation/technical/long-running-process-protocol.txt
@@ -0,0 +1,50 @@
+Long-running process protocol
+=============================
+
+This protocol is used when Git needs to communicate with an external
+process throughout the entire life of a single Git command. All
+communication is in pkt-line format (see linkgit:gitprotocol-common[5])
+over standard input and standard output.
+
+Handshake
+---------
+
+Git starts by sending a welcome message (for example,
+"git-filter-client"), a list of supported protocol version numbers, and
+a flush packet. Git expects to read the welcome message with "server"
+instead of "client" (for example, "git-filter-server"), exactly one
+protocol version number from the previously sent list, and a flush
+packet. All further communication will be based on the selected version.
+The remaining protocol description below documents "version=2". Please
+note that "version=42" in the example below does not exist and is only
+there to illustrate how the protocol would look like with more than one
+version.
+
+After the version negotiation Git sends a list of all capabilities that
+it supports and a flush packet. Git expects to read a list of desired
+capabilities, which must be a subset of the supported capabilities list,
+and a flush packet as response:
+------------------------
+packet:          git> git-filter-client
+packet:          git> version=2
+packet:          git> version=42
+packet:          git> 0000
+packet:          git< git-filter-server
+packet:          git< version=2
+packet:          git< 0000
+packet:          git> capability=clean
+packet:          git> capability=smudge
+packet:          git> capability=not-yet-invented
+packet:          git> 0000
+packet:          git< capability=clean
+packet:          git< capability=smudge
+packet:          git< 0000
+------------------------
+
+Shutdown
+--------
+
+Git will close
+the command pipe on exit. The filter is expected to detect EOF
+and exit gracefully on its own. Git will wait until the filter
+process has stopped.
diff --git a/Documentation/technical/multi-pack-index.txt b/Documentation/technical/multi-pack-index.txt
new file mode 100644
index 0000000..f2221d2
--- /dev/null
+++ b/Documentation/technical/multi-pack-index.txt
@@ -0,0 +1,100 @@
+Multi-Pack-Index (MIDX) Design Notes
+====================================
+
+The Git object directory contains a 'pack' directory containing
+packfiles (with suffix ".pack") and pack-indexes (with suffix
+".idx"). The pack-indexes provide a way to lookup objects and
+navigate to their offset within the pack, but these must come
+in pairs with the packfiles. This pairing depends on the file
+names, as the pack-index differs only in suffix with its pack-
+file. While the pack-indexes provide fast lookup per packfile,
+this performance degrades as the number of packfiles increases,
+because abbreviations need to inspect every packfile and we are
+more likely to have a miss on our most-recently-used packfile.
+For some large repositories, repacking into a single packfile
+is not feasible due to storage space or excessive repack times.
+
+The multi-pack-index (MIDX for short) stores a list of objects
+and their offsets into multiple packfiles. It contains:
+
+* A list of packfile names.
+* A sorted list of object IDs.
+* A list of metadata for the ith object ID including:
+** A value j referring to the jth packfile.
+** An offset within the jth packfile for the object.
+* If large offsets are required, we use another list of large
+  offsets similar to version 2 pack-indexes.
+- An optional list of objects in pseudo-pack order (used with MIDX bitmaps).
+
+Thus, we can provide O(log N) lookup time for any number
+of packfiles.
+
+Design Details
+--------------
+
+- The MIDX is stored in a file named 'multi-pack-index' in the
+  .git/objects/pack directory. This could be stored in the pack
+  directory of an alternate. It refers only to packfiles in that
+  same directory.
+
+- The core.multiPackIndex config setting must be on (which is the
+  default) to consume MIDX files.  Setting it to `false` prevents
+  Git from reading a MIDX file, even if one exists.
+
+- The file format includes parameters for the object ID hash
+  function, so a future change of hash algorithm does not require
+  a change in format.
+
+- The MIDX keeps only one record per object ID. If an object appears
+  in multiple packfiles, then the MIDX selects the copy in the
+  preferred packfile, otherwise selecting from the most-recently
+  modified packfile.
+
+- If there exist packfiles in the pack directory not registered in
+  the MIDX, then those packfiles are loaded into the `packed_git`
+  list and `packed_git_mru` cache.
+
+- The pack-indexes (.idx files) remain in the pack directory so we
+  can delete the MIDX file, set core.midx to false, or downgrade
+  without any loss of information.
+
+- The MIDX file format uses a chunk-based approach (similar to the
+  commit-graph file) that allows optional data to be added.
+
+Future Work
+-----------
+
+- The multi-pack-index allows many packfiles, especially in a context
+  where repacking is expensive (such as a very large repo), or
+  unexpected maintenance time is unacceptable (such as a high-demand
+  build machine). However, the multi-pack-index needs to be rewritten
+  in full every time. We can extend the format to be incremental, so
+  writes are fast. By storing a small "tip" multi-pack-index that
+  points to large "base" MIDX files, we can keep writes fast while
+  still reducing the number of binary searches required for object
+  lookups.
+
+- If the multi-pack-index is extended to store a "stable object order"
+  (a function Order(hash) = integer that is constant for a given hash,
+  even as the multi-pack-index is updated) then MIDX bitmaps could be
+  updated independently of the MIDX.
+
+- Packfiles can be marked as "special" using empty files that share
+  the initial name but replace ".pack" with ".keep" or ".promisor".
+  We can add an optional chunk of data to the multi-pack-index that
+  records flags of information about the packfiles. This allows new
+  states, such as 'repacked' or 'redeltified', that can help with
+  pack maintenance in a multi-pack environment. It may also be
+  helpful to organize packfiles by object type (commit, tree, blob,
+  etc.) and use this metadata to help that maintenance.
+
+Related Links
+-------------
+[0] https://bugs.chromium.org/p/git/issues/detail?id=6
+    Chromium work item for: Multi-Pack Index (MIDX)
+
+[1] https://lore.kernel.org/git/20180107181459.222909-1-dstolee@microsoft.com/
+    An earlier RFC for the multi-pack-index feature
+
+[2] https://lore.kernel.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/
+    Git Merge 2018 Contributor's summit notes (includes discussion of MIDX)
diff --git a/Documentation/technical/pack-heuristics.txt b/Documentation/technical/pack-heuristics.txt
new file mode 100644
index 0000000..95a07db
--- /dev/null
+++ b/Documentation/technical/pack-heuristics.txt
@@ -0,0 +1,460 @@
+Concerning Git's Packing Heuristics
+===================================
+
+        Oh, here's a really stupid question:
+
+                  Where do I go
+               to learn the details
+	    of Git's packing heuristics?
+
+Be careful what you ask!
+
+Followers of the Git, please open the Git IRC Log and turn to
+February 10, 2006.
+
+It's a rare occasion, and we are joined by the King Git Himself,
+Linus Torvalds (linus).  Nathaniel Smith, (njs`), has the floor
+and seeks enlightenment.  Others are present, but silent.
+
+Let's listen in!
+
+    <njs`> Oh, here's a really stupid question -- where do I go to
+	learn the details of Git's packing heuristics?  google avails
+        me not, reading the source didn't help a lot, and wading
+        through the whole mailing list seems less efficient than any
+        of that.
+
+It is a bold start!  A plea for help combined with a simultaneous
+tri-part attack on some of the tried and true mainstays in the quest
+for enlightenment.  Brash accusations of google being useless. Hubris!
+Maligning the source.  Heresy!  Disdain for the mailing list archives.
+Woe.
+
+    <pasky> yes, the packing-related delta stuff is somewhat
+        mysterious even for me ;)
+
+Ah!  Modesty after all.
+
+    <linus> njs, I don't think the docs exist. That's something where
+	 I don't think anybody else than me even really got involved.
+	 Most of the rest of Git others have been busy with (especially
+	 Junio), but packing nobody touched after I did it.
+
+It's cryptic, yet vague.  Linus in style for sure.  Wise men
+interpret this as an apology.  A few argue it is merely a
+statement of fact.
+
+    <njs`> I guess the next step is "read the source again", but I
+        have to build up a certain level of gumption first :-)
+
+Indeed!  On both points.
+
+    <linus> The packing heuristic is actually really really simple.
+
+Bait...
+
+    <linus> But strange.
+
+And switch.  That ought to do it!
+
+    <linus> Remember: Git really doesn't follow files. So what it does is
+        - generate a list of all objects
+        - sort the list according to magic heuristics
+        - walk the list, using a sliding window, seeing if an object
+          can be diffed against another object in the window
+        - write out the list in recency order
+
+The traditional understatement:
+
+    <njs`> I suspect that what I'm missing is the precise definition of
+        the word "magic"
+
+The traditional insight:
+
+    <pasky> yes
+
+And Babel-like confusion flowed.
+
+    <njs`> oh, hmm, and I'm not sure what this sliding window means either
+
+    <pasky> iirc, it appeared to me to be just the sha1 of the object
+        when reading the code casually ...
+
+        ... which simply doesn't sound as a very good heuristics, though ;)
+
+    <njs`> .....and recency order.  okay, I think it's clear I didn't
+       even realize how much I wasn't realizing :-)
+
+Ah, grasshopper!  And thus the enlightenment begins anew.
+
+    <linus> The "magic" is actually in theory totally arbitrary.
+        ANY order will give you a working pack, but no, it's not
+	ordered by SHA-1.
+
+        Before talking about the ordering for the sliding delta
+        window, let's talk about the recency order. That's more
+        important in one way.
+
+    <njs`> Right, but if all you want is a working way to pack things
+        together, you could just use cat and save yourself some
+        trouble...
+
+Waaait for it....
+
+    <linus> The recency ordering (which is basically: put objects
+        _physically_ into the pack in the order that they are
+        "reachable" from the head) is important.
+
+    <njs`> okay
+
+    <linus> It's important because that's the thing that gives packs
+        good locality. It keeps the objects close to the head (whether
+        they are old or new, but they are _reachable_ from the head)
+        at the head of the pack. So packs actually have absolutely
+        _wonderful_ IO patterns.
+
+Read that again, because it is important.
+
+    <linus> But recency ordering is totally useless for deciding how
+        to actually generate the deltas, so the delta ordering is
+        something else.
+
+        The delta ordering is (wait for it):
+        - first sort by the "basename" of the object, as defined by
+          the name the object was _first_ reached through when
+          generating the object list
+        - within the same basename, sort by size of the object
+        - but always sort different types separately (commits first).
+
+        That's not exactly it, but it's very close.
+
+    <njs`> The "_first_ reached" thing is not too important, just you
+        need some way to break ties since the same objects may be
+        reachable many ways, yes?
+
+And as if to clarify:
+
+    <linus> The point is that it's all really just any random
+        heuristic, and the ordering is totally unimportant for
+        correctness, but it helps a lot if the heuristic gives
+        "clumping" for things that are likely to delta well against
+        each other.
+
+It is an important point, so secretly, I did my own research and have
+included my results below.  To be fair, it has changed some over time.
+And through the magic of Revisionistic History, I draw upon this entry
+from The Git IRC Logs on my father's birthday, March 1:
+
+    <gitster> The quote from the above linus should be rewritten a
+        bit (wait for it):
+        - first sort by type.  Different objects never delta with
+	  each other.
+        - then sort by filename/dirname.  hash of the basename
+          occupies the top BITS_PER_INT-DIR_BITS bits, and bottom
+          DIR_BITS are for the hash of leading path elements.
+        - then if we are doing "thin" pack, the objects we are _not_
+          going to pack but we know about are sorted earlier than
+          other objects.
+        - and finally sort by size, larger to smaller.
+
+In one swell-foop, clarification and obscurification!  Nonetheless,
+authoritative.  Cryptic, yet concise.  It even solicits notions of
+quotes from The Source Code.  Clearly, more study is needed.
+
+    <gitster> That's the sort order.  What this means is:
+        - we do not delta different object types.
+	- we prefer to delta the objects with the same full path, but
+          allow files with the same name from different directories.
+	- we always prefer to delta against objects we are not going
+          to send, if there are some.
+	- we prefer to delta against larger objects, so that we have
+          lots of removals.
+
+        The penultimate rule is for "thin" packs.  It is used when
+        the other side is known to have such objects.
+
+There it is again. "Thin" packs.  I'm thinking to myself, "What
+is a 'thin' pack?"  So I ask:
+
+    <jdl> What is a "thin" pack?
+
+    <gitster> Use of --objects-edge to rev-list as the upstream of
+        pack-objects.  The pack transfer protocol negotiates that.
+
+Woo hoo!  Cleared that _right_ up!
+
+    <gitster> There are two directions - push and fetch.
+
+There!  Did you see it?  It is not '"push" and "pull"'!  How often the
+confusion has started here.  So casually mentioned, too!
+
+    <gitster> For push, git-send-pack invokes git-receive-pack on the
+        other end.  The receive-pack says "I have up to these commits".
+        send-pack looks at them, and computes what are missing from
+        the other end.  So "thin" could be the default there.
+
+        In the other direction, fetch, git-fetch-pack and
+        git-clone-pack invokes git-upload-pack on the other end
+	(via ssh or by talking to the daemon).
+
+	There are two cases: fetch-pack with -k and clone-pack is one,
+        fetch-pack without -k is the other.  clone-pack and fetch-pack
+        with -k will keep the downloaded packfile without expanded, so
+        we do not use thin pack transfer.  Otherwise, the generated
+        pack will have delta without base object in the same pack.
+
+        But fetch-pack without -k will explode the received pack into
+        individual objects, so we automatically ask upload-pack to
+        give us a thin pack if upload-pack supports it.
+
+OK then.
+
+Uh.
+
+Let's return to the previous conversation still in progress.
+
+    <njs`> and "basename" means something like "the tail of end of
+        path of file objects and dir objects, as per basename(3), and
+        we just declare all commit and tag objects to have the same
+        basename" or something?
+
+Luckily, that too is a point that gitster clarified for us!
+
+If I might add, the trick is to make files that _might_ be similar be
+located close to each other in the hash buckets based on their file
+names.  It used to be that "foo/Makefile", "bar/baz/quux/Makefile" and
+"Makefile" all landed in the same bucket due to their common basename,
+"Makefile". However, now they land in "close" buckets.
+
+The algorithm allows not just for the _same_ bucket, but for _close_
+buckets to be considered delta candidates.  The rationale is
+essentially that files, like Makefiles, often have very similar
+content no matter what directory they live in.
+
+    <linus> I played around with different delta algorithms, and with
+        making the "delta window" bigger, but having too big of a
+        sliding window makes it very expensive to generate the pack:
+        you need to compare every object with a _ton_ of other objects.
+
+        There are a number of other trivial heuristics too, which
+        basically boil down to "don't bother even trying to delta this
+        pair" if we can tell before-hand that the delta isn't worth it
+        (due to size differences, where we can take a previous delta
+        result into account to decide that "ok, no point in trying
+        that one, it will be worse").
+
+        End result: packing is actually very size efficient. It's
+        somewhat CPU-wasteful, but on the other hand, since you're
+        really only supposed to do it maybe once a month (and you can
+        do it during the night), nobody really seems to care.
+
+Nice Engineering Touch, there.  Find when it doesn't matter, and
+proclaim it a non-issue.  Good style too!
+
+    <njs`> So, just to repeat to see if I'm following, we start by
+        getting a list of the objects we want to pack, we sort it by
+        this heuristic (basically lexicographically on the tuple
+        (type, basename, size)).
+
+        Then we walk through this list, and calculate a delta of
+        each object against the last n (tunable parameter) objects,
+        and pick the smallest of these deltas.
+
+Vastly simplified, but the essence is there!
+
+    <linus> Correct.
+
+    <njs`> And then once we have picked a delta or fulltext to
+        represent each object, we re-sort by recency, and write them
+        out in that order.
+
+    <linus> Yup. Some other small details:
+
+And of course there is the "Other Shoe" Factor too.
+
+    <linus> - We limit the delta depth to another magic value (right
+        now both the window and delta depth magic values are just "10")
+
+    <njs`> Hrm, my intuition is that you'd end up with really _bad_ IO
+        patterns, because the things you want are near by, but to
+        actually reconstruct them you may have to jump all over in
+        random ways.
+
+    <linus> - When we write out a delta, and we haven't yet written
+        out the object it is a delta against, we write out the base
+        object first.  And no, when we reconstruct them, we actually
+        get nice IO patterns, because:
+        - larger objects tend to be "more recent" (Linus' law: files grow)
+        - we actively try to generate deltas from a larger object to a
+          smaller one
+        - this means that the top-of-tree very seldom has deltas
+          (i.e. deltas in _practice_ are "backwards deltas")
+
+Again, we should reread that whole paragraph.  Not just because
+Linus has slipped Linus's Law in there on us, but because it is
+important.  Let's make sure we clarify some of the points here:
+
+    <njs`> So the point is just that in practice, delta order and
+        recency order match each other quite well.
+
+    <linus> Yes. There's another nice side to this (and yes, it was
+	designed that way ;):
+        - the reason we generate deltas against the larger object is
+	  actually a big space saver too!
+
+    <njs`> Hmm, but your last comment (if "we haven't yet written out
+        the object it is a delta against, we write out the base object
+        first"), seems like it would make these facts mostly
+        irrelevant because even if in practice you would not have to
+        wander around much, in fact you just brute-force say that in
+        the cases where you might have to wander, don't do that :-)
+
+    <linus> Yes and no. Notice the rule: we only write out the base
+        object first if the delta against it was more recent.  That
+        means that you can actually have deltas that refer to a base
+        object that is _not_ close to the delta object, but that only
+        happens when the delta is needed to generate an _old_ object.
+
+    <linus> See?
+
+Yeah, no.  I missed that on the first two or three readings myself.
+
+    <linus> This keeps the front of the pack dense. The front of the
+        pack never contains data that isn't relevant to a "recent"
+        object.  The size optimization comes from our use of xdelta
+        (but is true for many other delta algorithms): removing data
+        is cheaper (in size) than adding data.
+
+        When you remove data, you only need to say "copy bytes n--m".
+	In contrast, in a delta that _adds_ data, you have to say "add
+        these bytes: 'actual data goes here'"
+
+    *** njs` has quit: Read error: 104 (Connection reset by peer)
+
+    <linus> Uhhuh. I hope I didn't blow njs` mind.
+
+    *** njs` has joined channel #git
+
+    <pasky> :)
+
+The silent observers are amused.  Of course.
+
+And as if njs` was expected to be omniscient:
+
+    <linus> njs - did you miss anything?
+
+OK, I'll spell it out.  That's Geek Humor.  If njs` was not actually
+connected for a little bit there, how would he know if missed anything
+while he was disconnected?  He's a benevolent dictator with a sense of
+humor!  Well noted!
+
+    <njs`> Stupid router.  Or gremlins, or whatever.
+
+It's a cheap shot at Cisco.  Take 'em when you can.
+
+    <njs`> Yes and no. Notice the rule: we only write out the base
+        object first if the delta against it was more recent.
+
+        I'm getting lost in all these orders, let me re-read :-)
+	So the write-out order is from most recent to least recent?
+        (Conceivably it could be the opposite way too, I'm not sure if
+        we've said) though my connection back at home is logging, so I
+        can just read what you said there :-)
+
+And for those of you paying attention, the Omniscient Trick has just
+been detailed!
+
+    <linus> Yes, we always write out most recent first
+
+    <njs`> And, yeah, I got the part about deeper-in-history stuff
+        having worse IO characteristics, one sort of doesn't care.
+
+    <linus> With the caveat that if the "most recent" needs an older
+        object to delta against (hey, shrinking sometimes does
+        happen), we write out the old object with the delta.
+
+    <njs`> (if only it happened more...)
+
+    <linus> Anyway, the pack-file could easily be denser still, but
+	because it's used both for streaming (the Git protocol) and
+        for on-disk, it has a few pessimizations.
+
+Actually, it is a made-up word. But it is a made-up word being
+used as setup for a later optimization, which is a real word:
+
+    <linus> In particular, while the pack-file is then compressed,
+        it's compressed just one object at a time, so the actual
+        compression factor is less than it could be in theory. But it
+        means that it's all nice random-access with a simple index to
+        do "object name->location in packfile" translation.
+
+    <njs`> I'm assuming the real win for delta-ing large->small is
+        more homogeneous statistics for gzip to run over?
+
+        (You have to put the bytes in one place or another, but
+        putting them in a larger blob wins on compression)
+
+        Actually, what is the compression strategy -- each delta
+        individually gzipped, the whole file gzipped, somewhere in
+        between, no compression at all, ....?
+
+        Right.
+
+Reality IRC sets in.  For example:
+
+    <pasky> I'll read the rest in the morning, I really have to go
+        sleep or there's no hope whatsoever for me at the today's
+        exam... g'nite all.
+
+Heh.
+
+    <linus> pasky: g'nite
+
+    <njs`> pasky: 'luck
+
+    <linus> Right: large->small matters exactly because of compression
+        behaviour. If it was non-compressed, it probably wouldn't make
+        any difference.
+
+    <njs`> yeah
+
+    <linus> Anyway: I'm not even trying to claim that the pack-files
+        are perfect, but they do tend to have a nice balance of
+        density vs ease-of use.
+
+Gasp!  OK, saved.  That's a fair Engineering trade off.  Close call!
+In fact, Linus reflects on some Basic Engineering Fundamentals,
+design options, etc.
+
+    <linus> More importantly, they allow Git to still _conceptually_
+        never deal with deltas at all, and be a "whole object" store.
+
+        Which has some problems (we discussed bad huge-file
+	behaviour on the Git lists the other day), but it does mean
+	that the basic Git concepts are really really simple and
+        straightforward.
+
+        It's all been quite stable.
+
+        Which I think is very much a result of having very simple
+        basic ideas, so that there's never any confusion about what's
+        going on.
+
+        Bugs happen, but they are "simple" bugs. And bugs that
+        actually get some object store detail wrong are almost always
+        so obvious that they never go anywhere.
+
+    <njs`> Yeah.
+
+Nuff said.
+
+    <linus> Anyway.  I'm off for bed. It's not 6AM here, but I've got
+	 three kids, and have to get up early in the morning to send
+	 them off. I need my beauty sleep.
+
+    <njs`> :-)
+
+    <njs`> appreciate the infodump, I really was failing to find the
+	details on Git packs :-)
+
+And now you know the rest of the story.
diff --git a/Documentation/technical/packfile-uri.txt b/Documentation/technical/packfile-uri.txt
new file mode 100644
index 0000000..9d453d4
--- /dev/null
+++ b/Documentation/technical/packfile-uri.txt
@@ -0,0 +1,82 @@
+Packfile URIs
+=============
+
+This feature allows servers to serve part of their packfile response as URIs.
+This allows server designs that improve scalability in bandwidth and CPU usage
+(for example, by serving some data through a CDN), and (in the future) provides
+some measure of resumability to clients.
+
+This feature is available only in protocol version 2.
+
+Protocol
+--------
+
+The server advertises the `packfile-uris` capability.
+
+If the client then communicates which protocols (HTTPS, etc.) it supports with
+a `packfile-uris` argument, the server MAY send a `packfile-uris` section
+directly before the `packfile` section (right after `wanted-refs` if it is
+sent) containing URIs of any of the given protocols. The URIs point to
+packfiles that use only features that the client has declared that it supports
+(e.g. ofs-delta and thin-pack). See linkgit:gitprotocol-v2[5] for the documentation of
+this section.
+
+Clients should then download and index all the given URIs (in addition to
+downloading and indexing the packfile given in the `packfile` section of the
+response) before performing the connectivity check.
+
+Server design
+-------------
+
+The server can be trivially made compatible with the proposed protocol by
+having it advertise `packfile-uris`, tolerating the client sending
+`packfile-uris`, and never sending any `packfile-uris` section. But we should
+include some sort of non-trivial implementation in the Minimum Viable Product,
+at least so that we can test the client.
+
+This is the implementation: a feature, marked experimental, that allows the
+server to be configured by one or more `uploadpack.blobPackfileUri=
+<object-hash> <pack-hash> <uri>` entries. Whenever the list of objects to be
+sent is assembled, all such blobs are excluded, replaced with URIs. As noted
+in "Future work" below, the server can evolve in the future to support
+excluding other objects (or other implementations of servers could be made
+that support excluding other objects) without needing a protocol change, so
+clients should not expect that packfiles downloaded in this way only contain
+single blobs.
+
+Client design
+-------------
+
+The client has a config variable `fetch.uriprotocols` that determines which
+protocols the end user is willing to use. By default, this is empty.
+
+When the client downloads the given URIs, it should store them with "keep"
+files, just like it does with the packfile in the `packfile` section. These
+additional "keep" files can only be removed after the refs have been updated -
+just like the "keep" file for the packfile in the `packfile` section.
+
+The division of work (initial fetch + additional URIs) introduces convenient
+points for resumption of an interrupted clone - such resumption can be done
+after the Minimum Viable Product (see "Future work").
+
+Future work
+-----------
+
+The protocol design allows some evolution of the server and client without any
+need for protocol changes, so only a small-scoped design is included here to
+form the MVP. For example, the following can be done:
+
+ * On the server, more sophisticated means of excluding objects (e.g. by
+   specifying a commit to represent that commit and all objects that it
+   references).
+ * On the client, resumption of clone. If a clone is interrupted, information
+   could be recorded in the repository's config and a "clone-resume" command
+   can resume the clone in progress. (Resumption of subsequent fetches is more
+   difficult because that must deal with the user wanting to use the repository
+   even after the fetch was interrupted.)
+
+There are some possible features that will require a change in protocol:
+
+ * Additional HTTP headers (e.g. authentication)
+ * Byte range support
+ * Different file formats referenced by URIs (e.g. raw object)
diff --git a/Documentation/technical/parallel-checkout.txt b/Documentation/technical/parallel-checkout.txt
new file mode 100644
index 0000000..b4a144e
--- /dev/null
+++ b/Documentation/technical/parallel-checkout.txt
@@ -0,0 +1,270 @@
+Parallel Checkout Design Notes
+==============================
+
+The "Parallel Checkout" feature attempts to use multiple processes to
+parallelize the work of uncompressing the blobs, applying in-core
+filters, and writing the resulting contents to the working tree during a
+checkout operation. It can be used by all checkout-related commands,
+such as `clone`, `checkout`, `reset`, `sparse-checkout`, and others.
+
+These commands share the following basic structure:
+
+* Step 1: Read the current index file into memory.
+
+* Step 2: Modify the in-memory index based upon the command, and
+  temporarily mark all cache entries that need to be updated.
+
+* Step 3: Populate the working tree to match the new candidate index.
+  This includes iterating over all of the to-be-updated cache entries
+  and delete, create, or overwrite the associated files in the working
+  tree.
+
+* Step 4: Write the new index to disk.
+
+Step 3 is the focus of the "parallel checkout" effort described here.
+
+Sequential Implementation
+-------------------------
+
+For the purposes of discussion here, the current sequential
+implementation of Step 3 is divided in 3 parts, each one implemented in
+its own function:
+
+* Step 3a: `unpack-trees.c:check_updates()` contains a series of
+  sequential loops iterating over the `cache_entry`'s array. The main
+  loop in this function calls the Step 3b function for each of the
+  to-be-updated entries.
+
+* Step 3b: `entry.c:checkout_entry()` examines the existing working tree
+  for file conflicts, collisions, and unsaved changes. It removes files
+  and creates leading directories as necessary. It calls the Step 3c
+  function for each entry to be written.
+
+* Step 3c: `entry.c:write_entry()` loads the blob into memory, smudges
+  it if necessary, creates the file in the working tree, writes the
+  smudged contents, calls `fstat()` or `lstat()`, and updates the
+  associated `cache_entry` struct with the stat information gathered.
+
+It wouldn't be safe to perform Step 3b in parallel, as there could be
+race conditions between file creations and removals. Instead, the
+parallel checkout framework lets the sequential code handle Step 3b,
+and uses parallel workers to replace the sequential
+`entry.c:write_entry()` calls from Step 3c.
+
+Rejected Multi-Threaded Solution
+--------------------------------
+
+The most "straightforward" implementation would be to spread the set of
+to-be-updated cache entries across multiple threads. But due to the
+thread-unsafe functions in the object database code, we would have to use locks to
+coordinate the parallel operation. An early prototype of this solution
+showed that the multi-threaded checkout would bring performance
+improvements over the sequential code, but there was still too much lock
+contention. A `perf` profiling indicated that around 20% of the runtime
+during a local Linux clone (on an SSD) was spent in locking functions.
+For this reason this approach was rejected in favor of using multiple
+child processes, which led to better performance.
+
+Multi-Process Solution
+----------------------
+
+Parallel checkout alters the aforementioned Step 3 to use multiple
+`checkout--worker` background processes to distribute the work. The
+long-running worker processes are controlled by the foreground Git
+command using the existing run-command API.
+
+Overview
+~~~~~~~~
+
+Step 3b is only slightly altered; for each entry to be checked out, the
+main process performs the following steps:
+
+* M1: Check whether there is any untracked or unclean file in the
+  working tree which would be overwritten by this entry, and decide
+  whether to proceed (removing the file(s)) or not.
+
+* M2: Create the leading directories.
+
+* M3: Load the conversion attributes for the entry's path.
+
+* M4: Check, based on the entry's type and conversion attributes,
+  whether the entry is eligible for parallel checkout (more on this
+  later). If it is eligible, enqueue the entry and the loaded
+  attributes to later write the entry in parallel. If not, write the
+  entry right away, using the default sequential code.
+
+Note: we save the conversion attributes associated with each entry
+because the workers don't have access to the main process' index state,
+so they can't load the attributes by themselves (and the attributes are
+needed to properly smudge the entry). Additionally, this has a positive
+impact on performance as (1) we don't need to load the attributes twice
+and (2) the attributes machinery is optimized to handle paths in
+sequential order.
+
+After all entries have passed through the above steps, the main process
+checks if the number of enqueued entries is sufficient to spread among
+the workers. If not, it just writes them sequentially. Otherwise, it
+spawns the workers and distributes the queued entries uniformly in
+continuous chunks. This aims to minimize the chances of two workers
+writing to the same directory simultaneously, which could increase lock
+contention in the kernel.
+
+Then, for each assigned item, each worker:
+
+* W1: Checks if there is any non-directory file in the leading part of
+  the entry's path or if there already exists a file at the entry' path.
+  If so, mark the entry with `PC_ITEM_COLLIDED` and skip it (more on
+  this later).
+
+* W2: Creates the file (with O_CREAT and O_EXCL).
+
+* W3: Loads the blob into memory (inflating and delta reconstructing
+  it).
+
+* W4: Applies any required in-process filter, like end-of-line
+  conversion and re-encoding.
+
+* W5: Writes the result to the file descriptor opened at W2.
+
+* W6: Calls `fstat()` or `lstat()` on the just-written path, and sends
+  the result back to the main process, together with the end status of
+  the operation and the item's identification number.
+
+Note that, when possible, steps W3 to W5 are delegated to the streaming
+machinery, removing the need to keep the entire blob in memory.
+
+If the worker fails to read the blob or to write it to the working tree,
+it removes the created file to avoid leaving empty files behind. This is
+the *only* time a worker is allowed to remove a file.
+
+As mentioned earlier, it is the responsibility of the main process to
+remove any file that blocks the checkout operation (or abort if the
+removal(s) would cause data loss and the user didn't ask to `--force`).
+This is crucial to avoid race conditions and also to properly detect
+path collisions at Step W1.
+
+After the workers finish writing the items and sending back the required
+information, the main process handles the results in two steps:
+
+- First, it updates the in-memory index with the `lstat()` information
+  sent by the workers. (This must be done first as this information
+  might be required in the following step.)
+
+- Then it writes the items which collided on disk (i.e. items marked
+  with `PC_ITEM_COLLIDED`). More on this below.
+
+Path Collisions
+---------------
+
+Path collisions happen when two different paths correspond to the same
+entry in the file system. E.g. the paths 'a' and 'A' would collide in a
+case-insensitive file system.
+
+The sequential checkout deals with collisions in the same way that it
+deals with files that were already present in the working tree before
+checkout. Basically, it checks if the path that it wants to write
+already exists on disk, makes sure the existing file doesn't have
+unsaved data, and then overwrites it. (To be more pedantic: it deletes
+the existing file and creates the new one.) So, if there are multiple
+colliding files to be checked out, the sequential code will write each
+one of them but only the last will actually survive on disk.
+
+Parallel checkout aims to reproduce the same behavior. However, we
+cannot let the workers racily write to the same file on disk. Instead,
+the workers detect when the entry that they want to check out would
+collide with an existing file, and mark it with `PC_ITEM_COLLIDED`.
+Later, the main process can sequentially feed these entries back to
+`checkout_entry()` without the risk of race conditions. On clone, this
+also has the effect of marking the colliding entries to later emit a
+warning for the user, like the classic sequential checkout does.
+
+The workers are able to detect both collisions among the entries being
+concurrently written and collisions between a parallel-eligible entry
+and an ineligible entry. The general idea for collision detection is
+quite straightforward: for each parallel-eligible entry, the main
+process must remove all files that prevent this entry from being written
+(before enqueueing it). This includes any non-directory file in the
+leading path of the entry. Later, when a worker gets assigned the entry,
+it looks again for the non-directory files and for an already existing
+file at the entry's path. If any of these checks finds something, the
+worker knows that there was a path collision.
+
+Because parallel checkout can distinguish path collisions from the case
+where the file was already present in the working tree before checkout,
+we could alternatively choose to skip the checkout of colliding entries.
+However, each entry that doesn't get written would have NULL `lstat()`
+fields on the index. This could cause performance penalties for
+subsequent commands that need to refresh the index, as they would have
+to go to the file system to see if the entry is dirty. Thus, if we have
+N entries in a colliding group and we decide to write and `lstat()` only
+one of them, every subsequent `git-status` will have to read, convert,
+and hash the written file N - 1 times. By checking out all colliding
+entries (like the sequential code does), we only pay the overhead once,
+during checkout.
+
+Eligible Entries for Parallel Checkout
+--------------------------------------
+
+As previously mentioned, not all entries passed to `checkout_entry()`
+will be considered eligible for parallel checkout. More specifically, we
+exclude:
+
+- Symbolic links; to avoid race conditions that, in combination with
+  path collisions, could cause workers to write files at the wrong
+  place. For example, if we were to concurrently check out a symlink
+  'a' -> 'b' and a regular file 'A/f' in a case-insensitive file system,
+  we could potentially end up writing the file 'A/f' at 'a/f', due to a
+  race condition.
+
+- Regular files that require external filters (either "one shot" filters
+  or long-running process filters). These filters are black-boxes to Git
+  and may have their own internal locking or non-concurrent assumptions.
+  So it might not be safe to run multiple instances in parallel.
++
+Besides, long-running filters may use the delayed checkout feature to
+postpone the return of some filtered blobs. The delayed checkout queue
+and the parallel checkout queue are not compatible and should remain
+separate.
++
+Note: regular files that only require internal filters, like end-of-line
+conversion and re-encoding, are eligible for parallel checkout.
+
+Ineligible entries are checked out by the classic sequential codepath
+*before* spawning workers.
+
+Note: submodules' files are also eligible for parallel checkout (as
+long as they don't fall into any of the excluding categories mentioned
+above). But since each submodule is checked out in its own child
+process, we don't mix the superproject's and the submodules' files in
+the same parallel checkout process or queue.
+
+The API
+-------
+
+The parallel checkout API was designed with the goal of minimizing
+changes to the current users of the checkout machinery. This means that
+they don't have to call a different function for sequential or parallel
+checkout. As already mentioned, `checkout_entry()` will automatically
+insert the given entry in the parallel checkout queue when this feature
+is enabled and the entry is eligible; otherwise, it will just write the
+entry right away, using the sequential code. In general, callers of the
+parallel checkout API should look similar to this:
+
+----------------------------------------------
+int pc_workers, pc_threshold, err = 0;
+struct checkout state;
+
+get_parallel_checkout_configs(&pc_workers, &pc_threshold);
+
+/*
+ * This check is not strictly required, but it
+ * should save some time in sequential mode.
+ */
+if (pc_workers > 1)
+	init_parallel_checkout();
+
+for (each cache_entry ce to-be-updated)
+	err |= checkout_entry(ce, &state, NULL, NULL);
+
+err |= run_parallel_checkout(&state, pc_workers, pc_threshold, NULL, NULL);
+----------------------------------------------
diff --git a/Documentation/technical/partial-clone.txt b/Documentation/technical/partial-clone.txt
new file mode 100644
index 0000000..cd948b0
--- /dev/null
+++ b/Documentation/technical/partial-clone.txt
@@ -0,0 +1,367 @@
+Partial Clone Design Notes
+==========================
+
+The "Partial Clone" feature is a performance optimization for Git that
+allows Git to function without having a complete copy of the repository.
+The goal of this work is to allow Git to better handle extremely large
+repositories.
+
+During clone and fetch operations, Git downloads the complete contents
+and history of the repository.  This includes all commits, trees, and
+blobs for the complete life of the repository.  For extremely large
+repositories, clones can take hours (or days) and consume 100+GiB of disk
+space.
+
+Often in these repositories there are many blobs and trees that the user
+does not need such as:
+
+  1. files outside of the user's work area in the tree.  For example, in
+     a repository with 500K directories and 3.5M files in every commit,
+     we can avoid downloading many objects if the user only needs a
+     narrow "cone" of the source tree.
+
+  2. large binary assets.  For example, in a repository where large build
+     artifacts are checked into the tree, we can avoid downloading all
+     previous versions of these non-mergeable binary assets and only
+     download versions that are actually referenced.
+
+Partial clone allows us to avoid downloading such unneeded objects *in
+advance* during clone and fetch operations and thereby reduce download
+times and disk usage.  Missing objects can later be "demand fetched"
+if/when needed.
+
+A remote that can later provide the missing objects is called a
+promisor remote, as it promises to send the objects when
+requested. Initially Git supported only one promisor remote, the origin
+remote from which the user cloned and that was configured in the
+"extensions.partialClone" config option. Later support for more than
+one promisor remote has been implemented.
+
+Use of partial clone requires that the user be online and the origin
+remote or other promisor remotes be available for on-demand fetching
+of missing objects.  This may or may not be problematic for the user.
+For example, if the user can stay within the pre-selected subset of
+the source tree, they may not encounter any missing objects.
+Alternatively, the user could try to pre-fetch various objects if they
+know that they are going offline.
+
+
+Non-Goals
+---------
+
+Partial clone is a mechanism to limit the number of blobs and trees downloaded
+*within* a given range of commits -- and is therefore independent of and not
+intended to conflict with existing DAG-level mechanisms to limit the set of
+requested commits (i.e. shallow clone, single branch, or fetch '<refspec>').
+
+
+Design Overview
+---------------
+
+Partial clone logically consists of the following parts:
+
+- A mechanism for the client to describe unneeded or unwanted objects to
+  the server.
+
+- A mechanism for the server to omit such unwanted objects from packfiles
+  sent to the client.
+
+- A mechanism for the client to gracefully handle missing objects (that
+  were previously omitted by the server).
+
+- A mechanism for the client to backfill missing objects as needed.
+
+
+Design Details
+--------------
+
+- A new pack-protocol capability "filter" is added to the fetch-pack and
+  upload-pack negotiation.
++
+This uses the existing capability discovery mechanism.
+See "filter" in linkgit:gitprotocol-pack[5].
+
+- Clients pass a "filter-spec" to clone and fetch which is passed to the
+  server to request filtering during packfile construction.
++
+There are various filters available to accommodate different situations.
+See "--filter=<filter-spec>" in Documentation/rev-list-options.txt.
+
+- On the server pack-objects applies the requested filter-spec as it
+  creates "filtered" packfiles for the client.
++
+These filtered packfiles are *incomplete* in the traditional sense because
+they may contain objects that reference objects not contained in the
+packfile and that the client doesn't already have.  For example, the
+filtered packfile may contain trees or tags that reference missing blobs
+or commits that reference missing trees.
+
+- On the client these incomplete packfiles are marked as "promisor packfiles"
+  and treated differently by various commands.
+
+- On the client a repository extension is added to the local config to
+  prevent older versions of git from failing mid-operation because of
+  missing objects that they cannot handle.
+  See "extensions.partialClone" in Documentation/technical/repository-version.txt"
+
+
+Handling Missing Objects
+------------------------
+
+- An object may be missing due to a partial clone or fetch, or missing
+  due to repository corruption.  To differentiate these cases, the
+  local repository specially indicates such filtered packfiles
+  obtained from promisor remotes as "promisor packfiles".
++
+These promisor packfiles consist of a "<name>.promisor" file with
+arbitrary contents (like the "<name>.keep" files), in addition to
+their "<name>.pack" and "<name>.idx" files.
+
+- The local repository considers a "promisor object" to be an object that
+  it knows (to the best of its ability) that promisor remotes have promised
+  that they have, either because the local repository has that object in one of
+  its promisor packfiles, or because another promisor object refers to it.
++
+When Git encounters a missing object, Git can see if it is a promisor object
+and handle it appropriately.  If not, Git can report a corruption.
++
+This means that there is no need for the client to explicitly maintain an
+expensive-to-modify list of missing objects.[a]
+
+- Since almost all Git code currently expects any referenced object to be
+  present locally and because we do not want to force every command to do
+  a dry-run first, a fallback mechanism is added to allow Git to attempt
+  to dynamically fetch missing objects from promisor remotes.
++
+When the normal object lookup fails to find an object, Git invokes
+promisor_remote_get_direct() to try to get the object from a promisor
+remote and then retry the object lookup.  This allows objects to be
+"faulted in" without complicated prediction algorithms.
++
+For efficiency reasons, no check as to whether the missing object is
+actually a promisor object is performed.
++
+Dynamic object fetching tends to be slow as objects are fetched one at
+a time.
+
+- `checkout` (and any other command using `unpack-trees`) has been taught
+  to bulk pre-fetch all required missing blobs in a single batch.
+
+- `rev-list` has been taught to print missing objects.
++
+This can be used by other commands to bulk prefetch objects.
+For example, a "git log -p A..B" may internally want to first do
+something like "git rev-list --objects --quiet --missing=print A..B"
+and prefetch those objects in bulk.
+
+- `fsck` has been updated to be fully aware of promisor objects.
+
+- `repack` in GC has been updated to not touch promisor packfiles at all,
+  and to only repack other objects.
+
+- The global variable "fetch_if_missing" is used to control whether an
+  object lookup will attempt to dynamically fetch a missing object or
+  report an error.
++
+We are not happy with this global variable and would like to remove it,
+but that requires significant refactoring of the object code to pass an
+additional flag.
+
+
+Fetching Missing Objects
+------------------------
+
+- Fetching of objects is done by invoking a "git fetch" subprocess.
+
+- The local repository sends a request with the hashes of all requested
+  objects, and does not perform any packfile negotiation.
+  It then receives a packfile.
+
+- Because we are reusing the existing fetch mechanism, fetching
+  currently fetches all objects referred to by the requested objects, even
+  though they are not necessary.
+
+- Fetching with `--refetch` will request a complete new filtered packfile from
+  the remote, which can be used to change a filter without needing to
+  dynamically fetch missing objects.
+
+Using many promisor remotes
+---------------------------
+
+Many promisor remotes can be configured and used.
+
+This allows for example a user to have multiple geographically-close
+cache servers for fetching missing blobs while continuing to do
+filtered `git-fetch` commands from the central server.
+
+When fetching objects, promisor remotes are tried one after the other
+until all the objects have been fetched.
+
+Remotes that are considered "promisor" remotes are those specified by
+the following configuration variables:
+
+- `extensions.partialClone = <name>`
+
+- `remote.<name>.promisor = true`
+
+- `remote.<name>.partialCloneFilter = ...`
+
+Only one promisor remote can be configured using the
+`extensions.partialClone` config variable. This promisor remote will
+be the last one tried when fetching objects.
+
+We decided to make it the last one we try, because it is likely that
+someone using many promisor remotes is doing so because the other
+promisor remotes are better for some reason (maybe they are closer or
+faster for some kind of objects) than the origin, and the origin is
+likely to be the remote specified by extensions.partialClone.
+
+This justification is not very strong, but one choice had to be made,
+and anyway the long term plan should be to make the order somehow
+fully configurable.
+
+For now though the other promisor remotes will be tried in the order
+they appear in the config file.
+
+Current Limitations
+-------------------
+
+- It is not possible to specify the order in which the promisor
+  remotes are tried in other ways than the order in which they appear
+  in the config file.
++
+It is also not possible to specify an order to be used when fetching
+from one remote and a different order when fetching from another
+remote.
+
+- It is not possible to push only specific objects to a promisor
+  remote.
++
+It is not possible to push at the same time to multiple promisor
+remote in a specific order.
+
+- Dynamic object fetching will only ask promisor remotes for missing
+  objects.  We assume that promisor remotes have a complete view of the
+  repository and can satisfy all such requests.
+
+- Repack essentially treats promisor and non-promisor packfiles as 2
+  distinct partitions and does not mix them.
+
+- Dynamic object fetching invokes fetch-pack once *for each item*
+  because most algorithms stumble upon a missing object and need to have
+  it resolved before continuing their work.  This may incur significant
+  overhead -- and multiple authentication requests -- if many objects are
+  needed.
+
+- Dynamic object fetching currently uses the existing pack protocol V0
+  which means that each object is requested via fetch-pack.  The server
+  will send a full set of info/refs when the connection is established.
+  If there are a large number of refs, this may incur significant overhead.
+
+
+Future Work
+-----------
+
+- Improve the way to specify the order in which promisor remotes are
+  tried.
++
+For example this could allow specifying explicitly something like:
+"When fetching from this remote, I want to use these promisor remotes
+in this order, though, when pushing or fetching to that remote, I want
+to use those promisor remotes in that order."
+
+- Allow pushing to promisor remotes.
++
+The user might want to work in a triangular work flow with multiple
+promisor remotes that each have an incomplete view of the repository.
+
+- Allow non-pathname-based filters to make use of packfile bitmaps (when
+  present).  This was just an omission during the initial implementation.
+
+- Investigate use of a long-running process to dynamically fetch a series
+  of objects, such as proposed in [5,6] to reduce process startup and
+  overhead costs.
++
+It would be nice if pack protocol V2 could allow that long-running
+process to make a series of requests over a single long-running
+connection.
+
+- Investigate pack protocol V2 to avoid the info/refs broadcast on
+  each connection with the server to dynamically fetch missing objects.
+
+- Investigate the need to handle loose promisor objects.
++
+Objects in promisor packfiles are allowed to reference missing objects
+that can be dynamically fetched from the server.  An assumption was
+made that loose objects are only created locally and therefore should
+not reference a missing object.  We may need to revisit that assumption
+if, for example, we dynamically fetch a missing tree and store it as a
+loose object rather than a single object packfile.
++
+This does not necessarily mean we need to mark loose objects as promisor;
+it may be sufficient to relax the object lookup or is-promisor functions.
+
+
+Non-Tasks
+---------
+
+- Every time the subject of "demand loading blobs" comes up it seems
+  that someone suggests that the server be allowed to "guess" and send
+  additional objects that may be related to the requested objects.
++
+No work has gone into actually doing that; we're just documenting that
+it is a common suggestion.  We're not sure how it would work and have
+no plans to work on it.
++
+It is valid for the server to send more objects than requested (even
+for a dynamic object fetch), but we are not building on that.
+
+
+Footnotes
+---------
+
+[a] expensive-to-modify list of missing objects:  Earlier in the design of
+    partial clone we discussed the need for a single list of missing objects.
+    This would essentially be a sorted linear list of OIDs that were
+    omitted by the server during a clone or subsequent fetches.
+
+This file would need to be loaded into memory on every object lookup.
+It would need to be read, updated, and re-written (like the .git/index)
+on every explicit "git fetch" command *and* on any dynamic object fetch.
+
+The cost to read, update, and write this file could add significant
+overhead to every command if there are many missing objects.  For example,
+if there are 100M missing blobs, this file would be at least 2GiB on disk.
+
+With the "promisor" concept, we *infer* a missing object based upon the
+type of packfile that references it.
+
+
+Related Links
+-------------
+[0] https://crbug.com/git/2
+    Bug#2: Partial Clone
+
+[1] https://lore.kernel.org/git/20170113155253.1644-1-benpeart@microsoft.com/ +
+    Subject: [RFC] Add support for downloading blobs on demand +
+    Date: Fri, 13 Jan 2017 10:52:53 -0500
+
+[2] https://lore.kernel.org/git/cover.1506714999.git.jonathantanmy@google.com/ +
+    Subject: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) +
+    Date: Fri, 29 Sep 2017 13:11:36 -0700
+
+[3] https://lore.kernel.org/git/20170426221346.25337-1-jonathantanmy@google.com/ +
+    Subject: Proposal for missing blob support in Git repos +
+    Date: Wed, 26 Apr 2017 15:13:46 -0700
+
+[4] https://lore.kernel.org/git/1488999039-37631-1-git-send-email-git@jeffhostetler.com/ +
+    Subject: [PATCH 00/10] RFC Partial Clone and Fetch +
+    Date: Wed,  8 Mar 2017 18:50:29 +0000
+
+[5] https://lore.kernel.org/git/20170505152802.6724-1-benpeart@microsoft.com/ +
+    Subject: [PATCH v7 00/10] refactor the filter process code into a reusable module +
+    Date: Fri,  5 May 2017 11:27:52 -0400
+
+[6] https://lore.kernel.org/git/20170714132651.170708-1-benpeart@microsoft.com/ +
+    Subject: [RFC/PATCH v2 0/1] Add support for downloading blobs on demand +
+    Date: Fri, 14 Jul 2017 09:26:50 -0400
diff --git a/Documentation/technical/racy-git.txt b/Documentation/technical/racy-git.txt
new file mode 100644
index 0000000..59bea66
--- /dev/null
+++ b/Documentation/technical/racy-git.txt
@@ -0,0 +1,201 @@
+Use of index and Racy Git problem
+=================================
+
+Background
+----------
+
+The index is one of the most important data structures in Git.
+It represents a virtual working tree state by recording list of
+paths and their object names and serves as a staging area to
+write out the next tree object to be committed.  The state is
+"virtual" in the sense that it does not necessarily have to, and
+often does not, match the files in the working tree.
+
+There are cases where Git needs to examine the differences between the
+virtual working tree state in the index and the files in the
+working tree.  The most obvious case is when the user asks `git
+diff` (or its low level implementation, `git diff-files`) or
+`git-ls-files --modified`.  In addition, Git internally checks
+if the files in the working tree are different from what are
+recorded in the index to avoid stomping on local changes in them
+during patch application, switching branches, and merging.
+
+In order to speed up this comparison between the files in the
+working tree and the index entries, the index entries record the
+information obtained from the filesystem via `lstat(2)` system
+call when they were last updated.  When checking if they differ,
+Git first runs `lstat(2)` on the files and compares the result
+with this information (this is what was originally done by the
+`ce_match_stat()` function, but the current code does it in
+`ce_match_stat_basic()` function).  If some of these "cached
+stat information" fields do not match, Git can tell that the
+files are modified without even looking at their contents.
+
+Note: not all members in `struct stat` obtained via `lstat(2)`
+are used for this comparison.  For example, `st_atime` obviously
+is not useful.  Currently, Git compares the file type (regular
+files vs symbolic links) and executable bits (only for regular
+files) from `st_mode` member, `st_mtime` and `st_ctime`
+timestamps, `st_uid`, `st_gid`, `st_ino`, and `st_size` members.
+With a `USE_STDEV` compile-time option, `st_dev` is also
+compared, but this is not enabled by default because this member
+is not stable on network filesystems.  With `USE_NSEC`
+compile-time option, `st_mtim.tv_nsec` and `st_ctim.tv_nsec`
+members are also compared. On Linux, this is not enabled by default
+because in-core timestamps can have finer granularity than
+on-disk timestamps, resulting in meaningless changes when an
+inode is evicted from the inode cache.  See commit 8ce13b0
+of git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
+([PATCH] Sync in core time granularity with filesystems,
+2005-01-04). This patch is included in kernel 2.6.11 and newer, but
+only fixes the issue for file systems with exactly 1 ns or 1 s
+resolution. Other file systems are still broken in current Linux
+kernels (e.g. CEPH, CIFS, NTFS, UDF), see
+https://lore.kernel.org/lkml/5577240D.7020309@gmail.com/
+
+Racy Git
+--------
+
+There is one slight problem with the optimization based on the
+cached stat information.  Consider this sequence:
+
+  : modify 'foo'
+  $ git update-index 'foo'
+  : modify 'foo' again, in-place, without changing its size
+
+The first `update-index` computes the object name of the
+contents of file `foo` and updates the index entry for `foo`
+along with the `struct stat` information.  If the modification
+that follows it happens very fast so that the file's `st_mtime`
+timestamp does not change, after this sequence, the cached stat
+information the index entry records still exactly match what you
+would see in the filesystem, even though the file `foo` is now
+different.
+This way, Git can incorrectly think files in the working tree
+are unmodified even though they actually are.  This is called
+the "racy Git" problem (discovered by Pasky), and the entries
+that appear clean when they may not be because of this problem
+are called "racily clean".
+
+To avoid this problem, Git does two things:
+
+. When the cached stat information says the file has not been
+  modified, and the `st_mtime` is the same as (or newer than)
+  the timestamp of the index file itself (which is the time `git
+  update-index foo` finished running in the above example), it
+  also compares the contents with the object registered in the
+  index entry to make sure they match.
+
+. When the index file is updated that contains racily clean
+  entries, cached `st_size` information is truncated to zero
+  before writing a new version of the index file.
+
+Because the index file itself is written after collecting all
+the stat information from updated paths, `st_mtime` timestamp of
+it is usually the same as or newer than any of the paths the
+index contains.  And no matter how quick the modification that
+follows `git update-index foo` finishes, the resulting
+`st_mtime` timestamp on `foo` cannot get a value earlier
+than the index file.  Therefore, index entries that can be
+racily clean are limited to the ones that have the same
+timestamp as the index file itself.
+
+The callers that want to check if an index entry matches the
+corresponding file in the working tree continue to call
+`ce_match_stat()`, but with this change, `ce_match_stat()` uses
+`ce_modified_check_fs()` to see if racily clean ones are
+actually clean after comparing the cached stat information using
+`ce_match_stat_basic()`.
+
+The problem the latter solves is this sequence:
+
+  $ git update-index 'foo'
+  : modify 'foo' in-place without changing its size
+  : wait for enough time
+  $ git update-index 'bar'
+
+Without the latter, the timestamp of the index file gets a newer
+value, and falsely clean entry `foo` would not be caught by the
+timestamp comparison check done with the former logic anymore.
+The latter makes sure that the cached stat information for `foo`
+would never match with the file in the working tree, so later
+checks by `ce_match_stat_basic()` would report that the index entry
+does not match the file and Git does not have to fall back on more
+expensive `ce_modified_check_fs()`.
+
+
+Runtime penalty
+---------------
+
+The runtime penalty of falling back to `ce_modified_check_fs()`
+from `ce_match_stat()` can be very expensive when there are many
+racily clean entries.  An obvious way to artificially create
+this situation is to give the same timestamp to all the files in
+the working tree in a large project, run `git update-index` on
+them, and give the same timestamp to the index file:
+
+  $ date >.datestamp
+  $ git ls-files | xargs touch -r .datestamp
+  $ git ls-files | git update-index --stdin
+  $ touch -r .datestamp .git/index
+
+This will make all index entries racily clean.  The linux project, for
+example, there are over 20,000 files in the working tree.  On my
+Athlon 64 X2 3800+, after the above:
+
+  $ /usr/bin/time git diff-files
+  1.68user 0.54system 0:02.22elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
+  0inputs+0outputs (0major+67111minor)pagefaults 0swaps
+  $ git update-index MAINTAINERS
+  $ /usr/bin/time git diff-files
+  0.02user 0.12system 0:00.14elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
+  0inputs+0outputs (0major+935minor)pagefaults 0swaps
+
+Running `git update-index` in the middle checked the racily
+clean entries, and left the cached `st_mtime` for all the paths
+intact because they were actually clean (so this step took about
+the same amount of time as the first `git diff-files`).  After
+that, they are not racily clean anymore but are truly clean, so
+the second invocation of `git diff-files` fully took advantage
+of the cached stat information.
+
+
+Avoiding runtime penalty
+------------------------
+
+In order to avoid the above runtime penalty, post 1.4.2 Git used
+to have a code that made sure the index file
+got a timestamp newer than the youngest files in the index when
+there were many young files with the same timestamp as the
+resulting index file otherwise would have by waiting
+before finishing writing the index file out.
+
+I suspected that in practice the situation where many paths in the
+index are all racily clean was quite rare.  The only code paths
+that can record recent timestamp for large number of paths are:
+
+. Initial `git add .` of a large project.
+
+. `git checkout` of a large project from an empty index into an
+  unpopulated working tree.
+
+Note: switching branches with `git checkout` keeps the cached
+stat information of existing working tree files that are the
+same between the current branch and the new branch, which are
+all older than the resulting index file, and they will not
+become racily clean.  Only the files that are actually checked
+out can become racily clean.
+
+In a large project where raciness avoidance cost really matters,
+however, the initial computation of all object names in the
+index takes more than one second, and the index file is written
+out after all that happens.  Therefore the timestamp of the
+index file will be more than one second later than the
+youngest file in the working tree.  This means that in these
+cases there actually will not be any racily clean entry in
+the resulting index.
+
+Based on this discussion, the current code does not use the
+"workaround" to avoid the runtime penalty that does not exist in
+practice anymore.  This was done with commit 0fc82cff on Aug 15,
+2006.
diff --git a/Documentation/technical/reftable.txt b/Documentation/technical/reftable.txt
new file mode 100644
index 0000000..dd0b37c
--- /dev/null
+++ b/Documentation/technical/reftable.txt
@@ -0,0 +1,1098 @@
+reftable
+--------
+
+Overview
+~~~~~~~~
+
+Problem statement
+^^^^^^^^^^^^^^^^^
+
+Some repositories contain a lot of references (e.g. android at 866k,
+rails at 31k). The existing packed-refs format takes up a lot of space
+(e.g. 62M), and does not scale with additional references. Lookup of a
+single reference requires linearly scanning the file.
+
+Atomic pushes modifying multiple references require copying the entire
+packed-refs file, which can be a considerable amount of data moved
+(e.g. 62M in, 62M out) for even small transactions (2 refs modified).
+
+Repositories with many loose references occupy a large number of disk
+blocks from the local file system, as each reference is its own file
+storing 41 bytes (and another file for the corresponding reflog). This
+negatively affects the number of inodes available when a large number of
+repositories are stored on the same filesystem. Readers can be penalized
+due to the larger number of syscalls required to traverse and read the
+`$GIT_DIR/refs` directory.
+
+
+Objectives
+^^^^^^^^^^
+
+* Near constant time lookup for any single reference, even when the
+repository is cold and not in process or kernel cache.
+* Near constant time verification if an object name is referred to by at least
+one reference (for allow-tip-sha1-in-want).
+* Efficient enumeration of an entire namespace, such as `refs/tags/`.
+* Support atomic push with `O(size_of_update)` operations.
+* Combine reflog storage with ref storage for small transactions.
+* Separate reflog storage for base refs and historical logs.
+
+Description
+^^^^^^^^^^^
+
+A reftable file is a portable binary file format customized for
+reference storage. References are sorted, enabling linear scans, binary
+search lookup, and range scans.
+
+Storage in the file is organized into variable sized blocks. Prefix
+compression is used within a single block to reduce disk space. Block
+size and alignment are tunable by the writer.
+
+Performance
+^^^^^^^^^^^
+
+Space used, packed-refs vs. reftable:
+
+[cols=",>,>,>,>,>",options="header",]
+|===============================================================
+|repository |packed-refs |reftable |% original |avg ref |avg obj
+|android |62.2 M |36.1 M |58.0% |33 bytes |5 bytes
+|rails |1.8 M |1.1 M |57.7% |29 bytes |4 bytes
+|git |78.7 K |48.1 K |61.0% |50 bytes |4 bytes
+|git (heads) |332 b |269 b |81.0% |33 bytes |0 bytes
+|===============================================================
+
+Scan (read 866k refs), by reference name lookup (single ref from 866k
+refs), and by SHA-1 lookup (refs with that SHA-1, from 866k refs):
+
+[cols=",>,>,>,>",options="header",]
+|=========================================================
+|format |cache |scan |by name |by SHA-1
+|packed-refs |cold |402 ms |409,660.1 usec |412,535.8 usec
+|packed-refs |hot | |6,844.6 usec |20,110.1 usec
+|reftable |cold |112 ms |33.9 usec |323.2 usec
+|reftable |hot | |20.2 usec |320.8 usec
+|=========================================================
+
+Space used for 149,932 log entries for 43,061 refs, reflog vs. reftable:
+
+[cols=",>,>",options="header",]
+|================================
+|format |size |avg entry
+|$GIT_DIR/logs |173 M |1209 bytes
+|reftable |5 M |37 bytes
+|================================
+
+Details
+~~~~~~~
+
+Peeling
+^^^^^^^
+
+References stored in a reftable are peeled, a record for an annotated
+(or signed) tag records both the tag object, and the object it refers
+to. This is analogous to storage in the packed-refs format.
+
+Reference name encoding
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Reference names are an uninterpreted sequence of bytes that must pass
+linkgit:git-check-ref-format[1] as a valid reference name.
+
+Key unicity
+^^^^^^^^^^^
+
+Each entry must have a unique key; repeated keys are disallowed.
+
+Network byte order
+^^^^^^^^^^^^^^^^^^
+
+All multi-byte, fixed width fields are in network byte order.
+
+Varint encoding
+^^^^^^^^^^^^^^^
+
+Varint encoding is identical to the ofs-delta encoding method used
+within pack files.
+
+Decoder works as follows:
+
+....
+val = buf[ptr] & 0x7f
+while (buf[ptr] & 0x80) {
+  ptr++
+  val = ((val + 1) << 7) | (buf[ptr] & 0x7f)
+}
+....
+
+Ordering
+^^^^^^^^
+
+Blocks are lexicographically ordered by their first reference.
+
+Directory/file conflicts
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+The reftable format accepts both `refs/heads/foo` and
+`refs/heads/foo/bar` as distinct references.
+
+This property is useful for retaining log records in reftable, but may
+confuse versions of Git using `$GIT_DIR/refs` directory tree to maintain
+references. Users of reftable may choose to continue to reject `foo` and
+`foo/bar` type conflicts to prevent problems for peers.
+
+File format
+~~~~~~~~~~~
+
+Structure
+^^^^^^^^^
+
+A reftable file has the following high-level structure:
+
+....
+first_block {
+  header
+  first_ref_block
+}
+ref_block*
+ref_index*
+obj_block*
+obj_index*
+log_block*
+log_index*
+footer
+....
+
+A log-only file omits the `ref_block`, `ref_index`, `obj_block` and
+`obj_index` sections, containing only the file header and log block:
+
+....
+first_block {
+  header
+}
+log_block*
+log_index*
+footer
+....
+
+In a log-only file, the first log block immediately follows the file
+header, without padding to block alignment.
+
+Block size
+^^^^^^^^^^
+
+The file's block size is arbitrarily determined by the writer, and does
+not have to be a power of 2. The block size must be larger than the
+longest reference name or log entry used in the repository, as
+references cannot span blocks.
+
+Powers of two that are friendly to the virtual memory system or
+filesystem (such as 4k or 8k) are recommended. Larger sizes (64k) can
+yield better compression, with a possible increased cost incurred by
+readers during access.
+
+The largest block size is `16777215` bytes (15.99 MiB).
+
+Block alignment
+^^^^^^^^^^^^^^^
+
+Writers may choose to align blocks at multiples of the block size by
+including `padding` filled with NUL bytes at the end of a block to round
+out to the chosen alignment. When alignment is used, writers must
+specify the alignment with the file header's `block_size` field.
+
+Block alignment is not required by the file format. Unaligned files must
+set `block_size = 0` in the file header, and omit `padding`. Unaligned
+files with more than one ref block must include the link:#Ref-index[ref
+index] to support fast lookup. Readers must be able to read both aligned
+and non-aligned files.
+
+Very small files (e.g. a single ref block) may omit `padding` and the ref
+index to reduce total file size.
+
+Header (version 1)
+^^^^^^^^^^^^^^^^^^
+
+A 24-byte header appears at the beginning of the file:
+
+....
+'REFT'
+uint8( version_number = 1 )
+uint24( block_size )
+uint64( min_update_index )
+uint64( max_update_index )
+....
+
+Aligned files must specify `block_size` to configure readers with the
+expected block alignment. Unaligned files must set `block_size = 0`.
+
+The `min_update_index` and `max_update_index` describe bounds for the
+`update_index` field of all log records in this file. When reftables are
+used in a stack for link:#Update-transactions[transactions], these
+fields can order the files such that the prior file's
+`max_update_index + 1` is the next file's `min_update_index`.
+
+Header (version 2)
+^^^^^^^^^^^^^^^^^^
+
+A 28-byte header appears at the beginning of the file:
+
+....
+'REFT'
+uint8( version_number = 2 )
+uint24( block_size )
+uint64( min_update_index )
+uint64( max_update_index )
+uint32( hash_id )
+....
+
+The header is identical to `version_number=1`, with the 4-byte hash ID
+("sha1" for SHA1 and "s256" for SHA-256) appended to the header.
+
+For maximum backward compatibility, it is recommended to use version 1 when
+writing SHA1 reftables.
+
+First ref block
+^^^^^^^^^^^^^^^
+
+The first ref block shares the same block as the file header, and is 24
+bytes smaller than all other blocks in the file. The first block
+immediately begins after the file header, at position 24.
+
+If the first block is a log block (a log-only file), its block header
+begins immediately at position 24.
+
+Ref block format
+^^^^^^^^^^^^^^^^
+
+A ref block is written as:
+
+....
+'r'
+uint24( block_len )
+ref_record+
+uint24( restart_offset )+
+uint16( restart_count )
+
+padding?
+....
+
+Blocks begin with `block_type = 'r'` and a 3-byte `block_len` which
+encodes the number of bytes in the block up to, but not including the
+optional `padding`. This is always less than or equal to the file's
+block size. In the first ref block, `block_len` includes 24 bytes for
+the file header.
+
+The 2-byte `restart_count` stores the number of entries in the
+`restart_offset` list, which must not be empty. Readers can use
+`restart_count` to binary search between restarts before starting a
+linear scan.
+
+Exactly `restart_count` 3-byte `restart_offset` values precede the
+`restart_count`. Offsets are relative to the start of the block and
+refer to the first byte of any `ref_record` whose name has not been
+prefix compressed. Entries in the `restart_offset` list must be sorted,
+ascending. Readers can start linear scans from any of these records.
+
+A variable number of `ref_record` fill the middle of the block,
+describing reference names and values. The format is described below.
+
+As the first ref block shares the first file block with the file header,
+all `restart_offset` in the first block are relative to the start of the
+file (position 0), and include the file header. This forces the first
+`restart_offset` to be `28`.
+
+ref record
+++++++++++
+
+A `ref_record` describes a single reference, storing both the name and
+its value(s). Records are formatted as:
+
+....
+varint( prefix_length )
+varint( (suffix_length << 3) | value_type )
+suffix
+varint( update_index_delta )
+value?
+....
+
+The `prefix_length` field specifies how many leading bytes of the prior
+reference record's name should be copied to obtain this reference's
+name. This must be 0 for the first reference in any block, and also must
+be 0 for any `ref_record` whose offset is listed in the `restart_offset`
+table at the end of the block.
+
+Recovering a reference name from any `ref_record` is a simple concat:
+
+....
+this_name = prior_name[0..prefix_length] + suffix
+....
+
+The `suffix_length` value provides the number of bytes available in
+`suffix` to copy from `suffix` to complete the reference name.
+
+The `update_index` that last modified the reference can be obtained by
+adding `update_index_delta` to the `min_update_index` from the file
+header: `min_update_index + update_index_delta`.
+
+The `value` follows. Its format is determined by `value_type`, one of
+the following:
+
+* `0x0`: deletion; no value data (see transactions, below)
+* `0x1`: one object name; value of the ref
+* `0x2`: two object names; value of the ref, peeled target
+* `0x3`: symbolic reference: `varint( target_len ) target`
+
+Symbolic references use `0x3`, followed by the complete name of the
+reference target. No compression is applied to the target name.
+
+Types `0x4..0x7` are reserved for future use.
+
+Ref index
+^^^^^^^^^
+
+The ref index stores the name of the last reference from every ref block
+in the file, enabling reduced disk seeks for lookups. Any reference can
+be found by searching the index, identifying the containing block, and
+searching within that block.
+
+The index may be organized into a multi-level index, where the 1st level
+index block points to additional ref index blocks (2nd level), which may
+in turn point to either additional index blocks (e.g. 3rd level) or ref
+blocks (leaf level). Disk reads required to access a ref go up with
+higher index levels. Multi-level indexes may be required to ensure no
+single index block exceeds the file format's max block size of
+`16777215` bytes (15.99 MiB). To achieve constant O(1) disk seeks for
+lookups the index must be a single level, which is permitted to exceed
+the file's configured block size, but not the format's max block size of
+15.99 MiB.
+
+If present, the ref index block(s) appears after the last ref block.
+
+If there are at least 4 ref blocks, a ref index block should be written
+to improve lookup times. Cold reads using the index require 2 disk reads
+(read index, read block), and binary searching < 4 blocks also requires
+<= 2 reads. Omitting the index block from smaller files saves space.
+
+If the file is unaligned and contains more than one ref block, the ref
+index must be written.
+
+Index block format:
+
+....
+'i'
+uint24( block_len )
+index_record+
+uint24( restart_offset )+
+uint16( restart_count )
+
+padding?
+....
+
+The index blocks begin with `block_type = 'i'` and a 3-byte `block_len`
+which encodes the number of bytes in the block, up to but not including
+the optional `padding`.
+
+The `restart_offset` and `restart_count` fields are identical in format,
+meaning and usage as in ref blocks.
+
+To reduce the number of reads required for random access in very large
+files the index block may be larger than other blocks. However, readers
+must hold the entire index in memory to benefit from this, so it's a
+time-space tradeoff in both file size and reader memory.
+
+Increasing the file's block size decreases the index size. Alternatively
+a multi-level index may be used, keeping index blocks within the file's
+block size, but increasing the number of blocks that need to be
+accessed.
+
+index record
+++++++++++++
+
+An index record describes the last entry in another block. Index records
+are written as:
+
+....
+varint( prefix_length )
+varint( (suffix_length << 3) | 0 )
+suffix
+varint( block_position )
+....
+
+Index records use prefix compression exactly like `ref_record`.
+
+Index records store `block_position` after the suffix, specifying the
+absolute position in bytes (from the start of the file) of the block
+that ends with this reference. Readers can seek to `block_position` to
+begin reading the block header.
+
+Readers must examine the block header at `block_position` to determine
+if the next block is another level index block, or the leaf-level ref
+block.
+
+Reading the index
++++++++++++++++++
+
+Readers loading the ref index must first read the footer (below) to
+obtain `ref_index_position`. If not present, the position will be 0. The
+`ref_index_position` is for the 1st level root of the ref index.
+
+Obj block format
+^^^^^^^^^^^^^^^^
+
+Object blocks are optional. Writers may choose to omit object blocks,
+especially if readers will not use the object name to ref mapping.
+
+Object blocks use unique, abbreviated 2-31 byte object name keys, mapping to
+ref blocks containing references pointing to that object directly, or as
+the peeled value of an annotated tag. Like ref blocks, object blocks use
+the file's standard block size. The abbreviation length is available in
+the footer as `obj_id_len`.
+
+To save space in small files, object blocks may be omitted if the ref
+index is not present, as brute force search will only need to read a few
+ref blocks. When missing, readers should brute force a linear search of
+all references to lookup by object name.
+
+An object block is written as:
+
+....
+'o'
+uint24( block_len )
+obj_record+
+uint24( restart_offset )+
+uint16( restart_count )
+
+padding?
+....
+
+Fields are identical to ref block. Binary search using the restart table
+works the same as in reference blocks.
+
+Because object names are abbreviated by writers to the shortest unique
+abbreviation within the reftable, obj key lengths have a variable length. Their
+length must be at least 2 bytes. Readers must compare only for common prefix
+match within an obj block or obj index.
+
+obj record
+++++++++++
+
+An `obj_record` describes a single object abbreviation, and the blocks
+containing references using that unique abbreviation:
+
+....
+varint( prefix_length )
+varint( (suffix_length << 3) | cnt_3 )
+suffix
+varint( cnt_large )?
+varint( position_delta )*
+....
+
+Like in reference blocks, abbreviations are prefix compressed within an
+obj block. On large reftables with many unique objects, higher block
+sizes (64k), and higher restart interval (128), a `prefix_length` of 2
+or 3 and `suffix_length` of 3 may be common in obj records (unique
+abbreviation of 5-6 raw bytes, 10-12 hex digits).
+
+Each record contains `position_count` number of positions for matching
+ref blocks. For 1-7 positions the count is stored in `cnt_3`. When
+`cnt_3 = 0` the actual count follows in a varint, `cnt_large`.
+
+The use of `cnt_3` bets most objects are pointed to by only a single
+reference, some may be pointed to by a couple of references, and very
+few (if any) are pointed to by more than 7 references.
+
+A special case exists when `cnt_3 = 0` and `cnt_large = 0`: there are no
+`position_delta`, but at least one reference starts with this
+abbreviation. A reader that needs exact reference names must scan all
+references to find which specific references have the desired object.
+Writers should use this format when the `position_delta` list would have
+overflowed the file's block size due to a high number of references
+pointing to the same object.
+
+The first `position_delta` is the position from the start of the file.
+Additional `position_delta` entries are sorted ascending and relative to
+the prior entry, e.g. a reader would perform:
+
+....
+pos = position_delta[0]
+prior = pos
+for (j = 1; j < position_count; j++) {
+  pos = prior + position_delta[j]
+  prior = pos
+}
+....
+
+With a position in hand, a reader must linearly scan the ref block,
+starting from the first `ref_record`, testing each reference's object names
+(for `value_type = 0x1` or `0x2`) for full equality. Faster searching by
+object name within a single ref block is not supported by the reftable format.
+Smaller block sizes reduce the number of candidates this step must
+consider.
+
+Obj index
+^^^^^^^^^
+
+The obj index stores the abbreviation from the last entry for every obj
+block in the file, enabling reduced disk seeks for all lookups. It is
+formatted exactly the same as the ref index, but refers to obj blocks.
+
+The obj index should be present if obj blocks are present, as obj blocks
+should only be written in larger files.
+
+Readers loading the obj index must first read the footer (below) to
+obtain `obj_index_position`. If not present, the position will be 0.
+
+Log block format
+^^^^^^^^^^^^^^^^
+
+Unlike ref and obj blocks, log blocks are always unaligned.
+
+Log blocks are variable in size, and do not match the `block_size`
+specified in the file header or footer. Writers should choose an
+appropriate buffer size to prepare a log block for deflation, such as
+`2 * block_size`.
+
+A log block is written as:
+
+....
+'g'
+uint24( block_len )
+zlib_deflate {
+  log_record+
+  uint24( restart_offset )+
+  uint16( restart_count )
+}
+....
+
+Log blocks look similar to ref blocks, except `block_type = 'g'`.
+
+The 4-byte block header is followed by the deflated block contents using
+zlib deflate. The `block_len` in the header is the inflated size
+(including 4-byte block header), and should be used by readers to
+preallocate the inflation output buffer. A log block's `block_len` may
+exceed the file's block size.
+
+Offsets within the log block (e.g. `restart_offset`) still include the
+4-byte header. Readers may prefer prefixing the inflation output buffer
+with the 4-byte header.
+
+Within the deflate container, a variable number of `log_record` describe
+reference changes. The log record format is described below. See ref
+block format (above) for a description of `restart_offset` and
+`restart_count`.
+
+Because log blocks have no alignment or padding between blocks, readers
+must keep track of the bytes consumed by the inflater to know where the
+next log block begins.
+
+log record
+++++++++++
+
+Log record keys are structured as:
+
+....
+ref_name '\0' reverse_int64( update_index )
+....
+
+where `update_index` is the unique transaction identifier. The
+`update_index` field must be unique within the scope of a `ref_name`.
+See the update transactions section below for further details.
+
+The `reverse_int64` function inverses the value so lexicographical
+ordering the network byte order encoding sorts the more recent records
+with higher `update_index` values first:
+
+....
+reverse_int64(int64 t) {
+  return 0xffffffffffffffff - t;
+}
+....
+
+Log records have a similar starting structure to ref and index records,
+utilizing the same prefix compression scheme applied to the log record
+key described above.
+
+....
+    varint( prefix_length )
+    varint( (suffix_length << 3) | log_type )
+    suffix
+    log_data {
+      old_id
+      new_id
+      varint( name_length    )  name
+      varint( email_length   )  email
+      varint( time_seconds )
+      sint16( tz_offset )
+      varint( message_length )  message
+    }?
+....
+
+Log record entries use `log_type` to indicate what follows:
+
+* `0x0`: deletion; no log data.
+* `0x1`: standard git reflog data using `log_data` above.
+
+The `log_type = 0x0` is mostly useful for `git stash drop`, removing an
+entry from the reflog of `refs/stash` in a transaction file (below),
+without needing to rewrite larger files. Readers reading a stack of
+reflogs must treat this as a deletion.
+
+For `log_type = 0x1`, the `log_data` section follows
+linkgit:git-update-ref[1] logging and includes:
+
+* two object names (old id, new id)
+* varint string of committer's name
+* varint string of committer's email
+* varint time in seconds since epoch (Jan 1, 1970)
+* 2-byte timezone offset in minutes (signed)
+* varint string of message
+
+`tz_offset` is the absolute number of minutes from GMT the committer was
+at the time of the update. For example `GMT-0800` is encoded in reftable
+as `sint16(-480)` and `GMT+0230` is `sint16(150)`.
+
+The committer email does not contain `<` or `>`, it's the value normally
+found between the `<>` in a git commit object header.
+
+The `message_length` may be 0, in which case there was no message
+supplied for the update.
+
+Contrary to traditional reflog (which is a file), renames are encoded as
+a combination of ref deletion and ref creation.  A deletion is a log
+record with a zero new_id, and a creation is a log record with a zero old_id.
+
+Reading the log
++++++++++++++++
+
+Readers accessing the log must first read the footer (below) to
+determine the `log_position`. The first block of the log begins at
+`log_position` bytes since the start of the file. The `log_position` is
+not block aligned.
+
+Importing logs
+++++++++++++++
+
+When importing from `$GIT_DIR/logs` writers should globally order all
+log records roughly by timestamp while preserving file order, and assign
+unique, increasing `update_index` values for each log line. Newer log
+records get higher `update_index` values.
+
+Although an import may write only a single reftable file, the reftable
+file must span many unique `update_index`, as each log line requires its
+own `update_index` to preserve semantics.
+
+Log index
+^^^^^^^^^
+
+The log index stores the log key
+(`refname \0 reverse_int64(update_index)`) for the last log record of
+every log block in the file, supporting bounded-time lookup.
+
+A log index block must be written if 2 or more log blocks are written to
+the file. If present, the log index appears after the last log block.
+There is no padding used to align the log index to block alignment.
+
+Log index format is identical to ref index, except the keys are 9 bytes
+longer to include `'\0'` and the 8-byte `reverse_int64(update_index)`.
+Records use `block_position` to refer to the start of a log block.
+
+Reading the index
++++++++++++++++++
+
+Readers loading the log index must first read the footer (below) to
+obtain `log_index_position`. If not present, the position will be 0.
+
+Footer
+^^^^^^
+
+After the last block of the file, a file footer is written. It begins
+like the file header, but is extended with additional data.
+
+....
+    HEADER
+
+    uint64( ref_index_position )
+    uint64( (obj_position << 5) | obj_id_len )
+    uint64( obj_index_position )
+
+    uint64( log_position )
+    uint64( log_index_position )
+
+    uint32( CRC-32 of above )
+....
+
+If a section is missing (e.g. ref index) the corresponding position
+field (e.g. `ref_index_position`) will be 0.
+
+* `obj_position`: byte position for the first obj block.
+* `obj_id_len`: number of bytes used to abbreviate object names in
+obj blocks.
+* `log_position`: byte position for the first log block.
+* `ref_index_position`: byte position for the start of the ref index.
+* `obj_index_position`: byte position for the start of the obj index.
+* `log_index_position`: byte position for the start of the log index.
+
+The size of the footer is 68 bytes for version 1, and 72 bytes for
+version 2.
+
+Reading the footer
+++++++++++++++++++
+
+Readers must first read the file start to determine the version
+number. Then they seek to `file_length - FOOTER_LENGTH` to access the
+footer. A trusted external source (such as `stat(2)`) is necessary to
+obtain `file_length`. When reading the footer, readers must verify:
+
+* 4-byte magic is correct
+* 1-byte version number is recognized
+* 4-byte CRC-32 matches the other 64 bytes (including magic, and
+version)
+
+Once verified, the other fields of the footer can be accessed.
+
+Empty tables
+++++++++++++
+
+A reftable may be empty. In this case, the file starts with a header
+and is immediately followed by a footer.
+
+Binary search
+^^^^^^^^^^^^^
+
+Binary search within a block is supported by the `restart_offset` fields
+at the end of the block. Readers can binary search through the restart
+table to locate between which two restart points the sought reference or
+key should appear.
+
+Each record identified by a `restart_offset` stores the complete key in
+the `suffix` field of the record, making the compare operation during
+binary search straightforward.
+
+Once a restart point lexicographically before the sought reference has
+been identified, readers can linearly scan through the following record
+entries to locate the sought record, terminating if the current record
+sorts after (and therefore the sought key is not present).
+
+Restart point selection
++++++++++++++++++++++++
+
+Writers determine the restart points at file creation. The process is
+arbitrary, but every 16 or 64 records is recommended. Every 16 may be
+more suitable for smaller block sizes (4k or 8k), every 64 for larger
+block sizes (64k).
+
+More frequent restart points reduces prefix compression and increases
+space consumed by the restart table, both of which increase file size.
+
+Less frequent restart points makes prefix compression more effective,
+decreasing overall file size, with increased penalties for readers
+walking through more records after the binary search step.
+
+A maximum of `65535` restart points per block is supported.
+
+Considerations
+~~~~~~~~~~~~~~
+
+Lightweight refs dominate
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The reftable format assumes the vast majority of references are single
+object names valued with common prefixes, such as Gerrit Code Review's
+`refs/changes/` namespace, GitHub's `refs/pulls/` namespace, or many
+lightweight tags in the `refs/tags/` namespace.
+
+Annotated tags storing the peeled object cost an additional object name per
+reference.
+
+Low overhead
+^^^^^^^^^^^^
+
+A reftable with very few references (e.g. git.git with 5 heads) is 269
+bytes for reftable, vs. 332 bytes for packed-refs. This supports
+reftable scaling down for transaction logs (below).
+
+Block size
+^^^^^^^^^^
+
+For a Gerrit Code Review type repository with many change refs, larger
+block sizes (64 KiB) and less frequent restart points (every 64) yield
+better compression due to more references within the block compressing
+against the prior reference.
+
+Larger block sizes reduce the index size, as the reftable will require
+fewer blocks to store the same number of references.
+
+Minimal disk seeks
+^^^^^^^^^^^^^^^^^^
+
+Assuming the index block has been loaded into memory, binary searching
+for any single reference requires exactly 1 disk seek to load the
+containing block.
+
+Scans and lookups dominate
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Scanning all references and lookup by name (or namespace such as
+`refs/heads/`) are the most common activities performed on repositories.
+Object names are stored directly with references to optimize this use case.
+
+Logs are infrequently read
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Logs are infrequently accessed, but can be large. Deflating log blocks
+saves disk space, with some increased penalty at read time.
+
+Logs are stored in an isolated section from refs, reducing the burden on
+reference readers that want to ignore logs. Further, historical logs can
+be isolated into log-only files.
+
+Logs are read backwards
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Logs are frequently accessed backwards (most recent N records for master
+to answer `master@{4}`), so log records are grouped by reference, and
+sorted descending by update index.
+
+Repository format
+~~~~~~~~~~~~~~~~~
+
+Version 1
+^^^^^^^^^
+
+A repository must set its `$GIT_DIR/config` to configure reftable:
+
+....
+[core]
+    repositoryformatversion = 1
+[extensions]
+    refStorage = reftable
+....
+
+Layout
+^^^^^^
+
+A collection of reftable files are stored in the `$GIT_DIR/reftable/` directory.
+Their names should have a random element, such that each filename is globally
+unique; this helps avoid spurious failures on Windows, where open files cannot
+be removed or overwritten. It suggested to use
+`${min_update_index}-${max_update_index}-${random}.ref` as a naming convention.
+
+Log-only files use the `.log` extension, while ref-only and mixed ref
+and log files use `.ref`. extension.
+
+The stack ordering file is `$GIT_DIR/reftable/tables.list` and lists the
+current files, one per line, in order, from oldest (base) to newest
+(most recent):
+
+....
+$ cat .git/reftable/tables.list
+00000001-00000001-RANDOM1.log
+00000002-00000002-RANDOM2.ref
+00000003-00000003-RANDOM3.ref
+....
+
+Readers must read `$GIT_DIR/reftable/tables.list` to determine which
+files are relevant right now, and search through the stack in reverse
+order (last reftable is examined first).
+
+Reftable files not listed in `tables.list` may be new (and about to be
+added to the stack by the active writer), or ancient and ready to be
+pruned.
+
+Backward compatibility
+^^^^^^^^^^^^^^^^^^^^^^
+
+Older clients should continue to recognize the directory as a git
+repository so they don't look for an enclosing repository in parent
+directories. To this end, a reftable-enabled repository must contain the
+following dummy files
+
+* `.git/HEAD`, a regular file containing `ref: refs/heads/.invalid`.
+* `.git/refs/`, a directory
+* `.git/refs/heads`, a regular file
+
+Readers
+^^^^^^^
+
+Readers can obtain a consistent snapshot of the reference space by
+following:
+
+1.  Open and read the `tables.list` file.
+2.  Open each of the reftable files that it mentions.
+3.  If any of the files is missing, goto 1.
+4.  Read from the now-open files as long as necessary.
+
+Update transactions
+^^^^^^^^^^^^^^^^^^^
+
+Although reftables are immutable, mutations are supported by writing a
+new reftable and atomically appending it to the stack:
+
+1.  Acquire `tables.list.lock`.
+2.  Read `tables.list` to determine current reftables.
+3.  Select `update_index` to be most recent file's
+`max_update_index + 1`.
+4.  Prepare temp reftable `tmp_XXXXXX`, including log entries.
+5.  Rename `tmp_XXXXXX` to `${update_index}-${update_index}-${random}.ref`.
+6.  Copy `tables.list` to `tables.list.lock`, appending file from (5).
+7.  Rename `tables.list.lock` to `tables.list`.
+
+During step 4 the new file's `min_update_index` and `max_update_index`
+are both set to the `update_index` selected by step 3. All log records
+for the transaction use the same `update_index` in their keys. This
+enables later correlation of which references were updated by the same
+transaction.
+
+Because a single `tables.list.lock` file is used to manage locking, the
+repository is single-threaded for writers. Writers may have to busy-spin
+(with backoff) around creating `tables.list.lock`, for up to an
+acceptable wait period, aborting if the repository is too busy to
+mutate. Application servers wrapped around repositories (e.g. Gerrit
+Code Review) can layer their own lock/wait queue to improve fairness to
+writers.
+
+Reference deletions
+^^^^^^^^^^^^^^^^^^^
+
+Deletion of any reference can be explicitly stored by setting the `type`
+to `0x0` and omitting the `value` field of the `ref_record`. This serves
+as a tombstone, overriding any assertions about the existence of the
+reference from earlier files in the stack.
+
+Compaction
+^^^^^^^^^^
+
+A partial stack of reftables can be compacted by merging references
+using a straightforward merge join across reftables, selecting the most
+recent value for output, and omitting deleted references that do not
+appear in remaining, lower reftables.
+
+A compacted reftable should set its `min_update_index` to the smallest
+of the input files' `min_update_index`, and its `max_update_index`
+likewise to the largest input `max_update_index`.
+
+For sake of illustration, assume the stack currently consists of
+reftable files (from oldest to newest): A, B, C, and D. The compactor is
+going to compact B and C, leaving A and D alone.
+
+1.  Obtain lock `tables.list.lock` and read the `tables.list` file.
+2.  Obtain locks `B.lock` and `C.lock`. Ownership of these locks
+prevents other processes from trying to compact these files.
+3.  Release `tables.list.lock`.
+4.  Compact `B` and `C` into a temp file
+`${min_update_index}-${max_update_index}_XXXXXX`.
+5.  Reacquire lock `tables.list.lock`.
+6.  Verify that `B` and `C` are still in the stack, in that order. This
+should always be the case, assuming that other processes are adhering to
+the locking protocol.
+7.  Rename `${min_update_index}-${max_update_index}_XXXXXX` to
+`${min_update_index}-${max_update_index}-${random}.ref`.
+8.  Write the new stack to `tables.list.lock`, replacing `B` and `C`
+with the file from (4).
+9.  Rename `tables.list.lock` to `tables.list`.
+10. Delete `B` and `C`, perhaps after a short sleep to avoid forcing
+readers to backtrack.
+
+This strategy permits compactions to proceed independently of updates.
+
+Each reftable (compacted or not) is uniquely identified by its name, so
+open reftables can be cached by their name.
+
+Windows
+^^^^^^^
+
+On windows, and other systems that do not allow deleting or renaming to open
+files, compaction may succeed, but other readers may prevent obsolete tables
+from being deleted.
+
+On these platforms, the following strategy can be followed: on closing a
+reftable stack, reload `tables.list`, and delete any tables no longer mentioned
+in `tables.list`.
+
+Irregular program exit may still leave about unused files. In this case, a
+cleanup operation should proceed as follows:
+
+* take a lock `tables.list.lock` to prevent concurrent modifications
+* refresh the reftable stack, by reading `tables.list`
+* for each `*.ref` file, remove it if
+** it is not mentioned in `tables.list`, and
+** its max update_index is not beyond the max update_index of the stack
+
+
+Alternatives considered
+~~~~~~~~~~~~~~~~~~~~~~~
+
+bzip packed-refs
+^^^^^^^^^^^^^^^^
+
+`bzip2` can significantly shrink a large packed-refs file (e.g. 62 MiB
+compresses to 23 MiB, 37%). However the bzip format does not support
+random access to a single reference. Readers must inflate and discard
+while performing a linear scan.
+
+Breaking packed-refs into chunks (individually compressing each chunk)
+would reduce the amount of data a reader must inflate, but still leaves
+the problem of indexing chunks to support readers efficiently locating
+the correct chunk.
+
+Given the compression achieved by reftable's encoding, it does not seem
+necessary to add the complexity of bzip/gzip/zlib.
+
+Michael Haggerty's alternate format
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Michael Haggerty proposed
+link:https://lore.kernel.org/git/CAMy9T_HCnyc1g8XWOOWhe7nN0aEFyyBskV2aOMb_fe%2BwGvEJ7A%40mail.gmail.com/[an
+alternate] format to reftable on the Git mailing list. This format uses
+smaller chunks, without the restart table, and avoids block alignment
+with padding. Reflog entries immediately follow each ref, and are thus
+interleaved between refs.
+
+Performance testing indicates reftable is faster for lookups (51%
+faster, 11.2 usec vs. 5.4 usec), although reftable produces a slightly
+larger file (+ ~3.2%, 28.3M vs 29.2M):
+
+[cols=">,>,>,>",options="header",]
+|=====================================
+|format |size |seek cold |seek hot
+|mh-alt |28.3 M |23.4 usec |11.2 usec
+|reftable |29.2 M |19.9 usec |5.4 usec
+|=====================================
+
+JGit Ketch RefTree
+^^^^^^^^^^^^^^^^^^
+
+https://dev.eclipse.org/mhonarc/lists/jgit-dev/msg03073.html[JGit Ketch]
+proposed
+link:https://lore.kernel.org/git/CAJo%3DhJvnAPNAdDcAAwAvU9C4RVeQdoS3Ev9WTguHx4fD0V_nOg%40mail.gmail.com/[RefTree],
+an encoding of references inside Git tree objects stored as part of the
+repository's object database.
+
+The RefTree format adds additional load on the object database storage
+layer (more loose objects, more objects in packs), and relies heavily on
+the packer's delta compression to save space. Namespaces which are flat
+(e.g. thousands of tags in refs/tags) initially create very large loose
+objects, and so RefTree does not address the problem of copying many
+references to modify a handful.
+
+Flat namespaces are not efficiently searchable in RefTree, as tree
+objects in canonical formatting cannot be binary searched. This fails
+the need to handle a large number of references in a single namespace,
+such as GitHub's `refs/pulls`, or a project with many tags.
+
+LMDB
+^^^^
+
+David Turner proposed
+https://lore.kernel.org/git/1455772670-21142-26-git-send-email-dturner@twopensource.com/[using
+LMDB], as LMDB is lightweight (64k of runtime code) and GPL-compatible
+license.
+
+A downside of LMDB is its reliance on a single C implementation. This
+makes embedding inside JGit (a popular reimplementation of Git)
+difficult, and hoisting onto virtual storage (for JGit DFS) virtually
+impossible.
+
+A common format that can be supported by all major Git implementations
+(git-core, JGit, libgit2) is strongly preferred.
diff --git a/Documentation/technical/remembering-renames.txt b/Documentation/technical/remembering-renames.txt
new file mode 100644
index 0000000..73f4176
--- /dev/null
+++ b/Documentation/technical/remembering-renames.txt
@@ -0,0 +1,671 @@
+Rebases and cherry-picks involve a sequence of merges whose results are
+recorded as new single-parent commits.  The first parent side of those
+merges represent the "upstream" side, and often include a far larger set of
+changes than the second parent side.  Traditionally, the renames on the
+first-parent side of that sequence of merges were repeatedly re-detected
+for every merge.  This file explains why it is safe and effective during
+rebases and cherry-picks to remember renames on the upstream side of
+history as an optimization, assuming all merges are automatic and clean
+(i.e. no conflicts and not interrupted for user input or editing).
+
+Outline:
+
+  0. Assumptions
+
+  1. How rebasing and cherry-picking work
+
+  2. Why the renames on MERGE_SIDE1 in any given pick are *always* a
+     superset of the renames on MERGE_SIDE1 for the next pick.
+
+  3. Why any rename on MERGE_SIDE1 in any given pick is _almost_ always also
+     a rename on MERGE_SIDE1 for the next pick
+
+  4. A detailed description of the counter-examples to #3.
+
+  5. Why the special cases in #4 are still fully reasonable to use to pair
+     up files for three-way content merging in the merge machinery, and why
+     they do not affect the correctness of the merge.
+
+  6. Interaction with skipping of "irrelevant" renames
+
+  7. Additional items that need to be cached
+
+  8. How directory rename detection interacts with the above and why this
+     optimization is still safe even if merge.directoryRenames is set to
+     "true".
+
+
+=== 0. Assumptions ===
+
+There are two assumptions that will hold throughout this document:
+
+  * The upstream side where commits are transplanted to is treated as the
+    first parent side when rebase/cherry-pick call the merge machinery
+
+  * All merges are fully automatic
+
+and a third that will hold in sections 2-5 for simplicity, that I'll later
+address in section 8:
+
+  * No directory renames occur
+
+
+Let me explain more about each assumption and why I include it:
+
+
+The first assumption is merely for the purposes of making this document
+clearer; the optimization implementation does not actually depend upon it.
+However, the assumption does hold in all cases because it reflects the way
+that both rebase and cherry-pick were implemented; and the implementation
+of cherry-pick and rebase are not readily changeable for backwards
+compatibility reasons (see for example the discussion of the --ours and
+--theirs flag in the documentation of `git checkout`, particularly the
+comments about how they behave with rebase).  The optimization avoids
+checking first-parent-ness, though.  It checks the conditions that make the
+optimization valid instead, so it would still continue working if someone
+changed the parent ordering that cherry-pick and rebase use.  But making
+this assumption does make this document much clearer and prevents me from
+having to repeat every example twice.
+
+If the second assumption is violated, then the optimization simply is
+turned off and thus isn't relevant to consider.  The second assumption can
+also be stated as "there is no interruption for a user to resolve conflicts
+or to just further edit or tweak files".  While real rebases and
+cherry-picks are often interrupted (either because it's an interactive
+rebase where the user requested to stop and edit, or because there were
+conflicts that the user needs to resolve), the cache of renames is not
+stored on disk, and thus is thrown away as soon as the rebase or cherry
+pick stops for the user to resolve the operation.
+
+The third assumption makes sections 2-5 simpler, and allows people to
+understand the basics of why this optimization is safe and effective, and
+then I can go back and address the specifics in section 8.  It is probably
+also worth noting that if directory renames do occur, then the default of
+merge.directoryRenames being set to "conflict" means that the operation
+will stop for users to resolve the conflicts and the cache will be thrown
+away, and thus that there won't be an optimization to apply.  So, the only
+reason we need to address directory renames specifically, is that some
+users will have set merge.directoryRenames to "true" to allow the merges to
+continue to proceed automatically.  The optimization is still safe with
+this config setting, but we have to discuss a few more cases to show why;
+this discussion is deferred until section 8.
+
+
+=== 1. How rebasing and cherry-picking work ===
+
+Consider the following setup (from the git-rebase manpage):
+
+		     A---B---C topic
+		    /
+	       D---E---F---G main
+
+After rebasing or cherry-picking topic onto main, this will appear as:
+
+			     A'--B'--C' topic
+			    /
+	       D---E---F---G main
+
+The way the commits A', B', and C' are created is through a series of
+merges, where rebase or cherry-pick sequentially uses each of the three
+A-B-C commits in a special merge operation.  Let's label the three commits
+in the merge operation as MERGE_BASE, MERGE_SIDE1, and MERGE_SIDE2.  For
+this picture, the three commits for each of the three merges would be:
+
+To create A':
+   MERGE_BASE:   E
+   MERGE_SIDE1:  G
+   MERGE_SIDE2:  A
+
+To create B':
+   MERGE_BASE:   A
+   MERGE_SIDE1:  A'
+   MERGE_SIDE2:  B
+
+To create C':
+   MERGE_BASE:   B
+   MERGE_SIDE1:  B'
+   MERGE_SIDE2:  C
+
+Sometimes, folks are surprised that these three-way merges are done.  It
+can be useful in understanding these three-way merges to view them in a
+slightly different light.  For example, in creating C', you can view it as
+either:
+
+  * Apply the changes between B & C to B'
+  * Apply the changes between B & B' to C
+
+Conceptually the two statements above are the same as a three-way merge of
+B, B', and C, at least the parts before you decide to record a commit.
+
+
+=== 2. Why the renames on MERGE_SIDE1 in any given pick are always a ===
+===    superset of the renames on MERGE_SIDE1 for the next pick.     ===
+
+The merge machinery uses the filenames it is fed from MERGE_BASE,
+MERGE_SIDE1, and MERGE_SIDE2.  It will only move content to a different
+filename under one of three conditions:
+
+  * To make both pieces of a conflict available to a user during conflict
+    resolution (examples: directory/file conflict, add/add type conflict
+    such as symlink vs. regular file)
+
+  * When MERGE_SIDE1 renames the file.
+
+  * When MERGE_SIDE2 renames the file.
+
+First, let's remember what commits are involved in the first and second
+picks of the cherry-pick or rebase sequence:
+
+To create A':
+   MERGE_BASE:   E
+   MERGE_SIDE1:  G
+   MERGE_SIDE2:  A
+
+To create B':
+   MERGE_BASE:   A
+   MERGE_SIDE1:  A'
+   MERGE_SIDE2:  B
+
+So, in particular, we need to show that the renames between E and G are a
+superset of those between A and A'.
+
+A' is created by the first merge.  A' will only have renames for one of the
+three reasons listed above.  The first case, a conflict, results in a
+situation where the cache is dropped and thus this optimization doesn't
+take effect, so we need not consider that case.  The third case, a rename
+on MERGE_SIDE2 (i.e. from G to A), will show up in A' but it also shows up
+in A -- therefore when diffing A and A' that path does not show up as a
+rename.  The only remaining way for renames to show up in A' is for the
+rename to come from MERGE_SIDE1.  Therefore, all renames between A and A'
+are a subset of those between E and G.  Equivalently, all renames between E
+and G are a superset of those between A and A'.
+
+
+=== 3. Why any rename on MERGE_SIDE1 in any given pick is _almost_   ===
+===    always also a rename on MERGE_SIDE1 for the next pick.        ===
+
+Let's again look at the first two picks:
+
+To create A':
+   MERGE_BASE:   E
+   MERGE_SIDE1:  G
+   MERGE_SIDE2:  A
+
+To create B':
+   MERGE_BASE:   A
+   MERGE_SIDE1:  A'
+   MERGE_SIDE2:  B
+
+Now let's look at any given rename from MERGE_SIDE1 of the first pick, i.e.
+any given rename from E to G.  Let's use the filenames 'oldfile' and
+'newfile' for demonstration purposes.  That first pick will function as
+follows; when the rename is detected, the merge machinery will do a
+three-way content merge of the following:
+    E:oldfile
+    G:newfile
+    A:oldfile
+and produce a new result:
+    A':newfile
+
+Note above that I've assumed that E->A did not rename oldfile.  If that
+side did rename, then we most likely have a rename/rename(1to2) conflict
+that will cause the rebase or cherry-pick operation to halt and drop the
+in-memory cache of renames and thus doesn't need to be considered further.
+In the special case that E->A does rename the file but also renames it to
+newfile, then there is no conflict from the renaming and the merge can
+succeed.  In this special case, the rename is not valid to cache because
+the second merge will find A:newfile in the MERGE_BASE (see also the new
+testcases in t6429 with "rename same file identically" in their
+description).  So a rename/rename(1to1) needs to be specially handled by
+pruning renames from the cache and decrementing the dir_rename_counts in
+the current and leading directories associated with those renames.  Or,
+since these are really rare, one could just take the easy way out and
+disable the remembering renames optimization when a rename/rename(1to1)
+happens.
+
+The previous paragraph handled the cases for E->A renaming oldfile, let's
+continue assuming that oldfile is not renamed in A.
+
+As per the diagram for creating B', MERGE_SIDE1 involves the changes from A
+to A'.  So, we are curious whether A:oldfile and A':newfile will be viewed
+as renames.  Note that:
+
+  * There will be no A':oldfile (because there could not have been a
+    G:oldfile as we do not do break detection in the merge machinery and
+    G:newfile was detected as a rename, and by the construction of the
+    rename above that merged cleanly, the merge machinery will ensure there
+    is no 'oldfile' in the result).
+
+  * There will be no A:newfile (if there had been, we would have had a
+    rename/add conflict).
+
+  * Clearly A:oldfile and A':newfile are "related" (A':newfile came from a
+    clean three-way content merge involving A:oldfile).
+
+We can also expound on the third point above, by noting that three-way
+content merges can also be viewed as applying the differences between the
+base and one side to the other side.  Thus we can view A':newfile as
+having been created by taking the changes between E:oldfile and G:newfile
+(which were detected as being related, i.e. <50% changed) to A:oldfile.
+
+Thus A:oldfile and A':newfile are just as related as E:oldfile and
+G:newfile are -- they have exactly identical differences.  Since the latter
+were detected as renames, A:oldfile and A':newfile should also be
+detectable as renames almost always.
+
+
+=== 4. A detailed description of the counter-examples to #3.         ===
+
+We already noted in section 3 that rename/rename(1to1) (i.e. both sides
+renaming a file the same way) was one counter-example.  The more
+interesting bit, though, is why did we need to use the "almost" qualifier
+when stating that A:oldfile and A':newfile are "almost" always detectable
+as renames?
+
+Let's repeat an earlier point that section 3 made:
+
+  A':newfile was created by applying the changes between E:oldfile and
+  G:newfile to A:oldfile.  The changes between E:oldfile and G:newfile were
+  <50% of the size of E:oldfile.
+
+If those changes that were <50% of the size of E:oldfile are also <50% of
+the size of A:oldfile, then A:oldfile and A':newfile will be detectable as
+renames.  However, if there is a dramatic size reduction between E:oldfile
+and A:oldfile (but the changes between E:oldfile, G:newfile, and A:oldfile
+still somehow merge cleanly), then traditional rename detection would not
+detect A:oldfile and A':newfile as renames.
+
+Here's an example where that can happen:
+  * E:oldfile had 20 lines
+  * G:newfile added 10 new lines at the beginning of the file
+  * A:oldfile kept the first 3 lines of the file, and deleted all the rest
+then
+  => A':newfile would have 13 lines, 3 of which matches those in A:oldfile.
+E:oldfile -> G:newfile would be detected as a rename, but A:oldfile and
+A':newfile would not be.
+
+
+=== 5. Why the special cases in #4 are still fully reasonable to use to    ===
+===    pair up files for three-way content merging in the merge machinery, ===
+===    and why they do not affect the correctness of the merge.            ===
+
+In the rename/rename(1to1) case, A:newfile and A':newfile are not renames
+since they use the *same* filename.  However, files with the same filename
+are obviously fine to pair up for three-way content merging (the merge
+machinery has never employed break detection).  The interesting
+counter-example case is thus not the rename/rename(1to1) case, but the case
+where A did not rename oldfile.  That was the case that we spent most of
+the time discussing in sections 3 and 4.  The remainder of this section
+will be devoted to that case as well.
+
+So, even if A:oldfile and A':newfile aren't detectable as renames, why is
+it still reasonable to pair them up for three-way content merging in the
+merge machinery?  There are multiple reasons:
+
+  * As noted in sections 3 and 4, the diff between A:oldfile and A':newfile
+    is *exactly* the same as the diff between E:oldfile and G:newfile.  The
+    latter pair were detected as renames, so it seems unlikely to surprise
+    users for us to treat A:oldfile and A':newfile as renames.
+
+  * In fact, "oldfile" and "newfile" were at one point detected as renames
+    due to how they were constructed in the E..G chain.  And we used that
+    information once already in this rebase/cherry-pick.  I think users
+    would be unlikely to be surprised at us continuing to treat the files
+    as renames and would quickly understand why we had done so.
+
+  * Marking or declaring files as renames is *not* the end goal for merges.
+    Merges use renames to determine which files make sense to be paired up
+    for three-way content merges.
+
+  * A:oldfile and A':newfile were _already_ paired up in a three-way
+    content merge; that is how A':newfile was created.  In fact, that
+    three-way content merge was clean.  So using them again in a later
+    three-way content merge seems very reasonable.
+
+However, the above is focusing on the common scenarios.  Let's try to look
+at all possible unusual scenarios and compare without the optimization to
+with the optimization.  Consider the following theoretical cases; we will
+then dive into each to determine which of them are possible,
+and if so, what they mean:
+
+  1. Without the optimization, the second merge results in a conflict.
+     With the optimization, the second merge also results in a conflict.
+     Questions: Are the conflicts confusingly different?  Better in one case?
+
+  2. Without the optimization, the second merge results in NO conflict.
+     With the optimization, the second merge also results in NO conflict.
+     Questions: Are the merges the same?
+
+  3. Without the optimization, the second merge results in a conflict.
+     With the optimization, the second merge results in NO conflict.
+     Questions: Possible?  Bug, bugfix, or something else?
+
+  4. Without the optimization, the second merge results in NO conflict.
+     With the optimization, the second merge results in a conflict.
+     Questions: Possible?  Bug, bugfix, or something else?
+
+I'll consider all four cases, but out of order.
+
+The fourth case is impossible.  For the code without the remembering
+renames optimization to not get a conflict, B:oldfile would need to exactly
+match A:oldfile -- if it doesn't, there would be a modify/delete conflict.
+If A:oldfile matches B:oldfile exactly, then a three-way content merge
+between A:oldfile, A':newfile, and B:oldfile would have no conflict and
+just give us the version of newfile from A' as the result.
+
+From the same logic as the above paragraph, the second case would indeed
+result in identical merges.  When A:oldfile exactly matches B:oldfile, an
+undetected rename would say, "Oh, I see one side didn't modify 'oldfile'
+and the other side deleted it.  I'll delete it.  And I see you have this
+brand new file named 'newfile' in A', so I'll keep it."  That gives the
+same results as three-way content merging A:oldfile, A':newfile, and
+B:oldfile -- a removal of oldfile with the version of newfile from A'
+showing up in the result.
+
+The third case is interesting.  It means that A:oldfile and A':newfile were
+not just similar enough, but that the changes between them did not conflict
+with the changes between A:oldfile and B:oldfile.  This would validate our
+hunch that the files were similar enough to be used in a three-way content
+merge, and thus seems entirely correct for us to have used them that way.
+(Sidenote: One particular example here may be enlightening.  Let's say that
+B was an immediate revert of A.  B clearly would have been a clean revert
+of A, since A was B's immediate parent.  One would assume that if you can
+pick a commit, you should also be able to cherry-pick its immediate revert.
+However, this is one of those funny corner cases; without this
+optimization, we just successfully picked a commit cleanly, but we are
+unable to cherry-pick its immediate revert due to the size differences
+between E:oldfile and A:oldfile.)
+
+That leaves only the first case to consider -- when we get conflicts both
+with or without the optimization.  Without the optimization, we'll have a
+modify/delete conflict, where both A':newfile and B:oldfile are left in the
+tree for the user to deal with and no hints about the potential similarity
+between the two.  With the optimization, we'll have a three-way content
+merged A:oldfile, A':newfile, and B:oldfile with conflict markers
+suggesting we thought the files were related but giving the user the chance
+to resolve.  As noted above, I don't think users will find us treating
+'oldfile' and 'newfile' as related as a surprise since they were between E
+and G.  In any event, though, this case shouldn't be concerning since we
+hit a conflict in both cases, told the user what we know, and asked them to
+resolve it.
+
+So, in summary, case 4 is impossible, case 2 yields the same behavior, and
+cases 1 and 3 seem to provide as good or better behavior with the
+optimization than without.
+
+
+=== 6. Interaction with skipping of "irrelevant" renames ===
+
+Previous optimizations involved skipping rename detection for paths
+considered to be "irrelevant".  See for example the following commits:
+
+  * 32a56dfb99 ("merge-ort: precompute subset of sources for which we
+		need rename detection", 2021-03-11)
+  * 2fd9eda462 ("merge-ort: precompute whether directory rename
+		detection is needed", 2021-03-11)
+  * 9bd342137e ("diffcore-rename: determine which relevant_sources are
+		no longer relevant", 2021-03-13)
+
+Relevance is always determined by what the _other_ side of history has
+done, in terms of modifying a file that our side renamed, or adding a
+file to a directory which our side renamed.  This means that a path
+that is "irrelevant" when picking the first commit of a series in a
+rebase or cherry-pick, may suddenly become "relevant" when picking the
+next commit.
+
+The upshot of this is that we can only cache rename detection results
+for relevant paths, and need to re-check relevance in subsequent
+commits.  If those subsequent commits have additional paths that are
+relevant for rename detection, then we will need to redo rename
+detection -- though we can limit it to the paths for which we have not
+already detected renames.
+
+
+=== 7. Additional items that need to be cached ===
+
+It turns out we have to cache more than just renames; we also cache:
+
+  A) non-renames (i.e. unpaired deletes)
+  B) counts of renames within directories
+  C) sources that were marked as RELEVANT_LOCATION, but which were
+     downgraded to RELEVANT_NO_MORE
+  D) the toplevel trees involved in the merge
+
+These are all stored in struct rename_info, and respectively appear in
+  * cached_pairs (along side actual renames, just with a value of NULL)
+  * dir_rename_counts
+  * cached_irrelevant
+  * merge_trees
+
+The reason for (A) comes from the irrelevant renames skipping
+optimization discussed in section 6.  The fact that irrelevant renames
+are skipped means we only get a subset of the potential renames
+detected and subsequent commits may need to run rename detection on
+the upstream side on a subset of the remaining renames (to get the
+renames that are relevant for that later commit).  Since unpaired
+deletes are involved in rename detection too, we don't want to
+repeatedly check that those paths remain unpaired on the upstream side
+with every commit we are transplanting.
+
+The reason for (B) is that diffcore_rename_extended() is what
+generates the counts of renames by directory which is needed in
+directory rename detection, and if we don't run
+diffcore_rename_extended() again then we need to have the output from
+it, including dir_rename_counts, from the previous run.
+
+The reason for (C) is that merge-ort's tree traversal will again think
+those paths are relevant (marking them as RELEVANT_LOCATION), but the
+fact that they were downgraded to RELEVANT_NO_MORE means that
+dir_rename_counts already has the information we need for directory
+rename detection.  (A path which becomes RELEVANT_CONTENT in a
+subsequent commit will be removed from cached_irrelevant.)
+
+The reason for (D) is that is how we determine whether the remember
+renames optimization can be used.  In particular, remembering that our
+sequence of merges looks like:
+
+   Merge 1:
+   MERGE_BASE:   E
+   MERGE_SIDE1:  G
+   MERGE_SIDE2:  A
+   => Creates    A'
+
+   Merge 2:
+   MERGE_BASE:   A
+   MERGE_SIDE1:  A'
+   MERGE_SIDE2:  B
+   => Creates    B'
+
+It is the fact that the trees A and A' appear both in Merge 1 and in
+Merge 2, with A as a parent of A' that allows this optimization.  So
+we store the trees to compare with what we are asked to merge next
+time.
+
+
+=== 8. How directory rename detection interacts with the above and   ===
+===    why this optimization is still safe even if                   ===
+===    merge.directoryRenames is set to "true".                      ===
+
+As noted in the assumptions section:
+
+    """
+    ...if directory renames do occur, then the default of
+    merge.directoryRenames being set to "conflict" means that the operation
+    will stop for users to resolve the conflicts and the cache will be
+    thrown away, and thus that there won't be an optimization to apply.
+    So, the only reason we need to address directory renames specifically,
+    is that some users will have set merge.directoryRenames to "true" to
+    allow the merges to continue to proceed automatically.
+    """
+
+Let's remember that we need to look at how any given pick affects the next
+one.  So let's again use the first two picks from the diagram in section
+one:
+
+  First pick does this three-way merge:
+    MERGE_BASE:   E
+    MERGE_SIDE1:  G
+    MERGE_SIDE2:  A
+    => creates A'
+
+  Second pick does this three-way merge:
+    MERGE_BASE:   A
+    MERGE_SIDE1:  A'
+    MERGE_SIDE2:  B
+    => creates B'
+
+Now, directory rename detection exists so that if one side of history
+renames a directory, and the other side adds a new file to the old
+directory, then the merge (with merge.directoryRenames=true) can move the
+file into the new directory.  There are two qualitatively different ways to
+add a new file to an old directory: create a new file, or rename a file
+into that directory.  Also, directory renames can be done on either side of
+history, so there are four cases to consider:
+
+  * MERGE_SIDE1 renames old dir, MERGE_SIDE2 adds new file to   old dir
+  * MERGE_SIDE1 renames old dir, MERGE_SIDE2 renames  file into old dir
+  * MERGE_SIDE1 adds new file to   old dir, MERGE_SIDE2 renames old dir
+  * MERGE_SIDE1 renames  file into old dir, MERGE_SIDE2 renames old dir
+
+One last note before we consider these four cases: There are some
+important properties about how we implement this optimization with
+respect to directory rename detection that we need to bear in mind
+while considering all of these cases:
+
+  * rename caching occurs *after* applying directory renames
+
+  * a rename created by directory rename detection is recorded for the side
+    of history that did the directory rename.
+
+  * dir_rename_counts, the nested map of
+	{oldname => {newname => count}},
+    is cached between runs as well.  This basically means that directory
+    rename detection is also cached, though only on the side of history
+    that we cache renames for (MERGE_SIDE1 as far as this document is
+    concerned; see the assumptions section).  Two interesting sub-notes
+    about these counts:
+
+    * If we need to perform rename-detection again on the given side (e.g.
+      some paths are relevant for rename detection that weren't before),
+      then we clear dir_rename_counts and recompute it, making use of
+      cached_pairs.  The reason it is important to do this is optimizations
+      around RELEVANT_LOCATION exist to prevent us from computing
+      unnecessary renames for directory rename detection and from computing
+      dir_rename_counts for irrelevant directories; but those same renames
+      or directories may become necessary for subsequent merges.  The
+      easiest way to "fix up" dir_rename_counts in such cases is to just
+      recompute it.
+
+    * If we prune rename/rename(1to1) entries from the cache, then we also
+      need to update dir_rename_counts to decrement the counts for the
+      involved directory and any relevant parent directories (to undo what
+      update_dir_rename_counts() in diffcore-rename.c incremented when the
+      rename was initially found).  If we instead just disable the
+      remembering renames optimization when the exceedingly rare
+      rename/rename(1to1) cases occur, then dir_rename_counts will get
+      re-computed the next time rename detection occurs, as noted above.
+
+  * the side with multiple commits to pick, is the side of history that we
+    do NOT cache renames for.  Thus, there are no additional commits to
+    change the number of renames in a directory, except for those done by
+    directory rename detection (which always pad the majority).
+
+  * the "renames" we cache are modified slightly by any directory rename,
+    as noted below.
+
+Now, with those notes out of the way, let's go through the four cases
+in order:
+
+Case 1: MERGE_SIDE1 renames old dir, MERGE_SIDE2 adds new file to old dir
+
+  This case looks like this:
+
+    MERGE_BASE:   E,   Has olddir/
+    MERGE_SIDE1:  G,   Renames olddir/ -> newdir/
+    MERGE_SIDE2:  A,   Adds olddir/newfile
+    => creates    A',  With newdir/newfile
+
+    MERGE_BASE:   A,   Has olddir/newfile
+    MERGE_SIDE1:  A',  Has newdir/newfile
+    MERGE_SIDE2:  B,   Modifies olddir/newfile
+    => expected   B',  with threeway-merged newdir/newfile from above
+
+  In this case, with the optimization, note that after the first commit:
+    * MERGE_SIDE1 remembers olddir/ -> newdir/
+    * MERGE_SIDE1 has cached olddir/newfile -> newdir/newfile
+  Given the cached rename noted above, the second merge can proceed as
+  expected without needing to perform rename detection from A -> A'.
+
+Case 2: MERGE_SIDE1 renames old dir, MERGE_SIDE2 renames  file into old dir
+
+  This case looks like this:
+    MERGE_BASE:   E    oldfile, olddir/
+    MERGE_SIDE1:  G    oldfile, olddir/ -> newdir/
+    MERGE_SIDE2:  A    oldfile -> olddir/newfile
+    => creates    A',  With newdir/newfile representing original oldfile
+
+    MERGE_BASE:   A    olddir/newfile
+    MERGE_SIDE1:  A'   newdir/newfile
+    MERGE_SIDE2:  B    modify olddir/newfile
+    => expected   B',  with threeway-merged newdir/newfile from above
+
+  In this case, with the optimization, note that after the first commit:
+    * MERGE_SIDE1 remembers olddir/ -> newdir/
+    * MERGE_SIDE1 has cached olddir/newfile -> newdir/newfile
+		  (NOT oldfile -> newdir/newfile; compare to case with
+		   (p->status == 'R' && new_path) in possibly_cache_new_pair())
+
+  Given the cached rename noted above, the second merge can proceed as
+  expected without needing to perform rename detection from A -> A'.
+
+Case 3: MERGE_SIDE1 adds new file to   old dir, MERGE_SIDE2 renames old dir
+
+  This case looks like this:
+
+    MERGE_BASE:   E,   Has olddir/
+    MERGE_SIDE1:  G,   Adds olddir/newfile
+    MERGE_SIDE2:  A,   Renames olddir/ -> newdir/
+    => creates    A',  With newdir/newfile
+
+    MERGE_BASE:   A,   Has newdir/, but no notion of newdir/newfile
+    MERGE_SIDE1:  A',  Has newdir/newfile
+    MERGE_SIDE2:  B,   Has newdir/, but no notion of newdir/newfile
+    => expected   B',  with newdir/newfile from A'
+
+  In this case, with the optimization, note that after the first commit there
+  were no renames on MERGE_SIDE1, and any renames on MERGE_SIDE2 are tossed.
+  But the second merge didn't need any renames so this is fine.
+
+Case 4: MERGE_SIDE1 renames  file into old dir, MERGE_SIDE2 renames old dir
+
+  This case looks like this:
+
+    MERGE_BASE:   E,   Has olddir/
+    MERGE_SIDE1:  G,   Renames oldfile -> olddir/newfile
+    MERGE_SIDE2:  A,   Renames olddir/ -> newdir/
+    => creates    A',  With newdir/newfile representing original oldfile
+
+    MERGE_BASE:   A,   Has oldfile
+    MERGE_SIDE1:  A',  Has newdir/newfile
+    MERGE_SIDE2:  B,   Modifies oldfile
+    => expected   B',  with threeway-merged newdir/newfile from above
+
+  In this case, with the optimization, note that after the first commit:
+    * MERGE_SIDE1 remembers oldfile -> newdir/newfile
+		  (NOT oldfile -> olddir/newfile; compare to case of second
+		   block under p->status == 'R' in possibly_cache_new_pair())
+    * MERGE_SIDE2 renames are tossed because only MERGE_SIDE1 is remembered
+
+  Given the cached rename noted above, the second merge can proceed as
+  expected without needing to perform rename detection from A -> A'.
+
+Finally, I'll just note here that interactions with the
+skip-irrelevant-renames optimization means we sometimes don't detect
+renames for any files within a directory that was renamed, in which
+case we will not have been able to detect any rename for the directory
+itself.  In such a case, we do not know whether the directory was
+renamed; we want to be careful to avoid caching some kind of "this
+directory was not renamed" statement.  If we did, then a subsequent
+commit being rebased could add a file to the old directory, and the
+user would expect it to end up in the correct directory -- something
+our erroneous "this directory was not renamed" cache would preclude.
diff --git a/Documentation/technical/repository-version.txt b/Documentation/technical/repository-version.txt
new file mode 100644
index 0000000..045a767
--- /dev/null
+++ b/Documentation/technical/repository-version.txt
@@ -0,0 +1,102 @@
+== Git Repository Format Versions
+
+Every git repository is marked with a numeric version in the
+`core.repositoryformatversion` key of its `config` file. This version
+specifies the rules for operating on the on-disk repository data. An
+implementation of git which does not understand a particular version
+advertised by an on-disk repository MUST NOT operate on that repository;
+doing so risks not only producing wrong results, but actually losing
+data.
+
+Because of this rule, version bumps should be kept to an absolute
+minimum. Instead, we generally prefer these strategies:
+
+  - bumping format version numbers of individual data files (e.g.,
+    index, packfiles, etc). This restricts the incompatibilities only to
+    those files.
+
+  - introducing new data that gracefully degrades when used by older
+    clients (e.g., pack bitmap files are ignored by older clients, which
+    simply do not take advantage of the optimization they provide).
+
+A whole-repository format version bump should only be part of a change
+that cannot be independently versioned. For instance, if one were to
+change the reachability rules for objects, or the rules for locking
+refs, that would require a bump of the repository format version.
+
+Note that this applies only to accessing the repository's disk contents
+directly. An older client which understands only format `0` may still
+connect via `git://` to a repository using format `1`, as long as the
+server process understands format `1`.
+
+The preferred strategy for rolling out a version bump (whether whole
+repository or for a single file) is to teach git to read the new format,
+and allow writing the new format with a config switch or command line
+option (for experimentation or for those who do not care about backwards
+compatibility with older gits). Then after a long period to allow the
+reading capability to become common, we may switch to writing the new
+format by default.
+
+The currently defined format versions are:
+
+=== Version `0`
+
+This is the format defined by the initial version of git, including but
+not limited to the format of the repository directory, the repository
+configuration file, and the object and ref storage. Specifying the
+complete behavior of git is beyond the scope of this document.
+
+=== Version `1`
+
+This format is identical to version `0`, with the following exceptions:
+
+  1. When reading the `core.repositoryformatversion` variable, a git
+     implementation which supports version 1 MUST also read any
+     configuration keys found in the `extensions` section of the
+     configuration file.
+
+  2. If a version-1 repository specifies any `extensions.*` keys that
+     the running git has not implemented, the operation MUST NOT
+     proceed. Similarly, if the value of any known key is not understood
+     by the implementation, the operation MUST NOT proceed.
+
+Note that if no extensions are specified in the config file, then
+`core.repositoryformatversion` SHOULD be set to `0` (setting it to `1`
+provides no benefit, and makes the repository incompatible with older
+implementations of git).
+
+This document will serve as the master list for extensions. Any
+implementation wishing to define a new extension should make a note of
+it here, in order to claim the name.
+
+The defined extensions are:
+
+==== `noop`
+
+This extension does not change git's behavior at all. It is useful only
+for testing format-1 compatibility.
+
+==== `preciousObjects`
+
+When the config key `extensions.preciousObjects` is set to `true`,
+objects in the repository MUST NOT be deleted (e.g., by `git-prune` or
+`git repack -d`).
+
+==== `partialClone`
+
+When the config key `extensions.partialClone` is set, it indicates
+that the repo was created with a partial clone (or later performed
+a partial fetch) and that the remote may have omitted sending
+certain unwanted objects.  Such a remote is called a "promisor remote"
+and it promises that all such omitted objects can be fetched from it
+in the future.
+
+The value of this key is the name of the promisor remote.
+
+==== `worktreeConfig`
+
+If set, by default "git config" reads from both "config" and
+"config.worktree" files from GIT_DIR in that order. In
+multiple working directory mode, "config" file is shared while
+"config.worktree" is per-working directory (i.e., it's in
+GIT_COMMON_DIR/worktrees/<id>/config.worktree)
diff --git a/Documentation/technical/rerere.txt b/Documentation/technical/rerere.txt
new file mode 100644
index 0000000..580f233
--- /dev/null
+++ b/Documentation/technical/rerere.txt
@@ -0,0 +1,186 @@
+Rerere
+======
+
+This document describes the rerere logic.
+
+Conflict normalization
+----------------------
+
+To ensure recorded conflict resolutions can be looked up in the rerere
+database, even when branches are merged in a different order,
+different branches are merged that result in the same conflict, or
+when different conflict style settings are used, rerere normalizes the
+conflicts before writing them to the rerere database.
+
+Different conflict styles and branch names are normalized by stripping
+the labels from the conflict markers, and removing the common ancestor
+version from the `diff3` or `zdiff3` conflict styles.  Branches that
+are merged in different order are normalized by sorting the conflict
+hunks.  More on each of those steps in the following sections.
+
+Once these two normalization operations are applied, a conflict ID is
+calculated based on the normalized conflict, which is later used by
+rerere to look up the conflict in the rerere database.
+
+Removing the common ancestor version
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Say we have three branches AB, AC and AC2.  The common ancestor of
+these branches has a file with a line containing the string "A" (for
+brevity this is called "line A" in the rest of the document).  In
+branch AB this line is changed to "B", in AC, this line is changed to
+"C", and branch AC2 is forked off of AC, after the line was changed to
+"C".
+
+Forking a branch ABAC off of branch AB and then merging AC into it, we
+get a conflict like the following:
+
+    <<<<<<< HEAD
+    B
+    =======
+    C
+    >>>>>>> AC
+
+Doing the analogous with AC2 (forking a branch ABAC2 off of branch AB
+and then merging branch AC2 into it), using the diff3 or zdiff3
+conflict style, we get a conflict like the following:
+
+    <<<<<<< HEAD
+    B
+    ||||||| merged common ancestors
+    A
+    =======
+    C
+    >>>>>>> AC2
+
+By resolving this conflict, to leave line D, the user declares:
+
+    After examining what branches AB and AC did, I believe that making
+    line A into line D is the best thing to do that is compatible with
+    what AB and AC wanted to do.
+
+As branch AC2 refers to the same commit as AC, the above implies that
+this is also compatible with what AB and AC2 wanted to do.
+
+By extension, this means that rerere should recognize that the above
+conflicts are the same.  To do this, the labels on the conflict
+markers are stripped, and the common ancestor version is removed.  The above
+examples would both result in the following normalized conflict:
+
+    <<<<<<<
+    B
+    =======
+    C
+    >>>>>>>
+
+Sorting hunks
+~~~~~~~~~~~~~
+
+As before, let's imagine that a common ancestor had a file with line A
+its early part, and line X in its late part.  And then four branches
+are forked that do these things:
+
+    - AB: changes A to B
+    - AC: changes A to C
+    - XY: changes X to Y
+    - XZ: changes X to Z
+
+Now, forking a branch ABAC off of branch AB and then merging AC into
+it, and forking a branch ACAB off of branch AC and then merging AB
+into it, would yield the conflict in a different order.  The former
+would say "A became B or C, what now?" while the latter would say "A
+became C or B, what now?"
+
+As a reminder, the act of merging AC into ABAC and resolving the
+conflict to leave line D means that the user declares:
+
+    After examining what branches AB and AC did, I believe that
+    making line A into line D is the best thing to do that is
+    compatible with what AB and AC wanted to do.
+
+So the conflict we would see when merging AB into ACAB should be
+resolved the same way--it is the resolution that is in line with that
+declaration.
+
+Imagine that similarly previously a branch XYXZ was forked from XY,
+and XZ was merged into it, and resolved "X became Y or Z" into "X
+became W".
+
+Now, if a branch ABXY was forked from AB and then merged XY, then ABXY
+would have line B in its early part and line Y in its later part.
+Such a merge would be quite clean.  We can construct 4 combinations
+using these four branches ((AB, AC) x (XY, XZ)).
+
+Merging ABXY and ACXZ would make "an early A became B or C, a late X
+became Y or Z" conflict, while merging ACXY and ABXZ would make "an
+early A became C or B, a late X became Y or Z".  We can see there are
+4 combinations of ("B or C", "C or B") x ("X or Y", "Y or X").
+
+By sorting, the conflict is given its canonical name, namely, "an
+early part became B or C, a late part became X or Y", and whenever
+any of these four patterns appear, and we can get to the same conflict
+and resolution that we saw earlier.
+
+Without the sorting, we'd have to somehow find a previous resolution
+from combinatorial explosion.
+
+Conflict ID calculation
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Once the conflict normalization is done, the conflict ID is calculated
+as the sha1 hash of the conflict hunks appended to each other,
+separated by <NUL> characters.  The conflict markers are stripped out
+before the sha1 is calculated.  So in the example above, where we
+merge branch AC which changes line A to line C, into branch AB, which
+changes line A to line C, the conflict ID would be
+SHA1('B<NUL>C<NUL>').
+
+If there are multiple conflicts in one file, the sha1 is calculated
+the same way with all hunks appended to each other, in the order in
+which they appear in the file, separated by a <NUL> character.
+
+Nested conflicts
+~~~~~~~~~~~~~~~~
+
+Nested conflicts are handled very similarly to "simple" conflicts.
+Similar to simple conflicts, the conflict is first normalized by
+stripping the labels from conflict markers, stripping the common ancestor
+version, and sorting the conflict hunks, both for the outer and the
+inner conflict.  This is done recursively, so any number of nested
+conflicts can be handled.
+
+Note that this only works for conflict markers that "cleanly nest".  If
+there are any unmatched conflict markers, rerere will fail to handle
+the conflict and record a conflict resolution.
+
+The only difference is in how the conflict ID is calculated.  For the
+inner conflict, the conflict markers themselves are not stripped out
+before calculating the sha1.
+
+Say we have the following conflict for example:
+
+    <<<<<<< HEAD
+    1
+    =======
+    <<<<<<< HEAD
+    3
+    =======
+    2
+    >>>>>>> branch-2
+    >>>>>>> branch-3~
+
+After stripping out the labels of the conflict markers, and sorting
+the hunks, the conflict would look as follows:
+
+    <<<<<<<
+    1
+    =======
+    <<<<<<<
+    2
+    =======
+    3
+    >>>>>>>
+    >>>>>>>
+
+and finally the conflict ID would be calculated as:
+`sha1('1<NUL><<<<<<<\n3\n=======\n2\n>>>>>>><NUL>')`
diff --git a/Documentation/technical/scalar.txt b/Documentation/technical/scalar.txt
new file mode 100644
index 0000000..921cb10
--- /dev/null
+++ b/Documentation/technical/scalar.txt
@@ -0,0 +1,66 @@
+Scalar
+======
+
+Scalar is a repository management tool that optimizes Git for use in large
+repositories. It accomplishes this by helping users to take advantage of
+advanced performance features in Git. Unlike most other Git built-in commands,
+Scalar is not executed as a subcommand of 'git'; rather, it is built as a
+separate executable containing its own series of subcommands.
+
+Background
+----------
+
+Scalar was originally designed as an add-on to Git and implemented as a .NET
+Core application. It was created based on the learnings from the VFS for Git
+project (another application aimed at improving the experience of working with
+large repositories). As part of its initial implementation, Scalar relied on
+custom features in the Microsoft fork of Git that have since been integrated
+into core Git:
+
+* partial clone,
+* commit graphs,
+* multi-pack index,
+* sparse checkout (cone mode),
+* scheduled background maintenance,
+* etc
+
+With the requisite Git functionality in place and a desire to bring the benefits
+of Scalar to the larger Git community, the Scalar application itself was ported
+from C# to C and integrated upstream.
+
+Features
+--------
+
+Scalar is comprised of two major pieces of functionality: automatically
+configuring built-in Git performance features and managing repository
+enlistments.
+
+The Git performance features configured by Scalar (see "Background" for
+examples) confer substantial performance benefits to large repositories, but are
+either too experimental to enable for all of Git yet, or only benefit large
+repositories. As new features are introduced, Scalar should be updated
+accordingly to incorporate them. This will prevent the tool from becoming stale
+while also providing a path for more easily bringing features to the appropriate
+users.
+
+Enlistments are how Scalar knows which repositories on a user's system should
+utilize Scalar-configured features. This allows it to update performance
+settings when new ones are added to the tool, as well as centrally manage
+repository maintenance. The enlistment structure - a root directory with a
+`src/` subdirectory containing the cloned repository itself - is designed to
+encourage users to route build outputs outside of the repository to avoid the
+performance-limiting overhead of ignoring those files in Git.
+
+Design
+------
+
+Scalar is implemented in C and interacts with Git via a mix of child process
+invocations of Git and direct usage of `libgit.a`. Internally, it is structured
+much like other built-ins with subcommands (e.g., `git stash`), containing a
+`cmd_<subcommand>()` function for each subcommand, routed through a `cmd_main()`
+function. Most options are unique to each subcommand, with `scalar` respecting
+some "global" `git` options (e.g., `-c` and `-C`).
+
+Because `scalar` is not invoked as a Git subcommand (like `git scalar`), it is
+built and installed as its own executable in the `bin/` directory, alongside
+`git`, `git-gui`, etc.
diff --git a/Documentation/technical/send-pack-pipeline.txt b/Documentation/technical/send-pack-pipeline.txt
new file mode 100644
index 0000000..9b5a0bc
--- /dev/null
+++ b/Documentation/technical/send-pack-pipeline.txt
@@ -0,0 +1,63 @@
+Git-send-pack internals
+=======================
+
+Overall operation
+-----------------
+
+. Connects to the remote side and invokes git-receive-pack.
+
+. Learns what refs the remote has and what commit they point at.
+  Matches them to the refspecs we are pushing.
+
+. Checks if there are non-fast-forwards.  Unlike fetch-pack,
+  the repository send-pack runs in is supposed to be a superset
+  of the recipient in fast-forward cases, so there is no need
+  for want/have exchanges, and fast-forward check can be done
+  locally.  Tell the result to the other end.
+
+. Calls pack_objects() which generates a packfile and sends it
+  over to the other end.
+
+. If the remote side is new enough (v1.1.0 or later), wait for
+  the unpack and hook status from the other end.
+
+. Exit with appropriate error codes.
+
+
+Pack_objects pipeline
+---------------------
+
+This function gets one file descriptor (`fd`) which is either a
+socket (over the network) or a pipe (local).  What's written to
+this fd goes to git-receive-pack to be unpacked.
+
+    send-pack ---> fd ---> receive-pack
+
+The function pack_objects creates a pipe and then forks.  The
+forked child execs pack-objects with --revs to receive revision
+parameters from its standard input. This process will write the
+packfile to the other end.
+
+    send-pack
+       |
+       pack_objects() ---> fd ---> receive-pack
+          | ^ (pipe)
+	  v |
+         (child)
+
+The child dup2's to arrange its standard output to go back to
+the other end, and read its standard input to come from the
+pipe.  After that it exec's pack-objects.  On the other hand,
+the parent process, before starting to feed the child pipeline,
+closes the reading side of the pipe and fd to receive-pack.
+
+    send-pack
+       |
+       pack_objects(parent)
+          |
+	  v [0]
+         pack-objects [0] ---> receive-pack
+
+
+[jc: the pipeline was much more complex and needed documentation before
+ I understood an earlier bug, but now it is trivial and straightforward.]
diff --git a/Documentation/technical/shallow.txt b/Documentation/technical/shallow.txt
new file mode 100644
index 0000000..f3738ba
--- /dev/null
+++ b/Documentation/technical/shallow.txt
@@ -0,0 +1,60 @@
+Shallow commits
+===============
+
+.Definition
+*********************************************************
+Shallow commits do have parents, but not in the shallow
+repo, and therefore grafts are introduced pretending that
+these commits have no parents.
+*********************************************************
+
+$GIT_DIR/shallow lists commit object names and tells Git to
+pretend as if they are root commits (e.g. "git log" traversal
+stops after showing them; "git fsck" does not complain saying
+the commits listed on their "parent" lines do not exist).
+
+Each line contains exactly one object name. When read, a commit_graft
+will be constructed, which has nr_parent < 0 to make it easier
+to discern from user provided grafts.
+
+Note that the shallow feature could not be changed easily to
+use replace refs: a commit containing a `mergetag` is not allowed
+to be replaced, not even by a root commit. Such a commit can be
+made shallow, though. Also, having a `shallow` file explicitly
+listing all the commits made shallow makes it a *lot* easier to
+do shallow-specific things such as to deepen the history.
+
+Since fsck-objects relies on the library to read the objects,
+it honours shallow commits automatically.
+
+There are some unfinished ends of the whole shallow business:
+
+- maybe we have to force non-thin packs when fetching into a
+  shallow repo (ATM they are forced non-thin).
+
+- A special handling of a shallow upstream is needed. At some
+  stage, upload-pack has to check if it sends a shallow commit,
+  and it should send that information early (or fail, if the
+  client does not support shallow repositories). There is no
+  support at all for this in this patch series.
+
+- Instead of locking $GIT_DIR/shallow at the start, just
+  the timestamp of it is noted, and when it comes to writing it,
+  a check is performed if the mtime is still the same, dying if
+  it is not.
+
+- It is unclear how "push into/from a shallow repo" should behave.
+
+- If you deepen a history, you'd want to get the tags of the
+  newly stored (but older!) commits. This does not work right now.
+
+To make a shallow clone, you can call "git-clone --depth 20 repo".
+The result contains only commit chains with a length of at most 20.
+It also writes an appropriate $GIT_DIR/shallow.
+
+You can deepen a shallow repository with "git-fetch --depth 20
+repo branch", which will fetch branch from repo, but stop at depth
+20, updating $GIT_DIR/shallow.
+
+The special depth 2147483647 (or 0x7fffffff, the largest positive
+number a signed 32-bit integer can contain) means infinite depth.
diff --git a/Documentation/technical/sparse-checkout.txt b/Documentation/technical/sparse-checkout.txt
new file mode 100644
index 0000000..fa0d01c
--- /dev/null
+++ b/Documentation/technical/sparse-checkout.txt
@@ -0,0 +1,1103 @@
+Table of contents:
+
+  * Terminology
+  * Purpose of sparse-checkouts
+  * Usecases of primary concern
+  * Oversimplified mental models ("Cliff Notes" for this document!)
+  * Desired behavior
+  * Behavior classes
+  * Subcommand-dependent defaults
+  * Sparse specification vs. sparsity patterns
+  * Implementation Questions
+  * Implementation Goals/Plans
+  * Known bugs
+  * Reference Emails
+
+
+=== Terminology ===
+
+cone mode: one of two modes for specifying the desired subset of files
+	in a sparse-checkout.  In cone-mode, the user specifies
+	directories (getting both everything under that directory as
+	well as everything in leading directories), while in non-cone
+	mode, the user specifies gitignore-style patterns.  Controlled
+	by the --[no-]cone option to sparse-checkout init|set.
+
+SKIP_WORKTREE: When tracked files do not match the sparse specification and
+	are removed from the working tree, the file in the index is marked
+	with a SKIP_WORKTREE bit.  Note that if a tracked file has the
+	SKIP_WORKTREE bit set but the file is later written by the user to
+	the working tree anyway, the SKIP_WORKTREE bit will be cleared at
+	the beginning of any subsequent Git operation.
+
+	Most sparse checkout users are unaware of this implementation
+	detail, and the term should generally be avoided in user-facing
+	descriptions and command flags.  Unfortunately, prior to the
+	`sparse-checkout` subcommand this low-level detail was exposed,
+	and as of time of writing, is still exposed in various places.
+
+sparse-checkout: a subcommand in git used to reduce the files present in
+	the working tree to a subset of all tracked files.  Also, the
+	name of the file in the $GIT_DIR/info directory used to track
+	the sparsity patterns corresponding to the user's desired
+	subset.
+
+sparse cone: see cone mode
+
+sparse directory: An entry in the index corresponding to a directory, which
+	appears in the index instead of all the files under that directory
+	that would normally appear.  See also sparse-index.  Something that
+	can cause confusion is that the "sparse directory" does NOT match
+	the sparse specification, i.e. the directory is NOT present in the
+	working tree.  May be renamed in the future (e.g. to "skipped
+	directory").
+
+sparse index: A special mode for sparse-checkout that also makes the
+	index sparse by recording a directory entry in lieu of all the
+	files underneath that directory (thus making that a "skipped
+	directory" which unfortunately has also been called a "sparse
+	directory"), and does this for potentially multiple
+	directories.  Controlled by the --[no-]sparse-index option to
+	init|set|reapply.
+
+sparsity patterns: patterns from $GIT_DIR/info/sparse-checkout used to
+	define the set of files of interest.  A warning: It is easy to
+	over-use this term (or the shortened "patterns" term), for two
+	reasons: (1) users in cone mode specify directories rather than
+	patterns (their directories are transformed into patterns, but
+	users may think you are talking about non-cone mode if you use the
+	word "patterns"), and (b) the sparse specification might
+	transiently differ in the working tree or index from the sparsity
+	patterns (see "Sparse specification vs. sparsity patterns").
+
+sparse specification: The set of paths in the user's area of focus.  This
+	is typically just the tracked files that match the sparsity
+	patterns, but the sparse specification can temporarily differ and
+	include additional files.  (See also "Sparse specification
+	vs. sparsity patterns")
+
+	* When working with history, the sparse specification is exactly
+	  the set of files matching the sparsity patterns.
+	* When interacting with the working tree, the sparse specification
+	  is the set of tracked files with a clear SKIP_WORKTREE bit or
+	  tracked files present in the working copy.
+	* When modifying or showing results from the index, the sparse
+	  specification is the set of files with a clear SKIP_WORKTREE bit
+	  or that differ in the index from HEAD.
+	* If working with the index and the working copy, the sparse
+	  specification is the union of the paths from above.
+
+vivifying: When a command restores a tracked file to the working tree (and
+	hopefully also clears the SKIP_WORKTREE bit in the index for that
+	file), this is referred to as "vivifying" the file.
+
+
+=== Purpose of sparse-checkouts ===
+
+sparse-checkouts exist to allow users to work with a subset of their
+files.
+
+You can think of sparse-checkouts as subdividing "tracked" files into two
+categories -- a sparse subset, and all the rest.  Implementationally, we
+mark "all the rest" in the index with a SKIP_WORKTREE bit and leave them
+out of the working tree.  The SKIP_WORKTREE files are still tracked, just
+not present in the working tree.
+
+In the past, sparse-checkouts were defined by "SKIP_WORKTREE means the file
+is missing from the working tree but pretend the file contents match HEAD".
+That was not only bogus (it actually meant the file missing from the
+working tree matched the index rather than HEAD), but it was also a
+low-level detail which only provided decent behavior for a few commands.
+There were a surprising number of ways in which that guiding principle gave
+command results that violated user expectations, and as such was a bad
+mental model.  However, it persisted for many years and may still be found
+in some corners of the code base.
+
+Anyway, the idea of "working with a subset of files" is simple enough, but
+there are multiple different high-level usecases which affect how some Git
+subcommands should behave.  Further, even if we only considered one of
+those usecases, sparse-checkouts can modify different subcommands in over a
+half dozen different ways.  Let's start by considering the high level
+usecases:
+
+  A) Users are _only_ interested in the sparse portion of the repo
+
+  A*) Users are _only_ interested in the sparse portion of the repo
+      that they have downloaded so far
+
+  B) Users want a sparse working tree, but are working in a larger whole
+
+  C) sparse-checkout is a behind-the-scenes implementation detail allowing
+     Git to work with a specially crafted in-house virtual file system;
+     users are actually working with a "full" working tree that is
+     lazily populated, and sparse-checkout helps with the lazy population
+     piece.
+
+It may be worth explaining each of these in a bit more detail:
+
+
+  (Behavior A) Users are _only_ interested in the sparse portion of the repo
+
+These folks might know there are other things in the repository, but
+don't care.  They are uninterested in other parts of the repository, and
+only want to know about changes within their area of interest.  Showing
+them other files from history (e.g. from diff/log/grep/etc.)  is a
+usability annoyance, potentially a huge one since other changes in
+history may dwarf the changes they are interested in.
+
+Some of these users also arrive at this usecase from wanting to use partial
+clones together with sparse checkouts (in a way where they have downloaded
+blobs within the sparse specification) and do disconnected development.
+Not only do these users generally not care about other parts of the
+repository, but consider it a blocker for Git commands to try to operate on
+those.  If commands attempt to access paths in history outside the sparsity
+specification, then the partial clone will attempt to download additional
+blobs on demand, fail, and then fail the user's command.  (This may be
+unavoidable in some cases, e.g. when `git merge` has non-trivial changes to
+reconcile outside the sparse specification, but we should limit how often
+users are forced to connect to the network.)
+
+Also, even for users using partial clones that do not mind being
+always connected to the network, the need to download blobs as
+side-effects of various other commands (such as the printed diffstat
+after a merge or pull) can lead to worries about local repository size
+growing unnecessarily[10].
+
+  (Behavior A*) Users are _only_ interested in the sparse portion of the repo
+      that they have downloaded so far (a variant on the first usecase)
+
+This variant is driven by folks who using partial clones together with
+sparse checkouts and do disconnected development (so far sounding like a
+subset of behavior A users) and doing so on very large repositories.  The
+reason for yet another variant is that downloading even just the blobs
+through history within their sparse specification may be too much, so they
+only download some.  They would still like operations to succeed without
+network connectivity, though, so things like `git log -S${SEARCH_TERM} -p`
+or `git grep ${SEARCH_TERM} OLDREV ` would need to be prepared to provide
+partial results that depend on what happens to have been downloaded.
+
+This variant could be viewed as Behavior A with the sparse specification
+for history querying operations modified from "sparsity patterns" to
+"sparsity patterns limited to the blobs we have already downloaded".
+
+  (Behavior B) Users want a sparse working tree, but are working in a
+      larger whole
+
+Stolee described this usecase this way[11]:
+
+"I'm also focused on users that know that they are a part of a larger
+whole. They know they are operating on a large repository but focus on
+what they need to contribute their part. I expect multiple "roles" to
+use very different, almost disjoint parts of the codebase. Some other
+"architect" users operate across the entire tree or hop between different
+sections of the codebase as necessary. In this situation, I'm wary of
+scoping too many features to the sparse-checkout definition, especially
+"git log," as it can be too confusing to have their view of the codebase
+depend on your "point of view."
+
+People might also end up wanting behavior B due to complex inter-project
+dependencies.  The initial attempts to use sparse-checkouts usually involve
+the directories you are directly interested in plus what those directories
+depend upon within your repository.  But there's a monkey wrench here: if
+you have integration tests, they invert the hierarchy: to run integration
+tests, you need not only what you are interested in and its in-tree
+dependencies, you also need everything that depends upon what you are
+interested in or that depends upon one of your dependencies...AND you need
+all the in-tree dependencies of that expanded group.  That can easily
+change your sparse-checkout into a nearly dense one.
+
+Naturally, that tends to kill the benefits of sparse-checkouts.  There are
+a couple solutions to this conundrum: either avoid grabbing in-repo
+dependencies (maybe have built versions of your in-repo dependencies pulled
+from a CI cache somewhere), or say that users shouldn't run integration
+tests directly and instead do it on the CI server when they submit a code
+review.  Or do both.  Regardless of whether you stub out your in-repo
+dependencies or stub out the things that depend upon you, there is
+certainly a reason to want to query and be aware of those other stubbed-out
+parts of the repository, particularly when the dependencies are complex or
+change relatively frequently.  Thus, for such uses, sparse-checkouts can be
+used to limit what you directly build and modify, but these users do not
+necessarily want their sparse checkout paths to limit their queries of
+versions in history.
+
+Some people may also be interested in behavior B over behavior A simply as
+a performance workaround: if they are using non-cone mode, then they have
+to deal with its inherent quadratic performance problems.  In that mode,
+every operation that checks whether paths match the sparsity specification
+can be expensive.  As such, these users may only be willing to pay for
+those expensive checks when interacting with the working copy, and may
+prefer getting "unrelated" results from their history queries over having
+slow commands.
+
+  (Behavior C) sparse-checkout is an implementational detail supporting a
+	       special VFS.
+
+This usecase goes slightly against the traditional definition of
+sparse-checkout in that it actually tries to present a full or dense
+checkout to the user.  However, this usecase utilizes the same underlying
+technical underpinnings in a new way which does provide some performance
+advantages to users.  The basic idea is that a company can have an in-house
+Git-aware Virtual File System which pretends all files are present in the
+working tree, by intercepting all file system accesses and using those to
+fetch and write accessed files on demand via partial clones.  The VFS uses
+sparse-checkout to prevent Git from writing or paying attention to many
+files, and manually updates the sparse checkout patterns itself based on
+user access and modification of files in the working tree.  See commit
+ecc7c8841d ("repo_read_index: add config to expect files outside sparse
+patterns", 2022-02-25) and the link at [17] for a more detailed description
+of such a VFS.
+
+The biggest difference here is that users are completely unaware that the
+sparse-checkout machinery is even in use.  The sparse patterns are not
+specified by the user but rather are under the complete control of the VFS
+(and the patterns are updated frequently and dynamically by it).  The user
+will perceive the checkout as dense, and commands should thus behave as if
+all files are present.
+
+
+=== Usecases of primary concern ===
+
+Most of the rest of this document will focus on Behavior A and Behavior
+B.  Some notes about the other two cases and why we are not focusing on
+them:
+
+  (Behavior A*)
+
+Supporting this usecase is estimated to be difficult and a lot of work.
+There are no plans to implement it currently, but it may be a potential
+future alternative.  Knowing about the existence of additional alternatives
+may affect our choice of command line flags (e.g. if we need tri-state or
+quad-state flags rather than just binary flags), so it was still important
+to at least note.
+
+Further, I believe the descriptions below for Behavior A are probably still
+valid for this usecase, with the only exception being that it redefines the
+sparse specification to restrict it to already-downloaded blobs.  The hard
+part is in making commands capable of respecting that modified definition.
+
+  (Behavior C)
+
+This usecase violates some of the early sparse-checkout documented
+assumptions (since files marked as SKIP_WORKTREE will be displayed to users
+as present in the working tree).  That violation may mean various
+sparse-checkout related behaviors are not well suited to this usecase and
+we may need tweaks -- to both documentation and code -- to handle it.
+However, this usecase is also perhaps the simplest model to support in that
+everything behaves like a dense checkout with a few exceptions (e.g. branch
+checkouts and switches write fewer things, knowing the VFS will lazily
+write the rest on an as-needed basis).
+
+Since there is no publically available VFS-related code for folks to try,
+the number of folks who can test such a usecase is limited.
+
+The primary reason to note the Behavior C usecase is that as we fix things
+to better support Behaviors A and B, there may be additional places where
+we need to make tweaks allowing folks in this usecase to get the original
+non-sparse treatment.  For an example, see ecc7c8841d ("repo_read_index:
+add config to expect files outside sparse patterns", 2022-02-25).  The
+secondary reason to note Behavior C, is so that folks taking advantage of
+Behavior C do not assume they are part of the Behavior B camp and propose
+patches that break things for the real Behavior B folks.
+
+
+=== Oversimplified mental models ===
+
+An oversimplification of the differences in the above behaviors is:
+
+  Behavior A: Restrict worktree and history operations to sparse specification
+  Behavior B: Restrict worktree operations to sparse specification; have any
+	      history operations work across all files
+  Behavior C: Do not restrict either worktree or history operations to the
+	      sparse specification...with the exception of branch checkouts or
+	      switches which avoid writing files that will match the index so
+	      they can later lazily be populated instead.
+
+
+=== Desired behavior ===
+
+As noted previously, despite the simple idea of just working with a subset
+of files, there are a range of different behavioral changes that need to be
+made to different subcommands to work well with such a feature.  See
+[1,2,3,4,5,6,7,8,9,10] for various examples.  In particular, at [2], we saw
+that mere composition of other commands that individually worked correctly
+in a sparse-checkout context did not imply that the higher level command
+would work correctly; it sometimes requires further tweaks.  So,
+understanding these differences can be beneficial.
+
+* Commands behaving the same regardless of high-level use-case
+
+  * commands that only look at files within the sparsity specification
+
+      * diff (without --cached or REVISION arguments)
+      * grep (without --cached or REVISION arguments)
+      * diff-files
+
+  * commands that restore files to the working tree that match sparsity
+    patterns, and remove unmodified files that don't match those
+    patterns:
+
+      * switch
+      * checkout (the switch-like half)
+      * read-tree
+      * reset --hard
+
+  * commands that write conflicted files to the working tree, but otherwise
+    will omit writing files to the working tree that do not match the
+    sparsity patterns:
+
+      * merge
+      * rebase
+      * cherry-pick
+      * revert
+
+      * `am` and `apply --cached` should probably be in this section but
+	are buggy (see the "Known bugs" section below)
+
+    The behavior for these commands somewhat depends upon the merge
+    strategy being used:
+      * `ort` behaves as described above
+      * `recursive` tries to not vivify files unnecessarily, but does sometimes
+	vivify files without conflicts.
+      * `octopus` and `resolve` will always vivify any file changed in the merge
+	relative to the first parent, which is rather suboptimal.
+
+    It is also important to note that these commands WILL update the index
+    outside the sparse specification relative to when the operation began,
+    BUT these commands often make a commit just before or after such that
+    by the end of the operation there is no change to the index outside the
+    sparse specification.  Of course, if the operation hits conflicts or
+    does not make a commit, then these operations clearly can modify the
+    index outside the sparse specification.
+
+    Finally, it is important to note that at least the first four of these
+    commands also try to remove differences between the sparse
+    specification and the sparsity patterns (much like the commands in the
+    previous section).
+
+  * commands that always ignore sparsity since commits must be full-tree
+
+      * archive
+      * bundle
+      * commit
+      * format-patch
+      * fast-export
+      * fast-import
+      * commit-tree
+
+  * commands that write any modified file to the working tree (conflicted
+    or not, and whether those paths match sparsity patterns or not):
+
+      * stash
+      * apply (without `--index` or `--cached`)
+
+* Commands that may slightly differ for behavior A vs. behavior B:
+
+  Commands in this category behave mostly the same between the two
+  behaviors, but may differ in verbosity and types of warning and error
+  messages.
+
+  * commands that make modifications to which files are tracked:
+      * add
+      * rm
+      * mv
+      * update-index
+
+    The fact that files can move between the 'tracked' and 'untracked'
+    categories means some commands will have to treat untracked files
+    differently.  But if we have to treat untracked files differently,
+    then additional commands may also need changes:
+
+      * status
+      * clean
+
+    In particular, `status` may need to report any untracked files outside
+    the sparsity specification as an erroneous condition (especially to
+    avoid the user trying to `git add` them, forcing `git add` to display
+    an error).
+
+    It's not clear to me exactly how (or even if) `clean` would change,
+    but it's the other command that also affects untracked files.
+
+    `update-index` may be slightly special.  Its --[no-]skip-worktree flag
+    may need to ignore the sparse specification by its nature.  Also, its
+    current --[no-]ignore-skip-worktree-entries default is totally bogus.
+
+  * commands for manually tweaking paths in both the index and the working tree
+      * `restore`
+      * the restore-like half of `checkout`
+
+    These commands should be similar to add/rm/mv in that they should
+    only operate on the sparse specification by default, and require a
+    special flag to operate on all files.
+
+    Also, note that these commands currently have a number of issues (see
+    the "Known bugs" section below)
+
+* Commands that significantly differ for behavior A vs. behavior B:
+
+  * commands that query history
+      * diff (with --cached or REVISION arguments)
+      * grep (with --cached or REVISION arguments)
+      * show (when given commit arguments)
+      * blame (only matters when one or more -C flags are passed)
+	* and annotate
+      * log
+      * whatchanged
+      * ls-files
+      * diff-index
+      * diff-tree
+      * ls-tree
+
+    Note: for log and whatchanged, revision walking logic is unaffected
+    but displaying of patches is affected by scoping the command to the
+    sparse-checkout.  (The fact that revision walking is unaffected is
+    why rev-list, shortlog, show-branch, and bisect are not in this
+    list.)
+
+    ls-files may be slightly special in that e.g. `git ls-files -t` is
+    often used to see what is sparse and what is not.  Perhaps -t should
+    always work on the full tree?
+
+* Commands I don't know how to classify
+
+  * range-diff
+
+    Is this like `log` or `format-patch`?
+
+  * cherry
+
+    See range-diff
+
+* Commands unaffected by sparse-checkouts
+
+  * shortlog
+  * show-branch
+  * rev-list
+  * bisect
+
+  * branch
+  * describe
+  * fetch
+  * gc
+  * init
+  * maintenance
+  * notes
+  * pull (merge & rebase have the necessary changes)
+  * push
+  * submodule
+  * tag
+
+  * config
+  * filter-branch (works in separate checkout without sparse-checkout setup)
+  * pack-refs
+  * prune
+  * remote
+  * repack
+  * replace
+
+  * bugreport
+  * count-objects
+  * fsck
+  * gitweb
+  * help
+  * instaweb
+  * merge-tree (doesn't touch worktree or index, and merges always compute full-tree)
+  * rerere
+  * verify-commit
+  * verify-tag
+
+  * commit-graph
+  * hash-object
+  * index-pack
+  * mktag
+  * mktree
+  * multi-pack-index
+  * pack-objects
+  * prune-packed
+  * symbolic-ref
+  * unpack-objects
+  * update-ref
+  * write-tree (operates on index, possibly optimized to use sparse dir entries)
+
+  * for-each-ref
+  * get-tar-commit-id
+  * ls-remote
+  * merge-base (merges are computed full tree, so merge base should be too)
+  * name-rev
+  * pack-redundant
+  * rev-parse
+  * show-index
+  * show-ref
+  * unpack-file
+  * var
+  * verify-pack
+
+  * <Everything under 'Interacting with Others' in 'git help --all'>
+  * <Everything under 'Low-level...Syncing' in 'git help --all'>
+  * <Everything under 'Low-level...Internal Helpers' in 'git help --all'>
+  * <Everything under 'External commands' in 'git help --all'>
+
+* Commands that might be affected, but who cares?
+
+  * merge-file
+  * merge-index
+  * gitk?
+
+
+=== Behavior classes ===
+
+From the above there are a few classes of behavior:
+
+  * "restrict"
+
+    Commands in this class only read or write files in the working tree
+    within the sparse specification.
+
+    When moving to a new commit (e.g. switch, reset --hard), these commands
+    may update index files outside the sparse specification as of the start
+    of the operation, but by the end of the operation those index files
+    will match HEAD again and thus those files will again be outside the
+    sparse specification.
+
+    When paths are explicitly specified, these paths are intersected with
+    the sparse specification and will only operate on such paths.
+    (e.g. `git restore [--staged] -- '*.png'`, `git reset -p -- '*.md'`)
+
+    Some of these commands may also attempt, at the end of their operation,
+    to cull transient differences between the sparse specification and the
+    sparsity patterns (see "Sparse specification vs. sparsity patterns" for
+    details, but this basically means either removing unmodified files not
+    matching the sparsity patterns and marking those files as
+    SKIP_WORKTREE, or vivifying files that match the sparsity patterns and
+    marking those files as !SKIP_WORKTREE).
+
+  * "restrict modulo conflicts"
+
+    Commands in this class generally behave like the "restrict" class,
+    except that:
+      (1) they will ignore the sparse specification and write files with
+	  conflicts to the working tree (thus temporarily expanding the
+	  sparse specification to include such files.)
+      (2) they are grouped with commands which move to a new commit, since
+	  they often create a commit and then move to it, even though we
+	  know there are many exceptions to moving to the new commit.  (For
+	  example, the user may rebase a commit that becomes empty, or have
+	  a cherry-pick which conflicts, or a user could run `merge
+	  --no-commit`, and we also view `apply --index` kind of like `am
+	  --no-commit`.)  As such, these commands can make changes to index
+	  files outside the sparse specification, though they'll mark such
+	  files with SKIP_WORKTREE.
+
+  * "restrict also specially applied to untracked files"
+
+    Commands in this class generally behave like the "restrict" class,
+    except that they have to handle untracked files differently too, often
+    because these commands are dealing with files changing state between
+    'tracked' and 'untracked'.  Often, this may mean printing an error
+    message if the command had nothing to do, but the arguments may have
+    referred to files whose tracked-ness state could have changed were it
+    not for the sparsity patterns excluding them.
+
+  * "no restrict"
+
+    Commands in this class ignore the sparse specification entirely.
+
+  * "restrict or no restrict dependent upon behavior A vs. behavior B"
+
+    Commands in this class behave like "no restrict" for folks in the
+    behavior B camp, and like "restrict" for folks in the behavior A camp.
+    However, when behaving like "restrict" a warning of some sort might be
+    provided that history queries have been limited by the sparse-checkout
+    specification.
+
+
+=== Subcommand-dependent defaults ===
+
+Note that we have different defaults depending on the command for the
+desired behavior :
+
+  * Commands defaulting to "restrict":
+    * diff-files
+    * diff (without --cached or REVISION arguments)
+    * grep (without --cached or REVISION arguments)
+    * switch
+    * checkout (the switch-like half)
+    * reset (<commit>)
+
+    * restore
+    * checkout (the restore-like half)
+    * checkout-index
+    * reset (with pathspec)
+
+    This behavior makes sense; these interact with the working tree.
+
+  * Commands defaulting to "restrict modulo conflicts":
+    * merge
+    * rebase
+    * cherry-pick
+    * revert
+
+    * am
+    * apply --index (which is kind of like an `am --no-commit`)
+
+    * read-tree (especially with -m or -u; is kind of like a --no-commit merge)
+    * reset (<tree-ish>, due to similarity to read-tree)
+
+    These also interact with the working tree, but require slightly
+    different behavior either so that (a) conflicts can be resolved or (b)
+    because they are kind of like a merge-without-commit operation.
+
+    (See also the "Known bugs" section below regarding `am` and `apply`)
+
+  * Commands defaulting to "no restrict":
+    * archive
+    * bundle
+    * commit
+    * format-patch
+    * fast-export
+    * fast-import
+    * commit-tree
+
+    * stash
+    * apply (without `--index`)
+
+    These have completely different defaults and perhaps deserve the most
+    detailed explanation:
+
+    In the case of commands in the first group (format-patch,
+    fast-export, bundle, archive, etc.), these are commands for
+    communicating history, which will be broken if they restrict to a
+    subset of the repository.  As such, they operate on full paths and
+    have no `--restrict` option for overriding.  Some of these commands may
+    take paths for manually restricting what is exported, but it needs to
+    be very explicit.
+
+    In the case of stash, it needs to vivify files to avoid losing the
+    user's changes.
+
+    In the case of apply without `--index`, that command needs to update
+    the working tree without the index (or the index without the working
+    tree if `--cached` is passed), and if we restrict those updates to the
+    sparse specification then we'll lose changes from the user.
+
+  * Commands defaulting to "restrict also specially applied to untracked files":
+    * add
+    * rm
+    * mv
+    * update-index
+    * status
+    * clean (?)
+
+    Our original implementation for the first three of these commands was
+    "no restrict", but it had some severe usability issues:
+      * `git add <somefile>` if honored and outside the sparse
+	specification, can result in the file randomly disappearing later
+	when some subsequent command is run (since various commands
+	automatically clean up unmodified files outside the sparse
+	specification).
+      * `git rm '*.jpg'` could very negatively surprise users if it deletes
+	files outside the range of the user's interest.
+      * `git mv` has similar surprises when moving into or out of the cone,
+	so best to restrict by default
+
+    So, we switched `add` and `rm` to default to "restrict", which made
+    usability problems much less severe and less frequent, but we still got
+    complaints because commands like:
+	git add <file-outside-sparse-specification>
+	git rm <file-outside-sparse-specification>
+    would silently do nothing.  We should instead print an error in those
+    cases to get usability right.
+
+    update-index needs to be updated to match, and status and maybe clean
+    also need to be updated to specially handle untracked paths.
+
+    There may be a difference in here between behavior A and behavior B in
+    terms of verboseness of errors or additional warnings.
+
+  * Commands falling under "restrict or no restrict dependent upon behavior
+    A vs. behavior B"
+
+    * diff (with --cached or REVISION arguments)
+    * grep (with --cached or REVISION arguments)
+    * show (when given commit arguments)
+    * blame (only matters when one or more -C flags passed)
+      * and annotate
+    * log
+      * and variants: shortlog, gitk, show-branch, whatchanged, rev-list
+    * ls-files
+    * diff-index
+    * diff-tree
+    * ls-tree
+
+    For now, we default to behavior B for these, which want a default of
+    "no restrict".
+
+    Note that two of these commands -- diff and grep -- also appeared in a
+    different list with a default of "restrict", but only when limited to
+    searching the working tree.  The working tree vs. history distinction
+    is fundamental in how behavior B operates, so this is expected.  Note,
+    though, that for diff and grep with --cached, when doing "restrict"
+    behavior, the difference between sparse specification and sparsity
+    patterns is important to handle.
+
+    "restrict" may make more sense as the long term default for these[12].
+    Also, supporting "restrict" for these commands might be a fair amount
+    of work to implement, meaning it might be implemented over multiple
+    releases.  If that behavior were the default in the commands that
+    supported it, that would force behavior B users to need to learn to
+    slowly add additional flags to their commands, depending on git
+    version, to get the behavior they want.  That gradual switchover would
+    be painful, so we should avoid it at least until it's fully
+    implemented.
+
+
+=== Sparse specification vs. sparsity patterns ===
+
+In a well-behaved situation, the sparse specification is given directly
+by the $GIT_DIR/info/sparse-checkout file.  However, it can transiently
+diverge for a few reasons:
+
+    * needing to resolve conflicts (merging will vivify conflicted files)
+    * running Git commands that implicitly vivify files (e.g. "git stash apply")
+    * running Git commands that explicitly vivify files (e.g. "git checkout
+      --ignore-skip-worktree-bits FILENAME")
+    * other commands that write to these files (perhaps a user copies it
+      from elsewhere)
+
+For the last item, note that we do automatically clear the SKIP_WORKTREE
+bit for files that are present in the working tree.  This has been true
+since 82386b4496 ("Merge branch 'en/present-despite-skipped'",
+2022-03-09)
+
+However, such a situation is transient because:
+
+   * Such transient differences can and will be automatically removed as
+     a side-effect of commands which call unpack_trees() (checkout,
+     merge, reset, etc.).
+   * Users can also request such transient differences be corrected via
+     running `git sparse-checkout reapply`.  Various places recommend
+     running that command.
+   * Additional commands are also welcome to implicitly fix these
+     differences; we may add more in the future.
+
+While we avoid dropping unstaged changes or files which have conflicts,
+we otherwise aggressively try to fix these transient differences.  If
+users want these differences to persist, they should run the `set` or
+`add` subcommands of `git sparse-checkout` to reflect their intended
+sparse specification.
+
+However, when we need to do a query on history restricted to the
+"relevant subset of files" such a transiently expanded sparse
+specification is ignored.  There are a couple reasons for this:
+
+   * The behavior wanted when doing something like
+	 git grep expression REVISION
+     is roughly what the users would expect from
+	 git checkout REVISION && git grep expression
+     (modulo a "REVISION:" prefix), which has a couple ramifications:
+
+   * REVISION may have paths not in the current index, so there is no
+     path we can consult for a SKIP_WORKTREE setting for those paths.
+
+   * Since `checkout` is one of those commands that tries to remove
+     transient differences in the sparse specification, it makes sense
+     to use the corrected sparse specification
+     (i.e. $GIT_DIR/info/sparse-checkout) rather than attempting to
+     consult SKIP_WORKTREE anyway.
+
+So, a transiently expanded (or restricted) sparse specification applies to
+the working tree, but not to history queries where we always use the
+sparsity patterns.  (See [16] for an early discussion of this.)
+
+Similar to a transiently expanded sparse specification of the working tree
+based on additional files being present in the working tree, we also need
+to consider additional files being modified in the index.  In particular,
+if the user has staged changes to files (relative to HEAD) that do not
+match the sparsity patterns, and the file is not present in the working
+tree, we still want to consider the file part of the sparse specification
+if we are specifically performing a query related to the index (e.g. git
+diff --cached [REVISION], git diff-index [REVISION], git restore --staged
+--source=REVISION -- PATHS, etc.)  Note that a transiently expanded sparse
+specification for the index usually only matters under behavior A, since
+under behavior B index operations are lumped with history and tend to
+operate full-tree.
+
+
+=== Implementation Questions ===
+
+  * Do the options --scope={sparse,all} sound good to others?  Are there better
+    options?
+    * Names in use, or appearing in patches, or previously suggested:
+      * --sparse/--dense
+      * --ignore-skip-worktree-bits
+      * --ignore-skip-worktree-entries
+      * --ignore-sparsity
+      * --[no-]restrict-to-sparse-paths
+      * --full-tree/--sparse-tree
+      * --[no-]restrict
+      * --scope={sparse,all}
+      * --focus/--unfocus
+      * --limit/--unlimited
+    * Rationale making me lean slightly towards --scope={sparse,all}:
+      * We want a name that works for many commands, so we need a name that
+	does not conflict
+      * We know that we have more than two possible usecases, so it is best
+	to avoid a flag that appears to be binary.
+      * --scope={sparse,all} isn't overly long and seems relatively
+	explanatory
+      * `--sparse`, as used in add/rm/mv, is totally backwards for
+	grep/log/etc.  Changing the meaning of `--sparse` for these
+	commands would fix the backwardness, but possibly break existing
+	scripts.  Using a new name pairing would allow us to treat
+	`--sparse` in these commands as a deprecated alias.
+      * There is a different `--sparse`/`--dense` pair for commands using
+	revision machinery, so using that naming might cause confusion
+      * There is also a `--sparse` in both pack-objects and show-branch, which
+	don't conflict but do suggest that `--sparse` is overloaded
+      * The name --ignore-skip-worktree-bits is a double negative, is
+	quite a mouthful, refers to an implementation detail that many
+	users may not be familiar with, and we'd need a negation for it
+	which would probably be even more ridiculously long.  (But we
+	can make --ignore-skip-worktree-bits a deprecated alias for
+	--no-restrict.)
+
+  * If a config option is added (sparse.scope?) what should the values and
+    description be?  "sparse" (behavior A), "worktree-sparse-history-dense"
+    (behavior B), "dense" (behavior C)?  There's a risk of confusion,
+    because even for Behaviors A and B we want some commands to be
+    full-tree and others to operate sparsely, so the wording may need to be
+    more tied to the usecases and somehow explain that.  Also, right now,
+    the primary difference we are focusing is just the history-querying
+    commands (log/diff/grep).  Previous config suggestion here: [13]
+
+  * Is `--no-expand` a good alias for ls-files's `--sparse` option?
+    (`--sparse` does not map to either `--scope=sparse` or `--scope=all`,
+    because in non-cone mode it does nothing and in cone-mode it shows the
+    sparse directory entries which are technically outside the sparse
+    specification)
+
+  * Under Behavior A:
+    * Does ls-files' `--no-expand` override the default `--scope=all`, or
+      does it need an extra flag?
+    * Does ls-files' `-t` option imply `--scope=all`?
+    * Does update-index's `--[no-]skip-worktree` option imply `--scope=all`?
+
+  * sparse-checkout: once behavior A is fully implemented, should we take
+    an interim measure to ease people into switching the default?  Namely,
+    if folks are not already in a sparse checkout, then require
+    `sparse-checkout init/set` to take a
+    `--set-scope=(sparse|worktree-sparse-history-dense|dense)` flag (which
+    would set sparse.scope according to the setting given), and throw an
+    error if the flag is not provided?  That error would be a great place
+    to warn folks that the default may change in the future, and get them
+    used to specifying what they want so that the eventual default switch
+    is seamless for them.
+
+
+=== Implementation Goals/Plans ===
+
+ * Get buy-in on this document in general.
+
+ * Figure out answers to the 'Implementation Questions' sections (above)
+
+ * Fix bugs in the 'Known bugs' section (below)
+
+ * Provide some kind of method for backfilling the blobs within the sparse
+   specification in a partial clone
+
+ [Below here is kind of spitballing since the first two haven't been resolved]
+
+ * update-index: flip the default to --no-ignore-skip-worktree-entries,
+   nuke this stupid "Oh, there's a bug?  Let me add a flag to let users
+   request that they not trigger this bug." flag
+
+ * Flags & Config
+   * Make `--sparse` in add/rm/mv a deprecated alias for `--scope=all`
+   * Make `--ignore-skip-worktree-bits` in checkout-index/checkout/restore
+     a deprecated aliases for `--scope=all`
+   * Create config option (sparse.scope?), tie it to the "Cliff notes"
+     overview
+
+   * Add --scope=sparse (and --scope=all) flag to each of the history querying
+     commands.  IMPORTANT: make sure diff machinery changes don't mess with
+     format-patch, fast-export, etc.
+
+=== Known bugs ===
+
+This list used to be a lot longer (see e.g. [1,2,3,4,5,6,7,8,9]), but we've
+been working on it.
+
+0. Behavior A is not well supported in Git.  (Behavior B didn't used to
+   be either, but was the easier of the two to implement.)
+
+1. am and apply:
+
+   apply, without `--index` or `--cached`, relies on files being present
+   in the working copy, and also writes to them unconditionally.  As
+   such, it should first check for the files' presence, and if found to
+   be SKIP_WORKTREE, then clear the bit and vivify the paths, then do
+   its work.  Currently, it just throws an error.
+
+   apply, with either `--cached` or `--index`, will not preserve the
+   SKIP_WORKTREE bit.  This is fine if the file has conflicts, but
+   otherwise SKIP_WORKTREE bits should be preserved for --cached and
+   probably also for --index.
+
+   am, if there are no conflicts, will vivify files and fail to preserve
+   the SKIP_WORKTREE bit.  If there are conflicts and `-3` is not
+   specified, it will vivify files and then complain the patch doesn't
+   apply.  If there are conflicts and `-3` is specified, it will vivify
+   files and then complain that those vivified files would be
+   overwritten by merge.
+
+2. reset --hard:
+
+   reset --hard provides confusing error message (works correctly, but
+   misleads the user into believing it didn't):
+
+    $ touch addme
+    $ git add addme
+    $ git ls-files -t
+    H addme
+    H tracked
+    S tracked-but-maybe-skipped
+    $ git reset --hard                           # usually works great
+    error: Path 'addme' not uptodate; will not remove from working tree.
+    HEAD is now at bdbbb6f third
+    $ git ls-files -t
+    H tracked
+    S tracked-but-maybe-skipped
+    $ ls -1
+    tracked
+
+    `git reset --hard` DID remove addme from the index and the working tree, contrary
+    to the error message, but in line with how reset --hard should behave.
+
+3. read-tree
+
+   `read-tree` doesn't apply the 'SKIP_WORKTREE' bit to *any* of the
+   entries it reads into the index, resulting in all your files suddenly
+   appearing to be "deleted".
+
+4. Checkout, restore:
+
+   These command do not handle path & revision arguments appropriately:
+
+    $ ls
+    tracked
+    $ git ls-files -t
+    H tracked
+    S tracked-but-maybe-skipped
+    $ git status --porcelain
+    $ git checkout -- '*skipped'
+    error: pathspec '*skipped' did not match any file(s) known to git
+    $ git ls-files -- '*skipped'
+    tracked-but-maybe-skipped
+    $ git checkout HEAD -- '*skipped'
+    error: pathspec '*skipped' did not match any file(s) known to git
+    $ git ls-tree HEAD | grep skipped
+    100644 blob 276f5a64354b791b13840f02047738c77ad0584f	tracked-but-maybe-skipped
+    $ git status --porcelain
+    $ git checkout HEAD~1 -- '*skipped'
+    $ git ls-files -t
+    H tracked
+    H tracked-but-maybe-skipped
+    $ git status --porcelain
+    M  tracked-but-maybe-skipped
+    $ git checkout HEAD -- '*skipped'
+    $ git status --porcelain
+    $
+
+    Note that checkout without a revision (or restore --staged) fails to
+    find a file to restore from the index, even though ls-files shows
+    such a file certainly exists.
+
+    Similar issues occur with HEAD (--source=HEAD in restore's case),
+    but suddenly works when HEAD~1 is specified.  And then after that it
+    will work with HEAD specified, even though it didn't before.
+
+    Directories are also an issue:
+
+    $ git sparse-checkout set nomatches
+    $ git status
+    On branch main
+    You are in a sparse checkout with 0% of tracked files present.
+
+    nothing to commit, working tree clean
+    $ git checkout .
+    error: pathspec '.' did not match any file(s) known to git
+    $ git checkout HEAD~1 .
+    Updated 1 path from 58916d9
+    $ git ls-files -t
+    S tracked
+    H tracked-but-maybe-skipped
+
+5. checkout and restore --staged, continued:
+
+   These commands do not correctly scope operations to the sparse
+   specification, and make it worse by not setting important SKIP_WORKTREE
+   bits:
+
+   $ git restore --source OLDREV --staged outside-sparse-cone/
+   $ git status --porcelain
+   MD outside-sparse-cone/file1
+   MD outside-sparse-cone/file2
+   MD outside-sparse-cone/file3
+
+   We can add a --scope=all mode to `git restore` to let it operate outside
+   the sparse specification, but then it will be important to set the
+   SKIP_WORKTREE bits appropriately.
+
+6. Performance issues; see:
+    https://lore.kernel.org/git/CABPp-BEkJQoKZsQGCYioyga_uoDQ6iBeW+FKr8JhyuuTMK1RDw@mail.gmail.com/
+
+
+=== Reference Emails ===
+
+Emails that detail various bugs we've had in sparse-checkout:
+
+[1] (Original descriptions of behavior A & behavior B)
+    https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/
+[2] (Fix stash applications in sparse checkouts; bugs from behavioral differences)
+    https://lore.kernel.org/git/ccfedc7140dbf63ba26a15f93bd3885180b26517.1606861519.git.gitgitgadget@gmail.com/
+[3] (Present-despite-skipped entries)
+    https://lore.kernel.org/git/11d46a399d26c913787b704d2b7169cafc28d639.1642175983.git.gitgitgadget@gmail.com/
+[4] (Clone --no-checkout interaction)
+    https://lore.kernel.org/git/pull.801.v2.git.git.1591324899170.gitgitgadget@gmail.com/ (clone --no-checkout)
+[5] (The need for update_sparsity() and avoiding `read-tree -mu HEAD`)
+    https://lore.kernel.org/git/3a1f084641eb47515b5a41ed4409a36128913309.1585270142.git.gitgitgadget@gmail.com/
+[6] (SKIP_WORKTREE is advisory, not mandatory)
+    https://lore.kernel.org/git/844306c3e86ef67591cc086decb2b760e7d710a3.1585270142.git.gitgitgadget@gmail.com/
+[7] (`worktree add` should copy sparsity settings from current worktree)
+    https://lore.kernel.org/git/c51cb3714e7b1d2f8c9370fe87eca9984ff4859f.1644269584.git.gitgitgadget@gmail.com/
+[8] (Avoid negative surprises in add, rm, and mv)
+    https://lore.kernel.org/git/cover.1617914011.git.matheus.bernardino@usp.br/
+    https://lore.kernel.org/git/pull.1018.v4.git.1632497954.gitgitgadget@gmail.com/
+[9] (Move from out-of-cone to in-cone)
+    https://lore.kernel.org/git/20220630023737.473690-6-shaoxuan.yuan02@gmail.com/
+    https://lore.kernel.org/git/20220630023737.473690-4-shaoxuan.yuan02@gmail.com/
+[10] (Unnecessarily downloading objects outside sparse specification)
+     https://lore.kernel.org/git/CAOLTT8QfwOi9yx_qZZgyGa8iL8kHWutEED7ok_jxwTcYT_hf9Q@mail.gmail.com/
+
+[11] (Stolee's comments on high-level usecases)
+     https://lore.kernel.org/git/1a1e33f6-3514-9afc-0a28-5a6b85bd8014@gmail.com/
+
+[12] Others commenting on eventually switching default to behavior A:
+  * https://lore.kernel.org/git/xmqqh719pcoo.fsf@gitster.g/
+  * https://lore.kernel.org/git/xmqqzgeqw0sy.fsf@gitster.g/
+  * https://lore.kernel.org/git/a86af661-cf58-a4e5-0214-a67d3a794d7e@github.com/
+
+[13] Previous config name suggestion and description
+  * https://lore.kernel.org/git/CABPp-BE6zW0nJSStcVU=_DoDBnPgLqOR8pkTXK3dW11=T01OhA@mail.gmail.com/
+
+[14] Tangential issue: switch to cone mode as default sparse specification mechanism:
+  https://lore.kernel.org/git/a1b68fd6126eb341ef3637bb93fedad4309b36d0.1650594746.git.gitgitgadget@gmail.com/
+
+[15] Lengthy email on grep behavior, covering what should be searched:
+  * https://lore.kernel.org/git/CABPp-BGVO3QdbfE84uF_3QDF0-y2iHHh6G5FAFzNRfeRitkuHw@mail.gmail.com/
+
+[16] Email explaining sparsity patterns vs. SKIP_WORKTREE and history operations,
+     search for the parenthetical comment starting "We do not check".
+    https://lore.kernel.org/git/CABPp-BFsCPPNOZ92JQRJeGyNd0e-TCW-LcLyr0i_+VSQJP+GCg@mail.gmail.com/
+
+[17] https://lore.kernel.org/git/20220207190320.2960362-1-jonathantanmy@google.com/
diff --git a/Documentation/technical/sparse-index.txt b/Documentation/technical/sparse-index.txt
new file mode 100644
index 0000000..3b24c1a
--- /dev/null
+++ b/Documentation/technical/sparse-index.txt
@@ -0,0 +1,208 @@
+Git Sparse-Index Design Document
+================================
+
+The sparse-checkout feature allows users to focus a working directory on
+a subset of the files at HEAD. The cone mode patterns, enabled by
+`core.sparseCheckoutCone`, allow for very fast pattern matching to
+discover which files at HEAD belong in the sparse-checkout cone.
+
+Three important scale dimensions for a Git working directory are:
+
+* `HEAD`: How many files are present at `HEAD`?
+
+* Populated: How many files are within the sparse-checkout cone.
+
+* Modified: How many files has the user modified in the working directory?
+
+We will use big-O notation -- O(X) -- to denote how expensive certain
+operations are in terms of these dimensions.
+
+These dimensions are ordered by their magnitude: users (typically) modify
+fewer files than are populated, and we can only populate files at `HEAD`.
+
+Problems occur if there is an extreme imbalance in these dimensions. For
+example, if `HEAD` contains millions of paths but the populated set has
+only tens of thousands, then commands like `git status` and `git add` can
+be dominated by operations that require O(`HEAD`) operations instead of
+O(Populated). Primarily, the cost is in parsing and rewriting the index,
+which is filled primarily with files at `HEAD` that are marked with the
+`SKIP_WORKTREE` bit.
+
+The sparse-index intends to take these commands that read and modify the
+index from O(`HEAD`) to O(Populated). To do this, we need to modify the
+index format in a significant way: add "sparse directory" entries.
+
+With cone mode patterns, it is possible to detect when an entire
+directory will have its contents outside of the sparse-checkout definition.
+Instead of listing all of the files it contains as individual entries, a
+sparse-index contains an entry with the directory name, referencing the
+object ID of the tree at `HEAD` and marked with the `SKIP_WORKTREE` bit.
+If we need to discover the details for paths within that directory, we
+can parse trees to find that list.
+
+At time of writing, sparse-directory entries violate expectations about the
+index format and its in-memory data structure. There are many consumers in
+the codebase that expect to iterate through all of the index entries and
+see only files. In fact, these loops expect to see a reference to every
+staged file. One way to handle this is to parse trees to replace a
+sparse-directory entry with all of the files within that tree as the index
+is loaded. However, parsing trees is slower than parsing the index format,
+so that is a slower operation than if we left the index alone. The plan is
+to make all of these integrations "sparse aware" so this expansion through
+tree parsing is unnecessary and they use fewer resources than when using a
+full index.
+
+The implementation plan below follows four phases to slowly integrate with
+the sparse-index. The intention is to incrementally update Git commands to
+interact safely with the sparse-index without significant slowdowns. This
+may not always be possible, but the hope is that the primary commands that
+users need in their daily work are dramatically improved.
+
+Phase I: Format and initial speedups
+------------------------------------
+
+During this phase, Git learns to enable the sparse-index and safely parse
+one. Protections are put in place so that every consumer of the in-memory
+data structure can operate with its current assumption of every file at
+`HEAD`.
+
+At first, every index parse will call a helper method,
+`ensure_full_index()`, which scans the index for sparse-directory entries
+(pointing to trees) and replaces them with the full list of paths (with
+blob contents) by parsing tree objects. This will be slower in all cases.
+The only noticeable change in behavior will be that the serialized index
+file contains sparse-directory entries.
+
+To start, we use a new required index extension, `sdir`, to allow
+inserting sparse-directory entries into indexes with file format
+versions 2, 3, and 4. This prevents Git versions that do not understand
+the sparse-index from operating on one, while allowing tools that do not
+understand the sparse-index to operate on repositories as long as they do
+not interact with the index. A new format, index v5, will be introduced
+that includes sparse-directory entries by default. It might also
+introduce other features that have been considered for improving the
+index, as well.
+
+Next, consumers of the index will be guarded against operating on a
+sparse-index by inserting calls to `ensure_full_index()` or
+`expand_index_to_path()`. If a specific path is requested, then those will
+be protected from within the `index_file_exists()` and `index_name_pos()`
+API calls: they will call `ensure_full_index()` if necessary. The
+intention here is to preserve existing behavior when interacting with a
+sparse-checkout. We don't want a change to happen by accident, without
+tests. Many of these locations may not need any change before removing the
+guards, but we should not do so without tests to ensure the expected
+behavior happens.
+
+It may be desirable to _change_ the behavior of some commands in the
+presence of a sparse index or more generally in any sparse-checkout
+scenario. In such cases, these should be carefully communicated and
+tested. No such behavior changes are intended during this phase.
+
+During a scan of the codebase, not every iteration of the cache entries
+needs an `ensure_full_index()` check. The basic reasons include:
+
+1. The loop is scanning for entries with non-zero stage. These entries
+   are not collapsed into a sparse-directory entry.
+
+2. The loop is scanning for submodules. These entries are not collapsed
+   into a sparse-directory entry.
+
+3. The loop is part of the index API, especially around reading or
+   writing the format.
+
+4. The loop is checking for correct order of cache entries and that is
+   correct if and only if the sparse-directory entries are in the correct
+   location.
+
+5. The loop ignores entries with the `SKIP_WORKTREE` bit set, or is
+   otherwise already aware of sparse directory entries.
+
+6. The sparse-index is disabled at this point when using the split-index
+   feature, so no effort is made to protect the split-index API.
+
+Even after inserting these guards, we will keep expanding sparse-indexes
+for most Git commands using the `command_requires_full_index` repository
+setting. This setting will be on by default and disabled one builtin at a
+time until we have sufficient confidence that all of the index operations
+are properly guarded.
+
+To complete this phase, the commands `git status` and `git add` will be
+integrated with the sparse-index so that they operate with O(Populated)
+performance. They will be carefully tested for operations within and
+outside the sparse-checkout definition.
+
+Phase II: Careful integrations
+------------------------------
+
+This phase focuses on ensuring that all index extensions and APIs work
+well with a sparse-index. This requires significant increases to our test
+coverage, especially for operations that interact with the working
+directory outside of the sparse-checkout definition. Some of these
+behaviors may not be the desirable ones, such as some tests already
+marked for failure in `t1092-sparse-checkout-compatibility.sh`.
+
+The index extensions that may require special integrations are:
+
+* FS Monitor
+* Untracked cache
+
+While integrating with these features, we should look for patterns that
+might lead to better APIs for interacting with the index. Coalescing
+common usage patterns into an API call can reduce the number of places
+where sparse-directories need to be handled carefully.
+
+Phase III: Important command speedups
+-------------------------------------
+
+At this point, the patterns for testing and implementing sparse-directory
+logic should be relatively stable. This phase focuses on updating some of
+the most common builtins that use the index to operate as O(Populated).
+Here is a potential list of commands that could be valuable to integrate
+at this point:
+
+* `git commit`
+* `git checkout`
+* `git merge`
+* `git rebase`
+
+Hopefully, commands such as `git merge` and `git rebase` can benefit
+instead from merge algorithms that do not use the index as a data
+structure, such as the merge-ORT strategy. As these topics mature, we
+may enable the ORT strategy by default for repositories using the
+sparse-index feature.
+
+Along with `git status` and `git add`, these commands cover the majority
+of users' interactions with the working directory. In addition, we can
+integrate with these commands:
+
+* `git grep`
+* `git rm`
+
+These have been proposed as some whose behavior could change when in a
+repo with a sparse-checkout definition. It would be good to include this
+behavior automatically when using a sparse-index. Some clarity is needed
+to make the behavior switch clear to the user.
+
+This phase is the first where parallel work might be possible without too
+much conflicts between topics.
+
+Phase IV: The long tail
+-----------------------
+
+This last phase is less a "phase" and more "the new normal" after all of
+the previous work.
+
+To start, the `command_requires_full_index` option could be removed in
+favor of expanding only when hitting an API guard.
+
+There are many Git commands that could use special attention to operate as
+O(Populated), while some might be so rare that it is acceptable to leave
+them with additional overhead when a sparse-index is present.
+
+Here are some commands that might be useful to update:
+
+* `git sparse-checkout set`
+* `git am`
+* `git clean`
+* `git stash`
diff --git a/Documentation/technical/trivial-merge.txt b/Documentation/technical/trivial-merge.txt
new file mode 100644
index 0000000..1f1c33d
--- /dev/null
+++ b/Documentation/technical/trivial-merge.txt
@@ -0,0 +1,121 @@
+Trivial merge rules
+===================
+
+This document describes the outcomes of the trivial merge logic in read-tree.
+
+One-way merge
+-------------
+
+This replaces the index with a different tree, keeping the stat info
+for entries that don't change, and allowing -u to make the minimum
+required changes to the working tree to have it match.
+
+Entries marked '+' have stat information. Spaces marked '*' don't
+affect the result.
+
+   index   tree    result
+   -----------------------
+   *       (empty) (empty)
+   (empty) tree    tree
+   index+  tree    tree
+   index+  index   index+
+
+Two-way merge
+-------------
+
+It is permitted for the index to lack an entry; this does not prevent
+any case from applying.
+
+If the index exists, it is an error for it not to match either the old
+or the result.
+
+If multiple cases apply, the one used is listed first.
+
+A result which changes the index is an error if the index is not empty
+and not up to date.
+
+Entries marked '+' have stat information. Spaces marked '*' don't
+affect the result.
+
+ case  index   old     new     result
+ -------------------------------------
+ 0/2   (empty) *       (empty) (empty)
+ 1/3   (empty) *       new     new
+ 4/5   index+  (empty) (empty) index+
+ 6/7   index+  (empty) index   index+
+ 10    index+  index   (empty) (empty)
+ 14/15 index+  old     old     index+
+ 18/19 index+  old     index   index+
+ 20    index+  index   new     new
+
+Three-way merge
+---------------
+
+It is permitted for the index to lack an entry; this does not prevent
+any case from applying.
+
+If the index exists, it is an error for it not to match either the
+head or (if the merge is trivial) the result.
+
+If multiple cases apply, the one used is listed first.
+
+A result of "no merge" means that index is left in stage 0, ancest in
+stage 1, head in stage 2, and remote in stage 3 (if any of these are
+empty, no entry is left for that stage). Otherwise, the given entry is
+left in stage 0, and there are no other entries.
+
+A result of "no merge" is an error if the index is not empty and not
+up to date.
+
+*empty* means that the tree must not have a directory-file conflict
+ with the entry.
+
+For multiple ancestors, a '+' means that this case applies even if
+only one ancestor or remote fits; a '^' means all of the ancestors
+must be the same.
+
+ case  ancest    head    remote    result
+ ----------------------------------------
+ 1     (empty)+  (empty) (empty)   (empty)
+ 2ALT  (empty)+  *empty* remote    remote
+ 2     (empty)^  (empty) remote    no merge
+ 3ALT  (empty)+  head    *empty*   head
+ 3     (empty)^  head    (empty)   no merge
+ 4     (empty)^  head    remote    no merge
+ 5ALT  *         head    head      head
+ 6     ancest+   (empty) (empty)   no merge
+ 8     ancest^   (empty) ancest    no merge
+ 7     ancest+   (empty) remote    no merge
+ 10    ancest^   ancest  (empty)   no merge
+ 9     ancest+   head    (empty)   no merge
+ 16    anc1/anc2 anc1    anc2      no merge
+ 13    ancest+   head    ancest    head
+ 14    ancest+   ancest  remote    remote
+ 11    ancest+   head    remote    no merge
+
+Only #2ALT and #3ALT use *empty*, because these are the only cases
+where there can be conflicts that didn't exist before. Note that we
+allow directory-file conflicts between things in different stages
+after the trivial merge.
+
+A possible alternative for #6 is (empty), which would make it like
+#1. This is not used, due to the likelihood that it arises due to
+moving the file to multiple different locations or moving and deleting
+it in different branches.
+
+Case #1 is included for completeness, and also in case we decide to
+put on '+' markings; any path that is never mentioned at all isn't
+handled.
+
+Note that #16 is when both #13 and #14 apply; in this case, we refuse
+the trivial merge, because we can't tell from this data which is
+right. This is a case of a reverted patch (in some direction, maybe
+multiple times), and the right answer depends on looking at crossings
+of history or common ancestors of the ancestors.
+
+Note that, between #6, #7, #9, and #11, all cases not otherwise
+covered are handled in this table.
+
+For #8 and #10, there is alternative behavior, not currently
+implemented, where the result is (empty). As currently implemented,
+the automatic merge will generally give this effect.
author	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-04-09 13:34:27 +0000
committer	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-04-09 13:34:27 +0000
commit	4dbdc42d9e7c3968ff7f690d00680419c9b8cb0f (patch)
tree	47c1d492e9c956c1cd2b74dbd3b9d8b0db44dc4e /Documentation/technical
parent	Initial commit. (diff)
download	git-4dbdc42d9e7c3968ff7f690d00680419c9b8cb0f.tar.xz git-4dbdc42d9e7c3968ff7f690d00680419c9b8cb0f.zip