summaryrefslogtreecommitdiffstats
path: root/Documentation/technical/sparse-checkout.txt
diff options
context:
space:
mode:
Diffstat (limited to '')
-rw-r--r--Documentation/technical/sparse-checkout.txt1103
1 files changed, 1103 insertions, 0 deletions
diff --git a/Documentation/technical/sparse-checkout.txt b/Documentation/technical/sparse-checkout.txt
new file mode 100644
index 0000000..fa0d01c
--- /dev/null
+++ b/Documentation/technical/sparse-checkout.txt
@@ -0,0 +1,1103 @@
+Table of contents:
+
+ * Terminology
+ * Purpose of sparse-checkouts
+ * Usecases of primary concern
+ * Oversimplified mental models ("Cliff Notes" for this document!)
+ * Desired behavior
+ * Behavior classes
+ * Subcommand-dependent defaults
+ * Sparse specification vs. sparsity patterns
+ * Implementation Questions
+ * Implementation Goals/Plans
+ * Known bugs
+ * Reference Emails
+
+
+=== Terminology ===
+
+cone mode: one of two modes for specifying the desired subset of files
+ in a sparse-checkout. In cone-mode, the user specifies
+ directories (getting both everything under that directory as
+ well as everything in leading directories), while in non-cone
+ mode, the user specifies gitignore-style patterns. Controlled
+ by the --[no-]cone option to sparse-checkout init|set.
+
+SKIP_WORKTREE: When tracked files do not match the sparse specification and
+ are removed from the working tree, the file in the index is marked
+ with a SKIP_WORKTREE bit. Note that if a tracked file has the
+ SKIP_WORKTREE bit set but the file is later written by the user to
+ the working tree anyway, the SKIP_WORKTREE bit will be cleared at
+ the beginning of any subsequent Git operation.
+
+ Most sparse checkout users are unaware of this implementation
+ detail, and the term should generally be avoided in user-facing
+ descriptions and command flags. Unfortunately, prior to the
+ `sparse-checkout` subcommand this low-level detail was exposed,
+ and as of time of writing, is still exposed in various places.
+
+sparse-checkout: a subcommand in git used to reduce the files present in
+ the working tree to a subset of all tracked files. Also, the
+ name of the file in the $GIT_DIR/info directory used to track
+ the sparsity patterns corresponding to the user's desired
+ subset.
+
+sparse cone: see cone mode
+
+sparse directory: An entry in the index corresponding to a directory, which
+ appears in the index instead of all the files under that directory
+ that would normally appear. See also sparse-index. Something that
+ can cause confusion is that the "sparse directory" does NOT match
+ the sparse specification, i.e. the directory is NOT present in the
+ working tree. May be renamed in the future (e.g. to "skipped
+ directory").
+
+sparse index: A special mode for sparse-checkout that also makes the
+ index sparse by recording a directory entry in lieu of all the
+ files underneath that directory (thus making that a "skipped
+ directory" which unfortunately has also been called a "sparse
+ directory"), and does this for potentially multiple
+ directories. Controlled by the --[no-]sparse-index option to
+ init|set|reapply.
+
+sparsity patterns: patterns from $GIT_DIR/info/sparse-checkout used to
+ define the set of files of interest. A warning: It is easy to
+ over-use this term (or the shortened "patterns" term), for two
+ reasons: (1) users in cone mode specify directories rather than
+ patterns (their directories are transformed into patterns, but
+ users may think you are talking about non-cone mode if you use the
+ word "patterns"), and (b) the sparse specification might
+ transiently differ in the working tree or index from the sparsity
+ patterns (see "Sparse specification vs. sparsity patterns").
+
+sparse specification: The set of paths in the user's area of focus. This
+ is typically just the tracked files that match the sparsity
+ patterns, but the sparse specification can temporarily differ and
+ include additional files. (See also "Sparse specification
+ vs. sparsity patterns")
+
+ * When working with history, the sparse specification is exactly
+ the set of files matching the sparsity patterns.
+ * When interacting with the working tree, the sparse specification
+ is the set of tracked files with a clear SKIP_WORKTREE bit or
+ tracked files present in the working copy.
+ * When modifying or showing results from the index, the sparse
+ specification is the set of files with a clear SKIP_WORKTREE bit
+ or that differ in the index from HEAD.
+ * If working with the index and the working copy, the sparse
+ specification is the union of the paths from above.
+
+vivifying: When a command restores a tracked file to the working tree (and
+ hopefully also clears the SKIP_WORKTREE bit in the index for that
+ file), this is referred to as "vivifying" the file.
+
+
+=== Purpose of sparse-checkouts ===
+
+sparse-checkouts exist to allow users to work with a subset of their
+files.
+
+You can think of sparse-checkouts as subdividing "tracked" files into two
+categories -- a sparse subset, and all the rest. Implementationally, we
+mark "all the rest" in the index with a SKIP_WORKTREE bit and leave them
+out of the working tree. The SKIP_WORKTREE files are still tracked, just
+not present in the working tree.
+
+In the past, sparse-checkouts were defined by "SKIP_WORKTREE means the file
+is missing from the working tree but pretend the file contents match HEAD".
+That was not only bogus (it actually meant the file missing from the
+working tree matched the index rather than HEAD), but it was also a
+low-level detail which only provided decent behavior for a few commands.
+There were a surprising number of ways in which that guiding principle gave
+command results that violated user expectations, and as such was a bad
+mental model. However, it persisted for many years and may still be found
+in some corners of the code base.
+
+Anyway, the idea of "working with a subset of files" is simple enough, but
+there are multiple different high-level usecases which affect how some Git
+subcommands should behave. Further, even if we only considered one of
+those usecases, sparse-checkouts can modify different subcommands in over a
+half dozen different ways. Let's start by considering the high level
+usecases:
+
+ A) Users are _only_ interested in the sparse portion of the repo
+
+ A*) Users are _only_ interested in the sparse portion of the repo
+ that they have downloaded so far
+
+ B) Users want a sparse working tree, but are working in a larger whole
+
+ C) sparse-checkout is a behind-the-scenes implementation detail allowing
+ Git to work with a specially crafted in-house virtual file system;
+ users are actually working with a "full" working tree that is
+ lazily populated, and sparse-checkout helps with the lazy population
+ piece.
+
+It may be worth explaining each of these in a bit more detail:
+
+
+ (Behavior A) Users are _only_ interested in the sparse portion of the repo
+
+These folks might know there are other things in the repository, but
+don't care. They are uninterested in other parts of the repository, and
+only want to know about changes within their area of interest. Showing
+them other files from history (e.g. from diff/log/grep/etc.) is a
+usability annoyance, potentially a huge one since other changes in
+history may dwarf the changes they are interested in.
+
+Some of these users also arrive at this usecase from wanting to use partial
+clones together with sparse checkouts (in a way where they have downloaded
+blobs within the sparse specification) and do disconnected development.
+Not only do these users generally not care about other parts of the
+repository, but consider it a blocker for Git commands to try to operate on
+those. If commands attempt to access paths in history outside the sparsity
+specification, then the partial clone will attempt to download additional
+blobs on demand, fail, and then fail the user's command. (This may be
+unavoidable in some cases, e.g. when `git merge` has non-trivial changes to
+reconcile outside the sparse specification, but we should limit how often
+users are forced to connect to the network.)
+
+Also, even for users using partial clones that do not mind being
+always connected to the network, the need to download blobs as
+side-effects of various other commands (such as the printed diffstat
+after a merge or pull) can lead to worries about local repository size
+growing unnecessarily[10].
+
+ (Behavior A*) Users are _only_ interested in the sparse portion of the repo
+ that they have downloaded so far (a variant on the first usecase)
+
+This variant is driven by folks who using partial clones together with
+sparse checkouts and do disconnected development (so far sounding like a
+subset of behavior A users) and doing so on very large repositories. The
+reason for yet another variant is that downloading even just the blobs
+through history within their sparse specification may be too much, so they
+only download some. They would still like operations to succeed without
+network connectivity, though, so things like `git log -S${SEARCH_TERM} -p`
+or `git grep ${SEARCH_TERM} OLDREV ` would need to be prepared to provide
+partial results that depend on what happens to have been downloaded.
+
+This variant could be viewed as Behavior A with the sparse specification
+for history querying operations modified from "sparsity patterns" to
+"sparsity patterns limited to the blobs we have already downloaded".
+
+ (Behavior B) Users want a sparse working tree, but are working in a
+ larger whole
+
+Stolee described this usecase this way[11]:
+
+"I'm also focused on users that know that they are a part of a larger
+whole. They know they are operating on a large repository but focus on
+what they need to contribute their part. I expect multiple "roles" to
+use very different, almost disjoint parts of the codebase. Some other
+"architect" users operate across the entire tree or hop between different
+sections of the codebase as necessary. In this situation, I'm wary of
+scoping too many features to the sparse-checkout definition, especially
+"git log," as it can be too confusing to have their view of the codebase
+depend on your "point of view."
+
+People might also end up wanting behavior B due to complex inter-project
+dependencies. The initial attempts to use sparse-checkouts usually involve
+the directories you are directly interested in plus what those directories
+depend upon within your repository. But there's a monkey wrench here: if
+you have integration tests, they invert the hierarchy: to run integration
+tests, you need not only what you are interested in and its in-tree
+dependencies, you also need everything that depends upon what you are
+interested in or that depends upon one of your dependencies...AND you need
+all the in-tree dependencies of that expanded group. That can easily
+change your sparse-checkout into a nearly dense one.
+
+Naturally, that tends to kill the benefits of sparse-checkouts. There are
+a couple solutions to this conundrum: either avoid grabbing in-repo
+dependencies (maybe have built versions of your in-repo dependencies pulled
+from a CI cache somewhere), or say that users shouldn't run integration
+tests directly and instead do it on the CI server when they submit a code
+review. Or do both. Regardless of whether you stub out your in-repo
+dependencies or stub out the things that depend upon you, there is
+certainly a reason to want to query and be aware of those other stubbed-out
+parts of the repository, particularly when the dependencies are complex or
+change relatively frequently. Thus, for such uses, sparse-checkouts can be
+used to limit what you directly build and modify, but these users do not
+necessarily want their sparse checkout paths to limit their queries of
+versions in history.
+
+Some people may also be interested in behavior B over behavior A simply as
+a performance workaround: if they are using non-cone mode, then they have
+to deal with its inherent quadratic performance problems. In that mode,
+every operation that checks whether paths match the sparsity specification
+can be expensive. As such, these users may only be willing to pay for
+those expensive checks when interacting with the working copy, and may
+prefer getting "unrelated" results from their history queries over having
+slow commands.
+
+ (Behavior C) sparse-checkout is an implementational detail supporting a
+ special VFS.
+
+This usecase goes slightly against the traditional definition of
+sparse-checkout in that it actually tries to present a full or dense
+checkout to the user. However, this usecase utilizes the same underlying
+technical underpinnings in a new way which does provide some performance
+advantages to users. The basic idea is that a company can have an in-house
+Git-aware Virtual File System which pretends all files are present in the
+working tree, by intercepting all file system accesses and using those to
+fetch and write accessed files on demand via partial clones. The VFS uses
+sparse-checkout to prevent Git from writing or paying attention to many
+files, and manually updates the sparse checkout patterns itself based on
+user access and modification of files in the working tree. See commit
+ecc7c8841d ("repo_read_index: add config to expect files outside sparse
+patterns", 2022-02-25) and the link at [17] for a more detailed description
+of such a VFS.
+
+The biggest difference here is that users are completely unaware that the
+sparse-checkout machinery is even in use. The sparse patterns are not
+specified by the user but rather are under the complete control of the VFS
+(and the patterns are updated frequently and dynamically by it). The user
+will perceive the checkout as dense, and commands should thus behave as if
+all files are present.
+
+
+=== Usecases of primary concern ===
+
+Most of the rest of this document will focus on Behavior A and Behavior
+B. Some notes about the other two cases and why we are not focusing on
+them:
+
+ (Behavior A*)
+
+Supporting this usecase is estimated to be difficult and a lot of work.
+There are no plans to implement it currently, but it may be a potential
+future alternative. Knowing about the existence of additional alternatives
+may affect our choice of command line flags (e.g. if we need tri-state or
+quad-state flags rather than just binary flags), so it was still important
+to at least note.
+
+Further, I believe the descriptions below for Behavior A are probably still
+valid for this usecase, with the only exception being that it redefines the
+sparse specification to restrict it to already-downloaded blobs. The hard
+part is in making commands capable of respecting that modified definition.
+
+ (Behavior C)
+
+This usecase violates some of the early sparse-checkout documented
+assumptions (since files marked as SKIP_WORKTREE will be displayed to users
+as present in the working tree). That violation may mean various
+sparse-checkout related behaviors are not well suited to this usecase and
+we may need tweaks -- to both documentation and code -- to handle it.
+However, this usecase is also perhaps the simplest model to support in that
+everything behaves like a dense checkout with a few exceptions (e.g. branch
+checkouts and switches write fewer things, knowing the VFS will lazily
+write the rest on an as-needed basis).
+
+Since there is no publically available VFS-related code for folks to try,
+the number of folks who can test such a usecase is limited.
+
+The primary reason to note the Behavior C usecase is that as we fix things
+to better support Behaviors A and B, there may be additional places where
+we need to make tweaks allowing folks in this usecase to get the original
+non-sparse treatment. For an example, see ecc7c8841d ("repo_read_index:
+add config to expect files outside sparse patterns", 2022-02-25). The
+secondary reason to note Behavior C, is so that folks taking advantage of
+Behavior C do not assume they are part of the Behavior B camp and propose
+patches that break things for the real Behavior B folks.
+
+
+=== Oversimplified mental models ===
+
+An oversimplification of the differences in the above behaviors is:
+
+ Behavior A: Restrict worktree and history operations to sparse specification
+ Behavior B: Restrict worktree operations to sparse specification; have any
+ history operations work across all files
+ Behavior C: Do not restrict either worktree or history operations to the
+ sparse specification...with the exception of branch checkouts or
+ switches which avoid writing files that will match the index so
+ they can later lazily be populated instead.
+
+
+=== Desired behavior ===
+
+As noted previously, despite the simple idea of just working with a subset
+of files, there are a range of different behavioral changes that need to be
+made to different subcommands to work well with such a feature. See
+[1,2,3,4,5,6,7,8,9,10] for various examples. In particular, at [2], we saw
+that mere composition of other commands that individually worked correctly
+in a sparse-checkout context did not imply that the higher level command
+would work correctly; it sometimes requires further tweaks. So,
+understanding these differences can be beneficial.
+
+* Commands behaving the same regardless of high-level use-case
+
+ * commands that only look at files within the sparsity specification
+
+ * diff (without --cached or REVISION arguments)
+ * grep (without --cached or REVISION arguments)
+ * diff-files
+
+ * commands that restore files to the working tree that match sparsity
+ patterns, and remove unmodified files that don't match those
+ patterns:
+
+ * switch
+ * checkout (the switch-like half)
+ * read-tree
+ * reset --hard
+
+ * commands that write conflicted files to the working tree, but otherwise
+ will omit writing files to the working tree that do not match the
+ sparsity patterns:
+
+ * merge
+ * rebase
+ * cherry-pick
+ * revert
+
+ * `am` and `apply --cached` should probably be in this section but
+ are buggy (see the "Known bugs" section below)
+
+ The behavior for these commands somewhat depends upon the merge
+ strategy being used:
+ * `ort` behaves as described above
+ * `recursive` tries to not vivify files unnecessarily, but does sometimes
+ vivify files without conflicts.
+ * `octopus` and `resolve` will always vivify any file changed in the merge
+ relative to the first parent, which is rather suboptimal.
+
+ It is also important to note that these commands WILL update the index
+ outside the sparse specification relative to when the operation began,
+ BUT these commands often make a commit just before or after such that
+ by the end of the operation there is no change to the index outside the
+ sparse specification. Of course, if the operation hits conflicts or
+ does not make a commit, then these operations clearly can modify the
+ index outside the sparse specification.
+
+ Finally, it is important to note that at least the first four of these
+ commands also try to remove differences between the sparse
+ specification and the sparsity patterns (much like the commands in the
+ previous section).
+
+ * commands that always ignore sparsity since commits must be full-tree
+
+ * archive
+ * bundle
+ * commit
+ * format-patch
+ * fast-export
+ * fast-import
+ * commit-tree
+
+ * commands that write any modified file to the working tree (conflicted
+ or not, and whether those paths match sparsity patterns or not):
+
+ * stash
+ * apply (without `--index` or `--cached`)
+
+* Commands that may slightly differ for behavior A vs. behavior B:
+
+ Commands in this category behave mostly the same between the two
+ behaviors, but may differ in verbosity and types of warning and error
+ messages.
+
+ * commands that make modifications to which files are tracked:
+ * add
+ * rm
+ * mv
+ * update-index
+
+ The fact that files can move between the 'tracked' and 'untracked'
+ categories means some commands will have to treat untracked files
+ differently. But if we have to treat untracked files differently,
+ then additional commands may also need changes:
+
+ * status
+ * clean
+
+ In particular, `status` may need to report any untracked files outside
+ the sparsity specification as an erroneous condition (especially to
+ avoid the user trying to `git add` them, forcing `git add` to display
+ an error).
+
+ It's not clear to me exactly how (or even if) `clean` would change,
+ but it's the other command that also affects untracked files.
+
+ `update-index` may be slightly special. Its --[no-]skip-worktree flag
+ may need to ignore the sparse specification by its nature. Also, its
+ current --[no-]ignore-skip-worktree-entries default is totally bogus.
+
+ * commands for manually tweaking paths in both the index and the working tree
+ * `restore`
+ * the restore-like half of `checkout`
+
+ These commands should be similar to add/rm/mv in that they should
+ only operate on the sparse specification by default, and require a
+ special flag to operate on all files.
+
+ Also, note that these commands currently have a number of issues (see
+ the "Known bugs" section below)
+
+* Commands that significantly differ for behavior A vs. behavior B:
+
+ * commands that query history
+ * diff (with --cached or REVISION arguments)
+ * grep (with --cached or REVISION arguments)
+ * show (when given commit arguments)
+ * blame (only matters when one or more -C flags are passed)
+ * and annotate
+ * log
+ * whatchanged
+ * ls-files
+ * diff-index
+ * diff-tree
+ * ls-tree
+
+ Note: for log and whatchanged, revision walking logic is unaffected
+ but displaying of patches is affected by scoping the command to the
+ sparse-checkout. (The fact that revision walking is unaffected is
+ why rev-list, shortlog, show-branch, and bisect are not in this
+ list.)
+
+ ls-files may be slightly special in that e.g. `git ls-files -t` is
+ often used to see what is sparse and what is not. Perhaps -t should
+ always work on the full tree?
+
+* Commands I don't know how to classify
+
+ * range-diff
+
+ Is this like `log` or `format-patch`?
+
+ * cherry
+
+ See range-diff
+
+* Commands unaffected by sparse-checkouts
+
+ * shortlog
+ * show-branch
+ * rev-list
+ * bisect
+
+ * branch
+ * describe
+ * fetch
+ * gc
+ * init
+ * maintenance
+ * notes
+ * pull (merge & rebase have the necessary changes)
+ * push
+ * submodule
+ * tag
+
+ * config
+ * filter-branch (works in separate checkout without sparse-checkout setup)
+ * pack-refs
+ * prune
+ * remote
+ * repack
+ * replace
+
+ * bugreport
+ * count-objects
+ * fsck
+ * gitweb
+ * help
+ * instaweb
+ * merge-tree (doesn't touch worktree or index, and merges always compute full-tree)
+ * rerere
+ * verify-commit
+ * verify-tag
+
+ * commit-graph
+ * hash-object
+ * index-pack
+ * mktag
+ * mktree
+ * multi-pack-index
+ * pack-objects
+ * prune-packed
+ * symbolic-ref
+ * unpack-objects
+ * update-ref
+ * write-tree (operates on index, possibly optimized to use sparse dir entries)
+
+ * for-each-ref
+ * get-tar-commit-id
+ * ls-remote
+ * merge-base (merges are computed full tree, so merge base should be too)
+ * name-rev
+ * pack-redundant
+ * rev-parse
+ * show-index
+ * show-ref
+ * unpack-file
+ * var
+ * verify-pack
+
+ * <Everything under 'Interacting with Others' in 'git help --all'>
+ * <Everything under 'Low-level...Syncing' in 'git help --all'>
+ * <Everything under 'Low-level...Internal Helpers' in 'git help --all'>
+ * <Everything under 'External commands' in 'git help --all'>
+
+* Commands that might be affected, but who cares?
+
+ * merge-file
+ * merge-index
+ * gitk?
+
+
+=== Behavior classes ===
+
+From the above there are a few classes of behavior:
+
+ * "restrict"
+
+ Commands in this class only read or write files in the working tree
+ within the sparse specification.
+
+ When moving to a new commit (e.g. switch, reset --hard), these commands
+ may update index files outside the sparse specification as of the start
+ of the operation, but by the end of the operation those index files
+ will match HEAD again and thus those files will again be outside the
+ sparse specification.
+
+ When paths are explicitly specified, these paths are intersected with
+ the sparse specification and will only operate on such paths.
+ (e.g. `git restore [--staged] -- '*.png'`, `git reset -p -- '*.md'`)
+
+ Some of these commands may also attempt, at the end of their operation,
+ to cull transient differences between the sparse specification and the
+ sparsity patterns (see "Sparse specification vs. sparsity patterns" for
+ details, but this basically means either removing unmodified files not
+ matching the sparsity patterns and marking those files as
+ SKIP_WORKTREE, or vivifying files that match the sparsity patterns and
+ marking those files as !SKIP_WORKTREE).
+
+ * "restrict modulo conflicts"
+
+ Commands in this class generally behave like the "restrict" class,
+ except that:
+ (1) they will ignore the sparse specification and write files with
+ conflicts to the working tree (thus temporarily expanding the
+ sparse specification to include such files.)
+ (2) they are grouped with commands which move to a new commit, since
+ they often create a commit and then move to it, even though we
+ know there are many exceptions to moving to the new commit. (For
+ example, the user may rebase a commit that becomes empty, or have
+ a cherry-pick which conflicts, or a user could run `merge
+ --no-commit`, and we also view `apply --index` kind of like `am
+ --no-commit`.) As such, these commands can make changes to index
+ files outside the sparse specification, though they'll mark such
+ files with SKIP_WORKTREE.
+
+ * "restrict also specially applied to untracked files"
+
+ Commands in this class generally behave like the "restrict" class,
+ except that they have to handle untracked files differently too, often
+ because these commands are dealing with files changing state between
+ 'tracked' and 'untracked'. Often, this may mean printing an error
+ message if the command had nothing to do, but the arguments may have
+ referred to files whose tracked-ness state could have changed were it
+ not for the sparsity patterns excluding them.
+
+ * "no restrict"
+
+ Commands in this class ignore the sparse specification entirely.
+
+ * "restrict or no restrict dependent upon behavior A vs. behavior B"
+
+ Commands in this class behave like "no restrict" for folks in the
+ behavior B camp, and like "restrict" for folks in the behavior A camp.
+ However, when behaving like "restrict" a warning of some sort might be
+ provided that history queries have been limited by the sparse-checkout
+ specification.
+
+
+=== Subcommand-dependent defaults ===
+
+Note that we have different defaults depending on the command for the
+desired behavior :
+
+ * Commands defaulting to "restrict":
+ * diff-files
+ * diff (without --cached or REVISION arguments)
+ * grep (without --cached or REVISION arguments)
+ * switch
+ * checkout (the switch-like half)
+ * reset (<commit>)
+
+ * restore
+ * checkout (the restore-like half)
+ * checkout-index
+ * reset (with pathspec)
+
+ This behavior makes sense; these interact with the working tree.
+
+ * Commands defaulting to "restrict modulo conflicts":
+ * merge
+ * rebase
+ * cherry-pick
+ * revert
+
+ * am
+ * apply --index (which is kind of like an `am --no-commit`)
+
+ * read-tree (especially with -m or -u; is kind of like a --no-commit merge)
+ * reset (<tree-ish>, due to similarity to read-tree)
+
+ These also interact with the working tree, but require slightly
+ different behavior either so that (a) conflicts can be resolved or (b)
+ because they are kind of like a merge-without-commit operation.
+
+ (See also the "Known bugs" section below regarding `am` and `apply`)
+
+ * Commands defaulting to "no restrict":
+ * archive
+ * bundle
+ * commit
+ * format-patch
+ * fast-export
+ * fast-import
+ * commit-tree
+
+ * stash
+ * apply (without `--index`)
+
+ These have completely different defaults and perhaps deserve the most
+ detailed explanation:
+
+ In the case of commands in the first group (format-patch,
+ fast-export, bundle, archive, etc.), these are commands for
+ communicating history, which will be broken if they restrict to a
+ subset of the repository. As such, they operate on full paths and
+ have no `--restrict` option for overriding. Some of these commands may
+ take paths for manually restricting what is exported, but it needs to
+ be very explicit.
+
+ In the case of stash, it needs to vivify files to avoid losing the
+ user's changes.
+
+ In the case of apply without `--index`, that command needs to update
+ the working tree without the index (or the index without the working
+ tree if `--cached` is passed), and if we restrict those updates to the
+ sparse specification then we'll lose changes from the user.
+
+ * Commands defaulting to "restrict also specially applied to untracked files":
+ * add
+ * rm
+ * mv
+ * update-index
+ * status
+ * clean (?)
+
+ Our original implementation for the first three of these commands was
+ "no restrict", but it had some severe usability issues:
+ * `git add <somefile>` if honored and outside the sparse
+ specification, can result in the file randomly disappearing later
+ when some subsequent command is run (since various commands
+ automatically clean up unmodified files outside the sparse
+ specification).
+ * `git rm '*.jpg'` could very negatively surprise users if it deletes
+ files outside the range of the user's interest.
+ * `git mv` has similar surprises when moving into or out of the cone,
+ so best to restrict by default
+
+ So, we switched `add` and `rm` to default to "restrict", which made
+ usability problems much less severe and less frequent, but we still got
+ complaints because commands like:
+ git add <file-outside-sparse-specification>
+ git rm <file-outside-sparse-specification>
+ would silently do nothing. We should instead print an error in those
+ cases to get usability right.
+
+ update-index needs to be updated to match, and status and maybe clean
+ also need to be updated to specially handle untracked paths.
+
+ There may be a difference in here between behavior A and behavior B in
+ terms of verboseness of errors or additional warnings.
+
+ * Commands falling under "restrict or no restrict dependent upon behavior
+ A vs. behavior B"
+
+ * diff (with --cached or REVISION arguments)
+ * grep (with --cached or REVISION arguments)
+ * show (when given commit arguments)
+ * blame (only matters when one or more -C flags passed)
+ * and annotate
+ * log
+ * and variants: shortlog, gitk, show-branch, whatchanged, rev-list
+ * ls-files
+ * diff-index
+ * diff-tree
+ * ls-tree
+
+ For now, we default to behavior B for these, which want a default of
+ "no restrict".
+
+ Note that two of these commands -- diff and grep -- also appeared in a
+ different list with a default of "restrict", but only when limited to
+ searching the working tree. The working tree vs. history distinction
+ is fundamental in how behavior B operates, so this is expected. Note,
+ though, that for diff and grep with --cached, when doing "restrict"
+ behavior, the difference between sparse specification and sparsity
+ patterns is important to handle.
+
+ "restrict" may make more sense as the long term default for these[12].
+ Also, supporting "restrict" for these commands might be a fair amount
+ of work to implement, meaning it might be implemented over multiple
+ releases. If that behavior were the default in the commands that
+ supported it, that would force behavior B users to need to learn to
+ slowly add additional flags to their commands, depending on git
+ version, to get the behavior they want. That gradual switchover would
+ be painful, so we should avoid it at least until it's fully
+ implemented.
+
+
+=== Sparse specification vs. sparsity patterns ===
+
+In a well-behaved situation, the sparse specification is given directly
+by the $GIT_DIR/info/sparse-checkout file. However, it can transiently
+diverge for a few reasons:
+
+ * needing to resolve conflicts (merging will vivify conflicted files)
+ * running Git commands that implicitly vivify files (e.g. "git stash apply")
+ * running Git commands that explicitly vivify files (e.g. "git checkout
+ --ignore-skip-worktree-bits FILENAME")
+ * other commands that write to these files (perhaps a user copies it
+ from elsewhere)
+
+For the last item, note that we do automatically clear the SKIP_WORKTREE
+bit for files that are present in the working tree. This has been true
+since 82386b4496 ("Merge branch 'en/present-despite-skipped'",
+2022-03-09)
+
+However, such a situation is transient because:
+
+ * Such transient differences can and will be automatically removed as
+ a side-effect of commands which call unpack_trees() (checkout,
+ merge, reset, etc.).
+ * Users can also request such transient differences be corrected via
+ running `git sparse-checkout reapply`. Various places recommend
+ running that command.
+ * Additional commands are also welcome to implicitly fix these
+ differences; we may add more in the future.
+
+While we avoid dropping unstaged changes or files which have conflicts,
+we otherwise aggressively try to fix these transient differences. If
+users want these differences to persist, they should run the `set` or
+`add` subcommands of `git sparse-checkout` to reflect their intended
+sparse specification.
+
+However, when we need to do a query on history restricted to the
+"relevant subset of files" such a transiently expanded sparse
+specification is ignored. There are a couple reasons for this:
+
+ * The behavior wanted when doing something like
+ git grep expression REVISION
+ is roughly what the users would expect from
+ git checkout REVISION && git grep expression
+ (modulo a "REVISION:" prefix), which has a couple ramifications:
+
+ * REVISION may have paths not in the current index, so there is no
+ path we can consult for a SKIP_WORKTREE setting for those paths.
+
+ * Since `checkout` is one of those commands that tries to remove
+ transient differences in the sparse specification, it makes sense
+ to use the corrected sparse specification
+ (i.e. $GIT_DIR/info/sparse-checkout) rather than attempting to
+ consult SKIP_WORKTREE anyway.
+
+So, a transiently expanded (or restricted) sparse specification applies to
+the working tree, but not to history queries where we always use the
+sparsity patterns. (See [16] for an early discussion of this.)
+
+Similar to a transiently expanded sparse specification of the working tree
+based on additional files being present in the working tree, we also need
+to consider additional files being modified in the index. In particular,
+if the user has staged changes to files (relative to HEAD) that do not
+match the sparsity patterns, and the file is not present in the working
+tree, we still want to consider the file part of the sparse specification
+if we are specifically performing a query related to the index (e.g. git
+diff --cached [REVISION], git diff-index [REVISION], git restore --staged
+--source=REVISION -- PATHS, etc.) Note that a transiently expanded sparse
+specification for the index usually only matters under behavior A, since
+under behavior B index operations are lumped with history and tend to
+operate full-tree.
+
+
+=== Implementation Questions ===
+
+ * Do the options --scope={sparse,all} sound good to others? Are there better
+ options?
+ * Names in use, or appearing in patches, or previously suggested:
+ * --sparse/--dense
+ * --ignore-skip-worktree-bits
+ * --ignore-skip-worktree-entries
+ * --ignore-sparsity
+ * --[no-]restrict-to-sparse-paths
+ * --full-tree/--sparse-tree
+ * --[no-]restrict
+ * --scope={sparse,all}
+ * --focus/--unfocus
+ * --limit/--unlimited
+ * Rationale making me lean slightly towards --scope={sparse,all}:
+ * We want a name that works for many commands, so we need a name that
+ does not conflict
+ * We know that we have more than two possible usecases, so it is best
+ to avoid a flag that appears to be binary.
+ * --scope={sparse,all} isn't overly long and seems relatively
+ explanatory
+ * `--sparse`, as used in add/rm/mv, is totally backwards for
+ grep/log/etc. Changing the meaning of `--sparse` for these
+ commands would fix the backwardness, but possibly break existing
+ scripts. Using a new name pairing would allow us to treat
+ `--sparse` in these commands as a deprecated alias.
+ * There is a different `--sparse`/`--dense` pair for commands using
+ revision machinery, so using that naming might cause confusion
+ * There is also a `--sparse` in both pack-objects and show-branch, which
+ don't conflict but do suggest that `--sparse` is overloaded
+ * The name --ignore-skip-worktree-bits is a double negative, is
+ quite a mouthful, refers to an implementation detail that many
+ users may not be familiar with, and we'd need a negation for it
+ which would probably be even more ridiculously long. (But we
+ can make --ignore-skip-worktree-bits a deprecated alias for
+ --no-restrict.)
+
+ * If a config option is added (sparse.scope?) what should the values and
+ description be? "sparse" (behavior A), "worktree-sparse-history-dense"
+ (behavior B), "dense" (behavior C)? There's a risk of confusion,
+ because even for Behaviors A and B we want some commands to be
+ full-tree and others to operate sparsely, so the wording may need to be
+ more tied to the usecases and somehow explain that. Also, right now,
+ the primary difference we are focusing is just the history-querying
+ commands (log/diff/grep). Previous config suggestion here: [13]
+
+ * Is `--no-expand` a good alias for ls-files's `--sparse` option?
+ (`--sparse` does not map to either `--scope=sparse` or `--scope=all`,
+ because in non-cone mode it does nothing and in cone-mode it shows the
+ sparse directory entries which are technically outside the sparse
+ specification)
+
+ * Under Behavior A:
+ * Does ls-files' `--no-expand` override the default `--scope=all`, or
+ does it need an extra flag?
+ * Does ls-files' `-t` option imply `--scope=all`?
+ * Does update-index's `--[no-]skip-worktree` option imply `--scope=all`?
+
+ * sparse-checkout: once behavior A is fully implemented, should we take
+ an interim measure to ease people into switching the default? Namely,
+ if folks are not already in a sparse checkout, then require
+ `sparse-checkout init/set` to take a
+ `--set-scope=(sparse|worktree-sparse-history-dense|dense)` flag (which
+ would set sparse.scope according to the setting given), and throw an
+ error if the flag is not provided? That error would be a great place
+ to warn folks that the default may change in the future, and get them
+ used to specifying what they want so that the eventual default switch
+ is seamless for them.
+
+
+=== Implementation Goals/Plans ===
+
+ * Get buy-in on this document in general.
+
+ * Figure out answers to the 'Implementation Questions' sections (above)
+
+ * Fix bugs in the 'Known bugs' section (below)
+
+ * Provide some kind of method for backfilling the blobs within the sparse
+ specification in a partial clone
+
+ [Below here is kind of spitballing since the first two haven't been resolved]
+
+ * update-index: flip the default to --no-ignore-skip-worktree-entries,
+ nuke this stupid "Oh, there's a bug? Let me add a flag to let users
+ request that they not trigger this bug." flag
+
+ * Flags & Config
+ * Make `--sparse` in add/rm/mv a deprecated alias for `--scope=all`
+ * Make `--ignore-skip-worktree-bits` in checkout-index/checkout/restore
+ a deprecated aliases for `--scope=all`
+ * Create config option (sparse.scope?), tie it to the "Cliff notes"
+ overview
+
+ * Add --scope=sparse (and --scope=all) flag to each of the history querying
+ commands. IMPORTANT: make sure diff machinery changes don't mess with
+ format-patch, fast-export, etc.
+
+=== Known bugs ===
+
+This list used to be a lot longer (see e.g. [1,2,3,4,5,6,7,8,9]), but we've
+been working on it.
+
+0. Behavior A is not well supported in Git. (Behavior B didn't used to
+ be either, but was the easier of the two to implement.)
+
+1. am and apply:
+
+ apply, without `--index` or `--cached`, relies on files being present
+ in the working copy, and also writes to them unconditionally. As
+ such, it should first check for the files' presence, and if found to
+ be SKIP_WORKTREE, then clear the bit and vivify the paths, then do
+ its work. Currently, it just throws an error.
+
+ apply, with either `--cached` or `--index`, will not preserve the
+ SKIP_WORKTREE bit. This is fine if the file has conflicts, but
+ otherwise SKIP_WORKTREE bits should be preserved for --cached and
+ probably also for --index.
+
+ am, if there are no conflicts, will vivify files and fail to preserve
+ the SKIP_WORKTREE bit. If there are conflicts and `-3` is not
+ specified, it will vivify files and then complain the patch doesn't
+ apply. If there are conflicts and `-3` is specified, it will vivify
+ files and then complain that those vivified files would be
+ overwritten by merge.
+
+2. reset --hard:
+
+ reset --hard provides confusing error message (works correctly, but
+ misleads the user into believing it didn't):
+
+ $ touch addme
+ $ git add addme
+ $ git ls-files -t
+ H addme
+ H tracked
+ S tracked-but-maybe-skipped
+ $ git reset --hard # usually works great
+ error: Path 'addme' not uptodate; will not remove from working tree.
+ HEAD is now at bdbbb6f third
+ $ git ls-files -t
+ H tracked
+ S tracked-but-maybe-skipped
+ $ ls -1
+ tracked
+
+ `git reset --hard` DID remove addme from the index and the working tree, contrary
+ to the error message, but in line with how reset --hard should behave.
+
+3. read-tree
+
+ `read-tree` doesn't apply the 'SKIP_WORKTREE' bit to *any* of the
+ entries it reads into the index, resulting in all your files suddenly
+ appearing to be "deleted".
+
+4. Checkout, restore:
+
+ These command do not handle path & revision arguments appropriately:
+
+ $ ls
+ tracked
+ $ git ls-files -t
+ H tracked
+ S tracked-but-maybe-skipped
+ $ git status --porcelain
+ $ git checkout -- '*skipped'
+ error: pathspec '*skipped' did not match any file(s) known to git
+ $ git ls-files -- '*skipped'
+ tracked-but-maybe-skipped
+ $ git checkout HEAD -- '*skipped'
+ error: pathspec '*skipped' did not match any file(s) known to git
+ $ git ls-tree HEAD | grep skipped
+ 100644 blob 276f5a64354b791b13840f02047738c77ad0584f tracked-but-maybe-skipped
+ $ git status --porcelain
+ $ git checkout HEAD~1 -- '*skipped'
+ $ git ls-files -t
+ H tracked
+ H tracked-but-maybe-skipped
+ $ git status --porcelain
+ M tracked-but-maybe-skipped
+ $ git checkout HEAD -- '*skipped'
+ $ git status --porcelain
+ $
+
+ Note that checkout without a revision (or restore --staged) fails to
+ find a file to restore from the index, even though ls-files shows
+ such a file certainly exists.
+
+ Similar issues occur with HEAD (--source=HEAD in restore's case),
+ but suddenly works when HEAD~1 is specified. And then after that it
+ will work with HEAD specified, even though it didn't before.
+
+ Directories are also an issue:
+
+ $ git sparse-checkout set nomatches
+ $ git status
+ On branch main
+ You are in a sparse checkout with 0% of tracked files present.
+
+ nothing to commit, working tree clean
+ $ git checkout .
+ error: pathspec '.' did not match any file(s) known to git
+ $ git checkout HEAD~1 .
+ Updated 1 path from 58916d9
+ $ git ls-files -t
+ S tracked
+ H tracked-but-maybe-skipped
+
+5. checkout and restore --staged, continued:
+
+ These commands do not correctly scope operations to the sparse
+ specification, and make it worse by not setting important SKIP_WORKTREE
+ bits:
+
+ $ git restore --source OLDREV --staged outside-sparse-cone/
+ $ git status --porcelain
+ MD outside-sparse-cone/file1
+ MD outside-sparse-cone/file2
+ MD outside-sparse-cone/file3
+
+ We can add a --scope=all mode to `git restore` to let it operate outside
+ the sparse specification, but then it will be important to set the
+ SKIP_WORKTREE bits appropriately.
+
+6. Performance issues; see:
+ https://lore.kernel.org/git/CABPp-BEkJQoKZsQGCYioyga_uoDQ6iBeW+FKr8JhyuuTMK1RDw@mail.gmail.com/
+
+
+=== Reference Emails ===
+
+Emails that detail various bugs we've had in sparse-checkout:
+
+[1] (Original descriptions of behavior A & behavior B)
+ https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/
+[2] (Fix stash applications in sparse checkouts; bugs from behavioral differences)
+ https://lore.kernel.org/git/ccfedc7140dbf63ba26a15f93bd3885180b26517.1606861519.git.gitgitgadget@gmail.com/
+[3] (Present-despite-skipped entries)
+ https://lore.kernel.org/git/11d46a399d26c913787b704d2b7169cafc28d639.1642175983.git.gitgitgadget@gmail.com/
+[4] (Clone --no-checkout interaction)
+ https://lore.kernel.org/git/pull.801.v2.git.git.1591324899170.gitgitgadget@gmail.com/ (clone --no-checkout)
+[5] (The need for update_sparsity() and avoiding `read-tree -mu HEAD`)
+ https://lore.kernel.org/git/3a1f084641eb47515b5a41ed4409a36128913309.1585270142.git.gitgitgadget@gmail.com/
+[6] (SKIP_WORKTREE is advisory, not mandatory)
+ https://lore.kernel.org/git/844306c3e86ef67591cc086decb2b760e7d710a3.1585270142.git.gitgitgadget@gmail.com/
+[7] (`worktree add` should copy sparsity settings from current worktree)
+ https://lore.kernel.org/git/c51cb3714e7b1d2f8c9370fe87eca9984ff4859f.1644269584.git.gitgitgadget@gmail.com/
+[8] (Avoid negative surprises in add, rm, and mv)
+ https://lore.kernel.org/git/cover.1617914011.git.matheus.bernardino@usp.br/
+ https://lore.kernel.org/git/pull.1018.v4.git.1632497954.gitgitgadget@gmail.com/
+[9] (Move from out-of-cone to in-cone)
+ https://lore.kernel.org/git/20220630023737.473690-6-shaoxuan.yuan02@gmail.com/
+ https://lore.kernel.org/git/20220630023737.473690-4-shaoxuan.yuan02@gmail.com/
+[10] (Unnecessarily downloading objects outside sparse specification)
+ https://lore.kernel.org/git/CAOLTT8QfwOi9yx_qZZgyGa8iL8kHWutEED7ok_jxwTcYT_hf9Q@mail.gmail.com/
+
+[11] (Stolee's comments on high-level usecases)
+ https://lore.kernel.org/git/1a1e33f6-3514-9afc-0a28-5a6b85bd8014@gmail.com/
+
+[12] Others commenting on eventually switching default to behavior A:
+ * https://lore.kernel.org/git/xmqqh719pcoo.fsf@gitster.g/
+ * https://lore.kernel.org/git/xmqqzgeqw0sy.fsf@gitster.g/
+ * https://lore.kernel.org/git/a86af661-cf58-a4e5-0214-a67d3a794d7e@github.com/
+
+[13] Previous config name suggestion and description
+ * https://lore.kernel.org/git/CABPp-BE6zW0nJSStcVU=_DoDBnPgLqOR8pkTXK3dW11=T01OhA@mail.gmail.com/
+
+[14] Tangential issue: switch to cone mode as default sparse specification mechanism:
+ https://lore.kernel.org/git/a1b68fd6126eb341ef3637bb93fedad4309b36d0.1650594746.git.gitgitgadget@gmail.com/
+
+[15] Lengthy email on grep behavior, covering what should be searched:
+ * https://lore.kernel.org/git/CABPp-BGVO3QdbfE84uF_3QDF0-y2iHHh6G5FAFzNRfeRitkuHw@mail.gmail.com/
+
+[16] Email explaining sparsity patterns vs. SKIP_WORKTREE and history operations,
+ search for the parenthetical comment starting "We do not check".
+ https://lore.kernel.org/git/CABPp-BFsCPPNOZ92JQRJeGyNd0e-TCW-LcLyr0i_+VSQJP+GCg@mail.gmail.com/
+
+[17] https://lore.kernel.org/git/20220207190320.2960362-1-jonathantanmy@google.com/