diff options
Diffstat (limited to 'Documentation/technical/sparse-checkout.txt')
-rw-r--r-- | Documentation/technical/sparse-checkout.txt | 1103 |
1 files changed, 1103 insertions, 0 deletions
diff --git a/Documentation/technical/sparse-checkout.txt b/Documentation/technical/sparse-checkout.txt new file mode 100644 index 0000000..fa0d01c --- /dev/null +++ b/Documentation/technical/sparse-checkout.txt @@ -0,0 +1,1103 @@ +Table of contents: + + * Terminology + * Purpose of sparse-checkouts + * Usecases of primary concern + * Oversimplified mental models ("Cliff Notes" for this document!) + * Desired behavior + * Behavior classes + * Subcommand-dependent defaults + * Sparse specification vs. sparsity patterns + * Implementation Questions + * Implementation Goals/Plans + * Known bugs + * Reference Emails + + +=== Terminology === + +cone mode: one of two modes for specifying the desired subset of files + in a sparse-checkout. In cone-mode, the user specifies + directories (getting both everything under that directory as + well as everything in leading directories), while in non-cone + mode, the user specifies gitignore-style patterns. Controlled + by the --[no-]cone option to sparse-checkout init|set. + +SKIP_WORKTREE: When tracked files do not match the sparse specification and + are removed from the working tree, the file in the index is marked + with a SKIP_WORKTREE bit. Note that if a tracked file has the + SKIP_WORKTREE bit set but the file is later written by the user to + the working tree anyway, the SKIP_WORKTREE bit will be cleared at + the beginning of any subsequent Git operation. + + Most sparse checkout users are unaware of this implementation + detail, and the term should generally be avoided in user-facing + descriptions and command flags. Unfortunately, prior to the + `sparse-checkout` subcommand this low-level detail was exposed, + and as of time of writing, is still exposed in various places. + +sparse-checkout: a subcommand in git used to reduce the files present in + the working tree to a subset of all tracked files. Also, the + name of the file in the $GIT_DIR/info directory used to track + the sparsity patterns corresponding to the user's desired + subset. + +sparse cone: see cone mode + +sparse directory: An entry in the index corresponding to a directory, which + appears in the index instead of all the files under that directory + that would normally appear. See also sparse-index. Something that + can cause confusion is that the "sparse directory" does NOT match + the sparse specification, i.e. the directory is NOT present in the + working tree. May be renamed in the future (e.g. to "skipped + directory"). + +sparse index: A special mode for sparse-checkout that also makes the + index sparse by recording a directory entry in lieu of all the + files underneath that directory (thus making that a "skipped + directory" which unfortunately has also been called a "sparse + directory"), and does this for potentially multiple + directories. Controlled by the --[no-]sparse-index option to + init|set|reapply. + +sparsity patterns: patterns from $GIT_DIR/info/sparse-checkout used to + define the set of files of interest. A warning: It is easy to + over-use this term (or the shortened "patterns" term), for two + reasons: (1) users in cone mode specify directories rather than + patterns (their directories are transformed into patterns, but + users may think you are talking about non-cone mode if you use the + word "patterns"), and (b) the sparse specification might + transiently differ in the working tree or index from the sparsity + patterns (see "Sparse specification vs. sparsity patterns"). + +sparse specification: The set of paths in the user's area of focus. This + is typically just the tracked files that match the sparsity + patterns, but the sparse specification can temporarily differ and + include additional files. (See also "Sparse specification + vs. sparsity patterns") + + * When working with history, the sparse specification is exactly + the set of files matching the sparsity patterns. + * When interacting with the working tree, the sparse specification + is the set of tracked files with a clear SKIP_WORKTREE bit or + tracked files present in the working copy. + * When modifying or showing results from the index, the sparse + specification is the set of files with a clear SKIP_WORKTREE bit + or that differ in the index from HEAD. + * If working with the index and the working copy, the sparse + specification is the union of the paths from above. + +vivifying: When a command restores a tracked file to the working tree (and + hopefully also clears the SKIP_WORKTREE bit in the index for that + file), this is referred to as "vivifying" the file. + + +=== Purpose of sparse-checkouts === + +sparse-checkouts exist to allow users to work with a subset of their +files. + +You can think of sparse-checkouts as subdividing "tracked" files into two +categories -- a sparse subset, and all the rest. Implementationally, we +mark "all the rest" in the index with a SKIP_WORKTREE bit and leave them +out of the working tree. The SKIP_WORKTREE files are still tracked, just +not present in the working tree. + +In the past, sparse-checkouts were defined by "SKIP_WORKTREE means the file +is missing from the working tree but pretend the file contents match HEAD". +That was not only bogus (it actually meant the file missing from the +working tree matched the index rather than HEAD), but it was also a +low-level detail which only provided decent behavior for a few commands. +There were a surprising number of ways in which that guiding principle gave +command results that violated user expectations, and as such was a bad +mental model. However, it persisted for many years and may still be found +in some corners of the code base. + +Anyway, the idea of "working with a subset of files" is simple enough, but +there are multiple different high-level usecases which affect how some Git +subcommands should behave. Further, even if we only considered one of +those usecases, sparse-checkouts can modify different subcommands in over a +half dozen different ways. Let's start by considering the high level +usecases: + + A) Users are _only_ interested in the sparse portion of the repo + + A*) Users are _only_ interested in the sparse portion of the repo + that they have downloaded so far + + B) Users want a sparse working tree, but are working in a larger whole + + C) sparse-checkout is a behind-the-scenes implementation detail allowing + Git to work with a specially crafted in-house virtual file system; + users are actually working with a "full" working tree that is + lazily populated, and sparse-checkout helps with the lazy population + piece. + +It may be worth explaining each of these in a bit more detail: + + + (Behavior A) Users are _only_ interested in the sparse portion of the repo + +These folks might know there are other things in the repository, but +don't care. They are uninterested in other parts of the repository, and +only want to know about changes within their area of interest. Showing +them other files from history (e.g. from diff/log/grep/etc.) is a +usability annoyance, potentially a huge one since other changes in +history may dwarf the changes they are interested in. + +Some of these users also arrive at this usecase from wanting to use partial +clones together with sparse checkouts (in a way where they have downloaded +blobs within the sparse specification) and do disconnected development. +Not only do these users generally not care about other parts of the +repository, but consider it a blocker for Git commands to try to operate on +those. If commands attempt to access paths in history outside the sparsity +specification, then the partial clone will attempt to download additional +blobs on demand, fail, and then fail the user's command. (This may be +unavoidable in some cases, e.g. when `git merge` has non-trivial changes to +reconcile outside the sparse specification, but we should limit how often +users are forced to connect to the network.) + +Also, even for users using partial clones that do not mind being +always connected to the network, the need to download blobs as +side-effects of various other commands (such as the printed diffstat +after a merge or pull) can lead to worries about local repository size +growing unnecessarily[10]. + + (Behavior A*) Users are _only_ interested in the sparse portion of the repo + that they have downloaded so far (a variant on the first usecase) + +This variant is driven by folks who using partial clones together with +sparse checkouts and do disconnected development (so far sounding like a +subset of behavior A users) and doing so on very large repositories. The +reason for yet another variant is that downloading even just the blobs +through history within their sparse specification may be too much, so they +only download some. They would still like operations to succeed without +network connectivity, though, so things like `git log -S${SEARCH_TERM} -p` +or `git grep ${SEARCH_TERM} OLDREV ` would need to be prepared to provide +partial results that depend on what happens to have been downloaded. + +This variant could be viewed as Behavior A with the sparse specification +for history querying operations modified from "sparsity patterns" to +"sparsity patterns limited to the blobs we have already downloaded". + + (Behavior B) Users want a sparse working tree, but are working in a + larger whole + +Stolee described this usecase this way[11]: + +"I'm also focused on users that know that they are a part of a larger +whole. They know they are operating on a large repository but focus on +what they need to contribute their part. I expect multiple "roles" to +use very different, almost disjoint parts of the codebase. Some other +"architect" users operate across the entire tree or hop between different +sections of the codebase as necessary. In this situation, I'm wary of +scoping too many features to the sparse-checkout definition, especially +"git log," as it can be too confusing to have their view of the codebase +depend on your "point of view." + +People might also end up wanting behavior B due to complex inter-project +dependencies. The initial attempts to use sparse-checkouts usually involve +the directories you are directly interested in plus what those directories +depend upon within your repository. But there's a monkey wrench here: if +you have integration tests, they invert the hierarchy: to run integration +tests, you need not only what you are interested in and its in-tree +dependencies, you also need everything that depends upon what you are +interested in or that depends upon one of your dependencies...AND you need +all the in-tree dependencies of that expanded group. That can easily +change your sparse-checkout into a nearly dense one. + +Naturally, that tends to kill the benefits of sparse-checkouts. There are +a couple solutions to this conundrum: either avoid grabbing in-repo +dependencies (maybe have built versions of your in-repo dependencies pulled +from a CI cache somewhere), or say that users shouldn't run integration +tests directly and instead do it on the CI server when they submit a code +review. Or do both. Regardless of whether you stub out your in-repo +dependencies or stub out the things that depend upon you, there is +certainly a reason to want to query and be aware of those other stubbed-out +parts of the repository, particularly when the dependencies are complex or +change relatively frequently. Thus, for such uses, sparse-checkouts can be +used to limit what you directly build and modify, but these users do not +necessarily want their sparse checkout paths to limit their queries of +versions in history. + +Some people may also be interested in behavior B over behavior A simply as +a performance workaround: if they are using non-cone mode, then they have +to deal with its inherent quadratic performance problems. In that mode, +every operation that checks whether paths match the sparsity specification +can be expensive. As such, these users may only be willing to pay for +those expensive checks when interacting with the working copy, and may +prefer getting "unrelated" results from their history queries over having +slow commands. + + (Behavior C) sparse-checkout is an implementational detail supporting a + special VFS. + +This usecase goes slightly against the traditional definition of +sparse-checkout in that it actually tries to present a full or dense +checkout to the user. However, this usecase utilizes the same underlying +technical underpinnings in a new way which does provide some performance +advantages to users. The basic idea is that a company can have an in-house +Git-aware Virtual File System which pretends all files are present in the +working tree, by intercepting all file system accesses and using those to +fetch and write accessed files on demand via partial clones. The VFS uses +sparse-checkout to prevent Git from writing or paying attention to many +files, and manually updates the sparse checkout patterns itself based on +user access and modification of files in the working tree. See commit +ecc7c8841d ("repo_read_index: add config to expect files outside sparse +patterns", 2022-02-25) and the link at [17] for a more detailed description +of such a VFS. + +The biggest difference here is that users are completely unaware that the +sparse-checkout machinery is even in use. The sparse patterns are not +specified by the user but rather are under the complete control of the VFS +(and the patterns are updated frequently and dynamically by it). The user +will perceive the checkout as dense, and commands should thus behave as if +all files are present. + + +=== Usecases of primary concern === + +Most of the rest of this document will focus on Behavior A and Behavior +B. Some notes about the other two cases and why we are not focusing on +them: + + (Behavior A*) + +Supporting this usecase is estimated to be difficult and a lot of work. +There are no plans to implement it currently, but it may be a potential +future alternative. Knowing about the existence of additional alternatives +may affect our choice of command line flags (e.g. if we need tri-state or +quad-state flags rather than just binary flags), so it was still important +to at least note. + +Further, I believe the descriptions below for Behavior A are probably still +valid for this usecase, with the only exception being that it redefines the +sparse specification to restrict it to already-downloaded blobs. The hard +part is in making commands capable of respecting that modified definition. + + (Behavior C) + +This usecase violates some of the early sparse-checkout documented +assumptions (since files marked as SKIP_WORKTREE will be displayed to users +as present in the working tree). That violation may mean various +sparse-checkout related behaviors are not well suited to this usecase and +we may need tweaks -- to both documentation and code -- to handle it. +However, this usecase is also perhaps the simplest model to support in that +everything behaves like a dense checkout with a few exceptions (e.g. branch +checkouts and switches write fewer things, knowing the VFS will lazily +write the rest on an as-needed basis). + +Since there is no publically available VFS-related code for folks to try, +the number of folks who can test such a usecase is limited. + +The primary reason to note the Behavior C usecase is that as we fix things +to better support Behaviors A and B, there may be additional places where +we need to make tweaks allowing folks in this usecase to get the original +non-sparse treatment. For an example, see ecc7c8841d ("repo_read_index: +add config to expect files outside sparse patterns", 2022-02-25). The +secondary reason to note Behavior C, is so that folks taking advantage of +Behavior C do not assume they are part of the Behavior B camp and propose +patches that break things for the real Behavior B folks. + + +=== Oversimplified mental models === + +An oversimplification of the differences in the above behaviors is: + + Behavior A: Restrict worktree and history operations to sparse specification + Behavior B: Restrict worktree operations to sparse specification; have any + history operations work across all files + Behavior C: Do not restrict either worktree or history operations to the + sparse specification...with the exception of branch checkouts or + switches which avoid writing files that will match the index so + they can later lazily be populated instead. + + +=== Desired behavior === + +As noted previously, despite the simple idea of just working with a subset +of files, there are a range of different behavioral changes that need to be +made to different subcommands to work well with such a feature. See +[1,2,3,4,5,6,7,8,9,10] for various examples. In particular, at [2], we saw +that mere composition of other commands that individually worked correctly +in a sparse-checkout context did not imply that the higher level command +would work correctly; it sometimes requires further tweaks. So, +understanding these differences can be beneficial. + +* Commands behaving the same regardless of high-level use-case + + * commands that only look at files within the sparsity specification + + * diff (without --cached or REVISION arguments) + * grep (without --cached or REVISION arguments) + * diff-files + + * commands that restore files to the working tree that match sparsity + patterns, and remove unmodified files that don't match those + patterns: + + * switch + * checkout (the switch-like half) + * read-tree + * reset --hard + + * commands that write conflicted files to the working tree, but otherwise + will omit writing files to the working tree that do not match the + sparsity patterns: + + * merge + * rebase + * cherry-pick + * revert + + * `am` and `apply --cached` should probably be in this section but + are buggy (see the "Known bugs" section below) + + The behavior for these commands somewhat depends upon the merge + strategy being used: + * `ort` behaves as described above + * `recursive` tries to not vivify files unnecessarily, but does sometimes + vivify files without conflicts. + * `octopus` and `resolve` will always vivify any file changed in the merge + relative to the first parent, which is rather suboptimal. + + It is also important to note that these commands WILL update the index + outside the sparse specification relative to when the operation began, + BUT these commands often make a commit just before or after such that + by the end of the operation there is no change to the index outside the + sparse specification. Of course, if the operation hits conflicts or + does not make a commit, then these operations clearly can modify the + index outside the sparse specification. + + Finally, it is important to note that at least the first four of these + commands also try to remove differences between the sparse + specification and the sparsity patterns (much like the commands in the + previous section). + + * commands that always ignore sparsity since commits must be full-tree + + * archive + * bundle + * commit + * format-patch + * fast-export + * fast-import + * commit-tree + + * commands that write any modified file to the working tree (conflicted + or not, and whether those paths match sparsity patterns or not): + + * stash + * apply (without `--index` or `--cached`) + +* Commands that may slightly differ for behavior A vs. behavior B: + + Commands in this category behave mostly the same between the two + behaviors, but may differ in verbosity and types of warning and error + messages. + + * commands that make modifications to which files are tracked: + * add + * rm + * mv + * update-index + + The fact that files can move between the 'tracked' and 'untracked' + categories means some commands will have to treat untracked files + differently. But if we have to treat untracked files differently, + then additional commands may also need changes: + + * status + * clean + + In particular, `status` may need to report any untracked files outside + the sparsity specification as an erroneous condition (especially to + avoid the user trying to `git add` them, forcing `git add` to display + an error). + + It's not clear to me exactly how (or even if) `clean` would change, + but it's the other command that also affects untracked files. + + `update-index` may be slightly special. Its --[no-]skip-worktree flag + may need to ignore the sparse specification by its nature. Also, its + current --[no-]ignore-skip-worktree-entries default is totally bogus. + + * commands for manually tweaking paths in both the index and the working tree + * `restore` + * the restore-like half of `checkout` + + These commands should be similar to add/rm/mv in that they should + only operate on the sparse specification by default, and require a + special flag to operate on all files. + + Also, note that these commands currently have a number of issues (see + the "Known bugs" section below) + +* Commands that significantly differ for behavior A vs. behavior B: + + * commands that query history + * diff (with --cached or REVISION arguments) + * grep (with --cached or REVISION arguments) + * show (when given commit arguments) + * blame (only matters when one or more -C flags are passed) + * and annotate + * log + * whatchanged + * ls-files + * diff-index + * diff-tree + * ls-tree + + Note: for log and whatchanged, revision walking logic is unaffected + but displaying of patches is affected by scoping the command to the + sparse-checkout. (The fact that revision walking is unaffected is + why rev-list, shortlog, show-branch, and bisect are not in this + list.) + + ls-files may be slightly special in that e.g. `git ls-files -t` is + often used to see what is sparse and what is not. Perhaps -t should + always work on the full tree? + +* Commands I don't know how to classify + + * range-diff + + Is this like `log` or `format-patch`? + + * cherry + + See range-diff + +* Commands unaffected by sparse-checkouts + + * shortlog + * show-branch + * rev-list + * bisect + + * branch + * describe + * fetch + * gc + * init + * maintenance + * notes + * pull (merge & rebase have the necessary changes) + * push + * submodule + * tag + + * config + * filter-branch (works in separate checkout without sparse-checkout setup) + * pack-refs + * prune + * remote + * repack + * replace + + * bugreport + * count-objects + * fsck + * gitweb + * help + * instaweb + * merge-tree (doesn't touch worktree or index, and merges always compute full-tree) + * rerere + * verify-commit + * verify-tag + + * commit-graph + * hash-object + * index-pack + * mktag + * mktree + * multi-pack-index + * pack-objects + * prune-packed + * symbolic-ref + * unpack-objects + * update-ref + * write-tree (operates on index, possibly optimized to use sparse dir entries) + + * for-each-ref + * get-tar-commit-id + * ls-remote + * merge-base (merges are computed full tree, so merge base should be too) + * name-rev + * pack-redundant + * rev-parse + * show-index + * show-ref + * unpack-file + * var + * verify-pack + + * <Everything under 'Interacting with Others' in 'git help --all'> + * <Everything under 'Low-level...Syncing' in 'git help --all'> + * <Everything under 'Low-level...Internal Helpers' in 'git help --all'> + * <Everything under 'External commands' in 'git help --all'> + +* Commands that might be affected, but who cares? + + * merge-file + * merge-index + * gitk? + + +=== Behavior classes === + +From the above there are a few classes of behavior: + + * "restrict" + + Commands in this class only read or write files in the working tree + within the sparse specification. + + When moving to a new commit (e.g. switch, reset --hard), these commands + may update index files outside the sparse specification as of the start + of the operation, but by the end of the operation those index files + will match HEAD again and thus those files will again be outside the + sparse specification. + + When paths are explicitly specified, these paths are intersected with + the sparse specification and will only operate on such paths. + (e.g. `git restore [--staged] -- '*.png'`, `git reset -p -- '*.md'`) + + Some of these commands may also attempt, at the end of their operation, + to cull transient differences between the sparse specification and the + sparsity patterns (see "Sparse specification vs. sparsity patterns" for + details, but this basically means either removing unmodified files not + matching the sparsity patterns and marking those files as + SKIP_WORKTREE, or vivifying files that match the sparsity patterns and + marking those files as !SKIP_WORKTREE). + + * "restrict modulo conflicts" + + Commands in this class generally behave like the "restrict" class, + except that: + (1) they will ignore the sparse specification and write files with + conflicts to the working tree (thus temporarily expanding the + sparse specification to include such files.) + (2) they are grouped with commands which move to a new commit, since + they often create a commit and then move to it, even though we + know there are many exceptions to moving to the new commit. (For + example, the user may rebase a commit that becomes empty, or have + a cherry-pick which conflicts, or a user could run `merge + --no-commit`, and we also view `apply --index` kind of like `am + --no-commit`.) As such, these commands can make changes to index + files outside the sparse specification, though they'll mark such + files with SKIP_WORKTREE. + + * "restrict also specially applied to untracked files" + + Commands in this class generally behave like the "restrict" class, + except that they have to handle untracked files differently too, often + because these commands are dealing with files changing state between + 'tracked' and 'untracked'. Often, this may mean printing an error + message if the command had nothing to do, but the arguments may have + referred to files whose tracked-ness state could have changed were it + not for the sparsity patterns excluding them. + + * "no restrict" + + Commands in this class ignore the sparse specification entirely. + + * "restrict or no restrict dependent upon behavior A vs. behavior B" + + Commands in this class behave like "no restrict" for folks in the + behavior B camp, and like "restrict" for folks in the behavior A camp. + However, when behaving like "restrict" a warning of some sort might be + provided that history queries have been limited by the sparse-checkout + specification. + + +=== Subcommand-dependent defaults === + +Note that we have different defaults depending on the command for the +desired behavior : + + * Commands defaulting to "restrict": + * diff-files + * diff (without --cached or REVISION arguments) + * grep (without --cached or REVISION arguments) + * switch + * checkout (the switch-like half) + * reset (<commit>) + + * restore + * checkout (the restore-like half) + * checkout-index + * reset (with pathspec) + + This behavior makes sense; these interact with the working tree. + + * Commands defaulting to "restrict modulo conflicts": + * merge + * rebase + * cherry-pick + * revert + + * am + * apply --index (which is kind of like an `am --no-commit`) + + * read-tree (especially with -m or -u; is kind of like a --no-commit merge) + * reset (<tree-ish>, due to similarity to read-tree) + + These also interact with the working tree, but require slightly + different behavior either so that (a) conflicts can be resolved or (b) + because they are kind of like a merge-without-commit operation. + + (See also the "Known bugs" section below regarding `am` and `apply`) + + * Commands defaulting to "no restrict": + * archive + * bundle + * commit + * format-patch + * fast-export + * fast-import + * commit-tree + + * stash + * apply (without `--index`) + + These have completely different defaults and perhaps deserve the most + detailed explanation: + + In the case of commands in the first group (format-patch, + fast-export, bundle, archive, etc.), these are commands for + communicating history, which will be broken if they restrict to a + subset of the repository. As such, they operate on full paths and + have no `--restrict` option for overriding. Some of these commands may + take paths for manually restricting what is exported, but it needs to + be very explicit. + + In the case of stash, it needs to vivify files to avoid losing the + user's changes. + + In the case of apply without `--index`, that command needs to update + the working tree without the index (or the index without the working + tree if `--cached` is passed), and if we restrict those updates to the + sparse specification then we'll lose changes from the user. + + * Commands defaulting to "restrict also specially applied to untracked files": + * add + * rm + * mv + * update-index + * status + * clean (?) + + Our original implementation for the first three of these commands was + "no restrict", but it had some severe usability issues: + * `git add <somefile>` if honored and outside the sparse + specification, can result in the file randomly disappearing later + when some subsequent command is run (since various commands + automatically clean up unmodified files outside the sparse + specification). + * `git rm '*.jpg'` could very negatively surprise users if it deletes + files outside the range of the user's interest. + * `git mv` has similar surprises when moving into or out of the cone, + so best to restrict by default + + So, we switched `add` and `rm` to default to "restrict", which made + usability problems much less severe and less frequent, but we still got + complaints because commands like: + git add <file-outside-sparse-specification> + git rm <file-outside-sparse-specification> + would silently do nothing. We should instead print an error in those + cases to get usability right. + + update-index needs to be updated to match, and status and maybe clean + also need to be updated to specially handle untracked paths. + + There may be a difference in here between behavior A and behavior B in + terms of verboseness of errors or additional warnings. + + * Commands falling under "restrict or no restrict dependent upon behavior + A vs. behavior B" + + * diff (with --cached or REVISION arguments) + * grep (with --cached or REVISION arguments) + * show (when given commit arguments) + * blame (only matters when one or more -C flags passed) + * and annotate + * log + * and variants: shortlog, gitk, show-branch, whatchanged, rev-list + * ls-files + * diff-index + * diff-tree + * ls-tree + + For now, we default to behavior B for these, which want a default of + "no restrict". + + Note that two of these commands -- diff and grep -- also appeared in a + different list with a default of "restrict", but only when limited to + searching the working tree. The working tree vs. history distinction + is fundamental in how behavior B operates, so this is expected. Note, + though, that for diff and grep with --cached, when doing "restrict" + behavior, the difference between sparse specification and sparsity + patterns is important to handle. + + "restrict" may make more sense as the long term default for these[12]. + Also, supporting "restrict" for these commands might be a fair amount + of work to implement, meaning it might be implemented over multiple + releases. If that behavior were the default in the commands that + supported it, that would force behavior B users to need to learn to + slowly add additional flags to their commands, depending on git + version, to get the behavior they want. That gradual switchover would + be painful, so we should avoid it at least until it's fully + implemented. + + +=== Sparse specification vs. sparsity patterns === + +In a well-behaved situation, the sparse specification is given directly +by the $GIT_DIR/info/sparse-checkout file. However, it can transiently +diverge for a few reasons: + + * needing to resolve conflicts (merging will vivify conflicted files) + * running Git commands that implicitly vivify files (e.g. "git stash apply") + * running Git commands that explicitly vivify files (e.g. "git checkout + --ignore-skip-worktree-bits FILENAME") + * other commands that write to these files (perhaps a user copies it + from elsewhere) + +For the last item, note that we do automatically clear the SKIP_WORKTREE +bit for files that are present in the working tree. This has been true +since 82386b4496 ("Merge branch 'en/present-despite-skipped'", +2022-03-09) + +However, such a situation is transient because: + + * Such transient differences can and will be automatically removed as + a side-effect of commands which call unpack_trees() (checkout, + merge, reset, etc.). + * Users can also request such transient differences be corrected via + running `git sparse-checkout reapply`. Various places recommend + running that command. + * Additional commands are also welcome to implicitly fix these + differences; we may add more in the future. + +While we avoid dropping unstaged changes or files which have conflicts, +we otherwise aggressively try to fix these transient differences. If +users want these differences to persist, they should run the `set` or +`add` subcommands of `git sparse-checkout` to reflect their intended +sparse specification. + +However, when we need to do a query on history restricted to the +"relevant subset of files" such a transiently expanded sparse +specification is ignored. There are a couple reasons for this: + + * The behavior wanted when doing something like + git grep expression REVISION + is roughly what the users would expect from + git checkout REVISION && git grep expression + (modulo a "REVISION:" prefix), which has a couple ramifications: + + * REVISION may have paths not in the current index, so there is no + path we can consult for a SKIP_WORKTREE setting for those paths. + + * Since `checkout` is one of those commands that tries to remove + transient differences in the sparse specification, it makes sense + to use the corrected sparse specification + (i.e. $GIT_DIR/info/sparse-checkout) rather than attempting to + consult SKIP_WORKTREE anyway. + +So, a transiently expanded (or restricted) sparse specification applies to +the working tree, but not to history queries where we always use the +sparsity patterns. (See [16] for an early discussion of this.) + +Similar to a transiently expanded sparse specification of the working tree +based on additional files being present in the working tree, we also need +to consider additional files being modified in the index. In particular, +if the user has staged changes to files (relative to HEAD) that do not +match the sparsity patterns, and the file is not present in the working +tree, we still want to consider the file part of the sparse specification +if we are specifically performing a query related to the index (e.g. git +diff --cached [REVISION], git diff-index [REVISION], git restore --staged +--source=REVISION -- PATHS, etc.) Note that a transiently expanded sparse +specification for the index usually only matters under behavior A, since +under behavior B index operations are lumped with history and tend to +operate full-tree. + + +=== Implementation Questions === + + * Do the options --scope={sparse,all} sound good to others? Are there better + options? + * Names in use, or appearing in patches, or previously suggested: + * --sparse/--dense + * --ignore-skip-worktree-bits + * --ignore-skip-worktree-entries + * --ignore-sparsity + * --[no-]restrict-to-sparse-paths + * --full-tree/--sparse-tree + * --[no-]restrict + * --scope={sparse,all} + * --focus/--unfocus + * --limit/--unlimited + * Rationale making me lean slightly towards --scope={sparse,all}: + * We want a name that works for many commands, so we need a name that + does not conflict + * We know that we have more than two possible usecases, so it is best + to avoid a flag that appears to be binary. + * --scope={sparse,all} isn't overly long and seems relatively + explanatory + * `--sparse`, as used in add/rm/mv, is totally backwards for + grep/log/etc. Changing the meaning of `--sparse` for these + commands would fix the backwardness, but possibly break existing + scripts. Using a new name pairing would allow us to treat + `--sparse` in these commands as a deprecated alias. + * There is a different `--sparse`/`--dense` pair for commands using + revision machinery, so using that naming might cause confusion + * There is also a `--sparse` in both pack-objects and show-branch, which + don't conflict but do suggest that `--sparse` is overloaded + * The name --ignore-skip-worktree-bits is a double negative, is + quite a mouthful, refers to an implementation detail that many + users may not be familiar with, and we'd need a negation for it + which would probably be even more ridiculously long. (But we + can make --ignore-skip-worktree-bits a deprecated alias for + --no-restrict.) + + * If a config option is added (sparse.scope?) what should the values and + description be? "sparse" (behavior A), "worktree-sparse-history-dense" + (behavior B), "dense" (behavior C)? There's a risk of confusion, + because even for Behaviors A and B we want some commands to be + full-tree and others to operate sparsely, so the wording may need to be + more tied to the usecases and somehow explain that. Also, right now, + the primary difference we are focusing is just the history-querying + commands (log/diff/grep). Previous config suggestion here: [13] + + * Is `--no-expand` a good alias for ls-files's `--sparse` option? + (`--sparse` does not map to either `--scope=sparse` or `--scope=all`, + because in non-cone mode it does nothing and in cone-mode it shows the + sparse directory entries which are technically outside the sparse + specification) + + * Under Behavior A: + * Does ls-files' `--no-expand` override the default `--scope=all`, or + does it need an extra flag? + * Does ls-files' `-t` option imply `--scope=all`? + * Does update-index's `--[no-]skip-worktree` option imply `--scope=all`? + + * sparse-checkout: once behavior A is fully implemented, should we take + an interim measure to ease people into switching the default? Namely, + if folks are not already in a sparse checkout, then require + `sparse-checkout init/set` to take a + `--set-scope=(sparse|worktree-sparse-history-dense|dense)` flag (which + would set sparse.scope according to the setting given), and throw an + error if the flag is not provided? That error would be a great place + to warn folks that the default may change in the future, and get them + used to specifying what they want so that the eventual default switch + is seamless for them. + + +=== Implementation Goals/Plans === + + * Get buy-in on this document in general. + + * Figure out answers to the 'Implementation Questions' sections (above) + + * Fix bugs in the 'Known bugs' section (below) + + * Provide some kind of method for backfilling the blobs within the sparse + specification in a partial clone + + [Below here is kind of spitballing since the first two haven't been resolved] + + * update-index: flip the default to --no-ignore-skip-worktree-entries, + nuke this stupid "Oh, there's a bug? Let me add a flag to let users + request that they not trigger this bug." flag + + * Flags & Config + * Make `--sparse` in add/rm/mv a deprecated alias for `--scope=all` + * Make `--ignore-skip-worktree-bits` in checkout-index/checkout/restore + a deprecated aliases for `--scope=all` + * Create config option (sparse.scope?), tie it to the "Cliff notes" + overview + + * Add --scope=sparse (and --scope=all) flag to each of the history querying + commands. IMPORTANT: make sure diff machinery changes don't mess with + format-patch, fast-export, etc. + +=== Known bugs === + +This list used to be a lot longer (see e.g. [1,2,3,4,5,6,7,8,9]), but we've +been working on it. + +0. Behavior A is not well supported in Git. (Behavior B didn't used to + be either, but was the easier of the two to implement.) + +1. am and apply: + + apply, without `--index` or `--cached`, relies on files being present + in the working copy, and also writes to them unconditionally. As + such, it should first check for the files' presence, and if found to + be SKIP_WORKTREE, then clear the bit and vivify the paths, then do + its work. Currently, it just throws an error. + + apply, with either `--cached` or `--index`, will not preserve the + SKIP_WORKTREE bit. This is fine if the file has conflicts, but + otherwise SKIP_WORKTREE bits should be preserved for --cached and + probably also for --index. + + am, if there are no conflicts, will vivify files and fail to preserve + the SKIP_WORKTREE bit. If there are conflicts and `-3` is not + specified, it will vivify files and then complain the patch doesn't + apply. If there are conflicts and `-3` is specified, it will vivify + files and then complain that those vivified files would be + overwritten by merge. + +2. reset --hard: + + reset --hard provides confusing error message (works correctly, but + misleads the user into believing it didn't): + + $ touch addme + $ git add addme + $ git ls-files -t + H addme + H tracked + S tracked-but-maybe-skipped + $ git reset --hard # usually works great + error: Path 'addme' not uptodate; will not remove from working tree. + HEAD is now at bdbbb6f third + $ git ls-files -t + H tracked + S tracked-but-maybe-skipped + $ ls -1 + tracked + + `git reset --hard` DID remove addme from the index and the working tree, contrary + to the error message, but in line with how reset --hard should behave. + +3. read-tree + + `read-tree` doesn't apply the 'SKIP_WORKTREE' bit to *any* of the + entries it reads into the index, resulting in all your files suddenly + appearing to be "deleted". + +4. Checkout, restore: + + These command do not handle path & revision arguments appropriately: + + $ ls + tracked + $ git ls-files -t + H tracked + S tracked-but-maybe-skipped + $ git status --porcelain + $ git checkout -- '*skipped' + error: pathspec '*skipped' did not match any file(s) known to git + $ git ls-files -- '*skipped' + tracked-but-maybe-skipped + $ git checkout HEAD -- '*skipped' + error: pathspec '*skipped' did not match any file(s) known to git + $ git ls-tree HEAD | grep skipped + 100644 blob 276f5a64354b791b13840f02047738c77ad0584f tracked-but-maybe-skipped + $ git status --porcelain + $ git checkout HEAD~1 -- '*skipped' + $ git ls-files -t + H tracked + H tracked-but-maybe-skipped + $ git status --porcelain + M tracked-but-maybe-skipped + $ git checkout HEAD -- '*skipped' + $ git status --porcelain + $ + + Note that checkout without a revision (or restore --staged) fails to + find a file to restore from the index, even though ls-files shows + such a file certainly exists. + + Similar issues occur with HEAD (--source=HEAD in restore's case), + but suddenly works when HEAD~1 is specified. And then after that it + will work with HEAD specified, even though it didn't before. + + Directories are also an issue: + + $ git sparse-checkout set nomatches + $ git status + On branch main + You are in a sparse checkout with 0% of tracked files present. + + nothing to commit, working tree clean + $ git checkout . + error: pathspec '.' did not match any file(s) known to git + $ git checkout HEAD~1 . + Updated 1 path from 58916d9 + $ git ls-files -t + S tracked + H tracked-but-maybe-skipped + +5. checkout and restore --staged, continued: + + These commands do not correctly scope operations to the sparse + specification, and make it worse by not setting important SKIP_WORKTREE + bits: + + $ git restore --source OLDREV --staged outside-sparse-cone/ + $ git status --porcelain + MD outside-sparse-cone/file1 + MD outside-sparse-cone/file2 + MD outside-sparse-cone/file3 + + We can add a --scope=all mode to `git restore` to let it operate outside + the sparse specification, but then it will be important to set the + SKIP_WORKTREE bits appropriately. + +6. Performance issues; see: + https://lore.kernel.org/git/CABPp-BEkJQoKZsQGCYioyga_uoDQ6iBeW+FKr8JhyuuTMK1RDw@mail.gmail.com/ + + +=== Reference Emails === + +Emails that detail various bugs we've had in sparse-checkout: + +[1] (Original descriptions of behavior A & behavior B) + https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/ +[2] (Fix stash applications in sparse checkouts; bugs from behavioral differences) + https://lore.kernel.org/git/ccfedc7140dbf63ba26a15f93bd3885180b26517.1606861519.git.gitgitgadget@gmail.com/ +[3] (Present-despite-skipped entries) + https://lore.kernel.org/git/11d46a399d26c913787b704d2b7169cafc28d639.1642175983.git.gitgitgadget@gmail.com/ +[4] (Clone --no-checkout interaction) + https://lore.kernel.org/git/pull.801.v2.git.git.1591324899170.gitgitgadget@gmail.com/ (clone --no-checkout) +[5] (The need for update_sparsity() and avoiding `read-tree -mu HEAD`) + https://lore.kernel.org/git/3a1f084641eb47515b5a41ed4409a36128913309.1585270142.git.gitgitgadget@gmail.com/ +[6] (SKIP_WORKTREE is advisory, not mandatory) + https://lore.kernel.org/git/844306c3e86ef67591cc086decb2b760e7d710a3.1585270142.git.gitgitgadget@gmail.com/ +[7] (`worktree add` should copy sparsity settings from current worktree) + https://lore.kernel.org/git/c51cb3714e7b1d2f8c9370fe87eca9984ff4859f.1644269584.git.gitgitgadget@gmail.com/ +[8] (Avoid negative surprises in add, rm, and mv) + https://lore.kernel.org/git/cover.1617914011.git.matheus.bernardino@usp.br/ + https://lore.kernel.org/git/pull.1018.v4.git.1632497954.gitgitgadget@gmail.com/ +[9] (Move from out-of-cone to in-cone) + https://lore.kernel.org/git/20220630023737.473690-6-shaoxuan.yuan02@gmail.com/ + https://lore.kernel.org/git/20220630023737.473690-4-shaoxuan.yuan02@gmail.com/ +[10] (Unnecessarily downloading objects outside sparse specification) + https://lore.kernel.org/git/CAOLTT8QfwOi9yx_qZZgyGa8iL8kHWutEED7ok_jxwTcYT_hf9Q@mail.gmail.com/ + +[11] (Stolee's comments on high-level usecases) + https://lore.kernel.org/git/1a1e33f6-3514-9afc-0a28-5a6b85bd8014@gmail.com/ + +[12] Others commenting on eventually switching default to behavior A: + * https://lore.kernel.org/git/xmqqh719pcoo.fsf@gitster.g/ + * https://lore.kernel.org/git/xmqqzgeqw0sy.fsf@gitster.g/ + * https://lore.kernel.org/git/a86af661-cf58-a4e5-0214-a67d3a794d7e@github.com/ + +[13] Previous config name suggestion and description + * https://lore.kernel.org/git/CABPp-BE6zW0nJSStcVU=_DoDBnPgLqOR8pkTXK3dW11=T01OhA@mail.gmail.com/ + +[14] Tangential issue: switch to cone mode as default sparse specification mechanism: + https://lore.kernel.org/git/a1b68fd6126eb341ef3637bb93fedad4309b36d0.1650594746.git.gitgitgadget@gmail.com/ + +[15] Lengthy email on grep behavior, covering what should be searched: + * https://lore.kernel.org/git/CABPp-BGVO3QdbfE84uF_3QDF0-y2iHHh6G5FAFzNRfeRitkuHw@mail.gmail.com/ + +[16] Email explaining sparsity patterns vs. SKIP_WORKTREE and history operations, + search for the parenthetical comment starting "We do not check". + https://lore.kernel.org/git/CABPp-BFsCPPNOZ92JQRJeGyNd0e-TCW-LcLyr0i_+VSQJP+GCg@mail.gmail.com/ + +[17] https://lore.kernel.org/git/20220207190320.2960362-1-jonathantanmy@google.com/ |