git bug report: 'git add' hangs in a large repo which has sparse-checkout file with large number of patterns in it

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* git bug report: 'git add' hangs in a large repo which has sparse-checkout file with large number of patterns in it
@ 2022-06-29 19:11 Dian Xu
  2022-06-29 21:53 ` Victoria Dye
  2022-06-30  3:10 ` Elijah Newren
  0 siblings, 2 replies; 12+ messages in thread
From: Dian Xu @ 2022-06-29 19:11 UTC (permalink / raw)
  To: git

Dear Git developers,

Reporting Issue:
              'git add' hangs in a large repo which has
sparse-checkout file with large number of patterns in it

Found in:
              Git 2.34.3. Issue occurs after 'audit for interaction
with sparse-index' was introduced in add.c

Reproduction steps:
              1. Clone a repo which has e.g. 2 million plus files
              2. Enable sparse checkout by: git config core.sparsecheckout true
              3. Create a .git/info/sparse-checkout file with a large
number of patterns, e.g. 16k plus lines
              4. Run 'git add', which will hang

Investigations:
              1. Stack trace:
                       add.c: cmd_add
                  -> add.c: prune_directory
                  -> pathspec.c: add_pathspec_matches_against_index
                  -> dir.c: path_in_sparse_checkout_1
              2. In Git 2.33.3, the loop at pathspec.c line 42 runs
fast, even when istate->cache_nr is at 2 million
              3. Since Git 2.34.3, the newly introduced 'audit for
interaction with sparse-index' (dir.c line 1459:
path_in_sparse_checkout_1) decides to loop through 2 million files and
match each one of them against the sparse-checkout patterns
              4. This hits the O(n^2) problem thus causes 'git add' to
hang (or ~1.5 hours to finish)

Please help us take a look at this issue and let us know if you need
more information.

Thanks,

Dian Xu
Mathworks, Inc
1 Lakeside Campus Drive, Natick, MA 01760
508-647-3583

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: git bug report: 'git add' hangs in a large repo which has sparse-checkout file with large number of patterns in it
  2022-06-29 19:11 git bug report: 'git add' hangs in a large repo which has sparse-checkout file with large number of patterns in it Dian Xu
@ 2022-06-29 21:53 ` Victoria Dye
  2022-06-30  4:06   ` Elijah Newren
  2022-06-30  3:10 ` Elijah Newren
  1 sibling, 1 reply; 12+ messages in thread
From: Victoria Dye @ 2022-06-29 21:53 UTC (permalink / raw)
  To: Dian Xu, git, Derrick Stolee

Dian Xu wrote:
> Dear Git developers,
> 
> Reporting Issue:
>               'git add' hangs in a large repo which has
> sparse-checkout file with large number of patterns in it
> 
> Found in:
>               Git 2.34.3. Issue occurs after 'audit for interaction
> with sparse-index' was introduced in add.c
> 
> Reproduction steps:
>               1. Clone a repo which has e.g. 2 million plus files
>               2. Enable sparse checkout by: git config core.sparsecheckout true
>               3. Create a .git/info/sparse-checkout file with a large
> number of patterns, e.g. 16k plus lines
>               4. Run 'git add', which will hang> 
> Investigations:
>               1. Stack trace:
>                        add.c: cmd_add
>                   -> add.c: prune_directory
>                   -> pathspec.c: add_pathspec_matches_against_index
>                   -> dir.c: path_in_sparse_checkout_1
>               2. In Git 2.33.3, the loop at pathspec.c line 42 runs
> fast, even when istate->cache_nr is at 2 million
>               3. Since Git 2.34.3, the newly introduced 'audit for
> interaction with sparse-index' (dir.c line 1459:
> path_in_sparse_checkout_1) decides to loop through 2 million files and
> match each one of them against the sparse-checkout patterns
>               4. This hits the O(n^2) problem thus causes 'git add' to
> hang (or ~1.5 hours to finish)

Thanks for the explanation, it helped me narrow down the source to an exact
commit (49fdd51a23 (add: skip tracked paths outside sparse-checkout cone,
2021-09-24)). 

You're correct that the `path_in_sparse_checkout()` check is slow [1].
However, it only runs on files that are not "hidden" with the
`SKIP_WORKTREE` flag. Ideally, if you're using sparse-checkout, this will
only be a small subset of your 2 million files. 

In your repro steps, you're adding patterns to a file then immediately
running `git add`. If that reflects how you're usually working with
sparse-checkout, `SKIP_WORKTREE` likely isn't being applied properly before
the `add`. You can check to see whether file(s) have the flag properly
applied with `git ls-files -t <file or dir names>` - each `SKIP_WORKTREE`
file should have an "S" next to it. "H" indicates that the flag is *not*
applied.

If you see that most of the files that *should* be sparse don't have
`SKIP_WORKTREE` applied, you can run `git sparse-checkout reapply` (make
sure you don't have any modified files outside the patterns you're
applying!). The downside is that it'll be as slow as what you're reporting
for `git add`, but any subsequent `add` (or reset, status, etc.) should be
much faster.

If you do all of that but things are still slow, then the way we check
pathspecs in `git add` would need to change (not trivial, but probably not
impossible either). At a cursory glance, I can think of a few options for
that:

1. Remove the `path_in_sparse_checkout()` check. It's the simplest solution,
   but it means you'd be able to stage files for commit outside the
   sparse-checkout patterns without using the '--sparse' option. I don't
   personally think that's a huge issue, but given that the implementation
   was intentionally changed *away* from this approach, I'd defer to other
   contributors to see if that's an okay change to make.
2. After every call to `ce_path_match()`, check if all pathspecs are marked
   as `seen` and, if so, return early. This would slow down each individual
   file check and wouldn't help you if a pathspec matches nothing, but
   prevents checking more files once all pathspecs are matched.
3. Do some heuristic checks on the pathspecs before checking index entries.
   For example, exact file or directory matches could be searched in the
   index. This would still require falling back on the per-file checks if
   not all pathspecs are matched, but makes some typical use-cases much
   faster.

There are almost certainly other options, and I can dig around `add.c` more
to see if there's anything I'm missing (although I'd love to hear other
ideas too!). 

Hopefully this helps!
- Victoria

[1] `path_in_sparse_checkout()` is significantly faster in cone mode, but
with 16k patterns I'm assuming you're not using cone patterns ;)

> 
> Please help us take a look at this issue and let us know if you need
> more information.
> 
> Thanks,
> 
> Dian Xu
> Mathworks, Inc
> 1 Lakeside Campus Drive, Natick, MA 01760
> 508-647-3583

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: git bug report: 'git add' hangs in a large repo which has sparse-checkout file with large number of patterns in it
  2022-06-29 19:11 git bug report: 'git add' hangs in a large repo which has sparse-checkout file with large number of patterns in it Dian Xu
  2022-06-29 21:53 ` Victoria Dye
@ 2022-06-30  3:10 ` Elijah Newren
  1 sibling, 0 replies; 12+ messages in thread
From: Elijah Newren @ 2022-06-30  3:10 UTC (permalink / raw)
  To: Dian Xu; +Cc: Git Mailing List

On Wed, Jun 29, 2022 at 12:50 PM Dian Xu <dianxudev@gmail.com> wrote:
>
> Dear Git developers,
>
> Reporting Issue:
>               'git add' hangs in a large repo which has
> sparse-checkout file with large number of patterns in it
>
> Found in:
>               Git 2.34.3. Issue occurs after 'audit for interaction
> with sparse-index' was introduced in add.c
>
> Reproduction steps:
>               1. Clone a repo which has e.g. 2 million plus files
>               2. Enable sparse checkout by: git config core.sparsecheckout true
>               3. Create a .git/info/sparse-checkout file with a large
> number of patterns, e.g. 16k plus lines

Did you run `git read-tree -mu HEAD` or even `git sparse-checkout
reapply` after step 3 and before step 4?  If not, you've left the
working tree out-of-sync with the specified sparsity paths and should
fix that before running step 4.

>               4. Run 'git add', which will hang

Alternatively to the above, if you really want to add a file and
ignore the fact that it might be outside the sparsity patterns (and
risk it later randomly disappearing with checkout/rebase/merge/etc.
commands), then you can use `git add --sparse $FILENAME`.

> Investigations:
>               1. Stack trace:
>                        add.c: cmd_add
>                   -> add.c: prune_directory
>                   -> pathspec.c: add_pathspec_matches_against_index
>                   -> dir.c: path_in_sparse_checkout_1
>               2. In Git 2.33.3, the loop at pathspec.c line 42 runs
> fast, even when istate->cache_nr is at 2 million
>               3. Since Git 2.34.3, the newly introduced 'audit for
> interaction with sparse-index' (dir.c line 1459:
> path_in_sparse_checkout_1) decides to loop through 2 million files and
> match each one of them against the sparse-checkout patterns
>               4. This hits the O(n^2) problem thus causes 'git add' to
> hang (or ~1.5 hours to finish)
>
> Please help us take a look at this issue and let us know if you need
> more information.

I'm also curious if you can use --cone mode in sparse-checkout.  The
O(N*M) behavior of sparse checkouts in non-cone mode is pretty
fundamental, and we may need to add additional paths checking the
sparsity patterns (i.e. more O(N*M) codepaths) to fix various
user-observed bugs.  Usage of --cone mode drops all of these to a
linear cost.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: git bug report: 'git add' hangs in a large repo which has sparse-checkout file with large number of patterns in it
  2022-06-29 21:53 ` Victoria Dye
@ 2022-06-30  4:06   ` Elijah Newren
  2022-06-30  5:06     ` Victoria Dye
  0 siblings, 1 reply; 12+ messages in thread
From: Elijah Newren @ 2022-06-30  4:06 UTC (permalink / raw)
  To: Victoria Dye; +Cc: Dian Xu, Git Mailing List, Derrick Stolee

On Wed, Jun 29, 2022 at 3:04 PM Victoria Dye <vdye@github.com> wrote:
>
> Dian Xu wrote:
> > Dear Git developers,
> >
> > Reporting Issue:
> >               'git add' hangs in a large repo which has
> > sparse-checkout file with large number of patterns in it
> >
> > Found in:
> >               Git 2.34.3. Issue occurs after 'audit for interaction
> > with sparse-index' was introduced in add.c
> >
> > Reproduction steps:
> >               1. Clone a repo which has e.g. 2 million plus files
> >               2. Enable sparse checkout by: git config core.sparsecheckout true
> >               3. Create a .git/info/sparse-checkout file with a large
> > number of patterns, e.g. 16k plus lines
> >               4. Run 'git add', which will hang>
> > Investigations:
> >               1. Stack trace:
> >                        add.c: cmd_add
> >                   -> add.c: prune_directory
> >                   -> pathspec.c: add_pathspec_matches_against_index
> >                   -> dir.c: path_in_sparse_checkout_1
> >               2. In Git 2.33.3, the loop at pathspec.c line 42 runs
> > fast, even when istate->cache_nr is at 2 million
> >               3. Since Git 2.34.3, the newly introduced 'audit for
> > interaction with sparse-index' (dir.c line 1459:
> > path_in_sparse_checkout_1) decides to loop through 2 million files and
> > match each one of them against the sparse-checkout patterns
> >               4. This hits the O(n^2) problem thus causes 'git add' to
> > hang (or ~1.5 hours to finish)
>
> Thanks for the explanation, it helped me narrow down the source to an exact
> commit (49fdd51a23 (add: skip tracked paths outside sparse-checkout cone,
> 2021-09-24)).
>
> You're correct that the `path_in_sparse_checkout()` check is slow [1].
> However, it only runs on files that are not "hidden" with the
> `SKIP_WORKTREE` flag. Ideally, if you're using sparse-checkout, this will
> only be a small subset of your 2 million files.
>
> In your repro steps, you're adding patterns to a file then immediately
> running `git add`. If that reflects how you're usually working with
> sparse-checkout, `SKIP_WORKTREE` likely isn't being applied properly before
> the `add`. You can check to see whether file(s) have the flag properly
> applied with `git ls-files -t <file or dir names>` - each `SKIP_WORKTREE`
> file should have an "S" next to it. "H" indicates that the flag is *not*
> applied.
>
> If you see that most of the files that *should* be sparse don't have
> `SKIP_WORKTREE` applied, you can run `git sparse-checkout reapply` (make
> sure you don't have any modified files outside the patterns you're
> applying!). The downside is that it'll be as slow as what you're reporting
> for `git add`, but any subsequent `add` (or reset, status, etc.) should be
> much faster.
>
> If you do all of that but things are still slow, then the way we check
> pathspecs in `git add` would need to change (not trivial, but probably not
> impossible either). At a cursory glance, I can think of a few options for
> that:
>
> 1. Remove the `path_in_sparse_checkout()` check. It's the simplest solution,
>    but it means you'd be able to stage files for commit outside the
>    sparse-checkout patterns without using the '--sparse' option. I don't
>    personally think that's a huge issue, but given that the implementation
>    was intentionally changed *away* from this approach, I'd defer to other
>    contributors to see if that's an okay change to make.

I'm strongly against this.  This just restores the original bug we
were trying to fix, attempts to paper over the fact that non-cone mode
is fundamentally O(N*M) in one small instance, and sets the precedent
that we can't fix further sparse-checkout-based usability bugs because
it might add performance bottlenecks in additional places given
non-cone-mode's fundamental performance design problem.  We should
instead (in rough priority order)

* encourage people to adopt cone mode
* discourage people still using non-cone mode from having lots of patterns
* make sure people aren't misusing things (the lack of a `git
read-tree -mu HEAD` or `git sparse-checkout reapply` seemed very
suspicious)
* educate people that non-cone mode is just fundamentally slow, among
other problems, and that the slowness might appear in additional
places in the future as we fix various usability issues.
* provide workarounds users can adopt if they really want to persist
with non-cone mode with lots of patterns (e.g. add "--sparse" to their
"git add" commands), though warn them about the possible side effects
they'll face (the added files can seemingly randomly disappear in the
working tree with future checkout/pull/merge/rebase/reset/etc commands
if the added files don't match the sparsity patterns).
* investigate ways to optimize the code to lower the constant in the
O(N*M) behavior on a case-by-case basis

We deprecated non-cone mode in v2.37 in part because of this type of
issue, and I really don't want the see the deprecated side of things
dictating how commands work for the now-default mode.

> 2. After every call to `ce_path_match()`, check if all pathspecs are marked
>    as `seen` and, if so, return early. This would slow down each individual
>    file check and wouldn't help you if a pathspec matches nothing, but
>    prevents checking more files once all pathspecs are matched.

Might be interesting.  Would need some careful measurements and
attempts to validate how often all pathspecs are matched early in some
kind of way, because this would definitely slow down some cases and
speed others up, but I don't have a good feel for which side happens
more frequently in practice.

> 3. Do some heuristic checks on the pathspecs before checking index entries.
>    For example, exact file or directory matches could be searched in the
>    index. This would still require falling back on the per-file checks if
>    not all pathspecs are matched, but makes some typical use-cases much
>    faster.

I'm confused.  "before checking index entries", you're checking things
(namely, exact file or directory matches) "in the index"?

> There are almost certainly other options, and I can dig around `add.c` more
> to see if there's anything I'm missing (although I'd love to hear other
> ideas too!).
>
> Hopefully this helps!
> - Victoria
>
> [1] `path_in_sparse_checkout()` is significantly faster in cone mode, but
> with 16k patterns I'm assuming you're not using cone patterns ;)
>
> >
> > Please help us take a look at this issue and let us know if you need
> > more information.
> >
> > Thanks,
> >
> > Dian Xu
> > Mathworks, Inc
> > 1 Lakeside Campus Drive, Natick, MA 01760
> > 508-647-3583
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: git bug report: 'git add' hangs in a large repo which has sparse-checkout file with large number of patterns in it
  2022-06-30  4:06   ` Elijah Newren
@ 2022-06-30  5:06     ` Victoria Dye
  2022-07-01  3:42       ` Elijah Newren
  0 siblings, 1 reply; 12+ messages in thread
From: Victoria Dye @ 2022-06-30  5:06 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Dian Xu, Git Mailing List, Derrick Stolee

Elijah Newren wrote:
> On Wed, Jun 29, 2022 at 3:04 PM Victoria Dye <vdye@github.com> wrote:
>>
>> Dian Xu wrote:
>>> Dear Git developers,
>>>
>>> Reporting Issue:
>>>               'git add' hangs in a large repo which has
>>> sparse-checkout file with large number of patterns in it
>>>
>>> Found in:
>>>               Git 2.34.3. Issue occurs after 'audit for interaction
>>> with sparse-index' was introduced in add.c
>>>
>>> Reproduction steps:
>>>               1. Clone a repo which has e.g. 2 million plus files
>>>               2. Enable sparse checkout by: git config core.sparsecheckout true
>>>               3. Create a .git/info/sparse-checkout file with a large
>>> number of patterns, e.g. 16k plus lines
>>>               4. Run 'git add', which will hang>
>>> Investigations:
>>>               1. Stack trace:
>>>                        add.c: cmd_add
>>>                   -> add.c: prune_directory
>>>                   -> pathspec.c: add_pathspec_matches_against_index
>>>                   -> dir.c: path_in_sparse_checkout_1
>>>               2. In Git 2.33.3, the loop at pathspec.c line 42 runs
>>> fast, even when istate->cache_nr is at 2 million
>>>               3. Since Git 2.34.3, the newly introduced 'audit for
>>> interaction with sparse-index' (dir.c line 1459:
>>> path_in_sparse_checkout_1) decides to loop through 2 million files and
>>> match each one of them against the sparse-checkout patterns
>>>               4. This hits the O(n^2) problem thus causes 'git add' to
>>> hang (or ~1.5 hours to finish)
>>
>> Thanks for the explanation, it helped me narrow down the source to an exact
>> commit (49fdd51a23 (add: skip tracked paths outside sparse-checkout cone,
>> 2021-09-24)).
>>
>> You're correct that the `path_in_sparse_checkout()` check is slow [1].
>> However, it only runs on files that are not "hidden" with the
>> `SKIP_WORKTREE` flag. Ideally, if you're using sparse-checkout, this will
>> only be a small subset of your 2 million files.
>>
>> In your repro steps, you're adding patterns to a file then immediately
>> running `git add`. If that reflects how you're usually working with
>> sparse-checkout, `SKIP_WORKTREE` likely isn't being applied properly before
>> the `add`. You can check to see whether file(s) have the flag properly
>> applied with `git ls-files -t <file or dir names>` - each `SKIP_WORKTREE`
>> file should have an "S" next to it. "H" indicates that the flag is *not*
>> applied.
>>
>> If you see that most of the files that *should* be sparse don't have
>> `SKIP_WORKTREE` applied, you can run `git sparse-checkout reapply` (make
>> sure you don't have any modified files outside the patterns you're
>> applying!). The downside is that it'll be as slow as what you're reporting
>> for `git add`, but any subsequent `add` (or reset, status, etc.) should be
>> much faster.
>>
>> If you do all of that but things are still slow, then the way we check
>> pathspecs in `git add` would need to change (not trivial, but probably not
>> impossible either). At a cursory glance, I can think of a few options for
>> that:
>>
>> 1. Remove the `path_in_sparse_checkout()` check. It's the simplest solution,
>>    but it means you'd be able to stage files for commit outside the
>>    sparse-checkout patterns without using the '--sparse' option. I don't
>>    personally think that's a huge issue, but given that the implementation
>>    was intentionally changed *away* from this approach, I'd defer to other
>>    contributors to see if that's an okay change to make.
> 
> I'm strongly against this.  This just restores the original bug we
> were trying to fix, attempts to paper over the fact that non-cone mode
> is fundamentally O(N*M) in one small instance, and sets the precedent
> that we can't fix further sparse-checkout-based usability bugs because
> it might add performance bottlenecks in additional places given
> non-cone-mode's fundamental performance design problem.  We should
> instead (in rough priority order)

I'm not sure what the bug was - although I can (and should) read through the
list archive to find out - but the rest of your point is convincing enough
on its own. Even if we sacrificed correctness for performance in this one
case, there are countless other places in the code like it, and changing all
of them could seriously hurt user experience in other ways.

Thanks for your perspective!

> 
> * encourage people to adopt cone mode
> * discourage people still using non-cone mode from having lots of patterns

While I know these are the recommended best practice, I do want to
acknowledge that switching to cone mode in some repositories is a huge lift
or otherwise infeasible [1]. While people make that transition (if they even
can), I don't think it's unreasonable to look for ways to improve
performance in non-cone sparse checkout, especially if those performance
gains can also be realized in cone mode.

[1] https://lore.kernel.org/git/nycvar.QRO.7.76.6.2205212347060.352@tvgsbejvaqbjf.bet/

> * make sure people aren't misusing things (the lack of a `git
> read-tree -mu HEAD` or `git sparse-checkout reapply` seemed very
> suspicious)

A warning if a particular git operation sees a lot of out-of-cone
non-`SKIP_WORKTREE` files might help with this (and would be especially
impactful for someone working with a sparse index). I'm not sure how to
quantify "a lot", though.

> * educate people that non-cone mode is just fundamentally slow, among
> other problems, and that the slowness might appear in additional
> places in the future as we fix various usability issues.
> * provide workarounds users can adopt if they really want to persist
> with non-cone mode with lots of patterns (e.g. add "--sparse" to their
> "git add" commands), though warn them about the possible side effects
> they'll face (the added files can seemingly randomly disappear in the
> working tree with future checkout/pull/merge/rebase/reset/etc commands
> if the added files don't match the sparsity patterns).
> * investigate ways to optimize the code to lower the constant in the
> O(N*M) behavior on a case-by-case basis
> 
> We deprecated non-cone mode in v2.37 in part because of this type of
> issue, and I really don't want the see the deprecated side of things
> dictating how commands work for the now-default mode.
> 
>> 2. After every call to `ce_path_match()`, check if all pathspecs are marked
>>    as `seen` and, if so, return early. This would slow down each individual
>>    file check and wouldn't help you if a pathspec matches nothing, but
>>    prevents checking more files once all pathspecs are matched.
> 
> Might be interesting.  Would need some careful measurements and
> attempts to validate how often all pathspecs are matched early in some
> kind of way, because this would definitely slow down some cases and
> speed others up, but I don't have a good feel for which side happens
> more frequently in practice.
> 
>> 3. Do some heuristic checks on the pathspecs before checking index entries.
>>    For example, exact file or directory matches could be searched in the
>>    index. This would still require falling back on the per-file checks if
>>    not all pathspecs are matched, but makes some typical use-cases much
>>    faster.
> 
> I'm confused.  "before checking index entries", you're checking things
> (namely, exact file or directory matches) "in the index"?

Sorry, I definitely wasn't clear. I mean "perform heuristic checks *per
pathspec item* before iterating *per index entry*." Pathspecs used in `git
add` are (at least in my experience) pretty small, so there could be
performance gains if all the items can be marked `seen` without iterating
over every entry of the index. I was thinking something like
`pathspec_needs_expanded_index()` in `reset` (4d1cfc1351 (reset: make
--mixed sparse-aware, 2021-11-29)), but tailored to this particular case.

> 
>> There are almost certainly other options, and I can dig around `add.c` more
>> to see if there's anything I'm missing (although I'd love to hear other
>> ideas too!).
>>
>> Hopefully this helps!
>> - Victoria
>>
>> [1] `path_in_sparse_checkout()` is significantly faster in cone mode, but
>> with 16k patterns I'm assuming you're not using cone patterns ;)
>>
>>>
>>> Please help us take a look at this issue and let us know if you need
>>> more information.
>>>
>>> Thanks,
>>>
>>> Dian Xu
>>> Mathworks, Inc
>>> 1 Lakeside Campus Drive, Natick, MA 01760
>>> 508-647-3583
>>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: git bug report: 'git add' hangs in a large repo which has sparse-checkout file with large number of patterns in it
  2022-06-30  5:06     ` Victoria Dye
@ 2022-07-01  3:42       ` Elijah Newren
  2022-07-01 20:24         ` Dian Xu
  0 siblings, 1 reply; 12+ messages in thread
From: Elijah Newren @ 2022-07-01  3:42 UTC (permalink / raw)
  To: Victoria Dye; +Cc: Dian Xu, Git Mailing List, Derrick Stolee

On Wed, Jun 29, 2022 at 10:06 PM Victoria Dye <vdye@github.com> wrote:
>
> Elijah Newren wrote:
> > On Wed, Jun 29, 2022 at 3:04 PM Victoria Dye <vdye@github.com> wrote:
> >>
> >> Dian Xu wrote:
> >>> Dear Git developers,
> >>>
> >>> Reporting Issue:
> >>>               'git add' hangs in a large repo which has
> >>> sparse-checkout file with large number of patterns in it
> >>>
> >>> Found in:
> >>>               Git 2.34.3. Issue occurs after 'audit for interaction
> >>> with sparse-index' was introduced in add.c
> >>>
> >>> Reproduction steps:
> >>>               1. Clone a repo which has e.g. 2 million plus files
> >>>               2. Enable sparse checkout by: git config core.sparsecheckout true
> >>>               3. Create a .git/info/sparse-checkout file with a large
> >>> number of patterns, e.g. 16k plus lines
> >>>               4. Run 'git add', which will hang>
> >>> Investigations:
> >>>               1. Stack trace:
> >>>                        add.c: cmd_add
> >>>                   -> add.c: prune_directory
> >>>                   -> pathspec.c: add_pathspec_matches_against_index
> >>>                   -> dir.c: path_in_sparse_checkout_1
> >>>               2. In Git 2.33.3, the loop at pathspec.c line 42 runs
> >>> fast, even when istate->cache_nr is at 2 million
> >>>               3. Since Git 2.34.3, the newly introduced 'audit for
> >>> interaction with sparse-index' (dir.c line 1459:
> >>> path_in_sparse_checkout_1) decides to loop through 2 million files and
> >>> match each one of them against the sparse-checkout patterns
> >>>               4. This hits the O(n^2) problem thus causes 'git add' to
> >>> hang (or ~1.5 hours to finish)
> >>
> >> Thanks for the explanation, it helped me narrow down the source to an exact
> >> commit (49fdd51a23 (add: skip tracked paths outside sparse-checkout cone,
> >> 2021-09-24)).
> >>
> >> You're correct that the `path_in_sparse_checkout()` check is slow [1].
> >> However, it only runs on files that are not "hidden" with the
> >> `SKIP_WORKTREE` flag. Ideally, if you're using sparse-checkout, this will
> >> only be a small subset of your 2 million files.
> >>
> >> In your repro steps, you're adding patterns to a file then immediately
> >> running `git add`. If that reflects how you're usually working with
> >> sparse-checkout, `SKIP_WORKTREE` likely isn't being applied properly before
> >> the `add`. You can check to see whether file(s) have the flag properly
> >> applied with `git ls-files -t <file or dir names>` - each `SKIP_WORKTREE`
> >> file should have an "S" next to it. "H" indicates that the flag is *not*
> >> applied.
> >>
> >> If you see that most of the files that *should* be sparse don't have
> >> `SKIP_WORKTREE` applied, you can run `git sparse-checkout reapply` (make
> >> sure you don't have any modified files outside the patterns you're
> >> applying!). The downside is that it'll be as slow as what you're reporting
> >> for `git add`, but any subsequent `add` (or reset, status, etc.) should be
> >> much faster.
> >>
> >> If you do all of that but things are still slow, then the way we check
> >> pathspecs in `git add` would need to change (not trivial, but probably not
> >> impossible either). At a cursory glance, I can think of a few options for
> >> that:
> >>
> >> 1. Remove the `path_in_sparse_checkout()` check. It's the simplest solution,
> >>    but it means you'd be able to stage files for commit outside the
> >>    sparse-checkout patterns without using the '--sparse' option. I don't
> >>    personally think that's a huge issue, but given that the implementation
> >>    was intentionally changed *away* from this approach, I'd defer to other
> >>    contributors to see if that's an okay change to make.
> >
> > I'm strongly against this.  This just restores the original bug we
> > were trying to fix, attempts to paper over the fact that non-cone mode
> > is fundamentally O(N*M) in one small instance, and sets the precedent
> > that we can't fix further sparse-checkout-based usability bugs because
> > it might add performance bottlenecks in additional places given
> > non-cone-mode's fundamental performance design problem.  We should
> > instead (in rough priority order)
>
> I'm not sure what the bug was - although I can (and should) read through the
> list archive to find out - but the rest of your point is convincing enough
> on its own. Even if we sacrificed correctness for performance in this one
> case, there are countless other places in the code like it, and changing all
> of them could seriously hurt user experience in other ways.
>
> Thanks for your perspective!

:-)

> >
> > * encourage people to adopt cone mode
> > * discourage people still using non-cone mode from having lots of patterns
>
> While I know these are the recommended best practice, I do want to
> acknowledge that switching to cone mode in some repositories is a huge lift
> or otherwise infeasible [1]. While people make that transition (if they even
> can), I don't think it's unreasonable to look for ways to improve
> performance in non-cone sparse checkout, especially if those performance
> gains can also be realized in cone mode.
>
> [1] https://lore.kernel.org/git/nycvar.QRO.7.76.6.2205212347060.352@tvgsbejvaqbjf.bet/

Yep, very good point.  These dovetail with why I used "encourage" and
"discourage", and with why I had several more things that should be
done in my list, including performance work.  I know that non-cone
mode is important to still support.

But I would also like to point out that folks sometimes aren't
adopting cone mode out of inertia or bad assumptions about how cone
mode operates, rather than having sound reasons.  In fact, I've even
seen them describe conditions that sound like a perfect fit for cone
mode and yet use their described usecase as rationale to _not_ use
cone mode simply because they assume cone mode does something other
than what it really does.  (See
https://lore.kernel.org/git/CABPp-BEwMAPHGt5xD9jDU58grbrAqCdqNY9Nh8UJGLKuLbArXQ@mail.gmail.com/
and the previous email in that thread for an example).  If even other
Git developers do that, that suggests we do need to probe a bit and
see if people can switch instead of just accepting they are one of the
cases that can't.  We still have some education work to do.

> > * make sure people aren't misusing things (the lack of a `git
> > read-tree -mu HEAD` or `git sparse-checkout reapply` seemed very
> > suspicious)
>
> A warning if a particular git operation sees a lot of out-of-cone
> non-`SKIP_WORKTREE` files might help with this (and would be especially
> impactful for someone working with a sparse index). I'm not sure how to
> quantify "a lot", though.

Yeah, this kind of reminds me of the present-despite-skipped check we
added.  Adding something like that which always runs is probably a
no-go, though, since this additional check would be much more
expensive than the present-despite-skipped one.  And, like you, I'm
also a little unsure how to quantify "a lot".

However, perhaps there's a way to tackle this problem from a different
angle.  I just noticed that the only place outside of the "git
sparse-checkout" command that sparse checkouts are documented, in
git-read-tree(1), that it didn't bother to give a list of steps for
employing sparse-checkouts and that people have to figure it out by
trial and error.  (Or read a random mailing list post or commit
message like 94c0956b60 (sparse-checkout: create builtin with 'list'
subcommand, 2019-11-21)).  So perhaps it's not surprising that users
miss one of the crucial steps.  Perhaps if we fix that documentation
to mention the necessary steps ("git config core.sparseCheckout true",
populate $GIT_DIR/info/sparse-checkout, then either run "git read-tree
-mu HEAD" or "git sparse-checkout reapply"), then users can discover
and make sure to do all the steps instead of just a subset?

> > * educate people that non-cone mode is just fundamentally slow, among
> > other problems, and that the slowness might appear in additional
> > places in the future as we fix various usability issues.
> > * provide workarounds users can adopt if they really want to persist
> > with non-cone mode with lots of patterns (e.g. add "--sparse" to their
> > "git add" commands), though warn them about the possible side effects
> > they'll face (the added files can seemingly randomly disappear in the
> > working tree with future checkout/pull/merge/rebase/reset/etc commands
> > if the added files don't match the sparsity patterns).
> > * investigate ways to optimize the code to lower the constant in the
> > O(N*M) behavior on a case-by-case basis
> >
> > We deprecated non-cone mode in v2.37 in part because of this type of
> > issue, and I really don't want the see the deprecated side of things
> > dictating how commands work for the now-default mode.
> >
> >> 2. After every call to `ce_path_match()`, check if all pathspecs are marked
> >>    as `seen` and, if so, return early. This would slow down each individual
> >>    file check and wouldn't help you if a pathspec matches nothing, but
> >>    prevents checking more files once all pathspecs are matched.
> >
> > Might be interesting.  Would need some careful measurements and
> > attempts to validate how often all pathspecs are matched early in some
> > kind of way, because this would definitely slow down some cases and
> > speed others up, but I don't have a good feel for which side happens
> > more frequently in practice.
> >
> >> 3. Do some heuristic checks on the pathspecs before checking index entries.
> >>    For example, exact file or directory matches could be searched in the
> >>    index. This would still require falling back on the per-file checks if
> >>    not all pathspecs are matched, but makes some typical use-cases much
> >>    faster.
> >
> > I'm confused.  "before checking index entries", you're checking things
> > (namely, exact file or directory matches) "in the index"?
>
> Sorry, I definitely wasn't clear. I mean "perform heuristic checks *per
> pathspec item* before iterating *per index entry*." Pathspecs used in `git
> add` are (at least in my experience) pretty small, so there could be
> performance gains if all the items can be marked `seen` without iterating
> over every entry of the index. I was thinking something like
> `pathspec_needs_expanded_index()` in `reset` (4d1cfc1351 (reset: make
> --mixed sparse-aware, 2021-11-29)), but tailored to this particular case.

Ah, okay makes sense now.





>
> >
> >> There are almost certainly other options, and I can dig around `add.c` more
> >> to see if there's anything I'm missing (although I'd love to hear other
> >> ideas too!).
> >>
> >> Hopefully this helps!
> >> - Victoria
> >>
> >> [1] `path_in_sparse_checkout()` is significantly faster in cone mode, but
> >> with 16k patterns I'm assuming you're not using cone patterns ;)
> >>
> >>>
> >>> Please help us take a look at this issue and let us know if you need
> >>> more information.
> >>>
> >>> Thanks,
> >>>
> >>> Dian Xu
> >>> Mathworks, Inc
> >>> 1 Lakeside Campus Drive, Natick, MA 01760
> >>> 508-647-3583
> >>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: git bug report: 'git add' hangs in a large repo which has sparse-checkout file with large number of patterns in it
  2022-07-01  3:42       ` Elijah Newren
@ 2022-07-01 20:24         ` Dian Xu
  2022-07-01 21:52           ` Elijah Newren
  0 siblings, 1 reply; 12+ messages in thread
From: Dian Xu @ 2022-07-01 20:24 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Victoria Dye, Git Mailing List, Derrick Stolee

Hi Victoria, Elijah, Derrick,

Thanks a lot for the detailed insight.

(Btw our company’s email mathworks.com is blocked by
mailto:git@vger.kernel.org, hope someone can help take a look)

1. We use a no-cone version of sparse-checkout to control the 'shape'
(set of scm files) of our source code. In this case, the local sandbox
is not necessarily 'sparse' (2m files), but it's very convenient that
we can use git to check out the exact amount (shape) of files. To
Victoria's question, all these 2m files are "H".

2. Below is the detail steps to create the local repo (sparse-checkout
was defined 'before' git checkout)
      % git init
      % git remote add origin <url>
      % git config core.sparsecheckout true
      % vi .git/info/sparse-checkout
      % git fetch
      % git checkout -b <SHA>
    Do I still need to 'git sparse-checkout reapply' after checkout?
(Thanks for pointing out to run reapply once .git/info/sparse-checkout
changed)

3. Unfortunately, after executing reapply (btw it is very slow on this
2m files * 16k patterns scenario: 30 mins), 'git add', and 'git add
--sparse' still hangs.

4. --cone is a big topic for us now, since 2.37.0 deprecates
--no-cone. We do have our own challenges to move away from --no-cone
(E.g. we use lots of file specifiers and/or exclusion patterns to
define our source code shape), which will be a huge amount of work, if
feasible. We've established a set of workflows based on --no-cone,
because of its merit of being capable of defining a fine-grained scm
shape.

5. Back to this case, what we've experimented on are:
      - Remove all files/*/! patterns from our shape definition, which
leave us with 14k directories (Obviously the scm shape no longe
matches, but just to proof of concept here)
      - 'git sparse-checkout set <14k directories>' finishes fast
      - 'git add' finishes fast
    As Victoria mentioned, I hope this --no-cone 'git add' performance
can be addressed because 'those performance gains can also be realized
in cone mode', as we saw here.

Again thank you for your help and looking forward to your thoughts.

Thanks,

Dian Xu
Mathworks, Inc
1 Lakeside Campus Drive, Natick, MA 01760
508-647-3583

On Thu, Jun 30, 2022 at 11:42 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Wed, Jun 29, 2022 at 10:06 PM Victoria Dye <vdye@github.com> wrote:
> >
> > Elijah Newren wrote:
> > > On Wed, Jun 29, 2022 at 3:04 PM Victoria Dye <vdye@github.com> wrote:
> > >>
> > >> Dian Xu wrote:
> > >>> Dear Git developers,
> > >>>
> > >>> Reporting Issue:
> > >>>               'git add' hangs in a large repo which has
> > >>> sparse-checkout file with large number of patterns in it
> > >>>
> > >>> Found in:
> > >>>               Git 2.34.3. Issue occurs after 'audit for interaction
> > >>> with sparse-index' was introduced in add.c
> > >>>
> > >>> Reproduction steps:
> > >>>               1. Clone a repo which has e.g. 2 million plus files
> > >>>               2. Enable sparse checkout by: git config core.sparsecheckout true
> > >>>               3. Create a .git/info/sparse-checkout file with a large
> > >>> number of patterns, e.g. 16k plus lines
> > >>>               4. Run 'git add', which will hang>
> > >>> Investigations:
> > >>>               1. Stack trace:
> > >>>                        add.c: cmd_add
> > >>>                   -> add.c: prune_directory
> > >>>                   -> pathspec.c: add_pathspec_matches_against_index
> > >>>                   -> dir.c: path_in_sparse_checkout_1
> > >>>               2. In Git 2.33.3, the loop at pathspec.c line 42 runs
> > >>> fast, even when istate->cache_nr is at 2 million
> > >>>               3. Since Git 2.34.3, the newly introduced 'audit for
> > >>> interaction with sparse-index' (dir.c line 1459:
> > >>> path_in_sparse_checkout_1) decides to loop through 2 million files and
> > >>> match each one of them against the sparse-checkout patterns
> > >>>               4. This hits the O(n^2) problem thus causes 'git add' to
> > >>> hang (or ~1.5 hours to finish)
> > >>
> > >> Thanks for the explanation, it helped me narrow down the source to an exact
> > >> commit (49fdd51a23 (add: skip tracked paths outside sparse-checkout cone,
> > >> 2021-09-24)).
> > >>
> > >> You're correct that the `path_in_sparse_checkout()` check is slow [1].
> > >> However, it only runs on files that are not "hidden" with the
> > >> `SKIP_WORKTREE` flag. Ideally, if you're using sparse-checkout, this will
> > >> only be a small subset of your 2 million files.
> > >>
> > >> In your repro steps, you're adding patterns to a file then immediately
> > >> running `git add`. If that reflects how you're usually working with
> > >> sparse-checkout, `SKIP_WORKTREE` likely isn't being applied properly before
> > >> the `add`. You can check to see whether file(s) have the flag properly
> > >> applied with `git ls-files -t <file or dir names>` - each `SKIP_WORKTREE`
> > >> file should have an "S" next to it. "H" indicates that the flag is *not*
> > >> applied.
> > >>
> > >> If you see that most of the files that *should* be sparse don't have
> > >> `SKIP_WORKTREE` applied, you can run `git sparse-checkout reapply` (make
> > >> sure you don't have any modified files outside the patterns you're
> > >> applying!). The downside is that it'll be as slow as what you're reporting
> > >> for `git add`, but any subsequent `add` (or reset, status, etc.) should be
> > >> much faster.
> > >>
> > >> If you do all of that but things are still slow, then the way we check
> > >> pathspecs in `git add` would need to change (not trivial, but probably not
> > >> impossible either). At a cursory glance, I can think of a few options for
> > >> that:
> > >>
> > >> 1. Remove the `path_in_sparse_checkout()` check. It's the simplest solution,
> > >>    but it means you'd be able to stage files for commit outside the
> > >>    sparse-checkout patterns without using the '--sparse' option. I don't
> > >>    personally think that's a huge issue, but given that the implementation
> > >>    was intentionally changed *away* from this approach, I'd defer to other
> > >>    contributors to see if that's an okay change to make.
> > >
> > > I'm strongly against this.  This just restores the original bug we
> > > were trying to fix, attempts to paper over the fact that non-cone mode
> > > is fundamentally O(N*M) in one small instance, and sets the precedent
> > > that we can't fix further sparse-checkout-based usability bugs because
> > > it might add performance bottlenecks in additional places given
> > > non-cone-mode's fundamental performance design problem.  We should
> > > instead (in rough priority order)
> >
> > I'm not sure what the bug was - although I can (and should) read through the
> > list archive to find out - but the rest of your point is convincing enough
> > on its own. Even if we sacrificed correctness for performance in this one
> > case, there are countless other places in the code like it, and changing all
> > of them could seriously hurt user experience in other ways.
> >
> > Thanks for your perspective!
>
> :-)
>
> > >
> > > * encourage people to adopt cone mode
> > > * discourage people still using non-cone mode from having lots of patterns
> >
> > While I know these are the recommended best practice, I do want to
> > acknowledge that switching to cone mode in some repositories is a huge lift
> > or otherwise infeasible [1]. While people make that transition (if they even
> > can), I don't think it's unreasonable to look for ways to improve
> > performance in non-cone sparse checkout, especially if those performance
> > gains can also be realized in cone mode.
> >
> > [1] https://lore.kernel.org/git/nycvar.QRO.7.76.6.2205212347060.352@tvgsbejvaqbjf.bet/
>
> Yep, very good point.  These dovetail with why I used "encourage" and
> "discourage", and with why I had several more things that should be
> done in my list, including performance work.  I know that non-cone
> mode is important to still support.
>
> But I would also like to point out that folks sometimes aren't
> adopting cone mode out of inertia or bad assumptions about how cone
> mode operates, rather than having sound reasons.  In fact, I've even
> seen them describe conditions that sound like a perfect fit for cone
> mode and yet use their described usecase as rationale to _not_ use
> cone mode simply because they assume cone mode does something other
> than what it really does.  (See
> https://lore.kernel.org/git/CABPp-BEwMAPHGt5xD9jDU58grbrAqCdqNY9Nh8UJGLKuLbArXQ@mail.gmail.com/
> and the previous email in that thread for an example).  If even other
> Git developers do that, that suggests we do need to probe a bit and
> see if people can switch instead of just accepting they are one of the
> cases that can't.  We still have some education work to do.
>
> > > * make sure people aren't misusing things (the lack of a `git
> > > read-tree -mu HEAD` or `git sparse-checkout reapply` seemed very
> > > suspicious)
> >
> > A warning if a particular git operation sees a lot of out-of-cone
> > non-`SKIP_WORKTREE` files might help with this (and would be especially
> > impactful for someone working with a sparse index). I'm not sure how to
> > quantify "a lot", though.
>
> Yeah, this kind of reminds me of the present-despite-skipped check we
> added.  Adding something like that which always runs is probably a
> no-go, though, since this additional check would be much more
> expensive than the present-despite-skipped one.  And, like you, I'm
> also a little unsure how to quantify "a lot".
>
> However, perhaps there's a way to tackle this problem from a different
> angle.  I just noticed that the only place outside of the "git
> sparse-checkout" command that sparse checkouts are documented, in
> git-read-tree(1), that it didn't bother to give a list of steps for
> employing sparse-checkouts and that people have to figure it out by
> trial and error.  (Or read a random mailing list post or commit
> message like 94c0956b60 (sparse-checkout: create builtin with 'list'
> subcommand, 2019-11-21)).  So perhaps it's not surprising that users
> miss one of the crucial steps.  Perhaps if we fix that documentation
> to mention the necessary steps ("git config core.sparseCheckout true",
> populate $GIT_DIR/info/sparse-checkout, then either run "git read-tree
> -mu HEAD" or "git sparse-checkout reapply"), then users can discover
> and make sure to do all the steps instead of just a subset?
>
> > > * educate people that non-cone mode is just fundamentally slow, among
> > > other problems, and that the slowness might appear in additional
> > > places in the future as we fix various usability issues.
> > > * provide workarounds users can adopt if they really want to persist
> > > with non-cone mode with lots of patterns (e.g. add "--sparse" to their
> > > "git add" commands), though warn them about the possible side effects
> > > they'll face (the added files can seemingly randomly disappear in the
> > > working tree with future checkout/pull/merge/rebase/reset/etc commands
> > > if the added files don't match the sparsity patterns).
> > > * investigate ways to optimize the code to lower the constant in the
> > > O(N*M) behavior on a case-by-case basis
> > >
> > > We deprecated non-cone mode in v2.37 in part because of this type of
> > > issue, and I really don't want the see the deprecated side of things
> > > dictating how commands work for the now-default mode.
> > >
> > >> 2. After every call to `ce_path_match()`, check if all pathspecs are marked
> > >>    as `seen` and, if so, return early. This would slow down each individual
> > >>    file check and wouldn't help you if a pathspec matches nothing, but
> > >>    prevents checking more files once all pathspecs are matched.
> > >
> > > Might be interesting.  Would need some careful measurements and
> > > attempts to validate how often all pathspecs are matched early in some
> > > kind of way, because this would definitely slow down some cases and
> > > speed others up, but I don't have a good feel for which side happens
> > > more frequently in practice.
> > >
> > >> 3. Do some heuristic checks on the pathspecs before checking index entries.
> > >>    For example, exact file or directory matches could be searched in the
> > >>    index. This would still require falling back on the per-file checks if
> > >>    not all pathspecs are matched, but makes some typical use-cases much
> > >>    faster.
> > >
> > > I'm confused.  "before checking index entries", you're checking things
> > > (namely, exact file or directory matches) "in the index"?
> >
> > Sorry, I definitely wasn't clear. I mean "perform heuristic checks *per
> > pathspec item* before iterating *per index entry*." Pathspecs used in `git
> > add` are (at least in my experience) pretty small, so there could be
> > performance gains if all the items can be marked `seen` without iterating
> > over every entry of the index. I was thinking something like
> > `pathspec_needs_expanded_index()` in `reset` (4d1cfc1351 (reset: make
> > --mixed sparse-aware, 2021-11-29)), but tailored to this particular case.
>
> Ah, okay makes sense now.
>
>
>
>
>
> >
> > >
> > >> There are almost certainly other options, and I can dig around `add.c` more
> > >> to see if there's anything I'm missing (although I'd love to hear other
> > >> ideas too!).
> > >>
> > >> Hopefully this helps!
> > >> - Victoria
> > >>
> > >> [1] `path_in_sparse_checkout()` is significantly faster in cone mode, but
> > >> with 16k patterns I'm assuming you're not using cone patterns ;)
> > >>
> > >>>
> > >>> Please help us take a look at this issue and let us know if you need
> > >>> more information.
> > >>>
> > >>> Thanks,
> > >>>
> > >>> Dian Xu
> > >>> Mathworks, Inc
> > >>> 1 Lakeside Campus Drive, Natick, MA 01760
> > >>> 508-647-3583
> > >>
> >

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: git bug report: 'git add' hangs in a large repo which has sparse-checkout file with large number of patterns in it
  2022-07-01 20:24         ` Dian Xu
@ 2022-07-01 21:52           ` Elijah Newren
  2022-07-04 19:11             ` Konstantin Ryabitsev
  0 siblings, 1 reply; 12+ messages in thread
From: Elijah Newren @ 2022-07-01 21:52 UTC (permalink / raw)
  To: Dian Xu
  Cc: Victoria Dye, Git Mailing List, Derrick Stolee,
	Konstantin Ryabitsev

Hi Dian,

As a heads up, note that on this list we don't top-post.

On Fri, Jul 1, 2022 at 1:24 PM Dian Xu <dianxudev@gmail.com> wrote:
>
> Hi Victoria, Elijah, Derrick,
>
> Thanks a lot for the detailed insight.
>
> (Btw our company’s email mathworks.com is blocked by
> mailto:git@vger.kernel.org, hope someone can help take a look)

Konstantin: Is this something you know how to look into?  (Or do you
know who to ask?)

> 1. We use a no-cone version of sparse-checkout to control the 'shape'
> (set of scm files) of our source code. In this case, the local sandbox
> is not necessarily 'sparse' (2m files), but it's very convenient that
> we can use git to check out the exact amount (shape) of files. To
> Victoria's question, all these 2m files are "H".

How many are "H", how many are "S", and how many files in total?  I'd
like to try to construct a way to reproduce your issue, and knowing
how many of each will help.

> 2. Below is the detail steps to create the local repo (sparse-checkout
> was defined 'before' git checkout)
>       % git init
>       % git remote add origin <url>
>       % git config core.sparsecheckout true
>       % vi .git/info/sparse-checkout
>       % git fetch
>       % git checkout -b <SHA>
>     Do I still need to 'git sparse-checkout reapply' after checkout?
> (Thanks for pointing out to run reapply once .git/info/sparse-checkout
> changed)

Why didn't you list 'git sparse-checkout reapply' after editing
.git/info/sparse-checkout?  You mention it later, so I'm hoping you
ran it at that point.

You should only need to run the sparse-checkout reapply command after
manually editing the .git/info/sparse-checkout file.  There are
special cases where it might be useful after other commands, but it's
pretty rare.  Most git commands, and particularly checkout, will keep
the sparsity of the working tree up-to-date with the sparse-checkout
file -- assuming it was up-to-date beforehand.  Basically, feel free
to use the rule that you only need to reapply after manual edits of
the $GIT_DIR/info/sparse-checkout file.

Also, with newer git, you can replace all three of
   git config core.sparsecheckout true
   vi .git/info/sparse-checkout
   git sparse-checkout reapply
with
   git sparse-checkout set --no-cone <space-separated list of patterns
to insert into the .git/info/sparse-checkout file>

With older git, you can replace those three commands with two: `git
sparse-checkout init --no-cone && git sparse-checkout set <list of
patterns>`.  But that's sometimes not wanted since the init command
sparsifies everything away except files in the toplevel directory, and
then the second step restores all the files, and that two-step
approach is really slow as it deletes and then restores a huge number
of files from the working directory.

> 3. Unfortunately, after executing reapply (btw it is very slow on this
> 2m files * 16k patterns scenario: 30 mins), 'git add', and 'git add
> --sparse' still hangs.

'git add --sparse' is still slow?  That sounds like a bug I'd like to
investigate.

What's the particular timing you get for each of 'git add' and 'git
add --sparse'?  Are you giving it individual files (if so, how many?),
or directories (how many files under those directories?), or globs?
(This information will be helpful in my attempts to get a synthetic
setup aiming to be similar to yours.)

> 4. --cone is a big topic for us now, since 2.37.0 deprecates
> --no-cone. We do have our own challenges to move away from --no-cone
> (E.g. we use lots of file specifiers and/or exclusion patterns to
> define our source code shape), which will be a huge amount of work, if
> feasible. We've established a set of workflows based on --no-cone,
> because of its merit of being capable of defining a fine-grained scm
> shape.

To be fair, --no-cone is deprecated as in discouraged due to various
usability problems (including performance), but we currently have no
plans to remove it from Git.  I do heartily recommend migrating to
--cone since it solves so many problems, but we'll still support
--no-cone users as best we can.

> 5. Back to this case, what we've experimented on are:
>       - Remove all files/*/! patterns from our shape definition, which
> leave us with 14k directories (Obviously the scm shape no longe
> matches, but just to proof of concept here)
>       - 'git sparse-checkout set <14k directories>' finishes fast

Now I'm surprised.  You said in the previous email that you were using
git 2.34.2.  In that version, --no-cone is the default, so this would
still be using --no-cone mode.  That either suggests you switched to
v2.37 since your email and didn't include that detail here, or that
the performance issue is actually with certain specific patterns.
What version of git did you use here?  And did you have either an
explicit --cone or --no-cone when using the sparse-checkout set
command?

>       - 'git add' finishes fast

>     As Victoria mentioned, I hope this --no-cone 'git add' performance
> can be addressed because 'those performance gains can also be realized
> in cone mode', as we saw here.

Are we sure we saw that here?  Could you verify by reporting: (a) what
version of git were you using, and (b) does `git config --list | grep
-i sparse` show both core.sparsecheckout and core.sparsecheckoutcone
as being true after your do your sparse-checkout set?

Elijah

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: git bug report: 'git add' hangs in a large repo which has sparse-checkout file with large number of patterns in it
  2022-07-01 21:52           ` Elijah Newren
@ 2022-07-04 19:11             ` Konstantin Ryabitsev
  2022-07-05 13:08               ` Dian Xu
  0 siblings, 1 reply; 12+ messages in thread
From: Konstantin Ryabitsev @ 2022-07-04 19:11 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Dian Xu, Victoria Dye, Git Mailing List, Derrick Stolee

On Fri, 1 Jul 2022 at 17:53, Elijah Newren <newren@gmail.com> wrote:
> > (Btw our company’s email mathworks.com is blocked by
> > mailto:git@vger.kernel.org, hope someone can help take a look)
>
> Konstantin: Is this something you know how to look into?  (Or do you
> know who to ask?)

Unfortunately, I'm not in charge of vger -- it's historically managed
by volunteers outside of LF.
You should reach out to postmaster@vger.kernel.org.

Best regards,
-Konstantin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: git bug report: 'git add' hangs in a large repo which has sparse-checkout file with large number of patterns in it
  2022-07-04 19:11             ` Konstantin Ryabitsev
@ 2022-07-05 13:08               ` Dian Xu
  2022-07-08  1:53                 ` Elijah Newren
  0 siblings, 1 reply; 12+ messages in thread
From: Dian Xu @ 2022-07-05 13:08 UTC (permalink / raw)
  To: Konstantin Ryabitsev
  Cc: Elijah Newren, Victoria Dye, Git Mailing List, Derrick Stolee

Hi Elijah,

Please see answers below:

1.  H: 2.27m; S: 7.7k; Total: 2.28m

2.  Sure I will run 'reapply' after the sparse-checkout file has
changed. Just curious, do I have to run 'reapply' if 'checkout' is the
next immediate cmd? I thought 'checkout' does the updating index as
well

3.  I simply added one file only, 'git add' and 'git add --sparse'
still hang. Let me know if you need me to send you any debug info from
pathspec.c/dir.c

4.  Good to know and we are investigating if we have a way out from --no-cone

5.  I should've been clearer: The experiment done here uses 2.37.0

Thanks,

Dian Xu
Mathworks, Inc
1 Lakeside Campus Drive, Natick, MA 01760
508-647-3583

On Mon, Jul 4, 2022 at 3:12 PM Konstantin Ryabitsev
<konstantin@linuxfoundation.org> wrote:
>
> On Fri, 1 Jul 2022 at 17:53, Elijah Newren <newren@gmail.com> wrote:
> > > (Btw our company’s email mathworks.com is blocked by
> > > mailto:git@vger.kernel.org, hope someone can help take a look)
> >
> > Konstantin: Is this something you know how to look into?  (Or do you
> > know who to ask?)
>
> Unfortunately, I'm not in charge of vger -- it's historically managed
> by volunteers outside of LF.
> You should reach out to postmaster@vger.kernel.org.
>
> Best regards,
> -Konstantin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: git bug report: 'git add' hangs in a large repo which has sparse-checkout file with large number of patterns in it
  2022-07-05 13:08               ` Dian Xu
@ 2022-07-08  1:53                 ` Elijah Newren
  2022-07-12 13:00                   ` Dian Xu
  0 siblings, 1 reply; 12+ messages in thread
From: Elijah Newren @ 2022-07-08  1:53 UTC (permalink / raw)
  To: Dian Xu
  Cc: Konstantin Ryabitsev, Victoria Dye, Git Mailing List,
	Derrick Stolee

On Tue, Jul 5, 2022 at 6:08 AM Dian Xu <dianxudev@gmail.com> wrote:
>
> Hi Elijah,

Hi Dian,

Please don't top post on this list.  It'd also help to respond to the
relevant email instead of picking a different email in the thread to
put your answers in.  Anyway, that aside...

> Please see answers below:
>
> 1.  H: 2.27m; S: 7.7k; Total: 2.28m
>
> 2.  Sure I will run 'reapply' after the sparse-checkout file has
> changed. Just curious, do I have to run 'reapply' if 'checkout' is the
> next immediate cmd? I thought 'checkout' does the updating index as
> well
>
> 3.  I simply added one file only, 'git add' and 'git add --sparse'
> still hang. Let me know if you need me to send you any debug info from
> pathspec.c/dir.c
>
> 4.  Good to know and we are investigating if we have a way out from --no-cone
>
> 5.  I should've been clearer: The experiment done here uses 2.37.0

Thanks for providing these details.  It was enough to at least get me
started, and from my experiments, it appears the arguments to `git
add` are important.  In particular, I could not trigger this when
passing actual filenames that existed.  I could when passing a fake
filename.  Here's the concrete steps I used to reproduce:

    git clone git@github.com:newren/gvfs-like-git-bomb
    cd gvfs-like-git-bomb

    git init attempt
    cd attempt
    ../make-a-git-bomb.sh

    time git checkout bomb

    echo "/*" >.git/info/sparse-checkout
    echo '!/bomb/j/j/' >>.git/info/sparse-checkout
    for i in $(seq 1 10000); do
        printf '!some/random/file/path-%05d\n' $i
    done >>.git/info/sparse-checkout
    git config core.sparseCheckout true
    time git sparse-checkout reapply

    echo hello >world
    time git add --sparse world nonexistent
    time git rm --cached --sparse world nonexistent
    time git add world nonexistent
    time git rm --cached world nonexistent

This sequence of steps will (1) clone a repo with 2 files, (2) create
another repository in subdirectory 'attempt' that has 1000001 files
(but only two unique files, and only six or so unique trees) in a
branch called 'bomb', (3) check it out, (4) create 10002 patterns for
the sparse-checkout file (only the first 2 of which match anything)
which will leave ~99% of files still present (990001 files checked out
and 10000 files sparse) and turn on sparsity, (5) measure how long it
takes to add and remove a file from the index, both with and without
the --sparse flag, but always listing an extra path that won't match
anything.

The timings I see for the setup steps are:
    4m10.444s  checkout bomb
    1m0.380s   sparse-checkout reapply

And the timings for the add/rm steps are:
    4m43.353s  add --sparse world nonexistent
    9m25.666s  add world nonexistent
    0m0.129s  rm --cached --sparse world nonexistent
    9m23.601s  rm --cached world nonexistent

which shows that 'rm' also has a performance problem without the
'--sparse' flag (which seems like another bug).

Now, if I remove the 'nonexistent' argument from the commands, then
the timings drop to:
    0m0.236s   add --sparse world
    0m0.233s   add world
    0m0.175s   rm --cached --sparse world
    4m43.744s  rm --cached world

So, I can reproduce some slowness.  'rm' without --sparse seems
buggily slow for either set, whereas 'add' is only slow when given a
fake path.  You never mentioned anything about the arguments you were
passing to `git add`, so I don't know whether you are using specific
filenames that just don't exist (like I did above), or globs that
perhaps match some files, or something else.  That might be useful to
know.  But there appears to be something here for both 'add' and 'rm'
that we could look into optimizing.  I don't have time right now.  I'm
not sure if someone else has some time to look into it; if no one else
does, I'll eventually try to get back to it.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: git bug report: 'git add' hangs in a large repo which has sparse-checkout file with large number of patterns in it
  2022-07-08  1:53                 ` Elijah Newren
@ 2022-07-12 13:00                   ` Dian Xu
  0 siblings, 0 replies; 12+ messages in thread
From: Dian Xu @ 2022-07-12 13:00 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Konstantin Ryabitsev, Victoria Dye, Git Mailing List,
	Derrick Stolee

On Thu, Jul 7, 2022 at 9:53 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Tue, Jul 5, 2022 at 6:08 AM Dian Xu <dianxudev@gmail.com> wrote:
> >
> > Hi Elijah,
>
> Hi Dian,
>
> Please don't top post on this list.  It'd also help to respond to the
> relevant email instead of picking a different email in the thread to
> put your answers in.  Anyway, that aside...
>
> > Please see answers below:
> >
> > 1.  H: 2.27m; S: 7.7k; Total: 2.28m
> >
> > 2.  Sure I will run 'reapply' after the sparse-checkout file has
> > changed. Just curious, do I have to run 'reapply' if 'checkout' is the
> > next immediate cmd? I thought 'checkout' does the updating index as
> > well
> >
> > 3.  I simply added one file only, 'git add' and 'git add --sparse'
> > still hang. Let me know if you need me to send you any debug info from
> > pathspec.c/dir.c
> >
> > 4.  Good to know and we are investigating if we have a way out from --no-cone
> >
> > 5.  I should've been clearer: The experiment done here uses 2.37.0
>
> Thanks for providing these details.  It was enough to at least get me
> started, and from my experiments, it appears the arguments to `git
> add` are important.  In particular, I could not trigger this when
> passing actual filenames that existed.  I could when passing a fake
> filename.  Here's the concrete steps I used to reproduce:
>
>     git clone git@github.com:newren/gvfs-like-git-bomb
>     cd gvfs-like-git-bomb
>
>     git init attempt
>     cd attempt
>     ../make-a-git-bomb.sh
>
>     time git checkout bomb
>
>     echo "/*" >.git/info/sparse-checkout
>     echo '!/bomb/j/j/' >>.git/info/sparse-checkout
>     for i in $(seq 1 10000); do
>         printf '!some/random/file/path-%05d\n' $i
>     done >>.git/info/sparse-checkout
>     git config core.sparseCheckout true
>     time git sparse-checkout reapply
>
>     echo hello >world
>     time git add --sparse world nonexistent
>     time git rm --cached --sparse world nonexistent
>     time git add world nonexistent
>     time git rm --cached world nonexistent
>
> This sequence of steps will (1) clone a repo with 2 files, (2) create
> another repository in subdirectory 'attempt' that has 1000001 files
> (but only two unique files, and only six or so unique trees) in a
> branch called 'bomb', (3) check it out, (4) create 10002 patterns for
> the sparse-checkout file (only the first 2 of which match anything)
> which will leave ~99% of files still present (990001 files checked out
> and 10000 files sparse) and turn on sparsity, (5) measure how long it
> takes to add and remove a file from the index, both with and without
> the --sparse flag, but always listing an extra path that won't match
> anything.
>
> The timings I see for the setup steps are:
>     4m10.444s  checkout bomb
>     1m0.380s   sparse-checkout reapply
>
> And the timings for the add/rm steps are:
>     4m43.353s  add --sparse world nonexistent
>     9m25.666s  add world nonexistent
>     0m0.129s  rm --cached --sparse world nonexistent
>     9m23.601s  rm --cached world nonexistent
>
> which shows that 'rm' also has a performance problem without the
> '--sparse' flag (which seems like another bug).
>
> Now, if I remove the 'nonexistent' argument from the commands, then
> the timings drop to:
>     0m0.236s   add --sparse world
>     0m0.233s   add world
>     0m0.175s   rm --cached --sparse world
>     4m43.744s  rm --cached world
>
> So, I can reproduce some slowness.  'rm' without --sparse seems
> buggily slow for either set, whereas 'add' is only slow when given a
> fake path.  You never mentioned anything about the arguments you were
> passing to `git add`, so I don't know whether you are using specific
> filenames that just don't exist (like I did above), or globs that
> perhaps match some files, or something else.  That might be useful to
> know.  But there appears to be something here for both 'add' and 'rm'
> that we could look into optimizing.  I don't have time right now.  I'm
> not sure if someone else has some time to look into it; if no one else
> does, I'll eventually try to get back to it.

Hi Elijah,

Thank you for sharing the reproduction steps. I believe they represent
our workflow.

We use 'git add <path_to_file>', where path_to_file is an existing
file, which is also within sparse-checkout shape.

Not sure this is related but we also use --reference while setting up the clone.

Dian Xu
Mathworks, Inc
1 Lakeside Campus Drive, Natick, MA 01760
508-647-3583

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-07-12 13:00 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-29 19:11 git bug report: 'git add' hangs in a large repo which has sparse-checkout file with large number of patterns in it Dian Xu
2022-06-29 21:53 ` Victoria Dye
2022-06-30  4:06   ` Elijah Newren
2022-06-30  5:06     ` Victoria Dye
2022-07-01  3:42       ` Elijah Newren
2022-07-01 20:24         ` Dian Xu
2022-07-01 21:52           ` Elijah Newren
2022-07-04 19:11             ` Konstantin Ryabitsev
2022-07-05 13:08               ` Dian Xu
2022-07-08  1:53                 ` Elijah Newren
2022-07-12 13:00                   ` Dian Xu
2022-06-30  3:10 ` Elijah Newren

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).