git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Victoria Dye <vdye@github.com>
To: Dian Xu <dianxudev@gmail.com>,
	git@vger.kernel.org, Derrick Stolee <derrickstolee@github.com>
Subject: Re: git bug report: 'git add' hangs in a large repo which has sparse-checkout file with large number of patterns in it
Date: Wed, 29 Jun 2022 14:53:37 -0700	[thread overview]
Message-ID: <96034b3f-9811-38c1-7afe-c1c9dd243f0e@github.com> (raw)
In-Reply-To: <CAKSRnEzMaQEv=3mkNWRpy6+-c0rz=R191iqheCQ2ptXFs1CQgw@mail.gmail.com>

Dian Xu wrote:
> Dear Git developers,
> 
> Reporting Issue:
>               'git add' hangs in a large repo which has
> sparse-checkout file with large number of patterns in it
> 
> Found in:
>               Git 2.34.3. Issue occurs after 'audit for interaction
> with sparse-index' was introduced in add.c
> 
> Reproduction steps:
>               1. Clone a repo which has e.g. 2 million plus files
>               2. Enable sparse checkout by: git config core.sparsecheckout true
>               3. Create a .git/info/sparse-checkout file with a large
> number of patterns, e.g. 16k plus lines
>               4. Run 'git add', which will hang> 
> Investigations:
>               1. Stack trace:
>                        add.c: cmd_add
>                   -> add.c: prune_directory
>                   -> pathspec.c: add_pathspec_matches_against_index
>                   -> dir.c: path_in_sparse_checkout_1
>               2. In Git 2.33.3, the loop at pathspec.c line 42 runs
> fast, even when istate->cache_nr is at 2 million
>               3. Since Git 2.34.3, the newly introduced 'audit for
> interaction with sparse-index' (dir.c line 1459:
> path_in_sparse_checkout_1) decides to loop through 2 million files and
> match each one of them against the sparse-checkout patterns
>               4. This hits the O(n^2) problem thus causes 'git add' to
> hang (or ~1.5 hours to finish)

Thanks for the explanation, it helped me narrow down the source to an exact
commit (49fdd51a23 (add: skip tracked paths outside sparse-checkout cone,
2021-09-24)). 

You're correct that the `path_in_sparse_checkout()` check is slow [1].
However, it only runs on files that are not "hidden" with the
`SKIP_WORKTREE` flag. Ideally, if you're using sparse-checkout, this will
only be a small subset of your 2 million files. 

In your repro steps, you're adding patterns to a file then immediately
running `git add`. If that reflects how you're usually working with
sparse-checkout, `SKIP_WORKTREE` likely isn't being applied properly before
the `add`. You can check to see whether file(s) have the flag properly
applied with `git ls-files -t <file or dir names>` - each `SKIP_WORKTREE`
file should have an "S" next to it. "H" indicates that the flag is *not*
applied.

If you see that most of the files that *should* be sparse don't have
`SKIP_WORKTREE` applied, you can run `git sparse-checkout reapply` (make
sure you don't have any modified files outside the patterns you're
applying!). The downside is that it'll be as slow as what you're reporting
for `git add`, but any subsequent `add` (or reset, status, etc.) should be
much faster.

If you do all of that but things are still slow, then the way we check
pathspecs in `git add` would need to change (not trivial, but probably not
impossible either). At a cursory glance, I can think of a few options for
that:

1. Remove the `path_in_sparse_checkout()` check. It's the simplest solution,
   but it means you'd be able to stage files for commit outside the
   sparse-checkout patterns without using the '--sparse' option. I don't
   personally think that's a huge issue, but given that the implementation
   was intentionally changed *away* from this approach, I'd defer to other
   contributors to see if that's an okay change to make.
2. After every call to `ce_path_match()`, check if all pathspecs are marked
   as `seen` and, if so, return early. This would slow down each individual
   file check and wouldn't help you if a pathspec matches nothing, but
   prevents checking more files once all pathspecs are matched.
3. Do some heuristic checks on the pathspecs before checking index entries.
   For example, exact file or directory matches could be searched in the
   index. This would still require falling back on the per-file checks if
   not all pathspecs are matched, but makes some typical use-cases much
   faster.

There are almost certainly other options, and I can dig around `add.c` more
to see if there's anything I'm missing (although I'd love to hear other
ideas too!). 

Hopefully this helps!
- Victoria

[1] `path_in_sparse_checkout()` is significantly faster in cone mode, but
with 16k patterns I'm assuming you're not using cone patterns ;)

> 
> Please help us take a look at this issue and let us know if you need
> more information.
> 
> Thanks,
> 
> Dian Xu
> Mathworks, Inc
> 1 Lakeside Campus Drive, Natick, MA 01760
> 508-647-3583


  reply	other threads:[~2022-06-29 21:53 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-29 19:11 git bug report: 'git add' hangs in a large repo which has sparse-checkout file with large number of patterns in it Dian Xu
2022-06-29 21:53 ` Victoria Dye [this message]
2022-06-30  4:06   ` Elijah Newren
2022-06-30  5:06     ` Victoria Dye
2022-07-01  3:42       ` Elijah Newren
2022-07-01 20:24         ` Dian Xu
2022-07-01 21:52           ` Elijah Newren
2022-07-04 19:11             ` Konstantin Ryabitsev
2022-07-05 13:08               ` Dian Xu
2022-07-08  1:53                 ` Elijah Newren
2022-07-12 13:00                   ` Dian Xu
2022-06-30  3:10 ` Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=96034b3f-9811-38c1-7afe-c1c9dd243f0e@github.com \
    --to=vdye@github.com \
    --cc=derrickstolee@github.com \
    --cc=dianxudev@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).