git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Dian Xu <dianxudev@gmail.com>
To: Elijah Newren <newren@gmail.com>
Cc: Konstantin Ryabitsev <konstantin@linuxfoundation.org>,
	Victoria Dye <vdye@github.com>,
	Git Mailing List <git@vger.kernel.org>,
	Derrick Stolee <derrickstolee@github.com>
Subject: Re: git bug report: 'git add' hangs in a large repo which has sparse-checkout file with large number of patterns in it
Date: Tue, 12 Jul 2022 09:00:21 -0400	[thread overview]
Message-ID: <CAKSRnEz6xg6jiTqaVtYf+32kEzS0jydiajbTnvK7qy_Po=y=uA@mail.gmail.com> (raw)
In-Reply-To: <CABPp-BEkJQoKZsQGCYioyga_uoDQ6iBeW+FKr8JhyuuTMK1RDw@mail.gmail.com>

On Thu, Jul 7, 2022 at 9:53 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Tue, Jul 5, 2022 at 6:08 AM Dian Xu <dianxudev@gmail.com> wrote:
> >
> > Hi Elijah,
>
> Hi Dian,
>
> Please don't top post on this list.  It'd also help to respond to the
> relevant email instead of picking a different email in the thread to
> put your answers in.  Anyway, that aside...
>
> > Please see answers below:
> >
> > 1.  H: 2.27m; S: 7.7k; Total: 2.28m
> >
> > 2.  Sure I will run 'reapply' after the sparse-checkout file has
> > changed. Just curious, do I have to run 'reapply' if 'checkout' is the
> > next immediate cmd? I thought 'checkout' does the updating index as
> > well
> >
> > 3.  I simply added one file only, 'git add' and 'git add --sparse'
> > still hang. Let me know if you need me to send you any debug info from
> > pathspec.c/dir.c
> >
> > 4.  Good to know and we are investigating if we have a way out from --no-cone
> >
> > 5.  I should've been clearer: The experiment done here uses 2.37.0
>
> Thanks for providing these details.  It was enough to at least get me
> started, and from my experiments, it appears the arguments to `git
> add` are important.  In particular, I could not trigger this when
> passing actual filenames that existed.  I could when passing a fake
> filename.  Here's the concrete steps I used to reproduce:
>
>     git clone git@github.com:newren/gvfs-like-git-bomb
>     cd gvfs-like-git-bomb
>
>     git init attempt
>     cd attempt
>     ../make-a-git-bomb.sh
>
>     time git checkout bomb
>
>     echo "/*" >.git/info/sparse-checkout
>     echo '!/bomb/j/j/' >>.git/info/sparse-checkout
>     for i in $(seq 1 10000); do
>         printf '!some/random/file/path-%05d\n' $i
>     done >>.git/info/sparse-checkout
>     git config core.sparseCheckout true
>     time git sparse-checkout reapply
>
>     echo hello >world
>     time git add --sparse world nonexistent
>     time git rm --cached --sparse world nonexistent
>     time git add world nonexistent
>     time git rm --cached world nonexistent
>
> This sequence of steps will (1) clone a repo with 2 files, (2) create
> another repository in subdirectory 'attempt' that has 1000001 files
> (but only two unique files, and only six or so unique trees) in a
> branch called 'bomb', (3) check it out, (4) create 10002 patterns for
> the sparse-checkout file (only the first 2 of which match anything)
> which will leave ~99% of files still present (990001 files checked out
> and 10000 files sparse) and turn on sparsity, (5) measure how long it
> takes to add and remove a file from the index, both with and without
> the --sparse flag, but always listing an extra path that won't match
> anything.
>
> The timings I see for the setup steps are:
>     4m10.444s  checkout bomb
>     1m0.380s   sparse-checkout reapply
>
> And the timings for the add/rm steps are:
>     4m43.353s  add --sparse world nonexistent
>     9m25.666s  add world nonexistent
>     0m0.129s  rm --cached --sparse world nonexistent
>     9m23.601s  rm --cached world nonexistent
>
> which shows that 'rm' also has a performance problem without the
> '--sparse' flag (which seems like another bug).
>
> Now, if I remove the 'nonexistent' argument from the commands, then
> the timings drop to:
>     0m0.236s   add --sparse world
>     0m0.233s   add world
>     0m0.175s   rm --cached --sparse world
>     4m43.744s  rm --cached world
>
> So, I can reproduce some slowness.  'rm' without --sparse seems
> buggily slow for either set, whereas 'add' is only slow when given a
> fake path.  You never mentioned anything about the arguments you were
> passing to `git add`, so I don't know whether you are using specific
> filenames that just don't exist (like I did above), or globs that
> perhaps match some files, or something else.  That might be useful to
> know.  But there appears to be something here for both 'add' and 'rm'
> that we could look into optimizing.  I don't have time right now.  I'm
> not sure if someone else has some time to look into it; if no one else
> does, I'll eventually try to get back to it.

Hi Elijah,

Thank you for sharing the reproduction steps. I believe they represent
our workflow.

We use 'git add <path_to_file>', where path_to_file is an existing
file, which is also within sparse-checkout shape.

Not sure this is related but we also use --reference while setting up the clone.

Dian Xu
Mathworks, Inc
1 Lakeside Campus Drive, Natick, MA 01760
508-647-3583

  reply	other threads:[~2022-07-12 13:00 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-29 19:11 git bug report: 'git add' hangs in a large repo which has sparse-checkout file with large number of patterns in it Dian Xu
2022-06-29 21:53 ` Victoria Dye
2022-06-30  4:06   ` Elijah Newren
2022-06-30  5:06     ` Victoria Dye
2022-07-01  3:42       ` Elijah Newren
2022-07-01 20:24         ` Dian Xu
2022-07-01 21:52           ` Elijah Newren
2022-07-04 19:11             ` Konstantin Ryabitsev
2022-07-05 13:08               ` Dian Xu
2022-07-08  1:53                 ` Elijah Newren
2022-07-12 13:00                   ` Dian Xu [this message]
2022-06-30  3:10 ` Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAKSRnEz6xg6jiTqaVtYf+32kEzS0jydiajbTnvK7qy_Po=y=uA@mail.gmail.com' \
    --to=dianxudev@gmail.com \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=konstantin@linuxfoundation.org \
    --cc=newren@gmail.com \
    --cc=vdye@github.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).