git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Patrick Marlier <patrick.marlier@gmail.com>
Cc: Git Mailing List <git@vger.kernel.org>
Subject: Re: Optimization for "git clean -ffdx"
Date: Sat, 12 Feb 2022 19:25:18 -0800	[thread overview]
Message-ID: <CABPp-BE2q5_LVVUw=2Wfys0AF350tTMMr43ZLpfsjE4ecb1Tpw@mail.gmail.com> (raw)
In-Reply-To: <CAKQMxzSQRL-Q5daxETF+gYhVScmq_n=r2LJAeEuxpM7=jPajZQ@mail.gmail.com>

Hi Patrick,

On Wed, Feb 9, 2022 at 2:42 AM Patrick Marlier
<patrick.marlier@gmail.com> wrote:
>
> Dear git Developers,
>
> In a big repository with a lot of untracked directories and files
> (build in tree), "git clean -ffdx" can be optimized. Indeed, "git
> clean" goes recursively into all untracked and nested directories to
> look for .git files even if "-ff" is specified.

Yeah, seems like a waste of work.  Thanks for digging in.

> Using breakpoint on stat or "strace -e newfstatat", it is possible to
> see the recursing search for ".git" and ".git/HEAD". Also it seems to
> traverse the untracked directories a few times, which I am not sure
> why.

There are two steps -- collect the list of things that are removable,
and then a separate step to remove the things that are removable.  It
can be the case that we recurse into the same directory twice, once
for each step.  For the first step, see the fill_directory() call.  If
you left off "-x" or only had one "-f", or your directories had some
tracked files, then fill_directory()'s job is finding out which things
are removable and it'd likely only be a subset of those directories.
For the second step, some of those removable things may be entire
directories.  So when it hits one of those and later calls
remove_dirs() to remove it, it has to recurse again.

> Using "-ff" should not check for nested .git and no need to recurse if
> the directory is already untracked.

Not quite; the -x and lack of -e options is critical here, otherwise
we cannot just nuke the whole directory.  (Also, -d is important, but
that just kind of goes without saying.)

> Doing the following, it seems to avoid looking for nested .git and all
> tests are passing.
>
> @@ -1007,6 +1008,12 @@ int cmd_clean(int argc, const char **argv,
> const char *prefix)
>                  * the code clearer to exclude it, though.
>                  */
>                 dir.flags |= DIR_KEEP_UNTRACKED_CONTENTS;
> +
> +               /*
> +                * No need to go to deeper in directories if already untracked
> +                */
> +               if (rm_flags == 0)
> +                       dir.flags |= DIR_NO_GITLINKS;
>         }
>
>         if (read_cache() < 0)
>
> However reading the documentation of DIR_NO_GITLINKS seems to say that
> is not the right fix.

I think DIR_NO_GITLINKS has an unfortunately poor description.  The
fill_directory() API has many different callers who use it
differently, see commit 8d92fb2927 ("dir: replace exponential
algorithm with a linear one", 2020-04-01) for some description of
this.  I often ran into problems where descriptions of variables and
items in the documentation tended to focus on one of those modes in
such a way as to make the other modes that also called in be
confusing.  I think the author of the current text was just thinking
of one of those other callers, probably 'git status'.  Looking at the
code, it's intended purpose is just what you're using it for.  (Also,
I probably should have documented DIR_SKIP_NESTED_GIT when I added it
in commit 09487f2cba ("clean: avoid removing untracked files in a
nested git repository", 2019-09-17); the fact that it is similar and
undocumented isn't helping matters, although at the time I didn't know
there was documentation for any of the flags given that they were in
an entirely separate file.)

Anyway, documentation issues aside, I think this is the flag you want
to use, but this is not the right place to set it.  I think you should
set it where rm_flags is set to 0 instead.

Further, setting this flag is only solving part of the performance
problem.  We'll still recurse into the directories to look for ignored
files, which should be avoided since we don't need to differentiate.
That basically means that we need to avoid all the code in the current

   if (remove_directories && !ignored_only)

block, whenever (remove_directories && ignored && !exclude_list.nr &&
force > 1).

> Another thing to note is that it shows "Removing XXX" but it shows it
> when the directory is already gone. So we could change to "Removed
> XXX" or display the "Removing XXX" before starting to remove the
> directory.

I commented on Bagas' patch, but I think this is minutiae that won't
be relevant to the end user, and would rather either ignore it or just
move the print statement earlier rather than increasing work for
translators.  That is, unless we tend to use past tense elsewhere in
the UI and we want to make a concerted effort to convert to using past
tense.

> Thanks in advance for any fix or help in getting it right.

You've clearly taken the time to investigate, and you found the basic
solution too.  That's pretty good, especially considering that you're
dealing with the dragons in dir.[ch].  We could potentially have 1-3
patches here:
  * avoiding the unnecessary checks for is_nonbare_repository_dir()
via setting DIR_NO_GITLINKS.  This is basically your change, just
moved slightly.
  * avoiding the unnecessary attempts to differentiate untracked and
ignored via avoiding the "if (remove_directories && !ignored_only)"
code block when appropriate, as highlighted above
  * fixing up the documentation for DIR_NO_GITLINKS and adding some
for DIR_SKIP_NESTED_GIT.

You've already done most the work for the first one, you'd just need
to move the line elsewhere and add a commit message (though adding a
testcase might be nice too).  Are you up for that?  Would you also
like to tackle either of the other two items?  If you combine the
first and the second, you might find be able to generate a good
testcase by running
    GIT_TRACE2_PERF="$(pwd)/.git/trace.output" git clean -ffdx
and then grepping the trace.output file for "directories-visited" and
"paths-visited" and checking that the new flags indeed reduce both of
those numbers.  Anyway, I'll be happy to review your patch(es) and
will take whichever parts you don't want to tackle.

Thanks for digging in, reporting, and helping make Git better!

  reply	other threads:[~2022-02-13  3:25 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-08 22:03 Optimization for "git clean -ffdx" Patrick Marlier
2022-02-13  3:25 ` Elijah Newren [this message]
2022-02-15 22:16   ` Patrick Marlier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CABPp-BE2q5_LVVUw=2Wfys0AF350tTMMr43ZLpfsjE4ecb1Tpw@mail.gmail.com' \
    --to=newren@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=patrick.marlier@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).