git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Optimization for "git clean -ffdx"
@ 2022-02-08 22:03 Patrick Marlier
  2022-02-13  3:25 ` Elijah Newren
  0 siblings, 1 reply; 3+ messages in thread
From: Patrick Marlier @ 2022-02-08 22:03 UTC (permalink / raw)
  To: git

Dear git Developers,

In a big repository with a lot of untracked directories and files
(build in tree), "git clean -ffdx" can be optimized. Indeed, "git
clean" goes recursively into all untracked and nested directories to
look for .git files even if "-ff" is specified.
Using breakpoint on stat or "strace -e newfstatat", it is possible to
see the recursing search for ".git" and ".git/HEAD". Also it seems to
traverse the untracked directories a few times, which I am not sure
why.

Using "-ff" should not check for nested .git and no need to recurse if
the directory is already untracked.

Doing the following, it seems to avoid looking for nested .git and all
tests are passing.

@@ -1007,6 +1008,12 @@ int cmd_clean(int argc, const char **argv,
const char *prefix)
                 * the code clearer to exclude it, though.
                 */
                dir.flags |= DIR_KEEP_UNTRACKED_CONTENTS;
+
+               /*
+                * No need to go to deeper in directories if already untracked
+                */
+               if (rm_flags == 0)
+                       dir.flags |= DIR_NO_GITLINKS;
        }

        if (read_cache() < 0)

However reading the documentation of DIR_NO_GITLINKS seems to say that
is not the right fix.

Another thing to note is that it shows "Removing XXX" but it shows it
when the directory is already gone. So we could change to "Removed
XXX" or display the "Removing XXX" before starting to remove the
directory.

Thanks in advance for any fix or help in getting it right.
--
Patrick Marlier

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Optimization for "git clean -ffdx"
  2022-02-08 22:03 Optimization for "git clean -ffdx" Patrick Marlier
@ 2022-02-13  3:25 ` Elijah Newren
  2022-02-15 22:16   ` Patrick Marlier
  0 siblings, 1 reply; 3+ messages in thread
From: Elijah Newren @ 2022-02-13  3:25 UTC (permalink / raw)
  To: Patrick Marlier; +Cc: Git Mailing List

Hi Patrick,

On Wed, Feb 9, 2022 at 2:42 AM Patrick Marlier
<patrick.marlier@gmail.com> wrote:
>
> Dear git Developers,
>
> In a big repository with a lot of untracked directories and files
> (build in tree), "git clean -ffdx" can be optimized. Indeed, "git
> clean" goes recursively into all untracked and nested directories to
> look for .git files even if "-ff" is specified.

Yeah, seems like a waste of work.  Thanks for digging in.

> Using breakpoint on stat or "strace -e newfstatat", it is possible to
> see the recursing search for ".git" and ".git/HEAD". Also it seems to
> traverse the untracked directories a few times, which I am not sure
> why.

There are two steps -- collect the list of things that are removable,
and then a separate step to remove the things that are removable.  It
can be the case that we recurse into the same directory twice, once
for each step.  For the first step, see the fill_directory() call.  If
you left off "-x" or only had one "-f", or your directories had some
tracked files, then fill_directory()'s job is finding out which things
are removable and it'd likely only be a subset of those directories.
For the second step, some of those removable things may be entire
directories.  So when it hits one of those and later calls
remove_dirs() to remove it, it has to recurse again.

> Using "-ff" should not check for nested .git and no need to recurse if
> the directory is already untracked.

Not quite; the -x and lack of -e options is critical here, otherwise
we cannot just nuke the whole directory.  (Also, -d is important, but
that just kind of goes without saying.)

> Doing the following, it seems to avoid looking for nested .git and all
> tests are passing.
>
> @@ -1007,6 +1008,12 @@ int cmd_clean(int argc, const char **argv,
> const char *prefix)
>                  * the code clearer to exclude it, though.
>                  */
>                 dir.flags |= DIR_KEEP_UNTRACKED_CONTENTS;
> +
> +               /*
> +                * No need to go to deeper in directories if already untracked
> +                */
> +               if (rm_flags == 0)
> +                       dir.flags |= DIR_NO_GITLINKS;
>         }
>
>         if (read_cache() < 0)
>
> However reading the documentation of DIR_NO_GITLINKS seems to say that
> is not the right fix.

I think DIR_NO_GITLINKS has an unfortunately poor description.  The
fill_directory() API has many different callers who use it
differently, see commit 8d92fb2927 ("dir: replace exponential
algorithm with a linear one", 2020-04-01) for some description of
this.  I often ran into problems where descriptions of variables and
items in the documentation tended to focus on one of those modes in
such a way as to make the other modes that also called in be
confusing.  I think the author of the current text was just thinking
of one of those other callers, probably 'git status'.  Looking at the
code, it's intended purpose is just what you're using it for.  (Also,
I probably should have documented DIR_SKIP_NESTED_GIT when I added it
in commit 09487f2cba ("clean: avoid removing untracked files in a
nested git repository", 2019-09-17); the fact that it is similar and
undocumented isn't helping matters, although at the time I didn't know
there was documentation for any of the flags given that they were in
an entirely separate file.)

Anyway, documentation issues aside, I think this is the flag you want
to use, but this is not the right place to set it.  I think you should
set it where rm_flags is set to 0 instead.

Further, setting this flag is only solving part of the performance
problem.  We'll still recurse into the directories to look for ignored
files, which should be avoided since we don't need to differentiate.
That basically means that we need to avoid all the code in the current

   if (remove_directories && !ignored_only)

block, whenever (remove_directories && ignored && !exclude_list.nr &&
force > 1).

> Another thing to note is that it shows "Removing XXX" but it shows it
> when the directory is already gone. So we could change to "Removed
> XXX" or display the "Removing XXX" before starting to remove the
> directory.

I commented on Bagas' patch, but I think this is minutiae that won't
be relevant to the end user, and would rather either ignore it or just
move the print statement earlier rather than increasing work for
translators.  That is, unless we tend to use past tense elsewhere in
the UI and we want to make a concerted effort to convert to using past
tense.

> Thanks in advance for any fix or help in getting it right.

You've clearly taken the time to investigate, and you found the basic
solution too.  That's pretty good, especially considering that you're
dealing with the dragons in dir.[ch].  We could potentially have 1-3
patches here:
  * avoiding the unnecessary checks for is_nonbare_repository_dir()
via setting DIR_NO_GITLINKS.  This is basically your change, just
moved slightly.
  * avoiding the unnecessary attempts to differentiate untracked and
ignored via avoiding the "if (remove_directories && !ignored_only)"
code block when appropriate, as highlighted above
  * fixing up the documentation for DIR_NO_GITLINKS and adding some
for DIR_SKIP_NESTED_GIT.

You've already done most the work for the first one, you'd just need
to move the line elsewhere and add a commit message (though adding a
testcase might be nice too).  Are you up for that?  Would you also
like to tackle either of the other two items?  If you combine the
first and the second, you might find be able to generate a good
testcase by running
    GIT_TRACE2_PERF="$(pwd)/.git/trace.output" git clean -ffdx
and then grepping the trace.output file for "directories-visited" and
"paths-visited" and checking that the new flags indeed reduce both of
those numbers.  Anyway, I'll be happy to review your patch(es) and
will take whichever parts you don't want to tackle.

Thanks for digging in, reporting, and helping make Git better!

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Optimization for "git clean -ffdx"
  2022-02-13  3:25 ` Elijah Newren
@ 2022-02-15 22:16   ` Patrick Marlier
  0 siblings, 0 replies; 3+ messages in thread
From: Patrick Marlier @ 2022-02-15 22:16 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git Mailing List

Hi Elijah,

Thanks for this nice and very detailed email!

On Sun, Feb 13, 2022 at 4:25 AM Elijah Newren <newren@gmail.com> wrote:
> There are two steps -- collect the list of things that are removable,
> and then a separate step to remove the things that are removable.  It
> can be the case that we recurse into the same directory twice, once
> for each step.  For the first step, see the fill_directory() call.  If
> you left off "-x" or only had one "-f", or your directories had some
> tracked files, then fill_directory()'s job is finding out which things
> are removable and it'd likely only be a subset of those directories.
> For the second step, some of those removable things may be entire
> directories.  So when it hits one of those and later calls
> remove_dirs() to remove it, it has to recurse again.

I am not sure I understand perfectly why there are 2 steps but this is detail.

The observation I had in the second step for removal, it is this type
of patterns using strace:
getcwd("/tmp/repo", 129) = 46
newfstatat(AT_FDCWD, "/tmp/repo/a", {st_mode=S_IFDIR|0700,
st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/tmp/repo/a/a", {st_mode=S_IFDIR|0700,
st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/tmp/repo/a/a/a", {st_mode=S_IFDIR|0700,
st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/tmp/repo/a/a/a/a", {st_mode=S_IFDIR|0700,
st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
rmdir("a/a/a/a")                        = 0
newfstatat(AT_FDCWD, "a/a/a/b", {st_mode=S_IFDIR|0700, st_size=4096,
...}, AT_SYMLINK_NOFOLLOW) = 0
openat(AT_FDCWD, "a/a/a/b", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 6
newfstatat(6, "", {st_mode=S_IFDIR|0700, st_size=4096, ...}, AT_EMPTY_PATH) = 0
getdents64(6, 0x55b718d8d650 /* 2 entries */, 32768) = 48
getdents64(6, 0x55b718d8d650 /* 0 entries */, 32768) = 0
close(6)                                = 0
getcwd("/tmp/repo", 129) = 46
newfstatat(AT_FDCWD, "/tmp/repo/a", {st_mode=S_IFDIR|0700,
st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/tmp/repo/a/a", {st_mode=S_IFDIR|0700,
st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/tmp/repo/a/a/a", {st_mode=S_IFDIR|0700,
st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/tmp/repo/a/a/a/b", {st_mode=S_IFDIR|0700,
st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
rmdir("a/a/a/b")                        = 0

Here I have a repository with quite some nested and untracked directories.
We can see that many "newfstatat" are repeated for the same path. Also
"getcwd" is repeated many times.
I guess here "strbuf_realpath" would benefit from some caching.


> > Using "-ff" should not check for nested .git and no need to recurse if
> > the directory is already untracked.
>
> Not quite; the -x and lack of -e options is critical here, otherwise
> we cannot just nuke the whole directory.  (Also, -d is important, but
> that just kind of goes without saying.)

Indeed, you are right.


> Further, setting this flag is only solving part of the performance
> problem.  We'll still recurse into the directories to look for ignored
> files, which should be avoided since we don't need to differentiate.
> That basically means that we need to avoid all the code in the current
>
>    if (remove_directories && !ignored_only)
>
> block, whenever (remove_directories && ignored && !exclude_list.nr &&
> force > 1).

Good point!


> > Another thfing to note is that it shows "Removing XXX" but it shows it
> > when the directory is already gone. So we could change to "Removed
> > XXX" or display the "Removing XXX" before starting to remove the
> > directory.
>
> I commented on Bagas' patch, but I think this is minutiae that won't
> be relevant to the end user, and would rather either ignore it or just
> move the print statement earlier rather than increasing work for
> translators.  That is, unless we tend to use past tense elsewhere in
> the UI and we want to make a concerted effort to convert to using past
> tense.

If we want to be correct, then we should then move the "Removing"
printing before but then it makes the "gone" flag a bit useless.
I will defer to you if this is something we want to fix.


> Anyway, I'll be happy to review your patch(es) and
> will take whichever parts you don't want to tackle.

If you don't mind, I will let you adjust the documentation.
Following the patch series for the 2 other changes.

Thanks a lot Elijah. Very appreciated!
--
Patrick Marlier

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-02-15 22:17 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-08 22:03 Optimization for "git clean -ffdx" Patrick Marlier
2022-02-13  3:25 ` Elijah Newren
2022-02-15 22:16   ` Patrick Marlier

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).