git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* Git clean enumerates ignored directories (since 2.27)
@ 2021-04-14 17:17 Jason Gore
  2021-04-14 22:56 ` brian m. carlson
  0 siblings, 1 reply; 10+ messages in thread
From: Jason Gore @ 2021-04-14 17:17 UTC (permalink / raw)
  To: git

I'm unsure of whether this issue is Windows-specific but at the very least I suspect this is a change of behavior that would affect all platforms.

In July 2020 I reported an issue I initially encountered in 2.27 that still seems to be present in 2.31.1.windows.1: https://github.com/git-for-windows/git/issues/2732

Since I haven't seen a response on the issue and it still occurs I decided to email this list as well.

Package managers such as PNPM tend to create very long filenames due to symlinking. Having git not enumerate these ignored directories is essential to leveraging any sort of clean behavior (both for performance and for warning output) as we did before version 2.27. Our repo clean functions went from taking a few seconds to over 10 minutes due to this change in behavior.

I've also had repeated problems sending you this email as your email server seems to reject any email with a URL in it (per the github link above.)

Thanks,
Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Git clean enumerates ignored directories (since 2.27)
  2021-04-14 17:17 Git clean enumerates ignored directories (since 2.27) Jason Gore
@ 2021-04-14 22:56 ` brian m. carlson
  2021-04-15  8:51   ` Jeff King
  0 siblings, 1 reply; 10+ messages in thread
From: brian m. carlson @ 2021-04-14 22:56 UTC (permalink / raw)
  To: Jason Gore; +Cc: git, Elijah Newren

[-- Attachment #1: Type: text/plain, Size: 2380 bytes --]

On 2021-04-14 at 17:17:46, Jason Gore wrote:
> I'm unsure of whether this issue is Windows-specific but at the very least I suspect this is a change of behavior that would affect all platforms.
> 
> In July 2020 I reported an issue I initially encountered in 2.27 that still seems to be present in 2.31.1.windows.1: https://github.com/git-for-windows/git/issues/2732
> 
> Since I haven't seen a response on the issue and it still occurs I decided to email this list as well.
> 
> Package managers such as PNPM tend to create very long filenames due to symlinking. Having git not enumerate these ignored directories is essential to leveraging any sort of clean behavior (both for performance and for warning output) as we did before version 2.27. Our repo clean functions went from taking a few seconds to over 10 minutes due to this change in behavior.

It looks like there were some changes to the dir.c code between 2.26 and
2.27 which may have impacted clean.  I'm CCing Elijah Newren who wrote
those patches and in any event would be more familiar with that code
than I am.

I've taken the liberty of pulling in your testcase from that issue and
converting it to work in POSIX sh on a Linux machine.  If the path is
not sufficiently long for your needs, you can bump the value of the variable
"last" below.

----
#!/bin/sh

git init test-repo
cd test-repo
longname="directory"
touch "$longname.txt"
last=400
for x in $(seq 1 $last); do
  mkdir "x$longname$x"
  mv directory* "x$longname$x"
  mv "x$longname$x" "$longname$x"
done
git clean -ffdxn -e directory$last
----

When it fails, it will complain that it wasn't able to open the
directory.  It still exits zero, however.

I haven't bisected this, so I don't know if those patches are related to
the problem or not.  I'm a little short on time today to investigate
further, but hopefully this can get someone on the right path with a
modified version and git bisect run if nothing else.

> I've also had repeated problems sending you this email as your email server seems to reject any email with a URL in it (per the github link above.)

I don't think that's the problem, since this email came through just
fine.  It's more likely that your email was being bounced for HTML,
which isn't allowed on the list.
-- 
brian m. carlson (he/him or they/them)
Houston, Texas, US

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Git clean enumerates ignored directories (since 2.27)
  2021-04-14 22:56 ` brian m. carlson
@ 2021-04-15  8:51   ` Jeff King
  2021-04-22 17:18     ` Jason Gore
  2021-05-07  2:31     ` Elijah Newren
  0 siblings, 2 replies; 10+ messages in thread
From: Jeff King @ 2021-04-15  8:51 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Jason Gore, git, Elijah Newren

On Wed, Apr 14, 2021 at 10:56:14PM +0000, brian m. carlson wrote:

> ----
> #!/bin/sh
> 
> git init test-repo
> cd test-repo
> longname="directory"
> touch "$longname.txt"
> last=400
> for x in $(seq 1 $last); do
>   mkdir "x$longname$x"
>   mv directory* "x$longname$x"
>   mv "x$longname$x" "$longname$x"
> done
> git clean -ffdxn -e directory$last
> ----
> 
> When it fails, it will complain that it wasn't able to open the
> directory.  It still exits zero, however.
> 
> I haven't bisected this, so I don't know if those patches are related to
> the problem or not.  I'm a little short on time today to investigate
> further, but hopefully this can get someone on the right path with a
> modified version and git bisect run if nothing else.

It bisects to 8d92fb2927 (dir: replace exponential algorithm with a
linear one, 2020-04-01). I won't pretend to understand everything going
on in that commit, though.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: Git clean enumerates ignored directories (since 2.27)
  2021-04-15  8:51   ` Jeff King
@ 2021-04-22 17:18     ` Jason Gore
  2021-04-22 17:24       ` Elijah Newren
  2021-05-07  4:08       ` Elijah Newren
  2021-05-07  2:31     ` Elijah Newren
  1 sibling, 2 replies; 10+ messages in thread
From: Jason Gore @ 2021-04-22 17:18 UTC (permalink / raw)
  To: peff, sandals; +Cc: git, newren

Sorry to add noise to this thread but since the github issue was closed I wanted to make sure I'm not missing any updates here. Would any updates come through on this thread? Thanks!

-----Original Message-----
From: Jeff King <peff@peff.net> 
Sent: Thursday, April 15, 2021 1:52 AM
To: brian m. carlson <sandals@crustytoothpaste.net>
Cc: Jason Gore <Jason.Gore@microsoft.com>; git@vger.kernel.org; Elijah Newren <newren@gmail.com>
Subject: Re: Git clean enumerates ignored directories (since 2.27)

On Wed, Apr 14, 2021 at 10:56:14PM +0000, brian m. carlson wrote:

> ----
> #!/bin/sh
> 
> git init test-repo
> cd test-repo
> longname="directory"
> touch "$longname.txt"
> last=400
> for x in $(seq 1 $last); do
>   mkdir "x$longname$x"
>   mv directory* "x$longname$x"
>   mv "x$longname$x" "$longname$x"
> done
> git clean -ffdxn -e directory$last
> ----
> 
> When it fails, it will complain that it wasn't able to open the 
> directory.  It still exits zero, however.
> 
> I haven't bisected this, so I don't know if those patches are related 
> to the problem or not.  I'm a little short on time today to 
> investigate further, but hopefully this can get someone on the right 
> path with a modified version and git bisect run if nothing else.

It bisects to 8d92fb2927 (dir: replace exponential algorithm with a linear one, 2020-04-01). I won't pretend to understand everything going on in that commit, though.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Git clean enumerates ignored directories (since 2.27)
  2021-04-22 17:18     ` Jason Gore
@ 2021-04-22 17:24       ` Elijah Newren
  2021-05-07  4:08       ` Elijah Newren
  1 sibling, 0 replies; 10+ messages in thread
From: Elijah Newren @ 2021-04-22 17:24 UTC (permalink / raw)
  To: Jason Gore; +Cc: peff, sandals, git

Yes, sorry, I've just been swamped.  It's still on my list to look at.

On Thu, Apr 22, 2021 at 10:18 AM Jason Gore <Jason.Gore@microsoft.com> wrote:
>
> Sorry to add noise to this thread but since the github issue was closed I wanted to make sure I'm not missing any updates here. Would any updates come through on this thread? Thanks!
>
> -----Original Message-----
> From: Jeff King <peff@peff.net>
> Sent: Thursday, April 15, 2021 1:52 AM
> To: brian m. carlson <sandals@crustytoothpaste.net>
> Cc: Jason Gore <Jason.Gore@microsoft.com>; git@vger.kernel.org; Elijah Newren <newren@gmail.com>
> Subject: Re: Git clean enumerates ignored directories (since 2.27)
>
> On Wed, Apr 14, 2021 at 10:56:14PM +0000, brian m. carlson wrote:
>
> > ----
> > #!/bin/sh
> >
> > git init test-repo
> > cd test-repo
> > longname="directory"
> > touch "$longname.txt"
> > last=400
> > for x in $(seq 1 $last); do
> >   mkdir "x$longname$x"
> >   mv directory* "x$longname$x"
> >   mv "x$longname$x" "$longname$x"
> > done
> > git clean -ffdxn -e directory$last
> > ----
> >
> > When it fails, it will complain that it wasn't able to open the
> > directory.  It still exits zero, however.
> >
> > I haven't bisected this, so I don't know if those patches are related
> > to the problem or not.  I'm a little short on time today to
> > investigate further, but hopefully this can get someone on the right
> > path with a modified version and git bisect run if nothing else.
>
> It bisects to 8d92fb2927 (dir: replace exponential algorithm with a linear one, 2020-04-01). I won't pretend to understand everything going on in that commit, though.
>
> -Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Git clean enumerates ignored directories (since 2.27)
  2021-04-15  8:51   ` Jeff King
  2021-04-22 17:18     ` Jason Gore
@ 2021-05-07  2:31     ` Elijah Newren
  2021-05-07  3:25       ` Jeff King
  1 sibling, 1 reply; 10+ messages in thread
From: Elijah Newren @ 2021-05-07  2:31 UTC (permalink / raw)
  To: Jeff King; +Cc: brian m. carlson, Jason Gore, git

On Thu, Apr 15, 2021 at 1:51 AM Jeff King <peff@peff.net> wrote:
>
> On Wed, Apr 14, 2021 at 10:56:14PM +0000, brian m. carlson wrote:
>
> > ----
> > #!/bin/sh
> >
> > git init test-repo
> > cd test-repo
> > longname="directory"
> > touch "$longname.txt"
> > last=400
> > for x in $(seq 1 $last); do
> >   mkdir "x$longname$x"
> >   mv directory* "x$longname$x"
> >   mv "x$longname$x" "$longname$x"
> > done
> > git clean -ffdxn -e directory$last
> > ----
> >
> > When it fails, it will complain that it wasn't able to open the
> > directory.  It still exits zero, however.
> >
> > I haven't bisected this, so I don't know if those patches are related to
> > the problem or not.  I'm a little short on time today to investigate
> > further, but hopefully this can get someone on the right path with a
> > modified version and git bisect run if nothing else.
>
> It bisects to 8d92fb2927 (dir: replace exponential algorithm with a
> linear one, 2020-04-01). I won't pretend to understand everything going
> on in that commit, though.

I've got a patch series to fix this and another traversal problem, and
the tests pass on all platforms.  Wahoo!

BUT!

On alpine linux-musl, I get an "error: Tests passed but test cleanup
failed; aborting", which makes it report as a failed build.  I'm not
sure how to fix it and am asking for ideas.

Apparently the deeply nested directory hierarchy cannot be removed on
that platform with a simple "rm -rf $dirname".  It throws a "rm: can't
stat '/__w/git/git/t/trash
directory.t7300-clean/avoid-traversing-deep-hierarchy/directory400/directory399/directory398/.....(you
get the idea)....': Filename too long" error message.[1]

Adding a "test_when_finished find directory400 -delete" also gives a
"Filename too long" message followed by a lot of "Directory not empty"
messages.[2]

Anyone have any bright ideas about how to tweak this test?  See [3]
for the current incarnation of the code, which was basically taken
from Brian's sample testcase.

[1] https://github.com/git/git/pull/1020/checks?check_run_id=2523414561
[2] https://github.com/git/git/pull/1020/checks?check_run_id=2523594095
[3] https://github.com/git/git/pull/1020

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Git clean enumerates ignored directories (since 2.27)
  2021-05-07  2:31     ` Elijah Newren
@ 2021-05-07  3:25       ` Jeff King
  2021-05-07  3:43         ` Jeff King
  2021-05-07  3:44         ` Elijah Newren
  0 siblings, 2 replies; 10+ messages in thread
From: Jeff King @ 2021-05-07  3:25 UTC (permalink / raw)
  To: Elijah Newren; +Cc: brian m. carlson, Jason Gore, git

On Thu, May 06, 2021 at 07:31:40PM -0700, Elijah Newren wrote:

> On alpine linux-musl, I get an "error: Tests passed but test cleanup
> failed; aborting", which makes it report as a failed build.  I'm not
> sure how to fix it and am asking for ideas.
> 
> Apparently the deeply nested directory hierarchy cannot be removed on
> that platform with a simple "rm -rf $dirname".  It throws a "rm: can't
> stat '/__w/git/git/t/trash
> directory.t7300-clean/avoid-traversing-deep-hierarchy/directory400/directory399/directory398/.....(you
> get the idea)....': Filename too long" error message.[1]
> 
> Adding a "test_when_finished find directory400 -delete" also gives a
> "Filename too long" message followed by a lot of "Directory not empty"
> messages.[2]
> 
> Anyone have any bright ideas about how to tweak this test?  See [3]
> for the current incarnation of the code, which was basically taken
> from Brian's sample testcase.

My guess is that that version of "rm" is trying to feed the entire
pathname directly to unlink() and rmdir(), and it exceeds PATH_MAX.

Even with GNU tools, for instance, I get:

  $ rmdir $(find avoid-traversing-deep-hierarchy -type d | tail -1)
  rmdir: failed to remove 'avoid-traversing-deep-hierarchy/directory400/
    [...and so on...]/directory1': File name too long

because it feeds the whole to a single rmdir() call. Whereas stracing
GNU "rm -rf", it uses unlinkat() and openat() to delete each level
individually (probably to avoid this exact problem).

Is the actual path length important, or just the depth? If the latter,
then calling it "d400/d399/.../d2/d1" would likely help, as that's less
than 2000 bytes.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Git clean enumerates ignored directories (since 2.27)
  2021-05-07  3:25       ` Jeff King
@ 2021-05-07  3:43         ` Jeff King
  2021-05-07  3:44         ` Elijah Newren
  1 sibling, 0 replies; 10+ messages in thread
From: Jeff King @ 2021-05-07  3:43 UTC (permalink / raw)
  To: Elijah Newren; +Cc: brian m. carlson, Jason Gore, git

On Thu, May 06, 2021 at 11:25:54PM -0400, Jeff King wrote:

> > Anyone have any bright ideas about how to tweak this test?  See [3]
> > for the current incarnation of the code, which was basically taken
> > from Brian's sample testcase.
> 
> My guess is that that version of "rm" is trying to feed the entire
> pathname directly to unlink() and rmdir(), and it exceeds PATH_MAX.
> 
> Even with GNU tools, for instance, I get:
> 
>   $ rmdir $(find avoid-traversing-deep-hierarchy -type d | tail -1)
>   rmdir: failed to remove 'avoid-traversing-deep-hierarchy/directory400/
>     [...and so on...]/directory1': File name too long
> 
> because it feeds the whole to a single rmdir() call. Whereas stracing
> GNU "rm -rf", it uses unlinkat() and openat() to delete each level
> individually (probably to avoid this exact problem).
> 
> Is the actual path length important, or just the depth? If the latter,
> then calling it "d400/d399/.../d2/d1" would likely help, as that's less
> than 2000 bytes.

Reading your commit messages a little more carefully, I'm still not
quite sure of the answer to that question.

But if you really do need the long length, a workaround is to avoid
dealing with the full path all at once. E.g., make two strings, one with
"directory400/.../directory200", and one with "directory199/.../directory1".

And then you can probably:

  (cd $one && rm -rf directory199) &&
  rm -rf directory400

to do it in two parts, with each "rm" seeing only a half-length path.

I notice you also run O(n) "mkdir" and "mv" calls to create the
directory. I "mkdir -p" would be much more efficient, though it might
run afoul of similar path-length problems (especially on non-GNU
systems).

It might be worth turning to perl here:

  perl -e '
    for (reverse 1..400) {
      my $d = "directory$_";
      mkdir($d) and chdir($d) or die "mkdir($d): $!";
    }
    open(my $fh, ">", "some-file");
  '

and you could probably do something similar to remove it. Sadly, I don't
think using File::Path makes building it easier, because it hits the
same path limit (it builds up the string internally). However, its
removal does work (and is in the set of core modules that we can count
on always being available):

  perl -MFile::Path=remove_tree -e 'remove_tree("directory400")'

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Git clean enumerates ignored directories (since 2.27)
  2021-05-07  3:25       ` Jeff King
  2021-05-07  3:43         ` Jeff King
@ 2021-05-07  3:44         ` Elijah Newren
  1 sibling, 0 replies; 10+ messages in thread
From: Elijah Newren @ 2021-05-07  3:44 UTC (permalink / raw)
  To: Jeff King; +Cc: brian m. carlson, Jason Gore, git

On Thu, May 6, 2021 at 8:25 PM Jeff King <peff@peff.net> wrote:
>
> On Thu, May 06, 2021 at 07:31:40PM -0700, Elijah Newren wrote:
>
> > On alpine linux-musl, I get an "error: Tests passed but test cleanup
> > failed; aborting", which makes it report as a failed build.  I'm not
> > sure how to fix it and am asking for ideas.
> >
> > Apparently the deeply nested directory hierarchy cannot be removed on
> > that platform with a simple "rm -rf $dirname".  It throws a "rm: can't
> > stat '/__w/git/git/t/trash
> > directory.t7300-clean/avoid-traversing-deep-hierarchy/directory400/directory399/directory398/.....(you
> > get the idea)....': Filename too long" error message.[1]
> >
> > Adding a "test_when_finished find directory400 -delete" also gives a
> > "Filename too long" message followed by a lot of "Directory not empty"
> > messages.[2]
> >
> > Anyone have any bright ideas about how to tweak this test?  See [3]
> > for the current incarnation of the code, which was basically taken
> > from Brian's sample testcase.
>
> My guess is that that version of "rm" is trying to feed the entire
> pathname directly to unlink() and rmdir(), and it exceeds PATH_MAX.
>
> Even with GNU tools, for instance, I get:
>
>   $ rmdir $(find avoid-traversing-deep-hierarchy -type d | tail -1)
>   rmdir: failed to remove 'avoid-traversing-deep-hierarchy/directory400/
>     [...and so on...]/directory1': File name too long
>
> because it feeds the whole to a single rmdir() call. Whereas stracing
> GNU "rm -rf", it uses unlinkat() and openat() to delete each level
> individually (probably to avoid this exact problem).
>
> Is the actual path length important, or just the depth? If the latter,
> then calling it "d400/d399/.../d2/d1" would likely help, as that's less
> than 2000 bytes.

I needed some kind of way to notice and test that we had recursed into
the directory.  Excessively long paths trigger failures in open()
calls in git itself, which was the only marker I knew of.  So, I was
essentially depending on the long paths.  (The actual depth was not
important.). If there were a different way to determine whether we
recursed into the first level subdirectory, that'd be nicer.  I
thought about adding some trace2_region_enter/trace2_region_leave
calls and letting them nest up to whatever depth, though that seemed a
bit ugly.  I also thought about trying to determine the maximum
recursion level and seeing if that could be emitted via some kind of
trace2 magic, but that sounded like another rabbit hole.  In the end,
I gave up and used Brian's modification of Jason's testcase.

However, it looks like manually unstacking and deleting the directory
is going to work.  Incredibly ugly and slow, but adding the following
at the end of the testcase makes it work (at least on that platform,
still awaiting results on others):

    # alpine-linux-musl fails to "rm -rf" a directory with such
    # a deeply nested hierarchy.  Help it out by deleting the
    # leading directories ourselves.  Super slow, but, what else
    # can we do?  Without this, we will hit a
    #     error: Tests passed but test cleanup failed; aborting
    # so do this ugly manual cleanup...
    while test ! -f directory-random-file.txt; do
      name=$(ls -d directory*) &&
      mv $name/* . &&
      rmdir $name
    done

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Git clean enumerates ignored directories (since 2.27)
  2021-04-22 17:18     ` Jason Gore
  2021-04-22 17:24       ` Elijah Newren
@ 2021-05-07  4:08       ` Elijah Newren
  1 sibling, 0 replies; 10+ messages in thread
From: Elijah Newren @ 2021-05-07  4:08 UTC (permalink / raw)
  To: Jason Gore; +Cc: peff, sandals, git

On Thu, Apr 22, 2021 at 10:18 AM Jason Gore <Jason.Gore@microsoft.com> wrote:
>
> Sorry to add noise to this thread but since the github issue was closed I wanted to make sure I'm not missing any updates here. Would any updates come through on this thread? Thanks!

I have posted some proposed fixes at
https://lore.kernel.org/git/pull.1020.git.git.1620360300.gitgitgadget@gmail.com/T/#t
(particularly patches 1 & 3); feel free to watch there, or over at
https://github.com/git/git/pull/1020.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-05-07  4:08 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-14 17:17 Git clean enumerates ignored directories (since 2.27) Jason Gore
2021-04-14 22:56 ` brian m. carlson
2021-04-15  8:51   ` Jeff King
2021-04-22 17:18     ` Jason Gore
2021-04-22 17:24       ` Elijah Newren
2021-05-07  4:08       ` Elijah Newren
2021-05-07  2:31     ` Elijah Newren
2021-05-07  3:25       ` Jeff King
2021-05-07  3:43         ` Jeff King
2021-05-07  3:44         ` Elijah Newren

Code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).