git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] add--interactive: do not expand pathspecs with ls-files
@ 2017-03-14 16:30 Jeff King
  2017-03-14 16:31 ` Jeff King
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Jeff King @ 2017-03-14 16:30 UTC (permalink / raw)
  To: git; +Cc: Wincent Colaiuta, Mislav Marohnić

When we want to get the list of modified files, we first
expand any user-provided pathspecs with "ls-files", and then
feed the resulting list of paths as arguments to
"diff-index" and "diff-files". If your pathspec expands into
a large number of paths, you may run into one of two
problems:

  1. The OS may complain about the size of the argument
     list, and refuse to run. For example:

       $ (ulimit -s 128 && git add -p drivers)
       Can't exec "git": Argument list too long at .../git-add--interactive line 177.
       Died at .../git-add--interactive line 177.

     That's on the linux.git repository, which has about 20K
     files in the "drivers" directory (none of them modified
     in this case). The "ulimit -s" trick is necessary to
     show the problem on Linux even for such a gigantic set
     of paths. Other operating systems have much smaller
     limits (e.g., a real-world case was seen with only 5K
     files on OS X).

  2. Even when it does work, it's really slow. The pathspec
     code is not optimized for huge numbers of paths. Here's
     the same case without the ulimit:

       $ time git add -p drivers
       No changes.

       real	0m16.559s
       user	0m53.140s
       sys	0m0.220s

We can improve this by skipping "ls-files" completely, and
just feeding the original pathspecs to the diff commands.
This solution was discussed in 2010:

  http://public-inbox.org/git/20100105041438.GB12574@coredump.intra.peff.net/

but at the time the diff code's pathspecs were more
primitive than those used by ls-files (e.g., they did not
support globs). Making the change would have caused a
user-visible regression, so we didn't.

Since then, the pathspec code has been unified, and the diff
commands natively understand pathspecs like '*.c'.

This patch implements that solution. That skips the
argument-list limits, and the result runs much faster:

  $ time git add -p drivers
  No changes.

  real	0m0.149s
  user	0m0.116s
  sys	0m0.080s

There are two new tests. The first just exercises the
globbing behavior to confirm that we are not causing a
regression there. The second checks the actual argument
behavior using GIT_TRACE. We _could_ do it with the "ulimit
-s" trick, as above. But that would mean the test could only
run where "ulimit -s" works. And tests of that sort are
expensive, because we have to come up with enough files to
actually bust the limit (we can't just shrink the "128" down
infinitely, since it is also the in-program stack size).

Finally, two caveats and possibilities for future work:

  a. This fixes one argument-list expansion, but there may
     be others. In fact, it's very likely that if you run
     "git add -i" and select a large number of modified
     files that the script would try to feed them all to a
     single git command.

     In practice this is probably fine. The real issue here
     is that the argument list was growing with the _total_
     number of files, not the number of modified or selected
     files.

  b. If the repository contains filenames with literal wildcard
     characters (e.g., "foo*"), the original code expanded
     them via "ls-files" and then fed those wildcard names
     to "diff-index", which would have treated them as
     wildcards. This was a bug, which is now fixed (though
     unless you really go through some contortions with
     ":(literal)", it's likely that your original pathspec
     would match whatever the accidentally-expanded wildcard
     would anyway).

     So this takes us one step closer to working correctly
     with files whose names contain wildcard characters, but
     it's likely that others remain (e.g., if "git add -i"
     feeds the selected paths to "git add").

Reported-by: Wincent Colaiuta <win@wincent.com>
Reported-by: Mislav Marohnić <mislav.marohnic@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
---
I credited Wincent for reporting it in 2010. Hopefully his email still
works. :)

 git-add--interactive.perl  | 13 ++-----------
 t/t3701-add-interactive.sh | 43 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 45 insertions(+), 11 deletions(-)

diff --git a/git-add--interactive.perl b/git-add--interactive.perl
index f5c816e27..77b4ed53a 100755
--- a/git-add--interactive.perl
+++ b/git-add--interactive.perl
@@ -275,20 +275,11 @@ sub list_modified {
 	my ($only) = @_;
 	my (%data, @return);
 	my ($add, $del, $adddel, $file);
-	my @tracked = ();
-
-	if (@ARGV) {
-		@tracked = map {
-			chomp $_;
-			unquote_path($_);
-		} run_cmd_pipe(qw(git ls-files --), @ARGV);
-		return if (!@tracked);
-	}
 
 	my $reference = get_diff_reference($patch_mode_revision);
 	for (run_cmd_pipe(qw(git diff-index --cached
 			     --numstat --summary), $reference,
-			     '--', @tracked)) {
+			     '--', @ARGV)) {
 		if (($add, $del, $file) =
 		    /^([-\d]+)	([-\d]+)	(.*)/) {
 			my ($change, $bin);
@@ -313,7 +304,7 @@ sub list_modified {
 		}
 	}
 
-	for (run_cmd_pipe(qw(git diff-files --numstat --summary --raw --), @tracked)) {
+	for (run_cmd_pipe(qw(git diff-files --numstat --summary --raw --), @ARGV)) {
 		if (($add, $del, $file) =
 		    /^([-\d]+)	([-\d]+)	(.*)/) {
 			$file = unquote_path($file);
diff --git a/t/t3701-add-interactive.sh b/t/t3701-add-interactive.sh
index aaa258daa..f9528fa00 100755
--- a/t/t3701-add-interactive.sh
+++ b/t/t3701-add-interactive.sh
@@ -412,4 +412,47 @@ test_expect_success 'patch-mode via -i prompts for files' '
 	test_cmp expect actual
 '
 
+test_expect_success 'add -p handles globs' '
+	git reset --hard &&
+
+	mkdir -p subdir &&
+	echo base >one.c &&
+	echo base >subdir/two.c &&
+	git add "*.c" &&
+	git commit -m base &&
+
+	echo change >one.c &&
+	echo change >subdir/two.c &&
+	git add -p "*.c" <<-\EOF &&
+	y
+	y
+	EOF
+
+	cat >expect <<-\EOF &&
+	one.c
+	subdir/two.c
+	EOF
+	git diff --cached --name-only >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'add -p does not expand argument lists' '
+	git reset --hard &&
+
+	echo content >not-changed &&
+	git add not-changed &&
+	git commit -m "add not-changed file" &&
+
+	echo change >file &&
+	GIT_TRACE=$(pwd)/trace.out git add -p . <<-\EOF &&
+	y
+	EOF
+
+	# we know that "file" must be mentioned since we actually
+	# update it, but we want to be sure that our "." pathspec
+	# was not expanded into the argument list of any command.
+	# So look only for "not-changed".
+	! grep not-changed trace.out
+'
+
 test_done
-- 
2.12.0.550.g163ff0384.dirty

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] add--interactive: do not expand pathspecs with ls-files
  2017-03-14 16:30 [PATCH] add--interactive: do not expand pathspecs with ls-files Jeff King
@ 2017-03-14 16:31 ` Jeff King
  2017-03-14 17:25 ` Brandon Williams
  2017-03-14 22:03 ` Junio C Hamano
  2 siblings, 0 replies; 6+ messages in thread
From: Jeff King @ 2017-03-14 16:31 UTC (permalink / raw)
  To: git; +Cc: Wincent Colaiuta, Mislav Marohnić

On Tue, Mar 14, 2017 at 12:30:24PM -0400, Jeff King wrote:

>   2. Even when it does work, it's really slow. The pathspec
>      code is not optimized for huge numbers of paths. Here's
>      the same case without the ulimit:
> 
>        $ time git add -p drivers
>        No changes.
> 
>        real	0m16.559s
>        user	0m53.140s
>        sys	0m0.220s

By the way, I wondered how we managed to consume more CPU than
wall-clock time. I think the answer is that diff-files does the pathspec
matching during the threaded preload stage.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] add--interactive: do not expand pathspecs with ls-files
  2017-03-14 16:30 [PATCH] add--interactive: do not expand pathspecs with ls-files Jeff King
  2017-03-14 16:31 ` Jeff King
@ 2017-03-14 17:25 ` Brandon Williams
  2017-03-14 17:31   ` Jeff King
  2017-03-14 22:03 ` Junio C Hamano
  2 siblings, 1 reply; 6+ messages in thread
From: Brandon Williams @ 2017-03-14 17:25 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Wincent Colaiuta, Mislav Marohnić

On 03/14, Jeff King wrote:
>   b. If the repository contains filenames with literal wildcard
>      characters (e.g., "foo*"), the original code expanded
>      them via "ls-files" and then fed those wildcard names
>      to "diff-index", which would have treated them as
>      wildcards. This was a bug, which is now fixed (though
>      unless you really go through some contortions with
>      ":(literal)", it's likely that your original pathspec
>      would match whatever the accidentally-expanded wildcard
>      would anyway).
> 
>      So this takes us one step closer to working correctly
>      with files whose names contain wildcard characters, but
>      it's likely that others remain (e.g., if "git add -i"
>      feeds the selected paths to "git add").

It definitely makes things more difficult when people use wildcard
characters in filenames.  Is there any reason people use wildcards as
literal characters in filenames other than them not knowing any better?

> diff --git a/git-add--interactive.perl b/git-add--interactive.perl
> index f5c816e27..77b4ed53a 100755
> --- a/git-add--interactive.perl
> +++ b/git-add--interactive.perl
> @@ -275,20 +275,11 @@ sub list_modified {
>  	my ($only) = @_;
>  	my (%data, @return);
>  	my ($add, $del, $adddel, $file);
> -	my @tracked = ();
> -
> -	if (@ARGV) {
> -		@tracked = map {
> -			chomp $_;
> -			unquote_path($_);
> -		} run_cmd_pipe(qw(git ls-files --), @ARGV);
> -		return if (!@tracked);
> -	}
>  
>  	my $reference = get_diff_reference($patch_mode_revision);
>  	for (run_cmd_pipe(qw(git diff-index --cached
>  			     --numstat --summary), $reference,
> -			     '--', @tracked)) {
> +			     '--', @ARGV)) {
>  		if (($add, $del, $file) =
>  		    /^([-\d]+)	([-\d]+)	(.*)/) {
>  			my ($change, $bin);
> @@ -313,7 +304,7 @@ sub list_modified {
>  		}
>  	}
>  
> -	for (run_cmd_pipe(qw(git diff-files --numstat --summary --raw --), @tracked)) {
> +	for (run_cmd_pipe(qw(git diff-files --numstat --summary --raw --), @ARGV)) {

Such a small and easy change for that big of a perf win.

>  		if (($add, $del, $file) =
>  		    /^([-\d]+)	([-\d]+)	(.*)/) {
>  			$file = unquote_path($file);
> diff --git a/t/t3701-add-interactive.sh b/t/t3701-add-interactive.sh
> index aaa258daa..f9528fa00 100755
> --- a/t/t3701-add-interactive.sh
> +++ b/t/t3701-add-interactive.sh
> @@ -412,4 +412,47 @@ test_expect_success 'patch-mode via -i prompts for files' '
>  	test_cmp expect actual
>  '
>  
> +test_expect_success 'add -p handles globs' '
> +	git reset --hard &&
> +
> +	mkdir -p subdir &&
> +	echo base >one.c &&
> +	echo base >subdir/two.c &&
> +	git add "*.c" &&
> +	git commit -m base &&
> +
> +	echo change >one.c &&
> +	echo change >subdir/two.c &&
> +	git add -p "*.c" <<-\EOF &&
> +	y
> +	y
> +	EOF
> +
> +	cat >expect <<-\EOF &&
> +	one.c
> +	subdir/two.c
> +	EOF
> +	git diff --cached --name-only >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'add -p does not expand argument lists' '
> +	git reset --hard &&
> +
> +	echo content >not-changed &&
> +	git add not-changed &&
> +	git commit -m "add not-changed file" &&
> +
> +	echo change >file &&
> +	GIT_TRACE=$(pwd)/trace.out git add -p . <<-\EOF &&
> +	y
> +	EOF
> +
> +	# we know that "file" must be mentioned since we actually
> +	# update it, but we want to be sure that our "." pathspec
> +	# was not expanded into the argument list of any command.
> +	# So look only for "not-changed".
> +	! grep not-changed trace.out
> +'
> +
>  test_done

Tests look sane to me.

-- 
Brandon Williams

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] add--interactive: do not expand pathspecs with ls-files
  2017-03-14 17:25 ` Brandon Williams
@ 2017-03-14 17:31   ` Jeff King
  0 siblings, 0 replies; 6+ messages in thread
From: Jeff King @ 2017-03-14 17:31 UTC (permalink / raw)
  To: Brandon Williams; +Cc: git, Wincent Colaiuta, Mislav Marohnić

On Tue, Mar 14, 2017 at 10:25:04AM -0700, Brandon Williams wrote:

> On 03/14, Jeff King wrote:
> >   b. If the repository contains filenames with literal wildcard
> >      characters (e.g., "foo*"), the original code expanded
> >      them via "ls-files" and then fed those wildcard names
> >      to "diff-index", which would have treated them as
> >      wildcards. This was a bug, which is now fixed (though
> >      unless you really go through some contortions with
> >      ":(literal)", it's likely that your original pathspec
> >      would match whatever the accidentally-expanded wildcard
> >      would anyway).
> > 
> >      So this takes us one step closer to working correctly
> >      with files whose names contain wildcard characters, but
> >      it's likely that others remain (e.g., if "git add -i"
> >      feeds the selected paths to "git add").
> 
> It definitely makes things more difficult when people use wildcard
> characters in filenames.  Is there any reason people use wildcards as
> literal characters in filenames other than them not knowing any better?

The philosophers of the ages have struggled with the question of why
users do anything.

I suspect that wildcards in filenames are pretty uncommon, just judging
from the lack of complaints. But they do come up. The most recent one
was:

  http://public-inbox.org/git/c9876671-6252-5dfa-18df-a6719dc6834c@gmail.com/

It's unclear if that was stimulated by a real-world case, or just
mucking around.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] add--interactive: do not expand pathspecs with ls-files
  2017-03-14 16:30 [PATCH] add--interactive: do not expand pathspecs with ls-files Jeff King
  2017-03-14 16:31 ` Jeff King
  2017-03-14 17:25 ` Brandon Williams
@ 2017-03-14 22:03 ` Junio C Hamano
  2017-03-14 22:09   ` Jeff King
  2 siblings, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2017-03-14 22:03 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Wincent Colaiuta, Mislav Marohnić

Jeff King <peff@peff.net> writes:

> We can improve this by skipping "ls-files" completely, and
> just feeding the original pathspecs to the diff commands.
> This solution was discussed in 2010:
>
>   http://public-inbox.org/git/20100105041438.GB12574@coredump.intra.peff.net/
>
> but at the time the diff code's pathspecs were more
> primitive than those used by ls-files (e.g., they did not
> support globs). Making the change would have caused a
> user-visible regression, so we didn't.

Heh.  The change and the reasoning are both obviously correct, but
how did you find this?  Do you have a pile of "old patches that
should be resurrected when time is right" and this was picked out of
it, or did you see somebody else hit the same thing recently and
then went back to the archive?

>   b. If the repository contains filenames with literal wildcard
>      characters (e.g., "foo*"), the original code expanded
>      them via "ls-files" and then fed those wildcard names
>      to "diff-index", which would have treated them as
>      wildcards. This was a bug, which is now fixed (though
>      unless you really go through some contortions with
>      ":(literal)", it's likely that your original pathspec
>      would match whatever the accidentally-expanded wildcard
>      would anyway).
>
>      So this takes us one step closer to working correctly
>      with files whose names contain wildcard characters, but
>      it's likely that others remain (e.g., if "git add -i"
>      feeds the selected paths to "git add").

We didn't run with --literal-pathspecs which was a bug, but I
suspect that it didn't exist back then ;-).

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] add--interactive: do not expand pathspecs with ls-files
  2017-03-14 22:03 ` Junio C Hamano
@ 2017-03-14 22:09   ` Jeff King
  0 siblings, 0 replies; 6+ messages in thread
From: Jeff King @ 2017-03-14 22:09 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Wincent Colaiuta, Mislav Marohnić

On Tue, Mar 14, 2017 at 03:03:34PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > We can improve this by skipping "ls-files" completely, and
> > just feeding the original pathspecs to the diff commands.
> > This solution was discussed in 2010:
> >
> >   http://public-inbox.org/git/20100105041438.GB12574@coredump.intra.peff.net/
> >
> > but at the time the diff code's pathspecs were more
> > primitive than those used by ls-files (e.g., they did not
> > support globs). Making the change would have caused a
> > user-visible regression, so we didn't.
> 
> Heh.  The change and the reasoning are both obviously correct, but
> how did you find this?  Do you have a pile of "old patches that
> should be resurrected when time is right" and this was picked out of
> it, or did you see somebody else hit the same thing recently and
> then went back to the archive?

The latter.  Mislav reported it to me off-list earlier today, and I
generated the patch. But after scratching my head at the familiarity and
especially wondering if there was some reason to use "ls-files" here, I
dug up the linked thread. The fact that the patches are identical is
just coincidence (though it's such a simple change that it seems highly
likely).

> >      So this takes us one step closer to working correctly
> >      with files whose names contain wildcard characters, but
> >      it's likely that others remain (e.g., if "git add -i"
> >      feeds the selected paths to "git add").
> 
> We didn't run with --literal-pathspecs which was a bug, but I
> suspect that it didn't exist back then ;-).

Yep. I think judiciously inserting "--literal-pathspecs" is probably the
correct fix. In an earlier thread (that I linked elsewhere from the
discussion here) I suggested just setting $GIT_LITERAL_PATHSPECS to 1.
But that is probably a bad idea. As this patch shows, we do still
sometimes feed the original non-expanded pathspec to some commands.

I left that as potential low-hanging fruit for somebody who cares more
(the trickiest part is probably not the fix, but coming up with a test
case).

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-03-14 22:09 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-14 16:30 [PATCH] add--interactive: do not expand pathspecs with ls-files Jeff King
2017-03-14 16:31 ` Jeff King
2017-03-14 17:25 ` Brandon Williams
2017-03-14 17:31   ` Jeff King
2017-03-14 22:03 ` Junio C Hamano
2017-03-14 22:09   ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).