git-clone causes out of memory

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* git-clone causes out of memory
@ 2017-10-13  9:51 Constantine
  2017-10-13 10:06 ` Mike Hommey
  0 siblings, 1 reply; 23+ messages in thread
From: Constantine @ 2017-10-13  9:51 UTC (permalink / raw)
  To: git

There's a gitbomb on github. It is undoubtedly creative and funny, but 
since this is a bug in git, I thought it'd be nice to report. The command:

	$ git clone https://github.com/x0rz/ShadowBrokersFiles

quickly fills out the RAM (e.g. 4GB of free memory for me). To recover, 
call oom-killer through Alt+SysRq+f, then switch to a TTY and back.

Git version: 2.14.2

P.S.: I am not subscribed to the ML, acc. to 
https://git-scm.com/community I will be added to CC.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: git-clone causes out of memory
  2017-10-13  9:51 git-clone causes out of memory Constantine
@ 2017-10-13 10:06 ` Mike Hommey
  2017-10-13 10:26   ` Christian Couder
  2017-10-13 12:35   ` git-clone causes out of memory Jeff King
  0 siblings, 2 replies; 23+ messages in thread
From: Mike Hommey @ 2017-10-13 10:06 UTC (permalink / raw)
  To: Constantine; +Cc: git

On Fri, Oct 13, 2017 at 12:51:58PM +0300, Constantine wrote:
> There's a gitbomb on github. It is undoubtedly creative and funny, but since
> this is a bug in git, I thought it'd be nice to report. The command:
> 
> 	$ git clone https://github.com/x0rz/ShadowBrokersFiles

What fills memory is actually the checkout part of the command. git
clone -n doesn't fail.

Credit should go where it's due: https://kate.io/blog/git-bomb/
(with the bonus that it comes with explanations)

Mike

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: git-clone causes out of memory
  2017-10-13 10:06 ` Mike Hommey
@ 2017-10-13 10:26   ` Christian Couder
  2017-10-13 10:37     ` Mike Hommey
  2017-10-13 12:35   ` git-clone causes out of memory Jeff King
  1 sibling, 1 reply; 23+ messages in thread
From: Christian Couder @ 2017-10-13 10:26 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Constantine, git

On Fri, Oct 13, 2017 at 12:06 PM, Mike Hommey <mh@glandium.org> wrote:
> On Fri, Oct 13, 2017 at 12:51:58PM +0300, Constantine wrote:
>> There's a gitbomb on github. It is undoubtedly creative and funny, but since
>> this is a bug in git, I thought it'd be nice to report. The command:
>>
>>       $ git clone https://github.com/x0rz/ShadowBrokersFiles
>
> What fills memory is actually the checkout part of the command. git
> clone -n doesn't fail.
>
> Credit should go where it's due: https://kate.io/blog/git-bomb/
> (with the bonus that it comes with explanations)

Yeah, there is a thread on Hacker News about this too:

https://news.ycombinator.com/item?id=15457076

The original repo on GitHub is:

https://github.com/Katee/git-bomb.git

After cloning it with -n, there is the following "funny" situation:

$ time git rev-list HEAD
7af99c9e7d4768fa681f4fe4ff61259794cf719b
18ed56cbc5012117e24a603e7c072cf65d36d469
45546f17e5801791d4bc5968b91253a2f4b0db72

real    0m0.004s
user    0m0.000s
sys     0m0.004s
$ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/d0/d0/f0

real    0m0.004s
user    0m0.000s
sys     0m0.000s
$ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/d0/d0

real    0m0.004s
user    0m0.000s
sys     0m0.000s
$ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/
45546f17e5801791d4bc5968b91253a2f4b0db72

real    0m0.005s
user    0m0.008s
sys     0m0.000s
$ time git rev-list HEAD -- d0/d0/d0/d0/d0/
45546f17e5801791d4bc5968b91253a2f4b0db72

real    0m0.203s
user    0m0.112s
sys     0m0.088s
$ time git rev-list HEAD -- d0/d0/d0/d0/
45546f17e5801791d4bc5968b91253a2f4b0db72

real    0m1.305s
user    0m0.720s
sys     0m0.580s
$ time git rev-list HEAD -- d0/d0/d0/
45546f17e5801791d4bc5968b91253a2f4b0db72

real    0m12.135s
user    0m6.700s
sys     0m5.412s

So `git rev-list` becomes exponentially more expensive when you run it
on a shorter directory path, though it is fast if you run it without a
path.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: git-clone causes out of memory
  2017-10-13 10:26   ` Christian Couder
@ 2017-10-13 10:37     ` Mike Hommey
  2017-10-13 10:44       ` Christian Couder
  0 siblings, 1 reply; 23+ messages in thread
From: Mike Hommey @ 2017-10-13 10:37 UTC (permalink / raw)
  To: Christian Couder; +Cc: Constantine, git

On Fri, Oct 13, 2017 at 12:26:46PM +0200, Christian Couder wrote:
> On Fri, Oct 13, 2017 at 12:06 PM, Mike Hommey <mh@glandium.org> wrote:
> > On Fri, Oct 13, 2017 at 12:51:58PM +0300, Constantine wrote:
> >> There's a gitbomb on github. It is undoubtedly creative and funny, but since
> >> this is a bug in git, I thought it'd be nice to report. The command:
> >>
> >>       $ git clone https://github.com/x0rz/ShadowBrokersFiles
> >
> > What fills memory is actually the checkout part of the command. git
> > clone -n doesn't fail.
> >
> > Credit should go where it's due: https://kate.io/blog/git-bomb/
> > (with the bonus that it comes with explanations)
> 
> Yeah, there is a thread on Hacker News about this too:
> 
> https://news.ycombinator.com/item?id=15457076
> 
> The original repo on GitHub is:
> 
> https://github.com/Katee/git-bomb.git
> 
> After cloning it with -n, there is the following "funny" situation:
> 
> $ time git rev-list HEAD
> 7af99c9e7d4768fa681f4fe4ff61259794cf719b
> 18ed56cbc5012117e24a603e7c072cf65d36d469
> 45546f17e5801791d4bc5968b91253a2f4b0db72
> 
> real    0m0.004s
> user    0m0.000s
> sys     0m0.004s
> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/d0/d0/f0
> 
> real    0m0.004s
> user    0m0.000s
> sys     0m0.000s
> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/d0/d0
> 
> real    0m0.004s
> user    0m0.000s
> sys     0m0.000s
> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/
> 45546f17e5801791d4bc5968b91253a2f4b0db72
> 
> real    0m0.005s
> user    0m0.008s
> sys     0m0.000s
> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/
> 45546f17e5801791d4bc5968b91253a2f4b0db72
> 
> real    0m0.203s
> user    0m0.112s
> sys     0m0.088s
> $ time git rev-list HEAD -- d0/d0/d0/d0/
> 45546f17e5801791d4bc5968b91253a2f4b0db72
> 
> real    0m1.305s
> user    0m0.720s
> sys     0m0.580s
> $ time git rev-list HEAD -- d0/d0/d0/
> 45546f17e5801791d4bc5968b91253a2f4b0db72
> 
> real    0m12.135s
> user    0m6.700s
> sys     0m5.412s
> 
> So `git rev-list` becomes exponentially more expensive when you run it
> on a shorter directory path, though it is fast if you run it without a
> path.

That's because there are 10^7 files under d0/d0/d0, 10^6 under
d0/d0/d0/d0/, 10^5 under d0/d0/d0/d0/d0/ etc.

So really, this is all about things being slower when there's a crazy
number of files. Picture me surprised.

What makes it kind of special is that the repository contains a lot of
paths/files, but very few objects, because it's duplicating everything.

All the 10^10 blobs have the same content, all the 10^9 trees that point
to them have the same content, all the 10^8 trees that point to those
trees have the same content, etc.

If git wasn't effectively deduplicating identical content, the repository
would be multiple gigabytes large.

Mike

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: git-clone causes out of memory
  2017-10-13 10:37     ` Mike Hommey
@ 2017-10-13 10:44       ` Christian Couder
  2017-10-13 12:04         ` Junio C Hamano
  0 siblings, 1 reply; 23+ messages in thread
From: Christian Couder @ 2017-10-13 10:44 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Constantine, git

On Fri, Oct 13, 2017 at 12:37 PM, Mike Hommey <mh@glandium.org> wrote:
> On Fri, Oct 13, 2017 at 12:26:46PM +0200, Christian Couder wrote:
>>
>> After cloning it with -n, there is the following "funny" situation:
>>
>> $ time git rev-list HEAD
>> 7af99c9e7d4768fa681f4fe4ff61259794cf719b
>> 18ed56cbc5012117e24a603e7c072cf65d36d469
>> 45546f17e5801791d4bc5968b91253a2f4b0db72
>>
>> real    0m0.004s
>> user    0m0.000s
>> sys     0m0.004s
>> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/d0/d0/f0
>>
>> real    0m0.004s
>> user    0m0.000s
>> sys     0m0.000s
>> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/d0/d0
>>
>> real    0m0.004s
>> user    0m0.000s
>> sys     0m0.000s
>> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/
>> 45546f17e5801791d4bc5968b91253a2f4b0db72
>>
>> real    0m0.005s
>> user    0m0.008s
>> sys     0m0.000s
>> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/
>> 45546f17e5801791d4bc5968b91253a2f4b0db72
>>
>> real    0m0.203s
>> user    0m0.112s
>> sys     0m0.088s
>> $ time git rev-list HEAD -- d0/d0/d0/d0/
>> 45546f17e5801791d4bc5968b91253a2f4b0db72
>>
>> real    0m1.305s
>> user    0m0.720s
>> sys     0m0.580s
>> $ time git rev-list HEAD -- d0/d0/d0/
>> 45546f17e5801791d4bc5968b91253a2f4b0db72
>>
>> real    0m12.135s
>> user    0m6.700s
>> sys     0m5.412s
>>
>> So `git rev-list` becomes exponentially more expensive when you run it
>> on a shorter directory path, though it is fast if you run it without a
>> path.
>
> That's because there are 10^7 files under d0/d0/d0, 10^6 under
> d0/d0/d0/d0/, 10^5 under d0/d0/d0/d0/d0/ etc.
>
> So really, this is all about things being slower when there's a crazy
> number of files. Picture me surprised.
>
> What makes it kind of special is that the repository contains a lot of
> paths/files, but very few objects, because it's duplicating everything.
>
> All the 10^10 blobs have the same content, all the 10^9 trees that point
> to them have the same content, all the 10^8 trees that point to those
> trees have the same content, etc.
>
> If git wasn't effectively deduplicating identical content, the repository
> would be multiple gigabytes large.

Yeah, but perhaps Git could be smarter when rev-listing too and avoid
processing files or directories it has already seen?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: git-clone causes out of memory
  2017-10-13 10:44       ` Christian Couder
@ 2017-10-13 12:04         ` Junio C Hamano
  2017-10-13 12:12           ` Constantine
  0 siblings, 1 reply; 23+ messages in thread
From: Junio C Hamano @ 2017-10-13 12:04 UTC (permalink / raw)
  To: Christian Couder; +Cc: Mike Hommey, Constantine, git

Christian Couder <christian.couder@gmail.com> writes:

> Yeah, but perhaps Git could be smarter when rev-listing too and avoid
> processing files or directories it has already seen?

Aren't you suggesting to optimize for a wrong case?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: git-clone causes out of memory
  2017-10-13 12:04         ` Junio C Hamano
@ 2017-10-13 12:12           ` Constantine
  2017-10-13 12:44             ` Jeff King
  0 siblings, 1 reply; 23+ messages in thread
From: Constantine @ 2017-10-13 12:12 UTC (permalink / raw)
  To: Junio C Hamano, Christian Couder; +Cc: Mike Hommey, git

On 13.10.2017 15:04, Junio C Hamano wrote:
> Christian Couder <christian.couder@gmail.com> writes:
> 
>> Yeah, but perhaps Git could be smarter when rev-listing too and avoid
>> processing files or directories it has already seen?
> 
> Aren't you suggesting to optimize for a wrong case?
> 

Anything that is possible with a software should be considered as a 
possible usecase. It's in fact a DoS attack. Imagine there's a server 
that using git to process something, and now there's a way to knock down 
this server. It's also bad from a promotional stand point.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: git-clone causes out of memory
  2017-10-13 12:12           ` Constantine
@ 2017-10-13 12:44             ` Jeff King
  2017-10-13 13:15               ` Derrick Stolee
  0 siblings, 1 reply; 23+ messages in thread
From: Jeff King @ 2017-10-13 12:44 UTC (permalink / raw)
  To: Constantine; +Cc: Junio C Hamano, Christian Couder, Mike Hommey, git

On Fri, Oct 13, 2017 at 03:12:43PM +0300, Constantine wrote:

> On 13.10.2017 15:04, Junio C Hamano wrote:
> > Christian Couder <christian.couder@gmail.com> writes:
> > 
> > > Yeah, but perhaps Git could be smarter when rev-listing too and avoid
> > > processing files or directories it has already seen?
> > 
> > Aren't you suggesting to optimize for a wrong case?
> > 
> 
> Anything that is possible with a software should be considered as a possible
> usecase. It's in fact a DoS attack. Imagine there's a server that using git
> to process something, and now there's a way to knock down this server. It's
> also bad from a promotional stand point.

But the point is that you'd have the same problem with any repository
that had 10^7 files in it. Yes, it's convenient for the attacker that
there are only 9 objects, but fundamentally it's pretty easy for an
attacker to construct repositories that have large trees (or very deep
trees -- that's what causes stack exhaustion in some cases).

Note too that this attack almost always comes down to the diff code
(which is why it kicks in for pathspec limiting) which has to actually
expand the tree. Most "normal" server-side operations (like accepting
pushes or serving fetches) operate only on the object graph and _do_
avoid processing already-seen objects.

As soon as servers start trying to checkout or diff, though, the attack
surface gets quite large. And you really need to start thinking about
having resource limits and quotas for CPU and memory use of each process
(and group by requesting user, IP, repo, etc).

I think the main thing Git could be doing here is to limit the size of
the tree (both width and depth). But arbitrary limits like that have a
way of being annoying, and I think it just pushes resource-exhaustion
attacks off a little (e.g., can you construct a blob that behaves badly
with the "--patch"?).

-Peff

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: git-clone causes out of memory
  2017-10-13 12:44             ` Jeff King
@ 2017-10-13 13:15               ` Derrick Stolee
  2017-10-13 13:39                 ` Derrick Stolee
  0 siblings, 1 reply; 23+ messages in thread
From: Derrick Stolee @ 2017-10-13 13:15 UTC (permalink / raw)
  To: Jeff King, Constantine; +Cc: Junio C Hamano, Christian Couder, Mike Hommey, git

On 10/13/2017 8:44 AM, Jeff King wrote:
> On Fri, Oct 13, 2017 at 03:12:43PM +0300, Constantine wrote:
>
>> On 13.10.2017 15:04, Junio C Hamano wrote:
>>> Christian Couder <christian.couder@gmail.com> writes:
>>>
>>>> Yeah, but perhaps Git could be smarter when rev-listing too and avoid
>>>> processing files or directories it has already seen?
>>> Aren't you suggesting to optimize for a wrong case?
>>>
>> Anything that is possible with a software should be considered as a possible
>> usecase. It's in fact a DoS attack. Imagine there's a server that using git
>> to process something, and now there's a way to knock down this server. It's
>> also bad from a promotional stand point.
> But the point is that you'd have the same problem with any repository
> that had 10^7 files in it. Yes, it's convenient for the attacker that
> there are only 9 objects, but fundamentally it's pretty easy for an
> attacker to construct repositories that have large trees (or very deep
> trees -- that's what causes stack exhaustion in some cases).
>
> Note too that this attack almost always comes down to the diff code
> (which is why it kicks in for pathspec limiting) which has to actually
> expand the tree. Most "normal" server-side operations (like accepting
> pushes or serving fetches) operate only on the object graph and _do_
> avoid processing already-seen objects.
>
> As soon as servers start trying to checkout or diff, though, the attack
> surface gets quite large. And you really need to start thinking about
> having resource limits and quotas for CPU and memory use of each process
> (and group by requesting user, IP, repo, etc).
>
> I think the main thing Git could be doing here is to limit the size of
> the tree (both width and depth). But arbitrary limits like that have a
> way of being annoying, and I think it just pushes resource-exhaustion
> attacks off a little (e.g., can you construct a blob that behaves badly
> with the "--patch"?).
>
> -Peff

I'm particularly interested in why `git rev-list HEAD -- [path]` gets 
slower in this case, because I wrote the history algorithm used by VSTS. 
In our algorithm, we only walk the list of objects from commit to the 
tree containing the path item. For example, in the path d0/d0/d0, we 
would only walk:

     commit --root--> tree --d0--> tree --d0--> tree [parse oid for d0 
entry]

 From this, we can determine the TREESAME relationship by parsing four 
objects without parsing all contents below d0/d0/d0.

The reason we have exponential behavior in `git rev-list` is because we 
are calling diff_tree_oid() in tree-diff.c recursively without 
short-circuiting on equal OIDs.

I will prepare a patch that adds this OID-equal short-circuit to avoid 
this exponential behavior. I'll model my patch against a similar patch 
in master:

     commit d12a8cf0af18804c2000efc7a0224da631e04cd1 unpack-trees: avoid 
duplicate ODB lookups during checkout

It will also significantly speed up rev-list calls for short paths in 
deep repositories. It will not be very measurable in the git or Linux 
repos because their shallow folder structure.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: git-clone causes out of memory
  2017-10-13 13:15               ` Derrick Stolee
@ 2017-10-13 13:39                 ` Derrick Stolee
  2017-10-13 13:50                   ` Jeff King
  0 siblings, 1 reply; 23+ messages in thread
From: Derrick Stolee @ 2017-10-13 13:39 UTC (permalink / raw)
  To: Jeff King, Constantine; +Cc: Junio C Hamano, Christian Couder, Mike Hommey, git

On 10/13/2017 9:15 AM, Derrick Stolee wrote:
> On 10/13/2017 8:44 AM, Jeff King wrote:
>> On Fri, Oct 13, 2017 at 03:12:43PM +0300, Constantine wrote:
>>
>>> On 13.10.2017 15:04, Junio C Hamano wrote:
>>>> Christian Couder <christian.couder@gmail.com> writes:
>>>>
>>>>> Yeah, but perhaps Git could be smarter when rev-listing too and avoid
>>>>> processing files or directories it has already seen?
>>>> Aren't you suggesting to optimize for a wrong case?
>>>>
>>> Anything that is possible with a software should be considered as a 
>>> possible
>>> usecase. It's in fact a DoS attack. Imagine there's a server that 
>>> using git
>>> to process something, and now there's a way to knock down this 
>>> server. It's
>>> also bad from a promotional stand point.
>> But the point is that you'd have the same problem with any repository
>> that had 10^7 files in it. Yes, it's convenient for the attacker that
>> there are only 9 objects, but fundamentally it's pretty easy for an
>> attacker to construct repositories that have large trees (or very deep
>> trees -- that's what causes stack exhaustion in some cases).
>>
>> Note too that this attack almost always comes down to the diff code
>> (which is why it kicks in for pathspec limiting) which has to actually
>> expand the tree. Most "normal" server-side operations (like accepting
>> pushes or serving fetches) operate only on the object graph and _do_
>> avoid processing already-seen objects.
>>
>> As soon as servers start trying to checkout or diff, though, the attack
>> surface gets quite large. And you really need to start thinking about
>> having resource limits and quotas for CPU and memory use of each process
>> (and group by requesting user, IP, repo, etc).
>>
>> I think the main thing Git could be doing here is to limit the size of
>> the tree (both width and depth). But arbitrary limits like that have a
>> way of being annoying, and I think it just pushes resource-exhaustion
>> attacks off a little (e.g., can you construct a blob that behaves badly
>> with the "--patch"?).
>>
>> -Peff
>
> I'm particularly interested in why `git rev-list HEAD -- [path]` gets 
> slower in this case, because I wrote the history algorithm used by 
> VSTS. In our algorithm, we only walk the list of objects from commit 
> to the tree containing the path item. For example, in the path 
> d0/d0/d0, we would only walk:
>
>     commit --root--> tree --d0--> tree --d0--> tree [parse oid for d0 
> entry]
>
> From this, we can determine the TREESAME relationship by parsing four 
> objects without parsing all contents below d0/d0/d0.
>
> The reason we have exponential behavior in `git rev-list` is because 
> we are calling diff_tree_oid() in tree-diff.c recursively without 
> short-circuiting on equal OIDs.
>
> I will prepare a patch that adds this OID-equal short-circuit to avoid 
> this exponential behavior. I'll model my patch against a similar patch 
> in master:
>
>     commit d12a8cf0af18804c2000efc7a0224da631e04cd1 unpack-trees: 
> avoid duplicate ODB lookups during checkout
>
> It will also significantly speed up rev-list calls for short paths in 
> deep repositories. It will not be very measurable in the git or Linux 
> repos because their shallow folder structure.
>
> Thanks,
> -Stolee

Since I don't understand enough about the consumers to diff_tree_oid() 
(and the fact that the recursive behavior may be wanted in some cases), 
I think we can fix this in builtin/rev-list.c with this simple diff:

---

diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index ded1577424..b2e8e02cc8 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -285,6 +285,9 @@ int cmd_rev_list(int argc, const char **argv, const 
char *prefix)

         git_config(git_default_config, NULL);
         init_revisions(&revs, prefix);
+
+       revs.pruning.flags = revs.pruning.flags & ~DIFF_OPT_RECURSIVE;
+
         revs.abbrev = DEFAULT_ABBREV;
         revs.commit_format = CMIT_FMT_UNSPECIFIED;
         argc = setup_revisions(argc, argv, &revs, NULL);


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: git-clone causes out of memory
  2017-10-13 13:39                 ` Derrick Stolee
@ 2017-10-13 13:50                   ` Jeff King
  2017-10-13 13:55                     ` Derrick Stolee
  0 siblings, 1 reply; 23+ messages in thread
From: Jeff King @ 2017-10-13 13:50 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Constantine, Junio C Hamano, Christian Couder, Mike Hommey, git

On Fri, Oct 13, 2017 at 09:39:14AM -0400, Derrick Stolee wrote:

> Since I don't understand enough about the consumers to diff_tree_oid() (and
> the fact that the recursive behavior may be wanted in some cases), I think
> we can fix this in builtin/rev-list.c with this simple diff:
> 
> ---
> 
> diff --git a/builtin/rev-list.c b/builtin/rev-list.c
> index ded1577424..b2e8e02cc8 100644
> --- a/builtin/rev-list.c
> +++ b/builtin/rev-list.c
> @@ -285,6 +285,9 @@ int cmd_rev_list(int argc, const char **argv, const char
> *prefix)
> 
>         git_config(git_default_config, NULL);
>         init_revisions(&revs, prefix);
> +
> +       revs.pruning.flags = revs.pruning.flags & ~DIFF_OPT_RECURSIVE;
> +

Hmm, this feels wrong, because we _do_ want to recurse down and follow
the pathspec to see if there are real changes.

We should be comparing an empty tree and d0/d0/d0/d0 (or however deep
your pathspec goes). We should be able to see immediately that the entry
is not present between the two and not bother descending. After all,
we've set the QUICK flag in init_revisions(). So the real question is
why QUICK is not kicking in.

-Peff

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: git-clone causes out of memory
  2017-10-13 13:50                   ` Jeff King
@ 2017-10-13 13:55                     ` Derrick Stolee
  2017-10-13 13:56                       ` Jeff King
  0 siblings, 1 reply; 23+ messages in thread
From: Derrick Stolee @ 2017-10-13 13:55 UTC (permalink / raw)
  To: Jeff King; +Cc: Constantine, Junio C Hamano, Christian Couder, Mike Hommey, git

On 10/13/2017 9:50 AM, Jeff King wrote:
> On Fri, Oct 13, 2017 at 09:39:14AM -0400, Derrick Stolee wrote:
>
>> Since I don't understand enough about the consumers to diff_tree_oid() (and
>> the fact that the recursive behavior may be wanted in some cases), I think
>> we can fix this in builtin/rev-list.c with this simple diff:
>>
>> ---
>>
>> diff --git a/builtin/rev-list.c b/builtin/rev-list.c
>> index ded1577424..b2e8e02cc8 100644
>> --- a/builtin/rev-list.c
>> +++ b/builtin/rev-list.c
>> @@ -285,6 +285,9 @@ int cmd_rev_list(int argc, const char **argv, const char
>> *prefix)
>>
>>          git_config(git_default_config, NULL);
>>          init_revisions(&revs, prefix);
>> +
>> +       revs.pruning.flags = revs.pruning.flags & ~DIFF_OPT_RECURSIVE;
>> +

(Note: I'm running tests now and see that this breaks behavior. 
Definitely not the solution we want.)

> Hmm, this feels wrong, because we _do_ want to recurse down and follow
> the pathspec to see if there are real changes.
>
> We should be comparing an empty tree and d0/d0/d0/d0 (or however deep
> your pathspec goes). We should be able to see immediately that the entry
> is not present between the two and not bother descending. After all,
> we've set the QUICK flag in init_revisions(). So the real question is
> why QUICK is not kicking in.
>
> -Peff

I'm struggling to understand your meaning. We want to walk from root to 
d0/d0/d0/d0, but there is no reason to walk beyond that tree. But maybe 
that's what the QUICK flag is supposed to do.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: git-clone causes out of memory
  2017-10-13 13:55                     ` Derrick Stolee
@ 2017-10-13 13:56                       ` Jeff King
  2017-10-13 14:10                         ` Jeff King
  0 siblings, 1 reply; 23+ messages in thread
From: Jeff King @ 2017-10-13 13:56 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Constantine, Junio C Hamano, Christian Couder, Mike Hommey, git

On Fri, Oct 13, 2017 at 09:55:15AM -0400, Derrick Stolee wrote:

> > We should be comparing an empty tree and d0/d0/d0/d0 (or however deep
> > your pathspec goes). We should be able to see immediately that the entry
> > is not present between the two and not bother descending. After all,
> > we've set the QUICK flag in init_revisions(). So the real question is
> > why QUICK is not kicking in.
> 
> I'm struggling to understand your meaning. We want to walk from root to
> d0/d0/d0/d0, but there is no reason to walk beyond that tree. But maybe
> that's what the QUICK flag is supposed to do.

Yes, that's exactly what it is for. When we see the first difference we
should say "aha, the caller only wanted to know whether there was a
difference, not what it was" and return immediately. See
diff_can_quit_early().

-Peff

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: git-clone causes out of memory
  2017-10-13 13:56                       ` Jeff King
@ 2017-10-13 14:10                         ` Jeff King
  2017-10-13 14:20                           ` Jeff King
  0 siblings, 1 reply; 23+ messages in thread
From: Jeff King @ 2017-10-13 14:10 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Constantine, Junio C Hamano, Christian Couder, Mike Hommey, git

On Fri, Oct 13, 2017 at 09:56:36AM -0400, Jeff King wrote:

> On Fri, Oct 13, 2017 at 09:55:15AM -0400, Derrick Stolee wrote:
> 
> > > We should be comparing an empty tree and d0/d0/d0/d0 (or however deep
> > > your pathspec goes). We should be able to see immediately that the entry
> > > is not present between the two and not bother descending. After all,
> > > we've set the QUICK flag in init_revisions(). So the real question is
> > > why QUICK is not kicking in.
> > 
> > I'm struggling to understand your meaning. We want to walk from root to
> > d0/d0/d0/d0, but there is no reason to walk beyond that tree. But maybe
> > that's what the QUICK flag is supposed to do.
> 
> Yes, that's exactly what it is for. When we see the first difference we
> should say "aha, the caller only wanted to know whether there was a
> difference, not what it was" and return immediately. See
> diff_can_quit_early().

Hmm. So this patch makes it go fast:

diff --git a/revision.c b/revision.c
index d167223e69..b52ea4e9d8 100644
--- a/revision.c
+++ b/revision.c
@@ -409,7 +409,7 @@ static void file_add_remove(struct diff_options *options,
 	int diff = addremove == '+' ? REV_TREE_NEW : REV_TREE_OLD;
 
 	tree_difference |= diff;
-	if (tree_difference == REV_TREE_DIFFERENT)
+	if (tree_difference & REV_TREE_DIFFERENT)
 		DIFF_OPT_SET(options, HAS_CHANGES);
 }
 

But that essentially makes the conditional a noop (since we know we set
either NEW or OLD above and DIFFERENT is the union of those flags).

I'm not sure I understand why file_add_remove() would ever want to avoid
setting HAS_CHANGES (certainly its companion file_change() always does).
This goes back to Junio's dd47aa3133 (try-to-simplify-commit: use
diff-tree --quiet machinery., 2007-03-14).

Maybe I am missing something, but AFAICT this was always buggy. But
since it only affects adds and deletes, maybe nobody noticed? I'm also
not sure if it only causes a slowdown, or if this could cause us to
erroneously mark something as TREESAME which isn't (I _do_ think people
would have noticed that).

-Peff

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: git-clone causes out of memory
  2017-10-13 14:10                         ` Jeff King
@ 2017-10-13 14:20                           ` Jeff King
  2017-10-13 14:25                             ` Derrick Stolee
  0 siblings, 1 reply; 23+ messages in thread
From: Jeff King @ 2017-10-13 14:20 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Constantine, Junio C Hamano, Christian Couder, Mike Hommey, git

On Fri, Oct 13, 2017 at 10:10:18AM -0400, Jeff King wrote:

> Hmm. So this patch makes it go fast:
> 
> diff --git a/revision.c b/revision.c
> index d167223e69..b52ea4e9d8 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -409,7 +409,7 @@ static void file_add_remove(struct diff_options *options,
>  	int diff = addremove == '+' ? REV_TREE_NEW : REV_TREE_OLD;
>  
>  	tree_difference |= diff;
> -	if (tree_difference == REV_TREE_DIFFERENT)
> +	if (tree_difference & REV_TREE_DIFFERENT)
>  		DIFF_OPT_SET(options, HAS_CHANGES);
>  }
>  
> 
> But that essentially makes the conditional a noop (since we know we set
> either NEW or OLD above and DIFFERENT is the union of those flags).
> 
> I'm not sure I understand why file_add_remove() would ever want to avoid
> setting HAS_CHANGES (certainly its companion file_change() always does).
> This goes back to Junio's dd47aa3133 (try-to-simplify-commit: use
> diff-tree --quiet machinery., 2007-03-14).
> 
> Maybe I am missing something, but AFAICT this was always buggy. But
> since it only affects adds and deletes, maybe nobody noticed? I'm also
> not sure if it only causes a slowdown, or if this could cause us to
> erroneously mark something as TREESAME which isn't (I _do_ think people
> would have noticed that).

Answering my own question a little, there is a hint in the comment
a few lines above:

  /*
   * The goal is to get REV_TREE_NEW as the result only if the
   * diff consists of all '+' (and no other changes), REV_TREE_OLD
   * if the whole diff is removal of old data, and otherwise
   * REV_TREE_DIFFERENT (of course if the trees are the same we
   * want REV_TREE_SAME).
   * That means that once we get to REV_TREE_DIFFERENT, we do not
   * have to look any further.
   */

So my patch above is breaking that. But it's not clear from that comment
why we care about knowing the different between NEW, OLD, and DIFFERENT.

Grepping around for REV_TREE_NEW and REV_TREE_OLD, I think the answer is
in try_to_simplify_commit():

     case REV_TREE_NEW:
              if (revs->remove_empty_trees &&
                  rev_same_tree_as_empty(revs, p)) {
                      /* We are adding all the specified
                       * paths from this parent, so the
                       * history beyond this parent is not
                       * interesting.  Remove its parents
                       * (they are grandparents for us).
                       * IOW, we pretend this parent is a
                       * "root" commit.
                       */
                      if (parse_commit(p) < 0)
                              die("cannot simplify commit %s (invalid %s)",
                                  oid_to_hex(&commit->object.oid),
                                  oid_to_hex(&p->object.oid));
                      p->parents = NULL;
              }
    /* fallthrough */
    case REV_TREE_OLD:
    case REV_TREE_DIFFERENT:

So when --remove-empty is not in effect (and it's not by default), we
don't care about OLD vs NEW, and we should be able to optimize further.

-Peff

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: git-clone causes out of memory
  2017-10-13 14:20                           ` Jeff King
@ 2017-10-13 14:25                             ` Derrick Stolee
  2017-10-13 14:26                               ` Jeff King
  0 siblings, 1 reply; 23+ messages in thread
From: Derrick Stolee @ 2017-10-13 14:25 UTC (permalink / raw)
  To: Jeff King; +Cc: Constantine, Junio C Hamano, Christian Couder, Mike Hommey, git

On 10/13/2017 10:20 AM, Jeff King wrote:
> On Fri, Oct 13, 2017 at 10:10:18AM -0400, Jeff King wrote:
>
>> Hmm. So this patch makes it go fast:
>>
>> diff --git a/revision.c b/revision.c
>> index d167223e69..b52ea4e9d8 100644
>> --- a/revision.c
>> +++ b/revision.c
>> @@ -409,7 +409,7 @@ static void file_add_remove(struct diff_options *options,
>>   	int diff = addremove == '+' ? REV_TREE_NEW : REV_TREE_OLD;
>>   
>>   	tree_difference |= diff;
>> -	if (tree_difference == REV_TREE_DIFFERENT)
>> +	if (tree_difference & REV_TREE_DIFFERENT)
>>   		DIFF_OPT_SET(options, HAS_CHANGES);
>>   }
>>   
>>
>> But that essentially makes the conditional a noop (since we know we set
>> either NEW or OLD above and DIFFERENT is the union of those flags).
>>
>> I'm not sure I understand why file_add_remove() would ever want to avoid
>> setting HAS_CHANGES (certainly its companion file_change() always does).
>> This goes back to Junio's dd47aa3133 (try-to-simplify-commit: use
>> diff-tree --quiet machinery., 2007-03-14).
>>
>> Maybe I am missing something, but AFAICT this was always buggy. But
>> since it only affects adds and deletes, maybe nobody noticed? I'm also
>> not sure if it only causes a slowdown, or if this could cause us to
>> erroneously mark something as TREESAME which isn't (I _do_ think people
>> would have noticed that).
> Answering my own question a little, there is a hint in the comment
> a few lines above:
>
>    /*
>     * The goal is to get REV_TREE_NEW as the result only if the
>     * diff consists of all '+' (and no other changes), REV_TREE_OLD
>     * if the whole diff is removal of old data, and otherwise
>     * REV_TREE_DIFFERENT (of course if the trees are the same we
>     * want REV_TREE_SAME).
>     * That means that once we get to REV_TREE_DIFFERENT, we do not
>     * have to look any further.
>     */
>
> So my patch above is breaking that. But it's not clear from that comment
> why we care about knowing the different between NEW, OLD, and DIFFERENT.
>
> Grepping around for REV_TREE_NEW and REV_TREE_OLD, I think the answer is
> in try_to_simplify_commit():
>
>       case REV_TREE_NEW:
>                if (revs->remove_empty_trees &&
>                    rev_same_tree_as_empty(revs, p)) {
>                        /* We are adding all the specified
>                         * paths from this parent, so the
>                         * history beyond this parent is not
>                         * interesting.  Remove its parents
>                         * (they are grandparents for us).
>                         * IOW, we pretend this parent is a
>                         * "root" commit.
>                         */
>                        if (parse_commit(p) < 0)
>                                die("cannot simplify commit %s (invalid %s)",
>                                    oid_to_hex(&commit->object.oid),
>                                    oid_to_hex(&p->object.oid));
>                        p->parents = NULL;
>                }
>      /* fallthrough */
>      case REV_TREE_OLD:
>      case REV_TREE_DIFFERENT:
>
> So when --remove-empty is not in effect (and it's not by default), we
> don't care about OLD vs NEW, and we should be able to optimize further.
>
> -Peff

This does appear to be the problem. The missing DIFF_OPT_HAS_CHANGES is 
causing diff_can_quit_early() to return false. Due to the corner-case of 
the bug it seems it will not be a huge performance improvement in most 
cases. Still worth fixing and I'm looking at your suggestions to try and 
learn this area better.

It will speed up natural cases of someone adding or renaming a folder 
with a lot of contents, including someone initializing a repository from 
an existing codebase. This git bomb case happens to add all of the 
folders at once, which is why the performance is so noticeable. I use a 
version of this "exponential file growth" for testing that increases the 
folder-depth one commit at a time, so the contents are rather small in 
the first "add", hence not noticing this issue.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: git-clone causes out of memory
  2017-10-13 14:25                             ` Derrick Stolee
@ 2017-10-13 14:26                               ` Jeff King
  2017-10-13 14:30                                 ` Derrick Stolee
  2017-10-13 15:27                                 ` [PATCH] revision: quit pruning diff more quickly when possible Jeff King
  0 siblings, 2 replies; 23+ messages in thread
From: Jeff King @ 2017-10-13 14:26 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Constantine, Junio C Hamano, Christian Couder, Mike Hommey, git

On Fri, Oct 13, 2017 at 10:25:10AM -0400, Derrick Stolee wrote:

> This does appear to be the problem. The missing DIFF_OPT_HAS_CHANGES is
> causing diff_can_quit_early() to return false. Due to the corner-case of the
> bug it seems it will not be a huge performance improvement in most cases.
> Still worth fixing and I'm looking at your suggestions to try and learn this
> area better.

Yeah, I just timed some pathspec limits on linux.git, and it makes at
best a fraction of a percent improvement (but any improvement is well
within run-to-run noise). Which is not surprising.

I agree it's worth fixing, though.

-Peff

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: git-clone causes out of memory
  2017-10-13 14:26                               ` Jeff King
@ 2017-10-13 14:30                                 ` Derrick Stolee
  2017-10-13 15:27                                 ` [PATCH] revision: quit pruning diff more quickly when possible Jeff King
  1 sibling, 0 replies; 23+ messages in thread
From: Derrick Stolee @ 2017-10-13 14:30 UTC (permalink / raw)
  To: Jeff King; +Cc: Constantine, Junio C Hamano, Christian Couder, Mike Hommey, git

On 10/13/2017 10:26 AM, Jeff King wrote:
> On Fri, Oct 13, 2017 at 10:25:10AM -0400, Derrick Stolee wrote:
>
>> This does appear to be the problem. The missing DIFF_OPT_HAS_CHANGES is
>> causing diff_can_quit_early() to return false. Due to the corner-case of the
>> bug it seems it will not be a huge performance improvement in most cases.
>> Still worth fixing and I'm looking at your suggestions to try and learn this
>> area better.
> Yeah, I just timed some pathspec limits on linux.git, and it makes at
> best a fraction of a percent improvement (but any improvement is well
> within run-to-run noise). Which is not surprising.
>
> I agree it's worth fixing, though.
>
> -Peff

I just ran a first-level folder history in the Windows repository and 
`git rev-list HEAD -- windows >/dev/null` went from 0.43s to 0.04s. So, 
a good percentage improvement but even on this enormous repo the bug 
doesn't present an issue to human users.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH] revision: quit pruning diff more quickly when possible
  2017-10-13 14:26                               ` Jeff King
  2017-10-13 14:30                                 ` Derrick Stolee
@ 2017-10-13 15:27                                 ` Jeff King
  2017-10-13 15:37                                   ` Derrick Stolee
  2017-10-14  2:43                                   ` Junio C Hamano
  1 sibling, 2 replies; 23+ messages in thread
From: Jeff King @ 2017-10-13 15:27 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Constantine, Junio C Hamano, Christian Couder, Mike Hommey, git

On Fri, Oct 13, 2017 at 10:26:46AM -0400, Jeff King wrote:

> On Fri, Oct 13, 2017 at 10:25:10AM -0400, Derrick Stolee wrote:
> 
> > This does appear to be the problem. The missing DIFF_OPT_HAS_CHANGES is
> > causing diff_can_quit_early() to return false. Due to the corner-case of the
> > bug it seems it will not be a huge performance improvement in most cases.
> > Still worth fixing and I'm looking at your suggestions to try and learn this
> > area better.
> 
> Yeah, I just timed some pathspec limits on linux.git, and it makes at
> best a fraction of a percent improvement (but any improvement is well
> within run-to-run noise). Which is not surprising.
> 
> I agree it's worth fixing, though.

Here it is cleaned up and with a commit message. There's another case
that can be optimized, too: --remove-empty with an all-deletions commit.
That's probably even more obscure and pathological, but it was easy to
cover in the same breath.

I didn't bother making a perf script, since this really isn't indicative
of real-world performance. If we wanted to do perf regression tests
here, I think the best path forward would be:

  1. Make sure there the perf tests cover pathspecs (maybe in p0001?).

  2. Make it easy to run the whole perf suite against a "bomb" repo.
     This surely isn't the only slow thing of interest.

-- >8 --
Subject: revision: quit pruning diff more quickly when possible

When the revision traversal machinery is given a pathspec,
we must compute the parent-diff for each commit to determine
which ones are TREESAME. We set the QUICK diff flag to avoid
looking at more entries than we need; we really just care
whether there are any changes at all.

But there is one case where we want to know a bit more: if
--remove-empty is set, we care about finding cases where the
change consists only of added entries (in which case we may
prune the parent in try_to_simplify_commit()). To cover that
case, our file_add_remove() callback does not quit the diff
upon seeing an added entry; it keeps looking for other types
of entries.

But this means when --remove-empty is not set (and it is not
by default), we compute more of the diff than is necessary.
You can see this in a pathological case where a commit adds
a very large number of entries, and we limit based on a
broad pathspec. E.g.:

  perl -e '
    chomp(my $blob = `git hash-object -w --stdin </dev/null`);
    for my $a (1..1000) {
      for my $b (1..1000) {
        print "100644 $blob\t$a/$b\n";
      }
    }
  ' | git update-index --index-info
  git commit -qm add

  git rev-list HEAD -- .

This case takes about 100ms now, but after this patch only
needs 6ms. That's not a huge improvement, but it's easy to
get and it protects us against even more pathological cases
(e.g., going from 1 million to 10 million files would take
ten times as long with the current code, but not increase at
all after this patch).

This is reported to minorly speed-up pathspec limiting in
real world repositories (like the 100-million-file Windows
repository), but probably won't make a noticeable difference
outside of pathological setups.

This patch actually covers the case without --remove-empty,
and the case where we see only deletions. See the in-code
comment for details.

Note that we have to add a new member to the diff_options
struct so that our callback can see the value of
revs->remove_empty_trees. This callback parameter could be
passed to the "add_remove" and "change" callbacks, but
there's not much point. They already receive the
diff_options struct, and doing it this way avoids having to
update the function signature of the other callbacks
(arguably the format_callback and output_prefix functions
could benefit from the same simplification).

Signed-off-by: Jeff King <peff@peff.net>
---
 diff.h     |  1 +
 revision.c | 16 +++++++++++++---
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/diff.h b/diff.h
index 7dcfcfbef7..4a34d256f1 100644
--- a/diff.h
+++ b/diff.h
@@ -180,6 +180,7 @@ struct diff_options {
 	pathchange_fn_t pathchange;
 	change_fn_t change;
 	add_remove_fn_t add_remove;
+	void *change_fn_data;
 	diff_format_fn_t format_callback;
 	void *format_callback_data;
 	diff_prefix_fn_t output_prefix;
diff --git a/revision.c b/revision.c
index 8fd222f3bf..a3f245e2cc 100644
--- a/revision.c
+++ b/revision.c
@@ -399,8 +399,16 @@ static struct commit *one_relevant_parent(const struct rev_info *revs,
  * if the whole diff is removal of old data, and otherwise
  * REV_TREE_DIFFERENT (of course if the trees are the same we
  * want REV_TREE_SAME).
- * That means that once we get to REV_TREE_DIFFERENT, we do not
- * have to look any further.
+ *
+ * The only time we care about the distinction is when
+ * remove_empty_trees is in effect, in which case we care only about
+ * whether the whole change is REV_TREE_NEW, or if there's another type
+ * of change. Which means we can stop the diff early in either of these
+ * cases:
+ *
+ *   1. We're not using remove_empty_trees at all.
+ *
+ *   2. We saw anything except REV_TREE_NEW.
  */
 static int tree_difference = REV_TREE_SAME;

@@ -411,9 +419,10 @@ static void file_add_remove(struct diff_options *options,
 		    const char *fullpath, unsigned dirty_submodule)
 {
 	int diff = addremove == '+' ? REV_TREE_NEW : REV_TREE_OLD;
+	struct rev_info *revs = options->change_fn_data;

 	tree_difference |= diff;
-	if (tree_difference == REV_TREE_DIFFERENT)
+	if (!revs->remove_empty_trees || tree_difference != REV_TREE_NEW)
 		DIFF_OPT_SET(options, HAS_CHANGES);
 }

@@ -1351,6 +1360,7 @@ void init_revisions(struct rev_info *revs, const char *prefix)
 	DIFF_OPT_SET(&revs->pruning, QUICK);
 	revs->pruning.add_remove = file_add_remove;
 	revs->pruning.change = file_change;
+	revs->pruning.change_fn_data = revs;
 	revs->sort_order = REV_SORT_IN_GRAPH_ORDER;
 	revs->dense = 1;
 	revs->prefix = prefix;
-- 
2.15.0.rc1.395.ga4290b5804

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] revision: quit pruning diff more quickly when possible
  2017-10-13 15:27                                 ` [PATCH] revision: quit pruning diff more quickly when possible Jeff King
@ 2017-10-13 15:37                                   ` Derrick Stolee
  2017-10-13 15:44                                     ` Jeff King
  2017-10-14  2:43                                   ` Junio C Hamano
  1 sibling, 1 reply; 23+ messages in thread
From: Derrick Stolee @ 2017-10-13 15:37 UTC (permalink / raw)
  To: Jeff King; +Cc: Constantine, Junio C Hamano, Christian Couder, Mike Hommey, git

On 10/13/2017 11:27 AM, Jeff King wrote:
> On Fri, Oct 13, 2017 at 10:26:46AM -0400, Jeff King wrote:
>
>> On Fri, Oct 13, 2017 at 10:25:10AM -0400, Derrick Stolee wrote:
>>
>>> This does appear to be the problem. The missing DIFF_OPT_HAS_CHANGES is
>>> causing diff_can_quit_early() to return false. Due to the corner-case of the
>>> bug it seems it will not be a huge performance improvement in most cases.
>>> Still worth fixing and I'm looking at your suggestions to try and learn this
>>> area better.
>> Yeah, I just timed some pathspec limits on linux.git, and it makes at
>> best a fraction of a percent improvement (but any improvement is well
>> within run-to-run noise). Which is not surprising.
>>
>> I agree it's worth fixing, though.
> Here it is cleaned up and with a commit message. There's another case
> that can be optimized, too: --remove-empty with an all-deletions commit.
> That's probably even more obscure and pathological, but it was easy to
> cover in the same breath.
>
> I didn't bother making a perf script, since this really isn't indicative
> of real-world performance. If we wanted to do perf regression tests
> here, I think the best path forward would be:
>
>    1. Make sure there the perf tests cover pathspecs (maybe in p0001?).
>
>    2. Make it easy to run the whole perf suite against a "bomb" repo.
>       This surely isn't the only slow thing of interest.
>
> -- >8 --
> Subject: revision: quit pruning diff more quickly when possible
>
> When the revision traversal machinery is given a pathspec,
> we must compute the parent-diff for each commit to determine
> which ones are TREESAME. We set the QUICK diff flag to avoid
> looking at more entries than we need; we really just care
> whether there are any changes at all.
>
> But there is one case where we want to know a bit more: if
> --remove-empty is set, we care about finding cases where the
> change consists only of added entries (in which case we may
> prune the parent in try_to_simplify_commit()). To cover that
> case, our file_add_remove() callback does not quit the diff
> upon seeing an added entry; it keeps looking for other types
> of entries.
>
> But this means when --remove-empty is not set (and it is not
> by default), we compute more of the diff than is necessary.
> You can see this in a pathological case where a commit adds
> a very large number of entries, and we limit based on a
> broad pathspec. E.g.:
>
>    perl -e '
>      chomp(my $blob = `git hash-object -w --stdin </dev/null`);
>      for my $a (1..1000) {
>        for my $b (1..1000) {
>          print "100644 $blob\t$a/$b\n";
>        }
>      }
>    ' | git update-index --index-info
>    git commit -qm add
>
>    git rev-list HEAD -- .
>
> This case takes about 100ms now, but after this patch only
> needs 6ms. That's not a huge improvement, but it's easy to
> get and it protects us against even more pathological cases
> (e.g., going from 1 million to 10 million files would take
> ten times as long with the current code, but not increase at
> all after this patch).
>
> This is reported to minorly speed-up pathspec limiting in
> real world repositories (like the 100-million-file Windows
> repository), but probably won't make a noticeable difference
> outside of pathological setups.
>
> This patch actually covers the case without --remove-empty,
> and the case where we see only deletions. See the in-code
> comment for details.
>
> Note that we have to add a new member to the diff_options
> struct so that our callback can see the value of
> revs->remove_empty_trees. This callback parameter could be
> passed to the "add_remove" and "change" callbacks, but
> there's not much point. They already receive the
> diff_options struct, and doing it this way avoids having to
> update the function signature of the other callbacks
> (arguably the format_callback and output_prefix functions
> could benefit from the same simplification).
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>   diff.h     |  1 +
>   revision.c | 16 +++++++++++++---
>   2 files changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/diff.h b/diff.h
> index 7dcfcfbef7..4a34d256f1 100644
> --- a/diff.h
> +++ b/diff.h
> @@ -180,6 +180,7 @@ struct diff_options {
>   	pathchange_fn_t pathchange;
>   	change_fn_t change;
>   	add_remove_fn_t add_remove;
> +	void *change_fn_data;
>   	diff_format_fn_t format_callback;
>   	void *format_callback_data;
>   	diff_prefix_fn_t output_prefix;
> diff --git a/revision.c b/revision.c
> index 8fd222f3bf..a3f245e2cc 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -399,8 +399,16 @@ static struct commit *one_relevant_parent(const struct rev_info *revs,
>    * if the whole diff is removal of old data, and otherwise
>    * REV_TREE_DIFFERENT (of course if the trees are the same we
>    * want REV_TREE_SAME).
> - * That means that once we get to REV_TREE_DIFFERENT, we do not
> - * have to look any further.
> + *
> + * The only time we care about the distinction is when
> + * remove_empty_trees is in effect, in which case we care only about
> + * whether the whole change is REV_TREE_NEW, or if there's another type
> + * of change. Which means we can stop the diff early in either of these
> + * cases:
> + *
> + *   1. We're not using remove_empty_trees at all.
> + *
> + *   2. We saw anything except REV_TREE_NEW.
>    */
>   static int tree_difference = REV_TREE_SAME;
>   
> @@ -411,9 +419,10 @@ static void file_add_remove(struct diff_options *options,
>   		    const char *fullpath, unsigned dirty_submodule)
>   {
>   	int diff = addremove == '+' ? REV_TREE_NEW : REV_TREE_OLD;
> +	struct rev_info *revs = options->change_fn_data;
>   
>   	tree_difference |= diff;
> -	if (tree_difference == REV_TREE_DIFFERENT)
> +	if (!revs->remove_empty_trees || tree_difference != REV_TREE_NEW)
>   		DIFF_OPT_SET(options, HAS_CHANGES);
>   }
>   
> @@ -1351,6 +1360,7 @@ void init_revisions(struct rev_info *revs, const char *prefix)
>   	DIFF_OPT_SET(&revs->pruning, QUICK);
>   	revs->pruning.add_remove = file_add_remove;
>   	revs->pruning.change = file_change;
> +	revs->pruning.change_fn_data = revs;
>   	revs->sort_order = REV_SORT_IN_GRAPH_ORDER;
>   	revs->dense = 1;
>   	revs->prefix = prefix;

Thanks, Peff. This patch looks good to me.

I tried a few other things like adding a flag DIFF_OPT_HAS_ANY_CHANGE 
next to DIFF_OPT_HAS_CHANGES that we could check in 
diff_can_quit_early() but it had side-effects that broke existing tests. 
From this exploration, it does seem necessary to be aware of 
'remove_empty_trees'.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] revision: quit pruning diff more quickly when possible
  2017-10-13 15:37                                   ` Derrick Stolee
@ 2017-10-13 15:44                                     ` Jeff King
  0 siblings, 0 replies; 23+ messages in thread
From: Jeff King @ 2017-10-13 15:44 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Constantine, Junio C Hamano, Christian Couder, Mike Hommey, git

On Fri, Oct 13, 2017 at 11:37:50AM -0400, Derrick Stolee wrote:

> Thanks, Peff. This patch looks good to me.
> 
> I tried a few other things like adding a flag DIFF_OPT_HAS_ANY_CHANGE next
> to DIFF_OPT_HAS_CHANGES that we could check in diff_can_quit_early() but it
> had side-effects that broke existing tests. From this exploration, it does
> seem necessary to be aware of 'remove_empty_trees'.

Keep in mind that the regular diff_change callbacks already handle this
case[1].

The file_change callbacks are specific to the revision machinery's
pruning diff, and intentionally hold back the HAS_CHANGES flag.

-Peff

[1] I tried "git diff-tree --root -r --quiet 45546f17e" on the bomb
    repo, and it went quickly. Dropping --quiet makes it take a really
    long time.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] revision: quit pruning diff more quickly when possible
  2017-10-13 15:27                                 ` [PATCH] revision: quit pruning diff more quickly when possible Jeff King
  2017-10-13 15:37                                   ` Derrick Stolee
@ 2017-10-14  2:43                                   ` Junio C Hamano
  1 sibling, 0 replies; 23+ messages in thread
From: Junio C Hamano @ 2017-10-14  2:43 UTC (permalink / raw)
  To: Jeff King; +Cc: Derrick Stolee, Constantine, Christian Couder, Mike Hommey, git

Jeff King <peff@peff.net> writes:

> Here it is cleaned up and with a commit message. There's another case
> that can be optimized, too: --remove-empty with an all-deletions commit.
> That's probably even more obscure and pathological, but it was easy to
> cover in the same breath.

This one looks good.  It appears that again you guys had all the fun
while I was offline ;-).  And I am happy to see that we didn't veer
in the direction to optimize for a wrong case by keeping track of
what trees we already saw and things like that, of course.

Thanks.

> Subject: revision: quit pruning diff more quickly when possible
>
> When the revision traversal machinery is given a pathspec,
> we must compute the parent-diff for each commit to determine
> which ones are TREESAME. We set the QUICK diff flag to avoid
> looking at more entries than we need; we really just care
> whether there are any changes at all.
>
> But there is one case where we want to know a bit more: if
> --remove-empty is set, we care about finding cases where the
> change consists only of added entries (in which case we may
> prune the parent in try_to_simplify_commit()). To cover that
> case, our file_add_remove() callback does not quit the diff
> upon seeing an added entry; it keeps looking for other types
> of entries.
>
> But this means when --remove-empty is not set (and it is not
> by default), we compute more of the diff than is necessary.
> You can see this in a pathological case where a commit adds
> a very large number of entries, and we limit based on a
> broad pathspec. E.g.:
>
>   perl -e '
>     chomp(my $blob = `git hash-object -w --stdin </dev/null`);
>     for my $a (1..1000) {
>       for my $b (1..1000) {
>         print "100644 $blob\t$a/$b\n";
>       }
>     }
>   ' | git update-index --index-info
>   git commit -qm add
>
>   git rev-list HEAD -- .
>
> This case takes about 100ms now, but after this patch only
> needs 6ms. That's not a huge improvement, but it's easy to
> get and it protects us against even more pathological cases
> (e.g., going from 1 million to 10 million files would take
> ten times as long with the current code, but not increase at
> all after this patch).
>
> This is reported to minorly speed-up pathspec limiting in
> real world repositories (like the 100-million-file Windows
> repository), but probably won't make a noticeable difference
> outside of pathological setups.
>
> This patch actually covers the case without --remove-empty,
> and the case where we see only deletions. See the in-code
> comment for details.
>
> Note that we have to add a new member to the diff_options
> struct so that our callback can see the value of
> revs->remove_empty_trees. This callback parameter could be
> passed to the "add_remove" and "change" callbacks, but
> there's not much point. They already receive the
> diff_options struct, and doing it this way avoids having to
> update the function signature of the other callbacks
> (arguably the format_callback and output_prefix functions
> could benefit from the same simplification).
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  diff.h     |  1 +
>  revision.c | 16 +++++++++++++---
>  2 files changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/diff.h b/diff.h
> index 7dcfcfbef7..4a34d256f1 100644
> --- a/diff.h
> +++ b/diff.h
> @@ -180,6 +180,7 @@ struct diff_options {
>  	pathchange_fn_t pathchange;
>  	change_fn_t change;
>  	add_remove_fn_t add_remove;
> +	void *change_fn_data;
>  	diff_format_fn_t format_callback;
>  	void *format_callback_data;
>  	diff_prefix_fn_t output_prefix;
> diff --git a/revision.c b/revision.c
> index 8fd222f3bf..a3f245e2cc 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -399,8 +399,16 @@ static struct commit *one_relevant_parent(const struct rev_info *revs,
>   * if the whole diff is removal of old data, and otherwise
>   * REV_TREE_DIFFERENT (of course if the trees are the same we
>   * want REV_TREE_SAME).
> - * That means that once we get to REV_TREE_DIFFERENT, we do not
> - * have to look any further.
> + *
> + * The only time we care about the distinction is when
> + * remove_empty_trees is in effect, in which case we care only about
> + * whether the whole change is REV_TREE_NEW, or if there's another type
> + * of change. Which means we can stop the diff early in either of these
> + * cases:
> + *
> + *   1. We're not using remove_empty_trees at all.
> + *
> + *   2. We saw anything except REV_TREE_NEW.
>   */
>  static int tree_difference = REV_TREE_SAME;
>  
> @@ -411,9 +419,10 @@ static void file_add_remove(struct diff_options *options,
>  		    const char *fullpath, unsigned dirty_submodule)
>  {
>  	int diff = addremove == '+' ? REV_TREE_NEW : REV_TREE_OLD;
> +	struct rev_info *revs = options->change_fn_data;
>  
>  	tree_difference |= diff;
> -	if (tree_difference == REV_TREE_DIFFERENT)
> +	if (!revs->remove_empty_trees || tree_difference != REV_TREE_NEW)
>  		DIFF_OPT_SET(options, HAS_CHANGES);
>  }
>  
> @@ -1351,6 +1360,7 @@ void init_revisions(struct rev_info *revs, const char *prefix)
>  	DIFF_OPT_SET(&revs->pruning, QUICK);
>  	revs->pruning.add_remove = file_add_remove;
>  	revs->pruning.change = file_change;
> +	revs->pruning.change_fn_data = revs;
>  	revs->sort_order = REV_SORT_IN_GRAPH_ORDER;
>  	revs->dense = 1;
>  	revs->prefix = prefix;

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: git-clone causes out of memory
  2017-10-13 10:06 ` Mike Hommey
  2017-10-13 10:26   ` Christian Couder
@ 2017-10-13 12:35   ` Jeff King
  1 sibling, 0 replies; 23+ messages in thread
From: Jeff King @ 2017-10-13 12:35 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Constantine, git

On Fri, Oct 13, 2017 at 07:06:03PM +0900, Mike Hommey wrote:

> On Fri, Oct 13, 2017 at 12:51:58PM +0300, Constantine wrote:
> > There's a gitbomb on github. It is undoubtedly creative and funny, but since
> > this is a bug in git, I thought it'd be nice to report. The command:
> > 
> > 	$ git clone https://github.com/x0rz/ShadowBrokersFiles
> 
> What fills memory is actually the checkout part of the command. git
> clone -n doesn't fail.
> 
> Credit should go where it's due: https://kate.io/blog/git-bomb/
> (with the bonus that it comes with explanations)

That is a nice explanation, and I think the dates point to the kate.io
post as the parent there.

I certainly have come across this type of "bomb" repo before, but I
couldn't come up with any public references (somebody uploaded such a
repository to GitHub in 2014). So I think this may be the first public
discussion of the idea.

-Peff

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-10-14  2:43 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-13  9:51 git-clone causes out of memory Constantine
2017-10-13 10:06 ` Mike Hommey
2017-10-13 10:26   ` Christian Couder
2017-10-13 10:37     ` Mike Hommey
2017-10-13 10:44       ` Christian Couder
2017-10-13 12:04         ` Junio C Hamano
2017-10-13 12:12           ` Constantine
2017-10-13 12:44             ` Jeff King
2017-10-13 13:15               ` Derrick Stolee
2017-10-13 13:39                 ` Derrick Stolee
2017-10-13 13:50                   ` Jeff King
2017-10-13 13:55                     ` Derrick Stolee
2017-10-13 13:56                       ` Jeff King
2017-10-13 14:10                         ` Jeff King
2017-10-13 14:20                           ` Jeff King
2017-10-13 14:25                             ` Derrick Stolee
2017-10-13 14:26                               ` Jeff King
2017-10-13 14:30                                 ` Derrick Stolee
2017-10-13 15:27                                 ` [PATCH] revision: quit pruning diff more quickly when possible Jeff King
2017-10-13 15:37                                   ` Derrick Stolee
2017-10-13 15:44                                     ` Jeff King
2017-10-14  2:43                                   ` Junio C Hamano
2017-10-13 12:35   ` git-clone causes out of memory Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).