* git-clone causes out of memory @ 2017-10-13 9:51 Constantine 2017-10-13 10:06 ` Mike Hommey 0 siblings, 1 reply; 23+ messages in thread From: Constantine @ 2017-10-13 9:51 UTC (permalink / raw) To: git There's a gitbomb on github. It is undoubtedly creative and funny, but since this is a bug in git, I thought it'd be nice to report. The command: $ git clone https://github.com/x0rz/ShadowBrokersFiles quickly fills out the RAM (e.g. 4GB of free memory for me). To recover, call oom-killer through Alt+SysRq+f, then switch to a TTY and back. Git version: 2.14.2 P.S.: I am not subscribed to the ML, acc. to https://git-scm.com/community I will be added to CC. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: git-clone causes out of memory 2017-10-13 9:51 git-clone causes out of memory Constantine @ 2017-10-13 10:06 ` Mike Hommey 2017-10-13 10:26 ` Christian Couder 2017-10-13 12:35 ` git-clone causes out of memory Jeff King 0 siblings, 2 replies; 23+ messages in thread From: Mike Hommey @ 2017-10-13 10:06 UTC (permalink / raw) To: Constantine; +Cc: git On Fri, Oct 13, 2017 at 12:51:58PM +0300, Constantine wrote: > There's a gitbomb on github. It is undoubtedly creative and funny, but since > this is a bug in git, I thought it'd be nice to report. The command: > > $ git clone https://github.com/x0rz/ShadowBrokersFiles What fills memory is actually the checkout part of the command. git clone -n doesn't fail. Credit should go where it's due: https://kate.io/blog/git-bomb/ (with the bonus that it comes with explanations) Mike ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: git-clone causes out of memory 2017-10-13 10:06 ` Mike Hommey @ 2017-10-13 10:26 ` Christian Couder 2017-10-13 10:37 ` Mike Hommey 2017-10-13 12:35 ` git-clone causes out of memory Jeff King 1 sibling, 1 reply; 23+ messages in thread From: Christian Couder @ 2017-10-13 10:26 UTC (permalink / raw) To: Mike Hommey; +Cc: Constantine, git On Fri, Oct 13, 2017 at 12:06 PM, Mike Hommey <mh@glandium.org> wrote: > On Fri, Oct 13, 2017 at 12:51:58PM +0300, Constantine wrote: >> There's a gitbomb on github. It is undoubtedly creative and funny, but since >> this is a bug in git, I thought it'd be nice to report. The command: >> >> $ git clone https://github.com/x0rz/ShadowBrokersFiles > > What fills memory is actually the checkout part of the command. git > clone -n doesn't fail. > > Credit should go where it's due: https://kate.io/blog/git-bomb/ > (with the bonus that it comes with explanations) Yeah, there is a thread on Hacker News about this too: https://news.ycombinator.com/item?id=15457076 The original repo on GitHub is: https://github.com/Katee/git-bomb.git After cloning it with -n, there is the following "funny" situation: $ time git rev-list HEAD 7af99c9e7d4768fa681f4fe4ff61259794cf719b 18ed56cbc5012117e24a603e7c072cf65d36d469 45546f17e5801791d4bc5968b91253a2f4b0db72 real 0m0.004s user 0m0.000s sys 0m0.004s $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/d0/d0/f0 real 0m0.004s user 0m0.000s sys 0m0.000s $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/d0/d0 real 0m0.004s user 0m0.000s sys 0m0.000s $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/ 45546f17e5801791d4bc5968b91253a2f4b0db72 real 0m0.005s user 0m0.008s sys 0m0.000s $ time git rev-list HEAD -- d0/d0/d0/d0/d0/ 45546f17e5801791d4bc5968b91253a2f4b0db72 real 0m0.203s user 0m0.112s sys 0m0.088s $ time git rev-list HEAD -- d0/d0/d0/d0/ 45546f17e5801791d4bc5968b91253a2f4b0db72 real 0m1.305s user 0m0.720s sys 0m0.580s $ time git rev-list HEAD -- d0/d0/d0/ 45546f17e5801791d4bc5968b91253a2f4b0db72 real 0m12.135s user 0m6.700s sys 0m5.412s So `git rev-list` becomes exponentially more expensive when you run it on a shorter directory path, though it is fast if you run it without a path. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: git-clone causes out of memory 2017-10-13 10:26 ` Christian Couder @ 2017-10-13 10:37 ` Mike Hommey 2017-10-13 10:44 ` Christian Couder 0 siblings, 1 reply; 23+ messages in thread From: Mike Hommey @ 2017-10-13 10:37 UTC (permalink / raw) To: Christian Couder; +Cc: Constantine, git On Fri, Oct 13, 2017 at 12:26:46PM +0200, Christian Couder wrote: > On Fri, Oct 13, 2017 at 12:06 PM, Mike Hommey <mh@glandium.org> wrote: > > On Fri, Oct 13, 2017 at 12:51:58PM +0300, Constantine wrote: > >> There's a gitbomb on github. It is undoubtedly creative and funny, but since > >> this is a bug in git, I thought it'd be nice to report. The command: > >> > >> $ git clone https://github.com/x0rz/ShadowBrokersFiles > > > > What fills memory is actually the checkout part of the command. git > > clone -n doesn't fail. > > > > Credit should go where it's due: https://kate.io/blog/git-bomb/ > > (with the bonus that it comes with explanations) > > Yeah, there is a thread on Hacker News about this too: > > https://news.ycombinator.com/item?id=15457076 > > The original repo on GitHub is: > > https://github.com/Katee/git-bomb.git > > After cloning it with -n, there is the following "funny" situation: > > $ time git rev-list HEAD > 7af99c9e7d4768fa681f4fe4ff61259794cf719b > 18ed56cbc5012117e24a603e7c072cf65d36d469 > 45546f17e5801791d4bc5968b91253a2f4b0db72 > > real 0m0.004s > user 0m0.000s > sys 0m0.004s > $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/d0/d0/f0 > > real 0m0.004s > user 0m0.000s > sys 0m0.000s > $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/d0/d0 > > real 0m0.004s > user 0m0.000s > sys 0m0.000s > $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/ > 45546f17e5801791d4bc5968b91253a2f4b0db72 > > real 0m0.005s > user 0m0.008s > sys 0m0.000s > $ time git rev-list HEAD -- d0/d0/d0/d0/d0/ > 45546f17e5801791d4bc5968b91253a2f4b0db72 > > real 0m0.203s > user 0m0.112s > sys 0m0.088s > $ time git rev-list HEAD -- d0/d0/d0/d0/ > 45546f17e5801791d4bc5968b91253a2f4b0db72 > > real 0m1.305s > user 0m0.720s > sys 0m0.580s > $ time git rev-list HEAD -- d0/d0/d0/ > 45546f17e5801791d4bc5968b91253a2f4b0db72 > > real 0m12.135s > user 0m6.700s > sys 0m5.412s > > So `git rev-list` becomes exponentially more expensive when you run it > on a shorter directory path, though it is fast if you run it without a > path. That's because there are 10^7 files under d0/d0/d0, 10^6 under d0/d0/d0/d0/, 10^5 under d0/d0/d0/d0/d0/ etc. So really, this is all about things being slower when there's a crazy number of files. Picture me surprised. What makes it kind of special is that the repository contains a lot of paths/files, but very few objects, because it's duplicating everything. All the 10^10 blobs have the same content, all the 10^9 trees that point to them have the same content, all the 10^8 trees that point to those trees have the same content, etc. If git wasn't effectively deduplicating identical content, the repository would be multiple gigabytes large. Mike ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: git-clone causes out of memory 2017-10-13 10:37 ` Mike Hommey @ 2017-10-13 10:44 ` Christian Couder 2017-10-13 12:04 ` Junio C Hamano 0 siblings, 1 reply; 23+ messages in thread From: Christian Couder @ 2017-10-13 10:44 UTC (permalink / raw) To: Mike Hommey; +Cc: Constantine, git On Fri, Oct 13, 2017 at 12:37 PM, Mike Hommey <mh@glandium.org> wrote: > On Fri, Oct 13, 2017 at 12:26:46PM +0200, Christian Couder wrote: >> >> After cloning it with -n, there is the following "funny" situation: >> >> $ time git rev-list HEAD >> 7af99c9e7d4768fa681f4fe4ff61259794cf719b >> 18ed56cbc5012117e24a603e7c072cf65d36d469 >> 45546f17e5801791d4bc5968b91253a2f4b0db72 >> >> real 0m0.004s >> user 0m0.000s >> sys 0m0.004s >> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/d0/d0/f0 >> >> real 0m0.004s >> user 0m0.000s >> sys 0m0.000s >> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/d0/d0 >> >> real 0m0.004s >> user 0m0.000s >> sys 0m0.000s >> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/ >> 45546f17e5801791d4bc5968b91253a2f4b0db72 >> >> real 0m0.005s >> user 0m0.008s >> sys 0m0.000s >> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/ >> 45546f17e5801791d4bc5968b91253a2f4b0db72 >> >> real 0m0.203s >> user 0m0.112s >> sys 0m0.088s >> $ time git rev-list HEAD -- d0/d0/d0/d0/ >> 45546f17e5801791d4bc5968b91253a2f4b0db72 >> >> real 0m1.305s >> user 0m0.720s >> sys 0m0.580s >> $ time git rev-list HEAD -- d0/d0/d0/ >> 45546f17e5801791d4bc5968b91253a2f4b0db72 >> >> real 0m12.135s >> user 0m6.700s >> sys 0m5.412s >> >> So `git rev-list` becomes exponentially more expensive when you run it >> on a shorter directory path, though it is fast if you run it without a >> path. > > That's because there are 10^7 files under d0/d0/d0, 10^6 under > d0/d0/d0/d0/, 10^5 under d0/d0/d0/d0/d0/ etc. > > So really, this is all about things being slower when there's a crazy > number of files. Picture me surprised. > > What makes it kind of special is that the repository contains a lot of > paths/files, but very few objects, because it's duplicating everything. > > All the 10^10 blobs have the same content, all the 10^9 trees that point > to them have the same content, all the 10^8 trees that point to those > trees have the same content, etc. > > If git wasn't effectively deduplicating identical content, the repository > would be multiple gigabytes large. Yeah, but perhaps Git could be smarter when rev-listing too and avoid processing files or directories it has already seen? ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: git-clone causes out of memory 2017-10-13 10:44 ` Christian Couder @ 2017-10-13 12:04 ` Junio C Hamano 2017-10-13 12:12 ` Constantine 0 siblings, 1 reply; 23+ messages in thread From: Junio C Hamano @ 2017-10-13 12:04 UTC (permalink / raw) To: Christian Couder; +Cc: Mike Hommey, Constantine, git Christian Couder <christian.couder@gmail.com> writes: > Yeah, but perhaps Git could be smarter when rev-listing too and avoid > processing files or directories it has already seen? Aren't you suggesting to optimize for a wrong case? ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: git-clone causes out of memory 2017-10-13 12:04 ` Junio C Hamano @ 2017-10-13 12:12 ` Constantine 2017-10-13 12:44 ` Jeff King 0 siblings, 1 reply; 23+ messages in thread From: Constantine @ 2017-10-13 12:12 UTC (permalink / raw) To: Junio C Hamano, Christian Couder; +Cc: Mike Hommey, git On 13.10.2017 15:04, Junio C Hamano wrote: > Christian Couder <christian.couder@gmail.com> writes: > >> Yeah, but perhaps Git could be smarter when rev-listing too and avoid >> processing files or directories it has already seen? > > Aren't you suggesting to optimize for a wrong case? > Anything that is possible with a software should be considered as a possible usecase. It's in fact a DoS attack. Imagine there's a server that using git to process something, and now there's a way to knock down this server. It's also bad from a promotional stand point. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: git-clone causes out of memory 2017-10-13 12:12 ` Constantine @ 2017-10-13 12:44 ` Jeff King 2017-10-13 13:15 ` Derrick Stolee 0 siblings, 1 reply; 23+ messages in thread From: Jeff King @ 2017-10-13 12:44 UTC (permalink / raw) To: Constantine; +Cc: Junio C Hamano, Christian Couder, Mike Hommey, git On Fri, Oct 13, 2017 at 03:12:43PM +0300, Constantine wrote: > On 13.10.2017 15:04, Junio C Hamano wrote: > > Christian Couder <christian.couder@gmail.com> writes: > > > > > Yeah, but perhaps Git could be smarter when rev-listing too and avoid > > > processing files or directories it has already seen? > > > > Aren't you suggesting to optimize for a wrong case? > > > > Anything that is possible with a software should be considered as a possible > usecase. It's in fact a DoS attack. Imagine there's a server that using git > to process something, and now there's a way to knock down this server. It's > also bad from a promotional stand point. But the point is that you'd have the same problem with any repository that had 10^7 files in it. Yes, it's convenient for the attacker that there are only 9 objects, but fundamentally it's pretty easy for an attacker to construct repositories that have large trees (or very deep trees -- that's what causes stack exhaustion in some cases). Note too that this attack almost always comes down to the diff code (which is why it kicks in for pathspec limiting) which has to actually expand the tree. Most "normal" server-side operations (like accepting pushes or serving fetches) operate only on the object graph and _do_ avoid processing already-seen objects. As soon as servers start trying to checkout or diff, though, the attack surface gets quite large. And you really need to start thinking about having resource limits and quotas for CPU and memory use of each process (and group by requesting user, IP, repo, etc). I think the main thing Git could be doing here is to limit the size of the tree (both width and depth). But arbitrary limits like that have a way of being annoying, and I think it just pushes resource-exhaustion attacks off a little (e.g., can you construct a blob that behaves badly with the "--patch"?). -Peff ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: git-clone causes out of memory 2017-10-13 12:44 ` Jeff King @ 2017-10-13 13:15 ` Derrick Stolee 2017-10-13 13:39 ` Derrick Stolee 0 siblings, 1 reply; 23+ messages in thread From: Derrick Stolee @ 2017-10-13 13:15 UTC (permalink / raw) To: Jeff King, Constantine; +Cc: Junio C Hamano, Christian Couder, Mike Hommey, git On 10/13/2017 8:44 AM, Jeff King wrote: > On Fri, Oct 13, 2017 at 03:12:43PM +0300, Constantine wrote: > >> On 13.10.2017 15:04, Junio C Hamano wrote: >>> Christian Couder <christian.couder@gmail.com> writes: >>> >>>> Yeah, but perhaps Git could be smarter when rev-listing too and avoid >>>> processing files or directories it has already seen? >>> Aren't you suggesting to optimize for a wrong case? >>> >> Anything that is possible with a software should be considered as a possible >> usecase. It's in fact a DoS attack. Imagine there's a server that using git >> to process something, and now there's a way to knock down this server. It's >> also bad from a promotional stand point. > But the point is that you'd have the same problem with any repository > that had 10^7 files in it. Yes, it's convenient for the attacker that > there are only 9 objects, but fundamentally it's pretty easy for an > attacker to construct repositories that have large trees (or very deep > trees -- that's what causes stack exhaustion in some cases). > > Note too that this attack almost always comes down to the diff code > (which is why it kicks in for pathspec limiting) which has to actually > expand the tree. Most "normal" server-side operations (like accepting > pushes or serving fetches) operate only on the object graph and _do_ > avoid processing already-seen objects. > > As soon as servers start trying to checkout or diff, though, the attack > surface gets quite large. And you really need to start thinking about > having resource limits and quotas for CPU and memory use of each process > (and group by requesting user, IP, repo, etc). > > I think the main thing Git could be doing here is to limit the size of > the tree (both width and depth). But arbitrary limits like that have a > way of being annoying, and I think it just pushes resource-exhaustion > attacks off a little (e.g., can you construct a blob that behaves badly > with the "--patch"?). > > -Peff I'm particularly interested in why `git rev-list HEAD -- [path]` gets slower in this case, because I wrote the history algorithm used by VSTS. In our algorithm, we only walk the list of objects from commit to the tree containing the path item. For example, in the path d0/d0/d0, we would only walk: commit --root--> tree --d0--> tree --d0--> tree [parse oid for d0 entry] From this, we can determine the TREESAME relationship by parsing four objects without parsing all contents below d0/d0/d0. The reason we have exponential behavior in `git rev-list` is because we are calling diff_tree_oid() in tree-diff.c recursively without short-circuiting on equal OIDs. I will prepare a patch that adds this OID-equal short-circuit to avoid this exponential behavior. I'll model my patch against a similar patch in master: commit d12a8cf0af18804c2000efc7a0224da631e04cd1 unpack-trees: avoid duplicate ODB lookups during checkout It will also significantly speed up rev-list calls for short paths in deep repositories. It will not be very measurable in the git or Linux repos because their shallow folder structure. Thanks, -Stolee ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: git-clone causes out of memory 2017-10-13 13:15 ` Derrick Stolee @ 2017-10-13 13:39 ` Derrick Stolee 2017-10-13 13:50 ` Jeff King 0 siblings, 1 reply; 23+ messages in thread From: Derrick Stolee @ 2017-10-13 13:39 UTC (permalink / raw) To: Jeff King, Constantine; +Cc: Junio C Hamano, Christian Couder, Mike Hommey, git On 10/13/2017 9:15 AM, Derrick Stolee wrote: > On 10/13/2017 8:44 AM, Jeff King wrote: >> On Fri, Oct 13, 2017 at 03:12:43PM +0300, Constantine wrote: >> >>> On 13.10.2017 15:04, Junio C Hamano wrote: >>>> Christian Couder <christian.couder@gmail.com> writes: >>>> >>>>> Yeah, but perhaps Git could be smarter when rev-listing too and avoid >>>>> processing files or directories it has already seen? >>>> Aren't you suggesting to optimize for a wrong case? >>>> >>> Anything that is possible with a software should be considered as a >>> possible >>> usecase. It's in fact a DoS attack. Imagine there's a server that >>> using git >>> to process something, and now there's a way to knock down this >>> server. It's >>> also bad from a promotional stand point. >> But the point is that you'd have the same problem with any repository >> that had 10^7 files in it. Yes, it's convenient for the attacker that >> there are only 9 objects, but fundamentally it's pretty easy for an >> attacker to construct repositories that have large trees (or very deep >> trees -- that's what causes stack exhaustion in some cases). >> >> Note too that this attack almost always comes down to the diff code >> (which is why it kicks in for pathspec limiting) which has to actually >> expand the tree. Most "normal" server-side operations (like accepting >> pushes or serving fetches) operate only on the object graph and _do_ >> avoid processing already-seen objects. >> >> As soon as servers start trying to checkout or diff, though, the attack >> surface gets quite large. And you really need to start thinking about >> having resource limits and quotas for CPU and memory use of each process >> (and group by requesting user, IP, repo, etc). >> >> I think the main thing Git could be doing here is to limit the size of >> the tree (both width and depth). But arbitrary limits like that have a >> way of being annoying, and I think it just pushes resource-exhaustion >> attacks off a little (e.g., can you construct a blob that behaves badly >> with the "--patch"?). >> >> -Peff > > I'm particularly interested in why `git rev-list HEAD -- [path]` gets > slower in this case, because I wrote the history algorithm used by > VSTS. In our algorithm, we only walk the list of objects from commit > to the tree containing the path item. For example, in the path > d0/d0/d0, we would only walk: > > commit --root--> tree --d0--> tree --d0--> tree [parse oid for d0 > entry] > > From this, we can determine the TREESAME relationship by parsing four > objects without parsing all contents below d0/d0/d0. > > The reason we have exponential behavior in `git rev-list` is because > we are calling diff_tree_oid() in tree-diff.c recursively without > short-circuiting on equal OIDs. > > I will prepare a patch that adds this OID-equal short-circuit to avoid > this exponential behavior. I'll model my patch against a similar patch > in master: > > commit d12a8cf0af18804c2000efc7a0224da631e04cd1 unpack-trees: > avoid duplicate ODB lookups during checkout > > It will also significantly speed up rev-list calls for short paths in > deep repositories. It will not be very measurable in the git or Linux > repos because their shallow folder structure. > > Thanks, > -Stolee Since I don't understand enough about the consumers to diff_tree_oid() (and the fact that the recursive behavior may be wanted in some cases), I think we can fix this in builtin/rev-list.c with this simple diff: --- diff --git a/builtin/rev-list.c b/builtin/rev-list.c index ded1577424..b2e8e02cc8 100644 --- a/builtin/rev-list.c +++ b/builtin/rev-list.c @@ -285,6 +285,9 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix) git_config(git_default_config, NULL); init_revisions(&revs, prefix); + + revs.pruning.flags = revs.pruning.flags & ~DIFF_OPT_RECURSIVE; + revs.abbrev = DEFAULT_ABBREV; revs.commit_format = CMIT_FMT_UNSPECIFIED; argc = setup_revisions(argc, argv, &revs, NULL); ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: git-clone causes out of memory 2017-10-13 13:39 ` Derrick Stolee @ 2017-10-13 13:50 ` Jeff King 2017-10-13 13:55 ` Derrick Stolee 0 siblings, 1 reply; 23+ messages in thread From: Jeff King @ 2017-10-13 13:50 UTC (permalink / raw) To: Derrick Stolee Cc: Constantine, Junio C Hamano, Christian Couder, Mike Hommey, git On Fri, Oct 13, 2017 at 09:39:14AM -0400, Derrick Stolee wrote: > Since I don't understand enough about the consumers to diff_tree_oid() (and > the fact that the recursive behavior may be wanted in some cases), I think > we can fix this in builtin/rev-list.c with this simple diff: > > --- > > diff --git a/builtin/rev-list.c b/builtin/rev-list.c > index ded1577424..b2e8e02cc8 100644 > --- a/builtin/rev-list.c > +++ b/builtin/rev-list.c > @@ -285,6 +285,9 @@ int cmd_rev_list(int argc, const char **argv, const char > *prefix) > > git_config(git_default_config, NULL); > init_revisions(&revs, prefix); > + > + revs.pruning.flags = revs.pruning.flags & ~DIFF_OPT_RECURSIVE; > + Hmm, this feels wrong, because we _do_ want to recurse down and follow the pathspec to see if there are real changes. We should be comparing an empty tree and d0/d0/d0/d0 (or however deep your pathspec goes). We should be able to see immediately that the entry is not present between the two and not bother descending. After all, we've set the QUICK flag in init_revisions(). So the real question is why QUICK is not kicking in. -Peff ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: git-clone causes out of memory 2017-10-13 13:50 ` Jeff King @ 2017-10-13 13:55 ` Derrick Stolee 2017-10-13 13:56 ` Jeff King 0 siblings, 1 reply; 23+ messages in thread From: Derrick Stolee @ 2017-10-13 13:55 UTC (permalink / raw) To: Jeff King; +Cc: Constantine, Junio C Hamano, Christian Couder, Mike Hommey, git On 10/13/2017 9:50 AM, Jeff King wrote: > On Fri, Oct 13, 2017 at 09:39:14AM -0400, Derrick Stolee wrote: > >> Since I don't understand enough about the consumers to diff_tree_oid() (and >> the fact that the recursive behavior may be wanted in some cases), I think >> we can fix this in builtin/rev-list.c with this simple diff: >> >> --- >> >> diff --git a/builtin/rev-list.c b/builtin/rev-list.c >> index ded1577424..b2e8e02cc8 100644 >> --- a/builtin/rev-list.c >> +++ b/builtin/rev-list.c >> @@ -285,6 +285,9 @@ int cmd_rev_list(int argc, const char **argv, const char >> *prefix) >> >> git_config(git_default_config, NULL); >> init_revisions(&revs, prefix); >> + >> + revs.pruning.flags = revs.pruning.flags & ~DIFF_OPT_RECURSIVE; >> + (Note: I'm running tests now and see that this breaks behavior. Definitely not the solution we want.) > Hmm, this feels wrong, because we _do_ want to recurse down and follow > the pathspec to see if there are real changes. > > We should be comparing an empty tree and d0/d0/d0/d0 (or however deep > your pathspec goes). We should be able to see immediately that the entry > is not present between the two and not bother descending. After all, > we've set the QUICK flag in init_revisions(). So the real question is > why QUICK is not kicking in. > > -Peff I'm struggling to understand your meaning. We want to walk from root to d0/d0/d0/d0, but there is no reason to walk beyond that tree. But maybe that's what the QUICK flag is supposed to do. Thanks, -Stolee ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: git-clone causes out of memory 2017-10-13 13:55 ` Derrick Stolee @ 2017-10-13 13:56 ` Jeff King 2017-10-13 14:10 ` Jeff King 0 siblings, 1 reply; 23+ messages in thread From: Jeff King @ 2017-10-13 13:56 UTC (permalink / raw) To: Derrick Stolee Cc: Constantine, Junio C Hamano, Christian Couder, Mike Hommey, git On Fri, Oct 13, 2017 at 09:55:15AM -0400, Derrick Stolee wrote: > > We should be comparing an empty tree and d0/d0/d0/d0 (or however deep > > your pathspec goes). We should be able to see immediately that the entry > > is not present between the two and not bother descending. After all, > > we've set the QUICK flag in init_revisions(). So the real question is > > why QUICK is not kicking in. > > I'm struggling to understand your meaning. We want to walk from root to > d0/d0/d0/d0, but there is no reason to walk beyond that tree. But maybe > that's what the QUICK flag is supposed to do. Yes, that's exactly what it is for. When we see the first difference we should say "aha, the caller only wanted to know whether there was a difference, not what it was" and return immediately. See diff_can_quit_early(). -Peff ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: git-clone causes out of memory 2017-10-13 13:56 ` Jeff King @ 2017-10-13 14:10 ` Jeff King 2017-10-13 14:20 ` Jeff King 0 siblings, 1 reply; 23+ messages in thread From: Jeff King @ 2017-10-13 14:10 UTC (permalink / raw) To: Derrick Stolee Cc: Constantine, Junio C Hamano, Christian Couder, Mike Hommey, git On Fri, Oct 13, 2017 at 09:56:36AM -0400, Jeff King wrote: > On Fri, Oct 13, 2017 at 09:55:15AM -0400, Derrick Stolee wrote: > > > > We should be comparing an empty tree and d0/d0/d0/d0 (or however deep > > > your pathspec goes). We should be able to see immediately that the entry > > > is not present between the two and not bother descending. After all, > > > we've set the QUICK flag in init_revisions(). So the real question is > > > why QUICK is not kicking in. > > > > I'm struggling to understand your meaning. We want to walk from root to > > d0/d0/d0/d0, but there is no reason to walk beyond that tree. But maybe > > that's what the QUICK flag is supposed to do. > > Yes, that's exactly what it is for. When we see the first difference we > should say "aha, the caller only wanted to know whether there was a > difference, not what it was" and return immediately. See > diff_can_quit_early(). Hmm. So this patch makes it go fast: diff --git a/revision.c b/revision.c index d167223e69..b52ea4e9d8 100644 --- a/revision.c +++ b/revision.c @@ -409,7 +409,7 @@ static void file_add_remove(struct diff_options *options, int diff = addremove == '+' ? REV_TREE_NEW : REV_TREE_OLD; tree_difference |= diff; - if (tree_difference == REV_TREE_DIFFERENT) + if (tree_difference & REV_TREE_DIFFERENT) DIFF_OPT_SET(options, HAS_CHANGES); } But that essentially makes the conditional a noop (since we know we set either NEW or OLD above and DIFFERENT is the union of those flags). I'm not sure I understand why file_add_remove() would ever want to avoid setting HAS_CHANGES (certainly its companion file_change() always does). This goes back to Junio's dd47aa3133 (try-to-simplify-commit: use diff-tree --quiet machinery., 2007-03-14). Maybe I am missing something, but AFAICT this was always buggy. But since it only affects adds and deletes, maybe nobody noticed? I'm also not sure if it only causes a slowdown, or if this could cause us to erroneously mark something as TREESAME which isn't (I _do_ think people would have noticed that). -Peff ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: git-clone causes out of memory 2017-10-13 14:10 ` Jeff King @ 2017-10-13 14:20 ` Jeff King 2017-10-13 14:25 ` Derrick Stolee 0 siblings, 1 reply; 23+ messages in thread From: Jeff King @ 2017-10-13 14:20 UTC (permalink / raw) To: Derrick Stolee Cc: Constantine, Junio C Hamano, Christian Couder, Mike Hommey, git On Fri, Oct 13, 2017 at 10:10:18AM -0400, Jeff King wrote: > Hmm. So this patch makes it go fast: > > diff --git a/revision.c b/revision.c > index d167223e69..b52ea4e9d8 100644 > --- a/revision.c > +++ b/revision.c > @@ -409,7 +409,7 @@ static void file_add_remove(struct diff_options *options, > int diff = addremove == '+' ? REV_TREE_NEW : REV_TREE_OLD; > > tree_difference |= diff; > - if (tree_difference == REV_TREE_DIFFERENT) > + if (tree_difference & REV_TREE_DIFFERENT) > DIFF_OPT_SET(options, HAS_CHANGES); > } > > > But that essentially makes the conditional a noop (since we know we set > either NEW or OLD above and DIFFERENT is the union of those flags). > > I'm not sure I understand why file_add_remove() would ever want to avoid > setting HAS_CHANGES (certainly its companion file_change() always does). > This goes back to Junio's dd47aa3133 (try-to-simplify-commit: use > diff-tree --quiet machinery., 2007-03-14). > > Maybe I am missing something, but AFAICT this was always buggy. But > since it only affects adds and deletes, maybe nobody noticed? I'm also > not sure if it only causes a slowdown, or if this could cause us to > erroneously mark something as TREESAME which isn't (I _do_ think people > would have noticed that). Answering my own question a little, there is a hint in the comment a few lines above: /* * The goal is to get REV_TREE_NEW as the result only if the * diff consists of all '+' (and no other changes), REV_TREE_OLD * if the whole diff is removal of old data, and otherwise * REV_TREE_DIFFERENT (of course if the trees are the same we * want REV_TREE_SAME). * That means that once we get to REV_TREE_DIFFERENT, we do not * have to look any further. */ So my patch above is breaking that. But it's not clear from that comment why we care about knowing the different between NEW, OLD, and DIFFERENT. Grepping around for REV_TREE_NEW and REV_TREE_OLD, I think the answer is in try_to_simplify_commit(): case REV_TREE_NEW: if (revs->remove_empty_trees && rev_same_tree_as_empty(revs, p)) { /* We are adding all the specified * paths from this parent, so the * history beyond this parent is not * interesting. Remove its parents * (they are grandparents for us). * IOW, we pretend this parent is a * "root" commit. */ if (parse_commit(p) < 0) die("cannot simplify commit %s (invalid %s)", oid_to_hex(&commit->object.oid), oid_to_hex(&p->object.oid)); p->parents = NULL; } /* fallthrough */ case REV_TREE_OLD: case REV_TREE_DIFFERENT: So when --remove-empty is not in effect (and it's not by default), we don't care about OLD vs NEW, and we should be able to optimize further. -Peff ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: git-clone causes out of memory 2017-10-13 14:20 ` Jeff King @ 2017-10-13 14:25 ` Derrick Stolee 2017-10-13 14:26 ` Jeff King 0 siblings, 1 reply; 23+ messages in thread From: Derrick Stolee @ 2017-10-13 14:25 UTC (permalink / raw) To: Jeff King; +Cc: Constantine, Junio C Hamano, Christian Couder, Mike Hommey, git On 10/13/2017 10:20 AM, Jeff King wrote: > On Fri, Oct 13, 2017 at 10:10:18AM -0400, Jeff King wrote: > >> Hmm. So this patch makes it go fast: >> >> diff --git a/revision.c b/revision.c >> index d167223e69..b52ea4e9d8 100644 >> --- a/revision.c >> +++ b/revision.c >> @@ -409,7 +409,7 @@ static void file_add_remove(struct diff_options *options, >> int diff = addremove == '+' ? REV_TREE_NEW : REV_TREE_OLD; >> >> tree_difference |= diff; >> - if (tree_difference == REV_TREE_DIFFERENT) >> + if (tree_difference & REV_TREE_DIFFERENT) >> DIFF_OPT_SET(options, HAS_CHANGES); >> } >> >> >> But that essentially makes the conditional a noop (since we know we set >> either NEW or OLD above and DIFFERENT is the union of those flags). >> >> I'm not sure I understand why file_add_remove() would ever want to avoid >> setting HAS_CHANGES (certainly its companion file_change() always does). >> This goes back to Junio's dd47aa3133 (try-to-simplify-commit: use >> diff-tree --quiet machinery., 2007-03-14). >> >> Maybe I am missing something, but AFAICT this was always buggy. But >> since it only affects adds and deletes, maybe nobody noticed? I'm also >> not sure if it only causes a slowdown, or if this could cause us to >> erroneously mark something as TREESAME which isn't (I _do_ think people >> would have noticed that). > Answering my own question a little, there is a hint in the comment > a few lines above: > > /* > * The goal is to get REV_TREE_NEW as the result only if the > * diff consists of all '+' (and no other changes), REV_TREE_OLD > * if the whole diff is removal of old data, and otherwise > * REV_TREE_DIFFERENT (of course if the trees are the same we > * want REV_TREE_SAME). > * That means that once we get to REV_TREE_DIFFERENT, we do not > * have to look any further. > */ > > So my patch above is breaking that. But it's not clear from that comment > why we care about knowing the different between NEW, OLD, and DIFFERENT. > > Grepping around for REV_TREE_NEW and REV_TREE_OLD, I think the answer is > in try_to_simplify_commit(): > > case REV_TREE_NEW: > if (revs->remove_empty_trees && > rev_same_tree_as_empty(revs, p)) { > /* We are adding all the specified > * paths from this parent, so the > * history beyond this parent is not > * interesting. Remove its parents > * (they are grandparents for us). > * IOW, we pretend this parent is a > * "root" commit. > */ > if (parse_commit(p) < 0) > die("cannot simplify commit %s (invalid %s)", > oid_to_hex(&commit->object.oid), > oid_to_hex(&p->object.oid)); > p->parents = NULL; > } > /* fallthrough */ > case REV_TREE_OLD: > case REV_TREE_DIFFERENT: > > So when --remove-empty is not in effect (and it's not by default), we > don't care about OLD vs NEW, and we should be able to optimize further. > > -Peff This does appear to be the problem. The missing DIFF_OPT_HAS_CHANGES is causing diff_can_quit_early() to return false. Due to the corner-case of the bug it seems it will not be a huge performance improvement in most cases. Still worth fixing and I'm looking at your suggestions to try and learn this area better. It will speed up natural cases of someone adding or renaming a folder with a lot of contents, including someone initializing a repository from an existing codebase. This git bomb case happens to add all of the folders at once, which is why the performance is so noticeable. I use a version of this "exponential file growth" for testing that increases the folder-depth one commit at a time, so the contents are rather small in the first "add", hence not noticing this issue. Thanks, -Stolee ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: git-clone causes out of memory 2017-10-13 14:25 ` Derrick Stolee @ 2017-10-13 14:26 ` Jeff King 2017-10-13 14:30 ` Derrick Stolee 2017-10-13 15:27 ` [PATCH] revision: quit pruning diff more quickly when possible Jeff King 0 siblings, 2 replies; 23+ messages in thread From: Jeff King @ 2017-10-13 14:26 UTC (permalink / raw) To: Derrick Stolee Cc: Constantine, Junio C Hamano, Christian Couder, Mike Hommey, git On Fri, Oct 13, 2017 at 10:25:10AM -0400, Derrick Stolee wrote: > This does appear to be the problem. The missing DIFF_OPT_HAS_CHANGES is > causing diff_can_quit_early() to return false. Due to the corner-case of the > bug it seems it will not be a huge performance improvement in most cases. > Still worth fixing and I'm looking at your suggestions to try and learn this > area better. Yeah, I just timed some pathspec limits on linux.git, and it makes at best a fraction of a percent improvement (but any improvement is well within run-to-run noise). Which is not surprising. I agree it's worth fixing, though. -Peff ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: git-clone causes out of memory 2017-10-13 14:26 ` Jeff King @ 2017-10-13 14:30 ` Derrick Stolee 2017-10-13 15:27 ` [PATCH] revision: quit pruning diff more quickly when possible Jeff King 1 sibling, 0 replies; 23+ messages in thread From: Derrick Stolee @ 2017-10-13 14:30 UTC (permalink / raw) To: Jeff King; +Cc: Constantine, Junio C Hamano, Christian Couder, Mike Hommey, git On 10/13/2017 10:26 AM, Jeff King wrote: > On Fri, Oct 13, 2017 at 10:25:10AM -0400, Derrick Stolee wrote: > >> This does appear to be the problem. The missing DIFF_OPT_HAS_CHANGES is >> causing diff_can_quit_early() to return false. Due to the corner-case of the >> bug it seems it will not be a huge performance improvement in most cases. >> Still worth fixing and I'm looking at your suggestions to try and learn this >> area better. > Yeah, I just timed some pathspec limits on linux.git, and it makes at > best a fraction of a percent improvement (but any improvement is well > within run-to-run noise). Which is not surprising. > > I agree it's worth fixing, though. > > -Peff I just ran a first-level folder history in the Windows repository and `git rev-list HEAD -- windows >/dev/null` went from 0.43s to 0.04s. So, a good percentage improvement but even on this enormous repo the bug doesn't present an issue to human users. Thanks, -Stolee ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH] revision: quit pruning diff more quickly when possible 2017-10-13 14:26 ` Jeff King 2017-10-13 14:30 ` Derrick Stolee @ 2017-10-13 15:27 ` Jeff King 2017-10-13 15:37 ` Derrick Stolee 2017-10-14 2:43 ` Junio C Hamano 1 sibling, 2 replies; 23+ messages in thread From: Jeff King @ 2017-10-13 15:27 UTC (permalink / raw) To: Derrick Stolee Cc: Constantine, Junio C Hamano, Christian Couder, Mike Hommey, git On Fri, Oct 13, 2017 at 10:26:46AM -0400, Jeff King wrote: > On Fri, Oct 13, 2017 at 10:25:10AM -0400, Derrick Stolee wrote: > > > This does appear to be the problem. The missing DIFF_OPT_HAS_CHANGES is > > causing diff_can_quit_early() to return false. Due to the corner-case of the > > bug it seems it will not be a huge performance improvement in most cases. > > Still worth fixing and I'm looking at your suggestions to try and learn this > > area better. > > Yeah, I just timed some pathspec limits on linux.git, and it makes at > best a fraction of a percent improvement (but any improvement is well > within run-to-run noise). Which is not surprising. > > I agree it's worth fixing, though. Here it is cleaned up and with a commit message. There's another case that can be optimized, too: --remove-empty with an all-deletions commit. That's probably even more obscure and pathological, but it was easy to cover in the same breath. I didn't bother making a perf script, since this really isn't indicative of real-world performance. If we wanted to do perf regression tests here, I think the best path forward would be: 1. Make sure there the perf tests cover pathspecs (maybe in p0001?). 2. Make it easy to run the whole perf suite against a "bomb" repo. This surely isn't the only slow thing of interest. -- >8 -- Subject: revision: quit pruning diff more quickly when possible When the revision traversal machinery is given a pathspec, we must compute the parent-diff for each commit to determine which ones are TREESAME. We set the QUICK diff flag to avoid looking at more entries than we need; we really just care whether there are any changes at all. But there is one case where we want to know a bit more: if --remove-empty is set, we care about finding cases where the change consists only of added entries (in which case we may prune the parent in try_to_simplify_commit()). To cover that case, our file_add_remove() callback does not quit the diff upon seeing an added entry; it keeps looking for other types of entries. But this means when --remove-empty is not set (and it is not by default), we compute more of the diff than is necessary. You can see this in a pathological case where a commit adds a very large number of entries, and we limit based on a broad pathspec. E.g.: perl -e ' chomp(my $blob = `git hash-object -w --stdin </dev/null`); for my $a (1..1000) { for my $b (1..1000) { print "100644 $blob\t$a/$b\n"; } } ' | git update-index --index-info git commit -qm add git rev-list HEAD -- . This case takes about 100ms now, but after this patch only needs 6ms. That's not a huge improvement, but it's easy to get and it protects us against even more pathological cases (e.g., going from 1 million to 10 million files would take ten times as long with the current code, but not increase at all after this patch). This is reported to minorly speed-up pathspec limiting in real world repositories (like the 100-million-file Windows repository), but probably won't make a noticeable difference outside of pathological setups. This patch actually covers the case without --remove-empty, and the case where we see only deletions. See the in-code comment for details. Note that we have to add a new member to the diff_options struct so that our callback can see the value of revs->remove_empty_trees. This callback parameter could be passed to the "add_remove" and "change" callbacks, but there's not much point. They already receive the diff_options struct, and doing it this way avoids having to update the function signature of the other callbacks (arguably the format_callback and output_prefix functions could benefit from the same simplification). Signed-off-by: Jeff King <peff@peff.net> --- diff.h | 1 + revision.c | 16 +++++++++++++--- 2 files changed, 14 insertions(+), 3 deletions(-) diff --git a/diff.h b/diff.h index 7dcfcfbef7..4a34d256f1 100644 --- a/diff.h +++ b/diff.h @@ -180,6 +180,7 @@ struct diff_options { pathchange_fn_t pathchange; change_fn_t change; add_remove_fn_t add_remove; + void *change_fn_data; diff_format_fn_t format_callback; void *format_callback_data; diff_prefix_fn_t output_prefix; diff --git a/revision.c b/revision.c index 8fd222f3bf..a3f245e2cc 100644 --- a/revision.c +++ b/revision.c @@ -399,8 +399,16 @@ static struct commit *one_relevant_parent(const struct rev_info *revs, * if the whole diff is removal of old data, and otherwise * REV_TREE_DIFFERENT (of course if the trees are the same we * want REV_TREE_SAME). - * That means that once we get to REV_TREE_DIFFERENT, we do not - * have to look any further. + * + * The only time we care about the distinction is when + * remove_empty_trees is in effect, in which case we care only about + * whether the whole change is REV_TREE_NEW, or if there's another type + * of change. Which means we can stop the diff early in either of these + * cases: + * + * 1. We're not using remove_empty_trees at all. + * + * 2. We saw anything except REV_TREE_NEW. */ static int tree_difference = REV_TREE_SAME; @@ -411,9 +419,10 @@ static void file_add_remove(struct diff_options *options, const char *fullpath, unsigned dirty_submodule) { int diff = addremove == '+' ? REV_TREE_NEW : REV_TREE_OLD; + struct rev_info *revs = options->change_fn_data; tree_difference |= diff; - if (tree_difference == REV_TREE_DIFFERENT) + if (!revs->remove_empty_trees || tree_difference != REV_TREE_NEW) DIFF_OPT_SET(options, HAS_CHANGES); } @@ -1351,6 +1360,7 @@ void init_revisions(struct rev_info *revs, const char *prefix) DIFF_OPT_SET(&revs->pruning, QUICK); revs->pruning.add_remove = file_add_remove; revs->pruning.change = file_change; + revs->pruning.change_fn_data = revs; revs->sort_order = REV_SORT_IN_GRAPH_ORDER; revs->dense = 1; revs->prefix = prefix; -- 2.15.0.rc1.395.ga4290b5804 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH] revision: quit pruning diff more quickly when possible 2017-10-13 15:27 ` [PATCH] revision: quit pruning diff more quickly when possible Jeff King @ 2017-10-13 15:37 ` Derrick Stolee 2017-10-13 15:44 ` Jeff King 2017-10-14 2:43 ` Junio C Hamano 1 sibling, 1 reply; 23+ messages in thread From: Derrick Stolee @ 2017-10-13 15:37 UTC (permalink / raw) To: Jeff King; +Cc: Constantine, Junio C Hamano, Christian Couder, Mike Hommey, git On 10/13/2017 11:27 AM, Jeff King wrote: > On Fri, Oct 13, 2017 at 10:26:46AM -0400, Jeff King wrote: > >> On Fri, Oct 13, 2017 at 10:25:10AM -0400, Derrick Stolee wrote: >> >>> This does appear to be the problem. The missing DIFF_OPT_HAS_CHANGES is >>> causing diff_can_quit_early() to return false. Due to the corner-case of the >>> bug it seems it will not be a huge performance improvement in most cases. >>> Still worth fixing and I'm looking at your suggestions to try and learn this >>> area better. >> Yeah, I just timed some pathspec limits on linux.git, and it makes at >> best a fraction of a percent improvement (but any improvement is well >> within run-to-run noise). Which is not surprising. >> >> I agree it's worth fixing, though. > Here it is cleaned up and with a commit message. There's another case > that can be optimized, too: --remove-empty with an all-deletions commit. > That's probably even more obscure and pathological, but it was easy to > cover in the same breath. > > I didn't bother making a perf script, since this really isn't indicative > of real-world performance. If we wanted to do perf regression tests > here, I think the best path forward would be: > > 1. Make sure there the perf tests cover pathspecs (maybe in p0001?). > > 2. Make it easy to run the whole perf suite against a "bomb" repo. > This surely isn't the only slow thing of interest. > > -- >8 -- > Subject: revision: quit pruning diff more quickly when possible > > When the revision traversal machinery is given a pathspec, > we must compute the parent-diff for each commit to determine > which ones are TREESAME. We set the QUICK diff flag to avoid > looking at more entries than we need; we really just care > whether there are any changes at all. > > But there is one case where we want to know a bit more: if > --remove-empty is set, we care about finding cases where the > change consists only of added entries (in which case we may > prune the parent in try_to_simplify_commit()). To cover that > case, our file_add_remove() callback does not quit the diff > upon seeing an added entry; it keeps looking for other types > of entries. > > But this means when --remove-empty is not set (and it is not > by default), we compute more of the diff than is necessary. > You can see this in a pathological case where a commit adds > a very large number of entries, and we limit based on a > broad pathspec. E.g.: > > perl -e ' > chomp(my $blob = `git hash-object -w --stdin </dev/null`); > for my $a (1..1000) { > for my $b (1..1000) { > print "100644 $blob\t$a/$b\n"; > } > } > ' | git update-index --index-info > git commit -qm add > > git rev-list HEAD -- . > > This case takes about 100ms now, but after this patch only > needs 6ms. That's not a huge improvement, but it's easy to > get and it protects us against even more pathological cases > (e.g., going from 1 million to 10 million files would take > ten times as long with the current code, but not increase at > all after this patch). > > This is reported to minorly speed-up pathspec limiting in > real world repositories (like the 100-million-file Windows > repository), but probably won't make a noticeable difference > outside of pathological setups. > > This patch actually covers the case without --remove-empty, > and the case where we see only deletions. See the in-code > comment for details. > > Note that we have to add a new member to the diff_options > struct so that our callback can see the value of > revs->remove_empty_trees. This callback parameter could be > passed to the "add_remove" and "change" callbacks, but > there's not much point. They already receive the > diff_options struct, and doing it this way avoids having to > update the function signature of the other callbacks > (arguably the format_callback and output_prefix functions > could benefit from the same simplification). > > Signed-off-by: Jeff King <peff@peff.net> > --- > diff.h | 1 + > revision.c | 16 +++++++++++++--- > 2 files changed, 14 insertions(+), 3 deletions(-) > > diff --git a/diff.h b/diff.h > index 7dcfcfbef7..4a34d256f1 100644 > --- a/diff.h > +++ b/diff.h > @@ -180,6 +180,7 @@ struct diff_options { > pathchange_fn_t pathchange; > change_fn_t change; > add_remove_fn_t add_remove; > + void *change_fn_data; > diff_format_fn_t format_callback; > void *format_callback_data; > diff_prefix_fn_t output_prefix; > diff --git a/revision.c b/revision.c > index 8fd222f3bf..a3f245e2cc 100644 > --- a/revision.c > +++ b/revision.c > @@ -399,8 +399,16 @@ static struct commit *one_relevant_parent(const struct rev_info *revs, > * if the whole diff is removal of old data, and otherwise > * REV_TREE_DIFFERENT (of course if the trees are the same we > * want REV_TREE_SAME). > - * That means that once we get to REV_TREE_DIFFERENT, we do not > - * have to look any further. > + * > + * The only time we care about the distinction is when > + * remove_empty_trees is in effect, in which case we care only about > + * whether the whole change is REV_TREE_NEW, or if there's another type > + * of change. Which means we can stop the diff early in either of these > + * cases: > + * > + * 1. We're not using remove_empty_trees at all. > + * > + * 2. We saw anything except REV_TREE_NEW. > */ > static int tree_difference = REV_TREE_SAME; > > @@ -411,9 +419,10 @@ static void file_add_remove(struct diff_options *options, > const char *fullpath, unsigned dirty_submodule) > { > int diff = addremove == '+' ? REV_TREE_NEW : REV_TREE_OLD; > + struct rev_info *revs = options->change_fn_data; > > tree_difference |= diff; > - if (tree_difference == REV_TREE_DIFFERENT) > + if (!revs->remove_empty_trees || tree_difference != REV_TREE_NEW) > DIFF_OPT_SET(options, HAS_CHANGES); > } > > @@ -1351,6 +1360,7 @@ void init_revisions(struct rev_info *revs, const char *prefix) > DIFF_OPT_SET(&revs->pruning, QUICK); > revs->pruning.add_remove = file_add_remove; > revs->pruning.change = file_change; > + revs->pruning.change_fn_data = revs; > revs->sort_order = REV_SORT_IN_GRAPH_ORDER; > revs->dense = 1; > revs->prefix = prefix; Thanks, Peff. This patch looks good to me. I tried a few other things like adding a flag DIFF_OPT_HAS_ANY_CHANGE next to DIFF_OPT_HAS_CHANGES that we could check in diff_can_quit_early() but it had side-effects that broke existing tests. From this exploration, it does seem necessary to be aware of 'remove_empty_trees'. Thanks, -Stolee ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] revision: quit pruning diff more quickly when possible 2017-10-13 15:37 ` Derrick Stolee @ 2017-10-13 15:44 ` Jeff King 0 siblings, 0 replies; 23+ messages in thread From: Jeff King @ 2017-10-13 15:44 UTC (permalink / raw) To: Derrick Stolee Cc: Constantine, Junio C Hamano, Christian Couder, Mike Hommey, git On Fri, Oct 13, 2017 at 11:37:50AM -0400, Derrick Stolee wrote: > Thanks, Peff. This patch looks good to me. > > I tried a few other things like adding a flag DIFF_OPT_HAS_ANY_CHANGE next > to DIFF_OPT_HAS_CHANGES that we could check in diff_can_quit_early() but it > had side-effects that broke existing tests. From this exploration, it does > seem necessary to be aware of 'remove_empty_trees'. Keep in mind that the regular diff_change callbacks already handle this case[1]. The file_change callbacks are specific to the revision machinery's pruning diff, and intentionally hold back the HAS_CHANGES flag. -Peff [1] I tried "git diff-tree --root -r --quiet 45546f17e" on the bomb repo, and it went quickly. Dropping --quiet makes it take a really long time. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] revision: quit pruning diff more quickly when possible 2017-10-13 15:27 ` [PATCH] revision: quit pruning diff more quickly when possible Jeff King 2017-10-13 15:37 ` Derrick Stolee @ 2017-10-14 2:43 ` Junio C Hamano 1 sibling, 0 replies; 23+ messages in thread From: Junio C Hamano @ 2017-10-14 2:43 UTC (permalink / raw) To: Jeff King; +Cc: Derrick Stolee, Constantine, Christian Couder, Mike Hommey, git Jeff King <peff@peff.net> writes: > Here it is cleaned up and with a commit message. There's another case > that can be optimized, too: --remove-empty with an all-deletions commit. > That's probably even more obscure and pathological, but it was easy to > cover in the same breath. This one looks good. It appears that again you guys had all the fun while I was offline ;-). And I am happy to see that we didn't veer in the direction to optimize for a wrong case by keeping track of what trees we already saw and things like that, of course. Thanks. > Subject: revision: quit pruning diff more quickly when possible > > When the revision traversal machinery is given a pathspec, > we must compute the parent-diff for each commit to determine > which ones are TREESAME. We set the QUICK diff flag to avoid > looking at more entries than we need; we really just care > whether there are any changes at all. > > But there is one case where we want to know a bit more: if > --remove-empty is set, we care about finding cases where the > change consists only of added entries (in which case we may > prune the parent in try_to_simplify_commit()). To cover that > case, our file_add_remove() callback does not quit the diff > upon seeing an added entry; it keeps looking for other types > of entries. > > But this means when --remove-empty is not set (and it is not > by default), we compute more of the diff than is necessary. > You can see this in a pathological case where a commit adds > a very large number of entries, and we limit based on a > broad pathspec. E.g.: > > perl -e ' > chomp(my $blob = `git hash-object -w --stdin </dev/null`); > for my $a (1..1000) { > for my $b (1..1000) { > print "100644 $blob\t$a/$b\n"; > } > } > ' | git update-index --index-info > git commit -qm add > > git rev-list HEAD -- . > > This case takes about 100ms now, but after this patch only > needs 6ms. That's not a huge improvement, but it's easy to > get and it protects us against even more pathological cases > (e.g., going from 1 million to 10 million files would take > ten times as long with the current code, but not increase at > all after this patch). > > This is reported to minorly speed-up pathspec limiting in > real world repositories (like the 100-million-file Windows > repository), but probably won't make a noticeable difference > outside of pathological setups. > > This patch actually covers the case without --remove-empty, > and the case where we see only deletions. See the in-code > comment for details. > > Note that we have to add a new member to the diff_options > struct so that our callback can see the value of > revs->remove_empty_trees. This callback parameter could be > passed to the "add_remove" and "change" callbacks, but > there's not much point. They already receive the > diff_options struct, and doing it this way avoids having to > update the function signature of the other callbacks > (arguably the format_callback and output_prefix functions > could benefit from the same simplification). > > Signed-off-by: Jeff King <peff@peff.net> > --- > diff.h | 1 + > revision.c | 16 +++++++++++++--- > 2 files changed, 14 insertions(+), 3 deletions(-) > > diff --git a/diff.h b/diff.h > index 7dcfcfbef7..4a34d256f1 100644 > --- a/diff.h > +++ b/diff.h > @@ -180,6 +180,7 @@ struct diff_options { > pathchange_fn_t pathchange; > change_fn_t change; > add_remove_fn_t add_remove; > + void *change_fn_data; > diff_format_fn_t format_callback; > void *format_callback_data; > diff_prefix_fn_t output_prefix; > diff --git a/revision.c b/revision.c > index 8fd222f3bf..a3f245e2cc 100644 > --- a/revision.c > +++ b/revision.c > @@ -399,8 +399,16 @@ static struct commit *one_relevant_parent(const struct rev_info *revs, > * if the whole diff is removal of old data, and otherwise > * REV_TREE_DIFFERENT (of course if the trees are the same we > * want REV_TREE_SAME). > - * That means that once we get to REV_TREE_DIFFERENT, we do not > - * have to look any further. > + * > + * The only time we care about the distinction is when > + * remove_empty_trees is in effect, in which case we care only about > + * whether the whole change is REV_TREE_NEW, or if there's another type > + * of change. Which means we can stop the diff early in either of these > + * cases: > + * > + * 1. We're not using remove_empty_trees at all. > + * > + * 2. We saw anything except REV_TREE_NEW. > */ > static int tree_difference = REV_TREE_SAME; > > @@ -411,9 +419,10 @@ static void file_add_remove(struct diff_options *options, > const char *fullpath, unsigned dirty_submodule) > { > int diff = addremove == '+' ? REV_TREE_NEW : REV_TREE_OLD; > + struct rev_info *revs = options->change_fn_data; > > tree_difference |= diff; > - if (tree_difference == REV_TREE_DIFFERENT) > + if (!revs->remove_empty_trees || tree_difference != REV_TREE_NEW) > DIFF_OPT_SET(options, HAS_CHANGES); > } > > @@ -1351,6 +1360,7 @@ void init_revisions(struct rev_info *revs, const char *prefix) > DIFF_OPT_SET(&revs->pruning, QUICK); > revs->pruning.add_remove = file_add_remove; > revs->pruning.change = file_change; > + revs->pruning.change_fn_data = revs; > revs->sort_order = REV_SORT_IN_GRAPH_ORDER; > revs->dense = 1; > revs->prefix = prefix; ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: git-clone causes out of memory 2017-10-13 10:06 ` Mike Hommey 2017-10-13 10:26 ` Christian Couder @ 2017-10-13 12:35 ` Jeff King 1 sibling, 0 replies; 23+ messages in thread From: Jeff King @ 2017-10-13 12:35 UTC (permalink / raw) To: Mike Hommey; +Cc: Constantine, git On Fri, Oct 13, 2017 at 07:06:03PM +0900, Mike Hommey wrote: > On Fri, Oct 13, 2017 at 12:51:58PM +0300, Constantine wrote: > > There's a gitbomb on github. It is undoubtedly creative and funny, but since > > this is a bug in git, I thought it'd be nice to report. The command: > > > > $ git clone https://github.com/x0rz/ShadowBrokersFiles > > What fills memory is actually the checkout part of the command. git > clone -n doesn't fail. > > Credit should go where it's due: https://kate.io/blog/git-bomb/ > (with the bonus that it comes with explanations) That is a nice explanation, and I think the dates point to the kate.io post as the parent there. I certainly have come across this type of "bomb" repo before, but I couldn't come up with any public references (somebody uploaded such a repository to GitHub in 2014). So I think this may be the first public discussion of the idea. -Peff ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2017-10-14 2:43 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-10-13 9:51 git-clone causes out of memory Constantine 2017-10-13 10:06 ` Mike Hommey 2017-10-13 10:26 ` Christian Couder 2017-10-13 10:37 ` Mike Hommey 2017-10-13 10:44 ` Christian Couder 2017-10-13 12:04 ` Junio C Hamano 2017-10-13 12:12 ` Constantine 2017-10-13 12:44 ` Jeff King 2017-10-13 13:15 ` Derrick Stolee 2017-10-13 13:39 ` Derrick Stolee 2017-10-13 13:50 ` Jeff King 2017-10-13 13:55 ` Derrick Stolee 2017-10-13 13:56 ` Jeff King 2017-10-13 14:10 ` Jeff King 2017-10-13 14:20 ` Jeff King 2017-10-13 14:25 ` Derrick Stolee 2017-10-13 14:26 ` Jeff King 2017-10-13 14:30 ` Derrick Stolee 2017-10-13 15:27 ` [PATCH] revision: quit pruning diff more quickly when possible Jeff King 2017-10-13 15:37 ` Derrick Stolee 2017-10-13 15:44 ` Jeff King 2017-10-14 2:43 ` Junio C Hamano 2017-10-13 12:35 ` git-clone causes out of memory Jeff King
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).