From: Derrick Stolee <firstname.lastname@example.org> To: Jeff King <email@example.com>, Constantine <firstname.lastname@example.org> Cc: Junio C Hamano <email@example.com>, Christian Couder <firstname.lastname@example.org>, Mike Hommey <email@example.com>, git <firstname.lastname@example.org> Subject: Re: git-clone causes out of memory Date: Fri, 13 Oct 2017 09:39:14 -0400 [thread overview] Message-ID: <email@example.com> (raw) In-Reply-To: <firstname.lastname@example.org> On 10/13/2017 9:15 AM, Derrick Stolee wrote: > On 10/13/2017 8:44 AM, Jeff King wrote: >> On Fri, Oct 13, 2017 at 03:12:43PM +0300, Constantine wrote: >> >>> On 13.10.2017 15:04, Junio C Hamano wrote: >>>> Christian Couder <email@example.com> writes: >>>> >>>>> Yeah, but perhaps Git could be smarter when rev-listing too and avoid >>>>> processing files or directories it has already seen? >>>> Aren't you suggesting to optimize for a wrong case? >>>> >>> Anything that is possible with a software should be considered as a >>> possible >>> usecase. It's in fact a DoS attack. Imagine there's a server that >>> using git >>> to process something, and now there's a way to knock down this >>> server. It's >>> also bad from a promotional stand point. >> But the point is that you'd have the same problem with any repository >> that had 10^7 files in it. Yes, it's convenient for the attacker that >> there are only 9 objects, but fundamentally it's pretty easy for an >> attacker to construct repositories that have large trees (or very deep >> trees -- that's what causes stack exhaustion in some cases). >> >> Note too that this attack almost always comes down to the diff code >> (which is why it kicks in for pathspec limiting) which has to actually >> expand the tree. Most "normal" server-side operations (like accepting >> pushes or serving fetches) operate only on the object graph and _do_ >> avoid processing already-seen objects. >> >> As soon as servers start trying to checkout or diff, though, the attack >> surface gets quite large. And you really need to start thinking about >> having resource limits and quotas for CPU and memory use of each process >> (and group by requesting user, IP, repo, etc). >> >> I think the main thing Git could be doing here is to limit the size of >> the tree (both width and depth). But arbitrary limits like that have a >> way of being annoying, and I think it just pushes resource-exhaustion >> attacks off a little (e.g., can you construct a blob that behaves badly >> with the "--patch"?). >> >> -Peff > > I'm particularly interested in why `git rev-list HEAD -- [path]` gets > slower in this case, because I wrote the history algorithm used by > VSTS. In our algorithm, we only walk the list of objects from commit > to the tree containing the path item. For example, in the path > d0/d0/d0, we would only walk: > > commit --root--> tree --d0--> tree --d0--> tree [parse oid for d0 > entry] > > From this, we can determine the TREESAME relationship by parsing four > objects without parsing all contents below d0/d0/d0. > > The reason we have exponential behavior in `git rev-list` is because > we are calling diff_tree_oid() in tree-diff.c recursively without > short-circuiting on equal OIDs. > > I will prepare a patch that adds this OID-equal short-circuit to avoid > this exponential behavior. I'll model my patch against a similar patch > in master: > > commit d12a8cf0af18804c2000efc7a0224da631e04cd1 unpack-trees: > avoid duplicate ODB lookups during checkout > > It will also significantly speed up rev-list calls for short paths in > deep repositories. It will not be very measurable in the git or Linux > repos because their shallow folder structure. > > Thanks, > -Stolee Since I don't understand enough about the consumers to diff_tree_oid() (and the fact that the recursive behavior may be wanted in some cases), I think we can fix this in builtin/rev-list.c with this simple diff: --- diff --git a/builtin/rev-list.c b/builtin/rev-list.c index ded1577424..b2e8e02cc8 100644 --- a/builtin/rev-list.c +++ b/builtin/rev-list.c @@ -285,6 +285,9 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix) git_config(git_default_config, NULL); init_revisions(&revs, prefix); + + revs.pruning.flags = revs.pruning.flags & ~DIFF_OPT_RECURSIVE; + revs.abbrev = DEFAULT_ABBREV; revs.commit_format = CMIT_FMT_UNSPECIFIED; argc = setup_revisions(argc, argv, &revs, NULL);
next prev parent reply other threads:[~2017-10-13 13:39 UTC|newest] Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top 2017-10-13 9:51 Constantine 2017-10-13 10:06 ` Mike Hommey 2017-10-13 10:26 ` Christian Couder 2017-10-13 10:37 ` Mike Hommey 2017-10-13 10:44 ` Christian Couder 2017-10-13 12:04 ` Junio C Hamano 2017-10-13 12:12 ` Constantine 2017-10-13 12:44 ` Jeff King 2017-10-13 13:15 ` Derrick Stolee 2017-10-13 13:39 ` Derrick Stolee [this message] 2017-10-13 13:50 ` Jeff King 2017-10-13 13:55 ` Derrick Stolee 2017-10-13 13:56 ` Jeff King 2017-10-13 14:10 ` Jeff King 2017-10-13 14:20 ` Jeff King 2017-10-13 14:25 ` Derrick Stolee 2017-10-13 14:26 ` Jeff King 2017-10-13 14:30 ` Derrick Stolee 2017-10-13 15:27 ` [PATCH] revision: quit pruning diff more quickly when possible Jeff King 2017-10-13 15:37 ` Derrick Stolee 2017-10-13 15:44 ` Jeff King 2017-10-14 2:43 ` Junio C Hamano 2017-10-13 12:35 ` git-clone causes out of memory Jeff King
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style List information: http://vger.kernel.org/majordomo-info.html * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --subject='Re: git-clone causes out of memory' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Code repositories for project(s) associated with this inbox: https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).