git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Jeff King <peff@peff.net>, Constantine <hi-angel@yandex.ru>
Cc: Junio C Hamano <gitster@pobox.com>,
	Christian Couder <christian.couder@gmail.com>,
	Mike Hommey <mh@glandium.org>, git <git@vger.kernel.org>
Subject: Re: git-clone causes out of memory
Date: Fri, 13 Oct 2017 09:39:14 -0400	[thread overview]
Message-ID: <a4ebf552-35d4-d55f-6f08-731afa2cd2de@gmail.com> (raw)
In-Reply-To: <f35d03b5-a525-87b3-a426-bd892edf0c36@gmail.com>

On 10/13/2017 9:15 AM, Derrick Stolee wrote:
> On 10/13/2017 8:44 AM, Jeff King wrote:
>> On Fri, Oct 13, 2017 at 03:12:43PM +0300, Constantine wrote:
>>
>>> On 13.10.2017 15:04, Junio C Hamano wrote:
>>>> Christian Couder <christian.couder@gmail.com> writes:
>>>>
>>>>> Yeah, but perhaps Git could be smarter when rev-listing too and avoid
>>>>> processing files or directories it has already seen?
>>>> Aren't you suggesting to optimize for a wrong case?
>>>>
>>> Anything that is possible with a software should be considered as a 
>>> possible
>>> usecase. It's in fact a DoS attack. Imagine there's a server that 
>>> using git
>>> to process something, and now there's a way to knock down this 
>>> server. It's
>>> also bad from a promotional stand point.
>> But the point is that you'd have the same problem with any repository
>> that had 10^7 files in it. Yes, it's convenient for the attacker that
>> there are only 9 objects, but fundamentally it's pretty easy for an
>> attacker to construct repositories that have large trees (or very deep
>> trees -- that's what causes stack exhaustion in some cases).
>>
>> Note too that this attack almost always comes down to the diff code
>> (which is why it kicks in for pathspec limiting) which has to actually
>> expand the tree. Most "normal" server-side operations (like accepting
>> pushes or serving fetches) operate only on the object graph and _do_
>> avoid processing already-seen objects.
>>
>> As soon as servers start trying to checkout or diff, though, the attack
>> surface gets quite large. And you really need to start thinking about
>> having resource limits and quotas for CPU and memory use of each process
>> (and group by requesting user, IP, repo, etc).
>>
>> I think the main thing Git could be doing here is to limit the size of
>> the tree (both width and depth). But arbitrary limits like that have a
>> way of being annoying, and I think it just pushes resource-exhaustion
>> attacks off a little (e.g., can you construct a blob that behaves badly
>> with the "--patch"?).
>>
>> -Peff
>
> I'm particularly interested in why `git rev-list HEAD -- [path]` gets 
> slower in this case, because I wrote the history algorithm used by 
> VSTS. In our algorithm, we only walk the list of objects from commit 
> to the tree containing the path item. For example, in the path 
> d0/d0/d0, we would only walk:
>
>     commit --root--> tree --d0--> tree --d0--> tree [parse oid for d0 
> entry]
>
> From this, we can determine the TREESAME relationship by parsing four 
> objects without parsing all contents below d0/d0/d0.
>
> The reason we have exponential behavior in `git rev-list` is because 
> we are calling diff_tree_oid() in tree-diff.c recursively without 
> short-circuiting on equal OIDs.
>
> I will prepare a patch that adds this OID-equal short-circuit to avoid 
> this exponential behavior. I'll model my patch against a similar patch 
> in master:
>
>     commit d12a8cf0af18804c2000efc7a0224da631e04cd1 unpack-trees: 
> avoid duplicate ODB lookups during checkout
>
> It will also significantly speed up rev-list calls for short paths in 
> deep repositories. It will not be very measurable in the git or Linux 
> repos because their shallow folder structure.
>
> Thanks,
> -Stolee

Since I don't understand enough about the consumers to diff_tree_oid() 
(and the fact that the recursive behavior may be wanted in some cases), 
I think we can fix this in builtin/rev-list.c with this simple diff:

---

diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index ded1577424..b2e8e02cc8 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -285,6 +285,9 @@ int cmd_rev_list(int argc, const char **argv, const 
char *prefix)

         git_config(git_default_config, NULL);
         init_revisions(&revs, prefix);
+
+       revs.pruning.flags = revs.pruning.flags & ~DIFF_OPT_RECURSIVE;
+
         revs.abbrev = DEFAULT_ABBREV;
         revs.commit_format = CMIT_FMT_UNSPECIFIED;
         argc = setup_revisions(argc, argv, &revs, NULL);


  reply	other threads:[~2017-10-13 13:39 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-13  9:51 git-clone causes out of memory Constantine
2017-10-13 10:06 ` Mike Hommey
2017-10-13 10:26   ` Christian Couder
2017-10-13 10:37     ` Mike Hommey
2017-10-13 10:44       ` Christian Couder
2017-10-13 12:04         ` Junio C Hamano
2017-10-13 12:12           ` Constantine
2017-10-13 12:44             ` Jeff King
2017-10-13 13:15               ` Derrick Stolee
2017-10-13 13:39                 ` Derrick Stolee [this message]
2017-10-13 13:50                   ` Jeff King
2017-10-13 13:55                     ` Derrick Stolee
2017-10-13 13:56                       ` Jeff King
2017-10-13 14:10                         ` Jeff King
2017-10-13 14:20                           ` Jeff King
2017-10-13 14:25                             ` Derrick Stolee
2017-10-13 14:26                               ` Jeff King
2017-10-13 14:30                                 ` Derrick Stolee
2017-10-13 15:27                                 ` [PATCH] revision: quit pruning diff more quickly when possible Jeff King
2017-10-13 15:37                                   ` Derrick Stolee
2017-10-13 15:44                                     ` Jeff King
2017-10-14  2:43                                   ` Junio C Hamano
2017-10-13 12:35   ` git-clone causes out of memory Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a4ebf552-35d4-d55f-6f08-731afa2cd2de@gmail.com \
    --to=stolee@gmail.com \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=hi-angel@yandex.ru \
    --cc=mh@glandium.org \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).