git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Mike Hommey <mh@glandium.org>
To: Christian Couder <christian.couder@gmail.com>
Cc: Constantine <hi-angel@yandex.ru>, git <git@vger.kernel.org>
Subject: Re: git-clone causes out of memory
Date: Fri, 13 Oct 2017 19:37:22 +0900	[thread overview]
Message-ID: <20171013103722.rvr7536mu2hoo4wb@glandium.org> (raw)
In-Reply-To: <CAP8UFD1KuBdUCo=x_q4__s1kW15CWMH1jJkKzXqmf3=T3jcrng@mail.gmail.com>

On Fri, Oct 13, 2017 at 12:26:46PM +0200, Christian Couder wrote:
> On Fri, Oct 13, 2017 at 12:06 PM, Mike Hommey <mh@glandium.org> wrote:
> > On Fri, Oct 13, 2017 at 12:51:58PM +0300, Constantine wrote:
> >> There's a gitbomb on github. It is undoubtedly creative and funny, but since
> >> this is a bug in git, I thought it'd be nice to report. The command:
> >>
> >>       $ git clone https://github.com/x0rz/ShadowBrokersFiles
> >
> > What fills memory is actually the checkout part of the command. git
> > clone -n doesn't fail.
> >
> > Credit should go where it's due: https://kate.io/blog/git-bomb/
> > (with the bonus that it comes with explanations)
> 
> Yeah, there is a thread on Hacker News about this too:
> 
> https://news.ycombinator.com/item?id=15457076
> 
> The original repo on GitHub is:
> 
> https://github.com/Katee/git-bomb.git
> 
> After cloning it with -n, there is the following "funny" situation:
> 
> $ time git rev-list HEAD
> 7af99c9e7d4768fa681f4fe4ff61259794cf719b
> 18ed56cbc5012117e24a603e7c072cf65d36d469
> 45546f17e5801791d4bc5968b91253a2f4b0db72
> 
> real    0m0.004s
> user    0m0.000s
> sys     0m0.004s
> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/d0/d0/f0
> 
> real    0m0.004s
> user    0m0.000s
> sys     0m0.000s
> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/d0/d0
> 
> real    0m0.004s
> user    0m0.000s
> sys     0m0.000s
> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/
> 45546f17e5801791d4bc5968b91253a2f4b0db72
> 
> real    0m0.005s
> user    0m0.008s
> sys     0m0.000s
> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/
> 45546f17e5801791d4bc5968b91253a2f4b0db72
> 
> real    0m0.203s
> user    0m0.112s
> sys     0m0.088s
> $ time git rev-list HEAD -- d0/d0/d0/d0/
> 45546f17e5801791d4bc5968b91253a2f4b0db72
> 
> real    0m1.305s
> user    0m0.720s
> sys     0m0.580s
> $ time git rev-list HEAD -- d0/d0/d0/
> 45546f17e5801791d4bc5968b91253a2f4b0db72
> 
> real    0m12.135s
> user    0m6.700s
> sys     0m5.412s
> 
> So `git rev-list` becomes exponentially more expensive when you run it
> on a shorter directory path, though it is fast if you run it without a
> path.

That's because there are 10^7 files under d0/d0/d0, 10^6 under
d0/d0/d0/d0/, 10^5 under d0/d0/d0/d0/d0/ etc.

So really, this is all about things being slower when there's a crazy
number of files. Picture me surprised.

What makes it kind of special is that the repository contains a lot of
paths/files, but very few objects, because it's duplicating everything.

All the 10^10 blobs have the same content, all the 10^9 trees that point
to them have the same content, all the 10^8 trees that point to those
trees have the same content, etc.

If git wasn't effectively deduplicating identical content, the repository
would be multiple gigabytes large.

Mike

  reply	other threads:[~2017-10-13 10:37 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-13  9:51 git-clone causes out of memory Constantine
2017-10-13 10:06 ` Mike Hommey
2017-10-13 10:26   ` Christian Couder
2017-10-13 10:37     ` Mike Hommey [this message]
2017-10-13 10:44       ` Christian Couder
2017-10-13 12:04         ` Junio C Hamano
2017-10-13 12:12           ` Constantine
2017-10-13 12:44             ` Jeff King
2017-10-13 13:15               ` Derrick Stolee
2017-10-13 13:39                 ` Derrick Stolee
2017-10-13 13:50                   ` Jeff King
2017-10-13 13:55                     ` Derrick Stolee
2017-10-13 13:56                       ` Jeff King
2017-10-13 14:10                         ` Jeff King
2017-10-13 14:20                           ` Jeff King
2017-10-13 14:25                             ` Derrick Stolee
2017-10-13 14:26                               ` Jeff King
2017-10-13 14:30                                 ` Derrick Stolee
2017-10-13 15:27                                 ` [PATCH] revision: quit pruning diff more quickly when possible Jeff King
2017-10-13 15:37                                   ` Derrick Stolee
2017-10-13 15:44                                     ` Jeff King
2017-10-14  2:43                                   ` Junio C Hamano
2017-10-13 12:35   ` git-clone causes out of memory Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171013103722.rvr7536mu2hoo4wb@glandium.org \
    --to=mh@glandium.org \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=hi-angel@yandex.ru \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).