git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Roman Shaposhnik <rvs@sun.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: git@vger.kernel.org
Subject: Re: Achieving efficient storage of weirdly structured repos
Date: Fri, 04 Apr 2008 16:30:58 -0700	[thread overview]
Message-ID: <1207351858.13123.52.camel@work.sfbay.sun.com> (raw)
In-Reply-To: <alpine.LFD.1.00.0804031402530.14670@woody.linux-foundation.org>

Hi Linus!

On Thu, 2008-04-03 at 14:11 -0700, Linus Torvalds wrote:
> 
> On Thu, 3 Apr 2008, Roman Shaposhnik wrote:
> > 
> > The repository was created using hg2git (the one based on git-fast-import)
> > and it was GC'ed and REPACK'ed just in case.
> 
> Before going any further - exactly _how_ was it repacked?

I believe it was the following two steps:
   $ git gc --aggressive
   $ git repack

> In particular, when using importers that do partial packing on their own 
> (and any "git-fastimport" user is that by definition - and I think 
> hg2git does that), at the end of it all you have to make sure to repack in 
> a way where the repacking will totally discard the import-time packfiles.

Good point. Speaking of which: do you have an FAQ for importers? The
entries in the official FAQ (http://git.or.cz/gitwiki/GitFaq#head-929a8825d04dde226c2530f5337d3b3ed8dcc7ce)
seem a bit stale for such an important issue. After all, importing from
an existing SCM is what usually forms a first time impression of Git's
effectiveness.

> IOW, that's one of the very few times you should use "-f" to git repack.

Got it!

> It's usually also a good place to make sure that since you ignore the old 
> packing information, it's best to also make sure that the new packing info 
> is good by using a bigger window (and perhaps a bigger depth). That makes 
> the packing much slower, of course, but this is meant to be a one-time 
> event.
> 
> So try something like
> 
> 	git repack -a -d -f --depth=100 --window=100
> 
> if you have a good CPU and plenty of memory.

That turned out to be a perfect suggestion. Thank you. I'm now the
happiest camper ever. And I'm also also pretty dumbfounded ;-)

Here's what happened. 

I started with a a repository filled with "loose" (one object per file)
objects (the reason I needed it was for the ease of sleuthing through
individual objects and it was created by git-unpack-objects from that
initial 1.1Gb pack). And I tried to pack it exactly like you
suggested:
   $ git-pack-objects --depth=100 --window=100 --delta-base-offset --progress pack < objects
   Generating pack...
   Counting objects: 1096305
   Done counting 1159628 objects.
   Deltifying 1159628 objects...
      100% (1159628/1159628) done
   Writing 1159628 objects...
   dd134c407324dc55b0cd2aa3a9e1b3420c2bba3f

   Total 1159628 (delta 386980), reused 0 (delta 0)

and it payed off reasonably well:
    $ du -s NB-clone
    670M NB-clone

It still was bigger than the Mercurial repository but at least it got
2 times smaller than the original result of hg2git. Now, if it wasn't
for a friend of mine, I probably would've stopped there. But he
showed up and saved the day ;-) His comments made me try something
that I didn't consider to be of any use -- repacking a freshly packed
pack with the *same* --depth=100 --window=100:
    $ git repack -a  -f --window=100 --depth=100 
    Generating pack...
    Counting objects: 1056829
    Done counting 1159628 objects.
    Deltifying 1159628 objects...
       100% (1159628/1159628) done
    Writing 1159628 objects...
       100% (1159628/1159628) done
    Total 1159628 (delta 614516), reused 0 (delta 0)
    Pack pack-dd134c407324dc55b0cd2aa3a9e1b3420c2bba3f created.
And then, a miracle occurred:
     $ du -sh NB-small 
     268M NB-small

Now, don't get me wrong: I'm as happy as a clam. The repository is now
*smaller* than the Mercurial's and because the structure of the
tree is so weird Git gets major points here. The only question that
is still bothering me is: how did it happen? Why did repacking 
a repository with exactly the same set of objects and the only
difference being where these objects resided (former case filesystem,
the later case an intermediate pack) made so huge a difference?

Please help!

> > The last item (trees) also seem to take the most space and the most 
> > reasonable explanation that I can offer is that NetBeans repository has 
> > a really weird structure where they have approximately 700 (yes, seven 
> > hundred!) top-level subdirectories there. They are clearly 
> > Submodules-shy, but that's another issue that I will need to address 
> > with them.
> 
> Trees taking the biggest amount of space is not unheard of, and it may 
> also be that the name heuristics (for finding good packing partners) could 
> be failign, which would result in a much bigger pack than necessary. 

Is there any documentation that describes the heuristics involved in
creating a pack?

> So if you already did an aggressive repack like the above, I'd happily 
> take a look at whether maybe it's bad heuristics for finding tree objects 
> to pair up for delta-compression. Do you have a place where you can put 
> that repo for people to clone and look at?

Unfortunately I don't. The only thing I can do is I can always create
a *.tar.bz2 and put and on  Sun's ftp server. Actually, that makes me
wonder: is there any public Git hosting available such that publishing
a hefty repository for the forensic purposes only wouldn't violate their
terms of use?

Thanks,
Roman.

P.S. Oh, and here's one extra tiny question that I also have: what
does the output:
   Total 1159628 (delta 614516), reused 0 (delta 0)
really mean?

  parent reply	other threads:[~2008-04-04 23:24 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-04-03 19:42 Achieving efficient storage of weirdly structured repos Roman Shaposhnik
2008-04-03 21:11 ` Linus Torvalds
2008-04-04  6:21   ` Jakub Narebski
2008-04-04 13:11     ` Nicolas Pitre
2008-04-04 14:16       ` Pieter de Bie
2008-04-05  3:24       ` Shawn O. Pearce
2008-04-04 23:30   ` Roman Shaposhnik [this message]
2008-04-04 23:57     ` Linus Torvalds
2008-04-06  0:13       ` Roman Shaposhnik
2008-04-06  0:48         ` Linus Torvalds
2008-04-06 16:10           ` Jeff King
2008-04-07  0:13             ` Nicolas Pitre
2008-04-07  0:18               ` Jeff King
2008-04-07  0:36                 ` Nicolas Pitre

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1207351858.13123.52.camel@work.sfbay.sun.com \
    --to=rvs@sun.com \
    --cc=git@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).