From: Roman Shaposhnik <rvs@sun.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: git@vger.kernel.org
Subject: Re: Achieving efficient storage of weirdly structured repos
Date: Fri, 04 Apr 2008 16:30:58 -0700 [thread overview]
Message-ID: <1207351858.13123.52.camel@work.sfbay.sun.com> (raw)
In-Reply-To: <alpine.LFD.1.00.0804031402530.14670@woody.linux-foundation.org>
Hi Linus!
On Thu, 2008-04-03 at 14:11 -0700, Linus Torvalds wrote:
>
> On Thu, 3 Apr 2008, Roman Shaposhnik wrote:
> >
> > The repository was created using hg2git (the one based on git-fast-import)
> > and it was GC'ed and REPACK'ed just in case.
>
> Before going any further - exactly _how_ was it repacked?
I believe it was the following two steps:
$ git gc --aggressive
$ git repack
> In particular, when using importers that do partial packing on their own
> (and any "git-fastimport" user is that by definition - and I think
> hg2git does that), at the end of it all you have to make sure to repack in
> a way where the repacking will totally discard the import-time packfiles.
Good point. Speaking of which: do you have an FAQ for importers? The
entries in the official FAQ (http://git.or.cz/gitwiki/GitFaq#head-929a8825d04dde226c2530f5337d3b3ed8dcc7ce)
seem a bit stale for such an important issue. After all, importing from
an existing SCM is what usually forms a first time impression of Git's
effectiveness.
> IOW, that's one of the very few times you should use "-f" to git repack.
Got it!
> It's usually also a good place to make sure that since you ignore the old
> packing information, it's best to also make sure that the new packing info
> is good by using a bigger window (and perhaps a bigger depth). That makes
> the packing much slower, of course, but this is meant to be a one-time
> event.
>
> So try something like
>
> git repack -a -d -f --depth=100 --window=100
>
> if you have a good CPU and plenty of memory.
That turned out to be a perfect suggestion. Thank you. I'm now the
happiest camper ever. And I'm also also pretty dumbfounded ;-)
Here's what happened.
I started with a a repository filled with "loose" (one object per file)
objects (the reason I needed it was for the ease of sleuthing through
individual objects and it was created by git-unpack-objects from that
initial 1.1Gb pack). And I tried to pack it exactly like you
suggested:
$ git-pack-objects --depth=100 --window=100 --delta-base-offset --progress pack < objects
Generating pack...
Counting objects: 1096305
Done counting 1159628 objects.
Deltifying 1159628 objects...
100% (1159628/1159628) done
Writing 1159628 objects...
dd134c407324dc55b0cd2aa3a9e1b3420c2bba3f
Total 1159628 (delta 386980), reused 0 (delta 0)
and it payed off reasonably well:
$ du -s NB-clone
670M NB-clone
It still was bigger than the Mercurial repository but at least it got
2 times smaller than the original result of hg2git. Now, if it wasn't
for a friend of mine, I probably would've stopped there. But he
showed up and saved the day ;-) His comments made me try something
that I didn't consider to be of any use -- repacking a freshly packed
pack with the *same* --depth=100 --window=100:
$ git repack -a -f --window=100 --depth=100
Generating pack...
Counting objects: 1056829
Done counting 1159628 objects.
Deltifying 1159628 objects...
100% (1159628/1159628) done
Writing 1159628 objects...
100% (1159628/1159628) done
Total 1159628 (delta 614516), reused 0 (delta 0)
Pack pack-dd134c407324dc55b0cd2aa3a9e1b3420c2bba3f created.
And then, a miracle occurred:
$ du -sh NB-small
268M NB-small
Now, don't get me wrong: I'm as happy as a clam. The repository is now
*smaller* than the Mercurial's and because the structure of the
tree is so weird Git gets major points here. The only question that
is still bothering me is: how did it happen? Why did repacking
a repository with exactly the same set of objects and the only
difference being where these objects resided (former case filesystem,
the later case an intermediate pack) made so huge a difference?
Please help!
> > The last item (trees) also seem to take the most space and the most
> > reasonable explanation that I can offer is that NetBeans repository has
> > a really weird structure where they have approximately 700 (yes, seven
> > hundred!) top-level subdirectories there. They are clearly
> > Submodules-shy, but that's another issue that I will need to address
> > with them.
>
> Trees taking the biggest amount of space is not unheard of, and it may
> also be that the name heuristics (for finding good packing partners) could
> be failign, which would result in a much bigger pack than necessary.
Is there any documentation that describes the heuristics involved in
creating a pack?
> So if you already did an aggressive repack like the above, I'd happily
> take a look at whether maybe it's bad heuristics for finding tree objects
> to pair up for delta-compression. Do you have a place where you can put
> that repo for people to clone and look at?
Unfortunately I don't. The only thing I can do is I can always create
a *.tar.bz2 and put and on Sun's ftp server. Actually, that makes me
wonder: is there any public Git hosting available such that publishing
a hefty repository for the forensic purposes only wouldn't violate their
terms of use?
Thanks,
Roman.
P.S. Oh, and here's one extra tiny question that I also have: what
does the output:
Total 1159628 (delta 614516), reused 0 (delta 0)
really mean?
next prev parent reply other threads:[~2008-04-04 23:24 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-04-03 19:42 Achieving efficient storage of weirdly structured repos Roman Shaposhnik
2008-04-03 21:11 ` Linus Torvalds
2008-04-04 6:21 ` Jakub Narebski
2008-04-04 13:11 ` Nicolas Pitre
2008-04-04 14:16 ` Pieter de Bie
2008-04-05 3:24 ` Shawn O. Pearce
2008-04-04 23:30 ` Roman Shaposhnik [this message]
2008-04-04 23:57 ` Linus Torvalds
2008-04-06 0:13 ` Roman Shaposhnik
2008-04-06 0:48 ` Linus Torvalds
2008-04-06 16:10 ` Jeff King
2008-04-07 0:13 ` Nicolas Pitre
2008-04-07 0:18 ` Jeff King
2008-04-07 0:36 ` Nicolas Pitre
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1207351858.13123.52.camel@work.sfbay.sun.com \
--to=rvs@sun.com \
--cc=git@vger.kernel.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).