From: Linus Torvalds <torvalds@osdl.org>
To: Chris Mason <mason@suse.com>
Cc: Jeff Garzik <jgarzik@pobox.com>,
"David S. Miller" <davem@davemloft.net>,
Git Mailing List <git@vger.kernel.org>,
Nicolas Pitre <nico@cam.org>
Subject: Re: kernel.org and GIT tree rebuilding
Date: Sun, 26 Jun 2005 14:40:59 -0700 (PDT) [thread overview]
Message-ID: <Pine.LNX.4.58.0506261359370.19755@ppc970.osdl.org> (raw)
In-Reply-To: <200506261652.59373.mason@suse.com>
On Sun, 26 Jun 2005, Chris Mason wrote:
>
> Without having read the code, the big thing that hurt performance in my early
> packed file work was compressing the whole packed file instead of individual
> sub-objects. It takes more room to compress each object, but when I
> compressed the whole thing read performance was quite bad.
Since I wanted random-access, compressing the whole thing just wasn't an
option.
Besides, the big space savings come from finding deltas, which is
obviously also a compression, just at a higher level. The biggest problem
there is to find a guess of objects to try to delta against, and right now
that part is pretty stupid and could possibly be improved (it just sorts
objects by size and tries to delta against "close" objects).
To generate a better sort _would_ actually be pretty close to doing a
global compression (it really does boil down to the same thing: finding
big sub-sequences, except it's in a "fragmented" space), but one issue is
that I don't want to read in the whole data set in one go, so it would
have to be based on some rolling hash or something. Davide pointed to
rzip, and a variation of that (which knows about object boundaries) might
work.
(You can also sort by filename, if you want to try. I don't track
filenames at all there and it's actually non-trivial to do, so that would
require some new and pretty nasty code, but it's possible in _theory_ at
least.)
Anyway, that's all potential improvement for generating better packing,
and it should certainly be possible without changing the format - just
generate a better initial sort, in otder to find more deltas (or rather,
find them faster by using a smaller window size).
So the stupid sort I have now does actually work, but exactly because it's
so stupid it wants a big window for best packing (because there might be a
lot of objects that aren't interesting), which in turn is quite expensive.
So a better sort would make a smaller window more effective.
[ Some numbers: a window of 10 objects is the default, and packs the
current kernel down to 77MB in 2m21s. A window of 20 objects improves
that packing to 71MB, but makes the packing time go up to 3m36s for me.
And a window of 100 gets us down to 62M but takes 11m54s.
A window of 200 (with a delta depth of 200 too - likely _way_ too deep
for normal use) gives you a 59M pack, but takes 20m59s, so there's
definitely a point of diminishing returns.
This is all for the current HEAD, which takes up 264M the "traditional"
git way and takes 141M without any deltas, just packed tightly with no
filesystem blocking.
Now, as you can notice that's actually a slightly sub-linear increase in
time, because as we find a delta, we will only accept smaller deltas in
the future, so we can often stop comparing even before we've reached the
maximum window size, and so effort is slightly less than linear because
there's effectively a constant component to part of it.
Also, the good news is that you probably don't want to generate one
humungous pack archive anyway, but you're likely better off doing a new
incremental pack every few months. So we'll never have the situation
that creating a pack gets increasingly more costly, since at some point
you just say "ok, I created a perfect pack for the first 4 months of
development, I'll now do subsequent packs on top of that instead".
The other good news is that a pack is also a natural boundary for fsck
(as in "ok, I found that object in a pack, so I won't bother going
deeper in the reachability chain"), so if you start packing your
repository, fsck will only have to worry about the objects that are
unpacked. That makes them work really naturally for archiving, ie this
all means that you can avoid a lot of overhead by packing your history
every once in a while, with it all being entirely transparent.
In other words, if you just pack every month, you can basically
guarantee that fsck costs etc never really go up, and your diskspace
also goes up only very slowly. The packed format is quite efficient in
many ways, but it is totally immutable (ie you can't add anything to an
archive - a pack stays the way it always was, and if you want to pack
more you have to either re-do the pack or just create a new one) ]
Linus
next prev parent reply other threads:[~2005-06-26 21:34 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-06-25 4:20 kernel.org and GIT tree rebuilding David S. Miller
2005-06-25 4:40 ` Jeff Garzik
2005-06-25 5:23 ` Linus Torvalds
2005-06-25 5:48 ` Jeff Garzik
2005-06-25 6:16 ` Linus Torvalds
2005-06-26 16:41 ` Linus Torvalds
2005-06-26 18:39 ` Junio C Hamano
2005-06-26 19:19 ` Linus Torvalds
2005-06-26 19:45 ` Junio C Hamano
[not found] ` <7v1x6om6o5.fsf@assigned-by-dhcp.cox.net>
[not found] ` <Pine.LNX.4.58.0506271227160.19755@ppc970.osdl.org>
[not found] ` <7v64vzyqyw.fsf_-_@assigned-by-dhcp.cox.net>
2005-06-28 6:56 ` [PATCH] Obtain sha1_file_info() for deltified pack entry properly Junio C Hamano
2005-06-28 6:58 ` Junio C Hamano
2005-06-28 6:58 ` [PATCH 2/3] git-cat-file: use sha1_object_info() on '-t' Junio C Hamano
2005-06-28 6:59 ` [PATCH 3/3] git-cat-file: '-s' to find out object size Junio C Hamano
2005-06-26 20:52 ` kernel.org and GIT tree rebuilding Chris Mason
2005-06-26 21:03 ` Chris Mason
2005-06-26 21:40 ` Linus Torvalds [this message]
2005-06-26 22:34 ` Linus Torvalds
2005-06-28 18:06 ` Nicolas Pitre
2005-06-28 19:28 ` Linus Torvalds
2005-06-28 21:08 ` Nicolas Pitre
2005-06-28 21:27 ` Linus Torvalds
2005-06-28 21:55 ` [PATCH] Bugfix: initialize pack_base to NULL Junio C Hamano
2005-06-29 3:55 ` kernel.org and GIT tree rebuilding Nicolas Pitre
2005-06-29 5:16 ` Nicolas Pitre
2005-06-29 5:43 ` Linus Torvalds
2005-06-29 5:54 ` Linus Torvalds
2005-06-29 7:16 ` Last mile for 1.0 again Junio C Hamano
2005-06-29 9:51 ` [PATCH] Add git-verify-pack command Junio C Hamano
2005-06-29 16:15 ` Linus Torvalds
2005-07-04 21:40 ` Last mile for 1.0 again Daniel Barkalow
2005-07-04 21:45 ` Junio C Hamano
2005-07-04 21:59 ` Linus Torvalds
2005-07-04 22:41 ` Daniel Barkalow
2005-07-04 23:06 ` Junio C Hamano
2005-07-05 1:54 ` Daniel Barkalow
2005-07-05 6:24 ` Junio C Hamano
2005-07-05 13:34 ` Marco Costalba
2005-06-25 5:04 ` kernel.org and GIT tree rebuilding Junio C Hamano
-- strict thread matches above, loose matches on Subject: below --
2005-07-03 2:51 linux
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.58.0506261359370.19755@ppc970.osdl.org \
--to=torvalds@osdl.org \
--cc=davem@davemloft.net \
--cc=git@vger.kernel.org \
--cc=jgarzik@pobox.com \
--cc=mason@suse.com \
--cc=nico@cam.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).