Re: kernel.org and GIT tree rebuilding

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Linus Torvalds <torvalds@osdl.org>
To: Chris Mason <mason@suse.com>
Cc: Jeff Garzik <jgarzik@pobox.com>,
	"David S. Miller" <davem@davemloft.net>,
	Git Mailing List <git@vger.kernel.org>,
	Nicolas Pitre <nico@cam.org>
Subject: Re: kernel.org and GIT tree rebuilding
Date: Sun, 26 Jun 2005 14:40:59 -0700 (PDT)	[thread overview]
Message-ID: <Pine.LNX.4.58.0506261359370.19755@ppc970.osdl.org> (raw)
In-Reply-To: <200506261652.59373.mason@suse.com>

On Sun, 26 Jun 2005, Chris Mason wrote:
> 
> Without having read the code, the big thing that hurt performance in my early 
> packed file work was compressing the whole packed file instead of individual 
> sub-objects.  It takes more room to compress each object, but when I 
> compressed the whole thing read performance was quite bad.

Since I wanted random-access, compressing the whole thing just wasn't an 
option.

Besides, the big space savings come from finding deltas, which is 
obviously also a compression, just at a higher level. The biggest problem 
there is to find a guess of objects to try to delta against, and right now 
that part is pretty stupid and could possibly be improved (it just sorts 
objects by size and tries to delta against "close" objects).

To generate a better sort _would_ actually be pretty close to doing a 
global compression (it really does boil down to the same thing: finding 
big sub-sequences, except it's in a "fragmented" space), but one issue is 
that I don't want to read in the whole data set in one go, so it would 
have to be based on some rolling hash or something. Davide pointed to 
rzip, and a variation of that (which knows about object boundaries) might 
work.

(You can also sort by filename, if you want to try. I don't track
filenames at all there and it's actually non-trivial to do, so that would
require some new and pretty nasty code, but it's possible in _theory_ at
least.)

Anyway, that's all potential improvement for generating better packing,
and it should certainly be possible without changing the format - just
generate a better initial sort, in otder to find more deltas (or rather,
find them faster by using a smaller window size).

So the stupid sort I have now does actually work, but exactly because it's
so stupid it wants a big window for best packing (because there might be a
lot of objects that aren't interesting), which in turn is quite expensive.
So a better sort would make a smaller window more effective.

[ Some numbers: a window of 10 objects is the default, and packs the
  current kernel down to 77MB in 2m21s. A window of 20 objects improves
  that packing to 71MB, but makes the packing time go up to 3m36s for me.  
  And a window of 100 gets us down to 62M but takes 11m54s.

  A window of 200 (with a delta depth of 200 too - likely _way_ too deep
  for normal use) gives you a 59M pack, but takes 20m59s, so there's
  definitely a point of diminishing returns.

  This is all for the current HEAD, which takes up 264M the "traditional" 
  git way and takes 141M without any deltas, just packed tightly with no 
  filesystem blocking.

  Now, as you can notice that's actually a slightly sub-linear increase in
  time, because as we find a delta, we will only accept smaller deltas in
  the future, so we can often stop comparing even before we've reached the
  maximum window size, and so effort is slightly less than linear because 
  there's effectively a constant component to part of it.

  Also, the good news is that you probably don't want to generate one 
  humungous pack archive anyway, but you're likely better off doing a new 
  incremental pack every few months. So we'll never have the situation 
  that creating a pack gets increasingly more costly, since at some point 
  you just say "ok, I created a perfect pack for the first 4 months of
  development, I'll now do subsequent packs on top of that instead".

  The other good news is that a pack is also a natural boundary for fsck 
  (as in "ok, I found that object in a pack, so I won't bother going
  deeper in the reachability chain"), so if you start packing your
  repository, fsck will only have to worry about the objects that are
  unpacked. That makes them work really naturally for archiving, ie this 
  all means that you can avoid a lot of overhead by packing your history
  every once in a while, with it all being entirely transparent.

  In other words, if you just pack every month, you can basically 
  guarantee that fsck costs etc never really go up, and your diskspace 
  also goes up only very slowly. The packed format is quite efficient in 
  many ways, but it is totally immutable (ie you can't add anything to an 
  archive - a pack stays the way it always was, and if you want to pack 
  more you have to either re-do the pack or just create a new one) ]

		Linus

next prev parent reply	other threads:[~2005-06-26 21:34 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-06-25  4:20 kernel.org and GIT tree rebuilding David S. Miller
2005-06-25  4:40 ` Jeff Garzik
2005-06-25  5:23   ` Linus Torvalds
2005-06-25  5:48     ` Jeff Garzik
2005-06-25  6:16       ` Linus Torvalds
2005-06-26 16:41         ` Linus Torvalds
2005-06-26 18:39           ` Junio C Hamano
2005-06-26 19:19             ` Linus Torvalds
2005-06-26 19:45               ` Junio C Hamano
     [not found]                 ` <7v1x6om6o5.fsf@assigned-by-dhcp.cox.net>
     [not found]                   ` <Pine.LNX.4.58.0506271227160.19755@ppc970.osdl.org>
     [not found]                     ` <7v64vzyqyw.fsf_-_@assigned-by-dhcp.cox.net>
2005-06-28  6:56                       ` [PATCH] Obtain sha1_file_info() for deltified pack entry properly Junio C Hamano
2005-06-28  6:58                         ` Junio C Hamano
2005-06-28  6:58                         ` [PATCH 2/3] git-cat-file: use sha1_object_info() on '-t' Junio C Hamano
2005-06-28  6:59                         ` [PATCH 3/3] git-cat-file: '-s' to find out object size Junio C Hamano
2005-06-26 20:52           ` kernel.org and GIT tree rebuilding Chris Mason
2005-06-26 21:03             ` Chris Mason
2005-06-26 21:40             ` Linus Torvalds [this message]
2005-06-26 22:34               ` Linus Torvalds
2005-06-28 18:06           ` Nicolas Pitre
2005-06-28 19:28             ` Linus Torvalds
2005-06-28 21:08               ` Nicolas Pitre
2005-06-28 21:27                 ` Linus Torvalds
2005-06-28 21:55                   ` [PATCH] Bugfix: initialize pack_base to NULL Junio C Hamano
2005-06-29  3:55                   ` kernel.org and GIT tree rebuilding Nicolas Pitre
2005-06-29  5:16                     ` Nicolas Pitre
2005-06-29  5:43                       ` Linus Torvalds
2005-06-29  5:54                         ` Linus Torvalds
2005-06-29  7:16                           ` Last mile for 1.0 again Junio C Hamano
2005-06-29  9:51                             ` [PATCH] Add git-verify-pack command Junio C Hamano
2005-06-29 16:15                               ` Linus Torvalds
2005-07-04 21:40                             ` Last mile for 1.0 again Daniel Barkalow
2005-07-04 21:45                               ` Junio C Hamano
2005-07-04 21:59                               ` Linus Torvalds
2005-07-04 22:41                                 ` Daniel Barkalow
2005-07-04 23:06                                   ` Junio C Hamano
2005-07-05  1:54                                     ` Daniel Barkalow
2005-07-05  6:24                                       ` Junio C Hamano
2005-07-05 13:34                                         ` Marco Costalba
2005-06-25  5:04 ` kernel.org and GIT tree rebuilding Junio C Hamano
  -- strict thread matches above, loose matches on Subject: below --
2005-07-03  2:51 linux

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.58.0506261359370.19755@ppc970.osdl.org \
    --to=torvalds@osdl.org \
    --cc=davem@davemloft.net \
    --cc=git@vger.kernel.org \
    --cc=jgarzik@pobox.com \
    --cc=mason@suse.com \
    --cc=nico@cam.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).