git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Linus Torvalds <torvalds@osdl.org>
To: Jeff Garzik <jgarzik@pobox.com>
Cc: "David S. Miller" <davem@davemloft.net>,
	Git Mailing List <git@vger.kernel.org>,
	Nicolas Pitre <nico@cam.org>, Chris Mason <mason@suse.com>
Subject: Re: kernel.org and GIT tree rebuilding
Date: Sun, 26 Jun 2005 09:41:02 -0700 (PDT)	[thread overview]
Message-ID: <Pine.LNX.4.58.0506260905200.19755@ppc970.osdl.org> (raw)
In-Reply-To: <Pine.LNX.4.58.0506242257450.11175@ppc970.osdl.org>



On Fri, 24 Jun 2005, Linus Torvalds wrote:
> 
> yeah, it clearly needs some refining to be useful, but I think you can
> kind of see how it would work.

Ok, here's how it works.

 - Pick a starting commit (or a hundred)

 - Pick an ending commit (or a hundred)

 - generate the list of objects in between them

	git-rev-list --object end ^start > object-list

 - Pack that list of objects into an "object pack":

	git-pack-objects out < object-list

   (This actually generates two files: "out.idx" is the index file, 
   "out.pack" is the data file, but I'll make it concatenate the two at 
   some point)

 - move the pack-files over somewhere else

 - unpack them

	git-unpack-objects out

and you're done.

Now, the reason I use "pack" and "unpack" instead of just "tar" to
transport the objects is that this allows me to do a fairly efficient
packing. I wanted these pack-files to be independent (ie they do _not_
depend on any objects outside of the pack-file), but within the objects
described in the pack I cna do delta-compression.

Now, that doesn't much help for small updates (where the objects are just 
unrelated and have no deltas), but it helps increasingly for big ones. The 
biggest one obviously being the whole path from the start to the HEAD..

For example, the "du -sh .git/objects" for the git project itself is 17MB 
for me, and I can do:

	torvalds@ppc970:~/git> du -sh .git/objects 
	17M     .git/objects

	torvalds@ppc970:~/git> time git-rev-list --objects HEAD | git-pack-objects out
	Packing 3656 objects

	real    0m3.779s
	user    0m3.169s
	sys     0m0.602s

	torvalds@ppc970:~/git> ls -lh out.*
	-rw-rw-r--  1 torvalds torvalds  87K Jun 26 09:12 out.idx
	-rw-rw-r--  1 torvalds torvalds 2.0M Jun 26 09:12 out.pack

ie it packs down to a nice 2MB pack-file with a small index. Move that
over to somewhere else, and unpack it, and you'll get all the regular
objects (it doesn't move tags and refs over, you'll have to do that
outside of the packing).

Now, you can trade off some packing time to get a better pack:

	torvalds@ppc970:~/git> time git-rev-list --objects HEAD | git-pack-objects --window=100 out
	Packing 3656 objects
	
	real    0m11.953s
	user    0m11.294s
	sys     0m0.663s

	torvalds@ppc970:~/git> ls -lh out.*
	-rw-rw-r--  1 torvalds torvalds  87K Jun 26 09:14 out.idx
	-rw-rw-r--  1 torvalds torvalds 1.6M Jun 26 09:14 out.pack

and if you want to allow deep delta chains (the default delta depth
limiting is 10), you can get even better results:

	torvalds@ppc970:~/git> time git-rev-list --objects HEAD | git-pack-objects --window=100 --depth=100 out
	Packing 3656 objects
	
	real    0m12.374s
	user    0m11.704s
	sys     0m0.659s

	torvalds@ppc970:~/git> ls -lh out.*
	-rw-rw-r--  1 torvalds torvalds  87K Jun 26 09:16 out.idx
	-rw-rw-r--  1 torvalds torvalds 1.3M Jun 26 09:16 out.pack

but then unpacking will slightly heavier.

(Doing the same for the kernel is obviously much more expensive just
because the kernel is so much bigger. A big delta discovery window like
100 takes about fifteen minutes to pack on my machine, but gets the
current kernel archive down to 70MB or so. That's ok for a monthly "pack
all the objects" to keep size requirements down, but you clearly don't
want to do this all the time ;).

Now, perhaps the more interesting part is that I also designed the pack
format so that it should be a good "history" format, not just a way to
move objects from one place to the other. Ie if you worry about diskspace,
you can pack everything up to the now into one big pack, and then remove
the original objects.

Don't do that yet, btw - I haven't actually written the code to read stuff
out of packs if we don't find it in the object directory yet, but the
layout is such that it should be straightforward and pretty efficient (but
there a deep delta chain obviously _will_ cause a performance hit).

I actually like this approach better than having delta-objects in the
filesystem. Partly because the pack-file is self-contained, partly because
it also solves the fs blocking issue, yet is still efficient to look up
the results without having hardlinks etc to duplicate objects virtually.  
And when you do the packing by hand as an "archival" mechanism, it also
doesn't have any of the downsides that Chris' packing approach had.

Nico? Chris? Interested in giving it a look? It's kind of a combination of 
your things, generalized and then made to have fast lookup with the index.

Fast lookup doesn't matter for a normal unpack, of course, and if I just
always wanted to unpack all the objects (ie just an object transfer
mechanism) I'd have made the index be a toposort of the objects. But
because I wanted to be able to use it as an archival format, I needed it
to be "random-access" by object name. So the index is in fact a binary
tree (well, sorted array, so the lookup degenerates into a binary search)
with a top-level index splitting up the contents based on the first byte
(the same way the filesystem layout does).

		Linus

  reply	other threads:[~2005-06-26 16:32 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-06-25  4:20 kernel.org and GIT tree rebuilding David S. Miller
2005-06-25  4:40 ` Jeff Garzik
2005-06-25  5:23   ` Linus Torvalds
2005-06-25  5:48     ` Jeff Garzik
2005-06-25  6:16       ` Linus Torvalds
2005-06-26 16:41         ` Linus Torvalds [this message]
2005-06-26 18:39           ` Junio C Hamano
2005-06-26 19:19             ` Linus Torvalds
2005-06-26 19:45               ` Junio C Hamano
     [not found]                 ` <7v1x6om6o5.fsf@assigned-by-dhcp.cox.net>
     [not found]                   ` <Pine.LNX.4.58.0506271227160.19755@ppc970.osdl.org>
     [not found]                     ` <7v64vzyqyw.fsf_-_@assigned-by-dhcp.cox.net>
2005-06-28  6:56                       ` [PATCH] Obtain sha1_file_info() for deltified pack entry properly Junio C Hamano
2005-06-28  6:58                         ` Junio C Hamano
2005-06-28  6:58                         ` [PATCH 2/3] git-cat-file: use sha1_object_info() on '-t' Junio C Hamano
2005-06-28  6:59                         ` [PATCH 3/3] git-cat-file: '-s' to find out object size Junio C Hamano
2005-06-26 20:52           ` kernel.org and GIT tree rebuilding Chris Mason
2005-06-26 21:03             ` Chris Mason
2005-06-26 21:40             ` Linus Torvalds
2005-06-26 22:34               ` Linus Torvalds
2005-06-28 18:06           ` Nicolas Pitre
2005-06-28 19:28             ` Linus Torvalds
2005-06-28 21:08               ` Nicolas Pitre
2005-06-28 21:27                 ` Linus Torvalds
2005-06-28 21:55                   ` [PATCH] Bugfix: initialize pack_base to NULL Junio C Hamano
2005-06-29  3:55                   ` kernel.org and GIT tree rebuilding Nicolas Pitre
2005-06-29  5:16                     ` Nicolas Pitre
2005-06-29  5:43                       ` Linus Torvalds
2005-06-29  5:54                         ` Linus Torvalds
2005-06-29  7:16                           ` Last mile for 1.0 again Junio C Hamano
2005-06-29  9:51                             ` [PATCH] Add git-verify-pack command Junio C Hamano
2005-06-29 16:15                               ` Linus Torvalds
2005-07-04 21:40                             ` Last mile for 1.0 again Daniel Barkalow
2005-07-04 21:45                               ` Junio C Hamano
2005-07-04 21:59                               ` Linus Torvalds
2005-07-04 22:41                                 ` Daniel Barkalow
2005-07-04 23:06                                   ` Junio C Hamano
2005-07-05  1:54                                     ` Daniel Barkalow
2005-07-05  6:24                                       ` Junio C Hamano
2005-07-05 13:34                                         ` Marco Costalba
2005-06-25  5:04 ` kernel.org and GIT tree rebuilding Junio C Hamano
  -- strict thread matches above, loose matches on Subject: below --
2005-07-03  2:51 linux

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.58.0506260905200.19755@ppc970.osdl.org \
    --to=torvalds@osdl.org \
    --cc=davem@davemloft.net \
    --cc=git@vger.kernel.org \
    --cc=jgarzik@pobox.com \
    --cc=mason@suse.com \
    --cc=nico@cam.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).