From: Nicolas Pitre <nico@cam.org>
To: Jakub Narebski <jnareb@gmail.com>
Cc: Tomasz Kontusz <roverorna@gmail.com>, git <git@vger.kernel.org>
Subject: Re: Continue git clone after interruption
Date: Wed, 19 Aug 2009 15:04:59 -0400 (EDT) [thread overview]
Message-ID: <alpine.LFD.2.00.0908191326360.6044@xanadu.home> (raw)
In-Reply-To: <200908191719.52974.jnareb@gmail.com>
On Wed, 19 Aug 2009, Jakub Narebski wrote:
> There are 114937 objects in this packfile, including 56249 objects
> used as base (can be deltified or not). git-verify-pack -v shows
> that all objects have total size-in-packfile of 33 MB (which agrees
> with packfile size of 33 MB), with 17 MB size-in-packfile taken by
> deltaified objects, and 16 MB taken by base objects.
>
> git verify-pack -v |
> grep -v "^chain" |
> grep -v "objects/pack/pack-" > verify-pack.out
>
> sum=0; bsum=0; dsum=0;
> while read sha1 type size packsize off depth base; do
> echo "$sha1" >> verify-pack.sha1.out
> sum=$(( $sum + $packsize ))
> if [ -n "$base" ]; then
> echo "$sha1" >> verify-pack.delta.out
> dsum=$(( $dsum + $packsize ))
> else
> echo "$sha1" >> verify-pack.base.out
> bsum=$(( $bsum + $packsize ))
> fi
> done < verify-pack.out
> echo "sum=$sum; bsum=$bsum; dsum=$dsum"
Your object classification is misleading. Because an object has no
base, that doesn't mean it is necessarily a base itself. You'd have to
store $base into a separate file and then sort it and remove duplicates
to know the actual number of base objects. What you have right now is
strictly delta objects and non-delta objects. And base objects can
themselves be delta objects already of course.
Also... my git repo after 'git gc --aggressive' contains a pack which
size is 22 MB. Your script tells me:
sum=22930254; bsum=14142012; dsum=8788242
and:
29558 verify-pack.base.out
82043 verify-pack.delta.out
111601 verify-pack.out
111601 verify-pack.sha1.out
meaning that I have 111601 total objects, of which 29558 are non-deltas
occupying 14 MB and 82043 are deltas occupying 8 MB. That certainly
shows how deltas are space efficient. And with a minor modification to
your script, I know that 44985 objects are actually used as a delta
base. So, on average, each base is responsible for nearly 2 deltas.
> >>>> (BTW what happens if this pack is larger than file size limit for
> >>>> given filesystem?).
> [...]
>
> >> If I remember correctly FAT28^W FAT32 has maximum file size of 2 GB.
> >> FAT is often used on SSD, on USB drive. Although if you have 2 GB
> >> packfile, you are doing something wrong, or UGFWIINI (Using Git For
> >> What It Is Not Intended).
> >
> > Hopefully you're not performing a 'git clone' off of a FAT filesystem.
> > For physical transport you may repack with the appropriate switches.
>
> Not off a FAT filesystem, but into a FAT filesystem.
That's what I meant, sorry. My point still stands.
> > The front of the pack is the critical point. If you get enough to
> > create the top commit then further transfers can be done incrementally
> > with only the deltas between each commits.
>
> How? You have some objects that can be used as base; how to tell
> git-daemon that we have them (but not theirs prerequisites), and how
> to generate incrementals?
Just the same as when you perform a fetch to update your local copy of a
remote branch: you tell the remote about the commit you have and the one
you want, and git-repack will create delta objects for the commit you
want against similar objects from the commit you already have, and skip
those objects from the commit you want that are already included in the
commit you have.
> >> A question about pack protocol negotiation. If clients presents some
> >> objects as "have", server can and does assume that client has all
> >> prerequisites for such objects, e.g. for tree objects that it has
> >> all objects for files and directories inside tree; for commit it means
> >> all ancestors and all objects in snapshot (have top tree, and its
> >> prerequisites). Do I understand this correctly?
> >
> > That works only for commits.
>
> Hmmmm... how do you intent for "prefetch top objects restartable-y first"
> to work, then?
See my latest reply to dscho (you were in CC already).
> >> BTW. because of compression it might be more difficult to resume
> >> archive creation in the middle, I think...
> >
> > Why so? the tar+gzip format is streamable.
>
> gzip format uses sliding window in compression. "cat a b | gzip"
> is different from "cat <(gzip a) <(gzip b)".
>
> But that doesn't matter. If we are interrupted in the middle, we can
> uncompress what we have to check how far did we get, and tell server
> to send the rest; this way server wouldn't have to even generate
> (but not send) what we get as partial transfer.
You got it.
> P.S. What do you think about 'bundle' capability extension mentioned
> in a side sub-thread?
I don't like it. Reason is that it forces the server to be (somewhat)
stateful by having to keep track of those bundles and cycle them, and it
doubles the disk usage by having one copy of the repository in the form
of the original pack(s) and another copy as a bundle.
Of course, the idea of having a cron job generating a bundle and
offering it for download through HTTP or the like is fine if people are
OK with that, and that requires zero modifications to git. But I don't
think that is a solution that scales.
If you think about git.kernel.org which has maybe hundreds of
repositories where the big majority of them are actually forks of Linus'
own repository, then having all those forks reference Linus' repository
is a big disk space saver (and IO too as the referenced repository is
likely to remain cached in memory). Having a bundle ready for each of
them will simply kill that space advantage, unless they all share the
same bundle.
Now sharing that common bundle could be done of course, but that makes
things yet more complex while still wasting IO because some requests
will hit the common pack and some others will hit the bundle, making
less efficient usage of the disk cache on the server.
Yet, that bundle would probably not contain the latest revision if it is
only periodically updated, even less so if it is shared between multiple
repositories as outlined above. And what people with slow/unreliable
network links are probably most interested in is the latest revision and
maybe a few older revisions, but probably not the whole repository as
that is simply too long to wait for. Hence having a big bundle is not
flexible either with regards to the actual data transfer size.
Hence having a restartable git-archive service to create the top
revision with the ability to cheaply (in terms of network bandwidth)
deepen the history afterwards is probably the most straight forward way
to achieve that. The server needs no be aware of separate bundles, etc.
And the shared object store still works as usual with the same cached IO
whether the data is needed for a traditional fetch or a "git archive"
operation.
Why "git archive"? Because its content is well defined. So if you give
it a commit SHA1 you will always get the same stream of bytes (after
decompression) since the way git sort files is strictly defined. It is
therefore easy to tell a remote "git archive" instance that we want the
content for commit xyz but that we already got n files already, and that
the last file we've got has m bytes. There is simply no confusion about
what we've got already, unlike with a partial pack which might need
yet-to-be-received objects in order to make sense of what has been
already received. The server simply has to skip that many files and
resume the transfer at that point, independently of the compression or
even the archive format.
Nicolas
next prev parent reply other threads:[~2009-08-19 19:05 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-08-17 11:42 Continue git clone after interruption Tomasz Kontusz
2009-08-17 12:31 ` Johannes Schindelin
2009-08-17 15:23 ` Shawn O. Pearce
2009-08-18 5:43 ` Matthieu Moy
2009-08-18 6:58 ` Tomasz Kontusz
2009-08-18 17:56 ` Nicolas Pitre
2009-08-18 18:45 ` Jakub Narebski
2009-08-18 20:01 ` Nicolas Pitre
2009-08-18 21:02 ` Jakub Narebski
2009-08-18 21:32 ` Nicolas Pitre
2009-08-19 15:19 ` Jakub Narebski
2009-08-19 19:04 ` Nicolas Pitre [this message]
2009-08-19 19:42 ` Jakub Narebski
2009-08-19 21:13 ` Nicolas Pitre
2009-08-20 0:26 ` Sam Vilain
2009-08-20 7:37 ` Jakub Narebski
2009-08-20 7:48 ` Nguyen Thai Ngoc Duy
2009-08-20 8:23 ` Jakub Narebski
2009-08-20 18:41 ` Nicolas Pitre
2009-08-21 10:07 ` Jakub Narebski
2009-08-21 10:26 ` Matthieu Moy
2009-08-21 21:07 ` Nicolas Pitre
2009-08-21 21:41 ` Jakub Narebski
2009-08-22 0:59 ` Nicolas Pitre
2009-08-21 23:07 ` Sam Vilain
2009-08-22 3:37 ` Nicolas Pitre
2009-08-22 5:50 ` Sam Vilain
2009-08-22 8:13 ` Nicolas Pitre
2009-08-23 10:37 ` Sam Vilain
2009-08-20 22:57 ` Sam Vilain
2009-08-18 22:28 ` Johannes Schindelin
2009-08-18 23:40 ` Nicolas Pitre
2009-08-19 7:35 ` Johannes Schindelin
2009-08-19 8:25 ` Nguyen Thai Ngoc Duy
2009-08-19 9:52 ` Johannes Schindelin
2009-08-19 17:21 ` Nicolas Pitre
2009-08-19 22:23 ` René Scharfe
2009-08-19 4:42 ` Sitaram Chamarty
2009-08-19 9:53 ` Jakub Narebski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.LFD.2.00.0908191326360.6044@xanadu.home \
--to=nico@cam.org \
--cc=git@vger.kernel.org \
--cc=jnareb@gmail.com \
--cc=roverorna@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).