git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Nicolas Pitre <nico@cam.org>
To: Jakub Narebski <jnareb@gmail.com>
Cc: Tomasz Kontusz <roverorna@gmail.com>, git <git@vger.kernel.org>
Subject: Re: Continue git clone after interruption
Date: Wed, 19 Aug 2009 15:04:59 -0400 (EDT)	[thread overview]
Message-ID: <alpine.LFD.2.00.0908191326360.6044@xanadu.home> (raw)
In-Reply-To: <200908191719.52974.jnareb@gmail.com>

On Wed, 19 Aug 2009, Jakub Narebski wrote:

> There are 114937 objects in this packfile, including 56249 objects
> used as base (can be deltified or not).  git-verify-pack -v shows
> that all objects have total size-in-packfile of 33 MB (which agrees
> with packfile size of 33 MB), with 17 MB size-in-packfile taken by
> deltaified objects, and 16 MB taken by base objects.
> 
>   git verify-pack -v | 
>     grep -v "^chain" | 
>     grep -v "objects/pack/pack-" > verify-pack.out
> 
>   sum=0; bsum=0; dsum=0; 
>   while read sha1 type size packsize off depth base; do
>     echo "$sha1" >> verify-pack.sha1.out
>     sum=$(( $sum + $packsize ))
>     if [ -n "$base" ]; then 
>        echo "$sha1" >> verify-pack.delta.out
>        dsum=$(( $dsum + $packsize ))
>     else
>        echo "$sha1" >> verify-pack.base.out
>        bsum=$(( $bsum + $packsize ))
>     fi
>   done < verify-pack.out
>   echo "sum=$sum; bsum=$bsum; dsum=$dsum"

Your object classification is misleading.  Because an object has no 
base, that doesn't mean it is necessarily a base itself.  You'd have to 
store $base into a separate file and then sort it and remove duplicates 
to know the actual number of base objects.  What you have right now is 
strictly delta objects and non-delta objects. And base objects can 
themselves be delta objects already of course.

Also... my git repo after 'git gc --aggressive' contains a pack which 
size is 22 MB.  Your script tells me:

sum=22930254; bsum=14142012; dsum=8788242

and:

   29558 verify-pack.base.out
   82043 verify-pack.delta.out
  111601 verify-pack.out
  111601 verify-pack.sha1.out

meaning that I have 111601 total objects, of which 29558 are non-deltas 
occupying 14 MB and 82043 are deltas occupying 8 MB.  That certainly 
shows how deltas are space efficient.  And with a minor modification to 
your script, I know that 44985 objects are actually used as a delta 
base.  So, on average, each base is responsible for nearly 2 deltas.

> >>>> (BTW what happens if this pack is larger than file size limit for 
> >>>> given filesystem?).
> [...]
> 
> >> If I remember correctly FAT28^W FAT32 has maximum file size of 2 GB.
> >> FAT is often used on SSD, on USB drive.  Although if you have  2 GB
> >> packfile, you are doing something wrong, or UGFWIINI (Using Git For
> >> What It Is Not Intended).
> > 
> > Hopefully you're not performing a 'git clone' off of a FAT filesystem.  
> > For physical transport you may repack with the appropriate switches.
> 
> Not off a FAT filesystem, but into a FAT filesystem.

That's what I meant, sorry.  My point still stands.

> > The front of the pack is the critical point.  If you get enough to 
> > create the top commit then further transfers can be done incrementally 
> > with only the deltas between each commits.
> 
> How?  You have some objects that can be used as base; how to tell 
> git-daemon that we have them (but not theirs prerequisites), and how
> to generate incrementals?

Just the same as when you perform a fetch to update your local copy of a 
remote branch: you tell the remote about the commit you have and the one 
you want, and git-repack will create delta objects for the commit you 
want against similar objects from the commit you already have, and skip 
those objects from the commit you want that are already included in the 
commit you have.

> >> A question about pack protocol negotiation.  If clients presents some
> >> objects as "have", server can and does assume that client has all 
> >> prerequisites for such objects, e.g. for tree objects that it has
> >> all objects for files and directories inside tree; for commit it means
> >> all ancestors and all objects in snapshot (have top tree, and its 
> >> prerequisites).  Do I understand this correctly?
> > 
> > That works only for commits.
> 
> Hmmmm... how do you intent for "prefetch top objects restartable-y first"
> to work, then?

See my latest reply to dscho (you were in CC already).

> >> BTW. because of compression it might be more difficult to resume 
> >> archive creation in the middle, I think...
> > 
> > Why so?  the tar+gzip format is streamable.
> 
> gzip format uses sliding window in compression.  "cat a b | gzip"
> is different from "cat <(gzip a) <(gzip b)".
> 
> But that doesn't matter.  If we are interrupted in the middle, we can
> uncompress what we have to check how far did we get, and tell server
> to send the rest; this way server wouldn't have to even generate 
> (but not send) what we get as partial transfer.

You got it.

> P.S. What do you think about 'bundle' capability extension mentioned
>      in a side sub-thread?

I don't like it.  Reason is that it forces the server to be (somewhat) 
stateful by having to keep track of those bundles and cycle them, and it 
doubles the disk usage by having one copy of the repository in the form 
of the original pack(s) and another copy as a bundle.

Of course, the idea of having a cron job generating a bundle and 
offering it for download through HTTP or the like is fine if people are 
OK with that, and that requires zero modifications to git.  But I don't 
think that is a solution that scales.

If you think about git.kernel.org which has maybe hundreds of 
repositories where the big majority of them are actually forks of Linus' 
own repository, then having all those forks reference Linus' repository 
is a big disk space saver (and IO too as the referenced repository is 
likely to remain cached in memory).  Having a bundle ready for each of 
them will simply kill that space advantage, unless they all share the 
same bundle.

Now sharing that common bundle could be done of course, but that makes 
things yet more complex while still wasting IO because some requests 
will hit the common pack and some others will hit the bundle, making 
less efficient usage of the disk cache on the server.

Yet, that bundle would probably not contain the latest revision if it is 
only periodically updated, even less so if it is shared between multiple 
repositories as outlined above.  And what people with slow/unreliable 
network links are probably most interested in is the latest revision and 
maybe a few older revisions, but probably not the whole repository as 
that is simply too long to wait for.  Hence having a big bundle is not 
flexible either with regards to the actual data transfer size.

Hence having a restartable git-archive service to create the top 
revision with the ability to cheaply (in terms of network bandwidth) 
deepen the history afterwards is probably the most straight forward way 
to achieve that.  The server needs no be aware of separate bundles, etc.  
And the shared object store still works as usual with the same cached IO 
whether the data is needed for a traditional fetch or a "git archive" 
operation.

Why "git archive"?  Because its content is well defined.  So if you give 
it a commit SHA1 you will always get the same stream of bytes (after 
decompression) since the way git sort files is strictly defined.  It is 
therefore easy to tell a remote "git archive" instance that we want the 
content for commit xyz but that we already got n files already, and that 
the last file we've got has m bytes.  There is simply no confusion about 
what we've got already, unlike with a partial pack which might need 
yet-to-be-received objects in order to make sense of what has been 
already received.  The server simply has to skip that many files and 
resume the transfer at that point, independently of the compression or 
even the archive format.


Nicolas

  reply	other threads:[~2009-08-19 19:05 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-08-17 11:42 Continue git clone after interruption Tomasz Kontusz
2009-08-17 12:31 ` Johannes Schindelin
2009-08-17 15:23   ` Shawn O. Pearce
2009-08-18  5:43   ` Matthieu Moy
2009-08-18  6:58     ` Tomasz Kontusz
2009-08-18 17:56       ` Nicolas Pitre
2009-08-18 18:45         ` Jakub Narebski
2009-08-18 20:01           ` Nicolas Pitre
2009-08-18 21:02             ` Jakub Narebski
2009-08-18 21:32               ` Nicolas Pitre
2009-08-19 15:19                 ` Jakub Narebski
2009-08-19 19:04                   ` Nicolas Pitre [this message]
2009-08-19 19:42                     ` Jakub Narebski
2009-08-19 21:13                       ` Nicolas Pitre
2009-08-20  0:26                         ` Sam Vilain
2009-08-20  7:37                         ` Jakub Narebski
2009-08-20  7:48                           ` Nguyen Thai Ngoc Duy
2009-08-20  8:23                             ` Jakub Narebski
2009-08-20 18:41                           ` Nicolas Pitre
2009-08-21 10:07                             ` Jakub Narebski
2009-08-21 10:26                               ` Matthieu Moy
2009-08-21 21:07                               ` Nicolas Pitre
2009-08-21 21:41                                 ` Jakub Narebski
2009-08-22  0:59                                   ` Nicolas Pitre
2009-08-21 23:07                                 ` Sam Vilain
2009-08-22  3:37                                   ` Nicolas Pitre
2009-08-22  5:50                                     ` Sam Vilain
2009-08-22  8:13                                       ` Nicolas Pitre
2009-08-23 10:37                                         ` Sam Vilain
2009-08-20 22:57                           ` Sam Vilain
2009-08-18 22:28             ` Johannes Schindelin
2009-08-18 23:40               ` Nicolas Pitre
2009-08-19  7:35                 ` Johannes Schindelin
2009-08-19  8:25                   ` Nguyen Thai Ngoc Duy
2009-08-19  9:52                     ` Johannes Schindelin
2009-08-19 17:21                   ` Nicolas Pitre
2009-08-19 22:23                     ` René Scharfe
2009-08-19  4:42           ` Sitaram Chamarty
2009-08-19  9:53             ` Jakub Narebski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LFD.2.00.0908191326360.6044@xanadu.home \
    --to=nico@cam.org \
    --cc=git@vger.kernel.org \
    --cc=jnareb@gmail.com \
    --cc=roverorna@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).