Re: Continue git clone after interruption

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Nicolas Pitre <nico@cam.org>
To: Tomasz Kontusz <roverorna@gmail.com>
Cc: git <git@vger.kernel.org>
Subject: Re: Continue git clone after interruption
Date: Tue, 18 Aug 2009 13:56:16 -0400 (EDT)	[thread overview]
Message-ID: <alpine.LFD.2.00.0908181246470.6044@xanadu.home> (raw)
In-Reply-To: <1250578735.2885.40.camel@cf-48>

On Tue, 18 Aug 2009, Tomasz Kontusz wrote:

> Ok, so it looks like it's not implementable without some kind of cache
> server-side, so the server would know what the pack it was sending
> looked like.
> But here's my idea: make server send objects in different order (the
> newest commit + whatever it points to first, then next one,then
> another...). Then it would be possible to look at what we got, tell
> server we have nothing, and want [the newest commit that was not
> complete]. I know the reason why it is sorted the way it is, but I think
> that the way data is stored after clone is clients problem, so the
> client should reorganize packs the way it wants.

That won't buy you much.  You should realize that a pack is made of:

1) Commit objects.  Yes they're all put together at the front of the pack,
   but they roughly are the equivalent of:

	git log --pretty=raw | gzip | wc -c

   For the Linux repo as of now that is around 32 MB.

2) Tree andblob objects.  Those are the bulk of the content for the top 
   commit.  The top commit is usually not delta compressed because we 
   want fast access to the top commit, and that is used as the base for 
   further delta compression for older commits.  So the very first 
   commit is whole at the front of the pack right after the commit 
   objects.  you can estimate the size of this data with:

	git archive --format=tar HEAD | gzip | wc -c

   On the same Linux repo this is currently 75 MB.

3) Delta objects.  Those are making the rest of the pack, plus a couple 
   tree/blob objects that were not found in the top commit and are 
   different enough from any object in that top commit not to be 
   represented as deltas.  Still, the majority of objects for all the 
   remaining commits are delta objects.

So... if we reorder objects, all that we can do is to spread commit 
objects around so that the objects referenced by one commit are all seen 
before another commit object is included.  That would cut on that 
initial 32 MB.

However you still have to get that 75 MB in order to at least be able to 
look at _one_ commit.  So you've only reduced your critical download 
size from 107 MB to 75 MB.  This is some improvement, of course, but not 
worth the bother IMHO.  If we're to have restartable clone, it has to 
work for any size.

And that's where the real problem is.  I don't think having servers to 
cache pack results for every fetch requests is sensible as that would be 
an immediate DoS attack vector.

And because the object order in a pack is not defined by the protocol, 
we cannot expect the server to necessarily always provide the same 
object order either.  For example, it is already undefined in which 
order you'll receive objects as threaded delta search is non 
deterministic and two identical fetch requests may end up with slightly 
different packing.  Or load balancing may redirect your fetch requests 
to different git servers which might have different versions of zlib, or 
even git itself, affecting the object packing order and/or size.

Now... What _could_ be done, though, is some extension to the 
git-archive command.  One thing that is well and strictly defined in git 
is the file path sort order.  So given a commit SHA1, you should always 
get the same files in the same order from git-archive.  For an initial 
clone, git could attempt fetching the top commit using the remote 
git-archive service and locally reconstruct that top commit that way.  
if the transfer is interrupted in the middle, then the remote 
git-archive could be told how to resume the transfer by telling it how 
many files and how many bytes in the current file to skip.  This way the 
server doesn't need to perform any sort of caching and remains 
stateless.

You then end up with a pretty shallow repository.  The clone process 
could then fall back to the traditional native git transfer protocol to 
deepen the history of that shallow repository.  And then that special 
packing sort order to distribute commit objects would make sense since 
each commit would then have a fairly small set of new objects, and most 
of them would be deltas anyway, making the data size per commit really 
small and any interrupted transfer much less of an issue.

Nicolas

next prev parent reply	other threads:[~2009-08-18 17:56 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-08-17 11:42 Continue git clone after interruption Tomasz Kontusz
2009-08-17 12:31 ` Johannes Schindelin
2009-08-17 15:23   ` Shawn O. Pearce
2009-08-18  5:43   ` Matthieu Moy
2009-08-18  6:58     ` Tomasz Kontusz
2009-08-18 17:56       ` Nicolas Pitre [this message]
2009-08-18 18:45         ` Jakub Narebski
2009-08-18 20:01           ` Nicolas Pitre
2009-08-18 21:02             ` Jakub Narebski
2009-08-18 21:32               ` Nicolas Pitre
2009-08-19 15:19                 ` Jakub Narebski
2009-08-19 19:04                   ` Nicolas Pitre
2009-08-19 19:42                     ` Jakub Narebski
2009-08-19 21:13                       ` Nicolas Pitre
2009-08-20  0:26                         ` Sam Vilain
2009-08-20  7:37                         ` Jakub Narebski
2009-08-20  7:48                           ` Nguyen Thai Ngoc Duy
2009-08-20  8:23                             ` Jakub Narebski
2009-08-20 18:41                           ` Nicolas Pitre
2009-08-21 10:07                             ` Jakub Narebski
2009-08-21 10:26                               ` Matthieu Moy
2009-08-21 21:07                               ` Nicolas Pitre
2009-08-21 21:41                                 ` Jakub Narebski
2009-08-22  0:59                                   ` Nicolas Pitre
2009-08-21 23:07                                 ` Sam Vilain
2009-08-22  3:37                                   ` Nicolas Pitre
2009-08-22  5:50                                     ` Sam Vilain
2009-08-22  8:13                                       ` Nicolas Pitre
2009-08-23 10:37                                         ` Sam Vilain
2009-08-20 22:57                           ` Sam Vilain
2009-08-18 22:28             ` Johannes Schindelin
2009-08-18 23:40               ` Nicolas Pitre
2009-08-19  7:35                 ` Johannes Schindelin
2009-08-19  8:25                   ` Nguyen Thai Ngoc Duy
2009-08-19  9:52                     ` Johannes Schindelin
2009-08-19 17:21                   ` Nicolas Pitre
2009-08-19 22:23                     ` René Scharfe
2009-08-19  4:42           ` Sitaram Chamarty
2009-08-19  9:53             ` Jakub Narebski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LFD.2.00.0908181246470.6044@xanadu.home \
    --to=nico@cam.org \
    --cc=git@vger.kernel.org \
    --cc=roverorna@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).