From: Nicolas Pitre <nico@cam.org>
To: Tomasz Kontusz <roverorna@gmail.com>
Cc: git <git@vger.kernel.org>
Subject: Re: Continue git clone after interruption
Date: Tue, 18 Aug 2009 13:56:16 -0400 (EDT) [thread overview]
Message-ID: <alpine.LFD.2.00.0908181246470.6044@xanadu.home> (raw)
In-Reply-To: <1250578735.2885.40.camel@cf-48>
On Tue, 18 Aug 2009, Tomasz Kontusz wrote:
> Ok, so it looks like it's not implementable without some kind of cache
> server-side, so the server would know what the pack it was sending
> looked like.
> But here's my idea: make server send objects in different order (the
> newest commit + whatever it points to first, then next one,then
> another...). Then it would be possible to look at what we got, tell
> server we have nothing, and want [the newest commit that was not
> complete]. I know the reason why it is sorted the way it is, but I think
> that the way data is stored after clone is clients problem, so the
> client should reorganize packs the way it wants.
That won't buy you much. You should realize that a pack is made of:
1) Commit objects. Yes they're all put together at the front of the pack,
but they roughly are the equivalent of:
git log --pretty=raw | gzip | wc -c
For the Linux repo as of now that is around 32 MB.
2) Tree andblob objects. Those are the bulk of the content for the top
commit. The top commit is usually not delta compressed because we
want fast access to the top commit, and that is used as the base for
further delta compression for older commits. So the very first
commit is whole at the front of the pack right after the commit
objects. you can estimate the size of this data with:
git archive --format=tar HEAD | gzip | wc -c
On the same Linux repo this is currently 75 MB.
3) Delta objects. Those are making the rest of the pack, plus a couple
tree/blob objects that were not found in the top commit and are
different enough from any object in that top commit not to be
represented as deltas. Still, the majority of objects for all the
remaining commits are delta objects.
So... if we reorder objects, all that we can do is to spread commit
objects around so that the objects referenced by one commit are all seen
before another commit object is included. That would cut on that
initial 32 MB.
However you still have to get that 75 MB in order to at least be able to
look at _one_ commit. So you've only reduced your critical download
size from 107 MB to 75 MB. This is some improvement, of course, but not
worth the bother IMHO. If we're to have restartable clone, it has to
work for any size.
And that's where the real problem is. I don't think having servers to
cache pack results for every fetch requests is sensible as that would be
an immediate DoS attack vector.
And because the object order in a pack is not defined by the protocol,
we cannot expect the server to necessarily always provide the same
object order either. For example, it is already undefined in which
order you'll receive objects as threaded delta search is non
deterministic and two identical fetch requests may end up with slightly
different packing. Or load balancing may redirect your fetch requests
to different git servers which might have different versions of zlib, or
even git itself, affecting the object packing order and/or size.
Now... What _could_ be done, though, is some extension to the
git-archive command. One thing that is well and strictly defined in git
is the file path sort order. So given a commit SHA1, you should always
get the same files in the same order from git-archive. For an initial
clone, git could attempt fetching the top commit using the remote
git-archive service and locally reconstruct that top commit that way.
if the transfer is interrupted in the middle, then the remote
git-archive could be told how to resume the transfer by telling it how
many files and how many bytes in the current file to skip. This way the
server doesn't need to perform any sort of caching and remains
stateless.
You then end up with a pretty shallow repository. The clone process
could then fall back to the traditional native git transfer protocol to
deepen the history of that shallow repository. And then that special
packing sort order to distribute commit objects would make sense since
each commit would then have a fairly small set of new objects, and most
of them would be deltas anyway, making the data size per commit really
small and any interrupted transfer much less of an issue.
Nicolas
next prev parent reply other threads:[~2009-08-18 17:56 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-08-17 11:42 Continue git clone after interruption Tomasz Kontusz
2009-08-17 12:31 ` Johannes Schindelin
2009-08-17 15:23 ` Shawn O. Pearce
2009-08-18 5:43 ` Matthieu Moy
2009-08-18 6:58 ` Tomasz Kontusz
2009-08-18 17:56 ` Nicolas Pitre [this message]
2009-08-18 18:45 ` Jakub Narebski
2009-08-18 20:01 ` Nicolas Pitre
2009-08-18 21:02 ` Jakub Narebski
2009-08-18 21:32 ` Nicolas Pitre
2009-08-19 15:19 ` Jakub Narebski
2009-08-19 19:04 ` Nicolas Pitre
2009-08-19 19:42 ` Jakub Narebski
2009-08-19 21:13 ` Nicolas Pitre
2009-08-20 0:26 ` Sam Vilain
2009-08-20 7:37 ` Jakub Narebski
2009-08-20 7:48 ` Nguyen Thai Ngoc Duy
2009-08-20 8:23 ` Jakub Narebski
2009-08-20 18:41 ` Nicolas Pitre
2009-08-21 10:07 ` Jakub Narebski
2009-08-21 10:26 ` Matthieu Moy
2009-08-21 21:07 ` Nicolas Pitre
2009-08-21 21:41 ` Jakub Narebski
2009-08-22 0:59 ` Nicolas Pitre
2009-08-21 23:07 ` Sam Vilain
2009-08-22 3:37 ` Nicolas Pitre
2009-08-22 5:50 ` Sam Vilain
2009-08-22 8:13 ` Nicolas Pitre
2009-08-23 10:37 ` Sam Vilain
2009-08-20 22:57 ` Sam Vilain
2009-08-18 22:28 ` Johannes Schindelin
2009-08-18 23:40 ` Nicolas Pitre
2009-08-19 7:35 ` Johannes Schindelin
2009-08-19 8:25 ` Nguyen Thai Ngoc Duy
2009-08-19 9:52 ` Johannes Schindelin
2009-08-19 17:21 ` Nicolas Pitre
2009-08-19 22:23 ` René Scharfe
2009-08-19 4:42 ` Sitaram Chamarty
2009-08-19 9:53 ` Jakub Narebski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.LFD.2.00.0908181246470.6044@xanadu.home \
--to=nico@cam.org \
--cc=git@vger.kernel.org \
--cc=roverorna@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).