git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Stefan Beller <sbeller@google.com>,
	Josh Triplett <josh@joshtriplett.org>,
	Duy Nguyen <pclouds@gmail.com>,
	"git\@vger.kernel.org" <git@vger.kernel.org>,
	sarah@thesharps.us
Subject: Re: Resumable git clone?
Date: Tue, 01 Mar 2016 22:31:30 -0800	[thread overview]
Message-ID: <xmqq8u215r25.fsf@gitster.mtv.corp.google.com> (raw)
In-Reply-To: <20160302023024.GG17997@ZenIV.linux.org.uk> (Al Viro's message of "Wed, 2 Mar 2016 02:30:24 +0000")

Al Viro <viro@ZenIV.linux.org.uk> writes:

> FWIW, I wasn't proposing to recreate the remaining bits of that _pack_;
> just do the normal pull with one addition: start with sending the list
> of sha1 of objects you are about to send and let the recepient reply
> with "I already have <set of sha1>, don't bother with those".  And exclude
> those from the transfer.

I did a quick-and-dirty unscientific experiment.

I had a clone of Linus's repository that was about a week old, whose
tip was at 4de8ebef (Merge tag 'trace-fixes-v4.5-rc5' of
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace,
2016-02-22).  To bring it up to date (i.e. a pull about a week's
worth of progress) to f691b77b (Merge branch 'for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs, 2016-03-01):

    $ git rev-list --objects 4de8ebef..f691b77b1fc | wc -l
    1396
    $ git rev-parse 4de8ebef..f691b77b1fc |
      git pack-objects --revs --delta-base-offset --stdout |
      wc -c
    2444127

So in order to salvage some transfer out of 2.4MB, the hypothetical
Al protocol would first have the upload-pack give 20*1396 = 28kB
object names to fetch-pack; no matter how fetch-pack encodes its
preference, its answer would be less than 28kB.  We would likely to
design this part of the new protocol in line with the existing part
and use textual object names, so let's round them up to 100kB.

That is quite small, even if you are on a crappy connection that you
need to retry 5 times, the additional overhead to negotiate the list
of objects alone would be 0.5MB (or less than 20% of the real
transfer).

That is quite interesting [*1*].

For the approach to be practical, you would have to write a program
that reads from a truncated packfile and writes a new packfile,
excising deltas that lack their bases, to salvage objects from a
half-transferred packfile; it is however unclear how involved the
code would get.

It is probably OK for a tiny pack that has only 1400 objects--we
could just pass the early part through unpack-objects and let it die
when it hits EOF, but for a "resumable clone", I do not think you
can afford to unpack 4.6M objects in the kernel repository into
loose objects.

The approach of course requires the server end to spend 5 times as
many cycles as usual in order to help a client that retries 5 times.

On the other hand, the resumable "clone" we were discussing by
allowing the server to respond with a slightly older bundle or a
pack and then asking the client to fill the latest bits by a
follow-up fetch targets to reduce the load of the server side (the
"slightly older" part can be offloaded to CDN).  It is a happy side
effect that material offloaded to CDN can more easily obtained via
HTTPS that is trivially resumable ;-)

I think your "I've got these already" extention may be worth trying,
and it is definitely better than the "let's make sure the server end
creates byte-for-byte identical pack stream, and discard the early
part without sending it to the network", and it may help resuming a
small incremental fetch, but I do not think it is advisable to use
it for a full clone, given that it is very likely that we would be
adding the "offload 'clone' to CDN" kind.  Even though I can foresee
both kinds to co-exist, I do not think it is practical to offer it
for resuming multi-hour cloning of the kernel repository (or worse,
Android repositories) over a trans-Pacific link, for example.


[Footnote]

*1* To update v4.5-rc1 to today's HEAD involves 10809 objects, and
    the pack data takes 14955728 bytes.  That translates to ~440kB
    needed to advertise a list of textual object names to salvage
    object transfer of 15MB.

  reply	other threads:[~2016-03-02  6:31 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-02  1:30 Resumable git clone? Josh Triplett
2016-03-02  1:40 ` Stefan Beller
2016-03-02  2:30   ` Al Viro
2016-03-02  6:31     ` Junio C Hamano [this message]
2016-03-02  7:37       ` Duy Nguyen
2016-03-02  7:44         ` Duy Nguyen
2016-03-02  7:54         ` Josh Triplett
2016-03-02  8:31           ` Junio C Hamano
2016-03-02  9:28             ` Duy Nguyen
2016-03-02 16:41             ` Josh Triplett
2016-03-02  8:13     ` Josh Triplett
2016-03-02  8:22       ` Duy Nguyen
2016-03-02  8:32         ` Jeff King
2016-03-02 10:47           ` Bhavik Bavishi
2016-03-02 16:40         ` Josh Triplett
2016-03-02  8:14     ` Duy Nguyen
2016-03-02  1:45 ` Duy Nguyen
2016-03-02  8:41 ` Junio C Hamano
2016-03-02 15:51   ` Konstantin Ryabitsev
2016-03-02 16:49   ` Josh Triplett
2016-03-02 17:57     ` Junio C Hamano
2016-03-24  8:00   ` Philip Oakley
2016-03-24 15:53     ` Junio C Hamano
2016-03-24 21:08       ` Philip Oakley

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqq8u215r25.fsf@gitster.mtv.corp.google.com \
    --to=gitster@pobox.com \
    --cc=git@vger.kernel.org \
    --cc=josh@joshtriplett.org \
    --cc=pclouds@gmail.com \
    --cc=sarah@thesharps.us \
    --cc=sbeller@google.com \
    --cc=viro@ZenIV.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).