From: Jeff King <peff@peff.net>
To: Shawn Pearce <spearce@spearce.org>
Cc: git <git@vger.kernel.org>
Subject: Re: RFC: Resumable clone based on hybrid "smart" and "dumb" HTTP
Date: Wed, 10 Feb 2016 16:49:46 -0500 [thread overview]
Message-ID: <20160210214945.GA5853@sigill.intra.peff.net> (raw)
In-Reply-To: <CAJo=hJuRxoe6tXe65ci-A35c_PWJEP7KEPFu5Ocn147HwVuo3A@mail.gmail.com>
On Wed, Feb 10, 2016 at 12:11:46PM -0800, Shawn Pearce wrote:
> On Wed, Feb 10, 2016 at 10:59 AM, Shawn Pearce <spearce@spearce.org> wrote:
> >
> > ... Thoughts?
>
> Several of us at $DAY_JOB talked about this more today and thought a
> variation makes more sense:
>
> 1. Clients attempting clone ask for /info/refs?service=git-upload-pack
> like they do today.
>
> 2. Servers that support resumable clone include a "resumable"
> capability in the advertisement.
Because the magic happens in the git protocol, that would mean this does
not have to be limited to git-over-http. It could be "resumable=<url>"
to point the client anywhere (the same server over a different protocol,
another server, etc).
> 3. Updated clients on clone request GET /info/refs?service=git-resumable-clone.
>
> 4. The server may return a 302 Redirect to its current "mostly whole"
> pack file. This can be more flexible than "refs/heads/*", it just
> needs to be a mostly complete pack file that contains a complete graph
> from any arbitrary roots.
And with "resumable=<url>", the client does not have to hit the server
to do a redirect; it can go straight to the final URL, saving a
round-trip.
> 5. Clients fetch the file using standard HTTP GET, possibly with
> byte-ranges to resume.
>
> 6. Once stored and indexed with .idx, clients run `git fsck
> --lost-found` to discover the roots of the pack it downloaded. These
> are saved as temporary references.
Clients do not have to _just_ fetch a packfile. They could get a bundle
file that contains the roots along with the packfile. I know that one of
your goals is not duplicating the storage of the packfile on the server,
but it would not be hard for the server to store the packfile and the
bundle header separately, and concatenate them on the fly.
Right now the clients can't clone from bundles directly via HTTP. I
wrote patches for that ages ago, but got stuck on this very issue
(basically that I had to spool the bundle and then clone from it, which
temporarily doubled the client's disk space requirements). One
alternative would be to amend the bundle format so that rather than a
single file, you get a bundle header whose end says "...and my matching
packfile is 1234-abcd". And then the client knows that they can fetch
that separately from the same source.
It's an extra HTTP request, but it makes the code for client _and_
server way simpler. So the whole thing is basically then:
0. During gc, server generates pack-1234abcd.pack. It writes matching
tips into pack-1234abcd.info, which is essentially a bundle file
whose final line says "pack-1234abcd.pack".
1. Client contacts server via any git protocol. Server says
"resumable=<url>". Let's says that <url> is
https://example.com/repo/clones/1234abcd.bundle.
2. Client goes to <url>. They see that they are fetching a bundle,
and know not to do the usual smart-http or dumb-http protocols.
They can fetch the bundle header resumably (though it's tiny, so it
doesn't really matter).
3. After finishing the bundle header, they see they need to grab the
packfile. Based on the bundle header's URL and the filename
contained within it, they know to get
https://example.com/repo/clones/pack-1234abcd.pack". This is
resumable, too.
4. Client clones from bundled pack as normal; no root-finding magic
required.
5. Client runs incremental fetch against original repo from step 1.
And you'll notice, too, that all of the bundle-http magic kicks in
during step 2 because the client sees they're grabbing a bundle. Which
means that the <url> in step 1 doesn't _have_ to be a bundle. It can be
"go fetch from kernel.org, then come back to me".
> An advantage to this process is its much more flexible for the server.
> There is no additional pack-*.info file required. GC can organize
> packs anyway it wants, etc.
Yes, it's much better than your original email, at least for GitHub
servers. We're not very flexible with GC tricks, because we need bitmaps
to work, and because we get a lot of benefit from sharing the object
storage for forks of a single repository.
> To make step 4 really resume well, clients may need to save the first
> Location header it gets back from
> /info/refs?service=git-resumable-clone and use that on resume. Servers
> are likely to embed the pack SHA-1 in the Location header, and the
> client wants to use this on subsequent GET attempts to abort early if
> the server has deleted the pack the client is trying to obtain.
You could possibly do away with this trick if the server hands out a
unique URL in its "resumable" header. Though I imagine it might be
convenient for server admins to always point to a generic url, and
put the logic in the HTTP layer.
OTOH, if you do the "split bundle" thing I mentioned above, then this
happens for free. The client caches the bundle header it grabs in my
step 2, and then that contains the unique pack name to fetch in step 3.
-Peff
next prev parent reply other threads:[~2016-02-10 21:49 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-02-10 18:59 RFC: Resumable clone based on hybrid "smart" and "dumb" HTTP Shawn Pearce
2016-02-10 20:11 ` Shawn Pearce
2016-02-10 20:23 ` Stefan Beller
2016-02-10 20:57 ` Junio C Hamano
2016-02-10 21:22 ` Jonathan Nieder
2016-02-10 22:03 ` Jeff King
2016-02-10 21:01 ` Jonathan Nieder
2016-02-10 21:07 ` Junio C Hamano
2016-02-11 3:43 ` Junio C Hamano
2016-02-11 18:04 ` Shawn Pearce
2016-02-11 23:53 ` Duy Nguyen
2016-02-13 5:07 ` Junio C Hamano
2016-02-10 21:49 ` Jeff King [this message]
2016-02-10 22:17 ` Jonathan Nieder
2016-02-10 23:03 ` Jeff King
2016-02-10 22:40 ` Junio C Hamano
2016-02-11 21:32 ` Junio C Hamano
2016-02-11 21:46 ` Jeff King
2016-02-13 1:40 ` Blake Burkhart
2016-02-13 17:00 ` Jeff King
2016-02-14 2:14 ` Shawn Pearce
2016-02-14 17:05 ` Jeff King
2016-02-14 17:56 ` Shawn Pearce
2016-02-16 18:34 ` Stefan Beller
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160210214945.GA5853@sigill.intra.peff.net \
--to=peff@peff.net \
--cc=git@vger.kernel.org \
--cc=spearce@spearce.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).