git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH] index-pack: --clone-bundle option
Date: Fri, 4 Mar 2016 10:34:00 -0500	[thread overview]
Message-ID: <20160304153359.GA16300@sigill.intra.peff.net> (raw)
In-Reply-To: <xmqqfuw7186z.fsf_-_@gitster.mtv.corp.google.com>

On Thu, Mar 03, 2016 at 02:57:08PM -0800, Junio C Hamano wrote:

> Teach a new option "--clone-bundle" to "git index-pack" to create a
> split bundle file that uses an existing packfile as its data part.
> 
> The expected "typical" preparation for helping initial clone would
> start by preparing a packfile that contains most of the history and
> add another packfile that contains the remainder (e.g. the objects
> that are only reachable from reflog entries).  The first pack can
> then be used as the data part of a split bundle and these two files
> can be served as static files to bootstrap the clients without
> incurring any more CPU cycles to the server side.

Just FYI, because I know our setup is anything but typical, but this
probably _won't_ be what we do at GitHub. We try to keep everything in a
single packfile that contains many "islands", which are not allowed to
delta between each other[1]. Individual forks of a repo get their own
islands, unreachable objects are in another one, and so forth. So we'd
probably never want to provide a whole packfile, as it has way too much
data. But we _would_ be able to take a bitmap over the set of packed
objects such that if you blindly spew out every object with its bit set,
the resulting pack has the desired objects.

That's not as turn-key for serving with a dumb http server, obviously.
And I don't expect you to write that part. I just wanted to let you know
where I foresee having to take this for GitHub to get it deployed.

[1] It's a little more complicated than "don't delta with each other".
    Each object has its own bitmap of which islands it is in, and the
    rule is that you can use a delta base iff it is in a subset of your
    own islands. The point is that a clone of a particular island should
    never have to throw away on-disk deltas.

    Cleaning up and sharing that code upstream has been on my todo list
    for about 2 years, but somehow there's always new code to be
    written. :-/ I'm happy to share if people want to look at the messy
    state.

> Among the objects in the packfile, the ones that are not referenced
> by no other objects are identified and recorded as the "references"

s/no/any/ (double negative)

> in the resulting bundle.  As the packfile does not record any ref
> information, however, the names of the "references" recorded in the
> bundle need to be synthesized; we arbitrarily choose to record the
> object whose name is $SHA1 as refs/objects/$SHA1.

Makes sense. In an "island" world I'd want to write one bitmap per
island, and then reconstruct that list on the fly by referencing the
island bitmap with the sha1 names in the pack revidx.

> +static void write_bundle_file(const char *filename, unsigned char *sha1,
> +			      const char *packname)
> +{
> +	int fd = open(filename, O_CREAT|O_EXCL|O_WRONLY, 0600);
> +	struct strbuf buf = STRBUF_INIT;
> +	struct stat st;
> +	int i;
> +

We should probably bail here if "fd < 0", though I guess technically we
will notice in write_in_full(). :)

> +	if (stat(packname, &st))
> +		die(_("cannot stat %s"), packname);
> +
> +	strbuf_addstr(&buf, "# v3 git bundle\n");
> +	strbuf_addf(&buf, "size: %lu\n", (unsigned long) st.st_size);
> +	strbuf_addf(&buf, "sha1: %s\n", sha1_to_hex(sha1));
> +	strbuf_addf(&buf, "data: %s\n\n", packname);

This "sha1" field is the sha1 of the packfile, right? If so, I think
it's redundant with the pack trailer found in the pack at "packname".
I'd prefer not to include that here, because it makes it harder to
generate these v3 bundle headers dynamically (you have to actually
checksum the pack subset to come up with sha1 up front, as opposed to
checksumming as you stream the pack out).

It _could_ be used as an http etag, but I think it would make more sense
to push implementations toward using a unique value for the "packname"
(i.e., so that fetching "https://example.com/foo/1234abcd.pack" is
basically immutable). And I think that comes basically for free, because
the packname has that same hash in it already.

-Peff

  parent reply	other threads:[~2016-03-04 15:34 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-03 19:14 [PATCH] index-pack: correct --keep[=<msg>] Junio C Hamano
2016-03-03 19:47 ` Jeff King
2016-03-03 21:37 ` [PATCH] index-pack: add a helper function to derive .idx/.keep filename Junio C Hamano
2016-03-03 22:29   ` Jeff King
2016-03-03 22:57     ` [PATCH] index-pack: --clone-bundle option Junio C Hamano
2016-03-03 23:20       ` Junio C Hamano
2016-03-04 15:51         ` Jeff King
2016-03-04 15:34       ` Jeff King [this message]
2016-03-03 21:44 ` [PATCH] index-pack: correct --keep[=<msg>] Eric Sunshine

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160304153359.GA16300@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).