git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jonathan Nieder <jrnieder@gmail.com>
To: Jeff Hostetler <git@jeffhostetler.com>
Cc: git@vger.kernel.org, jeffhost@microsoft.com, peff@peff.net,
	gitster@pobox.com, markbt@efaref.net, benpeart@microsoft.com,
	jonathantanmy@google.com
Subject: Re: [PATCH 00/10] RFC Partial Clone and Fetch
Date: Thu, 4 May 2017 11:41:51 -0700	[thread overview]
Message-ID: <20170504184150.GA15203@aiede.svl.corp.google.com> (raw)
In-Reply-To: <9b58bec9-dafb-aebc-d421-5327f2f607eb@jeffhostetler.com>

Hi again,

Jeff Hostetler wrote:

> In my original RFC there were comments/complaints that with
> missing blobs we lose the ability to detect corruptions.  My
> proposed changes to index-pack and rev-list (and suggestions
> for other commands like fsck) just disabled those errors.
> Personally, I'm OK with that, but I understand that others
> would like to save the ability to distinguish between missing
> and corrupted.

I'm also okay with it.  In a partial clone, in the same way as a
missing ref represents a different valid state and thus passes fsck
regardless of how it happened, a missing blob is a valid state and it
is sensible for it to pass fsck.

A person might object that previously a repository that passed "git
fsck" was a repository where "git fast-export --all" would succeed,
and if I omit a blob that is not present on the remote server then
that invariant is gone.  But that problem exists even if we have a
list of missing blobs.  The server could rewind history and garbage
collect, causing attempts on the client to fetch a previously
advertised missing blob to fail.  Or the server can disappear
completely, or it can lose all its data and have to be restored from
an older backup that is missing newer blobs.

> Right, only the .pack is sent over the wire.  And that's why I
> suggest listing the missing SHAs in it.  I think we need the server
> to send a list of them -- whether in individual per-file type-5
> records as I suggested, or a single record containing a list of
> SHAs+data (which I think I prefer in hindsight).

A list of SHAs+data sounds sensible as a way of conveying additional
per-blob information (such as size).

> WRT being able to discover the missing blobs, is that true in
> all cases?  I don't think it is for thin-packs -- where the server
> only sends stuff you don't (supposedly) already have, right ?

Generate the list of blobs referenced by trees in the pack, when you
are inflating them in git index-pack.  Omit any objects that you
already know about.  The remainder is the set of missing blobs.

One thing this doesn't tell you is if the same missing blob is
available from multiple remotes.  It associates each blob with a
single remote, the first one it was discovered from.

> If instead, we have pack-object indicate that it *would have*
> sent this blob normally, we don't change any of the semantics
> of how pack files are assembled.  This gives the client a
> definitive list of what's missing.

If there is additional information the server wants to convey about
the missing blobs, then this makes sense to me --- it has to send it
somewhere, and a separate list outside the pack seems like a good
place to put that.

When there is no additional information beyond "this is a blob I am
omitting", there is nothing the wire protocol needs to convey.  But
you've convinced me that that's usually moot because the blob size
is valuable information.

[...]
> The more I think about it, I'd like to NOT put the list in the .idx
> file.  If we put it in a separate peer file next to the *.{pack,idx}
> we could decorate it with the name of the remote used in the fetch
> or clone.

I have no strong opinions about this in either direction.

Since it only affects the local repository format and doesn't affect
the protocol, we can experiment without too much fuss. :)

[...]
> I've always wondered if repack, fetch, and friends should build
> a meta-idx that merges all of the current .idx files, but that
> is a different conversation....

Yes, we've been thinking about a one-index-for-many-packs facility
to deal with the proliferation of packfiles with only one or a few
large objects without having to waste I/O copying them into a
concatenated pack file.

Another thing we're looking into is incorporating something like
Martin Fick's "git exproll" (
http://public-inbox.org/git/1375756727-1275-1-git-send-email-artagnon@gmail.com/)
into "git gc --auto" to more aggressively keep the number of packs
down.

> On 5/3/2017 2:27 PM, Jonathan Nieder wrote:

>> If we were starting over, would this belong in the tree object?
>> (Part of the reason I ask is that we have an opportunity to sort
>> of start over as part of the transition away from SHA-1.)
>
> Yes, putting the size in the tree would be nice.  That does
> add 5-8 bytes to every entry in every tree (on top of the
> larger hash), but it would be useful.
>
> If we're going there, we might just define the new hash
> as the concatenation of the SHA* and the length of the data
> hashed.  So instead of a 32-byte SHA256, we'd have a (32 + 8)
> byte value.  (Or maybe a (32 + 5) if we want to squeeze it.)

Thanks --- that feedback helps.

It doesn't stop us from having to figure something else out in the
short term, of course.

[...]
>> I am worried about the implications of storing this kind of policy
>> information in the pack file.  There may be useful information along
>> these lines for a server to advertise, but I don't think it belongs in
>> the pack file.  Git protocol relies on it being cheap to stream pack
>> files as is.
>
> Perhaps.  I'm not sure it would be that expensive -- I'm thinking
> about it being a server constant rather than a per-user/per-fetch
> field.  But maybe we don't need it.  Assuming we can correctly
> associate a missing blob with the server/remote that omitted it.

Right.

I think we're heading toward a consensus:

- that the server should (at least if requested?) inform the client of
  the blobs it is omitting and their sizes

- that this would go in the same response as the packfile, but
  out-of-line from the pack

- that it (at least optionally?) gets stored *somewhere* locally
  associated with the pack --- either through an extension to the idx
  file format or in a separate file alongside the pack file and idx
  file.  This doesn't affect the protocol so it's not expensive to
  experiment

Thanks for the thoughtful replies.

Sincerely,
Jonathan

  reply	other threads:[~2017-05-04 18:42 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-08 18:50 [PATCH 00/10] RFC Partial Clone and Fetch git
2017-03-08 18:50 ` [PATCH 01/10] pack-objects: eat CR in addition to LF after fgets git
2017-03-08 18:50 ` [PATCH 02/10] pack-objects: add --partial-by-size=n --partial-special git
2017-03-08 18:50 ` [PATCH 03/10] pack-objects: test for --partial-by-size --partial-special git
2017-03-08 18:50 ` [PATCH 04/10] upload-pack: add partial (sparse) fetch git
2017-03-08 18:50 ` [PATCH 05/10] fetch-pack: add partial-by-size and partial-special git
2017-03-08 18:50 ` [PATCH 06/10] rev-list: add --allow-partial option to relax connectivity checks git
2017-03-08 18:50 ` [PATCH 07/10] index-pack: add --allow-partial option to relax blob existence checks git
2017-03-08 18:50 ` [PATCH 08/10] fetch: add partial-by-size and partial-special arguments git
2017-03-08 18:50 ` [PATCH 09/10] clone: " git
2017-03-08 18:50 ` [PATCH 10/10] ls-partial: created command to list missing blobs git
2017-03-09 20:18 ` [PATCH 00/10] RFC Partial Clone and Fetch Jonathan Tan
2017-03-16 21:43   ` Jeff Hostetler
2017-03-17 14:13     ` Jeff Hostetler
2017-03-22 15:16 ` ankostis
2017-03-22 16:21   ` Johannes Schindelin
2017-03-22 17:51     ` Jeff Hostetler
2017-05-03 16:38 ` Jeff Hostetler
2017-05-03 18:27   ` Jonathan Nieder
2017-05-04 16:51     ` Jeff Hostetler
2017-05-04 18:41       ` Jonathan Nieder [this message]
2017-05-08  0:15     ` Junio C Hamano
2017-05-03 20:40   ` Jonathan Tan
2017-05-03 21:08     ` Jonathan Nieder
  -- strict thread matches above, loose matches on Subject: below --
2017-03-08 17:37 Jeff Hostetler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170504184150.GA15203@aiede.svl.corp.google.com \
    --to=jrnieder@gmail.com \
    --cc=benpeart@microsoft.com \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jeffhost@microsoft.com \
    --cc=jonathantanmy@google.com \
    --cc=markbt@efaref.net \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).