git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jonathan Nieder <jrnieder@gmail.com>
To: Jeff Hostetler <git@jeffhostetler.com>
Cc: git@vger.kernel.org, jeffhost@microsoft.com, peff@peff.net,
	gitster@pobox.com, markbt@efaref.net, benpeart@microsoft.com,
	jonathantanmy@google.com
Subject: Re: [PATCH 00/10] RFC Partial Clone and Fetch
Date: Wed, 3 May 2017 11:27:26 -0700	[thread overview]
Message-ID: <20170503182725.GC28740@aiede.svl.corp.google.com> (raw)
In-Reply-To: <777ab8f2-c31a-d07b-ffe3-f8333f408ea1@jeffhostetler.com>

Hi,

Jeff Hostetler wrote:
> On 3/8/2017 1:50 PM, git@jeffhostetler.com wrote:

>> [RFC] Partial Clone and Fetch
>> =============================
>> [...]
>> E. Unresolved Thoughts
>> ======================
>>
>> *TODO* The server should optionally return (in a side-band?) a list
>> of the blobs that it omitted from the packfile (and possibly the sizes
>> or sha1_object_info() data for them) during the fetch-pack/upload-pack
>> operation.  This would allow the client to distinguish from invalid
>> SHAs and missing ones.  Size information would allow the client to
>> maybe choose between various servers.
>
> Since I first posted this, Jonathan Tan has started a related
> discussion on missing blob support.
> https://public-inbox.org/git/CAGf8dgK05+f4uX-8+iMFvQd0n2JP6YxJ18ag8uDaEH6qc6SgVQ@mail.gmail.com/T/
>
> I want to respond to both of these threads here.

Thanks much for this.

> Missing-Blob Support
> ====================
>
> Let me offer up an alternative idea for representing
> missing blobs.  This is differs from both of our previous
> proposals.  (I don't have any code for this new proposal,
> I just want to think out loud a bit and see if this is a
> direction worth pursuing -- or a complete non-starter.)
>
> Both proposals talk about detecting and adapting to a missing
> blob and ways to recover -- when we fail to find a blob.
> Comments on the thread asked about:
> () being able to detect missing blobs vs corrupt repos
> () being unable to detect duplicate blobs
> () expense of blob search.
>
> Suppose we store "positive" information about missing blobs?
> This would let us know that a blob is intentionally missing
> and possibly some meta-data about it.

We've discussed this a little informally but didn't go more into
it, so I'm glad you brought it up.

There are two use cases I care about.  I'll want names to refer to
them later, so describing them now:

 A. A p4 or svn style "monorepo" containing all code within a company.
    We want to make git scale well when working with such a
    repository.  Disk usage, network usage, index size, and object
    lookup time are all issues for such a repository.

    A repository can creep up in size so it starts falling into this
    category even though git coped well with it before.  Another way
    to end up in this category is a conversion from a version control
    system like p4 or svn.

 B. A more modestly sized repository with some large blobs in it.  At
    clone time we want to omit unneeded large blobs to speed up the
    clone, save disk space, and save bandwidth.

    For this kind of repository, the idx file already contained all
    those blobs and that was not causing problems --- the only problem
    was the actual blob size.

> 1. Suppose we update the .pack file format slightly.
[...]
> 2. Make a similar change in the .idx format and git-index-pack
>    to include them there.  Then blob lookup operations could
>    definitively determine that a blob exists and is just not
>    present locally.

Some nits:

- there shouldn't be any need for the blobs to even be mentioned in
  the .pack stored locally.  The .idx file maps from sha1 to offset
  within the packfile --- a special offset could mean "this is a
  missing blob".

- Git protocol works by sending pack files over the wire.  The idx
  file is not transmitted to the client --- the client instead
  reconstructs it from the pack file.  I assume this is why you
  stored the SHA-1 of the object in the pack file, but it could
  equally well be sent through another stream (since this proposal
  involves a change to git protocol anyway).

- However, the list of missing blobs can be inferred from the existing
  pack format, without a change to the pack format used in git
  protocol.  As part of constructing the idx, "git index-pack"
  inflates every object in the pack file sent by the server.  This
  means we know what blobs they reference, so we can easily produce a
  list for the idx file without changing the pack file format.

If this is done by only changing the idx file format and not the pack
file, then it does not affect the protocol.  That is good for
experimentation --- it lets us try out different formats client-side
without having to coordinate with server authors.

In case (A), this proposal means we get back some of the per-object
overhead that we were trying to avoid.  I would like to avoid that
if possible.

In case (B), this proposal doesn't hurt.

One problem with proposals so far has been how to handle repositories
with multiple remotes.  Having a local list of missing blobs is
convenient because it can be associated to the remote --- when a blob
is referenced later, we know which remote to ask for for it.  This is
a niche feature, but it's a nice bonus.

[...]
> 3. With this, packfile-based blob-lookup operations can get a
>    "missing-blob" result.
>    () It should be possible to short-cut searching in other
>       packfiles (because we don't have to assume that the blob
>       was just misplaced in another packfile).
>    () Lookup can still look for the corresponding loose blob
>       (in case a previous lookup already "faulted it in").

The number of packfiles to search is always a problem for object
lookups --- we need to be able to keep the number of packfiles low.
If we solve that for missing blobs but not for the ones that are
present, then git's object access performance is still slow.

I have some ideas for improving git's gc behavior to bound the number
of packfiles better.  I think that's mostly orthogonal to this
proposal.  The main point of overlap is the number of packfiles
produced if someone uses one packfile per large blob, but that is bad
for the performance in successful lookups, not just negative lookups,
so it needs to be solved directly anyway.

> 4. We can then think about dynamically fetching it.

This also seems orthogonal to whether the missing blobs list exists.

[...]
> A missing-blob entry needs to contain the SHA-1 value of
> the blob (obviously).  Other fields are nice to have, but
> are not necessary.  Here are a few fields to consider.
[...]
> B. The raw size of the blob (5? bytes).

This would be a very nice thing.

If we were starting over, would this belong in the tree object?
(Part of the reason I ask is that we have an opportunity to sort
of start over as part of the transition away from SHA-1.)

[...]
> C. A server "hint" (20 bytes)
>    () Instructions to help the client fetch the blob.
>    () If I have multiple remotes configured, a missing-blob
>       should be fetched from the same server that created
>       the missing-blob entry (since it may be the only
>       one that has it).

I am worried about the implications of storing this kind of policy
information in the pack file.  There may be useful information along
these lines for a server to advertise, but I don't think it belongs in
the pack file.  Git protocol relies on it being cheap to stream pack
files as is.

[...]
> Combining the ideas here with the partial clone/fetch
> parameters and the various blob back-filling proposals
> gives us the ability to create and work with sparse
> repos.
> () Filtering can be based upon blob size; this could be
>    seen as an alternative solution to LFS for repos with
>    large objects.
> () Filtering could also be based upon pathnames (such as
>    a sparse-checkout filter) and greatly help performance
>    on very large repos where developers only work with
>    small areas of the tree.

To summarize my thoughts:

+ It is possible to construct a list of blobs missing from the
  packfile sent by the server, without changing the protocol.
  This is a very good thing, since it makes local experimentation
  possible.

- Except in the multiple-remotes case, I am mostly unconvinced about
  the benefit of having the list of missing blobs around locally.
  For example:

  - "git fsck" can use the list of missing blobs to see whether a
    reachable missing blob is intentional or a violation of the
    repository's integrity.  I don't think this is useful.  In a
    partial clone repository, missing blobs are always permitted
    and do not indicate repository corruption, since git operations
    have a way of coping with them (by fetching them on demand using
    read-blob protocol).
 - In a partial clone repository that is acting as a server, the
   list of missing blobs can be used to deny a read-blob request
   without passing the request on to "origin".  Without this
   check, such checks could be a DoS vector.  Since there are other,
   simpler requests that have the potential to DoS "origin" (e.g.
   using plain clone --mirror) in such a setup, I think it best to
   disallow read-blob requests to partial clone repositories
   altogether for now.

- I am also unconvinced about the server hint.

+ Having the list of missing blob sizes locally seems very useful.  I
  can imagine a client that wants this information for some blobs but
  not all --- e.g. they are only interested in a particular
  subdirectory so they want blob sizes from that directory.

The blob-size use case is compelling, but I still lean toward
constructing the list of blobs locally for now where it's useful,
instead of adding a missing-blob entry to the pack file.

I'll think more about blob sizes.  I agree that we need to do
something to handle them.

Thanks and hope that helps,
Jonathan

  reply	other threads:[~2017-05-03 18:27 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-08 18:50 [PATCH 00/10] RFC Partial Clone and Fetch git
2017-03-08 18:50 ` [PATCH 01/10] pack-objects: eat CR in addition to LF after fgets git
2017-03-08 18:50 ` [PATCH 02/10] pack-objects: add --partial-by-size=n --partial-special git
2017-03-08 18:50 ` [PATCH 03/10] pack-objects: test for --partial-by-size --partial-special git
2017-03-08 18:50 ` [PATCH 04/10] upload-pack: add partial (sparse) fetch git
2017-03-08 18:50 ` [PATCH 05/10] fetch-pack: add partial-by-size and partial-special git
2017-03-08 18:50 ` [PATCH 06/10] rev-list: add --allow-partial option to relax connectivity checks git
2017-03-08 18:50 ` [PATCH 07/10] index-pack: add --allow-partial option to relax blob existence checks git
2017-03-08 18:50 ` [PATCH 08/10] fetch: add partial-by-size and partial-special arguments git
2017-03-08 18:50 ` [PATCH 09/10] clone: " git
2017-03-08 18:50 ` [PATCH 10/10] ls-partial: created command to list missing blobs git
2017-03-09 20:18 ` [PATCH 00/10] RFC Partial Clone and Fetch Jonathan Tan
2017-03-16 21:43   ` Jeff Hostetler
2017-03-17 14:13     ` Jeff Hostetler
2017-03-22 15:16 ` ankostis
2017-03-22 16:21   ` Johannes Schindelin
2017-03-22 17:51     ` Jeff Hostetler
2017-05-03 16:38 ` Jeff Hostetler
2017-05-03 18:27   ` Jonathan Nieder [this message]
2017-05-04 16:51     ` Jeff Hostetler
2017-05-04 18:41       ` Jonathan Nieder
2017-05-08  0:15     ` Junio C Hamano
2017-05-03 20:40   ` Jonathan Tan
2017-05-03 21:08     ` Jonathan Nieder
  -- strict thread matches above, loose matches on Subject: below --
2017-03-08 17:37 Jeff Hostetler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170503182725.GC28740@aiede.svl.corp.google.com \
    --to=jrnieder@gmail.com \
    --cc=benpeart@microsoft.com \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jeffhost@microsoft.com \
    --cc=jonathantanmy@google.com \
    --cc=markbt@efaref.net \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).