git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff Hostetler <git@jeffhostetler.com>
To: git@vger.kernel.org
Cc: jeffhost@microsoft.com, peff@peff.net, gitster@pobox.com,
	markbt@efaref.net, benpeart@microsoft.com,
	jonathantanmy@google.com
Subject: Re: [PATCH 00/10] RFC Partial Clone and Fetch
Date: Wed, 3 May 2017 12:38:33 -0400	[thread overview]
Message-ID: <777ab8f2-c31a-d07b-ffe3-f8333f408ea1@jeffhostetler.com> (raw)
In-Reply-To: <1488999039-37631-1-git-send-email-git@jeffhostetler.com>



On 3/8/2017 1:50 PM, git@jeffhostetler.com wrote:
> From: Jeff Hostetler <jeffhost@microsoft.com>
>
>
> [RFC] Partial Clone and Fetch
> =============================
> [...]
> E. Unresolved Thoughts
> ======================
>
> *TODO* The server should optionally return (in a side-band?) a list
> of the blobs that it omitted from the packfile (and possibly the sizes
> or sha1_object_info() data for them) during the fetch-pack/upload-pack
> operation.  This would allow the client to distinguish from invalid
> SHAs and missing ones.  Size information would allow the client to
> maybe choose between various servers.

Since I first posted this, Jonathan Tan has started a related
discussion on missing blob support.
https://public-inbox.org/git/CAGf8dgK05+f4uX-8+iMFvQd0n2JP6YxJ18ag8uDaEH6qc6SgVQ@mail.gmail.com/T/

I want to respond to both of these threads here.
-------------------------------------------------

Missing-Blob Support
====================

Let me offer up an alternative idea for representing
missing blobs.  This is differs from both of our previous
proposals.  (I don't have any code for this new proposal,
I just want to think out loud a bit and see if this is a
direction worth pursuing -- or a complete non-starter.)

Both proposals talk about detecting and adapting to a missing
blob and ways to recover -- when we fail to find a blob.
Comments on the thread asked about:
() being able to detect missing blobs vs corrupt repos
() being unable to detect duplicate blobs
() expense of blob search.

Suppose we store "positive" information about missing blobs?
This would let us know that a blob is intentionally missing
and possibly some meta-data about it.


1. Suppose we update the .pack file format slightly.
    () We use the 5 value in "enum object_type" to mean a
       "missing-blob".
    () We update git-pack-object as I did in my RFC, but have it
       create type 5 entries for the blobs that are omitted,
       rather than nothing.
    () Hopefully, the same logic that currently keeps pack-object
       from sending unnecessary blobs on subsequent fetches can
       also be used to keep it from sending unnecessary missing-blob
       entries.
    () The type 5 missing-blob entry would contain the SHA-1 of the
       blob and some meta-data to be explained later.

2. Make a similar change in the .idx format and git-index-pack
    to include them there.  Then blob lookup operations could
    definitively determine that a blob exists and is just not
    present locally.

3. With this, packfile-based blob-lookup operations can get a
    "missing-blob" result.
    () It should be possible to short-cut searching in other
       packfiles (because we don't have to assume that the blob
       was just misplaced in another packfile).
    () Lookup can still look for the corresponding loose blob
       (in case a previous lookup already "faulted it in").

4. We can then think about dynamically fetching it.
    () Several techniques for this are currently being
       discussed on the mailing list in other threads,
       so I won't go into this here.
    () There has also been debate about whether this should
       yield a loose blob or a new packfile.  I think both
       forms have merit and depend on whether we are limited
       to asking for a single blob or can make a batch request.
    () A dynamically-fetched loose blob is placed in the normal
       loose blob directory hierarchy so that subsequent
       lookups can find it as mentioned above.
    () A dynamically-fetched packfile (with one or more blobs)
       is written to the ODB and then the lookup operation
       completes.
       {} I want to isolate these packfiles from the main
          packfiles, so that they behave like a second-stage
          lookup and don't affect the caching/LRU nature of
          the existing first-stage packfile lookup.
       {} I also don't want the ambiguity of having 2 primary
          packfiles with a blob marked as missing in 1 and
          present in the other.

5. git-repack should be updated to "do the right thing" and
    squash missing-blob entries.

6. And etc.


Missing-Blob Entry Data
=======================

A missing-blob entry needs to contain the SHA-1 value of
the blob (obviously).  Other fields are nice to have, but
are not necessary.  Here are a few fields to consider.

A. The SHA-1 (20 bytes)

B. The raw size of the blob (5? bytes).
    () This is the cleaned size of the file as stored.  The
       server does not (and should not) have any knowledge
       of the smudging that may happen.
    () This may be useful if whatever dynamic-fetch-hook
       wants to customize its behavior, such as individually
       fetching large blobs and batch fetching smaller ones
       from the same server.
    () GVFS found it necessary to create a custom server
       end-point to get blob size data so that "ls -l"
       could show file sizes for non-present virtualized
       files.
    () 5 bytes (uint:40) should be more than enough for this.

C. A server "hint" (20 bytes)
    () Instructions to help the client fetch the blob.
    () If I have multiple remotes configured, a missing-blob
       should be fetched from the same server that created
       the missing-blob entry (since it may be the only
       one that has it).
    () If a blob is very large (and was omitted for this
       reason), the server may want to redirect the client
       to a geographically closer CDN.
    () This is the SHA-1 of a file in the repository of a
       hook (or a set of parameters to be used by a hook).
       {} This is a bit of *hand-wave* right now, but the
          idea is that you can use the information here to
          individually fetch a blob or batch fetch a set
          of blobs that have the same hint.
       {} Yes, there are security concerns here, so perhaps
          the hint file should just contain parameters for
          a stock git-fetch-pack or git-fetch-blob-pack or
          curl command (or wrapper script) that "does the
          right thing".
       {} I thought this would be more compact than listing
          detailed fetch data per-blob.  And we don't have
          to define yet another syntax.  For example, we can
          let the SHA-1 point to an administrator configured
          shell script and be done.
    () We assume that the SHA-1 file is present locally
       (not missing).  This might refer to a pinned file
       in a special ".git*" file (that we never omit) in
       HEAD.  Or it might be in a branch that all clients
       are assumed to have.


Concluding Thoughts
===================

Combining the ideas here with the partial clone/fetch
parameters and the various blob back-filling proposals
gives us the ability to create and work with sparse
repos.
() Filtering can be based upon blob size; this could be
    seen as an alternative solution to LFS for repos with
    large objects.
() Filtering could also be based upon pathnames (such as
    a sparse-checkout filter) and greatly help performance
    on very large repos where developers only work with
    small areas of the tree.

Thanks
Jeff








  parent reply	other threads:[~2017-05-03 16:38 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-08 18:50 [PATCH 00/10] RFC Partial Clone and Fetch git
2017-03-08 18:50 ` [PATCH 01/10] pack-objects: eat CR in addition to LF after fgets git
2017-03-08 18:50 ` [PATCH 02/10] pack-objects: add --partial-by-size=n --partial-special git
2017-03-08 18:50 ` [PATCH 03/10] pack-objects: test for --partial-by-size --partial-special git
2017-03-08 18:50 ` [PATCH 04/10] upload-pack: add partial (sparse) fetch git
2017-03-08 18:50 ` [PATCH 05/10] fetch-pack: add partial-by-size and partial-special git
2017-03-08 18:50 ` [PATCH 06/10] rev-list: add --allow-partial option to relax connectivity checks git
2017-03-08 18:50 ` [PATCH 07/10] index-pack: add --allow-partial option to relax blob existence checks git
2017-03-08 18:50 ` [PATCH 08/10] fetch: add partial-by-size and partial-special arguments git
2017-03-08 18:50 ` [PATCH 09/10] clone: " git
2017-03-08 18:50 ` [PATCH 10/10] ls-partial: created command to list missing blobs git
2017-03-09 20:18 ` [PATCH 00/10] RFC Partial Clone and Fetch Jonathan Tan
2017-03-16 21:43   ` Jeff Hostetler
2017-03-17 14:13     ` Jeff Hostetler
2017-03-22 15:16 ` ankostis
2017-03-22 16:21   ` Johannes Schindelin
2017-03-22 17:51     ` Jeff Hostetler
2017-05-03 16:38 ` Jeff Hostetler [this message]
2017-05-03 18:27   ` Jonathan Nieder
2017-05-04 16:51     ` Jeff Hostetler
2017-05-04 18:41       ` Jonathan Nieder
2017-05-08  0:15     ` Junio C Hamano
2017-05-03 20:40   ` Jonathan Tan
2017-05-03 21:08     ` Jonathan Nieder
  -- strict thread matches above, loose matches on Subject: below --
2017-03-08 17:37 Jeff Hostetler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=777ab8f2-c31a-d07b-ffe3-f8333f408ea1@jeffhostetler.com \
    --to=git@jeffhostetler.com \
    --cc=benpeart@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jeffhost@microsoft.com \
    --cc=jonathantanmy@google.com \
    --cc=markbt@efaref.net \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).