git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jonathan Tan <jonathantanmy@google.com>
To: Jeff Hostetler <git@jeffhostetler.com>, git@vger.kernel.org
Cc: jeffhost@microsoft.com, peff@peff.net, gitster@pobox.com,
	markbt@efaref.net, benpeart@microsoft.com
Subject: Re: [PATCH 00/10] RFC Partial Clone and Fetch
Date: Wed, 3 May 2017 13:40:08 -0700	[thread overview]
Message-ID: <2b1b504a-07c9-81b3-fed2-e9c029a5b284@google.com> (raw)
In-Reply-To: <777ab8f2-c31a-d07b-ffe3-f8333f408ea1@jeffhostetler.com>

On 05/03/2017 09:38 AM, Jeff Hostetler wrote:
>
>
> On 3/8/2017 1:50 PM, git@jeffhostetler.com wrote:
>> From: Jeff Hostetler <jeffhost@microsoft.com>
>>
>>
>> [RFC] Partial Clone and Fetch
>> =============================
>> [...]
>> E. Unresolved Thoughts
>> ======================
>>
>> *TODO* The server should optionally return (in a side-band?) a list
>> of the blobs that it omitted from the packfile (and possibly the sizes
>> or sha1_object_info() data for them) during the fetch-pack/upload-pack
>> operation.  This would allow the client to distinguish from invalid
>> SHAs and missing ones.  Size information would allow the client to
>> maybe choose between various servers.
>
> Since I first posted this, Jonathan Tan has started a related
> discussion on missing blob support.
> https://public-inbox.org/git/CAGf8dgK05+f4uX-8+iMFvQd0n2JP6YxJ18ag8uDaEH6qc6SgVQ@mail.gmail.com/T/
>
>
> I want to respond to both of these threads here.
> -------------------------------------------------

Thanks for your input. I see that you have explained both "storing 
'positive' information about missing blobs" and "what to store with 
those positive information"; I'll just comment on the former for now.

> Missing-Blob Support
> ====================
>
> Let me offer up an alternative idea for representing
> missing blobs.  This is differs from both of our previous
> proposals.  (I don't have any code for this new proposal,
> I just want to think out loud a bit and see if this is a
> direction worth pursuing -- or a complete non-starter.)
>
> Both proposals talk about detecting and adapting to a missing
> blob and ways to recover -- when we fail to find a blob.
> Comments on the thread asked about:
> () being able to detect missing blobs vs corrupt repos
> () being unable to detect duplicate blobs
> () expense of blob search.
>
> Suppose we store "positive" information about missing blobs?
> This would let us know that a blob is intentionally missing
> and possibly some meta-data about it.

I thought about this (see "Some alternative designs" in [1]), listing 
some similar benefits, but concluded that "it is difficult to scale to 
large repos".

Firstly, to be clear, by large repos I meant (and mean) the svn-style 
"monorepos" that Jonathan Nieder mentions as use case "A" [2].

My concern is that such lists (whether in separate file(s) or in .idx 
files) would be too unwieldy to manipulate. Even if we design things to 
avoid modifying such lists (for example, by adding a new list whenever 
we fetch instead of trying to modify an existing one), we would at least 
need to sort their contents (for example, when generating an .idx in the 
first place). For a repo with 10M-100M blobs [3], this might be doable 
on today's computers, but I would be concerned if a repo would exceed 
such numbers.

[1] <20170426221346.25337-1-jonathantanmy@google.com>
[2] <20170503182725.GC28740@aiede.svl.corp.google.com>
[3] In Microsoft's announcement of Git Virtual File System [4], they 
mentioned "over 3.5 million files" in the Windows codebase. I'm not sure 
if this refers to files in a snapshot (that is, working copy) or all 
historical versions.
[4] 
https://blogs.msdn.microsoft.com/visualstudioalm/2017/02/03/announcing-gvfs-git-virtual-file-system/

> 1. Suppose we update the .pack file format slightly.
>    () We use the 5 value in "enum object_type" to mean a
>       "missing-blob".
>    () We update git-pack-object as I did in my RFC, but have it
>       create type 5 entries for the blobs that are omitted,
>       rather than nothing.
>    () Hopefully, the same logic that currently keeps pack-object
>       from sending unnecessary blobs on subsequent fetches can
>       also be used to keep it from sending unnecessary missing-blob
>       entries.
>    () The type 5 missing-blob entry would contain the SHA-1 of the
>       blob and some meta-data to be explained later.

My original idea was to have sorted list(s) of hashes in separate 
file(s) much like the currently existing shallow file; it would have the 
semantics of "a hash here might be present or absent; if it is absent, 
use the hook". (Initially I thought that one list would be sufficient, 
but after reading your idea and considering it some more, multiple lists 
might be better.)

Your idea of storing them in an .idx (and possibly corresponding .pack 
file) is similar, I think. Although mine is probably simpler - at least, 
we wouldn't need a new object_type.

As described above, I don't think this list-of-hashes idea will work 
(because of the large numbers of blobs involved), but I'll compare it to 
yours anyway just in case we end up being convinced that this general 
idea works.

> 2. Make a similar change in the .idx format and git-index-pack
>    to include them there.  Then blob lookup operations could
>    definitively determine that a blob exists and is just not
>    present locally.
>
> 3. With this, packfile-based blob-lookup operations can get a
>    "missing-blob" result.
>    () It should be possible to short-cut searching in other
>       packfiles (because we don't have to assume that the blob
>       was just misplaced in another packfile).
>    () Lookup can still look for the corresponding loose blob
>       (in case a previous lookup already "faulted it in").

The binary search to lookup a packfile offset from a .idx file (which 
involves disk reads) would take longer for all lookups (not just lookups 
for missing blobs) - I think I prefer keeping the lists separate, to 
avoid pessimizing the (likely) usual case where the relevant blobs are 
all already in local repo storage.

> 4. We can then think about dynamically fetching it.
>    () Several techniques for this are currently being
>       discussed on the mailing list in other threads,
>       so I won't go into this here.
>    () There has also been debate about whether this should
>       yield a loose blob or a new packfile.  I think both
>       forms have merit and depend on whether we are limited
>       to asking for a single blob or can make a batch request.
>    () A dynamically-fetched loose blob is placed in the normal
>       loose blob directory hierarchy so that subsequent
>       lookups can find it as mentioned above.
>    () A dynamically-fetched packfile (with one or more blobs)
>       is written to the ODB and then the lookup operation
>       completes.
>       {} I want to isolate these packfiles from the main
>          packfiles, so that they behave like a second-stage
>          lookup and don't affect the caching/LRU nature of
>          the existing first-stage packfile lookup.
>       {} I also don't want the ambiguity of having 2 primary
>          packfiles with a blob marked as missing in 1 and
>          present in the other.

With my idea, the second-stage lookup is done on the list of missing 
hashes; there is no division between packfiles.

> 5. git-repack should be updated to "do the right thing" and
>    squash missing-blob entries.
>
> 6. And etc.

  parent reply	other threads:[~2017-05-03 20:40 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-08 18:50 [PATCH 00/10] RFC Partial Clone and Fetch git
2017-03-08 18:50 ` [PATCH 01/10] pack-objects: eat CR in addition to LF after fgets git
2017-03-08 18:50 ` [PATCH 02/10] pack-objects: add --partial-by-size=n --partial-special git
2017-03-08 18:50 ` [PATCH 03/10] pack-objects: test for --partial-by-size --partial-special git
2017-03-08 18:50 ` [PATCH 04/10] upload-pack: add partial (sparse) fetch git
2017-03-08 18:50 ` [PATCH 05/10] fetch-pack: add partial-by-size and partial-special git
2017-03-08 18:50 ` [PATCH 06/10] rev-list: add --allow-partial option to relax connectivity checks git
2017-03-08 18:50 ` [PATCH 07/10] index-pack: add --allow-partial option to relax blob existence checks git
2017-03-08 18:50 ` [PATCH 08/10] fetch: add partial-by-size and partial-special arguments git
2017-03-08 18:50 ` [PATCH 09/10] clone: " git
2017-03-08 18:50 ` [PATCH 10/10] ls-partial: created command to list missing blobs git
2017-03-09 20:18 ` [PATCH 00/10] RFC Partial Clone and Fetch Jonathan Tan
2017-03-16 21:43   ` Jeff Hostetler
2017-03-17 14:13     ` Jeff Hostetler
2017-03-22 15:16 ` ankostis
2017-03-22 16:21   ` Johannes Schindelin
2017-03-22 17:51     ` Jeff Hostetler
2017-05-03 16:38 ` Jeff Hostetler
2017-05-03 18:27   ` Jonathan Nieder
2017-05-04 16:51     ` Jeff Hostetler
2017-05-04 18:41       ` Jonathan Nieder
2017-05-08  0:15     ` Junio C Hamano
2017-05-03 20:40   ` Jonathan Tan [this message]
2017-05-03 21:08     ` Jonathan Nieder
  -- strict thread matches above, loose matches on Subject: below --
2017-03-08 17:37 Jeff Hostetler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2b1b504a-07c9-81b3-fed2-e9c029a5b284@google.com \
    --to=jonathantanmy@google.com \
    --cc=benpeart@microsoft.com \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jeffhost@microsoft.com \
    --cc=markbt@efaref.net \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).