Re: How hard would it be to implement sparse fetching/pulling?

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Jeff Hostetler <git@jeffhostetler.com>
To: Philip Oakley <philipoakley@iee.org>, Vitaly Arbuzov <vit@uber.com>
Cc: Git List <git@vger.kernel.org>
Subject: Re: How hard would it be to implement sparse fetching/pulling?
Date: Tue, 5 Dec 2017 14:14:53 -0500	[thread overview]
Message-ID: <7bcdbd52-0cfa-58b2-e40a-1852cc70ce69@jeffhostetler.com> (raw)
In-Reply-To: <6C1247A43F8841F98E070C264045BF49@PhilipOakley>

On 12/2/2017 1:24 PM, Philip Oakley wrote:
> From: "Jeff Hostetler" <git@jeffhostetler.com>
> Sent: Friday, December 01, 2017 5:23 PM
>> On 11/30/2017 6:43 PM, Philip Oakley wrote:
[...]
>>
>> Discussing this feature in the context of the defense industry
>> makes me a little nervous.  (I used to be in that area.)
> 
> I'm viewing the desire for codebase partitioning from a soft layering
> of risk view (perhaps a more UK than USA approach ;-)

I'm not sure I know what this means or how the UK defense
security models/policy/procedures are different from the US,
so I can't say much here.  I'm just thinking that even if we
get a *perfectly working* partial clone/fetch/push/etc. that
it would not pass a security audit.  I might be wrong here
(and I'm no expert on the subject), but I think they would
push you towards a different solution architecture.

> 
>> What we have in the code so far may be a nice start, but
>> probably doesn't have the assurances that you would need
>> for actual deployment.  But it's a start....
> 
> True. I need to get some of my collegues more engaged...
>>
[...]
>> Yes, this does tend to lead towards an always-online mentality.
>> However, there are 2 parts:
>> [a] dynamic object fetching for missing objects, such as during a
>>     random command like diff or blame or merge.  We need this
>>     regardless of usage -- because we can't always predict (or
>>     dry-run) every command the user might run in advance.
> 
> Making something "useful" happen here when off-line is an obvious goal.
> 
>> [b] batch fetch mode, such as using partial-fetch to match your
>>     sparse-checkout so that you always have the blobs of interest
>>     to you.  And assuming you don't wander outside of this subset
>>     of the tree, you should be able to work offline as usual.
>> If you can work within the confines of [b], you wouldn't need to
>> always be online.
> 
> I feel this is the area that does need ensure a capability to avoid
> any perception of the much maligned 'Embrace, extend, and extinguish'> by accidental lockout.
> 
> I don't think this should be viewed as a type of sparse checkout -
> it's just a checkout of what you have (under the hood it could use
> the same code though).

Right, I'm only thinking of this effort as a way to get a partial
clone and fetch that omits unneeded (or, not immediately needed)
objects for performance reasons.  There are several use scenarios
that I've discussed and sparse-checkout is one of them, but I do
not consider this to be a sparse-checkout feature.

[...]
>>
>> The main problem with markers or other lists of missing objects is
>> that it has scale problems for large repos.  Suppose I have 100M
>> blobs in my repo.  If I do a blob:none clone, I'd have 100M missing
>> blobs that would need tracking.  If I then do a batch fetch of the
>> blobs needed to do a sparse checkout of HEAD, I'd have to remove
>> those entries from the tracking data.  Not impossible, but not
>> speedy either.
> 
> ** Ahhh. I see. That's a consequence of having all the trees isn't it. **
> 
> I've always thought that limiting the trees is at the heart of the Narrow clone/fetch problem.
> 
> OK so if you have flat, wide structures with 10k files/directories per tree then it's still a fair sized problem, but it should *scale logarithmically* for the part of the tree structure that's not being downloaded.
> 
> You never have to add a marker for a blob that you have no containing tree for. Nor for the tree that contained the blob's tree, all the way up to primary line of descent to the tree of concern. All those trees are never down loaded, there are few markers (.gitNarrowTree files) for those tree stubs, certainly no 100M missing blob markers.

Currently, the code only omits blobs.  I want to extend the current
code to have filters that also exclude unneeded trees.  That will help
address some of these size concerns, but there are still perf issues
here.

>>> * Marking of 'missing' objects in the local object store, and on the wire.
>>> The missing objects are replaced by a place holder object, which used the
>>> same oid/sha1, but has a short fixed length, with content “GitNarrowObject
>>> <oid>”. The chance that that string would actually have such an oid clash is
>>> the same as all other object hashes, so is a *safe* self-referential device.
>>
>> Again, there is a scale problem here.  If I have 100M missing blobs,
>> I can't afford to create 100M loose place holder files.  Or juggle
>> a 2GB file of missing objects on various operations.
> 
> As above, I'm also trimming the trees, so in general, there would be no missing  blobs, just the content of the directory one was interested in.
> 
> That's not quite true if higher level trees have blob references in them that are otherwise unwanted - they may each need a marker. [Or maybe a special single 'tree-of-blobs' marker for them all thus only one marker per tree - over-thinking maybe...]

Also omitting certain trees means you now (obviously) have both missing
trees and blobs.  And both need to be dynamically or batch fetched as
needed.  And certain operations will need multiple round trips to fully
resolve -- fault in a tree and then fault in blobs referenced by it.

And right, you still need to be able to refer to trees that have *some*
of their children missing.  It's not a clean tree-only boundary.

So, given all that, any set of markers would be incomplete and/or would
need to be aggressively updated to be correct.  What we have now in
Jonathan's "promisor" code allows us to infer at object-lookup time
that any missing object (from a tree-to-child or commit-to-tree reference)
is expected and can be resolved.  And this doesn't require any markers
or additional on-disk lists of SHAs or packfile format changes.

[...]
>>> * The stored object already includes length (and inferred type), so we do
>>> know what it stands in for. Thus the local index (index file) should be able
>>> to be recreated from the object store alone (including the ‘promised /
>>> narrow / missing’ files/directory markers)

The packfile only contains the objects it contains.  The IDX file
is an index of that.  Neither know of objects (or sizes of objects)
that they don't have.  The have child references (tree to contained
blob), but those are just dangling -- and may be in a different packfile.

[...]
>>> As a safety it could be that the .gitNarrowIgnore is sent with the pack so
>>> that fold know what they missed, and fsck could check that they are locally
>>> not narrower than some specific project .gitNarrowIgnore spec.
>>
>> Currently, we store the filter_spec used for the partial clone (or the
>> first partial fetch) as a default for subsequent fetches, but we limit
>> it there.  That is, for operations like checkout or blame or whatever,
>> it doesn't matter why a blob is missing or what filter criteria was
>> used to cause it to be omitted -- just that it is.  Any time we need
>> a missing object, we have to go get it -- whether that is via dynamic
>> or bulk fetching.
> 
> Deciding *if* we have to get it, while still being 'useful', is part of the question I raised above. In my world view, we already have the intersting blobs, so we shouldn't need to get anything. diff's, blame's, checkout's, simply go with the stub values and everything is cushty.

That is not possible in general.  Suppose I have all of the trees and blobs
for my "cone" of the source tree (say "Documentation/") and I only plan to
make edits with in that cone.  I can do sparse-checkout and commits and all
is well.  Suppose I want to merge my work with Alice and Bob.  I can pull
their branches and I can merge any edits they also made in my cone of the
tree and all is well.  *BUT* if they both edited a file that is outside of
my cone, my git-merge has to file-merge the contents of the 3 versions (the
ancestor, Alice's, and Bob's) of the file.  I don't have the blobs for them
because I only got blobs for my cone of the tree.  Likewise, I also may not
have the 3 containing tree nodes.

So, I either need dynamic object fetching -or- I need a dry-run mode to
predict the missing objects

>>
[...]
>>>
>>> I believe that its all doable, and that Jeff H's work already puts much of
>>> it in place, or touches those places
>>>
>>> That said, it has been just _thinking_, without sufficient time to delve
>>> into the code.
>>>
>>> Phil
>> [...]
>>
>> Thanks
>> Jeff
>>
> 
> Thanks for the great work.
> 
> Philip

Thanks for the comments,
Jeff

next prev parent reply	other threads:[~2017-12-05 19:14 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-11-30  3:16 How hard would it be to implement sparse fetching/pulling? Vitaly Arbuzov
2017-11-30 14:24 ` Jeff Hostetler
2017-11-30 17:01   ` Vitaly Arbuzov
2017-11-30 17:44     ` Vitaly Arbuzov
2017-11-30 20:03       ` Jonathan Nieder
2017-12-01 16:03         ` Jeff Hostetler
2017-12-01 18:16           ` Jonathan Nieder
2017-11-30 23:43       ` Philip Oakley
2017-12-01  1:27         ` Vitaly Arbuzov
2017-12-01  1:51           ` Vitaly Arbuzov
2017-12-01  2:51             ` Jonathan Nieder
2017-12-01  3:37               ` Vitaly Arbuzov
2017-12-02 16:59               ` Philip Oakley
2017-12-01 14:30             ` Jeff Hostetler
2017-12-02 16:30               ` Philip Oakley
2017-12-04 15:36                 ` Jeff Hostetler
2017-12-05 23:46                   ` Philip Oakley
2017-12-02 15:04           ` Philip Oakley
2017-12-01 17:23         ` Jeff Hostetler
2017-12-01 18:24           ` Jonathan Nieder
2017-12-04 15:53             ` Jeff Hostetler
2017-12-02 18:24           ` Philip Oakley
2017-12-05 19:14             ` Jeff Hostetler [this message]
2017-12-05 20:07               ` Jonathan Nieder
2017-12-01 15:28       ` Jeff Hostetler
2017-12-01 14:50     ` Jeff Hostetler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7bcdbd52-0cfa-58b2-e40a-1852cc70ce69@jeffhostetler.com \
    --to=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=philipoakley@iee.org \
    --cc=vit@uber.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).