Re: How hard would it be to implement sparse fetching/pulling?

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: "Philip Oakley" <philipoakley@iee.org>
To: "Vitaly Arbuzov" <vit@uber.com>,
	"Jeff Hostetler" <git@jeffhostetler.com>
Cc: "Git List" <git@vger.kernel.org>
Subject: Re: How hard would it be to implement sparse fetching/pulling?
Date: Tue, 5 Dec 2017 23:46:32 -0000	[thread overview]
Message-ID: <67BC0CB8ADDA407CB88B60B9E3EE0093@PhilipOakley> (raw)
In-Reply-To: c10b559d-40b3-dd94-daa1-67279e79d533@jeffhostetler.com

From: "Jeff Hostetler" <git@jeffhostetler.com>
Sent: Monday, December 04, 2017 3:36 PM
>
> On 12/2/2017 11:30 AM, Philip Oakley wrote:
>> From: "Jeff Hostetler" <git@jeffhostetler.com>
>> Sent: Friday, December 01, 2017 2:30 PM
>>> On 11/30/2017 8:51 PM, Vitaly Arbuzov wrote:
>>>> I think it would be great if we high level agree on desired user
>>>> experience, so let me put a few possible use cases here.
>>>>
>>>> 1. Init and fetch into a new repo with a sparse list.
>>>> Preconditions: origin blah exists and has a lot of folders inside of
>>>> src including "bar".
>>>> Actions:
>>>> git init foo && cd foo
>>>> git config core.sparseAll true # New flag to activate all sparse
>>>> operations by default so you don't need to pass options to each
>>>> command.
>>>> echo "src/bar" > .git/info/sparse-checkout
>>>> git remote add origin blah
>>>> git pull origin master
>>>> Expected results: foo contains src/bar folder and nothing else,
>>>> objects that are unrelated to this tree are not fetched.
>>>> Notes: This should work same when fetch/merge/checkout operations are
>>>> used in the right order.
>>>
>>> With the current patches (parts 1,2,3) we can pass a blob-ish
>>> to the server during a clone that refers to a sparse-checkout
>>> specification.
>>
>> I hadn't appreciated this capability. I see it as important, and should 
>> be available both ways, so that a .gitNarrow spec can be imposed from the 
>> server side, as well as by the requester.
>>
>> It could also be used to assist in the 'precious/secret' blob problem, so 
>> that AWS keys are never pushed, nor available for fetching!
>
> To be honest, I've always considered partial clone/fetch as
> a client-side request as a performance feature to minimize
> download times and disk space requirements on the client.

Mine was a two way view where one side or other specified an extent for the 
narrow clone to achieve either the speed/space improvement or partitioning 
capability.

> I've not thought of it from the "server has secrets" point
> of view.

My potential for "secrets" was a little softer that some of the 'hard' 
security that is often discussed. I'm for the layered risk approach (swiss 
cheese model)
>
> We can talk about it, but I'd like to keep it outside the
> scope of the current effort.

Agreed.

>  My concerns are that that is
> not the appropriate mechanism to enforce MAC/DAC like security
> mechanisms.  For example:
> [a] The client will still receive the containing trees that
>     refer to the sensitive blobs, so the user can tell when
>     the secret blobs change -- they wouldn't have either blob,
>     but can tell when they are changed.  This event by itself
>     may or may not leak sensitive information depending on the
>     terms of the security policy in place.
> [b] The existence of such missing blobs would tell the client
>     which blobs are significant and secret and allow them to
>     focus their attack.  It would be better if those assets
>     were completely hidden and not in the tree at all.
> [c] The client could push a fake secret blob to replace the
>     valid one on the server.  You would have to audit the
>     server to ensure that it never accepts a push containing
>     a change to any secret blob.  And the server would need
>     an infrastructure to know about all secrets in the tree.
> [d] When a secret blob does change, any local merges by the
>     user lack information to complete the merge -- they can't
>     merge the secrets and they can't be trusted to correctly
>     pick-ours or pick-theirs -- so their workflows are broken.
> I'm not trying to blindly spread FUD here, but it is arguments
> like these that make me suggest that the partial clone mechanism
> is not the right vehicle for such "secret" blobs.

I'm on the 'a little security is better than no security' side, but all the 
points are valid.
>
>
>>
>>> There's a bit of a chicken-n-egg problem getting
>>> things set up. So if we assume your team would create a series
>>> of "known enlistments" under version control, then you could
>>
>> s/enlistments/entitlements/ I presume?
>
> Within my org we speak of "enlistments" as subset of the tree
> that you plan to work on.  For example, you might enlist in the
> "file system" portion of the tree or in the "device drivers"
> portion.  If the Makefiles have good partitioning, you should
> only need one of the above portions to do productive work within
> a feature area.
Ah, so it's the things that have been requested by the client (I'd like to 
the enlist..)

>
> I'm not sure what you mean by "entitlements".

It is like having the title deeds to a house - a list things you have, or 
can have. (e.g. a father saying: you can have the car on Saturday 6pm -11pm)

At the end of the day the particular lists would be the same, they guide 
what is sent.

>
>>
>>> just reference one by <branch>:<path> during your clone. The
>>> server can lookup that blob and just use it.
>>>
>>> git clone --filter=sparse:oid=master:templates/bar URL
>>>
>>> And then the server will filter-out the unwanted blobs during
>>> the clone. (The current version only filters blobs; you still
>>> get full commits and trees. That will be revisited later.)
>>
>> I'm for the idea that only the in-heirachy trees should be sent.
>> It should also be possible that the server replies that it is only 
>> sending a narrow clone, with the given (accessible?) spec.
>
> I do want to extend this to have unneeded tree filtering too.
> It is just not in this version.

Great. That's good to hear.

>
>>
>>>
>>> On the client side, the partial clone installs local config
>>> settings into the repo so that subsequent fetches default to
>>> the same filter criteria as used in the clone.
>>>
>>>
>>> I don't currently have provision to send a full sparse-checkout
>>> specification to the server during a clone or fetch. That
>>> seemed like too much to try to squeeze into the protocols.
>>> We can revisit this later if there is interest, but it wasn't
>>> critical for the initial phase.
>>>
>> Agreed. I think it should be somewhere 'visible' to the user, but could 
>> be setup by the server admin / repo maintainer if they don't have write 
>> access. But there could still be the catch-22 - maybe one starts with a 
>> <commit | toptree> : <tree> pair to define an origin point (it's not as 
>> refined as a .gitNarrow spec file, but is definative). The toptree option 
>> could even allow sub-tree clones.. maybe..
>
> That's why I suggest having the sparse-checkout specifications
> be stored under version control in the tree in a known location.
> The user could be told out-of-band that "master:enlistments/*"
> contains all of the well-defined enlistment specs -- I'm not
> proposing such an area, just that a sysadmin could agree to
> layout their tree with one.  Perhaps they have an enlistment
> that just includes that directory.  Then the user could clone
> that and look thru it -- add a new one if they need to and push
> it -- and then do a partial fetch using a different enlistment
> spec.  Again, I'm not dictating this mechanism, but just saying
> that something like the above is possible.

Ideal.

>
>>
>>>
>>>>
>>>> 2. Add a file and push changes.
>>>> Preconditions: all steps above followed.
>>>> touch src/bar/baz.txt && git add -A && git commit -m "added a file"
>>>> git push origin master
>>>> Expected results: changes are pushed to remote.
>>>
>>> I don't believe partial clone and/or partial fetch will cause
>>> any changes for push.
>>
>> I suspect that pushes could be rejected if the user 'pretends'
>> to modify files or trees outside their area. It does need the
>> user to be able to spoof part of a tree they don't have, so an
>> upstream / remote would immediatly know it was a spoof but
>> locally the narrow clone doesn't have enough detail about the
>> 'bad' oid. It would be right to reject such attempts!
>
> There is nothing in the partial clone/fetch to support this.
> The server doesn't know which parts of the tree the user has
> or doesn't have.  There is nothing to prevent the user from
> creating a new file anywhere in the tree -- even if they don't
> have blobs for anything else in the surrounding directory --
> and including it in a push -- since the local "git commit"
> would see it as adding a single file with a new SHA and
> use the existing SHAs for the neighboring files.
>

OK. That's something I'll have a think about.

I was thinking that the recieving server would cross check the recieved 
trees against the enlistments/entitlements to confirm that the user hasn't 
suggested that they have a blob/tree that neither know about (at least for a 
server that has that requirement to do the check!). This is something 
Randall also brought up in his thread.

>
>>
>>>
>>>>
>>>> 3. Clone a repo with a sparse list as a filter.
>>>> Preconditions: same as for #1
>>>> Actions:
>>>> echo "src/bar" > /tmp/blah-sparse-checkout
>>>> git clone --sparse /tmp/blah-sparse-checkout blah # Clone should be
>>>> the only command that would requires specific option key being passed.
>>>> Expected results: same as for #1 plus /tmp/blah-sparse-checkout is
>>>> copied into .git/info/sparse-checkout
>>
>> I presume clone and fetch are treated equivalently here.
>
> Yes, for the download and config setup, both clone and fetch are
> equivalent.  They can pass a filter-spec to the server and set
> the local config variables to indicate that a partial clone/fetch
> was used.

Good.
>
>>
>>>
>>> There are 2 independent concepts here: clone and checkout.
>>> Currently, there isn't any automatic linkage of the partial clone to
>>> the sparse-checkout settings, so you could do something like this:
>>>
>> I see an implicit link that clearly one cannot checkout
>> (inflate/populate) a file/directory that one does not have
>> in the object store. But that does not imply the reverse linkage.
>> The regular sparse checkout should be available independently of
>> the local clone being a narrow one.
>
> Right, I wasn't talking about changing sparse-checkout.  I was
> just pointing out that after you complete a partial-clone, you
> need to take a few steps before you can do a checkout that wont'
> complain about the missing blobs -- or worse, demand load all
> of them.
>

OK.
>
>>
>>> git clone --no-checkout --filter=sparse:oid=master:templates/bar URL
>>> git cat-file ... templates/bar >.git/info/sparse-checkout
>>> git config core.sparsecheckout true
>>> git checkout ...
>>>
>>> I've been focused on the clone/fetch issues and have not looked
>>> into the automation to couple them.
>>>
>>
>> I foresee that large files and certain files need to be filterable
>> for fetch-clone, and that might not be (backward) compatible
>> with the sparse-checkout.
>
> There are several filter-spec criteria: no-blobs, no-large-blobs,
> sparse-checkout-compatible and others may be added later.  Each
> may require different post-clone or post-fetch handling.
>
> For example, you may want to configure your client to only omit
> large blobs and configure a "missing blob helper" (outside of my
> scope) that fetches them from S3 rather than from the origin
> remote.  This would not need to use sparse-checkout.
>
> I guess what I'm trying to say is that this effort is providing
> a mechanism to let the git client request object filtering and
> work with missing objects locally.

OK.
>
>>
>>>
>>>>
>>>> 4. Showing log for sparsely cloned repo.
>>>> Preconditions: #3 is followed
>>>> Actions:
>>>> git log
>>>> Expected results: recent changes that affect src/bar tree.
>>>
>>> If I understand your meaning, log would only show changes
>>> within the sparse subset of the tree. This is not on my
>>> radar for partial clone/fetch. It would be a nice feature
>>> to have, but I think it would be better to think about it
>>> from the point of view of sparse-checkout rather than clone.
>>>
>> One option maybe by making a marker for the tree/blob to
>> be a first class citizen. So the oid (and worktree file)
>> has content ".gitNarrowTree <oid>" or ",gitNarrowBlob <oid>"
>> as required (*), which is safe, and allows a consistent
>> alter-ego view of the tree contents and hence for git-log et.al.
>>
>> (*) I keep flip flopping between a single object marker, and
>> distinct object markers for the types. It partly depends on
>> whether one can know in advance, locally, what the oid type
>> should be, and how it should be embedded in the object store
>> - need to re-check the specs.
>>
>> I'm tending toward distinct types to cope with the D/F conflict
>> in the worktrees - the directory must be created (holds the
>> name etc), and the alter-ego content then must be placed in a
>> _known_ sub-file ".gitNarrowTree" (without the oid in the file
>> name, but included in the content). Presence of a ".gitNarrowTree"
>> should be standalone in the directory when that part of the
>> work-tree is clean.
>
> I'm not sure I follow this, but it is outside of my scope
> for partial clone/fetch, so I'd rather not dive too deep on
> this here.  If we really want a version of "git log" that
> respects sparse-checkout boundaries, we should start a new
> thread.  Thanks.
>

Agreed.

>>
>>>
>>>>
>>>> 5. Showing diff.
>>>> Preconditions: #3 is followed
>>>> Actions:
>>>> git diff HEAD^ HEAD
>>>> Expected results: changes from the most recent commit affecting
>>>> src/bar folder are shown.
>>>> Notes: this can be tricky operation as filtering must be done to
>>>> remove results from unrelated subtrees.
>>>
>>> I don't have any plan for this and I don't think it fits within
>>> the scope of clone/fetch. I think this too would be a sparse-checkout
>>> feature.
>>>
>>
>> See my note about first class citizens for marker OIDs
>>
>>>
>>>>
>>>> *Note that I intentionally didn't mention use cases that are related
>>>> to filtering by blob size as I think we should logically consider them
>>>> as a separate, although related, feature.
>>>
>>> I've grouped blob-size and sparse filter together for the
>>> purposes of clone/fetch since the basic mechanisms (filtering,
>>> transport, and missing object handling) are the same for both.
>>> They do lead to different end-uses, but that is above my level
>>> here.
>>>
>>>
>>>>
>>>> What do you think about these examples above? Is that something that
>>>> more-or-less fits into current development? Are there other important
>>>> flows that I've missed?
>>>
>>> These are all good ideas and it is good to have someone else who
>>> wants to use partial+sparse thinking about it and looking for gaps
>>> as we try to make a complete end-to-end feature.
>>>>
>>>> -Vitaly
>>>
>>> Thanks
>>> Jeff
>>>
>>
>> Philip
>>
>
> Thanks!
> Jeff
>
--
Philip

next prev parent reply	other threads:[~2017-12-05 23:46 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-11-30  3:16 How hard would it be to implement sparse fetching/pulling? Vitaly Arbuzov
2017-11-30 14:24 ` Jeff Hostetler
2017-11-30 17:01   ` Vitaly Arbuzov
2017-11-30 17:44     ` Vitaly Arbuzov
2017-11-30 20:03       ` Jonathan Nieder
2017-12-01 16:03         ` Jeff Hostetler
2017-12-01 18:16           ` Jonathan Nieder
2017-11-30 23:43       ` Philip Oakley
2017-12-01  1:27         ` Vitaly Arbuzov
2017-12-01  1:51           ` Vitaly Arbuzov
2017-12-01  2:51             ` Jonathan Nieder
2017-12-01  3:37               ` Vitaly Arbuzov
2017-12-02 16:59               ` Philip Oakley
2017-12-01 14:30             ` Jeff Hostetler
2017-12-02 16:30               ` Philip Oakley
2017-12-04 15:36                 ` Jeff Hostetler
2017-12-05 23:46                   ` Philip Oakley [this message]
2017-12-02 15:04           ` Philip Oakley
2017-12-01 17:23         ` Jeff Hostetler
2017-12-01 18:24           ` Jonathan Nieder
2017-12-04 15:53             ` Jeff Hostetler
2017-12-02 18:24           ` Philip Oakley
2017-12-05 19:14             ` Jeff Hostetler
2017-12-05 20:07               ` Jonathan Nieder
2017-12-01 15:28       ` Jeff Hostetler
2017-12-01 14:50     ` Jeff Hostetler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=67BC0CB8ADDA407CB88B60B9E3EE0093@PhilipOakley \
    --to=philipoakley@iee.org \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=vit@uber.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).