Re: How hard would it be to implement sparse fetching/pulling?

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Jeff Hostetler <git@jeffhostetler.com>
To: Philip Oakley <philipoakley@iee.org>, Vitaly Arbuzov <vit@uber.com>
Cc: Git List <git@vger.kernel.org>
Subject: Re: How hard would it be to implement sparse fetching/pulling?
Date: Mon, 4 Dec 2017 10:36:07 -0500	[thread overview]
Message-ID: <c10b559d-40b3-dd94-daa1-67279e79d533@jeffhostetler.com> (raw)
In-Reply-To: <A2F606CCDE4542D8AAFC41EB50909FE3@PhilipOakley>



On 12/2/2017 11:30 AM, Philip Oakley wrote:
> From: "Jeff Hostetler" <git@jeffhostetler.com>
> Sent: Friday, December 01, 2017 2:30 PM
>> On 11/30/2017 8:51 PM, Vitaly Arbuzov wrote:
>>> I think it would be great if we high level agree on desired user
>>> experience, so let me put a few possible use cases here.
>>>
>>> 1. Init and fetch into a new repo with a sparse list.
>>> Preconditions: origin blah exists and has a lot of folders inside of
>>> src including "bar".
>>> Actions:
>>> git init foo && cd foo
>>> git config core.sparseAll true # New flag to activate all sparse
>>> operations by default so you don't need to pass options to each
>>> command.
>>> echo "src/bar" > .git/info/sparse-checkout
>>> git remote add origin blah
>>> git pull origin master
>>> Expected results: foo contains src/bar folder and nothing else,
>>> objects that are unrelated to this tree are not fetched.
>>> Notes: This should work same when fetch/merge/checkout operations are
>>> used in the right order.
>>
>> With the current patches (parts 1,2,3) we can pass a blob-ish
>> to the server during a clone that refers to a sparse-checkout
>> specification.
> 
> I hadn't appreciated this capability. I see it as important, and should be available both ways, so that a .gitNarrow spec can be imposed from the server side, as well as by the requester.
> 
> It could also be used to assist in the 'precious/secret' blob problem, so that AWS keys are never pushed, nor available for fetching!

To be honest, I've always considered partial clone/fetch as
a client-side request as a performance feature to minimize
download times and disk space requirements on the client.
I've not thought of it from the "server has secrets" point
of view.

We can talk about it, but I'd like to keep it outside the
scope of the current effort.  My concerns are that that is
not the appropriate mechanism to enforce MAC/DAC like security
mechanisms.  For example:
[a] The client will still receive the containing trees that
     refer to the sensitive blobs, so the user can tell when
     the secret blobs change -- they wouldn't have either blob,
     but can tell when they are changed.  This event by itself
     may or may not leak sensitive information depending on the
     terms of the security policy in place.
[b] The existence of such missing blobs would tell the client
     which blobs are significant and secret and allow them to
     focus their attack.  It would be better if those assets
     were completely hidden and not in the tree at all.
[c] The client could push a fake secret blob to replace the
     valid one on the server.  You would have to audit the
     server to ensure that it never accepts a push containing
     a change to any secret blob.  And the server would need
     an infrastructure to know about all secrets in the tree.
[d] When a secret blob does change, any local merges by the
     user lack information to complete the merge -- they can't
     merge the secrets and they can't be trusted to correctly
     pick-ours or pick-theirs -- so their workflows are broken.
I'm not trying to blindly spread FUD here, but it is arguments
like these that make me suggest that the partial clone mechanism
is not the right vehicle for such "secret" blobs.


> 
>>        There's a bit of a chicken-n-egg problem getting
>> things set up.  So if we assume your team would create a series
>> of "known enlistments" under version control, then you could
> 
> s/enlistments/entitlements/ I presume?

Within my org we speak of "enlistments" as subset of the tree
that you plan to work on.  For example, you might enlist in the
"file system" portion of the tree or in the "device drivers"
portion.  If the Makefiles have good partitioning, you should
only need one of the above portions to do productive work within
a feature area.

I'm not sure what you mean by "entitlements".

> 
>> just reference one by <branch>:<path> during your clone.  The
>> server can lookup that blob and just use it.
>>
>>     git clone --filter=sparse:oid=master:templates/bar URL
>>
>> And then the server will filter-out the unwanted blobs during
>> the clone.  (The current version only filters blobs; you still
>> get full commits and trees.  That will be revisited later.)
> 
> I'm for the idea that only the in-heirachy trees should be sent.
> It should also be possible that the server replies that it is 
> only sending a narrow clone, with the given (accessible?) spec.

I do want to extend this to have unneeded tree filtering too.
It is just not in this version.

> 
>>
>> On the client side, the partial clone installs local config
>> settings into the repo so that subsequent fetches default to
>> the same filter criteria as used in the clone.
>>
>>
>> I don't currently have provision to send a full sparse-checkout
>> specification to the server during a clone or fetch.  That
>> seemed like too much to try to squeeze into the protocols.
>> We can revisit this later if there is interest, but it wasn't
>> critical for the initial phase.
>>
> Agreed. I think it should be somewhere 'visible' to the user, but could be setup by the server admin / repo maintainer if they don't have write access. But there could still be the catch-22 - maybe one starts with a <commit | toptree> : <tree> pair to define an origin point (it's not as refined as a .gitNarrow spec file, but is definative). The toptree option could even allow sub-tree clones.. maybe..

That's why I suggest having the sparse-checkout specifications
be stored under version control in the tree in a known location.
The user could be told out-of-band that "master:enlistments/*"
contains all of the well-defined enlistment specs -- I'm not
proposing such an area, just that a sysadmin could agree to
layout their tree with one.  Perhaps they have an enlistment
that just includes that directory.  Then the user could clone
that and look thru it -- add a new one if they need to and push
it -- and then do a partial fetch using a different enlistment
spec.  Again, I'm not dictating this mechanism, but just saying
that something like the above is possible.

> 
>>
>>>
>>> 2. Add a file and push changes.
>>> Preconditions: all steps above followed.
>>> touch src/bar/baz.txt && git add -A && git commit -m "added a file"
>>> git push origin master
>>> Expected results: changes are pushed to remote.
>>
>> I don't believe partial clone and/or partial fetch will cause
>> any changes for push.
> 
> I suspect that pushes could be rejected if the user 'pretends'
> to modify files or trees outside their area. It does need the
> user to be able to spoof part of a tree they don't have, so an
> upstream / remote would immediatly know it was a spoof but
> locally the narrow clone doesn't have enough detail about the
> 'bad' oid. It would be right to reject such attempts!

There is nothing in the partial clone/fetch to support this.
The server doesn't know which parts of the tree the user has
or doesn't have.  There is nothing to prevent the user from
creating a new file anywhere in the tree -- even if they don't
have blobs for anything else in the surrounding directory --
and including it in a push -- since the local "git commit"
would see it as adding a single file with a new SHA and
use the existing SHAs for the neighboring files.


> 
>>
>>>
>>> 3. Clone a repo with a sparse list as a filter.
>>> Preconditions: same as for #1
>>> Actions:
>>> echo "src/bar" > /tmp/blah-sparse-checkout
>>> git clone --sparse /tmp/blah-sparse-checkout blah # Clone should be
>>> the only command that would requires specific option key being passed.
>>> Expected results: same as for #1 plus /tmp/blah-sparse-checkout is
>>> copied into .git/info/sparse-checkout
> 
> I presume clone and fetch are treated equivalently here.

Yes, for the download and config setup, both clone and fetch are
equivalent.  They can pass a filter-spec to the server and set
the local config variables to indicate that a partial clone/fetch
was used.

> 
>>
>> There are 2 independent concepts here: clone and checkout.
>> Currently, there isn't any automatic linkage of the partial clone to
>> the sparse-checkout settings, so you could do something like this:
>>
> I see an implicit link that clearly one cannot checkout
> (inflate/populate) a file/directory that one does not have
> in the object store. But that does not imply the reverse linkage.
> The regular sparse checkout should be available independently of
> the local clone being a narrow one.

Right, I wasn't talking about changing sparse-checkout.  I was
just pointing out that after you complete a partial-clone, you
need to take a few steps before you can do a checkout that wont'
complain about the missing blobs -- or worse, demand load all
of them.


> 
>>     git clone --no-checkout --filter=sparse:oid=master:templates/bar URL
>>     git cat-file ... templates/bar >.git/info/sparse-checkout
>>     git config core.sparsecheckout true
>>     git checkout ...
>>
>> I've been focused on the clone/fetch issues and have not looked
>> into the automation to couple them.
>>
> 
> I foresee that large files and certain files need to be filterable
> for fetch-clone, and that might not be (backward) compatible
> with the sparse-checkout.

There are several filter-spec criteria: no-blobs, no-large-blobs,
sparse-checkout-compatible and others may be added later.  Each
may require different post-clone or post-fetch handling.

For example, you may want to configure your client to only omit
large blobs and configure a "missing blob helper" (outside of my
scope) that fetches them from S3 rather than from the origin
remote.  This would not need to use sparse-checkout.

I guess what I'm trying to say is that this effort is providing
a mechanism to let the git client request object filtering and
work with missing objects locally.

> 
>>
>>>
>>> 4. Showing log for sparsely cloned repo.
>>> Preconditions: #3 is followed
>>> Actions:
>>> git log
>>> Expected results: recent changes that affect src/bar tree.
>>
>> If I understand your meaning, log would only show changes
>> within the sparse subset of the tree.  This is not on my
>> radar for partial clone/fetch.  It would be a nice feature
>> to have, but I think it would be better to think about it
>> from the point of view of sparse-checkout rather than clone.
>>
> One option maybe by making a marker for the tree/blob to
> be a first class citizen. So the oid (and worktree file)
> has content ".gitNarrowTree <oid>" or ",gitNarrowBlob <oid>"
> as required (*), which is safe, and allows a consistent
> alter-ego view of the tree contents and hence for git-log et.al.
> 
> (*) I keep flip flopping between a single object marker, and
> distinct object markers for the types. It partly depends on
> whether one can know in advance, locally, what the oid type
> should be, and how it should be embedded in the object store
> - need to re-check the specs.
> 
> I'm tending toward distinct types to cope with the D/F conflict
> in the worktrees - the directory must be created (holds the
> name etc), and the alter-ego content then must be placed in a
> _known_ sub-file ".gitNarrowTree" (without the oid in the file
> name, but included in the content). Presence of a ".gitNarrowTree"
> should be standalone in the directory when that part of the
> work-tree is clean.

I'm not sure I follow this, but it is outside of my scope
for partial clone/fetch, so I'd rather not dive too deep on
this here.  If we really want a version of "git log" that
respects sparse-checkout boundaries, we should start a new
thread.  Thanks.

> 
>>
>>>
>>> 5. Showing diff.
>>> Preconditions: #3 is followed
>>> Actions:
>>> git diff HEAD^ HEAD
>>> Expected results: changes from the most recent commit affecting
>>> src/bar folder are shown.
>>> Notes: this can be tricky operation as filtering must be done to
>>> remove results from unrelated subtrees.
>>
>> I don't have any plan for this and I don't think it fits within
>> the scope of clone/fetch.  I think this too would be a sparse-checkout
>> feature.
>>
> 
> See my note about first class citizens for marker OIDs
> 
>>
>>>
>>> *Note that I intentionally didn't mention use cases that are related
>>> to filtering by blob size as I think we should logically consider them
>>> as a separate, although related, feature.
>>
>> I've grouped blob-size and sparse filter together for the
>> purposes of clone/fetch since the basic mechanisms (filtering,
>> transport, and missing object handling) are the same for both.
>> They do lead to different end-uses, but that is above my level
>> here.
>>
>>
>>>
>>> What do you think about these examples above? Is that something that
>>> more-or-less fits into current development? Are there other important
>>> flows that I've missed?
>>
>> These are all good ideas and it is good to have someone else who
>> wants to use partial+sparse thinking about it and looking for gaps
>> as we try to make a complete end-to-end feature.
>>>
>>> -Vitaly
>>
>> Thanks
>> Jeff
>>
> 
> Philip
> 

Thanks!
Jeff

next prev parent reply	other threads:[~2017-12-04 15:36 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-11-30  3:16 How hard would it be to implement sparse fetching/pulling? Vitaly Arbuzov
2017-11-30 14:24 ` Jeff Hostetler
2017-11-30 17:01   ` Vitaly Arbuzov
2017-11-30 17:44     ` Vitaly Arbuzov
2017-11-30 20:03       ` Jonathan Nieder
2017-12-01 16:03         ` Jeff Hostetler
2017-12-01 18:16           ` Jonathan Nieder
2017-11-30 23:43       ` Philip Oakley
2017-12-01  1:27         ` Vitaly Arbuzov
2017-12-01  1:51           ` Vitaly Arbuzov
2017-12-01  2:51             ` Jonathan Nieder
2017-12-01  3:37               ` Vitaly Arbuzov
2017-12-02 16:59               ` Philip Oakley
2017-12-01 14:30             ` Jeff Hostetler
2017-12-02 16:30               ` Philip Oakley
2017-12-04 15:36                 ` Jeff Hostetler [this message]
2017-12-05 23:46                   ` Philip Oakley
2017-12-02 15:04           ` Philip Oakley
2017-12-01 17:23         ` Jeff Hostetler
2017-12-01 18:24           ` Jonathan Nieder
2017-12-04 15:53             ` Jeff Hostetler
2017-12-02 18:24           ` Philip Oakley
2017-12-05 19:14             ` Jeff Hostetler
2017-12-05 20:07               ` Jonathan Nieder
2017-12-01 15:28       ` Jeff Hostetler
2017-12-01 14:50     ` Jeff Hostetler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c10b559d-40b3-dd94-daa1-67279e79d533@jeffhostetler.com \
    --to=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=philipoakley@iee.org \
    --cc=vit@uber.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).