How hard would it be to implement sparse fetching/pulling?

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* How hard would it be to implement sparse fetching/pulling?
@ 2017-11-30  3:16 Vitaly Arbuzov
  2017-11-30 14:24 ` Jeff Hostetler
  0 siblings, 1 reply; 26+ messages in thread
From: Vitaly Arbuzov @ 2017-11-30  3:16 UTC (permalink / raw)
  To: git

Hi guys,

I'm looking for ways to improve fetch/pull/clone time for large git
(mono)repositories with unrelated source trees (that span across
multiple services).
I've found sparse checkout approach appealing and helpful for most of
client-side operations (e.g. status, reset, commit, etc.)
The problem is that there is no feature like sparse fetch/pull in git,
this means that ALL objects in unrelated trees are always fetched.
It may take a lot of time for large repositories and results in some
practical scalability limits for git.
This forced some large companies like Facebook and Google to move to
Mercurial as they were unable to improve client-side experience with
git while Microsoft has developed GVFS, which seems to be a step back
to CVCS world.

I want to get a feedback (from more experienced git users than I am)
on what it would take to implement sparse fetching/pulling.
(Downloading only objects related to the sparse-checkout list)
Are there any issues with missing hashes?
Are there any fundamental problems why it can't be done?
Can we get away with only client-side changes or would it require
special features on the server side?

If we had such a feature then all we would need on top is a separate
tool that builds the right "sparse" scope for the workspace based on
paths that developer wants to work on.

In the world where more and more companies are moving towards large
monorepos this improvement would provide a good way of scaling git to
meet this demand.

PS. Please don't advice to split things up, as there are some good
reasons why many companies decide to keep their code in the monorepo,
which you can easily find online. So let's keep that part out the
scope.

-Vitaly

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: How hard would it be to implement sparse fetching/pulling?
  2017-11-30  3:16 How hard would it be to implement sparse fetching/pulling? Vitaly Arbuzov
@ 2017-11-30 14:24 ` Jeff Hostetler
  2017-11-30 17:01   ` Vitaly Arbuzov
  0 siblings, 1 reply; 26+ messages in thread
From: Jeff Hostetler @ 2017-11-30 14:24 UTC (permalink / raw)
  To: Vitaly Arbuzov, git



On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote:
> Hi guys,
> 
> I'm looking for ways to improve fetch/pull/clone time for large git
> (mono)repositories with unrelated source trees (that span across
> multiple services).
> I've found sparse checkout approach appealing and helpful for most of
> client-side operations (e.g. status, reset, commit, etc.)
> The problem is that there is no feature like sparse fetch/pull in git,
> this means that ALL objects in unrelated trees are always fetched.
> It may take a lot of time for large repositories and results in some
> practical scalability limits for git.
> This forced some large companies like Facebook and Google to move to
> Mercurial as they were unable to improve client-side experience with
> git while Microsoft has developed GVFS, which seems to be a step back
> to CVCS world.
> 
> I want to get a feedback (from more experienced git users than I am)
> on what it would take to implement sparse fetching/pulling.
> (Downloading only objects related to the sparse-checkout list)
> Are there any issues with missing hashes?
> Are there any fundamental problems why it can't be done?
> Can we get away with only client-side changes or would it require
> special features on the server side?
> 
> If we had such a feature then all we would need on top is a separate
> tool that builds the right "sparse" scope for the workspace based on
> paths that developer wants to work on.
> 
> In the world where more and more companies are moving towards large
> monorepos this improvement would provide a good way of scaling git to
> meet this demand.
> 
> PS. Please don't advice to split things up, as there are some good
> reasons why many companies decide to keep their code in the monorepo,
> which you can easily find online. So let's keep that part out the
> scope.
> 
> -Vitaly
> 


This work is in-progress now.  A short summary can be found in [1]
of the current parts 1, 2, and 3.

> * jh/object-filtering (2017-11-22) 6 commits
> * jh/fsck-promisors (2017-11-22) 10 commits
> * jh/partial-clone (2017-11-22) 14 commits

[1] https://public-inbox.org/git/xmqq1skh6fyz.fsf@gitster.mtv.corp.google.com/T/

I have a branch that contains V5 all 3 parts:
https://github.com/jeffhostetler/git/tree/core/pc5_p3

This is a WIP, so there are some rough edges....
I hope to have a V6 out before the weekend with some
bug fixes and cleanup.

Please give it a try and see if it fits your needs.
Currently, there are filter methods to filter all blobs,
all large blobs, and one to match a sparse-checkout
specification.

Let me know if you have any questions or problems.

Thanks,
Jeff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: How hard would it be to implement sparse fetching/pulling?
  2017-11-30 14:24 ` Jeff Hostetler
@ 2017-11-30 17:01   ` Vitaly Arbuzov
  2017-11-30 17:44     ` Vitaly Arbuzov
  2017-12-01 14:50     ` Jeff Hostetler
  0 siblings, 2 replies; 26+ messages in thread
From: Vitaly Arbuzov @ 2017-11-30 17:01 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: git

Hey Jeff,

It's great, I didn't expect that anyone is actively working on this.
I'll check out your branch, meanwhile do you have any design docs that
describe these changes or can you define high level goals that you
want to achieve?

On Thu, Nov 30, 2017 at 6:24 AM, Jeff Hostetler <git@jeffhostetler.com> wrote:
>
>
> On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote:
>>
>> Hi guys,
>>
>> I'm looking for ways to improve fetch/pull/clone time for large git
>> (mono)repositories with unrelated source trees (that span across
>> multiple services).
>> I've found sparse checkout approach appealing and helpful for most of
>> client-side operations (e.g. status, reset, commit, etc.)
>> The problem is that there is no feature like sparse fetch/pull in git,
>> this means that ALL objects in unrelated trees are always fetched.
>> It may take a lot of time for large repositories and results in some
>> practical scalability limits for git.
>> This forced some large companies like Facebook and Google to move to
>> Mercurial as they were unable to improve client-side experience with
>> git while Microsoft has developed GVFS, which seems to be a step back
>> to CVCS world.
>>
>> I want to get a feedback (from more experienced git users than I am)
>> on what it would take to implement sparse fetching/pulling.
>> (Downloading only objects related to the sparse-checkout list)
>> Are there any issues with missing hashes?
>> Are there any fundamental problems why it can't be done?
>> Can we get away with only client-side changes or would it require
>> special features on the server side?
>>
>> If we had such a feature then all we would need on top is a separate
>> tool that builds the right "sparse" scope for the workspace based on
>> paths that developer wants to work on.
>>
>> In the world where more and more companies are moving towards large
>> monorepos this improvement would provide a good way of scaling git to
>> meet this demand.
>>
>> PS. Please don't advice to split things up, as there are some good
>> reasons why many companies decide to keep their code in the monorepo,
>> which you can easily find online. So let's keep that part out the
>> scope.
>>
>> -Vitaly
>>
>
>
> This work is in-progress now.  A short summary can be found in [1]
> of the current parts 1, 2, and 3.
>
>> * jh/object-filtering (2017-11-22) 6 commits
>> * jh/fsck-promisors (2017-11-22) 10 commits
>> * jh/partial-clone (2017-11-22) 14 commits
>
>
> [1]
> https://public-inbox.org/git/xmqq1skh6fyz.fsf@gitster.mtv.corp.google.com/T/
>
> I have a branch that contains V5 all 3 parts:
> https://github.com/jeffhostetler/git/tree/core/pc5_p3
>
> This is a WIP, so there are some rough edges....
> I hope to have a V6 out before the weekend with some
> bug fixes and cleanup.
>
> Please give it a try and see if it fits your needs.
> Currently, there are filter methods to filter all blobs,
> all large blobs, and one to match a sparse-checkout
> specification.
>
> Let me know if you have any questions or problems.
>
> Thanks,
> Jeff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: How hard would it be to implement sparse fetching/pulling?
  2017-11-30 17:01   ` Vitaly Arbuzov
@ 2017-11-30 17:44     ` Vitaly Arbuzov
  2017-11-30 20:03       ` Jonathan Nieder
                         ` (2 more replies)
  2017-12-01 14:50     ` Jeff Hostetler
  1 sibling, 3 replies; 26+ messages in thread
From: Vitaly Arbuzov @ 2017-11-30 17:44 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: git

Found some details here: https://github.com/jeffhostetler/git/pull/3

Looking at commits I see that you've done a lot of work already,
including packing, filtering, fetching, cloning etc.
What are some areas that aren't complete yet? Do you need any help
with implementation?


On Thu, Nov 30, 2017 at 9:01 AM, Vitaly Arbuzov <vit@uber.com> wrote:
> Hey Jeff,
>
> It's great, I didn't expect that anyone is actively working on this.
> I'll check out your branch, meanwhile do you have any design docs that
> describe these changes or can you define high level goals that you
> want to achieve?
>
> On Thu, Nov 30, 2017 at 6:24 AM, Jeff Hostetler <git@jeffhostetler.com> wrote:
>>
>>
>> On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote:
>>>
>>> Hi guys,
>>>
>>> I'm looking for ways to improve fetch/pull/clone time for large git
>>> (mono)repositories with unrelated source trees (that span across
>>> multiple services).
>>> I've found sparse checkout approach appealing and helpful for most of
>>> client-side operations (e.g. status, reset, commit, etc.)
>>> The problem is that there is no feature like sparse fetch/pull in git,
>>> this means that ALL objects in unrelated trees are always fetched.
>>> It may take a lot of time for large repositories and results in some
>>> practical scalability limits for git.
>>> This forced some large companies like Facebook and Google to move to
>>> Mercurial as they were unable to improve client-side experience with
>>> git while Microsoft has developed GVFS, which seems to be a step back
>>> to CVCS world.
>>>
>>> I want to get a feedback (from more experienced git users than I am)
>>> on what it would take to implement sparse fetching/pulling.
>>> (Downloading only objects related to the sparse-checkout list)
>>> Are there any issues with missing hashes?
>>> Are there any fundamental problems why it can't be done?
>>> Can we get away with only client-side changes or would it require
>>> special features on the server side?
>>>
>>> If we had such a feature then all we would need on top is a separate
>>> tool that builds the right "sparse" scope for the workspace based on
>>> paths that developer wants to work on.
>>>
>>> In the world where more and more companies are moving towards large
>>> monorepos this improvement would provide a good way of scaling git to
>>> meet this demand.
>>>
>>> PS. Please don't advice to split things up, as there are some good
>>> reasons why many companies decide to keep their code in the monorepo,
>>> which you can easily find online. So let's keep that part out the
>>> scope.
>>>
>>> -Vitaly
>>>
>>
>>
>> This work is in-progress now.  A short summary can be found in [1]
>> of the current parts 1, 2, and 3.
>>
>>> * jh/object-filtering (2017-11-22) 6 commits
>>> * jh/fsck-promisors (2017-11-22) 10 commits
>>> * jh/partial-clone (2017-11-22) 14 commits
>>
>>
>> [1]
>> https://public-inbox.org/git/xmqq1skh6fyz.fsf@gitster.mtv.corp.google.com/T/
>>
>> I have a branch that contains V5 all 3 parts:
>> https://github.com/jeffhostetler/git/tree/core/pc5_p3
>>
>> This is a WIP, so there are some rough edges....
>> I hope to have a V6 out before the weekend with some
>> bug fixes and cleanup.
>>
>> Please give it a try and see if it fits your needs.
>> Currently, there are filter methods to filter all blobs,
>> all large blobs, and one to match a sparse-checkout
>> specification.
>>
>> Let me know if you have any questions or problems.
>>
>> Thanks,
>> Jeff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: How hard would it be to implement sparse fetching/pulling?
  2017-11-30 17:44     ` Vitaly Arbuzov
@ 2017-11-30 20:03       ` Jonathan Nieder
  2017-12-01 16:03         ` Jeff Hostetler
  2017-11-30 23:43       ` Philip Oakley
  2017-12-01 15:28       ` Jeff Hostetler
  2 siblings, 1 reply; 26+ messages in thread
From: Jonathan Nieder @ 2017-11-30 20:03 UTC (permalink / raw)
  To: Vitaly Arbuzov
  Cc: Jeff Hostetler, git, Konstantin Khomoutov, git-users,
	jonathantanmy, Christian Couder

Hi Vitaly,

Vitaly Arbuzov wrote:

> Found some details here: https://github.com/jeffhostetler/git/pull/3
>
> Looking at commits I see that you've done a lot of work already,
> including packing, filtering, fetching, cloning etc.
> What are some areas that aren't complete yet? Do you need any help
> with implementation?

That's a great question!  I've filed https://crbug.com/git/2 to track
this project.  Feel free to star it to get updates there, or to add
updates of your own.

As described at https://crbug.com/git/2#c1, currently there are three
patch series for which review would be very welcome.  Building on top
of them is welcome as well.  Please make sure to coordinate with
jeffhost@microsoft.com and jonathantanmy@google.com (e.g. through the
bug tracker or email).

One piece of missing functionality that looks intereseting to me: that
series batches fetches of the missing blobs involved in a "git
checkout" command:

 https://public-inbox.org/git/20171121211528.21891-14-git@jeffhostetler.com/

But if doesn't batch fetches of the missing blobs involved in a "git
diff <commit> <commit>" command.  That might be a good place to get
your hands dirty. :)

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: How hard would it be to implement sparse fetching/pulling?
  2017-11-30 17:44     ` Vitaly Arbuzov
  2017-11-30 20:03       ` Jonathan Nieder
@ 2017-11-30 23:43       ` Philip Oakley
  2017-12-01  1:27         ` Vitaly Arbuzov
  2017-12-01 17:23         ` Jeff Hostetler
  2017-12-01 15:28       ` Jeff Hostetler
  2 siblings, 2 replies; 26+ messages in thread
From: Philip Oakley @ 2017-11-30 23:43 UTC (permalink / raw)
  To: Vitaly Arbuzov, Jeff Hostetler; +Cc: Git List

From: "Vitaly Arbuzov" <vit@uber.com>
> Found some details here: https://github.com/jeffhostetler/git/pull/3
>
> Looking at commits I see that you've done a lot of work already,
> including packing, filtering, fetching, cloning etc.
> What are some areas that aren't complete yet? Do you need any help
> with implementation?
>

comments below..
>
> On Thu, Nov 30, 2017 at 9:01 AM, Vitaly Arbuzov <vit@uber.com> wrote:
>> Hey Jeff,
>>
>> It's great, I didn't expect that anyone is actively working on this.
>> I'll check out your branch, meanwhile do you have any design docs that
>> describe these changes or can you define high level goals that you
>> want to achieve?
>>
>> On Thu, Nov 30, 2017 at 6:24 AM, Jeff Hostetler <git@jeffhostetler.com>
>> wrote:
>>>
>>>
>>> On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote:
>>>>
>>>> Hi guys,
>>>>
>>>> I'm looking for ways to improve fetch/pull/clone time for large git
>>>> (mono)repositories with unrelated source trees (that span across
>>>> multiple services).
>>>> I've found sparse checkout approach appealing and helpful for most of
>>>> client-side operations (e.g. status, reset, commit, etc.)
>>>> The problem is that there is no feature like sparse fetch/pull in git,
>>>> this means that ALL objects in unrelated trees are always fetched.
>>>> It may take a lot of time for large repositories and results in some
>>>> practical scalability limits for git.
>>>> This forced some large companies like Facebook and Google to move to
>>>> Mercurial as they were unable to improve client-side experience with
>>>> git while Microsoft has developed GVFS, which seems to be a step back
>>>> to CVCS world.
>>>>
>>>> I want to get a feedback (from more experienced git users than I am)
>>>> on what it would take to implement sparse fetching/pulling.
>>>> (Downloading only objects related to the sparse-checkout list)
>>>> Are there any issues with missing hashes?
>>>> Are there any fundamental problems why it can't be done?
>>>> Can we get away with only client-side changes or would it require
>>>> special features on the server side?
>>>>

I have, for separate reasons been _thinking_ about the issue ($dayjob is in
defence, so a similar partition would be useful).

The changes would almost certainly need to be server side (as well as client
side), as it is the server that decides what is sent over the wire in the 
pack files, which would need to be a 'narrow' pack file.

>>>> If we had such a feature then all we would need on top is a separate
>>>> tool that builds the right "sparse" scope for the workspace based on
>>>> paths that developer wants to work on.
>>>>
>>>> In the world where more and more companies are moving towards large
>>>> monorepos this improvement would provide a good way of scaling git to
>>>> meet this demand.

The 'companies' problem is that it tends to force a client-server, always-on
on-line mentality. I'm also wanting the original DVCS off-line capability to
still be available, with _user_ control, in a generic sense, of what they
have locally available (including files/directories they have not yet looked
at, but expect to have. IIUC Jeff's work is that on-line view, without the
off-line capability.

I'd commented early in the series at [1,2,3].

At its core, my idea was to use the object store to hold markers for the
'not yet fetched' objects (mainly trees and blobs). These would be in a 
known fixed format, and have the same effect (conceptually) as the 
sub-module markers - they _confirm_ the oid, yet say 'not here, try 
elsewhere'.

The comaprison with submodules mean there is the same chance of
de-synchronisation with triangular and upstream servers, unless managed.

The server side, as noted, will need to be included as it is the one that
decides the pack file.

Options for a server management are:

- "I accept narrow packs?" No; yes

- "I serve narrow packs?" No; yes.

- "Repo completeness checks on reciept": (must be complete) || (allow narrow 
to nothing).

 For server farms (e.g. Github..) the settings could be global, or by repo.
(note that the completeness requirement and narrow reciept option are not
incompatible - the recipient server can reject the pack from a narrow
subordinate as incomplete - see below)

 * Marking of 'missing' objects in the local object store, and on the wire.
The missing objects are replaced by a place holder object, which used the
same oid/sha1, but has a short fixed length, with content “GitNarrowObject
<oid>”. The chance that that string would actually have such an oid clash is
the same as all other object hashes, so is a *safe* self-referential device.

* The stored object already includes length (and inferred type), so we do
know what it stands in for. Thus the local index (index file) should be able
to be recreated from the object store alone (including the ‘promised /
narrow / missing’ files/directory markers)

* the ‘same’ as sub-modules.
The potential for loss of synchronisation with a golden complete repo is
just the same as for sub-modules. (We expected object/commit X here, but it’s 
not in the store). This could happen with a small user group who have 
locally narrow clones, who interact with their local narrow server for 
‘backup’, and then fail to push further upstream to a server that mandates 
completeness. They could create a death by a thousand narrow cuts. Having a 
golden upstream config reference (indicating which is the upstream) could 
allow checks to ensure that doesn’t happen.

The fsck can be taught the config option of 'allowNarrow'.

The narrowness would be defined in a locally stored '.gitNarrowIgnore' file
(which can include the size constraints being developed elsewhere on the
list)

As a safety it could be that the .gitNarrowIgnore is sent with the pack so
that fold know what they missed, and fsck could check that they are locally
not narrower than some specific project .gitNarrowIgnore spec.

The benefit of this that the off-line operation capability of Git continues,
which GVFS doesn’t quite do (accidental lock in to a client-server model aka
all those other VCS systems).

 I believe that its all doable, and that Jeff H's work already puts much of
it in place, or touches those places

That said, it has been just _thinking_, without sufficient time to delve
into the code.

Phil

>>>>
>>>> PS. Please don't advice to split things up, as there are some good
>>>> reasons why many companies decide to keep their code in the monorepo,
>>>> which you can easily find online. So let's keep that part out the
>>>> scope.
>>>>
>>>> -Vitaly
>>>>
>>>
>>>
>>> This work is in-progress now.  A short summary can be found in [1]
>>> of the current parts 1, 2, and 3.
>>>
>>>> * jh/object-filtering (2017-11-22) 6 commits
>>>> * jh/fsck-promisors (2017-11-22) 10 commits
>>>> * jh/partial-clone (2017-11-22) 14 commits
>>>
>>>
>>> [1]
>>> https://public-inbox.org/git/xmqq1skh6fyz.fsf@gitster.mtv.corp.google.com/T/
>>>
>>> I have a branch that contains V5 all 3 parts:
>>> https://github.com/jeffhostetler/git/tree/core/pc5_p3
>>>
>>> This is a WIP, so there are some rough edges....
>>> I hope to have a V6 out before the weekend with some
>>> bug fixes and cleanup.
>>>
>>> Please give it a try and see if it fits your needs.
>>> Currently, there are filter methods to filter all blobs,
>>> all large blobs, and one to match a sparse-checkout
>>> specification.
>>>
>>> Let me know if you have any questions or problems.
>>>
>>> Thanks,
>>> Jeff

[1,2]  [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing
blobs")
https://public-inbox.org/git/BC1048A63B034E46A11A01758BC04855@PhilipOakley/
Date: Tue, 25 Jul 2017 21:48:46 +0100
https://public-inbox.org/git/8EE0108BA72B42EA9494B571DDE2005D@PhilipOakley/
Date: Sat, 29 Jul 2017 13:51:16 +0100

[3] [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects
https://public-inbox.org/git/244AA0848E9D46F480E7CA407582A162@PhilipOakley/
Date: Sat, 29 Jul 2017 14:26:52 +0100

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: How hard would it be to implement sparse fetching/pulling?
  2017-11-30 23:43       ` Philip Oakley
@ 2017-12-01  1:27         ` Vitaly Arbuzov
  2017-12-01  1:51           ` Vitaly Arbuzov
  2017-12-02 15:04           ` Philip Oakley
  2017-12-01 17:23         ` Jeff Hostetler
  1 sibling, 2 replies; 26+ messages in thread
From: Vitaly Arbuzov @ 2017-12-01  1:27 UTC (permalink / raw)
  To: Philip Oakley; +Cc: Jeff Hostetler, Git List

Jonathan, thanks for references, that is super helpful, I will follow
your suggestions.

Philip, I agree that keeping original DVCS off-line capability is an
important point. Ideally this feature should work even with remotes
that are located on the local disk.
Which part of Jeff's work do you think wouldn't work offline after
repo initialization is done and sparse fetch is performed? All the
stuff that I've seen seems to be quite usable without GVFS.
I'm not sure if we need to store markers/tombstones on the client,
what problem does it solve?

On Thu, Nov 30, 2017 at 3:43 PM, Philip Oakley <philipoakley@iee.org> wrote:
> From: "Vitaly Arbuzov" <vit@uber.com>
>>
>> Found some details here: https://github.com/jeffhostetler/git/pull/3
>>
>> Looking at commits I see that you've done a lot of work already,
>> including packing, filtering, fetching, cloning etc.
>> What are some areas that aren't complete yet? Do you need any help
>> with implementation?
>>
>
> comments below..
>
>>
>> On Thu, Nov 30, 2017 at 9:01 AM, Vitaly Arbuzov <vit@uber.com> wrote:
>>>
>>> Hey Jeff,
>>>
>>> It's great, I didn't expect that anyone is actively working on this.
>>> I'll check out your branch, meanwhile do you have any design docs that
>>> describe these changes or can you define high level goals that you
>>> want to achieve?
>>>
>>> On Thu, Nov 30, 2017 at 6:24 AM, Jeff Hostetler <git@jeffhostetler.com>
>>> wrote:
>>>>
>>>>
>>>>
>>>> On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote:
>>>>>
>>>>>
>>>>> Hi guys,
>>>>>
>>>>> I'm looking for ways to improve fetch/pull/clone time for large git
>>>>> (mono)repositories with unrelated source trees (that span across
>>>>> multiple services).
>>>>> I've found sparse checkout approach appealing and helpful for most of
>>>>> client-side operations (e.g. status, reset, commit, etc.)
>>>>> The problem is that there is no feature like sparse fetch/pull in git,
>>>>> this means that ALL objects in unrelated trees are always fetched.
>>>>> It may take a lot of time for large repositories and results in some
>>>>> practical scalability limits for git.
>>>>> This forced some large companies like Facebook and Google to move to
>>>>> Mercurial as they were unable to improve client-side experience with
>>>>> git while Microsoft has developed GVFS, which seems to be a step back
>>>>> to CVCS world.
>>>>>
>>>>> I want to get a feedback (from more experienced git users than I am)
>>>>> on what it would take to implement sparse fetching/pulling.
>>>>> (Downloading only objects related to the sparse-checkout list)
>>>>> Are there any issues with missing hashes?
>>>>> Are there any fundamental problems why it can't be done?
>>>>> Can we get away with only client-side changes or would it require
>>>>> special features on the server side?
>>>>>
>
> I have, for separate reasons been _thinking_ about the issue ($dayjob is in
> defence, so a similar partition would be useful).
>
> The changes would almost certainly need to be server side (as well as client
> side), as it is the server that decides what is sent over the wire in the
> pack files, which would need to be a 'narrow' pack file.
>
>>>>> If we had such a feature then all we would need on top is a separate
>>>>> tool that builds the right "sparse" scope for the workspace based on
>>>>> paths that developer wants to work on.
>>>>>
>>>>> In the world where more and more companies are moving towards large
>>>>> monorepos this improvement would provide a good way of scaling git to
>>>>> meet this demand.
>
>
> The 'companies' problem is that it tends to force a client-server, always-on
> on-line mentality. I'm also wanting the original DVCS off-line capability to
> still be available, with _user_ control, in a generic sense, of what they
> have locally available (including files/directories they have not yet looked
> at, but expect to have. IIUC Jeff's work is that on-line view, without the
> off-line capability.
>
> I'd commented early in the series at [1,2,3].
>
>
> At its core, my idea was to use the object store to hold markers for the
> 'not yet fetched' objects (mainly trees and blobs). These would be in a
> known fixed format, and have the same effect (conceptually) as the
> sub-module markers - they _confirm_ the oid, yet say 'not here, try
> elsewhere'.
>
> The comaprison with submodules mean there is the same chance of
> de-synchronisation with triangular and upstream servers, unless managed.
>
> The server side, as noted, will need to be included as it is the one that
> decides the pack file.
>
> Options for a server management are:
>
> - "I accept narrow packs?" No; yes
>
> - "I serve narrow packs?" No; yes.
>
> - "Repo completeness checks on reciept": (must be complete) || (allow narrow
> to nothing).
>
> For server farms (e.g. Github..) the settings could be global, or by repo.
> (note that the completeness requirement and narrow reciept option are not
> incompatible - the recipient server can reject the pack from a narrow
> subordinate as incomplete - see below)
>
> * Marking of 'missing' objects in the local object store, and on the wire.
> The missing objects are replaced by a place holder object, which used the
> same oid/sha1, but has a short fixed length, with content “GitNarrowObject
> <oid>”. The chance that that string would actually have such an oid clash is
> the same as all other object hashes, so is a *safe* self-referential device.
>
>
> * The stored object already includes length (and inferred type), so we do
> know what it stands in for. Thus the local index (index file) should be able
> to be recreated from the object store alone (including the ‘promised /
> narrow / missing’ files/directory markers)
>
> * the ‘same’ as sub-modules.
> The potential for loss of synchronisation with a golden complete repo is
> just the same as for sub-modules. (We expected object/commit X here, but
> it’s not in the store). This could happen with a small user group who have
> locally narrow clones, who interact with their local narrow server for
> ‘backup’, and then fail to push further upstream to a server that mandates
> completeness. They could create a death by a thousand narrow cuts. Having a
> golden upstream config reference (indicating which is the upstream) could
> allow checks to ensure that doesn’t happen.
>
> The fsck can be taught the config option of 'allowNarrow'.
>
> The narrowness would be defined in a locally stored '.gitNarrowIgnore' file
> (which can include the size constraints being developed elsewhere on the
> list)
>
> As a safety it could be that the .gitNarrowIgnore is sent with the pack so
> that fold know what they missed, and fsck could check that they are locally
> not narrower than some specific project .gitNarrowIgnore spec.
>
> The benefit of this that the off-line operation capability of Git continues,
> which GVFS doesn’t quite do (accidental lock in to a client-server model aka
> all those other VCS systems).
>
> I believe that its all doable, and that Jeff H's work already puts much of
> it in place, or touches those places
>
> That said, it has been just _thinking_, without sufficient time to delve
> into the code.
>
> Phil
>
>>>>>
>>>>> PS. Please don't advice to split things up, as there are some good
>>>>> reasons why many companies decide to keep their code in the monorepo,
>>>>> which you can easily find online. So let's keep that part out the
>>>>> scope.
>>>>>
>>>>> -Vitaly
>>>>>
>>>>
>>>>
>>>> This work is in-progress now.  A short summary can be found in [1]
>>>> of the current parts 1, 2, and 3.
>>>>
>>>>> * jh/object-filtering (2017-11-22) 6 commits
>>>>> * jh/fsck-promisors (2017-11-22) 10 commits
>>>>> * jh/partial-clone (2017-11-22) 14 commits
>>>>
>>>>
>>>>
>>>> [1]
>>>>
>>>> https://public-inbox.org/git/xmqq1skh6fyz.fsf@gitster.mtv.corp.google.com/T/
>>>>
>>>> I have a branch that contains V5 all 3 parts:
>>>> https://github.com/jeffhostetler/git/tree/core/pc5_p3
>>>>
>>>> This is a WIP, so there are some rough edges....
>>>> I hope to have a V6 out before the weekend with some
>>>> bug fixes and cleanup.
>>>>
>>>> Please give it a try and see if it fits your needs.
>>>> Currently, there are filter methods to filter all blobs,
>>>> all large blobs, and one to match a sparse-checkout
>>>> specification.
>>>>
>>>> Let me know if you have any questions or problems.
>>>>
>>>> Thanks,
>>>> Jeff
>
>
> [1,2]  [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing
> blobs")
> https://public-inbox.org/git/BC1048A63B034E46A11A01758BC04855@PhilipOakley/
> Date: Tue, 25 Jul 2017 21:48:46 +0100
> https://public-inbox.org/git/8EE0108BA72B42EA9494B571DDE2005D@PhilipOakley/
> Date: Sat, 29 Jul 2017 13:51:16 +0100
>
> [3] [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects
> https://public-inbox.org/git/244AA0848E9D46F480E7CA407582A162@PhilipOakley/
> Date: Sat, 29 Jul 2017 14:26:52 +0100
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: How hard would it be to implement sparse fetching/pulling?
  2017-12-01  1:27         ` Vitaly Arbuzov
@ 2017-12-01  1:51           ` Vitaly Arbuzov
  2017-12-01  2:51             ` Jonathan Nieder
  2017-12-01 14:30             ` Jeff Hostetler
  2017-12-02 15:04           ` Philip Oakley
  1 sibling, 2 replies; 26+ messages in thread
From: Vitaly Arbuzov @ 2017-12-01  1:51 UTC (permalink / raw)
  To: Philip Oakley; +Cc: Jeff Hostetler, Git List

I think it would be great if we high level agree on desired user
experience, so let me put a few possible use cases here.

1. Init and fetch into a new repo with a sparse list.
Preconditions: origin blah exists and has a lot of folders inside of
src including "bar".
Actions:
git init foo && cd foo
git config core.sparseAll true # New flag to activate all sparse
operations by default so you don't need to pass options to each
command.
echo "src/bar" > .git/info/sparse-checkout
git remote add origin blah
git pull origin master
Expected results: foo contains src/bar folder and nothing else,
objects that are unrelated to this tree are not fetched.
Notes: This should work same when fetch/merge/checkout operations are
used in the right order.

2. Add a file and push changes.
Preconditions: all steps above followed.
touch src/bar/baz.txt && git add -A && git commit -m "added a file"
git push origin master
Expected results: changes are pushed to remote.

3. Clone a repo with a sparse list as a filter.
Preconditions: same as for #1
Actions:
echo "src/bar" > /tmp/blah-sparse-checkout
git clone --sparse /tmp/blah-sparse-checkout blah # Clone should be
the only command that would requires specific option key being passed.
Expected results: same as for #1 plus /tmp/blah-sparse-checkout is
copied into .git/info/sparse-checkout

4. Showing log for sparsely cloned repo.
Preconditions: #3 is followed
Actions:
git log
Expected results: recent changes that affect src/bar tree.

5. Showing diff.
Preconditions: #3 is followed
Actions:
git diff HEAD^ HEAD
Expected results: changes from the most recent commit affecting
src/bar folder are shown.
Notes: this can be tricky operation as filtering must be done to
remove results from unrelated subtrees.

*Note that I intentionally didn't mention use cases that are related
to filtering by blob size as I think we should logically consider them
as a separate, although related, feature.

What do you think about these examples above? Is that something that
more-or-less fits into current development? Are there other important
flows that I've missed?

-Vitaly

On Thu, Nov 30, 2017 at 5:27 PM, Vitaly Arbuzov <vit@uber.com> wrote:
> Jonathan, thanks for references, that is super helpful, I will follow
> your suggestions.
>
> Philip, I agree that keeping original DVCS off-line capability is an
> important point. Ideally this feature should work even with remotes
> that are located on the local disk.
> Which part of Jeff's work do you think wouldn't work offline after
> repo initialization is done and sparse fetch is performed? All the
> stuff that I've seen seems to be quite usable without GVFS.
> I'm not sure if we need to store markers/tombstones on the client,
> what problem does it solve?
>
> On Thu, Nov 30, 2017 at 3:43 PM, Philip Oakley <philipoakley@iee.org> wrote:
>> From: "Vitaly Arbuzov" <vit@uber.com>
>>>
>>> Found some details here: https://github.com/jeffhostetler/git/pull/3
>>>
>>> Looking at commits I see that you've done a lot of work already,
>>> including packing, filtering, fetching, cloning etc.
>>> What are some areas that aren't complete yet? Do you need any help
>>> with implementation?
>>>
>>
>> comments below..
>>
>>>
>>> On Thu, Nov 30, 2017 at 9:01 AM, Vitaly Arbuzov <vit@uber.com> wrote:
>>>>
>>>> Hey Jeff,
>>>>
>>>> It's great, I didn't expect that anyone is actively working on this.
>>>> I'll check out your branch, meanwhile do you have any design docs that
>>>> describe these changes or can you define high level goals that you
>>>> want to achieve?
>>>>
>>>> On Thu, Nov 30, 2017 at 6:24 AM, Jeff Hostetler <git@jeffhostetler.com>
>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote:
>>>>>>
>>>>>>
>>>>>> Hi guys,
>>>>>>
>>>>>> I'm looking for ways to improve fetch/pull/clone time for large git
>>>>>> (mono)repositories with unrelated source trees (that span across
>>>>>> multiple services).
>>>>>> I've found sparse checkout approach appealing and helpful for most of
>>>>>> client-side operations (e.g. status, reset, commit, etc.)
>>>>>> The problem is that there is no feature like sparse fetch/pull in git,
>>>>>> this means that ALL objects in unrelated trees are always fetched.
>>>>>> It may take a lot of time for large repositories and results in some
>>>>>> practical scalability limits for git.
>>>>>> This forced some large companies like Facebook and Google to move to
>>>>>> Mercurial as they were unable to improve client-side experience with
>>>>>> git while Microsoft has developed GVFS, which seems to be a step back
>>>>>> to CVCS world.
>>>>>>
>>>>>> I want to get a feedback (from more experienced git users than I am)
>>>>>> on what it would take to implement sparse fetching/pulling.
>>>>>> (Downloading only objects related to the sparse-checkout list)
>>>>>> Are there any issues with missing hashes?
>>>>>> Are there any fundamental problems why it can't be done?
>>>>>> Can we get away with only client-side changes or would it require
>>>>>> special features on the server side?
>>>>>>
>>
>> I have, for separate reasons been _thinking_ about the issue ($dayjob is in
>> defence, so a similar partition would be useful).
>>
>> The changes would almost certainly need to be server side (as well as client
>> side), as it is the server that decides what is sent over the wire in the
>> pack files, which would need to be a 'narrow' pack file.
>>
>>>>>> If we had such a feature then all we would need on top is a separate
>>>>>> tool that builds the right "sparse" scope for the workspace based on
>>>>>> paths that developer wants to work on.
>>>>>>
>>>>>> In the world where more and more companies are moving towards large
>>>>>> monorepos this improvement would provide a good way of scaling git to
>>>>>> meet this demand.
>>
>>
>> The 'companies' problem is that it tends to force a client-server, always-on
>> on-line mentality. I'm also wanting the original DVCS off-line capability to
>> still be available, with _user_ control, in a generic sense, of what they
>> have locally available (including files/directories they have not yet looked
>> at, but expect to have. IIUC Jeff's work is that on-line view, without the
>> off-line capability.
>>
>> I'd commented early in the series at [1,2,3].
>>
>>
>> At its core, my idea was to use the object store to hold markers for the
>> 'not yet fetched' objects (mainly trees and blobs). These would be in a
>> known fixed format, and have the same effect (conceptually) as the
>> sub-module markers - they _confirm_ the oid, yet say 'not here, try
>> elsewhere'.
>>
>> The comaprison with submodules mean there is the same chance of
>> de-synchronisation with triangular and upstream servers, unless managed.
>>
>> The server side, as noted, will need to be included as it is the one that
>> decides the pack file.
>>
>> Options for a server management are:
>>
>> - "I accept narrow packs?" No; yes
>>
>> - "I serve narrow packs?" No; yes.
>>
>> - "Repo completeness checks on reciept": (must be complete) || (allow narrow
>> to nothing).
>>
>> For server farms (e.g. Github..) the settings could be global, or by repo.
>> (note that the completeness requirement and narrow reciept option are not
>> incompatible - the recipient server can reject the pack from a narrow
>> subordinate as incomplete - see below)
>>
>> * Marking of 'missing' objects in the local object store, and on the wire.
>> The missing objects are replaced by a place holder object, which used the
>> same oid/sha1, but has a short fixed length, with content “GitNarrowObject
>> <oid>”. The chance that that string would actually have such an oid clash is
>> the same as all other object hashes, so is a *safe* self-referential device.
>>
>>
>> * The stored object already includes length (and inferred type), so we do
>> know what it stands in for. Thus the local index (index file) should be able
>> to be recreated from the object store alone (including the ‘promised /
>> narrow / missing’ files/directory markers)
>>
>> * the ‘same’ as sub-modules.
>> The potential for loss of synchronisation with a golden complete repo is
>> just the same as for sub-modules. (We expected object/commit X here, but
>> it’s not in the store). This could happen with a small user group who have
>> locally narrow clones, who interact with their local narrow server for
>> ‘backup’, and then fail to push further upstream to a server that mandates
>> completeness. They could create a death by a thousand narrow cuts. Having a
>> golden upstream config reference (indicating which is the upstream) could
>> allow checks to ensure that doesn’t happen.
>>
>> The fsck can be taught the config option of 'allowNarrow'.
>>
>> The narrowness would be defined in a locally stored '.gitNarrowIgnore' file
>> (which can include the size constraints being developed elsewhere on the
>> list)
>>
>> As a safety it could be that the .gitNarrowIgnore is sent with the pack so
>> that fold know what they missed, and fsck could check that they are locally
>> not narrower than some specific project .gitNarrowIgnore spec.
>>
>> The benefit of this that the off-line operation capability of Git continues,
>> which GVFS doesn’t quite do (accidental lock in to a client-server model aka
>> all those other VCS systems).
>>
>> I believe that its all doable, and that Jeff H's work already puts much of
>> it in place, or touches those places
>>
>> That said, it has been just _thinking_, without sufficient time to delve
>> into the code.
>>
>> Phil
>>
>>>>>>
>>>>>> PS. Please don't advice to split things up, as there are some good
>>>>>> reasons why many companies decide to keep their code in the monorepo,
>>>>>> which you can easily find online. So let's keep that part out the
>>>>>> scope.
>>>>>>
>>>>>> -Vitaly
>>>>>>
>>>>>
>>>>>
>>>>> This work is in-progress now.  A short summary can be found in [1]
>>>>> of the current parts 1, 2, and 3.
>>>>>
>>>>>> * jh/object-filtering (2017-11-22) 6 commits
>>>>>> * jh/fsck-promisors (2017-11-22) 10 commits
>>>>>> * jh/partial-clone (2017-11-22) 14 commits
>>>>>
>>>>>
>>>>>
>>>>> [1]
>>>>>
>>>>> https://public-inbox.org/git/xmqq1skh6fyz.fsf@gitster.mtv.corp.google.com/T/
>>>>>
>>>>> I have a branch that contains V5 all 3 parts:
>>>>> https://github.com/jeffhostetler/git/tree/core/pc5_p3
>>>>>
>>>>> This is a WIP, so there are some rough edges....
>>>>> I hope to have a V6 out before the weekend with some
>>>>> bug fixes and cleanup.
>>>>>
>>>>> Please give it a try and see if it fits your needs.
>>>>> Currently, there are filter methods to filter all blobs,
>>>>> all large blobs, and one to match a sparse-checkout
>>>>> specification.
>>>>>
>>>>> Let me know if you have any questions or problems.
>>>>>
>>>>> Thanks,
>>>>> Jeff
>>
>>
>> [1,2]  [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing
>> blobs")
>> https://public-inbox.org/git/BC1048A63B034E46A11A01758BC04855@PhilipOakley/
>> Date: Tue, 25 Jul 2017 21:48:46 +0100
>> https://public-inbox.org/git/8EE0108BA72B42EA9494B571DDE2005D@PhilipOakley/
>> Date: Sat, 29 Jul 2017 13:51:16 +0100
>>
>> [3] [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects
>> https://public-inbox.org/git/244AA0848E9D46F480E7CA407582A162@PhilipOakley/
>> Date: Sat, 29 Jul 2017 14:26:52 +0100
>>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: How hard would it be to implement sparse fetching/pulling?
  2017-12-01  1:51           ` Vitaly Arbuzov
@ 2017-12-01  2:51             ` Jonathan Nieder
  2017-12-01  3:37               ` Vitaly Arbuzov
  2017-12-02 16:59               ` Philip Oakley
  2017-12-01 14:30             ` Jeff Hostetler
  1 sibling, 2 replies; 26+ messages in thread
From: Jonathan Nieder @ 2017-12-01  2:51 UTC (permalink / raw)
  To: Vitaly Arbuzov; +Cc: Philip Oakley, Jeff Hostetler, Git List

Hi Vitaly,

Vitaly Arbuzov wrote:

> I think it would be great if we high level agree on desired user
> experience, so let me put a few possible use cases here.

I think one thing this thread is pointing to is a lack of overview
documentation about how the 'partial clone' series currently works.
The basic components are:

 1. extending git protocol to (1) allow fetching only a subset of the
    objects reachable from the commits being fetched and (2) later,
    going back and fetching the objects that were left out.

    We've also discussed some other protocol changes, e.g. to allow
    obtaining the sizes of un-fetched objects without fetching the
    objects themselves

 2. extending git's on-disk format to allow having some objects not be
    present but only be "promised" to be obtainable from a remote
    repository.  When running a command that requires those objects,
    the user can choose to have it either (a) error out ("airplane
    mode") or (b) fetch the required objects.

    It is still possible to work fully locally in such a repo, make
    changes, get useful results out of "git fsck", etc.  It is kind of
    similar to the existing "shallow clone" feature, except that there
    is a more straightforward way to obtain objects that are outside
    the "shallow" clone when needed on demand.

 3. improving everyday commands to require fewer objects.  For
    example, if I run "git log -p", then I way to see the history of
    most files but I don't necessarily want to download large binary
    files just to print 'Binary files differ' for them.

    And by the same token, we might want to have a mode for commands
    like "git log -p" to default to restricting to a particular
    directory, instead of downloading files outside that directory.

    There are some fundamental changes to make in this category ---
    e.g. modifying the index format to not require entries for files
    outside the sparse checkout, to avoid having to download the
    trees for them.

The overall goal is to make git scale better.

The existing patches do (1) and (2), though it is possible to do more
in those categories. :)  We have plans to work on (3) as well.

These are overall changes that happen at a fairly low level in git.
They mostly don't require changes command-by-command.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: How hard would it be to implement sparse fetching/pulling?
  2017-12-01  2:51             ` Jonathan Nieder
@ 2017-12-01  3:37               ` Vitaly Arbuzov
  2017-12-02 16:59               ` Philip Oakley
  1 sibling, 0 replies; 26+ messages in thread
From: Vitaly Arbuzov @ 2017-12-01  3:37 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Philip Oakley, Jeff Hostetler, Git List

Makes sense, I think this perfectly aligns with our needs too.
Let me dive deeper into those patches and previous discussions, that
you've kindly shared above, so I better understand details.

I'm very excited about what you guys already did, it's a big deal for
the community!


On Thu, Nov 30, 2017 at 6:51 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:
> Hi Vitaly,
>
> Vitaly Arbuzov wrote:
>
>> I think it would be great if we high level agree on desired user
>> experience, so let me put a few possible use cases here.
>
> I think one thing this thread is pointing to is a lack of overview
> documentation about how the 'partial clone' series currently works.
> The basic components are:
>
>  1. extending git protocol to (1) allow fetching only a subset of the
>     objects reachable from the commits being fetched and (2) later,
>     going back and fetching the objects that were left out.
>
>     We've also discussed some other protocol changes, e.g. to allow
>     obtaining the sizes of un-fetched objects without fetching the
>     objects themselves
>
>  2. extending git's on-disk format to allow having some objects not be
>     present but only be "promised" to be obtainable from a remote
>     repository.  When running a command that requires those objects,
>     the user can choose to have it either (a) error out ("airplane
>     mode") or (b) fetch the required objects.
>
>     It is still possible to work fully locally in such a repo, make
>     changes, get useful results out of "git fsck", etc.  It is kind of
>     similar to the existing "shallow clone" feature, except that there
>     is a more straightforward way to obtain objects that are outside
>     the "shallow" clone when needed on demand.
>
>  3. improving everyday commands to require fewer objects.  For
>     example, if I run "git log -p", then I way to see the history of
>     most files but I don't necessarily want to download large binary
>     files just to print 'Binary files differ' for them.
>
>     And by the same token, we might want to have a mode for commands
>     like "git log -p" to default to restricting to a particular
>     directory, instead of downloading files outside that directory.
>
>     There are some fundamental changes to make in this category ---
>     e.g. modifying the index format to not require entries for files
>     outside the sparse checkout, to avoid having to download the
>     trees for them.
>
> The overall goal is to make git scale better.
>
> The existing patches do (1) and (2), though it is possible to do more
> in those categories. :)  We have plans to work on (3) as well.
>
> These are overall changes that happen at a fairly low level in git.
> They mostly don't require changes command-by-command.
>
> Thanks,
> Jonathan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: How hard would it be to implement sparse fetching/pulling?
  2017-12-01  1:51           ` Vitaly Arbuzov
  2017-12-01  2:51             ` Jonathan Nieder
@ 2017-12-01 14:30             ` Jeff Hostetler
  2017-12-02 16:30               ` Philip Oakley
  1 sibling, 1 reply; 26+ messages in thread
From: Jeff Hostetler @ 2017-12-01 14:30 UTC (permalink / raw)
  To: Vitaly Arbuzov, Philip Oakley; +Cc: Git List

On 11/30/2017 8:51 PM, Vitaly Arbuzov wrote:
> I think it would be great if we high level agree on desired user
> experience, so let me put a few possible use cases here.
> 
> 1. Init and fetch into a new repo with a sparse list.
> Preconditions: origin blah exists and has a lot of folders inside of
> src including "bar".
> Actions:
> git init foo && cd foo
> git config core.sparseAll true # New flag to activate all sparse
> operations by default so you don't need to pass options to each
> command.
> echo "src/bar" > .git/info/sparse-checkout
> git remote add origin blah
> git pull origin master
> Expected results: foo contains src/bar folder and nothing else,
> objects that are unrelated to this tree are not fetched.
> Notes: This should work same when fetch/merge/checkout operations are
> used in the right order.

With the current patches (parts 1,2,3) we can pass a blob-ish
to the server during a clone that refers to a sparse-checkout
specification.  There's a bit of a chicken-n-egg problem getting
things set up.  So if we assume your team would create a series
of "known enlistments" under version control, then you could
just reference one by <branch>:<path> during your clone.  The
server can lookup that blob and just use it.

     git clone --filter=sparse:oid=master:templates/bar URL

And then the server will filter-out the unwanted blobs during
the clone.  (The current version only filters blobs; you still
get full commits and trees.  That will be revisited later.)

On the client side, the partial clone installs local config
settings into the repo so that subsequent fetches default to
the same filter criteria as used in the clone.

I don't currently have provision to send a full sparse-checkout
specification to the server during a clone or fetch.  That
seemed like too much to try to squeeze into the protocols.
We can revisit this later if there is interest, but it wasn't
critical for the initial phase.

> 
> 2. Add a file and push changes.
> Preconditions: all steps above followed.
> touch src/bar/baz.txt && git add -A && git commit -m "added a file"
> git push origin master
> Expected results: changes are pushed to remote.

I don't believe partial clone and/or partial fetch will cause
any changes for push.

> 
> 3. Clone a repo with a sparse list as a filter.
> Preconditions: same as for #1
> Actions:
> echo "src/bar" > /tmp/blah-sparse-checkout
> git clone --sparse /tmp/blah-sparse-checkout blah # Clone should be
> the only command that would requires specific option key being passed.
> Expected results: same as for #1 plus /tmp/blah-sparse-checkout is
> copied into .git/info/sparse-checkout

There are 2 independent concepts here: clone and checkout.
Currently, there isn't any automatic linkage of the partial clone to
the sparse-checkout settings, so you could do something like this:

     git clone --no-checkout --filter=sparse:oid=master:templates/bar URL
     git cat-file ... templates/bar >.git/info/sparse-checkout
     git config core.sparsecheckout true
     git checkout ...

I've been focused on the clone/fetch issues and have not looked
into the automation to couple them.

> 
> 4. Showing log for sparsely cloned repo.
> Preconditions: #3 is followed
> Actions:
> git log
> Expected results: recent changes that affect src/bar tree.

If I understand your meaning, log would only show changes
within the sparse subset of the tree.  This is not on my
radar for partial clone/fetch.  It would be a nice feature
to have, but I think it would be better to think about it
from the point of view of sparse-checkout rather than clone.

> 
> 5. Showing diff.
> Preconditions: #3 is followed
> Actions:
> git diff HEAD^ HEAD
> Expected results: changes from the most recent commit affecting
> src/bar folder are shown.
> Notes: this can be tricky operation as filtering must be done to
> remove results from unrelated subtrees.

I don't have any plan for this and I don't think it fits within
the scope of clone/fetch.  I think this too would be a sparse-checkout
feature.

> 
> *Note that I intentionally didn't mention use cases that are related
> to filtering by blob size as I think we should logically consider them
> as a separate, although related, feature.

I've grouped blob-size and sparse filter together for the
purposes of clone/fetch since the basic mechanisms (filtering,
transport, and missing object handling) are the same for both.
They do lead to different end-uses, but that is above my level
here.

> 
> What do you think about these examples above? Is that something that
> more-or-less fits into current development? Are there other important
> flows that I've missed?

These are all good ideas and it is good to have someone else who
wants to use partial+sparse thinking about it and looking for gaps
as we try to make a complete end-to-end feature.
> 
> -Vitaly

Thanks
Jeff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: How hard would it be to implement sparse fetching/pulling?
  2017-11-30 17:01   ` Vitaly Arbuzov
  2017-11-30 17:44     ` Vitaly Arbuzov
@ 2017-12-01 14:50     ` Jeff Hostetler
  1 sibling, 0 replies; 26+ messages in thread
From: Jeff Hostetler @ 2017-12-01 14:50 UTC (permalink / raw)
  To: Vitaly Arbuzov; +Cc: git



On 11/30/2017 12:01 PM, Vitaly Arbuzov wrote:
> Hey Jeff,
> 
> It's great, I didn't expect that anyone is actively working on this.
> I'll check out your branch, meanwhile do you have any design docs that
> describe these changes or can you define high level goals that you
> want to achieve?
> 

There are no summary docs in a traditional sense.
The patch series does have updated docs which show
the changes to some of the commands and protocols.
I would start there.

Jeff


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: How hard would it be to implement sparse fetching/pulling?
  2017-11-30 17:44     ` Vitaly Arbuzov
  2017-11-30 20:03       ` Jonathan Nieder
  2017-11-30 23:43       ` Philip Oakley
@ 2017-12-01 15:28       ` Jeff Hostetler
  2 siblings, 0 replies; 26+ messages in thread
From: Jeff Hostetler @ 2017-12-01 15:28 UTC (permalink / raw)
  To: Vitaly Arbuzov; +Cc: git

On 11/30/2017 12:44 PM, Vitaly Arbuzov wrote:
> Found some details here: https://github.com/jeffhostetler/git/pull/3
> 
> Looking at commits I see that you've done a lot of work already,
> including packing, filtering, fetching, cloning etc.
> What are some areas that aren't complete yet? Do you need any help
> with implementation?
> 

Sure.  Extra hands are always welcome.

Jonathan Tan and I have been working on this together.
Our V5 is on the mailing now.  We have privately exchanged
some commits that I hope to push up as a V6 today or Monday.

As for how to help, I'll have to think about that a bit.
Without knowing your experience level in the code or your
availability, it is hard to pick something specific right
now.

As a first step, build my core/pc5_p3 branch and try using
partial clone/fetch between local repos.  You can look at
the tests we added (t0410, t5317, t5616, t6112) to see sample
setups using a local pair of repos.  Then try actually using
the partial clone repo for actual work (dogfooding) and see
how it falls short of your expectations.

You might try duplicating the above tests to use a
local "git daemon" serving the remote and do partial clones
using localhost URLs rather than file:// URLs.  That would
exercise the transport differently.

The t5616 test has the start of some end-to-end tests that
try combine multiple steps -- such as do a partial clone
with no blobs and then run blame on a file.  You could extend
that with more tests that test odd combinations of commands
and confirm that we can handle missing blobs in different
scenarios.

Since you've expressed an interest in sparse-checkout and
having a complete end-to-end experience, you might also
experiment with adapting the above tests to use the sparse
filter (--filter=sparse:oid=<blob-ish>) instead of blobs:none
or blobs:limit.  See where that takes you and add tests
as you see fit.  The goal being to get tests in place that
match the usage you want to see (even if they fail) and
see what that looks like.

I know it is not as glamorous as adding new functionality,
but it would help get you up-to-speed on the code and we
do need additional tests.

Thanks
Jeff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: How hard would it be to implement sparse fetching/pulling?
  2017-11-30 20:03       ` Jonathan Nieder
@ 2017-12-01 16:03         ` Jeff Hostetler
  2017-12-01 18:16           ` Jonathan Nieder
  0 siblings, 1 reply; 26+ messages in thread
From: Jeff Hostetler @ 2017-12-01 16:03 UTC (permalink / raw)
  To: Jonathan Nieder, Vitaly Arbuzov
  Cc: git, Konstantin Khomoutov, git-users, jonathantanmy,
	Christian Couder



On 11/30/2017 3:03 PM, Jonathan Nieder wrote:
> Hi Vitaly,
> 
> Vitaly Arbuzov wrote:
> 
>> Found some details here: https://github.com/jeffhostetler/git/pull/3
>>
>> Looking at commits I see that you've done a lot of work already,
>> including packing, filtering, fetching, cloning etc.
>> What are some areas that aren't complete yet? Do you need any help
>> with implementation?
> 
> That's a great question!  I've filed https://crbug.com/git/2 to track
> this project.  Feel free to star it to get updates there, or to add
> updates of your own.

Thanks!

> 
> As described at https://crbug.com/git/2#c1, currently there are three
> patch series for which review would be very welcome.  Building on top
> of them is welcome as well.  Please make sure to coordinate with
> jeffhost@microsoft.com and jonathantanmy@google.com (e.g. through the
> bug tracker or email).
> 
> One piece of missing functionality that looks intereseting to me: that
> series batches fetches of the missing blobs involved in a "git
> checkout" command:
> 
>   https://public-inbox.org/git/20171121211528.21891-14-git@jeffhostetler.com/
> 
> But if doesn't batch fetches of the missing blobs involved in a "git
> diff <commit> <commit>" command.  That might be a good place to get
> your hands dirty. :)

Jonathan Tan added code in unpack-trees to bulk fetch missing blobs
before a checkout.  This is limited to the missing blobs needed for
the target commit.  We need this to make checkout seamless, but it
does mean that checkout may need online access.

I've also talked about a pre-fetch capability to bulk fetch missing
blobs in advance of some operation.  You could speed up the above
diff command or back-fill all the blobs I might need before going
offline for a while.

You can use the options that were added to rev-list to help with this.
For example:
     git rev-list --objects [--filter=<fs>] --missing=print <commit1>
     git rev-list --objects [--filter=<fs>] --missing=print <c1>..<c2>
And then pipe that into a "git fetch-pack --stdin".

You might experiment with this.


> 
> Thanks,
> Jonathan
> 

Thanks,
Jeff


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: How hard would it be to implement sparse fetching/pulling?
  2017-11-30 23:43       ` Philip Oakley
  2017-12-01  1:27         ` Vitaly Arbuzov
@ 2017-12-01 17:23         ` Jeff Hostetler
  2017-12-01 18:24           ` Jonathan Nieder
  2017-12-02 18:24           ` Philip Oakley
  1 sibling, 2 replies; 26+ messages in thread
From: Jeff Hostetler @ 2017-12-01 17:23 UTC (permalink / raw)
  To: Philip Oakley, Vitaly Arbuzov; +Cc: Git List

On 11/30/2017 6:43 PM, Philip Oakley wrote:
> From: "Vitaly Arbuzov" <vit@uber.com>
[...]
> comments below..
>>
>> On Thu, Nov 30, 2017 at 9:01 AM, Vitaly Arbuzov <vit@uber.com> wrote:
>>> Hey Jeff,
>>>
>>> It's great, I didn't expect that anyone is actively working on this.
>>> I'll check out your branch, meanwhile do you have any design docs that
>>> describe these changes or can you define high level goals that you
>>> want to achieve?
>>>
>>> On Thu, Nov 30, 2017 at 6:24 AM, Jeff Hostetler <git@jeffhostetler.com>
>>> wrote:
>>>>
>>>>
>>>> On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote:
[...]
>>>>>
> 
> I have, for separate reasons been _thinking_ about the issue ($dayjob is in
> defence, so a similar partition would be useful).
> 
> The changes would almost certainly need to be server side (as well as 
> client
> side), as it is the server that decides what is sent over the wire in 
> the pack files, which would need to be a 'narrow' pack file.

Yes, there will need to be both client and server changes.
In the current 3 part patch series, the client sends a "filter_spec"
to the server as part of the fetch-pack/upload-pack protocol.
If the server chooses to honor it, upload-pack passes the filter_spec
to pack-objects to build an "incomplete" packfile omitting various
objects (currently blobs).  Proprietary servers will need similar
changes to support this feature.

Discussing this feature in the context of the defense industry
makes me a little nervous.  (I used to be in that area.)
What we have in the code so far may be a nice start, but
probably doesn't have the assurances that you would need
for actual deployment.  But it's a start....

> 
>>>>> If we had such a feature then all we would need on top is a separate
>>>>> tool that builds the right "sparse" scope for the workspace based on
>>>>> paths that developer wants to work on.
>>>>>
>>>>> In the world where more and more companies are moving towards large
>>>>> monorepos this improvement would provide a good way of scaling git to
>>>>> meet this demand.
> 
> The 'companies' problem is that it tends to force a client-server, 
> always-on
> on-line mentality. I'm also wanting the original DVCS off-line 
> capability to
> still be available, with _user_ control, in a generic sense, of what they
> have locally available (including files/directories they have not yet 
> looked
> at, but expect to have. IIUC Jeff's work is that on-line view, without the
> off-line capability.
> 
> I'd commented early in the series at [1,2,3].

Yes, this does tend to lead towards an always-online mentality.
However, there are 2 parts:
[a] dynamic object fetching for missing objects, such as during a
     random command like diff or blame or merge.  We need this
     regardless of usage -- because we can't always predict (or
     dry-run) every command the user might run in advance.
[b] batch fetch mode, such as using partial-fetch to match your
     sparse-checkout so that you always have the blobs of interest
     to you.  And assuming you don't wander outside of this subset
     of the tree, you should be able to work offline as usual.
If you can work within the confines of [b], you wouldn't need to
always be online.

We might also add a part [c] with explicit commands to back-fill or
alter your incomplete view of the ODB (as I explained in response
to the "git diff <commit1> <commit2>" comment later in this thread.

> At its core, my idea was to use the object store to hold markers for the
> 'not yet fetched' objects (mainly trees and blobs). These would be in a 
> known fixed format, and have the same effect (conceptually) as the 
> sub-module markers - they _confirm_ the oid, yet say 'not here, try 
> elsewhere'.

We do have something like this.  Jonathan can explain better than I, but
basically, we denote possibly incomplete packfiles from partial clones
and fetches as "promisor" and have special rules in the code to assert
that a missing blob referenced from a "promisor" packfile is OK and can
be fetched later if necessary from the "promising" remote.

The main problem with markers or other lists of missing objects is
that it has scale problems for large repos.  Suppose I have 100M
blobs in my repo.  If I do a blob:none clone, I'd have 100M missing
blobs that would need tracking.  If I then do a batch fetch of the
blobs needed to do a sparse checkout of HEAD, I'd have to remove
those entries from the tracking data.  Not impossible, but not
speedy either.

> 
> The comaprison with submodules mean there is the same chance of
> de-synchronisation with triangular and upstream servers, unless managed.
> 
> The server side, as noted, will need to be included as it is the one that
> decides the pack file.
> 
> Options for a server management are:
> 
> - "I accept narrow packs?" No; yes
> 
> - "I serve narrow packs?" No; yes.
> 
> - "Repo completeness checks on reciept": (must be complete) || (allow 
> narrow to nothing).

we have new config settings for the server to allow/reject
partial clones.

and we have code in fsck/gc to handle these incomplete packfiles.

> 
> For server farms (e.g. Github..) the settings could be global, or by repo.
> (note that the completeness requirement and narrow reciept option are not
> incompatible - the recipient server can reject the pack from a narrow
> subordinate as incomplete - see below)

For now our scope is limited to partial clone and fetch.  We've not
considered push.

> 
> * Marking of 'missing' objects in the local object store, and on the wire.
> The missing objects are replaced by a place holder object, which used the
> same oid/sha1, but has a short fixed length, with content “GitNarrowObject
> <oid>”. The chance that that string would actually have such an oid 
> clash is
> the same as all other object hashes, so is a *safe* self-referential 
> device.

Again, there is a scale problem here.  If I have 100M missing blobs,
I can't afford to create 100M loose place holder files.  Or juggle
a 2GB file of missing objects on various operations.

> 
> 
> * The stored object already includes length (and inferred type), so we do
> know what it stands in for. Thus the local index (index file) should be 
> able
> to be recreated from the object store alone (including the ‘promised /
> narrow / missing’ files/directory markers)
> 
> * the ‘same’ as sub-modules.
> The potential for loss of synchronisation with a golden complete repo is
> just the same as for sub-modules. (We expected object/commit X here, but 
> it’s not in the store). This could happen with a small user group who 
> have locally narrow clones, who interact with their local narrow server 
> for ‘backup’, and then fail to push further upstream to a server that 
> mandates completeness. They could create a death by a thousand narrow 
> cuts. Having a golden upstream config reference (indicating which is the 
> upstream) could allow checks to ensure that doesn’t happen.
> 
> The fsck can be taught the config option of 'allowNarrow'.

We've updated fsck to be aware of partial clones.

> 
> The narrowness would be defined in a locally stored '.gitNarrowIgnore' file
> (which can include the size constraints being developed elsewhere on the
> list)
> 
> As a safety it could be that the .gitNarrowIgnore is sent with the pack so
> that fold know what they missed, and fsck could check that they are locally
> not narrower than some specific project .gitNarrowIgnore spec.

Currently, we store the filter_spec used for the partial clone (or the
first partial fetch) as a default for subsequent fetches, but we limit
it there.  That is, for operations like checkout or blame or whatever,
it doesn't matter why a blob is missing or what filter criteria was
used to cause it to be omitted -- just that it is.  Any time we need
a missing object, we have to go get it -- whether that is via dynamic
or bulk fetching.

> 
> The benefit of this that the off-line operation capability of Git 
> continues,
> which GVFS doesn’t quite do (accidental lock in to a client-server model 
> aka
> all those other VCS systems).

Yes, there are limitations of GVFS and you must be online to use it.
It essentially does a blob:none filter and dynamically faults in every
blob.  (But it is also using a kernel driver and daemon to dynamically
populate the file system when a file is first opened for writing --
much like copy-on-write virtual memory.  And yes, these 2 operations
are independent, but currently combined in the GVFS code.)

And yes, there is work here to make sure that most normal
operations can continue to work offline.

> 
> I believe that its all doable, and that Jeff H's work already puts much of
> it in place, or touches those places
> 
> That said, it has been just _thinking_, without sufficient time to delve
> into the code.
> 
> Phil
[...]

Thanks
Jeff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: How hard would it be to implement sparse fetching/pulling?
  2017-12-01 16:03         ` Jeff Hostetler
@ 2017-12-01 18:16           ` Jonathan Nieder
  0 siblings, 0 replies; 26+ messages in thread
From: Jonathan Nieder @ 2017-12-01 18:16 UTC (permalink / raw)
  To: Jeff Hostetler
  Cc: Vitaly Arbuzov, git, Konstantin Khomoutov, git-users,
	jonathantanmy, Christian Couder

Hi,

Jeff Hostetler wrote:
> On 11/30/2017 3:03 PM, Jonathan Nieder wrote:

>> One piece of missing functionality that looks intereseting to me: that
>> series batches fetches of the missing blobs involved in a "git
>> checkout" command:
>>
>>   https://public-inbox.org/git/20171121211528.21891-14-git@jeffhostetler.com/
>>
>> But if doesn't batch fetches of the missing blobs involved in a "git
>> diff <commit> <commit>" command.  That might be a good place to get
>> your hands dirty. :)
>
> Jonathan Tan added code in unpack-trees to bulk fetch missing blobs
> before a checkout.  This is limited to the missing blobs needed for
> the target commit.  We need this to make checkout seamless, but it
> does mean that checkout may need online access.

Just to clarify: other parts of the series already fetch all missing
blobs that are required for a command.  What that bulk-fetch patch
does is to make that more efficient, by using a single fetch request
to grab all the blobs that are needed for checkout, instead of one
fetch per blob.

This doesn't change the online access requirement: online access is
needed if and only if you don't have the required objects already
available locally.  For example, if at clone time you specified a
sparse checkout pattern and you haven't changed that sparse checkout
pattern, then online access is not needed for checkout.

> I've also talked about a pre-fetch capability to bulk fetch missing
> blobs in advance of some operation.  You could speed up the above
> diff command or back-fill all the blobs I might need before going
> offline for a while.

In particular, something like this seems like a very valuable thing to
have when changing the sparse checkout pattern.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: How hard would it be to implement sparse fetching/pulling?
  2017-12-01 17:23         ` Jeff Hostetler
@ 2017-12-01 18:24           ` Jonathan Nieder
  2017-12-04 15:53             ` Jeff Hostetler
  2017-12-02 18:24           ` Philip Oakley
  1 sibling, 1 reply; 26+ messages in thread
From: Jonathan Nieder @ 2017-12-01 18:24 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: Philip Oakley, Vitaly Arbuzov, Git List

Jeff Hostetler wrote:
> On 11/30/2017 6:43 PM, Philip Oakley wrote:

>> The 'companies' problem is that it tends to force a client-server, always-on
>> on-line mentality. I'm also wanting the original DVCS off-line capability to
>> still be available, with _user_ control, in a generic sense, of what they
>> have locally available (including files/directories they have not yet looked
>> at, but expect to have. IIUC Jeff's work is that on-line view, without the
>> off-line capability.
>>
>> I'd commented early in the series at [1,2,3].
>
> Yes, this does tend to lead towards an always-online mentality.
> However, there are 2 parts:
> [a] dynamic object fetching for missing objects, such as during a
>     random command like diff or blame or merge.  We need this
>     regardless of usage -- because we can't always predict (or
>     dry-run) every command the user might run in advance.
> [b] batch fetch mode, such as using partial-fetch to match your
>     sparse-checkout so that you always have the blobs of interest
>     to you.  And assuming you don't wander outside of this subset
>     of the tree, you should be able to work offline as usual.
> If you can work within the confines of [b], you wouldn't need to
> always be online.

Just to amplify this: for our internal use we care a lot about
disconnected usage working.  So it is not like we have forgotten about
this use case.

> We might also add a part [c] with explicit commands to back-fill or
> alter your incomplete view of the ODB

Agreed, this will be a nice thing to add.

[...]
>> At its core, my idea was to use the object store to hold markers for the
>> 'not yet fetched' objects (mainly trees and blobs). These would be in a
>> known fixed format, and have the same effect (conceptually) as the
>> sub-module markers - they _confirm_ the oid, yet say 'not here, try
>> elsewhere'.
>
> We do have something like this.  Jonathan can explain better than I, but
> basically, we denote possibly incomplete packfiles from partial clones
> and fetches as "promisor" and have special rules in the code to assert
> that a missing blob referenced from a "promisor" packfile is OK and can
> be fetched later if necessary from the "promising" remote.
>
> The main problem with markers or other lists of missing objects is
> that it has scale problems for large repos.

Any chance that we can get a design doc in Documentation/technical/
giving an overview of the design, with a brief "alternatives
considered" section describing this kind of thing?

E.g. some of the earlier descriptions like
 https://public-inbox.org/git/20170915134343.3814dc38@twelve2.svl.corp.google.com/
 https://public-inbox.org/git/cover.1506714999.git.jonathantanmy@google.com/
 https://public-inbox.org/git/20170113155253.1644-1-benpeart@microsoft.com/
may help as a starting point.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: How hard would it be to implement sparse fetching/pulling?
  2017-12-01  1:27         ` Vitaly Arbuzov
  2017-12-01  1:51           ` Vitaly Arbuzov
@ 2017-12-02 15:04           ` Philip Oakley
  1 sibling, 0 replies; 26+ messages in thread
From: Philip Oakley @ 2017-12-02 15:04 UTC (permalink / raw)
  To: Vitaly Arbuzov; +Cc: Jeff Hostetler, Git List

From: "Vitaly Arbuzov" <vit@uber.com>
Sent: Friday, December 01, 2017 1:27 AM
> Jonathan, thanks for references, that is super helpful, I will follow
your suggestions.

> Philip, I agree that keeping original DVCS off-line capability is an
important point. Ideally this feature should work even with remotes
that are located on the local disk.

And with other any other remote. (even to the extent that the other remote 
may indicate it has no capability, sorry, go away..)
E.g. One ought to be able to have/create a Github narrow fork of only the 
git.git/Documenation repo, and interact with that. (how much nicer if it was 
git.git/Documenation/ManPages/ to ease the exclusion of RelNotes/, howto/ 
and technical/ )

> Which part of Jeff's work do you think wouldn't work offline after
repo initialization is done and sparse fetch is performed? All the
stuff that I've seen seems to be quite usable without GVFS.

I think it's that initial download that may be different, and what is 
expected of it. In my case, one may never connect to that server again, yet 
still be able to work both off-line and with other remotes (push and pull as 
per capabilities). Below I note that I'd only fetch the needed trees, not 
all of them. Also one needs to fetch a complete (pre-defined) subset, rather 
than an on-demand subset.

> I'm not sure if we need to store markers/tombstones on the client,
what problem does it solve?

The part that the markers hopes to solve is the part that I hadn't said, 
that they should also show in the work tree so that users can see what is 
missing and where.

Importantly I would also trim the directory (tree) structure so only the 
direct heirachy of those files the user sees are visible, though at each 
level they would see side directory names (which are embedded in the 
heirachical tree objects). (IIUC Jeff H's scheme downloads *all* trees, not 
just a few)

It would mean that users can create a complete fresh tree and commit that 
can be merged and picked onto the usptream tree from the _directory worktree 
alone_, because the oid's of all the parts are listed in the worktree. The 
actual objects for the missing oids being available in the appropriate 
upstream.

It also means the index can be deleted, and with only the local narrow pack 
files and the current worktree the index can be recreated at the current 
sparseness level. (I'm hoping I've understood the dispersement of data 
between index and narrow packs corrrectly here ;-)

--
Philip

On Thu, Nov 30, 2017 at 3:43 PM, Philip Oakley <philipoakley@iee.org> wrote:
> From: "Vitaly Arbuzov" <vit@uber.com>
>>
>> Found some details here: https://github.com/jeffhostetler/git/pull/3
>>
>> Looking at commits I see that you've done a lot of work already,
>> including packing, filtering, fetching, cloning etc.
>> What are some areas that aren't complete yet? Do you need any help
>> with implementation?
>>
>
> comments below..
>
>>
>> On Thu, Nov 30, 2017 at 9:01 AM, Vitaly Arbuzov <vit@uber.com> wrote:
>>>
>>> Hey Jeff,
>>>
>>> It's great, I didn't expect that anyone is actively working on this.
>>> I'll check out your branch, meanwhile do you have any design docs that
>>> describe these changes or can you define high level goals that you
>>> want to achieve?
>>>
>>> On Thu, Nov 30, 2017 at 6:24 AM, Jeff Hostetler <git@jeffhostetler.com>
>>> wrote:
>>>>
>>>>
>>>>
>>>> On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote:
>>>>>
>>>>>
>>>>> Hi guys,
>>>>>
>>>>> I'm looking for ways to improve fetch/pull/clone time for large git
>>>>> (mono)repositories with unrelated source trees (that span across
>>>>> multiple services).
>>>>> I've found sparse checkout approach appealing and helpful for most of
>>>>> client-side operations (e.g. status, reset, commit, etc.)
>>>>> The problem is that there is no feature like sparse fetch/pull in git,
>>>>> this means that ALL objects in unrelated trees are always fetched.
>>>>> It may take a lot of time for large repositories and results in some
>>>>> practical scalability limits for git.
>>>>> This forced some large companies like Facebook and Google to move to
>>>>> Mercurial as they were unable to improve client-side experience with
>>>>> git while Microsoft has developed GVFS, which seems to be a step back
>>>>> to CVCS world.
>>>>>
>>>>> I want to get a feedback (from more experienced git users than I am)
>>>>> on what it would take to implement sparse fetching/pulling.
>>>>> (Downloading only objects related to the sparse-checkout list)
>>>>> Are there any issues with missing hashes?
>>>>> Are there any fundamental problems why it can't be done?
>>>>> Can we get away with only client-side changes or would it require
>>>>> special features on the server side?
>>>>>
>
> I have, for separate reasons been _thinking_ about the issue ($dayjob is 
> in
> defence, so a similar partition would be useful).
>
> The changes would almost certainly need to be server side (as well as 
> client
> side), as it is the server that decides what is sent over the wire in the
> pack files, which would need to be a 'narrow' pack file.
>
>>>>> If we had such a feature then all we would need on top is a separate
>>>>> tool that builds the right "sparse" scope for the workspace based on
>>>>> paths that developer wants to work on.
>>>>>
>>>>> In the world where more and more companies are moving towards large
>>>>> monorepos this improvement would provide a good way of scaling git to
>>>>> meet this demand.
>
>
> The 'companies' problem is that it tends to force a client-server, 
> always-on
> on-line mentality. I'm also wanting the original DVCS off-line capability 
> to
> still be available, with _user_ control, in a generic sense, of what they
> have locally available (including files/directories they have not yet 
> looked
> at, but expect to have. IIUC Jeff's work is that on-line view, without the
> off-line capability.
>
> I'd commented early in the series at [1,2,3].
>
>
> At its core, my idea was to use the object store to hold markers for the
> 'not yet fetched' objects (mainly trees and blobs). These would be in a
> known fixed format, and have the same effect (conceptually) as the
> sub-module markers - they _confirm_ the oid, yet say 'not here, try
> elsewhere'.
>
> The comaprison with submodules mean there is the same chance of
> de-synchronisation with triangular and upstream servers, unless managed.
>
> The server side, as noted, will need to be included as it is the one that
> decides the pack file.
>
> Options for a server management are:
>
> - "I accept narrow packs?" No; yes
>
> - "I serve narrow packs?" No; yes.
>
> - "Repo completeness checks on reciept": (must be complete) || (allow 
> narrow
> to nothing).
>
> For server farms (e.g. Github..) the settings could be global, or by repo.
> (note that the completeness requirement and narrow reciept option are not
> incompatible - the recipient server can reject the pack from a narrow
> subordinate as incomplete - see below)
>
> * Marking of 'missing' objects in the local object store, and on the wire.
> The missing objects are replaced by a place holder object, which used the
> same oid/sha1, but has a short fixed length, with content “GitNarrowObject
> <oid>”. The chance that that string would actually have such an oid clash 
> is
> the same as all other object hashes, so is a *safe* self-referential 
> device.
>
>
> * The stored object already includes length (and inferred type), so we do
> know what it stands in for. Thus the local index (index file) should be 
> able
> to be recreated from the object store alone (including the ‘promised /
> narrow / missing’ files/directory markers)
>
> * the ‘same’ as sub-modules.
> The potential for loss of synchronisation with a golden complete repo is
> just the same as for sub-modules. (We expected object/commit X here, but
> it’s not in the store). This could happen with a small user group who have
> locally narrow clones, who interact with their local narrow server for
> ‘backup’, and then fail to push further upstream to a server that mandates
> completeness. They could create a death by a thousand narrow cuts. Having 
> a
> golden upstream config reference (indicating which is the upstream) could
> allow checks to ensure that doesn’t happen.
>
> The fsck can be taught the config option of 'allowNarrow'.
>
> The narrowness would be defined in a locally stored '.gitNarrowIgnore' 
> file
> (which can include the size constraints being developed elsewhere on the
> list)
>
> As a safety it could be that the .gitNarrowIgnore is sent with the pack so
> that fold know what they missed, and fsck could check that they are 
> locally
> not narrower than some specific project .gitNarrowIgnore spec.
>
> The benefit of this that the off-line operation capability of Git 
> continues,
> which GVFS doesn’t quite do (accidental lock in to a client-server model 
> aka
> all those other VCS systems).
>
> I believe that its all doable, and that Jeff H's work already puts much of
> it in place, or touches those places
>
> That said, it has been just _thinking_, without sufficient time to delve
> into the code.
>
> Phil
>
>>>>>
>>>>> PS. Please don't advice to split things up, as there are some good
>>>>> reasons why many companies decide to keep their code in the monorepo,
>>>>> which you can easily find online. So let's keep that part out the
>>>>> scope.
>>>>>
>>>>> -Vitaly
>>>>>
>>>>
>>>>
>>>> This work is in-progress now.  A short summary can be found in [1]
>>>> of the current parts 1, 2, and 3.
>>>>
>>>>> * jh/object-filtering (2017-11-22) 6 commits
>>>>> * jh/fsck-promisors (2017-11-22) 10 commits
>>>>> * jh/partial-clone (2017-11-22) 14 commits
>>>>
>>>>
>>>>
>>>> [1]
>>>>
>>>> https://public-inbox.org/git/xmqq1skh6fyz.fsf@gitster.mtv.corp.google.com/T/
>>>>
>>>> I have a branch that contains V5 all 3 parts:
>>>> https://github.com/jeffhostetler/git/tree/core/pc5_p3
>>>>
>>>> This is a WIP, so there are some rough edges....
>>>> I hope to have a V6 out before the weekend with some
>>>> bug fixes and cleanup.
>>>>
>>>> Please give it a try and see if it fits your needs.
>>>> Currently, there are filter methods to filter all blobs,
>>>> all large blobs, and one to match a sparse-checkout
>>>> specification.
>>>>
>>>> Let me know if you have any questions or problems.
>>>>
>>>> Thanks,
>>>> Jeff
>
>
> [1,2]  [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing
> blobs")
> https://public-inbox.org/git/BC1048A63B034E46A11A01758BC04855@PhilipOakley/
> Date: Tue, 25 Jul 2017 21:48:46 +0100
> https://public-inbox.org/git/8EE0108BA72B42EA9494B571DDE2005D@PhilipOakley/
> Date: Sat, 29 Jul 2017 13:51:16 +0100
>
> [3] [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects
> https://public-inbox.org/git/244AA0848E9D46F480E7CA407582A162@PhilipOakley/
> Date: Sat, 29 Jul 2017 14:26:52 +0100
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: How hard would it be to implement sparse fetching/pulling?
  2017-12-01 14:30             ` Jeff Hostetler
@ 2017-12-02 16:30               ` Philip Oakley
  2017-12-04 15:36                 ` Jeff Hostetler
  0 siblings, 1 reply; 26+ messages in thread
From: Philip Oakley @ 2017-12-02 16:30 UTC (permalink / raw)
  To: Vitaly Arbuzov, Jeff Hostetler; +Cc: Git List

From: "Jeff Hostetler" <git@jeffhostetler.com>
Sent: Friday, December 01, 2017 2:30 PM
> On 11/30/2017 8:51 PM, Vitaly Arbuzov wrote:
>> I think it would be great if we high level agree on desired user
>> experience, so let me put a few possible use cases here.
>>
>> 1. Init and fetch into a new repo with a sparse list.
>> Preconditions: origin blah exists and has a lot of folders inside of
>> src including "bar".
>> Actions:
>> git init foo && cd foo
>> git config core.sparseAll true # New flag to activate all sparse
>> operations by default so you don't need to pass options to each
>> command.
>> echo "src/bar" > .git/info/sparse-checkout
>> git remote add origin blah
>> git pull origin master
>> Expected results: foo contains src/bar folder and nothing else,
>> objects that are unrelated to this tree are not fetched.
>> Notes: This should work same when fetch/merge/checkout operations are
>> used in the right order.
>
> With the current patches (parts 1,2,3) we can pass a blob-ish
> to the server during a clone that refers to a sparse-checkout
> specification.

I hadn't appreciated this capability. I see it as important, and should be 
available both ways, so that a .gitNarrow spec can be imposed from the 
server side, as well as by the requester.

It could also be used to assist in the 'precious/secret' blob problem, so 
that AWS keys are never pushed, nor available for fetching!

>        There's a bit of a chicken-n-egg problem getting
> things set up.  So if we assume your team would create a series
> of "known enlistments" under version control, then you could

s/enlistments/entitlements/ I presume?

> just reference one by <branch>:<path> during your clone.  The
> server can lookup that blob and just use it.
>
>     git clone --filter=sparse:oid=master:templates/bar URL
>
> And then the server will filter-out the unwanted blobs during
> the clone.  (The current version only filters blobs; you still
> get full commits and trees.  That will be revisited later.)

I'm for the idea that only the in-heirachy trees should be sent.
It should also be possible that the server replies that it is only sending a 
narrow clone, with the given (accessible?) spec.

>
> On the client side, the partial clone installs local config
> settings into the repo so that subsequent fetches default to
> the same filter criteria as used in the clone.
>
>
> I don't currently have provision to send a full sparse-checkout
> specification to the server during a clone or fetch.  That
> seemed like too much to try to squeeze into the protocols.
> We can revisit this later if there is interest, but it wasn't
> critical for the initial phase.
>
Agreed. I think it should be somewhere 'visible' to the user, but could be 
setup by the server admin / repo maintainer if they don't have write access. 
But there could still be the catch-22 - maybe one starts with a <commit | 
toptree> : <tree> pair to define an origin point (it's not as refined as a 
.gitNarrow spec file, but is definative). The toptree option could even 
allow sub-tree clones.. maybe..

>
>>
>> 2. Add a file and push changes.
>> Preconditions: all steps above followed.
>> touch src/bar/baz.txt && git add -A && git commit -m "added a file"
>> git push origin master
>> Expected results: changes are pushed to remote.
>
> I don't believe partial clone and/or partial fetch will cause
> any changes for push.

I suspect that pushes could be rejected if the user 'pretends' to modify 
files or trees outside their area. It does need the user to be able to spoof 
part of a tree they don't have, so an upstream / remote would immediatly 
know it was a spoof but locally the narrow clone doesn't have enough detail 
about the 'bad' oid. It would be right to reject such attempts!

>
>>
>> 3. Clone a repo with a sparse list as a filter.
>> Preconditions: same as for #1
>> Actions:
>> echo "src/bar" > /tmp/blah-sparse-checkout
>> git clone --sparse /tmp/blah-sparse-checkout blah # Clone should be
>> the only command that would requires specific option key being passed.
>> Expected results: same as for #1 plus /tmp/blah-sparse-checkout is
>> copied into .git/info/sparse-checkout

I presume clone and fetch are treated equivalently here.

>
> There are 2 independent concepts here: clone and checkout.
> Currently, there isn't any automatic linkage of the partial clone to
> the sparse-checkout settings, so you could do something like this:
>
I see an implicit link that clearly one cannot checkout (inflate/populate) a 
file/directory that one does not have in the object store. But that does not 
imply the reverse linkage. The regular sparse checkout should be available 
independently of the local clone being a narrow one.

>     git clone --no-checkout --filter=sparse:oid=master:templates/bar URL
>     git cat-file ... templates/bar >.git/info/sparse-checkout
>     git config core.sparsecheckout true
>     git checkout ...
>
> I've been focused on the clone/fetch issues and have not looked
> into the automation to couple them.
>

I foresee that large files and certain files need to be filterable for 
fetch-clone, and that might not be (backward) compatible with the 
sparse-checkout.

>
>>
>> 4. Showing log for sparsely cloned repo.
>> Preconditions: #3 is followed
>> Actions:
>> git log
>> Expected results: recent changes that affect src/bar tree.
>
> If I understand your meaning, log would only show changes
> within the sparse subset of the tree.  This is not on my
> radar for partial clone/fetch.  It would be a nice feature
> to have, but I think it would be better to think about it
> from the point of view of sparse-checkout rather than clone.
>
One option maybe by making a marker for the tree/blob to be a first class 
citizen. So the oid (and worktree file) has content ".gitNarrowTree <oid>" 
or ",gitNarrowBlob <oid>" as required (*), which is safe, and allows a 
consistent alter-ego view of the tree contents and hence for git-log et.al.

(*) I keep flip flopping between a single object marker, and distinct object 
markers for the types. It partly depends on whether one can know in advance, 
locally, what the oid type should be, and how it should be embedded in the 
object store - need to re-check the specs.

I'm tending toward distinct types to cope with the D/F conflict in the 
worktrees - the directory must be created (holds the name etc), and the 
alter-ego content then must be placed in a _known_ sub-file ".gitNarrowTree" 
(without the oid in the file name, but included in the content). Presence of 
a ".gitNarrowTree" should be standalone in the directory when that part of 
the work-tree is clean.

>
>>
>> 5. Showing diff.
>> Preconditions: #3 is followed
>> Actions:
>> git diff HEAD^ HEAD
>> Expected results: changes from the most recent commit affecting
>> src/bar folder are shown.
>> Notes: this can be tricky operation as filtering must be done to
>> remove results from unrelated subtrees.
>
> I don't have any plan for this and I don't think it fits within
> the scope of clone/fetch.  I think this too would be a sparse-checkout
> feature.
>

See my note about first class citizens for marker OIDs

>
>>
>> *Note that I intentionally didn't mention use cases that are related
>> to filtering by blob size as I think we should logically consider them
>> as a separate, although related, feature.
>
> I've grouped blob-size and sparse filter together for the
> purposes of clone/fetch since the basic mechanisms (filtering,
> transport, and missing object handling) are the same for both.
> They do lead to different end-uses, but that is above my level
> here.
>
>
>>
>> What do you think about these examples above? Is that something that
>> more-or-less fits into current development? Are there other important
>> flows that I've missed?
>
> These are all good ideas and it is good to have someone else who
> wants to use partial+sparse thinking about it and looking for gaps
> as we try to make a complete end-to-end feature.
>>
>> -Vitaly
>
> Thanks
> Jeff
>

Philip


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: How hard would it be to implement sparse fetching/pulling?
  2017-12-01  2:51             ` Jonathan Nieder
  2017-12-01  3:37               ` Vitaly Arbuzov
@ 2017-12-02 16:59               ` Philip Oakley
  1 sibling, 0 replies; 26+ messages in thread
From: Philip Oakley @ 2017-12-02 16:59 UTC (permalink / raw)
  To: Jonathan Nieder, Vitaly Arbuzov; +Cc: Jeff Hostetler, Git List

Hi Jonathan,

Thanks for the outline. It has help clarify some points and see the very 
similar alignments.

The one thing I wasn't clear about is the "promised" objects/remote. Is that 
"promisor" remote a fixed entity, or could it be one of many remotes that 
could be a "provider"? (sort of like fetching sub-modules...)

Philip

From: "Jonathan Nieder" <jrnieder@gmail.com>
Sent: Friday, December 01, 2017 2:51 AM
> Hi Vitaly,
>
> Vitaly Arbuzov wrote:
>
>> I think it would be great if we high level agree on desired user
>> experience, so let me put a few possible use cases here.
>
> I think one thing this thread is pointing to is a lack of overview
> documentation about how the 'partial clone' series currently works.
> The basic components are:
>
> 1. extending git protocol to (1) allow fetching only a subset of the
>    objects reachable from the commits being fetched and (2) later,
>    going back and fetching the objects that were left out.
>
>    We've also discussed some other protocol changes, e.g. to allow
>    obtaining the sizes of un-fetched objects without fetching the
>    objects themselves
>
> 2. extending git's on-disk format to allow having some objects not be
>    present but only be "promised" to be obtainable from a remote
>    repository.  When running a command that requires those objects,
>    the user can choose to have it either (a) error out ("airplane
>    mode") or (b) fetch the required objects.
>
>    It is still possible to work fully locally in such a repo, make
>    changes, get useful results out of "git fsck", etc.  It is kind of
>    similar to the existing "shallow clone" feature, except that there
>    is a more straightforward way to obtain objects that are outside
>    the "shallow" clone when needed on demand.
>
> 3. improving everyday commands to require fewer objects.  For
>    example, if I run "git log -p", then I way to see the history of
>    most files but I don't necessarily want to download large binary
>    files just to print 'Binary files differ' for them.
>
>    And by the same token, we might want to have a mode for commands
>    like "git log -p" to default to restricting to a particular
>    directory, instead of downloading files outside that directory.
>
>    There are some fundamental changes to make in this category ---
>    e.g. modifying the index format to not require entries for files
>    outside the sparse checkout, to avoid having to download the
>    trees for them.
>
> The overall goal is to make git scale better.
>
> The existing patches do (1) and (2), though it is possible to do more
> in those categories. :)  We have plans to work on (3) as well.
>
> These are overall changes that happen at a fairly low level in git.
> They mostly don't require changes command-by-command.
>
> Thanks,
> Jonathan 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: How hard would it be to implement sparse fetching/pulling?
  2017-12-01 17:23         ` Jeff Hostetler
  2017-12-01 18:24           ` Jonathan Nieder
@ 2017-12-02 18:24           ` Philip Oakley
  2017-12-05 19:14             ` Jeff Hostetler
  1 sibling, 1 reply; 26+ messages in thread
From: Philip Oakley @ 2017-12-02 18:24 UTC (permalink / raw)
  To: Vitaly Arbuzov, Jeff Hostetler; +Cc: Git List

From: "Jeff Hostetler" <git@jeffhostetler.com>
Sent: Friday, December 01, 2017 5:23 PM
> On 11/30/2017 6:43 PM, Philip Oakley wrote:
>> From: "Vitaly Arbuzov" <vit@uber.com>
> [...]
>> comments below..
>>>
>>> On Thu, Nov 30, 2017 at 9:01 AM, Vitaly Arbuzov <vit@uber.com> wrote:
>>>> Hey Jeff,
>>>>
>>>> It's great, I didn't expect that anyone is actively working on this.
>>>> I'll check out your branch, meanwhile do you have any design docs that
>>>> describe these changes or can you define high level goals that you
>>>> want to achieve?
>>>>
>>>> On Thu, Nov 30, 2017 at 6:24 AM, Jeff Hostetler <git@jeffhostetler.com>
>>>> wrote:
>>>>>
>>>>>
>>>>> On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote:
> [...]
>>>>>>
>>
>> I have, for separate reasons been _thinking_ about the issue ($dayjob is 
>> in
>> defence, so a similar partition would be useful).
>>
>> The changes would almost certainly need to be server side (as well as 
>> client
>> side), as it is the server that decides what is sent over the wire in the 
>> pack files, which would need to be a 'narrow' pack file.
>
> Yes, there will need to be both client and server changes.
> In the current 3 part patch series, the client sends a "filter_spec"
> to the server as part of the fetch-pack/upload-pack protocol.
> If the server chooses to honor it, upload-pack passes the filter_spec
> to pack-objects to build an "incomplete" packfile omitting various
> objects (currently blobs).  Proprietary servers will need similar
> changes to support this feature.
>
> Discussing this feature in the context of the defense industry
> makes me a little nervous.  (I used to be in that area.)

I'm viewing the desire for codebase partitioning from a soft layering of 
risk view (perhaps a more UK than USA approach ;-)

> What we have in the code so far may be a nice start, but
> probably doesn't have the assurances that you would need
> for actual deployment.  But it's a start....

True. I need to get some of my collegues more engaged...
>
>>
>>>>>> If we had such a feature then all we would need on top is a separate
>>>>>> tool that builds the right "sparse" scope for the workspace based on
>>>>>> paths that developer wants to work on.
>>>>>>
>>>>>> In the world where more and more companies are moving towards large
>>>>>> monorepos this improvement would provide a good way of scaling git to
>>>>>> meet this demand.
>>
>> The 'companies' problem is that it tends to force a client-server, 
>> always-on
>> on-line mentality. I'm also wanting the original DVCS off-line capability 
>> to
>> still be available, with _user_ control, in a generic sense, of what they
>> have locally available (including files/directories they have not yet 
>> looked
>> at, but expect to have. IIUC Jeff's work is that on-line view, without 
>> the
>> off-line capability.
>>
>> I'd commented early in the series at [1,2,3].
>
> Yes, this does tend to lead towards an always-online mentality.
> However, there are 2 parts:
> [a] dynamic object fetching for missing objects, such as during a
>     random command like diff or blame or merge.  We need this
>     regardless of usage -- because we can't always predict (or
>     dry-run) every command the user might run in advance.

Making something "useful" happen here when off-line is an obvious goal.

> [b] batch fetch mode, such as using partial-fetch to match your
>     sparse-checkout so that you always have the blobs of interest
>     to you.  And assuming you don't wander outside of this subset
>     of the tree, you should be able to work offline as usual.
> If you can work within the confines of [b], you wouldn't need to
> always be online.

I feel this is the area that does need ensure a capability to avoid any 
perception of the much maligned 'Embrace, extend, and extinguish' by 
accidental lockout.

I don't think this should be viewed as a type of sparse checkout - it's just 
a checkout of what you have (under the hood it could use the same code 
though).

>
> We might also add a part [c] with explicit commands to back-fill or
> alter your incomplete view of the ODB (as I explained in response
> to the "git diff <commit1> <commit2>" comment later in this thread.
>
>
>> At its core, my idea was to use the object store to hold markers for the
>> 'not yet fetched' objects (mainly trees and blobs). These would be in a 
>> known fixed format, and have the same effect (conceptually) as the 
>> sub-module markers - they _confirm_ the oid, yet say 'not here, try 
>> elsewhere'.
>
> We do have something like this.  Jonathan can explain better than I, but
> basically, we denote possibly incomplete packfiles from partial clones
> and fetches as "promisor" and have special rules in the code to assert
> that a missing blob referenced from a "promisor" packfile is OK and can
> be fetched later if necessary from the "promising" remote.

The remote interaction is one area that may need thought, especially in a 
triangle workflow, of which there are a few.

>
> The main problem with markers or other lists of missing objects is
> that it has scale problems for large repos.  Suppose I have 100M
> blobs in my repo.  If I do a blob:none clone, I'd have 100M missing
> blobs that would need tracking.  If I then do a batch fetch of the
> blobs needed to do a sparse checkout of HEAD, I'd have to remove
> those entries from the tracking data.  Not impossible, but not
> speedy either.

** Ahhh. I see. That's a consequence of having all the trees isn't it. **

I've always thought that limiting the trees is at the heart of the Narrow 
clone/fetch problem.

OK so if you have flat, wide structures with 10k files/directories per tree 
then it's still a fair sized problem, but it should *scale logarithmically* 
for the part of the tree structure that's not being downloaded.

You never have to add a marker for a blob that you have no containing tree 
for. Nor for the tree that contained the blob's tree, all the way up to 
primary line of descent to the tree of concern. All those trees are never 
down loaded, there are few markers (.gitNarrowTree files) for those tree 
stubs, certainly no 100M missing blob markers.

>
>>
>> The comaprison with submodules mean there is the same chance of
>> de-synchronisation with triangular and upstream servers, unless managed.
>>
>> The server side, as noted, will need to be included as it is the one that
>> decides the pack file.
>>
>> Options for a server management are:
>>
>> - "I accept narrow packs?" No; yes
>>
>> - "I serve narrow packs?" No; yes.
>>
>> - "Repo completeness checks on reciept": (must be complete) || (allow 
>> narrow to nothing).
>
> we have new config settings for the server to allow/reject
> partial clones.
>
> and we have code in fsck/gc to handle these incomplete packfiles.

good

>>
>> For server farms (e.g. Github..) the settings could be global, or by 
>> repo.
>> (note that the completeness requirement and narrow reciept option are not
>> incompatible - the recipient server can reject the pack from a narrow
>> subordinate as incomplete - see below)
>
> For now our scope is limited to partial clone and fetch.  We've not
> considered push.

OK
>
>>
>> * Marking of 'missing' objects in the local object store, and on the 
>> wire.
>> The missing objects are replaced by a place holder object, which used the
>> same oid/sha1, but has a short fixed length, with content 
>> “GitNarrowObject
>> <oid>”. The chance that that string would actually have such an oid clash 
>> is
>> the same as all other object hashes, so is a *safe* self-referential 
>> device.
>
> Again, there is a scale problem here.  If I have 100M missing blobs,
> I can't afford to create 100M loose place holder files.  Or juggle
> a 2GB file of missing objects on various operations.

As above, I'm also trimming the trees, so in general, there would be no 
missing  blobs, just the content of the directory one was interested in.

That's not quite true if higher level trees have blob references in them 
that are otherwise unwanted - they may each need a marker. [Or maybe a 
special single 'tree-of-blobs' marker for them all thus only one marker per 
tree - over-thinking maybe...]
>
>>
>>
>> * The stored object already includes length (and inferred type), so we do
>> know what it stands in for. Thus the local index (index file) should be 
>> able
>> to be recreated from the object store alone (including the ‘promised /
>> narrow / missing’ files/directory markers)
>>
>> * the ‘same’ as sub-modules.
>> The potential for loss of synchronisation with a golden complete repo is
>> just the same as for sub-modules. (We expected object/commit X here, but 
>> it’s not in the store). This could happen with a small user group who 
>> have locally narrow clones, who interact with their local narrow server 
>> for ‘backup’, and then fail to push further upstream to a server that 
>> mandates completeness. They could create a death by a thousand narrow 
>> cuts. Having a golden upstream config reference (indicating which is the 
>> upstream) could allow checks to ensure that doesn’t happen.
>>
>> The fsck can be taught the config option of 'allowNarrow'.
>
> We've updated fsck to be aware of partial clones.
>
OK
>>
>> The narrowness would be defined in a locally stored '.gitNarrowIgnore' 
>> file
>> (which can include the size constraints being developed elsewhere on the
>> list)
>>
>> As a safety it could be that the .gitNarrowIgnore is sent with the pack 
>> so
>> that fold know what they missed, and fsck could check that they are 
>> locally
>> not narrower than some specific project .gitNarrowIgnore spec.
>
> Currently, we store the filter_spec used for the partial clone (or the
> first partial fetch) as a default for subsequent fetches, but we limit
> it there.  That is, for operations like checkout or blame or whatever,
> it doesn't matter why a blob is missing or what filter criteria was
> used to cause it to be omitted -- just that it is.  Any time we need
> a missing object, we have to go get it -- whether that is via dynamic
> or bulk fetching.

Deciding *if* we have to get it, while still being 'useful', is part of the 
question I raised above. In my world view, we already have the intersting 
blobs, so we shouldn't need to get anything. diff's, blame's, checkout's, 
simply go with the stub values and everything is cushty.
>
>>
>> The benefit of this that the off-line operation capability of Git 
>> continues,
>> which GVFS doesn’t quite do (accidental lock in to a client-server model 
>> aka
>> all those other VCS systems).
>
> Yes, there are limitations of GVFS and you must be online to use it.
> It essentially does a blob:none filter and dynamically faults in every
> blob.  (But it is also using a kernel driver and daemon to dynamically
> populate the file system when a file is first opened for writing --
> much like copy-on-write virtual memory.  And yes, these 2 operations
> are independent, but currently combined in the GVFS code.)
>
> And yes, there is work here to make sure that most normal
> operations can continue to work offline.


Magic.
>
>
>>
>> I believe that its all doable, and that Jeff H's work already puts much 
>> of
>> it in place, or touches those places
>>
>> That said, it has been just _thinking_, without sufficient time to delve
>> into the code.
>>
>> Phil
> [...]
>
> Thanks
> Jeff
>

Thanks for the great work.

Philip 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: How hard would it be to implement sparse fetching/pulling?
  2017-12-02 16:30               ` Philip Oakley
@ 2017-12-04 15:36                 ` Jeff Hostetler
  2017-12-05 23:46                   ` Philip Oakley
  0 siblings, 1 reply; 26+ messages in thread
From: Jeff Hostetler @ 2017-12-04 15:36 UTC (permalink / raw)
  To: Philip Oakley, Vitaly Arbuzov; +Cc: Git List



On 12/2/2017 11:30 AM, Philip Oakley wrote:
> From: "Jeff Hostetler" <git@jeffhostetler.com>
> Sent: Friday, December 01, 2017 2:30 PM
>> On 11/30/2017 8:51 PM, Vitaly Arbuzov wrote:
>>> I think it would be great if we high level agree on desired user
>>> experience, so let me put a few possible use cases here.
>>>
>>> 1. Init and fetch into a new repo with a sparse list.
>>> Preconditions: origin blah exists and has a lot of folders inside of
>>> src including "bar".
>>> Actions:
>>> git init foo && cd foo
>>> git config core.sparseAll true # New flag to activate all sparse
>>> operations by default so you don't need to pass options to each
>>> command.
>>> echo "src/bar" > .git/info/sparse-checkout
>>> git remote add origin blah
>>> git pull origin master
>>> Expected results: foo contains src/bar folder and nothing else,
>>> objects that are unrelated to this tree are not fetched.
>>> Notes: This should work same when fetch/merge/checkout operations are
>>> used in the right order.
>>
>> With the current patches (parts 1,2,3) we can pass a blob-ish
>> to the server during a clone that refers to a sparse-checkout
>> specification.
> 
> I hadn't appreciated this capability. I see it as important, and should be available both ways, so that a .gitNarrow spec can be imposed from the server side, as well as by the requester.
> 
> It could also be used to assist in the 'precious/secret' blob problem, so that AWS keys are never pushed, nor available for fetching!

To be honest, I've always considered partial clone/fetch as
a client-side request as a performance feature to minimize
download times and disk space requirements on the client.
I've not thought of it from the "server has secrets" point
of view.

We can talk about it, but I'd like to keep it outside the
scope of the current effort.  My concerns are that that is
not the appropriate mechanism to enforce MAC/DAC like security
mechanisms.  For example:
[a] The client will still receive the containing trees that
     refer to the sensitive blobs, so the user can tell when
     the secret blobs change -- they wouldn't have either blob,
     but can tell when they are changed.  This event by itself
     may or may not leak sensitive information depending on the
     terms of the security policy in place.
[b] The existence of such missing blobs would tell the client
     which blobs are significant and secret and allow them to
     focus their attack.  It would be better if those assets
     were completely hidden and not in the tree at all.
[c] The client could push a fake secret blob to replace the
     valid one on the server.  You would have to audit the
     server to ensure that it never accepts a push containing
     a change to any secret blob.  And the server would need
     an infrastructure to know about all secrets in the tree.
[d] When a secret blob does change, any local merges by the
     user lack information to complete the merge -- they can't
     merge the secrets and they can't be trusted to correctly
     pick-ours or pick-theirs -- so their workflows are broken.
I'm not trying to blindly spread FUD here, but it is arguments
like these that make me suggest that the partial clone mechanism
is not the right vehicle for such "secret" blobs.


> 
>>        There's a bit of a chicken-n-egg problem getting
>> things set up.  So if we assume your team would create a series
>> of "known enlistments" under version control, then you could
> 
> s/enlistments/entitlements/ I presume?

Within my org we speak of "enlistments" as subset of the tree
that you plan to work on.  For example, you might enlist in the
"file system" portion of the tree or in the "device drivers"
portion.  If the Makefiles have good partitioning, you should
only need one of the above portions to do productive work within
a feature area.

I'm not sure what you mean by "entitlements".

> 
>> just reference one by <branch>:<path> during your clone.  The
>> server can lookup that blob and just use it.
>>
>>     git clone --filter=sparse:oid=master:templates/bar URL
>>
>> And then the server will filter-out the unwanted blobs during
>> the clone.  (The current version only filters blobs; you still
>> get full commits and trees.  That will be revisited later.)
> 
> I'm for the idea that only the in-heirachy trees should be sent.
> It should also be possible that the server replies that it is 
> only sending a narrow clone, with the given (accessible?) spec.

I do want to extend this to have unneeded tree filtering too.
It is just not in this version.

> 
>>
>> On the client side, the partial clone installs local config
>> settings into the repo so that subsequent fetches default to
>> the same filter criteria as used in the clone.
>>
>>
>> I don't currently have provision to send a full sparse-checkout
>> specification to the server during a clone or fetch.  That
>> seemed like too much to try to squeeze into the protocols.
>> We can revisit this later if there is interest, but it wasn't
>> critical for the initial phase.
>>
> Agreed. I think it should be somewhere 'visible' to the user, but could be setup by the server admin / repo maintainer if they don't have write access. But there could still be the catch-22 - maybe one starts with a <commit | toptree> : <tree> pair to define an origin point (it's not as refined as a .gitNarrow spec file, but is definative). The toptree option could even allow sub-tree clones.. maybe..

That's why I suggest having the sparse-checkout specifications
be stored under version control in the tree in a known location.
The user could be told out-of-band that "master:enlistments/*"
contains all of the well-defined enlistment specs -- I'm not
proposing such an area, just that a sysadmin could agree to
layout their tree with one.  Perhaps they have an enlistment
that just includes that directory.  Then the user could clone
that and look thru it -- add a new one if they need to and push
it -- and then do a partial fetch using a different enlistment
spec.  Again, I'm not dictating this mechanism, but just saying
that something like the above is possible.

> 
>>
>>>
>>> 2. Add a file and push changes.
>>> Preconditions: all steps above followed.
>>> touch src/bar/baz.txt && git add -A && git commit -m "added a file"
>>> git push origin master
>>> Expected results: changes are pushed to remote.
>>
>> I don't believe partial clone and/or partial fetch will cause
>> any changes for push.
> 
> I suspect that pushes could be rejected if the user 'pretends'
> to modify files or trees outside their area. It does need the
> user to be able to spoof part of a tree they don't have, so an
> upstream / remote would immediatly know it was a spoof but
> locally the narrow clone doesn't have enough detail about the
> 'bad' oid. It would be right to reject such attempts!

There is nothing in the partial clone/fetch to support this.
The server doesn't know which parts of the tree the user has
or doesn't have.  There is nothing to prevent the user from
creating a new file anywhere in the tree -- even if they don't
have blobs for anything else in the surrounding directory --
and including it in a push -- since the local "git commit"
would see it as adding a single file with a new SHA and
use the existing SHAs for the neighboring files.


> 
>>
>>>
>>> 3. Clone a repo with a sparse list as a filter.
>>> Preconditions: same as for #1
>>> Actions:
>>> echo "src/bar" > /tmp/blah-sparse-checkout
>>> git clone --sparse /tmp/blah-sparse-checkout blah # Clone should be
>>> the only command that would requires specific option key being passed.
>>> Expected results: same as for #1 plus /tmp/blah-sparse-checkout is
>>> copied into .git/info/sparse-checkout
> 
> I presume clone and fetch are treated equivalently here.

Yes, for the download and config setup, both clone and fetch are
equivalent.  They can pass a filter-spec to the server and set
the local config variables to indicate that a partial clone/fetch
was used.

> 
>>
>> There are 2 independent concepts here: clone and checkout.
>> Currently, there isn't any automatic linkage of the partial clone to
>> the sparse-checkout settings, so you could do something like this:
>>
> I see an implicit link that clearly one cannot checkout
> (inflate/populate) a file/directory that one does not have
> in the object store. But that does not imply the reverse linkage.
> The regular sparse checkout should be available independently of
> the local clone being a narrow one.

Right, I wasn't talking about changing sparse-checkout.  I was
just pointing out that after you complete a partial-clone, you
need to take a few steps before you can do a checkout that wont'
complain about the missing blobs -- or worse, demand load all
of them.


> 
>>     git clone --no-checkout --filter=sparse:oid=master:templates/bar URL
>>     git cat-file ... templates/bar >.git/info/sparse-checkout
>>     git config core.sparsecheckout true
>>     git checkout ...
>>
>> I've been focused on the clone/fetch issues and have not looked
>> into the automation to couple them.
>>
> 
> I foresee that large files and certain files need to be filterable
> for fetch-clone, and that might not be (backward) compatible
> with the sparse-checkout.

There are several filter-spec criteria: no-blobs, no-large-blobs,
sparse-checkout-compatible and others may be added later.  Each
may require different post-clone or post-fetch handling.

For example, you may want to configure your client to only omit
large blobs and configure a "missing blob helper" (outside of my
scope) that fetches them from S3 rather than from the origin
remote.  This would not need to use sparse-checkout.

I guess what I'm trying to say is that this effort is providing
a mechanism to let the git client request object filtering and
work with missing objects locally.

> 
>>
>>>
>>> 4. Showing log for sparsely cloned repo.
>>> Preconditions: #3 is followed
>>> Actions:
>>> git log
>>> Expected results: recent changes that affect src/bar tree.
>>
>> If I understand your meaning, log would only show changes
>> within the sparse subset of the tree.  This is not on my
>> radar for partial clone/fetch.  It would be a nice feature
>> to have, but I think it would be better to think about it
>> from the point of view of sparse-checkout rather than clone.
>>
> One option maybe by making a marker for the tree/blob to
> be a first class citizen. So the oid (and worktree file)
> has content ".gitNarrowTree <oid>" or ",gitNarrowBlob <oid>"
> as required (*), which is safe, and allows a consistent
> alter-ego view of the tree contents and hence for git-log et.al.
> 
> (*) I keep flip flopping between a single object marker, and
> distinct object markers for the types. It partly depends on
> whether one can know in advance, locally, what the oid type
> should be, and how it should be embedded in the object store
> - need to re-check the specs.
> 
> I'm tending toward distinct types to cope with the D/F conflict
> in the worktrees - the directory must be created (holds the
> name etc), and the alter-ego content then must be placed in a
> _known_ sub-file ".gitNarrowTree" (without the oid in the file
> name, but included in the content). Presence of a ".gitNarrowTree"
> should be standalone in the directory when that part of the
> work-tree is clean.

I'm not sure I follow this, but it is outside of my scope
for partial clone/fetch, so I'd rather not dive too deep on
this here.  If we really want a version of "git log" that
respects sparse-checkout boundaries, we should start a new
thread.  Thanks.

> 
>>
>>>
>>> 5. Showing diff.
>>> Preconditions: #3 is followed
>>> Actions:
>>> git diff HEAD^ HEAD
>>> Expected results: changes from the most recent commit affecting
>>> src/bar folder are shown.
>>> Notes: this can be tricky operation as filtering must be done to
>>> remove results from unrelated subtrees.
>>
>> I don't have any plan for this and I don't think it fits within
>> the scope of clone/fetch.  I think this too would be a sparse-checkout
>> feature.
>>
> 
> See my note about first class citizens for marker OIDs
> 
>>
>>>
>>> *Note that I intentionally didn't mention use cases that are related
>>> to filtering by blob size as I think we should logically consider them
>>> as a separate, although related, feature.
>>
>> I've grouped blob-size and sparse filter together for the
>> purposes of clone/fetch since the basic mechanisms (filtering,
>> transport, and missing object handling) are the same for both.
>> They do lead to different end-uses, but that is above my level
>> here.
>>
>>
>>>
>>> What do you think about these examples above? Is that something that
>>> more-or-less fits into current development? Are there other important
>>> flows that I've missed?
>>
>> These are all good ideas and it is good to have someone else who
>> wants to use partial+sparse thinking about it and looking for gaps
>> as we try to make a complete end-to-end feature.
>>>
>>> -Vitaly
>>
>> Thanks
>> Jeff
>>
> 
> Philip
> 

Thanks!
Jeff


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: How hard would it be to implement sparse fetching/pulling?
  2017-12-01 18:24           ` Jonathan Nieder
@ 2017-12-04 15:53             ` Jeff Hostetler
  0 siblings, 0 replies; 26+ messages in thread
From: Jeff Hostetler @ 2017-12-04 15:53 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Philip Oakley, Vitaly Arbuzov, Git List



On 12/1/2017 1:24 PM, Jonathan Nieder wrote:
> Jeff Hostetler wrote:
>> On 11/30/2017 6:43 PM, Philip Oakley wrote:
> 
>>> The 'companies' problem is that it tends to force a client-server, always-on
>>> on-line mentality. I'm also wanting the original DVCS off-line capability to
>>> still be available, with _user_ control, in a generic sense, of what they
>>> have locally available (including files/directories they have not yet looked
>>> at, but expect to have. IIUC Jeff's work is that on-line view, without the
>>> off-line capability.
>>>
>>> I'd commented early in the series at [1,2,3].
>>
>> Yes, this does tend to lead towards an always-online mentality.
>> However, there are 2 parts:
>> [a] dynamic object fetching for missing objects, such as during a
>>      random command like diff or blame or merge.  We need this
>>      regardless of usage -- because we can't always predict (or
>>      dry-run) every command the user might run in advance.
>> [b] batch fetch mode, such as using partial-fetch to match your
>>      sparse-checkout so that you always have the blobs of interest
>>      to you.  And assuming you don't wander outside of this subset
>>      of the tree, you should be able to work offline as usual.
>> If you can work within the confines of [b], you wouldn't need to
>> always be online.
> 
> Just to amplify this: for our internal use we care a lot about
> disconnected usage working.  So it is not like we have forgotten about
> this use case.
> 
>> We might also add a part [c] with explicit commands to back-fill or
>> alter your incomplete view of the ODB
> 
> Agreed, this will be a nice thing to add.
> 
> [...]
>>> At its core, my idea was to use the object store to hold markers for the
>>> 'not yet fetched' objects (mainly trees and blobs). These would be in a
>>> known fixed format, and have the same effect (conceptually) as the
>>> sub-module markers - they _confirm_ the oid, yet say 'not here, try
>>> elsewhere'.
>>
>> We do have something like this.  Jonathan can explain better than I, but
>> basically, we denote possibly incomplete packfiles from partial clones
>> and fetches as "promisor" and have special rules in the code to assert
>> that a missing blob referenced from a "promisor" packfile is OK and can
>> be fetched later if necessary from the "promising" remote.
>>
>> The main problem with markers or other lists of missing objects is
>> that it has scale problems for large repos.
> 
> Any chance that we can get a design doc in Documentation/technical/
> giving an overview of the design, with a brief "alternatives
> considered" section describing this kind of thing?

Yeah, I'll start one.  I have notes within the individual protocol
docs and man-pages, but no summary doc.  Thanks!

> 
> E.g. some of the earlier descriptions like
>   https://public-inbox.org/git/20170915134343.3814dc38@twelve2.svl.corp.google.com/
>   https://public-inbox.org/git/cover.1506714999.git.jonathantanmy@google.com/
>   https://public-inbox.org/git/20170113155253.1644-1-benpeart@microsoft.com/
> may help as a starting point.
> 
> Thanks,
> Jonathan
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: How hard would it be to implement sparse fetching/pulling?
  2017-12-02 18:24           ` Philip Oakley
@ 2017-12-05 19:14             ` Jeff Hostetler
  2017-12-05 20:07               ` Jonathan Nieder
  0 siblings, 1 reply; 26+ messages in thread
From: Jeff Hostetler @ 2017-12-05 19:14 UTC (permalink / raw)
  To: Philip Oakley, Vitaly Arbuzov; +Cc: Git List

On 12/2/2017 1:24 PM, Philip Oakley wrote:
> From: "Jeff Hostetler" <git@jeffhostetler.com>
> Sent: Friday, December 01, 2017 5:23 PM
>> On 11/30/2017 6:43 PM, Philip Oakley wrote:
[...]
>>
>> Discussing this feature in the context of the defense industry
>> makes me a little nervous.  (I used to be in that area.)
> 
> I'm viewing the desire for codebase partitioning from a soft layering
> of risk view (perhaps a more UK than USA approach ;-)

I'm not sure I know what this means or how the UK defense
security models/policy/procedures are different from the US,
so I can't say much here.  I'm just thinking that even if we
get a *perfectly working* partial clone/fetch/push/etc. that
it would not pass a security audit.  I might be wrong here
(and I'm no expert on the subject), but I think they would
push you towards a different solution architecture.

> 
>> What we have in the code so far may be a nice start, but
>> probably doesn't have the assurances that you would need
>> for actual deployment.  But it's a start....
> 
> True. I need to get some of my collegues more engaged...
>>
[...]
>> Yes, this does tend to lead towards an always-online mentality.
>> However, there are 2 parts:
>> [a] dynamic object fetching for missing objects, such as during a
>>     random command like diff or blame or merge.  We need this
>>     regardless of usage -- because we can't always predict (or
>>     dry-run) every command the user might run in advance.
> 
> Making something "useful" happen here when off-line is an obvious goal.
> 
>> [b] batch fetch mode, such as using partial-fetch to match your
>>     sparse-checkout so that you always have the blobs of interest
>>     to you.  And assuming you don't wander outside of this subset
>>     of the tree, you should be able to work offline as usual.
>> If you can work within the confines of [b], you wouldn't need to
>> always be online.
> 
> I feel this is the area that does need ensure a capability to avoid
> any perception of the much maligned 'Embrace, extend, and extinguish'> by accidental lockout.
> 
> I don't think this should be viewed as a type of sparse checkout -
> it's just a checkout of what you have (under the hood it could use
> the same code though).

Right, I'm only thinking of this effort as a way to get a partial
clone and fetch that omits unneeded (or, not immediately needed)
objects for performance reasons.  There are several use scenarios
that I've discussed and sparse-checkout is one of them, but I do
not consider this to be a sparse-checkout feature.

[...]
>>
>> The main problem with markers or other lists of missing objects is
>> that it has scale problems for large repos.  Suppose I have 100M
>> blobs in my repo.  If I do a blob:none clone, I'd have 100M missing
>> blobs that would need tracking.  If I then do a batch fetch of the
>> blobs needed to do a sparse checkout of HEAD, I'd have to remove
>> those entries from the tracking data.  Not impossible, but not
>> speedy either.
> 
> ** Ahhh. I see. That's a consequence of having all the trees isn't it. **
> 
> I've always thought that limiting the trees is at the heart of the Narrow clone/fetch problem.
> 
> OK so if you have flat, wide structures with 10k files/directories per tree then it's still a fair sized problem, but it should *scale logarithmically* for the part of the tree structure that's not being downloaded.
> 
> You never have to add a marker for a blob that you have no containing tree for. Nor for the tree that contained the blob's tree, all the way up to primary line of descent to the tree of concern. All those trees are never down loaded, there are few markers (.gitNarrowTree files) for those tree stubs, certainly no 100M missing blob markers.

Currently, the code only omits blobs.  I want to extend the current
code to have filters that also exclude unneeded trees.  That will help
address some of these size concerns, but there are still perf issues
here.

>>> * Marking of 'missing' objects in the local object store, and on the wire.
>>> The missing objects are replaced by a place holder object, which used the
>>> same oid/sha1, but has a short fixed length, with content “GitNarrowObject
>>> <oid>”. The chance that that string would actually have such an oid clash is
>>> the same as all other object hashes, so is a *safe* self-referential device.
>>
>> Again, there is a scale problem here.  If I have 100M missing blobs,
>> I can't afford to create 100M loose place holder files.  Or juggle
>> a 2GB file of missing objects on various operations.
> 
> As above, I'm also trimming the trees, so in general, there would be no missing  blobs, just the content of the directory one was interested in.
> 
> That's not quite true if higher level trees have blob references in them that are otherwise unwanted - they may each need a marker. [Or maybe a special single 'tree-of-blobs' marker for them all thus only one marker per tree - over-thinking maybe...]

Also omitting certain trees means you now (obviously) have both missing
trees and blobs.  And both need to be dynamically or batch fetched as
needed.  And certain operations will need multiple round trips to fully
resolve -- fault in a tree and then fault in blobs referenced by it.

And right, you still need to be able to refer to trees that have *some*
of their children missing.  It's not a clean tree-only boundary.

So, given all that, any set of markers would be incomplete and/or would
need to be aggressively updated to be correct.  What we have now in
Jonathan's "promisor" code allows us to infer at object-lookup time
that any missing object (from a tree-to-child or commit-to-tree reference)
is expected and can be resolved.  And this doesn't require any markers
or additional on-disk lists of SHAs or packfile format changes.

[...]
>>> * The stored object already includes length (and inferred type), so we do
>>> know what it stands in for. Thus the local index (index file) should be able
>>> to be recreated from the object store alone (including the ‘promised /
>>> narrow / missing’ files/directory markers)

The packfile only contains the objects it contains.  The IDX file
is an index of that.  Neither know of objects (or sizes of objects)
that they don't have.  The have child references (tree to contained
blob), but those are just dangling -- and may be in a different packfile.

[...]
>>> As a safety it could be that the .gitNarrowIgnore is sent with the pack so
>>> that fold know what they missed, and fsck could check that they are locally
>>> not narrower than some specific project .gitNarrowIgnore spec.
>>
>> Currently, we store the filter_spec used for the partial clone (or the
>> first partial fetch) as a default for subsequent fetches, but we limit
>> it there.  That is, for operations like checkout or blame or whatever,
>> it doesn't matter why a blob is missing or what filter criteria was
>> used to cause it to be omitted -- just that it is.  Any time we need
>> a missing object, we have to go get it -- whether that is via dynamic
>> or bulk fetching.
> 
> Deciding *if* we have to get it, while still being 'useful', is part of the question I raised above. In my world view, we already have the intersting blobs, so we shouldn't need to get anything. diff's, blame's, checkout's, simply go with the stub values and everything is cushty.

That is not possible in general.  Suppose I have all of the trees and blobs
for my "cone" of the source tree (say "Documentation/") and I only plan to
make edits with in that cone.  I can do sparse-checkout and commits and all
is well.  Suppose I want to merge my work with Alice and Bob.  I can pull
their branches and I can merge any edits they also made in my cone of the
tree and all is well.  *BUT* if they both edited a file that is outside of
my cone, my git-merge has to file-merge the contents of the 3 versions (the
ancestor, Alice's, and Bob's) of the file.  I don't have the blobs for them
because I only got blobs for my cone of the tree.  Likewise, I also may not
have the 3 containing tree nodes.

So, I either need dynamic object fetching -or- I need a dry-run mode to
predict the missing objects

>>
[...]
>>>
>>> I believe that its all doable, and that Jeff H's work already puts much of
>>> it in place, or touches those places
>>>
>>> That said, it has been just _thinking_, without sufficient time to delve
>>> into the code.
>>>
>>> Phil
>> [...]
>>
>> Thanks
>> Jeff
>>
> 
> Thanks for the great work.
> 
> Philip

Thanks for the comments,
Jeff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: How hard would it be to implement sparse fetching/pulling?
  2017-12-05 19:14             ` Jeff Hostetler
@ 2017-12-05 20:07               ` Jonathan Nieder
  0 siblings, 0 replies; 26+ messages in thread
From: Jonathan Nieder @ 2017-12-05 20:07 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: Philip Oakley, Vitaly Arbuzov, Git List

Hi,

Jeff Hostetler wrote:
> On 12/2/2017 1:24 PM, Philip Oakley wrote:
>> From: "Jeff Hostetler" <git@jeffhostetler.com>
>> Sent: Friday, December 01, 2017 5:23 PM

>>> Discussing this feature in the context of the defense industry
>>> makes me a little nervous.  (I used to be in that area.)
>>
>> I'm viewing the desire for codebase partitioning from a soft layering
>> of risk view (perhaps a more UK than USA approach ;-)
>
> I'm not sure I know what this means or how the UK defense
> security models/policy/procedures are different from the US,
> so I can't say much here.  I'm just thinking that even if we
> get a *perfectly working* partial clone/fetch/push/etc. that
> it would not pass a security audit.  I might be wrong here
> (and I'm no expert on the subject), but I think they would
> push you towards a different solution architecture.

I'm pretty ignorant about the defense industry, but a few more
comments:

- gitolite implements some features on top of git's server code that I
  consider to be important for security.  So much so that I've been
  considering what it would take to remove the git-shell command from
  git.git and move it to the gitolite project where people would be
  better equipped to use it in an appropriate context

- in particular, git's reachability checking code could use some
  hardening/improvement.  In particular, think of edge cases like
  where someone pushes a pack with deltas referring to objects they
  should not be able to reach.

- Anyone willing to audit git code's security wins my approval.
  Please, please, audit git code and report the issues you find. :)

[...]
> Also omitting certain trees means you now (obviously) have both missing
> trees and blobs.  And both need to be dynamically or batch fetched as
> needed.  And certain operations will need multiple round trips to fully
> resolve -- fault in a tree and then fault in blobs referenced by it.

For omitting trees, we will need to modify the index format, since the
index has entries for all paths today.  That's on the roadmap but has
not been implemented yet.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: How hard would it be to implement sparse fetching/pulling?
  2017-12-04 15:36                 ` Jeff Hostetler
@ 2017-12-05 23:46                   ` Philip Oakley
  0 siblings, 0 replies; 26+ messages in thread
From: Philip Oakley @ 2017-12-05 23:46 UTC (permalink / raw)
  To: Vitaly Arbuzov, Jeff Hostetler; +Cc: Git List

From: "Jeff Hostetler" <git@jeffhostetler.com>
Sent: Monday, December 04, 2017 3:36 PM
>
> On 12/2/2017 11:30 AM, Philip Oakley wrote:
>> From: "Jeff Hostetler" <git@jeffhostetler.com>
>> Sent: Friday, December 01, 2017 2:30 PM
>>> On 11/30/2017 8:51 PM, Vitaly Arbuzov wrote:
>>>> I think it would be great if we high level agree on desired user
>>>> experience, so let me put a few possible use cases here.
>>>>
>>>> 1. Init and fetch into a new repo with a sparse list.
>>>> Preconditions: origin blah exists and has a lot of folders inside of
>>>> src including "bar".
>>>> Actions:
>>>> git init foo && cd foo
>>>> git config core.sparseAll true # New flag to activate all sparse
>>>> operations by default so you don't need to pass options to each
>>>> command.
>>>> echo "src/bar" > .git/info/sparse-checkout
>>>> git remote add origin blah
>>>> git pull origin master
>>>> Expected results: foo contains src/bar folder and nothing else,
>>>> objects that are unrelated to this tree are not fetched.
>>>> Notes: This should work same when fetch/merge/checkout operations are
>>>> used in the right order.
>>>
>>> With the current patches (parts 1,2,3) we can pass a blob-ish
>>> to the server during a clone that refers to a sparse-checkout
>>> specification.
>>
>> I hadn't appreciated this capability. I see it as important, and should 
>> be available both ways, so that a .gitNarrow spec can be imposed from the 
>> server side, as well as by the requester.
>>
>> It could also be used to assist in the 'precious/secret' blob problem, so 
>> that AWS keys are never pushed, nor available for fetching!
>
> To be honest, I've always considered partial clone/fetch as
> a client-side request as a performance feature to minimize
> download times and disk space requirements on the client.

Mine was a two way view where one side or other specified an extent for the 
narrow clone to achieve either the speed/space improvement or partitioning 
capability.

> I've not thought of it from the "server has secrets" point
> of view.

My potential for "secrets" was a little softer that some of the 'hard' 
security that is often discussed. I'm for the layered risk approach (swiss 
cheese model)
>
> We can talk about it, but I'd like to keep it outside the
> scope of the current effort.

Agreed.

>  My concerns are that that is
> not the appropriate mechanism to enforce MAC/DAC like security
> mechanisms.  For example:
> [a] The client will still receive the containing trees that
>     refer to the sensitive blobs, so the user can tell when
>     the secret blobs change -- they wouldn't have either blob,
>     but can tell when they are changed.  This event by itself
>     may or may not leak sensitive information depending on the
>     terms of the security policy in place.
> [b] The existence of such missing blobs would tell the client
>     which blobs are significant and secret and allow them to
>     focus their attack.  It would be better if those assets
>     were completely hidden and not in the tree at all.
> [c] The client could push a fake secret blob to replace the
>     valid one on the server.  You would have to audit the
>     server to ensure that it never accepts a push containing
>     a change to any secret blob.  And the server would need
>     an infrastructure to know about all secrets in the tree.
> [d] When a secret blob does change, any local merges by the
>     user lack information to complete the merge -- they can't
>     merge the secrets and they can't be trusted to correctly
>     pick-ours or pick-theirs -- so their workflows are broken.
> I'm not trying to blindly spread FUD here, but it is arguments
> like these that make me suggest that the partial clone mechanism
> is not the right vehicle for such "secret" blobs.

I'm on the 'a little security is better than no security' side, but all the 
points are valid.
>
>
>>
>>> There's a bit of a chicken-n-egg problem getting
>>> things set up. So if we assume your team would create a series
>>> of "known enlistments" under version control, then you could
>>
>> s/enlistments/entitlements/ I presume?
>
> Within my org we speak of "enlistments" as subset of the tree
> that you plan to work on.  For example, you might enlist in the
> "file system" portion of the tree or in the "device drivers"
> portion.  If the Makefiles have good partitioning, you should
> only need one of the above portions to do productive work within
> a feature area.
Ah, so it's the things that have been requested by the client (I'd like to 
the enlist..)

>
> I'm not sure what you mean by "entitlements".

It is like having the title deeds to a house - a list things you have, or 
can have. (e.g. a father saying: you can have the car on Saturday 6pm -11pm)

At the end of the day the particular lists would be the same, they guide 
what is sent.

>
>>
>>> just reference one by <branch>:<path> during your clone. The
>>> server can lookup that blob and just use it.
>>>
>>> git clone --filter=sparse:oid=master:templates/bar URL
>>>
>>> And then the server will filter-out the unwanted blobs during
>>> the clone. (The current version only filters blobs; you still
>>> get full commits and trees. That will be revisited later.)
>>
>> I'm for the idea that only the in-heirachy trees should be sent.
>> It should also be possible that the server replies that it is only 
>> sending a narrow clone, with the given (accessible?) spec.
>
> I do want to extend this to have unneeded tree filtering too.
> It is just not in this version.

Great. That's good to hear.

>
>>
>>>
>>> On the client side, the partial clone installs local config
>>> settings into the repo so that subsequent fetches default to
>>> the same filter criteria as used in the clone.
>>>
>>>
>>> I don't currently have provision to send a full sparse-checkout
>>> specification to the server during a clone or fetch. That
>>> seemed like too much to try to squeeze into the protocols.
>>> We can revisit this later if there is interest, but it wasn't
>>> critical for the initial phase.
>>>
>> Agreed. I think it should be somewhere 'visible' to the user, but could 
>> be setup by the server admin / repo maintainer if they don't have write 
>> access. But there could still be the catch-22 - maybe one starts with a 
>> <commit | toptree> : <tree> pair to define an origin point (it's not as 
>> refined as a .gitNarrow spec file, but is definative). The toptree option 
>> could even allow sub-tree clones.. maybe..
>
> That's why I suggest having the sparse-checkout specifications
> be stored under version control in the tree in a known location.
> The user could be told out-of-band that "master:enlistments/*"
> contains all of the well-defined enlistment specs -- I'm not
> proposing such an area, just that a sysadmin could agree to
> layout their tree with one.  Perhaps they have an enlistment
> that just includes that directory.  Then the user could clone
> that and look thru it -- add a new one if they need to and push
> it -- and then do a partial fetch using a different enlistment
> spec.  Again, I'm not dictating this mechanism, but just saying
> that something like the above is possible.

Ideal.

>
>>
>>>
>>>>
>>>> 2. Add a file and push changes.
>>>> Preconditions: all steps above followed.
>>>> touch src/bar/baz.txt && git add -A && git commit -m "added a file"
>>>> git push origin master
>>>> Expected results: changes are pushed to remote.
>>>
>>> I don't believe partial clone and/or partial fetch will cause
>>> any changes for push.
>>
>> I suspect that pushes could be rejected if the user 'pretends'
>> to modify files or trees outside their area. It does need the
>> user to be able to spoof part of a tree they don't have, so an
>> upstream / remote would immediatly know it was a spoof but
>> locally the narrow clone doesn't have enough detail about the
>> 'bad' oid. It would be right to reject such attempts!
>
> There is nothing in the partial clone/fetch to support this.
> The server doesn't know which parts of the tree the user has
> or doesn't have.  There is nothing to prevent the user from
> creating a new file anywhere in the tree -- even if they don't
> have blobs for anything else in the surrounding directory --
> and including it in a push -- since the local "git commit"
> would see it as adding a single file with a new SHA and
> use the existing SHAs for the neighboring files.
>

OK. That's something I'll have a think about.

I was thinking that the recieving server would cross check the recieved 
trees against the enlistments/entitlements to confirm that the user hasn't 
suggested that they have a blob/tree that neither know about (at least for a 
server that has that requirement to do the check!). This is something 
Randall also brought up in his thread.

>
>>
>>>
>>>>
>>>> 3. Clone a repo with a sparse list as a filter.
>>>> Preconditions: same as for #1
>>>> Actions:
>>>> echo "src/bar" > /tmp/blah-sparse-checkout
>>>> git clone --sparse /tmp/blah-sparse-checkout blah # Clone should be
>>>> the only command that would requires specific option key being passed.
>>>> Expected results: same as for #1 plus /tmp/blah-sparse-checkout is
>>>> copied into .git/info/sparse-checkout
>>
>> I presume clone and fetch are treated equivalently here.
>
> Yes, for the download and config setup, both clone and fetch are
> equivalent.  They can pass a filter-spec to the server and set
> the local config variables to indicate that a partial clone/fetch
> was used.

Good.
>
>>
>>>
>>> There are 2 independent concepts here: clone and checkout.
>>> Currently, there isn't any automatic linkage of the partial clone to
>>> the sparse-checkout settings, so you could do something like this:
>>>
>> I see an implicit link that clearly one cannot checkout
>> (inflate/populate) a file/directory that one does not have
>> in the object store. But that does not imply the reverse linkage.
>> The regular sparse checkout should be available independently of
>> the local clone being a narrow one.
>
> Right, I wasn't talking about changing sparse-checkout.  I was
> just pointing out that after you complete a partial-clone, you
> need to take a few steps before you can do a checkout that wont'
> complain about the missing blobs -- or worse, demand load all
> of them.
>

OK.
>
>>
>>> git clone --no-checkout --filter=sparse:oid=master:templates/bar URL
>>> git cat-file ... templates/bar >.git/info/sparse-checkout
>>> git config core.sparsecheckout true
>>> git checkout ...
>>>
>>> I've been focused on the clone/fetch issues and have not looked
>>> into the automation to couple them.
>>>
>>
>> I foresee that large files and certain files need to be filterable
>> for fetch-clone, and that might not be (backward) compatible
>> with the sparse-checkout.
>
> There are several filter-spec criteria: no-blobs, no-large-blobs,
> sparse-checkout-compatible and others may be added later.  Each
> may require different post-clone or post-fetch handling.
>
> For example, you may want to configure your client to only omit
> large blobs and configure a "missing blob helper" (outside of my
> scope) that fetches them from S3 rather than from the origin
> remote.  This would not need to use sparse-checkout.
>
> I guess what I'm trying to say is that this effort is providing
> a mechanism to let the git client request object filtering and
> work with missing objects locally.

OK.
>
>>
>>>
>>>>
>>>> 4. Showing log for sparsely cloned repo.
>>>> Preconditions: #3 is followed
>>>> Actions:
>>>> git log
>>>> Expected results: recent changes that affect src/bar tree.
>>>
>>> If I understand your meaning, log would only show changes
>>> within the sparse subset of the tree. This is not on my
>>> radar for partial clone/fetch. It would be a nice feature
>>> to have, but I think it would be better to think about it
>>> from the point of view of sparse-checkout rather than clone.
>>>
>> One option maybe by making a marker for the tree/blob to
>> be a first class citizen. So the oid (and worktree file)
>> has content ".gitNarrowTree <oid>" or ",gitNarrowBlob <oid>"
>> as required (*), which is safe, and allows a consistent
>> alter-ego view of the tree contents and hence for git-log et.al.
>>
>> (*) I keep flip flopping between a single object marker, and
>> distinct object markers for the types. It partly depends on
>> whether one can know in advance, locally, what the oid type
>> should be, and how it should be embedded in the object store
>> - need to re-check the specs.
>>
>> I'm tending toward distinct types to cope with the D/F conflict
>> in the worktrees - the directory must be created (holds the
>> name etc), and the alter-ego content then must be placed in a
>> _known_ sub-file ".gitNarrowTree" (without the oid in the file
>> name, but included in the content). Presence of a ".gitNarrowTree"
>> should be standalone in the directory when that part of the
>> work-tree is clean.
>
> I'm not sure I follow this, but it is outside of my scope
> for partial clone/fetch, so I'd rather not dive too deep on
> this here.  If we really want a version of "git log" that
> respects sparse-checkout boundaries, we should start a new
> thread.  Thanks.
>

Agreed.

>>
>>>
>>>>
>>>> 5. Showing diff.
>>>> Preconditions: #3 is followed
>>>> Actions:
>>>> git diff HEAD^ HEAD
>>>> Expected results: changes from the most recent commit affecting
>>>> src/bar folder are shown.
>>>> Notes: this can be tricky operation as filtering must be done to
>>>> remove results from unrelated subtrees.
>>>
>>> I don't have any plan for this and I don't think it fits within
>>> the scope of clone/fetch. I think this too would be a sparse-checkout
>>> feature.
>>>
>>
>> See my note about first class citizens for marker OIDs
>>
>>>
>>>>
>>>> *Note that I intentionally didn't mention use cases that are related
>>>> to filtering by blob size as I think we should logically consider them
>>>> as a separate, although related, feature.
>>>
>>> I've grouped blob-size and sparse filter together for the
>>> purposes of clone/fetch since the basic mechanisms (filtering,
>>> transport, and missing object handling) are the same for both.
>>> They do lead to different end-uses, but that is above my level
>>> here.
>>>
>>>
>>>>
>>>> What do you think about these examples above? Is that something that
>>>> more-or-less fits into current development? Are there other important
>>>> flows that I've missed?
>>>
>>> These are all good ideas and it is good to have someone else who
>>> wants to use partial+sparse thinking about it and looking for gaps
>>> as we try to make a complete end-to-end feature.
>>>>
>>>> -Vitaly
>>>
>>> Thanks
>>> Jeff
>>>
>>
>> Philip
>>
>
> Thanks!
> Jeff
>
--
Philip 


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2017-12-05 23:46 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-30  3:16 How hard would it be to implement sparse fetching/pulling? Vitaly Arbuzov
2017-11-30 14:24 ` Jeff Hostetler
2017-11-30 17:01   ` Vitaly Arbuzov
2017-11-30 17:44     ` Vitaly Arbuzov
2017-11-30 20:03       ` Jonathan Nieder
2017-12-01 16:03         ` Jeff Hostetler
2017-12-01 18:16           ` Jonathan Nieder
2017-11-30 23:43       ` Philip Oakley
2017-12-01  1:27         ` Vitaly Arbuzov
2017-12-01  1:51           ` Vitaly Arbuzov
2017-12-01  2:51             ` Jonathan Nieder
2017-12-01  3:37               ` Vitaly Arbuzov
2017-12-02 16:59               ` Philip Oakley
2017-12-01 14:30             ` Jeff Hostetler
2017-12-02 16:30               ` Philip Oakley
2017-12-04 15:36                 ` Jeff Hostetler
2017-12-05 23:46                   ` Philip Oakley
2017-12-02 15:04           ` Philip Oakley
2017-12-01 17:23         ` Jeff Hostetler
2017-12-01 18:24           ` Jonathan Nieder
2017-12-04 15:53             ` Jeff Hostetler
2017-12-02 18:24           ` Philip Oakley
2017-12-05 19:14             ` Jeff Hostetler
2017-12-05 20:07               ` Jonathan Nieder
2017-12-01 15:28       ` Jeff Hostetler
2017-12-01 14:50     ` Jeff Hostetler

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).