RFC: Design and code of partial clones (now, missing commits and trees OK)

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* RFC: Design and code of partial clones (now, missing commits and trees OK)
@ 2017-09-15 20:43 Jonathan Tan
  2017-09-19  5:51 ` Junio C Hamano
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Jonathan Tan @ 2017-09-15 20:43 UTC (permalink / raw)
  To: git; +Cc: peartben, Christian Couder, git

For those interested in partial clones and/or missing objects in repos,
I've updated my original partialclone patches to not require an explicit
list of promises. Fetch/clone still only permits exclusion of blobs, but
the infrastructure is there for a local repo to support missing trees
and commits as well.

They can be found here:

https://github.com/jonathantanmy/git/tree/partialclone2

To make the number of patches more manageable, I have omitted support
for a custom fetching hook (but it can be readded in fetch-object.c),
and only support promisor packfiles for now (but I don't have any
objection to supporting promisor loose objects in the future).

Let me know what you think of the overall approach. In particular, I'm
still wondering if there is a better approach than to toggle
"fetch_if_missing" whenever we need lazy fetching (or need to suppress
it).

Also, if there any patches that you think might be useful to others, let
me know and I'll send them to this mailing list for review.

A demo and an overview of the design (also available from that
repository's README):

Demo
====

Obtain a repository.

    $ make prefix=$HOME/local install
    $ cd $HOME/tmp
    $ git clone https://github.com/git/git

Make it advertise the new feature and allow requests for arbitrary blobs.

    $ git -C git config uploadpack.advertiseblobmaxbytes 1
    $ git -C git config uploadpack.allowanysha1inwant 1

Perform the partial clone and check that it is indeed smaller. Specify
"file://" in order to test the partial clone mechanism. (If not, Git will
perform a local clone, which unselectively copies every object.)

    $ git clone --blob-max-bytes=100000 "file://$(pwd)/git" git2
    $ git clone "file://$(pwd)/git" git3
    $ du -sh git2 git3
    116M	git2
    129M	git3

Observe that the new repo is automatically configured to fetch missing objects
from the original repo. Subsequent fetches will also be partial.

    $ cat git2/.git/config
    [core]
    	repositoryformatversion = 1
    	filemode = true
    	bare = false
    	logallrefupdates = true
    [remote "origin"]
    	url = [snip]
    	fetch = +refs/heads/*:refs/remotes/origin/*
    	blobmaxbytes = 100000
    [extensions]
    	partialclone = origin
    [branch "master"]
    	remote = origin
    	merge = refs/heads/master

Unlike in an older version of this code (see the `partialclone` branch), this
also works with the HTTP/HTTPS protocols.

Design
======

Local repository layout
-----------------------

A repository declares its dependence on a *promisor remote* (a remote that
declares that it can serve certain objects when requested) by a repository
extension "partialclone". `extensions.partialclone` must be set to the name of
the remote ("origin" in the demo above).

A packfile can be annotated as originating from the promisor remote by the
existence of a "(packfile name).promisor" file with arbitrary contents (similar
to the ".keep" file). Whenever a promisor remote sends an object, it declares
that it can serve every object directly or indirectly referenced by the sent
object.

A promisor packfile is a packfile annotated with the ".promisor" file. A
promisor object is an object in a promisor packfile. A promised object is an
object directly referenced by a promisor object.

(In the future, we might need to add ".promisor" support to loose objects.)

Connectivity check and gc
-------------------------

The object walk done by the connectivity check (as used by fsck and fetch) stops
at all promisor objects and promised objects.

The object walk done by gc also stops at all promisor objects and promised
objects. Only non-promisor packfiles are deleted (if pack deletion is
requested); promisor packfiles are left alone. This maintains the distinction
between promisor packfiles and non-promisor packfiles. (In the future, we might
need to do something more sophisticated with promisor packfiles.)

Fetching of promised objects
----------------------------

When `sha1_object_info_extended()` (or similar) is invoked, it will
automatically attempt to fetch a missing object from the promisor remote if that
object is not in the local repository. For efficiency, no check is made as to
whether that object is a promised object or not.

This automatic fetching can be toggled on and off by the `fetch_if_missing`
global variable, and it is on by default.

The actual fetch is done through the fetch-pack/upload-pack protocol. Right now,
this uses the fact that upload-pack allows blob and tree "want"s, and this
incurs the overhead of the unnecessary ref advertisement. I hope that protocol
v2 will allow us to declare that blob and tree "want"s are allowed, and allow
the client to declare that it does not want the ref advertisement. All packfiles
downloaded in this way are annotated with ".promisor".

Fetching with `git fetch`
-------------------------

The fetch-pack/upload-pack protocol has also been extended to support omission
of blobs above a certain size. The client only allows this when fetching from
the promisor remote, and will annotate any packs received from the promisor
remote with ".promisor".

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: Design and code of partial clones (now, missing commits and trees OK)
  2017-09-15 20:43 RFC: Design and code of partial clones (now, missing commits and trees OK) Jonathan Tan
@ 2017-09-19  5:51 ` Junio C Hamano
  2017-09-21 17:57 ` Jeff Hostetler
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Junio C Hamano @ 2017-09-19  5:51 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peartben, Christian Couder, git

Jonathan Tan <jonathantanmy@google.com> writes:

> For those interested in partial clones and/or missing objects in repos,
> I've updated my original partialclone patches to not require an explicit
> list of promises. Fetch/clone still only permits exclusion of blobs, but
> the infrastructure is there for a local repo to support missing trees
> and commits as well.
> ...
> Demo
> ====
>
> Obtain a repository.
>
>     $ make prefix=$HOME/local install
>     $ cd $HOME/tmp
>     $ git clone https://github.com/git/git
>
> Make it advertise the new feature and allow requests for arbitrary blobs.
>
>     $ git -C git config uploadpack.advertiseblobmaxbytes 1
>     $ git -C git config uploadpack.allowanysha1inwant 1
>
> Perform the partial clone and check that it is indeed smaller. Specify
> "file://" in order to test the partial clone mechanism. (If not, Git will
> perform a local clone, which unselectively copies every object.)
>
>     $ git clone --blob-max-bytes=100000 "file://$(pwd)/git" git2
>     $ git clone "file://$(pwd)/git" git3
>     $ du -sh git2 git3
>     116M	git2
>     129M	git3
>
> Observe that the new repo is automatically configured to fetch missing objects
> from the original repo. Subsequent fetches will also be partial.
>
>     $ cat git2/.git/config
>     [core]
>     	repositoryformatversion = 1
>     	filemode = true
>     	bare = false
>     	logallrefupdates = true
>     [remote "origin"]
>     	url = [snip]
>     	fetch = +refs/heads/*:refs/remotes/origin/*
>     	blobmaxbytes = 100000
>     [extensions]
>     	partialclone = origin
>     [branch "master"]
>     	remote = origin
>     	merge = refs/heads/master

The above sequence of events make quite a lot of sense.  And the
following description of how it is designed (snipped) is clear
enough (at least to me) to allow me to say that I quite like it.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: Design and code of partial clones (now, missing commits and trees OK)
  2017-09-15 20:43 RFC: Design and code of partial clones (now, missing commits and trees OK) Jonathan Tan
  2017-09-19  5:51 ` Junio C Hamano
@ 2017-09-21 17:57 ` Jeff Hostetler
  2017-09-21 22:42   ` Jonathan Tan
  2017-09-21 17:59 ` RFC: Design and code of partial clones (now, missing commits and trees OK) (part 2/3) Jeff Hostetler
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: Jeff Hostetler @ 2017-09-21 17:57 UTC (permalink / raw)
  To: Jonathan Tan, git; +Cc: peartben, Christian Couder

There's a lot in this patch series.  I'm still studying it, but here
are some notes and questions.  I'll start with direct responses to
the RFC here and follow up in a second email with specific questions
and comments to keep this from being too long).

On 9/15/2017 4:43 PM, Jonathan Tan wrote:
> For those interested in partial clones and/or missing objects in repos,
> I've updated my original partialclone patches to not require an explicit
> list of promises. Fetch/clone still only permits exclusion of blobs, but
> the infrastructure is there for a local repo to support missing trees
> and commits as well.
> 
> They can be found here:
> 
> https://github.com/jonathantanmy/git/tree/partialclone2
> 
> To make the number of patches more manageable, I have omitted support
> for a custom fetching hook (but it can be readded in fetch-object.c),
> and only support promisor packfiles for now (but I don't have any
> objection to supporting promisor loose objects in the future).
> 
> Let me know what you think of the overall approach. In particular, I'm
> still wondering if there is a better approach than to toggle
> "fetch_if_missing" whenever we need lazy fetching (or need to suppress
> it).
> 
> Also, if there any patches that you think might be useful to others, let
> me know and I'll send them to this mailing list for review.
> 
> A demo and an overview of the design (also available from that
> repository's README):
> 
> Demo
> ====
> 
> Obtain a repository.
> 
>      $ make prefix=$HOME/local install
>      $ cd $HOME/tmp
>      $ git clone https://github.com/git/git
> 
> Make it advertise the new feature and allow requests for arbitrary blobs.
> 
>      $ git -C git config uploadpack.advertiseblobmaxbytes 1
>      $ git -C git config uploadpack.allowanysha1inwant 1
> 
> Perform the partial clone and check that it is indeed smaller. Specify
> "file://" in order to test the partial clone mechanism. (If not, Git will
> perform a local clone, which unselectively copies every object.)
> 
>      $ git clone --blob-max-bytes=100000 "file://$(pwd)/git" git2
>      $ git clone "file://$(pwd)/git" git3
>      $ du -sh git2 git3
>      116M	git2
>      129M	git3
> 
> Observe that the new repo is automatically configured to fetch missing objects
> from the original repo. Subsequent fetches will also be partial.
> 
>      $ cat git2/.git/config
>      [core]
>      	repositoryformatversion = 1
>      	filemode = true
>      	bare = false
>      	logallrefupdates = true
>      [remote "origin"]
>      	url = [snip]
>      	fetch = +refs/heads/*:refs/remotes/origin/*
>      	blobmaxbytes = 100000
>      [extensions]
>      	partialclone = origin
>      [branch "master"]
>      	remote = origin
>      	merge = refs/heads/master
> 

I like that git-clone saves the partial clone settings in the
.git/config.  This should make it easier for subsequent commands to
default to the right settings.

Do we allow a partial-fetch following a regular clone (such that git-fetch
would create these config file settings)?  This seems like a reasonable
upgrade path for a user with an existing repo to take advantage of partial
fetches going forward.

Or do we require that git-fetch only be allowed to do partial-fetches
after a partial-clone (so that only clone creates these settings) and
fetch always does partial-fetches thereafter?  It might be useful to
allow fetch to do a full fetch on top of a partial-clone, but there
are probably thin-pack issues to work out first.

Also, there is an assumption here that the user will want to keep
using the same filtering settings on subsequent fetches.  That's
probably fine for now and until we get a chance to try it out for
real.

> Unlike in an older version of this code (see the `partialclone` branch), this
> also works with the HTTP/HTTPS protocols.
> 
> Design
> ======
> 
> Local repository layout
> -----------------------
> 
> A repository declares its dependence on a *promisor remote* (a remote that
> declares that it can serve certain objects when requested) by a repository
> extension "partialclone". `extensions.partialclone` must be set to the name of
> the remote ("origin" in the demo above).
> 

Do we allow EXACTLY ONE promisor-remote?  That is, can we later add
another remote as a promisor-remote?  And then do partial-fetches from
either?  Do we need to disallow removing or altering a remote that is
listed as a promisor-remote?

I think for now, one promisor-remote is fine.  Just asking.

Changing a remote's URL might be fine, but deleting the promisor-remote
would put the user in a weird state.  We don't need to worry about it
now though.

> A packfile can be annotated as originating from the promisor remote by the
> existence of a "(packfile name).promisor" file with arbitrary contents (similar
> to the ".keep" file). Whenever a promisor remote sends an object, it declares
> that it can serve every object directly or indirectly referenced by the sent
> object.
> 
> A promisor packfile is a packfile annotated with the ".promisor" file. A
> promisor object is an object in a promisor packfile. A promised object is an
> object directly referenced by a promisor object.
> 

I struggled with the terms here a little when looking at the source.
() A remote responding to a partial-clone is termed a "promisor-remote".
() Packfiles received from a promisor-remote are marked with "<name>.promisor"
    like "<name>.keep" names.
() An object actually contained in such packfiles is called a "promisor-object".
() An object not-present but referenced by one of the above promisor-objects
    is called a "promised-object" (aka a "missing-object").

I think the similarity of the words "promisOR" and "promisED" threw me here
and as I was looking at the code.  The code in is_promised() [1] looked like
it was adding all promisor- and promised-objects to the "promised" OIDSET,
but I could be mistaken.

[1] https://github.com/jonathantanmy/git/commit/7a9c2d9b6e2fce293817b595dee29a7eede0dddd#diff-5d5d5dc185ef37dc30bb7d9a7ae0c4e8R1960

The contents of the "<name>.promisor" file are described as arbitrary?
Should we write the name of the remote (or some other structured data)
into this file so that later fetches can find the server?  This is more
a question for when we have multiple promisor-remotes and may need to
decide who can supply a missing object.  Not urgent now though.

> (In the future, we might need to add ".promisor" support to loose objects.)
>

I assume you mean that a dynamic fetch of a single tree object would be
unpacked upon receipt (rather than creating a one object packfile) and
that we may need to mark it as promisor-object so that missing objects
that *IT* references would still follow the promised-object rules.

> Connectivity check and gc
> -------------------------
> 
> The object walk done by the connectivity check (as used by fsck and fetch) stops
> at all promisor objects and promised objects.
>

So is the assumption that as soon as you touch a promisOR-object you might
as well stop scanning, because anything it references might be missing?

The code in rev-list.c and revision.c in [2] looks like it will continue
thru PROMISOR-objects and stop at (still-missing) PROMISED-objects.

[2] https://github.com/jonathantanmy/git/commit/2d7ae2bc780dd056552643e4f5061a0ca7b5b1e5

> The object walk done by gc also stops at all promisor objects and promised
> objects. Only non-promisor packfiles are deleted (if pack deletion is
> requested); promisor packfiles are left alone. This maintains the distinction
> between promisor packfiles and non-promisor packfiles. (In the future, we might
> need to do something more sophisticated with promisor packfiles.)
>

Seems like we could combine promisor-packfiles -- so, for example, the net
result of a complete repack might be one normal- and one promisor-packfile.

Question: if the initial clone is a partial-clone, we will ONLY have
promisor-packfiles, right?  (Unless they later add a second regular remote.)
So fsck and GC will basically stop scanning immediately, right?
Likewise, repack will basically be disabled until we support repacking
promisor-packfiles, right?  So performance of the resulting repo will
deteriorate over time as subsequent partial-fetches pull in more and more
promisor-packfiles.

> Fetching of promised objects
> ----------------------------
> 
> When `sha1_object_info_extended()` (or similar) is invoked, it will
> automatically attempt to fetch a missing object from the promisor remote if that
> object is not in the local repository. For efficiency, no check is made as to
> whether that object is a promised object or not.
>

I don't understand this last statement.  A missing-object is assumed to
be a promised-object, but we don't check that.  By this do you mean that
is_promised() [1] is only used by commands like fsck and not elsewhere?

> This automatic fetching can be toggled on and off by the `fetch_if_missing`
> global variable, and it is on by default.
> 
> The actual fetch is done through the fetch-pack/upload-pack protocol. Right now,
> this uses the fact that upload-pack allows blob and tree "want"s, and this
> incurs the overhead of the unnecessary ref advertisement. I hope that protocol
> v2 will allow us to declare that blob and tree "want"s are allowed, and allow
> the client to declare that it does not want the ref advertisement. All packfiles
> downloaded in this way are annotated with ".promisor".
>

The dynamic object fetching in fetch_object() [3] only fetches 1 object from
the server.  This will spawn a transport process, network connection (and
authentication), a packfile download, and index-pack or unpack for each object.

Perhaps this is just preliminary code, but I don't see how you can adapt that
model to be efficient.  Even using a long-running background process to fetch
individual blobs, it will still be fetching blobs individually.  We do need to
have dynamic object fetching, but we should try to minimize its use; that is,
to try have higher-level mechanisms in place to predict and bulk prefetch missing
objects.

I guess it depends on how many missing-objects you expect the client to have.
My concern here is that we're limiting the design to the "occasional" big file
problem, rather than the more general scale problem.

Also, relying on dynamic object fetching also means that operations, like
checkout, may now require a network connection.  Because we don't know we
need an object until we actually need it (and are half-way thru some
operation that the user thought was safe to do off-line).

I'll talk about this more in my next response to this thread.

[3] https://github.com/jonathantanmy/git/commit/d23be53681549d9ac3c2ecb7b2be9f93d355457b#diff-5c4ccaa3282a2113fc8cafa156b30941R7

> Fetching with `git fetch`
> -------------------------
> 
> The fetch-pack/upload-pack protocol has also been extended to support omission
> of blobs above a certain size. The client only allows this when fetching from
> the promisor remote, and will annotate any packs received from the promisor
> remote with ".promisor".
> 

Thanks,
Jeff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: Design and code of partial clones (now, missing commits and trees OK) (part 2/3)
  2017-09-15 20:43 RFC: Design and code of partial clones (now, missing commits and trees OK) Jonathan Tan
  2017-09-19  5:51 ` Junio C Hamano
  2017-09-21 17:57 ` Jeff Hostetler
@ 2017-09-21 17:59 ` Jeff Hostetler
  2017-09-21 22:51   ` Jonathan Tan
  2017-09-21 18:00 ` RFC: Design and code of partial clones (now, missing commits and trees OK) (part 3) Jeff Hostetler
  2017-09-29  0:53 ` RFC: Design and code of partial clones (now, missing commits and trees OK) Jonathan Tan
  4 siblings, 1 reply; 21+ messages in thread
From: Jeff Hostetler @ 2017-09-21 17:59 UTC (permalink / raw)
  To: Jonathan Tan, git; +Cc: peartben, Christian Couder

(part 2)

Additional overall comments on:
https://github.com/jonathantanmy/git/commits/partialclone2

{} I think it would help to split the blob-max-bytes filtering and the
    promisor/promised concepts and discuss them independently.

    {} Then we can talk about about the promisor/promised functionality
       independent of any kind of filter.  The net-net is that the client
       has missing objects and it doesn't matter what filter criteria
       or mechanism caused that to happened.

    {} blob-max-bytes is but one such filter we should have.  This might
       be sufficient if the goal is replace LFS (where you rarely ever
       need any given very very large object) and dynamically loading
       them as needed is sufficient and the network round-trip isn't
       too much of a perf penalty.

    {} But if we want to do things like a "sparse-enlistments" where the
       client only needs a small part of the tree using sparse-checkout.
       For example, only populating 50,000 files from a tree of 3.5M files
       at HEAD, then we need a more general filtering.

    {} And as I said above, how we chose to filter should be independent
       of how the client handles promisor/promised objects.


{} Also, if we rely strictly on dynamic object fetching to fetch missing
    objects, we are effectively tethered to the server during operations
    (such as checkout) that the user might not think about as requiring
    a network connection.  And we are forced to keep the same limitations
    of LFS in that you can't prefetch and go offline (without actually
    checking out to your worktree first).  And we can't bulk or parallel
    fetch objects.


{} I think it would also help to move the blob-max-bytes calculation out
    of pack-objects.c : add_object_entry() [1].  The current code isolates
    the computation there so that only pack-objects can do the filtering.

    Instead, put it in list-objects.c and traverse_commit_list() so that
    pack-objects and rev-list can share it (as Peff suggested [2] in
    response to my first patch series in March).

    For example, this would let the client have a pre-checkout hook, use
    rev-list to compute the set of missing objects needed for that commit,
    and pipe that to a command to BULK fetch them from the server BEFORE
    starting the actual checkout.  This would allow the savy user to
    manually run a prefetch before going offline.

[1] https://github.com/jonathantanmy/git/commit/68e529484169f4800115c5a32e0904c25ad14bd8#diff-a8d2c9cf879e775d748056cfed48440cR1110

[2] https://public-inbox.org/git/20170309073117.g3br5btsfwntcdpe@sigill.intra.peff.net/


{} This also locks us into size-only filtering and makes it more
    difficult to add other filters.  In that the add_object_entry()
    code gets called on an object after the traversal has decided
    what to do with it.  It would be difficult to add tree-trimming
    at this level, for example.


{} An early draft of this type of filtering is here [3].  I hope to push
    up a revised draft of this shortly.

[3] https://public-inbox.org/git/20170713173459.3559-1-git@jeffhostetler.com/


Thanks,
Jeff


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: Design and code of partial clones (now, missing commits and trees OK) (part 3)
  2017-09-15 20:43 RFC: Design and code of partial clones (now, missing commits and trees OK) Jonathan Tan
                   ` (2 preceding siblings ...)
  2017-09-21 17:59 ` RFC: Design and code of partial clones (now, missing commits and trees OK) (part 2/3) Jeff Hostetler
@ 2017-09-21 18:00 ` Jeff Hostetler
  2017-09-21 23:04   ` Jonathan Tan
  2017-09-29  0:53 ` RFC: Design and code of partial clones (now, missing commits and trees OK) Jonathan Tan
  4 siblings, 1 reply; 21+ messages in thread
From: Jeff Hostetler @ 2017-09-21 18:00 UTC (permalink / raw)
  To: Jonathan Tan, git; +Cc: peartben, Christian Couder

(part 3)

Additional overall comments on:
https://github.com/jonathantanmy/git/commits/partialclone2

{} WRT the code in is_promised() [1]

[1] https://github.com/jonathantanmy/git/commit/7a9c2d9b6e2fce293817b595dee29a7eede0dddd#diff-5d5d5dc185ef37dc30bb7d9a7ae0c4e8R1960

    {} it looked like it was adding ALL promisor- and promised-objects to
       the "promised" OIDSET, rather than just promised-objects, but I
       could be mistaken.

    {} Is this iterating over ALL promisor-packfiles?

    {} It looked like this was being used by fsck and rev-list.  I have
       concerns about how big this OIDSET will get and how it will scale,
       since if we start with a partial-clone all packfiles will be
       promisor-packfiles.

    {} When iterating thru a tree object, you add everything that it
       references (everything in that folder).  This adds all of the
       child OIDs -- without regard to whether they are new to this
       version of the tree object. (Granted, this is hard to compute.)

       My concern is that this will add too many objects to the OIDSET.
       That is, a new tree object (because of a recent change to something
       in that folder) will also have the OIDs of the other *unchanged*
       files which may be present in an earlier non-provisor packfile
       from an earlier commit.

       I worry that this will grow the OIDSET to essentially include
       everything.  And possibly defeating its own purpose.  I could be
       wrong here, but that's my concern.


{} I'm not opposed to the .promisor file concept, but I have concerns
    that in practice all packfiles after a partial clone will be
    promisor-packfiles and therefore short-cut during fsck, so fsck
    still won't gain anything.

    It would help if there are also non-promisor packfiles present,
    but only for objects referenced by non-promisor packfiles.

    But then I also have to wonder whether we can even support non-promisor
    packfiles after starting with a partial clone -- because of the
    properties of received thin-packs on a non-partial fetch.
    

Thanks,
Jeff


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: Design and code of partial clones (now, missing commits and trees OK)
  2017-09-21 17:57 ` Jeff Hostetler
@ 2017-09-21 22:42   ` Jonathan Tan
  2017-09-22 21:02     ` Jeff Hostetler
  2017-09-26 15:26     ` Michael Haggerty
  0 siblings, 2 replies; 21+ messages in thread
From: Jonathan Tan @ 2017-09-21 22:42 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: git, peartben, Christian Couder

On Thu, 21 Sep 2017 13:57:30 -0400
Jeff Hostetler <git@jeffhostetler.com> wrote:

> There's a lot in this patch series.  I'm still studying it, but here
> are some notes and questions.  I'll start with direct responses to
> the RFC here and follow up in a second email with specific questions
> and comments to keep this from being too long).

OK - thanks for your detailed comments.

> I like that git-clone saves the partial clone settings in the
> .git/config.  This should make it easier for subsequent commands to
> default to the right settings.
> 
> Do we allow a partial-fetch following a regular clone (such that
> git-fetch would create these config file settings)?  This seems like
> a reasonable upgrade path for a user with an existing repo to take
> advantage of partial fetches going forward.

A "git-fetch ...options... --save-options" does not sound unreasonable,
although I would think that (i) partial fetches/clones are useful on
very large repositories, and (ii) the fact that you had already cloned
this large repository shows that you can handle the disk and network
load, so partial fetch after non-partial clone doesn't seem very useful.

But if there is a use case, I think it could work. Although note that
the GC in my patch set stops at promisor objects, so if an object
originally cloned cannot be reached through a walk (with non-promisor
intermediate objects only), it might end up GC-ed (which might be fine).

> Or do we require that git-fetch only be allowed to do partial-fetches
> after a partial-clone (so that only clone creates these settings) and
> fetch always does partial-fetches thereafter?  It might be useful to
> allow fetch to do a full fetch on top of a partial-clone, but there
> are probably thin-pack issues to work out first.

About the thin-pack issue, I wonder if it is sufficient to turn on
fetch_if_missing while index-pack is trying to fix the thin pack.

If the thin-pack issues are worked out, then non-partial fetch after
partial clone seems doable (all commits from our tip to their tip are
fetched, as well as all new trees and all new blobs; any trees and blobs
still missing are already promised).

Thanks for these questions - I am concentrating on repos in which both
clone and fetch are partial, but it is good to discuss what happens if
the user does something else.

> Also, there is an assumption here that the user will want to keep
> using the same filtering settings on subsequent fetches.  That's
> probably fine for now and until we get a chance to try it out for
> real.

Yes. Having said that, the fetching of missing objects does not take
into account the filter at all, so the filter can be easily changed.

> Do we allow EXACTLY ONE promisor-remote?  That is, can we later add
> another remote as a promisor-remote?  And then do partial-fetches from
> either?

Yes, exactly one (because we need to know which remote to fetch missing
objects from, and which remote to allow partial fetches from). But the
promisor remote can be switched to another.

> Do we need to disallow removing or altering a remote that is
> listed as a promisor-remote?

Perhaps, although I think that right now configs are checked when we run
the command using the config, not when we run "git config".

> I think for now, one promisor-remote is fine.  Just asking.
> 
> Changing a remote's URL might be fine, but deleting the
> promisor-remote would put the user in a weird state.  We don't need
> to worry about it now though.

Agreed.

> I struggled with the terms here a little when looking at the source.
> () A remote responding to a partial-clone is termed a
> "promisor-remote". () Packfiles received from a promisor-remote are
> marked with "<name>.promisor" like "<name>.keep" names.
> () An object actually contained in such packfiles is called a
> "promisor-object". () An object not-present but referenced by one of
> the above promisor-objects is called a "promised-object" (aka a
> "missing-object").
> 
> I think the similarity of the words "promisOR" and "promisED" threw
> me here and as I was looking at the code.  The code in is_promised()
> [1] looked like it was adding all promisor- and promised-objects to
> the "promised" OIDSET, but I could be mistaken.
> 
> [1]
> https://github.com/jonathantanmy/git/commit/7a9c2d9b6e2fce293817b595dee29a7eede0dddd#diff-5d5d5dc185ef37dc30bb7d9a7ae0c4e8R1960

I was struggling a bit with the terminology, true.

Right now I'm thinking of:
 - promisor remote (as you defined)
 - promisor packfile (as you defined)
 - promisor object is an object known to belong to the promisor (whether
   because we have it in a promisor packfile or because it is referenced
   by an object in a promisor packfile)

This might eliminate "promise(d)", and thus eliminate the confusion
between "promised" and "promisor", but I haven't done an exhaustive
search.

> The contents of the "<name>.promisor" file are described as arbitrary?
> Should we write the name of the remote (or some other structured data)
> into this file so that later fetches can find the server?  This is
> more a question for when we have multiple promisor-remotes and may
> need to decide who can supply a missing object.  Not urgent now
> though.

Yeah, I don't have a use for it right now. Although it might be a good
idea to make it future-proof, by say, declaring it to be lines of
key-value pairs (with comments)...but then again, we might need to
forbid earlier tools from working on such modern repositories, so we
would need a repository extension (or something similar), and we can
upgrade the format then.

> I assume you mean that a dynamic fetch of a single tree object would
> be unpacked upon receipt (rather than creating a one object packfile)
> and that we may need to mark it as promisor-object so that missing
> objects that *IT* references would still follow the promised-object
> rules.

Yes, that's right. (And that it would not participate in GC.)

> So is the assumption that as soon as you touch a promisOR-object you
> might as well stop scanning, because anything it references might be
> missing?
> 
> 
> The code in rev-list.c and revision.c in [2] looks like it will
> continue thru PROMISOR-objects and stop at (still-missing)
> PROMISED-objects.
> 
> [2]
> https://github.com/jonathantanmy/git/commit/2d7ae2bc780dd056552643e4f5061a0ca7b5b1e5

The invocation of for_each_packed_object() in prepare_revision_walk()
should mark all objects in promisor packs as uninteresting, so the
traversal should stop there (unless I'm misunderstanding something).

> > The object walk done by gc also stops at all promisor objects and
> > promised objects. Only non-promisor packfiles are deleted (if pack
> > deletion is requested); promisor packfiles are left alone. This
> > maintains the distinction between promisor packfiles and
> > non-promisor packfiles. (In the future, we might need to do
> > something more sophisticated with promisor packfiles.)
> >
> 
> Seems like we could combine promisor-packfiles -- so, for example,
> the net result of a complete repack might be one normal- and one
> promisor-packfile.

Yes, that is a good next step. (I'm aiming for a minimum viable product
here so I didn't include that.)

> Question: if the initial clone is a partial-clone, we will ONLY have
> promisor-packfiles, right?  (Unless they later add a second regular
> remote.) So fsck and GC will basically stop scanning immediately,
> right? Likewise, repack will basically be disabled until we support
> repacking promisor-packfiles, right?  So performance of the resulting
> repo will deteriorate over time as subsequent partial-fetches pull in
> more and more promisor-packfiles.

Files locally created are non-promisor objects, and will become
non-promisor packfiles when GC/repack is run.

As for performance, you are right. We'll need to make a way to combine
promisor packfiles after this.

> > Fetching of promised objects
> > ----------------------------
> > 
> > When `sha1_object_info_extended()` (or similar) is invoked, it will
> > automatically attempt to fetch a missing object from the promisor
> > remote if that object is not in the local repository. For
> > efficiency, no check is made as to whether that object is a
> > promised object or not.
> >
> 
> I don't understand this last statement.  A missing-object is assumed
> to be a promised-object, but we don't check that.  By this do you
> mean that is_promised() [1] is only used by commands like fsck and
> not elsewhere?

Sorry, I meant not checked in sha1_object_info_extended(). Yes, fsck and
gc will check them.

> The dynamic object fetching in fetch_object() [3] only fetches 1
> object from the server.  This will spawn a transport process, network
> connection (and authentication), a packfile download, and index-pack
> or unpack for each object.
> 
> Perhaps this is just preliminary code, but I don't see how you can
> adapt that model to be efficient.  Even using a long-running
> background process to fetch individual blobs, it will still be
> fetching blobs individually.  We do need to have dynamic object
> fetching, but we should try to minimize its use; that is, to try have
> higher-level mechanisms in place to predict and bulk prefetch missing
> objects.

My plan is to have batch fetching. We would need to extend this command
by command, for example, my patch for batched checkout here:

https://github.com/jonathantanmy/git/commit/14f07d3f06bc3a1a2c9bca85adc8c42b230b9143

(I didn't update this for my newest patch set because I didn't figure
out a good way to deal with the possibility of missing trees, but
something like this could probably be done - for example, by only
batching blobs, and whenever we need to fetch a tree, automatically
fetch all objects to a certain depth, for example.)

Server incompatibility is not a problem because the server already
supports multiple objects per request. (Well, except the fetching trees
to a certain depth part.)

> I guess it depends on how many missing-objects you expect the client
> to have. My concern here is that we're limiting the design to the
> "occasional" big file problem, rather than the more general scale
> problem.

Do you have a specific situation in mind?

> Also, relying on dynamic object fetching also means that operations,
> like checkout, may now require a network connection.  Because we
> don't know we need an object until we actually need it (and are
> half-way thru some operation that the user thought was safe to do
> off-line).

Yes, but I think this is expected when we're dealing with partial
clones.

> Thanks,
> Jeff

Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: Design and code of partial clones (now, missing commits and trees OK) (part 2/3)
  2017-09-21 17:59 ` RFC: Design and code of partial clones (now, missing commits and trees OK) (part 2/3) Jeff Hostetler
@ 2017-09-21 22:51   ` Jonathan Tan
  2017-09-22 21:19     ` Jeff Hostetler
  0 siblings, 1 reply; 21+ messages in thread
From: Jonathan Tan @ 2017-09-21 22:51 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: git, peartben, Christian Couder

On Thu, 21 Sep 2017 13:59:43 -0400
Jeff Hostetler <git@jeffhostetler.com> wrote:

> (part 2)
> 
> Additional overall comments on:
> https://github.com/jonathantanmy/git/commits/partialclone2
> 
> {} I think it would help to split the blob-max-bytes filtering and the
>     promisor/promised concepts and discuss them independently.
> 
>     {} Then we can talk about about the promisor/promised
> functionality independent of any kind of filter.  The net-net is that
> the client has missing objects and it doesn't matter what filter
> criteria or mechanism caused that to happened.
> 
>     {} blob-max-bytes is but one such filter we should have.  This
> might be sufficient if the goal is replace LFS (where you rarely ever
>        need any given very very large object) and dynamically loading
>        them as needed is sufficient and the network round-trip isn't
>        too much of a perf penalty.
> 
>     {} But if we want to do things like a "sparse-enlistments" where
> the client only needs a small part of the tree using sparse-checkout.
>        For example, only populating 50,000 files from a tree of 3.5M
> files at HEAD, then we need a more general filtering.
> 
>     {} And as I said above, how we chose to filter should be
> independent of how the client handles promisor/promised objects.

I agree that they are independent. (I put everything together so that
people could see how they work together, but they can be changed
independently of each other.)

> {} Also, if we rely strictly on dynamic object fetching to fetch
> missing objects, we are effectively tethered to the server during
> operations (such as checkout) that the user might not think about as
> requiring a network connection.  And we are forced to keep the same
> limitations of LFS in that you can't prefetch and go offline (without
> actually checking out to your worktree first).  And we can't bulk or
> parallel fetch objects.

I don't think dynamic object fetching precludes any other more optimized
way of fetching or prefetching - I implemented dynamic object fetching
first so that we would have a fallback, but the others definitely can be
implemented too.

> {} I think it would also help to move the blob-max-bytes calculation
> out of pack-objects.c : add_object_entry() [1].  The current code
> isolates the computation there so that only pack-objects can do the
> filtering.
> 
>     Instead, put it in list-objects.c and traverse_commit_list() so
> that pack-objects and rev-list can share it (as Peff suggested [2] in
>     response to my first patch series in March).
> 
>     For example, this would let the client have a pre-checkout hook,
> use rev-list to compute the set of missing objects needed for that
> commit, and pipe that to a command to BULK fetch them from the server
> BEFORE starting the actual checkout.  This would allow the savy user
> to manually run a prefetch before going offline.
> 
> [1]
> https://github.com/jonathantanmy/git/commit/68e529484169f4800115c5a32e0904c25ad14bd8#diff-a8d2c9cf879e775d748056cfed48440cR1110
> 
> [2]
> https://public-inbox.org/git/20170309073117.g3br5btsfwntcdpe@sigill.intra.peff.net/

In your specific example, how would rev-list know, on the client, to
include (or exclude) a large blob in its output if it does not have it,
and thus does not know its size?

My reason for including it in pack-objects.c is because I only needed it
there and it is much simpler, but I agree that if it can be used
elsewhere, we can put it in a more general place.

> {} This also locks us into size-only filtering and makes it more
>     difficult to add other filters.  In that the add_object_entry()
>     code gets called on an object after the traversal has decided
>     what to do with it.  It would be difficult to add tree-trimming
>     at this level, for example.

That is true.

> {} An early draft of this type of filtering is here [3].  I hope to
> push up a revised draft of this shortly.
> 
> [3]
> https://public-inbox.org/git/20170713173459.3559-1-git@jeffhostetler.com/

OK - I'll take a look when that is done (I think I commented on an
earlier version on that).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: Design and code of partial clones (now, missing commits and trees OK) (part 3)
  2017-09-21 18:00 ` RFC: Design and code of partial clones (now, missing commits and trees OK) (part 3) Jeff Hostetler
@ 2017-09-21 23:04   ` Jonathan Tan
  2017-09-22 21:32     ` Jeff Hostetler
  0 siblings, 1 reply; 21+ messages in thread
From: Jonathan Tan @ 2017-09-21 23:04 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: git, peartben, Christian Couder

On Thu, 21 Sep 2017 14:00:40 -0400
Jeff Hostetler <git@jeffhostetler.com> wrote:

> (part 3)
> 
> Additional overall comments on:
> https://github.com/jonathantanmy/git/commits/partialclone2
> 
> {} WRT the code in is_promised() [1]
> 
> [1]
> https://github.com/jonathantanmy/git/commit/7a9c2d9b6e2fce293817b595dee29a7eede0dddd#diff-5d5d5dc185ef37dc30bb7d9a7ae0c4e8R1960
> 
>     {} it looked like it was adding ALL promisor- and
> promised-objects to the "promised" OIDSET, rather than just
> promised-objects, but I could be mistaken.

As far as I can tell, it is just adding the promised objects (some of
which may also be promisor objects). If you're saying that you expected
it to add the promisor objects as well, that might be a reasonable
expectation...I'm thinking of doing that.

>     {} Is this iterating over ALL promisor-packfiles?

Yes.

>     {} It looked like this was being used by fsck and rev-list.  I
> have concerns about how big this OIDSET will get and how it will
> scale, since if we start with a partial-clone all packfiles will be
>        promisor-packfiles.

It's true that scaling is an issue. I'm not sure if omitting the oidset
will solve anything, though - as it is, Git maintains an object hash and
adds to it quite liberally.

One thing that might help is some sort of flushing of objects in
promisor packfiles from the local repository - that way, the oidset
won't be so large.

> 
>     {} When iterating thru a tree object, you add everything that it
>        references (everything in that folder).  This adds all of the
>        child OIDs -- without regard to whether they are new to this
>        version of the tree object. (Granted, this is hard to compute.)

The oidset will deduplicate OIDs.

>        My concern is that this will add too many objects to the
> OIDSET. That is, a new tree object (because of a recent change to
> something in that folder) will also have the OIDs of the other
> *unchanged* files which may be present in an earlier non-provisor
> packfile from an earlier commit.
> 
>        I worry that this will grow the OIDSET to essentially include
>        everything.  And possibly defeating its own purpose.  I could
> be wrong here, but that's my concern.

Same answer as above (about flushing of objects in promisor packfiles).

> {} I'm not opposed to the .promisor file concept, but I have concerns
>     that in practice all packfiles after a partial clone will be
>     promisor-packfiles and therefore short-cut during fsck, so fsck
>     still won't gain anything.
> 
>     It would help if there are also non-promisor packfiles present,
>     but only for objects referenced by non-promisor packfiles.
> 
>     But then I also have to wonder whether we can even support
> non-promisor packfiles after starting with a partial clone -- because
> of the properties of received thin-packs on a non-partial fetch.

Same reply as to your other e-mail - locally created objects are not in
promisor packfiles. (Or were you thinking of a situation where locally
created objects are immediately uploaded to the promisor remote, thus
making them promisor objects too?)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: Design and code of partial clones (now, missing commits and trees OK)
  2017-09-21 22:42   ` Jonathan Tan
@ 2017-09-22 21:02     ` Jeff Hostetler
  2017-09-22 22:49       ` Jonathan Tan
  2017-09-26 15:26     ` Michael Haggerty
  1 sibling, 1 reply; 21+ messages in thread
From: Jeff Hostetler @ 2017-09-22 21:02 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peartben, Christian Couder



On 9/21/2017 6:42 PM, Jonathan Tan wrote:
> On Thu, 21 Sep 2017 13:57:30 -0400
> Jeff Hostetler <git@jeffhostetler.com> wrote:
> 
>> There's a lot in this patch series.  I'm still studying it, but here
>> are some notes and questions.  I'll start with direct responses to
>> the RFC here and follow up in a second email with specific questions
>> and comments to keep this from being too long).
> 
> OK - thanks for your detailed comments.
> 
>> I like that git-clone saves the partial clone settings in the
>> .git/config.  This should make it easier for subsequent commands to
>> default to the right settings.
>>
>> Do we allow a partial-fetch following a regular clone (such that
>> git-fetch would create these config file settings)?  This seems like
>> a reasonable upgrade path for a user with an existing repo to take
>> advantage of partial fetches going forward.
> 
> A "git-fetch ...options... --save-options" does not sound unreasonable,
> although I would think that (i) partial fetches/clones are useful on
> very large repositories, and (ii) the fact that you had already cloned
> this large repository shows that you can handle the disk and network
> load, so partial fetch after non-partial clone doesn't seem very useful.
> 
> But if there is a use case, I think it could work. Although note that
> the GC in my patch set stops at promisor objects, so if an object
> originally cloned cannot be reached through a walk (with non-promisor
> intermediate objects only), it might end up GC-ed (which might be fine).
> 
>> Or do we require that git-fetch only be allowed to do partial-fetches
>> after a partial-clone (so that only clone creates these settings) and
>> fetch always does partial-fetches thereafter?  It might be useful to
>> allow fetch to do a full fetch on top of a partial-clone, but there
>> are probably thin-pack issues to work out first.
> 
> About the thin-pack issue, I wonder if it is sufficient to turn on
> fetch_if_missing while index-pack is trying to fix the thin pack.
> 
> If the thin-pack issues are worked out, then non-partial fetch after
> partial clone seems doable (all commits from our tip to their tip are
> fetched, as well as all new trees and all new blobs; any trees and blobs
> still missing are already promised).

Agreed.  If we get a thin-pack and there are missing objects from the
commits in the edge set, we wouldn't be able to fix the newly received
objects without demand loading the object they are relative to.  Perhaps
we can use my filter/prefetch concept on the edge set to bulk fetch
them. (this is a bit of a SWAG.)  turning on the dynamic fetch would
be a first step (and may be sufficient).

> 
> Thanks for these questions - I am concentrating on repos in which both
> clone and fetch are partial, but it is good to discuss what happens if
> the user does something else.
> 
>> Also, there is an assumption here that the user will want to keep
>> using the same filtering settings on subsequent fetches.  That's
>> probably fine for now and until we get a chance to try it out for
>> real.
> 
> Yes. Having said that, the fetching of missing objects does not take
> into account the filter at all, so the filter can be easily changed.
> 
>> Do we allow EXACTLY ONE promisor-remote?  That is, can we later add
>> another remote as a promisor-remote?  And then do partial-fetches from
>> either?
> 
> Yes, exactly one (because we need to know which remote to fetch missing
> objects from, and which remote to allow partial fetches from). But the
> promisor remote can be switched to another.
> 
>> Do we need to disallow removing or altering a remote that is
>> listed as a promisor-remote?
> 
> Perhaps, although I think that right now configs are checked when we run
> the command using the config, not when we run "git config".
> 
>> I think for now, one promisor-remote is fine.  Just asking.
>>
>> Changing a remote's URL might be fine, but deleting the
>> promisor-remote would put the user in a weird state.  We don't need
>> to worry about it now though.
> 
> Agreed.
> 
>> I struggled with the terms here a little when looking at the source.
>> () A remote responding to a partial-clone is termed a
>> "promisor-remote". () Packfiles received from a promisor-remote are
>> marked with "<name>.promisor" like "<name>.keep" names.
>> () An object actually contained in such packfiles is called a
>> "promisor-object". () An object not-present but referenced by one of
>> the above promisor-objects is called a "promised-object" (aka a
>> "missing-object").
>>
>> I think the similarity of the words "promisOR" and "promisED" threw
>> me here and as I was looking at the code.  The code in is_promised()
>> [1] looked like it was adding all promisor- and promised-objects to
>> the "promised" OIDSET, but I could be mistaken.
>>
>> [1]
>> https://github.com/jonathantanmy/git/commit/7a9c2d9b6e2fce293817b595dee29a7eede0dddd#diff-5d5d5dc185ef37dc30bb7d9a7ae0c4e8R1960
> 
> I was struggling a bit with the terminology, true.
> 
> Right now I'm thinking of:
>   - promisor remote (as you defined)
>   - promisor packfile (as you defined)
>   - promisor object is an object known to belong to the promisor (whether
>     because we have it in a promisor packfile or because it is referenced
>     by an object in a promisor packfile)
> 
> This might eliminate "promise(d)", and thus eliminate the confusion
> between "promised" and "promisor", but I haven't done an exhaustive
> search.
> 

maybe just call the "promised" ones "missing".

>> The contents of the "<name>.promisor" file are described as arbitrary?
>> Should we write the name of the remote (or some other structured data)
>> into this file so that later fetches can find the server?  This is
>> more a question for when we have multiple promisor-remotes and may
>> need to decide who can supply a missing object.  Not urgent now
>> though.
> 
> Yeah, I don't have a use for it right now. Although it might be a good
> idea to make it future-proof, by say, declaring it to be lines of
> key-value pairs (with comments)...but then again, we might need to
> forbid earlier tools from working on such modern repositories, so we
> would need a repository extension (or something similar), and we can
> upgrade the format then.
> 
>> I assume you mean that a dynamic fetch of a single tree object would
>> be unpacked upon receipt (rather than creating a one object packfile)
>> and that we may need to mark it as promisor-object so that missing
>> objects that *IT* references would still follow the promised-object
>> rules.
> 
> Yes, that's right. (And that it would not participate in GC.)
> 
>> So is the assumption that as soon as you touch a promisOR-object you
>> might as well stop scanning, because anything it references might be
>> missing?
>>
>>
>> The code in rev-list.c and revision.c in [2] looks like it will
>> continue thru PROMISOR-objects and stop at (still-missing)
>> PROMISED-objects.
>>
>> [2]
>> https://github.com/jonathantanmy/git/commit/2d7ae2bc780dd056552643e4f5061a0ca7b5b1e5
> 
> The invocation of for_each_packed_object() in prepare_revision_walk()
> should mark all objects in promisor packs as uninteresting, so the
> traversal should stop there (unless I'm misunderstanding something).
> 
>>> The object walk done by gc also stops at all promisor objects and
>>> promised objects. Only non-promisor packfiles are deleted (if pack
>>> deletion is requested); promisor packfiles are left alone. This
>>> maintains the distinction between promisor packfiles and
>>> non-promisor packfiles. (In the future, we might need to do
>>> something more sophisticated with promisor packfiles.)
>>>
>>
>> Seems like we could combine promisor-packfiles -- so, for example,
>> the net result of a complete repack might be one normal- and one
>> promisor-packfile.
> 
> Yes, that is a good next step. (I'm aiming for a minimum viable product
> here so I didn't include that.)
> 
>> Question: if the initial clone is a partial-clone, we will ONLY have
>> promisor-packfiles, right?  (Unless they later add a second regular
>> remote.) So fsck and GC will basically stop scanning immediately,
>> right? Likewise, repack will basically be disabled until we support
>> repacking promisor-packfiles, right?  So performance of the resulting
>> repo will deteriorate over time as subsequent partial-fetches pull in
>> more and more promisor-packfiles.
> 
> Files locally created are non-promisor objects, and will become
> non-promisor packfiles when GC/repack is run.

i forgot about locally created objects.

> 
> As for performance, you are right. We'll need to make a way to combine
> promisor packfiles after this.
> 
>>> Fetching of promised objects
>>> ----------------------------
>>>
>>> When `sha1_object_info_extended()` (or similar) is invoked, it will
>>> automatically attempt to fetch a missing object from the promisor
>>> remote if that object is not in the local repository. For
>>> efficiency, no check is made as to whether that object is a
>>> promised object or not.
>>>
>>
>> I don't understand this last statement.  A missing-object is assumed
>> to be a promised-object, but we don't check that.  By this do you
>> mean that is_promised() [1] is only used by commands like fsck and
>> not elsewhere?
> 
> Sorry, I meant not checked in sha1_object_info_extended(). Yes, fsck and
> gc will check them.
> 
>> The dynamic object fetching in fetch_object() [3] only fetches 1
>> object from the server.  This will spawn a transport process, network
>> connection (and authentication), a packfile download, and index-pack
>> or unpack for each object.
>>
>> Perhaps this is just preliminary code, but I don't see how you can
>> adapt that model to be efficient.  Even using a long-running
>> background process to fetch individual blobs, it will still be
>> fetching blobs individually.  We do need to have dynamic object
>> fetching, but we should try to minimize its use; that is, to try have
>> higher-level mechanisms in place to predict and bulk prefetch missing
>> objects.
> 
> My plan is to have batch fetching. We would need to extend this command
> by command, for example, my patch for batched checkout here:
> 
> https://github.com/jonathantanmy/git/commit/14f07d3f06bc3a1a2c9bca85adc8c42b230b9143
> 
> (I didn't update this for my newest patch set because I didn't figure
> out a good way to deal with the possibility of missing trees, but
> something like this could probably be done - for example, by only
> batching blobs, and whenever we need to fetch a tree, automatically
> fetch all objects to a certain depth, for example.)
> 
> Server incompatibility is not a problem because the server already
> supports multiple objects per request. (Well, except the fetching trees
> to a certain depth part.)
> 

Here I think we plug in my filter mechanism and hook it to a pre-fetch
command.  Then we have a pre-checkout hook or something that drives
this.  We may need a similar pre-merge hook (to get both tips and the
merge base).

>> I guess it depends on how many missing-objects you expect the client
>> to have. My concern here is that we're limiting the design to the
>> "occasional" big file problem, rather than the more general scale
>> problem.
> 
> Do you have a specific situation in mind?
> 

I have would like to be able do sparse-enlistments in the Windows
source tree. (3.5M files at HEAD.)  Most developers only need a small
feature area (a device driver or file system or whatever) and not the
whole tree.  A typical Windows developer may have only 30-50K files
populated.  If we can synchronize on a sparse-checkout spec and use
that for both the checkout and the clone/fetch, then we can bulk fetch
the blobs that they'll actually need.  GVFS can hydrate the files as
they touch them, but I can use this to pre-fetch the needed blobs in
bulk, rather than faulting and fetching them one-by-one.

So, my usage may have >95% of the ODB be missing blobs.  Contrast that
with the occasional large blob / LFS usage where you may have <5% of
the ODB as large objects (by count of OIDs not disk usage).


>> Also, relying on dynamic object fetching also means that operations,
>> like checkout, may now require a network connection.  Because we
>> don't know we need an object until we actually need it (and are
>> half-way thru some operation that the user thought was safe to do
>> off-line).
> 
> Yes, but I think this is expected when we're dealing with partial
> clones.

Right, but if we could pre-fetch, we could make checkout a little
less unpredictable.

> 
>> Thanks,
>> Jeff
> 
> Thanks.
> 

Thanks
Jeff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: Design and code of partial clones (now, missing commits and trees OK) (part 2/3)
  2017-09-21 22:51   ` Jonathan Tan
@ 2017-09-22 21:19     ` Jeff Hostetler
  2017-09-22 22:52       ` Jonathan Tan
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff Hostetler @ 2017-09-22 21:19 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peartben, Christian Couder



On 9/21/2017 6:51 PM, Jonathan Tan wrote:
> On Thu, 21 Sep 2017 13:59:43 -0400
> Jeff Hostetler <git@jeffhostetler.com> wrote:
> 
>> (part 2)
>>
>> Additional overall comments on:
>> https://github.com/jonathantanmy/git/commits/partialclone2
>>
>> {} I think it would help to split the blob-max-bytes filtering and the
>>      promisor/promised concepts and discuss them independently.
>>
>>      {} Then we can talk about about the promisor/promised
>> functionality independent of any kind of filter.  The net-net is that
>> the client has missing objects and it doesn't matter what filter
>> criteria or mechanism caused that to happened.
>>
>>      {} blob-max-bytes is but one such filter we should have.  This
>> might be sufficient if the goal is replace LFS (where you rarely ever
>>         need any given very very large object) and dynamically loading
>>         them as needed is sufficient and the network round-trip isn't
>>         too much of a perf penalty.
>>
>>      {} But if we want to do things like a "sparse-enlistments" where
>> the client only needs a small part of the tree using sparse-checkout.
>>         For example, only populating 50,000 files from a tree of 3.5M
>> files at HEAD, then we need a more general filtering.
>>
>>      {} And as I said above, how we chose to filter should be
>> independent of how the client handles promisor/promised objects.
> 
> I agree that they are independent. (I put everything together so that
> people could see how they work together, but they can be changed
> independently of each other.)
> 
>> {} Also, if we rely strictly on dynamic object fetching to fetch
>> missing objects, we are effectively tethered to the server during
>> operations (such as checkout) that the user might not think about as
>> requiring a network connection.  And we are forced to keep the same
>> limitations of LFS in that you can't prefetch and go offline (without
>> actually checking out to your worktree first).  And we can't bulk or
>> parallel fetch objects.
> 
> I don't think dynamic object fetching precludes any other more optimized
> way of fetching or prefetching - I implemented dynamic object fetching
> first so that we would have a fallback, but the others definitely can be
> implemented too.

yes, we need that as a fallback/default for the odd cases where we
can't predict perfectly.  Like during a blame or history or a merge.

I didn't mean to say we didn't need it, but rather that we should
try to minimize it.

> 
>> {} I think it would also help to move the blob-max-bytes calculation
>> out of pack-objects.c : add_object_entry() [1].  The current code
>> isolates the computation there so that only pack-objects can do the
>> filtering.
>>
>>      Instead, put it in list-objects.c and traverse_commit_list() so
>> that pack-objects and rev-list can share it (as Peff suggested [2] in
>>      response to my first patch series in March).
>>
>>      For example, this would let the client have a pre-checkout hook,
>> use rev-list to compute the set of missing objects needed for that
>> commit, and pipe that to a command to BULK fetch them from the server
>> BEFORE starting the actual checkout.  This would allow the savy user
>> to manually run a prefetch before going offline.
>>
>> [1]
>> https://github.com/jonathantanmy/git/commit/68e529484169f4800115c5a32e0904c25ad14bd8#diff-a8d2c9cf879e775d748056cfed48440cR1110
>>
>> [2]
>> https://public-inbox.org/git/20170309073117.g3br5btsfwntcdpe@sigill.intra.peff.net/
> 
> In your specific example, how would rev-list know, on the client, to
> include (or exclude) a large blob in its output if it does not have it,
> and thus does not know its size?
> 

The client doesn't have the size. It just knows it is missing and it
needs it. It doesn't matter why it is missing.  (But I guess the client
could assume it is because it is large.)

So rev-list on the client could filter the objects it has by size.

I added that to rev-list primarily to demonstrate and debug the
filtering concept (it's easier than playing with packfiles).  But
it can be used to drive client-side queries and bulk requests.

> My reason for including it in pack-objects.c is because I only needed it
> there and it is much simpler, but I agree that if it can be used
> elsewhere, we can put it in a more general place.
> 
>> {} This also locks us into size-only filtering and makes it more
>>      difficult to add other filters.  In that the add_object_entry()
>>      code gets called on an object after the traversal has decided
>>      what to do with it.  It would be difficult to add tree-trimming
>>      at this level, for example.
> 
> That is true.
> 
>> {} An early draft of this type of filtering is here [3].  I hope to
>> push up a revised draft of this shortly.
>>
>> [3]
>> https://public-inbox.org/git/20170713173459.3559-1-git@jeffhostetler.com/
> 
> OK - I'll take a look when that is done (I think I commented on an
> earlier version on that).
> 

FYI I just posted my RFC this afternoon.
https://public-inbox.org/git/20170922204211.GA24036@google.com/T/


Thanks
Jeff


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: Design and code of partial clones (now, missing commits and trees OK) (part 3)
  2017-09-21 23:04   ` Jonathan Tan
@ 2017-09-22 21:32     ` Jeff Hostetler
  2017-09-22 22:58       ` Jonathan Tan
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff Hostetler @ 2017-09-22 21:32 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peartben, Christian Couder



On 9/21/2017 7:04 PM, Jonathan Tan wrote:
> On Thu, 21 Sep 2017 14:00:40 -0400
> Jeff Hostetler <git@jeffhostetler.com> wrote:
> 
>> (part 3)
>>
>> Additional overall comments on:
>> https://github.com/jonathantanmy/git/commits/partialclone2
>>
>> {} WRT the code in is_promised() [1]
>>
>> [1]
>> https://github.com/jonathantanmy/git/commit/7a9c2d9b6e2fce293817b595dee29a7eede0dddd#diff-5d5d5dc185ef37dc30bb7d9a7ae0c4e8R1960
>>
>>      {} it looked like it was adding ALL promisor- and
>> promised-objects to the "promised" OIDSET, rather than just
>> promised-objects, but I could be mistaken.
> 
> As far as I can tell, it is just adding the promised objects (some of
> which may also be promisor objects). If you're saying that you expected
> it to add the promisor objects as well, that might be a reasonable
> expectation...I'm thinking of doing that.
> 

It looked like it was adding both types.  I was concerned that that
it might be doing too much.  But I haven't run the code, that was from
an observation.

>>      {} Is this iterating over ALL promisor-packfiles?
> 
> Yes.
> 
>>      {} It looked like this was being used by fsck and rev-list.  I
>> have concerns about how big this OIDSET will get and how it will
>> scale, since if we start with a partial-clone all packfiles will be
>>         promisor-packfiles.
> 
> It's true that scaling is an issue. I'm not sure if omitting the oidset
> will solve anything, though - as it is, Git maintains an object hash and
> adds to it quite liberally.

I guess I'm afraid that the first call to is_promised() is going
cause a very long pause as it loads up a very large hash of objects.

Perhaps you could augment the OID lookup to remember where the object
was found (essentially a .promisor bit set).  Then you wouldn't need
to touch them all.

> 
> One thing that might help is some sort of flushing of objects in
> promisor packfiles from the local repository - that way, the oidset
> won't be so large.
> 
>>
>>      {} When iterating thru a tree object, you add everything that it
>>         references (everything in that folder).  This adds all of the
>>         child OIDs -- without regard to whether they are new to this
>>         version of the tree object. (Granted, this is hard to compute.)
> 
> The oidset will deduplicate OIDs.

Right, but you still have an entry for each object.  For a repo the
size of Windows, you may have 25M+ objects your copy of the ODB.

> 
>>         My concern is that this will add too many objects to the
>> OIDSET. That is, a new tree object (because of a recent change to
>> something in that folder) will also have the OIDs of the other
>> *unchanged* files which may be present in an earlier non-provisor
>> packfile from an earlier commit.
>>
>>         I worry that this will grow the OIDSET to essentially include
>>         everything.  And possibly defeating its own purpose.  I could
>> be wrong here, but that's my concern.
> 
> Same answer as above (about flushing of objects in promisor packfiles).
> 
>> {} I'm not opposed to the .promisor file concept, but I have concerns
>>      that in practice all packfiles after a partial clone will be
>>      promisor-packfiles and therefore short-cut during fsck, so fsck
>>      still won't gain anything.
>>
>>      It would help if there are also non-promisor packfiles present,
>>      but only for objects referenced by non-promisor packfiles.
>>
>>      But then I also have to wonder whether we can even support
>> non-promisor packfiles after starting with a partial clone -- because
>> of the properties of received thin-packs on a non-partial fetch.
> 
> Same reply as to your other e-mail - locally created objects are not in
> promisor packfiles. (Or were you thinking of a situation where locally
> created objects are immediately uploaded to the promisor remote, thus
> making them promisor objects too?)
> 

Thanks,
Jeff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: Design and code of partial clones (now, missing commits and trees OK)
  2017-09-22 21:02     ` Jeff Hostetler
@ 2017-09-22 22:49       ` Jonathan Tan
  0 siblings, 0 replies; 21+ messages in thread
From: Jonathan Tan @ 2017-09-22 22:49 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: git, peartben, Christian Couder

On Fri, 22 Sep 2017 17:02:11 -0400
Jeff Hostetler <git@jeffhostetler.com> wrote:

> > I was struggling a bit with the terminology, true.
> > 
> > Right now I'm thinking of:
> >   - promisor remote (as you defined)
> >   - promisor packfile (as you defined)
> >   - promisor object is an object known to belong to the promisor (whether
> >     because we have it in a promisor packfile or because it is referenced
> >     by an object in a promisor packfile)
> > 
> > This might eliminate "promise(d)", and thus eliminate the confusion
> > between "promised" and "promisor", but I haven't done an exhaustive
> > search.
> > 
> 
> maybe just call the "promised" ones "missing".

They are not the same, though - "missing" usually just means that the
local repo does not have it, without regard to whether another repo has
it.

> >> I guess it depends on how many missing-objects you expect the client
> >> to have. My concern here is that we're limiting the design to the
> >> "occasional" big file problem, rather than the more general scale
> >> problem.
> > 
> > Do you have a specific situation in mind?
> > 
> 
> I have would like to be able do sparse-enlistments in the Windows
> source tree. (3.5M files at HEAD.)  Most developers only need a small
> feature area (a device driver or file system or whatever) and not the
> whole tree.  A typical Windows developer may have only 30-50K files
> populated.  If we can synchronize on a sparse-checkout spec and use
> that for both the checkout and the clone/fetch, then we can bulk fetch
> the blobs that they'll actually need.  GVFS can hydrate the files as
> they touch them, but I can use this to pre-fetch the needed blobs in
> bulk, rather than faulting and fetching them one-by-one.
> 
> So, my usage may have >95% of the ODB be missing blobs.  Contrast that
> with the occasional large blob / LFS usage where you may have <5% of
> the ODB as large objects (by count of OIDs not disk usage).

I don't think the current design precludes a more intelligent bulk
fetching (e.g. being allowed to specify a "want" tree and several "have"
trees, although we will have to figure out a design for that, including
how to select the "have" trees to inform the server about).

In the meantime, yes, this will be more useful for occasional large blob
repos, and (if/when the hook support is added) a GVFS situation where
the missing objects are available network-topologically close by.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: Design and code of partial clones (now, missing commits and trees OK) (part 2/3)
  2017-09-22 21:19     ` Jeff Hostetler
@ 2017-09-22 22:52       ` Jonathan Tan
  2017-09-26 14:03         ` Jeff Hostetler
  0 siblings, 1 reply; 21+ messages in thread
From: Jonathan Tan @ 2017-09-22 22:52 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: git, peartben, Christian Couder

On Fri, 22 Sep 2017 17:19:50 -0400
Jeff Hostetler <git@jeffhostetler.com> wrote:

> > In your specific example, how would rev-list know, on the client, to
> > include (or exclude) a large blob in its output if it does not have it,
> > and thus does not know its size?
> > 
> 
> The client doesn't have the size. It just knows it is missing and it
> needs it. It doesn't matter why it is missing.  (But I guess the client
> could assume it is because it is large.)

Ah, OK.

> So rev-list on the client could filter the objects it has by size.

My issue is that if the purpose of this feature in rev-list is to do
prefetching, the only criterion we need to check for is absence from the
local repo right? (Or is filtering by size on the client useful for
other reasons?)

> FYI I just posted my RFC this afternoon.
> https://public-inbox.org/git/20170922204211.GA24036@google.com/T/

Thanks, I'll take a look.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: Design and code of partial clones (now, missing commits and trees OK) (part 3)
  2017-09-22 21:32     ` Jeff Hostetler
@ 2017-09-22 22:58       ` Jonathan Tan
  2017-09-26 14:25         ` Jeff Hostetler
  0 siblings, 1 reply; 21+ messages in thread
From: Jonathan Tan @ 2017-09-22 22:58 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: git, peartben, Christian Couder

On Fri, 22 Sep 2017 17:32:00 -0400
Jeff Hostetler <git@jeffhostetler.com> wrote:

> I guess I'm afraid that the first call to is_promised() is going
> cause a very long pause as it loads up a very large hash of objects.

Yes, the first call will cause a long pause. (I think fsck and gc can
tolerate this, but a better solution is appreciated.)

> Perhaps you could augment the OID lookup to remember where the object
> was found (essentially a .promisor bit set).  Then you wouldn't need
> to touch them all.

Sorry - I don't understand this. Are you saying that missing promisor
objects should go into the global object hashtable, so that we can set a
flag on them?

> > The oidset will deduplicate OIDs.
> 
> Right, but you still have an entry for each object.  For a repo the
> size of Windows, you may have 25M+ objects your copy of the ODB.

We have entries only for the "frontier" objects (the objects directly
referenced by any promisor object). For the Windows repo, for example, I
foresee that many of the blobs, trees, and commits will be "hiding"
behind objects that the repository user did not download into their
repo.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: Design and code of partial clones (now, missing commits and trees OK) (part 2/3)
  2017-09-22 22:52       ` Jonathan Tan
@ 2017-09-26 14:03         ` Jeff Hostetler
  0 siblings, 0 replies; 21+ messages in thread
From: Jeff Hostetler @ 2017-09-26 14:03 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peartben, Christian Couder



On 9/22/2017 6:52 PM, Jonathan Tan wrote:
> On Fri, 22 Sep 2017 17:19:50 -0400
> Jeff Hostetler <git@jeffhostetler.com> wrote:
> 
>>> In your specific example, how would rev-list know, on the client, to
>>> include (or exclude) a large blob in its output if it does not have it,
>>> and thus does not know its size?
>>>
>>
>> The client doesn't have the size. It just knows it is missing and it
>> needs it. It doesn't matter why it is missing.  (But I guess the client
>> could assume it is because it is large.)
> 
> Ah, OK.
> 
>> So rev-list on the client could filter the objects it has by size.
> 
> My issue is that if the purpose of this feature in rev-list is to do
> prefetching, the only criterion we need to check for is absence from the
> local repo right? (Or is filtering by size on the client useful for
> other reasons?)

The prefetch before a checkout may want all missing blobs, or it
want to work with the sparse-checkout specification and only get
the required missing blobs in the subset of the tree.  By putting
the same filter logic in rev-list, we can do that.

It also sets the stage for later filtering trees.  (My current patch
only filters blobs.  It would be nice to have a second version of the
sparse filter that also omits trees, but that may require a recursive
option in the fetch-objects protocol.)


> 
>> FYI I just posted my RFC this afternoon.
>> https://public-inbox.org/git/20170922204211.GA24036@google.com/T/
> 
> Thanks, I'll take a look.
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: Design and code of partial clones (now, missing commits and trees OK) (part 3)
  2017-09-22 22:58       ` Jonathan Tan
@ 2017-09-26 14:25         ` Jeff Hostetler
  2017-09-26 17:32           ` Jonathan Tan
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff Hostetler @ 2017-09-26 14:25 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peartben, Christian Couder



On 9/22/2017 6:58 PM, Jonathan Tan wrote:
> On Fri, 22 Sep 2017 17:32:00 -0400
> Jeff Hostetler <git@jeffhostetler.com> wrote:
> 
>> I guess I'm afraid that the first call to is_promised() is going
>> cause a very long pause as it loads up a very large hash of objects.
> 
> Yes, the first call will cause a long pause. (I think fsck and gc can
> tolerate this, but a better solution is appreciated.)
> 
>> Perhaps you could augment the OID lookup to remember where the object
>> was found (essentially a .promisor bit set).  Then you wouldn't need
>> to touch them all.
> 
> Sorry - I don't understand this. Are you saying that missing promisor
> objects should go into the global object hashtable, so that we can set a
> flag on them?

I just meant could we add a bit to "struct object_info" to indicate
that the object was found in a .promisor packfile ?  This could
be set in sha1_object_info_extended().

Then the is_promised() calls in fsck and gc would just test that bit.

Given that that bit will be set on promisOR objects (and we won't
have object_info for missing objects), you may need to adjust the
iterator in the fsck/gc code slightly.

This is a bit of a handwave, but could something like that eliminate
the need to build this oidset?


> 
>>> The oidset will deduplicate OIDs.
>>
>> Right, but you still have an entry for each object.  For a repo the
>> size of Windows, you may have 25M+ objects your copy of the ODB.
> 
> We have entries only for the "frontier" objects (the objects directly
> referenced by any promisor object). For the Windows repo, for example, I
> foresee that many of the blobs, trees, and commits will be "hiding"
> behind objects that the repository user did not download into their
> repo.
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: Design and code of partial clones (now, missing commits and trees OK)
  2017-09-21 22:42   ` Jonathan Tan
  2017-09-22 21:02     ` Jeff Hostetler
@ 2017-09-26 15:26     ` Michael Haggerty
  2017-09-29 20:21       ` Jonathan Tan
  1 sibling, 1 reply; 21+ messages in thread
From: Michael Haggerty @ 2017-09-26 15:26 UTC (permalink / raw)
  To: Jonathan Tan, Jeff Hostetler; +Cc: git, peartben, Christian Couder

On 09/22/2017 12:42 AM, Jonathan Tan wrote:
> On Thu, 21 Sep 2017 13:57:30 Jeff Hostetler <git@jeffhostetler.com> wrote:
> [...]
>> I struggled with the terms here a little when looking at the source.
>> () A remote responding to a partial-clone is termed a
>> "promisor-remote". () Packfiles received from a promisor-remote are
>> marked with "<name>.promisor" like "<name>.keep" names.
>> () An object actually contained in such packfiles is called a
>> "promisor-object". () An object not-present but referenced by one of
>> the above promisor-objects is called a "promised-object" (aka a
>> "missing-object").
>>
>> I think the similarity of the words "promisOR" and "promisED" threw
>> me here and as I was looking at the code.  The code in is_promised()
>> [1] looked like it was adding all promisor- and promised-objects to
>> the "promised" OIDSET, but I could be mistaken.
>>
>> [1]
>> https://github.com/jonathantanmy/git/commit/7a9c2d9b6e2fce293817b595dee29a7eede0dddd#diff-5d5d5dc185ef37dc30bb7d9a7ae0c4e8R1960
> 
> I was struggling a bit with the terminology, true.
> 
> Right now I'm thinking of:
>  - promisor remote (as you defined)
>  - promisor packfile (as you defined)
>  - promisor object is an object known to belong to the promisor (whether
>    because we have it in a promisor packfile or because it is referenced
>    by an object in a promisor packfile)

Maybe naming has been discussed at length before, and I am jumping into
a long-settled topic. And admittedly this is bikeshedding.

But I find these names obscure, even as a developer. And terms like this
will undoubtedly bleed into the UI and documentation, so it would be
good to put some effort into choosing the best names possible.

I suppose that the term "promisor" comes from the computer science term
"promise" [1]. In that sense it is apt, because, say, a promisor object
is something that is known to be obtainable, but we don't have it yet.

But from the user's point of view, I think this isn't a very
illuminating term. I think the user's mental model will be that there is
a distinguished remote repository that holds the project's entire
published history, and she has to remain connected to it for certain Git
operations to work [2]. Another interesting aspect of this remote is
that it has to be trusted never (well, almost never) to discard any
objects [3].

Let me brainstorm about other names or concepts that seem closer to the
user's mental model:

* "backing remote", "backing repository"

* "lazy remote", "live remote", "cloud remote", "shared remote",
  "on-demand remote"

* "full remote", "deep remote", "permanent remote"

* "attached remote", "bound remote", "linked remote"

* "trusted remote", "authoritative remote", "official remote"
  (these terms probably imply more than we want)

* "upstream", "upstream remote" (probably too confusing with how
  the term "upstream" is currently used, even if in practice they
  will often be the same remote)

* "object depot", "depot remote", "depot repository", "remote
  object depot" (I don't like the term "object" that much, because
  it is rather remote from the Git user's daily life)

* "repository of record", "remote of record" (too wordy)

* "history depot", "history warehouse" (meh)

* (dare I say it?) "central repository"

* "object archive", "archival remote" (not so good because we already
  use "archive" for other concepts)

* depository (sounds too much like "repository")

* The thesaurus suggests nouns like: store, bank, warehouse, library,
  chronicle, annals, registry, locker, strongbox, attic, bunker

Personally I think "lazy remote" and "backing remote" are not too bad.

Michael

[1] https://en.wikipedia.org/wiki/Futures_and_promises

[2] I haven't checked whether the current proposal allows for
    multiple "promisor remotes". It's certainly thinkable, if not
    now then in the future. But I suppose that even then, 99% of
    users will configure a single "promisor remote" for each
    repository.

[3] For those rare occasions where the server has to discard objects,
    it might make sense for the server to remember the names of the
    objects that were deleted, so that it can tell clients "no, you're
    not insane. I used to have that object but it has intentionally
    been obliterated", and possibly even a reason: "it is now taboo"
    vs. "I got tired of carrying it around".

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: Design and code of partial clones (now, missing commits and trees OK) (part 3)
  2017-09-26 14:25         ` Jeff Hostetler
@ 2017-09-26 17:32           ` Jonathan Tan
  0 siblings, 0 replies; 21+ messages in thread
From: Jonathan Tan @ 2017-09-26 17:32 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: git, peartben, Christian Couder

On Tue, 26 Sep 2017 10:25:16 -0400
Jeff Hostetler <git@jeffhostetler.com> wrote:

> >> Perhaps you could augment the OID lookup to remember where the object
> >> was found (essentially a .promisor bit set).  Then you wouldn't need
> >> to touch them all.
> > 
> > Sorry - I don't understand this. Are you saying that missing promisor
> > objects should go into the global object hashtable, so that we can set a
> > flag on them?
> 
> I just meant could we add a bit to "struct object_info" to indicate
> that the object was found in a .promisor packfile ?  This could
> be set in sha1_object_info_extended().
> 
> Then the is_promised() calls in fsck and gc would just test that bit.
> 
> Given that that bit will be set on promisOR objects (and we won't
> have object_info for missing objects), you may need to adjust the
> iterator in the fsck/gc code slightly.
> 
> This is a bit of a handwave, but could something like that eliminate
> the need to build this oidset?

This oidset is meant to contain the missing objects, and is needed as
the final check (attempt to read the object, then check this oidset).
Admittedly, right now I add objects to it even if they are present in
the DB, but that's because I think that it's better for the set to be
bigger than to incur the repeated existence checks. But even if we only
include truly missing objects in this oidset, we still need the oidset,
or store information about missing objects in some equivalent data
structure.

The bit that you mention being set on promisOR objects is already being
set. See the invocation of mark_uninteresting() in rev-list.c.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: Design and code of partial clones (now, missing commits and trees OK)
  2017-09-15 20:43 RFC: Design and code of partial clones (now, missing commits and trees OK) Jonathan Tan
                   ` (3 preceding siblings ...)
  2017-09-21 18:00 ` RFC: Design and code of partial clones (now, missing commits and trees OK) (part 3) Jeff Hostetler
@ 2017-09-29  0:53 ` Jonathan Tan
  2017-09-29  2:03   ` Junio C Hamano
  4 siblings, 1 reply; 21+ messages in thread
From: Jonathan Tan @ 2017-09-29  0:53 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peartben, Christian Couder, git

On Fri, 15 Sep 2017 13:43:43 -0700
Jonathan Tan <jonathantanmy@google.com> wrote:

> For those interested in partial clones and/or missing objects in repos,
> I've updated my original partialclone patches to not require an explicit
> list of promises. Fetch/clone still only permits exclusion of blobs, but
> the infrastructure is there for a local repo to support missing trees
> and commits as well.
> 
> They can be found here:
> 
> https://github.com/jonathantanmy/git/tree/partialclone2

I've pushed a new version:

https://github.com/jonathantanmy/git/tree/partialclone3

Besides some small changes as requested by comments on the GitHub
repository, I've also updated the code to do the following:
 - clarified terminology - in particular, I've tried to avoid
   "promised", only using "promisor object" to denote objects that the
   local repo knows that the promisor remote has, whether the local repo
   has it or not
 - restored bulk checkout functionality (so now you can clone with
   --blob-max-bytes=0)
 - a fix to fetch-pack to restore a global flag after it uses it, so
   commands like "git log -S" still work (but to test this, I used
   --blob-max-bytes=200000 with the Git repository, because batch
   fetching is not implemented for commands like these)

In its current form, the code is already useful for situations like:
 - a large repository with many blobs in which the client only needs to
   checkout, at most, and does not need to search through history
   locally, and
 - a repository with a few large blobs, where the client still can
   search through history as long as the client is online

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: Design and code of partial clones (now, missing commits and trees OK)
  2017-09-29  0:53 ` RFC: Design and code of partial clones (now, missing commits and trees OK) Jonathan Tan
@ 2017-09-29  2:03   ` Junio C Hamano
  0 siblings, 0 replies; 21+ messages in thread
From: Junio C Hamano @ 2017-09-29  2:03 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peartben, Christian Couder, git

Jonathan Tan <jonathantanmy@google.com> writes:

> I've pushed a new version:
>
> https://github.com/jonathantanmy/git/tree/partialclone3

Just FYI, the reason why I commented only on the first patch in your
previous series at GitHub wasn't because I found the others perfect
and nothing to comment on.  It was because I found it extremely
painful to conduct review and comment in the webform and gave up
while trying to review the series that way just after doing a single
patch.

I also found it frustrating that it is not even obvious which one of
the many patches in the series have been already commented on,
without clicking to each and every commit (and back), even when the
result is to find that nobody has commented on them yet.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RFC: Design and code of partial clones (now, missing commits and trees OK)
  2017-09-26 15:26     ` Michael Haggerty
@ 2017-09-29 20:21       ` Jonathan Tan
  0 siblings, 0 replies; 21+ messages in thread
From: Jonathan Tan @ 2017-09-29 20:21 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Jeff Hostetler, git, peartben, Christian Couder

On Tue, 26 Sep 2017 17:26:33 +0200
Michael Haggerty <mhagger@alum.mit.edu> wrote:

> Maybe naming has been discussed at length before, and I am jumping into
> a long-settled topic. And admittedly this is bikeshedding.
> 
> But I find these names obscure, even as a developer. And terms like this
> will undoubtedly bleed into the UI and documentation, so it would be
> good to put some effort into choosing the best names possible.

Names are definitely not a long-settled topic. :-)

I agree that naming is important, and thanks for your efforts.

> I suppose that the term "promisor" comes from the computer science term
> "promise" [1]. In that sense it is apt, because, say, a promisor object
> is something that is known to be obtainable, but we don't have it yet.
> 
> But from the user's point of view, I think this isn't a very
> illuminating term. I think the user's mental model will be that there is
> a distinguished remote repository that holds the project's entire
> published history, and she has to remain connected to it for certain Git
> operations to work [2]. Another interesting aspect of this remote is
> that it has to be trusted never (well, almost never) to discard any
> objects [3].

Yes, that is the mental model I have too. I think the ordinary meaning
of the word "promise" works, though - you're not working completely on
things you have, but you're working partly based on the guarantees (or
promises) that this distinguished remote repository gives.

> Personally I think "lazy remote" and "backing remote" are not too bad.

I think these terms are imprecise. "Lazy remote" seems to me to imply
that it is the remote that is lazy, not us.

"Backing remote" does evoke the concept of a "backing store". For me,
the ability to transfer objects to the backing store to be stored
permanently (so that you don't have to store it yourself) is an
essential part of a backing store, and that is definitely something we
don't do here (although, admittedly, such a feature might be useful), so
I don't agree with that term. But if transferring objects is not
essential to a backing store, or if adding such a feature is a natural
fit to the partial clone feature, maybe we could use that.

> [2] I haven't checked whether the current proposal allows for
>     multiple "promisor remotes". It's certainly thinkable, if not
>     now then in the future. But I suppose that even then, 99% of
>     users will configure a single "promisor remote" for each
>     repository.

It does not allow for multiple "promisor remotes". Support for that
would require upgrades in the design (including knowing which remote to
fetch a missing object from), but otherwise I agree with your
statements.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2017-09-29 20:21 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-15 20:43 RFC: Design and code of partial clones (now, missing commits and trees OK) Jonathan Tan
2017-09-19  5:51 ` Junio C Hamano
2017-09-21 17:57 ` Jeff Hostetler
2017-09-21 22:42   ` Jonathan Tan
2017-09-22 21:02     ` Jeff Hostetler
2017-09-22 22:49       ` Jonathan Tan
2017-09-26 15:26     ` Michael Haggerty
2017-09-29 20:21       ` Jonathan Tan
2017-09-21 17:59 ` RFC: Design and code of partial clones (now, missing commits and trees OK) (part 2/3) Jeff Hostetler
2017-09-21 22:51   ` Jonathan Tan
2017-09-22 21:19     ` Jeff Hostetler
2017-09-22 22:52       ` Jonathan Tan
2017-09-26 14:03         ` Jeff Hostetler
2017-09-21 18:00 ` RFC: Design and code of partial clones (now, missing commits and trees OK) (part 3) Jeff Hostetler
2017-09-21 23:04   ` Jonathan Tan
2017-09-22 21:32     ` Jeff Hostetler
2017-09-22 22:58       ` Jonathan Tan
2017-09-26 14:25         ` Jeff Hostetler
2017-09-26 17:32           ` Jonathan Tan
2017-09-29  0:53 ` RFC: Design and code of partial clones (now, missing commits and trees OK) Jonathan Tan
2017-09-29  2:03   ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).