[RFC] Add support for downloading blobs on demand

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [RFC] Add support for downloading blobs on demand
@ 2017-01-13 15:52 Ben Peart
  2017-01-13 21:07 ` Shawn Pearce
  2017-01-17 18:42 ` Jeff King
  0 siblings, 2 replies; 13+ messages in thread
From: Ben Peart @ 2017-01-13 15:52 UTC (permalink / raw)
  To: git; +Cc: benpeart

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 4344 bytes --]

Goal
~~~~

To be able to better handle repos with many files that any individual 
developer doesn’t need it would be nice if clone/fetch only brought down 
those files that were actually needed.  

To enable that, we are proposing adding a flag to clone/fetch that will
instruct the server to limit the objects it sends to commits and trees
and to not send any blobs.  

When git performs an operation that requires a blob that isn’t currently
available locally, it will download the missing blob and add it to the
local object store.

Design
~~~~~~

Clone and fetch will pass a “--lazy-clone” flag (open to a better name 
here) similar to “--depth” that instructs the server to only return 
commits and trees and to ignore blobs.

Later during git operations like checkout, when a blob cannot be found
after checking all the regular places (loose, pack, alternates, etc), 
git will download the missing object and place it into the local object 
store (currently as a loose object) then resume the operation.

To prevent git from accidentally downloading all missing blobs, some git
operations are updated to be aware of the potential for missing blobs.  
The most obvious being check_connected which will return success as if 
everything in the requested commits is available locally.

To minimize the impact on the server, the existing dumb HTTP protocol 
endpoint “objects/<sha>” can be used to retrieve the individual missing
blobs when needed.

Performance considerations
~~~~~~~~~~~~~~~~~~~~~~~~~~

We found that downloading commits and trees on demand had a significant 
negative performance impact.  In addition, many git commands assume all 
commits and trees are available locally so they quickly got pulled down 
anyway.  Even in very large repos the commits and trees are relatively 
small so bringing them down with the initial commit and subsequent fetch 
commands was reasonable.  

After cloning, the developer can use sparse-checkout to limit the set of 
files to the subset they need (typically only 1-10% in these large 
repos).  This allows the initial checkout to only download the set of 
files actually needed to complete their task.  At any point, the 
sparse-checkout file can be updated to include additional files which 
will be fetched transparently on demand.

Typical source files are relatively small so the overhead of connecting 
and authenticating to the server for a single file at a time is 
substantial.  As a result, having a long running process that is started 
with the first request and can cache connection information between 
requests is a significant performance win.

Now some numbers
~~~~~~~~~~~~~~~~

One repo has 3+ million files at tip across 500K folders with 5-6K 
active developers.  They have done a lot of work to remove large files 
from the repo so it is down to < 100GB.

Before changes: clone took hours to transfer the 87GB .pack + 119MB .idx

After changes: clone took 4 minutes to transfer 305MB .pack + 37MB .idx

After hydrating 35K files (the typical number any individual developer 
needs to do their work), there was an additional 460 MB of loose files 
downloaded.

Total savings: 86.24 GB * 6000 developers = 517 Terabytes saved!

We have another repo (3.1 M files, 618 GB at tip with no history with 
3K+ active developers) where the savings are even greater.

Future Work
~~~~~~~~~~~

The current prototype calls a new hook proc in sha1_object_info_extended 
and read_object, to download each missing blob.  A better solution would 
be to implement this via a long running process that is spawned on the 
first download and listens for requests to download additional objects 
until it terminates when the parent git operation exits (similar to the 
recent long running smudge and clean filter work).

Need to do more investigation into possible code paths that can trigger 
unnecessary blobs to be downloaded.  For example, we have determined 
that the rename detection logic in status can also trigger unnecessary 
blobs to be downloaded making status slow.

Need to investigate an alternate batching scheme where we can make a 
single request for a set of "related" blobs and receive single a 
packfile (especially during checkout).

Need to investigate adding a new endpoint in the smart protocol that can 
download both individual blobs as well as a batch of blobs.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] Add support for downloading blobs on demand
  2017-01-13 15:52 [RFC] Add support for downloading blobs on demand Ben Peart
@ 2017-01-13 21:07 ` Shawn Pearce
  2017-01-17 21:50   ` Ben Peart
  2017-01-17 18:42 ` Jeff King
  1 sibling, 1 reply; 13+ messages in thread
From: Shawn Pearce @ 2017-01-13 21:07 UTC (permalink / raw)
  To: Ben Peart; +Cc: git, benpeart

On Fri, Jan 13, 2017 at 7:52 AM, Ben Peart <peartben@gmail.com> wrote:
>
> Goal
> ~~~~
>
> To be able to better handle repos with many files that any individual
> developer doesn’t need it would be nice if clone/fetch only brought down
> those files that were actually needed.
>
> To enable that, we are proposing adding a flag to clone/fetch that will
> instruct the server to limit the objects it sends to commits and trees
> and to not send any blobs.
>
> When git performs an operation that requires a blob that isn’t currently
> available locally, it will download the missing blob and add it to the
> local object store.

Interesting. This is also an area I want to work on with my team at
$DAY_JOB. Repositories are growing along multiple dimensions, and
developers or editors don't always need all blobs for all time
available locally to successfully perform their work.

> Design
> ~~~~~~
>
> Clone and fetch will pass a “--lazy-clone” flag (open to a better name
> here) similar to “--depth” that instructs the server to only return
> commits and trees and to ignore blobs.

My group at $DAY_JOB hasn't talked about it yet, but I want to add a
protocol capability that lets clone/fetch ask only for blobs smaller
than a specified byte count. This could be set to a reasonable text
file size (e.g. <= 5 MiB) to predominately download only source files
and text documentation, omitting larger binaries.

If the limit was set to 0, its the same as your idea to ignore all blobs.

> Later during git operations like checkout, when a blob cannot be found
> after checking all the regular places (loose, pack, alternates, etc),
> git will download the missing object and place it into the local object
> store (currently as a loose object) then resume the operation.

Right. I'd like to have this object retrieval be inside the native Git
wire protocol, reusing the remote configuration and authentication
setup. That requires expanding the server side of the protocol
implementation slightly allowing any reachable object to be retrieved
by SHA-1 alone. Bitmap indexes can significantly reduce the
computational complexity for the server.

> To prevent git from accidentally downloading all missing blobs, some git
> operations are updated to be aware of the potential for missing blobs.
> The most obvious being check_connected which will return success as if
> everything in the requested commits is available locally.

This ... sounds risky for the developer, as the repository may be
corrupt due to a missing object, and the user cannot determine it.

Would it be reasonable for the server to return a list of SHA-1s it
knows should exist, but has omitted due to the blob threshold (above),
and the local repository store this in a binary searchable file?
During connectivity checking its assumed OK if an object is not
present in the object store, but is listed in this omitted objects
file.

> To minimize the impact on the server, the existing dumb HTTP protocol
> endpoint “objects/<sha>” can be used to retrieve the individual missing
> blobs when needed.

I'd prefer this to be in the native wire protocol, where the objects
are in pack format (which unfortunately differs from loose format). I
assume servers would combine many objects into pack files, potentially
isolating large uncompressable binaries into their own packs, stored
separately from commits/trees/small-text-blobs.

I get the value of this being in HTTP, where HTTP caching inside
proxies can be leveraged to reduce master server load. I wonder if the
native wire protocol could be taught to use a variation of an HTTP GET
that includes the object SHA-1 in the URL line, to retrieve a
one-object pack file.

> Performance considerations
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> We found that downloading commits and trees on demand had a significant
> negative performance impact.  In addition, many git commands assume all
> commits and trees are available locally so they quickly got pulled down
> anyway.  Even in very large repos the commits and trees are relatively
> small so bringing them down with the initial commit and subsequent fetch
> commands was reasonable.
>
> After cloning, the developer can use sparse-checkout to limit the set of
> files to the subset they need (typically only 1-10% in these large
> repos).  This allows the initial checkout to only download the set of
> files actually needed to complete their task.  At any point, the
> sparse-checkout file can be updated to include additional files which
> will be fetched transparently on demand.
>
> Typical source files are relatively small so the overhead of connecting
> and authenticating to the server for a single file at a time is
> substantial.  As a result, having a long running process that is started
> with the first request and can cache connection information between
> requests is a significant performance win.

Junio and I talked years ago (offline, sorry no mailing list archive)
about "narrow checkout", which is the idea of the client being able to
ask for a pack file from the server that only includes objects along
specific path names. This would allow a client to amortize the setup
costs, and even delta compress source files against each other (e.g.
boilerplate across Makefiles or license headers).

If the paths of interest can be determined as a batch before starting
the connection, this may be easier than maintaining a cross platform
connection cache in a separate process.

> Now some numbers
> ~~~~~~~~~~~~~~~~
>
> One repo has 3+ million files at tip across 500K folders with 5-6K
> active developers.  They have done a lot of work to remove large files
> from the repo so it is down to < 100GB.
>
> Before changes: clone took hours to transfer the 87GB .pack + 119MB .idx
>
> After changes: clone took 4 minutes to transfer 305MB .pack + 37MB .idx
>
> After hydrating 35K files (the typical number any individual developer
> needs to do their work), there was an additional 460 MB of loose files
> downloaded.
>
> Total savings: 86.24 GB * 6000 developers = 517 Terabytes saved!
>
> We have another repo (3.1 M files, 618 GB at tip with no history with
> 3K+ active developers) where the savings are even greater.

This is quite impressive, and shows this strategy has a lot of promise.

> Future Work
> ~~~~~~~~~~~
>
> The current prototype calls a new hook proc in sha1_object_info_extended
> and read_object, to download each missing blob.  A better solution would
> be to implement this via a long running process that is spawned on the
> first download and listens for requests to download additional objects
> until it terminates when the parent git operation exits (similar to the
> recent long running smudge and clean filter work).

Or batching these up in advance. checkout should be able to determine
which path entries from the index it wants to write to the working
tree. Once it has that set of paths it wants to write, it should be
fast to construct a subset of paths for which the blobs are not
present locally, and then pass the entire group off for download.

> Need to do more investigation into possible code paths that can trigger
> unnecessary blobs to be downloaded.  For example, we have determined
> that the rename detection logic in status can also trigger unnecessary
> blobs to be downloaded making status slow.

There isn't much of a workaround here. Only options I can see are
disabling rename detection when objects are above a certain size, or
removing entries from the rename table when the blob isn't already
local, which may yield different results than if the blob(s) were
local.

Another is to try to have actual source files always be local, and
thus we only punt on rename detection for bigger files that are more
likely to be binary, and thus less likely to match for rename[1]
unless it was SHA-1 identity match, which can be done without the
blob(s) present.

[1] I assume most really big files are some sort of media asset (e.g.
JPEG), where a change inside the source data may result in large
difference in bytes due to the compression applied by the media file
format.

> Need to investigate an alternate batching scheme where we can make a
> single request for a set of "related" blobs and receive single a
> packfile (especially during checkout).

Heh, what I just said above. Glad to see you already thought of it.

> Need to investigate adding a new endpoint in the smart protocol that can
> download both individual blobs as well as a batch of blobs.

Agreed, I said as much above. Again, glad to see you have similar ideas. :)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] Add support for downloading blobs on demand
  2017-01-13 15:52 [RFC] Add support for downloading blobs on demand Ben Peart
  2017-01-13 21:07 ` Shawn Pearce
@ 2017-01-17 18:42 ` Jeff King
  2017-01-17 21:50   ` Ben Peart
  1 sibling, 1 reply; 13+ messages in thread
From: Jeff King @ 2017-01-17 18:42 UTC (permalink / raw)
  To: Ben Peart; +Cc: git, benpeart

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=utf-8, Size: 9263 bytes --]

This is an issue I've thought a lot about. So apologies in advance that
this response turned out a bit long. :)

On Fri, Jan 13, 2017 at 10:52:53AM -0500, Ben Peart wrote:

> Design
> ~~~~~~
> 
> Clone and fetch will pass a “--lazy-clone” flag (open to a better name 
> here) similar to “--depth” that instructs the server to only return 
> commits and trees and to ignore blobs.
> 
> Later during git operations like checkout, when a blob cannot be found
> after checking all the regular places (loose, pack, alternates, etc), 
> git will download the missing object and place it into the local object 
> store (currently as a loose object) then resume the operation.

Have you looked at the "external odb" patches I wrote a while ago, and
which Christian has been trying to resurrect?

  http://public-inbox.org/git/20161130210420.15982-1-chriscool@tuxfamily.org/

This is a similar approach, though I pushed the policy for "how do you
get the objects" out into an external script. One advantage there is
that large objects could easily be fetched from another source entirely
(e.g., S3 or equivalent) rather than the repo itself.

The downside is that it makes things more complicated, because a push or
a fetch now involves three parties (server, client, and the alternate
object store). So questions like "do I have all the objects I need" are
hard to reason about.

If you assume that there's going to be _some_ central Git repo which has
all of the objects, you might as well fetch from there (and do it over
normal git protocols). And that simplifies things a bit, at the cost of
being less flexible.

> To prevent git from accidentally downloading all missing blobs, some git
> operations are updated to be aware of the potential for missing blobs.  
> The most obvious being check_connected which will return success as if 
> everything in the requested commits is available locally.

Actually, Git is pretty good about trying not to access blobs when it
doesn't need to. The important thing is that you know enough about the
blobs to fulfill has_sha1_file() and sha1_object_info() requests without
actually fetching the data.

So the client definitely needs to have some list of which objects exist,
and which it _could_ get if it needed to.

The one place you'd probably want to tweak things is in the diff code,
as a single "git log -Sfoo" would fault in all of the blobs.

> To minimize the impact on the server, the existing dumb HTTP protocol 
> endpoint “objects/<sha>” can be used to retrieve the individual missing
> blobs when needed.

This is going to behave badly on well-packed repositories, because there
isn't a good way to fetch a single object. The best case (which is not
implemented at all in Git) is that you grab the pack .idx, then grab
"slices" of the pack corresponding to specific objects, including
hunting down delta bases.

But then next time the server repacks, you have to throw away your .idx
file. And those can be big. The .idx for linux.git is 135MB. You really
wouldn't want to do an incremental fetch of 1MB worth of objects and
have to grab the whole .idx just to figure out which bytes you needed.

You can solve this by replacing the dumb-http server with a smart one
that actually serves up the individual objects as if they were truly
sitting on the filesystem. But then you haven't really minimized impact
on the server, and you might as well teach the smart protocols to do
blob fetches.

One big hurdle to this approach, no matter the protocol, is how you are
going to handle deltas. Right now, a git client tells the server "I have
this commit, but I want this other one". And the server knows which
objects the client has from the first, and which it needs from the
second. Moreover, it knows that it can send objects in delta form
directly from disk if the other side has the delta base.

So what happens in this system? We know we don't need to send any blobs
in a regular fetch, because the whole idea is that we only send blobs on
demand. So we wait for the client to ask us for blob A. But then what do
we send? If we send the whole blob without deltas, we're going to waste
a lot of bandwidth.

The on-disk size of all of the blobs in linux.git is ~500MB. The actual
data size is ~48GB. Some of that is from zlib, which you get even for
non-deltas. But the rest of it is from the delta compression. I don't
think it's feasible to give that up, at least not for "normal" source
repos like linux.git (more on that in a minute).

So ideally you do want to send deltas. But how do you know which objects
the other side already has, which you can use as a delta base? Sending
the list of "here are the blobs I have" doesn't scale. Just the sha1s
start to add up, especially when you are doing incremental fetches.

I think this sort of things performs a lot better when you just focus on
large objects. Because they don't tend to delta well anyway, and the
savings are much bigger by avoiding ones you don't want. So a directive
like "don't bother sending blobs larger than 1MB" avoids a lot of these
issues. In other words, you have some quick shorthand to communicate
between the client and server: this what I have, and what I don't.
Normal git relies on commit reachability for that, but there are
obviously other dimensions. The key thing is that both sides be able to
express the filters succinctly, and apply them efficiently.

> After cloning, the developer can use sparse-checkout to limit the set of 
> files to the subset they need (typically only 1-10% in these large 
> repos).  This allows the initial checkout to only download the set of 
> files actually needed to complete their task.  At any point, the 
> sparse-checkout file can be updated to include additional files which 
> will be fetched transparently on demand.

If most of your benefits are not from avoiding blobs in general, but
rather just from sparsely populating the tree, then it sounds like
sparse clone might be an easier path forward. The general idea is to
restrict not just the checkout, but the actual object transfer and
reachability (in the tree dimension, the way shallow clone limits it in
the time dimension, which will require cooperation between the client
and server).

So that's another dimension of filtering, which should be expressed
pretty succinctly: "I'm interested in these paths, and not these other
ones." It's pretty easy to compute on the server side during graph
traversal (though it interacts badly with reachability bitmaps, so there
would need to be some hacks there).

It's an idea that's been talked about many times, but I don't recall
that there were ever working patches. You might dig around in the list
archive under the name "sparse clone" or possibly "narrow clone".

> Now some numbers
> ~~~~~~~~~~~~~~~~
> 
> One repo has 3+ million files at tip across 500K folders with 5-6K 
> active developers.  They have done a lot of work to remove large files 
> from the repo so it is down to < 100GB.
> 
> Before changes: clone took hours to transfer the 87GB .pack + 119MB .idx
> 
> After changes: clone took 4 minutes to transfer 305MB .pack + 37MB .idx
> 
> After hydrating 35K files (the typical number any individual developer 
> needs to do their work), there was an additional 460 MB of loose files 
> downloaded.

It sounds like you have a case where the repository has a lot of large
files that are either historical, or uninteresting the sparse-tree
dimension.

How big is that 460MB if it were actually packed with deltas?

> Future Work
> ~~~~~~~~~~~
> 
> The current prototype calls a new hook proc in sha1_object_info_extended 
> and read_object, to download each missing blob.  A better solution would 
> be to implement this via a long running process that is spawned on the 
> first download and listens for requests to download additional objects 
> until it terminates when the parent git operation exits (similar to the 
> recent long running smudge and clean filter work).

Yeah, see the external-odb discussion. Those prototypes use a process
per object, but I think we all agree after seeing how the git-lfs
interface has scaled that this is a non-starter. Recent versions of
git-lfs do the single-process thing, and I think any sort of
external-odb hook should be modeled on that protocol.

> Need to investigate an alternate batching scheme where we can make a 
> single request for a set of "related" blobs and receive single a 
> packfile (especially during checkout).

I think this sort of batching is going to be the really hard part to
retrofit onto git. Because you're throwing out the procedural notion
that you can loop over a set of objects and ask for each individually.
You have to start deferring computation until answers are ready. Some
operations can do that reasonably well (e.g., checkout), but something
like "git log -p" is constantly digging down into history. I suppose you
could just perform the skeleton of the operation _twice_, once to find
the list of objects to fault in, and the second time to actually do it.

That will make git feel a lot slower, because a lot of the illusion of
speed is the way it streams out results. OTOH, if you have to wait to
fault in objects from the network, it's going to feel pretty slow
anyway. :)

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [RFC] Add support for downloading blobs on demand
  2017-01-13 21:07 ` Shawn Pearce
@ 2017-01-17 21:50   ` Ben Peart
  2017-01-17 22:05     ` Martin Fick
  0 siblings, 1 reply; 13+ messages in thread
From: Ben Peart @ 2017-01-17 21:50 UTC (permalink / raw)
  To: 'Shawn Pearce'; +Cc: 'git', benpeart

Thanks for the encouragement, support, and good ideas to look into.

Ben

> -----Original Message-----
> From: Shawn Pearce [mailto:spearce@spearce.org]
> Sent: Friday, January 13, 2017 4:07 PM
> To: Ben Peart <peartben@gmail.com>
> Cc: git <git@vger.kernel.org>; benpeart@microsoft.com
> Subject: Re: [RFC] Add support for downloading blobs on demand
> 
> On Fri, Jan 13, 2017 at 7:52 AM, Ben Peart <peartben@gmail.com> wrote:
> >
> > Goal
> > ~~~~
> >
> > To be able to better handle repos with many files that any individual
> > developer doesn’t need it would be nice if clone/fetch only brought
> > down those files that were actually needed.
> >
> > To enable that, we are proposing adding a flag to clone/fetch that
> > will instruct the server to limit the objects it sends to commits and
> > trees and to not send any blobs.
> >
> > When git performs an operation that requires a blob that isn’t
> > currently available locally, it will download the missing blob and add
> > it to the local object store.
> 
> Interesting. This is also an area I want to work on with my team at $DAY_JOB.
> Repositories are growing along multiple dimensions, and developers or
> editors don't always need all blobs for all time available locally to successfully
> perform their work.
> 
> > Design
> > ~~~~~~
> >
> > Clone and fetch will pass a “--lazy-clone” flag (open to a better name
> > here) similar to “--depth” that instructs the server to only return
> > commits and trees and to ignore blobs.
> 
> My group at $DAY_JOB hasn't talked about it yet, but I want to add a
> protocol capability that lets clone/fetch ask only for blobs smaller than a
> specified byte count. This could be set to a reasonable text file size (e.g. <= 5
> MiB) to predominately download only source files and text documentation,
> omitting larger binaries.
> 
> If the limit was set to 0, its the same as your idea to ignore all blobs.
> 

This is an interesting idea that may be an easier way to help mitigate 
the cost of very large files.  While our primary issue today is the 
sheer number of files, I'm sure at some point we'll run into issues with 
file size as well.  

> > Later during git operations like checkout, when a blob cannot be found
> > after checking all the regular places (loose, pack, alternates, etc),
> > git will download the missing object and place it into the local
> > object store (currently as a loose object) then resume the operation.
> 
> Right. I'd like to have this object retrieval be inside the native Git wire
> protocol, reusing the remote configuration and authentication setup. That
> requires expanding the server side of the protocol implementation slightly
> allowing any reachable object to be retrieved by SHA-1 alone. Bitmap indexes
> can significantly reduce the computational complexity for the server.
> 

Agree.  

> > To prevent git from accidentally downloading all missing blobs, some
> > git operations are updated to be aware of the potential for missing blobs.
> > The most obvious being check_connected which will return success as if
> > everything in the requested commits is available locally.
> 
> This ... sounds risky for the developer, as the repository may be corrupt due
> to a missing object, and the user cannot determine it.
> 
> Would it be reasonable for the server to return a list of SHA-1s it knows
> should exist, but has omitted due to the blob threshold (above), and the
> local repository store this in a binary searchable file?
> During connectivity checking its assumed OK if an object is not present in the
> object store, but is listed in this omitted objects file.
> 

Corrupt repos due to missing blobs must be pretty rare as I've never 
seen anyone report that error but for this and other reasons (see Peff's 
suggestion on how to minimize downloading unnecessary blobs) having this 
data could be valuable.  I'll add it to the list of things to look into.

> > To minimize the impact on the server, the existing dumb HTTP protocol
> > endpoint “objects/<sha>” can be used to retrieve the individual
> > missing blobs when needed.
> 
> I'd prefer this to be in the native wire protocol, where the objects are in pack
> format (which unfortunately differs from loose format). I assume servers
> would combine many objects into pack files, potentially isolating large
> uncompressable binaries into their own packs, stored separately from
> commits/trees/small-text-blobs.
> 
> I get the value of this being in HTTP, where HTTP caching inside proxies can
> be leveraged to reduce master server load. I wonder if the native wire
> protocol could be taught to use a variation of an HTTP GET that includes the
> object SHA-1 in the URL line, to retrieve a one-object pack file.
> 

You make a good point. I don't think the benefit of hitting this 
"existing" end point outweighs the many drawbacks.  Adding the ability 
to retrieve an individual blob via the native wire protocol seems a 
better plan.

> > Performance considerations
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> > We found that downloading commits and trees on demand had a
> > significant negative performance impact.  In addition, many git
> > commands assume all commits and trees are available locally so they
> > quickly got pulled down anyway.  Even in very large repos the commits
> > and trees are relatively small so bringing them down with the initial
> > commit and subsequent fetch commands was reasonable.
> >
> > After cloning, the developer can use sparse-checkout to limit the set
> > of files to the subset they need (typically only 1-10% in these large
> > repos).  This allows the initial checkout to only download the set of
> > files actually needed to complete their task.  At any point, the
> > sparse-checkout file can be updated to include additional files which
> > will be fetched transparently on demand.
> >
> > Typical source files are relatively small so the overhead of
> > connecting and authenticating to the server for a single file at a
> > time is substantial.  As a result, having a long running process that
> > is started with the first request and can cache connection information
> > between requests is a significant performance win.
> 
> Junio and I talked years ago (offline, sorry no mailing list archive) about
> "narrow checkout", which is the idea of the client being able to ask for a pack
> file from the server that only includes objects along specific path names. This
> would allow a client to amortize the setup costs, and even delta compress
> source files against each other (e.g.
> boilerplate across Makefiles or license headers).
> 
> If the paths of interest can be determined as a batch before starting the
> connection, this may be easier than maintaining a cross platform connection
> cache in a separate process.
> 

We looked into sparse/narrow-clone but for a variety of reasons it 
didn't work well for our usage patterns (see my response to Peff's 
feedback for more details).

> > Now some numbers
> > ~~~~~~~~~~~~~~~~
> >
> > One repo has 3+ million files at tip across 500K folders with 5-6K
> > active developers.  They have done a lot of work to remove large files
> > from the repo so it is down to < 100GB.
> >
> > Before changes: clone took hours to transfer the 87GB .pack + 119MB
> > .idx
> >
> > After changes: clone took 4 minutes to transfer 305MB .pack + 37MB
> > .idx
> >
> > After hydrating 35K files (the typical number any individual developer
> > needs to do their work), there was an additional 460 MB of loose files
> > downloaded.
> >
> > Total savings: 86.24 GB * 6000 developers = 517 Terabytes saved!
> >
> > We have another repo (3.1 M files, 618 GB at tip with no history with
> > 3K+ active developers) where the savings are even greater.
> 
> This is quite impressive, and shows this strategy has a lot of promise.
> 
> 
> > Future Work
> > ~~~~~~~~~~~
> >
> > The current prototype calls a new hook proc in
> > sha1_object_info_extended and read_object, to download each missing
> > blob.  A better solution would be to implement this via a long running
> > process that is spawned on the first download and listens for requests
> > to download additional objects until it terminates when the parent git
> > operation exits (similar to the recent long running smudge and clean filter
> work).
> 
> Or batching these up in advance. checkout should be able to determine
> which path entries from the index it wants to write to the working tree. Once
> it has that set of paths it wants to write, it should be fast to construct a
> subset of paths for which the blobs are not present locally, and then pass the
> entire group off for download.
> 

Yes, I'm optimistic that we can optimize for the checkout case (which is 
a _very_ common case!).

> > Need to do more investigation into possible code paths that can
> > trigger unnecessary blobs to be downloaded.  For example, we have
> > determined that the rename detection logic in status can also trigger
> > unnecessary blobs to be downloaded making status slow.
> 
> There isn't much of a workaround here. Only options I can see are disabling
> rename detection when objects are above a certain size, or removing entries
> from the rename table when the blob isn't already local, which may yield
> different results than if the blob(s) were local.
> 
> Another is to try to have actual source files always be local, and thus we only
> punt on rename detection for bigger files that are more likely to be binary,
> and thus less likely to match for rename[1] unless it was SHA-1 identity
> match, which can be done without the
> blob(s) present.
> 

While large files can be a real problem, our biggest issue today is 
having a lot (millions!) of source files when any individual developer 
only needs a small percentage of them.  Git with 3+ million local files 
just doesn't perform well.

We'll see what we can come up with here - especially if we had some 
information _about_ the blob, even though we didn't have the blob itself.

> 
> [1] I assume most really big files are some sort of media asset (e.g.
> JPEG), where a change inside the source data may result in large difference in
> bytes due to the compression applied by the media file format.
> 
> > Need to investigate an alternate batching scheme where we can make a
> > single request for a set of "related" blobs and receive single a
> > packfile (especially during checkout).
> 
> Heh, what I just said above. Glad to see you already thought of it.
> 
> > Need to investigate adding a new endpoint in the smart protocol that
> > can download both individual blobs as well as a batch of blobs.
> 
> Agreed, I said as much above. Again, glad to see you have similar ideas. :)


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [RFC] Add support for downloading blobs on demand
  2017-01-17 18:42 ` Jeff King
@ 2017-01-17 21:50   ` Ben Peart
  2017-02-05 14:03     ` Christian Couder
  0 siblings, 1 reply; 13+ messages in thread
From: Ben Peart @ 2017-01-17 21:50 UTC (permalink / raw)
  To: 'Jeff King', 'Ben Peart'; +Cc: git

Thanks for the thoughtful response.  No need to appologize for the 
length, it's a tough problem to solve so I don't expect it to be handled 
with a single, short email. :)

> -----Original Message-----
> From: Jeff King [mailto:peff@peff.net]
> Sent: Tuesday, January 17, 2017 1:43 PM
> To: Ben Peart <peartben@gmail.com>
> Cc: git@vger.kernel.org; Ben Peart <Ben.Peart@microsoft.com>
> Subject: Re: [RFC] Add support for downloading blobs on demand
> 
> This is an issue I've thought a lot about. So apologies in advance that this
> response turned out a bit long. :)
> 
> On Fri, Jan 13, 2017 at 10:52:53AM -0500, Ben Peart wrote:
> 
> > Design
> > ~~~~~~
> >
> > Clone and fetch will pass a  --lazy-clone  flag (open to a better name
> > here) similar to  --depth  that instructs the server to only return
> > commits and trees and to ignore blobs.
> >
> > Later during git operations like checkout, when a blob cannot be found
> > after checking all the regular places (loose, pack, alternates, etc),
> > git will download the missing object and place it into the local
> > object store (currently as a loose object) then resume the operation.
> 
> Have you looked at the "external odb" patches I wrote a while ago, and
> which Christian has been trying to resurrect?
> 
>   https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpublic-
> inbox.org%2Fgit%2F20161130210420.15982-1-
> chriscool%40tuxfamily.org%2F&data=02%7C01%7CBen.Peart%40microsoft.c
> om%7C9596d3bf32564f123e0c08d43f08a9e1%7C72f988bf86f141af91ab2d7c
> d011db47%7C1%7C0%7C636202753822020527&sdata=a6%2BGOAQoRhjFoxS
> vftY8JZAVUssmrXuDZ9OBy3xqNZk%3D&reserved=0
> 
> This is a similar approach, though I pushed the policy for "how do you get the
> objects" out into an external script. One advantage there is that large objects
> could easily be fetched from another source entirely (e.g., S3 or equivalent)
> rather than the repo itself.
> 
> The downside is that it makes things more complicated, because a push or a
> fetch now involves three parties (server, client, and the alternate object
> store). So questions like "do I have all the objects I need" are hard to reason
> about.
> 
> If you assume that there's going to be _some_ central Git repo which has all
> of the objects, you might as well fetch from there (and do it over normal git
> protocols). And that simplifies things a bit, at the cost of being less flexible.
> 

We looked quite a bit at the external odb patches, as well as lfs and 
even using alternates.  They all share a common downside that you must 
maintain a separate service that contains _some_ of the files.  These 
files must also be versioned, replicated, backed up and the service 
itself scaled out to handle the load.  As you mentioned, having multiple 
services involved increases flexability but it also increases the 
complexity and decreases the reliability of the overall version control 
service.  

For operational simplicity, we opted to go with a design that uses a 
single, central git repo which has _all_ the objects and to focus on 
enhancing it to handle large numbers of files efficiently.  This allows 
us to focus our efforts on a great git service and to avoid having to 
build out these other services.

> > To prevent git from accidentally downloading all missing blobs, some
> > git operations are updated to be aware of the potential for missing blobs.
> > The most obvious being check_connected which will return success as if
> > everything in the requested commits is available locally.
> 
> Actually, Git is pretty good about trying not to access blobs when it doesn't
> need to. The important thing is that you know enough about the blobs to
> fulfill has_sha1_file() and sha1_object_info() requests without actually
> fetching the data.
> 
> So the client definitely needs to have some list of which objects exist, and
> which it _could_ get if it needed to.
> 
> The one place you'd probably want to tweak things is in the diff code, as a
> single "git log -Sfoo" would fault in all of the blobs.
> 

It is an interesting idea to explore how we could be smarter about 
preventing blobs from faulting in if we had enough info to fulfill 
has_sha1_file() and sha1_object_info().  Given we also heavily prune the 
working directory using sparse-checkout, this hasn't been our top focus 
but it is certainly something worth looking into.

> > To minimize the impact on the server, the existing dumb HTTP protocol
> > endpoint  objects/<sha>  can be used to retrieve the individual
> > missing blobs when needed.
> 
> This is going to behave badly on well-packed repositories, because there isn't
> a good way to fetch a single object. The best case (which is not implemented
> at all in Git) is that you grab the pack .idx, then grab "slices" of the pack
> corresponding to specific objects, including hunting down delta bases.
> 
> But then next time the server repacks, you have to throw away your .idx file.
> And those can be big. The .idx for linux.git is 135MB. You really wouldn't	
> want to do an incremental fetch of 1MB worth of objects and have to grab
> the whole .idx just to figure out which bytes you needed.
> 
> You can solve this by replacing the dumb-http server with a smart one that
> actually serves up the individual objects as if they were truly sitting on the
> filesystem. But then you haven't really minimized impact on the server, and
> you might as well teach the smart protocols to do blob fetches.
> 

Yea, we actually implemented a new endpoint that we are using to fetch 
individual blobs, I just found the dumb endpoint recently and thought 
"hey, maybe we can us this to make it easier for other git servers."  
For a number of good reasons, I don't think this is the right approach.

> 
> One big hurdle to this approach, no matter the protocol, is how you are
> going to handle deltas. Right now, a git client tells the server "I have this
> commit, but I want this other one". And the server knows which objects the
> client has from the first, and which it needs from the second. Moreover, it
> knows that it can send objects in delta form directly from disk if the other
> side has the delta base.
> 
> So what happens in this system? We know we don't need to send any blobs
> in a regular fetch, because the whole idea is that we only send blobs on
> demand. So we wait for the client to ask us for blob A. But then what do we
> send? If we send the whole blob without deltas, we're going to waste a lot of
> bandwidth.
> 
> The on-disk size of all of the blobs in linux.git is ~500MB. The actual data size
> is ~48GB. Some of that is from zlib, which you get even for non-deltas. But
> the rest of it is from the delta compression. I don't think it's feasible to give
> that up, at least not for "normal" source repos like linux.git (more on that in
> a minute).
> 
> So ideally you do want to send deltas. But how do you know which objects
> the other side already has, which you can use as a delta base? Sending the
> list of "here are the blobs I have" doesn't scale. Just the sha1s start to add
> up, especially when you are doing incremental fetches.
> 
> I think this sort of things performs a lot better when you just focus on large
> objects. Because they don't tend to delta well anyway, and the savings are
> much bigger by avoiding ones you don't want. So a directive like "don't
> bother sending blobs larger than 1MB" avoids a lot of these issues. In other
> words, you have some quick shorthand to communicate between the client
> and server: this what I have, and what I don't.
> Normal git relies on commit reachability for that, but there are obviously
> other dimensions. The key thing is that both sides be able to express the
> filters succinctly, and apply them efficiently.
> 

Our challenge has been more the sheer _number_ of files that exist in 
the repo rather than the _size_ of the files in the repo.  With >3M 
source files and any typical developer only needing a small percentage 
of those files to do their job, our focus has been pruning the tree as 
much as possible such that they only pay the cost for the files they 
actually need.  With typical text source files being 10K - 20K in size, 
the overhead of the round trip is a significant part of the overall 
transfer time so deltas don't help as much.  I agree that large files 
are also a problem but it isn't my top focus at this point in time.  

> > After cloning, the developer can use sparse-checkout to limit the set
> > of files to the subset they need (typically only 1-10% in these large
> > repos).  This allows the initial checkout to only download the set of
> > files actually needed to complete their task.  At any point, the
> > sparse-checkout file can be updated to include additional files which
> > will be fetched transparently on demand.
> 
> If most of your benefits are not from avoiding blobs in general, but rather
> just from sparsely populating the tree, then it sounds like sparse clone might
> be an easier path forward. The general idea is to restrict not just the
> checkout, but the actual object transfer and reachability (in the tree
> dimension, the way shallow clone limits it in the time dimension, which will
> require cooperation between the client and server).
> 
> So that's another dimension of filtering, which should be expressed pretty
> succinctly: "I'm interested in these paths, and not these other ones." It's
> pretty easy to compute on the server side during graph traversal (though it
> interacts badly with reachability bitmaps, so there would need to be some
> hacks there).
> 
> It's an idea that's been talked about many times, but I don't recall that there
> were ever working patches. You might dig around in the list archive under
> the name "sparse clone" or possibly "narrow clone".

While a sparse/narrow clone would work with this proposal, it isn't 
required.  You'd still probably want all the commits and trees but the 
clone would also bring down the specified blobs.  Combined with using 
"depth" you could further limit it to those blobs at tip. 

We did run into problems with this model however as our usage patterns 
are such that our working directories often contain very sparse trees 
and as a result, we can end up with thousands of entries in the sparse 
checkout file.  This makes it difficult for users to manually specify a 
sparse-checkout before they even do a clone.  We have implemented a 
hashmap based sparse-checkout to deal with the performance issues of 
having that many entries but that's a different RFC/PATCH.  In short, we 
found that a "lazy-clone" and downloading blobs on demand provided a 
better developer experience.

> 
> > Now some numbers
> > ~~~~~~~~~~~~~~~~
> >
> > One repo has 3+ million files at tip across 500K folders with 5-6K
> > active developers.  They have done a lot of work to remove large files
> > from the repo so it is down to < 100GB.
> >
> > Before changes: clone took hours to transfer the 87GB .pack + 119MB
> > .idx
> >
> > After changes: clone took 4 minutes to transfer 305MB .pack + 37MB
> > .idx
> >
> > After hydrating 35K files (the typical number any individual developer
> > needs to do their work), there was an additional 460 MB of loose files
> > downloaded.
> 
> It sounds like you have a case where the repository has a lot of large files
> that are either historical, or uninteresting the sparse-tree dimension.
> 
> How big is that 460MB if it were actually packed with deltas?
> 

Uninteresting in the sparse-tree dimension.  460 MB divided by 35K files 
is less than 13 KB per file which is fairly typical for source code.  
Given there are no versions to calculate deltas from, compressing them 
into a pack file would help some but I don't have the numbers as to how 
much.  When we get to the "future work" below and start batching up 
requests, we'll have better data on that.

> > Future Work
> > ~~~~~~~~~~~
> >
> > The current prototype calls a new hook proc in
> > sha1_object_info_extended and read_object, to download each missing
> > blob.  A better solution would be to implement this via a long running
> > process that is spawned on the first download and listens for requests
> > to download additional objects until it terminates when the parent git
> > operation exits (similar to the recent long running smudge and clean filter
> work).
> 
> Yeah, see the external-odb discussion. Those prototypes use a process per
> object, but I think we all agree after seeing how the git-lfs interface has
> scaled that this is a non-starter. Recent versions of git-lfs do the single-
> process thing, and I think any sort of external-odb hook should be modeled
> on that protocol.
> 

I'm looking into this now and plan to re-implement it this way before 
sending out the first patch series.  Glad to hear you think it is a good 
protocol to model it on.

> > Need to investigate an alternate batching scheme where we can make a
> > single request for a set of "related" blobs and receive single a
> > packfile (especially during checkout).
> 
> I think this sort of batching is going to be the really hard part to retrofit onto
> git. Because you're throwing out the procedural notion that you can loop
> over a set of objects and ask for each individually.
> You have to start deferring computation until answers are ready. Some
> operations can do that reasonably well (e.g., checkout), but something like
> "git log -p" is constantly digging down into history. I suppose you could just
> perform the skeleton of the operation _twice_, once to find the list of objects
> to fault in, and the second time to actually do it.
> 
> That will make git feel a lot slower, because a lot of the illusion of speed is
> the way it streams out results. OTOH, if you have to wait to fault in objects
> from the network, it's going to feel pretty slow anyway. :)
> 

The good news is that for most operations, git doesn't need to access to 
all the blobs.  You're right, any command that does ends up faulting in 
a bunch of blobs from the network can get pretty slow.  Sometimes you 
get streaming results and sometimes it just "hangs" while we go off 
downloading blobs in the background.  We capture telemetry to detect 
these types of issues but typically the users are more than happy to 
send us an "I just ran command 'foo' and it hung" email. :)

> -Peff


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] Add support for downloading blobs on demand
  2017-01-17 21:50   ` Ben Peart
@ 2017-01-17 22:05     ` Martin Fick
  2017-01-17 22:23       ` Stefan Beller
  0 siblings, 1 reply; 13+ messages in thread
From: Martin Fick @ 2017-01-17 22:05 UTC (permalink / raw)
  To: Ben Peart; +Cc: 'Shawn Pearce', 'git', benpeart

On Tuesday, January 17, 2017 04:50:13 PM Ben Peart wrote:
> While large files can be a real problem, our biggest issue
> today is having a lot (millions!) of source files when
> any individual developer only needs a small percentage of
> them.  Git with 3+ million local files just doesn't
> perform well.

Honestly, this sounds like a problem better dealt with by 
using git subtree or git submodules, have you considered 
that?

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] Add support for downloading blobs on demand
  2017-01-17 22:05     ` Martin Fick
@ 2017-01-17 22:23       ` Stefan Beller
  2017-01-18 18:27         ` Ben Peart
  0 siblings, 1 reply; 13+ messages in thread
From: Stefan Beller @ 2017-01-17 22:23 UTC (permalink / raw)
  To: Martin Fick; +Cc: Ben Peart, Shawn Pearce, git, benpeart

On Tue, Jan 17, 2017 at 2:05 PM, Martin Fick <mfick@codeaurora.org> wrote:
> On Tuesday, January 17, 2017 04:50:13 PM Ben Peart wrote:
>> While large files can be a real problem, our biggest issue
>> today is having a lot (millions!) of source files when
>> any individual developer only needs a small percentage of
>> them.  Git with 3+ million local files just doesn't
>> perform well.
>
> Honestly, this sounds like a problem better dealt with by
> using git subtree or git submodules, have you considered
> that?
>
> -Martin
>

I cannot speak for subtrees as I have very little knowledge on them.
But there you also have the problem that *someone* has to have a
whole tree? (e.g. the build bot)

submodules however comes with a couple of things attached, both
positive as well as negative points:

* it offers ACLs along the way. ($user may not be allowed to
  clone all submodules, but only those related to the work)
* The conceptual understanding of git just got a lot harder.
  ("Yo dawg, I heard you like git, so I put git repos inside
  other git repos"), it is not easy to come up with reasonable
  defaults for all usecases, so the everyday user still has to
  have some understanding of submodules.
* typical cheap in-tree operations may become very expensive:
  -> moving a file from one location to another (in another
     submodule) adds overhead, no rename detection.
* We are actively working on submodules, so there is
  some momentum going already.
* our experiments with Android show that e.g. fetching (even
  if you have all of Android) becomes a lot faster for everyday
  usage as only a few repositories change each day). This
  comparision was against the repo tool, that we currently
  use for Android. I do not know how it would compare against
  single repo Git, as having such a large repository seemed
  complicated.
* the support for submodules in Git is already there, though
  not polished. The positive side is to have already a good base,
  the negative side is to have support current use cases.

Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [RFC] Add support for downloading blobs on demand
  2017-01-17 22:23       ` Stefan Beller
@ 2017-01-18 18:27         ` Ben Peart
  0 siblings, 0 replies; 13+ messages in thread
From: Ben Peart @ 2017-01-18 18:27 UTC (permalink / raw)
  To: 'Stefan Beller', 'Martin Fick'
  Cc: 'Shawn Pearce', 'git', benpeart

We actually pursued trying to make submodules work for some time and 
even built tooling around trying to work around some of the issues we 
ran into (not repo.py but along a similar line) before determining that 
we would be better served by having a single repo and solving the scale 
issues.  I don't want to rehash the arguments for/against a single repo 
- suffice it to say, we have opted for a single large repo. :)

Thanks,

Ben
> -----Original Message-----
> From: Stefan Beller [mailto:sbeller@google.com]
> Sent: Tuesday, January 17, 2017 5:24 PM
> To: Martin Fick <mfick@codeaurora.org>
> Cc: Ben Peart <peartben@gmail.com>; Shawn Pearce
> <spearce@spearce.org>; git <git@vger.kernel.org>;
> benpeart@microsoft.com
> Subject: Re: [RFC] Add support for downloading blobs on demand
> 
> On Tue, Jan 17, 2017 at 2:05 PM, Martin Fick <mfick@codeaurora.org>
> wrote:
> > On Tuesday, January 17, 2017 04:50:13 PM Ben Peart wrote:
> >> While large files can be a real problem, our biggest issue today is
> >> having a lot (millions!) of source files when any individual
> >> developer only needs a small percentage of them.  Git with 3+ million
> >> local files just doesn't perform well.
> >
> > Honestly, this sounds like a problem better dealt with by using git
> > subtree or git submodules, have you considered that?
> >
> > -Martin
> >
> 
> I cannot speak for subtrees as I have very little knowledge on them.
> But there you also have the problem that *someone* has to have a whole
> tree? (e.g. the build bot)
> 
> submodules however comes with a couple of things attached, both positive
> as well as negative points:
> 
> * it offers ACLs along the way. ($user may not be allowed to
>   clone all submodules, but only those related to the work)
> * The conceptual understanding of git just got a lot harder.
>   ("Yo dawg, I heard you like git, so I put git repos inside
>   other git repos"), it is not easy to come up with reasonable
>   defaults for all usecases, so the everyday user still has to
>   have some understanding of submodules.
> * typical cheap in-tree operations may become very expensive:
>   -> moving a file from one location to another (in another
>      submodule) adds overhead, no rename detection.
> * We are actively working on submodules, so there is
>   some momentum going already.
> * our experiments with Android show that e.g. fetching (even
>   if you have all of Android) becomes a lot faster for everyday
>   usage as only a few repositories change each day). This
>   comparision was against the repo tool, that we currently
>   use for Android. I do not know how it would compare against
>   single repo Git, as having such a large repository seemed
>   complicated.
> * the support for submodules in Git is already there, though
>   not polished. The positive side is to have already a good base,
>   the negative side is to have support current use cases.
> 
> Stefan


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] Add support for downloading blobs on demand
  2017-01-17 21:50   ` Ben Peart
@ 2017-02-05 14:03     ` Christian Couder
  2017-02-07 18:21       ` Ben Peart
  0 siblings, 1 reply; 13+ messages in thread
From: Christian Couder @ 2017-02-05 14:03 UTC (permalink / raw)
  To: Ben Peart; +Cc: Jeff King, git, Johannes Schindelin

(Sorry for the late reply and thanks to Dscho for pointing me to this thread.)

On Tue, Jan 17, 2017 at 10:50 PM, Ben Peart <peartben@gmail.com> wrote:
>> From: Jeff King [mailto:peff@peff.net]
>> On Fri, Jan 13, 2017 at 10:52:53AM -0500, Ben Peart wrote:
>>
>> > Clone and fetch will pass a  --lazy-clone  flag (open to a better name
>> > here) similar to  --depth  that instructs the server to only return
>> > commits and trees and to ignore blobs.
>> >
>> > Later during git operations like checkout, when a blob cannot be found
>> > after checking all the regular places (loose, pack, alternates, etc),
>> > git will download the missing object and place it into the local
>> > object store (currently as a loose object) then resume the operation.
>>
>> Have you looked at the "external odb" patches I wrote a while ago, and
>> which Christian has been trying to resurrect?
>>
>>   https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpublic-
>> inbox.org%2Fgit%2F20161130210420.15982-1-
>> chriscool%40tuxfamily.org%2F&data=02%7C01%7CBen.Peart%40microsoft.c
>> om%7C9596d3bf32564f123e0c08d43f08a9e1%7C72f988bf86f141af91ab2d7c
>> d011db47%7C1%7C0%7C636202753822020527&sdata=a6%2BGOAQoRhjFoxS
>> vftY8JZAVUssmrXuDZ9OBy3xqNZk%3D&reserved=0
>>
>> This is a similar approach, though I pushed the policy for "how do you get the
>> objects" out into an external script. One advantage there is that large objects
>> could easily be fetched from another source entirely (e.g., S3 or equivalent)
>> rather than the repo itself.
>>
>> The downside is that it makes things more complicated, because a push or a
>> fetch now involves three parties (server, client, and the alternate object
>> store). So questions like "do I have all the objects I need" are hard to reason
>> about.
>>
>> If you assume that there's going to be _some_ central Git repo which has all
>> of the objects, you might as well fetch from there (and do it over normal git
>> protocols). And that simplifies things a bit, at the cost of being less flexible.
>
> We looked quite a bit at the external odb patches, as well as lfs and
> even using alternates.  They all share a common downside that you must
> maintain a separate service that contains _some_ of the files.

Pushing the policy for "how do you get the objects" out into an
external helper doesn't mean that the external helper cannot use the
main service.
The external helper is still free to do whatever it wants including
calling the main service if it thinks it's better.

> These
> files must also be versioned, replicated, backed up and the service
> itself scaled out to handle the load.  As you mentioned, having multiple
> services involved increases flexability but it also increases the
> complexity and decreases the reliability of the overall version control
> service.

About reliability, I think it depends a lot on the use case. If you
want to get very big files over an unreliable connection, it can
better if you send those big files over a restartable protocol and
service like HTTP/S on a regular web server.

> For operational simplicity, we opted to go with a design that uses a
> single, central git repo which has _all_ the objects and to focus on
> enhancing it to handle large numbers of files efficiently.  This allows
> us to focus our efforts on a great git service and to avoid having to
> build out these other services.

Ok, but I don't think it prevents you from using at least some of the
same mechanisms that the external odb series is using.
And reducing the number of mechanisms in Git itself is great for its
maintainability and simplicity.

>> > To prevent git from accidentally downloading all missing blobs, some
>> > git operations are updated to be aware of the potential for missing blobs.
>> > The most obvious being check_connected which will return success as if
>> > everything in the requested commits is available locally.
>>
>> Actually, Git is pretty good about trying not to access blobs when it doesn't
>> need to. The important thing is that you know enough about the blobs to
>> fulfill has_sha1_file() and sha1_object_info() requests without actually
>> fetching the data.
>>
>> So the client definitely needs to have some list of which objects exist, and
>> which it _could_ get if it needed to.

Yeah, and the external odb series handles that already, thanks to
Peff's initial work.

>> The one place you'd probably want to tweak things is in the diff code, as a
>> single "git log -Sfoo" would fault in all of the blobs.
>
> It is an interesting idea to explore how we could be smarter about
> preventing blobs from faulting in if we had enough info to fulfill
> has_sha1_file() and sha1_object_info().  Given we also heavily prune the
> working directory using sparse-checkout, this hasn't been our top focus
> but it is certainly something worth looking into.

The external odb series doesn't handle preventing blobs from faulting
in yet, so this could be a common problem.

[...]

>> One big hurdle to this approach, no matter the protocol, is how you are
>> going to handle deltas. Right now, a git client tells the server "I have this
>> commit, but I want this other one". And the server knows which objects the
>> client has from the first, and which it needs from the second. Moreover, it
>> knows that it can send objects in delta form directly from disk if the other
>> side has the delta base.
>>
>> So what happens in this system? We know we don't need to send any blobs
>> in a regular fetch, because the whole idea is that we only send blobs on
>> demand. So we wait for the client to ask us for blob A. But then what do we
>> send? If we send the whole blob without deltas, we're going to waste a lot of
>> bandwidth.
>>
>> The on-disk size of all of the blobs in linux.git is ~500MB. The actual data size
>> is ~48GB. Some of that is from zlib, which you get even for non-deltas. But
>> the rest of it is from the delta compression. I don't think it's feasible to give
>> that up, at least not for "normal" source repos like linux.git (more on that in
>> a minute).
>>
>> So ideally you do want to send deltas. But how do you know which objects
>> the other side already has, which you can use as a delta base? Sending the
>> list of "here are the blobs I have" doesn't scale. Just the sha1s start to add
>> up, especially when you are doing incremental fetches.

To initialize some paths that the client wants, it could perhaps just
ask for some pack files, or maybe bundle files, related to these
paths.
Those packs or bundles could be downloaded either directly from the
main server or from other web or proxy servers.

>> I think this sort of things performs a lot better when you just focus on large
>> objects. Because they don't tend to delta well anyway, and the savings are
>> much bigger by avoiding ones you don't want. So a directive like "don't
>> bother sending blobs larger than 1MB" avoids a lot of these issues. In other
>> words, you have some quick shorthand to communicate between the client
>> and server: this what I have, and what I don't.
>> Normal git relies on commit reachability for that, but there are obviously
>> other dimensions. The key thing is that both sides be able to express the
>> filters succinctly, and apply them efficiently.
>
> Our challenge has been more the sheer _number_ of files that exist in
> the repo rather than the _size_ of the files in the repo.  With >3M
> source files and any typical developer only needing a small percentage
> of those files to do their job, our focus has been pruning the tree as
> much as possible such that they only pay the cost for the files they
> actually need.  With typical text source files being 10K - 20K in size,
> the overhead of the round trip is a significant part of the overall
> transfer time so deltas don't help as much.  I agree that large files
> are also a problem but it isn't my top focus at this point in time.

Ok, but it would be nice if both problems could be solved using some
common mechanisms.
This way it could probably work better in situations where there are
both a large number of files _and_ some big files.
And from what I am seeing, there could be no real downside from using
some common mechanisms.

>> If most of your benefits are not from avoiding blobs in general, but rather
>> just from sparsely populating the tree, then it sounds like sparse clone might
>> be an easier path forward. The general idea is to restrict not just the
>> checkout, but the actual object transfer and reachability (in the tree
>> dimension, the way shallow clone limits it in the time dimension, which will
>> require cooperation between the client and server).
>>
>> So that's another dimension of filtering, which should be expressed pretty
>> succinctly: "I'm interested in these paths, and not these other ones." It's
>> pretty easy to compute on the server side during graph traversal (though it
>> interacts badly with reachability bitmaps, so there would need to be some
>> hacks there).
>>
>> It's an idea that's been talked about many times, but I don't recall that there
>> were ever working patches. You might dig around in the list archive under
>> the name "sparse clone" or possibly "narrow clone".
>
> While a sparse/narrow clone would work with this proposal, it isn't
> required.  You'd still probably want all the commits and trees but the
> clone would also bring down the specified blobs.  Combined with using
> "depth" you could further limit it to those blobs at tip.
>
> We did run into problems with this model however as our usage patterns
> are such that our working directories often contain very sparse trees
> and as a result, we can end up with thousands of entries in the sparse
> checkout file.  This makes it difficult for users to manually specify a
> sparse-checkout before they even do a clone.  We have implemented a
> hashmap based sparse-checkout to deal with the performance issues of
> having that many entries but that's a different RFC/PATCH.  In short, we
> found that a "lazy-clone" and downloading blobs on demand provided a
> better developer experience.

I think both ways are possible using the external odb mechanism.

>> > Future Work
>> > ~~~~~~~~~~~
>> >
>> > The current prototype calls a new hook proc in
>> > sha1_object_info_extended and read_object, to download each missing
>> > blob.  A better solution would be to implement this via a long running
>> > process that is spawned on the first download and listens for requests
>> > to download additional objects until it terminates when the parent git
>> > operation exits (similar to the recent long running smudge and clean filter
>> work).
>>
>> Yeah, see the external-odb discussion. Those prototypes use a process per
>> object, but I think we all agree after seeing how the git-lfs interface has
>> scaled that this is a non-starter. Recent versions of git-lfs do the single-
>> process thing, and I think any sort of external-odb hook should be modeled
>> on that protocol.

I agree that the git-lfs scaling work is great, but I think it's not
necessary in the external odb work to have the same kind of
single-process protocol from the beginning (though it should be
possible and easy to add it).
For example if the external odb work can be used or extended to handle
restartable clone by downloading a single bundle when cloning, this
would not need that kind of protocol.

> I'm looking into this now and plan to re-implement it this way before
> sending out the first patch series.  Glad to hear you think it is a good
> protocol to model it on.

Yeah, for your use case on Windows, it looks really worth it to use
this kind of protocol.

>> > Need to investigate an alternate batching scheme where we can make a
>> > single request for a set of "related" blobs and receive single a
>> > packfile (especially during checkout).
>>
>> I think this sort of batching is going to be the really hard part to retrofit onto
>> git. Because you're throwing out the procedural notion that you can loop
>> over a set of objects and ask for each individually.
>> You have to start deferring computation until answers are ready. Some
>> operations can do that reasonably well (e.g., checkout), but something like
>> "git log -p" is constantly digging down into history. I suppose you could just
>> perform the skeleton of the operation _twice_, once to find the list of objects
>> to fault in, and the second time to actually do it.

In my opinion, perhaps we can just prevent "git log -p" from faulting
in blobs and have it show a warning saying that it was performed only
on a subset of all the blobs.

[...]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [RFC] Add support for downloading blobs on demand
  2017-02-05 14:03     ` Christian Couder
@ 2017-02-07 18:21       ` Ben Peart
  2017-02-07 21:56         ` Jakub Narębski
  2017-02-23 15:39         ` Ben Peart
  0 siblings, 2 replies; 13+ messages in thread
From: Ben Peart @ 2017-02-07 18:21 UTC (permalink / raw)
  To: 'Christian Couder'
  Cc: 'Jeff King', 'git', 'Johannes Schindelin',
	Ben Peart

No worries about a late response, I'm sure this is the start of a long conversation. :)

> -----Original Message-----
> From: Christian Couder [mailto:christian.couder@gmail.com]
> Sent: Sunday, February 5, 2017 9:04 AM
> To: Ben Peart <peartben@gmail.com>
> Cc: Jeff King <peff@peff.net>; git <git@vger.kernel.org>; Johannes Schindelin
> <Johannes.Schindelin@gmx.de>
> Subject: Re: [RFC] Add support for downloading blobs on demand
> 
> (Sorry for the late reply and thanks to Dscho for pointing me to this thread.)
> 
> On Tue, Jan 17, 2017 at 10:50 PM, Ben Peart <peartben@gmail.com> wrote:
> >> From: Jeff King [mailto:peff@peff.net] On Fri, Jan 13, 2017 at
> >> 10:52:53AM -0500, Ben Peart wrote:
> >>
> >> > Clone and fetch will pass a  --lazy-clone  flag (open to a better
> >> > name
> >> > here) similar to  --depth  that instructs the server to only return
> >> > commits and trees and to ignore blobs.
> >> >
> >> > Later during git operations like checkout, when a blob cannot be
> >> > found after checking all the regular places (loose, pack,
> >> > alternates, etc), git will download the missing object and place it
> >> > into the local object store (currently as a loose object) then resume the
> operation.
> >>
> >> Have you looked at the "external odb" patches I wrote a while ago,
> >> and which Christian has been trying to resurrect?
> >>
> >>
> >> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpubli
> >> c-
> >> inbox.org%2Fgit%2F20161130210420.15982-1-
> >>
> chriscool%40tuxfamily.org%2F&data=02%7C01%7CBen.Peart%40microsoft.c
> >>
> om%7C9596d3bf32564f123e0c08d43f08a9e1%7C72f988bf86f141af91ab2d7c
> >>
> d011db47%7C1%7C0%7C636202753822020527&sdata=a6%2BGOAQoRhjFoxS
> >> vftY8JZAVUssmrXuDZ9OBy3xqNZk%3D&reserved=0
> >>
> >> This is a similar approach, though I pushed the policy for "how do
> >> you get the objects" out into an external script. One advantage there
> >> is that large objects could easily be fetched from another source
> >> entirely (e.g., S3 or equivalent) rather than the repo itself.
> >>
> >> The downside is that it makes things more complicated, because a push
> >> or a fetch now involves three parties (server, client, and the
> >> alternate object store). So questions like "do I have all the objects
> >> I need" are hard to reason about.
> >>
> >> If you assume that there's going to be _some_ central Git repo which
> >> has all of the objects, you might as well fetch from there (and do it
> >> over normal git protocols). And that simplifies things a bit, at the cost of
> being less flexible.
> >
> > We looked quite a bit at the external odb patches, as well as lfs and
> > even using alternates.  They all share a common downside that you must
> > maintain a separate service that contains _some_ of the files.
> 
> Pushing the policy for "how do you get the objects" out into an external
> helper doesn't mean that the external helper cannot use the main service.
> The external helper is still free to do whatever it wants including calling the
> main service if it thinks it's better.

That is a good point and you're correct, that means you can avoid having to build out multiple services.

> 
> > These
> > files must also be versioned, replicated, backed up and the service
> > itself scaled out to handle the load.  As you mentioned, having
> > multiple services involved increases flexability but it also increases
> > the complexity and decreases the reliability of the overall version
> > control service.
> 
> About reliability, I think it depends a lot on the use case. If you want to get
> very big files over an unreliable connection, it can better if you send those big
> files over a restartable protocol and service like HTTP/S on a regular web
> server.
> 

My primary concern about reliability was the multiplicative effect of making multiple requests across multiple servers to complete a single request.  Having putting this all in a single service like you suggested above brings us back to parity on the complexity.

> > For operational simplicity, we opted to go with a design that uses a
> > single, central git repo which has _all_ the objects and to focus on
> > enhancing it to handle large numbers of files efficiently.  This
> > allows us to focus our efforts on a great git service and to avoid
> > having to build out these other services.
> 
> Ok, but I don't think it prevents you from using at least some of the same
> mechanisms that the external odb series is using.
> And reducing the number of mechanisms in Git itself is great for its
> maintainability and simplicity.

I completely agree with the goal of reducing the number of mechanisms in Git itself.  Our proposal is primarily targeting speeding operations when dealing with large numbers of files.  ObjectDB is primarily targeting large objects but there is a lot of similarity in how we're approaching the solution.  I hope/believe we can come to a common solution that will solve both.

> 
> >> > To prevent git from accidentally downloading all missing blobs,
> >> > some git operations are updated to be aware of the potential for
> missing blobs.
> >> > The most obvious being check_connected which will return success as
> >> > if everything in the requested commits is available locally.
> >>
> >> Actually, Git is pretty good about trying not to access blobs when it
> >> doesn't need to. The important thing is that you know enough about
> >> the blobs to fulfill has_sha1_file() and sha1_object_info() requests
> >> without actually fetching the data.
> >>
> >> So the client definitely needs to have some list of which objects
> >> exist, and which it _could_ get if it needed to.
> 
> Yeah, and the external odb series handles that already, thanks to Peff's initial
> work.
> 

I'm currently working on a patch series that will reimplement our current read-object hook to use the LFS model for long running background processes.  As part of that, I am building a versioned interface that will support multiple commands (like get, have, put).  In my initial implementation, I'm only supporting the "get" verb as that is what we currently need but my intent is to build it so that we could add have and put in future versions.  When I have the first iteration ready, I'll push it up to our fork on github for review as code is clearer than my description in email.

Moving forward, the "have" verb is a little problematic as we would "have" 3+ million shas that we'd be required to fetch from the server and then pass along to git when requested.  It would be nice to come up with a way to avoid or reduce that cost.

> >> The one place you'd probably want to tweak things is in the diff
> >> code, as a single "git log -Sfoo" would fault in all of the blobs.
> >
> > It is an interesting idea to explore how we could be smarter about
> > preventing blobs from faulting in if we had enough info to fulfill
> > has_sha1_file() and sha1_object_info().  Given we also heavily prune
> > the working directory using sparse-checkout, this hasn't been our top
> > focus but it is certainly something worth looking into.
> 
> The external odb series doesn't handle preventing blobs from faulting in yet,
> so this could be a common problem.
> 

Agreed.  This is one we've been working on quite a bit out of necessity.  If you look at our patch series, most of the changes are related to dealing with missing objects.

> [...]
> 
> >> One big hurdle to this approach, no matter the protocol, is how you
> >> are going to handle deltas. Right now, a git client tells the server
> >> "I have this commit, but I want this other one". And the server knows
> >> which objects the client has from the first, and which it needs from
> >> the second. Moreover, it knows that it can send objects in delta form
> >> directly from disk if the other side has the delta base.
> >>
> >> So what happens in this system? We know we don't need to send any
> >> blobs in a regular fetch, because the whole idea is that we only send
> >> blobs on demand. So we wait for the client to ask us for blob A. But
> >> then what do we send? If we send the whole blob without deltas, we're
> >> going to waste a lot of bandwidth.
> >>
> >> The on-disk size of all of the blobs in linux.git is ~500MB. The
> >> actual data size is ~48GB. Some of that is from zlib, which you get
> >> even for non-deltas. But the rest of it is from the delta
> >> compression. I don't think it's feasible to give that up, at least
> >> not for "normal" source repos like linux.git (more on that in a minute).
> >>
> >> So ideally you do want to send deltas. But how do you know which
> >> objects the other side already has, which you can use as a delta
> >> base? Sending the list of "here are the blobs I have" doesn't scale.
> >> Just the sha1s start to add up, especially when you are doing incremental
> fetches.
> 
> To initialize some paths that the client wants, it could perhaps just ask for
> some pack files, or maybe bundle files, related to these paths.
> Those packs or bundles could be downloaded either directly from the main
> server or from other web or proxy servers.
> 
> >> I think this sort of things performs a lot better when you just focus
> >> on large objects. Because they don't tend to delta well anyway, and
> >> the savings are much bigger by avoiding ones you don't want. So a
> >> directive like "don't bother sending blobs larger than 1MB" avoids a
> >> lot of these issues. In other words, you have some quick shorthand to
> >> communicate between the client and server: this what I have, and what I
> don't.
> >> Normal git relies on commit reachability for that, but there are
> >> obviously other dimensions. The key thing is that both sides be able
> >> to express the filters succinctly, and apply them efficiently.
> >
> > Our challenge has been more the sheer _number_ of files that exist in
> > the repo rather than the _size_ of the files in the repo.  With >3M
> > source files and any typical developer only needing a small percentage
> > of those files to do their job, our focus has been pruning the tree as
> > much as possible such that they only pay the cost for the files they
> > actually need.  With typical text source files being 10K - 20K in
> > size, the overhead of the round trip is a significant part of the
> > overall transfer time so deltas don't help as much.  I agree that
> > large files are also a problem but it isn't my top focus at this point in time.
> 
> Ok, but it would be nice if both problems could be solved using some
> common mechanisms.
> This way it could probably work better in situations where there are both a
> large number of files _and_ some big files.
> And from what I am seeing, there could be no real downside from using
> some common mechanisms.
> 

Agree completely.  I'm hopeful that we can come up with some common mechanisms that will allow us to solve both problems.

> >> If most of your benefits are not from avoiding blobs in general, but
> >> rather just from sparsely populating the tree, then it sounds like
> >> sparse clone might be an easier path forward. The general idea is to
> >> restrict not just the checkout, but the actual object transfer and
> >> reachability (in the tree dimension, the way shallow clone limits it
> >> in the time dimension, which will require cooperation between the client
> and server).
> >>
> >> So that's another dimension of filtering, which should be expressed
> >> pretty
> >> succinctly: "I'm interested in these paths, and not these other
> >> ones." It's pretty easy to compute on the server side during graph
> >> traversal (though it interacts badly with reachability bitmaps, so
> >> there would need to be some hacks there).
> >>
> >> It's an idea that's been talked about many times, but I don't recall
> >> that there were ever working patches. You might dig around in the
> >> list archive under the name "sparse clone" or possibly "narrow clone".
> >
> > While a sparse/narrow clone would work with this proposal, it isn't
> > required.  You'd still probably want all the commits and trees but the
> > clone would also bring down the specified blobs.  Combined with using
> > "depth" you could further limit it to those blobs at tip.
> >
> > We did run into problems with this model however as our usage patterns
> > are such that our working directories often contain very sparse trees
> > and as a result, we can end up with thousands of entries in the sparse
> > checkout file.  This makes it difficult for users to manually specify
> > a sparse-checkout before they even do a clone.  We have implemented a
> > hashmap based sparse-checkout to deal with the performance issues of
> > having that many entries but that's a different RFC/PATCH.  In short,
> > we found that a "lazy-clone" and downloading blobs on demand provided
> > a better developer experience.
> 
> I think both ways are possible using the external odb mechanism.
> 
> >> > Future Work
> >> > ~~~~~~~~~~~
> >> >
> >> > The current prototype calls a new hook proc in
> >> > sha1_object_info_extended and read_object, to download each missing
> >> > blob.  A better solution would be to implement this via a long
> >> > running process that is spawned on the first download and listens
> >> > for requests to download additional objects until it terminates
> >> > when the parent git operation exits (similar to the recent long
> >> > running smudge and clean filter
> >> work).
> >>
> >> Yeah, see the external-odb discussion. Those prototypes use a process
> >> per object, but I think we all agree after seeing how the git-lfs
> >> interface has scaled that this is a non-starter. Recent versions of
> >> git-lfs do the single- process thing, and I think any sort of
> >> external-odb hook should be modeled on that protocol.
> 
> I agree that the git-lfs scaling work is great, but I think it's not necessary in the
> external odb work to have the same kind of single-process protocol from the
> beginning (though it should be possible and easy to add it).
> For example if the external odb work can be used or extended to handle
> restartable clone by downloading a single bundle when cloning, this would
> not need that kind of protocol.
> 
> > I'm looking into this now and plan to re-implement it this way before
> > sending out the first patch series.  Glad to hear you think it is a
> > good protocol to model it on.
> 
> Yeah, for your use case on Windows, it looks really worth it to use this kind
> of protocol.
> 
> >> > Need to investigate an alternate batching scheme where we can make
> >> > a single request for a set of "related" blobs and receive single a
> >> > packfile (especially during checkout).
> >>
> >> I think this sort of batching is going to be the really hard part to
> >> retrofit onto git. Because you're throwing out the procedural notion
> >> that you can loop over a set of objects and ask for each individually.
> >> You have to start deferring computation until answers are ready. Some
> >> operations can do that reasonably well (e.g., checkout), but
> >> something like "git log -p" is constantly digging down into history.
> >> I suppose you could just perform the skeleton of the operation
> >> _twice_, once to find the list of objects to fault in, and the second time to
> actually do it.
> 
> In my opinion, perhaps we can just prevent "git log -p" from faulting in blobs
> and have it show a warning saying that it was performed only on a subset of
> all the blobs.
> 

You might be surprised at how many other places end up faulting in blobs. :)  Rename detection is one we've recently been working on.

> [...]


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] Add support for downloading blobs on demand
  2017-02-07 18:21       ` Ben Peart
@ 2017-02-07 21:56         ` Jakub Narębski
  2017-02-08  2:18           ` Ben Peart
  2017-02-23 15:39         ` Ben Peart
  1 sibling, 1 reply; 13+ messages in thread
From: Jakub Narębski @ 2017-02-07 21:56 UTC (permalink / raw)
  To: Ben Peart, 'Christian Couder'
  Cc: 'Jeff King', 'git', 'Johannes Schindelin',
	Ben Peart

I'd like to point to two (or rather one and a half) solutions that I got
aware of when watching streaming of "Git Merge 2017"[0].  There should
be here people who were there; and hopefully video of those presentations
and slides / notes would be soon available.

[0]: http://git-merge.com/

First tool that I'd like to point to is Git Virtual File System, or
GVFS in short (which unfortunately shares abbreviation with GNOME Virtual
File System).

The presentation was "Scaling Git at Microsoft" by Saeed Noursalehi, 
Microsoft.  You can read about this solution in ArsTechnica article[1],
and on Microsoft blog[2].  The code (or early version of thereof) is
also available[3] - I wonder why on GitHub and not Codeplex...

[1]: https://arstechnica.com/information-technology/2017/02/microsoft-hosts-the-windows-source-in-a-monstrous-300gb-git-repository/
[2]: https://blogs.msdn.microsoft.com/visualstudioalm/2017/02/03/announcing-gvfs-git-virtual-file-system/
[3]: https://github.com/Microsoft/GVFS

The second presentation that might be of some interest is "Scaling
Mercurial at Facebook: Insights from the Other Side" by Durham Goode,
Facebook.  The code is supposedly available as open-source; though
I don't know how useful their 'blob storage' solution would be of use
for your problem.

HTH
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [RFC] Add support for downloading blobs on demand
  2017-02-07 21:56         ` Jakub Narębski
@ 2017-02-08  2:18           ` Ben Peart
  0 siblings, 0 replies; 13+ messages in thread
From: Ben Peart @ 2017-02-08  2:18 UTC (permalink / raw)
  To: 'Jakub Narębski', 'Christian Couder'
  Cc: 'Jeff King', 'git', 'Johannes Schindelin',
	'Ben Peart'

Thanks Jakub.  

Just so you are aware, this isn't a separate effort, it actually is the same effort as the GVFS effort from Microsoft.  For pragmatic reasons, we implemented the lazy clone support and on demand object downloading in our own codebase (GVFS) first and are now are working to move it into git natively so that it will be available everywhere git is available.  This RFC is just one step in that process.

As we mentioned at Git Merge, we looked into Mercurial but settled on Git as our version control solution.  We are, however, in active communication with the team from Facebook to share ideas.

Ben

> -----Original Message-----
> From: Jakub Narębski [mailto:jnareb@gmail.com]
> Sent: Tuesday, February 7, 2017 4:57 PM
> To: Ben Peart <peartben@gmail.com>; 'Christian Couder'
> <christian.couder@gmail.com>
> Cc: 'Jeff King' <peff@peff.net>; 'git' <git@vger.kernel.org>; 'Johannes
> Schindelin' <Johannes.Schindelin@gmx.de>; Ben Peart
> <benpeart@microsoft.com>
> Subject: Re: [RFC] Add support for downloading blobs on demand
> 
> I'd like to point to two (or rather one and a half) solutions that I got aware of
> when watching streaming of "Git Merge 2017"[0].  There should be here
> people who were there; and hopefully video of those presentations and
> slides / notes would be soon available.
> 
> [0]: http://git-merge.com/
> 
> First tool that I'd like to point to is Git Virtual File System, or GVFS in short
> (which unfortunately shares abbreviation with GNOME Virtual File System).
> 
> The presentation was "Scaling Git at Microsoft" by Saeed Noursalehi,
> Microsoft.  You can read about this solution in ArsTechnica article[1], and on
> Microsoft blog[2].  The code (or early version of thereof) is also available[3] -
> I wonder why on GitHub and not Codeplex...
> 
> [1]: https://arstechnica.com/information-technology/2017/02/microsoft-
> hosts-the-windows-source-in-a-monstrous-300gb-git-repository/
> [2]:
> https://blogs.msdn.microsoft.com/visualstudioalm/2017/02/03/announcing-
> gvfs-git-virtual-file-system/
> [3]: https://github.com/Microsoft/GVFS
> 
> 
> The second presentation that might be of some interest is "Scaling Mercurial
> at Facebook: Insights from the Other Side" by Durham Goode, Facebook.
> The code is supposedly available as open-source; though I don't know how
> useful their 'blob storage' solution would be of use for your problem.
> 
> 
> HTH
> --
> Jakub Narębski



^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [RFC] Add support for downloading blobs on demand
  2017-02-07 18:21       ` Ben Peart
  2017-02-07 21:56         ` Jakub Narębski
@ 2017-02-23 15:39         ` Ben Peart
  1 sibling, 0 replies; 13+ messages in thread
From: Ben Peart @ 2017-02-23 15:39 UTC (permalink / raw)
  To: 'Christian Couder'
  Cc: 'Jeff King', 'git', 'Johannes Schindelin',
	'Ben Peart'

I've completed the work of switching our read_object proposal to use a 
background process (refactored from the LFS code) and have extricated it
from the rest of our GVFS fork so that it can be examined/tested
separately.  It is currently based on a Git For Windows fork that I've
pushed to GitHub for anyone who is interested in viewing it at:

https://github.com/benpeart/git/tree/read-object-process

After some additional conversations with Christian, we're working to
combine our RFC/patch series into a single solution that should meet the
requirements of both.

The combined solution needs to have an "info" function which requests
info about a single object instead of a "have" function which must
return information on all objects the ODB knows as this doesn't scale
when the number of objects is large.  

This means the "info" call has to be fast so spawning a process on every
call won't work.  The background process with a versioned interface that
allows you to negotiate capabilities should solve this problem.

Ben

> -----Original Message-----
> From: Ben Peart [mailto:peartben@gmail.com]
> Sent: Tuesday, February 7, 2017 1:21 PM
> To: 'Christian Couder' <christian.couder@gmail.com>
> Cc: 'Jeff King' <peff@peff.net>; 'git' <git@vger.kernel.org>; 'Johannes
> Schindelin' <Johannes.Schindelin@gmx.de>; Ben Peart
> <benpeart@microsoft.com>
> Subject: RE: [RFC] Add support for downloading blobs on demand
> 
> No worries about a late response, I'm sure this is the start of a long
> conversation. :)
> 
> > -----Original Message-----
> > From: Christian Couder [mailto:christian.couder@gmail.com]
> > Sent: Sunday, February 5, 2017 9:04 AM
> > To: Ben Peart <peartben@gmail.com>
> > Cc: Jeff King <peff@peff.net>; git <git@vger.kernel.org>; Johannes
> > Schindelin <Johannes.Schindelin@gmx.de>
> > Subject: Re: [RFC] Add support for downloading blobs on demand
> >
> > (Sorry for the late reply and thanks to Dscho for pointing me to this
> > thread.)
> >
> > On Tue, Jan 17, 2017 at 10:50 PM, Ben Peart <peartben@gmail.com>
> wrote:
> > >> From: Jeff King [mailto:peff@peff.net] On Fri, Jan 13, 2017 at
> > >> 10:52:53AM -0500, Ben Peart wrote:
> > >>
> > >> > Clone and fetch will pass a  --lazy-clone  flag (open to a better
> > >> > name
> > >> > here) similar to  --depth  that instructs the server to only
> > >> > return commits and trees and to ignore blobs.
> > >> >
> > >> > Later during git operations like checkout, when a blob cannot be
> > >> > found after checking all the regular places (loose, pack,
> > >> > alternates, etc), git will download the missing object and place
> > >> > it into the local object store (currently as a loose object) then
> > >> > resume the
> > operation.
> > >>
> > >> Have you looked at the "external odb" patches I wrote a while ago,
> > >> and which Christian has been trying to resurrect?
> > >>
> > >>
> > >>
> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpub
> > >> li
> > >> c-
> > >> inbox.org%2Fgit%2F20161130210420.15982-1-
> > >>
> >
> chriscool%40tuxfamily.org%2F&data=02%7C01%7CBen.Peart%40microsoft.c
> > >>
> >
> om%7C9596d3bf32564f123e0c08d43f08a9e1%7C72f988bf86f141af91ab2d7c
> > >>
> >
> d011db47%7C1%7C0%7C636202753822020527&sdata=a6%2BGOAQoRhjFoxS
> > >> vftY8JZAVUssmrXuDZ9OBy3xqNZk%3D&reserved=0
> > >>
> > >> This is a similar approach, though I pushed the policy for "how do
> > >> you get the objects" out into an external script. One advantage
> > >> there is that large objects could easily be fetched from another
> > >> source entirely (e.g., S3 or equivalent) rather than the repo itself.
> > >>
> > >> The downside is that it makes things more complicated, because a
> > >> push or a fetch now involves three parties (server, client, and the
> > >> alternate object store). So questions like "do I have all the
> > >> objects I need" are hard to reason about.
> > >>
> > >> If you assume that there's going to be _some_ central Git repo
> > >> which has all of the objects, you might as well fetch from there
> > >> (and do it over normal git protocols). And that simplifies things a
> > >> bit, at the cost of
> > being less flexible.
> > >
> > > We looked quite a bit at the external odb patches, as well as lfs
> > > and even using alternates.  They all share a common downside that
> > > you must maintain a separate service that contains _some_ of the files.
> >
> > Pushing the policy for "how do you get the objects" out into an
> > external helper doesn't mean that the external helper cannot use the main
> service.
> > The external helper is still free to do whatever it wants including
> > calling the main service if it thinks it's better.
> 
> That is a good point and you're correct, that means you can avoid having to
> build out multiple services.
> 
> >
> > > These
> > > files must also be versioned, replicated, backed up and the service
> > > itself scaled out to handle the load.  As you mentioned, having
> > > multiple services involved increases flexability but it also
> > > increases the complexity and decreases the reliability of the
> > > overall version control service.
> >
> > About reliability, I think it depends a lot on the use case. If you
> > want to get very big files over an unreliable connection, it can
> > better if you send those big files over a restartable protocol and
> > service like HTTP/S on a regular web server.
> >
> 
> My primary concern about reliability was the multiplicative effect of making
> multiple requests across multiple servers to complete a single request.
> Having putting this all in a single service like you suggested above brings us
> back to parity on the complexity.
> 
> > > For operational simplicity, we opted to go with a design that uses a
> > > single, central git repo which has _all_ the objects and to focus on
> > > enhancing it to handle large numbers of files efficiently.  This
> > > allows us to focus our efforts on a great git service and to avoid
> > > having to build out these other services.
> >
> > Ok, but I don't think it prevents you from using at least some of the
> > same mechanisms that the external odb series is using.
> > And reducing the number of mechanisms in Git itself is great for its
> > maintainability and simplicity.
> 
> I completely agree with the goal of reducing the number of mechanisms in
> Git itself.  Our proposal is primarily targeting speeding operations when
> dealing with large numbers of files.  ObjectDB is primarily targeting large
> objects but there is a lot of similarity in how we're approaching the solution.
> I hope/believe we can come to a common solution that will solve both.
> 
> >
> > >> > To prevent git from accidentally downloading all missing blobs,
> > >> > some git operations are updated to be aware of the potential for
> > missing blobs.
> > >> > The most obvious being check_connected which will return success
> > >> > as if everything in the requested commits is available locally.
> > >>
> > >> Actually, Git is pretty good about trying not to access blobs when
> > >> it doesn't need to. The important thing is that you know enough
> > >> about the blobs to fulfill has_sha1_file() and sha1_object_info()
> > >> requests without actually fetching the data.
> > >>
> > >> So the client definitely needs to have some list of which objects
> > >> exist, and which it _could_ get if it needed to.
> >
> > Yeah, and the external odb series handles that already, thanks to
> > Peff's initial work.
> >
> 
> I'm currently working on a patch series that will reimplement our current
> read-object hook to use the LFS model for long running background
> processes.  As part of that, I am building a versioned interface that will
> support multiple commands (like get, have, put).  In my initial
> implementation, I'm only supporting the "get" verb as that is what we
> currently need but my intent is to build it so that we could add have and put
> in future versions.  When I have the first iteration ready, I'll push it up to our
> fork on github for review as code is clearer than my description in email.
> 
> Moving forward, the "have" verb is a little problematic as we would "have"
> 3+ million shas that we'd be required to fetch from the server and then pass
> along to git when requested.  It would be nice to come up with a way to
> avoid or reduce that cost.
> 
> > >> The one place you'd probably want to tweak things is in the diff
> > >> code, as a single "git log -Sfoo" would fault in all of the blobs.
> > >
> > > It is an interesting idea to explore how we could be smarter about
> > > preventing blobs from faulting in if we had enough info to fulfill
> > > has_sha1_file() and sha1_object_info().  Given we also heavily prune
> > > the working directory using sparse-checkout, this hasn't been our
> > > top focus but it is certainly something worth looking into.
> >
> > The external odb series doesn't handle preventing blobs from faulting
> > in yet, so this could be a common problem.
> >
> 
> Agreed.  This is one we've been working on quite a bit out of necessity.  If
> you look at our patch series, most of the changes are related to dealing with
> missing objects.
> 
> > [...]
> >
> > >> One big hurdle to this approach, no matter the protocol, is how you
> > >> are going to handle deltas. Right now, a git client tells the
> > >> server "I have this commit, but I want this other one". And the
> > >> server knows which objects the client has from the first, and which
> > >> it needs from the second. Moreover, it knows that it can send
> > >> objects in delta form directly from disk if the other side has the delta
> base.
> > >>
> > >> So what happens in this system? We know we don't need to send any
> > >> blobs in a regular fetch, because the whole idea is that we only
> > >> send blobs on demand. So we wait for the client to ask us for blob
> > >> A. But then what do we send? If we send the whole blob without
> > >> deltas, we're going to waste a lot of bandwidth.
> > >>
> > >> The on-disk size of all of the blobs in linux.git is ~500MB. The
> > >> actual data size is ~48GB. Some of that is from zlib, which you get
> > >> even for non-deltas. But the rest of it is from the delta
> > >> compression. I don't think it's feasible to give that up, at least
> > >> not for "normal" source repos like linux.git (more on that in a minute).
> > >>
> > >> So ideally you do want to send deltas. But how do you know which
> > >> objects the other side already has, which you can use as a delta
> > >> base? Sending the list of "here are the blobs I have" doesn't scale.
> > >> Just the sha1s start to add up, especially when you are doing
> > >> incremental
> > fetches.
> >
> > To initialize some paths that the client wants, it could perhaps just
> > ask for some pack files, or maybe bundle files, related to these paths.
> > Those packs or bundles could be downloaded either directly from the
> > main server or from other web or proxy servers.
> >
> > >> I think this sort of things performs a lot better when you just
> > >> focus on large objects. Because they don't tend to delta well
> > >> anyway, and the savings are much bigger by avoiding ones you don't
> > >> want. So a directive like "don't bother sending blobs larger than
> > >> 1MB" avoids a lot of these issues. In other words, you have some
> > >> quick shorthand to communicate between the client and server: this
> > >> what I have, and what I
> > don't.
> > >> Normal git relies on commit reachability for that, but there are
> > >> obviously other dimensions. The key thing is that both sides be
> > >> able to express the filters succinctly, and apply them efficiently.
> > >
> > > Our challenge has been more the sheer _number_ of files that exist
> > > in the repo rather than the _size_ of the files in the repo.  With
> > > >3M source files and any typical developer only needing a small
> > > percentage of those files to do their job, our focus has been
> > > pruning the tree as much as possible such that they only pay the
> > > cost for the files they actually need.  With typical text source
> > > files being 10K - 20K in size, the overhead of the round trip is a
> > > significant part of the overall transfer time so deltas don't help
> > > as much.  I agree that large files are also a problem but it isn't my top
> focus at this point in time.
> >
> > Ok, but it would be nice if both problems could be solved using some
> > common mechanisms.
> > This way it could probably work better in situations where there are
> > both a large number of files _and_ some big files.
> > And from what I am seeing, there could be no real downside from using
> > some common mechanisms.
> >
> 
> Agree completely.  I'm hopeful that we can come up with some common
> mechanisms that will allow us to solve both problems.
> 
> > >> If most of your benefits are not from avoiding blobs in general,
> > >> but rather just from sparsely populating the tree, then it sounds
> > >> like sparse clone might be an easier path forward. The general idea
> > >> is to restrict not just the checkout, but the actual object
> > >> transfer and reachability (in the tree dimension, the way shallow
> > >> clone limits it in the time dimension, which will require
> > >> cooperation between the client
> > and server).
> > >>
> > >> So that's another dimension of filtering, which should be expressed
> > >> pretty
> > >> succinctly: "I'm interested in these paths, and not these other
> > >> ones." It's pretty easy to compute on the server side during graph
> > >> traversal (though it interacts badly with reachability bitmaps, so
> > >> there would need to be some hacks there).
> > >>
> > >> It's an idea that's been talked about many times, but I don't
> > >> recall that there were ever working patches. You might dig around
> > >> in the list archive under the name "sparse clone" or possibly "narrow
> clone".
> > >
> > > While a sparse/narrow clone would work with this proposal, it isn't
> > > required.  You'd still probably want all the commits and trees but
> > > the clone would also bring down the specified blobs.  Combined with
> > > using "depth" you could further limit it to those blobs at tip.
> > >
> > > We did run into problems with this model however as our usage
> > > patterns are such that our working directories often contain very
> > > sparse trees and as a result, we can end up with thousands of
> > > entries in the sparse checkout file.  This makes it difficult for
> > > users to manually specify a sparse-checkout before they even do a
> > > clone.  We have implemented a hashmap based sparse-checkout to deal
> > > with the performance issues of having that many entries but that's a
> > > different RFC/PATCH.  In short, we found that a "lazy-clone" and
> > > downloading blobs on demand provided a better developer experience.
> >
> > I think both ways are possible using the external odb mechanism.
> >
> > >> > Future Work
> > >> > ~~~~~~~~~~~
> > >> >
> > >> > The current prototype calls a new hook proc in
> > >> > sha1_object_info_extended and read_object, to download each
> > >> > missing blob.  A better solution would be to implement this via a
> > >> > long running process that is spawned on the first download and
> > >> > listens for requests to download additional objects until it
> > >> > terminates when the parent git operation exits (similar to the
> > >> > recent long running smudge and clean filter
> > >> work).
> > >>
> > >> Yeah, see the external-odb discussion. Those prototypes use a
> > >> process per object, but I think we all agree after seeing how the
> > >> git-lfs interface has scaled that this is a non-starter. Recent
> > >> versions of git-lfs do the single- process thing, and I think any
> > >> sort of external-odb hook should be modeled on that protocol.
> >
> > I agree that the git-lfs scaling work is great, but I think it's not
> > necessary in the external odb work to have the same kind of
> > single-process protocol from the beginning (though it should be possible
> and easy to add it).
> > For example if the external odb work can be used or extended to handle
> > restartable clone by downloading a single bundle when cloning, this
> > would not need that kind of protocol.
> >
> > > I'm looking into this now and plan to re-implement it this way
> > > before sending out the first patch series.  Glad to hear you think
> > > it is a good protocol to model it on.
> >
> > Yeah, for your use case on Windows, it looks really worth it to use
> > this kind of protocol.
> >
> > >> > Need to investigate an alternate batching scheme where we can
> > >> > make a single request for a set of "related" blobs and receive
> > >> > single a packfile (especially during checkout).
> > >>
> > >> I think this sort of batching is going to be the really hard part
> > >> to retrofit onto git. Because you're throwing out the procedural
> > >> notion that you can loop over a set of objects and ask for each
> individually.
> > >> You have to start deferring computation until answers are ready.
> > >> Some operations can do that reasonably well (e.g., checkout), but
> > >> something like "git log -p" is constantly digging down into history.
> > >> I suppose you could just perform the skeleton of the operation
> > >> _twice_, once to find the list of objects to fault in, and the
> > >> second time to
> > actually do it.
> >
> > In my opinion, perhaps we can just prevent "git log -p" from faulting
> > in blobs and have it show a warning saying that it was performed only
> > on a subset of all the blobs.
> >
> 
> You might be surprised at how many other places end up faulting in blobs. :)
> Rename detection is one we've recently been working on.
> 
> > [...]



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-02-23 15:40 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-13 15:52 [RFC] Add support for downloading blobs on demand Ben Peart
2017-01-13 21:07 ` Shawn Pearce
2017-01-17 21:50   ` Ben Peart
2017-01-17 22:05     ` Martin Fick
2017-01-17 22:23       ` Stefan Beller
2017-01-18 18:27         ` Ben Peart
2017-01-17 18:42 ` Jeff King
2017-01-17 21:50   ` Ben Peart
2017-02-05 14:03     ` Christian Couder
2017-02-07 18:21       ` Ben Peart
2017-02-07 21:56         ` Jakub Narębski
2017-02-08  2:18           ` Ben Peart
2017-02-23 15:39         ` Ben Peart

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).