RE: [RFC] Add support for downloading blobs on demand

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: "Ben Peart" <peartben@gmail.com>
To: "'Christian Couder'" <christian.couder@gmail.com>
Cc: "'Jeff King'" <peff@peff.net>, "'git'" <git@vger.kernel.org>,
	"'Johannes Schindelin'" <Johannes.Schindelin@gmx.de>,
	"Ben Peart" <benpeart@microsoft.com>
Subject: RE: [RFC] Add support for downloading blobs on demand
Date: Tue, 7 Feb 2017 13:21:08 -0500	[thread overview]
Message-ID: <002701d2816e$f4682fa0$dd388ee0$@gmail.com> (raw)
In-Reply-To: <CAP8UFD3R6nzDPApNvK6rcXR2qdAE6G4J3xbvEam3xsobO7viiA@mail.gmail.com>

No worries about a late response, I'm sure this is the start of a long conversation. :)

> -----Original Message-----
> From: Christian Couder [mailto:christian.couder@gmail.com]
> Sent: Sunday, February 5, 2017 9:04 AM
> To: Ben Peart <peartben@gmail.com>
> Cc: Jeff King <peff@peff.net>; git <git@vger.kernel.org>; Johannes Schindelin
> <Johannes.Schindelin@gmx.de>
> Subject: Re: [RFC] Add support for downloading blobs on demand
> 
> (Sorry for the late reply and thanks to Dscho for pointing me to this thread.)
> 
> On Tue, Jan 17, 2017 at 10:50 PM, Ben Peart <peartben@gmail.com> wrote:
> >> From: Jeff King [mailto:peff@peff.net] On Fri, Jan 13, 2017 at
> >> 10:52:53AM -0500, Ben Peart wrote:
> >>
> >> > Clone and fetch will pass a  --lazy-clone  flag (open to a better
> >> > name
> >> > here) similar to  --depth  that instructs the server to only return
> >> > commits and trees and to ignore blobs.
> >> >
> >> > Later during git operations like checkout, when a blob cannot be
> >> > found after checking all the regular places (loose, pack,
> >> > alternates, etc), git will download the missing object and place it
> >> > into the local object store (currently as a loose object) then resume the
> operation.
> >>
> >> Have you looked at the "external odb" patches I wrote a while ago,
> >> and which Christian has been trying to resurrect?
> >>
> >>
> >> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpubli
> >> c-
> >> inbox.org%2Fgit%2F20161130210420.15982-1-
> >>
> chriscool%40tuxfamily.org%2F&data=02%7C01%7CBen.Peart%40microsoft.c
> >>
> om%7C9596d3bf32564f123e0c08d43f08a9e1%7C72f988bf86f141af91ab2d7c
> >>
> d011db47%7C1%7C0%7C636202753822020527&sdata=a6%2BGOAQoRhjFoxS
> >> vftY8JZAVUssmrXuDZ9OBy3xqNZk%3D&reserved=0
> >>
> >> This is a similar approach, though I pushed the policy for "how do
> >> you get the objects" out into an external script. One advantage there
> >> is that large objects could easily be fetched from another source
> >> entirely (e.g., S3 or equivalent) rather than the repo itself.
> >>
> >> The downside is that it makes things more complicated, because a push
> >> or a fetch now involves three parties (server, client, and the
> >> alternate object store). So questions like "do I have all the objects
> >> I need" are hard to reason about.
> >>
> >> If you assume that there's going to be _some_ central Git repo which
> >> has all of the objects, you might as well fetch from there (and do it
> >> over normal git protocols). And that simplifies things a bit, at the cost of
> being less flexible.
> >
> > We looked quite a bit at the external odb patches, as well as lfs and
> > even using alternates.  They all share a common downside that you must
> > maintain a separate service that contains _some_ of the files.
> 
> Pushing the policy for "how do you get the objects" out into an external
> helper doesn't mean that the external helper cannot use the main service.
> The external helper is still free to do whatever it wants including calling the
> main service if it thinks it's better.

That is a good point and you're correct, that means you can avoid having to build out multiple services.

> 
> > These
> > files must also be versioned, replicated, backed up and the service
> > itself scaled out to handle the load.  As you mentioned, having
> > multiple services involved increases flexability but it also increases
> > the complexity and decreases the reliability of the overall version
> > control service.
> 
> About reliability, I think it depends a lot on the use case. If you want to get
> very big files over an unreliable connection, it can better if you send those big
> files over a restartable protocol and service like HTTP/S on a regular web
> server.
> 

My primary concern about reliability was the multiplicative effect of making multiple requests across multiple servers to complete a single request.  Having putting this all in a single service like you suggested above brings us back to parity on the complexity.

> > For operational simplicity, we opted to go with a design that uses a
> > single, central git repo which has _all_ the objects and to focus on
> > enhancing it to handle large numbers of files efficiently.  This
> > allows us to focus our efforts on a great git service and to avoid
> > having to build out these other services.
> 
> Ok, but I don't think it prevents you from using at least some of the same
> mechanisms that the external odb series is using.
> And reducing the number of mechanisms in Git itself is great for its
> maintainability and simplicity.

I completely agree with the goal of reducing the number of mechanisms in Git itself.  Our proposal is primarily targeting speeding operations when dealing with large numbers of files.  ObjectDB is primarily targeting large objects but there is a lot of similarity in how we're approaching the solution.  I hope/believe we can come to a common solution that will solve both.

> 
> >> > To prevent git from accidentally downloading all missing blobs,
> >> > some git operations are updated to be aware of the potential for
> missing blobs.
> >> > The most obvious being check_connected which will return success as
> >> > if everything in the requested commits is available locally.
> >>
> >> Actually, Git is pretty good about trying not to access blobs when it
> >> doesn't need to. The important thing is that you know enough about
> >> the blobs to fulfill has_sha1_file() and sha1_object_info() requests
> >> without actually fetching the data.
> >>
> >> So the client definitely needs to have some list of which objects
> >> exist, and which it _could_ get if it needed to.
> 
> Yeah, and the external odb series handles that already, thanks to Peff's initial
> work.
> 

I'm currently working on a patch series that will reimplement our current read-object hook to use the LFS model for long running background processes.  As part of that, I am building a versioned interface that will support multiple commands (like get, have, put).  In my initial implementation, I'm only supporting the "get" verb as that is what we currently need but my intent is to build it so that we could add have and put in future versions.  When I have the first iteration ready, I'll push it up to our fork on github for review as code is clearer than my description in email.

Moving forward, the "have" verb is a little problematic as we would "have" 3+ million shas that we'd be required to fetch from the server and then pass along to git when requested.  It would be nice to come up with a way to avoid or reduce that cost.

> >> The one place you'd probably want to tweak things is in the diff
> >> code, as a single "git log -Sfoo" would fault in all of the blobs.
> >
> > It is an interesting idea to explore how we could be smarter about
> > preventing blobs from faulting in if we had enough info to fulfill
> > has_sha1_file() and sha1_object_info().  Given we also heavily prune
> > the working directory using sparse-checkout, this hasn't been our top
> > focus but it is certainly something worth looking into.
> 
> The external odb series doesn't handle preventing blobs from faulting in yet,
> so this could be a common problem.
> 

Agreed.  This is one we've been working on quite a bit out of necessity.  If you look at our patch series, most of the changes are related to dealing with missing objects.

> [...]
> 
> >> One big hurdle to this approach, no matter the protocol, is how you
> >> are going to handle deltas. Right now, a git client tells the server
> >> "I have this commit, but I want this other one". And the server knows
> >> which objects the client has from the first, and which it needs from
> >> the second. Moreover, it knows that it can send objects in delta form
> >> directly from disk if the other side has the delta base.
> >>
> >> So what happens in this system? We know we don't need to send any
> >> blobs in a regular fetch, because the whole idea is that we only send
> >> blobs on demand. So we wait for the client to ask us for blob A. But
> >> then what do we send? If we send the whole blob without deltas, we're
> >> going to waste a lot of bandwidth.
> >>
> >> The on-disk size of all of the blobs in linux.git is ~500MB. The
> >> actual data size is ~48GB. Some of that is from zlib, which you get
> >> even for non-deltas. But the rest of it is from the delta
> >> compression. I don't think it's feasible to give that up, at least
> >> not for "normal" source repos like linux.git (more on that in a minute).
> >>
> >> So ideally you do want to send deltas. But how do you know which
> >> objects the other side already has, which you can use as a delta
> >> base? Sending the list of "here are the blobs I have" doesn't scale.
> >> Just the sha1s start to add up, especially when you are doing incremental
> fetches.
> 
> To initialize some paths that the client wants, it could perhaps just ask for
> some pack files, or maybe bundle files, related to these paths.
> Those packs or bundles could be downloaded either directly from the main
> server or from other web or proxy servers.
> 
> >> I think this sort of things performs a lot better when you just focus
> >> on large objects. Because they don't tend to delta well anyway, and
> >> the savings are much bigger by avoiding ones you don't want. So a
> >> directive like "don't bother sending blobs larger than 1MB" avoids a
> >> lot of these issues. In other words, you have some quick shorthand to
> >> communicate between the client and server: this what I have, and what I
> don't.
> >> Normal git relies on commit reachability for that, but there are
> >> obviously other dimensions. The key thing is that both sides be able
> >> to express the filters succinctly, and apply them efficiently.
> >
> > Our challenge has been more the sheer _number_ of files that exist in
> > the repo rather than the _size_ of the files in the repo.  With >3M
> > source files and any typical developer only needing a small percentage
> > of those files to do their job, our focus has been pruning the tree as
> > much as possible such that they only pay the cost for the files they
> > actually need.  With typical text source files being 10K - 20K in
> > size, the overhead of the round trip is a significant part of the
> > overall transfer time so deltas don't help as much.  I agree that
> > large files are also a problem but it isn't my top focus at this point in time.
> 
> Ok, but it would be nice if both problems could be solved using some
> common mechanisms.
> This way it could probably work better in situations where there are both a
> large number of files _and_ some big files.
> And from what I am seeing, there could be no real downside from using
> some common mechanisms.
> 

Agree completely.  I'm hopeful that we can come up with some common mechanisms that will allow us to solve both problems.

> >> If most of your benefits are not from avoiding blobs in general, but
> >> rather just from sparsely populating the tree, then it sounds like
> >> sparse clone might be an easier path forward. The general idea is to
> >> restrict not just the checkout, but the actual object transfer and
> >> reachability (in the tree dimension, the way shallow clone limits it
> >> in the time dimension, which will require cooperation between the client
> and server).
> >>
> >> So that's another dimension of filtering, which should be expressed
> >> pretty
> >> succinctly: "I'm interested in these paths, and not these other
> >> ones." It's pretty easy to compute on the server side during graph
> >> traversal (though it interacts badly with reachability bitmaps, so
> >> there would need to be some hacks there).
> >>
> >> It's an idea that's been talked about many times, but I don't recall
> >> that there were ever working patches. You might dig around in the
> >> list archive under the name "sparse clone" or possibly "narrow clone".
> >
> > While a sparse/narrow clone would work with this proposal, it isn't
> > required.  You'd still probably want all the commits and trees but the
> > clone would also bring down the specified blobs.  Combined with using
> > "depth" you could further limit it to those blobs at tip.
> >
> > We did run into problems with this model however as our usage patterns
> > are such that our working directories often contain very sparse trees
> > and as a result, we can end up with thousands of entries in the sparse
> > checkout file.  This makes it difficult for users to manually specify
> > a sparse-checkout before they even do a clone.  We have implemented a
> > hashmap based sparse-checkout to deal with the performance issues of
> > having that many entries but that's a different RFC/PATCH.  In short,
> > we found that a "lazy-clone" and downloading blobs on demand provided
> > a better developer experience.
> 
> I think both ways are possible using the external odb mechanism.
> 
> >> > Future Work
> >> > ~~~~~~~~~~~
> >> >
> >> > The current prototype calls a new hook proc in
> >> > sha1_object_info_extended and read_object, to download each missing
> >> > blob.  A better solution would be to implement this via a long
> >> > running process that is spawned on the first download and listens
> >> > for requests to download additional objects until it terminates
> >> > when the parent git operation exits (similar to the recent long
> >> > running smudge and clean filter
> >> work).
> >>
> >> Yeah, see the external-odb discussion. Those prototypes use a process
> >> per object, but I think we all agree after seeing how the git-lfs
> >> interface has scaled that this is a non-starter. Recent versions of
> >> git-lfs do the single- process thing, and I think any sort of
> >> external-odb hook should be modeled on that protocol.
> 
> I agree that the git-lfs scaling work is great, but I think it's not necessary in the
> external odb work to have the same kind of single-process protocol from the
> beginning (though it should be possible and easy to add it).
> For example if the external odb work can be used or extended to handle
> restartable clone by downloading a single bundle when cloning, this would
> not need that kind of protocol.
> 
> > I'm looking into this now and plan to re-implement it this way before
> > sending out the first patch series.  Glad to hear you think it is a
> > good protocol to model it on.
> 
> Yeah, for your use case on Windows, it looks really worth it to use this kind
> of protocol.
> 
> >> > Need to investigate an alternate batching scheme where we can make
> >> > a single request for a set of "related" blobs and receive single a
> >> > packfile (especially during checkout).
> >>
> >> I think this sort of batching is going to be the really hard part to
> >> retrofit onto git. Because you're throwing out the procedural notion
> >> that you can loop over a set of objects and ask for each individually.
> >> You have to start deferring computation until answers are ready. Some
> >> operations can do that reasonably well (e.g., checkout), but
> >> something like "git log -p" is constantly digging down into history.
> >> I suppose you could just perform the skeleton of the operation
> >> _twice_, once to find the list of objects to fault in, and the second time to
> actually do it.
> 
> In my opinion, perhaps we can just prevent "git log -p" from faulting in blobs
> and have it show a warning saying that it was performed only on a subset of
> all the blobs.
> 

You might be surprised at how many other places end up faulting in blobs. :)  Rename detection is one we've recently been working on.

> [...]

next prev parent reply	other threads:[~2017-02-07 18:21 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-13 15:52 [RFC] Add support for downloading blobs on demand Ben Peart
2017-01-13 21:07 ` Shawn Pearce
2017-01-17 21:50   ` Ben Peart
2017-01-17 22:05     ` Martin Fick
2017-01-17 22:23       ` Stefan Beller
2017-01-18 18:27         ` Ben Peart
2017-01-17 18:42 ` Jeff King
2017-01-17 21:50   ` Ben Peart
2017-02-05 14:03     ` Christian Couder
2017-02-07 18:21       ` Ben Peart [this message]
2017-02-07 21:56         ` Jakub Narębski
2017-02-08  2:18           ` Ben Peart
2017-02-23 15:39         ` Ben Peart

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='002701d2816e$f4682fa0$dd388ee0$@gmail.com' \
    --to=peartben@gmail.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=benpeart@microsoft.com \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).