RE: Proposal for "fetch-any-blob Git protocol" and server design

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Kevin David <Kevin.David@microsoft.com>
To: Jonathan Tan <jonathantanmy@google.com>,
	Ben Peart <peartben@gmail.com>,
	Stefan Beller <sbeller@google.com>
Cc: "git@vger.kernel.org" <git@vger.kernel.org>,
	Mark Thomas <markbt@efaref.net>,
	Jeff Hostetler <git@jeffhostetler.com>
Subject: RE: Proposal for "fetch-any-blob Git protocol" and server design
Date: Fri, 21 Apr 2017 16:41:14 +0000	[thread overview]
Message-ID: <CY4PR21MB05047397A12B8692C923FC67EE1A0@CY4PR21MB0504.namprd21.prod.outlook.com> (raw)
In-Reply-To: <18ebcd04-4765-bdc7-3880-b0e8cb90d35c@google.com>

Hi Jonathan,

Sorry for the delayed response - other work got in the way, unfortunately!

Kevin

-----Original Message-----
From: Jonathan Tan [mailto:jonathantanmy@google.com] 
Sent: Thursday, April 13, 2017 4:13 PM

>> I know we're considering server behavior here, but how large do you generally
>> expect these blob-want requests to be? I ask because we took an initial approach
>> very similar to this, however, we had a hard time being clever about figuring out
>> what set of blobs to request for those clients that didn't want the entire set, and
>> ended up falling back to single-blob requests.
>>
>> Obviously, this could be due to thenature of our filesystem-virtualization-based client,
>> but I also suspect that the teams attacking this problem are more often than not dealing
>> with very large blob objects, so the cost of a round-trip becomes lower relative to sending
>> the object content itself.
>
> I am envisioning (1a) as described in Jeff Hostetler's e-mail [1] ("a 
> pre-command or hook to identify needed blobs and pre-fetch them before 
> allowing the actual command to start"), so a Git command would typically 
> make a single request that contains all the blobs required, but my 
> proposal can also handle (1c) ('"fault" them in as necessary in 
> read_object() while the command is running and without any pre-fetch 
> (either synchronously or asynchronously and with/without a helper 
> process)').
> 
> Even if we decided to go with single-blob requests and responses, it is 
> still important to send them as packfiles, so that the server can serve 
> them directly from its compressed storage without first having to 
> uncompress them.
> 
> [1] https://public-inbox.org/git/1488999039-37631-1-git-send-email-git@jeffhostetler.com/

Ah, I missed this. If we think we can build meaningfully-sized requests via (1a) and (1b), 
then I agree - packs are optimal.

However, if single-blob requests/responses dominate, I have a few concerns, mostly from 
experience playing with something similar: 
* Regardless of the size of the object or number returned, the client will need to 
  `index-pack` the result and create a corresponding `.idx` file, which requires 
  decompression to construct (right?)

* Unless the client's repository is repacked aggressively, we'll pollute the 
  `.git\objects\pack` directory with little indexes (somewhere around 4KiB minimum) 
  and packfiles rapidly, which would degrade performance. One rudimentary workaround 
  would be to loosen these packs on the client if they were under a certain 
  size/object count. I think fetch does this already?

In either case, shifting the pack decompression/loose object recompression problem 
to the client instead of the server is probably a good principle, but in our case it 
simply wasn't fast enough to serve single blobs interactively (e.g. opening a source 
file you don't have locally). I'm hopeful that the proposed partial clone solutions you 
referenced would reduce the frequency of this being required.

>> Being a bit more clever about packing objects (e.g. splitting blobs out from commits
>> and trees) improved this a bit, but we still hit a bottlenecks from what appeared to
>> be a large number of memory-mapping operations on a ~140GiB packfile of blobs.
>> 
>> Each `pack-objects` process would consume approximately one CPU core for the
>> duration of the request. It's possible that further splitting of these large blob packs
>> would have improved performance in some scenarios, but that would increase the
>> amount of pack-index lookups necessary to find a single object.
>
> I'm not very experienced with mmap, but I thought that memory-mapping a 
> large file in itself does not incur much of a performance penalty (if 
> any) - it is the accesses that count. I experimented with 15,000 and 
> 150,000 MiB files and mmap and they seem to be handled quite well. Also, 
> how many objects are "pack-objects" packing here?

Back when we took this approach, it was ~4000 blob objects at a time. 
Perhaps we were being bitten by the Windows implementation of git_mmap[3]?. 
When I profiled the 4k blob scenario, the majority of CPU and wall time was 
spent in MapViewOfFileEx, which looks like it could mean accesses as well.

[3] https://github.com/git/git/blob/master/compat/win32mmap.c

>> This seems like a clever way to avoid the canonical `/info/refs?service=git-upload-pack`
>> capability negotiation on every call. However, using error handling to fallback seems
>> slightly wonky to me. Hopefully users are incentivized to upgrade their clients.
>
> By "error handling to fallback", do you mean in my proposal or in a 
> possible future one (assuming my proposal is implemented)? I don't think 
> my proposal requires any error handling to fallback (since only new 
> clients can clone partially - old clients will just clone totally and 
> obliviously), but I acknowledge that this proposal does not mean that 
> any future proposal can be done without requiring error handling to 
> fallback.

Right, I was talking about the possible future one - more around the 
concept of back-compat in the event of any protocol changes. I don't want 
to spend too much time focusing on what we might want in the future, but a 
thought I just had: what about versioning as a part of the URL? For example, 
`/server-endpoint?version=1.0`. This could also enable breaking changes for 
existing commands.

>> This makes a lot of sense to me. When we built our caching proxy, we had to be careful
>> when designing how we'd handle clients requesting objects missing from the proxy.
>>
>> For example, a client requests a single blob and the proxy doesn't have it - we can't simply
>> download that object from the "authoritative" remote and stick it in the `.git\objects\xx\yyy...`
>> directory, because the repository would be made corrupt.
>
> By proxy, do you mean a Git repository? Sorry, I don't really understand 
> this part.

Yes, apologies for not being clear. The proxy we created maintains a 
nearly-up-to-date mirror of the remote as a bare repo and exposes a few 
custom endpoints I've alluded to. What I'm trying to say is that updating 
the bare repo on disk via methods other than a full `fetch` is dangerous 
without the partial-clone-marking capability.

>> Not do derail us too far off blobs, but I wonder if we need a `fetch-commit-pack` endpoint,
>> or could get away with introducing a new capability (e.g. `no-blobs`) to `upload-pack` instead.
>> As a casual observer, this seems like it would be a much smaller change since the rest of the
>> negotiation/reachability calculation would look the same, right? Or would this `fetch-commit-pack`
>> not return trees either?
>>
>> I only ask because, in our observations, when git wants to read commits it's
>> usually followed by a lot of "related" trees - again caveated with the fact that
>> we're intercepting many things at the filesystem layer.
>
> The main reason for this extra command is not to exclude blobs (which, 
> as you said, can be done with a new capability - I suspect that we will 
> need a capability or parameter of some sort anyway to indicate which 
> size of blobs to filter out) but to eliminate the mandatory ref 
> advertisement that is done whenever the client fetches. One of our use 
> cases (internal Android) has large blobs and many (more than 700k) refs, 
> so it would benefit greatly from blob filtering and elimination of the 
> mandatory ref advertisement (tens of megabytes per fetch).
> 
> As for the size of the change, I have a work in progress that implements 
> this [2].
> 
> [2] https://public-inbox.org/git/cover.1491851452.git.jonathantanmy@google.com/

Ah, interesting. We solved the many-refs problem using a different approach - 
basically, limiting what the server returns based on who the user is and preferences 
they set via our web interface or an API. I've been promised by a few colleagues that 
we'll have more to share here soon... With your proposal, how does the client choose 
which refs they want to fetch?  

I took a look at look at your patch series. I'm nowhere near qualified to give feedback 
given my lack of experience with the core git implementation and C in general, but it 
looks reasonable. Hoping we can come up with a better name than "server-endpoint" though :)

>> Just to keep the discussion interesting, I'll throw an alternative out there that's
>> worked well for us. As I understand it, the HTTP-based dumb transfer protocol
>> supports returning objects in loose object format, but only if they already exist
>> in loose format.
>>
>> Extending this to have the remote provide these objects via a "dumb" protocol
>> when they are packed as well - i.e. the server would "loosens" them upon request -
>> is basically what we do and it works quite well for low-latency clients. To further improve
>> performance at the cost of complexity, we've added caching at the memory and disk layer
>> for these loose objects in the same format we send to the client.
>>
>> There's a clear tradeoff here - the servers must have adequate disk and/or memory to store
>> these loose objects in optimal format. In addition, the higher the latency is to the remote,
>> the worse this solution will perform. Fortunately, in our case, network topology allows us to
>> put these caching proxies close enough to clients for it not to matter.
>
>This does make sense in the situation you describe, but (as you said) I 
>don't think we can guarantee this in the majority of situations. I think 
>some sort of batching (like the (1a) solution I talked about near the 
>start of this e-mail) and serving packed data from packed storage should 
>form the baseline, and any situation-specific optimizations (e.g. 
>serving unpacked data from topologically-close servers) can be 
>additional steps.

Agreed - our choice of solution was largely driven by our lack of ability to batch, 
performance issues with `pack-objects` at scale, and ability to deploy the 
previously-mentioned proxies very close to clients. If we can solve the first 
problem (batching), optimizing for the hopefully-less-common single blob at a time 
scenario becomes less important. If we assume that happens, I think your proposal 
looks sound overall.

next prev parent reply	other threads:[~2017-04-21 16:42 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-14 22:57 Proposal for "fetch-any-blob Git protocol" and server design Jonathan Tan
2017-03-15 17:59 ` Junio C Hamano
2017-03-16 17:31   ` Jonathan Tan
2017-03-16 21:17     ` Junio C Hamano
2017-03-16 22:48       ` Jonathan Tan
2017-03-28 23:19 ` Stefan Beller
     [not found]   ` <00bf01d2aed7$b13492a0$139db7e0$@gmail.com>
2017-04-12 22:02     ` Kevin David
2017-04-13 20:12       ` Jonathan Tan
2017-04-21 16:41         ` Kevin David [this message]
2017-04-26 22:51           ` Jonathan Tan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CY4PR21MB05047397A12B8692C923FC67EE1A0@CY4PR21MB0504.namprd21.prod.outlook.com \
    --to=kevin.david@microsoft.com \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    --cc=markbt@efaref.net \
    --cc=peartben@gmail.com \
    --cc=sbeller@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).