From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-5.1 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_DNSWL_HI,RCVD_IN_SORBS_SPAM, RP_MATCHES_RCVD shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id 845C220756 for ; Tue, 17 Jan 2017 21:50:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751203AbdAQVuu (ORCPT ); Tue, 17 Jan 2017 16:50:50 -0500 Received: from mail-qt0-f195.google.com ([209.85.216.195]:32865 "EHLO mail-qt0-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751025AbdAQVut (ORCPT ); Tue, 17 Jan 2017 16:50:49 -0500 Received: by mail-qt0-f195.google.com with SMTP id n13so24637454qtc.0 for ; Tue, 17 Jan 2017 13:50:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:references:in-reply-to:subject:date:message-id :mime-version:content-transfer-encoding:thread-index :content-language; bh=WQSnSahsVhRXl9jqcheXbIo2ZrIyiHChddetrEL2xk0=; b=Yi6wf5xr/rLfBPvdV5lcEYkJMgSuV9gjJ1CF5jTIbHiU1+HZYthmjB9jGULn7RrdsZ dxGJEM6VHAoILIDTkRgRT/dMUx9NeSfYUihvVfM8Pb+kZHq2au1WDjqqICp1Ssj9C5xI eZB6skf4QlhC4JNqCBtsz6T2JW2M3jjHNdqmgiRVJtlrWcloP+OUuU0h8gQLOtHe+Y9g 4eGlqg4wLU9Pc8AbzEoAZffvv7Jr9BPmB1LXdE+QXbI56EEfJ0tsUjyiRgqqjqWLPdnN UsDmwQE3P0qKRYSdJvO2JMew/03oIa9ontBRUIOacJNw/dlPcSPB1Z0yb/Xgv1RUOo+d ISSw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:references:in-reply-to:subject:date :message-id:mime-version:content-transfer-encoding:thread-index :content-language; bh=WQSnSahsVhRXl9jqcheXbIo2ZrIyiHChddetrEL2xk0=; b=PAl0fk9Q7RCRVGco7pGudApjH4DoJzYtYtNIYeWGcPCfR/TLFjG/Q27VnDqk99nwc8 THB/wMS5SLJvAXmo1mmMk93osBbAnlHZJ3Y/9E7O8nVNyRGpuZNpPsvNQNNp4cKbahns R2UBBbLkLT4hAnUL9m4pXxMmpcytdbtCfMBN6H6FyIUnCmDvo6DJy2NrHHauYNn2p4IA ssM5DV3w4gMUzSgh1Mc9A8gUoIoIHfpCtOkEDk0abcX9+sPr5dVKdqVkWJqInzb60Ect z1Y2ZFSDaLZwTxywWIXgkpyBuqlC591VHMyHj0/h6nvwV7lij4sdZRQX6PDPd1DJ/vTs V1vA== X-Gm-Message-State: AIkVDXI1/QQGB208TA4KK0ms+6EqefhXLzMJUw5otF05bekp5VL5aff6enhavBzboQGYcw== X-Received: by 10.55.75.143 with SMTP id y137mr36040495qka.39.1484689848054; Tue, 17 Jan 2017 13:50:48 -0800 (PST) Received: from BenPeartHP ([65.222.173.206]) by smtp.gmail.com with ESMTPSA id s71sm19831165qkl.22.2017.01.17.13.50.46 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 17 Jan 2017 13:50:47 -0800 (PST) From: "Ben Peart" To: "'Jeff King'" , "'Ben Peart'" Cc: References: <20170113155253.1644-1-benpeart@microsoft.com> <20170117184258.sd7h2hkv27w52gzt@sigill.intra.peff.net> In-Reply-To: <20170117184258.sd7h2hkv27w52gzt@sigill.intra.peff.net> Subject: RE: [RFC] Add support for downloading blobs on demand Date: Tue, 17 Jan 2017 16:50:47 -0500 Message-ID: <002601d2710b$c3715890$4a5409b0$@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Outlook 16.0 Thread-Index: AQHSbbUrOh3gS4IHmUmedTbh+1qHIqE9B6EAgAAHF1A= Content-Language: en-us Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Thanks for the thoughtful response. No need to appologize for the=20 length, it's a tough problem to solve so I don't expect it to be handled = with a single, short email. :) > -----Original Message----- > From: Jeff King [mailto:peff@peff.net] > Sent: Tuesday, January 17, 2017 1:43 PM > To: Ben Peart > Cc: git@vger.kernel.org; Ben Peart > Subject: Re: [RFC] Add support for downloading blobs on demand >=20 > This is an issue I've thought a lot about. So apologies in advance = that this > response turned out a bit long. :) >=20 > On Fri, Jan 13, 2017 at 10:52:53AM -0500, Ben Peart wrote: >=20 > > Design > > ~~~~~~ > > > > Clone and fetch will pass a --lazy-clone flag (open to a better = name > > here) similar to --depth that instructs the server to only return > > commits and trees and to ignore blobs. > > > > Later during git operations like checkout, when a blob cannot be = found > > after checking all the regular places (loose, pack, alternates, = etc), > > git will download the missing object and place it into the local > > object store (currently as a loose object) then resume the = operation. >=20 > Have you looked at the "external odb" patches I wrote a while ago, and > which Christian has been trying to resurrect? >=20 > = https://na01.safelinks.protection.outlook.com/?url=3Dhttp%3A%2F%2Fpublic-= > inbox.org%2Fgit%2F20161130210420.15982-1- > chriscool%40tuxfamily.org%2F&data=3D02%7C01%7CBen.Peart%40microsoft.c > om%7C9596d3bf32564f123e0c08d43f08a9e1%7C72f988bf86f141af91ab2d7c > d011db47%7C1%7C0%7C636202753822020527&sdata=3Da6%2BGOAQoRhjFoxS > vftY8JZAVUssmrXuDZ9OBy3xqNZk%3D&reserved=3D0 >=20 > This is a similar approach, though I pushed the policy for "how do you = get the > objects" out into an external script. One advantage there is that = large objects > could easily be fetched from another source entirely (e.g., S3 or = equivalent) > rather than the repo itself. >=20 > The downside is that it makes things more complicated, because a push = or a > fetch now involves three parties (server, client, and the alternate = object > store). So questions like "do I have all the objects I need" are hard = to reason > about. >=20 > If you assume that there's going to be _some_ central Git repo which = has all > of the objects, you might as well fetch from there (and do it over = normal git > protocols). And that simplifies things a bit, at the cost of being = less flexible. >=20 We looked quite a bit at the external odb patches, as well as lfs and=20 even using alternates. They all share a common downside that you must=20 maintain a separate service that contains _some_ of the files. These=20 files must also be versioned, replicated, backed up and the service=20 itself scaled out to handle the load. As you mentioned, having multiple = services involved increases flexability but it also increases the=20 complexity and decreases the reliability of the overall version control=20 service. =20 For operational simplicity, we opted to go with a design that uses a=20 single, central git repo which has _all_ the objects and to focus on=20 enhancing it to handle large numbers of files efficiently. This allows=20 us to focus our efforts on a great git service and to avoid having to=20 build out these other services. > > To prevent git from accidentally downloading all missing blobs, some > > git operations are updated to be aware of the potential for missing = blobs. > > The most obvious being check_connected which will return success as = if > > everything in the requested commits is available locally. >=20 > Actually, Git is pretty good about trying not to access blobs when it = doesn't > need to. The important thing is that you know enough about the blobs = to > fulfill has_sha1_file() and sha1_object_info() requests without = actually > fetching the data. >=20 > So the client definitely needs to have some list of which objects = exist, and > which it _could_ get if it needed to. >=20 > The one place you'd probably want to tweak things is in the diff code, = as a > single "git log -Sfoo" would fault in all of the blobs. >=20 It is an interesting idea to explore how we could be smarter about=20 preventing blobs from faulting in if we had enough info to fulfill=20 has_sha1_file() and sha1_object_info(). Given we also heavily prune the = working directory using sparse-checkout, this hasn't been our top focus=20 but it is certainly something worth looking into. > > To minimize the impact on the server, the existing dumb HTTP = protocol > > endpoint objects/ can be used to retrieve the individual > > missing blobs when needed. >=20 > This is going to behave badly on well-packed repositories, because = there isn't > a good way to fetch a single object. The best case (which is not = implemented > at all in Git) is that you grab the pack .idx, then grab "slices" of = the pack > corresponding to specific objects, including hunting down delta bases. >=20 > But then next time the server repacks, you have to throw away your = .idx file. > And those can be big. The .idx for linux.git is 135MB. You really = wouldn't=09 > want to do an incremental fetch of 1MB worth of objects and have to = grab > the whole .idx just to figure out which bytes you needed. >=20 > You can solve this by replacing the dumb-http server with a smart one = that > actually serves up the individual objects as if they were truly = sitting on the > filesystem. But then you haven't really minimized impact on the = server, and > you might as well teach the smart protocols to do blob fetches. >=20 Yea, we actually implemented a new endpoint that we are using to fetch=20 individual blobs, I just found the dumb endpoint recently and thought=20 "hey, maybe we can us this to make it easier for other git servers." =20 For a number of good reasons, I don't think this is the right approach. >=20 > One big hurdle to this approach, no matter the protocol, is how you = are > going to handle deltas. Right now, a git client tells the server "I = have this > commit, but I want this other one". And the server knows which objects = the > client has from the first, and which it needs from the second. = Moreover, it > knows that it can send objects in delta form directly from disk if the = other > side has the delta base. >=20 > So what happens in this system? We know we don't need to send any = blobs > in a regular fetch, because the whole idea is that we only send blobs = on > demand. So we wait for the client to ask us for blob A. But then what = do we > send? If we send the whole blob without deltas, we're going to waste a = lot of > bandwidth. >=20 > The on-disk size of all of the blobs in linux.git is ~500MB. The = actual data size > is ~48GB. Some of that is from zlib, which you get even for = non-deltas. But > the rest of it is from the delta compression. I don't think it's = feasible to give > that up, at least not for "normal" source repos like linux.git (more = on that in > a minute). >=20 > So ideally you do want to send deltas. But how do you know which = objects > the other side already has, which you can use as a delta base? Sending = the > list of "here are the blobs I have" doesn't scale. Just the sha1s = start to add > up, especially when you are doing incremental fetches. >=20 > I think this sort of things performs a lot better when you just focus = on large > objects. Because they don't tend to delta well anyway, and the savings = are > much bigger by avoiding ones you don't want. So a directive like = "don't > bother sending blobs larger than 1MB" avoids a lot of these issues. In = other > words, you have some quick shorthand to communicate between the client > and server: this what I have, and what I don't. > Normal git relies on commit reachability for that, but there are = obviously > other dimensions. The key thing is that both sides be able to express = the > filters succinctly, and apply them efficiently. >=20 Our challenge has been more the sheer _number_ of files that exist in=20 the repo rather than the _size_ of the files in the repo. With >3M=20 source files and any typical developer only needing a small percentage=20 of those files to do their job, our focus has been pruning the tree as=20 much as possible such that they only pay the cost for the files they=20 actually need. With typical text source files being 10K - 20K in size,=20 the overhead of the round trip is a significant part of the overall=20 transfer time so deltas don't help as much. I agree that large files=20 are also a problem but it isn't my top focus at this point in time. =20 > > After cloning, the developer can use sparse-checkout to limit the = set > > of files to the subset they need (typically only 1-10% in these = large > > repos). This allows the initial checkout to only download the set = of > > files actually needed to complete their task. At any point, the > > sparse-checkout file can be updated to include additional files = which > > will be fetched transparently on demand. >=20 > If most of your benefits are not from avoiding blobs in general, but = rather > just from sparsely populating the tree, then it sounds like sparse = clone might > be an easier path forward. The general idea is to restrict not just = the > checkout, but the actual object transfer and reachability (in the tree > dimension, the way shallow clone limits it in the time dimension, = which will > require cooperation between the client and server). >=20 > So that's another dimension of filtering, which should be expressed = pretty > succinctly: "I'm interested in these paths, and not these other ones." = It's > pretty easy to compute on the server side during graph traversal = (though it > interacts badly with reachability bitmaps, so there would need to be = some > hacks there). >=20 > It's an idea that's been talked about many times, but I don't recall = that there > were ever working patches. You might dig around in the list archive = under > the name "sparse clone" or possibly "narrow clone". While a sparse/narrow clone would work with this proposal, it isn't=20 required. You'd still probably want all the commits and trees but the=20 clone would also bring down the specified blobs. Combined with using=20 "depth" you could further limit it to those blobs at tip.=20 We did run into problems with this model however as our usage patterns=20 are such that our working directories often contain very sparse trees=20 and as a result, we can end up with thousands of entries in the sparse=20 checkout file. This makes it difficult for users to manually specify a=20 sparse-checkout before they even do a clone. We have implemented a=20 hashmap based sparse-checkout to deal with the performance issues of=20 having that many entries but that's a different RFC/PATCH. In short, we = found that a "lazy-clone" and downloading blobs on demand provided a=20 better developer experience. >=20 > > Now some numbers > > ~~~~~~~~~~~~~~~~ > > > > One repo has 3+ million files at tip across 500K folders with 5-6K > > active developers. They have done a lot of work to remove large = files > > from the repo so it is down to < 100GB. > > > > Before changes: clone took hours to transfer the 87GB .pack + 119MB > > .idx > > > > After changes: clone took 4 minutes to transfer 305MB .pack + 37MB > > .idx > > > > After hydrating 35K files (the typical number any individual = developer > > needs to do their work), there was an additional 460 MB of loose = files > > downloaded. >=20 > It sounds like you have a case where the repository has a lot of large = files > that are either historical, or uninteresting the sparse-tree = dimension. >=20 > How big is that 460MB if it were actually packed with deltas? >=20 Uninteresting in the sparse-tree dimension. 460 MB divided by 35K files = is less than 13 KB per file which is fairly typical for source code. =20 Given there are no versions to calculate deltas from, compressing them=20 into a pack file would help some but I don't have the numbers as to how=20 much. When we get to the "future work" below and start batching up=20 requests, we'll have better data on that. > > Future Work > > ~~~~~~~~~~~ > > > > The current prototype calls a new hook proc in > > sha1_object_info_extended and read_object, to download each missing > > blob. A better solution would be to implement this via a long = running > > process that is spawned on the first download and listens for = requests > > to download additional objects until it terminates when the parent = git > > operation exits (similar to the recent long running smudge and clean = filter > work). >=20 > Yeah, see the external-odb discussion. Those prototypes use a process = per > object, but I think we all agree after seeing how the git-lfs = interface has > scaled that this is a non-starter. Recent versions of git-lfs do the = single- > process thing, and I think any sort of external-odb hook should be = modeled > on that protocol. >=20 I'm looking into this now and plan to re-implement it this way before=20 sending out the first patch series. Glad to hear you think it is a good = protocol to model it on. > > Need to investigate an alternate batching scheme where we can make a > > single request for a set of "related" blobs and receive single a > > packfile (especially during checkout). >=20 > I think this sort of batching is going to be the really hard part to = retrofit onto > git. Because you're throwing out the procedural notion that you can = loop > over a set of objects and ask for each individually. > You have to start deferring computation until answers are ready. Some > operations can do that reasonably well (e.g., checkout), but something = like > "git log -p" is constantly digging down into history. I suppose you = could just > perform the skeleton of the operation _twice_, once to find the list = of objects > to fault in, and the second time to actually do it. >=20 > That will make git feel a lot slower, because a lot of the illusion of = speed is > the way it streams out results. OTOH, if you have to wait to fault in = objects > from the network, it's going to feel pretty slow anyway. :) >=20 The good news is that for most operations, git doesn't need to access to = all the blobs. You're right, any command that does ends up faulting in=20 a bunch of blobs from the network can get pretty slow. Sometimes you=20 get streaming results and sometimes it just "hangs" while we go off=20 downloading blobs in the background. We capture telemetry to detect=20 these types of issues but typically the users are more than happy to=20 send us an "I just ran command 'foo' and it hung" email. :) > -Peff