From mboxrd@z Thu Jan 1 00:00:00 1970 From: Shawn Pearce Subject: Re: RefTree: Alternate ref backend Date: Tue, 22 Dec 2015 10:50:27 -0800 Message-ID: References: <56796F37.1000600@alum.mit.edu> <567985A8.2020301@alum.mit.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Cc: git , David Turner , Jeff King To: Michael Haggerty X-From: git-owner@vger.kernel.org Tue Dec 22 19:50:57 2015 Return-path: Envelope-to: gcvg-git-2@plane.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1aBS1N-0007tL-5p for gcvg-git-2@plane.gmane.org; Tue, 22 Dec 2015 19:50:53 +0100 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753965AbbLVSut (ORCPT ); Tue, 22 Dec 2015 13:50:49 -0500 Received: from mail-vk0-f45.google.com ([209.85.213.45]:35359 "EHLO mail-vk0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751382AbbLVSus (ORCPT ); Tue, 22 Dec 2015 13:50:48 -0500 Received: by mail-vk0-f45.google.com with SMTP id a189so122660815vkh.2 for ; Tue, 22 Dec 2015 10:50:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=spearce.org; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=SkzZBzqF40Y9MOth8QFkqJqEyuxgWsLkxZnXZzWZpf8=; b=R9IhrrX79NSnDWz7+QCQuXgCcA6aqnvUvc+gs1UztIHL6RxrQGRMiRrK4cxJe2NupK Ci0sZxUQgZidW+89j8xqf+JbKzr09+1l7sI1ANge7rvRGMTYeejexElahPcY5K14qZ5J 24prJGu1e7U6WqOCAnwB6Q7smp82ejC9twnjQ= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-type; bh=SkzZBzqF40Y9MOth8QFkqJqEyuxgWsLkxZnXZzWZpf8=; b=kPu4B0hJMIhlQmgpDeHnNMT79ReNdbbNHXx7W353HcOsRdn8376fwGUCwZIZYiAehw N2AD2e4MpPtssf6abLuA8CfaPw9jVIXBXoopgd2SyMZWUsADQr2T3PYCMGuUsnT5gOQC c0MUOEmMoZUJScx8nmP0PpB6K4N1rcrFQJRrnosX3I0pyvMx8czH3zjUmOVMPZSV9lXV 5hlpaewtbbTqLPsfn4fd3za1sP2bzwum1YETGXLSQxHrHJMslkH7f6EF7nzjf/3LeJIq QI9jjTQzjiRpeL6VDZegknprs28XbwCXo/Oxy2LvfVzePtN0aN61XBDEQE9cStwsk8rw BVZw== X-Gm-Message-State: ALoCoQmn4/uT85D2T4rTms+nTvXrOFTc5VD4g6GxGG670gOV7AwdlvRWpIGQArpqECdhtf1++qiTcLINqsYAiW9KeCksS+3mvQ== X-Received: by 10.31.163.17 with SMTP id m17mr16752846vke.46.1450810247280; Tue, 22 Dec 2015 10:50:47 -0800 (PST) Received: by 10.103.97.197 with HTTP; Tue, 22 Dec 2015 10:50:27 -0800 (PST) In-Reply-To: <567985A8.2020301@alum.mit.edu> Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On Tue, Dec 22, 2015 at 9:17 AM, Michael Haggerty wrote: > > etc. But we store branches into the main "refs/remotes/origin/" > namespace, leaving no reserved space for the remote "HEAD" (not to > mention other namespaces that might appear on the remote, such as > "refs/changes/*", "refs/pull/*", a separate record of the remote's > "refs/tags/*", etc). > > Maybe that is why my gut reaction to your proposal to elide the "refs" > part of the reference hierarchy and store "HEAD" as (effectively) > "refs/..HEAD" was negative, even though I can't think of any practical > objections. Good point; if the client's refs/remotes/origin/ namespace more closely mirrored the remote's own namespace (refs/remotes/origin/heads/master), this seems a lot less fishy. The mapping certainly makes a bit more sense. Etc. Its a user visible shift however; what was origin/master is now origin/heads/master. Which is part of the reason why the mapping works the way it does today. We hardly ever call a branch here heads/master, we just call it master. So we call origin's master, origin/master. :) > At a deeper level, the "refs/" part of reference names is actually > pretty useless in general. I suppose it originated in the practice of > storing loose references under "refs/" to keep them separate from other > metadata in $GIT_DIR. Correct. In the beginning you used echo $sha1 >.git/HEAD and it was good. Later more refs came along and they had to go somewhere, and so .git/refs was born with .git/refs/heads/master. Existing tools that knew how to write to .git/HEAD given the name HEAD could magically work with refs/heads/master too, and it was good. But that was an awefully long name to type, so shorthand of "master" for maybe refs/heads/master or maybe refs/tags/master or maybe no prefix at all (hi HEAD) came along. Basically its the origin story of Git. :) > But really, aside from slightly helping > disambiguate references from paths in the command line, what is it good > for? Nothing really; today refs/ prefix is used to encourage to the tools that you really meant refs/heads/master and not refs/heads/heads/master or some other crazy construct. You can thank the DWIMery inside the ref rev parse logic for needing this. > The client not only has to remember the server's reftree, but also must > verify that it still has all of the objects implied by that reftree, in > case a reference somehow got deleted under "refs/remotes/origin/*". At > that point, there is no special reason to use a SHA-1 in the > negotiation; any unique token generated by the server would suffice if > the server can connect it back to a set of references that was sent to > the client in the past. True, but its a nicer implementation if the token exchanged has simple meaning to the server. And its just a diff-tree at the server to compute the modifications the client might need to learn about. I see your point about the client being able to use that to say "If I not only have this, I also have all of the objects". It vastly simplifies the client's negotiation with the server. The client is negotiating the common ancestor of the reftree and that immediately gets the main graph ancestor negotiation system very close to a good set. The client may still be usefully ahead on other branches, e.g. she has pulled from the upstream and is now pulling from a lieutenant's tree, who also recently pulled from the upstream. > In practice, in my first "haves" announcement I would probably list a > few "famous" namespaces in the hope that one or more of them are > recognized by the server: > > have-tree > have-tree > have-tree > have-tree > have-tree Yes, but we also have to be careful about how long we get the "famous" list get. :) >> [...] >> FWIW, JGit is able to scan the canonical trees out of a pack file and >> inflate them in approximately the same time it takes to scan the >> packed-refs file for some 70k references. So we don't really slow down >> much to use this. And there's huge gains to be had by taking advantage >> of the tree structure and only inflating the components you need to >> answer a particular read. > > Yes, that's another nice aspect of the design. > > I do worry a bit that the hierarchical storage only helps if people > shard their reference namespace reasonably. Somebody who stores 100k > references in a single reference "directory" (imagine a > "refs/ci-tests/*") is going to suffer from expensive reference update > performance. But I guess they will suffer from poor performance within > Git as well, and that will probably encourage them to improve their > practices :-) I suppose this is not really much different than people > who store 100k files within a single directory of their working tree. Yup. Gerrit Code Review shards refs/changes/ across 100 directories for this reason as local filesystems don't like large numbers of files or directories in a directory. But at 100k change entries you are still dealing with 10k subtrees in each shard. The 100-sharding isn't quite enough. I started considering doing a notemap like sharding for reftree. Its harder because the names aren't a uniform shape the way object ids are in a notemap. But it could be possible to split by prefix, for example start by building a table of all 2 character prefixes in the tree. If this produces too many entries in any single 2 character subtree, retry as a 4 character subtree. Continue extending the prefix until either the number of unique prefixes in the parent tree is too many, or the subtrees are acceptable sizes. If the parent gets to be too many (1000?), freeze the parent prefix length and start splitting the subtrees instead. For tags you may wind up with a structure like: tags/ v1../ .0 .2 v2../ .0 0.125 v3../ 0.98 Or whatever. Here I used ".." as a suffix on the splits like "v1.." to indicate the name isn't itself a directory component, but a sharding split. Thus we have tags "v1.0", "v2.0", "v20.125", "v30.98", etc. It doesn't help the scalability of a source code tree having too many files. But we could do some smarter splitting inside reftree to help it scale even if people aren't sharding their ref namespaces. Sadly this has a lot of downsides, its complex to write and its ugly.