git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Shawn Pearce <spearce@spearce.org>
To: Michael Haggerty <mhagger@alum.mit.edu>
Cc: git <git@vger.kernel.org>,
	David Turner <dturner@twopensource.com>,
	Jeff King <peff@peff.net>
Subject: Re: RefTree: Alternate ref backend
Date: Tue, 22 Dec 2015 10:50:27 -0800	[thread overview]
Message-ID: <CAJo=hJtgfpZn0OjbQ=BVoO_=03yG0Czjfn9vX4RobWLYpNVENg@mail.gmail.com> (raw)
In-Reply-To: <567985A8.2020301@alum.mit.edu>

On Tue, Dec 22, 2015 at 9:17 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>
> etc. But we store branches into the main "refs/remotes/origin/"
> namespace, leaving no reserved space for the remote "HEAD" (not to
> mention other namespaces that might appear on the remote, such as
> "refs/changes/*", "refs/pull/*", a separate record of the remote's
> "refs/tags/*", etc).
>
> Maybe that is why my gut reaction to your proposal to elide the "refs"
> part of the reference hierarchy and store "HEAD" as (effectively)
> "refs/..HEAD" was negative, even though I can't think of any practical
> objections.

Good point; if the client's refs/remotes/origin/ namespace more
closely mirrored the remote's own namespace
(refs/remotes/origin/heads/master), this seems a lot less fishy. The
mapping certainly makes a bit more sense. Etc.

Its a user visible shift however; what was origin/master is now
origin/heads/master. Which is part of the reason why the mapping works
the way it does today. We hardly ever call a branch here heads/master,
we just call it master. So we call origin's master, origin/master. :)

> At a deeper level, the "refs/" part of reference names is actually
> pretty useless in general. I suppose it originated in the practice of
> storing loose references under "refs/" to keep them separate from other
> metadata in $GIT_DIR.

Correct. In the beginning you used echo $sha1 >.git/HEAD and it was good.

Later more refs came along and they had to go somewhere, and so
.git/refs was born with .git/refs/heads/master. Existing tools that
knew how to write to .git/HEAD given the name HEAD could magically
work with refs/heads/master too, and it was good. But that was an
awefully long name to type, so shorthand of "master" for maybe
refs/heads/master or maybe refs/tags/master or maybe no prefix at all
(hi HEAD) came along. Basically its the origin story of Git. :)

> But really, aside from slightly helping
> disambiguate references from paths in the command line, what is it good
> for?

Nothing really; today refs/ prefix is used to encourage to the tools
that you really meant refs/heads/master and not
refs/heads/heads/master or some other crazy construct. You can thank
the DWIMery inside the ref rev parse logic for needing this.

> The client not only has to remember the server's reftree, but also must
> verify that it still has all of the objects implied by that reftree, in
> case a reference somehow got deleted under "refs/remotes/origin/*". At
> that point, there is no special reason to use a SHA-1 in the
> negotiation; any unique token generated by the server would suffice if
> the server can connect it back to a set of references that was sent to
> the client in the past.

True, but its a nicer implementation if the token exchanged has simple
meaning to the server. And its just a diff-tree at the server to
compute the modifications the client might need to learn about.

I see your point about the client being able to use that to say "If I
not only have this, I also have all of the objects". It vastly
simplifies the client's negotiation with the server. The client is
negotiating the common ancestor of the reftree and that immediately
gets the main graph ancestor negotiation system very close to a good
set. The client may still be usefully ahead on other branches, e.g.
she has pulled from the upstream and is now pulling from a
lieutenant's tree, who also recently pulled from the upstream.

> In practice, in my first "haves" announcement I would probably list a
> few "famous" namespaces in the hope that one or more of them are
> recognized by the server:
>
>     have-tree <SHA-1 for "refs/">
>     have-tree <SHA-1 for "refs/heads/">
>     have-tree <SHA-1 for "refs/tags/">
>     have-tree <SHA-1 for "refs/remotes/origin/heads/">
>     have-tree <SHA-1 for "refs/remotes/other/heads/">

Yes, but we also have to be careful about how long we get the "famous"
list get. :)

>> [...]
>> FWIW, JGit is able to scan the canonical trees out of a pack file and
>> inflate them in approximately the same time it takes to scan the
>> packed-refs file for some 70k references. So we don't really slow down
>> much to use this. And there's huge gains to be had by taking advantage
>> of the tree structure and only inflating the components you need to
>> answer a particular read.
>
> Yes, that's another nice aspect of the design.
>
> I do worry a bit that the hierarchical storage only helps if people
> shard their reference namespace reasonably. Somebody who stores 100k
> references in a single reference "directory" (imagine a
> "refs/ci-tests/*") is going to suffer from expensive reference update
> performance. But I guess they will suffer from poor performance within
> Git as well, and that will probably encourage them to improve their
> practices :-) I suppose this is not really much different than people
> who store 100k files within a single directory of their working tree.

Yup. Gerrit Code Review shards refs/changes/ across 100 directories
for this reason as local filesystems don't like large numbers of files
or directories in a directory. But at 100k change entries you are
still dealing with 10k subtrees in each shard. The 100-sharding isn't
quite enough.

I started considering doing a notemap like sharding for reftree. Its
harder because the names aren't a uniform shape the way object ids are
in a notemap. But it could be possible to split by prefix, for example
start by building a table of all 2 character prefixes in the tree. If
this produces too many entries in any single 2 character subtree,
retry as a 4 character subtree. Continue extending the prefix until
either the number of unique prefixes in the parent tree is too many,
or the subtrees are acceptable sizes. If the parent gets to be too
many (1000?), freeze the parent prefix length and start splitting the
subtrees instead.

For tags you may wind up with a structure like:

  tags/
    v1../
      .0
      .2
    v2../
      .0
      0.125
    v3../
      0.98

Or whatever. Here I used ".." as a suffix on the splits like "v1.." to
indicate the name isn't itself a directory component, but a sharding
split. Thus we have tags "v1.0", "v2.0", "v20.125", "v30.98", etc.

It doesn't help the scalability of a source code tree having too many
files. But we could do some smarter splitting inside reftree to help
it scale even if people aren't sharding their ref namespaces. Sadly
this has a lot of downsides, its complex to write and its ugly.

  reply	other threads:[~2015-12-22 18:50 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-12-17 21:02 RefTree: Alternate ref backend Shawn Pearce
2015-12-17 21:57 ` Junio C Hamano
2015-12-17 22:15   ` Shawn Pearce
2015-12-17 22:10 ` Jeff King
2015-12-17 22:28   ` Shawn Pearce
2015-12-18  1:36     ` Mike Hommey
2015-12-22 15:41 ` Michael Haggerty
2015-12-22 16:11   ` Shawn Pearce
2015-12-22 17:04     ` Dave Borowitz
2015-12-22 17:17     ` Michael Haggerty
2015-12-22 18:50       ` Shawn Pearce [this message]
2015-12-22 19:09         ` Junio C Hamano
2015-12-22 19:11           ` Shawn Pearce
2015-12-22 19:34             ` Junio C Hamano
2015-12-23  4:59               ` Michael Haggerty
2015-12-24  1:33                 ` Junio C Hamano
     [not found]       ` <4689734.cEcQ2vR0aQ@mfick1-lnx>
2015-12-22 20:56         ` Martin Fick
2015-12-22 21:23           ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJo=hJtgfpZn0OjbQ=BVoO_=03yG0Czjfn9vX4RobWLYpNVENg@mail.gmail.com' \
    --to=spearce@spearce.org \
    --cc=dturner@twopensource.com \
    --cc=git@vger.kernel.org \
    --cc=mhagger@alum.mit.edu \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).