git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* RefTree: Alternate ref backend
@ 2015-12-17 21:02 Shawn Pearce
  2015-12-17 21:57 ` Junio C Hamano
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Shawn Pearce @ 2015-12-17 21:02 UTC (permalink / raw)
  To: git; +Cc: David Turner, Michael Haggerty, Jeff King

I started playing around with the idea of storing references directly
in Git. Exploiting the GITLINK tree entry, we can associate a name to
any SHA-1.

By storing all references in a single tree, atomic transactions are
possible. Its a simple compare-and-swap of a single 40 byte SHA-1.
This of course leads to a bootstrapping problem, where do we store the
40 byte SHA-1? For this example its just $GIT_DIR/refs/txn/committed
as a classical loose reference.


I posted code for this to JGit (sorry):

   https://git.eclipse.org/r/62970


Configuration:

  [core]
    repositoryformatversion = 1
  [extensions]
    refsBackendType = RefTree

For example, recent git.git has a structure like this:

  $ git ls-tree -r refs/txn/committed
  120000 blob 22e42fc826b437033ca444e09368f53a0b169322 ..HEAD
  160000 commit 1ff88560c8d22bcdb528a6629239d638f927cb96 heads/maint
  160000 commit f3adf457e046f92f039353762a78dcb3afb2cb13 heads/master
  160000 commit 5ee9e94ccfede68f0c386c497dd85c017efa22d6 heads/next
  160000 commit d3835d54cffb16c4362979a5be3ba9958eab4116 heads/pu
  160000 commit 68a0f56b615b61afdbd86be01a3ca63dca70edc0 heads/todo
  ...
  160000 commit 17f9f635c101aef03874e1de1d8d0322187494b3 tags/v2.6.0
  160000 commit 5bebb9057df8287684c763c59c67f25f16884ef6 tags/v2.6.0-rc0
  160000 commit 16ffa6443e279a9b3b63d7a2bebeb07833506010 tags/v2.6.0-rc0^{}
  160000 commit bbdca2a7bd942e1d3ce517b48e6229b99f7d7b2b tags/v2.6.0-rc1
  160000 commit 689efb737a7b46351850eefdfa57d2ce232011fb tags/v2.6.0-rc1^{}
  160000 commit 7b269a793392ee3d71ecddac88a8ad63497cbc4d tags/v2.6.0-rc2
  160000 commit 45733fa93f287fbc04d6a6a3f5a39cc852c5cf50 tags/v2.6.0-rc2^{}
  160000 commit 27df6e2585060add45b32bbd46f6e92ef79d069b tags/v2.6.0-rc3
  160000 commit 8d530c4d64ffcc853889f7b385f554d53db375ed tags/v2.6.0-rc3^{}
  160000 commit be08dee9738eaaa0423885ed189c2b6ad8368cf0 tags/v2.6.0^{}

Tags are stored twice, to cache the peel information for network advertisements.

Packing the tree by itself is smaller than packed-refs, which is uncompressed:

  -rw-r----- 1 shawn me  60K Dec 17 12:43 packed-refs
  -r--r----- 1 shawn me 1.2K Dec 17 12:56 R-5533*.idx
  -r--r----- 1 shawn me  28K Dec 17 12:56 R-5533*.pack


By exploiting Git to store Git, we get a reflog for free:

  $ git log -p refs/txn/committed -1
  commit f7ec5ceeba6ca87fa112b3af70d8ac17364045f7
  Author: anonymous <anonymous@localhost>
  Date:   Thu Dec 17 12:53:39 2015 -0800

      push

  diff --git a/heads/tmp2 b/heads/tmp2
  deleted file mode 160000
  index f3adf45..0000000
  --- a/heads/tmp2
  +++ /dev/null
  @@ -1 +0,0 @@
  -Subproject commit f3adf457e046f92f039353762a78dcb3afb2cb13

:)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RefTree: Alternate ref backend
  2015-12-17 21:02 RefTree: Alternate ref backend Shawn Pearce
@ 2015-12-17 21:57 ` Junio C Hamano
  2015-12-17 22:15   ` Shawn Pearce
  2015-12-17 22:10 ` Jeff King
  2015-12-22 15:41 ` Michael Haggerty
  2 siblings, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2015-12-17 21:57 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git, David Turner, Michael Haggerty, Jeff King

Shawn Pearce <spearce@spearce.org> writes:

> For example, recent git.git has a structure like this:
>
>   $ git ls-tree -r refs/txn/committed
>   120000 blob 22e42fc826b437033ca444e09368f53a0b169322 ..HEAD
>   160000 commit 1ff88560c8d22bcdb528a6629239d638f927cb96 heads/maint
>   160000 commit f3adf457e046f92f039353762a78dcb3afb2cb13 heads/master
>   160000 commit 5ee9e94ccfede68f0c386c497dd85c017efa22d6 heads/next
>   160000 commit d3835d54cffb16c4362979a5be3ba9958eab4116 heads/pu
>   160000 commit 68a0f56b615b61afdbd86be01a3ca63dca70edc0 heads/todo
>   ...
>   160000 commit 17f9f635c101aef03874e1de1d8d0322187494b3 tags/v2.6.0
>   160000 commit 5bebb9057df8287684c763c59c67f25f16884ef6 tags/v2.6.0-rc0
>   160000 commit 16ffa6443e279a9b3b63d7a2bebeb07833506010 tags/v2.6.0-rc0^{}
>   ...
>   160000 commit be08dee9738eaaa0423885ed189c2b6ad8368cf0 tags/v2.6.0^{}
>
> Tags are stored twice, to cache the peel information for network advertisements.

The object 17f9f635 is not a "commit" but is a "tag".  It is
somewhat unfortunate that "ls-tree -r" reports it as a commit; as
the command (or anything that deals with a gitlink) is not allowed
to look at the actual object, it is the best it could do, though.

The "..HEAD" hack looks somewhat ugly.  I can guess that the
reasoning went like "if we stored these in the most natural way, we
always need a top-level tree that hold two and only two entries,
HEAD and heads/, which would require us one level of tree unwrapping
to get to most of the refs" and "HEAD is the only oddball that is
outside refs/ hierarchy, others like MERGE_HEAD are ephemeral and
for the purpose of Gerrit that does not even do working tree
management, those things that are not necessary in order to manage
only the committed state can be omitted.", but still...

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RefTree: Alternate ref backend
  2015-12-17 21:02 RefTree: Alternate ref backend Shawn Pearce
  2015-12-17 21:57 ` Junio C Hamano
@ 2015-12-17 22:10 ` Jeff King
  2015-12-17 22:28   ` Shawn Pearce
  2015-12-22 15:41 ` Michael Haggerty
  2 siblings, 1 reply; 18+ messages in thread
From: Jeff King @ 2015-12-17 22:10 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git, David Turner, Michael Haggerty

On Thu, Dec 17, 2015 at 01:02:50PM -0800, Shawn Pearce wrote:

> I started playing around with the idea of storing references directly
> in Git. Exploiting the GITLINK tree entry, we can associate a name to
> any SHA-1.

Gitlink entries don't imply reachability, though. I guess that doesn't
matter if your ref backend says "no, really, these are the ref tips, and
they are reachable".  But you could not push the whole thing up to
another server and expect it to hold the whole graph.

Which is not strictly necessary, but to me seems like the real advantage
of using git objects versus some other system.

Of course, the lack of reachability has advantages, too. You can
drop commits pointed to by old reflogs without rewriting the ref
history. Unfortunately you cannot expunge the reflogs at all. That's
good if you like audit trails. Bad if you are worried that your reflogs
will grow large. :)

> By storing all references in a single tree, atomic transactions are
> possible. Its a simple compare-and-swap of a single 40 byte SHA-1.
> This of course leads to a bootstrapping problem, where do we store the
> 40 byte SHA-1? For this example its just $GIT_DIR/refs/txn/committed
> as a classical loose reference.

Somehow putting it inside `refs/` seems weird to me, in an infinite
recursion kind of way.  I would have picked $GIT_DIR/REFSTREE or
something. But that is a minor point.

> Configuration:
> 
>   [core]
>     repositoryformatversion = 1
>   [extensions]
>     refsBackendType = RefTree

The semantics of extensions config keys are open-ended. The
formatVersion=1 spec only says "if there is a key you don't know about,
then you may not proceed". Now we're defining a refsBackendType
extension. It probably makes sense to write up a few rules (e.g., is
RefTree case-sensitive?).

-Peff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RefTree: Alternate ref backend
  2015-12-17 21:57 ` Junio C Hamano
@ 2015-12-17 22:15   ` Shawn Pearce
  0 siblings, 0 replies; 18+ messages in thread
From: Shawn Pearce @ 2015-12-17 22:15 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, David Turner, Michael Haggerty, Jeff King

On Thu, Dec 17, 2015 at 1:57 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Shawn Pearce <spearce@spearce.org> writes:
>
>> For example, recent git.git has a structure like this:
>>
>>   $ git ls-tree -r refs/txn/committed
>>   120000 blob 22e42fc826b437033ca444e09368f53a0b169322 ..HEAD
>>   160000 commit 1ff88560c8d22bcdb528a6629239d638f927cb96 heads/maint
>>   160000 commit f3adf457e046f92f039353762a78dcb3afb2cb13 heads/master
>>   160000 commit 5ee9e94ccfede68f0c386c497dd85c017efa22d6 heads/next
>>   160000 commit d3835d54cffb16c4362979a5be3ba9958eab4116 heads/pu
>>   160000 commit 68a0f56b615b61afdbd86be01a3ca63dca70edc0 heads/todo
>>   ...
>>   160000 commit 17f9f635c101aef03874e1de1d8d0322187494b3 tags/v2.6.0
>>   160000 commit 5bebb9057df8287684c763c59c67f25f16884ef6 tags/v2.6.0-rc0
>>   160000 commit 16ffa6443e279a9b3b63d7a2bebeb07833506010 tags/v2.6.0-rc0^{}
>>   ...
>>   160000 commit be08dee9738eaaa0423885ed189c2b6ad8368cf0 tags/v2.6.0^{}
>>
>> Tags are stored twice, to cache the peel information for network advertisements.
>
> The object 17f9f635 is not a "commit" but is a "tag".  It is
> somewhat unfortunate that "ls-tree -r" reports it as a commit; as
> the command (or anything that deals with a gitlink) is not allowed
> to look at the actual object, it is the best it could do, though.

Yes; thus far GITLINK is only used for commits in submodules so its
reasonable for it to just hardcode the text "commit".

> The "..HEAD" hack looks somewhat ugly.  I can guess that the
> reasoning went like "if we stored these in the most natural way, we
> always need a top-level tree that hold two and only two entries,
> HEAD and heads/, which would require us one level of tree unwrapping
> to get to most of the refs" and "HEAD is the only oddball that is
> outside refs/ hierarchy,

Correct.

> others like MERGE_HEAD are ephemeral and
> for the purpose of Gerrit that does not even do working tree
> management, those things that are not necessary in order to manage
> only the committed state can be omitted.", but still...

Yes. I was mostly looking at this from a bare repository server
perspective, not a user working tree. On a bare repository you
probably don't have those special refs like MERGE_HEAD, FETCH_HEAD,
etc.

They could be stored as "..MERGE_HEAD", if you had to. But only HEAD
really matters to hint to clients what to checkout by default on
clone.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RefTree: Alternate ref backend
  2015-12-17 22:10 ` Jeff King
@ 2015-12-17 22:28   ` Shawn Pearce
  2015-12-18  1:36     ` Mike Hommey
  0 siblings, 1 reply; 18+ messages in thread
From: Shawn Pearce @ 2015-12-17 22:28 UTC (permalink / raw)
  To: Jeff King; +Cc: git, David Turner, Michael Haggerty

On Thu, Dec 17, 2015 at 2:10 PM, Jeff King <peff@peff.net> wrote:
> On Thu, Dec 17, 2015 at 01:02:50PM -0800, Shawn Pearce wrote:
>
>> I started playing around with the idea of storing references directly
>> in Git. Exploiting the GITLINK tree entry, we can associate a name to
>> any SHA-1.
>
> Gitlink entries don't imply reachability, though. I guess that doesn't
> matter if your ref backend says "no, really, these are the ref tips, and
> they are reachable".

Exactly. This works with existing JGit because it swaps out the ref
backend. When GC tries to enumerate the roots (current refs), it gets
these through the ref backend by scanning the tree recursively. The
packer itself doesn't care where those roots came from.

Same would be true for any other pluggable ref backend in git-core. GC
has to ask the ref backend, and then trust its reply. How/where that
ref backend tracks that is an implementation detail.

>  But you could not push the whole thing up to
> another server and expect it to hold the whole graph.

Correct, pushing this to another repository doesn't transmit the
graph. If the other repository also used this for its refs backend,
its now corrupt and confused out of its mind. Just like copying the
packed-refs file with scp. Don't do that. :)

> Which is not strictly necessary, but to me seems like the real advantage
> of using git objects versus some other system.

One advantage is you can edit HEAD symref remotely. Commit a different
symlink value and push. :)

I want to say more, but I'm going to hold back right now. There's more
going on in my head than just this.

> Of course, the lack of reachability has advantages, too. You can
> drop commits pointed to by old reflogs without rewriting the ref
> history.

Yes.

> Unfortunately you cannot expunge the reflogs at all. That's
> good if you like audit trails. Bad if you are worried that your reflogs
> will grow large. :)

At present our servers do not truncate their reflogs. Yes some are... big.

I considered truncating this graph by just using a shallow marker. Add
a shallow entry and repack. The ancient history will eventually be
garbage collected and disappear.

One advantage of this format is deleted branches can retain a reflog
post deletion. Another is you can trivially copy the reflog using
native Git to another system for backup purposes. Or fetch it over the
network to inspect locally. So a shared group server could be
exporting its reflog, you can fetch it and review locally what
happened to branches without logging into the shared server.

So long as you remember that copying the reflog doesn't mean you
actually copied the commit histories, its works nicely.

Another advantage of this format over LMDB or TDB or whatever is Git
already understands it. The tools already understand it. Plumbing can
inspect and repair things. You can reflog the reflog using traditional
reflog ($GIT_DIR/reflogs/refs/txn/committed).

>> By storing all references in a single tree, atomic transactions are
>> possible. Its a simple compare-and-swap of a single 40 byte SHA-1.
>> This of course leads to a bootstrapping problem, where do we store the
>> 40 byte SHA-1? For this example its just $GIT_DIR/refs/txn/committed
>> as a classical loose reference.
>
> Somehow putting it inside `refs/` seems weird to me, in an infinite
> recursion kind of way.  I would have picked $GIT_DIR/REFSTREE or
> something. But that is a minor point.

I had started with $GIT_DIR/REFS, but see above. I have more going on
in my head. This is only a tiny building block.

>> Configuration:
>>
>>   [core]
>>     repositoryformatversion = 1
>>   [extensions]
>>     refsBackendType = RefTree
>
> The semantics of extensions config keys are open-ended. The
> formatVersion=1 spec only says "if there is a key you don't know about,
> then you may not proceed". Now we're defining a refsBackendType
> extension. It probably makes sense to write up a few rules (e.g., is
> RefTree case-sensitive?).

In my prototype in JGIt I parse it as case insensitive, but used
CamelCase because the JavaClassNameIsNamedThatWayBecauseJava.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RefTree: Alternate ref backend
  2015-12-17 22:28   ` Shawn Pearce
@ 2015-12-18  1:36     ` Mike Hommey
  0 siblings, 0 replies; 18+ messages in thread
From: Mike Hommey @ 2015-12-18  1:36 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Jeff King, git, David Turner, Michael Haggerty

On Thu, Dec 17, 2015 at 02:28:01PM -0800, Shawn Pearce wrote:
> On Thu, Dec 17, 2015 at 2:10 PM, Jeff King <peff@peff.net> wrote:
> > On Thu, Dec 17, 2015 at 01:02:50PM -0800, Shawn Pearce wrote:
> >
> >> I started playing around with the idea of storing references directly
> >> in Git. Exploiting the GITLINK tree entry, we can associate a name to
> >> any SHA-1.
> >
> > Gitlink entries don't imply reachability, though. I guess that doesn't
> > matter if your ref backend says "no, really, these are the ref tips, and
> > they are reachable".
> 
> Exactly. This works with existing JGit because it swaps out the ref
> backend. When GC tries to enumerate the roots (current refs), it gets
> these through the ref backend by scanning the tree recursively. The
> packer itself doesn't care where those roots came from.
> 
> Same would be true for any other pluggable ref backend in git-core. GC
> has to ask the ref backend, and then trust its reply. How/where that
> ref backend tracks that is an implementation detail.
> 
> >  But you could not push the whole thing up to
> > another server and expect it to hold the whole graph.
> 
> Correct, pushing this to another repository doesn't transmit the
> graph. If the other repository also used this for its refs backend,
> its now corrupt and confused out of its mind. Just like copying the
> packed-refs file with scp. Don't do that. :)
> 
> > Which is not strictly necessary, but to me seems like the real advantage
> > of using git objects versus some other system.
> 
> One advantage is you can edit HEAD symref remotely. Commit a different
> symlink value and push. :)
> 
> I want to say more, but I'm going to hold back right now. There's more
> going on in my head than just this.
> 
> > Of course, the lack of reachability has advantages, too. You can
> > drop commits pointed to by old reflogs without rewriting the ref
> > history.
> 
> Yes.

Related thread: "Allowing weak references to blobs and strong references
to commits" http://marc.info/?l=git&m=142779648816577&w=2

Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RefTree: Alternate ref backend
  2015-12-17 21:02 RefTree: Alternate ref backend Shawn Pearce
  2015-12-17 21:57 ` Junio C Hamano
  2015-12-17 22:10 ` Jeff King
@ 2015-12-22 15:41 ` Michael Haggerty
  2015-12-22 16:11   ` Shawn Pearce
  2 siblings, 1 reply; 18+ messages in thread
From: Michael Haggerty @ 2015-12-22 15:41 UTC (permalink / raw)
  To: Shawn Pearce, git; +Cc: David Turner, Jeff King

On 12/17/2015 10:02 PM, Shawn Pearce wrote:
> I started playing around with the idea of storing references directly
> in Git. Exploiting the GITLINK tree entry, we can associate a name to
> any SHA-1.
> 
> By storing all references in a single tree, atomic transactions are
> possible. Its a simple compare-and-swap of a single 40 byte SHA-1.
> This of course leads to a bootstrapping problem, where do we store the
> 40 byte SHA-1? For this example its just $GIT_DIR/refs/txn/committed
> as a classical loose reference.

I like this general idea a lot, even while recognizing some practical
problems that other people have pointed out. I especially like the idea
of having truly atomic multi-reference updates.

I'm curious why you decided to store all of the references in a single
list, similar to the packed-refs file. This design means that the whole
object has to be rewritten whenever any reference is updated [1].
Certainly, storing the references in a single tree *object* is not a
requirement for having atomic transitions.

I would have expected a design where the layout of the references in
trees mimics the layout of loose references in the filesystem; e.g., one
tree object for "refs/", one for "refs/heads/" one for "refs/remotes/"
etc. This design would reduce the amount of rewriting that is needed
when one or a few references are updated.

Another reason that I find a hierarchical layout intriguing would be
that one could imagine using the SHA-1s of reference namespace subtrees
to speed up the negotiation phase of "git fetch". In the common case
that I use the local namespace "refs/remotes/origin" to track an
upstream repo, the SHA-1 of my "refs/remotes/origin" tree would usually
represent a complete description of the state of the upstream references
at the time that I last fetched. My client could tell the server

    have-tree $SHA1

, where $SHA1 is the hash of the tree representing
"refs/remotes/origin/". If the server keeps a reflog as you have
described (but hierarchically), then the server could look up $SHA1 and
immediately know the full set of references, and therefore of objects,
that I fetched last time. More generally, the negotiation could proceed
down the reference namespace tree and stop whenever commonality is found.

There are a lot of "if"s in that last paragraph, and maybe it's not
workable. For example, if I'm not pruning on fetch, then my reference
tree won't be identical to one that was ever present on the server and
this technique wouldn't necessarily help. But if, for example, we change
the default to pruning, or perhaps record some extra reftree SHA-1's,
then in most cases I would expect that this trick could reduce the
effort of negotiation to negligible in most cases, and reduce the time
of the whole fetch to negligible in the case that the clone is already
up-to-date.

Michael

[1] At GitHub, we store public repositories in networks with a shared
object store. The central repository in each network can have 10M+
references. So for us, rewriting that many references for every
reference update would be unworkable.

-- 
Michael Haggerty
mhagger@alum.mit.edu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RefTree: Alternate ref backend
  2015-12-22 15:41 ` Michael Haggerty
@ 2015-12-22 16:11   ` Shawn Pearce
  2015-12-22 17:04     ` Dave Borowitz
  2015-12-22 17:17     ` Michael Haggerty
  0 siblings, 2 replies; 18+ messages in thread
From: Shawn Pearce @ 2015-12-22 16:11 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git, David Turner, Jeff King

On Tue, Dec 22, 2015 at 7:41 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> On 12/17/2015 10:02 PM, Shawn Pearce wrote:
>> I started playing around with the idea of storing references directly
>> in Git. Exploiting the GITLINK tree entry, we can associate a name to
>> any SHA-1.
>>
>> By storing all references in a single tree, atomic transactions are
>> possible. Its a simple compare-and-swap of a single 40 byte SHA-1.
>> This of course leads to a bootstrapping problem, where do we store the
>> 40 byte SHA-1? For this example its just $GIT_DIR/refs/txn/committed
>> as a classical loose reference.
>
> I like this general idea a lot, even while recognizing some practical
> problems that other people have pointed out. I especially like the idea
> of having truly atomic multi-reference updates.
>
> I'm curious why you decided to store all of the references in a single
> list, similar to the packed-refs file. This design means that the whole
> object has to be rewritten whenever any reference is updated [1].
> Certainly, storing the references in a single tree *object* is not a
> requirement for having atomic transitions.
>
> I would have expected a design where the layout of the references in
> trees mimics the layout of loose references in the filesystem; e.g., one
> tree object for "refs/", one for "refs/heads/" one for "refs/remotes/"
> etc. This design would reduce the amount of rewriting that is needed
> when one or a few references are updated.

I did use tree objects for each directory component. The ls-tree I
showed was an ls-tree -r.

"heads" is a different subtree from "tags". I just skipped over the
"refs/" top level subtree because its useless here. The root tree
would always have one child, "refs", which normally has two children,
"heads" and "tags". Why bother with the root at that point?

So we do get minimum rewriting, tags tree is unmodified and reuses its
tree node when you update master.

> Another reason that I find a hierarchical layout intriguing would be
> that one could imagine using the SHA-1s of reference namespace subtrees
> to speed up the negotiation phase of "git fetch". In the common case
> that I use the local namespace "refs/remotes/origin" to track an
> upstream repo, the SHA-1 of my "refs/remotes/origin" tree would usually
> represent a complete description of the state of the upstream references
> at the time that I last fetched. My client could tell the server
>
>     have-tree $SHA1
>
> , where $SHA1 is the hash of the tree representing
> "refs/remotes/origin/". If the server keeps a reflog as you have
> described (but hierarchically), then the server could look up $SHA1 and
> immediately know the full set of references, and therefore of objects,
> that I fetched last time. More generally, the negotiation could proceed
> down the reference namespace tree and stop whenever commonality is found.

Yes. Martin Fick and I were discussing a strategy like this at the
Gerrit User Summit in November. I totally forgot about it when I
started this thread, but I'm glad you independently proposed it. Maybe
its not a horrible idea!  :)

One problem is clients don't mirror the heads tree exactly; they add
in HEAD as a symbolic reference in a way that the remote peer doesn't
have. Minor detail.


Martin and I were really thinking about server-server negotiation more
than client-server. Consider a master Git server that Linus pushes
too, and then a small farm of mirror servers that users actually clone
from. If an update hook on the master does a git push to each mirror,
the ls-remote advertisements is a non-trivial amount of data to
exchange. If the mirror servers are supposed to exactly match the
master, they can exchange all of their refs with a single SHA-1
instead of a big listing.

This isn't so important for Linus' repository; its got a relatively
small number of refs. We were thinking more about Gerrit Code Review
where the refs/changes/ namespace is huge and may be causing a
multi-megabyte advertisement. Its common in large companies to have
many mirror slaves in remote offices mirroring the Gerrit server so
that end-users can fetch from their office mirror more quickly.

> There are a lot of "if"s in that last paragraph, and maybe it's not
> workable. For example, if I'm not pruning on fetch, then my reference
> tree won't be identical to one that was ever present on the server and
> this technique wouldn't necessarily help. But if, for example, we change
> the default to pruning, or perhaps record some extra reftree SHA-1's,
> then in most cases I would expect that this trick could reduce the
> effort of negotiation to negligible in most cases, and reduce the time
> of the whole fetch to negligible in the case that the clone is already
> up-to-date.

Right, maybe the client just remember's the server's reftree SHA-1 and
offers it back on reconnection. The server can then diff between the
two reftrees and shows the client only refs that were modified that
the client cares about.


> [1] At GitHub, we store public repositories in networks with a shared
> object store. The central repository in each network can have 10M+
> references. So for us, rewriting that many references for every
> reference update would be unworkable.

Yup, and Gerrit Code Review servers often have 100k+ refs per
repository. We can't rewrite the entire store every time either. So
its not a flat list. Its a directory structure using the / separators
in the ref namespace.

FWIW, JGit is able to scan the canonical trees out of a pack file and
inflate them in approximately the same time it takes to scan the
packed-refs file for some 70k references. So we don't really slow down
much to use this. And there's huge gains to be had by taking advantage
of the tree structure and only inflating the components you need to
answer a particular read.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RefTree: Alternate ref backend
  2015-12-22 16:11   ` Shawn Pearce
@ 2015-12-22 17:04     ` Dave Borowitz
  2015-12-22 17:17     ` Michael Haggerty
  1 sibling, 0 replies; 18+ messages in thread
From: Dave Borowitz @ 2015-12-22 17:04 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Michael Haggerty, git, David Turner, Jeff King

On Tue, Dec 22, 2015 at 8:11 AM, Shawn Pearce <spearce@spearce.org> wrote:
> Yup, and Gerrit Code Review servers often have 100k+ refs per
> repository. We can't rewrite the entire store every time either. So
> its not a flat list. Its a directory structure using the / separators
> in the ref namespace.

I wonder if this might be insufficient in some cases, and additional
sharding might be required (though I haven't thought about how to do
that).

Chromium, for example, has upwards of 10k tags:
$ git ls-remote https://chromium.googlesource.com/chromium/src
refs/tags/\* | wc -l
8733

And I doubt it's unique in this regard.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RefTree: Alternate ref backend
  2015-12-22 16:11   ` Shawn Pearce
  2015-12-22 17:04     ` Dave Borowitz
@ 2015-12-22 17:17     ` Michael Haggerty
  2015-12-22 18:50       ` Shawn Pearce
       [not found]       ` <4689734.cEcQ2vR0aQ@mfick1-lnx>
  1 sibling, 2 replies; 18+ messages in thread
From: Michael Haggerty @ 2015-12-22 17:17 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git, David Turner, Jeff King

On 12/22/2015 05:11 PM, Shawn Pearce wrote:
> On Tue, Dec 22, 2015 at 7:41 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> On 12/17/2015 10:02 PM, Shawn Pearce wrote:
>>> I started playing around with the idea of storing references directly
>>> in Git. Exploiting the GITLINK tree entry, we can associate a name to
>>> any SHA-1.
>> [...]
>> I'm curious why you decided to store all of the references in a single
>> list, similar to the packed-refs file. [...]
> 
> I did use tree objects for each directory component. The ls-tree I
> showed was an ls-tree -r.

Silly me. Of course that was clear from your post, and I just overlooked it.

>> Another reason that I find a hierarchical layout intriguing would be
>> that one could imagine using the SHA-1s of reference namespace subtrees
>> to speed up the negotiation phase of "git fetch". [...]
> 
> Yes. Martin Fick and I were discussing a strategy like this at the
> Gerrit User Summit in November. I totally forgot about it when I
> started this thread, but I'm glad you independently proposed it. Maybe
> its not a horrible idea!  :)
> 
> One problem is clients don't mirror the heads tree exactly; they add
> in HEAD as a symbolic reference in a way that the remote peer doesn't
> have. Minor detail.

Yes, and this is a side effect of leaving out a layer of the remote
reference namespace in the local refs/remotes layout. Naively one would
expect

    refs/remotes/origin/HEAD
    refs/remotes/origin/refs/heads/master

etc. But we store branches into the main "refs/remotes/origin/"
namespace, leaving no reserved space for the remote "HEAD" (not to
mention other namespaces that might appear on the remote, such as
"refs/changes/*", "refs/pull/*", a separate record of the remote's
"refs/tags/*", etc).

Maybe that is why my gut reaction to your proposal to elide the "refs"
part of the reference hierarchy and store "HEAD" as (effectively)
"refs/..HEAD" was negative, even though I can't think of any practical
objections.

At a deeper level, the "refs/" part of reference names is actually
pretty useless in general. I suppose it originated in the practice of
storing loose references under "refs/" to keep them separate from other
metadata in $GIT_DIR. But really, aside from slightly helping
disambiguate references from paths in the command line, what is it good
for? Would we really be worse off if references' full names were

    HEAD
    heads/master
    tags/v1.0.0
    remotes/origin/master (or remotes/origin/heads/master)

etc? This notation is already recognized in most places (though not in
"update-ref"). I think your decision to elide "refs/" in the reftree
hierarchy is a reflection of its uselessness. In any case, your decision
is much less questionable than the decision to mash "refs/heads/*" all
the way up to the top level like we do in "refs/remotes/".

> Martin and I were really thinking about server-server negotiation more
> than client-server. [...]

Yes, that's also an interesting application.

>> There are a lot of "if"s in that last paragraph, and maybe it's not
>> workable. For example, if I'm not pruning on fetch, then my reference
>> tree won't be identical to one that was ever present on the server and
>> this technique wouldn't necessarily help. But if, for example, we change
>> the default to pruning, or perhaps record some extra reftree SHA-1's,
>> then in most cases I would expect that this trick could reduce the
>> effort of negotiation to negligible in most cases, and reduce the time
>> of the whole fetch to negligible in the case that the clone is already
>> up-to-date.
> 
> Right, maybe the client just remember's the server's reftree SHA-1 and
> offers it back on reconnection. The server can then diff between the
> two reftrees and shows the client only refs that were modified that
> the client cares about.

The client not only has to remember the server's reftree, but also must
verify that it still has all of the objects implied by that reftree, in
case a reference somehow got deleted under "refs/remotes/origin/*". At
that point, there is no special reason to use a SHA-1 in the
negotiation; any unique token generated by the server would suffice if
the server can connect it back to a set of references that was sent to
the client in the past.

The advantage of using hierarchical reftree SHA-1s in the negotiation is
that they can be used to name part of the reftree. For example, if I
fetch "refs/heads/*" from a remote but not "refs/changes/*", then what
do I report as my "have-tree"? I can't claim to have *all* of the
references that the remote had at the time. But with SHA-1s I can say
that I have the reftree that corresponds to my
"refs/remotes/origin/heads/", which the remote can notice is identical
to an old reftree that it happened to have for "refs/heads/" (without
even caring what path it represented). Bingo, we've just agreed about a
big part of the reference namespace without having to agree about the
whole namespace.

In practice, in my first "haves" announcement I would probably list a
few "famous" namespaces in the hope that one or more of them are
recognized by the server:

    have-tree <SHA-1 for "refs/">
    have-tree <SHA-1 for "refs/heads/">
    have-tree <SHA-1 for "refs/tags/">
    have-tree <SHA-1 for "refs/remotes/origin/heads/">
    have-tree <SHA-1 for "refs/remotes/other/heads/">

> [...]
> FWIW, JGit is able to scan the canonical trees out of a pack file and
> inflate them in approximately the same time it takes to scan the
> packed-refs file for some 70k references. So we don't really slow down
> much to use this. And there's huge gains to be had by taking advantage
> of the tree structure and only inflating the components you need to
> answer a particular read.

Yes, that's another nice aspect of the design.

I do worry a bit that the hierarchical storage only helps if people
shard their reference namespace reasonably. Somebody who stores 100k
references in a single reference "directory" (imagine a
"refs/ci-tests/*") is going to suffer from expensive reference update
performance. But I guess they will suffer from poor performance within
Git as well, and that will probably encourage them to improve their
practices :-) I suppose this is not really much different than people
who store 100k files within a single directory of their working tree.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RefTree: Alternate ref backend
  2015-12-22 17:17     ` Michael Haggerty
@ 2015-12-22 18:50       ` Shawn Pearce
  2015-12-22 19:09         ` Junio C Hamano
       [not found]       ` <4689734.cEcQ2vR0aQ@mfick1-lnx>
  1 sibling, 1 reply; 18+ messages in thread
From: Shawn Pearce @ 2015-12-22 18:50 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git, David Turner, Jeff King

On Tue, Dec 22, 2015 at 9:17 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>
> etc. But we store branches into the main "refs/remotes/origin/"
> namespace, leaving no reserved space for the remote "HEAD" (not to
> mention other namespaces that might appear on the remote, such as
> "refs/changes/*", "refs/pull/*", a separate record of the remote's
> "refs/tags/*", etc).
>
> Maybe that is why my gut reaction to your proposal to elide the "refs"
> part of the reference hierarchy and store "HEAD" as (effectively)
> "refs/..HEAD" was negative, even though I can't think of any practical
> objections.

Good point; if the client's refs/remotes/origin/ namespace more
closely mirrored the remote's own namespace
(refs/remotes/origin/heads/master), this seems a lot less fishy. The
mapping certainly makes a bit more sense. Etc.

Its a user visible shift however; what was origin/master is now
origin/heads/master. Which is part of the reason why the mapping works
the way it does today. We hardly ever call a branch here heads/master,
we just call it master. So we call origin's master, origin/master. :)

> At a deeper level, the "refs/" part of reference names is actually
> pretty useless in general. I suppose it originated in the practice of
> storing loose references under "refs/" to keep them separate from other
> metadata in $GIT_DIR.

Correct. In the beginning you used echo $sha1 >.git/HEAD and it was good.

Later more refs came along and they had to go somewhere, and so
.git/refs was born with .git/refs/heads/master. Existing tools that
knew how to write to .git/HEAD given the name HEAD could magically
work with refs/heads/master too, and it was good. But that was an
awefully long name to type, so shorthand of "master" for maybe
refs/heads/master or maybe refs/tags/master or maybe no prefix at all
(hi HEAD) came along. Basically its the origin story of Git. :)

> But really, aside from slightly helping
> disambiguate references from paths in the command line, what is it good
> for?

Nothing really; today refs/ prefix is used to encourage to the tools
that you really meant refs/heads/master and not
refs/heads/heads/master or some other crazy construct. You can thank
the DWIMery inside the ref rev parse logic for needing this.

> The client not only has to remember the server's reftree, but also must
> verify that it still has all of the objects implied by that reftree, in
> case a reference somehow got deleted under "refs/remotes/origin/*". At
> that point, there is no special reason to use a SHA-1 in the
> negotiation; any unique token generated by the server would suffice if
> the server can connect it back to a set of references that was sent to
> the client in the past.

True, but its a nicer implementation if the token exchanged has simple
meaning to the server. And its just a diff-tree at the server to
compute the modifications the client might need to learn about.

I see your point about the client being able to use that to say "If I
not only have this, I also have all of the objects". It vastly
simplifies the client's negotiation with the server. The client is
negotiating the common ancestor of the reftree and that immediately
gets the main graph ancestor negotiation system very close to a good
set. The client may still be usefully ahead on other branches, e.g.
she has pulled from the upstream and is now pulling from a
lieutenant's tree, who also recently pulled from the upstream.

> In practice, in my first "haves" announcement I would probably list a
> few "famous" namespaces in the hope that one or more of them are
> recognized by the server:
>
>     have-tree <SHA-1 for "refs/">
>     have-tree <SHA-1 for "refs/heads/">
>     have-tree <SHA-1 for "refs/tags/">
>     have-tree <SHA-1 for "refs/remotes/origin/heads/">
>     have-tree <SHA-1 for "refs/remotes/other/heads/">

Yes, but we also have to be careful about how long we get the "famous"
list get. :)

>> [...]
>> FWIW, JGit is able to scan the canonical trees out of a pack file and
>> inflate them in approximately the same time it takes to scan the
>> packed-refs file for some 70k references. So we don't really slow down
>> much to use this. And there's huge gains to be had by taking advantage
>> of the tree structure and only inflating the components you need to
>> answer a particular read.
>
> Yes, that's another nice aspect of the design.
>
> I do worry a bit that the hierarchical storage only helps if people
> shard their reference namespace reasonably. Somebody who stores 100k
> references in a single reference "directory" (imagine a
> "refs/ci-tests/*") is going to suffer from expensive reference update
> performance. But I guess they will suffer from poor performance within
> Git as well, and that will probably encourage them to improve their
> practices :-) I suppose this is not really much different than people
> who store 100k files within a single directory of their working tree.

Yup. Gerrit Code Review shards refs/changes/ across 100 directories
for this reason as local filesystems don't like large numbers of files
or directories in a directory. But at 100k change entries you are
still dealing with 10k subtrees in each shard. The 100-sharding isn't
quite enough.

I started considering doing a notemap like sharding for reftree. Its
harder because the names aren't a uniform shape the way object ids are
in a notemap. But it could be possible to split by prefix, for example
start by building a table of all 2 character prefixes in the tree. If
this produces too many entries in any single 2 character subtree,
retry as a 4 character subtree. Continue extending the prefix until
either the number of unique prefixes in the parent tree is too many,
or the subtrees are acceptable sizes. If the parent gets to be too
many (1000?), freeze the parent prefix length and start splitting the
subtrees instead.

For tags you may wind up with a structure like:

  tags/
    v1../
      .0
      .2
    v2../
      .0
      0.125
    v3../
      0.98

Or whatever. Here I used ".." as a suffix on the splits like "v1.." to
indicate the name isn't itself a directory component, but a sharding
split. Thus we have tags "v1.0", "v2.0", "v20.125", "v30.98", etc.

It doesn't help the scalability of a source code tree having too many
files. But we could do some smarter splitting inside reftree to help
it scale even if people aren't sharding their ref namespaces. Sadly
this has a lot of downsides, its complex to write and its ugly.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RefTree: Alternate ref backend
  2015-12-22 18:50       ` Shawn Pearce
@ 2015-12-22 19:09         ` Junio C Hamano
  2015-12-22 19:11           ` Shawn Pearce
  0 siblings, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2015-12-22 19:09 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Michael Haggerty, git, David Turner, Jeff King

Shawn Pearce <spearce@spearce.org> writes:

>> But really, aside from slightly helping
>> disambiguate references from paths in the command line, what is it good
>> for?
>
> Nothing really; today refs/ prefix is used to encourage to the tools
> that you really meant refs/heads/master and not
> refs/heads/heads/master or some other crazy construct. You can thank
> the DWIMery inside the ref rev parse logic for needing this.

Aren't you two forgetting one minor thing, though?

A layout without refs/, i.e. $GIT_DIR/{heads,tags,...}, will force
us to keep track of where the tips of histories are anchored for
reachability purposes, every time you would add a new hierarchy
(e.g. $GIT_DIR/changes)--and those unfortunate souls who run a
slightly older version of Git that is unaware of 'changes' hierarchy
would weep after running "git gc", no?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RefTree: Alternate ref backend
  2015-12-22 19:09         ` Junio C Hamano
@ 2015-12-22 19:11           ` Shawn Pearce
  2015-12-22 19:34             ` Junio C Hamano
  0 siblings, 1 reply; 18+ messages in thread
From: Shawn Pearce @ 2015-12-22 19:11 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Michael Haggerty, git, David Turner, Jeff King

On Tue, Dec 22, 2015 at 11:09 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Shawn Pearce <spearce@spearce.org> writes:
>
>>> But really, aside from slightly helping
>>> disambiguate references from paths in the command line, what is it good
>>> for?
>>
>> Nothing really; today refs/ prefix is used to encourage to the tools
>> that you really meant refs/heads/master and not
>> refs/heads/heads/master or some other crazy construct. You can thank
>> the DWIMery inside the ref rev parse logic for needing this.
>
> Aren't you two forgetting one minor thing, though?
>
> A layout without refs/, i.e. $GIT_DIR/{heads,tags,...}, will force
> us to keep track of where the tips of histories are anchored for
> reachability purposes, every time you would add a new hierarchy
> (e.g. $GIT_DIR/changes)--and those unfortunate souls who run a
> slightly older version of Git that is unaware of 'changes' hierarchy
> would weep after running "git gc", no?

You still store them under refs/

All of the code that is handed a ref name knows its a ref name and not
a sha-1 object name in the objects directory.

The catch is a few things accept HEAD, MERGE_HEAD, FETCH_HEAD, etc.
Those have to be handled even though they aren't in the refs/
directory.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RefTree: Alternate ref backend
  2015-12-22 19:11           ` Shawn Pearce
@ 2015-12-22 19:34             ` Junio C Hamano
  2015-12-23  4:59               ` Michael Haggerty
  0 siblings, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2015-12-22 19:34 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Michael Haggerty, git, David Turner, Jeff King

Shawn Pearce <spearce@spearce.org> writes:

> On Tue, Dec 22, 2015 at 11:09 AM, Junio C Hamano <gitster@pobox.com> wrote:
>> Shawn Pearce <spearce@spearce.org> writes:
>>
>>>> But really, aside from slightly helping
>>>> disambiguate references from paths in the command line, what is it good
>>>> for?
>>>
>>> Nothing really; today refs/ prefix is used to encourage to the tools
>>> that you really meant refs/heads/master and not
>>> refs/heads/heads/master or some other crazy construct. You can thank
>>> the DWIMery inside the ref rev parse logic for needing this.
>>
>> Aren't you two forgetting one minor thing, though?
>>
>> A layout without refs/, i.e. $GIT_DIR/{heads,tags,...}, will force
>> us to keep track of where the tips of histories are anchored for
>> reachability purposes, every time you would add a new hierarchy
>> (e.g. $GIT_DIR/changes)--and those unfortunate souls who run a
>> slightly older version of Git that is unaware of 'changes' hierarchy
>> would weep after running "git gc", no?
>
> You still store them under refs/

Well I know; the comment was merely a reaction to the exchange
between you two, "What is refs/ good for?", "Nothing really".

You'd benefit by having "refs/" that is known to contain all the
anchoring points for reachability without knowing what subhierarchy
it may contain in the future, that is what it is good for.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RefTree: Alternate ref backend
       [not found]       ` <4689734.cEcQ2vR0aQ@mfick1-lnx>
@ 2015-12-22 20:56         ` Martin Fick
  2015-12-22 21:23           ` Junio C Hamano
  0 siblings, 1 reply; 18+ messages in thread
From: Martin Fick @ 2015-12-22 20:56 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Shawn Pearce, git, David Turner, Jeff King

On Tuesday, December 22, 2015 06:17:28 PM you wrote:
> On Tue, Dec 22, 2015 at 7:41 AM, Michael Haggerty
<mhagger@alum.mit.edu> wrote:
>
> At a deeper level, the "refs/" part of reference names is
> actually pretty useless in general. I suppose it
> originated in the practice of storing loose references
> under "refs/" to keep them separate from other metadata
> in $GIT_DIR. But really, aside from slightly helping
> disambiguate references from paths in the command line,
> what is it good for? Would we really be worse off if
> references' full names were
>
>     HEAD
>     heads/master
>     tags/v1.0.0
>     remotes/origin/master (or remotes/origin/heads/master)

I think this is a bit off, because

  HEAD != refs/HEAD

so not quite useless.

But, I agree that the whole refs notation has always bugged
me, it is quirky.  It makes it hard to disambiguate when
something is meant to be absolute or not.  What if we added
a leading slash for absolute references? Then I could do
something like:

/HEAD
/refs/heads/master
/refs/tags/v1.0.0
/refs/remotes/origin/master

I don't like that plumbing has to do a dance to guess at
expansions, how many tools get it wrong (do it in different
orders, miss some expansions...).  With an absolute
notation, plumbing could be built to require absolute
notations, giving more predictable interpretations when
called from tools.

This is a long term idea, but it might make sense to
consider it now just for the sake of storing refs, it would
eliminate the need for the ".." notation for "refs/..HEAD".

Now if we could only figure out a way to tell plumbing that
something is a SHA, not a ref? :)

-Martin

--
The Qualcomm Innovation Center, Inc. is a member of Code
Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RefTree: Alternate ref backend
  2015-12-22 20:56         ` Martin Fick
@ 2015-12-22 21:23           ` Junio C Hamano
  0 siblings, 0 replies; 18+ messages in thread
From: Junio C Hamano @ 2015-12-22 21:23 UTC (permalink / raw)
  To: Martin Fick; +Cc: Michael Haggerty, Shawn Pearce, git, David Turner, Jeff King

Martin Fick <mfick@codeaurora.org> writes:

> ...  What if we added
> a leading slash for absolute references? Then I could do
> something like:
>
> /HEAD
> /refs/heads/master
> /refs/tags/v1.0.0
> /refs/remotes/origin/master

Yeah, that is one way to allow a tool to be absolutely certain there
is no funny DWIMmery going on.

> This is a long term idea, but it might make sense to
> consider it now just for the sake of storing refs, it would
> eliminate the need for the ".." notation for "refs/..HEAD".

I do not see how the absolute notation has anything to do with
eliminating "the need for the '..' notation" at all, though.

The funny "..HEAD" was brought up only because Shawn wanted to omit
a single level of dereferencing of a tree object, so that the
top-level tree for his ref backend would have "heads/", "tags/", etc.
in it, and because "HEAD" is not next to "heads/" and "tags/", it
needed some funny notation to avoid squatting on "HEAD" that should
mean "refs/HEAD" in the notation.

> Now if we could only figure out a way to tell plumbing that
> something is a SHA, not a ref? :)

You do not need :) there; I think we discussed something along that
line in the past few weeks (see the list archive).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RefTree: Alternate ref backend
  2015-12-22 19:34             ` Junio C Hamano
@ 2015-12-23  4:59               ` Michael Haggerty
  2015-12-24  1:33                 ` Junio C Hamano
  0 siblings, 1 reply; 18+ messages in thread
From: Michael Haggerty @ 2015-12-23  4:59 UTC (permalink / raw)
  To: Junio C Hamano, Shawn Pearce; +Cc: git, David Turner, Jeff King, Martin Fick

On 12/22/2015 08:34 PM, Junio C Hamano wrote:
> Shawn Pearce <spearce@spearce.org> writes:
> 
>> On Tue, Dec 22, 2015 at 11:09 AM, Junio C Hamano <gitster@pobox.com> wrote:
>>> Shawn Pearce <spearce@spearce.org> writes:
>>>
>>>>> But really, aside from slightly helping
>>>>> disambiguate references from paths in the command line, what is it good
>>>>> for?
>>>>
>>>> Nothing really; today refs/ prefix is used to encourage to the tools
>>>> that you really meant refs/heads/master and not
>>>> refs/heads/heads/master or some other crazy construct. You can thank
>>>> the DWIMery inside the ref rev parse logic for needing this.
>>>
>>> Aren't you two forgetting one minor thing, though?
>>>
>>> A layout without refs/, i.e. $GIT_DIR/{heads,tags,...}, will force
>>> us to keep track of where the tips of histories are anchored for
>>> reachability purposes, every time you would add a new hierarchy
>>> (e.g. $GIT_DIR/changes)--and those unfortunate souls who run a
>>> slightly older version of Git that is unaware of 'changes' hierarchy
>>> would weep after running "git gc", no?
>>
>> You still store them under refs/
> 
> Well I know; the comment was merely a reaction to the exchange
> between you two, "What is refs/ good for?", "Nothing really".
> 
> You'd benefit by having "refs/" that is known to contain all the
> anchoring points for reachability without knowing what subhierarchy
> it may contain in the future, that is what it is good for.

You are answering "What is 'refs/' good for in the pathnames of files
that store loose references?" I was asking "What is 'refs/' good for in
the logical names of references?"

It would have been totally possible to make the full name of a branch
be, for example, "heads/master" and nevertheless store its loose
reference in "$GIT_DIR/refs/heads/master". The obvious place to store
HEAD in such a scheme would have been "$GIT_DIR/refs/HEAD" while still
calling it "HEAD". This could have avoided the problem that we now have
with pseudo-references like FETCH_HEAD being stored directly in $GIT_DIR.

On 12/22/2015 09:56 PM, Martin Fick wrote:
> On Tuesday, December 22, 2015 06:17:28 PM you wrote:
>> On Tue, Dec 22, 2015 at 7:41 AM, Michael Haggerty
> <mhagger@alum.mit.edu> wrote:
>>
>> [...] Would we really be worse off if
>> references' full names were
>>
>>     HEAD
>>     heads/master
>>     tags/v1.0.0
>>     remotes/origin/master (or remotes/origin/heads/master)
>
> I think this is a bit off, because
>
>   HEAD != refs/HEAD
>
> so not quite useless.

A reference called "refs/HEAD" is not forbidden today but it's still not
very useful, is it? Do you know of some system that uses reference names
like this or are you just pointing out that it's theoretically possible?

> But, I agree that the whole refs notation has always bugged
> me, it is quirky.  It makes it hard to disambiguate when
> something is meant to be absolute or not.  What if we added
> a leading slash for absolute references? Then I could do
> something like:
>
> /HEAD
> /refs/heads/master
> /refs/tags/v1.0.0
> /refs/remotes/origin/master

I like the idea of having a way to express "absolute" reference names.
But maybe if we do so we could take a step towards deprecating "refs/"
in references' logical names, by instead using the following absolute
notation for the above references:

    /HEAD
    /heads/master
    /tags/v1.0.0
    /remotes/origin/master

Specifically:

* Any name of the form "/$name" for which is_pseudoref_syntax($name)
  returns true would be mapped to what we today call "$name" (e.g.,
  "/FETCH_HEAD" would be mapped to today's "FETCH_HEAD")

* Any other name of the form "/$name" would be mapped to today's
  "refs/$name".

Note that all of the absolute reference listed above, with their leading
"/" removed, have the same interpretation under DWIM as they would as
absolute names under my proposal (provided of course, that there is no
DWIM ambiguity with other reference names).

The only disadvantage that I can see with this scheme is that there
would be no "absolute" notation for a reference that currently has a
full name like "refs/HEAD" (or more generally a reference currently
called "refs/$name" where is_pseudoref_syntax($name) returns true). I
think that is acceptable: (1) such references are probably not in wide
use; (2) we wouldn't (yet) have to prohibit such references; even though
there would be no absolute notation to represent them, their old-style
names would still work.

If we ever decide to go further in banishing "refs/", the next step in
the transition would be to disallow names like "refs/HEAD", treat the
absolute reference names as the "canonical" version, and adding DWIM
rules that treat a prefix "refs/" very much like a leading "/".

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RefTree: Alternate ref backend
  2015-12-23  4:59               ` Michael Haggerty
@ 2015-12-24  1:33                 ` Junio C Hamano
  0 siblings, 0 replies; 18+ messages in thread
From: Junio C Hamano @ 2015-12-24  1:33 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Shawn Pearce, git, David Turner, Jeff King, Martin Fick

Michael Haggerty <mhagger@alum.mit.edu> writes:

> You are answering "What is 'refs/' good for in the pathnames of files
> that store loose references?" I was asking "What is 'refs/' good for in
> the logical names of references?"
>
> It would have been totally possible to make the full name of a branch
> be, for example, "heads/master" and nevertheless store its loose
> reference in "$GIT_DIR/refs/heads/master". The obvious place to store
> HEAD in such a scheme would have been "$GIT_DIR/refs/HEAD" while still
> calling it "HEAD". This could have avoided the problem that we now have
> with pseudo-references like FETCH_HEAD being stored directly in $GIT_DIR.

I see; OK.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2015-12-24  1:33 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-17 21:02 RefTree: Alternate ref backend Shawn Pearce
2015-12-17 21:57 ` Junio C Hamano
2015-12-17 22:15   ` Shawn Pearce
2015-12-17 22:10 ` Jeff King
2015-12-17 22:28   ` Shawn Pearce
2015-12-18  1:36     ` Mike Hommey
2015-12-22 15:41 ` Michael Haggerty
2015-12-22 16:11   ` Shawn Pearce
2015-12-22 17:04     ` Dave Borowitz
2015-12-22 17:17     ` Michael Haggerty
2015-12-22 18:50       ` Shawn Pearce
2015-12-22 19:09         ` Junio C Hamano
2015-12-22 19:11           ` Shawn Pearce
2015-12-22 19:34             ` Junio C Hamano
2015-12-23  4:59               ` Michael Haggerty
2015-12-24  1:33                 ` Junio C Hamano
     [not found]       ` <4689734.cEcQ2vR0aQ@mfick1-lnx>
2015-12-22 20:56         ` Martin Fick
2015-12-22 21:23           ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).