git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Michael Haggerty <mhagger@alum.mit.edu>
To: Shawn Pearce <spearce@spearce.org>, git <git@vger.kernel.org>
Cc: David Turner <dturner@twopensource.com>, Jeff King <peff@peff.net>
Subject: Re: RefTree: Alternate ref backend
Date: Tue, 22 Dec 2015 16:41:43 +0100	[thread overview]
Message-ID: <56796F37.1000600@alum.mit.edu> (raw)
In-Reply-To: <CAJo=hJvnAPNAdDcAAwAvU9C4RVeQdoS3Ev9WTguHx4fD0V_nOg@mail.gmail.com>

On 12/17/2015 10:02 PM, Shawn Pearce wrote:
> I started playing around with the idea of storing references directly
> in Git. Exploiting the GITLINK tree entry, we can associate a name to
> any SHA-1.
> 
> By storing all references in a single tree, atomic transactions are
> possible. Its a simple compare-and-swap of a single 40 byte SHA-1.
> This of course leads to a bootstrapping problem, where do we store the
> 40 byte SHA-1? For this example its just $GIT_DIR/refs/txn/committed
> as a classical loose reference.

I like this general idea a lot, even while recognizing some practical
problems that other people have pointed out. I especially like the idea
of having truly atomic multi-reference updates.

I'm curious why you decided to store all of the references in a single
list, similar to the packed-refs file. This design means that the whole
object has to be rewritten whenever any reference is updated [1].
Certainly, storing the references in a single tree *object* is not a
requirement for having atomic transitions.

I would have expected a design where the layout of the references in
trees mimics the layout of loose references in the filesystem; e.g., one
tree object for "refs/", one for "refs/heads/" one for "refs/remotes/"
etc. This design would reduce the amount of rewriting that is needed
when one or a few references are updated.

Another reason that I find a hierarchical layout intriguing would be
that one could imagine using the SHA-1s of reference namespace subtrees
to speed up the negotiation phase of "git fetch". In the common case
that I use the local namespace "refs/remotes/origin" to track an
upstream repo, the SHA-1 of my "refs/remotes/origin" tree would usually
represent a complete description of the state of the upstream references
at the time that I last fetched. My client could tell the server

    have-tree $SHA1

, where $SHA1 is the hash of the tree representing
"refs/remotes/origin/". If the server keeps a reflog as you have
described (but hierarchically), then the server could look up $SHA1 and
immediately know the full set of references, and therefore of objects,
that I fetched last time. More generally, the negotiation could proceed
down the reference namespace tree and stop whenever commonality is found.

There are a lot of "if"s in that last paragraph, and maybe it's not
workable. For example, if I'm not pruning on fetch, then my reference
tree won't be identical to one that was ever present on the server and
this technique wouldn't necessarily help. But if, for example, we change
the default to pruning, or perhaps record some extra reftree SHA-1's,
then in most cases I would expect that this trick could reduce the
effort of negotiation to negligible in most cases, and reduce the time
of the whole fetch to negligible in the case that the clone is already
up-to-date.

Michael

[1] At GitHub, we store public repositories in networks with a shared
object store. The central repository in each network can have 10M+
references. So for us, rewriting that many references for every
reference update would be unworkable.

-- 
Michael Haggerty
mhagger@alum.mit.edu

  parent reply	other threads:[~2015-12-22 15:49 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-12-17 21:02 RefTree: Alternate ref backend Shawn Pearce
2015-12-17 21:57 ` Junio C Hamano
2015-12-17 22:15   ` Shawn Pearce
2015-12-17 22:10 ` Jeff King
2015-12-17 22:28   ` Shawn Pearce
2015-12-18  1:36     ` Mike Hommey
2015-12-22 15:41 ` Michael Haggerty [this message]
2015-12-22 16:11   ` Shawn Pearce
2015-12-22 17:04     ` Dave Borowitz
2015-12-22 17:17     ` Michael Haggerty
2015-12-22 18:50       ` Shawn Pearce
2015-12-22 19:09         ` Junio C Hamano
2015-12-22 19:11           ` Shawn Pearce
2015-12-22 19:34             ` Junio C Hamano
2015-12-23  4:59               ` Michael Haggerty
2015-12-24  1:33                 ` Junio C Hamano
     [not found]       ` <4689734.cEcQ2vR0aQ@mfick1-lnx>
2015-12-22 20:56         ` Martin Fick
2015-12-22 21:23           ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56796F37.1000600@alum.mit.edu \
    --to=mhagger@alum.mit.edu \
    --cc=dturner@twopensource.com \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).