git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [RFC/WIP] Pluggable reference backends
@ 2014-03-10 11:00 Michael Haggerty
  2014-03-10 11:44 ` Johan Herland
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Michael Haggerty @ 2014-03-10 11:00 UTC (permalink / raw)
  To: git discussion list; +Cc: Jeff King, Vicent Marti, Brad King, Johan Herland

I have started working on pluggable ref backends.  In this email I
would like to share my plans and solicit feedback.

(This morning I removed this project from the GSoC ideas page, because
it is unfair to ask a student to shoot at a moving target.)

Why?
====

Currently, the reference- and reflog-handling code in Git is too
coupled to the rest of the system.  There are too many places that
know, for example, the difference between loose and packed refs, or
that loose references are stored as files directly under
$GIT_DIR/refs/heads/, or the locking protocols that have to be adhered
to when managing references.  This tight coupling, in turn, makes it
nearly impossible to experiment with alternate reference storage
schemes.

But there is a lot of potential to use alternate reference storage
schemes to fix some currently-unfixable problems, and to implement
some cool new features.

Unfixable problems
------------------

The on-disk format that we currently use to store references makes
some problems impossible to fix:

* It is impossible to get a self-consistent snapshot of all references
  at a given moment in time.  This makes it impossible, even in
  principle, to do object pruning in a 100% race-free way.  (Our
  current workaround of not deleting objects that are less than two
  weeks works in most cases but, aside from being ugly, has holes.

* There are awkward filesystem-imposed constraints on reference
  naming, for example:

  * D/F conflicts (I): it is not possible to have branches named
    "my-feature" and "my-feature/base" at the same time.

  * D/F conflicts (II): it is not possible to have reflogs for
    branches named "my-feature" and "my-feature/base" at the same
    time.  This leads to the problem that it is not, in general,
    possible to retain reflogs for branches that have been deleted.

  * There are additional constraints on reference names depending on
    the filesystem used to store them.  For example, a Git repository
    on a case-insensitive filesystem fails in confusing ways if there
    are two loose references whose names differ only in case; however,
    packed references differing in case might work for a while.  Also,
    reference names that include Unicode characters can have their
    normalization form changed if they are written on Mac OS.

* The packed-refs file has to be rewritten whenever a packed reference
  is deleted.  It might be nice to write 0{40} to a loose reference
  file to indicate that the reference has been deleted, but that would
  open the way for more D/F conflicts.)

Wild new ideas
--------------

So, I would like to reorganize the Git code to allow pluggable
reference backends.  If we had this, we could try out ideas like

* Retain the idea of loose/packed references, but encode loose
  reference names using a portable naming scheme before storing them
  to the filesystem; maybe something like

      refs/heads/Foo.42 -> refs.dir/heads.dir/%46oo%2e42
      logs/refs/heads/Foo.42 -> refs.dir/heads.dir/%46oo%2e42.log

  Yes, it looks uglier.  But users shouldn't be looking in these
  directories anyway.  This single change would prevent D/F conflicts,
  allow a reference to be deleted by writing 0{40} to its loose
  reference file, allow reflogs to be kept for deleted refs, and
  remove the problem of filesystem-dependent naming constraints.

* Store references in a SQLite database, to get correct transaction
  handling.

* Store references directly in the Git object database.

* Implement repository "groups" that share a common object database
  and also a common reference store.  Each repository in a group would
  get a sub-namespace in the shared database, and store its references
  in names like "refs/member/$MEMBERID/refs/heads/...".  The member
  repos would act like restricted views of the shared database.  This
  would be like a combination between alternates (with lowered risk of
  corruption) and gitnamespaces(7) (but usable for all git commands).

* Reference transactions that can be used across multiple Git
  commands.  Imagine,

      export GIT_TRANSACTION=$(git transaction begin)
      trap 'git transaction rollback' ERR
      git foo ...
      git bar ...
      git baz ...
      if ! git transaction commit
      then
          # Transaction failed; all references rolled back
      else
          # Transaction succeeded; all references updated atomically
      fi
      trap '' ERR
      unset GIT_TRANSACTION

  The "GIT_TRANSACTION" environment variable would tell git to read
  from the usual references, overridden with any reference changes
  that have occurred during the transaction, but write any changes
  (including both old and new values) to the transaction.  The command
  "git transaction commit" would verify that the old values listed in
  the transaction still agree with the current values, and then make
  all of the changes atomically.

  Such transactions could also be broadcast to mirrors when they are
  committed to keep multiple Git repositories in sync.

* One alternate backend might even be a shim that delegates to libgit2
  to do the actual reading/writing of references.  Then new backends
  could be implemented in libgit2 to allow both git and libgit2 to
  benefit.


The plan
========

It is currently not possible to experiment with any of these things
because of the tight coupling between the reference code and the rest
of git. The goal of this project is first to choke the interactions
down to a coherent interface, and second to make the implementation
selectable at runtime.  The implementation of specific alternate
backends will hopefully follow.

quagga references
-----------------

The overriding task is to isolate the reference-handling code; i.e.,
make sure that only code within refs.c touches git references, and
that the refs API provides all of the features that other code needs
to do its work.

So as a whimsical first milestone, I want to make it possible to
choose a different directory name for storing references and reflogs
by changing one #define statement in refs.c.  The goal is to get the
test suite to run correctly regardless of how this variable is set,
which would be a pretty good check that all reference-handling code
paths go though the refs API.  For no special reason I've been using
"quagga" as the new place, so references go to "$GIT_DIR/quagga/HEAD",
"$GIT_DIR/quagga/refs/heads/master", etc.  (Of course we wouldn't
actually *change* this name; it is only for testing purposes.)  I've
started working on this but there is a lot of code to change
(including test code).

Reference transactions
----------------------

I want to orient the new reference API as much as possible around
transactions.  I think a transaction is a flexible abstraction that
should be implementable by any backend (albeit not always with 100%
ACID compliance) and will allow a couple of existing races to be
fixed.

So as a first step, I will soon submit a patch series that starts
fleshing out the concept of a ref_transaction, and rewrites "git
update-ref --stdin" to use the new API.  For now, ref_transaction will
only be usable within a single git command invocation, but I want to
leave the way open to the GIT_TRANSACTION idea mentioned above.


Transition
==========

The current project is only to isolate the reference-handling code and
make it, in principle, exchangeable with another implementation.  It
doesn't require any transition.

Moreover, the changes will improve the modularity of the Git code, and
will be beneficial purely on those grounds.

When/if alternate backends are implemented, then the transition will
have to be handled on a case-by-case basis.  How references are stored
is mostly a decision internal to a single repository.  Any new
repository storage formats should be supported *in addition to* the
traditional storage scheme, to prevent the need for a flag day when
all repositories have to be converted simultaneously.

Git hosters [1] will be likely to take advantage of alternate
reference backends pretty easily, because they know which tools touch
their repositories and need only update those tools.  It is expected
that alternate reference backends will be useful for hosters even if
they don't become practical for end-users.

For end-users it is important that their repository be readable by all
of the tools that they use.  So if we want to make a new format a
viable option for normal Git users (let alone make it the new default
format), some coordination will be needed between all of the
commonly-used Git implementations (git-core, libgit2, JGit, and maybe
Dulwich, Grit, ...).  Whether or not this happens in real life depends
on how advantageous the hypothetical new format is to Git users and is
beyond the scope of this proposal.

Michael

[1] Full discloser: this includes my employer, GitHub.

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC/WIP] Pluggable reference backends
  2014-03-10 11:00 [RFC/WIP] Pluggable reference backends Michael Haggerty
@ 2014-03-10 11:44 ` Johan Herland
  2014-03-10 14:30 ` Shawn Pearce
  2014-03-11 10:56 ` [RFC/WIP] Pluggable reference backends Karsten Blees
  2 siblings, 0 replies; 17+ messages in thread
From: Johan Herland @ 2014-03-10 11:44 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git discussion list, Jeff King, Vicent Marti, Brad King

On Mon, Mar 10, 2014 at 12:00 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> I have started working on pluggable ref backends.  In this email I
> would like to share my plans and solicit feedback.

No comments or useful feedback yet, except that I enthusiastically
approve of the objective and the plan you have for how to get there.


...Johan

-- 
Johan Herland, <johan@herland.net>
www.herland.net

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC/WIP] Pluggable reference backends
  2014-03-10 11:00 [RFC/WIP] Pluggable reference backends Michael Haggerty
  2014-03-10 11:44 ` Johan Herland
@ 2014-03-10 14:30 ` Shawn Pearce
  2014-03-10 15:51   ` Max Horn
  2014-03-10 15:52   ` Jeff King
  2014-03-11 10:56 ` [RFC/WIP] Pluggable reference backends Karsten Blees
  2 siblings, 2 replies; 17+ messages in thread
From: Shawn Pearce @ 2014-03-10 14:30 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: git discussion list, Jeff King, Vicent Marti, Brad King,
	Johan Herland

On Mon, Mar 10, 2014 at 4:00 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> I have started working on pluggable ref backends.  In this email I
> would like to share my plans and solicit feedback.

Yay!

JGit already has pluggable ref backends, so it is good to see this
starting in git-core.

FWIW the Gerrit Code Review community is interested in this project.

> * Store references in a SQLite database, to get correct transaction
>   handling.

No to SQLLite in git-core. Using it from JGit requires building
SQLLite and a JNI wrapper, which makes JGit significantly less
portable. I know SQLLite is pretty amazing, but implementing
compatibility with it from JGit will be a big nightmare for us.

> * Reference transactions that can be used across multiple Git
>   commands.  Imagine,
>
>       export GIT_TRANSACTION=$(git transaction begin)
>       trap 'git transaction rollback' ERR
>       git foo ...
>       git bar ...
>       git baz ...
>       if ! git transaction commit
>       then
>           # Transaction failed; all references rolled back
>       else
>           # Transaction succeeded; all references updated atomically
>       fi
>       trap '' ERR
>       unset GIT_TRANSACTION
>
>   The "GIT_TRANSACTION" environment variable would tell git to read
>   from the usual references, overridden with any reference changes
>   that have occurred during the transaction, but write any changes
>   (including both old and new values) to the transaction.  The command
>   "git transaction commit" would verify that the old values listed in
>   the transaction still agree with the current values, and then make
>   all of the changes atomically.

Yay!

Gerrit Code Review really wants to get transactions implemented. So I
am very much in favor of trying to improve the situation in git-core.

We want not only a transaction over 2+ references in the same
repository, but we also want to perform transactions across
repositories. Consider a git submodule child and parent being updated
at the same time. We really want to update refs/heads/master in both
repositories atomically at the central server.

>   Such transactions could also be broadcast to mirrors when they are
>   committed to keep multiple Git repositories in sync.

Ooh, this would be very interesting.

> Git hosters [1] will be likely to take advantage of alternate
> reference backends pretty easily, because they know which tools touch
> their repositories and need only update those tools.  It is expected
> that alternate reference backends will be useful for hosters even if
> they don't become practical for end-users.

Alternate reference backends are absolutely useful to large hosters.
The loose reference format isn't very scalable. The packed-refs helps,
but you can do better. IIRC our android.googlesource.com reference
backend uses only 79 bytes per reference on average, including both
the name string and the value. This super compact format is easy to
hold in RAM for hundreds of busy repositories.

> For end-users it is important that their repository be readable by all
> of the tools that they use.  So if we want to make a new format a
> viable option for normal Git users (let alone make it the new default
> format), some coordination will be needed between all of the
> commonly-used Git implementations (git-core, libgit2, JGit, and maybe
> Dulwich, Grit, ...).  Whether or not this happens in real life depends
> on how advantageous the hypothetical new format is to Git users and is
> beyond the scope of this proposal.

It is sad we have this many implementations, but as one of the authors
(JGit) I am happy to at least see you are worrying about compatibility
with them.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC/WIP] Pluggable reference backends
  2014-03-10 14:30 ` Shawn Pearce
@ 2014-03-10 15:51   ` Max Horn
  2014-03-10 15:52   ` Jeff King
  1 sibling, 0 replies; 17+ messages in thread
From: Max Horn @ 2014-03-10 15:51 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Michael Haggerty, git discussion list, Jeff King, Vicent Marti,
	Brad King, Johan Herland

[-- Attachment #1: Type: text/plain, Size: 1953 bytes --]


On 10.03.2014, at 15:30, Shawn Pearce <spearce@spearce.org> wrote:

> On Mon, Mar 10, 2014 at 4:00 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> I have started working on pluggable ref backends.  In this email I
>> would like to share my plans and solicit feedback.
> 
> Yay!

Yay, too!

> JGit already has pluggable ref backends, so it is good to see this
> starting in git-core.
> 
> FWIW the Gerrit Code Review community is interested in this project.
> 
>> * Store references in a SQLite database, to get correct transaction
>>  handling.
> 
> No to SQLLite in git-core. Using it from JGit requires building
> SQLLite and a JNI wrapper, which makes JGit significantly less
> portable. I know SQLLite is pretty amazing, but implementing
> compatibility with it from JGit will be a big nightmare for us.

I understood this as an example (indeed, it is listed under "Wile new ideas"), not a proposal to put this into the git core. It might be an interesting experiment in any case, and if the proposed modularity is truly achieved, it could (if there was any interest in it, that is) be implemented in an external 3rd party project.


Anyway, I am quite excited about this project. Usually, I am quite skeptical about such large scope ideas ("Yeah, cool idea, but who will pull it off, and with which resources?"). But this one seems to have a good chance of being implemented gradually and inside the main repository, with the help of "feature flags". 

Thus, I am looking forward to Michael's announced initial patch series. I feel that I don't know enough yet about git overall to be of much help on my own at this point. But perhaps over time some mini- or micro-projects pop up were others can help (e.g. "adapt these 50 tests to work with the 'quagga' ref"); if they are pointed out (assuming that doing so isn't more work than just addressing them yourself ;-), I am willing to help out.


Cheers,
Max

[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 235 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC/WIP] Pluggable reference backends
  2014-03-10 14:30 ` Shawn Pearce
  2014-03-10 15:51   ` Max Horn
@ 2014-03-10 15:52   ` Jeff King
  2014-03-10 16:14     ` David Kastrup
                       ` (2 more replies)
  1 sibling, 3 replies; 17+ messages in thread
From: Jeff King @ 2014-03-10 15:52 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Michael Haggerty, git discussion list, Vicent Marti, Brad King,
	Johan Herland

On Mon, Mar 10, 2014 at 07:30:45AM -0700, Shawn Pearce wrote:

> > * Store references in a SQLite database, to get correct transaction
> >   handling.
> 
> No to SQLLite in git-core. Using it from JGit requires building
> SQLLite and a JNI wrapper, which makes JGit significantly less
> portable. I know SQLLite is pretty amazing, but implementing
> compatibility with it from JGit will be a big nightmare for us.

That seems like a poor reason not to implement a pluggable feature for
git-core. If we implement it, then a site using only git-core can take
advantage of it. Sites with JGit cannot, and would use a different
pluggable storage mechanism that's supported by both. But if we don't
implement, it hurts people using only git-core, and it does not help
sites using JGit at all.

That's assuming that attention spent on implementing the feature does
not take away from implementing some other parallel scheme that does the
same thing but does not use SQLite. I don't know what that would be
offhand; mapping the ref and reflog into a relational database is pretty
simple, and we get a lot of robustness and efficiency benefits for free.
We could perhaps have some kind of "relational" backend could use an
ODBC-like abstraction to point to a database. I have no idea if people
would want to ever store refs in a "real" server-backend RDBMS, but I
suspect Java has native support for such things.

Certainly I think we should aim for compatibility where we can, but if
there's not a compatible way to do something, I don't think the
limitations of one platform should drag other ones down. And that goes
both ways; we had to reimplement disk-compatible EWAH from scratch in C
for git-core to have bitmaps, whereas JGit just got to use a ready-made
library. I don't think that was a bad thing.  People in
mixed-implementation environments couldn't use it, but people with
JGit-only environments were free to take advantage of it.

At any rate, the repository needs to advertise "this is the ref storage
mechanism I use" in the config. We're going to need to bump
core.repositoryformatversion for such cases (because an old version of
git should not blindly lock and write to a refs/ directory that nobody
else is ever going to look at). And I'd suggest with that bump adding in
something like core.refstorage, so that an implementation can say
"foobar ref storage? Never heard of it" and barf. Whether it's because
that implementation doesn't support "foobar", because it's an old
version that doesn't understand "foobar" yet, or because it was simply
built without "foobar" support.

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC/WIP] Pluggable reference backends
  2014-03-10 15:52   ` Jeff King
@ 2014-03-10 16:14     ` David Kastrup
  2014-03-10 16:28       ` David Lang
  2014-03-10 19:42       ` Jeff King
  2014-03-10 17:46     ` Junio C Hamano
  2014-03-10 21:07     ` Michael Haggerty
  2 siblings, 2 replies; 17+ messages in thread
From: David Kastrup @ 2014-03-10 16:14 UTC (permalink / raw)
  To: Jeff King
  Cc: Shawn Pearce, Michael Haggerty, git discussion list, Vicent Marti,
	Brad King, Johan Herland

Jeff King <peff@peff.net> writes:

> On Mon, Mar 10, 2014 at 07:30:45AM -0700, Shawn Pearce wrote:
>
>> > * Store references in a SQLite database, to get correct transaction
>> >   handling.
>> 
>> No to SQLLite in git-core. Using it from JGit requires building
>> SQLLite and a JNI wrapper, which makes JGit significantly less
>> portable. I know SQLLite is pretty amazing, but implementing
>> compatibility with it from JGit will be a big nightmare for us.
>
> That seems like a poor reason not to implement a pluggable feature for
> git-core. If we implement it, then a site using only git-core can take
> advantage of it. Sites with JGit cannot, and would use a different
> pluggable storage mechanism that's supported by both. But if we don't
> implement, it hurts people using only git-core, and it does not help
> sites using JGit at all.

Of course, the basic premise for this feature is "let's assume that our
file and/or operating system suck at providing file system functionality
at file name granularity".  There have been two historically approaches
to that problem that are not independent: a) use Linux b) kick Linus.

Option b) has been fairly successful over quite a bit of time, but at
the current point of time, it has become harder to aim that kick on a
single person and/or where it counts.

The database approach is an alternative approach based on kicking an
alternate set of people, namely database rather than operating system
providers, based on the assumption that the former have softer behinds
(the backend-based approach) making them more sensitive to kicking.

So the database approach is most promising on the "what are we going to
do if our operating system vendor won't bother with sensible file system
performance" angle.  Which isn't doing total system architecture a
favor.

Personally, I have little sympathy for helping subpar systems, keeping
them on life support while they are in turn trying to squish the better
systems.

But then it is not me doing the actual work, so this is no more than an
idle reflection.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC/WIP] Pluggable reference backends
  2014-03-10 16:14     ` David Kastrup
@ 2014-03-10 16:28       ` David Lang
  2014-03-10 19:42       ` Jeff King
  1 sibling, 0 replies; 17+ messages in thread
From: David Lang @ 2014-03-10 16:28 UTC (permalink / raw)
  To: David Kastrup
  Cc: Jeff King, Shawn Pearce, Michael Haggerty, git discussion list,
	Vicent Marti, Brad King, Johan Herland

On Mon, 10 Mar 2014, David Kastrup wrote:

> Jeff King <peff@peff.net> writes:
>
>> On Mon, Mar 10, 2014 at 07:30:45AM -0700, Shawn Pearce wrote:
>>
>>>> * Store references in a SQLite database, to get correct transaction
>>>>   handling.
>>>
>>> No to SQLLite in git-core. Using it from JGit requires building
>>> SQLLite and a JNI wrapper, which makes JGit significantly less
>>> portable. I know SQLLite is pretty amazing, but implementing
>>> compatibility with it from JGit will be a big nightmare for us.
>>
>> That seems like a poor reason not to implement a pluggable feature for
>> git-core. If we implement it, then a site using only git-core can take
>> advantage of it. Sites with JGit cannot, and would use a different
>> pluggable storage mechanism that's supported by both. But if we don't
>> implement, it hurts people using only git-core, and it does not help
>> sites using JGit at all.
>
> Of course, the basic premise for this feature is "let's assume that our
> file and/or operating system suck at providing file system functionality
> at file name granularity".  There have been two historically approaches
> to that problem that are not independent: a) use Linux b) kick Linus.

As a note, if this is done properly, it could allow for plugins that connect to 
the underlying storage system (similar to the Facebook Mecurial change)

Even for those who don't have the $$$$$ storage arrays, there may be other 
storage specific hacks that can be done to detect that files haven't changed.

For example, with btrfs and you compile into a different directory thatn your 
source, you may be able to detect that things didn't change by the fact that the 
filesystem didn't have to do a rewrite of the parent node.

David Lang

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC/WIP] Pluggable reference backends
  2014-03-10 15:52   ` Jeff King
  2014-03-10 16:14     ` David Kastrup
@ 2014-03-10 17:46     ` Junio C Hamano
  2014-03-10 17:56       ` Jeff King
  2014-03-10 21:07     ` Michael Haggerty
  2 siblings, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2014-03-10 17:46 UTC (permalink / raw)
  To: Jeff King
  Cc: Shawn Pearce, Michael Haggerty, git discussion list, Vicent Marti,
	Brad King, Johan Herland

Jeff King <peff@peff.net> writes:

> On Mon, Mar 10, 2014 at 07:30:45AM -0700, Shawn Pearce wrote:
>
>> > * Store references in a SQLite database, to get correct transaction
>> >   handling.
>> 
>> No to SQLLite in git-core. Using it from JGit requires building
>> SQLLite and a JNI wrapper, which makes JGit significantly less
>> portable. I know SQLLite is pretty amazing, but implementing
>> compatibility with it from JGit will be a big nightmare for us.
>
> That seems like a poor reason not to implement a pluggable feature for
> git-core. If we implement it, then a site using only git-core can take
> advantage of it. Sites with JGit cannot, and would use a different
> pluggable storage mechanism that's supported by both. But if we don't
> implement, it hurts people using only git-core, and it does not help
> sites using JGit at all.

We would need to eventually have at least one backend that we know
will play well with different Git implementations that matter
(namely, git-core, Jgit and libgit2) before the feature can be
widely adopted.

The first backend that is used while the plugging-interface is in
development can be anything and does not have to be one that
eventual ubiquitous one, however; as long as it is something that we
do not mind carrying it forever, along with that final reference
backend.  I take the objection from Shawn only as against making the
sqlite that final one.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC/WIP] Pluggable reference backends
  2014-03-10 17:46     ` Junio C Hamano
@ 2014-03-10 17:56       ` Jeff King
  0 siblings, 0 replies; 17+ messages in thread
From: Jeff King @ 2014-03-10 17:56 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Shawn Pearce, Michael Haggerty, git discussion list, Vicent Marti,
	Brad King, Johan Herland

On Mon, Mar 10, 2014 at 10:46:01AM -0700, Junio C Hamano wrote:

> >> No to SQLLite in git-core. Using it from JGit requires building
> >> SQLLite and a JNI wrapper, which makes JGit significantly less
> >> portable. I know SQLLite is pretty amazing, but implementing
> >> compatibility with it from JGit will be a big nightmare for us.
> >
> > That seems like a poor reason not to implement a pluggable feature for
> > git-core. If we implement it, then a site using only git-core can take
> > advantage of it. Sites with JGit cannot, and would use a different
> > pluggable storage mechanism that's supported by both. But if we don't
> > implement, it hurts people using only git-core, and it does not help
> > sites using JGit at all.
> 
> We would need to eventually have at least one backend that we know
> will play well with different Git implementations that matter
> (namely, git-core, Jgit and libgit2) before the feature can be
> widely adopted.

I assumed that the current refs/ and logs/ code, massaged into pluggable
backend form, would be the first such. And I wouldn't be surprised to
see some iteration on that once it is easier to move from scheme to
scheme (e.g., to use some encoding of the names on the filesystem to
avoid D/F conflicts, and thus allow reflogs for deleted refs).

> The first backend that is used while the plugging-interface is in
> development can be anything and does not have to be one that
> eventual ubiquitous one, however; as long as it is something that we
> do not mind carrying it forever, along with that final reference
> backend.  I take the objection from Shawn only as against making the
> sqlite that final one.

Sure, I'd agree with that. I'd think something like an sqlite interface
would be mainly of interest to people running busy servers. I don't know
that it would make a good default.

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC/WIP] Pluggable reference backends
  2014-03-10 16:14     ` David Kastrup
  2014-03-10 16:28       ` David Lang
@ 2014-03-10 19:42       ` Jeff King
  2014-03-10 19:56         ` David Kastrup
  1 sibling, 1 reply; 17+ messages in thread
From: Jeff King @ 2014-03-10 19:42 UTC (permalink / raw)
  To: David Kastrup
  Cc: Shawn Pearce, Michael Haggerty, git discussion list, Vicent Marti,
	Brad King, Johan Herland

On Mon, Mar 10, 2014 at 05:14:02PM +0100, David Kastrup wrote:

> [storing refs in sqlite]
>
> Of course, the basic premise for this feature is "let's assume that our
> file and/or operating system suck at providing file system functionality
> at file name granularity".  There have been two historically approaches
> to that problem that are not independent: a) use Linux b) kick Linus.

You didn't define "suck" here, but there are a number of issues with the
current ref storage system. Here is a sampling:

  1. The filesystem does not present an atomic view of the data (e.g.,
     you read "a", then while you are reading "b", somebody else updates
     "a"; your view is one that never existed at any point in time).

  2. Using the filesystem creates D/F conflicts between branches "foo"
     and "foo/bar". Because this name is a primary key even for the
     reflogs, we cannot easily persist reflogs after the ref is removed.

  3. We use packed-refs in conjunction with loose ones to achieve
     reasonable performance when there are a large number of refs. The
     scheme for determining the current value of a ref is complicated
     and error-prone (we had several race conditions that caused real
     data loss).

Those things can be solved through better support from the filesystem.
But they were also solved decades ago by relational databases.

I generally avoid databases where possible. They lock your data up in a
binary format that you can't easily touch with standard unix tools. And
they introduce complexity and opportunity for bugs.

But they are also a proven technology for solving exactly the sorts of
problems that some people are having with git. I do not see a reason not
to consider them as an option for a pluggable refs system. But I also do
not see a reason to inflict their costs on people who do not have those
problems. And that is why Michael's email is about _pluggable_ ref
backends, and not "let's convert git to sqlite".

I do not even know if sqlite is going to end up as an interesting
option. But it will be nice to be able to experiment with it easily due
to git's ref code becoming more modular.

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC/WIP] Pluggable reference backends
  2014-03-10 19:42       ` Jeff King
@ 2014-03-10 19:56         ` David Kastrup
  0 siblings, 0 replies; 17+ messages in thread
From: David Kastrup @ 2014-03-10 19:56 UTC (permalink / raw)
  To: Jeff King
  Cc: Shawn Pearce, Michael Haggerty, git discussion list, Vicent Marti,
	Brad King, Johan Herland

Jeff King <peff@peff.net> writes:

> On Mon, Mar 10, 2014 at 05:14:02PM +0100, David Kastrup wrote:
>
>> [storing refs in sqlite]
>>
>> Of course, the basic premise for this feature is "let's assume that our
>> file and/or operating system suck at providing file system functionality
>> at file name granularity".  There have been two historically approaches
>> to that problem that are not independent: a) use Linux b) kick Linus.
>
> You didn't define "suck" here, but there are a number of issues with the
> current ref storage system. Here is a sampling:
>
>   1. The filesystem does not present an atomic view of the data (e.g.,
>      you read "a", then while you are reading "b", somebody else updates
>      "a"; your view is one that never existed at any point in time).

If there are no system calls suitable for addressing this problem that
fundamentally concerns the use of the file system as a file-name
addressed data store, I don't see why "kick Linus" would not apply here.

>   2. Using the filesystem creates D/F conflicts between branches "foo"
>      and "foo/bar". Because this name is a primary key even for the
>      reflogs, we cannot easily persist reflogs after the ref is
>      removed.

That actually sounds more like "kick Junio" territory (the wonderful
times when "kick Linus" could achieve almost anything are over).  To
wit: this sounds like a design shortcoming in Git's use of filesystems,
not something that is actually inherent in the use of files.

>   3. We use packed-refs in conjunction with loose ones to achieve
>      reasonable performance when there are a large number of refs. The
>      scheme for determining the current value of a ref is complicated
>      and error-prone (we had several race conditions that caused real
>      data loss).

Again, that sounds like we are talking about a scenario that is not a
problem of files inherently but rather of Git's ways of managing them.

> Those things can be solved through better support from the filesystem.
> But they were also solved decades ago by relational databases.

Relational databases that are not implemented on raw storage managed by
database servers will still map their operations to file operations.

> But they are also a proven technology for solving exactly the sorts of
> problems that some people are having with git. I do not see a reason
> not to consider them as an option for a pluggable refs system.

But I think it would be wrong to try solving "2." above at the database
level when its actual problem lies with the reference->filename mapping
scheme.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC/WIP] Pluggable reference backends
  2014-03-10 15:52   ` Jeff King
  2014-03-10 16:14     ` David Kastrup
  2014-03-10 17:46     ` Junio C Hamano
@ 2014-03-10 21:07     ` Michael Haggerty
  2014-03-11  2:39       ` Shawn Pearce
  2 siblings, 1 reply; 17+ messages in thread
From: Michael Haggerty @ 2014-03-10 21:07 UTC (permalink / raw)
  To: Jeff King
  Cc: Shawn Pearce, git discussion list, Vicent Marti, Brad King,
	Johan Herland

On 03/10/2014 04:52 PM, Jeff King wrote:
> On Mon, Mar 10, 2014 at 07:30:45AM -0700, Shawn Pearce wrote:
> 
>>> * Store references in a SQLite database, to get correct transaction
>>>   handling.
>>
>> No to SQLLite in git-core. Using it from JGit requires building
>> SQLLite and a JNI wrapper, which makes JGit significantly less
>> portable. I know SQLLite is pretty amazing, but implementing
>> compatibility with it from JGit will be a big nightmare for us.
> 
> That seems like a poor reason not to implement a pluggable feature for
> git-core. If we implement it, then a site using only git-core can take
> advantage of it. Sites with JGit cannot, and would use a different
> pluggable storage mechanism that's supported by both. But if we don't
> implement, it hurts people using only git-core, and it does not help
> sites using JGit at all.

I think it's important to distinguish between two types of backend:

* Exotic backends, optimized for servers, or embedded systems, or other
controlled environments where the person deploying Git can decide about
the whole technology stack.  Here I say let a thousand flowers bloom.
If user A wants to try an Oracle backend and only uses JGit, there's no
need for him to implement the equivalent backend for git-core or libgit2.

* Mainstream backends, intended for use by end-users on their
workstations and notebooks.  Such backends will be pretty worthless if
they are not supported more or less universally, because one user will
want to use the command line and Eclipse, another Visual Studio and
TortoiseGit, a third will use GitHub for Mac plus a bunch of shell
scripts written by his IT department.  A backend that is not supported
by the big three Git implementations (git-core, libgit2, and JGit) will
probably be rejected by users.  Realistically there will be at most a
couple of mainstream backends--in fact probably usually a single
established one and occasionally a single next-generation one waiting
for people to migrate slowly to it.  For mainstream backends I think it
is important for the implementations to plan and coordinate ahead of
time to make sure everybody's concerns are addressed.

It sounds to me like Shawn is saying "please don't make a SQLite-based
backend the new default git-core backend" and Peff is saying "there is
no reason that a Git hosting service shouldn't experiment with a
SQLite-based backend".  I see no contradiction there [1].

Also, please remember that I'm not advocating a SQLite backend or any
other at this time.  I'm only refactoring code to open the way for
*future* flamefests :-)

Michael

[1] There might of course be a technical argument about whether a
SQLite-based backend would be SO AWESOME for end-users that switching to
it would be worth the extra inconvenience for the JGit folks.
Personally I'm skeptical.

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC/WIP] Pluggable reference backends
  2014-03-10 21:07     ` Michael Haggerty
@ 2014-03-11  2:39       ` Shawn Pearce
  2014-03-12 10:26         ` egit vs. git behaviour (was: [RFC/WIP] Pluggable reference backends) Andreas Krey
  0 siblings, 1 reply; 17+ messages in thread
From: Shawn Pearce @ 2014-03-11  2:39 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Jeff King, git discussion list, Vicent Marti, Brad King,
	Johan Herland

On Mon, Mar 10, 2014 at 2:07 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> On 03/10/2014 04:52 PM, Jeff King wrote:
>> On Mon, Mar 10, 2014 at 07:30:45AM -0700, Shawn Pearce wrote:
>>
>>>> * Store references in a SQLite database, to get correct transaction
>>>>   handling.
>>>
>>> No to SQLLite in git-core. Using it from JGit requires building
>>> SQLLite and a JNI wrapper, which makes JGit significantly less
>>> portable. I know SQLLite is pretty amazing, but implementing
>>> compatibility with it from JGit will be a big nightmare for us.
>>
>> That seems like a poor reason not to implement a pluggable feature for
>> git-core. If we implement it, then a site using only git-core can take
>> advantage of it. Sites with JGit cannot, and would use a different
>> pluggable storage mechanism that's supported by both. But if we don't
>> implement, it hurts people using only git-core, and it does not help
>> sites using JGit at all.
>
> I think it's important to distinguish between two types of backend:
>
> * Exotic backends, optimized for servers, or embedded systems, or other
> controlled environments where the person deploying Git can decide about
> the whole technology stack.  Here I say let a thousand flowers bloom.
> If user A wants to try an Oracle backend and only uses JGit, there's no
> need for him to implement the equivalent backend for git-core or libgit2.

FWIW I have been running JGit derived servers using Google Bigtable
for reference storage for years. So yes in this sort of environment
let people do what they think is best for them.

> * Mainstream backends, intended for use by end-users on their
> workstations and notebooks.  Such backends will be pretty worthless if
> they are not supported more or less universally, because one user will
> want to use the command line and Eclipse, another Visual Studio and
> TortoiseGit, a third will use GitHub for Mac plus a bunch of shell
> scripts written by his IT department.  A backend that is not supported
> by the big three Git implementations (git-core, libgit2, and JGit) will
> probably be rejected by users.  Realistically there will be at most a
> couple of mainstream backends--in fact probably usually a single
> established one and occasionally a single next-generation one waiting
> for people to migrate slowly to it.  For mainstream backends I think it
> is important for the implementations to plan and coordinate ahead of
> time to make sure everybody's concerns are addressed.

Yes, this was my real concern. Eclipse users using EGit expect EGit to
be compatible with git-core at the filesystem level so they can do
something in EGit then switch to a shell and bang out a command, or
run a script provided by their project or co-worker. Build systems
often integrate with Git to e.g. embed `git describe` output into the
binary. In mainstream use cross compatibility of the tools within a
single working directory is something that I think users have come to
expect.

> It sounds to me like Shawn is saying "please don't make a SQLite-based
> backend the new default git-core backend" and Peff is saying "there is
> no reason that a Git hosting service shouldn't experiment with a
> SQLite-based backend".  I see no contradiction there [1].

Yes. :-)

> Also, please remember that I'm not advocating a SQLite backend or any
> other at this time.  I'm only refactoring code to open the way for
> *future* flamefests :-)
>
> Michael
>
> [1] There might of course be a technical argument about whether a
> SQLite-based backend would be SO AWESOME for end-users that switching to
> it would be worth the extra inconvenience for the JGit folks.
> Personally I'm skeptical.

If it was really that amazing, yes, we would probably support it in
JGit for those that need that amazing.

But I tend to think we can (usually) find a simpler format that would
provide many of the same benefits with less of the drawbacks of
locking the data up into SQLLite's file format. I'm with Peff, I kind
of like the fact that most of the Git data is easy to inspect by hand,
or with some simple tools written in Git's source tree. Starting with
"go get this other SQLLite tool first then write this code" is a lot
less fun.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC/WIP] Pluggable reference backends
  2014-03-10 11:00 [RFC/WIP] Pluggable reference backends Michael Haggerty
  2014-03-10 11:44 ` Johan Herland
  2014-03-10 14:30 ` Shawn Pearce
@ 2014-03-11 10:56 ` Karsten Blees
  2014-03-12 11:43   ` Michael Haggerty
  2 siblings, 1 reply; 17+ messages in thread
From: Karsten Blees @ 2014-03-11 10:56 UTC (permalink / raw)
  To: Michael Haggerty, git discussion list
  Cc: Jeff King, Vicent Marti, Brad King, Johan Herland

Am 10.03.2014 12:00, schrieb Michael Haggerty:
> 
> Reference transactions
> ----------------------
> 

Very cool ideas indeed.

However, I'm concerned a bit that transactions are conceptual overkill. How many concurrent updates do you expect in a repository? Wouldn't a single repo-wide lock suffice (and be _much_ simpler to implement with any backend, esp. file-based)?

The API you posted in [1] doesn't look very much like a transaction API either (rather like batch-updates). E.g. there's no rollback, the queue* methods cannot report failure, and there's no way to read a ref as part of the transaction. So I'm afraid that backends that support transactions out of the box (e.g. RDBMSs) will be hard to adapt to this.

Just my 2cents,
Karsten

[1] http://article.gmane.org/gmane.comp.version-control.git/243748

^ permalink raw reply	[flat|nested] 17+ messages in thread

* egit vs. git behaviour (was: [RFC/WIP] Pluggable reference backends)
  2014-03-11  2:39       ` Shawn Pearce
@ 2014-03-12 10:26         ` Andreas Krey
  2014-03-12 16:48           ` Shawn Pearce
  0 siblings, 1 reply; 17+ messages in thread
From: Andreas Krey @ 2014-03-12 10:26 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git discussion list

On Mon, 10 Mar 2014 19:39:00 +0000, Shawn Pearce wrote:
> Yes, this was my real concern. Eclipse users using EGit expect EGit to
> be compatible with git-core at the filesystem level so they can do
> something in EGit then switch to a shell and bang out a command, or
> run a script provided by their project or co-worker.

A question: Where to ask/report problems with that?

We're currently running into problems that egit doesn't push to where
git would when the local and remote branches aren't the same name. It
seems that egit ignores the branch.*.merge settings. Or push.default?

Andreas

-- 
"Totally trivial. Famous last words."
From: Linus Torvalds <torvalds@*.org>
Date: Fri, 22 Jan 2010 07:29:21 -0800

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC/WIP] Pluggable reference backends
  2014-03-11 10:56 ` [RFC/WIP] Pluggable reference backends Karsten Blees
@ 2014-03-12 11:43   ` Michael Haggerty
  0 siblings, 0 replies; 17+ messages in thread
From: Michael Haggerty @ 2014-03-12 11:43 UTC (permalink / raw)
  To: Karsten Blees
  Cc: git discussion list, Jeff King, Vicent Marti, Brad King,
	Johan Herland

Karsten,

Thanks for your feedback!

On 03/11/2014 11:56 AM, Karsten Blees wrote:
> Am 10.03.2014 12:00, schrieb Michael Haggerty:
>> 
>> Reference transactions ----------------------
> 
> Very cool ideas indeed.
> 
> However, I'm concerned a bit that transactions are conceptual
> overkill. How many concurrent updates do you expect in a repository?
> Wouldn't a single repo-wide lock suffice (and be _much_ simpler to
> implement with any backend, esp. file-based)?

I am mostly thinking about long-running processes, like "gc" and
"prune-refs", which need to be made race-free without blocking other
processes for the whole time they are running (whereas it might be quite
tolerable to have them fail or only complete part of their work in any
given invocation).  Also, I work at GitHub, where we have quite a few
repositories, some of which are quite active :-)

Remember that I'm not yet proposing anything like hard-core ACID
reference transactions.  I'm just clearing the way for various possible
changes in reference handling.  I listed the ideas only to whet people's
appetites and motivate the refactoring, which will take a while before
it bears any real fruit.

> The API you posted in [1] doesn't look very much like a transaction
> API either (rather like batch-updates). E.g. there's no rollback, the
> queue* methods cannot report failure, and there's no way to read a
> ref as part of the transaction. So I'm afraid that backends that
> support transactions out of the box (e.g. RDBMSs) will be hard to
> adapt to this.

Gmane is down at the moment but I assume you are referring to my patch
series and the ref_transaction implementation therein.

No explicit rollback is necessary at this stage, because the "commit"
function first locks all of the references that it wants to change
(first verifying that they have the expected values), and then modifies
them all.  By the time the references are locked, the whole transaction
is guaranteed to succeed [1].  If the locks can't all be acquired, then
any locks that were obtained are released.

If a caller wants to rollback a transaction, it only needs to free the
transaction instead of committing.  I should probably make that clearer
by renaming free_ref_transaction() to rollback_ref_transaction().  By
the time we start implementing other reference backends, that function
will of course have to do more.  For that matter, maybe
create_ref_transaction() should be renamed to begin_ref_transaction().
Now would be a good time for concrete bikeshedding suggestions about
function names or other details of the API :-)

Yes, the queue_*() methods should probably later make a preliminary
check of the reference's old value and return an error if the expected
value is already incorrect.  This would allow callers to fail fast if
the transaction is doomed to failure.  But that wasn't needed yet for
the one existing caller, which builds up a transaction and commits it
immediately, so I didn't implement it yet.  And the early checks would
add overhead for this caller, so maybe they should be optional anyway.
Maybe these functions should already be declared to return an error
status, but there should be an option passed to create_ref_transaction()
that selects whether fast checks should be performed or not for that
transaction.

Really, all that this first patch series does is put a different API
around the mechanism that was already there, in update_refs().  There
will be a lot more steps before we see anything approaching real
reference transactions.  But I think your (implied) suggestion, to make
the API more reminiscent of something like database transactions, is a
good one and I will work on it.

Cheers,
Michael

[1] "Guaranteed" here is of course relative.  The commit could still
fail due to the process being killed, disk errors, etc.  But it can't
fail due to lock contention with another git process.

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: egit vs. git behaviour (was: [RFC/WIP] Pluggable reference backends)
  2014-03-12 10:26         ` egit vs. git behaviour (was: [RFC/WIP] Pluggable reference backends) Andreas Krey
@ 2014-03-12 16:48           ` Shawn Pearce
  0 siblings, 0 replies; 17+ messages in thread
From: Shawn Pearce @ 2014-03-12 16:48 UTC (permalink / raw)
  To: Andreas Krey; +Cc: git discussion list

On Wed, Mar 12, 2014 at 3:26 AM, Andreas Krey <a.krey@gmx.de> wrote:
> On Mon, 10 Mar 2014 19:39:00 +0000, Shawn Pearce wrote:
>> Yes, this was my real concern. Eclipse users using EGit expect EGit to
>> be compatible with git-core at the filesystem level so they can do
>> something in EGit then switch to a shell and bang out a command, or
>> run a script provided by their project or co-worker.
>
> A question: Where to ask/report problems with that?

EGit developers have a bug tracker, from:

  http://eclipse.org/egit/support/

We see File a bug with a link to:

  https://bugs.eclipse.org/bugs/enter_bug.cgi?product=EGit&rep_platform=All&op_sys=All

> We're currently running into problems that egit doesn't push to where
> git would when the local and remote branches aren't the same name. It
> seems that egit ignores the branch.*.merge settings. Or push.default?

I think this is just missing code in EGit. Its probable they already
know about it, or many of them don't use these features in .git/config
and thus don't realize they are missing.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2014-03-12 16:48 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-10 11:00 [RFC/WIP] Pluggable reference backends Michael Haggerty
2014-03-10 11:44 ` Johan Herland
2014-03-10 14:30 ` Shawn Pearce
2014-03-10 15:51   ` Max Horn
2014-03-10 15:52   ` Jeff King
2014-03-10 16:14     ` David Kastrup
2014-03-10 16:28       ` David Lang
2014-03-10 19:42       ` Jeff King
2014-03-10 19:56         ` David Kastrup
2014-03-10 17:46     ` Junio C Hamano
2014-03-10 17:56       ` Jeff King
2014-03-10 21:07     ` Michael Haggerty
2014-03-11  2:39       ` Shawn Pearce
2014-03-12 10:26         ` egit vs. git behaviour (was: [RFC/WIP] Pluggable reference backends) Andreas Krey
2014-03-12 16:48           ` Shawn Pearce
2014-03-11 10:56 ` [RFC/WIP] Pluggable reference backends Karsten Blees
2014-03-12 11:43   ` Michael Haggerty

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).