The design of refs backends, linked worktrees and submodules

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* The design of refs backends, linked worktrees and submodules
@ 2017-01-19 11:55 Duy Nguyen
  2017-01-19 13:30 ` Michael Haggerty
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Duy Nguyen @ 2017-01-19 11:55 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Git Mailing List

I've started working on fixing the "git gc" issue with multiple
worktrees, which brings me back to this. Just some thoughts. Comments
are really appreciated.

In the current code, files backend has special cases for both
submodules (explicitly) and linked worktrees (hidden behind git_path).
But if a backend has to handle this stuff, all future backends have to
too. Which does not sound great. Imagine we have "something" in
addition to worktrees and submodules in future, then all backends have
to learn about it.

So how about we move worktree and submodule support back to front-end?

The backend deals with refs, period. The file-based backend will be
given a directory where refs live in and it work on that. Backends do
not use git_path(). Backends do not care about $GIT_DIR. Access to odb
(e.g. sha-1 validation) if needed is abstracted out via a set of
callbacks. This allows submodules to give access to submodule's
separate odb. And it's getting close to the "struct repository"
mentioned somewhere in refs "TODO" comments, even though we are
nowhere close to introducing that struct.

How does that sound? In particular, is there anything wrong, or
unrealistic, with that?

For that to work, I'll probably need to add a "composite" ref_store
that combines two file-based backends (for per-repo and per-worktree
refs) to represent a unified ref store. I think your work on ref
iterator makes way for that. A bit worried about transactions though,
because I think per-repo and per-worktree updates will be separated in
two transactions. But that's probably ok because future backends, like
lmdb, will have to go through the same route anyway.
-- 
Duy

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: The design of refs backends, linked worktrees and submodules
  2017-01-19 11:55 The design of refs backends, linked worktrees and submodules Duy Nguyen
@ 2017-01-19 13:30 ` Michael Haggerty
  2017-01-19 20:04   ` Johannes Schindelin
  2017-01-20 11:22   ` Duy Nguyen
  2017-01-19 19:44 ` Junio C Hamano
  2017-02-07 15:07 ` Duy Nguyen
  2 siblings, 2 replies; 6+ messages in thread
From: Michael Haggerty @ 2017-01-19 13:30 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Git Mailing List

On 01/19/2017 12:55 PM, Duy Nguyen wrote:
> I've started working on fixing the "git gc" issue with multiple
> worktrees, which brings me back to this. Just some thoughts. Comments
> are really appreciated.
> 
> In the current code, files backend has special cases for both
> submodules (explicitly) and linked worktrees (hidden behind git_path).

There is another terrible hack also needed to implement linked
worktrees, namely that the `refs/bisect/` hierarchy is manually inserted
into the `ref_cache`, because otherwise it wouldn't be noticed when
iterating over loose references via `readdir()`.

Other similar hacks would be required if other reference subtrees are
declared to be per-worktree.

> But if a backend has to handle this stuff, all future backends have to
> too. Which does not sound great. Imagine we have "something" in
> addition to worktrees and submodules in future, then all backends have
> to learn about it.

Agreed, the status quo is not pretty.

I kindof think that it would have been a better design to store the
references for all linked worktrees in the main repository's ref-store.
For example, the "bisect" refs for a worktree named "<name>" could have
been stored under "refs/worktrees/<name>/bisect/*". Then either:

* teach the associated tools to read/write references there directly
(probably with DWIM rules to make command-line use easier), or
* treat these references as if they were actually at a standard place
like `refs/worktree/bisect/*`; i.e., users would need to know that they
were per-worktree references, but wouldn't need to worry about the true
locations, or
* treat these references as if they were actually in their traditional
locations (though it is not obvious how this scheme could be expanded to
cover new per-worktree references).

> So how about we move worktree and submodule support back to front-end?
> 
> The backend deals with refs, period. The file-based backend will be
> given a directory where refs live in and it work on that. Backends do
> not use git_path(). Backends do not care about $GIT_DIR. Access to odb
> (e.g. sha-1 validation) if needed is abstracted out via a set of
> callbacks. This allows submodules to give access to submodule's
> separate odb. And it's getting close to the "struct repository"
> mentioned somewhere in refs "TODO" comments, even though we are
> nowhere close to introducing that struct.

This is a topic that I have thought a lot about. I definitely like this
direction. In fact I've dabbled around with some first steps; see branch
`submodule-hash` in my fork on GitHub [1]. That branch associates a
`ref_store` more closely with the directory where the references are
stored, as opposed to having a 1:1 relationship between `ref_store`s and
submodules.

I would like to see the separation of a concept "iterate over all
reachability roots" that is independent of other ref iteration. Then it
wouldn't have to include reference names, except basically for use in
error messages. So for linked worktrees, in place of the reference name
it might emit a string like "<worktree>:<refname>". (Of course it would
get its information by iterating over all of the linked reference stores
using their reference iteration APIs.)

> For that to work, I'll probably need to add a "composite" ref_store
> that combines two file-based backends (for per-repo and per-worktree
> refs) to represent a unified ref store. I think your work on ref
> iterator makes way for that. A bit worried about transactions though,
> because I think per-repo and per-worktree updates will be separated in
> two transactions. But that's probably ok because future backends, like
> lmdb, will have to go through the same route anyway.

Yes, that was the main motivation for the ref-iterator work.

Regarding transactions, the commit pointed at by branch
`split-transaction` in my fork shows how I think the
`transaction_commit()` method could be split into two parts,
`transaction_prepare()` and `transaction_finish()`. The idea would be
that the driver function, `ref_transaction_commit()`, calls
`transaction_prepare()` for each `ref_store` involved in the
transaction, passing each one the reference updates for references that
live in that reference store. Those methods would verify that the part
of the transaction that lives in that ref-store "should" go through,
without actually committing anything. Then `transaction_finish()` would
be called on each ref store, and that method would commit the updates.
You probably couldn't get a bulletproof kind of compound transaction out
of this (e.g., if the computer's power goes out, one `ref_store`'s
updates might be committed but another one's not). But it would probably
be good enough to cover everyday reasons for transaction failures, like
pre-checksums not matching up.

Let me braindump some more information about this topic. A files backend
for a repository (ignoring submodules for the moment) currently consists
of five interacting parts, each of which looks a lot like a ref-store
itself:

* A loose reference ref-store for the main repo
* A loose reference ref-store for the per-subtree references
* A ref_cache in front of the two loose reference stores
* A packed ref-store
* A second ref_cache in front of the packed ref-store

But these ref-stores are currently coupled very tightly and have
peculiarities:

* The caching code is tightly coupled to the ref-store behind it.
* It is hard to imagine a packed refs-store that doesn't have some kind
of caching in front of it.
* There are tricky ordering constraints between writes to packed and
loose references to avoid races.
* The packed ref-store can't store symbolic refs, nor can it store
reflogs. It currently relies on the loose ref-store for those things.
* There is no packed-refs ref-store associated with per-worktree refs.
* Packed references are currently locked via `*.lock` files located near
the corresponding loose references.
* There are constraints that span refstores. For example, you aren't
allowed to create a packed ref that D/F conflicts with a loose ref or
vice versa.
* Symrefs, which are loose, can point at packed references.

I've taken some stabs at picking these apart into separate ref stores,
but haven't had time to make very satisfying progress. By the time of
GitMerge I might have a better feeling for whether I can devote some
time to this project.

Michael

[1] https://github.com/mhagger/git

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: The design of refs backends, linked worktrees and submodules
  2017-01-19 11:55 The design of refs backends, linked worktrees and submodules Duy Nguyen
  2017-01-19 13:30 ` Michael Haggerty
@ 2017-01-19 19:44 ` Junio C Hamano
  2017-02-07 15:07 ` Duy Nguyen
  2 siblings, 0 replies; 6+ messages in thread
From: Junio C Hamano @ 2017-01-19 19:44 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Michael Haggerty, Git Mailing List

Duy Nguyen <pclouds@gmail.com> writes:

> ... A bit worried about transactions though,
> because I think per-repo and per-worktree updates will be separated in
> two transactions. But that's probably ok because future backends, like
> lmdb, will have to go through the same route anyway.

If we have per-worktree ref-store, what can be done as a useful
thing by future backends like lmdb may be to use the same single
database to store shared and per-worktree refs, similar to the way
Michael mentioned in his response to your message I am responding
to, i.e.

    I kindof think that it would have been a better design to store the
    references for all linked worktrees in the main repository's ref-store.
    For example, the "bisect" refs for a worktree named "<name>" could have
    been stored under "refs/worktrees/<name>/bisect/*".

The current design is heavily influenced by how "contrib/workdir"
lays things out, for the latter of which I take the blame X-<, but
perhaps the files backend can be retrofitted to use that layout in
the longer term?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: The design of refs backends, linked worktrees and submodules
  2017-01-19 13:30 ` Michael Haggerty
@ 2017-01-19 20:04   ` Johannes Schindelin
  2017-01-20 11:22   ` Duy Nguyen
  1 sibling, 0 replies; 6+ messages in thread
From: Johannes Schindelin @ 2017-01-19 20:04 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Duy Nguyen, Git Mailing List

Hi,

On Thu, 19 Jan 2017, Michael Haggerty wrote:

> On 01/19/2017 12:55 PM, Duy Nguyen wrote:
> > I've started working on fixing the "git gc" issue with multiple
> > worktrees, which brings me back to this. Just some thoughts. Comments
> > are really appreciated.
> > 
> > In the current code, files backend has special cases for both
> > submodules (explicitly) and linked worktrees (hidden behind git_path).
> 
> There is another terrible hack also needed to implement linked
> worktrees, namely that the `refs/bisect/` hierarchy is manually inserted
> into the `ref_cache`, because otherwise it wouldn't be noticed when
> iterating over loose references via `readdir()`.
> 
> Other similar hacks would be required if other reference subtrees are
> declared to be per-worktree.
> 
> > But if a backend has to handle this stuff, all future backends have to
> > too. Which does not sound great. Imagine we have "something" in
> > addition to worktrees and submodules in future, then all backends have
> > to learn about it.
> 
> Agreed, the status quo is not pretty.
> 
> I kindof think that it would have been a better design to store the
> references for all linked worktrees in the main repository's ref-store.
> For example, the "bisect" refs for a worktree named "<name>" could have
> been stored under "refs/worktrees/<name>/bisect/*".

That strikes me as a good design, indeed. It addresses very explicitly the
root cause of the worktree problems: Git's source code was developed for
years with the assumption that there is a 1:1 mapping between ref names
and SHA-1s in each repository, and the way worktrees were implemented
broke that assumption.

So introducing a new refs/ namespace -- that is visible to all other
worktrees -- would have addressed that problem.

This, BTW, is related to my concerns about introducing a "shadow" config
layer for worktrees: I still think it would be a bad idea, and very likely
to cause regressions in surprising ways, to allow such config "overlays"
per-worktree, as Git's current code's assumption is that there is only one
config per repository, and that it can, say, set one config setting to
match another (which in the per-worktree case would possibly hold true in
only one worktree only). Instead, introducing a new "namespace" in the
(single) config similar to refs/worktrees/<name> could address that
problem preemptively.

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: The design of refs backends, linked worktrees and submodules
  2017-01-19 13:30 ` Michael Haggerty
  2017-01-19 20:04   ` Johannes Schindelin
@ 2017-01-20 11:22   ` Duy Nguyen
  1 sibling, 0 replies; 6+ messages in thread
From: Duy Nguyen @ 2017-01-20 11:22 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Git Mailing List

On Thu, Jan 19, 2017 at 8:30 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> I kindof think that it would have been a better design to store the
> references for all linked worktrees in the main repository's ref-store.
> For example, the "bisect" refs for a worktree named "<name>" could have
> been stored under "refs/worktrees/<name>/bisect/*". Then either:
>
> * teach the associated tools to read/write references there directly
> (probably with DWIM rules to make command-line use easier), or
> * treat these references as if they were actually at a standard place
> like `refs/worktree/bisect/*`; i.e., users would need to know that they
> were per-worktree references, but wouldn't need to worry about the true
> locations, or
> * treat these references as if they were actually in their traditional
> locations (though it is not obvious how this scheme could be expanded to
> cover new per-worktree references).

Well. In one direction, we store everything at one place and construct
different slices of view of the unified store. On the other far end,
we have plenty of one-purpose stores, then combine them as we need.
It's probably personal taste, but I prefer the latter.

Making a single big store could bring us closer to the "big number"
problem. Yeah we will have to handle million of refs anyway, someday.
That does not mean we're free to increase the number of refs a few
more times. Then there are separate stores by nature like submodules
(caveat: I haven't checked out your submodule-hash branch), or the
problem with multiple repos sharing objects/info/alternates.

> This is a topic that I have thought a lot about. I definitely like this
> direction. In fact I've dabbled around with some first steps; see branch
> `submodule-hash` in my fork on GitHub [1]. That branch associates a
> `ref_store` more closely with the directory where the references are
> stored, as opposed to having a 1:1 relationship between `ref_store`s and
> submodules.

Thanks. Will check it out.

> Let me braindump some more information about this topic.
> ...

Juicy stuff :D It's hard to know these without staring really long and
hard at refs code. Thank you.

> I've taken some stabs at picking these apart into separate ref stores,
> but haven't had time to make very satisfying progress. By the time of
> GitMerge I might have a better feeling for whether I can devote some
> time to this project.

I think sending WIP patches to the list from time to time is also
helpful, even if it's not perfect. For one thing I would know you were
doing (or thinking at least, which also counts) and not stepping on
each other. On my part I'm not attempting to make any more changes (*)
until after I've read your branches.

(*) I took git_path() out of refs code and was surprised that multi
worktree broke. Silly me. Wrong first step.
-- 
Duy

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: The design of refs backends, linked worktrees and submodules
  2017-01-19 11:55 The design of refs backends, linked worktrees and submodules Duy Nguyen
  2017-01-19 13:30 ` Michael Haggerty
  2017-01-19 19:44 ` Junio C Hamano
@ 2017-02-07 15:07 ` Duy Nguyen
  2 siblings, 0 replies; 6+ messages in thread
From: Duy Nguyen @ 2017-02-07 15:07 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Git Mailing List

On Thu, Jan 19, 2017 at 6:55 PM, Duy Nguyen <pclouds@gmail.com> wrote:
> I've started working on fixing the "git gc" issue with multiple
> worktrees, which brings me back to this. Just some thoughts. Comments
> are really appreciated.
>
> In the current code, files backend has special cases for both
> submodules (explicitly) and linked worktrees (hidden behind git_path).

It just occurs to me that, since the refs directory structure of a
linked worktree is exactly like one in a normal single-worktree setup,
minus the shared (or packed) refs. The "files" refs backend can just
see this "per-worktree only" refs directory as a remote refs storage,
which is just another name for "submodule".

So, we could just use the exact same submodule code path in refs to
create a per-worktree refs storage. Doing it this way, files backedn
do not need to learn about linked worktrees at all. To retrieve a
per-worktree refs storage, we do
"get_ref_store(".git/worktrees/foobar")". To get all per-worktree refs
do for_each_ref_submodule(".git/worktrees/foobar", ...).

Does it make sense? Should we go this way?
-- 
Duy

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-02-07 15:08 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-19 11:55 The design of refs backends, linked worktrees and submodules Duy Nguyen
2017-01-19 13:30 ` Michael Haggerty
2017-01-19 20:04   ` Johannes Schindelin
2017-01-20 11:22   ` Duy Nguyen
2017-01-19 19:44 ` Junio C Hamano
2017-02-07 15:07 ` Duy Nguyen

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).