git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* How to move forward with Reftable
@ 2020-05-07 18:51 Han-Wen Nienhuys
  2020-05-11 18:36 ` Jonathan Nieder
  0 siblings, 1 reply; 4+ messages in thread
From: Han-Wen Nienhuys @ 2020-05-07 18:51 UTC (permalink / raw)
  To: git

Hi all,

I have noticed that not everyone is happy with the reftable series as
it currently stands. I wanted to give some background on why it looks
the way it does today, I want to solicit feedback on where to go from
here, and share my worries about the negative feedback I’ve gotten so
far.

I manage the Gerrit team at Google. Gerrit uses refs extensively: we
have many repositories with millions of refs. To address the problems
this caused for Google, Shawn designed the reftable storage format,
and we have been running it in production since 2017. Unfortunately,
this doesn’t solve the problem for ‘normal’ deployments of Gerrit, and
this limits design choices that my team can make. To fix this, I have
implemented support for reftable on local file systems in JGit, and
want to make it a feature that is supported by core git too.

I found the implementation somewhat gnarly, which makes it an
interesting project. I thought it would be good for the ecosystem if
related projects such as libgit2 could use the same implementation, so
I wrote it as a library that is completely standalone (except for a
dependency on zlib). It was originally written in Go, translated to as
idiomatic C as I could manage. It is basically object-oriented with a
small amount of polymorphism, so it should translate relatively easily
to languages like Python.

As it currently stands, the library implements the specification
completely, including the parts that are currently probably not useful
for Git, like support for the object-ID/ref mapping. The API between
reftable and git is extremely narrow: it’s about 20 functions, and 5
structure definitions. It’s also well tested (see coverage [2]. As of
today, the tests are leak-free according to valgrind). This is also
why I don’t believe Johannes’ argument that, because he found one bug,
it must be full of bugs.

This approach has been successful, in the sense that the libgit2
project has experimented with integrating it, see
https://github.com/libgit2/libgit2/pull/5462. They seem happy at the
prospect of integrating existing code rather than reimplementing it.

Johannes had suggested that this should be developed and maintained in
git-core first, and the result could then be reused by libgit2
project. According to the libgit2 folks, this what that would look
like:

“””
    - It needs to be easy to split out from git-core. If it is
      self-contained in a single directory, then I'd be sufficiently
      happy already.

    - It should continue providing a clean interface to external
      callers. This also means that its interface should stay stable so
      we don't have to adapt on every update. git-core historically
      never had such promises, but it kind of worked out for the xdiff
      code.

    - My most important fear would be that the reftable interface
      becomes heavily dependent on git-core's own data types. We had
      this discussion at the Contributor's Summit already, but if it
      starts adopting things like git-core's own string buffer then it
      would become a lot harder for us to use it.

    - Probably obvious, but contained in the above is that it stays
      compilable on its own. So even if you split out its directory and
      wire up some build instructions, it should not have any
      dependencies on git-core.
”””

(for the discussion at the summit:
https://lore.kernel.org/git/1B71B54C-E000-4CEB-8AC6-3DB86E96E31A@jramsay.com.au/)

I can make that work, but it would be good to know if this is
something the project is OK with in principle, or whether the code
needs to be completely Git-ified. If the latter happens, that would
effectively fork the code, which I think is a missed opportunity.

I have received a lot of negative comments, and I can’t gauge well
what to make of them, but they kindle several worries:

* The git community decides they don’t need the object reverse index,
and do not write ‘o’ section in reftables, because that culls 1000
lines of code. This generally works, but will cause hard to debug
performance regressions if command-line git is used on a gerrit
server.

* The git community looks at this, and decides the standard is too
complex, and goes off to create a reftable v3.

* The git community asks me to do a ton of superficial work (eg. slice
-> strbuf), and then decides the overall design needs to be different,
and should be completely rewritten.

Jonathan Nieder said I shouldn’t worry about standards compliance,
because the Git project has already ratified the reftable standard,
and wouldn’t want to break JGit compatibility, but it would be good to
have the community leaders reaffirm this stance.

For my last worry, it would be good if someone would make an
assessment of the overall design to see if it is acceptable.

Once we have a path forward we can think of a way of integrating the
code. I think it may take a while to shake out bugs. Not bugs in the
reftable library itself, but Git is not very strict in enforcing
proper use of the ref backend abstraction (eg. pseudorefs [1]). Many
integration tests also don’t go through proper channels for accessing
refs.

cheers,


[1] https://lore.kernel.org/git/CAFQ2z_NZgkPE+3oazfb_m0_7TWxHjje1yYCc0bMZG05_KUKEow@mail.gmail.com/

[2] https://hanwen.home.xs4all.nl/public/software/reftable-coverage/

-- 
Han-Wen Nienhuys - Google Munich
I work 80%. Don't expect answers from me on Fridays.
--

Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: How to move forward with Reftable
  2020-05-07 18:51 How to move forward with Reftable Han-Wen Nienhuys
@ 2020-05-11 18:36 ` Jonathan Nieder
  2020-05-11 18:50   ` Junio C Hamano
  2020-05-12 23:37   ` brian m. carlson
  0 siblings, 2 replies; 4+ messages in thread
From: Jonathan Nieder @ 2020-05-11 18:36 UTC (permalink / raw)
  To: Han-Wen Nienhuys
  Cc: git, Jeff King, brian m. carlson, Johannes Schindelin,
	Michael Haggerty, Patrick Steinhardt

(+cc: a few people I'm particularly interested in feedback from)

Han-Wen Nienhuys wrote[3]:

> I have noticed that not everyone is happy with the reftable series as
> it currently stands. I wanted to give some background on why it looks
> the way it does today, I want to solicit feedback on where to go from
> here, and share my worries about the negative feedback I've gotten so
> far.
[...]
> As it currently stands, the library implements the specification
> completely, including the parts that are currently probably not useful
> for Git, like support for the object-ID/ref mapping. The API between
> reftable and git is extremely narrow: it's about 20 functions, and 5
> structure definitions. It's also well tested (see coverage [2]. As of
> today, the tests are leak-free according to valgrind). This is also
> why I don't believe Johannes' argument that, because he found one bug,
> it must be full of bugs.
>
> This approach has been successful, in the sense that the libgit2
> project has experimented with integrating it, see
> https://github.com/libgit2/libgit2/pull/5462. They seem happy at the
> prospect of integrating existing code rather than reimplementing it.

Separate from the integration aspect is that this is not yet
battle-tested code.  One benefit of sharing code is to be able to share
the benefit of users testing it.

Since the ref system is fairly modular and this is about a non-default
backend, it's likely okay to integrate it initially as "experimental"
and then update docs as we gain confidence.

> Johannes had suggested that this should be developed and maintained in
> git-core first, and the result could then be reused by libgit2
> project. According to the libgit2 folks, this what that would look
> like:
>
> """
>     - It needs to be easy to split out from git-core. If it is
>       self-contained in a single directory, then I'd be sufficiently
>       happy already.
>
>     - It should continue providing a clean interface to external
>       callers. This also means that its interface should stay stable so
>       we don't have to adapt on every update. git-core historically
>       never had such promises, but it kind of worked out for the xdiff
>       code.
>
>     - My most important fear would be that the reftable interface
>       becomes heavily dependent on git-core's own data types. We had
>       this discussion at the Contributor's Summit already, but if it
>       starts adopting things like git-core's own string buffer then it
>       would become a lot harder for us to use it.
>
>     - Probably obvious, but contained in the above is that it stays
>       compilable on its own. So even if you split out its directory and
>       wire up some build instructions, it should not have any
>       dependencies on git-core.
> """
>
> (for the discussion at the summit:
> https://lore.kernel.org/git/1B71B54C-E000-4CEB-8AC6-3DB86E96E31A@jramsay.com.au/)
>
> I can make that work, but it would be good to know if this is
> something the project is OK with in principle, or whether the code
> needs to be completely Git-ified. If the latter happens, that would
> effectively fork the code, which I think is a missed opportunity.

There's been some discussion about use of strbuf versus the module's
own growing-buffer "struct slice".  Is that the only instance of this
kind of infrastructure duplication or are there others?

> I have received a lot of negative comments, and I can't gauge well
> what to make of them, but they kindle several worries:
>
> * The git community decides they don't need the object reverse index,
>   and do not write 'o' section in reftables, because that culls 1000
>   lines of code. This generally works, but will cause hard to debug
>   performance regressions if command-line git is used on a gerrit
>   server.
>
> * The git community looks at this, and decides the standard is too
>   complex, and goes off to create a reftable v3.
>
> * The git community asks me to do a ton of superficial work (eg. slice
>   -> strbuf), and then decides the overall design needs to be different,
>   and should be completely rewritten.
>
> Jonathan Nieder said I shouldn't worry about standards compliance,
> because the Git project has already ratified the reftable standard,
> and wouldn't want to break JGit compatibility, but it would be good to
> have the community leaders reaffirm this stance.

Recording from today's IRC standup for easy reference later:

  <gitster> jrnieder: that might have been solved already by me declaring
  that whether they like it or not, reftable must be in git-core to match
  what JGit has.

> For my last worry, it would be good if someone would make an
> assessment of the overall design to see if it is acceptable.

Thanks.  I think Peff did a little in this area; help from others
would be welcome as well.

> Once we have a path forward we can think of a way of integrating the
> code. I think it may take a while to shake out bugs. Not bugs in the
> reftable library itself, but Git is not very strict in enforcing
> proper use of the ref backend abstraction (eg. pseudorefs [1]). Many
> integration tests also don't go through proper channels for accessing
> refs.

Yes, I'm *very* excited that the series includes a knob for running
the testsuite using the reftable ref backend, providing a way for
anyone interested to pitch in to help with the issues they reveal
(both in Git and in its testsuite).

Summary: thanks for writing this.  I personally think your proposal is
a reasonable one, but I'm looking forward to hearing from others as
well.  I'm also looking forward to discussing "struct slice" and the
overall architecture in the context of the patches[4] and making some
more progress.

Sincerely,
Jonathan

> cheers,
>
> [1] https://lore.kernel.org/git/CAFQ2z_NZgkPE+3oazfb_m0_7TWxHjje1yYCc0bMZG05_KUKEow@mail.gmail.com/
>
> [2] https://hanwen.home.xs4all.nl/public/software/reftable-coverage/

[3] https://lore.kernel.org/git/CAFQ2z_Ne-1TWMRS88ADhzQmx4OfzNEkVUJ1anjf57mQD4Gdywg@mail.gmail.com/

[4] https://lore.kernel.org/git/pull.539.v12.git.1588845585.gitgitgadget@gmail.com/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: How to move forward with Reftable
  2020-05-11 18:36 ` Jonathan Nieder
@ 2020-05-11 18:50   ` Junio C Hamano
  2020-05-12 23:37   ` brian m. carlson
  1 sibling, 0 replies; 4+ messages in thread
From: Junio C Hamano @ 2020-05-11 18:50 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Han-Wen Nienhuys, git, Jeff King, brian m. carlson,
	Johannes Schindelin, Michael Haggerty, Patrick Steinhardt

Jonathan Nieder <jrnieder@gmail.com> writes:

> (+cc: a few people I'm particularly interested in feedback from)
> ...
> There's been some discussion about use of strbuf versus the module's
> own growing-buffer "struct slice".  Is that the only instance of this
> kind of infrastructure duplication or are there others?

Yeah, that is a very important question.  If there are own APIs that
are similar to what we have but impedance mismatch can not be easily
abstracted away, that would be a huge problem.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: How to move forward with Reftable
  2020-05-11 18:36 ` Jonathan Nieder
  2020-05-11 18:50   ` Junio C Hamano
@ 2020-05-12 23:37   ` brian m. carlson
  1 sibling, 0 replies; 4+ messages in thread
From: brian m. carlson @ 2020-05-12 23:37 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Han-Wen Nienhuys, git, Jeff King, Johannes Schindelin,
	Michael Haggerty, Patrick Steinhardt

[-- Attachment #1: Type: text/plain, Size: 4786 bytes --]

On 2020-05-11 at 18:36:02, Jonathan Nieder wrote:
> Separate from the integration aspect is that this is not yet
> battle-tested code.  One benefit of sharing code is to be able to share
> the benefit of users testing it.
> 
> Since the ref system is fairly modular and this is about a non-default
> backend, it's likely okay to integrate it initially as "experimental"
> and then update docs as we gain confidence.

If we're going to integrate it, I would like to pass the testsuite when
we do.  We can certainly do a series of preparatory patches (e.g., to
the testsuite) to get it ready, but once people can turn it on or use it
(via config, I assume), it should be fully functional and tested.

Having said that, I agree we should mark it as experimental first, at
least for a while.  I'm interested to see how it works both from a
functionality perspective as well as a performance perspective.  For
example, is reftable a win with a relatively large number of refs (say,
tens of thousands)?  Operational experience will tell us that and help
guide us to figure out if and when it should be the default.

> > Johannes had suggested that this should be developed and maintained in
> > git-core first, and the result could then be reused by libgit2
> > project. According to the libgit2 folks, this what that would look
> > like:
> >
> > """
> >     - It needs to be easy to split out from git-core. If it is
> >       self-contained in a single directory, then I'd be sufficiently
> >       happy already.
> >
> >     - It should continue providing a clean interface to external
> >       callers. This also means that its interface should stay stable so
> >       we don't have to adapt on every update. git-core historically
> >       never had such promises, but it kind of worked out for the xdiff
> >       code.
> >
> >     - My most important fear would be that the reftable interface
> >       becomes heavily dependent on git-core's own data types. We had
> >       this discussion at the Contributor's Summit already, but if it
> >       starts adopting things like git-core's own string buffer then it
> >       would become a lot harder for us to use it.
> >
> >     - Probably obvious, but contained in the above is that it stays
> >       compilable on its own. So even if you split out its directory and
> >       wire up some build instructions, it should not have any
> >       dependencies on git-core.
> > """
> >
> > (for the discussion at the summit:
> > https://lore.kernel.org/git/1B71B54C-E000-4CEB-8AC6-3DB86E96E31A@jramsay.com.au/)
> >
> > I can make that work, but it would be good to know if this is
> > something the project is OK with in principle, or whether the code
> > needs to be completely Git-ified. If the latter happens, that would
> > effectively fork the code, which I think is a missed opportunity.
> 
> There's been some discussion about use of strbuf versus the module's
> own growing-buffer "struct slice".  Is that the only instance of this
> kind of infrastructure duplication or are there others?

There's duplication of the hash algorithm stuff.  I don't know what else
because I haven't taken an in-depth look at the code other than for
SHA-256 compatibility.  I think Dscho has more insight here.

In general, my view here is that if this is going to be a part of core
Git, then it should live in core Git and use our standard tooling.  If
this is going to be a logically independent shared library (like zlib)
or an optional feature that one can compile with or not, then it can
live outside of the tree as a separate project (and shared library) that
we link against.

We've already seen a bunch of compatibility pain from sha1dc, which has
a much smaller, more well-defined interface.  I'd like to not repeat
that behavior with the reftable code, especially since Git runs on a
wide variety of systems and has significant compatibility needs.

I also don't love the fact that we have an update script that overwrites
all of our changes with the upstream code when there are some of us who
have no intention of contributing to upstream (e.g., for CLA reasons).
Barring some way of addressing those concerns, I think we're going to
need to assume that updates require some sort of manual rebase work like
with other code that we import.

> Yes, I'm *very* excited that the series includes a knob for running
> the testsuite using the reftable ref backend, providing a way for
> anyone interested to pitch in to help with the issues they reveal
> (both in Git and in its testsuite).

I think this is a good feature to have and definitely the right way
forward.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-05-12 23:37 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-07 18:51 How to move forward with Reftable Han-Wen Nienhuys
2020-05-11 18:36 ` Jonathan Nieder
2020-05-11 18:50   ` Junio C Hamano
2020-05-12 23:37   ` brian m. carlson

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).