Re: Proposal/Discussion: Turning parts of Git into libraries

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Victoria Dye <vdye@github.com>
To: Emily Shaffer <nasamuffin@google.com>, Git List <git@vger.kernel.org>
Cc: Jonathan Nieder <jrn@google.com>,
	Jose Lopes <jabolopes@google.com>,
	Aleksandr Mikhailov <avmikhailov@google.com>
Subject: Re: Proposal/Discussion: Turning parts of Git into libraries
Date: Tue, 21 Feb 2023 17:44:35 -0800	[thread overview]
Message-ID: <86e7f589-7576-5ed3-3dc9-5dec9ca346eb@github.com> (raw)
In-Reply-To: <CAJoAoZ=Cig_kLocxKGax31sU7Xe4==BGzC__Bg2_pr7krNq6MA@mail.gmail.com>

Emily Shaffer wrote:
> Hi folks,
> 
> As I mentioned in standup this week[1], my colleagues and I at Google
> have become very interested in converting parts of Git into libraries
> usable by external programs. In other words, for some modules which
> already have clear boundaries inside of Git - like config.[ch],
> strbuf.[ch], etc. - we want to remove some implicit dependencies, like
> references to globals, and make explicit other dependencies, like
> references to other modules within Git. Eventually, we'd like both for
> an external program to use Git libraries within its own process, and
> for Git to be given an alternative implementation of a library it uses
> internally (like a plugin at runtime).

I see why you've linked these two objectives (there could be some shared
infrastructure between them), but they're ultimately separate concerns, with
separate needs & tradeoffs. By trying to design with a unified solution for
both, though, this proposal's solution may be broader than it needs to be
and comes with some serious developer experience ramifications.

For example, looking at the  "call libgit.so from other applications"
objective: one way I could imagine implementing this is by creating a stable
API wrapper around various hand-picked internal functions. If those internal
functions change (e.g., adding a function argument to handle a new flag in a
builtin), the API can remain stable (by setting a sane default) without
really limiting the contributor adding a new feature. 

However, my reading of this proposal (although correct me if I'm wrong) is
that you want to use that same external-facing API as a plugin interface for
Git. In that case, the contributor can't just add a new argument and
trivially modify the wrapper; they either break the API contract, or create
a new version of the API and continue to support the old one. While that's
technically doable, it's a pretty sizeable increase to the amount of "stuff"
contributors need to keep track of and limits the project's flexibility. 

> 
> This turned out pretty long-winded, so a quick summary before I dive in:
> 
> - We want to compile parts of Git as independent libraries
> - We want to do it by making incremental code quality improvements to Git
> - Let's avoid promising stability of the interfaces of those libraries
> - We think it'll let Git do cool stuff like unit tests and allowing
> purpose-built plugins
> - Hopefully by example we can convince the rest of the project to join
> in the effort
> 
> My team has spent the past year or so trying to make improvements to
> Git's behavior with submodules, and we found that the current
> structure of Git is quite tricky to work with: because Git doesn't
> execute on second repositories in the same process well, recursing
> into submodules typically involves spawning child processes, and
> piping new arguments through the helpers around those child processes,
> and then through Git's typical codepaths, is very tricky. After
> spending more than a year trying to make improvements, we have very
> little to show for it, largely as a result of the difficulty of
> passing information between superprojects and submodules.
> 
> It seems like being able to invoke parts of Git as a library, or Git
> being able to invoke custom libraries, does a lot of good for the Git
> project:
> 
> - Having clear, modular libraries makes it easy to find the code
> responsible for some behavior, or determine where to add something
> new.

This is more an issue of code & repository organization, and isn't really an
inherent outcome of lib-ifying Git. That said, like the cleanup you mention
later, I would not be opposed to a dedicated code reorganization proposal.

> - Operations recursing into submodules can be run in-process, and the
> callsite makes it very clear which repository context is being used;
> if we're not subprocessing to recurse, then we can lose the awkward
> git-submodule.sh -> builtin/submodule__helper.c -> top-level Git
> process codepath.

Interesting - I wasn't aware this was how submodules worked internally. Is
there a specific reason this can't be refactored to perform the recursion
in-process?

> - Being able to test libraries in isolation via unit tests or mocks
> speeds up determining the root cause of bugs.

The addition of unit tests is a project of its own, and one I don't believe
is intrinsically tied to this one. More on this later.

> - If we can swap out an entire library, or just a single function of
> one, it's easy to experiment with the entire codebase without sweeping
> changes.

What sort of "experiment" are you thinking of here? Anyone can make internal
changes to Git in their own clone of the repository, then build and test it.
Shipping a custom fork of Git doesn't seem any less complex than building a
plugin library, shipping that, and injecting it into an existing Git
install.

> 
> The ability to use Git as a library also makes it easier for other
> tooling to interact with a Git repository *correctly*. As an example,
> `repo` has a long history of abusing Git by directly manipulating the
> gitdir[2], but if it were written in a world where Git exists as
> easy-to-use libraries, it probably wouldn't have needed to, as it
> could have invoked Git directly or replaced the relevant modules with
> its own implementation. Both `repo`[3] and `git-gui[4]` have
> reimplemented logic from git.git. Other interfaces that cooperate with
> Git's filesystem storage layer, like `scalar` or `jj`[5], would be
> able to interop with a Git repository without having to reimplement
> custom logic or keep up with new Git changes.

I'd alternatively suggest that these use cases could be addressed with
better plumbing command support and better (public) documentation of our
on-disk data structures. In many cases, we're treating on-disk data
structures as external-facing APIs anyway - the index [1], packfiles [2],
etc. are versioned. We could definitely add documentation that's more 
directly targeted at external integrators.

[1] https://git-scm.com/docs/index-format
[2] https://git-scm.com/docs/gitformat-pack

> 
> Of course, there's a reason Google wants it, too. We've talked
> previously about wanting better integration between Git and something
> like a VFS; as we've experimented with it internally, we've found a
> couple of tricky areas:
> 
> - The VFS relies on running Git commands to determine the state of the
> repository. However, these Git commands interact with the gitdir or
> worktree, which is populated by the VFS. For example, parsing a
> .gitattributes or .gitmodules which is already stored in the VFS
> requires the VFS to provide a POSIX file handle, spawn a Git
> subprocess, populate other files needed by that subprocess (like
> .git/config), and finally collect the output stream of the subprocess.
> As you can imagine, this interaction of VFS -> Git -> VFS [-> Git]
> creates all sort of complications. The alternative is for the VFS to
> write its own parser (or use a library like libgit2; more on that
> later). But having a Git library means that a subset of Git
> functionality can happen in-process, and that filesystem access could
> be replaced by the VFS directly providing high-level objects or plain
> bytestreams.
> 
> - A user running `git status` in a directory controlled by the VFS
> will require the VFS to populate the entire (large) worktree - even
> though the VFS is sure that only one file has been modified. The
> closest we've come with an alternative is an orchestrated use of
> sparse-checkout - but modifying the sparse-checkout configs
> automatically in response to the user's filesystem operations takes us
> right back to the previous point. If Git could take a plugin and
> replacement for the object store that directly interacts with the VFS
> daemon, a layer of complexity would disappear and performance would
> improve.

Both of these points sound more like implementation details/quirks of the
VFS prototype you're building than hard blockers coming from Git. For
example, you mention 'git status' needing to populate a full worktree -
could it instead use a mechanism similar to FSMonitor to avoid scanning for
files you know are virtualized? While there'd likely be some Git changes
involved (some of which might be a bit plugin-y), they'd be much more
tightly scoped than full libification.

> The good news is that for practical near-term purposes, "libification"
> mostly means cleanups to the Git codebase, and continuing code health
> work that the project has already cared about doing:
> 
> - Removing references to global variables and instead piping them
> through arguments
> - Finding and fixing memory leaks, especially in widely-used low-level code
> - Reducing or removing `die()` invocations in low-level code, and
> instead reporting errors back to callers in a consistent way
> - Clarifying the scope and layering of existing modules, for example
> by moving single-use helpers from the shared module's scope into the
> single user's scope
> - Making module interfaces more consistent and easier to understand,
> including moving "private" functions out of headers and into source
> files and improving in-header documentation
> 
> Basically, if this effort turns out not to be fruitful as a whole, I'd
> like for us to still have left a positive impact on the codebase.

These cleanups/best practices are a great idea regardless of any potential
lib-ification. I'll be sure to keep them in mind in any future changes I
make or review.

> 
> In the longer term, if Git has libraries with easily-replaced
> dependencies, we get a few more benefits:
> 
> - Unit tests. We already have some in t/helper/, but if we can replace
> all the dependencies of a certain library with simple stubs, it's
> easier for us to write comprehensive unit tests, in addition to the
> work we already do introducing edge cases in bash integration tests.

This doesn't really follow as a direct consequence of making libgit
externally facing. AFAIK, there's nothing explicitly stopping us from
writing or integrating a unit test framework now.

Like the "make libgit public" vs. "allow for custom backends in Git", I
think this is a separate project with its own tradeoffs to evaluate.

> - If our users can use plugins to improve performance in specific
> scenarios (like a VFS-aware object store in the VFS case I cited
> above), then Git works better for them without having to adopt a
> different workflow, such as using an alternative tool or wrapper.

I'm curious as to what you want the user experience for this to look like.
How would Git know to dynamically load a plugin? How does a user configure
(or disable) plugins? Will a plugin need to replace all of the functions in
a given library, or would Git be able to "fall back" on its own internal
implementation?

> - An easy-to-understand modular codebase makes it easy for new
> contributors to start hacking and understand the consequences of their
> patch.

As noted earlier, I think improving the developer experience in this way is
independent of the development of external APIs.

> 
> Of course, we haven't maintained any guarantee about the consistency
> of our implementation between releases. I don't anticipate that we'll
> write the perfect library interface on our first try. So I hope that
> we can be very explicit about refusing to provide any compatibility
> guarantee whatsoever between versions for quite a long time. On
> Google's end, that's well-understood and accepted. As I understand,
> some other projects already use Git's codebase as a "library" by
> including it as a submodule and using the code directly[6]; even a
> breakable API seems like an improvement over that, too.

This hints at, but sidesteps, a really important aspect of the long-term
goals of this project - are you planning on having us start guaranteeing
consistency once there's an external-facing API available? 

If not, that sounds like an unpleasant user experience (and one prone to
Hyrum's Law [3] at that). 

If so, Git contributors will either be much more constrained in the
introduction of new features, or we'll end up with a mess of
backward-compatibility APIs.

[3] https://www.hyrumslaw.com/

> 
> So what's next? Naturally, I'm looking forward to a spirited
> discussion about this topic - I'd like to know which concerns haven't
> been addressed and figure out whether we can find a way around them,
> and generally build awareness of this effort with the community.
> 
> I'm also planning to send a proposal for a document full of "best
> practices" for turning Git code into libraries (and have quite a lot
> of discussion around that document, too). My hope is that we can use
> that document to help us during implementation as well as during
> review, and refine it over time as we learn more about what works and
> what doesn't. Having this kind of documentation will make it easy for
> others to join us in moving Git's codebase towards a clean set of
> libraries. I hope that, as a project, we can settle on some tenets
> that we all agree would make Git nicer.

I think the use of multiple libraries is part of the potentially suboptimal
balance between the "load Git's internals as a library" and "let Git use
plugins in place of its own implementations" goals. The former could leave a
user in dependency hell if there are lots of small libraries to load, while
the latter may work better with smaller-scoped libraries (to avoid needing
to implement more than is useful). And, the functionality that's useful to a
program invoking Git via library may not be the same as what someone would
want to replace via plugin. 

> 
> From the rest of my own team, we're planning on working first on some
> limited scope, low-level libraries so that we can all see how the
> process works. We're starting with strbuf.[ch] (as it's used
> everywhere with few or no dependencies and helps us guarantee string
> safety at API boundaries), config.[ch] (as many external tools are
> probably interested in parsing Git config formatted files directly),
> and a subset of operations related to the object store. These starting
> points are intended to have a small impact on the codebase and teach
> us a lot about logistics and best practices while doing these kinds of
> conversions.
> 
> After that, we're still hoping to target low-level libraries first - I
> certainly don't think it will make sense to ship a high-level `git
> commit` library in the near future, if ever - in the order that
> they're required from the VFS project we're working closely with. As
> far as I can tell right now, that's likely to cover object store and
> worktree access, as well as commit creation and pushing, but we'll see
> how planning shakes out over the next month or so. But Google's
> schedule should have no bearing on what others in the Git project feel
> is important to clean up and libify, and if there is interest in the
> rest of the project in converting other existing modules into
> libraries, my team and I are excited to participate in the review.

As a general note, this libification project - its justification,
milestones, organization, etc. - seems to be primarily driven by the needs
of a VFS that is entirely opaque to the Git community you're proposing to.
As it stands, reviewers are put in a position of needing to accept, without
much evidence, that this is the best (or only) possible approach for meeting
those needs. 

While I'm normally content with "scratch your own itch"-type changes, what
you're proposing is a major paradigm shift in how Git is written,
maintained, and used. I'm not comfortable accepting that level of impact
without at least being able to evaluate whether the problem can be solved
some other way.

> 
> Much, much later on, I'm expecting us to form a plan around allowing
> "plugins" - that is, replacing library functionality we use today with
> an alternative library, such as an object store relying on a
> distributed file store like S3. Making that work well will also likely
> involve us coming up with a solution for dependency injection, and to
> begin using vtables for some libraries. I'm hoping that we can figure
> out a way to do that that won't make the Git source ugly. Around this
> time, I think it will make sense to buy into unit tests even more and
> start using an approach like mocking to test various edge cases. And
> at some point, it's likely that we'll want to make the interfaces to
> various Git libraries consistent with each other, which would involve
> some large-scale but hopefully-mechanical refactors.

Implementing dependency injection (via vtable or otherwise) and unit test
mocking is a massive undertaking on its own, let alone everything else
described so far. You've outlined some clear, fairly unobtrusive first steps
to the overarching proposal, but there's a lot of detail missing from the
plans for later steps. While there's always risk of a project getting
derailed or blocked later on due to some unforeseen issue, that risk seems
particularly high here given the size & scope of all of these components. 

Intermediate milestones/goals, each of which is valuable on its own,
outlined in the proposal would help allay those fears (for me, at least).

> 
> I'm looking forward to the discussion!
> 

So, to summarize my thoughts into some (hopefully) actionable feedback:

- It's really important that the proposal presents a clear, concrete
  long-term vision for this project. 
  - What is the desired end state, in terms of what is built and installed
    with Git?
  - What will be expected of other Git contributors to support this design?
    What happens if someone wants to "break" an API?
  - What is the impact to users (including security implications)?
- The scope for what you've proposed is pretty huge (which comes with a lot
  of risk), but I think it could be broken into smaller, *independent*
  pieces. 
- It's hard to judge or suggest adjustments to this proposal without knowing
  the specific challenges you're facing with Git as it is.

Finally, thanks for sending this email & starting a discussion. It's an
interesting topic and I'm looking forward to seeing everyone's perspectives
on the matter.

>  - Emily
> 
> 1: https://colabti.org/irclogger/irclogger_log/git-devel?date=2023-02-13#l29
> 2: https://gerrit.googlesource.com/git-repo/+/refs/heads/main/docs/internal-fs-layout.md
> 3: https://gerrit.googlesource.com/git-repo/+/refs/heads/main/git_config.py
> 4: https://github.com/git/git/blob/master/git-gui/git-gui.sh#L305
> 5: https://github.com/martinvonz/jj
> 6: https://github.com/glandium/git-cinnabar

next prev parent reply	other threads:[~2023-02-22  1:44 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-17 21:12 Proposal/Discussion: Turning parts of Git into libraries Emily Shaffer
2023-02-17 21:21 ` brian m. carlson
2023-02-17 21:38   ` Emily Shaffer
2023-02-17 22:41     ` brian m. carlson
2023-02-17 22:49       ` Emily Shaffer
2023-02-22 19:34         ` Jeff King
2023-02-24 20:31           ` Emily Shaffer
2023-02-24 21:41             ` Jeff King
2023-02-24 22:59             ` Junio C Hamano
2023-02-17 22:04   ` rsbecker
2023-02-17 22:48     ` brian m. carlson
2023-02-17 22:57 ` Junio C Hamano
2023-02-18  1:59   ` demerphq
2023-02-18 10:36     ` Phillip Wood
2023-03-23 23:22       ` Felipe Contreras
2023-03-23 23:30         ` rsbecker
2023-03-23 23:34           ` Felipe Contreras
2023-03-23 23:42             ` rsbecker
2023-03-23 23:55               ` Felipe Contreras
2023-03-24 19:27                 ` rsbecker
2023-03-24 21:21                   ` Felipe Contreras
2023-03-24 22:06                     ` rsbecker
2023-03-24 22:29                       ` Felipe Contreras
2023-02-21 21:42   ` Emily Shaffer
2023-02-22  0:22     ` Junio C Hamano
2023-02-18  4:05 ` Elijah Newren
2023-02-21 22:06   ` Emily Shaffer
2023-02-22  8:23     ` Elijah Newren
2023-02-22 19:25     ` Jeff King
2023-02-21 19:09 ` Taylor Blau
2023-02-21 22:27   ` Emily Shaffer
2023-02-22  1:44 ` Victoria Dye [this message]
2023-02-25  1:48   ` Jonathan Tan
2023-02-22 14:55 ` Derrick Stolee
2023-02-24 21:06   ` Emily Shaffer
2023-03-23 23:37 ` Felipe Contreras
2023-03-23 23:44   ` rsbecker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=86e7f589-7576-5ed3-3dc9-5dec9ca346eb@github.com \
    --to=vdye@github.com \
    --cc=avmikhailov@google.com \
    --cc=git@vger.kernel.org \
    --cc=jabolopes@google.com \
    --cc=jrn@google.com \
    --cc=nasamuffin@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).