Proposal/Discussion: Turning parts of Git into libraries

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Proposal/Discussion: Turning parts of Git into libraries
@ 2023-02-17 21:12 Emily Shaffer
  2023-02-17 21:21 ` brian m. carlson
                   ` (6 more replies)
  0 siblings, 7 replies; 37+ messages in thread
From: Emily Shaffer @ 2023-02-17 21:12 UTC (permalink / raw)
  To: Git List; +Cc: Jonathan Nieder, Jose Lopes, Aleksandr Mikhailov

Hi folks,

As I mentioned in standup this week[1], my colleagues and I at Google
have become very interested in converting parts of Git into libraries
usable by external programs. In other words, for some modules which
already have clear boundaries inside of Git - like config.[ch],
strbuf.[ch], etc. - we want to remove some implicit dependencies, like
references to globals, and make explicit other dependencies, like
references to other modules within Git. Eventually, we'd like both for
an external program to use Git libraries within its own process, and
for Git to be given an alternative implementation of a library it uses
internally (like a plugin at runtime).

This turned out pretty long-winded, so a quick summary before I dive in:

- We want to compile parts of Git as independent libraries
- We want to do it by making incremental code quality improvements to Git
- Let's avoid promising stability of the interfaces of those libraries
- We think it'll let Git do cool stuff like unit tests and allowing
purpose-built plugins
- Hopefully by example we can convince the rest of the project to join
in the effort

My team has spent the past year or so trying to make improvements to
Git's behavior with submodules, and we found that the current
structure of Git is quite tricky to work with: because Git doesn't
execute on second repositories in the same process well, recursing
into submodules typically involves spawning child processes, and
piping new arguments through the helpers around those child processes,
and then through Git's typical codepaths, is very tricky. After
spending more than a year trying to make improvements, we have very
little to show for it, largely as a result of the difficulty of
passing information between superprojects and submodules.

It seems like being able to invoke parts of Git as a library, or Git
being able to invoke custom libraries, does a lot of good for the Git
project:

- Having clear, modular libraries makes it easy to find the code
responsible for some behavior, or determine where to add something
new.
- Operations recursing into submodules can be run in-process, and the
callsite makes it very clear which repository context is being used;
if we're not subprocessing to recurse, then we can lose the awkward
git-submodule.sh -> builtin/submodule__helper.c -> top-level Git
process codepath.
- Being able to test libraries in isolation via unit tests or mocks
speeds up determining the root cause of bugs.
- If we can swap out an entire library, or just a single function of
one, it's easy to experiment with the entire codebase without sweeping
changes.

The ability to use Git as a library also makes it easier for other
tooling to interact with a Git repository *correctly*. As an example,
`repo` has a long history of abusing Git by directly manipulating the
gitdir[2], but if it were written in a world where Git exists as
easy-to-use libraries, it probably wouldn't have needed to, as it
could have invoked Git directly or replaced the relevant modules with
its own implementation. Both `repo`[3] and `git-gui[4]` have
reimplemented logic from git.git. Other interfaces that cooperate with
Git's filesystem storage layer, like `scalar` or `jj`[5], would be
able to interop with a Git repository without having to reimplement
custom logic or keep up with new Git changes.

Of course, there's a reason Google wants it, too. We've talked
previously about wanting better integration between Git and something
like a VFS; as we've experimented with it internally, we've found a
couple of tricky areas:

- The VFS relies on running Git commands to determine the state of the
repository. However, these Git commands interact with the gitdir or
worktree, which is populated by the VFS. For example, parsing a
.gitattributes or .gitmodules which is already stored in the VFS
requires the VFS to provide a POSIX file handle, spawn a Git
subprocess, populate other files needed by that subprocess (like
.git/config), and finally collect the output stream of the subprocess.
As you can imagine, this interaction of VFS -> Git -> VFS [-> Git]
creates all sort of complications. The alternative is for the VFS to
write its own parser (or use a library like libgit2; more on that
later). But having a Git library means that a subset of Git
functionality can happen in-process, and that filesystem access could
be replaced by the VFS directly providing high-level objects or plain
bytestreams.

- A user running `git status` in a directory controlled by the VFS
will require the VFS to populate the entire (large) worktree - even
though the VFS is sure that only one file has been modified. The
closest we've come with an alternative is an orchestrated use of
sparse-checkout - but modifying the sparse-checkout configs
automatically in response to the user's filesystem operations takes us
right back to the previous point. If Git could take a plugin and
replacement for the object store that directly interacts with the VFS
daemon, a layer of complexity would disappear and performance would
improve.

We discussed using an existing library like libgit2 or JGit, but it's
not a very exciting proposal: these libraries are already lagging
behind git.git in features, and trying to use them in combination with
brand-new improvements to Git (like new partial clone filters) means
that we'll always get to implement those improvements twice to bring
libgit2 up to speed. It also means that people using the `git` client
directly won't get performance benefits derived from having Git
internal libraries replaced by purpose-built ones in certain contexts.
That said, if libgit2 already provides functionality and performance
equivalent to git.git's in an appropriate wrapper, I'd be excited to
pursue integrating that library into git.git's codebase directly.

The good news is that for practical near-term purposes, "libification"
mostly means cleanups to the Git codebase, and continuing code health
work that the project has already cared about doing:

- Removing references to global variables and instead piping them
through arguments
- Finding and fixing memory leaks, especially in widely-used low-level code
- Reducing or removing `die()` invocations in low-level code, and
instead reporting errors back to callers in a consistent way
- Clarifying the scope and layering of existing modules, for example
by moving single-use helpers from the shared module's scope into the
single user's scope
- Making module interfaces more consistent and easier to understand,
including moving "private" functions out of headers and into source
files and improving in-header documentation

Basically, if this effort turns out not to be fruitful as a whole, I'd
like for us to still have left a positive impact on the codebase.

In the longer term, if Git has libraries with easily-replaced
dependencies, we get a few more benefits:

- Unit tests. We already have some in t/helper/, but if we can replace
all the dependencies of a certain library with simple stubs, it's
easier for us to write comprehensive unit tests, in addition to the
work we already do introducing edge cases in bash integration tests.
- If our users can use plugins to improve performance in specific
scenarios (like a VFS-aware object store in the VFS case I cited
above), then Git works better for them without having to adopt a
different workflow, such as using an alternative tool or wrapper.
- An easy-to-understand modular codebase makes it easy for new
contributors to start hacking and understand the consequences of their
patch.

Of course, we haven't maintained any guarantee about the consistency
of our implementation between releases. I don't anticipate that we'll
write the perfect library interface on our first try. So I hope that
we can be very explicit about refusing to provide any compatibility
guarantee whatsoever between versions for quite a long time. On
Google's end, that's well-understood and accepted. As I understand,
some other projects already use Git's codebase as a "library" by
including it as a submodule and using the code directly[6]; even a
breakable API seems like an improvement over that, too.

So what's next? Naturally, I'm looking forward to a spirited
discussion about this topic - I'd like to know which concerns haven't
been addressed and figure out whether we can find a way around them,
and generally build awareness of this effort with the community.

I'm also planning to send a proposal for a document full of "best
practices" for turning Git code into libraries (and have quite a lot
of discussion around that document, too). My hope is that we can use
that document to help us during implementation as well as during
review, and refine it over time as we learn more about what works and
what doesn't. Having this kind of documentation will make it easy for
others to join us in moving Git's codebase towards a clean set of
libraries. I hope that, as a project, we can settle on some tenets
that we all agree would make Git nicer.

From the rest of my own team, we're planning on working first on some
limited scope, low-level libraries so that we can all see how the
process works. We're starting with strbuf.[ch] (as it's used
everywhere with few or no dependencies and helps us guarantee string
safety at API boundaries), config.[ch] (as many external tools are
probably interested in parsing Git config formatted files directly),
and a subset of operations related to the object store. These starting
points are intended to have a small impact on the codebase and teach
us a lot about logistics and best practices while doing these kinds of
conversions.

After that, we're still hoping to target low-level libraries first - I
certainly don't think it will make sense to ship a high-level `git
commit` library in the near future, if ever - in the order that
they're required from the VFS project we're working closely with. As
far as I can tell right now, that's likely to cover object store and
worktree access, as well as commit creation and pushing, but we'll see
how planning shakes out over the next month or so. But Google's
schedule should have no bearing on what others in the Git project feel
is important to clean up and libify, and if there is interest in the
rest of the project in converting other existing modules into
libraries, my team and I are excited to participate in the review.

Much, much later on, I'm expecting us to form a plan around allowing
"plugins" - that is, replacing library functionality we use today with
an alternative library, such as an object store relying on a
distributed file store like S3. Making that work well will also likely
involve us coming up with a solution for dependency injection, and to
begin using vtables for some libraries. I'm hoping that we can figure
out a way to do that that won't make the Git source ugly. Around this
time, I think it will make sense to buy into unit tests even more and
start using an approach like mocking to test various edge cases. And
at some point, it's likely that we'll want to make the interfaces to
various Git libraries consistent with each other, which would involve
some large-scale but hopefully-mechanical refactors.

I'm looking forward to the discussion!

 - Emily

1: https://colabti.org/irclogger/irclogger_log/git-devel?date=2023-02-13#l29
2: https://gerrit.googlesource.com/git-repo/+/refs/heads/main/docs/internal-fs-layout.md
3: https://gerrit.googlesource.com/git-repo/+/refs/heads/main/git_config.py
4: https://github.com/git/git/blob/master/git-gui/git-gui.sh#L305
5: https://github.com/martinvonz/jj
6: https://github.com/glandium/git-cinnabar

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Proposal/Discussion: Turning parts of Git into libraries
  2023-02-17 21:12 Proposal/Discussion: Turning parts of Git into libraries Emily Shaffer
@ 2023-02-17 21:21 ` brian m. carlson
  2023-02-17 21:38   ` Emily Shaffer
  2023-02-17 22:04   ` rsbecker
  2023-02-17 22:57 ` Junio C Hamano
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 37+ messages in thread
From: brian m. carlson @ 2023-02-17 21:21 UTC (permalink / raw)
  To: Emily Shaffer; +Cc: Git List, Jonathan Nieder, Jose Lopes, Aleksandr Mikhailov

[-- Attachment #1: Type: text/plain, Size: 488 bytes --]

On 2023-02-17 at 21:12:23, Emily Shaffer wrote:
> Hi folks,

Hey,

> I'm looking forward to the discussion!

While I'm not personally interested in the VFS work, I think it's a
great idea to turn more of the code into libraries (or at least make it
more library-like), and so I'm fully in support of this approach.  When
I send patches in the future, I'll try to make sure that they're
friendly to this goal.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Proposal/Discussion: Turning parts of Git into libraries
  2023-02-17 21:21 ` brian m. carlson
@ 2023-02-17 21:38   ` Emily Shaffer
  2023-02-17 22:41     ` brian m. carlson
  2023-02-17 22:04   ` rsbecker
  1 sibling, 1 reply; 37+ messages in thread
From: Emily Shaffer @ 2023-02-17 21:38 UTC (permalink / raw)
  To: brian m. carlson, Emily Shaffer, Git List, Jonathan Nieder,
	Jose Lopes, Aleksandr Mikhailov

On Fri, Feb 17, 2023 at 1:21 PM brian m. carlson
<sandals@crustytoothpaste.net> wrote:
>
> On 2023-02-17 at 21:12:23, Emily Shaffer wrote:
> > Hi folks,
>
> Hey,
>
> > I'm looking forward to the discussion!
>
> While I'm not personally interested in the VFS work, I think it's a
> great idea to turn more of the code into libraries (or at least make it
> more library-like), and so I'm fully in support of this approach.

Yeah, I expect this sentiment is true for most contributors. And I'm
really hoping there are other things which aren't on my radar, but
other contributors are enthusiastic about, that would also be served
by this kind of library approach.

For example, I seem to remember you saying during the SHA-256 series
that the next hashing algorithm would also be painful to implement;
would that still be true if the hashing algorithm is encapsulated well
by a library interface? Or is it for a different reason?

> When
> I send patches in the future, I'll try to make sure that they're
> friendly to this goal.

That's awesome to hear. Thanks, brian.

 - Emily

^ permalink raw reply	[flat|nested] 37+ messages in thread

* RE: Proposal/Discussion: Turning parts of Git into libraries
  2023-02-17 21:21 ` brian m. carlson
  2023-02-17 21:38   ` Emily Shaffer
@ 2023-02-17 22:04   ` rsbecker
  2023-02-17 22:48     ` brian m. carlson
  1 sibling, 1 reply; 37+ messages in thread
From: rsbecker @ 2023-02-17 22:04 UTC (permalink / raw)
  To: 'brian m. carlson', 'Emily Shaffer'
  Cc: 'Git List', 'Jonathan Nieder',
	'Jose Lopes', 'Aleksandr Mikhailov'

On Friday, February 17, 2023 4:22 PM, brian m. carlson wrote:
>On 2023-02-17 at 21:12:23, Emily Shaffer wrote:
>> Hi folks,
>
>Hey,
>
>> I'm looking forward to the discussion!
>
>While I'm not personally interested in the VFS work, I think it's a great idea to turn
>more of the code into libraries (or at least make it more library-like), and so I'm fully
>in support of this approach.  When I send patches in the future, I'll try to make sure
>that they're friendly to this goal.

I am uncertain about this, from a licensing standpoint. Typically, when one links in a library from one project, the license from that project may inherit into your own project. AFAIK, GPLv3 has this implied provision - I do not think it is explicit, but the implication seems to be there. Making git libraries has the potential to cause git's license rights to be incorporated into other products. I am suggesting that we would need to tread carefully in this area. Using someone else's DLL is not so bad, as the code is not bound together, but may also cause ambiguities depending on whether the licenses are conflicting or not. I am not suggesting that this is a bad idea, just one that should be handled carefully.

--Randall

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Proposal/Discussion: Turning parts of Git into libraries
  2023-02-17 21:38   ` Emily Shaffer
@ 2023-02-17 22:41     ` brian m. carlson
  2023-02-17 22:49       ` Emily Shaffer
  0 siblings, 1 reply; 37+ messages in thread
From: brian m. carlson @ 2023-02-17 22:41 UTC (permalink / raw)
  To: Emily Shaffer; +Cc: Git List, Jonathan Nieder, Jose Lopes, Aleksandr Mikhailov

[-- Attachment #1: Type: text/plain, Size: 2050 bytes --]

On 2023-02-17 at 21:38:34, Emily Shaffer wrote:
> For example, I seem to remember you saying during the SHA-256 series
> that the next hashing algorithm would also be painful to implement;
> would that still be true if the hashing algorithm is encapsulated well
> by a library interface? Or is it for a different reason?

Right now, most of the code for a future hash algorithm wouldn't be too
difficult to implement, I'd think, because we can already support two of
them.  If we decide, say, to implement SHA-3-512, we basically just add
that algorithm, update all the entries in the tests (which is kind of a
pain since there's a lot of them, but not really difficult), and then
move on with our lives.

The difficulty is dealing with interop work, which is basically
switching from dealing with just one algorithm to rewriting things
between the two on the fly.  I think _that_ work would be made easier by
library work because sometimes it involves working with submodules, such
as when updating the submodule commit, and being able to deal with both
object stores more easily at the same time would be very helpful in that
regard.

I can imagine there are other things that would be easier as well, and I
can also imagine that we'll have better control over memory allocations
and leak less, which would be nice.  If we can get leaks low enough, we
could even add CI jobs to catch them and fail, which I think would be
super valuable, especially since I find even after over two decades of C
that I'm still not very good about catching all the leaks (which is one
of the reasons I've mostly switched to Rust).  We might also be able to
make nicer steps on multithreading our code as well.

Personally, I'd like to see some sort of standard error type (whether
integral or not) that would let us do more bubbling up of errors and
less die().  I don't know if that's in the cards, but I thought I'd
suggest it in case other folks are interested.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Proposal/Discussion: Turning parts of Git into libraries
  2023-02-17 22:04   ` rsbecker
@ 2023-02-17 22:48     ` brian m. carlson
  0 siblings, 0 replies; 37+ messages in thread
From: brian m. carlson @ 2023-02-17 22:48 UTC (permalink / raw)
  To: rsbecker
  Cc: 'Emily Shaffer', 'Git List',
	'Jonathan Nieder', 'Jose Lopes',
	'Aleksandr Mikhailov'

[-- Attachment #1: Type: text/plain, Size: 1067 bytes --]

On 2023-02-17 at 22:04:19, rsbecker@nexbridge.com wrote:
> I am uncertain about this, from a licensing standpoint. Typically,
> when one links in a library from one project, the license from that
> project may inherit into your own project. AFAIK, GPLv3 has this
> implied provision - I do not think it is explicit, but the implication
> seems to be there. Making git libraries has the potential to cause
> git's license rights to be incorporated into other products. I am
> suggesting that we would need to tread carefully in this area. Using
> someone else's DLL is not so bad, as the code is not bound together,
> but may also cause ambiguities depending on whether the licenses are
> conflicting or not. I am not suggesting that this is a bad idea, just
> one that should be handled carefully.

I think it's pretty clear that if software used Git's libraries, that
the result would be GPLv2.  That might be fine for some projects, and
for others, libgit2 would be more appealing.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Proposal/Discussion: Turning parts of Git into libraries
  2023-02-17 22:41     ` brian m. carlson
@ 2023-02-17 22:49       ` Emily Shaffer
  2023-02-22 19:34         ` Jeff King
  0 siblings, 1 reply; 37+ messages in thread
From: Emily Shaffer @ 2023-02-17 22:49 UTC (permalink / raw)
  To: brian m. carlson, Emily Shaffer, Git List, Jonathan Nieder,
	Jose Lopes, Aleksandr Mikhailov

On Fri, Feb 17, 2023 at 2:41 PM brian m. carlson
<sandals@crustytoothpaste.net> wrote:
>
> On 2023-02-17 at 21:38:34, Emily Shaffer wrote:
> > For example, I seem to remember you saying during the SHA-256 series
> > that the next hashing algorithm would also be painful to implement;
> > would that still be true if the hashing algorithm is encapsulated well
> > by a library interface? Or is it for a different reason?
>
> Right now, most of the code for a future hash algorithm wouldn't be too
> difficult to implement, I'd think, because we can already support two of
> them.  If we decide, say, to implement SHA-3-512, we basically just add
> that algorithm, update all the entries in the tests (which is kind of a
> pain since there's a lot of them, but not really difficult), and then
> move on with our lives.
>
> The difficulty is dealing with interop work, which is basically
> switching from dealing with just one algorithm to rewriting things
> between the two on the fly.  I think _that_ work would be made easier by
> library work because sometimes it involves working with submodules, such
> as when updating the submodule commit, and being able to deal with both
> object stores more easily at the same time would be very helpful in that
> regard.

Ooh, I see what you mean. Thanks for the extra context here.

> I can imagine there are other things that would be easier as well, and I
> can also imagine that we'll have better control over memory allocations
> and leak less, which would be nice.  If we can get leaks low enough, we
> could even add CI jobs to catch them and fail, which I think would be
> super valuable, especially since I find even after over two decades of C
> that I'm still not very good about catching all the leaks (which is one
> of the reasons I've mostly switched to Rust).  We might also be able to
> make nicer steps on multithreading our code as well.
>
> Personally, I'd like to see some sort of standard error type (whether
> integral or not) that would let us do more bubbling up of errors and
> less die().  I don't know if that's in the cards, but I thought I'd
> suggest it in case other folks are interested.

Yes!!! We have talked about this a lot internally - but this is one
thing that will be difficult to introduce into Git without making
parts of the codebase a little uglier. Since obviously C doesn't have
an intrinsic to do this, we'll have to roll our own, which means that
manipulating it consistently at function exits might end up pretty
ugly. So hearing that there's interest outside of my team to come up
with such a type makes me optimistic that we can figure out a
neat-enough solution.

> --
> brian m. carlson (he/him or they/them)
> Toronto, Ontario, CA

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Proposal/Discussion: Turning parts of Git into libraries
  2023-02-17 21:12 Proposal/Discussion: Turning parts of Git into libraries Emily Shaffer
  2023-02-17 21:21 ` brian m. carlson
@ 2023-02-17 22:57 ` Junio C Hamano
  2023-02-18  1:59   ` demerphq
  2023-02-21 21:42   ` Emily Shaffer
  2023-02-18  4:05 ` Elijah Newren
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 37+ messages in thread
From: Junio C Hamano @ 2023-02-17 22:57 UTC (permalink / raw)
  To: Emily Shaffer; +Cc: Git List, Jonathan Nieder, Jose Lopes, Aleksandr Mikhailov

Emily Shaffer <nasamuffin@google.com> writes:

> Basically, if this effort turns out not to be fruitful as a whole, I'd
> like for us to still have left a positive impact on the codebase.
> ...
> So what's next? Naturally, I'm looking forward to a spirited
> discussion about this topic - I'd like to know which concerns haven't
> been addressed and figure out whether we can find a way around them,
> and generally build awareness of this effort with the community.

On of the gravest concerns is that the devil is in the details.

For example, "die() is inconvenient to callers, let's propagate
errors up the callchain" is an easy thing to say, but it would take
much more than "let's propagate errors up" to libify something like
check_connected() to do the same thing without spawning a separate
process that is expected to exit with failure.

It is not clear if we can start small, work on a subset of the
things and still reap the benefit of libification.  Is there an
existing example that we have successfully modularlized the API into
one subsystem?  Offhand, I suspect that the refs API with its two
implementations may be reasonably close, but is the inteface into
that subsystem the granularity of the library interface you guys
have in mind?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Proposal/Discussion: Turning parts of Git into libraries
  2023-02-17 22:57 ` Junio C Hamano
@ 2023-02-18  1:59   ` demerphq
  2023-02-18 10:36     ` Phillip Wood
  2023-02-21 21:42   ` Emily Shaffer
  1 sibling, 1 reply; 37+ messages in thread
From: demerphq @ 2023-02-18  1:59 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Emily Shaffer, Git List, Jonathan Nieder, Jose Lopes,
	Aleksandr Mikhailov

On Sat, 18 Feb 2023 at 00:24, Junio C Hamano <gitster@pobox.com> wrote:
>
> Emily Shaffer <nasamuffin@google.com> writes:
>
> > Basically, if this effort turns out not to be fruitful as a whole, I'd
> > like for us to still have left a positive impact on the codebase.
> > ...
> > So what's next? Naturally, I'm looking forward to a spirited
> > discussion about this topic - I'd like to know which concerns haven't
> > been addressed and figure out whether we can find a way around them,
> > and generally build awareness of this effort with the community.
>
> On of the gravest concerns is that the devil is in the details.
>
> For example, "die() is inconvenient to callers, let's propagate
> errors up the callchain" is an easy thing to say, but it would take
> much more than "let's propagate errors up" to libify something like
> check_connected() to do the same thing without spawning a separate
> process that is expected to exit with failure.

What does "propagate errors up the callchain" mean?  One
interpretation I can think of seems quite horrible, but another seems
quite doable and reasonable and likely not even very invasive of the
existing code:

You can use setjmp/longjmp to implement a form of "try", so that
errors dont have to be *explicitly* returned *in* the call chain. And
you could probably do so without changing very much of the existing
code at all, and maintain a high level of conceptual alignment with
the current code strategy.

To do this you need to set up a globally available linked list of
jmp_env data (see `man setjmp` for jmp_env), and a global error
object, and make the existing "die" functions populate the global
error object, and then pop the most recent jmp_env data and longjmp to
it.

At the top of any git invocation you would set up the topmost jmp_env
"frame". Any code that wants to "try" existing logic pushes a new
jmp_env (using a wrapper around setjmp), and prepares to be longjmp'ed
to. If the code does not die then it pops the jmp_env it just pushed
and returns as normal, if it is longjmp'ed to you can detect this and
do some other behavior to handle the exception (by reading the global
error object). If the code that died *really* wants to exit, then it
returns the appropriate code as part of the longjmp, and the try
handler longjmps again propagating up the chain. Eventually you either
have an error that "propagates to the top" which results in an exit
with an appropriate error message, or you have an error that is
trapped and the code does something else, and then eventually returns
normally.

FWIW, this is essentially a loose description of how Perl handles the
execution part of "eval" and supports exception handling internally.
Most of the perl internals do not know anything about exceptions, they
just call functions similar to gits die functions if they need to,
which then call into Perl_die_unwind(). which then calls the
JUMPENV_JUMP() macro which does the "pop and longjmp" dance.

Seems to me that it wouldn't be very difficult nor particularly
invasive to implement this in git. Much of the logic in the perl
project to do this is at the top of cop.h,  see the macros
JMPENV_PUSH(), JMPENV_POP(), JMPENV_JUMP(). Obviously this code
contains a bunch of perl specific logic, but the general gist of it
should be easily understood and easily converted to a more git like
context:

struct jmpenv: https://github.com/Perl/perl5/blob/blead/cop.h#L32
JMPENV_BOOTSTRAP: https://github.com/Perl/perl5/blob/blead/cop.h#L66
JMPENV_PUSH: https://github.com/Perl/perl5/blob/blead/cop.h#L113
JMPENV_POP: https://github.com/Perl/perl5/blob/blead/cop.h#L147
JMPENV_JUMP: https://github.com/Perl/perl5/blob/blead/cop.h#L159

Perl_die_unwind: https://github.com/Perl/perl5/blob/blead/pp_ctl.c#L1741
Where Perl_die_unwind() calls JMPENV_JUMP:
https://github.com/Perl/perl5/blob/blead/pp_ctl.c#L1865

You can also grep for functions of the form S_try_ in the perl code
base to find examples where the C code explicitly sets up an "eval
frame" to interoperate with the functionality above.

git grep -nP '^S_try_'
pp_ctl.c:3548:S_try_yyparse(pTHX_ int gramtype, OP *caller_op)
pp_ctl.c:3604:S_try_run_unitcheck(pTHX_ OP* caller_op)
pp_sys.c:3120:S_try_amagic_ftest(pTHX_ char chr) {

Seems to me that this gives enough prior art to convert git to use the
same strategy, and that doing so would not actually be that big a
change to the existing code.  Both environments are fairly similar if
you look at them from the right perspective. Both are C, and both have
a lot of global state, and both have lots of functions which you
really dont want to have to change to understand about exception
objects..

Here is an example of how a C function might be written to use this
kind of infrastructure to "try" functionality that might call die. In
this case there is no need for the code to inspect the global error
object, but the basic pattern is consistent. The "default" case below
handles the situation where the "tried" function is signalling an
"untrappable error" that needs to be rethrown to ultimately unwind the
entire try/catch chain and exit the program. It is derived and
simplified from S_try_yyparse mentioned above. This function handles
the "compile the code" part of an `eval EXPR`, and traps exceptions
from the parser so that they can be handled properly and distinctly
from errors trapped during execution of the compiled code. [ I am
assuming that given the historical relationship between git and perl
these concepts are not alien to everybody on this list. ]

/* S_try_yyparse():
 *
 * Run yyparse() in a setjmp wrapper. Returns:
 *   0: yyparse() successful
 *   1: yyparse() failed
 *   3: yyparse() died
 *
 * ...
 */
STATIC int
S_try_yyparse(pTHX_ int gramtype, ...)
{
    dJMPENV;

    JMPENV_PUSH(ret);
    switch (ret) {
    case 0:
        ret = yyparse(gramtype) ? 1 : 0;
        break;
    case 3:
        /* yyparse() died and we trapped the error. */
        ....
        break;
    default:
        JMPENV_POP;          /* remove our own setjmp data */
        JMPENV_JUMP(ret); /* RETHROW */
    }
    JMPENV_POP;
    return ret;
}

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Proposal/Discussion: Turning parts of Git into libraries
  2023-02-17 21:12 Proposal/Discussion: Turning parts of Git into libraries Emily Shaffer
  2023-02-17 21:21 ` brian m. carlson
  2023-02-17 22:57 ` Junio C Hamano
@ 2023-02-18  4:05 ` Elijah Newren
  2023-02-21 22:06   ` Emily Shaffer
  2023-02-21 19:09 ` Taylor Blau
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 37+ messages in thread
From: Elijah Newren @ 2023-02-18  4:05 UTC (permalink / raw)
  To: Emily Shaffer; +Cc: Git List, Jonathan Nieder, Jose Lopes, Aleksandr Mikhailov

On Fri, Feb 17, 2023 at 1:45 PM Emily Shaffer <nasamuffin@google.com> wrote:
>
[...]
> This turned out pretty long-winded, so a quick summary before I dive in:
>
> - We want to compile parts of Git as independent libraries
> - We want to do it by making incremental code quality improvements to Git
> - Let's avoid promising stability of the interfaces of those libraries
> - We think it'll let Git do cool stuff like unit tests and allowing
> purpose-built plugins
> - Hopefully by example we can convince the rest of the project to join
> in the effort

Seems like quite reasonable high-level goals.

[...]
> The good news is that for practical near-term purposes, "libification"
> mostly means cleanups to the Git codebase, and continuing code health
> work that the project has already cared about doing:
>
> - Removing references to global variables and instead piping them
> through arguments
> - Finding and fixing memory leaks, especially in widely-used low-level code

Does removing memory leaks also mean converting UNLEAK to free()?
Thinking of things in a library context probably pushes us in that
direction (though, alternatively, it might just highlight the question
of what is considered "low-level" instead).

> - Reducing or removing `die()` invocations in low-level code, and
> instead reporting errors back to callers in a consistent way

What delinates "low-level" code?  (A "we don't know yet but we'll
start with obvious places and plan to have good discussions on the
appropriate boundary in the future as we submit patches" is a fine
answer, I'm just curious if you already have a rough idea of where you
intend that boundary to lie.)

> - Clarifying the scope and layering of existing modules, for example
> by moving single-use helpers from the shared module's scope into the
> single user's scope
> - Making module interfaces more consistent and easier to understand,
> including moving "private" functions out of headers and into source
> files and improving in-header documentation

I think these are very positive directions.  I like the fact that your
initial plan benefits all of us, whether or not libification is
ultimately achieved.

[...]
> So what's next? Naturally, I'm looking forward to a spirited
> discussion about this topic - I'd like to know which concerns haven't
> been addressed and figure out whether we can find a way around them,
> and generally build awareness of this effort with the community.

I'm curious whether clarifying scope/layering and cleaning up
interfaces might mean you'd be interested in things like:
  * https://github.com/newren/git/commits/header-cleanups (which was
still WIP; I paused working on it because I figured people would see
it as big "cleanup" patches with no practical benefit)
  * https://github.com/gitgitgadget/git/pull/1149 (which has been
ready to submit for a _long_ time, but I just haven't yet)
or if these two things are orthogonal to what you have in mind.

> I'm also planning to send a proposal for a document full of "best
> practices" for turning Git code into libraries (and have quite a lot
> of discussion around that document, too). My hope is that we can use
> that document to help us during implementation as well as during
> review, and refine it over time as we learn more about what works and
> what doesn't. Having this kind of documentation will make it easy for
> others to join us in moving Git's codebase towards a clean set of
> libraries. I hope that, as a project, we can settle on some tenets
> that we all agree would make Git nicer.

I like the sound of this.

> After that, we're still hoping to target low-level libraries first - I
> certainly don't think it will make sense to ship a high-level `git
> commit` library in the near future, if ever - in the order that
> they're required from the VFS project we're working closely with. As
> far as I can tell right now, that's likely to cover object store and
> worktree access, as well as commit creation and pushing, but we'll see
> how planning shakes out over the next month or so. But Google's
> schedule should have no bearing on what others in the Git project feel
> is important to clean up and libify, and if there is interest in the
> rest of the project in converting other existing modules into
> libraries, my team and I are excited to participate in the review.

If we can't libify something like commit, does that prevent libifying
higher level things like merge?

I spent some time thinking about this a while back.  I tried to
carefully design merge-ort to improve the odds it could be used
elsewhere, maybe even libgit2.  (I hope it shows in the many comments
in merge-ort.h, and I think the "priv" field in particular allowing me
to hide the first ~300 lines of merge-ort.c declaring data structures
from users was really nice.)  However, I still had to accept data in
some known format.  So input parameters are things like trees and
commits.  But tree.h and commit.h both include object.h first, which
includes cache.h, which is basically all of Git.  And the functions I
call to interoperate with the system are similarly entangled.  So, the
odds of merge-ort being reused by libgit2 or otherwise used in a
library seems essentially nil, at least without some broader
libification effort.

I'd like to make that story better, time permitting (which is much
more of a challenge these days than it was a couple years ago), but
I'm curious if you or others have thoughts on something like that.

> Much, much later on, I'm expecting us to form a plan around allowing
> "plugins" - that is, replacing library functionality we use today with
> an alternative library, such as an object store relying on a
> distributed file store like S3. Making that work well will also likely
> involve us coming up with a solution for dependency injection, and to
> begin using vtables for some libraries. I'm hoping that we can figure
> out a way to do that that won't make the Git source ugly. Around this
> time, I think it will make sense to buy into unit tests even more and
> start using an approach like mocking to test various edge cases. And
> at some point, it's likely that we'll want to make the interfaces to
> various Git libraries consistent with each other, which would involve
> some large-scale but hopefully-mechanical refactors.

Would these plugins resemble the pluggable merge backends that was
added to builtin/merge.c?  Would it replace that mechanism with a
different one?  Would it be more like the refs backends?

Would this plugin scheme allow us to, for example, use gitoxide[1] as
a clone replacement to make clones 2x as fast (and with half the
memory -- although I suspect they cheated and used sha1 instead of
sha1dc, so maybe it wouldn't really be 2x)?

Oh, and it's totally okay if you don't know the answers to any or all
of my questions right now.  I'm just curious, because I've long
thought these kinds of directions would be good.  Since I've spent
time thinking about it, I have questions that I don't know the answers
to, but I figured it couldn't hurt to bounce them off others who are
thinking about this area.

Anyway, it's a large pile of work that you're undertaking, and as
Junio comments elsewhere in this thread it's unclear if libification
can be achieved for a big enough component (and you seem to admit as
much in your email as well), but I applaud the general direction and
your initial plans.

[1] https://github.com/Byron/gitoxide/discussions/579

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Proposal/Discussion: Turning parts of Git into libraries
  2023-02-18  1:59   ` demerphq
@ 2023-02-18 10:36     ` Phillip Wood
  2023-03-23 23:22       ` Felipe Contreras
  0 siblings, 1 reply; 37+ messages in thread
From: Phillip Wood @ 2023-02-18 10:36 UTC (permalink / raw)
  To: demerphq, Junio C Hamano
  Cc: Emily Shaffer, Git List, Jonathan Nieder, Jose Lopes,
	Aleksandr Mikhailov

On 18/02/2023 01:59, demerphq wrote:
> On Sat, 18 Feb 2023 at 00:24, Junio C Hamano <gitster@pobox.com> wrote:
>>
>> Emily Shaffer <nasamuffin@google.com> writes:
>>
>>> Basically, if this effort turns out not to be fruitful as a whole, I'd
>>> like for us to still have left a positive impact on the codebase.
>>> ...
>>> So what's next? Naturally, I'm looking forward to a spirited
>>> discussion about this topic - I'd like to know which concerns haven't
>>> been addressed and figure out whether we can find a way around them,
>>> and generally build awareness of this effort with the community.
>>
>> On of the gravest concerns is that the devil is in the details.
>>
>> For example, "die() is inconvenient to callers, let's propagate
>> errors up the callchain" is an easy thing to say, but it would take
>> much more than "let's propagate errors up" to libify something like
>> check_connected() to do the same thing without spawning a separate
>> process that is expected to exit with failure.
> 
> 
> What does "propagate errors up the callchain" mean?  One
> interpretation I can think of seems quite horrible, but another seems
> quite doable and reasonable and likely not even very invasive of the
> existing code:
> 
> You can use setjmp/longjmp to implement a form of "try", so that
> errors dont have to be *explicitly* returned *in* the call chain. And
> you could probably do so without changing very much of the existing
> code at all, and maintain a high level of conceptual alignment with
> the current code strategy.

Using setjmp/longjmp is an interesting suggestion, I think lua does 
something similar to what you describe for perl. However I think both of 
those use a allocator with garbage collection. I worry that using 
longjmp in git would be more invasive (or result in more memory leaks) 
as we'd need to to guard each allocation with some code to clean it up 
and then propagate the error. That means we're back to manually 
propagating errors up the call chain in many cases.

Best Wishes

Phillip

> To do this you need to set up a globally available linked list of
> jmp_env data (see `man setjmp` for jmp_env), and a global error
> object, and make the existing "die" functions populate the global
> error object, and then pop the most recent jmp_env data and longjmp to
> it.
> 
> At the top of any git invocation you would set up the topmost jmp_env
> "frame". Any code that wants to "try" existing logic pushes a new
> jmp_env (using a wrapper around setjmp), and prepares to be longjmp'ed
> to. If the code does not die then it pops the jmp_env it just pushed
> and returns as normal, if it is longjmp'ed to you can detect this and
> do some other behavior to handle the exception (by reading the global
> error object). If the code that died *really* wants to exit, then it
> returns the appropriate code as part of the longjmp, and the try
> handler longjmps again propagating up the chain. Eventually you either
> have an error that "propagates to the top" which results in an exit
> with an appropriate error message, or you have an error that is
> trapped and the code does something else, and then eventually returns
> normally.
> 
> FWIW, this is essentially a loose description of how Perl handles the
> execution part of "eval" and supports exception handling internally.
> Most of the perl internals do not know anything about exceptions, they
> just call functions similar to gits die functions if they need to,
> which then call into Perl_die_unwind(). which then calls the
> JUMPENV_JUMP() macro which does the "pop and longjmp" dance.
> 
> Seems to me that it wouldn't be very difficult nor particularly
> invasive to implement this in git. Much of the logic in the perl
> project to do this is at the top of cop.h,  see the macros
> JMPENV_PUSH(), JMPENV_POP(), JMPENV_JUMP(). Obviously this code
> contains a bunch of perl specific logic, but the general gist of it
> should be easily understood and easily converted to a more git like
> context:
> 
> struct jmpenv: https://github.com/Perl/perl5/blob/blead/cop.h#L32
> JMPENV_BOOTSTRAP: https://github.com/Perl/perl5/blob/blead/cop.h#L66
> JMPENV_PUSH: https://github.com/Perl/perl5/blob/blead/cop.h#L113
> JMPENV_POP: https://github.com/Perl/perl5/blob/blead/cop.h#L147
> JMPENV_JUMP: https://github.com/Perl/perl5/blob/blead/cop.h#L159
> 
> Perl_die_unwind: https://github.com/Perl/perl5/blob/blead/pp_ctl.c#L1741
> Where Perl_die_unwind() calls JMPENV_JUMP:
> https://github.com/Perl/perl5/blob/blead/pp_ctl.c#L1865
> 
> You can also grep for functions of the form S_try_ in the perl code
> base to find examples where the C code explicitly sets up an "eval
> frame" to interoperate with the functionality above.
> 
> git grep -nP '^S_try_'
> pp_ctl.c:3548:S_try_yyparse(pTHX_ int gramtype, OP *caller_op)
> pp_ctl.c:3604:S_try_run_unitcheck(pTHX_ OP* caller_op)
> pp_sys.c:3120:S_try_amagic_ftest(pTHX_ char chr) {
> 
> Seems to me that this gives enough prior art to convert git to use the
> same strategy, and that doing so would not actually be that big a
> change to the existing code.  Both environments are fairly similar if
> you look at them from the right perspective. Both are C, and both have
> a lot of global state, and both have lots of functions which you
> really dont want to have to change to understand about exception
> objects..
> 
> Here is an example of how a C function might be written to use this
> kind of infrastructure to "try" functionality that might call die. In
> this case there is no need for the code to inspect the global error
> object, but the basic pattern is consistent. The "default" case below
> handles the situation where the "tried" function is signalling an
> "untrappable error" that needs to be rethrown to ultimately unwind the
> entire try/catch chain and exit the program. It is derived and
> simplified from S_try_yyparse mentioned above. This function handles
> the "compile the code" part of an `eval EXPR`, and traps exceptions
> from the parser so that they can be handled properly and distinctly
> from errors trapped during execution of the compiled code. [ I am
> assuming that given the historical relationship between git and perl
> these concepts are not alien to everybody on this list. ]
> 
> /* S_try_yyparse():
>   *
>   * Run yyparse() in a setjmp wrapper. Returns:
>   *   0: yyparse() successful
>   *   1: yyparse() failed
>   *   3: yyparse() died
>   *
>   * ...
>   */
> STATIC int
> S_try_yyparse(pTHX_ int gramtype, ...)
> {
>      dJMPENV;
> 
>      JMPENV_PUSH(ret);
>      switch (ret) {
>      case 0:
>          ret = yyparse(gramtype) ? 1 : 0;
>          break;
>      case 3:
>          /* yyparse() died and we trapped the error. */
>          ....
>          break;
>      default:
>          JMPENV_POP;          /* remove our own setjmp data */
>          JMPENV_JUMP(ret); /* RETHROW */
>      }
>      JMPENV_POP;
>      return ret;
> }
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Proposal/Discussion: Turning parts of Git into libraries
  2023-02-17 21:12 Proposal/Discussion: Turning parts of Git into libraries Emily Shaffer
                   ` (2 preceding siblings ...)
  2023-02-18  4:05 ` Elijah Newren
@ 2023-02-21 19:09 ` Taylor Blau
  2023-02-21 22:27   ` Emily Shaffer
  2023-02-22  1:44 ` Victoria Dye
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 37+ messages in thread
From: Taylor Blau @ 2023-02-21 19:09 UTC (permalink / raw)
  To: Emily Shaffer; +Cc: Git List, Jonathan Nieder, Jose Lopes, Aleksandr Mikhailov

On Fri, Feb 17, 2023 at 01:12:23PM -0800, Emily Shaffer wrote:
> This turned out pretty long-winded, so a quick summary before I dive in:
>
> - We want to compile parts of Git as independent libraries
> - We want to do it by making incremental code quality improvements to Git
> - Let's avoid promising stability of the interfaces of those libraries
> - We think it'll let Git do cool stuff like unit tests and allowing
>   purpose-built plugins
> - Hopefully by example we can convince the rest of the project to join
>   in the effort

Like others, I am less interested in the VFS-specific components you
mention here, but I suspect that is just one particular instance of
something that would be benefited by making git internals exposed via a
linkable library.

I don't have objections to things like reducing our usage of globals,
making fewer internal functions die() when they encounter an error, and
so on. But like Junio, I suspect that this is definitely an instance of
a "devil's in the details" kind of problem.

That's definitely my main concern: that this turns out to be much more
complicated than imagined and that we leave the codebase in a worse
state without much to show. A lesser version of that outcome would be
that we cause a lot of churn in the tree with not much to show either.

So I think we'd want to see some more concrete examples with clear
benefits to gauge whether this is a worthwhile direction. I think that
strbuf.h is too trivial an example to demonstrate anything useful. Being
able to extract config.h into its own library so that another non-Git
program could link against it and implement 'git config'-like
functionality would be much more interesting.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Proposal/Discussion: Turning parts of Git into libraries
  2023-02-17 22:57 ` Junio C Hamano
  2023-02-18  1:59   ` demerphq
@ 2023-02-21 21:42   ` Emily Shaffer
  2023-02-22  0:22     ` Junio C Hamano
  1 sibling, 1 reply; 37+ messages in thread
From: Emily Shaffer @ 2023-02-21 21:42 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git List, Jonathan Nieder, Jose Lopes, Aleksandr Mikhailov

On Fri, Feb 17, 2023 at 2:57 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Emily Shaffer <nasamuffin@google.com> writes:
>
> > Basically, if this effort turns out not to be fruitful as a whole, I'd
> > like for us to still have left a positive impact on the codebase.
> > ...
> > So what's next? Naturally, I'm looking forward to a spirited
> > discussion about this topic - I'd like to know which concerns haven't
> > been addressed and figure out whether we can find a way around them,
> > and generally build awareness of this effort with the community.
>
> On of the gravest concerns is that the devil is in the details.
>
> For example, "die() is inconvenient to callers, let's propagate
> errors up the callchain" is an easy thing to say, but it would take
> much more than "let's propagate errors up" to libify something like
> check_connected() to do the same thing without spawning a separate
> process that is expected to exit with failure.

Because the error propagation path is complicated, you mean? Or
because the cleanup is painful?

I wonder about this idea of spawning a worker thread that can
terminate itself, though. Is it a bad idea? Is it a hacky way of
pretending that we have exceptions? I guess if we have a thread then
we still have the same concerns about memory management (which we
don't have if we use a child process). (I'll reply to demerphq's mail
in detail, but it seems like the hardest part of this is memory
cleanup, no?)

In other cases, we might want to perform some work that can be sped up
by using more threads; how do we want to expose that functionality to
the caller? Do we want to manage our own threads, or do we want to
pass off orchestrating those worker threads to the caller (who
theoretically might have a faster way to manage them, like GPU
execution or distributed execution or something, or who might be using
their own thread pool manager)?

>
> It is not clear if we can start small, work on a subset of the
> things and still reap the benefit of libification.  Is there an
> existing example that we have successfully modularlized the API into
> one subsystem?  Offhand, I suspect that the refs API with its two
> implementations may be reasonably close, but is the inteface into
> that subsystem the granularity of the library interface you guys
> have in mind?

I think many of our internal APIs, especially the lower level ones,
are actually quite well modularized, or close enough to it that you
can't really tell they aren't. run-command.h and config.h come to
mind. The ones that aren't, I tend to think are frustrating to work
with anyways - is it reasonable to consider, for example, further
cleanup of cache.h as part of this effort? Is it reasonable to rework
an ugly circular dependency between two headers as a prerequisite to
doing library work around one of them?

I had a look at the refs API documentation but it seems that we don't
actually have a way for the code to use reftable. Is that what you
meant by the two implementations of refs API, or am I missing
something else? Anyway, abstracting at the "which backend do I want to
use" layer seems absolutely appropriate to me, if we're discussing
places where Git can use an alternative implementation. (For example,
this means it's also easy for Git to use some random NoSQL table as a
ref store, if that's what the caller wants.) For the most part refs.h
seems like it has things I would want to expose to external callers
(or that I would want to reimplement as a library author).

 - Emily

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Proposal/Discussion: Turning parts of Git into libraries
  2023-02-18  4:05 ` Elijah Newren
@ 2023-02-21 22:06   ` Emily Shaffer
  2023-02-22  8:23     ` Elijah Newren
  2023-02-22 19:25     ` Jeff King
  0 siblings, 2 replies; 37+ messages in thread
From: Emily Shaffer @ 2023-02-21 22:06 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git List, Jonathan Nieder, Jose Lopes, Aleksandr Mikhailov

On Fri, Feb 17, 2023 at 8:05 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Fri, Feb 17, 2023 at 1:45 PM Emily Shaffer <nasamuffin@google.com> wrote:
> [...]
> > The good news is that for practical near-term purposes, "libification"
> > mostly means cleanups to the Git codebase, and continuing code health
> > work that the project has already cared about doing:
> >
> > - Removing references to global variables and instead piping them
> > through arguments
> > - Finding and fixing memory leaks, especially in widely-used low-level code
>
> Does removing memory leaks also mean converting UNLEAK to free()?

I suspect so - as I understand it, UNLEAK is a macro that resolves to
"don't complain to me, compiler, I meant not to free it."

> Thinking of things in a library context probably pushes us in that
> direction (though, alternatively, it might just highlight the question
> of what is considered "low-level" instead).

I'm not sure whether use of UNLEAK has so much to do with "low-level"
or not. In cases when Git is being called as an ephemeral single-run
process, UNLEAK makes a lot of sense. In cases when Git is being
called in a long-lived process, UNLEAK is just a sign that says
"there's a leak here".  So I think the distinction is not low-level or
high-level, but more simply, within a library or not.

I do anticipate that we'll still have "non-libified" code for the
builtins, and that those builtins will invoke libraries at whatever
layer. So UNLEAKing memory allocated by the builtin - seems fine to
me, even if that builtin is a "low-level" plumbing command.

>
> > - Reducing or removing `die()` invocations in low-level code, and
> > instead reporting errors back to callers in a consistent way
>
> What delinates "low-level" code?  (A "we don't know yet but we'll
> start with obvious places and plan to have good discussions on the
> appropriate boundary in the future as we submit patches" is a fine
> answer, I'm just curious if you already have a rough idea of where you
> intend that boundary to lie.)

The biggest one is our "standard library" - stuff like strbuf,
string-list, strvec, etc. etc. I'd like to expose those to callers so
that we don't end up having library interfaces passing around
unterminated buffers, which means that they'll be used in almost any
other library.

That sort of hints at the next criteria - stuff that's used by lots of
operations, or lots of other parts of Git code. So that means things
like run-command and config.

Past that, we're determining libification order based on internal
priorities. A request like "our VFS helper needs to do `git commit`
with this specific set of constraints, please give us library calls to
do it" would probably result in us working on library interfaces to
hook execution, index parsing, and ref manipulation, and anything
that's a dependency of those three. It's very unlikely that it would
result in something like `git_do_commit(struct git_commit_flags)`.
(That's what I meant about avoiding high-level libraries to begin
with.)

> > - Clarifying the scope and layering of existing modules, for example
> > by moving single-use helpers from the shared module's scope into the
> > single user's scope
> > - Making module interfaces more consistent and easier to understand,
> > including moving "private" functions out of headers and into source
> > files and improving in-header documentation
>
> I think these are very positive directions.  I like the fact that your
> initial plan benefits all of us, whether or not libification is
> ultimately achieved.
>
> [...]
> > So what's next? Naturally, I'm looking forward to a spirited
> > discussion about this topic - I'd like to know which concerns haven't
> > been addressed and figure out whether we can find a way around them,
> > and generally build awareness of this effort with the community.
>
> I'm curious whether clarifying scope/layering and cleaning up
> interfaces might mean you'd be interested in things like:
>   * https://github.com/newren/git/commits/header-cleanups (which was
> still WIP; I paused working on it because I figured people would see
> it as big "cleanup" patches with no practical benefit)
>   * https://github.com/gitgitgadget/git/pull/1149 (which has been
> ready to submit for a _long_ time, but I just haven't yet)
> or if these two things are orthogonal to what you have in mind.

Extremely yes. :) Even "small" stuff like the need for header cleanups
have already come up for Glen and Calvin working on config and strbuf.

>
> > I'm also planning to send a proposal for a document full of "best
> > practices" for turning Git code into libraries (and have quite a lot
> > of discussion around that document, too). My hope is that we can use
> > that document to help us during implementation as well as during
> > review, and refine it over time as we learn more about what works and
> > what doesn't. Having this kind of documentation will make it easy for
> > others to join us in moving Git's codebase towards a clean set of
> > libraries. I hope that, as a project, we can settle on some tenets
> > that we all agree would make Git nicer.
>
> I like the sound of this.
>
> > After that, we're still hoping to target low-level libraries first - I
> > certainly don't think it will make sense to ship a high-level `git
> > commit` library in the near future, if ever - in the order that
> > they're required from the VFS project we're working closely with. As
> > far as I can tell right now, that's likely to cover object store and
> > worktree access, as well as commit creation and pushing, but we'll see
> > how planning shakes out over the next month or so. But Google's
> > schedule should have no bearing on what others in the Git project feel
> > is important to clean up and libify, and if there is interest in the
> > rest of the project in converting other existing modules into
> > libraries, my team and I are excited to participate in the review.
>
> If we can't libify something like commit, does that prevent libifying
> higher level things like merge?
>
> I spent some time thinking about this a while back.  I tried to
> carefully design merge-ort to improve the odds it could be used
> elsewhere, maybe even libgit2.  (I hope it shows in the many comments
> in merge-ort.h, and I think the "priv" field in particular allowing me
> to hide the first ~300 lines of merge-ort.c declaring data structures
> from users was really nice.)  However, I still had to accept data in
> some known format.  So input parameters are things like trees and
> commits.  But tree.h and commit.h both include object.h first, which
> includes cache.h, which is basically all of Git.  And the functions I
> call to interoperate with the system are similarly entangled.  So, the
> odds of merge-ort being reused by libgit2 or otherwise used in a
> library seems essentially nil, at least without some broader
> libification effort.

Yeah, it depends a lot on the usage. For merge, it would be tricky to
get the scope just right - should the merge library be responsible for
locating the merge-base? Should it just perform the conflict
resolution? Something else?

As for "I tried to include this thing which eventually included
cache.h" - yeah, I think we will be pulling stuff out of cache.h quite
heavily. But IMO this falls under "cleanups we want to do in Git
anyway" - I think it's widely understood that cache.h is not
well-scoped and could use improvement.

>
> I'd like to make that story better, time permitting (which is much
> more of a challenge these days than it was a couple years ago), but
> I'm curious if you or others have thoughts on something like that.
>
> > Much, much later on, I'm expecting us to form a plan around allowing
> > "plugins" - that is, replacing library functionality we use today with
> > an alternative library, such as an object store relying on a
> > distributed file store like S3. Making that work well will also likely
> > involve us coming up with a solution for dependency injection, and to
> > begin using vtables for some libraries. I'm hoping that we can figure
> > out a way to do that that won't make the Git source ugly. Around this
> > time, I think it will make sense to buy into unit tests even more and
> > start using an approach like mocking to test various edge cases. And
> > at some point, it's likely that we'll want to make the interfaces to
> > various Git libraries consistent with each other, which would involve
> > some large-scale but hopefully-mechanical refactors.
>
> Would these plugins resemble the pluggable merge backends that was
> added to builtin/merge.c?  Would it replace that mechanism with a
> different one?  Would it be more like the refs backends?

I suspect that it's likely to be most similar to the refs backend
replacement, although I investigated it only a little bit just now.

The pluggable merge backends are an interesting thought - right now
all those alternatives are built in and we decide based on config,
right? But if we were able to easily decide which library to link
based on config during runtime, then I could see that being an
appealing use of plugins too. I wonder whether "custom" merge backends
make the story for people doing compiled asset storage in Git (like
game assets or hardware layouts, both of which famously merge
horribly) any easier.

>
> Would this plugin scheme allow us to, for example, use gitoxide[1] as
> a clone replacement to make clones 2x as fast (and with half the
> memory -- although I suspect they cheated and used sha1 instead of
> sha1dc, so maybe it wouldn't really be 2x)?

Interesting. It would depend on whether we can match the interface to
gitoxide, or write a translation layer. I could see it! I'm also a
little curious how much of that speedup is because of corner-cutting
(since you mentioned skipping the collision detection) vs. how much is
due to Rust magic. In theory, building the Git CLI out of a handful of
libraries means that we could write some of those libraries in
something besides C; in practice, I understand there's a
maintainability issue around introducing new languages into the pile
of stuff the community is expected to understand and maintain. (For
example, I think many people don't like to touch git-gui, probably
primarily because it's in Tcl.)

>
> Oh, and it's totally okay if you don't know the answers to any or all
> of my questions right now.  I'm just curious, because I've long
> thought these kinds of directions would be good.  Since I've spent
> time thinking about it, I have questions that I don't know the answers
> to, but I figured it couldn't hurt to bounce them off others who are
> thinking about this area.
>
> Anyway, it's a large pile of work that you're undertaking, and as
> Junio comments elsewhere in this thread it's unclear if libification
> can be achieved for a big enough component (and you seem to admit as
> much in your email as well), but I applaud the general direction and
> your initial plans.

Thanks for your thoughtful reply.

 - Emily

>
>
> [1] https://github.com/Byron/gitoxide/discussions/579

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Proposal/Discussion: Turning parts of Git into libraries
  2023-02-21 19:09 ` Taylor Blau
@ 2023-02-21 22:27   ` Emily Shaffer
  0 siblings, 0 replies; 37+ messages in thread
From: Emily Shaffer @ 2023-02-21 22:27 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Git List, Jonathan Nieder, Jose Lopes, Aleksandr Mikhailov

On Tue, Feb 21, 2023 at 11:10 AM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Fri, Feb 17, 2023 at 01:12:23PM -0800, Emily Shaffer wrote:
> > This turned out pretty long-winded, so a quick summary before I dive in:
> >
> > - We want to compile parts of Git as independent libraries
> > - We want to do it by making incremental code quality improvements to Git
> > - Let's avoid promising stability of the interfaces of those libraries
> > - We think it'll let Git do cool stuff like unit tests and allowing
> >   purpose-built plugins
> > - Hopefully by example we can convince the rest of the project to join
> >   in the effort
>
> Like others, I am less interested in the VFS-specific components you
> mention here, but I suspect that is just one particular instance of
> something that would be benefited by making git internals exposed via a
> linkable library.
>
> I don't have objections to things like reducing our usage of globals,
> making fewer internal functions die() when they encounter an error, and
> so on. But like Junio, I suspect that this is definitely an instance of
> a "devil's in the details" kind of problem.
>
> That's definitely my main concern: that this turns out to be much more
> complicated than imagined and that we leave the codebase in a worse
> state without much to show.

Yeah, I'm really hoping we don't end up with ugly half-changes too.
Some examples of "partial credit" that I'd be happy with:

- Fewer internal libraries relying on globals like
the_repository/the_index/etc (we've already started this effort,
libification or no)
- An "ugly" library interface becoming clearer and easier to use (and
internal callers updated)
- Figuring out an "error reporting type" that works well for us

There are some things that *are* ugly, for example, calling a library
via a vtable. But I do feel comfortable waiting to introduce that kind
of thing until we really need it, at which point I suspect we'll have
already made some successful strides with libification in general.

It's not so great to just trust me to say "I promise not to make ugly
changes" - I'd appreciate the community's help pushing back if we
propose doing something in an untidy way without clear justification.

> A lesser version of that outcome would be
> that we cause a lot of churn in the tree with not much to show either.

I'm actually not so concerned about this! The "churn", as I see it,
comes in the form of code cleanup that already makes Git more
understandable for Git hackers. We do spend some time on that now, as
a project, but I wouldn't be unhappy if we spent even more :)

>
> So I think we'd want to see some more concrete examples with clear
> benefits to gauge whether this is a worthwhile direction. I think that
> strbuf.h is too trivial an example to demonstrate anything useful. Being
> able to extract config.h into its own library so that another non-Git
> program could link against it and implement 'git config'-like
> functionality would be much more interesting.

Sure - I'm also looking forward to seeing it.

Thanks for your thoughtful reply.
 - Emily

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Proposal/Discussion: Turning parts of Git into libraries
  2023-02-21 21:42   ` Emily Shaffer
@ 2023-02-22  0:22     ` Junio C Hamano
  0 siblings, 0 replies; 37+ messages in thread
From: Junio C Hamano @ 2023-02-22  0:22 UTC (permalink / raw)
  To: Emily Shaffer; +Cc: Git List, Jonathan Nieder, Jose Lopes, Aleksandr Mikhailov

Emily Shaffer <nasamuffin@google.com> writes:

>> For example, "die() is inconvenient to callers, let's propagate
>> errors up the callchain" is an easy thing to say, but it would take
>> much more than "let's propagate errors up" to libify something like
>> check_connected() to do the same thing without spawning a separate
>> process that is expected to exit with failure.
>
> Because the error propagation path is complicated, you mean? Or
> because the cleanup is painful?

Both.

The amount of data the caller may want to learn about an error may
not be uneven, depending on the caller even for a single function.
And yes, cleaning up of shared resources like object flag bits after
a traversal, especially a failed one, would be very painful unless
the processing is designed from day one to allow it (and the
revision traversal codepath is not).

> ... is it reasonable to consider, for example, further
> cleanup of cache.h as part of this effort? Is it reasonable to rework
> an ugly circular dependency between two headers as a prerequisite to
> doing library work around one of them?

I am not sure about which two headers you are talking about, but if
there is circular dependency that can be untangled, it would be a
reasonable preliminary clean-up work.  I am not sure if that is
"prerequisite"---it is up to folks who want to design how the
"libification" work goes.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Proposal/Discussion: Turning parts of Git into libraries
  2023-02-17 21:12 Proposal/Discussion: Turning parts of Git into libraries Emily Shaffer
                   ` (3 preceding siblings ...)
  2023-02-21 19:09 ` Taylor Blau
@ 2023-02-22  1:44 ` Victoria Dye
  2023-02-25  1:48   ` Jonathan Tan
  2023-02-22 14:55 ` Derrick Stolee
  2023-03-23 23:37 ` Felipe Contreras
  6 siblings, 1 reply; 37+ messages in thread
From: Victoria Dye @ 2023-02-22  1:44 UTC (permalink / raw)
  To: Emily Shaffer, Git List; +Cc: Jonathan Nieder, Jose Lopes, Aleksandr Mikhailov

Emily Shaffer wrote:
> Hi folks,
> 
> As I mentioned in standup this week[1], my colleagues and I at Google
> have become very interested in converting parts of Git into libraries
> usable by external programs. In other words, for some modules which
> already have clear boundaries inside of Git - like config.[ch],
> strbuf.[ch], etc. - we want to remove some implicit dependencies, like
> references to globals, and make explicit other dependencies, like
> references to other modules within Git. Eventually, we'd like both for
> an external program to use Git libraries within its own process, and
> for Git to be given an alternative implementation of a library it uses
> internally (like a plugin at runtime).

I see why you've linked these two objectives (there could be some shared
infrastructure between them), but they're ultimately separate concerns, with
separate needs & tradeoffs. By trying to design with a unified solution for
both, though, this proposal's solution may be broader than it needs to be
and comes with some serious developer experience ramifications.

For example, looking at the  "call libgit.so from other applications"
objective: one way I could imagine implementing this is by creating a stable
API wrapper around various hand-picked internal functions. If those internal
functions change (e.g., adding a function argument to handle a new flag in a
builtin), the API can remain stable (by setting a sane default) without
really limiting the contributor adding a new feature. 

However, my reading of this proposal (although correct me if I'm wrong) is
that you want to use that same external-facing API as a plugin interface for
Git. In that case, the contributor can't just add a new argument and
trivially modify the wrapper; they either break the API contract, or create
a new version of the API and continue to support the old one. While that's
technically doable, it's a pretty sizeable increase to the amount of "stuff"
contributors need to keep track of and limits the project's flexibility. 

> 
> This turned out pretty long-winded, so a quick summary before I dive in:
> 
> - We want to compile parts of Git as independent libraries
> - We want to do it by making incremental code quality improvements to Git
> - Let's avoid promising stability of the interfaces of those libraries
> - We think it'll let Git do cool stuff like unit tests and allowing
> purpose-built plugins
> - Hopefully by example we can convince the rest of the project to join
> in the effort
> 
> My team has spent the past year or so trying to make improvements to
> Git's behavior with submodules, and we found that the current
> structure of Git is quite tricky to work with: because Git doesn't
> execute on second repositories in the same process well, recursing
> into submodules typically involves spawning child processes, and
> piping new arguments through the helpers around those child processes,
> and then through Git's typical codepaths, is very tricky. After
> spending more than a year trying to make improvements, we have very
> little to show for it, largely as a result of the difficulty of
> passing information between superprojects and submodules.
> 
> It seems like being able to invoke parts of Git as a library, or Git
> being able to invoke custom libraries, does a lot of good for the Git
> project:
> 
> - Having clear, modular libraries makes it easy to find the code
> responsible for some behavior, or determine where to add something
> new.

This is more an issue of code & repository organization, and isn't really an
inherent outcome of lib-ifying Git. That said, like the cleanup you mention
later, I would not be opposed to a dedicated code reorganization proposal.

> - Operations recursing into submodules can be run in-process, and the
> callsite makes it very clear which repository context is being used;
> if we're not subprocessing to recurse, then we can lose the awkward
> git-submodule.sh -> builtin/submodule__helper.c -> top-level Git
> process codepath.

Interesting - I wasn't aware this was how submodules worked internally. Is
there a specific reason this can't be refactored to perform the recursion
in-process?

> - Being able to test libraries in isolation via unit tests or mocks
> speeds up determining the root cause of bugs.

The addition of unit tests is a project of its own, and one I don't believe
is intrinsically tied to this one. More on this later.

> - If we can swap out an entire library, or just a single function of
> one, it's easy to experiment with the entire codebase without sweeping
> changes.

What sort of "experiment" are you thinking of here? Anyone can make internal
changes to Git in their own clone of the repository, then build and test it.
Shipping a custom fork of Git doesn't seem any less complex than building a
plugin library, shipping that, and injecting it into an existing Git
install.

> 
> The ability to use Git as a library also makes it easier for other
> tooling to interact with a Git repository *correctly*. As an example,
> `repo` has a long history of abusing Git by directly manipulating the
> gitdir[2], but if it were written in a world where Git exists as
> easy-to-use libraries, it probably wouldn't have needed to, as it
> could have invoked Git directly or replaced the relevant modules with
> its own implementation. Both `repo`[3] and `git-gui[4]` have
> reimplemented logic from git.git. Other interfaces that cooperate with
> Git's filesystem storage layer, like `scalar` or `jj`[5], would be
> able to interop with a Git repository without having to reimplement
> custom logic or keep up with new Git changes.

I'd alternatively suggest that these use cases could be addressed with
better plumbing command support and better (public) documentation of our
on-disk data structures. In many cases, we're treating on-disk data
structures as external-facing APIs anyway - the index [1], packfiles [2],
etc. are versioned. We could definitely add documentation that's more 
directly targeted at external integrators.

[1] https://git-scm.com/docs/index-format
[2] https://git-scm.com/docs/gitformat-pack

> 
> Of course, there's a reason Google wants it, too. We've talked
> previously about wanting better integration between Git and something
> like a VFS; as we've experimented with it internally, we've found a
> couple of tricky areas:
> 
> - The VFS relies on running Git commands to determine the state of the
> repository. However, these Git commands interact with the gitdir or
> worktree, which is populated by the VFS. For example, parsing a
> .gitattributes or .gitmodules which is already stored in the VFS
> requires the VFS to provide a POSIX file handle, spawn a Git
> subprocess, populate other files needed by that subprocess (like
> .git/config), and finally collect the output stream of the subprocess.
> As you can imagine, this interaction of VFS -> Git -> VFS [-> Git]
> creates all sort of complications. The alternative is for the VFS to
> write its own parser (or use a library like libgit2; more on that
> later). But having a Git library means that a subset of Git
> functionality can happen in-process, and that filesystem access could
> be replaced by the VFS directly providing high-level objects or plain
> bytestreams.
> 
> - A user running `git status` in a directory controlled by the VFS
> will require the VFS to populate the entire (large) worktree - even
> though the VFS is sure that only one file has been modified. The
> closest we've come with an alternative is an orchestrated use of
> sparse-checkout - but modifying the sparse-checkout configs
> automatically in response to the user's filesystem operations takes us
> right back to the previous point. If Git could take a plugin and
> replacement for the object store that directly interacts with the VFS
> daemon, a layer of complexity would disappear and performance would
> improve.

Both of these points sound more like implementation details/quirks of the
VFS prototype you're building than hard blockers coming from Git. For
example, you mention 'git status' needing to populate a full worktree -
could it instead use a mechanism similar to FSMonitor to avoid scanning for
files you know are virtualized? While there'd likely be some Git changes
involved (some of which might be a bit plugin-y), they'd be much more
tightly scoped than full libification.

> The good news is that for practical near-term purposes, "libification"
> mostly means cleanups to the Git codebase, and continuing code health
> work that the project has already cared about doing:
> 
> - Removing references to global variables and instead piping them
> through arguments
> - Finding and fixing memory leaks, especially in widely-used low-level code
> - Reducing or removing `die()` invocations in low-level code, and
> instead reporting errors back to callers in a consistent way
> - Clarifying the scope and layering of existing modules, for example
> by moving single-use helpers from the shared module's scope into the
> single user's scope
> - Making module interfaces more consistent and easier to understand,
> including moving "private" functions out of headers and into source
> files and improving in-header documentation
> 
> Basically, if this effort turns out not to be fruitful as a whole, I'd
> like for us to still have left a positive impact on the codebase.

These cleanups/best practices are a great idea regardless of any potential
lib-ification. I'll be sure to keep them in mind in any future changes I
make or review.

> 
> In the longer term, if Git has libraries with easily-replaced
> dependencies, we get a few more benefits:
> 
> - Unit tests. We already have some in t/helper/, but if we can replace
> all the dependencies of a certain library with simple stubs, it's
> easier for us to write comprehensive unit tests, in addition to the
> work we already do introducing edge cases in bash integration tests.

This doesn't really follow as a direct consequence of making libgit
externally facing. AFAIK, there's nothing explicitly stopping us from
writing or integrating a unit test framework now.

Like the "make libgit public" vs. "allow for custom backends in Git", I
think this is a separate project with its own tradeoffs to evaluate.

> - If our users can use plugins to improve performance in specific
> scenarios (like a VFS-aware object store in the VFS case I cited
> above), then Git works better for them without having to adopt a
> different workflow, such as using an alternative tool or wrapper.

I'm curious as to what you want the user experience for this to look like.
How would Git know to dynamically load a plugin? How does a user configure
(or disable) plugins? Will a plugin need to replace all of the functions in
a given library, or would Git be able to "fall back" on its own internal
implementation?

> - An easy-to-understand modular codebase makes it easy for new
> contributors to start hacking and understand the consequences of their
> patch.

As noted earlier, I think improving the developer experience in this way is
independent of the development of external APIs.

> 
> Of course, we haven't maintained any guarantee about the consistency
> of our implementation between releases. I don't anticipate that we'll
> write the perfect library interface on our first try. So I hope that
> we can be very explicit about refusing to provide any compatibility
> guarantee whatsoever between versions for quite a long time. On
> Google's end, that's well-understood and accepted. As I understand,
> some other projects already use Git's codebase as a "library" by
> including it as a submodule and using the code directly[6]; even a
> breakable API seems like an improvement over that, too.

This hints at, but sidesteps, a really important aspect of the long-term
goals of this project - are you planning on having us start guaranteeing
consistency once there's an external-facing API available? 

If not, that sounds like an unpleasant user experience (and one prone to
Hyrum's Law [3] at that). 

If so, Git contributors will either be much more constrained in the
introduction of new features, or we'll end up with a mess of
backward-compatibility APIs.

[3] https://www.hyrumslaw.com/

> 
> So what's next? Naturally, I'm looking forward to a spirited
> discussion about this topic - I'd like to know which concerns haven't
> been addressed and figure out whether we can find a way around them,
> and generally build awareness of this effort with the community.
> 
> I'm also planning to send a proposal for a document full of "best
> practices" for turning Git code into libraries (and have quite a lot
> of discussion around that document, too). My hope is that we can use
> that document to help us during implementation as well as during
> review, and refine it over time as we learn more about what works and
> what doesn't. Having this kind of documentation will make it easy for
> others to join us in moving Git's codebase towards a clean set of
> libraries. I hope that, as a project, we can settle on some tenets
> that we all agree would make Git nicer.

I think the use of multiple libraries is part of the potentially suboptimal
balance between the "load Git's internals as a library" and "let Git use
plugins in place of its own implementations" goals. The former could leave a
user in dependency hell if there are lots of small libraries to load, while
the latter may work better with smaller-scoped libraries (to avoid needing
to implement more than is useful). And, the functionality that's useful to a
program invoking Git via library may not be the same as what someone would
want to replace via plugin. 

> 
> From the rest of my own team, we're planning on working first on some
> limited scope, low-level libraries so that we can all see how the
> process works. We're starting with strbuf.[ch] (as it's used
> everywhere with few or no dependencies and helps us guarantee string
> safety at API boundaries), config.[ch] (as many external tools are
> probably interested in parsing Git config formatted files directly),
> and a subset of operations related to the object store. These starting
> points are intended to have a small impact on the codebase and teach
> us a lot about logistics and best practices while doing these kinds of
> conversions.
> 
> After that, we're still hoping to target low-level libraries first - I
> certainly don't think it will make sense to ship a high-level `git
> commit` library in the near future, if ever - in the order that
> they're required from the VFS project we're working closely with. As
> far as I can tell right now, that's likely to cover object store and
> worktree access, as well as commit creation and pushing, but we'll see
> how planning shakes out over the next month or so. But Google's
> schedule should have no bearing on what others in the Git project feel
> is important to clean up and libify, and if there is interest in the
> rest of the project in converting other existing modules into
> libraries, my team and I are excited to participate in the review.

As a general note, this libification project - its justification,
milestones, organization, etc. - seems to be primarily driven by the needs
of a VFS that is entirely opaque to the Git community you're proposing to.
As it stands, reviewers are put in a position of needing to accept, without
much evidence, that this is the best (or only) possible approach for meeting
those needs. 

While I'm normally content with "scratch your own itch"-type changes, what
you're proposing is a major paradigm shift in how Git is written,
maintained, and used. I'm not comfortable accepting that level of impact
without at least being able to evaluate whether the problem can be solved
some other way.

> 
> Much, much later on, I'm expecting us to form a plan around allowing
> "plugins" - that is, replacing library functionality we use today with
> an alternative library, such as an object store relying on a
> distributed file store like S3. Making that work well will also likely
> involve us coming up with a solution for dependency injection, and to
> begin using vtables for some libraries. I'm hoping that we can figure
> out a way to do that that won't make the Git source ugly. Around this
> time, I think it will make sense to buy into unit tests even more and
> start using an approach like mocking to test various edge cases. And
> at some point, it's likely that we'll want to make the interfaces to
> various Git libraries consistent with each other, which would involve
> some large-scale but hopefully-mechanical refactors.

Implementing dependency injection (via vtable or otherwise) and unit test
mocking is a massive undertaking on its own, let alone everything else
described so far. You've outlined some clear, fairly unobtrusive first steps
to the overarching proposal, but there's a lot of detail missing from the
plans for later steps. While there's always risk of a project getting
derailed or blocked later on due to some unforeseen issue, that risk seems
particularly high here given the size & scope of all of these components. 

Intermediate milestones/goals, each of which is valuable on its own,
outlined in the proposal would help allay those fears (for me, at least).

> 
> I'm looking forward to the discussion!
> 

So, to summarize my thoughts into some (hopefully) actionable feedback:

- It's really important that the proposal presents a clear, concrete
  long-term vision for this project. 
  - What is the desired end state, in terms of what is built and installed
    with Git?
  - What will be expected of other Git contributors to support this design?
    What happens if someone wants to "break" an API?
  - What is the impact to users (including security implications)?
- The scope for what you've proposed is pretty huge (which comes with a lot
  of risk), but I think it could be broken into smaller, *independent*
  pieces. 
- It's hard to judge or suggest adjustments to this proposal without knowing
  the specific challenges you're facing with Git as it is.

Finally, thanks for sending this email & starting a discussion. It's an
interesting topic and I'm looking forward to seeing everyone's perspectives
on the matter.

>  - Emily
> 
> 1: https://colabti.org/irclogger/irclogger_log/git-devel?date=2023-02-13#l29
> 2: https://gerrit.googlesource.com/git-repo/+/refs/heads/main/docs/internal-fs-layout.md
> 3: https://gerrit.googlesource.com/git-repo/+/refs/heads/main/git_config.py
> 4: https://github.com/git/git/blob/master/git-gui/git-gui.sh#L305
> 5: https://github.com/martinvonz/jj
> 6: https://github.com/glandium/git-cinnabar

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Proposal/Discussion: Turning parts of Git into libraries
  2023-02-21 22:06   ` Emily Shaffer
@ 2023-02-22  8:23     ` Elijah Newren
  2023-02-22 19:25     ` Jeff King
  1 sibling, 0 replies; 37+ messages in thread
From: Elijah Newren @ 2023-02-22  8:23 UTC (permalink / raw)
  To: Emily Shaffer; +Cc: Git List, Jonathan Nieder, Jose Lopes, Aleksandr Mikhailov

On Tue, Feb 21, 2023 at 2:07 PM Emily Shaffer <nasamuffin@google.com> wrote:
>
> On Fri, Feb 17, 2023 at 8:05 PM Elijah Newren <newren@gmail.com> wrote:
> >
> > On Fri, Feb 17, 2023 at 1:45 PM Emily Shaffer <nasamuffin@google.com> wrote:
[...]
> > > So what's next? Naturally, I'm looking forward to a spirited
> > > discussion about this topic - I'd like to know which concerns haven't
> > > been addressed and figure out whether we can find a way around them,
> > > and generally build awareness of this effort with the community.
> >
> > I'm curious whether clarifying scope/layering and cleaning up
> > interfaces might mean you'd be interested in things like:
> >   * https://github.com/newren/git/commits/header-cleanups (which was
> > still WIP; I paused working on it because I figured people would see
> > it as big "cleanup" patches with no practical benefit)
> >   * https://github.com/gitgitgadget/git/pull/1149 (which has been
> > ready to submit for a _long_ time, but I just haven't yet)
> > or if these two things are orthogonal to what you have in mind.
>
> Extremely yes. :) Even "small" stuff like the need for header cleanups
> have already come up for Glen and Calvin working on config and strbuf.

Ok, I'll clean up what I've got and submit.

[...]
> > > Much, much later on, I'm expecting us to form a plan around allowing
> > > "plugins" - that is, replacing library functionality we use today with
> > > an alternative library, such as an object store relying on a
> > > distributed file store like S3. Making that work well will also likely
> > > involve us coming up with a solution for dependency injection, and to
> > > begin using vtables for some libraries. I'm hoping that we can figure
> > > out a way to do that that won't make the Git source ugly. Around this
> > > time, I think it will make sense to buy into unit tests even more and
> > > start using an approach like mocking to test various edge cases. And
> > > at some point, it's likely that we'll want to make the interfaces to
> > > various Git libraries consistent with each other, which would involve
> > > some large-scale but hopefully-mechanical refactors.
> >
> > Would these plugins resemble the pluggable merge backends that was
> > added to builtin/merge.c?  Would it replace that mechanism with a
> > different one?  Would it be more like the refs backends?
>
> I suspect that it's likely to be most similar to the refs backend
> replacement, although I investigated it only a little bit just now.
>
> The pluggable merge backends are an interesting thought - right now
> all those alternatives are built in and we decide based on config,
> right?

While we have several built in merge backends (recursive, resolve,
ort, octopus, ours, subtree), it is not limited to built-in
alternatives.  If someone creates a "git-merge-$STRATEGY" executable
and runs `git merge -s $STRATEGY` then builtin/merge.c will fork their
subcommand to try to resolve the merge.  (Users can even specify
something like `-s $STRATEGY1 -s $STRATEGY2` to have git find the best
strategy to use.)  The subcommand is then expected to update the
working tree and index and return an exit status signalling whether
the merge was clean.  It's very much built around assuming you
currently have a branch checked out and you want to merge into that
branch.

We do not know how widely used this feature is, but it's kept us from
fixing some API mistakes.  For example, a user strategy can return an
exit status of 2 signalling that it cannot even consider the merge in
question (e.g. most strategies cannot handle octopus merges).
However, the merge strategies are allowed to muck with the working
directory and index and leave them in a totally messed up state prior
to returning the "2" signalling that the given merge strategy is
inappropriate and another should be selected instead.  That means
builtin/merge.c is required after calling any given strategy and if it
does not succeed, go and "clean everything up".

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Proposal/Discussion: Turning parts of Git into libraries
  2023-02-17 21:12 Proposal/Discussion: Turning parts of Git into libraries Emily Shaffer
                   ` (4 preceding siblings ...)
  2023-02-22  1:44 ` Victoria Dye
@ 2023-02-22 14:55 ` Derrick Stolee
  2023-02-24 21:06   ` Emily Shaffer
  2023-03-23 23:37 ` Felipe Contreras
  6 siblings, 1 reply; 37+ messages in thread
From: Derrick Stolee @ 2023-02-22 14:55 UTC (permalink / raw)
  To: Emily Shaffer, Git List; +Cc: Jonathan Nieder, Jose Lopes, Aleksandr Mikhailov

On 2/17/2023 4:12 PM, Emily Shaffer wrote:

> This turned out pretty long-winded, so a quick summary before I dive in:
>
> - We want to compile parts of Git as independent libraries
> - We want to do it by making incremental code quality improvements to Git
> - Let's avoid promising stability of the interfaces of those libraries
> - We think it'll let Git do cool stuff like unit tests and allowing
> purpose-built plugins
> - Hopefully by example we can convince the rest of the project to join
> in the effort

As I was thinking about this proposal, I realized that my biggest concern
is that we are jumping to the "what" and "how" without fully exploring the
"why"s, so I've reorganized some of your points along those lines, so I
can respond to each in turn.

## Why: Modular code

> - Having clear, modular libraries makes it easy to find the code
> responsible for some behavior, or determine where to add something
> new.

> - An easy-to-understand modular codebase makes it easy for new
> contributors to start hacking and understand the consequences of their
> patch.

Generally, having a modular codebase does not guarantee that things are
easy to find or that consequences are understood. The correct abstractions
are key in this, as well as developing boundaries that do not leak into
each other.

Unless done correctly, we can have significant issues with how things are
communicated through layers of abstraction. We already have that problem
when we want to add a new option to a builtin and need to modify three or
four method prototypes in order to communicate a change of behavior to the
right layer.

I've also found that where we have abstractions, such as the refs code or
the transport code, that the indirection created by the vtables is more
confusing to discover by reading the code. I've needed the use of a
debugger to even discover the call stack for certain operations.

> - If we can swap out an entire library, or just a single function of
> one, it's easy to experiment with the entire codebase without sweeping
> changes.

I don't understand this one too much. We already have this ability through
forking the repository and mutating the code to do different things. Why
do we need modules (with vtables and all the rest of the pain that goes
into it) to make such swaps?

This only really matters in the fullness of the libification effort: if
everything is mocked, then subsystems can be replaced wholesale. That's a
lot of work just to avoid experimenting with a fork.

...

You said this about how to achieve modular code:

> - Removing references to global variables and instead piping them
> through arguments
> - Finding and fixing memory leaks, especially in widely-used low-level code
> - Reducing or removing `die()` invocations in low-level code, and
> instead reporting errors back to callers in a consistent way
...
> - Making module interfaces more consistent and easier to understand,
> including moving "private" functions out of headers and into source
> files and improving in-header documentation

I think these mechanisms should be universally welcomed. One thing that is
tricky is how we can guarantee that an API can be marked as "done" and
maintained that way in perpetuity. (Unit tests may help with that, so
let's come back to this then.)

> - Clarifying the scope and layering of existing modules, for example
> by moving single-use helpers from the shared module's scope into the
> single user's scope

This one is a bit murky to me. It sounds like that if there is only one
caller to a strbuf_*() method that it should be made 'static' inside the
caller instead of global in strbuf.h for use by future callers. Making
this a goal in itself is probably more likely to see methods moving around
more frequently due to the frequency of their use, rather than natural
groupings based on which data structures are being mutated.

What measurement would we optimize for? How could we maintain such
standards as we go? The Git codebase doesn't really have a firm
"architecture" other than "builtin code calls libgit.a code, which calls
other libgit.a code as necessary". There aren't really strong layers
between things. Everything assumes it can look up an object or load
config. Are there organizational things that we can do in this space that
would help decoupling things before jumping to libification?

## Why: Submodules

> - Operations recursing into submodules can be run in-process, and the
> callsite makes it very clear which repository context is being used;
> if we're not subprocessing to recurse, then we can lose the awkward
> git-submodule.sh -> builtin/submodule__helper.c -> top-level Git
> process codepath.

Previously, there was an effort to replace dependencies on the_repository
in favor of repository structs being passed through method parameters.
This seems to have fallen off of the priority list, and prevous APIs that
were converted have regressed in their dependencies.

Should we consider restarting that effort as an early goal? Should a
repository struct be the "god object" that also includes the default
implemenations of these modules until the modules can be teased apart and
no longer care about the entire repository?

## Why: Unit testing

> - Unit tests. We already have some in t/helper/, but if we can replace
> all the dependencies of a certain library with simple stubs, it's
> easier for us to write comprehensive unit tests, in addition to the
> work we already do introducing edge cases in bash integration tests.

Unit tests are very valuable in keeping the codebase stable and reducing
the chance that code was mutated by accident. This is achieved by very
rigid test infrastructure, requiring _replacing_ methods with mocks in
careful ways in order to test that behavior matches expectation. The
"simple stubs" are actually carefully crafted to verify their inputs and
provide a carefully selected return value.

The difficulty of writing unit tests (or mutating the code after writing
unit tests) is balanced by the ability to more explicitly create error
conditions that are unlikely to occur in real test cases. Cases such as
connection errors, file I/O problems, or other unexpected responses from
the mocked methods, are all relatively difficult to check via end-to-end
tests.

I've personally found that the hardest unit test to write is the _first_
one, and after that the rest are simpler. The difficulty arises in making
the code testable in the first place, as well as creating infrastructure
for the most-common unit test scenarios. My expectation is that it will
take significant changes to the Git codebase to make any non-trivial unit
tests outside of these very isolated cases that are presented here: strbuf
manipulation is easy to unit-test and config methods operating on
config_set structs should be similar. However, config methods that
interact with a repository and its config file(s) will be much harder to
test unless we start mocking filesystem interactions. We could create
test-tool helpers that load a full repository and check the config files
present on the filesystem, but now we're talking about integration tests
instead of unit tests.

> - Being able to test libraries in isolation via unit tests or mocks
> speeds up determining the root cause of bugs.

I'm not sure I completely agree with this statement. Unit tests can help
prevent introducing a new fault when mutating code to adjust for a bug,
but unit tests only guarantee expected behavior on the expected
preconditions. None of this necessarily helps finding a root cause,
especially in the likely event that the root cause is due to combining
units and thus not covered by unit tests.

## Why: Correct use

> The ability to use Git as a library also makes it easier for other
> tooling to interact with a Git repository *correctly*.

Is there is a correct way to interact with a Git repository? We definitely
prefer that the repository is updated by executing Git processes, leaving
all internals up to Git. Libification thus has a purpose for scenarios
that do not have an appropriate Git builtin or where the process startup
time is too expensive for that use.

However, would it not be preferrable to update Git to include these use
cases in the form of new builtin operations? Even in the case where
process startup is expensive, operations can be batched (as in 'git
cat-file --batch').

> Of course, we haven't maintained any guarantee about the consistency
> of our implementation between releases. I don't anticipate that we'll
> write the perfect library interface on our first try. So I hope that
> we can be very explicit about refusing to provide any compatibility
> guarantee whatsoever between versions for quite a long time. On
> Google's end, that's well-understood and accepted. As I understand,
> some other projects already use Git's codebase as a "library" by
> including it as a submodule and using the code directly[6]; even a
> breakable API seems like an improvement over that, too.

One thing we _do_ prioritize is the CLI as being backwards-compatible as
possible. We already have that interface as a stable one that can be
depended upon, even when the Git executable is built from a fork with
different implementations of subsystems (or operating differently in the
presence of custom config).

## Why: Virtual Filesystem Support

> Of course, there's a reason Google wants it, too. We've talked
> previously about wanting better integration between Git and something
> like a VFS; as we've experimented with it internally, we've found a
> couple of tricky areas:
>
> - The VFS relies on running Git commands to determine the state of the
> repository. However, these Git commands interact with the gitdir or
> worktree, which is populated by the VFS.

The way VFS for Git solved this was to not virtualize the gitdir and to
pass immediately to the filesystem if a worktree file was already
"hydrated". Of course, an early version was virtualizing the gitdir,
including faking that every possible loose object was present and could be
found via a network call, but the same issues you are bringing up now were
blockers for that approach.

> - A user running `git status` in a directory controlled by the VFS
> will require the VFS to populate the entire (large) worktree - even
> though the VFS is sure that only one file has been modified. The
> closest we've come with an alternative is an orchestrated use of
> sparse-checkout - but modifying the sparse-checkout configs
> automatically in response to the user's filesystem operations takes us
> right back to the previous point. If Git could take a plugin and
> replacement for the object store that directly interacts with the VFS
> daemon, a layer of complexity would disappear and performance would
> improve.

This is another case where the issue is that Git isn't aware that it is
operating in a virtual environment and doesn't speak to the virtualization
system directly. Git could talk to the virtualization layer as if it was a
filesystem monitor, and that would prevent a significant amount of these
changes. Preventing 'git checkout' from writing files to the worktree also
requires some coordination. The virtualization layer needs a signal that
it will need to update its projection of the worktree (the
post-index-change hook can do this)  and the Git process needs a way to
mark the index with skip-worktree bits for the changed files (while
keeping the bits off for files that were previously hydrated and not
changed by the index update).

In this sense, we already have _some_ of the pluggability (through hooks)
and could extend that ability more either via more hooks or by making Git
itself aware that it's in a virtual filesystem. This pluggability could be
extended by using pipe-based communication like the builtin FS Monitor,
except that the communication is a new protocol that can speak to an
arbitrary implementation on the other side.

I've said before that the goal of using git.git with a virtual filesystem
(as-is, no custom bits) is unlikely to succeed _unless_ there are changes
contributed to git.git to make it aware of a "filesystem virtualizer". I
also don't think that inserting plugins is the right way to solve for
this. Users will want to use the Git CLI on their PATH for both virtual
and non-virtual repositories, so the distinction between them needs to
happen at runtime, likely via Git config, hooks, or protocols.

## How to achieve these goals

> I'm also planning to send a proposal for a document full of "best
> practices" for turning Git code into libraries (and have quite a lot
> of discussion around that document, too). My hope is that we can use
> that document to help us during implementation as well as during
> review, and refine it over time as we learn more about what works and
> what doesn't. Having this kind of documentation will make it easy for
> others to join us in moving Git's codebase towards a clean set of
> libraries. I hope that, as a project, we can settle on some tenets
> that we all agree would make Git nicer.

I like the idea of a "best practices" document, but I would hesitate to
focus on the libification and instead aim for the high-value items such
as lack of globals and reduced memory leaks. How do we write such code?
How do we write (unit?) tests that guarantee those properties will be
maintained into the future?

> From the rest of my own team, we're planning on working first on some
> limited scope, low-level libraries so that we can all see how the
> process works. We're starting with strbuf.[ch] (as it's used
> everywhere with few or no dependencies and helps us guarantee string
> safety at API boundaries), config.[ch] (as many external tools are
> probably interested in parsing Git config formatted files directly),
> and a subset of operations related to the object store. These starting
> points are intended to have a small impact on the codebase and teach
> us a lot about logistics and best practices while doing these kinds of
> conversions.

I can see that making a library version of config.h could be helpful to
third parties wanting to read Git config. In particular, I know that Git
Credential Manager runs multiple 'git config' calls to get multiple values
out. Perhaps time would be better spent creating a 'git config --batch'
mode so these consumers could call one process, pass a list of config keys
as input, and get a list of key-value pairs out? (Use a '-z' mode to allow
newlines in the values.)

However, I'm not seeing value to anyone else to have strbuf.h available
externally. I'm also not seeing any value to the Git project by having
either of these available as a library. I can only see increased developer
friction due to the new restrictions on making changes in these areas.

If this first effort is instead swapped to say "we plan on introducing
unit tests for these libraries" then I think that is an easier thing to
support.

It happens that these efforts will make it easier to make the jump to
providing APIs as a library. Not only are unit tests _necessary_ to
support libification, they also make a smaller lift. Unit tests will
already create more friction when changing the API, since either new tests
must be written or old tests must be modified.

### Summary

Generally, making architectural change is difficult. In addition to the
amount of code that needs to be moved, adjusted, and tested in new ways,
there is a cultural element that needs to be adjusted. Expecting Git's
code to be used as a library is a fundamentally different use case than we
have supported in the past, and most of our developer guidelines reflect
that. We don't hesitate to modify APIs. We prefer end-to-end tests over
unit tests whenever possible.

There are focused bits of your proposal that I think will be universally
welcomed, and I think the best course of action in the short term is to
demonstrate improvements in those areas:

 * Reduced use of globals.
 * Reduced memory leaks.
 * Use error responses instead of die()s within low-level code.
 * Update header files to have appropriate visibility and documentation.

The things that we need to really focus on are how we can measure progress
in these areas as well as how can we prevent regression in the future.
Unit tests are one way to do this (especially with leak detection), but
also documentation and coding guidelines need to be updated based on the
new patterns discovered in this process.

I mentioned the_repository earlier as a partially-complete architectural
change. I think the_index compatibility macros are in a similar boat of
being half-completed (though, the last time I tried to remove these macros
I was told that it wasn't a goal to remove the macros from builtins). The
sparse-index's command_requires_full_index checks are also in this boat,
so I understand the difficulty of getting a big change 100% complete (and
in this case, the plan is to do at least one more GSoC project in the area
to see what progress can be made that way).

We should be careful in committing to a very long-term direction without
first delivering value on a shorter timeline.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Proposal/Discussion: Turning parts of Git into libraries
  2023-02-21 22:06   ` Emily Shaffer
  2023-02-22  8:23     ` Elijah Newren
@ 2023-02-22 19:25     ` Jeff King
  1 sibling, 0 replies; 37+ messages in thread
From: Jeff King @ 2023-02-22 19:25 UTC (permalink / raw)
  To: Emily Shaffer
  Cc: Elijah Newren, Git List, Jonathan Nieder, Jose Lopes,
	Aleksandr Mikhailov

On Tue, Feb 21, 2023 at 02:06:55PM -0800, Emily Shaffer wrote:

> > Does removing memory leaks also mean converting UNLEAK to free()?
> 
> I suspect so - as I understand it, UNLEAK is a macro that resolves to
> "don't complain to me, compiler, I meant not to free it."

Correct. It is supposed to be used sparingly at the outermost level to
say "I'm about to exit, so yes, we are leaking this, but no, it does not
matter".

So...

> > Thinking of things in a library context probably pushes us in that
> > direction (though, alternatively, it might just highlight the question
> > of what is considered "low-level" instead).
> 
> I'm not sure whether use of UNLEAK has so much to do with "low-level"
> or not. In cases when Git is being called as an ephemeral single-run
> process, UNLEAK makes a lot of sense. In cases when Git is being
> called in a long-lived process, UNLEAK is just a sign that says
> "there's a leak here".  So I think the distinction is not low-level or
> high-level, but more simply, within a library or not.

I'd take "low-level" here to mean "far down in the call stack". That is,
code which is called potentially from a lot of places, and can't know
what is going to happen afterwards.

In that case, such code calling UNLEAK() is already doing the wrong
thing. And such code is a likely candidate for being called in a
lib-ified long-running process, which means that ignoring the leaks is
likely to be more noticeable. :)

There are probably cases where code that is currently high-level becomes
more low-level, and will need to be adapted. For example, if cmd_diff()
has a static-local helper function for "diff these two blobs", and it
knows it will run it exactly once, it is OK to UNLEAK() from there now.
But that may be a reasonable API to expose more widely, at which point
it needs to stop UNLEAK()-ing and really free.

Just my two cents as the originator of UNLEAK(). :)

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Proposal/Discussion: Turning parts of Git into libraries
  2023-02-17 22:49       ` Emily Shaffer
@ 2023-02-22 19:34         ` Jeff King
  2023-02-24 20:31           ` Emily Shaffer
  0 siblings, 1 reply; 37+ messages in thread
From: Jeff King @ 2023-02-22 19:34 UTC (permalink / raw)
  To: Emily Shaffer
  Cc: brian m. carlson, Git List, Jonathan Nieder, Jose Lopes,
	Aleksandr Mikhailov

On Fri, Feb 17, 2023 at 02:49:51PM -0800, Emily Shaffer wrote:

> > Personally, I'd like to see some sort of standard error type (whether
> > integral or not) that would let us do more bubbling up of errors and
> > less die().  I don't know if that's in the cards, but I thought I'd
> > suggest it in case other folks are interested.
> 
> Yes!!! We have talked about this a lot internally - but this is one
> thing that will be difficult to introduce into Git without making
> parts of the codebase a little uglier. Since obviously C doesn't have
> an intrinsic to do this, we'll have to roll our own, which means that
> manipulating it consistently at function exits might end up pretty
> ugly. So hearing that there's interest outside of my team to come up
> with such a type makes me optimistic that we can figure out a
> neat-enough solution.

Here are some past discussions on what I thought would be a good
approach to error handling. The basic idea is to replace the "pass a
strbuf that people shove error messages into" approach with an error
context struct that has a callback. And that callback can then stuff
them into a strbuf, or report them directly, or even die.

This thread sketches out the idea, though sadly I no longer have the
more fleshed-out patches I mentioned there:

  https://lore.kernel.org/git/20160927191955.mympqgylrxhkp24n@sigill.intra.peff.net/

And then the sub-thread starting here discusses a similar approach:

  https://lore.kernel.org/git/20171103191309.sth4zjokgcupvk2e@sigill.intra.peff.net/

It does mean passing a "struct error_context" just about everywhere.
Though since the context doesn't change very much and most calls are
just forwarding it along, it would probably also be reasonable to have a
thread-local global context, and push/pop from it (sort of a poor man's
dynamic scoping).

One thing that strategy doesn't help with, though, that your
libification might want: it's not very machine-readable. The error
reporters would still fundamentally be working with strings. So a
libified process can know "OK, writing this ref failed, and I have some
error messages in a buffer". But the calling code can't know specifics
like "it failed because we tried to open file 'foo' and it got EPERM".
We _could_ design an error context that stores individual errno values
or codes in a list, but having each caller report those specific errors
is a much bigger job (and ongoing maintenance burden as we catalogue and
give an identifier to each error).

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Proposal/Discussion: Turning parts of Git into libraries
  2023-02-22 19:34         ` Jeff King
@ 2023-02-24 20:31           ` Emily Shaffer
  2023-02-24 21:41             ` Jeff King
  2023-02-24 22:59             ` Junio C Hamano
  0 siblings, 2 replies; 37+ messages in thread
From: Emily Shaffer @ 2023-02-24 20:31 UTC (permalink / raw)
  To: Jeff King
  Cc: brian m. carlson, Git List, Jonathan Nieder, Jose Lopes,
	Aleksandr Mikhailov

On Wed, Feb 22, 2023 at 11:34 AM Jeff King <peff@peff.net> wrote:
>
> On Fri, Feb 17, 2023 at 02:49:51PM -0800, Emily Shaffer wrote:
>
> > > Personally, I'd like to see some sort of standard error type (whether
> > > integral or not) that would let us do more bubbling up of errors and
> > > less die().  I don't know if that's in the cards, but I thought I'd
> > > suggest it in case other folks are interested.
> >
> > Yes!!! We have talked about this a lot internally - but this is one
> > thing that will be difficult to introduce into Git without making
> > parts of the codebase a little uglier. Since obviously C doesn't have
> > an intrinsic to do this, we'll have to roll our own, which means that
> > manipulating it consistently at function exits might end up pretty
> > ugly. So hearing that there's interest outside of my team to come up
> > with such a type makes me optimistic that we can figure out a
> > neat-enough solution.
>
> Here are some past discussions on what I thought would be a good
> approach to error handling. The basic idea is to replace the "pass a
> strbuf that people shove error messages into" approach with an error
> context struct that has a callback. And that callback can then stuff
> them into a strbuf, or report them directly, or even die.

Thanks! I'll give these a read in detail soon, I appreciate you digging them up.

>
> This thread sketches out the idea, though sadly I no longer have the
> more fleshed-out patches I mentioned there:
>
>   https://lore.kernel.org/git/20160927191955.mympqgylrxhkp24n@sigill.intra.peff.net/
>
> And then the sub-thread starting here discusses a similar approach:
>
>   https://lore.kernel.org/git/20171103191309.sth4zjokgcupvk2e@sigill.intra.peff.net/
>
> It does mean passing a "struct error_context" just about everywhere.
> Though since the context doesn't change very much and most calls are
> just forwarding it along, it would probably also be reasonable to have a
> thread-local global context, and push/pop from it (sort of a poor man's
> dynamic scoping).
>
> One thing that strategy doesn't help with, though, that your
> libification might want: it's not very machine-readable. The error
> reporters would still fundamentally be working with strings. So a
> libified process can know "OK, writing this ref failed, and I have some
> error messages in a buffer". But the calling code can't know specifics
> like "it failed because we tried to open file 'foo' and it got EPERM".
> We _could_ design an error context that stores individual errno values
> or codes in a list, but having each caller report those specific errors
> is a much bigger job (and ongoing maintenance burden as we catalogue and
> give an identifier to each error).

Is there a reason not to use this kind of struct and provide
library-specific error code enums, though, I wonder? You're right that
parsing the error string is really bad for the caller, for anything
besides just logging it. But it seems somewhat reasonable to expect
that any call from config library returning an integer error code is
referring to enum config_errors...

>
> -Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Proposal/Discussion: Turning parts of Git into libraries
  2023-02-22 14:55 ` Derrick Stolee
@ 2023-02-24 21:06   ` Emily Shaffer
  0 siblings, 0 replies; 37+ messages in thread
From: Emily Shaffer @ 2023-02-24 21:06 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Git List, Jonathan Nieder, Jose Lopes, Aleksandr Mikhailov

On Wed, Feb 22, 2023 at 6:55 AM Derrick Stolee <derrickstolee@github.com> wrote:
>
> On 2/17/2023 4:12 PM, Emily Shaffer wrote:
>
> > This turned out pretty long-winded, so a quick summary before I dive in:
> >
> > - We want to compile parts of Git as independent libraries
> > - We want to do it by making incremental code quality improvements to Git
> > - Let's avoid promising stability of the interfaces of those libraries
> > - We think it'll let Git do cool stuff like unit tests and allowing
> > purpose-built plugins
> > - Hopefully by example we can convince the rest of the project to join
> > in the effort
>
> As I was thinking about this proposal, I realized that my biggest concern
> is that we are jumping to the "what" and "how" without fully exploring the
> "why"s, so I've reorganized some of your points along those lines, so I
> can respond to each in turn.

Thanks - and thanks for your patience in my replying, because there's
a lot here ;)

>
> ## Why: Modular code
>
> > - Having clear, modular libraries makes it easy to find the code
> > responsible for some behavior, or determine where to add something
> > new.
>
> > - An easy-to-understand modular codebase makes it easy for new
> > contributors to start hacking and understand the consequences of their
> > patch.
>
> Generally, having a modular codebase does not guarantee that things are
> easy to find or that consequences are understood. The correct abstractions
> are key in this, as well as developing boundaries that do not leak into
> each other.

Absolutely agreed. I do think we will want to be very careful about
how to abstract things - and very open to reworking the interface when
something isn't working. (See also, not guaranteeing ABI stability)

>
> Unless done correctly, we can have significant issues with how things are
> communicated through layers of abstraction. We already have that problem
> when we want to add a new option to a builtin and need to modify three or
> four method prototypes in order to communicate a change of behavior to the
> right layer.

Yes, that's true - and usually plumbing some new argument through and
coping with it appropriately in those 3-4 functions is because there's
not a way to silently provide the needed context to the bottom-most
function, if I'm thinking of the same scenario you are.

One thing we discussed internally was the need to pass around some
context struct to libraries, mostly to avoid reliance on globals. But
I wonder if that same schema means that we can reduce the amount of
argument plumbing needed when adding something to a low-level
dependency?

>
> I've also found that where we have abstractions, such as the refs code or
> the transport code, that the indirection created by the vtables is more
> confusing to discover by reading the code. I've needed the use of a
> debugger to even discover the call stack for certain operations.

This is a really good point, and part of why I'd like to hold off on
introducing vtables until we really painfully need them. I wonder when
that point will be - I guess if/when we start using runtime library
linking/selection it will be necessary.

>
> > - If we can swap out an entire library, or just a single function of
> > one, it's easy to experiment with the entire codebase without sweeping
> > changes.
>
> I don't understand this one too much. We already have this ability through
> forking the repository and mutating the code to do different things. Why
> do we need modules (with vtables and all the rest of the pain that goes
> into it) to make such swaps?
>
> This only really matters in the fullness of the libification effort: if
> everything is mocked, then subsystems can be replaced wholesale. That's a
> lot of work just to avoid experimenting with a fork.

Yeah, I see your point. Git has also experimented with two backends
selectable via config lots of times, and that's basically the same
functional role as experimenting by replacing a library. But I sure
would like it if our project's default suggestion for experimenting
with different implementations wasn't just "fork" - forks are
cumbersome to maintain and painful to reconvene with the mainline.

>
> ...
>
> You said this about how to achieve modular code:
>
> > - Removing references to global variables and instead piping them
> > through arguments
> > - Finding and fixing memory leaks, especially in widely-used low-level code
> > - Reducing or removing `die()` invocations in low-level code, and
> > instead reporting errors back to callers in a consistent way
> ...
> > - Making module interfaces more consistent and easier to understand,
> > including moving "private" functions out of headers and into source
> > files and improving in-header documentation
>
> I think these mechanisms should be universally welcomed. One thing that is
> tricky is how we can guarantee that an API can be marked as "done" and
> maintained that way in perpetuity. (Unit tests may help with that, so
> let's come back to this then.)

I'd really like to avoid that guarantee, actually! Even in the long
term, I think it's very likely that we'll want to continue extending
some APIs forever - for example, config learns new kinds of things
which might change the API quite frequently. Let's not reach for
"done, never touch the interface again" as a goal.

>
> > - Clarifying the scope and layering of existing modules, for example
> > by moving single-use helpers from the shared module's scope into the
> > single user's scope
>
> This one is a bit murky to me. It sounds like that if there is only one
> caller to a strbuf_*() method that it should be made 'static' inside the
> caller instead of global in strbuf.h for use by future callers. Making
> this a goal in itself is probably more likely to see methods moving around
> more frequently due to the frequency of their use, rather than natural
> groupings based on which data structures are being mutated.

Let me be a little more explicit about the strbuf example. strbuf
includes the purpose-built helper `strbuf_add_unique_abbrev()`, to
truncate the oid as much as possible to remain unique.
(https://github.com/git/git/blob/master/strbuf.h#L633) I'm proposing
that this helper doesn't belong in strbuf.a, but instead in oid.a or
similar. Still public, but since it relies on also having linked
oid.a, and it's in fact useless _without_ oid.a, it should live in
oid.a. (Reading my original phrasing back, I see how it didn't imply
this at all, so thanks for asking.)

>
> What measurement would we optimize for? How could we maintain such
> standards as we go? The Git codebase doesn't really have a firm
> "architecture" other than "builtin code calls libgit.a code, which calls
> other libgit.a code as necessary". There aren't really strong layers
> between things. Everything assumes it can look up an object or load
> config. Are there organizational things that we can do in this space that
> would help decoupling things before jumping to libification?

The example I cited above is a symptom of this weak layering you
describe, yeah. In practice so far, these symptoms show up in the
process of trying to isolate a single library, and require deciding
how to reorganize before we can build it as a library. But I think
part of the reason we don't have strong layering now is precisely
because we need it; I'm not sure that a soft reorganization will stick
without some kind of enforcement (for example, in the form of unit
tests operating on isolated components).

>
> ## Why: Submodules
>
> > - Operations recursing into submodules can be run in-process, and the
> > callsite makes it very clear which repository context is being used;
> > if we're not subprocessing to recurse, then we can lose the awkward
> > git-submodule.sh -> builtin/submodule__helper.c -> top-level Git
> > process codepath.
>
> Previously, there was an effort to replace dependencies on the_repository
> in favor of repository structs being passed through method parameters.
> This seems to have fallen off of the priority list, and prevous APIs that
> were converted have regressed in their dependencies.
>
> Should we consider restarting that effort as an early goal?

Yes, I think that effort is a good early goal, but I think the reason
it fell off is because there's not a lot of direct reward in moving
bits and pieces off of `the_repository`. It's hard (impossible?) to
guarantee with an integration test, and taken alone, it's too big of a
change for one person to complete in a single series. So I'm less
interested in saying "start by getting rid of the_repository in 100%
of the code!" and a lot more interested in saying "start by getting
rid of 100% of the globals in config.[ch]", if that makes sense.

> Should a
> repository struct be the "god object" that also includes the default
> implemenations of these modules until the modules can be teased apart and
> no longer care about the entire repository?

Hm, I'm not sure I follow this and the first impression I get is
pretty scary.  Could you explain a little more?

> ## Why: Unit testing
>
> > - Unit tests. We already have some in t/helper/, but if we can replace
> > all the dependencies of a certain library with simple stubs, it's
> > easier for us to write comprehensive unit tests, in addition to the
> > work we already do introducing edge cases in bash integration tests.
>
> Unit tests are very valuable in keeping the codebase stable and reducing
> the chance that code was mutated by accident. This is achieved by very
> rigid test infrastructure, requiring _replacing_ methods with mocks in
> careful ways in order to test that behavior matches expectation. The
> "simple stubs" are actually carefully crafted to verify their inputs and
> provide a carefully selected return value.
>
> The difficulty of writing unit tests (or mutating the code after writing
> unit tests) is balanced by the ability to more explicitly create error
> conditions that are unlikely to occur in real test cases. Cases such as
> connection errors, file I/O problems, or other unexpected responses from
> the mocked methods, are all relatively difficult to check via end-to-end
> tests.
>
> I've personally found that the hardest unit test to write is the _first_
> one, and after that the rest are simpler. The difficulty arises in making
> the code testable in the first place, as well as creating infrastructure
> for the most-common unit test scenarios. My expectation is that it will
> take significant changes to the Git codebase to make any non-trivial unit
> tests outside of these very isolated cases that are presented here: strbuf
> manipulation is easy to unit-test and config methods operating on
> config_set structs should be similar. However, config methods that
> interact with a repository and its config file(s) will be much harder to
> test unless we start mocking filesystem interactions. We could create
> test-tool helpers that load a full repository and check the config files
> present on the filesystem, but now we're talking about integration tests
> instead of unit tests.

Yeah, I see what you're saying. I think some of it depends on how we
build the interfaces; it's difficult to mock an entire filesystem for
an API that takes a path, but it's easy to give fake data to an API
that takes a string or bytestream. I suspect we'll want to be very
careful looking at how much work it is to build unit tests for a
certain library - overly complicated mocking setup seems like a code
smell to me.

>
> > - Being able to test libraries in isolation via unit tests or mocks
> > speeds up determining the root cause of bugs.
>
> I'm not sure I completely agree with this statement. Unit tests can help
> prevent introducing a new fault when mutating code to adjust for a bug,
> but unit tests only guarantee expected behavior on the expected
> preconditions. None of this necessarily helps finding a root cause,
> especially in the likely event that the root cause is due to combining
> units and thus not covered by unit tests.

Yeah, that's fair.

>
> ## Why: Correct use
>
> > The ability to use Git as a library also makes it easier for other
> > tooling to interact with a Git repository *correctly*.
>
> Is there is a correct way to interact with a Git repository? We definitely
> prefer that the repository is updated by executing Git processes, leaving
> all internals up to Git. Libification thus has a purpose for scenarios
> that do not have an appropriate Git builtin or where the process startup
> time is too expensive for that use.

There are certainly some incorrect ways! We get a lot of firsthand
experience with `repo` tool's poor interaction with Git repositories
:)

> However, would it not be preferrable to update Git to include these use
> cases in the form of new builtin operations? Even in the case where
> process startup is expensive, operations can be batched (as in 'git
> cat-file --batch').

In addition to expensive startup time, in some cases we've gotten
pushback about starting a subprocess at all, for security reasons.
Especially because Git loves to then start subprocesses itself.

>
> ## Why: Virtual Filesystem Support

In the initial email I CC'd a couple of the people working on our VFS,
so I'll leave this section alone and leave it to them to reply to you.

>
> ## How to achieve these goals
>
> > I'm also planning to send a proposal for a document full of "best
> > practices" for turning Git code into libraries (and have quite a lot
> > of discussion around that document, too). My hope is that we can use
> > that document to help us during implementation as well as during
> > review, and refine it over time as we learn more about what works and
> > what doesn't. Having this kind of documentation will make it easy for
> > others to join us in moving Git's codebase towards a clean set of
> > libraries. I hope that, as a project, we can settle on some tenets
> > that we all agree would make Git nicer.
>
> I like the idea of a "best practices" document, but I would hesitate to
> focus on the libification and instead aim for the high-value items such
> as lack of globals and reduced memory leaks. How do we write such code?
> How do we write (unit?) tests that guarantee those properties will be
> maintained into the future?
>
> > From the rest of my own team, we're planning on working first on some
> > limited scope, low-level libraries so that we can all see how the
> > process works. We're starting with strbuf.[ch] (as it's used
> > everywhere with few or no dependencies and helps us guarantee string
> > safety at API boundaries), config.[ch] (as many external tools are
> > probably interested in parsing Git config formatted files directly),
> > and a subset of operations related to the object store. These starting
> > points are intended to have a small impact on the codebase and teach
> > us a lot about logistics and best practices while doing these kinds of
> > conversions.
>
> I can see that making a library version of config.h could be helpful to
> third parties wanting to read Git config. In particular, I know that Git
> Credential Manager runs multiple 'git config' calls to get multiple values
> out. Perhaps time would be better spent creating a 'git config --batch'
> mode so these consumers could call one process, pass a list of config keys
> as input, and get a list of key-value pairs out? (Use a '-z' mode to allow
> newlines in the values.)
>
> However, I'm not seeing value to anyone else to have strbuf.h available
> externally. I'm also not seeing any value to the Git project by having
> either of these available as a library. I can only see increased developer
> friction due to the new restrictions on making changes in these areas.

The value in strbuf is simply this: a Git config library should
probably not be returning char*+int to its callers, or even worse, an
unwrapped linked list or map. It's too easy to make memory mistakes
dealing with raw C types, and I'd like for us to avoid that by using
the types we already have.

>
> If this first effort is instead swapped to say "we plan on introducing
> unit tests for these libraries" then I think that is an easier thing to
> support.
>
> It happens that these efforts will make it easier to make the jump to
> providing APIs as a library. Not only are unit tests _necessary_ to
> support libification, they also make a smaller lift. Unit tests will
> already create more friction when changing the API, since either new tests
> must be written or old tests must be modified.
>
>
> ### Summary
>
> Generally, making architectural change is difficult. In addition to the
> amount of code that needs to be moved, adjusted, and tested in new ways,
> there is a cultural element that needs to be adjusted. Expecting Git's
> code to be used as a library is a fundamentally different use case than we
> have supported in the past, and most of our developer guidelines reflect
> that. We don't hesitate to modify APIs. We prefer end-to-end tests over
> unit tests whenever possible.

Inherent in your summary I'm hearing a lot of concern about providing
API stability. But I agree with you there - I don't think that we
should provide API stability, especially at the beginning while we're
learning how to write good APIs. I hope we can wait as long as
possible to start promising backwards compatibility - maybe even never
- and I hope we can stand our ground if we do receive bug reports
about breaking API stability on an API still labeled as unstable. I'm
not sure how I can make this statement more strongly :)

It seems that the Linux kernel has also managed to get away with this
kind of purposeful instability
(https://www.kernel.org/doc/Documentation/process/stable-api-nonsense.rst)
and I'd be happy to try and follow their example. Another standard
library provided by Google (https://abseil.io/about/compatibility)
does the same. So I do think it's possible for us to guarantee
instability, as it were. (Thanks, Jonathan Nieder, for the links
here.)

>
> There are focused bits of your proposal that I think will be universally
> welcomed, and I think the best course of action in the short term is to
> demonstrate improvements in those areas:
>
>  * Reduced use of globals.
>  * Reduced memory leaks.
>  * Use error responses instead of die()s within low-level code.
>  * Update header files to have appropriate visibility and documentation.
>
> The things that we need to really focus on are how we can measure progress
> in these areas as well as how can we prevent regression in the future.
> Unit tests are one way to do this (especially with leak detection), but
> also documentation and coding guidelines need to be updated based on the
> new patterns discovered in this process.
>
> I mentioned the_repository earlier as a partially-complete architectural
> change. I think the_index compatibility macros are in a similar boat of
> being half-completed (though, the last time I tried to remove these macros
> I was told that it wasn't a goal to remove the macros from builtins). The
> sparse-index's command_requires_full_index checks are also in this boat,
> so I understand the difficulty of getting a big change 100% complete (and
> in this case, the plan is to do at least one more GSoC project in the area
> to see what progress can be made that way).
>
> We should be careful in committing to a very long-term direction without
> first delivering value on a shorter timeline.
>
> Thanks,
> -Stolee

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Proposal/Discussion: Turning parts of Git into libraries
  2023-02-24 20:31           ` Emily Shaffer
@ 2023-02-24 21:41             ` Jeff King
  2023-02-24 22:59             ` Junio C Hamano
  1 sibling, 0 replies; 37+ messages in thread
From: Jeff King @ 2023-02-24 21:41 UTC (permalink / raw)
  To: Emily Shaffer
  Cc: brian m. carlson, Git List, Jonathan Nieder, Jose Lopes,
	Aleksandr Mikhailov

On Fri, Feb 24, 2023 at 12:31:16PM -0800, Emily Shaffer wrote:

> > One thing that strategy doesn't help with, though, that your
> > libification might want: it's not very machine-readable. The error
> > reporters would still fundamentally be working with strings. So a
> > libified process can know "OK, writing this ref failed, and I have some
> > error messages in a buffer". But the calling code can't know specifics
> > like "it failed because we tried to open file 'foo' and it got EPERM".
> > We _could_ design an error context that stores individual errno values
> > or codes in a list, but having each caller report those specific errors
> > is a much bigger job (and ongoing maintenance burden as we catalogue and
> > give an identifier to each error).
> 
> Is there a reason not to use this kind of struct and provide
> library-specific error code enums, though, I wonder? You're right that
> parsing the error string is really bad for the caller, for anything
> besides just logging it. But it seems somewhat reasonable to expect
> that any call from config library returning an integer error code is
> referring to enum config_errors...

Right, you could definitely layer the two approaches by storing the
enums in the struct. And that might be a good thing to do in the long
run. But I think it's a much harder change, as it implies assigning
those codes (and developing a taxonomy of errors). But that can also be
done incrementally, especially if it's done on top of human-readable
strings.

I.e., I can imagine a world where low-level code reports an error like:

   int read_foo(const char *path, struct error_context *err)
   {
           fd = open(path, ...);
           if (fd < 0)
                   return report_errno(&err, ERR_FOO_OPEN, "unable to open foo file '%s'", path);
           ...read and parse...
           if (some_unexpected_format)
                   return report(&err, ERR_FOO_FORMAT, "unable to parse foo file '%s'", path);
   }

and then the caller has many options:

  - pass in a context that just dumps the human readable errors to
    stderr

  - collect the error strings in a buffer to report by some other
    mechanism

  - check err.type to act on ERR_FOO_OPEN, etc, including errno (which
    I'd imagine report_errno() to record)

The details above are just a sketch, of course. Rather than FOO_OPEN,
callers may actually want to think of things more like a stack of
errors that would mirror the callstack. You could imagine something
like:

  type=ERR_REF_WRITE
  type=ERR_GET_LOCK
  type=ERR_OPEN, errno=EPERM

I do think it's possible to over-engineer this to the point of
absurdity, and end up creating a lot of work. Which is why I'd probably
start with uniformly trying to use an error context struct with strings,
after which adding on fancier features gets easier (and again, is
possibly something that can be done incrementally as various subsystems
support it).

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Proposal/Discussion: Turning parts of Git into libraries
  2023-02-24 20:31           ` Emily Shaffer
  2023-02-24 21:41             ` Jeff King
@ 2023-02-24 22:59             ` Junio C Hamano
  1 sibling, 0 replies; 37+ messages in thread
From: Junio C Hamano @ 2023-02-24 22:59 UTC (permalink / raw)
  To: Emily Shaffer
  Cc: Jeff King, brian m. carlson, Git List, Jonathan Nieder,
	Jose Lopes, Aleksandr Mikhailov

Emily Shaffer <nasamuffin@google.com> writes:

> Is there a reason not to use this kind of struct and provide
> library-specific error code enums, though, I wonder? You're right that
> parsing the error string is really bad for the caller, for anything
> besides just logging it. But it seems somewhat reasonable to expect
> that any call from config library returning an integer error code is
> referring to enum config_errors...

In addition to what Peff already said, I think the harder part of it
is to parametralize the errors in a machine readable way.  A part of
a library may say (with an enum) that it is returning "Ref cannot be
read" error, with a parameter that says "The ref that caused this
error was 'refs/heads/next'" which makes "Ref cannot be read" error
has one parameter.  "Ref cannot be renamed" may have two (old and
new name).  Other errors from some library functions may not even be
of type "string".

Coming up with the enums to cover the error conditions (which Peff
covered well) is already a lot of work.  Making sure each of them
take sufficient parameters to describe the error usefully adds more
on top.  And the code to pass these variable number of parameters of
variable types would be, eh, fun---it would be error prone without
compiler help.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Proposal/Discussion: Turning parts of Git into libraries
  2023-02-22  1:44 ` Victoria Dye
@ 2023-02-25  1:48   ` Jonathan Tan
  0 siblings, 0 replies; 37+ messages in thread
From: Jonathan Tan @ 2023-02-25  1:48 UTC (permalink / raw)
  To: Victoria Dye
  Cc: Jonathan Tan, Emily Shaffer, Git List, Jonathan Nieder,
	Jose Lopes, Aleksandr Mikhailov

Victoria Dye <vdye@github.com> writes:
> Emily Shaffer wrote:
> > Hi folks,
> > 
> > As I mentioned in standup this week[1], my colleagues and I at Google
> > have become very interested in converting parts of Git into libraries
> > usable by external programs. In other words, for some modules which
> > already have clear boundaries inside of Git - like config.[ch],
> > strbuf.[ch], etc. - we want to remove some implicit dependencies, like
> > references to globals, and make explicit other dependencies, like
> > references to other modules within Git. Eventually, we'd like both for
> > an external program to use Git libraries within its own process, and
> > for Git to be given an alternative implementation of a library it uses
> > internally (like a plugin at runtime).
> 
> I see why you've linked these two objectives (there could be some shared
> infrastructure between them), but they're ultimately separate concerns, with
> separate needs & tradeoffs. 

I think the main link is that we need (or think we need) both of them
- for example, we do want external code to be able to use Git's config
parsing mechanism, and we want to be able to swap out the object store.
They indeed are separate concerns.

> By trying to design with a unified solution for
> both, though, this proposal's solution may be broader than it needs to be
> and comes with some serious developer experience ramifications.
> 
> For example, looking at the  "call libgit.so from other applications"
> objective: one way I could imagine implementing this is by creating a stable
> API wrapper around various hand-picked internal functions. If those internal
> functions change (e.g., adding a function argument to handle a new flag in a
> builtin), the API can remain stable (by setting a sane default) without
> really limiting the contributor adding a new feature. 
> 
> However, my reading of this proposal (although correct me if I'm wrong) is
> that you want to use that same external-facing API as a plugin interface for
> Git. In that case, the contributor can't just add a new argument and
> trivially modify the wrapper; they either break the API contract, or create
> a new version of the API and continue to support the old one. While that's
> technically doable, it's a pretty sizeable increase to the amount of "stuff"
> contributors need to keep track of and limits the project's flexibility. 

For now, we at Google think that Git shouldn't guarantee API backwards
compatibility (unlike for plumbing commands), and that any users of
the API should be prepared for API backwards compatibility to not be
something Git has (and thus be prepared to, say, pin a version of Git and/
or be prepared to continually update their code, which they might want
to do anyway because hopefully, the Git developer community would update
an API because there is a user need or because it will make internal
code that uses that API better: hence they might want to meet the same
user need or have the same code improvement).

But I agree that the way an API change would be handled is different in
the two cases.

> > - Having clear, modular libraries makes it easy to find the code
> > responsible for some behavior, or determine where to add something
> > new.
> 
> This is more an issue of code & repository organization, and isn't really an
> inherent outcome of lib-ifying Git. That said, like the cleanup you mention
> later, I would not be opposed to a dedicated code reorganization proposal.

Thanks. I think that Emily didn't intend to present libification as
the only way to accomplish such an organization, but more of "here's
the business need to explain why we're doing this, and here's how it
benefits the project". I do agree that there are other ways of achieving
such a reorganization.

> > - Operations recursing into submodules can be run in-process, and the
> > callsite makes it very clear which repository context is being used;
> > if we're not subprocessing to recurse, then we can lose the awkward
> > git-submodule.sh -> builtin/submodule__helper.c -> top-level Git
> > process codepath.
> 
> Interesting - I wasn't aware this was how submodules worked internally. Is
> there a specific reason this can't be refactored to perform the recursion
> in-process?

Same answer as above.

> > - Being able to test libraries in isolation via unit tests or mocks
> > speeds up determining the root cause of bugs.
> 
> The addition of unit tests is a project of its own, and one I don't believe
> is intrinsically tied to this one. More on this later.

Same answer as above.

> > - If we can swap out an entire library, or just a single function of
> > one, it's easy to experiment with the entire codebase without sweeping
> > changes.
> 
> What sort of "experiment" are you thinking of here? Anyone can make internal
> changes to Git in their own clone of the repository, then build and test it.
> Shipping a custom fork of Git doesn't seem any less complex than building a
> plugin library, shipping that, and injecting it into an existing Git
> install.

Having clearly separated out modules means that experimentation,
especially those that span several months (say), can be more easily
done, because what we would need to change can be confined to one or
several files instead of possibly needing to be distributed across many
files. This not only helps speed up development but also keeping up with
upstream changes (there will hopefully be fewer merge conflicts this
way). The way of shipping experimental builds could indeed be from a
fork of the codebase.

> > The ability to use Git as a library also makes it easier for other
> > tooling to interact with a Git repository *correctly*. As an example,
> > `repo` has a long history of abusing Git by directly manipulating the
> > gitdir[2], but if it were written in a world where Git exists as
> > easy-to-use libraries, it probably wouldn't have needed to, as it
> > could have invoked Git directly or replaced the relevant modules with
> > its own implementation. Both `repo`[3] and `git-gui[4]` have
> > reimplemented logic from git.git. Other interfaces that cooperate with
> > Git's filesystem storage layer, like `scalar` or `jj`[5], would be
> > able to interop with a Git repository without having to reimplement
> > custom logic or keep up with new Git changes.
> 
> I'd alternatively suggest that these use cases could be addressed with
> better plumbing command support and better (public) documentation of our
> on-disk data structures. In many cases, we're treating on-disk data
> structures as external-facing APIs anyway - the index [1], packfiles [2],
> etc. are versioned. We could definitely add documentation that's more 
> directly targeted at external integrators.
> 
> [1] https://git-scm.com/docs/index-format
> [2] https://git-scm.com/docs/gitformat-pack

Plumbing commands require spawning processes and serializing inputs
and outputs, and we can't do things like pass callbacks or an mmap-ped
buffer we already have. (As an aside, we did consider that approach with
git-batch [1], but that didn't get much traction.)

As for interoperability on the basis of on-disk data, I think
the workload of keeping up with Git improvements shouldn't be
underestimated. For example, I don't think many (if any) implementations
that can read Git's files support partial clone (at least on the
client), and we are still improving things, e.g. hasconfig:remote.*.url
includeIf was recently added to Git config.

[1] https://lore.kernel.org/git/20220207190320.2960362-1-jonathantanmy@google.com/

> Both of these points sound more like implementation details/quirks of the
> VFS prototype you're building than hard blockers coming from Git. For
> example, you mention 'git status' needing to populate a full worktree -
> could it instead use a mechanism similar to FSMonitor to avoid scanning for
> files you know are virtualized? While there'd likely be some Git changes
> involved (some of which might be a bit plugin-y), they'd be much more
> tightly scoped than full libification.

I think the answer to this is the same as the disadvantages of outside
code calling plumbing commands (above).

> These cleanups/best practices are a great idea regardless of any potential
> lib-ification. I'll be sure to keep them in mind in any future changes I
> make or review.

Thanks.

> > In the longer term, if Git has libraries with easily-replaced
> > dependencies, we get a few more benefits:
> > 
> > - Unit tests. We already have some in t/helper/, but if we can replace
> > all the dependencies of a certain library with simple stubs, it's
> > easier for us to write comprehensive unit tests, in addition to the
> > work we already do introducing edge cases in bash integration tests.
> 
> This doesn't really follow as a direct consequence of making libgit
> externally facing. AFAIK, there's nothing explicitly stopping us from
> writing or integrating a unit test framework now.
> 
> Like the "make libgit public" vs. "allow for custom backends in Git", I
> think this is a separate project with its own tradeoffs to evaluate.

The way we're making code in Git externally facing, at least right now,
is not to make the entirety of the Git code (perhaps minus the builtins)
consumable from the outside, but just selected .c/.h files. Being able
to say "these .c/.h files can be used independently" will make unit
tests easier, I think. But yes, there is nothing stopping us from such
an effort independent of libification.

> > - If our users can use plugins to improve performance in specific
> > scenarios (like a VFS-aware object store in the VFS case I cited
> > above), then Git works better for them without having to adopt a
> > different workflow, such as using an alternative tool or wrapper.
> 
> I'm curious as to what you want the user experience for this to look like.
> How would Git know to dynamically load a plugin? How does a user configure
> (or disable) plugins? Will a plugin need to replace all of the functions in
> a given library, or would Git be able to "fall back" on its own internal
> implementation?

Right now, it's probably going to be just recompiling the Git binary
with a few .c files swapped for others, which is why it's important
to us to have defined boundaries between .c files (even if they move
over time). I do think that converting code in Git to access other
code through a vtable is desirable (that way, for example, a single
binary can access 2 kinds of object stores, for example) but even if
the Git project decides not to do that, we can still do a lot (and just
ship, for example, some extra .c files that can do both the standard
implementation and a custom one).

> > - An easy-to-understand modular codebase makes it easy for new
> > contributors to start hacking and understand the consequences of their
> > patch.
> 
> As noted earlier, I think improving the developer experience in this way is
> independent of the development of external APIs.

Noted - same answer as earlier too.

> > Of course, we haven't maintained any guarantee about the consistency
> > of our implementation between releases. I don't anticipate that we'll
> > write the perfect library interface on our first try. So I hope that
> > we can be very explicit about refusing to provide any compatibility
> > guarantee whatsoever between versions for quite a long time. On
> > Google's end, that's well-understood and accepted. As I understand,
> > some other projects already use Git's codebase as a "library" by
> > including it as a submodule and using the code directly[6]; even a
> > breakable API seems like an improvement over that, too.
> 
> This hints at, but sidesteps, a really important aspect of the long-term
> goals of this project - are you planning on having us start guaranteeing
> consistency once there's an external-facing API available? 
> 
> If not, that sounds like an unpleasant user experience (and one prone to
> Hyrum's Law [3] at that). 
> 
> If so, Git contributors will either be much more constrained in the
> introduction of new features, or we'll end up with a mess of
> backward-compatibility APIs.
> 
> [3] https://www.hyrumslaw.com/

I answered this above.

> > So what's next? Naturally, I'm looking forward to a spirited
> > discussion about this topic - I'd like to know which concerns haven't
> > been addressed and figure out whether we can find a way around them,
> > and generally build awareness of this effort with the community.
> > 
> > I'm also planning to send a proposal for a document full of "best
> > practices" for turning Git code into libraries (and have quite a lot
> > of discussion around that document, too). My hope is that we can use
> > that document to help us during implementation as well as during
> > review, and refine it over time as we learn more about what works and
> > what doesn't. Having this kind of documentation will make it easy for
> > others to join us in moving Git's codebase towards a clean set of
> > libraries. I hope that, as a project, we can settle on some tenets
> > that we all agree would make Git nicer.
> 
> I think the use of multiple libraries is part of the potentially suboptimal
> balance between the "load Git's internals as a library" and "let Git use
> plugins in place of its own implementations" goals. The former could leave a
> user in dependency hell if there are lots of small libraries to load, while
> the latter may work better with smaller-scoped libraries (to avoid needing
> to implement more than is useful). And, the functionality that's useful to a
> program invoking Git via library may not be the same as what someone would
> want to replace via plugin. 

I don't see how we can avoid separating up Git's code with either goal
(internals as library vs. plugins), but I think that such separation
makes the code clearer anyway even if one does not need to reuse
internals or use plugins.

> As a general note, this libification project - its justification,
> milestones, organization, etc. - seems to be primarily driven by the needs
> of a VFS that is entirely opaque to the Git community you're proposing to.
> As it stands, reviewers are put in a position of needing to accept, without
> much evidence, that this is the best (or only) possible approach for meeting
> those needs. 
> 
> While I'm normally content with "scratch your own itch"-type changes, what
> you're proposing is a major paradigm shift in how Git is written,
> maintained, and used. I'm not comfortable accepting that level of impact
> without at least being able to evaluate whether the problem can be solved
> some other way.

I think that reviewers should review code based on current principles,
not just based on our (we at Google) claim that it is useful for our
VFS (although perhaps our need might show that other people might have a
similar need too).

> Implementing dependency injection (via vtable or otherwise) and unit test
> mocking is a massive undertaking on its own, let alone everything else
> described so far. You've outlined some clear, fairly unobtrusive first steps
> to the overarching proposal, but there's a lot of detail missing from the
> plans for later steps. While there's always risk of a project getting
> derailed or blocked later on due to some unforeseen issue, that risk seems
> particularly high here given the size & scope of all of these components. 

True, but even the first steps will already enable us to accomplish
quite a bit. The later steps may change based on what we learn after the
first steps.

> Intermediate milestones/goals, each of which is valuable on its own,
> outlined in the proposal would help allay those fears (for me, at least).

Noted.

> So, to summarize my thoughts into some (hopefully) actionable feedback:
> 
> - It's really important that the proposal presents a clear, concrete
>   long-term vision for this project. 
>   - What is the desired end state, in terms of what is built and installed
>     with Git?

For the intermediate step (what we want to accomplish now/soon):
 - Git will compile into the same binaries as it does now, with no
   plugin functionality. There will be no additional libraries.
 - Git will have unit tests that compile a subset of Git's .c files,
   and may possibly supply their own .c files that implement existing .h
   signatures to serve as stubs.
 - Other projects that want to use Git's code will need to include Git's
   .c/.h files in their own build systems, possibly using the lists of
   files in the unit test Makefiles for reference. They will also need
   to keep up with changes themselves.

The end step might include vtables, but we probably will revisit that
later once we learn more about the task of code reuse, aided by what
we've accomplished in the intermediate step.

>   - What will be expected of other Git contributors to support this design?

To not break the unit tests by putting functionality in the wrong
place (but hopefully this is seen as a plus, in that we have something
mechanical to check such things), and to design new code to fit with the
principles of modularity.

>     What happens if someone wants to "break" an API?

I think I already answered this above.

>   - What is the impact to users (including security implications)?

As of the intermediate step, we don't allow plugins in the standard
build of Git, so I don't think there will be much impact. In certain
environments that run patched versions of Git, .c files could be swapped
out, but the impact is similar to that of what a patch can do.

> - The scope for what you've proposed is pretty huge (which comes with a lot
>   of risk), but I think it could be broken into smaller, *independent*
>   pieces. 

I think each patch set towards this goal can be considered independently.

> - It's hard to judge or suggest adjustments to this proposal without knowing
>   the specific challenges you're facing with Git as it is.

I think Emily has already outlined a few such challenges (and hopefully
I have elaborated on them), but if there are any specific ones you'd
like more information on, feel free to let us know.

> Finally, thanks for sending this email & starting a discussion. It's an
> interesting topic and I'm looking forward to seeing everyone's perspectives
> on the matter.

Thank you for your thorough review.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Proposal/Discussion: Turning parts of Git into libraries
  2023-02-18 10:36     ` Phillip Wood
@ 2023-03-23 23:22       ` Felipe Contreras
  2023-03-23 23:30         ` rsbecker
  0 siblings, 1 reply; 37+ messages in thread
From: Felipe Contreras @ 2023-03-23 23:22 UTC (permalink / raw)
  To: phillip.wood
  Cc: demerphq, Junio C Hamano, Emily Shaffer, Git List,
	Jonathan Nieder, Jose Lopes, Aleksandr Mikhailov

On Sat, Feb 18, 2023 at 5:12 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>
> On 18/02/2023 01:59, demerphq wrote:
> > On Sat, 18 Feb 2023 at 00:24, Junio C Hamano <gitster@pobox.com> wrote:
> >>
> >> Emily Shaffer <nasamuffin@google.com> writes:
> >>
> >>> Basically, if this effort turns out not to be fruitful as a whole, I'd
> >>> like for us to still have left a positive impact on the codebase.
> >>> ...
> >>> So what's next? Naturally, I'm looking forward to a spirited
> >>> discussion about this topic - I'd like to know which concerns haven't
> >>> been addressed and figure out whether we can find a way around them,
> >>> and generally build awareness of this effort with the community.
> >>
> >> On of the gravest concerns is that the devil is in the details.
> >>
> >> For example, "die() is inconvenient to callers, let's propagate
> >> errors up the callchain" is an easy thing to say, but it would take
> >> much more than "let's propagate errors up" to libify something like
> >> check_connected() to do the same thing without spawning a separate
> >> process that is expected to exit with failure.
> >
> >
> > What does "propagate errors up the callchain" mean?  One
> > interpretation I can think of seems quite horrible, but another seems
> > quite doable and reasonable and likely not even very invasive of the
> > existing code:
> >
> > You can use setjmp/longjmp to implement a form of "try", so that
> > errors dont have to be *explicitly* returned *in* the call chain. And
> > you could probably do so without changing very much of the existing
> > code at all, and maintain a high level of conceptual alignment with
> > the current code strategy.
>
> Using setjmp/longjmp is an interesting suggestion, I think lua does
> something similar to what you describe for perl. However I think both of
> those use a allocator with garbage collection. I worry that using
> longjmp in git would be more invasive (or result in more memory leaks)
> as we'd need to to guard each allocation with some code to clean it up
> and then propagate the error. That means we're back to manually
> propagating errors up the call chain in many cases.

We could just use talloc [1].

All the downstream code can create objects that are descendents of the
master object, when the "exception" occurs the code is sent back to
where the master object was created, it's destroyed, and all the
children are destroyed as well.

I've played around with both talloc and stack contexts and I don't see
how this isn't doable.

Cheers.

[1] https://talloc.samba.org/talloc/doc/html/index.html

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 37+ messages in thread

* RE: Proposal/Discussion: Turning parts of Git into libraries
  2023-03-23 23:22       ` Felipe Contreras
@ 2023-03-23 23:30         ` rsbecker
  2023-03-23 23:34           ` Felipe Contreras
  0 siblings, 1 reply; 37+ messages in thread
From: rsbecker @ 2023-03-23 23:30 UTC (permalink / raw)
  To: 'Felipe Contreras', phillip.wood
  Cc: 'demerphq', 'Junio C Hamano',
	'Emily Shaffer', 'Git List',
	'Jonathan Nieder', 'Jose Lopes',
	'Aleksandr Mikhailov'

On Thursday, March 23, 2023 7:22 PM, Felipe Contreras wrote:
>On Sat, Feb 18, 2023 at 5:12 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
>>
>> On 18/02/2023 01:59, demerphq wrote:
>> > On Sat, 18 Feb 2023 at 00:24, Junio C Hamano <gitster@pobox.com> wrote:
>> >>
>> >> Emily Shaffer <nasamuffin@google.com> writes:
>> >>
>> >>> Basically, if this effort turns out not to be fruitful as a whole,
>> >>> I'd like for us to still have left a positive impact on the codebase.
>> >>> ...
>> >>> So what's next? Naturally, I'm looking forward to a spirited
>> >>> discussion about this topic - I'd like to know which concerns
>> >>> haven't been addressed and figure out whether we can find a way
>> >>> around them, and generally build awareness of this effort with the community.
>> >>
>> >> On of the gravest concerns is that the devil is in the details.
>> >>
>> >> For example, "die() is inconvenient to callers, let's propagate
>> >> errors up the callchain" is an easy thing to say, but it would take
>> >> much more than "let's propagate errors up" to libify something like
>> >> check_connected() to do the same thing without spawning a separate
>> >> process that is expected to exit with failure.
>> >
>> >
>> > What does "propagate errors up the callchain" mean?  One
>> > interpretation I can think of seems quite horrible, but another
>> > seems quite doable and reasonable and likely not even very invasive
>> > of the existing code:
>> >
>> > You can use setjmp/longjmp to implement a form of "try", so that
>> > errors dont have to be *explicitly* returned *in* the call chain.
>> > And you could probably do so without changing very much of the
>> > existing code at all, and maintain a high level of conceptual
>> > alignment with the current code strategy.
>>
>> Using setjmp/longjmp is an interesting suggestion, I think lua does
>> something similar to what you describe for perl. However I think both
>> of those use a allocator with garbage collection. I worry that using
>> longjmp in git would be more invasive (or result in more memory leaks)
>> as we'd need to to guard each allocation with some code to clean it up
>> and then propagate the error. That means we're back to manually
>> propagating errors up the call chain in many cases.
>
>We could just use talloc [1].

talloc is not portable. This would break various platforms.

--Randall


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Proposal/Discussion: Turning parts of Git into libraries
  2023-03-23 23:30         ` rsbecker
@ 2023-03-23 23:34           ` Felipe Contreras
  2023-03-23 23:42             ` rsbecker
  0 siblings, 1 reply; 37+ messages in thread
From: Felipe Contreras @ 2023-03-23 23:34 UTC (permalink / raw)
  To: rsbecker
  Cc: phillip.wood, demerphq, Junio C Hamano, Emily Shaffer, Git List,
	Jonathan Nieder, Jose Lopes, Aleksandr Mikhailov

On Thu, Mar 23, 2023 at 5:30 PM <rsbecker@nexbridge.com> wrote:
>
> On Thursday, March 23, 2023 7:22 PM, Felipe Contreras wrote:
> >On Sat, Feb 18, 2023 at 5:12 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
> >>
> >> On 18/02/2023 01:59, demerphq wrote:
> >> > On Sat, 18 Feb 2023 at 00:24, Junio C Hamano <gitster@pobox.com> wrote:
> >> >>
> >> >> Emily Shaffer <nasamuffin@google.com> writes:
> >> >>
> >> >>> Basically, if this effort turns out not to be fruitful as a whole,
> >> >>> I'd like for us to still have left a positive impact on the codebase.
> >> >>> ...
> >> >>> So what's next? Naturally, I'm looking forward to a spirited
> >> >>> discussion about this topic - I'd like to know which concerns
> >> >>> haven't been addressed and figure out whether we can find a way
> >> >>> around them, and generally build awareness of this effort with the community.
> >> >>
> >> >> On of the gravest concerns is that the devil is in the details.
> >> >>
> >> >> For example, "die() is inconvenient to callers, let's propagate
> >> >> errors up the callchain" is an easy thing to say, but it would take
> >> >> much more than "let's propagate errors up" to libify something like
> >> >> check_connected() to do the same thing without spawning a separate
> >> >> process that is expected to exit with failure.
> >> >
> >> >
> >> > What does "propagate errors up the callchain" mean?  One
> >> > interpretation I can think of seems quite horrible, but another
> >> > seems quite doable and reasonable and likely not even very invasive
> >> > of the existing code:
> >> >
> >> > You can use setjmp/longjmp to implement a form of "try", so that
> >> > errors dont have to be *explicitly* returned *in* the call chain.
> >> > And you could probably do so without changing very much of the
> >> > existing code at all, and maintain a high level of conceptual
> >> > alignment with the current code strategy.
> >>
> >> Using setjmp/longjmp is an interesting suggestion, I think lua does
> >> something similar to what you describe for perl. However I think both
> >> of those use a allocator with garbage collection. I worry that using
> >> longjmp in git would be more invasive (or result in more memory leaks)
> >> as we'd need to to guard each allocation with some code to clean it up
> >> and then propagate the error. That means we're back to manually
> >> propagating errors up the call chain in many cases.
> >
> >We could just use talloc [1].
>
> talloc is not portable.

What makes you say that?

Either way, there's multiple libraries that do the same thing, and of
course one could be implemented within Git. It's not that complex.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Proposal/Discussion: Turning parts of Git into libraries
  2023-02-17 21:12 Proposal/Discussion: Turning parts of Git into libraries Emily Shaffer
                   ` (5 preceding siblings ...)
  2023-02-22 14:55 ` Derrick Stolee
@ 2023-03-23 23:37 ` Felipe Contreras
  2023-03-23 23:44   ` rsbecker
  6 siblings, 1 reply; 37+ messages in thread
From: Felipe Contreras @ 2023-03-23 23:37 UTC (permalink / raw)
  To: Emily Shaffer; +Cc: Git List, Jonathan Nieder, Jose Lopes, Aleksandr Mikhailov

On Fri, Feb 17, 2023 at 3:45 PM Emily Shaffer <nasamuffin@google.com> wrote:

> As I mentioned in standup this week[1], my colleagues and I at Google
> have become very interested in converting parts of Git into libraries
> usable by external programs. In other words, for some modules which
> already have clear boundaries inside of Git - like config.[ch],
> strbuf.[ch], etc. - we want to remove some implicit dependencies, like
> references to globals, and make explicit other dependencies, like
> references to other modules within Git. Eventually, we'd like both for
> an external program to use Git libraries within its own process, and
> for Git to be given an alternative implementation of a library it uses
> internally (like a plugin at runtime).

This is obviously the way it should have been done from the beginning,
but unfortunately at this point the Git project has too much inertia
and too many vested interests from multi-billion dollar corporations
to change.

I wonder if a single person who isn't paid to work on Git commented on
this thread.

I don't think these kinds of laudable efforts can be achieved within
the Git project.

Cheers.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 37+ messages in thread

* RE: Proposal/Discussion: Turning parts of Git into libraries
  2023-03-23 23:34           ` Felipe Contreras
@ 2023-03-23 23:42             ` rsbecker
  2023-03-23 23:55               ` Felipe Contreras
  0 siblings, 1 reply; 37+ messages in thread
From: rsbecker @ 2023-03-23 23:42 UTC (permalink / raw)
  To: 'Felipe Contreras'
  Cc: phillip.wood, 'demerphq', 'Junio C Hamano',
	'Emily Shaffer', 'Git List',
	'Jonathan Nieder', 'Jose Lopes',
	'Aleksandr Mikhailov'

On Thursday, March 23, 2023 7:35 PM, Felipe Contreras wrote:
>On Thu, Mar 23, 2023 at 5:30 PM <rsbecker@nexbridge.com> wrote:
>>
>> On Thursday, March 23, 2023 7:22 PM, Felipe Contreras wrote:
>> >On Sat, Feb 18, 2023 at 5:12 AM Phillip Wood <phillip.wood123@gmail.com>
>wrote:
>> >>
>> >> On 18/02/2023 01:59, demerphq wrote:
>> >> > On Sat, 18 Feb 2023 at 00:24, Junio C Hamano <gitster@pobox.com> wrote:
>> >> >>
>> >> >> Emily Shaffer <nasamuffin@google.com> writes:
>> >> >>
>> >> >>> Basically, if this effort turns out not to be fruitful as a
>> >> >>> whole, I'd like for us to still have left a positive impact on the codebase.
>> >> >>> ...
>> >> >>> So what's next? Naturally, I'm looking forward to a spirited
>> >> >>> discussion about this topic - I'd like to know which concerns
>> >> >>> haven't been addressed and figure out whether we can find a way
>> >> >>> around them, and generally build awareness of this effort with the
>community.
>> >> >>
>> >> >> On of the gravest concerns is that the devil is in the details.
>> >> >>
>> >> >> For example, "die() is inconvenient to callers, let's propagate
>> >> >> errors up the callchain" is an easy thing to say, but it would
>> >> >> take much more than "let's propagate errors up" to libify
>> >> >> something like
>> >> >> check_connected() to do the same thing without spawning a
>> >> >> separate process that is expected to exit with failure.
>> >> >
>> >> >
>> >> > What does "propagate errors up the callchain" mean?  One
>> >> > interpretation I can think of seems quite horrible, but another
>> >> > seems quite doable and reasonable and likely not even very
>> >> > invasive of the existing code:
>> >> >
>> >> > You can use setjmp/longjmp to implement a form of "try", so that
>> >> > errors dont have to be *explicitly* returned *in* the call chain.
>> >> > And you could probably do so without changing very much of the
>> >> > existing code at all, and maintain a high level of conceptual
>> >> > alignment with the current code strategy.
>> >>
>> >> Using setjmp/longjmp is an interesting suggestion, I think lua does
>> >> something similar to what you describe for perl. However I think
>> >> both of those use a allocator with garbage collection. I worry that
>> >> using longjmp in git would be more invasive (or result in more
>> >> memory leaks) as we'd need to to guard each allocation with some
>> >> code to clean it up and then propagate the error. That means we're
>> >> back to manually propagating errors up the call chain in many cases.
>> >
>> >We could just use talloc [1].
>>
>> talloc is not portable.
>
>What makes you say that?

talloc is not part of a POSIX standard I could find. Aside from that:

$ man talloc
(on NonStop) No manual entry for talloc
(on Cygwin) No manual entry for talloc

Just reporting my findings.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* RE: Proposal/Discussion: Turning parts of Git into libraries
  2023-03-23 23:37 ` Felipe Contreras
@ 2023-03-23 23:44   ` rsbecker
  0 siblings, 0 replies; 37+ messages in thread
From: rsbecker @ 2023-03-23 23:44 UTC (permalink / raw)
  To: 'Felipe Contreras', 'Emily Shaffer'
  Cc: 'Git List', 'Jonathan Nieder',
	'Jose Lopes', 'Aleksandr Mikhailov'

On Thursday, March 23, 2023 7:37 PM, Felipe Contreras wrote:
>On Fri, Feb 17, 2023 at 3:45 PM Emily Shaffer <nasamuffin@google.com> wrote:
>
>> As I mentioned in standup this week[1], my colleagues and I at Google
>> have become very interested in converting parts of Git into libraries
>> usable by external programs. In other words, for some modules which
>> already have clear boundaries inside of Git - like config.[ch],
>> strbuf.[ch], etc. - we want to remove some implicit dependencies, like
>> references to globals, and make explicit other dependencies, like
>> references to other modules within Git. Eventually, we'd like both for
>> an external program to use Git libraries within its own process, and
>> for Git to be given an alternative implementation of a library it uses
>> internally (like a plugin at runtime).
>
>This is obviously the way it should have been done from the beginning, but
>unfortunately at this point the Git project has too much inertia and too many vested
>interests from multi-billion dollar corporations to change.
>
>I wonder if a single person who isn't paid to work on Git commented on this thread.

Raises hand.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Proposal/Discussion: Turning parts of Git into libraries
  2023-03-23 23:42             ` rsbecker
@ 2023-03-23 23:55               ` Felipe Contreras
  2023-03-24 19:27                 ` rsbecker
  0 siblings, 1 reply; 37+ messages in thread
From: Felipe Contreras @ 2023-03-23 23:55 UTC (permalink / raw)
  To: rsbecker
  Cc: phillip.wood, demerphq, Junio C Hamano, Emily Shaffer, Git List,
	Jonathan Nieder, Jose Lopes, Aleksandr Mikhailov

On Thu, Mar 23, 2023 at 5:43 PM <rsbecker@nexbridge.com> wrote:
>
> On Thursday, March 23, 2023 7:35 PM, Felipe Contreras wrote:
> >On Thu, Mar 23, 2023 at 5:30 PM <rsbecker@nexbridge.com> wrote:
> >>
> >> On Thursday, March 23, 2023 7:22 PM, Felipe Contreras wrote:
> >> >On Sat, Feb 18, 2023 at 5:12 AM Phillip Wood <phillip.wood123@gmail.com>
> >wrote:
> >> >>
> >> >> On 18/02/2023 01:59, demerphq wrote:
> >> >> > On Sat, 18 Feb 2023 at 00:24, Junio C Hamano <gitster@pobox.com> wrote:
> >> >> >>
> >> >> >> Emily Shaffer <nasamuffin@google.com> writes:
> >> >> >>
> >> >> >>> Basically, if this effort turns out not to be fruitful as a
> >> >> >>> whole, I'd like for us to still have left a positive impact on the codebase.
> >> >> >>> ...
> >> >> >>> So what's next? Naturally, I'm looking forward to a spirited
> >> >> >>> discussion about this topic - I'd like to know which concerns
> >> >> >>> haven't been addressed and figure out whether we can find a way
> >> >> >>> around them, and generally build awareness of this effort with the
> >community.
> >> >> >>
> >> >> >> On of the gravest concerns is that the devil is in the details.
> >> >> >>
> >> >> >> For example, "die() is inconvenient to callers, let's propagate
> >> >> >> errors up the callchain" is an easy thing to say, but it would
> >> >> >> take much more than "let's propagate errors up" to libify
> >> >> >> something like
> >> >> >> check_connected() to do the same thing without spawning a
> >> >> >> separate process that is expected to exit with failure.
> >> >> >
> >> >> >
> >> >> > What does "propagate errors up the callchain" mean?  One
> >> >> > interpretation I can think of seems quite horrible, but another
> >> >> > seems quite doable and reasonable and likely not even very
> >> >> > invasive of the existing code:
> >> >> >
> >> >> > You can use setjmp/longjmp to implement a form of "try", so that
> >> >> > errors dont have to be *explicitly* returned *in* the call chain.
> >> >> > And you could probably do so without changing very much of the
> >> >> > existing code at all, and maintain a high level of conceptual
> >> >> > alignment with the current code strategy.
> >> >>
> >> >> Using setjmp/longjmp is an interesting suggestion, I think lua does
> >> >> something similar to what you describe for perl. However I think
> >> >> both of those use a allocator with garbage collection. I worry that
> >> >> using longjmp in git would be more invasive (or result in more
> >> >> memory leaks) as we'd need to to guard each allocation with some
> >> >> code to clean it up and then propagate the error. That means we're
> >> >> back to manually propagating errors up the call chain in many cases.
> >> >
> >> >We could just use talloc [1].
> >>
> >> talloc is not portable.
> >
> >What makes you say that?
>
> talloc is not part of a POSIX standard I could find.

It's a library, like: z, ssl, curl, pcre2-8, etc. Libraries can be
compiled on different platforms.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 37+ messages in thread

* RE: Proposal/Discussion: Turning parts of Git into libraries
  2023-03-23 23:55               ` Felipe Contreras
@ 2023-03-24 19:27                 ` rsbecker
  2023-03-24 21:21                   ` Felipe Contreras
  0 siblings, 1 reply; 37+ messages in thread
From: rsbecker @ 2023-03-24 19:27 UTC (permalink / raw)
  To: 'Felipe Contreras'
  Cc: phillip.wood, 'demerphq', 'Junio C Hamano',
	'Emily Shaffer', 'Git List',
	'Jonathan Nieder', 'Jose Lopes',
	'Aleksandr Mikhailov'

On Thursday, March 23, 2023 7:55 PM, Felipe Contreras wrote:
>On Thu, Mar 23, 2023 at 5:43 PM <rsbecker@nexbridge.com> wrote:
>>
>> On Thursday, March 23, 2023 7:35 PM, Felipe Contreras wrote:
>> >On Thu, Mar 23, 2023 at 5:30 PM <rsbecker@nexbridge.com> wrote:
>> >>
>> >> On Thursday, March 23, 2023 7:22 PM, Felipe Contreras wrote:
>> >> >On Sat, Feb 18, 2023 at 5:12 AM Phillip Wood
>> >> ><phillip.wood123@gmail.com>
>> >wrote:
>> >> >>
>> >> >> On 18/02/2023 01:59, demerphq wrote:
>> >> >> > On Sat, 18 Feb 2023 at 00:24, Junio C Hamano <gitster@pobox.com>
>wrote:
>> >> >> >>
>> >> >> >> Emily Shaffer <nasamuffin@google.com> writes:
>> >> >> >>
>> >> >> >>> Basically, if this effort turns out not to be fruitful as a
>> >> >> >>> whole, I'd like for us to still have left a positive impact on the codebase.
>> >> >> >>> ...
>> >> >> >>> So what's next? Naturally, I'm looking forward to a spirited
>> >> >> >>> discussion about this topic - I'd like to know which
>> >> >> >>> concerns haven't been addressed and figure out whether we
>> >> >> >>> can find a way around them, and generally build awareness of
>> >> >> >>> this effort with the
>> >community.
>> >> >> >>
>> >> >> >> On of the gravest concerns is that the devil is in the details.
>> >> >> >>
>> >> >> >> For example, "die() is inconvenient to callers, let's
>> >> >> >> propagate errors up the callchain" is an easy thing to say,
>> >> >> >> but it would take much more than "let's propagate errors up"
>> >> >> >> to libify something like
>> >> >> >> check_connected() to do the same thing without spawning a
>> >> >> >> separate process that is expected to exit with failure.
>> >> >> >
>> >> >> >
>> >> >> > What does "propagate errors up the callchain" mean?  One
>> >> >> > interpretation I can think of seems quite horrible, but
>> >> >> > another seems quite doable and reasonable and likely not even
>> >> >> > very invasive of the existing code:
>> >> >> >
>> >> >> > You can use setjmp/longjmp to implement a form of "try", so
>> >> >> > that errors dont have to be *explicitly* returned *in* the call chain.
>> >> >> > And you could probably do so without changing very much of the
>> >> >> > existing code at all, and maintain a high level of conceptual
>> >> >> > alignment with the current code strategy.
>> >> >>
>> >> >> Using setjmp/longjmp is an interesting suggestion, I think lua
>> >> >> does something similar to what you describe for perl. However I
>> >> >> think both of those use a allocator with garbage collection. I
>> >> >> worry that using longjmp in git would be more invasive (or
>> >> >> result in more memory leaks) as we'd need to to guard each
>> >> >> allocation with some code to clean it up and then propagate the
>> >> >> error. That means we're back to manually propagating errors up the call
>chain in many cases.
>> >> >
>> >> >We could just use talloc [1].
>> >>
>> >> talloc is not portable.
>> >
>> >What makes you say that?
>>
>> talloc is not part of a POSIX standard I could find.
>
>It's a library, like: z, ssl, curl, pcre2-8, etc. Libraries can be compiled on different
>platforms.

talloc adds additional *required* dependencies to git, including python3 - required to configure and build talloc - which is not available on the NonStop ia64 platform (required support through end of 2025). I must express my resistance to what would amount to losing support for git on this NonStop platform.

--Randall


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Proposal/Discussion: Turning parts of Git into libraries
  2023-03-24 19:27                 ` rsbecker
@ 2023-03-24 21:21                   ` Felipe Contreras
  2023-03-24 22:06                     ` rsbecker
  0 siblings, 1 reply; 37+ messages in thread
From: Felipe Contreras @ 2023-03-24 21:21 UTC (permalink / raw)
  To: rsbecker
  Cc: phillip.wood, demerphq, Junio C Hamano, Emily Shaffer, Git List,
	Jonathan Nieder, Jose Lopes, Aleksandr Mikhailov

On Fri, Mar 24, 2023 at 1:28 PM <rsbecker@nexbridge.com> wrote:
>
> On Thursday, March 23, 2023 7:55 PM, Felipe Contreras wrote:
> >On Thu, Mar 23, 2023 at 5:43 PM <rsbecker@nexbridge.com> wrote:
> >>
> >> On Thursday, March 23, 2023 7:35 PM, Felipe Contreras wrote:
> >> >On Thu, Mar 23, 2023 at 5:30 PM <rsbecker@nexbridge.com> wrote:
> >> >>
> >> >> On Thursday, March 23, 2023 7:22 PM, Felipe Contreras wrote:
> >> >> >On Sat, Feb 18, 2023 at 5:12 AM Phillip Wood
> >> >> ><phillip.wood123@gmail.com>
> >> >wrote:
> >> >> >>
> >> >> >> On 18/02/2023 01:59, demerphq wrote:
> >> >> >> > On Sat, 18 Feb 2023 at 00:24, Junio C Hamano <gitster@pobox.com>
> >wrote:
> >> >> >> >>
> >> >> >> >> Emily Shaffer <nasamuffin@google.com> writes:
> >> >> >> >>
> >> >> >> >>> Basically, if this effort turns out not to be fruitful as a
> >> >> >> >>> whole, I'd like for us to still have left a positive impact on the codebase.
> >> >> >> >>> ...
> >> >> >> >>> So what's next? Naturally, I'm looking forward to a spirited
> >> >> >> >>> discussion about this topic - I'd like to know which
> >> >> >> >>> concerns haven't been addressed and figure out whether we
> >> >> >> >>> can find a way around them, and generally build awareness of
> >> >> >> >>> this effort with the
> >> >community.
> >> >> >> >>
> >> >> >> >> On of the gravest concerns is that the devil is in the details.
> >> >> >> >>
> >> >> >> >> For example, "die() is inconvenient to callers, let's
> >> >> >> >> propagate errors up the callchain" is an easy thing to say,
> >> >> >> >> but it would take much more than "let's propagate errors up"
> >> >> >> >> to libify something like
> >> >> >> >> check_connected() to do the same thing without spawning a
> >> >> >> >> separate process that is expected to exit with failure.
> >> >> >> >
> >> >> >> >
> >> >> >> > What does "propagate errors up the callchain" mean?  One
> >> >> >> > interpretation I can think of seems quite horrible, but
> >> >> >> > another seems quite doable and reasonable and likely not even
> >> >> >> > very invasive of the existing code:
> >> >> >> >
> >> >> >> > You can use setjmp/longjmp to implement a form of "try", so
> >> >> >> > that errors dont have to be *explicitly* returned *in* the call chain.
> >> >> >> > And you could probably do so without changing very much of the
> >> >> >> > existing code at all, and maintain a high level of conceptual
> >> >> >> > alignment with the current code strategy.
> >> >> >>
> >> >> >> Using setjmp/longjmp is an interesting suggestion, I think lua
> >> >> >> does something similar to what you describe for perl. However I
> >> >> >> think both of those use a allocator with garbage collection. I
> >> >> >> worry that using longjmp in git would be more invasive (or
> >> >> >> result in more memory leaks) as we'd need to to guard each
> >> >> >> allocation with some code to clean it up and then propagate the
> >> >> >> error. That means we're back to manually propagating errors up the call
> >chain in many cases.
> >> >> >
> >> >> >We could just use talloc [1].
> >> >>
> >> >> talloc is not portable.
> >> >
> >> >What makes you say that?
> >>
> >> talloc is not part of a POSIX standard I could find.
> >
> >It's a library, like: z, ssl, curl, pcre2-8, etc. Libraries can be compiled on different
> >platforms.
>
> talloc adds additional *required* dependencies to git, including python3 - required to configure and build talloc - which is not available on the NonStop ia64 platform (required support through end of 2025). I must express my resistance to what would amount to losing support for git on this NonStop platform.

That is not true. You don't need python3 for talloc, not even to build
it, it's just a single simple c file, it's easy to compile.

The only reason python is used is to run waf, which is used to build
Samba, which is much more complex, but you don't need to run it,
especially if you know the characteristics of your system.

This simple Makefile builds libtalloc.so just fine:

  CC := gcc
  CFLAGS := -fPIC -I./lib/replace
  LDFLAGS := -Wl,--no-undefined

  # For talloc.c
  CFLAGS += -DTALLOC_BUILD_VERSION_MAJOR=2
-DTALLOC_BUILD_VERSION_MINOR=4 -DTALLOC_BUILD_VERSION_RELEASE=0
  CFLAGS += -DHAVE_CONSTRUCTOR_ATTRIBUTE -DHAVE_VA_COPY
-DHAVE_VALGRIND_MEMCHECK_H -DHAVE_INTPTR_T

  # For replace.h
  CFLAGS += -DNO_CONFIG_H -D__STDC_WANT_LIB_EXT1__=1
  CFLAGS += -DHAVE_STDBOOL_H -DHAVE_BOOL -DHAVE_STRING_H
-DHAVE_LIMITS_H -DHAVE_STDINT_H
  CFLAGS += -DHAVE_DLFCN_H -DHAVE_UINTPTR_T -DHAVE_C99_VSNPRINTF
-DHAVE_MEMMOVE -DHAVE_STRNLEN -DHAVE_VSNPRINTF

  libtalloc.so: talloc.o
    $(CC) $(LDFLAGS) -shared -o $@ $^

But of course, most of those defines are not even needed with a simple
"replace.h" that is less than 10 lines of code.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 37+ messages in thread

* RE: Proposal/Discussion: Turning parts of Git into libraries
  2023-03-24 21:21                   ` Felipe Contreras
@ 2023-03-24 22:06                     ` rsbecker
  2023-03-24 22:29                       ` Felipe Contreras
  0 siblings, 1 reply; 37+ messages in thread
From: rsbecker @ 2023-03-24 22:06 UTC (permalink / raw)
  To: 'Felipe Contreras'
  Cc: phillip.wood, 'demerphq', 'Junio C Hamano',
	'Emily Shaffer', 'Git List',
	'Jonathan Nieder', 'Jose Lopes',
	'Aleksandr Mikhailov'

On Friday, March 24, 2023 5:22 PM, Felipe Contreras wrote:
>On Fri, Mar 24, 2023 at 1:28 PM <rsbecker@nexbridge.com> wrote:
>>
>> On Thursday, March 23, 2023 7:55 PM, Felipe Contreras wrote:
>> >On Thu, Mar 23, 2023 at 5:43 PM <rsbecker@nexbridge.com> wrote:
>> >>
>> >> On Thursday, March 23, 2023 7:35 PM, Felipe Contreras wrote:
>> >> >On Thu, Mar 23, 2023 at 5:30 PM <rsbecker@nexbridge.com> wrote:
>> >> >>
>> >> >> On Thursday, March 23, 2023 7:22 PM, Felipe Contreras wrote:
>> >> >> >On Sat, Feb 18, 2023 at 5:12 AM Phillip Wood
>> >> >> ><phillip.wood123@gmail.com>
>> >> >wrote:
>> >> >> >>
>> >> >> >> On 18/02/2023 01:59, demerphq wrote:
>> >> >> >> > On Sat, 18 Feb 2023 at 00:24, Junio C Hamano
>> >> >> >> > <gitster@pobox.com>
>> >wrote:
>> >> >> >> >>
>> >> >> >> >> Emily Shaffer <nasamuffin@google.com> writes:
>> >> >> >> >>
>> >> >> >> >>> Basically, if this effort turns out not to be fruitful as
>> >> >> >> >>> a whole, I'd like for us to still have left a positive impact on the
>codebase.
>> >> >> >> >>> ...
>> >> >> >> >>> So what's next? Naturally, I'm looking forward to a
>> >> >> >> >>> spirited discussion about this topic - I'd like to know
>> >> >> >> >>> which concerns haven't been addressed and figure out
>> >> >> >> >>> whether we can find a way around them, and generally
>> >> >> >> >>> build awareness of this effort with the
>> >> >community.
>> >> >> >> >>
>> >> >> >> >> On of the gravest concerns is that the devil is in the details.
>> >> >> >> >>
>> >> >> >> >> For example, "die() is inconvenient to callers, let's
>> >> >> >> >> propagate errors up the callchain" is an easy thing to
>> >> >> >> >> say, but it would take much more than "let's propagate errors up"
>> >> >> >> >> to libify something like
>> >> >> >> >> check_connected() to do the same thing without spawning a
>> >> >> >> >> separate process that is expected to exit with failure.
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > What does "propagate errors up the callchain" mean?  One
>> >> >> >> > interpretation I can think of seems quite horrible, but
>> >> >> >> > another seems quite doable and reasonable and likely not
>> >> >> >> > even very invasive of the existing code:
>> >> >> >> >
>> >> >> >> > You can use setjmp/longjmp to implement a form of "try", so
>> >> >> >> > that errors dont have to be *explicitly* returned *in* the call chain.
>> >> >> >> > And you could probably do so without changing very much of
>> >> >> >> > the existing code at all, and maintain a high level of
>> >> >> >> > conceptual alignment with the current code strategy.
>> >> >> >>
>> >> >> >> Using setjmp/longjmp is an interesting suggestion, I think
>> >> >> >> lua does something similar to what you describe for perl.
>> >> >> >> However I think both of those use a allocator with garbage
>> >> >> >> collection. I worry that using longjmp in git would be more
>> >> >> >> invasive (or result in more memory leaks) as we'd need to to
>> >> >> >> guard each allocation with some code to clean it up and then
>> >> >> >> propagate the error. That means we're back to manually
>> >> >> >> propagating errors up the call
>> >chain in many cases.
>> >> >> >
>> >> >> >We could just use talloc [1].
>> >> >>
>> >> >> talloc is not portable.
>> >> >
>> >> >What makes you say that?
>> >>
>> >> talloc is not part of a POSIX standard I could find.
>> >
>> >It's a library, like: z, ssl, curl, pcre2-8, etc. Libraries can be
>> >compiled on different platforms.
>>
>> talloc adds additional *required* dependencies to git, including python3 - required
>to configure and build talloc - which is not available on the NonStop ia64 platform
>(required support through end of 2025). I must express my resistance to what would
>amount to losing support for git on this NonStop platform.
>
>That is not true. You don't need python3 for talloc, not even to build it, it's just a
>single simple c file, it's easy to compile.
>
>The only reason python is used is to run waf, which is used to build Samba, which is
>much more complex, but you don't need to run it, especially if you know the
>characteristics of your system.
>
>This simple Makefile builds libtalloc.so just fine:
>
>  CC := gcc
>  CFLAGS := -fPIC -I./lib/replace
>  LDFLAGS := -Wl,--no-undefined
>
>  # For talloc.c
>  CFLAGS += -DTALLOC_BUILD_VERSION_MAJOR=2
>-DTALLOC_BUILD_VERSION_MINOR=4 -DTALLOC_BUILD_VERSION_RELEASE=0
>  CFLAGS += -DHAVE_CONSTRUCTOR_ATTRIBUTE -DHAVE_VA_COPY -
>DHAVE_VALGRIND_MEMCHECK_H -DHAVE_INTPTR_T
>
>  # For replace.h
>  CFLAGS += -DNO_CONFIG_H -D__STDC_WANT_LIB_EXT1__=1
>  CFLAGS += -DHAVE_STDBOOL_H -DHAVE_BOOL -DHAVE_STRING_H -
>DHAVE_LIMITS_H -DHAVE_STDINT_H
>  CFLAGS += -DHAVE_DLFCN_H -DHAVE_UINTPTR_T -DHAVE_C99_VSNPRINTF -
>DHAVE_MEMMOVE -DHAVE_STRNLEN -DHAVE_VSNPRINTF
>
>  libtalloc.so: talloc.o
>    $(CC) $(LDFLAGS) -shared -o $@ $^
>
>But of course, most of those defines are not even needed with a simple "replace.h"
>that is less than 10 lines of code.

The 2.4.0 version is substantially more complex than one .c file. The version I just unpacked has 61 C and H files, and requires more than a trivial makefile. 

config.h requires that the ./configure script runs (which it does not because of python3). replace.h is over 1000 lines in this version.

This also gets into git supply chain issues where people who want to build git, outside of specific platforms, need to modify the build environment associated with talloc, which changes the delivered signature.

Speaking purely from a selfish standpoint, this is more than a trivial amount of work at least for me (the above Makefile does not satisfy 2.4.0) that is proposed in order to make talloc a seamless addition. Running the talloc test suite, which I consider a requirement to adding this as a dependency, is also problematic for the reasons I previously indicated.

We may just have to agree to disagree, but I stand by my concern that this suggestion will cause maintenance issues given the current state of the talloc code - and I do not have the cycles to become a community maintainer of that code also.

--Randall


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Proposal/Discussion: Turning parts of Git into libraries
  2023-03-24 22:06                     ` rsbecker
@ 2023-03-24 22:29                       ` Felipe Contreras
  0 siblings, 0 replies; 37+ messages in thread
From: Felipe Contreras @ 2023-03-24 22:29 UTC (permalink / raw)
  To: rsbecker
  Cc: phillip.wood, demerphq, Junio C Hamano, Emily Shaffer, Git List,
	Jonathan Nieder, Jose Lopes, Aleksandr Mikhailov

On Fri, Mar 24, 2023 at 4:06 PM <rsbecker@nexbridge.com> wrote:
>
> On Friday, March 24, 2023 5:22 PM, Felipe Contreras wrote:
> >On Fri, Mar 24, 2023 at 1:28 PM <rsbecker@nexbridge.com> wrote:
> >>
> >> On Thursday, March 23, 2023 7:55 PM, Felipe Contreras wrote:
> >> >On Thu, Mar 23, 2023 at 5:43 PM <rsbecker@nexbridge.com> wrote:
> >> >>
> >> >> On Thursday, March 23, 2023 7:35 PM, Felipe Contreras wrote:
> >> >> >On Thu, Mar 23, 2023 at 5:30 PM <rsbecker@nexbridge.com> wrote:
> >> >> >>
> >> >> >> On Thursday, March 23, 2023 7:22 PM, Felipe Contreras wrote:
> >> >> >> >On Sat, Feb 18, 2023 at 5:12 AM Phillip Wood
> >> >> >> ><phillip.wood123@gmail.com>
> >> >> >wrote:
> >> >> >> >>
> >> >> >> >> On 18/02/2023 01:59, demerphq wrote:
> >> >> >> >> > On Sat, 18 Feb 2023 at 00:24, Junio C Hamano
> >> >> >> >> > <gitster@pobox.com>
> >> >wrote:
> >> >> >> >> >>
> >> >> >> >> >> Emily Shaffer <nasamuffin@google.com> writes:
> >> >> >> >> >>
> >> >> >> >> >>> Basically, if this effort turns out not to be fruitful as
> >> >> >> >> >>> a whole, I'd like for us to still have left a positive impact on the
> >codebase.
> >> >> >> >> >>> ...
> >> >> >> >> >>> So what's next? Naturally, I'm looking forward to a
> >> >> >> >> >>> spirited discussion about this topic - I'd like to know
> >> >> >> >> >>> which concerns haven't been addressed and figure out
> >> >> >> >> >>> whether we can find a way around them, and generally
> >> >> >> >> >>> build awareness of this effort with the
> >> >> >community.
> >> >> >> >> >>
> >> >> >> >> >> On of the gravest concerns is that the devil is in the details.
> >> >> >> >> >>
> >> >> >> >> >> For example, "die() is inconvenient to callers, let's
> >> >> >> >> >> propagate errors up the callchain" is an easy thing to
> >> >> >> >> >> say, but it would take much more than "let's propagate errors up"
> >> >> >> >> >> to libify something like
> >> >> >> >> >> check_connected() to do the same thing without spawning a
> >> >> >> >> >> separate process that is expected to exit with failure.
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > What does "propagate errors up the callchain" mean?  One
> >> >> >> >> > interpretation I can think of seems quite horrible, but
> >> >> >> >> > another seems quite doable and reasonable and likely not
> >> >> >> >> > even very invasive of the existing code:
> >> >> >> >> >
> >> >> >> >> > You can use setjmp/longjmp to implement a form of "try", so
> >> >> >> >> > that errors dont have to be *explicitly* returned *in* the call chain.
> >> >> >> >> > And you could probably do so without changing very much of
> >> >> >> >> > the existing code at all, and maintain a high level of
> >> >> >> >> > conceptual alignment with the current code strategy.
> >> >> >> >>
> >> >> >> >> Using setjmp/longjmp is an interesting suggestion, I think
> >> >> >> >> lua does something similar to what you describe for perl.
> >> >> >> >> However I think both of those use a allocator with garbage
> >> >> >> >> collection. I worry that using longjmp in git would be more
> >> >> >> >> invasive (or result in more memory leaks) as we'd need to to
> >> >> >> >> guard each allocation with some code to clean it up and then
> >> >> >> >> propagate the error. That means we're back to manually
> >> >> >> >> propagating errors up the call
> >> >chain in many cases.
> >> >> >> >
> >> >> >> >We could just use talloc [1].
> >> >> >>
> >> >> >> talloc is not portable.
> >> >> >
> >> >> >What makes you say that?
> >> >>
> >> >> talloc is not part of a POSIX standard I could find.
> >> >
> >> >It's a library, like: z, ssl, curl, pcre2-8, etc. Libraries can be
> >> >compiled on different platforms.
> >>
> >> talloc adds additional *required* dependencies to git, including python3 - required
> >to configure and build talloc - which is not available on the NonStop ia64 platform
> >(required support through end of 2025). I must express my resistance to what would
> >amount to losing support for git on this NonStop platform.
> >
> >That is not true. You don't need python3 for talloc, not even to build it, it's just a
> >single simple c file, it's easy to compile.
> >
> >The only reason python is used is to run waf, which is used to build Samba, which is
> >much more complex, but you don't need to run it, especially if you know the
> >characteristics of your system.
> >
> >This simple Makefile builds libtalloc.so just fine:
> >
> >  CC := gcc
> >  CFLAGS := -fPIC -I./lib/replace
> >  LDFLAGS := -Wl,--no-undefined
> >
> >  # For talloc.c
> >  CFLAGS += -DTALLOC_BUILD_VERSION_MAJOR=2
> >-DTALLOC_BUILD_VERSION_MINOR=4 -DTALLOC_BUILD_VERSION_RELEASE=0
> >  CFLAGS += -DHAVE_CONSTRUCTOR_ATTRIBUTE -DHAVE_VA_COPY -
> >DHAVE_VALGRIND_MEMCHECK_H -DHAVE_INTPTR_T
> >
> >  # For replace.h
> >  CFLAGS += -DNO_CONFIG_H -D__STDC_WANT_LIB_EXT1__=1
> >  CFLAGS += -DHAVE_STDBOOL_H -DHAVE_BOOL -DHAVE_STRING_H -
> >DHAVE_LIMITS_H -DHAVE_STDINT_H
> >  CFLAGS += -DHAVE_DLFCN_H -DHAVE_UINTPTR_T -DHAVE_C99_VSNPRINTF -
> >DHAVE_MEMMOVE -DHAVE_STRNLEN -DHAVE_VSNPRINTF
> >
> >  libtalloc.so: talloc.o
> >    $(CC) $(LDFLAGS) -shared -o $@ $^
> >
> >But of course, most of those defines are not even needed with a simple "replace.h"
> >that is less than 10 lines of code.
>
> The 2.4.0 version is substantially more complex than one .c file. The version I just unpacked has 61 C and H files, and requires more than a trivial makefile.

Not to build libtalloc.so, it doesn't.

> config.h requires that the ./configure script runs (which it does not because of python3). replace.h is over 1000 lines in this version.

Neither config.h nor replace.h are needed.

> This also gets into git supply chain issues where people who want to build git, outside of specific platforms, need to modify the build environment associated with talloc, which changes the delivered signature.

The same happens with -lssl and many other libraries Git relies on.

People who can't or won't use the ssl library can *alternatively* use
block-sha1/sha1.c, the same can be done with talloc, which the Git
project can easily embed.

> Speaking purely from a selfish standpoint, this is more than a trivial amount of work at least for me (the above Makefile does not satisfy 2.4.0) that is proposed in order to make talloc a seamless addition.

Nobody is asking you to do anything.

> Running the talloc test suite, which I consider a requirement to adding this as a dependency, is also problematic for the reasons I previously indicated.

Nobody is proposing to make this a dependency.

> We may just have to agree to disagree, but I stand by my concern that this suggestion will cause maintenance issues given the current state of the talloc code - and I do not have the cycles to become a community maintainer of that code also.

That is simply not true. You have no idea what maintenance issues
could be caused by a hypothetical patch, because no such patch has
been put forward, and likely never will.

The fact is that the suggestion is perfectly *doable*.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2023-03-24 22:30 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-17 21:12 Proposal/Discussion: Turning parts of Git into libraries Emily Shaffer
2023-02-17 21:21 ` brian m. carlson
2023-02-17 21:38   ` Emily Shaffer
2023-02-17 22:41     ` brian m. carlson
2023-02-17 22:49       ` Emily Shaffer
2023-02-22 19:34         ` Jeff King
2023-02-24 20:31           ` Emily Shaffer
2023-02-24 21:41             ` Jeff King
2023-02-24 22:59             ` Junio C Hamano
2023-02-17 22:04   ` rsbecker
2023-02-17 22:48     ` brian m. carlson
2023-02-17 22:57 ` Junio C Hamano
2023-02-18  1:59   ` demerphq
2023-02-18 10:36     ` Phillip Wood
2023-03-23 23:22       ` Felipe Contreras
2023-03-23 23:30         ` rsbecker
2023-03-23 23:34           ` Felipe Contreras
2023-03-23 23:42             ` rsbecker
2023-03-23 23:55               ` Felipe Contreras
2023-03-24 19:27                 ` rsbecker
2023-03-24 21:21                   ` Felipe Contreras
2023-03-24 22:06                     ` rsbecker
2023-03-24 22:29                       ` Felipe Contreras
2023-02-21 21:42   ` Emily Shaffer
2023-02-22  0:22     ` Junio C Hamano
2023-02-18  4:05 ` Elijah Newren
2023-02-21 22:06   ` Emily Shaffer
2023-02-22  8:23     ` Elijah Newren
2023-02-22 19:25     ` Jeff King
2023-02-21 19:09 ` Taylor Blau
2023-02-21 22:27   ` Emily Shaffer
2023-02-22  1:44 ` Victoria Dye
2023-02-25  1:48   ` Jonathan Tan
2023-02-22 14:55 ` Derrick Stolee
2023-02-24 21:06   ` Emily Shaffer
2023-03-23 23:37 ` Felipe Contreras
2023-03-23 23:44   ` rsbecker

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).