git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* libification: how to avoid symbol collisions?
@ 2024-02-09  2:30 Kyle Lippincott
  2024-02-09 13:31 ` rsbecker
  2024-02-09 17:09 ` Junio C Hamano
  0 siblings, 2 replies; 6+ messages in thread
From: Kyle Lippincott @ 2024-02-09  2:30 UTC (permalink / raw
  To: git

While thinking about libification, I was wondering what we can/need to
do about symbols (specifically functions, since our libraries will
likely have few to no extern variables) that need to escape their
translation unit (.c file) but that we don't want to risk colliding
with symbols from our "host" project.

For any header that we're offering up as an API boundary we can have
prefixed names, but there are symbols from git-compat-util.h with
simple and likely common names like `die`, `usage`, error`, etc. I'm
far from an expert on linkers, but I'm under the impression that even
though we'll only be #including git-compat-util.h in our own .c files
(so the compiler for the host project wouldn't see them), the produced
static library will still be "providing" these symbols unless we mark
them as `static` (and if we mark them as `static`, they can't be used
outside of their translation unit). This means that if the host
project has a version of `die` (or links against yet another library
that does), we run into what C++ calls a One Definition Rule (ODR)
violation: we have two providers of the symbol `die` with different
implementations, and the behavior is undefined (no error needs to be
generated as far as I know).

With dynamic libraries I believe that we have more control over what
gets exposed, but I don't know of functionality for this when linking
statically. I'm assuming there is no such functionality, as projects
like openssl (ty Randall for mentioning this) appear to have a
convention of prefixing the symbols they put in their "public" API
(i.e. in non-internal header files) with things like OSSL_, and of
prefixing the symbols they put in their "private" APIs that can't be
marked as `static` with `ossl_`. I'd love to be wrong about this. :)

If I'm right that this is an issue, does this imply that we'd need to
rename every non-static function in the git codebase that's part of a
library to have a `git_` prefix, even if it won't be used outside of
the git project's own .c files? Is there a solution that doesn't
involve making it so that we have to type `git_` a billion times a day
that's also maintainable? We could try to guess at how likely a name
collision would be and only do this for ones where it's obviously
going to collide, but if we get it wrong, I'm concerned that we'll run
into subtle ODR violations that *at best* erode confidence in our
library, and can actually cause outages, data corruption, and
security/privacy issues.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: libification: how to avoid symbol collisions?
  2024-02-09  2:30 libification: how to avoid symbol collisions? Kyle Lippincott
@ 2024-02-09 13:31 ` rsbecker
  2024-02-13  1:02   ` Kyle Lippincott
  2024-02-09 17:09 ` Junio C Hamano
  1 sibling, 1 reply; 6+ messages in thread
From: rsbecker @ 2024-02-09 13:31 UTC (permalink / raw
  To: 'Kyle Lippincott', git

On Thursday, February 8, 2024 9:30 PM, Kyle Lippincott wrote:
>While thinking about libification, I was wondering what we can/need to do about
>symbols (specifically functions, since our libraries will likely have few to no extern
>variables) that need to escape their translation unit (.c file) but that we don't want
>to risk colliding with symbols from our "host" project.
>
>For any header that we're offering up as an API boundary we can have prefixed
>names, but there are symbols from git-compat-util.h with simple and likely
>common names like `die`, `usage`, error`, etc. I'm far from an expert on linkers, but
>I'm under the impression that even though we'll only be #including git-compat-
>util.h in our own .c files (so the compiler for the host project wouldn't see them),
>the produced static library will still be "providing" these symbols unless we mark
>them as `static` (and if we mark them as `static`, they can't be used outside of their
>translation unit). This means that if the host project has a version of `die` (or links
>against yet another library that does), we run into what C++ calls a One Definition
>Rule (ODR)
>violation: we have two providers of the symbol `die` with different
>implementations, and the behavior is undefined (no error needs to be generated as
>far as I know).
>
>With dynamic libraries I believe that we have more control over what gets exposed,
>but I don't know of functionality for this when linking statically. I'm assuming there
>is no such functionality, as projects like openssl (ty Randall for mentioning this)
>appear to have a convention of prefixing the symbols they put in their "public" API
>(i.e. in non-internal header files) with things like OSSL_, and of prefixing the symbols
>they put in their "private" APIs that can't be marked as `static` with `ossl_`. I'd love
>to be wrong about this. :)
>
>If I'm right that this is an issue, does this imply that we'd need to rename every non-
>static function in the git codebase that's part of a library to have a `git_` prefix, even
>if it won't be used outside of the git project's own .c files? Is there a solution that
>doesn't involve making it so that we have to type `git_` a billion times a day that's
>also maintainable? We could try to guess at how likely a name collision would be
>and only do this for ones where it's obviously going to collide, but if we get it wrong,
>I'm concerned that we'll run into subtle ODR violations that *at best* erode
>confidence in our library, and can actually cause outages, data corruption, and
>security/privacy issues.

I think we only need to do this for functions that are in the libification code-base for non-static symbols (and any data elements that may end up in a DLL some day). The bulk of the non-libified code base would only need to adapt to new symbol names if those symbols are specifically moved. die(), error(), are probably going to be impacted, but they can be aliased with #defines internally to git to git_die() or git_error(), for example.
--Randall



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: libification: how to avoid symbol collisions?
  2024-02-09  2:30 libification: how to avoid symbol collisions? Kyle Lippincott
  2024-02-09 13:31 ` rsbecker
@ 2024-02-09 17:09 ` Junio C Hamano
  2024-02-13  2:48   ` Kyle Lippincott
  1 sibling, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2024-02-09 17:09 UTC (permalink / raw
  To: Kyle Lippincott; +Cc: git

Kyle Lippincott <spectral@google.com> writes:

> If I'm right that this is an issue, does this imply that we'd need to
> rename every non-static function in the git codebase that's part of a
> library to have a `git_` prefix, even if it won't be used outside of
> the git project's own .c files? Is there a solution that doesn't
> involve making it so that we have to type `git_` a billion times a day
> that's also maintainable? We could try to guess at how likely a name
> collision would be and only do this for ones where it's obviously
> going to collide, but if we get it wrong, I'm concerned that we'll run
> into subtle ODR violations that *at best* erode confidence in our
> library, and can actually cause outages, data corruption, and
> security/privacy issues.

If you end up with a helper function name "foo" that is defined in
our X.c and used by our Y.c but is not part of the published "git
library API", we may want to rename it so that such a common name
can be used by programs that link with the "git library".  We may
choose to rename it to "GitLib_foo".

Do we want to keep the source of our library code, which defines the
helper function "foo" in X.c and calls it in Y.c, intact so that the
helper is still named "foo" as far as we are concerned?  Or do we
"bite the bullet" and bulk modify both the callers and the callee?

I'd imagine that we would rather avoid such a churn at all cost [*].
After all, "foo" is *not* supposed to be callable by any human
written code, and that is why we rename it to a name "GitLib_foo"
that is unlikely to overlap with any sane human would use.

	Side note: if a public API function that we want our library
	clients to call is misnamed, we want to rename it so that we
	would both internally and externally use the same public
	name, I would imagine.

The mechanics to rename should be a solved problem, I think, as we
are obviously not the first project that wants to build a library.

If the names are all simple, we could do this in CPP, i.e. invent a
header file that has bunch of such renames like

    #define foo GitLib_foo

and include it in both X.c and Y.c.  But "foo" may also appear as
the name of a type, a member in a structure, etc., where we may not
want to touch, so in a project larger than toy scale, this approach
may not work well.

"objcopy --redefine-sym" would probably be a better way.  I haven't
written a linker script, but I heard rumors that there is RENAME()
to rename symbols, and using that might be another avenue.




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: libification: how to avoid symbol collisions?
  2024-02-09 13:31 ` rsbecker
@ 2024-02-13  1:02   ` Kyle Lippincott
  0 siblings, 0 replies; 6+ messages in thread
From: Kyle Lippincott @ 2024-02-13  1:02 UTC (permalink / raw
  To: rsbecker; +Cc: git

On Fri, Feb 9, 2024 at 5:31 AM <rsbecker@nexbridge.com> wrote:
>
> On Thursday, February 8, 2024 9:30 PM, Kyle Lippincott wrote:
> >While thinking about libification, I was wondering what we can/need to do about
> >symbols (specifically functions, since our libraries will likely have few to no extern
> >variables) that need to escape their translation unit (.c file) but that we don't want
> >to risk colliding with symbols from our "host" project.
> >
> >For any header that we're offering up as an API boundary we can have prefixed
> >names, but there are symbols from git-compat-util.h with simple and likely
> >common names like `die`, `usage`, error`, etc. I'm far from an expert on linkers, but
> >I'm under the impression that even though we'll only be #including git-compat-
> >util.h in our own .c files (so the compiler for the host project wouldn't see them),
> >the produced static library will still be "providing" these symbols unless we mark
> >them as `static` (and if we mark them as `static`, they can't be used outside of their
> >translation unit). This means that if the host project has a version of `die` (or links
> >against yet another library that does), we run into what C++ calls a One Definition
> >Rule (ODR)
> >violation: we have two providers of the symbol `die` with different
> >implementations, and the behavior is undefined (no error needs to be generated as
> >far as I know).
> >
> >With dynamic libraries I believe that we have more control over what gets exposed,
> >but I don't know of functionality for this when linking statically. I'm assuming there
> >is no such functionality, as projects like openssl (ty Randall for mentioning this)
> >appear to have a convention of prefixing the symbols they put in their "public" API
> >(i.e. in non-internal header files) with things like OSSL_, and of prefixing the symbols
> >they put in their "private" APIs that can't be marked as `static` with `ossl_`. I'd love
> >to be wrong about this. :)
> >
> >If I'm right that this is an issue, does this imply that we'd need to rename every non-
> >static function in the git codebase that's part of a library to have a `git_` prefix, even
> >if it won't be used outside of the git project's own .c files? Is there a solution that
> >doesn't involve making it so that we have to type `git_` a billion times a day that's
> >also maintainable? We could try to guess at how likely a name collision would be
> >and only do this for ones where it's obviously going to collide, but if we get it wrong,
> >I'm concerned that we'll run into subtle ODR violations that *at best* erode
> >confidence in our library, and can actually cause outages, data corruption, and
> >security/privacy issues.
>
> I think we only need to do this for functions that are in the libification code-base for non-static symbols (and any data elements that may end up in a DLL some day).

I believe the hope is that the majority/entirety of plumbing code will
be provided as a library, and we'll likely want to have a significant
portion of porcelain code as well. I think we're really talking about
(effectively) "all of git", but not all at once. If we attempt to make
things safe based on guesses about what's likely to collide with other
project's code, we'll (a) get it wrong, and only discover later when
they try to add our library to their project, and (b) have a
maintenance burden, where we now have to think about every function
name we introduce, which would not be fun (and we'll get it wrong.
Frequently.)

> The bulk of the non-libified code base would only need to adapt to new symbol names if those symbols are specifically moved.

I'm not following what you mean by "moved" here.

> die(), error(), are probably going to be impacted, but they can be aliased with #defines internally to git to git_die() or git_error(), for example.
> --Randall
>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: libification: how to avoid symbol collisions?
  2024-02-09 17:09 ` Junio C Hamano
@ 2024-02-13  2:48   ` Kyle Lippincott
  2024-02-13 16:50     ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: Kyle Lippincott @ 2024-02-13  2:48 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git

On Fri, Feb 9, 2024 at 9:10 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Kyle Lippincott <spectral@google.com> writes:
>
> > If I'm right that this is an issue, does this imply that we'd need to
> > rename every non-static function in the git codebase that's part of a
> > library to have a `git_` prefix, even if it won't be used outside of
> > the git project's own .c files? Is there a solution that doesn't
> > involve making it so that we have to type `git_` a billion times a day
> > that's also maintainable? We could try to guess at how likely a name
> > collision would be and only do this for ones where it's obviously
> > going to collide, but if we get it wrong, I'm concerned that we'll run
> > into subtle ODR violations that *at best* erode confidence in our
> > library, and can actually cause outages, data corruption, and
> > security/privacy issues.
>
> If you end up with a helper function name "foo" that is defined in
> our X.c and used by our Y.c but is not part of the published "git
> library API", we may want to rename it so that such a common name
> can be used by programs that link with the "git library".  We may
> choose to rename it to "GitLib_foo".

If it's internal, we may want to name it with a different prefix than
GitLib, if we expect the exposed API of the library to have this
prefix, just as a signal to readers where the internal/external
boundaries are.

>
> Do we want to keep the source of our library code, which defines the
> helper function "foo" in X.c and calls it in Y.c, intact so that the
> helper is still named "foo" as far as we are concerned?  Or do we
> "bite the bullet" and bulk modify both the callers and the callee?
>
> I'd imagine that we would rather avoid such a churn at all cost [*].
> After all, "foo" is *not* supposed to be callable by any human
> written code, and that is why we rename it to a name "GitLib_foo"
> that is unlikely to overlap with any sane human would use.
>
>         Side note: if a public API function that we want our library
>         clients to call is misnamed, we want to rename it so that we
>         would both internally and externally use the same public
>         name, I would imagine.
>
> The mechanics to rename should be a solved problem, I think, as we
> are obviously not the first project that wants to build a library.
>
> If the names are all simple, we could do this in CPP,

At first I thought you meant C++, and I was like "Yeah, that's a
possible solution: when building a library, compile it as C++ with
name mangling, except for the symbols we intend to export!" -- this
was not what you meant, though. Kind of amusingly, that idea might
work, and might even be maintainable once we got to that state, but
getting to that state would be a lot of cleanup because of C++'s
stricter type system (`char *p = ptr;`, where `ptr` is a `void*` for
example; maybe this is a call to malloc or similar). Since the git
libraries don't exist yet, there's technically no worries about
backwards compatibility with requiring a C++ compiler.

> i.e. invent a
> header file that has bunch of such renames like
>
>     #define foo GitLib_foo
>
> and include it in both X.c and Y.c.  But "foo" may also appear as
> the name of a type, a member in a structure, etc., where we may not
> want to touch, so in a project larger than toy scale, this approach
> may not work well.

Glancing at the tags file, it looks like there's a small number of
cases where this would be problematic, and they're mostly things where
there's a function named the same thing as either a struct variable
storing the result of the function. So it could work, but there's over
3,500 symbols (if I did my filtering of the tags file correctly) that
are not scoped to a specific file (i.e. static), or
struct/enum/typedef/union names. That's going to be quite annoying to
maintain; even if we don't end up having to do all 3,500 symbols, for
the files that are part of some public library, we'd add maintenance
burden because we'd need to remember to either make every new function
be static, or add it to this list. I assume we could create a test
that would enforce this ("static, named with <prefix>, or added to
<list>"), so the issue is catchable, but it will be exceedingly
annoying every time one encounters this.

>
> "objcopy --redefine-sym" would probably be a better way.  I haven't
> written a linker script, but I heard rumors that there is RENAME()
> to rename symbols, and using that might be another avenue.
>
>

I'd thought of linker scripts, but rejected the idea due to
assumptions I made about their portability - this could be mitigated
by having a linker-script-generator step in the build process, but
this seems difficult to maintain. It also implies the same maintenance
burden as the #defines, where when introducing a new function to X.c
that is called from Y.c we'd have to edit the list of "symbols to
rename".


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: libification: how to avoid symbol collisions?
  2024-02-13  2:48   ` Kyle Lippincott
@ 2024-02-13 16:50     ` Junio C Hamano
  0 siblings, 0 replies; 6+ messages in thread
From: Junio C Hamano @ 2024-02-13 16:50 UTC (permalink / raw
  To: Kyle Lippincott; +Cc: git

Kyle Lippincott <spectral@google.com> writes:

> struct/enum/typedef/union names. That's going to be quite annoying to
> maintain; even if we don't end up having to do all 3,500 symbols, for
> the files that are part of some public library, we'd add maintenance
> burden because we'd need to remember to either make every new function
> be static, or add it to this list. I assume we could create a test
> that would enforce this ("static, named with <prefix>, or added to
> <list>"), so the issue is catchable, but it will be exceedingly
> annoying every time one encounters this.

No matter how we do this, we'd need to maintain that list, so the
choices are between "#define" and "objcopy --redefine-sym" if we
limit ourselves to static linking, I think.  The former may be more
portable but makes me feel dirty.  The debuggers will not see the
names we want to use, for one thing.  "rename selected symbols in *.o
files" approach, if it can be done on platforms we want the lib thing
on, would be more preferable.

We also should be able to create a single linkable object (e.g., Z.o
out of X.o and Y.o from the previous example---it could be Z.so that
is dynamically linked at runtime) to resolve the symbols that need
to be shared only among the object files (like "foo" that is defined
in X.o whose address is needed in Y.o) and after such a linking is
done, these "internal" symbols can be stripped away from the
resulting object file.  For that, we'd also need to maintain that
list of internal symbols that are needed in order to link our
objects together.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-02-13 16:50 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-09  2:30 libification: how to avoid symbol collisions? Kyle Lippincott
2024-02-09 13:31 ` rsbecker
2024-02-13  1:02   ` Kyle Lippincott
2024-02-09 17:09 ` Junio C Hamano
2024-02-13  2:48   ` Kyle Lippincott
2024-02-13 16:50     ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).