git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* [FR] supporting submodules with alternate version control systems (new contributor)
@ 2022-05-10 16:11 Addison Klinke
  2022-05-10 17:00 ` Junio C Hamano
  0 siblings, 1 reply; 13+ messages in thread
From: Addison Klinke @ 2022-05-10 16:11 UTC (permalink / raw)
  To: git; +Cc: Addison Klinke

Hello all,

I'm familiar with opensource software development through Github, but
have not contributed to git before so apologies if I'm using the wrong
avenues. Please point me in the right direction if that is the case. I
saw this mailing list mentioned on the
[mirror](https://github.com/git/git) repository, so it seemed like the
right place to start.

I have a feature request I'd like some feedback on. The core idea is
to support submodules with alternate (i.e. non-git based) version
control systems.

* **Why:** Git is excellent for versioning code and I don't need
another VCS for that purpose. However, in machine learning (ML)
workflows it has become more
[standard](https://opendatascience.com/how-data-versioning-can-be-used-in-machine-learning/)
to version your datasets, and for this purpose many git-like tools
have been developed. See [Dolt](https://www.dolthub.com/),
[LakeFS](https://lakefs.io/), and [DVC](https://dvc.org/) for a few
examples. Currently, ML practitioners have to bifurcate their
development process - code is committed/managed with git and datasets
are committed/managed with a 3rd party VCS (and often cloned in a
different folder outside the git repository). My proposal is to unify
the data versioning tools with git submodules so that they can act as
any other 3rd party library inside a parent repository

* **How:** Most data versioning tools already define a git-like CLI.
For instance, you have "dolt commit", "dvc push", "lakectl diff", etc.
The set of commands and options is usually a subset of the full list
available in git, but the important ones are there. My approach would
require a few steps

1. Git defines an API for configuring 3rd party VCS tools. It's
essentially a mapping from git command to the equivalent in the 3rd
party library. This should also account for which options/flags are
supported
2. Developers from the 3rd party library integrate with this git API
by maintaining a config file for the mapping that gets installed
alongside their binaries
3. The .gitmodules syntax is extended to include a "type" field which
defaults to git but can be set to other supported values
4. Then end-users can add submodules with an alternate VCS. Once
added, the CLI interaction would appear like normal git but under the
hood it would be using a different engine (and remote storage)

Is something along these lines feasible? If so, could someone who is
more familiar with the code base give me a rough idea how one might go
about this? I would like to author the PR to implement this - just
looking for some help getting started.

Thank you for the help,

Addison

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [FR] supporting submodules with alternate version control systems (new contributor)
  2022-05-10 16:11 [FR] supporting submodules with alternate version control systems (new contributor) Addison Klinke
@ 2022-05-10 17:00 ` Junio C Hamano
  2022-05-10 17:20   ` Jason Pyeron
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2022-05-10 17:00 UTC (permalink / raw)
  To: Addison Klinke; +Cc: git, Addison Klinke

Addison Klinke <addison@baller.tv> writes:

> Is something along these lines feasible?

Offhand, I only think of one thing that could make it fundamentally
infeasible.

When you bind an external repository (be it stored in Git or
somebody else's system) as a submodule, each commit in the
superproject records which exact commit in the submodule is used
with the rest of the superproject tree.  And that is done by
recording the object name of the commit in the submodule.

What it means for the foreign system that wants to "plug into" a
superproject in Git as a submodule?  It is required to do two
things:

 * At the time "git commit" is run at the superproject level, the
   foreign system has to be able to say "the version I have to be
   used in the context of this superproject commit is X", with X
   that somehow can be stored in the superproject's tree object
   (which is sized 20-byte for SHA-1 repositories; in SHA-256
   repositories, it is a bit wider).

 * At the time "git chekcout" is run at the superproject level, the
   superproject will learn the above X (i.e. the version of the
   submodule that goes with the version of the superproject being
   checked out).  The foreign system has to be able to perform a
   "checkout" given that X.

If a foreign system cannot do the above two, then it fundamentally
would be incapable of participating in such a "superproject and
submodule" relationship.

Everything else I think is feasible in the sense that "it is just a
matter of programming".

It is a different story how it is implemented, how much it would
cost to do so, and if it is worth maintaining it as part of Git, so
I'd stop at "is it feasible?" here, not judging "if it is realistic"
at this point ;-).


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [FR] supporting submodules with alternate version control systems (new contributor)
  2022-05-10 17:00 ` Junio C Hamano
@ 2022-05-10 17:20   ` Jason Pyeron
  2022-05-10 17:26     ` Addison Klinke
  2022-05-10 20:54     ` Philip Oakley
  0 siblings, 2 replies; 13+ messages in thread
From: Jason Pyeron @ 2022-05-10 17:20 UTC (permalink / raw)
  To: 'Junio C Hamano', 'Addison Klinke'
  Cc: git, 'Addison Klinke'

> -----Original Message-----
> From: Junio C Hamano
> Sent: Tuesday, May 10, 2022 1:01 PM
> To: Addison Klinke <addison@baller.tv>
> 
> Addison Klinke <addison@baller.tv> writes:
> 
> > Is something along these lines feasible?
> 
> Offhand, I only think of one thing that could make it fundamentally
> infeasible.
> 
> When you bind an external repository (be it stored in Git or
> somebody else's system) as a submodule, each commit in the
> superproject records which exact commit in the submodule is used
> with the rest of the superproject tree.  And that is done by
> recording the object name of the commit in the submodule.
> 
> What it means for the foreign system that wants to "plug into" a
> superproject in Git as a submodule?  It is required to do two
> things:
> 
>  * At the time "git commit" is run at the superproject level, the
>    foreign system has to be able to say "the version I have to be
>    used in the context of this superproject commit is X", with X
>    that somehow can be stored in the superproject's tree object
>    (which is sized 20-byte for SHA-1 repositories; in SHA-256
>    repositories, it is a bit wider).
> 
>  * At the time "git chekcout" is run at the superproject level, the
>    superproject will learn the above X (i.e. the version of the
>    submodule that goes with the version of the superproject being
>    checked out).  The foreign system has to be able to perform a
>    "checkout" given that X.
> 
> If a foreign system cannot do the above two, then it fundamentally
> would be incapable of participating in such a "superproject and
> submodule" relationship.

The submodule "type" could create an object (hashed and stored) that contains the needed "translation" details. The object would be hashed using SHA1 or SHA256 depending on the git config. The format of the object's contents would be defined by the submodule's "code".


--
Jason Pyeron  | Architect
PD Inc        | Certified SBA 8(a)
10 w 24th St  | Certified SBA HUBZone
Baltimore, MD | CAGE Code: 1WVR6
 
.mil: jason.j.pyeron.ctr@mail.mil
.com: jpyeron@pdinc.us
tel : 202-741-9397




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [FR] supporting submodules with alternate version control systems (new contributor)
  2022-05-10 17:20   ` Jason Pyeron
@ 2022-05-10 17:26     ` Addison Klinke
  2022-05-10 18:26       ` rsbecker
  2022-05-10 20:54     ` Philip Oakley
  1 sibling, 1 reply; 13+ messages in thread
From: Addison Klinke @ 2022-05-10 17:26 UTC (permalink / raw)
  To: Jason Pyeron; +Cc: Junio C Hamano, git, Addison Klinke

Thanks for the quick replies

> Junio Hamano: When you bind an external repository (be it stored in Git or
somebody else's system) as a submodule, each commit in the
superproject records which exact commit in the submodule is used
with the rest of the superproject tree.

This should be fine then - at least the data versioning tools I'm
familiar with can all specify their current commit and checkout by
commit hash. Does it matter how the hashes are structured/stored
internally? For example, I believe Dolt keeps them in a MySQL table
that connects to Noms under the hood.

 > Junio Hamano: not judging "if it is realistic" at this point

What would be the best approach for answering this portion?

> Jason Pyeron: The submodule "type" could create an object (hashed and stored) that contains the needed "translation" details

That sounds like an interesting idea. Since I'd like to offload the
burden of maintaining these translation files to the 3rd party
developers, it would be nice if they got copied to a standard location
(i.e. ~/.gitmodules/translations/tool_x) during the 3rd party install.
Then when a submodule is added with "type = tool_x", git checks that
the appropriate translation file is available, and if so, copies it
into the parent repository.

On Tue, May 10, 2022 at 11:20 AM Jason Pyeron <jpyeron@pdinc.us> wrote:
>
> > -----Original Message-----
> > From: Junio C Hamano
> > Sent: Tuesday, May 10, 2022 1:01 PM
> > To: Addison Klinke <addison@baller.tv>
> >
> > Addison Klinke <addison@baller.tv> writes:
> >
> > > Is something along these lines feasible?
> >
> > Offhand, I only think of one thing that could make it fundamentally
> > infeasible.
> >
> > When you bind an external repository (be it stored in Git or
> > somebody else's system) as a submodule, each commit in the
> > superproject records which exact commit in the submodule is used
> > with the rest of the superproject tree.  And that is done by
> > recording the object name of the commit in the submodule.
> >
> > What it means for the foreign system that wants to "plug into" a
> > superproject in Git as a submodule?  It is required to do two
> > things:
> >
> >  * At the time "git commit" is run at the superproject level, the
> >    foreign system has to be able to say "the version I have to be
> >    used in the context of this superproject commit is X", with X
> >    that somehow can be stored in the superproject's tree object
> >    (which is sized 20-byte for SHA-1 repositories; in SHA-256
> >    repositories, it is a bit wider).
> >
> >  * At the time "git chekcout" is run at the superproject level, the
> >    superproject will learn the above X (i.e. the version of the
> >    submodule that goes with the version of the superproject being
> >    checked out).  The foreign system has to be able to perform a
> >    "checkout" given that X.
> >
> > If a foreign system cannot do the above two, then it fundamentally
> > would be incapable of participating in such a "superproject and
> > submodule" relationship.
>
> The submodule "type" could create an object (hashed and stored) that contains the needed "translation" details. The object would be hashed using SHA1 or SHA256 depending on the git config. The format of the object's contents would be defined by the submodule's "code".
>
>
> --
> Jason Pyeron  | Architect
> PD Inc        | Certified SBA 8(a)
> 10 w 24th St  | Certified SBA HUBZone
> Baltimore, MD | CAGE Code: 1WVR6
>
> .mil: jason.j.pyeron.ctr@mail.mil
> .com: jpyeron@pdinc.us
> tel : 202-741-9397
>
>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [FR] supporting submodules with alternate version control systems (new contributor)
  2022-05-10 17:26     ` Addison Klinke
@ 2022-05-10 18:26       ` rsbecker
  0 siblings, 0 replies; 13+ messages in thread
From: rsbecker @ 2022-05-10 18:26 UTC (permalink / raw)
  To: 'Addison Klinke', 'Jason Pyeron'
  Cc: 'Junio C Hamano', git, 'Addison Klinke'

On May 10, 2022 1:27 PM, Addison Klinke wrote:
>Thanks for the quick replies
>
>> Junio Hamano: When you bind an external repository (be it stored in
>> Git or
>somebody else's system) as a submodule, each commit in the superproject
>records which exact commit in the submodule is used with the rest of the
>superproject tree.
>
>This should be fine then - at least the data versioning tools I'm familiar with can all
>specify their current commit and checkout by commit hash. Does it matter how
>the hashes are structured/stored internally? For example, I believe Dolt keeps
>them in a MySQL table that connects to Noms under the hood.
>
> > Junio Hamano: not judging "if it is realistic" at this point
>
>What would be the best approach for answering this portion?

Basically, answer the following: Can you implement a command like the cvs2git that can be re-executed on an idempotent (repeatedly with the same result) basis?

If yes, then you can build your own automation to move code into a submodule from your own VCS system into a git repository and the work with the submodule without the git code-base knowing about this.

If you can go the other way, from git to your other VCS system, repeatedly, then you can go back again. This is likely to be much harder as git has a much richer representation model than is typical of VCS systems.

One way may be sufficient for your purposes. Research how cvs2git works and see whether you are able to emulate its functions.

>> Jason Pyeron: The submodule "type" could create an object (hashed and
>> stored) that contains the needed "translation" details
>
>That sounds like an interesting idea. Since I'd like to offload the burden of
>maintaining these translation files to the 3rd party developers, it would be nice if
>they got copied to a standard location (i.e. ~/.gitmodules/translations/tool_x)
>during the 3rd party install.
>Then when a submodule is added with "type = tool_x", git checks that the
>appropriate translation file is available, and if so, copies it into the parent
>repository.
>
>On Tue, May 10, 2022 at 11:20 AM Jason Pyeron <jpyeron@pdinc.us> wrote:
>>
>> > -----Original Message-----
>> > From: Junio C Hamano
>> > Sent: Tuesday, May 10, 2022 1:01 PM
>> > To: Addison Klinke <addison@baller.tv>
>> >
>> > Addison Klinke <addison@baller.tv> writes:
>> >
>> > > Is something along these lines feasible?
>> >
>> > Offhand, I only think of one thing that could make it fundamentally
>> > infeasible.
>> >
>> > When you bind an external repository (be it stored in Git or
>> > somebody else's system) as a submodule, each commit in the
>> > superproject records which exact commit in the submodule is used
>> > with the rest of the superproject tree.  And that is done by
>> > recording the object name of the commit in the submodule.
>> >
>> > What it means for the foreign system that wants to "plug into" a
>> > superproject in Git as a submodule?  It is required to do two
>> > things:
>> >
>> >  * At the time "git commit" is run at the superproject level, the
>> >    foreign system has to be able to say "the version I have to be
>> >    used in the context of this superproject commit is X", with X
>> >    that somehow can be stored in the superproject's tree object
>> >    (which is sized 20-byte for SHA-1 repositories; in SHA-256
>> >    repositories, it is a bit wider).
>> >
>> >  * At the time "git chekcout" is run at the superproject level, the
>> >    superproject will learn the above X (i.e. the version of the
>> >    submodule that goes with the version of the superproject being
>> >    checked out).  The foreign system has to be able to perform a
>> >    "checkout" given that X.
>> >
>> > If a foreign system cannot do the above two, then it fundamentally
>> > would be incapable of participating in such a "superproject and
>> > submodule" relationship.
>>
>> The submodule "type" could create an object (hashed and stored) that contains
>the needed "translation" details. The object would be hashed using SHA1 or
>SHA256 depending on the git config. The format of the object's contents would be
>defined by the submodule's "code".

I would not try to do this inside the git infrastructure. What you may be able to do in my suggestion above, is to restrict how your other VCS system is used and restrict how your team uses git to make the mapping repeatable. This is typical of some environments where there is an SVN repo and a git repo that are mirrored. This does simplify matters particularly if you do not have to modify either system but are building a façade or wrapper around both.

Keep this as simple as possible to meet a minimum viable set of requirements.
--Randal 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [FR] supporting submodules with alternate version control systems (new contributor)
  2022-05-10 17:20   ` Jason Pyeron
  2022-05-10 17:26     ` Addison Klinke
@ 2022-05-10 20:54     ` Philip Oakley
  2022-06-01 12:44       ` Addison Klinke
  1 sibling, 1 reply; 13+ messages in thread
From: Philip Oakley @ 2022-05-10 20:54 UTC (permalink / raw)
  To: Jason Pyeron, 'Junio C Hamano', 'Addison Klinke'
  Cc: git, 'Addison Klinke'

On 10/05/2022 18:20, Jason Pyeron wrote:
>> -----Original Message-----
>> From: Junio C Hamano
>> Sent: Tuesday, May 10, 2022 1:01 PM
>> To: Addison Klinke <addison@baller.tv>
>>
>> Addison Klinke <addison@baller.tv> writes:
>>
>>> Is something along these lines feasible?
>> Offhand, I only think of one thing that could make it fundamentally
>> infeasible.
>>
>> When you bind an external repository (be it stored in Git or
>> somebody else's system) as a submodule, each commit in the
>> superproject records which exact commit in the submodule is used
>> with the rest of the superproject tree.  And that is done by
>> recording the object name of the commit in the submodule.
>>
>> What it means for the foreign system that wants to "plug into" a
>> superproject in Git as a submodule?  It is required to do two
>> things:
>>
>>   * At the time "git commit" is run at the superproject level, the
>>     foreign system has to be able to say "the version I have to be
>>     used in the context of this superproject commit is X", with X
>>     that somehow can be stored in the superproject's tree object
>>     (which is sized 20-byte for SHA-1 repositories; in SHA-256
>>     repositories, it is a bit wider).
>>
>>   * At the time "git chekcout" is run at the superproject level, the
>>     superproject will learn the above X (i.e. the version of the
>>     submodule that goes with the version of the superproject being
>>     checked out).  The foreign system has to be able to perform a
>>     "checkout" given that X.
>>
>> If a foreign system cannot do the above two, then it fundamentally
>> would be incapable of participating in such a "superproject and
>> submodule" relationship.

The sub-modules already have that problem if the user forgets publish 
their sub-module (see notes in the docs ;-).
> The submodule "type" could create an object (hashed and stored) that contains the needed "translation" details. The object would be hashed using SHA1 or SHA256 depending on the git config. The format of the object's contents would be defined by the submodule's "code".
>
Another way of looking at the issue is via a variant of Git-LFS with a 
smudge/clean style filter. I.e. the DataVCS would be treated as a 'file'.

The LFS already uses the .gitattributes to define a 'type', while the 
submodules don't yet have that capability. There is just a single 
special type within a tree object of "sub-module"  being a mode 16000 
commit (see https://longair.net/blog/2010/06/02/git-submodules-explained/).

One thought is that one uses a proper sub-module that within it then has 
the single 'large' file git-lfs style that hosts the hash reference for 
the data VCS 
(https://github.com/git-lfs/git-lfs/blob/main/docs/spec.md). It would be 
the regular sub-modules .gitattributes file that handles the data 
conversion.

It may be converting an X-Y problem into an X-Y-Z solution, or just 
extending the problem.

--
Philip



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [FR] supporting submodules with alternate version control systems (new contributor)
  2022-05-10 20:54     ` Philip Oakley
@ 2022-06-01 12:44       ` Addison Klinke
  2022-06-03 23:06         ` Philip Oakley
  0 siblings, 1 reply; 13+ messages in thread
From: Addison Klinke @ 2022-06-01 12:44 UTC (permalink / raw)
  To: philipoakley; +Cc: Jason Pyeron, Junio C Hamano, git, Addison Klinke

> rsbecker: move code into a submodule from your own VCS system
into a git repository and the work with the submodule without the git
code-base knowing about this

> Philip: uses a proper sub-module that within it then has
the single 'large' file git-lfs style that hosts the hash reference for
the data VCS

The downside I see with both of these approaches is that translating
the native data VCS to git (or LFS) negates all the benefits of having
a VCS purpose-built for data. That's why the majority of data
versioning tools exist - because git (or LFS) are not ideal for
handling machine learning datasets

On Tue, May 10, 2022 at 2:54 PM Philip Oakley <philipoakley@iee.email> wrote:
>
> On 10/05/2022 18:20, Jason Pyeron wrote:
> >> -----Original Message-----
> >> From: Junio C Hamano
> >> Sent: Tuesday, May 10, 2022 1:01 PM
> >> To: Addison Klinke <addison@baller.tv>
> >>
> >> Addison Klinke <addison@baller.tv> writes:
> >>
> >>> Is something along these lines feasible?
> >> Offhand, I only think of one thing that could make it fundamentally
> >> infeasible.
> >>
> >> When you bind an external repository (be it stored in Git or
> >> somebody else's system) as a submodule, each commit in the
> >> superproject records which exact commit in the submodule is used
> >> with the rest of the superproject tree.  And that is done by
> >> recording the object name of the commit in the submodule.
> >>
> >> What it means for the foreign system that wants to "plug into" a
> >> superproject in Git as a submodule?  It is required to do two
> >> things:
> >>
> >>   * At the time "git commit" is run at the superproject level, the
> >>     foreign system has to be able to say "the version I have to be
> >>     used in the context of this superproject commit is X", with X
> >>     that somehow can be stored in the superproject's tree object
> >>     (which is sized 20-byte for SHA-1 repositories; in SHA-256
> >>     repositories, it is a bit wider).
> >>
> >>   * At the time "git chekcout" is run at the superproject level, the
> >>     superproject will learn the above X (i.e. the version of the
> >>     submodule that goes with the version of the superproject being
> >>     checked out).  The foreign system has to be able to perform a
> >>     "checkout" given that X.
> >>
> >> If a foreign system cannot do the above two, then it fundamentally
> >> would be incapable of participating in such a "superproject and
> >> submodule" relationship.
>
> The sub-modules already have that problem if the user forgets publish
> their sub-module (see notes in the docs ;-).
> > The submodule "type" could create an object (hashed and stored) that contains the needed "translation" details. The object would be hashed using SHA1 or SHA256 depending on the git config. The format of the object's contents would be defined by the submodule's "code".
> >
> Another way of looking at the issue is via a variant of Git-LFS with a
> smudge/clean style filter. I.e. the DataVCS would be treated as a 'file'.
>
> The LFS already uses the .gitattributes to define a 'type', while the
> submodules don't yet have that capability. There is just a single
> special type within a tree object of "sub-module"  being a mode 16000
> commit (see https://longair.net/blog/2010/06/02/git-submodules-explained/).
>
> One thought is that one uses a proper sub-module that within it then has
> the single 'large' file git-lfs style that hosts the hash reference for
> the data VCS
> (https://github.com/git-lfs/git-lfs/blob/main/docs/spec.md). It would be
> the regular sub-modules .gitattributes file that handles the data
> conversion.
>
> It may be converting an X-Y problem into an X-Y-Z solution, or just
> extending the problem.
>
> --
> Philip
>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [FR] supporting submodules with alternate version control systems (new contributor)
  2022-06-01 12:44       ` Addison Klinke
@ 2022-06-03 23:06         ` Philip Oakley
  2022-06-04  2:01           ` rsbecker
  0 siblings, 1 reply; 13+ messages in thread
From: Philip Oakley @ 2022-06-03 23:06 UTC (permalink / raw)
  To: Addison Klinke; +Cc: Jason Pyeron, Junio C Hamano, git, Addison Klinke

On 01/06/2022 13:44, Addison Klinke wrote:
>> rsbecker: move code into a submodule from your own VCS system
> into a git repository and the work with the submodule without the git
> code-base knowing about this
>
>> Philip: uses a proper sub-module that within it then has
> the single 'large' file git-lfs style that hosts the hash reference for
> the data VCS
>
> The downside I see with both of these approaches is that translating
> the native data VCS to git (or LFS) negates all the benefits of having
> a VCS purpose-built for data. That's why the majority of data
> versioning tools exist - because git (or LFS) are not ideal for
> handling machine learning datasets

The key aspect is deciding which of the two storage systems (the Data &
the Code) will be the overall lead system that contains the linked
reference to the other storage system to ensure the needed integrity.
That is not really a technical question. Rather its somewhat of a social
discussion (workflows, trust, style of integration, etc).

It maybe that one of the systems does have less long-term integrity, as
has been seen in many versioning systems over the last century (both
manual and computer), but the UI is also important.

IIRC Junio did note that having a suitable API to access the other
storage system (to know its status, etc.) is likely to be core to the
ability to combine the two. It may  be that a top level 'gui' is used
control both systems and ensure synchronisation to hide the complexities
of both systems.

I'm still thinking that the "git-lfs like" style could be the one to
use, but that is very dependant on the API that is available for
capturing the Data state into the git entry that records that state,
whether that is a file (git-lfs like) or a 'sub-module' (directory as
state ) style.  Either way it still need reifying (i.e. coded to make
the abstract concept into a concrete implementation).

Which ever route is chosen, it still sounds to me like a worthwhile
enterprise. It's all still very abstract.
>
> On Tue, May 10, 2022 at 2:54 PM Philip Oakley <philipoakley@iee.email> wrote:
>> On 10/05/2022 18:20, Jason Pyeron wrote:
>>>> -----Original Message-----
>>>> From: Junio C Hamano
>>>> Sent: Tuesday, May 10, 2022 1:01 PM
>>>> To: Addison Klinke <addison@baller.tv>
>>>>
>>>> Addison Klinke <addison@baller.tv> writes:
>>>>
>>>>> Is something along these lines feasible?
>>>> Offhand, I only think of one thing that could make it fundamentally
>>>> infeasible.
>>>>
>>>> When you bind an external repository (be it stored in Git or
>>>> somebody else's system) as a submodule, each commit in the
>>>> superproject records which exact commit in the submodule is used
>>>> with the rest of the superproject tree.  And that is done by
>>>> recording the object name of the commit in the submodule.
>>>>
>>>> What it means for the foreign system that wants to "plug into" a
>>>> superproject in Git as a submodule?  It is required to do two
>>>> things:
>>>>
>>>>   * At the time "git commit" is run at the superproject level, the
>>>>     foreign system has to be able to say "the version I have to be
>>>>     used in the context of this superproject commit is X", with X
>>>>     that somehow can be stored in the superproject's tree object
>>>>     (which is sized 20-byte for SHA-1 repositories; in SHA-256
>>>>     repositories, it is a bit wider).
>>>>
>>>>   * At the time "git chekcout" is run at the superproject level, the
>>>>     superproject will learn the above X (i.e. the version of the
>>>>     submodule that goes with the version of the superproject being
>>>>     checked out).  The foreign system has to be able to perform a
>>>>     "checkout" given that X.
>>>>
>>>> If a foreign system cannot do the above two, then it fundamentally
>>>> would be incapable of participating in such a "superproject and
>>>> submodule" relationship.
>> The sub-modules already have that problem if the user forgets publish
>> their sub-module (see notes in the docs ;-).
>>> The submodule "type" could create an object (hashed and stored) that contains the needed "translation" details. The object would be hashed using SHA1 or SHA256 depending on the git config. The format of the object's contents would be defined by the submodule's "code".
>>>
>> Another way of looking at the issue is via a variant of Git-LFS with a
>> smudge/clean style filter. I.e. the DataVCS would be treated as a 'file'.
>>
>> The LFS already uses the .gitattributes to define a 'type', while the
>> submodules don't yet have that capability. There is just a single
>> special type within a tree object of "sub-module"  being a mode 16000
>> commit (see https://longair.net/blog/2010/06/02/git-submodules-explained/).
>>
>> One thought is that one uses a proper sub-module that within it then has
>> the single 'large' file git-lfs style that hosts the hash reference for
>> the data VCS
>> (https://github.com/git-lfs/git-lfs/blob/main/docs/spec.md). It would be
>> the regular sub-modules .gitattributes file that handles the data
>> conversion.
>>
>> It may be converting an X-Y problem into an X-Y-Z solution, or just
>> extending the problem.
>>
>> --
>> Philip
>>
>>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [FR] supporting submodules with alternate version control systems (new contributor)
  2022-06-03 23:06         ` Philip Oakley
@ 2022-06-04  2:01           ` rsbecker
  2022-06-04 13:27             ` Philip Oakley
  0 siblings, 1 reply; 13+ messages in thread
From: rsbecker @ 2022-06-04  2:01 UTC (permalink / raw)
  To: 'Philip Oakley', 'Addison Klinke'
  Cc: 'Jason Pyeron', 'Junio C Hamano',
	git, 'Addison Klinke'

On June 3, 2022 7:07 PM, Philip Oakley wrote:
>On 01/06/2022 13:44, Addison Klinke wrote:
>>> rsbecker: move code into a submodule from your own VCS system
>> into a git repository and the work with the submodule without the git
>> code-base knowing about this
>>
>>> Philip: uses a proper sub-module that within it then has
>> the single 'large' file git-lfs style that hosts the hash reference
>> for the data VCS
>>
>> The downside I see with both of these approaches is that translating
>> the native data VCS to git (or LFS) negates all the benefits of having
>> a VCS purpose-built for data. That's why the majority of data
>> versioning tools exist - because git (or LFS) are not ideal for
>> handling machine learning datasets
>
>The key aspect is deciding which of the two storage systems (the Data & the Code)
>will be the overall lead system that contains the linked reference to the other
>storage system to ensure the needed integrity.
>That is not really a technical question. Rather its somewhat of a social discussion
>(workflows, trust, style of integration, etc).
>
>It maybe that one of the systems does have less long-term integrity, as has been
>seen in many versioning systems over the last century (both manual and
>computer), but the UI is also important.
>
>IIRC Junio did note that having a suitable API to access the other storage system
>(to know its status, etc.) is likely to be core to the ability to combine the two. It
>may  be that a top level 'gui' is used control both systems and ensure
>synchronisation to hide the complexities of both systems.
>
>I'm still thinking that the "git-lfs like" style could be the one to use, but that is very
>dependant on the API that is available for capturing the Data state into the git
>entry that records that state, whether that is a file (git-lfs like) or a 'sub-module'
>(directory as state ) style.  Either way it still need reifying (i.e. coded to make the
>abstract concept into a concrete implementation).
>
>Which ever route is chosen, it still sounds to me like a worthwhile enterprise. It's
>all still very abstract.
>>
>> On Tue, May 10, 2022 at 2:54 PM Philip Oakley <philipoakley@iee.email> wrote:
>>> On 10/05/2022 18:20, Jason Pyeron wrote:
>>>>> -----Original Message-----
>>>>> From: Junio C Hamano
>>>>> Sent: Tuesday, May 10, 2022 1:01 PM
>>>>> To: Addison Klinke <addison@baller.tv>
>>>>>
>>>>> Addison Klinke <addison@baller.tv> writes:
>>>>>
>>>>>> Is something along these lines feasible?
>>>>> Offhand, I only think of one thing that could make it fundamentally
>>>>> infeasible.
>>>>>
>>>>> When you bind an external repository (be it stored in Git or
>>>>> somebody else's system) as a submodule, each commit in the
>>>>> superproject records which exact commit in the submodule is used
>>>>> with the rest of the superproject tree.  And that is done by
>>>>> recording the object name of the commit in the submodule.
>>>>>
>>>>> What it means for the foreign system that wants to "plug into" a
>>>>> superproject in Git as a submodule?  It is required to do two
>>>>> things:
>>>>>
>>>>>   * At the time "git commit" is run at the superproject level, the
>>>>>     foreign system has to be able to say "the version I have to be
>>>>>     used in the context of this superproject commit is X", with X
>>>>>     that somehow can be stored in the superproject's tree object
>>>>>     (which is sized 20-byte for SHA-1 repositories; in SHA-256
>>>>>     repositories, it is a bit wider).
>>>>>
>>>>>   * At the time "git chekcout" is run at the superproject level, the
>>>>>     superproject will learn the above X (i.e. the version of the
>>>>>     submodule that goes with the version of the superproject being
>>>>>     checked out).  The foreign system has to be able to perform a
>>>>>     "checkout" given that X.
>>>>>
>>>>> If a foreign system cannot do the above two, then it fundamentally
>>>>> would be incapable of participating in such a "superproject and
>>>>> submodule" relationship.
>>> The sub-modules already have that problem if the user forgets publish
>>> their sub-module (see notes in the docs ;-).
>>>> The submodule "type" could create an object (hashed and stored) that
>contains the needed "translation" details. The object would be hashed using SHA1
>or SHA256 depending on the git config. The format of the object's contents would
>be defined by the submodule's "code".
>>>>
>>> Another way of looking at the issue is via a variant of Git-LFS with
>>> a smudge/clean style filter. I.e. the DataVCS would be treated as a 'file'.
>>>
>>> The LFS already uses the .gitattributes to define a 'type', while the
>>> submodules don't yet have that capability. There is just a single
>>> special type within a tree object of "sub-module"  being a mode 16000
>>> commit (see https://longair.net/blog/2010/06/02/git-submodules-explained/).
>>>
>>> One thought is that one uses a proper sub-module that within it then
>>> has the single 'large' file git-lfs style that hosts the hash
>>> reference for the data VCS
>>> (https://github.com/git-lfs/git-lfs/blob/main/docs/spec.md). It would
>>> be the regular sub-modules .gitattributes file that handles the data
>>> conversion.
>>>
>>> It may be converting an X-Y problem into an X-Y-Z solution, or just
>>> extending the problem.

The most salient issue I have with this is that signatures cannot be validated across VCS systems. Within git, a submodule commit can be signed. This ensures that the contents of the commit in the super-project can also be signed. If someone hacks an underlying VCS that is not git, either:

a) git can never sign a commit from an underlying VCS, or

b) git can never trust a commit from an underlying VCS.

This pollutes a fundamental capability of git, being multiple signers the contents of a commit, and invalidates the integrity of the Merkel tree that underlies git contents.

I do not see that this concept contributes positively to the ecosystem. I do feel strongly about this and hope my points are understood.

Sincerely,
Randall


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [FR] supporting submodules with alternate version control systems (new contributor)
  2022-06-04  2:01           ` rsbecker
@ 2022-06-04 13:27             ` Philip Oakley
  2022-06-04 15:57               ` rsbecker
  0 siblings, 1 reply; 13+ messages in thread
From: Philip Oakley @ 2022-06-04 13:27 UTC (permalink / raw)
  To: rsbecker, 'Addison Klinke'
  Cc: 'Jason Pyeron', 'Junio C Hamano',
	git, 'Addison Klinke'

Hi Randall,

On 04/06/2022 03:01, rsbecker@nexbridge.com wrote:
> On June 3, 2022 7:07 PM, Philip Oakley wrote:
>> On 01/06/2022 13:44, Addison Klinke wrote:
>>>> rsbecker: move code into a submodule from your own VCS system
>>> into a git repository and the work with the submodule without the git
>>> code-base knowing about this
>>>
>>>> Philip: uses a proper sub-module that within it then has
>>> the single 'large' file git-lfs style that hosts the hash reference
>>> for the data VCS
>>>
>>> The downside I see with both of these approaches is that translating
>>> the native data VCS to git (or LFS) negates all the benefits of having
>>> a VCS purpose-built for data. That's why the majority of data
>>> versioning tools exist - because git (or LFS) are not ideal for
>>> handling machine learning datasets
>> The key aspect is deciding which of the two storage systems (the Data & the Code)
>> will be the overall lead system that contains the linked reference to the other
>> storage system to ensure the needed integrity.
>> That is not really a technical question. Rather its somewhat of a social discussion
>> (workflows, trust, style of integration, etc).
>>
>> It maybe that one of the systems does have less long-term integrity, as has been
>> seen in many versioning systems over the last century (both manual and
>> computer), but the UI is also important.
>>
>> IIRC Junio did note that having a suitable API to access the other storage system
>> (to know its status, etc.) is likely to be core to the ability to combine the two. It
>> may  be that a top level 'gui' is used control both systems and ensure
>> synchronisation to hide the complexities of both systems.
>>
>> I'm still thinking that the "git-lfs like" style could be the one to use, but that is very
>> dependant on the API that is available for capturing the Data state into the git
>> entry that records that state, whether that is a file (git-lfs like) or a 'sub-module'
>> (directory as state ) style.  Either way it still need reifying (i.e. coded to make the
>> abstract concept into a concrete implementation).
>>
>> Which ever route is chosen, it still sounds to me like a worthwhile enterprise. It's
>> all still very abstract.
>>> On Tue, May 10, 2022 at 2:54 PM Philip Oakley <philipoakley@iee.email> wrote:
>>>> On 10/05/2022 18:20, Jason Pyeron wrote:
>>>>>> -----Original Message-----
>>>>>> From: Junio C Hamano
>>>>>> Sent: Tuesday, May 10, 2022 1:01 PM
>>>>>> To: Addison Klinke <addison@baller.tv>
>>>>>>
>>>>>> Addison Klinke <addison@baller.tv> writes:
>>>>>>
>>>>>>> Is something along these lines feasible?
>>>>>> Offhand, I only think of one thing that could make it fundamentally
>>>>>> infeasible.
>>>>>>
>>>>>> When you bind an external repository (be it stored in Git or
>>>>>> somebody else's system) as a submodule, each commit in the
>>>>>> superproject records which exact commit in the submodule is used
>>>>>> with the rest of the superproject tree.  And that is done by
>>>>>> recording the object name of the commit in the submodule.
>>>>>>
>>>>>> What it means for the foreign system that wants to "plug into" a
>>>>>> superproject in Git as a submodule?  It is required to do two
>>>>>> things:
>>>>>>
>>>>>>   * At the time "git commit" is run at the superproject level, the
>>>>>>     foreign system has to be able to say "the version I have to be
>>>>>>     used in the context of this superproject commit is X", with X
>>>>>>     that somehow can be stored in the superproject's tree object
>>>>>>     (which is sized 20-byte for SHA-1 repositories; in SHA-256
>>>>>>     repositories, it is a bit wider).
>>>>>>
>>>>>>   * At the time "git chekcout" is run at the superproject level, the
>>>>>>     superproject will learn the above X (i.e. the version of the
>>>>>>     submodule that goes with the version of the superproject being
>>>>>>     checked out).  The foreign system has to be able to perform a
>>>>>>     "checkout" given that X.
>>>>>>
>>>>>> If a foreign system cannot do the above two, then it fundamentally
>>>>>> would be incapable of participating in such a "superproject and
>>>>>> submodule" relationship.
>>>> The sub-modules already have that problem if the user forgets publish
>>>> their sub-module (see notes in the docs ;-).
>>>>> The submodule "type" could create an object (hashed and stored) that
>> contains the needed "translation" details. The object would be hashed using SHA1
>> or SHA256 depending on the git config. The format of the object's contents would
>> be defined by the submodule's "code".
>>>> Another way of looking at the issue is via a variant of Git-LFS with
>>>> a smudge/clean style filter. I.e. the DataVCS would be treated as a 'file'.
>>>>
>>>> The LFS already uses the .gitattributes to define a 'type', while the
>>>> submodules don't yet have that capability. There is just a single
>>>> special type within a tree object of "sub-module"  being a mode 16000
>>>> commit (see https://longair.net/blog/2010/06/02/git-submodules-explained/).
>>>>
>>>> One thought is that one uses a proper sub-module that within it then
>>>> has the single 'large' file git-lfs style that hosts the hash
>>>> reference for the data VCS
>>>> (https://github.com/git-lfs/git-lfs/blob/main/docs/spec.md). It would
>>>> be the regular sub-modules .gitattributes file that handles the data
>>>> conversion.
>>>>
>>>> It may be converting an X-Y problem into an X-Y-Z solution, or just
>>>> extending the problem.
> The most salient issue I have with this is that signatures cannot be validated across VCS systems. 

I think I disagree, but let's be sure we are talking about the same
'signature' aspect, I think there are (at least) three different
signatures we could be talking about

1. The hash verification 'signature' that can cascade down the trees. We
verify against a given hash.
2. The 'Signed-off-by:' legal/copyright signature - important, but I
don't think that's the one being discussed.
3. The (e.g.) PGP signature of a tag or commit. This provides a (web of)
trust mechanism for the _given_ hash in 1. Important in 'open systems',
less so in more closed systems where trust, and the _given_, is via side
channels.

Note the shift from using a hash to using the PGP for the 'signature'.


> Within git, a submodule commit can be signed. This ensures that the contents of the commit in the super-project can also be signed. If someone hacks an underlying VCS that is not git, either:
Submodules are a remote VCS, it just happens to have the same hash
validation software as the super-project, which is nice.
>
> a) git can never sign a commit from an underlying VCS, or
Git-LFS is a similar hand off, though many accept it's capability.
>
> b) git can never trust a commit from an underlying VCS.
>
> This pollutes a fundamental capability of git, being multiple signers the contents of a commit, and invalidates the integrity of the Merkel tree that underlies git contents.

The main issue is how to confirm the integrity the other VCS. Many of
the Data VCS systems are based on Git and it's hash integrity approach,
so as long as the DATA VCS has similar integrity guarantees, we maintain
the level of trust in the security of the whole system.

>
> I do not see that this concept contributes positively to the ecosystem. I do feel strongly about this and hope my points are understood.

I'd agree that there is a need to work out how to integrate the code VCS
and data VCS in a consistent way. Ignoring the Data VCS problem doesn't
make it go away.

Maybe if Addison was able to identify one or two lead contenders as the
Data VCS and how it/they offer their levels of security and integrity,
then it would be easier to see where in the Git model that may fit. Or
whether Git is the underling VCS (because it has programmable API), and
the Data VCS (esp because of scale and non-distributed nature) becomes
the "authority", even if that has less capability!
>
> Sincerely,
> Randall
>
Philip

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [FR] supporting submodules with alternate version control systems (new contributor)
  2022-06-04 13:27             ` Philip Oakley
@ 2022-06-04 15:57               ` rsbecker
  2022-06-05 21:52                 ` Philip Oakley
  0 siblings, 1 reply; 13+ messages in thread
From: rsbecker @ 2022-06-04 15:57 UTC (permalink / raw)
  To: 'Philip Oakley', 'Addison Klinke'
  Cc: 'Jason Pyeron', 'Junio C Hamano',
	git, 'Addison Klinke'

On June 4, 2022 9:28 AM, Philip Oakley wrote:
>On 04/06/2022 03:01, rsbecker@nexbridge.com wrote:
>> On June 3, 2022 7:07 PM, Philip Oakley wrote:
>>> On 01/06/2022 13:44, Addison Klinke wrote:
>>>>> rsbecker: move code into a submodule from your own VCS system
>>>> into a git repository and the work with the submodule without the
>>>> git code-base knowing about this
>>>>
>>>>> Philip: uses a proper sub-module that within it then has
>>>> the single 'large' file git-lfs style that hosts the hash reference
>>>> for the data VCS
>>>>
>>>> The downside I see with both of these approaches is that translating
>>>> the native data VCS to git (or LFS) negates all the benefits of
>>>> having a VCS purpose-built for data. That's why the majority of data
>>>> versioning tools exist - because git (or LFS) are not ideal for
>>>> handling machine learning datasets
>>> The key aspect is deciding which of the two storage systems (the Data
>>> & the Code) will be the overall lead system that contains the linked
>>> reference to the other storage system to ensure the needed integrity.
>>> That is not really a technical question. Rather its somewhat of a
>>> social discussion (workflows, trust, style of integration, etc).
>>>
>>> It maybe that one of the systems does have less long-term integrity,
>>> as has been seen in many versioning systems over the last century
>>> (both manual and computer), but the UI is also important.
>>>
>>> IIRC Junio did note that having a suitable API to access the other
>>> storage system (to know its status, etc.) is likely to be core to the
>>> ability to combine the two. It may  be that a top level 'gui' is used
>>> control both systems and ensure synchronisation to hide the complexities of
>both systems.
>>>
>>> I'm still thinking that the "git-lfs like" style could be the one to
>>> use, but that is very dependant on the API that is available for
>>> capturing the Data state into the git entry that records that state, whether that
>is a file (git-lfs like) or a 'sub-module'
>>> (directory as state ) style.  Either way it still need reifying (i.e.
>>> coded to make the abstract concept into a concrete implementation).
>>>
>>> Which ever route is chosen, it still sounds to me like a worthwhile
>>> enterprise. It's all still very abstract.
>>>> On Tue, May 10, 2022 at 2:54 PM Philip Oakley <philipoakley@iee.email>
>wrote:
>>>>> On 10/05/2022 18:20, Jason Pyeron wrote:
>>>>>>> -----Original Message-----
>>>>>>> From: Junio C Hamano
>>>>>>> Sent: Tuesday, May 10, 2022 1:01 PM
>>>>>>> To: Addison Klinke <addison@baller.tv>
>>>>>>>
>>>>>>> Addison Klinke <addison@baller.tv> writes:
>>>>>>>
>>>>>>>> Is something along these lines feasible?
>>>>>>> Offhand, I only think of one thing that could make it
>>>>>>> fundamentally infeasible.
>>>>>>>
>>>>>>> When you bind an external repository (be it stored in Git or
>>>>>>> somebody else's system) as a submodule, each commit in the
>>>>>>> superproject records which exact commit in the submodule is used
>>>>>>> with the rest of the superproject tree.  And that is done by
>>>>>>> recording the object name of the commit in the submodule.
>>>>>>>
>>>>>>> What it means for the foreign system that wants to "plug into" a
>>>>>>> superproject in Git as a submodule?  It is required to do two
>>>>>>> things:
>>>>>>>
>>>>>>>   * At the time "git commit" is run at the superproject level, the
>>>>>>>     foreign system has to be able to say "the version I have to be
>>>>>>>     used in the context of this superproject commit is X", with X
>>>>>>>     that somehow can be stored in the superproject's tree object
>>>>>>>     (which is sized 20-byte for SHA-1 repositories; in SHA-256
>>>>>>>     repositories, it is a bit wider).
>>>>>>>
>>>>>>>   * At the time "git chekcout" is run at the superproject level, the
>>>>>>>     superproject will learn the above X (i.e. the version of the
>>>>>>>     submodule that goes with the version of the superproject being
>>>>>>>     checked out).  The foreign system has to be able to perform a
>>>>>>>     "checkout" given that X.
>>>>>>>
>>>>>>> If a foreign system cannot do the above two, then it
>>>>>>> fundamentally would be incapable of participating in such a
>>>>>>> "superproject and submodule" relationship.
>>>>> The sub-modules already have that problem if the user forgets
>>>>> publish their sub-module (see notes in the docs ;-).
>>>>>> The submodule "type" could create an object (hashed and stored)
>>>>>> that
>>> contains the needed "translation" details. The object would be hashed
>>> using SHA1 or SHA256 depending on the git config. The format of the
>>> object's contents would be defined by the submodule's "code".
>>>>> Another way of looking at the issue is via a variant of Git-LFS
>>>>> with a smudge/clean style filter. I.e. the DataVCS would be treated as a 'file'.
>>>>>
>>>>> The LFS already uses the .gitattributes to define a 'type', while
>>>>> the submodules don't yet have that capability. There is just a
>>>>> single special type within a tree object of "sub-module"  being a
>>>>> mode 16000 commit (see https://longair.net/blog/2010/06/02/git-
>submodules-explained/).
>>>>>
>>>>> One thought is that one uses a proper sub-module that within it
>>>>> then has the single 'large' file git-lfs style that hosts the hash
>>>>> reference for the data VCS
>>>>> (https://github.com/git-lfs/git-lfs/blob/main/docs/spec.md). It
>>>>> would be the regular sub-modules .gitattributes file that handles
>>>>> the data conversion.
>>>>>
>>>>> It may be converting an X-Y problem into an X-Y-Z solution, or just
>>>>> extending the problem.
>> The most salient issue I have with this is that signatures cannot be validated
>across VCS systems.
>
>I think I disagree, but let's be sure we are talking about the same 'signature'
>aspect, I think there are (at least) three different signatures we could be talking
>about
>
>1. The hash verification 'signature' that can cascade down the trees. We verify
>against a given hash.
>2. The 'Signed-off-by:' legal/copyright signature - important, but I don't think that's
>the one being discussed.
>3. The (e.g.) PGP signature of a tag or commit. This provides a (web of) trust
>mechanism for the _given_ hash in 1. Important in 'open systems', less so in more
>closed systems where trust, and the _given_, is via side channels.

The third is more my concern. I do not know of other (D)VCS systems that have the same level of trust allowed in git - simultaneously PGP/SSH signing commits and potentially multiple tags.

>Note the shift from using a hash to using the PGP for the 'signature'.
>
>
>> Within git, a submodule commit can be signed. This ensures that the contents of
>the commit in the super-project can also be signed. If someone hacks an
>underlying VCS that is not git, either:
>Submodules are a remote VCS, it just happens to have the same hash validation
>software as the super-project, which is nice.
>>
>> a) git can never sign a commit from an underlying VCS, or
>Git-LFS is a similar hand off, though many accept it's capability.
>>
>> b) git can never trust a commit from an underlying VCS.
>>
>> This pollutes a fundamental capability of git, being multiple signers the contents
>of a commit, and invalidates the integrity of the Merkel tree that underlies git
>contents.
>
>The main issue is how to confirm the integrity the other VCS. Many of the Data
>VCS systems are based on Git and it's hash integrity approach, so as long as the
>DATA VCS has similar integrity guarantees, we maintain the level of trust in the
>security of the whole system.

This is exactly my concern and what I was trying to point out - although more briefly. I do not think (an|there are) underlying VCS can provide similar guarantees. It is all too easy to hack most VCS systems if you have an appropriate user id especially most non-distributed ones. We originally moved to git because we had hacks on two different VCS systems underlying files.

>> I do not see that this concept contributes positively to the ecosystem. I do feel
>strongly about this and hope my points are understood.
>
>I'd agree that there is a need to work out how to integrate the code VCS and data
>VCS in a consistent way. Ignoring the Data VCS problem doesn't make it go away.
>
>Maybe if Addison was able to identify one or two lead contenders as the Data VCS
>and how it/they offer their levels of security and integrity, then it would be easier
>to see where in the Git model that may fit. Or whether Git is the underling VCS
>(because it has programmable API), and the Data VCS (esp because of scale and
>non-distributed nature) becomes the "authority", even if that has less capability!

I agree as well. I want to see assurances that this level of integrity can be maintained - or that the user will have to accept the risks that git signatures are no longer usable. It might be appropriate to disable commit.gpgsign if the underlying VCS cannot be an authority.

--Randall


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [FR] supporting submodules with alternate version control systems (new contributor)
  2022-06-04 15:57               ` rsbecker
@ 2022-06-05 21:52                 ` Philip Oakley
  2022-06-06 14:53                   ` Addison Klinke
  0 siblings, 1 reply; 13+ messages in thread
From: Philip Oakley @ 2022-06-05 21:52 UTC (permalink / raw)
  To: rsbecker, 'Addison Klinke'
  Cc: 'Jason Pyeron', 'Junio C Hamano',
	git, 'Addison Klinke'

On 04/06/2022 16:57, rsbecker@nexbridge.com wrote:
> On June 4, 2022 9:28 AM, Philip Oakley wrote:
>> On 04/06/2022 03:01, rsbecker@nexbridge.com wrote:
>>> On June 3, 2022 7:07 PM, Philip Oakley wrote:
>>>> On 01/06/2022 13:44, Addison Klinke wrote:
>>>>>> rsbecker: move code into a submodule from your own VCS system
>>>>> into a git repository and the work with the submodule without the
>>>>> git code-base knowing about this
>>>>>
>>>>>> Philip: uses a proper sub-module that within it then has
>>>>> the single 'large' file git-lfs style that hosts the hash reference
>>>>> for the data VCS
>>>>>
>>>>> The downside I see with both of these approaches is that translating
>>>>> the native data VCS to git (or LFS) negates all the benefits of
>>>>> having a VCS purpose-built for data. That's why the majority of data
>>>>> versioning tools exist - because git (or LFS) are not ideal for
>>>>> handling machine learning datasets
>>>> The key aspect is deciding which of the two storage systems (the Data
>>>> & the Code) will be the overall lead system that contains the linked
>>>> reference to the other storage system to ensure the needed integrity.
>>>> That is not really a technical question. Rather its somewhat of a
>>>> social discussion (workflows, trust, style of integration, etc).
>>>>
>>>> It maybe that one of the systems does have less long-term integrity,
>>>> as has been seen in many versioning systems over the last century
>>>> (both manual and computer), but the UI is also important.
>>>>
>>>> IIRC Junio did note that having a suitable API to access the other
>>>> storage system (to know its status, etc.) is likely to be core to the
>>>> ability to combine the two. It may  be that a top level 'gui' is used
>>>> control both systems and ensure synchronisation to hide the complexities of
>> both systems.
>>>> I'm still thinking that the "git-lfs like" style could be the one to
>>>> use, but that is very dependant on the API that is available for
>>>> capturing the Data state into the git entry that records that state, whether that
>> is a file (git-lfs like) or a 'sub-module'
>>>> (directory as state ) style.  Either way it still need reifying (i.e.
>>>> coded to make the abstract concept into a concrete implementation).
>>>>
>>>> Which ever route is chosen, it still sounds to me like a worthwhile
>>>> enterprise. It's all still very abstract.
>>>>> On Tue, May 10, 2022 at 2:54 PM Philip Oakley <philipoakley@iee.email>
>> wrote:
>>>>>> On 10/05/2022 18:20, Jason Pyeron wrote:
>>>>>>>> -----Original Message-----
>>>>>>>> From: Junio C Hamano
>>>>>>>> Sent: Tuesday, May 10, 2022 1:01 PM
>>>>>>>> To: Addison Klinke <addison@baller.tv>
>>>>>>>>
>>>>>>>> Addison Klinke <addison@baller.tv> writes:
>>>>>>>>
>>>>>>>>> Is something along these lines feasible?
>>>>>>>> Offhand, I only think of one thing that could make it
>>>>>>>> fundamentally infeasible.
>>>>>>>>
>>>>>>>> When you bind an external repository (be it stored in Git or
>>>>>>>> somebody else's system) as a submodule, each commit in the
>>>>>>>> superproject records which exact commit in the submodule is used
>>>>>>>> with the rest of the superproject tree.  And that is done by
>>>>>>>> recording the object name of the commit in the submodule.
>>>>>>>>
>>>>>>>> What it means for the foreign system that wants to "plug into" a
>>>>>>>> superproject in Git as a submodule?  It is required to do two
>>>>>>>> things:
>>>>>>>>
>>>>>>>>   * At the time "git commit" is run at the superproject level, the
>>>>>>>>     foreign system has to be able to say "the version I have to be
>>>>>>>>     used in the context of this superproject commit is X", with X
>>>>>>>>     that somehow can be stored in the superproject's tree object
>>>>>>>>     (which is sized 20-byte for SHA-1 repositories; in SHA-256
>>>>>>>>     repositories, it is a bit wider).
>>>>>>>>
>>>>>>>>   * At the time "git chekcout" is run at the superproject level, the
>>>>>>>>     superproject will learn the above X (i.e. the version of the
>>>>>>>>     submodule that goes with the version of the superproject being
>>>>>>>>     checked out).  The foreign system has to be able to perform a
>>>>>>>>     "checkout" given that X.
>>>>>>>>
>>>>>>>> If a foreign system cannot do the above two, then it
>>>>>>>> fundamentally would be incapable of participating in such a
>>>>>>>> "superproject and submodule" relationship.
>>>>>> The sub-modules already have that problem if the user forgets
>>>>>> publish their sub-module (see notes in the docs ;-).
>>>>>>> The submodule "type" could create an object (hashed and stored)
>>>>>>> that
>>>> contains the needed "translation" details. The object would be hashed
>>>> using SHA1 or SHA256 depending on the git config. The format of the
>>>> object's contents would be defined by the submodule's "code".
>>>>>> Another way of looking at the issue is via a variant of Git-LFS
>>>>>> with a smudge/clean style filter. I.e. the DataVCS would be treated as a 'file'.
>>>>>>
>>>>>> The LFS already uses the .gitattributes to define a 'type', while
>>>>>> the submodules don't yet have that capability. There is just a
>>>>>> single special type within a tree object of "sub-module"  being a
>>>>>> mode 16000 commit (see https://longair.net/blog/2010/06/02/git-
>> submodules-explained/).
>>>>>> One thought is that one uses a proper sub-module that within it
>>>>>> then has the single 'large' file git-lfs style that hosts the hash
>>>>>> reference for the data VCS
>>>>>> (https://github.com/git-lfs/git-lfs/blob/main/docs/spec.md). It
>>>>>> would be the regular sub-modules .gitattributes file that handles
>>>>>> the data conversion.
>>>>>>
>>>>>> It may be converting an X-Y problem into an X-Y-Z solution, or just
>>>>>> extending the problem.
>>> The most salient issue I have with this is that signatures cannot be validated
>> across VCS systems.
>>
>> I think I disagree, but let's be sure we are talking about the same 'signature'
>> aspect, I think there are (at least) three different signatures we could be talking
>> about
>>
>> 1. The hash verification 'signature' that can cascade down the trees. We verify
>> against a given hash.
>> 2. The 'Signed-off-by:' legal/copyright signature - important, but I don't think that's
>> the one being discussed.
>> 3. The (e.g.) PGP signature of a tag or commit. This provides a (web of) trust
>> mechanism for the _given_ hash in 1. Important in 'open systems', less so in more
>> closed systems where trust, and the _given_, is via side channels.
> The third is more my concern. I do not know of other (D)VCS systems that have the same level of trust allowed in git - simultaneously PGP/SSH signing commits and potentially multiple tags.

for reference of other readers, that's as discussed in
https://git-scm.com/book/en/v2/Git-Tools-Signing-Your-Work esp. the
'Signing Commits' and 'Everyone Must Sign' sections at the end of the Ch 7.4
>> Note the shift from using a hash to using the PGP for the 'signature'.
>>
>>
>>> Within git, a submodule commit can be signed. This ensures that the contents of
>> the commit in the super-project can also be signed. If someone hacks an
>> underlying VCS that is not git, either:
>> Submodules are a remote VCS, it just happens to have the same hash validation
>> software as the super-project, which is nice.
>>> a) git can never sign a commit from an underlying VCS, or
>> Git-LFS is a similar hand off, though many accept it's capability.
>>> b) git can never trust a commit from an underlying VCS.
>>>
>>> This pollutes a fundamental capability of git, being multiple signers the contents
>> of a commit, and invalidates the integrity of the Merkel tree that underlies git
>> contents.
>>
>> The main issue is how to confirm the integrity the other VCS. Many of the Data
>> VCS systems are based on Git and it's hash integrity approach, so as long as the
>> DATA VCS has similar integrity guarantees, we maintain the level of trust in the
>> security of the whole system.
> This is exactly my concern and what I was trying to point out - although more briefly. I do not think (an|there are) underlying VCS can provide similar guarantees. It is all too easy to hack most VCS systems if you have an appropriate user id especially most non-distributed ones. We originally moved to git because we had hacks on two different VCS systems underlying files.
>
>>> I do not see that this concept contributes positively to the ecosystem. I do feel
>> strongly about this and hope my points are understood.
>>
>> I'd agree that there is a need to work out how to integrate the code VCS and data
>> VCS in a consistent way. Ignoring the Data VCS problem doesn't make it go away.
>>
>> Maybe if Addison was able to identify one or two lead contenders as the Data VCS
>> and how it/they offer their levels of security and integrity,

Looking back at Addison's original email, he did suggest:

- [Dolt](https://www.dolthub.com/),
- [LakeFS](https://lakefs.io/), and
- [DVC](https://dvc.org/)

as examples. They all imply git hash style validation of the individual
data commits, by not mention of [PGP] signing, though it may available
for some.

I did see the Dolt issue [ Cryptographic signing of a changeset? #628
](https://github.com/dolthub/dolt/issues/628), so it looks like it's on
their radar, though it's likely they'll need similar discussions about
how to cross integrate with Git..

However, we also need to note the shift to the cloud for these very
large immobile data sets, where there maybe concerns as to the security
and trustworthiness of the compute and storage platforms (cosmic rays,
random glitches, hacks, etc).

We are no longer importing code to our local machine that we need to be
signed, rather we are exporting our code to their compute
infrastructure, so the verification has to happen 'over-there'. So the
integrity question is still very pertinent.

>>  then it would be easier
>> to see where in the Git model that may fit. Or whether Git is the underling VCS
>> (because it has programmable API), and the Data VCS (esp because of scale and
>> non-distributed nature) becomes the "authority", even if that has less capability!
> I agree as well. I want to see assurances that this level of integrity can be maintained - or that the user will have to accept the risks that git signatures are no longer usable. It might be appropriate to disable commit.gpgsign if the underlying VCS cannot be an authority.
>
>
I'd also worry, like yourself, about the cloud data sets, and how the
data selection subsets are captured (e.g. if multiple individuals have
used their right to be forgotten to make the old selection no longer
accessible, then how to validate?). Interesting times.
--
Philip

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [FR] supporting submodules with alternate version control systems (new contributor)
  2022-06-05 21:52                 ` Philip Oakley
@ 2022-06-06 14:53                   ` Addison Klinke
  0 siblings, 0 replies; 13+ messages in thread
From: Addison Klinke @ 2022-06-06 14:53 UTC (permalink / raw)
  To: Philip Oakley; +Cc: rsbecker, Jason Pyeron, Junio C Hamano, git, Addison Klinke

> The key aspect is deciding which of the two storage systems (the Data &
the Code) will be the overall lead system that contains the linked
reference to the other storage system

I'd prefer git as the lead system since it's a standard everyone is
already used to. With so many variations on data VCS out there, I
think it would be difficult to find consensus

> I do not know of other (D)VCS systems that have the same level of trust allowed in git - simultaneously PGP/SSH signing commits and potentially multiple tags

I have not used signed commits/tags with git before since the majority
of machine learning work in industry is on private repositories with
internal teams. The Dolt issue thread that Philip referenced seems
quite interesting in this regard.

> It might be appropriate to disable commit.gpgsign if the underlying VCS cannot be an authority

Would it be reasonable to start working on submodule integrations and
design a way for signing to be added later on as it (hopefully)
becomes supported by each data VCS?

On Sun, Jun 5, 2022 at 3:52 PM Philip Oakley <philipoakley@iee.email> wrote:
>
> On 04/06/2022 16:57, rsbecker@nexbridge.com wrote:
> > On June 4, 2022 9:28 AM, Philip Oakley wrote:
> >> On 04/06/2022 03:01, rsbecker@nexbridge.com wrote:
> >>> On June 3, 2022 7:07 PM, Philip Oakley wrote:
> >>>> On 01/06/2022 13:44, Addison Klinke wrote:
> >>>>>> rsbecker: move code into a submodule from your own VCS system
> >>>>> into a git repository and the work with the submodule without the
> >>>>> git code-base knowing about this
> >>>>>
> >>>>>> Philip: uses a proper sub-module that within it then has
> >>>>> the single 'large' file git-lfs style that hosts the hash reference
> >>>>> for the data VCS
> >>>>>
> >>>>> The downside I see with both of these approaches is that translating
> >>>>> the native data VCS to git (or LFS) negates all the benefits of
> >>>>> having a VCS purpose-built for data. That's why the majority of data
> >>>>> versioning tools exist - because git (or LFS) are not ideal for
> >>>>> handling machine learning datasets
> >>>> The key aspect is deciding which of the two storage systems (the Data
> >>>> & the Code) will be the overall lead system that contains the linked
> >>>> reference to the other storage system to ensure the needed integrity.
> >>>> That is not really a technical question. Rather its somewhat of a
> >>>> social discussion (workflows, trust, style of integration, etc).
> >>>>
> >>>> It maybe that one of the systems does have less long-term integrity,
> >>>> as has been seen in many versioning systems over the last century
> >>>> (both manual and computer), but the UI is also important.
> >>>>
> >>>> IIRC Junio did note that having a suitable API to access the other
> >>>> storage system (to know its status, etc.) is likely to be core to the
> >>>> ability to combine the two. It may  be that a top level 'gui' is used
> >>>> control both systems and ensure synchronisation to hide the complexities of
> >> both systems.
> >>>> I'm still thinking that the "git-lfs like" style could be the one to
> >>>> use, but that is very dependant on the API that is available for
> >>>> capturing the Data state into the git entry that records that state, whether that
> >> is a file (git-lfs like) or a 'sub-module'
> >>>> (directory as state ) style.  Either way it still need reifying (i.e.
> >>>> coded to make the abstract concept into a concrete implementation).
> >>>>
> >>>> Which ever route is chosen, it still sounds to me like a worthwhile
> >>>> enterprise. It's all still very abstract.
> >>>>> On Tue, May 10, 2022 at 2:54 PM Philip Oakley <philipoakley@iee.email>
> >> wrote:
> >>>>>> On 10/05/2022 18:20, Jason Pyeron wrote:
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: Junio C Hamano
> >>>>>>>> Sent: Tuesday, May 10, 2022 1:01 PM
> >>>>>>>> To: Addison Klinke <addison@baller.tv>
> >>>>>>>>
> >>>>>>>> Addison Klinke <addison@baller.tv> writes:
> >>>>>>>>
> >>>>>>>>> Is something along these lines feasible?
> >>>>>>>> Offhand, I only think of one thing that could make it
> >>>>>>>> fundamentally infeasible.
> >>>>>>>>
> >>>>>>>> When you bind an external repository (be it stored in Git or
> >>>>>>>> somebody else's system) as a submodule, each commit in the
> >>>>>>>> superproject records which exact commit in the submodule is used
> >>>>>>>> with the rest of the superproject tree.  And that is done by
> >>>>>>>> recording the object name of the commit in the submodule.
> >>>>>>>>
> >>>>>>>> What it means for the foreign system that wants to "plug into" a
> >>>>>>>> superproject in Git as a submodule?  It is required to do two
> >>>>>>>> things:
> >>>>>>>>
> >>>>>>>>   * At the time "git commit" is run at the superproject level, the
> >>>>>>>>     foreign system has to be able to say "the version I have to be
> >>>>>>>>     used in the context of this superproject commit is X", with X
> >>>>>>>>     that somehow can be stored in the superproject's tree object
> >>>>>>>>     (which is sized 20-byte for SHA-1 repositories; in SHA-256
> >>>>>>>>     repositories, it is a bit wider).
> >>>>>>>>
> >>>>>>>>   * At the time "git chekcout" is run at the superproject level, the
> >>>>>>>>     superproject will learn the above X (i.e. the version of the
> >>>>>>>>     submodule that goes with the version of the superproject being
> >>>>>>>>     checked out).  The foreign system has to be able to perform a
> >>>>>>>>     "checkout" given that X.
> >>>>>>>>
> >>>>>>>> If a foreign system cannot do the above two, then it
> >>>>>>>> fundamentally would be incapable of participating in such a
> >>>>>>>> "superproject and submodule" relationship.
> >>>>>> The sub-modules already have that problem if the user forgets
> >>>>>> publish their sub-module (see notes in the docs ;-).
> >>>>>>> The submodule "type" could create an object (hashed and stored)
> >>>>>>> that
> >>>> contains the needed "translation" details. The object would be hashed
> >>>> using SHA1 or SHA256 depending on the git config. The format of the
> >>>> object's contents would be defined by the submodule's "code".
> >>>>>> Another way of looking at the issue is via a variant of Git-LFS
> >>>>>> with a smudge/clean style filter. I.e. the DataVCS would be treated as a 'file'.
> >>>>>>
> >>>>>> The LFS already uses the .gitattributes to define a 'type', while
> >>>>>> the submodules don't yet have that capability. There is just a
> >>>>>> single special type within a tree object of "sub-module"  being a
> >>>>>> mode 16000 commit (see https://longair.net/blog/2010/06/02/git-
> >> submodules-explained/).
> >>>>>> One thought is that one uses a proper sub-module that within it
> >>>>>> then has the single 'large' file git-lfs style that hosts the hash
> >>>>>> reference for the data VCS
> >>>>>> (https://github.com/git-lfs/git-lfs/blob/main/docs/spec.md). It
> >>>>>> would be the regular sub-modules .gitattributes file that handles
> >>>>>> the data conversion.
> >>>>>>
> >>>>>> It may be converting an X-Y problem into an X-Y-Z solution, or just
> >>>>>> extending the problem.
> >>> The most salient issue I have with this is that signatures cannot be validated
> >> across VCS systems.
> >>
> >> I think I disagree, but let's be sure we are talking about the same 'signature'
> >> aspect, I think there are (at least) three different signatures we could be talking
> >> about
> >>
> >> 1. The hash verification 'signature' that can cascade down the trees. We verify
> >> against a given hash.
> >> 2. The 'Signed-off-by:' legal/copyright signature - important, but I don't think that's
> >> the one being discussed.
> >> 3. The (e.g.) PGP signature of a tag or commit. This provides a (web of) trust
> >> mechanism for the _given_ hash in 1. Important in 'open systems', less so in more
> >> closed systems where trust, and the _given_, is via side channels.
> > The third is more my concern. I do not know of other (D)VCS systems that have the same level of trust allowed in git - simultaneously PGP/SSH signing commits and potentially multiple tags.
>
> for reference of other readers, that's as discussed in
> https://git-scm.com/book/en/v2/Git-Tools-Signing-Your-Work esp. the
> 'Signing Commits' and 'Everyone Must Sign' sections at the end of the Ch 7.4
> >> Note the shift from using a hash to using the PGP for the 'signature'.
> >>
> >>
> >>> Within git, a submodule commit can be signed. This ensures that the contents of
> >> the commit in the super-project can also be signed. If someone hacks an
> >> underlying VCS that is not git, either:
> >> Submodules are a remote VCS, it just happens to have the same hash validation
> >> software as the super-project, which is nice.
> >>> a) git can never sign a commit from an underlying VCS, or
> >> Git-LFS is a similar hand off, though many accept it's capability.
> >>> b) git can never trust a commit from an underlying VCS.
> >>>
> >>> This pollutes a fundamental capability of git, being multiple signers the contents
> >> of a commit, and invalidates the integrity of the Merkel tree that underlies git
> >> contents.
> >>
> >> The main issue is how to confirm the integrity the other VCS. Many of the Data
> >> VCS systems are based on Git and it's hash integrity approach, so as long as the
> >> DATA VCS has similar integrity guarantees, we maintain the level of trust in the
> >> security of the whole system.
> > This is exactly my concern and what I was trying to point out - although more briefly. I do not think (an|there are) underlying VCS can provide similar guarantees. It is all too easy to hack most VCS systems if you have an appropriate user id especially most non-distributed ones. We originally moved to git because we had hacks on two different VCS systems underlying files.
> >
> >>> I do not see that this concept contributes positively to the ecosystem. I do feel
> >> strongly about this and hope my points are understood.
> >>
> >> I'd agree that there is a need to work out how to integrate the code VCS and data
> >> VCS in a consistent way. Ignoring the Data VCS problem doesn't make it go away.
> >>
> >> Maybe if Addison was able to identify one or two lead contenders as the Data VCS
> >> and how it/they offer their levels of security and integrity,
>
> Looking back at Addison's original email, he did suggest:
>
> - [Dolt](https://www.dolthub.com/),
> - [LakeFS](https://lakefs.io/), and
> - [DVC](https://dvc.org/)
>
> as examples. They all imply git hash style validation of the individual
> data commits, by not mention of [PGP] signing, though it may available
> for some.
>
> I did see the Dolt issue [ Cryptographic signing of a changeset? #628
> ](https://github.com/dolthub/dolt/issues/628), so it looks like it's on
> their radar, though it's likely they'll need similar discussions about
> how to cross integrate with Git..
>
> However, we also need to note the shift to the cloud for these very
> large immobile data sets, where there maybe concerns as to the security
> and trustworthiness of the compute and storage platforms (cosmic rays,
> random glitches, hacks, etc).
>
> We are no longer importing code to our local machine that we need to be
> signed, rather we are exporting our code to their compute
> infrastructure, so the verification has to happen 'over-there'. So the
> integrity question is still very pertinent.
>
> >>  then it would be easier
> >> to see where in the Git model that may fit. Or whether Git is the underling VCS
> >> (because it has programmable API), and the Data VCS (esp because of scale and
> >> non-distributed nature) becomes the "authority", even if that has less capability!
> > I agree as well. I want to see assurances that this level of integrity can be maintained - or that the user will have to accept the risks that git signatures are no longer usable. It might be appropriate to disable commit.gpgsign if the underlying VCS cannot be an authority.
> >
> >
> I'd also worry, like yourself, about the cloud data sets, and how the
> data selection subsets are captured (e.g. if multiple individuals have
> used their right to be forgotten to make the old selection no longer
> accessible, then how to validate?). Interesting times.
> --
> Philip

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-06-06 14:53 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-10 16:11 [FR] supporting submodules with alternate version control systems (new contributor) Addison Klinke
2022-05-10 17:00 ` Junio C Hamano
2022-05-10 17:20   ` Jason Pyeron
2022-05-10 17:26     ` Addison Klinke
2022-05-10 18:26       ` rsbecker
2022-05-10 20:54     ` Philip Oakley
2022-06-01 12:44       ` Addison Klinke
2022-06-03 23:06         ` Philip Oakley
2022-06-04  2:01           ` rsbecker
2022-06-04 13:27             ` Philip Oakley
2022-06-04 15:57               ` rsbecker
2022-06-05 21:52                 ` Philip Oakley
2022-06-06 14:53                   ` Addison Klinke

Code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).