list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
From: <>
To: "'Philip Oakley'" <>,
	"'Addison Klinke'" <>
Cc: "'Jason Pyeron'" <>,
	"'Junio C Hamano'" <>, <>,
	"'Addison Klinke'" <>
Subject: RE: [FR] supporting submodules with alternate version control systems (new contributor)
Date: Fri, 3 Jun 2022 22:01:47 -0400	[thread overview]
Message-ID: <000301d877b7$0fb1ca20$2f155e60$> (raw)
In-Reply-To: <>

On June 3, 2022 7:07 PM, Philip Oakley wrote:
>On 01/06/2022 13:44, Addison Klinke wrote:
>>> rsbecker: move code into a submodule from your own VCS system
>> into a git repository and the work with the submodule without the git
>> code-base knowing about this
>>> Philip: uses a proper sub-module that within it then has
>> the single 'large' file git-lfs style that hosts the hash reference
>> for the data VCS
>> The downside I see with both of these approaches is that translating
>> the native data VCS to git (or LFS) negates all the benefits of having
>> a VCS purpose-built for data. That's why the majority of data
>> versioning tools exist - because git (or LFS) are not ideal for
>> handling machine learning datasets
>The key aspect is deciding which of the two storage systems (the Data & the Code)
>will be the overall lead system that contains the linked reference to the other
>storage system to ensure the needed integrity.
>That is not really a technical question. Rather its somewhat of a social discussion
>(workflows, trust, style of integration, etc).
>It maybe that one of the systems does have less long-term integrity, as has been
>seen in many versioning systems over the last century (both manual and
>computer), but the UI is also important.
>IIRC Junio did note that having a suitable API to access the other storage system
>(to know its status, etc.) is likely to be core to the ability to combine the two. It
>may  be that a top level 'gui' is used control both systems and ensure
>synchronisation to hide the complexities of both systems.
>I'm still thinking that the "git-lfs like" style could be the one to use, but that is very
>dependant on the API that is available for capturing the Data state into the git
>entry that records that state, whether that is a file (git-lfs like) or a 'sub-module'
>(directory as state ) style.  Either way it still need reifying (i.e. coded to make the
>abstract concept into a concrete implementation).
>Which ever route is chosen, it still sounds to me like a worthwhile enterprise. It's
>all still very abstract.
>> On Tue, May 10, 2022 at 2:54 PM Philip Oakley <> wrote:
>>> On 10/05/2022 18:20, Jason Pyeron wrote:
>>>>> -----Original Message-----
>>>>> From: Junio C Hamano
>>>>> Sent: Tuesday, May 10, 2022 1:01 PM
>>>>> To: Addison Klinke <>
>>>>> Addison Klinke <> writes:
>>>>>> Is something along these lines feasible?
>>>>> Offhand, I only think of one thing that could make it fundamentally
>>>>> infeasible.
>>>>> When you bind an external repository (be it stored in Git or
>>>>> somebody else's system) as a submodule, each commit in the
>>>>> superproject records which exact commit in the submodule is used
>>>>> with the rest of the superproject tree.  And that is done by
>>>>> recording the object name of the commit in the submodule.
>>>>> What it means for the foreign system that wants to "plug into" a
>>>>> superproject in Git as a submodule?  It is required to do two
>>>>> things:
>>>>>   * At the time "git commit" is run at the superproject level, the
>>>>>     foreign system has to be able to say "the version I have to be
>>>>>     used in the context of this superproject commit is X", with X
>>>>>     that somehow can be stored in the superproject's tree object
>>>>>     (which is sized 20-byte for SHA-1 repositories; in SHA-256
>>>>>     repositories, it is a bit wider).
>>>>>   * At the time "git chekcout" is run at the superproject level, the
>>>>>     superproject will learn the above X (i.e. the version of the
>>>>>     submodule that goes with the version of the superproject being
>>>>>     checked out).  The foreign system has to be able to perform a
>>>>>     "checkout" given that X.
>>>>> If a foreign system cannot do the above two, then it fundamentally
>>>>> would be incapable of participating in such a "superproject and
>>>>> submodule" relationship.
>>> The sub-modules already have that problem if the user forgets publish
>>> their sub-module (see notes in the docs ;-).
>>>> The submodule "type" could create an object (hashed and stored) that
>contains the needed "translation" details. The object would be hashed using SHA1
>or SHA256 depending on the git config. The format of the object's contents would
>be defined by the submodule's "code".
>>> Another way of looking at the issue is via a variant of Git-LFS with
>>> a smudge/clean style filter. I.e. the DataVCS would be treated as a 'file'.
>>> The LFS already uses the .gitattributes to define a 'type', while the
>>> submodules don't yet have that capability. There is just a single
>>> special type within a tree object of "sub-module"  being a mode 16000
>>> commit (see
>>> One thought is that one uses a proper sub-module that within it then
>>> has the single 'large' file git-lfs style that hosts the hash
>>> reference for the data VCS
>>> ( It would
>>> be the regular sub-modules .gitattributes file that handles the data
>>> conversion.
>>> It may be converting an X-Y problem into an X-Y-Z solution, or just
>>> extending the problem.

The most salient issue I have with this is that signatures cannot be validated across VCS systems. Within git, a submodule commit can be signed. This ensures that the contents of the commit in the super-project can also be signed. If someone hacks an underlying VCS that is not git, either:

a) git can never sign a commit from an underlying VCS, or

b) git can never trust a commit from an underlying VCS.

This pollutes a fundamental capability of git, being multiple signers the contents of a commit, and invalidates the integrity of the Merkel tree that underlies git contents.

I do not see that this concept contributes positively to the ecosystem. I do feel strongly about this and hope my points are understood.


  reply	other threads:[~2022-06-04  2:02 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-10 16:11 Addison Klinke
2022-05-10 17:00 ` Junio C Hamano
2022-05-10 17:20   ` Jason Pyeron
2022-05-10 17:26     ` Addison Klinke
2022-05-10 18:26       ` rsbecker
2022-05-10 20:54     ` Philip Oakley
2022-06-01 12:44       ` Addison Klinke
2022-06-03 23:06         ` Philip Oakley
2022-06-04  2:01           ` rsbecker [this message]
2022-06-04 13:27             ` Philip Oakley
2022-06-04 15:57               ` rsbecker
2022-06-05 21:52                 ` Philip Oakley
2022-06-06 14:53                   ` Addison Klinke

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

  List information:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='000301d877b7$0fb1ca20$2f155e60$' \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this inbox:

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).