To: "'Philip Oakley'" <firstname.lastname@example.org>,
"'Addison Klinke'" <email@example.com>
Cc: "'Jason Pyeron'" <firstname.lastname@example.org>,
"'Junio C Hamano'" <email@example.com>, <firstname.lastname@example.org>,
"'Addison Klinke'" <email@example.com>
Subject: RE: [FR] supporting submodules with alternate version control systems (new contributor)
Date: Fri, 3 Jun 2022 22:01:47 -0400 [thread overview]
Message-ID: <firstname.lastname@example.org> (raw)
On June 3, 2022 7:07 PM, Philip Oakley wrote:
>On 01/06/2022 13:44, Addison Klinke wrote:
>>> rsbecker: move code into a submodule from your own VCS system
>> into a git repository and the work with the submodule without the git
>> code-base knowing about this
>>> Philip: uses a proper sub-module that within it then has
>> the single 'large' file git-lfs style that hosts the hash reference
>> for the data VCS
>> The downside I see with both of these approaches is that translating
>> the native data VCS to git (or LFS) negates all the benefits of having
>> a VCS purpose-built for data. That's why the majority of data
>> versioning tools exist - because git (or LFS) are not ideal for
>> handling machine learning datasets
>The key aspect is deciding which of the two storage systems (the Data & the Code)
>will be the overall lead system that contains the linked reference to the other
>storage system to ensure the needed integrity.
>That is not really a technical question. Rather its somewhat of a social discussion
>(workflows, trust, style of integration, etc).
>It maybe that one of the systems does have less long-term integrity, as has been
>seen in many versioning systems over the last century (both manual and
>computer), but the UI is also important.
>IIRC Junio did note that having a suitable API to access the other storage system
>(to know its status, etc.) is likely to be core to the ability to combine the two. It
>may be that a top level 'gui' is used control both systems and ensure
>synchronisation to hide the complexities of both systems.
>I'm still thinking that the "git-lfs like" style could be the one to use, but that is very
>dependant on the API that is available for capturing the Data state into the git
>entry that records that state, whether that is a file (git-lfs like) or a 'sub-module'
>(directory as state ) style. Either way it still need reifying (i.e. coded to make the
>abstract concept into a concrete implementation).
>Which ever route is chosen, it still sounds to me like a worthwhile enterprise. It's
>all still very abstract.
>> On Tue, May 10, 2022 at 2:54 PM Philip Oakley <email@example.com> wrote:
>>> On 10/05/2022 18:20, Jason Pyeron wrote:
>>>>> -----Original Message-----
>>>>> From: Junio C Hamano
>>>>> Sent: Tuesday, May 10, 2022 1:01 PM
>>>>> To: Addison Klinke <firstname.lastname@example.org>
>>>>> Addison Klinke <email@example.com> writes:
>>>>>> Is something along these lines feasible?
>>>>> Offhand, I only think of one thing that could make it fundamentally
>>>>> When you bind an external repository (be it stored in Git or
>>>>> somebody else's system) as a submodule, each commit in the
>>>>> superproject records which exact commit in the submodule is used
>>>>> with the rest of the superproject tree. And that is done by
>>>>> recording the object name of the commit in the submodule.
>>>>> What it means for the foreign system that wants to "plug into" a
>>>>> superproject in Git as a submodule? It is required to do two
>>>>> * At the time "git commit" is run at the superproject level, the
>>>>> foreign system has to be able to say "the version I have to be
>>>>> used in the context of this superproject commit is X", with X
>>>>> that somehow can be stored in the superproject's tree object
>>>>> (which is sized 20-byte for SHA-1 repositories; in SHA-256
>>>>> repositories, it is a bit wider).
>>>>> * At the time "git chekcout" is run at the superproject level, the
>>>>> superproject will learn the above X (i.e. the version of the
>>>>> submodule that goes with the version of the superproject being
>>>>> checked out). The foreign system has to be able to perform a
>>>>> "checkout" given that X.
>>>>> If a foreign system cannot do the above two, then it fundamentally
>>>>> would be incapable of participating in such a "superproject and
>>>>> submodule" relationship.
>>> The sub-modules already have that problem if the user forgets publish
>>> their sub-module (see notes in the docs ;-).
>>>> The submodule "type" could create an object (hashed and stored) that
>contains the needed "translation" details. The object would be hashed using SHA1
>or SHA256 depending on the git config. The format of the object's contents would
>be defined by the submodule's "code".
>>> Another way of looking at the issue is via a variant of Git-LFS with
>>> a smudge/clean style filter. I.e. the DataVCS would be treated as a 'file'.
>>> The LFS already uses the .gitattributes to define a 'type', while the
>>> submodules don't yet have that capability. There is just a single
>>> special type within a tree object of "sub-module" being a mode 16000
>>> commit (see https://longair.net/blog/2010/06/02/git-submodules-explained/).
>>> One thought is that one uses a proper sub-module that within it then
>>> has the single 'large' file git-lfs style that hosts the hash
>>> reference for the data VCS
>>> (https://github.com/git-lfs/git-lfs/blob/main/docs/spec.md). It would
>>> be the regular sub-modules .gitattributes file that handles the data
>>> It may be converting an X-Y problem into an X-Y-Z solution, or just
>>> extending the problem.
The most salient issue I have with this is that signatures cannot be validated across VCS systems. Within git, a submodule commit can be signed. This ensures that the contents of the commit in the super-project can also be signed. If someone hacks an underlying VCS that is not git, either:
a) git can never sign a commit from an underlying VCS, or
b) git can never trust a commit from an underlying VCS.
This pollutes a fundamental capability of git, being multiple signers the contents of a commit, and invalidates the integrity of the Merkel tree that underlies git contents.
I do not see that this concept contributes positively to the ecosystem. I do feel strongly about this and hope my points are understood.
next prev parent reply other threads:[~2022-06-04 2:02 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-05-10 16:11 Addison Klinke
2022-05-10 17:00 ` Junio C Hamano
2022-05-10 17:20 ` Jason Pyeron
2022-05-10 17:26 ` Addison Klinke
2022-05-10 18:26 ` rsbecker
2022-05-10 20:54 ` Philip Oakley
2022-06-01 12:44 ` Addison Klinke
2022-06-03 23:06 ` Philip Oakley
2022-06-04 2:01 ` rsbecker [this message]
2022-06-04 13:27 ` Philip Oakley
2022-06-04 15:57 ` rsbecker
2022-06-05 21:52 ` Philip Oakley
2022-06-06 14:53 ` Addison Klinke
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this inbox:
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).