git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Submodules and SHA-256/SHA-1 interoperability
@ 2021-02-14 21:50 brian m. carlson
  2021-03-01 19:28 ` Johannes Schindelin
  0 siblings, 1 reply; 4+ messages in thread
From: brian m. carlson @ 2021-02-14 21:50 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 2808 bytes --]

I'm currently working on the next step of the SHA-256 transition code,
which is SHA-256/SHA-1 interoperability.  Essentially, when we write a
loose object into the store, or when we index a pack, we take one form
of the object, usually the SHA-256 form, and rewrite it so that it is in
its SHA-1 form, and then hash it to determine its SHA-1 name.  We then
write this correspondence either into the loose object index (for loose
objects) or a v3 index (for packs).

Blobs are simply hashed with both algorithms, but trees, commits, and
tags need to be rewritten to use the SHA-1 names of the objects they
refer to.  For most situations, we already have this data, since it will
exist in the loose object index, in some pack index, or elsewhere in the
pack we're indexing.

However, for submodules, we have a problem.  By definition, the object
exists in a different repository.  If we have the submodule locally on
the system, then this works fine, but if we're performing a fetch or
clone and the submodule is not present, then we cannot rewrite the tree
or anything that refers to it, directly or indirectly.

So there are some possible courses of action:

* Disallow compatibility algorithms when using submodules.  This is
  simple, but inconvenient.
* Force users to always clone submodules and fetch them before fetching
  the main repository.  This is also relatively simple, but
  inconvenient.
* Have the remote server keep a list of correspondences and send them in
  a protocol extension.
* Just skip rewriting objects until the data is filled in later and
  admit the data will be incomplete.  This means that pushing to or
  pulling from a repository using a incompatible algorithm will be
  impossible.
* Something else I haven't thought of.

The third option is where I'm leaning, but it has some potential
downsides.  First, the server must support both hash algorithms and have
this data.  Second, it essentially requires all submodule updates to be
pushed from a compatible client.  Third, we need to trust that the
server hasn't tampered with the data, which should be possible by doing
an fsck on both forms (I think).  Fourth, we need to store this
somewhere, and the only place we have right now is the loose object
index, which would potentially grow to inefficient sizes.

We could potentially change this to be slightly different by asking the
submodule server for a list of correspondences instead via a new
protocol extension, but it has the same downsides except for the second
one, and additionally means that we'd need to make multiple connections.

So I'm seeking some ideas on which approach we want to use here before
I start sinking a lot of work into this.
-- 
brian m. carlson (he/him or they/them)
Houston, Texas, US

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Submodules and SHA-256/SHA-1 interoperability
  2021-02-14 21:50 Submodules and SHA-256/SHA-1 interoperability brian m. carlson
@ 2021-03-01 19:28 ` Johannes Schindelin
  2021-03-13 19:42   ` brian m. carlson
  0 siblings, 1 reply; 4+ messages in thread
From: Johannes Schindelin @ 2021-03-01 19:28 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git

Hi brian,

On Sun, 14 Feb 2021, brian m. carlson wrote:

> I'm currently working on the next step of the SHA-256 transition code,
> which is SHA-256/SHA-1 interoperability.  Essentially, when we write a
> loose object into the store, or when we index a pack, we take one form
> of the object, usually the SHA-256 form, and rewrite it so that it is in
> its SHA-1 form, and then hash it to determine its SHA-1 name.  We then
> write this correspondence either into the loose object index (for loose
> objects) or a v3 index (for packs).
>
> Blobs are simply hashed with both algorithms, but trees, commits, and
> tags need to be rewritten to use the SHA-1 names of the objects they
> refer to.  For most situations, we already have this data, since it will
> exist in the loose object index, in some pack index, or elsewhere in the
> pack we're indexing.
>
> However, for submodules, we have a problem.  By definition, the object
> exists in a different repository.  If we have the submodule locally on
> the system, then this works fine, but if we're performing a fetch or
> clone and the submodule is not present, then we cannot rewrite the tree
> or anything that refers to it, directly or indirectly.
>
> So there are some possible courses of action:
>
> * Disallow compatibility algorithms when using submodules.  This is
>   simple, but inconvenient.
> * Force users to always clone submodules and fetch them before fetching
>   the main repository.  This is also relatively simple, but
>   inconvenient.
> * Have the remote server keep a list of correspondences and send them in
>   a protocol extension.
> * Just skip rewriting objects until the data is filled in later and
>   admit the data will be incomplete.  This means that pushing to or
>   pulling from a repository using a incompatible algorithm will be
>   impossible.
> * Something else I haven't thought of.

While my strong urge is to add "Remove support for submodules" (which BTW
would also plug so many attack vectors that have lead to many a
vulnerability in the past), I understand that this would be impractical:
the figurative barn door has been open for way too long to do that.

But I'd like to put another idea into the fray: store the mapping in
`.gitmodules`. That is, each time `git submodule add <...>` is called, it
would update `.gitmodules` to list SHA-1 *and* SHA-256 for the given path.

That would relieve us of the problem where we rely on a server's ability
to give us that mapping.

Ciao,
Dscho

> The third option is where I'm leaning, but it has some potential
> downsides.  First, the server must support both hash algorithms and have
> this data.  Second, it essentially requires all submodule updates to be
> pushed from a compatible client.  Third, we need to trust that the
> server hasn't tampered with the data, which should be possible by doing
> an fsck on both forms (I think).  Fourth, we need to store this
> somewhere, and the only place we have right now is the loose object
> index, which would potentially grow to inefficient sizes.
>
> We could potentially change this to be slightly different by asking the
> submodule server for a list of correspondences instead via a new
> protocol extension, but it has the same downsides except for the second
> one, and additionally means that we'd need to make multiple connections.
>
> So I'm seeking some ideas on which approach we want to use here before
> I start sinking a lot of work into this.
> --
> brian m. carlson (he/him or they/them)
> Houston, Texas, US
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Submodules and SHA-256/SHA-1 interoperability
  2021-03-01 19:28 ` Johannes Schindelin
@ 2021-03-13 19:42   ` brian m. carlson
  2021-03-19 14:23     ` Johannes Schindelin
  0 siblings, 1 reply; 4+ messages in thread
From: brian m. carlson @ 2021-03-13 19:42 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 1212 bytes --]

On 2021-03-01 at 19:28:13, Johannes Schindelin wrote:
> While my strong urge is to add "Remove support for submodules" (which BTW
> would also plug so many attack vectors that have lead to many a
> vulnerability in the past), I understand that this would be impractical:
> the figurative barn door has been open for way too long to do that.
> 
> But I'd like to put another idea into the fray: store the mapping in
> `.gitmodules`. That is, each time `git submodule add <...>` is called, it
> would update `.gitmodules` to list SHA-1 *and* SHA-256 for the given path.
> 
> That would relieve us of the problem where we rely on a server's ability
> to give us that mapping.

This is true, but it ends up causing problems because we don't know
where the .gitmodules file is for a given revision.  If we're indexing a
pack file, we lack the ability to know which .gitmodules file is
associated with the blobs in a given revision, and we can't finish
indexing that file until we have both hashes for every object.

While we could change the way we do indexing, we'd end up having to
crawl the history and that would be very slow.
-- 
brian m. carlson (he/him or they/them)
Houston, Texas, US

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Submodules and SHA-256/SHA-1 interoperability
  2021-03-13 19:42   ` brian m. carlson
@ 2021-03-19 14:23     ` Johannes Schindelin
  0 siblings, 0 replies; 4+ messages in thread
From: Johannes Schindelin @ 2021-03-19 14:23 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git

Hi brian,

On Sat, 13 Mar 2021, brian m. carlson wrote:

> On 2021-03-01 at 19:28:13, Johannes Schindelin wrote:
> > While my strong urge is to add "Remove support for submodules" (which BTW
> > would also plug so many attack vectors that have lead to many a
> > vulnerability in the past), I understand that this would be impractical:
> > the figurative barn door has been open for way too long to do that.
> >
> > But I'd like to put another idea into the fray: store the mapping in
> > `.gitmodules`. That is, each time `git submodule add <...>` is called, it
> > would update `.gitmodules` to list SHA-1 *and* SHA-256 for the given path.
> >
> > That would relieve us of the problem where we rely on a server's ability
> > to give us that mapping.
>
> This is true, but it ends up causing problems because we don't know
> where the .gitmodules file is for a given revision.  If we're indexing a
> pack file, we lack the ability to know which .gitmodules file is
> associated with the blobs in a given revision, and we can't finish
> indexing that file until we have both hashes for every object.
>
> While we could change the way we do indexing, we'd end up having to
> crawl the history and that would be very slow.

Hrm, that's a valid point.

I just wish that we could make this more independent of servers. There
_might_ be a way to work around the flaw you pointed out, e.g. adding the
mappings from `.gitmodules` to the repository-local SHA-1 <-> SHA-256
mapping. But maybe somebody else can think of a better way?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-03-19 14:24 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-14 21:50 Submodules and SHA-256/SHA-1 interoperability brian m. carlson
2021-03-01 19:28 ` Johannes Schindelin
2021-03-13 19:42   ` brian m. carlson
2021-03-19 14:23     ` Johannes Schindelin

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).