git@vger.kernel.org mailing list mirror (one of many)
 help / Atom feed
* RFC: Another proposed hash function transition plan
@ 2017-03-04  1:12 Jonathan Nieder
  2017-03-05  2:35 ` Linus Torvalds
                   ` (4 more replies)
  0 siblings, 5 replies; 33+ messages in thread
From: Jonathan Nieder @ 2017-03-04  1:12 UTC (permalink / raw)
  To: git; +Cc: sbeller, bmwill, jonathantanmy, peff, Linus Torvalds

Hi,

This past week we came up with this idea for what a transition to a new
hash function for Git would look like.  I'd be interested in your
thoughts (especially if you can make them as comments on the document,
which makes it easier to address them and update the document).

This document is still in flux but I thought it best to send it out
early to start getting feedback.

We tried to incorporate some thoughts from the thread
http://public-inbox.org/git/20170223164306.spg2avxzukkggrpb@kitenet.net
but it is a little long so it is easy to imagine we've missed
some things already discussed there.

You can use the doc URL

 https://goo.gl/gh2Mzc

to view the latest version and comment.

Thoughts welcome, as always.

Git hash function transition
============================
Status: Draft
Last Updated: 2017-03-03

Objective
---------
Migrate Git from SHA-1 to a stronger hash function.

Background
----------
The Git version control system can be thought of as a content
addressable filesystem. It uses the SHA-1 hash function to name
content. For example, files, trees, commits are referred to by hash
values unlike in other traditional version control systems where files
or versions are referred to via sequential numbers. The use of a hash
function to address its content delivers a few advantages:

* Integrity checking is easy. Bit flips, for example, are easily
  detected, as the hash of corrupted content does not match its name.
  Lookup of objects is fast.

Using a cryptographically secure hash function brings additional advantages:

* Object names can be signed and third parties can trust the hash to
  address the signed object and all objects it references.
* Communication using Git protocol and out of band communication
  methods have a short reliable string that can be used to reliably
  address stored content.

Over time some flaws in SHA-1 have been discovered by security
researchers. https://shattered.io demonstrated a practical SHA-1 hash
collision. As a result, SHA-1 cannot be considered cryptographically
secure any more. This impacts the communication of hash values because
we cannot trust that a given hash value represents the known good
version of content that the speaker intended.

SHA-1 still possesses the other properties such as fast object lookup
and safe error checking, but other hash functions are equally suitable
that are believed to be cryptographically secure.

Goals
-----
1. The transition to SHA256 can be done one local repository at a time.
   a. Requiring no action by any other party.
   b. A SHA256 repository can communicate with SHA-1 Git servers and
      clients (push/fetch).
   c. Users can use SHA-1 and SHA256 identifiers for objects
      interchangeably.
   d. New signed objects make use of a stronger hash function than
      SHA-1 for their security guarantees.
2. Allow a complete transition away from SHA-1.
   a. Local metadata for SHA-1 compatibility can be dropped in a
      repository if compatibility with SHA-1 is no longer needed.
3. Maintainability throughout the process.
   a. The object format is kept simple and consistent.
   b. Creation of a generalized repository conversion tool.

Non-Goals
---------
1. Add SHA256 support to Git protocol. This is valuable and the
   logical next step but it is out of scope for this initial design.
2. Transparently improving the security of existing SHA-1 signed
   objects.
3. Intermixing objects using multiple hash functions in a single
   repository.
4. Taking the opportunity to fix other bugs in git's formats and
   protocols.
5. Shallow clones and fetches into a SHA256 repository. (This will
   change when we add SHA256 support to Git protocol.)
6. Skip fetching some submodules of a project into a SHA256
   repository. (This also depends on SHA256 support in Git protocol.)

Overview
--------
We introduce a new repository format extension `sha256`. Repositories
with this extension enabled use SHA256 instead of SHA-1 to name their
objects. This affects both object names and object content --- both
the names of objects and all references to other objects within an
object are switched to the new hash function.

sha256 repositories cannot be read by older versions of Git.

Alongside the packfile, a sha256 stores a bidirectional mapping
between sha256 and sha1 object names. The mapping is generated locally
and can be verified using "git fsck". Object lookups use this mapping
to allow naming objects using either their sha1 and sha256 names
interchangeably.

"git cat-file" and "git hash-object" gain options to display a sha256
object in its sha1 form and write a sha256 object given its sha1 form.
This requires all objects referenced by that object to be present in
the object database so that they can be named using the appropriate
name (using the bidirectional hash mapping).

Fetches from a SHA-1 based server convert the fetched objects into
sha256 form and record the mapping in the bidirectional mapping table
(see below for details). Pushes to a SHA-1 based server convert the
objects being pushed into sha1 form so the server does not have to be
aware of the hash function the client is using.

Detailed Design
---------------
Object names
~~~~~~~~~~~~
Objects can be named by their 40 hexadecimal digit sha1-name or 64
hexadecimal digit sha256-name, plus names derived from those (see
gitrevisions(7)).

The sha1-name of an object is the SHA-1 of the concatenation of its
type, length, a nul byte, and the object's sha1-content. This is the
traditional <sha1> used in Git to name objects.

The sha256-name of an object is the SHA-256 of the concatenation of
its type, length, a nul byte, and the object's sha256-content.

Object format
~~~~~~~~~~~~~
Objects are stored using a compressed representation of their
sha256-content. The sha256-content of an object is the same as its
sha1-content, except that:
* objects referenced by the object are named using their sha256-names
  instead of sha1-names
* signed tags, commits, and merges of signed tags get some additional
  fields (see below)

The format allows round-trip conversion between sha256-content and
sha1-content.

Loose objects use zlib compression and packed objects use the packed
format described in Documentation/technical/pack-format.txt, just like
today.

Translation table
~~~~~~~~~~~~~~~~~
A fast bidirectional mapping between sha1-names and sha256-names of
all local objects in the repository is kept on disk. The exact format
of that mapping is to be determined.

All operations that make new objects (e.g., "git commit") add the new
objects to the translation table.

Reading an object's sha1-content
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The sha1-content of an object can be read by converting all
sha256-names its sha256-content references to sha1-names using the
translation table. There is an additional minor transformation needed
for signed tags, commits, and merges (see below).

Fetch
~~~~~
Fetching from a SHA-1 based server requires translating between SHA-1
and SHA-256 based representations on the fly.

SHA-1s named in the ref advertisement can be translated to SHA-256 and
looked up as local objects using the translation table.

Negotiation proceeds as today. Any "have"s or "want"s generated
locally are converted to SHA-1 before being sent to the server, and
SHA-1s mentioned by the server are converted to SHA-256 when looking
them up locally.

After negotiation, the server sends a packfile containing the
requested objects. We convert the packfile to SHA-256 format using the
following steps:

1. index-pack: inflate each object in the packfile and compute its
   SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against
   objects the client has locally. These objects can be looked up using
   the translation table and their sha1-content read as described above
   to resolve the deltas.
2. topological sort: starting at the "want"s from the negotiation
   phase, walk through objects in the pack and emit a list of them in
   topologically sorted order. (This list only contains objects
   reachable from the "wants". If the pack from the server contained
   additional extraneous objects, then they will be discarded.)
3. convert to sha256: open a new (sha256) packfile. Read the
   topologically sorted list just generated in reverse order. For each
   object, inflate its sha1-content, convert to sha256-content, and
   write it to the sha256 pack. Write an idx file for this pack and
   include the new sha1<->sha256 mapping entry in the translation
   table.
4. clean up: remove the SHA-1 based pack file, index, and
   topologically sorted list obtained from the server and steps 1 and 2.

Step 3 requires every object referenced by the new object to be in the
translation table. This is why the topological sort step is necessary.

As an optimization, step 1 can write a file describing what objects
each object it has inflated from the packfile references. This makes
the topological sort in step 2 possible without inflating the objects
in the packfile for a second time. The objects need to be inflated
again in step 3, for a total of two inflations.

Push
~~~~
Push is simpler than fetch because the objects referenced by the
pushed objects are already in the translation table. The sha1-content
of each object being pushed can be read as described in the "Reading
an object's sha1-content" section to generate the pack written by git
send-pack.

Signed Objects
~~~~~~~~~~~~~~
Commits
^^^^^^^
Commits currently have the following sequence of header lines:

	"tree" SP object-name
	("parent" SP object-name)*
	"author" SP ident
	"committer" SP ident
	("mergetag" SP object-content)?
	("gpgsig" SP pgp-signature)?

We introduce new header lines "hash" and "nohash" that come after the
"gpgsig" field. No "hash" lines may appear unless the "gpgsig" field
is present.

Hash lines have the form

	"hash" SP hash-function SP field SP alternate-object-name

Nohash lines have the form

	"nohash" SP hash-function

There are only two recognized values of hash-function: "sha1" and
"sha256". "git fsck" will tolerate values of hash-function it does not
recognize, as long as they do not come before either of those two. All
"nohash" lines come before all "hash" lines. Any "hash sha1" lines
must come before all "hash sha256" lines, and likewise for nohash. The
Git project determines any future supported hash-functions that can
come after those two and their order.

There can be at most one "nohash <hash-function>" for one hash
function, indicating that this hash function should not be used when
checking the commit's signature.

There is one "hash <hash-function>" line for each tree or parent field
in the commit object header. The hash lines record object names for
those trees and parents using the indicated hash function, to be used
when checking the commit's signature.

TODO: simplify signature rules, handle the mergetag field better.

sha256-content of signed commits
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The sha256-content of a commit with a "gpgsig" header can include no
hash and nohash lines, a "nohash sha256" line and "hash sha1", or just
a "hash sha1" line.

Examples:
1. tree 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4
   parent e094bc809626f0a401a40d75c56df478e546902ff812772c4594265203b23980
   parent 1059dab4748aa33b86dad5ca97357bd322abaa558921255623fbddd066bb3315
   author A U Thor <author@example.com> 1465982009 +0000
   committer C O Mitter <committer@example.com> 1465982009 +0000
   gpgsig ...
2. tree 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4
   parent e094bc809626f0a401a40d75c56df478e546902ff812772c4594265203b23980
   parent 1059dab4748aa33b86dad5ca97357bd322abaa558921255623fbddd066bb3315
   author A U Thor <author@example.com> 1465982009 +0000
   committer C O Mitter <committer@example.com> 1465982009 +0000
   gpgsig ...
   nohash sha256
   hash sha1 tree c7b1cff039a93f3600a1d18b82d26688668c7dea
   hash sha1 parent c33429be94b5f2d3ee9b0adad223f877f174b05d
   hash sha1 parent 04b871796dc0420f8e7561a895b52484b701d51a
3. tree 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4
   parent e094bc809626f0a401a40d75c56df478e546902ff812772c4594265203b23980
   parent 1059dab4748aa33b86dad5ca97357bd322abaa558921255623fbddd066bb3315
   author A U Thor <author@example.com> 1465982009 +0000
   committer C O Mitter <committer@example.com> 1465982009 +0000
   gpgsig ...
   hash sha1 tree c7b1cff039a93f3600a1d18b82d26688668c7dea
   hash sha1 parent c33429be94b5f2d3ee9b0adad223f877f174b05d
   hash sha1 parent 04b871796dc0420f8e7561a895b52484b701d51a

sha1-content of signed commits
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The sha1-content of a commit with a "gpgsig" header can contain a
"nohash sha1" and "hash sha256" line, no hash or nohash lines, or just
a "hash sha256" line.

Examples:
1. tree c7b1cff039a93f3600a1d18b82d26688668c7dea
   parent c33429be94b5f2d3ee9b0adad223f877f174b05d
   parent 04b871796dc0420f8e7561a895b52484b701d51a
   author A U Thor <author@example.com> 1465982009 +0000
   committer C O Mitter <committer@example.com> 1465982009 +0000
   gpgsig ...
   nohash sha1
   hash sha256 tree 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4
   hash sha256 parent e094bc809626f0a401a40d75c56df478e546902ff812772c4594265203b23980
   hash sha256 parent 1059dab4748aa33b86dad5ca97357bd322abaa558921255623fbddd066bb3315
2. tree c7b1cff039a93f3600a1d18b82d26688668c7dea
   parent c33429be94b5f2d3ee9b0adad223f877f174b05d
   parent 04b871796dc0420f8e7561a895b52484b701d51a
   author A U Thor <author@example.com> 1465982009 +0000
   committer C O Mitter <committer@example.com> 1465982009 +0000
   gpgsig ...
3. tree c7b1cff039a93f3600a1d18b82d26688668c7dea
   parent c33429be94b5f2d3ee9b0adad223f877f174b05d
   parent 04b871796dc0420f8e7561a895b52484b701d51a
   author A U Thor <author@example.com> 1465982009 +0000
   committer C O Mitter <committer@example.com> 1465982009 +0000
   gpgsig ...
   hash sha256 tree 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4
   hash sha256 parent e094bc809626f0a401a40d75c56df478e546902ff812772c4594265203b23980
   hash sha256 parent 1059dab4748aa33b86dad5ca97357bd322abaa558921255623fbddd066bb3315

Converting signed commits
^^^^^^^^^^^^^^^^^^^^^^^^^
To convert the sha1-content of a signed commit to its sha256-content:

1. Change "tree" and "parent" lines to use the sha256-names of
   referenced objects, as with unsigned commits.
2. If there is a "mergetag" field, convert it from sha1-content to
   sha256-content, as with unsigned commits with a mergetag (see the
   "Mergetag" section below).
3. Unless there is a "nohash sha1" line, add a full set of "hash sha1
   <field> <sha1>" lines indicating the sha1-names of the tree and
   parents.
4. Remove any "hash sha256 <field> <sha256>" lines. If no such lines
   were present, add a "nohash sha256" line.

Converting the sha256-content of a signed commit to sha1-content uses
the same process with sha1 and sha256 switched.

Verifying signed commit signatures
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If the commit has a "hash sha1" line (or is sha1-content without a
"nohash sha1" line): check that the signature matches the sha1-content
with gpgsig field stripped out.

Otherwise: check that the signature matches the sha1-content with
gpgsig, nohash, tree, and parents fields stripped out.

With the examples above, the signed payloads are
1. author A U Thor <author@example.com> 1465982009 +0000
   committer C O Mitter <committer@example.com> 1465982009 +0000
   hash sha256 tree 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4
   hash sha256 parent e094bc809626f0a401a40d75c56df478e546902ff812772c4594265203b23980
   hash sha256 parent 1059dab4748aa33b86dad5ca97357bd322abaa558921255623fbddd066bb3315
2. tree c7b1cff039a93f3600a1d18b82d26688668c7dea
   parent c33429be94b5f2d3ee9b0adad223f877f174b05d
   parent 04b871796dc0420f8e7561a895b52484b701d51a
   author A U Thor <author@example.com> 1465982009 +0000
   committer C O Mitter <committer@example.com> 1465982009 +0000
3. tree c7b1cff039a93f3600a1d18b82d26688668c7dea
   parent c33429be94b5f2d3ee9b0adad223f877f174b05d
   parent 04b871796dc0420f8e7561a895b52484b701d51a
   author A U Thor <author@example.com> 1465982009 +0000
   committer C O Mitter <committer@example.com> 1465982009 +0000
   hash sha1
   hash sha256 tree 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4
   hash sha256 parent e094bc809626f0a401a40d75c56df478e546902ff812772c4594265203b23980
   hash sha256 parent 1059dab4748aa33b86dad5ca97357bd322abaa558921255623fbddd066bb3315
   
Current versions of "git verify-commit" can verify examples (2) and (3)
(but not (1)).

Tags
~~~~
Tags currently have the following sequence of header lines:
   
   	"object" SP object-name
	"type" SP type
	"tag" SP identifier
	"tagger" SP ident

A tag's signature, if it exists, is in the message body.

We introduce new header lines "nohash" and "hash" that come after the
"tagger" field. No "nohash" or "hash" lines may appear unless the
message body contains a PGP signature.

As with commits, "nohash" lines have the form "nohash
<hash-function>", indicating that this hash function should not be
used when checking the tag's signature.

"hash" lines have the form

	"hash" SP hash-function SP alternate-object-name

This records the pointed-to object name using the indicated hash
function, to be used when checking the tag's signature.

As with commits, "sha1" and "sha256" are the only permitted values of
hash-function and can only appear in that order for a field when they
appear. There can be at most one "nohash" line, and it comes before
any "hash" lines. There can be only one "hash" line for a given hash
function.

sha256-content of signed tags
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The sha256-content of a signed tag can include no "hash" or "nohash"
lines, a "nohash sha256" and "hash sha1 <sha1>" line, or just a "hash
sha1 <sha1>" line.

Examples:
1. object 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4
   type tree
   tag v1.0
   tagger C O Mitter <committer@example.com> 1465981006 +0000

   Tag Demo v1.0
   -----BEGIN PGP SIGNATURE-----
   Version: GnuPG v1

   iQEcBAABAgAGBQJXYRhOAAoJEGEJLoW3InGJklkIAIcnhL7RwEb/+QeX9enkXhxn
   ...
2. object 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4
   type tree
   tag v1.0
   tagger C O Mitter <committer@example.com> 1465981006 +0000
   nohash sha256
   hash sha1 c7b1cff039a93f3600a1d18b82d26688668c7dea

   Tag Demo v1.0
   -----BEGIN PGP SIGNATURE-----
   ...
3. object 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4
   type tree
   tag v1.0
   tagger C O Mitter <committer@example.com> 1465981006 +0000
   hash sha1 c7b1cff039a93f3600a1d18b82d26688668c7dea

   Tag Demo v1.0
   ...

sha1-content of signed tags
^^^^^^^^^^^^^^^^^^^^^^^^^^^
The sha1-content of a signed tag can include a "nohash sha1" and "hash
sha256" line, no "nohash" or "hash" lines, or just a "hash sha256
<sha256>" line.
   
Examples:
1. object c7b1cff039a93f3600a1d18b82d26688668c7dea
   ...
   tagger C O Mitter <committer@example.com> 1465981006 +0000
   nohash sha1
   hash sha256 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4

   Tag Demo v1.0
   -----BEGIN PGP SIGNATURE-----
   ...
2. object c7b1cff039a93f3600a1d18b82d26688668c7dea
   ...
   tagger C O Mitter <committer@example.com> 1465981006 +0000

   Tag Demo v1.0
   -----BEGIN PGP SIGNATURE-----
   ...
3. object c7b1cff039a93f3600a1d18b82d26688668c7dea
   ...
   tagger C O Mitter <committer@example.com> 1465981006 +0000
   hash sha256 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4

   Tag Demo v1.0
   -----BEGIN PGP SIGNATURE-----
   ...

Signed tags can be converted between sha1-content and sha256-content
using the same process as signed commits.

Verifying signed tags
^^^^^^^^^^^^^^^^^^^^^
As with commits, if the tag has a "hash sha1" (or is sha1-content
without a "nohash sha1" line): check that the signature matches the
sha1-content with PGP signature stripped out.
   
Otherwise: check that the signature matches the sha1-content with
nohash and object fields and PGP signature stripped out.

Mergetag signatures
~~~~~~~~~~~~~~~~~~~
The mergetag field in the sha1-content of a commit contains the
sha1-content of a tag that was merged by that commit.

The mergetag field in the sha256-content of the same commit contains
the sha256-content of the same tag.

Submodules
~~~~~~~~~~
To convert recorded submodule pointers, you need to have the converted
submodule repository in place. The bidirectional mapping of the
submodule can be used to look up the new hash.

Caveats
-------
Shallow clone and submodules
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Because this requires all referenced objects to be available in the
locally generated translation table, this design does not support
shallow clone or unfetched submodules.

Protocol improvements might allow lifting this restriction.

Alternatives considered
-----------------------
Upgrading everyone working on a particular project on a flag day
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Projects like the Linux kernel are large and complex enough that
flipping the switch for all projects based on the repository at once
is infeasible.

Not only would all developers and server operators supporting
developers have to switch on the same flag day, but supporting tooling
(continuous integration, code review, bug trackers, etc) would have to
be adapted as well. This also makes it difficult to get early feedback
from some project participants testing before it is time for mass
adoption.

Using hash functions in parallel 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(e.g. https://public-inbox.org/git/22708.8913.864049.452252@chiark.greenend.org.uk/ )
Objects newly created would be addressed by the new hash, but inside
such an object (e.g. commit) it is still possible to address objects
using the old hash function.

* You cannot trust its history (needed for bisectability) in the
  future without further work 
* Maintenance burden as the number of supported hash functions grows
  (they will never go away, so they accumulate). In this proposal, by
  comparison, converted objects lose all references to SHA-1 except
  where needed to verify signatures.

^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-04  1:12 RFC: Another proposed hash function transition plan Jonathan Nieder
@ 2017-03-05  2:35 ` Linus Torvalds
  2017-03-06  0:26   ` brian m. carlson
  2017-03-07  0:17   ` RFC v3: Another proposed hash function transition plan Jonathan Nieder
  2017-03-05 11:02 ` David Lang
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 33+ messages in thread
From: Linus Torvalds @ 2017-03-05  2:35 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Git Mailing List, Stefan Beller, bmwill, jonathantanmy, Jeff King

On Fri, Mar 3, 2017 at 5:12 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:
>
> This document is still in flux but I thought it best to send it out
> early to start getting feedback.

This actually looks very reasonable if you can implement it cleanly
enough. In many ways the "convert entirely to a new 256-bit hash" is
the cleanest model, and interoperability was at least my personal
concern. Maybe your model solves it (devil in the details), in which
case I really like it.

I do think that if you end up essentially converting the objects
without really having any true backwards compatibility at the object
layer (just the translation code), you should seriously look at doing
some other changes at the same time. Like not using zlib compression,
it really is very slow.

Btw, I do think the particular choice of hash should still be on the
table. sha-256 may be the obvious first choice, but there are
definitely a few reasons to consider alternatives, especially if it's
a complete switch-over like this.

One is large-file behavior - a parallel (or tree) mode could improve
on that noticeably. BLAKE2 does have special support for that, for
example. And SHA-256 does have known attacks compared to SHA-3-256 or
BLAKE2 - whether that is due to age or due to more effort, I can't
really judge. But if we're switching away from SHA1 due to known
attacks, it does feel like we should be careful.

                Linus

^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-04  1:12 RFC: Another proposed hash function transition plan Jonathan Nieder
  2017-03-05  2:35 ` Linus Torvalds
@ 2017-03-05 11:02 ` David Lang
       [not found]   ` <CA+dhYEXHbQfJ6KUB1tWS9u1MLEOJL81fTYkbxu4XO-i+379LPw@mail.gmail.com>
  2017-03-06 23:40   ` Jonathan Nieder
  2017-03-06  8:43 ` Jeff King
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 33+ messages in thread
From: David Lang @ 2017-03-05 11:02 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git, sbeller, bmwill, jonathantanmy, peff, Linus Torvalds

> Translation table
> ~~~~~~~~~~~~~~~~~
> A fast bidirectional mapping between sha1-names and sha256-names of
> all local objects in the repository is kept on disk. The exact format
> of that mapping is to be determined.
>
> All operations that make new objects (e.g., "git commit") add the new
> objects to the translation table.

This seems like a rather nontrival thing to design. It will need to hold 
millions of mappings, and be quickly searchable from either direction (sha1->new 
and new->sha1) while still be fairly fast to insert new records into.

For Linux, just the list of hashes recording the commits is going to be in the 
millions, whiel the list of hashes of individual files for all those commits is 
going to be substantially larger.

David Lang

^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-05  2:35 ` Linus Torvalds
@ 2017-03-06  0:26   ` brian m. carlson
  2017-03-06 18:24     ` Brandon Williams
  2017-03-07  0:17   ` RFC v3: Another proposed hash function transition plan Jonathan Nieder
  1 sibling, 1 reply; 33+ messages in thread
From: brian m. carlson @ 2017-03-06  0:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jonathan Nieder, Git Mailing List, Stefan Beller, bmwill, jonathantanmy, Jeff King

[-- Attachment #1: Type: text/plain, Size: 1818 bytes --]

On Sat, Mar 04, 2017 at 06:35:38PM -0800, Linus Torvalds wrote:
> On Fri, Mar 3, 2017 at 5:12 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:
> >
> > This document is still in flux but I thought it best to send it out
> > early to start getting feedback.
> 
> This actually looks very reasonable if you can implement it cleanly
> enough. In many ways the "convert entirely to a new 256-bit hash" is
> the cleanest model, and interoperability was at least my personal
> concern. Maybe your model solves it (devil in the details), in which
> case I really like it.

If you think you can do it, I'm all for it.

> Btw, I do think the particular choice of hash should still be on the
> table. sha-256 may be the obvious first choice, but there are
> definitely a few reasons to consider alternatives, especially if it's
> a complete switch-over like this.
> 
> One is large-file behavior - a parallel (or tree) mode could improve
> on that noticeably. BLAKE2 does have special support for that, for
> example. And SHA-256 does have known attacks compared to SHA-3-256 or
> BLAKE2 - whether that is due to age or due to more effort, I can't
> really judge. But if we're switching away from SHA1 due to known
> attacks, it does feel like we should be careful.

I agree with Linus on this.  SHA-256 is the slowest option, and it's the
one with the most advanced cryptanalysis.  SHA-3-256 is faster on 64-bit
machines (which, as we've seen on the list, is the overwhelming majority
of machines using Git), and even BLAKE2b-256 is stronger.

Doing this all over again in another couple years should also be a
non-goal.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
+1 832 623 2791 | https://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-04  1:12 RFC: Another proposed hash function transition plan Jonathan Nieder
  2017-03-05  2:35 ` Linus Torvalds
  2017-03-05 11:02 ` David Lang
@ 2017-03-06  8:43 ` Jeff King
  2017-03-06 18:39   ` Jonathan Tan
  2017-03-06 18:43   ` Junio C Hamano
  2017-03-07 18:57 ` Ian Jackson
  2017-03-13  9:24 ` The Keccak Team
  4 siblings, 2 replies; 33+ messages in thread
From: Jeff King @ 2017-03-06  8:43 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git, sbeller, bmwill, jonathantanmy, Linus Torvalds

On Fri, Mar 03, 2017 at 05:12:51PM -0800, Jonathan Nieder wrote:

> This past week we came up with this idea for what a transition to a new
> hash function for Git would look like.  I'd be interested in your
> thoughts (especially if you can make them as comments on the document,
> which makes it easier to address them and update the document).

Overall it's an interesting idea. I thought at first that you were
suggesting servers do on-the-fly conversion, but after a more careful
reading that isn't the case. And I don't think that would work, because
the conversion is expensive.

So this pushes the conversion cost onto the clients who decide to move
to SHA-256. That may be a problem for sites which have a lot of clients
(like CI hosts). But I guess they would just stick with SHA-1 as long as
possible, until the upstream repo switches (and that _is_ a per-repo
flag day, because the upstream host isn't going to convert back to SHA-1
on the fly to serve the old clients).

> You can use the doc URL
> 
>  https://goo.gl/gh2Mzc

I'd encourage anybody following along to follow that link. I almost
didn't, but there are a ton of comments there (I'm not sure how I feel
about splitting the discussion off the list, though).

> Goals
> -----
> 1. The transition to SHA256 can be done one local repository at a time.
>    a. Requiring no action by any other party.
>    b. A SHA256 repository can communicate with SHA-1 Git servers and
>       clients (push/fetch).
>    c. Users can use SHA-1 and SHA256 identifiers for objects
>       interchangeably.
>    d. New signed objects make use of a stronger hash function than
>       SHA-1 for their security guarantees.
> 2. Allow a complete transition away from SHA-1.
>    a. Local metadata for SHA-1 compatibility can be dropped in a
>       repository if compatibility with SHA-1 is no longer needed.

I suspect we'll never get away from keeping the mapping table. You'll
need at least the sha1->sha256 table if you want to look up names found
in historic commit messages, mailing list posts, etc.

And you'll need the sha256->sha1 table if you want to verify the gpg
signatures on old tags and commits. That might be something people are
willing to drop, though.

> After negotiation, the server sends a packfile containing the
> requested objects. We convert the packfile to SHA-256 format using the
> following steps:
> 
> 1. index-pack: inflate each object in the packfile and compute its
>    SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against
>    objects the client has locally. These objects can be looked up using
>    the translation table and their sha1-content read as described above
>    to resolve the deltas.
> 2. topological sort: starting at the "want"s from the negotiation
>    phase, walk through objects in the pack and emit a list of them in
>    topologically sorted order. (This list only contains objects
>    reachable from the "wants". If the pack from the server contained
>    additional extraneous objects, then they will be discarded.)

I don't think we do this right now, but you can actually find the entry
(and exit) points of a pack during the index-pack step. Basically:

  1. Keep a hashmap of objects mentioned in the pack.

  2. When we process an object's content (i.e., compute its hash), also
     parse it for any object references. Add entries in the hashmap for
     any object mentioned this way. Mark the entry for the object we
     processed with a "HAVE" bit, and mark any referenced object with a
     "REF" bit.

  3. After processing all objects, anything with a "HAVE" but no "REF"
     is an entry point to the pack (i.e., something that we should have
     asked for with a want). Anything with a "REF" but not a "HAVE" is
     an exit point (i.e., an object that we are expected to already have
     in our repo).

     (I've thought about this before because we could possibly shortcut
     the connectivity check using the exit points. It's complicated by
     the fact that we don't assume the transitive presence of objects
     unless they are reachable).

I don't think using the "want"s as the entry points is unreasonable,
though. The server _shouldn't_ generally be sending us other cruft.

I do wonder if you might be able to omit the extra object-graph walk
from your step 2, if you could assign "depths" to each object during
step 1 instead of HAVE/REF bits. The trouble, of course, is that you're
not visiting the nodes in the right order (so given two trees, you're
not sure if one might eventually be a child of the other; how do you
assign their depths?). I have a feeling there's a proof that it's
impossible, but I might just not be clever enough.


Overall the basics of the conversion seem sound to me. The "nohash"
things seems more complicated than I think it ought to be, which
probably just means I'm missing something.  I left a few related
comments on the google doc, so I won't repeat them here.

-Peff

^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Re: RFC: Another proposed hash function transition plan
       [not found]   ` <CA+dhYEXHbQfJ6KUB1tWS9u1MLEOJL81fTYkbxu4XO-i+379LPw@mail.gmail.com>
@ 2017-03-06  9:43     ` Jeff King
  0 siblings, 0 replies; 33+ messages in thread
From: Jeff King @ 2017-03-06  9:43 UTC (permalink / raw)
  To: ankostis; +Cc: David Lang, Jonathan Nieder, Git Mailing List, Stefan Beller, bmwill, jonathantanmy, Linus Torvalds

On Mon, Mar 06, 2017 at 10:29:33AM +0100, ankostis wrote:

> On 5 March 2017 at 12:02, David Lang <david@lang.hm> wrote:
> >> Translation table
> >> ~~~~~~~~~~~~~~~~~
> >> A fast bidirectional mapping between sha1-names and sha256-names of
> >> all local objects in the repository is kept on disk. The exact format
> >> of that mapping is to be determined.
> >>
> >> All operations that make new objects (e.g., "git commit") add the new
> >> objects to the translation table.
> >
> >
> > This seems like a rather nontrival thing to design. It will need to hold
> > millions of mappings, and be quickly searchable from either direction
> > (sha1->new and new->sha1) while still be fairly fast to insert new records
> > into.
> >
> > For Linux, just the list of hashes recording the commits is going to be in
> > the millions, whiel the list of hashes of individual files for all those
> > commits is going to be substantially larger.
> 
> Apologies if it is a stupid idea, but could we avoid the mappings-table
> just by
> hard-linking to the same object from both (or more) hashes?
> So instead of creating a text-db format, just use the filesystem.

No, for a few reasons:

  1. Most of these objects will not be in the filesystem at all, but
     rather in a packfile.

  2. It's not just a different hash over the same bytes. The sha256-name
     is taken over the sha256-content (which refers to other objects
     using sha256). So they really are different objects. You probably
     wouldn't keep the sha1 version around separately, but rather
     generate it on the fly during a push to a sha1 server.

  3. You really need to be able to take a sha256 name and convert it to
     a sha1 and vice versa. Hardlinks don't help with that, because they
     only point in one direction. That get you to the same _content_,
     but not the other name (and I guess this is where your "look up the
     name and then compute the other digest comes in, but that's
     probably too expensive to be workable).

I do think updating the mapping could potentially be deferred until
interacting with a sha1 server. But because it needs to be generated in
reverse-topological order, it's conceptually easier to do it one object
at a time.

-Peff

^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-06  0:26   ` brian m. carlson
@ 2017-03-06 18:24     ` Brandon Williams
  0 siblings, 0 replies; 33+ messages in thread
From: Brandon Williams @ 2017-03-06 18:24 UTC (permalink / raw)
  To: brian m. carlson, Linus Torvalds, Jonathan Nieder, Git Mailing List, Stefan Beller, jonathantanmy, Jeff King

On 03/06, brian m. carlson wrote:
> On Sat, Mar 04, 2017 at 06:35:38PM -0800, Linus Torvalds wrote:
> > On Fri, Mar 3, 2017 at 5:12 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:
> > >
> > > This document is still in flux but I thought it best to send it out
> > > early to start getting feedback.
> > 
> > This actually looks very reasonable if you can implement it cleanly
> > enough. In many ways the "convert entirely to a new 256-bit hash" is
> > the cleanest model, and interoperability was at least my personal
> > concern. Maybe your model solves it (devil in the details), in which
> > case I really like it.
> 
> If you think you can do it, I'm all for it.
> 
> > Btw, I do think the particular choice of hash should still be on the
> > table. sha-256 may be the obvious first choice, but there are
> > definitely a few reasons to consider alternatives, especially if it's
> > a complete switch-over like this.
> > 
> > One is large-file behavior - a parallel (or tree) mode could improve
> > on that noticeably. BLAKE2 does have special support for that, for
> > example. And SHA-256 does have known attacks compared to SHA-3-256 or
> > BLAKE2 - whether that is due to age or due to more effort, I can't
> > really judge. But if we're switching away from SHA1 due to known
> > attacks, it does feel like we should be careful.
> 
> I agree with Linus on this.  SHA-256 is the slowest option, and it's the
> one with the most advanced cryptanalysis.  SHA-3-256 is faster on 64-bit
> machines (which, as we've seen on the list, is the overwhelming majority
> of machines using Git), and even BLAKE2b-256 is stronger.
> 
> Doing this all over again in another couple years should also be a
> non-goal.

I agree that when we decide to move to a new algorithm that we should
select one which we plan on using for as long as possible (much longer
than a couple years).  While writing the document we simply used
"sha256" because it was more tangible and easier to reference.

> -- 
> brian m. carlson / brian with sandals: Houston, Texas, US
> +1 832 623 2791 | https://www.crustytoothpaste.net/~bmc | My opinion only
> OpenPGP: https://keybase.io/bk2204



-- 
Brandon Williams

^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-06  8:43 ` Jeff King
@ 2017-03-06 18:39   ` Jonathan Tan
  2017-03-06 19:22     ` Linus Torvalds
  2017-03-07  8:59     ` Jeff King
  2017-03-06 18:43   ` Junio C Hamano
  1 sibling, 2 replies; 33+ messages in thread
From: Jonathan Tan @ 2017-03-06 18:39 UTC (permalink / raw)
  To: Jeff King, Jonathan Nieder; +Cc: git, sbeller, bmwill, Linus Torvalds

On 03/06/2017 12:43 AM, Jeff King wrote:
> Overall the basics of the conversion seem sound to me. The "nohash"
> things seems more complicated than I think it ought to be, which
> probably just means I'm missing something.  I left a few related
> comments on the google doc, so I won't repeat them here.

I think "nohash" can be explained in 2 points:
  1. When creating signed objects, "nohash" is almost never written. Just
     create the object as usual and add "hash" lines for every other hash
     function that you want the signature to cover.
  2. When converting from function A to function B, add "nohash B" if
     there were no "hash B" lines in the original object.

The "nohash" thing was in the hope of requiring only one signature to 
sign all the hashes (in all the functions) that the user wants, while 
preserving round-tripping ability.

Maybe some examples would help to address the apparent complexity. These 
examples are the same as those in the document. I'll also show future 
compatibility with a hypothetical NEW hash function, and extend the rule 
about signing/verification to 'sign in the earliest supported hash 
function in ({object's hash function} + {functions in "hash" lines} - 
{function in "nohash" line})'.

Example 1 (existing signed commit)
<sha-1 object stuff>  <sha256 object stuff>  <NEW object stuff>
                       nohash sha256          nohash new
                       hash sha1 ...          hash sha1 ...

This object was probably created in a SHA-1 repository with no knowledge 
that we were going to transition to SHA256 (but there is nothing 
preventing us from creating the middle or right object and then 
translating it to the other functions).

Example 2 (recommended way to sign a commit in a SHA256 repo)
<sha-1 object stuff>  <sha256 object stuff>  <NEW object stuff>
hash sha256 ...       hash sha1 ...          nohash new
                                              hash sha1 ...
                                              hash sha256 ...

This is the recommended way to create a SHA256 object in a SHA256 repo. 
The rule about signing/verification (as stated above) is to sign in 
SHA-1, so when signing or verifying, we convert the object to SHA-1 and 
use that as the payload. Note that the signature covers both the SHA-1 
and SHA256 hashes, and that existing Git implementations can verify the 
signature.

Example 3 (a signer that does not care about SHA-1 anymore)
<sha-1 object stuff>  <sha256 object stuff>  <NEW object stuff>
nohash sha1                                  nohash new
hash sha256 ...                              hash sha256 ...

If we were to create a SHA256 object without any mentions of SHA-1, the 
rule about signing/verification (as stated above) states that the 
signature payload is the SHA256 object. This means that existing Git 
implementations cannot verify the signature, but we can still round-trip 
to SHA-1 and back without losing any information (as far as I can tell).

^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-06  8:43 ` Jeff King
  2017-03-06 18:39   ` Jonathan Tan
@ 2017-03-06 18:43   ` Junio C Hamano
  1 sibling, 0 replies; 33+ messages in thread
From: Junio C Hamano @ 2017-03-06 18:43 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Nieder, git, sbeller, bmwill, jonathantanmy, Linus Torvalds

Jeff King <peff@peff.net> writes:

>> You can use the doc URL
>> 
>>  https://goo.gl/gh2Mzc
>
> I'd encourage anybody following along to follow that link. I almost
> didn't, but there are a ton of comments there (I'm not sure how I feel
> about splitting the discussion off the list, though).

I am sure how I feel about it---we should really discourage it,
unless it is an effort to help polishing an early draft for wider
distribution and discussion.

> I don't think we do this right now, but you can actually find the entry
> (and exit) points of a pack during the index-pack step. Basically:

We have code to do the "entry point" computation in index-pack
already, I think, in 81a04b01 ("index-pack: --clone-bundle option",
2016-03-03).

> I don't think using the "want"s as the entry points is unreasonable,
> though. The server _shouldn't_ generally be sending us other cruft.

That's true.

^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-06 18:39   ` Jonathan Tan
@ 2017-03-06 19:22     ` Linus Torvalds
  2017-03-06 19:59       ` Brandon Williams
  2017-03-06 21:53       ` Junio C Hamano
  2017-03-07  8:59     ` Jeff King
  1 sibling, 2 replies; 33+ messages in thread
From: Linus Torvalds @ 2017-03-06 19:22 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Jeff King, Jonathan Nieder, Git Mailing List, Stefan Beller, bmwill

On Mon, Mar 6, 2017 at 10:39 AM, Jonathan Tan <jonathantanmy@google.com> wrote:
>
> I think "nohash" can be explained in 2 points:

I do think that that was my least favorite part of the suggestion. Not
just "nohash", but all the special "hash" lines too.

I would honestly hope that the design should not be about "other
hashes". If you plan your expectations around the new hash being
broken, something is wrong to begin with.

I do wonder if things wouldn't be simpler if the new format just
included the SHA1 object name in the new object. Put it in the
"header" line of the object, so that every time you look up an object,
you just _see_ the SHA1 of that object. You can even think of it as an
additional protection.

Btw, the multi-collision attack referenced earlier does _not_ work for
an iterated hash that has a bigger internal state than the final hash.
Which is actually a real argument against sha-256: the internal state
of sha-256 is 256 bits, so if an attack can find collisions due to
some weakness, you really can then generate exponential collisions by
chaining a linear collision search together.

But for sha3-256 or blake2, the internal hash state is larger than the
final hash, so now you need to generate collisions not in the 256
bits, but in the much larger search space of the internal hash space
if you want to generate those exponential collisions.

So *if* the new object format uses a git header line like

    "blob <size> <sha1>\0"

then it would inherently contain that mapping from 256-bit hash to the
SHA1, but it would actually also protect against attacks on the new
hash. In fact, in particular for objects with internal format that
differs between the two hashing models (ie trees and commits which to
some degree are higher-value targets), it would make attacks really
quite complicated, I suspect.

And you wouldn't need those "hash" or "nohash" things at all. The old
SHA1 would simply always be there, and cheap to look up (ie you
wouldn't have to unpack the whole object).

Hmm?

                   Linus

^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-06 19:22     ` Linus Torvalds
@ 2017-03-06 19:59       ` Brandon Williams
  2017-03-06 21:53       ` Junio C Hamano
  1 sibling, 0 replies; 33+ messages in thread
From: Brandon Williams @ 2017-03-06 19:59 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jonathan Tan, Jeff King, Jonathan Nieder, Git Mailing List, Stefan Beller

On 03/06, Linus Torvalds wrote:
> On Mon, Mar 6, 2017 at 10:39 AM, Jonathan Tan <jonathantanmy@google.com> wrote:
> >
> > I think "nohash" can be explained in 2 points:
> 
> I do think that that was my least favorite part of the suggestion. Not
> just "nohash", but all the special "hash" lines too.
> 
> I would honestly hope that the design should not be about "other
> hashes". If you plan your expectations around the new hash being
> broken, something is wrong to begin with.
> 
> I do wonder if things wouldn't be simpler if the new format just
> included the SHA1 object name in the new object. Put it in the
> "header" line of the object, so that every time you look up an object,
> you just _see_ the SHA1 of that object. You can even think of it as an
> additional protection.
> 
> Btw, the multi-collision attack referenced earlier does _not_ work for
> an iterated hash that has a bigger internal state than the final hash.
> Which is actually a real argument against sha-256: the internal state
> of sha-256 is 256 bits, so if an attack can find collisions due to
> some weakness, you really can then generate exponential collisions by
> chaining a linear collision search together.
> 
> But for sha3-256 or blake2, the internal hash state is larger than the
> final hash, so now you need to generate collisions not in the 256
> bits, but in the much larger search space of the internal hash space
> if you want to generate those exponential collisions.
> 
> So *if* the new object format uses a git header line like
> 
>     "blob <size> <sha1>\0"
> 
> then it would inherently contain that mapping from 256-bit hash to the
> SHA1, but it would actually also protect against attacks on the new
> hash. In fact, in particular for objects with internal format that
> differs between the two hashing models (ie trees and commits which to
> some degree are higher-value targets), it would make attacks really
> quite complicated, I suspect.
> 
> And you wouldn't need those "hash" or "nohash" things at all. The old
> SHA1 would simply always be there, and cheap to look up (ie you
> wouldn't have to unpack the whole object).
> 
> Hmm?

I'll agree that the "hash" "nohash" bit isn't my favorite and is really
only there to address the signing of tags/commits in this new non-sha1
world.  I'm inclined to take a closer look at Jeff's suggestion which
simply has a signature for the hash that the signer cares about.

I don't know if keeping around the SHA1 for every object buys you all
that much.  It would add an additional layer of protection but you would
also need to compute the SHA1 for each object indefinitely (assuming you
include the SHA1 in new objects and not just converted objects).  The
hope would be that at some point you could not worry about SHA1 at all.
That may be difficult for projects with long history with commit msgs
which reference SHA1's of other commits (if you wanted to look up the
referenced commit, for example), but projects started in the new
non-sha1 world shouldn't have to ever compute a sha1.

Also, during this transition phase you would still need to maintain the
sha1<->sha256 translation table to make looking up objects by their sha1
name in a sha256 repo fast.  Otherwise I think it would take a
non-trivial amount of time to search a sha256 repo for a sha1 name.  So
if you do include the sha1 in the new object format then you would end
up with some duplicate information, which isn't the end of the world.

-- 
Brandon Williams

^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-06 19:22     ` Linus Torvalds
  2017-03-06 19:59       ` Brandon Williams
@ 2017-03-06 21:53       ` Junio C Hamano
  1 sibling, 0 replies; 33+ messages in thread
From: Junio C Hamano @ 2017-03-06 21:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jonathan Tan, Jeff King, Jonathan Nieder, Git Mailing List, Stefan Beller, bmwill

Linus Torvalds <torvalds@linux-foundation.org> writes:

> So *if* the new object format uses a git header line like
>
>     "blob <size> <sha1>\0"
>
> then it would inherently contain that mapping from 256-bit hash to the
> SHA1, but it would actually also protect against attacks on the new
> hash.

This is easy for blobs as you only need to hash twice.  I am not
sure if you can do the same for trees, though.  For that <sha1> to
be useful, the hash needs to be over the tree contents whose
references are expressed in <sha1>, which in turn would mean...

... ah, you would read these <sha1> off of the object header in the
new world and you do not need to expand the whole thing.  OK, I see
how it could work.

> In fact, in particular for objects with internal format that
> differs between the two hashing models (ie trees and commits which to
> some degree are higher-value targets), it would make attacks really
> quite complicated, I suspect.
>
> And you wouldn't need those "hash" or "nohash" things at all. The old
> SHA1 would simply always be there, and cheap to look up (ie you
> wouldn't have to unpack the whole object).

^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-05 11:02 ` David Lang
       [not found]   ` <CA+dhYEXHbQfJ6KUB1tWS9u1MLEOJL81fTYkbxu4XO-i+379LPw@mail.gmail.com>
@ 2017-03-06 23:40   ` Jonathan Nieder
  2017-03-07  0:03     ` Mike Hommey
  1 sibling, 1 reply; 33+ messages in thread
From: Jonathan Nieder @ 2017-03-06 23:40 UTC (permalink / raw)
  To: David Lang; +Cc: git, sbeller, bmwill, jonathantanmy, peff, Linus Torvalds

David Lang wrote:

>> Translation table
>> ~~~~~~~~~~~~~~~~~
>> A fast bidirectional mapping between sha1-names and sha256-names of
>> all local objects in the repository is kept on disk. The exact format
>> of that mapping is to be determined.
>>
>> All operations that make new objects (e.g., "git commit") add the new
>> objects to the translation table.
>
> This seems like a rather nontrival thing to design. It will need to
> hold millions of mappings, and be quickly searchable from either
> direction (sha1->new and new->sha1) while still be fairly fast to
> insert new records into.

I am currently thinking of using LevelDB, since it has the advantages of
being simple, already existing, and having already been ported to Java
(allowing JGit can read and write the same format).

If that doesn't work, we'd try some other key-value store like Samba's
tdb or Kyoto Cabinet.

Jonathan

^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-06 23:40   ` Jonathan Nieder
@ 2017-03-07  0:03     ` Mike Hommey
  0 siblings, 0 replies; 33+ messages in thread
From: Mike Hommey @ 2017-03-07  0:03 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: David Lang, git, sbeller, bmwill, jonathantanmy, peff, Linus Torvalds

On Mon, Mar 06, 2017 at 03:40:30PM -0800, Jonathan Nieder wrote:
> David Lang wrote:
> 
> >> Translation table
> >> ~~~~~~~~~~~~~~~~~
> >> A fast bidirectional mapping between sha1-names and sha256-names of
> >> all local objects in the repository is kept on disk. The exact format
> >> of that mapping is to be determined.
> >>
> >> All operations that make new objects (e.g., "git commit") add the new
> >> objects to the translation table.
> >
> > This seems like a rather nontrival thing to design. It will need to
> > hold millions of mappings, and be quickly searchable from either
> > direction (sha1->new and new->sha1) while still be fairly fast to
> > insert new records into.
> 
> I am currently thinking of using LevelDB, since it has the advantages of
> being simple, already existing, and having already been ported to Java
> (allowing JGit can read and write the same format).
> 
> If that doesn't work, we'd try some other key-value store like Samba's
> tdb or Kyoto Cabinet.

FWIW, I'm using notes-like data to store mercurial->git mappings in
git-cinnabar, (ab)using the commit type in tree items. It's fast enough.

Mike

^ permalink raw reply	[flat|threaded] 33+ messages in thread

* RFC v3: Another proposed hash function transition plan
  2017-03-05  2:35 ` Linus Torvalds
  2017-03-06  0:26   ` brian m. carlson
@ 2017-03-07  0:17   ` Jonathan Nieder
  2017-03-09 19:14     ` Shawn Pearce
  1 sibling, 1 reply; 33+ messages in thread
From: Jonathan Nieder @ 2017-03-07  0:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Stefan Beller, bmwill, jonathantanmy, Jeff King, David Lang, brian m. carlson

Linus Torvalds wrote:
> On Fri, Mar 3, 2017 at 5:12 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:

>> This document is still in flux but I thought it best to send it out
>> early to start getting feedback.
>
> This actually looks very reasonable if you can implement it cleanly
> enough.

Thanks for the kind words on what had quite a few flaws still.  Here's
a new draft.  I think the next version will be a patch against
Documentation/technical/.

As before, comments welcome, both here and inline at

  https://goo.gl/gh2Mzc

Changes since v2:

Use SHA3-256 instead of SHA2 (thanks, Linus and brian m.
carlson).[1][2]

Make sha3-based signatures a separate field, avoiding the need for
"hash" and "nohash" fields (thanks to peff[3]).

Add a sorting phase to fetch (thanks to Junio for noticing the need
for this).

Omit blobs from the topological sort during fetch (thanks to peff).

Discuss alternates, git notes, and git servers in the caveats section
(thanks to Junio Hamano, brian m. carlson[4], and Shawn Pearce).

Clarify language throughout (thanks to various commenters, especially
Junio).

Sincerely,
Jonathan

Git hash function transition
============================
Status: Draft
Last Updated: 2017-03-06

Objective
---------
Migrate Git from SHA-1 to a stronger hash function.

Background
----------
At its core, the Git version control system is a content addressable
filesystem. It uses the SHA-1 hash function to name content. For
example, files, directories, and revisions are referred to by hash
values unlike in other traditional version control systems where files
or versions are referred to via sequential numbers. The use of a hash
function to address its content delivers a few advantages:

* Integrity checking is easy. Bit flips, for example, are easily
  detected, as the hash of corrupted content does not match its name.
* Lookup of objects is fast.

Using a cryptographically secure hash function brings additional
advantages:

* Object names can be signed and third parties can trust the hash to
  address the signed object and all objects it references.
* Communication using Git protocol and out of band communication
  methods have a short reliable string that can be used to reliably
  address stored content.

Over time some flaws in SHA-1 have been discovered by security
researchers. https://shattered.io demonstrated a practical SHA-1 hash
collision. As a result, SHA-1 cannot be considered cryptographically
secure any more. This impacts the communication of hash values because
we cannot trust that a given hash value represents the known good
version of content that the speaker intended.

SHA-1 still possesses the other properties such as fast object lookup
and safe error checking, but other hash functions are equally suitable
that are believed to be cryptographically secure.

Goals
-----
1. The transition to SHA3-256 can be done one local repository at a time.
   a. Requiring no action by any other party.
   b. A SHA3-256 repository can communicate with SHA-1 Git servers
      (push/fetch).
   c. Users can use SHA-1 and SHA3-256 identifiers for objects
      interchangeably.
   d. New signed objects make use of a stronger hash function than
      SHA-1 for their security guarantees.
2. Allow a complete transition away from SHA-1.
   a. Local metadata for SHA-1 compatibility can be removed from a
      repository if compatibility with SHA-1 is no longer needed.
3. Maintainability throughout the process.
   a. The object format is kept simple and consistent.
   b. Creation of a generalized repository conversion tool.

Non-Goals
---------
1. Add SHA3-256 support to Git protocol. This is valuable and the
   logical next step but it is out of scope for this initial design.
2. Transparently improving the security of existing SHA-1 signed
   objects.
3. Intermixing objects using multiple hash functions in a single
   repository.
4. Taking the opportunity to fix other bugs in git's formats and
   protocols.
5. Shallow clones and fetches into a SHA3-256 repository. (This will
   change when we add SHA3-256 support to Git protocol.)
6. Skip fetching some submodules of a project into a SHA3-256
   repository. (This also depends on SHA3-256 support in Git
   protocol.)

Overview
--------
We introduce a new repository format extension `sha3`. Repositories
with this extension enabled use SHA3-256 instead of SHA-1 to name
their objects. This affects both object names and object content ---
both the names of objects and all references to other objects within
an object are switched to the new hash function.

sha3 repositories cannot be read by older versions of Git.

Alongside the packfile, a sha3 repository stores a bidirectional
mapping between sha3 and sha1 object names. The mapping is generated
locally and can be verified using "git fsck". Object lookups use this
mapping to allow naming objects using either their sha1 and sha3 names
interchangeably.

"git cat-file" and "git hash-object" gain options to display an object
in its sha1 form and write an object given its sha1 form. This
requires all objects referenced by that object to be present in the
object database so that they can be named using the appropriate name
(using the bidirectional hash mapping).

Fetches from a SHA-1 based server convert the fetched objects into
sha3 form and record the mapping in the bidirectional mapping table
(see below for details). Pushes to a SHA-1 based server convert the
objects being pushed into sha1 form so the server does not have to be
aware of the hash function the client is using.

Detailed Design
---------------
Object names
~~~~~~~~~~~~
Objects can be named by their 40 hexadecimal digit sha1-name or 64
hexadecimal digit sha3-name, plus names derived from those (see
gitrevisions(7)).

The sha1-name of an object is the SHA-1 of the concatenation of its
type, length, a nul byte, and the object's sha1-content. This is the
traditional <sha1> used in Git to name objects.

The sha3-name of an object is the SHA3-256 of the concatenation of its
type, length, a nul byte, and the object's sha3-content.

Object format
~~~~~~~~~~~~~
The content as a byte sequence of a tag, commit, or tree object named
by sha1 and sha3 differ because an object named by sha3-name refers to
other objects by their sha3-names and an object named by sha1-name
refers to other objects by their sha1-names.

The sha3-content of an object is the same as its sha1-content, except
that objects referenced by the object are named using their sha3-names
instead of sha1-names. Because a blob object does not refer to any
other object, its sha1-content and sha3-content are the same.

The format allows round-trip conversion between sha3-content and
sha1-content.

Object storage
~~~~~~~~~~~~~~
Loose objects use zlib compression and packed objects use the packed
format described in Documentation/technical/pack-format.txt, just like
today. The content that is compressed and stored uses sha3-content
instead of sha1-content.

Translation table
~~~~~~~~~~~~~~~~~
A fast bidirectional mapping between sha1-names and sha3-names of all
local objects in the repository is kept on disk. The exact format of
that mapping is to be determined.

All operations that make new objects (e.g., "git commit") add the new
objects to the translation table.

(This work could have been deferred to push time, but that would
significantly complicate and slow down pushes. Calculating the
sha1-name at object creation time at the same time it is being
streamed to disk and having its sha3-name calculated should be an
acceptable cost.)

Reading an object's sha1-content
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The sha1-content of an object can be read by converting all sha3-names
its sha3-content references to sha1-names using the translation table.

Fetch
~~~~~
Fetching from a SHA-1 based server requires translating between SHA-1
and SHA3-256 based representations on the fly.

SHA-1s named in the ref advertisement that are present on the client
can be translated to SHA3-256 and looked up as local objects using the
translation table.

Negotiation proceeds as today. Any "have"s generated locally are
converted to SHA-1 before being sent to the server, and SHA-1s
mentioned by the server are converted to SHA3-256 when looking them up
locally.

After negotiation, the server sends a packfile containing the
requested objects. We convert the packfile to SHA3-256 format using
the following steps:

1. index-pack: inflate each object in the packfile and compute its
   SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against
   objects the client has locally. These objects can be looked up
   using the translation table and their sha1-content read as
   described above to resolve the deltas.
2. topological sort: starting at the "want"s from the negotiation
   phase, walk through objects in the pack and emit a list of them,
   excluding blobs, in reverse topologically sorted order, with each
   object coming later in the list than all objects it references.
   (This list only contains objects reachable from the "wants". If the
   pack from the server contained additional extraneous objects, then
   they will be discarded.)
3. convert to sha3: open a new (sha3) packfile. Read the topologically
   sorted list just generated. For each object, inflate its
   sha1-content, convert to sha3-content, and write it to the sha3
   pack. Include the new sha1<->sha3 mapping entry in the translation
   table.
4. sort: reorder entries in the new pack to match the order of objects
   in the pack the server generated and include blobs. Write a sha3 idx
   file.
5. clean up: remove the SHA-1 based pack file, index, and
   topologically sorted list obtained from the server and steps 1
   and 2.

Step 3 requires every object referenced by the new object to be in the
translation table. This is why the topological sort step is necessary.

As an optimization, step 1 could write a file describing what non-blob
objects each object it has inflated from the packfile references. This
makes the topological sort in step 2 possible without inflating the
objects in the packfile for a second time. The objects need to be
inflated again in step 3, for a total of two inflations.

Step 4 is probably necessary for good read-time performance. "git
pack-objects" on the server optimizes the pack file for good data
locality (see Documentation/technical/pack-heuristics.txt).

Details of this process are likely to change. It will take some
experimenting to get this to perform well.

Push
~~~~
Push is simpler than fetch because the objects referenced by the
pushed objects are already in the translation table. The sha1-content
of each object being pushed can be read as described in the "Reading
an object's sha1-content" section to generate the pack written by git
send-pack.

Signed Commits
~~~~~~~~~~~~~~
We add a new field "gpgsig-sha3" to the commit object format to allow
signing commits without relying on SHA-1. It is similar to the
existing "gpgsig" field. Its signed payload is the sha3-content of the
commit object with any "gpgsig" and "gpgsig-sha3" fields removed.

This means commits can be signed
1. using SHA-1 only, as in existing signed commit objects
2. using both SHA-1 and SHA3-256, by using both gpgsig-sha3 and gpgsig
   fields.
3. using only SHA3-256, by only using the gpgsig-sha3 field.

Old versions of "git verify-commit" can verify the gpgsig signature in
cases (1) and (2) without modifications and view case (3) as an
ordinary unsigned commit.

Signed Tags
~~~~~~~~~~~
We add a new field "gpgsig-sha3" to the tag object format to allow
signing tags without relying on SHA-1. Its signed payload is the
sha3-content of the tag with its gpgsig-sha3 field and "-----BEGIN PGP
SIGNATURE-----" delimited in-body signature removed.

This means tags can be signed
1. using SHA-1 only, as in existing signed tag objects
2. using both SHA-1 and SHA3-256, by using gpgsig-sha3 and an in-body
   signature.
3. using only SHA3-256, by only using the gpgsig-sha3 field.

Mergetag embedding
~~~~~~~~~~~~~~~~~~
The mergetag field in the sha1-content of a commit contains the
sha1-content of a tag that was merged by that commit.

The mergetag field in the sha3-content of the same commit contains the
sha3-content of the same tag.

Submodules
~~~~~~~~~~
To convert recorded submodule pointers, you need to have the converted
submodule repository in place. The translation table of the submodule
can be used to look up the new hash.

Caveats
-------
Invalid objects
~~~~~~~~~~~~~~~
The conversion from sha1-content to sha3-content retains any
brokenness in the original object (e.g., tree entry modes encoded with
leading 0, tree objects whose paths are not sorted correctly, and
commit objects without an author or committer). This is a deliberate
feature of the design to allow the conversion to round-trip.

More profoundly broken objects (e.g., a commit with a truncated "tree"
header line) cannot be converted but were not usable by current Git
anyway.

Shallow clone and submodules
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Because it requires all referenced objects to be available in the
locally generated translation table, this design does not support
shallow clone or unfetched submodules. Protocol improvements might
allow lifting this restriction.

Alternates
~~~~~~~~~~
For the same reason, a sha3 repository cannot borrow objects from a
sha1 repository using objects/info/alternates or
$GIT_ALTERNATE_OBJECT_REPOSITORIES.

git notes
~~~~~~~~~
The "git notes" tool annotates objects using their sha1-name as key.
This design does not describe a way to migrate notes trees to use
sha3-names. That migration is expected to happen separately (for
example using a file at the root of the notes tree to describe which
hash it uses).

Server-side cost
~~~~~~~~~~~~~~~~
Until Git protocol gains SHA3-256 support, using sha3 based storage on
public-facing Git servers is strongly discouraged. Once Git protocol
gains SHA3-256 support, sha3 based servers are likely not to support
sha1 compatibility, to avoid what may be a very expensive hash
reencode during clone and to encourage peers to modernize.

The design described here allows fetches by SHA-1 clients of a
personal SHA256 repository because it's not much more difficult than
allowing pushes from that repository. This support needs to be guarded
by a configuration option --- servers like git.kernel.org that serve a
large number of clients would not be expected to bear that cost.

Meaning of signatures
~~~~~~~~~~~~~~~~~~~~~
The signed payload for signed commits and tags does not explicitly
name the hash used to identify objects. If some day Git adopts a new
hash function with the same length as the current SHA-1 (40
hexadecimal digit) or SHA2-256 (64 hexadecimal digit) objects then the
intent behind the PGP signed payload in an object signature is
unclear:

	object e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7
	type commit
	tag v2.12.0
	tagger Junio C Hamano <gitster@pobox.com> 1487962205 -0800

	Git 2.12

Does this mean Git v2.12.0 is the commit with sha1-name
e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with
new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7?

Fortunately SHA3-256 and SHA-1 have different lengths. If Git starts
using another hash with the same length to name objects, then it will
need to change the format of signed payloads using that hash to
address this issue.

Alternatives considered
-----------------------
Upgrading everyone working on a particular project on a flag day
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Projects like the Linux kernel are large and complex enough that
flipping the switch for all projects based on the repository at once
is infeasible.

Not only would all developers and server operators supporting
developers have to switch on the same flag day, but supporting tooling
(continuous integration, code review, bug trackers, etc) would have to
be adapted as well. This also makes it difficult to get early feedback
from some project participants testing before it is time for mass
adoption.

Using hash functions in parallel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(e.g. https://public-inbox.org/git/22708.8913.864049.452252@chiark.greenend.org.uk/ )
Objects newly created would be addressed by the new hash, but inside
such an object (e.g. commit) it is still possible to address objects
using the old hash function.
* You cannot trust its history (needed for bisectability) in the
  future without further work
* Maintenance burden as the number of supported hash functions grows
  (they will never go away, so they accumulate). In this proposal, by
  comparison, converted objects lose all references to SHA-1.

Signed objects with multiple hashes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Instead of introducing the gpgsig-sha3 field in commit and tag objects
for sha3-content based signatures, an earlier version of this design
added "hash sha3 <sha3-name>" fields to strengthen the existing
sha1-content based signatures.

In other words, a single signature was used to attest to the object
content using both hash functions. This had some advantages:
* Using one signature instead of two speeds up the signing process.
* Having one signed payload with both hashes allows the signer to
  attest to the sha1-name and sha3-name referring to the same object.
* All users consume the same signature. Broken signatures are likely
  to be detected quickly using current versions of git.

However, it also came with disadvantages:
* Verifying a signed object requires access to the sha1-names of all
  objects it references, even after the transition is complete and
  translation table is no longer needed for anything else. To support
  this, the design added fields such as "hash sha1 tree <sha1-name>"
  and "hash sha1 parent <sha1-name>" to the sha3-content of a signed
  commit, complicating the conversion process.
* Allowing signed objects without a sha1 (for after the transition is
  complete) complicated the design further, requiring a "nohash sha1"
  field to suppress including "hash sha1" fields in the sha3-content
  and signed payload.

Document History
----------------

2017-03-03
bmwill@google.com, jonathantanmy@google.com, jrnieder@gmail.com,
sbeller@google.com

Initial version sent to
http://public-inbox.org/git/20170304011251.GA26789@aiede.mtv.corp.google.com

2017-03-03 jrnieder@gmail.com
Incorporated suggestions from jonathantanmy and sbeller:
* describe purpose of signed objects with each hash type
* redefine signed object verification using object content under the
  first hash function

2017-03-06 jrnieder@gmail.com
* Use SHA3-256 instead of SHA2 (thanks, Linus and brian m. carlson).[1][2]
* Make sha3-based signatures a separate field, avoiding the need for
  "hash" and "nohash" fields (thanks to peff[3]).
* Add a sorting phase to fetch (thanks to Junio for noticing the need
  for this).
* Omit blobs from the topological sort during fetch (thanks to peff).
* Discuss alternates, git notes, and git servers in the caveats
  section (thanks to Junio Hamano, brian m. carlson[4], and Shawn
  Pearce).
* Clarify language throughout (thanks to various commenters,
  especially Junio).

[1] http://public-inbox.org/git/CA+55aFzJtejiCjV0e43+9oR3QuJK2PiFiLQemytoLpyJWe6P9w@mail.gmail.com/
[2] http://public-inbox.org/git/CA+55aFz+gkAsDZ24zmePQuEs1XPS9BP_s8O7Q4wQ7LV7X5-oDA@mail.gmail.com/
[3] http://public-inbox.org/git/20170306084353.nrns455dvkdsfgo5@sigill.intra.peff.net/
[4] http://public-inbox.org/git/20170304224936.rqqtkdvfjgyezsht@genre.crustytoothpaste.net

^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-06 18:39   ` Jonathan Tan
  2017-03-06 19:22     ` Linus Torvalds
@ 2017-03-07  8:59     ` Jeff King
  1 sibling, 0 replies; 33+ messages in thread
From: Jeff King @ 2017-03-07  8:59 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Jonathan Nieder, git, sbeller, bmwill, Linus Torvalds

On Mon, Mar 06, 2017 at 10:39:49AM -0800, Jonathan Tan wrote:

> The "nohash" thing was in the hope of requiring only one signature to sign
> all the hashes (in all the functions) that the user wants, while preserving
> round-tripping ability.

Thanks, this explained it very well.

I understand the tradeoff now, though I am still of the opinion that
simplicity is probably a more important goal.

In practice I'd imagine that anybody doing commit-signing would just
sign the more-secure hash, and people doing tag releases would probably
do a dual-sign to be verifiable by both old and new clients. Those are
infrequent enough that the extra computation probably doesn't matter.
But that's just my gut feeling.

-Peff

^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-04  1:12 RFC: Another proposed hash function transition plan Jonathan Nieder
                   ` (2 preceding siblings ...)
  2017-03-06  8:43 ` Jeff King
@ 2017-03-07 18:57 ` Ian Jackson
  2017-03-07 19:15   ` Linus Torvalds
  2017-03-13  9:24 ` The Keccak Team
  4 siblings, 1 reply; 33+ messages in thread
From: Ian Jackson @ 2017-03-07 18:57 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git, sbeller, bmwill, jonathantanmy, peff, Linus Torvalds

Jonathan Nieder writes ("RFC: Another proposed hash function transition plan"):
> This past week we came up with this idea for what a transition to a new
> hash function for Git would look like.  I'd be interested in your
> thoughts (especially if you can make them as comments on the document,
> which makes it easier to address them and update the document).

Thanks for this.

This is a reasonable plan.  It corresponds to approaches (2) and (B)
of my survey mail from the other day.  Ie, two parallel homogeneous
hash trees, rather than a unified but heterogeneous hash tree, with
old vs new object names distinguished by length.

I still prefer my proposal with the mixed hash tree, mostly because
the handling of signatures here is very awkward, and because my
proposal does not involve altering object ids stored other than in the
git object graph (eg CI system databases, etc.)

One thing you've missed, I think, is notes: notes have to be dealt
with in a more complicated way.  Do you intend to rewrite the tree
objects for notes commits so that the notes are annotations for the
new names for the annotated objects ?  And if so, when ?

Also I think you need to specify how abbreviated object names are
interpreted.

Regards,
Ian.

^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-07 18:57 ` Ian Jackson
@ 2017-03-07 19:15   ` Linus Torvalds
  2017-03-08 11:20     ` Ian Jackson
  0 siblings, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2017-03-07 19:15 UTC (permalink / raw)
  To: Ian Jackson; +Cc: Jonathan Nieder, Git Mailing List, Stefan Beller, bmwill, Jonathan Tan, Jeff King

On Tue, Mar 7, 2017 at 10:57 AM, Ian Jackson
<ijackson@chiark.greenend.org.uk> wrote:
>
> Also I think you need to specify how abbreviated object names are
> interpreted.

One option might be to not use hex for the new hash, but base64 encoding.

That would make the full size ASCII hash encoding length roughly
similar (43 base64 characters rather than 40), which would offset some
of the new costs (longer filenames in the loose format, for example).

Also, since 256 isn't evenly divisible by 6, and because you'd want
some way to explictly disambiguate the new hashes, the rule *could* be
that the ASCII representation of a new hash is the base64 encoding of
the 258-bit value that has "10" prepended to it as padding.

That way the first character of the hash would be guaranteed to not be
a hex digit, because it would be in the range [g-v] (indexes 32..47).

Of course, the downside is that base64 encoded hashes can also end up
looking very much like real words, and now case would matter too.

The "use base64 with a "10" two-bit padding prepended" also means that
the natural loose format radix format would remain the first 2
characters of the hash, but due to the first character containing the
padding, it would be a fan-out of 2**10 rather than 2**12.

Of course, having written that, I now realize how it would cause
problems for the usual shit-for-brains case-insensitive filesystems.
So I guess base64 encoding doesn't work well for that reason.

                Linus

^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-07 19:15   ` Linus Torvalds
@ 2017-03-08 11:20     ` Ian Jackson
  2017-03-08 15:37       ` Johannes Schindelin
  2017-03-08 15:40       ` Johannes Schindelin
  0 siblings, 2 replies; 33+ messages in thread
From: Ian Jackson @ 2017-03-08 11:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jonathan Nieder, Git Mailing List, Stefan Beller, bmwill, Jonathan Tan, Jeff King

Linus Torvalds writes ("Re: RFC: Another proposed hash function transition plan"):
> Also, since 256 isn't evenly divisible by 6, and because you'd want
> some way to explictly disambiguate the new hashes, the rule *could* be
> that the ASCII representation of a new hash is the base64 encoding of
> the 258-bit value that has "10" prepended to it as padding.
> 
> That way the first character of the hash would be guaranteed to not be
> a hex digit, because it would be in the range [g-v] (indexes 32..47).

We should arrange for this to be an uppercase, not a lowercase,
letter, for the reasons I explained in my own proposal.  To summarise:
It would be undesirable to further increase the overlap between object
names and ref names.  Few people use uppercase in ref names because of
the case-insensitive filesystem problem; so object names starting with
uppercase ascii are distinct from most object names.

> Of course, having written that, I now realize how it would cause
> problems for the usual shit-for-brains case-insensitive filesystems.
> So I guess base64 encoding doesn't work well for that reason.

AFAIAA object names occur in publicly-visible filenames only in notes
tree objects, which are manipulated by git internally and do not
necessarily need to appear in the filesystem.

The filenames in .git/objects/ can be in whatever encoding we like, so
are not an obstacle.

Ian.

-- 
Ian Jackson <ijackson@chiark.greenend.org.uk>   These opinions are my own.

If I emailed you from an address @fyvzl.net or @evade.org.uk, that is
a private address which bypasses my fierce spamfilter.

^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-08 11:20     ` Ian Jackson
@ 2017-03-08 15:37       ` Johannes Schindelin
  2017-03-08 15:40       ` Johannes Schindelin
  1 sibling, 0 replies; 33+ messages in thread
From: Johannes Schindelin @ 2017-03-08 15:37 UTC (permalink / raw)
  To: Ian Jackson; +Cc: Linus Torvalds, Jonathan Nieder, Git Mailing List, Stefan Beller, bmwill, Jonathan Tan, Jeff King

Hi Ian,

On Wed, 8 Mar 2017, Ian Jackson wrote:

> Few people use uppercase in ref names because of the case-insensitive
> filesystem problem;

Not true.

Ciao,
Johannes

^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-08 11:20     ` Ian Jackson
  2017-03-08 15:37       ` Johannes Schindelin
@ 2017-03-08 15:40       ` Johannes Schindelin
  2017-03-20  5:21         ` Use base32? Jason Hennessey
  1 sibling, 1 reply; 33+ messages in thread
From: Johannes Schindelin @ 2017-03-08 15:40 UTC (permalink / raw)
  To: Ian Jackson; +Cc: Linus Torvalds, Jonathan Nieder, Git Mailing List, Stefan Beller, bmwill, Jonathan Tan, Jeff King

Hi Ian,

On Wed, 8 Mar 2017, Ian Jackson wrote:

> Linus Torvalds writes ("Re: RFC: Another proposed hash function transition plan"):
> > Of course, having written that, I now realize how it would cause
> > problems for the usual shit-for-brains case-insensitive filesystems.
> > So I guess base64 encoding doesn't work well for that reason.
> 
> AFAIAA object names occur in publicly-visible filenames only in notes
> tree objects, which are manipulated by git internally and do not
> necessarily need to appear in the filesystem.
> 
> The filenames in .git/objects/ can be in whatever encoding we like, so
> are not an obstacle.

Given that the idea was to encode the new hash in base64 or base85, we
*are* talking about an encoding. In that respect, yes, it can be whatever
encoding we like, and Linus just made a good point (with unnecessary foul
language) of explaining why base64/base85 is not that encoding.

Ciao,
Johannes

^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-03-07  0:17   ` RFC v3: Another proposed hash function transition plan Jonathan Nieder
@ 2017-03-09 19:14     ` Shawn Pearce
  2017-03-09 20:24       ` Jonathan Nieder
  0 siblings, 1 reply; 33+ messages in thread
From: Shawn Pearce @ 2017-03-09 19:14 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Linus Torvalds, Git Mailing List, Stefan Beller, bmwill, Jonathan Tan, Jeff King, David Lang, brian m. carlson

On Mon, Mar 6, 2017 at 4:17 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:
> Linus Torvalds wrote:
>> On Fri, Mar 3, 2017 at 5:12 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:
>
>>> This document is still in flux but I thought it best to send it out
>>> early to start getting feedback.
>>
>> This actually looks very reasonable if you can implement it cleanly
>> enough.
>
> Thanks for the kind words on what had quite a few flaws still.  Here's
> a new draft.  I think the next version will be a patch against
> Documentation/technical/.

FWIW, I like this approach.

> Alongside the packfile, a sha3 repository stores a bidirectional
> mapping between sha3 and sha1 object names. The mapping is generated
> locally and can be verified using "git fsck". Object lookups use this
> mapping to allow naming objects using either their sha1 and sha3 names
> interchangeably.

I saw some discussion about using LevelDB for this mapping table. I
think any existing database may be overkill.

For packs, you may be able to simplify by having only one file
(pack-*.msha1) that maps SHA-1 to pack offset; idx v2. The CRC32 table
in v2 is unnecessary, but you need the 64 bit offset support.

SHA-1 to SHA-3: lookup SHA-1 in .msha1, reverse .idx, find offset to
read the SHA-3.
SHA-3 to SHA-1: lookup SHA-3 in .idx, and reverse the .msha1 file to
translate offset to SHA-1.


For loose objects, the loose object directories should have only
O(4000) entries before auto gc is strongly encouraging
packing/pruning. With 256 shards, each given directory has O(16) loose
objects in it. When writing a SHA-3 loose object, Git could also
append a line "$sha3 $sha1\n" to objects/${first_byte}/sha1, which
GC/prune rewrites to remove entries. With O(16) objects in a
directory, these files should only have O(16) entries in them.

SHA-3 to SHA-1: open objects/${sha3_first_byte}/sha1 and scan until a
match is found.
SHA-1 to SHA-3: brute force read 256 files. Callers performing this
mapping may load all 256 files into a table in memory.

^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-03-09 19:14     ` Shawn Pearce
@ 2017-03-09 20:24       ` Jonathan Nieder
  2017-03-10 19:38         ` Jeff King
  0 siblings, 1 reply; 33+ messages in thread
From: Jonathan Nieder @ 2017-03-09 20:24 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Linus Torvalds, Git Mailing List, Stefan Beller, bmwill, Jonathan Tan, Jeff King, David Lang, brian m. carlson

Hi,

Shawn Pearce wrote:
> On Mon, Mar 6, 2017 at 4:17 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:

>> Alongside the packfile, a sha3 repository stores a bidirectional
>> mapping between sha3 and sha1 object names. The mapping is generated
>> locally and can be verified using "git fsck". Object lookups use this
>> mapping to allow naming objects using either their sha1 and sha3 names
>> interchangeably.
>
> I saw some discussion about using LevelDB for this mapping table. I
> think any existing database may be overkill.
>
> For packs, you may be able to simplify by having only one file
> (pack-*.msha1) that maps SHA-1 to pack offset; idx v2. The CRC32 table
> in v2 is unnecessary, but you need the 64 bit offset support.
>
> SHA-1 to SHA-3: lookup SHA-1 in .msha1, reverse .idx, find offset to
> read the SHA-3.
> SHA-3 to SHA-1: lookup SHA-3 in .idx, and reverse the .msha1 file to
> translate offset to SHA-1.

Thanks for this suggestion.  I was initially vaguely nervous about
lookup times in an idx-style file, but as you say, object reads from a
packfile already have to deal with this kind of lookup and work fine.

> For loose objects, the loose object directories should have only
> O(4000) entries before auto gc is strongly encouraging
> packing/pruning. With 256 shards, each given directory has O(16) loose
> objects in it. When writing a SHA-3 loose object, Git could also
> append a line "$sha3 $sha1\n" to objects/${first_byte}/sha1, which
> GC/prune rewrites to remove entries. With O(16) objects in a
> directory, these files should only have O(16) entries in them.

Insertion time is what worries me.  When writing a small number of
objects using a command like "git commit", I don't want to have to
regenerate an entire idx file.  I don't want to move the pain to
O(loose objects) work at read time, either --- some people disable
auto gc, and others have a large number of loose objects due to gc
ejecting unreachable objects.

But some kind of simplification along these lines should be possible.
I'll experiment.

Jonathan

^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-03-09 20:24       ` Jonathan Nieder
@ 2017-03-10 19:38         ` Jeff King
  2017-03-10 19:55           ` Jonathan Nieder
  0 siblings, 1 reply; 33+ messages in thread
From: Jeff King @ 2017-03-10 19:38 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Shawn Pearce, Linus Torvalds, Git Mailing List, Stefan Beller, bmwill, Jonathan Tan, David Lang, brian m. carlson

On Thu, Mar 09, 2017 at 12:24:08PM -0800, Jonathan Nieder wrote:

> > SHA-1 to SHA-3: lookup SHA-1 in .msha1, reverse .idx, find offset to
> > read the SHA-3.
> > SHA-3 to SHA-1: lookup SHA-3 in .idx, and reverse the .msha1 file to
> > translate offset to SHA-1.
> 
> Thanks for this suggestion.  I was initially vaguely nervous about
> lookup times in an idx-style file, but as you say, object reads from a
> packfile already have to deal with this kind of lookup and work fine.

Not exactly. The "reverse .idx" step has to build the reverse mapping on
the fly, and it's non-trivial. For instance, try:

  sha1=$(git rev-parse HEAD)
  time echo $sha1 | git cat-file --batch-check='%(objectsize)'
  time echo $sha1 | git cat-file --batch-check='%(objectsize:disk)'

on a large repo (where HEAD is in a big pack). The on-disk size is
conceptually simpler, as we only need to look at the offset of the
object versus the offset of the object after it. But in practice it
takes much longer, because it has to build the revindex on the fly (I
get 7ms versus 179ms on linux.git).

The effort is linear in the number of objects (we create the revindex
with a radix sort).

The reachability bitmaps suffer from this, too, as they need the
revindex to know which object is at which bit position. At GitHub we
added an extension to the .bitmap files that stores this "bit cache".
Here are timings before and after on linux.git:

  $ time git rev-list --use-bitmap-index --count master
  659371

  real	0m0.182s
  user	0m0.136s
  sys	0m0.044s

  $ time git.gh rev-list --use-bitmap-index --count master
  659371

  real	0m0.016s
  user	0m0.008s
  sys	0m0.004s

It's not a full revindex, but it's enough for bitmap use. You can also
use it to generate the revindex slightly more quickly, because you can
skip the sorting step (you just insert the entries in the correct order
by walking the bit cache and dereferencing the offsets from the .idx
portion). So it's still linear, but with a smaller constant factor.

I think for the purposes here, though, we don't actually care about the
offsets. For the cost of one uint32_t per object, you can keep a list
mapping positions in the sha1 index into the sha3 index. So then you do
the log-n binary search to find the sha1, a constant-time lookup in the
mapping array, and that gives you the position in the sha3 index, from
which you can then access the sha3 (or the actual pack offset, for that
matter).

So I think it's solvable, but I suspect we would want an extension to
the .idx format to store the mapping array, in order to keep it log-n.

-Peff

^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-03-10 19:38         ` Jeff King
@ 2017-03-10 19:55           ` Jonathan Nieder
  0 siblings, 0 replies; 33+ messages in thread
From: Jonathan Nieder @ 2017-03-10 19:55 UTC (permalink / raw)
  To: Jeff King; +Cc: Shawn Pearce, Linus Torvalds, Git Mailing List, Stefan Beller, bmwill, Jonathan Tan, David Lang, brian m. carlson

Jeff King wrote:
> On Thu, Mar 09, 2017 at 12:24:08PM -0800, Jonathan Nieder wrote:

>>> SHA-1 to SHA-3: lookup SHA-1 in .msha1, reverse .idx, find offset to
>>> read the SHA-3.
>>> SHA-3 to SHA-1: lookup SHA-3 in .idx, and reverse the .msha1 file to
>>> translate offset to SHA-1.
>>
>> Thanks for this suggestion.  I was initially vaguely nervous about
>> lookup times in an idx-style file, but as you say, object reads from a
>> packfile already have to deal with this kind of lookup and work fine.
>
> Not exactly. The "reverse .idx" step has to build the reverse mapping on
> the fly, and it's non-trivial.

Sure.  To be clear, I was handwaving over that since adding an on-disk
reverse .idx is a relatively small change.

[...]
> So I think it's solvable, but I suspect we would want an extension to
> the .idx format to store the mapping array, in order to keep it log-n.

i.e., this.

The loose object side is the more worrying bit, since we currently don't
have any practical bound on the number of loose objects.

One way to deal with that is to disallow loose objects completely.
Use packfiles for new objects, batching the objects produced by a
single process into a single packfile.  Teach "git gc --auto" a
behavior similar to Martin Fick's "git exproll" to combine packfiles
between full gcs to maintain reasonable performance.  For unreachable
objects, instead of using loose objects, use "unreachable garbage"
packs explicitly labeled as such, with similar semantics to what
JGit's DfsRepository backend uses (described in the discussion at
https://git.eclipse.org/r/89455).

That's a direction that I want in the long term anyway.  I was hoping
not to couple such changes with the hash transition but it might be
one of the simpler ways to go.

Jonathan

^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-04  1:12 RFC: Another proposed hash function transition plan Jonathan Nieder
                   ` (3 preceding siblings ...)
  2017-03-07 18:57 ` Ian Jackson
@ 2017-03-13  9:24 ` The Keccak Team
  2017-03-13 17:48   ` Jonathan Nieder
  4 siblings, 1 reply; 33+ messages in thread
From: The Keccak Team @ 2017-03-13  9:24 UTC (permalink / raw)
  To: Jonathan Nieder, git; +Cc: sbeller, bmwill, jonathantanmy, peff, Linus Torvalds

Hello,

We have read your transition plan to move away from SHA-1 and noticed
your intent to use SHA3-256 as the new hash function in the new Git
repository format and protocol. Although this is a valid choice, we
think that the new SHA-3 standard proposes alternatives that may also be
interesting for your use cases.  As designers of the Keccak function
family, we thought we could jump in the mail thread and present these
alternatives.


SHA3-256, standardized in FIPS 202 [1], is a fixed-length hash function
that provides the same interface and security level as SHA-256 (FIPS
180-4). SHA3-256's primary goal is to be drop-in compatible with the
previous standard, and to allow a fast transition for applications that
would already use SHA-256.

Since your application did not use SHA-256, you are free to choose one
of the alternatives listed below.


* SHAKE128

  SHAKE128, defined in FIPS 202, is an eXtendable-Output Function (XOF)
  that generates digests of any size. In your case, you would use
  SHAKE128 the same way you would use SHA3-256, just truncating the
  output at 256 bits. In that case, SHAKE128 provides a security level
  of 128 bits against all generic attacks, including collisions,
  preimages, etc. We think this security level is appropriate for your
  application since this is the maximum you can get with 256-bit tags in
  the case of collision attacks, and this level is beyond computation
  reach for any adversary in the foreseeable future.

  The immediate benefit of using SHAKE128 versus SHA3-256 is a
  performance gain of roughly 20%, both for SW and HW implementations.
  On Intel Core i5-6500, SHAKE128 throughput is 430MiB/s.


* ParallelHash128

  ParallelHash128 (PH128), defined in NIST Special Publication 800-185
  (SP800-185, SHA-3 Derived Functions [2]), is a XOF implementing a tree
  hash mode on top of SHAKE128 (in fact cSHAKE128) to provide higher
  performance for large-file hashing. The tree mode is designed to
  exploit any available parallelism on the CPU, either through vector
  instructions or availability of multiple cores. Note that the chosen
  level of parallelism does not impact the final result, which improves
  interoperability.

  PH128 offers the same security level and interface as SHAKE128. So
  likewise, you just truncate the output at 256 bits.

  The net advantage of using PH128 over SHAKE128 is a huge performance
  boost when hashing big files.  The advantage depends of course on the
  number of cores used for hashing and their architecture. On an Intel
  Core i5-6500 (Skylake), with a single-core, PH128 is faster than
  SHAKE128 by a factor 3 and than SHA-1 by a factor 1.5 over long
  messages, with a throughput of 1320MiB/s.


* KangarooTwelve

  KangarooTwelve (K12) [3] is a very fast parallel and secure XOF we
  defined for applications that require higher performance that the FIPS
  202 and SP800-185 functions provide, while retaining the same
  flexibility and basis of security.

  K12 is very similar to PH128. It uses the same cryptographic primitive
  (Keccak-p, defined in FIPS 202), the same sponge construction, a
  similar tree hashing mode, and targets the same generic security level
  (128 bits). The main differences are the number of rounds for the
  inner permutation, which is reduced to 12, and the tree mode
  parameters, which are optimized for both small and long messages.

  Again, the benefit of using K12 over PH128 is performance. K12 is
  twice as fast as SHAKE128 for short messages, i.e. 820MiB/s on Intel
  Core i5-6500, and twice as fast as PH128 over long messages, i.e.
  2500MiB/s on the same platform.


If performance is not your primary concern, we suggest to use SHAKE128
as the default hash function, and optionally use ParallelHash128 for
hashing big files. Both functions offer a considerable security margin
and are standardized algorithms. On the longer term, provided HW
acceleration, SHAKE128 alone would easily outperform SHA-1 thanks to its
design.

If however you value first performance, or if you would like to promote
adoption of the new repository format by offering higher performance,
then KangarooTwelve is the right candidate. On modern CPU, K12 offers
equal performance as SHA-1 for small messages and outperforms it by a
factor 3 for long messages.  Regarding security, although K12 offers of
course a smaller security margin than other alternatives, it inherits
the security assurance built up for Keccak and the FIPS 202 functions.
As of today, the best practical attack broke 6 rounds of Keccak-p, with
2^50 computation effort. The 12 rounds of K12 offers then a comfortable
security margin [4].


Lately, we made a presentation at FOSDEM covering the latest development
over the Keccak family [5].  You can find reference and optimized
implementations of the algorithms listed above in the Keccak Code
Package [6]. Also, if you have questions, don't hesitate to contact us.


Kind regards,
The Keccak Team

Links
 [1]   FIPS 202,
       http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.202.pdf.
 [2]   NIST SP 800-185,

http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-185.pdf.
 [3]   KangarooTwelve, http://keccak.noekeon.org/kangarootwelve.html.
 [4]   Keccak Crunchy Crypto Collision and Pre-image Contest,
       http://keccak.noekeon.org/crunchy_contest.html.
 [5]   FOSDEM 2017, Portfolio of optimized cryptographic functions based
       on Keccak, https://fosdem.org/2017/schedule/event/keccak/.
 [6]   Keccak Code Package, https://github.com/gvanas/KeccakCodePackage.



^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-13  9:24 ` The Keccak Team
@ 2017-03-13 17:48   ` Jonathan Nieder
  2017-03-13 18:34     ` ankostis
  0 siblings, 1 reply; 33+ messages in thread
From: Jonathan Nieder @ 2017-03-13 17:48 UTC (permalink / raw)
  To: The Keccak Team; +Cc: git, sbeller, bmwill, jonathantanmy, peff, Linus Torvalds

Hi,

The Keccak Team wrote:

> We have read your transition plan to move away from SHA-1 and noticed
> your intent to use SHA3-256 as the new hash function in the new Git
> repository format and protocol. Although this is a valid choice, we
> think that the new SHA-3 standard proposes alternatives that may also be
> interesting for your use cases.  As designers of the Keccak function
> family, we thought we could jump in the mail thread and present these
> alternatives.

I indeed had some reservations about SHA3-256's performance.  The main
hash function we had in mind to compare against is blake2bp-256.  This
overview of other functions to compare against should end up being
very helpful.

Thanks for this.  When I have more questions (which I most likely
will) I'll keep you posted.

Sincerely,
Jonathan

^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-13 17:48   ` Jonathan Nieder
@ 2017-03-13 18:34     ` ankostis
  2017-03-17 11:07       ` Johannes Schindelin
  0 siblings, 1 reply; 33+ messages in thread
From: ankostis @ 2017-03-13 18:34 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: The Keccak Team, Git Mailing List, Stefan Beller, bmwill, Jonathan Tan, Jeff King, Linus Torvalds

On 13 March 2017 at 18:48, Jonathan Nieder <jrnieder@gmail.com> wrote:
>
> Hi,
>
> The Keccak Team wrote:
>
> > We have read your transition plan to move away from SHA-1 and noticed
> > your intent to use SHA3-256 as the new hash function in the new Git
> > repository format and protocol. Although this is a valid choice, we
> > think that the new SHA-3 standard proposes alternatives that may also be
> > interesting for your use cases.  As designers of the Keccak function
> > family, we thought we could jump in the mail thread and present these
> > alternatives.
>
> I indeed had some reservations about SHA3-256's performance.  The main
> hash function we had in mind to compare against is blake2bp-256.  This
> overview of other functions to compare against should end up being
> very helpful.

What if some of us need this extra difficulty, and don't mind about
the performance tax,
because we need to refer to hashes 10 or 30 years from now,
or even in the Post Quantum era?

Thanks,
  Kostis

^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-13 18:34     ` ankostis
@ 2017-03-17 11:07       ` Johannes Schindelin
  0 siblings, 0 replies; 33+ messages in thread
From: Johannes Schindelin @ 2017-03-17 11:07 UTC (permalink / raw)
  To: ankostis; +Cc: Jonathan Nieder, The Keccak Team, Git Mailing List, Stefan Beller, bmwill, Jonathan Tan, Jeff King, Linus Torvalds

Hi Kostis,

On Mon, 13 Mar 2017, ankostis wrote:

> On 13 March 2017 at 18:48, Jonathan Nieder <jrnieder@gmail.com> wrote:
> >
> > The Keccak Team wrote:
> >
> > > We have read your transition plan to move away from SHA-1 and
> > > noticed your intent to use SHA3-256 as the new hash function in the
> > > new Git repository format and protocol. Although this is a valid
> > > choice, we think that the new SHA-3 standard proposes alternatives
> > > that may also be interesting for your use cases.  As designers of
> > > the Keccak function family, we thought we could jump in the mail
> > > thread and present these alternatives.
> >
> > I indeed had some reservations about SHA3-256's performance.  The main
> > hash function we had in mind to compare against is blake2bp-256.  This
> > overview of other functions to compare against should end up being
> > very helpful.
> 
> What if some of us need this extra difficulty, and don't mind about the
> performance tax, because we need to refer to hashes 10 or 30 years from
> now, or even in the Post Quantum era?

If you need this extra difficulty, and if this extra difficulty would
imply a huge penalty for everybody else, it is safe to assume that that
extra difficulty would need to be an extra switch, off by default.

It simply shows that we put too much of a burden on SHA-1: we used it for
three separate purposes: to verify data integrity, to allow addressing
objects by their own content, and for signing entire commit histories
cryptographically (more as an afterthought, as I see it: the Linux project
provides the context where you never fetch from any untrusted source,
therefore cryptographically secure signatures are not quite as important
as the trust between maintainer and lieutenants).

We *will* have to separate those concerns, and maybe even switch to
different algorithms for the different concerns. There are much better
algorithms for validating data integrity, for example, including error
correction (which SHA-1 never wanted to do anyway).

In your case, I could imagine that you would simply require verifiable
cryptographic signatures (.asc files) to be committed together with the
documents; it would be much harder to find a collision where those
signatures still match (or a double collision where the forged document's
signature would collide with the non-forget document's signature, in
addition to the two documents colliding).

Another idea would be to use Jonathan Nieder's proposed transition plan
and simply extend it. That transition plan details how the objects would
be hashed with two algorithms locally and how to maintain a bidirectional
mapping between the two. You could simply piggyback on that code and
provide patches that allow for a third, configurable algorithm, and that
algorithm's hashes would simply be added to the commit objects and fsck
would then know to verify those, too. That would be an opt-in feature, of
course, so that only those who need the extra long term security have to
pay the price of a substantially slower hashing.

What we cannot do is to pick a super slow hash algorithm just to cater to
the use case where legal documents are managed, punishing everybody else
for using Git in the intended way: to manage source code.

Ciao,
Johannes

^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Use base32?
  2017-03-08 15:40       ` Johannes Schindelin
@ 2017-03-20  5:21         ` Jason Hennessey
  2017-03-20  5:58           ` Michael Steuer
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Hennessey @ 2017-03-20  5:21 UTC (permalink / raw)
  To: johannes.schindelin; +Cc: bmwill, git, ijackson, jonathantanmy, jrnieder, peff, sbeller, torvalds


On Wed, 8 Mar 2017, Johannes Schindelin wrote:
> > Linus Torvalds writes ("Re: RFC: Another proposed hash function transition plan"): > > Of course, having written that, I now realize how it would cause
> > > problems for the usual shit-for-brains case-insensitive
> filesystems. > > So I guess base64 encoding doesn't work well for that
> reason.
> Given that the idea was to encode the new hash in base64 or base85, we
> *are* talking about an encoding. In that respect, yes, it can be whatever
> encoding we like, and Linus just made a good point (with unnecessary foul
> language) of explaining why base64/base85 is not that encoding.

Since the hash format is switching anyway, how about using base32
instead of hex?

Still get a 20% space savings over hex (minus a little for padding), and
it's guaranteed to be a single case.
Jason


^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Re: Use base32?
  2017-03-20  5:21         ` Use base32? Jason Hennessey
@ 2017-03-20  5:58           ` Michael Steuer
  2017-03-20  8:05             ` Jacob Keller
  0 siblings, 1 reply; 33+ messages in thread
From: Michael Steuer @ 2017-03-20  5:58 UTC (permalink / raw)
  To: Jason Hennessey, johannes.schindelin; +Cc: bmwill, git, ijackson, jonathantanmy, jrnieder, peff, sbeller, torvalds


On 20/03/2017 16:21, Jason Hennessey wrote:
> On Wed, 8 Mar 2017, Johannes Schindelin wrote:
>>> Linus Torvalds writes ("Re: RFC: Another proposed hash function transition plan"): > > Of course, having written that, I now realize how it would cause
>>>> problems for the usual shit-for-brains case-insensitive
>> filesystems. > > So I guess base64 encoding doesn't work well for that
>> reason.
>> Given that the idea was to encode the new hash in base64 or base85, we
>> *are* talking about an encoding. In that respect, yes, it can be whatever
>> encoding we like, and Linus just made a good point (with unnecessary foul
>> language) of explaining why base64/base85 is not that encoding.
> Since the hash format is switching anyway, how about using base32
> instead of hex?
>
> Still get a 20% space savings over hex (minus a little for padding), and
> it's guaranteed to be a single case.
> Jason
>

If base32 is being considered, I'd suggest the "base32hex" variant, 
which uses the same amount of space.

^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Re: Use base32?
  2017-03-20  5:58           ` Michael Steuer
@ 2017-03-20  8:05             ` Jacob Keller
  2017-03-21  3:07               ` Michael Steuer
  0 siblings, 1 reply; 33+ messages in thread
From: Jacob Keller @ 2017-03-20  8:05 UTC (permalink / raw)
  To: Michael Steuer; +Cc: Jason Hennessey, Johannes Schindelin, Brandon Williams, Git mailing list, Ian Jackson, Jonathan Tan, Jonathan Nieder, Jeff King, Stefan Beller, Linus Torvalds

On Sun, Mar 19, 2017 at 10:58 PM, Michael Steuer
<Michael.Steuer@constrainttec.com> wrote:
>
> On 20/03/2017 16:21, Jason Hennessey wrote:
>>
>> On Wed, 8 Mar 2017, Johannes Schindelin wrote:
>>>>
>>>> Linus Torvalds writes ("Re: RFC: Another proposed hash function
>>>> transition plan"): > > Of course, having written that, I now realize how it
>>>> would cause
>>>>>
>>>>> problems for the usual shit-for-brains case-insensitive
>>>
>>> filesystems. > > So I guess base64 encoding doesn't work well for that
>>> reason.
>>> Given that the idea was to encode the new hash in base64 or base85, we
>>> *are* talking about an encoding. In that respect, yes, it can be whatever
>>> encoding we like, and Linus just made a good point (with unnecessary foul
>>> language) of explaining why base64/base85 is not that encoding.
>>
>> Since the hash format is switching anyway, how about using base32
>> instead of hex?
>>
>> Still get a 20% space savings over hex (minus a little for padding), and
>> it's guaranteed to be a single case.
>> Jason
>>
>
> If base32 is being considered, I'd suggest the "base32hex" variant, which
> uses the same amount of space.

I don't see the benefit of adding characters like 0 and 1 which
conflict with some of the letters? Since there's no need for a human
to decode the base32 output, it's easier to use the one that's less
likely to get screwed up when typing if that ever happens. It's not
like we actually need to know what value each character represents.
(sure the program does, but the human does not).

Thanks,
Jake

^ permalink raw reply	[flat|threaded] 33+ messages in thread

* Re: Use base32?
  2017-03-20  8:05             ` Jacob Keller
@ 2017-03-21  3:07               ` Michael Steuer
  0 siblings, 0 replies; 33+ messages in thread
From: Michael Steuer @ 2017-03-21  3:07 UTC (permalink / raw)
  To: Jacob Keller; +Cc: Jason Hennessey, Johannes Schindelin, Brandon Williams, Git mailing list, Ian Jackson, Jonathan Tan, Jonathan Nieder, Jeff King, Stefan Beller, Linus Torvalds


On 20/03/2017 19:05, Jacob Keller wrote:
> On Sun, Mar 19, 2017 at 10:58 PM, Michael Steuer
> <Michael.Steuer@constrainttec.com> wrote:
>> [..]
>> If base32 is being considered, I'd suggest the "base32hex" variant, which
>> uses the same amount of space.
> I don't see the benefit of adding characters like 0 and 1 which
> conflict with some of the letters? Since there's no need for a human
> to decode the base32 output, it's easier to use the one that's less
> likely to get screwed up when typing if that ever happens. It's not
> like we actually need to know what value each character represents.
> (sure the program does, but the human does not).
>
> Thanks,
> Jake

Fair enough and good point. We definitely wouldn't want 0 and O and 1 
and I mixed together.

Cheers,
Mike.

^ permalink raw reply	[flat|threaded] 33+ messages in thread

end of thread, back to index

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-04  1:12 RFC: Another proposed hash function transition plan Jonathan Nieder
2017-03-05  2:35 ` Linus Torvalds
2017-03-06  0:26   ` brian m. carlson
2017-03-06 18:24     ` Brandon Williams
2017-03-07  0:17   ` RFC v3: Another proposed hash function transition plan Jonathan Nieder
2017-03-09 19:14     ` Shawn Pearce
2017-03-09 20:24       ` Jonathan Nieder
2017-03-10 19:38         ` Jeff King
2017-03-10 19:55           ` Jonathan Nieder
2017-03-05 11:02 ` David Lang
     [not found]   ` <CA+dhYEXHbQfJ6KUB1tWS9u1MLEOJL81fTYkbxu4XO-i+379LPw@mail.gmail.com>
2017-03-06  9:43     ` Jeff King
2017-03-06 23:40   ` Jonathan Nieder
2017-03-07  0:03     ` Mike Hommey
2017-03-06  8:43 ` Jeff King
2017-03-06 18:39   ` Jonathan Tan
2017-03-06 19:22     ` Linus Torvalds
2017-03-06 19:59       ` Brandon Williams
2017-03-06 21:53       ` Junio C Hamano
2017-03-07  8:59     ` Jeff King
2017-03-06 18:43   ` Junio C Hamano
2017-03-07 18:57 ` Ian Jackson
2017-03-07 19:15   ` Linus Torvalds
2017-03-08 11:20     ` Ian Jackson
2017-03-08 15:37       ` Johannes Schindelin
2017-03-08 15:40       ` Johannes Schindelin
2017-03-20  5:21         ` Use base32? Jason Hennessey
2017-03-20  5:58           ` Michael Steuer
2017-03-20  8:05             ` Jacob Keller
2017-03-21  3:07               ` Michael Steuer
2017-03-13  9:24 ` The Keccak Team
2017-03-13 17:48   ` Jonathan Nieder
2017-03-13 18:34     ` ankostis
2017-03-17 11:07       ` Johannes Schindelin

git@vger.kernel.org mailing list mirror (one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.org/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/
       or Tor2web: https://www.tor2web.org/

AGPL code for this site: git clone https://public-inbox.org/ public-inbox