RFC: Another proposed hash function transition plan

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* RFC: Another proposed hash function transition plan
@ 2017-03-04  1:12 Jonathan Nieder
  2017-03-05  2:35 ` Linus Torvalds
                   ` (4 more replies)
  0 siblings, 5 replies; 113+ messages in thread
From: Jonathan Nieder @ 2017-03-04  1:12 UTC (permalink / raw)
  To: git; +Cc: sbeller, bmwill, jonathantanmy, peff, Linus Torvalds

Hi,

This past week we came up with this idea for what a transition to a new
hash function for Git would look like.  I'd be interested in your
thoughts (especially if you can make them as comments on the document,
which makes it easier to address them and update the document).

This document is still in flux but I thought it best to send it out
early to start getting feedback.

We tried to incorporate some thoughts from the thread
http://public-inbox.org/git/20170223164306.spg2avxzukkggrpb@kitenet.net
but it is a little long so it is easy to imagine we've missed
some things already discussed there.

You can use the doc URL

 https://goo.gl/gh2Mzc

to view the latest version and comment.

Thoughts welcome, as always.

Git hash function transition
============================
Status: Draft
Last Updated: 2017-03-03

Objective
---------
Migrate Git from SHA-1 to a stronger hash function.

Background
----------
The Git version control system can be thought of as a content
addressable filesystem. It uses the SHA-1 hash function to name
content. For example, files, trees, commits are referred to by hash
values unlike in other traditional version control systems where files
or versions are referred to via sequential numbers. The use of a hash
function to address its content delivers a few advantages:

* Integrity checking is easy. Bit flips, for example, are easily
  detected, as the hash of corrupted content does not match its name.
  Lookup of objects is fast.

Using a cryptographically secure hash function brings additional advantages:

* Object names can be signed and third parties can trust the hash to
  address the signed object and all objects it references.
* Communication using Git protocol and out of band communication
  methods have a short reliable string that can be used to reliably
  address stored content.

Over time some flaws in SHA-1 have been discovered by security
researchers. https://shattered.io demonstrated a practical SHA-1 hash
collision. As a result, SHA-1 cannot be considered cryptographically
secure any more. This impacts the communication of hash values because
we cannot trust that a given hash value represents the known good
version of content that the speaker intended.

SHA-1 still possesses the other properties such as fast object lookup
and safe error checking, but other hash functions are equally suitable
that are believed to be cryptographically secure.

Goals
-----
1. The transition to SHA256 can be done one local repository at a time.
   a. Requiring no action by any other party.
   b. A SHA256 repository can communicate with SHA-1 Git servers and
      clients (push/fetch).
   c. Users can use SHA-1 and SHA256 identifiers for objects
      interchangeably.
   d. New signed objects make use of a stronger hash function than
      SHA-1 for their security guarantees.
2. Allow a complete transition away from SHA-1.
   a. Local metadata for SHA-1 compatibility can be dropped in a
      repository if compatibility with SHA-1 is no longer needed.
3. Maintainability throughout the process.
   a. The object format is kept simple and consistent.
   b. Creation of a generalized repository conversion tool.

Non-Goals
---------
1. Add SHA256 support to Git protocol. This is valuable and the
   logical next step but it is out of scope for this initial design.
2. Transparently improving the security of existing SHA-1 signed
   objects.
3. Intermixing objects using multiple hash functions in a single
   repository.
4. Taking the opportunity to fix other bugs in git's formats and
   protocols.
5. Shallow clones and fetches into a SHA256 repository. (This will
   change when we add SHA256 support to Git protocol.)
6. Skip fetching some submodules of a project into a SHA256
   repository. (This also depends on SHA256 support in Git protocol.)

Overview
--------
We introduce a new repository format extension `sha256`. Repositories
with this extension enabled use SHA256 instead of SHA-1 to name their
objects. This affects both object names and object content --- both
the names of objects and all references to other objects within an
object are switched to the new hash function.

sha256 repositories cannot be read by older versions of Git.

Alongside the packfile, a sha256 stores a bidirectional mapping
between sha256 and sha1 object names. The mapping is generated locally
and can be verified using "git fsck". Object lookups use this mapping
to allow naming objects using either their sha1 and sha256 names
interchangeably.

"git cat-file" and "git hash-object" gain options to display a sha256
object in its sha1 form and write a sha256 object given its sha1 form.
This requires all objects referenced by that object to be present in
the object database so that they can be named using the appropriate
name (using the bidirectional hash mapping).

Fetches from a SHA-1 based server convert the fetched objects into
sha256 form and record the mapping in the bidirectional mapping table
(see below for details). Pushes to a SHA-1 based server convert the
objects being pushed into sha1 form so the server does not have to be
aware of the hash function the client is using.

Detailed Design
---------------
Object names
~~~~~~~~~~~~
Objects can be named by their 40 hexadecimal digit sha1-name or 64
hexadecimal digit sha256-name, plus names derived from those (see
gitrevisions(7)).

The sha1-name of an object is the SHA-1 of the concatenation of its
type, length, a nul byte, and the object's sha1-content. This is the
traditional <sha1> used in Git to name objects.

The sha256-name of an object is the SHA-256 of the concatenation of
its type, length, a nul byte, and the object's sha256-content.

Object format
~~~~~~~~~~~~~
Objects are stored using a compressed representation of their
sha256-content. The sha256-content of an object is the same as its
sha1-content, except that:
* objects referenced by the object are named using their sha256-names
  instead of sha1-names
* signed tags, commits, and merges of signed tags get some additional
  fields (see below)

The format allows round-trip conversion between sha256-content and
sha1-content.

Loose objects use zlib compression and packed objects use the packed
format described in Documentation/technical/pack-format.txt, just like
today.

Translation table
~~~~~~~~~~~~~~~~~
A fast bidirectional mapping between sha1-names and sha256-names of
all local objects in the repository is kept on disk. The exact format
of that mapping is to be determined.

All operations that make new objects (e.g., "git commit") add the new
objects to the translation table.

Reading an object's sha1-content
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The sha1-content of an object can be read by converting all
sha256-names its sha256-content references to sha1-names using the
translation table. There is an additional minor transformation needed
for signed tags, commits, and merges (see below).

Fetch
~~~~~
Fetching from a SHA-1 based server requires translating between SHA-1
and SHA-256 based representations on the fly.

SHA-1s named in the ref advertisement can be translated to SHA-256 and
looked up as local objects using the translation table.

Negotiation proceeds as today. Any "have"s or "want"s generated
locally are converted to SHA-1 before being sent to the server, and
SHA-1s mentioned by the server are converted to SHA-256 when looking
them up locally.

After negotiation, the server sends a packfile containing the
requested objects. We convert the packfile to SHA-256 format using the
following steps:

1. index-pack: inflate each object in the packfile and compute its
   SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against
   objects the client has locally. These objects can be looked up using
   the translation table and their sha1-content read as described above
   to resolve the deltas.
2. topological sort: starting at the "want"s from the negotiation
   phase, walk through objects in the pack and emit a list of them in
   topologically sorted order. (This list only contains objects
   reachable from the "wants". If the pack from the server contained
   additional extraneous objects, then they will be discarded.)
3. convert to sha256: open a new (sha256) packfile. Read the
   topologically sorted list just generated in reverse order. For each
   object, inflate its sha1-content, convert to sha256-content, and
   write it to the sha256 pack. Write an idx file for this pack and
   include the new sha1<->sha256 mapping entry in the translation
   table.
4. clean up: remove the SHA-1 based pack file, index, and
   topologically sorted list obtained from the server and steps 1 and 2.

Step 3 requires every object referenced by the new object to be in the
translation table. This is why the topological sort step is necessary.

As an optimization, step 1 can write a file describing what objects
each object it has inflated from the packfile references. This makes
the topological sort in step 2 possible without inflating the objects
in the packfile for a second time. The objects need to be inflated
again in step 3, for a total of two inflations.

Push
~~~~
Push is simpler than fetch because the objects referenced by the
pushed objects are already in the translation table. The sha1-content
of each object being pushed can be read as described in the "Reading
an object's sha1-content" section to generate the pack written by git
send-pack.

Signed Objects
~~~~~~~~~~~~~~
Commits
^^^^^^^
Commits currently have the following sequence of header lines:

	"tree" SP object-name
	("parent" SP object-name)*
	"author" SP ident
	"committer" SP ident
	("mergetag" SP object-content)?
	("gpgsig" SP pgp-signature)?

We introduce new header lines "hash" and "nohash" that come after the
"gpgsig" field. No "hash" lines may appear unless the "gpgsig" field
is present.

Hash lines have the form

	"hash" SP hash-function SP field SP alternate-object-name

Nohash lines have the form

	"nohash" SP hash-function

There are only two recognized values of hash-function: "sha1" and
"sha256". "git fsck" will tolerate values of hash-function it does not
recognize, as long as they do not come before either of those two. All
"nohash" lines come before all "hash" lines. Any "hash sha1" lines
must come before all "hash sha256" lines, and likewise for nohash. The
Git project determines any future supported hash-functions that can
come after those two and their order.

There can be at most one "nohash <hash-function>" for one hash
function, indicating that this hash function should not be used when
checking the commit's signature.

There is one "hash <hash-function>" line for each tree or parent field
in the commit object header. The hash lines record object names for
those trees and parents using the indicated hash function, to be used
when checking the commit's signature.

TODO: simplify signature rules, handle the mergetag field better.

sha256-content of signed commits
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The sha256-content of a commit with a "gpgsig" header can include no
hash and nohash lines, a "nohash sha256" line and "hash sha1", or just
a "hash sha1" line.

Examples:
1. tree 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4
   parent e094bc809626f0a401a40d75c56df478e546902ff812772c4594265203b23980
   parent 1059dab4748aa33b86dad5ca97357bd322abaa558921255623fbddd066bb3315
   author A U Thor <author@example.com> 1465982009 +0000
   committer C O Mitter <committer@example.com> 1465982009 +0000
   gpgsig ...
2. tree 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4
   parent e094bc809626f0a401a40d75c56df478e546902ff812772c4594265203b23980
   parent 1059dab4748aa33b86dad5ca97357bd322abaa558921255623fbddd066bb3315
   author A U Thor <author@example.com> 1465982009 +0000
   committer C O Mitter <committer@example.com> 1465982009 +0000
   gpgsig ...
   nohash sha256
   hash sha1 tree c7b1cff039a93f3600a1d18b82d26688668c7dea
   hash sha1 parent c33429be94b5f2d3ee9b0adad223f877f174b05d
   hash sha1 parent 04b871796dc0420f8e7561a895b52484b701d51a
3. tree 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4
   parent e094bc809626f0a401a40d75c56df478e546902ff812772c4594265203b23980
   parent 1059dab4748aa33b86dad5ca97357bd322abaa558921255623fbddd066bb3315
   author A U Thor <author@example.com> 1465982009 +0000
   committer C O Mitter <committer@example.com> 1465982009 +0000
   gpgsig ...
   hash sha1 tree c7b1cff039a93f3600a1d18b82d26688668c7dea
   hash sha1 parent c33429be94b5f2d3ee9b0adad223f877f174b05d
   hash sha1 parent 04b871796dc0420f8e7561a895b52484b701d51a

sha1-content of signed commits
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The sha1-content of a commit with a "gpgsig" header can contain a
"nohash sha1" and "hash sha256" line, no hash or nohash lines, or just
a "hash sha256" line.

Examples:
1. tree c7b1cff039a93f3600a1d18b82d26688668c7dea
   parent c33429be94b5f2d3ee9b0adad223f877f174b05d
   parent 04b871796dc0420f8e7561a895b52484b701d51a
   author A U Thor <author@example.com> 1465982009 +0000
   committer C O Mitter <committer@example.com> 1465982009 +0000
   gpgsig ...
   nohash sha1
   hash sha256 tree 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4
   hash sha256 parent e094bc809626f0a401a40d75c56df478e546902ff812772c4594265203b23980
   hash sha256 parent 1059dab4748aa33b86dad5ca97357bd322abaa558921255623fbddd066bb3315
2. tree c7b1cff039a93f3600a1d18b82d26688668c7dea
   parent c33429be94b5f2d3ee9b0adad223f877f174b05d
   parent 04b871796dc0420f8e7561a895b52484b701d51a
   author A U Thor <author@example.com> 1465982009 +0000
   committer C O Mitter <committer@example.com> 1465982009 +0000
   gpgsig ...
3. tree c7b1cff039a93f3600a1d18b82d26688668c7dea
   parent c33429be94b5f2d3ee9b0adad223f877f174b05d
   parent 04b871796dc0420f8e7561a895b52484b701d51a
   author A U Thor <author@example.com> 1465982009 +0000
   committer C O Mitter <committer@example.com> 1465982009 +0000
   gpgsig ...
   hash sha256 tree 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4
   hash sha256 parent e094bc809626f0a401a40d75c56df478e546902ff812772c4594265203b23980
   hash sha256 parent 1059dab4748aa33b86dad5ca97357bd322abaa558921255623fbddd066bb3315

Converting signed commits
^^^^^^^^^^^^^^^^^^^^^^^^^
To convert the sha1-content of a signed commit to its sha256-content:

1. Change "tree" and "parent" lines to use the sha256-names of
   referenced objects, as with unsigned commits.
2. If there is a "mergetag" field, convert it from sha1-content to
   sha256-content, as with unsigned commits with a mergetag (see the
   "Mergetag" section below).
3. Unless there is a "nohash sha1" line, add a full set of "hash sha1
   <field> <sha1>" lines indicating the sha1-names of the tree and
   parents.
4. Remove any "hash sha256 <field> <sha256>" lines. If no such lines
   were present, add a "nohash sha256" line.

Converting the sha256-content of a signed commit to sha1-content uses
the same process with sha1 and sha256 switched.

Verifying signed commit signatures
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If the commit has a "hash sha1" line (or is sha1-content without a
"nohash sha1" line): check that the signature matches the sha1-content
with gpgsig field stripped out.

Otherwise: check that the signature matches the sha1-content with
gpgsig, nohash, tree, and parents fields stripped out.

With the examples above, the signed payloads are
1. author A U Thor <author@example.com> 1465982009 +0000
   committer C O Mitter <committer@example.com> 1465982009 +0000
   hash sha256 tree 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4
   hash sha256 parent e094bc809626f0a401a40d75c56df478e546902ff812772c4594265203b23980
   hash sha256 parent 1059dab4748aa33b86dad5ca97357bd322abaa558921255623fbddd066bb3315
2. tree c7b1cff039a93f3600a1d18b82d26688668c7dea
   parent c33429be94b5f2d3ee9b0adad223f877f174b05d
   parent 04b871796dc0420f8e7561a895b52484b701d51a
   author A U Thor <author@example.com> 1465982009 +0000
   committer C O Mitter <committer@example.com> 1465982009 +0000
3. tree c7b1cff039a93f3600a1d18b82d26688668c7dea
   parent c33429be94b5f2d3ee9b0adad223f877f174b05d
   parent 04b871796dc0420f8e7561a895b52484b701d51a
   author A U Thor <author@example.com> 1465982009 +0000
   committer C O Mitter <committer@example.com> 1465982009 +0000
   hash sha1
   hash sha256 tree 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4
   hash sha256 parent e094bc809626f0a401a40d75c56df478e546902ff812772c4594265203b23980
   hash sha256 parent 1059dab4748aa33b86dad5ca97357bd322abaa558921255623fbddd066bb3315

Current versions of "git verify-commit" can verify examples (2) and (3)
(but not (1)).

Tags
~~~~
Tags currently have the following sequence of header lines:

   	"object" SP object-name
	"type" SP type
	"tag" SP identifier
	"tagger" SP ident

A tag's signature, if it exists, is in the message body.

We introduce new header lines "nohash" and "hash" that come after the
"tagger" field. No "nohash" or "hash" lines may appear unless the
message body contains a PGP signature.

As with commits, "nohash" lines have the form "nohash
<hash-function>", indicating that this hash function should not be
used when checking the tag's signature.

"hash" lines have the form

	"hash" SP hash-function SP alternate-object-name

This records the pointed-to object name using the indicated hash
function, to be used when checking the tag's signature.

As with commits, "sha1" and "sha256" are the only permitted values of
hash-function and can only appear in that order for a field when they
appear. There can be at most one "nohash" line, and it comes before
any "hash" lines. There can be only one "hash" line for a given hash
function.

sha256-content of signed tags
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The sha256-content of a signed tag can include no "hash" or "nohash"
lines, a "nohash sha256" and "hash sha1 <sha1>" line, or just a "hash
sha1 <sha1>" line.

Examples:
1. object 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4
   type tree
   tag v1.0
   tagger C O Mitter <committer@example.com> 1465981006 +0000

   Tag Demo v1.0
   -----BEGIN PGP SIGNATURE-----
   Version: GnuPG v1

   iQEcBAABAgAGBQJXYRhOAAoJEGEJLoW3InGJklkIAIcnhL7RwEb/+QeX9enkXhxn
   ...
2. object 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4
   type tree
   tag v1.0
   tagger C O Mitter <committer@example.com> 1465981006 +0000
   nohash sha256
   hash sha1 c7b1cff039a93f3600a1d18b82d26688668c7dea

   Tag Demo v1.0
   -----BEGIN PGP SIGNATURE-----
   ...
3. object 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4
   type tree
   tag v1.0
   tagger C O Mitter <committer@example.com> 1465981006 +0000
   hash sha1 c7b1cff039a93f3600a1d18b82d26688668c7dea

   Tag Demo v1.0
   ...

sha1-content of signed tags
^^^^^^^^^^^^^^^^^^^^^^^^^^^
The sha1-content of a signed tag can include a "nohash sha1" and "hash
sha256" line, no "nohash" or "hash" lines, or just a "hash sha256
<sha256>" line.

Examples:
1. object c7b1cff039a93f3600a1d18b82d26688668c7dea
   ...
   tagger C O Mitter <committer@example.com> 1465981006 +0000
   nohash sha1
   hash sha256 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4

   Tag Demo v1.0
   -----BEGIN PGP SIGNATURE-----
   ...
2. object c7b1cff039a93f3600a1d18b82d26688668c7dea
   ...
   tagger C O Mitter <committer@example.com> 1465981006 +0000

   Tag Demo v1.0
   -----BEGIN PGP SIGNATURE-----
   ...
3. object c7b1cff039a93f3600a1d18b82d26688668c7dea
   ...
   tagger C O Mitter <committer@example.com> 1465981006 +0000
   hash sha256 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4

   Tag Demo v1.0
   -----BEGIN PGP SIGNATURE-----
   ...

Signed tags can be converted between sha1-content and sha256-content
using the same process as signed commits.

Verifying signed tags
^^^^^^^^^^^^^^^^^^^^^
As with commits, if the tag has a "hash sha1" (or is sha1-content
without a "nohash sha1" line): check that the signature matches the
sha1-content with PGP signature stripped out.

Otherwise: check that the signature matches the sha1-content with
nohash and object fields and PGP signature stripped out.

Mergetag signatures
~~~~~~~~~~~~~~~~~~~
The mergetag field in the sha1-content of a commit contains the
sha1-content of a tag that was merged by that commit.

The mergetag field in the sha256-content of the same commit contains
the sha256-content of the same tag.

Submodules
~~~~~~~~~~
To convert recorded submodule pointers, you need to have the converted
submodule repository in place. The bidirectional mapping of the
submodule can be used to look up the new hash.

Caveats
-------
Shallow clone and submodules
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Because this requires all referenced objects to be available in the
locally generated translation table, this design does not support
shallow clone or unfetched submodules.

Protocol improvements might allow lifting this restriction.

Alternatives considered
-----------------------
Upgrading everyone working on a particular project on a flag day
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Projects like the Linux kernel are large and complex enough that
flipping the switch for all projects based on the repository at once
is infeasible.

Not only would all developers and server operators supporting
developers have to switch on the same flag day, but supporting tooling
(continuous integration, code review, bug trackers, etc) would have to
be adapted as well. This also makes it difficult to get early feedback
from some project participants testing before it is time for mass
adoption.

Using hash functions in parallel 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(e.g. https://public-inbox.org/git/22708.8913.864049.452252@chiark.greenend.org.uk/ )
Objects newly created would be addressed by the new hash, but inside
such an object (e.g. commit) it is still possible to address objects
using the old hash function.

* You cannot trust its history (needed for bisectability) in the
  future without further work 
* Maintenance burden as the number of supported hash functions grows
  (they will never go away, so they accumulate). In this proposal, by
  comparison, converted objects lose all references to SHA-1 except
  where needed to verify signatures.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-04  1:12 RFC: Another proposed hash function transition plan Jonathan Nieder
@ 2017-03-05  2:35 ` Linus Torvalds
  2017-03-06  0:26   ` brian m. carlson
  2017-03-07  0:17   ` RFC v3: " Jonathan Nieder
  2017-03-05 11:02 ` RFC: " David Lang
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 113+ messages in thread
From: Linus Torvalds @ 2017-03-05  2:35 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Git Mailing List, Stefan Beller, bmwill, jonathantanmy, Jeff King

On Fri, Mar 3, 2017 at 5:12 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:
>
> This document is still in flux but I thought it best to send it out
> early to start getting feedback.

This actually looks very reasonable if you can implement it cleanly
enough. In many ways the "convert entirely to a new 256-bit hash" is
the cleanest model, and interoperability was at least my personal
concern. Maybe your model solves it (devil in the details), in which
case I really like it.

I do think that if you end up essentially converting the objects
without really having any true backwards compatibility at the object
layer (just the translation code), you should seriously look at doing
some other changes at the same time. Like not using zlib compression,
it really is very slow.

Btw, I do think the particular choice of hash should still be on the
table. sha-256 may be the obvious first choice, but there are
definitely a few reasons to consider alternatives, especially if it's
a complete switch-over like this.

One is large-file behavior - a parallel (or tree) mode could improve
on that noticeably. BLAKE2 does have special support for that, for
example. And SHA-256 does have known attacks compared to SHA-3-256 or
BLAKE2 - whether that is due to age or due to more effort, I can't
really judge. But if we're switching away from SHA1 due to known
attacks, it does feel like we should be careful.

                Linus

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-04  1:12 RFC: Another proposed hash function transition plan Jonathan Nieder
  2017-03-05  2:35 ` Linus Torvalds
@ 2017-03-05 11:02 ` David Lang
       [not found]   ` <CA+dhYEXHbQfJ6KUB1tWS9u1MLEOJL81fTYkbxu4XO-i+379LPw@mail.gmail.com>
  2017-03-06 23:40   ` Jonathan Nieder
  2017-03-06  8:43 ` Jeff King
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 113+ messages in thread
From: David Lang @ 2017-03-05 11:02 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git, sbeller, bmwill, jonathantanmy, peff, Linus Torvalds

> Translation table
> ~~~~~~~~~~~~~~~~~
> A fast bidirectional mapping between sha1-names and sha256-names of
> all local objects in the repository is kept on disk. The exact format
> of that mapping is to be determined.
>
> All operations that make new objects (e.g., "git commit") add the new
> objects to the translation table.

This seems like a rather nontrival thing to design. It will need to hold 
millions of mappings, and be quickly searchable from either direction (sha1->new 
and new->sha1) while still be fairly fast to insert new records into.

For Linux, just the list of hashes recording the commits is going to be in the 
millions, whiel the list of hashes of individual files for all those commits is 
going to be substantially larger.

David Lang

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-05  2:35 ` Linus Torvalds
@ 2017-03-06  0:26   ` brian m. carlson
  2017-03-06 18:24     ` Brandon Williams
  2017-03-07  0:17   ` RFC v3: " Jonathan Nieder
  1 sibling, 1 reply; 113+ messages in thread
From: brian m. carlson @ 2017-03-06  0:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jonathan Nieder, Git Mailing List, Stefan Beller, bmwill,
	jonathantanmy, Jeff King

[-- Attachment #1: Type: text/plain, Size: 1818 bytes --]

On Sat, Mar 04, 2017 at 06:35:38PM -0800, Linus Torvalds wrote:
> On Fri, Mar 3, 2017 at 5:12 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:
> >
> > This document is still in flux but I thought it best to send it out
> > early to start getting feedback.
> 
> This actually looks very reasonable if you can implement it cleanly
> enough. In many ways the "convert entirely to a new 256-bit hash" is
> the cleanest model, and interoperability was at least my personal
> concern. Maybe your model solves it (devil in the details), in which
> case I really like it.

If you think you can do it, I'm all for it.

> Btw, I do think the particular choice of hash should still be on the
> table. sha-256 may be the obvious first choice, but there are
> definitely a few reasons to consider alternatives, especially if it's
> a complete switch-over like this.
> 
> One is large-file behavior - a parallel (or tree) mode could improve
> on that noticeably. BLAKE2 does have special support for that, for
> example. And SHA-256 does have known attacks compared to SHA-3-256 or
> BLAKE2 - whether that is due to age or due to more effort, I can't
> really judge. But if we're switching away from SHA1 due to known
> attacks, it does feel like we should be careful.

I agree with Linus on this.  SHA-256 is the slowest option, and it's the
one with the most advanced cryptanalysis.  SHA-3-256 is faster on 64-bit
machines (which, as we've seen on the list, is the overwhelming majority
of machines using Git), and even BLAKE2b-256 is stronger.

Doing this all over again in another couple years should also be a
non-goal.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
+1 832 623 2791 | https://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-04  1:12 RFC: Another proposed hash function transition plan Jonathan Nieder
  2017-03-05  2:35 ` Linus Torvalds
  2017-03-05 11:02 ` RFC: " David Lang
@ 2017-03-06  8:43 ` Jeff King
  2017-03-06 18:39   ` Jonathan Tan
  2017-03-06 18:43   ` Junio C Hamano
  2017-03-07 18:57 ` Ian Jackson
  2017-03-13  9:24 ` RFC: Another proposed hash function transition plan The Keccak Team
  4 siblings, 2 replies; 113+ messages in thread
From: Jeff King @ 2017-03-06  8:43 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git, sbeller, bmwill, jonathantanmy, Linus Torvalds

On Fri, Mar 03, 2017 at 05:12:51PM -0800, Jonathan Nieder wrote:

> This past week we came up with this idea for what a transition to a new
> hash function for Git would look like.  I'd be interested in your
> thoughts (especially if you can make them as comments on the document,
> which makes it easier to address them and update the document).

Overall it's an interesting idea. I thought at first that you were
suggesting servers do on-the-fly conversion, but after a more careful
reading that isn't the case. And I don't think that would work, because
the conversion is expensive.

So this pushes the conversion cost onto the clients who decide to move
to SHA-256. That may be a problem for sites which have a lot of clients
(like CI hosts). But I guess they would just stick with SHA-1 as long as
possible, until the upstream repo switches (and that _is_ a per-repo
flag day, because the upstream host isn't going to convert back to SHA-1
on the fly to serve the old clients).

> You can use the doc URL
> 
>  https://goo.gl/gh2Mzc

I'd encourage anybody following along to follow that link. I almost
didn't, but there are a ton of comments there (I'm not sure how I feel
about splitting the discussion off the list, though).

> Goals
> -----
> 1. The transition to SHA256 can be done one local repository at a time.
>    a. Requiring no action by any other party.
>    b. A SHA256 repository can communicate with SHA-1 Git servers and
>       clients (push/fetch).
>    c. Users can use SHA-1 and SHA256 identifiers for objects
>       interchangeably.
>    d. New signed objects make use of a stronger hash function than
>       SHA-1 for their security guarantees.
> 2. Allow a complete transition away from SHA-1.
>    a. Local metadata for SHA-1 compatibility can be dropped in a
>       repository if compatibility with SHA-1 is no longer needed.

I suspect we'll never get away from keeping the mapping table. You'll
need at least the sha1->sha256 table if you want to look up names found
in historic commit messages, mailing list posts, etc.

And you'll need the sha256->sha1 table if you want to verify the gpg
signatures on old tags and commits. That might be something people are
willing to drop, though.

> After negotiation, the server sends a packfile containing the
> requested objects. We convert the packfile to SHA-256 format using the
> following steps:
> 
> 1. index-pack: inflate each object in the packfile and compute its
>    SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against
>    objects the client has locally. These objects can be looked up using
>    the translation table and their sha1-content read as described above
>    to resolve the deltas.
> 2. topological sort: starting at the "want"s from the negotiation
>    phase, walk through objects in the pack and emit a list of them in
>    topologically sorted order. (This list only contains objects
>    reachable from the "wants". If the pack from the server contained
>    additional extraneous objects, then they will be discarded.)

I don't think we do this right now, but you can actually find the entry
(and exit) points of a pack during the index-pack step. Basically:

  1. Keep a hashmap of objects mentioned in the pack.

  2. When we process an object's content (i.e., compute its hash), also
     parse it for any object references. Add entries in the hashmap for
     any object mentioned this way. Mark the entry for the object we
     processed with a "HAVE" bit, and mark any referenced object with a
     "REF" bit.

  3. After processing all objects, anything with a "HAVE" but no "REF"
     is an entry point to the pack (i.e., something that we should have
     asked for with a want). Anything with a "REF" but not a "HAVE" is
     an exit point (i.e., an object that we are expected to already have
     in our repo).

     (I've thought about this before because we could possibly shortcut
     the connectivity check using the exit points. It's complicated by
     the fact that we don't assume the transitive presence of objects
     unless they are reachable).

I don't think using the "want"s as the entry points is unreasonable,
though. The server _shouldn't_ generally be sending us other cruft.

I do wonder if you might be able to omit the extra object-graph walk
from your step 2, if you could assign "depths" to each object during
step 1 instead of HAVE/REF bits. The trouble, of course, is that you're
not visiting the nodes in the right order (so given two trees, you're
not sure if one might eventually be a child of the other; how do you
assign their depths?). I have a feeling there's a proof that it's
impossible, but I might just not be clever enough.

Overall the basics of the conversion seem sound to me. The "nohash"
things seems more complicated than I think it ought to be, which
probably just means I'm missing something.  I left a few related
comments on the google doc, so I won't repeat them here.

-Peff

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC: Another proposed hash function transition plan
       [not found]   ` <CA+dhYEXHbQfJ6KUB1tWS9u1MLEOJL81fTYkbxu4XO-i+379LPw@mail.gmail.com>
@ 2017-03-06  9:43     ` Jeff King
  0 siblings, 0 replies; 113+ messages in thread
From: Jeff King @ 2017-03-06  9:43 UTC (permalink / raw)
  To: ankostis
  Cc: David Lang, Jonathan Nieder, Git Mailing List, Stefan Beller,
	bmwill, jonathantanmy, Linus Torvalds

On Mon, Mar 06, 2017 at 10:29:33AM +0100, ankostis wrote:

> On 5 March 2017 at 12:02, David Lang <david@lang.hm> wrote:
> >> Translation table
> >> ~~~~~~~~~~~~~~~~~
> >> A fast bidirectional mapping between sha1-names and sha256-names of
> >> all local objects in the repository is kept on disk. The exact format
> >> of that mapping is to be determined.
> >>
> >> All operations that make new objects (e.g., "git commit") add the new
> >> objects to the translation table.
> >
> >
> > This seems like a rather nontrival thing to design. It will need to hold
> > millions of mappings, and be quickly searchable from either direction
> > (sha1->new and new->sha1) while still be fairly fast to insert new records
> > into.
> >
> > For Linux, just the list of hashes recording the commits is going to be in
> > the millions, whiel the list of hashes of individual files for all those
> > commits is going to be substantially larger.
> 
> Apologies if it is a stupid idea, but could we avoid the mappings-table
> just by
> hard-linking to the same object from both (or more) hashes?
> So instead of creating a text-db format, just use the filesystem.

No, for a few reasons:

  1. Most of these objects will not be in the filesystem at all, but
     rather in a packfile.

  2. It's not just a different hash over the same bytes. The sha256-name
     is taken over the sha256-content (which refers to other objects
     using sha256). So they really are different objects. You probably
     wouldn't keep the sha1 version around separately, but rather
     generate it on the fly during a push to a sha1 server.

  3. You really need to be able to take a sha256 name and convert it to
     a sha1 and vice versa. Hardlinks don't help with that, because they
     only point in one direction. That get you to the same _content_,
     but not the other name (and I guess this is where your "look up the
     name and then compute the other digest comes in, but that's
     probably too expensive to be workable).

I do think updating the mapping could potentially be deferred until
interacting with a sha1 server. But because it needs to be generated in
reverse-topological order, it's conceptually easier to do it one object
at a time.

-Peff

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-06  0:26   ` brian m. carlson
@ 2017-03-06 18:24     ` Brandon Williams
  2017-06-15 10:30       ` Which hash function to use, was " Johannes Schindelin
  0 siblings, 1 reply; 113+ messages in thread
From: Brandon Williams @ 2017-03-06 18:24 UTC (permalink / raw)
  To: brian m. carlson, Linus Torvalds, Jonathan Nieder,
	Git Mailing List, Stefan Beller, jonathantanmy, Jeff King

On 03/06, brian m. carlson wrote:
> On Sat, Mar 04, 2017 at 06:35:38PM -0800, Linus Torvalds wrote:
> > On Fri, Mar 3, 2017 at 5:12 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:
> > >
> > > This document is still in flux but I thought it best to send it out
> > > early to start getting feedback.
> > 
> > This actually looks very reasonable if you can implement it cleanly
> > enough. In many ways the "convert entirely to a new 256-bit hash" is
> > the cleanest model, and interoperability was at least my personal
> > concern. Maybe your model solves it (devil in the details), in which
> > case I really like it.
> 
> If you think you can do it, I'm all for it.
> 
> > Btw, I do think the particular choice of hash should still be on the
> > table. sha-256 may be the obvious first choice, but there are
> > definitely a few reasons to consider alternatives, especially if it's
> > a complete switch-over like this.
> > 
> > One is large-file behavior - a parallel (or tree) mode could improve
> > on that noticeably. BLAKE2 does have special support for that, for
> > example. And SHA-256 does have known attacks compared to SHA-3-256 or
> > BLAKE2 - whether that is due to age or due to more effort, I can't
> > really judge. But if we're switching away from SHA1 due to known
> > attacks, it does feel like we should be careful.
> 
> I agree with Linus on this.  SHA-256 is the slowest option, and it's the
> one with the most advanced cryptanalysis.  SHA-3-256 is faster on 64-bit
> machines (which, as we've seen on the list, is the overwhelming majority
> of machines using Git), and even BLAKE2b-256 is stronger.
> 
> Doing this all over again in another couple years should also be a
> non-goal.

I agree that when we decide to move to a new algorithm that we should
select one which we plan on using for as long as possible (much longer
than a couple years).  While writing the document we simply used
"sha256" because it was more tangible and easier to reference.

> -- 
> brian m. carlson / brian with sandals: Houston, Texas, US
> +1 832 623 2791 | https://www.crustytoothpaste.net/~bmc | My opinion only
> OpenPGP: https://keybase.io/bk2204



-- 
Brandon Williams

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-06  8:43 ` Jeff King
@ 2017-03-06 18:39   ` Jonathan Tan
  2017-03-06 19:22     ` Linus Torvalds
  2017-03-07  8:59     ` Jeff King
  2017-03-06 18:43   ` Junio C Hamano
  1 sibling, 2 replies; 113+ messages in thread
From: Jonathan Tan @ 2017-03-06 18:39 UTC (permalink / raw)
  To: Jeff King, Jonathan Nieder; +Cc: git, sbeller, bmwill, Linus Torvalds

On 03/06/2017 12:43 AM, Jeff King wrote:
> Overall the basics of the conversion seem sound to me. The "nohash"
> things seems more complicated than I think it ought to be, which
> probably just means I'm missing something.  I left a few related
> comments on the google doc, so I won't repeat them here.

I think "nohash" can be explained in 2 points:
  1. When creating signed objects, "nohash" is almost never written. Just
     create the object as usual and add "hash" lines for every other hash
     function that you want the signature to cover.
  2. When converting from function A to function B, add "nohash B" if
     there were no "hash B" lines in the original object.

The "nohash" thing was in the hope of requiring only one signature to 
sign all the hashes (in all the functions) that the user wants, while 
preserving round-tripping ability.

Maybe some examples would help to address the apparent complexity. These 
examples are the same as those in the document. I'll also show future 
compatibility with a hypothetical NEW hash function, and extend the rule 
about signing/verification to 'sign in the earliest supported hash 
function in ({object's hash function} + {functions in "hash" lines} - 
{function in "nohash" line})'.

Example 1 (existing signed commit)
<sha-1 object stuff>  <sha256 object stuff>  <NEW object stuff>
                       nohash sha256          nohash new
                       hash sha1 ...          hash sha1 ...

This object was probably created in a SHA-1 repository with no knowledge 
that we were going to transition to SHA256 (but there is nothing 
preventing us from creating the middle or right object and then 
translating it to the other functions).

Example 2 (recommended way to sign a commit in a SHA256 repo)
<sha-1 object stuff>  <sha256 object stuff>  <NEW object stuff>
hash sha256 ...       hash sha1 ...          nohash new
                                              hash sha1 ...
                                              hash sha256 ...

This is the recommended way to create a SHA256 object in a SHA256 repo. 
The rule about signing/verification (as stated above) is to sign in 
SHA-1, so when signing or verifying, we convert the object to SHA-1 and 
use that as the payload. Note that the signature covers both the SHA-1 
and SHA256 hashes, and that existing Git implementations can verify the 
signature.

Example 3 (a signer that does not care about SHA-1 anymore)
<sha-1 object stuff>  <sha256 object stuff>  <NEW object stuff>
nohash sha1                                  nohash new
hash sha256 ...                              hash sha256 ...

If we were to create a SHA256 object without any mentions of SHA-1, the 
rule about signing/verification (as stated above) states that the 
signature payload is the SHA256 object. This means that existing Git 
implementations cannot verify the signature, but we can still round-trip 
to SHA-1 and back without losing any information (as far as I can tell).

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-06  8:43 ` Jeff King
  2017-03-06 18:39   ` Jonathan Tan
@ 2017-03-06 18:43   ` Junio C Hamano
  1 sibling, 0 replies; 113+ messages in thread
From: Junio C Hamano @ 2017-03-06 18:43 UTC (permalink / raw)
  To: Jeff King
  Cc: Jonathan Nieder, git, sbeller, bmwill, jonathantanmy,
	Linus Torvalds

Jeff King <peff@peff.net> writes:

>> You can use the doc URL
>> 
>>  https://goo.gl/gh2Mzc
>
> I'd encourage anybody following along to follow that link. I almost
> didn't, but there are a ton of comments there (I'm not sure how I feel
> about splitting the discussion off the list, though).

I am sure how I feel about it---we should really discourage it,
unless it is an effort to help polishing an early draft for wider
distribution and discussion.

> I don't think we do this right now, but you can actually find the entry
> (and exit) points of a pack during the index-pack step. Basically:

We have code to do the "entry point" computation in index-pack
already, I think, in 81a04b01 ("index-pack: --clone-bundle option",
2016-03-03).

> I don't think using the "want"s as the entry points is unreasonable,
> though. The server _shouldn't_ generally be sending us other cruft.

That's true.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-06 18:39   ` Jonathan Tan
@ 2017-03-06 19:22     ` Linus Torvalds
  2017-03-06 19:59       ` Brandon Williams
  2017-03-06 21:53       ` Junio C Hamano
  2017-03-07  8:59     ` Jeff King
  1 sibling, 2 replies; 113+ messages in thread
From: Linus Torvalds @ 2017-03-06 19:22 UTC (permalink / raw)
  To: Jonathan Tan
  Cc: Jeff King, Jonathan Nieder, Git Mailing List, Stefan Beller,
	bmwill

On Mon, Mar 6, 2017 at 10:39 AM, Jonathan Tan <jonathantanmy@google.com> wrote:
>
> I think "nohash" can be explained in 2 points:

I do think that that was my least favorite part of the suggestion. Not
just "nohash", but all the special "hash" lines too.

I would honestly hope that the design should not be about "other
hashes". If you plan your expectations around the new hash being
broken, something is wrong to begin with.

I do wonder if things wouldn't be simpler if the new format just
included the SHA1 object name in the new object. Put it in the
"header" line of the object, so that every time you look up an object,
you just _see_ the SHA1 of that object. You can even think of it as an
additional protection.

Btw, the multi-collision attack referenced earlier does _not_ work for
an iterated hash that has a bigger internal state than the final hash.
Which is actually a real argument against sha-256: the internal state
of sha-256 is 256 bits, so if an attack can find collisions due to
some weakness, you really can then generate exponential collisions by
chaining a linear collision search together.

But for sha3-256 or blake2, the internal hash state is larger than the
final hash, so now you need to generate collisions not in the 256
bits, but in the much larger search space of the internal hash space
if you want to generate those exponential collisions.

So *if* the new object format uses a git header line like

    "blob <size> <sha1>\0"

then it would inherently contain that mapping from 256-bit hash to the
SHA1, but it would actually also protect against attacks on the new
hash. In fact, in particular for objects with internal format that
differs between the two hashing models (ie trees and commits which to
some degree are higher-value targets), it would make attacks really
quite complicated, I suspect.

And you wouldn't need those "hash" or "nohash" things at all. The old
SHA1 would simply always be there, and cheap to look up (ie you
wouldn't have to unpack the whole object).

Hmm?

                   Linus

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-06 19:22     ` Linus Torvalds
@ 2017-03-06 19:59       ` Brandon Williams
  2017-03-06 21:53       ` Junio C Hamano
  1 sibling, 0 replies; 113+ messages in thread
From: Brandon Williams @ 2017-03-06 19:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jonathan Tan, Jeff King, Jonathan Nieder, Git Mailing List,
	Stefan Beller

On 03/06, Linus Torvalds wrote:
> On Mon, Mar 6, 2017 at 10:39 AM, Jonathan Tan <jonathantanmy@google.com> wrote:
> >
> > I think "nohash" can be explained in 2 points:
> 
> I do think that that was my least favorite part of the suggestion. Not
> just "nohash", but all the special "hash" lines too.
> 
> I would honestly hope that the design should not be about "other
> hashes". If you plan your expectations around the new hash being
> broken, something is wrong to begin with.
> 
> I do wonder if things wouldn't be simpler if the new format just
> included the SHA1 object name in the new object. Put it in the
> "header" line of the object, so that every time you look up an object,
> you just _see_ the SHA1 of that object. You can even think of it as an
> additional protection.
> 
> Btw, the multi-collision attack referenced earlier does _not_ work for
> an iterated hash that has a bigger internal state than the final hash.
> Which is actually a real argument against sha-256: the internal state
> of sha-256 is 256 bits, so if an attack can find collisions due to
> some weakness, you really can then generate exponential collisions by
> chaining a linear collision search together.
> 
> But for sha3-256 or blake2, the internal hash state is larger than the
> final hash, so now you need to generate collisions not in the 256
> bits, but in the much larger search space of the internal hash space
> if you want to generate those exponential collisions.
> 
> So *if* the new object format uses a git header line like
> 
>     "blob <size> <sha1>\0"
> 
> then it would inherently contain that mapping from 256-bit hash to the
> SHA1, but it would actually also protect against attacks on the new
> hash. In fact, in particular for objects with internal format that
> differs between the two hashing models (ie trees and commits which to
> some degree are higher-value targets), it would make attacks really
> quite complicated, I suspect.
> 
> And you wouldn't need those "hash" or "nohash" things at all. The old
> SHA1 would simply always be there, and cheap to look up (ie you
> wouldn't have to unpack the whole object).
> 
> Hmm?

I'll agree that the "hash" "nohash" bit isn't my favorite and is really
only there to address the signing of tags/commits in this new non-sha1
world.  I'm inclined to take a closer look at Jeff's suggestion which
simply has a signature for the hash that the signer cares about.

I don't know if keeping around the SHA1 for every object buys you all
that much.  It would add an additional layer of protection but you would
also need to compute the SHA1 for each object indefinitely (assuming you
include the SHA1 in new objects and not just converted objects).  The
hope would be that at some point you could not worry about SHA1 at all.
That may be difficult for projects with long history with commit msgs
which reference SHA1's of other commits (if you wanted to look up the
referenced commit, for example), but projects started in the new
non-sha1 world shouldn't have to ever compute a sha1.

Also, during this transition phase you would still need to maintain the
sha1<->sha256 translation table to make looking up objects by their sha1
name in a sha256 repo fast.  Otherwise I think it would take a
non-trivial amount of time to search a sha256 repo for a sha1 name.  So
if you do include the sha1 in the new object format then you would end
up with some duplicate information, which isn't the end of the world.

-- 
Brandon Williams

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-06 19:22     ` Linus Torvalds
  2017-03-06 19:59       ` Brandon Williams
@ 2017-03-06 21:53       ` Junio C Hamano
  1 sibling, 0 replies; 113+ messages in thread
From: Junio C Hamano @ 2017-03-06 21:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jonathan Tan, Jeff King, Jonathan Nieder, Git Mailing List,
	Stefan Beller, bmwill

Linus Torvalds <torvalds@linux-foundation.org> writes:

> So *if* the new object format uses a git header line like
>
>     "blob <size> <sha1>\0"
>
> then it would inherently contain that mapping from 256-bit hash to the
> SHA1, but it would actually also protect against attacks on the new
> hash.

This is easy for blobs as you only need to hash twice.  I am not
sure if you can do the same for trees, though.  For that <sha1> to
be useful, the hash needs to be over the tree contents whose
references are expressed in <sha1>, which in turn would mean...

... ah, you would read these <sha1> off of the object header in the
new world and you do not need to expand the whole thing.  OK, I see
how it could work.

> In fact, in particular for objects with internal format that
> differs between the two hashing models (ie trees and commits which to
> some degree are higher-value targets), it would make attacks really
> quite complicated, I suspect.
>
> And you wouldn't need those "hash" or "nohash" things at all. The old
> SHA1 would simply always be there, and cheap to look up (ie you
> wouldn't have to unpack the whole object).

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-05 11:02 ` RFC: " David Lang
       [not found]   ` <CA+dhYEXHbQfJ6KUB1tWS9u1MLEOJL81fTYkbxu4XO-i+379LPw@mail.gmail.com>
@ 2017-03-06 23:40   ` Jonathan Nieder
  2017-03-07  0:03     ` Mike Hommey
  1 sibling, 1 reply; 113+ messages in thread
From: Jonathan Nieder @ 2017-03-06 23:40 UTC (permalink / raw)
  To: David Lang; +Cc: git, sbeller, bmwill, jonathantanmy, peff, Linus Torvalds

David Lang wrote:

>> Translation table
>> ~~~~~~~~~~~~~~~~~
>> A fast bidirectional mapping between sha1-names and sha256-names of
>> all local objects in the repository is kept on disk. The exact format
>> of that mapping is to be determined.
>>
>> All operations that make new objects (e.g., "git commit") add the new
>> objects to the translation table.
>
> This seems like a rather nontrival thing to design. It will need to
> hold millions of mappings, and be quickly searchable from either
> direction (sha1->new and new->sha1) while still be fairly fast to
> insert new records into.

I am currently thinking of using LevelDB, since it has the advantages of
being simple, already existing, and having already been ported to Java
(allowing JGit can read and write the same format).

If that doesn't work, we'd try some other key-value store like Samba's
tdb or Kyoto Cabinet.

Jonathan

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-06 23:40   ` Jonathan Nieder
@ 2017-03-07  0:03     ` Mike Hommey
  0 siblings, 0 replies; 113+ messages in thread
From: Mike Hommey @ 2017-03-07  0:03 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: David Lang, git, sbeller, bmwill, jonathantanmy, peff,
	Linus Torvalds

On Mon, Mar 06, 2017 at 03:40:30PM -0800, Jonathan Nieder wrote:
> David Lang wrote:
> 
> >> Translation table
> >> ~~~~~~~~~~~~~~~~~
> >> A fast bidirectional mapping between sha1-names and sha256-names of
> >> all local objects in the repository is kept on disk. The exact format
> >> of that mapping is to be determined.
> >>
> >> All operations that make new objects (e.g., "git commit") add the new
> >> objects to the translation table.
> >
> > This seems like a rather nontrival thing to design. It will need to
> > hold millions of mappings, and be quickly searchable from either
> > direction (sha1->new and new->sha1) while still be fairly fast to
> > insert new records into.
> 
> I am currently thinking of using LevelDB, since it has the advantages of
> being simple, already existing, and having already been ported to Java
> (allowing JGit can read and write the same format).
> 
> If that doesn't work, we'd try some other key-value store like Samba's
> tdb or Kyoto Cabinet.

FWIW, I'm using notes-like data to store mercurial->git mappings in
git-cinnabar, (ab)using the commit type in tree items. It's fast enough.

Mike

^ permalink raw reply	[flat|nested] 113+ messages in thread

* RFC v3: Another proposed hash function transition plan
  2017-03-05  2:35 ` Linus Torvalds
  2017-03-06  0:26   ` brian m. carlson
@ 2017-03-07  0:17   ` Jonathan Nieder
  2017-03-09 19:14     ` Shawn Pearce
  2017-09-06  6:28     ` RFC v3: Another proposed hash function transition plan Junio C Hamano
  1 sibling, 2 replies; 113+ messages in thread
From: Jonathan Nieder @ 2017-03-07  0:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Git Mailing List, Stefan Beller, bmwill, jonathantanmy, Jeff King,
	David Lang, brian m. carlson

Linus Torvalds wrote:
> On Fri, Mar 3, 2017 at 5:12 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:

>> This document is still in flux but I thought it best to send it out
>> early to start getting feedback.
>
> This actually looks very reasonable if you can implement it cleanly
> enough.

Thanks for the kind words on what had quite a few flaws still.  Here's
a new draft.  I think the next version will be a patch against
Documentation/technical/.

As before, comments welcome, both here and inline at

  https://goo.gl/gh2Mzc

Changes since v2:

Use SHA3-256 instead of SHA2 (thanks, Linus and brian m.
carlson).[1][2]

Make sha3-based signatures a separate field, avoiding the need for
"hash" and "nohash" fields (thanks to peff[3]).

Add a sorting phase to fetch (thanks to Junio for noticing the need
for this).

Omit blobs from the topological sort during fetch (thanks to peff).

Discuss alternates, git notes, and git servers in the caveats section
(thanks to Junio Hamano, brian m. carlson[4], and Shawn Pearce).

Clarify language throughout (thanks to various commenters, especially
Junio).

Sincerely,
Jonathan

Git hash function transition
============================
Status: Draft
Last Updated: 2017-03-06

Objective
---------
Migrate Git from SHA-1 to a stronger hash function.

Background
----------
At its core, the Git version control system is a content addressable
filesystem. It uses the SHA-1 hash function to name content. For
example, files, directories, and revisions are referred to by hash
values unlike in other traditional version control systems where files
or versions are referred to via sequential numbers. The use of a hash
function to address its content delivers a few advantages:

* Integrity checking is easy. Bit flips, for example, are easily
  detected, as the hash of corrupted content does not match its name.
* Lookup of objects is fast.

Using a cryptographically secure hash function brings additional
advantages:

* Object names can be signed and third parties can trust the hash to
  address the signed object and all objects it references.
* Communication using Git protocol and out of band communication
  methods have a short reliable string that can be used to reliably
  address stored content.

Over time some flaws in SHA-1 have been discovered by security
researchers. https://shattered.io demonstrated a practical SHA-1 hash
collision. As a result, SHA-1 cannot be considered cryptographically
secure any more. This impacts the communication of hash values because
we cannot trust that a given hash value represents the known good
version of content that the speaker intended.

SHA-1 still possesses the other properties such as fast object lookup
and safe error checking, but other hash functions are equally suitable
that are believed to be cryptographically secure.

Goals
-----
1. The transition to SHA3-256 can be done one local repository at a time.
   a. Requiring no action by any other party.
   b. A SHA3-256 repository can communicate with SHA-1 Git servers
      (push/fetch).
   c. Users can use SHA-1 and SHA3-256 identifiers for objects
      interchangeably.
   d. New signed objects make use of a stronger hash function than
      SHA-1 for their security guarantees.
2. Allow a complete transition away from SHA-1.
   a. Local metadata for SHA-1 compatibility can be removed from a
      repository if compatibility with SHA-1 is no longer needed.
3. Maintainability throughout the process.
   a. The object format is kept simple and consistent.
   b. Creation of a generalized repository conversion tool.

Non-Goals
---------
1. Add SHA3-256 support to Git protocol. This is valuable and the
   logical next step but it is out of scope for this initial design.
2. Transparently improving the security of existing SHA-1 signed
   objects.
3. Intermixing objects using multiple hash functions in a single
   repository.
4. Taking the opportunity to fix other bugs in git's formats and
   protocols.
5. Shallow clones and fetches into a SHA3-256 repository. (This will
   change when we add SHA3-256 support to Git protocol.)
6. Skip fetching some submodules of a project into a SHA3-256
   repository. (This also depends on SHA3-256 support in Git
   protocol.)

Overview
--------
We introduce a new repository format extension `sha3`. Repositories
with this extension enabled use SHA3-256 instead of SHA-1 to name
their objects. This affects both object names and object content ---
both the names of objects and all references to other objects within
an object are switched to the new hash function.

sha3 repositories cannot be read by older versions of Git.

Alongside the packfile, a sha3 repository stores a bidirectional
mapping between sha3 and sha1 object names. The mapping is generated
locally and can be verified using "git fsck". Object lookups use this
mapping to allow naming objects using either their sha1 and sha3 names
interchangeably.

"git cat-file" and "git hash-object" gain options to display an object
in its sha1 form and write an object given its sha1 form. This
requires all objects referenced by that object to be present in the
object database so that they can be named using the appropriate name
(using the bidirectional hash mapping).

Fetches from a SHA-1 based server convert the fetched objects into
sha3 form and record the mapping in the bidirectional mapping table
(see below for details). Pushes to a SHA-1 based server convert the
objects being pushed into sha1 form so the server does not have to be
aware of the hash function the client is using.

Detailed Design
---------------
Object names
~~~~~~~~~~~~
Objects can be named by their 40 hexadecimal digit sha1-name or 64
hexadecimal digit sha3-name, plus names derived from those (see
gitrevisions(7)).

The sha1-name of an object is the SHA-1 of the concatenation of its
type, length, a nul byte, and the object's sha1-content. This is the
traditional <sha1> used in Git to name objects.

The sha3-name of an object is the SHA3-256 of the concatenation of its
type, length, a nul byte, and the object's sha3-content.

Object format
~~~~~~~~~~~~~
The content as a byte sequence of a tag, commit, or tree object named
by sha1 and sha3 differ because an object named by sha3-name refers to
other objects by their sha3-names and an object named by sha1-name
refers to other objects by their sha1-names.

The sha3-content of an object is the same as its sha1-content, except
that objects referenced by the object are named using their sha3-names
instead of sha1-names. Because a blob object does not refer to any
other object, its sha1-content and sha3-content are the same.

The format allows round-trip conversion between sha3-content and
sha1-content.

Object storage
~~~~~~~~~~~~~~
Loose objects use zlib compression and packed objects use the packed
format described in Documentation/technical/pack-format.txt, just like
today. The content that is compressed and stored uses sha3-content
instead of sha1-content.

Translation table
~~~~~~~~~~~~~~~~~
A fast bidirectional mapping between sha1-names and sha3-names of all
local objects in the repository is kept on disk. The exact format of
that mapping is to be determined.

All operations that make new objects (e.g., "git commit") add the new
objects to the translation table.

(This work could have been deferred to push time, but that would
significantly complicate and slow down pushes. Calculating the
sha1-name at object creation time at the same time it is being
streamed to disk and having its sha3-name calculated should be an
acceptable cost.)

Reading an object's sha1-content
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The sha1-content of an object can be read by converting all sha3-names
its sha3-content references to sha1-names using the translation table.

Fetch
~~~~~
Fetching from a SHA-1 based server requires translating between SHA-1
and SHA3-256 based representations on the fly.

SHA-1s named in the ref advertisement that are present on the client
can be translated to SHA3-256 and looked up as local objects using the
translation table.

Negotiation proceeds as today. Any "have"s generated locally are
converted to SHA-1 before being sent to the server, and SHA-1s
mentioned by the server are converted to SHA3-256 when looking them up
locally.

After negotiation, the server sends a packfile containing the
requested objects. We convert the packfile to SHA3-256 format using
the following steps:

1. index-pack: inflate each object in the packfile and compute its
   SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against
   objects the client has locally. These objects can be looked up
   using the translation table and their sha1-content read as
   described above to resolve the deltas.
2. topological sort: starting at the "want"s from the negotiation
   phase, walk through objects in the pack and emit a list of them,
   excluding blobs, in reverse topologically sorted order, with each
   object coming later in the list than all objects it references.
   (This list only contains objects reachable from the "wants". If the
   pack from the server contained additional extraneous objects, then
   they will be discarded.)
3. convert to sha3: open a new (sha3) packfile. Read the topologically
   sorted list just generated. For each object, inflate its
   sha1-content, convert to sha3-content, and write it to the sha3
   pack. Include the new sha1<->sha3 mapping entry in the translation
   table.
4. sort: reorder entries in the new pack to match the order of objects
   in the pack the server generated and include blobs. Write a sha3 idx
   file.
5. clean up: remove the SHA-1 based pack file, index, and
   topologically sorted list obtained from the server and steps 1
   and 2.

Step 3 requires every object referenced by the new object to be in the
translation table. This is why the topological sort step is necessary.

As an optimization, step 1 could write a file describing what non-blob
objects each object it has inflated from the packfile references. This
makes the topological sort in step 2 possible without inflating the
objects in the packfile for a second time. The objects need to be
inflated again in step 3, for a total of two inflations.

Step 4 is probably necessary for good read-time performance. "git
pack-objects" on the server optimizes the pack file for good data
locality (see Documentation/technical/pack-heuristics.txt).

Details of this process are likely to change. It will take some
experimenting to get this to perform well.

Push
~~~~
Push is simpler than fetch because the objects referenced by the
pushed objects are already in the translation table. The sha1-content
of each object being pushed can be read as described in the "Reading
an object's sha1-content" section to generate the pack written by git
send-pack.

Signed Commits
~~~~~~~~~~~~~~
We add a new field "gpgsig-sha3" to the commit object format to allow
signing commits without relying on SHA-1. It is similar to the
existing "gpgsig" field. Its signed payload is the sha3-content of the
commit object with any "gpgsig" and "gpgsig-sha3" fields removed.

This means commits can be signed
1. using SHA-1 only, as in existing signed commit objects
2. using both SHA-1 and SHA3-256, by using both gpgsig-sha3 and gpgsig
   fields.
3. using only SHA3-256, by only using the gpgsig-sha3 field.

Old versions of "git verify-commit" can verify the gpgsig signature in
cases (1) and (2) without modifications and view case (3) as an
ordinary unsigned commit.

Signed Tags
~~~~~~~~~~~
We add a new field "gpgsig-sha3" to the tag object format to allow
signing tags without relying on SHA-1. Its signed payload is the
sha3-content of the tag with its gpgsig-sha3 field and "-----BEGIN PGP
SIGNATURE-----" delimited in-body signature removed.

This means tags can be signed
1. using SHA-1 only, as in existing signed tag objects
2. using both SHA-1 and SHA3-256, by using gpgsig-sha3 and an in-body
   signature.
3. using only SHA3-256, by only using the gpgsig-sha3 field.

Mergetag embedding
~~~~~~~~~~~~~~~~~~
The mergetag field in the sha1-content of a commit contains the
sha1-content of a tag that was merged by that commit.

The mergetag field in the sha3-content of the same commit contains the
sha3-content of the same tag.

Submodules
~~~~~~~~~~
To convert recorded submodule pointers, you need to have the converted
submodule repository in place. The translation table of the submodule
can be used to look up the new hash.

Caveats
-------
Invalid objects
~~~~~~~~~~~~~~~
The conversion from sha1-content to sha3-content retains any
brokenness in the original object (e.g., tree entry modes encoded with
leading 0, tree objects whose paths are not sorted correctly, and
commit objects without an author or committer). This is a deliberate
feature of the design to allow the conversion to round-trip.

More profoundly broken objects (e.g., a commit with a truncated "tree"
header line) cannot be converted but were not usable by current Git
anyway.

Shallow clone and submodules
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Because it requires all referenced objects to be available in the
locally generated translation table, this design does not support
shallow clone or unfetched submodules. Protocol improvements might
allow lifting this restriction.

Alternates
~~~~~~~~~~
For the same reason, a sha3 repository cannot borrow objects from a
sha1 repository using objects/info/alternates or
$GIT_ALTERNATE_OBJECT_REPOSITORIES.

git notes
~~~~~~~~~
The "git notes" tool annotates objects using their sha1-name as key.
This design does not describe a way to migrate notes trees to use
sha3-names. That migration is expected to happen separately (for
example using a file at the root of the notes tree to describe which
hash it uses).

Server-side cost
~~~~~~~~~~~~~~~~
Until Git protocol gains SHA3-256 support, using sha3 based storage on
public-facing Git servers is strongly discouraged. Once Git protocol
gains SHA3-256 support, sha3 based servers are likely not to support
sha1 compatibility, to avoid what may be a very expensive hash
reencode during clone and to encourage peers to modernize.

The design described here allows fetches by SHA-1 clients of a
personal SHA256 repository because it's not much more difficult than
allowing pushes from that repository. This support needs to be guarded
by a configuration option --- servers like git.kernel.org that serve a
large number of clients would not be expected to bear that cost.

Meaning of signatures
~~~~~~~~~~~~~~~~~~~~~
The signed payload for signed commits and tags does not explicitly
name the hash used to identify objects. If some day Git adopts a new
hash function with the same length as the current SHA-1 (40
hexadecimal digit) or SHA2-256 (64 hexadecimal digit) objects then the
intent behind the PGP signed payload in an object signature is
unclear:

	object e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7
	type commit
	tag v2.12.0
	tagger Junio C Hamano <gitster@pobox.com> 1487962205 -0800

	Git 2.12

Does this mean Git v2.12.0 is the commit with sha1-name
e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with
new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7?

Fortunately SHA3-256 and SHA-1 have different lengths. If Git starts
using another hash with the same length to name objects, then it will
need to change the format of signed payloads using that hash to
address this issue.

Alternatives considered
-----------------------
Upgrading everyone working on a particular project on a flag day
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Projects like the Linux kernel are large and complex enough that
flipping the switch for all projects based on the repository at once
is infeasible.

Not only would all developers and server operators supporting
developers have to switch on the same flag day, but supporting tooling
(continuous integration, code review, bug trackers, etc) would have to
be adapted as well. This also makes it difficult to get early feedback
from some project participants testing before it is time for mass
adoption.

Using hash functions in parallel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(e.g. https://public-inbox.org/git/22708.8913.864049.452252@chiark.greenend.org.uk/ )
Objects newly created would be addressed by the new hash, but inside
such an object (e.g. commit) it is still possible to address objects
using the old hash function.
* You cannot trust its history (needed for bisectability) in the
  future without further work
* Maintenance burden as the number of supported hash functions grows
  (they will never go away, so they accumulate). In this proposal, by
  comparison, converted objects lose all references to SHA-1.

Signed objects with multiple hashes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Instead of introducing the gpgsig-sha3 field in commit and tag objects
for sha3-content based signatures, an earlier version of this design
added "hash sha3 <sha3-name>" fields to strengthen the existing
sha1-content based signatures.

In other words, a single signature was used to attest to the object
content using both hash functions. This had some advantages:
* Using one signature instead of two speeds up the signing process.
* Having one signed payload with both hashes allows the signer to
  attest to the sha1-name and sha3-name referring to the same object.
* All users consume the same signature. Broken signatures are likely
  to be detected quickly using current versions of git.

However, it also came with disadvantages:
* Verifying a signed object requires access to the sha1-names of all
  objects it references, even after the transition is complete and
  translation table is no longer needed for anything else. To support
  this, the design added fields such as "hash sha1 tree <sha1-name>"
  and "hash sha1 parent <sha1-name>" to the sha3-content of a signed
  commit, complicating the conversion process.
* Allowing signed objects without a sha1 (for after the transition is
  complete) complicated the design further, requiring a "nohash sha1"
  field to suppress including "hash sha1" fields in the sha3-content
  and signed payload.

Document History
----------------

2017-03-03
bmwill@google.com, jonathantanmy@google.com, jrnieder@gmail.com,
sbeller@google.com

Initial version sent to
http://public-inbox.org/git/20170304011251.GA26789@aiede.mtv.corp.google.com

2017-03-03 jrnieder@gmail.com
Incorporated suggestions from jonathantanmy and sbeller:
* describe purpose of signed objects with each hash type
* redefine signed object verification using object content under the
  first hash function

2017-03-06 jrnieder@gmail.com
* Use SHA3-256 instead of SHA2 (thanks, Linus and brian m. carlson).[1][2]
* Make sha3-based signatures a separate field, avoiding the need for
  "hash" and "nohash" fields (thanks to peff[3]).
* Add a sorting phase to fetch (thanks to Junio for noticing the need
  for this).
* Omit blobs from the topological sort during fetch (thanks to peff).
* Discuss alternates, git notes, and git servers in the caveats
  section (thanks to Junio Hamano, brian m. carlson[4], and Shawn
  Pearce).
* Clarify language throughout (thanks to various commenters,
  especially Junio).

[1] http://public-inbox.org/git/CA+55aFzJtejiCjV0e43+9oR3QuJK2PiFiLQemytoLpyJWe6P9w@mail.gmail.com/
[2] http://public-inbox.org/git/CA+55aFz+gkAsDZ24zmePQuEs1XPS9BP_s8O7Q4wQ7LV7X5-oDA@mail.gmail.com/
[3] http://public-inbox.org/git/20170306084353.nrns455dvkdsfgo5@sigill.intra.peff.net/
[4] http://public-inbox.org/git/20170304224936.rqqtkdvfjgyezsht@genre.crustytoothpaste.net

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-06 18:39   ` Jonathan Tan
  2017-03-06 19:22     ` Linus Torvalds
@ 2017-03-07  8:59     ` Jeff King
  1 sibling, 0 replies; 113+ messages in thread
From: Jeff King @ 2017-03-07  8:59 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Jonathan Nieder, git, sbeller, bmwill, Linus Torvalds

On Mon, Mar 06, 2017 at 10:39:49AM -0800, Jonathan Tan wrote:

> The "nohash" thing was in the hope of requiring only one signature to sign
> all the hashes (in all the functions) that the user wants, while preserving
> round-tripping ability.

Thanks, this explained it very well.

I understand the tradeoff now, though I am still of the opinion that
simplicity is probably a more important goal.

In practice I'd imagine that anybody doing commit-signing would just
sign the more-secure hash, and people doing tag releases would probably
do a dual-sign to be verifiable by both old and new clients. Those are
infrequent enough that the extra computation probably doesn't matter.
But that's just my gut feeling.

-Peff

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-04  1:12 RFC: Another proposed hash function transition plan Jonathan Nieder
                   ` (2 preceding siblings ...)
  2017-03-06  8:43 ` Jeff King
@ 2017-03-07 18:57 ` Ian Jackson
  2017-03-07 19:15   ` Linus Torvalds
  2017-03-13  9:24 ` RFC: Another proposed hash function transition plan The Keccak Team
  4 siblings, 1 reply; 113+ messages in thread
From: Ian Jackson @ 2017-03-07 18:57 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git, sbeller, bmwill, jonathantanmy, peff, Linus Torvalds

Jonathan Nieder writes ("RFC: Another proposed hash function transition plan"):
> This past week we came up with this idea for what a transition to a new
> hash function for Git would look like.  I'd be interested in your
> thoughts (especially if you can make them as comments on the document,
> which makes it easier to address them and update the document).

Thanks for this.

This is a reasonable plan.  It corresponds to approaches (2) and (B)
of my survey mail from the other day.  Ie, two parallel homogeneous
hash trees, rather than a unified but heterogeneous hash tree, with
old vs new object names distinguished by length.

I still prefer my proposal with the mixed hash tree, mostly because
the handling of signatures here is very awkward, and because my
proposal does not involve altering object ids stored other than in the
git object graph (eg CI system databases, etc.)

One thing you've missed, I think, is notes: notes have to be dealt
with in a more complicated way.  Do you intend to rewrite the tree
objects for notes commits so that the notes are annotations for the
new names for the annotated objects ?  And if so, when ?

Also I think you need to specify how abbreviated object names are
interpreted.

Regards,
Ian.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-07 18:57 ` Ian Jackson
@ 2017-03-07 19:15   ` Linus Torvalds
  2017-03-08 11:20     ` Ian Jackson
  0 siblings, 1 reply; 113+ messages in thread
From: Linus Torvalds @ 2017-03-07 19:15 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Jonathan Nieder, Git Mailing List, Stefan Beller, bmwill,
	Jonathan Tan, Jeff King

On Tue, Mar 7, 2017 at 10:57 AM, Ian Jackson
<ijackson@chiark.greenend.org.uk> wrote:
>
> Also I think you need to specify how abbreviated object names are
> interpreted.

One option might be to not use hex for the new hash, but base64 encoding.

That would make the full size ASCII hash encoding length roughly
similar (43 base64 characters rather than 40), which would offset some
of the new costs (longer filenames in the loose format, for example).

Also, since 256 isn't evenly divisible by 6, and because you'd want
some way to explictly disambiguate the new hashes, the rule *could* be
that the ASCII representation of a new hash is the base64 encoding of
the 258-bit value that has "10" prepended to it as padding.

That way the first character of the hash would be guaranteed to not be
a hex digit, because it would be in the range [g-v] (indexes 32..47).

Of course, the downside is that base64 encoded hashes can also end up
looking very much like real words, and now case would matter too.

The "use base64 with a "10" two-bit padding prepended" also means that
the natural loose format radix format would remain the first 2
characters of the hash, but due to the first character containing the
padding, it would be a fan-out of 2**10 rather than 2**12.

Of course, having written that, I now realize how it would cause
problems for the usual shit-for-brains case-insensitive filesystems.
So I guess base64 encoding doesn't work well for that reason.

                Linus

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-07 19:15   ` Linus Torvalds
@ 2017-03-08 11:20     ` Ian Jackson
  2017-03-08 15:37       ` Johannes Schindelin
  2017-03-08 15:40       ` Johannes Schindelin
  0 siblings, 2 replies; 113+ messages in thread
From: Ian Jackson @ 2017-03-08 11:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jonathan Nieder, Git Mailing List, Stefan Beller, bmwill,
	Jonathan Tan, Jeff King

Linus Torvalds writes ("Re: RFC: Another proposed hash function transition plan"):
> Also, since 256 isn't evenly divisible by 6, and because you'd want
> some way to explictly disambiguate the new hashes, the rule *could* be
> that the ASCII representation of a new hash is the base64 encoding of
> the 258-bit value that has "10" prepended to it as padding.
> 
> That way the first character of the hash would be guaranteed to not be
> a hex digit, because it would be in the range [g-v] (indexes 32..47).

We should arrange for this to be an uppercase, not a lowercase,
letter, for the reasons I explained in my own proposal.  To summarise:
It would be undesirable to further increase the overlap between object
names and ref names.  Few people use uppercase in ref names because of
the case-insensitive filesystem problem; so object names starting with
uppercase ascii are distinct from most object names.

> Of course, having written that, I now realize how it would cause
> problems for the usual shit-for-brains case-insensitive filesystems.
> So I guess base64 encoding doesn't work well for that reason.

AFAIAA object names occur in publicly-visible filenames only in notes
tree objects, which are manipulated by git internally and do not
necessarily need to appear in the filesystem.

The filenames in .git/objects/ can be in whatever encoding we like, so
are not an obstacle.

Ian.

-- 
Ian Jackson <ijackson@chiark.greenend.org.uk>   These opinions are my own.

If I emailed you from an address @fyvzl.net or @evade.org.uk, that is
a private address which bypasses my fierce spamfilter.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-08 11:20     ` Ian Jackson
@ 2017-03-08 15:37       ` Johannes Schindelin
  2017-03-08 15:40       ` Johannes Schindelin
  1 sibling, 0 replies; 113+ messages in thread
From: Johannes Schindelin @ 2017-03-08 15:37 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Linus Torvalds, Jonathan Nieder, Git Mailing List, Stefan Beller,
	bmwill, Jonathan Tan, Jeff King

Hi Ian,

On Wed, 8 Mar 2017, Ian Jackson wrote:

> Few people use uppercase in ref names because of the case-insensitive
> filesystem problem;

Not true.

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-08 11:20     ` Ian Jackson
  2017-03-08 15:37       ` Johannes Schindelin
@ 2017-03-08 15:40       ` Johannes Schindelin
  2017-03-20  5:21         ` Use base32? Jason Hennessey
  1 sibling, 1 reply; 113+ messages in thread
From: Johannes Schindelin @ 2017-03-08 15:40 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Linus Torvalds, Jonathan Nieder, Git Mailing List, Stefan Beller,
	bmwill, Jonathan Tan, Jeff King

Hi Ian,

On Wed, 8 Mar 2017, Ian Jackson wrote:

> Linus Torvalds writes ("Re: RFC: Another proposed hash function transition plan"):
> > Of course, having written that, I now realize how it would cause
> > problems for the usual shit-for-brains case-insensitive filesystems.
> > So I guess base64 encoding doesn't work well for that reason.
> 
> AFAIAA object names occur in publicly-visible filenames only in notes
> tree objects, which are manipulated by git internally and do not
> necessarily need to appear in the filesystem.
> 
> The filenames in .git/objects/ can be in whatever encoding we like, so
> are not an obstacle.

Given that the idea was to encode the new hash in base64 or base85, we
*are* talking about an encoding. In that respect, yes, it can be whatever
encoding we like, and Linus just made a good point (with unnecessary foul
language) of explaining why base64/base85 is not that encoding.

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-03-07  0:17   ` RFC v3: " Jonathan Nieder
@ 2017-03-09 19:14     ` Shawn Pearce
  2017-03-09 20:24       ` Jonathan Nieder
  2017-09-28  4:43       ` [PATCH v4] technical doc: add a design doc for hash function transition Jonathan Nieder
  2017-09-06  6:28     ` RFC v3: Another proposed hash function transition plan Junio C Hamano
  1 sibling, 2 replies; 113+ messages in thread
From: Shawn Pearce @ 2017-03-09 19:14 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Linus Torvalds, Git Mailing List, Stefan Beller, bmwill,
	Jonathan Tan, Jeff King, David Lang, brian m. carlson

On Mon, Mar 6, 2017 at 4:17 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:
> Linus Torvalds wrote:
>> On Fri, Mar 3, 2017 at 5:12 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:
>
>>> This document is still in flux but I thought it best to send it out
>>> early to start getting feedback.
>>
>> This actually looks very reasonable if you can implement it cleanly
>> enough.
>
> Thanks for the kind words on what had quite a few flaws still.  Here's
> a new draft.  I think the next version will be a patch against
> Documentation/technical/.

FWIW, I like this approach.

> Alongside the packfile, a sha3 repository stores a bidirectional
> mapping between sha3 and sha1 object names. The mapping is generated
> locally and can be verified using "git fsck". Object lookups use this
> mapping to allow naming objects using either their sha1 and sha3 names
> interchangeably.

I saw some discussion about using LevelDB for this mapping table. I
think any existing database may be overkill.

For packs, you may be able to simplify by having only one file
(pack-*.msha1) that maps SHA-1 to pack offset; idx v2. The CRC32 table
in v2 is unnecessary, but you need the 64 bit offset support.

SHA-1 to SHA-3: lookup SHA-1 in .msha1, reverse .idx, find offset to
read the SHA-3.
SHA-3 to SHA-1: lookup SHA-3 in .idx, and reverse the .msha1 file to
translate offset to SHA-1.

For loose objects, the loose object directories should have only
O(4000) entries before auto gc is strongly encouraging
packing/pruning. With 256 shards, each given directory has O(16) loose
objects in it. When writing a SHA-3 loose object, Git could also
append a line "$sha3 $sha1\n" to objects/${first_byte}/sha1, which
GC/prune rewrites to remove entries. With O(16) objects in a
directory, these files should only have O(16) entries in them.

SHA-3 to SHA-1: open objects/${sha3_first_byte}/sha1 and scan until a
match is found.
SHA-1 to SHA-3: brute force read 256 files. Callers performing this
mapping may load all 256 files into a table in memory.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-03-09 19:14     ` Shawn Pearce
@ 2017-03-09 20:24       ` Jonathan Nieder
  2017-03-10 19:38         ` Jeff King
  2017-09-28  4:43       ` [PATCH v4] technical doc: add a design doc for hash function transition Jonathan Nieder
  1 sibling, 1 reply; 113+ messages in thread
From: Jonathan Nieder @ 2017-03-09 20:24 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Linus Torvalds, Git Mailing List, Stefan Beller, bmwill,
	Jonathan Tan, Jeff King, David Lang, brian m. carlson

Hi,

Shawn Pearce wrote:
> On Mon, Mar 6, 2017 at 4:17 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:

>> Alongside the packfile, a sha3 repository stores a bidirectional
>> mapping between sha3 and sha1 object names. The mapping is generated
>> locally and can be verified using "git fsck". Object lookups use this
>> mapping to allow naming objects using either their sha1 and sha3 names
>> interchangeably.
>
> I saw some discussion about using LevelDB for this mapping table. I
> think any existing database may be overkill.
>
> For packs, you may be able to simplify by having only one file
> (pack-*.msha1) that maps SHA-1 to pack offset; idx v2. The CRC32 table
> in v2 is unnecessary, but you need the 64 bit offset support.
>
> SHA-1 to SHA-3: lookup SHA-1 in .msha1, reverse .idx, find offset to
> read the SHA-3.
> SHA-3 to SHA-1: lookup SHA-3 in .idx, and reverse the .msha1 file to
> translate offset to SHA-1.

Thanks for this suggestion.  I was initially vaguely nervous about
lookup times in an idx-style file, but as you say, object reads from a
packfile already have to deal with this kind of lookup and work fine.

> For loose objects, the loose object directories should have only
> O(4000) entries before auto gc is strongly encouraging
> packing/pruning. With 256 shards, each given directory has O(16) loose
> objects in it. When writing a SHA-3 loose object, Git could also
> append a line "$sha3 $sha1\n" to objects/${first_byte}/sha1, which
> GC/prune rewrites to remove entries. With O(16) objects in a
> directory, these files should only have O(16) entries in them.

Insertion time is what worries me.  When writing a small number of
objects using a command like "git commit", I don't want to have to
regenerate an entire idx file.  I don't want to move the pain to
O(loose objects) work at read time, either --- some people disable
auto gc, and others have a large number of loose objects due to gc
ejecting unreachable objects.

But some kind of simplification along these lines should be possible.
I'll experiment.

Jonathan

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-03-09 20:24       ` Jonathan Nieder
@ 2017-03-10 19:38         ` Jeff King
  2017-03-10 19:55           ` Jonathan Nieder
  0 siblings, 1 reply; 113+ messages in thread
From: Jeff King @ 2017-03-10 19:38 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Shawn Pearce, Linus Torvalds, Git Mailing List, Stefan Beller,
	bmwill, Jonathan Tan, David Lang, brian m. carlson

On Thu, Mar 09, 2017 at 12:24:08PM -0800, Jonathan Nieder wrote:

> > SHA-1 to SHA-3: lookup SHA-1 in .msha1, reverse .idx, find offset to
> > read the SHA-3.
> > SHA-3 to SHA-1: lookup SHA-3 in .idx, and reverse the .msha1 file to
> > translate offset to SHA-1.
> 
> Thanks for this suggestion.  I was initially vaguely nervous about
> lookup times in an idx-style file, but as you say, object reads from a
> packfile already have to deal with this kind of lookup and work fine.

Not exactly. The "reverse .idx" step has to build the reverse mapping on
the fly, and it's non-trivial. For instance, try:

  sha1=$(git rev-parse HEAD)
  time echo $sha1 | git cat-file --batch-check='%(objectsize)'
  time echo $sha1 | git cat-file --batch-check='%(objectsize:disk)'

on a large repo (where HEAD is in a big pack). The on-disk size is
conceptually simpler, as we only need to look at the offset of the
object versus the offset of the object after it. But in practice it
takes much longer, because it has to build the revindex on the fly (I
get 7ms versus 179ms on linux.git).

The effort is linear in the number of objects (we create the revindex
with a radix sort).

The reachability bitmaps suffer from this, too, as they need the
revindex to know which object is at which bit position. At GitHub we
added an extension to the .bitmap files that stores this "bit cache".
Here are timings before and after on linux.git:

  $ time git rev-list --use-bitmap-index --count master
  659371

  real	0m0.182s
  user	0m0.136s
  sys	0m0.044s

  $ time git.gh rev-list --use-bitmap-index --count master
  659371

  real	0m0.016s
  user	0m0.008s
  sys	0m0.004s

It's not a full revindex, but it's enough for bitmap use. You can also
use it to generate the revindex slightly more quickly, because you can
skip the sorting step (you just insert the entries in the correct order
by walking the bit cache and dereferencing the offsets from the .idx
portion). So it's still linear, but with a smaller constant factor.

I think for the purposes here, though, we don't actually care about the
offsets. For the cost of one uint32_t per object, you can keep a list
mapping positions in the sha1 index into the sha3 index. So then you do
the log-n binary search to find the sha1, a constant-time lookup in the
mapping array, and that gives you the position in the sha3 index, from
which you can then access the sha3 (or the actual pack offset, for that
matter).

So I think it's solvable, but I suspect we would want an extension to
the .idx format to store the mapping array, in order to keep it log-n.

-Peff

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-03-10 19:38         ` Jeff King
@ 2017-03-10 19:55           ` Jonathan Nieder
  0 siblings, 0 replies; 113+ messages in thread
From: Jonathan Nieder @ 2017-03-10 19:55 UTC (permalink / raw)
  To: Jeff King
  Cc: Shawn Pearce, Linus Torvalds, Git Mailing List, Stefan Beller,
	bmwill, Jonathan Tan, David Lang, brian m. carlson

Jeff King wrote:
> On Thu, Mar 09, 2017 at 12:24:08PM -0800, Jonathan Nieder wrote:

>>> SHA-1 to SHA-3: lookup SHA-1 in .msha1, reverse .idx, find offset to
>>> read the SHA-3.
>>> SHA-3 to SHA-1: lookup SHA-3 in .idx, and reverse the .msha1 file to
>>> translate offset to SHA-1.
>>
>> Thanks for this suggestion.  I was initially vaguely nervous about
>> lookup times in an idx-style file, but as you say, object reads from a
>> packfile already have to deal with this kind of lookup and work fine.
>
> Not exactly. The "reverse .idx" step has to build the reverse mapping on
> the fly, and it's non-trivial.

Sure.  To be clear, I was handwaving over that since adding an on-disk
reverse .idx is a relatively small change.

[...]
> So I think it's solvable, but I suspect we would want an extension to
> the .idx format to store the mapping array, in order to keep it log-n.

i.e., this.

The loose object side is the more worrying bit, since we currently don't
have any practical bound on the number of loose objects.

One way to deal with that is to disallow loose objects completely.
Use packfiles for new objects, batching the objects produced by a
single process into a single packfile.  Teach "git gc --auto" a
behavior similar to Martin Fick's "git exproll" to combine packfiles
between full gcs to maintain reasonable performance.  For unreachable
objects, instead of using loose objects, use "unreachable garbage"
packs explicitly labeled as such, with similar semantics to what
JGit's DfsRepository backend uses (described in the discussion at
https://git.eclipse.org/r/89455).

That's a direction that I want in the long term anyway.  I was hoping
not to couple such changes with the hash transition but it might be
one of the simpler ways to go.

Jonathan

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-04  1:12 RFC: Another proposed hash function transition plan Jonathan Nieder
                   ` (3 preceding siblings ...)
  2017-03-07 18:57 ` Ian Jackson
@ 2017-03-13  9:24 ` The Keccak Team
  2017-03-13 17:48   ` Jonathan Nieder
  4 siblings, 1 reply; 113+ messages in thread
From: The Keccak Team @ 2017-03-13  9:24 UTC (permalink / raw)
  To: Jonathan Nieder, git; +Cc: sbeller, bmwill, jonathantanmy, peff, Linus Torvalds

Hello,

We have read your transition plan to move away from SHA-1 and noticed
your intent to use SHA3-256 as the new hash function in the new Git
repository format and protocol. Although this is a valid choice, we
think that the new SHA-3 standard proposes alternatives that may also be
interesting for your use cases.  As designers of the Keccak function
family, we thought we could jump in the mail thread and present these
alternatives.

SHA3-256, standardized in FIPS 202 [1], is a fixed-length hash function
that provides the same interface and security level as SHA-256 (FIPS
180-4). SHA3-256's primary goal is to be drop-in compatible with the
previous standard, and to allow a fast transition for applications that
would already use SHA-256.

Since your application did not use SHA-256, you are free to choose one
of the alternatives listed below.

* SHAKE128

  SHAKE128, defined in FIPS 202, is an eXtendable-Output Function (XOF)
  that generates digests of any size. In your case, you would use
  SHAKE128 the same way you would use SHA3-256, just truncating the
  output at 256 bits. In that case, SHAKE128 provides a security level
  of 128 bits against all generic attacks, including collisions,
  preimages, etc. We think this security level is appropriate for your
  application since this is the maximum you can get with 256-bit tags in
  the case of collision attacks, and this level is beyond computation
  reach for any adversary in the foreseeable future.

  The immediate benefit of using SHAKE128 versus SHA3-256 is a
  performance gain of roughly 20%, both for SW and HW implementations.
  On Intel Core i5-6500, SHAKE128 throughput is 430MiB/s.

* ParallelHash128

  ParallelHash128 (PH128), defined in NIST Special Publication 800-185
  (SP800-185, SHA-3 Derived Functions [2]), is a XOF implementing a tree
  hash mode on top of SHAKE128 (in fact cSHAKE128) to provide higher
  performance for large-file hashing. The tree mode is designed to
  exploit any available parallelism on the CPU, either through vector
  instructions or availability of multiple cores. Note that the chosen
  level of parallelism does not impact the final result, which improves
  interoperability.

  PH128 offers the same security level and interface as SHAKE128. So
  likewise, you just truncate the output at 256 bits.

  The net advantage of using PH128 over SHAKE128 is a huge performance
  boost when hashing big files.  The advantage depends of course on the
  number of cores used for hashing and their architecture. On an Intel
  Core i5-6500 (Skylake), with a single-core, PH128 is faster than
  SHAKE128 by a factor 3 and than SHA-1 by a factor 1.5 over long
  messages, with a throughput of 1320MiB/s.

* KangarooTwelve

  KangarooTwelve (K12) [3] is a very fast parallel and secure XOF we
  defined for applications that require higher performance that the FIPS
  202 and SP800-185 functions provide, while retaining the same
  flexibility and basis of security.

  K12 is very similar to PH128. It uses the same cryptographic primitive
  (Keccak-p, defined in FIPS 202), the same sponge construction, a
  similar tree hashing mode, and targets the same generic security level
  (128 bits). The main differences are the number of rounds for the
  inner permutation, which is reduced to 12, and the tree mode
  parameters, which are optimized for both small and long messages.

  Again, the benefit of using K12 over PH128 is performance. K12 is
  twice as fast as SHAKE128 for short messages, i.e. 820MiB/s on Intel
  Core i5-6500, and twice as fast as PH128 over long messages, i.e.
  2500MiB/s on the same platform.

If performance is not your primary concern, we suggest to use SHAKE128
as the default hash function, and optionally use ParallelHash128 for
hashing big files. Both functions offer a considerable security margin
and are standardized algorithms. On the longer term, provided HW
acceleration, SHAKE128 alone would easily outperform SHA-1 thanks to its
design.

If however you value first performance, or if you would like to promote
adoption of the new repository format by offering higher performance,
then KangarooTwelve is the right candidate. On modern CPU, K12 offers
equal performance as SHA-1 for small messages and outperforms it by a
factor 3 for long messages.  Regarding security, although K12 offers of
course a smaller security margin than other alternatives, it inherits
the security assurance built up for Keccak and the FIPS 202 functions.
As of today, the best practical attack broke 6 rounds of Keccak-p, with
2^50 computation effort. The 12 rounds of K12 offers then a comfortable
security margin [4].

Lately, we made a presentation at FOSDEM covering the latest development
over the Keccak family [5].  You can find reference and optimized
implementations of the algorithms listed above in the Keccak Code
Package [6]. Also, if you have questions, don't hesitate to contact us.

Kind regards,
The Keccak Team

Links
 [1]   FIPS 202,
       http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.202.pdf.
 [2]   NIST SP 800-185,

http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-185.pdf.
 [3]   KangarooTwelve, http://keccak.noekeon.org/kangarootwelve.html.
 [4]   Keccak Crunchy Crypto Collision and Pre-image Contest,
       http://keccak.noekeon.org/crunchy_contest.html.
 [5]   FOSDEM 2017, Portfolio of optimized cryptographic functions based
       on Keccak, https://fosdem.org/2017/schedule/event/keccak/.
 [6]   Keccak Code Package, https://github.com/gvanas/KeccakCodePackage.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-13  9:24 ` RFC: Another proposed hash function transition plan The Keccak Team
@ 2017-03-13 17:48   ` Jonathan Nieder
  2017-03-13 18:34     ` ankostis
  0 siblings, 1 reply; 113+ messages in thread
From: Jonathan Nieder @ 2017-03-13 17:48 UTC (permalink / raw)
  To: The Keccak Team; +Cc: git, sbeller, bmwill, jonathantanmy, peff, Linus Torvalds

Hi,

The Keccak Team wrote:

> We have read your transition plan to move away from SHA-1 and noticed
> your intent to use SHA3-256 as the new hash function in the new Git
> repository format and protocol. Although this is a valid choice, we
> think that the new SHA-3 standard proposes alternatives that may also be
> interesting for your use cases.  As designers of the Keccak function
> family, we thought we could jump in the mail thread and present these
> alternatives.

I indeed had some reservations about SHA3-256's performance.  The main
hash function we had in mind to compare against is blake2bp-256.  This
overview of other functions to compare against should end up being
very helpful.

Thanks for this.  When I have more questions (which I most likely
will) I'll keep you posted.

Sincerely,
Jonathan

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-13 17:48   ` Jonathan Nieder
@ 2017-03-13 18:34     ` ankostis
  2017-03-17 11:07       ` Johannes Schindelin
  0 siblings, 1 reply; 113+ messages in thread
From: ankostis @ 2017-03-13 18:34 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: The Keccak Team, Git Mailing List, Stefan Beller, bmwill,
	Jonathan Tan, Jeff King, Linus Torvalds

On 13 March 2017 at 18:48, Jonathan Nieder <jrnieder@gmail.com> wrote:
>
> Hi,
>
> The Keccak Team wrote:
>
> > We have read your transition plan to move away from SHA-1 and noticed
> > your intent to use SHA3-256 as the new hash function in the new Git
> > repository format and protocol. Although this is a valid choice, we
> > think that the new SHA-3 standard proposes alternatives that may also be
> > interesting for your use cases.  As designers of the Keccak function
> > family, we thought we could jump in the mail thread and present these
> > alternatives.
>
> I indeed had some reservations about SHA3-256's performance.  The main
> hash function we had in mind to compare against is blake2bp-256.  This
> overview of other functions to compare against should end up being
> very helpful.

What if some of us need this extra difficulty, and don't mind about
the performance tax,
because we need to refer to hashes 10 or 30 years from now,
or even in the Post Quantum era?

Thanks,
  Kostis

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC: Another proposed hash function transition plan
  2017-03-13 18:34     ` ankostis
@ 2017-03-17 11:07       ` Johannes Schindelin
  0 siblings, 0 replies; 113+ messages in thread
From: Johannes Schindelin @ 2017-03-17 11:07 UTC (permalink / raw)
  To: ankostis
  Cc: Jonathan Nieder, The Keccak Team, Git Mailing List, Stefan Beller,
	bmwill, Jonathan Tan, Jeff King, Linus Torvalds

Hi Kostis,

On Mon, 13 Mar 2017, ankostis wrote:

> On 13 March 2017 at 18:48, Jonathan Nieder <jrnieder@gmail.com> wrote:
> >
> > The Keccak Team wrote:
> >
> > > We have read your transition plan to move away from SHA-1 and
> > > noticed your intent to use SHA3-256 as the new hash function in the
> > > new Git repository format and protocol. Although this is a valid
> > > choice, we think that the new SHA-3 standard proposes alternatives
> > > that may also be interesting for your use cases.  As designers of
> > > the Keccak function family, we thought we could jump in the mail
> > > thread and present these alternatives.
> >
> > I indeed had some reservations about SHA3-256's performance.  The main
> > hash function we had in mind to compare against is blake2bp-256.  This
> > overview of other functions to compare against should end up being
> > very helpful.
> 
> What if some of us need this extra difficulty, and don't mind about the
> performance tax, because we need to refer to hashes 10 or 30 years from
> now, or even in the Post Quantum era?

If you need this extra difficulty, and if this extra difficulty would
imply a huge penalty for everybody else, it is safe to assume that that
extra difficulty would need to be an extra switch, off by default.

It simply shows that we put too much of a burden on SHA-1: we used it for
three separate purposes: to verify data integrity, to allow addressing
objects by their own content, and for signing entire commit histories
cryptographically (more as an afterthought, as I see it: the Linux project
provides the context where you never fetch from any untrusted source,
therefore cryptographically secure signatures are not quite as important
as the trust between maintainer and lieutenants).

We *will* have to separate those concerns, and maybe even switch to
different algorithms for the different concerns. There are much better
algorithms for validating data integrity, for example, including error
correction (which SHA-1 never wanted to do anyway).

In your case, I could imagine that you would simply require verifiable
cryptographic signatures (.asc files) to be committed together with the
documents; it would be much harder to find a collision where those
signatures still match (or a double collision where the forged document's
signature would collide with the non-forget document's signature, in
addition to the two documents colliding).

Another idea would be to use Jonathan Nieder's proposed transition plan
and simply extend it. That transition plan details how the objects would
be hashed with two algorithms locally and how to maintain a bidirectional
mapping between the two. You could simply piggyback on that code and
provide patches that allow for a third, configurable algorithm, and that
algorithm's hashes would simply be added to the commit objects and fsck
would then know to verify those, too. That would be an opt-in feature, of
course, so that only those who need the extra long term security have to
pay the price of a substantially slower hashing.

What we cannot do is to pick a super slow hash algorithm just to cater to
the use case where legal documents are managed, punishing everybody else
for using Git in the intended way: to manage source code.

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Use base32?
  2017-03-08 15:40       ` Johannes Schindelin
@ 2017-03-20  5:21         ` Jason Hennessey
  2017-03-20  5:58           ` Michael Steuer
  0 siblings, 1 reply; 113+ messages in thread
From: Jason Hennessey @ 2017-03-20  5:21 UTC (permalink / raw)
  To: johannes.schindelin
  Cc: bmwill, git, ijackson, jonathantanmy, jrnieder, peff, sbeller,
	torvalds


On Wed, 8 Mar 2017, Johannes Schindelin wrote:
> > Linus Torvalds writes ("Re: RFC: Another proposed hash function transition plan"): > > Of course, having written that, I now realize how it would cause
> > > problems for the usual shit-for-brains case-insensitive
> filesystems. > > So I guess base64 encoding doesn't work well for that
> reason.
> Given that the idea was to encode the new hash in base64 or base85, we
> *are* talking about an encoding. In that respect, yes, it can be whatever
> encoding we like, and Linus just made a good point (with unnecessary foul
> language) of explaining why base64/base85 is not that encoding.

Since the hash format is switching anyway, how about using base32
instead of hex?

Still get a 20% space savings over hex (minus a little for padding), and
it's guaranteed to be a single case.
Jason


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: Use base32?
  2017-03-20  5:21         ` Use base32? Jason Hennessey
@ 2017-03-20  5:58           ` Michael Steuer
  2017-03-20  8:05             ` Jacob Keller
  0 siblings, 1 reply; 113+ messages in thread
From: Michael Steuer @ 2017-03-20  5:58 UTC (permalink / raw)
  To: Jason Hennessey, johannes.schindelin
  Cc: bmwill, git, ijackson, jonathantanmy, jrnieder, peff, sbeller,
	torvalds


On 20/03/2017 16:21, Jason Hennessey wrote:
> On Wed, 8 Mar 2017, Johannes Schindelin wrote:
>>> Linus Torvalds writes ("Re: RFC: Another proposed hash function transition plan"): > > Of course, having written that, I now realize how it would cause
>>>> problems for the usual shit-for-brains case-insensitive
>> filesystems. > > So I guess base64 encoding doesn't work well for that
>> reason.
>> Given that the idea was to encode the new hash in base64 or base85, we
>> *are* talking about an encoding. In that respect, yes, it can be whatever
>> encoding we like, and Linus just made a good point (with unnecessary foul
>> language) of explaining why base64/base85 is not that encoding.
> Since the hash format is switching anyway, how about using base32
> instead of hex?
>
> Still get a 20% space savings over hex (minus a little for padding), and
> it's guaranteed to be a single case.
> Jason
>

If base32 is being considered, I'd suggest the "base32hex" variant, 
which uses the same amount of space.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: Use base32?
  2017-03-20  5:58           ` Michael Steuer
@ 2017-03-20  8:05             ` Jacob Keller
  2017-03-21  3:07               ` Michael Steuer
  0 siblings, 1 reply; 113+ messages in thread
From: Jacob Keller @ 2017-03-20  8:05 UTC (permalink / raw)
  To: Michael Steuer
  Cc: Jason Hennessey, Johannes Schindelin, Brandon Williams,
	Git mailing list, Ian Jackson, Jonathan Tan, Jonathan Nieder,
	Jeff King, Stefan Beller, Linus Torvalds

On Sun, Mar 19, 2017 at 10:58 PM, Michael Steuer
<Michael.Steuer@constrainttec.com> wrote:
>
> On 20/03/2017 16:21, Jason Hennessey wrote:
>>
>> On Wed, 8 Mar 2017, Johannes Schindelin wrote:
>>>>
>>>> Linus Torvalds writes ("Re: RFC: Another proposed hash function
>>>> transition plan"): > > Of course, having written that, I now realize how it
>>>> would cause
>>>>>
>>>>> problems for the usual shit-for-brains case-insensitive
>>>
>>> filesystems. > > So I guess base64 encoding doesn't work well for that
>>> reason.
>>> Given that the idea was to encode the new hash in base64 or base85, we
>>> *are* talking about an encoding. In that respect, yes, it can be whatever
>>> encoding we like, and Linus just made a good point (with unnecessary foul
>>> language) of explaining why base64/base85 is not that encoding.
>>
>> Since the hash format is switching anyway, how about using base32
>> instead of hex?
>>
>> Still get a 20% space savings over hex (minus a little for padding), and
>> it's guaranteed to be a single case.
>> Jason
>>
>
> If base32 is being considered, I'd suggest the "base32hex" variant, which
> uses the same amount of space.

I don't see the benefit of adding characters like 0 and 1 which
conflict with some of the letters? Since there's no need for a human
to decode the base32 output, it's easier to use the one that's less
likely to get screwed up when typing if that ever happens. It's not
like we actually need to know what value each character represents.
(sure the program does, but the human does not).

Thanks,
Jake

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: Use base32?
  2017-03-20  8:05             ` Jacob Keller
@ 2017-03-21  3:07               ` Michael Steuer
  0 siblings, 0 replies; 113+ messages in thread
From: Michael Steuer @ 2017-03-21  3:07 UTC (permalink / raw)
  To: Jacob Keller
  Cc: Jason Hennessey, Johannes Schindelin, Brandon Williams,
	Git mailing list, Ian Jackson, Jonathan Tan, Jonathan Nieder,
	Jeff King, Stefan Beller, Linus Torvalds


On 20/03/2017 19:05, Jacob Keller wrote:
> On Sun, Mar 19, 2017 at 10:58 PM, Michael Steuer
> <Michael.Steuer@constrainttec.com> wrote:
>> [..]
>> If base32 is being considered, I'd suggest the "base32hex" variant, which
>> uses the same amount of space.
> I don't see the benefit of adding characters like 0 and 1 which
> conflict with some of the letters? Since there's no need for a human
> to decode the base32 output, it's easier to use the one that's less
> likely to get screwed up when typing if that ever happens. It's not
> like we actually need to know what value each character represents.
> (sure the program does, but the human does not).
>
> Thanks,
> Jake

Fair enough and good point. We definitely wouldn't want 0 and O and 1 
and I mixed together.

Cheers,
Mike.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-03-06 18:24     ` Brandon Williams
@ 2017-06-15 10:30       ` Johannes Schindelin
  2017-06-15 11:05         ` Mike Hommey
                           ` (2 more replies)
  0 siblings, 3 replies; 113+ messages in thread
From: Johannes Schindelin @ 2017-06-15 10:30 UTC (permalink / raw)
  To: Brandon Williams
  Cc: brian m. carlson, Linus Torvalds, Jonathan Nieder,
	Git Mailing List, Stefan Beller, jonathantanmy, Jeff King,
	Junio Hamano

Hi,

I thought it better to revive this old thread rather than start a new
thread, so as to automatically reach everybody who chimed in originally.

On Mon, 6 Mar 2017, Brandon Williams wrote:

> On 03/06, brian m. carlson wrote:
>
> > On Sat, Mar 04, 2017 at 06:35:38PM -0800, Linus Torvalds wrote:
> >
> > > Btw, I do think the particular choice of hash should still be on the
> > > table. sha-256 may be the obvious first choice, but there are
> > > definitely a few reasons to consider alternatives, especially if
> > > it's a complete switch-over like this.
> > > 
> > > One is large-file behavior - a parallel (or tree) mode could improve
> > > on that noticeably. BLAKE2 does have special support for that, for
> > > example. And SHA-256 does have known attacks compared to SHA-3-256
> > > or BLAKE2 - whether that is due to age or due to more effort, I
> > > can't really judge. But if we're switching away from SHA1 due to
> > > known attacks, it does feel like we should be careful.
> > 
> > I agree with Linus on this.  SHA-256 is the slowest option, and it's
> > the one with the most advanced cryptanalysis.  SHA-3-256 is faster on
> > 64-bit machines (which, as we've seen on the list, is the overwhelming
> > majority of machines using Git), and even BLAKE2b-256 is stronger.
> > 
> > Doing this all over again in another couple years should also be a
> > non-goal.
> 
> I agree that when we decide to move to a new algorithm that we should
> select one which we plan on using for as long as possible (much longer
> than a couple years).  While writing the document we simply used
> "sha256" because it was more tangible and easier to reference.

The SHA-1 transition *requires* a knob telling Git that the current
repository uses a hash function different from SHA-1.

It would make *a whole of a lot of sense* to make that knob *not* Boolean,
but to specify *which* hash function is in use.

That way, it will be easier to switch another time when it becomes
necessary.

And it will also make it easier for interested parties to use a different
hash function in their infrastructure if they want.

And it lifts part of that burden that we have to consider *very carefully*
which function to pick. We still should be more careful than in 2005, when
Git was born, and when, incidentally, when the first attacks on SHA-1
became known, of course. We were just lucky for almost 12 years.

Now, with Dunning-Kruger in mind, I feel that my degree in mathematics
equips me with *just enough* competence to know just how little *even I*
know about cryptography.

The smart thing to do, hence, was to get involved in this discussion and
act as Lt Tawney Madison between us Git developers and experts in
cryptography.

It just so happens that I work at a company with access to excellent
cryptographers, and as we own the largest Git repository on the planet, we
have a vested interest in ensuring Git's continued success.

After a couple of conversations with a couple of experts who I cannot
thank enough for their time and patience, let alone their knowledge about
this matter, it would appear that we may not have had a complete enough
picture yet to even start to make the decision on the hash function to
use.

From what I read, pretty much everybody who participated in the discussion
was aware that the essential question is: performance vs security.

It turns out that we can have essentially both.

SHA-256 is most likely the best-studied hash function we currently know
about (*maybe* SHA3-256 has been studied slightly more, but only
slightly). All the experts in the field banged on it with multiple sticks
and other weapons. And so far, they only found one weakness that does not
even apply to Git's usage [*1*]. For cryptography experts, this is the
ultimate measure of security: if something has been attacked that
intensely, by that many experts, for that long, with that little effect,
it is the best we got at the time.

And since SHA-256 has become the standard, and more importantly: since
SHA-256 was explicitly designed to allow for relatively inexpensive
hardware acceleration, this is what we will soon have: hardware support in
the form of, say, special CPU instructions. (That is what I meant by: we
can have performance *and* security.)

This is a rather important point to stress, by the way: BLAKE's design is
apparently *not* friendly to CPU instruction implementations. Meaning that
SHA-256 will be faster than BLAKE (and even than BLAKE2) once the Intel
and AMD CPUs with hardware support for SHA-256 become common.

I also heard something really worrisome about BLAKE2 that makes me want to
stay away from it (in addition to the difficulty it poses for hardware
acceleration): to compete in the SHA-3 contest, BLAKE added complexity so
that it would be roughly on par with its competitors. To allow for faster
execution in software, this complexity was *removed* from BLAKE to create
BLAKE2, making it weaker than SHA-256.

Another important point to consider is that SHA-256 implementations are
everywhere. Think e.g. how difficult we would make it on, say, JGit or
go-git if we chose a less common hash function.

As to KangarooTwelve: it has seen substantially less cryptanalysis than
SHA-256 and SHA3-256. That does not necessarily mean that it is weaker,
but it means that we simply cannot know whether it is as strong. On that
basis alone, I would already reject it, and then there are far fewer
implementations, too.

When it comes to choosing SHA-256 vs SHA3-256, I would like to point out
that hardware acceleration is a lot farther in the future than SHA-256
support. And according to the experts I asked, they are roughly equally
secure as far as Git's usage is concerned, even if the SHA-3 contest
provided SHA3-256 with even fiercer cryptanalysis than SHA-256.

In short: my takeaway from the conversations with cryptography experts was
that SHA-256 would be the best choice for now, and that we should make
sure that the next switch is not as painful as this one (read: we should
not repeat the mistake of hard-coding the new hash function into Git as
much as we hard-coded SHA-1 into it).

Ciao,
Johannes

Footnote *1*: SHA-256, as all hash functions whose output is essentially
the entire internal state, are susceptible to a so-called "length
extension attack", where the hash of a secret+message can be used to
generate the hash of secret+message+piggyback without knowing the secret.
This is not the case for Git: only visible data are hashed. The type of
attacks Git has to worry about is very different from the length extension
attacks, and it is highly unlikely that that weakness of SHA-256 leads to,
say, a collision attack.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-15 10:30       ` Which hash function to use, was " Johannes Schindelin
@ 2017-06-15 11:05         ` Mike Hommey
  2017-06-15 13:01           ` Jeff King
  2017-06-15 17:36         ` Brandon Williams
  2017-06-15 19:13         ` Jonathan Nieder
  2 siblings, 1 reply; 113+ messages in thread
From: Mike Hommey @ 2017-06-15 11:05 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Brandon Williams, brian m. carlson, Linus Torvalds,
	Jonathan Nieder, Git Mailing List, Stefan Beller, jonathantanmy,
	Jeff King, Junio Hamano

On Thu, Jun 15, 2017 at 12:30:46PM +0200, Johannes Schindelin wrote:
> Footnote *1*: SHA-256, as all hash functions whose output is essentially
> the entire internal state, are susceptible to a so-called "length
> extension attack", where the hash of a secret+message can be used to
> generate the hash of secret+message+piggyback without knowing the secret.
> This is not the case for Git: only visible data are hashed. The type of
> attacks Git has to worry about is very different from the length extension
> attacks, and it is highly unlikely that that weakness of SHA-256 leads to,
> say, a collision attack.

What do the experts think or SHA512/256, which completely removes the
concerns over length extension attack? (which I'd argue is better than
sweeping them under the carpet)

Mike

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-15 11:05         ` Mike Hommey
@ 2017-06-15 13:01           ` Jeff King
  2017-06-15 16:30             ` Ævar Arnfjörð Bjarmason
  2017-06-15 21:10             ` Mike Hommey
  0 siblings, 2 replies; 113+ messages in thread
From: Jeff King @ 2017-06-15 13:01 UTC (permalink / raw)
  To: Mike Hommey
  Cc: Johannes Schindelin, Brandon Williams, brian m. carlson,
	Linus Torvalds, Jonathan Nieder, Git Mailing List, Stefan Beller,
	jonathantanmy, Junio Hamano

On Thu, Jun 15, 2017 at 08:05:18PM +0900, Mike Hommey wrote:

> On Thu, Jun 15, 2017 at 12:30:46PM +0200, Johannes Schindelin wrote:
> > Footnote *1*: SHA-256, as all hash functions whose output is essentially
> > the entire internal state, are susceptible to a so-called "length
> > extension attack", where the hash of a secret+message can be used to
> > generate the hash of secret+message+piggyback without knowing the secret.
> > This is not the case for Git: only visible data are hashed. The type of
> > attacks Git has to worry about is very different from the length extension
> > attacks, and it is highly unlikely that that weakness of SHA-256 leads to,
> > say, a collision attack.
> 
> What do the experts think or SHA512/256, which completely removes the
> concerns over length extension attack? (which I'd argue is better than
> sweeping them under the carpet)

I don't think it's sweeping them under the carpet. Git does not use the
hash as a MAC, so length extension attacks aren't a thing (and even if
we later wanted to use the same algorithm as a MAC, the HMAC
construction is a well-studied technique for dealing with it).

That said, SHA-512 is typically a little faster than SHA-256 on 64-bit
platforms. I don't know if that will change with the advent of hardware
instructions oriented towards SHA-256.

-Peff

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-15 13:01           ` Jeff King
@ 2017-06-15 16:30             ` Ævar Arnfjörð Bjarmason
  2017-06-15 19:34               ` Johannes Schindelin
  2017-06-15 21:10             ` Mike Hommey
  1 sibling, 1 reply; 113+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-06-15 16:30 UTC (permalink / raw)
  To: Jeff King
  Cc: Mike Hommey, Johannes Schindelin, Brandon Williams,
	brian m. carlson, Linus Torvalds, Jonathan Nieder,
	Git Mailing List, Stefan Beller, jonathantanmy, Junio Hamano


On Thu, Jun 15 2017, Jeff King jotted:

> On Thu, Jun 15, 2017 at 08:05:18PM +0900, Mike Hommey wrote:
>
>> On Thu, Jun 15, 2017 at 12:30:46PM +0200, Johannes Schindelin wrote:
>> > Footnote *1*: SHA-256, as all hash functions whose output is essentially
>> > the entire internal state, are susceptible to a so-called "length
>> > extension attack", where the hash of a secret+message can be used to
>> > generate the hash of secret+message+piggyback without knowing the secret.
>> > This is not the case for Git: only visible data are hashed. The type of
>> > attacks Git has to worry about is very different from the length extension
>> > attacks, and it is highly unlikely that that weakness of SHA-256 leads to,
>> > say, a collision attack.
>>
>> What do the experts think or SHA512/256, which completely removes the
>> concerns over length extension attack? (which I'd argue is better than
>> sweeping them under the carpet)
>
> I don't think it's sweeping them under the carpet. Git does not use the
> hash as a MAC, so length extension attacks aren't a thing (and even if
> we later wanted to use the same algorithm as a MAC, the HMAC
> construction is a well-studied technique for dealing with it).
>
> That said, SHA-512 is typically a little faster than SHA-256 on 64-bit
> platforms. I don't know if that will change with the advent of hardware
> instructions oriented towards SHA-256.

Quoting my own
CACBZZX7JRA2niwt9wsGAxnzS+gWS8hTUgzWm8NaY1gs87o8xVQ@mail.gmail.com sent
~2 weeks ago to the list:

    On Fri, Jun 2, 2017 at 7:54 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:
    [...]
    > 4. When choosing a hash function, people may argue about performance.
    >    It would be useful for run some benchmarks for git (running
    >    the test suite, t/perf tests, etc) using a variety of hash
    >    functions as input to such a discussion.

    To the extent that such benchmarks matter, it seems prudent to heavily
    weigh them in favor of whatever seems to be likely to be the more
    common hash function going forward, since those are likely to get
    faster through future hardware acceleration.

    E.g. Intel announced Goldmont last year which according to one SHA-1
    implementation improved from 9.5 cycles per byte to 2.7 cpb[1]. They
    only have acceleration for SHA-1 and SHA-256[2]

    1. https://github.com/weidai11/cryptopp/issues/139#issuecomment-264283385

    2. https://en.wikipedia.org/wiki/Goldmont

Maybe someone else knows of better numbers / benchmarks, but such a
reduction in CBP likely makes it faster than SHA-512.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-15 10:30       ` Which hash function to use, was " Johannes Schindelin
  2017-06-15 11:05         ` Mike Hommey
@ 2017-06-15 17:36         ` Brandon Williams
  2017-06-15 19:20           ` Junio C Hamano
  2017-06-15 19:13         ` Jonathan Nieder
  2 siblings, 1 reply; 113+ messages in thread
From: Brandon Williams @ 2017-06-15 17:36 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: brian m. carlson, Linus Torvalds, Jonathan Nieder,
	Git Mailing List, Stefan Beller, jonathantanmy, Jeff King,
	Junio Hamano

On 06/15, Johannes Schindelin wrote:
> Hi,
> 
> I thought it better to revive this old thread rather than start a new
> thread, so as to automatically reach everybody who chimed in originally.
> 
> On Mon, 6 Mar 2017, Brandon Williams wrote:
> 
> > On 03/06, brian m. carlson wrote:
> >
> > > On Sat, Mar 04, 2017 at 06:35:38PM -0800, Linus Torvalds wrote:
> > >
> > > > Btw, I do think the particular choice of hash should still be on the
> > > > table. sha-256 may be the obvious first choice, but there are
> > > > definitely a few reasons to consider alternatives, especially if
> > > > it's a complete switch-over like this.
> > > > 
> > > > One is large-file behavior - a parallel (or tree) mode could improve
> > > > on that noticeably. BLAKE2 does have special support for that, for
> > > > example. And SHA-256 does have known attacks compared to SHA-3-256
> > > > or BLAKE2 - whether that is due to age or due to more effort, I
> > > > can't really judge. But if we're switching away from SHA1 due to
> > > > known attacks, it does feel like we should be careful.
> > > 
> > > I agree with Linus on this.  SHA-256 is the slowest option, and it's
> > > the one with the most advanced cryptanalysis.  SHA-3-256 is faster on
> > > 64-bit machines (which, as we've seen on the list, is the overwhelming
> > > majority of machines using Git), and even BLAKE2b-256 is stronger.
> > > 
> > > Doing this all over again in another couple years should also be a
> > > non-goal.
> > 
> > I agree that when we decide to move to a new algorithm that we should
> > select one which we plan on using for as long as possible (much longer
> > than a couple years).  While writing the document we simply used
> > "sha256" because it was more tangible and easier to reference.
> 
> The SHA-1 transition *requires* a knob telling Git that the current
> repository uses a hash function different from SHA-1.
> 
> It would make *a whole of a lot of sense* to make that knob *not* Boolean,
> but to specify *which* hash function is in use.

100% agree on this point.  I believe the current plan is to have the
hashing function used for a repository be a repository format extension
which would be a value (most likely a string like 'sha1', 'sha256',
'black2', etc) stored in a repository's .git/config.  This way, upon
startup git will die or ignore a repository which uses a hashing
function which it does not recognize or does not compiled to handle.

I hope (and expect) that the end produce of this transition is a nice,
clean hashing API and interface with sufficient abstractions such that
if I wanted to switch to a different hashing function I would just need
to implement the interface with the new hashing function and ensure that
'verify_repository_format' allows the new function.

> 
> That way, it will be easier to switch another time when it becomes
> necessary.
> 
> And it will also make it easier for interested parties to use a different
> hash function in their infrastructure if they want.
> 
> And it lifts part of that burden that we have to consider *very carefully*
> which function to pick. We still should be more careful than in 2005, when
> Git was born, and when, incidentally, when the first attacks on SHA-1
> became known, of course. We were just lucky for almost 12 years.
> 
> Now, with Dunning-Kruger in mind, I feel that my degree in mathematics
> equips me with *just enough* competence to know just how little *even I*
> know about cryptography.
> 
> The smart thing to do, hence, was to get involved in this discussion and
> act as Lt Tawney Madison between us Git developers and experts in
> cryptography.
> 
> It just so happens that I work at a company with access to excellent
> cryptographers, and as we own the largest Git repository on the planet, we
> have a vested interest in ensuring Git's continued success.
> 
> After a couple of conversations with a couple of experts who I cannot
> thank enough for their time and patience, let alone their knowledge about
> this matter, it would appear that we may not have had a complete enough
> picture yet to even start to make the decision on the hash function to
> use.
> 

-- 
Brandon Williams

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-15 10:30       ` Which hash function to use, was " Johannes Schindelin
  2017-06-15 11:05         ` Mike Hommey
  2017-06-15 17:36         ` Brandon Williams
@ 2017-06-15 19:13         ` Jonathan Nieder
  2 siblings, 0 replies; 113+ messages in thread
From: Jonathan Nieder @ 2017-06-15 19:13 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Brandon Williams, brian m. carlson, Linus Torvalds,
	Git Mailing List, Stefan Beller, jonathantanmy, Jeff King,
	Junio Hamano

Hi Dscho,

Johannes Schindelin wrote:

> From what I read, pretty much everybody who participated in the discussion
> was aware that the essential question is: performance vs security.

I don't completely agree with this framing.  The essential question is:
how to get the right security properties without abysmal performance.

> It turns out that we can have essentially both.
>
> SHA-256 is most likely the best-studied hash function we currently know
[... etc ...]

Thanks for a thoughtful restart to the discussion.  This is much more
concrete than your previous objections about process, and that is very
helpful.

In the interest of transparency: here are my current questions for
cryptographers to whom I have forwarded this thread.  Several of these
questions involve predictions or opinions, so in my ideal world we'd
want multiple, well reasoned answers to them.  Please feel free to
forward them to appropriate people or add more.

 1. Now it sounds like SHA-512/256 is the safest choice (see also Mike
    Hommey's response to Dscho's message).  Please poke holes in my
    understanding.

 2. Would you be willing to weigh in publicly on the mailing list? I
    think that would be the most straightforward way to move this
    forward (and it would give you a chance to ask relevant questions,
    etc).  Feel free to contact me privately if you have any questions
    about how this particular mailing list works.

 3. On the speed side, Dscho states "SHA-256 will be faster than BLAKE
    (and even than BLAKE2) once the Intel and AMD CPUs with hardware
    support for SHA-256 become common."  Do you agree?

 4. On the security side, Dscho states "to compete in the SHA-3
    contest, BLAKE added complexity so that it would be roughly on par
    with its competitors.  To allow for faster execution in software,
    this complexity was *removed* from BLAKE to create BLAKE2, making
    it weaker than SHA-256."  Putting aside the historical questions,
    do you agree with this "weaker than" claim?

 5. On the security side, Dscho states, "The type of attacks Git has to
    worry about is very different from the length extension attacks,
    and it is highly unlikely that that weakness of SHA-256 leads to,
    say, a collision attack", and Jeff King states, "Git does not use
    the hash as a MAC, so length extension attacks aren't a thing (and
    even if we later wanted to use the same algorithm as a MAC, the
    HMAC construction is a well-studied technique for dealing with
    it)."  Is this correct in spirit?  Is SHA-256 equally strong to
    SHA-512/256 for Git's purposes, or are the increased bits of
    internal state (or other differences) relevant?  How would you
    compare the two functions' properties?

 6. On the speed side, Jeff King states "That said, SHA-512 is
    typically a little faster than SHA-256 on 64-bit platforms. I
    don't know if that will change with the advent of hardware
    instructions oriented towards SHA-256."  Thoughts?

 7. If the answer to (2) is "no", do I have permission to quote or
    paraphrase your replies that were given here?

Thanks, sincerely,
Jonathan

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-15 17:36         ` Brandon Williams
@ 2017-06-15 19:20           ` Junio C Hamano
  0 siblings, 0 replies; 113+ messages in thread
From: Junio C Hamano @ 2017-06-15 19:20 UTC (permalink / raw)
  To: Brandon Williams
  Cc: Johannes Schindelin, brian m. carlson, Linus Torvalds,
	Jonathan Nieder, Git Mailing List, Stefan Beller, jonathantanmy,
	Jeff King

Brandon Williams <bmwill@google.com> writes:

>> It would make a whole of a lot of sense to make that knob not Boolean,
>> but to specify which hash function is in use.
>
> 100% agree on this point.  I believe the current plan is to have the
> hashing function used for a repository be a repository format extension
> which would be a value (most likely a string like 'sha1', 'sha256',
> 'black2', etc) stored in a repository's .git/config.  This way, upon
> startup git will die or ignore a repository which uses a hashing
> function which it does not recognize or does not compiled to handle.
>
> I hope (and expect) that the end produce of this transition is a nice,
> clean hashing API and interface with sufficient abstractions such that
> if I wanted to switch to a different hashing function I would just need
> to implement the interface with the new hashing function and ensure that
> 'verify_repository_format' allows the new function.

Yup.  I thought that part has already been agreed upon, but it is a
good thing that somebody is writing it down (perhaps "again", if not
"for the first time").

Thanks.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-15 16:30             ` Ævar Arnfjörð Bjarmason
@ 2017-06-15 19:34               ` Johannes Schindelin
  2017-06-15 21:59                 ` Adam Langley
  0 siblings, 1 reply; 113+ messages in thread
From: Johannes Schindelin @ 2017-06-15 19:34 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Jeff King, Mike Hommey, Brandon Williams, brian m. carlson,
	Linus Torvalds, Jonathan Nieder, Git Mailing List, Stefan Beller,
	jonathantanmy, Junio Hamano

[-- Attachment #1: Type: text/plain, Size: 4489 bytes --]

Hi,

On Thu, 15 Jun 2017, Ævar Arnfjörð Bjarmason wrote:

> On Thu, Jun 15 2017, Jeff King jotted:
> 
> > On Thu, Jun 15, 2017 at 08:05:18PM +0900, Mike Hommey wrote:
> >
> >> On Thu, Jun 15, 2017 at 12:30:46PM +0200, Johannes Schindelin wrote:
> >>
> >> > Footnote *1*: SHA-256, as all hash functions whose output is
> >> > essentially the entire internal state, are susceptible to a
> >> > so-called "length extension attack", where the hash of a
> >> > secret+message can be used to generate the hash of
> >> > secret+message+piggyback without knowing the secret.  This is not
> >> > the case for Git: only visible data are hashed. The type of attacks
> >> > Git has to worry about is very different from the length extension
> >> > attacks, and it is highly unlikely that that weakness of SHA-256
> >> > leads to, say, a collision attack.
> >>
> >> What do the experts think or SHA512/256, which completely removes the
> >> concerns over length extension attack? (which I'd argue is better than
> >> sweeping them under the carpet)
> >
> > I don't think it's sweeping them under the carpet. Git does not use the
> > hash as a MAC, so length extension attacks aren't a thing (and even if
> > we later wanted to use the same algorithm as a MAC, the HMAC
> > construction is a well-studied technique for dealing with it).

I really tried to drive that point home, as it had been made very clear to
me that the length extension attack is something that Git need not concern
itself.

The length extension attack *only* comes into play when there are secrets
that are hashed. In that case, one would not want others to be able to
produce a valid hash *without* knowing the secrets. And SHA-256 allows to
"reconstruct" the internal state (which is the hash value) in order to
continue at any point, i.e. if the hash for secret+message is known, it is
easy to calculate the hash for secret+message+addition, without knowing
the secret at all.

That is exactly *not* the case with Git. In Git, what we want to hash is
known in its entirety. If the hash value were not identical to the
internal state, it would be easy enough to reconstruct, because *there are
no secrets*.

So please understand that even the direction that the length extension
attack takes is completely different than the direction any attack would
have to take that weakens SHA-256 for Git's purposes. As far as Git's
usage is concerned, SHA-256 has no known weaknesses.

It is *really, really, really* important to understand this before going
on to suggest another hash function such as SHA-512/256 (i.e. SHA-512
truncated to 256 bits), based only on that perceived weakness of SHA-256.

> > That said, SHA-512 is typically a little faster than SHA-256 on 64-bit
> > platforms. I don't know if that will change with the advent of
> > hardware instructions oriented towards SHA-256.
> 
> Quoting my own
> CACBZZX7JRA2niwt9wsGAxnzS+gWS8hTUgzWm8NaY1gs87o8xVQ@mail.gmail.com sent
> ~2 weeks ago to the list:
> 
>     On Fri, Jun 2, 2017 at 7:54 PM, Jonathan Nieder <jrnieder@gmail.com>
>     wrote:
>     [...]
>     > 4. When choosing a hash function, people may argue about performance.
>     >    It would be useful for run some benchmarks for git (running
>     >    the test suite, t/perf tests, etc) using a variety of hash
>     >    functions as input to such a discussion.
> 
>     To the extent that such benchmarks matter, it seems prudent to heavily
>     weigh them in favor of whatever seems to be likely to be the more
>     common hash function going forward, since those are likely to get
>     faster through future hardware acceleration.
> 
>     E.g. Intel announced Goldmont last year which according to one SHA-1
>     implementation improved from 9.5 cycles per byte to 2.7 cpb[1]. They
>     only have acceleration for SHA-1 and SHA-256[2]
> 
>     1. https://github.com/weidai11/cryptopp/issues/139#issuecomment-264283385
> 
>     2. https://en.wikipedia.org/wiki/Goldmont
> 
> Maybe someone else knows of better numbers / benchmarks, but such a
> reduction in CBP likely makes it faster than SHA-512.

Very, very likely faster than SHA-512.

I'd like to stress explicitly that the Intel SHA extensions do *not* cover
SHA-512:

	https://en.wikipedia.org/wiki/Intel_SHA_extensions

In other words, once those extensions become commonplace, SHA-256 will be
faster than SHA-512, hands down.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-15 13:01           ` Jeff King
  2017-06-15 16:30             ` Ævar Arnfjörð Bjarmason
@ 2017-06-15 21:10             ` Mike Hommey
  2017-06-16  4:30               ` Jeff King
  1 sibling, 1 reply; 113+ messages in thread
From: Mike Hommey @ 2017-06-15 21:10 UTC (permalink / raw)
  To: Jeff King
  Cc: Johannes Schindelin, Brandon Williams, brian m. carlson,
	Linus Torvalds, Jonathan Nieder, Git Mailing List, Stefan Beller,
	jonathantanmy, Junio Hamano

On Thu, Jun 15, 2017 at 09:01:45AM -0400, Jeff King wrote:
> On Thu, Jun 15, 2017 at 08:05:18PM +0900, Mike Hommey wrote:
> 
> > On Thu, Jun 15, 2017 at 12:30:46PM +0200, Johannes Schindelin wrote:
> > > Footnote *1*: SHA-256, as all hash functions whose output is essentially
> > > the entire internal state, are susceptible to a so-called "length
> > > extension attack", where the hash of a secret+message can be used to
> > > generate the hash of secret+message+piggyback without knowing the secret.
> > > This is not the case for Git: only visible data are hashed. The type of
> > > attacks Git has to worry about is very different from the length extension
> > > attacks, and it is highly unlikely that that weakness of SHA-256 leads to,
> > > say, a collision attack.
> > 
> > What do the experts think or SHA512/256, which completely removes the
> > concerns over length extension attack? (which I'd argue is better than
> > sweeping them under the carpet)
> 
> I don't think it's sweeping them under the carpet. Git does not use the
> hash as a MAC, so length extension attacks aren't a thing (and even if
> we later wanted to use the same algorithm as a MAC, the HMAC
> construction is a well-studied technique for dealing with it).

AIUI, length extension does make brute force collision attacks (which,
really Shattered was) cheaper by allowing one to create the collision
with a small message and extend it later.

This might not be a credible thread against git, but if we go by that
standard, post-shattered Sha-1 is still fine for git. As a matter of
fact, MD5 would also be fine: there is still, to this day, no preimage
attack against them.

Mike

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-15 19:34               ` Johannes Schindelin
@ 2017-06-15 21:59                 ` Adam Langley
  2017-06-15 22:41                   ` brian m. carlson
  0 siblings, 1 reply; 113+ messages in thread
From: Adam Langley @ 2017-06-15 21:59 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Ævar Arnfjörð Bjarmason, Jeff King, Mike Hommey,
	Brandon Williams, brian m. carlson, Linus Torvalds,
	Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan,
	Junio Hamano

(I was asked to comment a few points in public by Jonathan.)

I think this group can safely assume that SHA-256, SHA-512, BLAKE2,
K12, etc are all secure to the extent that I don't believe that making
comparisons between them on that axis is meaningful. Thus I think the
question is primarily concerned with performance and implementation
availability.

I think any of the above would be reasonable choices. I don't believe
that length-extension is a concern here.

SHA-512/256 will be faster than SHA-256 on 64-bit systems in software.
The graph at https://blake2.net/ suggests a 50% speedup on Skylake. On
my Ivy Bridge system, it's about 20%.

(SHA-512/256 does not enjoy the same availability in common libraries however.)

Both Intel and ARM have SHA-256 instructions defined. I've not seen
good benchmarks of them yet, but they will make SHA-256 faster than
SHA-512 when available. However, it's very possible that something
like BLAKE2bp will still be faster. Of course, BLAKE2bp does not enjoy
the ubiquity of SHA-256, but nor do you have to wait years for the CPU
population to advance for high performance.

So, overall, none of these choices should obviously be excluded. The
considerations at this point are not cryptographic and the tradeoff
between implementation ease and performance is one that the git
community would have to make.

Cheers

AGL

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-15 21:59                 ` Adam Langley
@ 2017-06-15 22:41                   ` brian m. carlson
  2017-06-15 23:36                     ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 113+ messages in thread
From: brian m. carlson @ 2017-06-15 22:41 UTC (permalink / raw)
  To: Adam Langley
  Cc: Johannes Schindelin, Ævar Arnfjörð Bjarmason,
	Jeff King, Mike Hommey, Brandon Williams, Linus Torvalds,
	Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan,
	Junio Hamano

[-- Attachment #1: Type: text/plain, Size: 2755 bytes --]

On Thu, Jun 15, 2017 at 02:59:57PM -0700, Adam Langley wrote:
> (I was asked to comment a few points in public by Jonathan.)
> 
> I think this group can safely assume that SHA-256, SHA-512, BLAKE2,
> K12, etc are all secure to the extent that I don't believe that making
> comparisons between them on that axis is meaningful. Thus I think the
> question is primarily concerned with performance and implementation
> availability.
> 
> I think any of the above would be reasonable choices. I don't believe
> that length-extension is a concern here.
> 
> SHA-512/256 will be faster than SHA-256 on 64-bit systems in software.
> The graph at https://blake2.net/ suggests a 50% speedup on Skylake. On
> my Ivy Bridge system, it's about 20%.
> 
> (SHA-512/256 does not enjoy the same availability in common libraries however.)
> 
> Both Intel and ARM have SHA-256 instructions defined. I've not seen
> good benchmarks of them yet, but they will make SHA-256 faster than
> SHA-512 when available. However, it's very possible that something
> like BLAKE2bp will still be faster. Of course, BLAKE2bp does not enjoy
> the ubiquity of SHA-256, but nor do you have to wait years for the CPU
> population to advance for high performance.

SHA-256 acceleration exists for some existing Intel platforms already.
However, they're not practically present on anything but servers at the
moment, and so I don't think the acceleration of SHA-256 is a
something we should consider.

The SUPERCOP benchmarks tell me that generally, on 64-bit systems where
acceleration is not available, SHA-256 is the slowest, followed by
SHA3-256.  BLAKE2b is the fastest.

If our goal is performance, then I would argue BLAKE2b-256 is the best
choice.  It is secure and extremely fast.  It does have the benefit that
we get to tell people that by moving away from SHA-1, they will get a
performance boost, pretty much no matter what the system.

BLAKE2bp may be faster, but it introduces additional implementation
complexity.  I'm not sure crypto libraries will implement it, but then
again, OpenSSL only implements BLAKE2b-512 at the moment.  I don't care
much either way, but we should add good tests to exercise the
implementation thoroughly.  We're generally going to need to ship our
own implementation anyway.

I've argued that SHA3-256 probably has the longest life and good
unaccelerated performance, and for that reason, I've preferred it.  But
if AGL says that they're all secure (and I generally think he knows
what he's talking about), we could consider performance more.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
https://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-15 22:41                   ` brian m. carlson
@ 2017-06-15 23:36                     ` Ævar Arnfjörð Bjarmason
  2017-06-16  0:17                       ` brian m. carlson
  0 siblings, 1 reply; 113+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-06-15 23:36 UTC (permalink / raw)
  To: brian m. carlson, Adam Langley, Johannes Schindelin,
	Ævar Arnfjörð Bjarmason, Jeff King, Mike Hommey,
	Brandon Williams, Linus Torvalds, Jonathan Nieder,
	Git Mailing List, Stefan Beller, Jonathan Tan, Junio Hamano

On Fri, Jun 16, 2017 at 12:41 AM, brian m. carlson
<sandals@crustytoothpaste.net> wrote:
> On Thu, Jun 15, 2017 at 02:59:57PM -0700, Adam Langley wrote:
>> (I was asked to comment a few points in public by Jonathan.)
>>
>> I think this group can safely assume that SHA-256, SHA-512, BLAKE2,
>> K12, etc are all secure to the extent that I don't believe that making
>> comparisons between them on that axis is meaningful. Thus I think the
>> question is primarily concerned with performance and implementation
>> availability.
>>
>> I think any of the above would be reasonable choices. I don't believe
>> that length-extension is a concern here.
>>
>> SHA-512/256 will be faster than SHA-256 on 64-bit systems in software.
>> The graph at https://blake2.net/ suggests a 50% speedup on Skylake. On
>> my Ivy Bridge system, it's about 20%.
>>
>> (SHA-512/256 does not enjoy the same availability in common libraries however.)
>>
>> Both Intel and ARM have SHA-256 instructions defined. I've not seen
>> good benchmarks of them yet, but they will make SHA-256 faster than
>> SHA-512 when available. However, it's very possible that something
>> like BLAKE2bp will still be faster. Of course, BLAKE2bp does not enjoy
>> the ubiquity of SHA-256, but nor do you have to wait years for the CPU
>> population to advance for high performance.
>
> SHA-256 acceleration exists for some existing Intel platforms already.
> However, they're not practically present on anything but servers at the
> moment, and so I don't think the acceleration of SHA-256 is a
> something we should consider.

Whatever next-gen hash Git ends up with is going to be in use for
decades, so what hardware acceleration exists in consumer products
right now is practically irrelevant, but what acceleration is likely
to exist for the lifetime of the hash existing *is* relevant.

So I don't follow the argument that we shouldn't weigh future HW
acceleration highly just because you can't easily buy a laptop today
with these features.

Aside from that I think you've got this backwards, it's AMD that's
adding SHA acceleration to their high-end Ryzen chips[1] but Intel is
starting at the lower end this year with Goldmont which'll be in
lower-end consumer devices[2]. If you read the github issue I linked
to upthread[3] you can see that the cryptopp devs already tested their
SHA accelerated code on a consumer Celeron[4] recently.

I don't think Intel has announced the SHA extensions for future Xeon
releases, but it seems given that they're going to have it there as
well. Have there every been x86 extensions that aren't eventually
portable across the entire line, or that they've ended up removing
from x86 once introduced?

In any case, I think by the time we're ready to follow-up the current
hash refactoring efforts with actually changing the hash
implementation many of us are likely to have laptops with these
extensions, making this easy to test.

1. https://en.wikipedia.org/wiki/Intel_SHA_extensions
2. https://en.wikipedia.org/wiki/Goldmont
3. https://github.com/weidai11/cryptopp/issues/139#issuecomment-264283385
4. https://ark.intel.com/products/95594/Intel-Celeron-Processor-J3455-2M-Cache-up-to-2_3-GHz

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-15 23:36                     ` Ævar Arnfjörð Bjarmason
@ 2017-06-16  0:17                       ` brian m. carlson
  2017-06-16  6:25                         ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 113+ messages in thread
From: brian m. carlson @ 2017-06-16  0:17 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Adam Langley, Johannes Schindelin, Jeff King, Mike Hommey,
	Brandon Williams, Linus Torvalds, Jonathan Nieder,
	Git Mailing List, Stefan Beller, Jonathan Tan, Junio Hamano

[-- Attachment #1: Type: text/plain, Size: 3810 bytes --]

On Fri, Jun 16, 2017 at 01:36:13AM +0200, Ævar Arnfjörð Bjarmason wrote:
> On Fri, Jun 16, 2017 at 12:41 AM, brian m. carlson
> <sandals@crustytoothpaste.net> wrote:
> > SHA-256 acceleration exists for some existing Intel platforms already.
> > However, they're not practically present on anything but servers at the
> > moment, and so I don't think the acceleration of SHA-256 is a
> > something we should consider.
> 
> Whatever next-gen hash Git ends up with is going to be in use for
> decades, so what hardware acceleration exists in consumer products
> right now is practically irrelevant, but what acceleration is likely
> to exist for the lifetime of the hash existing *is* relevant.

The life of MD5 was about 23 years (introduction to first document
collision).  SHA-1 had about 22.  Decades, yes, but just barely.  SHA-2
was introduced in 2001, and by the same estimate, we're a little over
halfway through its life.

> So I don't follow the argument that we shouldn't weigh future HW
> acceleration highly just because you can't easily buy a laptop today
> with these features.
> 
> Aside from that I think you've got this backwards, it's AMD that's
> adding SHA acceleration to their high-end Ryzen chips[1] but Intel is
> starting at the lower end this year with Goldmont which'll be in
> lower-end consumer devices[2]. If you read the github issue I linked
> to upthread[3] you can see that the cryptopp devs already tested their
> SHA accelerated code on a consumer Celeron[4] recently.
> 
> I don't think Intel has announced the SHA extensions for future Xeon
> releases, but it seems given that they're going to have it there as
> well. Have there every been x86 extensions that aren't eventually
> portable across the entire line, or that they've ended up removing
> from x86 once introduced?
> 
> In any case, I think by the time we're ready to follow-up the current
> hash refactoring efforts with actually changing the hash
> implementation many of us are likely to have laptops with these
> extensions, making this easy to test.

I think you underestimate the life of hardware and software.  I have
servers running KVM development instances that have been running since
at least 2012.  Those machines are not scheduled for replacement anytime
soon.

Whatever we deploy within the next year is going to run on existing
hardware for probably a decade, whether we want it to or not.  Most of
those machines don't have acceleration.

Furthermore, you need a reasonably modern crypto library to get hardware
acceleration.  OpenSSL has only recently gained support for it.  RHEL 7
does not currently support it, and probably never will.  That OS is
going to be around for the next 6 years.

If we're optimizing for performance, I don't want to optimize for the
latest, greatest machines.  Those machines are going to outperform
everything else either way.  I'd rather optimize for something which
performs well on the whole everywhere.  There are a lot of developers
who have older machines, for cost reasons or otherwise.

Here are some stats (cycles/byte for long messages):

                   SHA-256    BLAKE2b
Ryzen                 1.89       3.06
Knight's Landing     19.00       5.65
Cortex-A72            1.99       5.48
Cortex-A57           11.81       5.47
Cortex-A7            28.19      15.16

In other words, BLAKE2b performs well uniformly across a wide variety of
architectures even without acceleration.  I'd rather tell people that
upgrading to a new hash algorithm is a performance win either way, not
just if they have the latest hardware.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
https://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-15 21:10             ` Mike Hommey
@ 2017-06-16  4:30               ` Jeff King
  0 siblings, 0 replies; 113+ messages in thread
From: Jeff King @ 2017-06-16  4:30 UTC (permalink / raw)
  To: Mike Hommey
  Cc: Johannes Schindelin, Brandon Williams, brian m. carlson,
	Linus Torvalds, Jonathan Nieder, Git Mailing List, Stefan Beller,
	jonathantanmy, Junio Hamano

On Fri, Jun 16, 2017 at 06:10:22AM +0900, Mike Hommey wrote:

> > > What do the experts think or SHA512/256, which completely removes the
> > > concerns over length extension attack? (which I'd argue is better than
> > > sweeping them under the carpet)
> > 
> > I don't think it's sweeping them under the carpet. Git does not use the
> > hash as a MAC, so length extension attacks aren't a thing (and even if
> > we later wanted to use the same algorithm as a MAC, the HMAC
> > construction is a well-studied technique for dealing with it).
> 
> AIUI, length extension does make brute force collision attacks (which,
> really Shattered was) cheaper by allowing one to create the collision
> with a small message and extend it later.
> 
> This might not be a credible thread against git, but if we go by that
> standard, post-shattered Sha-1 is still fine for git. As a matter of
> fact, MD5 would also be fine: there is still, to this day, no preimage
> attack against them.

I think collision attacks are of interest to Git. But I would think
2^128 would be enough (TBH, 2^80 probably would have been enough for
SHA-1; it was the weaknesses that brought that down by a factor of a
million that made it a problem).

-Peff

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-16  0:17                       ` brian m. carlson
@ 2017-06-16  6:25                         ` Ævar Arnfjörð Bjarmason
  2017-06-16 13:24                           ` Johannes Schindelin
  0 siblings, 1 reply; 113+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-06-16  6:25 UTC (permalink / raw)
  To: brian m. carlson
  Cc: Adam Langley, Johannes Schindelin, Jeff King, Mike Hommey,
	Brandon Williams, Linus Torvalds, Jonathan Nieder,
	Git Mailing List, Stefan Beller, Jonathan Tan, Junio Hamano


On Fri, Jun 16 2017, brian m. carlson jotted:

> On Fri, Jun 16, 2017 at 01:36:13AM +0200, Ævar Arnfjörð Bjarmason wrote:
>> On Fri, Jun 16, 2017 at 12:41 AM, brian m. carlson
>> <sandals@crustytoothpaste.net> wrote:
>> > SHA-256 acceleration exists for some existing Intel platforms already.
>> > However, they're not practically present on anything but servers at the
>> > moment, and so I don't think the acceleration of SHA-256 is a
>> > something we should consider.
>>
>> Whatever next-gen hash Git ends up with is going to be in use for
>> decades, so what hardware acceleration exists in consumer products
>> right now is practically irrelevant, but what acceleration is likely
>> to exist for the lifetime of the hash existing *is* relevant.
>
> The life of MD5 was about 23 years (introduction to first document
> collision).  SHA-1 had about 22.  Decades, yes, but just barely.  SHA-2
> was introduced in 2001, and by the same estimate, we're a little over
> halfway through its life.

I'm talking about the lifetime of SHA-1 or $newhash's use in Git. As our
continued use of SHA-1 demonstrates the window of practical hash
function use extends well beyond the window from introduction to
published breakage.

It's also telling that SHA-1, which any cryptographer would have waived
you off from since around 2011, is just getting widely deployed HW
acceleration now in 2017. The practical use of hash functions far
exceeds their recommended use in new projects.

>> So I don't follow the argument that we shouldn't weigh future HW
>> acceleration highly just because you can't easily buy a laptop today
>> with these features.
>>
>> Aside from that I think you've got this backwards, it's AMD that's
>> adding SHA acceleration to their high-end Ryzen chips[1] but Intel is
>> starting at the lower end this year with Goldmont which'll be in
>> lower-end consumer devices[2]. If you read the github issue I linked
>> to upthread[3] you can see that the cryptopp devs already tested their
>> SHA accelerated code on a consumer Celeron[4] recently.
>>
>> I don't think Intel has announced the SHA extensions for future Xeon
>> releases, but it seems given that they're going to have it there as
>> well. Have there every been x86 extensions that aren't eventually
>> portable across the entire line, or that they've ended up removing
>> from x86 once introduced?
>>
>> In any case, I think by the time we're ready to follow-up the current
>> hash refactoring efforts with actually changing the hash
>> implementation many of us are likely to have laptops with these
>> extensions, making this easy to test.
>
> I think you underestimate the life of hardware and software.  I have
> servers running KVM development instances that have been running since
> at least 2012.  Those machines are not scheduled for replacement anytime
> soon.
>
> Whatever we deploy within the next year is going to run on existing
> hardware for probably a decade, whether we want it to or not.  Most of
> those machines don't have acceleration.

To clarify, I'm not dismissing the need to consider existing hardware
without these acceleration functions or future processors without
them. I don't think that makes any sense, we need to keep those in mind.

I was replying to a bit in your comment where you (it seems to me) were
making the claim that we shouldn't consider the HW acceleration of
certain hash functions either.

Clearly both need to be considered.

> Furthermore, you need a reasonably modern crypto library to get hardware
> acceleration.  OpenSSL has only recently gained support for it.  RHEL 7
> does not currently support it, and probably never will.  That OS is
> going to be around for the next 6 years.
>
> If we're optimizing for performance, I don't want to optimize for the
> latest, greatest machines.  Those machines are going to outperform
> everything else either way.  I'd rather optimize for something which
> performs well on the whole everywhere.  There are a lot of developers
> who have older machines, for cost reasons or otherwise.

We have real data showing that the intersection between people who care
about the hash slowing down and those who can't afford the latest
hardware is pretty much nil.

I.e. in 2.13.0 SHA-1 got slower, and pretty much nobody noticed or cared
except Johannes Schindelin, myself & Christian Couder. This is because
in practice hashing only becomes a bottleneck on huge monorepos that
need to e.g. re-hash the contents of a huge index.

> Here are some stats (cycles/byte for long messages):
>
>                    SHA-256    BLAKE2b
> Ryzen                 1.89       3.06
> Knight's Landing     19.00       5.65
> Cortex-A72            1.99       5.48
> Cortex-A57           11.81       5.47
> Cortex-A7            28.19      15.16
>
> In other words, BLAKE2b performs well uniformly across a wide variety of
> architectures even without acceleration.  I'd rather tell people that
> upgrading to a new hash algorithm is a performance win either way, not
> just if they have the latest hardware.

Yup, all of those need to be considered, although given my comment above
about big repos a 40% improvement on Ryzen (a processor likely to be
used for big repos) stands out, where are those numbers from, and is
that with or without HW accel for SHA-256 on Ryzen?

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-16  6:25                         ` Ævar Arnfjörð Bjarmason
@ 2017-06-16 13:24                           ` Johannes Schindelin
  2017-06-16 17:38                             ` Adam Langley
  2017-06-16 20:42                             ` Jeff King
  0 siblings, 2 replies; 113+ messages in thread
From: Johannes Schindelin @ 2017-06-16 13:24 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: brian m. carlson, Adam Langley, Jeff King, Mike Hommey,
	Brandon Williams, Linus Torvalds, Jonathan Nieder,
	Git Mailing List, Stefan Beller, Jonathan Tan, Junio Hamano

[-- Attachment #1: Type: text/plain, Size: 7027 bytes --]

Hi,

On Fri, 16 Jun 2017, Ævar Arnfjörð Bjarmason wrote:

> On Fri, Jun 16 2017, brian m. carlson jotted:
> 
> > On Fri, Jun 16, 2017 at 01:36:13AM +0200, Ævar Arnfjörð Bjarmason wrote:
> >
> >> So I don't follow the argument that we shouldn't weigh future HW
> >> acceleration highly just because you can't easily buy a laptop today
> >> with these features.
> >>
> >> Aside from that I think you've got this backwards, it's AMD that's
> >> adding SHA acceleration to their high-end Ryzen chips[1] but Intel is
> >> starting at the lower end this year with Goldmont which'll be in
> >> lower-end consumer devices[2]. If you read the github issue I linked
> >> to upthread[3] you can see that the cryptopp devs already tested
> >> their SHA accelerated code on a consumer Celeron[4] recently.
> >>
> >> I don't think Intel has announced the SHA extensions for future Xeon
> >> releases, but it seems given that they're going to have it there as
> >> well. Have there every been x86 extensions that aren't eventually
> >> portable across the entire line, or that they've ended up removing
> >> from x86 once introduced?
> >>
> >> In any case, I think by the time we're ready to follow-up the current
> >> hash refactoring efforts with actually changing the hash
> >> implementation many of us are likely to have laptops with these
> >> extensions, making this easy to test.
> >
> > I think you underestimate the life of hardware and software.  I have
> > servers running KVM development instances that have been running since
> > at least 2012.  Those machines are not scheduled for replacement
> > anytime soon.
> >
> > Whatever we deploy within the next year is going to run on existing
> > hardware for probably a decade, whether we want it to or not.  Most of
> > those machines don't have acceleration.
> 
> To clarify, I'm not dismissing the need to consider existing hardware
> without these acceleration functions or future processors without them.
> I don't think that makes any sense, we need to keep those in mind.
> 
> I was replying to a bit in your comment where you (it seems to me) were
> making the claim that we shouldn't consider the HW acceleration of
> certain hash functions either.

Yes, I also had the impression that it stressed the status quo quite a bit
too much.

We know for a fact that SHA-256 acceleration is coming to consumer CPUs.
We know of no plans for any of the other mentioned hash functions to
hardware-accelerate them in consumer CPUs.

And remember: for those who are affected most (humongous monorepos, source
code hosters), upgrading hardware is less of an issue than having a secure
hash function for the rest of us.

And while I am really thankful that Adam chimed in, I think he would agree
that BLAKE2 is a purposefully weakened version of BLAKE, for the benefit
of speed (with the caveat that one of my experts disagrees that BLAKE2b
would be faster than hardware-accelerated SHA-256). And while BLAKE has
seen roughly equivalent cryptanalysis as Keccak (which became SHA-3),
BLAKE2 has not.

That makes me *very* uneasy about choosing BLAKE2.

> > Furthermore, you need a reasonably modern crypto library to get hardware
> > acceleration.  OpenSSL has only recently gained support for it.  RHEL 7
> > does not currently support it, and probably never will.  That OS is
> > going to be around for the next 6 years.
> >
> > If we're optimizing for performance, I don't want to optimize for the
> > latest, greatest machines.  Those machines are going to outperform
> > everything else either way.  I'd rather optimize for something which
> > performs well on the whole everywhere.  There are a lot of developers
> > who have older machines, for cost reasons or otherwise.
> 
> We have real data showing that the intersection between people who care
> about the hash slowing down and those who can't afford the latest
> hardware is pretty much nil.
> 
> I.e. in 2.13.0 SHA-1 got slower, and pretty much nobody noticed or cared
> except Johannes Schindelin, myself & Christian Couder. This is because
> in practice hashing only becomes a bottleneck on huge monorepos that
> need to e.g. re-hash the contents of a huge index.

Indeed. I am still concerned about that. As you mention, though, it really
only affects users of ginormous monorepos, and of course source code
hosters.

The jury's still out on how much it impacts my colleagues, by the way.

I have no doubt that Visual Studio Team Services, GitHub and Atlassian
will eventually end up with FPGAs for hash computation. So that's that.

Side note: BLAKE is actually *not* friendly to hardware acceleration, I
have been told by one cryptography expert. In contrast, the Keccak team
claims SHA3-256 to be the easiest to hardware-accelerate, making it "a
green cryptographic primitive":
http://keccak.noekeon.org/is_sha3_slow.html

> > Here are some stats (cycles/byte for long messages):
> >
> >                    SHA-256    BLAKE2b
> > Ryzen                 1.89       3.06
> > Knight's Landing     19.00       5.65
> > Cortex-A72            1.99       5.48
> > Cortex-A57           11.81       5.47
> > Cortex-A7            28.19      15.16
> >
> > In other words, BLAKE2b performs well uniformly across a wide variety of
> > architectures even without acceleration.  I'd rather tell people that
> > upgrading to a new hash algorithm is a performance win either way, not
> > just if they have the latest hardware.
> 
> Yup, all of those need to be considered, although given my comment above
> about big repos a 40% improvement on Ryzen (a processor likely to be
> used for big repos) stands out, where are those numbers from, and is
> that with or without HW accel for SHA-256 on Ryzen?

When it comes to BLAKE2, I would actually strongly suggest to consider the
amount of attempts to break it. Or rather, how much less attention it got
than, say, SHA-256.

In any case, I have been encouraged to stress the importance of
"crypto-agility", i.e. the ability to switch to another algorithm when the
current one gets broken "enough".

And I am delighted that that is exactly the direction we are going. In
other words, even if I still think (backed up by the experts on whose
knowledge I lean heavily to form my opinions) that SHA-256 would be the
best choice for now, it should be relatively easy to offer BLAKE2b support
for (and by [*1*]) those who want it.

Ciao,
Dscho

Footnote *1*: I say that the support for BLAKE2b should come from those
parties who desire it also because it is not as ubiquituous as SHA-256.
Hence, it would add the burden of having a performant and reasonably
bug-free implementation in Git's source tree. IIUC OpenSSL added BLAKE2b
support only in OpenSSL 1.1.0, the 1.0.2 line (which is still in use in
many places, e.g. Git for Windows' SDK) does not, meaning: Git's
implementation would be the one *everybody* relies on, with *no*
fall-back.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-16 13:24                           ` Johannes Schindelin
@ 2017-06-16 17:38                             ` Adam Langley
  2017-06-16 20:52                               ` Junio C Hamano
  2017-06-16 20:42                             ` Jeff King
  1 sibling, 1 reply; 113+ messages in thread
From: Adam Langley @ 2017-06-16 17:38 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Ævar Arnfjörð Bjarmason, brian m. carlson,
	Jeff King, Mike Hommey, Brandon Williams, Linus Torvalds,
	Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan,
	Junio Hamano

On Fri, Jun 16, 2017 at 6:24 AM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>
> And while I am really thankful that Adam chimed in, I think he would agree
> that BLAKE2 is a purposefully weakened version of BLAKE, for the benefit
> of speed

That is correct.

Although worth keeping in mind that the analysis results from the
SHA-3 process informed this rebalancing. Indeed, NIST proposed[1] to
do the same with Keccak before stamping it as SHA-3 (although
ultimately did not in the context of public feeling in late 2013). The
Keccak team have essentially done the same with K12. Thus there is
evidence of a fairly widespread belief that the SHA-3 parameters were
excessively cautious.

[1] https://docs.google.com/file/d/0BzRYQSHuuMYOQXdHWkRiZXlURVE/edit, slide 48

> (with the caveat that one of my experts disagrees that BLAKE2b
> would be faster than hardware-accelerated SHA-256).

The numbers given above for SHA-256 on Ryzen and Cortex-A72 must be
with hardware acceleration and I thank Brian Carlson for digging them
up as I hadn't seen them before.

I suggested above that BLAKE2bp (note the p at the end) might be
faster than hardware SHA-256 and that appears to be plausible based on
benchmarks[2] of that function. (With the caveat those numbers are for
Haswell and Skylake and so cannot be directly compared with Ryzen.)

K12 reports similar speeds on Skylake[3] and thus is also plausibly
faster than hardware SHA-256.

[2] https://github.com/sneves/blake2-avx2
[3] http://keccak.noekeon.org/KangarooTwelve.pdf

However, as I'm not a git developer, I've no opinion on whether the
cost of carrying implementations of these functions is worth the speed
vs using SHA-256, which can be assumed to be supported everywhere
already.

Cheers

AGL

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-16 13:24                           ` Johannes Schindelin
  2017-06-16 17:38                             ` Adam Langley
@ 2017-06-16 20:42                             ` Jeff King
  2017-06-19  9:26                               ` Johannes Schindelin
  1 sibling, 1 reply; 113+ messages in thread
From: Jeff King @ 2017-06-16 20:42 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Ævar Arnfjörð Bjarmason, brian m. carlson,
	Adam Langley, Mike Hommey, Brandon Williams, Linus Torvalds,
	Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan,
	Junio Hamano

On Fri, Jun 16, 2017 at 03:24:19PM +0200, Johannes Schindelin wrote:

> I have no doubt that Visual Studio Team Services, GitHub and Atlassian
> will eventually end up with FPGAs for hash computation. So that's that.

I actually doubt this from the GitHub side. Hash performance is not even
on our radar as a bottleneck. In most cases the problem is touching
uncompressed data _at all_, not computing the hash over it (so things
like reusing on-disk deltas are really important).

-Peff

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-16 17:38                             ` Adam Langley
@ 2017-06-16 20:52                               ` Junio C Hamano
  2017-06-16 21:12                                 ` Junio C Hamano
  0 siblings, 1 reply; 113+ messages in thread
From: Junio C Hamano @ 2017-06-16 20:52 UTC (permalink / raw)
  To: Adam Langley
  Cc: Johannes Schindelin, Ævar Arnfjörð Bjarmason,
	brian m. carlson, Jeff King, Mike Hommey, Brandon Williams,
	Linus Torvalds, Jonathan Nieder, Git Mailing List, Stefan Beller,
	Jonathan Tan

Adam Langley <agl@google.com> writes:

> However, as I'm not a git developer, I've no opinion on whether the
> cost of carrying implementations of these functions is worth the speed
> vs using SHA-256, which can be assumed to be supported everywhere
> already.

Thanks.

My impression from this thread is that even though fast may be
better than slow, ubiquity trumps it for our use case, as long as
the thing is not absurdly and unusably slow, of course.  Which makes
me lean towards something older/more established like SHA-256, and
it would be a very nice bonus if it gets hardware acceleration more
widely than others ;-)

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-16 20:52                               ` Junio C Hamano
@ 2017-06-16 21:12                                 ` Junio C Hamano
  2017-06-16 21:24                                   ` Jonathan Nieder
  0 siblings, 1 reply; 113+ messages in thread
From: Junio C Hamano @ 2017-06-16 21:12 UTC (permalink / raw)
  To: Adam Langley
  Cc: Johannes Schindelin, Ævar Arnfjörð Bjarmason,
	brian m. carlson, Jeff King, Mike Hommey, Brandon Williams,
	Linus Torvalds, Jonathan Nieder, Git Mailing List, Stefan Beller,
	Jonathan Tan

Junio C Hamano <gitster@pobox.com> writes:

> Adam Langley <agl@google.com> writes:
>
>> However, as I'm not a git developer, I've no opinion on whether the
>> cost of carrying implementations of these functions is worth the speed
>> vs using SHA-256, which can be assumed to be supported everywhere
>> already.
>
> Thanks.
>
> My impression from this thread is that even though fast may be
> better than slow, ubiquity trumps it for our use case, as long as
> the thing is not absurdly and unusably slow, of course.  Which makes
> me lean towards something older/more established like SHA-256, and
> it would be a very nice bonus if it gets hardware acceleration more
> widely than others ;-)

Ah, I recall one thing that was mentioned but not discussed much in
the thread: possible use of tree-hashing to exploit multiple cores
hashing a large-ish payload.  As long as it is OK to pick a sound
tree hash coding on top of any (secure) underlying hash function,
I do not think the use of tree-hashing should not affect which exact
underlying hash function is to be used, and I also am not convinced
if we really want tree hashing (some codepaths that deal with a large
payload wants to stream the data in single pass from head to tail)
in the context of Git, but I am not a crypto person, so ...

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-16 21:12                                 ` Junio C Hamano
@ 2017-06-16 21:24                                   ` Jonathan Nieder
  2017-06-16 21:39                                     ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 113+ messages in thread
From: Jonathan Nieder @ 2017-06-16 21:24 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Adam Langley, Johannes Schindelin,
	Ævar Arnfjörð Bjarmason, brian m. carlson,
	Jeff King, Mike Hommey, Brandon Williams, Linus Torvalds,
	Git Mailing List, Stefan Beller, Jonathan Tan

Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
>> Adam Langley <agl@google.com> writes:

>>> However, as I'm not a git developer, I've no opinion on whether the
>>> cost of carrying implementations of these functions is worth the speed
>>> vs using SHA-256, which can be assumed to be supported everywhere
>>> already.
>>
>> Thanks.
>>
>> My impression from this thread is that even though fast may be
>> better than slow, ubiquity trumps it for our use case, as long as
>> the thing is not absurdly and unusably slow, of course.  Which makes
>> me lean towards something older/more established like SHA-256, and
>> it would be a very nice bonus if it gets hardware acceleration more
>> widely than others ;-)
>
> Ah, I recall one thing that was mentioned but not discussed much in
> the thread: possible use of tree-hashing to exploit multiple cores
> hashing a large-ish payload.  As long as it is OK to pick a sound
> tree hash coding on top of any (secure) underlying hash function,
> I do not think the use of tree-hashing should not affect which exact
> underlying hash function is to be used, and I also am not convinced
> if we really want tree hashing (some codepaths that deal with a large
> payload wants to stream the data in single pass from head to tail)
> in the context of Git, but I am not a crypto person, so ...

Tree hashing also affects single-core performance because of the
availability of SIMD instructions.

That is how software implementations of e.g. blake2bp-256 and
SHA-256x16[1] are able to have competitive performance with (slightly
better performance than, at least in some cases) hardware
implementations of SHA-256.

It is also satisfying that we have options like these that are faster
than SHA-1.

All that said, SHA-256 seems like a fine choice, despite its worse
performance.  The wide availability of reasonable-quality
implementations (e.g. in Java you can use
'MessageDigest.getInstance("SHA-256")') makes it a very tempting one.

Part of the reason I suggested previously that it would be helpful to
try to benchmark Git with various hash functions (which didn't go over
well, for some reason) is that it makes these comparisons more
concrete.  Without measuring, it is hard to get a sense of the
distribution of input sizes and how much practical effect the
differences we are talking about have.

Thanks,
Jonathan

[1] https://eprint.iacr.org/2012/476.pdf

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-16 21:24                                   ` Jonathan Nieder
@ 2017-06-16 21:39                                     ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 113+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-06-16 21:39 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Junio C Hamano, Adam Langley, Johannes Schindelin,
	brian m. carlson, Jeff King, Mike Hommey, Brandon Williams,
	Linus Torvalds, Git Mailing List, Stefan Beller, Jonathan Tan

On Fri, Jun 16 2017, Jonathan Nieder jotted:
> Part of the reason I suggested previously that it would be helpful to
> try to benchmark Git with various hash functions (which didn't go over
> well, for some reason) is that it makes these comparisons more
> concrete.  Without measuring, it is hard to get a sense of the
> distribution of input sizes and how much practical effect the
> differences we are talking about have.

It would be great to have such benchmarks (I probably missed the "didn't
go over well" part), but FWIW you can get pretty close to this right now
in git by running various t/perf benchmarks with
BLKSHA1/OPENSSL/SHA1DC.

Between the three of those (particularly SHA1DC being slower than
OpenSSL) you get a similar performance difference as some SHA-1
v.s. SHA-256 benchmarks I've seen, so to the extent that we have
existing performance tests it's revealing to see what's slower & faster.

It makes a particularly big difference for e.g. p3400-rebase.sh.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-16 20:42                             ` Jeff King
@ 2017-06-19  9:26                               ` Johannes Schindelin
  0 siblings, 0 replies; 113+ messages in thread
From: Johannes Schindelin @ 2017-06-19  9:26 UTC (permalink / raw)
  To: Jeff King
  Cc: Ævar Arnfjörð Bjarmason, brian m. carlson,
	Adam Langley, Mike Hommey, Brandon Williams, Linus Torvalds,
	Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan,
	Junio Hamano

Hi Peff,

On Fri, 16 Jun 2017, Jeff King wrote:

> On Fri, Jun 16, 2017 at 03:24:19PM +0200, Johannes Schindelin wrote:
> 
> > I have no doubt that Visual Studio Team Services, GitHub and Atlassian
> > will eventually end up with FPGAs for hash computation. So that's
> > that.
> 
> I actually doubt this from the GitHub side. Hash performance is not even
> on our radar as a bottleneck. In most cases the problem is touching
> uncompressed data _at all_, not computing the hash over it (so things
> like reusing on-disk deltas are really important).

Thanks for pointing that out! As a mainly client-side person, I rarely get
insights into the server side...

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-03-07  0:17   ` RFC v3: " Jonathan Nieder
  2017-03-09 19:14     ` Shawn Pearce
@ 2017-09-06  6:28     ` Junio C Hamano
  2017-09-08  2:40       ` Junio C Hamano
  1 sibling, 1 reply; 113+ messages in thread
From: Junio C Hamano @ 2017-09-06  6:28 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Linus Torvalds, Git Mailing List, Stefan Beller, bmwill,
	jonathantanmy, Jeff King, David Lang, brian m. carlson

Jonathan Nieder <jrnieder@gmail.com> writes:

> Linus Torvalds wrote:
>> On Fri, Mar 3, 2017 at 5:12 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:
>
>>> This document is still in flux but I thought it best to send it out
>>> early to start getting feedback.
>>
>> This actually looks very reasonable if you can implement it cleanly
>> enough.
>
> Thanks for the kind words on what had quite a few flaws still.  Here's
> a new draft.  I think the next version will be a patch against
> Documentation/technical/.

Can we reboot the discussion and advance this to v4 state?

> As before, comments welcome, both here and inline at
>
>   https://goo.gl/gh2Mzc

I think what you have over there looks pretty-much ready as the
final outline.

One thing I still do not know how I feel about after re-reading the
thread, and I didn't find the above doc, is Linus's suggestion to
use the objects themselves as NewHash-to-SHA-1 mapper [*1*].  

It does not help the reverse mapping that is needed while pushing
things out (the SHA-1 receiver tells us what they have in terms of
SHA-1 names; we need to figure out where we stop sending based on
that).  While it does help maintaining itself (while constructing
SHA3-content, we'd be required to find out its SHA1 name but the
SHA3 objects that we refer to all know their SHA-1 names), if it is
not useful otherwise, then that does not count as a plus.  Also
having to bake corresponding SHA-1 name in the object would mean
mistakes can easily propagate and cannot be corrected without
rewriting the history, which would be a huge downside.  So perhaps
we are better off without it, I guess.

[Reference]

*1* <CA+55aFxj7Vtwac64RfAz_u=U4tob4Xg+2pDBDFNpJdmgaTCmxA@mail.gmail.com>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-06  6:28     ` RFC v3: Another proposed hash function transition plan Junio C Hamano
@ 2017-09-08  2:40       ` Junio C Hamano
  2017-09-08  3:34         ` Jeff King
  2017-09-11 18:59         ` Brandon Williams
  0 siblings, 2 replies; 113+ messages in thread
From: Junio C Hamano @ 2017-09-08  2:40 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Linus Torvalds, Git Mailing List, Stefan Beller, bmwill,
	jonathantanmy, Jeff King, David Lang, brian m. carlson

Junio C Hamano <gitster@pobox.com> writes:

> One thing I still do not know how I feel about after re-reading the
> thread, and I didn't find the above doc, is Linus's suggestion to
> use the objects themselves as NewHash-to-SHA-1 mapper [*1*].  
> ...
> [Reference]
>
> *1* <CA+55aFxj7Vtwac64RfAz_u=U4tob4Xg+2pDBDFNpJdmgaTCmxA@mail.gmail.com>

I think this falls into the same category as the often-talked-about
addition of the "generation number" field.  It is very tempting to
add these "mechanically derivable but expensive to compute" pieces
of information to the sha3-content while converting from
sha1-content and creating anew.  

Because the "sha1-name" or the "generation number" can mechanically
be computed, as long as everybody agrees to _always_ place them in
the sha3-content, the same sha1-content will be converted into
exactly the same sha3-content without ambiguity, and converting them
back to sha1-content while pushing to an older repository will
correctly produce the original sha1-content, as it would just be the
matter of simply stripping these extra pieces of information.

The reason why I still feel a bit uneasy about adding these things
(aside from the fact that sha1-name thing will be a baggage we would
need to carry forever even after we completely wean ourselves off of
the old hash) is because I am not sure what we should do when we
encounter sha3-content in the wild that has these things _wrong_.
An object that exists today in the SHA-1 world is fetched into the
new repository and converted to SHA-3 contents, and Linus's extra
"original SHA-1 name" field is added to the object's header while
recording the SHA-3 content.  But for whatever reason, the original
SHA-1 name is recorded incorrectly in the resulting SHA-3 object.

The same thing could happen if we decide to bake "generation number"
in the SHA-3 commit objects.  One possible definition would be that
a root commit will have gen #0; a commit with 1 or more parents will
get max(parents' gen numbers) + 1 as its gen number.  But somebody
may botch the counting and records sum(parents' gen numbers) as its
gen number.

In these cases, not just the SHA3-content but also the resulting
SHA-3 object name would be different from the name of the object
that would have recorded the same contents correctly.  So converting
back to SHA-1 world from these botched SHA-3 contents may produce
the original contents, but we may end up with multiple "plausibly
looking" set of SHA-3 objects that (clain to) correspond to a single
SHA-1 object, only one of which is a valid one.

Our "git fsck" already treats certain brokenness (like a tree whose
entry has mode that is 0-padded to the left) as broken but still
tolerate them.  I am not sure if it is sufficient to diagnose and
declare broken and invalid when we see sha3-content that records
these "mechanically derivable but expensive to compute" pieces of
information incorrectly.

I am leaning towards saying "yes, catching in fsck is enough" and
suggesting to add generation number to sha3-content of the commit
objects, and to add even the "original sha1 name" thing if we find
good use of it.  But I cannot shake this nagging feeling off that I
am missing some huge problems that adding these fields and opening
ourselves to more classes of broken objects.

Thoughts?

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-08  2:40       ` Junio C Hamano
@ 2017-09-08  3:34         ` Jeff King
  2017-09-11 18:59         ` Brandon Williams
  1 sibling, 0 replies; 113+ messages in thread
From: Jeff King @ 2017-09-08  3:34 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jonathan Nieder, Linus Torvalds, Git Mailing List, Stefan Beller,
	bmwill, jonathantanmy, David Lang, brian m. carlson

On Fri, Sep 08, 2017 at 11:40:21AM +0900, Junio C Hamano wrote:

> Our "git fsck" already treats certain brokenness (like a tree whose
> entry has mode that is 0-padded to the left) as broken but still
> tolerate them.  I am not sure if it is sufficient to diagnose and
> declare broken and invalid when we see sha3-content that records
> these "mechanically derivable but expensive to compute" pieces of
> information incorrectly.
> 
> I am leaning towards saying "yes, catching in fsck is enough" and
> suggesting to add generation number to sha3-content of the commit
> objects, and to add even the "original sha1 name" thing if we find
> good use of it.  But I cannot shake this nagging feeling off that I
> am missing some huge problems that adding these fields and opening
> ourselves to more classes of broken objects.

I share your nagging feeling.

I have two thoughts on the "fsck can catch it" line of reasoning.

  1. It's harder to fsck generation numbers than other syntactic
     elements of an object, because it inherently depends on the links.
     So I can't fsck a commit object in isolation. I have to open its
     parents and check _their_ generation numbers.

     In some sense that isn't a big deal. A real fsck wants to know that
     we _have_ the parents in the first place. But traditionally we've
     separated "is this syntactically valid" from "do we have full
     connectivity". And features like shallow clones rely on us fudging
     the latter but not the former. A shallow history could never
     properly fsck the generation numbers.

     A multiple-hash field doesn't have this problem. It's purely a
     function of the bytes in the object.

  2. I wouldn't classify the current fsck checks as a wild success in
     containing breakages. If a buggy implementation produces invalid
     objects, the same buggy implementation generally lets people (and
     their colleagues) unwittingly build on top of those objects. It's
     only later (sometimes much later) that they interact with a
     non-buggy implementation whose fsck complains.

     And what happens then? If they're lucky, the invalid objects
     haven't spread far, and the worst thing is that they have to learn
     to use filter-branch (which itself is punishment enough). But
     sometimes a significant bit of history has been built on top, and
     it's awkward or impossible to rewrite it.

     That puts the burden on whoever is running the non-buggy
     implementation that wants to reject the objects. Do they accept
     these broken objects? If so, what do they do to mitigate the wrong
     answers that Git will return?

I'm much more in favor of keeping that data outside the object-hash
computation, and caching the pre-computed results as necessary. Those
cache can disagree with the objects, of course, but the cost to dropping
and re-building them is much lower than a history rewrite.

I'm speaking primarily to the generation-number thing, where I really
don't think there's any benefit to embedding it in the object beyond the
obvious "well, it has to go _somewhere_, and this saves us implementing
a local cache layer".  I haven't thought hard enough on the
multiple-hash thing to know if there's some other benefit to having it
inside the objects.

-Peff

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-08  2:40       ` Junio C Hamano
  2017-09-08  3:34         ` Jeff King
@ 2017-09-11 18:59         ` Brandon Williams
  2017-09-13 12:05           ` Johannes Schindelin
  1 sibling, 1 reply; 113+ messages in thread
From: Brandon Williams @ 2017-09-11 18:59 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jonathan Nieder, Linus Torvalds, Git Mailing List, Stefan Beller,
	jonathantanmy, Jeff King, David Lang, brian m. carlson

On 09/08, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> 
> > One thing I still do not know how I feel about after re-reading the
> > thread, and I didn't find the above doc, is Linus's suggestion to
> > use the objects themselves as NewHash-to-SHA-1 mapper [*1*].  
> > ...
> > [Reference]
> >
> > *1* <CA+55aFxj7Vtwac64RfAz_u=U4tob4Xg+2pDBDFNpJdmgaTCmxA@mail.gmail.com>
> 
> I think this falls into the same category as the often-talked-about
> addition of the "generation number" field.  It is very tempting to
> add these "mechanically derivable but expensive to compute" pieces
> of information to the sha3-content while converting from
> sha1-content and creating anew.  

We didn't discuss that in the doc since this particular transition plan
we made uses an external NewHash-to-SHA1 map instead of an internal one
because we believe that at some point we would be able to drop
compatibility with SHA1.  Now I suspect that wont happen for a long time
but I think it would be preferable over carrying the SHA1 luggage
indefinitely.  At some point, then, we would be able to stop hashing
objects twice (once with SHA1 and once with NewHash) instead of always
requiring that we hash them with each hash function which was used
historically.

> 
> Because the "sha1-name" or the "generation number" can mechanically
> be computed, as long as everybody agrees to _always_ place them in
> the sha3-content, the same sha1-content will be converted into
> exactly the same sha3-content without ambiguity, and converting them
> back to sha1-content while pushing to an older repository will
> correctly produce the original sha1-content, as it would just be the
> matter of simply stripping these extra pieces of information.
> 
> The reason why I still feel a bit uneasy about adding these things
> (aside from the fact that sha1-name thing will be a baggage we would
> need to carry forever even after we completely wean ourselves off of
> the old hash) is because I am not sure what we should do when we
> encounter sha3-content in the wild that has these things _wrong_.
> An object that exists today in the SHA-1 world is fetched into the
> new repository and converted to SHA-3 contents, and Linus's extra
> "original SHA-1 name" field is added to the object's header while
> recording the SHA-3 content.  But for whatever reason, the original
> SHA-1 name is recorded incorrectly in the resulting SHA-3 object.

This wasn't one of the issues that I thought of but it just makes the
argument against adding sha1's to the sha3 content stronger.

> 
> The same thing could happen if we decide to bake "generation number"
> in the SHA-3 commit objects.  One possible definition would be that
> a root commit will have gen #0; a commit with 1 or more parents will
> get max(parents' gen numbers) + 1 as its gen number.  But somebody
> may botch the counting and records sum(parents' gen numbers) as its
> gen number.
> 
> In these cases, not just the SHA3-content but also the resulting
> SHA-3 object name would be different from the name of the object
> that would have recorded the same contents correctly.  So converting
> back to SHA-1 world from these botched SHA-3 contents may produce
> the original contents, but we may end up with multiple "plausibly
> looking" set of SHA-3 objects that (clain to) correspond to a single
> SHA-1 object, only one of which is a valid one.
> 
> Our "git fsck" already treats certain brokenness (like a tree whose
> entry has mode that is 0-padded to the left) as broken but still
> tolerate them.  I am not sure if it is sufficient to diagnose and
> declare broken and invalid when we see sha3-content that records
> these "mechanically derivable but expensive to compute" pieces of
> information incorrectly.
> 
> I am leaning towards saying "yes, catching in fsck is enough" and
> suggesting to add generation number to sha3-content of the commit
> objects, and to add even the "original sha1 name" thing if we find
> good use of it.  But I cannot shake this nagging feeling off that I
> am missing some huge problems that adding these fields and opening
> ourselves to more classes of broken objects.
> 
> Thoughts?
> 
> 

-- 
Brandon Williams

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-11 18:59         ` Brandon Williams
@ 2017-09-13 12:05           ` Johannes Schindelin
  2017-09-13 13:43             ` demerphq
  2017-09-13 16:30             ` Jonathan Nieder
  0 siblings, 2 replies; 113+ messages in thread
From: Johannes Schindelin @ 2017-09-13 12:05 UTC (permalink / raw)
  To: Brandon Williams
  Cc: Junio C Hamano, Jonathan Nieder, Linus Torvalds, Git Mailing List,
	Stefan Beller, jonathantanmy, Jeff King, David Lang,
	brian m. carlson

Hi Brandon,

On Mon, 11 Sep 2017, Brandon Williams wrote:

> On 09/08, Junio C Hamano wrote:
> > Junio C Hamano <gitster@pobox.com> writes:
> > 
> > > One thing I still do not know how I feel about after re-reading the
> > > thread, and I didn't find the above doc, is Linus's suggestion to
> > > use the objects themselves as NewHash-to-SHA-1 mapper [*1*].  
> > > ...
> > > [Reference]
> > >
> > > *1* <CA+55aFxj7Vtwac64RfAz_u=U4tob4Xg+2pDBDFNpJdmgaTCmxA@mail.gmail.com>
> > 
> > I think this falls into the same category as the often-talked-about
> > addition of the "generation number" field.  It is very tempting to add
> > these "mechanically derivable but expensive to compute" pieces of
> > information to the sha3-content while converting from sha1-content and
> > creating anew.  
> 
> We didn't discuss that in the doc since this particular transition plan
> we made uses an external NewHash-to-SHA1 map instead of an internal one
> because we believe that at some point we would be able to drop
> compatibility with SHA1.

Is there even a question about that? I mean, why would *any* project that
switches entirely to SHA-256 want to carry the SHA-1 baggage around?

So even if the code to generate a bidirectional old <-> new hash mapping
might be with us forever, it *definitely* should be optional ("optional"
at least as in "config setting"), allowing developers who only work with
new-hash repositories to save the time and electrons.

> Now I suspect that wont happen for a long time but I think it would be
> preferable over carrying the SHA1 luggage indefinitely.

It should be possible to push back the SHA-1 ginny into a small gin bottle
inside Git's source code, so to say, i.e. encapsulate it to the point
where it is a compile-time option, in addition to a runtime option.

Of course, that's only unless the SHA-1 calculation is made mandatory as
suggested above. I really shudder at the idea of requiring SHA-1 to be
required forever. We ignored advice in 2005 against making ourselves too
dependent on SHA-1, and I would hope that we would learn from this.

> At some point, then, we would be able to stop hashing objects twice
> (once with SHA1 and once with NewHash) instead of always requiring that
> we hash them with each hash function which was used historically.

Yes, please.

> > Because the "sha1-name" or the "generation number" can mechanically
> > be computed,

... as long as a shallow clone you do not have, of course...

> > as long as everybody agrees to _always_ place them in the
> > sha3-content, the same sha1-content will be converted into exactly the
> > same sha3-content without ambiguity, and converting them back to
> > sha1-content while pushing to an older repository will correctly
> > produce the original sha1-content, as it would just be the matter of
> > simply stripping these extra pieces of information.

... or Git would simply handle the absence of the generation number header
gracefully, so that sha1-content == sha3-content...

> > The same thing could happen if we decide to bake "generation number"
> > in the SHA-3 commit objects.  One possible definition would be that a
> > root commit will have gen #0; a commit with 1 or more parents will get
> > max(parents' gen numbers) + 1 as its gen number.  But somebody may
> > botch the counting and records sum(parents' gen numbers) as its gen
> > number.
> > 
> > In these cases, not just the SHA3-content but also the resulting SHA-3
> > object name would be different from the name of the object that would
> > have recorded the same contents correctly.  So converting back to
> > SHA-1 world from these botched SHA-3 contents may produce the original
> > contents, but we may end up with multiple "plausibly looking" set of
> > SHA-3 objects that (clain to) correspond to a single SHA-1 object,
> > only one of which is a valid one.
> > 
> > Our "git fsck" already treats certain brokenness (like a tree whose
> > entry has mode that is 0-padded to the left) as broken but still
> > tolerate them.  I am not sure if it is sufficient to diagnose and
> > declare broken and invalid when we see sha3-content that records
> > these "mechanically derivable but expensive to compute" pieces of
> > information incorrectly.
> > 
> > I am leaning towards saying "yes, catching in fsck is enough" and
> > suggesting to add generation number to sha3-content of the commit
> > objects, and to add even the "original sha1 name" thing if we find
> > good use of it.  But I cannot shake this nagging feeling off that I
> > am missing some huge problems that adding these fields and opening
> > ourselves to more classes of broken objects.
> > 
> > Thoughts?

Seeing as current Git versions would always ignore the generation number
(and therefore work perfectly even with erroneous baked-in generation
numbers), and seeing as it would be easy to add a config option to force
Git to ignore the embedded generation numbers, I would consider `fsck`
catching those problems the best idea.

It seems that every major Git hoster already has some sort of fsck on the
fly for newly-pushed objects, so that would be another "line of defense".

Taking a step back, though, it may be a good idea to leave the generation
number business for later, as much fun as it is to get side tracked and
focus on relatively trivial stuff instead of the far more difficult and
complex task to get the transition plan to a new hash ironed out.

For example, I am still in favor of SHA-256 over SHA3-256, after learning
some background details from in-house cryptographers: it provides
essentially the same level of security, according to my sources, while
hardware support seems to be coming to SHA-256 a lot sooner than to
SHA3-256.

Which hash algorithm to choose is a tough question to answer, and
discussing generation numbers will sadly not help us answer it any quicker.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-13 12:05           ` Johannes Schindelin
@ 2017-09-13 13:43             ` demerphq
  2017-09-13 22:51               ` Jonathan Nieder
  2017-09-13 23:30               ` Linus Torvalds
  2017-09-13 16:30             ` Jonathan Nieder
  1 sibling, 2 replies; 113+ messages in thread
From: demerphq @ 2017-09-13 13:43 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Brandon Williams, Junio C Hamano, Jonathan Nieder, Linus Torvalds,
	Git Mailing List, Stefan Beller, jonathantanmy, Jeff King,
	David Lang, brian m. carlson

On 13 September 2017 at 14:05, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
> For example, I am still in favor of SHA-256 over SHA3-256, after learning
> some background details from in-house cryptographers: it provides
> essentially the same level of security, according to my sources, while
> hardware support seems to be coming to SHA-256 a lot sooner than to
> SHA3-256.

FWIW, and I know it is not worth much, as far as I can tell there is
at least some security/math basis to prefer SHA3-256 to SHA-256.

The SHA1 and SHA-256 hash functions, (iirc along with their older
cousins MD5 and MD2) all have a common design feature where they mix a
relatively large block size into a much smaller state *each block*. So
for instance SHA-256 mixes a 512 bit block into a 256 bit state with a
2:1 "leverage" between the block being read and the state. In SHA1
this was worse, mixing a 512 bit block into a 160 bit state, closer to
3:1 leverage.

SHA3 however uses a completely different design where it mixes a 1088
bit block into a 1600 bit state, for a leverage of 2:3, and the excess
is *preserved between each block*.

Assuming everything else is equal between SHA-256 and SHA3 this
difference alone would seem to justify choosing SHA3 over SHA-256. We
know that there MUST be collisions when compressing a 512 bit block
into a 256 bit space, however one cannot say the same about mixing
1088 bits into a 1600 bit state. The excess state which is not
directly modified by the input block makes a big difference when
reading the next block.

Of course in both cases we end up compressing the entire source
document down to the same number of bits, however SHA3 does that
*once*, in finalization only, whereas SHA-256 does it *every* block
read. So it seems to me that the opportunity for collisions is *much*
higher in SHA-256 than it is in SHA3-256. (Even if they should be
vanishingly rare regardless.)

For this reason if I had a vote I would definitely vote SHA3-256, or
even for SHA3-512. The latter has an impressive 1:2 leverage between
block and state, and much better theoretical security levels.

cheers,
Yves
Note: I am not a cryptographer, although I am probably pretty well
informed as far hobby-hash-function-enthusiasts go.
-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-13 12:05           ` Johannes Schindelin
  2017-09-13 13:43             ` demerphq
@ 2017-09-13 16:30             ` Jonathan Nieder
  2017-09-13 21:52               ` Junio C Hamano
  2017-09-14 12:39               ` Johannes Schindelin
  1 sibling, 2 replies; 113+ messages in thread
From: Jonathan Nieder @ 2017-09-13 16:30 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Brandon Williams, Junio C Hamano, Linus Torvalds,
	Git Mailing List, Stefan Beller, jonathantanmy, Jeff King,
	David Lang, brian m. carlson

Hi Dscho,

Johannes Schindelin wrote:

> So even if the code to generate a bidirectional old <-> new hash mapping
> might be with us forever, it *definitely* should be optional ("optional"
> at least as in "config setting"), allowing developers who only work with
> new-hash repositories to save the time and electrons.

Agreed.  This is a good reason not to store the sha1 inside the
sha256-encoded objects.  I think that is exactly what Brandon was saying
in response to Junio --- did you read it differently?

[...]
> ... or Git would simply handle the absence of the generation number header
> gracefully, so that sha1-content == sha3-content...

Part of the sha1-content is references to other objects using their
sha1-name, so it is not possible to have sha1-content == sha3-content.

That said, I am also leaning against including generation numbers as
part of this design.

There is an argument for including generation numbers.  It is much
simpler to have generation numbers in *all* commit objects than only in
some, since it means the slop-based heuristics for faking generation
numbers using commit timestamp can be completely avoided for a
repository using such a format.  Including generation numbers in all
commit objects is a painless thing to do during a format change, since
it can happen without harming round-tripping.

Treating generation numbers as derived data (as in Jeff King's
preferred design, if I have understood his replies correctly) would
also be possible but it does not interact well with shallow clone or
narrow clone.

All that said, for simplicity I still lean against including
generation numbers as part of a hash function transition.  Nothing
stops us from having another format change later.

This is a particularly hard decision because I don't have a strong
preference.  That leads me to err on the side of simplicity.

I will make sure to discuss this issue in my patch to
Documentation/technical/, so we don't have to repeat the same
conversations again and again.

[...]
> Taking a step back, though, it may be a good idea to leave the generation
> number business for later, as much fun as it is to get side tracked and
> focus on relatively trivial stuff instead of the far more difficult and
> complex task to get the transition plan to a new hash ironed out.
>
> For example, I am still in favor of SHA-256 over SHA3-256, after learning
> some background details from in-house cryptographers: it provides
> essentially the same level of security, according to my sources, while
> hardware support seems to be coming to SHA-256 a lot sooner than to
> SHA3-256.
>
> Which hash algorithm to choose is a tough question to answer, and
> discussing generation numbers will sadly not help us answer it any quicker.

This is unrelated to Brandon's message, except for his use of SHA3 as
a placeholder for "the next hash function".

My assumption based on previous conversations (and other external
conversations like [1]) is that we are going to use SHA2-256 and have
a pretty strong consensus for that.  Don't worry!

As a side note, I am probably misreading, but I found this set of
paragraphs a bit condescending.  It sounds to me like you are saying
"You are making the wrong choice of hash function and everything else
you are describing is irrelevant when compared to that monumental
mistake.  Please stop working on things I don't consider important".
With that reading it is quite demotivating to read.

An alternative reading is that you are saying that the transition plan
described in this thread is not ironed out.  Can you spell that out
more?  What particular aspect of the transition plan (which is of
course orthogonal to the choice of hash function) are you discontent
with?

Thanks and hope that helps,
Jonathan

[1] https://www.imperialviolet.org/2017/05/31/skipsha3.html

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-13 16:30             ` Jonathan Nieder
@ 2017-09-13 21:52               ` Junio C Hamano
  2017-09-13 22:07                 ` Stefan Beller
  2017-09-13 22:15                 ` Junio C Hamano
  2017-09-14 12:39               ` Johannes Schindelin
  1 sibling, 2 replies; 113+ messages in thread
From: Junio C Hamano @ 2017-09-13 21:52 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Johannes Schindelin, Brandon Williams, Linus Torvalds,
	Git Mailing List, Stefan Beller, jonathantanmy, Jeff King,
	David Lang, brian m. carlson

Jonathan Nieder <jrnieder@gmail.com> writes:

> Treating generation numbers as derived data (as in Jeff King's
> preferred design, if I have understood his replies correctly) would
> also be possible but it does not interact well with shallow clone or
> narrow clone.

Just like we have skewed committer timestamps, there is no reason to
believe that generation numbers embedded in objects are trustable,
and there is no way for narrow clients to even verify their correctness.

So I agree with Peff that having generation numbers in object is
pointless; I agree any other derivables like corresponding sha-1
name is also pointless to have.

This is a tangent, but it may be fine for a shallow clone to treat
the cut-off points in the history as if they are root commits and
compute generation numbers locally, just like everybody else does.
As generation numbers won't have to be global (because we will not
be embedding them in objects), nobody gets hurt if they do not match
across repositories---just like often-mentioned rename detection
cache, it can be kept as a mere local performance aid and does not
have to participate in the object model.

> All that said, for simplicity I still lean against including
> generation numbers as part of a hash function transition.

Good.

> This is unrelated to Brandon's message, except for his use of SHA3 as
> a placeholder for "the next hash function".
>
> My assumption based on previous conversations (and other external
> conversations like [1]) is that we are going to use SHA2-256 and have
> a pretty strong consensus for that.  Don't worry!

Hmph, I actually re-read the thread recently, and my impression was
that we didn't quite have a consensus but were leaning towards
SHA3-256.

I do not personally have a strong preference myself and I would say
that anything will do as long as it is with good longevity and
availability.  SHA2 family would be a fine choice due to its age on
both counts, being scrutinized longer and having a chance to be
implemented in many places, even though its age itself may have to
be subtracted from the longevity factor.

Thanks.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-13 21:52               ` Junio C Hamano
@ 2017-09-13 22:07                 ` Stefan Beller
  2017-09-13 22:18                   ` Jonathan Nieder
  2017-09-13 22:15                 ` Junio C Hamano
  1 sibling, 1 reply; 113+ messages in thread
From: Stefan Beller @ 2017-09-13 22:07 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jonathan Nieder, Johannes Schindelin, Brandon Williams,
	Linus Torvalds, Git Mailing List, Jonathan Tan, Jeff King,
	David Lang, brian m. carlson

On Wed, Sep 13, 2017 at 2:52 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Jonathan Nieder <jrnieder@gmail.com> writes:
>
>> Treating generation numbers as derived data (as in Jeff King's
>> preferred design, if I have understood his replies correctly) would
>> also be possible but it does not interact well with shallow clone or
>> narrow clone.
>
> Just like we have skewed committer timestamps, there is no reason to
> believe that generation numbers embedded in objects are trustable,
> and there is no way for narrow clients to even verify their correctness.
>
> So I agree with Peff that having generation numbers in object is
> pointless; I agree any other derivables like corresponding sha-1
> name is also pointless to have.
>
> This is a tangent, but it may be fine for a shallow clone to treat
> the cut-off points in the history as if they are root commits and
> compute generation numbers locally, just like everybody else does.
> As generation numbers won't have to be global (because we will not
> be embedding them in objects), nobody gets hurt if they do not match
> across repositories---just like often-mentioned rename detection
> cache, it can be kept as a mere local performance aid and does not
> have to participate in the object model.

Locally it helps for some operations such as correct walks.
For the network case however, it doesn't really help either.

If we had global generation numbers, one could imagine that they
are used in the pack negotiation (server advertises the maximum
generation number or even gen number per branch; client
could binary search in there for the fork point)

I wonder if locally generated generation numbers (for the shallow
case) could be used somehow to still improve network operations.



>> My assumption based on previous conversations (and other external
>> conversations like [1]) is that we are going to use SHA2-256 and have
>> a pretty strong consensus for that.  Don't worry!
>
> Hmph, I actually re-read the thread recently, and my impression was
> that we didn't quite have a consensus but were leaning towards
> SHA3-256.
>
> I do not personally have a strong preference myself and I would say
> that anything will do as long as it is with good longevity and
> availability.  SHA2 family would be a fine choice due to its age on
> both counts, being scrutinized longer and having a chance to be
> implemented in many places, even though its age itself may have to
> be subtracted from the longevity factor.

If we'd get the transition somewhat right, the next transition will
be easier than the current transition, such that I am not that concerned
about longevity. I am rather concerned about the complexity that is added
to the code base (whilst accumulating technical debt instead of clearer
abstraction layers)

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-13 21:52               ` Junio C Hamano
  2017-09-13 22:07                 ` Stefan Beller
@ 2017-09-13 22:15                 ` Junio C Hamano
  2017-09-13 22:27                   ` Jonathan Nieder
  1 sibling, 1 reply; 113+ messages in thread
From: Junio C Hamano @ 2017-09-13 22:15 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Johannes Schindelin, Brandon Williams, Linus Torvalds,
	Git Mailing List, Stefan Beller, jonathantanmy, Jeff King,
	David Lang, brian m. carlson

Junio C Hamano <gitster@pobox.com> writes:

> Jonathan Nieder <jrnieder@gmail.com> writes:
>
>> Treating generation numbers as derived data (as in Jeff King's
>> preferred design, if I have understood his replies correctly) would
>> also be possible but it does not interact well with shallow clone or
>> narrow clone.
>
> Just like we have skewed committer timestamps, there is no reason to
> believe that generation numbers embedded in objects are trustable,
> and there is no way for narrow clients to even verify their correctness.
>
> So I agree with Peff that having generation numbers in object is
> pointless; I agree any other derivables like corresponding sha-1
> name is also pointless to have.
>
> This is a tangent, but it may be fine for a shallow clone to treat
> the cut-off points in the history as if they are root commits and
> compute generation numbers locally, just like everybody else does.
> As generation numbers won't have to be global (because we will not
> be embedding them in objects), nobody gets hurt if they do not match
> across repositories---just like often-mentioned rename detection
> cache, it can be kept as a mere local performance aid and does not
> have to participate in the object model.
>
>> All that said, for simplicity I still lean against including
>> generation numbers as part of a hash function transition.
>
> Good.

In the proposed transition plan, the treatment of various signatures
(deliberately) makes the conversion not quite roundtrip.

When existing SHA-1 history in individual clones are converted to
NewHash, we obviously cannot re-sign the corresponding NewHash
contents with the same PGP key, so these converted objects will
carry only signature on SHA-1 contents.  They can still be validated
when they are exported back to SHA-1 world via the fetch/push
protocol, and can be validated locally by converting them back to
SHA-1 contents and then passing the result to gpgv.

The plan also states, if I remember what I read correctly, that
newly created and signed objects (this includes signed commits and
signed tags; mergetags merely carry over what the tag object that
was merged was signed with, so we do not have to worry about them
unless the resulting commit that has mergetag is signed itself, but
that is already covered by how we handle signed commits) would be
signed both for NewHash contents and its corresponding SHA-1
contents (after internally convering it to SHA-1 contents).  That
would allow us to strip the signature over NewHash contents and
derive the SHA-1 contents to be shown to the outside world while
migration is going on and I'd imagine it would be a good practice;
it would allow us to sign something that allows everybody to verify,
when some participants of the project are not yet NewHash capable.

But the signing over SHA-1 contents has to stop at some point, when
everybody's Git becomes completely unaware of SHA-1.  We may want to
have a guideline in the transition plan to (1) encourage signing for
both for quite some time, and (2) the criteria for us to decide when
to stop.

Thanks.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-13 22:07                 ` Stefan Beller
@ 2017-09-13 22:18                   ` Jonathan Nieder
  2017-09-14  2:13                     ` Junio C Hamano
  0 siblings, 1 reply; 113+ messages in thread
From: Jonathan Nieder @ 2017-09-13 22:18 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Junio C Hamano, Johannes Schindelin, Brandon Williams,
	Linus Torvalds, Git Mailing List, Jonathan Tan, Jeff King,
	David Lang, brian m. carlson

Hi,

Stefan Beller wrote:
> On Wed, Sep 13, 2017 at 2:52 PM, Junio C Hamano <gitster@pobox.com> wrote:

>> This is a tangent, but it may be fine for a shallow clone to treat
>> the cut-off points in the history as if they are root commits and
>> compute generation numbers locally, just like everybody else does.
[...]
> Locally it helps for some operations such as correct walks.
> For the network case however, it doesn't really help either.
>
> If we had global generation numbers, one could imagine that they
> are used in the pack negotiation (server advertises the maximum
> generation number or even gen number per branch; client
> could binary search in there for the fork point)
>
> I wonder if locally generated generation numbers (for the shallow
> case) could be used somehow to still improve network operations.

I have a different concern about locally generated generation numbers in
a shallow clone.  My concern is that it is slow to recompute them when
deepening the shallow clone.

However:

 1. That only affects performance and for some use cases could be
    mitigated e.g. by introducing some laziness, and, more
    convincingly,

 2. With a small protocol change, the server could communicate the
    generation numbers for commit objects at the edge of a shallow
    clone, avoiding this trouble.

So I am not too concerned.

More generally, unless there is a very very compelling reason to, I
don't want to couple other changes into the hash function transition.
If they're worthwhile enough to do, they're worthwhile enough to do
whether we're transitioning to a new hash function or not: I have not
heard a convincing example yet of a "while at it" that is worth the
complexity of such coupling.

(That said, if two format changes are worth doing and happen to be
implemented at the same time, then we can save users the trouble of
experiencing two format change transitions.  That is a kind of
coupling from the end user's point of view.  But from the perspective
of someone writing the code, there is no need to count on that, and it
is not likely to happen anyway.)

> If we'd get the transition somewhat right, the next transition will
> be easier than the current transition, such that I am not that concerned
> about longevity. I am rather concerned about the complexity that is added
> to the code base (whilst accumulating technical debt instead of clearer
> abstraction layers)

During the transition, users have to suffer reencoding overhead, so it
is not good for such transitions to need to happen very often.  If the
new hash function breaks early, then we have to cope with it and as
you say, having the framework in place means we'd be ready for that.
But I still don't want the chosen hash function to break early.

In other words, a long lifetime for the hash absolutely is a design
goal.  Coping well with an unexpectedly short lifetime for the hash is
also a design goal.

If the hash function lasts 10 years then I am happy.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-13 22:15                 ` Junio C Hamano
@ 2017-09-13 22:27                   ` Jonathan Nieder
  2017-09-14  2:10                     ` Junio C Hamano
  0 siblings, 1 reply; 113+ messages in thread
From: Jonathan Nieder @ 2017-09-13 22:27 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Johannes Schindelin, Brandon Williams, Linus Torvalds,
	Git Mailing List, Stefan Beller, jonathantanmy, Jeff King,
	David Lang, brian m. carlson

Junio C Hamano wrote:

> In the proposed transition plan, the treatment of various signatures
> (deliberately) makes the conversion not quite roundtrip.

That's not precisely true.  Details below.

> When existing SHA-1 history in individual clones are converted to
> NewHash, we obviously cannot re-sign the corresponding NewHash
> contents with the same PGP key, so these converted objects will
> carry only signature on SHA-1 contents.  They can still be validated
> when they are exported back to SHA-1 world via the fetch/push
> protocol, and can be validated locally by converting them back to
> SHA-1 contents and then passing the result to gpgv.

Correct.

> The plan also states, if I remember what I read correctly, that
> newly created and signed objects (this includes signed commits and
> signed tags; mergetags merely carry over what the tag object that
> was merged was signed with, so we do not have to worry about them
> unless the resulting commit that has mergetag is signed itself, but
> that is already covered by how we handle signed commits) would be
> signed both for NewHash contents and its corresponding SHA-1
> contents (after internally convering it to SHA-1 contents).

Also correct.

> would allow us to strip the signature over NewHash contents and
> derive the SHA-1 contents to be shown to the outside world while
> migration is going on and I'd imagine it would be a good practice;
> it would allow us to sign something that allows everybody to verify,
> when some participants of the project are not yet NewHash capable.

The NewHash-based signature is included in the SHA-1 content as well,
for the sake of round-tripping.  It is not stripped out.

> But the signing over SHA-1 contents has to stop at some point, when
> everybody's Git becomes completely unaware of SHA-1.  We may want to
> have a guideline in the transition plan to (1) encourage signing for
> both for quite some time, and (2) the criteria for us to decide when
> to stop.

Yes, spelling out a rough schedule is a good idea.  I'll add that.

A version of Git that is aware of NewHash should be able to verify
NewHash signatures even for users that are using SHA-1 locally for the
sake of faster fetches and pushes to SHA-1 based peers.

In addition to a new enough Git, this requires the translation table
to translate to NewHash to be present.

So the criterion (2) is largely based on how up-to-date the Git used
by users wanting to verify signatures is and whether they are willing
to tolerate the performance implications of having a translation
table.  My hope is that when communicating with peers using the same
hash function, the translation table will not add too much performance
overhead.

Thank you,
Jonathan

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-13 13:43             ` demerphq
@ 2017-09-13 22:51               ` Jonathan Nieder
  2017-09-14 18:26                 ` Johannes Schindelin
  2017-09-13 23:30               ` Linus Torvalds
  1 sibling, 1 reply; 113+ messages in thread
From: Jonathan Nieder @ 2017-09-13 22:51 UTC (permalink / raw)
  To: demerphq
  Cc: Johannes Schindelin, Brandon Williams, Junio C Hamano,
	Linus Torvalds, Git Mailing List, Stefan Beller, jonathantanmy,
	Jeff King, David Lang, brian m. carlson

Hi,

Yves wrote:
> On 13 September 2017 at 14:05, Johannes Schindelin

>> For example, I am still in favor of SHA-256 over SHA3-256, after learning
>> some background details from in-house cryptographers: it provides
>> essentially the same level of security, according to my sources, while
>> hardware support seems to be coming to SHA-256 a lot sooner than to
>> SHA3-256.
>
> FWIW, and I know it is not worth much, as far as I can tell there is
> at least some security/math basis to prefer SHA3-256 to SHA-256.

Thanks for spelling this out.  From my (very cursory) understanding of
the math, what you are saying makes sense.  I think there were some
hints of this topic on-list before, but not made so explicit before.

Here's my summary of the discussion of other aspects of the choice of
hash functions so far:

My understanding from asking cryptographers matches what Dscho said.
One of the lessons of the history of hash functions is that some kinds
of attempts to improve the security margin of a hash function do not
help as much as expected once a function is broken.

In practice, what we are looking for is

- is the algorithm broken, or likely to be broken soon
- do the algorithm's guarantees match the application
- is the algorithm fast enough
- are high quality implementations widely available

On that first question, every well informed person I have asked has
assured me that SHA-256, SHA-512, SHA-512/256, SHA-256x16, SHA3-256,
K12, BLAKE2bp-256, etc are equally likely to be broken in the next 10
years.  The main difference for the longevity question is that some of
those algorithms have had more scrutiny than others, but all have had
significant scrutiny.  See [1] and the surrounding thread for more
discussion on that.

On the second question, SHA-256 is vulnerable to length extension
attacks, which means it would not be usable as a MAC directly (instead
of using the HMAC construction).  Fortunately Git doesn't use its hash
function that way.

On the third question, SHA-256 is one of the slower ones, even with
hardware accelaration, but it should be fast enough.

On the fourth question, SHA-256 shines.  See [2].  That is where I had
thought the conversation ended up.

For what it's worth, I'm pretty happy both with the level of scrutiny
we've given to this question and SHA-256 as an answer.  Luckily even
if at the last minute we learn something that changes the choice of
hash function, that would not significantly affect the transition
plan, so we have a chance to learn more.

See also [3].

Thanks,
Jonathan

[1] https://public-inbox.org/git/CAL9PXLzhPyE+geUdcLmd=pidT5P8eFEBbSgX_dS88knz2q_LSw@mail.gmail.com/#t
[2] https://public-inbox.org/git/xmqq37azy7ru.fsf@gitster.mtv.corp.google.com/
[3] https://www.imperialviolet.org/2017/05/31/skipsha3.html,
    https://news.ycombinator.com/item?id=14453622

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-13 13:43             ` demerphq
  2017-09-13 22:51               ` Jonathan Nieder
@ 2017-09-13 23:30               ` Linus Torvalds
  2017-09-14 18:45                 ` Johannes Schindelin
  1 sibling, 1 reply; 113+ messages in thread
From: Linus Torvalds @ 2017-09-13 23:30 UTC (permalink / raw)
  To: demerphq
  Cc: Johannes Schindelin, Brandon Williams, Junio C Hamano,
	Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan,
	Jeff King, David Lang, brian m. carlson

On Wed, Sep 13, 2017 at 6:43 AM, demerphq <demerphq@gmail.com> wrote:
>
> SHA3 however uses a completely different design where it mixes a 1088
> bit block into a 1600 bit state, for a leverage of 2:3, and the excess
> is *preserved between each block*.

Yes. And considering that the SHA1 attack was actually predicated on
the fact that each block was independent (no extra state between), I
do think SHA3 is a better model.

So I'd rather see SHA3-256 than SHA256.

              Linus

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-13 22:27                   ` Jonathan Nieder
@ 2017-09-14  2:10                     ` Junio C Hamano
  0 siblings, 0 replies; 113+ messages in thread
From: Junio C Hamano @ 2017-09-14  2:10 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Johannes Schindelin, Brandon Williams, Linus Torvalds,
	Git Mailing List, Stefan Beller, jonathantanmy, Jeff King,
	David Lang, brian m. carlson

Jonathan Nieder <jrnieder@gmail.com> writes:

> The NewHash-based signature is included in the SHA-1 content as well,
> for the sake of round-tripping.  It is not stripped out.

Ah, OK, that allays my worries.  We rely on the fact that unknown
object headers from the future are ignored.  We use something other
than "gpgsig" header (say, "gpgsigN") to store NewHash based
signature on a commit object created in the NewHash world, so that
SHA-1 clients will ignore it but still include in the signature
computation---is that the idea?

Existing versions of Git that live in the SHA-1 world may still need
to learn to ignore/drop "gpgsigN" while amending a commit that
originally was created in the NewHash world.  Or to force upgrade we
may freeze the SHA-1 only versions of Git and stop updating them
altogether.  I dunno.

We also need to use something other than "mergetag" when carrying
over the contents of a tag being merged in the NewHash world, but
I'd imagine that you've thought about this already.

Thanks.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-13 22:18                   ` Jonathan Nieder
@ 2017-09-14  2:13                     ` Junio C Hamano
  2017-09-14 15:23                       ` Johannes Schindelin
  0 siblings, 1 reply; 113+ messages in thread
From: Junio C Hamano @ 2017-09-14  2:13 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Stefan Beller, Johannes Schindelin, Brandon Williams,
	Linus Torvalds, Git Mailing List, Jonathan Tan, Jeff King,
	David Lang, brian m. carlson

Jonathan Nieder <jrnieder@gmail.com> writes:

> In other words, a long lifetime for the hash absolutely is a design
> goal.  Coping well with an unexpectedly short lifetime for the hash is
> also a design goal.
>
> If the hash function lasts 10 years then I am happy.

Absolutely.  When two functions have similar expected remaining life
and are equally widely supported, then faster is better than slower.
Otherwise our primary goal when picking the function from candidates
should be to optimize for its remaining life and wider availability.

Thanks.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-13 16:30             ` Jonathan Nieder
  2017-09-13 21:52               ` Junio C Hamano
@ 2017-09-14 12:39               ` Johannes Schindelin
  2017-09-14 16:36                 ` Brandon Williams
  2017-09-14 18:49                 ` Jonathan Nieder
  1 sibling, 2 replies; 113+ messages in thread
From: Johannes Schindelin @ 2017-09-14 12:39 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Brandon Williams, Junio C Hamano, Linus Torvalds,
	Git Mailing List, Stefan Beller, jonathantanmy, Jeff King,
	David Lang, brian m. carlson

Hi Jonathan,

On Wed, 13 Sep 2017, Jonathan Nieder wrote:

> As a side note, I am probably misreading, but I found this set of
> paragraphs a bit condescending.  It sounds to me like you are saying
> "You are making the wrong choice of hash function and everything else
> you are describing is irrelevant when compared to that monumental
> mistake.  Please stop working on things I don't consider important".
> With that reading it is quite demotivating to read.

I am sorry you read it that way. I did not feel condescending when I wrote
that mail, I felt annoyed by the side track, and anxious. In my mind, the
transition is too important for side tracking, and I worry that we are not
fast enough (imagine what would happen if a better attack was discovered
that is not as easily detected as the one we know about?).

> An alternative reading is that you are saying that the transition plan
> described in this thread is not ironed out.  Can you spell that out
> more?  What particular aspect of the transition plan (which is of
> course orthogonal to the choice of hash function) are you discontent
> with?

My impression from reading Junio's mail was that he does not consider the
transition plan ironed out yet, and that he wants to spend time on
discussing generation numbers right now.

I was in particularly frightened by the suggestion to "reboot" [*1*].
Hopefully I misunderstand and he meant "finishing touches" instead.

As to *my* opinion: after reading https://goo.gl/gh2Mzc (is it really
correct that its last update has been on March 6th?), my only concern is
really that it still talks about SHA3-256 when I think that the
performance benefits of SHA-256 (think: "Git at scale", and also hardware
support) really make the latter a better choice.

In order to be "ironed out", I think we need to talk about the
implementation detail "Translation table". This is important. It needs to
be *fast*.

Speaking of *fast*, I could imagine that it would make sense to store the
SHA-1 objects on disk, still, instead of converting them on the fly. I am
not sure whether this is something we need to define in the document,
though, as it may very well be premature optimization; Maybe mention that
we could do this if necessary?

Apart from that, I would *love* to see this document as The Official Plan
that I can Show To The Manager so that I can ask to Allocate Time.

Ciao,
Dscho

Footnote *1*:
https://public-inbox.org/git/xmqqa828733s.fsf@gitster.mtv.corp.google.com/

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-14  2:13                     ` Junio C Hamano
@ 2017-09-14 15:23                       ` Johannes Schindelin
  2017-09-14 15:45                         ` demerphq
  0 siblings, 1 reply; 113+ messages in thread
From: Johannes Schindelin @ 2017-09-14 15:23 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jonathan Nieder, Stefan Beller, Brandon Williams, Linus Torvalds,
	Git Mailing List, Jonathan Tan, Jeff King, David Lang,
	brian m. carlson

Hi Junio,

On Thu, 14 Sep 2017, Junio C Hamano wrote:

> Jonathan Nieder <jrnieder@gmail.com> writes:
> 
> > In other words, a long lifetime for the hash absolutely is a design
> > goal.  Coping well with an unexpectedly short lifetime for the hash is
> > also a design goal.
> >
> > If the hash function lasts 10 years then I am happy.
> 
> Absolutely.  When two functions have similar expected remaining life
> and are equally widely supported, then faster is better than slower.
> Otherwise our primary goal when picking the function from candidates
> should be to optimize for its remaining life and wider availability.

SHA-256 has been hammered on a lot more than SHA3-256.

That would be a strong point in favor of SHA2.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-14 15:23                       ` Johannes Schindelin
@ 2017-09-14 15:45                         ` demerphq
  2017-09-14 22:06                           ` Johannes Schindelin
  0 siblings, 1 reply; 113+ messages in thread
From: demerphq @ 2017-09-14 15:45 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Junio C Hamano, Jonathan Nieder, Stefan Beller, Brandon Williams,
	Linus Torvalds, Git Mailing List, Jonathan Tan, Jeff King,
	David Lang, brian m. carlson

On 14 September 2017 at 17:23, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
> Hi Junio,
>
> On Thu, 14 Sep 2017, Junio C Hamano wrote:
>
>> Jonathan Nieder <jrnieder@gmail.com> writes:
>>
>> > In other words, a long lifetime for the hash absolutely is a design
>> > goal.  Coping well with an unexpectedly short lifetime for the hash is
>> > also a design goal.
>> >
>> > If the hash function lasts 10 years then I am happy.
>>
>> Absolutely.  When two functions have similar expected remaining life
>> and are equally widely supported, then faster is better than slower.
>> Otherwise our primary goal when picking the function from candidates
>> should be to optimize for its remaining life and wider availability.
>
> SHA-256 has been hammered on a lot more than SHA3-256.

Last year that was even more true of SHA1 than it is true of SHA-256 today.

Anyway,
Yves
-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-14 12:39               ` Johannes Schindelin
@ 2017-09-14 16:36                 ` Brandon Williams
  2017-09-14 18:49                 ` Jonathan Nieder
  1 sibling, 0 replies; 113+ messages in thread
From: Brandon Williams @ 2017-09-14 16:36 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Jonathan Nieder, Junio C Hamano, Linus Torvalds, Git Mailing List,
	Stefan Beller, jonathantanmy, Jeff King, David Lang,
	brian m. carlson

On 09/14, Johannes Schindelin wrote:
> Hi Jonathan,
> 
> On Wed, 13 Sep 2017, Jonathan Nieder wrote:
> 
> > As a side note, I am probably misreading, but I found this set of
> > paragraphs a bit condescending.  It sounds to me like you are saying
> > "You are making the wrong choice of hash function and everything else
> > you are describing is irrelevant when compared to that monumental
> > mistake.  Please stop working on things I don't consider important".
> > With that reading it is quite demotivating to read.
> 
> I am sorry you read it that way. I did not feel condescending when I wrote
> that mail, I felt annoyed by the side track, and anxious. In my mind, the
> transition is too important for side tracking, and I worry that we are not
> fast enough (imagine what would happen if a better attack was discovered
> that is not as easily detected as the one we know about?).
> 
> > An alternative reading is that you are saying that the transition plan
> > described in this thread is not ironed out.  Can you spell that out
> > more?  What particular aspect of the transition plan (which is of
> > course orthogonal to the choice of hash function) are you discontent
> > with?
> 
> My impression from reading Junio's mail was that he does not consider the
> transition plan ironed out yet, and that he wants to spend time on
> discussing generation numbers right now.
> 
> I was in particularly frightened by the suggestion to "reboot" [*1*].
> Hopefully I misunderstand and he meant "finishing touches" instead.
> 
> As to *my* opinion: after reading https://goo.gl/gh2Mzc (is it really
> correct that its last update has been on March 6th?), my only concern is
> really that it still talks about SHA3-256 when I think that the
> performance benefits of SHA-256 (think: "Git at scale", and also hardware
> support) really make the latter a better choice.
> 
> In order to be "ironed out", I think we need to talk about the
> implementation detail "Translation table". This is important. It needs to
> be *fast*.

Agreed, when that document was written it was hand waved as an
implementation detail but once we should probably stare ironing out
those details soon so that we have a concrete plan in place.

> 
> Speaking of *fast*, I could imagine that it would make sense to store the
> SHA-1 objects on disk, still, instead of converting them on the fly. I am
> not sure whether this is something we need to define in the document,
> though, as it may very well be premature optimization; Maybe mention that
> we could do this if necessary?
> 
> Apart from that, I would *love* to see this document as The Official Plan
> that I can Show To The Manager so that I can ask to Allocate Time.

Speaking of having a concrete plan, we discussed in office the other day
about finally converting the doc into a Documentation patch.  That was
always are intention but after writing up the doc we got busy working on
other projects.  Getting it in as a patch (with a more concrete road map)
is probably the next step we'd need to take.

I do want to echo what jonathan has said in other parts of this thread,
that the transition plan itself doesn't depend on which hash function we
end up going with in the end.  I fully expect that for the transition
plan to succeed that we'll have infrastructure for dropping in different
hash functions so that we can do some sort of benchmarking before
selecting one to use.  This would also give us the ability to more
easily transition to another hash function when the time comes.

-- 
Brandon Williams

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-13 22:51               ` Jonathan Nieder
@ 2017-09-14 18:26                 ` Johannes Schindelin
  2017-09-14 18:40                   ` Jonathan Nieder
  0 siblings, 1 reply; 113+ messages in thread
From: Johannes Schindelin @ 2017-09-14 18:26 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: demerphq, Brandon Williams, Junio C Hamano, Linus Torvalds,
	Git Mailing List, Stefan Beller, jonathantanmy, Jeff King,
	David Lang, brian m. carlson

Hi Jonathan,

On Wed, 13 Sep 2017, Jonathan Nieder wrote:

> [3] https://www.imperialviolet.org/2017/05/31/skipsha3.html,

I had read this short after it was published, and had missed the updates.
One link in particular caught my eye:

	https://eprint.iacr.org/2012/476

Essentially, the authors demonstrate that using SIMD technology can speed
up computation by factor 2 for longer messages (2kB being considered
"long" already). It is a little bit unclear to me from a cursory look
whether their fast algorithm computes SHA-256, or something similar.

As the author of that paper is also known to have contributed to OpenSSL,
I had a quick look and it would appear that a comment in
crypto/sha/asm/sha256-mb-x86_64.pl speaking about "lanes" suggests that
OpenSSL uses the ideas from the paper, even if b783858654 (x86_64 assembly
pack: add multi-block AES-NI, SHA1 and SHA256., 2013-10-03) does not talk
about the paper specifically.

The numbers shown in
https://github.com/openssl/openssl/blob/master/crypto/sha/asm/keccak1600-x86_64.pl#L28
and in
https://github.com/openssl/openssl/blob/master/crypto/sha/asm/sha256-mb-x86_64.pl#L17
are sufficiently satisfying.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-14 18:26                 ` Johannes Schindelin
@ 2017-09-14 18:40                   ` Jonathan Nieder
  2017-09-14 22:09                     ` Johannes Schindelin
  0 siblings, 1 reply; 113+ messages in thread
From: Jonathan Nieder @ 2017-09-14 18:40 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: demerphq, Brandon Williams, Junio C Hamano, Linus Torvalds,
	Git Mailing List, Stefan Beller, jonathantanmy, Jeff King,
	David Lang, brian m. carlson

Hi,

Johannes Schindelin wrote:
> On Wed, 13 Sep 2017, Jonathan Nieder wrote:

>> [3] https://www.imperialviolet.org/2017/05/31/skipsha3.html,
>
> I had read this short after it was published, and had missed the updates.
> One link in particular caught my eye:
>
> 	https://eprint.iacr.org/2012/476
>
> Essentially, the authors demonstrate that using SIMD technology can speed
> up computation by factor 2 for longer messages (2kB being considered
> "long" already). It is a little bit unclear to me from a cursory look
> whether their fast algorithm computes SHA-256, or something similar.

The latter: that paper is about a variant on SHA-256 called SHA-256x4
(or SHA-256x16 to take advantage of newer instructions).  It's a
different hash function.  This is what I was alluding to at [1].

> As the author of that paper is also known to have contributed to OpenSSL,
> I had a quick look and it would appear that a comment in
> crypto/sha/asm/sha256-mb-x86_64.pl speaking about "lanes" suggests that
> OpenSSL uses the ideas from the paper, even if b783858654 (x86_64 assembly
> pack: add multi-block AES-NI, SHA1 and SHA256., 2013-10-03) does not talk
> about the paper specifically.
>
> The numbers shown in
> https://github.com/openssl/openssl/blob/master/crypto/sha/asm/keccak1600-x86_64.pl#L28
> and in
> https://github.com/openssl/openssl/blob/master/crypto/sha/asm/sha256-mb-x86_64.pl#L17
>
> are sufficiently satisfying.

This one is about actual SHA-256, but computing the hash of multiple
streams in a single funtion call.  The paper to read is [2].  We could
probably take advantage of it for e.g. bulk-checkin and index-pack.
Most other code paths that compute hashes wouldn't be able to benefit
from it.

Thanks,
Jonathan

[1] https://public-inbox.org/git/20170616212414.GC133952@aiede.mtv.corp.google.com/
[2] https://eprint.iacr.org/2012/371

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-13 23:30               ` Linus Torvalds
@ 2017-09-14 18:45                 ` Johannes Schindelin
  2017-09-18 12:17                   ` Gilles Van Assche
  2017-09-26 17:05                   ` Jason Cooper
  0 siblings, 2 replies; 113+ messages in thread
From: Johannes Schindelin @ 2017-09-14 18:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: demerphq, Brandon Williams, Junio C Hamano, Jonathan Nieder,
	Git Mailing List, Stefan Beller, Jonathan Tan, Jeff King,
	David Lang, brian m. carlson

Hi Linus,

On Wed, 13 Sep 2017, Linus Torvalds wrote:

> On Wed, Sep 13, 2017 at 6:43 AM, demerphq <demerphq@gmail.com> wrote:
> >
> > SHA3 however uses a completely different design where it mixes a 1088
> > bit block into a 1600 bit state, for a leverage of 2:3, and the excess
> > is *preserved between each block*.
> 
> Yes. And considering that the SHA1 attack was actually predicated on
> the fact that each block was independent (no extra state between), I
> do think SHA3 is a better model.
> 
> So I'd rather see SHA3-256 than SHA256.

SHA-256 got much more cryptanalysis than SHA3-256, and apart from the
length-extension problem that does not affect Git's usage, there are no
known weaknesses so far.

It would seem that the experts I talked to were much more concerned about
that amount of attention than the particulars of the algorithm. My
impression was that the new features of SHA3 were less studied than the
well-known features of SHA2, and that the new-ness of SHA3 is not
necessarily a good thing.

You will have to deal with the fact that I trust the crypto experts'
opinion on this a lot more than your opinion. Sure, you learned from the
fact that you had been warned about SHA-1 already seeing theoretical
attacks in 2005 and still choosing to hard-wire it into Git. And yet, you
are still no more of a cryptography expert than I am.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-14 12:39               ` Johannes Schindelin
  2017-09-14 16:36                 ` Brandon Williams
@ 2017-09-14 18:49                 ` Jonathan Nieder
  2017-09-15 20:42                   ` Philip Oakley
  1 sibling, 1 reply; 113+ messages in thread
From: Jonathan Nieder @ 2017-09-14 18:49 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Brandon Williams, Junio C Hamano, Linus Torvalds,
	Git Mailing List, Stefan Beller, jonathantanmy, Jeff King,
	David Lang, brian m. carlson

Johannes Schindelin wrote:
> On Wed, 13 Sep 2017, Jonathan Nieder wrote:

>> As a side note, I am probably misreading, but I found this set of
>> paragraphs a bit condescending.  It sounds to me like you are saying
>> "You are making the wrong choice of hash function and everything else
>> you are describing is irrelevant when compared to that monumental
>> mistake.  Please stop working on things I don't consider important".
>> With that reading it is quite demotivating to read.
>
> I am sorry you read it that way. I did not feel condescending when I wrote
> that mail, I felt annoyed by the side track, and anxious. In my mind, the
> transition is too important for side tracking, and I worry that we are not
> fast enough (imagine what would happen if a better attack was discovered
> that is not as easily detected as the one we know about?).

Thanks for clarifying.  That makes sense.

[...]
> As to *my* opinion: after reading https://goo.gl/gh2Mzc (is it really
> correct that its last update has been on March 6th?), my only concern is
> really that it still talks about SHA3-256 when I think that the
> performance benefits of SHA-256 (think: "Git at scale", and also hardware
> support) really make the latter a better choice.
>
> In order to be "ironed out", I think we need to talk about the
> implementation detail "Translation table". This is important. It needs to
> be *fast*.
>
> Speaking of *fast*, I could imagine that it would make sense to store the
> SHA-1 objects on disk, still, instead of converting them on the fly. I am
> not sure whether this is something we need to define in the document,
> though, as it may very well be premature optimization; Maybe mention that
> we could do this if necessary?
>
> Apart from that, I would *love* to see this document as The Official Plan
> that I can Show To The Manager so that I can ask to Allocate Time.

Sounds promising!

Thanks much for this feedback.  This is very helpful for knowing what
v4 of the doc needs.

The discussion of the translation table in [1] didn't make it to the
doc.  You're right that it needs to.

Caching SHA-1 objects (and the pros and cons involved) makes sense to
mention in an "ideas for future work" section.

An implementation plan with well-defined pieces for people to take on
and estimates of how much work each involves may be useful for Showing
To The Manager.  So I'll include a sketch of that for reviewers to
poke holes in, too.

Another thing the doc doesn't currently describe is how Git protocol
would work.  That's worth sketching in a "future work" section as
well.

Sorry it has been taking so long to get this out.  I think we should
have something ready to send on Monday.

Thanks,
Jonathan

[1] https://public-inbox.org/git/CAJo=hJtoX9=AyLHHpUJS7fueV9ciZ_MNpnEPHUz8Whui6g9F0A@mail.gmail.com/

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-14 15:45                         ` demerphq
@ 2017-09-14 22:06                           ` Johannes Schindelin
  0 siblings, 0 replies; 113+ messages in thread
From: Johannes Schindelin @ 2017-09-14 22:06 UTC (permalink / raw)
  To: demerphq
  Cc: Junio C Hamano, Jonathan Nieder, Stefan Beller, Brandon Williams,
	Linus Torvalds, Git Mailing List, Jonathan Tan, Jeff King,
	David Lang, brian m. carlson

Hi,

On Thu, 14 Sep 2017, demerphq wrote:

> On 14 September 2017 at 17:23, Johannes Schindelin
> <Johannes.Schindelin@gmx.de> wrote:
> >
> > SHA-256 has been hammered on a lot more than SHA3-256.
> 
> Last year that was even more true of SHA1 than it is true of SHA-256
> today.

I hope you are not deliberately trying to annoy me. I say that because you
seemed to be interested enough in cryptography to know that the known
attacks on SHA-256 *today* are unlikely to extend to Git's use case,
whereas the known attacks on SHA-1 *in 2005* were already raising doubts.

So while SHA-1 has been hammered on for longer than SHA-256, the latter
came out a lot less scathed than the former.

Besides, you are totally missing the point here that the choice is *not*
between SHA-1 and SHA-256, but between SHA-256 and SHA3-256.

After all, we would not consider any hash algorithm with known problems
(as far as Git's usage is concerned). The amount of scrutiny with which
the algorithm was investigated would only be a deciding factor among the
remaining choices, yes?

In any case, don't trust me on cryptography (just like I do not trust you
on that matter). Trust the cryptographers. I contacted some of my
colleagues who are responsible for crypto, and the two who seem to
disagree on pretty much everything agreed on this one thing: that SHA-256
would be a good choice for Git (and one of them suggested that it would be
much better than SHA3-256, because SHA-256 saw more cryptanalysis).

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-14 18:40                   ` Jonathan Nieder
@ 2017-09-14 22:09                     ` Johannes Schindelin
  0 siblings, 0 replies; 113+ messages in thread
From: Johannes Schindelin @ 2017-09-14 22:09 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: demerphq, Brandon Williams, Junio C Hamano, Linus Torvalds,
	Git Mailing List, Stefan Beller, jonathantanmy, Jeff King,
	David Lang, brian m. carlson

Hi Jonathan,

On Thu, 14 Sep 2017, Jonathan Nieder wrote:

> Johannes Schindelin wrote:
> > On Wed, 13 Sep 2017, Jonathan Nieder wrote:
> 
> >> [3] https://www.imperialviolet.org/2017/05/31/skipsha3.html,
> >
> > I had read this short after it was published, and had missed the updates.
> > One link in particular caught my eye:
> >
> > 	https://eprint.iacr.org/2012/476
> >
> > Essentially, the authors demonstrate that using SIMD technology can speed
> > up computation by factor 2 for longer messages (2kB being considered
> > "long" already). It is a little bit unclear to me from a cursory look
> > whether their fast algorithm computes SHA-256, or something similar.
> 
> The latter: that paper is about a variant on SHA-256 called SHA-256x4
> (or SHA-256x16 to take advantage of newer instructions).  It's a
> different hash function.  This is what I was alluding to at [1].

Thanks for the explanation!

> > As the author of that paper is also known to have contributed to OpenSSL,
> > I had a quick look and it would appear that a comment in
> > crypto/sha/asm/sha256-mb-x86_64.pl speaking about "lanes" suggests that
> > OpenSSL uses the ideas from the paper, even if b783858654 (x86_64 assembly
> > pack: add multi-block AES-NI, SHA1 and SHA256., 2013-10-03) does not talk
> > about the paper specifically.
> >
> > The numbers shown in
> > https://github.com/openssl/openssl/blob/master/crypto/sha/asm/keccak1600-x86_64.pl#L28
> > and in
> > https://github.com/openssl/openssl/blob/master/crypto/sha/asm/sha256-mb-x86_64.pl#L17
> >
> > are sufficiently satisfying.
> 
> This one is about actual SHA-256, but computing the hash of multiple
> streams in a single funtion call.  The paper to read is [2].  We could
> probably take advantage of it for e.g. bulk-checkin and index-pack.
> Most other code paths that compute hashes wouldn't be able to benefit
> from it.

Again, thanks for the explanation.

Ciao,
Dscho

> [1] https://public-inbox.org/git/20170616212414.GC133952@aiede.mtv.corp.google.com/
> [2] https://eprint.iacr.org/2012/371
> 

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-14 18:49                 ` Jonathan Nieder
@ 2017-09-15 20:42                   ` Philip Oakley
  0 siblings, 0 replies; 113+ messages in thread
From: Philip Oakley @ 2017-09-15 20:42 UTC (permalink / raw)
  To: Jonathan Nieder, Johannes Schindelin
  Cc: Brandon Williams, Junio C Hamano, Linus Torvalds,
	Git Mailing List, Stefan Beller, jonathantanmy, Jeff King,
	David Lang, brian m. carlson

Hi Jonathan,

"Jonathan Nieder" <jrnieder@gmail.com> wrote;
> Johannes Schindelin wrote:
>> On Wed, 13 Sep 2017, Jonathan Nieder wrote:
>
>>> As a side note, I am probably misreading, but I found this set of
>>> paragraphs a bit condescending.  It sounds to me like you are saying
>>> "You are making the wrong choice of hash function and everything else
>>> you are describing is irrelevant when compared to that monumental
>>> mistake.  Please stop working on things I don't consider important".
>>> With that reading it is quite demotivating to read.
>>
>> I am sorry you read it that way. I did not feel condescending when I 
>> wrote
>> that mail, I felt annoyed by the side track, and anxious. In my mind, the
>> transition is too important for side tracking, and I worry that we are 
>> not
>> fast enough (imagine what would happen if a better attack was discovered
>> that is not as easily detected as the one we know about?).
>
> Thanks for clarifying.  That makes sense.
>
> [...]
>> As to *my* opinion: after reading https://goo.gl/gh2Mzc (is it really
>> correct that its last update has been on March 6th?), my only concern is
>> really that it still talks about SHA3-256 when I think that the
>> performance benefits of SHA-256 (think: "Git at scale", and also hardware
>> support) really make the latter a better choice.
>>
>> In order to be "ironed out", I think we need to talk about the
>> implementation detail "Translation table". This is important. It needs to
>> be *fast*.
>>
>> Speaking of *fast*, I could imagine that it would make sense to store the
>> SHA-1 objects on disk, still, instead of converting them on the fly. I am
>> not sure whether this is something we need to define in the document,
>> though, as it may very well be premature optimization; Maybe mention that
>> we could do this if necessary?
>>
>> Apart from that, I would *love* to see this document as The Official Plan
>> that I can Show To The Manager so that I can ask to Allocate Time.
>
> Sounds promising!
>
> Thanks much for this feedback.  This is very helpful for knowing what
> v4 of the doc needs.
>
> The discussion of the translation table in [1] didn't make it to the
> doc.  You're right that it needs to.
>
> Caching SHA-1 objects (and the pros and cons involved) makes sense to
> mention in an "ideas for future work" section.
>
> An implementation plan with well-defined pieces for people to take on
> and estimates of how much work each involves may be useful for Showing
> To The Manager.  So I'll include a sketch of that for reviewers to
> poke holes in, too.
>
> Another thing the doc doesn't currently describe is how Git protocol
> would work.  That's worth sketching in a "future work" section as
> well.
>
> Sorry it has been taking so long to get this out.  I think we should
> have something ready to send on Monday.

I had a look at the current doc  https://goo.gl/gh2Mzc and thought that the 
selection of the "NewHash" should be separated out into a section of it's 
own as a 'separation of concerns', so that the general transition plan only 
refers to the "NewHash", so as not to accidentally pre-judge that selection.

I did look up the arguments regarding sha2 (sha256) versus sha3-256 and 
found these two Q&A items

https://security.stackexchange.com/questions/152360/should-we-be-using-sha3-2017

https://security.stackexchange.com/questions/86283/how-does-sha3-keccak-shake-compare-to-sha2-should-i-use-non-shake-parameter

with an onward link to this:
 https://www.imperialviolet.org/2012/10/21/nist.html

"NIST may not have you in mind (21 Oct 2012)"

"A couple of weeks back, NIST announced that Keccak would be SHA-3. Keccak 
has somewhat disappointing software performance but is a gift to hardware 
implementations."

which does appear to cover some of the concerns that dscho had noted, and 
speed does appear to be a core Git selling point.

It would be worth at least covering these trade offs in the "select a 
NewHash" section of the document, as at the end of the day it will be a 
political judgement about what the future might hold regarding the 
contenders.

What may also be worth noting is the fall back plan should the chosen 
NewHash be the first to fail, perhaps spectacularly, as having a ready plan 
could support the choice at risk.

>
> Thanks,
> Jonathan
>
> [1] 
> https://public-inbox.org/git/CAJo=hJtoX9=AyLHHpUJS7fueV9ciZ_MNpnEPHUz8Whui6g9F0A@mail.gmail.com/

--
Philip 


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-14 18:45                 ` Johannes Schindelin
@ 2017-09-18 12:17                   ` Gilles Van Assche
  2017-09-18 22:16                     ` Johannes Schindelin
  2017-09-18 22:25                     ` Jonathan Nieder
  2017-09-26 17:05                   ` Jason Cooper
  1 sibling, 2 replies; 113+ messages in thread
From: Gilles Van Assche @ 2017-09-18 12:17 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Linus Torvalds, demerphq, Brandon Williams, Junio C Hamano,
	Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan,
	Jeff King, David Lang, brian m. carlson, Keccak Team

Hi Johannes,

> SHA-256 got much more cryptanalysis than SHA3-256 […].

I do not think this is true. Keccak/SHA-3 actually got (and is still
getting) a lot of cryptanalysis, with papers published at renowned
crypto conferences [1].

Keccak/SHA-3 is recognized to have a significant safety margin. E.g.,
one can cut the number of rounds in half (as in Keyak or KangarooTwelve)
and still get a very strong function. I don't think we could say the
same for SHA-256 or SHA-512…

Kind regards,
Gilles, for the Keccak team

[1] https://keccak.team/third_party.html

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-18 12:17                   ` Gilles Van Assche
@ 2017-09-18 22:16                     ` Johannes Schindelin
  2017-09-19 16:45                       ` Gilles Van Assche
  2017-09-18 22:25                     ` Jonathan Nieder
  1 sibling, 1 reply; 113+ messages in thread
From: Johannes Schindelin @ 2017-09-18 22:16 UTC (permalink / raw)
  To: Gilles Van Assche
  Cc: Linus Torvalds, demerphq, Brandon Williams, Junio C Hamano,
	Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan,
	Jeff King, David Lang, brian m. carlson, Keccak Team

[-- Attachment #1: Type: text/plain, Size: 1646 bytes --]

Hi Gilles,

On Mon, 18 Sep 2017, Gilles Van Assche wrote:

> > SHA-256 got much more cryptanalysis than SHA3-256 […].
> 
> I do not think this is true.

Please read what I said again: SHA-256 got much more cryptanalysis than
SHA3-256.

I never said that SHA3-256 got little cryptanalysis. Personally, I think
that SHA3-256 got a ton more cryptanalysis than SHA-1, and that SHA-256
*still* got more cryptanalysis. But my opinion does not count, really.
However, the two experts I pestered with questions over questions left me
with that strong impression, and their opinion does count.

> Keccak/SHA-3 actually got (and is still getting) a lot of cryptanalysis,
> with papers published at renowned crypto conferences [1].
> 
> Keccak/SHA-3 is recognized to have a significant safety margin. E.g.,
> one can cut the number of rounds in half (as in Keyak or KangarooTwelve)
> and still get a very strong function. I don't think we could say the
> same for SHA-256 or SHA-512…

Again, I do not want to criticize SHA3/Keccak. Personally, I have a lot of
respect for Keccak.

I also have a lot of respect for everybody who scrutinized the SHA2 family
of algorithms.

I also respect the fact that there are more implementations of SHA-256,
and thanks to everybody seeming to demand SHA-256 checksums instead of
SHA-1 or MD5 for downloads, bugs in those implementations are probably
discovered relatively quickly, and I also cannot ignore the prospect of
hardware support for SHA-256.

In any case, having SHA3 as a fallback in case SHA-256 gets broken seems
like a very good safety net to me.

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-18 12:17                   ` Gilles Van Assche
  2017-09-18 22:16                     ` Johannes Schindelin
@ 2017-09-18 22:25                     ` Jonathan Nieder
  1 sibling, 0 replies; 113+ messages in thread
From: Jonathan Nieder @ 2017-09-18 22:25 UTC (permalink / raw)
  To: Gilles Van Assche
  Cc: Johannes Schindelin, Linus Torvalds, demerphq, Brandon Williams,
	Junio C Hamano, Git Mailing List, Stefan Beller, Jonathan Tan,
	Jeff King, David Lang, brian m. carlson, Keccak Team

Hi,

Gilles Van Assche wrote:
> Hi Johannes,

>> SHA-256 got much more cryptanalysis than SHA3-256 […].
>
> I do not think this is true. Keccak/SHA-3 actually got (and is still
> getting) a lot of cryptanalysis, with papers published at renowned
> crypto conferences [1].
>
> Keccak/SHA-3 is recognized to have a significant safety margin. E.g.,
> one can cut the number of rounds in half (as in Keyak or KangarooTwelve)
> and still get a very strong function. I don't think we could say the
> same for SHA-256 or SHA-512…

I just wanted to thank you for paying attention to this conversation
and weighing in.

Most of the regulars in the git project are not crypto experts.  This
kind of extra information (and e.g. [2]) is very useful to us.

Thanks,
Jonathan

> Kind regards,
> Gilles, for the Keccak team
>
> [1] https://keccak.team/third_party.html
[2] https://public-inbox.org/git/91a34c5b-7844-3db2-cf29-411df5bcf886@noekeon.org/

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-18 22:16                     ` Johannes Schindelin
@ 2017-09-19 16:45                       ` Gilles Van Assche
  2017-09-29 13:17                         ` Johannes Schindelin
  0 siblings, 1 reply; 113+ messages in thread
From: Gilles Van Assche @ 2017-09-19 16:45 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Linus Torvalds, demerphq, Brandon Williams, Junio C Hamano,
	Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan,
	Jeff King, David Lang, brian m. carlson, Keccak Team

Hi Johannes,

Thanks for your feedback.

On 19/09/17 00:16, Johannes Schindelin wrote:
>>> SHA-256 got much more cryptanalysis than SHA3-256 […]. 
>>
>> I do not think this is true. 
>
> Please read what I said again: SHA-256 got much more cryptanalysis
> than SHA3-256.

Indeed. What I meant is that SHA3-256 got at least as much cryptanalysis
as SHA-256. :-)

> I never said that SHA3-256 got little cryptanalysis. Personally, I
> think that SHA3-256 got a ton more cryptanalysis than SHA-1, and that
> SHA-256 *still* got more cryptanalysis. But my opinion does not count,
> really. However, the two experts I pestered with questions over
> questions left me with that strong impression, and their opinion does
> count.

OK, I respect your opinion and that of your two experts. Yet, the "much
more" part of your statement, in particular, is something that may
require a bit more explanations.

Kind regards,
Gilles


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-14 18:45                 ` Johannes Schindelin
  2017-09-18 12:17                   ` Gilles Van Assche
@ 2017-09-26 17:05                   ` Jason Cooper
  2017-09-26 22:11                     ` Johannes Schindelin
  1 sibling, 1 reply; 113+ messages in thread
From: Jason Cooper @ 2017-09-26 17:05 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Linus Torvalds, demerphq, Brandon Williams, Junio C Hamano,
	Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan,
	Jeff King, David Lang, brian m. carlson

Hi all,

Sorry for late commentary...

On Thu, Sep 14, 2017 at 08:45:35PM +0200, Johannes Schindelin wrote:
> On Wed, 13 Sep 2017, Linus Torvalds wrote:
> > On Wed, Sep 13, 2017 at 6:43 AM, demerphq <demerphq@gmail.com> wrote:
> > > SHA3 however uses a completely different design where it mixes a 1088
> > > bit block into a 1600 bit state, for a leverage of 2:3, and the excess
> > > is *preserved between each block*.
> > 
> > Yes. And considering that the SHA1 attack was actually predicated on
> > the fact that each block was independent (no extra state between), I
> > do think SHA3 is a better model.
> > 
> > So I'd rather see SHA3-256 than SHA256.

Well, for what it's worth, we need to be aware that SHA3 is *different*.
In crypto, "different" = "bugs haven't been found yet".  :-P

And SHA2 is *known*.  So we have a pretty good handle on how it'll
weaken over time.

> SHA-256 got much more cryptanalysis than SHA3-256, and apart from the
> length-extension problem that does not affect Git's usage, there are no
> known weaknesses so far.

While I think that statement is true on it's face (particularly when
including post-competition analysis), I don't think it's sufficient
justification to chose one over the other.

> It would seem that the experts I talked to were much more concerned about
> that amount of attention than the particulars of the algorithm. My
> impression was that the new features of SHA3 were less studied than the
> well-known features of SHA2, and that the new-ness of SHA3 is not
> necessarily a good thing.

The only thing I really object to here is the abstract "experts".  We're
talking about cryptography and integrity here.  It's no longer
sufficient to cite anonymous experts.  Either they can put their
thoughts, opinions and analysis on record here, or it shouldn't be
considered.  Sorry.

Other than their anonymity, though, I do agree with your experts
assessments.

However, whether we chose SHA2 or SHA3 doesn't matter.  Moving away from
SHA1 does.  Once the object_id code is in place to facilitate that
transition, the problem is solved from git's perspective.

If SHA3 is chosen as the successor, it's going to get a *lot* more
adoption, and thus, a lot more analysis.  If cracks start to show, the
hard work of making git flexible is already done.  We can migrate to
SHA4/5/whatever in an orderly fashion with far less effort than the
transition away from SHA1.

For my use cases, as a user of git, I have a plan to maintain provable
integrity of existing objects stored in git under sha1 while migrating
away from sha1.  The same plan works for migrating away from SHA2 or
SHA3 when the time comes.

thx,

Jason.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-26 17:05                   ` Jason Cooper
@ 2017-09-26 22:11                     ` Johannes Schindelin
  2017-09-26 22:25                       ` [PATCH] technical doc: add a design doc for hash function transition Stefan Beller
                                         ` (2 more replies)
  0 siblings, 3 replies; 113+ messages in thread
From: Johannes Schindelin @ 2017-09-26 22:11 UTC (permalink / raw)
  To: Jason Cooper
  Cc: Linus Torvalds, demerphq, Brandon Williams, Junio C Hamano,
	Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan,
	Jeff King, David Lang, brian m. carlson

Hi Jason,

On Tue, 26 Sep 2017, Jason Cooper wrote:

> On Thu, Sep 14, 2017 at 08:45:35PM +0200, Johannes Schindelin wrote:
> > On Wed, 13 Sep 2017, Linus Torvalds wrote:
> > > On Wed, Sep 13, 2017 at 6:43 AM, demerphq <demerphq@gmail.com> wrote:
> > > > SHA3 however uses a completely different design where it mixes a 1088
> > > > bit block into a 1600 bit state, for a leverage of 2:3, and the excess
> > > > is *preserved between each block*.
> > > 
> > > Yes. And considering that the SHA1 attack was actually predicated on
> > > the fact that each block was independent (no extra state between), I
> > > do think SHA3 is a better model.
> > > 
> > > So I'd rather see SHA3-256 than SHA256.
> 
> Well, for what it's worth, we need to be aware that SHA3 is *different*.
> In crypto, "different" = "bugs haven't been found yet".  :-P
> 
> And SHA2 is *known*.  So we have a pretty good handle on how it'll
> weaken over time.

Here, you seem to agree with me.

> > SHA-256 got much more cryptanalysis than SHA3-256, and apart from the
> > length-extension problem that does not affect Git's usage, there are no
> > known weaknesses so far.
> 
> While I think that statement is true on it's face (particularly when
> including post-competition analysis), I don't think it's sufficient
> justification to chose one over the other.

And here you don't.

I find that very confusing.

> > It would seem that the experts I talked to were much more concerned about
> > that amount of attention than the particulars of the algorithm. My
> > impression was that the new features of SHA3 were less studied than the
> > well-known features of SHA2, and that the new-ness of SHA3 is not
> > necessarily a good thing.
> 
> The only thing I really object to here is the abstract "experts".  We're
> talking about cryptography and integrity here.  It's no longer
> sufficient to cite anonymous experts.  Either they can put their
> thoughts, opinions and analysis on record here, or it shouldn't be
> considered.  Sorry.

Sorry, you are asking cryptography experts to spend their time on the Git
mailing list. I tried to get them to speak out on the Git mailing list.
They respectfully declined.

I can't fault them, they have real jobs to do, and none of their managers
would be happy for them to educate the Git mailing list on matters of
cryptography, not after what happened in 2005.

> Other than their anonymity, though, I do agree with your experts
> assessments.

I know what our in-house cryptography experts have to prove to start
working at Microsoft. Forgive me, but you are not a known entity to me.

> However, whether we chose SHA2 or SHA3 doesn't matter.

To you, it does not matter.

To me, it matters. To the several thousand developers working on Windows,
probably the largest Git repository in active use, it matters. It matters
because the speed difference that has little impact on you has a lot more
impact on us.

> Moving away from SHA1 does.  Once the object_id code is in place to
> facilitate that transition, the problem is solved from git's
> perspective.

Uh oh. You forgot the mapping. And the protocol. And pretty much
everything except the oid.

> If SHA3 is chosen as the successor, it's going to get a *lot* more
> adoption, and thus, a lot more analysis.  If cracks start to show, the
> hard work of making git flexible is already done.  We can migrate to
> SHA4/5/whatever in an orderly fashion with far less effort than the
> transition away from SHA1.

Sure. And if XYZ789 is chosen, it's going to get a *lot* more adoption,
too.

We think.

Let's be realistic. Git is pretty important to us, but it is not important
enough to sway, say, Intel into announcing hardware support for SHA3.

And if you try to force through *any* hash function only so that it gets
more adoption and hence more support, in the short run you will make life
harder for developers on more obscure platforms, who may not easily get
high-quality, high-speed implementations of anything but the very
mainstream (which is, let's face it, MD5, SHA-1 and SHA-256). I know I
would have cursed you for such a decision back when I had to work on AIX
and IRIX.

> For my use cases, as a user of git, I have a plan to maintain provable
> integrity of existing objects stored in git under sha1 while migrating
> away from sha1.  The same plan works for migrating away from SHA2 or
> SHA3 when the time comes.

Please do not make the mistake of taking your use case to be a template
for everybody's use case.

Migrating a large team away from any hash function to another one *will*
be painful, and costly.

Migrating will be very costly for hosting companies like GitHub, Microsoft
and BitBucket, too.

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH] technical doc: add a design doc for hash function transition
  2017-09-26 22:11                     ` Johannes Schindelin
@ 2017-09-26 22:25                       ` Stefan Beller
  2017-09-26 23:38                         ` Jonathan Nieder
  2017-09-26 23:51                       ` RFC v3: Another proposed hash function transition plan Jonathan Nieder
  2017-10-02 14:00                       ` Jason Cooper
  2 siblings, 1 reply; 113+ messages in thread
From: Stefan Beller @ 2017-09-26 22:25 UTC (permalink / raw)
  To: johannes.schindelin
  Cc: bmwill, david, demerphq, git, gitster, jason, jonathantanmy,
	jrnieder, peff, sandals, sbeller, torvalds, Jonathan Nieder

From: Jonathan Nieder <jrn@google.com>

This is "RFC v3: Another proposed hash function transition plan" from
the git mailing list.

Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Brandon Williams <bmwill@google.com>
Signed-off-by: Stefan Beller <sbeller@google.com>
---

 This takes the original Google Doc[1] and adds it to our history,
 such that the discussion can be on on list and in the commit messages.
 
 * replaced SHA3-256 with NEWHASH, sha3 with newhash
 * added section 'Implementation plan'
 * added section 'Future work'
 * added section 'Agreed-upon criteria for selecting NewHash'
 
 As the discussion restarts again, here is our attempt
 to add value to the discussion, we planned to polish it more, but as the
 discussion is restarting, we might just post it as-is.
  
 Thanks.

[1] https://docs.google.com/document/d/18hYAQCTsDgaFUo-VJGhT0UqyetL2LbAzkWNK1fYS8R0/edit

 Documentation/Makefile                             |   1 +
 .../technical/hash-function-transition.txt         | 571 +++++++++++++++++++++
 2 files changed, 572 insertions(+)
 create mode 100644 Documentation/technical/hash-function-transition.txt

diff --git a/Documentation/Makefile b/Documentation/Makefile
index 2415e0d657..471bb29725 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -67,6 +67,7 @@ SP_ARTICLES += howto/maintain-git
 API_DOCS = $(patsubst %.txt,%,$(filter-out technical/api-index-skel.txt technical/api-index.txt, $(wildcard technical/api-*.txt)))
 SP_ARTICLES += $(API_DOCS)
 
+TECH_DOCS += technical/hash-function-transition
 TECH_DOCS += technical/http-protocol
 TECH_DOCS += technical/index-format
 TECH_DOCS += technical/pack-format
diff --git a/Documentation/technical/hash-function-transition.txt b/Documentation/technical/hash-function-transition.txt
new file mode 100644
index 0000000000..0ac751d600
--- /dev/null
+++ b/Documentation/technical/hash-function-transition.txt
@@ -0,0 +1,571 @@
+Git hash function transition
+============================
+
+Objective
+---------
+Migrate Git from SHA-1 to a stronger hash function.
+
+Background
+----------
+At its core, the Git version control system is a content addressable
+filesystem. It uses the SHA-1 hash function to name content. For
+example, files, directories, and revisions are referred to by hash
+values unlike in other traditional version control systems where files
+or versions are referred to via sequential numbers. The use of a hash
+function to address its content delivers a few advantages:
+
+* Integrity checking is easy. Bit flips, for example, are easily
+  detected, as the hash of corrupted content does not match its name.
+* Lookup of objects is fast.
+
+Using a cryptographically secure hash function brings additional
+advantages:
+
+* Object names can be signed and third parties can trust the hash to
+  address the signed object and all objects it references.
+* Communication using Git protocol and out of band communication
+  methods have a short reliable string that can be used to reliably
+  address stored content.
+
+Over time some flaws in SHA-1 have been discovered by security
+researchers. https://shattered.io demonstrated a practical SHA-1 hash
+collision. As a result, SHA-1 cannot be considered cryptographically
+secure any more. This impacts the communication of hash values because
+we cannot trust that a given hash value represents the known good
+version of content that the speaker intended.
+
+SHA-1 still possesses the other properties such as fast object lookup
+and safe error checking, but other hash functions are equally suitable
+that are believed to be cryptographically secure.
+
+Goals
+-----
+1. The transition to NEWHASH can be done one local repository at a time.
+   a. Requiring no action by any other party.
+   b. A NEWHASH repository can communicate with SHA-1 Git servers
+      (push/fetch).
+   c. Users can use SHA-1 and NEWHASH identifiers for objects
+      interchangeably.
+   d. New signed objects make use of a stronger hash function than
+      SHA-1 for their security guarantees.
+2. Allow a complete transition away from SHA-1.
+   a. Local metadata for SHA-1 compatibility can be removed from a
+      repository if compatibility with SHA-1 is no longer needed.
+3. Maintainability throughout the process.
+   a. The object format is kept simple and consistent.
+   b. Creation of a generalized repository conversion tool.
+
+Non-Goals
+---------
+1. Add NEWHASH support to Git protocol. This is valuable and the
+   logical next step but it is out of scope for this initial design.
+2. Transparently improving the security of existing SHA-1 signed
+   objects.
+3. Intermixing objects using multiple hash functions in a single
+   repository.
+4. Taking the opportunity to fix other bugs in git's formats and
+   protocols.
+5. Shallow clones and fetches into a NEWHASH repository. (This will
+   change when we add NEWHASH support to Git protocol.)
+6. Skip fetching some submodules of a project into a NEWHASH
+   repository. (This also depends on NEWHASH support in Git
+   protocol.)
+
+Overview
+--------
+We introduce a new repository format extension `newhash`. Repositories
+with this extension enabled use NEWHASH instead of SHA-1 to name
+their objects. This affects both object names and object content ---
+both the names of objects and all references to other objects within
+an object are switched to the new hash function.
+
+newhash repositories cannot be read by older versions of Git.
+
+Alongside the packfile, a newhash repository stores a bidirectional
+mapping between newhash and sha1 object names in a new format of .idx files.
+The mapping is generated locally and can be verified using "git fsck".
+Object lookups use this mapping to allow naming objects using either
+their sha1 and newhash names interchangeably.
+
+"git cat-file" and "git hash-object" gain options to display an object
+in its sha1 form and write an object given its sha1 form. This
+requires all objects referenced by that object to be present in the
+object database so that they can be named using the appropriate name
+(using the bidirectional hash mapping).
+
+Fetches from a SHA-1 based server convert the fetched objects into
+newhash form and record the mapping in the bidirectional mapping table
+(see below for details). Pushes to a SHA-1 based server convert the
+objects being pushed into sha1 form so the server does not have to be
+aware of the hash function the client is using.
+
+Detailed Design
+---------------
+Object names
+~~~~~~~~~~~~
+Objects can be named by their 40 hexadecimal digit sha1-name or <n>
+hexadecimal digit newhash-name, plus names derived from those (see
+gitrevisions(7)).
+
+The sha1-name of an object is the SHA-1 of the concatenation of its
+type, length, a nul byte, and the object's sha1-content. This is the
+traditional <sha1> used in Git to name objects.
+
+The newhash-name of an object is the NEWHASH of the concatenation of its
+type, length, a nul byte, and the object's newhash-content.
+
+Object format
+~~~~~~~~~~~~~
+The content as a byte sequence of a tag, commit, or tree object named
+by sha1 and newhash differ because an object named by newhash-name refers to
+other objects by their newhash-names and an object named by sha1-name
+refers to other objects by their sha1-names.
+
+The newhash-content of an object is the same as its sha1-content, except
+that objects referenced by the object are named using their newhash-names
+instead of sha1-names. Because a blob object does not refer to any
+other object, its sha1-content and newhash-content are the same.
+
+The format allows round-trip conversion between newhash-content and
+sha1-content.
+
+Object storage
+~~~~~~~~~~~~~~
+Loose objects use zlib compression and packed objects use the packed
+format described in Documentation/technical/pack-format.txt, just like
+today. The content that is compressed and stored uses newhash-content
+instead of sha1-content.
+
+Translation table
+~~~~~~~~~~~~~~~~~
+A fast bidirectional mapping between sha1-names and newhash-names of all
+local objects in the repository is kept on disk.
+
+For pack files, upgrade the .idx file to be as follows:
+
+  4 magic bytes
+  header, containing pointers to the 3 lists below
+
+  list of
+  abbrev sha1 -> ordinal, sorted by sha1
+
+  list of
+  abbrev newhash -> ordinal, sorted by newhash
+
+  list of
+  ordinal, complete sha1, complete new hash,
+  sorted by ordinal, such that a lookup can be computed after looking into
+  one of the first lists.
+
+For unpacked objects, keep a simple list
+  sha1 -> newhash
+around at $OBJECT_DIR/loose-lookup
+
+All operations that make new objects (e.g., "git commit") add the new
+objects to the translation table.
+
+(This work could have been deferred to push time, but that would
+significantly complicate and slow down pushes. Calculating the
+sha1-name at object creation time at the same time it is being
+streamed to disk and having its newhash-name calculated should be an
+acceptable cost.)
+
+Reading an object's sha1-content
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The sha1-content of an object can be read by converting all newhash-names
+its newhash-content references to sha1-names using the translation table.
+
+Fetch
+~~~~~
+Fetching from a SHA-1 based server requires translating between SHA-1
+and NEWHASH based representations on the fly.
+
+SHA-1s named in the ref advertisement that are present on the client
+can be translated to NEWHASH and looked up as local objects using the
+translation table.
+
+Negotiation proceeds as today. Any "have"s generated locally are
+converted to SHA-1 before being sent to the server, and SHA-1s
+mentioned by the server are converted to NEWHASH when looking them up
+locally.
+
+After negotiation, the server sends a packfile containing the
+requested objects. We convert the packfile to NEWHASH format using
+the following steps:
+
+1. index-pack: inflate each object in the packfile and compute its
+   SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against
+   objects the client has locally. These objects can be looked up
+   using the translation table and their sha1-content read as
+   described above to resolve the deltas.
+2. topological sort: starting at the "want"s from the negotiation
+   phase, walk through objects in the pack and emit a list of them,
+   excluding blobs, in reverse topologically sorted order, with each
+   object coming later in the list than all objects it references.
+   (This list only contains objects reachable from the "wants". If the
+   pack from the server contained additional extraneous objects, then
+   they will be discarded.)
+3. convert to newhash: open a new (newhash) packfile. Read the topologically
+   sorted list just generated. For each object, inflate its
+   sha1-content, convert to newhash-content, and write it to the newhash
+   pack. Include the new sha1<->newhash mapping entry in the translation
+   table.
+4. sort: reorder entries in the new pack to match the order of objects
+   in the pack the server generated and include blobs. Write a newhash idx
+   file.
+5. clean up: remove the SHA-1 based pack file, index, and
+   topologically sorted list obtained from the server and steps 1
+   and 2.
+
+Step 3 requires every object referenced by the new object to be in the
+translation table. This is why the topological sort step is necessary.
+
+As an optimization, step 1 could write a file describing what non-blob
+objects each object it has inflated from the packfile references. This
+makes the topological sort in step 2 possible without inflating the
+objects in the packfile for a second time. The objects need to be
+inflated again in step 3, for a total of two inflations.
+
+Step 4 is probably necessary for good read-time performance. "git
+pack-objects" on the server optimizes the pack file for good data
+locality (see Documentation/technical/pack-heuristics.txt).
+
+Details of this process are likely to change. It will take some
+experimenting to get this to perform well.
+
+Push
+~~~~
+Push is simpler than fetch because the objects referenced by the
+pushed objects are already in the translation table. The sha1-content
+of each object being pushed can be read as described in the "Reading
+an object's sha1-content" section to generate the pack written by git
+send-pack.
+
+Signed Commits
+~~~~~~~~~~~~~~
+We add a new field "gpgsig-newhash" to the commit object format to allow
+signing commits without relying on SHA-1. It is similar to the
+existing "gpgsig" field. Its signed payload is the newhash-content of the
+commit object with any "gpgsig" and "gpgsig-newhash" fields removed.
+
+This means commits can be signed
+1. using SHA-1 only, as in existing signed commit objects
+2. using both SHA-1 and NEWHASH, by using both gpgsig-newhash and gpgsig
+   fields.
+3. using only NEWHASH, by only using the gpgsig-newhash field.
+
+Old versions of "git verify-commit" can verify the gpgsig signature in
+cases (1) and (2) without modifications and view case (3) as an
+ordinary unsigned commit.
+
+Signed Tags
+~~~~~~~~~~~
+We add a new field "gpgsig-newhash" to the tag object format to allow
+signing tags without relying on SHA-1. Its signed payload is the
+newhash-content of the tag with its gpgsig-newhash field and "-----BEGIN PGP
+SIGNATURE-----" delimited in-body signature removed.
+
+This means tags can be signed
+1. using SHA-1 only, as in existing signed tag objects
+2. using both SHA-1 and NEWHASH, by using gpgsig-newhash and an in-body
+   signature.
+3. using only NEWHASH, by only using the gpgsig-newhash field.
+
+Mergetag embedding
+~~~~~~~~~~~~~~~~~~
+The mergetag field in the sha1-content of a commit contains the
+sha1-content of a tag that was merged by that commit.
+
+The mergetag field in the newhash-content of the same commit contains the
+newhash-content of the same tag.
+
+Submodules
+~~~~~~~~~~
+To convert recorded submodule pointers, you need to have the converted
+submodule repository in place. The translation table of the submodule
+can be used to look up the new hash.
+
+Caveats
+-------
+Invalid objects
+~~~~~~~~~~~~~~~
+The conversion from sha1-content to newhash-content retains any
+brokenness in the original object (e.g., tree entry modes encoded with
+leading 0, tree objects whose paths are not sorted correctly, and
+commit objects without an author or committer). This is a deliberate
+feature of the design to allow the conversion to round-trip.
+
+More profoundly broken objects (e.g., a commit with a truncated "tree"
+header line) cannot be converted but were not usable by current Git
+anyway.
+
+Shallow clone and submodules
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Because it requires all referenced objects to be available in the
+locally generated translation table, this design does not support
+shallow clone or unfetched submodules. Protocol improvements might
+allow lifting this restriction.
+
+Alternates
+~~~~~~~~~~
+For the same reason, a newhash repository cannot borrow objects from a
+sha1 repository using objects/info/alternates or
+$GIT_ALTERNATE_OBJECT_REPOSITORIES.
+
+git notes
+~~~~~~~~~
+The "git notes" tool annotates objects using their sha1-name as key.
+This design does not describe a way to migrate notes trees to use
+newhash-names. That migration is expected to happen separately (for
+example using a file at the root of the notes tree to describe which
+hash it uses).
+
+Server-side cost
+~~~~~~~~~~~~~~~~
+Until Git protocol gains NEWHASH support, using newhash based storage on
+public-facing Git servers is strongly discouraged. Once Git protocol
+gains NEWHASH support, newhash based servers are likely not to support
+sha1 compatibility, to avoid what may be a very expensive hash
+reencode during clone and to encourage peers to modernize.
+
+The design described here allows fetches by SHA-1 clients of a
+personal NEWHASH repository because it's not much more difficult than
+allowing pushes from that repository. This support needs to be guarded
+by a configuration option --- servers like git.kernel.org that serve a
+large number of clients would not be expected to bear that cost.
+
+Meaning of signatures
+~~~~~~~~~~~~~~~~~~~~~
+The signed payload for signed commits and tags does not explicitly
+name the hash used to identify objects. If some day Git adopts a new
+hash function with the same length as the current SHA-1 (40
+hexadecimal digit) or NEWHASH (64 hexadecimal digit) objects then the
+intent behind the PGP signed payload in an object signature is
+unclear:
+
+	object e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7
+	type commit
+	tag v2.12.0
+	tagger Junio C Hamano <gitster@pobox.com> 1487962205 -0800
+
+	Git 2.12
+
+Does this mean Git v2.12.0 is the commit with sha1-name
+e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with
+new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7?
+
+Fortunately NEWHASH and SHA-1 have different lengths. If Git starts
+using another hash with the same length to name objects, then it will
+need to change the format of signed payloads using that hash to
+address this issue.
+
+Alternatives considered
+-----------------------
+Upgrading everyone working on a particular project on a flag day
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Projects like the Linux kernel are large and complex enough that
+flipping the switch for all projects based on the repository at once
+is infeasible.
+
+Not only would all developers and server operators supporting
+developers have to switch on the same flag day, but supporting tooling
+(continuous integration, code review, bug trackers, etc) would have to
+be adapted as well. This also makes it difficult to get early feedback
+from some project participants testing before it is time for mass
+adoption.
+
+Using hash functions in parallel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+(e.g. https://public-inbox.org/git/22708.8913.864049.452252@chiark.greenend.org.uk/ )
+Objects newly created would be addressed by the new hash, but inside
+such an object (e.g. commit) it is still possible to address objects
+using the old hash function.
+* You cannot trust its history (needed for bisectability) in the
+  future without further work
+* Maintenance burden as the number of supported hash functions grows
+  (they will never go away, so they accumulate). In this proposal, by
+  comparison, converted objects lose all references to SHA-1.
+
+Signed objects with multiple hashes
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Instead of introducing the gpgsig-newhash field in commit and tag objects
+for newhash-content based signatures, an earlier version of this design
+added "hash newhash <newhash-name>" fields to strengthen the existing
+sha1-content based signatures.
+
+In other words, a single signature was used to attest to the object
+content using both hash functions. This had some advantages:
+* Using one signature instead of two speeds up the signing process.
+* Having one signed payload with both hashes allows the signer to
+  attest to the sha1-name and newhash-name referring to the same object.
+* All users consume the same signature. Broken signatures are likely
+  to be detected quickly using current versions of git.
+
+However, it also came with disadvantages:
+* Verifying a signed object requires access to the sha1-names of all
+  objects it references, even after the transition is complete and
+  translation table is no longer needed for anything else. To support
+  this, the design added fields such as "hash sha1 tree <sha1-name>"
+  and "hash sha1 parent <sha1-name>" to the newhash-content of a signed
+  commit, complicating the conversion process.
+* Allowing signed objects without a sha1 (for after the transition is
+  complete) complicated the design further, requiring a "nohash sha1"
+  field to suppress including "hash sha1" fields in the newhash-content
+  and signed payload.
+
+
+Implementation plan
+-------------------
+
+Here's a rough list of some useful tasks, in no particular order:
+
+1. bc/object-id: This patch series continues, eliminating assumptions
+   about the size of object ids by encapsulating them in a struct.
+   One straightforward way to find code that still needs to be
+   converted is to grep for "sha" --- often the conversion patches
+   change function and variable names to refer to oid_ where they used
+   to use sha1_, making the stragglers easier to spot.
+
+2. Hard-coded object ids in tests: Many tests beyond t00* make assumptions
+   about the exact values of object ids.  That's bad for maintainability
+   for other reasons beyond the hash function transition, too.
+
+   It should be possible to suss them out by patching git's sha1
+   routine to use the ones-complement of sha1 (~sha1) instead and
+   seeing which tests fail.
+
+3. Repository format extension to use a different hash function: we
+   want git to be able to work with two hash functions: sha1 and
+   something else.  For interoperability and simplity, it is useful
+   for a single git binary to support both hash functions.
+
+   That means a repository needs to be able to specify what hash
+   function is used for the objects in that repository.  This can be
+   configured by setting '[core] repositoryformatversion=1' (to avoid
+   confusing old versions of git) and
+   '[extensions] experimentalNewHashFunction = true'.
+   Documentation/technical/repository-version.txt has more details.
+
+   We can start experimenting with this using e.g. the ~sha1 function
+   described at (2), or the 160-bit hash of the patch author's choice
+   (e.g. truncated blake2bp-256).
+
+4. When choosing a hash function, people may argue about performance.
+   It would be useful for run some benchmarks for git (running
+   the test suite, t/perf tests, etc) using a variety of hash
+   functions as input to such a discussion.
+
+5. Longer hash: Even once all object id references in git use struct
+   object_id (see (1)), we need to tackle other assumptions about
+   object id size in git and its tests.
+
+   It should be possible to suss them out by replacing git's sha1
+   routine with a 40-byte hash: sha1 with each byte repeated (sha1+sha1)
+   and seeing what fails.
+
+6. Repository format extension for longer hash: As in (3), we could
+   add a repository format extension to experiment with using the
+   sha1+sha1 function.
+
+7. Avoiding wasted memory from unused hash functions: struct object_id
+   has definition 'unsigned char hash[GIT_MAX_RAWSZ]', where
+   GIT_MAX_RAWSZ is the size of the largest supported hash function.
+   When operating on a repository that only uses sha1, this wastes
+   memory.
+
+   Avoid that by making object identifiers variable-sized.  That is,
+   something like
+
+     struct object_id {
+        union {
+           unsigned char hash20[20];
+           unsigned char hash32[32];
+        } *hash;
+     }
+
+   or
+
+     struct object_id {
+       unsigned char *hash;
+     }
+
+   The hard part is that allocation and destruction have to be
+   explicit instead of happening automatically when an object_id is an
+   automatic variable.
+
+8. Implementation of this plan (roughly in order):
+   - abstract the hash computation to be able to plug in another hash
+   - make the choice of hash dependant on repository extension
+   - implement the new .idx format
+   - implement cat-file's flag to show things in old/new hash
+   - convert fetch, push
+
+9. We can use help from security experts in all of this.  Fuzzing,
+   analysis of how we use cryptography, security review of other parts
+   of the design, and information to help choose a hash function are
+   all appreciated.
+
+Agreed-upon criteria for selecting NewHash
+------------------------------------------
+
+The discussion which hash function to use is going in circles, so let's
+first argree on criteria on how to select the new hash function. These
+could include:
+* cryptografic strength
+* performance
+* other cryptografic aspects(?)
+* portability / availability of properly licensed implementations
+
+Future work
+-----------
+
+* other compression instead of zlib (this is a stated non goal, though!)
+* rehash discussion whether to include generation numbers natively
+  (this is a stated non goal, though!)
+* describing (1) the possibility of caching translated objects
+* and (2) protocol changes.
+* other format changes
+
+Document History
+----------------
+
+2017-03-03
+bmwill@google.com, jonathantanmy@google.com, jrnieder@gmail.com,
+sbeller@google.com
+
+Initial version sent to
+http://public-inbox.org/git/20170304011251.GA26789@aiede.mtv.corp.google.com
+
+2017-03-03 jrnieder@gmail.com
+Incorporated suggestions from jonathantanmy and sbeller:
+* describe purpose of signed objects with each hash type
+* redefine signed object verification using object content under the
+  first hash function
+
+2017-03-06 jrnieder@gmail.com
+* Use SHA3-256 instead of SHA2 (thanks, Linus and brian m. carlson).[1][2]
+* Make sha3-based signatures a separate field, avoiding the need for
+  "hash" and "nohash" fields (thanks to peff[3]).
+* Add a sorting phase to fetch (thanks to Junio for noticing the need
+  for this).
+* Omit blobs from the topological sort during fetch (thanks to peff).
+* Discuss alternates, git notes, and git servers in the caveats
+  section (thanks to Junio Hamano, brian m. carlson[4], and Shawn
+  Pearce).
+* Clarify language throughout (thanks to various commenters,
+  especially Junio).
+
+[1] http://public-inbox.org/git/CA+55aFzJtejiCjV0e43+9oR3QuJK2PiFiLQemytoLpyJWe6P9w@mail.gmail.com/
+[2] http://public-inbox.org/git/CA+55aFz+gkAsDZ24zmePQuEs1XPS9BP_s8O7Q4wQ7LV7X5-oDA@mail.gmail.com/
+[3] http://public-inbox.org/git/20170306084353.nrns455dvkdsfgo5@sigill.intra.peff.net/
+[4] http://public-inbox.org/git/20170304224936.rqqtkdvfjgyezsht@genre.crustytoothpaste.net
+
+2017-09-25
+* replaced SHA3-256 with NEWHASH, sha3 with newhash
+* added section 'Implementation plan'
+* added section 'Future work'
+
+* This version is sent to the list; to be incorporated into git.git, such
+  that further document history is found using git-log.
+
+
-- 
2.14.0.rc0.3.g6c2e499285


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH] technical doc: add a design doc for hash function transition
  2017-09-26 22:25                       ` [PATCH] technical doc: add a design doc for hash function transition Stefan Beller
@ 2017-09-26 23:38                         ` Jonathan Nieder
  0 siblings, 0 replies; 113+ messages in thread
From: Jonathan Nieder @ 2017-09-26 23:38 UTC (permalink / raw)
  To: Stefan Beller
  Cc: johannes.schindelin, bmwill, david, demerphq, git, gitster, jason,
	jonathantanmy, peff, sandals, torvalds, Jonathan Nieder

Hi,

Stefan Beller wrote:

> From: Jonathan Nieder <jrn@google.com>

I go by jrnieder@gmail.com upstream. :)

> This is "RFC v3: Another proposed hash function transition plan" from
> the git mailing list.
>
> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> Signed-off-by: Brandon Williams <bmwill@google.com>
> Signed-off-by: Stefan Beller <sbeller@google.com>

I hadn't signed-off on this version, but it's not a big deal.

[...]
> ---
>
>  This takes the original Google Doc[1] and adds it to our history,
>  such that the discussion can be on on list and in the commit messages.
>
>  * replaced SHA3-256 with NEWHASH, sha3 with newhash
>  * added section 'Implementation plan'
>  * added section 'Future work'
>  * added section 'Agreed-upon criteria for selecting NewHash'

Thanks for sending this out.  I had let it stall too long.

As a tiny nit, I think NewHash is easier to read than NEWHASH.  Not a
big deal.  More importantly, we need some text describing it and
saying it's a placeholder.

The implementation plan included here is out of date.  It comes from
an email where I was answering a question about what people can do to
make progress, before this design had been agreed on.  In the context
of this design there are other steps we'd want to describe (having to
do with implementing the translation table, etc).

I also planned to add a description of the translation table based on
what was discussed previously in this thread.

Jonathan

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-26 22:11                     ` Johannes Schindelin
  2017-09-26 22:25                       ` [PATCH] technical doc: add a design doc for hash function transition Stefan Beller
@ 2017-09-26 23:51                       ` Jonathan Nieder
  2017-10-02 14:54                         ` Jason Cooper
  2017-10-02 14:00                       ` Jason Cooper
  2 siblings, 1 reply; 113+ messages in thread
From: Jonathan Nieder @ 2017-09-26 23:51 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Jason Cooper, Linus Torvalds, demerphq, Brandon Williams,
	Junio C Hamano, Git Mailing List, Stefan Beller, Jonathan Tan,
	Jeff King, David Lang, brian m. carlson

Hi,

Johannes Schindelin wrote:

> Sorry, you are asking cryptography experts to spend their time on the Git
> mailing list. I tried to get them to speak out on the Git mailing list.
> They respectfully declined.
>
> I can't fault them, they have real jobs to do, and none of their managers
> would be happy for them to educate the Git mailing list on matters of
> cryptography, not after what happened in 2005.

Fortunately we have had a few public comments from crypto specialists:

https://public-inbox.org/git/91a34c5b-7844-3db2-cf29-411df5bcf886@noekeon.org/
https://public-inbox.org/git/CAL9PXLzhPyE+geUdcLmd=pidT5P8eFEBbSgX_dS88knz2q_LSw@mail.gmail.com/
https://public-inbox.org/git/CAL9PXLxMHG1nP5_GQaK_WSJTNKs=_qbaL6V5v2GzVG=9VU2+gA@mail.gmail.com/
https://public-inbox.org/git/59BFB95D.1030903@st.com/
https://public-inbox.org/git/59C149A3.6080506@st.com/

[...]
> Let's be realistic. Git is pretty important to us, but it is not important
> enough to sway, say, Intel into announcing hardware support for SHA3.

Yes, I agree with this.  (Adoption by Git could lead to adoption by
some other projects, leading to more work on high quality software
implementations in projects like OpenSSL, but I am not convinced that
that would be a good thing for the world anyway.  There are downsides
to a proliferation of too many crypto primitives.  This is the basic
argument described in more detail at [1].)

[...]
> On Tue, 26 Sep 2017, Jason Cooper wrote:

>> For my use cases, as a user of git, I have a plan to maintain provable
>> integrity of existing objects stored in git under sha1 while migrating
>> away from sha1.  The same plan works for migrating away from SHA2 or
>> SHA3 when the time comes.
>
> Please do not make the mistake of taking your use case to be a template
> for everybody's use case.

That said, I'm curious at what plan you are alluding to.  Is it
something that could benefit others on the list?

Thanks,
Jonathan

[1] https://www.imperialviolet.org/2017/05/31/skipsha3.html

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH v4] technical doc: add a design doc for hash function transition
  2017-03-09 19:14     ` Shawn Pearce
  2017-03-09 20:24       ` Jonathan Nieder
@ 2017-09-28  4:43       ` Jonathan Nieder
  2017-09-29  6:06         ` Junio C Hamano
                           ` (4 more replies)
  1 sibling, 5 replies; 113+ messages in thread
From: Jonathan Nieder @ 2017-09-28  4:43 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Linus Torvalds, Git Mailing List, Stefan Beller, bmwill,
	Jonathan Tan, Jeff King, David Lang, brian m. carlson,
	Masaya Suzuki, demerphq, The Keccak Team, Johannes Schindelin

This document describes what a transition to a new hash function for
Git would look like.  Add it to Documentation/technical/ as the plan
of record so that future changes can be recorded as patches.

Also-by: Brandon Williams <bmwill@google.com>
Also-by: Jonathan Tan <jonathantanmy@google.com>
Also-by: Stefan Beller <sbeller@google.com>
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
---
On Thu, Mar 09, 2017 at 11:14 AM, Shawn Pearce wrote:
> On Mon, Mar 6, 2017 at 4:17 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:

>> Thanks for the kind words on what had quite a few flaws still.  Here's
>> a new draft.  I think the next version will be a patch against
>> Documentation/technical/.
>
> FWIW, I like this approach.

Okay, here goes.

Instead of sharding the loose object translation tables by first byte,
we went for a single table.  It simplifies the design and we need to
keep the number of loose objects under control anyway.

We also included a description of the transition plan and tried to
include a summary of what has been agreed upon so far about the choice
of hash function.

Thanks to Junio for reviving the discussion and in particular to Dscho
for pushing this forward and making the missing pieces clearer.

Thoughts of all kinds welcome, as always.

 Documentation/Makefile                             |   1 +
 .../technical/hash-function-transition.txt         | 797 +++++++++++++++++++++
 2 files changed, 798 insertions(+)
 create mode 100644 Documentation/technical/hash-function-transition.txt

diff --git a/Documentation/Makefile b/Documentation/Makefile
index 2415e0d657..471bb29725 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -67,6 +67,7 @@ SP_ARTICLES += howto/maintain-git
 API_DOCS = $(patsubst %.txt,%,$(filter-out technical/api-index-skel.txt technical/api-index.txt, $(wildcard technical/api-*.txt)))
 SP_ARTICLES += $(API_DOCS)
 
+TECH_DOCS += technical/hash-function-transition
 TECH_DOCS += technical/http-protocol
 TECH_DOCS += technical/index-format
 TECH_DOCS += technical/pack-format
diff --git a/Documentation/technical/hash-function-transition.txt b/Documentation/technical/hash-function-transition.txt
new file mode 100644
index 0000000000..417ba491d0
--- /dev/null
+++ b/Documentation/technical/hash-function-transition.txt
@@ -0,0 +1,797 @@
+Git hash function transition
+============================
+
+Objective
+---------
+Migrate Git from SHA-1 to a stronger hash function.
+
+Background
+----------
+At its core, the Git version control system is a content addressable
+filesystem. It uses the SHA-1 hash function to name content. For
+example, files, directories, and revisions are referred to by hash
+values unlike in other traditional version control systems where files
+or versions are referred to via sequential numbers. The use of a hash
+function to address its content delivers a few advantages:
+
+* Integrity checking is easy. Bit flips, for example, are easily
+  detected, as the hash of corrupted content does not match its name.
+* Lookup of objects is fast.
+
+Using a cryptographically secure hash function brings additional
+advantages:
+
+* Object names can be signed and third parties can trust the hash to
+  address the signed object and all objects it references.
+* Communication using Git protocol and out of band communication
+  methods have a short reliable string that can be used to reliably
+  address stored content.
+
+Over time some flaws in SHA-1 have been discovered by security
+researchers. https://shattered.io demonstrated a practical SHA-1 hash
+collision. As a result, SHA-1 cannot be considered cryptographically
+secure any more. This impacts the communication of hash values because
+we cannot trust that a given hash value represents the known good
+version of content that the speaker intended.
+
+SHA-1 still possesses the other properties such as fast object lookup
+and safe error checking, but other hash functions are equally suitable
+that are believed to be cryptographically secure.
+
+Goals
+-----
+Where NewHash is a strong 256-bit hash function to replace SHA-1 (see
+"Selection of a New Hash", below):
+
+1. The transition to NewHash can be done one local repository at a time.
+   a. Requiring no action by any other party.
+   b. A NewHash repository can communicate with SHA-1 Git servers
+      (push/fetch).
+   c. Users can use SHA-1 and NewHash identifiers for objects
+      interchangeably (see "Object names on the command line", below).
+   d. New signed objects make use of a stronger hash function than
+      SHA-1 for their security guarantees.
+2. Allow a complete transition away from SHA-1.
+   a. Local metadata for SHA-1 compatibility can be removed from a
+      repository if compatibility with SHA-1 is no longer needed.
+3. Maintainability throughout the process.
+   a. The object format is kept simple and consistent.
+   b. Creation of a generalized repository conversion tool.
+
+Non-Goals
+---------
+1. Add NewHash support to Git protocol. This is valuable and the
+   logical next step but it is out of scope for this initial design.
+2. Transparently improving the security of existing SHA-1 signed
+   objects.
+3. Intermixing objects using multiple hash functions in a single
+   repository.
+4. Taking the opportunity to fix other bugs in Git's formats and
+   protocols.
+5. Shallow clones and fetches into a NewHash repository. (This will
+   change when we add NewHash support to Git protocol.)
+6. Skip fetching some submodules of a project into a NewHash
+   repository. (This also depends on NewHash support in Git
+   protocol.)
+
+Overview
+--------
+We introduce a new repository format extension. Repositories with this
+extension enabled use NewHash instead of SHA-1 to name their objects.
+This affects both object names and object content --- both the names
+of objects and all references to other objects within an object are
+switched to the new hash function.
+
+NewHash repositories cannot be read by older versions of Git.
+
+Alongside the packfile, a NewHash repository stores a bidirectional
+mapping between NewHash and SHA-1 object names. The mapping is generated
+locally and can be verified using "git fsck". Object lookups use this
+mapping to allow naming objects using either their SHA-1 and NewHash names
+interchangeably.
+
+"git cat-file" and "git hash-object" gain options to display an object
+in its sha1 form and write an object given its sha1 form. This
+requires all objects referenced by that object to be present in the
+object database so that they can be named using the appropriate name
+(using the bidirectional hash mapping).
+
+Fetches from a SHA-1 based server convert the fetched objects into
+NewHash form and record the mapping in the bidirectional mapping table
+(see below for details). Pushes to a SHA-1 based server convert the
+objects being pushed into sha1 form so the server does not have to be
+aware of the hash function the client is using.
+
+Detailed Design
+---------------
+Repository format extension
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+A NewHash repository uses repository format version `1` (see
+Documentation/technical/repository-version.txt) with extensions
+`objectFormat` and `compatObjectFormat`:
+
+	[core]
+		repositoryFormatVersion = 1
+	[extensions]
+		objectFormat = newhash
+		compatObjectFormat = sha1
+
+Specifying a repository format extension ensures that versions of Git
+not aware of NewHash do not try to operate on these repositories,
+instead producing an error message:
+
+	$ git status
+	fatal: unknown repository extensions found:
+		objectformat
+		compatobjectformat
+
+See the "Transition plan" section below for more details on these
+repository extensions.
+
+Object names
+~~~~~~~~~~~~
+Objects can be named by their 40 hexadecimal digit sha1-name or 64
+hexadecimal digit newhash-name, plus names derived from those (see
+gitrevisions(7)).
+
+The sha1-name of an object is the SHA-1 of the concatenation of its
+type, length, a nul byte, and the object's sha1-content. This is the
+traditional <sha1> used in Git to name objects.
+
+The newhash-name of an object is the NewHash of the concatenation of its
+type, length, a nul byte, and the object's newhash-content.
+
+Object format
+~~~~~~~~~~~~~
+The content as a byte sequence of a tag, commit, or tree object named
+by sha1 and newhash differ because an object named by newhash-name refers to
+other objects by their newhash-names and an object named by sha1-name
+refers to other objects by their sha1-names.
+
+The newhash-content of an object is the same as its sha1-content, except
+that objects referenced by the object are named using their newhash-names
+instead of sha1-names. Because a blob object does not refer to any
+other object, its sha1-content and newhash-content are the same.
+
+The format allows round-trip conversion between newhash-content and
+sha1-content.
+
+Object storage
+~~~~~~~~~~~~~~
+Loose objects use zlib compression and packed objects use the packed
+format described in Documentation/technical/pack-format.txt, just like
+today. The content that is compressed and stored uses newhash-content
+instead of sha1-content.
+
+Pack index
+~~~~~~~~~~
+Pack index (.idx) files use a new v3 format that supports multiple
+hash functions. They have the following format (all integers are in
+network byte order):
+
+- A header appears at the beginning and consists of the following:
+  - The 4-byte pack index signature: '\377t0c'
+  - 4-byte version number: 3
+  - 4-byte length of the header section, including the signature and
+    version number
+  - 4-byte number of objects contained in the pack
+  - 4-byte number of object formats in this pack index: 2
+  - For each object format:
+    - 4-byte format identifier (e.g., 'sha1' for SHA-1)
+    - 4-byte length in bytes of shortened object names. This is the
+      shortest possible length needed to make names in the shortened
+      object name table unambiguous.
+    - 4-byte integer, recording where tables relating to this format
+      are stored in this index file, as an offset from the beginning.
+  - 4-byte offset to the trailer from the beginning of this file.
+  - Zero or more additional key/value pairs (4-byte key, 4-byte
+    value). Only one key is supported: 'PSRC'. See the "Loose objects
+    and unreachable objects" section for supported values and how this
+    is used.  All other keys are reserved. Readers must ignore
+    unrecognized keys.
+- Zero or more NUL bytes. This can optionally be used to improve the
+  alignment of the full object name table below.
+- Tables for the first object format:
+  - A sorted table of shortened object names.  These are prefixes of
+    the names of all objects in this pack file, packed together
+    without offset values to reduce the cache footprint of the binary
+    search for a specific object name.
+
+  - A table of full object names in pack order. This allows resolving
+    a reference to "the nth object in the pack file" (from a
+    reachability bitmap or from the next table of another object
+    format) to its object name.
+
+  - A table of 4-byte values mapping object name order to pack order.
+    For an object in the table of sorted shortened object names, the
+    value at the corresponding index in this table is the index in the
+    previous table for that same object.
+
+    This can be used to look up the object in reachability bitmaps or
+    to look up its name in another object format.
+
+  - A table of 4-byte CRC32 values of the packed object data, in the
+    order that the objects appear in the pack file. This is to allow
+    compressed data to be copied directly from pack to pack during
+    repacking without undetected data corruption.
+
+  - A table of 4-byte offset values. For an object in the table of
+    sorted shortened object names, the value at the corresponding
+    index in this table indicates where that object can be found in
+    the pack file. These are usually 31-bit pack file offsets, but
+    large offsets are encoded as an index into the next table with the
+    most significant bit set.
+
+  - A table of 8-byte offset entries (empty for pack files less than
+    2 GiB). Pack files are organized with heavily used objects toward
+    the front, so most object references should not need to refer to
+    this table.
+- Zero or more NUL bytes.
+- Tables for the second object format, with the same layout as above,
+  up to and not including the table of CRC32 values.
+- Zero or more NUL bytes.
+- The trailer consists of the following:
+  - A copy of the 20-byte NewHash checksum at the end of the
+    corresponding packfile.
+
+  - 20-byte NewHash checksum of all of the above.
+
+Loose object index
+~~~~~~~~~~~~~~~~~~
+A new file $GIT_OBJECT_DIR/loose-object-idx contains information about
+all loose objects. Its format is
+
+  # loose-object-idx
+  (newhash-name SP sha1-name LF)*
+
+where the object names are in hexadecimal format. The file is not
+sorted.
+
+The loose object index is protected against concurrent writes by a
+lock file $GIT_OBJECT_DIR/loose-object-idx.lock. To add a new loose
+object:
+
+1. Write the loose object to a temporary file, like today.
+2. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the lock.
+3. Rename the loose object into place.
+4. Open loose-object-idx with O_APPEND and write the new object
+5. Unlink loose-object-idx.lock to release the lock.
+
+To remove entries (e.g. in "git pack-refs" or "git-prune"):
+
+1. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the
+   lock.
+2. Write the new content to loose-object-idx.lock.
+3. Unlink any loose objects being removed.
+4. Rename to replace loose-object-idx, releasing the lock.
+
+Translation table
+~~~~~~~~~~~~~~~~~
+The index files support a bidirectional mapping between sha1-names
+and newhash-names. The lookup proceeds similarly to ordinary object
+lookups. For example, to convert a sha1-name to a newhash-name:
+
+ 1. Look for the object in idx files. If a match is present in the
+    idx's sorted list of truncated sha1-names, then:
+    a. Read the corresponding entry in the sha1-name order to pack
+       name order mapping.
+    b. Read the corresponding entry in the full sha1-name table to
+       verify we found the right object. If it is, then
+    c. Read the corresponding entry in the full newhash-name table.
+       That is the object's newhash-name.
+ 2. Check for a loose object. Read lines from loose-object-idx until
+    we find a match.
+
+Step (1) takes the same amount of time as an ordinary object lookup:
+O(number of packs * log(objects per pack)). Step (2) takes O(number of
+loose objects) time. To maintain good performance it will be necessary
+to keep the number of loose objects low. See the "Loose objects and
+unreachable objects" section below for more details.
+
+Since all operations that make new objects (e.g., "git commit") add
+the new objects to the corresponding index, this mapping is possible
+for all objects in the object store.
+
+Reading an object's sha1-content
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The sha1-content of an object can be read by converting all newhash-names
+its newhash-content references to sha1-names using the translation table.
+
+Fetch
+~~~~~
+Fetching from a SHA-1 based server requires translating between SHA-1
+and NewHash based representations on the fly.
+
+SHA-1s named in the ref advertisement that are present on the client
+can be translated to NewHash and looked up as local objects using the
+translation table.
+
+Negotiation proceeds as today. Any "have"s generated locally are
+converted to SHA-1 before being sent to the server, and SHA-1s
+mentioned by the server are converted to NewHash when looking them up
+locally.
+
+After negotiation, the server sends a packfile containing the
+requested objects. We convert the packfile to NewHash format using
+the following steps:
+
+1. index-pack: inflate each object in the packfile and compute its
+   SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against
+   objects the client has locally. These objects can be looked up
+   using the translation table and their sha1-content read as
+   described above to resolve the deltas.
+2. topological sort: starting at the "want"s from the negotiation
+   phase, walk through objects in the pack and emit a list of them,
+   excluding blobs, in reverse topologically sorted order, with each
+   object coming later in the list than all objects it references.
+   (This list only contains objects reachable from the "wants". If the
+   pack from the server contained additional extraneous objects, then
+   they will be discarded.)
+3. convert to newhash: open a new (newhash) packfile. Read the topologically
+   sorted list just generated. For each object, inflate its
+   sha1-content, convert to newhash-content, and write it to the newhash
+   pack. Record the new sha1<->newhash mapping entry for use in the idx.
+4. sort: reorder entries in the new pack to match the order of objects
+   in the pack the server generated and include blobs. Write a newhash idx
+   file
+5. clean up: remove the SHA-1 based pack file, index, and
+   topologically sorted list obtained from the server in steps 1
+   and 2.
+
+Step 3 requires every object referenced by the new object to be in the
+translation table. This is why the topological sort step is necessary.
+
+As an optimization, step 1 could write a file describing what non-blob
+objects each object it has inflated from the packfile references. This
+makes the topological sort in step 2 possible without inflating the
+objects in the packfile for a second time. The objects need to be
+inflated again in step 3, for a total of two inflations.
+
+Step 4 is probably necessary for good read-time performance. "git
+pack-objects" on the server optimizes the pack file for good data
+locality (see Documentation/technical/pack-heuristics.txt).
+
+Details of this process are likely to change. It will take some
+experimenting to get this to perform well.
+
+Push
+~~~~
+Push is simpler than fetch because the objects referenced by the
+pushed objects are already in the translation table. The sha1-content
+of each object being pushed can be read as described in the "Reading
+an object's sha1-content" section to generate the pack written by git
+send-pack.
+
+Signed Commits
+~~~~~~~~~~~~~~
+We add a new field "gpgsig-newhash" to the commit object format to allow
+signing commits without relying on SHA-1. It is similar to the
+existing "gpgsig" field. Its signed payload is the newhash-content of the
+commit object with any "gpgsig" and "gpgsig-newhash" fields removed.
+
+This means commits can be signed
+1. using SHA-1 only, as in existing signed commit objects
+2. using both SHA-1 and NewHash, by using both gpgsig-newhash and gpgsig
+   fields.
+3. using only NewHash, by only using the gpgsig-newhash field.
+
+Old versions of "git verify-commit" can verify the gpgsig signature in
+cases (1) and (2) without modifications and view case (3) as an
+ordinary unsigned commit.
+
+Signed Tags
+~~~~~~~~~~~
+We add a new field "gpgsig-newhash" to the tag object format to allow
+signing tags without relying on SHA-1. Its signed payload is the
+newhash-content of the tag with its gpgsig-newhash field and "-----BEGIN PGP
+SIGNATURE-----" delimited in-body signature removed.
+
+This means tags can be signed
+1. using SHA-1 only, as in existing signed tag objects
+2. using both SHA-1 and NewHash, by using gpgsig-newhash and an in-body
+   signature.
+3. using only NewHash, by only using the gpgsig-newhash field.
+
+Mergetag embedding
+~~~~~~~~~~~~~~~~~~
+The mergetag field in the sha1-content of a commit contains the
+sha1-content of a tag that was merged by that commit.
+
+The mergetag field in the newhash-content of the same commit contains the
+newhash-content of the same tag.
+
+Submodules
+~~~~~~~~~~
+To convert recorded submodule pointers, you need to have the converted
+submodule repository in place. The translation table of the submodule
+can be used to look up the new hash.
+
+Loose objects and unreachable objects
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Fast lookups in the loose-object-idx require that the number of loose
+objects not grow too high.
+
+"git gc --auto" currently waits for there to be 6700 loose objects
+present before consolidating them into a packfile. We will need to
+measure to find a more appropriate threshold for it to use.
+
+"git gc --auto" currently waits for there to be 50 packs present
+before combining packfiles. Packing loose objects more aggressively
+may cause the number of pack files to grow too quickly. This can be
+mitigated by using a strategy similar to Martin Fick's exponential
+rolling garbage collection script:
+https://gerrit-review.googlesource.com/c/gerrit/+/35215
+
+"git gc" currently expels any unreachable objects it encounters in
+pack files to loose objects in an attempt to prevent a race when
+pruning them (in case another process is simultaneously writing a new
+object that refers to the about-to-be-deleted object). This leads to
+an explosion in the number of loose objects present and disk space
+usage due to the objects in delta form being replaced with independent
+loose objects.  Worse, the race is still present for loose objects.
+
+Instead, "git gc" will need to move unreachable objects to a new
+packfile marked as UNREACHABLE_GARBAGE (using the PSRC field; see
+below). To avoid the race when writing new objects referring to an
+about-to-be-deleted object, code paths that write new objects will
+need to copy any objects from UNREACHABLE_GARBAGE packs that they
+refer to to new, non-UNREACHABLE_GARBAGE packs (or loose objects).
+UNREACHABLE_GARBAGE are then safe to delete if their creation time (as
+indicated by the file's mtime) is long enough ago.
+
+To avoid a proliferation of UNREACHABLE_GARBAGE packs, they can be
+combined under certain circumstances. If "gc.garbageTtl" is set to
+greater than one day, then packs created within a single calendar day,
+UTC, can be coalesced together. The resulting packfile would have an
+mtime before midnight on that day, so this makes the effective maximum
+ttl the garbageTtl + 1 day. If "gc.garbageTtl" is less than one day,
+then we divide the calendar day into intervals one-third of that ttl
+in duration. Packs created within the same interval can be coalesced
+together. The resulting packfile would have an mtime before the end of
+the interval, so this makes the effective maximum ttl equal to the
+garbageTtl * 4/3.
+
+This rule comes from Thirumala Reddy Mutchukota's JGit change
+https://git.eclipse.org/r/90465.
+
+The UNREACHABLE_GARBAGE setting goes in the PSRC field of the pack
+index. More generally, that field indicates where a pack came from:
+
+ - 1 (PACK_SOURCE_RECEIVE) for a pack received over the network
+ - 2 (PACK_SOURCE_AUTO) for a pack created by a lightweight
+   "gc --auto" operation
+ - 3 (PACK_SOURCE_GC) for a pack created by a full gc
+ - 4 (PACK_SOURCE_UNREACHABLE_GARBAGE) for potential garbage
+   discovered by gc
+ - 5 (PACK_SOURCE_INSERT) for locally created objects that were
+   written directly to a pack file, e.g. from "git add ."
+
+This information can be useful for debugging and for "gc --auto" to
+make appropriate choices about which packs to coalesce.
+
+Caveats
+-------
+Invalid objects
+~~~~~~~~~~~~~~~
+The conversion from sha1-content to newhash-content retains any
+brokenness in the original object (e.g., tree entry modes encoded with
+leading 0, tree objects whose paths are not sorted correctly, and
+commit objects without an author or committer). This is a deliberate
+feature of the design to allow the conversion to round-trip.
+
+More profoundly broken objects (e.g., a commit with a truncated "tree"
+header line) cannot be converted but were not usable by current Git
+anyway.
+
+Shallow clone and submodules
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Because it requires all referenced objects to be available in the
+locally generated translation table, this design does not support
+shallow clone or unfetched submodules. Protocol improvements might
+allow lifting this restriction.
+
+Alternates
+~~~~~~~~~~
+For the same reason, a newhash repository cannot borrow objects from a
+sha1 repository using objects/info/alternates or
+$GIT_ALTERNATE_OBJECT_REPOSITORIES.
+
+git notes
+~~~~~~~~~
+The "git notes" tool annotates objects using their sha1-name as key.
+This design does not describe a way to migrate notes trees to use
+newhash-names. That migration is expected to happen separately (for
+example using a file at the root of the notes tree to describe which
+hash it uses).
+
+Server-side cost
+~~~~~~~~~~~~~~~~
+Until Git protocol gains NewHash support, using NewHash based storage
+on public-facing Git servers is strongly discouraged. Once Git
+protocol gains NewHash support, NewHash based servers are likely not
+to support SHA-1 compatibility, to avoid what may be a very expensive
+hash reencode during clone and to encourage peers to modernize.
+
+The design described here allows fetches by SHA-1 clients of a
+personal NewHash repository because it's not much more difficult than
+allowing pushes from that repository. This support needs to be guarded
+by a configuration option --- servers like git.kernel.org that serve a
+large number of clients would not be expected to bear that cost.
+
+Meaning of signatures
+~~~~~~~~~~~~~~~~~~~~~
+The signed payload for signed commits and tags does not explicitly
+name the hash used to identify objects. If some day Git adopts a new
+hash function with the same length as the current SHA-1 (40
+hexadecimal digit) or NewHash (64 hexadecimal digit) objects then the
+intent behind the PGP signed payload in an object signature is
+unclear:
+
+	object e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7
+	type commit
+	tag v2.12.0
+	tagger Junio C Hamano <gitster@pobox.com> 1487962205 -0800
+
+	Git 2.12
+
+Does this mean Git v2.12.0 is the commit with sha1-name
+e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with
+new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7?
+
+Fortunately NewHash and SHA-1 have different lengths. If Git starts
+using another hash with the same length to name objects, then it will
+need to change the format of signed payloads using that hash to
+address this issue.
+
+Object names on the command line
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+To support the transition (see Transition plan below), this design
+supports four different modes of operation:
+
+ 1. ("dark launch") Treat object names input by the user as SHA-1 and
+    convert any object names written to output to SHA-1, but store
+    objects using NewHash.  This allows users to test the code with no
+    visible behavior change except for performance.  This allows
+    allows running even tests that assume the SHA-1 hash function, to
+    sanity-check the behavior of the new mode.
+
+ 2. ("early transition") Allow both SHA-1 and NewHash object names in
+    input. Any object names written to output use SHA-1. This allows
+    users to continue to make use of SHA-1 to communicate with peers
+    (e.g. by email) that have not migrated yet and prepares for mode 3.
+
+ 3. ("late transition") Allow both SHA-1 and NewHash object names in
+    input. Any object names written to output use NewHash. In this
+    mode, users are using a more secure object naming method by
+    default.  The disruption is minimal as long as most of their peers
+    are in mode 2 or mode 3.
+
+ 4. ("post-transition") Treat object names input by the user as
+    NewHash and write output using NewHash. This is safer than mode 3
+    because there is less risk that input is incorrectly interpreted
+    using the wrong hash function.
+
+The mode is specified in configuration.
+
+The user can also explicitly specify which format to use for a
+particular revision specifier and for output, overriding the mode. For
+example:
+
+git --output-format=sha1 log abac87a^{sha1}..f787cac^{newhash}
+
+Selection of a New Hash
+-----------------------
+In early 2005, around the time that Git was written,  Xiaoyun Wang,
+Yiqun Lisa Yin, and Hongbo Yu announced an attack finding SHA-1
+collisions in 2^69 operations. In August they published details.
+Luckily, no practical demonstrations of a collision in full SHA-1 were
+published until 10 years later, in 2017.
+
+The hash function NewHash to replace SHA-1 should be stronger than
+SHA-1 was: we would like it to be trustworthy and useful in practice
+for at least 10 years.
+
+Some other relevant properties:
+
+1. A 256-bit hash (long enough to match common security practice; not
+   excessively long to hurt performance and disk usage).
+
+2. High quality implementations should be widely available (e.g. in
+   OpenSSL).
+
+3. The hash function's properties should match Git's needs (e.g. Git
+   requires collision and 2nd preimage resistance and does not require
+   length extension resistance).
+
+4. As a tiebreaker, the hash should be fast to compute (fortunately
+   many contenders are faster than SHA-1).
+
+Some hashes under consideration are SHA-256, SHA-512/256, SHA-256x16,
+K12, and BLAKE2bp-256.
+
+Transition plan
+---------------
+Some initial steps can be implemented independently of one another:
+- adding a hash function API (vtable)
+- teaching fsck to tolerate the gpgsig-newhash field
+- excluding gpgsig-* from the fields copied by "git commit --amend"
+- annotating tests that depend on SHA-1 values with a SHA1 test
+  prerequisite
+- using "struct object_id", GIT_MAX_RAWSZ, and GIT_MAX_HEXSZ
+  consistently instead of "unsigned char *" and the hardcoded
+  constants 20 and 40.
+- introducing index v3
+- adding support for the PSRC field and safer object pruning
+
+
+The first user-visible change is the introduction of the objectFormat
+extension (without compatObjectFormat). This requires:
+- implementing the loose-object-idx
+- teaching fsck about this mode of operation
+- using the hash function API (vtable) when computing object names
+- signing objects and verifying signatures
+- rejecting attempts to fetch from or push to an incompatible
+  repository
+
+Next comes introduction of compatObjectFormat:
+- translating object names between object formats
+- translating object content between object formats
+- generating and verifying signatures in the compat format
+- adding appropriate index entries when adding a new object to the
+  object store
+- --output-format option
+- ^{sha1} and ^{newhash} revision notation
+- configuration to specify default input and output format (see
+  "Object names on the command line" above)
+
+The next step is supporting fetches and pushes to SHA-1 repositories:
+- allow pushes to a repository using the compat format
+- generate a topologically sorted list of the SHA-1 names of fetched
+  objects
+- convert the fetched packfile to newhash format and generate an idx
+  file
+- re-sort to match the order of objects in the fetched packfile
+
+The infrastructure supporting fetch also allows converting an existing
+repository. In converted repositories and new clones, end users can
+gain support for the new hash function without any visible change in
+behavior (see "dark launch" in the "Object names on the command line"
+section). In particular this allows users to verify NewHash signatures
+on objects in the repository, and it should ensure the transition code
+is stable in production in preparation for using it more widely.
+
+Over time projects would encourage their users to adopt the "early
+transition" and then "late transition" modes to take advantage of the
+new, more futureproof NewHash object names.
+
+When objectFormat and compatObjectFormat are both set, commands
+generating signatures would generate both SHA-1 and NewHash signatures
+by default to support both new and old users.
+
+In projects using NewHash heavily, users could be encouraged to adopt
+the "post-transition" mode to avoid accidentally making implicit use
+of SHA-1 object names.
+
+Once a critical mass of users have upgraded to a version of Git that
+can verify NewHash signatures and have converted their existing
+repositories to support verifying them, we can add support for a
+setting to generate only NewHash signatures. This is expected to be at
+least a year later.
+
+That is also a good moment to advertise the ability to convert
+repositories to use NewHash only, stripping out all SHA-1 related
+metadata. This improves performance by eliminating translation
+overhead and security by avoiding the possibility of accidentally
+relying on the safety of SHA-1.
+
+Updating Git's protocols to allow a server to specify which hash
+functions it supports is also an important part of this transition. It
+is not discussed in detail in this document but this transition plan
+assumes it happens. :)
+
+Alternatives considered
+-----------------------
+Upgrading everyone working on a particular project on a flag day
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Projects like the Linux kernel are large and complex enough that
+flipping the switch for all projects based on the repository at once
+is infeasible.
+
+Not only would all developers and server operators supporting
+developers have to switch on the same flag day, but supporting tooling
+(continuous integration, code review, bug trackers, etc) would have to
+be adapted as well. This also makes it difficult to get early feedback
+from some project participants testing before it is time for mass
+adoption.
+
+Using hash functions in parallel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+(e.g. https://public-inbox.org/git/22708.8913.864049.452252@chiark.greenend.org.uk/ )
+Objects newly created would be addressed by the new hash, but inside
+such an object (e.g. commit) it is still possible to address objects
+using the old hash function.
+* You cannot trust its history (needed for bisectability) in the
+  future without further work
+* Maintenance burden as the number of supported hash functions grows
+  (they will never go away, so they accumulate). In this proposal, by
+  comparison, converted objects lose all references to SHA-1.
+
+Signed objects with multiple hashes
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Instead of introducing the gpgsig-newhash field in commit and tag objects
+for newhash-content based signatures, an earlier version of this design
+added "hash newhash <newhash-name>" fields to strengthen the existing
+sha1-content based signatures.
+
+In other words, a single signature was used to attest to the object
+content using both hash functions. This had some advantages:
+* Using one signature instead of two speeds up the signing process.
+* Having one signed payload with both hashes allows the signer to
+  attest to the sha1-name and newhash-name referring to the same object.
+* All users consume the same signature. Broken signatures are likely
+  to be detected quickly using current versions of git.
+
+However, it also came with disadvantages:
+* Verifying a signed object requires access to the sha1-names of all
+  objects it references, even after the transition is complete and
+  translation table is no longer needed for anything else. To support
+  this, the design added fields such as "hash sha1 tree <sha1-name>"
+  and "hash sha1 parent <sha1-name>" to the newhash-content of a signed
+  commit, complicating the conversion process.
+* Allowing signed objects without a sha1 (for after the transition is
+  complete) complicated the design further, requiring a "nohash sha1"
+  field to suppress including "hash sha1" fields in the newhash-content
+  and signed payload.
+
+Lazily populated translation table
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Some of the work of building the translation table could be deferred to
+push time, but that would significantly complicate and slow down pushes.
+Calculating the sha1-name at object creation time at the same time it is
+being streamed to disk and having its newhash-name calculated should be
+an acceptable cost.
+
+Document History
+----------------
+
+2017-03-03
+bmwill@google.com, jonathantanmy@google.com, jrnieder@gmail.com,
+sbeller@google.com
+
+Initial version sent to
+http://public-inbox.org/git/20170304011251.GA26789@aiede.mtv.corp.google.com
+
+2017-03-03 jrnieder@gmail.com
+Incorporated suggestions from jonathantanmy and sbeller:
+* describe purpose of signed objects with each hash type
+* redefine signed object verification using object content under the
+  first hash function
+
+2017-03-06 jrnieder@gmail.com
+* Use SHA3-256 instead of SHA2 (thanks, Linus and brian m. carlson).[1][2]
+* Make sha3-based signatures a separate field, avoiding the need for
+  "hash" and "nohash" fields (thanks to peff[3]).
+* Add a sorting phase to fetch (thanks to Junio for noticing the need
+  for this).
+* Omit blobs from the topological sort during fetch (thanks to peff).
+* Discuss alternates, git notes, and git servers in the caveats
+  section (thanks to Junio Hamano, brian m. carlson[4], and Shawn
+  Pearce).
+* Clarify language throughout (thanks to various commenters,
+  especially Junio).
+
+2017-09-27 jrnieder@gmail.com, sbeller@google.com
+* use placeholder NewHash instead of SHA3-256
+* describe criteria for picking a hash function.
+* include a transition plan (thanks especially to Brandon Williams
+  for fleshing these ideas out)
+* define the translation table (thanks, Shawn Pearce[5], Jonathan
+  Tan, and Masaya Suzuki)
+* avoid loose object overhead by packing more aggressively in
+  "git gc --auto"
+
+[1] http://public-inbox.org/git/CA+55aFzJtejiCjV0e43+9oR3QuJK2PiFiLQemytoLpyJWe6P9w@mail.gmail.com/
+[2] http://public-inbox.org/git/CA+55aFz+gkAsDZ24zmePQuEs1XPS9BP_s8O7Q4wQ7LV7X5-oDA@mail.gmail.com/
+[3] http://public-inbox.org/git/20170306084353.nrns455dvkdsfgo5@sigill.intra.peff.net/
+[4] http://public-inbox.org/git/20170304224936.rqqtkdvfjgyezsht@genre.crustytoothpaste.net
+[5] https://public-inbox.org/git/CAJo=hJtoX9=AyLHHpUJS7fueV9ciZ_MNpnEPHUz8Whui6g9F0A@mail.gmail.com/
-- 
2.14.2.822.g60be5d43e6-goog


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH v4] technical doc: add a design doc for hash function transition
  2017-09-28  4:43       ` [PATCH v4] technical doc: add a design doc for hash function transition Jonathan Nieder
@ 2017-09-29  6:06         ` Junio C Hamano
  2017-09-29  8:09           ` Junio C Hamano
  2017-09-29 17:34           ` Jonathan Nieder
  2017-10-02  9:02         ` Junio C Hamano
                           ` (3 subsequent siblings)
  4 siblings, 2 replies; 113+ messages in thread
From: Junio C Hamano @ 2017-09-29  6:06 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Shawn Pearce, Linus Torvalds, Git Mailing List, Stefan Beller,
	bmwill, Jonathan Tan, Jeff King, David Lang, brian m. carlson,
	Masaya Suzuki, demerphq, The Keccak Team, Johannes Schindelin

Jonathan Nieder <jrnieder@gmail.com> writes:

> This document describes what a transition to a new hash function for
> Git would look like.  Add it to Documentation/technical/ as the plan
> of record so that future changes can be recorded as patches.
>
> Also-by: Brandon Williams <bmwill@google.com>
> Also-by: Jonathan Tan <jonathantanmy@google.com>
> Also-by: Stefan Beller <sbeller@google.com>
> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
> ---

Shoudln't these all be s-o-b: (with a note immediately before that
to say all four contributed equally or something)?

> +Background
> +----------
> +At its core, the Git version control system is a content addressable
> +filesystem. It uses the SHA-1 hash function to name content. For
> +example, files, directories, and revisions are referred to by hash
> +values unlike in other traditional version control systems where files
> +or versions are referred to via sequential numbers. The use of a hash

Traditional systems refer to files via numbers???  Perhaps "where
versions of files are referred to via sequential numbers" or
something?

> +function to address its content delivers a few advantages:
> +
> +* Integrity checking is easy. Bit flips, for example, are easily
> +  detected, as the hash of corrupted content does not match its name.
> +* Lookup of objects is fast.

* There is no ambiguity what the object's name should be, given its
  content.

* Deduping the same content copied across versions and paths is
  automatic.

> +SHA-1 still possesses the other properties such as fast object lookup
> +and safe error checking, but other hash functions are equally suitable
> +that are believed to be cryptographically secure.

s/secure/more &/, perhaps?

> +Goals
> +-----
> +...
> +   c. Users can use SHA-1 and NewHash identifiers for objects
> +      interchangeably (see "Object names on the command line", below).

Mental note.  This needs to extend to the "index X..Y" lines in the
patch output, which is used by "apply -3" and "am -3".

> +2. Allow a complete transition away from SHA-1.
> +   a. Local metadata for SHA-1 compatibility can be removed from a
> +      repository if compatibility with SHA-1 is no longer needed.

I like the emphasis on "Local" here.  Metadata for compatiblity that
is embedded in the objects obviously cannot be removed.

From that point of view, one of the goals ought to be "make sure
that as much SHA-1 compatibility metadata as possible is local and
outside the object".  This goal may not be able to say more than "as
much as possible", as signed objects that came from SHA-1 world
needs to carry the compatibility metadata somewhere somehow.  

Or perhaps we could.  There is nothing that says a signed tag
created in the SHA-1 world must have the PGP/SHA-1 signature in the
NewHash payload---it could be split off of the object data and
stored in a local metadata cache, to be used only when we need to
convert it back to the SHA-1 world.

But I am getting ahead of myself before reading the proposal
through.

> +Non-Goals
> +---------
> ...
> +6. Skip fetching some submodules of a project into a NewHash
> +   repository. (This also depends on NewHash support in Git
> +   protocol.)

It is unclear what this means.  Around submodule support, one thing
I can think of is that a NewHash tree in a superproject would record
a gitlink that is a NewHash commit object name in it, therefore it
cannot refer to an unconverted SHA-1 submodule repository.  But it
is unclear if the above description refers to the same issue, or
something else.

> +Overview
> +--------
> +We introduce a new repository format extension. Repositories with this
> +extension enabled use NewHash instead of SHA-1 to name their objects.
> +This affects both object names and object content --- both the names
> +of objects and all references to other objects within an object are
> +switched to the new hash function.
> +
> +NewHash repositories cannot be read by older versions of Git.
> +
> +Alongside the packfile, a NewHash repository stores a bidirectional
> +mapping between NewHash and SHA-1 object names. The mapping is generated
> +locally and can be verified using "git fsck". Object lookups use this
> +mapping to allow naming objects using either their SHA-1 and NewHash names
> +interchangeably.
> +
> +"git cat-file" and "git hash-object" gain options to display an object
> +in its sha1 form and write an object given its sha1 form.

Both of these are somewhat unclear.  I am guessing that "git
cat-file --convert-to=sha1 <type> <NewHashName>" would emit the
object contents converted from their NewHash payload to SHA-1
payload (blobs are unchanged, trees, commits and tags get their
outgoing references converted from NewHash to their SHA-1
counterparts), and that is what you mean by "options to display an
object in its sha1 form".  

I am not sure how "git hash-object" with the option would work,
though.  Do you give an option "--hash=sha1 --stdout --stdin -t
<type>" to feed a NewHash contents (file, tree, commit or tag) to
the command, convert it to the SHA-1 content (hmm, how's that
different from the cat-file's new option???) and then write out its
loose object representation suitable to be used in the SHA-1 workd?
Where do you write it to?  It won't be in the repository, as we
rejected mixed repository in our Non-Goals section.

> +Object names
> +~~~~~~~~~~~~
> +Objects can be named by their 40 hexadecimal digit sha1-name or 64
> +hexadecimal digit newhash-name, plus names derived from those (see
> +gitrevisions(7)).
> +
> +The sha1-name of an object is the SHA-1 of the concatenation of its
> +type, length, a nul byte, and the object's sha1-content. This is the
> +traditional <sha1> used in Git to name objects.
> +
> +The newhash-name of an object is the NewHash of the concatenation of its
> +type, length, a nul byte, and the object's newhash-content.

It makes me wonder if we want to add the hashname in this object
header.  "length" would be different for non-blob objects anyway,
and it is not "compat metadata" we want to avoid baked in, yet it
would help diagnose a mistake of attempting to use a "mixed" objects
in a single repository.  Not a big issue, though.

> +The format allows round-trip conversion between newhash-content and
> +sha1-content.

If it is a goal to eventually be able to lose SHA-1 compatibility
metadata from the objects, then we might want to remove SHA-1 based
signature bits (e.g. PGP trailer in signed tag, gpgsig header in the
commit object) from NewHash contents, and instead have them stored
in a side "metadata" table, only to be used while converting back.
I dunno if that is desirable.

> +Pack index
> +~~~~~~~~~~
> +Pack index (.idx) files use a new v3 format that supports multiple
> +hash functions. They have the following format (all integers are in
> +network byte order):
> +
> +- A header appears at the beginning and consists of the following:
> +  - The 4-byte pack index signature: '\377t0c'
> +  - 4-byte version number: 3
> +  - 4-byte length of the header section, including the signature and
> +    version number
> +  - 4-byte number of objects contained in the pack
> +  - 4-byte number of object formats in this pack index: 2
> +  - For each object format:
> +    - 4-byte format identifier (e.g., 'sha1' for SHA-1)
> +    - 4-byte length in bytes of shortened object names. This is the
> +      shortest possible length needed to make names in the shortened
> +      object name table unambiguous.
> +    - 4-byte integer, recording where tables relating to this format
> +      are stored in this index file, as an offset from the beginning.
> +  - 4-byte offset to the trailer from the beginning of this file.
> +  - Zero or more additional key/value pairs (4-byte key, 4-byte
> +    value). Only one key is supported: 'PSRC'. See the "Loose objects
> +    and unreachable objects" section for supported values and how this
> +    is used.  All other keys are reserved. Readers must ignore
> +    unrecognized keys.
> +- Zero or more NUL bytes. This can optionally be used to improve the
> +  alignment of the full object name table below.
> +- Tables for the first object format:
> +  - A sorted table of shortened object names.  These are prefixes of
> +    the names of all objects in this pack file, packed together
> +    without offset values to reduce the cache footprint of the binary
> +    search for a specific object name.

I take it to mean that the stride is defined in the "length in bytes
of shortened object names" in the file header.  If so, I can see how
this would work.  This "sorted table", unlike the next one, does not
say how it is sorted, but I assume this is just the object name
order (as opposed to the pack location order the next table uses)?

> +  - A table of full object names in pack order. This allows resolving
> +    a reference to "the nth object in the pack file" (from a
> +    reachability bitmap or from the next table of another object
> +    format) to its object name.
> +
> +  - A table of 4-byte values mapping object name order to pack order.
> +    For an object in the table of sorted shortened object names, the
> +    value at the corresponding index in this table is the index in the
> +    previous table for that same object.
> +
> +    This can be used to look up the object in reachability bitmaps or
> +    to look up its name in another object format.

And this is a separate table because the short-name table wants to
be as compact as possible for binary search?  Otherwise an entry in
the short-name table could be <pack order number, n-bytes that is
short unique prefix>.

> +  - A table of 4-byte CRC32 values of the packed object data, in the
> +    order that the objects appear in the pack file. This is to allow
> +    compressed data to be copied directly from pack to pack during
> +    repacking without undetected data corruption.

An obvious alternative would be to have the CRC32 checksum near
(e.g. immediately before) the object data in the packfile (as
opposed to the .idx file like this document specifies).  I am not
sure what the pros and cons are between the two, though, and that is
why I mention the possiblity here.

Hmm, as the corresponding packfile stores object data only in
NewHash content format, it is somewhat curious that this table that
stores CRC32 of the data appears in the "Tables for each object
format" section, as they would be identical, no?  Unless I am
grossly misleading the spec, the checksum should either go outside
the "Tables for each object format" section but still in .idx, or
should be eliminated and become part of the packdata stream instead,
perhaps?

> +  - A table of 4-byte offset values. For an object in the table of
> +    sorted shortened object names, the value at the corresponding
> +    index in this table indicates where that object can be found in
> +    the pack file. These are usually 31-bit pack file offsets, but
> +    large offsets are encoded as an index into the next table with the
> +    most significant bit set.

Oy.  So we can go from a short prefix to the pack location by first
finding it via binsearch in the short-name table, realize that it is
nth object in the object name order, and consulting this table.
When we know the pack-order of an object, there is no direct way to
go to its location (short of reversing the name-order-to-pack-order
table)?

> +  - A table of 8-byte offset entries (empty for pack files less than
> +    2 GiB). Pack files are organized with heavily used objects toward
> +    the front, so most object references should not need to refer to
> +    this table.

> +- Zero or more NUL bytes.

... for padding/aligning.

> +- Tables for the second object format, with the same layout as above,
> +  up to and not including the table of CRC32 values.
> +- Zero or more NUL bytes.
> +- The trailer consists of the following:
> +  - A copy of the 20-byte NewHash checksum at the end of the
> +    corresponding packfile.
> +
> +  - 20-byte NewHash checksum of all of the above.

When did NewHash shrink to 20-byte suddenly?  I think the above two
are both "32-byte"?

> +Loose object index
> +~~~~~~~~~~~~~~~~~~
> +A new file $GIT_OBJECT_DIR/loose-object-idx contains information about
> +all loose objects. Its format is
> +
> +  # loose-object-idx
> +  (newhash-name SP sha1-name LF)*
> +
> +where the object names are in hexadecimal format. The file is not
> +sorted.

Shouldn't the file somehow say what hashes are involved to allow us
match it with extension.{objectFormat,compatObjectFormat}, perhaps
at the end of the "# loose-object-idx" line?

> +The loose object index is protected against concurrent writes by a
> +lock file $GIT_OBJECT_DIR/loose-object-idx.lock. To add a new loose
> +object:
> +
> +1. Write the loose object to a temporary file, like today.
> +2. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the lock.
> +3. Rename the loose object into place.
> +4. Open loose-object-idx with O_APPEND and write the new object

"write the new entry, fsync and close"?

> +Translation table
> +~~~~~~~~~~~~~~~~~
> +The index files support a bidirectional mapping between sha1-names
> +and newhash-names. The lookup proceeds similarly to ordinary object
> +lookups. For example, to convert a sha1-name to a newhash-name:
> +
> + 1. Look for the object in idx files. If a match is present in the
> +    idx's sorted list of truncated sha1-names, then:
> +    a. Read the corresponding entry in the sha1-name order to pack
> +       name order mapping.
> +    b. Read the corresponding entry in the full sha1-name table to
> +       verify we found the right object. If it is, then
> +    c. Read the corresponding entry in the full newhash-name table.
> +       That is the object's newhash-name.

c. is possible because b. and c. are sorted the same way, i.e. the
index used to consult the full sha1-name table, which is the pack
order number, can be used to find its full newhash in the "full
newhash sorted by pack order" table?

> +Reading an object's sha1-content
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I'd stop here and continue in a separate message.  Thanks for a
detailed write-up.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v4] technical doc: add a design doc for hash function transition
  2017-09-29  6:06         ` Junio C Hamano
@ 2017-09-29  8:09           ` Junio C Hamano
  2017-09-29 17:34           ` Jonathan Nieder
  1 sibling, 0 replies; 113+ messages in thread
From: Junio C Hamano @ 2017-09-29  8:09 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Shawn Pearce, Linus Torvalds, Git Mailing List, Stefan Beller,
	bmwill, Jonathan Tan, Jeff King, David Lang, brian m. carlson,
	Masaya Suzuki, demerphq, The Keccak Team, Johannes Schindelin

Junio C Hamano <gitster@pobox.com> writes:

> Or perhaps we could.  There is nothing that says a signed tag
> created in the SHA-1 world must have the PGP/SHA-1 signature in the
> NewHash payload---it could be split off of the object data and
> stored in a local metadata cache, to be used only when we need to
> convert it back to the SHA-1 world.
> ...
>> +The format allows round-trip conversion between newhash-content and
>> +sha1-content.
>
> If it is a goal to eventually be able to lose SHA-1 compatibility
> metadata from the objects, then we might want to remove SHA-1 based
> signature bits (e.g. PGP trailer in signed tag, gpgsig header in the
> commit object) from NewHash contents, and instead have them stored
> in a side "metadata" table, only to be used while converting back.
> I dunno if that is desirable.

Let's keep it simple by ignoring all of the above.  Even though
leaving the sha1-gpgsig and other crufts would etch these
compatibility metadata in objects forever, these remain only in
objects that originate from SHA-1 world, or in objects created in
the NewHash world only while the project participants still care
about SHA-1 compatibility.  Strictly speaking, it would be super
nice if we can do without contaminating these newly created objects
with SHA-1 compatibility headers, just like we wish to be able to
drop the SHA-1 vs NewHash mapping table after projects participants
stop careing about SHA-1 compatiblity, it may not be worth it.  Of
course, if we decide to spend a bit more brain cycle to design how
we push these out of the object proper, the same solution would
automatically allow us to omit SHA-1 compatibility headers from the
objects that were converted from SHA-1 world.
>
>> +  - A table of 4-byte CRC32 values of the packed object data, in the
>> +    order that the objects appear in the pack file. This is to allow
>> +    compressed data to be copied directly from pack to pack during
>> +    repacking without undetected data corruption.
>
> An obvious alternative would be to have the CRC32 checksum near
> (e.g. immediately before) the object data in the packfile (as
> opposed to the .idx file like this document specifies).  I am not
> sure what the pros and cons are between the two, though, and that is
> why I mention the possiblity here.
>
> Hmm, as the corresponding packfile stores object data only in
> NewHash content format, it is somewhat curious that this table that
> stores CRC32 of the data appears in the "Tables for each object
> format" section, as they would be identical, no?  Unless I am
> grossly misleading the spec, the checksum should either go outside
> the "Tables for each object format" section but still in .idx, or
> should be eliminated and become part of the packdata stream instead,
> perhaps?

Thinking about this a bit more, I think a single table per .idx file
would be the right way to go, not a checksum immediately after or
before the object data that is embedded in the pack stream.  In the
NewHash world (after this initial migration), we would want to be
able to stream NewHash packstream that comes from the network
straight to disk, which would mean these in-line CRC32 data would
need to be sent over the wire (i.e. 4-byte per object sent); that is
an unneeded overhead, as the packstream has its trailing checksum to
protect the whole thing anyway.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-19 16:45                       ` Gilles Van Assche
@ 2017-09-29 13:17                         ` Johannes Schindelin
  2017-09-29 14:54                           ` Joan Daemen
  0 siblings, 1 reply; 113+ messages in thread
From: Johannes Schindelin @ 2017-09-29 13:17 UTC (permalink / raw)
  To: Gilles Van Assche
  Cc: Linus Torvalds, demerphq, Brandon Williams, Junio C Hamano,
	Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan,
	Jeff King, David Lang, brian m. carlson, Keccak Team

[-- Attachment #1: Type: text/plain, Size: 2234 bytes --]

Hi Gilles,

On Tue, 19 Sep 2017, Gilles Van Assche wrote:

> On 19/09/17 00:16, Johannes Schindelin wrote:
> >>> SHA-256 got much more cryptanalysis than SHA3-256 […].
> >>
> >> I do not think this is true.
> >
> > Please read what I said again: SHA-256 got much more cryptanalysis
> > than SHA3-256.
> 
> Indeed. What I meant is that SHA3-256 got at least as much cryptanalysis
> as SHA-256. :-)

Oh? I got the opposite impression... I got the impression that *everybody*
in the field banged on all the SHA-2 candidates because everybody was
worried that SHA-1 would be utterly broken soon, and I got the impression
that after this SHA-2 competition, people were less worried?

Besides, I would expect that the difference in age (at *least* 7 years by
my humble arithmetic skills) to make a difference...

> > I never said that SHA3-256 got little cryptanalysis. Personally, I
> > think that SHA3-256 got a ton more cryptanalysis than SHA-1, and that
> > SHA-256 *still* got more cryptanalysis. But my opinion does not count,
> > really. However, the two experts I pestered with questions over
> > questions left me with that strong impression, and their opinion does
> > count.
> 
> OK, I respect your opinion and that of your two experts. Yet, the "much
> more" part of your statement, in particular, is something that may
> require a bit more explanations.

I would also like to point out the ubiquitousness of SHA-256. I have been
asked to provide SHA-256 checksums for the downloads of Git for Windows,
but not SHA3-256...

And this is a practically-relevant thing: the more users of an algorithm
there are, the more high-quality implementations you can choose from. And
this becomes relevant, say, when you have to switch implementations due to
license changes (*cough, cough looking in OpenSSL's direction*). Or when
you have to support the biggest Git repository on this planet and have to
eek out 5-10% more performance using the latest hardware. All of a sudden,
your consideration cannot only be "security of the algorithm" any longer.

Having said that, I am *really* happy to have SHA3-256 as a valid fallback
option in case SHA-256 should be broken.

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-29 13:17                         ` Johannes Schindelin
@ 2017-09-29 14:54                           ` Joan Daemen
  2017-09-29 22:33                             ` Johannes Schindelin
  0 siblings, 1 reply; 113+ messages in thread
From: Joan Daemen @ 2017-09-29 14:54 UTC (permalink / raw)
  To: Johannes Schindelin, Gilles Van Assche
  Cc: Linus Torvalds, demerphq, Brandon Williams, Junio C Hamano,
	Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan,
	Jeff King, David Lang, brian m. carlson, Keccak Team

Dear Johannes,

if ever there was a SHA-2 competition, it must have been held inside 
NSA:-) But maybe you are confusing with the SHA-3 competition. In any 
case, when considering SHA-2 vs SHA-3 for usage in git, you may have a 
look at arguments we give in the following blogpost:

https://keccak.team/2017/open_source_crypto.html

Kind regards,

Joan Daemen

On 29/09/17 15:17, Johannes Schindelin wrote:
> Hi Gilles,
>
> On Tue, 19 Sep 2017, Gilles Van Assche wrote:
>
>> On 19/09/17 00:16, Johannes Schindelin wrote:
>>>>> SHA-256 got much more cryptanalysis than SHA3-256 […].
>>>> I do not think this is true.
>>> Please read what I said again: SHA-256 got much more cryptanalysis
>>> than SHA3-256.
>> Indeed. What I meant is that SHA3-256 got at least as much cryptanalysis
>> as SHA-256. :-)
> Oh? I got the opposite impression... I got the impression that *everybody*
> in the field banged on all the SHA-2 candidates because everybody was
> worried that SHA-1 would be utterly broken soon, and I got the impression
> that after this SHA-2 competition, people were less worried?
>
> Besides, I would expect that the difference in age (at *least* 7 years by
> my humble arithmetic skills) to make a difference...
>
>>> I never said that SHA3-256 got little cryptanalysis. Personally, I
>>> think that SHA3-256 got a ton more cryptanalysis than SHA-1, and that
>>> SHA-256 *still* got more cryptanalysis. But my opinion does not count,
>>> really. However, the two experts I pestered with questions over
>>> questions left me with that strong impression, and their opinion does
>>> count.
>> OK, I respect your opinion and that of your two experts. Yet, the "much
>> more" part of your statement, in particular, is something that may
>> require a bit more explanations.
> I would also like to point out the ubiquitousness of SHA-256. I have been
> asked to provide SHA-256 checksums for the downloads of Git for Windows,
> but not SHA3-256...
>
> And this is a practically-relevant thing: the more users of an algorithm
> there are, the more high-quality implementations you can choose from. And
> this becomes relevant, say, when you have to switch implementations due to
> license changes (*cough, cough looking in OpenSSL's direction*). Or when
> you have to support the biggest Git repository on this planet and have to
> eek out 5-10% more performance using the latest hardware. All of a sudden,
> your consideration cannot only be "security of the algorithm" any longer.
>
> Having said that, I am *really* happy to have SHA3-256 as a valid fallback
> option in case SHA-256 should be broken.
>
> Ciao,
> Johannes


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v4] technical doc: add a design doc for hash function transition
  2017-09-29  6:06         ` Junio C Hamano
  2017-09-29  8:09           ` Junio C Hamano
@ 2017-09-29 17:34           ` Jonathan Nieder
  2017-10-02  8:25             ` Junio C Hamano
  2017-10-02 19:41             ` Jason Cooper
  1 sibling, 2 replies; 113+ messages in thread
From: Jonathan Nieder @ 2017-09-29 17:34 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Shawn Pearce, Linus Torvalds, Git Mailing List, Stefan Beller,
	bmwill, Jonathan Tan, Jeff King, David Lang, brian m. carlson,
	Masaya Suzuki, demerphq, The Keccak Team, Johannes Schindelin

Junio C Hamano wrote:
> Jonathan Nieder <jrnieder@gmail.com> writes:

>> This document describes what a transition to a new hash function for
>> Git would look like.  Add it to Documentation/technical/ as the plan
>> of record so that future changes can be recorded as patches.
>>
>> Also-by: Brandon Williams <bmwill@google.com>
>> Also-by: Jonathan Tan <jonathantanmy@google.com>
>> Also-by: Stefan Beller <sbeller@google.com>
>> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
>> ---
>
> Shoudln't these all be s-o-b: (with a note immediately before that
> to say all four contributed equally or something)?

I don't want to get lost in the weeds in the question of how to
represent such a collaborative effort in git's metadata.

You're right that I should collect their sign-offs!  Your approach of
using text instead of machine-readable data for common authorship also
seems okay.

In any event, this is indeed

Signed-off-by: Brandon Williams <bmwill@google.com>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Stefan Beller <sbeller@google.com>

(I just checked :)).

>> +Background
>> +----------
>> +At its core, the Git version control system is a content addressable
>> +filesystem. It uses the SHA-1 hash function to name content. For
>> +example, files, directories, and revisions are referred to by hash
>> +values unlike in other traditional version control systems where files
>> +or versions are referred to via sequential numbers. The use of a hash
>
> Traditional systems refer to files via numbers???  Perhaps "where
> versions of files are referred to via sequential numbers" or
> something?

Good point.  The wording you suggested will work well.

>> +function to address its content delivers a few advantages:
>> +
>> +* Integrity checking is easy. Bit flips, for example, are easily
>> +  detected, as the hash of corrupted content does not match its name.
>> +* Lookup of objects is fast.
>
> * There is no ambiguity what the object's name should be, given its
>   content.
>
> * Deduping the same content copied across versions and paths is
>   automatic.

:)  Yep, these are nice too, especially that second one.

It also is how we make diff-ing fast.

>> +SHA-1 still possesses the other properties such as fast object lookup
>> +and safe error checking, but other hash functions are equally suitable
>> +that are believed to be cryptographically secure.
>
> s/secure/more &/, perhaps?

We were looking for a phrase meaning that it should be a cryptographic
hash function in good standing, which SHA-1 is at least approaching
not being.

"more secure" should work fine.  Let's go with that.

>> +Goals
>> +-----
>> +...
>> +   c. Users can use SHA-1 and NewHash identifiers for objects
>> +      interchangeably (see "Object names on the command line", below).
>
> Mental note.  This needs to extend to the "index X..Y" lines in the
> patch output, which is used by "apply -3" and "am -3".

Will add a note about this to "Object names on the command line".  Stefan
had already pointed out that that section should really be renamed to
something like "Object names in input and output".

>> +2. Allow a complete transition away from SHA-1.
>> +   a. Local metadata for SHA-1 compatibility can be removed from a
>> +      repository if compatibility with SHA-1 is no longer needed.
>
> I like the emphasis on "Local" here.  Metadata for compatiblity that
> is embedded in the objects obviously cannot be removed.
>
> From that point of view, one of the goals ought to be "make sure
> that as much SHA-1 compatibility metadata as possible is local and
> outside the object".  This goal may not be able to say more than "as
> much as possible", as signed objects that came from SHA-1 world
> needs to carry the compatibility metadata somewhere somehow.
>
> Or perhaps we could.  There is nothing that says a signed tag
> created in the SHA-1 world must have the PGP/SHA-1 signature in the
> NewHash payload---it could be split off of the object data and
> stored in a local metadata cache, to be used only when we need to
> convert it back to the SHA-1 world.

That would break round-tripping and would mean that multiple SHA-1
objects could have the same NewHash name.  In other words, from
my point of view there is something that says that such data must
be preserved.

Another way to put it: even after removing all SHA-1 compatibility
metadata, one nice feature of this design is that it can be recovered
if I change my mind, from data in the NewHash based repository alone.

[...]
>> +Non-Goals
>> +---------
>> ...
>> +6. Skip fetching some submodules of a project into a NewHash
>> +   repository. (This also depends on NewHash support in Git
>> +   protocol.)
>
> It is unclear what this means.  Around submodule support, one thing
> I can think of is that a NewHash tree in a superproject would record
> a gitlink that is a NewHash commit object name in it, therefore it
> cannot refer to an unconverted SHA-1 submodule repository.  But it
> is unclear if the above description refers to the same issue, or
> something else.

It refers to that issue.

[...]
>> +Overview
>> +--------
>> +We introduce a new repository format extension. Repositories with this
>> +extension enabled use NewHash instead of SHA-1 to name their objects.
>> +This affects both object names and object content --- both the names
>> +of objects and all references to other objects within an object are
>> +switched to the new hash function.
>> +
>> +NewHash repositories cannot be read by older versions of Git.
>> +
>> +Alongside the packfile, a NewHash repository stores a bidirectional
>> +mapping between NewHash and SHA-1 object names. The mapping is generated
>> +locally and can be verified using "git fsck". Object lookups use this
>> +mapping to allow naming objects using either their SHA-1 and NewHash names
>> +interchangeably.
>> +
>> +"git cat-file" and "git hash-object" gain options to display an object
>> +in its sha1 form and write an object given its sha1 form.
>
> Both of these are somewhat unclear.

I think we can delete this paragraph.  It was written before the
"Object names on the command line" section that goes into such issues
in more detail.

[...]
>> +Object names
>> +~~~~~~~~~~~~
>> +Objects can be named by their 40 hexadecimal digit sha1-name or 64
>> +hexadecimal digit newhash-name, plus names derived from those (see
>> +gitrevisions(7)).
>> +
>> +The sha1-name of an object is the SHA-1 of the concatenation of its
>> +type, length, a nul byte, and the object's sha1-content. This is the
>> +traditional <sha1> used in Git to name objects.
>> +
>> +The newhash-name of an object is the NewHash of the concatenation of its
>> +type, length, a nul byte, and the object's newhash-content.
>
> It makes me wonder if we want to add the hashname in this object
> header.  "length" would be different for non-blob objects anyway,
> and it is not "compat metadata" we want to avoid baked in, yet it
> would help diagnose a mistake of attempting to use a "mixed" objects
> in a single repository.  Not a big issue, though.

Do you mean that adding the hashname into the computation that
produces the object name would help in some use case?

Or do you mean storing the hashname on disk somewhere, even if it
doesn't enter into the object name?  For the latter, we store the
hashname in the .git/config extensions.* configuration and the pack
index files.  You also suggested storing the hash name in
.git/objects/loose-object-idx, which seems to me like a good idea.

We didn't touch on the .pack format but we probably need to (if only
because of the size of REF_DELTAs and the cksum trailer), and it would
also need to name what object format it is using.

For loose objects, it would be nice to name the hash in the file, so
that "file" can understand what is happening if someone accidentally
mixes types using "cp".  The only downside is losing the ability to
copy blobs (which have the same content despite being named using
different hashes) between repositories after determining their new
names.  That doesn't seem like a strong downside --- it's pretty
harmless to include the hash type in loose object files, too.  I think
I would prefer this to be a "magic number" instead of part of the
zlib-deflated payload, since this way "file" can discover it more
easily.

>> +The format allows round-trip conversion between newhash-content and
>> +sha1-content.
>
> If it is a goal to eventually be able to lose SHA-1 compatibility
> metadata from the objects, then we might want to remove SHA-1 based
> signature bits (e.g. PGP trailer in signed tag, gpgsig header in the
> commit object) from NewHash contents, and instead have them stored
> in a side "metadata" table, only to be used while converting back.
> I dunno if that is desirable.

I don't consider that desirable.

A SHA-1 based signature is still of historical interest even if my
centuries-newer version of Git is not able to verify it.

[...]
> I take it to mean that the stride is defined in the "length in bytes
> of shortened object names" in the file header.  If so, I can see how
> this would work.  This "sorted table", unlike the next one, does not
> say how it is sorted, but I assume this is just the object name
> order (as opposed to the pack location order the next table uses)?

Yes.  Will clarify.

>> +  - A table of full object names in pack order. This allows resolving
>> +    a reference to "the nth object in the pack file" (from a
>> +    reachability bitmap or from the next table of another object
>> +    format) to its object name.
>> +
>> +  - A table of 4-byte values mapping object name order to pack order.
>> +    For an object in the table of sorted shortened object names, the
>> +    value at the corresponding index in this table is the index in the
>> +    previous table for that same object.
>> +
>> +    This can be used to look up the object in reachability bitmaps or
>> +    to look up its name in another object format.
>
> And this is a separate table because the short-name table wants to
> be as compact as possible for binary search?  Otherwise an entry in
> the short-name table could be <pack order number, n-bytes that is
> short unique prefix>.

Yes.  The idx v2 format has a similar design.

>> +  - A table of 4-byte CRC32 values of the packed object data, in the
>> +    order that the objects appear in the pack file. This is to allow
>> +    compressed data to be copied directly from pack to pack during
>> +    repacking without undetected data corruption.
>
> An obvious alternative would be to have the CRC32 checksum near
> (e.g. immediately before) the object data in the packfile (as
> opposed to the .idx file like this document specifies).  I am not
> sure what the pros and cons are between the two, though, and that is
> why I mention the possiblity here.

As you mentioned under separate cover, it is useful for derived data
like this to be outside the packfile.

> Hmm, as the corresponding packfile stores object data only in
> NewHash content format, it is somewhat curious that this table that
> stores CRC32 of the data appears in the "Tables for each object
> format" section, as they would be identical, no?  Unless I am
> grossly misleading the spec, the checksum should either go outside
> the "Tables for each object format" section but still in .idx, or
> should be eliminated and become part of the packdata stream instead,
> perhaps?

It's actually only present for the first object format.  Will find a
better way to describe this.

>> +  - A table of 4-byte offset values. For an object in the table of
>> +    sorted shortened object names, the value at the corresponding
>> +    index in this table indicates where that object can be found in
>> +    the pack file. These are usually 31-bit pack file offsets, but
>> +    large offsets are encoded as an index into the next table with the
>> +    most significant bit set.
>
> Oy.  So we can go from a short prefix to the pack location by first
> finding it via binsearch in the short-name table, realize that it is
> nth object in the object name order, and consulting this table.
> When we know the pack-order of an object, there is no direct way to
> go to its location (short of reversing the name-order-to-pack-order
> table)?

An earlier version of the design also had a pack-order-to-pack-offset
table, but we weren't able to think of any cases where that would be
used without also looking up the object name that can be used to
verify the integrity of the inflated object.

Do you have an application in mind?

[...]
>> +- Tables for the second object format, with the same layout as above,
>> +  up to and not including the table of CRC32 values.
>> +- Zero or more NUL bytes.
>> +- The trailer consists of the following:
>> +  - A copy of the 20-byte NewHash checksum at the end of the
>> +    corresponding packfile.
>> +
>> +  - 20-byte NewHash checksum of all of the above.
>
> When did NewHash shrink to 20-byte suddenly?  I think the above two
> are both "32-byte"?

Yes, good catch.

[...]
>> +Loose object index
>> +~~~~~~~~~~~~~~~~~~
>> +A new file $GIT_OBJECT_DIR/loose-object-idx contains information about
>> +all loose objects. Its format is
>> +
>> +  # loose-object-idx
>> +  (newhash-name SP sha1-name LF)*
>> +
>> +where the object names are in hexadecimal format. The file is not
>> +sorted.
>
> Shouldn't the file somehow say what hashes are involved to allow us
> match it with extension.{objectFormat,compatObjectFormat}, perhaps
> at the end of the "# loose-object-idx" line?

Good idea!

[...]
>> +The loose object index is protected against concurrent writes by a
>> +lock file $GIT_OBJECT_DIR/loose-object-idx.lock. To add a new loose
>> +object:
>> +
>> +1. Write the loose object to a temporary file, like today.
>> +2. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the lock.
>> +3. Rename the loose object into place.
>> +4. Open loose-object-idx with O_APPEND and write the new object
>
> "write the new entry, fsync and close"?

Yes, I think we do need to fsync. :/

[...]
>> +Translation table
>> +~~~~~~~~~~~~~~~~~
>> +The index files support a bidirectional mapping between sha1-names
>> +and newhash-names. The lookup proceeds similarly to ordinary object
>> +lookups. For example, to convert a sha1-name to a newhash-name:
>> +
>> + 1. Look for the object in idx files. If a match is present in the
>> +    idx's sorted list of truncated sha1-names, then:
>> +    a. Read the corresponding entry in the sha1-name order to pack
>> +       name order mapping.
>> +    b. Read the corresponding entry in the full sha1-name table to
>> +       verify we found the right object. If it is, then
>> +    c. Read the corresponding entry in the full newhash-name table.
>> +       That is the object's newhash-name.
>
> c. is possible because b. and c. are sorted the same way, i.e. the
> index used to consult the full sha1-name table, which is the pack
> order number, can be used to find its full newhash in the "full
> newhash sorted by pack order" table?

Yes.

>> +Reading an object's sha1-content
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> I'd stop here and continue in a separate message.  Thanks for a
> detailed write-up.

Thanks for looking it over.

Jonathan

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-29 14:54                           ` Joan Daemen
@ 2017-09-29 22:33                             ` Johannes Schindelin
  2017-09-30 22:02                               ` Joan Daemen
  0 siblings, 1 reply; 113+ messages in thread
From: Johannes Schindelin @ 2017-09-29 22:33 UTC (permalink / raw)
  To: Joan Daemen
  Cc: Gilles Van Assche, Linus Torvalds, demerphq, Brandon Williams,
	Junio C Hamano, Jonathan Nieder, Git Mailing List, Stefan Beller,
	Jonathan Tan, Jeff King, David Lang, brian m. carlson,
	Keccak Team

Hi Joan,

On Fri, 29 Sep 2017, Joan Daemen wrote:

> if ever there was a SHA-2 competition, it must have been held inside NSA:-)

Oops. My bad, I indeed got confused about that, as you suggest below (I
actually thought of the AES competition, but that was obviously not about
SHA-2). Sorry.

> But maybe you are confusing with the SHA-3 competition. In any case,
> when considering SHA-2 vs SHA-3 for usage in git, you may have a look at
> arguments we give in the following blogpost:
> 
> https://keccak.team/2017/open_source_crypto.html

Thanks for the pointer!

Small nit: the post uses "its" in place of "it's", twice.

It does have a good point, of course: the scientific exchange (which you
call "open-source" in spirit) makes tons of sense.

As far as Git is concerned, we not only care about the source code of the
hash algorithm we use, we need to care even more about what you call
"executable": ready-to-use, high quality, well-tested implementations.

We carry source code for SHA-1 as part of Git's source code, which was
hand-tuned to be as fast as Linus could get it, which was tricky given
that the tuning should be general enough to apply to all common intel
CPUs.

This hand-crafted code was blown out of the water by OpenSSL's SHA-1 in
our tests here at Microsoft, thanks to the fact that OpenSSL does
vectorized SHA-1 computation now.

To me, this illustrates why it is not good enough to have only a reference
implementation available at our finger tips. Of course, above-mentioned
OpenSSL supports SHA-256 and SHA3-256, too, and at least recent versions
vectorize those, too.

Also, ARM processors have become a lot more popular, so we'll want to have
high-quality implementations of the hash algorithm also for those
processors.

Likewise, in contrast to 2005, nowadays implementations of Git in
languages as obscure as Javascript are not only theoretical but do exist
in practice (https://github.com/creationix/js-git). I had a *very* quick
look for libraries providing crypto in Javascript and immediately found
the Standford Javascript Crypto library
(https://github.com/bitwiseshiftleft/sjcl/) which seems to offer SHA-256
but not SHA3-256 computation.

Back to Intel processors: I read some vague hints about extensions
accelerating SHA-256 computation on future Intel processors, but not
SHA3-256.

It would make sense, of course, that more crypto libraries and more
hardware support would be available for SHA-256 than for SHA3-256 given
the time since publication: 16 vs 5 years (I am playing it loose here,
taking just the year into account, not the exact date, so please treat
that merely as a ballpark figure).

So from a practical point of view, I wonder what your take is on, say,
hardware support for SHA3-256. Do you think this will become a focus soon?

Also, what is your take on the question whether SHA-256 is good enough?
SHA-1 was broken theoretically already 10 years after it was published
(which unfortunately did not prevent us from baking it into Git), after
all, while SHA-256 is 16 years old and the only known weakness does not
apply to Git's usage?

Also, while I have the attention of somebody who knows a heck more about
cryptography than Git's top 10 committers combined: how soon do you expect
practical SHA-1 attacks that are much worse than what we already have
seen? I am concerned that if we do not move fast enough to a new hash
algorithm, and somebody finds a way in the meantime to craft arbitrary
messages given a prefix and an SHA-1, then we have a huge problem on
our hands.

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-29 22:33                             ` Johannes Schindelin
@ 2017-09-30 22:02                               ` Joan Daemen
  2017-10-02 14:26                                 ` Johannes Schindelin
  0 siblings, 1 reply; 113+ messages in thread
From: Joan Daemen @ 2017-09-30 22:02 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Gilles Van Assche, Linus Torvalds, demerphq, Brandon Williams,
	Junio C Hamano, Jonathan Nieder, Git Mailing List, Stefan Beller,
	Jonathan Tan, Jeff King, David Lang, brian m. carlson,
	Keccak Team

Dear Johannes,

thanks for your response and taking the effort to express your concerns. 
Please see below for some feedback.

On 30/09/17 00:33, Johannes Schindelin wrote:
> Hi Joan,
> 
> On Fri, 29 Sep 2017, Joan Daemen wrote:
> 
>> if ever there was a SHA-2 competition, it must have been held inside 
>> NSA:-)
> Oops. My bad, I indeed got confused about that, as you suggest below (I
> actually thought of the AES competition, but that was obviously not 
> about
> SHA-2). Sorry.
> 
>> But maybe you are confusing with the SHA-3 competition. In any case,
>> when considering SHA-2 vs SHA-3 for usage in git, you may have a look 
>> at
>> arguments we give in the following blogpost:
>> 
>> https://keccak.team/2017/open_source_crypto.html
> Thanks for the pointer!
> 
> Small nit: the post uses "its" in place of "it's", twice.

Thanks, we'll correct that.

> It does have a good point, of course: the scientific exchange (which 
> you
> call "open-source" in spirit) makes tons of sense.
> 
> As far as Git is concerned, we not only care about the source code of 
> the
> hash algorithm we use, we need to care even more about what you call
> "executable": ready-to-use, high quality, well-tested implementations.
> 
> We carry source code for SHA-1 as part of Git's source code, which was
> hand-tuned to be as fast as Linus could get it, which was tricky given
> that the tuning should be general enough to apply to all common intel
> CPUs.
> 
> This hand-crafted code was blown out of the water by OpenSSL's SHA-1 in
> our tests here at Microsoft, thanks to the fact that OpenSSL does
> vectorized SHA-1 computation now.
> 
> To me, this illustrates why it is not good enough to have only a 
> reference
> implementation available at our finger tips. Of course, above-mentioned
> OpenSSL supports SHA-256 and SHA3-256, too, and at least recent 
> versions
> vectorize those, too.

There is a lot of high-quality optimized code for all SHA-3 functions 
and many CPUs in the Keccak code package 
https://github.com/gvanas/KeccakCodePackage but also OpenSSL contains 
some good SHA-3 code and then there are all those related to Ethereum.

By the way, you speak about SHA3-256, but the right choice would be to 
use SHAKE128. Well, what is exactly the right choice depends on what you 
want. If you want to have a function in the SHA3 standard (FIPS 202), it 
is SHAKE128. You can boost performance on high-end CPUs by adopting 
Parallelhash from NIST SP 800-185, still a NIST standard. You can 
multiply that performance again by a factor of 2 by adopting 
KangarooTwelve. This is our (Keccak team) proposal for a parallelizable 
Keccak-based hash function that has a safety margin comparable to that 
of the SHA-2 functions. See https://keccak.team/kangarootwelve.html
May I also suggest you read https://keccak.team/2017/is_sha3_slow.html

> Also, ARM processors have become a lot more popular, so we'll want to 
> have
> high-quality implementations of the hash algorithm also for those
> processors.
> 
> Likewise, in contrast to 2005, nowadays implementations of Git in
> languages as obscure as Javascript are not only theoretical but do 
> exist
> in practice (https://github.com/creationix/js-git). I had a *very* 
> quick
> look for libraries providing crypto in Javascript and immediately found
> the Standford Javascript Crypto library
> (https://github.com/bitwiseshiftleft/sjcl/) which seems to offer 
> SHA-256
> but not SHA3-256 computation.
> 
> Back to Intel processors: I read some vague hints about extensions
> accelerating SHA-256 computation on future Intel processors, but not
> SHA3-256.
> 
> It would make sense, of course, that more crypto libraries and more
> hardware support would be available for SHA-256 than for SHA3-256 given
> the time since publication: 16 vs 5 years (I am playing it loose here,
> taking just the year into account, not the exact date, so please treat
> that merely as a ballpark figure).
> 
> So from a practical point of view, I wonder what your take is on, say,
> hardware support for SHA3-256. Do you think this will become a focus 
> soon?

I think this is a chicken-and-egg problem. In any case, hardware support 
for one SHA3-256 will also work for the other SHA3 and SHAKE functions 
as they all use the same underlying primitive: the Keccak-f permutation. 
This is not the case for SHA2 because SHA224 and SHA256 use a different 
compression function than SHA384, SHA512, SHA512/224 and SHA512/256.

> Also, what is your take on the question whether SHA-256 is good enough?
> SHA-1 was broken theoretically already 10 years after it was published
> (which unfortunately did not prevent us from baking it into Git), after
> all, while SHA-256 is 16 years old and the only known weakness does not
> apply to Git's usage?

SHA-256 is more conservative than SHA-1 and I don't expect it to be 
broken in the coming decades (unless NSA inserted a backdoor but I don't 
think that is likely). But looking at the existing cryptanalysis, I 
think it is even less likely that I SHAKE128, ParallelHash or 
KangarooTwelve will be broken anytime.

> Also, while I have the attention of somebody who knows a heck more 
> about
> cryptography than Git's top 10 committers combined: how soon do you 
> expect
> practical SHA-1 attacks that are much worse than what we already have
> seen? I am concerned that if we do not move fast enough to a new hash
> algorithm, and somebody finds a way in the meantime to craft arbitrary
> messages given a prefix and an SHA-1, then we have a huge problem on
> our hands.

This is hard to say. To be honest, when witnessing the first MD5 
collisions I did not expect them to lead to some real world attacks and 
just a few years later we saw real-world forged certificates based on 
MD5 collisions. And SHA-1 has a lot in common with MD5...

But let me end with a philosophical note. Independent of all the 
arguments for and against, I think this is ultimately about doing the 
right thing. The choice is here between SHA1/SHA2 on the one hand and 
SHA3/Keccak on the other. The former standards are imposed on us by NSA 
and the latter are the best that came out of an open competition 
involving all experts in the field worldwide. What would be closest to 
the philosophy of Git (and by extension Linux or open-source in 
general)?

Kind regards,

Joan

On 30/09/17 00:33, Johannes Schindelin wrote:
> Hi Joan,
> 
> On Fri, 29 Sep 2017, Joan Daemen wrote:
> 
>> if ever there was a SHA-2 competition, it must have been held inside 
>> NSA:-)
> Oops. My bad, I indeed got confused about that, as you suggest below (I
> actually thought of the AES competition, but that was obviously not 
> about
> SHA-2). Sorry.
> 
>> But maybe you are confusing with the SHA-3 competition. In any case,
>> when considering SHA-2 vs SHA-3 for usage in git, you may have a look 
>> at
>> arguments we give in the following blogpost:
>> 
>> https://keccak.team/2017/open_source_crypto.html
> Thanks for the pointer!
> 
> Small nit: the post uses "its" in place of "it's", twice.
> 
> It does have a good point, of course: the scientific exchange (which 
> you
> call "open-source" in spirit) makes tons of sense.
> 
> As far as Git is concerned, we not only care about the source code of 
> the
> hash algorithm we use, we need to care even more about what you call
> "executable": ready-to-use, high quality, well-tested implementations.
> 
> We carry source code for SHA-1 as part of Git's source code, which was
> hand-tuned to be as fast as Linus could get it, which was tricky given
> that the tuning should be general enough to apply to all common intel
> CPUs.
> 
> This hand-crafted code was blown out of the water by OpenSSL's SHA-1 in
> our tests here at Microsoft, thanks to the fact that OpenSSL does
> vectorized SHA-1 computation now.
> 
> To me, this illustrates why it is not good enough to have only a 
> reference
> implementation available at our finger tips. Of course, above-mentioned
> OpenSSL supports SHA-256 and SHA3-256, too, and at least recent 
> versions
> vectorize those, too.
> 
> Also, ARM processors have become a lot more popular, so we'll want to 
> have
> high-quality implementations of the hash algorithm also for those
> processors.
> 
> Likewise, in contrast to 2005, nowadays implementations of Git in
> languages as obscure as Javascript are not only theoretical but do 
> exist
> in practice (https://github.com/creationix/js-git). I had a *very* 
> quick
> look for libraries providing crypto in Javascript and immediately found
> the Standford Javascript Crypto library
> (https://github.com/bitwiseshiftleft/sjcl/) which seems to offer 
> SHA-256
> but not SHA3-256 computation.
> 
> Back to Intel processors: I read some vague hints about extensions
> accelerating SHA-256 computation on future Intel processors, but not
> SHA3-256.
> 
> It would make sense, of course, that more crypto libraries and more
> hardware support would be available for SHA-256 than for SHA3-256 given
> the time since publication: 16 vs 5 years (I am playing it loose here,
> taking just the year into account, not the exact date, so please treat
> that merely as a ballpark figure).
> 
> So from a practical point of view, I wonder what your take is on, say,
> hardware support for SHA3-256. Do you think this will become a focus 
> soon?
> 
> Also, what is your take on the question whether SHA-256 is good enough?
> SHA-1 was broken theoretically already 10 years after it was published
> (which unfortunately did not prevent us from baking it into Git), after
> all, while SHA-256 is 16 years old and the only known weakness does not
> apply to Git's usage?
> 
> Also, while I have the attention of somebody who knows a heck more 
> about
> cryptography than Git's top 10 committers combined: how soon do you 
> expect
> practical SHA-1 attacks that are much worse than what we already have
> seen? I am concerned that if we do not move fast enough to a new hash
> algorithm, and somebody finds a way in the meantime to craft arbitrary
> messages given a prefix and an SHA-1, then we have a huge problem on
> our hands.
> 
> Ciao,
> Johannes

Begin forwarded message:

 From: Gilles Van Assche <gilles.van.assche@noekeon.org>
Subject: Re: RFC v3: Another proposed hash function transition plan
Date: 30 Sep 2017 22:20:42 CEST
To: Joan Daemen <joan@cs.ru.nl>, keccak@noekeon.org

Dag Joan,

About the implementations, there are many high-quality implementations 
of Keccak besides the KCP that you could also mention. E.g., those in 
OpenSSL are very good. And there are all those related to Ethereum.

I tend to agree with Guido regarding SHA-1, even if you are right, there 
is no need to reduce/excuse too much the impact of collisions, there 
could be unexpected use cases. And it's not clean. (And don't 
underestimate the probability to be quoted on this.)

Finally, just to say that I like your last paragraph.

Kind regards,
Gilles

Joan Daemen <joan@cs.ru.nl> wrote:
what about replying with something like this (please have a critical 
look). I sent this from my Radboud account as I have problems with my 
Thunderbird settings. When trying to send a mail, it sometimes works and 
sometimes it says “An error occurred while sending mail: Outgoing server 
(SMTP) error. The server responded:  4.7.1 <joans-mbp.home>: Helo 
command rejected: Host not found."
Dear Johannes,
thanks for your response and taking the effort to express your concerns. 
Please see below for some feedback.
On 30/09/17 00:33, Johannes Schindelin wrote:
Hi Joan,

On Fri, 29 Sep 2017, Joan Daemen wrote:

if ever there was a SHA-2 competition, it must have been held inside 
NSA:-)
Oops. My bad, I indeed got confused about that, as you suggest below (I
actually thought of the AES competition, but that was obviously not 
about
SHA-2). Sorry.

But maybe you are confusing with the SHA-3 competition. In any case,
when considering SHA-2 vs SHA-3 for usage in git, you may have a look at
arguments we give in the following blogpost:

https://keccak.team/2017/open_source_crypto.html
Thanks for the pointer!

Small nit: the post uses "its" in place of "it's", twice.
Thanks, we'll correct that.

It does have a good point, of course: the scientific exchange (which you
call "open-source" in spirit) makes tons of sense.

As far as Git is concerned, we not only care about the source code of 
the
hash algorithm we use, we need to care even more about what you call
"executable": ready-to-use, high quality, well-tested implementations.

We carry source code for SHA-1 as part of Git's source code, which was
hand-tuned to be as fast as Linus could get it, which was tricky given
that the tuning should be general enough to apply to all common intel
CPUs.

This hand-crafted code was blown out of the water by OpenSSL's SHA-1 in
our tests here at Microsoft, thanks to the fact that OpenSSL does
vectorized SHA-1 computation now.

To me, this illustrates why it is not good enough to have only a 
reference
implementation available at our finger tips. Of course, above-mentioned
OpenSSL supports SHA-256 and SHA3-256, too, and at least recent versions
vectorize those, too.
There is a lot of high-quality optimized code for all SHA-3 functions 
and many CPUs in the Keccak code package 
https://github.com/gvanas/KeccakCodePackage

By the way, you speak about SHA3-256, but the right choice would be to 
use SHAKE128. Well, what is exactly the right choice depends on what you 
want. If you want to have a function in the SHA3 standard (FIPS 202), it 
is SHAKE128. You can boost performance on high-end CPUs by adopting 
Parallelhash from NIST SP 800-185, still a NIST standard. You can 
multiply that performance again by a factor of 2 by adopting 
KangarooTwelve. This is our (Keccak team) proposal for a parallelizable 
Keccak-based hash function that has a safety margin comparable to that 
of the SHA-2 functions. See https://keccak.team/kangarootwelve.html
May I also suggest you to read 
https://keccak.team/2017/is_sha3_slow.html

Also, ARM processors have become a lot more popular, so we'll want to 
have
high-quality implementations of the hash algorithm also for those
processors.

Likewise, in contrast to 2005, nowadays implementations of Git in
languages as obscure as Javascript are not only theoretical but do exist
in practice (https://github.com/creationix/js-git). I had a *very* quick
look for libraries providing crypto in Javascript and immediately found
the Standford Javascript Crypto library
(https://github.com/bitwiseshiftleft/sjcl/) which seems to offer SHA-256
but not SHA3-256 computation.

Back to Intel processors: I read some vague hints about extensions
accelerating SHA-256 computation on future Intel processors, but not
SHA3-256.

It would make sense, of course, that more crypto libraries and more
hardware support would be available for SHA-256 than for SHA3-256 given
the time since publication: 16 vs 5 years (I am playing it loose here,
taking just the year into account, not the exact date, so please treat
that merely as a ballpark figure).

So from a practical point of view, I wonder what your take is on, say,
hardware support for SHA3-256. Do you think this will become a focus 
soon?
I think this is a chicken-and-egg problem. In any case, hardware support 
for one SHA3-256 will also work for the other SHA3 and SHAKE functions 
as they all use the same underlying primitive: the Keccak-f permutation. 
This is not the case for SHA2 because SHA224 and SHA256 use a different 
compression function than SHA384, SHA512, SHA512/224 and SHA512/256.

Also, what is your take on the question whether SHA-256 is good enough?
SHA-1 was broken theoretically already 10 years after it was published
(which unfortunately did not prevent us from baking it into Git), after
all, while SHA-256 is 16 years old and the only known weakness does not
apply to Git's usage?
I think even the weakness of SHA-1 will be hard to exploit to do 
something bad in Git. SHA-256 is more conservative than SHA-1 and I 
don't expect it to be broken (unless NSA inserted a backdoor but I don't 
think that is likely). But I also don't expect SHAKE128, ParallelHash or 
KangarooTwelve to be broken, looking at the existing cryptanalysis.
Also, while I have the attention of somebody who knows a heck more about
cryptography than Git's top 10 committers combined: how soon do you 
expect
practical SHA-1 attacks that are much worse than what we already have
seen? I am concerned that if we do not move fast enough to a new hash
algorithm, and somebody finds a way in the meantime to craft arbitrary
messages given a prefix and an SHA-1, then we have a huge problem on
our hands.
As said, I don't expect practical SHA-1 attacks soon. But let me end 
with a philosophical note. Independent of all the arguments for and 
against, I think this is about doing the right thing. The choice is here 
between SHA1/SHA2 on the one hand and SHA3/Keccak on the other. The 
former standards are imposed on us by NSA and the latter are the best 
that came out of an open competition involving all experts worldwide. 
What would be closest to the philosophy of Git (and by extension Linux 
or open-source in general)?

Kind regards,

Joan

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v4] technical doc: add a design doc for hash function transition
  2017-09-29 17:34           ` Jonathan Nieder
@ 2017-10-02  8:25             ` Junio C Hamano
  2017-10-02 19:41             ` Jason Cooper
  1 sibling, 0 replies; 113+ messages in thread
From: Junio C Hamano @ 2017-10-02  8:25 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Shawn Pearce, Linus Torvalds, Git Mailing List, Stefan Beller,
	bmwill, Jonathan Tan, Jeff King, David Lang, brian m. carlson,
	Masaya Suzuki, demerphq, The Keccak Team, Johannes Schindelin

Jonathan Nieder <jrnieder@gmail.com> writes:

>>> +6. Skip fetching some submodules of a project into a NewHash
>>> +   repository. (This also depends on NewHash support in Git
>>> +   protocol.)
>>
>> It is unclear what this means.  Around submodule support, one thing
>> I can think of is that a NewHash tree in a superproject would record
>> a gitlink that is a NewHash commit object name in it, therefore it
>> cannot refer to an unconverted SHA-1 submodule repository.  But it
>> is unclear if the above description refers to the same issue, or
>> something else.
>
> It refers to that issue.

We may want to find a way to make it clear, then.

>> It makes me wonder if we want to add the hashname in this object
>> header.  "length" would be different for non-blob objects anyway,
>> and it is not "compat metadata" we want to avoid baked in, yet it
>> would help diagnose a mistake of attempting to use a "mixed" objects
>> in a single repository.  Not a big issue, though.
>
> Do you mean that adding the hashname into the computation that
> produces the object name would help in some use case?

What I mean is that for SHA-1 objects we keep the object header to
be "<type> <length> NUL".  For objects in newer world, use the
object header to "<type> <hash> <length> NUL", and include the
hashname in the object name computation.

> For loose objects, it would be nice to name the hash in the file, so
> that "file" can understand what is happening if someone accidentally
> mixes types using "cp".  The only downside is losing the ability to
> copy blobs (which have the same content despite being named using
> different hashes) between repositories after determining their new
> names.  That doesn't seem like a strong downside --- it's pretty
> harmless to include the hash type in loose object files, too.  I think
> I would prefer this to be a "magic number" instead of part of the
> zlib-deflated payload, since this way "file" can discover it more
> easily.

Yeah, thanks for doing pros-and-cons for me ;-)

>> If it is a goal to eventually be able to lose SHA-1 compatibility
>> metadata from the objects, then we might want to remove SHA-1 based
>> signature bits (e.g. PGP trailer in signed tag, gpgsig header in the
>> commit object) from NewHash contents, and instead have them stored
>> in a side "metadata" table, only to be used while converting back.
>> I dunno if that is desirable.
>
> I don't consider that desirable.

Agreed.  Let's not go there.

>> Hmm, as the corresponding packfile stores object data only in
>> NewHash content format, it is somewhat curious that this table that
>> stores CRC32 of the data appears in the "Tables for each object
>> format" section, as they would be identical, no?  Unless I am
>> grossly misleading the spec, the checksum should either go outside
>> the "Tables for each object format" section but still in .idx, or
>> should be eliminated and become part of the packdata stream instead,
>> perhaps?
>
> It's actually only present for the first object format.  Will find a
> better way to describe this.

I see.  One way to do so is to have it upfront before the "after
this point, these tables repeat for each of the hashes" part of the
file.

>> Oy.  So we can go from a short prefix to the pack location by first
>> finding it via binsearch in the short-name table, realize that it is
>> nth object in the object name order, and consulting this table.
>> When we know the pack-order of an object, there is no direct way to
>> go to its location (short of reversing the name-order-to-pack-order
>> table)?
>
> An earlier version of the design also had a pack-order-to-pack-offset
> table, but we weren't able to think of any cases where that would be
> used without also looking up the object name that can be used to
> verify the integrity of the inflated object.

The primary thing I was interested in knowing was if we tried to
think of any case where it may be useful and then didn't think of
any---I couldn't but I know I am not imaginative enough, and I
wanted to know you guys didn't, either.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v4] technical doc: add a design doc for hash function transition
  2017-09-28  4:43       ` [PATCH v4] technical doc: add a design doc for hash function transition Jonathan Nieder
  2017-09-29  6:06         ` Junio C Hamano
@ 2017-10-02  9:02         ` Junio C Hamano
  2017-10-02 19:23         ` Jason Cooper
                           ` (2 subsequent siblings)
  4 siblings, 0 replies; 113+ messages in thread
From: Junio C Hamano @ 2017-10-02  9:02 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Shawn Pearce, Linus Torvalds, Git Mailing List, Stefan Beller,
	bmwill, Jonathan Tan, Jeff King, David Lang, brian m. carlson,
	Masaya Suzuki, demerphq, The Keccak Team, Johannes Schindelin

Jonathan Nieder <jrnieder@gmail.com> writes:

> +Reading an object's sha1-content
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +The sha1-content of an object can be read by converting all newhash-names
> +its newhash-content references to sha1-names using the translation table.

Sure.

> +Fetch
> +~~~~~
> +Fetching from a SHA-1 based server requires translating between SHA-1
> +and NewHash based representations on the fly.
> +
> +SHA-1s named in the ref advertisement that are present on the client
> +can be translated to NewHash and looked up as local objects using the
> +translation table.
> +
> +Negotiation proceeds as today. Any "have"s generated locally are
> +converted to SHA-1 before being sent to the server, and SHA-1s
> +mentioned by the server are converted to NewHash when looking them up
> +locally.

Any of our alternate object store by definition is a NewHash
repository--otherwise we'd violate "no mixing" rule.  It may or may
note have the translation table for its objects.  If it no longer
has the translation table (because it migrated to NewHash only world
before we did), then we can still use it as our alternate but we
cannot use it for the purpose of common ancestore discovery.

> +After negotiation, the server sends a packfile containing the
> +requested objects.

s/objects.$/& These are all SHA-1 contents./

> +We convert the packfile to NewHash format using
> +the following steps:
> +
> +1. index-pack: inflate each object in the packfile and compute its
> +   SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against
> +   objects the client has locally. These objects can be looked up
> +   using the translation table and their sha1-content read as
> +   described above to resolve the deltas.

That procedure would give us the object's SHA-1 contents for
ref-delta objects.  For an ofs-delta object, by definition, its base
object should appear in the same packstream, so we should eventually
be able to get to the SHA-1 contents of the delta base, and from
there we can apply the delta to obtain the SHA-1 contents.  For a
non-delta object, we already have its SHA-1 contents in the
packstream.

So we can get SHA-1 names and SHA-1 contents of each and every
object in the packstream in this step.

Are we actually writing out a .pack/.idx pair that is usable in the
SHA-1 world at this stage?  Or are we going to read from something
we keep in-core in the step #3 below?

> +2. topological sort: starting at the "want"s from the negotiation
> +   phase, walk through objects in the pack and emit a list of them,
> +   excluding blobs, in reverse topologically sorted order, with each
> +   object coming later in the list than all objects it references.
> +   (This list only contains objects reachable from the "wants". If the
> +   pack from the server contained additional extraneous objects, then
> +   they will be discarded.)

Presumably this is a list of SHA-1 names, as we do not yet have
enough information to compute NewHash names yet at this point.  May
want to spell it out here.

Would it discard the auto-followed tags if we do the "traverse from
wants only"?  Traversing the objects in the packfile to find the
"tips" that are not referenced from any other object in the pack
might be necessary, and it shouldn't be too costly, I'd guess.

> +3. convert to newhash: open a new (newhash) packfile. Read the topologically
> +   sorted list just generated. For each object, inflate its
> +   sha1-content, convert to newhash-content, and write it to the newhash
> +   pack. Record the new sha1<->newhash mapping entry for use in the idx.

Are we doing any deltification here?  If we are computing .pack/.idx
pair that can be usable in the SHA-1 world in step #1, then reusing
blob deltas should be trivial (a good delta-base in the SHA-1 world
is a good delta-base in the NewHash world, too).  Things that have
outgoing references like trees, it might be possible that such a
heuristic may not give us the absolute best delta-base, but I guess
it would still be a good approximation to reuse the delta/base
object relationship in SHA-1 world to NewHash world, assuming that
the server did a good job choosing the bases.

> +4. sort: reorder entries in the new pack to match the order of objects
> +   in the pack the server generated and include blobs. Write a newhash idx
> +   file

OK.

> +5. clean up: remove the SHA-1 based pack file, index, and
> +   topologically sorted list obtained from the server in steps 1
> +   and 2.

Ah, OK, so we do write the SHA_1 pack/idx in the first step.  OK.

> +Push
> +~~~~
> +Push is simpler than fetch because the objects referenced by the
> +pushed objects are already in the translation table. The sha1-content
> +of each object being pushed can be read as described in the "Reading
> +an object's sha1-content" section to generate the pack written by git
> +send-pack.

OK.

> +Signed Commits
> +~~~~~~~~~~~~~~
> +We add a new field "gpgsig-newhash" to the commit object format to allow
> +signing commits without relying on SHA-1. It is similar to the
> +existing "gpgsig" field. Its signed payload is the newhash-content of the
> +commit object with any "gpgsig" and "gpgsig-newhash" fields removed.

Do we prepare for newerhash, too?  IOW, should the signed payload be
the newhash-contents with any field whose name is "gpgsig" or begins
with "gpgsig-" followed by anything?

> +This means commits can be signed
> +1. using SHA-1 only, as in existing signed commit objects
> +2. using both SHA-1 and NewHash, by using both gpgsig-newhash and gpgsig
> +   fields.
> +3. using only NewHash, by only using the gpgsig-newhash field.
> +
> +Old versions of "git verify-commit" can verify the gpgsig signature in
> +cases (1) and (2) without modifications and view case (3) as an
> +ordinary unsigned commit.

For old clients to be able to verify (2), signed payload for SHA-1
is everything in SHA-1 contents minus "gpgsig"; "gpgsig-newhash"
should not get excluded from the computation.  Am I correct?

I am primarily finding it a bit disturbing that there is a bit of
asymmetry here.

> +Signed Tags
> +~~~~~~~~~~~

This message stops here for now.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-26 22:11                     ` Johannes Schindelin
  2017-09-26 22:25                       ` [PATCH] technical doc: add a design doc for hash function transition Stefan Beller
  2017-09-26 23:51                       ` RFC v3: Another proposed hash function transition plan Jonathan Nieder
@ 2017-10-02 14:00                       ` Jason Cooper
  2017-10-02 17:18                         ` Linus Torvalds
  2 siblings, 1 reply; 113+ messages in thread
From: Jason Cooper @ 2017-10-02 14:00 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Linus Torvalds, demerphq, Brandon Williams, Junio C Hamano,
	Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan,
	Jeff King, David Lang, brian m. carlson

Hi Johannes,

Thanks for the response.  Sorry for the delay.  Had a large deadline for
$dayjob.

On Wed, Sep 27, 2017 at 12:11:14AM +0200, Johannes Schindelin wrote:
> On Tue, 26 Sep 2017, Jason Cooper wrote:
> > On Thu, Sep 14, 2017 at 08:45:35PM +0200, Johannes Schindelin wrote:
> > > On Wed, 13 Sep 2017, Linus Torvalds wrote:
> > > > On Wed, Sep 13, 2017 at 6:43 AM, demerphq <demerphq@gmail.com> wrote:
> > > > > SHA3 however uses a completely different design where it mixes a 1088
> > > > > bit block into a 1600 bit state, for a leverage of 2:3, and the excess
> > > > > is *preserved between each block*.
> > > > 
> > > > Yes. And considering that the SHA1 attack was actually predicated on
> > > > the fact that each block was independent (no extra state between), I
> > > > do think SHA3 is a better model.
> > > > 
> > > > So I'd rather see SHA3-256 than SHA256.
> > 
> > Well, for what it's worth, we need to be aware that SHA3 is *different*.
> > In crypto, "different" = "bugs haven't been found yet".  :-P
> > 
> > And SHA2 is *known*.  So we have a pretty good handle on how it'll
> > weaken over time.
> 
> Here, you seem to agree with me.

Yep.

> > > SHA-256 got much more cryptanalysis than SHA3-256, and apart from the
> > > length-extension problem that does not affect Git's usage, there are no
> > > known weaknesses so far.
> > 
> > While I think that statement is true on it's face (particularly when
> > including post-competition analysis), I don't think it's sufficient
> > justification to chose one over the other.
> 
> And here you don't.
> 
> I find that very confusing.

What I'm saying is that there is more to selecting a hash function for
git than just the cryptographic assessment.  In fact I would argue that
the primary cryptographic concern for git is "What is the likelihood
that we'll wake up one day to full collisions with no warning?"

To that, I'd argue that SHA-256's time in the field and SHA3-256's
competition give them both passing marks in that regard.  fwiw, I'd also
put Blake and Skein in there as well.

The chance that any of those will suffer sudden, catastrophic failure is
minimal.  IOW, we'll have warnings, and time to migrate to the next
function.

None of us can predict the future, but having a significant amount of
vetting reduces the chances of catastrophic failure.

> > > It would seem that the experts I talked to were much more concerned about
> > > that amount of attention than the particulars of the algorithm. My
> > > impression was that the new features of SHA3 were less studied than the
> > > well-known features of SHA2, and that the new-ness of SHA3 is not
> > > necessarily a good thing.
> > 
> > The only thing I really object to here is the abstract "experts".  We're
> > talking about cryptography and integrity here.  It's no longer
> > sufficient to cite anonymous experts.  Either they can put their
> > thoughts, opinions and analysis on record here, or it shouldn't be
> > considered.  Sorry.
> 
> Sorry, you are asking cryptography experts to spend their time on the Git
> mailing list. I tried to get them to speak out on the Git mailing list.
> They respectfully declined.

Ok, fair enough.  Just please understand that it's difficult to place
much weight on statements that we can't discuss with the person who made
them.

> > However, whether we chose SHA2 or SHA3 doesn't matter.
> 
> To you, it does not matter.

Well, I'd say it does not matter for *most* users.

> To me, it matters. To the several thousand developers working on Windows,
> probably the largest Git repository in active use, it matters. It matters
> because the speed difference that has little impact on you has a lot more
> impact on us.

Ahhh, so if I understand you correctly, you'd prefer SHA-256 over
SHA3-256 because it's more performant for your usecase?  Well, that's a
completely different animal that cryptographic suitability.

Have you been able to crunch numbers yet?  Will you be able to share
some empirical data?  I'd love to see some comparisons between SHA1,
SHA-256, SHA512-256, and SHA3-256 for different git operations under
your work load.

> > If SHA3 is chosen as the successor, it's going to get a *lot* more
> > adoption, and thus, a lot more analysis.  If cracks start to show, the
> > hard work of making git flexible is already done.  We can migrate to
> > SHA4/5/whatever in an orderly fashion with far less effort than the
> > transition away from SHA1.
> 
> Sure. And if XYZ789 is chosen, it's going to get a *lot* more adoption,
> too.
> 
> We think.
> 
> Let's be realistic. Git is pretty important to us, but it is not important
> enough to sway, say, Intel into announcing hardware support for SHA3.
> And if you try to force through *any* hash function only so that it gets
> more adoption and hence more support,

That's quite a jump from what I was saying.  I would never advise using
code in a production setting just to increase adoption.

What I /was/ saying: Let's say you don't get what you want, and SHA3-256
is chosen.  It's not the end of the world from a cryptographic PoV.
The hard work of making the git (and libgit2) codebases hash-flexible is
already done.  So, if you're correct, and SHA3 was too immature, the
increased visibility will help us discover that more quickly.  And, the
code will already be in a position to conduct an orderly migration.

Will it still be costly?  Yes.  But I would argue that it's naive to
think that we will be using git/sha3-256 or git/sha-256 10 to 15 years
from now.  It might be git, it might not.  But there *will* be another
migration of existing data (code, history, etc) from one object storage
model to another.  It might be git/SHA4-512, or hg/sha4-384.

So, we aren't trying to find the perfect hash function so that we
naively think we'll never have to change again.  Rather, we're choosing
the next hash function so that we can hold off another migration for as
long as possible.  After all, SHA4-512 doesn't exist yet. ;-)

> in the short run you will make life
> harder for developers on more obscure platforms, who may not easily get
> high-quality, high-speed implementations of anything but the very
> mainstream (which is, let's face it, MD5, SHA-1 and SHA-256). I know I
> would have cursed you for such a decision back when I had to work on AIX
> and IRIX.

I think you're assuming that all developers on obscure platforms have
a similar git usecase to your current one.  I've not heard of that being
the case.

> > For my use cases, as a user of git, I have a plan to maintain provable
> > integrity of existing objects stored in git under sha1 while migrating
> > away from sha1.  The same plan works for migrating away from SHA2 or
> > SHA3 when the time comes.
> 
> Please do not make the mistake of taking your use case to be a template
> for everybody's use case.

I wasn't.  But I will argue that my usecase is valid.  Just as yours is.

> Migrating a large team away from any hash function to another one *will*
> be painful, and costly.

Assuming that it will never happen again would make that doubly costly.

> Migrating will be very costly for hosting companies like GitHub, Microsoft
> and BitBucket, too.

<with_my_business_hat_on>
GitHub and BitBucket have git as the core of their business model.  If
they aren't keeping an eye on the future path of git and maintaining
migration plans, shame on them.
</with_my_business_hat_on>

Thanks,

Jason.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-30 22:02                               ` Joan Daemen
@ 2017-10-02 14:26                                 ` Johannes Schindelin
  0 siblings, 0 replies; 113+ messages in thread
From: Johannes Schindelin @ 2017-10-02 14:26 UTC (permalink / raw)
  To: Joan Daemen
  Cc: Gilles Van Assche, Linus Torvalds, demerphq, Brandon Williams,
	Junio C Hamano, Jonathan Nieder, Git Mailing List, Stefan Beller,
	Jonathan Tan, Jeff King, David Lang, brian m. carlson,
	Keccak Team

Hi Joan,

On Sun, 1 Oct 2017, Joan Daemen wrote:

> On 30/09/17 00:33, Johannes Schindelin wrote:
> 
> > As far as Git is concerned, we not only care about the source code of
> > the hash algorithm we use, we need to care even more about what you
> > call "executable": ready-to-use, high quality, well-tested
> > implementations.
> > 
> > We carry source code for SHA-1 as part of Git's source code, which was
> > hand-tuned to be as fast as Linus could get it, which was tricky given
> > that the tuning should be general enough to apply to all common intel
> > CPUs.
> > 
> > This hand-crafted code was blown out of the water by OpenSSL's SHA-1
> > in our tests here at Microsoft, thanks to the fact that OpenSSL does
> > vectorized SHA-1 computation now.
> > 
> > To me, this illustrates why it is not good enough to have only a
> > reference implementation available at our finger tips. Of course,
> > above-mentioned OpenSSL supports SHA-256 and SHA3-256, too, and at
> > least recent versions vectorize those, too.
> 
> There is a lot of high-quality optimized code for all SHA-3 functions
> and many CPUs in the Keccak code package
> https://github.com/gvanas/KeccakCodePackage but also OpenSSL contains
> some good SHA-3 code and then there are all those related to Ethereum.
> 
> By the way, you speak about SHA3-256, but the right choice would be to
> use SHAKE128. Well, what is exactly the right choice depends on what you
> want. If you want to have a function in the SHA3 standard (FIPS 202), it
> is SHAKE128.  You can boost performance on high-end CPUs by adopting
> Parallelhash from NIST SP 800-185, still a NIST standard. You can
> multiply that performance again by a factor of 2 by adopting
> KangarooTwelve. This is our (Keccak team) proposal for a parallelizable
> Keccak-based hash function that has a safety margin comparable to that
> of the SHA-2 functions. See https://keccak.team/kangarootwelve.html May
> I also suggest you read https://keccak.team/2017/is_sha3_slow.html

Thanks.

I have to admit that all those names that do not start with SHA and do not
end in 256 make me a bit dizzy.

> > Back to Intel processors: I read some vague hints about extensions
> > accelerating SHA-256 computation on future Intel processors, but not
> > SHA3-256.
> > 
> > It would make sense, of course, that more crypto libraries and more
> > hardware support would be available for SHA-256 than for SHA3-256
> > given the time since publication: 16 vs 5 years (I am playing it loose
> > here, taking just the year into account, not the exact date, so please
> > treat that merely as a ballpark figure).
> > 
> > So from a practical point of view, I wonder what your take is on, say,
> > hardware support for SHA3-256. Do you think this will become a focus
> > soon?
> 
> I think this is a chicken-and-egg problem. In any case, hardware support
> for one SHA3-256 will also work for the other SHA3 and SHAKE functions
> as they all use the same underlying primitive: the Keccak-f permutation.
> This is not the case for SHA2 because SHA224 and SHA256 use a different
> compression function than SHA384, SHA512, SHA512/224 and SHA512/256.

Okay.

So given that Git does not exactly have a big sway on hardware vendors, we
would have to hope that some other chicken lays that egg.

> > Also, what is your take on the question whether SHA-256 is good
> > enough?  SHA-1 was broken theoretically already 10 years after it was
> > published (which unfortunately did not prevent us from baking it into
> > Git), after all, while SHA-256 is 16 years old and the only known
> > weakness does not apply to Git's usage?
> 
> SHA-256 is more conservative than SHA-1 and I don't expect it to be
> broken in the coming decades (unless NSA inserted a backdoor but I don't
> think that is likely). But looking at the existing cryptanalysis, I
> think it is even less likely that I SHAKE128, ParallelHash or
> KangarooTwelve will be broken anytime.

That's reassuring! ;-)

> > Also, while I have the attention of somebody who knows a heck more
> > about cryptography than Git's top 10 committers combined: how soon do
> > you expect practical SHA-1 attacks that are much worse than what we
> > already have seen? I am concerned that if we do not move fast enough
> > to a new hash algorithm, and somebody finds a way in the meantime to
> > craft arbitrary messages given a prefix and an SHA-1, then we have a
> > huge problem on our hands.
> 
> This is hard to say. To be honest, when witnessing the first MD5
> collisions I did not expect them to lead to some real world attacks and
> just a few years later we saw real-world forged certificates based on
> MD5 collisions. And SHA-1 has a lot in common with MD5...

Oh, okay. I did not realize that MD5 and SHA-1 are so similar in design,
thank you for educating me!

> But let me end with a philosophical note. Independent of all the
> arguments for and against, I think this is ultimately about doing the
> right thing. The choice is here between SHA1/SHA2 on the one hand and
> SHA3/Keccak on the other.  The former standards are imposed on us by NSA
> and the latter are the best that came out of an open competition
> involving all experts in the field worldwide.  What would be closest to
> the philosophy of Git (and by extension Linux or open-source in
> general)?

Heh. Do you realize that you are talking to a Microsoftie, i.e. one of the
"evil company"? ;-)

So philosophically, I am much more pragmatic. Or maybe I am not, after
all, I joined a company at a time when it is arguably going through one of
the most dramatic cultural changes any company has seen lately (a year
ago, we became #1 contributor on GitHub according to Business Insider, and
as far as I can tell, we're not willing to pass that belt to anyone else).

But when it comes to the philosophy of Git, I fear I have to disappoint
you: Git's fundamental concepts were not developed in an open process. Git
even so much as rejected professional advice *not* to bake SHA-1 into
everything.

Of course, we are undoing this damage right now, and your input helps
greatly, I would think.

While I feel reassured by your response that SHA-256 would be "good
enough" and would have some real-life benefits of announced hardware
support, I would now also feel comfortable if my preference was overruled
in the end, in favor of a hash from the Keccak family. I would understand,
for example, if the parallel option turned out to be enticing enough for
other core Git contributors to aim for, say, K12).

Again, thank you very much for chiming in,
Johannes

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-09-26 23:51                       ` RFC v3: Another proposed hash function transition plan Jonathan Nieder
@ 2017-10-02 14:54                         ` Jason Cooper
  2017-10-02 16:50                           ` Brandon Williams
  0 siblings, 1 reply; 113+ messages in thread
From: Jason Cooper @ 2017-10-02 14:54 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Johannes Schindelin, Linus Torvalds, demerphq, Brandon Williams,
	Junio C Hamano, Git Mailing List, Stefan Beller, Jonathan Tan,
	Jeff King, David Lang, brian m. carlson

Hi Jonathan,

On Tue, Sep 26, 2017 at 04:51:58PM -0700, Jonathan Nieder wrote:
> Johannes Schindelin wrote:
> > On Tue, 26 Sep 2017, Jason Cooper wrote:
> >> For my use cases, as a user of git, I have a plan to maintain provable
> >> integrity of existing objects stored in git under sha1 while migrating
> >> away from sha1.  The same plan works for migrating away from SHA2 or
> >> SHA3 when the time comes.
> >
> > Please do not make the mistake of taking your use case to be a template
> > for everybody's use case.
> 
> That said, I'm curious at what plan you are alluding to.  Is it
> something that could benefit others on the list?

Well, it's just a plan at this point.  As there's a lot of other work to
do in the mean-time, and there's no possibility of transitioning until
the dust has settled on NEWHASH.  :-)

Given an existing repository that needs to migrate from SHA1 to NEWHASH,
and maintain backwards compatibility with clients that haven't migrated
yet, how do we

  a) perform that migration,
  b) allow non-updated clients to use the data prior to the switch, and
  c) maintain provable integrity of the old objects as well as the new.

The primary method is counter-hashing, which re-uses the blobs, and
creates parallel, deterministic tree, commit, and tag objects using
NEWHASH for everything up to flag day.  post-flag-day only uses NEWHASH.
A PGP "transition" key is used to counter-sign the NEWHASH version of
the old signed tags.  The transition key is not required to be different
than the existing maintainers key.

A critical feature is the ability of entities other than the maintainer
to migrate to NEWHASH.  For example, let's say that git has fully
implemented and tested NEWHASH.  linux.git intends to migrate, but it's
going to take several months (get all the developers herded up).

In the interim, a security company, relying on Linux for it's products
can counter-hash Linus' repo, and continue to do so every time he
updates his tree.  This shrinks the attack window for an entity (with an
undisclosed break of SHA1) down to a few minutes to an hour.  Otherwise,
a check of the counter hashes in the future would reveal the
substitution.

The deterministic feature is critical here because there is valuable
integrity and trust built by counter-hashing quickly after publication.
So once Linux migrates to NEWHASH, the hashes calculated by the security
company should be identical.  IOW, use the timestamps that are in the
SHA1 commit objects for the NEWHASH objects.  Which should be obvious,
but it's worth explicitly mentioning that determinism provides great
value.

We're in the process of writing this up formally, which will provide a
lot more detail and rationale that this quick stream of thought.  :-)

I'm sure a lot of this has already been discussed on the list.  If so, I
apologize for being repetitive.  Unfortunately, I'm not able to keep up
with the MLs like I used to.

thx,

Jason.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-10-02 14:54                         ` Jason Cooper
@ 2017-10-02 16:50                           ` Brandon Williams
  0 siblings, 0 replies; 113+ messages in thread
From: Brandon Williams @ 2017-10-02 16:50 UTC (permalink / raw)
  To: Jason Cooper
  Cc: Jonathan Nieder, Johannes Schindelin, Linus Torvalds, demerphq,
	Junio C Hamano, Git Mailing List, Stefan Beller, Jonathan Tan,
	Jeff King, David Lang, brian m. carlson

On 10/02, Jason Cooper wrote:
> Hi Jonathan,
> 
> On Tue, Sep 26, 2017 at 04:51:58PM -0700, Jonathan Nieder wrote:
> > Johannes Schindelin wrote:
> > > On Tue, 26 Sep 2017, Jason Cooper wrote:
> > >> For my use cases, as a user of git, I have a plan to maintain provable
> > >> integrity of existing objects stored in git under sha1 while migrating
> > >> away from sha1.  The same plan works for migrating away from SHA2 or
> > >> SHA3 when the time comes.
> > >
> > > Please do not make the mistake of taking your use case to be a template
> > > for everybody's use case.
> > 
> > That said, I'm curious at what plan you are alluding to.  Is it
> > something that could benefit others on the list?
> 
> Well, it's just a plan at this point.  As there's a lot of other work to
> do in the mean-time, and there's no possibility of transitioning until
> the dust has settled on NEWHASH.  :-)
> 
> Given an existing repository that needs to migrate from SHA1 to NEWHASH,
> and maintain backwards compatibility with clients that haven't migrated
> yet, how do we
> 
>   a) perform that migration,
>   b) allow non-updated clients to use the data prior to the switch, and
>   c) maintain provable integrity of the old objects as well as the new.
> 
> The primary method is counter-hashing, which re-uses the blobs, and
> creates parallel, deterministic tree, commit, and tag objects using
> NEWHASH for everything up to flag day.  post-flag-day only uses NEWHASH.
> A PGP "transition" key is used to counter-sign the NEWHASH version of
> the old signed tags.  The transition key is not required to be different
> than the existing maintainers key.
> 
> A critical feature is the ability of entities other than the maintainer
> to migrate to NEWHASH.  For example, let's say that git has fully
> implemented and tested NEWHASH.  linux.git intends to migrate, but it's
> going to take several months (get all the developers herded up).
> 
> In the interim, a security company, relying on Linux for it's products
> can counter-hash Linus' repo, and continue to do so every time he
> updates his tree.  This shrinks the attack window for an entity (with an
> undisclosed break of SHA1) down to a few minutes to an hour.  Otherwise,
> a check of the counter hashes in the future would reveal the
> substitution.
> 
> The deterministic feature is critical here because there is valuable
> integrity and trust built by counter-hashing quickly after publication.
> So once Linux migrates to NEWHASH, the hashes calculated by the security
> company should be identical.  IOW, use the timestamps that are in the
> SHA1 commit objects for the NEWHASH objects.  Which should be obvious,
> but it's worth explicitly mentioning that determinism provides great
> value.
> 
> We're in the process of writing this up formally, which will provide a
> lot more detail and rationale that this quick stream of thought.  :-)
> 
> I'm sure a lot of this has already been discussed on the list.  If so, I
> apologize for being repetitive.  Unfortunately, I'm not able to keep up
> with the MLs like I used to.
> 
> thx,
> 
> Jason.

Given the interests that you've expressed here I'd recommend taking a
look at
https://public-inbox.org/git/20170928044320.GA84719@aiede.mtv.corp.google.com/
which is the current version of the transition plan that the community
has settled on
(https://public-inbox.org/git/xmqqlgkyxgvq.fsf@gitster.mtv.corp.google.com/
shows that it should be merged to 'next' soon).  Once neat aspect of
this transition plan is that it doesn't require a flag day but rather
anyone can migrate to the new hash function and still interact with
repositories (via the wire) which are still running SHA1.

-- 
Brandon Williams

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-10-02 14:00                       ` Jason Cooper
@ 2017-10-02 17:18                         ` Linus Torvalds
  2017-10-02 19:37                           ` Jeff King
  0 siblings, 1 reply; 113+ messages in thread
From: Linus Torvalds @ 2017-10-02 17:18 UTC (permalink / raw)
  To: Jason Cooper
  Cc: Johannes Schindelin, demerphq, Brandon Williams, Junio C Hamano,
	Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan,
	Jeff King, David Lang, brian m. carlson

On Mon, Oct 2, 2017 at 7:00 AM, Jason Cooper <jason@lakedaemon.net> wrote:
>
> Ahhh, so if I understand you correctly, you'd prefer SHA-256 over
> SHA3-256 because it's more performant for your usecase?  Well, that's a
> completely different animal that cryptographic suitability.

In almost all loads I've seen, zlib inflate() cost is a bigger deal
than the crypto load. The crypto people talk about cycles per byte,
but the deflate code is what usually takes the page faults and cache
misses etc, and has bad branch prediction. That ends up easily being
tens or thousands of cycles, even for small data.

But it does obviously depend on exactly what you do. The Windows
people saw SHA1 as costly mainly due to the index file (which is just
a "fancy crc", and not even cryptographically important, and where the
cache misses actually happen when doing crypto, not decompressing the
data).

And fsck and big initial checkins can have a very different profile
than most "regular use" profiles. Again, there the crypto happens
first, and takes the cache misses. And the crypto is almost certainly
_much_ cheaper than just the act of loading the index file contents in
the first place. It may show up on profiles fairly clearly, but that's
mostly because crypto is *intensive*, not because crypto takes up most
of the cycles.

End result: honestly, the real cost on almost any load is not crypto
or necessarily even (de)compression, even if those are the things that
show up. It's the cache misses and the "get data into user space"
(whether using "read()" or page faulting). Worrying about cycles per
byte of compression speed is almost certainly missing the real issue.

The people who benchmark cryptography tend to intentionally avoid the
actual real work, because they just want to know the crypto costs. So
when you see numbers like "9 cycles per byte" vs "12 cycles per byte"
and think that it's a big deal - 30% performance difference! -  it's
almost certainly complete garbage. It may be 30%, but it is likely 30%
out of 10% total, meaning that it's almost in the noise for any but
some very special case.

                 Linus

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v4] technical doc: add a design doc for hash function transition
  2017-09-28  4:43       ` [PATCH v4] technical doc: add a design doc for hash function transition Jonathan Nieder
  2017-09-29  6:06         ` Junio C Hamano
  2017-10-02  9:02         ` Junio C Hamano
@ 2017-10-02 19:23         ` Jason Cooper
  2017-10-03  5:40         ` Junio C Hamano
  2017-10-04  1:44         ` Junio C Hamano
  4 siblings, 0 replies; 113+ messages in thread
From: Jason Cooper @ 2017-10-02 19:23 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Shawn Pearce, Linus Torvalds, Git Mailing List, Stefan Beller,
	bmwill, Jonathan Tan, Jeff King, David Lang, brian m. carlson,
	Masaya Suzuki, demerphq, The Keccak Team, Johannes Schindelin

Hi Jonathan,

On Wed, Sep 27, 2017 at 09:43:21PM -0700, Jonathan Nieder wrote:
> This document describes what a transition to a new hash function for
> Git would look like.  Add it to Documentation/technical/ as the plan
> of record so that future changes can be recorded as patches.
> 
> Also-by: Brandon Williams <bmwill@google.com>
> Also-by: Jonathan Tan <jonathantanmy@google.com>
> Also-by: Stefan Beller <sbeller@google.com>
> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
> ---
> On Thu, Mar 09, 2017 at 11:14 AM, Shawn Pearce wrote:
> > On Mon, Mar 6, 2017 at 4:17 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:
> 
> >> Thanks for the kind words on what had quite a few flaws still.  Here's
> >> a new draft.  I think the next version will be a patch against
> >> Documentation/technical/.
> >
> > FWIW, I like this approach.
> 
> Okay, here goes.
> 
> Instead of sharding the loose object translation tables by first byte,
> we went for a single table.  It simplifies the design and we need to
> keep the number of loose objects under control anyway.
> 
> We also included a description of the transition plan and tried to
> include a summary of what has been agreed upon so far about the choice
> of hash function.
> 
> Thanks to Junio for reviving the discussion and in particular to Dscho
> for pushing this forward and making the missing pieces clearer.
> 
> Thoughts of all kinds welcome, as always.
> 
>  Documentation/Makefile                             |   1 +
>  .../technical/hash-function-transition.txt         | 797 +++++++++++++++++++++
>  2 files changed, 798 insertions(+)
>  create mode 100644 Documentation/technical/hash-function-transition.txt
> 
...
> diff --git a/Documentation/technical/hash-function-transition.txt b/Documentation/technical/hash-function-transition.txt
> new file mode 100644
> index 0000000000..417ba491d0
> --- /dev/null
> +++ b/Documentation/technical/hash-function-transition.txt
> @@ -0,0 +1,797 @@
> +Git hash function transition
> +============================
> +
> +Objective
> +---------
> +Migrate Git from SHA-1 to a stronger hash function.
> +
...
> +Goals
> +-----
> +Where NewHash is a strong 256-bit hash function to replace SHA-1 (see
> +"Selection of a New Hash", below):

Could we clarify and say "a strong hash function with 256-bit output"?

...
> +Overview
> +--------
> +We introduce a new repository format extension. Repositories with this
> +extension enabled use NewHash instead of SHA-1 to name their objects.
> +This affects both object names and object content --- both the names
> +of objects and all references to other objects within an object are
> +switched to the new hash function.
> +
> +NewHash repositories cannot be read by older versions of Git.
> +
> +Alongside the packfile, a NewHash repository stores a bidirectional
> +mapping between NewHash and SHA-1 object names. The mapping is generated
> +locally and can be verified using "git fsck". Object lookups use this
> +mapping to allow naming objects using either their SHA-1 and NewHash names
> +interchangeably.

nit: Are we presuming that abbreviated hashes won't collide?  Or the
user needs to specify which hash type?

> +Object format
> +~~~~~~~~~~~~~
> +The content as a byte sequence of a tag, commit, or tree object named
> +by sha1 and newhash differ because an object named by newhash-name refers to
> +other objects by their newhash-names and an object named by sha1-name
> +refers to other objects by their sha1-names.
> +
> +The newhash-content of an object is the same as its sha1-content, except
> +that objects referenced by the object are named using their newhash-names
> +instead of sha1-names. Because a blob object does not refer to any
> +other object, its sha1-content and newhash-content are the same.
> +
> +The format allows round-trip conversion between newhash-content and
> +sha1-content.

It would be nice here to explicitly mention deterministic hashing.
Meaning that anyone who converts a commit from sha1 to newhash shall get
the same newhash.

> +
> +Object storage
> +~~~~~~~~~~~~~~
> +Loose objects use zlib compression and packed objects use the packed
> +format described in Documentation/technical/pack-format.txt, just like
> +today. The content that is compressed and stored uses newhash-content
> +instead of sha1-content.
> +
> +Pack index
> +~~~~~~~~~~
> +Pack index (.idx) files use a new v3 format that supports multiple
> +hash functions. They have the following format (all integers are in
> +network byte order):
> +
> +- A header appears at the beginning and consists of the following:
> +  - The 4-byte pack index signature: '\377t0c'
> +  - 4-byte version number: 3
> +  - 4-byte length of the header section, including the signature and
> +    version number
> +  - 4-byte number of objects contained in the pack
> +  - 4-byte number of object formats in this pack index: 2
> +  - For each object format:
> +    - 4-byte format identifier (e.g., 'sha1' for SHA-1)

This seems a little rough to me.  Maybe it would be better to have a 4
byte field where 0x01 = SHA-1, 0x02 = NEWHASH?

> +Reading an object's sha1-content
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +The sha1-content of an object can be read by converting all newhash-names
> +its newhash-content references to sha1-names using the translation table.
> +
> +Fetch
> +~~~~~
> +Fetching from a SHA-1 based server requires translating between SHA-1
> +and NewHash based representations on the fly.
> +
> +SHA-1s named in the ref advertisement that are present on the client
> +can be translated to NewHash and looked up as local objects using the
> +translation table.
> +
> +Negotiation proceeds as today. Any "have"s generated locally are
> +converted to SHA-1 before being sent to the server, and SHA-1s
> +mentioned by the server are converted to NewHash when looking them up
> +locally.

By "converted", do you mean "looked up in the table" or "look up
newhash, re-calculate sha1, send" ?  I presume you mean the former, but
it would be good to clarify.

> +
> +After negotiation, the server sends a packfile containing the
> +requested objects. We convert the packfile to NewHash format using
> +the following steps:
> +
> +1. index-pack: inflate each object in the packfile and compute its
> +   SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against
> +   objects the client has locally. These objects can be looked up
> +   using the translation table and their sha1-content read as
> +   described above to resolve the deltas.
> +2. topological sort: starting at the "want"s from the negotiation
> +   phase, walk through objects in the pack and emit a list of them,
> +   excluding blobs, in reverse topologically sorted order, with each
> +   object coming later in the list than all objects it references.
> +   (This list only contains objects reachable from the "wants". If the
> +   pack from the server contained additional extraneous objects, then
> +   they will be discarded.)
> +3. convert to newhash: open a new (newhash) packfile. Read the topologically
> +   sorted list just generated. For each object, inflate its
> +   sha1-content, convert to newhash-content, and write it to the newhash
> +   pack. Record the new sha1<->newhash mapping entry for use in the idx.
> +4. sort: reorder entries in the new pack to match the order of objects
> +   in the pack the server generated and include blobs. Write a newhash idx
> +   file
> +5. clean up: remove the SHA-1 based pack file, index, and
> +   topologically sorted list obtained from the server in steps 1
> +   and 2.

How are signed tags (against sha1 commits) to be handled?  See below for
further thoughts.

> +Signed Tags
> +~~~~~~~~~~~
> +We add a new field "gpgsig-newhash" to the tag object format to allow
> +signing tags without relying on SHA-1. Its signed payload is the
> +newhash-content of the tag with its gpgsig-newhash field and "-----BEGIN PGP
> +SIGNATURE-----" delimited in-body signature removed.
> +
> +This means tags can be signed
> +1. using SHA-1 only, as in existing signed tag objects
> +2. using both SHA-1 and NewHash, by using gpgsig-newhash and an in-body
> +   signature.
> +3. using only NewHash, by only using the gpgsig-newhash field.

To be clear here, "gpgsig" = SHA-1, "gpgsig-SHA-256" = SHA-256?

> +Caveats
> +-------
> +Invalid objects
> +~~~~~~~~~~~~~~~
> +The conversion from sha1-content to newhash-content retains any
> +brokenness in the original object (e.g., tree entry modes encoded with
> +leading 0, tree objects whose paths are not sorted correctly, and
> +commit objects without an author or committer). This is a deliberate
> +feature of the design to allow the conversion to round-trip.

Ah, so this is part of the deterministic hashing.

> +Object names on the command line
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +To support the transition (see Transition plan below), this design
> +supports four different modes of operation:
> +
> + 1. ("dark launch") Treat object names input by the user as SHA-1 and
> +    convert any object names written to output to SHA-1, but store
> +    objects using NewHash.  This allows users to test the code with no
> +    visible behavior change except for performance.  This allows
> +    allows running even tests that assume the SHA-1 hash function, to

nit:  s/allows allows/allows/

> +    sanity-check the behavior of the new mode.
> +
> + 2. ("early transition") Allow both SHA-1 and NewHash object names in
> +    input. Any object names written to output use SHA-1. This allows
> +    users to continue to make use of SHA-1 to communicate with peers
> +    (e.g. by email) that have not migrated yet and prepares for mode 3.
> +
> + 3. ("late transition") Allow both SHA-1 and NewHash object names in
> +    input. Any object names written to output use NewHash. In this
> +    mode, users are using a more secure object naming method by
> +    default.  The disruption is minimal as long as most of their peers
> +    are in mode 2 or mode 3.
> +
> + 4. ("post-transition") Treat object names input by the user as
> +    NewHash and write output using NewHash. This is safer than mode 3
> +    because there is less risk that input is incorrectly interpreted
> +    using the wrong hash function.

Surely we can error-out if the provided object name is ambiguous?

> +Selection of a New Hash
> +-----------------------
> +In early 2005, around the time that Git was written,  Xiaoyun Wang,
> +Yiqun Lisa Yin, and Hongbo Yu announced an attack finding SHA-1
> +collisions in 2^69 operations. In August they published details.
> +Luckily, no practical demonstrations of a collision in full SHA-1 were
> +published until 10 years later, in 2017.
> +
> +The hash function NewHash to replace SHA-1 should be stronger than
> +SHA-1 was: we would like it to be trustworthy and useful in practice
> +for at least 10 years.
> +
> +Some other relevant properties:
> +
> +1. A 256-bit hash (long enough to match common security practice; not
> +   excessively long to hurt performance and disk usage).
> +
> +2. High quality implementations should be widely available (e.g. in
> +   OpenSSL).
> +
> +3. The hash function's properties should match Git's needs (e.g. Git
> +   requires collision and 2nd preimage resistance and does not require
> +   length extension resistance).

Based on recent discussion, I would add here, that the candidate hash
has had sufficient review.  Such that the likelihood of overnight
catastrophic failure is greatly reduced.  This gives git and git users
time to migrate away from the now weakening hash function.

> +
> +4. As a tiebreaker, the hash should be fast to compute (fortunately
> +   many contenders are faster than SHA-1).
> +
> +Some hashes under consideration are SHA-256, SHA-512/256, SHA-256x16,
> +K12, and BLAKE2bp-256.

If anyone is counting votes, I prefer either SHA-512/256 or
BLAKE2bp-256.  But as I've mentioned elsewhere, it's only a preference.

> +
> +Transition plan
> +---------------
...
> +Once a critical mass of users have upgraded to a version of Git that
> +can verify NewHash signatures and have converted their existing
> +repositories to support verifying them, we can add support for a
> +setting to generate only NewHash signatures. This is expected to be at
> +least a year later.
> +
> +That is also a good moment to advertise the ability to convert
> +repositories to use NewHash only, stripping out all SHA-1 related
> +metadata. This improves performance by eliminating translation
> +overhead and security by avoiding the possibility of accidentally
> +relying on the safety of SHA-1.

There is a caveat here regarding old signatures.  Those have value and
shouldn't be lost.  repos needing to prove the validity of the old
sha1-only signatures should counter-hash all objects, and then
counter-sign the corresponding newhash version of the original sha1-only
tags.

Reviewed-by: Jason Cooper <jason@lakedaemon.net>

thx,

Jason.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: RFC v3: Another proposed hash function transition plan
  2017-10-02 17:18                         ` Linus Torvalds
@ 2017-10-02 19:37                           ` Jeff King
  0 siblings, 0 replies; 113+ messages in thread
From: Jeff King @ 2017-10-02 19:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jason Cooper, Johannes Schindelin, demerphq, Brandon Williams,
	Junio C Hamano, Jonathan Nieder, Git Mailing List, Stefan Beller,
	Jonathan Tan, David Lang, brian m. carlson

On Mon, Oct 02, 2017 at 10:18:02AM -0700, Linus Torvalds wrote:

> On Mon, Oct 2, 2017 at 7:00 AM, Jason Cooper <jason@lakedaemon.net> wrote:
> >
> > Ahhh, so if I understand you correctly, you'd prefer SHA-256 over
> > SHA3-256 because it's more performant for your usecase?  Well, that's a
> > completely different animal that cryptographic suitability.
> 
> In almost all loads I've seen, zlib inflate() cost is a bigger deal
> than the crypto load. The crypto people talk about cycles per byte,
> but the deflate code is what usually takes the page faults and cache
> misses etc, and has bad branch prediction. That ends up easily being
> tens or thousands of cycles, even for small data.

If anyone is interested in the user-visible effects of slower crypto, I
think, there are some numbers in 8325e43b82 (Makefile: add DC_SHA1 knob,
2017-03-16). I don't know how SHA-256 compares to sha1dc exactly, but
certainly the latter is a lot slower than normal sha1.

The only real-world case I found with a noticeable slowdown was
index-pack.  Which in the worst case is roughly the same operation as
"git fsck" (inflate and compute the sha1 on every byte), but people tend
to actually do it a lot more often.

And it really _is_ slower for real-world operations; the CPU for
computing the sha1 of an incoming clone of linux.git jumped from ~3
minutes to ~6 minutes.  But I don't think we've seen a lot of
complaints, probably because that time is lumped in with "time to
transfer a gigabyte of data", so unless you're on a slow machine on fast
connection, you don't even really notice.

For day-to-day operations in a repository, I never came up with a good
example where the speed difference mattered. I think Dscho's giant-index
example is an outlier and the right answer there is not "pick a fast
crypto algorithm" but "stop using a slow crypto algorithm as a
checksum" (and also, stop routinely reading and writing 400MB for
day-to-day operations).

-Peff

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v4] technical doc: add a design doc for hash function transition
  2017-09-29 17:34           ` Jonathan Nieder
  2017-10-02  8:25             ` Junio C Hamano
@ 2017-10-02 19:41             ` Jason Cooper
  1 sibling, 0 replies; 113+ messages in thread
From: Jason Cooper @ 2017-10-02 19:41 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Junio C Hamano, Shawn Pearce, Linus Torvalds, Git Mailing List,
	Stefan Beller, bmwill, Jonathan Tan, Jeff King, David Lang,
	brian m. carlson, Masaya Suzuki, demerphq, The Keccak Team,
	Johannes Schindelin

On Fri, Sep 29, 2017 at 10:34:13AM -0700, Jonathan Nieder wrote:
> Junio C Hamano wrote:
> > Jonathan Nieder <jrnieder@gmail.com> writes:
...
> > If it is a goal to eventually be able to lose SHA-1 compatibility
> > metadata from the objects, then we might want to remove SHA-1 based
> > signature bits (e.g. PGP trailer in signed tag, gpgsig header in the
> > commit object) from NewHash contents, and instead have them stored
> > in a side "metadata" table, only to be used while converting back.
> > I dunno if that is desirable.
> 
> I don't consider that desirable.
> 
> A SHA-1 based signature is still of historical interest even if my
> centuries-newer version of Git is not able to verify it.

Agreed, even a signature made by a now exposed and revoked key still has
validity.  Especially in a commit or merge.  We know it was made prior
to the key being compromised / revoked.

This is assuming that the keyholder can definitively say "Don't trust
signatures from this key after this date/time+0000".  And the signature
in question is in the git history prior to that cut off.

Tags are a different animal because they can be added at any time and
aren't directly incorporated into the history.

thx,

Jason.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v4] technical doc: add a design doc for hash function transition
  2017-09-28  4:43       ` [PATCH v4] technical doc: add a design doc for hash function transition Jonathan Nieder
                           ` (2 preceding siblings ...)
  2017-10-02 19:23         ` Jason Cooper
@ 2017-10-03  5:40         ` Junio C Hamano
  2017-10-03 13:08           ` Jason Cooper
  2017-10-04  1:44         ` Junio C Hamano
  4 siblings, 1 reply; 113+ messages in thread
From: Junio C Hamano @ 2017-10-03  5:40 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Shawn Pearce, Linus Torvalds, Git Mailing List, Stefan Beller,
	bmwill, Jonathan Tan, Jeff King, David Lang, brian m. carlson,
	Masaya Suzuki, demerphq, The Keccak Team, Johannes Schindelin

Jonathan Nieder <jrnieder@gmail.com> writes:

> +Signed Tags
> +~~~~~~~~~~~
> +We add a new field "gpgsig-newhash" to the tag object format to allow
> +signing tags without relying on SHA-1. Its signed payload is the
> +newhash-content of the tag with its gpgsig-newhash field and "-----BEGIN PGP
> +SIGNATURE-----" delimited in-body signature removed.
> +
> +This means tags can be signed
> +1. using SHA-1 only, as in existing signed tag objects
> +2. using both SHA-1 and NewHash, by using gpgsig-newhash and an in-body
> +   signature.
> +3. using only NewHash, by only using the gpgsig-newhash field.

I have the same issue with signed commit.

The signed parts for SHA-1 contents exclude the in-body signature
(obviously) and all the headers including gpgsig-newhash that is not
known to our old clients are included.  The signed parts for NewHash
contents exclude the in-body signature and gpgsig-newhash header,
but all other headers.  I somehow feel that we should just reserve
gpgsig-* to prepare for the day when we introduce newhash2 and later
and exclude all of them from the computation.  Treat the difference
between how SHA-1 contents excludes _only_ it knows about and how
NewHash contents excludes _all_ possible signatures, just like the
differece between where SHA-1 and NewHash contents has the
signature.  That is, yes, we didn't know better when we designed
SHA-1 contents, but now we know better and are correcting the
mistakes by moving the signature from in-body tail to a header, and
by excluding anything gpgsig-*, not just the known ones.

> +Mergetag embedding
> +~~~~~~~~~~~~~~~~~~
> +The mergetag field in the sha1-content of a commit contains the
> +sha1-content of a tag that was merged by that commit.
> +
> +The mergetag field in the newhash-content of the same commit contains the
> +newhash-content of the same tag.

OK.  

We do not have a tool that extracts them and creates a tag object,
but if such a tool is invented in the future, it would only have to
worry about newhash content, as it would be a local operation.
Makes sense.

> +Submodules
> +~~~~~~~~~~
> +To convert recorded submodule pointers, you need to have the converted
> +submodule repository in place. The translation table of the submodule
> +can be used to look up the new hash.

OK, I earlier commented on a paragraph that I couldn't tell what it
was talking about, but this is a lot more understandable.  Perhaps
the earlier one can be removed?

We saw earlier what happens during "fetch".  This seems to hint that
we would need to do a "recursive" fetch in the bottom-up direction,
but without fetching the superproject, you wouldn't know what submodules
are needed and from where, so there is a bit of chicken-and-egg problem
we need to address, as we further make the design more detailed.

> +Loose objects and unreachable objects
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ...
> +"git gc --auto" currently waits for there to be 50 packs present
> +before combining packfiles. Packing loose objects more aggressively
> +may cause the number of pack files to grow too quickly. This can be
> +mitigated by using a strategy similar to Martin Fick's exponential
> +rolling garbage collection script:
> +https://gerrit-review.googlesource.com/c/gerrit/+/35215

Yes, concatenating into the latest pack that still is small may be a
reasonable way, as there won't be many good chances to create good
deltas anyway until you have blobs and trees at sufficiently numbers
of different versions, to do a "quick GC whose only purpose is to
keep the number of loose object down".

> +To avoid a proliferation of UNREACHABLE_GARBAGE packs, they can be
> +combined under certain circumstances. If "gc.garbageTtl" is set to
> +greater than one day, then packs created within a single calendar day,
> +UTC, can be coalesced together. The resulting packfile would have an
> +mtime before midnight on that day, so this makes the effective maximum
> +ttl the garbageTtl + 1 day. If "gc.garbageTtl" is less than one day,
> +then we divide the calendar day into intervals one-third of that ttl
> +in duration. Packs created within the same interval can be coalesced
> +together. The resulting packfile would have an mtime before the end of
> +the interval, so this makes the effective maximum ttl equal to the
> +garbageTtl * 4/3.

OK.  

Is the use of mtime essential, or because packs are "write once and
from there access read-only", would a timestamp written somewhere in
the header or the trailer of the file, if existed, work equally
well?  Not a strong objection, but a mild suggestion that not
relying on mtime may be a good idea (it will keep an accidental /
unintended "touch" from keeping garbage alive longer than you want).

> +The UNREACHABLE_GARBAGE setting goes in the PSRC field of the pack
> +index. More generally, that field indicates where a pack came from:
> +
> + - 1 (PACK_SOURCE_RECEIVE) for a pack received over the network
> + - 2 (PACK_SOURCE_AUTO) for a pack created by a lightweight
> +   "gc --auto" operation
> + - 3 (PACK_SOURCE_GC) for a pack created by a full gc
> + - 4 (PACK_SOURCE_UNREACHABLE_GARBAGE) for potential garbage
> +   discovered by gc
> + - 5 (PACK_SOURCE_INSERT) for locally created objects that were
> +   written directly to a pack file, e.g. from "git add ."
> +
> +This information can be useful for debugging and for "gc --auto" to
> +make appropriate choices about which packs to coalesce.

Would this be the direction we want to take to reduce the number of
auxiliary files like *.keep, *.promised, etc., or we do not envision
these to be useful for anything other than "gc"?

> +Caveats
> +-------
> +Invalid objects
> +...
> +More profoundly broken objects (e.g., a commit with a truncated "tree"
> +header line) cannot be converted but were not usable by current Git
> +anyway.

Fair enough.

> +Shallow clone and submodules
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +Because it requires all referenced objects to be available in the
> +locally generated translation table, this design does not support
> +shallow clone or unfetched submodules. Protocol improvements might
> +allow lifting this restriction.

OK, I think it is sensible to leave them outside the scope at the
moment.  All we need is a reliable way to learn the NewHash name of
the objects immediately beyond the cut-off points, but it will have
to become a huge discussion how to ensure that reliability, without
trusting the remote too much.

> +Alternates
> +~~~~~~~~~~
> +For the same reason, a newhash repository cannot borrow objects from a
> +sha1 repository using objects/info/alternates or
> +$GIT_ALTERNATE_OBJECT_REPOSITORIES.

Correct.  In addition, if the alternate has already fully migrated
away from SHA-1 compatiblity, we can only use it for local operation.

    ... goes back and thinks

No, we cannot use such an alternate even for local operation.  So a
newhash repository cannot borrow objects from a SHA-1 repository,
and from a newhash repository that lost SHA-1 compatiblity if it
itself wants to retain SHA-1 compatiblity.

Which again is "fair enough", I'd say.

> +git notes
> +~~~~~~~~~
> +The "git notes" tool annotates objects using their sha1-name as key.
> +This design does not describe a way to migrate notes trees to use
> +newhash-names. That migration is expected to happen separately (for
> +example using a file at the root of the notes tree to describe which
> +hash it uses).

To be consistent with the remainder of the design, I think they
should also be translated to NewHash, but punting it is OK to limit
the scope of the initial migration.

> +Server-side cost
> +~~~~~~~~~~~~~~~~
> +Until Git protocol gains NewHash support, using NewHash based storage
> +on public-facing Git servers is strongly discouraged. Once Git
> +protocol gains NewHash support, NewHash based servers are likely not
> +to support SHA-1 compatibility, to avoid what may be a very expensive
> +hash reencode during clone and to encourage peers to modernize.

I doubt that the first sentence is needed.  We as git-core community
will not help people to run Git service backed by NewHash storage
that talks SHA-1 over the wire, by limiting the scope to "NewHash
Git fetching from SHA-1 Git" and "NewHash Git pushing to SHA-1 Git"
and not including the other two combinations.  That may be worth
saying here.  Masochist server operators are still welcome to build
and operate such a service and we don't really care.  It's not our
business.

> +The design described here allows fetches by SHA-1 clients of a
> +personal NewHash repository because it's not much more difficult than
> +allowing pushes from that repository.

Does the design described here really allow that?

I thought what I read was "everybody talks SHA-1 over the wire, and
those who want to use NewHash converts".  So a user may be able to
push from a personal NewHash repository to a personal SHA-1
repository (to simulate a fetch going in the reverse direction).

In any case, I do not think I saw conversion issues discussed for a
fetch from NewHash repository earlier in the document, where
conversion considerations for other two modes (fetch to NewHash, and
push from NewHash) were reasonably well described.  If we are to
allow this third mode, we'd need to make sure "because it's not much
more difficult" is true.

> This support needs to be guarded
> +by a configuration option --- servers like git.kernel.org that serve a
> +large number of clients would not be expected to bear that cost.

Yes, of course.  And if these 6 lines are not unintended leftover
from earlier round of the design that we wanted to remove but forget
to do so, then the first paragraph I doubted its validity of starts
to make sense.

> +Meaning of signatures
> +~~~~~~~~~~~~~~~~~~~~~
> +The signed payload for signed commits and tags does not explicitly
> +name the hash used to identify objects. If some day Git adopts a new
> +hash function with the same length as the current SHA-1 (40
> +hexadecimal digit) or NewHash (64 hexadecimal digit) objects then the
> +intent behind the PGP signed payload in an object signature is
> +unclear:
> +
> +	object e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7
> +	type commit
> +	tag v2.12.0
> +	tagger Junio C Hamano <gitster@pobox.com> 1487962205 -0800
> +
> +	Git 2.12
> +
> +Does this mean Git v2.12.0 is the commit with sha1-name
> +e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with
> +new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7?
> +
> +Fortunately NewHash and SHA-1 have different lengths. If Git starts
> +using another hash with the same length to name objects, then it will
> +need to change the format of signed payloads using that hash to
> +address this issue.

This is not just signatures, is it?  The reference to parent commits
and its tree in a commit object would also have ambiguity between
SHA-1 and new-40-digit-hash.  And the "no mixed repository" rule
resolved that for us---isn't that sufficient for the signed tag (or
commit), too?  If such a signed-tag appears in a SHA-1 content of a
tag, then the "object" reference is made with SHA-1.  If the tag is
in NewHash40 content, "object" reference is made with NewHash40, no?

> +Object names on the command line
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +To support the transition (see Transition plan below), this design
> +supports four different modes of operation:
> +
> + 1. ("dark launch") Treat object names input by the user as SHA-1 and
> +    convert any object names written to output to SHA-1, but store
> +    objects using NewHash.  This allows users to test the code with no
> +    visible behavior change except for performance.  This allows
> +    allows running even tests that assume the SHA-1 hash function, to
> +    sanity-check the behavior of the new mode.

Oooooh.  That's ambitious.

> + 2. ("early transition") Allow both SHA-1 and NewHash object names in
> +    input. Any object names written to output use SHA-1. This allows
> +    users to continue to make use of SHA-1 to communicate with peers
> +    (e.g. by email) that have not migrated yet and prepares for mode 3.

This and others also make sense.

> +Transition plan
> +---------------
> +Some initial steps can be implemented independently of one another:
> +...
> +- introducing index v3

Just making sure; this is pack .idx v3?

> +The infrastructure supporting fetch also allows converting an existing
> +repository. In converted repositories and new clones, end users can
> +gain support for the new hash function without any visible change in
> +behavior (see "dark launch" in the "Object names on the command line"
> +section). In particular this allows users to verify NewHash signatures
> +on objects in the repository, and it should ensure the transition code
> +is stable in production in preparation for using it more widely.
> +
> +Over time projects would encourage their users to adopt the "early
> +transition" and then "late transition" modes to take advantage of the
> +new, more futureproof NewHash object names.
> +
> +When objectFormat and compatObjectFormat are both set, commands
> +generating signatures would generate both SHA-1 and NewHash signatures
> +by default to support both new and old users.
> +
> +In projects using NewHash heavily, users could be encouraged to adopt
> +the "post-transition" mode to avoid accidentally making implicit use
> +of SHA-1 object names.
> +
> +Once a critical mass of users have upgraded to a version of Git that
> +can verify NewHash signatures and have converted their existing
> +repositories to support verifying them, we can add support for a
> +setting to generate only NewHash signatures. This is expected to be at
> +least a year later.
> +
> +That is also a good moment to advertise the ability to convert
> +repositories to use NewHash only, stripping out all SHA-1 related
> +metadata. This improves performance by eliminating translation
> +overhead and security by avoiding the possibility of accidentally
> +relying on the safety of SHA-1.
> +
> +Updating Git's protocols to allow a server to specify which hash
> +functions it supports is also an important part of this transition. It
> +is not discussed in detail in this document but this transition plan
> +assumes it happens. :)

All of the above sounds sensible to me.

> +Alternatives considered
> +-----------------------

This message stops here...

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v4] technical doc: add a design doc for hash function transition
  2017-10-03  5:40         ` Junio C Hamano
@ 2017-10-03 13:08           ` Jason Cooper
  0 siblings, 0 replies; 113+ messages in thread
From: Jason Cooper @ 2017-10-03 13:08 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jonathan Nieder, Shawn Pearce, Linus Torvalds, Git Mailing List,
	Stefan Beller, bmwill, Jonathan Tan, Jeff King, David Lang,
	brian m. carlson, Masaya Suzuki, demerphq, The Keccak Team,
	Johannes Schindelin

On Tue, Oct 03, 2017 at 02:40:26PM +0900, Junio C Hamano wrote:
> Jonathan Nieder <jrnieder@gmail.com> writes:
...
> > +Meaning of signatures
> > +~~~~~~~~~~~~~~~~~~~~~
> > +The signed payload for signed commits and tags does not explicitly
> > +name the hash used to identify objects. If some day Git adopts a new
> > +hash function with the same length as the current SHA-1 (40
> > +hexadecimal digit) or NewHash (64 hexadecimal digit) objects then the
> > +intent behind the PGP signed payload in an object signature is
> > +unclear:
> > +
> > +	object e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7
> > +	type commit
> > +	tag v2.12.0
> > +	tagger Junio C Hamano <gitster@pobox.com> 1487962205 -0800
> > +
> > +	Git 2.12
> > +
> > +Does this mean Git v2.12.0 is the commit with sha1-name
> > +e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with
> > +new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7?
> > +
> > +Fortunately NewHash and SHA-1 have different lengths. If Git starts
> > +using another hash with the same length to name objects, then it will
> > +need to change the format of signed payloads using that hash to
> > +address this issue.
> 
> This is not just signatures, is it?  The reference to parent commits
> and its tree in a commit object would also have ambiguity between
> SHA-1 and new-40-digit-hash.  And the "no mixed repository" rule
> resolved that for us---isn't that sufficient for the signed tag (or
> commit), too?  If such a signed-tag appears in a SHA-1 content of a
> tag, then the "object" reference is made with SHA-1.  If the tag is
> in NewHash40 content, "object" reference is made with NewHash40, no?

I do hope we adhere to "no mixed repository" rule.  Or, at least, "no
mixing of hash types".  Ambiguity opens cracks for uncertainty to creep
in.

For our case, where we counter-hash the sha1 commits, and counter-sign
the sha1-based signatures, we intend to include the relevant
sha1<->newhash lookups in the newhash signature body.  afaict, the git
sha1<->newhash table is not cryptographically secured underneath
signatures, and thus can't be used in the verification of objects.

The advantage to this approach is that we can be as explicit as
necessary with "SHA-1 -> SHA-512/256" or "SHA-1 -> SHA3-256" in the body
of the message.

thx,

Jason.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v4] technical doc: add a design doc for hash function transition
  2017-09-28  4:43       ` [PATCH v4] technical doc: add a design doc for hash function transition Jonathan Nieder
                           ` (3 preceding siblings ...)
  2017-10-03  5:40         ` Junio C Hamano
@ 2017-10-04  1:44         ` Junio C Hamano
  4 siblings, 0 replies; 113+ messages in thread
From: Junio C Hamano @ 2017-10-04  1:44 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Shawn Pearce, Linus Torvalds, Git Mailing List, Stefan Beller,
	bmwill, Jonathan Tan, Jeff King, David Lang, brian m. carlson,
	Masaya Suzuki, demerphq, The Keccak Team, Johannes Schindelin

Jonathan Nieder <jrnieder@gmail.com> writes:

> +Alternatives considered
> +-----------------------
> +Upgrading everyone working on a particular project on a flag day
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ...
> +Using hash functions in parallel
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ...

Good that we are not doing these ;-)

> +Lazily populated translation table
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +Some of the work of building the translation table could be deferred to
> +push time, but that would significantly complicate and slow down pushes.
> +Calculating the sha1-name at object creation time at the same time it is
> +being streamed to disk and having its newhash-name calculated should be
> +an acceptable cost.

And the version described in the body of the document hopefully
would be simpler.  It certainly would be, when SHA-1 content and
NewHash content are the same (i.e. blob).

THanks.

^ permalink raw reply	[flat|nested] 113+ messages in thread

end of thread, other threads:[~2017-10-04  1:44 UTC | newest]

Thread overview: 113+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-04  1:12 RFC: Another proposed hash function transition plan Jonathan Nieder
2017-03-05  2:35 ` Linus Torvalds
2017-03-06  0:26   ` brian m. carlson
2017-03-06 18:24     ` Brandon Williams
2017-06-15 10:30       ` Which hash function to use, was " Johannes Schindelin
2017-06-15 11:05         ` Mike Hommey
2017-06-15 13:01           ` Jeff King
2017-06-15 16:30             ` Ævar Arnfjörð Bjarmason
2017-06-15 19:34               ` Johannes Schindelin
2017-06-15 21:59                 ` Adam Langley
2017-06-15 22:41                   ` brian m. carlson
2017-06-15 23:36                     ` Ævar Arnfjörð Bjarmason
2017-06-16  0:17                       ` brian m. carlson
2017-06-16  6:25                         ` Ævar Arnfjörð Bjarmason
2017-06-16 13:24                           ` Johannes Schindelin
2017-06-16 17:38                             ` Adam Langley
2017-06-16 20:52                               ` Junio C Hamano
2017-06-16 21:12                                 ` Junio C Hamano
2017-06-16 21:24                                   ` Jonathan Nieder
2017-06-16 21:39                                     ` Ævar Arnfjörð Bjarmason
2017-06-16 20:42                             ` Jeff King
2017-06-19  9:26                               ` Johannes Schindelin
2017-06-15 21:10             ` Mike Hommey
2017-06-16  4:30               ` Jeff King
2017-06-15 17:36         ` Brandon Williams
2017-06-15 19:20           ` Junio C Hamano
2017-06-15 19:13         ` Jonathan Nieder
2017-03-07  0:17   ` RFC v3: " Jonathan Nieder
2017-03-09 19:14     ` Shawn Pearce
2017-03-09 20:24       ` Jonathan Nieder
2017-03-10 19:38         ` Jeff King
2017-03-10 19:55           ` Jonathan Nieder
2017-09-28  4:43       ` [PATCH v4] technical doc: add a design doc for hash function transition Jonathan Nieder
2017-09-29  6:06         ` Junio C Hamano
2017-09-29  8:09           ` Junio C Hamano
2017-09-29 17:34           ` Jonathan Nieder
2017-10-02  8:25             ` Junio C Hamano
2017-10-02 19:41             ` Jason Cooper
2017-10-02  9:02         ` Junio C Hamano
2017-10-02 19:23         ` Jason Cooper
2017-10-03  5:40         ` Junio C Hamano
2017-10-03 13:08           ` Jason Cooper
2017-10-04  1:44         ` Junio C Hamano
2017-09-06  6:28     ` RFC v3: Another proposed hash function transition plan Junio C Hamano
2017-09-08  2:40       ` Junio C Hamano
2017-09-08  3:34         ` Jeff King
2017-09-11 18:59         ` Brandon Williams
2017-09-13 12:05           ` Johannes Schindelin
2017-09-13 13:43             ` demerphq
2017-09-13 22:51               ` Jonathan Nieder
2017-09-14 18:26                 ` Johannes Schindelin
2017-09-14 18:40                   ` Jonathan Nieder
2017-09-14 22:09                     ` Johannes Schindelin
2017-09-13 23:30               ` Linus Torvalds
2017-09-14 18:45                 ` Johannes Schindelin
2017-09-18 12:17                   ` Gilles Van Assche
2017-09-18 22:16                     ` Johannes Schindelin
2017-09-19 16:45                       ` Gilles Van Assche
2017-09-29 13:17                         ` Johannes Schindelin
2017-09-29 14:54                           ` Joan Daemen
2017-09-29 22:33                             ` Johannes Schindelin
2017-09-30 22:02                               ` Joan Daemen
2017-10-02 14:26                                 ` Johannes Schindelin
2017-09-18 22:25                     ` Jonathan Nieder
2017-09-26 17:05                   ` Jason Cooper
2017-09-26 22:11                     ` Johannes Schindelin
2017-09-26 22:25                       ` [PATCH] technical doc: add a design doc for hash function transition Stefan Beller
2017-09-26 23:38                         ` Jonathan Nieder
2017-09-26 23:51                       ` RFC v3: Another proposed hash function transition plan Jonathan Nieder
2017-10-02 14:54                         ` Jason Cooper
2017-10-02 16:50                           ` Brandon Williams
2017-10-02 14:00                       ` Jason Cooper
2017-10-02 17:18                         ` Linus Torvalds
2017-10-02 19:37                           ` Jeff King
2017-09-13 16:30             ` Jonathan Nieder
2017-09-13 21:52               ` Junio C Hamano
2017-09-13 22:07                 ` Stefan Beller
2017-09-13 22:18                   ` Jonathan Nieder
2017-09-14  2:13                     ` Junio C Hamano
2017-09-14 15:23                       ` Johannes Schindelin
2017-09-14 15:45                         ` demerphq
2017-09-14 22:06                           ` Johannes Schindelin
2017-09-13 22:15                 ` Junio C Hamano
2017-09-13 22:27                   ` Jonathan Nieder
2017-09-14  2:10                     ` Junio C Hamano
2017-09-14 12:39               ` Johannes Schindelin
2017-09-14 16:36                 ` Brandon Williams
2017-09-14 18:49                 ` Jonathan Nieder
2017-09-15 20:42                   ` Philip Oakley
2017-03-05 11:02 ` RFC: " David Lang
     [not found]   ` <CA+dhYEXHbQfJ6KUB1tWS9u1MLEOJL81fTYkbxu4XO-i+379LPw@mail.gmail.com>
2017-03-06  9:43     ` Jeff King
2017-03-06 23:40   ` Jonathan Nieder
2017-03-07  0:03     ` Mike Hommey
2017-03-06  8:43 ` Jeff King
2017-03-06 18:39   ` Jonathan Tan
2017-03-06 19:22     ` Linus Torvalds
2017-03-06 19:59       ` Brandon Williams
2017-03-06 21:53       ` Junio C Hamano
2017-03-07  8:59     ` Jeff King
2017-03-06 18:43   ` Junio C Hamano
2017-03-07 18:57 ` Ian Jackson
2017-03-07 19:15   ` Linus Torvalds
2017-03-08 11:20     ` Ian Jackson
2017-03-08 15:37       ` Johannes Schindelin
2017-03-08 15:40       ` Johannes Schindelin
2017-03-20  5:21         ` Use base32? Jason Hennessey
2017-03-20  5:58           ` Michael Steuer
2017-03-20  8:05             ` Jacob Keller
2017-03-21  3:07               ` Michael Steuer
2017-03-13  9:24 ` RFC: Another proposed hash function transition plan The Keccak Team
2017-03-13 17:48   ` Jonathan Nieder
2017-03-13 18:34     ` ankostis
2017-03-17 11:07       ` Johannes Schindelin

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).