git@vger.kernel.org mailing list mirror (one of many)
 help / Atom feed
From: "brian m. carlson" <sandals@crustytoothpaste.net>
To: git@vger.kernel.org
Subject: State of NewHash work, future directions, and discussion
Date: Sat, 9 Jun 2018 20:56:28 +0000
Message-ID: <20180609205628.GB38834@genre.crustytoothpaste.net> (raw)

[-- Attachment #1: Type: text/plain, Size: 4661 bytes --]

Since there's been a lot of questions recently about the state of the
NewHash work, I thought I'd send out a summary.

== Status

I have patches to make the entire codebase work, including passing all
tests, when Git is converted to use a 256-bit hash algorithm.
Obviously, such a Git is incompatible with the current version, but it
means that we've fixed essentially all of the hard-coded 20 and 40
constants (and therefore Git doesn't segfault).

I'm working on getting a 256-bit Git to work with SHA-1 being the
default.  Currently, this involves doing things like writing transport
code, since in order to clone a repository, you need to be able to set
up the hash algorithm correctly.  I know that this was a non-goal in the
transition plan, but since the testsuite doesn't pass without it, it's
become necessary.

Some of these patches will be making their way to the list soon.
They're hanging out in the normal places in the object-id-part14 branch
(which may be rebased).

== Future Design

The work I've done necessarily involves porting everything to use
the_hash_algo.  Essentially, when the piece I'm currently working on is
complete, we'll have a transition stage 4 implementation (all NewHash).
Stage 2 and 3 will be implemented next.

My vision of how data is stored is that the .git directory is, except
for pack indices and the loose object lookup table, entirely in one
format.  It will be all SHA-1 or all NewHash.  This algorithm will be
stored in the_hash_algo.

I plan on introducing an array of hash algorithms into struct repository
(and wrapper macros) which stores, in order, the output hash, and if
used, the additional input hash.

Functions like get_oid_hex and parse_oid_hex will acquire an internal
version, which knows about parsing things (like refs) in the internal
format, and one which knows about parsing in the UI formats.  Similarly,
oid_to_hex will have an internal version that handles data in the .git
directory, and an external version that produces data in the output
format.  Translation will take place at the outer edges of the program.

The transition plan anticipates a stage 1 where accept only SHA-1 on
input and produce only SHA-1 on output, but store in NewHash.  As I've
worked with our tests, I've realized such an implementation is not
entirely possible.  We have various tools that expect to accept invalid
object IDs, and obviously there's no way to have those continue to work.
We'd have to either reject invalid data in such a case or combine stages
1 and 2.

== Compatibility with this Work

If you're working on new features and you'd like to implement the best
possible compatibility with this work, here are some recommendations:

* Assume everything in the .git directory but pack indices and the loose
  object index will be in the same algorithm and that that algorithm is
  the_hash_algo.
* For the moment, use the_hash_algo to look up the size of all
  hash-related constants.  Use GIT_MAX_* for allocations.
* If you are writing a new data format, add a version number.
* If you need to serialize an algorithm identifier into your data
  format, use the format_id field of struct git_hash_algo.  It's
  designed specifically for that purpose.
* You can safely assume that the_hash_algo will be suitably initialized
  to the correct algorithm for your repository.
* Keep using the object ID functions and struct object_id.
* Try not to use mmap'd structs for reading and writing formats on disk,
  since these are hard to make hash size agnostic.

== Discussion about an Actual NewHash

Since I'll be writing new code, I'll be writing tests for this code.
However, writing tests for creating and initializing repositories
requires that I be able to test that objects are being serialized
correctly, and therefore requires that I actually know what the hash
algorithm is going to be.  I also can't submit code for multi-hash packs
when we officially only support one hash algorithm.

I know that we have long tried to avoid discussing the specific
algorithm to use, in part because the last discussion generated more
heat than light, and settled on referring to it as NewHash for the time
being.  However, I think it's time to pick this topic back up, since I
can't really continue work in this direction without us picking a
NewHash.

If people are interested, I've done some analysis on availability of
implementations, performance, and other attributes described in the
transition plan and can send that to the list.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 867 bytes --]

             reply index

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-09 20:56 brian m. carlson [this message]
2018-06-09 21:26 ` Ævar Arnfjörð Bjarmason
2018-06-09 22:49 ` Hash algorithm analysis brian m. carlson
2018-06-11 19:29   ` Jonathan Nieder
2018-06-11 20:20     ` Linus Torvalds
2018-06-11 23:27       ` Ævar Arnfjörð Bjarmason
2018-06-12  0:11         ` David Lang
2018-06-12  0:45         ` Linus Torvalds
2018-06-11 22:35     ` brian m. carlson
2018-06-12 16:21       ` Gilles Van Assche
2018-06-13 23:58         ` brian m. carlson
2018-06-15 10:33           ` Gilles Van Assche
2018-07-20 21:52     ` brian m. carlson
2018-07-21  0:31       ` Jonathan Nieder
2018-07-21 19:52       ` Ævar Arnfjörð Bjarmason
2018-07-21 20:25         ` brian m. carlson
2018-07-21 22:38       ` Johannes Schindelin
2018-07-21 23:09         ` Linus Torvalds
2018-07-21 23:59         ` brian m. carlson
2018-07-22  9:34           ` Eric Deplagne
2018-07-22 14:21             ` brian m. carlson
2018-07-22 14:55               ` Eric Deplagne
2018-07-26 10:05                 ` Johannes Schindelin
2018-07-22 15:23           ` Joan Daemen
2018-07-22 18:54             ` Adam Langley
2018-07-26 10:31             ` Johannes Schindelin
2018-07-23 12:40           ` demerphq
2018-07-23 12:48             ` Sitaram Chamarty
2018-07-23 12:55               ` demerphq
2018-07-23 18:23               ` Linus Torvalds
2018-07-23 17:57             ` Stefan Beller
2018-07-23 18:35             ` Jonathan Nieder
2018-07-24 19:01       ` Edward Thomson
2018-07-24 20:31         ` Linus Torvalds
2018-07-24 20:49           ` Jonathan Nieder
2018-07-24 21:13           ` Junio C Hamano
2018-07-24 22:10             ` brian m. carlson
2018-07-30  9:06               ` Johannes Schindelin
2018-07-30 20:01                 ` Dan Shumow
2018-08-03  2:57                   ` Jonathan Nieder
2018-09-18 15:18                   ` Joan Daemen
2018-09-18 15:32                     ` Jonathan Nieder
2018-09-18 16:50                     ` Linus Torvalds
2018-07-25  8:30             ` [PATCH 0/2] document that NewHash is now SHA-256 Ævar Arnfjörð Bjarmason
2018-07-25  8:30             ` [PATCH 1/2] doc hash-function-transition: note the lack of a changelog Ævar Arnfjörð Bjarmason
2018-07-25  8:30             ` [PATCH 2/2] doc hash-function-transition: pick SHA-256 as NewHash Ævar Arnfjörð Bjarmason
2018-07-25 16:45               ` Junio C Hamano
2018-07-25 17:25                 ` Jonathan Nieder
2018-07-25 21:32                   ` Junio C Hamano
2018-07-26 13:41                     ` [PATCH v2 " Ævar Arnfjörð Bjarmason
2018-08-03  7:20                       ` Jonathan Nieder
2018-08-03 16:40                         ` Junio C Hamano
2018-08-03 17:01                           ` Linus Torvalds
2018-08-03 16:42                         ` Linus Torvalds
2018-08-03 17:43                         ` Ævar Arnfjörð Bjarmason
2018-08-04  8:52                           ` Jonathan Nieder
2018-08-03 17:45                         ` brian m. carlson
2018-07-25 22:56                 ` [PATCH " brian m. carlson
2018-06-11 21:19   ` Hash algorithm analysis Ævar Arnfjörð Bjarmason
2018-06-21  8:20     ` Johannes Schindelin
2018-06-21 22:39     ` brian m. carlson
2018-06-11 18:09 ` State of NewHash work, future directions, and discussion Duy Nguyen
2018-06-12  1:28   ` brian m. carlson
2018-06-11 19:01 ` Jonathan Nieder
2018-06-12  2:28   ` brian m. carlson
2018-06-12  2:42     ` Jonathan Nieder

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180609205628.GB38834@genre.crustytoothpaste.net \
    --to=sandals@crustytoothpaste.net \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

git@vger.kernel.org mailing list mirror (one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.org/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/
       or Tor2web: https://www.tor2web.org/

AGPL code for this site: git clone https://public-inbox.org/ public-inbox