From: Jeff Hostetler <git@jeffhostetler.com>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
"Alex Vandiver" <alexmv@dropbox.com>
Cc: git@vger.kernel.org, jonathantanmy@google.com, bmwill@google.com,
stolee@gmail.com, sbeller@google.com, peff@peff.net,
johannes.schindelin@gmx.de, Jonathan Nieder <jrnieder@gmail.com>,
Michael Haggerty <mhagger@alum.mit.edu>
Subject: Re: Git Merge contributor summit notes
Date: Mon, 26 Mar 2018 13:33:46 -0400 [thread overview]
Message-ID: <0c3bb65f-d418-b39e-34c7-c2f3efec7e50@jeffhostetler.com> (raw)
In-Reply-To: <874ll3yd75.fsf@evledraar.gmail.com>
On 3/25/2018 6:58 PM, Ævar Arnfjörð Bjarmason wrote:
>
> On Sat, Mar 10 2018, Alex Vandiver wrote:
>
>> New hash (Stefan, etc)
>> ----------------------
>> - discussed on the mailing list
>> - actual plan checked in to Documentation/technical/hash-function-transition.txt
>> - lots of work renaming
>> - any actual work with the transition plan?
>> - local conversion first; fetch/push have translation table
>> - like git-svn
>> - also modified pack and index format to have lookup/translation efficiently
>> - brian's series to eliminate SHA1 strings from the codebase
>> - testsuite is not working well because hardcoded SHA1 values
>> - flip a bit in the sha1 computation and see what breaks in the testsuite
>> - will also need a way to do the conversion itself; traverse and write out new version
>> - without that, can start new repos, but not work on old ones
>> - on-disk formats will need to change -- something to keep in mind with new index work
>> - documentation describes packfile and index formats
>> - what time frame are we talking?
>> - public perception question
>> - signing commits doesn't help (just signs commit object) unless you "recursive sign"
>> - switched to SHA1dc; we detect and reject known collision technique
>> - do it now because it takes too long if we start when the collision drops
>> - always call it "new hash" to reduce bikeshedding
>> - is translation table a backdoor? has it been reviewed by crypto folks?
>> - no, but everything gets translated
>> - meant to avoid a flag day for entire repositories
>> - linus can decide to upgrade to newhash; if pushes to server that is not newhash aware, that's fine
>> - will need a wire protocol change
>> - v2 might add a capability for newhash
>> - "now that you mention md5, it's a good idea"
>> - can use md5 to test the conversion
>> - is there a technical reason for why not /n/ hashes?
>> - the slow step goes away as people converge to the new hash
>> - beneficial to make up some fake hash function for testing
>> - is there a plan on how we decide which hash function?
>> - trust junio to merge commits when appropriate
>> - conservancy committee explicitly does not make code decisions
>> - waiting will just give better data
>> - some hash functions are in silicon (e.g. microsoft cares)
>> - any movement in libgit2 / jgit?
>> - basic stuff for libgit2; same testsuite problems
>> - no work in jgit
>> - most optimistic forecast?
>> - could be done in 1-2y
>> - submodules with one hash function?
>> - unable to convert project unless all submodules are converted
>> - OO-ing is not a prereq
>
> Late reply, but one thing I brought up at the time is that we'll want to
> keep this code around even after the NewHash migration at least for
> testing purposes, should we ever need to move to NewNewHash.
>
> It occurred to me recently that once we have such a layer it could be
> (ab)used with some relatively minor changes to do any arbitrary
> local-to-remote object content translation, unless I've missed something
> (but I just re-read hash-function-transition.txt now...).
>
> E.g. having a SHA-1 (or NewHash) local repo, but interfacing with a
> remote server so that you upload a GPG encrypted version of all your
> blobs, and have your trees reference those blobs.
>
> Because we'd be doing arbitrary translations for all of
> commits/trees/blobs this could go further than other bolted-on
> encryption solutions for Git. E.g. paths in trees could be encrypted
> too, as well as all the content of the commit object that isn't parent
> info & the like (but that would have different hashes).
>
> Basically clean/smudge filters on steroids, but for every object in the
> repo. Anyone who got a hold of it would still see the shape of the repo
> & approximate content size, but other than that it wouldn't be more info
> than they'd get via `fast-export --anonymize` now.
>
> I mainly find it interesting because presents an intersection between a
> feature we might want to offer anyway, and something that would stress
> the hash transition codepath going forward, to make sure it hasn't all
> bitrotted by the time we'll need NewHash->NewNewHash.
>
> Git hosting providers would hate it, but they should probably be
> charging users by how much Michael Haggerty's git-sizer tool hates their
> repo anyway :)
>
While we are converting to a new hash function, it would be nice
if we could add a couple of fields to the end of the OID: the object
type and the raw uncompressed object size.
If would be nice if we could extend the OID to include 6 bytes of data
(4 or 8 bits for the type and the rest for the raw object size), and
just say that an OID is a {hash,type,size} tuple.
There are lots of places where we open an object to see what type it is
or how big it is. This requires uncompressing/undeltafying the object
(or at least decoding enough to get the header). In the case of missing
objects (partial clone or a gvfs-like projection) it requires either
dynamically fetching the object or asking an object-size-server for the
data.
All of these cases could be eliminated if the type/size were available
in the OID.
Just a thought. While we are converting to a new hash it seems like
this would be a good time to at least discuss it.
Jeff
next prev parent reply other threads:[~2018-03-26 17:33 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-03-10 0:06 Git Merge contributor summit notes Alex Vandiver
2018-03-10 13:01 ` Ævar Arnfjörð Bjarmason
2018-03-11 0:02 ` Junio C Hamano
2018-03-12 23:40 ` Jeff King
2018-03-13 0:49 ` Brandon Williams
2018-03-12 23:33 ` Jeff King
2018-03-25 22:58 ` Ævar Arnfjörð Bjarmason
2018-03-26 17:33 ` Jeff Hostetler [this message]
2018-03-26 17:56 ` Stefan Beller
2018-03-26 18:54 ` Jeff Hostetler
2018-03-26 18:05 ` Brandon Williams
2018-04-07 20:37 ` Jakub Narebski
2018-03-26 21:00 ` Including object type and size in object id (Re: Git Merge contributor summit notes) Jonathan Nieder
2018-03-26 21:42 ` Jeff Hostetler
2018-03-26 22:40 ` Junio C Hamano
2018-03-26 20:54 ` Per-object encryption " Jonathan Nieder
2018-03-26 21:22 ` Ævar Arnfjörð Bjarmason
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0c3bb65f-d418-b39e-34c7-c2f3efec7e50@jeffhostetler.com \
--to=git@jeffhostetler.com \
--cc=alexmv@dropbox.com \
--cc=avarab@gmail.com \
--cc=bmwill@google.com \
--cc=git@vger.kernel.org \
--cc=johannes.schindelin@gmx.de \
--cc=jonathantanmy@google.com \
--cc=jrnieder@gmail.com \
--cc=mhagger@alum.mit.edu \
--cc=peff@peff.net \
--cc=sbeller@google.com \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).