Re: Git Merge contributor summit notes

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Jeff Hostetler <git@jeffhostetler.com>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Alex Vandiver" <alexmv@dropbox.com>
Cc: git@vger.kernel.org, jonathantanmy@google.com, bmwill@google.com,
	stolee@gmail.com, sbeller@google.com, peff@peff.net,
	johannes.schindelin@gmx.de, Jonathan Nieder <jrnieder@gmail.com>,
	Michael Haggerty <mhagger@alum.mit.edu>
Subject: Re: Git Merge contributor summit notes
Date: Mon, 26 Mar 2018 13:33:46 -0400	[thread overview]
Message-ID: <0c3bb65f-d418-b39e-34c7-c2f3efec7e50@jeffhostetler.com> (raw)
In-Reply-To: <874ll3yd75.fsf@evledraar.gmail.com>



On 3/25/2018 6:58 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Sat, Mar 10 2018, Alex Vandiver wrote:
> 
>> New hash (Stefan, etc)
>> ----------------------
>>   - discussed on the mailing list
>>   - actual plan checked in to Documentation/technical/hash-function-transition.txt
>>   - lots of work renaming
>>   - any actual work with the transition plan?
>>   - local conversion first; fetch/push have translation table
>>   - like git-svn
>>   - also modified pack and index format to have lookup/translation efficiently
>>   - brian's series to eliminate SHA1 strings from the codebase
>>   - testsuite is not working well because hardcoded SHA1 values
>>   - flip a bit in the sha1 computation and see what breaks in the testsuite
>>   - will also need a way to do the conversion itself; traverse and write out new version
>>   - without that, can start new repos, but not work on old ones
>>   - on-disk formats will need to change -- something to keep in mind with new index work
>>   - documentation describes packfile and index formats
>>   - what time frame are we talking?
>>   - public perception question
>>   - signing commits doesn't help (just signs commit object) unless you "recursive sign"
>>   - switched to SHA1dc; we detect and reject known collision technique
>>   - do it now because it takes too long if we start when the collision drops
>>   - always call it "new hash" to reduce bikeshedding
>>   - is translation table a backdoor? has it been reviewed by crypto folks?
>>     - no, but everything gets translated
>>   - meant to avoid a flag day for entire repositories
>>   - linus can decide to upgrade to newhash; if pushes to server that is not newhash aware, that's fine
>>   - will need a wire protocol change
>>   - v2 might add a capability for newhash
>>   - "now that you mention md5, it's a good idea"
>>   - can use md5 to test the conversion
>>   - is there a technical reason for why not /n/ hashes?
>>   - the slow step goes away as people converge to the new hash
>>   - beneficial to make up some fake hash function for testing
>>   - is there a plan on how we decide which hash function?
>>   - trust junio to merge commits when appropriate
>>   - conservancy committee explicitly does not make code decisions
>>   - waiting will just give better data
>>   - some hash functions are in silicon (e.g. microsoft cares)
>>   - any movement in libgit2 / jgit?
>>     - basic stuff for libgit2; same testsuite problems
>>     - no work in jgit
>>   - most optimistic forecast?
>>     - could be done in 1-2y
>>   - submodules with one hash function?
>>     - unable to convert project unless all submodules are converted
>>     - OO-ing is not a prereq
> 
> Late reply, but one thing I brought up at the time is that we'll want to
> keep this code around even after the NewHash migration at least for
> testing purposes, should we ever need to move to NewNewHash.
> 
> It occurred to me recently that once we have such a layer it could be
> (ab)used with some relatively minor changes to do any arbitrary
> local-to-remote object content translation, unless I've missed something
> (but I just re-read hash-function-transition.txt now...).
> 
> E.g. having a SHA-1 (or NewHash) local repo, but interfacing with a
> remote server so that you upload a GPG encrypted version of all your
> blobs, and have your trees reference those blobs.
> 
> Because we'd be doing arbitrary translations for all of
> commits/trees/blobs this could go further than other bolted-on
> encryption solutions for Git. E.g. paths in trees could be encrypted
> too, as well as all the content of the commit object that isn't parent
> info & the like (but that would have different hashes).
> 
> Basically clean/smudge filters on steroids, but for every object in the
> repo. Anyone who got a hold of it would still see the shape of the repo
> & approximate content size, but other than that it wouldn't be more info
> than they'd get via `fast-export --anonymize` now.
> 
> I mainly find it interesting because presents an intersection between a
> feature we might want to offer anyway, and something that would stress
> the hash transition codepath going forward, to make sure it hasn't all
> bitrotted by the time we'll need NewHash->NewNewHash.
> 
> Git hosting providers would hate it, but they should probably be
> charging users by how much Michael Haggerty's git-sizer tool hates their
> repo anyway :)
> 

While we are converting to a new hash function, it would be nice
if we could add a couple of fields to the end of the OID:  the object
type and the raw uncompressed object size.

If would be nice if we could extend the OID to include 6 bytes of data
(4 or 8 bits for the type and the rest for the raw object size), and
just say that an OID is a {hash,type,size} tuple.

There are lots of places where we open an object to see what type it is
or how big it is.  This requires uncompressing/undeltafying the object
(or at least decoding enough to get the header).  In the case of missing
objects (partial clone or a gvfs-like projection) it requires either
dynamically fetching the object or asking an object-size-server for the
data.

All of these cases could be eliminated if the type/size were available
in the OID.

Just a thought.  While we are converting to a new hash it seems like
this would be a good time to at least discuss it.

Jeff

next prev parent reply	other threads:[~2018-03-26 17:33 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-10  0:06 Git Merge contributor summit notes Alex Vandiver
2018-03-10 13:01 ` Ævar Arnfjörð Bjarmason
2018-03-11  0:02   ` Junio C Hamano
2018-03-12 23:40   ` Jeff King
2018-03-13  0:49     ` Brandon Williams
2018-03-12 23:33 ` Jeff King
2018-03-25 22:58 ` Ævar Arnfjörð Bjarmason
2018-03-26 17:33   ` Jeff Hostetler [this message]
2018-03-26 17:56     ` Stefan Beller
2018-03-26 18:54       ` Jeff Hostetler
2018-03-26 18:05     ` Brandon Williams
2018-04-07 20:37       ` Jakub Narebski
2018-03-26 21:00     ` Including object type and size in object id (Re: Git Merge contributor summit notes) Jonathan Nieder
2018-03-26 21:42       ` Jeff Hostetler
2018-03-26 22:40       ` Junio C Hamano
2018-03-26 20:54   ` Per-object encryption " Jonathan Nieder
2018-03-26 21:22     ` Ævar Arnfjörð Bjarmason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0c3bb65f-d418-b39e-34c7-c2f3efec7e50@jeffhostetler.com \
    --to=git@jeffhostetler.com \
    --cc=alexmv@dropbox.com \
    --cc=avarab@gmail.com \
    --cc=bmwill@google.com \
    --cc=git@vger.kernel.org \
    --cc=johannes.schindelin@gmx.de \
    --cc=jonathantanmy@google.com \
    --cc=jrnieder@gmail.com \
    --cc=mhagger@alum.mit.edu \
    --cc=peff@peff.net \
    --cc=sbeller@google.com \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).