From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-3.2 required=3.0 tests=AWL,BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_DNSWL_HI,T_RP_MATCHES_RCVD shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id 008771F404 for ; Mon, 26 Mar 2018 17:33:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752275AbeCZRds (ORCPT ); Mon, 26 Mar 2018 13:33:48 -0400 Received: from siwi.pair.com ([209.68.5.199]:57440 "EHLO siwi.pair.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751964AbeCZRdr (ORCPT ); Mon, 26 Mar 2018 13:33:47 -0400 Received: from siwi.pair.com (localhost [127.0.0.1]) by siwi.pair.com (Postfix) with ESMTP id 390CD3F4006; Mon, 26 Mar 2018 13:33:47 -0400 (EDT) Received: from [10.160.98.99] (unknown [167.220.148.99]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by siwi.pair.com (Postfix) with ESMTPSA id A8C5C3F4000; Mon, 26 Mar 2018 13:33:46 -0400 (EDT) Subject: Re: Git Merge contributor summit notes To: =?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= , Alex Vandiver Cc: git@vger.kernel.org, jonathantanmy@google.com, bmwill@google.com, stolee@gmail.com, sbeller@google.com, peff@peff.net, johannes.schindelin@gmx.de, Jonathan Nieder , Michael Haggerty References: <874ll3yd75.fsf@evledraar.gmail.com> From: Jeff Hostetler Message-ID: <0c3bb65f-d418-b39e-34c7-c2f3efec7e50@jeffhostetler.com> Date: Mon, 26 Mar 2018 13:33:46 -0400 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 MIME-Version: 1.0 In-Reply-To: <874ll3yd75.fsf@evledraar.gmail.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On 3/25/2018 6:58 PM, Ævar Arnfjörð Bjarmason wrote: > > On Sat, Mar 10 2018, Alex Vandiver wrote: > >> New hash (Stefan, etc) >> ---------------------- >> - discussed on the mailing list >> - actual plan checked in to Documentation/technical/hash-function-transition.txt >> - lots of work renaming >> - any actual work with the transition plan? >> - local conversion first; fetch/push have translation table >> - like git-svn >> - also modified pack and index format to have lookup/translation efficiently >> - brian's series to eliminate SHA1 strings from the codebase >> - testsuite is not working well because hardcoded SHA1 values >> - flip a bit in the sha1 computation and see what breaks in the testsuite >> - will also need a way to do the conversion itself; traverse and write out new version >> - without that, can start new repos, but not work on old ones >> - on-disk formats will need to change -- something to keep in mind with new index work >> - documentation describes packfile and index formats >> - what time frame are we talking? >> - public perception question >> - signing commits doesn't help (just signs commit object) unless you "recursive sign" >> - switched to SHA1dc; we detect and reject known collision technique >> - do it now because it takes too long if we start when the collision drops >> - always call it "new hash" to reduce bikeshedding >> - is translation table a backdoor? has it been reviewed by crypto folks? >> - no, but everything gets translated >> - meant to avoid a flag day for entire repositories >> - linus can decide to upgrade to newhash; if pushes to server that is not newhash aware, that's fine >> - will need a wire protocol change >> - v2 might add a capability for newhash >> - "now that you mention md5, it's a good idea" >> - can use md5 to test the conversion >> - is there a technical reason for why not /n/ hashes? >> - the slow step goes away as people converge to the new hash >> - beneficial to make up some fake hash function for testing >> - is there a plan on how we decide which hash function? >> - trust junio to merge commits when appropriate >> - conservancy committee explicitly does not make code decisions >> - waiting will just give better data >> - some hash functions are in silicon (e.g. microsoft cares) >> - any movement in libgit2 / jgit? >> - basic stuff for libgit2; same testsuite problems >> - no work in jgit >> - most optimistic forecast? >> - could be done in 1-2y >> - submodules with one hash function? >> - unable to convert project unless all submodules are converted >> - OO-ing is not a prereq > > Late reply, but one thing I brought up at the time is that we'll want to > keep this code around even after the NewHash migration at least for > testing purposes, should we ever need to move to NewNewHash. > > It occurred to me recently that once we have such a layer it could be > (ab)used with some relatively minor changes to do any arbitrary > local-to-remote object content translation, unless I've missed something > (but I just re-read hash-function-transition.txt now...). > > E.g. having a SHA-1 (or NewHash) local repo, but interfacing with a > remote server so that you upload a GPG encrypted version of all your > blobs, and have your trees reference those blobs. > > Because we'd be doing arbitrary translations for all of > commits/trees/blobs this could go further than other bolted-on > encryption solutions for Git. E.g. paths in trees could be encrypted > too, as well as all the content of the commit object that isn't parent > info & the like (but that would have different hashes). > > Basically clean/smudge filters on steroids, but for every object in the > repo. Anyone who got a hold of it would still see the shape of the repo > & approximate content size, but other than that it wouldn't be more info > than they'd get via `fast-export --anonymize` now. > > I mainly find it interesting because presents an intersection between a > feature we might want to offer anyway, and something that would stress > the hash transition codepath going forward, to make sure it hasn't all > bitrotted by the time we'll need NewHash->NewNewHash. > > Git hosting providers would hate it, but they should probably be > charging users by how much Michael Haggerty's git-sizer tool hates their > repo anyway :) > While we are converting to a new hash function, it would be nice if we could add a couple of fields to the end of the OID: the object type and the raw uncompressed object size. If would be nice if we could extend the OID to include 6 bytes of data (4 or 8 bits for the type and the rest for the raw object size), and just say that an OID is a {hash,type,size} tuple. There are lots of places where we open an object to see what type it is or how big it is. This requires uncompressing/undeltafying the object (or at least decoding enough to get the header). In the case of missing objects (partial clone or a gvfs-like projection) it requires either dynamically fetching the object or asking an object-size-server for the data. All of these cases could be eliminated if the type/size were available in the OID. Just a thought. While we are converting to a new hash it seems like this would be a good time to at least discuss it. Jeff