From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-4.1 required=3.0 tests=AWL,BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_DNSWL_HI,RP_MATCHES_RCVD shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id BE5CC201A9 for ; Sat, 25 Feb 2017 01:16:41 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751445AbdBYBQk (ORCPT ); Fri, 24 Feb 2017 20:16:40 -0500 Received: from cloud.peff.net ([104.130.231.41]:33783 "EHLO cloud.peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751291AbdBYBQj (ORCPT ); Fri, 24 Feb 2017 20:16:39 -0500 Received: (qmail 28398 invoked by uid 109); 25 Feb 2017 01:16:39 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.84) with SMTP; Sat, 25 Feb 2017 01:16:39 +0000 Received: (qmail 29284 invoked by uid 111); 25 Feb 2017 01:16:42 -0000 Received: from sigill.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.7) by peff.net (qpsmtpd/0.84) with SMTP; Fri, 24 Feb 2017 20:16:42 -0500 Received: by sigill.intra.peff.net (sSMTP sendmail emulation); Fri, 24 Feb 2017 20:16:36 -0500 Date: Fri, 24 Feb 2017 20:16:36 -0500 From: Jeff King To: Linus Torvalds Cc: Junio C Hamano , Ian Jackson , Joey Hess , Git Mailing List Subject: Re: SHA1 collisions found Message-ID: <20170225011636.qjlv2luj3zefmrpz@sigill.intra.peff.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On Fri, Feb 24, 2017 at 04:39:45PM -0800, Linus Torvalds wrote: > For example, what I would suggest the rules be is something like this: > > - introduce new tag2/commit2/tree2/blob2 object type tags that imply > that they were hashed using the new hash > > - an old type obviously can never contain a pointer to a new type (ie > you can't have a "tree" object that contains a tree2 object or a blob2 > object. > > - but also make the rule that a *new* type can never contain a > pointer to an old type, with the *very* specific exception that a > commit2 can have a parent that is of type "commit". Yeah, this is exactly what I had in mind. That way everybody in "newhash" mode has no decisions to make. They follow the same rules and it's as if sha1 never existed, except when you follow links in historical objects. > [in reply...] > Actually, I take that back. I think it might be easier to keep > "object->type" as-is, and it would only show the current OBJ_xyz > fields. Then writing the SHA ends up deciding whether a OBJ_COMMIT > gets written as "commit" or "commit2". Yeah, I think there are some data structures with limited bits for the "type" fields (e.g., the pack format). So sticking with OBJ_COMMIT might be nice. For commits and tags, it would be nice to have an "I'm v2" header at the start so there's no confusion about how they are meant to be interpreted. Trees are more difficult, as they don't have any such field. But a valid tree does need to start with a mode, so sticking some non-numeric flag at the front of the object would work (it breaks backwards compatibility, but that's kind of the point). I dunno. Maybe we do not need those markers at all, and could get by purely on object-length, or annotating the headers in some way (like "parent sha256:1234abcd"). It might just be nice if we could very easily identify objects as one type or the other without having to parse them in detail. > So you will end up with duplicate objects, and that's not good (think > of what it does to all our full-tree "diff" optimizations, for example > - you no longer get the "these sub-trees are identical" across a > format change), but realistically you'll have a very limited time of > that kind of duplication. Yeah, cross-flag-day diffs will be more expensive. I think that's something we have to live with. I was thinking originally that the sha1->newhash mapping might solve that, but it only works at the blob level. I.e., you can compare a sha1 and a newhash like: if (!hashcmp(sha1_to_newhash(a), b)) without having to look at the contents. But it doesn't work recursively, because the tree-pointing-to-newhash will have different content. -Peff