From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS3215 2.6.0.0/16 X-Spam-Status: No, score=-3.8 required=3.0 tests=AWL,BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from out1.vger.email (out1.vger.email [IPv6:2620:137:e000::1:20]) by dcvr.yhbt.net (Postfix) with ESMTP id 0CFC81F54E for ; Tue, 6 Sep 2022 23:05:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229668AbiIFXFq (ORCPT ); Tue, 6 Sep 2022 19:05:46 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59004 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229498AbiIFXFp (ORCPT ); Tue, 6 Sep 2022 19:05:45 -0400 Received: from cloud.peff.net (cloud.peff.net [104.130.231.41]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A134320F40 for ; Tue, 6 Sep 2022 16:05:43 -0700 (PDT) Received: (qmail 19896 invoked by uid 109); 6 Sep 2022 23:05:43 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with ESMTP; Tue, 06 Sep 2022 23:05:43 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 6693 invoked by uid 111); 6 Sep 2022 23:05:43 -0000 Received: from coredump.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Tue, 06 Sep 2022 19:05:43 -0400 Authentication-Results: peff.net; auth=none Date: Tue, 6 Sep 2022 19:05:42 -0400 From: Jeff King To: =?utf-8?B?56iL5rSL?= Cc: Derrick Stolee , "git@vger.kernel.org" , =?utf-8?B?5L2V5rWp?= , Xin7 Ma =?utf-8?B?6ams6ZGr?= , =?utf-8?B?55+z5aWJ5YW1?= , =?utf-8?B?5Yeh5Yab6L6J?= , =?utf-8?B?546L5rGJ5Z+6?= Subject: [PATCH 2/3] upload-pack: skip parse-object re-hashing of "want" objects Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Imagine we have a history with commit C pointing to a large blob B. If a client asks us for C, we can generally serve both objects to them without accessing the uncompressed contents of B. In upload-pack, we figure out which commits we have and what the client has, and feed those tips to pack-objects. In pack-objects, we traverse the commits and trees (or use bitmaps!) to find the set of objects needed, but we never open up B. When we serve it to the client, we can often pass the compressed bytes directly from the on-disk packfile over the wire. But if a client asks us directly for B, perhaps because they are doing an on-demand fetch to fill in the missing blob of a partial clone, we end up much slower. Upload-pack calls parse_object() on the oid we receive, which opens up the object and re-checks its hash (even though if it were a commit, we might skip this parse entirely in favor of the commit graph!). And then we feed the oid directly to pack-objects, which again calls parse_object() and opens the object. And then finally, when we write out the result, we may send bytes straight from disk, but only after having unnecessarily uncompressed and computed the sha1 of the object twice! This patch teaches both code paths to use the new SKIP_HASH_CHECK flag for parse_object(). You can see the speed-up in p5600, which does a blob:none clone followed by a checkout. The savings for git.git are modest: Test HEAD^ HEAD ---------------------------------------------------------------------- 5600.3: checkout of result 2.23(4.19+0.24) 1.72(3.79+0.18) -22.9% But the savings scale with the number of bytes. So on a repository like linux.git with more files, we see more improvement (in both absolute and relative numbers): Test HEAD^ HEAD ---------------------------------------------------------------------------- 5600.3: checkout of result 51.62(77.26+2.76) 34.86(61.41+2.63) -32.5% And here's an even more extreme case. This is the android gradle-plugin repository, whose tip checkout has ~3.7GB of files: Test HEAD^ HEAD -------------------------------------------------------------------------- 5600.3: checkout of result 79.51(90.84+5.55) 40.28(51.88+5.67) -49.3% Keep in mind that these timings are of the whole checkout operation. So they count the client indexing the pack and actually writing out the files. If we want to see just the server's view, we can hack up the GIT_TRACE_PACKET output from those operations and replay it via upload-pack. For the gradle example, that gives me: Benchmark 1: GIT_PROTOCOL=version=2 git.old upload-pack ../gradle-plugin --- revision.c | 4 +++- t/t1450-fsck.sh | 20 ++++++++++++++++++++ upload-pack.c | 3 ++- 3 files changed, 25 insertions(+), 2 deletions(-) diff --git a/revision.c b/revision.c index ee702e498a..786e090785 100644 --- a/revision.c +++ b/revision.c @@ -384,7 +384,9 @@ static struct object *get_reference(struct rev_info *revs, const char *name, if (commit) object = &commit->object; else - object = parse_object(revs->repo, oid); + object = parse_object_with_flags(revs->repo, oid, + revs->verify_objects ? 0 : + PARSE_OBJECT_SKIP_HASH_CHECK); if (!object) { if (revs->ignore_missing) diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh index 53c2aa10b7..6410eff4e0 100755 --- a/t/t1450-fsck.sh +++ b/t/t1450-fsck.sh @@ -507,6 +507,26 @@ test_expect_success 'rev-list --verify-objects with bad sha1' ' test_i18ngrep -q "error: hash mismatch $(dirname $new)$(test_oid ff_2)" out ' +# An actual bit corruption is more likely than swapped commits, but +# this provides an easy way to have commits which don't match their purported +# hashes, but which aren't so broken we can't read them at all. +test_expect_success 'rev-list --verify-objects notices swapped commits' ' + git init swapped-commits && + ( + cd swapped-commits && + test_commit one && + test_commit two && + one_oid=$(git rev-parse HEAD) && + two_oid=$(git rev-parse HEAD^) && + one=.git/objects/$(test_oid_to_path $one_oid) && + two=.git/objects/$(test_oid_to_path $two_oid) && + mv $one tmp && + mv $two $one && + mv tmp $two && + test_must_fail git rev-list --verify-objects HEAD + ) +' + test_expect_success 'force fsck to ignore double author' ' git cat-file commit HEAD >basis && sed "s/^author .*/&,&/" multiple-authors && diff --git a/upload-pack.c b/upload-pack.c index b217a1f469..4bacdf087c 100644 --- a/upload-pack.c +++ b/upload-pack.c @@ -1420,7 +1420,8 @@ static int parse_want(struct packet_writer *writer, const char *line, if (commit) o = &commit->object; else - o = parse_object(the_repository, &oid); + o = parse_object_with_flags(the_repository, &oid, + PARSE_OBJECT_SKIP_HASH_CHECK); if (!o) { packet_writer_error(writer, -- 2.37.3.1134.gfd534b3986