From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS3215 2.6.0.0/16 X-Spam-Status: No, score=-3.6 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from out1.vger.email (out1.vger.email [IPv6:2620:137:e000::1:20]) by dcvr.yhbt.net (Postfix) with ESMTP id 683B11F93C for ; Mon, 7 Nov 2022 18:36:27 +0000 (UTC) Authentication-Results: dcvr.yhbt.net; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="SQeyBlH+"; dkim-atps=neutral Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233093AbiKGSgL (ORCPT ); Mon, 7 Nov 2022 13:36:11 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49588 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232992AbiKGSgK (ORCPT ); Mon, 7 Nov 2022 13:36:10 -0500 Received: from mail-wr1-x431.google.com (mail-wr1-x431.google.com [IPv6:2a00:1450:4864:20::431]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 63BBA240A3 for ; Mon, 7 Nov 2022 10:36:09 -0800 (PST) Received: by mail-wr1-x431.google.com with SMTP id y16so17507795wrt.12 for ; Mon, 07 Nov 2022 10:36:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:from:to:cc:subject:date :message-id:reply-to; bh=hwC5rdCBH8oAZ28SC1bjINPERkNTBY5/ApyQoQryQpE=; b=SQeyBlH+F3IEKx0DdJhEp0kc6GA7huuXSGyIt9SNC78aCM+Wy+a+qsbCy+WvhGJdPu tvtHlpcKVVSDF0WWxwcqxLsyDmcGQ6VVBm7IOfydwBji5dGPUqzv+jN0t1EahuPmxgK5 7is0ttJnGz1Fnvr3X3WScPtr3fYituL1XNpmUH6GzL8atFE+KJTTC7GmKZBJZ+NZf4/d leqXe26YHBopA9DWwd+aXmK8s9jU8tWQfR4tWgT2qO4KJ7986qXoIssCNPrR93e8viGJ XvHOKRoNCYsa+O1x8Z+jH/fT7Th03Li0HonmAnfL5O/44Hqlk6HjrUoraZTGM4bqU97h +K+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:mime-version:content-transfer-encoding:fcc:subject:date:from :references:in-reply-to:message-id:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=hwC5rdCBH8oAZ28SC1bjINPERkNTBY5/ApyQoQryQpE=; b=c7anbCaAeTOPEkCm6xExFYr0cz5obHQEgNrjSwLxY2nU9hgl7fv9bRjTVCBSqgaa5K CKnT9eo5kqg0aDXZU4KrjNV5TkpmENou5fXafZkHCi7vG9wINOJkrVa/yEg0fc0VIkWh iE4Ktnf/dM3Cf1hytbQZoItzh0/juC7ol3FtKCN5gXvZ6DZpBA+s9wvaCld1gI2eGeyV Gkj4pSBv+bYfY9NqE93yMfxa8aE9txAGxpHgyGwIAOxxl/cBRfJHnqVwQbGZCLIjBfEi sqfohqcijNrdbgFf//3nhlHF3omzI6EhgwT+JCJlMpI187hu+HC5jSxdSMZewry/6ltO 8ZeQ== X-Gm-Message-State: ACrzQf3en++lWUNx7gMXrRE9gqa2JC7s7eQM2/pWArkeAxbMj/QEeIsg hbpqkK8tsbXbfhV/uXaSfrUnlalXnrw= X-Google-Smtp-Source: AMsMyM7xZUl0ylNMZ41dn/zopx24e4xnTYld6JcNmdql4E2RjwPDG+F4GzPcDrtHwQSfjeTlTu8Bvw== X-Received: by 2002:a5d:52ca:0:b0:236:e9ac:3c3f with SMTP id r10-20020a5d52ca000000b00236e9ac3c3fmr22325891wrv.53.1667846167626; Mon, 07 Nov 2022 10:36:07 -0800 (PST) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id l7-20020a7bc447000000b003cf7928e731sm8795380wmi.9.2022.11.07.10.36.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 07 Nov 2022 10:36:06 -0800 (PST) Message-Id: <71c76d4ccbe577f82e820fb08fe93e5177177804.1667846164.git.gitgitgadget@gmail.com> In-Reply-To: References: From: "Derrick Stolee via GitGitGadget" Date: Mon, 07 Nov 2022 18:35:35 +0000 Subject: [PATCH 01/30] hashfile: allow skipping the hash function Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: jrnieder@gmail.com, Derrick Stolee , Derrick Stolee Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Derrick Stolee The hashfile API is useful for generating files that include a trailing hash of the file's contents up to that point. Using such a hash is helpful for verifying the file for corruption-at-rest, such as a faulty drive causing flipped bits. Since the commit-graph and multi-pack-index files both use this trailing hash, the chunk-format API uses a 'struct hashfile' to handle the I/O to the file. This was very convenient to allow using the hashfile methods during these operations. However, hashing the file contents during write comes at a performance penalty. It's slower to hash the bytes on their way to the disk than without that step. If we wish to use the chunk-format API to upgrade other file types, then this hashing is a performance penalty that might not be worth the benefit of a trailing hash. For example, if we create a chunk-format version of the packed-refs file, then the file format could shrink by using raw object IDs instead of hexadecimal representations in ASCII. That reduction in size is not enough to counteract the performance penalty of hashing the file contents. In cases such as deleting a reference that appears in the packed-refs file, that write-time performance is critical. This is in contrast to the commit-graph and multi-pack-index files which are mainly updated in non-critical paths such as background maintenance. One way to allow future chunked formats to not suffer this penalty would be to create an abstraction layer around the 'struct hashfile' using a vtable of function pointers. This would allow placing a different representation in place of the hashfile. This option would be cumbersome for a few reasons. First, the hashfile's buffered writes are already highly optimized and would need to be duplicated in another code path. The second is that the chunk-format API calls the chunk_write_fn pointers using a hashfile. If we change that to an abstraction layer, then those that _do_ use the hashfile API would need to change all of their instances of hashwrite(), hashwrite_be32(), and others to use the new abstraction layer. Instead, this change opts for a simpler change. Introduce a new 'skip_hash' option to 'struct hashfile'. When set, the update_fn and final_fn members of the_hash_algo are skipped. When finalizing the hashfile, the trailing hash is replaced with the null hash. This use of a trailing null hash would be desireable in either case, since we do not want to special case a file format to have a different length depending on whether it was hashed or not. When the final bytes of a file are all zero, we can infer that it was written without hashing, and thus that verification is not available as a check for file consistency. This also means that we could easily toggle hashing for any file format we desire. For the commit-graph and multi-pack-index file, it may be possible to allow the null hash without incrementing the file format version, since it technically fits the structure of the file format. The only issue is that older versions would trigger a failure during 'git fsck'. For these file formats, we may want to delay such a change until it is justified. However, the index file is written in critical paths. It is also frequently updated, so corruption at rest is less likely to be an issue than in those other file formats. This could be a good candidate to create an option that skips the hashing operation. A version of this patch has existed in the microsoft/git fork since 2017 [1] (the linked commit was rebased in 2018, but the original dates back to January 2017). Here, the change to make the index use this fast path is delayed until a later change. [1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201 Co-authored-by: Kevin Willford Signed-off-by: Kevin Willford Signed-off-by: Derrick Stolee --- csum-file.c | 14 +++++++++++--- csum-file.h | 7 +++++++ 2 files changed, 18 insertions(+), 3 deletions(-) diff --git a/csum-file.c b/csum-file.c index 59ef3398ca2..3243473c3d7 100644 --- a/csum-file.c +++ b/csum-file.c @@ -45,7 +45,8 @@ void hashflush(struct hashfile *f) unsigned offset = f->offset; if (offset) { - the_hash_algo->update_fn(&f->ctx, f->buffer, offset); + if (!f->skip_hash) + the_hash_algo->update_fn(&f->ctx, f->buffer, offset); flush(f, f->buffer, offset); f->offset = 0; } @@ -64,7 +65,12 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result, int fd; hashflush(f); - the_hash_algo->final_fn(f->buffer, &f->ctx); + + if (f->skip_hash) + memset(f->buffer, 0, the_hash_algo->rawsz); + else + the_hash_algo->final_fn(f->buffer, &f->ctx); + if (result) hashcpy(result, f->buffer); if (flags & CSUM_HASH_IN_STREAM) @@ -108,7 +114,8 @@ void hashwrite(struct hashfile *f, const void *buf, unsigned int count) * the hashfile's buffer. In this block, * f->offset is necessarily zero. */ - the_hash_algo->update_fn(&f->ctx, buf, nr); + if (!f->skip_hash) + the_hash_algo->update_fn(&f->ctx, buf, nr); flush(f, buf, nr); } else { /* @@ -153,6 +160,7 @@ static struct hashfile *hashfd_internal(int fd, const char *name, f->tp = tp; f->name = name; f->do_crc = 0; + f->skip_hash = 0; the_hash_algo->init_fn(&f->ctx); f->buffer_len = buffer_len; diff --git a/csum-file.h b/csum-file.h index 0d29f528fbc..29468067f81 100644 --- a/csum-file.h +++ b/csum-file.h @@ -20,6 +20,13 @@ struct hashfile { size_t buffer_len; unsigned char *buffer; unsigned char *check_buffer; + + /** + * If set to 1, skip_hash indicates that we should + * not actually compute the hash for this hashfile and + * instead only use it as a buffered write. + */ + unsigned int skip_hash; }; /* Checkpoint */ -- gitgitgadget