* [PATCH 0/4] Optionally skip hashing index on write @ 2022-12-07 17:25 Derrick Stolee via GitGitGadget 2022-12-07 17:25 ` [PATCH 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget ` (5 more replies) 0 siblings, 6 replies; 59+ messages in thread From: Derrick Stolee via GitGitGadget @ 2022-12-07 17:25 UTC (permalink / raw) To: git; +Cc: gitster, vdye, avarab, newren, Derrick Stolee Writing the index is a critical action that takes place in multiple Git commands. The recent performance improvements available with the sparse index show how often the I/O costs around the index can affect different Git commands, although reading the index takes place more often than a write. The index is written through the hashfile API, both as a buffered write and as a computation of its trailing hash. This trailing hash is an expectation of the file format: several optional index extensions are provided using indicators immediately preceding the trailing hash. 'git fsck' will complain if the trailing hash does not match the contents of the file up to that point. However, computing the hash is expensive. This is even more expensive now that we've moved to sha1dc instead of hardware-accelerated SHA1 computations. This series provides a new config option, index.skipHash, that allows users to disable computing the hash as Git writes the index. In this case, the trailing hash is stored as the null hash (all bytes are zero). The implementation is rather simple, but the patches organize different aspects in a way that we could split things out: * Patch 1 adds a 'skip_hash' option to 'struct hashfile' that creates this as a general option to the hashwrite API. * The motivation in Patch 1 avoids the talk about the chunk-format API and instead focuses on the index as the first example that could make use of this. * Patch 2 creates the index.skipHash config option and updates 'git fsck' to ignore a null hash on the index. This is demonstrated by a very simple test. * Patch 3 creates a test helper to get the trailing hash of a file, and adds a concrete check on the trailing hash when index.skipHash=true. * Patch 4 (which could be deferred until later) adds index.skipHash=true as an implication of feature.manyFiles and as a setting in Scalar. The end result is that index writes speed up significantly, enough that 'git update-index --force-write' improves by 1.75x. Similar patches appeared in my earlier refs RFC [1]. [1] https://lore.kernel.org/git/pull.1408.git.1667846164.gitgitgadget@gmail.com/ Updates since RFC ================= * The config name is now index.skipHash to make it more clear that the default is to not skip the hash, and choosing this option requires a true value. * Use oidread() instead of an ad-hoc loop. * Patches 3 and 4 are new. 3 adds some extra test verification that wasn't in the RFC. * I think that patch 4 is helpful to see now, but I could understand if we wanted to defer that patch until after index.skipHash has had more time to bake. Thanks, -Stolee Derrick Stolee (4): hashfile: allow skipping the hash function read-cache: add index.skipHash config option test-lib-functions: add helper for trailing hash features: feature.manyFiles implies fast index writes Documentation/config/feature.txt | 3 +++ Documentation/config/index.txt | 8 ++++++++ csum-file.c | 14 +++++++++++--- csum-file.h | 7 +++++++ read-cache.c | 15 ++++++++++++++- repo-settings.c | 2 ++ repository.h | 1 + scalar.c | 1 + t/t1600-index.sh | 22 ++++++++++++++++++++++ t/test-lib-functions.sh | 8 ++++++++ 10 files changed, 77 insertions(+), 4 deletions(-) base-commit: 805265fcf7a737664a8321aaf4a0587b78435184 Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1439%2Fderrickstolee%2Fhashfile-skip-hash-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1439/derrickstolee/hashfile-skip-hash-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/1439 -- gitgitgadget ^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH 1/4] hashfile: allow skipping the hash function 2022-12-07 17:25 [PATCH 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget @ 2022-12-07 17:25 ` Derrick Stolee via GitGitGadget 2022-12-07 22:13 ` Ævar Arnfjörð Bjarmason 2022-12-07 17:25 ` [PATCH 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget ` (4 subsequent siblings) 5 siblings, 1 reply; 59+ messages in thread From: Derrick Stolee via GitGitGadget @ 2022-12-07 17:25 UTC (permalink / raw) To: git; +Cc: gitster, vdye, avarab, newren, Derrick Stolee, Derrick Stolee From: Derrick Stolee <derrickstolee@github.com> The hashfile API is useful for generating files that include a trailing hash of the file's contents up to that point. Using such a hash is helpful for verifying the file for corruption-at-rest, such as a faulty drive causing flipped bits. Git's index file includes this trailing hash, so it uses a 'struct hashfile' to handle the I/O to the file. This was very convenient to allow using the hashfile methods during these operations. However, hashing the file contents during write comes at a performance penalty. It's slower to hash the bytes on their way to the disk than without that step. This problem is made worse by the replacement of hardware-accelerated SHA1 computations with the software-based sha1dc computation. This write cost is significant, and the checksum capability is likely not worth that cost for such a short-lived file. The index is rewritten frequently and the only time the checksum is checked is during 'git fsck'. Thus, it would be helpful to allow a user to opt-out of the hash computation. We first need to allow Git to opt-out of the hash computation in the hashfile API. The buffered writes of the API are still helpful, so it makes sense to make the change here. Introduce a new 'skip_hash' option to 'struct hashfile'. When set, the update_fn and final_fn members of the_hash_algo are skipped. When finalizing the hashfile, the trailing hash is replaced with the null hash. This use of a trailing null hash would be desireable in either case, since we do not want to special case a file format to have a different length depending on whether it was hashed or not. When the final bytes of a file are all zero, we can infer that it was written without hashing, and thus that verification is not available as a check for file consistency. This also means that we could easily toggle hashing for any file format we desire. A version of this patch has existed in the microsoft/git fork since 2017 [1] (the linked commit was rebased in 2018, but the original dates back to January 2017). Here, the change to make the index use this fast path is delayed until a later change. [1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201 Co-authored-by: Kevin Willford <kewillf@microsoft.com> Signed-off-by: Kevin Willford <kewillf@microsoft.com> Signed-off-by: Derrick Stolee <derrickstolee@github.com> --- csum-file.c | 14 +++++++++++--- csum-file.h | 7 +++++++ 2 files changed, 18 insertions(+), 3 deletions(-) diff --git a/csum-file.c b/csum-file.c index 59ef3398ca2..3243473c3d7 100644 --- a/csum-file.c +++ b/csum-file.c @@ -45,7 +45,8 @@ void hashflush(struct hashfile *f) unsigned offset = f->offset; if (offset) { - the_hash_algo->update_fn(&f->ctx, f->buffer, offset); + if (!f->skip_hash) + the_hash_algo->update_fn(&f->ctx, f->buffer, offset); flush(f, f->buffer, offset); f->offset = 0; } @@ -64,7 +65,12 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result, int fd; hashflush(f); - the_hash_algo->final_fn(f->buffer, &f->ctx); + + if (f->skip_hash) + memset(f->buffer, 0, the_hash_algo->rawsz); + else + the_hash_algo->final_fn(f->buffer, &f->ctx); + if (result) hashcpy(result, f->buffer); if (flags & CSUM_HASH_IN_STREAM) @@ -108,7 +114,8 @@ void hashwrite(struct hashfile *f, const void *buf, unsigned int count) * the hashfile's buffer. In this block, * f->offset is necessarily zero. */ - the_hash_algo->update_fn(&f->ctx, buf, nr); + if (!f->skip_hash) + the_hash_algo->update_fn(&f->ctx, buf, nr); flush(f, buf, nr); } else { /* @@ -153,6 +160,7 @@ static struct hashfile *hashfd_internal(int fd, const char *name, f->tp = tp; f->name = name; f->do_crc = 0; + f->skip_hash = 0; the_hash_algo->init_fn(&f->ctx); f->buffer_len = buffer_len; diff --git a/csum-file.h b/csum-file.h index 0d29f528fbc..29468067f81 100644 --- a/csum-file.h +++ b/csum-file.h @@ -20,6 +20,13 @@ struct hashfile { size_t buffer_len; unsigned char *buffer; unsigned char *check_buffer; + + /** + * If set to 1, skip_hash indicates that we should + * not actually compute the hash for this hashfile and + * instead only use it as a buffered write. + */ + unsigned int skip_hash; }; /* Checkpoint */ -- gitgitgadget ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH 1/4] hashfile: allow skipping the hash function 2022-12-07 17:25 ` [PATCH 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget @ 2022-12-07 22:13 ` Ævar Arnfjörð Bjarmason 2022-12-08 7:32 ` Jeff King 0 siblings, 1 reply; 59+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2022-12-07 22:13 UTC (permalink / raw) To: Derrick Stolee via GitGitGadget Cc: git, gitster, vdye, newren, Derrick Stolee On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote: > From: Derrick Stolee <derrickstolee@github.com> > [...] > However, hashing the file contents during write comes at a performance > penalty. It's slower to hash the bytes on their way to the disk than > without that step. This problem is made worse by the replacement of > hardware-accelerated SHA1 computations with the software-based sha1dc > computation. More on that lack of HW accel later... > This write cost is significant Don't you mean hashing cost, or do we also do additional writes if we do the hashing? > , and the checksum capability is likely > not worth that cost for such a short-lived file. The index is rewritten > frequently and the only time the checksum is checked is during 'git > fsck'. Thus, it would be helpful to allow a user to opt-out of the hash > computation. I didn't know that, and had assumed that we at least checked it on the full read (and I found this bit of the commit message after writing the last paragraphs here at the end, so maybe skipping this is fine...). > [...] > @@ -64,7 +65,12 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result, > int fd; > > hashflush(f); > - the_hash_algo->final_fn(f->buffer, &f->ctx); > + > + if (f->skip_hash) > + memset(f->buffer, 0, the_hash_algo->rawsz); Here you're hardcoding a new version of null_oid(), but we can use it instead. Perhaps: diff --git a/csum-file.c b/csum-file.c index 3243473c3d7..b54c4f66cbb 100644 --- a/csum-file.c +++ b/csum-file.c @@ -63,11 +63,12 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result, enum fsync_component component, unsigned int flags) { int fd; + const struct object_id *const noid = null_oid(); hashflush(f); if (f->skip_hash) - memset(f->buffer, 0, the_hash_algo->rawsz); + memcpy(f->buffer, noid, sizeof(*noid)); else the_hash_algo->final_fn(f->buffer, &f->ctx); > @@ -153,6 +160,7 @@ static struct hashfile *hashfd_internal(int fd, const char *name, > f->tp = tp; > f->name = name; > f->do_crc = 0; > + f->skip_hash = 0; > the_hash_algo->init_fn(&f->ctx); > > f->buffer_len = buffer_len; I think I pointed out in the RFC that we'd be much faster with non-sha1collisiondetection, and that maybe this would get us partway to the performance you desired (or maybe we'd think that was a more acceptable trade-off, as it didn't make the format backwards-incompatible). But just from seeing "do_crc" here in the context, did you benchmark it against that? How does it perform? There's no place to put that in the index, but we *could* encode it in the hash, just with a lot of leading zeros. Maybe that would give us some/most of the performance benefits, with the benefit of a checksum? Or maybe not, but I think it's worth exploring & supporting a different & faster SHA-1 implementation before making (even opt-in) backwards incompatible format changes for performance reasons, and if even that's too slow maybe crc32 would be sufficient (but not compatible), but still safer? ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 1/4] hashfile: allow skipping the hash function 2022-12-07 22:13 ` Ævar Arnfjörð Bjarmason @ 2022-12-08 7:32 ` Jeff King 0 siblings, 0 replies; 59+ messages in thread From: Jeff King @ 2022-12-08 7:32 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Derrick Stolee via GitGitGadget, git, gitster, vdye, newren, Derrick Stolee On Wed, Dec 07, 2022 at 11:13:15PM +0100, Ævar Arnfjörð Bjarmason wrote: > > + if (f->skip_hash) > > + memset(f->buffer, 0, the_hash_algo->rawsz); > > Here you're hardcoding a new version of null_oid(), but we can use it > instead. Perhaps: > > diff --git a/csum-file.c b/csum-file.c > index 3243473c3d7..b54c4f66cbb 100644 > --- a/csum-file.c > +++ b/csum-file.c > @@ -63,11 +63,12 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result, > enum fsync_component component, unsigned int flags) > { > int fd; > + const struct object_id *const noid = null_oid(); > > hashflush(f); > > if (f->skip_hash) > - memset(f->buffer, 0, the_hash_algo->rawsz); > + memcpy(f->buffer, noid, sizeof(*noid)); > else > the_hash_algo->final_fn(f->buffer, &f->ctx); I don't think that's quite right; the object_id struct has other stuff in it. You'd probably want: hashcpy(f->buffer, null_oid()); But I think we already have a helper to just do it directly: hashclr(f->buffer); -Peff ^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH 2/4] read-cache: add index.skipHash config option 2022-12-07 17:25 [PATCH 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget 2022-12-07 17:25 ` [PATCH 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget @ 2022-12-07 17:25 ` Derrick Stolee via GitGitGadget 2022-12-07 18:59 ` Eric Sunshine ` (2 more replies) 2022-12-07 17:25 ` [PATCH 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget ` (3 subsequent siblings) 5 siblings, 3 replies; 59+ messages in thread From: Derrick Stolee via GitGitGadget @ 2022-12-07 17:25 UTC (permalink / raw) To: git; +Cc: gitster, vdye, avarab, newren, Derrick Stolee, Derrick Stolee From: Derrick Stolee <derrickstolee@github.com> The previous change allowed skipping the hashing portion of the hashwrite API, using it instead as a buffered write API. Disabling the hashwrite can be particularly helpful when the write operation is in a critical path. One such critical path is the writing of the index. This operation is so critical that the sparse index was created specifically to reduce the size of the index to make these writes (and reads) faster. Following a similar approach to one used in the microsoft/git fork [1], add a new config option (index.skipHash) that allows disabling this hashing during the index write. The cost is that we can no longer validate the contents for corruption-at-rest using the trailing hash. [1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201 While older Git versions will not recognize the null hash as a special case, the file format itself is still being met in terms of its structure. Using this null hash will still allow Git operations to function across older versions. The one exception is 'git fsck' which checks the hash of the index file. This used to be a check on every index read, but was split out to just the index in a33fc72fe91 (read-cache: force_verify_index_checksum, 2017-04-14). Here, we disable this check if the trailing hash is all zeroes. We add a warning to the config option that this may cause undesirable behavior with older Git versions. As a quick comparison, I tested 'git update-index --force-write' with and without index.computeHash=false on a copy of the Linux kernel repository. Benchmark 1: with hash Time (mean ± σ): 46.3 ms ± 13.8 ms [User: 34.3 ms, System: 11.9 ms] Range (min … max): 34.3 ms … 79.1 ms 82 runs Benchmark 2: without hash Time (mean ± σ): 26.0 ms ± 7.9 ms [User: 11.8 ms, System: 14.2 ms] Range (min … max): 16.3 ms … 42.0 ms 69 runs Summary 'without hash' ran 1.78 ± 0.76 times faster than 'with hash' These performance benefits are substantial enough to allow users the ability to opt-in to this feature, even with the potential confusion with older 'git fsck' versions. It is critical that this test is placed before the test_index_version tests, since those tests obliterate the .git/config file and hence lose the setting from GIT_TEST_DEFAULT_HASH, if set. Signed-off-by: Derrick Stolee <derrickstolee@github.com> --- Documentation/config/index.txt | 8 ++++++++ read-cache.c | 14 +++++++++++++- t/t1600-index.sh | 8 ++++++++ 3 files changed, 29 insertions(+), 1 deletion(-) diff --git a/Documentation/config/index.txt b/Documentation/config/index.txt index 75f3a2d1054..3ea0962631d 100644 --- a/Documentation/config/index.txt +++ b/Documentation/config/index.txt @@ -30,3 +30,11 @@ index.version:: Specify the version with which new index files should be initialized. This does not affect existing repositories. If `feature.manyFiles` is enabled, then the default is 4. + +index.skipHash:: + When enabled, do not compute the trailing hash for the index file. + Instead, write a trailing set of bytes with value zero, indicating + that the computation was skipped. ++ +If you enable `index.skipHash`, then older Git clients may report that +your index is corrupt during `git fsck`. diff --git a/read-cache.c b/read-cache.c index 46f5e497b14..fb4d6fb6387 100644 --- a/read-cache.c +++ b/read-cache.c @@ -1817,6 +1817,8 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size) git_hash_ctx c; unsigned char hash[GIT_MAX_RAWSZ]; int hdr_version; + unsigned char *start, *end; + struct object_id oid; if (hdr->hdr_signature != htonl(CACHE_SIGNATURE)) return error(_("bad signature 0x%08x"), hdr->hdr_signature); @@ -1827,10 +1829,16 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size) if (!verify_index_checksum) return 0; + end = (unsigned char *)hdr + size; + start = end - the_hash_algo->rawsz; + oidread(&oid, start); + if (oideq(&oid, null_oid())) + return 0; + the_hash_algo->init_fn(&c); the_hash_algo->update_fn(&c, hdr, size - the_hash_algo->rawsz); the_hash_algo->final_fn(hash, &c); - if (!hasheq(hash, (unsigned char *)hdr + size - the_hash_algo->rawsz)) + if (!hasheq(hash, end - the_hash_algo->rawsz)) return error(_("bad index file sha1 signature")); return 0; } @@ -2915,9 +2923,13 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile, int ieot_entries = 1; struct index_entry_offset_table *ieot = NULL; int nr, nr_threads; + int skip_hash; f = hashfd(tempfile->fd, tempfile->filename.buf); + if (!git_config_get_maybe_bool("index.skiphash", &skip_hash)) + f->skip_hash = skip_hash; + for (i = removed = extended = 0; i < entries; i++) { if (cache[i]->ce_flags & CE_REMOVE) removed++; diff --git a/t/t1600-index.sh b/t/t1600-index.sh index 010989f90e6..df07c587e0e 100755 --- a/t/t1600-index.sh +++ b/t/t1600-index.sh @@ -65,6 +65,14 @@ test_expect_success 'out of bounds index.version issues warning' ' ) ' +test_expect_success 'index.skipHash config option' ' + ( + rm -f .git/index && + git -c index.skipHash=true add a && + git fsck + ) +' + test_index_version () { INDEX_VERSION_CONFIG=$1 && FEATURE_MANY_FILES=$2 && -- gitgitgadget ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH 2/4] read-cache: add index.skipHash config option 2022-12-07 17:25 ` [PATCH 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget @ 2022-12-07 18:59 ` Eric Sunshine 2022-12-12 13:59 ` Derrick Stolee 2022-12-07 22:25 ` Ævar Arnfjörð Bjarmason 2022-12-07 23:06 ` Ævar Arnfjörð Bjarmason 2 siblings, 1 reply; 59+ messages in thread From: Eric Sunshine @ 2022-12-07 18:59 UTC (permalink / raw) To: Derrick Stolee via GitGitGadget Cc: git, gitster, vdye, avarab, newren, Derrick Stolee On Wed, Dec 7, 2022 at 12:27 PM Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote: > The previous change allowed skipping the hashing portion of the > hashwrite API, using it instead as a buffered write API. Disabling the > hashwrite can be particularly helpful when the write operation is in a > critical path. > > One such critical path is the writing of the index. This operation is so > critical that the sparse index was created specifically to reduce the > size of the index to make these writes (and reads) faster. > > Following a similar approach to one used in the microsoft/git fork [1], > add a new config option (index.skipHash) that allows disabling this > hashing during the index write. The cost is that we can no longer > validate the contents for corruption-at-rest using the trailing hash. > [...] > Signed-off-by: Derrick Stolee <derrickstolee@github.com> > --- > diff --git a/Documentation/config/index.txt b/Documentation/config/index.txt > @@ -30,3 +30,11 @@ index.version:: > +index.skipHash:: > + When enabled, do not compute the trailing hash for the index file. > + Instead, write a trailing set of bytes with value zero, indicating > + that the computation was skipped. > ++ > +If you enable `index.skipHash`, then older Git clients may report that > +your index is corrupt during `git fsck`. This documentation is rather minimal. Given this description, are readers going to understand the purpose of the option, when they should use it, what the impact will be, when and why they should avoid it, etc.? > diff --git a/t/t1600-index.sh b/t/t1600-index.sh > @@ -65,6 +65,14 @@ test_expect_success 'out of bounds index.version issues warning' ' > +test_expect_success 'index.skipHash config option' ' > + ( > + rm -f .git/index && > + git -c index.skipHash=true add a && > + git fsck > + ) > +' What is the purpose of the subshell here? ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/4] read-cache: add index.skipHash config option 2022-12-07 18:59 ` Eric Sunshine @ 2022-12-12 13:59 ` Derrick Stolee 2022-12-12 18:55 ` Eric Sunshine 0 siblings, 1 reply; 59+ messages in thread From: Derrick Stolee @ 2022-12-12 13:59 UTC (permalink / raw) To: Eric Sunshine, Derrick Stolee via GitGitGadget Cc: git, gitster, vdye, avarab, newren On 12/7/2022 1:59 PM, Eric Sunshine wrote: > On Wed, Dec 7, 2022 at 12:27 PM Derrick Stolee via GitGitGadget > <gitgitgadget@gmail.com> wrote: >> +If you enable `index.skipHash`, then older Git clients may report that >> +your index is corrupt during `git fsck`. > > This documentation is rather minimal. Given this description, are > readers going to understand the purpose of the option, when they > should use it, what the impact will be, when and why they should avoid > it, etc.? I will expand this with explicit version numbers for older Git versions. >> diff --git a/t/t1600-index.sh b/t/t1600-index.sh >> @@ -65,6 +65,14 @@ test_expect_success 'out of bounds index.version issues warning' ' >> +test_expect_success 'index.skipHash config option' ' >> + ( >> + rm -f .git/index && >> + git -c index.skipHash=true add a && >> + git fsck >> + ) >> +' > > What is the purpose of the subshell here? I was matching the style of the nearby tests, but they are all modifying GIT_INDEX_VERSION, which isn't necessary here. Thanks, -Stolee ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/4] read-cache: add index.skipHash config option 2022-12-12 13:59 ` Derrick Stolee @ 2022-12-12 18:55 ` Eric Sunshine 0 siblings, 0 replies; 59+ messages in thread From: Eric Sunshine @ 2022-12-12 18:55 UTC (permalink / raw) To: Derrick Stolee Cc: Derrick Stolee via GitGitGadget, git, gitster, vdye, avarab, newren On Mon, Dec 12, 2022 at 8:59 AM Derrick Stolee <derrickstolee@github.com> wrote: > On 12/7/2022 1:59 PM, Eric Sunshine wrote: > > On Wed, Dec 7, 2022 at 12:27 PM Derrick Stolee via GitGitGadget > > <gitgitgadget@gmail.com> wrote: > >> +index.skipHash:: > >> + When enabled, do not compute the trailing hash for the index file. > >> + Instead, write a trailing set of bytes with value zero, indicating > >> + that the computation was skipped. > >> ++ > >> +If you enable `index.skipHash`, then older Git clients may report that > >> +your index is corrupt during `git fsck`. > > > > This documentation is rather minimal. Given this description, are > > readers going to understand the purpose of the option, when they > > should use it, what the impact will be, when and why they should avoid > > it, etc.? > > I will expand this with explicit version numbers for older Git versions. Okay, but that doesn't address the larger questions I asked. The documentation, as written, gives no explanation of the purpose of this option. Since you conceived of the option and implemented it, you implicitly understand its use-case and repercussions which might arise from using it, but is the typical reader going to understand all that? Namely, is the reader going to understand: * why this option exists * what problem it is trying to solve * when to use it * when not to use it * what the repercussions are of not computing a hash for the index * etc. Are the answers to those questions documented somewhere? If so, then the documentation for this option should link to that discussion (and vice-versa). If not, then those questions should be answered by this documentation. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/4] read-cache: add index.skipHash config option 2022-12-07 17:25 ` [PATCH 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget 2022-12-07 18:59 ` Eric Sunshine @ 2022-12-07 22:25 ` Ævar Arnfjörð Bjarmason 2022-12-07 23:06 ` Ævar Arnfjörð Bjarmason 2 siblings, 0 replies; 59+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2022-12-07 22:25 UTC (permalink / raw) To: Derrick Stolee via GitGitGadget Cc: git, gitster, vdye, newren, Derrick Stolee On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote: > From: Derrick Stolee <derrickstolee@github.com> > [...] > diff --git a/read-cache.c b/read-cache.c > index 46f5e497b14..fb4d6fb6387 100644 > --- a/read-cache.c > +++ b/read-cache.c > @@ -1817,6 +1817,8 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size) > git_hash_ctx c; > unsigned char hash[GIT_MAX_RAWSZ]; > int hdr_version; > + unsigned char *start, *end; > + struct object_id oid; > > if (hdr->hdr_signature != htonl(CACHE_SIGNATURE)) > return error(_("bad signature 0x%08x"), hdr->hdr_signature); > @@ -1827,10 +1829,16 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size) > if (!verify_index_checksum) > return 0; > > + end = (unsigned char *)hdr + size; > + start = end - the_hash_algo->rawsz; > + oidread(&oid, start); > + if (oideq(&oid, null_oid())) > + return 0; It's good to see this use the existing hashing support, as I suggested in the RFC comments. Glad it helped. > int ieot_entries = 1; > struct index_entry_offset_table *ieot = NULL; > int nr, nr_threads; > + int skip_hash; You don't need this variable. > > f = hashfd(tempfile->fd, tempfile->filename.buf); > > + if (!git_config_get_maybe_bool("index.skiphash", &skip_hash)) > + f->skip_hash = skip_hash; Because this can just be: git_config_get_maybe_bool("index.skiphash", &f->skip_hash); I.e. the config API guarantees that it won't touch the variable if the key doesn't exist, so no need for the intermediate variable. In a later commit you convert this to that very API use when moving this to the repo-settings. Can we maybe skip straight to that step? > +test_expect_success 'index.skipHash config option' ' > + ( > + rm -f .git/index && > + git -c index.skipHash=true add a && > + git fsck > + ) Why the subshell? ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/4] read-cache: add index.skipHash config option 2022-12-07 17:25 ` [PATCH 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget 2022-12-07 18:59 ` Eric Sunshine 2022-12-07 22:25 ` Ævar Arnfjörð Bjarmason @ 2022-12-07 23:06 ` Ævar Arnfjörð Bjarmason 2022-12-08 0:05 ` Junio C Hamano 2022-12-12 14:05 ` Derrick Stolee 2 siblings, 2 replies; 59+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2022-12-07 23:06 UTC (permalink / raw) To: Derrick Stolee via GitGitGadget Cc: git, gitster, vdye, newren, Derrick Stolee On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote: > From: Derrick Stolee <derrickstolee@github.com> > [...] > While older Git versions will not recognize the null hash as a special > case, the file format itself is still being met in terms of its > structure. Using this null hash will still allow Git operations to > function across older versions. That's good news, but... > The one exception is 'git fsck' which checks the hash of the index file. > This used to be a check on every index read, but was split out to just > the index in a33fc72fe91 (read-cache: force_verify_index_checksum, > 2017-04-14). ...uh, what? Is there an implied claim here that versions before v2.13.0 don't count as "older versions"? I.e. doesn't v2.12.0 hard fail the verification for all index writing? It's only after v2.13.0 that we do it only for the fsck. That seems like a rather significant caveat that we should be noting prominently in the docs added in 4/4. > As a quick comparison, I tested 'git update-index --force-write' with > and without index.computeHash=false on a copy of the Linux kernel > repository. It took me a bit to see why I was failing to reproduce this, before finding that it's because you mention index.computeHash here, but it's index.skipHash now. > > Benchmark 1: with hash > Time (mean ± σ): 46.3 ms ± 13.8 ms [User: 34.3 ms, System: 11.9 ms] > Range (min … max): 34.3 ms … 79.1 ms 82 runs > > Benchmark 2: without hash > Time (mean ± σ): 26.0 ms ± 7.9 ms [User: 11.8 ms, System: 14.2 ms] > Range (min … max): 16.3 ms … 42.0 ms 69 runs > > Summary > 'without hash' ran > 1.78 ± 0.76 times faster than 'with hash' I suggested in https://lore.kernel.org/git/221207.868rjiam86.gmgdl@evledraar.gmail.com/ earlier to benchmark this against not-sha1collisiondetection. I don't think I get HW-accelerated SHA-1 on that box with OPENSSL (how do I check...?). The results on my main development box are: $ hyperfine -L g sha1dc,openssl -L v false,true -n '{g} with index.skipHash={v}' './git.{g} -c index.skipHash={v} -C /dev/shm/linux-mem update-index --force-write' -w 5 -r 10 Benchmark 1: sha1dc with index.skipHash=false Time (mean ± σ): 37.0 ms ± 2.3 ms [User: 30.8 ms, System: 6.0 ms] Range (min … max): 35.1 ms … 41.4 ms 10 runs Benchmark 2: openssl with index.skipHash=false Time (mean ± σ): 21.5 ms ± 0.4 ms [User: 15.0 ms, System: 6.3 ms] Range (min … max): 20.7 ms … 22.0 ms 10 runs Benchmark 3: sha1dc with index.skipHash=true Time (mean ± σ): 13.5 ms ± 0.4 ms [User: 7.9 ms, System: 5.4 ms] Range (min … max): 13.0 ms … 14.2 ms 10 runs Benchmark 4: openssl with index.skipHash=true Time (mean ± σ): 14.2 ms ± 0.3 ms [User: 9.5 ms, System: 4.6 ms] Range (min … max): 13.6 ms … 14.6 ms 10 runs Summary 'sha1dc with index.skipHash=true' ran 1.05 ± 0.04 times faster than 'openssl with index.skipHash=true' 1.60 ± 0.05 times faster than 'openssl with index.skipHash=false' 2.75 ± 0.19 times faster than 'sha1dc with index.skipHash=false' So, curiously it's proportionally much slower for me with the hash checking, and skipping it with openssl is on the order of the results you see. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/4] read-cache: add index.skipHash config option 2022-12-07 23:06 ` Ævar Arnfjörð Bjarmason @ 2022-12-08 0:05 ` Junio C Hamano 2022-12-12 14:05 ` Derrick Stolee 1 sibling, 0 replies; 59+ messages in thread From: Junio C Hamano @ 2022-12-08 0:05 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Derrick Stolee via GitGitGadget, git, vdye, newren, Derrick Stolee Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > Is there an implied claim here that versions before v2.13.0 don't count > as "older versions"? > > I.e. doesn't v2.12.0 hard fail the verification for all index writing? > It's only after v2.13.0 that we do it only for the fsck. > > That seems like a rather significant caveat that we should be noting > prominently in the docs added in 4/4. True enough. It seems that we only did security releases from 2.30.x track and upwards for the past year or so, and anything older may not matter anymore. Documenting it should be sufficient. I actually was wondering what impact it would have if we made this change unconditionally. Thanks. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/4] read-cache: add index.skipHash config option 2022-12-07 23:06 ` Ævar Arnfjörð Bjarmason 2022-12-08 0:05 ` Junio C Hamano @ 2022-12-12 14:05 ` Derrick Stolee 2022-12-12 18:01 ` Ævar Arnfjörð Bjarmason 1 sibling, 1 reply; 59+ messages in thread From: Derrick Stolee @ 2022-12-12 14:05 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason, Derrick Stolee via GitGitGadget Cc: git, gitster, vdye, newren On 12/7/2022 6:06 PM, Ævar Arnfjörð Bjarmason wrote: > > On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote: > >> From: Derrick Stolee <derrickstolee@github.com> >> [...] >> While older Git versions will not recognize the null hash as a special >> case, the file format itself is still being met in terms of its >> structure. Using this null hash will still allow Git operations to >> function across older versions. > > That's good news, but... > >> The one exception is 'git fsck' which checks the hash of the index file. >> This used to be a check on every index read, but was split out to just >> the index in a33fc72fe91 (read-cache: force_verify_index_checksum, >> 2017-04-14). > > ...uh, what? > > Is there an implied claim here that versions before v2.13.0 don't count > as "older versions"? > > I.e. doesn't v2.12.0 hard fail the verification for all index writing? > It's only after v2.13.0 that we do it only for the fsck. > > That seems like a rather significant caveat that we should be noting > prominently in the docs added in 4/4. I can add those details. >> As a quick comparison, I tested 'git update-index --force-write' with >> and without index.computeHash=false on a copy of the Linux kernel >> repository. > > It took me a bit to see why I was failing to reproduce this, before > finding that it's because you mention index.computeHash here, but it's > index.skipHash now. >> >> Benchmark 1: with hash >> Time (mean ± σ): 46.3 ms ± 13.8 ms [User: 34.3 ms, System: 11.9 ms] >> Range (min … max): 34.3 ms … 79.1 ms 82 runs >> >> Benchmark 2: without hash >> Time (mean ± σ): 26.0 ms ± 7.9 ms [User: 11.8 ms, System: 14.2 ms] >> Range (min … max): 16.3 ms … 42.0 ms 69 runs >> >> Summary >> 'without hash' ran >> 1.78 ± 0.76 times faster than 'with hash' > > I suggested in > https://lore.kernel.org/git/221207.868rjiam86.gmgdl@evledraar.gmail.com/ > earlier to benchmark this against not-sha1collisiondetection. Generally, I'm avoiding that benchmark because sha1dc is here to stay. If users want to go through the trouble of compiling to use the non-dc version, then I would expect the difference to be less noticeable, but still significant. However, I would strongly avoid considering compiling both into the client by default, letting certain paths use sha1dc and others using non-dc. Certain secure environments currently only use Git under exceptions that allow SHA1 for "non-cryptographic" reasons, but also with the understanding that sha1dc is used as a safety measure. Adding the non-dc version back in would put that understanding at risk. Thanks, -Stolee ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 2/4] read-cache: add index.skipHash config option 2022-12-12 14:05 ` Derrick Stolee @ 2022-12-12 18:01 ` Ævar Arnfjörð Bjarmason 0 siblings, 0 replies; 59+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2022-12-12 18:01 UTC (permalink / raw) To: Derrick Stolee Cc: Derrick Stolee via GitGitGadget, git, gitster, vdye, newren On Mon, Dec 12 2022, Derrick Stolee wrote: > On 12/7/2022 6:06 PM, Ævar Arnfjörð Bjarmason wrote: >> >> On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote: >> >>> From: Derrick Stolee <derrickstolee@github.com> >>> [...] >>> While older Git versions will not recognize the null hash as a special >>> case, the file format itself is still being met in terms of its >>> structure. Using this null hash will still allow Git operations to >>> function across older versions. >> >> That's good news, but... >> >>> The one exception is 'git fsck' which checks the hash of the index file. >>> This used to be a check on every index read, but was split out to just >>> the index in a33fc72fe91 (read-cache: force_verify_index_checksum, >>> 2017-04-14). >> >> ...uh, what? >> >> Is there an implied claim here that versions before v2.13.0 don't count >> as "older versions"? >> >> I.e. doesn't v2.12.0 hard fail the verification for all index writing? >> It's only after v2.13.0 that we do it only for the fsck. >> >> That seems like a rather significant caveat that we should be noting >> prominently in the docs added in 4/4. > > I can add those details. > >>> As a quick comparison, I tested 'git update-index --force-write' with >>> and without index.computeHash=false on a copy of the Linux kernel >>> repository. >> >> It took me a bit to see why I was failing to reproduce this, before >> finding that it's because you mention index.computeHash here, but it's >> index.skipHash now. >>> >>> Benchmark 1: with hash >>> Time (mean ± σ): 46.3 ms ± 13.8 ms [User: 34.3 ms, System: 11.9 ms] >>> Range (min … max): 34.3 ms … 79.1 ms 82 runs >>> >>> Benchmark 2: without hash >>> Time (mean ± σ): 26.0 ms ± 7.9 ms [User: 11.8 ms, System: 14.2 ms] >>> Range (min … max): 16.3 ms … 42.0 ms 69 runs >>> >>> Summary >>> 'without hash' ran >>> 1.78 ± 0.76 times faster than 'with hash' >> >> I suggested in >> https://lore.kernel.org/git/221207.868rjiam86.gmgdl@evledraar.gmail.com/ >> earlier to benchmark this against not-sha1collisiondetection. > > Generally, I'm avoiding that benchmark because sha1dc is here to stay. > > If users want to go through the trouble of compiling to use the non-dc > version, then I would expect the difference to be less noticeable, but > still significant. However, I would strongly avoid considering compiling > both into the client by default, letting certain paths use sha1dc and > others using non-dc. Certain secure environments currently only use Git > under exceptions that allow SHA1 for "non-cryptographic" reasons, but > also with the understanding that sha1dc is used as a safety measure. > Adding the non-dc version back in would put that understanding at risk. Doesn't using a checksum for our own index count as a "non-cryptographic reason"? I.e. we control the .git/index file, and the context is that we're checking if bytes we wrote to disk are corrupt since we last saw them. Even if hypothetically an attacker could craft files to go into the index (knowing our envelope) in such a way as to craft a collision between that index file and some other index file I don't see how that would give the attacker anything. We'd still have a valid index, and we'd probably be replacing that crafted index with a new one anyway. I understand that some organizations have SHA-1 on some naughty list, and using it again in non-SHA1DC contexts might trigger some audit. So it wouldn't be something for everyone, and it's orthagonal to the benefits of a new ref format or index format. But if we're considering new formats, I think it's worth considering a non-format change which doesn't get us all of the way of no checksumming, but more than halfway there. Maybe we'll still want the "don't do any checksumming", but maybe some would find that enough (particularly if SHA-1 HW acceleration is available). ^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH 3/4] test-lib-functions: add helper for trailing hash 2022-12-07 17:25 [PATCH 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget 2022-12-07 17:25 ` [PATCH 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget 2022-12-07 17:25 ` [PATCH 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget @ 2022-12-07 17:25 ` Derrick Stolee via GitGitGadget 2022-12-07 22:27 ` Ævar Arnfjörð Bjarmason 2022-12-07 17:25 ` [PATCH 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget ` (2 subsequent siblings) 5 siblings, 1 reply; 59+ messages in thread From: Derrick Stolee via GitGitGadget @ 2022-12-07 17:25 UTC (permalink / raw) To: git; +Cc: gitster, vdye, avarab, newren, Derrick Stolee, Derrick Stolee From: Derrick Stolee <derrickstolee@github.com> It can be helpful to check that a file format with a trailing hash has a specific hash in the final bytes of a written file. This is made more apparent by recent changes that allow skipping the hash algorithm and writing a null hash at the end of the file instead. Add a new test_trailing_hash helper and use it in t1600 to verify that index.skipHash=true really does skip the hash computation, since 'git fsck' does not actually verify the hash. Keep the 'git fsck' call to ensure that any potential future change to check the index hash does not cause an error in this case. Signed-off-by: Derrick Stolee <derrickstolee@github.com> --- t/t1600-index.sh | 3 +++ t/test-lib-functions.sh | 8 ++++++++ 2 files changed, 11 insertions(+) diff --git a/t/t1600-index.sh b/t/t1600-index.sh index df07c587e0e..55816756607 100755 --- a/t/t1600-index.sh +++ b/t/t1600-index.sh @@ -69,6 +69,9 @@ test_expect_success 'index.skipHash config option' ' ( rm -f .git/index && git -c index.skipHash=true add a && + test_trailing_hash .git/index >hash && + echo $(test_oid zero) >expect && + test_cmp expect hash && git fsck ) ' diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh index 796093a7b32..e88acfdb68a 100644 --- a/t/test-lib-functions.sh +++ b/t/test-lib-functions.sh @@ -1875,3 +1875,11 @@ test_cmp_config_output () { sort config-actual >sorted-actual && test_cmp sorted-expect sorted-actual } + +# Given a filename, extract its trailing hash as a hex string +test_trailing_hash () { + local file="$1" && + tail -c $(test_oid rawsz) "$file" | \ + test-tool hexdump | \ + sed "s/ //g" +} -- gitgitgadget ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH 3/4] test-lib-functions: add helper for trailing hash 2022-12-07 17:25 ` [PATCH 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget @ 2022-12-07 22:27 ` Ævar Arnfjörð Bjarmason 2022-12-12 14:10 ` Derrick Stolee 0 siblings, 1 reply; 59+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2022-12-07 22:27 UTC (permalink / raw) To: Derrick Stolee via GitGitGadget Cc: git, gitster, vdye, newren, Derrick Stolee On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote: > From: Derrick Stolee <derrickstolee@github.com> > [...] > +# Given a filename, extract its trailing hash as a hex string > +test_trailing_hash () { > + local file="$1" && > + tail -c $(test_oid rawsz) "$file" | \ > + test-tool hexdump | \ You don't need the "\"'s here. > + sed "s/ //g" > +} At the end of this series we have one test file using this test_trailing_hash, can't we just add it to t1600-index.sh? ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 3/4] test-lib-functions: add helper for trailing hash 2022-12-07 22:27 ` Ævar Arnfjörð Bjarmason @ 2022-12-12 14:10 ` Derrick Stolee 0 siblings, 0 replies; 59+ messages in thread From: Derrick Stolee @ 2022-12-12 14:10 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason, Derrick Stolee via GitGitGadget Cc: git, gitster, vdye, newren On 12/7/2022 5:27 PM, Ævar Arnfjörð Bjarmason wrote: > > On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote: > >> From: Derrick Stolee <derrickstolee@github.com> >> [...] >> +# Given a filename, extract its trailing hash as a hex string >> +test_trailing_hash () { >> + local file="$1" && >> + tail -c $(test_oid rawsz) "$file" | \ >> + test-tool hexdump | \ > > You don't need the "\"'s here. Thanks. >> + sed "s/ //g" >> +} > > At the end of this series we have one test file using this > test_trailing_hash, can't we just add it to t1600-index.sh? I imagine that it would be helpful to access the trailing hash for testing other file formats in the future. It certainly could have helped the incremental commit-graph work, so we could check that the filenames do match the trailing hashes. Thanks, -Stolee ^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH 4/4] features: feature.manyFiles implies fast index writes 2022-12-07 17:25 [PATCH 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget ` (2 preceding siblings ...) 2022-12-07 17:25 ` [PATCH 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget @ 2022-12-07 17:25 ` Derrick Stolee via GitGitGadget 2022-12-07 22:30 ` Ævar Arnfjörð Bjarmason 2022-12-07 23:27 ` [PATCH 0/4] Optionally skip hashing index on write Junio C Hamano 2022-12-12 16:31 ` [PATCH v2 " Derrick Stolee via GitGitGadget 5 siblings, 1 reply; 59+ messages in thread From: Derrick Stolee via GitGitGadget @ 2022-12-07 17:25 UTC (permalink / raw) To: git; +Cc: gitster, vdye, avarab, newren, Derrick Stolee, Derrick Stolee From: Derrick Stolee <derrickstolee@github.com> The recent addition of the index.skipHash config option allows index writes to speed up by skipping the hash computation for the trailing checksum. This is particularly critical for repositories with many files at HEAD, so add this config option to two cases where users in that scenario may opt-in to such behavior: 1. The feature.manyFiles config option enables some options that are helpful for repositories with many files at HEAD. 2. 'scalar register' and 'scalar reconfigure' set config options that optimize for large repositories. In both of these cases, set index.skipHash=true to gain this speedup. Add tests that demonstrate the proper way that index.skipHash=true can override feature.manyFiles=true. Signed-off-by: Derrick Stolee <derrickstolee@github.com> --- Documentation/config/feature.txt | 3 +++ read-cache.c | 7 ++++--- repo-settings.c | 2 ++ repository.h | 1 + scalar.c | 1 + t/t1600-index.sh | 13 ++++++++++++- 6 files changed, 23 insertions(+), 4 deletions(-) diff --git a/Documentation/config/feature.txt b/Documentation/config/feature.txt index 95975e50912..f0e1d4cb2be 100644 --- a/Documentation/config/feature.txt +++ b/Documentation/config/feature.txt @@ -23,6 +23,9 @@ feature.manyFiles:: working directory. With many files, commands such as `git status` and `git checkout` may be slow and these new defaults improve performance: + +* `index.skipHash=true` speeds up index writes by not computing a trailing + checksum. ++ * `index.version=4` enables path-prefix compression in the index. + * `core.untrackedCache=true` enables the untracked cache. This setting assumes diff --git a/read-cache.c b/read-cache.c index fb4d6fb6387..1844953fba7 100644 --- a/read-cache.c +++ b/read-cache.c @@ -2923,12 +2923,13 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile, int ieot_entries = 1; struct index_entry_offset_table *ieot = NULL; int nr, nr_threads; - int skip_hash; f = hashfd(tempfile->fd, tempfile->filename.buf); - if (!git_config_get_maybe_bool("index.skiphash", &skip_hash)) - f->skip_hash = skip_hash; + if (istate->repo) { + prepare_repo_settings(istate->repo); + f->skip_hash = istate->repo->settings.index_skip_hash; + } for (i = removed = extended = 0; i < entries; i++) { if (cache[i]->ce_flags & CE_REMOVE) diff --git a/repo-settings.c b/repo-settings.c index 3021921c53d..3dbd3f0e2ec 100644 --- a/repo-settings.c +++ b/repo-settings.c @@ -47,6 +47,7 @@ void prepare_repo_settings(struct repository *r) } if (manyfiles) { r->settings.index_version = 4; + r->settings.index_skip_hash = 1; r->settings.core_untracked_cache = UNTRACKED_CACHE_WRITE; } @@ -61,6 +62,7 @@ void prepare_repo_settings(struct repository *r) repo_cfg_bool(r, "pack.usesparse", &r->settings.pack_use_sparse, 1); repo_cfg_bool(r, "core.multipackindex", &r->settings.core_multi_pack_index, 1); repo_cfg_bool(r, "index.sparse", &r->settings.sparse_index, 0); + repo_cfg_bool(r, "index.skiphash", &r->settings.index_skip_hash, r->settings.index_skip_hash); /* * The GIT_TEST_MULTI_PACK_INDEX variable is special in that diff --git a/repository.h b/repository.h index 6c461c5b9de..e8c67ffe165 100644 --- a/repository.h +++ b/repository.h @@ -42,6 +42,7 @@ struct repo_settings { struct fsmonitor_settings *fsmonitor; /* lazily loaded */ int index_version; + int index_skip_hash; enum untracked_cache_setting core_untracked_cache; int pack_use_sparse; diff --git a/scalar.c b/scalar.c index 6c52243cdf1..b49bb8c24ec 100644 --- a/scalar.c +++ b/scalar.c @@ -143,6 +143,7 @@ static int set_recommended_config(int reconfigure) { "credential.validate", "false", 1 }, /* GCM4W-only */ { "gc.auto", "0", 1 }, { "gui.GCWarning", "false", 1 }, + { "index.skipHash", "false", 1 }, { "index.threads", "true", 1 }, { "index.version", "4", 1 }, { "merge.stat", "false", 1 }, diff --git a/t/t1600-index.sh b/t/t1600-index.sh index 55816756607..be0a0a8a008 100755 --- a/t/t1600-index.sh +++ b/t/t1600-index.sh @@ -72,7 +72,18 @@ test_expect_success 'index.skipHash config option' ' test_trailing_hash .git/index >hash && echo $(test_oid zero) >expect && test_cmp expect hash && - git fsck + git fsck && + + rm -f .git/index && + git -c feature.manyFiles=true add a && + test_trailing_hash .git/index >hash && + test_cmp expect hash && + + rm -f .git/index && + git -c feature.manyFiles=true \ + -c index.skipHash=false add a && + test_trailing_hash .git/index >hash && + ! test_cmp expect hash ) ' -- gitgitgadget ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH 4/4] features: feature.manyFiles implies fast index writes 2022-12-07 17:25 ` [PATCH 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget @ 2022-12-07 22:30 ` Ævar Arnfjörð Bjarmason 2022-12-12 14:18 ` Derrick Stolee 0 siblings, 1 reply; 59+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2022-12-07 22:30 UTC (permalink / raw) To: Derrick Stolee via GitGitGadget Cc: git, gitster, vdye, newren, Derrick Stolee On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote: > From: Derrick Stolee <derrickstolee@github.com> > [...] > diff --git a/read-cache.c b/read-cache.c > index fb4d6fb6387..1844953fba7 100644 > --- a/read-cache.c > +++ b/read-cache.c > @@ -2923,12 +2923,13 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile, > int ieot_entries = 1; > struct index_entry_offset_table *ieot = NULL; > int nr, nr_threads; > - int skip_hash; > > f = hashfd(tempfile->fd, tempfile->filename.buf); > > - if (!git_config_get_maybe_bool("index.skiphash", &skip_hash)) > - f->skip_hash = skip_hash; > + if (istate->repo) { > + prepare_repo_settings(istate->repo); > + f->skip_hash = istate->repo->settings.index_skip_hash; > + } Urm, are we ever going to find ourselves in a situation where: * We have read the settings for the_repository * We have an index we're about to write out as our "main index", but the istate->repo *isn't* the_repository. * Even then, wouldn't the two copies of the repos have read the same repo settings? But maybe there's a really obvious submodule / worktree / whatever edge case I'm missing. But if not, shouldn't we just always read/write this from the_repository? > + rm -f .git/index && > + git -c feature.manyFiles=true \ > + -c index.skipHash=false add a && > + test_trailing_hash .git/index >hash && > + ! test_cmp expect hash We had a parallel thread where we discussed "! test_cmp" being an anti-pattern, i.e. you want them not to be the same, but you want it to still show a diff, Maybe just "! cmp" ? I.e. either the diff will be meaningless, or we really should be asserting the actual value we want, not what it shouldn't be. so in this case, shouldn't we assert that it's the 0000... value, or the actual hash (depending on which way around we're testing this)? ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 4/4] features: feature.manyFiles implies fast index writes 2022-12-07 22:30 ` Ævar Arnfjörð Bjarmason @ 2022-12-12 14:18 ` Derrick Stolee 2022-12-12 18:27 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 59+ messages in thread From: Derrick Stolee @ 2022-12-12 14:18 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason, Derrick Stolee via GitGitGadget Cc: git, gitster, vdye, newren On 12/7/2022 5:30 PM, Ævar Arnfjörð Bjarmason wrote: > > On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote: > >> From: Derrick Stolee <derrickstolee@github.com> >> [...] >> diff --git a/read-cache.c b/read-cache.c >> index fb4d6fb6387..1844953fba7 100644 >> --- a/read-cache.c >> +++ b/read-cache.c >> @@ -2923,12 +2923,13 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile, >> int ieot_entries = 1; >> struct index_entry_offset_table *ieot = NULL; >> int nr, nr_threads; >> - int skip_hash; >> >> f = hashfd(tempfile->fd, tempfile->filename.buf); >> >> - if (!git_config_get_maybe_bool("index.skiphash", &skip_hash)) >> - f->skip_hash = skip_hash; >> + if (istate->repo) { >> + prepare_repo_settings(istate->repo); >> + f->skip_hash = istate->repo->settings.index_skip_hash; >> + } > > Urm, are we ever going to find ourselves in a situation where: > > * We have read the settings for the_repository > * We have an index we're about to write out as our "main index", but > the istate->repo *isn't* the_repository. > * Even then, wouldn't the two copies of the repos have read the same > repo settings? > > But maybe there's a really obvious submodule / worktree / whatever edge > case I'm missing. > > But if not, shouldn't we just always read/write this from > the_repository? I don't understand your concern. We call prepare_repo_settings(istate->repo) just before using these settings, so we are using whatever repository-local config we have available to us. If you're thinking that we could be writing an index but istate->repo is somehow the "wrong" repo, then that is a larger problem. This patch is doing the best thing it can with the information it is given. >> + rm -f .git/index && >> + git -c feature.manyFiles=true \ >> + -c index.skipHash=false add a && >> + test_trailing_hash .git/index >hash && >> + ! test_cmp expect hash > > We had a parallel thread where we discussed "! test_cmp" being an > anti-pattern, i.e. you want them not to be the same, but you want it to > still show a diff, Maybe just "! cmp" ? I couldn't tell from this sentence whether test_cmp or cmp would show the diff, but from testing I see that test_cmp shows the diff (for debugging purposes, I'm sure) while cmp shows the position of the first difference. "! cmp" would work here, since we don't care about what the real hash is. > I.e. either the diff will be meaningless, or we really should be > asserting the actual value we want, not what it shouldn't be. > > so in this case, shouldn't we assert that it's the 0000... value, or the > actual hash (depending on which way around we're testing this)? When it should be the null hash, we assert that it is that value. When it isn't, we do not assert the exact hash because we do not want other modifications to the index (or surrounding tests) to cause that hash to change, causing toil for future contributors. "! cmp" suffices for this case to show that the config inheritance is working correctly. Thanks, -Stolee ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 4/4] features: feature.manyFiles implies fast index writes 2022-12-12 14:18 ` Derrick Stolee @ 2022-12-12 18:27 ` Ævar Arnfjörð Bjarmason 0 siblings, 0 replies; 59+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2022-12-12 18:27 UTC (permalink / raw) To: Derrick Stolee Cc: Derrick Stolee via GitGitGadget, git, gitster, vdye, newren On Mon, Dec 12 2022, Derrick Stolee wrote: > On 12/7/2022 5:30 PM, Ævar Arnfjörð Bjarmason wrote: >> >> On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote: >> >>> From: Derrick Stolee <derrickstolee@github.com> >>> [...] >>> diff --git a/read-cache.c b/read-cache.c >>> index fb4d6fb6387..1844953fba7 100644 >>> --- a/read-cache.c >>> +++ b/read-cache.c >>> @@ -2923,12 +2923,13 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile, >>> int ieot_entries = 1; >>> struct index_entry_offset_table *ieot = NULL; >>> int nr, nr_threads; >>> - int skip_hash; >>> >>> f = hashfd(tempfile->fd, tempfile->filename.buf); >>> >>> - if (!git_config_get_maybe_bool("index.skiphash", &skip_hash)) >>> - f->skip_hash = skip_hash; >>> + if (istate->repo) { >>> + prepare_repo_settings(istate->repo); >>> + f->skip_hash = istate->repo->settings.index_skip_hash; >>> + } >> >> Urm, are we ever going to find ourselves in a situation where: >> >> * We have read the settings for the_repository >> * We have an index we're about to write out as our "main index", but >> the istate->repo *isn't* the_repository. >> * Even then, wouldn't the two copies of the repos have read the same >> repo settings? >> >> But maybe there's a really obvious submodule / worktree / whatever edge >> case I'm missing. >> >> But if not, shouldn't we just always read/write this from >> the_repository? > > I don't understand your concern. We call prepare_repo_settings(istate->repo) > just before using these settings, so we are using whatever repository-local > config we have available to us. > > If you're thinking that we could be writing an index but istate->repo is > somehow the "wrong" repo, then that is a larger problem. This patch is > doing the best thing it can with the information it is given. It's not a concern, just confusion :) In the preceding step (and this is still the case in your v2) we used git_config_get_maybe_bool(), if we meant to use istate->repo shouldn't we have used repo_config_get_maybe_bool() to begin with? And will we ever get !istate->repo? If not should we BUG() here? Otherwise the 4/4 changes this to a state where we'll no longer read the index.skipHash setting if that "repo" is NULL, but our previous the_repository was non-NULL... ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 0/4] Optionally skip hashing index on write 2022-12-07 17:25 [PATCH 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget ` (3 preceding siblings ...) 2022-12-07 17:25 ` [PATCH 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget @ 2022-12-07 23:27 ` Junio C Hamano 2022-12-07 23:42 ` Ævar Arnfjörð Bjarmason 2022-12-08 16:38 ` Derrick Stolee 2022-12-12 16:31 ` [PATCH v2 " Derrick Stolee via GitGitGadget 5 siblings, 2 replies; 59+ messages in thread From: Junio C Hamano @ 2022-12-07 23:27 UTC (permalink / raw) To: Derrick Stolee via GitGitGadget; +Cc: git, vdye, avarab, newren, Derrick Stolee "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > Writing the index is a critical action that takes place in multiple Git > commands. The recent performance improvements available with the sparse > index show how often the I/O costs around the index can affect different Git > commands, although reading the index takes place more often than a write. The sparse-index work is great in that it offers correctness while taking advantage of the knowledge of which part of the tree is quiescent and unused to boost performance. I am not sure a change to reduce file safety can be compared with it, in that one is pure improvement, while the other is trade-off. As long as we will keep the "create into a new file, write it fully and fsync + rename to the final" pattern, we do not need the trailing checksum to protect us from a truncated output due to index-writing process dying in the middle, so I do not mind that trade-off, though. Protecting files from bit flipping filesystem corruption is a different matter. Folks at hosting sites like GitHub would know how often they detect object corruption (I presume they do not have to deal with the index file on the server end that often, but loose and pack object files have the trailing checksums the same way) thanks to the trailing checksum, and what the consequences are if we lost that safety (I am guessing it would be minimum, though). Thanks. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 0/4] Optionally skip hashing index on write 2022-12-07 23:27 ` [PATCH 0/4] Optionally skip hashing index on write Junio C Hamano @ 2022-12-07 23:42 ` Ævar Arnfjörð Bjarmason 2022-12-08 16:38 ` Derrick Stolee 1 sibling, 0 replies; 59+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2022-12-07 23:42 UTC (permalink / raw) To: Junio C Hamano Cc: Derrick Stolee via GitGitGadget, git, vdye, newren, Derrick Stolee On Thu, Dec 08 2022, Junio C Hamano wrote: > Protecting files from bit flipping filesystem corruption is a > different matter. Folks at hosting sites like GitHub would know how > often they detect object corruption (I presume they do not have to > deal with the index file on the server end that often, but loose and > pack object files have the trailing checksums the same way) thanks > to the trailing checksum, and what the consequences are if we lost > that safety (I am guessing it would be minimum, though). I don't think this checksum does much for us in practice, but just on this point in general: Extrapolating results at <hosting site> when it comes to making general decisions about git's data safety isn't a good idea. I don't know about GitHub's hardware, but servers almost universally use ECC ram, and tend to use things like error-correcting filesystem RAID etc. Data in that area is really interesting when it comes to running git in that sort of setup, but it really shouldn't be extrapolated to git's userbase in general. A lot of those users will be using cheap memory and/or storage devices without any error correction. They're also likely to stress our reliability guarantees in other ways, e.g. yanking their power cord (or equivalent), which a server typically won't need to deal with. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 0/4] Optionally skip hashing index on write 2022-12-07 23:27 ` [PATCH 0/4] Optionally skip hashing index on write Junio C Hamano 2022-12-07 23:42 ` Ævar Arnfjörð Bjarmason @ 2022-12-08 16:38 ` Derrick Stolee 2022-12-12 22:22 ` Jacob Keller 1 sibling, 1 reply; 59+ messages in thread From: Derrick Stolee @ 2022-12-08 16:38 UTC (permalink / raw) To: Junio C Hamano, Derrick Stolee via GitGitGadget; +Cc: git, vdye, avarab, newren On 12/7/2022 6:27 PM, Junio C Hamano wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > >> Writing the index is a critical action that takes place in multiple Git >> commands. The recent performance improvements available with the sparse >> index show how often the I/O costs around the index can affect different Git >> commands, although reading the index takes place more often than a write. > > The sparse-index work is great in that it offers correctness while > taking advantage of the knowledge of which part of the tree is > quiescent and unused to boost performance. I am not sure a change > to reduce file safety can be compared with it, in that one is pure > improvement, while the other is trade-off. I agree that this is a trade-off, and we should both be careful about whether or not we even make this a possibility for certain file formats. The index is an interesting case for a couple reasons: 1. Writes block users. Writing the index takes place in many user- blocking foreground operations. The speed improvement directly impacts their use. Other file formats are typically written in the background (commit-graph, multi-pack-index) or are super- critical to correctness (pack-files). 2. Index files are short lived. It is rare that a user leaves an index for a long time with many staged changes. That's the condition that's required for losing an index file to cause a loss of work (or maybe I'm missing something). Outside of staged changes, the index can be completely destroyed and rewritten with minimal impact to the user. > As long as we will keep the "create into a new file, write it fully > and fsync + rename to the final" pattern, we do not need the trailing > checksum to protect us from a truncated output due to index-writing > process dying in the middle, so I do not mind that trade-off, though. > > Protecting files from bit flipping filesystem corruption is a > different matter. Folks at hosting sites like GitHub would know how > often they detect object corruption (I presume they do not have to > deal with the index file on the server end that often, but loose and > pack object files have the trailing checksums the same way) thanks > to the trailing checksum, and what the consequences are if we lost > that safety (I am guessing it would be minimum, though). I agree that we need to be careful about which files get this treatement. But I also want to point out that I'm not using hosting servers as evidence that this has worked in practice, but instead many developer machines in large monorepos who have had this enabled (via the microsoft/git fork) for years. We've not come across an instance where this loss of a trailing hash has been an issue. Thanks, -Stolee ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH 0/4] Optionally skip hashing index on write 2022-12-08 16:38 ` Derrick Stolee @ 2022-12-12 22:22 ` Jacob Keller 0 siblings, 0 replies; 59+ messages in thread From: Jacob Keller @ 2022-12-12 22:22 UTC (permalink / raw) To: Derrick Stolee Cc: Junio C Hamano, Derrick Stolee via GitGitGadget, git, vdye, avarab, newren On Thu, Dec 8, 2022 at 8:50 AM Derrick Stolee <derrickstolee@github.com> wrote: > > On 12/7/2022 6:27 PM, Junio C Hamano wrote: > > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > >> Writing the index is a critical action that takes place in multiple Git > >> commands. The recent performance improvements available with the sparse > >> index show how often the I/O costs around the index can affect different Git > >> commands, although reading the index takes place more often than a write. > > > > The sparse-index work is great in that it offers correctness while > > taking advantage of the knowledge of which part of the tree is > > quiescent and unused to boost performance. I am not sure a change > > to reduce file safety can be compared with it, in that one is pure > > improvement, while the other is trade-off. > > I agree that this is a trade-off, and we should both be careful about > whether or not we even make this a possibility for certain file > formats. The index is an interesting case for a couple reasons: > > 1. Writes block users. Writing the index takes place in many user- > blocking foreground operations. The speed improvement directly > impacts their use. Other file formats are typically written in > the background (commit-graph, multi-pack-index) or are super- > critical to correctness (pack-files). > > 2. Index files are short lived. It is rare that a user leaves an > index for a long time with many staged changes. That's the condition > that's required for losing an index file to cause a loss of work > (or maybe I'm missing something). Outside of staged changes, the > index can be completely destroyed and rewritten with minimal impact > to the user. > Is this information in the commit messages somewhere? I didn't see it in the cover letter. Nor did I see any other explanation in the cover letter besides "this makes it faster". I would expect such trade off or analysis of "what do we lose" to be in the cover letter, as it may not be clear otherwise. I do agree these reasons are good, but it can be confusing to later reviewers when looking back at the code for an option like this and wondering why it exists. Thanks, Jake ^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH v2 0/4] Optionally skip hashing index on write 2022-12-07 17:25 [PATCH 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget ` (4 preceding siblings ...) 2022-12-07 23:27 ` [PATCH 0/4] Optionally skip hashing index on write Junio C Hamano @ 2022-12-12 16:31 ` Derrick Stolee via GitGitGadget 2022-12-12 16:31 ` [PATCH v2 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget ` (4 more replies) 5 siblings, 5 replies; 59+ messages in thread From: Derrick Stolee via GitGitGadget @ 2022-12-12 16:31 UTC (permalink / raw) To: git; +Cc: gitster, vdye, avarab, newren, Derrick Stolee Writing the index is a critical action that takes place in multiple Git commands. The recent performance improvements available with the sparse index show how often the I/O costs around the index can affect different Git commands, although reading the index takes place more often than a write. The index is written through the hashfile API, both as a buffered write and as a computation of its trailing hash. This trailing hash is an expectation of the file format: several optional index extensions are provided using indicators immediately preceding the trailing hash. 'git fsck' will complain if the trailing hash does not match the contents of the file up to that point. However, computing the hash is expensive. This is even more expensive now that we've moved to sha1dc instead of hardware-accelerated SHA1 computations. This series provides a new config option, index.skipHash, that allows users to disable computing the hash as Git writes the index. In this case, the trailing hash is stored as the null hash (all bytes are zero). The implementation is rather simple, but the patches organize different aspects in a way that we could split things out: * Patch 1 adds a 'skip_hash' option to 'struct hashfile' that creates this as a general option to the hashwrite API. * The motivation in Patch 1 avoids the talk about the chunk-format API and instead focuses on the index as the first example that could make use of this. * Patch 2 creates the index.skipHash config option and updates 'git fsck' to ignore a null hash on the index. This is demonstrated by a very simple test. * Patch 3 creates a test helper to get the trailing hash of a file, and adds a concrete check on the trailing hash when index.skipHash=true. * Patch 4 (which could be deferred until later) adds index.skipHash=true as an implication of feature.manyFiles and as a setting in Scalar. The end result is that index writes speed up significantly, enough that 'git update-index --force-write' improves by 1.75x. Similar patches appeared in my earlier refs RFC [1]. [1] https://lore.kernel.org/git/pull.1408.git.1667846164.gitgitgadget@gmail.com/ Updates since v1 ================ * In Patch 1, use hashcler() instead of memset(). * In Patches 2 and 4, be explicit about which Git versions will exhibit different behavior when reading an index with a null trailing hash. * In Patch 2, used a more compact way of loading from config. * In Patch 2 (and following) dropped the sub-shell in t1600-index.sh. * In Patch 3, dropped back-ticks when using a multi-line pipe. * In Patch 4, use "! cmp" instead of "! test_cmp" Updates since RFC ================= * The config name is now index.skipHash to make it more clear that the default is to not skip the hash, and choosing this option requires a true value. * Use oidread() instead of an ad-hoc loop. * Patches 3 and 4 are new. 3 adds some extra test verification that wasn't in the RFC. * I think that patch 4 is helpful to see now, but I could understand if we wanted to defer that patch until after index.skipHash has had more time to bake. Thanks, -Stolee Derrick Stolee (4): hashfile: allow skipping the hash function read-cache: add index.skipHash config option test-lib-functions: add helper for trailing hash features: feature.manyFiles implies fast index writes Documentation/config/feature.txt | 5 +++++ Documentation/config/index.txt | 9 +++++++++ csum-file.c | 14 +++++++++++--- csum-file.h | 7 +++++++ read-cache.c | 15 ++++++++++++++- repo-settings.c | 2 ++ repository.h | 1 + scalar.c | 1 + t/t1600-index.sh | 20 ++++++++++++++++++++ t/test-lib-functions.sh | 8 ++++++++ 10 files changed, 78 insertions(+), 4 deletions(-) base-commit: 805265fcf7a737664a8321aaf4a0587b78435184 Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1439%2Fderrickstolee%2Fhashfile-skip-hash-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1439/derrickstolee/hashfile-skip-hash-v2 Pull-Request: https://github.com/gitgitgadget/git/pull/1439 Range-diff vs v1: 1: 40ee8dbaef0 ! 1: b582d681581 hashfile: allow skipping the hash function @@ csum-file.c: int finalize_hashfile(struct hashfile *f, unsigned char *result, - the_hash_algo->final_fn(f->buffer, &f->ctx); + + if (f->skip_hash) -+ memset(f->buffer, 0, the_hash_algo->rawsz); ++ hashclr(f->buffer); + else + the_hash_algo->final_fn(f->buffer, &f->ctx); + 2: 5fb4b5a36ac ! 2: c8a2fd3a470 read-cache: add index.skipHash config option @@ Commit message The one exception is 'git fsck' which checks the hash of the index file. This used to be a check on every index read, but was split out to just the index in a33fc72fe91 (read-cache: force_verify_index_checksum, - 2017-04-14). + 2017-04-14) and released first in Git 2.13.0. Document the versions that + relaxed these restrictions, with the optimistic expectation that this + change will be included in Git 2.40.0. Here, we disable this check if the trailing hash is all zeroes. We add a warning to the config option that this may cause undesirable behavior with older Git versions. As a quick comparison, I tested 'git update-index --force-write' with - and without index.computeHash=false on a copy of the Linux kernel + and without index.skipHash=true on a copy of the Linux kernel repository. Benchmark 1: with hash @@ Documentation/config/index.txt: index.version:: + Instead, write a trailing set of bytes with value zero, indicating + that the computation was skipped. ++ -+If you enable `index.skipHash`, then older Git clients may report that -+your index is corrupt during `git fsck`. ++If you enable `index.skipHash`, then Git clients older than 2.13.0 will ++refuse to parse the index and Git clients older than 2.40.0 will report an ++error during `git fsck`. ## read-cache.c ## @@ read-cache.c: static int verify_hdr(const struct cache_header *hdr, unsigned long size) @@ read-cache.c: static int verify_hdr(const struct cache_header *hdr, unsigned lon return 0; } @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempfile *tempfile, - int ieot_entries = 1; - struct index_entry_offset_table *ieot = NULL; - int nr, nr_threads; -+ int skip_hash; f = hashfd(tempfile->fd, tempfile->filename.buf); -+ if (!git_config_get_maybe_bool("index.skiphash", &skip_hash)) -+ f->skip_hash = skip_hash; ++ git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash); + for (i = removed = extended = 0; i < entries; i++) { if (cache[i]->ce_flags & CE_REMOVE) @@ t/t1600-index.sh: test_expect_success 'out of bounds index.version issues warnin ' +test_expect_success 'index.skipHash config option' ' -+ ( -+ rm -f .git/index && -+ git -c index.skipHash=true add a && -+ git fsck -+ ) ++ rm -f .git/index && ++ git -c index.skipHash=true add a && ++ git fsck +' + test_index_version () { 3: a20bf8de864 ! 3: 813e81a0582 test-lib-functions: add helper for trailing hash @@ Commit message Signed-off-by: Derrick Stolee <derrickstolee@github.com> ## t/t1600-index.sh ## -@@ t/t1600-index.sh: test_expect_success 'index.skipHash config option' ' - ( - rm -f .git/index && - git -c index.skipHash=true add a && -+ test_trailing_hash .git/index >hash && -+ echo $(test_oid zero) >expect && -+ test_cmp expect hash && - git fsck - ) +@@ t/t1600-index.sh: test_expect_success 'out of bounds index.version issues warning' ' + test_expect_success 'index.skipHash config option' ' + rm -f .git/index && + git -c index.skipHash=true add a && ++ test_trailing_hash .git/index >hash && ++ echo $(test_oid zero) >expect && ++ test_cmp expect hash && + git fsck ' + ## t/test-lib-functions.sh ## @@ t/test-lib-functions.sh: test_cmp_config_output () { @@ t/test-lib-functions.sh: test_cmp_config_output () { +# Given a filename, extract its trailing hash as a hex string +test_trailing_hash () { + local file="$1" && -+ tail -c $(test_oid rawsz) "$file" | \ -+ test-tool hexdump | \ ++ tail -c $(test_oid rawsz) "$file" | ++ test-tool hexdump | + sed "s/ //g" +} 4: 77bf5d5ff27 ! 4: e640dff53dd features: feature.manyFiles implies fast index writes @@ Documentation/config/feature.txt: feature.manyFiles:: `git checkout` may be slow and these new defaults improve performance: + +* `index.skipHash=true` speeds up index writes by not computing a trailing -+ checksum. ++ checksum. Note that this will cause Git versions earlier than 2.13.0 to ++ refuse to parse the index and Git versions earlier than 2.40.0 will report ++ a corrupted index during `git fsck`. ++ * `index.version=4` enables path-prefix compression in the index. + @@ Documentation/config/feature.txt: feature.manyFiles:: ## read-cache.c ## @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempfile *tempfile, - int ieot_entries = 1; - struct index_entry_offset_table *ieot = NULL; - int nr, nr_threads; -- int skip_hash; f = hashfd(tempfile->fd, tempfile->filename.buf); -- if (!git_config_get_maybe_bool("index.skiphash", &skip_hash)) -- f->skip_hash = skip_hash; +- git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash); + if (istate->repo) { + prepare_repo_settings(istate->repo); + f->skip_hash = istate->repo->settings.index_skip_hash; @@ scalar.c: static int set_recommended_config(int reconfigure) ## t/t1600-index.sh ## @@ t/t1600-index.sh: test_expect_success 'index.skipHash config option' ' - test_trailing_hash .git/index >hash && - echo $(test_oid zero) >expect && - test_cmp expect hash && -- git fsck -+ git fsck && + test_trailing_hash .git/index >hash && + echo $(test_oid zero) >expect && + test_cmp expect hash && +- git fsck ++ git fsck && + -+ rm -f .git/index && -+ git -c feature.manyFiles=true add a && -+ test_trailing_hash .git/index >hash && -+ test_cmp expect hash && ++ rm -f .git/index && ++ git -c feature.manyFiles=true add a && ++ test_trailing_hash .git/index >hash && ++ test_cmp expect hash && + -+ rm -f .git/index && -+ git -c feature.manyFiles=true \ -+ -c index.skipHash=false add a && -+ test_trailing_hash .git/index >hash && -+ ! test_cmp expect hash - ) ++ rm -f .git/index && ++ git -c feature.manyFiles=true \ ++ -c index.skipHash=false add a && ++ test_trailing_hash .git/index >hash && ++ ! cmp expect hash ' + test_index_version () { -- gitgitgadget ^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH v2 1/4] hashfile: allow skipping the hash function 2022-12-12 16:31 ` [PATCH v2 " Derrick Stolee via GitGitGadget @ 2022-12-12 16:31 ` Derrick Stolee via GitGitGadget 2022-12-12 16:31 ` [PATCH v2 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget ` (3 subsequent siblings) 4 siblings, 0 replies; 59+ messages in thread From: Derrick Stolee via GitGitGadget @ 2022-12-12 16:31 UTC (permalink / raw) To: git; +Cc: gitster, vdye, avarab, newren, Derrick Stolee, Derrick Stolee From: Derrick Stolee <derrickstolee@github.com> The hashfile API is useful for generating files that include a trailing hash of the file's contents up to that point. Using such a hash is helpful for verifying the file for corruption-at-rest, such as a faulty drive causing flipped bits. Git's index file includes this trailing hash, so it uses a 'struct hashfile' to handle the I/O to the file. This was very convenient to allow using the hashfile methods during these operations. However, hashing the file contents during write comes at a performance penalty. It's slower to hash the bytes on their way to the disk than without that step. This problem is made worse by the replacement of hardware-accelerated SHA1 computations with the software-based sha1dc computation. This write cost is significant, and the checksum capability is likely not worth that cost for such a short-lived file. The index is rewritten frequently and the only time the checksum is checked is during 'git fsck'. Thus, it would be helpful to allow a user to opt-out of the hash computation. We first need to allow Git to opt-out of the hash computation in the hashfile API. The buffered writes of the API are still helpful, so it makes sense to make the change here. Introduce a new 'skip_hash' option to 'struct hashfile'. When set, the update_fn and final_fn members of the_hash_algo are skipped. When finalizing the hashfile, the trailing hash is replaced with the null hash. This use of a trailing null hash would be desireable in either case, since we do not want to special case a file format to have a different length depending on whether it was hashed or not. When the final bytes of a file are all zero, we can infer that it was written without hashing, and thus that verification is not available as a check for file consistency. This also means that we could easily toggle hashing for any file format we desire. A version of this patch has existed in the microsoft/git fork since 2017 [1] (the linked commit was rebased in 2018, but the original dates back to January 2017). Here, the change to make the index use this fast path is delayed until a later change. [1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201 Co-authored-by: Kevin Willford <kewillf@microsoft.com> Signed-off-by: Kevin Willford <kewillf@microsoft.com> Signed-off-by: Derrick Stolee <derrickstolee@github.com> --- csum-file.c | 14 +++++++++++--- csum-file.h | 7 +++++++ 2 files changed, 18 insertions(+), 3 deletions(-) diff --git a/csum-file.c b/csum-file.c index 59ef3398ca2..cce13c0f047 100644 --- a/csum-file.c +++ b/csum-file.c @@ -45,7 +45,8 @@ void hashflush(struct hashfile *f) unsigned offset = f->offset; if (offset) { - the_hash_algo->update_fn(&f->ctx, f->buffer, offset); + if (!f->skip_hash) + the_hash_algo->update_fn(&f->ctx, f->buffer, offset); flush(f, f->buffer, offset); f->offset = 0; } @@ -64,7 +65,12 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result, int fd; hashflush(f); - the_hash_algo->final_fn(f->buffer, &f->ctx); + + if (f->skip_hash) + hashclr(f->buffer); + else + the_hash_algo->final_fn(f->buffer, &f->ctx); + if (result) hashcpy(result, f->buffer); if (flags & CSUM_HASH_IN_STREAM) @@ -108,7 +114,8 @@ void hashwrite(struct hashfile *f, const void *buf, unsigned int count) * the hashfile's buffer. In this block, * f->offset is necessarily zero. */ - the_hash_algo->update_fn(&f->ctx, buf, nr); + if (!f->skip_hash) + the_hash_algo->update_fn(&f->ctx, buf, nr); flush(f, buf, nr); } else { /* @@ -153,6 +160,7 @@ static struct hashfile *hashfd_internal(int fd, const char *name, f->tp = tp; f->name = name; f->do_crc = 0; + f->skip_hash = 0; the_hash_algo->init_fn(&f->ctx); f->buffer_len = buffer_len; diff --git a/csum-file.h b/csum-file.h index 0d29f528fbc..29468067f81 100644 --- a/csum-file.h +++ b/csum-file.h @@ -20,6 +20,13 @@ struct hashfile { size_t buffer_len; unsigned char *buffer; unsigned char *check_buffer; + + /** + * If set to 1, skip_hash indicates that we should + * not actually compute the hash for this hashfile and + * instead only use it as a buffered write. + */ + unsigned int skip_hash; }; /* Checkpoint */ -- gitgitgadget ^ permalink raw reply related [flat|nested] 59+ messages in thread
* [PATCH v2 2/4] read-cache: add index.skipHash config option 2022-12-12 16:31 ` [PATCH v2 " Derrick Stolee via GitGitGadget 2022-12-12 16:31 ` [PATCH v2 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget @ 2022-12-12 16:31 ` Derrick Stolee via GitGitGadget 2022-12-12 16:31 ` [PATCH v2 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget ` (2 subsequent siblings) 4 siblings, 0 replies; 59+ messages in thread From: Derrick Stolee via GitGitGadget @ 2022-12-12 16:31 UTC (permalink / raw) To: git; +Cc: gitster, vdye, avarab, newren, Derrick Stolee, Derrick Stolee From: Derrick Stolee <derrickstolee@github.com> The previous change allowed skipping the hashing portion of the hashwrite API, using it instead as a buffered write API. Disabling the hashwrite can be particularly helpful when the write operation is in a critical path. One such critical path is the writing of the index. This operation is so critical that the sparse index was created specifically to reduce the size of the index to make these writes (and reads) faster. Following a similar approach to one used in the microsoft/git fork [1], add a new config option (index.skipHash) that allows disabling this hashing during the index write. The cost is that we can no longer validate the contents for corruption-at-rest using the trailing hash. [1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201 While older Git versions will not recognize the null hash as a special case, the file format itself is still being met in terms of its structure. Using this null hash will still allow Git operations to function across older versions. The one exception is 'git fsck' which checks the hash of the index file. This used to be a check on every index read, but was split out to just the index in a33fc72fe91 (read-cache: force_verify_index_checksum, 2017-04-14) and released first in Git 2.13.0. Document the versions that relaxed these restrictions, with the optimistic expectation that this change will be included in Git 2.40.0. Here, we disable this check if the trailing hash is all zeroes. We add a warning to the config option that this may cause undesirable behavior with older Git versions. As a quick comparison, I tested 'git update-index --force-write' with and without index.skipHash=true on a copy of the Linux kernel repository. Benchmark 1: with hash Time (mean ± σ): 46.3 ms ± 13.8 ms [User: 34.3 ms, System: 11.9 ms] Range (min … max): 34.3 ms … 79.1 ms 82 runs Benchmark 2: without hash Time (mean ± σ): 26.0 ms ± 7.9 ms [User: 11.8 ms, System: 14.2 ms] Range (min … max): 16.3 ms … 42.0 ms 69 runs Summary 'without hash' ran 1.78 ± 0.76 times faster than 'with hash' These performance benefits are substantial enough to allow users the ability to opt-in to this feature, even with the potential confusion with older 'git fsck' versions. It is critical that this test is placed before the test_index_version tests, since those tests obliterate the .git/config file and hence lose the setting from GIT_TEST_DEFAULT_HASH, if set. Signed-off-by: Derrick Stolee <derrickstolee@github.com> --- Documentation/config/index.txt | 9 +++++++++ read-cache.c | 12 +++++++++++- t/t1600-index.sh | 6 ++++++ 3 files changed, 26 insertions(+), 1 deletion(-) diff --git a/Documentation/config/index.txt b/Documentation/config/index.txt index 75f3a2d1054..5d62489c302 100644 --- a/Documentation/config/index.txt +++ b/Documentation/config/index.txt @@ -30,3 +30,12 @@ index.version:: Specify the version with which new index files should be initialized. This does not affect existing repositories. If `feature.manyFiles` is enabled, then the default is 4. + +index.skipHash:: + When enabled, do not compute the trailing hash for the index file. + Instead, write a trailing set of bytes with value zero, indicating + that the computation was skipped. ++ +If you enable `index.skipHash`, then Git clients older than 2.13.0 will +refuse to parse the index and Git clients older than 2.40.0 will report an +error during `git fsck`. diff --git a/read-cache.c b/read-cache.c index 46f5e497b14..3f7de8b2e20 100644 --- a/read-cache.c +++ b/read-cache.c @@ -1817,6 +1817,8 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size) git_hash_ctx c; unsigned char hash[GIT_MAX_RAWSZ]; int hdr_version; + unsigned char *start, *end; + struct object_id oid; if (hdr->hdr_signature != htonl(CACHE_SIGNATURE)) return error(_("bad signature 0x%08x"), hdr->hdr_signature); @@ -1827,10 +1829,16 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size) if (!verify_index_checksum) return 0; + end = (unsigned char *)hdr + size; + start = end - the_hash_algo->rawsz; + oidread(&oid, start); + if (oideq(&oid, null_oid())) + return 0; + the_hash_algo->init_fn(&c); the_hash_algo->update_fn(&c, hdr, size - the_hash_algo->rawsz); the_hash_algo->final_fn(hash, &c); - if (!hasheq(hash, (unsigned char *)hdr + size - the_hash_algo->rawsz)) + if (!hasheq(hash, end - the_hash_algo->rawsz)) return error(_("bad index file sha1 signature")); return 0; } @@ -2918,6 +2926,8 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile, f = hashfd(tempfile->fd, tempfile->filename.buf); + git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash); + for (i = removed = extended = 0; i < entries; i++) { if (cache[i]->ce_flags & CE_REMOVE) removed++; diff --git a/t/t1600-index.sh b/t/t1600-index.sh index 010989f90e6..45feb0fc5d8 100755 --- a/t/t1600-index.sh +++ b/t/t1600-index.sh @@ -65,6 +65,12 @@ test_expect_success 'out of bounds index.version issues warning' ' ) ' +test_expect_success 'index.skipHash config option' ' + rm -f .git/index && + git -c index.skipHash=true add a && + git fsck +' + test_index_version () { INDEX_VERSION_CONFIG=$1 && FEATURE_MANY_FILES=$2 && -- gitgitgadget ^ permalink raw reply related [flat|nested] 59+ messages in thread
* [PATCH v2 3/4] test-lib-functions: add helper for trailing hash 2022-12-12 16:31 ` [PATCH v2 " Derrick Stolee via GitGitGadget 2022-12-12 16:31 ` [PATCH v2 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget 2022-12-12 16:31 ` [PATCH v2 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget @ 2022-12-12 16:31 ` Derrick Stolee via GitGitGadget 2022-12-12 18:14 ` SZEDER Gábor 2022-12-12 16:31 ` [PATCH v2 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget 2022-12-15 15:06 ` [PATCH v3 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget 4 siblings, 1 reply; 59+ messages in thread From: Derrick Stolee via GitGitGadget @ 2022-12-12 16:31 UTC (permalink / raw) To: git; +Cc: gitster, vdye, avarab, newren, Derrick Stolee, Derrick Stolee From: Derrick Stolee <derrickstolee@github.com> It can be helpful to check that a file format with a trailing hash has a specific hash in the final bytes of a written file. This is made more apparent by recent changes that allow skipping the hash algorithm and writing a null hash at the end of the file instead. Add a new test_trailing_hash helper and use it in t1600 to verify that index.skipHash=true really does skip the hash computation, since 'git fsck' does not actually verify the hash. Keep the 'git fsck' call to ensure that any potential future change to check the index hash does not cause an error in this case. Signed-off-by: Derrick Stolee <derrickstolee@github.com> --- t/t1600-index.sh | 3 +++ t/test-lib-functions.sh | 8 ++++++++ 2 files changed, 11 insertions(+) diff --git a/t/t1600-index.sh b/t/t1600-index.sh index 45feb0fc5d8..55914bc3506 100755 --- a/t/t1600-index.sh +++ b/t/t1600-index.sh @@ -68,6 +68,9 @@ test_expect_success 'out of bounds index.version issues warning' ' test_expect_success 'index.skipHash config option' ' rm -f .git/index && git -c index.skipHash=true add a && + test_trailing_hash .git/index >hash && + echo $(test_oid zero) >expect && + test_cmp expect hash && git fsck ' diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh index 796093a7b32..60308843f8f 100644 --- a/t/test-lib-functions.sh +++ b/t/test-lib-functions.sh @@ -1875,3 +1875,11 @@ test_cmp_config_output () { sort config-actual >sorted-actual && test_cmp sorted-expect sorted-actual } + +# Given a filename, extract its trailing hash as a hex string +test_trailing_hash () { + local file="$1" && + tail -c $(test_oid rawsz) "$file" | + test-tool hexdump | + sed "s/ //g" +} -- gitgitgadget ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH v2 3/4] test-lib-functions: add helper for trailing hash 2022-12-12 16:31 ` [PATCH v2 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget @ 2022-12-12 18:14 ` SZEDER Gábor 2022-12-13 0:55 ` Junio C Hamano 0 siblings, 1 reply; 59+ messages in thread From: SZEDER Gábor @ 2022-12-12 18:14 UTC (permalink / raw) To: Derrick Stolee via GitGitGadget Cc: git, gitster, vdye, avarab, newren, Derrick Stolee On Mon, Dec 12, 2022 at 04:31:16PM +0000, Derrick Stolee via GitGitGadget wrote: > diff --git a/t/t1600-index.sh b/t/t1600-index.sh > index 45feb0fc5d8..55914bc3506 100755 > --- a/t/t1600-index.sh > +++ b/t/t1600-index.sh > @@ -68,6 +68,9 @@ test_expect_success 'out of bounds index.version issues warning' ' > test_expect_success 'index.skipHash config option' ' > rm -f .git/index && > git -c index.skipHash=true add a && > + test_trailing_hash .git/index >hash && > + echo $(test_oid zero) >expect && Nit: test_oid zero >expect > + test_cmp expect hash && > git fsck > ' ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v2 3/4] test-lib-functions: add helper for trailing hash 2022-12-12 18:14 ` SZEDER Gábor @ 2022-12-13 0:55 ` Junio C Hamano 2022-12-17 17:37 ` SZEDER Gábor 0 siblings, 1 reply; 59+ messages in thread From: Junio C Hamano @ 2022-12-13 0:55 UTC (permalink / raw) To: SZEDER Gábor Cc: Derrick Stolee via GitGitGadget, git, vdye, avarab, newren, Derrick Stolee SZEDER Gábor <szeder.dev@gmail.com> writes: >> + test_trailing_hash .git/index >hash && >> + echo $(test_oid zero) >expect && > > Nit: test_oid zero >expect > >> + test_cmp expect hash && Unfortunately they are not equivalent. Usually we write these helpers to terminate their output with LF, relying on the fact that terminating LF will be dropped when used in a command substition, e.g. VAR=$(HELPER), but test_oid deviates from the pattern and does not give the terminating LF to its output. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v2 3/4] test-lib-functions: add helper for trailing hash 2022-12-13 0:55 ` Junio C Hamano @ 2022-12-17 17:37 ` SZEDER Gábor 0 siblings, 0 replies; 59+ messages in thread From: SZEDER Gábor @ 2022-12-17 17:37 UTC (permalink / raw) To: Junio C Hamano Cc: brian m. carlson, Derrick Stolee via GitGitGadget, git, vdye, avarab, newren, Derrick Stolee On Tue, Dec 13, 2022 at 09:55:47AM +0900, Junio C Hamano wrote: > SZEDER Gábor <szeder.dev@gmail.com> writes: > > >> + test_trailing_hash .git/index >hash && > >> + echo $(test_oid zero) >expect && > > > > Nit: test_oid zero >expect > > > >> + test_cmp expect hash && > > Unfortunately they are not equivalent. > > Usually we write these helpers to terminate their output with LF, > relying on the fact that terminating LF will be dropped when used in > a command substition, e.g. VAR=$(HELPER), but test_oid deviates from > the pattern and does not give the terminating LF to its output. Oh, indeed. But why does it omit the trailing LF?! Alas, 2c02b110da (t: add test functions to translate hash-related values, 2018-09-13) doesn't seem to mention anything about it. However, skimming through the output of git grep 'test_oid ' -- ':/t/*.sh' it appears that all but three uses of 'test_oid' are in command substitutions, and those three exceptions are in t0000 checking that 'test_oid' actually works. So I don't see any benefit of omitting that trailing LF, but this and similar cases show its drawbacks. $ git grep 'echo $(test_oid ' -- ':/t/*.sh' t/t1302-repo-version.sh: echo $(test_oid version) >expect && t/t5313-pack-bounds-checks.sh: echo $(test_oid oidfff) >file && ^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH v2 4/4] features: feature.manyFiles implies fast index writes 2022-12-12 16:31 ` [PATCH v2 " Derrick Stolee via GitGitGadget ` (2 preceding siblings ...) 2022-12-12 16:31 ` [PATCH v2 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget @ 2022-12-12 16:31 ` Derrick Stolee via GitGitGadget 2022-12-15 15:06 ` [PATCH v3 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget 4 siblings, 0 replies; 59+ messages in thread From: Derrick Stolee via GitGitGadget @ 2022-12-12 16:31 UTC (permalink / raw) To: git; +Cc: gitster, vdye, avarab, newren, Derrick Stolee, Derrick Stolee From: Derrick Stolee <derrickstolee@github.com> The recent addition of the index.skipHash config option allows index writes to speed up by skipping the hash computation for the trailing checksum. This is particularly critical for repositories with many files at HEAD, so add this config option to two cases where users in that scenario may opt-in to such behavior: 1. The feature.manyFiles config option enables some options that are helpful for repositories with many files at HEAD. 2. 'scalar register' and 'scalar reconfigure' set config options that optimize for large repositories. In both of these cases, set index.skipHash=true to gain this speedup. Add tests that demonstrate the proper way that index.skipHash=true can override feature.manyFiles=true. Signed-off-by: Derrick Stolee <derrickstolee@github.com> --- Documentation/config/feature.txt | 5 +++++ read-cache.c | 5 ++++- repo-settings.c | 2 ++ repository.h | 1 + scalar.c | 1 + t/t1600-index.sh | 13 ++++++++++++- 6 files changed, 25 insertions(+), 2 deletions(-) diff --git a/Documentation/config/feature.txt b/Documentation/config/feature.txt index 95975e50912..e52bc6b8584 100644 --- a/Documentation/config/feature.txt +++ b/Documentation/config/feature.txt @@ -23,6 +23,11 @@ feature.manyFiles:: working directory. With many files, commands such as `git status` and `git checkout` may be slow and these new defaults improve performance: + +* `index.skipHash=true` speeds up index writes by not computing a trailing + checksum. Note that this will cause Git versions earlier than 2.13.0 to + refuse to parse the index and Git versions earlier than 2.40.0 will report + a corrupted index during `git fsck`. ++ * `index.version=4` enables path-prefix compression in the index. + * `core.untrackedCache=true` enables the untracked cache. This setting assumes diff --git a/read-cache.c b/read-cache.c index 3f7de8b2e20..1844953fba7 100644 --- a/read-cache.c +++ b/read-cache.c @@ -2926,7 +2926,10 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile, f = hashfd(tempfile->fd, tempfile->filename.buf); - git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash); + if (istate->repo) { + prepare_repo_settings(istate->repo); + f->skip_hash = istate->repo->settings.index_skip_hash; + } for (i = removed = extended = 0; i < entries; i++) { if (cache[i]->ce_flags & CE_REMOVE) diff --git a/repo-settings.c b/repo-settings.c index 3021921c53d..3dbd3f0e2ec 100644 --- a/repo-settings.c +++ b/repo-settings.c @@ -47,6 +47,7 @@ void prepare_repo_settings(struct repository *r) } if (manyfiles) { r->settings.index_version = 4; + r->settings.index_skip_hash = 1; r->settings.core_untracked_cache = UNTRACKED_CACHE_WRITE; } @@ -61,6 +62,7 @@ void prepare_repo_settings(struct repository *r) repo_cfg_bool(r, "pack.usesparse", &r->settings.pack_use_sparse, 1); repo_cfg_bool(r, "core.multipackindex", &r->settings.core_multi_pack_index, 1); repo_cfg_bool(r, "index.sparse", &r->settings.sparse_index, 0); + repo_cfg_bool(r, "index.skiphash", &r->settings.index_skip_hash, r->settings.index_skip_hash); /* * The GIT_TEST_MULTI_PACK_INDEX variable is special in that diff --git a/repository.h b/repository.h index 6c461c5b9de..e8c67ffe165 100644 --- a/repository.h +++ b/repository.h @@ -42,6 +42,7 @@ struct repo_settings { struct fsmonitor_settings *fsmonitor; /* lazily loaded */ int index_version; + int index_skip_hash; enum untracked_cache_setting core_untracked_cache; int pack_use_sparse; diff --git a/scalar.c b/scalar.c index 6c52243cdf1..b49bb8c24ec 100644 --- a/scalar.c +++ b/scalar.c @@ -143,6 +143,7 @@ static int set_recommended_config(int reconfigure) { "credential.validate", "false", 1 }, /* GCM4W-only */ { "gc.auto", "0", 1 }, { "gui.GCWarning", "false", 1 }, + { "index.skipHash", "false", 1 }, { "index.threads", "true", 1 }, { "index.version", "4", 1 }, { "merge.stat", "false", 1 }, diff --git a/t/t1600-index.sh b/t/t1600-index.sh index 55914bc3506..103743a1c7d 100755 --- a/t/t1600-index.sh +++ b/t/t1600-index.sh @@ -71,7 +71,18 @@ test_expect_success 'index.skipHash config option' ' test_trailing_hash .git/index >hash && echo $(test_oid zero) >expect && test_cmp expect hash && - git fsck + git fsck && + + rm -f .git/index && + git -c feature.manyFiles=true add a && + test_trailing_hash .git/index >hash && + test_cmp expect hash && + + rm -f .git/index && + git -c feature.manyFiles=true \ + -c index.skipHash=false add a && + test_trailing_hash .git/index >hash && + ! cmp expect hash ' test_index_version () { -- gitgitgadget ^ permalink raw reply related [flat|nested] 59+ messages in thread
* [PATCH v3 0/4] Optionally skip hashing index on write 2022-12-12 16:31 ` [PATCH v2 " Derrick Stolee via GitGitGadget ` (3 preceding siblings ...) 2022-12-12 16:31 ` [PATCH v2 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget @ 2022-12-15 15:06 ` Derrick Stolee via GitGitGadget 2022-12-15 15:06 ` [PATCH v3 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget ` (5 more replies) 4 siblings, 6 replies; 59+ messages in thread From: Derrick Stolee via GitGitGadget @ 2022-12-15 15:06 UTC (permalink / raw) To: git; +Cc: gitster, vdye, avarab, newren, Jacob Keller, Derrick Stolee Writing the index is a critical action that takes place in multiple Git commands. The recent performance improvements available with the sparse index show how often the I/O costs around the index can affect different Git commands, although reading the index takes place more often than a write. The index is written through the hashfile API, both as a buffered write and as a computation of its trailing hash. This trailing hash is an expectation of the file format: several optional index extensions are provided using indicators immediately preceding the trailing hash. 'git fsck' will complain if the trailing hash does not match the contents of the file up to that point. However, computing the hash is expensive. This is even more expensive now that we've moved to sha1dc instead of hardware-accelerated SHA1 computations. This series provides a new config option, index.skipHash, that allows users to disable computing the hash as Git writes the index. In this case, the trailing hash is stored as the null hash (all bytes are zero). The implementation is rather simple, but the patches organize different aspects in a way that we could split things out: * Patch 1 adds a 'skip_hash' option to 'struct hashfile' that creates this as a general option to the hashwrite API. * The motivation in Patch 1 avoids the talk about the chunk-format API and instead focuses on the index as the first example that could make use of this. * Patch 2 creates the index.skipHash config option and updates 'git fsck' to ignore a null hash on the index. This is demonstrated by a very simple test. * Patch 3 creates a test helper to get the trailing hash of a file, and adds a concrete check on the trailing hash when index.skipHash=true. * Patch 4 (which could be deferred until later) adds index.skipHash=true as an implication of feature.manyFiles and as a setting in Scalar. The end result is that index writes speed up significantly, enough that 'git update-index --force-write' improves by 1.75x. Similar patches appeared in my earlier refs RFC [1]. [1] https://lore.kernel.org/git/pull.1408.git.1667846164.gitgitgadget@gmail.com/ Updates since v2 ================ * Patch 2 now has additional details about why to use the config option in its message. The discussion about why the index is special is included in the commit message. Updates since v1 ================ * In Patch 1, use hashcler() instead of memset(). * In Patches 2 and 4, be explicit about which Git versions will exhibit different behavior when reading an index with a null trailing hash. * In Patch 2, used a more compact way of loading from config. * In Patch 2 (and following) dropped the sub-shell in t1600-index.sh. * In Patch 3, dropped back-ticks when using a multi-line pipe. * In Patch 4, use "! cmp" instead of "! test_cmp" Updates since RFC ================= * The config name is now index.skipHash to make it more clear that the default is to not skip the hash, and choosing this option requires a true value. * Use oidread() instead of an ad-hoc loop. * Patches 3 and 4 are new. 3 adds some extra test verification that wasn't in the RFC. * I think that patch 4 is helpful to see now, but I could understand if we wanted to defer that patch until after index.skipHash has had more time to bake. Thanks, -Stolee Derrick Stolee (4): hashfile: allow skipping the hash function read-cache: add index.skipHash config option test-lib-functions: add helper for trailing hash features: feature.manyFiles implies fast index writes Documentation/config/feature.txt | 5 +++++ Documentation/config/index.txt | 11 +++++++++++ csum-file.c | 14 +++++++++++--- csum-file.h | 7 +++++++ read-cache.c | 15 ++++++++++++++- repo-settings.c | 2 ++ repository.h | 1 + scalar.c | 1 + t/t1600-index.sh | 20 ++++++++++++++++++++ t/test-lib-functions.sh | 8 ++++++++ 10 files changed, 80 insertions(+), 4 deletions(-) base-commit: 805265fcf7a737664a8321aaf4a0587b78435184 Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1439%2Fderrickstolee%2Fhashfile-skip-hash-v3 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1439/derrickstolee/hashfile-skip-hash-v3 Pull-Request: https://github.com/gitgitgadget/git/pull/1439 Range-diff vs v2: 1: b582d681581 = 1: b582d681581 hashfile: allow skipping the hash function 2: c8a2fd3a470 ! 2: 00738c81a12 read-cache: add index.skipHash config option @@ Commit message critical that the sparse index was created specifically to reduce the size of the index to make these writes (and reads) faster. + This trade-off between file stability at rest and write-time performance + is not easy to balance. The index is an interesting case for a couple + reasons: + + 1. Writes block users. Writing the index takes place in many user- + blocking foreground operations. The speed improvement directly + impacts their use. Other file formats are typically written in the + background (commit-graph, multi-pack-index) or are super-critical to + correctness (pack-files). + + 2. Index files are short lived. It is rare that a user leaves an index + for a long time with many staged changes. Outside of staged changes, + the index can be completely destroyed and rewritten with minimal + impact to the user. + Following a similar approach to one used in the microsoft/git fork [1], add a new config option (index.skipHash) that allows disabling this hashing during the index write. The cost is that we can no longer @@ Documentation/config/index.txt: index.version:: + +index.skipHash:: + When enabled, do not compute the trailing hash for the index file. -+ Instead, write a trailing set of bytes with value zero, indicating ++ This accelerates Git commands that manipulate the index, such as ++ `git add`, `git commit`, or `git status`. Instead of storing the ++ checksum, write a trailing set of bytes with value zero, indicating + that the computation was skipped. ++ +If you enable `index.skipHash`, then Git clients older than 2.13.0 will 3: 813e81a0582 = 3: 86370af1351 test-lib-functions: add helper for trailing hash 4: e640dff53dd = 4: 6490bd445eb features: feature.manyFiles implies fast index writes -- gitgitgadget ^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH v3 1/4] hashfile: allow skipping the hash function 2022-12-15 15:06 ` [PATCH v3 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget @ 2022-12-15 15:06 ` Derrick Stolee via GitGitGadget 2022-12-15 15:06 ` [PATCH v3 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget ` (4 subsequent siblings) 5 siblings, 0 replies; 59+ messages in thread From: Derrick Stolee via GitGitGadget @ 2022-12-15 15:06 UTC (permalink / raw) To: git Cc: gitster, vdye, avarab, newren, Jacob Keller, Derrick Stolee, Derrick Stolee From: Derrick Stolee <derrickstolee@github.com> The hashfile API is useful for generating files that include a trailing hash of the file's contents up to that point. Using such a hash is helpful for verifying the file for corruption-at-rest, such as a faulty drive causing flipped bits. Git's index file includes this trailing hash, so it uses a 'struct hashfile' to handle the I/O to the file. This was very convenient to allow using the hashfile methods during these operations. However, hashing the file contents during write comes at a performance penalty. It's slower to hash the bytes on their way to the disk than without that step. This problem is made worse by the replacement of hardware-accelerated SHA1 computations with the software-based sha1dc computation. This write cost is significant, and the checksum capability is likely not worth that cost for such a short-lived file. The index is rewritten frequently and the only time the checksum is checked is during 'git fsck'. Thus, it would be helpful to allow a user to opt-out of the hash computation. We first need to allow Git to opt-out of the hash computation in the hashfile API. The buffered writes of the API are still helpful, so it makes sense to make the change here. Introduce a new 'skip_hash' option to 'struct hashfile'. When set, the update_fn and final_fn members of the_hash_algo are skipped. When finalizing the hashfile, the trailing hash is replaced with the null hash. This use of a trailing null hash would be desireable in either case, since we do not want to special case a file format to have a different length depending on whether it was hashed or not. When the final bytes of a file are all zero, we can infer that it was written without hashing, and thus that verification is not available as a check for file consistency. This also means that we could easily toggle hashing for any file format we desire. A version of this patch has existed in the microsoft/git fork since 2017 [1] (the linked commit was rebased in 2018, but the original dates back to January 2017). Here, the change to make the index use this fast path is delayed until a later change. [1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201 Co-authored-by: Kevin Willford <kewillf@microsoft.com> Signed-off-by: Kevin Willford <kewillf@microsoft.com> Signed-off-by: Derrick Stolee <derrickstolee@github.com> --- csum-file.c | 14 +++++++++++--- csum-file.h | 7 +++++++ 2 files changed, 18 insertions(+), 3 deletions(-) diff --git a/csum-file.c b/csum-file.c index 59ef3398ca2..cce13c0f047 100644 --- a/csum-file.c +++ b/csum-file.c @@ -45,7 +45,8 @@ void hashflush(struct hashfile *f) unsigned offset = f->offset; if (offset) { - the_hash_algo->update_fn(&f->ctx, f->buffer, offset); + if (!f->skip_hash) + the_hash_algo->update_fn(&f->ctx, f->buffer, offset); flush(f, f->buffer, offset); f->offset = 0; } @@ -64,7 +65,12 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result, int fd; hashflush(f); - the_hash_algo->final_fn(f->buffer, &f->ctx); + + if (f->skip_hash) + hashclr(f->buffer); + else + the_hash_algo->final_fn(f->buffer, &f->ctx); + if (result) hashcpy(result, f->buffer); if (flags & CSUM_HASH_IN_STREAM) @@ -108,7 +114,8 @@ void hashwrite(struct hashfile *f, const void *buf, unsigned int count) * the hashfile's buffer. In this block, * f->offset is necessarily zero. */ - the_hash_algo->update_fn(&f->ctx, buf, nr); + if (!f->skip_hash) + the_hash_algo->update_fn(&f->ctx, buf, nr); flush(f, buf, nr); } else { /* @@ -153,6 +160,7 @@ static struct hashfile *hashfd_internal(int fd, const char *name, f->tp = tp; f->name = name; f->do_crc = 0; + f->skip_hash = 0; the_hash_algo->init_fn(&f->ctx); f->buffer_len = buffer_len; diff --git a/csum-file.h b/csum-file.h index 0d29f528fbc..29468067f81 100644 --- a/csum-file.h +++ b/csum-file.h @@ -20,6 +20,13 @@ struct hashfile { size_t buffer_len; unsigned char *buffer; unsigned char *check_buffer; + + /** + * If set to 1, skip_hash indicates that we should + * not actually compute the hash for this hashfile and + * instead only use it as a buffered write. + */ + unsigned int skip_hash; }; /* Checkpoint */ -- gitgitgadget ^ permalink raw reply related [flat|nested] 59+ messages in thread
* [PATCH v3 2/4] read-cache: add index.skipHash config option 2022-12-15 15:06 ` [PATCH v3 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget 2022-12-15 15:06 ` [PATCH v3 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget @ 2022-12-15 15:06 ` Derrick Stolee via GitGitGadget 2022-12-15 16:12 ` Ævar Arnfjörð Bjarmason 2022-12-15 15:06 ` [PATCH v3 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget ` (3 subsequent siblings) 5 siblings, 1 reply; 59+ messages in thread From: Derrick Stolee via GitGitGadget @ 2022-12-15 15:06 UTC (permalink / raw) To: git Cc: gitster, vdye, avarab, newren, Jacob Keller, Derrick Stolee, Derrick Stolee From: Derrick Stolee <derrickstolee@github.com> The previous change allowed skipping the hashing portion of the hashwrite API, using it instead as a buffered write API. Disabling the hashwrite can be particularly helpful when the write operation is in a critical path. One such critical path is the writing of the index. This operation is so critical that the sparse index was created specifically to reduce the size of the index to make these writes (and reads) faster. This trade-off between file stability at rest and write-time performance is not easy to balance. The index is an interesting case for a couple reasons: 1. Writes block users. Writing the index takes place in many user- blocking foreground operations. The speed improvement directly impacts their use. Other file formats are typically written in the background (commit-graph, multi-pack-index) or are super-critical to correctness (pack-files). 2. Index files are short lived. It is rare that a user leaves an index for a long time with many staged changes. Outside of staged changes, the index can be completely destroyed and rewritten with minimal impact to the user. Following a similar approach to one used in the microsoft/git fork [1], add a new config option (index.skipHash) that allows disabling this hashing during the index write. The cost is that we can no longer validate the contents for corruption-at-rest using the trailing hash. [1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201 While older Git versions will not recognize the null hash as a special case, the file format itself is still being met in terms of its structure. Using this null hash will still allow Git operations to function across older versions. The one exception is 'git fsck' which checks the hash of the index file. This used to be a check on every index read, but was split out to just the index in a33fc72fe91 (read-cache: force_verify_index_checksum, 2017-04-14) and released first in Git 2.13.0. Document the versions that relaxed these restrictions, with the optimistic expectation that this change will be included in Git 2.40.0. Here, we disable this check if the trailing hash is all zeroes. We add a warning to the config option that this may cause undesirable behavior with older Git versions. As a quick comparison, I tested 'git update-index --force-write' with and without index.skipHash=true on a copy of the Linux kernel repository. Benchmark 1: with hash Time (mean ± σ): 46.3 ms ± 13.8 ms [User: 34.3 ms, System: 11.9 ms] Range (min … max): 34.3 ms … 79.1 ms 82 runs Benchmark 2: without hash Time (mean ± σ): 26.0 ms ± 7.9 ms [User: 11.8 ms, System: 14.2 ms] Range (min … max): 16.3 ms … 42.0 ms 69 runs Summary 'without hash' ran 1.78 ± 0.76 times faster than 'with hash' These performance benefits are substantial enough to allow users the ability to opt-in to this feature, even with the potential confusion with older 'git fsck' versions. It is critical that this test is placed before the test_index_version tests, since those tests obliterate the .git/config file and hence lose the setting from GIT_TEST_DEFAULT_HASH, if set. Signed-off-by: Derrick Stolee <derrickstolee@github.com> --- Documentation/config/index.txt | 11 +++++++++++ read-cache.c | 12 +++++++++++- t/t1600-index.sh | 6 ++++++ 3 files changed, 28 insertions(+), 1 deletion(-) diff --git a/Documentation/config/index.txt b/Documentation/config/index.txt index 75f3a2d1054..23c7985eb40 100644 --- a/Documentation/config/index.txt +++ b/Documentation/config/index.txt @@ -30,3 +30,14 @@ index.version:: Specify the version with which new index files should be initialized. This does not affect existing repositories. If `feature.manyFiles` is enabled, then the default is 4. + +index.skipHash:: + When enabled, do not compute the trailing hash for the index file. + This accelerates Git commands that manipulate the index, such as + `git add`, `git commit`, or `git status`. Instead of storing the + checksum, write a trailing set of bytes with value zero, indicating + that the computation was skipped. ++ +If you enable `index.skipHash`, then Git clients older than 2.13.0 will +refuse to parse the index and Git clients older than 2.40.0 will report an +error during `git fsck`. diff --git a/read-cache.c b/read-cache.c index 46f5e497b14..3f7de8b2e20 100644 --- a/read-cache.c +++ b/read-cache.c @@ -1817,6 +1817,8 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size) git_hash_ctx c; unsigned char hash[GIT_MAX_RAWSZ]; int hdr_version; + unsigned char *start, *end; + struct object_id oid; if (hdr->hdr_signature != htonl(CACHE_SIGNATURE)) return error(_("bad signature 0x%08x"), hdr->hdr_signature); @@ -1827,10 +1829,16 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size) if (!verify_index_checksum) return 0; + end = (unsigned char *)hdr + size; + start = end - the_hash_algo->rawsz; + oidread(&oid, start); + if (oideq(&oid, null_oid())) + return 0; + the_hash_algo->init_fn(&c); the_hash_algo->update_fn(&c, hdr, size - the_hash_algo->rawsz); the_hash_algo->final_fn(hash, &c); - if (!hasheq(hash, (unsigned char *)hdr + size - the_hash_algo->rawsz)) + if (!hasheq(hash, end - the_hash_algo->rawsz)) return error(_("bad index file sha1 signature")); return 0; } @@ -2918,6 +2926,8 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile, f = hashfd(tempfile->fd, tempfile->filename.buf); + git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash); + for (i = removed = extended = 0; i < entries; i++) { if (cache[i]->ce_flags & CE_REMOVE) removed++; diff --git a/t/t1600-index.sh b/t/t1600-index.sh index 010989f90e6..45feb0fc5d8 100755 --- a/t/t1600-index.sh +++ b/t/t1600-index.sh @@ -65,6 +65,12 @@ test_expect_success 'out of bounds index.version issues warning' ' ) ' +test_expect_success 'index.skipHash config option' ' + rm -f .git/index && + git -c index.skipHash=true add a && + git fsck +' + test_index_version () { INDEX_VERSION_CONFIG=$1 && FEATURE_MANY_FILES=$2 && -- gitgitgadget ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH v3 2/4] read-cache: add index.skipHash config option 2022-12-15 15:06 ` [PATCH v3 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget @ 2022-12-15 16:12 ` Ævar Arnfjörð Bjarmason 0 siblings, 0 replies; 59+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2022-12-15 16:12 UTC (permalink / raw) To: Derrick Stolee via GitGitGadget Cc: git, gitster, vdye, newren, Jacob Keller, Derrick Stolee On Thu, Dec 15 2022, Derrick Stolee via GitGitGadget wrote: > From: Derrick Stolee <derrickstolee@github.com> > + end = (unsigned char *)hdr + size; Commentary: The "unsigned char *" cast has moved up here from the below, that's needed (we're casting from the struct), and we should get rid of it from below, good. > + start = end - the_hash_algo->rawsz; Okey, so here we mark the start, which is the end minus the rawsz, but... > + oidread(&oid, start); > + if (oideq(&oid, null_oid())) > + return 0; > + > the_hash_algo->init_fn(&c); > the_hash_algo->update_fn(&c, hdr, size - the_hash_algo->rawsz); > the_hash_algo->final_fn(hash, &c); > - if (!hasheq(hash, (unsigned char *)hdr + size - the_hash_algo->rawsz)) > + if (!hasheq(hash, end - the_hash_algo->rawsz)) ...here we got rid of the cast, which is good, but let's not use "end - the_hash_algo->rawsz" here, let's use "start", which you already computed as "end - the_hash_algo->rawsz". This is just repeating it. I wondered if I just missed it being modified in the interim before carefully re-reading this, but we pass your tests with: - if (!hasheq(hash, end - the_hash_algo->rawsz)) + assert((end - the_hash_algo->rawsz) == start); + if (!hasheq(hash, start)) So, we can indeed juse the simpler "start" here, and it makes it easier to read, as we're assured that it didn't move in the interim. > + git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash); Aside from the question of whether we use the repo_*() variant here, which I noted in my reply to the CL. The cast is suspicious. So, in the 1/4 we added this as *unsigned*: + * If set to 1, skip_hash indicates that we should + * not actually compute the hash for this hashfile and + * instead only use it as a buffered write. + */ + unsigned int skip_hash; But you need the cast here since the config API can and will set the "dest" to -1. See the "*dest == -1" test in git_configset_get_value(). So, here we're relying on a "unsigned int" cast'd to "int" correctly doing the right thing on a "-1" assignment. I'm not sure if that's portable or leads to undefined behavior, but in any case, won't such a -1 value be read back as ~0 from that "unsigned int" variable on most modern platforms? Just bypassing that entirely and making it "int" seems better here, or having an intermediate variable. I also wondered if this was all my fault, in your original version you were doing: int skip_hash; [...] if (!git_config_get_maybe_bool("index.skiphash", &skip_hash)) f->skip_hash = skip_hash; And I suggested that this was redundant, and that you could just write to "f->skip_hash" directly. But I didn't notice it was "unsigned", and in any case your original version had the same issue of assigning a -1 to the unsigned variable... ^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH v3 3/4] test-lib-functions: add helper for trailing hash 2022-12-15 15:06 ` [PATCH v3 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget 2022-12-15 15:06 ` [PATCH v3 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget 2022-12-15 15:06 ` [PATCH v3 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget @ 2022-12-15 15:06 ` Derrick Stolee via GitGitGadget 2022-12-15 15:07 ` [PATCH v3 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget ` (2 subsequent siblings) 5 siblings, 0 replies; 59+ messages in thread From: Derrick Stolee via GitGitGadget @ 2022-12-15 15:06 UTC (permalink / raw) To: git Cc: gitster, vdye, avarab, newren, Jacob Keller, Derrick Stolee, Derrick Stolee From: Derrick Stolee <derrickstolee@github.com> It can be helpful to check that a file format with a trailing hash has a specific hash in the final bytes of a written file. This is made more apparent by recent changes that allow skipping the hash algorithm and writing a null hash at the end of the file instead. Add a new test_trailing_hash helper and use it in t1600 to verify that index.skipHash=true really does skip the hash computation, since 'git fsck' does not actually verify the hash. Keep the 'git fsck' call to ensure that any potential future change to check the index hash does not cause an error in this case. Signed-off-by: Derrick Stolee <derrickstolee@github.com> --- t/t1600-index.sh | 3 +++ t/test-lib-functions.sh | 8 ++++++++ 2 files changed, 11 insertions(+) diff --git a/t/t1600-index.sh b/t/t1600-index.sh index 45feb0fc5d8..55914bc3506 100755 --- a/t/t1600-index.sh +++ b/t/t1600-index.sh @@ -68,6 +68,9 @@ test_expect_success 'out of bounds index.version issues warning' ' test_expect_success 'index.skipHash config option' ' rm -f .git/index && git -c index.skipHash=true add a && + test_trailing_hash .git/index >hash && + echo $(test_oid zero) >expect && + test_cmp expect hash && git fsck ' diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh index 796093a7b32..60308843f8f 100644 --- a/t/test-lib-functions.sh +++ b/t/test-lib-functions.sh @@ -1875,3 +1875,11 @@ test_cmp_config_output () { sort config-actual >sorted-actual && test_cmp sorted-expect sorted-actual } + +# Given a filename, extract its trailing hash as a hex string +test_trailing_hash () { + local file="$1" && + tail -c $(test_oid rawsz) "$file" | + test-tool hexdump | + sed "s/ //g" +} -- gitgitgadget ^ permalink raw reply related [flat|nested] 59+ messages in thread
* [PATCH v3 4/4] features: feature.manyFiles implies fast index writes 2022-12-15 15:06 ` [PATCH v3 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget ` (2 preceding siblings ...) 2022-12-15 15:06 ` [PATCH v3 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget @ 2022-12-15 15:07 ` Derrick Stolee via GitGitGadget 2022-12-15 15:56 ` [PATCH v3 0/4] Optionally skip hashing index on write Ævar Arnfjörð Bjarmason 2022-12-16 15:31 ` [PATCH v4 " Derrick Stolee via GitGitGadget 5 siblings, 0 replies; 59+ messages in thread From: Derrick Stolee via GitGitGadget @ 2022-12-15 15:07 UTC (permalink / raw) To: git Cc: gitster, vdye, avarab, newren, Jacob Keller, Derrick Stolee, Derrick Stolee From: Derrick Stolee <derrickstolee@github.com> The recent addition of the index.skipHash config option allows index writes to speed up by skipping the hash computation for the trailing checksum. This is particularly critical for repositories with many files at HEAD, so add this config option to two cases where users in that scenario may opt-in to such behavior: 1. The feature.manyFiles config option enables some options that are helpful for repositories with many files at HEAD. 2. 'scalar register' and 'scalar reconfigure' set config options that optimize for large repositories. In both of these cases, set index.skipHash=true to gain this speedup. Add tests that demonstrate the proper way that index.skipHash=true can override feature.manyFiles=true. Signed-off-by: Derrick Stolee <derrickstolee@github.com> --- Documentation/config/feature.txt | 5 +++++ read-cache.c | 5 ++++- repo-settings.c | 2 ++ repository.h | 1 + scalar.c | 1 + t/t1600-index.sh | 13 ++++++++++++- 6 files changed, 25 insertions(+), 2 deletions(-) diff --git a/Documentation/config/feature.txt b/Documentation/config/feature.txt index 95975e50912..e52bc6b8584 100644 --- a/Documentation/config/feature.txt +++ b/Documentation/config/feature.txt @@ -23,6 +23,11 @@ feature.manyFiles:: working directory. With many files, commands such as `git status` and `git checkout` may be slow and these new defaults improve performance: + +* `index.skipHash=true` speeds up index writes by not computing a trailing + checksum. Note that this will cause Git versions earlier than 2.13.0 to + refuse to parse the index and Git versions earlier than 2.40.0 will report + a corrupted index during `git fsck`. ++ * `index.version=4` enables path-prefix compression in the index. + * `core.untrackedCache=true` enables the untracked cache. This setting assumes diff --git a/read-cache.c b/read-cache.c index 3f7de8b2e20..1844953fba7 100644 --- a/read-cache.c +++ b/read-cache.c @@ -2926,7 +2926,10 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile, f = hashfd(tempfile->fd, tempfile->filename.buf); - git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash); + if (istate->repo) { + prepare_repo_settings(istate->repo); + f->skip_hash = istate->repo->settings.index_skip_hash; + } for (i = removed = extended = 0; i < entries; i++) { if (cache[i]->ce_flags & CE_REMOVE) diff --git a/repo-settings.c b/repo-settings.c index 3021921c53d..3dbd3f0e2ec 100644 --- a/repo-settings.c +++ b/repo-settings.c @@ -47,6 +47,7 @@ void prepare_repo_settings(struct repository *r) } if (manyfiles) { r->settings.index_version = 4; + r->settings.index_skip_hash = 1; r->settings.core_untracked_cache = UNTRACKED_CACHE_WRITE; } @@ -61,6 +62,7 @@ void prepare_repo_settings(struct repository *r) repo_cfg_bool(r, "pack.usesparse", &r->settings.pack_use_sparse, 1); repo_cfg_bool(r, "core.multipackindex", &r->settings.core_multi_pack_index, 1); repo_cfg_bool(r, "index.sparse", &r->settings.sparse_index, 0); + repo_cfg_bool(r, "index.skiphash", &r->settings.index_skip_hash, r->settings.index_skip_hash); /* * The GIT_TEST_MULTI_PACK_INDEX variable is special in that diff --git a/repository.h b/repository.h index 6c461c5b9de..e8c67ffe165 100644 --- a/repository.h +++ b/repository.h @@ -42,6 +42,7 @@ struct repo_settings { struct fsmonitor_settings *fsmonitor; /* lazily loaded */ int index_version; + int index_skip_hash; enum untracked_cache_setting core_untracked_cache; int pack_use_sparse; diff --git a/scalar.c b/scalar.c index 6c52243cdf1..b49bb8c24ec 100644 --- a/scalar.c +++ b/scalar.c @@ -143,6 +143,7 @@ static int set_recommended_config(int reconfigure) { "credential.validate", "false", 1 }, /* GCM4W-only */ { "gc.auto", "0", 1 }, { "gui.GCWarning", "false", 1 }, + { "index.skipHash", "false", 1 }, { "index.threads", "true", 1 }, { "index.version", "4", 1 }, { "merge.stat", "false", 1 }, diff --git a/t/t1600-index.sh b/t/t1600-index.sh index 55914bc3506..103743a1c7d 100755 --- a/t/t1600-index.sh +++ b/t/t1600-index.sh @@ -71,7 +71,18 @@ test_expect_success 'index.skipHash config option' ' test_trailing_hash .git/index >hash && echo $(test_oid zero) >expect && test_cmp expect hash && - git fsck + git fsck && + + rm -f .git/index && + git -c feature.manyFiles=true add a && + test_trailing_hash .git/index >hash && + test_cmp expect hash && + + rm -f .git/index && + git -c feature.manyFiles=true \ + -c index.skipHash=false add a && + test_trailing_hash .git/index >hash && + ! cmp expect hash ' test_index_version () { -- gitgitgadget ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH v3 0/4] Optionally skip hashing index on write 2022-12-15 15:06 ` [PATCH v3 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget ` (3 preceding siblings ...) 2022-12-15 15:07 ` [PATCH v3 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget @ 2022-12-15 15:56 ` Ævar Arnfjörð Bjarmason 2022-12-16 13:41 ` Derrick Stolee 2022-12-16 15:31 ` [PATCH v4 " Derrick Stolee via GitGitGadget 5 siblings, 1 reply; 59+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2022-12-15 15:56 UTC (permalink / raw) To: Derrick Stolee via GitGitGadget Cc: git, gitster, vdye, newren, Jacob Keller, Derrick Stolee On Thu, Dec 15 2022, Derrick Stolee via GitGitGadget wrote: > Updates since v2 > ================ > > * Patch 2 now has additional details about why to use the config option in > its message. The discussion about why the index is special is included in > the commit message. Re the comments I had on an earlier version[1] I tried this series with these changes squashed in: 1: b582d681581 = 1: 53d289cf82c hashfile: allow skipping the hash function 2: 00738c81a12 ! 2: db487f3e76d read-cache: add index.skipHash config option @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempf f = hashfd(tempfile->fd, tempfile->filename.buf); -+ git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash); ++ repo_config_get_maybe_bool(istate->repo, "index.skiphash", (int *)&f->skip_hash); + for (i = removed = extended = 0; i < entries; i++) { if (cache[i]->ce_flags & CE_REMOVE) 3: 86370af1351 = 3: 35188bbcfb5 test-lib-functions: add helper for trailing hash 4: 6490bd445eb ! 4: 4354328e8fc features: feature.manyFiles implies fast index writes @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempf f = hashfd(tempfile->fd, tempfile->filename.buf); -- git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash); +- repo_config_get_maybe_bool(istate->repo, "index.skiphash", (int *)&f->skip_hash); + if (istate->repo) { ++ int ours, theirs; ++ + prepare_repo_settings(istate->repo); -+ f->skip_hash = istate->repo->settings.index_skip_hash; ++ ++ ours = istate->repo->settings.index_skip_hash; ++ theirs = the_repository->settings.index_skip_hash; ++ ++ if (ours != theirs) ++ BUG("ours %d theirs %d", ours, theirs); ++ f->skip_hash = ours; ++ } else { ++ f->skip_hash = the_repository->settings.index_skip_hash; ++ /*BUG("no repo???");*/ + } for (i = removed = extended = 0; i < entries; i++) { With those all of your tests still pass. Which leads to these outstanding questions: a) If we're doing to use "istate->repo" in 4/4 why not do so in 2/4, neither patch discusses *that* part of the change. On an unrelated topic there's this[2] fix-up of yours, 2/2 still seems like it needs the same treatment. b) In 2/4 we'd get the config if "istate->repo" was NULL, if you uncomment that BUG() in the squashed 4/4 we'll get test failures all over the place. Aren't these places where we'd get the config before, but istate->repo isn't set up properly, so with 4/4 we won't get it? Maybe that's desired, but again, neither commit discusses this change. Now, I *think* it's a good idea to defer to the already set-up 'istate->repo', but I also know (and have some local patches to fix...) about the cases where we don't set it up (but should), so it can't be fully relied on. So even if we can't produce a behavior difference, just doing e.g.: struct repository *r = istate->repo ? istate->repo : the_repository; And then using: prepare_repo_settings(r); f->skip_hash = r->->settings.index_skip_hash; Seems sensible. I just don't get why 4/4 has that seeming fix-up of 2/4 after-the-fact. Isn't it better to carve that bit out of 4/4, just do the config setup in repo-settings.c to begin with, and have 4/4 do what its $subject says, i.e. to have "feature.manyFiles" imply this new config? The rest of this all looks sensible to me, sans some isolated things I'll comment on on individual patches. 1. https://lore.kernel.org/git/221212.86v8mg4gr2.gmgdl@evledraar.gmail.com/ 2. https://lore.kernel.org/git/857d1abec4cf124e011c7f84276ce105cb5b3a96.1670866407.git.gitgitgadget@gmail.com/ ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v3 0/4] Optionally skip hashing index on write 2022-12-15 15:56 ` [PATCH v3 0/4] Optionally skip hashing index on write Ævar Arnfjörð Bjarmason @ 2022-12-16 13:41 ` Derrick Stolee 0 siblings, 0 replies; 59+ messages in thread From: Derrick Stolee @ 2022-12-16 13:41 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason, Derrick Stolee via GitGitGadget Cc: git, gitster, vdye, newren, Jacob Keller On 12/15/22 10:56 AM, Ævar Arnfjörð Bjarmason wrote: > > On Thu, Dec 15 2022, Derrick Stolee via GitGitGadget wrote: > >> Updates since v2 >> ================ >> >> * Patch 2 now has additional details about why to use the config option in >> its message. The discussion about why the index is special is included in >> the commit message. > So even if we can't produce a behavior difference, just doing e.g.: > > struct repository *r = istate->repo ? istate->repo : the_repository; > > And then using: > > prepare_repo_settings(r); > f->skip_hash = r->->settings.index_skip_hash; This, and other comments are sensible and will be reflected in v4. > Seems sensible. I just don't get why 4/4 has that seeming fix-up of 2/4 > after-the-fact. Isn't it better to carve that bit out of 4/4, just do > the config setup in repo-settings.c to begin with, and have 4/4 do what > its $subject says, i.e. to have "feature.manyFiles" imply this new > config? The point is to make patch 4 completely independent, and be able to pull it out if necessary. There's no reason to load the config inside repo-settings.c unless it's part of something like feature.manyFiles. Thanks, -Stolee ^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH v4 0/4] Optionally skip hashing index on write 2022-12-15 15:06 ` [PATCH v3 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget ` (4 preceding siblings ...) 2022-12-15 15:56 ` [PATCH v3 0/4] Optionally skip hashing index on write Ævar Arnfjörð Bjarmason @ 2022-12-16 15:31 ` Derrick Stolee via GitGitGadget 2022-12-16 15:31 ` [PATCH v4 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget ` (5 more replies) 5 siblings, 6 replies; 59+ messages in thread From: Derrick Stolee via GitGitGadget @ 2022-12-16 15:31 UTC (permalink / raw) To: git; +Cc: gitster, vdye, avarab, newren, Jacob Keller, Derrick Stolee Writing the index is a critical action that takes place in multiple Git commands. The recent performance improvements available with the sparse index show how often the I/O costs around the index can affect different Git commands, although reading the index takes place more often than a write. The index is written through the hashfile API, both as a buffered write and as a computation of its trailing hash. This trailing hash is an expectation of the file format: several optional index extensions are provided using indicators immediately preceding the trailing hash. 'git fsck' will complain if the trailing hash does not match the contents of the file up to that point. However, computing the hash is expensive. This is even more expensive now that we've moved to sha1dc instead of hardware-accelerated SHA1 computations. This series provides a new config option, index.skipHash, that allows users to disable computing the hash as Git writes the index. In this case, the trailing hash is stored as the null hash (all bytes are zero). The implementation is rather simple, but the patches organize different aspects in a way that we could split things out: * Patch 1 adds a 'skip_hash' option to 'struct hashfile' that creates this as a general option to the hashwrite API. * The motivation in Patch 1 avoids the talk about the chunk-format API and instead focuses on the index as the first example that could make use of this. * Patch 2 creates the index.skipHash config option and updates 'git fsck' to ignore a null hash on the index. This is demonstrated by a very simple test. * Patch 3 creates a test helper to get the trailing hash of a file, and adds a concrete check on the trailing hash when index.skipHash=true. * Patch 4 (which could be deferred until later) adds index.skipHash=true as an implication of feature.manyFiles and as a setting in Scalar. The end result is that index writes speed up significantly, enough that 'git update-index --force-write' improves by 1.75x. Similar patches appeared in my earlier refs RFC [1]. [1] https://lore.kernel.org/git/pull.1408.git.1667846164.gitgitgadget@gmail.com/ Updates since v3 ================ * Patch 1 uses an "int" instead of "unsigned int". This was a carry-over from Kevin's patch in microsoft/git, but our use of it in patch 2 is different and thus is better with a signed int. * Patch 2 uses a fallback to the_repository if istate->repo is NULL. This allows the setting to be applied to more cases where istate->repo is not set. * Patch 2 also uses 'start' in more places, since it already computed the beginning of the hash. * Patch 4 slightly modifies its use of prepare_repo_settings() due to Patch 2's fallback to the_repository. Updates since v2 ================ * Patch 2 now has additional details about why to use the config option in its message. The discussion about why the index is special is included in the commit message. Updates since v1 ================ * In Patch 1, use hashcler() instead of memset(). * In Patches 2 and 4, be explicit about which Git versions will exhibit different behavior when reading an index with a null trailing hash. * In Patch 2, used a more compact way of loading from config. * In Patch 2 (and following) dropped the sub-shell in t1600-index.sh. * In Patch 3, dropped back-ticks when using a multi-line pipe. * In Patch 4, use "! cmp" instead of "! test_cmp" Updates since RFC ================= * The config name is now index.skipHash to make it more clear that the default is to not skip the hash, and choosing this option requires a true value. * Use oidread() instead of an ad-hoc loop. * Patches 3 and 4 are new. 3 adds some extra test verification that wasn't in the RFC. * I think that patch 4 is helpful to see now, but I could understand if we wanted to defer that patch until after index.skipHash has had more time to bake. Thanks, -Stolee Derrick Stolee (4): hashfile: allow skipping the hash function read-cache: add index.skipHash config option test-lib-functions: add helper for trailing hash features: feature.manyFiles implies fast index writes Documentation/config/feature.txt | 5 +++++ Documentation/config/index.txt | 11 +++++++++++ csum-file.c | 14 +++++++++++--- csum-file.h | 7 +++++++ read-cache.c | 14 +++++++++++++- repo-settings.c | 2 ++ repository.h | 1 + scalar.c | 1 + t/t1600-index.sh | 20 ++++++++++++++++++++ t/test-lib-functions.sh | 8 ++++++++ 10 files changed, 79 insertions(+), 4 deletions(-) base-commit: 805265fcf7a737664a8321aaf4a0587b78435184 Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1439%2Fderrickstolee%2Fhashfile-skip-hash-v4 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1439/derrickstolee/hashfile-skip-hash-v4 Pull-Request: https://github.com/gitgitgadget/git/pull/1439 Range-diff vs v3: 1: b582d681581 ! 1: c99470d4676 hashfile: allow skipping the hash function @@ csum-file.h: struct hashfile { unsigned char *check_buffer; + + /** -+ * If set to 1, skip_hash indicates that we should ++ * If non-zero, skip_hash indicates that we should + * not actually compute the hash for this hashfile and + * instead only use it as a buffered write. + */ -+ unsigned int skip_hash; ++ int skip_hash; }; /* Checkpoint */ 2: 00738c81a12 ! 2: aae047cbc9f read-cache: add index.skipHash config option @@ Commit message [1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201 + We load this config from the repository config given by istate->repo, + with a fallback to the_repository if it is not set. + While older Git versions will not recognize the null hash as a special case, the file format itself is still being met in terms of its structure. Using this null hash will still allow Git operations to @@ read-cache.c: static int verify_hdr(const struct cache_header *hdr, unsigned lon the_hash_algo->update_fn(&c, hdr, size - the_hash_algo->rawsz); the_hash_algo->final_fn(hash, &c); - if (!hasheq(hash, (unsigned char *)hdr + size - the_hash_algo->rawsz)) -+ if (!hasheq(hash, end - the_hash_algo->rawsz)) ++ if (!hasheq(hash, start)) return error(_("bad index file sha1 signature")); return 0; } @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempfile *tempfile, + int ieot_entries = 1; + struct index_entry_offset_table *ieot = NULL; + int nr, nr_threads; ++ struct repository *r = istate->repo ? istate->repo : the_repository; f = hashfd(tempfile->fd, tempfile->filename.buf); -+ git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash); ++ repo_config_get_maybe_bool(r, "index.skiphash", &f->skip_hash); + for (i = removed = extended = 0; i < entries; i++) { if (cache[i]->ce_flags & CE_REMOVE) 3: 86370af1351 = 3: 27fbe52e748 test-lib-functions: add helper for trailing hash 4: 6490bd445eb ! 4: 075921514f2 features: feature.manyFiles implies fast index writes @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempf f = hashfd(tempfile->fd, tempfile->filename.buf); -- git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash); -+ if (istate->repo) { -+ prepare_repo_settings(istate->repo); -+ f->skip_hash = istate->repo->settings.index_skip_hash; -+ } +- repo_config_get_maybe_bool(r, "index.skiphash", &f->skip_hash); ++ prepare_repo_settings(r); ++ f->skip_hash = r->settings.index_skip_hash; for (i = removed = extended = 0; i < entries; i++) { if (cache[i]->ce_flags & CE_REMOVE) -- gitgitgadget ^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH v4 1/4] hashfile: allow skipping the hash function 2022-12-16 15:31 ` [PATCH v4 " Derrick Stolee via GitGitGadget @ 2022-12-16 15:31 ` Derrick Stolee via GitGitGadget 2022-12-16 15:31 ` [PATCH v4 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget ` (4 subsequent siblings) 5 siblings, 0 replies; 59+ messages in thread From: Derrick Stolee via GitGitGadget @ 2022-12-16 15:31 UTC (permalink / raw) To: git Cc: gitster, vdye, avarab, newren, Jacob Keller, Derrick Stolee, Derrick Stolee From: Derrick Stolee <derrickstolee@github.com> The hashfile API is useful for generating files that include a trailing hash of the file's contents up to that point. Using such a hash is helpful for verifying the file for corruption-at-rest, such as a faulty drive causing flipped bits. Git's index file includes this trailing hash, so it uses a 'struct hashfile' to handle the I/O to the file. This was very convenient to allow using the hashfile methods during these operations. However, hashing the file contents during write comes at a performance penalty. It's slower to hash the bytes on their way to the disk than without that step. This problem is made worse by the replacement of hardware-accelerated SHA1 computations with the software-based sha1dc computation. This write cost is significant, and the checksum capability is likely not worth that cost for such a short-lived file. The index is rewritten frequently and the only time the checksum is checked is during 'git fsck'. Thus, it would be helpful to allow a user to opt-out of the hash computation. We first need to allow Git to opt-out of the hash computation in the hashfile API. The buffered writes of the API are still helpful, so it makes sense to make the change here. Introduce a new 'skip_hash' option to 'struct hashfile'. When set, the update_fn and final_fn members of the_hash_algo are skipped. When finalizing the hashfile, the trailing hash is replaced with the null hash. This use of a trailing null hash would be desireable in either case, since we do not want to special case a file format to have a different length depending on whether it was hashed or not. When the final bytes of a file are all zero, we can infer that it was written without hashing, and thus that verification is not available as a check for file consistency. This also means that we could easily toggle hashing for any file format we desire. A version of this patch has existed in the microsoft/git fork since 2017 [1] (the linked commit was rebased in 2018, but the original dates back to January 2017). Here, the change to make the index use this fast path is delayed until a later change. [1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201 Co-authored-by: Kevin Willford <kewillf@microsoft.com> Signed-off-by: Kevin Willford <kewillf@microsoft.com> Signed-off-by: Derrick Stolee <derrickstolee@github.com> --- csum-file.c | 14 +++++++++++--- csum-file.h | 7 +++++++ 2 files changed, 18 insertions(+), 3 deletions(-) diff --git a/csum-file.c b/csum-file.c index 59ef3398ca2..cce13c0f047 100644 --- a/csum-file.c +++ b/csum-file.c @@ -45,7 +45,8 @@ void hashflush(struct hashfile *f) unsigned offset = f->offset; if (offset) { - the_hash_algo->update_fn(&f->ctx, f->buffer, offset); + if (!f->skip_hash) + the_hash_algo->update_fn(&f->ctx, f->buffer, offset); flush(f, f->buffer, offset); f->offset = 0; } @@ -64,7 +65,12 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result, int fd; hashflush(f); - the_hash_algo->final_fn(f->buffer, &f->ctx); + + if (f->skip_hash) + hashclr(f->buffer); + else + the_hash_algo->final_fn(f->buffer, &f->ctx); + if (result) hashcpy(result, f->buffer); if (flags & CSUM_HASH_IN_STREAM) @@ -108,7 +114,8 @@ void hashwrite(struct hashfile *f, const void *buf, unsigned int count) * the hashfile's buffer. In this block, * f->offset is necessarily zero. */ - the_hash_algo->update_fn(&f->ctx, buf, nr); + if (!f->skip_hash) + the_hash_algo->update_fn(&f->ctx, buf, nr); flush(f, buf, nr); } else { /* @@ -153,6 +160,7 @@ static struct hashfile *hashfd_internal(int fd, const char *name, f->tp = tp; f->name = name; f->do_crc = 0; + f->skip_hash = 0; the_hash_algo->init_fn(&f->ctx); f->buffer_len = buffer_len; diff --git a/csum-file.h b/csum-file.h index 0d29f528fbc..793a59da12b 100644 --- a/csum-file.h +++ b/csum-file.h @@ -20,6 +20,13 @@ struct hashfile { size_t buffer_len; unsigned char *buffer; unsigned char *check_buffer; + + /** + * If non-zero, skip_hash indicates that we should + * not actually compute the hash for this hashfile and + * instead only use it as a buffered write. + */ + int skip_hash; }; /* Checkpoint */ -- gitgitgadget ^ permalink raw reply related [flat|nested] 59+ messages in thread
* [PATCH v4 2/4] read-cache: add index.skipHash config option 2022-12-16 15:31 ` [PATCH v4 " Derrick Stolee via GitGitGadget 2022-12-16 15:31 ` [PATCH v4 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget @ 2022-12-16 15:31 ` Derrick Stolee via GitGitGadget 2022-12-16 15:31 ` [PATCH v4 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget ` (3 subsequent siblings) 5 siblings, 0 replies; 59+ messages in thread From: Derrick Stolee via GitGitGadget @ 2022-12-16 15:31 UTC (permalink / raw) To: git Cc: gitster, vdye, avarab, newren, Jacob Keller, Derrick Stolee, Derrick Stolee From: Derrick Stolee <derrickstolee@github.com> The previous change allowed skipping the hashing portion of the hashwrite API, using it instead as a buffered write API. Disabling the hashwrite can be particularly helpful when the write operation is in a critical path. One such critical path is the writing of the index. This operation is so critical that the sparse index was created specifically to reduce the size of the index to make these writes (and reads) faster. This trade-off between file stability at rest and write-time performance is not easy to balance. The index is an interesting case for a couple reasons: 1. Writes block users. Writing the index takes place in many user- blocking foreground operations. The speed improvement directly impacts their use. Other file formats are typically written in the background (commit-graph, multi-pack-index) or are super-critical to correctness (pack-files). 2. Index files are short lived. It is rare that a user leaves an index for a long time with many staged changes. Outside of staged changes, the index can be completely destroyed and rewritten with minimal impact to the user. Following a similar approach to one used in the microsoft/git fork [1], add a new config option (index.skipHash) that allows disabling this hashing during the index write. The cost is that we can no longer validate the contents for corruption-at-rest using the trailing hash. [1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201 We load this config from the repository config given by istate->repo, with a fallback to the_repository if it is not set. While older Git versions will not recognize the null hash as a special case, the file format itself is still being met in terms of its structure. Using this null hash will still allow Git operations to function across older versions. The one exception is 'git fsck' which checks the hash of the index file. This used to be a check on every index read, but was split out to just the index in a33fc72fe91 (read-cache: force_verify_index_checksum, 2017-04-14) and released first in Git 2.13.0. Document the versions that relaxed these restrictions, with the optimistic expectation that this change will be included in Git 2.40.0. Here, we disable this check if the trailing hash is all zeroes. We add a warning to the config option that this may cause undesirable behavior with older Git versions. As a quick comparison, I tested 'git update-index --force-write' with and without index.skipHash=true on a copy of the Linux kernel repository. Benchmark 1: with hash Time (mean ± σ): 46.3 ms ± 13.8 ms [User: 34.3 ms, System: 11.9 ms] Range (min … max): 34.3 ms … 79.1 ms 82 runs Benchmark 2: without hash Time (mean ± σ): 26.0 ms ± 7.9 ms [User: 11.8 ms, System: 14.2 ms] Range (min … max): 16.3 ms … 42.0 ms 69 runs Summary 'without hash' ran 1.78 ± 0.76 times faster than 'with hash' These performance benefits are substantial enough to allow users the ability to opt-in to this feature, even with the potential confusion with older 'git fsck' versions. It is critical that this test is placed before the test_index_version tests, since those tests obliterate the .git/config file and hence lose the setting from GIT_TEST_DEFAULT_HASH, if set. Signed-off-by: Derrick Stolee <derrickstolee@github.com> --- Documentation/config/index.txt | 11 +++++++++++ read-cache.c | 13 ++++++++++++- t/t1600-index.sh | 6 ++++++ 3 files changed, 29 insertions(+), 1 deletion(-) diff --git a/Documentation/config/index.txt b/Documentation/config/index.txt index 75f3a2d1054..23c7985eb40 100644 --- a/Documentation/config/index.txt +++ b/Documentation/config/index.txt @@ -30,3 +30,14 @@ index.version:: Specify the version with which new index files should be initialized. This does not affect existing repositories. If `feature.manyFiles` is enabled, then the default is 4. + +index.skipHash:: + When enabled, do not compute the trailing hash for the index file. + This accelerates Git commands that manipulate the index, such as + `git add`, `git commit`, or `git status`. Instead of storing the + checksum, write a trailing set of bytes with value zero, indicating + that the computation was skipped. ++ +If you enable `index.skipHash`, then Git clients older than 2.13.0 will +refuse to parse the index and Git clients older than 2.40.0 will report an +error during `git fsck`. diff --git a/read-cache.c b/read-cache.c index 46f5e497b14..fde0697873c 100644 --- a/read-cache.c +++ b/read-cache.c @@ -1817,6 +1817,8 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size) git_hash_ctx c; unsigned char hash[GIT_MAX_RAWSZ]; int hdr_version; + unsigned char *start, *end; + struct object_id oid; if (hdr->hdr_signature != htonl(CACHE_SIGNATURE)) return error(_("bad signature 0x%08x"), hdr->hdr_signature); @@ -1827,10 +1829,16 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size) if (!verify_index_checksum) return 0; + end = (unsigned char *)hdr + size; + start = end - the_hash_algo->rawsz; + oidread(&oid, start); + if (oideq(&oid, null_oid())) + return 0; + the_hash_algo->init_fn(&c); the_hash_algo->update_fn(&c, hdr, size - the_hash_algo->rawsz); the_hash_algo->final_fn(hash, &c); - if (!hasheq(hash, (unsigned char *)hdr + size - the_hash_algo->rawsz)) + if (!hasheq(hash, start)) return error(_("bad index file sha1 signature")); return 0; } @@ -2915,9 +2923,12 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile, int ieot_entries = 1; struct index_entry_offset_table *ieot = NULL; int nr, nr_threads; + struct repository *r = istate->repo ? istate->repo : the_repository; f = hashfd(tempfile->fd, tempfile->filename.buf); + repo_config_get_maybe_bool(r, "index.skiphash", &f->skip_hash); + for (i = removed = extended = 0; i < entries; i++) { if (cache[i]->ce_flags & CE_REMOVE) removed++; diff --git a/t/t1600-index.sh b/t/t1600-index.sh index 010989f90e6..45feb0fc5d8 100755 --- a/t/t1600-index.sh +++ b/t/t1600-index.sh @@ -65,6 +65,12 @@ test_expect_success 'out of bounds index.version issues warning' ' ) ' +test_expect_success 'index.skipHash config option' ' + rm -f .git/index && + git -c index.skipHash=true add a && + git fsck +' + test_index_version () { INDEX_VERSION_CONFIG=$1 && FEATURE_MANY_FILES=$2 && -- gitgitgadget ^ permalink raw reply related [flat|nested] 59+ messages in thread
* [PATCH v4 3/4] test-lib-functions: add helper for trailing hash 2022-12-16 15:31 ` [PATCH v4 " Derrick Stolee via GitGitGadget 2022-12-16 15:31 ` [PATCH v4 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget 2022-12-16 15:31 ` [PATCH v4 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget @ 2022-12-16 15:31 ` Derrick Stolee via GitGitGadget 2022-12-16 15:31 ` [PATCH v4 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget ` (2 subsequent siblings) 5 siblings, 0 replies; 59+ messages in thread From: Derrick Stolee via GitGitGadget @ 2022-12-16 15:31 UTC (permalink / raw) To: git Cc: gitster, vdye, avarab, newren, Jacob Keller, Derrick Stolee, Derrick Stolee From: Derrick Stolee <derrickstolee@github.com> It can be helpful to check that a file format with a trailing hash has a specific hash in the final bytes of a written file. This is made more apparent by recent changes that allow skipping the hash algorithm and writing a null hash at the end of the file instead. Add a new test_trailing_hash helper and use it in t1600 to verify that index.skipHash=true really does skip the hash computation, since 'git fsck' does not actually verify the hash. Keep the 'git fsck' call to ensure that any potential future change to check the index hash does not cause an error in this case. Signed-off-by: Derrick Stolee <derrickstolee@github.com> --- t/t1600-index.sh | 3 +++ t/test-lib-functions.sh | 8 ++++++++ 2 files changed, 11 insertions(+) diff --git a/t/t1600-index.sh b/t/t1600-index.sh index 45feb0fc5d8..55914bc3506 100755 --- a/t/t1600-index.sh +++ b/t/t1600-index.sh @@ -68,6 +68,9 @@ test_expect_success 'out of bounds index.version issues warning' ' test_expect_success 'index.skipHash config option' ' rm -f .git/index && git -c index.skipHash=true add a && + test_trailing_hash .git/index >hash && + echo $(test_oid zero) >expect && + test_cmp expect hash && git fsck ' diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh index 796093a7b32..60308843f8f 100644 --- a/t/test-lib-functions.sh +++ b/t/test-lib-functions.sh @@ -1875,3 +1875,11 @@ test_cmp_config_output () { sort config-actual >sorted-actual && test_cmp sorted-expect sorted-actual } + +# Given a filename, extract its trailing hash as a hex string +test_trailing_hash () { + local file="$1" && + tail -c $(test_oid rawsz) "$file" | + test-tool hexdump | + sed "s/ //g" +} -- gitgitgadget ^ permalink raw reply related [flat|nested] 59+ messages in thread
* [PATCH v4 4/4] features: feature.manyFiles implies fast index writes 2022-12-16 15:31 ` [PATCH v4 " Derrick Stolee via GitGitGadget ` (2 preceding siblings ...) 2022-12-16 15:31 ` [PATCH v4 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget @ 2022-12-16 15:31 ` Derrick Stolee via GitGitGadget 2022-12-16 15:43 ` [PATCH v4 0/4] Optionally skip hashing index on write Ævar Arnfjörð Bjarmason 2023-01-06 16:31 ` [PATCH v5 " Derrick Stolee via GitGitGadget 5 siblings, 0 replies; 59+ messages in thread From: Derrick Stolee via GitGitGadget @ 2022-12-16 15:31 UTC (permalink / raw) To: git Cc: gitster, vdye, avarab, newren, Jacob Keller, Derrick Stolee, Derrick Stolee From: Derrick Stolee <derrickstolee@github.com> The recent addition of the index.skipHash config option allows index writes to speed up by skipping the hash computation for the trailing checksum. This is particularly critical for repositories with many files at HEAD, so add this config option to two cases where users in that scenario may opt-in to such behavior: 1. The feature.manyFiles config option enables some options that are helpful for repositories with many files at HEAD. 2. 'scalar register' and 'scalar reconfigure' set config options that optimize for large repositories. In both of these cases, set index.skipHash=true to gain this speedup. Add tests that demonstrate the proper way that index.skipHash=true can override feature.manyFiles=true. Signed-off-by: Derrick Stolee <derrickstolee@github.com> --- Documentation/config/feature.txt | 5 +++++ read-cache.c | 3 ++- repo-settings.c | 2 ++ repository.h | 1 + scalar.c | 1 + t/t1600-index.sh | 13 ++++++++++++- 6 files changed, 23 insertions(+), 2 deletions(-) diff --git a/Documentation/config/feature.txt b/Documentation/config/feature.txt index 95975e50912..e52bc6b8584 100644 --- a/Documentation/config/feature.txt +++ b/Documentation/config/feature.txt @@ -23,6 +23,11 @@ feature.manyFiles:: working directory. With many files, commands such as `git status` and `git checkout` may be slow and these new defaults improve performance: + +* `index.skipHash=true` speeds up index writes by not computing a trailing + checksum. Note that this will cause Git versions earlier than 2.13.0 to + refuse to parse the index and Git versions earlier than 2.40.0 will report + a corrupted index during `git fsck`. ++ * `index.version=4` enables path-prefix compression in the index. + * `core.untrackedCache=true` enables the untracked cache. This setting assumes diff --git a/read-cache.c b/read-cache.c index fde0697873c..feefa0f68ba 100644 --- a/read-cache.c +++ b/read-cache.c @@ -2927,7 +2927,8 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile, f = hashfd(tempfile->fd, tempfile->filename.buf); - repo_config_get_maybe_bool(r, "index.skiphash", &f->skip_hash); + prepare_repo_settings(r); + f->skip_hash = r->settings.index_skip_hash; for (i = removed = extended = 0; i < entries; i++) { if (cache[i]->ce_flags & CE_REMOVE) diff --git a/repo-settings.c b/repo-settings.c index 3021921c53d..3dbd3f0e2ec 100644 --- a/repo-settings.c +++ b/repo-settings.c @@ -47,6 +47,7 @@ void prepare_repo_settings(struct repository *r) } if (manyfiles) { r->settings.index_version = 4; + r->settings.index_skip_hash = 1; r->settings.core_untracked_cache = UNTRACKED_CACHE_WRITE; } @@ -61,6 +62,7 @@ void prepare_repo_settings(struct repository *r) repo_cfg_bool(r, "pack.usesparse", &r->settings.pack_use_sparse, 1); repo_cfg_bool(r, "core.multipackindex", &r->settings.core_multi_pack_index, 1); repo_cfg_bool(r, "index.sparse", &r->settings.sparse_index, 0); + repo_cfg_bool(r, "index.skiphash", &r->settings.index_skip_hash, r->settings.index_skip_hash); /* * The GIT_TEST_MULTI_PACK_INDEX variable is special in that diff --git a/repository.h b/repository.h index 6c461c5b9de..e8c67ffe165 100644 --- a/repository.h +++ b/repository.h @@ -42,6 +42,7 @@ struct repo_settings { struct fsmonitor_settings *fsmonitor; /* lazily loaded */ int index_version; + int index_skip_hash; enum untracked_cache_setting core_untracked_cache; int pack_use_sparse; diff --git a/scalar.c b/scalar.c index 6c52243cdf1..b49bb8c24ec 100644 --- a/scalar.c +++ b/scalar.c @@ -143,6 +143,7 @@ static int set_recommended_config(int reconfigure) { "credential.validate", "false", 1 }, /* GCM4W-only */ { "gc.auto", "0", 1 }, { "gui.GCWarning", "false", 1 }, + { "index.skipHash", "false", 1 }, { "index.threads", "true", 1 }, { "index.version", "4", 1 }, { "merge.stat", "false", 1 }, diff --git a/t/t1600-index.sh b/t/t1600-index.sh index 55914bc3506..103743a1c7d 100755 --- a/t/t1600-index.sh +++ b/t/t1600-index.sh @@ -71,7 +71,18 @@ test_expect_success 'index.skipHash config option' ' test_trailing_hash .git/index >hash && echo $(test_oid zero) >expect && test_cmp expect hash && - git fsck + git fsck && + + rm -f .git/index && + git -c feature.manyFiles=true add a && + test_trailing_hash .git/index >hash && + test_cmp expect hash && + + rm -f .git/index && + git -c feature.manyFiles=true \ + -c index.skipHash=false add a && + test_trailing_hash .git/index >hash && + ! cmp expect hash ' test_index_version () { -- gitgitgadget ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH v4 0/4] Optionally skip hashing index on write 2022-12-16 15:31 ` [PATCH v4 " Derrick Stolee via GitGitGadget ` (3 preceding siblings ...) 2022-12-16 15:31 ` [PATCH v4 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget @ 2022-12-16 15:43 ` Ævar Arnfjörð Bjarmason 2023-01-06 15:33 ` Derrick Stolee 2023-01-06 16:31 ` [PATCH v5 " Derrick Stolee via GitGitGadget 5 siblings, 1 reply; 59+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2022-12-16 15:43 UTC (permalink / raw) To: Derrick Stolee via GitGitGadget Cc: git, gitster, vdye, newren, Jacob Keller, Derrick Stolee On Fri, Dec 16 2022, Derrick Stolee via GitGitGadget wrote: Thanks for the update! > Updates since v3 > ================ > > * Patch 1 uses an "int" instead of "unsigned int". This was a carry-over > from Kevin's patch in microsoft/git, but our use of it in patch 2 is > different and thus is better with a signed int. > * Patch 2 uses a fallback to the_repository if istate->repo is NULL. This > allows the setting to be applied to more cases where istate->repo is not > set. > * Patch 2 also uses 'start' in more places, since it already computed the > beginning of the hash. > * Patch 4 slightly modifies its use of prepare_repo_settings() due to Patch > 2's fallback to the_repository. It's good that it's "int" now instead of "unsigned int", but just doing that search-replacement I think misses the implied (which I really should have made more explicit) question in my v3 feedback (https://lore.kernel.org/git/221215.861qp03agm.gmgdl@evledraar.gmail.com/): What should we do about the -1 state then? With your 2/4 here we'll accept e.g.: ./git -c index.skipHash=12345 status As well as: ./git -c index.skipHash=blahblah status But with the migration to repo-settings.c we start refusing the latter of those, as you inherit a stricture in repo-settings.c. Aside: I think this series would be much more readable if it were just squashed into one patch. 1/4 introduces code, but no way to reach it, with 2/4 we have config, 3/4 adds a test 4/4 tweaks how to read/parse/interpret/combine the config. But as it is split up the individual steps should make sense. The 2/4 here should really just use "bool", not "maybe_bool" to begin with, no? And part of this in 4/4 is inheriting a non-stricture in repo-settings.c, but for this new config variable that we're introducing as only a boolean from day one can't we just die() on anything that's not a boolean? On other implied feedback, in https://lore.kernel.org/git/221215.865yec3b1j.gmgdl@evledraar.gmail.com/ I noted the switching between istate->repo & the_repository, and that you could hit a BUG() (when uncommenting a line in my testing patch on top) if we didn't have istate->repo: There was a commit message update for that: > 2: 00738c81a12 ! 2: aae047cbc9f read-cache: add index.skipHash config option > @@ Commit message > > [1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201 > > + We load this config from the repository config given by istate->repo, > + with a fallback to the_repository if it is not set. But I think that really sweeps a potential issue under the rug. What are these cases where we don't get istate->repo, are those expected? Is preferring istate->repo always known to be redundant now, and just done for good measure, or are there subtle cases (e.g. when reading submodules) where we pick the wrong repo's config? In that context I'd really like to see some testing of submodules in the added tests, i.e. does this act as we'd like it to when you set the config as "true" in the parent, but "false" in the submodule, & the other way around? That's a case that should stress this "the_repository" v.s. "istate->repo". > While older Git versions will not recognize the null hash as a special > case, the file format itself is still being met in terms of its > structure. Using this null hash will still allow Git operations to > @@ read-cache.c: static int verify_hdr(const struct cache_header *hdr, unsigned lon > the_hash_algo->update_fn(&c, hdr, size - the_hash_algo->rawsz); > the_hash_algo->final_fn(hash, &c); > - if (!hasheq(hash, (unsigned char *)hdr + size - the_hash_algo->rawsz)) > -+ if (!hasheq(hash, end - the_hash_algo->rawsz)) > ++ if (!hasheq(hash, start)) > return error(_("bad index file sha1 signature")); > return 0; Thanks, good to have this suggested simplification. > @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempfile *tempfile, > + int ieot_entries = 1; > + struct index_entry_offset_table *ieot = NULL; > + int nr, nr_threads; > ++ struct repository *r = istate->repo ? istate->repo : the_repository; > > f = hashfd(tempfile->fd, tempfile->filename.buf); > > -+ git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash); > ++ repo_config_get_maybe_bool(r, "index.skiphash", &f->skip_hash); > + > for (i = removed = extended = 0; i < entries; i++) { Re the above this looks good, but did we introduce an untested behavior change here or not? > if (cache[i]->ce_flags & CE_REMOVE) > 3: 86370af1351 = 3: 27fbe52e748 test-lib-functions: add helper for trailing hash > 4: 6490bd445eb ! 4: 075921514f2 features: feature.manyFiles implies fast index writes > @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempf > > f = hashfd(tempfile->fd, tempfile->filename.buf); > > -- git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash); > -+ if (istate->repo) { > -+ prepare_repo_settings(istate->repo); > -+ f->skip_hash = istate->repo->settings.index_skip_hash; > -+ } > +- repo_config_get_maybe_bool(r, "index.skiphash", &f->skip_hash); > ++ prepare_repo_settings(r); > ++ f->skip_hash = r->settings.index_skip_hash; > > for (i = removed = extended = 0; i < entries; i++) { > if (cache[i]->ce_flags & CE_REMOVE) I really don't care per-se where we read the config, as long as it's doing what we expect from the UX point of view. But in your https://lore.kernel.org/git/9e754af8-ecd3-6aed-60e8-2fc09a6a8759@github.com/ you noted "There's no reason to load the config inside repo-settings.c unless it's part of something like feature.manyFiles.". I think that's true, but on the flip side of that: Is there a reason to move the reading of such localized config (only needed in read-cache.c, as 2/4 shows) to repo-settings.c, just because it's now relying on the "manyfiles"? I think the unstated reason for that is that while we read the "manyfiles" config we didn't stick it in the "struct repo_settings", therefore the reading of it is conveniently done in repo-settings.c, which the 4/4 here does. But I think it makes more sense to just stick the "manyfiles" in that struct, not stick "skip_hash" there, and then just check the "manyfiles" in 4/4, but read "index.skiphash" locally. Maybe not worth the churn at this point, but I do wonder if we're going in the wrong direction here, i.e. if all config that happens to rely on features must be added to repo-settings.c, or whether we should reserve that for truly "global" (or mostly global) config. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v4 0/4] Optionally skip hashing index on write 2022-12-16 15:43 ` [PATCH v4 0/4] Optionally skip hashing index on write Ævar Arnfjörð Bjarmason @ 2023-01-06 15:33 ` Derrick Stolee 2023-01-06 22:45 ` Junio C Hamano 0 siblings, 1 reply; 59+ messages in thread From: Derrick Stolee @ 2023-01-06 15:33 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason, Derrick Stolee via GitGitGadget Cc: git, gitster, vdye, newren, Jacob Keller On 12/16/2022 10:43 AM, Ævar Arnfjörð Bjarmason wrote: > But as it is split up the individual steps should make sense. The 2/4 > here should really just use "bool", not "maybe_bool" to begin with, no? > And part of this in 4/4 is inheriting a non-stricture in > repo-settings.c, but for this new config variable that we're introducing > as only a boolean from day one can't we just die() on anything that's > not a boolean? > > On other implied feedback, in > https://lore.kernel.org/git/221215.865yec3b1j.gmgdl@evledraar.gmail.com/ > I noted the switching between istate->repo & the_repository, and that > you could hit a BUG() (when uncommenting a line in my testing patch on > top) if we didn't have istate->repo: > > There was a commit message update for that: > >> 2: 00738c81a12 ! 2: aae047cbc9f read-cache: add index.skipHash config option >> @@ Commit message >> >> [1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201 >> >> + We load this config from the repository config given by istate->repo, >> + with a fallback to the_repository if it is not set. > > But I think that really sweeps a potential issue under the rug. What are > these cases where we don't get istate->repo, are those expected? Is > preferring istate->repo always known to be redundant now, and just done > for good measure, or are there subtle cases (e.g. when reading > submodules) where we pick the wrong repo's config? After investigating some of the failures from creating a BUG() statement when istate->repo is NULL I see several problems, and they are not related to submodules for the most part. The first issues I've found are due to code paths that operate on the_index without actually initializing it with a do_read_index(), or otherwise initialize an index using a memset() instead of a common initializer. This looks to be a frequent enough problem that solving it would require a significant effort. It's not a quick fix. > In that context I'd really like to see some testing of submodules in the > added tests, i.e. does this act as we'd like it to when you set the > config as "true" in the parent, but "false" in the submodule, & the > other way around? That's a case that should stress this "the_repository" > v.s. "istate->repo". I can add a test for this, though we do not do that for every new config option. Further, we can only rely on commands like 'git add' that correctly initialize the index from a do_read_index(). It should be sufficient to use the index-local repository (when available) and trust that the repo_config_...() method is doing the right thing. > I really don't care per-se where we read the config, as long as it's > doing what we expect from the UX point of view. > > But in your > https://lore.kernel.org/git/9e754af8-ecd3-6aed-60e8-2fc09a6a8759@github.com/ > you noted "There's no reason to load the config inside repo-settings.c > unless it's part of something like feature.manyFiles.". > > I think that's true, but on the flip side of that: Is there a reason to > move the reading of such localized config (only needed in read-cache.c, > as 2/4 shows) to repo-settings.c, just because it's now relying on the > "manyfiles"? Yes, because it makes it clear what options are part of these meta-config keys. It is better to centralize the parsing instead of having several different ad-hoc parsing strategies. Thanks, -Stolee ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v4 0/4] Optionally skip hashing index on write 2023-01-06 15:33 ` Derrick Stolee @ 2023-01-06 22:45 ` Junio C Hamano 2023-01-06 23:40 ` Derrick Stolee 0 siblings, 1 reply; 59+ messages in thread From: Junio C Hamano @ 2023-01-06 22:45 UTC (permalink / raw) To: Derrick Stolee Cc: Ævar Arnfjörð Bjarmason, Derrick Stolee via GitGitGadget, git, vdye, newren, Jacob Keller Derrick Stolee <derrickstolee@github.com> writes: > After investigating some of the failures from creating a BUG() statement > when istate->repo is NULL I see several problems, and they are not related > to submodules for the most part. > > The first issues I've found are due to code paths that operate on the_index > without actually initializing it with a do_read_index(), or otherwise > initialize an index using a memset() instead of a common initializer. This > looks to be a frequent enough problem that solving it would require a > significant effort. It's not a quick fix. Thanks. It is not entirely unexpected X-<, considering how the connection from the in-core repository and the in-core index has been developed and evolved. So in short, istate->repo is not yet ready to be used, until the problems you identified are resolved? We probably should start paying down such technical debts. We've punted on them too many times, saying "yes this is klunky but what we have is good enough for adding this feature", I suspect? Thanks. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v4 0/4] Optionally skip hashing index on write 2023-01-06 22:45 ` Junio C Hamano @ 2023-01-06 23:40 ` Derrick Stolee 2023-01-09 17:15 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 59+ messages in thread From: Derrick Stolee @ 2023-01-06 23:40 UTC (permalink / raw) To: Junio C Hamano Cc: Ævar Arnfjörð Bjarmason, Derrick Stolee via GitGitGadget, git, vdye, newren, Jacob Keller On 1/6/2023 5:45 PM, Junio C Hamano wrote: > Derrick Stolee <derrickstolee@github.com> writes: > >> After investigating some of the failures from creating a BUG() statement >> when istate->repo is NULL I see several problems, and they are not related >> to submodules for the most part. >> >> The first issues I've found are due to code paths that operate on the_index >> without actually initializing it with a do_read_index(), or otherwise >> initialize an index using a memset() instead of a common initializer. This >> looks to be a frequent enough problem that solving it would require a >> significant effort. It's not a quick fix. > > Thanks. It is not entirely unexpected X-<, considering how the > connection from the in-core repository and the in-core index has > been developed and evolved. So in short, istate->repo is not yet > ready to be used, until the problems you identified are resolved? > > We probably should start paying down such technical debts. We've > punted on them too many times, saying "yes this is klunky but what > we have is good enough for adding this feature", I suspect? Yes, I'll make note to prioritize this soon. Thanks, -Stolee ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v4 0/4] Optionally skip hashing index on write 2023-01-06 23:40 ` Derrick Stolee @ 2023-01-09 17:15 ` Ævar Arnfjörð Bjarmason 2023-01-09 18:00 ` Derrick Stolee 0 siblings, 1 reply; 59+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2023-01-09 17:15 UTC (permalink / raw) To: Derrick Stolee Cc: Junio C Hamano, Derrick Stolee via GitGitGadget, git, vdye, newren, Jacob Keller On Fri, Jan 06 2023, Derrick Stolee wrote: > On 1/6/2023 5:45 PM, Junio C Hamano wrote: >> Derrick Stolee <derrickstolee@github.com> writes: >> >>> After investigating some of the failures from creating a BUG() statement >>> when istate->repo is NULL I see several problems, and they are not related >>> to submodules for the most part. >>> >>> The first issues I've found are due to code paths that operate on the_index >>> without actually initializing it with a do_read_index(), or otherwise >>> initialize an index using a memset() instead of a common initializer. This >>> looks to be a frequent enough problem that solving it would require a >>> significant effort. It's not a quick fix. >> >> Thanks. It is not entirely unexpected X-<, considering how the >> connection from the in-core repository and the in-core index has >> been developed and evolved. So in short, istate->repo is not yet >> ready to be used, until the problems you identified are resolved? >> >> We probably should start paying down such technical debts. We've >> punted on them too many times, saying "yes this is klunky but what >> we have is good enough for adding this feature", I suspect? > > Yes, I'll make note to prioritize this soon. I noted in passing in [1] that I had those patches locally, if I'm understanding you correctly and you're planning to work on changes that'll make "istate->repo" always non-NULL. I've rebased those on top of your "ds/omit-trailing-hash-in-index". I'm CI-ing those now & hoping to submit them soon (I've had it working for a while, but there was some interaction with your patches). Preview at [2]. 1. https://lore.kernel.org/git/221215.865yec3b1j.gmgdl@evledraar.gmail.com/ 2. https://github.com/avar/git/compare/avar-ds/omit-trailing-hash-in-index...avar/do-not-lazy-populate-istate-repo ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v4 0/4] Optionally skip hashing index on write 2023-01-09 17:15 ` Ævar Arnfjörð Bjarmason @ 2023-01-09 18:00 ` Derrick Stolee 2023-01-09 19:22 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 59+ messages in thread From: Derrick Stolee @ 2023-01-09 18:00 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Junio C Hamano, Derrick Stolee via GitGitGadget, git, vdye, newren, Jacob Keller On 1/9/2023 12:15 PM, Ævar Arnfjörð Bjarmason wrote: > On Fri, Jan 06 2023, Derrick Stolee wrote: >> On 1/6/2023 5:45 PM, Junio C Hamano wrote: >>> We probably should start paying down such technical debts. We've >>> punted on them too many times, saying "yes this is klunky but what >>> we have is good enough for adding this feature", I suspect? >> >> Yes, I'll make note to prioritize this soon. > > I noted in passing in [1] that I had those patches locally, if I'm > understanding you correctly and you're planning to work on changes > that'll make "istate->repo" always non-NULL. > > I've rebased those on top of your "ds/omit-trailing-hash-in-index". I'm > CI-ing those now & hoping to submit them soon (I've had it working for a > while, but there was some interaction with your patches). Preview at > [2]. > > 1. https://lore.kernel.org/git/221215.865yec3b1j.gmgdl@evledraar.gmail.com/ > 2. https://github.com/avar/git/compare/avar-ds/omit-trailing-hash-in-index...avar/do-not-lazy-populate-istate-repo Thanks for doing this so I don't need to. Some quick pre-submission feedback: your "treewide" patch [1] is far too big and doing too much all at once. It's difficult to know why you're doing things when you're doing them, especially the choices for the validate_cache_repo() calls. In particular, I noticed that you used "the_repository" in some cases where a local "struct repository *" would be better (even if it is currently pointing to the_repository as in builtin/sparse-checkout.c in update_working_directory()). These would be easier to catch in smaller diffs. [1] https://github.com/avar/git/commit/7732a41800dfc8f5dbf909560615d6048d583ed9 Looking forward to your series. -Stolee ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v4 0/4] Optionally skip hashing index on write 2023-01-09 18:00 ` Derrick Stolee @ 2023-01-09 19:22 ` Ævar Arnfjörð Bjarmason 0 siblings, 0 replies; 59+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2023-01-09 19:22 UTC (permalink / raw) To: Derrick Stolee Cc: Junio C Hamano, Derrick Stolee via GitGitGadget, git, vdye, newren, Jacob Keller On Mon, Jan 09 2023, Derrick Stolee wrote: > On 1/9/2023 12:15 PM, Ævar Arnfjörð Bjarmason wrote: >> On Fri, Jan 06 2023, Derrick Stolee wrote: >>> On 1/6/2023 5:45 PM, Junio C Hamano wrote: > >>>> We probably should start paying down such technical debts. We've >>>> punted on them too many times, saying "yes this is klunky but what >>>> we have is good enough for adding this feature", I suspect? >>> >>> Yes, I'll make note to prioritize this soon. >> >> I noted in passing in [1] that I had those patches locally, if I'm >> understanding you correctly and you're planning to work on changes >> that'll make "istate->repo" always non-NULL. >> >> I've rebased those on top of your "ds/omit-trailing-hash-in-index". I'm >> CI-ing those now & hoping to submit them soon (I've had it working for a >> while, but there was some interaction with your patches). Preview at >> [2]. >> >> 1. https://lore.kernel.org/git/221215.865yec3b1j.gmgdl@evledraar.gmail.com/ >> 2. https://github.com/avar/git/compare/avar-ds/omit-trailing-hash-in-index...avar/do-not-lazy-populate-istate-repo > > Thanks for doing this so I don't need to. > > Some quick pre-submission feedback: your "treewide" patch [1] is far > too big and doing too much all at once. It's difficult to know why > you're doing things when you're doing them, especially the choices > for the validate_cache_repo() calls. Yeah, I don't like it either :) I'm mulling over whether to keep it at all. I.e. the fix to pass the data is relatively small, but most of that patch is validation paranoia. It was helpful to write that to validate it for my own sanity, but maybe it's not worth keeping it. Ultimately it's just making things BUG(...) during testing that would otherwise be a NULL-pointer dereference, so I'm currently leaning towards (especially with this early feedback) just skipping it. Now that we have a SANITIZE=address CI job the error feedback if/when that would happen is arguably worse than the BUG(...) output. > In particular, I noticed that you used "the_repository" in some cases > where a local "struct repository *" would be better (even if it is > currently pointing to the_repository as in builtin/sparse-checkout.c > in update_working_directory()). These would be easier to catch in > smaller diffs. Yes, definitely. I was careful to use an existing "struct repository *" in favor of "the_repository", but clearly not careful enough! I'll fix that, and look carefully over the rest before submission in case I've missed other similar cases. > [1] https://github.com/avar/git/commit/7732a41800dfc8f5dbf909560615d6048d583ed9 > > Looking forward to your series. Thanks, since I sent the above CI passed, so it'll be sooner than later, but I'll massage it a bit per the above before submission. ^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH v5 0/4] Optionally skip hashing index on write 2022-12-16 15:31 ` [PATCH v4 " Derrick Stolee via GitGitGadget ` (4 preceding siblings ...) 2022-12-16 15:43 ` [PATCH v4 0/4] Optionally skip hashing index on write Ævar Arnfjörð Bjarmason @ 2023-01-06 16:31 ` Derrick Stolee via GitGitGadget 2023-01-06 16:31 ` [PATCH v5 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget ` (4 more replies) 5 siblings, 5 replies; 59+ messages in thread From: Derrick Stolee via GitGitGadget @ 2023-01-06 16:31 UTC (permalink / raw) To: git; +Cc: gitster, vdye, avarab, newren, Jacob Keller, Derrick Stolee Writing the index is a critical action that takes place in multiple Git commands. The recent performance improvements available with the sparse index show how often the I/O costs around the index can affect different Git commands, although reading the index takes place more often than a write. The index is written through the hashfile API, both as a buffered write and as a computation of its trailing hash. This trailing hash is an expectation of the file format: several optional index extensions are provided using indicators immediately preceding the trailing hash. 'git fsck' will complain if the trailing hash does not match the contents of the file up to that point. However, computing the hash is expensive. This is even more expensive now that we've moved to sha1dc instead of hardware-accelerated SHA1 computations. This series provides a new config option, index.skipHash, that allows users to disable computing the hash as Git writes the index. In this case, the trailing hash is stored as the null hash (all bytes are zero). The implementation is rather simple, but the patches organize different aspects in a way that we could split things out: * Patch 1 adds a 'skip_hash' option to 'struct hashfile' that creates this as a general option to the hashwrite API. * The motivation in Patch 1 avoids the talk about the chunk-format API and instead focuses on the index as the first example that could make use of this. * Patch 2 creates the index.skipHash config option and updates 'git fsck' to ignore a null hash on the index. This is demonstrated by a very simple test. * Patch 3 creates a test helper to get the trailing hash of a file, and adds a concrete check on the trailing hash when index.skipHash=true. * Patch 4 (which could be deferred until later) adds index.skipHash=true as an implication of feature.manyFiles and as a setting in Scalar. The end result is that index writes speed up significantly, enough that 'git update-index --force-write' improves by 1.75x. Similar patches appeared in my earlier refs RFC [1]. [1] https://lore.kernel.org/git/pull.1408.git.1667846164.gitgitgadget@gmail.com/ Updates since v4 ================ * Patch 2 now uses repo_config_get_bool() instead of repo_config_get_maybe_bool(). * A test is added for submodule-level config. Updates since v3 ================ * Patch 1 uses an "int" instead of "unsigned int". This was a carry-over from Kevin's patch in microsoft/git, but our use of it in patch 2 is different and thus is better with a signed int. * Patch 2 uses a fallback to the_repository if istate->repo is NULL. This allows the setting to be applied to more cases where istate->repo is not set. * Patch 2 also uses 'start' in more places, since it already computed the beginning of the hash. * Patch 4 slightly modifies its use of prepare_repo_settings() due to Patch 2's fallback to the_repository. Updates since v2 ================ * Patch 2 now has additional details about why to use the config option in its message. The discussion about why the index is special is included in the commit message. Updates since v1 ================ * In Patch 1, use hashcler() instead of memset(). * In Patches 2 and 4, be explicit about which Git versions will exhibit different behavior when reading an index with a null trailing hash. * In Patch 2, used a more compact way of loading from config. * In Patch 2 (and following) dropped the sub-shell in t1600-index.sh. * In Patch 3, dropped back-ticks when using a multi-line pipe. * In Patch 4, use "! cmp" instead of "! test_cmp" Updates since RFC ================= * The config name is now index.skipHash to make it more clear that the default is to not skip the hash, and choosing this option requires a true value. * Use oidread() instead of an ad-hoc loop. * Patches 3 and 4 are new. 3 adds some extra test verification that wasn't in the RFC. * I think that patch 4 is helpful to see now, but I could understand if we wanted to defer that patch until after index.skipHash has had more time to bake. Thanks, -Stolee Derrick Stolee (4): hashfile: allow skipping the hash function read-cache: add index.skipHash config option test-lib-functions: add helper for trailing hash features: feature.manyFiles implies fast index writes Documentation/config/feature.txt | 5 +++++ Documentation/config/index.txt | 11 +++++++++++ csum-file.c | 14 +++++++++++--- csum-file.h | 7 +++++++ read-cache.c | 14 +++++++++++++- repo-settings.c | 2 ++ repository.h | 1 + scalar.c | 1 + t/t1600-index.sh | 30 ++++++++++++++++++++++++++++++ t/test-lib-functions.sh | 8 ++++++++ 10 files changed, 89 insertions(+), 4 deletions(-) base-commit: 805265fcf7a737664a8321aaf4a0587b78435184 Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1439%2Fderrickstolee%2Fhashfile-skip-hash-v5 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1439/derrickstolee/hashfile-skip-hash-v5 Pull-Request: https://github.com/gitgitgadget/git/pull/1439 Range-diff vs v4: 1: c99470d4676 = 1: c99470d4676 hashfile: allow skipping the hash function 2: aae047cbc9f ! 2: a3c79308171 read-cache: add index.skipHash config option @@ Commit message ability to opt-in to this feature, even with the potential confusion with older 'git fsck' versions. + Test this new config option, both at a command-line level and within a + submodule. The confirmation is currently limited to confirm that 'git + fsck' does not complain about the index. Future updates will make this + test more robust. + It is critical that this test is placed before the test_index_version tests, since those tests obliterate the .git/config file and hence lose the setting from GIT_TEST_DEFAULT_HASH, if set. @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempf f = hashfd(tempfile->fd, tempfile->filename.buf); -+ repo_config_get_maybe_bool(r, "index.skiphash", &f->skip_hash); ++ repo_config_get_bool(r, "index.skiphash", &f->skip_hash); + for (i = removed = extended = 0; i < entries; i++) { if (cache[i]->ce_flags & CE_REMOVE) @@ t/t1600-index.sh: test_expect_success 'out of bounds index.version issues warnin +test_expect_success 'index.skipHash config option' ' + rm -f .git/index && + git -c index.skipHash=true add a && -+ git fsck ++ git fsck && ++ ++ test_commit start && ++ git -c protocol.file.allow=always submodule add ./ sub && ++ git config index.skipHash false && ++ git -C sub config index.skipHash true && ++ >sub/file && ++ git -C sub add a && ++ git -C sub fsck +' + test_index_version () { 3: 27fbe52e748 ! 3: dab9b15de47 test-lib-functions: add helper for trailing hash @@ Commit message Add a new test_trailing_hash helper and use it in t1600 to verify that index.skipHash=true really does skip the hash computation, since - 'git fsck' does not actually verify the hash. + 'git fsck' does not actually verify the hash. This confirms that when + the config is disabled explicitly in a super project but enabled in a + submodule, then the use of repo_config_get_bool() loads config from the + correct repository in the case of 'git add'. There are other cases where + istate->repo is NULL and thus this config is loaded instead from + the_repository, but that's due to many different code paths initializing + index_state structs in their own way. Keep the 'git fsck' call to ensure that any potential future change to check the index hash does not cause an error in this case. @@ t/t1600-index.sh: test_expect_success 'out of bounds index.version issues warnin + test_trailing_hash .git/index >hash && + echo $(test_oid zero) >expect && + test_cmp expect hash && - git fsck + git fsck && + + test_commit start && +@@ t/t1600-index.sh: test_expect_success 'index.skipHash config option' ' + git -C sub config index.skipHash true && + >sub/file && + git -C sub add a && ++ test_trailing_hash .git/modules/sub/index >hash && ++ test_cmp expect hash && + git -C sub fsck ' 4: 075921514f2 ! 4: 1beedcb5ba1 features: feature.manyFiles implies fast index writes @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempf f = hashfd(tempfile->fd, tempfile->filename.buf); -- repo_config_get_maybe_bool(r, "index.skiphash", &f->skip_hash); +- repo_config_get_bool(r, "index.skiphash", &f->skip_hash); + prepare_repo_settings(r); + f->skip_hash = r->settings.index_skip_hash; @@ scalar.c: static int set_recommended_config(int reconfigure) ## t/t1600-index.sh ## @@ t/t1600-index.sh: test_expect_success 'index.skipHash config option' ' - test_trailing_hash .git/index >hash && - echo $(test_oid zero) >expect && test_cmp expect hash && -- git fsck -+ git fsck && -+ + git fsck && + + rm -f .git/index && + git -c feature.manyFiles=true add a && + test_trailing_hash .git/index >hash && -+ test_cmp expect hash && ++ cmp expect hash && + + rm -f .git/index && + git -c feature.manyFiles=true \ -+ -c index.skipHash=false add a && ++ -c index.skipHash=false add a && + test_trailing_hash .git/index >hash && -+ ! cmp expect hash - ' - - test_index_version () { ++ ! cmp expect hash && ++ + test_commit start && + git -c protocol.file.allow=always submodule add ./ sub && + git config index.skipHash false && -- gitgitgadget ^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH v5 1/4] hashfile: allow skipping the hash function 2023-01-06 16:31 ` [PATCH v5 " Derrick Stolee via GitGitGadget @ 2023-01-06 16:31 ` Derrick Stolee via GitGitGadget 2023-01-06 16:31 ` [PATCH v5 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget ` (3 subsequent siblings) 4 siblings, 0 replies; 59+ messages in thread From: Derrick Stolee via GitGitGadget @ 2023-01-06 16:31 UTC (permalink / raw) To: git Cc: gitster, vdye, avarab, newren, Jacob Keller, Derrick Stolee, Derrick Stolee From: Derrick Stolee <derrickstolee@github.com> The hashfile API is useful for generating files that include a trailing hash of the file's contents up to that point. Using such a hash is helpful for verifying the file for corruption-at-rest, such as a faulty drive causing flipped bits. Git's index file includes this trailing hash, so it uses a 'struct hashfile' to handle the I/O to the file. This was very convenient to allow using the hashfile methods during these operations. However, hashing the file contents during write comes at a performance penalty. It's slower to hash the bytes on their way to the disk than without that step. This problem is made worse by the replacement of hardware-accelerated SHA1 computations with the software-based sha1dc computation. This write cost is significant, and the checksum capability is likely not worth that cost for such a short-lived file. The index is rewritten frequently and the only time the checksum is checked is during 'git fsck'. Thus, it would be helpful to allow a user to opt-out of the hash computation. We first need to allow Git to opt-out of the hash computation in the hashfile API. The buffered writes of the API are still helpful, so it makes sense to make the change here. Introduce a new 'skip_hash' option to 'struct hashfile'. When set, the update_fn and final_fn members of the_hash_algo are skipped. When finalizing the hashfile, the trailing hash is replaced with the null hash. This use of a trailing null hash would be desireable in either case, since we do not want to special case a file format to have a different length depending on whether it was hashed or not. When the final bytes of a file are all zero, we can infer that it was written without hashing, and thus that verification is not available as a check for file consistency. This also means that we could easily toggle hashing for any file format we desire. A version of this patch has existed in the microsoft/git fork since 2017 [1] (the linked commit was rebased in 2018, but the original dates back to January 2017). Here, the change to make the index use this fast path is delayed until a later change. [1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201 Co-authored-by: Kevin Willford <kewillf@microsoft.com> Signed-off-by: Kevin Willford <kewillf@microsoft.com> Signed-off-by: Derrick Stolee <derrickstolee@github.com> --- csum-file.c | 14 +++++++++++--- csum-file.h | 7 +++++++ 2 files changed, 18 insertions(+), 3 deletions(-) diff --git a/csum-file.c b/csum-file.c index 59ef3398ca2..cce13c0f047 100644 --- a/csum-file.c +++ b/csum-file.c @@ -45,7 +45,8 @@ void hashflush(struct hashfile *f) unsigned offset = f->offset; if (offset) { - the_hash_algo->update_fn(&f->ctx, f->buffer, offset); + if (!f->skip_hash) + the_hash_algo->update_fn(&f->ctx, f->buffer, offset); flush(f, f->buffer, offset); f->offset = 0; } @@ -64,7 +65,12 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result, int fd; hashflush(f); - the_hash_algo->final_fn(f->buffer, &f->ctx); + + if (f->skip_hash) + hashclr(f->buffer); + else + the_hash_algo->final_fn(f->buffer, &f->ctx); + if (result) hashcpy(result, f->buffer); if (flags & CSUM_HASH_IN_STREAM) @@ -108,7 +114,8 @@ void hashwrite(struct hashfile *f, const void *buf, unsigned int count) * the hashfile's buffer. In this block, * f->offset is necessarily zero. */ - the_hash_algo->update_fn(&f->ctx, buf, nr); + if (!f->skip_hash) + the_hash_algo->update_fn(&f->ctx, buf, nr); flush(f, buf, nr); } else { /* @@ -153,6 +160,7 @@ static struct hashfile *hashfd_internal(int fd, const char *name, f->tp = tp; f->name = name; f->do_crc = 0; + f->skip_hash = 0; the_hash_algo->init_fn(&f->ctx); f->buffer_len = buffer_len; diff --git a/csum-file.h b/csum-file.h index 0d29f528fbc..793a59da12b 100644 --- a/csum-file.h +++ b/csum-file.h @@ -20,6 +20,13 @@ struct hashfile { size_t buffer_len; unsigned char *buffer; unsigned char *check_buffer; + + /** + * If non-zero, skip_hash indicates that we should + * not actually compute the hash for this hashfile and + * instead only use it as a buffered write. + */ + int skip_hash; }; /* Checkpoint */ -- gitgitgadget ^ permalink raw reply related [flat|nested] 59+ messages in thread
* [PATCH v5 2/4] read-cache: add index.skipHash config option 2023-01-06 16:31 ` [PATCH v5 " Derrick Stolee via GitGitGadget 2023-01-06 16:31 ` [PATCH v5 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget @ 2023-01-06 16:31 ` Derrick Stolee via GitGitGadget 2023-01-06 16:31 ` [PATCH v5 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget ` (2 subsequent siblings) 4 siblings, 0 replies; 59+ messages in thread From: Derrick Stolee via GitGitGadget @ 2023-01-06 16:31 UTC (permalink / raw) To: git Cc: gitster, vdye, avarab, newren, Jacob Keller, Derrick Stolee, Derrick Stolee From: Derrick Stolee <derrickstolee@github.com> The previous change allowed skipping the hashing portion of the hashwrite API, using it instead as a buffered write API. Disabling the hashwrite can be particularly helpful when the write operation is in a critical path. One such critical path is the writing of the index. This operation is so critical that the sparse index was created specifically to reduce the size of the index to make these writes (and reads) faster. This trade-off between file stability at rest and write-time performance is not easy to balance. The index is an interesting case for a couple reasons: 1. Writes block users. Writing the index takes place in many user- blocking foreground operations. The speed improvement directly impacts their use. Other file formats are typically written in the background (commit-graph, multi-pack-index) or are super-critical to correctness (pack-files). 2. Index files are short lived. It is rare that a user leaves an index for a long time with many staged changes. Outside of staged changes, the index can be completely destroyed and rewritten with minimal impact to the user. Following a similar approach to one used in the microsoft/git fork [1], add a new config option (index.skipHash) that allows disabling this hashing during the index write. The cost is that we can no longer validate the contents for corruption-at-rest using the trailing hash. [1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201 We load this config from the repository config given by istate->repo, with a fallback to the_repository if it is not set. While older Git versions will not recognize the null hash as a special case, the file format itself is still being met in terms of its structure. Using this null hash will still allow Git operations to function across older versions. The one exception is 'git fsck' which checks the hash of the index file. This used to be a check on every index read, but was split out to just the index in a33fc72fe91 (read-cache: force_verify_index_checksum, 2017-04-14) and released first in Git 2.13.0. Document the versions that relaxed these restrictions, with the optimistic expectation that this change will be included in Git 2.40.0. Here, we disable this check if the trailing hash is all zeroes. We add a warning to the config option that this may cause undesirable behavior with older Git versions. As a quick comparison, I tested 'git update-index --force-write' with and without index.skipHash=true on a copy of the Linux kernel repository. Benchmark 1: with hash Time (mean ± σ): 46.3 ms ± 13.8 ms [User: 34.3 ms, System: 11.9 ms] Range (min … max): 34.3 ms … 79.1 ms 82 runs Benchmark 2: without hash Time (mean ± σ): 26.0 ms ± 7.9 ms [User: 11.8 ms, System: 14.2 ms] Range (min … max): 16.3 ms … 42.0 ms 69 runs Summary 'without hash' ran 1.78 ± 0.76 times faster than 'with hash' These performance benefits are substantial enough to allow users the ability to opt-in to this feature, even with the potential confusion with older 'git fsck' versions. Test this new config option, both at a command-line level and within a submodule. The confirmation is currently limited to confirm that 'git fsck' does not complain about the index. Future updates will make this test more robust. It is critical that this test is placed before the test_index_version tests, since those tests obliterate the .git/config file and hence lose the setting from GIT_TEST_DEFAULT_HASH, if set. Signed-off-by: Derrick Stolee <derrickstolee@github.com> --- Documentation/config/index.txt | 11 +++++++++++ read-cache.c | 13 ++++++++++++- t/t1600-index.sh | 14 ++++++++++++++ 3 files changed, 37 insertions(+), 1 deletion(-) diff --git a/Documentation/config/index.txt b/Documentation/config/index.txt index 75f3a2d1054..23c7985eb40 100644 --- a/Documentation/config/index.txt +++ b/Documentation/config/index.txt @@ -30,3 +30,14 @@ index.version:: Specify the version with which new index files should be initialized. This does not affect existing repositories. If `feature.manyFiles` is enabled, then the default is 4. + +index.skipHash:: + When enabled, do not compute the trailing hash for the index file. + This accelerates Git commands that manipulate the index, such as + `git add`, `git commit`, or `git status`. Instead of storing the + checksum, write a trailing set of bytes with value zero, indicating + that the computation was skipped. ++ +If you enable `index.skipHash`, then Git clients older than 2.13.0 will +refuse to parse the index and Git clients older than 2.40.0 will report an +error during `git fsck`. diff --git a/read-cache.c b/read-cache.c index 46f5e497b14..d73a81e41ae 100644 --- a/read-cache.c +++ b/read-cache.c @@ -1817,6 +1817,8 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size) git_hash_ctx c; unsigned char hash[GIT_MAX_RAWSZ]; int hdr_version; + unsigned char *start, *end; + struct object_id oid; if (hdr->hdr_signature != htonl(CACHE_SIGNATURE)) return error(_("bad signature 0x%08x"), hdr->hdr_signature); @@ -1827,10 +1829,16 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size) if (!verify_index_checksum) return 0; + end = (unsigned char *)hdr + size; + start = end - the_hash_algo->rawsz; + oidread(&oid, start); + if (oideq(&oid, null_oid())) + return 0; + the_hash_algo->init_fn(&c); the_hash_algo->update_fn(&c, hdr, size - the_hash_algo->rawsz); the_hash_algo->final_fn(hash, &c); - if (!hasheq(hash, (unsigned char *)hdr + size - the_hash_algo->rawsz)) + if (!hasheq(hash, start)) return error(_("bad index file sha1 signature")); return 0; } @@ -2915,9 +2923,12 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile, int ieot_entries = 1; struct index_entry_offset_table *ieot = NULL; int nr, nr_threads; + struct repository *r = istate->repo ? istate->repo : the_repository; f = hashfd(tempfile->fd, tempfile->filename.buf); + repo_config_get_bool(r, "index.skiphash", &f->skip_hash); + for (i = removed = extended = 0; i < entries; i++) { if (cache[i]->ce_flags & CE_REMOVE) removed++; diff --git a/t/t1600-index.sh b/t/t1600-index.sh index 010989f90e6..98c5a83db73 100755 --- a/t/t1600-index.sh +++ b/t/t1600-index.sh @@ -65,6 +65,20 @@ test_expect_success 'out of bounds index.version issues warning' ' ) ' +test_expect_success 'index.skipHash config option' ' + rm -f .git/index && + git -c index.skipHash=true add a && + git fsck && + + test_commit start && + git -c protocol.file.allow=always submodule add ./ sub && + git config index.skipHash false && + git -C sub config index.skipHash true && + >sub/file && + git -C sub add a && + git -C sub fsck +' + test_index_version () { INDEX_VERSION_CONFIG=$1 && FEATURE_MANY_FILES=$2 && -- gitgitgadget ^ permalink raw reply related [flat|nested] 59+ messages in thread
* [PATCH v5 3/4] test-lib-functions: add helper for trailing hash 2023-01-06 16:31 ` [PATCH v5 " Derrick Stolee via GitGitGadget 2023-01-06 16:31 ` [PATCH v5 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget 2023-01-06 16:31 ` [PATCH v5 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget @ 2023-01-06 16:31 ` Derrick Stolee via GitGitGadget 2023-01-06 16:31 ` [PATCH v5 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget 2023-01-15 9:31 ` [PATCH v5 0/4] Optionally skip hashing index on write Junio C Hamano 4 siblings, 0 replies; 59+ messages in thread From: Derrick Stolee via GitGitGadget @ 2023-01-06 16:31 UTC (permalink / raw) To: git Cc: gitster, vdye, avarab, newren, Jacob Keller, Derrick Stolee, Derrick Stolee From: Derrick Stolee <derrickstolee@github.com> It can be helpful to check that a file format with a trailing hash has a specific hash in the final bytes of a written file. This is made more apparent by recent changes that allow skipping the hash algorithm and writing a null hash at the end of the file instead. Add a new test_trailing_hash helper and use it in t1600 to verify that index.skipHash=true really does skip the hash computation, since 'git fsck' does not actually verify the hash. This confirms that when the config is disabled explicitly in a super project but enabled in a submodule, then the use of repo_config_get_bool() loads config from the correct repository in the case of 'git add'. There are other cases where istate->repo is NULL and thus this config is loaded instead from the_repository, but that's due to many different code paths initializing index_state structs in their own way. Keep the 'git fsck' call to ensure that any potential future change to check the index hash does not cause an error in this case. Signed-off-by: Derrick Stolee <derrickstolee@github.com> --- t/t1600-index.sh | 5 +++++ t/test-lib-functions.sh | 8 ++++++++ 2 files changed, 13 insertions(+) diff --git a/t/t1600-index.sh b/t/t1600-index.sh index 98c5a83db73..2f792bb8ffa 100755 --- a/t/t1600-index.sh +++ b/t/t1600-index.sh @@ -68,6 +68,9 @@ test_expect_success 'out of bounds index.version issues warning' ' test_expect_success 'index.skipHash config option' ' rm -f .git/index && git -c index.skipHash=true add a && + test_trailing_hash .git/index >hash && + echo $(test_oid zero) >expect && + test_cmp expect hash && git fsck && test_commit start && @@ -76,6 +79,8 @@ test_expect_success 'index.skipHash config option' ' git -C sub config index.skipHash true && >sub/file && git -C sub add a && + test_trailing_hash .git/modules/sub/index >hash && + test_cmp expect hash && git -C sub fsck ' diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh index 796093a7b32..60308843f8f 100644 --- a/t/test-lib-functions.sh +++ b/t/test-lib-functions.sh @@ -1875,3 +1875,11 @@ test_cmp_config_output () { sort config-actual >sorted-actual && test_cmp sorted-expect sorted-actual } + +# Given a filename, extract its trailing hash as a hex string +test_trailing_hash () { + local file="$1" && + tail -c $(test_oid rawsz) "$file" | + test-tool hexdump | + sed "s/ //g" +} -- gitgitgadget ^ permalink raw reply related [flat|nested] 59+ messages in thread
* [PATCH v5 4/4] features: feature.manyFiles implies fast index writes 2023-01-06 16:31 ` [PATCH v5 " Derrick Stolee via GitGitGadget ` (2 preceding siblings ...) 2023-01-06 16:31 ` [PATCH v5 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget @ 2023-01-06 16:31 ` Derrick Stolee via GitGitGadget 2023-01-15 9:31 ` [PATCH v5 0/4] Optionally skip hashing index on write Junio C Hamano 4 siblings, 0 replies; 59+ messages in thread From: Derrick Stolee via GitGitGadget @ 2023-01-06 16:31 UTC (permalink / raw) To: git Cc: gitster, vdye, avarab, newren, Jacob Keller, Derrick Stolee, Derrick Stolee From: Derrick Stolee <derrickstolee@github.com> The recent addition of the index.skipHash config option allows index writes to speed up by skipping the hash computation for the trailing checksum. This is particularly critical for repositories with many files at HEAD, so add this config option to two cases where users in that scenario may opt-in to such behavior: 1. The feature.manyFiles config option enables some options that are helpful for repositories with many files at HEAD. 2. 'scalar register' and 'scalar reconfigure' set config options that optimize for large repositories. In both of these cases, set index.skipHash=true to gain this speedup. Add tests that demonstrate the proper way that index.skipHash=true can override feature.manyFiles=true. Signed-off-by: Derrick Stolee <derrickstolee@github.com> --- Documentation/config/feature.txt | 5 +++++ read-cache.c | 3 ++- repo-settings.c | 2 ++ repository.h | 1 + scalar.c | 1 + t/t1600-index.sh | 11 +++++++++++ 6 files changed, 22 insertions(+), 1 deletion(-) diff --git a/Documentation/config/feature.txt b/Documentation/config/feature.txt index 95975e50912..e52bc6b8584 100644 --- a/Documentation/config/feature.txt +++ b/Documentation/config/feature.txt @@ -23,6 +23,11 @@ feature.manyFiles:: working directory. With many files, commands such as `git status` and `git checkout` may be slow and these new defaults improve performance: + +* `index.skipHash=true` speeds up index writes by not computing a trailing + checksum. Note that this will cause Git versions earlier than 2.13.0 to + refuse to parse the index and Git versions earlier than 2.40.0 will report + a corrupted index during `git fsck`. ++ * `index.version=4` enables path-prefix compression in the index. + * `core.untrackedCache=true` enables the untracked cache. This setting assumes diff --git a/read-cache.c b/read-cache.c index d73a81e41ae..feefa0f68ba 100644 --- a/read-cache.c +++ b/read-cache.c @@ -2927,7 +2927,8 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile, f = hashfd(tempfile->fd, tempfile->filename.buf); - repo_config_get_bool(r, "index.skiphash", &f->skip_hash); + prepare_repo_settings(r); + f->skip_hash = r->settings.index_skip_hash; for (i = removed = extended = 0; i < entries; i++) { if (cache[i]->ce_flags & CE_REMOVE) diff --git a/repo-settings.c b/repo-settings.c index 3021921c53d..3dbd3f0e2ec 100644 --- a/repo-settings.c +++ b/repo-settings.c @@ -47,6 +47,7 @@ void prepare_repo_settings(struct repository *r) } if (manyfiles) { r->settings.index_version = 4; + r->settings.index_skip_hash = 1; r->settings.core_untracked_cache = UNTRACKED_CACHE_WRITE; } @@ -61,6 +62,7 @@ void prepare_repo_settings(struct repository *r) repo_cfg_bool(r, "pack.usesparse", &r->settings.pack_use_sparse, 1); repo_cfg_bool(r, "core.multipackindex", &r->settings.core_multi_pack_index, 1); repo_cfg_bool(r, "index.sparse", &r->settings.sparse_index, 0); + repo_cfg_bool(r, "index.skiphash", &r->settings.index_skip_hash, r->settings.index_skip_hash); /* * The GIT_TEST_MULTI_PACK_INDEX variable is special in that diff --git a/repository.h b/repository.h index 6c461c5b9de..e8c67ffe165 100644 --- a/repository.h +++ b/repository.h @@ -42,6 +42,7 @@ struct repo_settings { struct fsmonitor_settings *fsmonitor; /* lazily loaded */ int index_version; + int index_skip_hash; enum untracked_cache_setting core_untracked_cache; int pack_use_sparse; diff --git a/scalar.c b/scalar.c index 6c52243cdf1..b49bb8c24ec 100644 --- a/scalar.c +++ b/scalar.c @@ -143,6 +143,7 @@ static int set_recommended_config(int reconfigure) { "credential.validate", "false", 1 }, /* GCM4W-only */ { "gc.auto", "0", 1 }, { "gui.GCWarning", "false", 1 }, + { "index.skipHash", "false", 1 }, { "index.threads", "true", 1 }, { "index.version", "4", 1 }, { "merge.stat", "false", 1 }, diff --git a/t/t1600-index.sh b/t/t1600-index.sh index 2f792bb8ffa..0ebbae13058 100755 --- a/t/t1600-index.sh +++ b/t/t1600-index.sh @@ -73,6 +73,17 @@ test_expect_success 'index.skipHash config option' ' test_cmp expect hash && git fsck && + rm -f .git/index && + git -c feature.manyFiles=true add a && + test_trailing_hash .git/index >hash && + cmp expect hash && + + rm -f .git/index && + git -c feature.manyFiles=true \ + -c index.skipHash=false add a && + test_trailing_hash .git/index >hash && + ! cmp expect hash && + test_commit start && git -c protocol.file.allow=always submodule add ./ sub && git config index.skipHash false && -- gitgitgadget ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH v5 0/4] Optionally skip hashing index on write 2023-01-06 16:31 ` [PATCH v5 " Derrick Stolee via GitGitGadget ` (3 preceding siblings ...) 2023-01-06 16:31 ` [PATCH v5 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget @ 2023-01-15 9:31 ` Junio C Hamano 2023-01-17 14:49 ` Derrick Stolee 4 siblings, 1 reply; 59+ messages in thread From: Junio C Hamano @ 2023-01-15 9:31 UTC (permalink / raw) To: Derrick Stolee via GitGitGadget Cc: git, vdye, avarab, newren, Jacob Keller, Derrick Stolee "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > Writing the index is a critical action that takes place in multiple Git > commands. The recent performance improvements available with the sparse > index show how often the I/O costs around the index can affect different Git > commands, although reading the index takes place more often than a write. https://github.com/git/git/actions/runs/3922482355/jobs/6705454550#step:4:2006 shows that all other platforms passed but osx-gcc failed one of the tests this series added. Could it be that there is something racy about the test (1006.6 if I am not mistaken)? ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 0/4] Optionally skip hashing index on write 2023-01-15 9:31 ` [PATCH v5 0/4] Optionally skip hashing index on write Junio C Hamano @ 2023-01-17 14:49 ` Derrick Stolee 0 siblings, 0 replies; 59+ messages in thread From: Derrick Stolee @ 2023-01-17 14:49 UTC (permalink / raw) To: Junio C Hamano, Derrick Stolee via GitGitGadget Cc: git, vdye, avarab, newren, Jacob Keller On 1/15/2023 4:31 AM, Junio C Hamano wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > >> Writing the index is a critical action that takes place in multiple Git >> commands. The recent performance improvements available with the sparse >> index show how often the I/O costs around the index can affect different Git >> commands, although reading the index takes place more often than a write. > > https://github.com/git/git/actions/runs/3922482355/jobs/6705454550#step:4:2006 > > shows that all other platforms passed but osx-gcc failed one of the > tests this series added. Could it be that there is something racy > about the test (1006.6 if I am not mistaken)? Yes, you are right. The patch below fixes the problem. Sorry about that! -Stolee -- >8 -- From 6827205f979ecfa66f4f9edabfb247cff05f0605 Mon Sep 17 00:00:00 2001 From: Derrick Stolee <derrickstolee@github.com> Date: Tue, 17 Jan 2023 09:46:04 -0500 Subject: [PATCH] t1600: fix racy index.skipHash test The test 1600.6 can fail under --stress due to mtime collisions. Most of the tests include a removal of the index file to guarantee that the index is updated. However, the submodule test addded in ee1f0c242ef (read-cache: add index.skipHash config option, 2023-01-06) did not include this removal. Thus, on rare occasions, the test can fail because the index still has a non-null trailing hash, as detected by the helper added in da9acde14ed (test-lib-functions: add helper for trailing hash, 2023-01-06). By removing the submodule's index before the 'git -C sub add a' command, we guarantee that the index is rewritten with the new index.skipHash config option. Signed-off-by: Derrick Stolee <derrickstolee@github.com> --- t/t1600-index.sh | 1 + 1 file changed, 1 insertion(+) diff --git a/t/t1600-index.sh b/t/t1600-index.sh index 0ebbae13058..9368d82f7d7 100755 --- a/t/t1600-index.sh +++ b/t/t1600-index.sh @@ -88,6 +88,7 @@ test_expect_success 'index.skipHash config option' ' git -c protocol.file.allow=always submodule add ./ sub && git config index.skipHash false && git -C sub config index.skipHash true && + rm -f .git/modules/sub/index && >sub/file && git -C sub add a && test_trailing_hash .git/modules/sub/index >hash && -- 2.38.0.vfs.0.0 ^ permalink raw reply related [flat|nested] 59+ messages in thread
end of thread, other threads:[~2023-01-17 14:50 UTC | newest] Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-12-07 17:25 [PATCH 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget 2022-12-07 17:25 ` [PATCH 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget 2022-12-07 22:13 ` Ævar Arnfjörð Bjarmason 2022-12-08 7:32 ` Jeff King 2022-12-07 17:25 ` [PATCH 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget 2022-12-07 18:59 ` Eric Sunshine 2022-12-12 13:59 ` Derrick Stolee 2022-12-12 18:55 ` Eric Sunshine 2022-12-07 22:25 ` Ævar Arnfjörð Bjarmason 2022-12-07 23:06 ` Ævar Arnfjörð Bjarmason 2022-12-08 0:05 ` Junio C Hamano 2022-12-12 14:05 ` Derrick Stolee 2022-12-12 18:01 ` Ævar Arnfjörð Bjarmason 2022-12-07 17:25 ` [PATCH 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget 2022-12-07 22:27 ` Ævar Arnfjörð Bjarmason 2022-12-12 14:10 ` Derrick Stolee 2022-12-07 17:25 ` [PATCH 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget 2022-12-07 22:30 ` Ævar Arnfjörð Bjarmason 2022-12-12 14:18 ` Derrick Stolee 2022-12-12 18:27 ` Ævar Arnfjörð Bjarmason 2022-12-07 23:27 ` [PATCH 0/4] Optionally skip hashing index on write Junio C Hamano 2022-12-07 23:42 ` Ævar Arnfjörð Bjarmason 2022-12-08 16:38 ` Derrick Stolee 2022-12-12 22:22 ` Jacob Keller 2022-12-12 16:31 ` [PATCH v2 " Derrick Stolee via GitGitGadget 2022-12-12 16:31 ` [PATCH v2 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget 2022-12-12 16:31 ` [PATCH v2 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget 2022-12-12 16:31 ` [PATCH v2 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget 2022-12-12 18:14 ` SZEDER Gábor 2022-12-13 0:55 ` Junio C Hamano 2022-12-17 17:37 ` SZEDER Gábor 2022-12-12 16:31 ` [PATCH v2 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget 2022-12-15 15:06 ` [PATCH v3 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget 2022-12-15 15:06 ` [PATCH v3 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget 2022-12-15 15:06 ` [PATCH v3 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget 2022-12-15 16:12 ` Ævar Arnfjörð Bjarmason 2022-12-15 15:06 ` [PATCH v3 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget 2022-12-15 15:07 ` [PATCH v3 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget 2022-12-15 15:56 ` [PATCH v3 0/4] Optionally skip hashing index on write Ævar Arnfjörð Bjarmason 2022-12-16 13:41 ` Derrick Stolee 2022-12-16 15:31 ` [PATCH v4 " Derrick Stolee via GitGitGadget 2022-12-16 15:31 ` [PATCH v4 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget 2022-12-16 15:31 ` [PATCH v4 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget 2022-12-16 15:31 ` [PATCH v4 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget 2022-12-16 15:31 ` [PATCH v4 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget 2022-12-16 15:43 ` [PATCH v4 0/4] Optionally skip hashing index on write Ævar Arnfjörð Bjarmason 2023-01-06 15:33 ` Derrick Stolee 2023-01-06 22:45 ` Junio C Hamano 2023-01-06 23:40 ` Derrick Stolee 2023-01-09 17:15 ` Ævar Arnfjörð Bjarmason 2023-01-09 18:00 ` Derrick Stolee 2023-01-09 19:22 ` Ævar Arnfjörð Bjarmason 2023-01-06 16:31 ` [PATCH v5 " Derrick Stolee via GitGitGadget 2023-01-06 16:31 ` [PATCH v5 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget 2023-01-06 16:31 ` [PATCH v5 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget 2023-01-06 16:31 ` [PATCH v5 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget 2023-01-06 16:31 ` [PATCH v5 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget 2023-01-15 9:31 ` [PATCH v5 0/4] Optionally skip hashing index on write Junio C Hamano 2023-01-17 14:49 ` Derrick Stolee
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).