git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/4] Optionally skip hashing index on write
@ 2022-12-07 17:25 Derrick Stolee via GitGitGadget
  2022-12-07 17:25 ` [PATCH 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
                   ` (5 more replies)
  0 siblings, 6 replies; 59+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-12-07 17:25 UTC (permalink / raw)
  To: git; +Cc: gitster, vdye, avarab, newren, Derrick Stolee

Writing the index is a critical action that takes place in multiple Git
commands. The recent performance improvements available with the sparse
index show how often the I/O costs around the index can affect different Git
commands, although reading the index takes place more often than a write.

The index is written through the hashfile API, both as a buffered write and
as a computation of its trailing hash. This trailing hash is an expectation
of the file format: several optional index extensions are provided using
indicators immediately preceding the trailing hash. 'git fsck' will complain
if the trailing hash does not match the contents of the file up to that
point.

However, computing the hash is expensive. This is even more expensive now
that we've moved to sha1dc instead of hardware-accelerated SHA1
computations.

This series provides a new config option, index.skipHash, that allows users
to disable computing the hash as Git writes the index. In this case, the
trailing hash is stored as the null hash (all bytes are zero).

The implementation is rather simple, but the patches organize different
aspects in a way that we could split things out:

 * Patch 1 adds a 'skip_hash' option to 'struct hashfile' that creates this
   as a general option to the hashwrite API.
 * The motivation in Patch 1 avoids the talk about the chunk-format API and
   instead focuses on the index as the first example that could make use of
   this.
 * Patch 2 creates the index.skipHash config option and updates 'git fsck'
   to ignore a null hash on the index. This is demonstrated by a very simple
   test.
 * Patch 3 creates a test helper to get the trailing hash of a file, and
   adds a concrete check on the trailing hash when index.skipHash=true.
 * Patch 4 (which could be deferred until later) adds index.skipHash=true as
   an implication of feature.manyFiles and as a setting in Scalar.

The end result is that index writes speed up significantly, enough that 'git
update-index --force-write' improves by 1.75x.

Similar patches appeared in my earlier refs RFC [1].

[1]
https://lore.kernel.org/git/pull.1408.git.1667846164.gitgitgadget@gmail.com/


Updates since RFC
=================

 * The config name is now index.skipHash to make it more clear that the
   default is to not skip the hash, and choosing this option requires a true
   value.
 * Use oidread() instead of an ad-hoc loop.
 * Patches 3 and 4 are new. 3 adds some extra test verification that wasn't
   in the RFC.
 * I think that patch 4 is helpful to see now, but I could understand if we
   wanted to defer that patch until after index.skipHash has had more time
   to bake.

Thanks, -Stolee

Derrick Stolee (4):
  hashfile: allow skipping the hash function
  read-cache: add index.skipHash config option
  test-lib-functions: add helper for trailing hash
  features: feature.manyFiles implies fast index writes

 Documentation/config/feature.txt |  3 +++
 Documentation/config/index.txt   |  8 ++++++++
 csum-file.c                      | 14 +++++++++++---
 csum-file.h                      |  7 +++++++
 read-cache.c                     | 15 ++++++++++++++-
 repo-settings.c                  |  2 ++
 repository.h                     |  1 +
 scalar.c                         |  1 +
 t/t1600-index.sh                 | 22 ++++++++++++++++++++++
 t/test-lib-functions.sh          |  8 ++++++++
 10 files changed, 77 insertions(+), 4 deletions(-)


base-commit: 805265fcf7a737664a8321aaf4a0587b78435184
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1439%2Fderrickstolee%2Fhashfile-skip-hash-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1439/derrickstolee/hashfile-skip-hash-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1439
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 1/4] hashfile: allow skipping the hash function
  2022-12-07 17:25 [PATCH 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget
@ 2022-12-07 17:25 ` Derrick Stolee via GitGitGadget
  2022-12-07 22:13   ` Ævar Arnfjörð Bjarmason
  2022-12-07 17:25 ` [PATCH 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 59+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-12-07 17:25 UTC (permalink / raw)
  To: git; +Cc: gitster, vdye, avarab, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The hashfile API is useful for generating files that include a trailing
hash of the file's contents up to that point. Using such a hash is
helpful for verifying the file for corruption-at-rest, such as a faulty
drive causing flipped bits.

Git's index file includes this trailing hash, so it uses a 'struct
hashfile' to handle the I/O to the file. This was very convenient to
allow using the hashfile methods during these operations.

However, hashing the file contents during write comes at a performance
penalty. It's slower to hash the bytes on their way to the disk than
without that step. This problem is made worse by the replacement of
hardware-accelerated SHA1 computations with the software-based sha1dc
computation.

This write cost is significant, and the checksum capability is likely
not worth that cost for such a short-lived file. The index is rewritten
frequently and the only time the checksum is checked is during 'git
fsck'. Thus, it would be helpful to allow a user to opt-out of the hash
computation.

We first need to allow Git to opt-out of the hash computation in the
hashfile API. The buffered writes of the API are still helpful, so it
makes sense to make the change here.

Introduce a new 'skip_hash' option to 'struct hashfile'. When set, the
update_fn and final_fn members of the_hash_algo are skipped. When
finalizing the hashfile, the trailing hash is replaced with the null
hash.

This use of a trailing null hash would be desireable in either case,
since we do not want to special case a file format to have a different
length depending on whether it was hashed or not. When the final bytes
of a file are all zero, we can infer that it was written without
hashing, and thus that verification is not available as a check for file
consistency. This also means that we could easily toggle hashing for any
file format we desire.

A version of this patch has existed in the microsoft/git fork since
2017 [1] (the linked commit was rebased in 2018, but the original dates
back to January 2017). Here, the change to make the index use this fast
path is delayed until a later change.

[1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201

Co-authored-by: Kevin Willford <kewillf@microsoft.com>
Signed-off-by: Kevin Willford <kewillf@microsoft.com>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 csum-file.c | 14 +++++++++++---
 csum-file.h |  7 +++++++
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index 59ef3398ca2..3243473c3d7 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -45,7 +45,8 @@ void hashflush(struct hashfile *f)
 	unsigned offset = f->offset;
 
 	if (offset) {
-		the_hash_algo->update_fn(&f->ctx, f->buffer, offset);
+		if (!f->skip_hash)
+			the_hash_algo->update_fn(&f->ctx, f->buffer, offset);
 		flush(f, f->buffer, offset);
 		f->offset = 0;
 	}
@@ -64,7 +65,12 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result,
 	int fd;
 
 	hashflush(f);
-	the_hash_algo->final_fn(f->buffer, &f->ctx);
+
+	if (f->skip_hash)
+		memset(f->buffer, 0, the_hash_algo->rawsz);
+	else
+		the_hash_algo->final_fn(f->buffer, &f->ctx);
+
 	if (result)
 		hashcpy(result, f->buffer);
 	if (flags & CSUM_HASH_IN_STREAM)
@@ -108,7 +114,8 @@ void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
 			 * the hashfile's buffer. In this block,
 			 * f->offset is necessarily zero.
 			 */
-			the_hash_algo->update_fn(&f->ctx, buf, nr);
+			if (!f->skip_hash)
+				the_hash_algo->update_fn(&f->ctx, buf, nr);
 			flush(f, buf, nr);
 		} else {
 			/*
@@ -153,6 +160,7 @@ static struct hashfile *hashfd_internal(int fd, const char *name,
 	f->tp = tp;
 	f->name = name;
 	f->do_crc = 0;
+	f->skip_hash = 0;
 	the_hash_algo->init_fn(&f->ctx);
 
 	f->buffer_len = buffer_len;
diff --git a/csum-file.h b/csum-file.h
index 0d29f528fbc..29468067f81 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -20,6 +20,13 @@ struct hashfile {
 	size_t buffer_len;
 	unsigned char *buffer;
 	unsigned char *check_buffer;
+
+	/**
+	 * If set to 1, skip_hash indicates that we should
+	 * not actually compute the hash for this hashfile and
+	 * instead only use it as a buffered write.
+	 */
+	unsigned int skip_hash;
 };
 
 /* Checkpoint */
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 2/4] read-cache: add index.skipHash config option
  2022-12-07 17:25 [PATCH 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget
  2022-12-07 17:25 ` [PATCH 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
@ 2022-12-07 17:25 ` Derrick Stolee via GitGitGadget
  2022-12-07 18:59   ` Eric Sunshine
                     ` (2 more replies)
  2022-12-07 17:25 ` [PATCH 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget
                   ` (3 subsequent siblings)
  5 siblings, 3 replies; 59+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-12-07 17:25 UTC (permalink / raw)
  To: git; +Cc: gitster, vdye, avarab, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The previous change allowed skipping the hashing portion of the
hashwrite API, using it instead as a buffered write API. Disabling the
hashwrite can be particularly helpful when the write operation is in a
critical path.

One such critical path is the writing of the index. This operation is so
critical that the sparse index was created specifically to reduce the
size of the index to make these writes (and reads) faster.

Following a similar approach to one used in the microsoft/git fork [1],
add a new config option (index.skipHash) that allows disabling this
hashing during the index write. The cost is that we can no longer
validate the contents for corruption-at-rest using the trailing hash.

[1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201

While older Git versions will not recognize the null hash as a special
case, the file format itself is still being met in terms of its
structure. Using this null hash will still allow Git operations to
function across older versions.

The one exception is 'git fsck' which checks the hash of the index file.
This used to be a check on every index read, but was split out to just
the index in a33fc72fe91 (read-cache: force_verify_index_checksum,
2017-04-14).

Here, we disable this check if the trailing hash is all zeroes. We add a
warning to the config option that this may cause undesirable behavior
with older Git versions.

As a quick comparison, I tested 'git update-index --force-write' with
and without index.computeHash=false on a copy of the Linux kernel
repository.

Benchmark 1: with hash
  Time (mean ± σ):      46.3 ms ±  13.8 ms    [User: 34.3 ms, System: 11.9 ms]
  Range (min … max):    34.3 ms …  79.1 ms    82 runs

Benchmark 2: without hash
  Time (mean ± σ):      26.0 ms ±   7.9 ms    [User: 11.8 ms, System: 14.2 ms]
  Range (min … max):    16.3 ms …  42.0 ms    69 runs

Summary
  'without hash' ran
    1.78 ± 0.76 times faster than 'with hash'

These performance benefits are substantial enough to allow users the
ability to opt-in to this feature, even with the potential confusion
with older 'git fsck' versions.

It is critical that this test is placed before the test_index_version
tests, since those tests obliterate the .git/config file and hence lose
the setting from GIT_TEST_DEFAULT_HASH, if set.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/config/index.txt |  8 ++++++++
 read-cache.c                   | 14 +++++++++++++-
 t/t1600-index.sh               |  8 ++++++++
 3 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/Documentation/config/index.txt b/Documentation/config/index.txt
index 75f3a2d1054..3ea0962631d 100644
--- a/Documentation/config/index.txt
+++ b/Documentation/config/index.txt
@@ -30,3 +30,11 @@ index.version::
 	Specify the version with which new index files should be
 	initialized.  This does not affect existing repositories.
 	If `feature.manyFiles` is enabled, then the default is 4.
+
+index.skipHash::
+	When enabled, do not compute the trailing hash for the index file.
+	Instead, write a trailing set of bytes with value zero, indicating
+	that the computation was skipped.
++
+If you enable `index.skipHash`, then older Git clients may report that
+your index is corrupt during `git fsck`.
diff --git a/read-cache.c b/read-cache.c
index 46f5e497b14..fb4d6fb6387 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1817,6 +1817,8 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
 	git_hash_ctx c;
 	unsigned char hash[GIT_MAX_RAWSZ];
 	int hdr_version;
+	unsigned char *start, *end;
+	struct object_id oid;
 
 	if (hdr->hdr_signature != htonl(CACHE_SIGNATURE))
 		return error(_("bad signature 0x%08x"), hdr->hdr_signature);
@@ -1827,10 +1829,16 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
 	if (!verify_index_checksum)
 		return 0;
 
+	end = (unsigned char *)hdr + size;
+	start = end - the_hash_algo->rawsz;
+	oidread(&oid, start);
+	if (oideq(&oid, null_oid()))
+		return 0;
+
 	the_hash_algo->init_fn(&c);
 	the_hash_algo->update_fn(&c, hdr, size - the_hash_algo->rawsz);
 	the_hash_algo->final_fn(hash, &c);
-	if (!hasheq(hash, (unsigned char *)hdr + size - the_hash_algo->rawsz))
+	if (!hasheq(hash, end - the_hash_algo->rawsz))
 		return error(_("bad index file sha1 signature"));
 	return 0;
 }
@@ -2915,9 +2923,13 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 	int ieot_entries = 1;
 	struct index_entry_offset_table *ieot = NULL;
 	int nr, nr_threads;
+	int skip_hash;
 
 	f = hashfd(tempfile->fd, tempfile->filename.buf);
 
+	if (!git_config_get_maybe_bool("index.skiphash", &skip_hash))
+		f->skip_hash = skip_hash;
+
 	for (i = removed = extended = 0; i < entries; i++) {
 		if (cache[i]->ce_flags & CE_REMOVE)
 			removed++;
diff --git a/t/t1600-index.sh b/t/t1600-index.sh
index 010989f90e6..df07c587e0e 100755
--- a/t/t1600-index.sh
+++ b/t/t1600-index.sh
@@ -65,6 +65,14 @@ test_expect_success 'out of bounds index.version issues warning' '
 	)
 '
 
+test_expect_success 'index.skipHash config option' '
+	(
+		rm -f .git/index &&
+		git -c index.skipHash=true add a &&
+		git fsck
+	)
+'
+
 test_index_version () {
 	INDEX_VERSION_CONFIG=$1 &&
 	FEATURE_MANY_FILES=$2 &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 3/4] test-lib-functions: add helper for trailing hash
  2022-12-07 17:25 [PATCH 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget
  2022-12-07 17:25 ` [PATCH 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
  2022-12-07 17:25 ` [PATCH 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget
@ 2022-12-07 17:25 ` Derrick Stolee via GitGitGadget
  2022-12-07 22:27   ` Ævar Arnfjörð Bjarmason
  2022-12-07 17:25 ` [PATCH 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 59+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-12-07 17:25 UTC (permalink / raw)
  To: git; +Cc: gitster, vdye, avarab, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

It can be helpful to check that a file format with a trailing hash has a
specific hash in the final bytes of a written file. This is made more
apparent by recent changes that allow skipping the hash algorithm and
writing a null hash at the end of the file instead.

Add a new test_trailing_hash helper and use it in t1600 to verify that
index.skipHash=true really does skip the hash computation, since
'git fsck' does not actually verify the hash.

Keep the 'git fsck' call to ensure that any potential future change to
check the index hash does not cause an error in this case.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 t/t1600-index.sh        | 3 +++
 t/test-lib-functions.sh | 8 ++++++++
 2 files changed, 11 insertions(+)

diff --git a/t/t1600-index.sh b/t/t1600-index.sh
index df07c587e0e..55816756607 100755
--- a/t/t1600-index.sh
+++ b/t/t1600-index.sh
@@ -69,6 +69,9 @@ test_expect_success 'index.skipHash config option' '
 	(
 		rm -f .git/index &&
 		git -c index.skipHash=true add a &&
+		test_trailing_hash .git/index >hash &&
+		echo $(test_oid zero) >expect &&
+		test_cmp expect hash &&
 		git fsck
 	)
 '
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index 796093a7b32..e88acfdb68a 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -1875,3 +1875,11 @@ test_cmp_config_output () {
 	sort config-actual >sorted-actual &&
 	test_cmp sorted-expect sorted-actual
 }
+
+# Given a filename, extract its trailing hash as a hex string
+test_trailing_hash () {
+	local file="$1" &&
+	tail -c $(test_oid rawsz) "$file" | \
+		test-tool hexdump | \
+		sed "s/ //g"
+}
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 4/4] features: feature.manyFiles implies fast index writes
  2022-12-07 17:25 [PATCH 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget
                   ` (2 preceding siblings ...)
  2022-12-07 17:25 ` [PATCH 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget
@ 2022-12-07 17:25 ` Derrick Stolee via GitGitGadget
  2022-12-07 22:30   ` Ævar Arnfjörð Bjarmason
  2022-12-07 23:27 ` [PATCH 0/4] Optionally skip hashing index on write Junio C Hamano
  2022-12-12 16:31 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  5 siblings, 1 reply; 59+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-12-07 17:25 UTC (permalink / raw)
  To: git; +Cc: gitster, vdye, avarab, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The recent addition of the index.skipHash config option allows index
writes to speed up by skipping the hash computation for the trailing
checksum. This is particularly critical for repositories with many files
at HEAD, so add this config option to two cases where users in that
scenario may opt-in to such behavior:

 1. The feature.manyFiles config option enables some options that are
    helpful for repositories with many files at HEAD.

 2. 'scalar register' and 'scalar reconfigure' set config options that
    optimize for large repositories.

In both of these cases, set index.skipHash=true to gain this
speedup. Add tests that demonstrate the proper way that
index.skipHash=true can override feature.manyFiles=true.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/config/feature.txt |  3 +++
 read-cache.c                     |  7 ++++---
 repo-settings.c                  |  2 ++
 repository.h                     |  1 +
 scalar.c                         |  1 +
 t/t1600-index.sh                 | 13 ++++++++++++-
 6 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/Documentation/config/feature.txt b/Documentation/config/feature.txt
index 95975e50912..f0e1d4cb2be 100644
--- a/Documentation/config/feature.txt
+++ b/Documentation/config/feature.txt
@@ -23,6 +23,9 @@ feature.manyFiles::
 	working directory. With many files, commands such as `git status` and
 	`git checkout` may be slow and these new defaults improve performance:
 +
+* `index.skipHash=true` speeds up index writes by not computing a trailing
+  checksum.
++
 * `index.version=4` enables path-prefix compression in the index.
 +
 * `core.untrackedCache=true` enables the untracked cache. This setting assumes
diff --git a/read-cache.c b/read-cache.c
index fb4d6fb6387..1844953fba7 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -2923,12 +2923,13 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 	int ieot_entries = 1;
 	struct index_entry_offset_table *ieot = NULL;
 	int nr, nr_threads;
-	int skip_hash;
 
 	f = hashfd(tempfile->fd, tempfile->filename.buf);
 
-	if (!git_config_get_maybe_bool("index.skiphash", &skip_hash))
-		f->skip_hash = skip_hash;
+	if (istate->repo) {
+		prepare_repo_settings(istate->repo);
+		f->skip_hash = istate->repo->settings.index_skip_hash;
+	}
 
 	for (i = removed = extended = 0; i < entries; i++) {
 		if (cache[i]->ce_flags & CE_REMOVE)
diff --git a/repo-settings.c b/repo-settings.c
index 3021921c53d..3dbd3f0e2ec 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -47,6 +47,7 @@ void prepare_repo_settings(struct repository *r)
 	}
 	if (manyfiles) {
 		r->settings.index_version = 4;
+		r->settings.index_skip_hash = 1;
 		r->settings.core_untracked_cache = UNTRACKED_CACHE_WRITE;
 	}
 
@@ -61,6 +62,7 @@ void prepare_repo_settings(struct repository *r)
 	repo_cfg_bool(r, "pack.usesparse", &r->settings.pack_use_sparse, 1);
 	repo_cfg_bool(r, "core.multipackindex", &r->settings.core_multi_pack_index, 1);
 	repo_cfg_bool(r, "index.sparse", &r->settings.sparse_index, 0);
+	repo_cfg_bool(r, "index.skiphash", &r->settings.index_skip_hash, r->settings.index_skip_hash);
 
 	/*
 	 * The GIT_TEST_MULTI_PACK_INDEX variable is special in that
diff --git a/repository.h b/repository.h
index 6c461c5b9de..e8c67ffe165 100644
--- a/repository.h
+++ b/repository.h
@@ -42,6 +42,7 @@ struct repo_settings {
 	struct fsmonitor_settings *fsmonitor; /* lazily loaded */
 
 	int index_version;
+	int index_skip_hash;
 	enum untracked_cache_setting core_untracked_cache;
 
 	int pack_use_sparse;
diff --git a/scalar.c b/scalar.c
index 6c52243cdf1..b49bb8c24ec 100644
--- a/scalar.c
+++ b/scalar.c
@@ -143,6 +143,7 @@ static int set_recommended_config(int reconfigure)
 		{ "credential.validate", "false", 1 }, /* GCM4W-only */
 		{ "gc.auto", "0", 1 },
 		{ "gui.GCWarning", "false", 1 },
+		{ "index.skipHash", "false", 1 },
 		{ "index.threads", "true", 1 },
 		{ "index.version", "4", 1 },
 		{ "merge.stat", "false", 1 },
diff --git a/t/t1600-index.sh b/t/t1600-index.sh
index 55816756607..be0a0a8a008 100755
--- a/t/t1600-index.sh
+++ b/t/t1600-index.sh
@@ -72,7 +72,18 @@ test_expect_success 'index.skipHash config option' '
 		test_trailing_hash .git/index >hash &&
 		echo $(test_oid zero) >expect &&
 		test_cmp expect hash &&
-		git fsck
+		git fsck &&
+
+		rm -f .git/index &&
+		git -c feature.manyFiles=true add a &&
+		test_trailing_hash .git/index >hash &&
+		test_cmp expect hash &&
+
+		rm -f .git/index &&
+		git -c feature.manyFiles=true \
+		    -c index.skipHash=false add a &&
+		test_trailing_hash .git/index >hash &&
+		! test_cmp expect hash
 	)
 '
 
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/4] read-cache: add index.skipHash config option
  2022-12-07 17:25 ` [PATCH 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget
@ 2022-12-07 18:59   ` Eric Sunshine
  2022-12-12 13:59     ` Derrick Stolee
  2022-12-07 22:25   ` Ævar Arnfjörð Bjarmason
  2022-12-07 23:06   ` Ævar Arnfjörð Bjarmason
  2 siblings, 1 reply; 59+ messages in thread
From: Eric Sunshine @ 2022-12-07 18:59 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, vdye, avarab, newren, Derrick Stolee

On Wed, Dec 7, 2022 at 12:27 PM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
> The previous change allowed skipping the hashing portion of the
> hashwrite API, using it instead as a buffered write API. Disabling the
> hashwrite can be particularly helpful when the write operation is in a
> critical path.
>
> One such critical path is the writing of the index. This operation is so
> critical that the sparse index was created specifically to reduce the
> size of the index to make these writes (and reads) faster.
>
> Following a similar approach to one used in the microsoft/git fork [1],
> add a new config option (index.skipHash) that allows disabling this
> hashing during the index write. The cost is that we can no longer
> validate the contents for corruption-at-rest using the trailing hash.
> [...]
> Signed-off-by: Derrick Stolee <derrickstolee@github.com>
> ---
> diff --git a/Documentation/config/index.txt b/Documentation/config/index.txt
> @@ -30,3 +30,11 @@ index.version::
> +index.skipHash::
> +       When enabled, do not compute the trailing hash for the index file.
> +       Instead, write a trailing set of bytes with value zero, indicating
> +       that the computation was skipped.
> ++
> +If you enable `index.skipHash`, then older Git clients may report that
> +your index is corrupt during `git fsck`.

This documentation is rather minimal. Given this description, are
readers going to understand the purpose of the option, when they
should use it, what the impact will be, when and why they should avoid
it, etc.?

> diff --git a/t/t1600-index.sh b/t/t1600-index.sh
> @@ -65,6 +65,14 @@ test_expect_success 'out of bounds index.version issues warning' '
> +test_expect_success 'index.skipHash config option' '
> +       (
> +               rm -f .git/index &&
> +               git -c index.skipHash=true add a &&
> +               git fsck
> +       )
> +'

What is the purpose of the subshell here?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 1/4] hashfile: allow skipping the hash function
  2022-12-07 17:25 ` [PATCH 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
@ 2022-12-07 22:13   ` Ævar Arnfjörð Bjarmason
  2022-12-08  7:32     ` Jeff King
  0 siblings, 1 reply; 59+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-12-07 22:13 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, vdye, newren, Derrick Stolee


On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <derrickstolee@github.com>
> [...]
> However, hashing the file contents during write comes at a performance
> penalty. It's slower to hash the bytes on their way to the disk than
> without that step. This problem is made worse by the replacement of
> hardware-accelerated SHA1 computations with the software-based sha1dc
> computation.

More on that lack of HW accel later...

> This write cost is significant

Don't you mean hashing cost, or do we also do additional writes if we do
the hashing?

> , and the checksum capability is likely
> not worth that cost for such a short-lived file. The index is rewritten
> frequently and the only time the checksum is checked is during 'git
> fsck'. Thus, it would be helpful to allow a user to opt-out of the hash
> computation.

I didn't know that, and had assumed that we at least checked it on the
full read (and I found this bit of the commit message after writing the
last paragraphs here at the end, so maybe skipping this is fine...).

> [...]
> @@ -64,7 +65,12 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result,
>  	int fd;
>  
>  	hashflush(f);
> -	the_hash_algo->final_fn(f->buffer, &f->ctx);
> +
> +	if (f->skip_hash)
> +		memset(f->buffer, 0, the_hash_algo->rawsz);

Here you're hardcoding a new version of null_oid(), but we can use it
instead. Perhaps:
	
	diff --git a/csum-file.c b/csum-file.c
	index 3243473c3d7..b54c4f66cbb 100644
	--- a/csum-file.c
	+++ b/csum-file.c
	@@ -63,11 +63,12 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result,
	 		      enum fsync_component component, unsigned int flags)
	 {
	 	int fd;
	+	const struct object_id *const noid = null_oid();
	 
	 	hashflush(f);
	 
	 	if (f->skip_hash)
	-		memset(f->buffer, 0, the_hash_algo->rawsz);
	+		memcpy(f->buffer, noid, sizeof(*noid));
	 	else
	 		the_hash_algo->final_fn(f->buffer, &f->ctx);
	 
> @@ -153,6 +160,7 @@ static struct hashfile *hashfd_internal(int fd, const char *name,
>  	f->tp = tp;
>  	f->name = name;
>  	f->do_crc = 0;
> +	f->skip_hash = 0;
>  	the_hash_algo->init_fn(&f->ctx);
>  
>  	f->buffer_len = buffer_len;

I think I pointed out in the RFC that we'd be much faster with
non-sha1collisiondetection, and that maybe this would get us partway to
the performance you desired (or maybe we'd think that was a more
acceptable trade-off, as it didn't make the format
backwards-incompatible).

But just from seeing "do_crc" here in the context, did you benchmark it
against that? How does it perform?

There's no place to put that in the index, but we *could* encode it in
the hash, just with a lot of leading zeros.

Maybe that would give us some/most of the performance benefits, with the
benefit of a checksum?

Or maybe not, but I think it's worth exploring & supporting a different
& faster SHA-1 implementation before making (even opt-in) backwards
incompatible format changes for performance reasons, and if even that's
too slow maybe crc32 would be sufficient (but not compatible), but still
safer?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/4] read-cache: add index.skipHash config option
  2022-12-07 17:25 ` [PATCH 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget
  2022-12-07 18:59   ` Eric Sunshine
@ 2022-12-07 22:25   ` Ævar Arnfjörð Bjarmason
  2022-12-07 23:06   ` Ævar Arnfjörð Bjarmason
  2 siblings, 0 replies; 59+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-12-07 22:25 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, vdye, newren, Derrick Stolee


On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <derrickstolee@github.com>
> [...]
> diff --git a/read-cache.c b/read-cache.c
> index 46f5e497b14..fb4d6fb6387 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -1817,6 +1817,8 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
>  	git_hash_ctx c;
>  	unsigned char hash[GIT_MAX_RAWSZ];
>  	int hdr_version;
> +	unsigned char *start, *end;
> +	struct object_id oid;
>  
>  	if (hdr->hdr_signature != htonl(CACHE_SIGNATURE))
>  		return error(_("bad signature 0x%08x"), hdr->hdr_signature);
> @@ -1827,10 +1829,16 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
>  	if (!verify_index_checksum)
>  		return 0;
>  
> +	end = (unsigned char *)hdr + size;
> +	start = end - the_hash_algo->rawsz;
> +	oidread(&oid, start);
> +	if (oideq(&oid, null_oid()))
> +		return 0;

It's good to see this use the existing hashing support, as I suggested
in the RFC comments. Glad it helped.

>  	int ieot_entries = 1;
>  	struct index_entry_offset_table *ieot = NULL;
>  	int nr, nr_threads;
> +	int skip_hash;

You don't need this variable.
>  
>  	f = hashfd(tempfile->fd, tempfile->filename.buf);
>  
> +	if (!git_config_get_maybe_bool("index.skiphash", &skip_hash))
> +		f->skip_hash = skip_hash;

Because this can just be:

	git_config_get_maybe_bool("index.skiphash", &f->skip_hash);

I.e. the config API guarantees that it won't touch the variable if the
key doesn't exist, so no need for the intermediate variable.

In a later commit you convert this to that very API use when moving this
to the repo-settings.

Can we maybe skip straight to that step?

> +test_expect_success 'index.skipHash config option' '
> +	(
> +		rm -f .git/index &&
> +		git -c index.skipHash=true add a &&
> +		git fsck
> +	)

Why the subshell?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/4] test-lib-functions: add helper for trailing hash
  2022-12-07 17:25 ` [PATCH 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget
@ 2022-12-07 22:27   ` Ævar Arnfjörð Bjarmason
  2022-12-12 14:10     ` Derrick Stolee
  0 siblings, 1 reply; 59+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-12-07 22:27 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, vdye, newren, Derrick Stolee


On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <derrickstolee@github.com>
> [...]
> +# Given a filename, extract its trailing hash as a hex string
> +test_trailing_hash () {
> +	local file="$1" &&
> +	tail -c $(test_oid rawsz) "$file" | \
> +		test-tool hexdump | \

You don't need the "\"'s here.

> +		sed "s/ //g"
> +}

At the end of this series we have one test file using this
test_trailing_hash, can't we just add it to t1600-index.sh?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 4/4] features: feature.manyFiles implies fast index writes
  2022-12-07 17:25 ` [PATCH 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget
@ 2022-12-07 22:30   ` Ævar Arnfjörð Bjarmason
  2022-12-12 14:18     ` Derrick Stolee
  0 siblings, 1 reply; 59+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-12-07 22:30 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, vdye, newren, Derrick Stolee


On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <derrickstolee@github.com>
> [...]
> diff --git a/read-cache.c b/read-cache.c
> index fb4d6fb6387..1844953fba7 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -2923,12 +2923,13 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
>  	int ieot_entries = 1;
>  	struct index_entry_offset_table *ieot = NULL;
>  	int nr, nr_threads;
> -	int skip_hash;
>  
>  	f = hashfd(tempfile->fd, tempfile->filename.buf);
>  
> -	if (!git_config_get_maybe_bool("index.skiphash", &skip_hash))
> -		f->skip_hash = skip_hash;
> +	if (istate->repo) {
> +		prepare_repo_settings(istate->repo);
> +		f->skip_hash = istate->repo->settings.index_skip_hash;
> +	}

Urm, are we ever going to find ourselves in a situation where:

 * We have read the settings for the_repository
 * We have an index we're about to write out as our "main index", but
   the istate->repo *isn't* the_repository.
 * Even then, wouldn't the two copies of the repos have read the same
   repo settings?

But maybe there's a really obvious submodule / worktree / whatever edge
case I'm missing.

But if not, shouldn't we just always read/write this from
the_repository?

> +		rm -f .git/index &&
> +		git -c feature.manyFiles=true \
> +		    -c index.skipHash=false add a &&
> +		test_trailing_hash .git/index >hash &&
> +		! test_cmp expect hash

We had a parallel thread where we discussed "! test_cmp" being an
anti-pattern, i.e. you want them not to be the same, but you want it to
still show a diff, Maybe just "! cmp" ?

I.e. either the diff will be meaningless, or we really should be
asserting the actual value we want, not what it shouldn't be.

so in this case, shouldn't we assert that it's the 0000... value, or the
actual hash (depending on which way around we're testing this)?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/4] read-cache: add index.skipHash config option
  2022-12-07 17:25 ` [PATCH 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget
  2022-12-07 18:59   ` Eric Sunshine
  2022-12-07 22:25   ` Ævar Arnfjörð Bjarmason
@ 2022-12-07 23:06   ` Ævar Arnfjörð Bjarmason
  2022-12-08  0:05     ` Junio C Hamano
  2022-12-12 14:05     ` Derrick Stolee
  2 siblings, 2 replies; 59+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-12-07 23:06 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, vdye, newren, Derrick Stolee


On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <derrickstolee@github.com>
> [...]
> While older Git versions will not recognize the null hash as a special
> case, the file format itself is still being met in terms of its
> structure. Using this null hash will still allow Git operations to
> function across older versions.

That's good news, but...

> The one exception is 'git fsck' which checks the hash of the index file.
> This used to be a check on every index read, but was split out to just
> the index in a33fc72fe91 (read-cache: force_verify_index_checksum,
> 2017-04-14).

...uh, what?

Is there an implied claim here that versions before v2.13.0 don't count
as "older versions"?

I.e. doesn't v2.12.0 hard fail the verification for all index writing?
It's only after v2.13.0 that we do it only for the fsck.

That seems like a rather significant caveat that we should be noting
prominently in the docs added in 4/4.

> As a quick comparison, I tested 'git update-index --force-write' with
> and without index.computeHash=false on a copy of the Linux kernel
> repository.

It took me a bit to see why I was failing to reproduce this, before
finding that it's because you mention index.computeHash here, but it's
index.skipHash now.
>
> Benchmark 1: with hash
>   Time (mean ± σ):      46.3 ms ±  13.8 ms    [User: 34.3 ms, System: 11.9 ms]
>   Range (min … max):    34.3 ms …  79.1 ms    82 runs
>
> Benchmark 2: without hash
>   Time (mean ± σ):      26.0 ms ±   7.9 ms    [User: 11.8 ms, System: 14.2 ms]
>   Range (min … max):    16.3 ms …  42.0 ms    69 runs
>
> Summary
>   'without hash' ran
>     1.78 ± 0.76 times faster than 'with hash'

I suggested in
https://lore.kernel.org/git/221207.868rjiam86.gmgdl@evledraar.gmail.com/
earlier to benchmark this against not-sha1collisiondetection.

I don't think I get HW-accelerated SHA-1 on that box with OPENSSL (how
do I check...?). The results on my main development box are:
	
	$ hyperfine -L g sha1dc,openssl -L v false,true -n '{g} with index.skipHash={v}' './git.{g} -c index.skipHash={v} -C /dev/shm/linux-mem update-index --force-write' -w 5 -r 10
	Benchmark 1: sha1dc with index.skipHash=false
	  Time (mean ± σ):      37.0 ms ±   2.3 ms    [User: 30.8 ms, System: 6.0 ms]
	  Range (min … max):    35.1 ms …  41.4 ms    10 runs
	 
	Benchmark 2: openssl with index.skipHash=false
	  Time (mean ± σ):      21.5 ms ±   0.4 ms    [User: 15.0 ms, System: 6.3 ms]
	  Range (min … max):    20.7 ms …  22.0 ms    10 runs
	 
	Benchmark 3: sha1dc with index.skipHash=true
	  Time (mean ± σ):      13.5 ms ±   0.4 ms    [User: 7.9 ms, System: 5.4 ms]
	  Range (min … max):    13.0 ms …  14.2 ms    10 runs
	 
	Benchmark 4: openssl with index.skipHash=true
	  Time (mean ± σ):      14.2 ms ±   0.3 ms    [User: 9.5 ms, System: 4.6 ms]
	  Range (min … max):    13.6 ms …  14.6 ms    10 runs
	 
	Summary
	  'sha1dc with index.skipHash=true' ran
	    1.05 ± 0.04 times faster than 'openssl with index.skipHash=true'
	    1.60 ± 0.05 times faster than 'openssl with index.skipHash=false'
	    2.75 ± 0.19 times faster than 'sha1dc with index.skipHash=false'

So, curiously it's proportionally much slower for me with the hash
checking, and skipping it with openssl is on the order of the results
you see.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 0/4] Optionally skip hashing index on write
  2022-12-07 17:25 [PATCH 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget
                   ` (3 preceding siblings ...)
  2022-12-07 17:25 ` [PATCH 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget
@ 2022-12-07 23:27 ` Junio C Hamano
  2022-12-07 23:42   ` Ævar Arnfjörð Bjarmason
  2022-12-08 16:38   ` Derrick Stolee
  2022-12-12 16:31 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  5 siblings, 2 replies; 59+ messages in thread
From: Junio C Hamano @ 2022-12-07 23:27 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, vdye, avarab, newren, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Writing the index is a critical action that takes place in multiple Git
> commands. The recent performance improvements available with the sparse
> index show how often the I/O costs around the index can affect different Git
> commands, although reading the index takes place more often than a write.

The sparse-index work is great in that it offers correctness while
taking advantage of the knowledge of which part of the tree is
quiescent and unused to boost performance.  I am not sure a change
to reduce file safety can be compared with it, in that one is pure
improvement, while the other is trade-off.

As long as we will keep the "create into a new file, write it fully
and fsync + rename to the final" pattern, we do not need the trailing
checksum to protect us from a truncated output due to index-writing
process dying in the middle, so I do not mind that trade-off, though.

Protecting files from bit flipping filesystem corruption is a
different matter.  Folks at hosting sites like GitHub would know how
often they detect object corruption (I presume they do not have to
deal with the index file on the server end that often, but loose and
pack object files have the trailing checksums the same way) thanks
to the trailing checksum, and what the consequences are if we lost
that safety (I am guessing it would be minimum, though).

Thanks.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 0/4] Optionally skip hashing index on write
  2022-12-07 23:27 ` [PATCH 0/4] Optionally skip hashing index on write Junio C Hamano
@ 2022-12-07 23:42   ` Ævar Arnfjörð Bjarmason
  2022-12-08 16:38   ` Derrick Stolee
  1 sibling, 0 replies; 59+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-12-07 23:42 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, git, vdye, newren,
	Derrick Stolee


On Thu, Dec 08 2022, Junio C Hamano wrote:

> Protecting files from bit flipping filesystem corruption is a
> different matter.  Folks at hosting sites like GitHub would know how
> often they detect object corruption (I presume they do not have to
> deal with the index file on the server end that often, but loose and
> pack object files have the trailing checksums the same way) thanks
> to the trailing checksum, and what the consequences are if we lost
> that safety (I am guessing it would be minimum, though).

I don't think this checksum does much for us in practice, but just on
this point in general: Extrapolating results at <hosting site> when it
comes to making general decisions about git's data safety isn't a good
idea.

I don't know about GitHub's hardware, but servers almost universally use
ECC ram, and tend to use things like error-correcting filesystem RAID
etc.

Data in that area is really interesting when it comes to running git in
that sort of setup, but it really shouldn't be extrapolated to git's
userbase in general.

A lot of those users will be using cheap memory and/or storage devices
without any error correction.

They're also likely to stress our reliability guarantees in other ways,
e.g. yanking their power cord (or equivalent), which a server typically
won't need to deal with.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/4] read-cache: add index.skipHash config option
  2022-12-07 23:06   ` Ævar Arnfjörð Bjarmason
@ 2022-12-08  0:05     ` Junio C Hamano
  2022-12-12 14:05     ` Derrick Stolee
  1 sibling, 0 replies; 59+ messages in thread
From: Junio C Hamano @ 2022-12-08  0:05 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Derrick Stolee via GitGitGadget, git, vdye, newren,
	Derrick Stolee

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> Is there an implied claim here that versions before v2.13.0 don't count
> as "older versions"?
>
> I.e. doesn't v2.12.0 hard fail the verification for all index writing?
> It's only after v2.13.0 that we do it only for the fsck.
>
> That seems like a rather significant caveat that we should be noting
> prominently in the docs added in 4/4.

True enough.

It seems that we only did security releases from 2.30.x track and
upwards for the past year or so, and anything older may not matter
anymore.  Documenting it should be sufficient.

I actually was wondering what impact it would have if we made this
change unconditionally.

Thanks.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 1/4] hashfile: allow skipping the hash function
  2022-12-07 22:13   ` Ævar Arnfjörð Bjarmason
@ 2022-12-08  7:32     ` Jeff King
  0 siblings, 0 replies; 59+ messages in thread
From: Jeff King @ 2022-12-08  7:32 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Derrick Stolee via GitGitGadget, git, gitster, vdye, newren,
	Derrick Stolee

On Wed, Dec 07, 2022 at 11:13:15PM +0100, Ævar Arnfjörð Bjarmason wrote:

> > +	if (f->skip_hash)
> > +		memset(f->buffer, 0, the_hash_algo->rawsz);
> 
> Here you're hardcoding a new version of null_oid(), but we can use it
> instead. Perhaps:
> 	
> 	diff --git a/csum-file.c b/csum-file.c
> 	index 3243473c3d7..b54c4f66cbb 100644
> 	--- a/csum-file.c
> 	+++ b/csum-file.c
> 	@@ -63,11 +63,12 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result,
> 	 		      enum fsync_component component, unsigned int flags)
> 	 {
> 	 	int fd;
> 	+	const struct object_id *const noid = null_oid();
> 	 
> 	 	hashflush(f);
> 	 
> 	 	if (f->skip_hash)
> 	-		memset(f->buffer, 0, the_hash_algo->rawsz);
> 	+		memcpy(f->buffer, noid, sizeof(*noid));
> 	 	else
> 	 		the_hash_algo->final_fn(f->buffer, &f->ctx);

I don't think that's quite right; the object_id struct has other stuff
in it.  You'd probably want:

  hashcpy(f->buffer, null_oid());

But I think we already have a helper to just do it directly:

  hashclr(f->buffer);

-Peff

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 0/4] Optionally skip hashing index on write
  2022-12-07 23:27 ` [PATCH 0/4] Optionally skip hashing index on write Junio C Hamano
  2022-12-07 23:42   ` Ævar Arnfjörð Bjarmason
@ 2022-12-08 16:38   ` Derrick Stolee
  2022-12-12 22:22     ` Jacob Keller
  1 sibling, 1 reply; 59+ messages in thread
From: Derrick Stolee @ 2022-12-08 16:38 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget; +Cc: git, vdye, avarab, newren

On 12/7/2022 6:27 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> Writing the index is a critical action that takes place in multiple Git
>> commands. The recent performance improvements available with the sparse
>> index show how often the I/O costs around the index can affect different Git
>> commands, although reading the index takes place more often than a write.
> 
> The sparse-index work is great in that it offers correctness while
> taking advantage of the knowledge of which part of the tree is
> quiescent and unused to boost performance.  I am not sure a change
> to reduce file safety can be compared with it, in that one is pure
> improvement, while the other is trade-off.

I agree that this is a trade-off, and we should both be careful about
whether or not we even make this a possibility for certain file
formats. The index is an interesting case for a couple reasons:

1. Writes block users. Writing the index takes place in many user-
   blocking foreground operations. The speed improvement directly
   impacts their use. Other file formats are typically written in
   the background (commit-graph, multi-pack-index) or are super-
   critical to correctness (pack-files).

2. Index files are short lived. It is rare that a user leaves an
   index for a long time with many staged changes. That's the condition
   that's required for losing an index file to cause a loss of work
   (or maybe I'm missing something). Outside of staged changes, the
   index can be completely destroyed and rewritten with minimal impact
   to the user.

> As long as we will keep the "create into a new file, write it fully
> and fsync + rename to the final" pattern, we do not need the trailing
> checksum to protect us from a truncated output due to index-writing
> process dying in the middle, so I do not mind that trade-off, though.
> 
> Protecting files from bit flipping filesystem corruption is a
> different matter.  Folks at hosting sites like GitHub would know how
> often they detect object corruption (I presume they do not have to
> deal with the index file on the server end that often, but loose and
> pack object files have the trailing checksums the same way) thanks
> to the trailing checksum, and what the consequences are if we lost
> that safety (I am guessing it would be minimum, though).

I agree that we need to be careful about which files get this
treatement.

But I also want to point out that I'm not using hosting servers as
evidence that this has worked in practice, but instead many developer
machines in large monorepos who have had this enabled (via the
microsoft/git fork) for years. We've not come across an instance where
this loss of a trailing hash has been an issue.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/4] read-cache: add index.skipHash config option
  2022-12-07 18:59   ` Eric Sunshine
@ 2022-12-12 13:59     ` Derrick Stolee
  2022-12-12 18:55       ` Eric Sunshine
  0 siblings, 1 reply; 59+ messages in thread
From: Derrick Stolee @ 2022-12-12 13:59 UTC (permalink / raw)
  To: Eric Sunshine, Derrick Stolee via GitGitGadget
  Cc: git, gitster, vdye, avarab, newren

On 12/7/2022 1:59 PM, Eric Sunshine wrote:
> On Wed, Dec 7, 2022 at 12:27 PM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> +If you enable `index.skipHash`, then older Git clients may report that
>> +your index is corrupt during `git fsck`.
> 
> This documentation is rather minimal. Given this description, are
> readers going to understand the purpose of the option, when they
> should use it, what the impact will be, when and why they should avoid
> it, etc.?

I will expand this with explicit version numbers for older Git versions.

>> diff --git a/t/t1600-index.sh b/t/t1600-index.sh
>> @@ -65,6 +65,14 @@ test_expect_success 'out of bounds index.version issues warning' '
>> +test_expect_success 'index.skipHash config option' '
>> +       (
>> +               rm -f .git/index &&
>> +               git -c index.skipHash=true add a &&
>> +               git fsck
>> +       )
>> +'
> 
> What is the purpose of the subshell here?

I was matching the style of the nearby tests, but they are all
modifying GIT_INDEX_VERSION, which isn't necessary here.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/4] read-cache: add index.skipHash config option
  2022-12-07 23:06   ` Ævar Arnfjörð Bjarmason
  2022-12-08  0:05     ` Junio C Hamano
@ 2022-12-12 14:05     ` Derrick Stolee
  2022-12-12 18:01       ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 59+ messages in thread
From: Derrick Stolee @ 2022-12-12 14:05 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason,
	Derrick Stolee via GitGitGadget
  Cc: git, gitster, vdye, newren

On 12/7/2022 6:06 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote:
> 
>> From: Derrick Stolee <derrickstolee@github.com>
>> [...]
>> While older Git versions will not recognize the null hash as a special
>> case, the file format itself is still being met in terms of its
>> structure. Using this null hash will still allow Git operations to
>> function across older versions.
> 
> That's good news, but...
> 
>> The one exception is 'git fsck' which checks the hash of the index file.
>> This used to be a check on every index read, but was split out to just
>> the index in a33fc72fe91 (read-cache: force_verify_index_checksum,
>> 2017-04-14).
> 
> ...uh, what?
> 
> Is there an implied claim here that versions before v2.13.0 don't count
> as "older versions"?
> 
> I.e. doesn't v2.12.0 hard fail the verification for all index writing?
> It's only after v2.13.0 that we do it only for the fsck.
> 
> That seems like a rather significant caveat that we should be noting
> prominently in the docs added in 4/4.

I can add those details.
 
>> As a quick comparison, I tested 'git update-index --force-write' with
>> and without index.computeHash=false on a copy of the Linux kernel
>> repository.
> 
> It took me a bit to see why I was failing to reproduce this, before
> finding that it's because you mention index.computeHash here, but it's
> index.skipHash now.
>>
>> Benchmark 1: with hash
>>   Time (mean ± σ):      46.3 ms ±  13.8 ms    [User: 34.3 ms, System: 11.9 ms]
>>   Range (min … max):    34.3 ms …  79.1 ms    82 runs
>>
>> Benchmark 2: without hash
>>   Time (mean ± σ):      26.0 ms ±   7.9 ms    [User: 11.8 ms, System: 14.2 ms]
>>   Range (min … max):    16.3 ms …  42.0 ms    69 runs
>>
>> Summary
>>   'without hash' ran
>>     1.78 ± 0.76 times faster than 'with hash'
> 
> I suggested in
> https://lore.kernel.org/git/221207.868rjiam86.gmgdl@evledraar.gmail.com/
> earlier to benchmark this against not-sha1collisiondetection.

Generally, I'm avoiding that benchmark because sha1dc is here to stay.

If users want to go through the trouble of compiling to use the non-dc
version, then I would expect the difference to be less noticeable, but
still significant. However, I would strongly avoid considering compiling
both into the client by default, letting certain paths use sha1dc and
others using non-dc. Certain secure environments currently only use Git
under exceptions that allow SHA1 for "non-cryptographic" reasons, but
also with the understanding that sha1dc is used as a safety measure.
Adding the non-dc version back in would put that understanding at risk.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/4] test-lib-functions: add helper for trailing hash
  2022-12-07 22:27   ` Ævar Arnfjörð Bjarmason
@ 2022-12-12 14:10     ` Derrick Stolee
  0 siblings, 0 replies; 59+ messages in thread
From: Derrick Stolee @ 2022-12-12 14:10 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason,
	Derrick Stolee via GitGitGadget
  Cc: git, gitster, vdye, newren

On 12/7/2022 5:27 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote:
> 
>> From: Derrick Stolee <derrickstolee@github.com>
>> [...]
>> +# Given a filename, extract its trailing hash as a hex string
>> +test_trailing_hash () {
>> +	local file="$1" &&
>> +	tail -c $(test_oid rawsz) "$file" | \
>> +		test-tool hexdump | \
> 
> You don't need the "\"'s here.

Thanks.
 
>> +		sed "s/ //g"
>> +}
> 
> At the end of this series we have one test file using this
> test_trailing_hash, can't we just add it to t1600-index.sh?

I imagine that it would be helpful to access the trailing hash
for testing other file formats in the future. It certainly
could have helped the incremental commit-graph work, so we
could check that the filenames do match the trailing hashes.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 4/4] features: feature.manyFiles implies fast index writes
  2022-12-07 22:30   ` Ævar Arnfjörð Bjarmason
@ 2022-12-12 14:18     ` Derrick Stolee
  2022-12-12 18:27       ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 59+ messages in thread
From: Derrick Stolee @ 2022-12-12 14:18 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason,
	Derrick Stolee via GitGitGadget
  Cc: git, gitster, vdye, newren

On 12/7/2022 5:30 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote:
> 
>> From: Derrick Stolee <derrickstolee@github.com>
>> [...]
>> diff --git a/read-cache.c b/read-cache.c
>> index fb4d6fb6387..1844953fba7 100644
>> --- a/read-cache.c
>> +++ b/read-cache.c
>> @@ -2923,12 +2923,13 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
>>  	int ieot_entries = 1;
>>  	struct index_entry_offset_table *ieot = NULL;
>>  	int nr, nr_threads;
>> -	int skip_hash;
>>  
>>  	f = hashfd(tempfile->fd, tempfile->filename.buf);
>>  
>> -	if (!git_config_get_maybe_bool("index.skiphash", &skip_hash))
>> -		f->skip_hash = skip_hash;
>> +	if (istate->repo) {
>> +		prepare_repo_settings(istate->repo);
>> +		f->skip_hash = istate->repo->settings.index_skip_hash;
>> +	}
> 
> Urm, are we ever going to find ourselves in a situation where:
> 
>  * We have read the settings for the_repository
>  * We have an index we're about to write out as our "main index", but
>    the istate->repo *isn't* the_repository.
>  * Even then, wouldn't the two copies of the repos have read the same
>    repo settings?
> 
> But maybe there's a really obvious submodule / worktree / whatever edge
> case I'm missing.
> 
> But if not, shouldn't we just always read/write this from
> the_repository?

I don't understand your concern. We call prepare_repo_settings(istate->repo)
just before using these settings, so we are using whatever repository-local
config we have available to us.

If you're thinking that we could be writing an index but istate->repo is
somehow the "wrong" repo, then that is a larger problem. This patch is
doing the best thing it can with the information it is given.

>> +		rm -f .git/index &&
>> +		git -c feature.manyFiles=true \
>> +		    -c index.skipHash=false add a &&
>> +		test_trailing_hash .git/index >hash &&
>> +		! test_cmp expect hash
> 
> We had a parallel thread where we discussed "! test_cmp" being an
> anti-pattern, i.e. you want them not to be the same, but you want it to
> still show a diff, Maybe just "! cmp" ?

I couldn't tell from this sentence whether test_cmp or cmp would show
the diff, but from testing I see that test_cmp shows the diff (for
debugging purposes, I'm sure) while cmp shows the position of the first
difference.

"! cmp" would work here, since we don't care about what the real hash is.
 
> I.e. either the diff will be meaningless, or we really should be
> asserting the actual value we want, not what it shouldn't be.
> 
> so in this case, shouldn't we assert that it's the 0000... value, or the
> actual hash (depending on which way around we're testing this)?

When it should be the null hash, we assert that it is that value.

When it isn't, we do not assert the exact hash because we do not want
other modifications to the index (or surrounding tests) to cause that
hash to change, causing toil for future contributors. "! cmp" suffices
for this case to show that the config inheritance is working correctly.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v2 0/4] Optionally skip hashing index on write
  2022-12-07 17:25 [PATCH 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget
                   ` (4 preceding siblings ...)
  2022-12-07 23:27 ` [PATCH 0/4] Optionally skip hashing index on write Junio C Hamano
@ 2022-12-12 16:31 ` Derrick Stolee via GitGitGadget
  2022-12-12 16:31   ` [PATCH v2 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
                     ` (4 more replies)
  5 siblings, 5 replies; 59+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-12-12 16:31 UTC (permalink / raw)
  To: git; +Cc: gitster, vdye, avarab, newren, Derrick Stolee

Writing the index is a critical action that takes place in multiple Git
commands. The recent performance improvements available with the sparse
index show how often the I/O costs around the index can affect different Git
commands, although reading the index takes place more often than a write.

The index is written through the hashfile API, both as a buffered write and
as a computation of its trailing hash. This trailing hash is an expectation
of the file format: several optional index extensions are provided using
indicators immediately preceding the trailing hash. 'git fsck' will complain
if the trailing hash does not match the contents of the file up to that
point.

However, computing the hash is expensive. This is even more expensive now
that we've moved to sha1dc instead of hardware-accelerated SHA1
computations.

This series provides a new config option, index.skipHash, that allows users
to disable computing the hash as Git writes the index. In this case, the
trailing hash is stored as the null hash (all bytes are zero).

The implementation is rather simple, but the patches organize different
aspects in a way that we could split things out:

 * Patch 1 adds a 'skip_hash' option to 'struct hashfile' that creates this
   as a general option to the hashwrite API.
 * The motivation in Patch 1 avoids the talk about the chunk-format API and
   instead focuses on the index as the first example that could make use of
   this.
 * Patch 2 creates the index.skipHash config option and updates 'git fsck'
   to ignore a null hash on the index. This is demonstrated by a very simple
   test.
 * Patch 3 creates a test helper to get the trailing hash of a file, and
   adds a concrete check on the trailing hash when index.skipHash=true.
 * Patch 4 (which could be deferred until later) adds index.skipHash=true as
   an implication of feature.manyFiles and as a setting in Scalar.

The end result is that index writes speed up significantly, enough that 'git
update-index --force-write' improves by 1.75x.

Similar patches appeared in my earlier refs RFC [1].

[1]
https://lore.kernel.org/git/pull.1408.git.1667846164.gitgitgadget@gmail.com/


Updates since v1
================

 * In Patch 1, use hashcler() instead of memset().
 * In Patches 2 and 4, be explicit about which Git versions will exhibit
   different behavior when reading an index with a null trailing hash.
 * In Patch 2, used a more compact way of loading from config.
 * In Patch 2 (and following) dropped the sub-shell in t1600-index.sh.
 * In Patch 3, dropped back-ticks when using a multi-line pipe.
 * In Patch 4, use "! cmp" instead of "! test_cmp"


Updates since RFC
=================

 * The config name is now index.skipHash to make it more clear that the
   default is to not skip the hash, and choosing this option requires a true
   value.
 * Use oidread() instead of an ad-hoc loop.
 * Patches 3 and 4 are new. 3 adds some extra test verification that wasn't
   in the RFC.
 * I think that patch 4 is helpful to see now, but I could understand if we
   wanted to defer that patch until after index.skipHash has had more time
   to bake.

Thanks, -Stolee

Derrick Stolee (4):
  hashfile: allow skipping the hash function
  read-cache: add index.skipHash config option
  test-lib-functions: add helper for trailing hash
  features: feature.manyFiles implies fast index writes

 Documentation/config/feature.txt |  5 +++++
 Documentation/config/index.txt   |  9 +++++++++
 csum-file.c                      | 14 +++++++++++---
 csum-file.h                      |  7 +++++++
 read-cache.c                     | 15 ++++++++++++++-
 repo-settings.c                  |  2 ++
 repository.h                     |  1 +
 scalar.c                         |  1 +
 t/t1600-index.sh                 | 20 ++++++++++++++++++++
 t/test-lib-functions.sh          |  8 ++++++++
 10 files changed, 78 insertions(+), 4 deletions(-)


base-commit: 805265fcf7a737664a8321aaf4a0587b78435184
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1439%2Fderrickstolee%2Fhashfile-skip-hash-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1439/derrickstolee/hashfile-skip-hash-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/1439

Range-diff vs v1:

 1:  40ee8dbaef0 ! 1:  b582d681581 hashfile: allow skipping the hash function
     @@ csum-file.c: int finalize_hashfile(struct hashfile *f, unsigned char *result,
      -	the_hash_algo->final_fn(f->buffer, &f->ctx);
      +
      +	if (f->skip_hash)
     -+		memset(f->buffer, 0, the_hash_algo->rawsz);
     ++		hashclr(f->buffer);
      +	else
      +		the_hash_algo->final_fn(f->buffer, &f->ctx);
      +
 2:  5fb4b5a36ac ! 2:  c8a2fd3a470 read-cache: add index.skipHash config option
     @@ Commit message
          The one exception is 'git fsck' which checks the hash of the index file.
          This used to be a check on every index read, but was split out to just
          the index in a33fc72fe91 (read-cache: force_verify_index_checksum,
     -    2017-04-14).
     +    2017-04-14) and released first in Git 2.13.0. Document the versions that
     +    relaxed these restrictions, with the optimistic expectation that this
     +    change will be included in Git 2.40.0.
      
          Here, we disable this check if the trailing hash is all zeroes. We add a
          warning to the config option that this may cause undesirable behavior
          with older Git versions.
      
          As a quick comparison, I tested 'git update-index --force-write' with
     -    and without index.computeHash=false on a copy of the Linux kernel
     +    and without index.skipHash=true on a copy of the Linux kernel
          repository.
      
          Benchmark 1: with hash
     @@ Documentation/config/index.txt: index.version::
      +	Instead, write a trailing set of bytes with value zero, indicating
      +	that the computation was skipped.
      ++
     -+If you enable `index.skipHash`, then older Git clients may report that
     -+your index is corrupt during `git fsck`.
     ++If you enable `index.skipHash`, then Git clients older than 2.13.0 will
     ++refuse to parse the index and Git clients older than 2.40.0 will report an
     ++error during `git fsck`.
      
       ## read-cache.c ##
      @@ read-cache.c: static int verify_hdr(const struct cache_header *hdr, unsigned long size)
     @@ read-cache.c: static int verify_hdr(const struct cache_header *hdr, unsigned lon
       	return 0;
       }
      @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
     - 	int ieot_entries = 1;
     - 	struct index_entry_offset_table *ieot = NULL;
     - 	int nr, nr_threads;
     -+	int skip_hash;
       
       	f = hashfd(tempfile->fd, tempfile->filename.buf);
       
     -+	if (!git_config_get_maybe_bool("index.skiphash", &skip_hash))
     -+		f->skip_hash = skip_hash;
     ++	git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash);
      +
       	for (i = removed = extended = 0; i < entries; i++) {
       		if (cache[i]->ce_flags & CE_REMOVE)
     @@ t/t1600-index.sh: test_expect_success 'out of bounds index.version issues warnin
       '
       
      +test_expect_success 'index.skipHash config option' '
     -+	(
     -+		rm -f .git/index &&
     -+		git -c index.skipHash=true add a &&
     -+		git fsck
     -+	)
     ++	rm -f .git/index &&
     ++	git -c index.skipHash=true add a &&
     ++	git fsck
      +'
      +
       test_index_version () {
 3:  a20bf8de864 ! 3:  813e81a0582 test-lib-functions: add helper for trailing hash
     @@ Commit message
          Signed-off-by: Derrick Stolee <derrickstolee@github.com>
      
       ## t/t1600-index.sh ##
     -@@ t/t1600-index.sh: test_expect_success 'index.skipHash config option' '
     - 	(
     - 		rm -f .git/index &&
     - 		git -c index.skipHash=true add a &&
     -+		test_trailing_hash .git/index >hash &&
     -+		echo $(test_oid zero) >expect &&
     -+		test_cmp expect hash &&
     - 		git fsck
     - 	)
     +@@ t/t1600-index.sh: test_expect_success 'out of bounds index.version issues warning' '
     + test_expect_success 'index.skipHash config option' '
     + 	rm -f .git/index &&
     + 	git -c index.skipHash=true add a &&
     ++	test_trailing_hash .git/index >hash &&
     ++	echo $(test_oid zero) >expect &&
     ++	test_cmp expect hash &&
     + 	git fsck
       '
     + 
      
       ## t/test-lib-functions.sh ##
      @@ t/test-lib-functions.sh: test_cmp_config_output () {
     @@ t/test-lib-functions.sh: test_cmp_config_output () {
      +# Given a filename, extract its trailing hash as a hex string
      +test_trailing_hash () {
      +	local file="$1" &&
     -+	tail -c $(test_oid rawsz) "$file" | \
     -+		test-tool hexdump | \
     ++	tail -c $(test_oid rawsz) "$file" |
     ++		test-tool hexdump |
      +		sed "s/ //g"
      +}
 4:  77bf5d5ff27 ! 4:  e640dff53dd features: feature.manyFiles implies fast index writes
     @@ Documentation/config/feature.txt: feature.manyFiles::
       	`git checkout` may be slow and these new defaults improve performance:
       +
      +* `index.skipHash=true` speeds up index writes by not computing a trailing
     -+  checksum.
     ++  checksum. Note that this will cause Git versions earlier than 2.13.0 to
     ++  refuse to parse the index and Git versions earlier than 2.40.0 will report
     ++  a corrupted index during `git fsck`.
      ++
       * `index.version=4` enables path-prefix compression in the index.
       +
     @@ Documentation/config/feature.txt: feature.manyFiles::
      
       ## read-cache.c ##
      @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
     - 	int ieot_entries = 1;
     - 	struct index_entry_offset_table *ieot = NULL;
     - 	int nr, nr_threads;
     --	int skip_hash;
       
       	f = hashfd(tempfile->fd, tempfile->filename.buf);
       
     --	if (!git_config_get_maybe_bool("index.skiphash", &skip_hash))
     --		f->skip_hash = skip_hash;
     +-	git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash);
      +	if (istate->repo) {
      +		prepare_repo_settings(istate->repo);
      +		f->skip_hash = istate->repo->settings.index_skip_hash;
     @@ scalar.c: static int set_recommended_config(int reconfigure)
      
       ## t/t1600-index.sh ##
      @@ t/t1600-index.sh: test_expect_success 'index.skipHash config option' '
     - 		test_trailing_hash .git/index >hash &&
     - 		echo $(test_oid zero) >expect &&
     - 		test_cmp expect hash &&
     --		git fsck
     -+		git fsck &&
     + 	test_trailing_hash .git/index >hash &&
     + 	echo $(test_oid zero) >expect &&
     + 	test_cmp expect hash &&
     +-	git fsck
     ++	git fsck &&
      +
     -+		rm -f .git/index &&
     -+		git -c feature.manyFiles=true add a &&
     -+		test_trailing_hash .git/index >hash &&
     -+		test_cmp expect hash &&
     ++	rm -f .git/index &&
     ++	git -c feature.manyFiles=true add a &&
     ++	test_trailing_hash .git/index >hash &&
     ++	test_cmp expect hash &&
      +
     -+		rm -f .git/index &&
     -+		git -c feature.manyFiles=true \
     -+		    -c index.skipHash=false add a &&
     -+		test_trailing_hash .git/index >hash &&
     -+		! test_cmp expect hash
     - 	)
     ++	rm -f .git/index &&
     ++	git -c feature.manyFiles=true \
     ++		-c index.skipHash=false add a &&
     ++	test_trailing_hash .git/index >hash &&
     ++	! cmp expect hash
       '
       
     + test_index_version () {

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v2 1/4] hashfile: allow skipping the hash function
  2022-12-12 16:31 ` [PATCH v2 " Derrick Stolee via GitGitGadget
@ 2022-12-12 16:31   ` Derrick Stolee via GitGitGadget
  2022-12-12 16:31   ` [PATCH v2 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 59+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-12-12 16:31 UTC (permalink / raw)
  To: git; +Cc: gitster, vdye, avarab, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The hashfile API is useful for generating files that include a trailing
hash of the file's contents up to that point. Using such a hash is
helpful for verifying the file for corruption-at-rest, such as a faulty
drive causing flipped bits.

Git's index file includes this trailing hash, so it uses a 'struct
hashfile' to handle the I/O to the file. This was very convenient to
allow using the hashfile methods during these operations.

However, hashing the file contents during write comes at a performance
penalty. It's slower to hash the bytes on their way to the disk than
without that step. This problem is made worse by the replacement of
hardware-accelerated SHA1 computations with the software-based sha1dc
computation.

This write cost is significant, and the checksum capability is likely
not worth that cost for such a short-lived file. The index is rewritten
frequently and the only time the checksum is checked is during 'git
fsck'. Thus, it would be helpful to allow a user to opt-out of the hash
computation.

We first need to allow Git to opt-out of the hash computation in the
hashfile API. The buffered writes of the API are still helpful, so it
makes sense to make the change here.

Introduce a new 'skip_hash' option to 'struct hashfile'. When set, the
update_fn and final_fn members of the_hash_algo are skipped. When
finalizing the hashfile, the trailing hash is replaced with the null
hash.

This use of a trailing null hash would be desireable in either case,
since we do not want to special case a file format to have a different
length depending on whether it was hashed or not. When the final bytes
of a file are all zero, we can infer that it was written without
hashing, and thus that verification is not available as a check for file
consistency. This also means that we could easily toggle hashing for any
file format we desire.

A version of this patch has existed in the microsoft/git fork since
2017 [1] (the linked commit was rebased in 2018, but the original dates
back to January 2017). Here, the change to make the index use this fast
path is delayed until a later change.

[1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201

Co-authored-by: Kevin Willford <kewillf@microsoft.com>
Signed-off-by: Kevin Willford <kewillf@microsoft.com>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 csum-file.c | 14 +++++++++++---
 csum-file.h |  7 +++++++
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index 59ef3398ca2..cce13c0f047 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -45,7 +45,8 @@ void hashflush(struct hashfile *f)
 	unsigned offset = f->offset;
 
 	if (offset) {
-		the_hash_algo->update_fn(&f->ctx, f->buffer, offset);
+		if (!f->skip_hash)
+			the_hash_algo->update_fn(&f->ctx, f->buffer, offset);
 		flush(f, f->buffer, offset);
 		f->offset = 0;
 	}
@@ -64,7 +65,12 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result,
 	int fd;
 
 	hashflush(f);
-	the_hash_algo->final_fn(f->buffer, &f->ctx);
+
+	if (f->skip_hash)
+		hashclr(f->buffer);
+	else
+		the_hash_algo->final_fn(f->buffer, &f->ctx);
+
 	if (result)
 		hashcpy(result, f->buffer);
 	if (flags & CSUM_HASH_IN_STREAM)
@@ -108,7 +114,8 @@ void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
 			 * the hashfile's buffer. In this block,
 			 * f->offset is necessarily zero.
 			 */
-			the_hash_algo->update_fn(&f->ctx, buf, nr);
+			if (!f->skip_hash)
+				the_hash_algo->update_fn(&f->ctx, buf, nr);
 			flush(f, buf, nr);
 		} else {
 			/*
@@ -153,6 +160,7 @@ static struct hashfile *hashfd_internal(int fd, const char *name,
 	f->tp = tp;
 	f->name = name;
 	f->do_crc = 0;
+	f->skip_hash = 0;
 	the_hash_algo->init_fn(&f->ctx);
 
 	f->buffer_len = buffer_len;
diff --git a/csum-file.h b/csum-file.h
index 0d29f528fbc..29468067f81 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -20,6 +20,13 @@ struct hashfile {
 	size_t buffer_len;
 	unsigned char *buffer;
 	unsigned char *check_buffer;
+
+	/**
+	 * If set to 1, skip_hash indicates that we should
+	 * not actually compute the hash for this hashfile and
+	 * instead only use it as a buffered write.
+	 */
+	unsigned int skip_hash;
 };
 
 /* Checkpoint */
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 2/4] read-cache: add index.skipHash config option
  2022-12-12 16:31 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  2022-12-12 16:31   ` [PATCH v2 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
@ 2022-12-12 16:31   ` Derrick Stolee via GitGitGadget
  2022-12-12 16:31   ` [PATCH v2 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 59+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-12-12 16:31 UTC (permalink / raw)
  To: git; +Cc: gitster, vdye, avarab, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The previous change allowed skipping the hashing portion of the
hashwrite API, using it instead as a buffered write API. Disabling the
hashwrite can be particularly helpful when the write operation is in a
critical path.

One such critical path is the writing of the index. This operation is so
critical that the sparse index was created specifically to reduce the
size of the index to make these writes (and reads) faster.

Following a similar approach to one used in the microsoft/git fork [1],
add a new config option (index.skipHash) that allows disabling this
hashing during the index write. The cost is that we can no longer
validate the contents for corruption-at-rest using the trailing hash.

[1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201

While older Git versions will not recognize the null hash as a special
case, the file format itself is still being met in terms of its
structure. Using this null hash will still allow Git operations to
function across older versions.

The one exception is 'git fsck' which checks the hash of the index file.
This used to be a check on every index read, but was split out to just
the index in a33fc72fe91 (read-cache: force_verify_index_checksum,
2017-04-14) and released first in Git 2.13.0. Document the versions that
relaxed these restrictions, with the optimistic expectation that this
change will be included in Git 2.40.0.

Here, we disable this check if the trailing hash is all zeroes. We add a
warning to the config option that this may cause undesirable behavior
with older Git versions.

As a quick comparison, I tested 'git update-index --force-write' with
and without index.skipHash=true on a copy of the Linux kernel
repository.

Benchmark 1: with hash
  Time (mean ± σ):      46.3 ms ±  13.8 ms    [User: 34.3 ms, System: 11.9 ms]
  Range (min … max):    34.3 ms …  79.1 ms    82 runs

Benchmark 2: without hash
  Time (mean ± σ):      26.0 ms ±   7.9 ms    [User: 11.8 ms, System: 14.2 ms]
  Range (min … max):    16.3 ms …  42.0 ms    69 runs

Summary
  'without hash' ran
    1.78 ± 0.76 times faster than 'with hash'

These performance benefits are substantial enough to allow users the
ability to opt-in to this feature, even with the potential confusion
with older 'git fsck' versions.

It is critical that this test is placed before the test_index_version
tests, since those tests obliterate the .git/config file and hence lose
the setting from GIT_TEST_DEFAULT_HASH, if set.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/config/index.txt |  9 +++++++++
 read-cache.c                   | 12 +++++++++++-
 t/t1600-index.sh               |  6 ++++++
 3 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/Documentation/config/index.txt b/Documentation/config/index.txt
index 75f3a2d1054..5d62489c302 100644
--- a/Documentation/config/index.txt
+++ b/Documentation/config/index.txt
@@ -30,3 +30,12 @@ index.version::
 	Specify the version with which new index files should be
 	initialized.  This does not affect existing repositories.
 	If `feature.manyFiles` is enabled, then the default is 4.
+
+index.skipHash::
+	When enabled, do not compute the trailing hash for the index file.
+	Instead, write a trailing set of bytes with value zero, indicating
+	that the computation was skipped.
++
+If you enable `index.skipHash`, then Git clients older than 2.13.0 will
+refuse to parse the index and Git clients older than 2.40.0 will report an
+error during `git fsck`.
diff --git a/read-cache.c b/read-cache.c
index 46f5e497b14..3f7de8b2e20 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1817,6 +1817,8 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
 	git_hash_ctx c;
 	unsigned char hash[GIT_MAX_RAWSZ];
 	int hdr_version;
+	unsigned char *start, *end;
+	struct object_id oid;
 
 	if (hdr->hdr_signature != htonl(CACHE_SIGNATURE))
 		return error(_("bad signature 0x%08x"), hdr->hdr_signature);
@@ -1827,10 +1829,16 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
 	if (!verify_index_checksum)
 		return 0;
 
+	end = (unsigned char *)hdr + size;
+	start = end - the_hash_algo->rawsz;
+	oidread(&oid, start);
+	if (oideq(&oid, null_oid()))
+		return 0;
+
 	the_hash_algo->init_fn(&c);
 	the_hash_algo->update_fn(&c, hdr, size - the_hash_algo->rawsz);
 	the_hash_algo->final_fn(hash, &c);
-	if (!hasheq(hash, (unsigned char *)hdr + size - the_hash_algo->rawsz))
+	if (!hasheq(hash, end - the_hash_algo->rawsz))
 		return error(_("bad index file sha1 signature"));
 	return 0;
 }
@@ -2918,6 +2926,8 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 
 	f = hashfd(tempfile->fd, tempfile->filename.buf);
 
+	git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash);
+
 	for (i = removed = extended = 0; i < entries; i++) {
 		if (cache[i]->ce_flags & CE_REMOVE)
 			removed++;
diff --git a/t/t1600-index.sh b/t/t1600-index.sh
index 010989f90e6..45feb0fc5d8 100755
--- a/t/t1600-index.sh
+++ b/t/t1600-index.sh
@@ -65,6 +65,12 @@ test_expect_success 'out of bounds index.version issues warning' '
 	)
 '
 
+test_expect_success 'index.skipHash config option' '
+	rm -f .git/index &&
+	git -c index.skipHash=true add a &&
+	git fsck
+'
+
 test_index_version () {
 	INDEX_VERSION_CONFIG=$1 &&
 	FEATURE_MANY_FILES=$2 &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 3/4] test-lib-functions: add helper for trailing hash
  2022-12-12 16:31 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  2022-12-12 16:31   ` [PATCH v2 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
  2022-12-12 16:31   ` [PATCH v2 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget
@ 2022-12-12 16:31   ` Derrick Stolee via GitGitGadget
  2022-12-12 18:14     ` SZEDER Gábor
  2022-12-12 16:31   ` [PATCH v2 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget
  2022-12-15 15:06   ` [PATCH v3 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget
  4 siblings, 1 reply; 59+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-12-12 16:31 UTC (permalink / raw)
  To: git; +Cc: gitster, vdye, avarab, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

It can be helpful to check that a file format with a trailing hash has a
specific hash in the final bytes of a written file. This is made more
apparent by recent changes that allow skipping the hash algorithm and
writing a null hash at the end of the file instead.

Add a new test_trailing_hash helper and use it in t1600 to verify that
index.skipHash=true really does skip the hash computation, since
'git fsck' does not actually verify the hash.

Keep the 'git fsck' call to ensure that any potential future change to
check the index hash does not cause an error in this case.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 t/t1600-index.sh        | 3 +++
 t/test-lib-functions.sh | 8 ++++++++
 2 files changed, 11 insertions(+)

diff --git a/t/t1600-index.sh b/t/t1600-index.sh
index 45feb0fc5d8..55914bc3506 100755
--- a/t/t1600-index.sh
+++ b/t/t1600-index.sh
@@ -68,6 +68,9 @@ test_expect_success 'out of bounds index.version issues warning' '
 test_expect_success 'index.skipHash config option' '
 	rm -f .git/index &&
 	git -c index.skipHash=true add a &&
+	test_trailing_hash .git/index >hash &&
+	echo $(test_oid zero) >expect &&
+	test_cmp expect hash &&
 	git fsck
 '
 
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index 796093a7b32..60308843f8f 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -1875,3 +1875,11 @@ test_cmp_config_output () {
 	sort config-actual >sorted-actual &&
 	test_cmp sorted-expect sorted-actual
 }
+
+# Given a filename, extract its trailing hash as a hex string
+test_trailing_hash () {
+	local file="$1" &&
+	tail -c $(test_oid rawsz) "$file" |
+		test-tool hexdump |
+		sed "s/ //g"
+}
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 4/4] features: feature.manyFiles implies fast index writes
  2022-12-12 16:31 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (2 preceding siblings ...)
  2022-12-12 16:31   ` [PATCH v2 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget
@ 2022-12-12 16:31   ` Derrick Stolee via GitGitGadget
  2022-12-15 15:06   ` [PATCH v3 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget
  4 siblings, 0 replies; 59+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-12-12 16:31 UTC (permalink / raw)
  To: git; +Cc: gitster, vdye, avarab, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The recent addition of the index.skipHash config option allows index
writes to speed up by skipping the hash computation for the trailing
checksum. This is particularly critical for repositories with many files
at HEAD, so add this config option to two cases where users in that
scenario may opt-in to such behavior:

 1. The feature.manyFiles config option enables some options that are
    helpful for repositories with many files at HEAD.

 2. 'scalar register' and 'scalar reconfigure' set config options that
    optimize for large repositories.

In both of these cases, set index.skipHash=true to gain this
speedup. Add tests that demonstrate the proper way that
index.skipHash=true can override feature.manyFiles=true.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/config/feature.txt |  5 +++++
 read-cache.c                     |  5 ++++-
 repo-settings.c                  |  2 ++
 repository.h                     |  1 +
 scalar.c                         |  1 +
 t/t1600-index.sh                 | 13 ++++++++++++-
 6 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/Documentation/config/feature.txt b/Documentation/config/feature.txt
index 95975e50912..e52bc6b8584 100644
--- a/Documentation/config/feature.txt
+++ b/Documentation/config/feature.txt
@@ -23,6 +23,11 @@ feature.manyFiles::
 	working directory. With many files, commands such as `git status` and
 	`git checkout` may be slow and these new defaults improve performance:
 +
+* `index.skipHash=true` speeds up index writes by not computing a trailing
+  checksum. Note that this will cause Git versions earlier than 2.13.0 to
+  refuse to parse the index and Git versions earlier than 2.40.0 will report
+  a corrupted index during `git fsck`.
++
 * `index.version=4` enables path-prefix compression in the index.
 +
 * `core.untrackedCache=true` enables the untracked cache. This setting assumes
diff --git a/read-cache.c b/read-cache.c
index 3f7de8b2e20..1844953fba7 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -2926,7 +2926,10 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 
 	f = hashfd(tempfile->fd, tempfile->filename.buf);
 
-	git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash);
+	if (istate->repo) {
+		prepare_repo_settings(istate->repo);
+		f->skip_hash = istate->repo->settings.index_skip_hash;
+	}
 
 	for (i = removed = extended = 0; i < entries; i++) {
 		if (cache[i]->ce_flags & CE_REMOVE)
diff --git a/repo-settings.c b/repo-settings.c
index 3021921c53d..3dbd3f0e2ec 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -47,6 +47,7 @@ void prepare_repo_settings(struct repository *r)
 	}
 	if (manyfiles) {
 		r->settings.index_version = 4;
+		r->settings.index_skip_hash = 1;
 		r->settings.core_untracked_cache = UNTRACKED_CACHE_WRITE;
 	}
 
@@ -61,6 +62,7 @@ void prepare_repo_settings(struct repository *r)
 	repo_cfg_bool(r, "pack.usesparse", &r->settings.pack_use_sparse, 1);
 	repo_cfg_bool(r, "core.multipackindex", &r->settings.core_multi_pack_index, 1);
 	repo_cfg_bool(r, "index.sparse", &r->settings.sparse_index, 0);
+	repo_cfg_bool(r, "index.skiphash", &r->settings.index_skip_hash, r->settings.index_skip_hash);
 
 	/*
 	 * The GIT_TEST_MULTI_PACK_INDEX variable is special in that
diff --git a/repository.h b/repository.h
index 6c461c5b9de..e8c67ffe165 100644
--- a/repository.h
+++ b/repository.h
@@ -42,6 +42,7 @@ struct repo_settings {
 	struct fsmonitor_settings *fsmonitor; /* lazily loaded */
 
 	int index_version;
+	int index_skip_hash;
 	enum untracked_cache_setting core_untracked_cache;
 
 	int pack_use_sparse;
diff --git a/scalar.c b/scalar.c
index 6c52243cdf1..b49bb8c24ec 100644
--- a/scalar.c
+++ b/scalar.c
@@ -143,6 +143,7 @@ static int set_recommended_config(int reconfigure)
 		{ "credential.validate", "false", 1 }, /* GCM4W-only */
 		{ "gc.auto", "0", 1 },
 		{ "gui.GCWarning", "false", 1 },
+		{ "index.skipHash", "false", 1 },
 		{ "index.threads", "true", 1 },
 		{ "index.version", "4", 1 },
 		{ "merge.stat", "false", 1 },
diff --git a/t/t1600-index.sh b/t/t1600-index.sh
index 55914bc3506..103743a1c7d 100755
--- a/t/t1600-index.sh
+++ b/t/t1600-index.sh
@@ -71,7 +71,18 @@ test_expect_success 'index.skipHash config option' '
 	test_trailing_hash .git/index >hash &&
 	echo $(test_oid zero) >expect &&
 	test_cmp expect hash &&
-	git fsck
+	git fsck &&
+
+	rm -f .git/index &&
+	git -c feature.manyFiles=true add a &&
+	test_trailing_hash .git/index >hash &&
+	test_cmp expect hash &&
+
+	rm -f .git/index &&
+	git -c feature.manyFiles=true \
+		-c index.skipHash=false add a &&
+	test_trailing_hash .git/index >hash &&
+	! cmp expect hash
 '
 
 test_index_version () {
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/4] read-cache: add index.skipHash config option
  2022-12-12 14:05     ` Derrick Stolee
@ 2022-12-12 18:01       ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 59+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-12-12 18:01 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, gitster, vdye, newren


On Mon, Dec 12 2022, Derrick Stolee wrote:

> On 12/7/2022 6:06 PM, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote:
>> 
>>> From: Derrick Stolee <derrickstolee@github.com>
>>> [...]
>>> While older Git versions will not recognize the null hash as a special
>>> case, the file format itself is still being met in terms of its
>>> structure. Using this null hash will still allow Git operations to
>>> function across older versions.
>> 
>> That's good news, but...
>> 
>>> The one exception is 'git fsck' which checks the hash of the index file.
>>> This used to be a check on every index read, but was split out to just
>>> the index in a33fc72fe91 (read-cache: force_verify_index_checksum,
>>> 2017-04-14).
>> 
>> ...uh, what?
>> 
>> Is there an implied claim here that versions before v2.13.0 don't count
>> as "older versions"?
>> 
>> I.e. doesn't v2.12.0 hard fail the verification for all index writing?
>> It's only after v2.13.0 that we do it only for the fsck.
>> 
>> That seems like a rather significant caveat that we should be noting
>> prominently in the docs added in 4/4.
>
> I can add those details.
>  
>>> As a quick comparison, I tested 'git update-index --force-write' with
>>> and without index.computeHash=false on a copy of the Linux kernel
>>> repository.
>> 
>> It took me a bit to see why I was failing to reproduce this, before
>> finding that it's because you mention index.computeHash here, but it's
>> index.skipHash now.
>>>
>>> Benchmark 1: with hash
>>>   Time (mean ± σ):      46.3 ms ±  13.8 ms    [User: 34.3 ms, System: 11.9 ms]
>>>   Range (min … max):    34.3 ms …  79.1 ms    82 runs
>>>
>>> Benchmark 2: without hash
>>>   Time (mean ± σ):      26.0 ms ±   7.9 ms    [User: 11.8 ms, System: 14.2 ms]
>>>   Range (min … max):    16.3 ms …  42.0 ms    69 runs
>>>
>>> Summary
>>>   'without hash' ran
>>>     1.78 ± 0.76 times faster than 'with hash'
>> 
>> I suggested in
>> https://lore.kernel.org/git/221207.868rjiam86.gmgdl@evledraar.gmail.com/
>> earlier to benchmark this against not-sha1collisiondetection.
>
> Generally, I'm avoiding that benchmark because sha1dc is here to stay.
>
> If users want to go through the trouble of compiling to use the non-dc
> version, then I would expect the difference to be less noticeable, but
> still significant. However, I would strongly avoid considering compiling
> both into the client by default, letting certain paths use sha1dc and
> others using non-dc. Certain secure environments currently only use Git
> under exceptions that allow SHA1 for "non-cryptographic" reasons, but
> also with the understanding that sha1dc is used as a safety measure.
> Adding the non-dc version back in would put that understanding at risk.

Doesn't using a checksum for our own index count as a "non-cryptographic
reason"? I.e. we control the .git/index file, and the context is that
we're checking if bytes we wrote to disk are corrupt since we last saw
them.

Even if hypothetically an attacker could craft files to go into the
index (knowing our envelope) in such a way as to craft a collision
between that index file and some other index file I don't see how that
would give the attacker anything. We'd still have a valid index, and
we'd probably be replacing that crafted index with a new one anyway.

I understand that some organizations have SHA-1 on some naughty list,
and using it again in non-SHA1DC contexts might trigger some audit.

So it wouldn't be something for everyone, and it's orthagonal to the
benefits of a new ref format or index format.

But if we're considering new formats, I think it's worth considering a
non-format change which doesn't get us all of the way of no
checksumming, but more than halfway there.

Maybe we'll still want the "don't do any checksumming", but maybe some
would find that enough (particularly if SHA-1 HW acceleration is
available).




^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 3/4] test-lib-functions: add helper for trailing hash
  2022-12-12 16:31   ` [PATCH v2 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget
@ 2022-12-12 18:14     ` SZEDER Gábor
  2022-12-13  0:55       ` Junio C Hamano
  0 siblings, 1 reply; 59+ messages in thread
From: SZEDER Gábor @ 2022-12-12 18:14 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, vdye, avarab, newren, Derrick Stolee

On Mon, Dec 12, 2022 at 04:31:16PM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/t/t1600-index.sh b/t/t1600-index.sh
> index 45feb0fc5d8..55914bc3506 100755
> --- a/t/t1600-index.sh
> +++ b/t/t1600-index.sh
> @@ -68,6 +68,9 @@ test_expect_success 'out of bounds index.version issues warning' '
>  test_expect_success 'index.skipHash config option' '
>  	rm -f .git/index &&
>  	git -c index.skipHash=true add a &&
> +	test_trailing_hash .git/index >hash &&
> +	echo $(test_oid zero) >expect &&

Nit: test_oid zero >expect

> +	test_cmp expect hash &&
>  	git fsck
>  '

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 4/4] features: feature.manyFiles implies fast index writes
  2022-12-12 14:18     ` Derrick Stolee
@ 2022-12-12 18:27       ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 59+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-12-12 18:27 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, gitster, vdye, newren


On Mon, Dec 12 2022, Derrick Stolee wrote:

> On 12/7/2022 5:30 PM, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Wed, Dec 07 2022, Derrick Stolee via GitGitGadget wrote:
>> 
>>> From: Derrick Stolee <derrickstolee@github.com>
>>> [...]
>>> diff --git a/read-cache.c b/read-cache.c
>>> index fb4d6fb6387..1844953fba7 100644
>>> --- a/read-cache.c
>>> +++ b/read-cache.c
>>> @@ -2923,12 +2923,13 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
>>>  	int ieot_entries = 1;
>>>  	struct index_entry_offset_table *ieot = NULL;
>>>  	int nr, nr_threads;
>>> -	int skip_hash;
>>>  
>>>  	f = hashfd(tempfile->fd, tempfile->filename.buf);
>>>  
>>> -	if (!git_config_get_maybe_bool("index.skiphash", &skip_hash))
>>> -		f->skip_hash = skip_hash;
>>> +	if (istate->repo) {
>>> +		prepare_repo_settings(istate->repo);
>>> +		f->skip_hash = istate->repo->settings.index_skip_hash;
>>> +	}
>> 
>> Urm, are we ever going to find ourselves in a situation where:
>> 
>>  * We have read the settings for the_repository
>>  * We have an index we're about to write out as our "main index", but
>>    the istate->repo *isn't* the_repository.
>>  * Even then, wouldn't the two copies of the repos have read the same
>>    repo settings?
>> 
>> But maybe there's a really obvious submodule / worktree / whatever edge
>> case I'm missing.
>> 
>> But if not, shouldn't we just always read/write this from
>> the_repository?
>
> I don't understand your concern. We call prepare_repo_settings(istate->repo)
> just before using these settings, so we are using whatever repository-local
> config we have available to us.
>
> If you're thinking that we could be writing an index but istate->repo is
> somehow the "wrong" repo, then that is a larger problem. This patch is
> doing the best thing it can with the information it is given.

It's not a concern, just confusion :)

In the preceding step (and this is still the case in your v2) we used
git_config_get_maybe_bool(), if we meant to use istate->repo shouldn't
we have used repo_config_get_maybe_bool() to begin with?

And will we ever get !istate->repo? If not should we BUG() here?
Otherwise the 4/4 changes this to a state where we'll no longer read the
index.skipHash setting if that "repo" is NULL, but our previous
the_repository was non-NULL...

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/4] read-cache: add index.skipHash config option
  2022-12-12 13:59     ` Derrick Stolee
@ 2022-12-12 18:55       ` Eric Sunshine
  0 siblings, 0 replies; 59+ messages in thread
From: Eric Sunshine @ 2022-12-12 18:55 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, gitster, vdye, avarab,
	newren

On Mon, Dec 12, 2022 at 8:59 AM Derrick Stolee <derrickstolee@github.com> wrote:
> On 12/7/2022 1:59 PM, Eric Sunshine wrote:
> > On Wed, Dec 7, 2022 at 12:27 PM Derrick Stolee via GitGitGadget
> > <gitgitgadget@gmail.com> wrote:
> >> +index.skipHash::
> >> +       When enabled, do not compute the trailing hash for the index file.
> >> +       Instead, write a trailing set of bytes with value zero, indicating
> >> +       that the computation was skipped.
> >> ++
> >> +If you enable `index.skipHash`, then older Git clients may report that
> >> +your index is corrupt during `git fsck`.
> >
> > This documentation is rather minimal. Given this description, are
> > readers going to understand the purpose of the option, when they
> > should use it, what the impact will be, when and why they should avoid
> > it, etc.?
>
> I will expand this with explicit version numbers for older Git versions.

Okay, but that doesn't address the larger questions I asked. The
documentation, as written, gives no explanation of the purpose of this
option. Since you conceived of the option and implemented it, you
implicitly understand its use-case and repercussions which might arise
from using it, but is the typical reader going to understand all that?
Namely, is the reader going to understand:

* why this option exists
* what problem it is trying to solve
* when to use it
* when not to use it
* what the repercussions are of not computing a hash for the index
* etc.

Are the answers to those questions documented somewhere? If so, then
the documentation for this option should link to that discussion (and
vice-versa). If not, then those questions should be answered by this
documentation.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 0/4] Optionally skip hashing index on write
  2022-12-08 16:38   ` Derrick Stolee
@ 2022-12-12 22:22     ` Jacob Keller
  0 siblings, 0 replies; 59+ messages in thread
From: Jacob Keller @ 2022-12-12 22:22 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Junio C Hamano, Derrick Stolee via GitGitGadget, git, vdye,
	avarab, newren

On Thu, Dec 8, 2022 at 8:50 AM Derrick Stolee <derrickstolee@github.com> wrote:
>
> On 12/7/2022 6:27 PM, Junio C Hamano wrote:
> > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> >
> >> Writing the index is a critical action that takes place in multiple Git
> >> commands. The recent performance improvements available with the sparse
> >> index show how often the I/O costs around the index can affect different Git
> >> commands, although reading the index takes place more often than a write.
> >
> > The sparse-index work is great in that it offers correctness while
> > taking advantage of the knowledge of which part of the tree is
> > quiescent and unused to boost performance.  I am not sure a change
> > to reduce file safety can be compared with it, in that one is pure
> > improvement, while the other is trade-off.
>
> I agree that this is a trade-off, and we should both be careful about
> whether or not we even make this a possibility for certain file
> formats. The index is an interesting case for a couple reasons:
>
> 1. Writes block users. Writing the index takes place in many user-
>    blocking foreground operations. The speed improvement directly
>    impacts their use. Other file formats are typically written in
>    the background (commit-graph, multi-pack-index) or are super-
>    critical to correctness (pack-files).
>
> 2. Index files are short lived. It is rare that a user leaves an
>    index for a long time with many staged changes. That's the condition
>    that's required for losing an index file to cause a loss of work
>    (or maybe I'm missing something). Outside of staged changes, the
>    index can be completely destroyed and rewritten with minimal impact
>    to the user.
>

Is this information in the commit messages somewhere? I didn't see it
in the cover letter. Nor did I see any other explanation in the cover
letter besides "this makes it faster". I would expect such trade off
or analysis of "what do we lose" to be in the cover letter, as it may
not be clear otherwise.

I do agree these reasons are good, but it can be confusing to later
reviewers when looking back at the code for an option like this and
wondering why it exists.

Thanks,
Jake

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 3/4] test-lib-functions: add helper for trailing hash
  2022-12-12 18:14     ` SZEDER Gábor
@ 2022-12-13  0:55       ` Junio C Hamano
  2022-12-17 17:37         ` SZEDER Gábor
  0 siblings, 1 reply; 59+ messages in thread
From: Junio C Hamano @ 2022-12-13  0:55 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: Derrick Stolee via GitGitGadget, git, vdye, avarab, newren,
	Derrick Stolee

SZEDER Gábor <szeder.dev@gmail.com> writes:

>> +	test_trailing_hash .git/index >hash &&
>> +	echo $(test_oid zero) >expect &&
>
> Nit: test_oid zero >expect
>
>> +	test_cmp expect hash &&

Unfortunately they are not equivalent.

Usually we write these helpers to terminate their output with LF,
relying on the fact that terminating LF will be dropped when used in
a command substition, e.g. VAR=$(HELPER), but test_oid deviates from
the pattern and does not give the terminating LF to its output.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v3 0/4] Optionally skip hashing index on write
  2022-12-12 16:31 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (3 preceding siblings ...)
  2022-12-12 16:31   ` [PATCH v2 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget
@ 2022-12-15 15:06   ` Derrick Stolee via GitGitGadget
  2022-12-15 15:06     ` [PATCH v3 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
                       ` (5 more replies)
  4 siblings, 6 replies; 59+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-12-15 15:06 UTC (permalink / raw)
  To: git; +Cc: gitster, vdye, avarab, newren, Jacob Keller, Derrick Stolee

Writing the index is a critical action that takes place in multiple Git
commands. The recent performance improvements available with the sparse
index show how often the I/O costs around the index can affect different Git
commands, although reading the index takes place more often than a write.

The index is written through the hashfile API, both as a buffered write and
as a computation of its trailing hash. This trailing hash is an expectation
of the file format: several optional index extensions are provided using
indicators immediately preceding the trailing hash. 'git fsck' will complain
if the trailing hash does not match the contents of the file up to that
point.

However, computing the hash is expensive. This is even more expensive now
that we've moved to sha1dc instead of hardware-accelerated SHA1
computations.

This series provides a new config option, index.skipHash, that allows users
to disable computing the hash as Git writes the index. In this case, the
trailing hash is stored as the null hash (all bytes are zero).

The implementation is rather simple, but the patches organize different
aspects in a way that we could split things out:

 * Patch 1 adds a 'skip_hash' option to 'struct hashfile' that creates this
   as a general option to the hashwrite API.
 * The motivation in Patch 1 avoids the talk about the chunk-format API and
   instead focuses on the index as the first example that could make use of
   this.
 * Patch 2 creates the index.skipHash config option and updates 'git fsck'
   to ignore a null hash on the index. This is demonstrated by a very simple
   test.
 * Patch 3 creates a test helper to get the trailing hash of a file, and
   adds a concrete check on the trailing hash when index.skipHash=true.
 * Patch 4 (which could be deferred until later) adds index.skipHash=true as
   an implication of feature.manyFiles and as a setting in Scalar.

The end result is that index writes speed up significantly, enough that 'git
update-index --force-write' improves by 1.75x.

Similar patches appeared in my earlier refs RFC [1].

[1]
https://lore.kernel.org/git/pull.1408.git.1667846164.gitgitgadget@gmail.com/


Updates since v2
================

 * Patch 2 now has additional details about why to use the config option in
   its message. The discussion about why the index is special is included in
   the commit message.


Updates since v1
================

 * In Patch 1, use hashcler() instead of memset().
 * In Patches 2 and 4, be explicit about which Git versions will exhibit
   different behavior when reading an index with a null trailing hash.
 * In Patch 2, used a more compact way of loading from config.
 * In Patch 2 (and following) dropped the sub-shell in t1600-index.sh.
 * In Patch 3, dropped back-ticks when using a multi-line pipe.
 * In Patch 4, use "! cmp" instead of "! test_cmp"


Updates since RFC
=================

 * The config name is now index.skipHash to make it more clear that the
   default is to not skip the hash, and choosing this option requires a true
   value.
 * Use oidread() instead of an ad-hoc loop.
 * Patches 3 and 4 are new. 3 adds some extra test verification that wasn't
   in the RFC.
 * I think that patch 4 is helpful to see now, but I could understand if we
   wanted to defer that patch until after index.skipHash has had more time
   to bake.

Thanks, -Stolee

Derrick Stolee (4):
  hashfile: allow skipping the hash function
  read-cache: add index.skipHash config option
  test-lib-functions: add helper for trailing hash
  features: feature.manyFiles implies fast index writes

 Documentation/config/feature.txt |  5 +++++
 Documentation/config/index.txt   | 11 +++++++++++
 csum-file.c                      | 14 +++++++++++---
 csum-file.h                      |  7 +++++++
 read-cache.c                     | 15 ++++++++++++++-
 repo-settings.c                  |  2 ++
 repository.h                     |  1 +
 scalar.c                         |  1 +
 t/t1600-index.sh                 | 20 ++++++++++++++++++++
 t/test-lib-functions.sh          |  8 ++++++++
 10 files changed, 80 insertions(+), 4 deletions(-)


base-commit: 805265fcf7a737664a8321aaf4a0587b78435184
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1439%2Fderrickstolee%2Fhashfile-skip-hash-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1439/derrickstolee/hashfile-skip-hash-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/1439

Range-diff vs v2:

 1:  b582d681581 = 1:  b582d681581 hashfile: allow skipping the hash function
 2:  c8a2fd3a470 ! 2:  00738c81a12 read-cache: add index.skipHash config option
     @@ Commit message
          critical that the sparse index was created specifically to reduce the
          size of the index to make these writes (and reads) faster.
      
     +    This trade-off between file stability at rest and write-time performance
     +    is not easy to balance. The index is an interesting case for a couple
     +    reasons:
     +
     +    1. Writes block users. Writing the index takes place in many user-
     +       blocking foreground operations. The speed improvement directly
     +       impacts their use. Other file formats are typically written in the
     +       background (commit-graph, multi-pack-index) or are super-critical to
     +       correctness (pack-files).
     +
     +    2. Index files are short lived. It is rare that a user leaves an index
     +       for a long time with many staged changes. Outside of staged changes,
     +       the index can be completely destroyed and rewritten with minimal
     +       impact to the user.
     +
          Following a similar approach to one used in the microsoft/git fork [1],
          add a new config option (index.skipHash) that allows disabling this
          hashing during the index write. The cost is that we can no longer
     @@ Documentation/config/index.txt: index.version::
      +
      +index.skipHash::
      +	When enabled, do not compute the trailing hash for the index file.
     -+	Instead, write a trailing set of bytes with value zero, indicating
     ++	This accelerates Git commands that manipulate the index, such as
     ++	`git add`, `git commit`, or `git status`. Instead of storing the
     ++	checksum, write a trailing set of bytes with value zero, indicating
      +	that the computation was skipped.
      ++
      +If you enable `index.skipHash`, then Git clients older than 2.13.0 will
 3:  813e81a0582 = 3:  86370af1351 test-lib-functions: add helper for trailing hash
 4:  e640dff53dd = 4:  6490bd445eb features: feature.manyFiles implies fast index writes

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v3 1/4] hashfile: allow skipping the hash function
  2022-12-15 15:06   ` [PATCH v3 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget
@ 2022-12-15 15:06     ` Derrick Stolee via GitGitGadget
  2022-12-15 15:06     ` [PATCH v3 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget
                       ` (4 subsequent siblings)
  5 siblings, 0 replies; 59+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-12-15 15:06 UTC (permalink / raw)
  To: git
  Cc: gitster, vdye, avarab, newren, Jacob Keller, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The hashfile API is useful for generating files that include a trailing
hash of the file's contents up to that point. Using such a hash is
helpful for verifying the file for corruption-at-rest, such as a faulty
drive causing flipped bits.

Git's index file includes this trailing hash, so it uses a 'struct
hashfile' to handle the I/O to the file. This was very convenient to
allow using the hashfile methods during these operations.

However, hashing the file contents during write comes at a performance
penalty. It's slower to hash the bytes on their way to the disk than
without that step. This problem is made worse by the replacement of
hardware-accelerated SHA1 computations with the software-based sha1dc
computation.

This write cost is significant, and the checksum capability is likely
not worth that cost for such a short-lived file. The index is rewritten
frequently and the only time the checksum is checked is during 'git
fsck'. Thus, it would be helpful to allow a user to opt-out of the hash
computation.

We first need to allow Git to opt-out of the hash computation in the
hashfile API. The buffered writes of the API are still helpful, so it
makes sense to make the change here.

Introduce a new 'skip_hash' option to 'struct hashfile'. When set, the
update_fn and final_fn members of the_hash_algo are skipped. When
finalizing the hashfile, the trailing hash is replaced with the null
hash.

This use of a trailing null hash would be desireable in either case,
since we do not want to special case a file format to have a different
length depending on whether it was hashed or not. When the final bytes
of a file are all zero, we can infer that it was written without
hashing, and thus that verification is not available as a check for file
consistency. This also means that we could easily toggle hashing for any
file format we desire.

A version of this patch has existed in the microsoft/git fork since
2017 [1] (the linked commit was rebased in 2018, but the original dates
back to January 2017). Here, the change to make the index use this fast
path is delayed until a later change.

[1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201

Co-authored-by: Kevin Willford <kewillf@microsoft.com>
Signed-off-by: Kevin Willford <kewillf@microsoft.com>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 csum-file.c | 14 +++++++++++---
 csum-file.h |  7 +++++++
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index 59ef3398ca2..cce13c0f047 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -45,7 +45,8 @@ void hashflush(struct hashfile *f)
 	unsigned offset = f->offset;
 
 	if (offset) {
-		the_hash_algo->update_fn(&f->ctx, f->buffer, offset);
+		if (!f->skip_hash)
+			the_hash_algo->update_fn(&f->ctx, f->buffer, offset);
 		flush(f, f->buffer, offset);
 		f->offset = 0;
 	}
@@ -64,7 +65,12 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result,
 	int fd;
 
 	hashflush(f);
-	the_hash_algo->final_fn(f->buffer, &f->ctx);
+
+	if (f->skip_hash)
+		hashclr(f->buffer);
+	else
+		the_hash_algo->final_fn(f->buffer, &f->ctx);
+
 	if (result)
 		hashcpy(result, f->buffer);
 	if (flags & CSUM_HASH_IN_STREAM)
@@ -108,7 +114,8 @@ void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
 			 * the hashfile's buffer. In this block,
 			 * f->offset is necessarily zero.
 			 */
-			the_hash_algo->update_fn(&f->ctx, buf, nr);
+			if (!f->skip_hash)
+				the_hash_algo->update_fn(&f->ctx, buf, nr);
 			flush(f, buf, nr);
 		} else {
 			/*
@@ -153,6 +160,7 @@ static struct hashfile *hashfd_internal(int fd, const char *name,
 	f->tp = tp;
 	f->name = name;
 	f->do_crc = 0;
+	f->skip_hash = 0;
 	the_hash_algo->init_fn(&f->ctx);
 
 	f->buffer_len = buffer_len;
diff --git a/csum-file.h b/csum-file.h
index 0d29f528fbc..29468067f81 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -20,6 +20,13 @@ struct hashfile {
 	size_t buffer_len;
 	unsigned char *buffer;
 	unsigned char *check_buffer;
+
+	/**
+	 * If set to 1, skip_hash indicates that we should
+	 * not actually compute the hash for this hashfile and
+	 * instead only use it as a buffered write.
+	 */
+	unsigned int skip_hash;
 };
 
 /* Checkpoint */
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 2/4] read-cache: add index.skipHash config option
  2022-12-15 15:06   ` [PATCH v3 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget
  2022-12-15 15:06     ` [PATCH v3 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
@ 2022-12-15 15:06     ` Derrick Stolee via GitGitGadget
  2022-12-15 16:12       ` Ævar Arnfjörð Bjarmason
  2022-12-15 15:06     ` [PATCH v3 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget
                       ` (3 subsequent siblings)
  5 siblings, 1 reply; 59+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-12-15 15:06 UTC (permalink / raw)
  To: git
  Cc: gitster, vdye, avarab, newren, Jacob Keller, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The previous change allowed skipping the hashing portion of the
hashwrite API, using it instead as a buffered write API. Disabling the
hashwrite can be particularly helpful when the write operation is in a
critical path.

One such critical path is the writing of the index. This operation is so
critical that the sparse index was created specifically to reduce the
size of the index to make these writes (and reads) faster.

This trade-off between file stability at rest and write-time performance
is not easy to balance. The index is an interesting case for a couple
reasons:

1. Writes block users. Writing the index takes place in many user-
   blocking foreground operations. The speed improvement directly
   impacts their use. Other file formats are typically written in the
   background (commit-graph, multi-pack-index) or are super-critical to
   correctness (pack-files).

2. Index files are short lived. It is rare that a user leaves an index
   for a long time with many staged changes. Outside of staged changes,
   the index can be completely destroyed and rewritten with minimal
   impact to the user.

Following a similar approach to one used in the microsoft/git fork [1],
add a new config option (index.skipHash) that allows disabling this
hashing during the index write. The cost is that we can no longer
validate the contents for corruption-at-rest using the trailing hash.

[1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201

While older Git versions will not recognize the null hash as a special
case, the file format itself is still being met in terms of its
structure. Using this null hash will still allow Git operations to
function across older versions.

The one exception is 'git fsck' which checks the hash of the index file.
This used to be a check on every index read, but was split out to just
the index in a33fc72fe91 (read-cache: force_verify_index_checksum,
2017-04-14) and released first in Git 2.13.0. Document the versions that
relaxed these restrictions, with the optimistic expectation that this
change will be included in Git 2.40.0.

Here, we disable this check if the trailing hash is all zeroes. We add a
warning to the config option that this may cause undesirable behavior
with older Git versions.

As a quick comparison, I tested 'git update-index --force-write' with
and without index.skipHash=true on a copy of the Linux kernel
repository.

Benchmark 1: with hash
  Time (mean ± σ):      46.3 ms ±  13.8 ms    [User: 34.3 ms, System: 11.9 ms]
  Range (min … max):    34.3 ms …  79.1 ms    82 runs

Benchmark 2: without hash
  Time (mean ± σ):      26.0 ms ±   7.9 ms    [User: 11.8 ms, System: 14.2 ms]
  Range (min … max):    16.3 ms …  42.0 ms    69 runs

Summary
  'without hash' ran
    1.78 ± 0.76 times faster than 'with hash'

These performance benefits are substantial enough to allow users the
ability to opt-in to this feature, even with the potential confusion
with older 'git fsck' versions.

It is critical that this test is placed before the test_index_version
tests, since those tests obliterate the .git/config file and hence lose
the setting from GIT_TEST_DEFAULT_HASH, if set.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/config/index.txt | 11 +++++++++++
 read-cache.c                   | 12 +++++++++++-
 t/t1600-index.sh               |  6 ++++++
 3 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/Documentation/config/index.txt b/Documentation/config/index.txt
index 75f3a2d1054..23c7985eb40 100644
--- a/Documentation/config/index.txt
+++ b/Documentation/config/index.txt
@@ -30,3 +30,14 @@ index.version::
 	Specify the version with which new index files should be
 	initialized.  This does not affect existing repositories.
 	If `feature.manyFiles` is enabled, then the default is 4.
+
+index.skipHash::
+	When enabled, do not compute the trailing hash for the index file.
+	This accelerates Git commands that manipulate the index, such as
+	`git add`, `git commit`, or `git status`. Instead of storing the
+	checksum, write a trailing set of bytes with value zero, indicating
+	that the computation was skipped.
++
+If you enable `index.skipHash`, then Git clients older than 2.13.0 will
+refuse to parse the index and Git clients older than 2.40.0 will report an
+error during `git fsck`.
diff --git a/read-cache.c b/read-cache.c
index 46f5e497b14..3f7de8b2e20 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1817,6 +1817,8 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
 	git_hash_ctx c;
 	unsigned char hash[GIT_MAX_RAWSZ];
 	int hdr_version;
+	unsigned char *start, *end;
+	struct object_id oid;
 
 	if (hdr->hdr_signature != htonl(CACHE_SIGNATURE))
 		return error(_("bad signature 0x%08x"), hdr->hdr_signature);
@@ -1827,10 +1829,16 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
 	if (!verify_index_checksum)
 		return 0;
 
+	end = (unsigned char *)hdr + size;
+	start = end - the_hash_algo->rawsz;
+	oidread(&oid, start);
+	if (oideq(&oid, null_oid()))
+		return 0;
+
 	the_hash_algo->init_fn(&c);
 	the_hash_algo->update_fn(&c, hdr, size - the_hash_algo->rawsz);
 	the_hash_algo->final_fn(hash, &c);
-	if (!hasheq(hash, (unsigned char *)hdr + size - the_hash_algo->rawsz))
+	if (!hasheq(hash, end - the_hash_algo->rawsz))
 		return error(_("bad index file sha1 signature"));
 	return 0;
 }
@@ -2918,6 +2926,8 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 
 	f = hashfd(tempfile->fd, tempfile->filename.buf);
 
+	git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash);
+
 	for (i = removed = extended = 0; i < entries; i++) {
 		if (cache[i]->ce_flags & CE_REMOVE)
 			removed++;
diff --git a/t/t1600-index.sh b/t/t1600-index.sh
index 010989f90e6..45feb0fc5d8 100755
--- a/t/t1600-index.sh
+++ b/t/t1600-index.sh
@@ -65,6 +65,12 @@ test_expect_success 'out of bounds index.version issues warning' '
 	)
 '
 
+test_expect_success 'index.skipHash config option' '
+	rm -f .git/index &&
+	git -c index.skipHash=true add a &&
+	git fsck
+'
+
 test_index_version () {
 	INDEX_VERSION_CONFIG=$1 &&
 	FEATURE_MANY_FILES=$2 &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 3/4] test-lib-functions: add helper for trailing hash
  2022-12-15 15:06   ` [PATCH v3 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget
  2022-12-15 15:06     ` [PATCH v3 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
  2022-12-15 15:06     ` [PATCH v3 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget
@ 2022-12-15 15:06     ` Derrick Stolee via GitGitGadget
  2022-12-15 15:07     ` [PATCH v3 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget
                       ` (2 subsequent siblings)
  5 siblings, 0 replies; 59+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-12-15 15:06 UTC (permalink / raw)
  To: git
  Cc: gitster, vdye, avarab, newren, Jacob Keller, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

It can be helpful to check that a file format with a trailing hash has a
specific hash in the final bytes of a written file. This is made more
apparent by recent changes that allow skipping the hash algorithm and
writing a null hash at the end of the file instead.

Add a new test_trailing_hash helper and use it in t1600 to verify that
index.skipHash=true really does skip the hash computation, since
'git fsck' does not actually verify the hash.

Keep the 'git fsck' call to ensure that any potential future change to
check the index hash does not cause an error in this case.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 t/t1600-index.sh        | 3 +++
 t/test-lib-functions.sh | 8 ++++++++
 2 files changed, 11 insertions(+)

diff --git a/t/t1600-index.sh b/t/t1600-index.sh
index 45feb0fc5d8..55914bc3506 100755
--- a/t/t1600-index.sh
+++ b/t/t1600-index.sh
@@ -68,6 +68,9 @@ test_expect_success 'out of bounds index.version issues warning' '
 test_expect_success 'index.skipHash config option' '
 	rm -f .git/index &&
 	git -c index.skipHash=true add a &&
+	test_trailing_hash .git/index >hash &&
+	echo $(test_oid zero) >expect &&
+	test_cmp expect hash &&
 	git fsck
 '
 
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index 796093a7b32..60308843f8f 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -1875,3 +1875,11 @@ test_cmp_config_output () {
 	sort config-actual >sorted-actual &&
 	test_cmp sorted-expect sorted-actual
 }
+
+# Given a filename, extract its trailing hash as a hex string
+test_trailing_hash () {
+	local file="$1" &&
+	tail -c $(test_oid rawsz) "$file" |
+		test-tool hexdump |
+		sed "s/ //g"
+}
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 4/4] features: feature.manyFiles implies fast index writes
  2022-12-15 15:06   ` [PATCH v3 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget
                       ` (2 preceding siblings ...)
  2022-12-15 15:06     ` [PATCH v3 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget
@ 2022-12-15 15:07     ` Derrick Stolee via GitGitGadget
  2022-12-15 15:56     ` [PATCH v3 0/4] Optionally skip hashing index on write Ævar Arnfjörð Bjarmason
  2022-12-16 15:31     ` [PATCH v4 " Derrick Stolee via GitGitGadget
  5 siblings, 0 replies; 59+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-12-15 15:07 UTC (permalink / raw)
  To: git
  Cc: gitster, vdye, avarab, newren, Jacob Keller, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The recent addition of the index.skipHash config option allows index
writes to speed up by skipping the hash computation for the trailing
checksum. This is particularly critical for repositories with many files
at HEAD, so add this config option to two cases where users in that
scenario may opt-in to such behavior:

 1. The feature.manyFiles config option enables some options that are
    helpful for repositories with many files at HEAD.

 2. 'scalar register' and 'scalar reconfigure' set config options that
    optimize for large repositories.

In both of these cases, set index.skipHash=true to gain this
speedup. Add tests that demonstrate the proper way that
index.skipHash=true can override feature.manyFiles=true.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/config/feature.txt |  5 +++++
 read-cache.c                     |  5 ++++-
 repo-settings.c                  |  2 ++
 repository.h                     |  1 +
 scalar.c                         |  1 +
 t/t1600-index.sh                 | 13 ++++++++++++-
 6 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/Documentation/config/feature.txt b/Documentation/config/feature.txt
index 95975e50912..e52bc6b8584 100644
--- a/Documentation/config/feature.txt
+++ b/Documentation/config/feature.txt
@@ -23,6 +23,11 @@ feature.manyFiles::
 	working directory. With many files, commands such as `git status` and
 	`git checkout` may be slow and these new defaults improve performance:
 +
+* `index.skipHash=true` speeds up index writes by not computing a trailing
+  checksum. Note that this will cause Git versions earlier than 2.13.0 to
+  refuse to parse the index and Git versions earlier than 2.40.0 will report
+  a corrupted index during `git fsck`.
++
 * `index.version=4` enables path-prefix compression in the index.
 +
 * `core.untrackedCache=true` enables the untracked cache. This setting assumes
diff --git a/read-cache.c b/read-cache.c
index 3f7de8b2e20..1844953fba7 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -2926,7 +2926,10 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 
 	f = hashfd(tempfile->fd, tempfile->filename.buf);
 
-	git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash);
+	if (istate->repo) {
+		prepare_repo_settings(istate->repo);
+		f->skip_hash = istate->repo->settings.index_skip_hash;
+	}
 
 	for (i = removed = extended = 0; i < entries; i++) {
 		if (cache[i]->ce_flags & CE_REMOVE)
diff --git a/repo-settings.c b/repo-settings.c
index 3021921c53d..3dbd3f0e2ec 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -47,6 +47,7 @@ void prepare_repo_settings(struct repository *r)
 	}
 	if (manyfiles) {
 		r->settings.index_version = 4;
+		r->settings.index_skip_hash = 1;
 		r->settings.core_untracked_cache = UNTRACKED_CACHE_WRITE;
 	}
 
@@ -61,6 +62,7 @@ void prepare_repo_settings(struct repository *r)
 	repo_cfg_bool(r, "pack.usesparse", &r->settings.pack_use_sparse, 1);
 	repo_cfg_bool(r, "core.multipackindex", &r->settings.core_multi_pack_index, 1);
 	repo_cfg_bool(r, "index.sparse", &r->settings.sparse_index, 0);
+	repo_cfg_bool(r, "index.skiphash", &r->settings.index_skip_hash, r->settings.index_skip_hash);
 
 	/*
 	 * The GIT_TEST_MULTI_PACK_INDEX variable is special in that
diff --git a/repository.h b/repository.h
index 6c461c5b9de..e8c67ffe165 100644
--- a/repository.h
+++ b/repository.h
@@ -42,6 +42,7 @@ struct repo_settings {
 	struct fsmonitor_settings *fsmonitor; /* lazily loaded */
 
 	int index_version;
+	int index_skip_hash;
 	enum untracked_cache_setting core_untracked_cache;
 
 	int pack_use_sparse;
diff --git a/scalar.c b/scalar.c
index 6c52243cdf1..b49bb8c24ec 100644
--- a/scalar.c
+++ b/scalar.c
@@ -143,6 +143,7 @@ static int set_recommended_config(int reconfigure)
 		{ "credential.validate", "false", 1 }, /* GCM4W-only */
 		{ "gc.auto", "0", 1 },
 		{ "gui.GCWarning", "false", 1 },
+		{ "index.skipHash", "false", 1 },
 		{ "index.threads", "true", 1 },
 		{ "index.version", "4", 1 },
 		{ "merge.stat", "false", 1 },
diff --git a/t/t1600-index.sh b/t/t1600-index.sh
index 55914bc3506..103743a1c7d 100755
--- a/t/t1600-index.sh
+++ b/t/t1600-index.sh
@@ -71,7 +71,18 @@ test_expect_success 'index.skipHash config option' '
 	test_trailing_hash .git/index >hash &&
 	echo $(test_oid zero) >expect &&
 	test_cmp expect hash &&
-	git fsck
+	git fsck &&
+
+	rm -f .git/index &&
+	git -c feature.manyFiles=true add a &&
+	test_trailing_hash .git/index >hash &&
+	test_cmp expect hash &&
+
+	rm -f .git/index &&
+	git -c feature.manyFiles=true \
+		-c index.skipHash=false add a &&
+	test_trailing_hash .git/index >hash &&
+	! cmp expect hash
 '
 
 test_index_version () {
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 0/4] Optionally skip hashing index on write
  2022-12-15 15:06   ` [PATCH v3 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget
                       ` (3 preceding siblings ...)
  2022-12-15 15:07     ` [PATCH v3 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget
@ 2022-12-15 15:56     ` Ævar Arnfjörð Bjarmason
  2022-12-16 13:41       ` Derrick Stolee
  2022-12-16 15:31     ` [PATCH v4 " Derrick Stolee via GitGitGadget
  5 siblings, 1 reply; 59+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-12-15 15:56 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, vdye, newren, Jacob Keller, Derrick Stolee


On Thu, Dec 15 2022, Derrick Stolee via GitGitGadget wrote:

> Updates since v2
> ================
>
>  * Patch 2 now has additional details about why to use the config option in
>    its message. The discussion about why the index is special is included in
>    the commit message.

Re the comments I had on an earlier version[1] I tried this series with
these changes squashed in:
	
	1:  b582d681581 = 1:  53d289cf82c hashfile: allow skipping the hash function
	2:  00738c81a12 ! 2:  db487f3e76d read-cache: add index.skipHash config option
	    @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempf
	      
	      	f = hashfd(tempfile->fd, tempfile->filename.buf);
	      
	    -+	git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash);
	    ++	repo_config_get_maybe_bool(istate->repo, "index.skiphash", (int *)&f->skip_hash);
	     +
	      	for (i = removed = extended = 0; i < entries; i++) {
	      		if (cache[i]->ce_flags & CE_REMOVE)
	3:  86370af1351 = 3:  35188bbcfb5 test-lib-functions: add helper for trailing hash
	4:  6490bd445eb ! 4:  4354328e8fc features: feature.manyFiles implies fast index writes
	    @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempf
	      
	      	f = hashfd(tempfile->fd, tempfile->filename.buf);
	      
	    --	git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash);
	    +-	repo_config_get_maybe_bool(istate->repo, "index.skiphash", (int *)&f->skip_hash);
	     +	if (istate->repo) {
	    ++		int ours, theirs;
	    ++
	     +		prepare_repo_settings(istate->repo);
	    -+		f->skip_hash = istate->repo->settings.index_skip_hash;
	    ++
	    ++		ours = istate->repo->settings.index_skip_hash;
	    ++		theirs = the_repository->settings.index_skip_hash;
	    ++
	    ++		if (ours != theirs)
	    ++			BUG("ours %d theirs %d", ours, theirs);
	    ++		f->skip_hash = ours;
	    ++	} else {
	    ++		f->skip_hash = the_repository->settings.index_skip_hash;
	    ++		/*BUG("no repo???");*/
	     +	}
	      
	      	for (i = removed = extended = 0; i < entries; i++) {

With those all of your tests still pass. Which leads to these outstanding questions:

a) If we're doing to use "istate->repo" in 4/4 why not do so in 2/4,
   neither patch discusses *that* part of the change.

   On an unrelated topic there's this[2] fix-up of yours, 2/2 still
   seems like it needs the same treatment.

b) In 2/4 we'd get the config if "istate->repo" was NULL, if you
   uncomment that BUG() in the squashed 4/4 we'll get test failures all
   over the place. Aren't these places where we'd get the config before,
   but istate->repo isn't set up properly, so with 4/4 we won't get it?

   Maybe that's desired, but again, neither commit discusses this
   change.

Now, I *think* it's a good idea to defer to the already set-up
'istate->repo', but I also know (and have some local patches to fix...)
about the cases where we don't set it up (but should), so it can't be
fully relied on.

So even if we can't produce a behavior difference, just doing e.g.:

	struct repository *r = istate->repo ? istate->repo : the_repository;

And then using:

	prepare_repo_settings(r);
        f->skip_hash = r->->settings.index_skip_hash;

Seems sensible. I just don't get why 4/4 has that seeming fix-up of 2/4
after-the-fact. Isn't it better to carve that bit out of 4/4, just do
the config setup in repo-settings.c to begin with, and have 4/4 do what
its $subject says, i.e. to have "feature.manyFiles" imply this new
config?

The rest of this all looks sensible to me, sans some isolated things
I'll comment on on individual patches.

1. https://lore.kernel.org/git/221212.86v8mg4gr2.gmgdl@evledraar.gmail.com/
2. https://lore.kernel.org/git/857d1abec4cf124e011c7f84276ce105cb5b3a96.1670866407.git.gitgitgadget@gmail.com/

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 2/4] read-cache: add index.skipHash config option
  2022-12-15 15:06     ` [PATCH v3 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget
@ 2022-12-15 16:12       ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 59+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-12-15 16:12 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, vdye, newren, Jacob Keller, Derrick Stolee


On Thu, Dec 15 2022, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <derrickstolee@github.com>

> +	end = (unsigned char *)hdr + size;

Commentary: The "unsigned char *" cast has moved up here from the below,
that's needed (we're casting from the struct), and we should get rid of
it from below, good.

> +	start = end - the_hash_algo->rawsz;

Okey, so here we mark the start, which is the end minus the rawsz,
but...

> +	oidread(&oid, start);
> +	if (oideq(&oid, null_oid()))
> +		return 0;
> +
>  	the_hash_algo->init_fn(&c);
>  	the_hash_algo->update_fn(&c, hdr, size - the_hash_algo->rawsz);
>  	the_hash_algo->final_fn(hash, &c);
> -	if (!hasheq(hash, (unsigned char *)hdr + size - the_hash_algo->rawsz))
> +	if (!hasheq(hash, end - the_hash_algo->rawsz))

...here we got rid of the cast, which is good, but let's not use "end -
the_hash_algo->rawsz" here, let's use "start", which you already
computed as "end - the_hash_algo->rawsz". This is just repeating it.

I wondered if I just missed it being modified in the interim before
carefully re-reading this, but we pass your tests with:

	-       if (!hasheq(hash, end - the_hash_algo->rawsz))
	+       assert((end - the_hash_algo->rawsz) == start);
	+       if (!hasheq(hash, start))

So, we can indeed juse the simpler "start" here, and it makes it easier
to read, as we're assured that it didn't move in the interim.

> +	git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash);

Aside from the question of whether we use the repo_*() variant here,
which I noted in my reply to the CL. The cast is suspicious.

So, in the 1/4 we added this as *unsigned*:
	
	+	 * If set to 1, skip_hash indicates that we should
	+	 * not actually compute the hash for this hashfile and
	+	 * instead only use it as a buffered write.
	+	 */
	+	unsigned int skip_hash;

But you need the cast here since the config API can and will set the
"dest" to -1. See the "*dest == -1" test in git_configset_get_value().

So, here we're relying on a "unsigned int" cast'd to "int" correctly
doing the right thing on a "-1" assignment.

I'm not sure if that's portable or leads to undefined behavior, but in
any case, won't such a -1 value be read back as ~0 from that "unsigned
int" variable on most modern platforms?

Just bypassing that entirely and making it "int" seems better here, or
having an intermediate variable.

I also wondered if this was all my fault, in your original version you
were doing:

	int skip_hash;
	[...]
	if (!git_config_get_maybe_bool("index.skiphash", &skip_hash))
		f->skip_hash = skip_hash;

And I suggested that this was redundant, and that you could just write
to "f->skip_hash" directly.

But I didn't notice it was "unsigned", and in any case your original
version had the same issue of assigning a -1 to the unsigned variable...

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 0/4] Optionally skip hashing index on write
  2022-12-15 15:56     ` [PATCH v3 0/4] Optionally skip hashing index on write Ævar Arnfjörð Bjarmason
@ 2022-12-16 13:41       ` Derrick Stolee
  0 siblings, 0 replies; 59+ messages in thread
From: Derrick Stolee @ 2022-12-16 13:41 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason,
	Derrick Stolee via GitGitGadget
  Cc: git, gitster, vdye, newren, Jacob Keller

On 12/15/22 10:56 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Thu, Dec 15 2022, Derrick Stolee via GitGitGadget wrote:
> 
>> Updates since v2
>> ================
>>
>>  * Patch 2 now has additional details about why to use the config option in
>>    its message. The discussion about why the index is special is included in
>>    the commit message.


> So even if we can't produce a behavior difference, just doing e.g.:
> 
> 	struct repository *r = istate->repo ? istate->repo : the_repository;
> 
> And then using:
> 
> 	prepare_repo_settings(r);
>         f->skip_hash = r->->settings.index_skip_hash;

This, and other comments are sensible and will be reflected in v4.
 
> Seems sensible. I just don't get why 4/4 has that seeming fix-up of 2/4
> after-the-fact. Isn't it better to carve that bit out of 4/4, just do
> the config setup in repo-settings.c to begin with, and have 4/4 do what
> its $subject says, i.e. to have "feature.manyFiles" imply this new
> config?

The point is to make patch 4 completely independent, and be able to
pull it out if necessary. There's no reason to load the config inside
repo-settings.c unless it's part of something like feature.manyFiles.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 0/4] Optionally skip hashing index on write
  2022-12-15 15:06   ` [PATCH v3 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget
                       ` (4 preceding siblings ...)
  2022-12-15 15:56     ` [PATCH v3 0/4] Optionally skip hashing index on write Ævar Arnfjörð Bjarmason
@ 2022-12-16 15:31     ` Derrick Stolee via GitGitGadget
  2022-12-16 15:31       ` [PATCH v4 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
                         ` (5 more replies)
  5 siblings, 6 replies; 59+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-12-16 15:31 UTC (permalink / raw)
  To: git; +Cc: gitster, vdye, avarab, newren, Jacob Keller, Derrick Stolee

Writing the index is a critical action that takes place in multiple Git
commands. The recent performance improvements available with the sparse
index show how often the I/O costs around the index can affect different Git
commands, although reading the index takes place more often than a write.

The index is written through the hashfile API, both as a buffered write and
as a computation of its trailing hash. This trailing hash is an expectation
of the file format: several optional index extensions are provided using
indicators immediately preceding the trailing hash. 'git fsck' will complain
if the trailing hash does not match the contents of the file up to that
point.

However, computing the hash is expensive. This is even more expensive now
that we've moved to sha1dc instead of hardware-accelerated SHA1
computations.

This series provides a new config option, index.skipHash, that allows users
to disable computing the hash as Git writes the index. In this case, the
trailing hash is stored as the null hash (all bytes are zero).

The implementation is rather simple, but the patches organize different
aspects in a way that we could split things out:

 * Patch 1 adds a 'skip_hash' option to 'struct hashfile' that creates this
   as a general option to the hashwrite API.
 * The motivation in Patch 1 avoids the talk about the chunk-format API and
   instead focuses on the index as the first example that could make use of
   this.
 * Patch 2 creates the index.skipHash config option and updates 'git fsck'
   to ignore a null hash on the index. This is demonstrated by a very simple
   test.
 * Patch 3 creates a test helper to get the trailing hash of a file, and
   adds a concrete check on the trailing hash when index.skipHash=true.
 * Patch 4 (which could be deferred until later) adds index.skipHash=true as
   an implication of feature.manyFiles and as a setting in Scalar.

The end result is that index writes speed up significantly, enough that 'git
update-index --force-write' improves by 1.75x.

Similar patches appeared in my earlier refs RFC [1].

[1]
https://lore.kernel.org/git/pull.1408.git.1667846164.gitgitgadget@gmail.com/


Updates since v3
================

 * Patch 1 uses an "int" instead of "unsigned int". This was a carry-over
   from Kevin's patch in microsoft/git, but our use of it in patch 2 is
   different and thus is better with a signed int.
 * Patch 2 uses a fallback to the_repository if istate->repo is NULL. This
   allows the setting to be applied to more cases where istate->repo is not
   set.
 * Patch 2 also uses 'start' in more places, since it already computed the
   beginning of the hash.
 * Patch 4 slightly modifies its use of prepare_repo_settings() due to Patch
   2's fallback to the_repository.


Updates since v2
================

 * Patch 2 now has additional details about why to use the config option in
   its message. The discussion about why the index is special is included in
   the commit message.


Updates since v1
================

 * In Patch 1, use hashcler() instead of memset().
 * In Patches 2 and 4, be explicit about which Git versions will exhibit
   different behavior when reading an index with a null trailing hash.
 * In Patch 2, used a more compact way of loading from config.
 * In Patch 2 (and following) dropped the sub-shell in t1600-index.sh.
 * In Patch 3, dropped back-ticks when using a multi-line pipe.
 * In Patch 4, use "! cmp" instead of "! test_cmp"


Updates since RFC
=================

 * The config name is now index.skipHash to make it more clear that the
   default is to not skip the hash, and choosing this option requires a true
   value.
 * Use oidread() instead of an ad-hoc loop.
 * Patches 3 and 4 are new. 3 adds some extra test verification that wasn't
   in the RFC.
 * I think that patch 4 is helpful to see now, but I could understand if we
   wanted to defer that patch until after index.skipHash has had more time
   to bake.

Thanks, -Stolee

Derrick Stolee (4):
  hashfile: allow skipping the hash function
  read-cache: add index.skipHash config option
  test-lib-functions: add helper for trailing hash
  features: feature.manyFiles implies fast index writes

 Documentation/config/feature.txt |  5 +++++
 Documentation/config/index.txt   | 11 +++++++++++
 csum-file.c                      | 14 +++++++++++---
 csum-file.h                      |  7 +++++++
 read-cache.c                     | 14 +++++++++++++-
 repo-settings.c                  |  2 ++
 repository.h                     |  1 +
 scalar.c                         |  1 +
 t/t1600-index.sh                 | 20 ++++++++++++++++++++
 t/test-lib-functions.sh          |  8 ++++++++
 10 files changed, 79 insertions(+), 4 deletions(-)


base-commit: 805265fcf7a737664a8321aaf4a0587b78435184
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1439%2Fderrickstolee%2Fhashfile-skip-hash-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1439/derrickstolee/hashfile-skip-hash-v4
Pull-Request: https://github.com/gitgitgadget/git/pull/1439

Range-diff vs v3:

 1:  b582d681581 ! 1:  c99470d4676 hashfile: allow skipping the hash function
     @@ csum-file.h: struct hashfile {
       	unsigned char *check_buffer;
      +
      +	/**
     -+	 * If set to 1, skip_hash indicates that we should
     ++	 * If non-zero, skip_hash indicates that we should
      +	 * not actually compute the hash for this hashfile and
      +	 * instead only use it as a buffered write.
      +	 */
     -+	unsigned int skip_hash;
     ++	int skip_hash;
       };
       
       /* Checkpoint */
 2:  00738c81a12 ! 2:  aae047cbc9f read-cache: add index.skipHash config option
     @@ Commit message
      
          [1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201
      
     +    We load this config from the repository config given by istate->repo,
     +    with a fallback to the_repository if it is not set.
     +
          While older Git versions will not recognize the null hash as a special
          case, the file format itself is still being met in terms of its
          structure. Using this null hash will still allow Git operations to
     @@ read-cache.c: static int verify_hdr(const struct cache_header *hdr, unsigned lon
       	the_hash_algo->update_fn(&c, hdr, size - the_hash_algo->rawsz);
       	the_hash_algo->final_fn(hash, &c);
      -	if (!hasheq(hash, (unsigned char *)hdr + size - the_hash_algo->rawsz))
     -+	if (!hasheq(hash, end - the_hash_algo->rawsz))
     ++	if (!hasheq(hash, start))
       		return error(_("bad index file sha1 signature"));
       	return 0;
       }
      @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
     + 	int ieot_entries = 1;
     + 	struct index_entry_offset_table *ieot = NULL;
     + 	int nr, nr_threads;
     ++	struct repository *r = istate->repo ? istate->repo : the_repository;
       
       	f = hashfd(tempfile->fd, tempfile->filename.buf);
       
     -+	git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash);
     ++	repo_config_get_maybe_bool(r, "index.skiphash", &f->skip_hash);
      +
       	for (i = removed = extended = 0; i < entries; i++) {
       		if (cache[i]->ce_flags & CE_REMOVE)
 3:  86370af1351 = 3:  27fbe52e748 test-lib-functions: add helper for trailing hash
 4:  6490bd445eb ! 4:  075921514f2 features: feature.manyFiles implies fast index writes
     @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempf
       
       	f = hashfd(tempfile->fd, tempfile->filename.buf);
       
     --	git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash);
     -+	if (istate->repo) {
     -+		prepare_repo_settings(istate->repo);
     -+		f->skip_hash = istate->repo->settings.index_skip_hash;
     -+	}
     +-	repo_config_get_maybe_bool(r, "index.skiphash", &f->skip_hash);
     ++	prepare_repo_settings(r);
     ++	f->skip_hash = r->settings.index_skip_hash;
       
       	for (i = removed = extended = 0; i < entries; i++) {
       		if (cache[i]->ce_flags & CE_REMOVE)

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 1/4] hashfile: allow skipping the hash function
  2022-12-16 15:31     ` [PATCH v4 " Derrick Stolee via GitGitGadget
@ 2022-12-16 15:31       ` Derrick Stolee via GitGitGadget
  2022-12-16 15:31       ` [PATCH v4 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget
                         ` (4 subsequent siblings)
  5 siblings, 0 replies; 59+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-12-16 15:31 UTC (permalink / raw)
  To: git
  Cc: gitster, vdye, avarab, newren, Jacob Keller, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The hashfile API is useful for generating files that include a trailing
hash of the file's contents up to that point. Using such a hash is
helpful for verifying the file for corruption-at-rest, such as a faulty
drive causing flipped bits.

Git's index file includes this trailing hash, so it uses a 'struct
hashfile' to handle the I/O to the file. This was very convenient to
allow using the hashfile methods during these operations.

However, hashing the file contents during write comes at a performance
penalty. It's slower to hash the bytes on their way to the disk than
without that step. This problem is made worse by the replacement of
hardware-accelerated SHA1 computations with the software-based sha1dc
computation.

This write cost is significant, and the checksum capability is likely
not worth that cost for such a short-lived file. The index is rewritten
frequently and the only time the checksum is checked is during 'git
fsck'. Thus, it would be helpful to allow a user to opt-out of the hash
computation.

We first need to allow Git to opt-out of the hash computation in the
hashfile API. The buffered writes of the API are still helpful, so it
makes sense to make the change here.

Introduce a new 'skip_hash' option to 'struct hashfile'. When set, the
update_fn and final_fn members of the_hash_algo are skipped. When
finalizing the hashfile, the trailing hash is replaced with the null
hash.

This use of a trailing null hash would be desireable in either case,
since we do not want to special case a file format to have a different
length depending on whether it was hashed or not. When the final bytes
of a file are all zero, we can infer that it was written without
hashing, and thus that verification is not available as a check for file
consistency. This also means that we could easily toggle hashing for any
file format we desire.

A version of this patch has existed in the microsoft/git fork since
2017 [1] (the linked commit was rebased in 2018, but the original dates
back to January 2017). Here, the change to make the index use this fast
path is delayed until a later change.

[1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201

Co-authored-by: Kevin Willford <kewillf@microsoft.com>
Signed-off-by: Kevin Willford <kewillf@microsoft.com>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 csum-file.c | 14 +++++++++++---
 csum-file.h |  7 +++++++
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index 59ef3398ca2..cce13c0f047 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -45,7 +45,8 @@ void hashflush(struct hashfile *f)
 	unsigned offset = f->offset;
 
 	if (offset) {
-		the_hash_algo->update_fn(&f->ctx, f->buffer, offset);
+		if (!f->skip_hash)
+			the_hash_algo->update_fn(&f->ctx, f->buffer, offset);
 		flush(f, f->buffer, offset);
 		f->offset = 0;
 	}
@@ -64,7 +65,12 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result,
 	int fd;
 
 	hashflush(f);
-	the_hash_algo->final_fn(f->buffer, &f->ctx);
+
+	if (f->skip_hash)
+		hashclr(f->buffer);
+	else
+		the_hash_algo->final_fn(f->buffer, &f->ctx);
+
 	if (result)
 		hashcpy(result, f->buffer);
 	if (flags & CSUM_HASH_IN_STREAM)
@@ -108,7 +114,8 @@ void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
 			 * the hashfile's buffer. In this block,
 			 * f->offset is necessarily zero.
 			 */
-			the_hash_algo->update_fn(&f->ctx, buf, nr);
+			if (!f->skip_hash)
+				the_hash_algo->update_fn(&f->ctx, buf, nr);
 			flush(f, buf, nr);
 		} else {
 			/*
@@ -153,6 +160,7 @@ static struct hashfile *hashfd_internal(int fd, const char *name,
 	f->tp = tp;
 	f->name = name;
 	f->do_crc = 0;
+	f->skip_hash = 0;
 	the_hash_algo->init_fn(&f->ctx);
 
 	f->buffer_len = buffer_len;
diff --git a/csum-file.h b/csum-file.h
index 0d29f528fbc..793a59da12b 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -20,6 +20,13 @@ struct hashfile {
 	size_t buffer_len;
 	unsigned char *buffer;
 	unsigned char *check_buffer;
+
+	/**
+	 * If non-zero, skip_hash indicates that we should
+	 * not actually compute the hash for this hashfile and
+	 * instead only use it as a buffered write.
+	 */
+	int skip_hash;
 };
 
 /* Checkpoint */
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v4 2/4] read-cache: add index.skipHash config option
  2022-12-16 15:31     ` [PATCH v4 " Derrick Stolee via GitGitGadget
  2022-12-16 15:31       ` [PATCH v4 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
@ 2022-12-16 15:31       ` Derrick Stolee via GitGitGadget
  2022-12-16 15:31       ` [PATCH v4 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget
                         ` (3 subsequent siblings)
  5 siblings, 0 replies; 59+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-12-16 15:31 UTC (permalink / raw)
  To: git
  Cc: gitster, vdye, avarab, newren, Jacob Keller, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The previous change allowed skipping the hashing portion of the
hashwrite API, using it instead as a buffered write API. Disabling the
hashwrite can be particularly helpful when the write operation is in a
critical path.

One such critical path is the writing of the index. This operation is so
critical that the sparse index was created specifically to reduce the
size of the index to make these writes (and reads) faster.

This trade-off between file stability at rest and write-time performance
is not easy to balance. The index is an interesting case for a couple
reasons:

1. Writes block users. Writing the index takes place in many user-
   blocking foreground operations. The speed improvement directly
   impacts their use. Other file formats are typically written in the
   background (commit-graph, multi-pack-index) or are super-critical to
   correctness (pack-files).

2. Index files are short lived. It is rare that a user leaves an index
   for a long time with many staged changes. Outside of staged changes,
   the index can be completely destroyed and rewritten with minimal
   impact to the user.

Following a similar approach to one used in the microsoft/git fork [1],
add a new config option (index.skipHash) that allows disabling this
hashing during the index write. The cost is that we can no longer
validate the contents for corruption-at-rest using the trailing hash.

[1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201

We load this config from the repository config given by istate->repo,
with a fallback to the_repository if it is not set.

While older Git versions will not recognize the null hash as a special
case, the file format itself is still being met in terms of its
structure. Using this null hash will still allow Git operations to
function across older versions.

The one exception is 'git fsck' which checks the hash of the index file.
This used to be a check on every index read, but was split out to just
the index in a33fc72fe91 (read-cache: force_verify_index_checksum,
2017-04-14) and released first in Git 2.13.0. Document the versions that
relaxed these restrictions, with the optimistic expectation that this
change will be included in Git 2.40.0.

Here, we disable this check if the trailing hash is all zeroes. We add a
warning to the config option that this may cause undesirable behavior
with older Git versions.

As a quick comparison, I tested 'git update-index --force-write' with
and without index.skipHash=true on a copy of the Linux kernel
repository.

Benchmark 1: with hash
  Time (mean ± σ):      46.3 ms ±  13.8 ms    [User: 34.3 ms, System: 11.9 ms]
  Range (min … max):    34.3 ms …  79.1 ms    82 runs

Benchmark 2: without hash
  Time (mean ± σ):      26.0 ms ±   7.9 ms    [User: 11.8 ms, System: 14.2 ms]
  Range (min … max):    16.3 ms …  42.0 ms    69 runs

Summary
  'without hash' ran
    1.78 ± 0.76 times faster than 'with hash'

These performance benefits are substantial enough to allow users the
ability to opt-in to this feature, even with the potential confusion
with older 'git fsck' versions.

It is critical that this test is placed before the test_index_version
tests, since those tests obliterate the .git/config file and hence lose
the setting from GIT_TEST_DEFAULT_HASH, if set.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/config/index.txt | 11 +++++++++++
 read-cache.c                   | 13 ++++++++++++-
 t/t1600-index.sh               |  6 ++++++
 3 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/Documentation/config/index.txt b/Documentation/config/index.txt
index 75f3a2d1054..23c7985eb40 100644
--- a/Documentation/config/index.txt
+++ b/Documentation/config/index.txt
@@ -30,3 +30,14 @@ index.version::
 	Specify the version with which new index files should be
 	initialized.  This does not affect existing repositories.
 	If `feature.manyFiles` is enabled, then the default is 4.
+
+index.skipHash::
+	When enabled, do not compute the trailing hash for the index file.
+	This accelerates Git commands that manipulate the index, such as
+	`git add`, `git commit`, or `git status`. Instead of storing the
+	checksum, write a trailing set of bytes with value zero, indicating
+	that the computation was skipped.
++
+If you enable `index.skipHash`, then Git clients older than 2.13.0 will
+refuse to parse the index and Git clients older than 2.40.0 will report an
+error during `git fsck`.
diff --git a/read-cache.c b/read-cache.c
index 46f5e497b14..fde0697873c 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1817,6 +1817,8 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
 	git_hash_ctx c;
 	unsigned char hash[GIT_MAX_RAWSZ];
 	int hdr_version;
+	unsigned char *start, *end;
+	struct object_id oid;
 
 	if (hdr->hdr_signature != htonl(CACHE_SIGNATURE))
 		return error(_("bad signature 0x%08x"), hdr->hdr_signature);
@@ -1827,10 +1829,16 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
 	if (!verify_index_checksum)
 		return 0;
 
+	end = (unsigned char *)hdr + size;
+	start = end - the_hash_algo->rawsz;
+	oidread(&oid, start);
+	if (oideq(&oid, null_oid()))
+		return 0;
+
 	the_hash_algo->init_fn(&c);
 	the_hash_algo->update_fn(&c, hdr, size - the_hash_algo->rawsz);
 	the_hash_algo->final_fn(hash, &c);
-	if (!hasheq(hash, (unsigned char *)hdr + size - the_hash_algo->rawsz))
+	if (!hasheq(hash, start))
 		return error(_("bad index file sha1 signature"));
 	return 0;
 }
@@ -2915,9 +2923,12 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 	int ieot_entries = 1;
 	struct index_entry_offset_table *ieot = NULL;
 	int nr, nr_threads;
+	struct repository *r = istate->repo ? istate->repo : the_repository;
 
 	f = hashfd(tempfile->fd, tempfile->filename.buf);
 
+	repo_config_get_maybe_bool(r, "index.skiphash", &f->skip_hash);
+
 	for (i = removed = extended = 0; i < entries; i++) {
 		if (cache[i]->ce_flags & CE_REMOVE)
 			removed++;
diff --git a/t/t1600-index.sh b/t/t1600-index.sh
index 010989f90e6..45feb0fc5d8 100755
--- a/t/t1600-index.sh
+++ b/t/t1600-index.sh
@@ -65,6 +65,12 @@ test_expect_success 'out of bounds index.version issues warning' '
 	)
 '
 
+test_expect_success 'index.skipHash config option' '
+	rm -f .git/index &&
+	git -c index.skipHash=true add a &&
+	git fsck
+'
+
 test_index_version () {
 	INDEX_VERSION_CONFIG=$1 &&
 	FEATURE_MANY_FILES=$2 &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v4 3/4] test-lib-functions: add helper for trailing hash
  2022-12-16 15:31     ` [PATCH v4 " Derrick Stolee via GitGitGadget
  2022-12-16 15:31       ` [PATCH v4 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
  2022-12-16 15:31       ` [PATCH v4 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget
@ 2022-12-16 15:31       ` Derrick Stolee via GitGitGadget
  2022-12-16 15:31       ` [PATCH v4 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget
                         ` (2 subsequent siblings)
  5 siblings, 0 replies; 59+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-12-16 15:31 UTC (permalink / raw)
  To: git
  Cc: gitster, vdye, avarab, newren, Jacob Keller, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

It can be helpful to check that a file format with a trailing hash has a
specific hash in the final bytes of a written file. This is made more
apparent by recent changes that allow skipping the hash algorithm and
writing a null hash at the end of the file instead.

Add a new test_trailing_hash helper and use it in t1600 to verify that
index.skipHash=true really does skip the hash computation, since
'git fsck' does not actually verify the hash.

Keep the 'git fsck' call to ensure that any potential future change to
check the index hash does not cause an error in this case.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 t/t1600-index.sh        | 3 +++
 t/test-lib-functions.sh | 8 ++++++++
 2 files changed, 11 insertions(+)

diff --git a/t/t1600-index.sh b/t/t1600-index.sh
index 45feb0fc5d8..55914bc3506 100755
--- a/t/t1600-index.sh
+++ b/t/t1600-index.sh
@@ -68,6 +68,9 @@ test_expect_success 'out of bounds index.version issues warning' '
 test_expect_success 'index.skipHash config option' '
 	rm -f .git/index &&
 	git -c index.skipHash=true add a &&
+	test_trailing_hash .git/index >hash &&
+	echo $(test_oid zero) >expect &&
+	test_cmp expect hash &&
 	git fsck
 '
 
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index 796093a7b32..60308843f8f 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -1875,3 +1875,11 @@ test_cmp_config_output () {
 	sort config-actual >sorted-actual &&
 	test_cmp sorted-expect sorted-actual
 }
+
+# Given a filename, extract its trailing hash as a hex string
+test_trailing_hash () {
+	local file="$1" &&
+	tail -c $(test_oid rawsz) "$file" |
+		test-tool hexdump |
+		sed "s/ //g"
+}
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v4 4/4] features: feature.manyFiles implies fast index writes
  2022-12-16 15:31     ` [PATCH v4 " Derrick Stolee via GitGitGadget
                         ` (2 preceding siblings ...)
  2022-12-16 15:31       ` [PATCH v4 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget
@ 2022-12-16 15:31       ` Derrick Stolee via GitGitGadget
  2022-12-16 15:43       ` [PATCH v4 0/4] Optionally skip hashing index on write Ævar Arnfjörð Bjarmason
  2023-01-06 16:31       ` [PATCH v5 " Derrick Stolee via GitGitGadget
  5 siblings, 0 replies; 59+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-12-16 15:31 UTC (permalink / raw)
  To: git
  Cc: gitster, vdye, avarab, newren, Jacob Keller, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The recent addition of the index.skipHash config option allows index
writes to speed up by skipping the hash computation for the trailing
checksum. This is particularly critical for repositories with many files
at HEAD, so add this config option to two cases where users in that
scenario may opt-in to such behavior:

 1. The feature.manyFiles config option enables some options that are
    helpful for repositories with many files at HEAD.

 2. 'scalar register' and 'scalar reconfigure' set config options that
    optimize for large repositories.

In both of these cases, set index.skipHash=true to gain this
speedup. Add tests that demonstrate the proper way that
index.skipHash=true can override feature.manyFiles=true.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/config/feature.txt |  5 +++++
 read-cache.c                     |  3 ++-
 repo-settings.c                  |  2 ++
 repository.h                     |  1 +
 scalar.c                         |  1 +
 t/t1600-index.sh                 | 13 ++++++++++++-
 6 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/Documentation/config/feature.txt b/Documentation/config/feature.txt
index 95975e50912..e52bc6b8584 100644
--- a/Documentation/config/feature.txt
+++ b/Documentation/config/feature.txt
@@ -23,6 +23,11 @@ feature.manyFiles::
 	working directory. With many files, commands such as `git status` and
 	`git checkout` may be slow and these new defaults improve performance:
 +
+* `index.skipHash=true` speeds up index writes by not computing a trailing
+  checksum. Note that this will cause Git versions earlier than 2.13.0 to
+  refuse to parse the index and Git versions earlier than 2.40.0 will report
+  a corrupted index during `git fsck`.
++
 * `index.version=4` enables path-prefix compression in the index.
 +
 * `core.untrackedCache=true` enables the untracked cache. This setting assumes
diff --git a/read-cache.c b/read-cache.c
index fde0697873c..feefa0f68ba 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -2927,7 +2927,8 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 
 	f = hashfd(tempfile->fd, tempfile->filename.buf);
 
-	repo_config_get_maybe_bool(r, "index.skiphash", &f->skip_hash);
+	prepare_repo_settings(r);
+	f->skip_hash = r->settings.index_skip_hash;
 
 	for (i = removed = extended = 0; i < entries; i++) {
 		if (cache[i]->ce_flags & CE_REMOVE)
diff --git a/repo-settings.c b/repo-settings.c
index 3021921c53d..3dbd3f0e2ec 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -47,6 +47,7 @@ void prepare_repo_settings(struct repository *r)
 	}
 	if (manyfiles) {
 		r->settings.index_version = 4;
+		r->settings.index_skip_hash = 1;
 		r->settings.core_untracked_cache = UNTRACKED_CACHE_WRITE;
 	}
 
@@ -61,6 +62,7 @@ void prepare_repo_settings(struct repository *r)
 	repo_cfg_bool(r, "pack.usesparse", &r->settings.pack_use_sparse, 1);
 	repo_cfg_bool(r, "core.multipackindex", &r->settings.core_multi_pack_index, 1);
 	repo_cfg_bool(r, "index.sparse", &r->settings.sparse_index, 0);
+	repo_cfg_bool(r, "index.skiphash", &r->settings.index_skip_hash, r->settings.index_skip_hash);
 
 	/*
 	 * The GIT_TEST_MULTI_PACK_INDEX variable is special in that
diff --git a/repository.h b/repository.h
index 6c461c5b9de..e8c67ffe165 100644
--- a/repository.h
+++ b/repository.h
@@ -42,6 +42,7 @@ struct repo_settings {
 	struct fsmonitor_settings *fsmonitor; /* lazily loaded */
 
 	int index_version;
+	int index_skip_hash;
 	enum untracked_cache_setting core_untracked_cache;
 
 	int pack_use_sparse;
diff --git a/scalar.c b/scalar.c
index 6c52243cdf1..b49bb8c24ec 100644
--- a/scalar.c
+++ b/scalar.c
@@ -143,6 +143,7 @@ static int set_recommended_config(int reconfigure)
 		{ "credential.validate", "false", 1 }, /* GCM4W-only */
 		{ "gc.auto", "0", 1 },
 		{ "gui.GCWarning", "false", 1 },
+		{ "index.skipHash", "false", 1 },
 		{ "index.threads", "true", 1 },
 		{ "index.version", "4", 1 },
 		{ "merge.stat", "false", 1 },
diff --git a/t/t1600-index.sh b/t/t1600-index.sh
index 55914bc3506..103743a1c7d 100755
--- a/t/t1600-index.sh
+++ b/t/t1600-index.sh
@@ -71,7 +71,18 @@ test_expect_success 'index.skipHash config option' '
 	test_trailing_hash .git/index >hash &&
 	echo $(test_oid zero) >expect &&
 	test_cmp expect hash &&
-	git fsck
+	git fsck &&
+
+	rm -f .git/index &&
+	git -c feature.manyFiles=true add a &&
+	test_trailing_hash .git/index >hash &&
+	test_cmp expect hash &&
+
+	rm -f .git/index &&
+	git -c feature.manyFiles=true \
+		-c index.skipHash=false add a &&
+	test_trailing_hash .git/index >hash &&
+	! cmp expect hash
 '
 
 test_index_version () {
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 0/4] Optionally skip hashing index on write
  2022-12-16 15:31     ` [PATCH v4 " Derrick Stolee via GitGitGadget
                         ` (3 preceding siblings ...)
  2022-12-16 15:31       ` [PATCH v4 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget
@ 2022-12-16 15:43       ` Ævar Arnfjörð Bjarmason
  2023-01-06 15:33         ` Derrick Stolee
  2023-01-06 16:31       ` [PATCH v5 " Derrick Stolee via GitGitGadget
  5 siblings, 1 reply; 59+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-12-16 15:43 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, vdye, newren, Jacob Keller, Derrick Stolee


On Fri, Dec 16 2022, Derrick Stolee via GitGitGadget wrote:

Thanks for the update!

> Updates since v3
> ================
>
>  * Patch 1 uses an "int" instead of "unsigned int". This was a carry-over
>    from Kevin's patch in microsoft/git, but our use of it in patch 2 is
>    different and thus is better with a signed int.
>  * Patch 2 uses a fallback to the_repository if istate->repo is NULL. This
>    allows the setting to be applied to more cases where istate->repo is not
>    set.
>  * Patch 2 also uses 'start' in more places, since it already computed the
>    beginning of the hash.
>  * Patch 4 slightly modifies its use of prepare_repo_settings() due to Patch
>    2's fallback to the_repository.

It's good that it's "int" now instead of "unsigned int", but just doing
that search-replacement I think misses the implied (which I really
should have made more explicit) question in my v3 feedback
(https://lore.kernel.org/git/221215.861qp03agm.gmgdl@evledraar.gmail.com/):
What should we do about the -1 state then?

With your 2/4 here we'll accept e.g.:

	./git -c index.skipHash=12345 status

As well as:

	./git -c index.skipHash=blahblah status

But with the migration to repo-settings.c we start refusing the latter
of those, as you inherit a stricture in repo-settings.c.

Aside: I think this series would be much more readable if it were just
squashed into one patch. 1/4 introduces code, but no way to reach it,
with 2/4 we have config, 3/4 adds a test 4/4 tweaks how to
read/parse/interpret/combine the config.

But as it is split up the individual steps should make sense. The 2/4
here should really just use "bool", not "maybe_bool" to begin with, no?

And part of this in 4/4 is inheriting a non-stricture in
repo-settings.c, but for this new config variable that we're introducing
as only a boolean from day one can't we just die() on anything that's
not a boolean?

On other implied feedback, in
https://lore.kernel.org/git/221215.865yec3b1j.gmgdl@evledraar.gmail.com/
I noted the switching between istate->repo & the_repository, and that
you could hit a BUG() (when uncommenting a line in my testing patch on
top) if we didn't have istate->repo:

There was a commit message update for that:

>  2:  00738c81a12 ! 2:  aae047cbc9f read-cache: add index.skipHash config option
>      @@ Commit message
>       
>           [1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201
>       
>      +    We load this config from the repository config given by istate->repo,
>      +    with a fallback to the_repository if it is not set.

But I think that really sweeps a potential issue under the rug. What are
these cases where we don't get istate->repo, are those expected? Is
preferring istate->repo always known to be redundant now, and just done
for good measure, or are there subtle cases (e.g. when reading
submodules) where we pick the wrong repo's config?

In that context I'd really like to see some testing of submodules in the
added tests, i.e. does this act as we'd like it to when you set the
config as "true" in the parent, but "false" in the submodule, & the
other way around? That's a case that should stress this "the_repository"
v.s. "istate->repo".

>           While older Git versions will not recognize the null hash as a special
>           case, the file format itself is still being met in terms of its
>           structure. Using this null hash will still allow Git operations to
>      @@ read-cache.c: static int verify_hdr(const struct cache_header *hdr, unsigned lon
>        	the_hash_algo->update_fn(&c, hdr, size - the_hash_algo->rawsz);
>        	the_hash_algo->final_fn(hash, &c);
>       -	if (!hasheq(hash, (unsigned char *)hdr + size - the_hash_algo->rawsz))
>      -+	if (!hasheq(hash, end - the_hash_algo->rawsz))
>      ++	if (!hasheq(hash, start))
>        		return error(_("bad index file sha1 signature"));
>        	return 0;

Thanks, good to have this suggested simplification.

>       @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
>      + 	int ieot_entries = 1;
>      + 	struct index_entry_offset_table *ieot = NULL;
>      + 	int nr, nr_threads;
>      ++	struct repository *r = istate->repo ? istate->repo : the_repository;
>        
>        	f = hashfd(tempfile->fd, tempfile->filename.buf);
>        
>      -+	git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash);
>      ++	repo_config_get_maybe_bool(r, "index.skiphash", &f->skip_hash);
>       +
>        	for (i = removed = extended = 0; i < entries; i++) {

Re the above this looks good, but did we introduce an untested behavior
change here or not?

>        		if (cache[i]->ce_flags & CE_REMOVE)
>  3:  86370af1351 = 3:  27fbe52e748 test-lib-functions: add helper for trailing hash
>  4:  6490bd445eb ! 4:  075921514f2 features: feature.manyFiles implies fast index writes
>      @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempf
>        
>        	f = hashfd(tempfile->fd, tempfile->filename.buf);
>        
>      --	git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash);
>      -+	if (istate->repo) {
>      -+		prepare_repo_settings(istate->repo);
>      -+		f->skip_hash = istate->repo->settings.index_skip_hash;
>      -+	}
>      +-	repo_config_get_maybe_bool(r, "index.skiphash", &f->skip_hash);
>      ++	prepare_repo_settings(r);
>      ++	f->skip_hash = r->settings.index_skip_hash;
>        
>        	for (i = removed = extended = 0; i < entries; i++) {
>        		if (cache[i]->ce_flags & CE_REMOVE)

I really don't care per-se where we read the config, as long as it's
doing what we expect from the UX point of view.

But in your
https://lore.kernel.org/git/9e754af8-ecd3-6aed-60e8-2fc09a6a8759@github.com/
you noted "There's no reason to load the config inside repo-settings.c
unless it's part of something like feature.manyFiles.".

I think that's true, but on the flip side of that: Is there a reason to
move the reading of such localized config (only needed in read-cache.c,
as 2/4 shows) to repo-settings.c, just because it's now relying on the
"manyfiles"?

I think the unstated reason for that is that while we read the
"manyfiles" config we didn't stick it in the "struct repo_settings",
therefore the reading of it is conveniently done in repo-settings.c,
which the 4/4 here does.

But I think it makes more sense to just stick the "manyfiles" in that
struct, not stick "skip_hash" there, and then just check the "manyfiles"
in 4/4, but read "index.skiphash" locally.

Maybe not worth the churn at this point, but I do wonder if we're going
in the wrong direction here, i.e. if all config that happens to rely on
features must be added to repo-settings.c, or whether we should reserve
that for truly "global" (or mostly global) config.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 3/4] test-lib-functions: add helper for trailing hash
  2022-12-13  0:55       ` Junio C Hamano
@ 2022-12-17 17:37         ` SZEDER Gábor
  0 siblings, 0 replies; 59+ messages in thread
From: SZEDER Gábor @ 2022-12-17 17:37 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: brian m. carlson, Derrick Stolee via GitGitGadget, git, vdye,
	avarab, newren, Derrick Stolee

On Tue, Dec 13, 2022 at 09:55:47AM +0900, Junio C Hamano wrote:
> SZEDER Gábor <szeder.dev@gmail.com> writes:
> 
> >> +	test_trailing_hash .git/index >hash &&
> >> +	echo $(test_oid zero) >expect &&
> >
> > Nit: test_oid zero >expect
> >
> >> +	test_cmp expect hash &&
> 
> Unfortunately they are not equivalent.
> 
> Usually we write these helpers to terminate their output with LF,
> relying on the fact that terminating LF will be dropped when used in
> a command substition, e.g. VAR=$(HELPER), but test_oid deviates from
> the pattern and does not give the terminating LF to its output.

Oh, indeed.  But why does it omit the trailing LF?!  Alas, 2c02b110da
(t: add test functions to translate hash-related values, 2018-09-13)
doesn't seem to mention anything about it.  However, skimming through
the output of

  git grep 'test_oid ' -- ':/t/*.sh'

it appears that all but three uses of 'test_oid' are in command
substitutions, and those three exceptions are in t0000 checking that
'test_oid' actually works.  So I don't see any benefit of omitting
that trailing LF, but this and similar cases show its drawbacks.

  $ git grep 'echo $(test_oid ' -- ':/t/*.sh'
  t/t1302-repo-version.sh:        echo $(test_oid version) >expect &&
  t/t5313-pack-bounds-checks.sh:  echo $(test_oid oidfff) >file &&



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 0/4] Optionally skip hashing index on write
  2022-12-16 15:43       ` [PATCH v4 0/4] Optionally skip hashing index on write Ævar Arnfjörð Bjarmason
@ 2023-01-06 15:33         ` Derrick Stolee
  2023-01-06 22:45           ` Junio C Hamano
  0 siblings, 1 reply; 59+ messages in thread
From: Derrick Stolee @ 2023-01-06 15:33 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason,
	Derrick Stolee via GitGitGadget
  Cc: git, gitster, vdye, newren, Jacob Keller

On 12/16/2022 10:43 AM, Ævar Arnfjörð Bjarmason wrote:

> But as it is split up the individual steps should make sense. The 2/4
> here should really just use "bool", not "maybe_bool" to begin with, no?
 
> And part of this in 4/4 is inheriting a non-stricture in
> repo-settings.c, but for this new config variable that we're introducing
> as only a boolean from day one can't we just die() on anything that's
> not a boolean?
> 
> On other implied feedback, in
> https://lore.kernel.org/git/221215.865yec3b1j.gmgdl@evledraar.gmail.com/
> I noted the switching between istate->repo & the_repository, and that
> you could hit a BUG() (when uncommenting a line in my testing patch on
> top) if we didn't have istate->repo:
> 
> There was a commit message update for that:
> 
>>  2:  00738c81a12 ! 2:  aae047cbc9f read-cache: add index.skipHash config option
>>      @@ Commit message
>>       
>>           [1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201
>>       
>>      +    We load this config from the repository config given by istate->repo,
>>      +    with a fallback to the_repository if it is not set.
> 
> But I think that really sweeps a potential issue under the rug. What are
> these cases where we don't get istate->repo, are those expected? Is
> preferring istate->repo always known to be redundant now, and just done
> for good measure, or are there subtle cases (e.g. when reading
> submodules) where we pick the wrong repo's config?

After investigating some of the failures from creating a BUG() statement
when istate->repo is NULL I see several problems, and they are not related
to submodules for the most part.

The first issues I've found are due to code paths that operate on the_index
without actually initializing it with a do_read_index(), or otherwise
initialize an index using a memset() instead of a common initializer. This
looks to be a frequent enough problem that solving it would require a
significant effort. It's not a quick fix.

> In that context I'd really like to see some testing of submodules in the
> added tests, i.e. does this act as we'd like it to when you set the
> config as "true" in the parent, but "false" in the submodule, & the
> other way around? That's a case that should stress this "the_repository"
> v.s. "istate->repo".

I can add a test for this, though we do not do that for every new config
option. Further, we can only rely on commands like 'git add' that correctly
initialize the index from a do_read_index(). It should be sufficient to
use the index-local repository (when available) and trust that the
repo_config_...() method is doing the right thing.

> I really don't care per-se where we read the config, as long as it's
> doing what we expect from the UX point of view.
> 
> But in your
> https://lore.kernel.org/git/9e754af8-ecd3-6aed-60e8-2fc09a6a8759@github.com/
> you noted "There's no reason to load the config inside repo-settings.c
> unless it's part of something like feature.manyFiles.".
> 
> I think that's true, but on the flip side of that: Is there a reason to
> move the reading of such localized config (only needed in read-cache.c,
> as 2/4 shows) to repo-settings.c, just because it's now relying on the
> "manyfiles"?

Yes, because it makes it clear what options are part of these meta-config
keys. It is better to centralize the parsing instead of having several
different ad-hoc parsing strategies.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v5 0/4] Optionally skip hashing index on write
  2022-12-16 15:31     ` [PATCH v4 " Derrick Stolee via GitGitGadget
                         ` (4 preceding siblings ...)
  2022-12-16 15:43       ` [PATCH v4 0/4] Optionally skip hashing index on write Ævar Arnfjörð Bjarmason
@ 2023-01-06 16:31       ` Derrick Stolee via GitGitGadget
  2023-01-06 16:31         ` [PATCH v5 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
                           ` (4 more replies)
  5 siblings, 5 replies; 59+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2023-01-06 16:31 UTC (permalink / raw)
  To: git; +Cc: gitster, vdye, avarab, newren, Jacob Keller, Derrick Stolee

Writing the index is a critical action that takes place in multiple Git
commands. The recent performance improvements available with the sparse
index show how often the I/O costs around the index can affect different Git
commands, although reading the index takes place more often than a write.

The index is written through the hashfile API, both as a buffered write and
as a computation of its trailing hash. This trailing hash is an expectation
of the file format: several optional index extensions are provided using
indicators immediately preceding the trailing hash. 'git fsck' will complain
if the trailing hash does not match the contents of the file up to that
point.

However, computing the hash is expensive. This is even more expensive now
that we've moved to sha1dc instead of hardware-accelerated SHA1
computations.

This series provides a new config option, index.skipHash, that allows users
to disable computing the hash as Git writes the index. In this case, the
trailing hash is stored as the null hash (all bytes are zero).

The implementation is rather simple, but the patches organize different
aspects in a way that we could split things out:

 * Patch 1 adds a 'skip_hash' option to 'struct hashfile' that creates this
   as a general option to the hashwrite API.
 * The motivation in Patch 1 avoids the talk about the chunk-format API and
   instead focuses on the index as the first example that could make use of
   this.
 * Patch 2 creates the index.skipHash config option and updates 'git fsck'
   to ignore a null hash on the index. This is demonstrated by a very simple
   test.
 * Patch 3 creates a test helper to get the trailing hash of a file, and
   adds a concrete check on the trailing hash when index.skipHash=true.
 * Patch 4 (which could be deferred until later) adds index.skipHash=true as
   an implication of feature.manyFiles and as a setting in Scalar.

The end result is that index writes speed up significantly, enough that 'git
update-index --force-write' improves by 1.75x.

Similar patches appeared in my earlier refs RFC [1].

[1]
https://lore.kernel.org/git/pull.1408.git.1667846164.gitgitgadget@gmail.com/


Updates since v4
================

 * Patch 2 now uses repo_config_get_bool() instead of
   repo_config_get_maybe_bool().
 * A test is added for submodule-level config.


Updates since v3
================

 * Patch 1 uses an "int" instead of "unsigned int". This was a carry-over
   from Kevin's patch in microsoft/git, but our use of it in patch 2 is
   different and thus is better with a signed int.
 * Patch 2 uses a fallback to the_repository if istate->repo is NULL. This
   allows the setting to be applied to more cases where istate->repo is not
   set.
 * Patch 2 also uses 'start' in more places, since it already computed the
   beginning of the hash.
 * Patch 4 slightly modifies its use of prepare_repo_settings() due to Patch
   2's fallback to the_repository.


Updates since v2
================

 * Patch 2 now has additional details about why to use the config option in
   its message. The discussion about why the index is special is included in
   the commit message.


Updates since v1
================

 * In Patch 1, use hashcler() instead of memset().
 * In Patches 2 and 4, be explicit about which Git versions will exhibit
   different behavior when reading an index with a null trailing hash.
 * In Patch 2, used a more compact way of loading from config.
 * In Patch 2 (and following) dropped the sub-shell in t1600-index.sh.
 * In Patch 3, dropped back-ticks when using a multi-line pipe.
 * In Patch 4, use "! cmp" instead of "! test_cmp"


Updates since RFC
=================

 * The config name is now index.skipHash to make it more clear that the
   default is to not skip the hash, and choosing this option requires a true
   value.
 * Use oidread() instead of an ad-hoc loop.
 * Patches 3 and 4 are new. 3 adds some extra test verification that wasn't
   in the RFC.
 * I think that patch 4 is helpful to see now, but I could understand if we
   wanted to defer that patch until after index.skipHash has had more time
   to bake.

Thanks, -Stolee

Derrick Stolee (4):
  hashfile: allow skipping the hash function
  read-cache: add index.skipHash config option
  test-lib-functions: add helper for trailing hash
  features: feature.manyFiles implies fast index writes

 Documentation/config/feature.txt |  5 +++++
 Documentation/config/index.txt   | 11 +++++++++++
 csum-file.c                      | 14 +++++++++++---
 csum-file.h                      |  7 +++++++
 read-cache.c                     | 14 +++++++++++++-
 repo-settings.c                  |  2 ++
 repository.h                     |  1 +
 scalar.c                         |  1 +
 t/t1600-index.sh                 | 30 ++++++++++++++++++++++++++++++
 t/test-lib-functions.sh          |  8 ++++++++
 10 files changed, 89 insertions(+), 4 deletions(-)


base-commit: 805265fcf7a737664a8321aaf4a0587b78435184
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1439%2Fderrickstolee%2Fhashfile-skip-hash-v5
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1439/derrickstolee/hashfile-skip-hash-v5
Pull-Request: https://github.com/gitgitgadget/git/pull/1439

Range-diff vs v4:

 1:  c99470d4676 = 1:  c99470d4676 hashfile: allow skipping the hash function
 2:  aae047cbc9f ! 2:  a3c79308171 read-cache: add index.skipHash config option
     @@ Commit message
          ability to opt-in to this feature, even with the potential confusion
          with older 'git fsck' versions.
      
     +    Test this new config option, both at a command-line level and within a
     +    submodule. The confirmation is currently limited to confirm that 'git
     +    fsck' does not complain about the index. Future updates will make this
     +    test more robust.
     +
          It is critical that this test is placed before the test_index_version
          tests, since those tests obliterate the .git/config file and hence lose
          the setting from GIT_TEST_DEFAULT_HASH, if set.
     @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempf
       
       	f = hashfd(tempfile->fd, tempfile->filename.buf);
       
     -+	repo_config_get_maybe_bool(r, "index.skiphash", &f->skip_hash);
     ++	repo_config_get_bool(r, "index.skiphash", &f->skip_hash);
      +
       	for (i = removed = extended = 0; i < entries; i++) {
       		if (cache[i]->ce_flags & CE_REMOVE)
     @@ t/t1600-index.sh: test_expect_success 'out of bounds index.version issues warnin
      +test_expect_success 'index.skipHash config option' '
      +	rm -f .git/index &&
      +	git -c index.skipHash=true add a &&
     -+	git fsck
     ++	git fsck &&
     ++
     ++	test_commit start &&
     ++	git -c protocol.file.allow=always submodule add ./ sub &&
     ++	git config index.skipHash false &&
     ++	git -C sub config index.skipHash true &&
     ++	>sub/file &&
     ++	git -C sub add a &&
     ++	git -C sub fsck
      +'
      +
       test_index_version () {
 3:  27fbe52e748 ! 3:  dab9b15de47 test-lib-functions: add helper for trailing hash
     @@ Commit message
      
          Add a new test_trailing_hash helper and use it in t1600 to verify that
          index.skipHash=true really does skip the hash computation, since
     -    'git fsck' does not actually verify the hash.
     +    'git fsck' does not actually verify the hash. This confirms that when
     +    the config is disabled explicitly in a super project but enabled in a
     +    submodule, then the use of repo_config_get_bool() loads config from the
     +    correct repository in the case of 'git add'. There are other cases where
     +    istate->repo is NULL and thus this config is loaded instead from
     +    the_repository, but that's due to many different code paths initializing
     +    index_state structs in their own way.
      
          Keep the 'git fsck' call to ensure that any potential future change to
          check the index hash does not cause an error in this case.
     @@ t/t1600-index.sh: test_expect_success 'out of bounds index.version issues warnin
      +	test_trailing_hash .git/index >hash &&
      +	echo $(test_oid zero) >expect &&
      +	test_cmp expect hash &&
     - 	git fsck
     + 	git fsck &&
     + 
     + 	test_commit start &&
     +@@ t/t1600-index.sh: test_expect_success 'index.skipHash config option' '
     + 	git -C sub config index.skipHash true &&
     + 	>sub/file &&
     + 	git -C sub add a &&
     ++	test_trailing_hash .git/modules/sub/index >hash &&
     ++	test_cmp expect hash &&
     + 	git -C sub fsck
       '
       
      
 4:  075921514f2 ! 4:  1beedcb5ba1 features: feature.manyFiles implies fast index writes
     @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempf
       
       	f = hashfd(tempfile->fd, tempfile->filename.buf);
       
     --	repo_config_get_maybe_bool(r, "index.skiphash", &f->skip_hash);
     +-	repo_config_get_bool(r, "index.skiphash", &f->skip_hash);
      +	prepare_repo_settings(r);
      +	f->skip_hash = r->settings.index_skip_hash;
       
     @@ scalar.c: static int set_recommended_config(int reconfigure)
      
       ## t/t1600-index.sh ##
      @@ t/t1600-index.sh: test_expect_success 'index.skipHash config option' '
     - 	test_trailing_hash .git/index >hash &&
     - 	echo $(test_oid zero) >expect &&
       	test_cmp expect hash &&
     --	git fsck
     -+	git fsck &&
     -+
     + 	git fsck &&
     + 
      +	rm -f .git/index &&
      +	git -c feature.manyFiles=true add a &&
      +	test_trailing_hash .git/index >hash &&
     -+	test_cmp expect hash &&
     ++	cmp expect hash &&
      +
      +	rm -f .git/index &&
      +	git -c feature.manyFiles=true \
     -+		-c index.skipHash=false add a &&
     ++	    -c index.skipHash=false add a &&
      +	test_trailing_hash .git/index >hash &&
     -+	! cmp expect hash
     - '
     - 
     - test_index_version () {
     ++	! cmp expect hash &&
     ++
     + 	test_commit start &&
     + 	git -c protocol.file.allow=always submodule add ./ sub &&
     + 	git config index.skipHash false &&

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v5 1/4] hashfile: allow skipping the hash function
  2023-01-06 16:31       ` [PATCH v5 " Derrick Stolee via GitGitGadget
@ 2023-01-06 16:31         ` Derrick Stolee via GitGitGadget
  2023-01-06 16:31         ` [PATCH v5 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget
                           ` (3 subsequent siblings)
  4 siblings, 0 replies; 59+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2023-01-06 16:31 UTC (permalink / raw)
  To: git
  Cc: gitster, vdye, avarab, newren, Jacob Keller, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The hashfile API is useful for generating files that include a trailing
hash of the file's contents up to that point. Using such a hash is
helpful for verifying the file for corruption-at-rest, such as a faulty
drive causing flipped bits.

Git's index file includes this trailing hash, so it uses a 'struct
hashfile' to handle the I/O to the file. This was very convenient to
allow using the hashfile methods during these operations.

However, hashing the file contents during write comes at a performance
penalty. It's slower to hash the bytes on their way to the disk than
without that step. This problem is made worse by the replacement of
hardware-accelerated SHA1 computations with the software-based sha1dc
computation.

This write cost is significant, and the checksum capability is likely
not worth that cost for such a short-lived file. The index is rewritten
frequently and the only time the checksum is checked is during 'git
fsck'. Thus, it would be helpful to allow a user to opt-out of the hash
computation.

We first need to allow Git to opt-out of the hash computation in the
hashfile API. The buffered writes of the API are still helpful, so it
makes sense to make the change here.

Introduce a new 'skip_hash' option to 'struct hashfile'. When set, the
update_fn and final_fn members of the_hash_algo are skipped. When
finalizing the hashfile, the trailing hash is replaced with the null
hash.

This use of a trailing null hash would be desireable in either case,
since we do not want to special case a file format to have a different
length depending on whether it was hashed or not. When the final bytes
of a file are all zero, we can infer that it was written without
hashing, and thus that verification is not available as a check for file
consistency. This also means that we could easily toggle hashing for any
file format we desire.

A version of this patch has existed in the microsoft/git fork since
2017 [1] (the linked commit was rebased in 2018, but the original dates
back to January 2017). Here, the change to make the index use this fast
path is delayed until a later change.

[1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201

Co-authored-by: Kevin Willford <kewillf@microsoft.com>
Signed-off-by: Kevin Willford <kewillf@microsoft.com>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 csum-file.c | 14 +++++++++++---
 csum-file.h |  7 +++++++
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index 59ef3398ca2..cce13c0f047 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -45,7 +45,8 @@ void hashflush(struct hashfile *f)
 	unsigned offset = f->offset;
 
 	if (offset) {
-		the_hash_algo->update_fn(&f->ctx, f->buffer, offset);
+		if (!f->skip_hash)
+			the_hash_algo->update_fn(&f->ctx, f->buffer, offset);
 		flush(f, f->buffer, offset);
 		f->offset = 0;
 	}
@@ -64,7 +65,12 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result,
 	int fd;
 
 	hashflush(f);
-	the_hash_algo->final_fn(f->buffer, &f->ctx);
+
+	if (f->skip_hash)
+		hashclr(f->buffer);
+	else
+		the_hash_algo->final_fn(f->buffer, &f->ctx);
+
 	if (result)
 		hashcpy(result, f->buffer);
 	if (flags & CSUM_HASH_IN_STREAM)
@@ -108,7 +114,8 @@ void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
 			 * the hashfile's buffer. In this block,
 			 * f->offset is necessarily zero.
 			 */
-			the_hash_algo->update_fn(&f->ctx, buf, nr);
+			if (!f->skip_hash)
+				the_hash_algo->update_fn(&f->ctx, buf, nr);
 			flush(f, buf, nr);
 		} else {
 			/*
@@ -153,6 +160,7 @@ static struct hashfile *hashfd_internal(int fd, const char *name,
 	f->tp = tp;
 	f->name = name;
 	f->do_crc = 0;
+	f->skip_hash = 0;
 	the_hash_algo->init_fn(&f->ctx);
 
 	f->buffer_len = buffer_len;
diff --git a/csum-file.h b/csum-file.h
index 0d29f528fbc..793a59da12b 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -20,6 +20,13 @@ struct hashfile {
 	size_t buffer_len;
 	unsigned char *buffer;
 	unsigned char *check_buffer;
+
+	/**
+	 * If non-zero, skip_hash indicates that we should
+	 * not actually compute the hash for this hashfile and
+	 * instead only use it as a buffered write.
+	 */
+	int skip_hash;
 };
 
 /* Checkpoint */
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v5 2/4] read-cache: add index.skipHash config option
  2023-01-06 16:31       ` [PATCH v5 " Derrick Stolee via GitGitGadget
  2023-01-06 16:31         ` [PATCH v5 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
@ 2023-01-06 16:31         ` Derrick Stolee via GitGitGadget
  2023-01-06 16:31         ` [PATCH v5 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget
                           ` (2 subsequent siblings)
  4 siblings, 0 replies; 59+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2023-01-06 16:31 UTC (permalink / raw)
  To: git
  Cc: gitster, vdye, avarab, newren, Jacob Keller, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The previous change allowed skipping the hashing portion of the
hashwrite API, using it instead as a buffered write API. Disabling the
hashwrite can be particularly helpful when the write operation is in a
critical path.

One such critical path is the writing of the index. This operation is so
critical that the sparse index was created specifically to reduce the
size of the index to make these writes (and reads) faster.

This trade-off between file stability at rest and write-time performance
is not easy to balance. The index is an interesting case for a couple
reasons:

1. Writes block users. Writing the index takes place in many user-
   blocking foreground operations. The speed improvement directly
   impacts their use. Other file formats are typically written in the
   background (commit-graph, multi-pack-index) or are super-critical to
   correctness (pack-files).

2. Index files are short lived. It is rare that a user leaves an index
   for a long time with many staged changes. Outside of staged changes,
   the index can be completely destroyed and rewritten with minimal
   impact to the user.

Following a similar approach to one used in the microsoft/git fork [1],
add a new config option (index.skipHash) that allows disabling this
hashing during the index write. The cost is that we can no longer
validate the contents for corruption-at-rest using the trailing hash.

[1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201

We load this config from the repository config given by istate->repo,
with a fallback to the_repository if it is not set.

While older Git versions will not recognize the null hash as a special
case, the file format itself is still being met in terms of its
structure. Using this null hash will still allow Git operations to
function across older versions.

The one exception is 'git fsck' which checks the hash of the index file.
This used to be a check on every index read, but was split out to just
the index in a33fc72fe91 (read-cache: force_verify_index_checksum,
2017-04-14) and released first in Git 2.13.0. Document the versions that
relaxed these restrictions, with the optimistic expectation that this
change will be included in Git 2.40.0.

Here, we disable this check if the trailing hash is all zeroes. We add a
warning to the config option that this may cause undesirable behavior
with older Git versions.

As a quick comparison, I tested 'git update-index --force-write' with
and without index.skipHash=true on a copy of the Linux kernel
repository.

Benchmark 1: with hash
  Time (mean ± σ):      46.3 ms ±  13.8 ms    [User: 34.3 ms, System: 11.9 ms]
  Range (min … max):    34.3 ms …  79.1 ms    82 runs

Benchmark 2: without hash
  Time (mean ± σ):      26.0 ms ±   7.9 ms    [User: 11.8 ms, System: 14.2 ms]
  Range (min … max):    16.3 ms …  42.0 ms    69 runs

Summary
  'without hash' ran
    1.78 ± 0.76 times faster than 'with hash'

These performance benefits are substantial enough to allow users the
ability to opt-in to this feature, even with the potential confusion
with older 'git fsck' versions.

Test this new config option, both at a command-line level and within a
submodule. The confirmation is currently limited to confirm that 'git
fsck' does not complain about the index. Future updates will make this
test more robust.

It is critical that this test is placed before the test_index_version
tests, since those tests obliterate the .git/config file and hence lose
the setting from GIT_TEST_DEFAULT_HASH, if set.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/config/index.txt | 11 +++++++++++
 read-cache.c                   | 13 ++++++++++++-
 t/t1600-index.sh               | 14 ++++++++++++++
 3 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/Documentation/config/index.txt b/Documentation/config/index.txt
index 75f3a2d1054..23c7985eb40 100644
--- a/Documentation/config/index.txt
+++ b/Documentation/config/index.txt
@@ -30,3 +30,14 @@ index.version::
 	Specify the version with which new index files should be
 	initialized.  This does not affect existing repositories.
 	If `feature.manyFiles` is enabled, then the default is 4.
+
+index.skipHash::
+	When enabled, do not compute the trailing hash for the index file.
+	This accelerates Git commands that manipulate the index, such as
+	`git add`, `git commit`, or `git status`. Instead of storing the
+	checksum, write a trailing set of bytes with value zero, indicating
+	that the computation was skipped.
++
+If you enable `index.skipHash`, then Git clients older than 2.13.0 will
+refuse to parse the index and Git clients older than 2.40.0 will report an
+error during `git fsck`.
diff --git a/read-cache.c b/read-cache.c
index 46f5e497b14..d73a81e41ae 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1817,6 +1817,8 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
 	git_hash_ctx c;
 	unsigned char hash[GIT_MAX_RAWSZ];
 	int hdr_version;
+	unsigned char *start, *end;
+	struct object_id oid;
 
 	if (hdr->hdr_signature != htonl(CACHE_SIGNATURE))
 		return error(_("bad signature 0x%08x"), hdr->hdr_signature);
@@ -1827,10 +1829,16 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
 	if (!verify_index_checksum)
 		return 0;
 
+	end = (unsigned char *)hdr + size;
+	start = end - the_hash_algo->rawsz;
+	oidread(&oid, start);
+	if (oideq(&oid, null_oid()))
+		return 0;
+
 	the_hash_algo->init_fn(&c);
 	the_hash_algo->update_fn(&c, hdr, size - the_hash_algo->rawsz);
 	the_hash_algo->final_fn(hash, &c);
-	if (!hasheq(hash, (unsigned char *)hdr + size - the_hash_algo->rawsz))
+	if (!hasheq(hash, start))
 		return error(_("bad index file sha1 signature"));
 	return 0;
 }
@@ -2915,9 +2923,12 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 	int ieot_entries = 1;
 	struct index_entry_offset_table *ieot = NULL;
 	int nr, nr_threads;
+	struct repository *r = istate->repo ? istate->repo : the_repository;
 
 	f = hashfd(tempfile->fd, tempfile->filename.buf);
 
+	repo_config_get_bool(r, "index.skiphash", &f->skip_hash);
+
 	for (i = removed = extended = 0; i < entries; i++) {
 		if (cache[i]->ce_flags & CE_REMOVE)
 			removed++;
diff --git a/t/t1600-index.sh b/t/t1600-index.sh
index 010989f90e6..98c5a83db73 100755
--- a/t/t1600-index.sh
+++ b/t/t1600-index.sh
@@ -65,6 +65,20 @@ test_expect_success 'out of bounds index.version issues warning' '
 	)
 '
 
+test_expect_success 'index.skipHash config option' '
+	rm -f .git/index &&
+	git -c index.skipHash=true add a &&
+	git fsck &&
+
+	test_commit start &&
+	git -c protocol.file.allow=always submodule add ./ sub &&
+	git config index.skipHash false &&
+	git -C sub config index.skipHash true &&
+	>sub/file &&
+	git -C sub add a &&
+	git -C sub fsck
+'
+
 test_index_version () {
 	INDEX_VERSION_CONFIG=$1 &&
 	FEATURE_MANY_FILES=$2 &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v5 3/4] test-lib-functions: add helper for trailing hash
  2023-01-06 16:31       ` [PATCH v5 " Derrick Stolee via GitGitGadget
  2023-01-06 16:31         ` [PATCH v5 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
  2023-01-06 16:31         ` [PATCH v5 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget
@ 2023-01-06 16:31         ` Derrick Stolee via GitGitGadget
  2023-01-06 16:31         ` [PATCH v5 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget
  2023-01-15  9:31         ` [PATCH v5 0/4] Optionally skip hashing index on write Junio C Hamano
  4 siblings, 0 replies; 59+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2023-01-06 16:31 UTC (permalink / raw)
  To: git
  Cc: gitster, vdye, avarab, newren, Jacob Keller, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

It can be helpful to check that a file format with a trailing hash has a
specific hash in the final bytes of a written file. This is made more
apparent by recent changes that allow skipping the hash algorithm and
writing a null hash at the end of the file instead.

Add a new test_trailing_hash helper and use it in t1600 to verify that
index.skipHash=true really does skip the hash computation, since
'git fsck' does not actually verify the hash. This confirms that when
the config is disabled explicitly in a super project but enabled in a
submodule, then the use of repo_config_get_bool() loads config from the
correct repository in the case of 'git add'. There are other cases where
istate->repo is NULL and thus this config is loaded instead from
the_repository, but that's due to many different code paths initializing
index_state structs in their own way.

Keep the 'git fsck' call to ensure that any potential future change to
check the index hash does not cause an error in this case.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 t/t1600-index.sh        | 5 +++++
 t/test-lib-functions.sh | 8 ++++++++
 2 files changed, 13 insertions(+)

diff --git a/t/t1600-index.sh b/t/t1600-index.sh
index 98c5a83db73..2f792bb8ffa 100755
--- a/t/t1600-index.sh
+++ b/t/t1600-index.sh
@@ -68,6 +68,9 @@ test_expect_success 'out of bounds index.version issues warning' '
 test_expect_success 'index.skipHash config option' '
 	rm -f .git/index &&
 	git -c index.skipHash=true add a &&
+	test_trailing_hash .git/index >hash &&
+	echo $(test_oid zero) >expect &&
+	test_cmp expect hash &&
 	git fsck &&
 
 	test_commit start &&
@@ -76,6 +79,8 @@ test_expect_success 'index.skipHash config option' '
 	git -C sub config index.skipHash true &&
 	>sub/file &&
 	git -C sub add a &&
+	test_trailing_hash .git/modules/sub/index >hash &&
+	test_cmp expect hash &&
 	git -C sub fsck
 '
 
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index 796093a7b32..60308843f8f 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -1875,3 +1875,11 @@ test_cmp_config_output () {
 	sort config-actual >sorted-actual &&
 	test_cmp sorted-expect sorted-actual
 }
+
+# Given a filename, extract its trailing hash as a hex string
+test_trailing_hash () {
+	local file="$1" &&
+	tail -c $(test_oid rawsz) "$file" |
+		test-tool hexdump |
+		sed "s/ //g"
+}
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v5 4/4] features: feature.manyFiles implies fast index writes
  2023-01-06 16:31       ` [PATCH v5 " Derrick Stolee via GitGitGadget
                           ` (2 preceding siblings ...)
  2023-01-06 16:31         ` [PATCH v5 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget
@ 2023-01-06 16:31         ` Derrick Stolee via GitGitGadget
  2023-01-15  9:31         ` [PATCH v5 0/4] Optionally skip hashing index on write Junio C Hamano
  4 siblings, 0 replies; 59+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2023-01-06 16:31 UTC (permalink / raw)
  To: git
  Cc: gitster, vdye, avarab, newren, Jacob Keller, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The recent addition of the index.skipHash config option allows index
writes to speed up by skipping the hash computation for the trailing
checksum. This is particularly critical for repositories with many files
at HEAD, so add this config option to two cases where users in that
scenario may opt-in to such behavior:

 1. The feature.manyFiles config option enables some options that are
    helpful for repositories with many files at HEAD.

 2. 'scalar register' and 'scalar reconfigure' set config options that
    optimize for large repositories.

In both of these cases, set index.skipHash=true to gain this
speedup. Add tests that demonstrate the proper way that
index.skipHash=true can override feature.manyFiles=true.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/config/feature.txt |  5 +++++
 read-cache.c                     |  3 ++-
 repo-settings.c                  |  2 ++
 repository.h                     |  1 +
 scalar.c                         |  1 +
 t/t1600-index.sh                 | 11 +++++++++++
 6 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/Documentation/config/feature.txt b/Documentation/config/feature.txt
index 95975e50912..e52bc6b8584 100644
--- a/Documentation/config/feature.txt
+++ b/Documentation/config/feature.txt
@@ -23,6 +23,11 @@ feature.manyFiles::
 	working directory. With many files, commands such as `git status` and
 	`git checkout` may be slow and these new defaults improve performance:
 +
+* `index.skipHash=true` speeds up index writes by not computing a trailing
+  checksum. Note that this will cause Git versions earlier than 2.13.0 to
+  refuse to parse the index and Git versions earlier than 2.40.0 will report
+  a corrupted index during `git fsck`.
++
 * `index.version=4` enables path-prefix compression in the index.
 +
 * `core.untrackedCache=true` enables the untracked cache. This setting assumes
diff --git a/read-cache.c b/read-cache.c
index d73a81e41ae..feefa0f68ba 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -2927,7 +2927,8 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 
 	f = hashfd(tempfile->fd, tempfile->filename.buf);
 
-	repo_config_get_bool(r, "index.skiphash", &f->skip_hash);
+	prepare_repo_settings(r);
+	f->skip_hash = r->settings.index_skip_hash;
 
 	for (i = removed = extended = 0; i < entries; i++) {
 		if (cache[i]->ce_flags & CE_REMOVE)
diff --git a/repo-settings.c b/repo-settings.c
index 3021921c53d..3dbd3f0e2ec 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -47,6 +47,7 @@ void prepare_repo_settings(struct repository *r)
 	}
 	if (manyfiles) {
 		r->settings.index_version = 4;
+		r->settings.index_skip_hash = 1;
 		r->settings.core_untracked_cache = UNTRACKED_CACHE_WRITE;
 	}
 
@@ -61,6 +62,7 @@ void prepare_repo_settings(struct repository *r)
 	repo_cfg_bool(r, "pack.usesparse", &r->settings.pack_use_sparse, 1);
 	repo_cfg_bool(r, "core.multipackindex", &r->settings.core_multi_pack_index, 1);
 	repo_cfg_bool(r, "index.sparse", &r->settings.sparse_index, 0);
+	repo_cfg_bool(r, "index.skiphash", &r->settings.index_skip_hash, r->settings.index_skip_hash);
 
 	/*
 	 * The GIT_TEST_MULTI_PACK_INDEX variable is special in that
diff --git a/repository.h b/repository.h
index 6c461c5b9de..e8c67ffe165 100644
--- a/repository.h
+++ b/repository.h
@@ -42,6 +42,7 @@ struct repo_settings {
 	struct fsmonitor_settings *fsmonitor; /* lazily loaded */
 
 	int index_version;
+	int index_skip_hash;
 	enum untracked_cache_setting core_untracked_cache;
 
 	int pack_use_sparse;
diff --git a/scalar.c b/scalar.c
index 6c52243cdf1..b49bb8c24ec 100644
--- a/scalar.c
+++ b/scalar.c
@@ -143,6 +143,7 @@ static int set_recommended_config(int reconfigure)
 		{ "credential.validate", "false", 1 }, /* GCM4W-only */
 		{ "gc.auto", "0", 1 },
 		{ "gui.GCWarning", "false", 1 },
+		{ "index.skipHash", "false", 1 },
 		{ "index.threads", "true", 1 },
 		{ "index.version", "4", 1 },
 		{ "merge.stat", "false", 1 },
diff --git a/t/t1600-index.sh b/t/t1600-index.sh
index 2f792bb8ffa..0ebbae13058 100755
--- a/t/t1600-index.sh
+++ b/t/t1600-index.sh
@@ -73,6 +73,17 @@ test_expect_success 'index.skipHash config option' '
 	test_cmp expect hash &&
 	git fsck &&
 
+	rm -f .git/index &&
+	git -c feature.manyFiles=true add a &&
+	test_trailing_hash .git/index >hash &&
+	cmp expect hash &&
+
+	rm -f .git/index &&
+	git -c feature.manyFiles=true \
+	    -c index.skipHash=false add a &&
+	test_trailing_hash .git/index >hash &&
+	! cmp expect hash &&
+
 	test_commit start &&
 	git -c protocol.file.allow=always submodule add ./ sub &&
 	git config index.skipHash false &&
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 0/4] Optionally skip hashing index on write
  2023-01-06 15:33         ` Derrick Stolee
@ 2023-01-06 22:45           ` Junio C Hamano
  2023-01-06 23:40             ` Derrick Stolee
  0 siblings, 1 reply; 59+ messages in thread
From: Junio C Hamano @ 2023-01-06 22:45 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Ævar Arnfjörð Bjarmason,
	Derrick Stolee via GitGitGadget, git, vdye, newren, Jacob Keller

Derrick Stolee <derrickstolee@github.com> writes:

> After investigating some of the failures from creating a BUG() statement
> when istate->repo is NULL I see several problems, and they are not related
> to submodules for the most part.
>
> The first issues I've found are due to code paths that operate on the_index
> without actually initializing it with a do_read_index(), or otherwise
> initialize an index using a memset() instead of a common initializer. This
> looks to be a frequent enough problem that solving it would require a
> significant effort. It's not a quick fix.

Thanks.  It is not entirely unexpected X-<, considering how the
connection from the in-core repository and the in-core index has
been developed and evolved.  So in short, istate->repo is not yet
ready to be used, until the problems you identified are resolved?

We probably should start paying down such technical debts.  We've
punted on them too many times, saying "yes this is klunky but what
we have is good enough for adding this feature", I suspect?

Thanks.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 0/4] Optionally skip hashing index on write
  2023-01-06 22:45           ` Junio C Hamano
@ 2023-01-06 23:40             ` Derrick Stolee
  2023-01-09 17:15               ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 59+ messages in thread
From: Derrick Stolee @ 2023-01-06 23:40 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason,
	Derrick Stolee via GitGitGadget, git, vdye, newren, Jacob Keller

On 1/6/2023 5:45 PM, Junio C Hamano wrote:
> Derrick Stolee <derrickstolee@github.com> writes:
> 
>> After investigating some of the failures from creating a BUG() statement
>> when istate->repo is NULL I see several problems, and they are not related
>> to submodules for the most part.
>>
>> The first issues I've found are due to code paths that operate on the_index
>> without actually initializing it with a do_read_index(), or otherwise
>> initialize an index using a memset() instead of a common initializer. This
>> looks to be a frequent enough problem that solving it would require a
>> significant effort. It's not a quick fix.
> 
> Thanks.  It is not entirely unexpected X-<, considering how the
> connection from the in-core repository and the in-core index has
> been developed and evolved.  So in short, istate->repo is not yet
> ready to be used, until the problems you identified are resolved?
> 
> We probably should start paying down such technical debts.  We've
> punted on them too many times, saying "yes this is klunky but what
> we have is good enough for adding this feature", I suspect?

Yes, I'll make note to prioritize this soon.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 0/4] Optionally skip hashing index on write
  2023-01-06 23:40             ` Derrick Stolee
@ 2023-01-09 17:15               ` Ævar Arnfjörð Bjarmason
  2023-01-09 18:00                 ` Derrick Stolee
  0 siblings, 1 reply; 59+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-01-09 17:15 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Junio C Hamano, Derrick Stolee via GitGitGadget, git, vdye,
	newren, Jacob Keller


On Fri, Jan 06 2023, Derrick Stolee wrote:

> On 1/6/2023 5:45 PM, Junio C Hamano wrote:
>> Derrick Stolee <derrickstolee@github.com> writes:
>> 
>>> After investigating some of the failures from creating a BUG() statement
>>> when istate->repo is NULL I see several problems, and they are not related
>>> to submodules for the most part.
>>>
>>> The first issues I've found are due to code paths that operate on the_index
>>> without actually initializing it with a do_read_index(), or otherwise
>>> initialize an index using a memset() instead of a common initializer. This
>>> looks to be a frequent enough problem that solving it would require a
>>> significant effort. It's not a quick fix.
>> 
>> Thanks.  It is not entirely unexpected X-<, considering how the
>> connection from the in-core repository and the in-core index has
>> been developed and evolved.  So in short, istate->repo is not yet
>> ready to be used, until the problems you identified are resolved?
>> 
>> We probably should start paying down such technical debts.  We've
>> punted on them too many times, saying "yes this is klunky but what
>> we have is good enough for adding this feature", I suspect?
>
> Yes, I'll make note to prioritize this soon.

I noted in passing in [1] that I had those patches locally, if I'm
understanding you correctly and you're planning to work on changes
that'll make "istate->repo" always non-NULL.

I've rebased those on top of your "ds/omit-trailing-hash-in-index". I'm
CI-ing those now & hoping to submit them soon (I've had it working for a
while, but there was some interaction with your patches). Preview at
[2].

1. https://lore.kernel.org/git/221215.865yec3b1j.gmgdl@evledraar.gmail.com/
2. https://github.com/avar/git/compare/avar-ds/omit-trailing-hash-in-index...avar/do-not-lazy-populate-istate-repo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 0/4] Optionally skip hashing index on write
  2023-01-09 17:15               ` Ævar Arnfjörð Bjarmason
@ 2023-01-09 18:00                 ` Derrick Stolee
  2023-01-09 19:22                   ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 59+ messages in thread
From: Derrick Stolee @ 2023-01-09 18:00 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Junio C Hamano, Derrick Stolee via GitGitGadget, git, vdye,
	newren, Jacob Keller

On 1/9/2023 12:15 PM, Ævar Arnfjörð Bjarmason wrote:
> On Fri, Jan 06 2023, Derrick Stolee wrote:
>> On 1/6/2023 5:45 PM, Junio C Hamano wrote:

>>> We probably should start paying down such technical debts.  We've
>>> punted on them too many times, saying "yes this is klunky but what
>>> we have is good enough for adding this feature", I suspect?
>>
>> Yes, I'll make note to prioritize this soon.
> 
> I noted in passing in [1] that I had those patches locally, if I'm
> understanding you correctly and you're planning to work on changes
> that'll make "istate->repo" always non-NULL.
> 
> I've rebased those on top of your "ds/omit-trailing-hash-in-index". I'm
> CI-ing those now & hoping to submit them soon (I've had it working for a
> while, but there was some interaction with your patches). Preview at
> [2].
> 
> 1. https://lore.kernel.org/git/221215.865yec3b1j.gmgdl@evledraar.gmail.com/
> 2. https://github.com/avar/git/compare/avar-ds/omit-trailing-hash-in-index...avar/do-not-lazy-populate-istate-repo

Thanks for doing this so I don't need to.

Some quick pre-submission feedback: your "treewide" patch [1] is far
too big and doing too much all at once. It's difficult to know why
you're doing things when you're doing them, especially the choices
for the validate_cache_repo() calls.

In particular, I noticed that you used "the_repository" in some cases
where a local "struct repository *" would be better (even if it is
currently pointing to the_repository as in builtin/sparse-checkout.c
in update_working_directory()). These would be easier to catch in
smaller diffs.

[1] https://github.com/avar/git/commit/7732a41800dfc8f5dbf909560615d6048d583ed9

Looking forward to your series.

-Stolee

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 0/4] Optionally skip hashing index on write
  2023-01-09 18:00                 ` Derrick Stolee
@ 2023-01-09 19:22                   ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 59+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-01-09 19:22 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Junio C Hamano, Derrick Stolee via GitGitGadget, git, vdye,
	newren, Jacob Keller


On Mon, Jan 09 2023, Derrick Stolee wrote:

> On 1/9/2023 12:15 PM, Ævar Arnfjörð Bjarmason wrote:
>> On Fri, Jan 06 2023, Derrick Stolee wrote:
>>> On 1/6/2023 5:45 PM, Junio C Hamano wrote:
>
>>>> We probably should start paying down such technical debts.  We've
>>>> punted on them too many times, saying "yes this is klunky but what
>>>> we have is good enough for adding this feature", I suspect?
>>>
>>> Yes, I'll make note to prioritize this soon.
>> 
>> I noted in passing in [1] that I had those patches locally, if I'm
>> understanding you correctly and you're planning to work on changes
>> that'll make "istate->repo" always non-NULL.
>> 
>> I've rebased those on top of your "ds/omit-trailing-hash-in-index". I'm
>> CI-ing those now & hoping to submit them soon (I've had it working for a
>> while, but there was some interaction with your patches). Preview at
>> [2].
>> 
>> 1. https://lore.kernel.org/git/221215.865yec3b1j.gmgdl@evledraar.gmail.com/
>> 2. https://github.com/avar/git/compare/avar-ds/omit-trailing-hash-in-index...avar/do-not-lazy-populate-istate-repo
>
> Thanks for doing this so I don't need to.
>
> Some quick pre-submission feedback: your "treewide" patch [1] is far
> too big and doing too much all at once. It's difficult to know why
> you're doing things when you're doing them, especially the choices
> for the validate_cache_repo() calls.

Yeah, I don't like it either :)

I'm mulling over whether to keep it at all. I.e. the fix to pass the
data is relatively small, but most of that patch is validation paranoia.

It was helpful to write that to validate it for my own sanity, but maybe
it's not worth keeping it.

Ultimately it's just making things BUG(...)  during testing that would
otherwise be a NULL-pointer dereference, so I'm currently leaning
towards (especially with this early feedback) just skipping it.

Now that we have a SANITIZE=address CI job the error feedback if/when
that would happen is arguably worse than the BUG(...) output.

> In particular, I noticed that you used "the_repository" in some cases
> where a local "struct repository *" would be better (even if it is
> currently pointing to the_repository as in builtin/sparse-checkout.c
> in update_working_directory()). These would be easier to catch in
> smaller diffs.

Yes, definitely. I was careful to use an existing "struct repository *"
in favor of "the_repository", but clearly not careful enough!

I'll fix that, and look carefully over the rest before submission in
case I've missed other similar cases.

> [1] https://github.com/avar/git/commit/7732a41800dfc8f5dbf909560615d6048d583ed9
>
> Looking forward to your series.

Thanks, since I sent the above CI passed, so it'll be sooner than later,
but I'll massage it a bit per the above before submission.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 0/4] Optionally skip hashing index on write
  2023-01-06 16:31       ` [PATCH v5 " Derrick Stolee via GitGitGadget
                           ` (3 preceding siblings ...)
  2023-01-06 16:31         ` [PATCH v5 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget
@ 2023-01-15  9:31         ` Junio C Hamano
  2023-01-17 14:49           ` Derrick Stolee
  4 siblings, 1 reply; 59+ messages in thread
From: Junio C Hamano @ 2023-01-15  9:31 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, vdye, avarab, newren, Jacob Keller, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Writing the index is a critical action that takes place in multiple Git
> commands. The recent performance improvements available with the sparse
> index show how often the I/O costs around the index can affect different Git
> commands, although reading the index takes place more often than a write.

https://github.com/git/git/actions/runs/3922482355/jobs/6705454550#step:4:2006

shows that all other platforms passed but osx-gcc failed one of the
tests this series added.  Could it be that there is something racy
about the test (1006.6 if I am not mistaken)?


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 0/4] Optionally skip hashing index on write
  2023-01-15  9:31         ` [PATCH v5 0/4] Optionally skip hashing index on write Junio C Hamano
@ 2023-01-17 14:49           ` Derrick Stolee
  0 siblings, 0 replies; 59+ messages in thread
From: Derrick Stolee @ 2023-01-17 14:49 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, vdye, avarab, newren, Jacob Keller

On 1/15/2023 4:31 AM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> Writing the index is a critical action that takes place in multiple Git
>> commands. The recent performance improvements available with the sparse
>> index show how often the I/O costs around the index can affect different Git
>> commands, although reading the index takes place more often than a write.
> 
> https://github.com/git/git/actions/runs/3922482355/jobs/6705454550#step:4:2006
> 
> shows that all other platforms passed but osx-gcc failed one of the
> tests this series added.  Could it be that there is something racy
> about the test (1006.6 if I am not mistaken)?

Yes, you are right. The patch below fixes the problem. Sorry about that!

-Stolee
 
-- >8 --

From 6827205f979ecfa66f4f9edabfb247cff05f0605 Mon Sep 17 00:00:00 2001
From: Derrick Stolee <derrickstolee@github.com>
Date: Tue, 17 Jan 2023 09:46:04 -0500
Subject: [PATCH] t1600: fix racy index.skipHash test

The test 1600.6 can fail under --stress due to mtime collisions. Most of
the tests include a removal of the index file to guarantee that the
index is updated. However, the submodule test addded in ee1f0c242ef
(read-cache: add index.skipHash config option, 2023-01-06) did not
include this removal. Thus, on rare occasions, the test can fail because
the index still has a non-null trailing hash, as detected by the helper
added in da9acde14ed (test-lib-functions: add helper for trailing hash,
2023-01-06).

By removing the submodule's index before the 'git -C sub add a' command,
we guarantee that the index is rewritten with the new index.skipHash
config option.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 t/t1600-index.sh | 1 +
 1 file changed, 1 insertion(+)

diff --git a/t/t1600-index.sh b/t/t1600-index.sh
index 0ebbae13058..9368d82f7d7 100755
--- a/t/t1600-index.sh
+++ b/t/t1600-index.sh
@@ -88,6 +88,7 @@ test_expect_success 'index.skipHash config option' '
 	git -c protocol.file.allow=always submodule add ./ sub &&
 	git config index.skipHash false &&
 	git -C sub config index.skipHash true &&
+	rm -f .git/modules/sub/index &&
 	>sub/file &&
 	git -C sub add a &&
 	test_trailing_hash .git/modules/sub/index >hash &&
-- 
2.38.0.vfs.0.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2023-01-17 14:50 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-07 17:25 [PATCH 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget
2022-12-07 17:25 ` [PATCH 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
2022-12-07 22:13   ` Ævar Arnfjörð Bjarmason
2022-12-08  7:32     ` Jeff King
2022-12-07 17:25 ` [PATCH 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget
2022-12-07 18:59   ` Eric Sunshine
2022-12-12 13:59     ` Derrick Stolee
2022-12-12 18:55       ` Eric Sunshine
2022-12-07 22:25   ` Ævar Arnfjörð Bjarmason
2022-12-07 23:06   ` Ævar Arnfjörð Bjarmason
2022-12-08  0:05     ` Junio C Hamano
2022-12-12 14:05     ` Derrick Stolee
2022-12-12 18:01       ` Ævar Arnfjörð Bjarmason
2022-12-07 17:25 ` [PATCH 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget
2022-12-07 22:27   ` Ævar Arnfjörð Bjarmason
2022-12-12 14:10     ` Derrick Stolee
2022-12-07 17:25 ` [PATCH 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget
2022-12-07 22:30   ` Ævar Arnfjörð Bjarmason
2022-12-12 14:18     ` Derrick Stolee
2022-12-12 18:27       ` Ævar Arnfjörð Bjarmason
2022-12-07 23:27 ` [PATCH 0/4] Optionally skip hashing index on write Junio C Hamano
2022-12-07 23:42   ` Ævar Arnfjörð Bjarmason
2022-12-08 16:38   ` Derrick Stolee
2022-12-12 22:22     ` Jacob Keller
2022-12-12 16:31 ` [PATCH v2 " Derrick Stolee via GitGitGadget
2022-12-12 16:31   ` [PATCH v2 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
2022-12-12 16:31   ` [PATCH v2 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget
2022-12-12 16:31   ` [PATCH v2 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget
2022-12-12 18:14     ` SZEDER Gábor
2022-12-13  0:55       ` Junio C Hamano
2022-12-17 17:37         ` SZEDER Gábor
2022-12-12 16:31   ` [PATCH v2 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget
2022-12-15 15:06   ` [PATCH v3 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget
2022-12-15 15:06     ` [PATCH v3 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
2022-12-15 15:06     ` [PATCH v3 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget
2022-12-15 16:12       ` Ævar Arnfjörð Bjarmason
2022-12-15 15:06     ` [PATCH v3 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget
2022-12-15 15:07     ` [PATCH v3 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget
2022-12-15 15:56     ` [PATCH v3 0/4] Optionally skip hashing index on write Ævar Arnfjörð Bjarmason
2022-12-16 13:41       ` Derrick Stolee
2022-12-16 15:31     ` [PATCH v4 " Derrick Stolee via GitGitGadget
2022-12-16 15:31       ` [PATCH v4 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
2022-12-16 15:31       ` [PATCH v4 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget
2022-12-16 15:31       ` [PATCH v4 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget
2022-12-16 15:31       ` [PATCH v4 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget
2022-12-16 15:43       ` [PATCH v4 0/4] Optionally skip hashing index on write Ævar Arnfjörð Bjarmason
2023-01-06 15:33         ` Derrick Stolee
2023-01-06 22:45           ` Junio C Hamano
2023-01-06 23:40             ` Derrick Stolee
2023-01-09 17:15               ` Ævar Arnfjörð Bjarmason
2023-01-09 18:00                 ` Derrick Stolee
2023-01-09 19:22                   ` Ævar Arnfjörð Bjarmason
2023-01-06 16:31       ` [PATCH v5 " Derrick Stolee via GitGitGadget
2023-01-06 16:31         ` [PATCH v5 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
2023-01-06 16:31         ` [PATCH v5 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget
2023-01-06 16:31         ` [PATCH v5 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget
2023-01-06 16:31         ` [PATCH v5 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget
2023-01-15  9:31         ` [PATCH v5 0/4] Optionally skip hashing index on write Junio C Hamano
2023-01-17 14:49           ` Derrick Stolee

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).