[PATCH 0/5] handling 4GB .idx files

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH 0/5] handling 4GB .idx files
@ 2020-11-13  5:06 Jeff King
  2020-11-13  5:06 ` [PATCH 1/5] compute pack .idx byte offsets using size_t Jeff King
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Jeff King @ 2020-11-13  5:06 UTC (permalink / raw)
  To: git

I recently ran into a case where Git could not read the pack it had
produced via running "git repack". The culprit turned out to be an .idx
file which crossed the 4GB barrier (in bytes, not number of objects).
This series fixes the problems I saw, along with similar ones I couldn't
trigger in practice, and protects the .idx loading code against integer
overflows that would fool the size checks.

  [1/5]: compute pack .idx byte offsets using size_t
  [2/5]: use size_t to store pack .idx byte offsets
  [3/5]: fsck: correctly compute checksums on idx files larger than 4GB
  [4/5]: block-sha1: take a size_t length parameter
  [5/5]: packfile: detect overflow in .idx file size checks

 block-sha1/sha1.c        |  2 +-
 block-sha1/sha1.h        |  2 +-
 builtin/index-pack.c     |  2 +-
 builtin/pack-redundant.c |  6 +++---
 pack-check.c             | 10 +++++-----
 pack-revindex.c          |  2 +-
 packfile.c               | 14 +++++++-------
 7 files changed, 19 insertions(+), 19 deletions(-)

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/5] compute pack .idx byte offsets using size_t
  2020-11-13  5:06 [PATCH 0/5] handling 4GB .idx files Jeff King
@ 2020-11-13  5:06 ` Jeff King
  2020-11-13  5:07 ` [PATCH 2/5] use size_t to store pack .idx byte offsets Jeff King
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Jeff King @ 2020-11-13  5:06 UTC (permalink / raw)
  To: git

A pack and its matching .idx file are limited to 2^32 objects, because
the pack format contains a 32-bit field to store the number of objects.
Hence we use uint32_t in the code.

But the byte count of even a .idx file can be much larger than that,
because it stores at least a hash and an offset for each object. So
using SHA-1, a v2 .idx file will cross the 4GB boundary at 153,391,650
objects. This confuses load_idx(), which computes the minimum size like
this:

  unsigned long min_size = 8 + 4*256 + nr*(hashsz + 4 + 4) + hashsz + hashsz;

Even though min_size will be big enough on most 64-bit platforms, the
actual arithmetic is done as a uint32_t, resulting in a truncation. We
actually exceed that min_size, but then we do:

  unsigned long max_size = min_size;
  if (nr)
          max_size += (nr - 1)*8;

to account for the variable-sized table. That computation doesn't
overflow quite so low, but with the truncation for min_size, we end up
with a max_size that is much smaller than our actual size. So we
complain that the idx is invalid, and can't find any of its objects.

We can fix this case by casting "nr" to a size_t, which will do the
multiplication in 64-bits (assuming you're on a 64-bit platform; this
will never work on a 32-bit system since we couldn't map the whole .idx
anyway). Likewise, we don't have to worry about further additions,
because adding a smaller number to a size_t will convert the other side
to a size_t.

A few notes:

  - obviously we could just declare "nr" as a size_t in the first place
    (and likewise, packed_git.num_objects).  But it's conceptually a
    uint32_t because of the on-disk format, and we correctly treat it
    that way in other contexts that don't need to compute byte offsets
    (e.g., iterating over the set of objects should and generally does
    use a uint32_t). Switching to size_t would make all of those other
    cases look wrong.

  - it could be argued that the proper type is off_t to represent the
    file offset. But in practice the .idx file must fit within memory,
    because we mmap the whole thing. And the rest of the code (including
    the idx_size variable we're comparing against) uses size_t.

  - we'll add the same cast to the max_size arithmetic line. Even though
    we're adding to a larger type, which will convert our result, the
    multiplication is still done as a 32-bit value and can itself
    overflow. I didn't check this with my test case, since it would need
    an even larger pack (~530M objects), but looking at compiler output
    shows that it works this way. The standard should agree, but I
    couldn't find anything explicit in 6.3.1.8 ("usual arithmetic
    conversions").

The case in load_idx() was the most immediate one that I was able to
trigger. After fixing it, looking up actual objects (including the very
last one in sha1 order) works in a test repo with 153,725,110 objects.
That's because bsearch_hash() works with uint32_t entry indices, and the
actual byte access:

  int cmp = hashcmp(table + mi * stride, sha1);

is done with "stride" as a size_t, causing the uint32_t "mi" to be
promoted to a size_t. This is the way most code will access the index
data.

However, I audited all of the other byte-wise accesses of
packed_git.index_data, and many of the others are suspect (they are
similar to the max_size one, where we are adding to a properly sized
offset or directly to a pointer, but the multiplication in the
sub-expression can overflow). I didn't trigger any of these in practice,
but I believe they're potential problems, and certainly adding in the
cast is not going to hurt anything here.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/index-pack.c |  2 +-
 pack-check.c         |  2 +-
 pack-revindex.c      |  2 +-
 packfile.c           | 12 ++++++------
 4 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 0d03cb442d..4b8d86e0ad 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1597,7 +1597,7 @@ static void read_v2_anomalous_offsets(struct packed_git *p,
 
 	/* The address of the 4-byte offset table */
 	idx1 = (((const uint32_t *)((const uint8_t *)p->index_data + p->crc_offset))
-		+ p->num_objects /* CRC32 table */
+		+ (size_t)p->num_objects /* CRC32 table */
 		);
 
 	/* The address of the 8-byte offset table */
diff --git a/pack-check.c b/pack-check.c
index dad6d8ae7f..db3adf8781 100644
--- a/pack-check.c
+++ b/pack-check.c
@@ -39,7 +39,7 @@ int check_pack_crc(struct packed_git *p, struct pack_window **w_curs,
 	} while (len);
 
 	index_crc = p->index_data;
-	index_crc += 2 + 256 + p->num_objects * (the_hash_algo->rawsz/4) + nr;
+	index_crc += 2 + 256 + (size_t)p->num_objects * (the_hash_algo->rawsz/4) + nr;
 
 	return data_crc != ntohl(*index_crc);
 }
diff --git a/pack-revindex.c b/pack-revindex.c
index d28a7e43d0..ecdde39cf4 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -130,7 +130,7 @@ static void create_pack_revindex(struct packed_git *p)
 
 	if (p->index_version > 1) {
 		const uint32_t *off_32 =
-			(uint32_t *)(index + 8 + p->num_objects * (hashsz + 4));
+			(uint32_t *)(index + 8 + (size_t)p->num_objects * (hashsz + 4));
 		const uint32_t *off_64 = off_32 + p->num_objects;
 		for (i = 0; i < num_ent; i++) {
 			const uint32_t off = ntohl(*off_32++);
diff --git a/packfile.c b/packfile.c
index 0929ebe4fc..a72c2a261f 100644
--- a/packfile.c
+++ b/packfile.c
@@ -148,7 +148,7 @@ int load_idx(const char *path, const unsigned int hashsz, void *idx_map,
 		 *  - hash of the packfile
 		 *  - file checksum
 		 */
-		if (idx_size != 4 * 256 + nr * (hashsz + 4) + hashsz + hashsz)
+		if (idx_size != 4 * 256 + (size_t)nr * (hashsz + 4) + hashsz + hashsz)
 			return error("wrong index v1 file size in %s", path);
 	} else if (version == 2) {
 		/*
@@ -164,10 +164,10 @@ int load_idx(const char *path, const unsigned int hashsz, void *idx_map,
 		 * variable sized table containing 8-byte entries
 		 * for offsets larger than 2^31.
 		 */
-		unsigned long min_size = 8 + 4*256 + nr*(hashsz + 4 + 4) + hashsz + hashsz;
+		unsigned long min_size = 8 + 4*256 + (size_t)nr*(hashsz + 4 + 4) + hashsz + hashsz;
 		unsigned long max_size = min_size;
 		if (nr)
-			max_size += (nr - 1)*8;
+			max_size += ((size_t)nr - 1)*8;
 		if (idx_size < min_size || idx_size > max_size)
 			return error("wrong index v2 file size in %s", path);
 		if (idx_size != min_size &&
@@ -1933,14 +1933,14 @@ off_t nth_packed_object_offset(const struct packed_git *p, uint32_t n)
 	const unsigned int hashsz = the_hash_algo->rawsz;
 	index += 4 * 256;
 	if (p->index_version == 1) {
-		return ntohl(*((uint32_t *)(index + (hashsz + 4) * n)));
+		return ntohl(*((uint32_t *)(index + (hashsz + 4) * (size_t)n)));
 	} else {
 		uint32_t off;
-		index += 8 + p->num_objects * (hashsz + 4);
+		index += 8 + (size_t)p->num_objects * (hashsz + 4);
 		off = ntohl(*((uint32_t *)(index + 4 * n)));
 		if (!(off & 0x80000000))
 			return off;
-		index += p->num_objects * 4 + (off & 0x7fffffff) * 8;
+		index += (size_t)p->num_objects * 4 + (off & 0x7fffffff) * 8;
 		check_pack_index_ptr(p, index);
 		return get_be64(index);
 	}
-- 
2.29.2.705.g306f91dc4e


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/5] use size_t to store pack .idx byte offsets
  2020-11-13  5:06 [PATCH 0/5] handling 4GB .idx files Jeff King
  2020-11-13  5:06 ` [PATCH 1/5] compute pack .idx byte offsets using size_t Jeff King
@ 2020-11-13  5:07 ` Jeff King
  2020-11-13  5:07 ` [PATCH 3/5] fsck: correctly compute checksums on idx files larger than 4GB Jeff King
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Jeff King @ 2020-11-13  5:07 UTC (permalink / raw)
  To: git

We sometimes store the offset into a pack .idx file as an "unsigned
long", but the mmap'd size of a pack .idx file can exceed 4GB. This is
sufficient on LP64 systems like Linux, but will be too small on LLP64
systems like Windows, where "unsigned long" is still only 32 bits. Let's
use size_t, which is a better type for an offset into a memory buffer.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/pack-redundant.c | 6 +++---
 packfile.c               | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/builtin/pack-redundant.c b/builtin/pack-redundant.c
index 178e3409b7..3e70f2a4c1 100644
--- a/builtin/pack-redundant.c
+++ b/builtin/pack-redundant.c
@@ -236,7 +236,7 @@ static struct pack_list * pack_list_difference(const struct pack_list *A,
 
 static void cmp_two_packs(struct pack_list *p1, struct pack_list *p2)
 {
-	unsigned long p1_off = 0, p2_off = 0, p1_step, p2_step;
+	size_t p1_off = 0, p2_off = 0, p1_step, p2_step;
 	const unsigned char *p1_base, *p2_base;
 	struct llist_item *p1_hint = NULL, *p2_hint = NULL;
 	const unsigned int hashsz = the_hash_algo->rawsz;
@@ -280,7 +280,7 @@ static void cmp_two_packs(struct pack_list *p1, struct pack_list *p2)
 static size_t sizeof_union(struct packed_git *p1, struct packed_git *p2)
 {
 	size_t ret = 0;
-	unsigned long p1_off = 0, p2_off = 0, p1_step, p2_step;
+	size_t p1_off = 0, p2_off = 0, p1_step, p2_step;
 	const unsigned char *p1_base, *p2_base;
 	const unsigned int hashsz = the_hash_algo->rawsz;
 
@@ -499,7 +499,7 @@ static void scan_alt_odb_packs(void)
 static struct pack_list * add_pack(struct packed_git *p)
 {
 	struct pack_list l;
-	unsigned long off = 0, step;
+	size_t off = 0, step;
 	const unsigned char *base;
 
 	if (!p->pack_local && !(alt_odb || verbose))
diff --git a/packfile.c b/packfile.c
index a72c2a261f..63fe9ee8be 100644
--- a/packfile.c
+++ b/packfile.c
@@ -164,8 +164,8 @@ int load_idx(const char *path, const unsigned int hashsz, void *idx_map,
 		 * variable sized table containing 8-byte entries
 		 * for offsets larger than 2^31.
 		 */
-		unsigned long min_size = 8 + 4*256 + (size_t)nr*(hashsz + 4 + 4) + hashsz + hashsz;
-		unsigned long max_size = min_size;
+		size_t min_size = 8 + 4*256 + (size_t)nr*(hashsz + 4 + 4) + hashsz + hashsz;
+		size_t max_size = min_size;
 		if (nr)
 			max_size += ((size_t)nr - 1)*8;
 		if (idx_size < min_size || idx_size > max_size)
-- 
2.29.2.705.g306f91dc4e


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/5] fsck: correctly compute checksums on idx files larger than 4GB
  2020-11-13  5:06 [PATCH 0/5] handling 4GB .idx files Jeff King
  2020-11-13  5:06 ` [PATCH 1/5] compute pack .idx byte offsets using size_t Jeff King
  2020-11-13  5:07 ` [PATCH 2/5] use size_t to store pack .idx byte offsets Jeff King
@ 2020-11-13  5:07 ` Jeff King
  2020-11-13  5:07 ` [PATCH 4/5] block-sha1: take a size_t length parameter Jeff King
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Jeff King @ 2020-11-13  5:07 UTC (permalink / raw)
  To: git

When checking the trailing checksum hash of a .idx file, we pass the
whole buffer (minus the trailing hash) into a single call to
the_hash_algo->update_fn(). But we cast it to an "unsigned int". This
comes from c4001d92be (Use off_t when we really mean a file offset.,
2007-03-06). That commit started storing the index_size variable as an
off_t, but our mozilla-sha1 implementation from the time was limited to
a smaller size. Presumably the cast was a way of annotating that we
expected .idx files to be small, and so we didn't need to loop (as we do
for arbitrarily-large .pack files). Though as an aside it was still
wrong, because the mozilla function actually took a signed int.

These days our hash-update functions are defined to take a size_t, so we
can pass the whole buffer in directly. The cast is actually causing a
buggy truncation!

While we're here, though, let's drop the confusing off_t variable in the
first place. We're getting the size not from the filesystem anyway, but
from p->index_size, which is a size_t. In fact, we can make the code a
bit more readable by dropping our local variable duplicating
p->index_size, and instead have one that stores the size of the actual
index data, minus the trailing hash.

Signed-off-by: Jeff King <peff@peff.net>
---
 pack-check.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/pack-check.c b/pack-check.c
index db3adf8781..4b089fe8ec 100644
--- a/pack-check.c
+++ b/pack-check.c
@@ -164,22 +164,22 @@ static int verify_packfile(struct repository *r,

 int verify_pack_index(struct packed_git *p)
 {
-	off_t index_size;
+	size_t len;
 	const unsigned char *index_base;
 	git_hash_ctx ctx;
 	unsigned char hash[GIT_MAX_RAWSZ];
 	int err = 0;

 	if (open_pack_index(p))
 		return error("packfile %s index not opened", p->pack_name);
-	index_size = p->index_size;
 	index_base = p->index_data;
+	len = p->index_size - the_hash_algo->rawsz;

 	/* Verify SHA1 sum of the index file */
 	the_hash_algo->init_fn(&ctx);
-	the_hash_algo->update_fn(&ctx, index_base, (unsigned int)(index_size - the_hash_algo->rawsz));
+	the_hash_algo->update_fn(&ctx, index_base, len);
 	the_hash_algo->final_fn(hash, &ctx);
-	if (!hasheq(hash, index_base + index_size - the_hash_algo->rawsz))
+	if (!hasheq(hash, index_base + len))
 		err = error("Packfile index for %s hash mismatch",
 			    p->pack_name);
 	return err;
-- 
2.29.2.705.g306f91dc4e

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/5] block-sha1: take a size_t length parameter
  2020-11-13  5:06 [PATCH 0/5] handling 4GB .idx files Jeff King
                   ` (2 preceding siblings ...)
  2020-11-13  5:07 ` [PATCH 3/5] fsck: correctly compute checksums on idx files larger than 4GB Jeff King
@ 2020-11-13  5:07 ` Jeff King
  2020-11-13  5:07 ` [PATCH 5/5] packfile: detect overflow in .idx file size checks Jeff King
  2020-11-15 14:43 ` [PATCH 0/5] handling 4GB .idx files Thomas Braun
  5 siblings, 0 replies; 19+ messages in thread
From: Jeff King @ 2020-11-13  5:07 UTC (permalink / raw)
  To: git

The block-sha1 implementation takes an "unsigned long" for the length of
a buffer to hash, but our hash algorithm wrappers take a size_t, as do
other implementations we support like openssl or sha1dc. On many
systems, including Linux, these two are equivalent, but they are not on
Windows (where only a "long long" is 64 bits). As a result, passing
large chunks to a single the_hash_algo->update_fn() would produce wrong
answers there.

Note that we don't need to update any other sizes outside of the
function interface. We store the cumulative size in a "long long" (which
we must do since we hash things bigger than 4GB, like packfiles, even on
32-bit platforms). And internally, we break that size_t len down into
64-byte blocks to feed into the guts of the algorithm.

Signed-off-by: Jeff King <peff@peff.net>
---
 block-sha1/sha1.c | 2 +-
 block-sha1/sha1.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index 22b125cf8c..8681031402 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -203,7 +203,7 @@ void blk_SHA1_Init(blk_SHA_CTX *ctx)
 	ctx->H[4] = 0xc3d2e1f0;
 }

-void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len)
+void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, size_t len)
 {
 	unsigned int lenW = ctx->size & 63;

diff --git a/block-sha1/sha1.h b/block-sha1/sha1.h
index 4df6747752..9fb0441b98 100644
--- a/block-sha1/sha1.h
+++ b/block-sha1/sha1.h
@@ -13,7 +13,7 @@ typedef struct {
 } blk_SHA_CTX;

 void blk_SHA1_Init(blk_SHA_CTX *ctx);
-void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *dataIn, unsigned long len);
+void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *dataIn, size_t len);
 void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx);

 #define platform_SHA_CTX	blk_SHA_CTX
-- 
2.29.2.705.g306f91dc4e

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 5/5] packfile: detect overflow in .idx file size checks
  2020-11-13  5:06 [PATCH 0/5] handling 4GB .idx files Jeff King
                   ` (3 preceding siblings ...)
  2020-11-13  5:07 ` [PATCH 4/5] block-sha1: take a size_t length parameter Jeff King
@ 2020-11-13  5:07 ` Jeff King
  2020-11-13 11:02   ` Johannes Schindelin
  2020-11-15 14:43 ` [PATCH 0/5] handling 4GB .idx files Thomas Braun
  5 siblings, 1 reply; 19+ messages in thread
From: Jeff King @ 2020-11-13  5:07 UTC (permalink / raw)
  To: git

In load_idx(), we check that the .idx file is sized appropriately for
the number of objects it claims to have. We recently fixed the case
where the number of objects caused our expected size to overflow a
32-bit unsigned int, and we switched to size_t.

On a 64-bit system, this is fine; our size_t covers any expected size.
On a 32-bit system, though, it won't. The file may claim to have 2^31
objects, which will overflow even a size_t.

This doesn't hurt us at all for a well-formed idx file. A 32-bit system
would already have failed to mmap such a file, since it would be too
big. But an .idx file which _claims_ to have 2^31 objects but is
actually much smaller would fool our check.

This is a broken file, and for the most part we don't care that much
what happens. But:

  - it's a little friendlier to notice up front "woah, this file is
    broken" than it is to get nonsense results

  - later access of the data assumes that the loading function
    sanity-checked that we have at least enough bytes for the regular
    object-id table. A malformed .idx file could lead to an
    out-of-bounds read.

So let's use our overflow-checking functions to make sure that we're not
fooled by a malformed file.

Signed-off-by: Jeff King <peff@peff.net>
---
 packfile.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/packfile.c b/packfile.c
index 63fe9ee8be..9702b1218b 100644
--- a/packfile.c
+++ b/packfile.c
@@ -148,7 +148,7 @@ int load_idx(const char *path, const unsigned int hashsz, void *idx_map,
 		 *  - hash of the packfile
 		 *  - file checksum
 		 */
-		if (idx_size != 4 * 256 + (size_t)nr * (hashsz + 4) + hashsz + hashsz)
+		if (idx_size != st_add(4 * 256 + hashsz + hashsz, st_mult(nr, hashsz + 4)))
 			return error("wrong index v1 file size in %s", path);
 	} else if (version == 2) {
 		/*
@@ -164,10 +164,10 @@ int load_idx(const char *path, const unsigned int hashsz, void *idx_map,
 		 * variable sized table containing 8-byte entries
 		 * for offsets larger than 2^31.
 		 */
-		size_t min_size = 8 + 4*256 + (size_t)nr*(hashsz + 4 + 4) + hashsz + hashsz;
+		size_t min_size = st_add(8 + 4*256 + hashsz + hashsz, st_mult(nr, hashsz + 4 + 4));
 		size_t max_size = min_size;
 		if (nr)
-			max_size += ((size_t)nr - 1)*8;
+			max_size = st_add(max_size, st_mult(nr - 1, 8));
 		if (idx_size < min_size || idx_size > max_size)
 			return error("wrong index v2 file size in %s", path);
 		if (idx_size != min_size &&
-- 
2.29.2.705.g306f91dc4e

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/5] packfile: detect overflow in .idx file size checks
  2020-11-13  5:07 ` [PATCH 5/5] packfile: detect overflow in .idx file size checks Jeff King
@ 2020-11-13 11:02   ` Johannes Schindelin
  0 siblings, 0 replies; 19+ messages in thread
From: Johannes Schindelin @ 2020-11-13 11:02 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Hi Peff,

On Fri, 13 Nov 2020, Jeff King wrote:

> diff --git a/packfile.c b/packfile.c
> index 63fe9ee8be..9702b1218b 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -148,7 +148,7 @@ int load_idx(const char *path, const unsigned int hashsz, void *idx_map,
>  		 *  - hash of the packfile
>  		 *  - file checksum
>  		 */
> -		if (idx_size != 4 * 256 + (size_t)nr * (hashsz + 4) + hashsz + hashsz)
> +		if (idx_size != st_add(4 * 256 + hashsz + hashsz, st_mult(nr, hashsz + 4)))
>  			return error("wrong index v1 file size in %s", path);
>  	} else if (version == 2) {
>  		/*
> @@ -164,10 +164,10 @@ int load_idx(const char *path, const unsigned int hashsz, void *idx_map,
>  		 * variable sized table containing 8-byte entries
>  		 * for offsets larger than 2^31.
>  		 */
> -		size_t min_size = 8 + 4*256 + (size_t)nr*(hashsz + 4 + 4) + hashsz + hashsz;
> +		size_t min_size = st_add(8 + 4*256 + hashsz + hashsz, st_mult(nr, hashsz + 4 + 4));
>  		size_t max_size = min_size;
>  		if (nr)
> -			max_size += ((size_t)nr - 1)*8;
> +			max_size = st_add(max_size, st_mult(nr - 1, 8));

I wondered about these multiplications and whether we should use the
`st_*()` helpers, when reading 1/5. And I am glad I read on!

FWIW I like all five patches.

Thanks,
Dscho

>  		if (idx_size < min_size || idx_size > max_size)
>  			return error("wrong index v2 file size in %s", path);
>  		if (idx_size != min_size &&
> --
> 2.29.2.705.g306f91dc4e
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] handling 4GB .idx files
  2020-11-13  5:06 [PATCH 0/5] handling 4GB .idx files Jeff King
                   ` (4 preceding siblings ...)
  2020-11-13  5:07 ` [PATCH 5/5] packfile: detect overflow in .idx file size checks Jeff King
@ 2020-11-15 14:43 ` Thomas Braun
  2020-11-16  4:10   ` Jeff King
  5 siblings, 1 reply; 19+ messages in thread
From: Thomas Braun @ 2020-11-15 14:43 UTC (permalink / raw)
  To: Jeff King, git

On 13.11.2020 06:06, Jeff King wrote:
> I recently ran into a case where Git could not read the pack it had
> produced via running "git repack". The culprit turned out to be an .idx
> file which crossed the 4GB barrier (in bytes, not number of objects).
> This series fixes the problems I saw, along with similar ones I couldn't
> trigger in practice, and protects the .idx loading code against integer
> overflows that would fool the size checks.

Would it be feasible to have a test case for this large index case? This
should very certainly have an EXPENSIVE tag, or might even not yet work
on windows. But hopefully someday I'll find some more time to push large
object support on windows forward, and these kind of tests would really
help then.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] handling 4GB .idx files
  2020-11-15 14:43 ` [PATCH 0/5] handling 4GB .idx files Thomas Braun
@ 2020-11-16  4:10   ` Jeff King
  2020-11-16 13:30     ` Derrick Stolee
  2020-11-30 22:57     ` Thomas Braun
  0 siblings, 2 replies; 19+ messages in thread
From: Jeff King @ 2020-11-16  4:10 UTC (permalink / raw)
  To: Thomas Braun; +Cc: git

On Sun, Nov 15, 2020 at 03:43:39PM +0100, Thomas Braun wrote:

> On 13.11.2020 06:06, Jeff King wrote:
> > I recently ran into a case where Git could not read the pack it had
> > produced via running "git repack". The culprit turned out to be an .idx
> > file which crossed the 4GB barrier (in bytes, not number of objects).
> > This series fixes the problems I saw, along with similar ones I couldn't
> > trigger in practice, and protects the .idx loading code against integer
> > overflows that would fool the size checks.
> 
> Would it be feasible to have a test case for this large index case? This
> should very certainly have an EXPENSIVE tag, or might even not yet work
> on windows. But hopefully someday I'll find some more time to push large
> object support on windows forward, and these kind of tests would really
> help then.

I think it would be a level beyond what we usually consider even for
EXPENSIVE. The cheapest I could come up with to generate the case is:

  perl -e '
	for (0..154_000_000) {
		print "blob\n";
		print "data <<EOF\n";
		print "$_\n";
		print "EOF\n";
	}
  ' |
  git fast-import

which took almost 13 minutes of CPU to run, and peaked around 15GB of
RAM (and takes about 6.7GB on disk).

In the resulting repo, the old code barfed on lookups:

  $ blob=$(echo 0 | git hash-object --stdin)
  $ git cat-file blob $blob
  error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
  error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
  error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
  error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
  error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
  error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
  fatal: git cat-file 573541ac9702dd3969c9bc859d2b91ec1f7e6e56: bad file

whereas now it works:

  $ git cat-file blob $blob
  0

That's the most basic test I think you could do. More interesting is
looking at entries that are actually after the 4GB mark. That requires
dumping the whole index:

  final=$(git show-index <.git/objects/pack/*.idx | tail -1 | awk '{print $2}')
  git cat-file blob $final

That takes ~35s to run. Curiously, it also allocates 5GB of heap. For
some reason it decides to make an internal copy of the entries table. I
guess because it reads the file sequentially rather than mmap-ing it,
and 64-bit offsets in v2 idx files can't be resolved until we've read
the whole entry table (and it wants to output the entries in sha1
order).

The checksum bug requires running git-fsck on the repo. That's another 5
minutes of CPU (and even higher peak memory; I think we create a "struct
blob" for each one, and it seems to hit 20GB).

Hitting the other cases that I fixed but never triggered in practice
would need a repo about 4x as large. So figure an hour of CPU and 60GB
of RAM.

So I dunno. I wouldn't be opposed to codifying some of that in a script,
but I can't imagine anybody ever running it unless they were working on
this specific problem.

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] handling 4GB .idx files
  2020-11-16  4:10   ` Jeff King
@ 2020-11-16 13:30     ` Derrick Stolee
  2020-11-16 23:49       ` Jeff King
  2020-11-30 22:57     ` Thomas Braun
  1 sibling, 1 reply; 19+ messages in thread
From: Derrick Stolee @ 2020-11-16 13:30 UTC (permalink / raw)
  To: Jeff King, Thomas Braun; +Cc: git

On 11/15/2020 11:10 PM, Jeff King wrote:
> On Sun, Nov 15, 2020 at 03:43:39PM +0100, Thomas Braun wrote:
> 
>> On 13.11.2020 06:06, Jeff King wrote:
>>> I recently ran into a case where Git could not read the pack it had
>>> produced via running "git repack". The culprit turned out to be an .idx
>>> file which crossed the 4GB barrier (in bytes, not number of objects).
>>> This series fixes the problems I saw, along with similar ones I couldn't
>>> trigger in practice, and protects the .idx loading code against integer
>>> overflows that would fool the size checks.
>>
>> Would it be feasible to have a test case for this large index case? This
>> should very certainly have an EXPENSIVE tag, or might even not yet work
>> on windows. But hopefully someday I'll find some more time to push large
>> object support on windows forward, and these kind of tests would really
>> help then.
> 
> I think it would be a level beyond what we usually consider even for
> EXPENSIVE. The cheapest I could come up with to generate the case is:

I agree that the cost of this test is more than I would expect for
EXPENSIVE.

>   perl -e '
> 	for (0..154_000_000) {
> 		print "blob\n";
> 		print "data <<EOF\n";
> 		print "$_\n";
> 		print "EOF\n";
> 	}
>   ' |
>   git fast-import
> 
> which took almost 13 minutes of CPU to run, and peaked around 15GB of
> RAM (and takes about 6.7GB on disk).

I was thinking that maybe the RAM requirements would be lower
if we batched the fast-import calls and then repacked, but then
the repack would probably be just as expensive.

> In the resulting repo, the old code barfed on lookups:
> 
>   $ blob=$(echo 0 | git hash-object --stdin)
>   $ git cat-file blob $blob
>   error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
>   error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
>   error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
>   error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
>   error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
>   error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
>   fatal: git cat-file 573541ac9702dd3969c9bc859d2b91ec1f7e6e56: bad file
> 
> whereas now it works:
> 
>   $ git cat-file blob $blob
>   0
> 
> That's the most basic test I think you could do. More interesting is
> looking at entries that are actually after the 4GB mark. That requires
> dumping the whole index:
> 
>   final=$(git show-index <.git/objects/pack/*.idx | tail -1 | awk '{print $2}')
>   git cat-file blob $final

Could you also (after running the test once) determine the largest
SHA-1, at least up to unique short-SHA? Then run something like

	git cat-file blob fffffe

Since your loop is hard-coded, you could even use the largest full
SHA-1.

Naturally, nothing short of a full .idx verification would be
completely sound, and we are already generating an enormous repo.

> So I dunno. I wouldn't be opposed to codifying some of that in a script,
> but I can't imagine anybody ever running it unless they were working on
> this specific problem.

It would be good to have this available somewhere in the codebase to
run whenever testing .idx changes. Perhaps create a new prerequisite
specifically for EXPENSIVE_IDX tests, triggered only by a GIT_TEST_*
environment variable?

It would be helpful to also write a multi-pack-index on top of this
.idx to ensure we can handle that case, too.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] handling 4GB .idx files
  2020-11-16 13:30     ` Derrick Stolee
@ 2020-11-16 23:49       ` Jeff King
  0 siblings, 0 replies; 19+ messages in thread
From: Jeff King @ 2020-11-16 23:49 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Thomas Braun, git

On Mon, Nov 16, 2020 at 08:30:34AM -0500, Derrick Stolee wrote:

> > which took almost 13 minutes of CPU to run, and peaked around 15GB of
> > RAM (and takes about 6.7GB on disk).
> 
> I was thinking that maybe the RAM requirements would be lower
> if we batched the fast-import calls and then repacked, but then
> the repack would probably be just as expensive.

I think it's even worse. Fast-import just holds enough data to create
the index (sha1, etc), but pack-objects is also holding data to support
the delta search, etc. A quick (well, quick to invoke, not to run):

   git show-index <.git/objects/pack/pack-*.idx |
   awk '{print $2}' |
   git pack-objects foo --all-progress

on the fast-import pack seems to cap out around 27GB.

I doubt you could do much better overall than fast-import in terms of
CPU. The trick is really that you need to have a matching content/sha1
pair for 154M objects, and that's where most of the time goes. If we
lied about what's in each object (just generating an index with sha1
...0001, ...0002, etc), we could go much faster. But it's a much less
interesting test then.

> > That's the most basic test I think you could do. More interesting is
> > looking at entries that are actually after the 4GB mark. That requires
> > dumping the whole index:
> > 
> >   final=$(git show-index <.git/objects/pack/*.idx | tail -1 | awk '{print $2}')
> >   git cat-file blob $final
> 
> Could you also (after running the test once) determine the largest
> SHA-1, at least up to unique short-SHA? Then run something like
> 
> 	git cat-file blob fffffe
> 
> Since your loop is hard-coded, you could even use the largest full
> SHA-1.

That $final is the highest sha1. We could hard-code it, yes (and the
resulting lookup via cat-file is quite fast; it's the linear index dump
that's slow). We'd need the matching sha256 version, too. But it's
really the generation of the data that's the main issue.

> Naturally, nothing short of a full .idx verification would be
> completely sound, and we are already generating an enormous repo.

Yep.

> > So I dunno. I wouldn't be opposed to codifying some of that in a script,
> > but I can't imagine anybody ever running it unless they were working on
> > this specific problem.
> 
> It would be good to have this available somewhere in the codebase to
> run whenever testing .idx changes. Perhaps create a new prerequisite
> specifically for EXPENSIVE_IDX tests, triggered only by a GIT_TEST_*
> environment variable?

My feeling is that anybody who's really interested in playing with this
topic can find this thread in the archive. I don't think they're really
any worse off there than with a bit-rotting script in the repo that
nobody ever runs.

But if somebody wants to write up a test script, I'm happy to review it.

> It would be helpful to also write a multi-pack-index on top of this
> .idx to ensure we can handle that case, too.

I did run "git multi-pack-index write" on the resulting repo, which
completed in a reasonable amount of time (maybe 30-60s). And then
confirmed that lookups in the midx work just fine.

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] handling 4GB .idx files
  2020-11-16  4:10   ` Jeff King
  2020-11-16 13:30     ` Derrick Stolee
@ 2020-11-30 22:57     ` Thomas Braun
  2020-12-01 11:23       ` Jeff King
  1 sibling, 1 reply; 19+ messages in thread
From: Thomas Braun @ 2020-11-30 22:57 UTC (permalink / raw)
  To: Jeff King, Derrick Stolee; +Cc: git

> Jeff King <peff@peff.net> hat am 16.11.2020 05:10 geschrieben:

[...]

> So I dunno. I wouldn't be opposed to codifying some of that in 
> a script, but I can't imagine anybody ever running it unless they 
> were working on this specific problem.

Thanks for the pointers.

Below is what I came up with. It passes here. I've replaced awk with cut from the original draft, and also moved the perl script out of the test as I think the quoting is getting way too messy otherwise. And I've added --no-dangling to git fsck as otherwise it takes forever to output the obvious dangling blobs. The unpack limit is mostly for testing the test itself with a smaller amount of blobs. But I still think it is worthwile to force everything into a pack.

--- a/t/t1600-index.sh
+++ b/t/t1600-index.sh
@@ -97,4 +97,34 @@ test_expect_success 'index version config precedence' '
 	test_index_version 0 true 2 2
 '
 
+{
+	echo "#!$SHELL_PATH"
+	cat <<'EOF'
+	   "$PERL_PATH" -e '
+		for (0..154_000_000) {
+			print "blob\n";
+			print "data <<EOF\n";
+			print "$_\n";
+			print "EOF\n";
+		} '
+EOF
+
+} >dump
+chmod +x dump
+
+test_expect_success EXPENSIVE,PERL 'Test 4GB boundary for the index' '
+	test_config fastimport.unpacklimit 0 &&
+	./dump | git fast-import &&
+	blob=$(echo 0 | git hash-object --stdin) &&
+	git cat-file blob $blob >actual &&
+	echo 0 >expect &&
+	test_cmp expect actual &&
+	idx_pack=$(ls .git/objects/pack/*.idx) &&
+	test_file_not_empty $idx_pack &&
+	final=$(git show-index <$idx_pack | tail -1 | cut -d " " -f2) &&
+	git cat-file blob $final &&
+	git cat-file blob fffffff &&
+	git fsck --strict --no-dangling
+'
+
 test_done
--

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] handling 4GB .idx files
  2020-11-30 22:57     ` Thomas Braun
@ 2020-12-01 11:23       ` Jeff King
  2020-12-01 11:39         ` t7900's new expensive test Jeff King
  2020-12-01 18:27         ` [PATCH 0/5] handling 4GB .idx files Taylor Blau
  0 siblings, 2 replies; 19+ messages in thread
From: Jeff King @ 2020-12-01 11:23 UTC (permalink / raw)
  To: Thomas Braun; +Cc: Derrick Stolee, git

On Mon, Nov 30, 2020 at 11:57:27PM +0100, Thomas Braun wrote:

> Below is what I came up with. It passes here. I've replaced awk with
> cut from the original draft, and also moved the perl script out of the
> test as I think the quoting is getting way too messy otherwise. And
> I've added --no-dangling to git fsck as otherwise it takes forever to
> output the obvious dangling blobs. The unpack limit is mostly for
> testing the test itself with a smaller amount of blobs. But I still
> think it is worthwile to force everything into a pack.

I think you can get rid of some of the quoting by using perl directly as
the interpreter, rather than a shell script that only invokes it with
-e. See below.

> --- a/t/t1600-index.sh
> +++ b/t/t1600-index.sh

I don't think this should go in t1600; that's about testing the
.git/index file, not a pack .idx. Probably t5302 would be more
appropriate.

> @@ -97,4 +97,34 @@ test_expect_success 'index version config precedence' '
>  	test_index_version 0 true 2 2
>  '
>  
> +{
> +	echo "#!$SHELL_PATH"
> +	cat <<'EOF'
> +	   "$PERL_PATH" -e '
> +		for (0..154_000_000) {
> +			print "blob\n";
> +			print "data <<EOF\n";
> +			print "$_\n";
> +			print "EOF\n";
> +		} '
> +EOF
> +
> +} >dump
> +chmod +x dump

You can simplify this a bit with write_script, as well. And we do prefer
to put this stuff in a test block, so verbosity, etc, is handled
correctly.

I didn't let it run to completion, but something like this seems to
work:

diff --git a/t/t1600-index.sh b/t/t1600-index.sh
index 6d83aaf8a4..a4c1dc0f0a 100755
--- a/t/t1600-index.sh
+++ b/t/t1600-index.sh
@@ -97,23 +97,16 @@ test_expect_success 'index version config precedence' '
 	test_index_version 0 true 2 2
 '

-{
-	echo "#!$SHELL_PATH"
-	cat <<'EOF'
-	   "$PERL_PATH" -e '
-		for (0..154_000_000) {
-			print "blob\n";
-			print "data <<EOF\n";
-			print "$_\n";
-			print "EOF\n";
-		} '
-EOF
-
-} >dump
-chmod +x dump
-
 test_expect_success EXPENSIVE,PERL 'Test 4GB boundary for the index' '
 	test_config fastimport.unpacklimit 0 &&
+	write_script dump "$PERL_PATH" <<-\EOF &&
+	for (0..154_000_000) {
+		print "blob\n";
+		print "data <<EOF\n";
+		print "$_\n";
+		print "EOF\n";
+	}
+	EOF
 	./dump | git fast-import &&
 	blob=$(echo 0 | git hash-object --stdin) &&
 	git cat-file blob $blob >actual &&

> +test_expect_success EXPENSIVE,PERL 'Test 4GB boundary for the index' '

You can drop the PERL prereq. Even without it set, we assume that we can
do basic perl one-liners that would work even in old versions of perl.

I'm not sure if EXPENSIVE is the right ballpark, or if we'd want a
VERY_EXPENSIVE. On my machine, the whole test suite for v2.29.0 takes 64
seconds to run, and setting GIT_TEST_LONG=1 bumps that to 103s. It got a
bit worse since then, as t7900 adds an EXPENSIVE test that takes ~200s
(it's not strictly additive, since we can work in parallel on other
tests for the first bit, but still, yuck).

So we're looking at 2-3x to run the expensive tests now. This new one
would be 20x or more. I'm not sure if anybody would care or not (i.e.,
whether anyone actually runs the whole suite with this flag). I thought
we did for some CI job, but it looks like it's just the one-off in
t5608.

> +	git cat-file blob $final &&
> +	git cat-file blob fffffff &&

This final cat-file may be a problem when tested with SHA-256. You are
relying on the fact that there is exactly one object that matches seven
f's as its prefix. That may be true for SHA-1, but if so it's mostly
luck.  Seven hex digits is only 28 bits, which is ~260M. For 154M
objects, we'd expect an average of 0.57 objects per 7-digit prefix. So I
wouldn't be at all surprised if there are two of them for SHA-256.

I'm also not sure what it's testing that the $final one isn't.

-Peff

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* t7900's new expensive test
  2020-12-01 11:23       ` Jeff King
@ 2020-12-01 11:39         ` Jeff King
  2020-12-01 20:55           ` Derrick Stolee
  2020-12-01 18:27         ` [PATCH 0/5] handling 4GB .idx files Taylor Blau
  1 sibling, 1 reply; 19+ messages in thread
From: Jeff King @ 2020-12-01 11:39 UTC (permalink / raw)
  To: Thomas Braun; +Cc: Derrick Stolee, git

On Tue, Dec 01, 2020 at 06:23:28AM -0500, Jeff King wrote:

> I'm not sure if EXPENSIVE is the right ballpark, or if we'd want a
> VERY_EXPENSIVE. On my machine, the whole test suite for v2.29.0 takes 64
> seconds to run, and setting GIT_TEST_LONG=1 bumps that to 103s. It got a
> bit worse since then, as t7900 adds an EXPENSIVE test that takes ~200s
> (it's not strictly additive, since we can work in parallel on other
> tests for the first bit, but still, yuck).

Since Stolee is on the cc and has already seen me complaining about his
test, I guess I should expand a bit. ;)

There are some small wins possible (e.g., using "commit --quiet" seems
to shave off ~8s when we don't even think about writing a diff), but
fundamentally the issue is that it just takes a long time to "git add"
the 5.2GB worth of random data. I almost wonder if it would be worth it
to hard-coded the known sha1 and sha256 names of the blobs, and write
them straight into the appropriate loose object file. I guess that is
tricky, though, because it actually needs to be a zlib stream, not just
the output of "test-tool genrandom".

Though speaking of which, another easy win might be setting
core.compression to "0". We know the random data won't compress anyway,
so there's no point in spending cycles on zlib.

Doing this:

diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index d9e68bb2bf..849c6d1361 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -239,6 +239,8 @@ test_expect_success 'incremental-repack task' '
 '

 test_expect_success EXPENSIVE 'incremental-repack 2g limit' '
+	test_config core.compression 0 &&
+
 	for i in $(test_seq 1 5)
 	do
 		test-tool genrandom foo$i $((512 * 1024 * 1024 + 1)) >>big ||
@@ -257,7 +259,7 @@ test_expect_success EXPENSIVE 'incremental-repack 2g limit' '
 		return 1
 	done &&
 	git add big &&
-	git commit -m "Add big file (2)" &&
+	git commit -qm "Add big file (2)" &&

 	# ensure any possible loose objects are in a pack-file
 	git maintenance run --task=loose-objects &&

seems to shave off ~140s from the test. I think we could get a little
more by cleaning up the enormous objects, too (they end up causing the
subsequent test to run slower, too, though perhaps it was intentional to
impact downstream tests).

-Peff

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] handling 4GB .idx files
  2020-12-01 11:23       ` Jeff King
  2020-12-01 11:39         ` t7900's new expensive test Jeff King
@ 2020-12-01 18:27         ` Taylor Blau
  2020-12-02 13:12           ` Jeff King
  1 sibling, 1 reply; 19+ messages in thread
From: Taylor Blau @ 2020-12-01 18:27 UTC (permalink / raw)
  To: Jeff King; +Cc: Thomas Braun, Derrick Stolee, git

On Tue, Dec 01, 2020 at 06:23:28AM -0500, Jeff King wrote:
> I'm not sure if EXPENSIVE is the right ballpark, or if we'd want a
> VERY_EXPENSIVE. On my machine, the whole test suite for v2.29.0 takes 64
> seconds to run, and setting GIT_TEST_LONG=1 bumps that to 103s. It got a
> bit worse since then, as t7900 adds an EXPENSIVE test that takes ~200s
> (it's not strictly additive, since we can work in parallel on other
> tests for the first bit, but still, yuck).
>
> So we're looking at 2-3x to run the expensive tests now. This new one
> would be 20x or more. I'm not sure if anybody would care or not (i.e.,
> whether anyone actually runs the whole suite with this flag). I thought
> we did for some CI job, but it looks like it's just the one-off in
> t5608.

I had written something similar yesterday before mutt crashed and I
decided to stop work for the day.

I have a sense that probably very few people actually run GIT_TEST_LONG
regularly, and that that group may vanish entirely if we added a test
which increased the runtime of the suite by 20x in this mode.

I have mixed feelings about VERY_EXPENSIVE. On one hand, having this
test checked in so that we can quickly refer back to it in the case of a
regression is useful. On the other hand, what is it worth to have this
in-tree if nobody ever runs it? I'm speculating about whether or not
people would run this, of course.

My hunch is that anybody who is interested enough to fix regressions in
this area would be able to refer back to the list archive to dig up this
thread and recover the script.

I don't feel strongly, really, but just noting some light objections to
checking this test into the suite.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: t7900's new expensive test
  2020-12-01 11:39         ` t7900's new expensive test Jeff King
@ 2020-12-01 20:55           ` Derrick Stolee
  2020-12-02  2:47             ` [PATCH] t7900: speed up " Jeff King
  0 siblings, 1 reply; 19+ messages in thread
From: Derrick Stolee @ 2020-12-01 20:55 UTC (permalink / raw)
  To: Jeff King, Thomas Braun; +Cc: Derrick Stolee, git

On 12/1/2020 6:39 AM, Jeff King wrote:
> On Tue, Dec 01, 2020 at 06:23:28AM -0500, Jeff King wrote:
> 
>> I'm not sure if EXPENSIVE is the right ballpark, or if we'd want a
>> VERY_EXPENSIVE. On my machine, the whole test suite for v2.29.0 takes 64
>> seconds to run, and setting GIT_TEST_LONG=1 bumps that to 103s. It got a
>> bit worse since then, as t7900 adds an EXPENSIVE test that takes ~200s
>> (it's not strictly additive, since we can work in parallel on other
>> tests for the first bit, but still, yuck).
> 
> Since Stolee is on the cc and has already seen me complaining about his
> test, I guess I should expand a bit. ;)

Ha. I apologize for causing pain here. My thought was that GIT_TEST_LONG=1
was only used by someone really willing to wait, or someone specifically
trying to investigate a problem that only triggers on very large cases.

In that sense, it's not so much intended as a frequently-run regression
test, but a "run this if you are messing with this area" kind of thing.
Perhaps there is a different pattern to use here?

> There are some small wins possible (e.g., using "commit --quiet" seems
> to shave off ~8s when we don't even think about writing a diff), but
> fundamentally the issue is that it just takes a long time to "git add"
> the 5.2GB worth of random data. I almost wonder if it would be worth it
> to hard-coded the known sha1 and sha256 names of the blobs, and write
> them straight into the appropriate loose object file. I guess that is
> tricky, though, because it actually needs to be a zlib stream, not just
> the output of "test-tool genrandom".
>
> Though speaking of which, another easy win might be setting
> core.compression to "0". We know the random data won't compress anyway,
> so there's no point in spending cycles on zlib.

The intention is mostly to expand the data beyond two gigabytes, so
dropping compression to get there seems like a good idea. If we are
not compressing at all, then perhaps we can reliably cut ourselves
closer to the 2GB limit instead of overshooting as a precaution.
 
> Doing this:
> 
> diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
> index d9e68bb2bf..849c6d1361 100755
> --- a/t/t7900-maintenance.sh
> +++ b/t/t7900-maintenance.sh
> @@ -239,6 +239,8 @@ test_expect_success 'incremental-repack task' '
>  '
>  
>  test_expect_success EXPENSIVE 'incremental-repack 2g limit' '
> +	test_config core.compression 0 &&
> +
>  	for i in $(test_seq 1 5)
>  	do
>  		test-tool genrandom foo$i $((512 * 1024 * 1024 + 1)) >>big ||
> @@ -257,7 +259,7 @@ test_expect_success EXPENSIVE 'incremental-repack 2g limit' '
>  		return 1
>  	done &&
>  	git add big &&
> -	git commit -m "Add big file (2)" &&
> +	git commit -qm "Add big file (2)" &&
>  
>  	# ensure any possible loose objects are in a pack-file
>  	git maintenance run --task=loose-objects &&
> 
> seems to shave off ~140s from the test. I think we could get a little
> more by cleaning up the enormous objects, too (they end up causing the
> subsequent test to run slower, too, though perhaps it was intentional to
> impact downstream tests).

Cutting out 70% out seems like a great idea. I don't think it was super
intentional to slow down those tests.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH] t7900: speed up expensive test
  2020-12-01 20:55           ` Derrick Stolee
@ 2020-12-02  2:47             ` Jeff King
  2020-12-03 15:23               ` Derrick Stolee
  0 siblings, 1 reply; 19+ messages in thread
From: Jeff King @ 2020-12-02  2:47 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Thomas Braun, Derrick Stolee, git

On Tue, Dec 01, 2020 at 03:55:00PM -0500, Derrick Stolee wrote:

> > Since Stolee is on the cc and has already seen me complaining about his
> > test, I guess I should expand a bit. ;)
> 
> Ha. I apologize for causing pain here. My thought was that GIT_TEST_LONG=1
> was only used by someone really willing to wait, or someone specifically
> trying to investigate a problem that only triggers on very large cases.
> 
> In that sense, it's not so much intended as a frequently-run regression
> test, but a "run this if you are messing with this area" kind of thing.
> Perhaps there is a different pattern to use here?

No, I think your interpretation is pretty reasonable. I definitely do
not run with GIT_TEST_LONG normally, but was only poking at it because
of the other discussion.

> > Though speaking of which, another easy win might be setting
> > core.compression to "0". We know the random data won't compress anyway,
> > so there's no point in spending cycles on zlib.
> 
> The intention is mostly to expand the data beyond two gigabytes, so
> dropping compression to get there seems like a good idea. If we are
> not compressing at all, then perhaps we can reliably cut ourselves
> closer to the 2GB limit instead of overshooting as a precaution.

Probably, though I think at best we could save 20% of the test cost. I
didn't play much with it.

> Cutting out 70% out seems like a great idea. I don't think it was super
> intentional to slow down those tests.

Here it is as a patch (the numbers are slightly different this time
because I used my usual ramdisk, but the overall improvement is roughly
the same).

I didn't look into whether it should be cleaning up (test 16 takes
longer, too, because of the extra on-disk bytes). I also noticed while
running with "-v" that it complains of corrupted refs. I assume this is
leftover cruft from earlier tests, and I didn't dig into it. But it may
be something worth cleaning up for somebody more familiar with these
tests.

-- >8 --
Subject: [PATCH] t7900: speed up expensive test

A test marked with EXPENSIVE creates two 2.5GB files and adds them to
the repository. This takes 194s to run on my machine, versus 2s when the
EXPENSIVE prereq isn't set. We can trim this down a bit by doing two
things:

  - use "git commit --quiet" to avoid spending time generating a diff
    summary (this actually only helps for the second commit, but I've
    added it here to both for consistency). This shaves off 8s.

  - set core.compression to 0. We know these files are full of random
    bytes, and so won't compress (that's the point of the test!).
    Spending cycles on zlib is pointless. This shaves off 122s.

After this, my total time to run the script is 64s. That won't help
normal runs without GIT_TEST_LONG set, of course, but it's easy enough
to do.

Signed-off-by: Jeff King <peff@peff.net>
---
 t/t7900-maintenance.sh | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index d9e68bb2bf..d9a02df686 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -239,13 +239,15 @@ test_expect_success 'incremental-repack task' '
 '
 
 test_expect_success EXPENSIVE 'incremental-repack 2g limit' '
+	test_config core.compression 0 &&
+
 	for i in $(test_seq 1 5)
 	do
 		test-tool genrandom foo$i $((512 * 1024 * 1024 + 1)) >>big ||
 		return 1
 	done &&
 	git add big &&
-	git commit -m "Add big file (1)" &&
+	git commit -qm "Add big file (1)" &&
 
 	# ensure any possible loose objects are in a pack-file
 	git maintenance run --task=loose-objects &&
@@ -257,7 +259,7 @@ test_expect_success EXPENSIVE 'incremental-repack 2g limit' '
 		return 1
 	done &&
 	git add big &&
-	git commit -m "Add big file (2)" &&
+	git commit -qm "Add big file (2)" &&
 
 	# ensure any possible loose objects are in a pack-file
 	git maintenance run --task=loose-objects &&
-- 
2.29.2.894.g2dadb8c6b8


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] handling 4GB .idx files
  2020-12-01 18:27         ` [PATCH 0/5] handling 4GB .idx files Taylor Blau
@ 2020-12-02 13:12           ` Jeff King
  0 siblings, 0 replies; 19+ messages in thread
From: Jeff King @ 2020-12-02 13:12 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Thomas Braun, Derrick Stolee, git

On Tue, Dec 01, 2020 at 01:27:26PM -0500, Taylor Blau wrote:

> I have a sense that probably very few people actually run GIT_TEST_LONG
> regularly, and that that group may vanish entirely if we added a test
> which increased the runtime of the suite by 20x in this mode.

Yeah, and if that's so, then I'm OK with calling it EXPENSIVE and moving
on with our lives. And I guess merging this is one way to find out, if
anybody screams. The stakes are low enough that I don't mind doing that.

> I have mixed feelings about VERY_EXPENSIVE. On one hand, having this
> test checked in so that we can quickly refer back to it in the case of a
> regression is useful. On the other hand, what is it worth to have this
> in-tree if nobody ever runs it? I'm speculating about whether or not
> people would run this, of course.
> 
> My hunch is that anybody who is interested enough to fix regressions in
> this area would be able to refer back to the list archive to dig up this
> thread and recover the script.
> 
> I don't feel strongly, really, but just noting some light objections to
> checking this test into the suite.

Yeah, that about matches my feelings. But if Thomas wants to wrap it up
as a patch, I don't mind seeing what happens (and I think with the
suggestions I gave earlier, it would be in good shape).

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] t7900: speed up expensive test
  2020-12-02  2:47             ` [PATCH] t7900: speed up " Jeff King
@ 2020-12-03 15:23               ` Derrick Stolee
  0 siblings, 0 replies; 19+ messages in thread
From: Derrick Stolee @ 2020-12-03 15:23 UTC (permalink / raw)
  To: Jeff King; +Cc: Thomas Braun, Derrick Stolee, git

On 12/1/2020 9:47 PM, Jeff King wrote:
> Subject: [PATCH] t7900: speed up expensive test
> 
> A test marked with EXPENSIVE creates two 2.5GB files and adds them to
> the repository. This takes 194s to run on my machine, versus 2s when the
> EXPENSIVE prereq isn't set. We can trim this down a bit by doing two
> things:
> 
>   - use "git commit --quiet" to avoid spending time generating a diff
>     summary (this actually only helps for the second commit, but I've
>     added it here to both for consistency). This shaves off 8s.
> 
>   - set core.compression to 0. We know these files are full of random
>     bytes, and so won't compress (that's the point of the test!).
>     Spending cycles on zlib is pointless. This shaves off 122s.
> 
> After this, my total time to run the script is 64s. That won't help
> normal runs without GIT_TEST_LONG set, of course, but it's easy enough
> to do.

I'm happy with these easy fixes to make the test faster without
changing any of the important behavior. Thanks!

> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  t/t7900-maintenance.sh | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
> index d9e68bb2bf..d9a02df686 100755
> --- a/t/t7900-maintenance.sh
> +++ b/t/t7900-maintenance.sh
> @@ -239,13 +239,15 @@ test_expect_success 'incremental-repack task' '
>  '
>  
>  test_expect_success EXPENSIVE 'incremental-repack 2g limit' '
> +	test_config core.compression 0 &&
> +
>  	for i in $(test_seq 1 5)
>  	do
>  		test-tool genrandom foo$i $((512 * 1024 * 1024 + 1)) >>big ||
>  		return 1
>  	done &&
>  	git add big &&
> -	git commit -m "Add big file (1)" &&
> +	git commit -qm "Add big file (1)" &&
>  
>  	# ensure any possible loose objects are in a pack-file
>  	git maintenance run --task=loose-objects &&
> @@ -257,7 +259,7 @@ test_expect_success EXPENSIVE 'incremental-repack 2g limit' '
>  		return 1
>  	done &&
>  	git add big &&
> -	git commit -m "Add big file (2)" &&
> +	git commit -qm "Add big file (2)" &&
>  
>  	# ensure any possible loose objects are in a pack-file
>  	git maintenance run --task=loose-objects &&
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2020-12-03 15:27 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-13  5:06 [PATCH 0/5] handling 4GB .idx files Jeff King
2020-11-13  5:06 ` [PATCH 1/5] compute pack .idx byte offsets using size_t Jeff King
2020-11-13  5:07 ` [PATCH 2/5] use size_t to store pack .idx byte offsets Jeff King
2020-11-13  5:07 ` [PATCH 3/5] fsck: correctly compute checksums on idx files larger than 4GB Jeff King
2020-11-13  5:07 ` [PATCH 4/5] block-sha1: take a size_t length parameter Jeff King
2020-11-13  5:07 ` [PATCH 5/5] packfile: detect overflow in .idx file size checks Jeff King
2020-11-13 11:02   ` Johannes Schindelin
2020-11-15 14:43 ` [PATCH 0/5] handling 4GB .idx files Thomas Braun
2020-11-16  4:10   ` Jeff King
2020-11-16 13:30     ` Derrick Stolee
2020-11-16 23:49       ` Jeff King
2020-11-30 22:57     ` Thomas Braun
2020-12-01 11:23       ` Jeff King
2020-12-01 11:39         ` t7900's new expensive test Jeff King
2020-12-01 20:55           ` Derrick Stolee
2020-12-02  2:47             ` [PATCH] t7900: speed up " Jeff King
2020-12-03 15:23               ` Derrick Stolee
2020-12-01 18:27         ` [PATCH 0/5] handling 4GB .idx files Taylor Blau
2020-12-02 13:12           ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).