git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/6] pack-bitmap: miscellaneous mmap read hardening
@ 2023-03-20 20:02 Taylor Blau
  2023-03-20 20:02 ` [PATCH 1/6] pack-bitmap.c: hide bitmap internals in `read_u8()` Taylor Blau
                   ` (5 more replies)
  0 siblings, 6 replies; 29+ messages in thread
From: Taylor Blau @ 2023-03-20 20:02 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Derrick Stolee, Abhradeep Chakraborty, Jeff King

This short series has a few minor refactorings and hardenings I noticed
while reading the pack-bitmap.c code recently.

The series is structured as follows:

  - The first three are cleanups to the read_be32() and read_u8()
    functions that could be taken on their own.

  - The last three replace manually reading from `bitmap_git->map` and
    adjusting `bitmap_git->map_pos` accordingly with a new
    `bitmap_index_seek()` wrapper that accomplishes the same with
    additional bounds- and overflow-checking.

Thanks in advance for your review.

Taylor Blau (6):
  pack-bitmap.c: hide bitmap internals in `read_u8()`
  pack-bitmap.c: hide bitmap internals in `read_be32()`
  pack-bitmap.c: drop unnecessary 'inline's
  pack-bitmap.c: factor out manual `map_pos` manipulation
  pack-bitmap.c: use `bitmap_index_seek()` where possible
  pack-bitmap.c: factor out `bitmap_index_seek_commit()`

 pack-bitmap.c | 84 ++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 59 insertions(+), 25 deletions(-)

-- 
2.40.0.77.gd564125b3f

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/6] pack-bitmap.c: hide bitmap internals in `read_u8()`
  2023-03-20 20:02 [PATCH 0/6] pack-bitmap: miscellaneous mmap read hardening Taylor Blau
@ 2023-03-20 20:02 ` Taylor Blau
  2023-03-21 17:35   ` Jeff King
  2023-03-20 20:02 ` [PATCH 2/6] pack-bitmap.c: hide bitmap internals in `read_be32()` Taylor Blau
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 29+ messages in thread
From: Taylor Blau @ 2023-03-20 20:02 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Derrick Stolee, Abhradeep Chakraborty, Jeff King

The `read_u8()` helper function internal to pack-bitmap.c was defined in
b5007211b6 (pack-bitmap: do not use gcc packed attribute, 2014-11-27).

Prior to b5007211b6, callers within pack-bitmap.c would read an
individual unsigned integer by doing something like:

    struct bitmap_disk_entry *e;

    e = (struct bitmap_disk_entry *)(index->map + index->map_pos);
    index->map_pos += sizeof(*e);

...which relied on the fact that the `bitmap_disk_entry` struct was
defined with `__attribute((packed))`, which b5007211b6 sought to get rid
of since the `__attribute__` flag is a noop on some compilers (which
makes the above code rely on the absence of padding to be correct).

So b5007211b6 got rid of the above convention and replaced it by reading
individual fields of that structure with a `read_u8()` helper that reads
from the region of memory pointed to by `->map`, and updates the
`->map_pos` pointer accordingly.

But this forces callers to be intimately aware of `bitmap_git->map` and
`bitmap_git->map_pos`. Instead, teach `read_u8()` to take a `struct
bitmap_index *` directly, and avoid having callers deal with the
internals themselves.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index ca7c81b5c9..d8ba252ba1 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -251,9 +251,9 @@ static inline uint32_t read_be32(const unsigned char *buffer, size_t *pos)
 	return result;
 }
 
-static inline uint8_t read_u8(const unsigned char *buffer, size_t *pos)
+static inline uint8_t read_u8(struct bitmap_index *bitmap_git)
 {
-	return buffer[(*pos)++];
+	return bitmap_git->map[bitmap_git->map_pos++];
 }
 
 #define MAX_XOR_OFFSET 160
@@ -283,8 +283,8 @@ static int load_bitmap_entries_v1(struct bitmap_index *index)
 			return error(_("corrupt ewah bitmap: truncated header for entry %d"), i);
 
 		commit_idx_pos = read_be32(index->map, &index->map_pos);
-		xor_offset = read_u8(index->map, &index->map_pos);
-		flags = read_u8(index->map, &index->map_pos);
+		xor_offset = read_u8(index);
+		flags = read_u8(index);
 
 		if (nth_bitmap_object_oid(index, &oid, commit_idx_pos) < 0)
 			return error(_("corrupt ewah bitmap: commit index %u out of range"),
@@ -780,7 +780,7 @@ static struct stored_bitmap *lazy_bitmap_for_commit(struct bitmap_index *bitmap_
 		}
 
 		bitmap_git->map_pos += sizeof(uint32_t) + sizeof(uint8_t);
-		xor_flags = read_u8(bitmap_git->map, &bitmap_git->map_pos);
+		xor_flags = read_u8(bitmap_git);
 		bitmap = read_bitmap_1(bitmap_git);
 
 		if (!bitmap)
@@ -821,7 +821,7 @@ static struct stored_bitmap *lazy_bitmap_for_commit(struct bitmap_index *bitmap_
 	 * ewah bitmap.
 	 */
 	bitmap_git->map_pos += sizeof(uint32_t) + sizeof(uint8_t);
-	flags = read_u8(bitmap_git->map, &bitmap_git->map_pos);
+	flags = read_u8(bitmap_git);
 	bitmap = read_bitmap_1(bitmap_git);
 
 	if (!bitmap)
-- 
2.40.0.77.gd564125b3f


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/6] pack-bitmap.c: hide bitmap internals in `read_be32()`
  2023-03-20 20:02 [PATCH 0/6] pack-bitmap: miscellaneous mmap read hardening Taylor Blau
  2023-03-20 20:02 ` [PATCH 1/6] pack-bitmap.c: hide bitmap internals in `read_u8()` Taylor Blau
@ 2023-03-20 20:02 ` Taylor Blau
  2023-03-20 20:02 ` [PATCH 3/6] pack-bitmap.c: drop unnecessary 'inline's Taylor Blau
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 29+ messages in thread
From: Taylor Blau @ 2023-03-20 20:02 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Derrick Stolee, Abhradeep Chakraborty, Jeff King

In a similar fashion as the previous commit, update `read_be32()` to
take a `struct bitmap_index *` instead of copies of and pointers to
variables within that structure.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index d8ba252ba1..794aaf5b02 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -244,10 +244,10 @@ static struct stored_bitmap *store_bitmap(struct bitmap_index *index,
 	return stored;
 }
 
-static inline uint32_t read_be32(const unsigned char *buffer, size_t *pos)
+static inline uint32_t read_be32(struct bitmap_index *bitmap_git)
 {
-	uint32_t result = get_be32(buffer + *pos);
-	(*pos) += sizeof(result);
+	uint32_t result = get_be32(bitmap_git->map + bitmap_git->map_pos);
+	bitmap_git->map_pos += sizeof(result);
 	return result;
 }
 
@@ -282,7 +282,7 @@ static int load_bitmap_entries_v1(struct bitmap_index *index)
 		if (index->map_size - index->map_pos < 6)
 			return error(_("corrupt ewah bitmap: truncated header for entry %d"), i);
 
-		commit_idx_pos = read_be32(index->map, &index->map_pos);
+		commit_idx_pos = read_be32(index);
 		xor_offset = read_u8(index);
 		flags = read_u8(index);
 
-- 
2.40.0.77.gd564125b3f


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 3/6] pack-bitmap.c: drop unnecessary 'inline's
  2023-03-20 20:02 [PATCH 0/6] pack-bitmap: miscellaneous mmap read hardening Taylor Blau
  2023-03-20 20:02 ` [PATCH 1/6] pack-bitmap.c: hide bitmap internals in `read_u8()` Taylor Blau
  2023-03-20 20:02 ` [PATCH 2/6] pack-bitmap.c: hide bitmap internals in `read_be32()` Taylor Blau
@ 2023-03-20 20:02 ` Taylor Blau
  2023-03-21 17:40   ` Jeff King
  2023-03-20 20:02 ` [PATCH 4/6] pack-bitmap.c: factor out manual `map_pos` manipulation Taylor Blau
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 29+ messages in thread
From: Taylor Blau @ 2023-03-20 20:02 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Derrick Stolee, Abhradeep Chakraborty, Jeff King

Both `read_be32()` and `read_u8()` are defined as inline dating back
to b5007211b6 (pack-bitmap: do not use gcc packed attribute,
2014-11-27), though that commit does not hint at why the functions were
defined with that attribute.

However (at least with GCC 12.2.0, at the time of writing), the
resulting pack-bitmap.o contains the same instructions with or without
the inline attribute applied to these functions:

    $ make O=3 pack-bitmap.o && mv pack-bitmap.o{,.before}
    [ apply this patch ]
    $ make O=3 pack-bitmap.o && mv pack-bitmap.o{,.after}
    $ objdump -d pack-bitmap.o.before >before
    $ objdump -d pack-bitmap.o.after >after
    $ diff -u pack-bitmap.o.{before,after}
    --- before	2023-03-15 18:54:17.021580095 -0400
    +++ after	2023-03-15 18:54:21.853552218 -0400
    @@ -1,5 +1,5 @@

    -pack-bitmap.o.before:     file format elf64-x86-64
    +pack-bitmap.o.after:     file format elf64-x86-64

     Disassembly of section .text:

So defining these functions as inline is at best a noop, and at worst
confuses the reader into thinking that there is some trickier reason
that they are defined as inline when there isn't.

Since this pair of functions does not need to be inlined, let's drop
that attribute from both `read_u8()` and `read_be32()`.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 794aaf5b02..1d12f90ff9 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -244,14 +244,14 @@ static struct stored_bitmap *store_bitmap(struct bitmap_index *index,
 	return stored;
 }
 
-static inline uint32_t read_be32(struct bitmap_index *bitmap_git)
+static uint32_t read_be32(struct bitmap_index *bitmap_git)
 {
 	uint32_t result = get_be32(bitmap_git->map + bitmap_git->map_pos);
 	bitmap_git->map_pos += sizeof(result);
 	return result;
 }
 
-static inline uint8_t read_u8(struct bitmap_index *bitmap_git)
+static uint8_t read_u8(struct bitmap_index *bitmap_git)
 {
 	return bitmap_git->map[bitmap_git->map_pos++];
 }
-- 
2.40.0.77.gd564125b3f


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 4/6] pack-bitmap.c: factor out manual `map_pos` manipulation
  2023-03-20 20:02 [PATCH 0/6] pack-bitmap: miscellaneous mmap read hardening Taylor Blau
                   ` (2 preceding siblings ...)
  2023-03-20 20:02 ` [PATCH 3/6] pack-bitmap.c: drop unnecessary 'inline's Taylor Blau
@ 2023-03-20 20:02 ` Taylor Blau
  2023-03-21 17:56   ` Jeff King
  2023-03-20 20:02 ` [PATCH 5/6] pack-bitmap.c: use `bitmap_index_seek()` where possible Taylor Blau
  2023-03-20 20:02 ` [PATCH 6/6] pack-bitmap.c: factor out `bitmap_index_seek_commit()` Taylor Blau
  5 siblings, 1 reply; 29+ messages in thread
From: Taylor Blau @ 2023-03-20 20:02 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Derrick Stolee, Abhradeep Chakraborty, Jeff King

The pack-bitmap internals use a combination of a `map` and `map_pos`
variable to keep a pointer to a memory mapped region of the `.bitmap`
file, as well as the position within that file, respectively.

Reads within the memory mapped region are meant to mimic file reads,
where each read adjusts the offset of the file descriptor accordingly.
This functionality is implemented by adjusting the `map_pos` variable
after each read.

Factor out a function similar to seek() which adjusts the `map_pos`
variable for us. Since the bitmap code only needs to adjust relative to
the beginning of the file as well as the current position, we do not
need to support any "whence" values outside of SEEK_SET and SEEK_CUR.

Extracting out this operation into a separate function allows us to add
some additional safety checks, such as ensuring that adding to `map_pos`
does not overflow `size_t`, and that the resulting `map_pos` value is
in bounds of the mapped region.

The subsequent commit will rewrite all manual manipulation of the
`map_pos` variable in terms of the new `bitmap_index_seek()`.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 1d12f90ff9..fabcf01c14 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -134,6 +134,28 @@ static struct ewah_bitmap *lookup_stored_bitmap(struct stored_bitmap *st)
 	return composed;
 }
 
+static size_t bitmap_index_seek(struct bitmap_index *bitmap_git, size_t offset,
+				int whence)
+{
+	switch (whence) {
+	case SEEK_SET:
+		bitmap_git->map_pos = offset;
+		break;
+	case SEEK_CUR:
+		bitmap_git->map_pos = st_add(bitmap_git->map_pos, offset);
+		break;
+	default:
+		BUG("unhandled seek whence: %d", whence);
+	}
+
+	if (bitmap_git->map_pos >= bitmap_git->map_size)
+		BUG("bitmap position exceeds size (%"PRIuMAX" >= %"PRIuMAX")",
+		    (uintmax_t)bitmap_git->map_pos,
+		    (uintmax_t)bitmap_git->map_size);
+
+	return bitmap_git->map_pos;
+}
+
 /*
  * Read a bitmap from the current read position on the mmaped
  * index, and increase the read position accordingly
-- 
2.40.0.77.gd564125b3f


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 5/6] pack-bitmap.c: use `bitmap_index_seek()` where possible
  2023-03-20 20:02 [PATCH 0/6] pack-bitmap: miscellaneous mmap read hardening Taylor Blau
                   ` (3 preceding siblings ...)
  2023-03-20 20:02 ` [PATCH 4/6] pack-bitmap.c: factor out manual `map_pos` manipulation Taylor Blau
@ 2023-03-20 20:02 ` Taylor Blau
  2023-03-21 18:05   ` Jeff King
  2023-03-20 20:02 ` [PATCH 6/6] pack-bitmap.c: factor out `bitmap_index_seek_commit()` Taylor Blau
  5 siblings, 1 reply; 29+ messages in thread
From: Taylor Blau @ 2023-03-20 20:02 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Derrick Stolee, Abhradeep Chakraborty, Jeff King

As described in the previous commit, now that we have a functional
`bitmap_index_seek()`, rewrite all callers that manually manipulate the
`map_pos` variable with calls to `bitmap_index_seek()`.

This means that all callers that adjust the value of `map_pos` have
those changes automatically bounds- and overflow-checked.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index fabcf01c14..38a3c6a3f9 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -174,7 +174,7 @@ static struct ewah_bitmap *read_bitmap_1(struct bitmap_index *index)
 		return NULL;
 	}
 
-	index->map_pos += bitmap_size;
+	bitmap_index_seek(index, bitmap_size, SEEK_CUR);
 	return b;
 }
 
@@ -230,7 +230,7 @@ static int load_bitmap_header(struct bitmap_index *index)
 
 	index->entry_count = ntohl(header->entry_count);
 	index->checksum = header->checksum;
-	index->map_pos += header_size;
+	bitmap_index_seek(index, header_size, SEEK_CUR);
 	return 0;
 }
 
@@ -269,13 +269,15 @@ static struct stored_bitmap *store_bitmap(struct bitmap_index *index,
 static uint32_t read_be32(struct bitmap_index *bitmap_git)
 {
 	uint32_t result = get_be32(bitmap_git->map + bitmap_git->map_pos);
-	bitmap_git->map_pos += sizeof(result);
+	bitmap_index_seek(bitmap_git, sizeof(uint32_t), SEEK_CUR);
 	return result;
 }
 
 static uint8_t read_u8(struct bitmap_index *bitmap_git)
 {
-	return bitmap_git->map[bitmap_git->map_pos++];
+	uint8_t result = bitmap_git->map[bitmap_git->map_pos];
+	bitmap_index_seek(bitmap_git, sizeof(uint8_t), SEEK_CUR);
+	return result;
 }
 
 #define MAX_XOR_OFFSET 160
@@ -794,14 +796,16 @@ static struct stored_bitmap *lazy_bitmap_for_commit(struct bitmap_index *bitmap_
 
 	while (xor_items_nr) {
 		xor_item = &xor_items[xor_items_nr - 1];
-		bitmap_git->map_pos = xor_item->offset;
+		bitmap_index_seek(bitmap_git, xor_item->offset, SEEK_SET);
+
 		if (bitmap_git->map_size - bitmap_git->map_pos < bitmap_header_size) {
 			error(_("corrupt ewah bitmap: truncated header for bitmap of commit \"%s\""),
 				oid_to_hex(&xor_item->oid));
 			goto corrupt;
 		}
 
-		bitmap_git->map_pos += sizeof(uint32_t) + sizeof(uint8_t);
+		bitmap_index_seek(bitmap_git,
+				  sizeof(uint32_t) + sizeof(uint8_t), SEEK_CUR);
 		xor_flags = read_u8(bitmap_git);
 		bitmap = read_bitmap_1(bitmap_git);
 
@@ -812,7 +816,7 @@ static struct stored_bitmap *lazy_bitmap_for_commit(struct bitmap_index *bitmap_
 		xor_items_nr--;
 	}
 
-	bitmap_git->map_pos = offset;
+	bitmap_index_seek(bitmap_git, offset, SEEK_SET);
 	if (bitmap_git->map_size - bitmap_git->map_pos < bitmap_header_size) {
 		error(_("corrupt ewah bitmap: truncated header for bitmap of commit \"%s\""),
 			oid_to_hex(oid));
@@ -842,7 +846,8 @@ static struct stored_bitmap *lazy_bitmap_for_commit(struct bitmap_index *bitmap_
 	 * Instead, we can skip ahead and immediately read the flags and
 	 * ewah bitmap.
 	 */
-	bitmap_git->map_pos += sizeof(uint32_t) + sizeof(uint8_t);
+	bitmap_index_seek(bitmap_git, sizeof(uint32_t) + sizeof(uint8_t),
+			  SEEK_CUR);
 	flags = read_u8(bitmap_git);
 	bitmap = read_bitmap_1(bitmap_git);
 
-- 
2.40.0.77.gd564125b3f


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 6/6] pack-bitmap.c: factor out `bitmap_index_seek_commit()`
  2023-03-20 20:02 [PATCH 0/6] pack-bitmap: miscellaneous mmap read hardening Taylor Blau
                   ` (4 preceding siblings ...)
  2023-03-20 20:02 ` [PATCH 5/6] pack-bitmap.c: use `bitmap_index_seek()` where possible Taylor Blau
@ 2023-03-20 20:02 ` Taylor Blau
  2023-03-21 18:13   ` Jeff King
  5 siblings, 1 reply; 29+ messages in thread
From: Taylor Blau @ 2023-03-20 20:02 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Derrick Stolee, Abhradeep Chakraborty, Jeff King

Factor out a common pattern within `lazy_bitmap_for_commit()` where we
seek to a given position (expecting to read the start of an individual
bitmap entry).

Both spots within `lazy_bitmap_for_commit()` emit a common error, so
factor out the whole routine into its own function to DRY things up a
little.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 29 ++++++++++++++++++-----------
 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 38a3c6a3f9..9859f61a5a 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -156,6 +156,21 @@ static size_t bitmap_index_seek(struct bitmap_index *bitmap_git, size_t offset,
 	return bitmap_git->map_pos;
 }
 
+static int bitmap_index_seek_commit(struct bitmap_index *bitmap_git,
+				     struct object_id *oid,
+				     size_t pos)
+{
+	const int bitmap_header_size = 6;
+
+	bitmap_index_seek(bitmap_git, pos, SEEK_SET);
+
+	if (bitmap_git->map_size - bitmap_git->map_pos < bitmap_header_size)
+		return error(_("corrupt ewah bitmap: truncated header for "
+			       "bitmap of commit \"%s\""),
+			oid_to_hex(oid));
+	return 0;
+}
+
 /*
  * Read a bitmap from the current read position on the mmaped
  * index, and increase the read position accordingly
@@ -737,7 +752,6 @@ static struct stored_bitmap *lazy_bitmap_for_commit(struct bitmap_index *bitmap_
 	struct object_id *oid = &commit->object.oid;
 	struct ewah_bitmap *bitmap;
 	struct stored_bitmap *xor_bitmap = NULL;
-	const int bitmap_header_size = 6;
 	static struct bitmap_lookup_table_xor_item *xor_items = NULL;
 	static size_t xor_items_nr = 0, xor_items_alloc = 0;
 	static int is_corrupt = 0;
@@ -796,13 +810,10 @@ static struct stored_bitmap *lazy_bitmap_for_commit(struct bitmap_index *bitmap_
 
 	while (xor_items_nr) {
 		xor_item = &xor_items[xor_items_nr - 1];
-		bitmap_index_seek(bitmap_git, xor_item->offset, SEEK_SET);
 
-		if (bitmap_git->map_size - bitmap_git->map_pos < bitmap_header_size) {
-			error(_("corrupt ewah bitmap: truncated header for bitmap of commit \"%s\""),
-				oid_to_hex(&xor_item->oid));
+		if (bitmap_index_seek_commit(bitmap_git, &xor_item->oid,
+					     xor_item->offset) < 0)
 			goto corrupt;
-		}
 
 		bitmap_index_seek(bitmap_git,
 				  sizeof(uint32_t) + sizeof(uint8_t), SEEK_CUR);
@@ -816,12 +827,8 @@ static struct stored_bitmap *lazy_bitmap_for_commit(struct bitmap_index *bitmap_
 		xor_items_nr--;
 	}
 
-	bitmap_index_seek(bitmap_git, offset, SEEK_SET);
-	if (bitmap_git->map_size - bitmap_git->map_pos < bitmap_header_size) {
-		error(_("corrupt ewah bitmap: truncated header for bitmap of commit \"%s\""),
-			oid_to_hex(oid));
+	if (bitmap_index_seek_commit(bitmap_git, oid, offset) < 0)
 		goto corrupt;
-	}
 
 	/*
 	 * Don't bother reading the commit's index position or its xor
-- 
2.40.0.77.gd564125b3f

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/6] pack-bitmap.c: hide bitmap internals in `read_u8()`
  2023-03-20 20:02 ` [PATCH 1/6] pack-bitmap.c: hide bitmap internals in `read_u8()` Taylor Blau
@ 2023-03-21 17:35   ` Jeff King
  2023-03-24 17:52     ` Derrick Stolee
  0 siblings, 1 reply; 29+ messages in thread
From: Jeff King @ 2023-03-21 17:35 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Junio C Hamano, Derrick Stolee, Abhradeep Chakraborty

On Mon, Mar 20, 2023 at 04:02:40PM -0400, Taylor Blau wrote:

> So b5007211b6 got rid of the above convention and replaced it by reading
> individual fields of that structure with a `read_u8()` helper that reads
> from the region of memory pointed to by `->map`, and updates the
> `->map_pos` pointer accordingly.
> 
> But this forces callers to be intimately aware of `bitmap_git->map` and
> `bitmap_git->map_pos`. Instead, teach `read_u8()` to take a `struct
> bitmap_index *` directly, and avoid having callers deal with the
> internals themselves.

OK. I always felt like this read_u8() and read_be32() were trying to
match get_be32(), etc, just with an auto-incrementing "pos" pointer. And
this patch makes them a lot less generic.

But that is probably OK. They are static-local to the bitmap file, and
we have not found another caller who wanted them in the intervening
years. Arguably they could be given more descriptive names, like
read_bitmap_u8() or something, but again, being static-local to the
file, the generic names are fine.

-Peff

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/6] pack-bitmap.c: drop unnecessary 'inline's
  2023-03-20 20:02 ` [PATCH 3/6] pack-bitmap.c: drop unnecessary 'inline's Taylor Blau
@ 2023-03-21 17:40   ` Jeff King
  0 siblings, 0 replies; 29+ messages in thread
From: Jeff King @ 2023-03-21 17:40 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Junio C Hamano, Derrick Stolee, Abhradeep Chakraborty

On Mon, Mar 20, 2023 at 04:02:46PM -0400, Taylor Blau wrote:

> Both `read_be32()` and `read_u8()` are defined as inline dating back
> to b5007211b6 (pack-bitmap: do not use gcc packed attribute,
> 2014-11-27), though that commit does not hint at why the functions were
> defined with that attribute.

I think any non-header inline like this can implicitly be assumed to be
"I thought it would make things faster". ;)

> However (at least with GCC 12.2.0, at the time of writing), the
> resulting pack-bitmap.o contains the same instructions with or without
> the inline attribute applied to these functions:
> 
>     $ make O=3 pack-bitmap.o && mv pack-bitmap.o{,.before}
>     [ apply this patch ]
>     $ make O=3 pack-bitmap.o && mv pack-bitmap.o{,.after}
>     $ objdump -d pack-bitmap.o.before >before
>     $ objdump -d pack-bitmap.o.after >after
>     $ diff -u pack-bitmap.o.{before,after}
>     --- before	2023-03-15 18:54:17.021580095 -0400
>     +++ after	2023-03-15 18:54:21.853552218 -0400
>     @@ -1,5 +1,5 @@
> 
>     -pack-bitmap.o.before:     file format elf64-x86-64
>     +pack-bitmap.o.after:     file format elf64-x86-64
> 
>      Disassembly of section .text:
> 
> So defining these functions as inline is at best a noop, and at worst
> confuses the reader into thinking that there is some trickier reason
> that they are defined as inline when there isn't.

Nice digging. The "inline" is really just a hint to the compiler here,
and obviously it does not need that hint. I do wonder if that is still
true after you make them more complicated in a later patch in the
series.

On the other hand, I doubt that these need to be very optimized at all.
If there were a tight loop of single-byte reads, the function overhead
might be noticeable. But generally we're reading only a few items from
the beginning of each entry, and then reading the bulk of the data via
ewah_read_mmap().

So I think the overall argument is "let the compiler decide what is good
to inline and what is not".

-Peff

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] pack-bitmap.c: factor out manual `map_pos` manipulation
  2023-03-20 20:02 ` [PATCH 4/6] pack-bitmap.c: factor out manual `map_pos` manipulation Taylor Blau
@ 2023-03-21 17:56   ` Jeff King
  2023-03-24 18:04     ` Derrick Stolee
  2023-03-24 23:08     ` Taylor Blau
  0 siblings, 2 replies; 29+ messages in thread
From: Jeff King @ 2023-03-21 17:56 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Junio C Hamano, Derrick Stolee, Abhradeep Chakraborty

On Mon, Mar 20, 2023 at 04:02:49PM -0400, Taylor Blau wrote:

> The pack-bitmap internals use a combination of a `map` and `map_pos`
> variable to keep a pointer to a memory mapped region of the `.bitmap`
> file, as well as the position within that file, respectively.
> 
> Reads within the memory mapped region are meant to mimic file reads,
> where each read adjusts the offset of the file descriptor accordingly.
> This functionality is implemented by adjusting the `map_pos` variable
> after each read.
> 
> Factor out a function similar to seek() which adjusts the `map_pos`
> variable for us. Since the bitmap code only needs to adjust relative to
> the beginning of the file as well as the current position, we do not
> need to support any "whence" values outside of SEEK_SET and SEEK_CUR.

I like the idea of centralizing the bounds checks here.

I do think copying lseek() is a little awkward, though. As you note, we
only care about SEEK_SET and SEEK_CUR, and we have to BUG() on anything
else. Which means we have a run-time check, rather than a compile time
check. If we had two functions like:

  void bitmap_index_seek_to(struct bitmap_index *bitmap_git, size_t pos)
  {
	bitmap_git->map_pos = pos;
	if (bitmap_git->map_pos >= bitmap_git->map_size)
	   ...etc...
  }

  void bitmap_index_incr(struct bitmap_index *bitmap_git, size_t offset)
  {
	bitmap_index_seek_to(bitmap_git, st_add(bitmap_git->map_pos, offset));
  }

then the compiler can see all cases directly. I dunno how much it
matters.

The other thing that's interesting from a bounds-checking perspective is
that you're checking the seek _after_ you've read an item. Which has two
implications:

  - I don't think we could ever overflow size_t here. We are working
    with a buffer that we got from mmap(), so it's already probably
    bounded to some much smaller subset of size_t. And even if we
    imagine that you really could get a buffer that stretches for the
    whole of the memory space, and that incrementing it by 4 bytes would
    overflow, we'd by definition have just overflowed the memory space
    itself by reading 4 bytes (and presumably segfaulted). So I really
    doubt this st_add() is doing anything.

  - The more interesting case is that we are not overflowing size_t, but
    simply walking past bitmap_git->map_size. But again, we are reading
    first and then seeking. So if our seek does go past, then by
    definition we just read garbage bytes, which is undefined behavior.

    For a bounds-check, wouldn't we want it the other way around? To
    make sure we have 4 bytes available, and then if so read them and
    increment the offset?

> +	if (bitmap_git->map_pos >= bitmap_git->map_size)
> +		BUG("bitmap position exceeds size (%"PRIuMAX" >= %"PRIuMAX")",
> +		    (uintmax_t)bitmap_git->map_pos,
> +		    (uintmax_t)bitmap_git->map_size);

This uses ">=", which is good, because it is valid to walk the pointer
to one past the end of the map, which is effectively EOF. But as above,
in that condition the callers should be checking for this EOF state
before reading the bytes.

-Peff

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/6] pack-bitmap.c: use `bitmap_index_seek()` where possible
  2023-03-20 20:02 ` [PATCH 5/6] pack-bitmap.c: use `bitmap_index_seek()` where possible Taylor Blau
@ 2023-03-21 18:05   ` Jeff King
  2023-03-24 18:06     ` Derrick Stolee
  0 siblings, 1 reply; 29+ messages in thread
From: Jeff King @ 2023-03-21 18:05 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Junio C Hamano, Derrick Stolee, Abhradeep Chakraborty

On Mon, Mar 20, 2023 at 04:02:52PM -0400, Taylor Blau wrote:

> --- a/pack-bitmap.c
> +++ b/pack-bitmap.c
> @@ -174,7 +174,7 @@ static struct ewah_bitmap *read_bitmap_1(struct bitmap_index *index)
>  		return NULL;
>  	}
>  
> -	index->map_pos += bitmap_size;
> +	bitmap_index_seek(index, bitmap_size, SEEK_CUR);
>  	return b;

As an aside, I notice none of the callers here or in the next patch
check the return value of bitmap_index_seek(). I guess you included it
to match the return value of lseek(), but maybe it is better to return
void if nobody is looking at it.

But getting back to the bounds-checking: I think we are already
correctly bounds-checked here, because ewah_read_mmap() will make sure
that it doesn't read too far (and will return an error if there's
truncation). And if it didn't, this check-on-seek doesn't help us,
because we'll already have done out-of-bounds reads.

> @@ -230,7 +230,7 @@ static int load_bitmap_header(struct bitmap_index *index)
>  
>  	index->entry_count = ntohl(header->entry_count);
>  	index->checksum = header->checksum;
> -	index->map_pos += header_size;
> +	bitmap_index_seek(index, header_size, SEEK_CUR);
>  	return 0;
>  }

Likewise this function already has bounds checks at the top:

	if (index->map_size < header_size + the_hash_algo->rawsz)
		return error(_("corrupted bitmap index (too small)"));

I'd be perfectly happy if we swapped that our for checking the bounds on
individual reads, but the extra checking in the seek step here just
seems redundant (and again, too late).

> @@ -269,13 +269,15 @@ static struct stored_bitmap *store_bitmap(struct bitmap_index *index,
>  static uint32_t read_be32(struct bitmap_index *bitmap_git)
>  {
>  	uint32_t result = get_be32(bitmap_git->map + bitmap_git->map_pos);
> -	bitmap_git->map_pos += sizeof(result);
> +	bitmap_index_seek(bitmap_git, sizeof(uint32_t), SEEK_CUR);
>  	return result;
>  }

The function doesn't do bounds-checks itself, but the callers do, like:

                if (index->map_size - index->map_pos < 6)
                        return error(_("corrupt ewah bitmap: truncated header for entry %d"), i);

                commit_idx_pos = read_be32(index->map, &index->map_pos);
                xor_offset = read_u8(index->map, &index->map_pos);
                flags = read_u8(index->map, &index->map_pos);

(and again, I'd be happy to see this magic "6" go away in favor of
checking as we read each item).

Maybe I'm misunderstanding the purpose of your series. I thought it was
to avoid reading out of bounds. But since bitmap_index_seek() triggers a
BUG(), it's not good for detecting truncated files, etc. So is it really
just meant to be a belt-and-suspenders check on the existing
bounds-checks? I guess that makes more sense with the way it is written,
but I'm just a little skeptical that it's really useful.

-Peff

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 6/6] pack-bitmap.c: factor out `bitmap_index_seek_commit()`
  2023-03-20 20:02 ` [PATCH 6/6] pack-bitmap.c: factor out `bitmap_index_seek_commit()` Taylor Blau
@ 2023-03-21 18:13   ` Jeff King
  2023-03-21 18:16     ` Taylor Blau
  0 siblings, 1 reply; 29+ messages in thread
From: Jeff King @ 2023-03-21 18:13 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Junio C Hamano, Derrick Stolee, Abhradeep Chakraborty

On Mon, Mar 20, 2023 at 04:02:55PM -0400, Taylor Blau wrote:

> Factor out a common pattern within `lazy_bitmap_for_commit()` where we
> seek to a given position (expecting to read the start of an individual
> bitmap entry).
> 
> Both spots within `lazy_bitmap_for_commit()` emit a common error, so
> factor out the whole routine into its own function to DRY things up a
> little.

OK, so this case makes more sense to me as a bounds-check, because we
are seeking to an arbitrary offset.

But...

> +static int bitmap_index_seek_commit(struct bitmap_index *bitmap_git,
> +				     struct object_id *oid,
> +				     size_t pos)
> +{
> +	const int bitmap_header_size = 6;
> +
> +	bitmap_index_seek(bitmap_git, pos, SEEK_SET);
> +
> +	if (bitmap_git->map_size - bitmap_git->map_pos < bitmap_header_size)
> +		return error(_("corrupt ewah bitmap: truncated header for "
> +			       "bitmap of commit \"%s\""),
> +			oid_to_hex(oid));
> +	return 0;
> +}

So here we seek to the position, and then make sure we have 6 bytes to
read, which makes sense. But in the seek step, are we worried that we
will seek to somewhere bogus? If so, we'll get a BUG(). But is that the
right thing if somebody feeds a bogus "pos" that moves past truncation?

I'm not 100% sure on where these offsets come from. But it looks like
they're coming from the bitmap lookup table. In which case a bogus value
there should be an error(), and not a BUG(), I would think.

-Peff

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 6/6] pack-bitmap.c: factor out `bitmap_index_seek_commit()`
  2023-03-21 18:13   ` Jeff King
@ 2023-03-21 18:16     ` Taylor Blau
  2023-03-21 18:27       ` Jeff King
  0 siblings, 1 reply; 29+ messages in thread
From: Taylor Blau @ 2023-03-21 18:16 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano, Derrick Stolee, Abhradeep Chakraborty

On Tue, Mar 21, 2023 at 02:13:15PM -0400, Jeff King wrote:
> I'm not 100% sure on where these offsets come from. But it looks like
> they're coming from the bitmap lookup table. In which case a bogus value
> there should be an error(), and not a BUG(), I would think.

They do come from the lookup table, yes. I'm not sure that I agree that
bogus values here should be an error() or a BUG(), or if I even have a
strong preference between one and the other.

But I do think that trying to make it an error() makes it awkward for
all of the other callers that want it to be a BUG(), since the detail of
whether to call one or the other is private to bitmap_index_seek().

We *could* open-code it, introduce a variant of bitmap_index_seek(),
make it take an additional parameter specifying whether to call one over
the other, *or* check the bounds ourselves before even calling
bitmap_index_seek().

But none of those seem like great options to me, TBH. I would be just as
happy to leave this as a BUG(), to be honest.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 6/6] pack-bitmap.c: factor out `bitmap_index_seek_commit()`
  2023-03-21 18:16     ` Taylor Blau
@ 2023-03-21 18:27       ` Jeff King
  2023-03-24 18:09         ` Derrick Stolee
  0 siblings, 1 reply; 29+ messages in thread
From: Jeff King @ 2023-03-21 18:27 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Junio C Hamano, Derrick Stolee, Abhradeep Chakraborty

On Tue, Mar 21, 2023 at 02:16:40PM -0400, Taylor Blau wrote:

> On Tue, Mar 21, 2023 at 02:13:15PM -0400, Jeff King wrote:
> > I'm not 100% sure on where these offsets come from. But it looks like
> > they're coming from the bitmap lookup table. In which case a bogus value
> > there should be an error(), and not a BUG(), I would think.
> 
> They do come from the lookup table, yes. I'm not sure that I agree that
> bogus values here should be an error() or a BUG(), or if I even have a
> strong preference between one and the other.

The usual philosophy we've applied is: a BUG() should not be
trigger-able, even if Git is fed bad data. A BUG() should indicate an
error in the program logic, and if we see one, there should be a code
fix that handles the case.

Whereas if I understand this correctly, if I corrupt the bitmap file on
disk, we'd trigger this BUG().

In many cases I think one could argue that it's kind of academic. But in
this case we should be able to say "oops, the bitmap file seems corrupt"
and skip using it, rather than bailing completely from the process.

> But I do think that trying to make it an error() makes it awkward for
> all of the other callers that want it to be a BUG(), since the detail of
> whether to call one or the other is private to bitmap_index_seek().
>
> We *could* open-code it, introduce a variant of bitmap_index_seek(),
> make it take an additional parameter specifying whether to call one over
> the other, *or* check the bounds ourselves before even calling
> bitmap_index_seek().

I'm mostly unconvinced of the value of bitmap_index_seek() doing
checking at all, because it is too late in most of the cases. In fact it
is only in this case that it is doing something useful, which makes me
think that the check should be open-coded here.

-Peff

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/6] pack-bitmap.c: hide bitmap internals in `read_u8()`
  2023-03-21 17:35   ` Jeff King
@ 2023-03-24 17:52     ` Derrick Stolee
  0 siblings, 0 replies; 29+ messages in thread
From: Derrick Stolee @ 2023-03-24 17:52 UTC (permalink / raw)
  To: Jeff King, Taylor Blau; +Cc: git, Junio C Hamano, Abhradeep Chakraborty

On 3/21/2023 1:35 PM, Jeff King wrote:
> On Mon, Mar 20, 2023 at 04:02:40PM -0400, Taylor Blau wrote:
> 
>> So b5007211b6 got rid of the above convention and replaced it by reading
>> individual fields of that structure with a `read_u8()` helper that reads
>> from the region of memory pointed to by `->map`, and updates the
>> `->map_pos` pointer accordingly.
>>
>> But this forces callers to be intimately aware of `bitmap_git->map` and
>> `bitmap_git->map_pos`. Instead, teach `read_u8()` to take a `struct
>> bitmap_index *` directly, and avoid having callers deal with the
>> internals themselves.
> 
> OK. I always felt like this read_u8() and read_be32() were trying to
> match get_be32(), etc, just with an auto-incrementing "pos" pointer. And
> this patch makes them a lot less generic.
> 
> But that is probably OK. They are static-local to the bitmap file, and
> we have not found another caller who wanted them in the intervening
> years. Arguably they could be given more descriptive names, like
> read_bitmap_u8() or something, but again, being static-local to the
> file, the generic names are fine.

Since the names don't match the "get" of get_be32(), we don't run the
risk of a future public get_be8() colliding with these static-local
methods, so I agree.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] pack-bitmap.c: factor out manual `map_pos` manipulation
  2023-03-21 17:56   ` Jeff King
@ 2023-03-24 18:04     ` Derrick Stolee
  2023-03-24 18:29       ` Jeff King
  2023-03-24 23:13       ` Taylor Blau
  2023-03-24 23:08     ` Taylor Blau
  1 sibling, 2 replies; 29+ messages in thread
From: Derrick Stolee @ 2023-03-24 18:04 UTC (permalink / raw)
  To: Jeff King, Taylor Blau; +Cc: git, Junio C Hamano, Abhradeep Chakraborty

On 3/21/2023 1:56 PM, Jeff King wrote:
> On Mon, Mar 20, 2023 at 04:02:49PM -0400, Taylor Blau wrote:
> 
>> The pack-bitmap internals use a combination of a `map` and `map_pos`
>> variable to keep a pointer to a memory mapped region of the `.bitmap`
>> file, as well as the position within that file, respectively.
>>
>> Reads within the memory mapped region are meant to mimic file reads,
>> where each read adjusts the offset of the file descriptor accordingly.
>> This functionality is implemented by adjusting the `map_pos` variable
>> after each read.
>>
>> Factor out a function similar to seek() which adjusts the `map_pos`
>> variable for us. Since the bitmap code only needs to adjust relative to
>> the beginning of the file as well as the current position, we do not
>> need to support any "whence" values outside of SEEK_SET and SEEK_CUR.
> 
> I like the idea of centralizing the bounds checks here.
> 
> I do think copying lseek() is a little awkward, though. As you note, we
> only care about SEEK_SET and SEEK_CUR, and we have to BUG() on anything
> else. Which means we have a run-time check, rather than a compile time
> check. If we had two functions like:
> 
>   void bitmap_index_seek_to(struct bitmap_index *bitmap_git, size_t pos)
>   {
> 	bitmap_git->map_pos = pos;
> 	if (bitmap_git->map_pos >= bitmap_git->map_size)
> 	   ...etc...
>   }
> 
>   void bitmap_index_incr(struct bitmap_index *bitmap_git, size_t offset)
>   {
> 	bitmap_index_seek_to(bitmap_git, st_add(bitmap_git->map_pos, offset));
>   }
> 
> then the compiler can see all cases directly. I dunno how much it
> matters.

Whenever the compiler can help us, I'm usually in favor.

In this case, I'd call them bitmap_index_seek() and bitmap_index_increment(),
which should be clear enough.

The other alternative would be to use an enum instead of an arbitrary int.
The compiler can warn to some extent, but it's not as helpful as a method
name distinction.

>> +	if (bitmap_git->map_pos >= bitmap_git->map_size)
>> +		BUG("bitmap position exceeds size (%"PRIuMAX" >= %"PRIuMAX")",
>> +		    (uintmax_t)bitmap_git->map_pos,
>> +		    (uintmax_t)bitmap_git->map_size);
> 
> This uses ">=", which is good, because it is valid to walk the pointer
> to one past the end of the map, which is effectively EOF. But as above,
> in that condition the callers should be checking for this EOF state
> before reading the bytes.

In other words, it would be too easy for a strange data shape to trigger
this BUG() statement, which should be reserved for the most-extreme cases
of developer mistake. Things like "this is an unacceptable 'whence' value"
or "this should never be NULL" make sense, but this is too likely to be
hit due to a runtime error.

Would it make sense to return an 'int' instead of the size_t of map_pos?
That way we could return in error if this is exceeded, and then all
callers can respond "oh wait, that move would exceed the file size, so
I should fail in my own way..."?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/6] pack-bitmap.c: use `bitmap_index_seek()` where possible
  2023-03-21 18:05   ` Jeff King
@ 2023-03-24 18:06     ` Derrick Stolee
  2023-03-24 18:35       ` Jeff King
  0 siblings, 1 reply; 29+ messages in thread
From: Derrick Stolee @ 2023-03-24 18:06 UTC (permalink / raw)
  To: Jeff King, Taylor Blau; +Cc: git, Junio C Hamano, Abhradeep Chakraborty

On 3/21/2023 2:05 PM, Jeff King wrote:
> On Mon, Mar 20, 2023 at 04:02:52PM -0400, Taylor Blau wrote:
> 
>> --- a/pack-bitmap.c
>> +++ b/pack-bitmap.c
>> @@ -174,7 +174,7 @@ static struct ewah_bitmap *read_bitmap_1(struct bitmap_index *index)
>>  		return NULL;
>>  	}
>>  
>> -	index->map_pos += bitmap_size;
>> +	bitmap_index_seek(index, bitmap_size, SEEK_CUR);
>>  	return b;
> 
> As an aside, I notice none of the callers here or in the next patch
> check the return value of bitmap_index_seek(). I guess you included it
> to match the return value of lseek(), but maybe it is better to return
> void if nobody is looking at it.
> 
> But getting back to the bounds-checking: I think we are already
> correctly bounds-checked here, because ewah_read_mmap() will make sure
> that it doesn't read too far (and will return an error if there's
> truncation). And if it didn't, this check-on-seek doesn't help us,
> because we'll already have done out-of-bounds reads.

>> +	bitmap_index_seek(index, header_size, SEEK_CUR);
>>  	return 0;
>>  }
> 
> Likewise this function already has bounds checks at the top:
> 
> 	if (index->map_size < header_size + the_hash_algo->rawsz)
> 		return error(_("corrupted bitmap index (too small)"));
> 
> I'd be perfectly happy if we swapped that our for checking the bounds on
> individual reads, but the extra checking in the seek step here just
> seems redundant (and again, too late).

I think it would be nice to replace all of these custom bounds
checks with a check within bitmap_index_seek() and error conditions
done in response to an error code returned by that method. It keeps
the code more consistent in the potential future of changing the
amount to move the map_pos and the amount checked in these conditions.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 6/6] pack-bitmap.c: factor out `bitmap_index_seek_commit()`
  2023-03-21 18:27       ` Jeff King
@ 2023-03-24 18:09         ` Derrick Stolee
  0 siblings, 0 replies; 29+ messages in thread
From: Derrick Stolee @ 2023-03-24 18:09 UTC (permalink / raw)
  To: Jeff King, Taylor Blau; +Cc: git, Junio C Hamano, Abhradeep Chakraborty

On 3/21/2023 2:27 PM, Jeff King wrote:
> On Tue, Mar 21, 2023 at 02:16:40PM -0400, Taylor Blau wrote:
> 
>> On Tue, Mar 21, 2023 at 02:13:15PM -0400, Jeff King wrote:
>>> I'm not 100% sure on where these offsets come from. But it looks like
>>> they're coming from the bitmap lookup table. In which case a bogus value
>>> there should be an error(), and not a BUG(), I would think.
>>
>> They do come from the lookup table, yes. I'm not sure that I agree that
>> bogus values here should be an error() or a BUG(), or if I even have a
>> strong preference between one and the other.
> 
> The usual philosophy we've applied is: a BUG() should not be
> trigger-able, even if Git is fed bad data. A BUG() should indicate an
> error in the program logic, and if we see one, there should be a code
> fix that handles the case.
> 
> Whereas if I understand this correctly, if I corrupt the bitmap file on
> disk, we'd trigger this BUG().
> 
> In many cases I think one could argue that it's kind of academic. But in
> this case we should be able to say "oops, the bitmap file seems corrupt"
> and skip using it, rather than bailing completely from the process.

It's not just academic. BUG() statements kill the process without running
important cleanup steps like deleting open .lock files or outputting the
final traces. This can be especially problematic when we count on those
operations in order to recover a repository from such errors.
 
>> But I do think that trying to make it an error() makes it awkward for
>> all of the other callers that want it to be a BUG(), since the detail of
>> whether to call one or the other is private to bitmap_index_seek().
>>
>> We *could* open-code it, introduce a variant of bitmap_index_seek(),
>> make it take an additional parameter specifying whether to call one over
>> the other, *or* check the bounds ourselves before even calling
>> bitmap_index_seek().
> 
> I'm mostly unconvinced of the value of bitmap_index_seek() doing
> checking at all, because it is too late in most of the cases. In fact it
> is only in this case that it is doing something useful, which makes me
> think that the check should be open-coded here.

If we universally check whether bitmap_index_seek() works, then there
is value. It avoids the existing ad-hoc checks in favor of always-on
checks (as well as avoiding potential disconnects between the check
and the seeked position in the future).

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] pack-bitmap.c: factor out manual `map_pos` manipulation
  2023-03-24 18:04     ` Derrick Stolee
@ 2023-03-24 18:29       ` Jeff King
  2023-03-24 23:23         ` Taylor Blau
  2023-03-24 23:13       ` Taylor Blau
  1 sibling, 1 reply; 29+ messages in thread
From: Jeff King @ 2023-03-24 18:29 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Taylor Blau, git, Junio C Hamano, Abhradeep Chakraborty

On Fri, Mar 24, 2023 at 02:04:05PM -0400, Derrick Stolee wrote:

> >> +	if (bitmap_git->map_pos >= bitmap_git->map_size)
> >> +		BUG("bitmap position exceeds size (%"PRIuMAX" >= %"PRIuMAX")",
> >> +		    (uintmax_t)bitmap_git->map_pos,
> >> +		    (uintmax_t)bitmap_git->map_size);
> > 
> > This uses ">=", which is good, because it is valid to walk the pointer
> > to one past the end of the map, which is effectively EOF. But as above,
> > in that condition the callers should be checking for this EOF state
> > before reading the bytes.
> 
> In other words, it would be too easy for a strange data shape to trigger
> this BUG() statement, which should be reserved for the most-extreme cases
> of developer mistake. Things like "this is an unacceptable 'whence' value"
> or "this should never be NULL" make sense, but this is too likely to be
> hit due to a runtime error.

Sort of. AFAICT in all of the "increment" cases we'll already have done
bounds-checks, so this really is a BUG() condition. But in that case I
question whether it's even worthwhile. The calling code ends up being
something like:

  /* check that we have enough bytes */
  if (total - pos < 4)
	return error(...);

  /* read those bytes */
  get_be32() or whatever...

  /* now advance pos, making sure we...had enough bytes? */
  bitmap_index_seek(..., 4);

We know the advance will succeed because we checked ahead of time that
we had enough bytes. So it really is a BUG() if we don't, as it would
indicate somebody missed the earlier check. On the other hand, it is a
weird spot for an extra check, because by definition we'll have just
read off the array just before the seek.

> Would it make sense to return an 'int' instead of the size_t of map_pos?
> That way we could return in error if this is exceeded, and then all
> callers can respond "oh wait, that move would exceed the file size, so
> I should fail in my own way..."?

You can die() to avoid returning an error. But given that this is bitmap
code, and we can generally complete the operation even if it is broken
(albeit slower), usually we'd try to return the error up the call chain
(like bitmap_index_seek_commit() does later in the series). Plus that
follows our libification trend of not killing the process in low-level
code.

It does make the callers more complicated, though. If this were
_replacing_ the existing bounds-checks that might be worth it, but
coming after like this, I guess I just don't see it as adding much.

The case where we _do_ seek directly to a file-provided offset, rather
than incrementing, is an important check that this series adds, but that
one should be a die() and not a BUG().

-Peff

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/6] pack-bitmap.c: use `bitmap_index_seek()` where possible
  2023-03-24 18:06     ` Derrick Stolee
@ 2023-03-24 18:35       ` Jeff King
  2023-03-24 19:43         ` Junio C Hamano
  0 siblings, 1 reply; 29+ messages in thread
From: Jeff King @ 2023-03-24 18:35 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Taylor Blau, git, Junio C Hamano, Abhradeep Chakraborty

On Fri, Mar 24, 2023 at 02:06:43PM -0400, Derrick Stolee wrote:

> >> +	bitmap_index_seek(index, header_size, SEEK_CUR);
> >>  	return 0;
> >>  }
> > 
> > Likewise this function already has bounds checks at the top:
> > 
> > 	if (index->map_size < header_size + the_hash_algo->rawsz)
> > 		return error(_("corrupted bitmap index (too small)"));
> > 
> > I'd be perfectly happy if we swapped that our for checking the bounds on
> > individual reads, but the extra checking in the seek step here just
> > seems redundant (and again, too late).
> 
> I think it would be nice to replace all of these custom bounds
> checks with a check within bitmap_index_seek() and error conditions
> done in response to an error code returned by that method. It keeps
> the code more consistent in the potential future of changing the
> amount to move the map_pos and the amount checked in these conditions.

Yeah, that's what I was getting at. But doing it at seek time is too
late. We'll have just read off the end of the array.

You really need an interface more like "make sure there are N bytes for
me to read at offset X". Then you can read and advance past them.

For individual items where you want to copy the bytes out anyway (like a
be32) you can have a nice interface like:

  if (read_be32(bitmap_git, &commit_idx_pos) < 0 ||
      read_u8(bitmap_git, &xor_offset) < 0 ||
      read_u8(bitmap_git, &flags) < 0)
	return error("truncated bitmap entry");

But given that there is only one spot that calls these, that kind of
refactoring might not be worth it (right now it just uses the magic
number "6" right before grabbing the data).

-Peff

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/6] pack-bitmap.c: use `bitmap_index_seek()` where possible
  2023-03-24 18:35       ` Jeff King
@ 2023-03-24 19:43         ` Junio C Hamano
  2023-03-24 20:37           ` Jeff King
  0 siblings, 1 reply; 29+ messages in thread
From: Junio C Hamano @ 2023-03-24 19:43 UTC (permalink / raw)
  To: Jeff King; +Cc: Derrick Stolee, Taylor Blau, git, Abhradeep Chakraborty

Jeff King <peff@peff.net> writes:

> On Fri, Mar 24, 2023 at 02:06:43PM -0400, Derrick Stolee wrote:
>
>> >> +	bitmap_index_seek(index, header_size, SEEK_CUR);
>> >>  	return 0;
>> >>  }
>> > 
>> > Likewise this function already has bounds checks at the top:
>> > 
>> > 	if (index->map_size < header_size + the_hash_algo->rawsz)
>> > 		return error(_("corrupted bitmap index (too small)"));
>> > 
>> > I'd be perfectly happy if we swapped that our for checking the bounds on
>> > individual reads, but the extra checking in the seek step here just
>> > seems redundant (and again, too late).
>> 
>> I think it would be nice to replace all of these custom bounds
>> checks with a check within bitmap_index_seek() and error conditions
>> done in response to an error code returned by that method. It keeps
>> the code more consistent in the potential future of changing the
>> amount to move the map_pos and the amount checked in these conditions.
>
> Yeah, that's what I was getting at. But doing it at seek time is too
> late. We'll have just read off the end of the array.

Yup.  You illustrated it nicely in your response for the previous
step of the series.  If the typical access pattern is to check, read
and then advance to the next position, and by the time you are ready
to advance to the next position, you'd better have done the checking.

> You really need an interface more like "make sure there are N bytes for
> me to read at offset X". Then you can read and advance past them.
>
> For individual items where you want to copy the bytes out anyway (like a
> be32) you can have a nice interface like:
>
>   if (read_be32(bitmap_git, &commit_idx_pos) < 0 ||
>       read_u8(bitmap_git, &xor_offset) < 0 ||
>       read_u8(bitmap_git, &flags) < 0)
> 	return error("truncated bitmap entry");

Yeah, that kind of flow reads really well.

> But given that there is only one spot that calls these, that kind of
> refactoring might not be worth it (right now it just uses the magic
> number "6" right before grabbing the data).

Yeah, it seems most of the callers with SEEK_SET are "I find the
next offset from a table and jump there in preparation for doing
something".  I suspect callers with SEEK_CUR would fit in the
read_X() pattern better?  From that angle, it smells that the two
kinds of seek functions may want to be split into two different
helpers.

Thanks.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/6] pack-bitmap.c: use `bitmap_index_seek()` where possible
  2023-03-24 19:43         ` Junio C Hamano
@ 2023-03-24 20:37           ` Jeff King
  2023-03-24 21:38             ` Junio C Hamano
  0 siblings, 1 reply; 29+ messages in thread
From: Jeff King @ 2023-03-24 20:37 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Derrick Stolee, Taylor Blau, git, Abhradeep Chakraborty

On Fri, Mar 24, 2023 at 12:43:46PM -0700, Junio C Hamano wrote:

> > But given that there is only one spot that calls these, that kind of
> > refactoring might not be worth it (right now it just uses the magic
> > number "6" right before grabbing the data).
> 
> Yeah, it seems most of the callers with SEEK_SET are "I find the
> next offset from a table and jump there in preparation for doing
> something".  I suspect callers with SEEK_CUR would fit in the
> read_X() pattern better?  From that angle, it smells that the two
> kinds of seek functions may want to be split into two different
> helpers.

Yes, I think the SEEK_SET cases really do need to be doing more
checking. AFAICT they are blindly trusting the offsets in the file
(which is locally generated, so it's more of a corruption problem than a
security one, but still). And this series improves that, which is good
(but I still think it should be a die() and not a BUG()).

The SEEK_CUR cases in theory could all look like the nice read_be32() I
showed earlier, but I think in practice there are a lot of variants
(skipping read of index_pos, advancing past size given by
ewah_read_mmap(), and so on). And the current code, while ugly, does
give more specific error messages (e.g., telling on _which_ commit we
found the truncated data). So I dunno.

Certainly there could be more consistency in the magic numbers. E.g., in
this code:

                if (bitmap_git->map_size - bitmap_git->map_pos < bitmap_header_size) {
                        error(_("corrupt ewah bitmap: truncated header for bitmap of commit \"%s\""),
                                oid_to_hex(&xor_item->oid));
                        goto corrupt;
                }

                bitmap_git->map_pos += sizeof(uint32_t) + sizeof(uint8_t);
                xor_flags = read_u8(bitmap_git->map, &bitmap_git->map_pos);
                bitmap = read_bitmap_1(bitmap_git);

There is an assumption that sizeof(uint32_t) + sizeof(uint8_t) is equal
to bitmap_header_size - 1. That's not wrong, but it's hard to verify
that it's doing the right thing, and it's potentially fragile to changes
(though such changes seem unlikely).

-Peff

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/6] pack-bitmap.c: use `bitmap_index_seek()` where possible
  2023-03-24 20:37           ` Jeff King
@ 2023-03-24 21:38             ` Junio C Hamano
  2023-03-24 22:57               ` Taylor Blau
  0 siblings, 1 reply; 29+ messages in thread
From: Junio C Hamano @ 2023-03-24 21:38 UTC (permalink / raw)
  To: Jeff King; +Cc: Derrick Stolee, Taylor Blau, git, Abhradeep Chakraborty

Jeff King <peff@peff.net> writes:

> Yes, I think the SEEK_SET cases really do need to be doing more
> checking. AFAICT they are blindly trusting the offsets in the file
> (which is locally generated, so it's more of a corruption problem than a
> security one, but still). And this series improves that, which is good
> (but I still think it should be a die() and not a BUG()).

Yes, I think by mistake I merged the topic way too early than it has
been discussed sufficiently.  I haven't reverted the merge into 'next'
but it may not be a bad idea if the concensus is that the seek-like
whence interface is too ugly to live.  BUG() that triggers on data
errors should be updated to die(), whether we do it as a follow-on
patch or with a replacement iteration.

> Certainly there could be more consistency in the magic numbers. E.g., in
> this code:
>
>                 if (bitmap_git->map_size - bitmap_git->map_pos < bitmap_header_size) {
>                         error(_("corrupt ewah bitmap: truncated header for bitmap of commit \"%s\""),
>                                 oid_to_hex(&xor_item->oid));
>                         goto corrupt;
>                 }
>
>                 bitmap_git->map_pos += sizeof(uint32_t) + sizeof(uint8_t);
>                 xor_flags = read_u8(bitmap_git->map, &bitmap_git->map_pos);
>                 bitmap = read_bitmap_1(bitmap_git);
>
> There is an assumption that sizeof(uint32_t) + sizeof(uint8_t) is equal
> to bitmap_header_size - 1. That's not wrong, but it's hard to verify
> that it's doing the right thing, and it's potentially fragile to changes
> (though such changes seem unlikely).

Yeah, that was the part of the code I was looking at when I wrote
the message you are responding to X-<.

Thanks.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/6] pack-bitmap.c: use `bitmap_index_seek()` where possible
  2023-03-24 21:38             ` Junio C Hamano
@ 2023-03-24 22:57               ` Taylor Blau
  0 siblings, 0 replies; 29+ messages in thread
From: Taylor Blau @ 2023-03-24 22:57 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, Derrick Stolee, git, Abhradeep Chakraborty

On Fri, Mar 24, 2023 at 02:38:17PM -0700, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
>
> > Yes, I think the SEEK_SET cases really do need to be doing more
> > checking. AFAICT they are blindly trusting the offsets in the file
> > (which is locally generated, so it's more of a corruption problem than a
> > security one, but still). And this series improves that, which is good
> > (but I still think it should be a die() and not a BUG()).
>
> Yes, I think by mistake I merged the topic way too early than it has
> been discussed sufficiently.  I haven't reverted the merge into 'next'
> but it may not be a bad idea if the concensus is that the seek-like
> whence interface is too ugly to live.  BUG() that triggers on data
> errors should be updated to die(), whether we do it as a follow-on
> patch or with a replacement iteration.

Yeah, I was a little surprised to see it merged down so quickly ;-).

It's fine with me if you want to hold it in 'next' while I send a
replacement. Otherwise, if you want to revert it out of 'next', that's
fine with me too.

I doubt it should take that long to reroll and address the concerns in
this thread.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] pack-bitmap.c: factor out manual `map_pos` manipulation
  2023-03-21 17:56   ` Jeff King
  2023-03-24 18:04     ` Derrick Stolee
@ 2023-03-24 23:08     ` Taylor Blau
  1 sibling, 0 replies; 29+ messages in thread
From: Taylor Blau @ 2023-03-24 23:08 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano, Derrick Stolee, Abhradeep Chakraborty

On Tue, Mar 21, 2023 at 01:56:12PM -0400, Jeff King wrote:
> I like the idea of centralizing the bounds checks here.
>
> I do think copying lseek() is a little awkward, though. As you note, we
> only care about SEEK_SET and SEEK_CUR, and we have to BUG() on anything
> else. Which means we have a run-time check, rather than a compile time
> check. If we had two functions like:
>
>   void bitmap_index_seek_to(struct bitmap_index *bitmap_git, size_t pos)
>   {
> 	bitmap_git->map_pos = pos;
> 	if (bitmap_git->map_pos >= bitmap_git->map_size)
> 	   ...etc...
>   }
>
>   void bitmap_index_incr(struct bitmap_index *bitmap_git, size_t offset)
>   {
> 	bitmap_index_seek_to(bitmap_git, st_add(bitmap_git->map_pos, offset));
>   }
>
> then the compiler can see all cases directly. I dunno how much it
> matters.

Yeah, I think that trying to copy the interface of lseek() was a
mistake, since it ended up creating more awkwardness than anything else.
I like the combination of bitmap_index_seek_to() and _incr(), though
though I called the latter bitmap_index_seek_set() instead.

We've got to be a little reminiscent of lseek() after all ;-).

> The other thing that's interesting from a bounds-checking perspective is
> that you're checking the seek _after_ you've read an item. Which has two
> implications:
>
>   - I don't think we could ever overflow size_t here.

Yep, agreed. Let's drop the st_add() call there entirely, since it's not
doing anything useful and just serving to confuse the reader.

>   - The more interesting case is that we are not overflowing size_t, but
>     simply walking past bitmap_git->map_size. But again, we are reading
>     first and then seeking. So if our seek does go past, then by
>     definition we just read garbage bytes, which is undefined behavior.
>
>     For a bounds-check, wouldn't we want it the other way around? To
>     make sure we have 4 bytes available, and then if so read them and
>     increment the offset?

I thought on first blush that this would be a much larger change, but
actually it's quite small. The EWAH code already checks its reads ahead
of time, so we don't have to worry about those. It's really that
read_be32() and read_u8() do the read-then-seek.

So I think that the change here is limited to making sure that we can
read enough bytes in those functions before actually executing the read.

> > +	if (bitmap_git->map_pos >= bitmap_git->map_size)
> > +		BUG("bitmap position exceeds size (%"PRIuMAX" >= %"PRIuMAX")",
> > +		    (uintmax_t)bitmap_git->map_pos,
> > +		    (uintmax_t)bitmap_git->map_size);
>
> This uses ">=", which is good, because it is valid to walk the pointer
> to one past the end of the map, which is effectively EOF. But as above,
> in that condition the callers should be checking for this EOF state
> before reading the bytes.

I think that having both is useful. Checking before you read avoids
undefined behavior. Checking after you seek ensures that we never have a
bitmap_index struct with a bogus map_pos value.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] pack-bitmap.c: factor out manual `map_pos` manipulation
  2023-03-24 18:04     ` Derrick Stolee
  2023-03-24 18:29       ` Jeff King
@ 2023-03-24 23:13       ` Taylor Blau
  2023-03-24 23:24         ` Taylor Blau
  1 sibling, 1 reply; 29+ messages in thread
From: Taylor Blau @ 2023-03-24 23:13 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Jeff King, git, Junio C Hamano, Abhradeep Chakraborty

On Fri, Mar 24, 2023 at 02:04:05PM -0400, Derrick Stolee wrote:
> The other alternative would be to use an enum instead of an arbitrary int.
> The compiler can warn to some extent, but it's not as helpful as a method
> name distinction.

Yeah, I think that another enum here just to distinguish between seeking
to an absolute position versus a relative one is probably overkill in
this instance.

> >> +	if (bitmap_git->map_pos >= bitmap_git->map_size)
> >> +		BUG("bitmap position exceeds size (%"PRIuMAX" >= %"PRIuMAX")",
> >> +		    (uintmax_t)bitmap_git->map_pos,
> >> +		    (uintmax_t)bitmap_git->map_size);
> >
> > This uses ">=", which is good, because it is valid to walk the pointer
> > to one past the end of the map, which is effectively EOF. But as above,
> > in that condition the callers should be checking for this EOF state
> > before reading the bytes.
>
> In other words, it would be too easy for a strange data shape to trigger
> this BUG() statement, which should be reserved for the most-extreme cases
> of developer mistake. Things like "this is an unacceptable 'whence' value"
> or "this should never be NULL" make sense, but this is too likely to be
> hit due to a runtime error.

> Would it make sense to return an 'int' instead of the size_t of map_pos?
> That way we could return in error if this is exceeded, and then all
> callers can respond "oh wait, that move would exceed the file size, so
> I should fail in my own way..."?

Works for me. I think bitmap_index_seek_to() would probably return the
error() itself, since I don't think it makes sense to require each
caller to come up with the same "bitmap position exceeds size" thing.

We want the message from that error() to appear regardless, but each
caller can decide what it wants to do in the presence of an error (e.g.,
continue on, propagate the error, abort the program, etc).

Something like this:

    static int bitmap_index_seek_to(struct bitmap_index *bitmap_git, size_t pos)
    {
      bitmap_git->map_pos = pos;

      if (bitmap_git->map_pos >= bitmap_git->map_size)
        return error(_("bitmap position exceeds size "
                       "(%"PRIuMAX" >= %"PRIuMAX")"),
                     (uintmax_t)bitmap_git->map_pos,
                     (uintmax_t)bitmap_git->map_size);
      return 0;
    }

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] pack-bitmap.c: factor out manual `map_pos` manipulation
  2023-03-24 18:29       ` Jeff King
@ 2023-03-24 23:23         ` Taylor Blau
  2023-03-25  4:57           ` Jeff King
  0 siblings, 1 reply; 29+ messages in thread
From: Taylor Blau @ 2023-03-24 23:23 UTC (permalink / raw)
  To: Jeff King; +Cc: Derrick Stolee, git, Junio C Hamano, Abhradeep Chakraborty

On Fri, Mar 24, 2023 at 02:29:29PM -0400, Jeff King wrote:
> We know the advance will succeed because we checked ahead of time that
> we had enough bytes. So it really is a BUG() if we don't, as it would
> indicate somebody missed the earlier check. On the other hand, it is a
> weird spot for an extra check, because by definition we'll have just
> read off the array just before the seek.

Here you claim that we want bitmap_index_seek_to() to call BUG() if we
end up with map_pos >= map_size. But...

> The case where we _do_ seek directly to a file-provided offset, rather
> than incrementing, is an important check that this series adds, but that
> one should be a die() and not a BUG().

...here you say that it should be a die().

I think it does depend on the context. When seeking directly to a
position before reading something, die()-ing is appropriate. The case
where you seek to a relative position to reflect that you just read
something, a BUG() is appropriate.

So really, I think you want something like this:

    static void bitmap_index_seek_set(struct bitmap_index *bitmap_git, size_t pos)
    {
      if (pos >= bitmap_git->map_size)
        die(_("bitmap position exceeds size (%"PRIuMAX" >= %"PRIuMAX")"),
            (uintmax_t)bitmap_git->map_pos,
            (uintmax_t)bitmap_git->map_size);

      bitmap_git->map_pos = pos;
    }

    static void bitmap_index_seek_ahead(struct bitmap_index *bitmap_git,
                                        size_t offset)
    {
      if (bitmap_git->map_pos + offset >= bitmap_git->map_size)
        BUG("cannot seek %"PRIuMAX" byte(s) ahead of %"PRIuMAX" "
            "(%"PRIuMAX" >= %"PRIuMAX")",
            (uintmax_t)offset,
            (uintmax_t)bitmap_git->map_pos,
            (uintmax_t)(bitmap_git->map_pos + offset),
            (uintmax_t)bitmap_git->map_size);

      bitmap_git->map_pos += offset;
    }

Does that match what you were thinking?

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] pack-bitmap.c: factor out manual `map_pos` manipulation
  2023-03-24 23:13       ` Taylor Blau
@ 2023-03-24 23:24         ` Taylor Blau
  0 siblings, 0 replies; 29+ messages in thread
From: Taylor Blau @ 2023-03-24 23:24 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Jeff King, git, Junio C Hamano, Abhradeep Chakraborty

On Fri, Mar 24, 2023 at 07:13:16PM -0400, Taylor Blau wrote:
> > Would it make sense to return an 'int' instead of the size_t of map_pos?
> > That way we could return in error if this is exceeded, and then all
> > callers can respond "oh wait, that move would exceed the file size, so
> > I should fail in my own way..."?
>
> Works for me. I think bitmap_index_seek_to() would probably return the
> error() itself, since I don't think it makes sense to require each
> caller to come up with the same "bitmap position exceeds size" thing.

Actually, I take that back. See my response lower in this thread:

    https://lore.kernel.org/git/ZB4w2gCo/qPCmWkz@nand.local/

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] pack-bitmap.c: factor out manual `map_pos` manipulation
  2023-03-24 23:23         ` Taylor Blau
@ 2023-03-25  4:57           ` Jeff King
  0 siblings, 0 replies; 29+ messages in thread
From: Jeff King @ 2023-03-25  4:57 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Derrick Stolee, git, Junio C Hamano, Abhradeep Chakraborty

On Fri, Mar 24, 2023 at 07:23:06PM -0400, Taylor Blau wrote:

> On Fri, Mar 24, 2023 at 02:29:29PM -0400, Jeff King wrote:
> > We know the advance will succeed because we checked ahead of time that
> > we had enough bytes. So it really is a BUG() if we don't, as it would
> > indicate somebody missed the earlier check. On the other hand, it is a
> > weird spot for an extra check, because by definition we'll have just
> > read off the array just before the seek.
> 
> Here you claim that we want bitmap_index_seek_to() to call BUG() if we
> end up with map_pos >= map_size. But...

I think the paragraph above doesn't have enough context. I meant
incrementing the pos here (which is why "we checked ahead of time that
we had enough bytes"), in which case it is a BUG() (double-checking the
earlier check).

In a seek_to(), there is no previous check. We have to make sure the
requested offset is within bounds.

> > The case where we _do_ seek directly to a file-provided offset, rather
> > than incrementing, is an important check that this series adds, but that
> > one should be a die() and not a BUG().
> 
> ...here you say that it should be a die().

Right, so that one would be a die(). Or better still, an error().

> I think it does depend on the context. When seeking directly to a
> position before reading something, die()-ing is appropriate. The case
> where you seek to a relative position to reflect that you just read
> something, a BUG() is appropriate.

Right, exactly. We are agreeing, I think.

> So really, I think you want something like this:
> 
>     static void bitmap_index_seek_set(struct bitmap_index *bitmap_git, size_t pos)
>     {
>       if (pos >= bitmap_git->map_size)
>         die(_("bitmap position exceeds size (%"PRIuMAX" >= %"PRIuMAX")"),
>             (uintmax_t)bitmap_git->map_pos,
>             (uintmax_t)bitmap_git->map_size);
> 
>       bitmap_git->map_pos = pos;
>     }
> 
>     static void bitmap_index_seek_ahead(struct bitmap_index *bitmap_git,
>                                         size_t offset)
>     {
>       if (bitmap_git->map_pos + offset >= bitmap_git->map_size)
>         BUG("cannot seek %"PRIuMAX" byte(s) ahead of %"PRIuMAX" "
>             "(%"PRIuMAX" >= %"PRIuMAX")",
>             (uintmax_t)offset,
>             (uintmax_t)bitmap_git->map_pos,
>             (uintmax_t)(bitmap_git->map_pos + offset),
>             (uintmax_t)bitmap_git->map_size);
> 
>       bitmap_git->map_pos += offset;
>     }
> 
> Does that match what you were thinking?

Yes, though I am of the opinion that the assertion in seek_ahead() is
largely pointless, simply because if it ever triggered we would already
have triggered undefined behavior. I'm not opposed to adding it if you
feel strongly, I just wouldn't bother myself (and instead would focus on
making the "do we have enough bytes to read" checks more consistent and
harder-to-get-wrong).

Seeking to exactly map_size in the seek_set() case (i.e., the "=" in
">=") is a little funny, but not illegal. Either way, you'd want to
check "and do we have N bytes to read from this offset" immediately
afterwards (and your series does), so that would catch any non-zero
reads there.

-Peff

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2023-03-25  4:57 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-20 20:02 [PATCH 0/6] pack-bitmap: miscellaneous mmap read hardening Taylor Blau
2023-03-20 20:02 ` [PATCH 1/6] pack-bitmap.c: hide bitmap internals in `read_u8()` Taylor Blau
2023-03-21 17:35   ` Jeff King
2023-03-24 17:52     ` Derrick Stolee
2023-03-20 20:02 ` [PATCH 2/6] pack-bitmap.c: hide bitmap internals in `read_be32()` Taylor Blau
2023-03-20 20:02 ` [PATCH 3/6] pack-bitmap.c: drop unnecessary 'inline's Taylor Blau
2023-03-21 17:40   ` Jeff King
2023-03-20 20:02 ` [PATCH 4/6] pack-bitmap.c: factor out manual `map_pos` manipulation Taylor Blau
2023-03-21 17:56   ` Jeff King
2023-03-24 18:04     ` Derrick Stolee
2023-03-24 18:29       ` Jeff King
2023-03-24 23:23         ` Taylor Blau
2023-03-25  4:57           ` Jeff King
2023-03-24 23:13       ` Taylor Blau
2023-03-24 23:24         ` Taylor Blau
2023-03-24 23:08     ` Taylor Blau
2023-03-20 20:02 ` [PATCH 5/6] pack-bitmap.c: use `bitmap_index_seek()` where possible Taylor Blau
2023-03-21 18:05   ` Jeff King
2023-03-24 18:06     ` Derrick Stolee
2023-03-24 18:35       ` Jeff King
2023-03-24 19:43         ` Junio C Hamano
2023-03-24 20:37           ` Jeff King
2023-03-24 21:38             ` Junio C Hamano
2023-03-24 22:57               ` Taylor Blau
2023-03-20 20:02 ` [PATCH 6/6] pack-bitmap.c: factor out `bitmap_index_seek_commit()` Taylor Blau
2023-03-21 18:13   ` Jeff King
2023-03-21 18:16     ` Taylor Blau
2023-03-21 18:27       ` Jeff King
2023-03-24 18:09         ` Derrick Stolee

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).