git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 00/20] pack-revindex: prepare for on-disk reverse index
@ 2021-01-08 18:16 Taylor Blau
  2021-01-08 18:16 ` [PATCH 01/20] pack-revindex: introduce a new API Taylor Blau
                   ` (23 more replies)
  0 siblings, 24 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-08 18:16 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder

Hi,

This is the first of two series to introduce an on-disk alternative to the
reverse index which is currently generated once per-process and stored in
memory.

As a reminder, the reverse index is used to translate an object's position
relative to other objects within the same packfile to that object's position
relative to the pack's index.

Generating the reverse index in memory for repositories with large packs has two
significant drawbacks:

  - It requires allocating sizeof(struct revindex_entry) per packed object.

  - It requires us to sort the entries by their pack offset. This is implemented
    in sort_revindex() using a radix sort, but still takes considerable time (as
    benchmarks found in the second series demonstrate).

Both of these can be addressed by storing the reverse index in a new '.rev' file
alongside the packs. This file is written once (during pack creation), and does
not require sorting when accessed, since it is stored in a sorted order.

This series lays the groundwork necessary to introduce the file format (which is
implemented in the second series). I'm splitting these two up since I think the
changes in this series can be queued independently from those in the latter
series.

The goal of this series is to remove direct access of the `struct
revindex_entry` type, as well as `struct packed_git`'s `revindex` field. The
on-disk format will be mmap'd and accessed directly, but the format is
sufficiently different that the whole `revindex` array can't be written as-is.

Separately from our main goal, the new API that is introduced in this series is
IMHO easier to read, and more clearly describes what order the given and
returned values are relative to.

The changes are structured as follows:

  - First, a new API is proposed.

  - Then, uses of the old API are removed one by one and replaced with their
    new counterparts.

  - Finally, without any callers remaining, the old API is removed.

Thanks in advance for your review.

Taylor Blau (20):
  pack-revindex: introduce a new API
  write_reuse_object(): convert to new revindex API
  write_reused_pack_one(): convert to new revindex API
  write_reused_pack_verbatim(): convert to new revindex API
  check_object(): convert to new revindex API
  bitmap_position_packfile(): convert to new revindex API
  show_objects_for_type(): convert to new revindex API
  get_size_by_pos(): convert to new revindex API
  try_partial_reuse(): convert to new revindex API
  rebuild_existing_bitmaps(): convert to new revindex API
  get_delta_base_oid(): convert to new revindex API
  retry_bad_packed_offset(): convert to new revindex API
  packed_object_info(): convert to new revindex API
  unpack_entry(): convert to new revindex API
  for_each_object_in_pack(): convert to new revindex API
  builtin/gc.c: guess the size of the revindex
  pack-revindex: remove unused 'find_pack_revindex()'
  pack-revindex: remove unused 'find_revindex_position()'
  pack-revindex: hide the definition of 'revindex_entry'
  pack-revindex.c: avoid direct revindex access in
    'offset_to_pack_pos()'

 builtin/gc.c           |  2 +-
 builtin/pack-objects.c | 33 +++++++++++++++++-----------
 pack-bitmap.c          | 44 ++++++++++++++++++-------------------
 pack-revindex.c        | 49 +++++++++++++++++++++++++++--------------
 pack-revindex.h        | 10 +++------
 packfile.c             | 50 ++++++++++++++++++++++++++----------------
 6 files changed, 109 insertions(+), 79 deletions(-)

-- 
2.30.0.138.g6d7191ea01

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [PATCH 01/20] pack-revindex: introduce a new API
  2021-01-08 18:16 [PATCH 00/20] pack-revindex: prepare for on-disk reverse index Taylor Blau
@ 2021-01-08 18:16 ` Taylor Blau
  2021-01-12  8:41   ` Jeff King
  2021-01-13  8:06   ` Junio C Hamano
  2021-01-08 18:16 ` [PATCH 02/20] write_reuse_object(): convert to new revindex API Taylor Blau
                   ` (22 subsequent siblings)
  23 siblings, 2 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-08 18:16 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder

In the next several patches, we will prepare for loading a reverse index
either in memory, or from a yet-to-be-introduced on-disk format. To do
that, we'll introduce an API that avoids the caller explicitly indexing
the revindex pointer in the packed_git structure.

There are four ways to interact with the reverse index. Accordingly,
four functions will be exported from 'pack-revindex.h' by the time that
the existing API is removed. A caller may:

 1. Load the pack's reverse index. This involves opening up the index,
    generating an array, and then sorting it. Since opening the index
    can fail, this function ('load_pack_revindex()') returns an int.
    Accordingly, it takes only a single argument: the 'struct
    packed_git' the caller wants to build a reverse index for.

    This function is well-suited for both the current and new API.
    Callers will have to continue to open the reverse index explicitly,
    but this function will eventually learn how to detect and load a
    reverse index from the on-disk format, if one exists. Otherwise, it
    will fallback to generating one in memory from scratch.

 2. Convert a pack position into an offset. This operation is now
    called `pack_pos_to_offset()`. It takes a pack and a position, and
    returns the corresponding off_t.

 3. Convert a pack position into an index position. Same as above; this
    takes a pack and a position, and returns a uint32_t. This operation
    is known as `pack_pos_to_index()`.

 4. Find the pack position for a given offset. This operation is now
    known as `offset_to_pack_pos()`. It takes a pack, an offset, and a
    pointer to a uint32_t where the position is written, if an object
    exists at that offset. Otherwise, -1 is returned to indicate
    failure.

    Unlike some of the callers that used to access '->offset' and '->nr'
    directly, the error checking around this call is somewhat more
    robust. This is important since callers can pass an offset which
    does not contain an object.

    This will become important in a subsequent patch where a caller
    which does not but could check the return value treats the signed
    `-1` from `find_revindex_position()` as an index into the 'revindex'
    array.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-revindex.c | 32 ++++++++++++++++++++++++++++++++
 pack-revindex.h |  4 ++++
 2 files changed, 36 insertions(+)

diff --git a/pack-revindex.c b/pack-revindex.c
index ecdde39cf4..6d86a85208 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -203,3 +203,35 @@ struct revindex_entry *find_pack_revindex(struct packed_git *p, off_t ofs)
 
 	return p->revindex + pos;
 }
+
+int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
+{
+	int ret;
+
+	if (load_pack_revindex(p) < 0)
+		return -1;
+
+	ret = find_revindex_position(p, ofs);
+	if (ret < 0)
+		return -1;
+	*pos = ret;
+	return 0;
+}
+
+uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos)
+{
+	if (!p->revindex)
+		BUG("pack_pos_to_index: reverse index not yet loaded");
+	if (pos >= p->num_objects)
+		BUG("pack_pos_to_index: out-of-bounds object at %"PRIu32, pos);
+	return p->revindex[pos].nr;
+}
+
+off_t pack_pos_to_offset(struct packed_git *p, uint32_t pos)
+{
+	if (!p->revindex)
+		BUG("pack_pos_to_index: reverse index not yet loaded");
+	if (pos > p->num_objects)
+		BUG("pack_pos_to_offset: out-of-bounds object at %"PRIu32, pos);
+	return p->revindex[pos].offset;
+}
diff --git a/pack-revindex.h b/pack-revindex.h
index 848331d5d6..256c0a9106 100644
--- a/pack-revindex.h
+++ b/pack-revindex.h
@@ -13,4 +13,8 @@ int find_revindex_position(struct packed_git *p, off_t ofs);
 
 struct revindex_entry *find_pack_revindex(struct packed_git *p, off_t ofs);
 
+int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos);
+uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos);
+off_t pack_pos_to_offset(struct packed_git *p, uint32_t pos);
+
 #endif
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH 02/20] write_reuse_object(): convert to new revindex API
  2021-01-08 18:16 [PATCH 00/20] pack-revindex: prepare for on-disk reverse index Taylor Blau
  2021-01-08 18:16 ` [PATCH 01/20] pack-revindex: introduce a new API Taylor Blau
@ 2021-01-08 18:16 ` Taylor Blau
  2021-01-12  8:47   ` Jeff King
  2021-01-08 18:16 ` [PATCH 03/20] write_reused_pack_one(): " Taylor Blau
                   ` (21 subsequent siblings)
  23 siblings, 1 reply; 121+ messages in thread
From: Taylor Blau @ 2021-01-08 18:16 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder

First replace 'find_pack_revindex()' with its replacement
'offset_to_pack_pos()'. This prevents any bogus OFS_DELTA that may make
its way through until 'write_reuse_object()' from causing a bad memory
read (if 'revidx' is 'NULL')

Next, replace a direct access of '->nr' with the wrapper function
'pack_pos_to_index()'.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 2a00358f34..03d25db442 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -419,7 +419,7 @@ static off_t write_reuse_object(struct hashfile *f, struct object_entry *entry,
 {
 	struct packed_git *p = IN_PACK(entry);
 	struct pack_window *w_curs = NULL;
-	struct revindex_entry *revidx;
+	uint32_t pos;
 	off_t offset;
 	enum object_type type = oe_type(entry);
 	off_t datalen;
@@ -436,10 +436,13 @@ static off_t write_reuse_object(struct hashfile *f, struct object_entry *entry,
 					      type, entry_size);
 
 	offset = entry->in_pack_offset;
-	revidx = find_pack_revindex(p, offset);
-	datalen = revidx[1].offset - offset;
+	if (offset_to_pack_pos(p, offset, &pos) < 0)
+		die(_("write_reuse_object: could not locate %s"),
+		    oid_to_hex(&entry->idx.oid));
+	datalen = pack_pos_to_offset(p, pos + 1) - offset;
 	if (!pack_to_stdout && p->index_version > 1 &&
-	    check_pack_crc(p, &w_curs, offset, datalen, revidx->nr)) {
+	    check_pack_crc(p, &w_curs, offset, datalen,
+			   pack_pos_to_index(p, pos))) {
 		error(_("bad packed object CRC for %s"),
 		      oid_to_hex(&entry->idx.oid));
 		unuse_pack(&w_curs);
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH 03/20] write_reused_pack_one(): convert to new revindex API
  2021-01-08 18:16 [PATCH 00/20] pack-revindex: prepare for on-disk reverse index Taylor Blau
  2021-01-08 18:16 ` [PATCH 01/20] pack-revindex: introduce a new API Taylor Blau
  2021-01-08 18:16 ` [PATCH 02/20] write_reuse_object(): convert to new revindex API Taylor Blau
@ 2021-01-08 18:16 ` Taylor Blau
  2021-01-12  8:50   ` Jeff King
  2021-01-08 18:16 ` [PATCH 04/20] write_reused_pack_verbatim(): " Taylor Blau
                   ` (20 subsequent siblings)
  23 siblings, 1 reply; 121+ messages in thread
From: Taylor Blau @ 2021-01-08 18:16 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder

Replace direct revindex accesses with calls to 'pack_pos_to_offset()'
and 'pack_pos_to_index()'.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 03d25db442..ea7df9270f 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -866,8 +866,8 @@ static void write_reused_pack_one(size_t pos, struct hashfile *out,
 	enum object_type type;
 	unsigned long size;
 
-	offset = reuse_packfile->revindex[pos].offset;
-	next = reuse_packfile->revindex[pos + 1].offset;
+	offset = pack_pos_to_offset(reuse_packfile, pos);
+	next = pack_pos_to_offset(reuse_packfile, pos + 1);
 
 	record_reused_object(offset, offset - hashfile_total(out));
 
@@ -887,11 +887,15 @@ static void write_reused_pack_one(size_t pos, struct hashfile *out,
 
 		/* Convert to REF_DELTA if we must... */
 		if (!allow_ofs_delta) {
-			int base_pos = find_revindex_position(reuse_packfile, base_offset);
+			uint32_t base_pos;
 			struct object_id base_oid;
 
+			if (offset_to_pack_pos(reuse_packfile, base_offset, &base_pos) < 0)
+				die(_("expected object at offset %"PRIuMAX),
+				    (uintmax_t)base_offset);
+
 			nth_packed_object_id(&base_oid, reuse_packfile,
-					     reuse_packfile->revindex[base_pos].nr);
+					     pack_pos_to_index(reuse_packfile, base_pos));
 
 			len = encode_in_pack_object_header(header, sizeof(header),
 							   OBJ_REF_DELTA, size);
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH 04/20] write_reused_pack_verbatim(): convert to new revindex API
  2021-01-08 18:16 [PATCH 00/20] pack-revindex: prepare for on-disk reverse index Taylor Blau
                   ` (2 preceding siblings ...)
  2021-01-08 18:16 ` [PATCH 03/20] write_reused_pack_one(): " Taylor Blau
@ 2021-01-08 18:16 ` Taylor Blau
  2021-01-12  8:50   ` Jeff King
  2021-01-08 18:17 ` [PATCH 05/20] check_object(): " Taylor Blau
                   ` (19 subsequent siblings)
  23 siblings, 1 reply; 121+ messages in thread
From: Taylor Blau @ 2021-01-08 18:16 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder

Replace a direct access to the revindex array with
'pack_pos_to_offset()'.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index ea7df9270f..4341bc27b4 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -948,7 +948,7 @@ static size_t write_reused_pack_verbatim(struct hashfile *out,
 		off_t to_write;
 
 		written = (pos * BITS_IN_EWORD);
-		to_write = reuse_packfile->revindex[written].offset
+		to_write = pack_pos_to_offset(reuse_packfile, written)
 			- sizeof(struct pack_header);
 
 		/* We're recording one chunk, not one object. */
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH 05/20] check_object(): convert to new revindex API
  2021-01-08 18:16 [PATCH 00/20] pack-revindex: prepare for on-disk reverse index Taylor Blau
                   ` (3 preceding siblings ...)
  2021-01-08 18:16 ` [PATCH 04/20] write_reused_pack_verbatim(): " Taylor Blau
@ 2021-01-08 18:17 ` Taylor Blau
  2021-01-11 11:43   ` Derrick Stolee
  2021-01-12  8:51   ` Jeff King
  2021-01-08 18:17 ` [PATCH 06/20] bitmap_position_packfile(): " Taylor Blau
                   ` (18 subsequent siblings)
  23 siblings, 2 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-08 18:17 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder

Replace direct accesses to the revindex with calls to
'offset_to_pack_pos()' and 'pack_pos_to_index()'.

Since this caller already had some error checking (it can jump to the
'give_up' label if it encounters an error), we can easily check whether
or not the provided offset points to an object in the given pack. This
error checking existed prior to this patch, too, since the caller checks
whether the return value from 'find_pack_revindex()' was NULL or not.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 4341bc27b4..a193cdaf2f 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1813,11 +1813,11 @@ static void check_object(struct object_entry *entry, uint32_t object_index)
 				goto give_up;
 			}
 			if (reuse_delta && !entry->preferred_base) {
-				struct revindex_entry *revidx;
-				revidx = find_pack_revindex(p, ofs);
-				if (!revidx)
+				uint32_t pos;
+				if (offset_to_pack_pos(p, ofs, &pos) < 0)
 					goto give_up;
-				if (!nth_packed_object_id(&base_ref, p, revidx->nr))
+				if (!nth_packed_object_id(&base_ref, p,
+							  pack_pos_to_index(p, pos)))
 					have_base = 1;
 			}
 			entry->in_pack_header_size = used + used_0;
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH 06/20] bitmap_position_packfile(): convert to new revindex API
  2021-01-08 18:16 [PATCH 00/20] pack-revindex: prepare for on-disk reverse index Taylor Blau
                   ` (4 preceding siblings ...)
  2021-01-08 18:17 ` [PATCH 05/20] check_object(): " Taylor Blau
@ 2021-01-08 18:17 ` Taylor Blau
  2021-01-08 18:17 ` [PATCH 07/20] show_objects_for_type(): " Taylor Blau
                   ` (17 subsequent siblings)
  23 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-08 18:17 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder

Replace find_revindex_position() with its counterpart in the new API,
offset_to_pack_pos().

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index d88745fb02..d6861ddd4d 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -407,11 +407,14 @@ static inline int bitmap_position_extended(struct bitmap_index *bitmap_git,
 static inline int bitmap_position_packfile(struct bitmap_index *bitmap_git,
 					   const struct object_id *oid)
 {
+	uint32_t pos;
 	off_t offset = find_pack_entry_one(oid->hash, bitmap_git->pack);
 	if (!offset)
 		return -1;
 
-	return find_revindex_position(bitmap_git->pack, offset);
+	if (offset_to_pack_pos(bitmap_git->pack, offset, &pos) < 0)
+		return -1;
+	return pos;
 }
 
 static int bitmap_position(struct bitmap_index *bitmap_git,
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH 07/20] show_objects_for_type(): convert to new revindex API
  2021-01-08 18:16 [PATCH 00/20] pack-revindex: prepare for on-disk reverse index Taylor Blau
                   ` (5 preceding siblings ...)
  2021-01-08 18:17 ` [PATCH 06/20] bitmap_position_packfile(): " Taylor Blau
@ 2021-01-08 18:17 ` Taylor Blau
  2021-01-12  8:57   ` Jeff King
  2021-01-08 18:17 ` [PATCH 08/20] get_size_by_pos(): " Taylor Blau
                   ` (16 subsequent siblings)
  23 siblings, 1 reply; 121+ messages in thread
From: Taylor Blau @ 2021-01-08 18:17 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder

Avoid storing the revindex entry directly, since this structure will
soon be removed from the public interface. Instead, store the offset and
index position by calling 'pack_pos_to_offset()' and
'pack_pos_to_index()', respectively.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index d6861ddd4d..80c57bde73 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -711,21 +711,22 @@ static void show_objects_for_type(
 
 		for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
 			struct object_id oid;
-			struct revindex_entry *entry;
-			uint32_t hash = 0;
+			uint32_t hash = 0, n;
+			off_t ofs;
 
 			if ((word >> offset) == 0)
 				break;
 
 			offset += ewah_bit_ctz64(word >> offset);
 
-			entry = &bitmap_git->pack->revindex[pos + offset];
-			nth_packed_object_id(&oid, bitmap_git->pack, entry->nr);
+			n = pack_pos_to_index(bitmap_git->pack, pos + offset);
+			ofs = pack_pos_to_offset(bitmap_git->pack, pos + offset);
+			nth_packed_object_id(&oid, bitmap_git->pack, n);
 
 			if (bitmap_git->hashes)
-				hash = get_be32(bitmap_git->hashes + entry->nr);
+				hash = get_be32(bitmap_git->hashes + n);
 
-			show_reach(&oid, object_type, 0, hash, bitmap_git->pack, entry->offset);
+			show_reach(&oid, object_type, 0, hash, bitmap_git->pack, ofs);
 		}
 	}
 }
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH 08/20] get_size_by_pos(): convert to new revindex API
  2021-01-08 18:16 [PATCH 00/20] pack-revindex: prepare for on-disk reverse index Taylor Blau
                   ` (6 preceding siblings ...)
  2021-01-08 18:17 ` [PATCH 07/20] show_objects_for_type(): " Taylor Blau
@ 2021-01-08 18:17 ` Taylor Blau
  2021-01-08 18:17 ` [PATCH 09/20] try_partial_reuse(): " Taylor Blau
                   ` (15 subsequent siblings)
  23 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-08 18:17 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder

Remove another caller that holds onto a 'struct revindex_entry' by
replacing the direct indexing with calls to 'pack_pos_to_offset()' and
'pack_pos_to_index()'.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 80c57bde73..422505d7af 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -835,11 +835,11 @@ static unsigned long get_size_by_pos(struct bitmap_index *bitmap_git,
 	oi.sizep = &size;
 
 	if (pos < pack->num_objects) {
-		struct revindex_entry *entry = &pack->revindex[pos];
-		if (packed_object_info(the_repository, pack,
-				       entry->offset, &oi) < 0) {
+		off_t ofs = pack_pos_to_offset(pack, pos);
+		if (packed_object_info(the_repository, pack, ofs, &oi) < 0) {
 			struct object_id oid;
-			nth_packed_object_id(&oid, pack, entry->nr);
+			nth_packed_object_id(&oid, pack,
+					     pack_pos_to_index(pack, pos));
 			die(_("unable to get size of %s"), oid_to_hex(&oid));
 		}
 	} else {
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH 09/20] try_partial_reuse(): convert to new revindex API
  2021-01-08 18:16 [PATCH 00/20] pack-revindex: prepare for on-disk reverse index Taylor Blau
                   ` (7 preceding siblings ...)
  2021-01-08 18:17 ` [PATCH 08/20] get_size_by_pos(): " Taylor Blau
@ 2021-01-08 18:17 ` Taylor Blau
  2021-01-12  9:06   ` Jeff King
  2021-01-08 18:17 ` [PATCH 10/20] rebuild_existing_bitmaps(): " Taylor Blau
                   ` (14 subsequent siblings)
  23 siblings, 1 reply; 121+ messages in thread
From: Taylor Blau @ 2021-01-08 18:17 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder

Remove another instance of direct revindex manipulation by calling
'pack_pos_to_offset()' instead (the caller here does not care about the
index position of the object at position 'pos').

Somewhat confusingly, the subsequent call to unpack_object_header()
takes a pointer to &offset and then updates it with a new value. But,
try_partial_reuse() cares about the offset of both the base's header and
contents. The existing code made a copy of the offset field, and only
addresses and manipulates one of them.

Instead, store the return of pack_pos_to_offset twice: once in header
and another in offset. Header will be left untouched, but offset will be
addressed and modified by unpack_object_header().

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 422505d7af..3f5cd4e77d 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1069,23 +1069,21 @@ static void try_partial_reuse(struct bitmap_index *bitmap_git,
 			      struct bitmap *reuse,
 			      struct pack_window **w_curs)
 {
-	struct revindex_entry *revidx;
-	off_t offset;
+	off_t offset, header;
 	enum object_type type;
 	unsigned long size;
 
 	if (pos >= bitmap_git->pack->num_objects)
 		return; /* not actually in the pack */
 
-	revidx = &bitmap_git->pack->revindex[pos];
-	offset = revidx->offset;
+	offset = header = pack_pos_to_offset(bitmap_git->pack, pos);
 	type = unpack_object_header(bitmap_git->pack, w_curs, &offset, &size);
 	if (type < 0)
 		return; /* broken packfile, punt */
 
 	if (type == OBJ_REF_DELTA || type == OBJ_OFS_DELTA) {
 		off_t base_offset;
-		int base_pos;
+		uint32_t base_pos;
 
 		/*
 		 * Find the position of the base object so we can look it up
@@ -1096,11 +1094,10 @@ static void try_partial_reuse(struct bitmap_index *bitmap_git,
 		 * more detail.
 		 */
 		base_offset = get_delta_base(bitmap_git->pack, w_curs,
-					     &offset, type, revidx->offset);
+					     &offset, type, header);
 		if (!base_offset)
 			return;
-		base_pos = find_revindex_position(bitmap_git->pack, base_offset);
-		if (base_pos < 0)
+		if (offset_to_pack_pos(bitmap_git->pack, base_offset, &base_pos) < 0)
 			return;
 
 		/*
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH 10/20] rebuild_existing_bitmaps(): convert to new revindex API
  2021-01-08 18:16 [PATCH 00/20] pack-revindex: prepare for on-disk reverse index Taylor Blau
                   ` (8 preceding siblings ...)
  2021-01-08 18:17 ` [PATCH 09/20] try_partial_reuse(): " Taylor Blau
@ 2021-01-08 18:17 ` Taylor Blau
  2021-01-08 18:17 ` [PATCH 11/20] get_delta_base_oid(): " Taylor Blau
                   ` (13 subsequent siblings)
  23 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-08 18:17 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder

Remove another instance of looking at the revindex directly by instead
calling 'pack_pos_to_index()'. Unlike other patches, this caller only
cares about the index position of each object in the loop.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 3f5cd4e77d..24b9628dca 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1392,11 +1392,10 @@ uint32_t *create_bitmap_mapping(struct bitmap_index *bitmap_git,
 
 	for (i = 0; i < num_objects; ++i) {
 		struct object_id oid;
-		struct revindex_entry *entry;
 		struct object_entry *oe;
 
-		entry = &bitmap_git->pack->revindex[i];
-		nth_packed_object_id(&oid, bitmap_git->pack, entry->nr);
+		nth_packed_object_id(&oid, bitmap_git->pack,
+				     pack_pos_to_index(bitmap_git->pack, i));
 		oe = packlist_find(mapping, &oid);
 
 		if (oe)
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH 11/20] get_delta_base_oid(): convert to new revindex API
  2021-01-08 18:16 [PATCH 00/20] pack-revindex: prepare for on-disk reverse index Taylor Blau
                   ` (9 preceding siblings ...)
  2021-01-08 18:17 ` [PATCH 10/20] rebuild_existing_bitmaps(): " Taylor Blau
@ 2021-01-08 18:17 ` Taylor Blau
  2021-01-08 18:17 ` [PATCH 12/20] retry_bad_packed_offset(): " Taylor Blau
                   ` (12 subsequent siblings)
  23 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-08 18:17 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder

Replace direct accesses to the 'struct revindex' type with a call to
'pack_pos_to_index()'.

Likewise drop the old-style 'find_pack_revindex()' with its replacement
'offset_to_pack_pos()' (while continuing to perform the same error
checking).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 packfile.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/packfile.c b/packfile.c
index 86f5c8dbf6..3e3f391949 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1235,18 +1235,18 @@ static int get_delta_base_oid(struct packed_git *p,
 		oidread(oid, base);
 		return 0;
 	} else if (type == OBJ_OFS_DELTA) {
-		struct revindex_entry *revidx;
+		uint32_t base_pos;
 		off_t base_offset = get_delta_base(p, w_curs, &curpos,
 						   type, delta_obj_offset);
 
 		if (!base_offset)
 			return -1;
 
-		revidx = find_pack_revindex(p, base_offset);
-		if (!revidx)
+		if (offset_to_pack_pos(p, base_offset, &base_pos) < 0)
 			return -1;
 
-		return nth_packed_object_id(oid, p, revidx->nr);
+		return nth_packed_object_id(oid, p,
+					    pack_pos_to_index(p, base_pos));
 	} else
 		return -1;
 }
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH 12/20] retry_bad_packed_offset(): convert to new revindex API
  2021-01-08 18:16 [PATCH 00/20] pack-revindex: prepare for on-disk reverse index Taylor Blau
                   ` (10 preceding siblings ...)
  2021-01-08 18:17 ` [PATCH 11/20] get_delta_base_oid(): " Taylor Blau
@ 2021-01-08 18:17 ` Taylor Blau
  2021-01-08 18:17 ` [PATCH 13/20] packed_object_info(): " Taylor Blau
                   ` (11 subsequent siblings)
  23 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-08 18:17 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder

Perform exactly the same conversion as in the previous commit to another
caller within 'packfile.c'.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 packfile.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/packfile.c b/packfile.c
index 3e3f391949..7c37f9ec5c 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1256,12 +1256,11 @@ static int retry_bad_packed_offset(struct repository *r,
 				   off_t obj_offset)
 {
 	int type;
-	struct revindex_entry *revidx;
+	uint32_t pos;
 	struct object_id oid;
-	revidx = find_pack_revindex(p, obj_offset);
-	if (!revidx)
+	if (offset_to_pack_pos(p, obj_offset, &pos) < 0)
 		return OBJ_BAD;
-	nth_packed_object_id(&oid, p, revidx->nr);
+	nth_packed_object_id(&oid, p, pack_pos_to_index(p, pos));
 	mark_bad_packed_object(p, oid.hash);
 	type = oid_object_info(r, &oid, NULL);
 	if (type <= OBJ_NONE)
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH 13/20] packed_object_info(): convert to new revindex API
  2021-01-08 18:16 [PATCH 00/20] pack-revindex: prepare for on-disk reverse index Taylor Blau
                   ` (11 preceding siblings ...)
  2021-01-08 18:17 ` [PATCH 12/20] retry_bad_packed_offset(): " Taylor Blau
@ 2021-01-08 18:17 ` Taylor Blau
  2021-01-12  9:11   ` Jeff King
  2021-01-08 18:17 ` [PATCH 14/20] unpack_entry(): " Taylor Blau
                   ` (10 subsequent siblings)
  23 siblings, 1 reply; 121+ messages in thread
From: Taylor Blau @ 2021-01-08 18:17 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder

Convert another call of 'find_pack_revindex()' to its replacement
'pack_pos_to_offset()'. Likewise:

  - Avoid manipulating `struct packed_git`'s `revindex` pointer directly
    by removing the pointer-as-array indexing.

  - Add an additional guard to check that the offset 'obj_offset()'
    points to a real object. This should be the case with well-behaved
    callers to 'packed_object_info()', but isn't guarenteed.

    Other blocks that fill in various other values from the 'struct
    object_info' request handle bad inputs by setting the type to
    'OBJ_BAD' and jumping to 'out'. Do the same when given a bad offset
    here.

    The previous code would have segfaulted when given a bad
    'obj_offset' value, since 'find_pack_revindex()' would return
    'NULL', and then the line that fills 'oi->disk_sizep' would try to
    access 'NULL[1]' with a stride of 16 bytes (the width of 'struct
    revindex_entry)'.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 packfile.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/packfile.c b/packfile.c
index 7c37f9ec5c..469c8d4f57 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1537,8 +1537,13 @@ int packed_object_info(struct repository *r, struct packed_git *p,
 	}
 
 	if (oi->disk_sizep) {
-		struct revindex_entry *revidx = find_pack_revindex(p, obj_offset);
-		*oi->disk_sizep = revidx[1].offset - obj_offset;
+		uint32_t pos;
+		if (offset_to_pack_pos(p, obj_offset, &pos) < 0) {
+			type = OBJ_BAD;
+			goto out;
+		}
+
+		*oi->disk_sizep = pack_pos_to_offset(p, pos + 1) - obj_offset;
 	}
 
 	if (oi->typep || oi->type_name) {
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH 14/20] unpack_entry(): convert to new revindex API
  2021-01-08 18:16 [PATCH 00/20] pack-revindex: prepare for on-disk reverse index Taylor Blau
                   ` (12 preceding siblings ...)
  2021-01-08 18:17 ` [PATCH 13/20] packed_object_info(): " Taylor Blau
@ 2021-01-08 18:17 ` Taylor Blau
  2021-01-12  9:22   ` Jeff King
  2021-01-08 18:17 ` [PATCH 15/20] for_each_object_in_pack(): " Taylor Blau
                   ` (9 subsequent siblings)
  23 siblings, 1 reply; 121+ messages in thread
From: Taylor Blau @ 2021-01-08 18:17 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder

Remove direct manipulation of the 'struct revindex_entry' type as well
as calls to the deprecated API in 'packfile.c:unpack_entry()'. Usual
clean-up is performed (replacing '->nr' with calls to
'pack_pos_to_index()' and so on). Add an additional check to make
sure that 'obj_offset()' points at a valid object.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 packfile.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/packfile.c b/packfile.c
index 469c8d4f57..34467ea4a3 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1692,11 +1692,19 @@ void *unpack_entry(struct repository *r, struct packed_git *p, off_t obj_offset,
 		}
 
 		if (do_check_packed_object_crc && p->index_version > 1) {
-			struct revindex_entry *revidx = find_pack_revindex(p, obj_offset);
-			off_t len = revidx[1].offset - obj_offset;
-			if (check_pack_crc(p, &w_curs, obj_offset, len, revidx->nr)) {
+			uint32_t pos, nr;
+			off_t len;
+
+			if (offset_to_pack_pos(p, obj_offset, &pos) < 0) {
+				data = NULL;
+				goto out;
+			}
+
+			len = pack_pos_to_offset(p, pos + 1) - obj_offset;
+			nr = pack_pos_to_index(p, pos);
+			if (check_pack_crc(p, &w_curs, obj_offset, len, nr)) {
 				struct object_id oid;
-				nth_packed_object_id(&oid, p, revidx->nr);
+				nth_packed_object_id(&oid, p, nr);
 				error("bad packed object CRC for %s",
 				      oid_to_hex(&oid));
 				mark_bad_packed_object(p, oid.hash);
@@ -1779,11 +1787,11 @@ void *unpack_entry(struct repository *r, struct packed_git *p, off_t obj_offset,
 			 * This is costly but should happen only in the presence
 			 * of a corrupted pack, and is better than failing outright.
 			 */
-			struct revindex_entry *revidx;
+			uint32_t pos;
 			struct object_id base_oid;
-			revidx = find_pack_revindex(p, obj_offset);
-			if (revidx) {
-				nth_packed_object_id(&base_oid, p, revidx->nr);
+			if (!(offset_to_pack_pos(p, obj_offset, &pos))) {
+				nth_packed_object_id(&base_oid, p,
+						     pack_pos_to_index(p, pos));
 				error("failed to read delta base object %s"
 				      " at offset %"PRIuMAX" from %s",
 				      oid_to_hex(&base_oid), (uintmax_t)obj_offset,
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH 15/20] for_each_object_in_pack(): convert to new revindex API
  2021-01-08 18:16 [PATCH 00/20] pack-revindex: prepare for on-disk reverse index Taylor Blau
                   ` (13 preceding siblings ...)
  2021-01-08 18:17 ` [PATCH 14/20] unpack_entry(): " Taylor Blau
@ 2021-01-08 18:17 ` Taylor Blau
  2021-01-08 18:17 ` [PATCH 16/20] builtin/gc.c: guess the size of the revindex Taylor Blau
                   ` (8 subsequent siblings)
  23 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-08 18:17 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder

Avoid looking at the 'revindex' pointer directly and instead call
'pack_pos_to_index()'.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 packfile.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/packfile.c b/packfile.c
index 34467ea4a3..46c9c7ea3c 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2082,7 +2082,7 @@ int for_each_object_in_pack(struct packed_git *p,
 		struct object_id oid;
 
 		if (flags & FOR_EACH_OBJECT_PACK_ORDER)
-			pos = p->revindex[i].nr;
+			pos = pack_pos_to_index(p, i);
 		else
 			pos = i;
 
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH 16/20] builtin/gc.c: guess the size of the revindex
  2021-01-08 18:16 [PATCH 00/20] pack-revindex: prepare for on-disk reverse index Taylor Blau
                   ` (14 preceding siblings ...)
  2021-01-08 18:17 ` [PATCH 15/20] for_each_object_in_pack(): " Taylor Blau
@ 2021-01-08 18:17 ` Taylor Blau
  2021-01-11 11:52   ` Derrick Stolee
  2021-01-08 18:17 ` [PATCH 17/20] pack-revindex: remove unused 'find_pack_revindex()' Taylor Blau
                   ` (7 subsequent siblings)
  23 siblings, 1 reply; 121+ messages in thread
From: Taylor Blau @ 2021-01-08 18:17 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder

'estimate_repack_memory()' takes into account the amount of memory
required to load the reverse index in memory by multiplying the assumed
number of objects by the size of the 'revindex_entry' struct.

Prepare for hiding the definition of 'struct revindex_entry' by removing
a 'sizeof()' of that type from outside of pack-revindex.c. Instead,
guess that one off_t and one uint32_t are required per object. Strictly
speaking, this is a worse guess than asking for 'sizeof(struct
revindex_entry)' directly, since the true size of this struct is 16
bytes with padding on the end of the struct in order to align the offset
field.

But, this is an approximation anyway, and it does remove a use of the
'struct revindex_entry' from outside of pack-revindex internals.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/gc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index 4c24f41852..c60811f212 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -301,7 +301,7 @@ static uint64_t estimate_repack_memory(struct packed_git *pack)
 	/* and then obj_hash[], underestimated in fact */
 	heap += sizeof(struct object *) * nr_objects;
 	/* revindex is used also */
-	heap += sizeof(struct revindex_entry) * nr_objects;
+	heap += (sizeof(off_t) + sizeof(uint32_t)) * nr_objects;
 	/*
 	 * read_sha1_file() (either at delta calculation phase, or
 	 * writing phase) also fills up the delta base cache
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH 17/20] pack-revindex: remove unused 'find_pack_revindex()'
  2021-01-08 18:16 [PATCH 00/20] pack-revindex: prepare for on-disk reverse index Taylor Blau
                   ` (15 preceding siblings ...)
  2021-01-08 18:17 ` [PATCH 16/20] builtin/gc.c: guess the size of the revindex Taylor Blau
@ 2021-01-08 18:17 ` Taylor Blau
  2021-01-08 18:17 ` [PATCH 18/20] pack-revindex: remove unused 'find_revindex_position()' Taylor Blau
                   ` (6 subsequent siblings)
  23 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-08 18:17 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder

Now that no callers of 'find_pack_revindex()' remain, remove the
function's declaration and implementation entirely.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-revindex.c | 15 ---------------
 pack-revindex.h |  2 --
 2 files changed, 17 deletions(-)

diff --git a/pack-revindex.c b/pack-revindex.c
index 6d86a85208..4e42238906 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -189,21 +189,6 @@ int find_revindex_position(struct packed_git *p, off_t ofs)
 	return -1;
 }
 
-struct revindex_entry *find_pack_revindex(struct packed_git *p, off_t ofs)
-{
-	int pos;
-
-	if (load_pack_revindex(p))
-		return NULL;
-
-	pos = find_revindex_position(p, ofs);
-
-	if (pos < 0)
-		return NULL;
-
-	return p->revindex + pos;
-}
-
 int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
 {
 	int ret;
diff --git a/pack-revindex.h b/pack-revindex.h
index 256c0a9106..07c1a7a3c8 100644
--- a/pack-revindex.h
+++ b/pack-revindex.h
@@ -11,8 +11,6 @@ struct revindex_entry {
 int load_pack_revindex(struct packed_git *p);
 int find_revindex_position(struct packed_git *p, off_t ofs);
 
-struct revindex_entry *find_pack_revindex(struct packed_git *p, off_t ofs);
-
 int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos);
 uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos);
 off_t pack_pos_to_offset(struct packed_git *p, uint32_t pos);
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH 18/20] pack-revindex: remove unused 'find_revindex_position()'
  2021-01-08 18:16 [PATCH 00/20] pack-revindex: prepare for on-disk reverse index Taylor Blau
                   ` (16 preceding siblings ...)
  2021-01-08 18:17 ` [PATCH 17/20] pack-revindex: remove unused 'find_pack_revindex()' Taylor Blau
@ 2021-01-08 18:17 ` Taylor Blau
  2021-01-11 11:57   ` Derrick Stolee
  2021-01-12  9:32   ` Jeff King
  2021-01-08 18:18 ` [PATCH 19/20] pack-revindex: hide the definition of 'revindex_entry' Taylor Blau
                   ` (5 subsequent siblings)
  23 siblings, 2 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-08 18:17 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder

Now that all 'find_revindex_position()' callers have been removed (and
converted to the more descriptive 'offset_to_pack_pos()'), it is almost
safe to get rid of 'find_revindex_position()' entirely. Almost, except
for the fact that 'offset_to_pack_pos()' calls
'find_revindex_position()'.

Inline 'find_revindex_position()' into 'offset_to_pack_pos()', and
then remove 'find_revindex_position()' entirely.

This is a straightforward refactoring with one minor snag.
'offset_to_pack_pos()' used to load the index before calling
'find_revindex_position()'. That means that by the time
'find_revindex_position()' starts executing, 'p->num_objects' can be
safely read. After inlining, be careful to not read 'p->num_objects'
until _after_ 'load_pack_revindex()' (which loads the index as a
side-effect) has been called.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-revindex.c | 29 +++++++++++------------------
 pack-revindex.h |  1 -
 2 files changed, 11 insertions(+), 19 deletions(-)

diff --git a/pack-revindex.c b/pack-revindex.c
index 4e42238906..9392c4be73 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -169,16 +169,23 @@ int load_pack_revindex(struct packed_git *p)
 	return 0;
 }
 
-int find_revindex_position(struct packed_git *p, off_t ofs)
+int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
 {
 	int lo = 0;
-	int hi = p->num_objects + 1;
-	const struct revindex_entry *revindex = p->revindex;
+	int hi;
+	const struct revindex_entry *revindex;
+
+	if (load_pack_revindex(p) < 0)
+		return -1;
+
+	hi = p->num_objects + 1;
+	revindex = p->revindex;
 
 	do {
 		const unsigned mi = lo + (hi - lo) / 2;
 		if (revindex[mi].offset == ofs) {
-			return mi;
+			*pos = mi;
+			return 0;
 		} else if (ofs < revindex[mi].offset)
 			hi = mi;
 		else
@@ -189,20 +196,6 @@ int find_revindex_position(struct packed_git *p, off_t ofs)
 	return -1;
 }
 
-int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
-{
-	int ret;
-
-	if (load_pack_revindex(p) < 0)
-		return -1;
-
-	ret = find_revindex_position(p, ofs);
-	if (ret < 0)
-		return -1;
-	*pos = ret;
-	return 0;
-}
-
 uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos)
 {
 	if (!p->revindex)
diff --git a/pack-revindex.h b/pack-revindex.h
index 07c1a7a3c8..b5dd114fd5 100644
--- a/pack-revindex.h
+++ b/pack-revindex.h
@@ -9,7 +9,6 @@ struct revindex_entry {
 };
 
 int load_pack_revindex(struct packed_git *p);
-int find_revindex_position(struct packed_git *p, off_t ofs);
 
 int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos);
 uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos);
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH 19/20] pack-revindex: hide the definition of 'revindex_entry'
  2021-01-08 18:16 [PATCH 00/20] pack-revindex: prepare for on-disk reverse index Taylor Blau
                   ` (17 preceding siblings ...)
  2021-01-08 18:17 ` [PATCH 18/20] pack-revindex: remove unused 'find_revindex_position()' Taylor Blau
@ 2021-01-08 18:18 ` Taylor Blau
  2021-01-11 11:57   ` Derrick Stolee
  2021-01-12  9:34   ` Jeff King
  2021-01-08 18:18 ` [PATCH 20/20] pack-revindex.c: avoid direct revindex access in 'offset_to_pack_pos()' Taylor Blau
                   ` (4 subsequent siblings)
  23 siblings, 2 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-08 18:18 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder

Now that all spots outside of pack-revindex.c that reference 'struct
revindex_entry' directly have been removed, it is safe to hide the
implementation by moving it from pack-revindex.h to pack-revindex.c.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-revindex.c | 5 +++++
 pack-revindex.h | 5 -----
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/pack-revindex.c b/pack-revindex.c
index 9392c4be73..36ef276378 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -3,6 +3,11 @@
 #include "object-store.h"
 #include "packfile.h"
 
+struct revindex_entry {
+	off_t offset;
+	unsigned int nr;
+};
+
 /*
  * Pack index for existing packs give us easy access to the offsets into
  * corresponding pack file where each object's data starts, but the entries
diff --git a/pack-revindex.h b/pack-revindex.h
index b5dd114fd5..b501a7cd62 100644
--- a/pack-revindex.h
+++ b/pack-revindex.h
@@ -3,11 +3,6 @@
 
 struct packed_git;
 
-struct revindex_entry {
-	off_t offset;
-	unsigned int nr;
-};
-
 int load_pack_revindex(struct packed_git *p);
 
 int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos);
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH 20/20] pack-revindex.c: avoid direct revindex access in 'offset_to_pack_pos()'
  2021-01-08 18:16 [PATCH 00/20] pack-revindex: prepare for on-disk reverse index Taylor Blau
                   ` (18 preceding siblings ...)
  2021-01-08 18:18 ` [PATCH 19/20] pack-revindex: hide the definition of 'revindex_entry' Taylor Blau
@ 2021-01-08 18:18 ` Taylor Blau
  2021-01-12  9:37   ` Jeff King
  2021-01-11 12:07 ` [PATCH 00/20] pack-revindex: prepare for on-disk reverse index Derrick Stolee
                   ` (3 subsequent siblings)
  23 siblings, 1 reply; 121+ messages in thread
From: Taylor Blau @ 2021-01-08 18:18 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder

To prepare for on-disk reverse indexes, remove a spot in
'offset_to_pack_pos()' that looks at the 'revindex' array in 'struct
packed_git'.

Even though this use of the revindex pointer is within pack-revindex.c,
this clean up is still worth doing. Since the 'revindex' pointer will be
NULL when reading from an on-disk reverse index (instead the
'revindex_data' pointer will be mmaped to the 'pack-*.rev' file), this
call-site would have to include a conditional to lookup the offset for
position 'mi' each iteration through the search.

So instead of open-coding 'pack_pos_to_offset()', call it directly from
within 'offset_to_pack_pos()'.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-revindex.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/pack-revindex.c b/pack-revindex.c
index 36ef276378..2cd9d632f1 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -178,20 +178,20 @@ int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
 {
 	int lo = 0;
 	int hi;
-	const struct revindex_entry *revindex;
 
 	if (load_pack_revindex(p) < 0)
 		return -1;
 
 	hi = p->num_objects + 1;
-	revindex = p->revindex;
 
 	do {
 		const unsigned mi = lo + (hi - lo) / 2;
-		if (revindex[mi].offset == ofs) {
+		off_t got = pack_pos_to_offset(p, mi);
+
+		if (got == ofs) {
 			*pos = mi;
 			return 0;
-		} else if (ofs < revindex[mi].offset)
+		} else if (ofs < got)
 			hi = mi;
 		else
 			lo = mi + 1;
-- 
2.30.0.138.g6d7191ea01

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [PATCH 05/20] check_object(): convert to new revindex API
  2021-01-08 18:17 ` [PATCH 05/20] check_object(): " Taylor Blau
@ 2021-01-11 11:43   ` Derrick Stolee
  2021-01-11 16:15     ` Taylor Blau
  2021-01-12  8:51   ` Jeff King
  1 sibling, 1 reply; 121+ messages in thread
From: Derrick Stolee @ 2021-01-11 11:43 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: peff, jrnieder

On 1/8/2021 1:17 PM, Taylor Blau wrote:
> Replace direct accesses to the revindex with calls to
> 'offset_to_pack_pos()' and 'pack_pos_to_index()'.
> 
> Since this caller already had some error checking (it can jump to the
> 'give_up' label if it encounters an error), we can easily check whether
> or not the provided offset points to an object in the given pack. This
> error checking existed prior to this patch, too, since the caller checks
> whether the return value from 'find_pack_revindex()' was NULL or not.
> 
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  builtin/pack-objects.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 4341bc27b4..a193cdaf2f 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -1813,11 +1813,11 @@ static void check_object(struct object_entry *entry, uint32_t object_index)
>  				goto give_up;
>  			}
>  			if (reuse_delta && !entry->preferred_base) {
> -				struct revindex_entry *revidx;
> -				revidx = find_pack_revindex(p, ofs);
> -				if (!revidx)
> +				uint32_t pos;
> +				if (offset_to_pack_pos(p, ofs, &pos) < 0)

The current implementation does not return a positive value. Only
-1 on error and 0 on success. Is this "< 0" doing anything important?
Seems like it would be easiest to do

	if (offset_to_pack_pos(p, ofs, &pos))

>  					goto give_up;
> -				if (!nth_packed_object_id(&base_ref, p, revidx->nr))
> +				if (!nth_packed_object_id(&base_ref, p,
> +							  pack_pos_to_index(p, pos)))
>  					have_base = 1;
>  			}

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 16/20] builtin/gc.c: guess the size of the revindex
  2021-01-08 18:17 ` [PATCH 16/20] builtin/gc.c: guess the size of the revindex Taylor Blau
@ 2021-01-11 11:52   ` Derrick Stolee
  2021-01-11 16:23     ` Taylor Blau
  0 siblings, 1 reply; 121+ messages in thread
From: Derrick Stolee @ 2021-01-11 11:52 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: peff, jrnieder

On 1/8/2021 1:17 PM, Taylor Blau wrote:
> 'estimate_repack_memory()' takes into account the amount of memory
> required to load the reverse index in memory by multiplying the assumed
> number of objects by the size of the 'revindex_entry' struct.
> 
> Prepare for hiding the definition of 'struct revindex_entry' by removing
> a 'sizeof()' of that type from outside of pack-revindex.c. Instead,
> guess that one off_t and one uint32_t are required per object. Strictly
> speaking, this is a worse guess than asking for 'sizeof(struct
> revindex_entry)' directly, since the true size of this struct is 16
> bytes with padding on the end of the struct in order to align the offset
> field.

This is so far the only not-completely-obvious change.
 
> But, this is an approximation anyway, and it does remove a use of the
> 'struct revindex_entry' from outside of pack-revindex internals.

And this might be enough justification for it, but...

> -	heap += sizeof(struct revindex_entry) * nr_objects;
> +	heap += (sizeof(off_t) + sizeof(uint32_t)) * nr_objects;
...outside of the estimation change, will this need another change
when the rev-index is mmap'd? Should this instead be an API call,
such as estimate_rev_index_memory(nr_objects)? That would
centralize the estimate to be next to the code that currently
interacts with 'struct revindex_entry' and will later interact with
the mmap region.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 18/20] pack-revindex: remove unused 'find_revindex_position()'
  2021-01-08 18:17 ` [PATCH 18/20] pack-revindex: remove unused 'find_revindex_position()' Taylor Blau
@ 2021-01-11 11:57   ` Derrick Stolee
  2021-01-11 16:27     ` Taylor Blau
  2021-01-12  9:32   ` Jeff King
  1 sibling, 1 reply; 121+ messages in thread
From: Derrick Stolee @ 2021-01-11 11:57 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: peff, jrnieder

On 1/8/2021 1:17 PM, Taylor Blau wrote:
> -int find_revindex_position(struct packed_git *p, off_t ofs)
> +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
>  {
>  	int lo = 0;
> -	int hi = p->num_objects + 1;
> -	const struct revindex_entry *revindex = p->revindex;
> +	int hi;
> +	const struct revindex_entry *revindex;
> +
> +	if (load_pack_revindex(p) < 0)
> +		return -1;
> +
> +	hi = p->num_objects + 1;
> +	revindex = p->revindex;
>  
>  	do {
>  		const unsigned mi = lo + (hi - lo) / 2;
>  		if (revindex[mi].offset == ofs) {
> -			return mi;
> +			*pos = mi;
> +			return 0;
>  		} else if (ofs < revindex[mi].offset)
>  			hi = mi;
>  		else
> @@ -189,20 +196,6 @@ int find_revindex_position(struct packed_git *p, off_t ofs)
>  	return -1;
>  }
>  
> -int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
> -{
> -	int ret;
> -
> -	if (load_pack_revindex(p) < 0)
> -		return -1;
> -
> -	ret = find_revindex_position(p, ofs);
> -	if (ret < 0)
> -		return -1;
> -	*pos = ret;
> -	return 0;
> -}
> -

Not that this is new to the current patch, but this patch made me
wonder if we should initialize *pos = -1 in the case of a failure
to find the position? A correct caller should not use the value
if they are checking for the fail-to-find case properly. But, I
could see someone making a mistake and having trouble diagnosing
the problem because their position variable was initialized to
zero or a previous successful case.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 19/20] pack-revindex: hide the definition of 'revindex_entry'
  2021-01-08 18:18 ` [PATCH 19/20] pack-revindex: hide the definition of 'revindex_entry' Taylor Blau
@ 2021-01-11 11:57   ` Derrick Stolee
  2021-01-12  9:34   ` Jeff King
  1 sibling, 0 replies; 121+ messages in thread
From: Derrick Stolee @ 2021-01-11 11:57 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: peff, jrnieder

On 1/8/2021 1:18 PM, Taylor Blau wrote:
> diff --git a/pack-revindex.h b/pack-revindex.h
> index b5dd114fd5..b501a7cd62 100644
> --- a/pack-revindex.h
> +++ b/pack-revindex.h
> @@ -3,11 +3,6 @@
>  
>  struct packed_git;
>  
> -struct revindex_entry {
> -	off_t offset;
> -	unsigned int nr;
> -};
> -

Very nice. A successful anonymization of a struct.

-Stolee

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 00/20] pack-revindex: prepare for on-disk reverse index
  2021-01-08 18:16 [PATCH 00/20] pack-revindex: prepare for on-disk reverse index Taylor Blau
                   ` (19 preceding siblings ...)
  2021-01-08 18:18 ` [PATCH 20/20] pack-revindex.c: avoid direct revindex access in 'offset_to_pack_pos()' Taylor Blau
@ 2021-01-11 12:07 ` Derrick Stolee
  2021-01-11 16:30   ` Taylor Blau
  2021-01-12  9:45 ` Jeff King
                   ` (2 subsequent siblings)
  23 siblings, 1 reply; 121+ messages in thread
From: Derrick Stolee @ 2021-01-11 12:07 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: peff, jrnieder

On 1/8/2021 1:16 PM, Taylor Blau wrote:
...
>   - First, a new API is proposed.
> 
>   - Then, uses of the old API are removed one by one and replaced with their
>     new counterparts.
> 
>   - Finally, without any callers remaining, the old API is removed.

This patch series has a clear layout that was easy to follow.

In a vacuum, the conversions from immediate struct member lookups
to API calls seems like adding overhead to something that is done
frequently in a loop. However, you do justify it:

> Generating the reverse index in memory for repositories with large packs has two
> significant drawbacks:
> 
>   - It requires allocating sizeof(struct revindex_entry) per packed object.
> 
>   - It requires us to sort the entries by their pack offset. This is implemented
>     in sort_revindex() using a radix sort, but still takes considerable time (as
>     benchmarks found in the second series demonstrate).
> 
> Both of these can be addressed by storing the reverse index in a new '.rev' file
> alongside the packs. This file is written once (during pack creation), and does
> not require sorting when accessed, since it is stored in a sorted order.

Even if these method calls do add a bit of overhead to each
access, it helps to not compute the table from scratch before
any access is possible.

This will be particularly valuable for operations that use only
a few position lookups, such as "is object A reachable from
commit C?"

Operations that iterate through every object in a bitmap are
more likely to notice a difference, but that will probably be
visible in the next series.

My comments on this series are very minor.

I made only one comment about "if (method() < 0)" versus
"if (method())" but that pattern appears in multiple patches.
_If_ you decide to change that pattern, then I'm sure you can
find all uses.

Reviewed-by: Derrick Stolee <dstolee@microsoft.com>

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 05/20] check_object(): convert to new revindex API
  2021-01-11 11:43   ` Derrick Stolee
@ 2021-01-11 16:15     ` Taylor Blau
  2021-01-12  8:54       ` Jeff King
  0 siblings, 1 reply; 121+ messages in thread
From: Taylor Blau @ 2021-01-11 16:15 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Taylor Blau, git, peff, jrnieder

On Mon, Jan 11, 2021 at 06:43:23AM -0500, Derrick Stolee wrote:
> > @@ -1813,11 +1813,11 @@ static void check_object(struct object_entry *entry, uint32_t object_index)
> >  				goto give_up;
> >  			}
> >  			if (reuse_delta && !entry->preferred_base) {
> > -				struct revindex_entry *revidx;
> > -				revidx = find_pack_revindex(p, ofs);
> > -				if (!revidx)
> > +				uint32_t pos;
> > +				if (offset_to_pack_pos(p, ofs, &pos) < 0)
>
> The current implementation does not return a positive value. Only
> -1 on error and 0 on success. Is this "< 0" doing anything important?
> Seems like it would be easiest to do
>
> 	if (offset_to_pack_pos(p, ofs, &pos))
>
> [snip]

Either would work, of course. I tend to find the '< 0' form easier to
read, but I may be in the minority there. For me, the negative return
value makes clear that the function encountered an error.

A secondary benefit is that if the function ever were to return a
positive value that _didn't_ indicate an error, we would already be
protected against it. That is probably a pretty weak argument, though,
since any such refactoring would probably require the callers to change,
too.

Anyway, that's all to say that I'm happy to leave it as-is, but I'm
equally happy to change it, too.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 16/20] builtin/gc.c: guess the size of the revindex
  2021-01-11 11:52   ` Derrick Stolee
@ 2021-01-11 16:23     ` Taylor Blau
  2021-01-11 17:09       ` Derrick Stolee
  0 siblings, 1 reply; 121+ messages in thread
From: Taylor Blau @ 2021-01-11 16:23 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Taylor Blau, git, peff, jrnieder

On Mon, Jan 11, 2021 at 06:52:24AM -0500, Derrick Stolee wrote:
> This is so far the only not-completely-obvious change.
>
> > But, this is an approximation anyway, and it does remove a use of the
> > 'struct revindex_entry' from outside of pack-revindex internals.
>
> And this might be enough justification for it, but...
>
> > -	heap += sizeof(struct revindex_entry) * nr_objects;
> > +	heap += (sizeof(off_t) + sizeof(uint32_t)) * nr_objects;
>
> ...outside of the estimation change, will this need another change
> when the rev-index is mmap'd? Should this instead be an API call,
> such as estimate_rev_index_memory(nr_objects)? That would
> centralize the estimate to be next to the code that currently
> interacts with 'struct revindex_entry' and will later interact with
> the mmap region.

I definitely did consider this, and it seems that I made a mistake in
not documenting my consideration (since I assumed that it was so benign
nobody would notice / care ;-)).

The reason I didn't pursue it here was that we haven't yet loaded the
reverse index by this point. So, you'd want a function that at least
stats the '*.rev' file (and either does or doesn't parse it [1]), or
aborts early to indicate otherwise.

One would hope that 'load_pack_revindex()' would do just that, but it
falls back to load a reverse index in memory, which involves exactly the
slow sort that we're trying to avoid. (Of course, we're going to have to
do it later anyway, but allocating many GB of heap just to provide an
estimation seems ill-advised to me ;-).)

So, we'd have to expand the API in some way or another, and to me it
didn't seem worth it. As I mentioned in the commit message, I'm
skeptical of the value of being accurate here, since this is (after all)
an estimation.

Perhaps a longer response than you were bargaining for, but... :-).

Thanks,
Taylor

[1]: Likely negligible, since all "parsing" really does is verify the
internal checksum, and then assign a pointer into it.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 18/20] pack-revindex: remove unused 'find_revindex_position()'
  2021-01-11 11:57   ` Derrick Stolee
@ 2021-01-11 16:27     ` Taylor Blau
  2021-01-11 17:11       ` Derrick Stolee
  0 siblings, 1 reply; 121+ messages in thread
From: Taylor Blau @ 2021-01-11 16:27 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Taylor Blau, git, peff, jrnieder

On Mon, Jan 11, 2021 at 06:57:00AM -0500, Derrick Stolee wrote:
> On 1/8/2021 1:17 PM, Taylor Blau wrote:
> > -int find_revindex_position(struct packed_git *p, off_t ofs)
> > +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
> >  {
> >  	int lo = 0;
> > -	int hi = p->num_objects + 1;
> > -	const struct revindex_entry *revindex = p->revindex;
> > +	int hi;
> > +	const struct revindex_entry *revindex;
> > +
> > +	if (load_pack_revindex(p) < 0)
> > +		return -1;
> > +
> > +	hi = p->num_objects + 1;
> > +	revindex = p->revindex;
> >
> >  	do {
> >  		const unsigned mi = lo + (hi - lo) / 2;
> >  		if (revindex[mi].offset == ofs) {
> > -			return mi;
> > +			*pos = mi;
> > +			return 0;
> >  		} else if (ofs < revindex[mi].offset)
> >  			hi = mi;
> >  		else
> > @@ -189,20 +196,6 @@ int find_revindex_position(struct packed_git *p, off_t ofs)
> >  	return -1;
> >  }
> >
> > -int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
> > -{
> > -	int ret;
> > -
> > -	if (load_pack_revindex(p) < 0)
> > -		return -1;
> > -
> > -	ret = find_revindex_position(p, ofs);
> > -	if (ret < 0)
> > -		return -1;
> > -	*pos = ret;
> > -	return 0;
> > -}
> > -
>
> Not that this is new to the current patch, but this patch made me
> wonder if we should initialize *pos = -1 in the case of a failure
> to find the position? A correct caller should not use the value
> if they are checking for the fail-to-find case properly. But, I
> could see someone making a mistake and having trouble diagnosing
> the problem because their position variable was initialized to
> zero or a previous successful case.

*pos = -1 may be more confusing than clarifying since pos is unsigned.

It would be nice if there was a clear signal beyond returning a negative
value. I guess you could take a double pointer here which would allow
you to assign NULL, but that feels rather cumbersome as a means to catch
callers who failed to check the return value.

It does raise the argument of whether or not we should allow the
program to continue at all if 'ret < 0' (i.e., 'offset_to_pack_pos()'
either 'die()'s or returns a usable uint32_t), but I'm OK with the
current behavior.

> Thanks,
> -Stolee

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 00/20] pack-revindex: prepare for on-disk reverse index
  2021-01-11 12:07 ` [PATCH 00/20] pack-revindex: prepare for on-disk reverse index Derrick Stolee
@ 2021-01-11 16:30   ` Taylor Blau
  2021-01-11 17:15     ` Derrick Stolee
  0 siblings, 1 reply; 121+ messages in thread
From: Taylor Blau @ 2021-01-11 16:30 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Taylor Blau, git, peff, jrnieder

On Mon, Jan 11, 2021 at 07:07:17AM -0500, Derrick Stolee wrote:
> My comments on this series are very minor.
>
> I made only one comment about "if (method() < 0)" versus
> "if (method())" but that pattern appears in multiple patches.
> _If_ you decide to change that pattern, then I'm sure you can
> find all uses.

I have no strong opinion here, so I'm happy to defer to your or others'
judgement. My very weak opinion is that I'd just as soon leave it as-is,
but that if I'm rerolling and others would like to see it changed, then
I'm happy to do it.

> Reviewed-by: Derrick Stolee <dstolee@microsoft.com>

Thank you for your review. I think that I owe you some as well,
somewhere in the near-2,000 emails that I still have :-/.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 16/20] builtin/gc.c: guess the size of the revindex
  2021-01-11 16:23     ` Taylor Blau
@ 2021-01-11 17:09       ` Derrick Stolee
  2021-01-12  9:28         ` Jeff King
  0 siblings, 1 reply; 121+ messages in thread
From: Derrick Stolee @ 2021-01-11 17:09 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, jrnieder

On 1/11/2021 11:23 AM, Taylor Blau wrote:
> On Mon, Jan 11, 2021 at 06:52:24AM -0500, Derrick Stolee wrote:
>> This is so far the only not-completely-obvious change.
>>
>>> But, this is an approximation anyway, and it does remove a use of the
>>> 'struct revindex_entry' from outside of pack-revindex internals.
>>
>> And this might be enough justification for it, but...
>>
>>> -	heap += sizeof(struct revindex_entry) * nr_objects;
>>> +	heap += (sizeof(off_t) + sizeof(uint32_t)) * nr_objects;
>>
>> ...outside of the estimation change, will this need another change
>> when the rev-index is mmap'd? Should this instead be an API call,
>> such as estimate_rev_index_memory(nr_objects)? That would
>> centralize the estimate to be next to the code that currently
>> interacts with 'struct revindex_entry' and will later interact with
>> the mmap region.
> 
> I definitely did consider this, and it seems that I made a mistake in
> not documenting my consideration (since I assumed that it was so benign
> nobody would notice / care ;-)).
> 
> The reason I didn't pursue it here was that we haven't yet loaded the
> reverse index by this point. So, you'd want a function that at least
> stats the '*.rev' file (and either does or doesn't parse it [1]), or
> aborts early to indicate otherwise.

In this patch, I would expect it to use sizeof(struct revindex_entry).
Later, the method would know if a .rev file exists and do the right
thing instead. (Also, should mmap'd data count towards this estimate?)

> One would hope that 'load_pack_revindex()' would do just that, but it
> falls back to load a reverse index in memory, which involves exactly the
> slow sort that we're trying to avoid. (Of course, we're going to have to
> do it later anyway, but allocating many GB of heap just to provide an
> estimation seems ill-advised to me ;-).)
> 
> So, we'd have to expand the API in some way or another, and to me it
> didn't seem worth it. As I mentioned in the commit message, I'm
> skeptical of the value of being accurate here, since this is (after all)
> an estimation.

Yes, I'm probably just poking somewhere it was easy to poke. This is
probably not worth the time I'm spending asking about it.

Feel free to disregard.

-Stolee

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 18/20] pack-revindex: remove unused 'find_revindex_position()'
  2021-01-11 16:27     ` Taylor Blau
@ 2021-01-11 17:11       ` Derrick Stolee
  0 siblings, 0 replies; 121+ messages in thread
From: Derrick Stolee @ 2021-01-11 17:11 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, jrnieder

On 1/11/2021 11:27 AM, Taylor Blau wrote:
> On Mon, Jan 11, 2021 at 06:57:00AM -0500, Derrick Stolee wrote:
>> Not that this is new to the current patch, but this patch made me
>> wonder if we should initialize *pos = -1 in the case of a failure
>> to find the position? A correct caller should not use the value
>> if they are checking for the fail-to-find case properly. But, I
>> could see someone making a mistake and having trouble diagnosing
>> the problem because their position variable was initialized to
>> zero or a previous successful case.
> 
> *pos = -1 may be more confusing than clarifying since pos is unsigned.

RIGHT. My bad.

> It would be nice if there was a clear signal beyond returning a negative
> value. I guess you could take a double pointer here which would allow
> you to assign NULL, but that feels rather cumbersome as a means to catch
> callers who failed to check the return value.
> 
> It does raise the argument of whether or not we should allow the
> program to continue at all if 'ret < 0' (i.e., 'offset_to_pack_pos()'
> either 'die()'s or returns a usable uint32_t), but I'm OK with the
> current behavior.

I was thinking "*pos = -1" was a free way to "help" a developer
who uses the API incorrectly, but it's _not_ free. Ignore me.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 00/20] pack-revindex: prepare for on-disk reverse index
  2021-01-11 16:30   ` Taylor Blau
@ 2021-01-11 17:15     ` Derrick Stolee
  2021-01-11 17:29       ` Taylor Blau
  2021-01-11 18:40       ` Junio C Hamano
  0 siblings, 2 replies; 121+ messages in thread
From: Derrick Stolee @ 2021-01-11 17:15 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, jrnieder

On 1/11/2021 11:30 AM, Taylor Blau wrote:
> On Mon, Jan 11, 2021 at 07:07:17AM -0500, Derrick Stolee wrote:
>> My comments on this series are very minor.
>>
>> I made only one comment about "if (method() < 0)" versus
>> "if (method())" but that pattern appears in multiple patches.
>> _If_ you decide to change that pattern, then I'm sure you can
>> find all uses.
> 
> I have no strong opinion here, so I'm happy to defer to your or others'
> judgement. My very weak opinion is that I'd just as soon leave it as-is,
> but that if I'm rerolling and others would like to see it changed, then
> I'm happy to do it.

Well, I found 782 instances of ") < 0)" in the codebase, and my initial
scan of these shows they are doing exactly what you are asking. So as
far as code style goes, there is plenty of precedent.

The thing that makes me react to this is that it _looks_ like an extra
comparison. However, I'm sure the assembly instructions have the same
performance characteristics between "!= 0" and "< 0".

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 00/20] pack-revindex: prepare for on-disk reverse index
  2021-01-11 17:15     ` Derrick Stolee
@ 2021-01-11 17:29       ` Taylor Blau
  2021-01-11 18:40       ` Junio C Hamano
  1 sibling, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-11 17:29 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Taylor Blau, git, peff, jrnieder

On Mon, Jan 11, 2021 at 12:15:58PM -0500, Derrick Stolee wrote:
> On 1/11/2021 11:30 AM, Taylor Blau wrote:
> > On Mon, Jan 11, 2021 at 07:07:17AM -0500, Derrick Stolee wrote:
> >> My comments on this series are very minor.
> >>
> >> I made only one comment about "if (method() < 0)" versus
> >> "if (method())" but that pattern appears in multiple patches.
> >> _If_ you decide to change that pattern, then I'm sure you can
> >> find all uses.
> >
> > I have no strong opinion here, so I'm happy to defer to your or others'
> > judgement. My very weak opinion is that I'd just as soon leave it as-is,
> > but that if I'm rerolling and others would like to see it changed, then
> > I'm happy to do it.
>
> Well, I found 782 instances of ") < 0)" in the codebase, and my initial
> scan of these shows they are doing exactly what you are asking. So as
> far as code style goes, there is plenty of precedent.

Thanks for looking, I was curious about that myself after our thread,
but I hadn't yet bothered to look.

> The thing that makes me react to this is that it _looks_ like an extra
> comparison. However, I'm sure the assembly instructions have the same
> performance characteristics between "!= 0" and "< 0".

It should make no difference. Both comparisons will do a 'cmp $0 ...'
where '...' is probably an indirect into the current frame. The '!= 0'
will use je, and the '< 0' comparison will use 'jns'. Both conditional
jumps should be implemented by checking a CPU flag only (ZF and SF,
respectively).

Not that any of this matters, it's just fun to look.

> Thanks,
> -Stolee
>
Thanks,
Taylor

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 00/20] pack-revindex: prepare for on-disk reverse index
  2021-01-11 17:15     ` Derrick Stolee
  2021-01-11 17:29       ` Taylor Blau
@ 2021-01-11 18:40       ` Junio C Hamano
  1 sibling, 0 replies; 121+ messages in thread
From: Junio C Hamano @ 2021-01-11 18:40 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Taylor Blau, git, peff, jrnieder

Derrick Stolee <stolee@gmail.com> writes:

> Well, I found 782 instances of ") < 0)" in the codebase, and my initial
> scan of these shows they are doing exactly what you are asking. So as
> far as code style goes, there is plenty of precedent.
>
> The thing that makes me react to this is that it _looks_ like an extra
> comparison. However, I'm sure the assembly instructions have the same
> performance characteristics between "!= 0" and "< 0".

I'd prefer to keep it that way for the human cost's point of view.

Perhaps it could be subjective but

	if (func_that_signals_error_with_return_value() < 0)

is immediately recognizable as checking for an error to folks who
were trained to write C in POSIX environment, as "on error, return
negative" is a convention that they are familiar with.  At least to
me, your "if (func_that_signals_error_with_return_value())" looks
unnatural and makes me look at the function to see what its return
value means.

If there are helper functions that use "non-zero is an error and
zero is success" convention, we should look at them to see why they
do not do the usual "a negative is an error and a non-negative is
success".  And if the *only* reason they do so is because their
normal return do not have to give more than one kind of "success",
we should see if we can fix them to follow the usual "a negative is
an error" convention, I would think.

Thanks.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 01/20] pack-revindex: introduce a new API
  2021-01-08 18:16 ` [PATCH 01/20] pack-revindex: introduce a new API Taylor Blau
@ 2021-01-12  8:41   ` Jeff King
  2021-01-12  9:41     ` Jeff King
  2021-01-12 16:27     ` Taylor Blau
  2021-01-13  8:06   ` Junio C Hamano
  1 sibling, 2 replies; 121+ messages in thread
From: Jeff King @ 2021-01-12  8:41 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, jrnieder

On Fri, Jan 08, 2021 at 01:16:43PM -0500, Taylor Blau wrote:

> In the next several patches, we will prepare for loading a reverse index
> either in memory, or from a yet-to-be-introduced on-disk format. To do
> that, we'll introduce an API that avoids the caller explicitly indexing
> the revindex pointer in the packed_git structure.

This API looks good to me. Here are a few extra thoughts:

> There are four ways to interact with the reverse index. Accordingly,
> four functions will be exported from 'pack-revindex.h' by the time that
> the existing API is removed. A caller may:

This tells us what the new API functions do. That's useful, but should
it be in the header file itself, documenting each function?

Likewise, I think we'd want to define the concepts in that
documentation. Something like:


  /*
   * A revindex allows converting efficiently between three properties
   * of an object within a pack:
   *
   *  - index position: the numeric position within the list of
   *    sorted object ids found in the .idx file
   *
   *  - pack position: the numeric position within the list of objects
   *    in their order within the actual .pack file (i.e., 0 is the
   *    first object in the .pack, 1 is the second, and so on)
   *
   *  - offset: the byte offset within the .pack file at which the
   *    object contents can be found
   */

And then above each function we can just say that it converts X to Y
(like you have in the commit message). It may also be worth indicating
the run-time of each (some of them are constant-time once you have a
revindex, and some are log(n)).

> +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)

The types here make sense. off_t is clearly needed for a pack offset,
and uint32_t is correct for the position fields, because packs have a
4-byte object count.

Separating the error return from the out-parameter makes the interface
slightly more awkward, but is needed to use the properly-sized types.
Makes sense.

> +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
> +{
> +	int ret;
> +
> +	if (load_pack_revindex(p) < 0)
> +		return -1;

This one lazy-loads the revindex for us, which seems handy...

> +uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos)
> +{
> +	if (!p->revindex)
> +		BUG("pack_pos_to_index: reverse index not yet loaded");
> +	if (pos >= p->num_objects)
> +		BUG("pack_pos_to_index: out-of-bounds object at %"PRIu32, pos);
> +	return p->revindex[pos].nr;
> +}

But these ones don't. I'm glad we at least catch it with a BUG(), but it
makes the API a little funny. Returning an error here would require a
similarly awkward out-parameter, I guess.

-Peff

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 02/20] write_reuse_object(): convert to new revindex API
  2021-01-08 18:16 ` [PATCH 02/20] write_reuse_object(): convert to new revindex API Taylor Blau
@ 2021-01-12  8:47   ` Jeff King
  2021-01-12 16:31     ` Taylor Blau
  0 siblings, 1 reply; 121+ messages in thread
From: Jeff King @ 2021-01-12  8:47 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, jrnieder

On Fri, Jan 08, 2021 at 01:16:48PM -0500, Taylor Blau wrote:

> First replace 'find_pack_revindex()' with its replacement
> 'offset_to_pack_pos()'. This prevents any bogus OFS_DELTA that may make
> its way through until 'write_reuse_object()' from causing a bad memory
> read (if 'revidx' is 'NULL')

Nice. Better abstraction, and we're catching more errors.

> @@ -436,10 +436,13 @@ static off_t write_reuse_object(struct hashfile *f, struct object_entry *entry,
>  					      type, entry_size);
>  
>  	offset = entry->in_pack_offset;
> -	revidx = find_pack_revindex(p, offset);
> -	datalen = revidx[1].offset - offset;
> +	if (offset_to_pack_pos(p, offset, &pos) < 0)
> +		die(_("write_reuse_object: could not locate %s"),
> +		    oid_to_hex(&entry->idx.oid));

If we believe the offset is bogus, should we print that in the error
message, too? Something like:

  die("could not locate %s, expected at offset %"PRIuMAX" in pack %s",
      oid_to_hex(&entry->idx.oid), (uintmax_t)offset, p->pack_name);

> +	datalen = pack_pos_to_offset(p, pos + 1) - offset;

This "pos + 1" means we may be looking one past the end of the array.
That's OK (at least for now), because our revindex always puts in an
extra dummy value exactly for computing these kinds of byte-distances.
That might be worth documenting in the API header.

-Peff

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 03/20] write_reused_pack_one(): convert to new revindex API
  2021-01-08 18:16 ` [PATCH 03/20] write_reused_pack_one(): " Taylor Blau
@ 2021-01-12  8:50   ` Jeff King
  2021-01-12 16:34     ` Taylor Blau
  0 siblings, 1 reply; 121+ messages in thread
From: Jeff King @ 2021-01-12  8:50 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, jrnieder

On Fri, Jan 08, 2021 at 01:16:53PM -0500, Taylor Blau wrote:

> -	offset = reuse_packfile->revindex[pos].offset;
> -	next = reuse_packfile->revindex[pos + 1].offset;
> +	offset = pack_pos_to_offset(reuse_packfile, pos);
> +	next = pack_pos_to_offset(reuse_packfile, pos + 1);

Makes sense.

> @@ -887,11 +887,15 @@ static void write_reused_pack_one(size_t pos, struct hashfile *out,
>  
>  		/* Convert to REF_DELTA if we must... */
>  		if (!allow_ofs_delta) {
> -			int base_pos = find_revindex_position(reuse_packfile, base_offset);
> +			uint32_t base_pos;
>  			struct object_id base_oid;
>  
> +			if (offset_to_pack_pos(reuse_packfile, base_offset, &base_pos) < 0)
> +				die(_("expected object at offset %"PRIuMAX),
> +				    (uintmax_t)base_offset);

This error does mention the offset, which is good. But not the pack name
(nor the object name, but we don't have it!).

-Peff

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 04/20] write_reused_pack_verbatim(): convert to new revindex API
  2021-01-08 18:16 ` [PATCH 04/20] write_reused_pack_verbatim(): " Taylor Blau
@ 2021-01-12  8:50   ` Jeff King
  0 siblings, 0 replies; 121+ messages in thread
From: Jeff King @ 2021-01-12  8:50 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, jrnieder

On Fri, Jan 08, 2021 at 01:16:56PM -0500, Taylor Blau wrote:

> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index ea7df9270f..4341bc27b4 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -948,7 +948,7 @@ static size_t write_reused_pack_verbatim(struct hashfile *out,
>  		off_t to_write;
>  
>  		written = (pos * BITS_IN_EWORD);
> -		to_write = reuse_packfile->revindex[written].offset
> +		to_write = pack_pos_to_offset(reuse_packfile, written)
>  			- sizeof(struct pack_header);

This one is obviously correct.

-Peff

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 05/20] check_object(): convert to new revindex API
  2021-01-08 18:17 ` [PATCH 05/20] check_object(): " Taylor Blau
  2021-01-11 11:43   ` Derrick Stolee
@ 2021-01-12  8:51   ` Jeff King
  1 sibling, 0 replies; 121+ messages in thread
From: Jeff King @ 2021-01-12  8:51 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, jrnieder

On Fri, Jan 08, 2021 at 01:17:00PM -0500, Taylor Blau wrote:

> Replace direct accesses to the revindex with calls to
> 'offset_to_pack_pos()' and 'pack_pos_to_index()'.
> 
> Since this caller already had some error checking (it can jump to the
> 'give_up' label if it encounters an error), we can easily check whether
> or not the provided offset points to an object in the given pack. This
> error checking existed prior to this patch, too, since the caller checks
> whether the return value from 'find_pack_revindex()' was NULL or not.

Yay. Happy again to see things getting more robust.

-Peff

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 05/20] check_object(): convert to new revindex API
  2021-01-11 16:15     ` Taylor Blau
@ 2021-01-12  8:54       ` Jeff King
  0 siblings, 0 replies; 121+ messages in thread
From: Jeff King @ 2021-01-12  8:54 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Derrick Stolee, git, jrnieder

On Mon, Jan 11, 2021 at 11:15:49AM -0500, Taylor Blau wrote:

> On Mon, Jan 11, 2021 at 06:43:23AM -0500, Derrick Stolee wrote:
> > > @@ -1813,11 +1813,11 @@ static void check_object(struct object_entry *entry, uint32_t object_index)
> > >  				goto give_up;
> > >  			}
> > >  			if (reuse_delta && !entry->preferred_base) {
> > > -				struct revindex_entry *revidx;
> > > -				revidx = find_pack_revindex(p, ofs);
> > > -				if (!revidx)
> > > +				uint32_t pos;
> > > +				if (offset_to_pack_pos(p, ofs, &pos) < 0)
> >
> > The current implementation does not return a positive value. Only
> > -1 on error and 0 on success. Is this "< 0" doing anything important?
> > Seems like it would be easiest to do
> >
> > 	if (offset_to_pack_pos(p, ofs, &pos))
> >
> > [snip]
> 
> Either would work, of course. I tend to find the '< 0' form easier to
> read, but I may be in the minority there. For me, the negative return
> value makes clear that the function encountered an error.

I'll throw in my opinion that "< 0" to me much more clearly signals "did
an error occur". And that same form can be used consistently with
functions which _do_ have a positive return value on success, too. So I
prefer it for readability.

-Peff

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 07/20] show_objects_for_type(): convert to new revindex API
  2021-01-08 18:17 ` [PATCH 07/20] show_objects_for_type(): " Taylor Blau
@ 2021-01-12  8:57   ` Jeff King
  2021-01-12 16:35     ` Taylor Blau
  0 siblings, 1 reply; 121+ messages in thread
From: Jeff King @ 2021-01-12  8:57 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, jrnieder

On Fri, Jan 08, 2021 at 01:17:09PM -0500, Taylor Blau wrote:

> diff --git a/pack-bitmap.c b/pack-bitmap.c
> index d6861ddd4d..80c57bde73 100644
> --- a/pack-bitmap.c
> +++ b/pack-bitmap.c
> @@ -711,21 +711,22 @@ static void show_objects_for_type(
>  
>  		for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
>  			struct object_id oid;
> -			struct revindex_entry *entry;
> -			uint32_t hash = 0;
> +			uint32_t hash = 0, n;
> +			off_t ofs;

A minor nit, but "n" isn't very descriptive. It's not in scope for very
long, so that's not too bad, but there are two positions at work in this
function: the pos/offset bit position, and the index position. Maybe
"index_pos" would be better than "n" to keep the two clear?

-Peff

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 09/20] try_partial_reuse(): convert to new revindex API
  2021-01-08 18:17 ` [PATCH 09/20] try_partial_reuse(): " Taylor Blau
@ 2021-01-12  9:06   ` Jeff King
  2021-01-12 16:47     ` Taylor Blau
  0 siblings, 1 reply; 121+ messages in thread
From: Jeff King @ 2021-01-12  9:06 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, jrnieder

On Fri, Jan 08, 2021 at 01:17:17PM -0500, Taylor Blau wrote:

> Remove another instance of direct revindex manipulation by calling
> 'pack_pos_to_offset()' instead (the caller here does not care about the
> index position of the object at position 'pos').
> 
> Somewhat confusingly, the subsequent call to unpack_object_header()
> takes a pointer to &offset and then updates it with a new value. But,
> try_partial_reuse() cares about the offset of both the base's header and
> contents. The existing code made a copy of the offset field, and only
> addresses and manipulates one of them.
> 
> Instead, store the return of pack_pos_to_offset twice: once in header
> and another in offset. Header will be left untouched, but offset will be
> addressed and modified by unpack_object_header().

I had to read these second two paragraphs a few times to parse them.
Really we are just replacing revidx->offset with "header", and "offset"
retains its same role within the function.

So it's definitely doing the right thing, but it makes more sense to me
as:

  Note that we cannot just use the existing "offset" variable to store
  the value we get from pack_pos_to_offset(). It is incremented by
  unpack_object_header(), but we later need the original value. Since
  we'll no longer have revindex->offset to read it from, we'll store
  that in a separate variable ("header" since it points to the entry's
  header bytes).

Another option would be to just call pack_pos_to_offset() again for the
later call. Like the code it's replacing, it's constant-time anyway. But
I think the "header" variable actually makes things more readable.

-Peff

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 13/20] packed_object_info(): convert to new revindex API
  2021-01-08 18:17 ` [PATCH 13/20] packed_object_info(): " Taylor Blau
@ 2021-01-12  9:11   ` Jeff King
  2021-01-12 16:51     ` Taylor Blau
  0 siblings, 1 reply; 121+ messages in thread
From: Jeff King @ 2021-01-12  9:11 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, jrnieder

On Fri, Jan 08, 2021 at 01:17:34PM -0500, Taylor Blau wrote:

> Convert another call of 'find_pack_revindex()' to its replacement
> 'pack_pos_to_offset()'. Likewise:
> 
>   - Avoid manipulating `struct packed_git`'s `revindex` pointer directly
>     by removing the pointer-as-array indexing.

Good.

>   - Add an additional guard to check that the offset 'obj_offset()'
>     points to a real object. This should be the case with well-behaved
>     callers to 'packed_object_info()', but isn't guarenteed.
>
>     Other blocks that fill in various other values from the 'struct
>     object_info' request handle bad inputs by setting the type to
>     'OBJ_BAD' and jumping to 'out'. Do the same when given a bad offset
>     here.

Also good. I wonder if we need to call error() here, too. The caller
will probably say something like "bad object" or whatever, but the user
will have no clue that it's related to the revindex.

That would match other parts of the function (e.g., calling into
unpack_entry() can generate lots of descriptive errors about exactly
what went wrong).

>     The previous code would have segfaulted when given a bad
>     'obj_offset' value, since 'find_pack_revindex()' would return
>     'NULL', and then the line that fills 'oi->disk_sizep' would try to
>     access 'NULL[1]' with a stride of 16 bytes (the width of 'struct
>     revindex_entry)'.

Yep. Again, I'm really happy to see these "should never happen" cases
converted to real errors or even BUG()s.

-Peff

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 14/20] unpack_entry(): convert to new revindex API
  2021-01-08 18:17 ` [PATCH 14/20] unpack_entry(): " Taylor Blau
@ 2021-01-12  9:22   ` Jeff King
  2021-01-12 16:56     ` Taylor Blau
  0 siblings, 1 reply; 121+ messages in thread
From: Jeff King @ 2021-01-12  9:22 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, jrnieder

On Fri, Jan 08, 2021 at 01:17:39PM -0500, Taylor Blau wrote:

> diff --git a/packfile.c b/packfile.c
> index 469c8d4f57..34467ea4a3 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -1692,11 +1692,19 @@ void *unpack_entry(struct repository *r, struct packed_git *p, off_t obj_offset,
>  		}
>  
>  		if (do_check_packed_object_crc && p->index_version > 1) {
> -			struct revindex_entry *revidx = find_pack_revindex(p, obj_offset);
> -			off_t len = revidx[1].offset - obj_offset;
> -			if (check_pack_crc(p, &w_curs, obj_offset, len, revidx->nr)) {
> +			uint32_t pos, nr;

We have "pos" and "nr". What's the difference? :)

I think pack_pos and index_pos might be harder to get confused.

> +			off_t len;
> +
> +			if (offset_to_pack_pos(p, obj_offset, &pos) < 0) {
> +				data = NULL;
> +				goto out;
> +			}

Nice to see the error check here. As with the previous commit, we
probably want to error(), just as we would for errors below.

Do we also need to call mark_bad_packed_object()? I guess we can't,
because we only have the offset, and not the oid (the code below uses
nth_packed_object_id(), but it is relying on the revindex, which we know
just failed to work).

I'm just wondering if an error here is going to put us into an infinite
loop of retrying the lookup in the same pack over and over. Let's
see...our caller is ultimately packed_object_info(), but it too does not
have the oid. It returns an error up to do_oid_object_info_extended().
Which yes, does mark_bad_packed_object() itself. Good. So I think we are
fine, and arguably these lower-level calls to mark_bad_packed_object()
are not necessary. But they do not hurt either.

-Peff

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 16/20] builtin/gc.c: guess the size of the revindex
  2021-01-11 17:09       ` Derrick Stolee
@ 2021-01-12  9:28         ` Jeff King
  0 siblings, 0 replies; 121+ messages in thread
From: Jeff King @ 2021-01-12  9:28 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Taylor Blau, git, jrnieder

On Mon, Jan 11, 2021 at 12:09:27PM -0500, Derrick Stolee wrote:

> > The reason I didn't pursue it here was that we haven't yet loaded the
> > reverse index by this point. So, you'd want a function that at least
> > stats the '*.rev' file (and either does or doesn't parse it [1]), or
> > aborts early to indicate otherwise.
> 
> In this patch, I would expect it to use sizeof(struct revindex_entry).
> Later, the method would know if a .rev file exists and do the right
> thing instead. (Also, should mmap'd data count towards this estimate?)

Yeah, I think if we care about memory pressure, then the mmap would
count anyway. I agree that letting the revindex code decide which to use
would be the most accurate thing, but given that this whole chunk of
code is an estimate (that does not even seem to take into account the
memory used for the delta search!), I don't think it's worth trying to
get to accurate.

-Peff

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 18/20] pack-revindex: remove unused 'find_revindex_position()'
  2021-01-08 18:17 ` [PATCH 18/20] pack-revindex: remove unused 'find_revindex_position()' Taylor Blau
  2021-01-11 11:57   ` Derrick Stolee
@ 2021-01-12  9:32   ` Jeff King
  2021-01-12 16:59     ` Taylor Blau
  1 sibling, 1 reply; 121+ messages in thread
From: Jeff King @ 2021-01-12  9:32 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, jrnieder

On Fri, Jan 08, 2021 at 01:17:57PM -0500, Taylor Blau wrote:

> Now that all 'find_revindex_position()' callers have been removed (and
> converted to the more descriptive 'offset_to_pack_pos()'), it is almost
> safe to get rid of 'find_revindex_position()' entirely. Almost, except
> for the fact that 'offset_to_pack_pos()' calls
> 'find_revindex_position()'.
> 
> Inline 'find_revindex_position()' into 'offset_to_pack_pos()', and
> then remove 'find_revindex_position()' entirely.

Sounds good.

> This is a straightforward refactoring with one minor snag.
> 'offset_to_pack_pos()' used to load the index before calling
> 'find_revindex_position()'. That means that by the time
> 'find_revindex_position()' starts executing, 'p->num_objects' can be
> safely read. After inlining, be careful to not read 'p->num_objects'
> until _after_ 'load_pack_revindex()' (which loads the index as a
> side-effect) has been called.

Good catch. We might want to drop the initialization of "lo":

>  	int lo = 0;
> -	int hi = p->num_objects + 1;

down to here:

> +	hi = p->num_objects + 1;

to maintain symmetry (though it's quite a minor point).

I notice these are signed ints, but we've taken care to use uint32_t
elsewhere for positions. Shouldn't these be uint32_t, also (or at least
unsigned)?

-Peff

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 19/20] pack-revindex: hide the definition of 'revindex_entry'
  2021-01-08 18:18 ` [PATCH 19/20] pack-revindex: hide the definition of 'revindex_entry' Taylor Blau
  2021-01-11 11:57   ` Derrick Stolee
@ 2021-01-12  9:34   ` Jeff King
  1 sibling, 0 replies; 121+ messages in thread
From: Jeff King @ 2021-01-12  9:34 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, jrnieder

On Fri, Jan 08, 2021 at 01:18:01PM -0500, Taylor Blau wrote:

> Now that all spots outside of pack-revindex.c that reference 'struct
> revindex_entry' directly have been removed, it is safe to hide the
> implementation by moving it from pack-revindex.h to pack-revindex.c.

That was a lot of patches to get here, but this is a very nice outcome. :)

-Peff

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 20/20] pack-revindex.c: avoid direct revindex access in 'offset_to_pack_pos()'
  2021-01-08 18:18 ` [PATCH 20/20] pack-revindex.c: avoid direct revindex access in 'offset_to_pack_pos()' Taylor Blau
@ 2021-01-12  9:37   ` Jeff King
  2021-01-12 17:02     ` Taylor Blau
  0 siblings, 1 reply; 121+ messages in thread
From: Jeff King @ 2021-01-12  9:37 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, jrnieder

On Fri, Jan 08, 2021 at 01:18:06PM -0500, Taylor Blau wrote:

> To prepare for on-disk reverse indexes, remove a spot in
> 'offset_to_pack_pos()' that looks at the 'revindex' array in 'struct
> packed_git'.
> 
> Even though this use of the revindex pointer is within pack-revindex.c,
> this clean up is still worth doing. Since the 'revindex' pointer will be
> NULL when reading from an on-disk reverse index (instead the
> 'revindex_data' pointer will be mmaped to the 'pack-*.rev' file), this
> call-site would have to include a conditional to lookup the offset for
> position 'mi' each iteration through the search.
> 
> So instead of open-coding 'pack_pos_to_offset()', call it directly from
> within 'offset_to_pack_pos()'.

This definitely makes sense in the long run. I could take or leave it as
a final patch in _this_ series (as opposed to the first patch in a
subsequent series adding the rev files).

>  	do {
>  		const unsigned mi = lo + (hi - lo) / 2;
> -		if (revindex[mi].offset == ofs) {
> +		off_t got = pack_pos_to_offset(p, mi);


They're both constant-time, so performance should be the same big-O. The
function has extra BUG() checks. I doubt those are measurable in
practice, though.

-Peff

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 01/20] pack-revindex: introduce a new API
  2021-01-12  8:41   ` Jeff King
@ 2021-01-12  9:41     ` Jeff King
  2021-01-12 16:27     ` Taylor Blau
  1 sibling, 0 replies; 121+ messages in thread
From: Jeff King @ 2021-01-12  9:41 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, jrnieder

On Tue, Jan 12, 2021 at 03:41:28AM -0500, Jeff King wrote:

> > +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
> > +{
> > +	int ret;
> > +
> > +	if (load_pack_revindex(p) < 0)
> > +		return -1;
> 
> This one lazy-loads the revindex for us, which seems handy...
> 
> > +uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos)
> > +{
> > +	if (!p->revindex)
> > +		BUG("pack_pos_to_index: reverse index not yet loaded");
> > +	if (pos >= p->num_objects)
> > +		BUG("pack_pos_to_index: out-of-bounds object at %"PRIu32, pos);
> > +	return p->revindex[pos].nr;
> > +}
> 
> But these ones don't. I'm glad we at least catch it with a BUG(), but it
> makes the API a little funny. Returning an error here would require a
> similarly awkward out-parameter, I guess.

Having now looked at the callers through the series, I think adding an
error return to pack_pos_to_index() would be really awkward (since it
cannot currently fail).

We _could_ insist that callers of offset_to_pack_pos() also make sure
the revindex is loaded themselves. But it would be annoying and
error-prone to check the existing callers. So I'm OK with leaving this
asymmetry in the API.

-Peff

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 00/20] pack-revindex: prepare for on-disk reverse index
  2021-01-08 18:16 [PATCH 00/20] pack-revindex: prepare for on-disk reverse index Taylor Blau
                   ` (20 preceding siblings ...)
  2021-01-11 12:07 ` [PATCH 00/20] pack-revindex: prepare for on-disk reverse index Derrick Stolee
@ 2021-01-12  9:45 ` Jeff King
  2021-01-12 17:07   ` Taylor Blau
  2021-01-13  0:23 ` Junio C Hamano
  2021-01-13 22:23 ` [PATCH v2 " Taylor Blau
  23 siblings, 1 reply; 121+ messages in thread
From: Jeff King @ 2021-01-12  9:45 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, jrnieder

On Fri, Jan 08, 2021 at 01:16:39PM -0500, Taylor Blau wrote:

> Generating the reverse index in memory for repositories with large packs has two
> significant drawbacks:
> 
>   - It requires allocating sizeof(struct revindex_entry) per packed object.
> 
>   - It requires us to sort the entries by their pack offset. This is implemented
>     in sort_revindex() using a radix sort, but still takes considerable time (as
>     benchmarks found in the second series demonstrate).

Or thinking about it more fundamentally: any operation which touches the
revindex is now O(nr_objects_in_repo), even if it only cares about a few
objects. Ideally this will eventually be this O(log nr_objects_in_repo);
we can't do much better than that because of object lookups (unless we
replace the .idx with a perfect hash or something).

> The goal of this series is to remove direct access of the `struct
> revindex_entry` type, as well as `struct packed_git`'s `revindex` field. The
> on-disk format will be mmap'd and accessed directly, but the format is
> sufficiently different that the whole `revindex` array can't be written as-is.

It looks good overall to me. I left a few nits around documentation and
integer types that I think are worth a re-roll, but I think after
addressing those it should be good.

-Peff

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 01/20] pack-revindex: introduce a new API
  2021-01-12  8:41   ` Jeff King
  2021-01-12  9:41     ` Jeff King
@ 2021-01-12 16:27     ` Taylor Blau
  1 sibling, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-12 16:27 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, jrnieder

On Tue, Jan 12, 2021 at 03:41:28AM -0500, Jeff King wrote:
> > There are four ways to interact with the reverse index. Accordingly,
> > four functions will be exported from 'pack-revindex.h' by the time that
> > the existing API is removed. A caller may:
>
> This tells us what the new API functions do. That's useful, but should
> it be in the header file itself, documenting each function?

Mm, that's a good idea. I took your suggestion for a comment at the top
of pack-revindex.h directly, and then added some of my own commentary
above each function. I avoided documenting the functions we're about to
remove for obvious reasons.

I think that the commit message is OK as-is, since it provides more of a
rationale of what operations need to exist, rather than the specifics of
each implementation.

We'll have to update these again when the on-disk format exists (i.e.,
because all of the runtimes become constant with the exception of going
to the index), but that's a topic for another series.

> > +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
>
> The types here make sense. off_t is clearly needed for a pack offset,
> and uint32_t is correct for the position fields, because packs have a
> 4-byte object count.
>
> Separating the error return from the out-parameter makes the interface
> slightly more awkward, but is needed to use the properly-sized types.
> Makes sense.

Yep, exactly.

> > +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
> > +{
> > +	int ret;
> > +
> > +	if (load_pack_revindex(p) < 0)
> > +		return -1;
>
> This one lazy-loads the revindex for us, which seems handy...
>
> > +uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos)
> > +{
> > +	if (!p->revindex)
> > +		BUG("pack_pos_to_index: reverse index not yet loaded");
> > +	if (pos >= p->num_objects)
> > +		BUG("pack_pos_to_index: out-of-bounds object at %"PRIu32, pos);
> > +	return p->revindex[pos].nr;
> > +}
>
> But these ones don't. I'm glad we at least catch it with a BUG(), but it
> makes the API a little funny. Returning an error here would require a
> similarly awkward out-parameter, I guess.

It is awkward, but (as you note downthread) the callers of
pack_pos_to_index() make it difficult to do so, so I didn't pursue it
here.

> -Peff

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 02/20] write_reuse_object(): convert to new revindex API
  2021-01-12  8:47   ` Jeff King
@ 2021-01-12 16:31     ` Taylor Blau
  2021-01-13 13:02       ` Jeff King
  0 siblings, 1 reply; 121+ messages in thread
From: Taylor Blau @ 2021-01-12 16:31 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, jrnieder

On Tue, Jan 12, 2021 at 03:47:47AM -0500, Jeff King wrote:
> > @@ -436,10 +436,13 @@ static off_t write_reuse_object(struct hashfile *f, struct object_entry *entry,
> >  					      type, entry_size);
> >
> >  	offset = entry->in_pack_offset;
> > -	revidx = find_pack_revindex(p, offset);
> > -	datalen = revidx[1].offset - offset;
> > +	if (offset_to_pack_pos(p, offset, &pos) < 0)
> > +		die(_("write_reuse_object: could not locate %s"),
> > +		    oid_to_hex(&entry->idx.oid));
>
> If we believe the offset is bogus, should we print that in the error
> message, too? Something like:
>
>   die("could not locate %s, expected at offset %"PRIuMAX" in pack %s",
>       oid_to_hex(&entry->idx.oid), (uintmax_t)offset, p->pack_name);

Good idea, thanks.

> > +	datalen = pack_pos_to_offset(p, pos + 1) - offset;
>
> This "pos + 1" means we may be looking one past the end of the array.
> That's OK (at least for now), because our revindex always puts in an
> extra dummy value exactly for computing these kinds of byte-distances.
> That might be worth documenting in the API header.

Yeah, I made sure to document that when I was touching up the last
patch. FWIW, that's a behavior that we're going to carry over even when
the reverse index is stored on-disk (not by writing four extra bytes
into the .rev file, but by handling queries for pos == p->num_objects
separately.)

> -Peff

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 03/20] write_reused_pack_one(): convert to new revindex API
  2021-01-12  8:50   ` Jeff King
@ 2021-01-12 16:34     ` Taylor Blau
  0 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-12 16:34 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, jrnieder

On Tue, Jan 12, 2021 at 03:50:05AM -0500, Jeff King wrote:
> > @@ -887,11 +887,15 @@ static void write_reused_pack_one(size_t pos, struct hashfile *out,
> >
> >  		/* Convert to REF_DELTA if we must... */
> >  		if (!allow_ofs_delta) {
> > -			int base_pos = find_revindex_position(reuse_packfile, base_offset);
> > +			uint32_t base_pos;
> >  			struct object_id base_oid;
> >
> > +			if (offset_to_pack_pos(reuse_packfile, base_offset, &base_pos) < 0)
> > +				die(_("expected object at offset %"PRIuMAX),
> > +				    (uintmax_t)base_offset);
>
> This error does mention the offset, which is good. But not the pack name
> (nor the object name, but we don't have it!).

Indeed we don't have the object name, but the pack is reuse_packfile
(which is statistically initialized), so we could do something like:

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 10a16ced1e..8e40b19ee8 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -893,8 +893,10 @@ static void write_reused_pack_one(size_t pos, struct hashfile *out,
 			struct object_id base_oid;

 			if (offset_to_pack_pos(reuse_packfile, base_offset, &base_pos) < 0)
-				die(_("expected object at offset %"PRIuMAX),
-				    (uintmax_t)base_offset);
+				die(_("expected object at offset %"PRIuMAX" "
+				      "in pack %s"),
+				    (uintmax_t)base_offset,
+				    reuse_packfile->pack_name);

 			nth_packed_object_id(&base_oid, reuse_packfile,
 					     pack_pos_to_index(reuse_packfile, base_pos));

Which I think would be clearer.

> -Peff

Thanks,
Taylor

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [PATCH 07/20] show_objects_for_type(): convert to new revindex API
  2021-01-12  8:57   ` Jeff King
@ 2021-01-12 16:35     ` Taylor Blau
  0 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-12 16:35 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, jrnieder

On Tue, Jan 12, 2021 at 03:57:59AM -0500, Jeff King wrote:
> A minor nit, but "n" isn't very descriptive. It's not in scope for very
> long, so that's not too bad, but there are two positions at work in this
> function: the pos/offset bit position, and the index position. Maybe
> "index_pos" would be better than "n" to keep the two clear?

Much clearer, thank you.

> -Peff

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 09/20] try_partial_reuse(): convert to new revindex API
  2021-01-12  9:06   ` Jeff King
@ 2021-01-12 16:47     ` Taylor Blau
  0 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-12 16:47 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, jrnieder

On Tue, Jan 12, 2021 at 04:06:46AM -0500, Jeff King wrote:
> On Fri, Jan 08, 2021 at 01:17:17PM -0500, Taylor Blau wrote:
> > Somewhat confusingly, the subsequent call to unpack_object_header()
> > takes a pointer to &offset and then updates it with a new value. But,
> > try_partial_reuse() cares about the offset of both the base's header and
> > contents. The existing code made a copy of the offset field, and only
> > addresses and manipulates one of them.
> >
> > Instead, store the return of pack_pos_to_offset twice: once in header
> > and another in offset. Header will be left untouched, but offset will be
> > addressed and modified by unpack_object_header().
>
> I had to read these second two paragraphs a few times to parse them.
> Really we are just replacing revidx->offset with "header", and "offset"
> retains its same role within the function.
>
> So it's definitely doing the right thing, but it makes more sense to me
> as: [...]

Thanks. Reading these both back-to-back, I agree that yours is much
clearer in describing what is actually going on.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 13/20] packed_object_info(): convert to new revindex API
  2021-01-12  9:11   ` Jeff King
@ 2021-01-12 16:51     ` Taylor Blau
  0 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-12 16:51 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, jrnieder

On Tue, Jan 12, 2021 at 04:11:59AM -0500, Jeff King wrote:
> >   - Add an additional guard to check that the offset 'obj_offset()'
> >     points to a real object. This should be the case with well-behaved
> >     callers to 'packed_object_info()', but isn't guarenteed.
> >
> >     Other blocks that fill in various other values from the 'struct
> >     object_info' request handle bad inputs by setting the type to
> >     'OBJ_BAD' and jumping to 'out'. Do the same when given a bad offset
> >     here.
>
> Also good. I wonder if we need to call error() here, too. The caller
> will probably say something like "bad object" or whatever, but the user
> will have no clue that it's related to the revindex.
>
> That would match other parts of the function (e.g., calling into
> unpack_entry() can generate lots of descriptive errors about exactly
> what went wrong).

Indeed, and we have the object's offset and pack name to include, too,
so our error can be quite descriptive.

> Yep. Again, I'm really happy to see these "should never happen" cases
> converted to real errors or even BUG()s.

:-).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 14/20] unpack_entry(): convert to new revindex API
  2021-01-12  9:22   ` Jeff King
@ 2021-01-12 16:56     ` Taylor Blau
  0 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-12 16:56 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, jrnieder

On Tue, Jan 12, 2021 at 04:22:26AM -0500, Jeff King wrote:
> We have "pos" and "nr". What's the difference? :)
>
> I think pack_pos and index_pos might be harder to get confused.

Much clearer, thanks.

> > +			off_t len;
> > +
> > +			if (offset_to_pack_pos(p, obj_offset, &pos) < 0) {
> > +				data = NULL;
> > +				goto out;
> > +			}
>
> Nice to see the error check here. As with the previous commit, we
> probably want to error(), just as we would for errors below.

Yup, good call.

> Do we also need to call mark_bad_packed_object()? I guess we can't,
> because we only have the offset, and not the oid (the code below uses
> nth_packed_object_id(), but it is relying on the revindex, which we know
> just failed to work).
>
> I'm just wondering if an error here is going to put us into an infinite
> loop of retrying the lookup in the same pack over and over. Let's
> see...our caller is ultimately packed_object_info(), but it too does not
> have the oid. It returns an error up to do_oid_object_info_extended().
> Which yes, does mark_bad_packed_object() itself. Good. So I think we are
> fine, and arguably these lower-level calls to mark_bad_packed_object()
> are not necessary. But they do not hurt either.

Thanks, this rationale is helpful to have. I included an abridged
version of it in the patch message.

> -Peff

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 18/20] pack-revindex: remove unused 'find_revindex_position()'
  2021-01-12  9:32   ` Jeff King
@ 2021-01-12 16:59     ` Taylor Blau
  2021-01-13 13:05       ` Jeff King
  0 siblings, 1 reply; 121+ messages in thread
From: Taylor Blau @ 2021-01-12 16:59 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, jrnieder

On Tue, Jan 12, 2021 at 04:32:39AM -0500, Jeff King wrote:
> Good catch. We might want to drop the initialization of "lo":
>
> >  	int lo = 0;
> > -	int hi = p->num_objects + 1;
>
> down to here:
>
> > +	hi = p->num_objects + 1;

> to maintain symmetry (though it's quite a minor point).

:-). I agree it's a minor point, but I think that the symmetry is nice,
so it's worth doing.

> I notice these are signed ints, but we've taken care to use uint32_t
> elsewhere for positions. Shouldn't these be uint32_t, also (or at least
> unsigned)?

I'll let both of these be an raw unsigned, since the midpoint is already
labeled as an unsigned.

> -Peff

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 20/20] pack-revindex.c: avoid direct revindex access in 'offset_to_pack_pos()'
  2021-01-12  9:37   ` Jeff King
@ 2021-01-12 17:02     ` Taylor Blau
  0 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-12 17:02 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, jrnieder

On Tue, Jan 12, 2021 at 04:37:09AM -0500, Jeff King wrote:
> This definitely makes sense in the long run. I could take or leave it as
> a final patch in _this_ series (as opposed to the first patch in a
> subsequent series adding the rev files).
>
> >  	do {
> >  		const unsigned mi = lo + (hi - lo) / 2;
> > -		if (revindex[mi].offset == ofs) {
> > +		off_t got = pack_pos_to_offset(p, mi);
>
>
> They're both constant-time, so performance should be the same big-O. The
> function has extra BUG() checks. I doubt those are measurable in
> practice, though.

Funny enough, I have moved this patch between the two so many times
before submitting this. I tend to agree that I don't think it makes a
difference in which series this patch goes, so I'm just as happy to
leave it where it is and stop thinking about it ;-).

If others have strong feelings, this can be dropped when queuing and
I'll send it along as the first commit in the second series (which will
have to be updated along with this one).

> -Peff

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 00/20] pack-revindex: prepare for on-disk reverse index
  2021-01-12  9:45 ` Jeff King
@ 2021-01-12 17:07   ` Taylor Blau
  0 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-12 17:07 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, jrnieder

On Tue, Jan 12, 2021 at 04:45:47AM -0500, Jeff King wrote:
> > The goal of this series is to remove direct access of the `struct
> > revindex_entry` type, as well as `struct packed_git`'s `revindex` field. The
> > on-disk format will be mmap'd and accessed directly, but the format is
> > sufficiently different that the whole `revindex` array can't be written as-is.
>
> It looks good overall to me. I left a few nits around documentation and
> integer types that I think are worth a re-roll, but I think after
> addressing those it should be good.

Thanks so much for your review. I think I addressed all of your
feedback, but I'll sit on the revised patches for another day or so in
case other reviewers would like to chime in before v2.

Hopefully the this re-roll is the only one we'll need, since the next
series is much more interesting!

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 00/20] pack-revindex: prepare for on-disk reverse index
  2021-01-08 18:16 [PATCH 00/20] pack-revindex: prepare for on-disk reverse index Taylor Blau
                   ` (21 preceding siblings ...)
  2021-01-12  9:45 ` Jeff King
@ 2021-01-13  0:23 ` Junio C Hamano
  2021-01-13  0:52   ` Taylor Blau
  2021-01-13 22:23 ` [PATCH v2 " Taylor Blau
  23 siblings, 1 reply; 121+ messages in thread
From: Junio C Hamano @ 2021-01-13  0:23 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, jrnieder

Taylor Blau <me@ttaylorr.com> writes:

> This is the first of two series to introduce an on-disk alternative to the
> reverse index which is currently generated once per-process and stored in
> memory.

Queued; seems to be killed on Windows but otherwise looking good.
  
  https://github.com/git/git/runs/1691849288?check_suite_focus=true#step:6:164

Thanks.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 00/20] pack-revindex: prepare for on-disk reverse index
  2021-01-13  0:23 ` Junio C Hamano
@ 2021-01-13  0:52   ` Taylor Blau
  2021-01-13  2:15     ` Junio C Hamano
  0 siblings, 1 reply; 121+ messages in thread
From: Taylor Blau @ 2021-01-13  0:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, peff, jrnieder

On Tue, Jan 12, 2021 at 04:23:00PM -0800, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > This is the first of two series to introduce an on-disk alternative to the
> > reverse index which is currently generated once per-process and stored in
> > memory.
>
> Queued; seems to be killed on Windows but otherwise looking good.

Thanks.

>   https://github.com/git/git/runs/1691849288?check_suite_focus=true#step:6:164

I will look into those failures. It looks like these are all due to
running 'git index-pack' when the '*.idx' file already exists (resulting
in Windows being unable to write over the file).

Did you want to queue these two topics separately? It looks like you
picked them up in 'seen' as one big topic in a8f367dae2 (Merge branch
'tb/pack-revindex-api' into jch, 2021-01-12), but I thought you may want
to pick parts one and two up separately to allow them to progress
independently of one another.

In either case, I am planning on sending a new version of the first
series tomorrow to address Peff's feedback.

> Thanks.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 00/20] pack-revindex: prepare for on-disk reverse index
  2021-01-13  0:52   ` Taylor Blau
@ 2021-01-13  2:15     ` Junio C Hamano
  2021-01-13  3:23       ` Taylor Blau
  0 siblings, 1 reply; 121+ messages in thread
From: Junio C Hamano @ 2021-01-13  2:15 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, jrnieder

Taylor Blau <me@ttaylorr.com> writes:

> Did you want to queue these two topics separately?

Unless it is a reasonable amount of certainty that the bottom topic
will not have to be rerolled, I'd rather not to have separate topics
on top of each other.

It is tedious and error prone, having to rebase the old iteration of
the top topic on top when a new iteration of the bottom topic comes
out.  I'd rather see that the top one get rerolled whenever the
bottom one gets rerolled to make life simpler.

Even in a single topic, I would encourage people to put the
foundational and preparatory work early so that we can make them
solidify before review really gets to later parts of the series.

And when a series is structured like so, it is perfectly fine to say
something like:

    Here is a new iteration of the last 7 patches---the early 13
    patches are the same as the previous round, so reset the topic
    to bb6414ab (packfile: prepare for the existence of '*.rev'
    files, 2021-01-08) and then apply these 7.

if you do not want to send all patches in a nontrivial series when
it gets updated.

Thanks.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 00/20] pack-revindex: prepare for on-disk reverse index
  2021-01-13  2:15     ` Junio C Hamano
@ 2021-01-13  3:23       ` Taylor Blau
  2021-01-13  8:21         ` Junio C Hamano
  0 siblings, 1 reply; 121+ messages in thread
From: Taylor Blau @ 2021-01-13  3:23 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, peff, jrnieder

On Tue, Jan 12, 2021 at 06:15:11PM -0800, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > Did you want to queue these two topics separately?
>
> Unless it is a reasonable amount of certainty that the bottom topic
> will not have to be rerolled, I'd rather not to have separate topics
> on top of each other.

Hmm. I certainly do not want to create any tedious work for you, but I
do think that we're in a situation where the bottom topic is stable and
the interesting discussion will happen in the newer topic.

> It is tedious and error prone, having to rebase the old iteration of
> the top topic on top when a new iteration of the bottom topic comes
> out.  I'd rather see that the top one get rerolled whenever the
> bottom one gets rerolled to make life simpler.

Indeed, my local v2 of the bottom topic requires the top topic to be
rebased onto it, and so I'll plan to send a v2 of each tomorrow morning.

> Even in a single topic, I would encourage people to put thet
> foundational and preparatory work early so that we can make them
> solidify before review really gets to later parts of the series.
>
> And when a series is structured like so, it is perfectly fine to say
> something like:
>
>     Here is a new iteration of the last 7 patches---the early 13
>     patches are the same as the previous round, so reset the topic
>     to bb6414ab (packfile: prepare for the existence of '*.rev'
>     files, 2021-01-08) and then apply these 7.
>
> if you do not want to send all patches in a nontrivial series when
> it gets updated.

I am happy to do it this way if you would prefer. If so, you may want to
rename this topic while queuing, since it is not just about a new
revindex API, but rather about introducing the on-disk format. (I would
suggest tb/on-disk-revindex, if you were looking for alternate names).

If you agree that the bottom topic is stable, I'd prefer to send the top
one separately. Otherwise, I can send both together. Let me know.

> Thanks.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 01/20] pack-revindex: introduce a new API
  2021-01-08 18:16 ` [PATCH 01/20] pack-revindex: introduce a new API Taylor Blau
  2021-01-12  8:41   ` Jeff King
@ 2021-01-13  8:06   ` Junio C Hamano
  2021-01-13  8:54     ` Junio C Hamano
                       ` (2 more replies)
  1 sibling, 3 replies; 121+ messages in thread
From: Junio C Hamano @ 2021-01-13  8:06 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, jrnieder

Taylor Blau <me@ttaylorr.com> writes:

> In the next several patches, we will prepare for loading a reverse index
> either in memory, or from a yet-to-be-introduced on-disk format. To do

Does "load revindex in memory" (as opposed to "from on-disk file")
mean the good old "read the forward index and make inverse map
in-core", or something else?  

IOW, is "We will prepare a reverse index either by computing in
memory from forward index, or loading from on-disk file" what we
want to say here?

> There are four ways to interact with the reverse index. Accordingly,
> four functions will be exported from 'pack-revindex.h' by the time that
> the existing API is removed. A caller may:
>
>  1. Load the pack's reverse index. This involves opening up the index,
>     generating an array, and then sorting it. Since opening the index
>     can fail, this function ('load_pack_revindex()') returns an int.
>     Accordingly, it takes only a single argument: the 'struct
>     packed_git' the caller wants to build a reverse index for.
>
>     This function is well-suited for both the current and new API.
>     Callers will have to continue to open the reverse index explicitly,
>     but this function will eventually learn how to detect and load a
>     reverse index from the on-disk format, if one exists. Otherwise, it
>     will fallback to generating one in memory from scratch.

OK.

>  2. Convert a pack position into an offset. This operation is now
>     called `pack_pos_to_offset()`. It takes a pack and a position, and
>     returns the corresponding off_t.
>
>  3. Convert a pack position into an index position. Same as above; this
>     takes a pack and a position, and returns a uint32_t. This operation
>     is known as `pack_pos_to_index()`.
>
>  4. Find the pack position for a given offset. This operation is now
>     known as `offset_to_pack_pos()`. It takes a pack, an offset, and a
>     pointer to a uint32_t where the position is written, if an object
>     exists at that offset. Otherwise, -1 is returned to indicate
>     failure.

Without knowing what exactly "pack position", "offset" and "index
position" refer to, the above three are almost impossible to grok.
Can we have one paragraph description for each?  Something along the
lines of...

 - Pack position: a packstream consists of series of byte ranges,
   each of which represents an object, so the objects can be
   numbered from 0 (the object whose data is stored at the earliest
   part in the packfile) to N (the object whose data is stored at
   the tail end of the packfile).  The number corresponding to an
   object in this order in the packfile is called the "pack
   position" of the object.

 - Offset: The ofs_t distance between the beginning of a pack stream
   and the beginning of data that represents an object is called the
   "offset" of the object in the packfile.

 - Index position: for a single pack stream, there is a table that
   maps object name to its offset and the entries in this table are
   sorted by the object name (this is what pack ".idx" file is).
   The location (counting from 0) of an object in this table is
   called the "index position" of the object in the packfile.

I am not sure if the above correctly reflects what you meant by
"position", though.

>     Unlike some of the callers that used to access '->offset' and '->nr'
>     directly, the error checking around this call is somewhat more
>     robust. This is important since callers can pass an offset which
>     does not contain an object.

Meaning "offset ought to point at the boundary between objects in
the pack stream, and the API, unlike the direct access, makes sure
that is the case"?  That is a good thing.

>     This will become important in a subsequent patch where a caller
>     which does not but could check the return value treats the signed
>     `-1` from `find_revindex_position()` as an index into the 'revindex'
>     array.
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  pack-revindex.c | 32 ++++++++++++++++++++++++++++++++
>  pack-revindex.h |  4 ++++
>  2 files changed, 36 insertions(+)
>
> diff --git a/pack-revindex.c b/pack-revindex.c
> index ecdde39cf4..6d86a85208 100644
> --- a/pack-revindex.c
> +++ b/pack-revindex.c
> @@ -203,3 +203,35 @@ struct revindex_entry *find_pack_revindex(struct packed_git *p, off_t ofs)
>  
>  	return p->revindex + pos;
>  }
> +
> +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
> +{
> +	int ret;
> +
> +	if (load_pack_revindex(p) < 0)
> +		return -1;
> +
> +	ret = find_revindex_position(p, ofs);
> +	if (ret < 0)
> +		return -1;

Why not "return ret"?  We know that find_revindex_position() would
signal an error by returning -1, but is there a reason why we want
to prevent it from returning richer errors in the future?

> +	*pos = ret;

The untold assumption is that uint32_t can fit the maximum returned
value from find_revindex_position() and "signed int" can also big
enough.  I guess it is OK to be limited to up-to 2 billion objects
on 32-bit systems.

> +	return 0;
> +}
> +
> +uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos)
> +{
> +	if (!p->revindex)
> +		BUG("pack_pos_to_index: reverse index not yet loaded");

The previous function lazy loaded the revindex, but this one and the
next one refuses to work without revindex.  Intended?

> +	if (pos >= p->num_objects)
> +		BUG("pack_pos_to_index: out-of-bounds object at %"PRIu32, pos);

Personally I find it easier to place items on a single line in an
ascending order of magnitude, i.e.

	if (p->num_objects <= pos)
		BUG("...");

The assertion requires pos to be strictly lower than p->num_objects,
which is in line with how we usually count elements of an array of
size p->num_objects, but the next one allows pos == p->num_objects;
intended?

p->revindex[] is an array of two-member struct, so if an element of
the array is invalid for its .nr member here because pos is exactly
at p->num_objects, I would imagine it is also invalid for its .offset
member, too, no?

Ah, perhaps the "offset beyond the end of the pack positions" is a
sentinel element to give the in-pack-stream size of the object at
the last pack position?  If that is the case, it deserves a comment,
I would think.

> +	return p->revindex[pos].nr;
> +}
> +
> +off_t pack_pos_to_offset(struct packed_git *p, uint32_t pos)
> +{
> +	if (!p->revindex)
> +		BUG("pack_pos_to_index: reverse index not yet loaded");
> +	if (pos > p->num_objects)
> +		BUG("pack_pos_to_offset: out-of-bounds object at %"PRIu32, pos);
> +	return p->revindex[pos].offset;
> +}
> diff --git a/pack-revindex.h b/pack-revindex.h
> index 848331d5d6..256c0a9106 100644
> --- a/pack-revindex.h
> +++ b/pack-revindex.h
> @@ -13,4 +13,8 @@ int find_revindex_position(struct packed_git *p, off_t ofs);
>  
>  struct revindex_entry *find_pack_revindex(struct packed_git *p, off_t ofs);
>  
> +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos);
> +uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos);
> +off_t pack_pos_to_offset(struct packed_git *p, uint32_t pos);
> +
>  #endif

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 00/20] pack-revindex: prepare for on-disk reverse index
  2021-01-13  3:23       ` Taylor Blau
@ 2021-01-13  8:21         ` Junio C Hamano
  2021-01-13 13:13           ` Jeff King
  0 siblings, 1 reply; 121+ messages in thread
From: Junio C Hamano @ 2021-01-13  8:21 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, jrnieder

Taylor Blau <me@ttaylorr.com> writes:

> If you agree that the bottom topic is stable, I'd prefer to send the top
> one separately. Otherwise, I can send both together. Let me know.

I do not expect the first 20 of the 20+8 patches to be stable from
the beginning---in fact, after reading 01/20 myself, and seeing a
few of Peff's reviews, I expect that you'll be redoing at least some
of them.

When we deal with this kind of situation (not limited to these
topics), let's make it a tradition to first pretend that we have a
long single topic, and expect that the early rounds of review focus
on two things:

 (1) to identify the best ordering of the commits (topics from
     experienced contributors like you are likely to be already
     structured in a good order, so reviewers may only have to say
     "the ordering looks good", but sometimes they may want to say
     thinks like "it would be better to leave patches 5, 8 and 11 to
     much later in the series".

 (2) to find good points to divide the series into two (or more)
     pieces, and spend more effort on helping the bottom part to
     solidify faster.

That way, the bottom part can be merged sooner to 'next' than the
rest.  It always is cumbersome to have some part of the series in
'next' and remainder in 'seen', so at that point, the lower half
would naturally gain a different name before it gets merged to
'next', I would think.

Thanks.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 01/20] pack-revindex: introduce a new API
  2021-01-13  8:06   ` Junio C Hamano
@ 2021-01-13  8:54     ` Junio C Hamano
  2021-01-13 13:17     ` Jeff King
  2021-01-13 16:23     ` Taylor Blau
  2 siblings, 0 replies; 121+ messages in thread
From: Junio C Hamano @ 2021-01-13  8:54 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, jrnieder

Junio C Hamano <gitster@pobox.com> writes:

> Without knowing what exactly "pack position", "offset" and "index
> position" refer to, the above three are almost impossible to grok.
> Can we have one paragraph description for each?  Something along the
> lines of...
>
>  - Pack position: a packstream consists of series of byte ranges,
>    each of which represents an object, so the objects can be
>    numbered from 0 (the object whose data is stored at the earliest
>    part in the packfile) to N (the object whose data is stored at
>    the tail end of the packfile).  The number corresponding to an
>    object in this order in the packfile is called the "pack
>    position" of the object.
>
>  - Offset: The ofs_t distance between the beginning of a pack stream
>    and the beginning of data that represents an object is called the
>    "offset" of the object in the packfile.
>
>  - Index position: for a single pack stream, there is a table that
>    maps object name to its offset and the entries in this table are
>    sorted by the object name (this is what pack ".idx" file is).
>    The location (counting from 0) of an object in this table is
>    called the "index position" of the object in the packfile.
>
> I am not sure if the above correctly reflects what you meant by
> "position", though.

Please scratch the above.  I see Peff already pointed out pretty
much the same, and his phrasing looked a lot cleaner than my
attempt.

Thanks.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 02/20] write_reuse_object(): convert to new revindex API
  2021-01-12 16:31     ` Taylor Blau
@ 2021-01-13 13:02       ` Jeff King
  0 siblings, 0 replies; 121+ messages in thread
From: Jeff King @ 2021-01-13 13:02 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, jrnieder

On Tue, Jan 12, 2021 at 11:31:40AM -0500, Taylor Blau wrote:

> > > +	datalen = pack_pos_to_offset(p, pos + 1) - offset;
> >
> > This "pos + 1" means we may be looking one past the end of the array.
> > That's OK (at least for now), because our revindex always puts in an
> > extra dummy value exactly for computing these kinds of byte-distances.
> > That might be worth documenting in the API header.
> 
> Yeah, I made sure to document that when I was touching up the last
> patch. FWIW, that's a behavior that we're going to carry over even when
> the reverse index is stored on-disk (not by writing four extra bytes
> into the .rev file, but by handling queries for pos == p->num_objects
> separately.)

I think the only reason to look past the end like that is to compute the
size of the final entry. So we _could_ abstract that away from the
callers with a separate function like:

  off_t pack_pos_to_size(struct packed_git *p, uint32_t pos);

But as long as the behavior of passing p->num_objects is documented, I
do not mind overly mind spelling it one way or the other.

-Peff

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 18/20] pack-revindex: remove unused 'find_revindex_position()'
  2021-01-12 16:59     ` Taylor Blau
@ 2021-01-13 13:05       ` Jeff King
  0 siblings, 0 replies; 121+ messages in thread
From: Jeff King @ 2021-01-13 13:05 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, jrnieder

On Tue, Jan 12, 2021 at 11:59:42AM -0500, Taylor Blau wrote:

> > I notice these are signed ints, but we've taken care to use uint32_t
> > elsewhere for positions. Shouldn't these be uint32_t, also (or at least
> > unsigned)?
> 
> I'll let both of these be an raw unsigned, since the midpoint is already
> labeled as an unsigned.

Yeah, I found that existing code doubly weird, since "mi" must by
definition be bounded by "lo" and "hi". :)

Looks like the bug was introduced in 92e5c77c37 (revindex: export new
APIs, 2013-10-24), which hacked up the existing loop into a new
function. We should have caught it then (back then we were a lot less
careful about types, IMHO).

Also, the subject line of that commit is giving me deja vu. ;)

-Peff

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 00/20] pack-revindex: prepare for on-disk reverse index
  2021-01-13  8:21         ` Junio C Hamano
@ 2021-01-13 13:13           ` Jeff King
  2021-01-13 15:34             ` Taylor Blau
  0 siblings, 1 reply; 121+ messages in thread
From: Jeff King @ 2021-01-13 13:13 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, jrnieder

On Wed, Jan 13, 2021 at 12:21:12AM -0800, Junio C Hamano wrote:

> Taylor Blau <me@ttaylorr.com> writes:
> 
> > If you agree that the bottom topic is stable, I'd prefer to send the top
> > one separately. Otherwise, I can send both together. Let me know.
> 
> I do not expect the first 20 of the 20+8 patches to be stable from
> the beginning---in fact, after reading 01/20 myself, and seeing a
> few of Peff's reviews, I expect that you'll be redoing at least some
> of them.

They'll definitely need at least one re-roll. But I think Taylor is
expecting (and I do too) that the second half will probably have a lot
more back-and-forth over the on-disk format, and hence need more
re-rolls.

My main concern is reviewer fatigue. 28 patches is a lot. If we can
solidify the first 20 and then let people focus on the final 8
separately, that helps. If you're OK with splitting a topic and saying
"this is a re-roll of just the last 8 patches", then that problem is
solved. But IMHO it is easier to just point out that split from the
start than it is to come up with it after the fact. It tells reviewers
what to expect from the get-go.

>  (2) to find good points to divide the series into two (or more)
>      pieces, and spend more effort on helping the bottom part to
>      solidify faster.

I think we just did that preemptively. ;) In these two particular
series, the first 20 (or at least the first 19) are an improvement by
themselves. I think they would be worth doing even without the final 8,
both to make calling code more readable and to add new error checks and
assertions to revindex access.

> That way, the bottom part can be merged sooner to 'next' than the
> rest.  It always is cumbersome to have some part of the series in
> 'next' and remainder in 'seen', so at that point, the lower half
> would naturally gain a different name before it gets merged to
> 'next', I would think.

That seems to me like it ends up being _more_ work than just making them
into two branches in the first place.

So I guess I remain skeptical that ad-hoc splitting of longer series is
easier than doing so up front. But you're the one who does all of the
branch shuffling in the end, so if you really prefer longer series, I'm
not what I think matters that much. ;)

-Peff

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 01/20] pack-revindex: introduce a new API
  2021-01-13  8:06   ` Junio C Hamano
  2021-01-13  8:54     ` Junio C Hamano
@ 2021-01-13 13:17     ` Jeff King
  2021-01-13 16:23     ` Taylor Blau
  2 siblings, 0 replies; 121+ messages in thread
From: Jeff King @ 2021-01-13 13:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, jrnieder

On Wed, Jan 13, 2021 at 12:06:03AM -0800, Junio C Hamano wrote:

> > +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
> > +{
> > +	int ret;
> [...]
> > +	*pos = ret;
> 
> The untold assumption is that uint32_t can fit the maximum returned
> value from find_revindex_position() and "signed int" can also big
> enough.  I guess it is OK to be limited to up-to 2 billion objects
> on 32-bit systems.

Thanks for pointing this out. I recalled there being an "int" problem
somewhere in the revindex code, but I didn't notice it on my
read-through. This bug already exists (the problem is actually in the
find_revindex_position() interface), and is fixed when we inline that
into offset_to_pack_pos() in patch 18.

It might be worth mentioning the fix there.

-Peff

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 00/20] pack-revindex: prepare for on-disk reverse index
  2021-01-13 13:13           ` Jeff King
@ 2021-01-13 15:34             ` Taylor Blau
  2021-01-13 20:06               ` Junio C Hamano
  0 siblings, 1 reply; 121+ messages in thread
From: Taylor Blau @ 2021-01-13 15:34 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Taylor Blau, git, jrnieder

On Wed, Jan 13, 2021 at 08:13:57AM -0500, Jeff King wrote:
> On Wed, Jan 13, 2021 at 12:21:12AM -0800, Junio C Hamano wrote:
>
> > Taylor Blau <me@ttaylorr.com> writes:
> >
> > > If you agree that the bottom topic is stable, I'd prefer to send the top
> > > one separately. Otherwise, I can send both together. Let me know.
> >
> > I do not expect the first 20 of the 20+8 patches to be stable from
> > the beginning---in fact, after reading 01/20 myself, and seeing a
> > few of Peff's reviews, I expect that you'll be redoing at least some
> > of them.
>
> They'll definitely need at least one re-roll. But I think Taylor is
> expecting (and I do too) that the second half will probably have a lot
> more back-and-forth over the on-disk format, and hence need more
> re-rolls.

Indeed, I find the first ~20 patches fairly benign, and I think that the
interesting discussion will and should take place over the final 8
patches.

For what it's worth, I was referring to the pending re-roll I have of
the first 20 patches as the stable one. (Peff notes in [1] that he
thought there wasn't much else to consider beyond his comments.)

> My main concern is reviewer fatigue. 28 patches is a lot. If we can
> solidify the first 20 and then let people focus on the final 8
> separately, that helps. If you're OK with splitting a topic and saying
> "this is a re-roll of just the last 8 patches", then that problem is
> solved. But IMHO it is easier to just point out that split from the
> start than it is to come up with it after the fact. It tells reviewers
> what to expect from the get-go.

Yes, exactly.

> > That way, the bottom part can be merged sooner to 'next' than the
> > rest.  It always is cumbersome to have some part of the series in
> > 'next' and remainder in 'seen', so at that point, the lower half
> > would naturally gain a different name before it gets merged to
> > 'next', I would think.
>
> That seems to me like it ends up being _more_ work than just making them
> into two branches in the first place.

I agree, but I also wasn't aware that you would consider queuing part of
a series. If that's the route you want to take, I'm OK with that. But I
tend to agree with Peff that (in this case since a clear deliniation
already exists) it may save us time to just send two separate series
from the get-go.

> So I guess I remain skeptical that ad-hoc splitting of longer series is
> easier than doing so up front. But you're the one who does all of the
> branch shuffling in the end, so if you really prefer longer series, I'm
> not what I think matters that much. ;)

Agreed. Junio, let me know which you'd prefer in this case (I'm not sure
if the additional context has changed your mind or not).

> -Peff

Thanks,
Taylor

[1]: https://lore.kernel.org/git/X%2F1vy3D10wDEZNva@coredump.intra.peff.net/

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 01/20] pack-revindex: introduce a new API
  2021-01-13  8:06   ` Junio C Hamano
  2021-01-13  8:54     ` Junio C Hamano
  2021-01-13 13:17     ` Jeff King
@ 2021-01-13 16:23     ` Taylor Blau
  2 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-13 16:23 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, peff, jrnieder

On Wed, Jan 13, 2021 at 12:06:03AM -0800, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > In the next several patches, we will prepare for loading a reverse index
> > either in memory, or from a yet-to-be-introduced on-disk format. To do
>
> Does "load revindex in memory" (as opposed to "from on-disk file")
> mean the good old "read the forward index and make inverse map
> in-core", or something else?

Indeed, that's what it means. I've made that clearer in the patch
message by saying it explicitly.

> IOW, is "We will prepare a reverse index either by computing in
> memory from forward index, or loading from on-disk file" what we
> want to say here?

Yep.

> Without knowing what exactly "pack position", "offset" and "index
> position" refer to, the above three are almost impossible to grok.
> Can we have one paragraph description for each?  Something along the
> lines of...

Yep, and I see later on in the thread that you want to discard this
suggestion since Peff has suggested similar changes in pack-revindex.h,
which I've applied.

> >     Unlike some of the callers that used to access '->offset' and '->nr'
> >     directly, the error checking around this call is somewhat more
> >     robust. This is important since callers can pass an offset which
> >     does not contain an object.
>
> Meaning "offset ought to point at the boundary between objects in
> the pack stream, and the API, unlike the direct access, makes sure
> that is the case"?  That is a good thing.

Indeed, and I've called that out directly in the patch message to
highlight it.

> > diff --git a/pack-revindex.c b/pack-revindex.c
> > index ecdde39cf4..6d86a85208 100644
> > --- a/pack-revindex.c
> > +++ b/pack-revindex.c
> > @@ -203,3 +203,35 @@ struct revindex_entry *find_pack_revindex(struct packed_git *p, off_t ofs)
> >
> >  	return p->revindex + pos;
> >  }
> > +
> > +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
> > +{
> > +	int ret;
> > +
> > +	if (load_pack_revindex(p) < 0)
> > +		return -1;
> > +
> > +	ret = find_revindex_position(p, ofs);
> > +	if (ret < 0)
> > +		return -1;
>
> Why not "return ret"?  We know that find_revindex_position() would
> signal an error by returning -1, but is there a reason why we want
> to prevent it from returning richer errors in the future?

No reason, I've changed it to 'return ret' instead.

> > +	*pos = ret;
>
> The untold assumption is that uint32_t can fit the maximum returned
> value from find_revindex_position() and "signed int" can also big
> enough.  I guess it is OK to be limited to up-to 2 billion objects
> on 32-bit systems.

Indeed, and that is fixed in a later on patch. I'll make sure to call it
out there.

> > +	return 0;
> > +}
> > +
> > +uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos)
> > +{
> > +	if (!p->revindex)
> > +		BUG("pack_pos_to_index: reverse index not yet loaded");
>
> The previous function lazy loaded the revindex, but this one and the
> next one refuses to work without revindex.  Intended?

Yes, and discussed a little bit more here [1]. Obviously that discussion
doesn't do any good to those reading the git log in the future, so I've
summarized the important detail (that some callers are equipped to deal
with errors but others aren't) in the patch.

> > +	if (pos >= p->num_objects)
> > +		BUG("pack_pos_to_index: out-of-bounds object at %"PRIu32, pos);
>
> Personally I find it easier to place items on a single line in an
> ascending order of magnitude, i.e.
>
> 	if (p->num_objects <= pos)
> 		BUG("...");

More readable, thanks.

> The assertion requires pos to be strictly lower than p->num_objects,
> which is in line with how we usually count elements of an array of
> size p->num_objects, but the next one allows pos == p->num_objects;
> intended?
>
> p->revindex[] is an array of two-member struct, so if an element of
> the array is invalid for its .nr member here because pos is exactly
> at p->num_objects, I would imagine it is also invalid for its .offset
> member, too, no?
>
> Ah, perhaps the "offset beyond the end of the pack positions" is a
> sentinel element to give the in-pack-stream size of the object at
> the last pack position?  If that is the case, it deserves a comment,
> I would think.

Exactly. I added a detail about that in the patch, too.

Thanks,
Taylor

[1]: https://lore.kernel.org/git/X%2F3ODgaa9wr65M09@nand.local/

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 00/20] pack-revindex: prepare for on-disk reverse index
  2021-01-13 15:34             ` Taylor Blau
@ 2021-01-13 20:06               ` Junio C Hamano
  2021-01-13 20:13                 ` Taylor Blau
  2021-01-13 20:22                 ` Jeff King
  0 siblings, 2 replies; 121+ messages in thread
From: Junio C Hamano @ 2021-01-13 20:06 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jeff King, git, jrnieder

Taylor Blau <me@ttaylorr.com> writes:

>> > That way, the bottom part can be merged sooner to 'next' than the
>> > rest.  It always is cumbersome to have some part of the series in
>> > 'next' and remainder in 'seen', so at that point, the lower half
>> > would naturally gain a different name before it gets merged to
>> > 'next', I would think.
>>
>> That seems to me like it ends up being _more_ work than just making them
>> into two branches in the first place.

More work to contributors?  How?

As long as each of 20-patch and 8-patch series is marked clearly to
manage the risk of mistakes and confusion down to the same level as
a single long series, I am perfectly OK.

Examples of help contributors could have made, which would have
avoided past confusion (these are not "potential" ones, but I had to
redo day's intergration in the past because of one long topic
building on top of another) are:

 - When sending either topic, not limited to the initial round but
   in all the subsequent rounds, remind that the top topic is to be
   applied on top of the bottom topic.

 - When updating the bottom topic (e.g. 20-patch one in this case),
   send out the top one (e.g. 8-patch one), too (or instruct me to
   discard the top one tentatively).

The worst case that happened in the past was that a quite minor
tweak was made to a bottom topic that was depended on another topic,
so I just queued the new iteration of the bottom topic again,
without realizing that the other one needed to be rebased.  We ended
up two copies of the bottom topic commits in 'pu' (these days we
call it 'seen') as the tweak was so minor that the two topics
cleanly merged into 'pu' without causing conflict.  The next bad
case was a similar situation with larger rewrite of the bottom
topic, which caused me to look at quite a big conflict and waste my
time until I realized that I was missing an updated top half.

If the inter-dependent topics that caused me trouble were managed as
a single long patch series, either with "this is a full replacement
of the new iteration" or "these are to update only the last 8
patches; apply them after rewinding the topic to commit f0e1d2c3
(gostak: distim doshes, 2021-01-08)", would have had a lot less risk
to introduce human error on this end.

> I agree, but I also wasn't aware that you would consider queuing part of
> a series. If that's the route you want to take, I'm OK with that.

Discarding broken part of a series and only queuing a good part can
happen with or without multiple topics.  Merging one topic to 'next'
but not the other also happens.  Merging early part of a topic to
'next' while leaving the rest to 'seen' is possible but I'd prefer
to avoid it.  Because of the last one, a single long topic, when a
bottom part stabilizes enough, would likely to gain a separate name
and its tip would be merged to 'next'.

> But I
> tend to agree with Peff that (in this case since a clear deliniation
> already exists) it may save us time to just send two separate series
> from the get-go.

As long as the two serieses are marked as such clearly, not just in
the initial round but in all subsequent rounds, it is OK.  But in an
unproven initial round, you may regret having to move a patch across
topics, from the bottom one to the top one or vice versa, instead of
just reordering inside a single topic.

>> So I guess I remain skeptical that ad-hoc splitting of longer series is
>> easier than doing so up front.

Nobody suggested ad-hoc splitting.  I was saying that splitting
would naturally grow out of reviews toward stabilization.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 00/20] pack-revindex: prepare for on-disk reverse index
  2021-01-13 20:06               ` Junio C Hamano
@ 2021-01-13 20:13                 ` Taylor Blau
  2021-01-13 20:22                 ` Jeff King
  1 sibling, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-13 20:13 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, Jeff King, git, jrnieder

On Wed, Jan 13, 2021 at 12:06:08PM -0800, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > But I tend to agree with Peff that (in this case since a clear
> > deliniation already exists) it may save us time to just send two
> > separate series from the get-go.
>
> As long as the two serieses are marked as such clearly, not just in
> the initial round but in all subsequent rounds, it is OK.  But in an
> unproven initial round, you may regret having to move a patch across
> topics, from the bottom one to the top one or vice versa, instead of
> just reordering inside a single topic.

Sounds good. Like I said, the last thing I want to do is create undue
burden on your or anybody else when queuing or reviewing these topics.

So, I'll try to err on the side of stating the dependency between topics
too often rather than not often enough. Hopefully the first ~20 patches
are boring enough that they will make their way to master soon enough
and we don't have to worry about it.

Ordinarily, I would have held off on the second series until more the
first one had graduated, but I felt that the pure refactorings didn't
make much sense on their own without the new file-based backend to
motivate them.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH 00/20] pack-revindex: prepare for on-disk reverse index
  2021-01-13 20:06               ` Junio C Hamano
  2021-01-13 20:13                 ` Taylor Blau
@ 2021-01-13 20:22                 ` Jeff King
  1 sibling, 0 replies; 121+ messages in thread
From: Jeff King @ 2021-01-13 20:22 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, jrnieder

On Wed, Jan 13, 2021 at 12:06:08PM -0800, Junio C Hamano wrote:

> Taylor Blau <me@ttaylorr.com> writes:
> 
> >> > That way, the bottom part can be merged sooner to 'next' than the
> >> > rest.  It always is cumbersome to have some part of the series in
> >> > 'next' and remainder in 'seen', so at that point, the lower half
> >> > would naturally gain a different name before it gets merged to
> >> > 'next', I would think.
> >>
> >> That seems to me like it ends up being _more_ work than just making them
> >> into two branches in the first place.
> 
> More work to contributors?  How?

The quoted part is from me, so I'll respond: I didn't mean contributors,
but it seems like more work to you. I.e., you are ending up with the
same multi-branch config _and_ you have to split the branches yourself
later after seeing review.

But reading what you wrote below, the advantage is that if this does not
happen until the first part hits "next", then there is no chance of it
being rebased at that point (and thus getting rewritten out from under
the second topic).

> The worst case that happened in the past was that a quite minor
> tweak was made to a bottom topic that was depended on another topic,
> so I just queued the new iteration of the bottom topic again,
> without realizing that the other one needed to be rebased.  We ended
> up two copies of the bottom topic commits in 'pu' (these days we
> call it 'seen') as the tweak was so minor that the two topics
> cleanly merged into 'pu' without causing conflict.  The next bad
> case was a similar situation with larger rewrite of the bottom
> topic, which caused me to look at quite a big conflict and waste my
> time until I realized that I was missing an updated top half.

I somehow assumed you had more automation there. On my end, since I
rebase my topics aggressively, it is just a matter of pointing the
branch upstream in the right place. But of course that is not your
workflow at all.

I know you do have the "this branches uses" logic in your what's cooking
emails. In theory it could remind you of the situation, but I'm not sure
where in the workflow you'd insert it (by the time you run the WC
script, it is hard to realize the rebasing that _should_ have been done
earlier, unless you collate patch-ids, and even that is not 100%).

I do wonder if setting the dependent branch's @{upstream} would be
helpful here. You do not rebase all of your topics, but the ones with a
local-branch @{u} would be candidates for doing so.

All that said, I am also sensitive that my armchair "you could do it
like this..." suggesting may not be fully informed. So take it as idle
thoughts, not necessarily arguments. :)

> >> So I guess I remain skeptical that ad-hoc splitting of longer series is
> >> easier than doing so up front.
> 
> Nobody suggested ad-hoc splitting.  I was saying that splitting
> would naturally grow out of reviews toward stabilization.

This quote is me again. By "ad-hoc" I meant exactly this "after we see
some reviews" (as opposed to drawing a line up front).

-Peff

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [PATCH v2 00/20] pack-revindex: prepare for on-disk reverse index
  2021-01-08 18:16 [PATCH 00/20] pack-revindex: prepare for on-disk reverse index Taylor Blau
                   ` (22 preceding siblings ...)
  2021-01-13  0:23 ` Junio C Hamano
@ 2021-01-13 22:23 ` Taylor Blau
  2021-01-13 22:23   ` [PATCH v2 01/20] pack-revindex: introduce a new API Taylor Blau
                     ` (20 more replies)
  23 siblings, 21 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-13 22:23 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

Hi,

Here is a revision of the first of two series to prepare for and introduce an
on-disk alternative for storing the reverse index.

In this revision, I addressed feedback from Junio, Peff, and Stolee. A
range-diff is included below, but the main changes are:

  - Error messages are improved to include the pack and offset when applicable.
  - Variable names were made clearer (e.g., n -> index_pos).
  - Comments were added in pack-revindex.h to introduce relevant terminology,
    and which methods convert between what orderings.
  - int-sized lower- and upper-bounds were converted to be unsigned.

I believe that this revision should be ready for queueing. I'll send a v2 of the
corresponding latter series shortly.

Thanks in advance for your review.

Taylor Blau (20):
  pack-revindex: introduce a new API
  write_reuse_object(): convert to new revindex API
  write_reused_pack_one(): convert to new revindex API
  write_reused_pack_verbatim(): convert to new revindex API
  check_object(): convert to new revindex API
  bitmap_position_packfile(): convert to new revindex API
  show_objects_for_type(): convert to new revindex API
  get_size_by_pos(): convert to new revindex API
  try_partial_reuse(): convert to new revindex API
  rebuild_existing_bitmaps(): convert to new revindex API
  get_delta_base_oid(): convert to new revindex API
  retry_bad_packed_offset(): convert to new revindex API
  packed_object_info(): convert to new revindex API
  unpack_entry(): convert to new revindex API
  for_each_object_in_pack(): convert to new revindex API
  builtin/gc.c: guess the size of the revindex
  pack-revindex: remove unused 'find_pack_revindex()'
  pack-revindex: remove unused 'find_revindex_position()'
  pack-revindex: hide the definition of 'revindex_entry'
  pack-revindex.c: avoid direct revindex access in
    'offset_to_pack_pos()'

 builtin/gc.c           |  2 +-
 builtin/pack-objects.c | 37 +++++++++++++++++---------
 pack-bitmap.c          | 44 +++++++++++++++----------------
 pack-revindex.c        | 51 ++++++++++++++++++++++-------------
 pack-revindex.h        | 60 +++++++++++++++++++++++++++++++++++++-----
 packfile.c             | 54 ++++++++++++++++++++++++-------------
 6 files changed, 168 insertions(+), 80 deletions(-)

Range-diff against v1:
 1:  fa6b830908 <  -:  ---------- pack-revindex: introduce a new API
 -:  ---------- >  1:  e1aa89244a pack-revindex: introduce a new API
 2:  00668523e1 !  2:  0fca7d5812 write_reuse_object(): convert to new revindex API
    @@ builtin/pack-objects.c: static off_t write_reuse_object(struct hashfile *f, stru
     -	revidx = find_pack_revindex(p, offset);
     -	datalen = revidx[1].offset - offset;
     +	if (offset_to_pack_pos(p, offset, &pos) < 0)
    -+		die(_("write_reuse_object: could not locate %s"),
    -+		    oid_to_hex(&entry->idx.oid));
    ++		die(_("write_reuse_object: could not locate %s, expected at "
    ++		      "offset %"PRIuMAX" in pack %s"),
    ++		    oid_to_hex(&entry->idx.oid), (uintmax_t)offset,
    ++		    p->pack_name);
     +	datalen = pack_pos_to_offset(p, pos + 1) - offset;
      	if (!pack_to_stdout && p->index_version > 1 &&
     -	    check_pack_crc(p, &w_curs, offset, datalen, revidx->nr)) {
 3:  81ab11e18c !  3:  7676822a54 write_reused_pack_one(): convert to new revindex API
    @@ builtin/pack-objects.c: static void write_reused_pack_one(size_t pos, struct has
      			struct object_id base_oid;
      
     +			if (offset_to_pack_pos(reuse_packfile, base_offset, &base_pos) < 0)
    -+				die(_("expected object at offset %"PRIuMAX),
    -+				    (uintmax_t)base_offset);
    ++				die(_("expected object at offset %"PRIuMAX" "
    ++				      "in pack %s"),
    ++				    (uintmax_t)base_offset,
    ++				    reuse_packfile->pack_name);
     +
      			nth_packed_object_id(&base_oid, reuse_packfile,
     -					     reuse_packfile->revindex[base_pos].nr);
 4:  14b35d01a0 =  4:  dd7133fdb7 write_reused_pack_verbatim(): convert to new revindex API
 5:  c47e77a30e =  5:  8e93ca3886 check_object(): convert to new revindex API
 6:  3b170663dd =  6:  084bbf2145 bitmap_position_packfile(): convert to new revindex API
 7:  bc67bb462a !  7:  68794e9484 show_objects_for_type(): convert to new revindex API
    @@ pack-bitmap.c: static void show_objects_for_type(
      			struct object_id oid;
     -			struct revindex_entry *entry;
     -			uint32_t hash = 0;
    -+			uint32_t hash = 0, n;
    ++			uint32_t hash = 0, index_pos;
     +			off_t ofs;
      
      			if ((word >> offset) == 0)
    @@ pack-bitmap.c: static void show_objects_for_type(
      
     -			entry = &bitmap_git->pack->revindex[pos + offset];
     -			nth_packed_object_id(&oid, bitmap_git->pack, entry->nr);
    -+			n = pack_pos_to_index(bitmap_git->pack, pos + offset);
    ++			index_pos = pack_pos_to_index(bitmap_git->pack, pos + offset);
     +			ofs = pack_pos_to_offset(bitmap_git->pack, pos + offset);
    -+			nth_packed_object_id(&oid, bitmap_git->pack, n);
    ++			nth_packed_object_id(&oid, bitmap_git->pack, index_pos);
      
      			if (bitmap_git->hashes)
     -				hash = get_be32(bitmap_git->hashes + entry->nr);
    -+				hash = get_be32(bitmap_git->hashes + n);
    ++				hash = get_be32(bitmap_git->hashes + index_pos);
      
     -			show_reach(&oid, object_type, 0, hash, bitmap_git->pack, entry->offset);
     +			show_reach(&oid, object_type, 0, hash, bitmap_git->pack, ofs);
 8:  541fe679f3 =  8:  31ac6f5703 get_size_by_pos(): convert to new revindex API
 9:  54f4ad329f !  9:  acd80069a2 try_partial_reuse(): convert to new revindex API
    @@ Commit message
         'pack_pos_to_offset()' instead (the caller here does not care about the
         index position of the object at position 'pos').
     
    -    Somewhat confusingly, the subsequent call to unpack_object_header()
    -    takes a pointer to &offset and then updates it with a new value. But,
    -    try_partial_reuse() cares about the offset of both the base's header and
    -    contents. The existing code made a copy of the offset field, and only
    -    addresses and manipulates one of them.
    -
    -    Instead, store the return of pack_pos_to_offset twice: once in header
    -    and another in offset. Header will be left untouched, but offset will be
    -    addressed and modified by unpack_object_header().
    +    Note that we cannot just use the existing "offset" variable to store the
    +    value we get from pack_pos_to_offset(). It is incremented by
    +    unpack_object_header(), but we later need the original value. Since
    +    we'll no longer have revindex->offset to read it from, we'll store that
    +    in a separate variable ("header" since it points to the entry's header
    +    bytes).
     
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
10:  97eaa7b2d6 = 10:  569acdca7f rebuild_existing_bitmaps(): convert to new revindex API
11:  e00c434ab2 = 11:  9881637724 get_delta_base_oid(): convert to new revindex API
12:  aae01d7029 = 12:  df8bb571a5 retry_bad_packed_offset(): convert to new revindex API
13:  eab7ab1f35 ! 13:  41b2e00947 packed_object_info(): convert to new revindex API
    @@ packfile.c: int packed_object_info(struct repository *r, struct packed_git *p,
     -		*oi->disk_sizep = revidx[1].offset - obj_offset;
     +		uint32_t pos;
     +		if (offset_to_pack_pos(p, obj_offset, &pos) < 0) {
    ++			error("could not find object at offset %"PRIuMAX" "
    ++			      "in pack %s", (uintmax_t)obj_offset, p->pack_name);
     +			type = OBJ_BAD;
     +			goto out;
     +		}
14:  13c49ed40c ! 14:  8ad49d231f unpack_entry(): convert to new revindex API
    @@ Commit message
         Remove direct manipulation of the 'struct revindex_entry' type as well
         as calls to the deprecated API in 'packfile.c:unpack_entry()'. Usual
         clean-up is performed (replacing '->nr' with calls to
    -    'pack_pos_to_index()' and so on). Add an additional check to make
    -    sure that 'obj_offset()' points at a valid object.
    +    'pack_pos_to_index()' and so on).
    +
    +    Add an additional check to make sure that 'obj_offset()' points at a
    +    valid object. In the case this check is violated, we cannot call
    +    'mark_bad_packed_object()' because we don't know the OID. At the top of
    +    the call stack is do_oid_object_info_extended() (via
    +    packed_object_info()), which does mark the object.
     
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
    @@ packfile.c: void *unpack_entry(struct repository *r, struct packed_git *p, off_t
     -			struct revindex_entry *revidx = find_pack_revindex(p, obj_offset);
     -			off_t len = revidx[1].offset - obj_offset;
     -			if (check_pack_crc(p, &w_curs, obj_offset, len, revidx->nr)) {
    -+			uint32_t pos, nr;
    ++			uint32_t pack_pos, index_pos;
     +			off_t len;
     +
    -+			if (offset_to_pack_pos(p, obj_offset, &pos) < 0) {
    ++			if (offset_to_pack_pos(p, obj_offset, &pack_pos) < 0) {
    ++				error("could not find object at offset %"PRIuMAX" in pack %s",
    ++				      (uintmax_t)obj_offset, p->pack_name);
     +				data = NULL;
     +				goto out;
     +			}
     +
    -+			len = pack_pos_to_offset(p, pos + 1) - obj_offset;
    -+			nr = pack_pos_to_index(p, pos);
    -+			if (check_pack_crc(p, &w_curs, obj_offset, len, nr)) {
    ++			len = pack_pos_to_offset(p, pack_pos + 1) - obj_offset;
    ++			index_pos = pack_pos_to_index(p, pack_pos);
    ++			if (check_pack_crc(p, &w_curs, obj_offset, len, index_pos)) {
      				struct object_id oid;
     -				nth_packed_object_id(&oid, p, revidx->nr);
    -+				nth_packed_object_id(&oid, p, nr);
    ++				nth_packed_object_id(&oid, p, index_pos);
      				error("bad packed object CRC for %s",
      				      oid_to_hex(&oid));
      				mark_bad_packed_object(p, oid.hash);
15:  a3249986f9 = 15:  e757476351 for_each_object_in_pack(): convert to new revindex API
16:  7c17db7a7d = 16:  a500311e33 builtin/gc.c: guess the size of the revindex
17:  c4c88bcc3d ! 17:  67d14da04a pack-revindex: remove unused 'find_pack_revindex()'
    @@ pack-revindex.h: struct revindex_entry {
      
     -struct revindex_entry *find_pack_revindex(struct packed_git *p, off_t ofs);
     -
    - int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos);
    - uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos);
    - off_t pack_pos_to_offset(struct packed_git *p, uint32_t pos);
    + /*
    +  * offset_to_pack_pos converts an object offset to a pack position. This
    +  * function returns zero on success, and a negative number otherwise. The
18:  d60411d524 ! 18:  3b5c92be68 pack-revindex: remove unused 'find_revindex_position()'
    @@ Commit message
         until _after_ 'load_pack_revindex()' (which loads the index as a
         side-effect) has been called.
     
    +    Another small fix that is included is converting the upper- and
    +    lower-bounds to be unsigned's instead of ints. This dates back to
    +    92e5c77c37 (revindex: export new APIs, 2013-10-24)--ironically, the last
    +    time we introduced new APIs here--but this unifies the types.
    +
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
      ## pack-revindex.c ##
    @@ pack-revindex.c: int load_pack_revindex(struct packed_git *p)
     -int find_revindex_position(struct packed_git *p, off_t ofs)
     +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
      {
    - 	int lo = 0;
    +-	int lo = 0;
     -	int hi = p->num_objects + 1;
     -	const struct revindex_entry *revindex = p->revindex;
    -+	int hi;
    ++	unsigned lo, hi;
     +	const struct revindex_entry *revindex;
     +
     +	if (load_pack_revindex(p) < 0)
     +		return -1;
     +
    ++	lo = 0;
     +	hi = p->num_objects + 1;
     +	revindex = p->revindex;
      
    @@ pack-revindex.c: int find_revindex_position(struct packed_git *p, off_t ofs)
     -
     -	ret = find_revindex_position(p, ofs);
     -	if (ret < 0)
    --		return -1;
    +-		return ret;
     -	*pos = ret;
     -	return 0;
     -}
    @@ pack-revindex.c: int find_revindex_position(struct packed_git *p, off_t ofs)
     
      ## pack-revindex.h ##
     @@ pack-revindex.h: struct revindex_entry {
    - };
    - 
    +  * given pack, returning zero on success and a negative value otherwise.
    +  */
      int load_pack_revindex(struct packed_git *p);
     -int find_revindex_position(struct packed_git *p, off_t ofs);
      
    - int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos);
    - uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos);
    + /*
    +  * offset_to_pack_pos converts an object offset to a pack position. This
19:  7c0e4acc84 ! 19:  cabafce4a1 pack-revindex: hide the definition of 'revindex_entry'
    @@ pack-revindex.h
     -	unsigned int nr;
     -};
     -
    - int load_pack_revindex(struct packed_git *p);
    - 
    - int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos);
    + /*
    +  * load_pack_revindex populates the revindex's internal data-structures for the
    +  * given pack, returning zero on success and a negative value otherwise.
20:  eada1ffcfa ! 20:  8400ff6c96 pack-revindex.c: avoid direct revindex access in 'offset_to_pack_pos()'
    @@ Commit message
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
      ## pack-revindex.c ##
    -@@ pack-revindex.c: int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
    +@@ pack-revindex.c: int load_pack_revindex(struct packed_git *p)
    + int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
      {
    - 	int lo = 0;
    - 	int hi;
    + 	unsigned lo, hi;
     -	const struct revindex_entry *revindex;
      
      	if (load_pack_revindex(p) < 0)
      		return -1;
      
    + 	lo = 0;
      	hi = p->num_objects + 1;
     -	revindex = p->revindex;
      
-- 
2.30.0.138.g6d7191ea01

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [PATCH v2 01/20] pack-revindex: introduce a new API
  2021-01-13 22:23 ` [PATCH v2 " Taylor Blau
@ 2021-01-13 22:23   ` Taylor Blau
  2021-01-14  6:46     ` Junio C Hamano
  2021-01-13 22:23   ` [PATCH v2 02/20] write_reuse_object(): convert to new revindex API Taylor Blau
                     ` (19 subsequent siblings)
  20 siblings, 1 reply; 121+ messages in thread
From: Taylor Blau @ 2021-01-13 22:23 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

In the next several patches, we will prepare for loading a reverse index
either in memory (mapping the inverse of the .idx's contents in-core),
or directly from a yet-to-be-introduced on-disk format. To prepare for
that, we'll introduce an API that avoids the caller explicitly indexing
the revindex pointer in the packed_git structure.

There are four ways to interact with the reverse index. Accordingly,
four functions will be exported from 'pack-revindex.h' by the time that
the existing API is removed. A caller may:

 1. Load the pack's reverse index. This involves opening up the index,
    generating an array, and then sorting it. Since opening the index
    can fail, this function ('load_pack_revindex()') returns an int.
    Accordingly, it takes only a single argument: the 'struct
    packed_git' the caller wants to build a reverse index for.

    This function is well-suited for both the current and new API.
    Callers will have to continue to open the reverse index explicitly,
    but this function will eventually learn how to detect and load a
    reverse index from the on-disk format, if one exists. Otherwise, it
    will fallback to generating one in memory from scratch.

 2. Convert a pack position into an offset. This operation is now
    called `pack_pos_to_offset()`. It takes a pack and a position, and
    returns the corresponding off_t.

    Any error simply calls BUG(), since the callers are not well-suited
    to handle a failure and keep going.

 3. Convert a pack position into an index position. Same as above; this
    takes a pack and a position, and returns a uint32_t. This operation
    is known as `pack_pos_to_index()`. The same thinking about error
    conditions applies here as well.

 4. Find the pack position for a given offset. This operation is now
    known as `offset_to_pack_pos()`. It takes a pack, an offset, and a
    pointer to a uint32_t where the position is written, if an object
    exists at that offset. Otherwise, -1 is returned to indicate
    failure.

    Unlike some of the callers that used to access '->offset' and '->nr'
    directly, the error checking around this call is somewhat more
    robust. This is important since callers should always pass an offset
    which points at the boundary of two objects. The API, unlike direct
    access, enforces that that is the case.

    This will become important in a subsequent patch where a caller
    which does not but could check the return value treats the signed
    `-1` from `find_revindex_position()` as an index into the 'revindex'
    array.

Two design warts are carried over into the new API:

  - Asking for the index position of an out-of-bounds object will result
    in a BUG() (since no such object exists), but asking for the offset
    of the non-existent object at the end of the pack returns the total
    size of the pack.

    This makes it convenient for callers who always want to take the
    difference of two adjacent object's offsets (to compute the on-disk
    size) but don't want to worry about boundaries at the end of the
    pack.

  - offset_to_pack_pos() lazily loads the reverse index, but
    pack_pos_to_index() doesn't (callers of the former are well-suited
    to handle errors, but callers of the latter are not).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-revindex.c | 32 +++++++++++++++++++++++++++++
 pack-revindex.h | 54 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 86 insertions(+)

diff --git a/pack-revindex.c b/pack-revindex.c
index ecdde39cf4..0ca3b54b45 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -203,3 +203,35 @@ struct revindex_entry *find_pack_revindex(struct packed_git *p, off_t ofs)
 
 	return p->revindex + pos;
 }
+
+int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
+{
+	int ret;
+
+	if (load_pack_revindex(p) < 0)
+		return -1;
+
+	ret = find_revindex_position(p, ofs);
+	if (ret < 0)
+		return ret;
+	*pos = ret;
+	return 0;
+}
+
+uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos)
+{
+	if (!p->revindex)
+		BUG("pack_pos_to_index: reverse index not yet loaded");
+	if (p->num_objects <= pos)
+		BUG("pack_pos_to_index: out-of-bounds object at %"PRIu32, pos);
+	return p->revindex[pos].nr;
+}
+
+off_t pack_pos_to_offset(struct packed_git *p, uint32_t pos)
+{
+	if (!p->revindex)
+		BUG("pack_pos_to_index: reverse index not yet loaded");
+	if (p->num_objects < pos)
+		BUG("pack_pos_to_offset: out-of-bounds object at %"PRIu32, pos);
+	return p->revindex[pos].offset;
+}
diff --git a/pack-revindex.h b/pack-revindex.h
index 848331d5d6..5a218aaa66 100644
--- a/pack-revindex.h
+++ b/pack-revindex.h
@@ -1,6 +1,21 @@
 #ifndef PACK_REVINDEX_H
 #define PACK_REVINDEX_H
 
+/**
+ * A revindex allows converting efficiently between three properties
+ * of an object within a pack:
+ *
+ * - index position: the numeric position within the list of sorted object ids
+ *   found in the .idx file
+ *
+ * - pack position: the numeric position within the list of objects in their
+ *   order within the actual .pack file (i.e., 0 is the first object in the
+ *   .pack, 1 is the second, and so on)
+ *
+ * - offset: the byte offset within the .pack file at which the object contents
+ *   can be found
+ */
+
 struct packed_git;
 
 struct revindex_entry {
@@ -8,9 +23,48 @@ struct revindex_entry {
 	unsigned int nr;
 };
 
+/*
+ * load_pack_revindex populates the revindex's internal data-structures for the
+ * given pack, returning zero on success and a negative value otherwise.
+ */
 int load_pack_revindex(struct packed_git *p);
 int find_revindex_position(struct packed_git *p, off_t ofs);
 
 struct revindex_entry *find_pack_revindex(struct packed_git *p, off_t ofs);
 
+/*
+ * offset_to_pack_pos converts an object offset to a pack position. This
+ * function returns zero on success, and a negative number otherwise. The
+ * parameter 'pos' is usable only on success.
+ *
+ * If the reverse index has not yet been loaded, this function loads it lazily,
+ * and returns an negative number if an error was encountered.
+ *
+ * This function runs in time O(log N) with the number of objects in the pack.
+ */
+int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos);
+
+/*
+ * pack_pos_to_index converts the given pack-relative position 'pos' by
+ * returning an index-relative position.
+ *
+ * If the reverse index has not yet been loaded, or the position is out of
+ * bounds, this function aborts.
+ *
+ * This function runs in constant time.
+ */
+uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos);
+
+/*
+ * pack_pos_to_offset converts the given pack-relative position 'pos' into a
+ * pack offset. For a pack with 'N' objects, asking for position 'N' will return
+ * the total size (in bytes) of the pack.
+ *
+ * If the reverse index has not yet been loaded, or the position is out of
+ * bounds, this function aborts.
+ *
+ * This function runs in constant time.
+ */
+off_t pack_pos_to_offset(struct packed_git *p, uint32_t pos);
+
 #endif
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v2 02/20] write_reuse_object(): convert to new revindex API
  2021-01-13 22:23 ` [PATCH v2 " Taylor Blau
  2021-01-13 22:23   ` [PATCH v2 01/20] pack-revindex: introduce a new API Taylor Blau
@ 2021-01-13 22:23   ` Taylor Blau
  2021-01-13 22:23   ` [PATCH v2 03/20] write_reused_pack_one(): " Taylor Blau
                     ` (18 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-13 22:23 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

First replace 'find_pack_revindex()' with its replacement
'offset_to_pack_pos()'. This prevents any bogus OFS_DELTA that may make
its way through until 'write_reuse_object()' from causing a bad memory
read (if 'revidx' is 'NULL')

Next, replace a direct access of '->nr' with the wrapper function
'pack_pos_to_index()'.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 2a00358f34..ab1fd853f1 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -419,7 +419,7 @@ static off_t write_reuse_object(struct hashfile *f, struct object_entry *entry,
 {
 	struct packed_git *p = IN_PACK(entry);
 	struct pack_window *w_curs = NULL;
-	struct revindex_entry *revidx;
+	uint32_t pos;
 	off_t offset;
 	enum object_type type = oe_type(entry);
 	off_t datalen;
@@ -436,10 +436,15 @@ static off_t write_reuse_object(struct hashfile *f, struct object_entry *entry,
 					      type, entry_size);
 
 	offset = entry->in_pack_offset;
-	revidx = find_pack_revindex(p, offset);
-	datalen = revidx[1].offset - offset;
+	if (offset_to_pack_pos(p, offset, &pos) < 0)
+		die(_("write_reuse_object: could not locate %s, expected at "
+		      "offset %"PRIuMAX" in pack %s"),
+		    oid_to_hex(&entry->idx.oid), (uintmax_t)offset,
+		    p->pack_name);
+	datalen = pack_pos_to_offset(p, pos + 1) - offset;
 	if (!pack_to_stdout && p->index_version > 1 &&
-	    check_pack_crc(p, &w_curs, offset, datalen, revidx->nr)) {
+	    check_pack_crc(p, &w_curs, offset, datalen,
+			   pack_pos_to_index(p, pos))) {
 		error(_("bad packed object CRC for %s"),
 		      oid_to_hex(&entry->idx.oid));
 		unuse_pack(&w_curs);
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v2 03/20] write_reused_pack_one(): convert to new revindex API
  2021-01-13 22:23 ` [PATCH v2 " Taylor Blau
  2021-01-13 22:23   ` [PATCH v2 01/20] pack-revindex: introduce a new API Taylor Blau
  2021-01-13 22:23   ` [PATCH v2 02/20] write_reuse_object(): convert to new revindex API Taylor Blau
@ 2021-01-13 22:23   ` Taylor Blau
  2021-01-13 22:23   ` [PATCH v2 04/20] write_reused_pack_verbatim(): " Taylor Blau
                     ` (17 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-13 22:23 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

Replace direct revindex accesses with calls to 'pack_pos_to_offset()'
and 'pack_pos_to_index()'.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index ab1fd853f1..8e40b19ee8 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -868,8 +868,8 @@ static void write_reused_pack_one(size_t pos, struct hashfile *out,
 	enum object_type type;
 	unsigned long size;
 
-	offset = reuse_packfile->revindex[pos].offset;
-	next = reuse_packfile->revindex[pos + 1].offset;
+	offset = pack_pos_to_offset(reuse_packfile, pos);
+	next = pack_pos_to_offset(reuse_packfile, pos + 1);
 
 	record_reused_object(offset, offset - hashfile_total(out));
 
@@ -889,11 +889,17 @@ static void write_reused_pack_one(size_t pos, struct hashfile *out,
 
 		/* Convert to REF_DELTA if we must... */
 		if (!allow_ofs_delta) {
-			int base_pos = find_revindex_position(reuse_packfile, base_offset);
+			uint32_t base_pos;
 			struct object_id base_oid;
 
+			if (offset_to_pack_pos(reuse_packfile, base_offset, &base_pos) < 0)
+				die(_("expected object at offset %"PRIuMAX" "
+				      "in pack %s"),
+				    (uintmax_t)base_offset,
+				    reuse_packfile->pack_name);
+
 			nth_packed_object_id(&base_oid, reuse_packfile,
-					     reuse_packfile->revindex[base_pos].nr);
+					     pack_pos_to_index(reuse_packfile, base_pos));
 
 			len = encode_in_pack_object_header(header, sizeof(header),
 							   OBJ_REF_DELTA, size);
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v2 04/20] write_reused_pack_verbatim(): convert to new revindex API
  2021-01-13 22:23 ` [PATCH v2 " Taylor Blau
                     ` (2 preceding siblings ...)
  2021-01-13 22:23   ` [PATCH v2 03/20] write_reused_pack_one(): " Taylor Blau
@ 2021-01-13 22:23   ` Taylor Blau
  2021-01-13 22:23   ` [PATCH v2 05/20] check_object(): " Taylor Blau
                     ` (16 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-13 22:23 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

Replace a direct access to the revindex array with
'pack_pos_to_offset()'.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 8e40b19ee8..77ce5583a2 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -952,7 +952,7 @@ static size_t write_reused_pack_verbatim(struct hashfile *out,
 		off_t to_write;
 
 		written = (pos * BITS_IN_EWORD);
-		to_write = reuse_packfile->revindex[written].offset
+		to_write = pack_pos_to_offset(reuse_packfile, written)
 			- sizeof(struct pack_header);
 
 		/* We're recording one chunk, not one object. */
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v2 05/20] check_object(): convert to new revindex API
  2021-01-13 22:23 ` [PATCH v2 " Taylor Blau
                     ` (3 preceding siblings ...)
  2021-01-13 22:23   ` [PATCH v2 04/20] write_reused_pack_verbatim(): " Taylor Blau
@ 2021-01-13 22:23   ` Taylor Blau
  2021-01-13 22:23   ` [PATCH v2 06/20] bitmap_position_packfile(): " Taylor Blau
                     ` (15 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-13 22:23 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

Replace direct accesses to the revindex with calls to
'offset_to_pack_pos()' and 'pack_pos_to_index()'.

Since this caller already had some error checking (it can jump to the
'give_up' label if it encounters an error), we can easily check whether
or not the provided offset points to an object in the given pack. This
error checking existed prior to this patch, too, since the caller checks
whether the return value from 'find_pack_revindex()' was NULL or not.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 77ce5583a2..5b0c4489e2 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1817,11 +1817,11 @@ static void check_object(struct object_entry *entry, uint32_t object_index)
 				goto give_up;
 			}
 			if (reuse_delta && !entry->preferred_base) {
-				struct revindex_entry *revidx;
-				revidx = find_pack_revindex(p, ofs);
-				if (!revidx)
+				uint32_t pos;
+				if (offset_to_pack_pos(p, ofs, &pos) < 0)
 					goto give_up;
-				if (!nth_packed_object_id(&base_ref, p, revidx->nr))
+				if (!nth_packed_object_id(&base_ref, p,
+							  pack_pos_to_index(p, pos)))
 					have_base = 1;
 			}
 			entry->in_pack_header_size = used + used_0;
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v2 06/20] bitmap_position_packfile(): convert to new revindex API
  2021-01-13 22:23 ` [PATCH v2 " Taylor Blau
                     ` (4 preceding siblings ...)
  2021-01-13 22:23   ` [PATCH v2 05/20] check_object(): " Taylor Blau
@ 2021-01-13 22:23   ` Taylor Blau
  2021-01-13 22:23   ` [PATCH v2 07/20] show_objects_for_type(): " Taylor Blau
                     ` (14 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-13 22:23 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

Replace find_revindex_position() with its counterpart in the new API,
offset_to_pack_pos().

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index d88745fb02..d6861ddd4d 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -407,11 +407,14 @@ static inline int bitmap_position_extended(struct bitmap_index *bitmap_git,
 static inline int bitmap_position_packfile(struct bitmap_index *bitmap_git,
 					   const struct object_id *oid)
 {
+	uint32_t pos;
 	off_t offset = find_pack_entry_one(oid->hash, bitmap_git->pack);
 	if (!offset)
 		return -1;
 
-	return find_revindex_position(bitmap_git->pack, offset);
+	if (offset_to_pack_pos(bitmap_git->pack, offset, &pos) < 0)
+		return -1;
+	return pos;
 }
 
 static int bitmap_position(struct bitmap_index *bitmap_git,
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v2 07/20] show_objects_for_type(): convert to new revindex API
  2021-01-13 22:23 ` [PATCH v2 " Taylor Blau
                     ` (5 preceding siblings ...)
  2021-01-13 22:23   ` [PATCH v2 06/20] bitmap_position_packfile(): " Taylor Blau
@ 2021-01-13 22:23   ` Taylor Blau
  2021-01-13 22:24   ` [PATCH v2 08/20] get_size_by_pos(): " Taylor Blau
                     ` (13 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-13 22:23 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

Avoid storing the revindex entry directly, since this structure will
soon be removed from the public interface. Instead, store the offset and
index position by calling 'pack_pos_to_offset()' and
'pack_pos_to_index()', respectively.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index d6861ddd4d..27a7a8ac4c 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -711,21 +711,22 @@ static void show_objects_for_type(
 
 		for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
 			struct object_id oid;
-			struct revindex_entry *entry;
-			uint32_t hash = 0;
+			uint32_t hash = 0, index_pos;
+			off_t ofs;
 
 			if ((word >> offset) == 0)
 				break;
 
 			offset += ewah_bit_ctz64(word >> offset);
 
-			entry = &bitmap_git->pack->revindex[pos + offset];
-			nth_packed_object_id(&oid, bitmap_git->pack, entry->nr);
+			index_pos = pack_pos_to_index(bitmap_git->pack, pos + offset);
+			ofs = pack_pos_to_offset(bitmap_git->pack, pos + offset);
+			nth_packed_object_id(&oid, bitmap_git->pack, index_pos);
 
 			if (bitmap_git->hashes)
-				hash = get_be32(bitmap_git->hashes + entry->nr);
+				hash = get_be32(bitmap_git->hashes + index_pos);
 
-			show_reach(&oid, object_type, 0, hash, bitmap_git->pack, entry->offset);
+			show_reach(&oid, object_type, 0, hash, bitmap_git->pack, ofs);
 		}
 	}
 }
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v2 08/20] get_size_by_pos(): convert to new revindex API
  2021-01-13 22:23 ` [PATCH v2 " Taylor Blau
                     ` (6 preceding siblings ...)
  2021-01-13 22:23   ` [PATCH v2 07/20] show_objects_for_type(): " Taylor Blau
@ 2021-01-13 22:24   ` Taylor Blau
  2021-01-13 22:24   ` [PATCH v2 09/20] try_partial_reuse(): " Taylor Blau
                     ` (12 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-13 22:24 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

Remove another caller that holds onto a 'struct revindex_entry' by
replacing the direct indexing with calls to 'pack_pos_to_offset()' and
'pack_pos_to_index()'.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 27a7a8ac4c..89a528a91b 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -835,11 +835,11 @@ static unsigned long get_size_by_pos(struct bitmap_index *bitmap_git,
 	oi.sizep = &size;
 
 	if (pos < pack->num_objects) {
-		struct revindex_entry *entry = &pack->revindex[pos];
-		if (packed_object_info(the_repository, pack,
-				       entry->offset, &oi) < 0) {
+		off_t ofs = pack_pos_to_offset(pack, pos);
+		if (packed_object_info(the_repository, pack, ofs, &oi) < 0) {
 			struct object_id oid;
-			nth_packed_object_id(&oid, pack, entry->nr);
+			nth_packed_object_id(&oid, pack,
+					     pack_pos_to_index(pack, pos));
 			die(_("unable to get size of %s"), oid_to_hex(&oid));
 		}
 	} else {
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v2 09/20] try_partial_reuse(): convert to new revindex API
  2021-01-13 22:23 ` [PATCH v2 " Taylor Blau
                     ` (7 preceding siblings ...)
  2021-01-13 22:24   ` [PATCH v2 08/20] get_size_by_pos(): " Taylor Blau
@ 2021-01-13 22:24   ` Taylor Blau
  2021-01-13 22:24   ` [PATCH v2 10/20] rebuild_existing_bitmaps(): " Taylor Blau
                     ` (11 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-13 22:24 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

Remove another instance of direct revindex manipulation by calling
'pack_pos_to_offset()' instead (the caller here does not care about the
index position of the object at position 'pos').

Note that we cannot just use the existing "offset" variable to store the
value we get from pack_pos_to_offset(). It is incremented by
unpack_object_header(), but we later need the original value. Since
we'll no longer have revindex->offset to read it from, we'll store that
in a separate variable ("header" since it points to the entry's header
bytes).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 89a528a91b..1fdf7ce20a 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1069,23 +1069,21 @@ static void try_partial_reuse(struct bitmap_index *bitmap_git,
 			      struct bitmap *reuse,
 			      struct pack_window **w_curs)
 {
-	struct revindex_entry *revidx;
-	off_t offset;
+	off_t offset, header;
 	enum object_type type;
 	unsigned long size;
 
 	if (pos >= bitmap_git->pack->num_objects)
 		return; /* not actually in the pack */
 
-	revidx = &bitmap_git->pack->revindex[pos];
-	offset = revidx->offset;
+	offset = header = pack_pos_to_offset(bitmap_git->pack, pos);
 	type = unpack_object_header(bitmap_git->pack, w_curs, &offset, &size);
 	if (type < 0)
 		return; /* broken packfile, punt */
 
 	if (type == OBJ_REF_DELTA || type == OBJ_OFS_DELTA) {
 		off_t base_offset;
-		int base_pos;
+		uint32_t base_pos;
 
 		/*
 		 * Find the position of the base object so we can look it up
@@ -1096,11 +1094,10 @@ static void try_partial_reuse(struct bitmap_index *bitmap_git,
 		 * more detail.
 		 */
 		base_offset = get_delta_base(bitmap_git->pack, w_curs,
-					     &offset, type, revidx->offset);
+					     &offset, type, header);
 		if (!base_offset)
 			return;
-		base_pos = find_revindex_position(bitmap_git->pack, base_offset);
-		if (base_pos < 0)
+		if (offset_to_pack_pos(bitmap_git->pack, base_offset, &base_pos) < 0)
 			return;
 
 		/*
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v2 10/20] rebuild_existing_bitmaps(): convert to new revindex API
  2021-01-13 22:23 ` [PATCH v2 " Taylor Blau
                     ` (8 preceding siblings ...)
  2021-01-13 22:24   ` [PATCH v2 09/20] try_partial_reuse(): " Taylor Blau
@ 2021-01-13 22:24   ` Taylor Blau
  2021-01-13 22:24   ` [PATCH v2 11/20] get_delta_base_oid(): " Taylor Blau
                     ` (10 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-13 22:24 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

Remove another instance of looking at the revindex directly by instead
calling 'pack_pos_to_index()'. Unlike other patches, this caller only
cares about the index position of each object in the loop.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 1fdf7ce20a..60fe20fb87 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1392,11 +1392,10 @@ uint32_t *create_bitmap_mapping(struct bitmap_index *bitmap_git,
 
 	for (i = 0; i < num_objects; ++i) {
 		struct object_id oid;
-		struct revindex_entry *entry;
 		struct object_entry *oe;
 
-		entry = &bitmap_git->pack->revindex[i];
-		nth_packed_object_id(&oid, bitmap_git->pack, entry->nr);
+		nth_packed_object_id(&oid, bitmap_git->pack,
+				     pack_pos_to_index(bitmap_git->pack, i));
 		oe = packlist_find(mapping, &oid);
 
 		if (oe)
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v2 11/20] get_delta_base_oid(): convert to new revindex API
  2021-01-13 22:23 ` [PATCH v2 " Taylor Blau
                     ` (9 preceding siblings ...)
  2021-01-13 22:24   ` [PATCH v2 10/20] rebuild_existing_bitmaps(): " Taylor Blau
@ 2021-01-13 22:24   ` Taylor Blau
  2021-01-13 22:24   ` [PATCH v2 12/20] retry_bad_packed_offset(): " Taylor Blau
                     ` (9 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-13 22:24 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

Replace direct accesses to the 'struct revindex' type with a call to
'pack_pos_to_index()'.

Likewise drop the old-style 'find_pack_revindex()' with its replacement
'offset_to_pack_pos()' (while continuing to perform the same error
checking).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 packfile.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/packfile.c b/packfile.c
index 86f5c8dbf6..3e3f391949 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1235,18 +1235,18 @@ static int get_delta_base_oid(struct packed_git *p,
 		oidread(oid, base);
 		return 0;
 	} else if (type == OBJ_OFS_DELTA) {
-		struct revindex_entry *revidx;
+		uint32_t base_pos;
 		off_t base_offset = get_delta_base(p, w_curs, &curpos,
 						   type, delta_obj_offset);
 
 		if (!base_offset)
 			return -1;
 
-		revidx = find_pack_revindex(p, base_offset);
-		if (!revidx)
+		if (offset_to_pack_pos(p, base_offset, &base_pos) < 0)
 			return -1;
 
-		return nth_packed_object_id(oid, p, revidx->nr);
+		return nth_packed_object_id(oid, p,
+					    pack_pos_to_index(p, base_pos));
 	} else
 		return -1;
 }
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v2 12/20] retry_bad_packed_offset(): convert to new revindex API
  2021-01-13 22:23 ` [PATCH v2 " Taylor Blau
                     ` (10 preceding siblings ...)
  2021-01-13 22:24   ` [PATCH v2 11/20] get_delta_base_oid(): " Taylor Blau
@ 2021-01-13 22:24   ` Taylor Blau
  2021-01-13 22:24   ` [PATCH v2 13/20] packed_object_info(): " Taylor Blau
                     ` (8 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-13 22:24 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

Perform exactly the same conversion as in the previous commit to another
caller within 'packfile.c'.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 packfile.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/packfile.c b/packfile.c
index 3e3f391949..7c37f9ec5c 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1256,12 +1256,11 @@ static int retry_bad_packed_offset(struct repository *r,
 				   off_t obj_offset)
 {
 	int type;
-	struct revindex_entry *revidx;
+	uint32_t pos;
 	struct object_id oid;
-	revidx = find_pack_revindex(p, obj_offset);
-	if (!revidx)
+	if (offset_to_pack_pos(p, obj_offset, &pos) < 0)
 		return OBJ_BAD;
-	nth_packed_object_id(&oid, p, revidx->nr);
+	nth_packed_object_id(&oid, p, pack_pos_to_index(p, pos));
 	mark_bad_packed_object(p, oid.hash);
 	type = oid_object_info(r, &oid, NULL);
 	if (type <= OBJ_NONE)
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v2 13/20] packed_object_info(): convert to new revindex API
  2021-01-13 22:23 ` [PATCH v2 " Taylor Blau
                     ` (11 preceding siblings ...)
  2021-01-13 22:24   ` [PATCH v2 12/20] retry_bad_packed_offset(): " Taylor Blau
@ 2021-01-13 22:24   ` Taylor Blau
  2021-01-13 22:24   ` [PATCH v2 14/20] unpack_entry(): " Taylor Blau
                     ` (7 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-13 22:24 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

Convert another call of 'find_pack_revindex()' to its replacement
'pack_pos_to_offset()'. Likewise:

  - Avoid manipulating `struct packed_git`'s `revindex` pointer directly
    by removing the pointer-as-array indexing.

  - Add an additional guard to check that the offset 'obj_offset()'
    points to a real object. This should be the case with well-behaved
    callers to 'packed_object_info()', but isn't guarenteed.

    Other blocks that fill in various other values from the 'struct
    object_info' request handle bad inputs by setting the type to
    'OBJ_BAD' and jumping to 'out'. Do the same when given a bad offset
    here.

    The previous code would have segfaulted when given a bad
    'obj_offset' value, since 'find_pack_revindex()' would return
    'NULL', and then the line that fills 'oi->disk_sizep' would try to
    access 'NULL[1]' with a stride of 16 bytes (the width of 'struct
    revindex_entry)'.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 packfile.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/packfile.c b/packfile.c
index 7c37f9ec5c..bb4bb14671 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1537,8 +1537,15 @@ int packed_object_info(struct repository *r, struct packed_git *p,
 	}
 
 	if (oi->disk_sizep) {
-		struct revindex_entry *revidx = find_pack_revindex(p, obj_offset);
-		*oi->disk_sizep = revidx[1].offset - obj_offset;
+		uint32_t pos;
+		if (offset_to_pack_pos(p, obj_offset, &pos) < 0) {
+			error("could not find object at offset %"PRIuMAX" "
+			      "in pack %s", (uintmax_t)obj_offset, p->pack_name);
+			type = OBJ_BAD;
+			goto out;
+		}
+
+		*oi->disk_sizep = pack_pos_to_offset(p, pos + 1) - obj_offset;
 	}
 
 	if (oi->typep || oi->type_name) {
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v2 14/20] unpack_entry(): convert to new revindex API
  2021-01-13 22:23 ` [PATCH v2 " Taylor Blau
                     ` (12 preceding siblings ...)
  2021-01-13 22:24   ` [PATCH v2 13/20] packed_object_info(): " Taylor Blau
@ 2021-01-13 22:24   ` Taylor Blau
  2021-01-13 22:24   ` [PATCH v2 15/20] for_each_object_in_pack(): " Taylor Blau
                     ` (6 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-13 22:24 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

Remove direct manipulation of the 'struct revindex_entry' type as well
as calls to the deprecated API in 'packfile.c:unpack_entry()'. Usual
clean-up is performed (replacing '->nr' with calls to
'pack_pos_to_index()' and so on).

Add an additional check to make sure that 'obj_offset()' points at a
valid object. In the case this check is violated, we cannot call
'mark_bad_packed_object()' because we don't know the OID. At the top of
the call stack is do_oid_object_info_extended() (via
packed_object_info()), which does mark the object.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 packfile.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/packfile.c b/packfile.c
index bb4bb14671..936ab3def5 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1694,11 +1694,21 @@ void *unpack_entry(struct repository *r, struct packed_git *p, off_t obj_offset,
 		}
 
 		if (do_check_packed_object_crc && p->index_version > 1) {
-			struct revindex_entry *revidx = find_pack_revindex(p, obj_offset);
-			off_t len = revidx[1].offset - obj_offset;
-			if (check_pack_crc(p, &w_curs, obj_offset, len, revidx->nr)) {
+			uint32_t pack_pos, index_pos;
+			off_t len;
+
+			if (offset_to_pack_pos(p, obj_offset, &pack_pos) < 0) {
+				error("could not find object at offset %"PRIuMAX" in pack %s",
+				      (uintmax_t)obj_offset, p->pack_name);
+				data = NULL;
+				goto out;
+			}
+
+			len = pack_pos_to_offset(p, pack_pos + 1) - obj_offset;
+			index_pos = pack_pos_to_index(p, pack_pos);
+			if (check_pack_crc(p, &w_curs, obj_offset, len, index_pos)) {
 				struct object_id oid;
-				nth_packed_object_id(&oid, p, revidx->nr);
+				nth_packed_object_id(&oid, p, index_pos);
 				error("bad packed object CRC for %s",
 				      oid_to_hex(&oid));
 				mark_bad_packed_object(p, oid.hash);
@@ -1781,11 +1791,11 @@ void *unpack_entry(struct repository *r, struct packed_git *p, off_t obj_offset,
 			 * This is costly but should happen only in the presence
 			 * of a corrupted pack, and is better than failing outright.
 			 */
-			struct revindex_entry *revidx;
+			uint32_t pos;
 			struct object_id base_oid;
-			revidx = find_pack_revindex(p, obj_offset);
-			if (revidx) {
-				nth_packed_object_id(&base_oid, p, revidx->nr);
+			if (!(offset_to_pack_pos(p, obj_offset, &pos))) {
+				nth_packed_object_id(&base_oid, p,
+						     pack_pos_to_index(p, pos));
 				error("failed to read delta base object %s"
 				      " at offset %"PRIuMAX" from %s",
 				      oid_to_hex(&base_oid), (uintmax_t)obj_offset,
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v2 15/20] for_each_object_in_pack(): convert to new revindex API
  2021-01-13 22:23 ` [PATCH v2 " Taylor Blau
                     ` (13 preceding siblings ...)
  2021-01-13 22:24   ` [PATCH v2 14/20] unpack_entry(): " Taylor Blau
@ 2021-01-13 22:24   ` Taylor Blau
  2021-01-14  6:43     ` Junio C Hamano
  2021-01-13 22:24   ` [PATCH v2 16/20] builtin/gc.c: guess the size of the revindex Taylor Blau
                     ` (5 subsequent siblings)
  20 siblings, 1 reply; 121+ messages in thread
From: Taylor Blau @ 2021-01-13 22:24 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

Avoid looking at the 'revindex' pointer directly and instead call
'pack_pos_to_index()'.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 packfile.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/packfile.c b/packfile.c
index 936ab3def5..7bb1750934 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2086,7 +2086,7 @@ int for_each_object_in_pack(struct packed_git *p,
 		struct object_id oid;
 
 		if (flags & FOR_EACH_OBJECT_PACK_ORDER)
-			pos = p->revindex[i].nr;
+			pos = pack_pos_to_index(p, i);
 		else
 			pos = i;
 
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v2 16/20] builtin/gc.c: guess the size of the revindex
  2021-01-13 22:23 ` [PATCH v2 " Taylor Blau
                     ` (14 preceding siblings ...)
  2021-01-13 22:24   ` [PATCH v2 15/20] for_each_object_in_pack(): " Taylor Blau
@ 2021-01-13 22:24   ` Taylor Blau
  2021-01-14  6:33     ` Junio C Hamano
  2021-01-13 22:24   ` [PATCH v2 17/20] pack-revindex: remove unused 'find_pack_revindex()' Taylor Blau
                     ` (4 subsequent siblings)
  20 siblings, 1 reply; 121+ messages in thread
From: Taylor Blau @ 2021-01-13 22:24 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

'estimate_repack_memory()' takes into account the amount of memory
required to load the reverse index in memory by multiplying the assumed
number of objects by the size of the 'revindex_entry' struct.

Prepare for hiding the definition of 'struct revindex_entry' by removing
a 'sizeof()' of that type from outside of pack-revindex.c. Instead,
guess that one off_t and one uint32_t are required per object. Strictly
speaking, this is a worse guess than asking for 'sizeof(struct
revindex_entry)' directly, since the true size of this struct is 16
bytes with padding on the end of the struct in order to align the offset
field.

But, this is an approximation anyway, and it does remove a use of the
'struct revindex_entry' from outside of pack-revindex internals.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/gc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index 4c24f41852..c60811f212 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -301,7 +301,7 @@ static uint64_t estimate_repack_memory(struct packed_git *pack)
 	/* and then obj_hash[], underestimated in fact */
 	heap += sizeof(struct object *) * nr_objects;
 	/* revindex is used also */
-	heap += sizeof(struct revindex_entry) * nr_objects;
+	heap += (sizeof(off_t) + sizeof(uint32_t)) * nr_objects;
 	/*
 	 * read_sha1_file() (either at delta calculation phase, or
 	 * writing phase) also fills up the delta base cache
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v2 17/20] pack-revindex: remove unused 'find_pack_revindex()'
  2021-01-13 22:23 ` [PATCH v2 " Taylor Blau
                     ` (15 preceding siblings ...)
  2021-01-13 22:24   ` [PATCH v2 16/20] builtin/gc.c: guess the size of the revindex Taylor Blau
@ 2021-01-13 22:24   ` Taylor Blau
  2021-01-13 22:25   ` [PATCH v2 18/20] pack-revindex: remove unused 'find_revindex_position()' Taylor Blau
                     ` (3 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-13 22:24 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

Now that no callers of 'find_pack_revindex()' remain, remove the
function's declaration and implementation entirely.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-revindex.c | 15 ---------------
 pack-revindex.h |  2 --
 2 files changed, 17 deletions(-)

diff --git a/pack-revindex.c b/pack-revindex.c
index 0ca3b54b45..16baafb281 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -189,21 +189,6 @@ int find_revindex_position(struct packed_git *p, off_t ofs)
 	return -1;
 }
 
-struct revindex_entry *find_pack_revindex(struct packed_git *p, off_t ofs)
-{
-	int pos;
-
-	if (load_pack_revindex(p))
-		return NULL;
-
-	pos = find_revindex_position(p, ofs);
-
-	if (pos < 0)
-		return NULL;
-
-	return p->revindex + pos;
-}
-
 int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
 {
 	int ret;
diff --git a/pack-revindex.h b/pack-revindex.h
index 5a218aaa66..f7094ba9a5 100644
--- a/pack-revindex.h
+++ b/pack-revindex.h
@@ -30,8 +30,6 @@ struct revindex_entry {
 int load_pack_revindex(struct packed_git *p);
 int find_revindex_position(struct packed_git *p, off_t ofs);
 
-struct revindex_entry *find_pack_revindex(struct packed_git *p, off_t ofs);
-
 /*
  * offset_to_pack_pos converts an object offset to a pack position. This
  * function returns zero on success, and a negative number otherwise. The
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v2 18/20] pack-revindex: remove unused 'find_revindex_position()'
  2021-01-13 22:23 ` [PATCH v2 " Taylor Blau
                     ` (16 preceding siblings ...)
  2021-01-13 22:24   ` [PATCH v2 17/20] pack-revindex: remove unused 'find_pack_revindex()' Taylor Blau
@ 2021-01-13 22:25   ` Taylor Blau
  2021-01-14  6:42     ` Junio C Hamano
  2021-01-13 22:25   ` [PATCH v2 19/20] pack-revindex: hide the definition of 'revindex_entry' Taylor Blau
                     ` (2 subsequent siblings)
  20 siblings, 1 reply; 121+ messages in thread
From: Taylor Blau @ 2021-01-13 22:25 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

Now that all 'find_revindex_position()' callers have been removed (and
converted to the more descriptive 'offset_to_pack_pos()'), it is almost
safe to get rid of 'find_revindex_position()' entirely. Almost, except
for the fact that 'offset_to_pack_pos()' calls
'find_revindex_position()'.

Inline 'find_revindex_position()' into 'offset_to_pack_pos()', and
then remove 'find_revindex_position()' entirely.

This is a straightforward refactoring with one minor snag.
'offset_to_pack_pos()' used to load the index before calling
'find_revindex_position()'. That means that by the time
'find_revindex_position()' starts executing, 'p->num_objects' can be
safely read. After inlining, be careful to not read 'p->num_objects'
until _after_ 'load_pack_revindex()' (which loads the index as a
side-effect) has been called.

Another small fix that is included is converting the upper- and
lower-bounds to be unsigned's instead of ints. This dates back to
92e5c77c37 (revindex: export new APIs, 2013-10-24)--ironically, the last
time we introduced new APIs here--but this unifies the types.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-revindex.c | 31 ++++++++++++-------------------
 pack-revindex.h |  1 -
 2 files changed, 12 insertions(+), 20 deletions(-)

diff --git a/pack-revindex.c b/pack-revindex.c
index 16baafb281..282fe92640 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -169,16 +169,23 @@ int load_pack_revindex(struct packed_git *p)
 	return 0;
 }
 
-int find_revindex_position(struct packed_git *p, off_t ofs)
+int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
 {
-	int lo = 0;
-	int hi = p->num_objects + 1;
-	const struct revindex_entry *revindex = p->revindex;
+	unsigned lo, hi;
+	const struct revindex_entry *revindex;
+
+	if (load_pack_revindex(p) < 0)
+		return -1;
+
+	lo = 0;
+	hi = p->num_objects + 1;
+	revindex = p->revindex;
 
 	do {
 		const unsigned mi = lo + (hi - lo) / 2;
 		if (revindex[mi].offset == ofs) {
-			return mi;
+			*pos = mi;
+			return 0;
 		} else if (ofs < revindex[mi].offset)
 			hi = mi;
 		else
@@ -189,20 +196,6 @@ int find_revindex_position(struct packed_git *p, off_t ofs)
 	return -1;
 }
 
-int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
-{
-	int ret;
-
-	if (load_pack_revindex(p) < 0)
-		return -1;
-
-	ret = find_revindex_position(p, ofs);
-	if (ret < 0)
-		return ret;
-	*pos = ret;
-	return 0;
-}
-
 uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos)
 {
 	if (!p->revindex)
diff --git a/pack-revindex.h b/pack-revindex.h
index f7094ba9a5..746776be7f 100644
--- a/pack-revindex.h
+++ b/pack-revindex.h
@@ -28,7 +28,6 @@ struct revindex_entry {
  * given pack, returning zero on success and a negative value otherwise.
  */
 int load_pack_revindex(struct packed_git *p);
-int find_revindex_position(struct packed_git *p, off_t ofs);
 
 /*
  * offset_to_pack_pos converts an object offset to a pack position. This
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v2 19/20] pack-revindex: hide the definition of 'revindex_entry'
  2021-01-13 22:23 ` [PATCH v2 " Taylor Blau
                     ` (17 preceding siblings ...)
  2021-01-13 22:25   ` [PATCH v2 18/20] pack-revindex: remove unused 'find_revindex_position()' Taylor Blau
@ 2021-01-13 22:25   ` Taylor Blau
  2021-01-13 22:25   ` [PATCH v2 20/20] pack-revindex.c: avoid direct revindex access in 'offset_to_pack_pos()' Taylor Blau
  2021-01-14 19:51   ` [PATCH v2 00/20] pack-revindex: prepare for on-disk reverse index Jeff King
  20 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-13 22:25 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

Now that all spots outside of pack-revindex.c that reference 'struct
revindex_entry' directly have been removed, it is safe to hide the
implementation by moving it from pack-revindex.h to pack-revindex.c.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-revindex.c | 5 +++++
 pack-revindex.h | 5 -----
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/pack-revindex.c b/pack-revindex.c
index 282fe92640..a508d5f0a4 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -3,6 +3,11 @@
 #include "object-store.h"
 #include "packfile.h"
 
+struct revindex_entry {
+	off_t offset;
+	unsigned int nr;
+};
+
 /*
  * Pack index for existing packs give us easy access to the offsets into
  * corresponding pack file where each object's data starts, but the entries
diff --git a/pack-revindex.h b/pack-revindex.h
index 746776be7f..6e0320b08b 100644
--- a/pack-revindex.h
+++ b/pack-revindex.h
@@ -18,11 +18,6 @@
 
 struct packed_git;
 
-struct revindex_entry {
-	off_t offset;
-	unsigned int nr;
-};
-
 /*
  * load_pack_revindex populates the revindex's internal data-structures for the
  * given pack, returning zero on success and a negative value otherwise.
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH v2 20/20] pack-revindex.c: avoid direct revindex access in 'offset_to_pack_pos()'
  2021-01-13 22:23 ` [PATCH v2 " Taylor Blau
                     ` (18 preceding siblings ...)
  2021-01-13 22:25   ` [PATCH v2 19/20] pack-revindex: hide the definition of 'revindex_entry' Taylor Blau
@ 2021-01-13 22:25   ` Taylor Blau
  2021-01-14  6:42     ` Junio C Hamano
  2021-01-14 19:51   ` [PATCH v2 00/20] pack-revindex: prepare for on-disk reverse index Jeff King
  20 siblings, 1 reply; 121+ messages in thread
From: Taylor Blau @ 2021-01-13 22:25 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

To prepare for on-disk reverse indexes, remove a spot in
'offset_to_pack_pos()' that looks at the 'revindex' array in 'struct
packed_git'.

Even though this use of the revindex pointer is within pack-revindex.c,
this clean up is still worth doing. Since the 'revindex' pointer will be
NULL when reading from an on-disk reverse index (instead the
'revindex_data' pointer will be mmaped to the 'pack-*.rev' file), this
call-site would have to include a conditional to lookup the offset for
position 'mi' each iteration through the search.

So instead of open-coding 'pack_pos_to_offset()', call it directly from
within 'offset_to_pack_pos()'.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-revindex.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/pack-revindex.c b/pack-revindex.c
index a508d5f0a4..5e69bc7372 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -177,21 +177,21 @@ int load_pack_revindex(struct packed_git *p)
 int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
 {
 	unsigned lo, hi;
-	const struct revindex_entry *revindex;
 
 	if (load_pack_revindex(p) < 0)
 		return -1;
 
 	lo = 0;
 	hi = p->num_objects + 1;
-	revindex = p->revindex;
 
 	do {
 		const unsigned mi = lo + (hi - lo) / 2;
-		if (revindex[mi].offset == ofs) {
+		off_t got = pack_pos_to_offset(p, mi);
+
+		if (got == ofs) {
 			*pos = mi;
 			return 0;
-		} else if (ofs < revindex[mi].offset)
+		} else if (ofs < got)
 			hi = mi;
 		else
 			lo = mi + 1;
-- 
2.30.0.138.g6d7191ea01

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [PATCH v2 16/20] builtin/gc.c: guess the size of the revindex
  2021-01-13 22:24   ` [PATCH v2 16/20] builtin/gc.c: guess the size of the revindex Taylor Blau
@ 2021-01-14  6:33     ` Junio C Hamano
  2021-01-14 16:53       ` Taylor Blau
  0 siblings, 1 reply; 121+ messages in thread
From: Junio C Hamano @ 2021-01-14  6:33 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, jrnieder, peff

Taylor Blau <me@ttaylorr.com> writes:

> 'estimate_repack_memory()' takes into account the amount of memory
> required to load the reverse index in memory by multiplying the assumed
> number of objects by the size of the 'revindex_entry' struct.
>
> Prepare for hiding the definition of 'struct revindex_entry' by removing
> a 'sizeof()' of that type from outside of pack-revindex.c. Instead,
> guess that one off_t and one uint32_t are required per object. Strictly
> speaking, this is a worse guess than asking for 'sizeof(struct
> revindex_entry)' directly, since the true size of this struct is 16
> bytes with padding on the end of the struct in order to align the offset
> field.

Meaning that we under-estimate by 25%?

> But, this is an approximation anyway, and it does remove a use of the
> 'struct revindex_entry' from outside of pack-revindex internals.
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  builtin/gc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/builtin/gc.c b/builtin/gc.c
> index 4c24f41852..c60811f212 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -301,7 +301,7 @@ static uint64_t estimate_repack_memory(struct packed_git *pack)
>  	/* and then obj_hash[], underestimated in fact */
>  	heap += sizeof(struct object *) * nr_objects;
>  	/* revindex is used also */
> -	heap += sizeof(struct revindex_entry) * nr_objects;
> +	heap += (sizeof(off_t) + sizeof(uint32_t)) * nr_objects;
>  	/*
>  	 * read_sha1_file() (either at delta calculation phase, or
>  	 * writing phase) also fills up the delta base cache

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2 20/20] pack-revindex.c: avoid direct revindex access in 'offset_to_pack_pos()'
  2021-01-13 22:25   ` [PATCH v2 20/20] pack-revindex.c: avoid direct revindex access in 'offset_to_pack_pos()' Taylor Blau
@ 2021-01-14  6:42     ` Junio C Hamano
  2021-01-14 16:56       ` Taylor Blau
  0 siblings, 1 reply; 121+ messages in thread
From: Junio C Hamano @ 2021-01-14  6:42 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, jrnieder, peff

Taylor Blau <me@ttaylorr.com> writes:

> To prepare for on-disk reverse indexes, remove a spot in
> 'offset_to_pack_pos()' that looks at the 'revindex' array in 'struct
> packed_git'.

Hmph, I somehow would have expected that this clean-up would be done
before step [18/20], but that does not matter in the end.  The end
result looks fairly clean.

I wonder if the call overhead to pack_pos_to_offset(), relative to
the direct indexing of an in-core array revindex[] followed by an
access to a member .offset that we used to do, makes a measurable
difference in this tight loop, though.

> diff --git a/pack-revindex.c b/pack-revindex.c
> index a508d5f0a4..5e69bc7372 100644
> --- a/pack-revindex.c
> +++ b/pack-revindex.c
> @@ -177,21 +177,21 @@ int load_pack_revindex(struct packed_git *p)
>  int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
>  {
>  	unsigned lo, hi;
> -	const struct revindex_entry *revindex;
>  
>  	if (load_pack_revindex(p) < 0)
>  		return -1;
>  
>  	lo = 0;
>  	hi = p->num_objects + 1;
> -	revindex = p->revindex;
>  
>  	do {
>  		const unsigned mi = lo + (hi - lo) / 2;
> -		if (revindex[mi].offset == ofs) {
> +		off_t got = pack_pos_to_offset(p, mi);
> +
> +		if (got == ofs) {
>  			*pos = mi;
>  			return 0;
> -		} else if (ofs < revindex[mi].offset)
> +		} else if (ofs < got)
>  			hi = mi;
>  		else
>  			lo = mi + 1;

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2 18/20] pack-revindex: remove unused 'find_revindex_position()'
  2021-01-13 22:25   ` [PATCH v2 18/20] pack-revindex: remove unused 'find_revindex_position()' Taylor Blau
@ 2021-01-14  6:42     ` Junio C Hamano
  0 siblings, 0 replies; 121+ messages in thread
From: Junio C Hamano @ 2021-01-14  6:42 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, jrnieder, peff

Taylor Blau <me@ttaylorr.com> writes:

> -int find_revindex_position(struct packed_git *p, off_t ofs)
> +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
>  {
> -	int lo = 0;
> -	int hi = p->num_objects + 1;
> -	const struct revindex_entry *revindex = p->revindex;
> +	unsigned lo, hi;
> +	const struct revindex_entry *revindex;
> +
> +	if (load_pack_revindex(p) < 0)
> +		return -1;
> +
> +	lo = 0;
> +	hi = p->num_objects + 1;
> +	revindex = p->revindex;
>  	do {
>  		const unsigned mi = lo + (hi - lo) / 2;
>  		if (revindex[mi].offset == ofs) {
> -			return mi;
> +			*pos = mi;
> +			return 0;
>  		} else if (ofs < revindex[mi].offset)
>  			hi = mi;
>  		else

OK, we can safely depend on "unsigned int" at least as wide as
"uint32_t"; unlike the original that used "int", we won't risk
losing the upper half of 4G range.

Nice.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2 15/20] for_each_object_in_pack(): convert to new revindex API
  2021-01-13 22:24   ` [PATCH v2 15/20] for_each_object_in_pack(): " Taylor Blau
@ 2021-01-14  6:43     ` Junio C Hamano
  2021-01-14 17:00       ` Taylor Blau
  2021-01-14 19:33       ` Jeff King
  0 siblings, 2 replies; 121+ messages in thread
From: Junio C Hamano @ 2021-01-14  6:43 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, jrnieder, peff

Taylor Blau <me@ttaylorr.com> writes:

> Avoid looking at the 'revindex' pointer directly and instead call
> 'pack_pos_to_index()'.
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  packfile.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/packfile.c b/packfile.c
> index 936ab3def5..7bb1750934 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -2086,7 +2086,7 @@ int for_each_object_in_pack(struct packed_git *p,
>  		struct object_id oid;
>  
>  		if (flags & FOR_EACH_OBJECT_PACK_ORDER)
> -			pos = p->revindex[i].nr;
> +			pos = pack_pos_to_index(p, i);

It wasn't too bad before this series formally defined what
"position", "index" and "offset" mean, but now this has become
highly misleading. The variable "pos" here holds what we consider
"index" while "i" holds what we call "position" [*1*].

>  		else
>  			pos = i;

Perhaps renaming "uint32_t pos" to "nth" would avoid confusion?

-	if (nth_packed_object_id(&oid, p, pos) < 0)
+	if (nth_packed_object_id(&oid, p, nth) < 0)
		return error(...);


[Footnote]

*1* The nth_packed_object_id() call we make later using the value we
obtain here should be documented to take "index" as its last
parameter, now that is what we call the location in the index, which
is in object name order.




^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2 01/20] pack-revindex: introduce a new API
  2021-01-13 22:23   ` [PATCH v2 01/20] pack-revindex: introduce a new API Taylor Blau
@ 2021-01-14  6:46     ` Junio C Hamano
  2021-01-14 12:00       ` Derrick Stolee
  2021-01-14 17:06       ` Taylor Blau
  0 siblings, 2 replies; 121+ messages in thread
From: Junio C Hamano @ 2021-01-14  6:46 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, jrnieder, peff

Taylor Blau <me@ttaylorr.com> writes:

> +/*
> + * offset_to_pack_pos converts an object offset to a pack position. This
> + * function returns zero on success, and a negative number otherwise. The
> + * parameter 'pos' is usable only on success.
> + *
> + * If the reverse index has not yet been loaded, this function loads it lazily,
> + * and returns an negative number if an error was encountered.

It is somewhat strange to see a function that yields a non-negative
"position" on success and a negative value to signal a failure to
have a separate pointer to the location to receive the true return
value.  Do we truly care the upper half of "uint32_t" (in other
words, do we seriously want to support more than 2G positions in a
pack)?

What I'm trying to get at is that

		int pos = offset_to_pack_pos(...);
		if (pos < 0)
			error();
		else
			use(pos);

is more natural than

		uint32_t pos;
                if (offset_to_pack_pos(..., &pos) < 0)
			error();
		else
			use(pos);

but now I wrote it down and laid it out in front of my eyes, the
latter does not look too bad.

	... later comes back after reading through the series ...

	The new callers all looked quite nice to eyes.  Because we
	discourage assignment inside if() condition, the converted
	result does not make the code more verbose than the
	original.  In fact, it makes it even clearer that we are
	checking for an error return from a function call.  

	Quite nice.

> + * This function runs in time O(log N) with the number of objects in the pack.

Is it a good idea to commit to such performance characteristics as a
promise to callers like this (the comment applies to all three
functions)?

It depends on how a developer is helped by this comment when
deciding whether to use this function, or find other ways, to
implement what s/he wants to do.

> + */
> +int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos);

> +/*
> + * pack_pos_to_index converts the given pack-relative position 'pos' by
> + * returning an index-relative position.
> + *
> + * If the reverse index has not yet been loaded, or the position is out of
> + * bounds, this function aborts.
> + *
> + * This function runs in constant time.
> + */
> +uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos);
> +
> +/*
> + * pack_pos_to_offset converts the given pack-relative position 'pos' into a
> + * pack offset. For a pack with 'N' objects, asking for position 'N' will return
> + * the total size (in bytes) of the pack.

If we talk about "asking for 'N'" and want it to mean "one beyond
the last position", it is better to clarify that we count from 0.
But see below.

> + * If the reverse index has not yet been loaded, or the position is out of
> + * bounds, this function aborts.

I think it is easier to read if the "unlike the above function, this
allows pos that is one beyond the last object" is explained next to
"if out of bounds, it is an error", not as a part of the previous
paragraph.

> + * This function runs in constant time.
> + */
> +off_t pack_pos_to_offset(struct packed_git *p, uint32_t pos);
> +
>  #endif

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2 01/20] pack-revindex: introduce a new API
  2021-01-14  6:46     ` Junio C Hamano
@ 2021-01-14 12:00       ` Derrick Stolee
  2021-01-14 17:06       ` Taylor Blau
  1 sibling, 0 replies; 121+ messages in thread
From: Derrick Stolee @ 2021-01-14 12:00 UTC (permalink / raw)
  To: Junio C Hamano, Taylor Blau; +Cc: git, dstolee, jrnieder, peff

On 1/14/2021 1:46 AM, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
> 
>> +/*
>> + * offset_to_pack_pos converts an object offset to a pack position. This
>> + * function returns zero on success, and a negative number otherwise. The
>> + * parameter 'pos' is usable only on success.
>> + *
>> + * If the reverse index has not yet been loaded, this function loads it lazily,
>> + * and returns an negative number if an error was encountered.
> 
> It is somewhat strange to see a function that yields a non-negative
> "position" on success and a negative value to signal a failure to
> have a separate pointer to the location to receive the true return
> value.  Do we truly care the upper half of "uint32_t" (in other
> words, do we seriously want to support more than 2G positions in a
> pack)?
> 
> What I'm trying to get at is that
> 
> 		int pos = offset_to_pack_pos(...);
> 		if (pos < 0)
> 			error();
> 		else
> 			use(pos);
> 
> is more natural than

I agree that this is used commonly, but usually in the case that
we are finding a position in the list _or where such an item would
be inserted_. For example:

	pos = index_name_pos(istate, dirname, len);
	if (pos < 0)
		pos = -pos-1;
	while (pos < istate->cache_nr) {
		...

But that does not apply in this case. Knowing that the requested
offset lies between object 'i' and object 'i + 1' isn't helpful,
since the offset still does not correspond to the start of an
object.

> 		uint32_t pos;
>                 if (offset_to_pack_pos(..., &pos) < 0)
> 			error();
> 		else
> 			use(pos);
> 
> but now I wrote it down and laid it out in front of my eyes, the
> latter does not look too bad.
> 
> 	... later comes back after reading through the series ...
> 
> 	The new callers all looked quite nice to eyes.  Because we
> 	discourage assignment inside if() condition, the converted
> 	result does not make the code more verbose than the
> 	original.  In fact, it makes it even clearer that we are
> 	checking for an error return from a function call.  
> 
> 	Quite nice.

As someone who spends a decent amount of time working in C#, I
also like this pattern. The APIs in C# work this way, too, such
as:

	if (!set.TryGetValue(key, out value))
		return false;

	// Use 'value', which is initialized now.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2 16/20] builtin/gc.c: guess the size of the revindex
  2021-01-14  6:33     ` Junio C Hamano
@ 2021-01-14 16:53       ` Taylor Blau
  2021-01-14 19:43         ` Jeff King
  0 siblings, 1 reply; 121+ messages in thread
From: Taylor Blau @ 2021-01-14 16:53 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, dstolee, jrnieder, peff

On Wed, Jan 13, 2021 at 10:33:01PM -0800, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > 'estimate_repack_memory()' takes into account the amount of memory
> > required to load the reverse index in memory by multiplying the assumed
> > number of objects by the size of the 'revindex_entry' struct.
> >
> > Prepare for hiding the definition of 'struct revindex_entry' by removing
> > a 'sizeof()' of that type from outside of pack-revindex.c. Instead,
> > guess that one off_t and one uint32_t are required per object. Strictly
> > speaking, this is a worse guess than asking for 'sizeof(struct
> > revindex_entry)' directly, since the true size of this struct is 16
> > bytes with padding on the end of the struct in order to align the offset
> > field.
>
> Meaning that we under-estimate by 25%?

In this area, yes. I'm skeptical that this estimate is all that
important, since it doesn't seem to take into account the memory
required to select delta/base candidates [1].

Thanks,
Taylor

[1]: https://lore.kernel.org/git/X%2F1roycRbYPjnI3l@coredump.intra.peff.net/

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2 20/20] pack-revindex.c: avoid direct revindex access in 'offset_to_pack_pos()'
  2021-01-14  6:42     ` Junio C Hamano
@ 2021-01-14 16:56       ` Taylor Blau
  0 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-14 16:56 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, dstolee, jrnieder, peff

On Wed, Jan 13, 2021 at 10:42:29PM -0800, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > To prepare for on-disk reverse indexes, remove a spot in
> > 'offset_to_pack_pos()' that looks at the 'revindex' array in 'struct
> > packed_git'.
>
> Hmph, I somehow would have expected that this clean-up would be done
> before step [18/20], but that does not matter in the end.  The end
> result looks fairly clean.
>
> I wonder if the call overhead to pack_pos_to_offset(), relative to
> the direct indexing of an in-core array revindex[] followed by an
> access to a member .offset that we used to do, makes a measurable
> difference in this tight loop, though.

I'm skeptical that it does (take that with a grain of salt, since I
haven't done any per-function tests with perf, only "how long does it
take to run 'git cat-file --batch-check=%(objectsize:disk)' and so on").

But even if it were to make a difference, it'll get dwarfed in the next
series by the time that we now _don't_ have to spend building and
sorting the reverse index in memory for each new process.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2 15/20] for_each_object_in_pack(): convert to new revindex API
  2021-01-14  6:43     ` Junio C Hamano
@ 2021-01-14 17:00       ` Taylor Blau
  2021-01-14 19:33       ` Jeff King
  1 sibling, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-14 17:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, dstolee, jrnieder, peff

On Wed, Jan 13, 2021 at 10:43:37PM -0800, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > Avoid looking at the 'revindex' pointer directly and instead call
> > 'pack_pos_to_index()'.
> >
> > Signed-off-by: Taylor Blau <me@ttaylorr.com>
> > ---
> >  packfile.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/packfile.c b/packfile.c
> > index 936ab3def5..7bb1750934 100644
> > --- a/packfile.c
> > +++ b/packfile.c
> > @@ -2086,7 +2086,7 @@ int for_each_object_in_pack(struct packed_git *p,
> >  		struct object_id oid;
> >
> >  		if (flags & FOR_EACH_OBJECT_PACK_ORDER)
> > -			pos = p->revindex[i].nr;
> > +			pos = pack_pos_to_index(p, i);
>
> It wasn't too bad before this series formally defined what
> "position", "index" and "offset" mean, but now this has become
> highly misleading. The variable "pos" here holds what we consider
> "index" while "i" holds what we call "position" [*1*].
>
> >  		else
> >  			pos = i;
>
> Perhaps renaming "uint32_t pos" to "nth" would avoid confusion?

I agree that it can be confusing. Unfortunately in this spot, this
variable really does mean two things. If we set the
FOR_EACH_OBJECT_PACK_ORDER bit in our flags, then the caller really
wants the index position (and the objects to be delivered in pack
order). But if we didn't set it, then the caller wants it in index
order.

> -	if (nth_packed_object_id(&oid, p, pos) < 0)
> +	if (nth_packed_object_id(&oid, p, nth) < 0)
> 		return error(...);

This suggested diff makes me think that you understand all of that, so
I'm mostly saying this for the benefit of others that haven't looked at
this code closely in the recent past.

I'd be happy to send a replacement patch if you would like [1], but I'm
hopeful that this is clear enough since there isn't much code between
the declaration, assignment(s), and use of 'pos'.

Thanks,
Taylor

[1]: I understand your general disdain for single replacement patches,
but I'd like to avoid sending the other 19 patches if possible to avoid
delivering more mail to list subscribers than is necessary.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2 01/20] pack-revindex: introduce a new API
  2021-01-14  6:46     ` Junio C Hamano
  2021-01-14 12:00       ` Derrick Stolee
@ 2021-01-14 17:06       ` Taylor Blau
  2021-01-14 19:19         ` Jeff King
  1 sibling, 1 reply; 121+ messages in thread
From: Taylor Blau @ 2021-01-14 17:06 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, dstolee, jrnieder, peff

On Wed, Jan 13, 2021 at 10:46:57PM -0800, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > +/*
> > + * offset_to_pack_pos converts an object offset to a pack position. This
> > + * function returns zero on success, and a negative number otherwise. The
> > + * parameter 'pos' is usable only on success.
> > + *
> > + * If the reverse index has not yet been loaded, this function loads it lazily,
> > + * and returns an negative number if an error was encountered.
>
> It is somewhat strange to see a function that yields a non-negative
> "position" on success and a negative value to signal a failure to
> have a separate pointer to the location to receive the true return
> value.  Do we truly care the upper half of "uint32_t" (in other
> words, do we seriously want to support more than 2G positions in a
> pack)?

I don't think that we care about that as much as we do about potential
misuse of a signed return value. There are indeed a couple of spots
where a potential negative return value is ignored, and then used to
lookup an object in a pack, or some such.

And that's part of the goal of this API: we have strict guidelines about
when the output parameter is and isn't usable. That makes it more
difficult to accidentally use an uninitialized value / negative number.

> What I'm trying to get at is that [...] is more natural than [...] but
> now I wrote it down and laid it out in front of my eyes, the latter
> does not look too bad.

OK, good :-).

> 	... later comes back after reading through the series ...
>
> 	The new callers all looked quite nice to eyes.  Because we
> 	discourage assignment inside if() condition, the converted
> 	result does not make the code more verbose than the
> 	original.  In fact, it makes it even clearer that we are
> 	checking for an error return from a function call.
>
> 	Quite nice.

Thank you :-D.

> > + * This function runs in time O(log N) with the number of objects in the pack.
>
> Is it a good idea to commit to such performance characteristics as a
> promise to callers like this (the comment applies to all three
> functions)?
>
> It depends on how a developer is helped by this comment when
> deciding whether to use this function, or find other ways, to
> implement what s/he wants to do.

I don't mind it. If they all had the same performance characteristics, I
wouldn't be for it, but since they don't, I think that it's good to
know. Peff suggested this back in [1].

Thanks,
Taylor

[1]: https://lore.kernel.org/git/X%2F1guCOGWybOzIS7@coredump.intra.peff.net/

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2 01/20] pack-revindex: introduce a new API
  2021-01-14 17:06       ` Taylor Blau
@ 2021-01-14 19:19         ` Jeff King
  2021-01-14 20:47           ` Junio C Hamano
  0 siblings, 1 reply; 121+ messages in thread
From: Jeff King @ 2021-01-14 19:19 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Junio C Hamano, git, dstolee, jrnieder

On Thu, Jan 14, 2021 at 12:06:20PM -0500, Taylor Blau wrote:

> > > + * This function runs in time O(log N) with the number of objects in the pack.
> >
> > Is it a good idea to commit to such performance characteristics as a
> > promise to callers like this (the comment applies to all three
> > functions)?
> >
> > It depends on how a developer is helped by this comment when
> > deciding whether to use this function, or find other ways, to
> > implement what s/he wants to do.
> 
> I don't mind it. If they all had the same performance characteristics, I
> wouldn't be for it, but since they don't, I think that it's good to
> know. Peff suggested this back in [1].

Yeah, I asked for this. As somebody who has frequently worked on the
code which accesses the revindex (mostly bitmap stuff), I found it
useful to understand how expensive the operations were.  However, I also
know what their runtimes are at this point, and it is not like somebody
interested cannot look at the implementation. So it may not be that
important.

So I do still think it is useful, but if somebody feels strongly against
it, I don't mind it being removed.

-Peff

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2 15/20] for_each_object_in_pack(): convert to new revindex API
  2021-01-14  6:43     ` Junio C Hamano
  2021-01-14 17:00       ` Taylor Blau
@ 2021-01-14 19:33       ` Jeff King
  2021-01-14 20:11         ` Jeff King
  2021-01-14 20:51         ` Junio C Hamano
  1 sibling, 2 replies; 121+ messages in thread
From: Jeff King @ 2021-01-14 19:33 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, dstolee, jrnieder

On Wed, Jan 13, 2021 at 10:43:37PM -0800, Junio C Hamano wrote:

> >  		if (flags & FOR_EACH_OBJECT_PACK_ORDER)
> > -			pos = p->revindex[i].nr;
> > +			pos = pack_pos_to_index(p, i);
> 
> It wasn't too bad before this series formally defined what
> "position", "index" and "offset" mean, but now this has become
> highly misleading. The variable "pos" here holds what we consider
> "index" while "i" holds what we call "position" [*1*].

I don't think "position" is a meaningful term by itself. I would say the
useful terms are "pack position", "index position", and "offset" (or
"pack offset" if you like). I don't think anything in the definitions
added by earlier patches contradicts that, but perhaps we can make it
more clear.

So "pos" in this case is not wrong. But I agree that it could stand to
be more clear. Saying "nth" does not help things IMHO (there is an "nth"
pack position, as well).

But maybe this makes it more clear (or possibly just the name change
without the comment):

diff --git a/packfile.c b/packfile.c
index de47c9f4f8..6035b80466 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2078,19 +2078,30 @@ int for_each_object_in_pack(struct packed_git *p,
 	}
 
 	for (i = 0; i < p->num_objects; i++) {
-		uint32_t pos;
+		uint32_t index_pos;
 		struct object_id oid;
 
+		/*
+		 * We are iterating "i" from 0 up to num_objects, but its
+		 * meaning may be different:
+		 *
+		 *   - in object-name order, it is the same as the index order
+		 *     given to us by nth_packed_object_id(), and we can use it
+		 *     directly
+		 *
+		 *   - in pack-order, it is pack position, which we must
+		 *     convert to an index position in order to get the oid.
+		 */
 		if (flags & FOR_EACH_OBJECT_PACK_ORDER)
-			pos = p->revindex[i].nr;
+			index_pos = p->revindex[i].nr;
 		else
-			pos = i;
+			index_pos = i;
 
-		if (nth_packed_object_id(&oid, p, pos) < 0)
+		if (nth_packed_object_id(&oid, p, index_pos) < 0)
 			return error("unable to get sha1 of object %u in %s",
-				     pos, p->pack_name);
+				     index_pos, p->pack_name);
 
-		r = cb(&oid, p, pos, data);
+		r = cb(&oid, p, index_pos, data);
 		if (r)
 			break;
 	}


> *1* The nth_packed_object_id() call we make later using the value we
> obtain here should be documented to take "index" as its last
> parameter, now that is what we call the location in the index, which
> is in object name order.

I would love to see the function given a more descriptive name. Having
worked on the bitmap code a lot, where the norm is pack-order, saying
"nth" is confusing and error-prone.

But I think that's out of scope for this series.

-Peff

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [PATCH v2 16/20] builtin/gc.c: guess the size of the revindex
  2021-01-14 16:53       ` Taylor Blau
@ 2021-01-14 19:43         ` Jeff King
  0 siblings, 0 replies; 121+ messages in thread
From: Jeff King @ 2021-01-14 19:43 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Junio C Hamano, git, dstolee, jrnieder

On Thu, Jan 14, 2021 at 11:53:39AM -0500, Taylor Blau wrote:

> On Wed, Jan 13, 2021 at 10:33:01PM -0800, Junio C Hamano wrote:
> > Taylor Blau <me@ttaylorr.com> writes:
> >
> > > 'estimate_repack_memory()' takes into account the amount of memory
> > > required to load the reverse index in memory by multiplying the assumed
> > > number of objects by the size of the 'revindex_entry' struct.
> > >
> > > Prepare for hiding the definition of 'struct revindex_entry' by removing
> > > a 'sizeof()' of that type from outside of pack-revindex.c. Instead,
> > > guess that one off_t and one uint32_t are required per object. Strictly
> > > speaking, this is a worse guess than asking for 'sizeof(struct
> > > revindex_entry)' directly, since the true size of this struct is 16
> > > bytes with padding on the end of the struct in order to align the offset
> > > field.
> >
> > Meaning that we under-estimate by 25%?
> 
> In this area, yes. I'm skeptical that this estimate is all that
> important, since it doesn't seem to take into account the memory
> required to select delta/base candidates [1].

It has many other inaccuracies:

  - it assumes half of all objects are blobs, which is not really
    accurate (linux.git is more like 60% trees, 12% commits, 28% blobs).
    This underestimates because blobs are the smallest struct.

  - since we moved a bunch of stuff out of "struct object_entry" into
    lazily-initialized auxiliary structures, we are under-counting the
    per-object cost when we have to spill into this structures

So I'm rather skeptical that this number is close to accurate. But
since there's a bunch of leeway (we are looking to use half of the
system memory) I suspect it doesn't matter all that much. But I
definitely don't think it's worth trying to micro-optimize its accuracy.

-Peff

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2 00/20] pack-revindex: prepare for on-disk reverse index
  2021-01-13 22:23 ` [PATCH v2 " Taylor Blau
                     ` (19 preceding siblings ...)
  2021-01-13 22:25   ` [PATCH v2 20/20] pack-revindex.c: avoid direct revindex access in 'offset_to_pack_pos()' Taylor Blau
@ 2021-01-14 19:51   ` Jeff King
  2021-01-14 20:53     ` Junio C Hamano
  20 siblings, 1 reply; 121+ messages in thread
From: Jeff King @ 2021-01-14 19:51 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jrnieder

On Wed, Jan 13, 2021 at 05:23:25PM -0500, Taylor Blau wrote:

> In this revision, I addressed feedback from Junio, Peff, and Stolee. A
> range-diff is included below, but the main changes are:
> 
>   - Error messages are improved to include the pack and offset when applicable.
>   - Variable names were made clearer (e.g., n -> index_pos).
>   - Comments were added in pack-revindex.h to introduce relevant terminology,
>     and which methods convert between what orderings.
>   - int-sized lower- and upper-bounds were converted to be unsigned.

Thanks, this addresses all of my nits. I responded to a few of Junio's
reviews with some further comments/suggestions; the most interesting one
is using "index_pos" to indicate the ordering more clearly in patch 15.
I'm happy with or without including that.

-Peff

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2 15/20] for_each_object_in_pack(): convert to new revindex API
  2021-01-14 19:33       ` Jeff King
@ 2021-01-14 20:11         ` Jeff King
  2021-01-14 20:15           ` Taylor Blau
  2021-01-14 20:51         ` Junio C Hamano
  1 sibling, 1 reply; 121+ messages in thread
From: Jeff King @ 2021-01-14 20:11 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, dstolee, jrnieder

On Thu, Jan 14, 2021 at 02:33:53PM -0500, Jeff King wrote:

> So "pos" in this case is not wrong. But I agree that it could stand to
> be more clear. Saying "nth" does not help things IMHO (there is an "nth"
> pack position, as well).
> 
> But maybe this makes it more clear (or possibly just the name change
> without the comment):

Here it is again, but with a signoff and commit message, and done on top
of your series (so if we agree this is a good resolution, it can just be
picked up on top, but I am also happy for it to be squashed into patch
15).

-- >8 --
Subject: [PATCH] for_each_object_in_pack(): clarify pack vs index ordering

We may return objects in one of two orders: how they appear in the .idx
(sorted by object id) or how they appear in the packfile itself. To
further complicate matters, we have two ordering variables, "i" and
"pos", and it is not clear to which order they apply.

Let's clarify this by using an unambiguous name where possible, and
leaving a comment for the variable that does double-duty.

Signed-off-by: Jeff King <peff@peff.net>
---
 packfile.c | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/packfile.c b/packfile.c
index 7bb1750934..35d50e2c38 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2082,19 +2082,31 @@ int for_each_object_in_pack(struct packed_git *p,
 	}
 
 	for (i = 0; i < p->num_objects; i++) {
-		uint32_t pos;
+		uint32_t index_pos;
 		struct object_id oid;
 
+		/*
+		 * We are iterating "i" from 0 up to num_objects, but its
+		 * meaning may be different, depending on the requested output
+		 * order:
+		 *
+		 *   - in object-name order, it is the same as the index order
+		 *     used by nth_packed_object_id(), so we can pass it
+		 *     directly
+		 *
+		 *   - in pack-order, it is pack position, which we must
+		 *     convert to an index position in order to get the oid.
+		 */
 		if (flags & FOR_EACH_OBJECT_PACK_ORDER)
-			pos = pack_pos_to_index(p, i);
+			index_pos = pack_pos_to_index(p, i);
 		else
-			pos = i;
+			index_pos = i;
 
-		if (nth_packed_object_id(&oid, p, pos) < 0)
+		if (nth_packed_object_id(&oid, p, index_pos) < 0)
 			return error("unable to get sha1 of object %u in %s",
-				     pos, p->pack_name);
+				     index_pos, p->pack_name);
 
-		r = cb(&oid, p, pos, data);
+		r = cb(&oid, p, index_pos, data);
 		if (r)
 			break;
 	}
-- 
2.30.0.578.g0a9fb12091


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [PATCH v2 15/20] for_each_object_in_pack(): convert to new revindex API
  2021-01-14 20:11         ` Jeff King
@ 2021-01-14 20:15           ` Taylor Blau
  2021-01-15  2:22             ` Junio C Hamano
  0 siblings, 1 reply; 121+ messages in thread
From: Taylor Blau @ 2021-01-14 20:15 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Taylor Blau, git, dstolee, jrnieder

On Thu, Jan 14, 2021 at 03:11:10PM -0500, Jeff King wrote:
> On Thu, Jan 14, 2021 at 02:33:53PM -0500, Jeff King wrote:
>
> > So "pos" in this case is not wrong. But I agree that it could stand to
> > be more clear. Saying "nth" does not help things IMHO (there is an "nth"
> > pack position, as well).
> >
> > But maybe this makes it more clear (or possibly just the name change
> > without the comment):
>
> Here it is again, but with a signoff and commit message, and done on top
> of your series (so if we agree this is a good resolution, it can just be
> picked up on top, but I am also happy for it to be squashed into patch
> 15).

Much appreciated. This looks good to me (and I have no opinion whether
it is picked up on top, or squashed into patch 15).

  Acked-by: Taylor Blau <me@ttaylorr.com>

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2 01/20] pack-revindex: introduce a new API
  2021-01-14 19:19         ` Jeff King
@ 2021-01-14 20:47           ` Junio C Hamano
  0 siblings, 0 replies; 121+ messages in thread
From: Junio C Hamano @ 2021-01-14 20:47 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, dstolee, jrnieder

Jeff King <peff@peff.net> writes:

> On Thu, Jan 14, 2021 at 12:06:20PM -0500, Taylor Blau wrote:
>
>> > > + * This function runs in time O(log N) with the number of objects in the pack.
>> >
>> > Is it a good idea to commit to such performance characteristics as a
>> > promise to callers like this (the comment applies to all three
>> > functions)?
>> >
>> > It depends on how a developer is helped by this comment when
>> > deciding whether to use this function, or find other ways, to
>> > implement what s/he wants to do.
>> 
>> I don't mind it. If they all had the same performance characteristics, I
>> wouldn't be for it, but since they don't, I think that it's good to
>> know. Peff suggested this back in [1].
>
> Yeah, I asked for this. As somebody who has frequently worked on the
> code which accesses the revindex (mostly bitmap stuff), I found it
> useful to understand how expensive the operations were.  However, I also
> know what their runtimes are at this point, and it is not like somebody
> interested cannot look at the implementation. So it may not be that
> important.
>
> So I do still think it is useful, but if somebody feels strongly against
> it, I don't mind it being removed.

That won't be me.  It's not like you'd use pack_pos_to_index()
combined with pack_pos_to_offset() instead of offset_to_pack_pos()
because the latter is more expensive than using the other two
functions; the comment does not help those who want to know relative
performance of these functions for such a purpose.

I was just curious who the comments were meant to help, that's all.

Thanks.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2 15/20] for_each_object_in_pack(): convert to new revindex API
  2021-01-14 19:33       ` Jeff King
  2021-01-14 20:11         ` Jeff King
@ 2021-01-14 20:51         ` Junio C Hamano
  1 sibling, 0 replies; 121+ messages in thread
From: Junio C Hamano @ 2021-01-14 20:51 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, dstolee, jrnieder

Jeff King <peff@peff.net> writes:

>  	for (i = 0; i < p->num_objects; i++) {
> -		uint32_t pos;
> +		uint32_t index_pos;
> ...
>> *1* The nth_packed_object_id() call we make later using the value we
>> obtain here should be documented to take "index" as its last
>> parameter, now that is what we call the location in the index, which
>> is in object name order.
>
> I would love to see the function given a more descriptive name. Having
> worked on the bitmap code a lot, where the norm is pack-order, saying
> "nth" is confusing and error-prone.
>
> But I think that's out of scope for this series.

Yeah, an explicit index_pos (vs pack_order_pos) would be good names
to use, and nth_packed_object_id() can also use somewhere in its
name to hint that it is about the object name order, but I agree
that both are outside the scope of this series.

Thanks.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2 00/20] pack-revindex: prepare for on-disk reverse index
  2021-01-14 19:51   ` [PATCH v2 00/20] pack-revindex: prepare for on-disk reverse index Jeff King
@ 2021-01-14 20:53     ` Junio C Hamano
  2021-01-15  9:32       ` Jeff King
  0 siblings, 1 reply; 121+ messages in thread
From: Junio C Hamano @ 2021-01-14 20:53 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, dstolee, jrnieder

Jeff King <peff@peff.net> writes:

> On Wed, Jan 13, 2021 at 05:23:25PM -0500, Taylor Blau wrote:
>
>> In this revision, I addressed feedback from Junio, Peff, and Stolee. A
>> range-diff is included below, but the main changes are:
>> 
>>   - Error messages are improved to include the pack and offset when applicable.
>>   - Variable names were made clearer (e.g., n -> index_pos).
>>   - Comments were added in pack-revindex.h to introduce relevant terminology,
>>     and which methods convert between what orderings.
>>   - int-sized lower- and upper-bounds were converted to be unsigned.
>
> Thanks, this addresses all of my nits. I responded to a few of Junio's
> reviews with some further comments/suggestions; the most interesting one
> is using "index_pos" to indicate the ordering more clearly in patch 15.
> I'm happy with or without including that.

I think it is better to leave it for future "clean-up" outside the
series, to be done after the series hits a released version.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2 15/20] for_each_object_in_pack(): convert to new revindex API
  2021-01-14 20:15           ` Taylor Blau
@ 2021-01-15  2:22             ` Junio C Hamano
  2021-01-15  2:29               ` Taylor Blau
  0 siblings, 1 reply; 121+ messages in thread
From: Junio C Hamano @ 2021-01-15  2:22 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jeff King, git, dstolee, jrnieder

Taylor Blau <me@ttaylorr.com> writes:

> On Thu, Jan 14, 2021 at 03:11:10PM -0500, Jeff King wrote:
>> On Thu, Jan 14, 2021 at 02:33:53PM -0500, Jeff King wrote:
>>
>> > So "pos" in this case is not wrong. But I agree that it could stand to
>> > be more clear. Saying "nth" does not help things IMHO (there is an "nth"
>> > pack position, as well).
>> >
>> > But maybe this makes it more clear (or possibly just the name change
>> > without the comment):
>>
>> Here it is again, but with a signoff and commit message, and done on top
>> of your series (so if we agree this is a good resolution, it can just be
>> picked up on top, but I am also happy for it to be squashed into patch
>> 15).
>
> Much appreciated. This looks good to me (and I have no opinion whether
> it is picked up on top, or squashed into patch 15).
>
>   Acked-by: Taylor Blau <me@ttaylorr.com>

OK, so I'll make this 21/20 and rebase the other one...

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2 15/20] for_each_object_in_pack(): convert to new revindex API
  2021-01-15  2:22             ` Junio C Hamano
@ 2021-01-15  2:29               ` Taylor Blau
  0 siblings, 0 replies; 121+ messages in thread
From: Taylor Blau @ 2021-01-15  2:29 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, Jeff King, git, dstolee, jrnieder

On Thu, Jan 14, 2021 at 06:22:56PM -0800, Junio C Hamano wrote:
> > Much appreciated. This looks good to me (and I have no opinion whether
> > it is picked up on top, or squashed into patch 15).
> >
> >   Acked-by: Taylor Blau <me@ttaylorr.com>
>
> OK, so I'll make this 21/20 and rebase the other one...

I'm happy if you want to apply this separately on top of both series,
since I think we all agree that this isn't strictly necessary to go
forward.

That said, if you do make this 21/20 and the rebase of the second series
requires any intervention, please let me know and I'll be happy to send
out a version myself.

In the meantime, I don't mind if you want to eject the latter topic from
seen while it picks up review. The alternative (keeping it in and
rebasing it forward), of course, is much appreciated.


Thanks,
Taylor

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2 00/20] pack-revindex: prepare for on-disk reverse index
  2021-01-14 20:53     ` Junio C Hamano
@ 2021-01-15  9:32       ` Jeff King
  2021-01-15  9:33         ` Jeff King
  0 siblings, 1 reply; 121+ messages in thread
From: Jeff King @ 2021-01-15  9:32 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, dstolee, jrnieder

On Thu, Jan 14, 2021 at 12:53:19PM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > On Wed, Jan 13, 2021 at 05:23:25PM -0500, Taylor Blau wrote:
> >
> >> In this revision, I addressed feedback from Junio, Peff, and Stolee. A
> >> range-diff is included below, but the main changes are:
> >> 
> >>   - Error messages are improved to include the pack and offset when applicable.
> >>   - Variable names were made clearer (e.g., n -> index_pos).
> >>   - Comments were added in pack-revindex.h to introduce relevant terminology,
> >>     and which methods convert between what orderings.
> >>   - int-sized lower- and upper-bounds were converted to be unsigned.
> >
> > Thanks, this addresses all of my nits. I responded to a few of Junio's
> > reviews with some further comments/suggestions; the most interesting one
> > is using "index_pos" to indicate the ordering more clearly in patch 15.
> > I'm happy with or without including that.
> 
> I think it is better to leave it for future "clean-up" outside the
> series, to be done after the series hits a released version.

OK. I'll hold on to it and re-send it after the area has settled.

-Peff

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH v2 00/20] pack-revindex: prepare for on-disk reverse index
  2021-01-15  9:32       ` Jeff King
@ 2021-01-15  9:33         ` Jeff King
  0 siblings, 0 replies; 121+ messages in thread
From: Jeff King @ 2021-01-15  9:33 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, dstolee, jrnieder

On Fri, Jan 15, 2021 at 04:32:34AM -0500, Jeff King wrote:

> On Thu, Jan 14, 2021 at 12:53:19PM -0800, Junio C Hamano wrote:
> 
> > Jeff King <peff@peff.net> writes:
> > 
> > > On Wed, Jan 13, 2021 at 05:23:25PM -0500, Taylor Blau wrote:
> > >
> > >> In this revision, I addressed feedback from Junio, Peff, and Stolee. A
> > >> range-diff is included below, but the main changes are:
> > >> 
> > >>   - Error messages are improved to include the pack and offset when applicable.
> > >>   - Variable names were made clearer (e.g., n -> index_pos).
> > >>   - Comments were added in pack-revindex.h to introduce relevant terminology,
> > >>     and which methods convert between what orderings.
> > >>   - int-sized lower- and upper-bounds were converted to be unsigned.
> > >
> > > Thanks, this addresses all of my nits. I responded to a few of Junio's
> > > reviews with some further comments/suggestions; the most interesting one
> > > is using "index_pos" to indicate the ordering more clearly in patch 15.
> > > I'm happy with or without including that.
> > 
> > I think it is better to leave it for future "clean-up" outside the
> > series, to be done after the series hits a released version.
> 
> OK. I'll hold on to it and re-send it after the area has settled.

Oh, nevermind, you said elsewhere you'd pick it up as 21/20. Assuming
you do that, I won't resend it. :)

-Peff

^ permalink raw reply	[flat|nested] 121+ messages in thread

end of thread, other threads:[~2021-01-15  9:35 UTC | newest]

Thread overview: 121+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-08 18:16 [PATCH 00/20] pack-revindex: prepare for on-disk reverse index Taylor Blau
2021-01-08 18:16 ` [PATCH 01/20] pack-revindex: introduce a new API Taylor Blau
2021-01-12  8:41   ` Jeff King
2021-01-12  9:41     ` Jeff King
2021-01-12 16:27     ` Taylor Blau
2021-01-13  8:06   ` Junio C Hamano
2021-01-13  8:54     ` Junio C Hamano
2021-01-13 13:17     ` Jeff King
2021-01-13 16:23     ` Taylor Blau
2021-01-08 18:16 ` [PATCH 02/20] write_reuse_object(): convert to new revindex API Taylor Blau
2021-01-12  8:47   ` Jeff King
2021-01-12 16:31     ` Taylor Blau
2021-01-13 13:02       ` Jeff King
2021-01-08 18:16 ` [PATCH 03/20] write_reused_pack_one(): " Taylor Blau
2021-01-12  8:50   ` Jeff King
2021-01-12 16:34     ` Taylor Blau
2021-01-08 18:16 ` [PATCH 04/20] write_reused_pack_verbatim(): " Taylor Blau
2021-01-12  8:50   ` Jeff King
2021-01-08 18:17 ` [PATCH 05/20] check_object(): " Taylor Blau
2021-01-11 11:43   ` Derrick Stolee
2021-01-11 16:15     ` Taylor Blau
2021-01-12  8:54       ` Jeff King
2021-01-12  8:51   ` Jeff King
2021-01-08 18:17 ` [PATCH 06/20] bitmap_position_packfile(): " Taylor Blau
2021-01-08 18:17 ` [PATCH 07/20] show_objects_for_type(): " Taylor Blau
2021-01-12  8:57   ` Jeff King
2021-01-12 16:35     ` Taylor Blau
2021-01-08 18:17 ` [PATCH 08/20] get_size_by_pos(): " Taylor Blau
2021-01-08 18:17 ` [PATCH 09/20] try_partial_reuse(): " Taylor Blau
2021-01-12  9:06   ` Jeff King
2021-01-12 16:47     ` Taylor Blau
2021-01-08 18:17 ` [PATCH 10/20] rebuild_existing_bitmaps(): " Taylor Blau
2021-01-08 18:17 ` [PATCH 11/20] get_delta_base_oid(): " Taylor Blau
2021-01-08 18:17 ` [PATCH 12/20] retry_bad_packed_offset(): " Taylor Blau
2021-01-08 18:17 ` [PATCH 13/20] packed_object_info(): " Taylor Blau
2021-01-12  9:11   ` Jeff King
2021-01-12 16:51     ` Taylor Blau
2021-01-08 18:17 ` [PATCH 14/20] unpack_entry(): " Taylor Blau
2021-01-12  9:22   ` Jeff King
2021-01-12 16:56     ` Taylor Blau
2021-01-08 18:17 ` [PATCH 15/20] for_each_object_in_pack(): " Taylor Blau
2021-01-08 18:17 ` [PATCH 16/20] builtin/gc.c: guess the size of the revindex Taylor Blau
2021-01-11 11:52   ` Derrick Stolee
2021-01-11 16:23     ` Taylor Blau
2021-01-11 17:09       ` Derrick Stolee
2021-01-12  9:28         ` Jeff King
2021-01-08 18:17 ` [PATCH 17/20] pack-revindex: remove unused 'find_pack_revindex()' Taylor Blau
2021-01-08 18:17 ` [PATCH 18/20] pack-revindex: remove unused 'find_revindex_position()' Taylor Blau
2021-01-11 11:57   ` Derrick Stolee
2021-01-11 16:27     ` Taylor Blau
2021-01-11 17:11       ` Derrick Stolee
2021-01-12  9:32   ` Jeff King
2021-01-12 16:59     ` Taylor Blau
2021-01-13 13:05       ` Jeff King
2021-01-08 18:18 ` [PATCH 19/20] pack-revindex: hide the definition of 'revindex_entry' Taylor Blau
2021-01-11 11:57   ` Derrick Stolee
2021-01-12  9:34   ` Jeff King
2021-01-08 18:18 ` [PATCH 20/20] pack-revindex.c: avoid direct revindex access in 'offset_to_pack_pos()' Taylor Blau
2021-01-12  9:37   ` Jeff King
2021-01-12 17:02     ` Taylor Blau
2021-01-11 12:07 ` [PATCH 00/20] pack-revindex: prepare for on-disk reverse index Derrick Stolee
2021-01-11 16:30   ` Taylor Blau
2021-01-11 17:15     ` Derrick Stolee
2021-01-11 17:29       ` Taylor Blau
2021-01-11 18:40       ` Junio C Hamano
2021-01-12  9:45 ` Jeff King
2021-01-12 17:07   ` Taylor Blau
2021-01-13  0:23 ` Junio C Hamano
2021-01-13  0:52   ` Taylor Blau
2021-01-13  2:15     ` Junio C Hamano
2021-01-13  3:23       ` Taylor Blau
2021-01-13  8:21         ` Junio C Hamano
2021-01-13 13:13           ` Jeff King
2021-01-13 15:34             ` Taylor Blau
2021-01-13 20:06               ` Junio C Hamano
2021-01-13 20:13                 ` Taylor Blau
2021-01-13 20:22                 ` Jeff King
2021-01-13 22:23 ` [PATCH v2 " Taylor Blau
2021-01-13 22:23   ` [PATCH v2 01/20] pack-revindex: introduce a new API Taylor Blau
2021-01-14  6:46     ` Junio C Hamano
2021-01-14 12:00       ` Derrick Stolee
2021-01-14 17:06       ` Taylor Blau
2021-01-14 19:19         ` Jeff King
2021-01-14 20:47           ` Junio C Hamano
2021-01-13 22:23   ` [PATCH v2 02/20] write_reuse_object(): convert to new revindex API Taylor Blau
2021-01-13 22:23   ` [PATCH v2 03/20] write_reused_pack_one(): " Taylor Blau
2021-01-13 22:23   ` [PATCH v2 04/20] write_reused_pack_verbatim(): " Taylor Blau
2021-01-13 22:23   ` [PATCH v2 05/20] check_object(): " Taylor Blau
2021-01-13 22:23   ` [PATCH v2 06/20] bitmap_position_packfile(): " Taylor Blau
2021-01-13 22:23   ` [PATCH v2 07/20] show_objects_for_type(): " Taylor Blau
2021-01-13 22:24   ` [PATCH v2 08/20] get_size_by_pos(): " Taylor Blau
2021-01-13 22:24   ` [PATCH v2 09/20] try_partial_reuse(): " Taylor Blau
2021-01-13 22:24   ` [PATCH v2 10/20] rebuild_existing_bitmaps(): " Taylor Blau
2021-01-13 22:24   ` [PATCH v2 11/20] get_delta_base_oid(): " Taylor Blau
2021-01-13 22:24   ` [PATCH v2 12/20] retry_bad_packed_offset(): " Taylor Blau
2021-01-13 22:24   ` [PATCH v2 13/20] packed_object_info(): " Taylor Blau
2021-01-13 22:24   ` [PATCH v2 14/20] unpack_entry(): " Taylor Blau
2021-01-13 22:24   ` [PATCH v2 15/20] for_each_object_in_pack(): " Taylor Blau
2021-01-14  6:43     ` Junio C Hamano
2021-01-14 17:00       ` Taylor Blau
2021-01-14 19:33       ` Jeff King
2021-01-14 20:11         ` Jeff King
2021-01-14 20:15           ` Taylor Blau
2021-01-15  2:22             ` Junio C Hamano
2021-01-15  2:29               ` Taylor Blau
2021-01-14 20:51         ` Junio C Hamano
2021-01-13 22:24   ` [PATCH v2 16/20] builtin/gc.c: guess the size of the revindex Taylor Blau
2021-01-14  6:33     ` Junio C Hamano
2021-01-14 16:53       ` Taylor Blau
2021-01-14 19:43         ` Jeff King
2021-01-13 22:24   ` [PATCH v2 17/20] pack-revindex: remove unused 'find_pack_revindex()' Taylor Blau
2021-01-13 22:25   ` [PATCH v2 18/20] pack-revindex: remove unused 'find_revindex_position()' Taylor Blau
2021-01-14  6:42     ` Junio C Hamano
2021-01-13 22:25   ` [PATCH v2 19/20] pack-revindex: hide the definition of 'revindex_entry' Taylor Blau
2021-01-13 22:25   ` [PATCH v2 20/20] pack-revindex.c: avoid direct revindex access in 'offset_to_pack_pos()' Taylor Blau
2021-01-14  6:42     ` Junio C Hamano
2021-01-14 16:56       ` Taylor Blau
2021-01-14 19:51   ` [PATCH v2 00/20] pack-revindex: prepare for on-disk reverse index Jeff King
2021-01-14 20:53     ` Junio C Hamano
2021-01-15  9:32       ` Jeff King
2021-01-15  9:33         ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).