git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* [PATCH 00/22] multi-pack reachability bitmaps
@ 2021-04-09 18:10 Taylor Blau
  2021-04-09 18:10 ` [PATCH 01/22] pack-bitmap.c: harden 'test_bitmap_walk()' to check type bitmaps Taylor Blau
                   ` (25 more replies)
  0 siblings, 26 replies; 273+ messages in thread
From: Taylor Blau @ 2021-04-09 18:10 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

This series implements multi-pack reachability bitmaps. It is based on
'master' after merging 'tb/pack-preferred-tips-to-give-bitmap'.

This is an extension of the classic single-pack bitmaps. Instead of
mapping between objects and bit positions according to each object's
pack-relative position, multi-pack bitmaps use each object's position in
a kind of "pseudo pack".

The pseudo pack doesn't refer to a physical packfile, but instead a
conceptual ordering of objects in a multi-pack index. This ordering is
reflected in the MIDX's .rev file, which is used extensively to power
multi-pack bitmaps.

This somewhat lengthy series is organized as follows:

  - The first eight patches are cleanup and preparation.

  - The next three patches factor out functions which have different
    implementations based on whether a bitmap is tied to a pack or MIDX.

  - The next two patches implement support for reading and writing
    multi-pack bitmaps.

  - The remaining tests prepare for a new
    GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP mode of running the test
    suite, and add new tests covering the new multi-pack bitmap
    behavior.

You can experiment with the new functionality by running "git
multi-pack-index write --bitmap", which updates the multi-pack index (if
necessary), and writes out a corresponding .bitmap file. Eventually,
support for invoking the above during "git repack" will be introduced,
but this is done in a separate series.

These patches have been extracted from a version which has been running
on every repository on GitHub for the past few weeks.

Thanks in advance for your review (including on all of the many series leading
up to this one).

Jeff King (1):
  t5310: disable GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP

Taylor Blau (21):
  pack-bitmap.c: harden 'test_bitmap_walk()' to check type bitmaps
  pack-bitmap-write.c: gracefully fail to write non-closed bitmaps
  pack-bitmap-write.c: free existing bitmaps
  Documentation: build 'technical/bitmap-format' by default
  Documentation: describe MIDX-based bitmaps
  midx: make a number of functions non-static
  midx: clear auxiliary .rev after replacing the MIDX
  midx: respect 'core.multiPackIndex' when writing
  pack-bitmap.c: introduce 'bitmap_num_objects()'
  pack-bitmap.c: introduce 'nth_bitmap_object_oid()'
  pack-bitmap.c: introduce 'bitmap_is_preferred_refname()'
  pack-bitmap: read multi-pack bitmaps
  pack-bitmap: write multi-pack bitmaps
  t5310: move some tests to lib-bitmap.sh
  t/helper/test-read-midx.c: add --checksum mode
  t5326: test multi-pack bitmap behavior
  t5319: don't write MIDX bitmaps in t5319
  t7700: update to work with MIDX bitmap test knob
  midx: respect 'GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP'
  p5310: extract full and partial bitmap tests
  p5326: perf tests for MIDX bitmaps

 Documentation/Makefile                       |   1 +
 Documentation/git-multi-pack-index.txt       |  12 +-
 Documentation/technical/bitmap-format.txt    |  72 ++-
 Documentation/technical/multi-pack-index.txt |  10 +-
 builtin/multi-pack-index.c                   |   2 +
 builtin/pack-objects.c                       |  15 +-
 builtin/repack.c                             |  13 +-
 ci/run-build-and-tests.sh                    |   1 +
 midx.c                                       | 216 ++++++++-
 midx.h                                       |   5 +
 pack-bitmap-write.c                          |  79 +++-
 pack-bitmap.c                                | 463 +++++++++++++++++--
 pack-bitmap.h                                |   8 +-
 packfile.c                                   |   2 +-
 t/README                                     |   4 +
 t/helper/test-read-midx.c                    |  16 +-
 t/lib-bitmap.sh                              | 216 +++++++++
 t/perf/lib-bitmap.sh                         |  69 +++
 t/perf/p5310-pack-bitmaps.sh                 |  65 +--
 t/perf/p5326-multi-pack-bitmaps.sh           |  43 ++
 t/t5310-pack-bitmaps.sh                      | 208 +--------
 t/t5319-multi-pack-index.sh                  |   3 +-
 t/t5326-multi-pack-bitmaps.sh                | 278 +++++++++++
 t/t7700-repack.sh                            |  18 +-
 24 files changed, 1435 insertions(+), 384 deletions(-)
 create mode 100644 t/perf/lib-bitmap.sh
 create mode 100755 t/perf/p5326-multi-pack-bitmaps.sh
 create mode 100755 t/t5326-multi-pack-bitmaps.sh

-- 
2.31.1.163.ga65ce7f831

^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH 01/22] pack-bitmap.c: harden 'test_bitmap_walk()' to check type bitmaps
  2021-04-09 18:10 [PATCH 00/22] multi-pack reachability bitmaps Taylor Blau
@ 2021-04-09 18:10 ` Taylor Blau
  2021-04-09 18:10 ` [PATCH 02/22] pack-bitmap-write.c: gracefully fail to write non-closed bitmaps Taylor Blau
                   ` (24 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-04-09 18:10 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

The special `--test-bitmap` mode of `git rev-list` is used to compare
the result of an object traversal with a bitmap to check its integrity.
This mode does not, however, assert that the types of reachable objects
are stored correctly.

Harden this mode by teaching it to also check that each time an object's
bit is marked, the corresponding bit should be set in exactly one of the
type bitmaps (whose type matches the object's true type).

Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 3ed15431cd..d45e91db1e 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1263,10 +1263,52 @@ void count_bitmap_commit_list(struct bitmap_index *bitmap_git,
 struct bitmap_test_data {
 	struct bitmap_index *bitmap_git;
 	struct bitmap *base;
+	struct bitmap *commits;
+	struct bitmap *trees;
+	struct bitmap *blobs;
+	struct bitmap *tags;
 	struct progress *prg;
 	size_t seen;
 };
 
+static void test_bitmap_type(struct bitmap_test_data *tdata,
+			     struct object *obj, int pos)
+{
+	enum object_type bitmap_type = OBJ_NONE;
+	int bitmaps_nr = 0;
+
+	if (bitmap_get(tdata->commits, pos)) {
+		bitmap_type = OBJ_COMMIT;
+		bitmaps_nr++;
+	}
+	if (bitmap_get(tdata->trees, pos)) {
+		bitmap_type = OBJ_TREE;
+		bitmaps_nr++;
+	}
+	if (bitmap_get(tdata->blobs, pos)) {
+		bitmap_type = OBJ_BLOB;
+		bitmaps_nr++;
+	}
+	if (bitmap_get(tdata->tags, pos)) {
+		bitmap_type = OBJ_TAG;
+		bitmaps_nr++;
+	}
+
+	if (!bitmap_type)
+		die("object %s not found in type bitmaps",
+		    oid_to_hex(&obj->oid));
+
+	if (bitmaps_nr > 1)
+		die("object %s does not have a unique type",
+		    oid_to_hex(&obj->oid));
+
+	if (bitmap_type != obj->type)
+		die("object %s: real type %s, expected: %s",
+		    oid_to_hex(&obj->oid),
+		    type_name(obj->type),
+		    type_name(bitmap_type));
+}
+
 static void test_show_object(struct object *object, const char *name,
 			     void *data)
 {
@@ -1276,6 +1318,7 @@ static void test_show_object(struct object *object, const char *name,
 	bitmap_pos = bitmap_position(tdata->bitmap_git, &object->oid);
 	if (bitmap_pos < 0)
 		die("Object not in bitmap: %s\n", oid_to_hex(&object->oid));
+	test_bitmap_type(tdata, object, bitmap_pos);
 
 	bitmap_set(tdata->base, bitmap_pos);
 	display_progress(tdata->prg, ++tdata->seen);
@@ -1290,6 +1333,7 @@ static void test_show_commit(struct commit *commit, void *data)
 				     &commit->object.oid);
 	if (bitmap_pos < 0)
 		die("Object not in bitmap: %s\n", oid_to_hex(&commit->object.oid));
+	test_bitmap_type(tdata, &commit->object, bitmap_pos);
 
 	bitmap_set(tdata->base, bitmap_pos);
 	display_progress(tdata->prg, ++tdata->seen);
@@ -1337,6 +1381,10 @@ void test_bitmap_walk(struct rev_info *revs)
 
 	tdata.bitmap_git = bitmap_git;
 	tdata.base = bitmap_new();
+	tdata.commits = ewah_to_bitmap(bitmap_git->commits);
+	tdata.trees = ewah_to_bitmap(bitmap_git->trees);
+	tdata.blobs = ewah_to_bitmap(bitmap_git->blobs);
+	tdata.tags = ewah_to_bitmap(bitmap_git->tags);
 	tdata.prg = start_progress("Verifying bitmap entries", result_popcnt);
 	tdata.seen = 0;
 
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH 02/22] pack-bitmap-write.c: gracefully fail to write non-closed bitmaps
  2021-04-09 18:10 [PATCH 00/22] multi-pack reachability bitmaps Taylor Blau
  2021-04-09 18:10 ` [PATCH 01/22] pack-bitmap.c: harden 'test_bitmap_walk()' to check type bitmaps Taylor Blau
@ 2021-04-09 18:10 ` Taylor Blau
  2021-04-16  2:46   ` Jonathan Tan
  2021-04-09 18:10 ` [PATCH 03/22] pack-bitmap-write.c: free existing bitmaps Taylor Blau
                   ` (23 subsequent siblings)
  25 siblings, 1 reply; 273+ messages in thread
From: Taylor Blau @ 2021-04-09 18:10 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

The set of objects covered by a bitmap must be closed under
reachability, since it must be the case that there is a valid bit
position assigned for every possible reachable object (otherwise the
bitmaps would be incomplete).

Pack bitmaps are never written from 'git repack' unless repacking
all-into-one, and so we never write non-closed bitmaps.

But multi-pack bitmaps change this, since it isn't known whether the
set of objects in the MIDX is closed under reachability until walking
them. Plumb through a bit that is set when a reachable object isn't
found.

As soon as a reachable object isn't found in the set of objects to
include in the bitmap, bitmap_writer_build() knows that the set is not
closed, and so it now fails gracefully.

(The new conditional in builtin/pack-objects.c:bitmap_writer_build()
guards against other failure modes, but is never triggered here, because
of the all-into-one detail above. This return value will be important to
check from the multi-pack index caller.)

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c |  3 +-
 pack-bitmap-write.c    | 76 +++++++++++++++++++++++++++++-------------
 pack-bitmap.h          |  2 +-
 3 files changed, 56 insertions(+), 25 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index a1e33d7507..5205dde2e1 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1116,7 +1116,8 @@ static void write_pack_file(void)
 
 				bitmap_writer_show_progress(progress);
 				bitmap_writer_select_commits(indexed_commits, indexed_commits_nr, -1);
-				bitmap_writer_build(&to_pack);
+				if (bitmap_writer_build(&to_pack) < 0)
+					die(_("failed to write bitmap index"));
 				bitmap_writer_finish(written_list, nr_written,
 						     tmpname.buf, write_bitmap_options);
 				write_bitmap_index = 0;
diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
index 88d9e696a5..e829c46649 100644
--- a/pack-bitmap-write.c
+++ b/pack-bitmap-write.c
@@ -125,15 +125,20 @@ static inline void push_bitmapped_commit(struct commit *commit)
 	writer.selected_nr++;
 }
 
-static uint32_t find_object_pos(const struct object_id *oid)
+static uint32_t find_object_pos(const struct object_id *oid, int *found)
 {
 	struct object_entry *entry = packlist_find(writer.to_pack, oid);
 
 	if (!entry) {
-		die("Failed to write bitmap index. Packfile doesn't have full closure "
+		if (found)
+			*found = 0;
+		warning("Failed to write bitmap index. Packfile doesn't have full closure "
 			"(object %s is missing)", oid_to_hex(oid));
+		return 0;
 	}
 
+	if (found)
+		*found = 1;
 	return oe_in_pack_pos(writer.to_pack, entry);
 }
 
@@ -331,9 +336,10 @@ static void bitmap_builder_clear(struct bitmap_builder *bb)
 	bb->commits_nr = bb->commits_alloc = 0;
 }
 
-static void fill_bitmap_tree(struct bitmap *bitmap,
-			     struct tree *tree)
+static int fill_bitmap_tree(struct bitmap *bitmap,
+			    struct tree *tree)
 {
+	int found;
 	uint32_t pos;
 	struct tree_desc desc;
 	struct name_entry entry;
@@ -342,9 +348,11 @@ static void fill_bitmap_tree(struct bitmap *bitmap,
 	 * If our bit is already set, then there is nothing to do. Both this
 	 * tree and all of its children will be set.
 	 */
-	pos = find_object_pos(&tree->object.oid);
+	pos = find_object_pos(&tree->object.oid, &found);
+	if (!found)
+		return -1;
 	if (bitmap_get(bitmap, pos))
-		return;
+		return 0;
 	bitmap_set(bitmap, pos);
 
 	if (parse_tree(tree) < 0)
@@ -355,11 +363,15 @@ static void fill_bitmap_tree(struct bitmap *bitmap,
 	while (tree_entry(&desc, &entry)) {
 		switch (object_type(entry.mode)) {
 		case OBJ_TREE:
-			fill_bitmap_tree(bitmap,
-					 lookup_tree(the_repository, &entry.oid));
+			if (fill_bitmap_tree(bitmap,
+					     lookup_tree(the_repository, &entry.oid)) < 0)
+				return -1;
 			break;
 		case OBJ_BLOB:
-			bitmap_set(bitmap, find_object_pos(&entry.oid));
+			pos = find_object_pos(&entry.oid, &found);
+			if (!found)
+				return -1;
+			bitmap_set(bitmap, pos);
 			break;
 		default:
 			/* Gitlink, etc; not reachable */
@@ -368,15 +380,18 @@ static void fill_bitmap_tree(struct bitmap *bitmap,
 	}
 
 	free_tree_buffer(tree);
+	return 0;
 }
 
-static void fill_bitmap_commit(struct bb_commit *ent,
-			       struct commit *commit,
-			       struct prio_queue *queue,
-			       struct prio_queue *tree_queue,
-			       struct bitmap_index *old_bitmap,
-			       const uint32_t *mapping)
+static int fill_bitmap_commit(struct bb_commit *ent,
+			      struct commit *commit,
+			      struct prio_queue *queue,
+			      struct prio_queue *tree_queue,
+			      struct bitmap_index *old_bitmap,
+			      const uint32_t *mapping)
 {
+	int found;
+	uint32_t pos;
 	if (!ent->bitmap)
 		ent->bitmap = bitmap_new();
 
@@ -401,11 +416,16 @@ static void fill_bitmap_commit(struct bb_commit *ent,
 		 * Mark ourselves and queue our tree. The commit
 		 * walk ensures we cover all parents.
 		 */
-		bitmap_set(ent->bitmap, find_object_pos(&c->object.oid));
+		pos = find_object_pos(&c->object.oid, &found);
+		if (!found)
+			return -1;
+		bitmap_set(ent->bitmap, pos);
 		prio_queue_put(tree_queue, get_commit_tree(c));
 
 		for (p = c->parents; p; p = p->next) {
-			int pos = find_object_pos(&p->item->object.oid);
+			pos = find_object_pos(&p->item->object.oid, &found);
+			if (!found)
+				return -1;
 			if (!bitmap_get(ent->bitmap, pos)) {
 				bitmap_set(ent->bitmap, pos);
 				prio_queue_put(queue, p->item);
@@ -413,8 +433,12 @@ static void fill_bitmap_commit(struct bb_commit *ent,
 		}
 	}
 
-	while (tree_queue->nr)
-		fill_bitmap_tree(ent->bitmap, prio_queue_get(tree_queue));
+	while (tree_queue->nr) {
+		if (fill_bitmap_tree(ent->bitmap,
+				     prio_queue_get(tree_queue)) < 0)
+			return -1;
+	}
+	return 0;
 }
 
 static void store_selected(struct bb_commit *ent, struct commit *commit)
@@ -432,7 +456,7 @@ static void store_selected(struct bb_commit *ent, struct commit *commit)
 	kh_value(writer.bitmaps, hash_pos) = stored;
 }
 
-void bitmap_writer_build(struct packing_data *to_pack)
+int bitmap_writer_build(struct packing_data *to_pack)
 {
 	struct bitmap_builder bb;
 	size_t i;
@@ -441,6 +465,7 @@ void bitmap_writer_build(struct packing_data *to_pack)
 	struct prio_queue tree_queue = { NULL };
 	struct bitmap_index *old_bitmap;
 	uint32_t *mapping;
+	int closed = 1; /* until proven otherwise */
 
 	writer.bitmaps = kh_init_oid_map();
 	writer.to_pack = to_pack;
@@ -463,8 +488,11 @@ void bitmap_writer_build(struct packing_data *to_pack)
 		struct commit *child;
 		int reused = 0;
 
-		fill_bitmap_commit(ent, commit, &queue, &tree_queue,
-				   old_bitmap, mapping);
+		if (fill_bitmap_commit(ent, commit, &queue, &tree_queue,
+				       old_bitmap, mapping) < 0) {
+			closed = 0;
+			break;
+		}
 
 		if (ent->selected) {
 			store_selected(ent, commit);
@@ -499,7 +527,9 @@ void bitmap_writer_build(struct packing_data *to_pack)
 
 	stop_progress(&writer.progress);
 
-	compute_xor_offsets();
+	if (closed)
+		compute_xor_offsets();
+	return closed;
 }
 
 /**
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 78f2b3ff79..988ed3a30d 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -86,7 +86,7 @@ struct ewah_bitmap *bitmap_for_commit(struct bitmap_index *bitmap_git,
 				      struct commit *commit);
 void bitmap_writer_select_commits(struct commit **indexed_commits,
 		unsigned int indexed_commits_nr, int max_bitmaps);
-void bitmap_writer_build(struct packing_data *to_pack);
+int bitmap_writer_build(struct packing_data *to_pack);
 void bitmap_writer_finish(struct pack_idx_entry **index,
 			  uint32_t index_nr,
 			  const char *filename,
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH 03/22] pack-bitmap-write.c: free existing bitmaps
  2021-04-09 18:10 [PATCH 00/22] multi-pack reachability bitmaps Taylor Blau
  2021-04-09 18:10 ` [PATCH 01/22] pack-bitmap.c: harden 'test_bitmap_walk()' to check type bitmaps Taylor Blau
  2021-04-09 18:10 ` [PATCH 02/22] pack-bitmap-write.c: gracefully fail to write non-closed bitmaps Taylor Blau
@ 2021-04-09 18:10 ` Taylor Blau
  2021-04-09 18:10 ` [PATCH 04/22] Documentation: build 'technical/bitmap-format' by default Taylor Blau
                   ` (22 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-04-09 18:10 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

When writing a new bitmap, the bitmap writer code attempts to read the
existing bitmap (if one is present). This is done in order to quickly
permute the bits of any bitmaps for commits which appear in the existing
bitmap, and were also selected for the new bitmap.

But since this code was added in 341fa34887 (pack-bitmap-write: use
existing bitmaps, 2020-12-08), the resources associated with opening an
existing bitmap were never released.

It's fine to ignore this, but it's bad hygiene. It will also cause a
problem for the multi-pack-index builtin, which will be responsible not
only for writing bitmaps, but also for expiring any old multi-pack
bitmaps.

If an existing bitmap was reused here, it will also be expired. That
will cause a problem on platforms which require file resources to be
closed before unlinking them, like Windows. Avoid this by ensuring we
close reused bitmaps with free_bitmap_index() before removing them.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap-write.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
index e829c46649..f90e100e3e 100644
--- a/pack-bitmap-write.c
+++ b/pack-bitmap-write.c
@@ -520,6 +520,7 @@ int bitmap_writer_build(struct packing_data *to_pack)
 	clear_prio_queue(&queue);
 	clear_prio_queue(&tree_queue);
 	bitmap_builder_clear(&bb);
+	free_bitmap_index(old_bitmap);
 	free(mapping);
 
 	trace2_region_leave("pack-bitmap-write", "building_bitmaps_total",
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH 04/22] Documentation: build 'technical/bitmap-format' by default
  2021-04-09 18:10 [PATCH 00/22] multi-pack reachability bitmaps Taylor Blau
                   ` (2 preceding siblings ...)
  2021-04-09 18:10 ` [PATCH 03/22] pack-bitmap-write.c: free existing bitmaps Taylor Blau
@ 2021-04-09 18:10 ` Taylor Blau
  2021-04-09 18:11 ` [PATCH 05/22] Documentation: describe MIDX-based bitmaps Taylor Blau
                   ` (21 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-04-09 18:10 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

Even though the 'TECH_DOCS' variable was introduced all the way back in
5e00439f0a (Documentation: build html for all files in technical and
howto, 2012-10-23), the 'bitmap-format' document was never added to that
list when it was created.

Prepare for changes to this file by including it in the list of
technical documentation that 'make doc' will build by default.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/Makefile b/Documentation/Makefile
index 874a01d7a8..6d60c8c165 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -83,6 +83,7 @@ SP_ARTICLES += $(API_DOCS)
 TECH_DOCS += MyFirstContribution
 TECH_DOCS += MyFirstObjectWalk
 TECH_DOCS += SubmittingPatches
+TECH_DOCS += technical/bitmap-format
 TECH_DOCS += technical/hash-function-transition
 TECH_DOCS += technical/http-protocol
 TECH_DOCS += technical/index-format
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH 05/22] Documentation: describe MIDX-based bitmaps
  2021-04-09 18:10 [PATCH 00/22] multi-pack reachability bitmaps Taylor Blau
                   ` (3 preceding siblings ...)
  2021-04-09 18:10 ` [PATCH 04/22] Documentation: build 'technical/bitmap-format' by default Taylor Blau
@ 2021-04-09 18:11 ` Taylor Blau
  2021-04-09 18:11 ` [PATCH 06/22] midx: make a number of functions non-static Taylor Blau
                   ` (20 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-04-09 18:11 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

Update the technical documentation to describe the multi-pack bitmap
format. This patch merely introduces the new format, and describes its
high-level ideas. Git does not yet know how to read nor write these
multi-pack variants, and so the subsequent patches will:

  - Introduce code to interpret multi-pack bitmaps, according to this
    document.

  - Then, introduce code to write multi-pack bitmaps from the 'git
    multi-pack-index write' sub-command.

Finally, the implementation will gain tests in subsequent patches (as
opposed to inline with the patch teaching Git how to write multi-pack
bitmaps) to avoid a cyclic dependency.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/technical/bitmap-format.txt    | 72 ++++++++++++++++----
 Documentation/technical/multi-pack-index.txt | 10 +--
 2 files changed, 61 insertions(+), 21 deletions(-)

diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt
index f8c18a0f7a..25221c7ec8 100644
--- a/Documentation/technical/bitmap-format.txt
+++ b/Documentation/technical/bitmap-format.txt
@@ -1,6 +1,45 @@
 GIT bitmap v1 format
 ====================
 
+== Pack and multi-pack bitmaps
+
+Bitmaps store reachability information about the set of objects in a packfile,
+or a multi-pack index (MIDX). The former is defined obviously, and the latter is
+defined as the union of objects in packs contained in the MIDX.
+
+A bitmap may belong to either one pack, or the repository's multi-pack index (if
+it exists). A repository may have at most one bitmap.
+
+An object is uniquely described by its bit position within a bitmap:
+
+	- If the bitmap belongs to a packfile, the __n__th bit corresponds to
+	the __n__th object in pack order. For a function `offset` which maps
+	objects to their byte offset within a pack, pack order is defined as
+	follows:
+
+		o1 <= o2 <==> offset(o1) <= offset(o2)
+
+	- If the bitmap belongs to a MIDX, the __n__th bit corresponds to the
+	__n__th object in MIDX order. With an additional function `pack` which
+	maps objects to the pack they were selected from by the MIDX, MIDX order
+	is defined as follows:
+
+		o1 <= o2 <==> pack(o1) <= pack(o2) /\ offset(o1) <= offset(o2)
+
+	The ordering between packs is done lexicographically by the pack name,
+	with the exception of the preferred pack, which sorts ahead of all other
+	packs.
+
+The on-disk representation (described below) of a bitmap is the same regardless
+of whether or not that bitmap belongs to a packfile or a MIDX. The only
+difference is the interpretation of the bits, which is described above.
+
+Certain bitmap extensions are supported (see: Appendix B). No extensions are
+required for bitmaps corresponding to packfiles. For bitmaps that correspond to
+MIDXs, both the bit-cache and rev-cache extensions are required.
+
+== On-disk format
+
 	- A header appears at the beginning:
 
 		4-byte signature: {'B', 'I', 'T', 'M'}
@@ -14,17 +53,19 @@ GIT bitmap v1 format
 			The following flags are supported:
 
 			- BITMAP_OPT_FULL_DAG (0x1) REQUIRED
-			This flag must always be present. It implies that the bitmap
-			index has been generated for a packfile with full closure
-			(i.e. where every single object in the packfile can find
-			 its parent links inside the same packfile). This is a
-			requirement for the bitmap index format, also present in JGit,
-			that greatly reduces the complexity of the implementation.
+			This flag must always be present. It implies that the
+			bitmap index has been generated for a packfile or
+			multi-pack index (MIDX) with full closure (i.e. where
+			every single object in the packfile/MIDX can find its
+			parent links inside the same packfile/MIDX). This is a
+			requirement for the bitmap index format, also present in
+			JGit, that greatly reduces the complexity of the
+			implementation.
 
 			- BITMAP_OPT_HASH_CACHE (0x4)
 			If present, the end of the bitmap file contains
 			`N` 32-bit name-hash values, one per object in the
-			pack. The format and meaning of the name-hash is
+			pack/MIDX. The format and meaning of the name-hash is
 			described below.
 
 		4-byte entry count (network byte order)
@@ -33,7 +74,8 @@ GIT bitmap v1 format
 
 		20-byte checksum
 
-			The SHA1 checksum of the pack this bitmap index belongs to.
+			The SHA1 checksum of the pack/MIDX this bitmap index
+			belongs to.
 
 	- 4 EWAH bitmaps that act as type indexes
 
@@ -50,7 +92,7 @@ GIT bitmap v1 format
 			- Tags
 
 		In each bitmap, the `n`th bit is set to true if the `n`th object
-		in the packfile is of that type.
+		in the packfile or multi-pack index is of that type.
 
 		The obvious consequence is that the OR of all 4 bitmaps will result
 		in a full set (all bits set), and the AND of all 4 bitmaps will
@@ -62,8 +104,9 @@ GIT bitmap v1 format
 		Each entry contains the following:
 
 		- 4-byte object position (network byte order)
-			The position **in the index for the packfile** where the
-			bitmap for this commit is found.
+			The position **in the index for the packfile or
+			multi-pack index** where the bitmap for this commit is
+			found.
 
 		- 1-byte XOR-offset
 			The xor offset used to compress this bitmap. For an entry
@@ -146,10 +189,11 @@ Name-hash cache
 ---------------
 
 If the BITMAP_OPT_HASH_CACHE flag is set, the end of the bitmap contains
-a cache of 32-bit values, one per object in the pack. The value at
+a cache of 32-bit values, one per object in the pack/MIDX. The value at
 position `i` is the hash of the pathname at which the `i`th object
-(counting in index order) in the pack can be found.  This can be fed
-into the delta heuristics to compare objects with similar pathnames.
+(counting in index or multi-pack index order) in the pack/MIDX can be found.
+This can be fed into the delta heuristics to compare objects with similar
+pathnames.
 
 The hash algorithm used is:
 
diff --git a/Documentation/technical/multi-pack-index.txt b/Documentation/technical/multi-pack-index.txt
index fb688976c4..1a73c3ee20 100644
--- a/Documentation/technical/multi-pack-index.txt
+++ b/Documentation/technical/multi-pack-index.txt
@@ -71,14 +71,10 @@ Future Work
   still reducing the number of binary searches required for object
   lookups.
 
-- The reachability bitmap is currently paired directly with a single
-  packfile, using the pack-order as the object order to hopefully
-  compress the bitmaps well using run-length encoding. This could be
-  extended to pair a reachability bitmap with a multi-pack-index. If
-  the multi-pack-index is extended to store a "stable object order"
+- If the multi-pack-index is extended to store a "stable object order"
   (a function Order(hash) = integer that is constant for a given hash,
-  even as the multi-pack-index is updated) then a reachability bitmap
-  could point to a multi-pack-index and be updated independently.
+  even as the multi-pack-index is updated) then MIDX bitmaps could be
+  updated independently of the MIDX.
 
 - Packfiles can be marked as "special" using empty files that share
   the initial name but replace ".pack" with ".keep" or ".promisor".
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH 06/22] midx: make a number of functions non-static
  2021-04-09 18:10 [PATCH 00/22] multi-pack reachability bitmaps Taylor Blau
                   ` (4 preceding siblings ...)
  2021-04-09 18:11 ` [PATCH 05/22] Documentation: describe MIDX-based bitmaps Taylor Blau
@ 2021-04-09 18:11 ` Taylor Blau
  2021-04-09 18:11 ` [PATCH 07/22] midx: clear auxiliary .rev after replacing the MIDX Taylor Blau
                   ` (19 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-04-09 18:11 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

These functions will be called from outside of midx.c in a subsequent
patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c | 4 ++--
 midx.h | 2 ++
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/midx.c b/midx.c
index 9e86583172..5249802326 100644
--- a/midx.c
+++ b/midx.c
@@ -48,12 +48,12 @@ static uint8_t oid_version(void)
 	}
 }
 
-static const unsigned char *get_midx_checksum(struct multi_pack_index *m)
+const unsigned char *get_midx_checksum(struct multi_pack_index *m)
 {
 	return m->data + m->data_len - the_hash_algo->rawsz;
 }
 
-static char *get_midx_filename(const char *object_dir)
+char *get_midx_filename(const char *object_dir)
 {
 	return xstrfmt("%s/pack/multi-pack-index", object_dir);
 }
diff --git a/midx.h b/midx.h
index 8684cf0fef..1172df1a71 100644
--- a/midx.h
+++ b/midx.h
@@ -42,6 +42,8 @@ struct multi_pack_index {
 #define MIDX_PROGRESS     (1 << 0)
 #define MIDX_WRITE_REV_INDEX (1 << 1)
 
+const unsigned char *get_midx_checksum(struct multi_pack_index *m);
+char *get_midx_filename(const char *object_dir);
 char *get_midx_rev_filename(struct multi_pack_index *m);
 
 struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local);
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH 07/22] midx: clear auxiliary .rev after replacing the MIDX
  2021-04-09 18:10 [PATCH 00/22] multi-pack reachability bitmaps Taylor Blau
                   ` (5 preceding siblings ...)
  2021-04-09 18:11 ` [PATCH 06/22] midx: make a number of functions non-static Taylor Blau
@ 2021-04-09 18:11 ` Taylor Blau
  2021-04-09 18:11 ` [PATCH 08/22] midx: respect 'core.multiPackIndex' when writing Taylor Blau
                   ` (18 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-04-09 18:11 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

When writing a new multi-pack index, write_midx_internal() attempts to
clean up any auxiliary files (currently just the MIDX's `.rev` file, but
soon to include a `.bitmap`, too) corresponding to the MIDX it's
replacing.

This step should happen after the new MIDX is written into place, since
doing so beforehand means that the old MIDX could be read without its
corresponding .rev file.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 5249802326..a24c36968d 100644
--- a/midx.c
+++ b/midx.c
@@ -1076,10 +1076,11 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 	if (flags & MIDX_WRITE_REV_INDEX)
 		write_midx_reverse_index(midx_name, midx_hash, &ctx);
-	clear_midx_files_ext(the_repository, ".rev", midx_hash);
 
 	commit_lock_file(&lk);
 
+	clear_midx_files_ext(the_repository, ".rev", midx_hash);
+
 cleanup:
 	for (i = 0; i < ctx.nr; i++) {
 		if (ctx.info[i].p) {
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH 08/22] midx: respect 'core.multiPackIndex' when writing
  2021-04-09 18:10 [PATCH 00/22] multi-pack reachability bitmaps Taylor Blau
                   ` (6 preceding siblings ...)
  2021-04-09 18:11 ` [PATCH 07/22] midx: clear auxiliary .rev after replacing the MIDX Taylor Blau
@ 2021-04-09 18:11 ` Taylor Blau
  2021-04-09 18:11 ` [PATCH 09/22] pack-bitmap.c: introduce 'bitmap_num_objects()' Taylor Blau
                   ` (17 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-04-09 18:11 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

When writing a new multi-pack index, write_midx_internal() attempts to
load any existing one to fill in some pieces of information. But it uses
load_multi_pack_index(), which ignores the configuration
"core.multiPackIndex", which indicates whether or not Git is allowed to
read an existing multi-pack-index.

Replace this with a routine that does respect that setting, to avoid
reading multi-pack-index files when told not to.

This avoids a problem that would arise in subsequent patches due to the
combination of 'git repack' reopening the object store in-process and
the multi-pack index code not checking whether a pack already exists in
the object store when calling add_pack_to_midx().

This would ultimately lead to a cycle being created along the
'packed_git' struct's '->next' pointer. That is obviously bad, but it
has hard-to-debug downstream effects like saying a bitmap can't be
loaded for a pack because one already exists (for the same pack).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/midx.c b/midx.c
index a24c36968d..567cdf0fcf 100644
--- a/midx.c
+++ b/midx.c
@@ -908,8 +908,18 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 	if (m)
 		ctx.m = m;
-	else
-		ctx.m = load_multi_pack_index(object_dir, 1);
+	else {
+		struct multi_pack_index *cur;
+
+		prepare_multi_pack_index_one(the_repository, object_dir, 1);
+
+		ctx.m = NULL;
+		for (cur = the_repository->objects->multi_pack_index; cur;
+		     cur = cur->next) {
+			if (!strcmp(object_dir, cur->object_dir))
+				ctx.m = cur;
+		}
+	}
 
 	ctx.nr = 0;
 	ctx.alloc = ctx.m ? ctx.m->num_packs : 16;
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH 09/22] pack-bitmap.c: introduce 'bitmap_num_objects()'
  2021-04-09 18:10 [PATCH 00/22] multi-pack reachability bitmaps Taylor Blau
                   ` (7 preceding siblings ...)
  2021-04-09 18:11 ` [PATCH 08/22] midx: respect 'core.multiPackIndex' when writing Taylor Blau
@ 2021-04-09 18:11 ` Taylor Blau
  2021-04-09 18:11 ` [PATCH 10/22] pack-bitmap.c: introduce 'nth_bitmap_object_oid()' Taylor Blau
                   ` (16 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-04-09 18:11 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

A subsequent patch to support reading MIDX bitmaps will be less noisy
after extracting a generic function to return how many objects are
contained in a bitmap.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 37 +++++++++++++++++++++----------------
 1 file changed, 21 insertions(+), 16 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index d45e91db1e..a6c616aa3e 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -136,6 +136,11 @@ static struct ewah_bitmap *read_bitmap_1(struct bitmap_index *index)
 	return b;
 }
 
+static uint32_t bitmap_num_objects(struct bitmap_index *index)
+{
+	return index->pack->num_objects;
+}
+
 static int load_bitmap_header(struct bitmap_index *index)
 {
 	struct bitmap_disk_header *header = (void *)index->map;
@@ -154,7 +159,7 @@ static int load_bitmap_header(struct bitmap_index *index)
 	/* Parse known bitmap format options */
 	{
 		uint32_t flags = ntohs(header->options);
-		size_t cache_size = st_mult(index->pack->num_objects, sizeof(uint32_t));
+		size_t cache_size = st_mult(bitmap_num_objects(index), sizeof(uint32_t));
 		unsigned char *index_end = index->map + index->map_size - the_hash_algo->rawsz;
 
 		if ((flags & BITMAP_OPT_FULL_DAG) == 0)
@@ -399,7 +404,7 @@ static inline int bitmap_position_extended(struct bitmap_index *bitmap_git,
 
 	if (pos < kh_end(positions)) {
 		int bitmap_pos = kh_value(positions, pos);
-		return bitmap_pos + bitmap_git->pack->num_objects;
+		return bitmap_pos + bitmap_num_objects(bitmap_git);
 	}
 
 	return -1;
@@ -451,7 +456,7 @@ static int ext_index_add_object(struct bitmap_index *bitmap_git,
 		bitmap_pos = kh_value(eindex->positions, hash_pos);
 	}
 
-	return bitmap_pos + bitmap_git->pack->num_objects;
+	return bitmap_pos + bitmap_num_objects(bitmap_git);
 }
 
 struct bitmap_show_data {
@@ -647,7 +652,7 @@ static void show_extended_objects(struct bitmap_index *bitmap_git,
 	for (i = 0; i < eindex->count; ++i) {
 		struct object *obj;
 
-		if (!bitmap_get(objects, bitmap_git->pack->num_objects + i))
+		if (!bitmap_get(objects, bitmap_num_objects(bitmap_git) + i))
 			continue;
 
 		obj = eindex->objects[i];
@@ -808,7 +813,7 @@ static void filter_bitmap_exclude_type(struct bitmap_index *bitmap_git,
 	 * individually.
 	 */
 	for (i = 0; i < eindex->count; i++) {
-		uint32_t pos = i + bitmap_git->pack->num_objects;
+		uint32_t pos = i + bitmap_num_objects(bitmap_git);
 		if (eindex->objects[i]->type == type &&
 		    bitmap_get(to_filter, pos) &&
 		    !bitmap_get(tips, pos))
@@ -835,7 +840,7 @@ static unsigned long get_size_by_pos(struct bitmap_index *bitmap_git,
 
 	oi.sizep = &size;
 
-	if (pos < pack->num_objects) {
+	if (pos < bitmap_num_objects(bitmap_git)) {
 		off_t ofs = pack_pos_to_offset(pack, pos);
 		if (packed_object_info(the_repository, pack, ofs, &oi) < 0) {
 			struct object_id oid;
@@ -845,7 +850,7 @@ static unsigned long get_size_by_pos(struct bitmap_index *bitmap_git,
 		}
 	} else {
 		struct eindex *eindex = &bitmap_git->ext_index;
-		struct object *obj = eindex->objects[pos - pack->num_objects];
+		struct object *obj = eindex->objects[pos - bitmap_num_objects(bitmap_git)];
 		if (oid_object_info_extended(the_repository, &obj->oid, &oi, 0) < 0)
 			die(_("unable to get size of %s"), oid_to_hex(&obj->oid));
 	}
@@ -887,7 +892,7 @@ static void filter_bitmap_blob_limit(struct bitmap_index *bitmap_git,
 	}
 
 	for (i = 0; i < eindex->count; i++) {
-		uint32_t pos = i + bitmap_git->pack->num_objects;
+		uint32_t pos = i + bitmap_num_objects(bitmap_git);
 		if (eindex->objects[i]->type == OBJ_BLOB &&
 		    bitmap_get(to_filter, pos) &&
 		    !bitmap_get(tips, pos) &&
@@ -1075,8 +1080,8 @@ static void try_partial_reuse(struct bitmap_index *bitmap_git,
 	enum object_type type;
 	unsigned long size;
 
-	if (pos >= bitmap_git->pack->num_objects)
-		return; /* not actually in the pack */
+	if (pos >= bitmap_num_objects(bitmap_git))
+		return; /* not actually in the pack or MIDX */
 
 	offset = header = pack_pos_to_offset(bitmap_git->pack, pos);
 	type = unpack_object_header(bitmap_git->pack, w_curs, &offset, &size);
@@ -1142,6 +1147,7 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 	struct pack_window *w_curs = NULL;
 	size_t i = 0;
 	uint32_t offset;
+	uint32_t objects_nr = bitmap_num_objects(bitmap_git);
 
 	assert(result);
 
@@ -1149,8 +1155,8 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 		i++;
 
 	/* Don't mark objects not in the packfile */
-	if (i > bitmap_git->pack->num_objects / BITS_IN_EWORD)
-		i = bitmap_git->pack->num_objects / BITS_IN_EWORD;
+	if (i > objects_nr / BITS_IN_EWORD)
+		i = objects_nr / BITS_IN_EWORD;
 
 	reuse = bitmap_word_alloc(i);
 	memset(reuse->words, 0xFF, i * sizeof(eword_t));
@@ -1234,7 +1240,7 @@ static uint32_t count_object_type(struct bitmap_index *bitmap_git,
 
 	for (i = 0; i < eindex->count; ++i) {
 		if (eindex->objects[i]->type == type &&
-			bitmap_get(objects, bitmap_git->pack->num_objects + i))
+			bitmap_get(objects, bitmap_num_objects(bitmap_git) + i))
 			count++;
 	}
 
@@ -1455,7 +1461,7 @@ uint32_t *create_bitmap_mapping(struct bitmap_index *bitmap_git,
 	uint32_t i, num_objects;
 	uint32_t *reposition;
 
-	num_objects = bitmap_git->pack->num_objects;
+	num_objects = bitmap_num_objects(bitmap_git);
 	CALLOC_ARRAY(reposition, num_objects);
 
 	for (i = 0; i < num_objects; ++i) {
@@ -1538,7 +1544,6 @@ static off_t get_disk_usage_for_type(struct bitmap_index *bitmap_git,
 static off_t get_disk_usage_for_extended(struct bitmap_index *bitmap_git)
 {
 	struct bitmap *result = bitmap_git->result;
-	struct packed_git *pack = bitmap_git->pack;
 	struct eindex *eindex = &bitmap_git->ext_index;
 	off_t total = 0;
 	struct object_info oi = OBJECT_INFO_INIT;
@@ -1550,7 +1555,7 @@ static off_t get_disk_usage_for_extended(struct bitmap_index *bitmap_git)
 	for (i = 0; i < eindex->count; i++) {
 		struct object *obj = eindex->objects[i];
 
-		if (!bitmap_get(result, pack->num_objects + i))
+		if (!bitmap_get(result, bitmap_num_objects(bitmap_git) + i))
 			continue;
 
 		if (oid_object_info_extended(the_repository, &obj->oid, &oi, 0) < 0)
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH 10/22] pack-bitmap.c: introduce 'nth_bitmap_object_oid()'
  2021-04-09 18:10 [PATCH 00/22] multi-pack reachability bitmaps Taylor Blau
                   ` (8 preceding siblings ...)
  2021-04-09 18:11 ` [PATCH 09/22] pack-bitmap.c: introduce 'bitmap_num_objects()' Taylor Blau
@ 2021-04-09 18:11 ` Taylor Blau
  2021-04-09 18:11 ` [PATCH 11/22] pack-bitmap.c: introduce 'bitmap_is_preferred_refname()' Taylor Blau
                   ` (15 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-04-09 18:11 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

A subsequent patch to support reading MIDX bitmaps will be less noisy
after extracting a generic function to fetch the nth OID contained in
the bitmap.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index a6c616aa3e..97ee2d331d 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -223,6 +223,13 @@ static inline uint8_t read_u8(const unsigned char *buffer, size_t *pos)
 
 #define MAX_XOR_OFFSET 160
 
+static void nth_bitmap_object_oid(struct bitmap_index *index,
+				  struct object_id *oid,
+				  uint32_t n)
+{
+	nth_packed_object_id(oid, index->pack, n);
+}
+
 static int load_bitmap_entries_v1(struct bitmap_index *index)
 {
 	uint32_t i;
@@ -242,9 +249,7 @@ static int load_bitmap_entries_v1(struct bitmap_index *index)
 		xor_offset = read_u8(index->map, &index->map_pos);
 		flags = read_u8(index->map, &index->map_pos);
 
-		if (nth_packed_object_id(&oid, index->pack, commit_idx_pos) < 0)
-			return error("corrupt ewah bitmap: commit index %u out of range",
-				     (unsigned)commit_idx_pos);
+		nth_bitmap_object_oid(index, &oid, commit_idx_pos);
 
 		bitmap = read_bitmap_1(index);
 		if (!bitmap)
@@ -844,8 +849,8 @@ static unsigned long get_size_by_pos(struct bitmap_index *bitmap_git,
 		off_t ofs = pack_pos_to_offset(pack, pos);
 		if (packed_object_info(the_repository, pack, ofs, &oi) < 0) {
 			struct object_id oid;
-			nth_packed_object_id(&oid, pack,
-					     pack_pos_to_index(pack, pos));
+			nth_bitmap_object_oid(bitmap_git, &oid,
+					      pack_pos_to_index(pack, pos));
 			die(_("unable to get size of %s"), oid_to_hex(&oid));
 		}
 	} else {
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH 11/22] pack-bitmap.c: introduce 'bitmap_is_preferred_refname()'
  2021-04-09 18:10 [PATCH 00/22] multi-pack reachability bitmaps Taylor Blau
                   ` (9 preceding siblings ...)
  2021-04-09 18:11 ` [PATCH 10/22] pack-bitmap.c: introduce 'nth_bitmap_object_oid()' Taylor Blau
@ 2021-04-09 18:11 ` Taylor Blau
  2021-04-09 18:11 ` [PATCH 12/22] pack-bitmap: read multi-pack bitmaps Taylor Blau
                   ` (14 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-04-09 18:11 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

In a recent commit, pack-objects learned support for the
'pack.preferBitmapTips' configuration. This patch prepares the
multi-pack bitmap code to respect this configuration, too.

Since the multi-pack bitmap code already does a traversal of all
references (in order to discover the set of reachable commits in the
multi-pack index), it is more efficient to check whether or not each
reference is a suffix of any value of 'pack.preferBitmapTips' rather
than do an additional traversal.

Implement a function 'bitmap_is_preferred_refname()' which does just
that. The caller will be added in a subsequent patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 16 ++++++++++++++++
 pack-bitmap.h |  1 +
 2 files changed, 17 insertions(+)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 97ee2d331d..be52570b0f 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1594,3 +1594,19 @@ const struct string_list *bitmap_preferred_tips(struct repository *r)
 {
 	return repo_config_get_value_multi(r, "pack.preferbitmaptips");
 }
+
+int bitmap_is_preferred_refname(struct repository *r, const char *refname)
+{
+	const struct string_list *preferred_tips = bitmap_preferred_tips(r);
+	struct string_list_item *item;
+
+	if (!preferred_tips)
+		return 0;
+
+	for_each_string_list_item(item, preferred_tips) {
+		if (starts_with(refname, item->string))
+			return 1;
+	}
+
+	return 0;
+}
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 988ed3a30d..0bf75ff2a7 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -93,5 +93,6 @@ void bitmap_writer_finish(struct pack_idx_entry **index,
 			  uint16_t options);
 
 const struct string_list *bitmap_preferred_tips(struct repository *r);
+int bitmap_is_preferred_refname(struct repository *r, const char *refname);
 
 #endif
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH 12/22] pack-bitmap: read multi-pack bitmaps
  2021-04-09 18:10 [PATCH 00/22] multi-pack reachability bitmaps Taylor Blau
                   ` (10 preceding siblings ...)
  2021-04-09 18:11 ` [PATCH 11/22] pack-bitmap.c: introduce 'bitmap_is_preferred_refname()' Taylor Blau
@ 2021-04-09 18:11 ` Taylor Blau
  2021-04-16  2:39   ` Jonathan Tan
  2021-04-09 18:11 ` [PATCH 13/22] pack-bitmap: write " Taylor Blau
                   ` (13 subsequent siblings)
  25 siblings, 1 reply; 273+ messages in thread
From: Taylor Blau @ 2021-04-09 18:11 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

This prepares the code in pack-bitmap to interpret the new multi-pack
bitmaps described in Documentation/technical/bitmap-format.txt, which
mostly involves converting bit positions to accommodate looking them up
in a MIDX.

Note that there are currently no writers who write multi-pack bitmaps,
and that this will be implemented in the subsequent commit.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c |  12 +-
 pack-bitmap-write.c    |   2 +-
 pack-bitmap.c          | 349 +++++++++++++++++++++++++++++++++++++----
 pack-bitmap.h          |   5 +
 packfile.c             |   2 +-
 5 files changed, 338 insertions(+), 32 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5205dde2e1..a4e4e4ebcc 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -984,7 +984,17 @@ static void write_reused_pack(struct hashfile *f)
 				break;
 
 			offset += ewah_bit_ctz64(word >> offset);
-			write_reused_pack_one(pos + offset, f, &w_curs);
+			if (bitmap_is_midx(bitmap_git)) {
+				off_t pack_offs = bitmap_pack_offset(bitmap_git,
+								     pos + offset);
+				uint32_t pos;
+
+				if (offset_to_pack_pos(reuse_packfile, pack_offs, &pos) < 0)
+					die(_("write_reused_pack: could not locate %"PRIdMAX),
+					    (intmax_t)pack_offs);
+				write_reused_pack_one(pos, f, &w_curs);
+			} else
+				write_reused_pack_one(pos + offset, f, &w_curs);
 			display_progress(progress_state, ++written);
 		}
 	}
diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
index f90e100e3e..020c1774c8 100644
--- a/pack-bitmap-write.c
+++ b/pack-bitmap-write.c
@@ -48,7 +48,7 @@ void bitmap_writer_show_progress(int show)
 }
 
 /**
- * Build the initial type index for the packfile
+ * Build the initial type index for the packfile or multi-pack-index
  */
 void bitmap_writer_build_type_index(struct packing_data *to_pack,
 				    struct pack_idx_entry **index,
diff --git a/pack-bitmap.c b/pack-bitmap.c
index be52570b0f..e41fce9675 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -13,6 +13,7 @@
 #include "repository.h"
 #include "object-store.h"
 #include "list-objects-filter-options.h"
+#include "midx.h"
 #include "config.h"
 
 /*
@@ -35,8 +36,15 @@ struct stored_bitmap {
  * the active bitmap index is the largest one.
  */
 struct bitmap_index {
-	/* Packfile to which this bitmap index belongs to */
+	/*
+	 * The pack or multi-pack index (MIDX) that this bitmap index belongs
+	 * to.
+	 *
+	 * Exactly one of these must be non-NULL; this specifies the object
+	 * order used to interpret this bitmap.
+	 */
 	struct packed_git *pack;
+	struct multi_pack_index *midx;
 
 	/*
 	 * Mark the first `reuse_objects` in the packfile as reused:
@@ -71,6 +79,8 @@ struct bitmap_index {
 	/* If not NULL, this is a name-hash cache pointing into map. */
 	uint32_t *hashes;
 
+	const unsigned char *checksum;
+
 	/*
 	 * Extended index.
 	 *
@@ -138,6 +148,8 @@ static struct ewah_bitmap *read_bitmap_1(struct bitmap_index *index)
 
 static uint32_t bitmap_num_objects(struct bitmap_index *index)
 {
+	if (index->midx)
+		return index->midx->num_objects;
 	return index->pack->num_objects;
 }
 
@@ -175,6 +187,7 @@ static int load_bitmap_header(struct bitmap_index *index)
 	}
 
 	index->entry_count = ntohl(header->entry_count);
+	index->checksum = header->checksum;
 	index->map_pos += header_size;
 	return 0;
 }
@@ -227,7 +240,10 @@ static void nth_bitmap_object_oid(struct bitmap_index *index,
 				  struct object_id *oid,
 				  uint32_t n)
 {
-	nth_packed_object_id(oid, index->pack, n);
+	if (index->midx)
+		nth_midxed_object_oid(oid, index->midx, n);
+	else
+		nth_packed_object_id(oid, index->pack, n);
 }
 
 static int load_bitmap_entries_v1(struct bitmap_index *index)
@@ -272,7 +288,14 @@ static int load_bitmap_entries_v1(struct bitmap_index *index)
 	return 0;
 }
 
-static char *pack_bitmap_filename(struct packed_git *p)
+char *midx_bitmap_filename(struct multi_pack_index *midx)
+{
+	return xstrfmt("%s-%s.bitmap",
+		       get_midx_filename(midx->object_dir),
+		       hash_to_hex(get_midx_checksum(midx)));
+}
+
+char *pack_bitmap_filename(struct packed_git *p)
 {
 	size_t len;
 
@@ -281,6 +304,54 @@ static char *pack_bitmap_filename(struct packed_git *p)
 	return xstrfmt("%.*s.bitmap", (int)len, p->pack_name);
 }
 
+static int open_midx_bitmap_1(struct bitmap_index *bitmap_git,
+			      struct multi_pack_index *midx)
+{
+	struct stat st;
+	char *idx_name = midx_bitmap_filename(midx);
+	int fd = git_open(idx_name);
+
+	free(idx_name);
+
+	if (fd < 0)
+		return -1;
+
+	if (fstat(fd, &st)) {
+		close(fd);
+		return -1;
+	}
+
+	if (bitmap_git->pack || bitmap_git->midx) {
+		/* ignore extra bitmap file; we can only handle one */
+		return -1;
+	}
+
+	bitmap_git->midx = midx;
+	bitmap_git->map_size = xsize_t(st.st_size);
+	bitmap_git->map_pos = 0;
+	bitmap_git->map = xmmap(NULL, bitmap_git->map_size, PROT_READ,
+				MAP_PRIVATE, fd, 0);
+	close(fd);
+
+	if (load_bitmap_header(bitmap_git) < 0)
+		goto cleanup;
+
+	if (!hasheq(get_midx_checksum(bitmap_git->midx), bitmap_git->checksum))
+		goto cleanup;
+
+	if (load_midx_revindex(bitmap_git->midx) < 0) {
+		warning(_("multi-pack bitmap is missing required reverse index"));
+		goto cleanup;
+	}
+	return 0;
+
+cleanup:
+	munmap(bitmap_git->map, bitmap_git->map_size);
+	bitmap_git->map_size = 0;
+	bitmap_git->map = NULL;
+	return -1;
+}
+
 static int open_pack_bitmap_1(struct bitmap_index *bitmap_git, struct packed_git *packfile)
 {
 	int fd;
@@ -302,12 +373,18 @@ static int open_pack_bitmap_1(struct bitmap_index *bitmap_git, struct packed_git
 		return -1;
 	}
 
-	if (bitmap_git->pack) {
+	if (bitmap_git->pack || bitmap_git->midx) {
+		/* ignore extra bitmap file; we can only handle one */
 		warning("ignoring extra bitmap file: %s", packfile->pack_name);
 		close(fd);
 		return -1;
 	}
 
+	if (!is_pack_valid(packfile)) {
+		close(fd);
+		return -1;
+	}
+
 	bitmap_git->pack = packfile;
 	bitmap_git->map_size = xsize_t(st.st_size);
 	bitmap_git->map = xmmap(NULL, bitmap_git->map_size, PROT_READ, MAP_PRIVATE, fd, 0);
@@ -324,13 +401,36 @@ static int open_pack_bitmap_1(struct bitmap_index *bitmap_git, struct packed_git
 	return 0;
 }
 
-static int load_pack_bitmap(struct bitmap_index *bitmap_git)
+static int load_reverse_index(struct bitmap_index *bitmap_git)
+{
+	if (bitmap_is_midx(bitmap_git)) {
+		uint32_t i;
+		int ret;
+
+		ret = load_midx_revindex(bitmap_git->midx);
+		if (ret)
+			return ret;
+
+		for (i = 0; i < bitmap_git->midx->num_packs; i++) {
+			if (prepare_midx_pack(the_repository, bitmap_git->midx, i))
+				die(_("load_reverse_index: could not open pack"));
+			ret = load_pack_revindex(bitmap_git->midx->packs[i]);
+			if (ret)
+				return ret;
+		}
+		return 0;
+	}
+	return load_pack_revindex(bitmap_git->pack);
+}
+
+static int load_bitmap(struct bitmap_index *bitmap_git)
 {
 	assert(bitmap_git->map);
 
 	bitmap_git->bitmaps = kh_init_oid_map();
 	bitmap_git->ext_index.positions = kh_init_oid_pos();
-	if (load_pack_revindex(bitmap_git->pack))
+
+	if (load_reverse_index(bitmap_git))
 		goto failed;
 
 	if (!(bitmap_git->commits = read_bitmap_1(bitmap_git)) ||
@@ -374,11 +474,35 @@ static int open_pack_bitmap(struct repository *r,
 	return ret;
 }
 
+static int open_midx_bitmap(struct repository *r,
+			    struct bitmap_index *bitmap_git)
+{
+	struct multi_pack_index *midx;
+
+	assert(!bitmap_git->map);
+
+	for (midx = get_multi_pack_index(r); midx; midx = midx->next) {
+		if (!open_midx_bitmap_1(bitmap_git, midx))
+			return 0;
+	}
+	return -1;
+}
+
+static int open_bitmap(struct repository *r,
+		       struct bitmap_index *bitmap_git)
+{
+	assert(!bitmap_git->map);
+
+	if (!open_midx_bitmap(r, bitmap_git))
+		return 0;
+	return open_pack_bitmap(r, bitmap_git);
+}
+
 struct bitmap_index *prepare_bitmap_git(struct repository *r)
 {
 	struct bitmap_index *bitmap_git = xcalloc(1, sizeof(*bitmap_git));
 
-	if (!open_pack_bitmap(r, bitmap_git) && !load_pack_bitmap(bitmap_git))
+	if (!open_bitmap(r, bitmap_git) && !load_bitmap(bitmap_git))
 		return bitmap_git;
 
 	free_bitmap_index(bitmap_git);
@@ -428,10 +552,26 @@ static inline int bitmap_position_packfile(struct bitmap_index *bitmap_git,
 	return pos;
 }
 
+static int bitmap_position_midx(struct bitmap_index *bitmap_git,
+				const struct object_id *oid)
+{
+	uint32_t want, got;
+	if (!bsearch_midx(oid, bitmap_git->midx, &want))
+		return -1;
+
+	if (midx_to_pack_pos(bitmap_git->midx, want, &got) < 0)
+		return -1;
+	return got;
+}
+
 static int bitmap_position(struct bitmap_index *bitmap_git,
 			   const struct object_id *oid)
 {
-	int pos = bitmap_position_packfile(bitmap_git, oid);
+	int pos;
+	if (bitmap_is_midx(bitmap_git))
+		pos = bitmap_position_midx(bitmap_git, oid);
+	else
+		pos = bitmap_position_packfile(bitmap_git, oid);
 	return (pos >= 0) ? pos : bitmap_position_extended(bitmap_git, oid);
 }
 
@@ -721,6 +861,7 @@ static void show_objects_for_type(
 			continue;
 
 		for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
+			struct packed_git *pack;
 			struct object_id oid;
 			uint32_t hash = 0, index_pos;
 			off_t ofs;
@@ -730,14 +871,28 @@ static void show_objects_for_type(
 
 			offset += ewah_bit_ctz64(word >> offset);
 
-			index_pos = pack_pos_to_index(bitmap_git->pack, pos + offset);
-			ofs = pack_pos_to_offset(bitmap_git->pack, pos + offset);
-			nth_packed_object_id(&oid, bitmap_git->pack, index_pos);
+			if (bitmap_is_midx(bitmap_git)) {
+				struct multi_pack_index *m = bitmap_git->midx;
+				uint32_t pack_id;
+
+				index_pos = pack_pos_to_midx(m, pos + offset);
+				ofs = nth_midxed_offset(m, index_pos);
+				nth_midxed_object_oid(&oid, m, index_pos);
+
+				pack_id = nth_midxed_pack_int_id(m, index_pos);
+				pack = bitmap_git->midx->packs[pack_id];
+			} else {
+				index_pos = pack_pos_to_index(bitmap_git->pack, pos + offset);
+				ofs = pack_pos_to_offset(bitmap_git->pack, pos + offset);
+				nth_bitmap_object_oid(bitmap_git, &oid, index_pos);
+
+				pack = bitmap_git->pack;
+			}
 
 			if (bitmap_git->hashes)
 				hash = get_be32(bitmap_git->hashes + index_pos);
 
-			show_reach(&oid, object_type, 0, hash, bitmap_git->pack, ofs);
+			show_reach(&oid, object_type, 0, hash, pack, ofs);
 		}
 	}
 }
@@ -749,8 +904,13 @@ static int in_bitmapped_pack(struct bitmap_index *bitmap_git,
 		struct object *object = roots->item;
 		roots = roots->next;
 
-		if (find_pack_entry_one(object->oid.hash, bitmap_git->pack) > 0)
-			return 1;
+		if (bitmap_is_midx(bitmap_git)) {
+			if (bsearch_midx(&object->oid, bitmap_git->midx, NULL))
+				return 1;
+		} else {
+			if (find_pack_entry_one(object->oid.hash, bitmap_git->pack) > 0)
+				return 1;
+		}
 	}
 
 	return 0;
@@ -839,14 +999,26 @@ static void filter_bitmap_blob_none(struct bitmap_index *bitmap_git,
 static unsigned long get_size_by_pos(struct bitmap_index *bitmap_git,
 				     uint32_t pos)
 {
-	struct packed_git *pack = bitmap_git->pack;
 	unsigned long size;
 	struct object_info oi = OBJECT_INFO_INIT;
 
 	oi.sizep = &size;
 
 	if (pos < bitmap_num_objects(bitmap_git)) {
-		off_t ofs = pack_pos_to_offset(pack, pos);
+		struct packed_git *pack;
+		off_t ofs;
+
+		if (bitmap_is_midx(bitmap_git)) {
+			uint32_t midx_pos = pack_pos_to_midx(bitmap_git->midx, pos);
+			uint32_t pack_id = nth_midxed_pack_int_id(bitmap_git->midx, midx_pos);
+
+			pack = bitmap_git->midx->packs[pack_id];
+			ofs = nth_midxed_offset(bitmap_git->midx, midx_pos);
+		} else {
+			pack = bitmap_git->pack;
+			ofs = pack_pos_to_offset(pack, pos);
+		}
+
 		if (packed_object_info(the_repository, pack, ofs, &oi) < 0) {
 			struct object_id oid;
 			nth_bitmap_object_oid(bitmap_git, &oid,
@@ -990,7 +1162,7 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 	/* try to open a bitmapped pack, but don't parse it yet
 	 * because we may not need to use it */
 	CALLOC_ARRAY(bitmap_git, 1);
-	if (open_pack_bitmap(revs->repo, bitmap_git) < 0)
+	if (open_bitmap(revs->repo, bitmap_git) < 0)
 		goto cleanup;
 
 	for (i = 0; i < revs->pending.nr; ++i) {
@@ -1034,7 +1206,7 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 	 * from disk. this is the point of no return; after this the rev_list
 	 * becomes invalidated and we must perform the revwalk through bitmaps
 	 */
-	if (load_pack_bitmap(bitmap_git) < 0)
+	if (load_bitmap(bitmap_git) < 0)
 		goto cleanup;
 
 	object_array_clear(&revs->pending);
@@ -1081,15 +1253,29 @@ static void try_partial_reuse(struct bitmap_index *bitmap_git,
 			      struct bitmap *reuse,
 			      struct pack_window **w_curs)
 {
-	off_t offset, header;
+	struct packed_git *pack;
+	off_t offset, delta_obj_offset;
 	enum object_type type;
 	unsigned long size;
 
 	if (pos >= bitmap_num_objects(bitmap_git))
 		return; /* not actually in the pack or MIDX */
 
-	offset = header = pack_pos_to_offset(bitmap_git->pack, pos);
-	type = unpack_object_header(bitmap_git->pack, w_curs, &offset, &size);
+	if (bitmap_is_midx(bitmap_git)) {
+		uint32_t pack_id, midx_pos;
+
+		midx_pos = pack_pos_to_midx(bitmap_git->midx, pos);
+		pack_id = nth_midxed_pack_int_id(bitmap_git->midx, midx_pos);
+
+		pack = bitmap_git->midx->packs[pack_id];
+		offset = nth_midxed_offset(bitmap_git->midx, midx_pos);
+	} else {
+		pack = bitmap_git->pack;
+		offset = pack_pos_to_offset(bitmap_git->pack, pos);
+	}
+
+	delta_obj_offset = offset;
+	type = unpack_object_header(pack, w_curs, &offset, &size);
 	if (type < 0)
 		return; /* broken packfile, punt */
 
@@ -1105,11 +1291,11 @@ static void try_partial_reuse(struct bitmap_index *bitmap_git,
 		 * and the normal slow path will complain about it in
 		 * more detail.
 		 */
-		base_offset = get_delta_base(bitmap_git->pack, w_curs,
-					     &offset, type, header);
+		base_offset = get_delta_base(pack, w_curs, &offset, type,
+					     delta_obj_offset);
 		if (!base_offset)
 			return;
-		if (offset_to_pack_pos(bitmap_git->pack, base_offset, &base_pos) < 0)
+		if (offset_to_pack_pos(pack, base_offset, &base_pos) < 0)
 			return;
 
 		/*
@@ -1120,6 +1306,16 @@ static void try_partial_reuse(struct bitmap_index *bitmap_git,
 		 * packs we write fresh, and OFS_DELTA is the default). But
 		 * let's double check to make sure the pack wasn't written with
 		 * odd parameters.
+		 *
+		 * Note that the base does not need to be repositioned, i.e.,
+		 * the MIDX is guaranteed to have selected the copy of "base"
+		 * from the same pack, since this function is only ever called
+		 * on the preferred pack (and all duplicate objects are resolved
+		 * in favor of the preferred pack).
+		 *
+		 * This means that we can reuse base_pos when looking up the bit
+		 * in the reuse bitmap, too, since bits corresponding to the
+		 * preferred pack precede all bits from other packs.
 		 */
 		if (base_pos >= pos)
 			return;
@@ -1142,6 +1338,14 @@ static void try_partial_reuse(struct bitmap_index *bitmap_git,
 	bitmap_set(reuse, pos);
 }
 
+static uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git)
+{
+	struct multi_pack_index *m = bitmap_git->midx;
+	if (!m)
+		BUG("midx_preferred_pack: requires non-empty MIDX");
+	return nth_midxed_pack_int_id(m, pack_pos_to_midx(bitmap_git->midx, 0));
+}
+
 int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 				       struct packed_git **packfile_out,
 				       uint32_t *entries,
@@ -1153,13 +1357,29 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 	size_t i = 0;
 	uint32_t offset;
 	uint32_t objects_nr = bitmap_num_objects(bitmap_git);
+	uint32_t preferred_pack = 0;
 
 	assert(result);
 
+	load_reverse_index(bitmap_git);
+
+	if (bitmap_is_midx(bitmap_git)) {
+		preferred_pack = midx_preferred_pack(bitmap_git);
+		objects_nr = bitmap_git->midx->packs[preferred_pack]->num_objects;
+	} else
+		objects_nr = bitmap_git->pack->num_objects;
+
 	while (i < result->word_alloc && result->words[i] == (eword_t)~0)
 		i++;
 
-	/* Don't mark objects not in the packfile */
+	/*
+	 * Don't mark objects not in the packfile or preferred pack. This bitmap
+	 * marks objects eligible for reuse, but the pack-reuse code only
+	 * understands how to reuse a single pack. Since the preferred pack is
+	 * guaranteed to have all bases for its deltas (in a multi-pack bitmap),
+	 * we use it instead of another pack. In single-pack bitmaps, the choice
+	 * is made for us.
+	 */
 	if (i > objects_nr / BITS_IN_EWORD)
 		i = objects_nr / BITS_IN_EWORD;
 
@@ -1175,6 +1395,14 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 				break;
 
 			offset += ewah_bit_ctz64(word >> offset);
+			if (bitmap_is_midx(bitmap_git)) {
+				/*
+				 * Can't reuse from a non-preferred pack (see
+				 * above).
+				 */
+				if (pos + offset >= objects_nr)
+					continue;
+			}
 			try_partial_reuse(bitmap_git, pos + offset, reuse, &w_curs);
 		}
 	}
@@ -1192,7 +1420,9 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 	 * need to be handled separately.
 	 */
 	bitmap_and_not(result, reuse);
-	*packfile_out = bitmap_git->pack;
+	*packfile_out = bitmap_git->pack ?
+		bitmap_git->pack :
+		bitmap_git->midx->packs[preferred_pack];
 	*reuse_out = reuse;
 	return 0;
 }
@@ -1466,6 +1696,12 @@ uint32_t *create_bitmap_mapping(struct bitmap_index *bitmap_git,
 	uint32_t i, num_objects;
 	uint32_t *reposition;
 
+	if (!bitmap_is_midx(bitmap_git))
+		load_reverse_index(bitmap_git);
+	else if (load_midx_revindex(bitmap_git->midx) < 0)
+		BUG("rebuild_existing_bitmaps: missing required rev-cache "
+		    "extension");
+
 	num_objects = bitmap_num_objects(bitmap_git);
 	CALLOC_ARRAY(reposition, num_objects);
 
@@ -1473,8 +1709,13 @@ uint32_t *create_bitmap_mapping(struct bitmap_index *bitmap_git,
 		struct object_id oid;
 		struct object_entry *oe;
 
-		nth_packed_object_id(&oid, bitmap_git->pack,
-				     pack_pos_to_index(bitmap_git->pack, i));
+		if (bitmap_is_midx(bitmap_git))
+			nth_midxed_object_oid(&oid,
+					      bitmap_git->midx,
+					      pack_pos_to_midx(bitmap_git->midx, i));
+		else
+			nth_packed_object_id(&oid, bitmap_git->pack,
+					     pack_pos_to_index(bitmap_git->pack, i));
 		oe = packlist_find(mapping, &oid);
 
 		if (oe)
@@ -1500,6 +1741,19 @@ void free_bitmap_index(struct bitmap_index *b)
 	free(b->ext_index.hashes);
 	bitmap_free(b->result);
 	bitmap_free(b->haves);
+	if (bitmap_is_midx(b)) {
+		/*
+		 * Multi-pack bitmaps need to have resources associated with
+		 * their on-disk reverse indexes unmapped so that stale .rev and
+		 * .bitmap files can be removed.
+		 *
+		 * Unlike pack-based bitmaps, multi-pack bitmaps can be read and
+		 * written in the same 'git multi-pack-index write --bitmap'
+		 * process. Close resources so they can be removed safely on
+		 * platforms like Windows.
+		 */
+		close_midx_revindex(b->midx);
+	}
 	free(b);
 }
 
@@ -1514,7 +1768,7 @@ static off_t get_disk_usage_for_type(struct bitmap_index *bitmap_git,
 				     enum object_type object_type)
 {
 	struct bitmap *result = bitmap_git->result;
-	struct packed_git *pack = bitmap_git->pack;
+	struct packed_git *pack;
 	off_t total = 0;
 	struct ewah_iterator it;
 	eword_t filter;
@@ -1538,6 +1792,29 @@ static off_t get_disk_usage_for_type(struct bitmap_index *bitmap_git,
 
 			offset += ewah_bit_ctz64(word >> offset);
 			pos = base + offset;
+
+			if (bitmap_is_midx(bitmap_git)) {
+				uint32_t pack_pos;
+				uint32_t midx_pos = pack_pos_to_midx(bitmap_git->midx, pos);
+				uint32_t pack_id = nth_midxed_pack_int_id(bitmap_git->midx, midx_pos);
+				off_t offset = nth_midxed_offset(bitmap_git->midx, midx_pos);
+
+				pack = bitmap_git->midx->packs[pack_id];
+
+				if (offset_to_pack_pos(pack, offset, &pack_pos) < 0) {
+					struct object_id oid;
+					nth_midxed_object_oid(&oid, bitmap_git->midx, midx_pos);
+
+					die(_("could not find %s in pack #%"PRIu32" at offset %"PRIuMAX),
+					    oid_to_hex(&oid),
+					    pack_id,
+					    (uintmax_t)offset);
+				}
+
+				pos = pack_pos;
+			} else
+				pack = bitmap_git->pack;
+
 			total += pack_pos_to_offset(pack, pos + 1) -
 				 pack_pos_to_offset(pack, pos);
 		}
@@ -1590,6 +1867,20 @@ off_t get_disk_usage_from_bitmap(struct bitmap_index *bitmap_git,
 	return total;
 }
 
+int bitmap_is_midx(struct bitmap_index *bitmap_git)
+{
+	return !!bitmap_git->midx;
+}
+
+off_t bitmap_pack_offset(struct bitmap_index *bitmap_git, uint32_t pos)
+{
+	if (bitmap_is_midx(bitmap_git))
+		return nth_midxed_offset(bitmap_git->midx,
+					 pack_pos_to_midx(bitmap_git->midx, pos));
+	return nth_packed_object_offset(bitmap_git->pack,
+					pack_pos_to_index(bitmap_git->pack, pos));
+}
+
 const struct string_list *bitmap_preferred_tips(struct repository *r)
 {
 	return repo_config_get_value_multi(r, "pack.preferbitmaptips");
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 0bf75ff2a7..0dc6f7a7e4 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -91,6 +91,11 @@ void bitmap_writer_finish(struct pack_idx_entry **index,
 			  uint32_t index_nr,
 			  const char *filename,
 			  uint16_t options);
+char *midx_bitmap_filename(struct multi_pack_index *midx);
+char *pack_bitmap_filename(struct packed_git *p);
+
+int bitmap_is_midx(struct bitmap_index *bitmap_git);
+off_t bitmap_pack_offset(struct bitmap_index *bitmap_git, uint32_t pos);
 
 const struct string_list *bitmap_preferred_tips(struct repository *r);
 int bitmap_is_preferred_refname(struct repository *r, const char *refname);
diff --git a/packfile.c b/packfile.c
index 8668345d93..c444e365a3 100644
--- a/packfile.c
+++ b/packfile.c
@@ -863,7 +863,7 @@ static void prepare_pack(const char *full_name, size_t full_name_len,
 	if (!strcmp(file_name, "multi-pack-index"))
 		return;
 	if (starts_with(file_name, "multi-pack-index") &&
-	    ends_with(file_name, ".rev"))
+	    (ends_with(file_name, ".bitmap") || ends_with(file_name, ".rev")))
 		return;
 	if (ends_with(file_name, ".idx") ||
 	    ends_with(file_name, ".rev") ||
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH 13/22] pack-bitmap: write multi-pack bitmaps
  2021-04-09 18:10 [PATCH 00/22] multi-pack reachability bitmaps Taylor Blau
                   ` (11 preceding siblings ...)
  2021-04-09 18:11 ` [PATCH 12/22] pack-bitmap: read multi-pack bitmaps Taylor Blau
@ 2021-04-09 18:11 ` Taylor Blau
  2021-05-04  5:02   ` Jonathan Tan
  2021-04-09 18:11 ` [PATCH 14/22] t5310: move some tests to lib-bitmap.sh Taylor Blau
                   ` (12 subsequent siblings)
  25 siblings, 1 reply; 273+ messages in thread
From: Taylor Blau @ 2021-04-09 18:11 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

Write multi-pack bitmaps in the format described by
Documentation/technical/bitmap-format.txt, inferring their presence with
the absence of '--bitmap'.

To write a multi-pack bitmap, this patch attempts to reuse as much of
the existing machinery from pack-objects as possible. Specifically, the
MIDX code prepares a packing_data struct that pretends as if a single
packfile has been generated containing all of the objects contained
within the MIDX.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-multi-pack-index.txt |  12 +-
 builtin/multi-pack-index.c             |   2 +
 midx.c                                 | 195 ++++++++++++++++++++++++-
 midx.h                                 |   1 +
 4 files changed, 202 insertions(+), 8 deletions(-)

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index ffd601bc17..ada14deb2c 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -10,7 +10,7 @@ SYNOPSIS
 --------
 [verse]
 'git multi-pack-index' [--object-dir=<dir>] [--[no-]progress]
-	[--preferred-pack=<pack>] <subcommand>
+	[--preferred-pack=<pack>] [--[no-]bitmap] <subcommand>
 
 DESCRIPTION
 -----------
@@ -40,6 +40,9 @@ write::
 		multiple packs contain the same object. If not given,
 		ties are broken in favor of the pack with the lowest
 		mtime.
+
+	--[no-]bitmap::
+		Control whether or not a multi-pack bitmap is written.
 --
 
 verify::
@@ -81,6 +84,13 @@ EXAMPLES
 $ git multi-pack-index write
 -----------------------------------------------
 
+* Write a MIDX file for the packfiles in the current .git folder with a
+corresponding bitmap.
++
+-------------------------------------------------------------
+$ git multi-pack-index write --preferred-pack <pack> --bitmap
+-------------------------------------------------------------
+
 * Write a MIDX file for the packfiles in an alternate object store.
 +
 -----------------------------------------------
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 5d3ea445fd..bf6fa982e3 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -68,6 +68,8 @@ static int cmd_multi_pack_index_write(int argc, const char **argv)
 		OPT_STRING(0, "preferred-pack", &opts.preferred_pack,
 			   N_("preferred-pack"),
 			   N_("pack for reuse when computing a multi-pack bitmap")),
+		OPT_BIT(0, "bitmap", &opts.flags, N_("write multi-pack bitmap"),
+			MIDX_WRITE_BITMAP | MIDX_WRITE_REV_INDEX),
 		OPT_END(),
 	};
 
diff --git a/midx.c b/midx.c
index 567cdf0fcf..32d7d184c0 100644
--- a/midx.c
+++ b/midx.c
@@ -13,6 +13,10 @@
 #include "repository.h"
 #include "chunk-format.h"
 #include "pack.h"
+#include "pack-bitmap.h"
+#include "refs.h"
+#include "revision.h"
+#include "list-objects.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
@@ -885,6 +889,145 @@ static void write_midx_reverse_index(char *midx_name, unsigned char *midx_hash,
 static void clear_midx_files_ext(struct repository *r, const char *ext,
 				 unsigned char *keep_hash);
 
+static void prepare_midx_packing_data(struct packing_data *pdata,
+				      struct write_midx_context *ctx)
+{
+	uint32_t i;
+
+	memset(pdata, 0, sizeof(struct packing_data));
+	prepare_packing_data(the_repository, pdata);
+
+	for (i = 0; i < ctx->entries_nr; i++) {
+		struct pack_midx_entry *from = &ctx->entries[ctx->pack_order[i]];
+		struct object_entry *to = packlist_alloc(pdata, &from->oid);
+
+		oe_set_in_pack(pdata, to,
+			       ctx->info[ctx->pack_perm[from->pack_int_id]].p);
+	}
+}
+
+static int add_ref_to_pending(const char *refname,
+			      const struct object_id *oid,
+			      int flag, void *cb_data)
+{
+	struct rev_info *revs = (struct rev_info*)cb_data;
+	struct object *object;
+
+	if ((flag & REF_ISSYMREF) && (flag & REF_ISBROKEN)) {
+		warning("symbolic ref is dangling: %s", refname);
+		return 0;
+	}
+
+	object = parse_object_or_die(oid, refname);
+	if (object->type != OBJ_COMMIT)
+		return 0;
+
+	add_pending_object(revs, object, "");
+	if (bitmap_is_preferred_refname(revs->repo, refname))
+		object->flags |= NEEDS_BITMAP;
+	return 0;
+}
+
+struct bitmap_commit_cb {
+	struct commit **commits;
+	size_t commits_nr, commits_alloc;
+
+	struct write_midx_context *ctx;
+};
+
+static const struct object_id *bitmap_oid_access(size_t index,
+						 const void *_entries)
+{
+	const struct pack_midx_entry *entries = _entries;
+	return &entries[index].oid;
+}
+
+static void bitmap_show_commit(struct commit *commit, void *_data)
+{
+	struct bitmap_commit_cb *data = _data;
+	if (oid_pos(&commit->object.oid, data->ctx->entries,
+		    data->ctx->entries_nr,
+		    bitmap_oid_access) > -1) {
+		ALLOC_GROW(data->commits, data->commits_nr + 1,
+			   data->commits_alloc);
+		data->commits[data->commits_nr++] = commit;
+	}
+}
+
+static struct commit **find_commits_for_midx_bitmap(uint32_t *indexed_commits_nr_p,
+						    struct write_midx_context *ctx)
+{
+	struct rev_info revs;
+	struct bitmap_commit_cb cb;
+
+	memset(&cb, 0, sizeof(struct bitmap_commit_cb));
+	cb.ctx = ctx;
+
+	repo_init_revisions(the_repository, &revs, NULL);
+	for_each_ref(add_ref_to_pending, &revs);
+
+	fetch_if_missing = 0;
+	revs.exclude_promisor_objects = 1;
+
+	if (prepare_revision_walk(&revs))
+		die(_("revision walk setup failed"));
+
+	traverse_commit_list(&revs, bitmap_show_commit, NULL, &cb);
+	if (indexed_commits_nr_p)
+		*indexed_commits_nr_p = cb.commits_nr;
+
+	return cb.commits;
+}
+
+static int write_midx_bitmap(char *midx_name, unsigned char *midx_hash,
+			     struct write_midx_context *ctx,
+			     unsigned flags)
+{
+	struct packing_data pdata;
+	struct pack_idx_entry **index;
+	struct commit **commits = NULL;
+	uint32_t i, commits_nr;
+	char *bitmap_name = xstrfmt("%s-%s.bitmap", midx_name, hash_to_hex(midx_hash));
+	int ret;
+
+	prepare_midx_packing_data(&pdata, ctx);
+
+	commits = find_commits_for_midx_bitmap(&commits_nr, ctx);
+
+	/*
+	 * Build the MIDX-order index based on pdata.objects (which is already
+	 * in MIDX order; c.f., 'midx_pack_order_cmp()' for the definition of
+	 * this order).
+	 */
+	ALLOC_ARRAY(index, pdata.nr_objects);
+	for (i = 0; i < pdata.nr_objects; i++)
+		index[i] = (struct pack_idx_entry *)&pdata.objects[i];
+
+	bitmap_writer_show_progress(flags & MIDX_PROGRESS);
+	bitmap_writer_build_type_index(&pdata, index, pdata.nr_objects);
+
+	/*
+	 * bitmap_writer_select_commits expects objects in lex order, but
+	 * pack_order gives us exactly that. use it directly instead of
+	 * re-sorting the array
+	 */
+	for (i = 0; i < pdata.nr_objects; i++)
+		index[ctx->pack_order[i]] = (struct pack_idx_entry *)&pdata.objects[i];
+
+	bitmap_writer_select_commits(commits, commits_nr, -1);
+	ret = bitmap_writer_build(&pdata);
+	if (!ret)
+		goto cleanup;
+
+	bitmap_writer_set_checksum(midx_hash);
+	bitmap_writer_finish(index, pdata.nr_objects, bitmap_name, 0);
+
+cleanup:
+	free(index);
+	free(bitmap_name);
+	return ret;
+}
+
 static int write_midx_internal(const char *object_dir, struct multi_pack_index *m,
 			       struct string_list *packs_to_drop,
 			       const char *preferred_pack_name,
@@ -930,9 +1073,16 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 		for (i = 0; i < ctx.m->num_packs; i++) {
 			ALLOC_GROW(ctx.info, ctx.nr + 1, ctx.alloc);
 
+			if (prepare_midx_pack(the_repository, ctx.m, i)) {
+				error(_("could not load pack %s"),
+				      ctx.m->pack_names[i]);
+				result = 1;
+				goto cleanup;
+			}
+
 			ctx.info[ctx.nr].orig_pack_int_id = i;
 			ctx.info[ctx.nr].pack_name = xstrdup(ctx.m->pack_names[i]);
-			ctx.info[ctx.nr].p = NULL;
+			ctx.info[ctx.nr].p = ctx.m->packs[i];
 			ctx.info[ctx.nr].expired = 0;
 			ctx.nr++;
 		}
@@ -947,8 +1097,26 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &ctx);
 	stop_progress(&ctx.progress);
 
-	if (ctx.m && ctx.nr == ctx.m->num_packs && !packs_to_drop)
-		goto cleanup;
+	if (ctx.m && ctx.nr == ctx.m->num_packs && !packs_to_drop) {
+		struct bitmap_index *bitmap_git;
+		int bitmap_exists;
+		int want_bitmap = flags & MIDX_WRITE_BITMAP;
+
+		bitmap_git = prepare_bitmap_git(the_repository);
+		bitmap_exists = bitmap_git && bitmap_is_midx(bitmap_git);
+		free_bitmap_index(bitmap_git);
+
+		if (bitmap_exists || !want_bitmap) {
+			/*
+			 * The correct MIDX already exists, and so does a
+			 * corresponding bitmap (or one wasn't requested).
+			 */
+			if (!want_bitmap)
+				clear_midx_files_ext(the_repository, ".bitmap",
+						     NULL);
+			goto cleanup;
+		}
+	}
 
 	ctx.preferred_pack_idx = -1;
 	if (preferred_pack_name) {
@@ -1048,9 +1216,6 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(get_lock_file_fd(&lk), get_lock_file_path(&lk));
 
-	if (ctx.m)
-		close_midx(ctx.m);
-
 	if (ctx.nr - dropped_packs == 0) {
 		error(_("no pack files to index."));
 		result = 1;
@@ -1081,14 +1246,17 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	finalize_hashfile(f, midx_hash, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
 	free_chunkfile(cf);
 
-	if (flags & MIDX_WRITE_REV_INDEX)
+	if (flags & (MIDX_WRITE_REV_INDEX | MIDX_WRITE_BITMAP))
 		ctx.pack_order = midx_pack_order(&ctx);
 
 	if (flags & MIDX_WRITE_REV_INDEX)
 		write_midx_reverse_index(midx_name, midx_hash, &ctx);
+	if (flags & MIDX_WRITE_BITMAP)
+		write_midx_bitmap(midx_name, midx_hash, &ctx, flags);
 
 	commit_lock_file(&lk);
 
+	clear_midx_files_ext(the_repository, ".bitmap", midx_hash);
 	clear_midx_files_ext(the_repository, ".rev", midx_hash);
 
 cleanup:
@@ -1096,6 +1264,15 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 		if (ctx.info[i].p) {
 			close_pack(ctx.info[i].p);
 			free(ctx.info[i].p);
+			if (ctx.m) {
+				/*
+				 * Destroy a stale reference to the pack in
+				 * 'ctx.m'.
+				 */
+				uint32_t orig = ctx.info[i].orig_pack_int_id;
+				if (orig < ctx.m->num_packs)
+					ctx.m->packs[orig] = NULL;
+			}
 		}
 		free(ctx.info[i].pack_name);
 	}
@@ -1105,6 +1282,9 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	free(ctx.pack_perm);
 	free(ctx.pack_order);
 	free(midx_name);
+	if (ctx.m)
+		close_midx(ctx.m);
+
 	return result;
 }
 
@@ -1166,6 +1346,7 @@ void clear_midx_file(struct repository *r)
 	if (remove_path(midx))
 		die(_("failed to clear multi-pack-index at %s"), midx);
 
+	clear_midx_files_ext(r, ".bitmap", NULL);
 	clear_midx_files_ext(r, ".rev", NULL);
 
 	free(midx);
diff --git a/midx.h b/midx.h
index 1172df1a71..350f4d0a7b 100644
--- a/midx.h
+++ b/midx.h
@@ -41,6 +41,7 @@ struct multi_pack_index {
 
 #define MIDX_PROGRESS     (1 << 0)
 #define MIDX_WRITE_REV_INDEX (1 << 1)
+#define MIDX_WRITE_BITMAP (1 << 2)
 
 const unsigned char *get_midx_checksum(struct multi_pack_index *m);
 char *get_midx_filename(const char *object_dir);
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH 14/22] t5310: move some tests to lib-bitmap.sh
  2021-04-09 18:10 [PATCH 00/22] multi-pack reachability bitmaps Taylor Blau
                   ` (12 preceding siblings ...)
  2021-04-09 18:11 ` [PATCH 13/22] pack-bitmap: write " Taylor Blau
@ 2021-04-09 18:11 ` Taylor Blau
  2021-04-09 18:11 ` [PATCH 15/22] t/helper/test-read-midx.c: add --checksum mode Taylor Blau
                   ` (11 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-04-09 18:11 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

We'll soon be adding a test script that will cover many of the same
bitmap concepts as t5310, but for MIDX bitmaps. Let's pull out as many
of the applicable tests as we can so we don't have to rewrite them.

There should be no functional change to t5310; we still run the same
operations in the same order.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/lib-bitmap.sh         | 212 ++++++++++++++++++++++++++++++++++++++++
 t/t5310-pack-bitmaps.sh | 204 +-------------------------------------
 2 files changed, 216 insertions(+), 200 deletions(-)

diff --git a/t/lib-bitmap.sh b/t/lib-bitmap.sh
index fe3f98be24..c655a9bf35 100644
--- a/t/lib-bitmap.sh
+++ b/t/lib-bitmap.sh
@@ -1,3 +1,6 @@
+# Helpers for scripts testing bitamp functionality; see t5310 for
+# example usage.
+
 # Compare a file containing rev-list bitmap traversal output to its non-bitmap
 # counterpart. You can't just use test_cmp for this, because the two produce
 # subtly different output:
@@ -24,3 +27,212 @@ test_bitmap_traversal () {
 	test_cmp "$1.normalized" "$2.normalized" &&
 	rm -f "$1.normalized" "$2.normalized"
 }
+
+# To ensure the logic for "maximal commits" is exercised, make
+# the repository a bit more complicated.
+#
+#    other                         second
+#      *                             *
+# (99 commits)                  (99 commits)
+#      *                             *
+#      |\                           /|
+#      | * octo-other  octo-second * |
+#      |/|\_________  ____________/|\|
+#      | \          \/  __________/  |
+#      |  | ________/\ /             |
+#      *  |/          * merge-right  *
+#      | _|__________/ \____________ |
+#      |/ |                         \|
+# (l1) *  * merge-left               * (r1)
+#      | / \________________________ |
+#      |/                           \|
+# (l2) *                             * (r2)
+#       \___________________________ |
+#                                   \|
+#                                    * (base)
+#
+# We only push bits down the first-parent history, which
+# makes some of these commits unimportant!
+#
+# The important part for the maximal commit algorithm is how
+# the bitmasks are extended. Assuming starting bit positions
+# for second (bit 0) and other (bit 1), the bitmasks at the
+# end should be:
+#
+#      second: 1       (maximal, selected)
+#       other: 01      (maximal, selected)
+#      (base): 11 (maximal)
+#
+# This complicated history was important for a previous
+# version of the walk that guarantees never walking a
+# commit multiple times. That goal might be important
+# again, so preserve this complicated case. For now, this
+# test will guarantee that the bitmaps are computed
+# correctly, even with the repeat calculations.
+setup_bitmap_history() {
+	test_expect_success 'setup repo with moderate-sized history' '
+		test_commit_bulk --id=file 10 &&
+		git branch -M second &&
+		git checkout -b other HEAD~5 &&
+		test_commit_bulk --id=side 10 &&
+
+		# add complicated history setup, including merges and
+		# ambiguous merge-bases
+
+		git checkout -b merge-left other~2 &&
+		git merge second~2 -m "merge-left" &&
+
+		git checkout -b merge-right second~1 &&
+		git merge other~1 -m "merge-right" &&
+
+		git checkout -b octo-second second &&
+		git merge merge-left merge-right -m "octopus-second" &&
+
+		git checkout -b octo-other other &&
+		git merge merge-left merge-right -m "octopus-other" &&
+
+		git checkout other &&
+		git merge octo-other -m "pull octopus" &&
+
+		git checkout second &&
+		git merge octo-second -m "pull octopus" &&
+
+		# Remove these branches so they are not selected
+		# as bitmap tips
+		git branch -D merge-left &&
+		git branch -D merge-right &&
+		git branch -D octo-other &&
+		git branch -D octo-second &&
+
+		# add padding to make these merges less interesting
+		# and avoid having them selected for bitmaps
+		test_commit_bulk --id=file 100 &&
+		git checkout other &&
+		test_commit_bulk --id=side 100 &&
+		git checkout second &&
+
+		bitmaptip=$(git rev-parse second) &&
+		blob=$(echo tagged-blob | git hash-object -w --stdin) &&
+		git tag tagged-blob $blob
+	'
+}
+
+rev_list_tests_head () {
+	test_expect_success "counting commits via bitmap ($state, $branch)" '
+		git rev-list --count $branch >expect &&
+		git rev-list --use-bitmap-index --count $branch >actual &&
+		test_cmp expect actual
+	'
+
+	test_expect_success "counting partial commits via bitmap ($state, $branch)" '
+		git rev-list --count $branch~5..$branch >expect &&
+		git rev-list --use-bitmap-index --count $branch~5..$branch >actual &&
+		test_cmp expect actual
+	'
+
+	test_expect_success "counting commits with limit ($state, $branch)" '
+		git rev-list --count -n 1 $branch >expect &&
+		git rev-list --use-bitmap-index --count -n 1 $branch >actual &&
+		test_cmp expect actual
+	'
+
+	test_expect_success "counting non-linear history ($state, $branch)" '
+		git rev-list --count other...second >expect &&
+		git rev-list --use-bitmap-index --count other...second >actual &&
+		test_cmp expect actual
+	'
+
+	test_expect_success "counting commits with limiting ($state, $branch)" '
+		git rev-list --count $branch -- 1.t >expect &&
+		git rev-list --use-bitmap-index --count $branch -- 1.t >actual &&
+		test_cmp expect actual
+	'
+
+	test_expect_success "counting objects via bitmap ($state, $branch)" '
+		git rev-list --count --objects $branch >expect &&
+		git rev-list --use-bitmap-index --count --objects $branch >actual &&
+		test_cmp expect actual
+	'
+
+	test_expect_success "enumerate commits ($state, $branch)" '
+		git rev-list --use-bitmap-index $branch >actual &&
+		git rev-list $branch >expect &&
+		test_bitmap_traversal --no-confirm-bitmaps expect actual
+	'
+
+	test_expect_success "enumerate --objects ($state, $branch)" '
+		git rev-list --objects --use-bitmap-index $branch >actual &&
+		git rev-list --objects $branch >expect &&
+		test_bitmap_traversal expect actual
+	'
+
+	test_expect_success "bitmap --objects handles non-commit objects ($state, $branch)" '
+		git rev-list --objects --use-bitmap-index $branch tagged-blob >actual &&
+		grep $blob actual
+	'
+}
+
+rev_list_tests () {
+	state=$1
+
+	for branch in "second" "other"
+	do
+		rev_list_tests_head
+	done
+}
+
+basic_bitmap_tests () {
+	tip="$1"
+	test_expect_success 'rev-list --test-bitmap verifies bitmaps' "
+		git rev-list --test-bitmap "${tip:-HEAD}"
+	"
+
+	rev_list_tests 'full bitmap'
+
+	test_expect_success 'clone from bitmapped repository' '
+		rm -fr clone.git &&
+		git clone --no-local --bare . clone.git &&
+		git rev-parse HEAD >expect &&
+		git --git-dir=clone.git rev-parse HEAD >actual &&
+		test_cmp expect actual
+	'
+
+	test_expect_success 'partial clone from bitmapped repository' '
+		test_config uploadpack.allowfilter true &&
+		rm -fr partial-clone.git &&
+		git clone --no-local --bare --filter=blob:none . partial-clone.git &&
+		(
+			cd partial-clone.git &&
+			pack=$(echo objects/pack/*.pack) &&
+			git verify-pack -v "$pack" >have &&
+			awk "/blob/ { print \$1 }" <have >blobs &&
+			# we expect this single blob because of the direct ref
+			git rev-parse refs/tags/tagged-blob >expect &&
+			test_cmp expect blobs
+		)
+	'
+
+	test_expect_success 'setup further non-bitmapped commits' '
+		test_commit_bulk --id=further 10
+	'
+
+	rev_list_tests 'partial bitmap'
+
+	test_expect_success 'fetch (partial bitmap)' '
+		git --git-dir=clone.git fetch origin second:second &&
+		git rev-parse HEAD >expect &&
+		git --git-dir=clone.git rev-parse HEAD >actual &&
+		test_cmp expect actual
+	'
+}
+
+# have_delta <obj> <expected_base>
+#
+# Note that because this relies on cat-file, it might find _any_ copy of an
+# object in the repository. The caller is responsible for making sure
+# there's only one (e.g., via "repack -ad", or having just fetched a copy).
+have_delta () {
+	echo $2 >expect &&
+	echo $1 | git cat-file --batch-check="%(deltabase)" >actual &&
+	test_cmp expect actual
+}
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index f53efc8229..4318f84d53 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -25,93 +25,10 @@ has_any () {
 	grep -Ff "$1" "$2"
 }
 
-# To ensure the logic for "maximal commits" is exercised, make
-# the repository a bit more complicated.
-#
-#    other                         second
-#      *                             *
-# (99 commits)                  (99 commits)
-#      *                             *
-#      |\                           /|
-#      | * octo-other  octo-second * |
-#      |/|\_________  ____________/|\|
-#      | \          \/  __________/  |
-#      |  | ________/\ /             |
-#      *  |/          * merge-right  *
-#      | _|__________/ \____________ |
-#      |/ |                         \|
-# (l1) *  * merge-left               * (r1)
-#      | / \________________________ |
-#      |/                           \|
-# (l2) *                             * (r2)
-#       \___________________________ |
-#                                   \|
-#                                    * (base)
-#
-# We only push bits down the first-parent history, which
-# makes some of these commits unimportant!
-#
-# The important part for the maximal commit algorithm is how
-# the bitmasks are extended. Assuming starting bit positions
-# for second (bit 0) and other (bit 1), the bitmasks at the
-# end should be:
-#
-#      second: 1       (maximal, selected)
-#       other: 01      (maximal, selected)
-#      (base): 11 (maximal)
-#
-# This complicated history was important for a previous
-# version of the walk that guarantees never walking a
-# commit multiple times. That goal might be important
-# again, so preserve this complicated case. For now, this
-# test will guarantee that the bitmaps are computed
-# correctly, even with the repeat calculations.
+setup_bitmap_history
 
-test_expect_success 'setup repo with moderate-sized history' '
-	test_commit_bulk --id=file 10 &&
-	git branch -M second &&
-	git checkout -b other HEAD~5 &&
-	test_commit_bulk --id=side 10 &&
-
-	# add complicated history setup, including merges and
-	# ambiguous merge-bases
-
-	git checkout -b merge-left other~2 &&
-	git merge second~2 -m "merge-left" &&
-
-	git checkout -b merge-right second~1 &&
-	git merge other~1 -m "merge-right" &&
-
-	git checkout -b octo-second second &&
-	git merge merge-left merge-right -m "octopus-second" &&
-
-	git checkout -b octo-other other &&
-	git merge merge-left merge-right -m "octopus-other" &&
-
-	git checkout other &&
-	git merge octo-other -m "pull octopus" &&
-
-	git checkout second &&
-	git merge octo-second -m "pull octopus" &&
-
-	# Remove these branches so they are not selected
-	# as bitmap tips
-	git branch -D merge-left &&
-	git branch -D merge-right &&
-	git branch -D octo-other &&
-	git branch -D octo-second &&
-
-	# add padding to make these merges less interesting
-	# and avoid having them selected for bitmaps
-	test_commit_bulk --id=file 100 &&
-	git checkout other &&
-	test_commit_bulk --id=side 100 &&
-	git checkout second &&
-
-	bitmaptip=$(git rev-parse second) &&
-	blob=$(echo tagged-blob | git hash-object -w --stdin) &&
-	git tag tagged-blob $blob &&
-	git config repack.writebitmaps true
+test_expect_success 'setup writing bitmaps during repack' '
+	git config repack.writeBitmaps true
 '
 
 test_expect_success 'full repack creates bitmaps' '
@@ -123,109 +40,7 @@ test_expect_success 'full repack creates bitmaps' '
 	grep "\"key\":\"num_maximal_commits\",\"value\":\"107\"" trace
 '
 
-test_expect_success 'rev-list --test-bitmap verifies bitmaps' '
-	git rev-list --test-bitmap HEAD
-'
-
-rev_list_tests_head () {
-	test_expect_success "counting commits via bitmap ($state, $branch)" '
-		git rev-list --count $branch >expect &&
-		git rev-list --use-bitmap-index --count $branch >actual &&
-		test_cmp expect actual
-	'
-
-	test_expect_success "counting partial commits via bitmap ($state, $branch)" '
-		git rev-list --count $branch~5..$branch >expect &&
-		git rev-list --use-bitmap-index --count $branch~5..$branch >actual &&
-		test_cmp expect actual
-	'
-
-	test_expect_success "counting commits with limit ($state, $branch)" '
-		git rev-list --count -n 1 $branch >expect &&
-		git rev-list --use-bitmap-index --count -n 1 $branch >actual &&
-		test_cmp expect actual
-	'
-
-	test_expect_success "counting non-linear history ($state, $branch)" '
-		git rev-list --count other...second >expect &&
-		git rev-list --use-bitmap-index --count other...second >actual &&
-		test_cmp expect actual
-	'
-
-	test_expect_success "counting commits with limiting ($state, $branch)" '
-		git rev-list --count $branch -- 1.t >expect &&
-		git rev-list --use-bitmap-index --count $branch -- 1.t >actual &&
-		test_cmp expect actual
-	'
-
-	test_expect_success "counting objects via bitmap ($state, $branch)" '
-		git rev-list --count --objects $branch >expect &&
-		git rev-list --use-bitmap-index --count --objects $branch >actual &&
-		test_cmp expect actual
-	'
-
-	test_expect_success "enumerate commits ($state, $branch)" '
-		git rev-list --use-bitmap-index $branch >actual &&
-		git rev-list $branch >expect &&
-		test_bitmap_traversal --no-confirm-bitmaps expect actual
-	'
-
-	test_expect_success "enumerate --objects ($state, $branch)" '
-		git rev-list --objects --use-bitmap-index $branch >actual &&
-		git rev-list --objects $branch >expect &&
-		test_bitmap_traversal expect actual
-	'
-
-	test_expect_success "bitmap --objects handles non-commit objects ($state, $branch)" '
-		git rev-list --objects --use-bitmap-index $branch tagged-blob >actual &&
-		grep $blob actual
-	'
-}
-
-rev_list_tests () {
-	state=$1
-
-	for branch in "second" "other"
-	do
-		rev_list_tests_head
-	done
-}
-
-rev_list_tests 'full bitmap'
-
-test_expect_success 'clone from bitmapped repository' '
-	git clone --no-local --bare . clone.git &&
-	git rev-parse HEAD >expect &&
-	git --git-dir=clone.git rev-parse HEAD >actual &&
-	test_cmp expect actual
-'
-
-test_expect_success 'partial clone from bitmapped repository' '
-	test_config uploadpack.allowfilter true &&
-	git clone --no-local --bare --filter=blob:none . partial-clone.git &&
-	(
-		cd partial-clone.git &&
-		pack=$(echo objects/pack/*.pack) &&
-		git verify-pack -v "$pack" >have &&
-		awk "/blob/ { print \$1 }" <have >blobs &&
-		# we expect this single blob because of the direct ref
-		git rev-parse refs/tags/tagged-blob >expect &&
-		test_cmp expect blobs
-	)
-'
-
-test_expect_success 'setup further non-bitmapped commits' '
-	test_commit_bulk --id=further 10
-'
-
-rev_list_tests 'partial bitmap'
-
-test_expect_success 'fetch (partial bitmap)' '
-	git --git-dir=clone.git fetch origin second:second &&
-	git rev-parse HEAD >expect &&
-	git --git-dir=clone.git rev-parse HEAD >actual &&
-	test_cmp expect actual
-'
+basic_bitmap_tests
 
 test_expect_success 'incremental repack fails when bitmaps are requested' '
 	test_commit more-1 &&
@@ -461,17 +276,6 @@ test_expect_success 'truncated bitmap fails gracefully (cache)' '
 	test_i18ngrep corrupted.bitmap.index stderr
 '
 
-# have_delta <obj> <expected_base>
-#
-# Note that because this relies on cat-file, it might find _any_ copy of an
-# object in the repository. The caller is responsible for making sure
-# there's only one (e.g., via "repack -ad", or having just fetched a copy).
-have_delta () {
-	echo $2 >expect &&
-	echo $1 | git cat-file --batch-check="%(deltabase)" >actual &&
-	test_cmp expect actual
-}
-
 # Create a state of history with these properties:
 #
 #  - refs that allow a client to fetch some new history, while sharing some old
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH 15/22] t/helper/test-read-midx.c: add --checksum mode
  2021-04-09 18:10 [PATCH 00/22] multi-pack reachability bitmaps Taylor Blau
                   ` (13 preceding siblings ...)
  2021-04-09 18:11 ` [PATCH 14/22] t5310: move some tests to lib-bitmap.sh Taylor Blau
@ 2021-04-09 18:11 ` Taylor Blau
  2021-04-09 18:12 ` [PATCH 16/22] t5326: test multi-pack bitmap behavior Taylor Blau
                   ` (10 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-04-09 18:11 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

Subsequent tests will want to check for the existence of a multi-pack
bitmap which matches the multi-pack-index stored in the pack directory.

The multi-pack bitmap includes the hex checksum of the MIDX it
corresponds to in its filename (for example,
'$packdir/multi-pack-index-<checksum>.bitmap'). As a result, some tests
want a way to learn what '<checksum>' is.

This helper addresses that need by printing the checksum of the
repository's multi-pack-index.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/helper/test-read-midx.c | 16 +++++++++++++++-
 t/lib-bitmap.sh           |  4 ++++
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c
index 7c2eb11a8e..cb0d27049a 100644
--- a/t/helper/test-read-midx.c
+++ b/t/helper/test-read-midx.c
@@ -60,12 +60,26 @@ static int read_midx_file(const char *object_dir, int show_objects)
 	return 0;
 }
 
+static int read_midx_checksum(const char *object_dir)
+{
+	struct multi_pack_index *m;
+
+	setup_git_directory();
+	m = load_multi_pack_index(object_dir, 1);
+	if (!m)
+		return 1;
+	printf("%s\n", hash_to_hex(get_midx_checksum(m)));
+	return 0;
+}
+
 int cmd__read_midx(int argc, const char **argv)
 {
 	if (!(argc == 2 || argc == 3))
-		usage("read-midx [--show-objects] <object-dir>");
+		usage("read-midx [--show-objects|--checksum] <object-dir>");
 
 	if (!strcmp(argv[1], "--show-objects"))
 		return read_midx_file(argv[2], 1);
+	else if (!strcmp(argv[1], "--checksum"))
+		return read_midx_checksum(argv[2]);
 	return read_midx_file(argv[1], 0);
 }
diff --git a/t/lib-bitmap.sh b/t/lib-bitmap.sh
index c655a9bf35..a5ac8b41a7 100644
--- a/t/lib-bitmap.sh
+++ b/t/lib-bitmap.sh
@@ -236,3 +236,7 @@ have_delta () {
 	echo $1 | git cat-file --batch-check="%(deltabase)" >actual &&
 	test_cmp expect actual
 }
+
+midx_checksum () {
+	test-tool read-midx --checksum "${1:-.git/objects}"
+}
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH 16/22] t5326: test multi-pack bitmap behavior
  2021-04-09 18:10 [PATCH 00/22] multi-pack reachability bitmaps Taylor Blau
                   ` (14 preceding siblings ...)
  2021-04-09 18:11 ` [PATCH 15/22] t/helper/test-read-midx.c: add --checksum mode Taylor Blau
@ 2021-04-09 18:12 ` Taylor Blau
  2021-05-04 17:51   ` Jonathan Tan
  2021-04-09 18:12 ` [PATCH 17/22] t5310: disable GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP Taylor Blau
                   ` (9 subsequent siblings)
  25 siblings, 1 reply; 273+ messages in thread
From: Taylor Blau @ 2021-04-09 18:12 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

This patch introduces a new test, t5326, which tests the basic
functionality of multi-pack bitmaps.

Some trivial behavior is tested, such as:

  - Whether bitmaps can be generated with more than one pack.
  - Whether clones can be served with all objects in the bitmap.
  - Whether follow-up fetches can be served with some objects outside of
    the server's bitmap

These use lib-bitmap's tests (which in turn were pulled from t5310), and
we cover cases where the MIDX represents both a single pack and multiple
packs.

In addition, some non-trivial and MIDX-specific behavior is tested, too,
including:

  - Whether multi-pack bitmaps behave correctly with respect to the
    pack-reuse machinery when the base for some object is selected from
    a different pack than the delta.
  - Whether multi-pack bitmaps correctly respect the
    pack.preferBitmapTips configuration.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/t5326-multi-pack-bitmaps.sh | 278 ++++++++++++++++++++++++++++++++++
 1 file changed, 278 insertions(+)
 create mode 100755 t/t5326-multi-pack-bitmaps.sh

diff --git a/t/t5326-multi-pack-bitmaps.sh b/t/t5326-multi-pack-bitmaps.sh
new file mode 100755
index 0000000000..51c4f9ad78
--- /dev/null
+++ b/t/t5326-multi-pack-bitmaps.sh
@@ -0,0 +1,278 @@
+#!/bin/sh
+
+test_description='exercise basic multi-pack bitmap functionality'
+. ./test-lib.sh
+. "${TEST_DIRECTORY}/lib-bitmap.sh"
+
+# We'll be writing our own midx and bitmaps, so avoid getting confused by the
+# automatic ones.
+GIT_TEST_MULTI_PACK_INDEX=0
+GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0
+
+objdir=.git/objects
+midx=$objdir/pack/multi-pack-index
+
+# midx_pack_source <obj>
+midx_pack_source () {
+	test-tool read-midx --show-objects .git/objects | grep "^$1 " | cut -f2
+}
+
+setup_bitmap_history
+
+test_expect_success 'enable core.multiPackIndex' '
+	git config core.multiPackIndex true
+'
+
+test_expect_success 'create single-pack midx with bitmaps' '
+	git repack -ad &&
+	git multi-pack-index write --bitmap &&
+	test_path_is_file $midx &&
+	test_path_is_file $midx-$(midx_checksum $objdir).bitmap
+'
+
+basic_bitmap_tests
+
+test_expect_success 'create new additional packs' '
+	for i in $(test_seq 1 16)
+	do
+		test_commit "$i" &&
+		git repack -d
+	done &&
+
+	git checkout -b other2 HEAD~8 &&
+	for i in $(test_seq 1 8)
+	do
+		test_commit "side-$i" &&
+		git repack -d
+	done &&
+	git checkout second
+'
+
+test_expect_success 'create multi-pack midx with bitmaps' '
+	git multi-pack-index write --bitmap &&
+
+	ls $objdir/pack/pack-*.pack >packs &&
+	test_line_count = 26 packs &&
+
+	test_path_is_file $midx &&
+	test_path_is_file $midx-$(midx_checksum $objdir).bitmap
+'
+
+basic_bitmap_tests
+
+test_expect_success '--no-bitmap is respected when bitmaps exist' '
+	git multi-pack-index write --bitmap &&
+
+	test_commit respect--no-bitmap &&
+	GIT_TEST_MULTI_PACK_INDEX=0 git repack -d &&
+
+	test_path_is_file $midx &&
+	test_path_is_file $midx-$(midx_checksum $objdir).bitmap &&
+
+	git multi-pack-index write --no-bitmap &&
+
+	test_path_is_file $midx &&
+	test_path_is_missing $midx-$(midx_checksum $objdir).bitmap
+'
+
+test_expect_success 'setup midx with base from later pack' '
+	# Write a and b so that "a" is a delta on top of base "b", since Git
+	# prefers to delete contents out of a base rather than add to a shorter
+	# object.
+	test_seq 1 128 >a &&
+	test_seq 1 130 >b &&
+
+	git add a b &&
+	git commit -m "initial commit" &&
+
+	a=$(git rev-parse HEAD:a) &&
+	b=$(git rev-parse HEAD:b) &&
+
+	# In the first pack, "a" is stored as a delta to "b".
+	p1=$(git pack-objects .git/objects/pack/pack <<-EOF
+	$a
+	$b
+	EOF
+	) &&
+
+	# In the second pack, "a" is missing, and "b" is not a delta nor base to
+	# any other object.
+	p2=$(git pack-objects .git/objects/pack/pack <<-EOF
+	$b
+	$(git rev-parse HEAD)
+	$(git rev-parse HEAD^{tree})
+	EOF
+	) &&
+
+	git prune-packed &&
+	# Use the second pack as the preferred source, so that "b" occurs
+	# earlier in the MIDX object order, rendering "a" unusable for pack
+	# reuse.
+	git multi-pack-index write --bitmap --preferred-pack=pack-$p2.idx &&
+
+	have_delta $a $b &&
+	test $(midx_pack_source $a) != $(midx_pack_source $b)
+'
+
+rev_list_tests 'full bitmap with backwards delta'
+
+test_expect_success 'clone with bitmaps enabled' '
+	git clone --no-local --bare . clone-reverse-delta.git &&
+	test_when_finished "rm -fr clone-reverse-delta.git" &&
+
+	git rev-parse HEAD >expect &&
+	git --git-dir=clone-reverse-delta.git rev-parse HEAD >actual &&
+	test_cmp expect actual
+'
+
+bitmap_reuse_tests() {
+	from=$1
+	to=$2
+
+	test_expect_success "setup pack reuse tests ($from -> $to)" '
+		rm -fr repo &&
+		git init repo &&
+		(
+			cd repo &&
+			test_commit_bulk 16 &&
+			git tag old-tip &&
+
+			git config core.multiPackIndex true &&
+			if test "MIDX" = "$from"
+			then
+				GIT_TEST_MULTI_PACK_INDEX=0 git repack -Ad &&
+				git multi-pack-index write --bitmap
+			else
+				GIT_TEST_MULTI_PACK_INDEX=0 git repack -Adb
+			fi
+		)
+	'
+
+	test_expect_success "build bitmap from existing ($from -> $to)" '
+		(
+			cd repo &&
+			test_commit_bulk --id=further 16 &&
+			git tag new-tip &&
+
+			if test "MIDX" = "$to"
+			then
+				GIT_TEST_MULTI_PACK_INDEX=0 git repack -d &&
+				git multi-pack-index write --bitmap
+			else
+				GIT_TEST_MULTI_PACK_INDEX=0 git repack -Adb
+			fi
+		)
+	'
+
+	test_expect_success "verify resulting bitmaps ($from -> $to)" '
+		(
+			cd repo &&
+			git for-each-ref &&
+			git rev-list --test-bitmap refs/tags/old-tip &&
+			git rev-list --test-bitmap refs/tags/new-tip
+		)
+	'
+}
+
+bitmap_reuse_tests 'pack' 'MIDX'
+bitmap_reuse_tests 'MIDX' 'pack'
+bitmap_reuse_tests 'MIDX' 'MIDX'
+
+test_expect_success 'missing object closure fails gracefully' '
+	rm -fr repo &&
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit loose &&
+		test_commit packed &&
+
+		# Do not pass "--revs"; we want a pack without the "loose"
+		# commit.
+		git pack-objects $objdir/pack/pack <<-EOF &&
+		$(git rev-parse packed)
+		EOF
+
+		git multi-pack-index write --bitmap 2>err &&
+		grep "doesn.t have full closure" err &&
+		test_path_is_file $midx &&
+		test_path_is_missing $midx-$(midx_checksum $objdir).bitmap
+	)
+'
+
+test_expect_success 'setup partial bitmaps' '
+	test_commit packed &&
+	git repack &&
+	test_commit loose &&
+	git multi-pack-index write --bitmap 2>err &&
+	test_path_is_file $midx &&
+	test_path_is_file $midx-$(midx_checksum $objdir).bitmap
+'
+
+basic_bitmap_tests HEAD~
+
+test_expect_success 'removing a MIDX clears stale bitmaps' '
+	rm -fr repo &&
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+		test_commit base &&
+		git repack &&
+		git multi-pack-index write --bitmap &&
+
+		# Write a MIDX and bitmap; remove the MIDX but leave the bitmap.
+		stale_bitmap=$midx-$(midx_checksum $objdir).bitmap &&
+		rm $midx &&
+
+		# Then write a new MIDX.
+		test_commit new &&
+		git repack &&
+		git multi-pack-index write --bitmap &&
+
+		test_path_is_file $midx &&
+		test_path_is_file $midx-$(midx_checksum $objdir).bitmap &&
+		test_path_is_missing $stale_bitmap
+	)
+'
+
+test_expect_success 'pack.preferBitmapTips' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit_bulk --message="%s" 103 &&
+
+		git log --format="%H" >commits.raw &&
+		sort <commits.raw >commits &&
+
+		git log --format="create refs/tags/%s %H" HEAD >refs &&
+		git update-ref --stdin <refs &&
+
+		git multi-pack-index write --bitmap &&
+		test_path_is_file $midx &&
+		test_path_is_file $midx-$(midx_checksum $objdir).bitmap &&
+
+		test-tool bitmap list-commits | sort >bitmaps &&
+		comm -13 bitmaps commits >before &&
+		test_line_count = 1 before &&
+
+		perl -ne "printf(\"create refs/tags/include/%d \", $.); print" \
+			<before | git update-ref --stdin &&
+
+		rm -fr $midx-$(midx_checksum $objdir).bitmap &&
+		rm -fr $midx-$(midx_checksum $objdir).rev &&
+		rm -fr $midx &&
+
+		git -c pack.preferBitmapTips=refs/tags/include \
+			multi-pack-index write --bitmap &&
+		test-tool bitmap list-commits | sort >bitmaps &&
+		comm -13 bitmaps commits >after &&
+
+		! test_cmp before after
+	)
+'
+
+test_done
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH 17/22] t5310: disable GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP
  2021-04-09 18:10 [PATCH 00/22] multi-pack reachability bitmaps Taylor Blau
                   ` (15 preceding siblings ...)
  2021-04-09 18:12 ` [PATCH 16/22] t5326: test multi-pack bitmap behavior Taylor Blau
@ 2021-04-09 18:12 ` Taylor Blau
  2021-04-09 18:12 ` [PATCH 18/22] t5319: don't write MIDX bitmaps in t5319 Taylor Blau
                   ` (8 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-04-09 18:12 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

From: Jeff King <peff@peff.net>

Generating a MIDX bitmap confuses many of the tests in t5310, which
expect to control whether and how bitmaps are written. Since the
relevant MIDX-bitmap tests here are covered already in t5326, let's just
disable the flag for the whole t5310 script.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/t5310-pack-bitmaps.sh | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 4318f84d53..673baa5c3c 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -8,6 +8,10 @@ export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
 . "$TEST_DIRECTORY"/lib-bundle.sh
 . "$TEST_DIRECTORY"/lib-bitmap.sh
 
+# t5310 deals only with single-pack bitmaps, so don't write MIDX bitmaps in
+# their place.
+GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0
+
 objpath () {
 	echo ".git/objects/$(echo "$1" | sed -e 's|\(..\)|\1/|')"
 }
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH 18/22] t5319: don't write MIDX bitmaps in t5319
  2021-04-09 18:10 [PATCH 00/22] multi-pack reachability bitmaps Taylor Blau
                   ` (16 preceding siblings ...)
  2021-04-09 18:12 ` [PATCH 17/22] t5310: disable GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP Taylor Blau
@ 2021-04-09 18:12 ` Taylor Blau
  2021-04-09 18:12 ` [PATCH 19/22] t7700: update to work with MIDX bitmap test knob Taylor Blau
                   ` (7 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-04-09 18:12 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

This test is specifically about generating a midx still respecting a
pack-based bitmap file. Generating a MIDX bitmap would confuse the test.
Let's override the 'GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP' variable to
make sure we don't do so.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/t5319-multi-pack-index.sh | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 5641d158df..69f1c815aa 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -474,7 +474,8 @@ test_expect_success 'repack preserves multi-pack-index when creating packs' '
 compare_results_with_midx "after repack"
 
 test_expect_success 'multi-pack-index and pack-bitmap' '
-	git -c repack.writeBitmaps=true repack -ad &&
+	GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 \
+		git -c repack.writeBitmaps=true repack -ad &&
 	git multi-pack-index write &&
 	git rev-list --test-bitmap HEAD
 '
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH 19/22] t7700: update to work with MIDX bitmap test knob
  2021-04-09 18:10 [PATCH 00/22] multi-pack reachability bitmaps Taylor Blau
                   ` (17 preceding siblings ...)
  2021-04-09 18:12 ` [PATCH 18/22] t5319: don't write MIDX bitmaps in t5319 Taylor Blau
@ 2021-04-09 18:12 ` Taylor Blau
  2021-04-09 18:12 ` [PATCH 20/22] midx: respect 'GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP' Taylor Blau
                   ` (6 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-04-09 18:12 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

A number of these tests are focused only on pack-based bitmaps and need
to be updated to disable 'GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP' where
necessary.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/t7700-repack.sh | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index 25b235c063..98eda3bfeb 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -63,13 +63,14 @@ test_expect_success 'objects in packs marked .keep are not repacked' '
 
 test_expect_success 'writing bitmaps via command-line can duplicate .keep objects' '
 	# build on $oid, $packid, and .keep state from previous
-	git repack -Adbl &&
+	GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 git repack -Adbl &&
 	test_has_duplicate_object true
 '
 
 test_expect_success 'writing bitmaps via config can duplicate .keep objects' '
 	# build on $oid, $packid, and .keep state from previous
-	git -c repack.writebitmaps=true repack -Adl &&
+	GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 \
+		git -c repack.writebitmaps=true repack -Adl &&
 	test_has_duplicate_object true
 '
 
@@ -189,7 +190,9 @@ test_expect_success 'repack --keep-pack' '
 
 test_expect_success 'bitmaps are created by default in bare repos' '
 	git clone --bare .git bare.git &&
-	git -C bare.git repack -ad &&
+	rm -f bare.git/objects/pack/*.bitmap &&
+	GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 \
+		git -C bare.git repack -ad &&
 	bitmap=$(ls bare.git/objects/pack/*.bitmap) &&
 	test_path_is_file "$bitmap"
 '
@@ -200,7 +203,8 @@ test_expect_success 'incremental repack does not complain' '
 '
 
 test_expect_success 'bitmaps can be disabled on bare repos' '
-	git -c repack.writeBitmaps=false -C bare.git repack -ad &&
+	GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 \
+		git -c repack.writeBitmaps=false -C bare.git repack -ad &&
 	bitmap=$(ls bare.git/objects/pack/*.bitmap || :) &&
 	test -z "$bitmap"
 '
@@ -211,7 +215,8 @@ test_expect_success 'no bitmaps created if .keep files present' '
 	keep=${pack%.pack}.keep &&
 	test_when_finished "rm -f \"\$keep\"" &&
 	>"$keep" &&
-	git -C bare.git repack -ad 2>stderr &&
+	GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 \
+		git -C bare.git repack -ad 2>stderr &&
 	test_must_be_empty stderr &&
 	find bare.git/objects/pack/ -type f -name "*.bitmap" >actual &&
 	test_must_be_empty actual
@@ -222,7 +227,8 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
 	blob=$(test-tool genrandom big $((1024*1024)) |
 	       git -C bare.git hash-object -w --stdin) &&
 	git -C bare.git update-ref refs/tags/big $blob &&
-	git -C bare.git repack -ad 2>stderr &&
+	GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 \
+		git -C bare.git repack -ad 2>stderr &&
 	test_must_be_empty stderr &&
 	find bare.git/objects/pack -type f -name "*.bitmap" >actual &&
 	test_must_be_empty actual
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH 20/22] midx: respect 'GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP'
  2021-04-09 18:10 [PATCH 00/22] multi-pack reachability bitmaps Taylor Blau
                   ` (18 preceding siblings ...)
  2021-04-09 18:12 ` [PATCH 19/22] t7700: update to work with MIDX bitmap test knob Taylor Blau
@ 2021-04-09 18:12 ` Taylor Blau
  2021-04-09 18:12 ` [PATCH 21/22] p5310: extract full and partial bitmap tests Taylor Blau
                   ` (5 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-04-09 18:12 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

Introduce a new 'GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP' environment
variable to also write a multi-pack bitmap when
'GIT_TEST_MULTI_PACK_INDEX' is set.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c          | 13 ++++++++++---
 ci/run-build-and-tests.sh |  1 +
 midx.h                    |  2 ++
 t/README                  |  4 ++++
 4 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 2847fdfbab..3cb843fb59 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -515,7 +515,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		if (!(pack_everything & ALL_INTO_ONE) ||
 		    !is_bare_repository())
 			write_bitmaps = 0;
-	}
+	} else if (write_bitmaps &&
+		   git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0) &&
+		   git_env_bool(GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP, 0))
+		write_bitmaps = 0;
 	if (pack_kept_objects < 0)
 		pack_kept_objects = write_bitmaps > 0;
 
@@ -720,8 +723,12 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		update_server_info(0);
 	remove_temporary_files();
 
-	if (git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0))
-		write_midx_file(get_object_directory(), NULL, 0);
+	if (git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0)) {
+		unsigned flags = 0;
+		if (git_env_bool(GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP, 0))
+			flags |= MIDX_WRITE_BITMAP | MIDX_WRITE_REV_INDEX;
+		write_midx_file(get_object_directory(), NULL, flags);
+	}
 
 	string_list_clear(&names, 0);
 	string_list_clear(&rollback, 0);
diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
index a66b5e8c75..7c55a5033e 100755
--- a/ci/run-build-and-tests.sh
+++ b/ci/run-build-and-tests.sh
@@ -22,6 +22,7 @@ linux-gcc)
 	export GIT_TEST_COMMIT_GRAPH=1
 	export GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=1
 	export GIT_TEST_MULTI_PACK_INDEX=1
+	export GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=1
 	export GIT_TEST_ADD_I_USE_BUILTIN=1
 	export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=master
 	export GIT_TEST_WRITE_REV_INDEX=1
diff --git a/midx.h b/midx.h
index 350f4d0a7b..aa3da557bb 100644
--- a/midx.h
+++ b/midx.h
@@ -8,6 +8,8 @@ struct pack_entry;
 struct repository;
 
 #define GIT_TEST_MULTI_PACK_INDEX "GIT_TEST_MULTI_PACK_INDEX"
+#define GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP \
+	"GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP"
 
 struct multi_pack_index {
 	struct multi_pack_index *next;
diff --git a/t/README b/t/README
index fd9375b146..956731da44 100644
--- a/t/README
+++ b/t/README
@@ -420,6 +420,10 @@ GIT_TEST_MULTI_PACK_INDEX=<boolean>, when true, forces the multi-pack-
 index to be written after every 'git repack' command, and overrides the
 'core.multiPackIndex' setting to true.
 
+GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=<boolean>, when true, sets the
+'--bitmap' option on all invocations of 'git multi-pack-index write',
+and ignores pack-objects' '--write-bitmap-index'.
+
 GIT_TEST_SIDEBAND_ALL=<boolean>, when true, overrides the
 'uploadpack.allowSidebandAll' setting to true, and when false, forces
 fetch-pack to not request sideband-all (even if the server advertises
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH 21/22] p5310: extract full and partial bitmap tests
  2021-04-09 18:10 [PATCH 00/22] multi-pack reachability bitmaps Taylor Blau
                   ` (19 preceding siblings ...)
  2021-04-09 18:12 ` [PATCH 20/22] midx: respect 'GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP' Taylor Blau
@ 2021-04-09 18:12 ` Taylor Blau
  2021-04-09 18:12 ` [PATCH 22/22] p5326: perf tests for MIDX bitmaps Taylor Blau
                   ` (4 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-04-09 18:12 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

A new p5326 introduced by the next patch will want these same tests,
interjecting its own setup in between. Move them out so that both perf
tests can reuse them.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/perf/lib-bitmap.sh         | 69 ++++++++++++++++++++++++++++++++++++
 t/perf/p5310-pack-bitmaps.sh | 65 ++-------------------------------
 2 files changed, 72 insertions(+), 62 deletions(-)
 create mode 100644 t/perf/lib-bitmap.sh

diff --git a/t/perf/lib-bitmap.sh b/t/perf/lib-bitmap.sh
new file mode 100644
index 0000000000..63d3bc7cec
--- /dev/null
+++ b/t/perf/lib-bitmap.sh
@@ -0,0 +1,69 @@
+# Helper functions for testing bitmap performance; see p5310.
+
+test_full_bitmap () {
+	test_perf 'simulated clone' '
+		git pack-objects --stdout --all </dev/null >/dev/null
+	'
+
+	test_perf 'simulated fetch' '
+		have=$(git rev-list HEAD~100 -1) &&
+		{
+			echo HEAD &&
+			echo ^$have
+		} | git pack-objects --revs --stdout >/dev/null
+	'
+
+	test_perf 'pack to file (bitmap)' '
+		git pack-objects --use-bitmap-index --all pack1b </dev/null >/dev/null
+	'
+
+	test_perf 'rev-list (commits)' '
+		git rev-list --all --use-bitmap-index >/dev/null
+	'
+
+	test_perf 'rev-list (objects)' '
+		git rev-list --all --use-bitmap-index --objects >/dev/null
+	'
+
+	test_perf 'rev-list with tag negated via --not --all (objects)' '
+		git rev-list perf-tag --not --all --use-bitmap-index --objects >/dev/null
+	'
+
+	test_perf 'rev-list with negative tag (objects)' '
+		git rev-list HEAD --not perf-tag --use-bitmap-index --objects >/dev/null
+	'
+
+	test_perf 'rev-list count with blob:none' '
+		git rev-list --use-bitmap-index --count --objects --all \
+			--filter=blob:none >/dev/null
+	'
+
+	test_perf 'rev-list count with blob:limit=1k' '
+		git rev-list --use-bitmap-index --count --objects --all \
+			--filter=blob:limit=1k >/dev/null
+	'
+
+	test_perf 'rev-list count with tree:0' '
+		git rev-list --use-bitmap-index --count --objects --all \
+			--filter=tree:0 >/dev/null
+	'
+
+	test_perf 'simulated partial clone' '
+		git pack-objects --stdout --all --filter=blob:none </dev/null >/dev/null
+	'
+}
+
+test_partial_bitmap () {
+	test_perf 'clone (partial bitmap)' '
+		git pack-objects --stdout --all </dev/null >/dev/null
+	'
+
+	test_perf 'pack to file (partial bitmap)' '
+		git pack-objects --use-bitmap-index --all pack2b </dev/null >/dev/null
+	'
+
+	test_perf 'rev-list with tree filter (partial bitmap)' '
+		git rev-list --use-bitmap-index --count --objects --all \
+			--filter=tree:0 >/dev/null
+	'
+}
diff --git a/t/perf/p5310-pack-bitmaps.sh b/t/perf/p5310-pack-bitmaps.sh
index 452be01056..7ad4f237bc 100755
--- a/t/perf/p5310-pack-bitmaps.sh
+++ b/t/perf/p5310-pack-bitmaps.sh
@@ -2,6 +2,7 @@
 
 test_description='Tests pack performance using bitmaps'
 . ./perf-lib.sh
+. "${TEST_DIRECTORY}/perf/lib-bitmap.sh"
 
 test_perf_large_repo
 
@@ -25,56 +26,7 @@ test_perf 'repack to disk' '
 	git repack -ad
 '
 
-test_perf 'simulated clone' '
-	git pack-objects --stdout --all </dev/null >/dev/null
-'
-
-test_perf 'simulated fetch' '
-	have=$(git rev-list HEAD~100 -1) &&
-	{
-		echo HEAD &&
-		echo ^$have
-	} | git pack-objects --revs --stdout >/dev/null
-'
-
-test_perf 'pack to file (bitmap)' '
-	git pack-objects --use-bitmap-index --all pack1b </dev/null >/dev/null
-'
-
-test_perf 'rev-list (commits)' '
-	git rev-list --all --use-bitmap-index >/dev/null
-'
-
-test_perf 'rev-list (objects)' '
-	git rev-list --all --use-bitmap-index --objects >/dev/null
-'
-
-test_perf 'rev-list with tag negated via --not --all (objects)' '
-	git rev-list perf-tag --not --all --use-bitmap-index --objects >/dev/null
-'
-
-test_perf 'rev-list with negative tag (objects)' '
-	git rev-list HEAD --not perf-tag --use-bitmap-index --objects >/dev/null
-'
-
-test_perf 'rev-list count with blob:none' '
-	git rev-list --use-bitmap-index --count --objects --all \
-		--filter=blob:none >/dev/null
-'
-
-test_perf 'rev-list count with blob:limit=1k' '
-	git rev-list --use-bitmap-index --count --objects --all \
-		--filter=blob:limit=1k >/dev/null
-'
-
-test_perf 'rev-list count with tree:0' '
-	git rev-list --use-bitmap-index --count --objects --all \
-		--filter=tree:0 >/dev/null
-'
-
-test_perf 'simulated partial clone' '
-	git pack-objects --stdout --all --filter=blob:none </dev/null >/dev/null
-'
+test_full_bitmap
 
 test_expect_success 'create partial bitmap state' '
 	# pick a commit to represent the repo tip in the past
@@ -97,17 +49,6 @@ test_expect_success 'create partial bitmap state' '
 	git update-ref HEAD $orig_tip
 '
 
-test_perf 'clone (partial bitmap)' '
-	git pack-objects --stdout --all </dev/null >/dev/null
-'
-
-test_perf 'pack to file (partial bitmap)' '
-	git pack-objects --use-bitmap-index --all pack2b </dev/null >/dev/null
-'
-
-test_perf 'rev-list with tree filter (partial bitmap)' '
-	git rev-list --use-bitmap-index --count --objects --all \
-		--filter=tree:0 >/dev/null
-'
+test_partial_bitmap
 
 test_done
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH 22/22] p5326: perf tests for MIDX bitmaps
  2021-04-09 18:10 [PATCH 00/22] multi-pack reachability bitmaps Taylor Blau
                   ` (20 preceding siblings ...)
  2021-04-09 18:12 ` [PATCH 21/22] p5310: extract full and partial bitmap tests Taylor Blau
@ 2021-04-09 18:12 ` Taylor Blau
  2021-05-04 18:00   ` Jonathan Tan
  2021-06-21 22:24 ` [PATCH v2 00/24] multi-pack reachability bitmaps Taylor Blau
                   ` (3 subsequent siblings)
  25 siblings, 1 reply; 273+ messages in thread
From: Taylor Blau @ 2021-04-09 18:12 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

These new performance tests demonstrate effectively the same behavior as
p5310, but use a multi-pack bitmap instead of a single-pack one.

Notably, p5326 does not create a MIDX bitmap with multiple packs. This
is so we can measure a direct comparison between it and p5310. Any
difference between the two is measuring just the overhead of using MIDX
bitmaps.

Here are the results of p5310 and p5326 together, measured at the same
time and on the same machine (using a Xenon W-2255 CPU):

    Test                                                  HEAD
    ------------------------------------------------------------------------
    5310.2: repack to disk                                96.78(93.39+11.33)
    5310.3: simulated clone                               9.98(9.79+0.19)
    5310.4: simulated fetch                               1.75(4.26+0.19)
    5310.5: pack to file (bitmap)                         28.20(27.87+8.70)
    5310.6: rev-list (commits)                            0.41(0.36+0.05)
    5310.7: rev-list (objects)                            1.61(1.54+0.07)
    5310.8: rev-list count with blob:none                 0.25(0.21+0.04)
    5310.9: rev-list count with blob:limit=1k             2.65(2.54+0.10)
    5310.10: rev-list count with tree:0                   0.23(0.19+0.04)
    5310.11: simulated partial clone                      4.34(4.21+0.12)
    5310.13: clone (partial bitmap)                       11.05(12.21+0.48)
    5310.14: pack to file (partial bitmap)                31.25(34.22+3.70)
    5310.15: rev-list with tree filter (partial bitmap)   0.26(0.22+0.04)

versus the same tests (this time using a multi-pack index):

    Test                                                  HEAD
    ------------------------------------------------------------------------
    5326.2: setup multi-pack index                        78.99(75.29+11.58)
    5326.3: simulated clone                               11.78(11.56+0.22)
    5326.4: simulated fetch                               1.70(4.49+0.13)
    5326.5: pack to file (bitmap)                         28.02(27.72+8.76)
    5326.6: rev-list (commits)                            0.42(0.36+0.06)
    5326.7: rev-list (objects)                            1.65(1.58+0.06)
    5326.8: rev-list count with blob:none                 0.26(0.21+0.05)
    5326.9: rev-list count with blob:limit=1k             2.97(2.86+0.10)
    5326.10: rev-list count with tree:0                   0.25(0.20+0.04)
    5326.11: simulated partial clone                      5.65(5.49+0.16)
    5326.13: clone (partial bitmap)                       12.22(13.43+0.38)
    5326.14: pack to file (partial bitmap)                30.05(31.57+7.25)
    5326.15: rev-list with tree filter (partial bitmap)   0.24(0.20+0.04)

There is slight overhead in "simulated clone", "simulated partial
clone", and "clone (partial bitmap)". Unsurprisingly, that overhead is
due to using the MIDX's reverse index to map between bit positions and
MIDX positions.

This can be reproduced by running "git repack -adb" along with "git
multi-pack-index write --bitmap" in a large-ish repository. Then run:

    $ perf record -o pack.perf git -c core.multiPackIndex=false \
      pack-objects --all --stdout >/dev/null </dev/null
    $ perf record -o midx.perf git -c core.multiPackIndex=true \
      pack-objects --all --stdout >/dev/null </dev/null

and compare the two with "perf diff -c delta -o 1 pack.perf midx.perf".
The most notable results are below (the next largest positive delta is
+0.14%):

    # Event 'cycles'
    #
    # Baseline    Delta  Shared Object       Symbol
    # ........  .......  ..................  ..........................
    #
                 +5.86%  git                 [.] nth_midxed_offset
                 +5.24%  git                 [.] nth_midxed_pack_int_id
         3.45%   +0.97%  git                 [.] offset_to_pack_pos
         3.30%   +0.57%  git                 [.] pack_pos_to_offset
                 +0.30%  git                 [.] pack_pos_to_midx

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/perf/p5326-multi-pack-bitmaps.sh | 43 ++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)
 create mode 100755 t/perf/p5326-multi-pack-bitmaps.sh

diff --git a/t/perf/p5326-multi-pack-bitmaps.sh b/t/perf/p5326-multi-pack-bitmaps.sh
new file mode 100755
index 0000000000..5845109ac7
--- /dev/null
+++ b/t/perf/p5326-multi-pack-bitmaps.sh
@@ -0,0 +1,43 @@
+#!/bin/sh
+
+test_description='Tests performance using midx bitmaps'
+. ./perf-lib.sh
+. "${TEST_DIRECTORY}/perf/lib-bitmap.sh"
+
+test_perf_large_repo
+
+test_expect_success 'enable multi-pack index' '
+	git config core.multiPackIndex true
+'
+
+test_perf 'setup multi-pack index' '
+	git repack -ad &&
+	git multi-pack-index write --bitmap
+'
+
+test_full_bitmap
+
+test_expect_success 'create partial bitmap state' '
+	# pick a commit to represent the repo tip in the past
+	cutoff=$(git rev-list HEAD~100 -1) &&
+	orig_tip=$(git rev-parse HEAD) &&
+
+	# now pretend we have just one tip
+	rm -rf .git/logs .git/refs/* .git/packed-refs &&
+	git update-ref HEAD $cutoff &&
+
+	# and then repack, which will leave us with a nice
+	# big bitmap pack of the "old" history, and all of
+	# the new history will be loose, as if it had been pushed
+	# up incrementally and exploded via unpack-objects
+	git repack -Ad &&
+	git multi-pack-index write --bitmap &&
+
+	# and now restore our original tip, as if the pushes
+	# had happened
+	git update-ref HEAD $orig_tip
+'
+
+test_partial_bitmap
+
+test_done
-- 
2.31.1.163.ga65ce7f831

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH 12/22] pack-bitmap: read multi-pack bitmaps
  2021-04-09 18:11 ` [PATCH 12/22] pack-bitmap: read multi-pack bitmaps Taylor Blau
@ 2021-04-16  2:39   ` Jonathan Tan
  2021-04-16  3:13     ` Taylor Blau
  0 siblings, 1 reply; 273+ messages in thread
From: Jonathan Tan @ 2021-04-16  2:39 UTC (permalink / raw)
  To: me; +Cc: git, peff, dstolee, gitster, jonathantanmy

I'll review until this patch for now. Hopefully I'll get to the rest
soon.

> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 5205dde2e1..a4e4e4ebcc 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -984,7 +984,17 @@ static void write_reused_pack(struct hashfile *f)
>  				break;
>  
>  			offset += ewah_bit_ctz64(word >> offset);
> -			write_reused_pack_one(pos + offset, f, &w_curs);
> +			if (bitmap_is_midx(bitmap_git)) {
> +				off_t pack_offs = bitmap_pack_offset(bitmap_git,
> +								     pos + offset);
> +				uint32_t pos;
> +
> +				if (offset_to_pack_pos(reuse_packfile, pack_offs, &pos) < 0)
> +					die(_("write_reused_pack: could not locate %"PRIdMAX),
> +					    (intmax_t)pack_offs);
> +				write_reused_pack_one(pos, f, &w_curs);
> +			} else
> +				write_reused_pack_one(pos + offset, f, &w_curs);
>  			display_progress(progress_state, ++written);
>  		}
>  	}

When bitmaps are used, pos + offset is the pseudo-pack (a virtual
concatenation of all packfiles in the MIDX) position (as in, first
object is 0, second object is 1, and so on), not a position in
a single packfile. From it, we obtain a pack offset, and from it, we
obtain a position in the reused packfile (reuse_packfile). In this way,
the code is equivalent to the non-MIDX case. Looks good.

(There is no need to select a packfile here in the case of MIDX because,
as the code later shows, we always reuse only one packfile - assigned to
reuse_packfile.)

> @@ -35,8 +36,15 @@ struct stored_bitmap {
>   * the active bitmap index is the largest one.
>   */
>  struct bitmap_index {
> -	/* Packfile to which this bitmap index belongs to */
> +	/*
> +	 * The pack or multi-pack index (MIDX) that this bitmap index belongs
> +	 * to.
> +	 *
> +	 * Exactly one of these must be non-NULL; this specifies the object
> +	 * order used to interpret this bitmap.
> +	 */
>  	struct packed_git *pack;
> +	struct multi_pack_index *midx;

Makes sense.

> @@ -71,6 +79,8 @@ struct bitmap_index {
>  	/* If not NULL, this is a name-hash cache pointing into map. */
>  	uint32_t *hashes;
>  
> +	const unsigned char *checksum;
> +
>  	/*
>  	 * Extended index.
>  	 *

I see later that this checksum is used, OK. Maybe comment that this
points into map (just like "hashes", as quoted above).

> @@ -281,6 +304,54 @@ static char *pack_bitmap_filename(struct packed_git *p)
>  	return xstrfmt("%.*s.bitmap", (int)len, p->pack_name);
>  }
>  
> +static int open_midx_bitmap_1(struct bitmap_index *bitmap_git,
> +			      struct multi_pack_index *midx)
> +{
> +	struct stat st;
> +	char *idx_name = midx_bitmap_filename(midx);
> +	int fd = git_open(idx_name);
> +
> +	free(idx_name);
> +
> +	if (fd < 0)
> +		return -1;
> +
> +	if (fstat(fd, &st)) {
> +		close(fd);
> +		return -1;
> +	}
> +
> +	if (bitmap_git->pack || bitmap_git->midx) {
> +		/* ignore extra bitmap file; we can only handle one */
> +		return -1;

Here, fd is not closed? Maybe better to have multiple cleanup stages
(one when the mmap has been built, and one when not).

> @@ -302,12 +373,18 @@ static int open_pack_bitmap_1(struct bitmap_index *bitmap_git, struct packed_git
>  		return -1;
>  	}
>  
> -	if (bitmap_git->pack) {
> +	if (bitmap_git->pack || bitmap_git->midx) {
> +		/* ignore extra bitmap file; we can only handle one */
>  		warning("ignoring extra bitmap file: %s", packfile->pack_name);
>  		close(fd);
>  		return -1;
>  	}
>  
> +	if (!is_pack_valid(packfile)) {
> +		close(fd);
> +		return -1;
> +	}

Why is this needed now (and presumably, not before)?

> -static int load_pack_bitmap(struct bitmap_index *bitmap_git)
> +static int load_reverse_index(struct bitmap_index *bitmap_git)
> +{
> +	if (bitmap_is_midx(bitmap_git)) {
> +		uint32_t i;
> +		int ret;
> +
> +		ret = load_midx_revindex(bitmap_git->midx);
> +		if (ret)
> +			return ret;
> +
> +		for (i = 0; i < bitmap_git->midx->num_packs; i++) {
> +			if (prepare_midx_pack(the_repository, bitmap_git->midx, i))
> +				die(_("load_reverse_index: could not open pack"));
> +			ret = load_pack_revindex(bitmap_git->midx->packs[i]);

I was thinking about why we still need per-pack revindex, but I think
the answer is that we still need to convert pack offsets (roughly
speaking, 0 to size of packfile in bytes) to pack positions (0 to number
of objects) (and one such conversion is in the quoted section of
builtin/pack-objects.c above), and MIDX does not provide this. OK, makes
sense.

> +			if (ret)
> +				return ret;
> +		}
> +		return 0;
> +	}
> +	return load_pack_revindex(bitmap_git->pack);
> +}

[snip]

> @@ -428,10 +552,26 @@ static inline int bitmap_position_packfile(struct bitmap_index *bitmap_git,
>  	return pos;
>  }
>  
> +static int bitmap_position_midx(struct bitmap_index *bitmap_git,
> +				const struct object_id *oid)
> +{
> +	uint32_t want, got;
> +	if (!bsearch_midx(oid, bitmap_git->midx, &want))
> +		return -1;
> +
> +	if (midx_to_pack_pos(bitmap_git->midx, want, &got) < 0)
> +		return -1;
> +	return got;
> +}

bsearch_midx() gives us the position in the MIDX (e.g. if we had an
object with the name 00...00, "want" will be 0, and if we had an
object with the name ff...ff, "want" will be the number of objects
minus 1). midx_to_pack_pos() converts that into the position in the
pseudo-pack, which is what we want. OK.

> @@ -730,14 +871,28 @@ static void show_objects_for_type(
>  
>  			offset += ewah_bit_ctz64(word >> offset);
>  
> -			index_pos = pack_pos_to_index(bitmap_git->pack, pos + offset);
> -			ofs = pack_pos_to_offset(bitmap_git->pack, pos + offset);
> -			nth_packed_object_id(&oid, bitmap_git->pack, index_pos);
> +			if (bitmap_is_midx(bitmap_git)) {
> +				struct multi_pack_index *m = bitmap_git->midx;
> +				uint32_t pack_id;
> +
> +				index_pos = pack_pos_to_midx(m, pos + offset);
> +				ofs = nth_midxed_offset(m, index_pos);
> +				nth_midxed_object_oid(&oid, m, index_pos);
> +
> +				pack_id = nth_midxed_pack_int_id(m, index_pos);
> +				pack = bitmap_git->midx->packs[pack_id];

This is similar to the builtin/pack-objects.c case right at the start of
this patch. (bitmap_pack_offset(), used in builtin/pack-objects.c, is
pack_pos_to_midx() and nth_midx_offset() chained.) OK.

> +			} else {
> +				index_pos = pack_pos_to_index(bitmap_git->pack, pos + offset);
> +				ofs = pack_pos_to_offset(bitmap_git->pack, pos + offset);
> +				nth_bitmap_object_oid(bitmap_git, &oid, index_pos);
> +
> +				pack = bitmap_git->pack;
> +			}
>  
>  			if (bitmap_git->hashes)
>  				hash = get_be32(bitmap_git->hashes + index_pos);
>  
> -			show_reach(&oid, object_type, 0, hash, bitmap_git->pack, ofs);
> +			show_reach(&oid, object_type, 0, hash, pack, ofs);
>  		}
>  	}
>  }
> @@ -749,8 +904,13 @@ static int in_bitmapped_pack(struct bitmap_index *bitmap_git,
>  		struct object *object = roots->item;
>  		roots = roots->next;
>  
> -		if (find_pack_entry_one(object->oid.hash, bitmap_git->pack) > 0)
> -			return 1;
> +		if (bitmap_is_midx(bitmap_git)) {
> +			if (bsearch_midx(&object->oid, bitmap_git->midx, NULL))
> +				return 1;
> +		} else {
> +			if (find_pack_entry_one(object->oid.hash, bitmap_git->pack) > 0)
> +				return 1;
> +		}
>  	}
>  
>  	return 0;

OK - we don't actually care about the position, just that it exists,
which is why we can pass NULL as the last argument to bsearch_midx().

> @@ -839,14 +999,26 @@ static void filter_bitmap_blob_none(struct bitmap_index *bitmap_git,
>  static unsigned long get_size_by_pos(struct bitmap_index *bitmap_git,
>  				     uint32_t pos)
>  {
> -	struct packed_git *pack = bitmap_git->pack;
>  	unsigned long size;
>  	struct object_info oi = OBJECT_INFO_INIT;
>  
>  	oi.sizep = &size;
>  
>  	if (pos < bitmap_num_objects(bitmap_git)) {
> -		off_t ofs = pack_pos_to_offset(pack, pos);
> +		struct packed_git *pack;
> +		off_t ofs;
> +
> +		if (bitmap_is_midx(bitmap_git)) {
> +			uint32_t midx_pos = pack_pos_to_midx(bitmap_git->midx, pos);
> +			uint32_t pack_id = nth_midxed_pack_int_id(bitmap_git->midx, midx_pos);
> +
> +			pack = bitmap_git->midx->packs[pack_id];
> +			ofs = nth_midxed_offset(bitmap_git->midx, midx_pos);
> +		} else {
> +			pack = bitmap_git->pack;
> +			ofs = pack_pos_to_offset(pack, pos);
> +		}
> +
>  		if (packed_object_info(the_repository, pack, ofs, &oi) < 0) {
>  			struct object_id oid;
>  			nth_bitmap_object_oid(bitmap_git, &oid,

Makes sense - "pos" is the position in the pseudo-pack. From it we get
the MIDX position, and then we can get the pack ID and pack offset as
usual.

> @@ -1081,15 +1253,29 @@ static void try_partial_reuse(struct bitmap_index *bitmap_git,
>  			      struct bitmap *reuse,
>  			      struct pack_window **w_curs)
>  {
> -	off_t offset, header;
> +	struct packed_git *pack;
> +	off_t offset, delta_obj_offset;
>  	enum object_type type;
>  	unsigned long size;
>  
>  	if (pos >= bitmap_num_objects(bitmap_git))
>  		return; /* not actually in the pack or MIDX */
>  
> -	offset = header = pack_pos_to_offset(bitmap_git->pack, pos);
> -	type = unpack_object_header(bitmap_git->pack, w_curs, &offset, &size);
> +	if (bitmap_is_midx(bitmap_git)) {
> +		uint32_t pack_id, midx_pos;
> +
> +		midx_pos = pack_pos_to_midx(bitmap_git->midx, pos);
> +		pack_id = nth_midxed_pack_int_id(bitmap_git->midx, midx_pos);
> +
> +		pack = bitmap_git->midx->packs[pack_id];
> +		offset = nth_midxed_offset(bitmap_git->midx, midx_pos);

Would it be useful to assert somewhere here that "pack" is the preferred
pack?

Going further, is it reasonable to say that positions 0..n in the
preferred pack (where n is the number of objects in the preferred pack)
match positions 0..n in the pseudo-pack exactly? If yes, maybe we can
simplify things by explaining that we can operate in the MIDX case
exactly (or as similarly as possible) like we operate on a single
packfile because of this, instead of always needing to consider if a
delta base could appear in the MIDX as belonging to another packfile.

> @@ -1538,6 +1792,29 @@ static off_t get_disk_usage_for_type(struct bitmap_index *bitmap_git,
>  
>  			offset += ewah_bit_ctz64(word >> offset);
>  			pos = base + offset;
> +
> +			if (bitmap_is_midx(bitmap_git)) {
> +				uint32_t pack_pos;
> +				uint32_t midx_pos = pack_pos_to_midx(bitmap_git->midx, pos);
> +				uint32_t pack_id = nth_midxed_pack_int_id(bitmap_git->midx, midx_pos);
> +				off_t offset = nth_midxed_offset(bitmap_git->midx, midx_pos);
> +
> +				pack = bitmap_git->midx->packs[pack_id];
> +
> +				if (offset_to_pack_pos(pack, offset, &pack_pos) < 0) {
> +					struct object_id oid;
> +					nth_midxed_object_oid(&oid, bitmap_git->midx, midx_pos);
> +
> +					die(_("could not find %s in pack #%"PRIu32" at offset %"PRIuMAX),
> +					    oid_to_hex(&oid),
> +					    pack_id,
> +					    (uintmax_t)offset);
> +				}
> +
> +				pos = pack_pos;
> +			} else
> +				pack = bitmap_git->pack;
> +
>  			total += pack_pos_to_offset(pack, pos + 1) -
>  				 pack_pos_to_offset(pack, pos);
>  		}

"pos" is assigned to twice in the MIDX case (with different semantics).
I think it's better to do it like in the rest of the patch - use "base +
offset" as the argument to pack_pos_to_midx, and then you wouldn't need
to assign to "pos" twice.

> diff --git a/packfile.c b/packfile.c
> index 8668345d93..c444e365a3 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -863,7 +863,7 @@ static void prepare_pack(const char *full_name, size_t full_name_len,
>  	if (!strcmp(file_name, "multi-pack-index"))
>  		return;
>  	if (starts_with(file_name, "multi-pack-index") &&
> -	    ends_with(file_name, ".rev"))
> +	    (ends_with(file_name, ".bitmap") || ends_with(file_name, ".rev")))
>  		return;
>  	if (ends_with(file_name, ".idx") ||
>  	    ends_with(file_name, ".rev") ||

I guess this will come into play when we start writing MIDX bitmaps?

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH 02/22] pack-bitmap-write.c: gracefully fail to write non-closed bitmaps
  2021-04-09 18:10 ` [PATCH 02/22] pack-bitmap-write.c: gracefully fail to write non-closed bitmaps Taylor Blau
@ 2021-04-16  2:46   ` Jonathan Tan
  0 siblings, 0 replies; 273+ messages in thread
From: Jonathan Tan @ 2021-04-16  2:46 UTC (permalink / raw)
  To: me; +Cc: git, peff, dstolee, gitster, jonathantanmy

> @@ -125,15 +125,20 @@ static inline void push_bitmapped_commit(struct commit *commit)
>  	writer.selected_nr++;
>  }
>  
> -static uint32_t find_object_pos(const struct object_id *oid)
> +static uint32_t find_object_pos(const struct object_id *oid, int *found)

find_object_pos() is only called by fill_bitmap_tree() and
fill_bitmap_commit(). fill_bitmap_tree() is only called by itself and
fill_bitmap_commit(). fill_bitmap_commit() is only called by
bitmap_writer_build(). And bitmap_writer_build() is only called by
write_pack_file(), which has been changed to die when
bitmap_writer_build() fails. So looks like everything is accounted for.

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH 12/22] pack-bitmap: read multi-pack bitmaps
  2021-04-16  2:39   ` Jonathan Tan
@ 2021-04-16  3:13     ` Taylor Blau
  0 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-04-16  3:13 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: me, git, peff, dstolee, gitster

On Thu, Apr 15, 2021 at 07:39:25PM -0700, Jonathan Tan wrote:
> I'll review until this patch for now. Hopefully I'll get to the rest
> soon.

Thanks in advance. I always find that you leave insightful comments, so
I appreciate you taking the time to review my patches.

> > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> > index 5205dde2e1..a4e4e4ebcc 100644
> > --- a/builtin/pack-objects.c
> > +++ b/builtin/pack-objects.c
> > @@ -984,7 +984,17 @@ static void write_reused_pack(struct hashfile *f)
> >  				break;
> >
> >  			offset += ewah_bit_ctz64(word >> offset);
> > -			write_reused_pack_one(pos + offset, f, &w_curs);
> > +			if (bitmap_is_midx(bitmap_git)) {
> > +				off_t pack_offs = bitmap_pack_offset(bitmap_git,
> > +								     pos + offset);
> > +				uint32_t pos;
> > +
> > +				if (offset_to_pack_pos(reuse_packfile, pack_offs, &pos) < 0)
> > +					die(_("write_reused_pack: could not locate %"PRIdMAX),
> > +					    (intmax_t)pack_offs);
> > +				write_reused_pack_one(pos, f, &w_curs);
> > +			} else
> > +				write_reused_pack_one(pos + offset, f, &w_curs);
> >  			display_progress(progress_state, ++written);
> >  		}
> >  	}
>
> When bitmaps are used, pos + offset is the pseudo-pack (a virtual
> concatenation of all packfiles in the MIDX) position (as in, first
> object is 0, second object is 1, and so on), not a position in
> a single packfile. From it, we obtain a pack offset, and from it, we
> obtain a position in the reused packfile (reuse_packfile). In this way,
> the code is equivalent to the non-MIDX case. Looks good.
>
> (There is no need to select a packfile here in the case of MIDX because,
> as the code later shows, we always reuse only one packfile - assigned to
> reuse_packfile.)

You're exactly right here on both points.

It's worth noting that the "reuse" you're describing here is only about
reusing sections of the original packfile byte-for-byte (with the
exception of fixing the offsets in any OFS_DELTAs). That's not to be
confused with delta reuse, which is entirely different.

I think that both Peff and I are dubious that the pack-reuse stuff is
kicking in all that much, since there are some heuristics in place about
when it is allowed to take over and when it isn't, but that's a topic
for another thread.

> > @@ -35,8 +36,15 @@ struct stored_bitmap {
> >   * the active bitmap index is the largest one.
> >   */
> >  struct bitmap_index {
> > -	/* Packfile to which this bitmap index belongs to */
> > +	/*
> > +	 * The pack or multi-pack index (MIDX) that this bitmap index belongs
> > +	 * to.
> > +	 *
> > +	 * Exactly one of these must be non-NULL; this specifies the object
> > +	 * order used to interpret this bitmap.
> > +	 */
> >  	struct packed_git *pack;
> > +	struct multi_pack_index *midx;
>
> Makes sense.
>
> > @@ -71,6 +79,8 @@ struct bitmap_index {
> >  	/* If not NULL, this is a name-hash cache pointing into map. */
> >  	uint32_t *hashes;
> >
> > +	const unsigned char *checksum;
> > +
> >  	/*
> >  	 * Extended index.
> >  	 *
>
> I see later that this checksum is used, OK. Maybe comment that this
> points into map (just like "hashes", as quoted above).

Yep, quite fair.

> > +	if (bitmap_git->pack || bitmap_git->midx) {
> > +		/* ignore extra bitmap file; we can only handle one */
> > +		return -1;
>
> Here, fd is not closed? Maybe better to have multiple cleanup stages
> (one when the mmap has been built, and one when not).

Good eyes. That's an oversight, and we should be closing fd there, too.
It looks like we're also missing a warning(), although I am skeptical
that the warning would ever kick in. The pack-based version of this
function is run in a loop over all packs, but the loop doesn't terminate
once a pack bitmap is opened, since we make sure that no *other* packs
have bitmaps, too.

But we don't do the same for multi-pack bitmaps, i.e., once we find a
MIDX that has a bitmap, we terminate immediately. It may be worth
scanning through the list of all MIDXs to make sure that only one has a
bitmap, but to be honest I could go either way on that point, too, since
any MIDX bitmap is worth loading. But the warning doesn't hurt, so I'll
add that, too.

> > +	if (!is_pack_valid(packfile)) {
> > +		close(fd);
> > +		return -1;
> > +	}
>
> Why is this needed now (and presumably, not before)?

It does appear as a stray hunk, and I'm sure that it probably could be
extracted into its own patch. I can't recall anything about this
particular patch that makes it necessary, but maybe Peff remembers
something I don't.

> > @@ -1081,15 +1253,29 @@ static void try_partial_reuse(struct bitmap_index *bitmap_git,
> >  			      struct bitmap *reuse,
> >  			      struct pack_window **w_curs)
> >  {
> > -	off_t offset, header;
> > +	struct packed_git *pack;
> > +	off_t offset, delta_obj_offset;
> >  	enum object_type type;
> >  	unsigned long size;
> >
> >  	if (pos >= bitmap_num_objects(bitmap_git))
> >  		return; /* not actually in the pack or MIDX */
> >
> > -	offset = header = pack_pos_to_offset(bitmap_git->pack, pos);
> > -	type = unpack_object_header(bitmap_git->pack, w_curs, &offset, &size);
> > +	if (bitmap_is_midx(bitmap_git)) {
> > +		uint32_t pack_id, midx_pos;
> > +
> > +		midx_pos = pack_pos_to_midx(bitmap_git->midx, pos);
> > +		pack_id = nth_midxed_pack_int_id(bitmap_git->midx, midx_pos);
> > +
> > +		pack = bitmap_git->midx->packs[pack_id];
> > +		offset = nth_midxed_offset(bitmap_git->midx, midx_pos);
>
> Would it be useful to assert somewhere here that "pack" is the preferred
> pack?

An assertion like that may hurt this function's cache performance, since
the way we determine the preferred pack is by looking at which pack is
the donor for the 0th object in the MIDX's .rev file. And this function
is rather hot, since it is invoked once per-bit. So it may cause us to
hit more page faults than we currently do.

That all said, the assertion may not be helping much since we only call
this method on objects from a single pack (the bitmapped pack in the
single-pack case, or the preferred pack in the MIDX case). There's a
comment in reuse_partial_packfile_from_bitmap() to this effect, which
may or may not be good enough ;).

> Going further, is it reasonable to say that positions 0..n in the
> preferred pack (where n is the number of objects in the preferred pack)
> match positions 0..n in the pseudo-pack exactly? If yes, maybe we can
> simplify things by explaining that we can operate in the MIDX case
> exactly (or as similarly as possible) like we operate on a single
> packfile because of this, instead of always needing to consider if a
> delta base could appear in the MIDX as belonging to another packfile.

You're right, and there are two things going on here which allow us to
make that assumption:

  - The preferred pack sorts ahead of all other packs in the MIDX when
    assembling the pseudo-pack order, so bits 0..n (where 'n' is the
    number of objects in the preferred pack) of the pseudo pack are
    designated to the preferred pack.

  - When duplicates of objects exist, the MIDX *always* breaks ties in
    favor of the preferred pack, so it's never the case that a delta'd
    object from the preferred pack will find its base in another pack
    (if it asked the MIDX to locate a copy of the base object).

So we can safely remove the conditional on bitmap_is_midx() in the first
part of this function for exactly the reasons above, which is good. That
probably merits moving the comment beginning with "Note that the base
does not need to be repositioned ..." earlier in this function, to make
clear that we really can treat bits from the preferred pack as if they
don't have anything to do with the MIDX at all.

So long as we determine the preferred pack ahead of time (and not once
per-call), I think that it would be a win.

> > @@ -1538,6 +1792,29 @@ static off_t get_disk_usage_for_type(struct bitmap_index *bitmap_git,
> >
> >  			offset += ewah_bit_ctz64(word >> offset);
> >  			pos = base + offset;
> > +
> > +			if (bitmap_is_midx(bitmap_git)) {
> > +				uint32_t pack_pos;
> > +				uint32_t midx_pos = pack_pos_to_midx(bitmap_git->midx, pos);
> > +				uint32_t pack_id = nth_midxed_pack_int_id(bitmap_git->midx, midx_pos);
> > +				off_t offset = nth_midxed_offset(bitmap_git->midx, midx_pos);
> > +
> > +				pack = bitmap_git->midx->packs[pack_id];
> > +
> > +				if (offset_to_pack_pos(pack, offset, &pack_pos) < 0) {
> > +					struct object_id oid;
> > +					nth_midxed_object_oid(&oid, bitmap_git->midx, midx_pos);
> > +
> > +					die(_("could not find %s in pack #%"PRIu32" at offset %"PRIuMAX),
> > +					    oid_to_hex(&oid),
> > +					    pack_id,
> > +					    (uintmax_t)offset);
> > +				}
> > +
> > +				pos = pack_pos;
> > +			} else
> > +				pack = bitmap_git->pack;
> > +
> >  			total += pack_pos_to_offset(pack, pos + 1) -
> >  				 pack_pos_to_offset(pack, pos);
> >  		}
>
> "pos" is assigned to twice in the MIDX case (with different semantics).
> I think it's better to do it like in the rest of the patch - use "base +
> offset" as the argument to pack_pos_to_midx, and then you wouldn't need
> to assign to "pos" twice.

Good idea, thanks. Skimming again over the patch, this is the only place
that I could find where I double-assign pos like this.

> > diff --git a/packfile.c b/packfile.c
> > index 8668345d93..c444e365a3 100644
> > --- a/packfile.c
> > +++ b/packfile.c
> > @@ -863,7 +863,7 @@ static void prepare_pack(const char *full_name, size_t full_name_len,
> >  	if (!strcmp(file_name, "multi-pack-index"))
> >  		return;
> >  	if (starts_with(file_name, "multi-pack-index") &&
> > -	    ends_with(file_name, ".rev"))
> > +	    (ends_with(file_name, ".bitmap") || ends_with(file_name, ".rev")))
> >  		return;
> >  	if (ends_with(file_name, ".idx") ||
> >  	    ends_with(file_name, ".rev") ||
>
> I guess this will come into play when we start writing MIDX bitmaps?

Yep, that's right. Since this patch is about making sure we can handle
the MIDX bitmap as described in
Documentation/technical/bitmap-format.txt, this is part of that.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH 13/22] pack-bitmap: write multi-pack bitmaps
  2021-04-09 18:11 ` [PATCH 13/22] pack-bitmap: write " Taylor Blau
@ 2021-05-04  5:02   ` Jonathan Tan
  2021-05-06 20:18     ` Taylor Blau
  0 siblings, 1 reply; 273+ messages in thread
From: Jonathan Tan @ 2021-05-04  5:02 UTC (permalink / raw)
  To: me; +Cc: git, peff, dstolee, gitster, jonathantanmy

> Write multi-pack bitmaps in the format described by
> Documentation/technical/bitmap-format.txt, inferring their presence with
> the absence of '--bitmap'.
> 
> To write a multi-pack bitmap, this patch attempts to reuse as much of
> the existing machinery from pack-objects as possible. Specifically, the
> MIDX code prepares a packing_data struct that pretends as if a single
> packfile has been generated containing all of the objects contained
> within the MIDX.

Sounds good, and makes sense. Conceptually, the MIDX bitmap is the same
as a regular packfile bitmap, just that the order of objects in the
bitmap is defined differently.

> +static void prepare_midx_packing_data(struct packing_data *pdata,
> +				      struct write_midx_context *ctx)
> +{
> +	uint32_t i;
> +
> +	memset(pdata, 0, sizeof(struct packing_data));
> +	prepare_packing_data(the_repository, pdata);
> +
> +	for (i = 0; i < ctx->entries_nr; i++) {
> +		struct pack_midx_entry *from = &ctx->entries[ctx->pack_order[i]];
> +		struct object_entry *to = packlist_alloc(pdata, &from->oid);
> +
> +		oe_set_in_pack(pdata, to,
> +			       ctx->info[ctx->pack_perm[from->pack_int_id]].p);
> +	}
> +}

It is surprising to see this right at the top. Scrolling down, I guess
that there is more information needed than just the packing_data struct.

> +static int add_ref_to_pending(const char *refname,
> +			      const struct object_id *oid,
> +			      int flag, void *cb_data)
> +{
> +	struct rev_info *revs = (struct rev_info*)cb_data;
> +	struct object *object;
> +
> +	if ((flag & REF_ISSYMREF) && (flag & REF_ISBROKEN)) {
> +		warning("symbolic ref is dangling: %s", refname);
> +		return 0;
> +	}
> +
> +	object = parse_object_or_die(oid, refname);
> +	if (object->type != OBJ_COMMIT)
> +		return 0;
> +
> +	add_pending_object(revs, object, "");
> +	if (bitmap_is_preferred_refname(revs->repo, refname))
> +		object->flags |= NEEDS_BITMAP;
> +	return 0;
> +}

Makes sense. We need to flag certain commits as NEEDS_BITMAP because
bitmaps are not made for all commits but only certain ones.

> +struct bitmap_commit_cb {
> +	struct commit **commits;
> +	size_t commits_nr, commits_alloc;
> +
> +	struct write_midx_context *ctx;
> +};
> +
> +static const struct object_id *bitmap_oid_access(size_t index,
> +						 const void *_entries)
> +{
> +	const struct pack_midx_entry *entries = _entries;
> +	return &entries[index].oid;
> +}
> +
> +static void bitmap_show_commit(struct commit *commit, void *_data)
> +{
> +	struct bitmap_commit_cb *data = _data;
> +	if (oid_pos(&commit->object.oid, data->ctx->entries,
> +		    data->ctx->entries_nr,
> +		    bitmap_oid_access) > -1) {
> +		ALLOC_GROW(data->commits, data->commits_nr + 1,
> +			   data->commits_alloc);
> +		data->commits[data->commits_nr++] = commit;
> +	}
> +}
> +
> +static struct commit **find_commits_for_midx_bitmap(uint32_t *indexed_commits_nr_p,
> +						    struct write_midx_context *ctx)
> +{
> +	struct rev_info revs;
> +	struct bitmap_commit_cb cb;
> +
> +	memset(&cb, 0, sizeof(struct bitmap_commit_cb));
> +	cb.ctx = ctx;
> +
> +	repo_init_revisions(the_repository, &revs, NULL);
> +	for_each_ref(add_ref_to_pending, &revs);
> +
> +	fetch_if_missing = 0;
> +	revs.exclude_promisor_objects = 1;

I think that the MIDX bitmap requires all objects be present? If yes, we
should omit these 2 lines.

> +
> +	if (prepare_revision_walk(&revs))
> +		die(_("revision walk setup failed"));
> +
> +	traverse_commit_list(&revs, bitmap_show_commit, NULL, &cb);
> +	if (indexed_commits_nr_p)
> +		*indexed_commits_nr_p = cb.commits_nr;
> +
> +	return cb.commits;
> +}

Hmm...I might be missing something obvious, but this function and its
callbacks seem to be written like this in order to put the returned
commits in a certain order. But later on in write_midx_bitmap(), the
return value of this function is passed to
bitmap_writer_select_commits(), which resorts the list anyway?

> +static int write_midx_bitmap(char *midx_name, unsigned char *midx_hash,
> +			     struct write_midx_context *ctx,
> +			     unsigned flags)
> +{
> +	struct packing_data pdata;
> +	struct pack_idx_entry **index;
> +	struct commit **commits = NULL;
> +	uint32_t i, commits_nr;
> +	char *bitmap_name = xstrfmt("%s-%s.bitmap", midx_name, hash_to_hex(midx_hash));
> +	int ret;
> +
> +	prepare_midx_packing_data(&pdata, ctx);
> +
> +	commits = find_commits_for_midx_bitmap(&commits_nr, ctx);
> +
> +	/*
> +	 * Build the MIDX-order index based on pdata.objects (which is already
> +	 * in MIDX order; c.f., 'midx_pack_order_cmp()' for the definition of
> +	 * this order).
> +	 */
> +	ALLOC_ARRAY(index, pdata.nr_objects);
> +	for (i = 0; i < pdata.nr_objects; i++)
> +		index[i] = (struct pack_idx_entry *)&pdata.objects[i];
> +
> +	bitmap_writer_show_progress(flags & MIDX_PROGRESS);
> +	bitmap_writer_build_type_index(&pdata, index, pdata.nr_objects);
> +
> +	/*
> +	 * bitmap_writer_select_commits expects objects in lex order, but
> +	 * pack_order gives us exactly that. use it directly instead of
> +	 * re-sorting the array
> +	 */
> +	for (i = 0; i < pdata.nr_objects; i++)
> +		index[ctx->pack_order[i]] = (struct pack_idx_entry *)&pdata.objects[i];
> +
> +	bitmap_writer_select_commits(commits, commits_nr, -1);

The comment above says bitmap_writer_select_commits() expects objects in
lex order, but (1) you're putting "index" in lex order, not "commits",
and (2) the first thing in bitmap_writer_select_commits() is a QSORT.
Did you mean another function?

> +	ret = bitmap_writer_build(&pdata);
> +	if (!ret)
> +		goto cleanup;
> +
> +	bitmap_writer_set_checksum(midx_hash);
> +	bitmap_writer_finish(index, pdata.nr_objects, bitmap_name, 0);

So bitmap_writer_build_type_index() and bitmap_writer_finish() are
called with 2 different orders of commits. Is this expected? If yes,
maybe this is worth a comment.

> +
> +cleanup:
> +	free(index);
> +	free(bitmap_name);
> +	return ret;
> +}

[snip]

> @@ -930,9 +1073,16 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
>  		for (i = 0; i < ctx.m->num_packs; i++) {
>  			ALLOC_GROW(ctx.info, ctx.nr + 1, ctx.alloc);
>  
> +			if (prepare_midx_pack(the_repository, ctx.m, i)) {
> +				error(_("could not load pack %s"),
> +				      ctx.m->pack_names[i]);
> +				result = 1;
> +				goto cleanup;
> +			}
> +
>  			ctx.info[ctx.nr].orig_pack_int_id = i;
>  			ctx.info[ctx.nr].pack_name = xstrdup(ctx.m->pack_names[i]);
> -			ctx.info[ctx.nr].p = NULL;
> +			ctx.info[ctx.nr].p = ctx.m->packs[i];
>  			ctx.info[ctx.nr].expired = 0;
>  			ctx.nr++;
>  		}

Why is this needed now and not before? From what I see in this function,
nothing seems to happen to this .p pack except that they are closed
later.

> @@ -1096,6 +1264,15 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
>  		if (ctx.info[i].p) {
>  			close_pack(ctx.info[i].p);
>  			free(ctx.info[i].p);
> +			if (ctx.m) {
> +				/*
> +				 * Destroy a stale reference to the pack in
> +				 * 'ctx.m'.
> +				 */
> +				uint32_t orig = ctx.info[i].orig_pack_int_id;
> +				if (orig < ctx.m->num_packs)
> +					ctx.m->packs[orig] = NULL;
> +			}
>  		}
>  		free(ctx.info[i].pack_name);
>  	}

Is this hunk needed? "ctx" is a local variable and will not outlast this
function.

I'll review the rest tomorrow. It seems like I've gotten over the most
difficult patches.

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH 16/22] t5326: test multi-pack bitmap behavior
  2021-04-09 18:12 ` [PATCH 16/22] t5326: test multi-pack bitmap behavior Taylor Blau
@ 2021-05-04 17:51   ` Jonathan Tan
  0 siblings, 0 replies; 273+ messages in thread
From: Jonathan Tan @ 2021-05-04 17:51 UTC (permalink / raw)
  To: me; +Cc: git, peff, dstolee, gitster, jonathantanmy

> +test_expect_success 'clone with bitmaps enabled' '
> +	git clone --no-local --bare . clone-reverse-delta.git &&
> +	test_when_finished "rm -fr clone-reverse-delta.git" &&
> +
> +	git rev-parse HEAD >expect &&
> +	git --git-dir=clone-reverse-delta.git rev-parse HEAD >actual &&
> +	test_cmp expect actual
> +'

What is this test testing? That bitmaps are used? (I'm not sure how to
verify that though - we seem to have tracing for bitmap writing but not
for reading, for example.)

> +bitmap_reuse_tests() {
> +	from=$1
> +	to=$2
> +
> +	test_expect_success "setup pack reuse tests ($from -> $to)" '
> +		rm -fr repo &&
> +		git init repo &&
> +		(
> +			cd repo &&
> +			test_commit_bulk 16 &&
> +			git tag old-tip &&
> +
> +			git config core.multiPackIndex true &&
> +			if test "MIDX" = "$from"
> +			then
> +				GIT_TEST_MULTI_PACK_INDEX=0 git repack -Ad &&
> +				git multi-pack-index write --bitmap
> +			else
> +				GIT_TEST_MULTI_PACK_INDEX=0 git repack -Adb
> +			fi
> +		)
> +	'
> +
> +	test_expect_success "build bitmap from existing ($from -> $to)" '
> +		(
> +			cd repo &&
> +			test_commit_bulk --id=further 16 &&
> +			git tag new-tip &&
> +
> +			if test "MIDX" = "$to"
> +			then
> +				GIT_TEST_MULTI_PACK_INDEX=0 git repack -d &&
> +				git multi-pack-index write --bitmap
> +			else
> +				GIT_TEST_MULTI_PACK_INDEX=0 git repack -Adb
> +			fi
> +		)
> +	'
> +
> +	test_expect_success "verify resulting bitmaps ($from -> $to)" '
> +		(
> +			cd repo &&
> +			git for-each-ref &&
> +			git rev-list --test-bitmap refs/tags/old-tip &&
> +			git rev-list --test-bitmap refs/tags/new-tip
> +		)
> +	'
> +}
> +
> +bitmap_reuse_tests 'pack' 'MIDX'
> +bitmap_reuse_tests 'MIDX' 'pack'
> +bitmap_reuse_tests 'MIDX' 'MIDX'

Is it possible to verify that the bitmaps have truly been reused (and
not, say, created from scratch)? (E.g. is there any nature of the
bitmap created - for example, the order of commits?)

> +test_expect_success 'pack.preferBitmapTips' '
> +	git init repo &&
> +	test_when_finished "rm -fr repo" &&
> +	(
> +		cd repo &&
> +
> +		test_commit_bulk --message="%s" 103 &&
> +
> +		git log --format="%H" >commits.raw &&
> +		sort <commits.raw >commits &&
> +
> +		git log --format="create refs/tags/%s %H" HEAD >refs &&
> +		git update-ref --stdin <refs &&
> +
> +		git multi-pack-index write --bitmap &&
> +		test_path_is_file $midx &&
> +		test_path_is_file $midx-$(midx_checksum $objdir).bitmap &&
> +
> +		test-tool bitmap list-commits | sort >bitmaps &&
> +		comm -13 bitmaps commits >before &&
> +		test_line_count = 1 before &&
> +
> +		perl -ne "printf(\"create refs/tags/include/%d \", $.); print" \
> +			<before | git update-ref --stdin &&
> +
> +		rm -fr $midx-$(midx_checksum $objdir).bitmap &&
> +		rm -fr $midx-$(midx_checksum $objdir).rev &&
> +		rm -fr $midx &&
> +
> +		git -c pack.preferBitmapTips=refs/tags/include \
> +			multi-pack-index write --bitmap &&
> +		test-tool bitmap list-commits | sort >bitmaps &&
> +		comm -13 bitmaps commits >after &&
> +
> +		! test_cmp before after
> +	)
> +'

Could we have a more precise comparison of "before" and "after" (besides
the fact that they're different)?

Besides that, all the patches up to this one look good (including patch
14, verified with "--color-moved
--color-moved-ws=allow-indentation-change").

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH 22/22] p5326: perf tests for MIDX bitmaps
  2021-04-09 18:12 ` [PATCH 22/22] p5326: perf tests for MIDX bitmaps Taylor Blau
@ 2021-05-04 18:00   ` Jonathan Tan
  2021-05-05  0:55     ` Junio C Hamano
  0 siblings, 1 reply; 273+ messages in thread
From: Jonathan Tan @ 2021-05-04 18:00 UTC (permalink / raw)
  To: me; +Cc: git, peff, dstolee, gitster, jonathantanmy

> There is slight overhead in "simulated clone", "simulated partial
> clone", and "clone (partial bitmap)". Unsurprisingly, that overhead is
> due to using the MIDX's reverse index to map between bit positions and
> MIDX positions.

Thanks - it's great to see that accessing a MIDX bitmap doesn't add much
overhead (as compared to accessing a single-pack bitmap of the same
size).

All the remaining patches up to and including this one look good.
Overall, I did have some comments here and there, but I am happy with
the overall design.

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH 22/22] p5326: perf tests for MIDX bitmaps
  2021-05-04 18:00   ` Jonathan Tan
@ 2021-05-05  0:55     ` Junio C Hamano
  0 siblings, 0 replies; 273+ messages in thread
From: Junio C Hamano @ 2021-05-05  0:55 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: me, git, peff, dstolee

Jonathan Tan <jonathantanmy@google.com> writes:

>> There is slight overhead in "simulated clone", "simulated partial
>> clone", and "clone (partial bitmap)". Unsurprisingly, that overhead is
>> due to using the MIDX's reverse index to map between bit positions and
>> MIDX positions.
>
> Thanks - it's great to see that accessing a MIDX bitmap doesn't add much
> overhead (as compared to accessing a single-pack bitmap of the same
> size).
>
> All the remaining patches up to and including this one look good.
> Overall, I did have some comments here and there, but I am happy with
> the overall design.

Thanks for a review.

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH 13/22] pack-bitmap: write multi-pack bitmaps
  2021-05-04  5:02   ` Jonathan Tan
@ 2021-05-06 20:18     ` Taylor Blau
  2021-05-06 22:00       ` Jonathan Tan
  0 siblings, 1 reply; 273+ messages in thread
From: Taylor Blau @ 2021-05-06 20:18 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peff, dstolee, gitster

On Mon, May 03, 2021 at 10:02:30PM -0700, Jonathan Tan wrote:
> > +static void prepare_midx_packing_data(struct packing_data *pdata,
> > +				      struct write_midx_context *ctx)
> > +{
> > +	uint32_t i;
> > +
> > +	memset(pdata, 0, sizeof(struct packing_data));
> > +	prepare_packing_data(the_repository, pdata);
> > +
> > +	for (i = 0; i < ctx->entries_nr; i++) {
> > +		struct pack_midx_entry *from = &ctx->entries[ctx->pack_order[i]];
> > +		struct object_entry *to = packlist_alloc(pdata, &from->oid);
> > +
> > +		oe_set_in_pack(pdata, to,
> > +			       ctx->info[ctx->pack_perm[from->pack_int_id]].p);
> > +	}
> > +}
>
> It is surprising to see this right at the top. Scrolling down, I guess
> that there is more information needed than just the packing_data struct.

Hmm, which part is surprising to you? This function is setting up the
packing_data structure that I mentioned in the commit message, which
happens in two steps. First, we allocate and call
prepare_packing_data(). And then we call packlist_alloc() for each
object in the MIDX, setting up some information about each object
(like its OID and which physical pack it came from).

But if any of this is unclear, let me know which part and I'd be happy
to add a clarifying comment.

> > +static int add_ref_to_pending(const char *refname,
> > +			      const struct object_id *oid,
> > +			      int flag, void *cb_data)
> > +{
> > +	struct rev_info *revs = (struct rev_info*)cb_data;
> > +	struct object *object;
> > +
> > +	if ((flag & REF_ISSYMREF) && (flag & REF_ISBROKEN)) {
> > +		warning("symbolic ref is dangling: %s", refname);
> > +		return 0;
> > +	}
> > +
> > +	object = parse_object_or_die(oid, refname);
> > +	if (object->type != OBJ_COMMIT)
> > +		return 0;
> > +
> > +	add_pending_object(revs, object, "");
> > +	if (bitmap_is_preferred_refname(revs->repo, refname))
> > +		object->flags |= NEEDS_BITMAP;
> > +	return 0;
> > +}
>
> Makes sense. We need to flag certain commits as NEEDS_BITMAP because
> bitmaps are not made for all commits but only certain ones.

Right, and the NEEDS_BITMAP is a bit of a misnomer. It's true meaning is
more like BITMAPPING_THIS_WOULD_BE_A_GOOD_IDEA, since it roughly
translates to "bitmap this commit before any others in its window". More
details are in bitmap_writer_select_commits(), but in all honesty I find
the implementation there somewhat confusing.

> > +static struct commit **find_commits_for_midx_bitmap(uint32_t *indexed_commits_nr_p,
> > +						    struct write_midx_context *ctx)
> > +{
> > +	struct rev_info revs;
> > +	struct bitmap_commit_cb cb;
> > +
> > +	memset(&cb, 0, sizeof(struct bitmap_commit_cb));
> > +	cb.ctx = ctx;
> > +
> > +	repo_init_revisions(the_repository, &revs, NULL);
> > +	for_each_ref(add_ref_to_pending, &revs);
> > +
> > +	fetch_if_missing = 0;
> > +	revs.exclude_promisor_objects = 1;
>
> I think that the MIDX bitmap requires all objects be present? If yes, we
> should omit these 2 lines.

It does require that all objects are present, but if we fetched any
promisor objects at this stage it would be too late. That's because by
the time we're in this function, all of the packs that are to be
included in the MIDX should already exist on disk.

Skipping promisor objects here is intentional, since it only excludes
them from the list of reachable commits that we want to select from when
computing the selection of MIDX'd commits to receive bitmaps.

But, if one of those promisor objects is reachable from another object
that is included in the bitmap, then we will complain later on that we
couldn't find a reachability closure (and fail appropriately).

That said, I'm not sure any of that is obvious from reading this code,
so I'll add a comment to that effect around these lines.

> > +
> > +	if (prepare_revision_walk(&revs))
> > +		die(_("revision walk setup failed"));
> > +
> > +	traverse_commit_list(&revs, bitmap_show_commit, NULL, &cb);
> > +	if (indexed_commits_nr_p)
> > +		*indexed_commits_nr_p = cb.commits_nr;
> > +
> > +	return cb.commits;
> > +}
>
> Hmm...I might be missing something obvious, but this function and its
> callbacks seem to be written like this in order to put the returned
> commits in a certain order. But later on in write_midx_bitmap(), the
> return value of this function is passed to
> bitmap_writer_select_commits(), which resorts the list anyway?

It isn't intentional, but rather just to build up the list in topo
order. In fact, the order we build it up in isn't quite the same as how
the pack bitmap code generates it (it is in true topo order, at least on
GitHub's servers, as a side effect of using delta islands).

The fact that we resort according to date_compare makes me wonder why
changing that seemed to make such a difference for us. The whole
selection code is a mystery to me.

But no, the order shouldn't matter since we QSORT it later. Any code
here that looks like it's putting it in a certain order has much more to
do with convenience than anything else.

>
> > +static int write_midx_bitmap(char *midx_name, unsigned char *midx_hash,
> > +			     struct write_midx_context *ctx,
> > +			     unsigned flags)
> > +{
> > +	struct packing_data pdata;
> > +	struct pack_idx_entry **index;
> > +	struct commit **commits = NULL;
> > +	uint32_t i, commits_nr;
> > +	char *bitmap_name = xstrfmt("%s-%s.bitmap", midx_name, hash_to_hex(midx_hash));
> > +	int ret;
> > +
> > +	prepare_midx_packing_data(&pdata, ctx);
> > +
> > +	commits = find_commits_for_midx_bitmap(&commits_nr, ctx);
> > +
> > +	/*
> > +	 * Build the MIDX-order index based on pdata.objects (which is already
> > +	 * in MIDX order; c.f., 'midx_pack_order_cmp()' for the definition of
> > +	 * this order).
> > +	 */
> > +	ALLOC_ARRAY(index, pdata.nr_objects);
> > +	for (i = 0; i < pdata.nr_objects; i++)
> > +		index[i] = (struct pack_idx_entry *)&pdata.objects[i];
> > +
> > +	bitmap_writer_show_progress(flags & MIDX_PROGRESS);
> > +	bitmap_writer_build_type_index(&pdata, index, pdata.nr_objects);
> > +
> > +	/*
> > +	 * bitmap_writer_select_commits expects objects in lex order, but
> > +	 * pack_order gives us exactly that. use it directly instead of
> > +	 * re-sorting the array
> > +	 */
> > +	for (i = 0; i < pdata.nr_objects; i++)
> > +		index[ctx->pack_order[i]] = (struct pack_idx_entry *)&pdata.objects[i];
> > +
> > +	bitmap_writer_select_commits(commits, commits_nr, -1);
>
> The comment above says bitmap_writer_select_commits() expects objects in
> lex order, but (1) you're putting "index" in lex order, not "commits",
> and (2) the first thing in bitmap_writer_select_commits() is a QSORT.
> Did you mean another function?

Ack, I definitely meant bitmap_writer_build(). Thanks for catching.

> > +	ret = bitmap_writer_build(&pdata);
> > +	if (!ret)
> > +		goto cleanup;
> > +
> > +	bitmap_writer_set_checksum(midx_hash);
> > +	bitmap_writer_finish(index, pdata.nr_objects, bitmap_name, 0);
>
> So bitmap_writer_build_type_index() and bitmap_writer_finish() are
> called with 2 different orders of commits. Is this expected? If yes,
> maybe this is worth a comment.

Confusingly so, but yes, these two do expect different orders. You can
see the same re-sorting going on much more subtly in
pack-write.c:write_idx_file(), which is called by
builtin/pack-objects.c:finish_tmp_packfile(), which happens between
bitmap_writer_build_type_index() and bitmap_writer_finish().

Definitely worth adding a comment.

> > @@ -930,9 +1073,16 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
> >  		for (i = 0; i < ctx.m->num_packs; i++) {
> >  			ALLOC_GROW(ctx.info, ctx.nr + 1, ctx.alloc);
> >
> > +			if (prepare_midx_pack(the_repository, ctx.m, i)) {
> > +				error(_("could not load pack %s"),
> > +				      ctx.m->pack_names[i]);
> > +				result = 1;
> > +				goto cleanup;
> > +			}
> > +
> >  			ctx.info[ctx.nr].orig_pack_int_id = i;
> >  			ctx.info[ctx.nr].pack_name = xstrdup(ctx.m->pack_names[i]);
> > -			ctx.info[ctx.nr].p = NULL;
> > +			ctx.info[ctx.nr].p = ctx.m->packs[i];
> >  			ctx.info[ctx.nr].expired = 0;
> >  			ctx.nr++;
> >  		}
>
> Why is this needed now and not before? From what I see in this function,
> nothing seems to happen to this .p pack except that they are closed
> later.

These are used by prepare_midx_packing_data().

> > @@ -1096,6 +1264,15 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
> >  		if (ctx.info[i].p) {
> >  			close_pack(ctx.info[i].p);
> >  			free(ctx.info[i].p);
> > +			if (ctx.m) {
> > +				/*
> > +				 * Destroy a stale reference to the pack in
> > +				 * 'ctx.m'.
> > +				 */
> > +				uint32_t orig = ctx.info[i].orig_pack_int_id;
> > +				if (orig < ctx.m->num_packs)
> > +					ctx.m->packs[orig] = NULL;
> > +			}
> >  		}
> >  		free(ctx.info[i].pack_name);
> >  	}
>
> Is this hunk needed? "ctx" is a local variable and will not outlast this
> function.

I can't remember exactly why I added this. I'll play around with it and
either remove it or add a comment why it's necessary before the next
reroll.

> I'll review the rest tomorrow. It seems like I've gotten over the most
> difficult patches.

Thanks, and sorry that this took me a few days to get back to. I
appreciate your review immensely.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH 13/22] pack-bitmap: write multi-pack bitmaps
  2021-05-06 20:18     ` Taylor Blau
@ 2021-05-06 22:00       ` Jonathan Tan
  0 siblings, 0 replies; 273+ messages in thread
From: Jonathan Tan @ 2021-05-06 22:00 UTC (permalink / raw)
  To: me; +Cc: jonathantanmy, git, peff, dstolee, gitster

> On Mon, May 03, 2021 at 10:02:30PM -0700, Jonathan Tan wrote:
> > > +static void prepare_midx_packing_data(struct packing_data *pdata,
> > > +				      struct write_midx_context *ctx)
> > > +{
> > > +	uint32_t i;
> > > +
> > > +	memset(pdata, 0, sizeof(struct packing_data));
> > > +	prepare_packing_data(the_repository, pdata);
> > > +
> > > +	for (i = 0; i < ctx->entries_nr; i++) {
> > > +		struct pack_midx_entry *from = &ctx->entries[ctx->pack_order[i]];
> > > +		struct object_entry *to = packlist_alloc(pdata, &from->oid);
> > > +
> > > +		oe_set_in_pack(pdata, to,
> > > +			       ctx->info[ctx->pack_perm[from->pack_int_id]].p);
> > > +	}
> > > +}
> >
> > It is surprising to see this right at the top. Scrolling down, I guess
> > that there is more information needed than just the packing_data struct.
> 
> Hmm, which part is surprising to you? This function is setting up the
> packing_data structure that I mentioned in the commit message, which
> happens in two steps. First, we allocate and call
> prepare_packing_data(). And then we call packlist_alloc() for each
> object in the MIDX, setting up some information about each object
> (like its OID and which physical pack it came from).
> 
> But if any of this is unclear, let me know which part and I'd be happy
> to add a clarifying comment.

Ah, I think I was unclear. I was thinking that the commit message led me
to believe that all information needed for creating a bitmap lies in the
packing_data struct, so I would have expected several helper functions
followed by a function that actually writes the packing_data struct.
Maybe the commit message could be rewritten to avoid that confusion, but
it's probably not a big deal.

> > > +static struct commit **find_commits_for_midx_bitmap(uint32_t *indexed_commits_nr_p,
> > > +						    struct write_midx_context *ctx)
> > > +{
> > > +	struct rev_info revs;
> > > +	struct bitmap_commit_cb cb;
> > > +
> > > +	memset(&cb, 0, sizeof(struct bitmap_commit_cb));
> > > +	cb.ctx = ctx;
> > > +
> > > +	repo_init_revisions(the_repository, &revs, NULL);
> > > +	for_each_ref(add_ref_to_pending, &revs);
> > > +
> > > +	fetch_if_missing = 0;
> > > +	revs.exclude_promisor_objects = 1;
> >
> > I think that the MIDX bitmap requires all objects be present? If yes, we
> > should omit these 2 lines.
> 
> It does require that all objects are present, but if we fetched any
> promisor objects at this stage it would be too late. That's because by
> the time we're in this function, all of the packs that are to be
> included in the MIDX should already exist on disk.
> 
> Skipping promisor objects here is intentional, since it only excludes
> them from the list of reachable commits that we want to select from when
> computing the selection of MIDX'd commits to receive bitmaps.
> 
> But, if one of those promisor objects is reachable from another object
> that is included in the bitmap, then we will complain later on that we
> couldn't find a reachability closure (and fail appropriately).
> 
> That said, I'm not sure any of that is obvious from reading this code,
> so I'll add a comment to that effect around these lines.

So you're saying that if we have missing promisor commits as in the
following graph:

   A
  / \
 B   C
 |   |
 .   .
 .   .
 .   .

where B is missing but promised, and only C is NEEDS_BITMAP, then the
MIDX bitmap write will still work? (So the rev walk is intended to walk
through A and C but not B, and because we are only building bitmaps for
C and potentially its ancestors, we only need the objects in C's
transitive closure.) Even if this is true, "exclude_promisor_objects" is
the wrong option here, because it will exclude all commits that came
from a promisor remote (regardless of whether it is present locally).
(That's how "promisor object" is defined in partial-clone.txt.) What we
need would be an option that permits missing links.

And even if we go with that option that permits missing links, it still
remains that we have very little support for missing promisor commits in
Git right now.

It might be better to just assume that MIDX will only be used for full
clones. If you want, you can add a NEEDSWORK explaining the above case.

> > > +
> > > +	if (prepare_revision_walk(&revs))
> > > +		die(_("revision walk setup failed"));
> > > +
> > > +	traverse_commit_list(&revs, bitmap_show_commit, NULL, &cb);
> > > +	if (indexed_commits_nr_p)
> > > +		*indexed_commits_nr_p = cb.commits_nr;
> > > +
> > > +	return cb.commits;
> > > +}
> >
> > Hmm...I might be missing something obvious, but this function and its
> > callbacks seem to be written like this in order to put the returned
> > commits in a certain order. But later on in write_midx_bitmap(), the
> > return value of this function is passed to
> > bitmap_writer_select_commits(), which resorts the list anyway?
> 
> It isn't intentional, but rather just to build up the list in topo
> order. In fact, the order we build it up in isn't quite the same as how
> the pack bitmap code generates it (it is in true topo order, at least on
> GitHub's servers, as a side effect of using delta islands).
> 
> The fact that we resort according to date_compare makes me wonder why
> changing that seemed to make such a difference for us. The whole
> selection code is a mystery to me.
> 
> But no, the order shouldn't matter since we QSORT it later. Any code
> here that looks like it's putting it in a certain order has much more to
> do with convenience than anything else.

If the order doesn't matter, why don't you just copy one-by-one from
data->ctx->entries into data->commits? (Unless data->ctx->entries has
extra commits?)

> > > +	ret = bitmap_writer_build(&pdata);
> > > +	if (!ret)
> > > +		goto cleanup;
> > > +
> > > +	bitmap_writer_set_checksum(midx_hash);
> > > +	bitmap_writer_finish(index, pdata.nr_objects, bitmap_name, 0);
> >
> > So bitmap_writer_build_type_index() and bitmap_writer_finish() are
> > called with 2 different orders of commits. Is this expected? If yes,
> > maybe this is worth a comment.
> 
> Confusingly so, but yes, these two do expect different orders. You can
> see the same re-sorting going on much more subtly in
> pack-write.c:write_idx_file(), which is called by
> builtin/pack-objects.c:finish_tmp_packfile(), which happens between
> bitmap_writer_build_type_index() and bitmap_writer_finish().
> 
> Definitely worth adding a comment.

Ah, I see. Thanks for your explanation.

> > > @@ -930,9 +1073,16 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
> > >  		for (i = 0; i < ctx.m->num_packs; i++) {
> > >  			ALLOC_GROW(ctx.info, ctx.nr + 1, ctx.alloc);
> > >
> > > +			if (prepare_midx_pack(the_repository, ctx.m, i)) {
> > > +				error(_("could not load pack %s"),
> > > +				      ctx.m->pack_names[i]);
> > > +				result = 1;
> > > +				goto cleanup;
> > > +			}
> > > +
> > >  			ctx.info[ctx.nr].orig_pack_int_id = i;
> > >  			ctx.info[ctx.nr].pack_name = xstrdup(ctx.m->pack_names[i]);
> > > -			ctx.info[ctx.nr].p = NULL;
> > > +			ctx.info[ctx.nr].p = ctx.m->packs[i];
> > >  			ctx.info[ctx.nr].expired = 0;
> > >  			ctx.nr++;
> > >  		}
> >
> > Why is this needed now and not before? From what I see in this function,
> > nothing seems to happen to this .p pack except that they are closed
> > later.
> 
> These are used by prepare_midx_packing_data().

Ah, thanks.

> > > @@ -1096,6 +1264,15 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
> > >  		if (ctx.info[i].p) {
> > >  			close_pack(ctx.info[i].p);
> > >  			free(ctx.info[i].p);
> > > +			if (ctx.m) {
> > > +				/*
> > > +				 * Destroy a stale reference to the pack in
> > > +				 * 'ctx.m'.
> > > +				 */
> > > +				uint32_t orig = ctx.info[i].orig_pack_int_id;
> > > +				if (orig < ctx.m->num_packs)
> > > +					ctx.m->packs[orig] = NULL;
> > > +			}
> > >  		}
> > >  		free(ctx.info[i].pack_name);
> > >  	}
> >
> > Is this hunk needed? "ctx" is a local variable and will not outlast this
> > function.
> 
> I can't remember exactly why I added this. I'll play around with it and
> either remove it or add a comment why it's necessary before the next
> reroll.

OK.

> 
> > I'll review the rest tomorrow. It seems like I've gotten over the most
> > difficult patches.
> 
> Thanks, and sorry that this took me a few days to get back to. I
> appreciate your review immensely.

No worries, and thanks for these patches.

^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v2 00/24] multi-pack reachability bitmaps
  2021-04-09 18:10 [PATCH 00/22] multi-pack reachability bitmaps Taylor Blau
                   ` (21 preceding siblings ...)
  2021-04-09 18:12 ` [PATCH 22/22] p5326: perf tests for MIDX bitmaps Taylor Blau
@ 2021-06-21 22:24 ` Taylor Blau
  2021-06-21 22:24   ` [PATCH v2 01/24] pack-bitmap.c: harden 'test_bitmap_walk()' to check type bitmaps Taylor Blau
                     ` (24 more replies)
  2021-07-27 21:19 ` [PATCH v3 00/25] " Taylor Blau
                   ` (2 subsequent siblings)
  25 siblings, 25 replies; 273+ messages in thread
From: Taylor Blau @ 2021-06-21 22:24 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

Here is a reroll of my series to implement multi-pack reachability bitmaps. It
is based on 'master', and incorporates a handful of changes from an earlier
round of review from Jonathan Tan, as well as a handful of tweaks useful to us
at GitHub that we've picked up over a few months of running these patches in
production.

I have been quite behind in sending this to the list because a number of
non-work things that have kept me busy. But those seem to have settled down, so
here is a second reroll.

Notable changes since last time are summarized here (though a complete
range-diff is below):

  - A preferred pack is inferred when not otherwise specified. This fixes a
    nasty bug dependent on readdir() ordering which can cause bitmap corruption.
    See the new ninth patch for the gory details.
  - A bug which broke CI on 'seen' is fixed where t0410.27 would fail when
    writing a bitmap (as is the case when
    GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=1 is set).
  - Comments in unclear portions of the code having to do with promisor objects,
    and object order when fed to bitmap writing routines are added.
  - A number of spots dealing with pack reuse were simplified to avoid using the
    MIDX's .rev file where unnecessary (along with a comment explaining why the
    optimization is possible in the first place).

Thanks in advance for your review, and sorry for the wait.

Jeff King (2):
  t0410: disable GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP
  t5310: disable GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP

Taylor Blau (22):
  pack-bitmap.c: harden 'test_bitmap_walk()' to check type bitmaps
  pack-bitmap-write.c: gracefully fail to write non-closed bitmaps
  pack-bitmap-write.c: free existing bitmaps
  Documentation: build 'technical/bitmap-format' by default
  Documentation: describe MIDX-based bitmaps
  midx: make a number of functions non-static
  midx: clear auxiliary .rev after replacing the MIDX
  midx: respect 'core.multiPackIndex' when writing
  midx: infer preferred pack when not given one
  pack-bitmap.c: introduce 'bitmap_num_objects()'
  pack-bitmap.c: introduce 'nth_bitmap_object_oid()'
  pack-bitmap.c: introduce 'bitmap_is_preferred_refname()'
  pack-bitmap: read multi-pack bitmaps
  pack-bitmap: write multi-pack bitmaps
  t5310: move some tests to lib-bitmap.sh
  t/helper/test-read-midx.c: add --checksum mode
  t5326: test multi-pack bitmap behavior
  t5319: don't write MIDX bitmaps in t5319
  t7700: update to work with MIDX bitmap test knob
  midx: respect 'GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP'
  p5310: extract full and partial bitmap tests
  p5326: perf tests for MIDX bitmaps

 Documentation/Makefile                       |   1 +
 Documentation/git-multi-pack-index.txt       |  12 +-
 Documentation/technical/bitmap-format.txt    |  72 ++-
 Documentation/technical/multi-pack-index.txt |  10 +-
 builtin/multi-pack-index.c                   |   2 +
 builtin/pack-objects.c                       |   8 +-
 builtin/repack.c                             |  13 +-
 ci/run-build-and-tests.sh                    |   1 +
 midx.c                                       | 288 +++++++++++-
 midx.h                                       |   5 +
 pack-bitmap-write.c                          |  79 +++-
 pack-bitmap.c                                | 470 +++++++++++++++++--
 pack-bitmap.h                                |   8 +-
 packfile.c                                   |   2 +-
 t/README                                     |   4 +
 t/helper/test-read-midx.c                    |  16 +-
 t/lib-bitmap.sh                              | 240 ++++++++++
 t/perf/lib-bitmap.sh                         |  69 +++
 t/perf/p5310-pack-bitmaps.sh                 |  65 +--
 t/perf/p5326-multi-pack-bitmaps.sh           |  43 ++
 t/t0410-partial-clone.sh                     |  12 +-
 t/t5310-pack-bitmaps.sh                      | 231 +--------
 t/t5319-multi-pack-index.sh                  |   3 +-
 t/t5326-multi-pack-bitmaps.sh                | 277 +++++++++++
 t/t7700-repack.sh                            |  18 +-
 25 files changed, 1534 insertions(+), 415 deletions(-)
 create mode 100644 t/perf/lib-bitmap.sh
 create mode 100755 t/perf/p5326-multi-pack-bitmaps.sh
 create mode 100755 t/t5326-multi-pack-bitmaps.sh

Range-diff against v1:
 1:  2d1c6ccab5 =  1:  a18baeb0b4 pack-bitmap.c: harden 'test_bitmap_walk()' to check type bitmaps
 2:  d199954ef2 !  2:  3e637d9ec8 pack-bitmap-write.c: gracefully fail to write non-closed bitmaps
    @@ Commit message
         bitmaps would be incomplete).
     
         Pack bitmaps are never written from 'git repack' unless repacking
    -    all-into-one, and so we never write non-closed bitmaps.
    +    all-into-one, and so we never write non-closed bitmaps (except in the
    +    case of partial clones where we aren't guaranteed to have all objects).
     
         But multi-pack bitmaps change this, since it isn't known whether the
         set of objects in the MIDX is closed under reachability until walking
    @@ Commit message
         include in the bitmap, bitmap_writer_build() knows that the set is not
         closed, and so it now fails gracefully.
     
    -    (The new conditional in builtin/pack-objects.c:bitmap_writer_build()
    -    guards against other failure modes, but is never triggered here, because
    -    of the all-into-one detail above. This return value will be important to
    -    check from the multi-pack index caller.)
    +    A test is added in t0410 to trigger a bitmap write without full
    +    reachability closure by removing local copies of some reachable objects
    +    from a promisor remote.
     
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
    @@ pack-bitmap-write.c: void bitmap_writer_build(struct packing_data *to_pack)
     -	compute_xor_offsets();
     +	if (closed)
     +		compute_xor_offsets();
    -+	return closed;
    ++	return closed ? 0 : -1;
      }
      
      /**
    @@ pack-bitmap.h: struct ewah_bitmap *bitmap_for_commit(struct bitmap_index *bitmap
      void bitmap_writer_finish(struct pack_idx_entry **index,
      			  uint32_t index_nr,
      			  const char *filename,
    +
    + ## t/t0410-partial-clone.sh ##
    +@@ t/t0410-partial-clone.sh: test_expect_success 'gc does not repack promisor objects if there are none' '
    + repack_and_check () {
    + 	rm -rf repo2 &&
    + 	cp -r repo repo2 &&
    +-	git -C repo2 repack $1 -d &&
    ++	if test x"$1" = "x--must-fail"
    ++	then
    ++		shift
    ++		test_must_fail git -C repo2 repack $1 -d
    ++	else
    ++		git -C repo2 repack $1 -d
    ++	fi &&
    + 	git -C repo2 fsck &&
    + 
    + 	git -C repo2 cat-file -e $2 &&
    +@@ t/t0410-partial-clone.sh: test_expect_success 'repack -d does not irreversibly delete promisor objects' '
    + 	printf "$THREE\n" | pack_as_from_promisor &&
    + 	delete_object repo "$ONE" &&
    + 
    ++	repack_and_check --must-fail -ab "$TWO" "$THREE" &&
    + 	repack_and_check -a "$TWO" "$THREE" &&
    + 	repack_and_check -A "$TWO" "$THREE" &&
    + 	repack_and_check -l "$TWO" "$THREE"
 3:  014c18b896 =  3:  490d733d12 pack-bitmap-write.c: free existing bitmaps
 4:  46de889cd2 =  4:  b0bb2e8051 Documentation: build 'technical/bitmap-format' by default
 5:  0d4822a64e =  5:  64a260e0c6 Documentation: describe MIDX-based bitmaps
 6:  c76dfc198e =  6:  b3a12424d7 midx: make a number of functions non-static
 7:  26c3a312f9 =  7:  1448ca0d2b midx: clear auxiliary .rev after replacing the MIDX
 8:  8643174a67 =  8:  dfd1daacc5 midx: respect 'core.multiPackIndex' when writing
 -:  ---------- >  9:  9495f6869d midx: infer preferred pack when not given one
 9:  af507f4b29 = 10:  373aa47528 pack-bitmap.c: introduce 'bitmap_num_objects()'
10:  a6fdf7234a = 11:  ac1f46aa1f pack-bitmap.c: introduce 'nth_bitmap_object_oid()'
11:  a78f83a127 = 12:  c474d2eda5 pack-bitmap.c: introduce 'bitmap_is_preferred_refname()'
12:  d5eeca4f11 ! 13:  7d44ba6299 pack-bitmap: read multi-pack bitmaps
    @@ builtin/pack-objects.c: static void write_reused_pack(struct hashfile *f)
      				break;
      
      			offset += ewah_bit_ctz64(word >> offset);
    --			write_reused_pack_one(pos + offset, f, &w_curs);
    -+			if (bitmap_is_midx(bitmap_git)) {
    -+				off_t pack_offs = bitmap_pack_offset(bitmap_git,
    -+								     pos + offset);
    -+				uint32_t pos;
    -+
    -+				if (offset_to_pack_pos(reuse_packfile, pack_offs, &pos) < 0)
    -+					die(_("write_reused_pack: could not locate %"PRIdMAX),
    -+					    (intmax_t)pack_offs);
    -+				write_reused_pack_one(pos, f, &w_curs);
    -+			} else
    -+				write_reused_pack_one(pos + offset, f, &w_curs);
    ++			/*
    ++			 * Can use bit positions directly, even for MIDX
    ++			 * bitmaps. See comment in try_partial_reuse()
    ++			 * for why.
    ++			 */
    + 			write_reused_pack_one(pos + offset, f, &w_curs);
      			display_progress(progress_state, ++written);
      		}
    - 	}
     
      ## pack-bitmap-write.c ##
     @@ pack-bitmap-write.c: void bitmap_writer_show_progress(int show)
    @@ pack-bitmap.c: struct bitmap_index {
      	/* If not NULL, this is a name-hash cache pointing into map. */
      	uint32_t *hashes;
      
    ++	/* The checksum of the packfile or MIDX; points into map. */
     +	const unsigned char *checksum;
     +
      	/*
    @@ pack-bitmap.c: static char *pack_bitmap_filename(struct packed_git *p)
     +
     +	if (bitmap_git->pack || bitmap_git->midx) {
     +		/* ignore extra bitmap file; we can only handle one */
    ++		warning("ignoring extra bitmap file: %s",
    ++			get_midx_filename(midx->object_dir));
    ++		close(fd);
     +		return -1;
     +	}
     +
    @@ pack-bitmap.c: struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
      		goto cleanup;
      
      	object_array_clear(&revs->pending);
    -@@ pack-bitmap.c: static void try_partial_reuse(struct bitmap_index *bitmap_git,
    +@@ pack-bitmap.c: struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
    + }
    + 
    + static void try_partial_reuse(struct bitmap_index *bitmap_git,
    ++			      struct packed_git *pack,
    + 			      size_t pos,
      			      struct bitmap *reuse,
      			      struct pack_window **w_curs)
      {
     -	off_t offset, header;
    -+	struct packed_git *pack;
     +	off_t offset, delta_obj_offset;
      	enum object_type type;
      	unsigned long size;
      
    - 	if (pos >= bitmap_num_objects(bitmap_git))
    - 		return; /* not actually in the pack or MIDX */
    +-	if (pos >= bitmap_num_objects(bitmap_git))
    +-		return; /* not actually in the pack or MIDX */
    ++	/*
    ++	 * try_partial_reuse() is called either on (a) objects in the
    ++	 * bitmapped pack (in the case of a single-pack bitmap) or (b)
    ++	 * objects in the preferred pack of a multi-pack bitmap.
    ++	 * Importantly, the latter can pretend as if only a single pack
    ++	 * exists because:
    ++	 *
    ++	 *   - The first pack->num_objects bits of a MIDX bitmap are
    ++	 *     reserved for the preferred pack, and
    ++	 *
    ++	 *   - Ties due to duplicate objects are always resolved in
    ++	 *     favor of the preferred pack.
    ++	 *
    ++	 * Therefore we do not need to ever ask the MIDX for its copy of
    ++	 * an object by OID, since it will always select it from the
    ++	 * preferred pack. Likewise, the selected copy of the base
    ++	 * object for any deltas will reside in the same pack.
    ++	 *
    ++	 * This means that we can reuse pos when looking up the bit in
    ++	 * the reuse bitmap, too, since bits corresponding to the
    ++	 * preferred pack precede all bits from other packs.
    ++	 */
      
     -	offset = header = pack_pos_to_offset(bitmap_git->pack, pos);
     -	type = unpack_object_header(bitmap_git->pack, w_curs, &offset, &size);
    -+	if (bitmap_is_midx(bitmap_git)) {
    -+		uint32_t pack_id, midx_pos;
    ++	if (pos >= pack->num_objects)
    ++		return; /* not actually in the pack or MIDX preferred pack */
     +
    -+		midx_pos = pack_pos_to_midx(bitmap_git->midx, pos);
    -+		pack_id = nth_midxed_pack_int_id(bitmap_git->midx, midx_pos);
    -+
    -+		pack = bitmap_git->midx->packs[pack_id];
    -+		offset = nth_midxed_offset(bitmap_git->midx, midx_pos);
    -+	} else {
    -+		pack = bitmap_git->pack;
    -+		offset = pack_pos_to_offset(bitmap_git->pack, pos);
    -+	}
    -+
    -+	delta_obj_offset = offset;
    ++	offset = delta_obj_offset = pack_pos_to_offset(pack, pos);
     +	type = unpack_object_header(pack, w_curs, &offset, &size);
      	if (type < 0)
      		return; /* broken packfile, punt */
    @@ pack-bitmap.c: static void try_partial_reuse(struct bitmap_index *bitmap_git,
      			return;
      
      		/*
    -@@ pack-bitmap.c: static void try_partial_reuse(struct bitmap_index *bitmap_git,
    - 		 * packs we write fresh, and OFS_DELTA is the default). But
    - 		 * let's double check to make sure the pack wasn't written with
    - 		 * odd parameters.
    -+		 *
    -+		 * Note that the base does not need to be repositioned, i.e.,
    -+		 * the MIDX is guaranteed to have selected the copy of "base"
    -+		 * from the same pack, since this function is only ever called
    -+		 * on the preferred pack (and all duplicate objects are resolved
    -+		 * in favor of the preferred pack).
    -+		 *
    -+		 * This means that we can reuse base_pos when looking up the bit
    -+		 * in the reuse bitmap, too, since bits corresponding to the
    -+		 * preferred pack precede all bits from other packs.
    - 		 */
    - 		if (base_pos >= pos)
    - 			return;
     @@ pack-bitmap.c: static void try_partial_reuse(struct bitmap_index *bitmap_git,
      	bitmap_set(reuse, pos);
      }
    @@ pack-bitmap.c: static void try_partial_reuse(struct bitmap_index *bitmap_git,
      int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
      				       struct packed_git **packfile_out,
      				       uint32_t *entries,
    -@@ pack-bitmap.c: int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
    + 				       struct bitmap **reuse_out)
    + {
    ++	struct packed_git *pack;
    + 	struct bitmap *result = bitmap_git->result;
    + 	struct bitmap *reuse;
    + 	struct pack_window *w_curs = NULL;
      	size_t i = 0;
      	uint32_t offset;
    - 	uint32_t objects_nr = bitmap_num_objects(bitmap_git);
    -+	uint32_t preferred_pack = 0;
    +-	uint32_t objects_nr = bitmap_num_objects(bitmap_git);
    ++	uint32_t objects_nr;
      
      	assert(result);
      
     +	load_reverse_index(bitmap_git);
     +
    -+	if (bitmap_is_midx(bitmap_git)) {
    -+		preferred_pack = midx_preferred_pack(bitmap_git);
    -+		objects_nr = bitmap_git->midx->packs[preferred_pack]->num_objects;
    -+	} else
    -+		objects_nr = bitmap_git->pack->num_objects;
    ++	if (bitmap_is_midx(bitmap_git))
    ++		pack = bitmap_git->midx->packs[midx_preferred_pack(bitmap_git)];
    ++	else
    ++		pack = bitmap_git->pack;
    ++	objects_nr = pack->num_objects;
     +
      	while (i < result->word_alloc && result->words[i] == (eword_t)~0)
      		i++;
    @@ pack-bitmap.c: int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitma
      				break;
      
      			offset += ewah_bit_ctz64(word >> offset);
    +-			try_partial_reuse(bitmap_git, pos + offset, reuse, &w_curs);
     +			if (bitmap_is_midx(bitmap_git)) {
     +				/*
     +				 * Can't reuse from a non-preferred pack (see
    @@ pack-bitmap.c: int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitma
     +				if (pos + offset >= objects_nr)
     +					continue;
     +			}
    - 			try_partial_reuse(bitmap_git, pos + offset, reuse, &w_curs);
    ++			try_partial_reuse(bitmap_git, pack, pos + offset, reuse, &w_curs);
      		}
      	}
    + 
     @@ pack-bitmap.c: int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
      	 * need to be handled separately.
      	 */
      	bitmap_and_not(result, reuse);
     -	*packfile_out = bitmap_git->pack;
    -+	*packfile_out = bitmap_git->pack ?
    -+		bitmap_git->pack :
    -+		bitmap_git->midx->packs[preferred_pack];
    ++	*packfile_out = pack;
      	*reuse_out = reuse;
      	return 0;
      }
    @@ pack-bitmap.c: static off_t get_disk_usage_for_type(struct bitmap_index *bitmap_
      	struct ewah_iterator it;
      	eword_t filter;
     @@ pack-bitmap.c: static off_t get_disk_usage_for_type(struct bitmap_index *bitmap_git,
    + 				break;
      
      			offset += ewah_bit_ctz64(word >> offset);
    - 			pos = base + offset;
    +-			pos = base + offset;
     +
     +			if (bitmap_is_midx(bitmap_git)) {
     +				uint32_t pack_pos;
    -+				uint32_t midx_pos = pack_pos_to_midx(bitmap_git->midx, pos);
    ++				uint32_t midx_pos = pack_pos_to_midx(bitmap_git->midx, base + offset);
     +				uint32_t pack_id = nth_midxed_pack_int_id(bitmap_git->midx, midx_pos);
     +				off_t offset = nth_midxed_offset(bitmap_git->midx, midx_pos);
     +
    @@ pack-bitmap.c: static off_t get_disk_usage_for_type(struct bitmap_index *bitmap_
     +				}
     +
     +				pos = pack_pos;
    -+			} else
    ++			} else {
     +				pack = bitmap_git->pack;
    ++				pos = base + offset;
    ++			}
     +
      			total += pack_pos_to_offset(pack, pos + 1) -
      				 pack_pos_to_offset(pack, pos);
13:  fd320c5ed4 ! 14:  a8cec2463d pack-bitmap: write multi-pack bitmaps
    @@ midx.c: static void write_midx_reverse_index(char *midx_name, unsigned char *mid
     +	repo_init_revisions(the_repository, &revs, NULL);
     +	for_each_ref(add_ref_to_pending, &revs);
     +
    ++	/*
    ++	 * Skipping promisor objects here is intentional, since it only excludes
    ++	 * them from the list of reachable commits that we want to select from
    ++	 * when computing the selection of MIDX'd commits to receive bitmaps.
    ++	 *
    ++	 * Reachability bitmaps do require that their objects be closed under
    ++	 * reachability, but fetching any objects missing from promisors at this
    ++	 * point is too late. But, if one of those objects can be reached from
    ++	 * an another object that is included in the bitmap, then we will
    ++	 * complain later that we don't have reachability closure (and fail
    ++	 * appropriately).
    ++	 */
     +	fetch_if_missing = 0;
     +	revs.exclude_promisor_objects = 1;
     +
    ++	/*
    ++	 * Pass selected commits in topo order to match the behavior of
    ++	 * pack-bitmaps when configured with delta islands.
    ++	 */
    ++	revs.topo_order = 1;
    ++	revs.sort_order = REV_SORT_IN_GRAPH_ORDER;
    ++
     +	if (prepare_revision_walk(&revs))
     +		die(_("revision walk setup failed"));
     +
    @@ midx.c: static void write_midx_reverse_index(char *midx_name, unsigned char *mid
     +	bitmap_writer_build_type_index(&pdata, index, pdata.nr_objects);
     +
     +	/*
    -+	 * bitmap_writer_select_commits expects objects in lex order, but
    -+	 * pack_order gives us exactly that. use it directly instead of
    -+	 * re-sorting the array
    ++	 * bitmap_writer_finish expects objects in lex order, but pack_order
    ++	 * gives us exactly that. use it directly instead of re-sorting the
    ++	 * array.
    ++	 *
    ++	 * This changes the order of objects in 'index' between
    ++	 * bitmap_writer_build_type_index and bitmap_writer_finish.
    ++	 *
    ++	 * The same re-ordering takes place in the single-pack bitmap code via
    ++	 * write_idx_file(), which is called by finish_tmp_packfile(), which
    ++	 * happens between bitmap_writer_build_type_index() and
    ++	 * bitmap_writer_finish().
     +	 */
     +	for (i = 0; i < pdata.nr_objects; i++)
     +		index[ctx->pack_order[i]] = (struct pack_idx_entry *)&pdata.objects[i];
     +
     +	bitmap_writer_select_commits(commits, commits_nr, -1);
     +	ret = bitmap_writer_build(&pdata);
    -+	if (!ret)
    ++	if (ret < 0)
     +		goto cleanup;
     +
     +	bitmap_writer_set_checksum(midx_hash);
    @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack
     +		}
     +	}
      
    - 	ctx.preferred_pack_idx = -1;
      	if (preferred_pack_name) {
    + 		int found = 0;
    +@@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack_index *
    + 		if (!found)
    + 			warning(_("unknown preferred pack: '%s'"),
    + 				preferred_pack_name);
    +-	} else if (ctx.nr && (flags & MIDX_WRITE_REV_INDEX)) {
    ++	} else if (ctx.nr &&
    ++		   (flags & (MIDX_WRITE_REV_INDEX | MIDX_WRITE_BITMAP))) {
    + 		time_t oldest = ctx.info[0].p->mtime;
    + 		ctx.preferred_pack_idx = 0;
    + 
     @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack_index *
      	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
      	f = hashfd(get_lock_file_fd(&lk), get_lock_file_path(&lk));
    @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack
      
      	if (flags & MIDX_WRITE_REV_INDEX)
      		write_midx_reverse_index(midx_name, midx_hash, &ctx);
    -+	if (flags & MIDX_WRITE_BITMAP)
    -+		write_midx_bitmap(midx_name, midx_hash, &ctx, flags);
    ++	if (flags & MIDX_WRITE_BITMAP) {
    ++		if (write_midx_bitmap(midx_name, midx_hash, &ctx, flags) < 0) {
    ++			error(_("could not write multi-pack bitmap"));
    ++			result = 1;
    ++			goto cleanup;
    ++		}
    ++	}
      
      	commit_lock_file(&lk);
      
14:  570e3de9ed ! 15:  c63eb637c8 t5310: move some tests to lib-bitmap.sh
    @@ t/lib-bitmap.sh: test_bitmap_traversal () {
     +		git --git-dir=clone.git rev-parse HEAD >actual &&
     +		test_cmp expect actual
     +	'
    ++
    ++	test_expect_success 'enumerating progress counts pack-reused objects' '
    ++		count=$(git rev-list --objects --all --count) &&
    ++		git repack -adb &&
    ++
    ++		# check first with only reused objects; confirm that our
    ++		# progress showed the right number, and also that we did
    ++		# pack-reuse as expected.  Check only the final "done"
    ++		# line of the meter (there may be an arbitrary number of
    ++		# intermediate lines ending with CR).
    ++		GIT_PROGRESS_DELAY=0 \
    ++			git pack-objects --all --stdout --progress \
    ++			</dev/null >/dev/null 2>stderr &&
    ++		grep "Enumerating objects: $count, done" stderr &&
    ++		grep "pack-reused $count" stderr &&
    ++
    ++		# now the same but with one non-reused object
    ++		git commit --allow-empty -m "an extra commit object" &&
    ++		GIT_PROGRESS_DELAY=0 \
    ++			git pack-objects --all --stdout --progress \
    ++			</dev/null >/dev/null 2>stderr &&
    ++		grep "Enumerating objects: $((count+1)), done" stderr &&
    ++		grep "pack-reused $count" stderr
    ++	'
     +}
     +
     +# have_delta <obj> <expected_base>
    @@ t/t5310-pack-bitmaps.sh: test_expect_success 'truncated bitmap fails gracefully
      	test_i18ngrep corrupted.bitmap.index stderr
      '
      
    +-test_expect_success 'enumerating progress counts pack-reused objects' '
    +-	count=$(git rev-list --objects --all --count) &&
    +-	git repack -adb &&
    +-
    +-	# check first with only reused objects; confirm that our progress
    +-	# showed the right number, and also that we did pack-reuse as expected.
    +-	# Check only the final "done" line of the meter (there may be an
    +-	# arbitrary number of intermediate lines ending with CR).
    +-	GIT_PROGRESS_DELAY=0 \
    +-		git pack-objects --all --stdout --progress \
    +-		</dev/null >/dev/null 2>stderr &&
    +-	grep "Enumerating objects: $count, done" stderr &&
    +-	grep "pack-reused $count" stderr &&
    +-
    +-	# now the same but with one non-reused object
    +-	git commit --allow-empty -m "an extra commit object" &&
    +-	GIT_PROGRESS_DELAY=0 \
    +-		git pack-objects --all --stdout --progress \
    +-		</dev/null >/dev/null 2>stderr &&
    +-	grep "Enumerating objects: $((count+1)), done" stderr &&
    +-	grep "pack-reused $count" stderr
    +-'
    +-
     -# have_delta <obj> <expected_base>
     -#
     -# Note that because this relies on cat-file, it might find _any_ copy of an
15:  060ee427be = 16:  bedb7afb37 t/helper/test-read-midx.c: add --checksum mode
16:  ff74181e85 ! 17:  fbfac4ae8e t5326: test multi-pack bitmap behavior
    @@ t/t5326-multi-pack-bitmaps.sh (new)
     +	git multi-pack-index write --bitmap &&
     +
     +	ls $objdir/pack/pack-*.pack >packs &&
    -+	test_line_count = 26 packs &&
    ++	test_line_count = 25 packs &&
     +
     +	test_path_is_file $midx &&
     +	test_path_is_file $midx-$(midx_checksum $objdir).bitmap
    @@ t/t5326-multi-pack-bitmaps.sh (new)
     +		$(git rev-parse packed)
     +		EOF
     +
    -+		git multi-pack-index write --bitmap 2>err &&
    ++		test_must_fail git multi-pack-index write --bitmap 2>err &&
     +		grep "doesn.t have full closure" err &&
    -+		test_path_is_file $midx &&
    -+		test_path_is_missing $midx-$(midx_checksum $objdir).bitmap
    ++		test_path_is_missing $midx
     +	)
     +'
     +
 -:  ---------- > 18:  2a5df1832a t0410: disable GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP
17:  8f328bb5bc = 19:  2d24c5b7ad t5310: disable GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP
18:  dbd953815b = 20:  4cbfaa0e97 t5319: don't write MIDX bitmaps in t5319
19:  ee952e4300 = 21:  839a7a79eb t7700: update to work with MIDX bitmap test knob
20:  83614f9284 = 22:  00418d5b09 midx: respect 'GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP'
21:  2f3836bf2e = 23:  98fa73a76a p5310: extract full and partial bitmap tests
22:  b75b534446 = 24:  ec0f53b424 p5326: perf tests for MIDX bitmaps
-- 
2.31.1.163.ga65ce7f831

^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v2 01/24] pack-bitmap.c: harden 'test_bitmap_walk()' to check type bitmaps
  2021-06-21 22:24 ` [PATCH v2 00/24] multi-pack reachability bitmaps Taylor Blau
@ 2021-06-21 22:24   ` Taylor Blau
  2021-06-24 23:02     ` Ævar Arnfjörð Bjarmason
  2021-07-21  9:45     ` Jeff King
  2021-06-21 22:25   ` [PATCH v2 02/24] pack-bitmap-write.c: gracefully fail to write non-closed bitmaps Taylor Blau
                     ` (23 subsequent siblings)
  24 siblings, 2 replies; 273+ messages in thread
From: Taylor Blau @ 2021-06-21 22:24 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

The special `--test-bitmap` mode of `git rev-list` is used to compare
the result of an object traversal with a bitmap to check its integrity.
This mode does not, however, assert that the types of reachable objects
are stored correctly.

Harden this mode by teaching it to also check that each time an object's
bit is marked, the corresponding bit should be set in exactly one of the
type bitmaps (whose type matches the object's true type).

Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index d90e1d9d8c..368fa59a42 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1301,10 +1301,52 @@ void count_bitmap_commit_list(struct bitmap_index *bitmap_git,
 struct bitmap_test_data {
 	struct bitmap_index *bitmap_git;
 	struct bitmap *base;
+	struct bitmap *commits;
+	struct bitmap *trees;
+	struct bitmap *blobs;
+	struct bitmap *tags;
 	struct progress *prg;
 	size_t seen;
 };
 
+static void test_bitmap_type(struct bitmap_test_data *tdata,
+			     struct object *obj, int pos)
+{
+	enum object_type bitmap_type = OBJ_NONE;
+	int bitmaps_nr = 0;
+
+	if (bitmap_get(tdata->commits, pos)) {
+		bitmap_type = OBJ_COMMIT;
+		bitmaps_nr++;
+	}
+	if (bitmap_get(tdata->trees, pos)) {
+		bitmap_type = OBJ_TREE;
+		bitmaps_nr++;
+	}
+	if (bitmap_get(tdata->blobs, pos)) {
+		bitmap_type = OBJ_BLOB;
+		bitmaps_nr++;
+	}
+	if (bitmap_get(tdata->tags, pos)) {
+		bitmap_type = OBJ_TAG;
+		bitmaps_nr++;
+	}
+
+	if (!bitmap_type)
+		die("object %s not found in type bitmaps",
+		    oid_to_hex(&obj->oid));
+
+	if (bitmaps_nr > 1)
+		die("object %s does not have a unique type",
+		    oid_to_hex(&obj->oid));
+
+	if (bitmap_type != obj->type)
+		die("object %s: real type %s, expected: %s",
+		    oid_to_hex(&obj->oid),
+		    type_name(obj->type),
+		    type_name(bitmap_type));
+}
+
 static void test_show_object(struct object *object, const char *name,
 			     void *data)
 {
@@ -1314,6 +1356,7 @@ static void test_show_object(struct object *object, const char *name,
 	bitmap_pos = bitmap_position(tdata->bitmap_git, &object->oid);
 	if (bitmap_pos < 0)
 		die("Object not in bitmap: %s\n", oid_to_hex(&object->oid));
+	test_bitmap_type(tdata, object, bitmap_pos);
 
 	bitmap_set(tdata->base, bitmap_pos);
 	display_progress(tdata->prg, ++tdata->seen);
@@ -1328,6 +1371,7 @@ static void test_show_commit(struct commit *commit, void *data)
 				     &commit->object.oid);
 	if (bitmap_pos < 0)
 		die("Object not in bitmap: %s\n", oid_to_hex(&commit->object.oid));
+	test_bitmap_type(tdata, &commit->object, bitmap_pos);
 
 	bitmap_set(tdata->base, bitmap_pos);
 	display_progress(tdata->prg, ++tdata->seen);
@@ -1375,6 +1419,10 @@ void test_bitmap_walk(struct rev_info *revs)
 
 	tdata.bitmap_git = bitmap_git;
 	tdata.base = bitmap_new();
+	tdata.commits = ewah_to_bitmap(bitmap_git->commits);
+	tdata.trees = ewah_to_bitmap(bitmap_git->trees);
+	tdata.blobs = ewah_to_bitmap(bitmap_git->blobs);
+	tdata.tags = ewah_to_bitmap(bitmap_git->tags);
 	tdata.prg = start_progress("Verifying bitmap entries", result_popcnt);
 	tdata.seen = 0;
 
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v2 02/24] pack-bitmap-write.c: gracefully fail to write non-closed bitmaps
  2021-06-21 22:24 ` [PATCH v2 00/24] multi-pack reachability bitmaps Taylor Blau
  2021-06-21 22:24   ` [PATCH v2 01/24] pack-bitmap.c: harden 'test_bitmap_walk()' to check type bitmaps Taylor Blau
@ 2021-06-21 22:25   ` Taylor Blau
  2021-06-24 23:23     ` Ævar Arnfjörð Bjarmason
  2021-07-21  9:50     ` Jeff King
  2021-06-21 22:25   ` [PATCH v2 03/24] pack-bitmap-write.c: free existing bitmaps Taylor Blau
                     ` (22 subsequent siblings)
  24 siblings, 2 replies; 273+ messages in thread
From: Taylor Blau @ 2021-06-21 22:25 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

The set of objects covered by a bitmap must be closed under
reachability, since it must be the case that there is a valid bit
position assigned for every possible reachable object (otherwise the
bitmaps would be incomplete).

Pack bitmaps are never written from 'git repack' unless repacking
all-into-one, and so we never write non-closed bitmaps (except in the
case of partial clones where we aren't guaranteed to have all objects).

But multi-pack bitmaps change this, since it isn't known whether the
set of objects in the MIDX is closed under reachability until walking
them. Plumb through a bit that is set when a reachable object isn't
found.

As soon as a reachable object isn't found in the set of objects to
include in the bitmap, bitmap_writer_build() knows that the set is not
closed, and so it now fails gracefully.

A test is added in t0410 to trigger a bitmap write without full
reachability closure by removing local copies of some reachable objects
from a promisor remote.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c   |  3 +-
 pack-bitmap-write.c      | 76 ++++++++++++++++++++++++++++------------
 pack-bitmap.h            |  2 +-
 t/t0410-partial-clone.sh |  9 ++++-
 4 files changed, 64 insertions(+), 26 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index de00adbb9e..8a523624a1 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1256,7 +1256,8 @@ static void write_pack_file(void)
 
 				bitmap_writer_show_progress(progress);
 				bitmap_writer_select_commits(indexed_commits, indexed_commits_nr, -1);
-				bitmap_writer_build(&to_pack);
+				if (bitmap_writer_build(&to_pack) < 0)
+					die(_("failed to write bitmap index"));
 				bitmap_writer_finish(written_list, nr_written,
 						     tmpname.buf, write_bitmap_options);
 				write_bitmap_index = 0;
diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
index 88d9e696a5..d374f7884b 100644
--- a/pack-bitmap-write.c
+++ b/pack-bitmap-write.c
@@ -125,15 +125,20 @@ static inline void push_bitmapped_commit(struct commit *commit)
 	writer.selected_nr++;
 }
 
-static uint32_t find_object_pos(const struct object_id *oid)
+static uint32_t find_object_pos(const struct object_id *oid, int *found)
 {
 	struct object_entry *entry = packlist_find(writer.to_pack, oid);
 
 	if (!entry) {
-		die("Failed to write bitmap index. Packfile doesn't have full closure "
+		if (found)
+			*found = 0;
+		warning("Failed to write bitmap index. Packfile doesn't have full closure "
 			"(object %s is missing)", oid_to_hex(oid));
+		return 0;
 	}
 
+	if (found)
+		*found = 1;
 	return oe_in_pack_pos(writer.to_pack, entry);
 }
 
@@ -331,9 +336,10 @@ static void bitmap_builder_clear(struct bitmap_builder *bb)
 	bb->commits_nr = bb->commits_alloc = 0;
 }
 
-static void fill_bitmap_tree(struct bitmap *bitmap,
-			     struct tree *tree)
+static int fill_bitmap_tree(struct bitmap *bitmap,
+			    struct tree *tree)
 {
+	int found;
 	uint32_t pos;
 	struct tree_desc desc;
 	struct name_entry entry;
@@ -342,9 +348,11 @@ static void fill_bitmap_tree(struct bitmap *bitmap,
 	 * If our bit is already set, then there is nothing to do. Both this
 	 * tree and all of its children will be set.
 	 */
-	pos = find_object_pos(&tree->object.oid);
+	pos = find_object_pos(&tree->object.oid, &found);
+	if (!found)
+		return -1;
 	if (bitmap_get(bitmap, pos))
-		return;
+		return 0;
 	bitmap_set(bitmap, pos);
 
 	if (parse_tree(tree) < 0)
@@ -355,11 +363,15 @@ static void fill_bitmap_tree(struct bitmap *bitmap,
 	while (tree_entry(&desc, &entry)) {
 		switch (object_type(entry.mode)) {
 		case OBJ_TREE:
-			fill_bitmap_tree(bitmap,
-					 lookup_tree(the_repository, &entry.oid));
+			if (fill_bitmap_tree(bitmap,
+					     lookup_tree(the_repository, &entry.oid)) < 0)
+				return -1;
 			break;
 		case OBJ_BLOB:
-			bitmap_set(bitmap, find_object_pos(&entry.oid));
+			pos = find_object_pos(&entry.oid, &found);
+			if (!found)
+				return -1;
+			bitmap_set(bitmap, pos);
 			break;
 		default:
 			/* Gitlink, etc; not reachable */
@@ -368,15 +380,18 @@ static void fill_bitmap_tree(struct bitmap *bitmap,
 	}
 
 	free_tree_buffer(tree);
+	return 0;
 }
 
-static void fill_bitmap_commit(struct bb_commit *ent,
-			       struct commit *commit,
-			       struct prio_queue *queue,
-			       struct prio_queue *tree_queue,
-			       struct bitmap_index *old_bitmap,
-			       const uint32_t *mapping)
+static int fill_bitmap_commit(struct bb_commit *ent,
+			      struct commit *commit,
+			      struct prio_queue *queue,
+			      struct prio_queue *tree_queue,
+			      struct bitmap_index *old_bitmap,
+			      const uint32_t *mapping)
 {
+	int found;
+	uint32_t pos;
 	if (!ent->bitmap)
 		ent->bitmap = bitmap_new();
 
@@ -401,11 +416,16 @@ static void fill_bitmap_commit(struct bb_commit *ent,
 		 * Mark ourselves and queue our tree. The commit
 		 * walk ensures we cover all parents.
 		 */
-		bitmap_set(ent->bitmap, find_object_pos(&c->object.oid));
+		pos = find_object_pos(&c->object.oid, &found);
+		if (!found)
+			return -1;
+		bitmap_set(ent->bitmap, pos);
 		prio_queue_put(tree_queue, get_commit_tree(c));
 
 		for (p = c->parents; p; p = p->next) {
-			int pos = find_object_pos(&p->item->object.oid);
+			pos = find_object_pos(&p->item->object.oid, &found);
+			if (!found)
+				return -1;
 			if (!bitmap_get(ent->bitmap, pos)) {
 				bitmap_set(ent->bitmap, pos);
 				prio_queue_put(queue, p->item);
@@ -413,8 +433,12 @@ static void fill_bitmap_commit(struct bb_commit *ent,
 		}
 	}
 
-	while (tree_queue->nr)
-		fill_bitmap_tree(ent->bitmap, prio_queue_get(tree_queue));
+	while (tree_queue->nr) {
+		if (fill_bitmap_tree(ent->bitmap,
+				     prio_queue_get(tree_queue)) < 0)
+			return -1;
+	}
+	return 0;
 }
 
 static void store_selected(struct bb_commit *ent, struct commit *commit)
@@ -432,7 +456,7 @@ static void store_selected(struct bb_commit *ent, struct commit *commit)
 	kh_value(writer.bitmaps, hash_pos) = stored;
 }
 
-void bitmap_writer_build(struct packing_data *to_pack)
+int bitmap_writer_build(struct packing_data *to_pack)
 {
 	struct bitmap_builder bb;
 	size_t i;
@@ -441,6 +465,7 @@ void bitmap_writer_build(struct packing_data *to_pack)
 	struct prio_queue tree_queue = { NULL };
 	struct bitmap_index *old_bitmap;
 	uint32_t *mapping;
+	int closed = 1; /* until proven otherwise */
 
 	writer.bitmaps = kh_init_oid_map();
 	writer.to_pack = to_pack;
@@ -463,8 +488,11 @@ void bitmap_writer_build(struct packing_data *to_pack)
 		struct commit *child;
 		int reused = 0;
 
-		fill_bitmap_commit(ent, commit, &queue, &tree_queue,
-				   old_bitmap, mapping);
+		if (fill_bitmap_commit(ent, commit, &queue, &tree_queue,
+				       old_bitmap, mapping) < 0) {
+			closed = 0;
+			break;
+		}
 
 		if (ent->selected) {
 			store_selected(ent, commit);
@@ -499,7 +527,9 @@ void bitmap_writer_build(struct packing_data *to_pack)
 
 	stop_progress(&writer.progress);
 
-	compute_xor_offsets();
+	if (closed)
+		compute_xor_offsets();
+	return closed ? 0 : -1;
 }
 
 /**
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 99d733eb26..020cd8d868 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -87,7 +87,7 @@ struct ewah_bitmap *bitmap_for_commit(struct bitmap_index *bitmap_git,
 				      struct commit *commit);
 void bitmap_writer_select_commits(struct commit **indexed_commits,
 		unsigned int indexed_commits_nr, int max_bitmaps);
-void bitmap_writer_build(struct packing_data *to_pack);
+int bitmap_writer_build(struct packing_data *to_pack);
 void bitmap_writer_finish(struct pack_idx_entry **index,
 			  uint32_t index_nr,
 			  const char *filename,
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index 584a039b85..1667450917 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -536,7 +536,13 @@ test_expect_success 'gc does not repack promisor objects if there are none' '
 repack_and_check () {
 	rm -rf repo2 &&
 	cp -r repo repo2 &&
-	git -C repo2 repack $1 -d &&
+	if test x"$1" = "x--must-fail"
+	then
+		shift
+		test_must_fail git -C repo2 repack $1 -d
+	else
+		git -C repo2 repack $1 -d
+	fi &&
 	git -C repo2 fsck &&
 
 	git -C repo2 cat-file -e $2 &&
@@ -561,6 +567,7 @@ test_expect_success 'repack -d does not irreversibly delete promisor objects' '
 	printf "$THREE\n" | pack_as_from_promisor &&
 	delete_object repo "$ONE" &&
 
+	repack_and_check --must-fail -ab "$TWO" "$THREE" &&
 	repack_and_check -a "$TWO" "$THREE" &&
 	repack_and_check -A "$TWO" "$THREE" &&
 	repack_and_check -l "$TWO" "$THREE"
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v2 03/24] pack-bitmap-write.c: free existing bitmaps
  2021-06-21 22:24 ` [PATCH v2 00/24] multi-pack reachability bitmaps Taylor Blau
  2021-06-21 22:24   ` [PATCH v2 01/24] pack-bitmap.c: harden 'test_bitmap_walk()' to check type bitmaps Taylor Blau
  2021-06-21 22:25   ` [PATCH v2 02/24] pack-bitmap-write.c: gracefully fail to write non-closed bitmaps Taylor Blau
@ 2021-06-21 22:25   ` Taylor Blau
  2021-07-21  9:54     ` Jeff King
  2021-06-21 22:25   ` [PATCH v2 04/24] Documentation: build 'technical/bitmap-format' by default Taylor Blau
                     ` (21 subsequent siblings)
  24 siblings, 1 reply; 273+ messages in thread
From: Taylor Blau @ 2021-06-21 22:25 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

When writing a new bitmap, the bitmap writer code attempts to read the
existing bitmap (if one is present). This is done in order to quickly
permute the bits of any bitmaps for commits which appear in the existing
bitmap, and were also selected for the new bitmap.

But since this code was added in 341fa34887 (pack-bitmap-write: use
existing bitmaps, 2020-12-08), the resources associated with opening an
existing bitmap were never released.

It's fine to ignore this, but it's bad hygiene. It will also cause a
problem for the multi-pack-index builtin, which will be responsible not
only for writing bitmaps, but also for expiring any old multi-pack
bitmaps.

If an existing bitmap was reused here, it will also be expired. That
will cause a problem on platforms which require file resources to be
closed before unlinking them, like Windows. Avoid this by ensuring we
close reused bitmaps with free_bitmap_index() before removing them.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap-write.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
index d374f7884b..142fd0adb8 100644
--- a/pack-bitmap-write.c
+++ b/pack-bitmap-write.c
@@ -520,6 +520,7 @@ int bitmap_writer_build(struct packing_data *to_pack)
 	clear_prio_queue(&queue);
 	clear_prio_queue(&tree_queue);
 	bitmap_builder_clear(&bb);
+	free_bitmap_index(old_bitmap);
 	free(mapping);
 
 	trace2_region_leave("pack-bitmap-write", "building_bitmaps_total",
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v2 04/24] Documentation: build 'technical/bitmap-format' by default
  2021-06-21 22:24 ` [PATCH v2 00/24] multi-pack reachability bitmaps Taylor Blau
                     ` (2 preceding siblings ...)
  2021-06-21 22:25   ` [PATCH v2 03/24] pack-bitmap-write.c: free existing bitmaps Taylor Blau
@ 2021-06-21 22:25   ` Taylor Blau
  2021-06-24 23:35     ` Ævar Arnfjörð Bjarmason
  2021-07-21  9:58     ` Jeff King
  2021-06-21 22:25   ` [PATCH v2 05/24] Documentation: describe MIDX-based bitmaps Taylor Blau
                     ` (20 subsequent siblings)
  24 siblings, 2 replies; 273+ messages in thread
From: Taylor Blau @ 2021-06-21 22:25 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

Even though the 'TECH_DOCS' variable was introduced all the way back in
5e00439f0a (Documentation: build html for all files in technical and
howto, 2012-10-23), the 'bitmap-format' document was never added to that
list when it was created.

Prepare for changes to this file by including it in the list of
technical documentation that 'make doc' will build by default.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/Makefile b/Documentation/Makefile
index f5605b7767..7d7b778b28 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -90,6 +90,7 @@ SP_ARTICLES += $(API_DOCS)
 TECH_DOCS += MyFirstContribution
 TECH_DOCS += MyFirstObjectWalk
 TECH_DOCS += SubmittingPatches
+TECH_DOCS += technical/bitmap-format
 TECH_DOCS += technical/hash-function-transition
 TECH_DOCS += technical/http-protocol
 TECH_DOCS += technical/index-format
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v2 05/24] Documentation: describe MIDX-based bitmaps
  2021-06-21 22:24 ` [PATCH v2 00/24] multi-pack reachability bitmaps Taylor Blau
                     ` (3 preceding siblings ...)
  2021-06-21 22:25   ` [PATCH v2 04/24] Documentation: build 'technical/bitmap-format' by default Taylor Blau
@ 2021-06-21 22:25   ` Taylor Blau
  2021-07-21 10:18     ` Jeff King
  2021-06-21 22:25   ` [PATCH v2 06/24] midx: make a number of functions non-static Taylor Blau
                     ` (19 subsequent siblings)
  24 siblings, 1 reply; 273+ messages in thread
From: Taylor Blau @ 2021-06-21 22:25 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

Update the technical documentation to describe the multi-pack bitmap
format. This patch merely introduces the new format, and describes its
high-level ideas. Git does not yet know how to read nor write these
multi-pack variants, and so the subsequent patches will:

  - Introduce code to interpret multi-pack bitmaps, according to this
    document.

  - Then, introduce code to write multi-pack bitmaps from the 'git
    multi-pack-index write' sub-command.

Finally, the implementation will gain tests in subsequent patches (as
opposed to inline with the patch teaching Git how to write multi-pack
bitmaps) to avoid a cyclic dependency.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/technical/bitmap-format.txt    | 72 ++++++++++++++++----
 Documentation/technical/multi-pack-index.txt | 10 +--
 2 files changed, 61 insertions(+), 21 deletions(-)

diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt
index f8c18a0f7a..25221c7ec8 100644
--- a/Documentation/technical/bitmap-format.txt
+++ b/Documentation/technical/bitmap-format.txt
@@ -1,6 +1,45 @@
 GIT bitmap v1 format
 ====================
 
+== Pack and multi-pack bitmaps
+
+Bitmaps store reachability information about the set of objects in a packfile,
+or a multi-pack index (MIDX). The former is defined obviously, and the latter is
+defined as the union of objects in packs contained in the MIDX.
+
+A bitmap may belong to either one pack, or the repository's multi-pack index (if
+it exists). A repository may have at most one bitmap.
+
+An object is uniquely described by its bit position within a bitmap:
+
+	- If the bitmap belongs to a packfile, the __n__th bit corresponds to
+	the __n__th object in pack order. For a function `offset` which maps
+	objects to their byte offset within a pack, pack order is defined as
+	follows:
+
+		o1 <= o2 <==> offset(o1) <= offset(o2)
+
+	- If the bitmap belongs to a MIDX, the __n__th bit corresponds to the
+	__n__th object in MIDX order. With an additional function `pack` which
+	maps objects to the pack they were selected from by the MIDX, MIDX order
+	is defined as follows:
+
+		o1 <= o2 <==> pack(o1) <= pack(o2) /\ offset(o1) <= offset(o2)
+
+	The ordering between packs is done lexicographically by the pack name,
+	with the exception of the preferred pack, which sorts ahead of all other
+	packs.
+
+The on-disk representation (described below) of a bitmap is the same regardless
+of whether or not that bitmap belongs to a packfile or a MIDX. The only
+difference is the interpretation of the bits, which is described above.
+
+Certain bitmap extensions are supported (see: Appendix B). No extensions are
+required for bitmaps corresponding to packfiles. For bitmaps that correspond to
+MIDXs, both the bit-cache and rev-cache extensions are required.
+
+== On-disk format
+
 	- A header appears at the beginning:
 
 		4-byte signature: {'B', 'I', 'T', 'M'}
@@ -14,17 +53,19 @@ GIT bitmap v1 format
 			The following flags are supported:
 
 			- BITMAP_OPT_FULL_DAG (0x1) REQUIRED
-			This flag must always be present. It implies that the bitmap
-			index has been generated for a packfile with full closure
-			(i.e. where every single object in the packfile can find
-			 its parent links inside the same packfile). This is a
-			requirement for the bitmap index format, also present in JGit,
-			that greatly reduces the complexity of the implementation.
+			This flag must always be present. It implies that the
+			bitmap index has been generated for a packfile or
+			multi-pack index (MIDX) with full closure (i.e. where
+			every single object in the packfile/MIDX can find its
+			parent links inside the same packfile/MIDX). This is a
+			requirement for the bitmap index format, also present in
+			JGit, that greatly reduces the complexity of the
+			implementation.
 
 			- BITMAP_OPT_HASH_CACHE (0x4)
 			If present, the end of the bitmap file contains
 			`N` 32-bit name-hash values, one per object in the
-			pack. The format and meaning of the name-hash is
+			pack/MIDX. The format and meaning of the name-hash is
 			described below.
 
 		4-byte entry count (network byte order)
@@ -33,7 +74,8 @@ GIT bitmap v1 format
 
 		20-byte checksum
 
-			The SHA1 checksum of the pack this bitmap index belongs to.
+			The SHA1 checksum of the pack/MIDX this bitmap index
+			belongs to.
 
 	- 4 EWAH bitmaps that act as type indexes
 
@@ -50,7 +92,7 @@ GIT bitmap v1 format
 			- Tags
 
 		In each bitmap, the `n`th bit is set to true if the `n`th object
-		in the packfile is of that type.
+		in the packfile or multi-pack index is of that type.
 
 		The obvious consequence is that the OR of all 4 bitmaps will result
 		in a full set (all bits set), and the AND of all 4 bitmaps will
@@ -62,8 +104,9 @@ GIT bitmap v1 format
 		Each entry contains the following:
 
 		- 4-byte object position (network byte order)
-			The position **in the index for the packfile** where the
-			bitmap for this commit is found.
+			The position **in the index for the packfile or
+			multi-pack index** where the bitmap for this commit is
+			found.
 
 		- 1-byte XOR-offset
 			The xor offset used to compress this bitmap. For an entry
@@ -146,10 +189,11 @@ Name-hash cache
 ---------------
 
 If the BITMAP_OPT_HASH_CACHE flag is set, the end of the bitmap contains
-a cache of 32-bit values, one per object in the pack. The value at
+a cache of 32-bit values, one per object in the pack/MIDX. The value at
 position `i` is the hash of the pathname at which the `i`th object
-(counting in index order) in the pack can be found.  This can be fed
-into the delta heuristics to compare objects with similar pathnames.
+(counting in index or multi-pack index order) in the pack/MIDX can be found.
+This can be fed into the delta heuristics to compare objects with similar
+pathnames.
 
 The hash algorithm used is:
 
diff --git a/Documentation/technical/multi-pack-index.txt b/Documentation/technical/multi-pack-index.txt
index fb688976c4..1a73c3ee20 100644
--- a/Documentation/technical/multi-pack-index.txt
+++ b/Documentation/technical/multi-pack-index.txt
@@ -71,14 +71,10 @@ Future Work
   still reducing the number of binary searches required for object
   lookups.
 
-- The reachability bitmap is currently paired directly with a single
-  packfile, using the pack-order as the object order to hopefully
-  compress the bitmaps well using run-length encoding. This could be
-  extended to pair a reachability bitmap with a multi-pack-index. If
-  the multi-pack-index is extended to store a "stable object order"
+- If the multi-pack-index is extended to store a "stable object order"
   (a function Order(hash) = integer that is constant for a given hash,
-  even as the multi-pack-index is updated) then a reachability bitmap
-  could point to a multi-pack-index and be updated independently.
+  even as the multi-pack-index is updated) then MIDX bitmaps could be
+  updated independently of the MIDX.
 
 - Packfiles can be marked as "special" using empty files that share
   the initial name but replace ".pack" with ".keep" or ".promisor".
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v2 06/24] midx: make a number of functions non-static
  2021-06-21 22:24 ` [PATCH v2 00/24] multi-pack reachability bitmaps Taylor Blau
                     ` (4 preceding siblings ...)
  2021-06-21 22:25   ` [PATCH v2 05/24] Documentation: describe MIDX-based bitmaps Taylor Blau
@ 2021-06-21 22:25   ` Taylor Blau
  2021-06-24 23:42     ` Ævar Arnfjörð Bjarmason
  2021-06-21 22:25   ` [PATCH v2 07/24] midx: clear auxiliary .rev after replacing the MIDX Taylor Blau
                     ` (18 subsequent siblings)
  24 siblings, 1 reply; 273+ messages in thread
From: Taylor Blau @ 2021-06-21 22:25 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

These functions will be called from outside of midx.c in a subsequent
patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c | 4 ++--
 midx.h | 2 ++
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/midx.c b/midx.c
index 21d6a05e88..fa23d57a24 100644
--- a/midx.c
+++ b/midx.c
@@ -48,12 +48,12 @@ static uint8_t oid_version(void)
 	}
 }
 
-static const unsigned char *get_midx_checksum(struct multi_pack_index *m)
+const unsigned char *get_midx_checksum(struct multi_pack_index *m)
 {
 	return m->data + m->data_len - the_hash_algo->rawsz;
 }
 
-static char *get_midx_filename(const char *object_dir)
+char *get_midx_filename(const char *object_dir)
 {
 	return xstrfmt("%s/pack/multi-pack-index", object_dir);
 }
diff --git a/midx.h b/midx.h
index 8684cf0fef..1172df1a71 100644
--- a/midx.h
+++ b/midx.h
@@ -42,6 +42,8 @@ struct multi_pack_index {
 #define MIDX_PROGRESS     (1 << 0)
 #define MIDX_WRITE_REV_INDEX (1 << 1)
 
+const unsigned char *get_midx_checksum(struct multi_pack_index *m);
+char *get_midx_filename(const char *object_dir);
 char *get_midx_rev_filename(struct multi_pack_index *m);
 
 struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local);
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v2 07/24] midx: clear auxiliary .rev after replacing the MIDX
  2021-06-21 22:24 ` [PATCH v2 00/24] multi-pack reachability bitmaps Taylor Blau
                     ` (5 preceding siblings ...)
  2021-06-21 22:25   ` [PATCH v2 06/24] midx: make a number of functions non-static Taylor Blau
@ 2021-06-21 22:25   ` Taylor Blau
  2021-07-21 10:19     ` Jeff King
  2021-06-21 22:25   ` [PATCH v2 08/24] midx: respect 'core.multiPackIndex' when writing Taylor Blau
                     ` (17 subsequent siblings)
  24 siblings, 1 reply; 273+ messages in thread
From: Taylor Blau @ 2021-06-21 22:25 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

When writing a new multi-pack index, write_midx_internal() attempts to
clean up any auxiliary files (currently just the MIDX's `.rev` file, but
soon to include a `.bitmap`, too) corresponding to the MIDX it's
replacing.

This step should happen after the new MIDX is written into place, since
doing so beforehand means that the old MIDX could be read without its
corresponding .rev file.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index fa23d57a24..40eb7974ba 100644
--- a/midx.c
+++ b/midx.c
@@ -1076,10 +1076,11 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 	if (flags & MIDX_WRITE_REV_INDEX)
 		write_midx_reverse_index(midx_name, midx_hash, &ctx);
-	clear_midx_files_ext(the_repository, ".rev", midx_hash);
 
 	commit_lock_file(&lk);
 
+	clear_midx_files_ext(the_repository, ".rev", midx_hash);
+
 cleanup:
 	for (i = 0; i < ctx.nr; i++) {
 		if (ctx.info[i].p) {
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v2 08/24] midx: respect 'core.multiPackIndex' when writing
  2021-06-21 22:24 ` [PATCH v2 00/24] multi-pack reachability bitmaps Taylor Blau
                     ` (6 preceding siblings ...)
  2021-06-21 22:25   ` [PATCH v2 07/24] midx: clear auxiliary .rev after replacing the MIDX Taylor Blau
@ 2021-06-21 22:25   ` Taylor Blau
  2021-06-24 23:43     ` Ævar Arnfjörð Bjarmason
  2021-07-21 10:23     ` Jeff King
  2021-06-21 22:25   ` [PATCH v2 09/24] midx: infer preferred pack when not given one Taylor Blau
                     ` (16 subsequent siblings)
  24 siblings, 2 replies; 273+ messages in thread
From: Taylor Blau @ 2021-06-21 22:25 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

When writing a new multi-pack index, write_midx_internal() attempts to
load any existing one to fill in some pieces of information. But it uses
load_multi_pack_index(), which ignores the configuration
"core.multiPackIndex", which indicates whether or not Git is allowed to
read an existing multi-pack-index.

Replace this with a routine that does respect that setting, to avoid
reading multi-pack-index files when told not to.

This avoids a problem that would arise in subsequent patches due to the
combination of 'git repack' reopening the object store in-process and
the multi-pack index code not checking whether a pack already exists in
the object store when calling add_pack_to_midx().

This would ultimately lead to a cycle being created along the
'packed_git' struct's '->next' pointer. That is obviously bad, but it
has hard-to-debug downstream effects like saying a bitmap can't be
loaded for a pack because one already exists (for the same pack).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/midx.c b/midx.c
index 40eb7974ba..759007d5a8 100644
--- a/midx.c
+++ b/midx.c
@@ -908,8 +908,18 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 	if (m)
 		ctx.m = m;
-	else
-		ctx.m = load_multi_pack_index(object_dir, 1);
+	else {
+		struct multi_pack_index *cur;
+
+		prepare_multi_pack_index_one(the_repository, object_dir, 1);
+
+		ctx.m = NULL;
+		for (cur = the_repository->objects->multi_pack_index; cur;
+		     cur = cur->next) {
+			if (!strcmp(object_dir, cur->object_dir))
+				ctx.m = cur;
+		}
+	}
 
 	ctx.nr = 0;
 	ctx.alloc = ctx.m ? ctx.m->num_packs : 16;
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v2 09/24] midx: infer preferred pack when not given one
  2021-06-21 22:24 ` [PATCH v2 00/24] multi-pack reachability bitmaps Taylor Blau
                     ` (7 preceding siblings ...)
  2021-06-21 22:25   ` [PATCH v2 08/24] midx: respect 'core.multiPackIndex' when writing Taylor Blau
@ 2021-06-21 22:25   ` Taylor Blau
  2021-07-21 10:34     ` Jeff King
  2021-06-21 22:25   ` [PATCH v2 10/24] pack-bitmap.c: introduce 'bitmap_num_objects()' Taylor Blau
                     ` (15 subsequent siblings)
  24 siblings, 1 reply; 273+ messages in thread
From: Taylor Blau @ 2021-06-21 22:25 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

In 9218c6a40c (midx: allow marking a pack as preferred, 2021-03-30), the
multi-pack index code learned how to select a pack which all duplicate
objects are selected from. That is, if an object appears in multiple
packs, select the copy in the preferred pack before breaking ties
according to the other rules like pack mtime and readdir() order.

Not specifying a preferred pack can cause serious problems with
multi-pack reachability bitmaps, because these bitmaps rely on having at
least one pack from which all duplicates are selected. Not having such a
pack causes problems with the pack reuse code (e.g., like assuming that
a base object was sent from that pack via reuse when in fact the base
was selected from a different pack).

So why does not marking a pack preferred cause problems here? The reason
is roughly as follows:

  - Ties are broken (when handling duplicate objects) by sorting
    according to midx_oid_compare(), which sorts objects by OID,
    preferred-ness, pack mtime, and finally pack ID (more on that
    later).

  - The psuedo pack-order (described in
    Documentation/technical/bitmap-format.txt) is computed by
    midx_pack_order(), and sorts by pack ID and pack offset, with
    preferred packs sorting first.

  - But! Pack IDs come from incrementing the pack count in
    add_pack_to_midx(), which is a callback to
    for_each_file_in_pack_dir(), meaning that pack IDs are assigned in
    readdir() order.

When specifying a preferred pack, all of that works fine, because
duplicate objects are correctly resolved in favor of the copy in the
preferred pack, and the preferred pack sorts first in the object order.

"Sorting first" is critical, because the bitmap code relies on finding
out which pack holds the first object in the MIDX's pseudo pack-order to
determine which pack is preferred.

But if we didn't specify a preferred pack, and the pack which comes
first in readdir() order does not also have the lowest timestamp, then
it's possible that that pack (the one that sorts first in pseudo-pack
order, which the bitmap code will treat as the preferred one) did *not*
have all duplicate objects resolved in its favor, resulting in breakage.

The fix is simple: pick a (semi-arbitrary) preferred pack when none was
specified. This forces that pack to have duplicates resolved in its
favor, and (critically) to sort first in pseudo-pack order.
Unfortunately, testing this behavior portably isn't possible, since it
depends on readdir() order which isn't guaranteed by POSIX.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c | 39 +++++++++++++++++++++++++++++++++------
 1 file changed, 33 insertions(+), 6 deletions(-)

diff --git a/midx.c b/midx.c
index 759007d5a8..752d36c57f 100644
--- a/midx.c
+++ b/midx.c
@@ -950,15 +950,46 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	if (ctx.m && ctx.nr == ctx.m->num_packs && !packs_to_drop)
 		goto cleanup;
 
-	ctx.preferred_pack_idx = -1;
 	if (preferred_pack_name) {
+		int found = 0;
 		for (i = 0; i < ctx.nr; i++) {
 			if (!cmp_idx_or_pack_name(preferred_pack_name,
 						  ctx.info[i].pack_name)) {
 				ctx.preferred_pack_idx = i;
+				found = 1;
 				break;
 			}
 		}
+
+		if (!found)
+			warning(_("unknown preferred pack: '%s'"),
+				preferred_pack_name);
+	} else if (ctx.nr && (flags & MIDX_WRITE_REV_INDEX)) {
+		time_t oldest = ctx.info[0].p->mtime;
+		ctx.preferred_pack_idx = 0;
+
+		if (packs_to_drop && packs_to_drop->nr)
+			BUG("cannot write a MIDX bitmap during expiration");
+
+		/*
+		 * set a preferred pack when writing a bitmap to ensure that
+		 * the pack from which the first object is selected in pseudo
+		 * pack-order has all of its objects selected from that pack
+		 * (and not another pack containing a duplicate)
+		 */
+		for (i = 1; i < ctx.nr; i++) {
+			time_t mtime = ctx.info[i].p->mtime;
+			if (mtime < oldest) {
+				oldest = mtime;
+				ctx.preferred_pack_idx = i;
+			}
+		}
+	} else {
+		/*
+		 * otherwise don't mark any pack as preferred to avoid
+		 * interfering with expiration logic below
+		 */
+		ctx.preferred_pack_idx = -1;
 	}
 
 	ctx.entries = get_sorted_entries(ctx.m, ctx.info, ctx.nr, &ctx.entries_nr,
@@ -1029,11 +1060,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 						      ctx.info, ctx.nr,
 						      sizeof(*ctx.info),
 						      idx_or_pack_name_cmp);
-
-		if (!preferred)
-			warning(_("unknown preferred pack: '%s'"),
-				preferred_pack_name);
-		else {
+		if (preferred) {
 			uint32_t perm = ctx.pack_perm[preferred->orig_pack_int_id];
 			if (perm == PACK_EXPIRED)
 				warning(_("preferred pack '%s' is expired"),
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v2 10/24] pack-bitmap.c: introduce 'bitmap_num_objects()'
  2021-06-21 22:24 ` [PATCH v2 00/24] multi-pack reachability bitmaps Taylor Blau
                     ` (8 preceding siblings ...)
  2021-06-21 22:25   ` [PATCH v2 09/24] midx: infer preferred pack when not given one Taylor Blau
@ 2021-06-21 22:25   ` Taylor Blau
  2021-07-21 10:35     ` Jeff King
  2021-06-21 22:25   ` [PATCH v2 11/24] pack-bitmap.c: introduce 'nth_bitmap_object_oid()' Taylor Blau
                     ` (14 subsequent siblings)
  24 siblings, 1 reply; 273+ messages in thread
From: Taylor Blau @ 2021-06-21 22:25 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

A subsequent patch to support reading MIDX bitmaps will be less noisy
after extracting a generic function to return how many objects are
contained in a bitmap.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 37 +++++++++++++++++++++----------------
 1 file changed, 21 insertions(+), 16 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 368fa59a42..2dc135d34a 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -136,6 +136,11 @@ static struct ewah_bitmap *read_bitmap_1(struct bitmap_index *index)
 	return b;
 }
 
+static uint32_t bitmap_num_objects(struct bitmap_index *index)
+{
+	return index->pack->num_objects;
+}
+
 static int load_bitmap_header(struct bitmap_index *index)
 {
 	struct bitmap_disk_header *header = (void *)index->map;
@@ -154,7 +159,7 @@ static int load_bitmap_header(struct bitmap_index *index)
 	/* Parse known bitmap format options */
 	{
 		uint32_t flags = ntohs(header->options);
-		size_t cache_size = st_mult(index->pack->num_objects, sizeof(uint32_t));
+		size_t cache_size = st_mult(bitmap_num_objects(index), sizeof(uint32_t));
 		unsigned char *index_end = index->map + index->map_size - the_hash_algo->rawsz;
 
 		if ((flags & BITMAP_OPT_FULL_DAG) == 0)
@@ -399,7 +404,7 @@ static inline int bitmap_position_extended(struct bitmap_index *bitmap_git,
 
 	if (pos < kh_end(positions)) {
 		int bitmap_pos = kh_value(positions, pos);
-		return bitmap_pos + bitmap_git->pack->num_objects;
+		return bitmap_pos + bitmap_num_objects(bitmap_git);
 	}
 
 	return -1;
@@ -451,7 +456,7 @@ static int ext_index_add_object(struct bitmap_index *bitmap_git,
 		bitmap_pos = kh_value(eindex->positions, hash_pos);
 	}
 
-	return bitmap_pos + bitmap_git->pack->num_objects;
+	return bitmap_pos + bitmap_num_objects(bitmap_git);
 }
 
 struct bitmap_show_data {
@@ -650,7 +655,7 @@ static void show_extended_objects(struct bitmap_index *bitmap_git,
 	for (i = 0; i < eindex->count; ++i) {
 		struct object *obj;
 
-		if (!bitmap_get(objects, bitmap_git->pack->num_objects + i))
+		if (!bitmap_get(objects, bitmap_num_objects(bitmap_git) + i))
 			continue;
 
 		obj = eindex->objects[i];
@@ -808,7 +813,7 @@ static void filter_bitmap_exclude_type(struct bitmap_index *bitmap_git,
 	 * individually.
 	 */
 	for (i = 0; i < eindex->count; i++) {
-		uint32_t pos = i + bitmap_git->pack->num_objects;
+		uint32_t pos = i + bitmap_num_objects(bitmap_git);
 		if (eindex->objects[i]->type == type &&
 		    bitmap_get(to_filter, pos) &&
 		    !bitmap_get(tips, pos))
@@ -835,7 +840,7 @@ static unsigned long get_size_by_pos(struct bitmap_index *bitmap_git,
 
 	oi.sizep = &size;
 
-	if (pos < pack->num_objects) {
+	if (pos < bitmap_num_objects(bitmap_git)) {
 		off_t ofs = pack_pos_to_offset(pack, pos);
 		if (packed_object_info(the_repository, pack, ofs, &oi) < 0) {
 			struct object_id oid;
@@ -845,7 +850,7 @@ static unsigned long get_size_by_pos(struct bitmap_index *bitmap_git,
 		}
 	} else {
 		struct eindex *eindex = &bitmap_git->ext_index;
-		struct object *obj = eindex->objects[pos - pack->num_objects];
+		struct object *obj = eindex->objects[pos - bitmap_num_objects(bitmap_git)];
 		if (oid_object_info_extended(the_repository, &obj->oid, &oi, 0) < 0)
 			die(_("unable to get size of %s"), oid_to_hex(&obj->oid));
 	}
@@ -887,7 +892,7 @@ static void filter_bitmap_blob_limit(struct bitmap_index *bitmap_git,
 	}
 
 	for (i = 0; i < eindex->count; i++) {
-		uint32_t pos = i + bitmap_git->pack->num_objects;
+		uint32_t pos = i + bitmap_num_objects(bitmap_git);
 		if (eindex->objects[i]->type == OBJ_BLOB &&
 		    bitmap_get(to_filter, pos) &&
 		    !bitmap_get(tips, pos) &&
@@ -1113,8 +1118,8 @@ static void try_partial_reuse(struct bitmap_index *bitmap_git,
 	enum object_type type;
 	unsigned long size;
 
-	if (pos >= bitmap_git->pack->num_objects)
-		return; /* not actually in the pack */
+	if (pos >= bitmap_num_objects(bitmap_git))
+		return; /* not actually in the pack or MIDX */
 
 	offset = header = pack_pos_to_offset(bitmap_git->pack, pos);
 	type = unpack_object_header(bitmap_git->pack, w_curs, &offset, &size);
@@ -1180,6 +1185,7 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 	struct pack_window *w_curs = NULL;
 	size_t i = 0;
 	uint32_t offset;
+	uint32_t objects_nr = bitmap_num_objects(bitmap_git);
 
 	assert(result);
 
@@ -1187,8 +1193,8 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 		i++;
 
 	/* Don't mark objects not in the packfile */
-	if (i > bitmap_git->pack->num_objects / BITS_IN_EWORD)
-		i = bitmap_git->pack->num_objects / BITS_IN_EWORD;
+	if (i > objects_nr / BITS_IN_EWORD)
+		i = objects_nr / BITS_IN_EWORD;
 
 	reuse = bitmap_word_alloc(i);
 	memset(reuse->words, 0xFF, i * sizeof(eword_t));
@@ -1272,7 +1278,7 @@ static uint32_t count_object_type(struct bitmap_index *bitmap_git,
 
 	for (i = 0; i < eindex->count; ++i) {
 		if (eindex->objects[i]->type == type &&
-			bitmap_get(objects, bitmap_git->pack->num_objects + i))
+			bitmap_get(objects, bitmap_num_objects(bitmap_git) + i))
 			count++;
 	}
 
@@ -1493,7 +1499,7 @@ uint32_t *create_bitmap_mapping(struct bitmap_index *bitmap_git,
 	uint32_t i, num_objects;
 	uint32_t *reposition;
 
-	num_objects = bitmap_git->pack->num_objects;
+	num_objects = bitmap_num_objects(bitmap_git);
 	CALLOC_ARRAY(reposition, num_objects);
 
 	for (i = 0; i < num_objects; ++i) {
@@ -1576,7 +1582,6 @@ static off_t get_disk_usage_for_type(struct bitmap_index *bitmap_git,
 static off_t get_disk_usage_for_extended(struct bitmap_index *bitmap_git)
 {
 	struct bitmap *result = bitmap_git->result;
-	struct packed_git *pack = bitmap_git->pack;
 	struct eindex *eindex = &bitmap_git->ext_index;
 	off_t total = 0;
 	struct object_info oi = OBJECT_INFO_INIT;
@@ -1588,7 +1593,7 @@ static off_t get_disk_usage_for_extended(struct bitmap_index *bitmap_git)
 	for (i = 0; i < eindex->count; i++) {
 		struct object *obj = eindex->objects[i];
 
-		if (!bitmap_get(result, pack->num_objects + i))
+		if (!bitmap_get(result, bitmap_num_objects(bitmap_git) + i))
 			continue;
 
 		if (oid_object_info_extended(the_repository, &obj->oid, &oi, 0) < 0)
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v2 11/24] pack-bitmap.c: introduce 'nth_bitmap_object_oid()'
  2021-06-21 22:24 ` [PATCH v2 00/24] multi-pack reachability bitmaps Taylor Blau
                     ` (9 preceding siblings ...)
  2021-06-21 22:25   ` [PATCH v2 10/24] pack-bitmap.c: introduce 'bitmap_num_objects()' Taylor Blau
@ 2021-06-21 22:25   ` Taylor Blau
  2021-06-24 14:59     ` Taylor Blau
  2021-07-21 10:37     ` Jeff King
  2021-06-21 22:25   ` [PATCH v2 12/24] pack-bitmap.c: introduce 'bitmap_is_preferred_refname()' Taylor Blau
                     ` (13 subsequent siblings)
  24 siblings, 2 replies; 273+ messages in thread
From: Taylor Blau @ 2021-06-21 22:25 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

A subsequent patch to support reading MIDX bitmaps will be less noisy
after extracting a generic function to fetch the nth OID contained in
the bitmap.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 2dc135d34a..9757cd0fbb 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -223,6 +223,13 @@ static inline uint8_t read_u8(const unsigned char *buffer, size_t *pos)
 
 #define MAX_XOR_OFFSET 160
 
+static void nth_bitmap_object_oid(struct bitmap_index *index,
+				  struct object_id *oid,
+				  uint32_t n)
+{
+	nth_packed_object_id(oid, index->pack, n);
+}
+
 static int load_bitmap_entries_v1(struct bitmap_index *index)
 {
 	uint32_t i;
@@ -242,9 +249,7 @@ static int load_bitmap_entries_v1(struct bitmap_index *index)
 		xor_offset = read_u8(index->map, &index->map_pos);
 		flags = read_u8(index->map, &index->map_pos);
 
-		if (nth_packed_object_id(&oid, index->pack, commit_idx_pos) < 0)
-			return error("corrupt ewah bitmap: commit index %u out of range",
-				     (unsigned)commit_idx_pos);
+		nth_bitmap_object_oid(index, &oid, commit_idx_pos);
 
 		bitmap = read_bitmap_1(index);
 		if (!bitmap)
@@ -844,8 +849,8 @@ static unsigned long get_size_by_pos(struct bitmap_index *bitmap_git,
 		off_t ofs = pack_pos_to_offset(pack, pos);
 		if (packed_object_info(the_repository, pack, ofs, &oi) < 0) {
 			struct object_id oid;
-			nth_packed_object_id(&oid, pack,
-					     pack_pos_to_index(pack, pos));
+			nth_bitmap_object_oid(bitmap_git, &oid,
+					      pack_pos_to_index(pack, pos));
 			die(_("unable to get size of %s"), oid_to_hex(&oid));
 		}
 	} else {
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v2 12/24] pack-bitmap.c: introduce 'bitmap_is_preferred_refname()'
  2021-06-21 22:24 ` [PATCH v2 00/24] multi-pack reachability bitmaps Taylor Blau
                     ` (10 preceding siblings ...)
  2021-06-21 22:25   ` [PATCH v2 11/24] pack-bitmap.c: introduce 'nth_bitmap_object_oid()' Taylor Blau
@ 2021-06-21 22:25   ` Taylor Blau
  2021-07-21 10:39     ` Jeff King
  2021-06-21 22:25   ` [PATCH v2 13/24] pack-bitmap: read multi-pack bitmaps Taylor Blau
                     ` (12 subsequent siblings)
  24 siblings, 1 reply; 273+ messages in thread
From: Taylor Blau @ 2021-06-21 22:25 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

In a recent commit, pack-objects learned support for the
'pack.preferBitmapTips' configuration. This patch prepares the
multi-pack bitmap code to respect this configuration, too.

Since the multi-pack bitmap code already does a traversal of all
references (in order to discover the set of reachable commits in the
multi-pack index), it is more efficient to check whether or not each
reference is a suffix of any value of 'pack.preferBitmapTips' rather
than do an additional traversal.

Implement a function 'bitmap_is_preferred_refname()' which does just
that. The caller will be added in a subsequent patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 16 ++++++++++++++++
 pack-bitmap.h |  1 +
 2 files changed, 17 insertions(+)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 9757cd0fbb..d882bf7ce1 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1632,3 +1632,19 @@ const struct string_list *bitmap_preferred_tips(struct repository *r)
 {
 	return repo_config_get_value_multi(r, "pack.preferbitmaptips");
 }
+
+int bitmap_is_preferred_refname(struct repository *r, const char *refname)
+{
+	const struct string_list *preferred_tips = bitmap_preferred_tips(r);
+	struct string_list_item *item;
+
+	if (!preferred_tips)
+		return 0;
+
+	for_each_string_list_item(item, preferred_tips) {
+		if (starts_with(refname, item->string))
+			return 1;
+	}
+
+	return 0;
+}
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 020cd8d868..52ea10de51 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -94,5 +94,6 @@ void bitmap_writer_finish(struct pack_idx_entry **index,
 			  uint16_t options);
 
 const struct string_list *bitmap_preferred_tips(struct repository *r);
+int bitmap_is_preferred_refname(struct repository *r, const char *refname);
 
 #endif
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v2 13/24] pack-bitmap: read multi-pack bitmaps
  2021-06-21 22:24 ` [PATCH v2 00/24] multi-pack reachability bitmaps Taylor Blau
                     ` (11 preceding siblings ...)
  2021-06-21 22:25   ` [PATCH v2 12/24] pack-bitmap.c: introduce 'bitmap_is_preferred_refname()' Taylor Blau
@ 2021-06-21 22:25   ` Taylor Blau
  2021-07-21 11:32     ` Jeff King
  2021-06-21 22:25   ` [PATCH v2 14/24] pack-bitmap: write " Taylor Blau
                     ` (11 subsequent siblings)
  24 siblings, 1 reply; 273+ messages in thread
From: Taylor Blau @ 2021-06-21 22:25 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

This prepares the code in pack-bitmap to interpret the new multi-pack
bitmaps described in Documentation/technical/bitmap-format.txt, which
mostly involves converting bit positions to accommodate looking them up
in a MIDX.

Note that there are currently no writers who write multi-pack bitmaps,
and that this will be implemented in the subsequent commit.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c |   5 +
 pack-bitmap-write.c    |   2 +-
 pack-bitmap.c          | 362 +++++++++++++++++++++++++++++++++++++----
 pack-bitmap.h          |   5 +
 packfile.c             |   2 +-
 5 files changed, 340 insertions(+), 36 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 8a523624a1..e11d3ac2e5 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1124,6 +1124,11 @@ static void write_reused_pack(struct hashfile *f)
 				break;
 
 			offset += ewah_bit_ctz64(word >> offset);
+			/*
+			 * Can use bit positions directly, even for MIDX
+			 * bitmaps. See comment in try_partial_reuse()
+			 * for why.
+			 */
 			write_reused_pack_one(pos + offset, f, &w_curs);
 			display_progress(progress_state, ++written);
 		}
diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
index 142fd0adb8..9c55c1531e 100644
--- a/pack-bitmap-write.c
+++ b/pack-bitmap-write.c
@@ -48,7 +48,7 @@ void bitmap_writer_show_progress(int show)
 }
 
 /**
- * Build the initial type index for the packfile
+ * Build the initial type index for the packfile or multi-pack-index
  */
 void bitmap_writer_build_type_index(struct packing_data *to_pack,
 				    struct pack_idx_entry **index,
diff --git a/pack-bitmap.c b/pack-bitmap.c
index d882bf7ce1..4110d23ca1 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -13,6 +13,7 @@
 #include "repository.h"
 #include "object-store.h"
 #include "list-objects-filter-options.h"
+#include "midx.h"
 #include "config.h"
 
 /*
@@ -35,8 +36,15 @@ struct stored_bitmap {
  * the active bitmap index is the largest one.
  */
 struct bitmap_index {
-	/* Packfile to which this bitmap index belongs to */
+	/*
+	 * The pack or multi-pack index (MIDX) that this bitmap index belongs
+	 * to.
+	 *
+	 * Exactly one of these must be non-NULL; this specifies the object
+	 * order used to interpret this bitmap.
+	 */
 	struct packed_git *pack;
+	struct multi_pack_index *midx;
 
 	/*
 	 * Mark the first `reuse_objects` in the packfile as reused:
@@ -71,6 +79,9 @@ struct bitmap_index {
 	/* If not NULL, this is a name-hash cache pointing into map. */
 	uint32_t *hashes;
 
+	/* The checksum of the packfile or MIDX; points into map. */
+	const unsigned char *checksum;
+
 	/*
 	 * Extended index.
 	 *
@@ -138,6 +149,8 @@ static struct ewah_bitmap *read_bitmap_1(struct bitmap_index *index)
 
 static uint32_t bitmap_num_objects(struct bitmap_index *index)
 {
+	if (index->midx)
+		return index->midx->num_objects;
 	return index->pack->num_objects;
 }
 
@@ -175,6 +188,7 @@ static int load_bitmap_header(struct bitmap_index *index)
 	}
 
 	index->entry_count = ntohl(header->entry_count);
+	index->checksum = header->checksum;
 	index->map_pos += header_size;
 	return 0;
 }
@@ -227,7 +241,10 @@ static void nth_bitmap_object_oid(struct bitmap_index *index,
 				  struct object_id *oid,
 				  uint32_t n)
 {
-	nth_packed_object_id(oid, index->pack, n);
+	if (index->midx)
+		nth_midxed_object_oid(oid, index->midx, n);
+	else
+		nth_packed_object_id(oid, index->pack, n);
 }
 
 static int load_bitmap_entries_v1(struct bitmap_index *index)
@@ -272,7 +289,14 @@ static int load_bitmap_entries_v1(struct bitmap_index *index)
 	return 0;
 }
 
-static char *pack_bitmap_filename(struct packed_git *p)
+char *midx_bitmap_filename(struct multi_pack_index *midx)
+{
+	return xstrfmt("%s-%s.bitmap",
+		       get_midx_filename(midx->object_dir),
+		       hash_to_hex(get_midx_checksum(midx)));
+}
+
+char *pack_bitmap_filename(struct packed_git *p)
 {
 	size_t len;
 
@@ -281,6 +305,57 @@ static char *pack_bitmap_filename(struct packed_git *p)
 	return xstrfmt("%.*s.bitmap", (int)len, p->pack_name);
 }
 
+static int open_midx_bitmap_1(struct bitmap_index *bitmap_git,
+			      struct multi_pack_index *midx)
+{
+	struct stat st;
+	char *idx_name = midx_bitmap_filename(midx);
+	int fd = git_open(idx_name);
+
+	free(idx_name);
+
+	if (fd < 0)
+		return -1;
+
+	if (fstat(fd, &st)) {
+		close(fd);
+		return -1;
+	}
+
+	if (bitmap_git->pack || bitmap_git->midx) {
+		/* ignore extra bitmap file; we can only handle one */
+		warning("ignoring extra bitmap file: %s",
+			get_midx_filename(midx->object_dir));
+		close(fd);
+		return -1;
+	}
+
+	bitmap_git->midx = midx;
+	bitmap_git->map_size = xsize_t(st.st_size);
+	bitmap_git->map_pos = 0;
+	bitmap_git->map = xmmap(NULL, bitmap_git->map_size, PROT_READ,
+				MAP_PRIVATE, fd, 0);
+	close(fd);
+
+	if (load_bitmap_header(bitmap_git) < 0)
+		goto cleanup;
+
+	if (!hasheq(get_midx_checksum(bitmap_git->midx), bitmap_git->checksum))
+		goto cleanup;
+
+	if (load_midx_revindex(bitmap_git->midx) < 0) {
+		warning(_("multi-pack bitmap is missing required reverse index"));
+		goto cleanup;
+	}
+	return 0;
+
+cleanup:
+	munmap(bitmap_git->map, bitmap_git->map_size);
+	bitmap_git->map_size = 0;
+	bitmap_git->map = NULL;
+	return -1;
+}
+
 static int open_pack_bitmap_1(struct bitmap_index *bitmap_git, struct packed_git *packfile)
 {
 	int fd;
@@ -302,12 +377,18 @@ static int open_pack_bitmap_1(struct bitmap_index *bitmap_git, struct packed_git
 		return -1;
 	}
 
-	if (bitmap_git->pack) {
+	if (bitmap_git->pack || bitmap_git->midx) {
+		/* ignore extra bitmap file; we can only handle one */
 		warning("ignoring extra bitmap file: %s", packfile->pack_name);
 		close(fd);
 		return -1;
 	}
 
+	if (!is_pack_valid(packfile)) {
+		close(fd);
+		return -1;
+	}
+
 	bitmap_git->pack = packfile;
 	bitmap_git->map_size = xsize_t(st.st_size);
 	bitmap_git->map = xmmap(NULL, bitmap_git->map_size, PROT_READ, MAP_PRIVATE, fd, 0);
@@ -324,13 +405,36 @@ static int open_pack_bitmap_1(struct bitmap_index *bitmap_git, struct packed_git
 	return 0;
 }
 
-static int load_pack_bitmap(struct bitmap_index *bitmap_git)
+static int load_reverse_index(struct bitmap_index *bitmap_git)
+{
+	if (bitmap_is_midx(bitmap_git)) {
+		uint32_t i;
+		int ret;
+
+		ret = load_midx_revindex(bitmap_git->midx);
+		if (ret)
+			return ret;
+
+		for (i = 0; i < bitmap_git->midx->num_packs; i++) {
+			if (prepare_midx_pack(the_repository, bitmap_git->midx, i))
+				die(_("load_reverse_index: could not open pack"));
+			ret = load_pack_revindex(bitmap_git->midx->packs[i]);
+			if (ret)
+				return ret;
+		}
+		return 0;
+	}
+	return load_pack_revindex(bitmap_git->pack);
+}
+
+static int load_bitmap(struct bitmap_index *bitmap_git)
 {
 	assert(bitmap_git->map);
 
 	bitmap_git->bitmaps = kh_init_oid_map();
 	bitmap_git->ext_index.positions = kh_init_oid_pos();
-	if (load_pack_revindex(bitmap_git->pack))
+
+	if (load_reverse_index(bitmap_git))
 		goto failed;
 
 	if (!(bitmap_git->commits = read_bitmap_1(bitmap_git)) ||
@@ -374,11 +478,35 @@ static int open_pack_bitmap(struct repository *r,
 	return ret;
 }
 
+static int open_midx_bitmap(struct repository *r,
+			    struct bitmap_index *bitmap_git)
+{
+	struct multi_pack_index *midx;
+
+	assert(!bitmap_git->map);
+
+	for (midx = get_multi_pack_index(r); midx; midx = midx->next) {
+		if (!open_midx_bitmap_1(bitmap_git, midx))
+			return 0;
+	}
+	return -1;
+}
+
+static int open_bitmap(struct repository *r,
+		       struct bitmap_index *bitmap_git)
+{
+	assert(!bitmap_git->map);
+
+	if (!open_midx_bitmap(r, bitmap_git))
+		return 0;
+	return open_pack_bitmap(r, bitmap_git);
+}
+
 struct bitmap_index *prepare_bitmap_git(struct repository *r)
 {
 	struct bitmap_index *bitmap_git = xcalloc(1, sizeof(*bitmap_git));
 
-	if (!open_pack_bitmap(r, bitmap_git) && !load_pack_bitmap(bitmap_git))
+	if (!open_bitmap(r, bitmap_git) && !load_bitmap(bitmap_git))
 		return bitmap_git;
 
 	free_bitmap_index(bitmap_git);
@@ -428,10 +556,26 @@ static inline int bitmap_position_packfile(struct bitmap_index *bitmap_git,
 	return pos;
 }
 
+static int bitmap_position_midx(struct bitmap_index *bitmap_git,
+				const struct object_id *oid)
+{
+	uint32_t want, got;
+	if (!bsearch_midx(oid, bitmap_git->midx, &want))
+		return -1;
+
+	if (midx_to_pack_pos(bitmap_git->midx, want, &got) < 0)
+		return -1;
+	return got;
+}
+
 static int bitmap_position(struct bitmap_index *bitmap_git,
 			   const struct object_id *oid)
 {
-	int pos = bitmap_position_packfile(bitmap_git, oid);
+	int pos;
+	if (bitmap_is_midx(bitmap_git))
+		pos = bitmap_position_midx(bitmap_git, oid);
+	else
+		pos = bitmap_position_packfile(bitmap_git, oid);
 	return (pos >= 0) ? pos : bitmap_position_extended(bitmap_git, oid);
 }
 
@@ -724,6 +868,7 @@ static void show_objects_for_type(
 			continue;
 
 		for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
+			struct packed_git *pack;
 			struct object_id oid;
 			uint32_t hash = 0, index_pos;
 			off_t ofs;
@@ -733,14 +878,28 @@ static void show_objects_for_type(
 
 			offset += ewah_bit_ctz64(word >> offset);
 
-			index_pos = pack_pos_to_index(bitmap_git->pack, pos + offset);
-			ofs = pack_pos_to_offset(bitmap_git->pack, pos + offset);
-			nth_packed_object_id(&oid, bitmap_git->pack, index_pos);
+			if (bitmap_is_midx(bitmap_git)) {
+				struct multi_pack_index *m = bitmap_git->midx;
+				uint32_t pack_id;
+
+				index_pos = pack_pos_to_midx(m, pos + offset);
+				ofs = nth_midxed_offset(m, index_pos);
+				nth_midxed_object_oid(&oid, m, index_pos);
+
+				pack_id = nth_midxed_pack_int_id(m, index_pos);
+				pack = bitmap_git->midx->packs[pack_id];
+			} else {
+				index_pos = pack_pos_to_index(bitmap_git->pack, pos + offset);
+				ofs = pack_pos_to_offset(bitmap_git->pack, pos + offset);
+				nth_bitmap_object_oid(bitmap_git, &oid, index_pos);
+
+				pack = bitmap_git->pack;
+			}
 
 			if (bitmap_git->hashes)
 				hash = get_be32(bitmap_git->hashes + index_pos);
 
-			show_reach(&oid, object_type, 0, hash, bitmap_git->pack, ofs);
+			show_reach(&oid, object_type, 0, hash, pack, ofs);
 		}
 	}
 }
@@ -752,8 +911,13 @@ static int in_bitmapped_pack(struct bitmap_index *bitmap_git,
 		struct object *object = roots->item;
 		roots = roots->next;
 
-		if (find_pack_entry_one(object->oid.hash, bitmap_git->pack) > 0)
-			return 1;
+		if (bitmap_is_midx(bitmap_git)) {
+			if (bsearch_midx(&object->oid, bitmap_git->midx, NULL))
+				return 1;
+		} else {
+			if (find_pack_entry_one(object->oid.hash, bitmap_git->pack) > 0)
+				return 1;
+		}
 	}
 
 	return 0;
@@ -839,14 +1003,26 @@ static void filter_bitmap_blob_none(struct bitmap_index *bitmap_git,
 static unsigned long get_size_by_pos(struct bitmap_index *bitmap_git,
 				     uint32_t pos)
 {
-	struct packed_git *pack = bitmap_git->pack;
 	unsigned long size;
 	struct object_info oi = OBJECT_INFO_INIT;
 
 	oi.sizep = &size;
 
 	if (pos < bitmap_num_objects(bitmap_git)) {
-		off_t ofs = pack_pos_to_offset(pack, pos);
+		struct packed_git *pack;
+		off_t ofs;
+
+		if (bitmap_is_midx(bitmap_git)) {
+			uint32_t midx_pos = pack_pos_to_midx(bitmap_git->midx, pos);
+			uint32_t pack_id = nth_midxed_pack_int_id(bitmap_git->midx, midx_pos);
+
+			pack = bitmap_git->midx->packs[pack_id];
+			ofs = nth_midxed_offset(bitmap_git->midx, midx_pos);
+		} else {
+			pack = bitmap_git->pack;
+			ofs = pack_pos_to_offset(pack, pos);
+		}
+
 		if (packed_object_info(the_repository, pack, ofs, &oi) < 0) {
 			struct object_id oid;
 			nth_bitmap_object_oid(bitmap_git, &oid,
@@ -1027,7 +1203,7 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 	/* try to open a bitmapped pack, but don't parse it yet
 	 * because we may not need to use it */
 	CALLOC_ARRAY(bitmap_git, 1);
-	if (open_pack_bitmap(revs->repo, bitmap_git) < 0)
+	if (open_bitmap(revs->repo, bitmap_git) < 0)
 		goto cleanup;
 
 	for (i = 0; i < revs->pending.nr; ++i) {
@@ -1071,7 +1247,7 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 	 * from disk. this is the point of no return; after this the rev_list
 	 * becomes invalidated and we must perform the revwalk through bitmaps
 	 */
-	if (load_pack_bitmap(bitmap_git) < 0)
+	if (load_bitmap(bitmap_git) < 0)
 		goto cleanup;
 
 	object_array_clear(&revs->pending);
@@ -1115,19 +1291,43 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 }
 
 static void try_partial_reuse(struct bitmap_index *bitmap_git,
+			      struct packed_git *pack,
 			      size_t pos,
 			      struct bitmap *reuse,
 			      struct pack_window **w_curs)
 {
-	off_t offset, header;
+	off_t offset, delta_obj_offset;
 	enum object_type type;
 	unsigned long size;
 
-	if (pos >= bitmap_num_objects(bitmap_git))
-		return; /* not actually in the pack or MIDX */
+	/*
+	 * try_partial_reuse() is called either on (a) objects in the
+	 * bitmapped pack (in the case of a single-pack bitmap) or (b)
+	 * objects in the preferred pack of a multi-pack bitmap.
+	 * Importantly, the latter can pretend as if only a single pack
+	 * exists because:
+	 *
+	 *   - The first pack->num_objects bits of a MIDX bitmap are
+	 *     reserved for the preferred pack, and
+	 *
+	 *   - Ties due to duplicate objects are always resolved in
+	 *     favor of the preferred pack.
+	 *
+	 * Therefore we do not need to ever ask the MIDX for its copy of
+	 * an object by OID, since it will always select it from the
+	 * preferred pack. Likewise, the selected copy of the base
+	 * object for any deltas will reside in the same pack.
+	 *
+	 * This means that we can reuse pos when looking up the bit in
+	 * the reuse bitmap, too, since bits corresponding to the
+	 * preferred pack precede all bits from other packs.
+	 */
 
-	offset = header = pack_pos_to_offset(bitmap_git->pack, pos);
-	type = unpack_object_header(bitmap_git->pack, w_curs, &offset, &size);
+	if (pos >= pack->num_objects)
+		return; /* not actually in the pack or MIDX preferred pack */
+
+	offset = delta_obj_offset = pack_pos_to_offset(pack, pos);
+	type = unpack_object_header(pack, w_curs, &offset, &size);
 	if (type < 0)
 		return; /* broken packfile, punt */
 
@@ -1143,11 +1343,11 @@ static void try_partial_reuse(struct bitmap_index *bitmap_git,
 		 * and the normal slow path will complain about it in
 		 * more detail.
 		 */
-		base_offset = get_delta_base(bitmap_git->pack, w_curs,
-					     &offset, type, header);
+		base_offset = get_delta_base(pack, w_curs, &offset, type,
+					     delta_obj_offset);
 		if (!base_offset)
 			return;
-		if (offset_to_pack_pos(bitmap_git->pack, base_offset, &base_pos) < 0)
+		if (offset_to_pack_pos(pack, base_offset, &base_pos) < 0)
 			return;
 
 		/*
@@ -1180,24 +1380,48 @@ static void try_partial_reuse(struct bitmap_index *bitmap_git,
 	bitmap_set(reuse, pos);
 }
 
+static uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git)
+{
+	struct multi_pack_index *m = bitmap_git->midx;
+	if (!m)
+		BUG("midx_preferred_pack: requires non-empty MIDX");
+	return nth_midxed_pack_int_id(m, pack_pos_to_midx(bitmap_git->midx, 0));
+}
+
 int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 				       struct packed_git **packfile_out,
 				       uint32_t *entries,
 				       struct bitmap **reuse_out)
 {
+	struct packed_git *pack;
 	struct bitmap *result = bitmap_git->result;
 	struct bitmap *reuse;
 	struct pack_window *w_curs = NULL;
 	size_t i = 0;
 	uint32_t offset;
-	uint32_t objects_nr = bitmap_num_objects(bitmap_git);
+	uint32_t objects_nr;
 
 	assert(result);
 
+	load_reverse_index(bitmap_git);
+
+	if (bitmap_is_midx(bitmap_git))
+		pack = bitmap_git->midx->packs[midx_preferred_pack(bitmap_git)];
+	else
+		pack = bitmap_git->pack;
+	objects_nr = pack->num_objects;
+
 	while (i < result->word_alloc && result->words[i] == (eword_t)~0)
 		i++;
 
-	/* Don't mark objects not in the packfile */
+	/*
+	 * Don't mark objects not in the packfile or preferred pack. This bitmap
+	 * marks objects eligible for reuse, but the pack-reuse code only
+	 * understands how to reuse a single pack. Since the preferred pack is
+	 * guaranteed to have all bases for its deltas (in a multi-pack bitmap),
+	 * we use it instead of another pack. In single-pack bitmaps, the choice
+	 * is made for us.
+	 */
 	if (i > objects_nr / BITS_IN_EWORD)
 		i = objects_nr / BITS_IN_EWORD;
 
@@ -1213,7 +1437,15 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 				break;
 
 			offset += ewah_bit_ctz64(word >> offset);
-			try_partial_reuse(bitmap_git, pos + offset, reuse, &w_curs);
+			if (bitmap_is_midx(bitmap_git)) {
+				/*
+				 * Can't reuse from a non-preferred pack (see
+				 * above).
+				 */
+				if (pos + offset >= objects_nr)
+					continue;
+			}
+			try_partial_reuse(bitmap_git, pack, pos + offset, reuse, &w_curs);
 		}
 	}
 
@@ -1230,7 +1462,7 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 	 * need to be handled separately.
 	 */
 	bitmap_and_not(result, reuse);
-	*packfile_out = bitmap_git->pack;
+	*packfile_out = pack;
 	*reuse_out = reuse;
 	return 0;
 }
@@ -1504,6 +1736,12 @@ uint32_t *create_bitmap_mapping(struct bitmap_index *bitmap_git,
 	uint32_t i, num_objects;
 	uint32_t *reposition;
 
+	if (!bitmap_is_midx(bitmap_git))
+		load_reverse_index(bitmap_git);
+	else if (load_midx_revindex(bitmap_git->midx) < 0)
+		BUG("rebuild_existing_bitmaps: missing required rev-cache "
+		    "extension");
+
 	num_objects = bitmap_num_objects(bitmap_git);
 	CALLOC_ARRAY(reposition, num_objects);
 
@@ -1511,8 +1749,13 @@ uint32_t *create_bitmap_mapping(struct bitmap_index *bitmap_git,
 		struct object_id oid;
 		struct object_entry *oe;
 
-		nth_packed_object_id(&oid, bitmap_git->pack,
-				     pack_pos_to_index(bitmap_git->pack, i));
+		if (bitmap_is_midx(bitmap_git))
+			nth_midxed_object_oid(&oid,
+					      bitmap_git->midx,
+					      pack_pos_to_midx(bitmap_git->midx, i));
+		else
+			nth_packed_object_id(&oid, bitmap_git->pack,
+					     pack_pos_to_index(bitmap_git->pack, i));
 		oe = packlist_find(mapping, &oid);
 
 		if (oe)
@@ -1538,6 +1781,19 @@ void free_bitmap_index(struct bitmap_index *b)
 	free(b->ext_index.hashes);
 	bitmap_free(b->result);
 	bitmap_free(b->haves);
+	if (bitmap_is_midx(b)) {
+		/*
+		 * Multi-pack bitmaps need to have resources associated with
+		 * their on-disk reverse indexes unmapped so that stale .rev and
+		 * .bitmap files can be removed.
+		 *
+		 * Unlike pack-based bitmaps, multi-pack bitmaps can be read and
+		 * written in the same 'git multi-pack-index write --bitmap'
+		 * process. Close resources so they can be removed safely on
+		 * platforms like Windows.
+		 */
+		close_midx_revindex(b->midx);
+	}
 	free(b);
 }
 
@@ -1552,7 +1808,7 @@ static off_t get_disk_usage_for_type(struct bitmap_index *bitmap_git,
 				     enum object_type object_type)
 {
 	struct bitmap *result = bitmap_git->result;
-	struct packed_git *pack = bitmap_git->pack;
+	struct packed_git *pack;
 	off_t total = 0;
 	struct ewah_iterator it;
 	eword_t filter;
@@ -1575,7 +1831,31 @@ static off_t get_disk_usage_for_type(struct bitmap_index *bitmap_git,
 				break;
 
 			offset += ewah_bit_ctz64(word >> offset);
-			pos = base + offset;
+
+			if (bitmap_is_midx(bitmap_git)) {
+				uint32_t pack_pos;
+				uint32_t midx_pos = pack_pos_to_midx(bitmap_git->midx, base + offset);
+				uint32_t pack_id = nth_midxed_pack_int_id(bitmap_git->midx, midx_pos);
+				off_t offset = nth_midxed_offset(bitmap_git->midx, midx_pos);
+
+				pack = bitmap_git->midx->packs[pack_id];
+
+				if (offset_to_pack_pos(pack, offset, &pack_pos) < 0) {
+					struct object_id oid;
+					nth_midxed_object_oid(&oid, bitmap_git->midx, midx_pos);
+
+					die(_("could not find %s in pack #%"PRIu32" at offset %"PRIuMAX),
+					    oid_to_hex(&oid),
+					    pack_id,
+					    (uintmax_t)offset);
+				}
+
+				pos = pack_pos;
+			} else {
+				pack = bitmap_git->pack;
+				pos = base + offset;
+			}
+
 			total += pack_pos_to_offset(pack, pos + 1) -
 				 pack_pos_to_offset(pack, pos);
 		}
@@ -1628,6 +1908,20 @@ off_t get_disk_usage_from_bitmap(struct bitmap_index *bitmap_git,
 	return total;
 }
 
+int bitmap_is_midx(struct bitmap_index *bitmap_git)
+{
+	return !!bitmap_git->midx;
+}
+
+off_t bitmap_pack_offset(struct bitmap_index *bitmap_git, uint32_t pos)
+{
+	if (bitmap_is_midx(bitmap_git))
+		return nth_midxed_offset(bitmap_git->midx,
+					 pack_pos_to_midx(bitmap_git->midx, pos));
+	return nth_packed_object_offset(bitmap_git->pack,
+					pack_pos_to_index(bitmap_git->pack, pos));
+}
+
 const struct string_list *bitmap_preferred_tips(struct repository *r)
 {
 	return repo_config_get_value_multi(r, "pack.preferbitmaptips");
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 52ea10de51..30396a7a4a 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -92,6 +92,11 @@ void bitmap_writer_finish(struct pack_idx_entry **index,
 			  uint32_t index_nr,
 			  const char *filename,
 			  uint16_t options);
+char *midx_bitmap_filename(struct multi_pack_index *midx);
+char *pack_bitmap_filename(struct packed_git *p);
+
+int bitmap_is_midx(struct bitmap_index *bitmap_git);
+off_t bitmap_pack_offset(struct bitmap_index *bitmap_git, uint32_t pos);
 
 const struct string_list *bitmap_preferred_tips(struct repository *r);
 int bitmap_is_preferred_refname(struct repository *r, const char *refname);
diff --git a/packfile.c b/packfile.c
index 755aa7aec5..e855b93208 100644
--- a/packfile.c
+++ b/packfile.c
@@ -860,7 +860,7 @@ static void prepare_pack(const char *full_name, size_t full_name_len,
 	if (!strcmp(file_name, "multi-pack-index"))
 		return;
 	if (starts_with(file_name, "multi-pack-index") &&
-	    ends_with(file_name, ".rev"))
+	    (ends_with(file_name, ".bitmap") || ends_with(file_name, ".rev")))
 		return;
 	if (ends_with(file_name, ".idx") ||
 	    ends_with(file_name, ".rev") ||
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v2 14/24] pack-bitmap: write multi-pack bitmaps
  2021-06-21 22:24 ` [PATCH v2 00/24] multi-pack reachability bitmaps Taylor Blau
                     ` (12 preceding siblings ...)
  2021-06-21 22:25   ` [PATCH v2 13/24] pack-bitmap: read multi-pack bitmaps Taylor Blau
@ 2021-06-21 22:25   ` Taylor Blau
  2021-06-24 23:45     ` Ævar Arnfjörð Bjarmason
  2021-07-21 12:09     ` Jeff King
  2021-06-21 22:25   ` [PATCH v2 15/24] t5310: move some tests to lib-bitmap.sh Taylor Blau
                     ` (10 subsequent siblings)
  24 siblings, 2 replies; 273+ messages in thread
From: Taylor Blau @ 2021-06-21 22:25 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

Write multi-pack bitmaps in the format described by
Documentation/technical/bitmap-format.txt, inferring their presence with
the absence of '--bitmap'.

To write a multi-pack bitmap, this patch attempts to reuse as much of
the existing machinery from pack-objects as possible. Specifically, the
MIDX code prepares a packing_data struct that pretends as if a single
packfile has been generated containing all of the objects contained
within the MIDX.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-multi-pack-index.txt |  12 +-
 builtin/multi-pack-index.c             |   2 +
 midx.c                                 | 230 ++++++++++++++++++++++++-
 midx.h                                 |   1 +
 4 files changed, 236 insertions(+), 9 deletions(-)

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index ffd601bc17..ada14deb2c 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -10,7 +10,7 @@ SYNOPSIS
 --------
 [verse]
 'git multi-pack-index' [--object-dir=<dir>] [--[no-]progress]
-	[--preferred-pack=<pack>] <subcommand>
+	[--preferred-pack=<pack>] [--[no-]bitmap] <subcommand>
 
 DESCRIPTION
 -----------
@@ -40,6 +40,9 @@ write::
 		multiple packs contain the same object. If not given,
 		ties are broken in favor of the pack with the lowest
 		mtime.
+
+	--[no-]bitmap::
+		Control whether or not a multi-pack bitmap is written.
 --
 
 verify::
@@ -81,6 +84,13 @@ EXAMPLES
 $ git multi-pack-index write
 -----------------------------------------------
 
+* Write a MIDX file for the packfiles in the current .git folder with a
+corresponding bitmap.
++
+-------------------------------------------------------------
+$ git multi-pack-index write --preferred-pack <pack> --bitmap
+-------------------------------------------------------------
+
 * Write a MIDX file for the packfiles in an alternate object store.
 +
 -----------------------------------------------
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 5d3ea445fd..bf6fa982e3 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -68,6 +68,8 @@ static int cmd_multi_pack_index_write(int argc, const char **argv)
 		OPT_STRING(0, "preferred-pack", &opts.preferred_pack,
 			   N_("preferred-pack"),
 			   N_("pack for reuse when computing a multi-pack bitmap")),
+		OPT_BIT(0, "bitmap", &opts.flags, N_("write multi-pack bitmap"),
+			MIDX_WRITE_BITMAP | MIDX_WRITE_REV_INDEX),
 		OPT_END(),
 	};
 
diff --git a/midx.c b/midx.c
index 752d36c57f..a58cca707b 100644
--- a/midx.c
+++ b/midx.c
@@ -13,6 +13,10 @@
 #include "repository.h"
 #include "chunk-format.h"
 #include "pack.h"
+#include "pack-bitmap.h"
+#include "refs.h"
+#include "revision.h"
+#include "list-objects.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
@@ -885,6 +889,172 @@ static void write_midx_reverse_index(char *midx_name, unsigned char *midx_hash,
 static void clear_midx_files_ext(struct repository *r, const char *ext,
 				 unsigned char *keep_hash);
 
+static void prepare_midx_packing_data(struct packing_data *pdata,
+				      struct write_midx_context *ctx)
+{
+	uint32_t i;
+
+	memset(pdata, 0, sizeof(struct packing_data));
+	prepare_packing_data(the_repository, pdata);
+
+	for (i = 0; i < ctx->entries_nr; i++) {
+		struct pack_midx_entry *from = &ctx->entries[ctx->pack_order[i]];
+		struct object_entry *to = packlist_alloc(pdata, &from->oid);
+
+		oe_set_in_pack(pdata, to,
+			       ctx->info[ctx->pack_perm[from->pack_int_id]].p);
+	}
+}
+
+static int add_ref_to_pending(const char *refname,
+			      const struct object_id *oid,
+			      int flag, void *cb_data)
+{
+	struct rev_info *revs = (struct rev_info*)cb_data;
+	struct object *object;
+
+	if ((flag & REF_ISSYMREF) && (flag & REF_ISBROKEN)) {
+		warning("symbolic ref is dangling: %s", refname);
+		return 0;
+	}
+
+	object = parse_object_or_die(oid, refname);
+	if (object->type != OBJ_COMMIT)
+		return 0;
+
+	add_pending_object(revs, object, "");
+	if (bitmap_is_preferred_refname(revs->repo, refname))
+		object->flags |= NEEDS_BITMAP;
+	return 0;
+}
+
+struct bitmap_commit_cb {
+	struct commit **commits;
+	size_t commits_nr, commits_alloc;
+
+	struct write_midx_context *ctx;
+};
+
+static const struct object_id *bitmap_oid_access(size_t index,
+						 const void *_entries)
+{
+	const struct pack_midx_entry *entries = _entries;
+	return &entries[index].oid;
+}
+
+static void bitmap_show_commit(struct commit *commit, void *_data)
+{
+	struct bitmap_commit_cb *data = _data;
+	if (oid_pos(&commit->object.oid, data->ctx->entries,
+		    data->ctx->entries_nr,
+		    bitmap_oid_access) > -1) {
+		ALLOC_GROW(data->commits, data->commits_nr + 1,
+			   data->commits_alloc);
+		data->commits[data->commits_nr++] = commit;
+	}
+}
+
+static struct commit **find_commits_for_midx_bitmap(uint32_t *indexed_commits_nr_p,
+						    struct write_midx_context *ctx)
+{
+	struct rev_info revs;
+	struct bitmap_commit_cb cb;
+
+	memset(&cb, 0, sizeof(struct bitmap_commit_cb));
+	cb.ctx = ctx;
+
+	repo_init_revisions(the_repository, &revs, NULL);
+	for_each_ref(add_ref_to_pending, &revs);
+
+	/*
+	 * Skipping promisor objects here is intentional, since it only excludes
+	 * them from the list of reachable commits that we want to select from
+	 * when computing the selection of MIDX'd commits to receive bitmaps.
+	 *
+	 * Reachability bitmaps do require that their objects be closed under
+	 * reachability, but fetching any objects missing from promisors at this
+	 * point is too late. But, if one of those objects can be reached from
+	 * an another object that is included in the bitmap, then we will
+	 * complain later that we don't have reachability closure (and fail
+	 * appropriately).
+	 */
+	fetch_if_missing = 0;
+	revs.exclude_promisor_objects = 1;
+
+	/*
+	 * Pass selected commits in topo order to match the behavior of
+	 * pack-bitmaps when configured with delta islands.
+	 */
+	revs.topo_order = 1;
+	revs.sort_order = REV_SORT_IN_GRAPH_ORDER;
+
+	if (prepare_revision_walk(&revs))
+		die(_("revision walk setup failed"));
+
+	traverse_commit_list(&revs, bitmap_show_commit, NULL, &cb);
+	if (indexed_commits_nr_p)
+		*indexed_commits_nr_p = cb.commits_nr;
+
+	return cb.commits;
+}
+
+static int write_midx_bitmap(char *midx_name, unsigned char *midx_hash,
+			     struct write_midx_context *ctx,
+			     unsigned flags)
+{
+	struct packing_data pdata;
+	struct pack_idx_entry **index;
+	struct commit **commits = NULL;
+	uint32_t i, commits_nr;
+	char *bitmap_name = xstrfmt("%s-%s.bitmap", midx_name, hash_to_hex(midx_hash));
+	int ret;
+
+	prepare_midx_packing_data(&pdata, ctx);
+
+	commits = find_commits_for_midx_bitmap(&commits_nr, ctx);
+
+	/*
+	 * Build the MIDX-order index based on pdata.objects (which is already
+	 * in MIDX order; c.f., 'midx_pack_order_cmp()' for the definition of
+	 * this order).
+	 */
+	ALLOC_ARRAY(index, pdata.nr_objects);
+	for (i = 0; i < pdata.nr_objects; i++)
+		index[i] = (struct pack_idx_entry *)&pdata.objects[i];
+
+	bitmap_writer_show_progress(flags & MIDX_PROGRESS);
+	bitmap_writer_build_type_index(&pdata, index, pdata.nr_objects);
+
+	/*
+	 * bitmap_writer_finish expects objects in lex order, but pack_order
+	 * gives us exactly that. use it directly instead of re-sorting the
+	 * array.
+	 *
+	 * This changes the order of objects in 'index' between
+	 * bitmap_writer_build_type_index and bitmap_writer_finish.
+	 *
+	 * The same re-ordering takes place in the single-pack bitmap code via
+	 * write_idx_file(), which is called by finish_tmp_packfile(), which
+	 * happens between bitmap_writer_build_type_index() and
+	 * bitmap_writer_finish().
+	 */
+	for (i = 0; i < pdata.nr_objects; i++)
+		index[ctx->pack_order[i]] = (struct pack_idx_entry *)&pdata.objects[i];
+
+	bitmap_writer_select_commits(commits, commits_nr, -1);
+	ret = bitmap_writer_build(&pdata);
+	if (ret < 0)
+		goto cleanup;
+
+	bitmap_writer_set_checksum(midx_hash);
+	bitmap_writer_finish(index, pdata.nr_objects, bitmap_name, 0);
+
+cleanup:
+	free(index);
+	free(bitmap_name);
+	return ret;
+}
+
 static int write_midx_internal(const char *object_dir, struct multi_pack_index *m,
 			       struct string_list *packs_to_drop,
 			       const char *preferred_pack_name,
@@ -930,9 +1100,16 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 		for (i = 0; i < ctx.m->num_packs; i++) {
 			ALLOC_GROW(ctx.info, ctx.nr + 1, ctx.alloc);
 
+			if (prepare_midx_pack(the_repository, ctx.m, i)) {
+				error(_("could not load pack %s"),
+				      ctx.m->pack_names[i]);
+				result = 1;
+				goto cleanup;
+			}
+
 			ctx.info[ctx.nr].orig_pack_int_id = i;
 			ctx.info[ctx.nr].pack_name = xstrdup(ctx.m->pack_names[i]);
-			ctx.info[ctx.nr].p = NULL;
+			ctx.info[ctx.nr].p = ctx.m->packs[i];
 			ctx.info[ctx.nr].expired = 0;
 			ctx.nr++;
 		}
@@ -947,8 +1124,26 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &ctx);
 	stop_progress(&ctx.progress);
 
-	if (ctx.m && ctx.nr == ctx.m->num_packs && !packs_to_drop)
-		goto cleanup;
+	if (ctx.m && ctx.nr == ctx.m->num_packs && !packs_to_drop) {
+		struct bitmap_index *bitmap_git;
+		int bitmap_exists;
+		int want_bitmap = flags & MIDX_WRITE_BITMAP;
+
+		bitmap_git = prepare_bitmap_git(the_repository);
+		bitmap_exists = bitmap_git && bitmap_is_midx(bitmap_git);
+		free_bitmap_index(bitmap_git);
+
+		if (bitmap_exists || !want_bitmap) {
+			/*
+			 * The correct MIDX already exists, and so does a
+			 * corresponding bitmap (or one wasn't requested).
+			 */
+			if (!want_bitmap)
+				clear_midx_files_ext(the_repository, ".bitmap",
+						     NULL);
+			goto cleanup;
+		}
+	}
 
 	if (preferred_pack_name) {
 		int found = 0;
@@ -964,7 +1159,8 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 		if (!found)
 			warning(_("unknown preferred pack: '%s'"),
 				preferred_pack_name);
-	} else if (ctx.nr && (flags & MIDX_WRITE_REV_INDEX)) {
+	} else if (ctx.nr &&
+		   (flags & (MIDX_WRITE_REV_INDEX | MIDX_WRITE_BITMAP))) {
 		time_t oldest = ctx.info[0].p->mtime;
 		ctx.preferred_pack_idx = 0;
 
@@ -1075,9 +1271,6 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(get_lock_file_fd(&lk), get_lock_file_path(&lk));
 
-	if (ctx.m)
-		close_midx(ctx.m);
-
 	if (ctx.nr - dropped_packs == 0) {
 		error(_("no pack files to index."));
 		result = 1;
@@ -1108,14 +1301,22 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	finalize_hashfile(f, midx_hash, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
 	free_chunkfile(cf);
 
-	if (flags & MIDX_WRITE_REV_INDEX)
+	if (flags & (MIDX_WRITE_REV_INDEX | MIDX_WRITE_BITMAP))
 		ctx.pack_order = midx_pack_order(&ctx);
 
 	if (flags & MIDX_WRITE_REV_INDEX)
 		write_midx_reverse_index(midx_name, midx_hash, &ctx);
+	if (flags & MIDX_WRITE_BITMAP) {
+		if (write_midx_bitmap(midx_name, midx_hash, &ctx, flags) < 0) {
+			error(_("could not write multi-pack bitmap"));
+			result = 1;
+			goto cleanup;
+		}
+	}
 
 	commit_lock_file(&lk);
 
+	clear_midx_files_ext(the_repository, ".bitmap", midx_hash);
 	clear_midx_files_ext(the_repository, ".rev", midx_hash);
 
 cleanup:
@@ -1123,6 +1324,15 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 		if (ctx.info[i].p) {
 			close_pack(ctx.info[i].p);
 			free(ctx.info[i].p);
+			if (ctx.m) {
+				/*
+				 * Destroy a stale reference to the pack in
+				 * 'ctx.m'.
+				 */
+				uint32_t orig = ctx.info[i].orig_pack_int_id;
+				if (orig < ctx.m->num_packs)
+					ctx.m->packs[orig] = NULL;
+			}
 		}
 		free(ctx.info[i].pack_name);
 	}
@@ -1132,6 +1342,9 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	free(ctx.pack_perm);
 	free(ctx.pack_order);
 	free(midx_name);
+	if (ctx.m)
+		close_midx(ctx.m);
+
 	return result;
 }
 
@@ -1193,6 +1406,7 @@ void clear_midx_file(struct repository *r)
 	if (remove_path(midx))
 		die(_("failed to clear multi-pack-index at %s"), midx);
 
+	clear_midx_files_ext(r, ".bitmap", NULL);
 	clear_midx_files_ext(r, ".rev", NULL);
 
 	free(midx);
diff --git a/midx.h b/midx.h
index 1172df1a71..350f4d0a7b 100644
--- a/midx.h
+++ b/midx.h
@@ -41,6 +41,7 @@ struct multi_pack_index {
 
 #define MIDX_PROGRESS     (1 << 0)
 #define MIDX_WRITE_REV_INDEX (1 << 1)
+#define MIDX_WRITE_BITMAP (1 << 2)
 
 const unsigned char *get_midx_checksum(struct multi_pack_index *m);
 char *get_midx_filename(const char *object_dir);
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v2 15/24] t5310: move some tests to lib-bitmap.sh
  2021-06-21 22:24 ` [PATCH v2 00/24] multi-pack reachability bitmaps Taylor Blau
                     ` (13 preceding siblings ...)
  2021-06-21 22:25   ` [PATCH v2 14/24] pack-bitmap: write " Taylor Blau
@ 2021-06-21 22:25   ` Taylor Blau
  2021-06-21 22:25   ` [PATCH v2 16/24] t/helper/test-read-midx.c: add --checksum mode Taylor Blau
                     ` (9 subsequent siblings)
  24 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-06-21 22:25 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

We'll soon be adding a test script that will cover many of the same
bitmap concepts as t5310, but for MIDX bitmaps. Let's pull out as many
of the applicable tests as we can so we don't have to rewrite them.

There should be no functional change to t5310; we still run the same
operations in the same order.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/lib-bitmap.sh         | 236 ++++++++++++++++++++++++++++++++++++++++
 t/t5310-pack-bitmaps.sh | 227 +-------------------------------------
 2 files changed, 240 insertions(+), 223 deletions(-)

diff --git a/t/lib-bitmap.sh b/t/lib-bitmap.sh
index fe3f98be24..ecb5d0e05d 100644
--- a/t/lib-bitmap.sh
+++ b/t/lib-bitmap.sh
@@ -1,3 +1,6 @@
+# Helpers for scripts testing bitamp functionality; see t5310 for
+# example usage.
+
 # Compare a file containing rev-list bitmap traversal output to its non-bitmap
 # counterpart. You can't just use test_cmp for this, because the two produce
 # subtly different output:
@@ -24,3 +27,236 @@ test_bitmap_traversal () {
 	test_cmp "$1.normalized" "$2.normalized" &&
 	rm -f "$1.normalized" "$2.normalized"
 }
+
+# To ensure the logic for "maximal commits" is exercised, make
+# the repository a bit more complicated.
+#
+#    other                         second
+#      *                             *
+# (99 commits)                  (99 commits)
+#      *                             *
+#      |\                           /|
+#      | * octo-other  octo-second * |
+#      |/|\_________  ____________/|\|
+#      | \          \/  __________/  |
+#      |  | ________/\ /             |
+#      *  |/          * merge-right  *
+#      | _|__________/ \____________ |
+#      |/ |                         \|
+# (l1) *  * merge-left               * (r1)
+#      | / \________________________ |
+#      |/                           \|
+# (l2) *                             * (r2)
+#       \___________________________ |
+#                                   \|
+#                                    * (base)
+#
+# We only push bits down the first-parent history, which
+# makes some of these commits unimportant!
+#
+# The important part for the maximal commit algorithm is how
+# the bitmasks are extended. Assuming starting bit positions
+# for second (bit 0) and other (bit 1), the bitmasks at the
+# end should be:
+#
+#      second: 1       (maximal, selected)
+#       other: 01      (maximal, selected)
+#      (base): 11 (maximal)
+#
+# This complicated history was important for a previous
+# version of the walk that guarantees never walking a
+# commit multiple times. That goal might be important
+# again, so preserve this complicated case. For now, this
+# test will guarantee that the bitmaps are computed
+# correctly, even with the repeat calculations.
+setup_bitmap_history() {
+	test_expect_success 'setup repo with moderate-sized history' '
+		test_commit_bulk --id=file 10 &&
+		git branch -M second &&
+		git checkout -b other HEAD~5 &&
+		test_commit_bulk --id=side 10 &&
+
+		# add complicated history setup, including merges and
+		# ambiguous merge-bases
+
+		git checkout -b merge-left other~2 &&
+		git merge second~2 -m "merge-left" &&
+
+		git checkout -b merge-right second~1 &&
+		git merge other~1 -m "merge-right" &&
+
+		git checkout -b octo-second second &&
+		git merge merge-left merge-right -m "octopus-second" &&
+
+		git checkout -b octo-other other &&
+		git merge merge-left merge-right -m "octopus-other" &&
+
+		git checkout other &&
+		git merge octo-other -m "pull octopus" &&
+
+		git checkout second &&
+		git merge octo-second -m "pull octopus" &&
+
+		# Remove these branches so they are not selected
+		# as bitmap tips
+		git branch -D merge-left &&
+		git branch -D merge-right &&
+		git branch -D octo-other &&
+		git branch -D octo-second &&
+
+		# add padding to make these merges less interesting
+		# and avoid having them selected for bitmaps
+		test_commit_bulk --id=file 100 &&
+		git checkout other &&
+		test_commit_bulk --id=side 100 &&
+		git checkout second &&
+
+		bitmaptip=$(git rev-parse second) &&
+		blob=$(echo tagged-blob | git hash-object -w --stdin) &&
+		git tag tagged-blob $blob
+	'
+}
+
+rev_list_tests_head () {
+	test_expect_success "counting commits via bitmap ($state, $branch)" '
+		git rev-list --count $branch >expect &&
+		git rev-list --use-bitmap-index --count $branch >actual &&
+		test_cmp expect actual
+	'
+
+	test_expect_success "counting partial commits via bitmap ($state, $branch)" '
+		git rev-list --count $branch~5..$branch >expect &&
+		git rev-list --use-bitmap-index --count $branch~5..$branch >actual &&
+		test_cmp expect actual
+	'
+
+	test_expect_success "counting commits with limit ($state, $branch)" '
+		git rev-list --count -n 1 $branch >expect &&
+		git rev-list --use-bitmap-index --count -n 1 $branch >actual &&
+		test_cmp expect actual
+	'
+
+	test_expect_success "counting non-linear history ($state, $branch)" '
+		git rev-list --count other...second >expect &&
+		git rev-list --use-bitmap-index --count other...second >actual &&
+		test_cmp expect actual
+	'
+
+	test_expect_success "counting commits with limiting ($state, $branch)" '
+		git rev-list --count $branch -- 1.t >expect &&
+		git rev-list --use-bitmap-index --count $branch -- 1.t >actual &&
+		test_cmp expect actual
+	'
+
+	test_expect_success "counting objects via bitmap ($state, $branch)" '
+		git rev-list --count --objects $branch >expect &&
+		git rev-list --use-bitmap-index --count --objects $branch >actual &&
+		test_cmp expect actual
+	'
+
+	test_expect_success "enumerate commits ($state, $branch)" '
+		git rev-list --use-bitmap-index $branch >actual &&
+		git rev-list $branch >expect &&
+		test_bitmap_traversal --no-confirm-bitmaps expect actual
+	'
+
+	test_expect_success "enumerate --objects ($state, $branch)" '
+		git rev-list --objects --use-bitmap-index $branch >actual &&
+		git rev-list --objects $branch >expect &&
+		test_bitmap_traversal expect actual
+	'
+
+	test_expect_success "bitmap --objects handles non-commit objects ($state, $branch)" '
+		git rev-list --objects --use-bitmap-index $branch tagged-blob >actual &&
+		grep $blob actual
+	'
+}
+
+rev_list_tests () {
+	state=$1
+
+	for branch in "second" "other"
+	do
+		rev_list_tests_head
+	done
+}
+
+basic_bitmap_tests () {
+	tip="$1"
+	test_expect_success 'rev-list --test-bitmap verifies bitmaps' "
+		git rev-list --test-bitmap "${tip:-HEAD}"
+	"
+
+	rev_list_tests 'full bitmap'
+
+	test_expect_success 'clone from bitmapped repository' '
+		rm -fr clone.git &&
+		git clone --no-local --bare . clone.git &&
+		git rev-parse HEAD >expect &&
+		git --git-dir=clone.git rev-parse HEAD >actual &&
+		test_cmp expect actual
+	'
+
+	test_expect_success 'partial clone from bitmapped repository' '
+		test_config uploadpack.allowfilter true &&
+		rm -fr partial-clone.git &&
+		git clone --no-local --bare --filter=blob:none . partial-clone.git &&
+		(
+			cd partial-clone.git &&
+			pack=$(echo objects/pack/*.pack) &&
+			git verify-pack -v "$pack" >have &&
+			awk "/blob/ { print \$1 }" <have >blobs &&
+			# we expect this single blob because of the direct ref
+			git rev-parse refs/tags/tagged-blob >expect &&
+			test_cmp expect blobs
+		)
+	'
+
+	test_expect_success 'setup further non-bitmapped commits' '
+		test_commit_bulk --id=further 10
+	'
+
+	rev_list_tests 'partial bitmap'
+
+	test_expect_success 'fetch (partial bitmap)' '
+		git --git-dir=clone.git fetch origin second:second &&
+		git rev-parse HEAD >expect &&
+		git --git-dir=clone.git rev-parse HEAD >actual &&
+		test_cmp expect actual
+	'
+
+	test_expect_success 'enumerating progress counts pack-reused objects' '
+		count=$(git rev-list --objects --all --count) &&
+		git repack -adb &&
+
+		# check first with only reused objects; confirm that our
+		# progress showed the right number, and also that we did
+		# pack-reuse as expected.  Check only the final "done"
+		# line of the meter (there may be an arbitrary number of
+		# intermediate lines ending with CR).
+		GIT_PROGRESS_DELAY=0 \
+			git pack-objects --all --stdout --progress \
+			</dev/null >/dev/null 2>stderr &&
+		grep "Enumerating objects: $count, done" stderr &&
+		grep "pack-reused $count" stderr &&
+
+		# now the same but with one non-reused object
+		git commit --allow-empty -m "an extra commit object" &&
+		GIT_PROGRESS_DELAY=0 \
+			git pack-objects --all --stdout --progress \
+			</dev/null >/dev/null 2>stderr &&
+		grep "Enumerating objects: $((count+1)), done" stderr &&
+		grep "pack-reused $count" stderr
+	'
+}
+
+# have_delta <obj> <expected_base>
+#
+# Note that because this relies on cat-file, it might find _any_ copy of an
+# object in the repository. The caller is responsible for making sure
+# there's only one (e.g., via "repack -ad", or having just fetched a copy).
+have_delta () {
+	echo $2 >expect &&
+	echo $1 | git cat-file --batch-check="%(deltabase)" >actual &&
+	test_cmp expect actual
+}
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index b02838750e..4318f84d53 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -25,93 +25,10 @@ has_any () {
 	grep -Ff "$1" "$2"
 }
 
-# To ensure the logic for "maximal commits" is exercised, make
-# the repository a bit more complicated.
-#
-#    other                         second
-#      *                             *
-# (99 commits)                  (99 commits)
-#      *                             *
-#      |\                           /|
-#      | * octo-other  octo-second * |
-#      |/|\_________  ____________/|\|
-#      | \          \/  __________/  |
-#      |  | ________/\ /             |
-#      *  |/          * merge-right  *
-#      | _|__________/ \____________ |
-#      |/ |                         \|
-# (l1) *  * merge-left               * (r1)
-#      | / \________________________ |
-#      |/                           \|
-# (l2) *                             * (r2)
-#       \___________________________ |
-#                                   \|
-#                                    * (base)
-#
-# We only push bits down the first-parent history, which
-# makes some of these commits unimportant!
-#
-# The important part for the maximal commit algorithm is how
-# the bitmasks are extended. Assuming starting bit positions
-# for second (bit 0) and other (bit 1), the bitmasks at the
-# end should be:
-#
-#      second: 1       (maximal, selected)
-#       other: 01      (maximal, selected)
-#      (base): 11 (maximal)
-#
-# This complicated history was important for a previous
-# version of the walk that guarantees never walking a
-# commit multiple times. That goal might be important
-# again, so preserve this complicated case. For now, this
-# test will guarantee that the bitmaps are computed
-# correctly, even with the repeat calculations.
+setup_bitmap_history
 
-test_expect_success 'setup repo with moderate-sized history' '
-	test_commit_bulk --id=file 10 &&
-	git branch -M second &&
-	git checkout -b other HEAD~5 &&
-	test_commit_bulk --id=side 10 &&
-
-	# add complicated history setup, including merges and
-	# ambiguous merge-bases
-
-	git checkout -b merge-left other~2 &&
-	git merge second~2 -m "merge-left" &&
-
-	git checkout -b merge-right second~1 &&
-	git merge other~1 -m "merge-right" &&
-
-	git checkout -b octo-second second &&
-	git merge merge-left merge-right -m "octopus-second" &&
-
-	git checkout -b octo-other other &&
-	git merge merge-left merge-right -m "octopus-other" &&
-
-	git checkout other &&
-	git merge octo-other -m "pull octopus" &&
-
-	git checkout second &&
-	git merge octo-second -m "pull octopus" &&
-
-	# Remove these branches so they are not selected
-	# as bitmap tips
-	git branch -D merge-left &&
-	git branch -D merge-right &&
-	git branch -D octo-other &&
-	git branch -D octo-second &&
-
-	# add padding to make these merges less interesting
-	# and avoid having them selected for bitmaps
-	test_commit_bulk --id=file 100 &&
-	git checkout other &&
-	test_commit_bulk --id=side 100 &&
-	git checkout second &&
-
-	bitmaptip=$(git rev-parse second) &&
-	blob=$(echo tagged-blob | git hash-object -w --stdin) &&
-	git tag tagged-blob $blob &&
-	git config repack.writebitmaps true
+test_expect_success 'setup writing bitmaps during repack' '
+	git config repack.writeBitmaps true
 '
 
 test_expect_success 'full repack creates bitmaps' '
@@ -123,109 +40,7 @@ test_expect_success 'full repack creates bitmaps' '
 	grep "\"key\":\"num_maximal_commits\",\"value\":\"107\"" trace
 '
 
-test_expect_success 'rev-list --test-bitmap verifies bitmaps' '
-	git rev-list --test-bitmap HEAD
-'
-
-rev_list_tests_head () {
-	test_expect_success "counting commits via bitmap ($state, $branch)" '
-		git rev-list --count $branch >expect &&
-		git rev-list --use-bitmap-index --count $branch >actual &&
-		test_cmp expect actual
-	'
-
-	test_expect_success "counting partial commits via bitmap ($state, $branch)" '
-		git rev-list --count $branch~5..$branch >expect &&
-		git rev-list --use-bitmap-index --count $branch~5..$branch >actual &&
-		test_cmp expect actual
-	'
-
-	test_expect_success "counting commits with limit ($state, $branch)" '
-		git rev-list --count -n 1 $branch >expect &&
-		git rev-list --use-bitmap-index --count -n 1 $branch >actual &&
-		test_cmp expect actual
-	'
-
-	test_expect_success "counting non-linear history ($state, $branch)" '
-		git rev-list --count other...second >expect &&
-		git rev-list --use-bitmap-index --count other...second >actual &&
-		test_cmp expect actual
-	'
-
-	test_expect_success "counting commits with limiting ($state, $branch)" '
-		git rev-list --count $branch -- 1.t >expect &&
-		git rev-list --use-bitmap-index --count $branch -- 1.t >actual &&
-		test_cmp expect actual
-	'
-
-	test_expect_success "counting objects via bitmap ($state, $branch)" '
-		git rev-list --count --objects $branch >expect &&
-		git rev-list --use-bitmap-index --count --objects $branch >actual &&
-		test_cmp expect actual
-	'
-
-	test_expect_success "enumerate commits ($state, $branch)" '
-		git rev-list --use-bitmap-index $branch >actual &&
-		git rev-list $branch >expect &&
-		test_bitmap_traversal --no-confirm-bitmaps expect actual
-	'
-
-	test_expect_success "enumerate --objects ($state, $branch)" '
-		git rev-list --objects --use-bitmap-index $branch >actual &&
-		git rev-list --objects $branch >expect &&
-		test_bitmap_traversal expect actual
-	'
-
-	test_expect_success "bitmap --objects handles non-commit objects ($state, $branch)" '
-		git rev-list --objects --use-bitmap-index $branch tagged-blob >actual &&
-		grep $blob actual
-	'
-}
-
-rev_list_tests () {
-	state=$1
-
-	for branch in "second" "other"
-	do
-		rev_list_tests_head
-	done
-}
-
-rev_list_tests 'full bitmap'
-
-test_expect_success 'clone from bitmapped repository' '
-	git clone --no-local --bare . clone.git &&
-	git rev-parse HEAD >expect &&
-	git --git-dir=clone.git rev-parse HEAD >actual &&
-	test_cmp expect actual
-'
-
-test_expect_success 'partial clone from bitmapped repository' '
-	test_config uploadpack.allowfilter true &&
-	git clone --no-local --bare --filter=blob:none . partial-clone.git &&
-	(
-		cd partial-clone.git &&
-		pack=$(echo objects/pack/*.pack) &&
-		git verify-pack -v "$pack" >have &&
-		awk "/blob/ { print \$1 }" <have >blobs &&
-		# we expect this single blob because of the direct ref
-		git rev-parse refs/tags/tagged-blob >expect &&
-		test_cmp expect blobs
-	)
-'
-
-test_expect_success 'setup further non-bitmapped commits' '
-	test_commit_bulk --id=further 10
-'
-
-rev_list_tests 'partial bitmap'
-
-test_expect_success 'fetch (partial bitmap)' '
-	git --git-dir=clone.git fetch origin second:second &&
-	git rev-parse HEAD >expect &&
-	git --git-dir=clone.git rev-parse HEAD >actual &&
-	test_cmp expect actual
-'
+basic_bitmap_tests
 
 test_expect_success 'incremental repack fails when bitmaps are requested' '
 	test_commit more-1 &&
@@ -461,40 +276,6 @@ test_expect_success 'truncated bitmap fails gracefully (cache)' '
 	test_i18ngrep corrupted.bitmap.index stderr
 '
 
-test_expect_success 'enumerating progress counts pack-reused objects' '
-	count=$(git rev-list --objects --all --count) &&
-	git repack -adb &&
-
-	# check first with only reused objects; confirm that our progress
-	# showed the right number, and also that we did pack-reuse as expected.
-	# Check only the final "done" line of the meter (there may be an
-	# arbitrary number of intermediate lines ending with CR).
-	GIT_PROGRESS_DELAY=0 \
-		git pack-objects --all --stdout --progress \
-		</dev/null >/dev/null 2>stderr &&
-	grep "Enumerating objects: $count, done" stderr &&
-	grep "pack-reused $count" stderr &&
-
-	# now the same but with one non-reused object
-	git commit --allow-empty -m "an extra commit object" &&
-	GIT_PROGRESS_DELAY=0 \
-		git pack-objects --all --stdout --progress \
-		</dev/null >/dev/null 2>stderr &&
-	grep "Enumerating objects: $((count+1)), done" stderr &&
-	grep "pack-reused $count" stderr
-'
-
-# have_delta <obj> <expected_base>
-#
-# Note that because this relies on cat-file, it might find _any_ copy of an
-# object in the repository. The caller is responsible for making sure
-# there's only one (e.g., via "repack -ad", or having just fetched a copy).
-have_delta () {
-	echo $2 >expect &&
-	echo $1 | git cat-file --batch-check="%(deltabase)" >actual &&
-	test_cmp expect actual
-}
-
 # Create a state of history with these properties:
 #
 #  - refs that allow a client to fetch some new history, while sharing some old
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v2 16/24] t/helper/test-read-midx.c: add --checksum mode
  2021-06-21 22:24 ` [PATCH v2 00/24] multi-pack reachability bitmaps Taylor Blau
                     ` (14 preceding siblings ...)
  2021-06-21 22:25   ` [PATCH v2 15/24] t5310: move some tests to lib-bitmap.sh Taylor Blau
@ 2021-06-21 22:25   ` Taylor Blau
  2021-06-21 22:25   ` [PATCH v2 17/24] t5326: test multi-pack bitmap behavior Taylor Blau
                     ` (8 subsequent siblings)
  24 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-06-21 22:25 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

Subsequent tests will want to check for the existence of a multi-pack
bitmap which matches the multi-pack-index stored in the pack directory.

The multi-pack bitmap includes the hex checksum of the MIDX it
corresponds to in its filename (for example,
'$packdir/multi-pack-index-<checksum>.bitmap'). As a result, some tests
want a way to learn what '<checksum>' is.

This helper addresses that need by printing the checksum of the
repository's multi-pack-index.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/helper/test-read-midx.c | 16 +++++++++++++++-
 t/lib-bitmap.sh           |  4 ++++
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c
index 7c2eb11a8e..cb0d27049a 100644
--- a/t/helper/test-read-midx.c
+++ b/t/helper/test-read-midx.c
@@ -60,12 +60,26 @@ static int read_midx_file(const char *object_dir, int show_objects)
 	return 0;
 }
 
+static int read_midx_checksum(const char *object_dir)
+{
+	struct multi_pack_index *m;
+
+	setup_git_directory();
+	m = load_multi_pack_index(object_dir, 1);
+	if (!m)
+		return 1;
+	printf("%s\n", hash_to_hex(get_midx_checksum(m)));
+	return 0;
+}
+
 int cmd__read_midx(int argc, const char **argv)
 {
 	if (!(argc == 2 || argc == 3))
-		usage("read-midx [--show-objects] <object-dir>");
+		usage("read-midx [--show-objects|--checksum] <object-dir>");
 
 	if (!strcmp(argv[1], "--show-objects"))
 		return read_midx_file(argv[2], 1);
+	else if (!strcmp(argv[1], "--checksum"))
+		return read_midx_checksum(argv[2]);
 	return read_midx_file(argv[1], 0);
 }
diff --git a/t/lib-bitmap.sh b/t/lib-bitmap.sh
index ecb5d0e05d..09cd036f4d 100644
--- a/t/lib-bitmap.sh
+++ b/t/lib-bitmap.sh
@@ -260,3 +260,7 @@ have_delta () {
 	echo $1 | git cat-file --batch-check="%(deltabase)" >actual &&
 	test_cmp expect actual
 }
+
+midx_checksum () {
+	test-tool read-midx --checksum "${1:-.git/objects}"
+}
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v2 17/24] t5326: test multi-pack bitmap behavior
  2021-06-21 22:24 ` [PATCH v2 00/24] multi-pack reachability bitmaps Taylor Blau
                     ` (15 preceding siblings ...)
  2021-06-21 22:25   ` [PATCH v2 16/24] t/helper/test-read-midx.c: add --checksum mode Taylor Blau
@ 2021-06-21 22:25   ` Taylor Blau
  2021-06-21 22:25   ` [PATCH v2 18/24] t0410: disable GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP Taylor Blau
                     ` (7 subsequent siblings)
  24 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-06-21 22:25 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

This patch introduces a new test, t5326, which tests the basic
functionality of multi-pack bitmaps.

Some trivial behavior is tested, such as:

  - Whether bitmaps can be generated with more than one pack.
  - Whether clones can be served with all objects in the bitmap.
  - Whether follow-up fetches can be served with some objects outside of
    the server's bitmap

These use lib-bitmap's tests (which in turn were pulled from t5310), and
we cover cases where the MIDX represents both a single pack and multiple
packs.

In addition, some non-trivial and MIDX-specific behavior is tested, too,
including:

  - Whether multi-pack bitmaps behave correctly with respect to the
    pack-reuse machinery when the base for some object is selected from
    a different pack than the delta.
  - Whether multi-pack bitmaps correctly respect the
    pack.preferBitmapTips configuration.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/t5326-multi-pack-bitmaps.sh | 277 ++++++++++++++++++++++++++++++++++
 1 file changed, 277 insertions(+)
 create mode 100755 t/t5326-multi-pack-bitmaps.sh

diff --git a/t/t5326-multi-pack-bitmaps.sh b/t/t5326-multi-pack-bitmaps.sh
new file mode 100755
index 0000000000..c1b7d633e2
--- /dev/null
+++ b/t/t5326-multi-pack-bitmaps.sh
@@ -0,0 +1,277 @@
+#!/bin/sh
+
+test_description='exercise basic multi-pack bitmap functionality'
+. ./test-lib.sh
+. "${TEST_DIRECTORY}/lib-bitmap.sh"
+
+# We'll be writing our own midx and bitmaps, so avoid getting confused by the
+# automatic ones.
+GIT_TEST_MULTI_PACK_INDEX=0
+GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0
+
+objdir=.git/objects
+midx=$objdir/pack/multi-pack-index
+
+# midx_pack_source <obj>
+midx_pack_source () {
+	test-tool read-midx --show-objects .git/objects | grep "^$1 " | cut -f2
+}
+
+setup_bitmap_history
+
+test_expect_success 'enable core.multiPackIndex' '
+	git config core.multiPackIndex true
+'
+
+test_expect_success 'create single-pack midx with bitmaps' '
+	git repack -ad &&
+	git multi-pack-index write --bitmap &&
+	test_path_is_file $midx &&
+	test_path_is_file $midx-$(midx_checksum $objdir).bitmap
+'
+
+basic_bitmap_tests
+
+test_expect_success 'create new additional packs' '
+	for i in $(test_seq 1 16)
+	do
+		test_commit "$i" &&
+		git repack -d
+	done &&
+
+	git checkout -b other2 HEAD~8 &&
+	for i in $(test_seq 1 8)
+	do
+		test_commit "side-$i" &&
+		git repack -d
+	done &&
+	git checkout second
+'
+
+test_expect_success 'create multi-pack midx with bitmaps' '
+	git multi-pack-index write --bitmap &&
+
+	ls $objdir/pack/pack-*.pack >packs &&
+	test_line_count = 25 packs &&
+
+	test_path_is_file $midx &&
+	test_path_is_file $midx-$(midx_checksum $objdir).bitmap
+'
+
+basic_bitmap_tests
+
+test_expect_success '--no-bitmap is respected when bitmaps exist' '
+	git multi-pack-index write --bitmap &&
+
+	test_commit respect--no-bitmap &&
+	GIT_TEST_MULTI_PACK_INDEX=0 git repack -d &&
+
+	test_path_is_file $midx &&
+	test_path_is_file $midx-$(midx_checksum $objdir).bitmap &&
+
+	git multi-pack-index write --no-bitmap &&
+
+	test_path_is_file $midx &&
+	test_path_is_missing $midx-$(midx_checksum $objdir).bitmap
+'
+
+test_expect_success 'setup midx with base from later pack' '
+	# Write a and b so that "a" is a delta on top of base "b", since Git
+	# prefers to delete contents out of a base rather than add to a shorter
+	# object.
+	test_seq 1 128 >a &&
+	test_seq 1 130 >b &&
+
+	git add a b &&
+	git commit -m "initial commit" &&
+
+	a=$(git rev-parse HEAD:a) &&
+	b=$(git rev-parse HEAD:b) &&
+
+	# In the first pack, "a" is stored as a delta to "b".
+	p1=$(git pack-objects .git/objects/pack/pack <<-EOF
+	$a
+	$b
+	EOF
+	) &&
+
+	# In the second pack, "a" is missing, and "b" is not a delta nor base to
+	# any other object.
+	p2=$(git pack-objects .git/objects/pack/pack <<-EOF
+	$b
+	$(git rev-parse HEAD)
+	$(git rev-parse HEAD^{tree})
+	EOF
+	) &&
+
+	git prune-packed &&
+	# Use the second pack as the preferred source, so that "b" occurs
+	# earlier in the MIDX object order, rendering "a" unusable for pack
+	# reuse.
+	git multi-pack-index write --bitmap --preferred-pack=pack-$p2.idx &&
+
+	have_delta $a $b &&
+	test $(midx_pack_source $a) != $(midx_pack_source $b)
+'
+
+rev_list_tests 'full bitmap with backwards delta'
+
+test_expect_success 'clone with bitmaps enabled' '
+	git clone --no-local --bare . clone-reverse-delta.git &&
+	test_when_finished "rm -fr clone-reverse-delta.git" &&
+
+	git rev-parse HEAD >expect &&
+	git --git-dir=clone-reverse-delta.git rev-parse HEAD >actual &&
+	test_cmp expect actual
+'
+
+bitmap_reuse_tests() {
+	from=$1
+	to=$2
+
+	test_expect_success "setup pack reuse tests ($from -> $to)" '
+		rm -fr repo &&
+		git init repo &&
+		(
+			cd repo &&
+			test_commit_bulk 16 &&
+			git tag old-tip &&
+
+			git config core.multiPackIndex true &&
+			if test "MIDX" = "$from"
+			then
+				GIT_TEST_MULTI_PACK_INDEX=0 git repack -Ad &&
+				git multi-pack-index write --bitmap
+			else
+				GIT_TEST_MULTI_PACK_INDEX=0 git repack -Adb
+			fi
+		)
+	'
+
+	test_expect_success "build bitmap from existing ($from -> $to)" '
+		(
+			cd repo &&
+			test_commit_bulk --id=further 16 &&
+			git tag new-tip &&
+
+			if test "MIDX" = "$to"
+			then
+				GIT_TEST_MULTI_PACK_INDEX=0 git repack -d &&
+				git multi-pack-index write --bitmap
+			else
+				GIT_TEST_MULTI_PACK_INDEX=0 git repack -Adb
+			fi
+		)
+	'
+
+	test_expect_success "verify resulting bitmaps ($from -> $to)" '
+		(
+			cd repo &&
+			git for-each-ref &&
+			git rev-list --test-bitmap refs/tags/old-tip &&
+			git rev-list --test-bitmap refs/tags/new-tip
+		)
+	'
+}
+
+bitmap_reuse_tests 'pack' 'MIDX'
+bitmap_reuse_tests 'MIDX' 'pack'
+bitmap_reuse_tests 'MIDX' 'MIDX'
+
+test_expect_success 'missing object closure fails gracefully' '
+	rm -fr repo &&
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit loose &&
+		test_commit packed &&
+
+		# Do not pass "--revs"; we want a pack without the "loose"
+		# commit.
+		git pack-objects $objdir/pack/pack <<-EOF &&
+		$(git rev-parse packed)
+		EOF
+
+		test_must_fail git multi-pack-index write --bitmap 2>err &&
+		grep "doesn.t have full closure" err &&
+		test_path_is_missing $midx
+	)
+'
+
+test_expect_success 'setup partial bitmaps' '
+	test_commit packed &&
+	git repack &&
+	test_commit loose &&
+	git multi-pack-index write --bitmap 2>err &&
+	test_path_is_file $midx &&
+	test_path_is_file $midx-$(midx_checksum $objdir).bitmap
+'
+
+basic_bitmap_tests HEAD~
+
+test_expect_success 'removing a MIDX clears stale bitmaps' '
+	rm -fr repo &&
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+		test_commit base &&
+		git repack &&
+		git multi-pack-index write --bitmap &&
+
+		# Write a MIDX and bitmap; remove the MIDX but leave the bitmap.
+		stale_bitmap=$midx-$(midx_checksum $objdir).bitmap &&
+		rm $midx &&
+
+		# Then write a new MIDX.
+		test_commit new &&
+		git repack &&
+		git multi-pack-index write --bitmap &&
+
+		test_path_is_file $midx &&
+		test_path_is_file $midx-$(midx_checksum $objdir).bitmap &&
+		test_path_is_missing $stale_bitmap
+	)
+'
+
+test_expect_success 'pack.preferBitmapTips' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit_bulk --message="%s" 103 &&
+
+		git log --format="%H" >commits.raw &&
+		sort <commits.raw >commits &&
+
+		git log --format="create refs/tags/%s %H" HEAD >refs &&
+		git update-ref --stdin <refs &&
+
+		git multi-pack-index write --bitmap &&
+		test_path_is_file $midx &&
+		test_path_is_file $midx-$(midx_checksum $objdir).bitmap &&
+
+		test-tool bitmap list-commits | sort >bitmaps &&
+		comm -13 bitmaps commits >before &&
+		test_line_count = 1 before &&
+
+		perl -ne "printf(\"create refs/tags/include/%d \", $.); print" \
+			<before | git update-ref --stdin &&
+
+		rm -fr $midx-$(midx_checksum $objdir).bitmap &&
+		rm -fr $midx-$(midx_checksum $objdir).rev &&
+		rm -fr $midx &&
+
+		git -c pack.preferBitmapTips=refs/tags/include \
+			multi-pack-index write --bitmap &&
+		test-tool bitmap list-commits | sort >bitmaps &&
+		comm -13 bitmaps commits >after &&
+
+		! test_cmp before after
+	)
+'
+
+test_done
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v2 18/24] t0410: disable GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP
  2021-06-21 22:24 ` [PATCH v2 00/24] multi-pack reachability bitmaps Taylor Blau
                     ` (16 preceding siblings ...)
  2021-06-21 22:25   ` [PATCH v2 17/24] t5326: test multi-pack bitmap behavior Taylor Blau
@ 2021-06-21 22:25   ` Taylor Blau
  2021-06-21 22:25   ` [PATCH v2 19/24] t5310: " Taylor Blau
                     ` (6 subsequent siblings)
  24 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-06-21 22:25 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

From: Jeff King <peff@peff.net>

Generating a MIDX bitmap causes tests which repack in a partial clone to
fail because they are missing objects. Missing objects is an expected
component of tests in t0410, so disable this knob altogether. Graceful
degradation when writing a bitmap with missing objects is tested in
t5326.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/t0410-partial-clone.sh | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index 1667450917..4fd8e83da1 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -4,6 +4,9 @@ test_description='partial clone'
 
 . ./test-lib.sh
 
+# missing promisor objects cause repacks which write bitmaps to fail
+GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0
+
 delete_object () {
 	rm $1/.git/objects/$(echo $2 | sed -e 's|^..|&/|')
 }
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v2 19/24] t5310: disable GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP
  2021-06-21 22:24 ` [PATCH v2 00/24] multi-pack reachability bitmaps Taylor Blau
                     ` (17 preceding siblings ...)
  2021-06-21 22:25   ` [PATCH v2 18/24] t0410: disable GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP Taylor Blau
@ 2021-06-21 22:25   ` Taylor Blau
  2021-06-21 22:25   ` [PATCH v2 20/24] t5319: don't write MIDX bitmaps in t5319 Taylor Blau
                     ` (5 subsequent siblings)
  24 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-06-21 22:25 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

From: Jeff King <peff@peff.net>

Generating a MIDX bitmap confuses many of the tests in t5310, which
expect to control whether and how bitmaps are written. Since the
relevant MIDX-bitmap tests here are covered already in t5326, let's just
disable the flag for the whole t5310 script.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/t5310-pack-bitmaps.sh | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 4318f84d53..673baa5c3c 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -8,6 +8,10 @@ export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
 . "$TEST_DIRECTORY"/lib-bundle.sh
 . "$TEST_DIRECTORY"/lib-bitmap.sh
 
+# t5310 deals only with single-pack bitmaps, so don't write MIDX bitmaps in
+# their place.
+GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0
+
 objpath () {
 	echo ".git/objects/$(echo "$1" | sed -e 's|\(..\)|\1/|')"
 }
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v2 20/24] t5319: don't write MIDX bitmaps in t5319
  2021-06-21 22:24 ` [PATCH v2 00/24] multi-pack reachability bitmaps Taylor Blau
                     ` (18 preceding siblings ...)
  2021-06-21 22:25   ` [PATCH v2 19/24] t5310: " Taylor Blau
@ 2021-06-21 22:25   ` Taylor Blau
  2021-06-21 22:25   ` [PATCH v2 21/24] t7700: update to work with MIDX bitmap test knob Taylor Blau
                     ` (4 subsequent siblings)
  24 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-06-21 22:25 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

This test is specifically about generating a midx still respecting a
pack-based bitmap file. Generating a MIDX bitmap would confuse the test.
Let's override the 'GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP' variable to
make sure we don't do so.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/t5319-multi-pack-index.sh | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 5641d158df..69f1c815aa 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -474,7 +474,8 @@ test_expect_success 'repack preserves multi-pack-index when creating packs' '
 compare_results_with_midx "after repack"
 
 test_expect_success 'multi-pack-index and pack-bitmap' '
-	git -c repack.writeBitmaps=true repack -ad &&
+	GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 \
+		git -c repack.writeBitmaps=true repack -ad &&
 	git multi-pack-index write &&
 	git rev-list --test-bitmap HEAD
 '
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v2 21/24] t7700: update to work with MIDX bitmap test knob
  2021-06-21 22:24 ` [PATCH v2 00/24] multi-pack reachability bitmaps Taylor Blau
                     ` (19 preceding siblings ...)
  2021-06-21 22:25   ` [PATCH v2 20/24] t5319: don't write MIDX bitmaps in t5319 Taylor Blau
@ 2021-06-21 22:25   ` Taylor Blau
  2021-06-21 22:25   ` [PATCH v2 22/24] midx: respect 'GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP' Taylor Blau
                     ` (3 subsequent siblings)
  24 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-06-21 22:25 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

A number of these tests are focused only on pack-based bitmaps and need
to be updated to disable 'GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP' where
necessary.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/t7700-repack.sh | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index 25b235c063..98eda3bfeb 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -63,13 +63,14 @@ test_expect_success 'objects in packs marked .keep are not repacked' '
 
 test_expect_success 'writing bitmaps via command-line can duplicate .keep objects' '
 	# build on $oid, $packid, and .keep state from previous
-	git repack -Adbl &&
+	GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 git repack -Adbl &&
 	test_has_duplicate_object true
 '
 
 test_expect_success 'writing bitmaps via config can duplicate .keep objects' '
 	# build on $oid, $packid, and .keep state from previous
-	git -c repack.writebitmaps=true repack -Adl &&
+	GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 \
+		git -c repack.writebitmaps=true repack -Adl &&
 	test_has_duplicate_object true
 '
 
@@ -189,7 +190,9 @@ test_expect_success 'repack --keep-pack' '
 
 test_expect_success 'bitmaps are created by default in bare repos' '
 	git clone --bare .git bare.git &&
-	git -C bare.git repack -ad &&
+	rm -f bare.git/objects/pack/*.bitmap &&
+	GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 \
+		git -C bare.git repack -ad &&
 	bitmap=$(ls bare.git/objects/pack/*.bitmap) &&
 	test_path_is_file "$bitmap"
 '
@@ -200,7 +203,8 @@ test_expect_success 'incremental repack does not complain' '
 '
 
 test_expect_success 'bitmaps can be disabled on bare repos' '
-	git -c repack.writeBitmaps=false -C bare.git repack -ad &&
+	GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 \
+		git -c repack.writeBitmaps=false -C bare.git repack -ad &&
 	bitmap=$(ls bare.git/objects/pack/*.bitmap || :) &&
 	test -z "$bitmap"
 '
@@ -211,7 +215,8 @@ test_expect_success 'no bitmaps created if .keep files present' '
 	keep=${pack%.pack}.keep &&
 	test_when_finished "rm -f \"\$keep\"" &&
 	>"$keep" &&
-	git -C bare.git repack -ad 2>stderr &&
+	GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 \
+		git -C bare.git repack -ad 2>stderr &&
 	test_must_be_empty stderr &&
 	find bare.git/objects/pack/ -type f -name "*.bitmap" >actual &&
 	test_must_be_empty actual
@@ -222,7 +227,8 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
 	blob=$(test-tool genrandom big $((1024*1024)) |
 	       git -C bare.git hash-object -w --stdin) &&
 	git -C bare.git update-ref refs/tags/big $blob &&
-	git -C bare.git repack -ad 2>stderr &&
+	GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 \
+		git -C bare.git repack -ad 2>stderr &&
 	test_must_be_empty stderr &&
 	find bare.git/objects/pack -type f -name "*.bitmap" >actual &&
 	test_must_be_empty actual
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v2 22/24] midx: respect 'GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP'
  2021-06-21 22:24 ` [PATCH v2 00/24] multi-pack reachability bitmaps Taylor Blau
                     ` (20 preceding siblings ...)
  2021-06-21 22:25   ` [PATCH v2 21/24] t7700: update to work with MIDX bitmap test knob Taylor Blau
@ 2021-06-21 22:25   ` Taylor Blau
  2021-06-25  0:03     ` Ævar Arnfjörð Bjarmason
  2021-06-21 22:25   ` [PATCH v2 23/24] p5310: extract full and partial bitmap tests Taylor Blau
                     ` (2 subsequent siblings)
  24 siblings, 1 reply; 273+ messages in thread
From: Taylor Blau @ 2021-06-21 22:25 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

Introduce a new 'GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP' environment
variable to also write a multi-pack bitmap when
'GIT_TEST_MULTI_PACK_INDEX' is set.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c          | 13 ++++++++++---
 ci/run-build-and-tests.sh |  1 +
 midx.h                    |  2 ++
 t/README                  |  4 ++++
 4 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 5f9bc74adc..77f6f03057 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -515,7 +515,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		if (!(pack_everything & ALL_INTO_ONE) ||
 		    !is_bare_repository())
 			write_bitmaps = 0;
-	}
+	} else if (write_bitmaps &&
+		   git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0) &&
+		   git_env_bool(GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP, 0))
+		write_bitmaps = 0;
 	if (pack_kept_objects < 0)
 		pack_kept_objects = write_bitmaps > 0;
 
@@ -725,8 +728,12 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		update_server_info(0);
 	remove_temporary_files();
 
-	if (git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0))
-		write_midx_file(get_object_directory(), NULL, 0);
+	if (git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0)) {
+		unsigned flags = 0;
+		if (git_env_bool(GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP, 0))
+			flags |= MIDX_WRITE_BITMAP | MIDX_WRITE_REV_INDEX;
+		write_midx_file(get_object_directory(), NULL, flags);
+	}
 
 	string_list_clear(&names, 0);
 	string_list_clear(&rollback, 0);
diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
index 3ce81ffee9..7ee9ba9325 100755
--- a/ci/run-build-and-tests.sh
+++ b/ci/run-build-and-tests.sh
@@ -23,6 +23,7 @@ linux-gcc)
 	export GIT_TEST_COMMIT_GRAPH=1
 	export GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=1
 	export GIT_TEST_MULTI_PACK_INDEX=1
+	export GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=1
 	export GIT_TEST_ADD_I_USE_BUILTIN=1
 	export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=master
 	export GIT_TEST_WRITE_REV_INDEX=1
diff --git a/midx.h b/midx.h
index 350f4d0a7b..aa3da557bb 100644
--- a/midx.h
+++ b/midx.h
@@ -8,6 +8,8 @@ struct pack_entry;
 struct repository;
 
 #define GIT_TEST_MULTI_PACK_INDEX "GIT_TEST_MULTI_PACK_INDEX"
+#define GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP \
+	"GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP"
 
 struct multi_pack_index {
 	struct multi_pack_index *next;
diff --git a/t/README b/t/README
index 1a2072b2c8..1311b8e17a 100644
--- a/t/README
+++ b/t/README
@@ -425,6 +425,10 @@ GIT_TEST_MULTI_PACK_INDEX=<boolean>, when true, forces the multi-pack-
 index to be written after every 'git repack' command, and overrides the
 'core.multiPackIndex' setting to true.
 
+GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=<boolean>, when true, sets the
+'--bitmap' option on all invocations of 'git multi-pack-index write',
+and ignores pack-objects' '--write-bitmap-index'.
+
 GIT_TEST_SIDEBAND_ALL=<boolean>, when true, overrides the
 'uploadpack.allowSidebandAll' setting to true, and when false, forces
 fetch-pack to not request sideband-all (even if the server advertises
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v2 23/24] p5310: extract full and partial bitmap tests
  2021-06-21 22:24 ` [PATCH v2 00/24] multi-pack reachability bitmaps Taylor Blau
                     ` (21 preceding siblings ...)
  2021-06-21 22:25   ` [PATCH v2 22/24] midx: respect 'GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP' Taylor Blau
@ 2021-06-21 22:25   ` Taylor Blau
  2021-06-21 22:26   ` [PATCH v2 24/24] p5326: perf tests for MIDX bitmaps Taylor Blau
  2021-06-25  9:06   ` [PATCH v2 00/24] multi-pack reachability bitmaps Ævar Arnfjörð Bjarmason
  24 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-06-21 22:25 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

A new p5326 introduced by the next patch will want these same tests,
interjecting its own setup in between. Move them out so that both perf
tests can reuse them.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/perf/lib-bitmap.sh         | 69 ++++++++++++++++++++++++++++++++++++
 t/perf/p5310-pack-bitmaps.sh | 65 ++-------------------------------
 2 files changed, 72 insertions(+), 62 deletions(-)
 create mode 100644 t/perf/lib-bitmap.sh

diff --git a/t/perf/lib-bitmap.sh b/t/perf/lib-bitmap.sh
new file mode 100644
index 0000000000..63d3bc7cec
--- /dev/null
+++ b/t/perf/lib-bitmap.sh
@@ -0,0 +1,69 @@
+# Helper functions for testing bitmap performance; see p5310.
+
+test_full_bitmap () {
+	test_perf 'simulated clone' '
+		git pack-objects --stdout --all </dev/null >/dev/null
+	'
+
+	test_perf 'simulated fetch' '
+		have=$(git rev-list HEAD~100 -1) &&
+		{
+			echo HEAD &&
+			echo ^$have
+		} | git pack-objects --revs --stdout >/dev/null
+	'
+
+	test_perf 'pack to file (bitmap)' '
+		git pack-objects --use-bitmap-index --all pack1b </dev/null >/dev/null
+	'
+
+	test_perf 'rev-list (commits)' '
+		git rev-list --all --use-bitmap-index >/dev/null
+	'
+
+	test_perf 'rev-list (objects)' '
+		git rev-list --all --use-bitmap-index --objects >/dev/null
+	'
+
+	test_perf 'rev-list with tag negated via --not --all (objects)' '
+		git rev-list perf-tag --not --all --use-bitmap-index --objects >/dev/null
+	'
+
+	test_perf 'rev-list with negative tag (objects)' '
+		git rev-list HEAD --not perf-tag --use-bitmap-index --objects >/dev/null
+	'
+
+	test_perf 'rev-list count with blob:none' '
+		git rev-list --use-bitmap-index --count --objects --all \
+			--filter=blob:none >/dev/null
+	'
+
+	test_perf 'rev-list count with blob:limit=1k' '
+		git rev-list --use-bitmap-index --count --objects --all \
+			--filter=blob:limit=1k >/dev/null
+	'
+
+	test_perf 'rev-list count with tree:0' '
+		git rev-list --use-bitmap-index --count --objects --all \
+			--filter=tree:0 >/dev/null
+	'
+
+	test_perf 'simulated partial clone' '
+		git pack-objects --stdout --all --filter=blob:none </dev/null >/dev/null
+	'
+}
+
+test_partial_bitmap () {
+	test_perf 'clone (partial bitmap)' '
+		git pack-objects --stdout --all </dev/null >/dev/null
+	'
+
+	test_perf 'pack to file (partial bitmap)' '
+		git pack-objects --use-bitmap-index --all pack2b </dev/null >/dev/null
+	'
+
+	test_perf 'rev-list with tree filter (partial bitmap)' '
+		git rev-list --use-bitmap-index --count --objects --all \
+			--filter=tree:0 >/dev/null
+	'
+}
diff --git a/t/perf/p5310-pack-bitmaps.sh b/t/perf/p5310-pack-bitmaps.sh
index 452be01056..7ad4f237bc 100755
--- a/t/perf/p5310-pack-bitmaps.sh
+++ b/t/perf/p5310-pack-bitmaps.sh
@@ -2,6 +2,7 @@
 
 test_description='Tests pack performance using bitmaps'
 . ./perf-lib.sh
+. "${TEST_DIRECTORY}/perf/lib-bitmap.sh"
 
 test_perf_large_repo
 
@@ -25,56 +26,7 @@ test_perf 'repack to disk' '
 	git repack -ad
 '
 
-test_perf 'simulated clone' '
-	git pack-objects --stdout --all </dev/null >/dev/null
-'
-
-test_perf 'simulated fetch' '
-	have=$(git rev-list HEAD~100 -1) &&
-	{
-		echo HEAD &&
-		echo ^$have
-	} | git pack-objects --revs --stdout >/dev/null
-'
-
-test_perf 'pack to file (bitmap)' '
-	git pack-objects --use-bitmap-index --all pack1b </dev/null >/dev/null
-'
-
-test_perf 'rev-list (commits)' '
-	git rev-list --all --use-bitmap-index >/dev/null
-'
-
-test_perf 'rev-list (objects)' '
-	git rev-list --all --use-bitmap-index --objects >/dev/null
-'
-
-test_perf 'rev-list with tag negated via --not --all (objects)' '
-	git rev-list perf-tag --not --all --use-bitmap-index --objects >/dev/null
-'
-
-test_perf 'rev-list with negative tag (objects)' '
-	git rev-list HEAD --not perf-tag --use-bitmap-index --objects >/dev/null
-'
-
-test_perf 'rev-list count with blob:none' '
-	git rev-list --use-bitmap-index --count --objects --all \
-		--filter=blob:none >/dev/null
-'
-
-test_perf 'rev-list count with blob:limit=1k' '
-	git rev-list --use-bitmap-index --count --objects --all \
-		--filter=blob:limit=1k >/dev/null
-'
-
-test_perf 'rev-list count with tree:0' '
-	git rev-list --use-bitmap-index --count --objects --all \
-		--filter=tree:0 >/dev/null
-'
-
-test_perf 'simulated partial clone' '
-	git pack-objects --stdout --all --filter=blob:none </dev/null >/dev/null
-'
+test_full_bitmap
 
 test_expect_success 'create partial bitmap state' '
 	# pick a commit to represent the repo tip in the past
@@ -97,17 +49,6 @@ test_expect_success 'create partial bitmap state' '
 	git update-ref HEAD $orig_tip
 '
 
-test_perf 'clone (partial bitmap)' '
-	git pack-objects --stdout --all </dev/null >/dev/null
-'
-
-test_perf 'pack to file (partial bitmap)' '
-	git pack-objects --use-bitmap-index --all pack2b </dev/null >/dev/null
-'
-
-test_perf 'rev-list with tree filter (partial bitmap)' '
-	git rev-list --use-bitmap-index --count --objects --all \
-		--filter=tree:0 >/dev/null
-'
+test_partial_bitmap
 
 test_done
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v2 24/24] p5326: perf tests for MIDX bitmaps
  2021-06-21 22:24 ` [PATCH v2 00/24] multi-pack reachability bitmaps Taylor Blau
                     ` (22 preceding siblings ...)
  2021-06-21 22:25   ` [PATCH v2 23/24] p5310: extract full and partial bitmap tests Taylor Blau
@ 2021-06-21 22:26   ` Taylor Blau
  2021-06-25  9:06   ` [PATCH v2 00/24] multi-pack reachability bitmaps Ævar Arnfjörð Bjarmason
  24 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-06-21 22:26 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

These new performance tests demonstrate effectively the same behavior as
p5310, but use a multi-pack bitmap instead of a single-pack one.

Notably, p5326 does not create a MIDX bitmap with multiple packs. This
is so we can measure a direct comparison between it and p5310. Any
difference between the two is measuring just the overhead of using MIDX
bitmaps.

Here are the results of p5310 and p5326 together, measured at the same
time and on the same machine (using a Xenon W-2255 CPU):

    Test                                                  HEAD
    ------------------------------------------------------------------------
    5310.2: repack to disk                                96.78(93.39+11.33)
    5310.3: simulated clone                               9.98(9.79+0.19)
    5310.4: simulated fetch                               1.75(4.26+0.19)
    5310.5: pack to file (bitmap)                         28.20(27.87+8.70)
    5310.6: rev-list (commits)                            0.41(0.36+0.05)
    5310.7: rev-list (objects)                            1.61(1.54+0.07)
    5310.8: rev-list count with blob:none                 0.25(0.21+0.04)
    5310.9: rev-list count with blob:limit=1k             2.65(2.54+0.10)
    5310.10: rev-list count with tree:0                   0.23(0.19+0.04)
    5310.11: simulated partial clone                      4.34(4.21+0.12)
    5310.13: clone (partial bitmap)                       11.05(12.21+0.48)
    5310.14: pack to file (partial bitmap)                31.25(34.22+3.70)
    5310.15: rev-list with tree filter (partial bitmap)   0.26(0.22+0.04)

versus the same tests (this time using a multi-pack index):

    Test                                                  HEAD
    ------------------------------------------------------------------------
    5326.2: setup multi-pack index                        78.99(75.29+11.58)
    5326.3: simulated clone                               11.78(11.56+0.22)
    5326.4: simulated fetch                               1.70(4.49+0.13)
    5326.5: pack to file (bitmap)                         28.02(27.72+8.76)
    5326.6: rev-list (commits)                            0.42(0.36+0.06)
    5326.7: rev-list (objects)                            1.65(1.58+0.06)
    5326.8: rev-list count with blob:none                 0.26(0.21+0.05)
    5326.9: rev-list count with blob:limit=1k             2.97(2.86+0.10)
    5326.10: rev-list count with tree:0                   0.25(0.20+0.04)
    5326.11: simulated partial clone                      5.65(5.49+0.16)
    5326.13: clone (partial bitmap)                       12.22(13.43+0.38)
    5326.14: pack to file (partial bitmap)                30.05(31.57+7.25)
    5326.15: rev-list with tree filter (partial bitmap)   0.24(0.20+0.04)

There is slight overhead in "simulated clone", "simulated partial
clone", and "clone (partial bitmap)". Unsurprisingly, that overhead is
due to using the MIDX's reverse index to map between bit positions and
MIDX positions.

This can be reproduced by running "git repack -adb" along with "git
multi-pack-index write --bitmap" in a large-ish repository. Then run:

    $ perf record -o pack.perf git -c core.multiPackIndex=false \
      pack-objects --all --stdout >/dev/null </dev/null
    $ perf record -o midx.perf git -c core.multiPackIndex=true \
      pack-objects --all --stdout >/dev/null </dev/null

and compare the two with "perf diff -c delta -o 1 pack.perf midx.perf".
The most notable results are below (the next largest positive delta is
+0.14%):

    # Event 'cycles'
    #
    # Baseline    Delta  Shared Object       Symbol
    # ........  .......  ..................  ..........................
    #
                 +5.86%  git                 [.] nth_midxed_offset
                 +5.24%  git                 [.] nth_midxed_pack_int_id
         3.45%   +0.97%  git                 [.] offset_to_pack_pos
         3.30%   +0.57%  git                 [.] pack_pos_to_offset
                 +0.30%  git                 [.] pack_pos_to_midx

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/perf/p5326-multi-pack-bitmaps.sh | 43 ++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)
 create mode 100755 t/perf/p5326-multi-pack-bitmaps.sh

diff --git a/t/perf/p5326-multi-pack-bitmaps.sh b/t/perf/p5326-multi-pack-bitmaps.sh
new file mode 100755
index 0000000000..5845109ac7
--- /dev/null
+++ b/t/perf/p5326-multi-pack-bitmaps.sh
@@ -0,0 +1,43 @@
+#!/bin/sh
+
+test_description='Tests performance using midx bitmaps'
+. ./perf-lib.sh
+. "${TEST_DIRECTORY}/perf/lib-bitmap.sh"
+
+test_perf_large_repo
+
+test_expect_success 'enable multi-pack index' '
+	git config core.multiPackIndex true
+'
+
+test_perf 'setup multi-pack index' '
+	git repack -ad &&
+	git multi-pack-index write --bitmap
+'
+
+test_full_bitmap
+
+test_expect_success 'create partial bitmap state' '
+	# pick a commit to represent the repo tip in the past
+	cutoff=$(git rev-list HEAD~100 -1) &&
+	orig_tip=$(git rev-parse HEAD) &&
+
+	# now pretend we have just one tip
+	rm -rf .git/logs .git/refs/* .git/packed-refs &&
+	git update-ref HEAD $cutoff &&
+
+	# and then repack, which will leave us with a nice
+	# big bitmap pack of the "old" history, and all of
+	# the new history will be loose, as if it had been pushed
+	# up incrementally and exploded via unpack-objects
+	git repack -Ad &&
+	git multi-pack-index write --bitmap &&
+
+	# and now restore our original tip, as if the pushes
+	# had happened
+	git update-ref HEAD $orig_tip
+'
+
+test_partial_bitmap
+
+test_done
-- 
2.31.1.163.ga65ce7f831

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 11/24] pack-bitmap.c: introduce 'nth_bitmap_object_oid()'
  2021-06-21 22:25   ` [PATCH v2 11/24] pack-bitmap.c: introduce 'nth_bitmap_object_oid()' Taylor Blau
@ 2021-06-24 14:59     ` Taylor Blau
  2021-07-21 10:37     ` Jeff King
  1 sibling, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-06-24 14:59 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

On Mon, Jun 21, 2021 at 06:25:26PM -0400, Taylor Blau wrote:
>  static int load_bitmap_entries_v1(struct bitmap_index *index)
>  {
>  	uint32_t i;
> @@ -242,9 +249,7 @@ static int load_bitmap_entries_v1(struct bitmap_index *index)
>  		xor_offset = read_u8(index->map, &index->map_pos);
>  		flags = read_u8(index->map, &index->map_pos);
>
> -		if (nth_packed_object_id(&oid, index->pack, commit_idx_pos) < 0)
> -			return error("corrupt ewah bitmap: commit index %u out of range",
> -				     (unsigned)commit_idx_pos);
> +		nth_bitmap_object_oid(index, &oid, commit_idx_pos);

Oops. I was reading code in this area and noticed that this patch drops
the check introduced by c6b0c3910c (pack-bitmap.c: check reads more
aggressively when loading, 2020-12-08).

I fixed it up locally by restoring the check (but on the new function
nth_bitmap_object_oid() instead), and will send it in a reroll.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 01/24] pack-bitmap.c: harden 'test_bitmap_walk()' to check type bitmaps
  2021-06-21 22:24   ` [PATCH v2 01/24] pack-bitmap.c: harden 'test_bitmap_walk()' to check type bitmaps Taylor Blau
@ 2021-06-24 23:02     ` Ævar Arnfjörð Bjarmason
  2021-07-14 17:24       ` Taylor Blau
  2021-07-21  9:45     ` Jeff King
  1 sibling, 1 reply; 273+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-06-24 23:02 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee, gitster, jonathantanmy


On Mon, Jun 21 2021, Taylor Blau wrote:

> +	enum object_type bitmap_type = OBJ_NONE;
> +	int bitmaps_nr = 0;
> +
> +	if (bitmap_get(tdata->commits, pos)) {
> +		bitmap_type = OBJ_COMMIT;
> +		bitmaps_nr++;
> +	}
> +	if (bitmap_get(tdata->trees, pos)) {
> +		bitmap_type = OBJ_TREE;
> +		bitmaps_nr++;
> +	}
> +	if (bitmap_get(tdata->blobs, pos)) {
> +		bitmap_type = OBJ_BLOB;
> +		bitmaps_nr++;
> +	}
> +	if (bitmap_get(tdata->tags, pos)) {
> +		bitmap_type = OBJ_TAG;
> +		bitmaps_nr++;
> +	}

This made me wonder if this could be better with something like the
HAS_MULTI_BITS() macro, but that trick probably can't be applied here in
any shape or form :)

> +
> +	if (!bitmap_type)
> +		die("object %s not found in type bitmaps",
> +		    oid_to_hex(&obj->oid));

It feels a bit magical to use an enum and then assume to know the enum's
values, I know we do "type < 0" all over the place, but I'd think "if
(bitmap_type == OBJ_NONE)" would be better here....

> +
> +	if (bitmaps_nr > 1)
> +		die("object %s does not have a unique type",
> +		    oid_to_hex(&obj->oid));

Or just check the bitmaps_nr instead:

    if (!bitmaps_nr)
        die("found none");
    else if (bitmaps_nr > 1)
        ...;

Just bikeshedding...

> +
> +	if (bitmap_type != obj->type)
> +		die("object %s: real type %s, expected: %s",
> +		    oid_to_hex(&obj->oid),
> +		    type_name(obj->type),
> +		    type_name(bitmap_type));

To argue against myself (sort of) about that "== OBJ_NONE" above, if
we're not assuming that then it's sort of weird not to also assume that
type_name(type) won't return a NULL in the case of OBJ_NONE, which it
does (but this code guards against).

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 02/24] pack-bitmap-write.c: gracefully fail to write non-closed bitmaps
  2021-06-21 22:25   ` [PATCH v2 02/24] pack-bitmap-write.c: gracefully fail to write non-closed bitmaps Taylor Blau
@ 2021-06-24 23:23     ` Ævar Arnfjörð Bjarmason
  2021-07-14 17:32       ` Taylor Blau
  2021-07-21  9:50     ` Jeff King
  1 sibling, 1 reply; 273+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-06-24 23:23 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee, gitster, jonathantanmy


On Mon, Jun 21 2021, Taylor Blau wrote:

> -static uint32_t find_object_pos(const struct object_id *oid)
> +static uint32_t find_object_pos(const struct object_id *oid, int *found)
>  {
>  	struct object_entry *entry = packlist_find(writer.to_pack, oid);
>  
>  	if (!entry) {
> -		die("Failed to write bitmap index. Packfile doesn't have full closure "
> +		if (found)
> +			*found = 0;
> +		warning("Failed to write bitmap index. Packfile doesn't have full closure "
>  			"(object %s is missing)", oid_to_hex(oid));
> +		return 0;
>  	}
>  
> +	if (found)
> +		*found = 1;
>  	return oe_in_pack_pos(writer.to_pack, entry);
>  }
>  
> @@ -331,9 +336,10 @@ static void bitmap_builder_clear(struct bitmap_builder *bb)
>  	bb->commits_nr = bb->commits_alloc = 0;
>  }
>  
> -static void fill_bitmap_tree(struct bitmap *bitmap,
> -			     struct tree *tree)
> +static int fill_bitmap_tree(struct bitmap *bitmap,
> +			    struct tree *tree)
>  {
> +	int found;
>  	uint32_t pos;
>  	struct tree_desc desc;
>  	struct name_entry entry;
> @@ -342,9 +348,11 @@ static void fill_bitmap_tree(struct bitmap *bitmap,
>  	 * If our bit is already set, then there is nothing to do. Both this
>  	 * tree and all of its children will be set.
>  	 */
> -	pos = find_object_pos(&tree->object.oid);
> +	pos = find_object_pos(&tree->object.oid, &found);
> +	if (!found)
> +		return -1;

So, a function that returns an unsigned 32 bit int won't (presumably)
have enough space for an "is bad", but before it died so it didn't
matter.

Now it warns, so it needs a "is bad", so we add another "int" to pass
that information around.

So if we're already paying for that extra space (which, on some
platforms would already be a 64 bit int, and on some so would the
uint32_t, it's just "at least 32 bits").

Wouldn't it be more idiomatic to just have find_object_pos() return
int64_t now, if it's -1 it's an error, otherwise the "pos" is cast to
uint32_t:
	
	diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
	index 88d9e696a54..d71fa6f607a 100644
	--- a/pack-bitmap-write.c
	+++ b/pack-bitmap-write.c
	@@ -125,14 +125,12 @@ static inline void push_bitmapped_commit(struct commit *commit)
	 	writer.selected_nr++;
	 }
	 
	-static uint32_t find_object_pos(const struct object_id *oid)
	+static int64_t find_object_pos(const struct object_id *oid)
	 {
	 	struct object_entry *entry = packlist_find(writer.to_pack, oid);
	 
	-	if (!entry) {
	-		die("Failed to write bitmap index. Packfile doesn't have full closure "
	-			"(object %s is missing)", oid_to_hex(oid));
	-	}
	+	if (!entry)
	+		return -1;
	 
	 	return oe_in_pack_pos(writer.to_pack, entry);
	 }
	@@ -334,7 +332,7 @@ static void bitmap_builder_clear(struct bitmap_builder *bb)
	 static void fill_bitmap_tree(struct bitmap *bitmap,
	 			     struct tree *tree)
	 {
	-	uint32_t pos;
	+	int64_t pos;
	 	struct tree_desc desc;
	 	struct name_entry entry;
	 
	@@ -343,6 +341,9 @@ static void fill_bitmap_tree(struct bitmap *bitmap,
	 	 * tree and all of its children will be set.
	 	 */
	 	pos = find_object_pos(&tree->object.oid);
	+	if (pos < 0)
	+		die("unhappy: %s", oid_to_hex(&tree->object.oid));
	+
	 	if (bitmap_get(bitmap, pos))
	 		return;
	 	bitmap_set(bitmap, pos);

I mean, you don't want the die() part of that, but to me the rest looks
better.

> [...]
> diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
> index 584a039b85..1667450917 100755
> --- a/t/t0410-partial-clone.sh
> +++ b/t/t0410-partial-clone.sh
> @@ -536,7 +536,13 @@ test_expect_success 'gc does not repack promisor objects if there are none' '
>  repack_and_check () {
>  	rm -rf repo2 &&
>  	cp -r repo repo2 &&
> -	git -C repo2 repack $1 -d &&
> +	if test x"$1" = "x--must-fail"
> +	then
> +		shift
> +		test_must_fail git -C repo2 repack $1 -d
> +	else
> +		git -C repo2 repack $1 -d
> +	fi &&
>  	git -C repo2 fsck &&

This sent me down the rabbit hole of
https://lore.kernel.org/git/60c67bdf5b77e_f569220858@natae.notmuch/
again.

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 04/24] Documentation: build 'technical/bitmap-format' by default
  2021-06-21 22:25   ` [PATCH v2 04/24] Documentation: build 'technical/bitmap-format' by default Taylor Blau
@ 2021-06-24 23:35     ` Ævar Arnfjörð Bjarmason
  2021-07-14 17:41       ` Taylor Blau
  2021-07-21  9:58     ` Jeff King
  1 sibling, 1 reply; 273+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-06-24 23:35 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee, gitster, jonathantanmy


On Mon, Jun 21 2021, Taylor Blau wrote:

> Even though the 'TECH_DOCS' variable was introduced all the way back in
> 5e00439f0a (Documentation: build html for all files in technical and
> howto, 2012-10-23), the 'bitmap-format' document was never added to that
> list when it was created.
>
> Prepare for changes to this file by including it in the list of
> technical documentation that 'make doc' will build by default.
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  Documentation/Makefile | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/Documentation/Makefile b/Documentation/Makefile
> index f5605b7767..7d7b778b28 100644
> --- a/Documentation/Makefile
> +++ b/Documentation/Makefile
> @@ -90,6 +90,7 @@ SP_ARTICLES += $(API_DOCS)
>  TECH_DOCS += MyFirstContribution
>  TECH_DOCS += MyFirstObjectWalk
>  TECH_DOCS += SubmittingPatches
> +TECH_DOCS += technical/bitmap-format
>  TECH_DOCS += technical/hash-function-transition
>  TECH_DOCS += technical/http-protocol
>  TECH_DOCS += technical/index-format

As a mostly aside I've got a local series queued up to move all of these
"format" docs to e.g. gitformat-bitmap(5), i.e. to make them first-class
manpages, so other pages can link to them. Right now we mostly don't,
but when our manpages do they link to the generated HTML, which e.g. I
don't have installed by default.

So since you're linking to it: Does anyone prefer this state of a
affairs, and isn't it mainly useful for built docs such as
https://git-scm.com/docs/git-multi-pack-index.

But there's still (but maybe later in this series) a link to
bitmap-format anywhere from another manual page (but there is for
e.g. technical/pack-format.html).

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 06/24] midx: make a number of functions non-static
  2021-06-21 22:25   ` [PATCH v2 06/24] midx: make a number of functions non-static Taylor Blau
@ 2021-06-24 23:42     ` Ævar Arnfjörð Bjarmason
  2021-07-14 23:01       ` Taylor Blau
  0 siblings, 1 reply; 273+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-06-24 23:42 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee, gitster, jonathantanmy


On Mon, Jun 21 2021, Taylor Blau wrote:

> These functions will be called from outside of midx.c in a subsequent
> patch.

So "a number" is "two" and "a subsequent patch" appears to be 13/24. I
think this would be clearer just squashed into whatever needs it, or at
least if it comes right before the new use in the series.

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 08/24] midx: respect 'core.multiPackIndex' when writing
  2021-06-21 22:25   ` [PATCH v2 08/24] midx: respect 'core.multiPackIndex' when writing Taylor Blau
@ 2021-06-24 23:43     ` Ævar Arnfjörð Bjarmason
  2021-07-21 10:23     ` Jeff King
  1 sibling, 0 replies; 273+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-06-24 23:43 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee, gitster, jonathantanmy


On Mon, Jun 21 2021, Taylor Blau wrote:

> When writing a new multi-pack index, write_midx_internal() attempts to
> load any existing one to fill in some pieces of information. But it uses
> load_multi_pack_index(), which ignores the configuration
> "core.multiPackIndex", which indicates whether or not Git is allowed to
> read an existing multi-pack-index.
>
> Replace this with a routine that does respect that setting, to avoid
> reading multi-pack-index files when told not to.
>
> This avoids a problem that would arise in subsequent patches due to the
> combination of 'git repack' reopening the object store in-process and
> the multi-pack index code not checking whether a pack already exists in
> the object store when calling add_pack_to_midx().
>
> This would ultimately lead to a cycle being created along the
> 'packed_git' struct's '->next' pointer. That is obviously bad, but it
> has hard-to-debug downstream effects like saying a bitmap can't be
> loaded for a pack because one already exists (for the same pack).
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  midx.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/midx.c b/midx.c
> index 40eb7974ba..759007d5a8 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -908,8 +908,18 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
>  
>  	if (m)
>  		ctx.m = m;
> -	else
> -		ctx.m = load_multi_pack_index(object_dir, 1);
> +	else {

Style nit: leaves the initial "if" braceless now that the "else" gained
braces.

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 14/24] pack-bitmap: write multi-pack bitmaps
  2021-06-21 22:25   ` [PATCH v2 14/24] pack-bitmap: write " Taylor Blau
@ 2021-06-24 23:45     ` Ævar Arnfjörð Bjarmason
  2021-07-15 14:33       ` Taylor Blau
  2021-07-21 12:09     ` Jeff King
  1 sibling, 1 reply; 273+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-06-24 23:45 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee, gitster, jonathantanmy


On Mon, Jun 21 2021, Taylor Blau wrote:

> Write multi-pack bitmaps in the format described by
> Documentation/technical/bitmap-format.txt, inferring their presence with
> the absence of '--bitmap'.
>
> To write a multi-pack bitmap, this patch attempts to reuse as much of
> the existing machinery from pack-objects as possible. Specifically, the
> MIDX code prepares a packing_data struct that pretends as if a single
> packfile has been generated containing all of the objects contained
> within the MIDX.
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  Documentation/git-multi-pack-index.txt |  12 +-
>  builtin/multi-pack-index.c             |   2 +
>  midx.c                                 | 230 ++++++++++++++++++++++++-
>  midx.h                                 |   1 +
>  4 files changed, 236 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
> index ffd601bc17..ada14deb2c 100644
> --- a/Documentation/git-multi-pack-index.txt
> +++ b/Documentation/git-multi-pack-index.txt
> @@ -10,7 +10,7 @@ SYNOPSIS
>  --------
>  [verse]
>  'git multi-pack-index' [--object-dir=<dir>] [--[no-]progress]
> -	[--preferred-pack=<pack>] <subcommand>
> +	[--preferred-pack=<pack>] [--[no-]bitmap] <subcommand>
>  
>  DESCRIPTION
>  -----------
> @@ -40,6 +40,9 @@ write::
>  		multiple packs contain the same object. If not given,
>  		ties are broken in favor of the pack with the lowest
>  		mtime.
> +
> +	--[no-]bitmap::
> +		Control whether or not a multi-pack bitmap is written.
>  --
>  
>  verify::
> @@ -81,6 +84,13 @@ EXAMPLES
>  $ git multi-pack-index write
>  -----------------------------------------------
>  
> +* Write a MIDX file for the packfiles in the current .git folder with a
> +corresponding bitmap.
> ++
> +-------------------------------------------------------------
> +$ git multi-pack-index write --preferred-pack <pack> --bitmap
> +-------------------------------------------------------------
> +

I wondered if this was a <pack> positional argument, but it's just the
argument for --preferred-pack, even though the synopsis uses the "="
style for it. Even if parse-options.c is loose about it, let's use one
or the other in examples consistently.

>  * Write a MIDX file for the packfiles in an alternate object store.
>  +
>  -----------------------------------------------
> diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
> index 5d3ea445fd..bf6fa982e3 100644
> --- a/builtin/multi-pack-index.c
> +++ b/builtin/multi-pack-index.c
> @@ -68,6 +68,8 @@ static int cmd_multi_pack_index_write(int argc, const char **argv)
>  		OPT_STRING(0, "preferred-pack", &opts.preferred_pack,
>  			   N_("preferred-pack"),
>  			   N_("pack for reuse when computing a multi-pack bitmap")),
> +		OPT_BIT(0, "bitmap", &opts.flags, N_("write multi-pack bitmap"),
> +			MIDX_WRITE_BITMAP | MIDX_WRITE_REV_INDEX),
>  		OPT_END(),
>  	};
>  
> diff --git a/midx.c b/midx.c
> index 752d36c57f..a58cca707b 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -13,6 +13,10 @@
>  #include "repository.h"
>  #include "chunk-format.h"
>  #include "pack.h"
> +#include "pack-bitmap.h"
> +#include "refs.h"
> +#include "revision.h"
> +#include "list-objects.h"
>  
>  #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
>  #define MIDX_VERSION 1
> @@ -885,6 +889,172 @@ static void write_midx_reverse_index(char *midx_name, unsigned char *midx_hash,
>  static void clear_midx_files_ext(struct repository *r, const char *ext,
>  				 unsigned char *keep_hash);
>  
> +static void prepare_midx_packing_data(struct packing_data *pdata,
> +				      struct write_midx_context *ctx)
> +{
> +	uint32_t i;
> +
> +	memset(pdata, 0, sizeof(struct packing_data));

We initialize this on the stack in write_midx_bitmap(), shouldn't we
just there do:

    struct packing_data pdata = {0}

Instead of:

    struct packing_data pdata;

And then doing this memset() here?

> +	prepare_packing_data(the_repository, pdata);
> +
> +	for (i = 0; i < ctx->entries_nr; i++) {
> +		struct pack_midx_entry *from = &ctx->entries[ctx->pack_order[i]];
> +		struct object_entry *to = packlist_alloc(pdata, &from->oid);
> +
> +		oe_set_in_pack(pdata, to,
> +			       ctx->info[ctx->pack_perm[from->pack_int_id]].p);
> +	}
> +}
> +
> +static int add_ref_to_pending(const char *refname,
> +			      const struct object_id *oid,
> +			      int flag, void *cb_data)
> +{
> +	struct rev_info *revs = (struct rev_info*)cb_data;
> +	struct object *object;
> +
> +	if ((flag & REF_ISSYMREF) && (flag & REF_ISBROKEN)) {

Just since I'd mentioned HAS_MULTI_BITS() offhand on another patch of
yours, it's for cases like this, so:

-    if ((flag & REF_ISSYMREF) && (flag & REF_ISBROKEN)) {
+    if (HAS_MULTI_BITS(flag & (REF_ISSYMREF|REF_ISBROKEN)) {

Saves you 3 bytes of code:) Anyway, you don't need to use it, just an
intresting function... :)

> +{
> +	struct rev_info revs;
> +	struct bitmap_commit_cb cb;
> +
> +	memset(&cb, 0, sizeof(struct bitmap_commit_cb));

Another case of s/memset/"= {0}"/g ?

> +	cb.ctx = ctx;
> +
> +	repo_init_revisions(the_repository, &revs, NULL);
> +	for_each_ref(add_ref_to_pending, &revs);
> +
> +	/*
> +	 * Skipping promisor objects here is intentional, since it only excludes
> +	 * them from the list of reachable commits that we want to select from
> +	 * when computing the selection of MIDX'd commits to receive bitmaps.
> +	 *
> +	 * Reachability bitmaps do require that their objects be closed under
> +	 * reachability, but fetching any objects missing from promisors at this
> +	 * point is too late. But, if one of those objects can be reached from
> +	 * an another object that is included in the bitmap, then we will
> +	 * complain later that we don't have reachability closure (and fail
> +	 * appropriately).
> +	 */
> +	fetch_if_missing = 0;
> +	revs.exclude_promisor_objects = 1;
> +
> +	/*
> +	 * Pass selected commits in topo order to match the behavior of
> +	 * pack-bitmaps when configured with delta islands.
> +	 */
> +	revs.topo_order = 1;
> +	revs.sort_order = REV_SORT_IN_GRAPH_ORDER;
> +
> +	if (prepare_revision_walk(&revs))
> +		die(_("revision walk setup failed"));
> +
> +	traverse_commit_list(&revs, bitmap_show_commit, NULL, &cb);
> +	if (indexed_commits_nr_p)
> +		*indexed_commits_nr_p = cb.commits_nr;
> +
> +	return cb.commits;
> +}
> +
> +static int write_midx_bitmap(char *midx_name, unsigned char *midx_hash,
> +			     struct write_midx_context *ctx,
> +			     unsigned flags)
> +{
> +	struct packing_data pdata;
> +	struct pack_idx_entry **index;
> +	struct commit **commits = NULL;
> +	uint32_t i, commits_nr;
> +	char *bitmap_name = xstrfmt("%s-%s.bitmap", midx_name, hash_to_hex(midx_hash));
> +	int ret;
> +
> +	prepare_midx_packing_data(&pdata, ctx);
> +
> +	commits = find_commits_for_midx_bitmap(&commits_nr, ctx);
> +
> +	/*
> +	 * Build the MIDX-order index based on pdata.objects (which is already
> +	 * in MIDX order; c.f., 'midx_pack_order_cmp()' for the definition of
> +	 * this order).
> +	 */
> +	ALLOC_ARRAY(index, pdata.nr_objects);
> +	for (i = 0; i < pdata.nr_objects; i++)
> +		index[i] = (struct pack_idx_entry *)&pdata.objects[i];
> +
> +	bitmap_writer_show_progress(flags & MIDX_PROGRESS);
> +	bitmap_writer_build_type_index(&pdata, index, pdata.nr_objects);
> +
> +	/*
> +	 * bitmap_writer_finish expects objects in lex order, but pack_order
> +	 * gives us exactly that. use it directly instead of re-sorting the
> +	 * array.
> +	 *
> +	 * This changes the order of objects in 'index' between
> +	 * bitmap_writer_build_type_index and bitmap_writer_finish.
> +	 *
> +	 * The same re-ordering takes place in the single-pack bitmap code via
> +	 * write_idx_file(), which is called by finish_tmp_packfile(), which
> +	 * happens between bitmap_writer_build_type_index() and
> +	 * bitmap_writer_finish().
> +	 */
> +	for (i = 0; i < pdata.nr_objects; i++)
> +		index[ctx->pack_order[i]] = (struct pack_idx_entry *)&pdata.objects[i];
> +
> +	bitmap_writer_select_commits(commits, commits_nr, -1);
> +	ret = bitmap_writer_build(&pdata);
> +	if (ret < 0)
> +		goto cleanup;
> +
> +	bitmap_writer_set_checksum(midx_hash);
> +	bitmap_writer_finish(index, pdata.nr_objects, bitmap_name, 0);
> +
> +cleanup:
> +	free(index);
> +	free(bitmap_name);
> +	return ret;
> +}
> +
>  static int write_midx_internal(const char *object_dir, struct multi_pack_index *m,
>  			       struct string_list *packs_to_drop,
>  			       const char *preferred_pack_name,
> @@ -930,9 +1100,16 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
>  		for (i = 0; i < ctx.m->num_packs; i++) {
>  			ALLOC_GROW(ctx.info, ctx.nr + 1, ctx.alloc);
>  
> +			if (prepare_midx_pack(the_repository, ctx.m, i)) {
> +				error(_("could not load pack %s"),
> +				      ctx.m->pack_names[i]);

Isn't the prepare_midx_pack() tasked with populating that pack_names[i]
that you can't load (the strbuf_addf() it does), but it can also exit
before that, do we get an empty string here then? Maybe I'm misreading
it (I haven't run this, just skimmed the code).

> @@ -1132,6 +1342,9 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
>  	free(ctx.pack_perm);
>  	free(ctx.pack_order);
>  	free(midx_name);
> +	if (ctx.m)
> +		close_midx(ctx.m);
> +

I see Stolee made close_midx() just return silently if !ctx.m in
1dcd9f2043a (midx: close multi-pack-index on repack, 2018-10-12), but
grepping the uses of it it seems calls to it are similarly guarded by
"if"'s.

Just a nit, weird to have a free-like function not invoked like
free. Perhaps (and maybe better for an unrelated cleanup) to either drop
the conditionals, or make it BUG() if it's called with NULL, but at
least we should pick one :)

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 22/24] midx: respect 'GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP'
  2021-06-21 22:25   ` [PATCH v2 22/24] midx: respect 'GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP' Taylor Blau
@ 2021-06-25  0:03     ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 273+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-06-25  0:03 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee, gitster, jonathantanmy


On Mon, Jun 21 2021, Taylor Blau wrote:

> Introduce a new 'GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP' environment
> variable to also write a multi-pack bitmap when
> 'GIT_TEST_MULTI_PACK_INDEX' is set.
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  builtin/repack.c          | 13 ++++++++++---
>  ci/run-build-and-tests.sh |  1 +
>  midx.h                    |  2 ++
>  t/README                  |  4 ++++
>  4 files changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/builtin/repack.c b/builtin/repack.c
> index 5f9bc74adc..77f6f03057 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -515,7 +515,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  		if (!(pack_everything & ALL_INTO_ONE) ||
>  		    !is_bare_repository())
>  			write_bitmaps = 0;
> -	}
> +	} else if (write_bitmaps &&
> +		   git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0) &&
> +		   git_env_bool(GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP, 0))
> +		write_bitmaps = 0;

Style nit: more if/else/elseif some with braces, some not...

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 00/24] multi-pack reachability bitmaps
  2021-06-21 22:24 ` [PATCH v2 00/24] multi-pack reachability bitmaps Taylor Blau
                     ` (23 preceding siblings ...)
  2021-06-21 22:26   ` [PATCH v2 24/24] p5326: perf tests for MIDX bitmaps Taylor Blau
@ 2021-06-25  9:06   ` Ævar Arnfjörð Bjarmason
  2021-07-15 14:36     ` Taylor Blau
  24 siblings, 1 reply; 273+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-06-25  9:06 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee, gitster, jonathantanmy


On Mon, Jun 21 2021, Taylor Blau wrote:

> Thanks in advance for your review, and sorry for the wait.

Thanks for working on this, exciting feature!

Just a note on my comments on this. I left them after some light reading
and would describe them as some combination of "musings", "shallow",
"nit-y" and "bikesheddy".

I.e. I did not have time (or I feel, the familiarity) to give this
series the sort of review it actually deserves as far as the actual
important bits go, i.e. nits aside whether this feature works and
behaves as desired. Sorry, but hopefully at least some of comments were
somewhat useful anyway.

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 01/24] pack-bitmap.c: harden 'test_bitmap_walk()' to check type bitmaps
  2021-06-24 23:02     ` Ævar Arnfjörð Bjarmason
@ 2021-07-14 17:24       ` Taylor Blau
  0 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-07-14 17:24 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, peff, dstolee, gitster, jonathantanmy

On Fri, Jun 25, 2021 at 01:02:56AM +0200, Ævar Arnfjörð Bjarmason wrote:
>
> On Mon, Jun 21 2021, Taylor Blau wrote:
>
> > +	enum object_type bitmap_type = OBJ_NONE;
> > +	int bitmaps_nr = 0;
> > +
> > +	if (bitmap_get(tdata->commits, pos)) {
> > +		bitmap_type = OBJ_COMMIT;
> > +		bitmaps_nr++;
> > +	}
> > +	if (bitmap_get(tdata->trees, pos)) {
> > +		bitmap_type = OBJ_TREE;
> > +		bitmaps_nr++;
> > +	}
> > +	if (bitmap_get(tdata->blobs, pos)) {
> > +		bitmap_type = OBJ_BLOB;
> > +		bitmaps_nr++;
> > +	}
> > +	if (bitmap_get(tdata->tags, pos)) {
> > +		bitmap_type = OBJ_TAG;
> > +		bitmaps_nr++;
> > +	}
>
> This made me wonder if this could be better with something like the
> HAS_MULTI_BITS() macro, but that trick probably can't be applied here in
> any shape or form :)

Right; since we're looking at the same bit position in each of the
type-level bitmaps, we can't just OR them together, since all of the
bits are in the same place.

And really, the object_type enum doesn't have values that tell us the
type of an object by looking at just a single bit. So
HAS_MULTI_BITS(OBJ_BLOB) would return "true", since OBJ_BLOB is 3.

> > +
> > +	if (bitmap_type != obj->type)
> > +		die("object %s: real type %s, expected: %s",
> > +		    oid_to_hex(&obj->oid),
> > +		    type_name(obj->type),
> > +		    type_name(bitmap_type));
>
> To argue against myself (sort of) about that "== OBJ_NONE" above, if
> we're not assuming that then it's sort of weird not to also assume that
> type_name(type) won't return a NULL in the case of OBJ_NONE, which it
> does (but this code guards against).

I tend to agree. To restate what you're saying: by the time we get to
the type_name(bitmap_type) we know that bitmap_type is non-zero, so we
assume it's OK to call type_name() on it.

Of course, the object_type_strings does handle the zero argument, so
this is probably a little academic, but good to think through
nonetheless.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 02/24] pack-bitmap-write.c: gracefully fail to write non-closed bitmaps
  2021-06-24 23:23     ` Ævar Arnfjörð Bjarmason
@ 2021-07-14 17:32       ` Taylor Blau
  2021-07-14 18:44         ` Ævar Arnfjörð Bjarmason
  2021-07-21  9:53         ` Jeff King
  0 siblings, 2 replies; 273+ messages in thread
From: Taylor Blau @ 2021-07-14 17:32 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, peff, dstolee, gitster, jonathantanmy

On Fri, Jun 25, 2021 at 01:23:40AM +0200, Ævar Arnfjörð Bjarmason wrote:
>
> On Mon, Jun 21 2021, Taylor Blau wrote:
>
> > -static uint32_t find_object_pos(const struct object_id *oid)
> > +static uint32_t find_object_pos(const struct object_id *oid, int *found)
> >  {
> >  	struct object_entry *entry = packlist_find(writer.to_pack, oid);
> >
> >  	if (!entry) {
> > -		die("Failed to write bitmap index. Packfile doesn't have full closure "
> > +		if (found)
> > +			*found = 0;
> > +		warning("Failed to write bitmap index. Packfile doesn't have full closure "
> >  			"(object %s is missing)", oid_to_hex(oid));
> > +		return 0;
> >  	}
> >
> > +	if (found)
> > +		*found = 1;
> >  	return oe_in_pack_pos(writer.to_pack, entry);
> >  }
>
> So, a function that returns an unsigned 32 bit int won't (presumably)
> have enough space for an "is bad", but before it died so it didn't
> matter.
>
> Now it warns, so it needs a "is bad", so we add another "int" to pass
> that information around.

Right. You could imagine using the most-significant bit to indicate
"bad" (which in this case is "I couldn't find this object that I'm
supposed to be able to reach"), but of course it cuts our maximum number
of objects in a bitmap in half.

> So if we're already paying for that extra space (which, on some
> platforms would already be a 64 bit int, and on some so would the
> uint32_t, it's just "at least 32 bits").
>
> Wouldn't it be more idiomatic to just have find_object_pos() return
> int64_t now, if it's -1 it's an error, otherwise the "pos" is cast to
> uint32_t:

I'm not sure. It does save the extra argument, which is arguably more
convenient for callers, but the cost for doing so is a cast from a
signed integer type to an unsigned one (and a narrower destination type,
at that).

That seems easier to get wrong to me than passing a pointer to a pure
"int" and keeping the return type a uint32_t. So, I'm probably more
content to leave it as-is rather than change it.

I don't feel too strongly about it, though, so if you do I'd be happy to
hear more.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 04/24] Documentation: build 'technical/bitmap-format' by default
  2021-06-24 23:35     ` Ævar Arnfjörð Bjarmason
@ 2021-07-14 17:41       ` Taylor Blau
  2021-07-14 22:58         ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 273+ messages in thread
From: Taylor Blau @ 2021-07-14 17:41 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, peff, dstolee, gitster, jonathantanmy

On Fri, Jun 25, 2021 at 01:35:48AM +0200, Ævar Arnfjörð Bjarmason wrote:
> As a mostly aside I've got a local series queued up to move all of these
> "format" docs to e.g. gitformat-bitmap(5), i.e. to make them first-class
> manpages, so other pages can link to them. Right now we mostly don't,
> but when our manpages do they link to the generated HTML, which e.g. I
> don't have installed by default.

It would be nice to be able to look it up with "man 5 gitformat-bitmap".
I actually don't have strong feelings about this particular patch
getting picked up or not, since it doesn't add the actual format changes
to the file itself.

This does pick up the bitmap-format document in "make -C Documentation
html", which is nice(r than "make -C Documentation
technical/bitmap-format.html" IMHO).

> But there's still (but maybe later in this series) a link to
> bitmap-format anywhere from another manual page (but there is for
> e.g. technical/pack-format.html).

No, I didn't add any links pointed at bitmap-format.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 02/24] pack-bitmap-write.c: gracefully fail to write non-closed bitmaps
  2021-07-14 17:32       ` Taylor Blau
@ 2021-07-14 18:44         ` Ævar Arnfjörð Bjarmason
  2021-07-21  9:53         ` Jeff King
  1 sibling, 0 replies; 273+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-07-14 18:44 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee, gitster, jonathantanmy


On Wed, Jul 14 2021, Taylor Blau wrote:

> On Fri, Jun 25, 2021 at 01:23:40AM +0200, Ævar Arnfjörð Bjarmason wrote:
>>
>> On Mon, Jun 21 2021, Taylor Blau wrote:
>>
>> > -static uint32_t find_object_pos(const struct object_id *oid)
>> > +static uint32_t find_object_pos(const struct object_id *oid, int *found)
>> >  {
>> >  	struct object_entry *entry = packlist_find(writer.to_pack, oid);
>> >
>> >  	if (!entry) {
>> > -		die("Failed to write bitmap index. Packfile doesn't have full closure "
>> > +		if (found)
>> > +			*found = 0;
>> > +		warning("Failed to write bitmap index. Packfile doesn't have full closure "
>> >  			"(object %s is missing)", oid_to_hex(oid));
>> > +		return 0;
>> >  	}
>> >
>> > +	if (found)
>> > +		*found = 1;
>> >  	return oe_in_pack_pos(writer.to_pack, entry);
>> >  }
>>
>> So, a function that returns an unsigned 32 bit int won't (presumably)
>> have enough space for an "is bad", but before it died so it didn't
>> matter.
>>
>> Now it warns, so it needs a "is bad", so we add another "int" to pass
>> that information around.
>
> Right. You could imagine using the most-significant bit to indicate
> "bad" (which in this case is "I couldn't find this object that I'm
> supposed to be able to reach"), but of course it cuts our maximum number
> of objects in a bitmap in half.
>
>> So if we're already paying for that extra space (which, on some
>> platforms would already be a 64 bit int, and on some so would the
>> uint32_t, it's just "at least 32 bits").
>>
>> Wouldn't it be more idiomatic to just have find_object_pos() return
>> int64_t now, if it's -1 it's an error, otherwise the "pos" is cast to
>> uint32_t:
>
> I'm not sure. It does save the extra argument, which is arguably more
> convenient for callers, but the cost for doing so is a cast from a
> signed integer type to an unsigned one (and a narrower destination type,
> at that).
>
> That seems easier to get wrong to me than passing a pointer to a pure
> "int" and keeping the return type a uint32_t. So, I'm probably more
> content to leave it as-is rather than change it.
>
> I don't feel too strongly about it, though, so if you do I'd be happy to
> hear more.

I don't really care, it just looked a bit weird at first, and I wondered
why it couldn't return -1.

Aside from this case do you mean that such a cast would be too expensive
in general, or fears abou going past the 32 bits? I assumed that there
would be checks here for that already (and if not, we'd have wrap-around
now...).

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 04/24] Documentation: build 'technical/bitmap-format' by default
  2021-07-14 17:41       ` Taylor Blau
@ 2021-07-14 22:58         ` Ævar Arnfjörð Bjarmason
  2021-07-21 10:04           ` Jeff King
  0 siblings, 1 reply; 273+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-07-14 22:58 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee, gitster, jonathantanmy


On Wed, Jul 14 2021, Taylor Blau wrote:

> On Fri, Jun 25, 2021 at 01:35:48AM +0200, Ævar Arnfjörð Bjarmason wrote:
>> As a mostly aside I've got a local series queued up to move all of these
>> "format" docs to e.g. gitformat-bitmap(5), i.e. to make them first-class
>> manpages, so other pages can link to them. Right now we mostly don't,
>> but when our manpages do they link to the generated HTML, which e.g. I
>> don't have installed by default.
>
> It would be nice to be able to look it up with "man 5 gitformat-bitmap".
> I actually don't have strong feelings about this particular patch
> getting picked up or not, since it doesn't add the actual format changes
> to the file itself.
>
> This does pick up the bitmap-format document in "make -C Documentation
> html", which is nice(r than "make -C Documentation
> technical/bitmap-format.html" IMHO).

Oh yes, I'm not saying don't add the target. Just a musing on how we
ended up with such a large set of things in "Documentation/technical/*"
as opposed to just man pages.

I guess if there's good reasons for it they'll come out if/when I submit
that series...

>> But there's still (but maybe later in this series) a link to
>> bitmap-format anywhere from another manual page (but there is for
>> e.g. technical/pack-format.html).
>
> No, I didn't add any links pointed at bitmap-format.

I see https://git-scm.com/docs/bitmap-format has somehow managed to get
indexed by Google, perhaps through some magic :)

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 06/24] midx: make a number of functions non-static
  2021-06-24 23:42     ` Ævar Arnfjörð Bjarmason
@ 2021-07-14 23:01       ` Taylor Blau
  0 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-07-14 23:01 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, peff, dstolee, gitster, jonathantanmy

On Fri, Jun 25, 2021 at 01:42:06AM +0200, Ævar Arnfjörð Bjarmason wrote:
>
> On Mon, Jun 21 2021, Taylor Blau wrote:
>
> > These functions will be called from outside of midx.c in a subsequent
> > patch.
>
> So "a number" is "two" and "a subsequent patch" appears to be 13/24. I
> think this would be clearer just squashed into whatever needs it, or at
> least if it comes right before the new use in the series.

Good suggestion, thanks. This was probably written at a time when the
number of functions I needed was larger (or perhaps when this and
patches 10-12 were all jumbled together).

In any case, I dropped this patch and squashed its contents into 13/24
where it is used.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 14/24] pack-bitmap: write multi-pack bitmaps
  2021-06-24 23:45     ` Ævar Arnfjörð Bjarmason
@ 2021-07-15 14:33       ` Taylor Blau
  0 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-07-15 14:33 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, peff, dstolee, gitster, jonathantanmy

On Fri, Jun 25, 2021 at 01:45:46AM +0200, Ævar Arnfjörð Bjarmason wrote:
>
> On Mon, Jun 21 2021, Taylor Blau wrote:
> > +* Write a MIDX file for the packfiles in the current .git folder with a
> > +corresponding bitmap.
> > ++
> > +-------------------------------------------------------------
> > +$ git multi-pack-index write --preferred-pack <pack> --bitmap
> > +-------------------------------------------------------------
> > +
>
> I wondered if this was a <pack> positional argument, but it's just the
> argument for --preferred-pack, even though the synopsis uses the "="
> style for it. Even if parse-options.c is loose about it, let's use one
> or the other in examples consistently.

The example below (for writing a MIDX in an alternate object store)
doesn't include the '=', but probably would be clearer if it did. I
think it's a good suggestion, though, so I'll fix up my addition here.

> > +	memset(pdata, 0, sizeof(struct packing_data));
>
> We initialize this on the stack in write_midx_bitmap(), shouldn't we
> just there do:
>
>     struct packing_data pdata = {0}
>
> Instead of:
>
>     struct packing_data pdata;
>
> And then doing this memset() here?

I could go either way. Part of me prefers the memset() since it lets
callers of prepare_midx_packing_data() pass in anything they want,
including a pointer to uninitialized memory. Of course, there is only
one such caller, so it probably doesn't really matter.

And the other caller of prepare_packing_data() which is in
builtin/pack-objects.c operates on a pointer to a statically allocated
variable, so its bytes are already zero'd.

I don't feel strongly about it, though, so I'd just as soon err on the
side of flexibility than changing the declaration.

> > +{
> > +	struct rev_info revs;
> > +	struct bitmap_commit_cb cb;
> > +
> > +	memset(&cb, 0, sizeof(struct bitmap_commit_cb));
>
> Another case of s/memset/"= {0}"/g ?

Ah, in this case I'd prefer the aggregate-style initialization, since
we're zero-ing it out in the same function.

> >  static int write_midx_internal(const char *object_dir, struct multi_pack_index *m,
> >  			       struct string_list *packs_to_drop,
> >  			       const char *preferred_pack_name,
> > @@ -930,9 +1100,16 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
> >  		for (i = 0; i < ctx.m->num_packs; i++) {
> >  			ALLOC_GROW(ctx.info, ctx.nr + 1, ctx.alloc);
> >
> > +			if (prepare_midx_pack(the_repository, ctx.m, i)) {
> > +				error(_("could not load pack %s"),
> > +				      ctx.m->pack_names[i]);
>
> Isn't the prepare_midx_pack() tasked with populating that pack_names[i]
> that you can't load (the strbuf_addf() it does), but it can also exit
> before that, do we get an empty string here then? Maybe I'm misreading
> it (I haven't run this, just skimmed the code).

Nice catch, we can't rely on ctx->m.pack_names[i] being safe to read
(and at the same time know that we're going to get a non-empty string).

Since prepare_midx_pack() can fail because the pack itself couldn't be
loaded, I think the easiest thing to do here is to just opaquely say
"could not load pack" without adding any pack name that we may or may
not have.

> > @@ -1132,6 +1342,9 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
> >  	free(ctx.pack_perm);
> >  	free(ctx.pack_order);
> >  	free(midx_name);
> > +	if (ctx.m)
> > +		close_midx(ctx.m);
> > +
>
> I see Stolee made close_midx() just return silently if !ctx.m in
> 1dcd9f2043a (midx: close multi-pack-index on repack, 2018-10-12), but
> grepping the uses of it it seems calls to it are similarly guarded by
> "if"'s.
>
> Just a nit, weird to have a free-like function not invoked like
> free. Perhaps (and maybe better for an unrelated cleanup) to either drop
> the conditionals, or make it BUG() if it's called with NULL, but at
> least we should pick one :)

I agree with the direction of 1dcd9f2043a, so I'm happy to just drop the
conditional and call close_midx() with an argument that may or may not
be NULL.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 00/24] multi-pack reachability bitmaps
  2021-06-25  9:06   ` [PATCH v2 00/24] multi-pack reachability bitmaps Ævar Arnfjörð Bjarmason
@ 2021-07-15 14:36     ` Taylor Blau
  2021-07-21 12:12       ` Jeff King
  0 siblings, 1 reply; 273+ messages in thread
From: Taylor Blau @ 2021-07-15 14:36 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, peff, dstolee, gitster, jonathantanmy

On Fri, Jun 25, 2021 at 11:06:21AM +0200, Ævar Arnfjörð Bjarmason wrote:
>
> On Mon, Jun 21 2021, Taylor Blau wrote:
>
> > Thanks in advance for your review, and sorry for the wait.
>
> Thanks for working on this, exciting feature!
>
> Just a note on my comments on this. I left them after some light reading
> and would describe them as some combination of "musings", "shallow",
> "nit-y" and "bikesheddy".

Thanks for your review. It did help me catch some important issues, like
referring to ctx->m.pack_names[i] when a read at "i" may have been
invalid.

> I.e. I did not have time (or I feel, the familiarity) to give this
> series the sort of review it actually deserves as far as the actual
> important bits go, i.e. nits aside whether this feature works and
> behaves as desired. Sorry, but hopefully at least some of comments were
> somewhat useful anyway.

They were useful indeed. I'll sit on the changes locally since most of
them are pretty benign and I think subsequent review could be done on
top of v2 without seeing my local changes.

I know that reviewing this is on Peff's list of things to do, but there
are competing priorities (and we have all-company meetings this week, so
I would not be surprised to see a lack of movement until at least next
week).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 01/24] pack-bitmap.c: harden 'test_bitmap_walk()' to check type bitmaps
  2021-06-21 22:24   ` [PATCH v2 01/24] pack-bitmap.c: harden 'test_bitmap_walk()' to check type bitmaps Taylor Blau
  2021-06-24 23:02     ` Ævar Arnfjörð Bjarmason
@ 2021-07-21  9:45     ` Jeff King
  2021-07-21 17:15       ` Taylor Blau
  1 sibling, 1 reply; 273+ messages in thread
From: Jeff King @ 2021-07-21  9:45 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jonathantanmy

On Mon, Jun 21, 2021 at 06:24:59PM -0400, Taylor Blau wrote:

> The special `--test-bitmap` mode of `git rev-list` is used to compare
> the result of an object traversal with a bitmap to check its integrity.
> This mode does not, however, assert that the types of reachable objects
> are stored correctly.
> 
> Harden this mode by teaching it to also check that each time an object's
> bit is marked, the corresponding bit should be set in exactly one of the
> type bitmaps (whose type matches the object's true type).

Yep, makes sense, and the patch looks good.

> +{
> +	enum object_type bitmap_type = OBJ_NONE;
> [...]
> +
> +	if (!bitmap_type)
> +		die("object %s not found in type bitmaps",
> +		    oid_to_hex(&obj->oid));

I think the suggestion to do:

  if (bitmap_type == OBJ_NONE)

is reasonable here, as it assumes less about the enum. I do think
OBJ_BAD and OBJ_NONE were chosen with these kind of numeric comparisons
in mind, but there is no reason to rely on them in places we don't need
to.

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 02/24] pack-bitmap-write.c: gracefully fail to write non-closed bitmaps
  2021-06-21 22:25   ` [PATCH v2 02/24] pack-bitmap-write.c: gracefully fail to write non-closed bitmaps Taylor Blau
  2021-06-24 23:23     ` Ævar Arnfjörð Bjarmason
@ 2021-07-21  9:50     ` Jeff King
  2021-07-21 17:20       ` Taylor Blau
  1 sibling, 1 reply; 273+ messages in thread
From: Jeff King @ 2021-07-21  9:50 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jonathantanmy

On Mon, Jun 21, 2021 at 06:25:01PM -0400, Taylor Blau wrote:

> The set of objects covered by a bitmap must be closed under
> reachability, since it must be the case that there is a valid bit
> position assigned for every possible reachable object (otherwise the
> bitmaps would be incomplete).
> 
> Pack bitmaps are never written from 'git repack' unless repacking
> all-into-one, and so we never write non-closed bitmaps (except in the
> case of partial clones where we aren't guaranteed to have all objects).
> 
> But multi-pack bitmaps change this, since it isn't known whether the
> set of objects in the MIDX is closed under reachability until walking
> them. Plumb through a bit that is set when a reachable object isn't
> found.
> 
> As soon as a reachable object isn't found in the set of objects to
> include in the bitmap, bitmap_writer_build() knows that the set is not
> closed, and so it now fails gracefully.

Leaving aside your intended use here, I think it's nice to get rid of a
deep-buried die() like this in general.

The amount of error-plumbing you had to do is a little unpleasant, but I
think is unavoidable. The only non-obvious part was this hunk:

> @@ -463,8 +488,11 @@ void bitmap_writer_build(struct packing_data *to_pack)
>  		struct commit *child;
>  		int reused = 0;
>  
> -		fill_bitmap_commit(ent, commit, &queue, &tree_queue,
> -				   old_bitmap, mapping);
> +		if (fill_bitmap_commit(ent, commit, &queue, &tree_queue,
> +				       old_bitmap, mapping) < 0) {
> +			closed = 0;
> +			break;
> +		}
>  
>  		if (ent->selected) {
>  			store_selected(ent, commit);

This is the right thing to do because we still want to free memory, stop
progress, etc. I gave a look over what will run after breaking out of
the loop, and compute_xor_offsets(), which you already handled, is the
only thing we'd want to avoid running. Good.

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 02/24] pack-bitmap-write.c: gracefully fail to write non-closed bitmaps
  2021-07-14 17:32       ` Taylor Blau
  2021-07-14 18:44         ` Ævar Arnfjörð Bjarmason
@ 2021-07-21  9:53         ` Jeff King
  1 sibling, 0 replies; 273+ messages in thread
From: Jeff King @ 2021-07-21  9:53 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Ævar Arnfjörð Bjarmason, git, dstolee, gitster,
	jonathantanmy

On Wed, Jul 14, 2021 at 01:32:21PM -0400, Taylor Blau wrote:

> > So if we're already paying for that extra space (which, on some
> > platforms would already be a 64 bit int, and on some so would the
> > uint32_t, it's just "at least 32 bits").
> >
> > Wouldn't it be more idiomatic to just have find_object_pos() return
> > int64_t now, if it's -1 it's an error, otherwise the "pos" is cast to
> > uint32_t:
> 
> I'm not sure. It does save the extra argument, which is arguably more
> convenient for callers, but the cost for doing so is a cast from a
> signed integer type to an unsigned one (and a narrower destination type,
> at that).
> 
> That seems easier to get wrong to me than passing a pointer to a pure
> "int" and keeping the return type a uint32_t. So, I'm probably more
> content to leave it as-is rather than change it.

I agree that the separate "found" value makes things more obvious.
Casting to a smaller size means explaining why that is OK, whereas it is
quite clear that things are correct with two separate variables. And
having an int64_t makes one wonder whether a value like 2^35 is possible
(it isn't; this is a uint32_t because that is the limit of objects in
the pack format).

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 03/24] pack-bitmap-write.c: free existing bitmaps
  2021-06-21 22:25   ` [PATCH v2 03/24] pack-bitmap-write.c: free existing bitmaps Taylor Blau
@ 2021-07-21  9:54     ` Jeff King
  0 siblings, 0 replies; 273+ messages in thread
From: Jeff King @ 2021-07-21  9:54 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jonathantanmy

On Mon, Jun 21, 2021 at 06:25:04PM -0400, Taylor Blau wrote:

> When writing a new bitmap, the bitmap writer code attempts to read the
> existing bitmap (if one is present). This is done in order to quickly
> permute the bits of any bitmaps for commits which appear in the existing
> bitmap, and were also selected for the new bitmap.
> 
> But since this code was added in 341fa34887 (pack-bitmap-write: use
> existing bitmaps, 2020-12-08), the resources associated with opening an
> existing bitmap were never released.
> 
> It's fine to ignore this, but it's bad hygiene. It will also cause a
> problem for the multi-pack-index builtin, which will be responsible not
> only for writing bitmaps, but also for expiring any old multi-pack
> bitmaps.
> 
> If an existing bitmap was reused here, it will also be expired. That
> will cause a problem on platforms which require file resources to be
> closed before unlinking them, like Windows. Avoid this by ensuring we
> close reused bitmaps with free_bitmap_index() before removing them.

I agree with all of that. But just "it's a memory leak" would have
contented me, too. :)

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 04/24] Documentation: build 'technical/bitmap-format' by default
  2021-06-21 22:25   ` [PATCH v2 04/24] Documentation: build 'technical/bitmap-format' by default Taylor Blau
  2021-06-24 23:35     ` Ævar Arnfjörð Bjarmason
@ 2021-07-21  9:58     ` Jeff King
  2021-07-21 10:08       ` Jeff King
  1 sibling, 1 reply; 273+ messages in thread
From: Jeff King @ 2021-07-21  9:58 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jonathantanmy

On Mon, Jun 21, 2021 at 06:25:07PM -0400, Taylor Blau wrote:

> Even though the 'TECH_DOCS' variable was introduced all the way back in
> 5e00439f0a (Documentation: build html for all files in technical and
> howto, 2012-10-23), the 'bitmap-format' document was never added to that
> list when it was created.
> 
> Prepare for changes to this file by including it in the list of
> technical documentation that 'make doc' will build by default.

OK. I don't care that much about being able to format this as html, but
I agree it's good to be consistent with the other stuff in technical/.

The big question is whether it looks OK rendered by asciidoc, and the
answer seems to be "yes" (from a cursory look I gave it).

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 04/24] Documentation: build 'technical/bitmap-format' by default
  2021-07-14 22:58         ` Ævar Arnfjörð Bjarmason
@ 2021-07-21 10:04           ` Jeff King
  2021-07-21 10:10             ` Jeff King
  0 siblings, 1 reply; 273+ messages in thread
From: Jeff King @ 2021-07-21 10:04 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Taylor Blau, git, dstolee, gitster, jonathantanmy

On Thu, Jul 15, 2021 at 12:58:00AM +0200, Ævar Arnfjörð Bjarmason wrote:

> >> But there's still (but maybe later in this series) a link to
> >> bitmap-format anywhere from another manual page (but there is for
> >> e.g. technical/pack-format.html).
> >
> > No, I didn't add any links pointed at bitmap-format.
> 
> I see https://git-scm.com/docs/bitmap-format has somehow managed to get
> indexed by Google, perhaps through some magic :)

Presumably somebody somewhere mentioned it (if not, the list archive now
links to it ;) ).

The git-scm.com site doesn't use our Documentation/Makefile, so it has
been building the bitmap-format page for a while. It doesn't look
correct, though (it is missing everything before the "on-disk format"
section).

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 04/24] Documentation: build 'technical/bitmap-format' by default
  2021-07-21  9:58     ` Jeff King
@ 2021-07-21 10:08       ` Jeff King
  2021-07-21 17:23         ` Taylor Blau
  0 siblings, 1 reply; 273+ messages in thread
From: Jeff King @ 2021-07-21 10:08 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jonathantanmy

On Wed, Jul 21, 2021 at 05:58:41AM -0400, Jeff King wrote:

> On Mon, Jun 21, 2021 at 06:25:07PM -0400, Taylor Blau wrote:
> 
> > Even though the 'TECH_DOCS' variable was introduced all the way back in
> > 5e00439f0a (Documentation: build html for all files in technical and
> > howto, 2012-10-23), the 'bitmap-format' document was never added to that
> > list when it was created.
> > 
> > Prepare for changes to this file by including it in the list of
> > technical documentation that 'make doc' will build by default.
> 
> OK. I don't care that much about being able to format this as html, but
> I agree it's good to be consistent with the other stuff in technical/.
> 
> The big question is whether it looks OK rendered by asciidoc, and the
> answer seems to be "yes" (from a cursory look I gave it).

Actually, I take it back. After looking more carefully, it renders quite
poorly. There's a lot of structural indentation that ends up being
confused as code blocks.

I don't know if it's better to have a poorly-formatted HTML file, or
none at all. :)

Personally, I would just read the source. And I have a slight concern
that if we start "cleaning it up" to render as asciidoc, the source
might end up a lot less readable (though I'd reserve judgement until
actually seeing it).

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 04/24] Documentation: build 'technical/bitmap-format' by default
  2021-07-21 10:04           ` Jeff King
@ 2021-07-21 10:10             ` Jeff King
  0 siblings, 0 replies; 273+ messages in thread
From: Jeff King @ 2021-07-21 10:10 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Taylor Blau, git, dstolee, gitster, jonathantanmy

On Wed, Jul 21, 2021 at 06:04:15AM -0400, Jeff King wrote:

> The git-scm.com site doesn't use our Documentation/Makefile, so it has
> been building the bitmap-format page for a while. It doesn't look
> correct, though (it is missing everything before the "on-disk format"
> section).

Doh, nevermind. That nice header section does not exist before Taylor's
series. ;)

The git-scm.com page does seem to throw away the "GIT bitmap v1 format"
title entirely. And it suffers the same rendering problems I saw when
building with asciidoc locally.

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 05/24] Documentation: describe MIDX-based bitmaps
  2021-06-21 22:25   ` [PATCH v2 05/24] Documentation: describe MIDX-based bitmaps Taylor Blau
@ 2021-07-21 10:18     ` Jeff King
  2021-07-21 17:53       ` Taylor Blau
  0 siblings, 1 reply; 273+ messages in thread
From: Jeff King @ 2021-07-21 10:18 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jonathantanmy

On Mon, Jun 21, 2021 at 06:25:10PM -0400, Taylor Blau wrote:

> +An object is uniquely described by its bit position within a bitmap:
> +
> +	- If the bitmap belongs to a packfile, the __n__th bit corresponds to
> +	the __n__th object in pack order. For a function `offset` which maps
> +	objects to their byte offset within a pack, pack order is defined as
> +	follows:
> +
> +		o1 <= o2 <==> offset(o1) <= offset(o2)
> +
> +	- If the bitmap belongs to a MIDX, the __n__th bit corresponds to the
> +	__n__th object in MIDX order. With an additional function `pack` which
> +	maps objects to the pack they were selected from by the MIDX, MIDX order
> +	is defined as follows:
> +
> +		o1 <= o2 <==> pack(o1) <= pack(o2) /\ offset(o1) <= offset(o2)
> +
> +	The ordering between packs is done lexicographically by the pack name,
> +	with the exception of the preferred pack, which sorts ahead of all other
> +	packs.

This doesn't render well as asciidoc (the final paragraph is taken as
more of the code block). But that is a problem through the whole file. I
think we should ignore it for now, and worry about asciidoc-ifying the
whole thing later, if we choose to.

> +	The ordering between packs is done lexicographically by the pack name,
> +	with the exception of the preferred pack, which sorts ahead of all other
> +	packs.

Hmm, I'm not sure if this "lexicographically" part is true. Really we're
building on the midx .rev format here. And that says "defined by the
MIDX's pack list" (though I can't offhand remember if that is
lexicographic, or if it is in the reverse-mtime order).

At any rate, should we just be referencing the rev documentation?

> [...]

The rest of the changes to the document seemed quite sensible to me.

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 07/24] midx: clear auxiliary .rev after replacing the MIDX
  2021-06-21 22:25   ` [PATCH v2 07/24] midx: clear auxiliary .rev after replacing the MIDX Taylor Blau
@ 2021-07-21 10:19     ` Jeff King
  0 siblings, 0 replies; 273+ messages in thread
From: Jeff King @ 2021-07-21 10:19 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jonathantanmy

On Mon, Jun 21, 2021 at 06:25:15PM -0400, Taylor Blau wrote:

> When writing a new multi-pack index, write_midx_internal() attempts to
> clean up any auxiliary files (currently just the MIDX's `.rev` file, but
> soon to include a `.bitmap`, too) corresponding to the MIDX it's
> replacing.
> 
> This step should happen after the new MIDX is written into place, since
> doing so beforehand means that the old MIDX could be read without its
> corresponding .rev file.

Good catch. The patch looks obviously correct.

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 08/24] midx: respect 'core.multiPackIndex' when writing
  2021-06-21 22:25   ` [PATCH v2 08/24] midx: respect 'core.multiPackIndex' when writing Taylor Blau
  2021-06-24 23:43     ` Ævar Arnfjörð Bjarmason
@ 2021-07-21 10:23     ` Jeff King
  2021-07-21 19:22       ` Taylor Blau
  1 sibling, 1 reply; 273+ messages in thread
From: Jeff King @ 2021-07-21 10:23 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jonathantanmy

On Mon, Jun 21, 2021 at 06:25:18PM -0400, Taylor Blau wrote:

> When writing a new multi-pack index, write_midx_internal() attempts to
> load any existing one to fill in some pieces of information. But it uses
> load_multi_pack_index(), which ignores the configuration
> "core.multiPackIndex", which indicates whether or not Git is allowed to
> read an existing multi-pack-index.
> 
> Replace this with a routine that does respect that setting, to avoid
> reading multi-pack-index files when told not to.
> 
> This avoids a problem that would arise in subsequent patches due to the
> combination of 'git repack' reopening the object store in-process and
> the multi-pack index code not checking whether a pack already exists in
> the object store when calling add_pack_to_midx().
> 
> This would ultimately lead to a cycle being created along the
> 'packed_git' struct's '->next' pointer. That is obviously bad, but it
> has hard-to-debug downstream effects like saying a bitmap can't be
> loaded for a pack because one already exists (for the same pack).

I'm not sure I completely understand the bug that this causes.

But another question: does this impact how

  git -c core.multipackindex=false multi-pack-index write

behaves? I.e., do we still write, but just avoid reading the existing
midx? That itself seems like a more sensible behavior (e.g., trying to
recover from a broken midx state).

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 09/24] midx: infer preferred pack when not given one
  2021-06-21 22:25   ` [PATCH v2 09/24] midx: infer preferred pack when not given one Taylor Blau
@ 2021-07-21 10:34     ` Jeff King
  2021-07-21 20:16       ` Taylor Blau
  0 siblings, 1 reply; 273+ messages in thread
From: Jeff King @ 2021-07-21 10:34 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jonathantanmy

On Mon, Jun 21, 2021 at 06:25:21PM -0400, Taylor Blau wrote:

> In 9218c6a40c (midx: allow marking a pack as preferred, 2021-03-30), the
> multi-pack index code learned how to select a pack which all duplicate
> objects are selected from. That is, if an object appears in multiple
> packs, select the copy in the preferred pack before breaking ties
> according to the other rules like pack mtime and readdir() order.
> 
> Not specifying a preferred pack can cause serious problems with
> multi-pack reachability bitmaps, because these bitmaps rely on having at
> least one pack from which all duplicates are selected. Not having such a
> pack causes problems with the pack reuse code (e.g., like assuming that
> a base object was sent from that pack via reuse when in fact the base
> was selected from a different pack).

It might be helpful to use a more descriptive name for "pack reuse code"
here, since it's kind of vague for people who have not been actively
working on bitmaps.

I don't have a short name for that chunk of code, but maybe:

  ...causes problems with the code in pack-objects to reuse packs
  verbatim (e.g., that code assumes that a delta object in a chunk of
  pack sent verbatim will have its base object sent from the same pack).

> So why does not marking a pack preferred cause problems here? The reason
> is roughly as follows:
> 
>   - Ties are broken (when handling duplicate objects) by sorting
>     according to midx_oid_compare(), which sorts objects by OID,
>     preferred-ness, pack mtime, and finally pack ID (more on that
>     later).
> 
>   - The psuedo pack-order (described in
>     Documentation/technical/bitmap-format.txt) is computed by
>     midx_pack_order(), and sorts by pack ID and pack offset, with
>     preferred packs sorting first.

I think the .rev description in pack-format.txt may be a better
reference here.

>   - But! Pack IDs come from incrementing the pack count in
>     add_pack_to_midx(), which is a callback to
>     for_each_file_in_pack_dir(), meaning that pack IDs are assigned in
>     readdir() order.
> 
> When specifying a preferred pack, all of that works fine, because
> duplicate objects are correctly resolved in favor of the copy in the
> preferred pack, and the preferred pack sorts first in the object order.
> 
> "Sorting first" is critical, because the bitmap code relies on finding
> out which pack holds the first object in the MIDX's pseudo pack-order to
> determine which pack is preferred.
> 
> But if we didn't specify a preferred pack, and the pack which comes
> first in readdir() order does not also have the lowest timestamp, then
> it's possible that that pack (the one that sorts first in pseudo-pack
> order, which the bitmap code will treat as the preferred one) did *not*
> have all duplicate objects resolved in its favor, resulting in breakage.
> 
> The fix is simple: pick a (semi-arbitrary) preferred pack when none was
> specified. This forces that pack to have duplicates resolved in its
> favor, and (critically) to sort first in pseudo-pack order.
> Unfortunately, testing this behavior portably isn't possible, since it
> depends on readdir() order which isn't guaranteed by POSIX.

This explanation is rather confusing, but I'm not sure if we can do much
better. I followed all of it, because I was there when we found the bug
that this is fixing. And of course that happened _after_ we implemented
midx bitmaps and in particular adapted the verbatim reuse stuff in
pack-objects to make use of it.

I see why you'd want to float the fix up before then, so we don't ever
have the broken state. But it's hard to understand what bug this is
fixing, because the bug does not even exist yet at this point in
the series!

I dunno. Like I said, I was able to follow it, so maybe it is
sufficient. I'm just not sure others would be able to.

> +
> +		if (!found)
> +			warning(_("unknown preferred pack: '%s'"),
> +				preferred_pack_name);
> +	} else if (ctx.nr && (flags & MIDX_WRITE_REV_INDEX)) {
> +		time_t oldest = ctx.info[0].p->mtime;
> +		ctx.preferred_pack_idx = 0;
> +
> +		if (packs_to_drop && packs_to_drop->nr)
> +			BUG("cannot write a MIDX bitmap during expiration");

Likewise, this BUG() feels somewhat out-of-place. At this point in the
series, we don't have bitmaps yet. :)

I can live with that, though. And I don't want to make a lot of work by
trying to re-order this patch within the series. Mostly I want to make
sure that if somebody stumbles on this commit via git-log or in a
bisection, that they can make some sense of what it's trying to do.

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 10/24] pack-bitmap.c: introduce 'bitmap_num_objects()'
  2021-06-21 22:25   ` [PATCH v2 10/24] pack-bitmap.c: introduce 'bitmap_num_objects()' Taylor Blau
@ 2021-07-21 10:35     ` Jeff King
  0 siblings, 0 replies; 273+ messages in thread
From: Jeff King @ 2021-07-21 10:35 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jonathantanmy

On Mon, Jun 21, 2021 at 06:25:23PM -0400, Taylor Blau wrote:

> A subsequent patch to support reading MIDX bitmaps will be less noisy
> after extracting a generic function to return how many objects are
> contained in a bitmap.

Thanks for this. These kinds of preparatory patches make reading the
actual tricky changes so much easier.

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 11/24] pack-bitmap.c: introduce 'nth_bitmap_object_oid()'
  2021-06-21 22:25   ` [PATCH v2 11/24] pack-bitmap.c: introduce 'nth_bitmap_object_oid()' Taylor Blau
  2021-06-24 14:59     ` Taylor Blau
@ 2021-07-21 10:37     ` Jeff King
  2021-07-21 10:38       ` Jeff King
  1 sibling, 1 reply; 273+ messages in thread
From: Jeff King @ 2021-07-21 10:37 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jonathantanmy

On Mon, Jun 21, 2021 at 06:25:26PM -0400, Taylor Blau wrote:

> A subsequent patch to support reading MIDX bitmaps will be less noisy
> after extracting a generic function to fetch the nth OID contained in
> the bitmap.

Makes sense, but...

> +static void nth_bitmap_object_oid(struct bitmap_index *index,
> +				  struct object_id *oid,
> +				  uint32_t n)
> +{
> +	nth_packed_object_id(oid, index->pack, n);
> +}
> +
>  static int load_bitmap_entries_v1(struct bitmap_index *index)
>  {
>  	uint32_t i;
> @@ -242,9 +249,7 @@ static int load_bitmap_entries_v1(struct bitmap_index *index)
>  		xor_offset = read_u8(index->map, &index->map_pos);
>  		flags = read_u8(index->map, &index->map_pos);
>  
> -		if (nth_packed_object_id(&oid, index->pack, commit_idx_pos) < 0)
> -			return error("corrupt ewah bitmap: commit index %u out of range",
> -				     (unsigned)commit_idx_pos);
> +		nth_bitmap_object_oid(index, &oid, commit_idx_pos);

What happened to our error check here?

Should nth_bitmap_object_oid() be returning the value from
nth_packed_object_id()?

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 11/24] pack-bitmap.c: introduce 'nth_bitmap_object_oid()'
  2021-07-21 10:37     ` Jeff King
@ 2021-07-21 10:38       ` Jeff King
  0 siblings, 0 replies; 273+ messages in thread
From: Jeff King @ 2021-07-21 10:38 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jonathantanmy

On Wed, Jul 21, 2021 at 06:37:41AM -0400, Jeff King wrote:

> > @@ -242,9 +249,7 @@ static int load_bitmap_entries_v1(struct bitmap_index *index)
> >  		xor_offset = read_u8(index->map, &index->map_pos);
> >  		flags = read_u8(index->map, &index->map_pos);
> >  
> > -		if (nth_packed_object_id(&oid, index->pack, commit_idx_pos) < 0)
> > -			return error("corrupt ewah bitmap: commit index %u out of range",
> > -				     (unsigned)commit_idx_pos);
> > +		nth_bitmap_object_oid(index, &oid, commit_idx_pos);
> 
> What happened to our error check here?
> 
> Should nth_bitmap_object_oid() be returning the value from
> nth_packed_object_id()?

Ah, sorry, I just saw your followup message. I'll look for the fix in
the re-roll.

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 12/24] pack-bitmap.c: introduce 'bitmap_is_preferred_refname()'
  2021-06-21 22:25   ` [PATCH v2 12/24] pack-bitmap.c: introduce 'bitmap_is_preferred_refname()' Taylor Blau
@ 2021-07-21 10:39     ` Jeff King
  2021-07-21 20:18       ` Taylor Blau
  0 siblings, 1 reply; 273+ messages in thread
From: Jeff King @ 2021-07-21 10:39 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jonathantanmy

On Mon, Jun 21, 2021 at 06:25:29PM -0400, Taylor Blau wrote:

> In a recent commit, pack-objects learned support for the
> 'pack.preferBitmapTips' configuration. This patch prepares the
> multi-pack bitmap code to respect this configuration, too.
> 
> Since the multi-pack bitmap code already does a traversal of all
> references (in order to discover the set of reachable commits in the
> multi-pack index), it is more efficient to check whether or not each
> reference is a suffix of any value of 'pack.preferBitmapTips' rather
> than do an additional traversal.
> 
> Implement a function 'bitmap_is_preferred_refname()' which does just
> that. The caller will be added in a subsequent patch.

I suspect there was some patch reordering here. We don't have any
multi-pack bitmap code yet. :)

Probably this needs to say something like "in preparation for adding
multi-pack bitmap code..." or similar?

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 13/24] pack-bitmap: read multi-pack bitmaps
  2021-06-21 22:25   ` [PATCH v2 13/24] pack-bitmap: read multi-pack bitmaps Taylor Blau
@ 2021-07-21 11:32     ` Jeff King
  2021-07-21 23:01       ` Taylor Blau
  0 siblings, 1 reply; 273+ messages in thread
From: Jeff King @ 2021-07-21 11:32 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jonathantanmy

On Mon, Jun 21, 2021 at 06:25:31PM -0400, Taylor Blau wrote:

> This prepares the code in pack-bitmap to interpret the new multi-pack
> bitmaps described in Documentation/technical/bitmap-format.txt, which
> mostly involves converting bit positions to accommodate looking them up
> in a MIDX.
> 
> Note that there are currently no writers who write multi-pack bitmaps,
> and that this will be implemented in the subsequent commit.

There's quite a lot going on in this one, of course, but most of it
looks right. A few hunks did puzzle me:

> @@ -302,12 +377,18 @@ static int open_pack_bitmap_1(struct bitmap_index *bitmap_git, struct packed_git
>  		return -1;
>  	}
>  
> -	if (bitmap_git->pack) {
> +	if (bitmap_git->pack || bitmap_git->midx) {
> +		/* ignore extra bitmap file; we can only handle one */
>  		warning("ignoring extra bitmap file: %s", packfile->pack_name);
>  		close(fd);
>  		return -1;
>  	}
>  
> +	if (!is_pack_valid(packfile)) {
> +		close(fd);
> +		return -1;
> +	}
> +

What's this extra is_pack_valid() doing? I wouldn't expect many changes
at all to this non-midx code path (aside from the "did we already load a
midx bitmap" in the earlier part of the hunk, which makes sense).

> -static int load_pack_bitmap(struct bitmap_index *bitmap_git)
> +static int load_reverse_index(struct bitmap_index *bitmap_git)
> +{
> +	if (bitmap_is_midx(bitmap_git)) {
> +		uint32_t i;
> +		int ret;
> +
> +		ret = load_midx_revindex(bitmap_git->midx);
> +		if (ret)
> +			return ret;
> +
> +		for (i = 0; i < bitmap_git->midx->num_packs; i++) {
> +			if (prepare_midx_pack(the_repository, bitmap_git->midx, i))
> +				die(_("load_reverse_index: could not open pack"));
> +			ret = load_pack_revindex(bitmap_git->midx->packs[i]);
> +			if (ret)
> +				return ret;
> +		}
> +		return 0;
> +	}
> +	return load_pack_revindex(bitmap_git->pack);
> +}

OK, this new function is used in load_bitmap(), which is used for both
pack and midx bitmaps. So if we have a midx bitmap, we'll
unconditionally load the revindex here. But:

  - why do we then load individual pack revindexes? I can believe it may
    be necessary to meet the assumptions of some other part of the code,
    but it would be nice to have a comment giving us some clue.

  - in open_midx_bitmap_1(), we also unconditionally load the midx
    reverse index. I think that will always happen before us here (we
    cannot load_bitmap() a bitmap that has not been opened). So is this
    load_midx_revindex() call always a noop?

> +static int open_bitmap(struct repository *r,
> +		       struct bitmap_index *bitmap_git)
> +{
> +	assert(!bitmap_git->map);
> +
> +	if (!open_midx_bitmap(r, bitmap_git))
> +		return 0;
> +	return open_pack_bitmap(r, bitmap_git);
> +}

We always prefer a midx bitmap over a pack one. That makes sense, since
that means we can leave old pack bitmaps in place when generating midx
bitmaps, if we choose to.

>  static int bitmap_position(struct bitmap_index *bitmap_git,
>  			   const struct object_id *oid)
>  {
> -	int pos = bitmap_position_packfile(bitmap_git, oid);
> +	int pos;
> +	if (bitmap_is_midx(bitmap_git))
> +		pos = bitmap_position_midx(bitmap_git, oid);
> +	else
> +		pos = bitmap_position_packfile(bitmap_git, oid);
>  	return (pos >= 0) ? pos : bitmap_position_extended(bitmap_git, oid);
>  }

Makes sense. Not new in your patch, but this "int" return is fudging the
same 32-bit space we were talking about elsewhere (i.e., "pos" really
could be 2^32, or even more due to extended objects).

In practice I think even 2^31 objects is pretty out-of-reach, but it may
be worth changing the return type (and the callers), or even just
catching the overflow with an assertion.

> @@ -752,8 +911,13 @@ static int in_bitmapped_pack(struct bitmap_index *bitmap_git,
>  		struct object *object = roots->item;
>  		roots = roots->next;
>  
> -		if (find_pack_entry_one(object->oid.hash, bitmap_git->pack) > 0)
> -			return 1;
> +		if (bitmap_is_midx(bitmap_git)) {
> +			if (bsearch_midx(&object->oid, bitmap_git->midx, NULL))
> +				return 1;
> +		} else {
> +			if (find_pack_entry_one(object->oid.hash, bitmap_git->pack) > 0)
> +				return 1;
> +		}
>  	}

Makes sense. TBH, I am not sure this in_bitmapped_pack() function is all
that useful. It is used only as part of a heuristic to avoid bitmaps
when we don't have coverage of any "have" commits. But I'm not sure that
heuristic is actually useful.

Anyway, we should definitely not get into ripping it out here. This
series is complicated enough. :) Just a note for possible future work.

>  	if (pos < bitmap_num_objects(bitmap_git)) {
> -		off_t ofs = pack_pos_to_offset(pack, pos);
> +		struct packed_git *pack;
> +		off_t ofs;
> +
> +		if (bitmap_is_midx(bitmap_git)) {
> +			uint32_t midx_pos = pack_pos_to_midx(bitmap_git->midx, pos);
> +			uint32_t pack_id = nth_midxed_pack_int_id(bitmap_git->midx, midx_pos);
> +
> +			pack = bitmap_git->midx->packs[pack_id];
> +			ofs = nth_midxed_offset(bitmap_git->midx, midx_pos);
> +		} else {
> +			pack = bitmap_git->pack;
> +			ofs = pack_pos_to_offset(pack, pos);
> +		}
> +

All of the hunks like this make perfect sense. The big problem would be
if we _missed_ a place that needed conversion to handle midx. But the
nice thing is that it would segfault quickly in such an instance. So
there I'm mostly relying on test coverage, plus our experience running
with this code at scale.

>  static void try_partial_reuse(struct bitmap_index *bitmap_git,
> +			      struct packed_git *pack,
>  			      size_t pos,
>  			      struct bitmap *reuse,
>  			      struct pack_window **w_curs)
>  {
> -	off_t offset, header;
> +	off_t offset, delta_obj_offset;

I'm OK with all of this in one big patch. But I suspect you _could_
just put:

  if (bitmap_git->midx)
	return; /* partial reuse not implemented for midx yet */

to start with, and then actually implement it later. I call out this
code in particular just because it's got a lot of subtleties (the
"reuse" bits are much more intimate with the assumptions of packs and
bitmaps than most other code).

I'm not sure if it's worth the trouble at this point or not.

>  	enum object_type type;
>  	unsigned long size;
>  
> -	if (pos >= bitmap_num_objects(bitmap_git))
> -		return; /* not actually in the pack or MIDX */
> +	/*
> +	 * try_partial_reuse() is called either on (a) objects in the
> +	 * bitmapped pack (in the case of a single-pack bitmap) or (b)
> +	 * objects in the preferred pack of a multi-pack bitmap.
> +	 * Importantly, the latter can pretend as if only a single pack
> +	 * exists because:
> +	 *
> +	 *   - The first pack->num_objects bits of a MIDX bitmap are
> +	 *     reserved for the preferred pack, and
> +	 *
> +	 *   - Ties due to duplicate objects are always resolved in
> +	 *     favor of the preferred pack.
> +	 *
> +	 * Therefore we do not need to ever ask the MIDX for its copy of
> +	 * an object by OID, since it will always select it from the
> +	 * preferred pack. Likewise, the selected copy of the base
> +	 * object for any deltas will reside in the same pack.
> +	 *
> +	 * This means that we can reuse pos when looking up the bit in
> +	 * the reuse bitmap, too, since bits corresponding to the
> +	 * preferred pack precede all bits from other packs.
> +	 */
>  
> -	offset = header = pack_pos_to_offset(bitmap_git->pack, pos);
> -	type = unpack_object_header(bitmap_git->pack, w_curs, &offset, &size);
> +	if (pos >= pack->num_objects)
> +		return; /* not actually in the pack or MIDX preferred pack */

It feels weird to go from bitmap_num_objects() back to
pack->num_objects. But I agree it's the right thing for the "pretend as
if only a single pack exists" reasons given above.

> +static uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git)
> +{
> +	struct multi_pack_index *m = bitmap_git->midx;
> +	if (!m)
> +		BUG("midx_preferred_pack: requires non-empty MIDX");
> +	return nth_midxed_pack_int_id(m, pack_pos_to_midx(bitmap_git->midx, 0));
> +}

This part is really subtle. We infer the preferred pack by looking at
the pack of the 0th bit position. In general that works, since that's
part of the definition of the preferred pack.

Could this ever be fooled if we had a preferred pack with 0 objects in
it? I don't know why we would have such a thing, but just trying to
think of cases where our assumptions might not hold (and what bad things
could happen).

> +	if (bitmap_is_midx(bitmap_git))
> +		pack = bitmap_git->midx->packs[midx_preferred_pack(bitmap_git)];
> +	else
> +		pack = bitmap_git->pack;
> +	objects_nr = pack->num_objects;
> +
>  	while (i < result->word_alloc && result->words[i] == (eword_t)~0)
>  		i++;
>  
> -	/* Don't mark objects not in the packfile */
> +	/*
> +	 * Don't mark objects not in the packfile or preferred pack. This bitmap
> +	 * marks objects eligible for reuse, but the pack-reuse code only
> +	 * understands how to reuse a single pack. Since the preferred pack is
> +	 * guaranteed to have all bases for its deltas (in a multi-pack bitmap),
> +	 * we use it instead of another pack. In single-pack bitmaps, the choice
> +	 * is made for us.
> +	 */
>  	if (i > objects_nr / BITS_IN_EWORD)
>  		i = objects_nr / BITS_IN_EWORD;

OK, so this clamps our "quick" contiguous set of bits to the number of
objects in the preferred pack. Makes sense. And then we hit the
object-by-object loop below...

> @@ -1213,7 +1437,15 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
>  				break;
>  
>  			offset += ewah_bit_ctz64(word >> offset);
> -			try_partial_reuse(bitmap_git, pos + offset, reuse, &w_curs);
> +			if (bitmap_is_midx(bitmap_git)) {
> +				/*
> +				 * Can't reuse from a non-preferred pack (see
> +				 * above).
> +				 */
> +				if (pos + offset >= objects_nr)
> +					continue;
> +			}
> +			try_partial_reuse(bitmap_git, pack, pos + offset, reuse, &w_curs);

...and this likewise makes sure we never go past that first pack. Good.

I think this "continue" could actually be a "break", as the loop is
iterating over "offset" (and "pos + offset" always gets larger). In
fact, it could break out of the outer loop as well (which is
incrementing "pos"). It's probably a pretty small efficiency in
practice, though.

> @@ -1511,8 +1749,13 @@ uint32_t *create_bitmap_mapping(struct bitmap_index *bitmap_git,
>  		struct object_id oid;
>  		struct object_entry *oe;
>  
> -		nth_packed_object_id(&oid, bitmap_git->pack,
> -				     pack_pos_to_index(bitmap_git->pack, i));
> +		if (bitmap_is_midx(bitmap_git))
> +			nth_midxed_object_oid(&oid,
> +					      bitmap_git->midx,
> +					      pack_pos_to_midx(bitmap_git->midx, i));
> +		else
> +			nth_packed_object_id(&oid, bitmap_git->pack,
> +					     pack_pos_to_index(bitmap_git->pack, i));
>  		oe = packlist_find(mapping, &oid);

Could this be using nth_bitmap_object_oid()? I guess not, because we are
feeding from pack_pos_to_*. I'm not sure if another helper function is
worth it (pack_pos_to_bitmap_index() or something?).

> @@ -1575,7 +1831,31 @@ static off_t get_disk_usage_for_type(struct bitmap_index *bitmap_git,
>  				break;
>  
>  			offset += ewah_bit_ctz64(word >> offset);
> -			pos = base + offset;
> +
> +			if (bitmap_is_midx(bitmap_git)) {
> +				uint32_t pack_pos;
> +				uint32_t midx_pos = pack_pos_to_midx(bitmap_git->midx, base + offset);
> +				uint32_t pack_id = nth_midxed_pack_int_id(bitmap_git->midx, midx_pos);
> +				off_t offset = nth_midxed_offset(bitmap_git->midx, midx_pos);
> +
> +				pack = bitmap_git->midx->packs[pack_id];
> +
> +				if (offset_to_pack_pos(pack, offset, &pack_pos) < 0) {
> +					struct object_id oid;
> +					nth_midxed_object_oid(&oid, bitmap_git->midx, midx_pos);
> +
> +					die(_("could not find %s in pack #%"PRIu32" at offset %"PRIuMAX),
> +					    oid_to_hex(&oid),
> +					    pack_id,
> +					    (uintmax_t)offset);
> +				}
> +
> +				pos = pack_pos;
> +			} else {
> +				pack = bitmap_git->pack;
> +				pos = base + offset;
> +			}
> +
>  			total += pack_pos_to_offset(pack, pos + 1) -
>  				 pack_pos_to_offset(pack, pos);
>  		}

In the midx case, we have to go from midx-bitmap-pos to midx-index-pos,
to then get the pack/ofs combo, which then gives us a real "pos" in the
pack. I don't think there's a faster way to do it (and this is still
much faster than looking up objects in the pack only to check their
revindex).

But then with the result, we compare the offset of "pos" and "pos + 1".
We need to know "pos" to find "pos + 1". But in the midx case, don't we
already have the offset of "pos" (it is "offset" in the bitmap_is_midx()
conditional, which is shadowing the completely unrelated "offset" in the
outer loop).

We could reuse it, saving ourselves an extra round-trip of pack_pos to
index_pos to offset. It would just mean stuffing the "total +=" line
into the two sides of the conditional.

> +off_t bitmap_pack_offset(struct bitmap_index *bitmap_git, uint32_t pos)
> +{
> +	if (bitmap_is_midx(bitmap_git))
> +		return nth_midxed_offset(bitmap_git->midx,
> +					 pack_pos_to_midx(bitmap_git->midx, pos));
> +	return nth_packed_object_offset(bitmap_git->pack,
> +					pack_pos_to_index(bitmap_git->pack, pos));
> +}

Does anybody call this function? I don't see any users by the end of the
series.

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 14/24] pack-bitmap: write multi-pack bitmaps
  2021-06-21 22:25   ` [PATCH v2 14/24] pack-bitmap: write " Taylor Blau
  2021-06-24 23:45     ` Ævar Arnfjörð Bjarmason
@ 2021-07-21 12:09     ` Jeff King
  2021-07-26 18:12       ` Taylor Blau
  1 sibling, 1 reply; 273+ messages in thread
From: Jeff King @ 2021-07-21 12:09 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jonathantanmy

On Mon, Jun 21, 2021 at 06:25:34PM -0400, Taylor Blau wrote:

> +static int add_ref_to_pending(const char *refname,
> +			      const struct object_id *oid,
> +			      int flag, void *cb_data)
> +{
> +	struct rev_info *revs = (struct rev_info*)cb_data;
> +	struct object *object;
> +
> +	if ((flag & REF_ISSYMREF) && (flag & REF_ISBROKEN)) {
> +		warning("symbolic ref is dangling: %s", refname);
> +		return 0;
> +	}
> +
> +	object = parse_object_or_die(oid, refname);
> +	if (object->type != OBJ_COMMIT)
> +		return 0;
> +
> +	add_pending_object(revs, object, "");
> +	if (bitmap_is_preferred_refname(revs->repo, refname))
> +		object->flags |= NEEDS_BITMAP;
> +	return 0;
> +}

OK, so we'll look at each ref to get the set of commits that we want to
traverse to put into the bitmap. Which is roughly the same as what the
pack bitmap does. We only generate bitmaps for all-into-one repacks, so
it is traversing all of the reachable objects. It is a little different
in that the pack version is probably hitting reflogs, but IMHO we are
better off to ignore reflogs for the purposes of bitmaps (I would
suggest to do so in the pack-bitmap case, too, except that it is
combined with the "what to pack" traversal there, and by the time we see
each commit we don't know how we got there).

> +struct bitmap_commit_cb {
> +	struct commit **commits;
> +	size_t commits_nr, commits_alloc;
> +
> +	struct write_midx_context *ctx;
> +};
> +
> +static const struct object_id *bitmap_oid_access(size_t index,
> +						 const void *_entries)
> +{
> +	const struct pack_midx_entry *entries = _entries;
> +	return &entries[index].oid;
> +}
> +
> +static void bitmap_show_commit(struct commit *commit, void *_data)
> +{
> +	struct bitmap_commit_cb *data = _data;
> +	if (oid_pos(&commit->object.oid, data->ctx->entries,
> +		    data->ctx->entries_nr,
> +		    bitmap_oid_access) > -1) {

This "> -1" struck me as a little bit funny. Perhaps ">= 0" would be a
more obvious way of saying "we found it"?

> +	/*
> +	 * Skipping promisor objects here is intentional, since it only excludes
> +	 * them from the list of reachable commits that we want to select from
> +	 * when computing the selection of MIDX'd commits to receive bitmaps.
> +	 *
> +	 * Reachability bitmaps do require that their objects be closed under
> +	 * reachability, but fetching any objects missing from promisors at this
> +	 * point is too late. But, if one of those objects can be reached from
> +	 * an another object that is included in the bitmap, then we will
> +	 * complain later that we don't have reachability closure (and fail
> +	 * appropriately).
> +	 */
> +	fetch_if_missing = 0;
> +	revs.exclude_promisor_objects = 1;

Makes sense.

> +	/*
> +	 * Pass selected commits in topo order to match the behavior of
> +	 * pack-bitmaps when configured with delta islands.
> +	 */
> +	revs.topo_order = 1;
> +	revs.sort_order = REV_SORT_IN_GRAPH_ORDER;

Hmm. Why do we want to match this side effect of delta islands here?

The only impact this has is on the order of commits we feed for bitmap
selection (and during the actual generation phase, it may impact
visitation order).

Now I'm of the opinion that topo order is probably the best thing for
bitmap generation (since the bitmaps themselves are connected to the
graph structure). But if it is the best thing, shouldn't we perhaps be
turning on topo-order for single-pack bitmaps, too?

And if it isn't the best thing, then why would we want it here?

> +	if (prepare_revision_walk(&revs))
> +		die(_("revision walk setup failed"));

We call init_revisions(), and then go straight to
prepare_revision_walk() with no call to setup_revisions() between. It
doesn't seem to be clearly documented, but I think you're supposed to,
as it finalizes some bits like diff_setup_done().

I suspect it works OK in practice, and I did find a few other spots that
do not call it (e.g., builtin/am.c:write_commit_patch). But most spots
do at least an empty setup_revisions(0, NULL, &rev, NULL).

> +	/*
> +	 * Build the MIDX-order index based on pdata.objects (which is already
> +	 * in MIDX order; c.f., 'midx_pack_order_cmp()' for the definition of
> +	 * this order).
> +	 */
> +	ALLOC_ARRAY(index, pdata.nr_objects);
> +	for (i = 0; i < pdata.nr_objects; i++)
> +		index[i] = (struct pack_idx_entry *)&pdata.objects[i];

This cast is correct because the pack_idx_entry is at the start of each
object_entry. But maybe:

  index[i] = &pdata.objects[i].idx;

would be less scary looking?

> +	/*
> +	 * bitmap_writer_finish expects objects in lex order, but pack_order
> +	 * gives us exactly that. use it directly instead of re-sorting the
> +	 * array.
> +	 *
> +	 * This changes the order of objects in 'index' between
> +	 * bitmap_writer_build_type_index and bitmap_writer_finish.
> +	 *
> +	 * The same re-ordering takes place in the single-pack bitmap code via
> +	 * write_idx_file(), which is called by finish_tmp_packfile(), which
> +	 * happens between bitmap_writer_build_type_index() and
> +	 * bitmap_writer_finish().
> +	 */
> +	for (i = 0; i < pdata.nr_objects; i++)
> +		index[ctx->pack_order[i]] = (struct pack_idx_entry *)&pdata.objects[i];

Ditto here.

> +	bitmap_writer_select_commits(commits, commits_nr, -1);

Not related to your patch, but I had to refresh my memory on what this
"-1" was for. It's "max_bitmaps", and is ignored if it's negative. But
the only callers pass "-1"! So we could get rid of it entirely.

It probably makes sense to leave that cleanup out of this
already-complicated series. But maybe worth doing later on top.

> @@ -930,9 +1100,16 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
>  		for (i = 0; i < ctx.m->num_packs; i++) {
>  			ALLOC_GROW(ctx.info, ctx.nr + 1, ctx.alloc);
>  
> +			if (prepare_midx_pack(the_repository, ctx.m, i)) {
> +				error(_("could not load pack %s"),
> +				      ctx.m->pack_names[i]);
> +				result = 1;
> +				goto cleanup;
> +			}

It might be worth a comment here. I can easily believe that there is
some later part of the bitmap generation code that assumes the packs are
loaded. But somebody reading this is not likely to understand why it's
here.

Should this be done conditionally only if we're writing a bitmap? (That
might also make it obvious why we are doing it).

> @@ -947,8 +1124,26 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
>  	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &ctx);
>  	stop_progress(&ctx.progress);
>  
> -	if (ctx.m && ctx.nr == ctx.m->num_packs && !packs_to_drop)
> -		goto cleanup;
> +	if (ctx.m && ctx.nr == ctx.m->num_packs && !packs_to_drop) {
> +		struct bitmap_index *bitmap_git;
> +		int bitmap_exists;
> +		int want_bitmap = flags & MIDX_WRITE_BITMAP;
> +
> +		bitmap_git = prepare_bitmap_git(the_repository);
> +		bitmap_exists = bitmap_git && bitmap_is_midx(bitmap_git);
> +		free_bitmap_index(bitmap_git);
> +
> +		if (bitmap_exists || !want_bitmap) {
> +			/*
> +			 * The correct MIDX already exists, and so does a
> +			 * corresponding bitmap (or one wasn't requested).
> +			 */
> +			if (!want_bitmap)
> +				clear_midx_files_ext(the_repository, ".bitmap",
> +						     NULL);
> +			goto cleanup;
> +		}
> +	}

So this makes "git multi-pack-index write --write-bitmap" actually write
a bitmap, even if the midx itself didn't need updating? Sounds good.
Likewise, we'll delete a bitmap if one exists but we were not requested
to write one. Makes sense.

I do think nice-to-have bits like this could have come in a separate
patch with their own explanation and tests. It may not be worth trying
to extract it at this point, though.

> @@ -1075,9 +1271,6 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
>  	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
>  	f = hashfd(get_lock_file_fd(&lk), get_lock_file_path(&lk));
>  
> -	if (ctx.m)
> -		close_midx(ctx.m);
> -
>  	if (ctx.nr - dropped_packs == 0) {
>  		error(_("no pack files to index."));
>  		result = 1;

I'm not sure what this hunk is doing. We do pick up the close_midx()
call at the end of the function, amidst the other cleanup.

I expect the answer is something like "we need it open when we generate
the bitmaps". But it makes me wonder if we could hit any cases where we
try to overwrite it while it's still open, which would cause problems on
Windows.

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 00/24] multi-pack reachability bitmaps
  2021-07-15 14:36     ` Taylor Blau
@ 2021-07-21 12:12       ` Jeff King
  0 siblings, 0 replies; 273+ messages in thread
From: Jeff King @ 2021-07-21 12:12 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Ævar Arnfjörð Bjarmason, git, dstolee, gitster,
	jonathantanmy

On Thu, Jul 15, 2021 at 10:36:53AM -0400, Taylor Blau wrote:

> I know that reviewing this is on Peff's list of things to do, but there
> are competing priorities (and we have all-company meetings this week, so
> I would not be surprised to see a lack of movement until at least next
> week).

I made it through a very careful read of all of the actual code changes,
through patch 14. After that, it looks like it's all tests, but my brain
is now effectively fried. I had some comments, but nothing
earth-shattering.

I'll pick up on 15-24 later, though I think between my comments and
Ævar's there's some light changes to be made. So don't hesitate to post
a new version in the meantime, which can save us a round-trip.

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 01/24] pack-bitmap.c: harden 'test_bitmap_walk()' to check type bitmaps
  2021-07-21  9:45     ` Jeff King
@ 2021-07-21 17:15       ` Taylor Blau
  0 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-07-21 17:15 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster, jonathantanmy

On Wed, Jul 21, 2021 at 05:45:12AM -0400, Jeff King wrote:
> > +{
> > +	enum object_type bitmap_type = OBJ_NONE;
> > [...]
> > +
> > +	if (!bitmap_type)
> > +		die("object %s not found in type bitmaps",
> > +		    oid_to_hex(&obj->oid));
>
> I think the suggestion to do:
>
>   if (bitmap_type == OBJ_NONE)
>
> is reasonable here, as it assumes less about the enum. I do think
> OBJ_BAD and OBJ_NONE were chosen with these kind of numeric comparisons
> in mind, but there is no reason to rely on them in places we don't need
> to.

I had to double check your suggestion, because my first question was
"what if bitmap_type is OBJ_BAD?" We can call type_name() on OBJ_BAD, but
it will return NULL, and we use the return value in a format string
unconditionally.

So that would be a problem, but it's impossible for this to ever be
OBJ_BAD, because we only set it based on the type bitmaps; so it's
either a commit/tree/blob/tag, or none (but not bad).

I took your suggestion, thanks.

> -Peff

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 02/24] pack-bitmap-write.c: gracefully fail to write non-closed bitmaps
  2021-07-21  9:50     ` Jeff King
@ 2021-07-21 17:20       ` Taylor Blau
  2021-07-23  7:37         ` Jeff King
  0 siblings, 1 reply; 273+ messages in thread
From: Taylor Blau @ 2021-07-21 17:20 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster, jonathantanmy

On Wed, Jul 21, 2021 at 05:50:43AM -0400, Jeff King wrote:
> The amount of error-plumbing you had to do is a little unpleasant, but I
> think is unavoidable. The only non-obvious part was this hunk:

Agreed, at least on the amount of plumbing required to get this to work
;).

> > @@ -463,8 +488,11 @@ void bitmap_writer_build(struct packing_data *to_pack)
> >  		struct commit *child;
> >  		int reused = 0;
> >
> > -		fill_bitmap_commit(ent, commit, &queue, &tree_queue,
> > -				   old_bitmap, mapping);
> > +		if (fill_bitmap_commit(ent, commit, &queue, &tree_queue,
> > +				       old_bitmap, mapping) < 0) {
> > +			closed = 0;
> > +			break;
> > +		}
> >
> >  		if (ent->selected) {
> >  			store_selected(ent, commit);
>
> This is the right thing to do because we still want to free memory, stop
> progress, etc. I gave a look over what will run after breaking out of
> the loop, and compute_xor_offsets(), which you already handled, is the
> only thing we'd want to avoid running. Good.

Right. The key is that we return "closed ? 0 : -1" (of course, being
careful to invert "closed" where "1" OK into a suitable return value for
bitmap_writer_build, where "0" means OK, and a negative number means
"error").

While I'm thinking about that inversion, we *could* call this variable
"open" and set it to "0" until proven otherwise. Then the conditional
becomes "if (!open)", but the return value is still "return open ? -1 :
0" (since I assume we'd want to use 0/1 values for "open" instead of -1,
meaning we'd have to do some translation).

Anyway, this is definitely an annoying detail that doesn't really
matter (and just rambling on my part) ;).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 04/24] Documentation: build 'technical/bitmap-format' by default
  2021-07-21 10:08       ` Jeff King
@ 2021-07-21 17:23         ` Taylor Blau
  2021-07-23  7:39           ` Jeff King
  0 siblings, 1 reply; 273+ messages in thread
From: Taylor Blau @ 2021-07-21 17:23 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster, jonathantanmy

On Wed, Jul 21, 2021 at 06:08:18AM -0400, Jeff King wrote:
> On Wed, Jul 21, 2021 at 05:58:41AM -0400, Jeff King wrote:
>
> > On Mon, Jun 21, 2021 at 06:25:07PM -0400, Taylor Blau wrote:
> >
> > > Even though the 'TECH_DOCS' variable was introduced all the way back in
> > > 5e00439f0a (Documentation: build html for all files in technical and
> > > howto, 2012-10-23), the 'bitmap-format' document was never added to that
> > > list when it was created.
> > >
> > > Prepare for changes to this file by including it in the list of
> > > technical documentation that 'make doc' will build by default.
> >
> > OK. I don't care that much about being able to format this as html, but
> > I agree it's good to be consistent with the other stuff in technical/.
> >
> > The big question is whether it looks OK rendered by asciidoc, and the
> > answer seems to be "yes" (from a cursory look I gave it).
>
> Actually, I take it back. After looking more carefully, it renders quite
> poorly. There's a lot of structural indentation that ends up being
> confused as code blocks.
>
> I don't know if it's better to have a poorly-formatted HTML file, or
> none at all. :)
>
> Personally, I would just read the source. And I have a slight concern
> that if we start "cleaning it up" to render as asciidoc, the source
> might end up a lot less readable (though I'd reserve judgement until
> actually seeing it).

Yeah, the actual source is pretty readable (and it's what I had been
looking at, although it is sometimes convenient to have a version I can
read in my web browser). But it's definitely not good Asciidoc.

I briefly considered cleaning it up, but decided against it. Usually I
would opt to clean it up, but this series is already so large that I
figured it would make a negative impact on the reviewer experience to
read a clean-up patch here.

I wouldn't be opposed to coming back to it in the future, once the dust
settles. I guess we can consider this #leftoverbits until then.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 05/24] Documentation: describe MIDX-based bitmaps
  2021-07-21 10:18     ` Jeff King
@ 2021-07-21 17:53       ` Taylor Blau
  2021-07-23  7:45         ` Jeff King
  0 siblings, 1 reply; 273+ messages in thread
From: Taylor Blau @ 2021-07-21 17:53 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster, jonathantanmy

On Wed, Jul 21, 2021 at 06:18:46AM -0400, Jeff King wrote:
> On Mon, Jun 21, 2021 at 06:25:10PM -0400, Taylor Blau wrote:
>
> > +An object is uniquely described by its bit position within a bitmap:
> > +
> > +	- If the bitmap belongs to a packfile, the __n__th bit corresponds to
> > +	the __n__th object in pack order. For a function `offset` which maps
> > +	objects to their byte offset within a pack, pack order is defined as
> > +	follows:
> > +
> > +		o1 <= o2 <==> offset(o1) <= offset(o2)
> > +
> > +	- If the bitmap belongs to a MIDX, the __n__th bit corresponds to the
> > +	__n__th object in MIDX order. With an additional function `pack` which
> > +	maps objects to the pack they were selected from by the MIDX, MIDX order
> > +	is defined as follows:
> > +
> > +		o1 <= o2 <==> pack(o1) <= pack(o2) /\ offset(o1) <= offset(o2)
> > +
> > +	The ordering between packs is done lexicographically by the pack name,
> > +	with the exception of the preferred pack, which sorts ahead of all other
> > +	packs.
>
> This doesn't render well as asciidoc (the final paragraph is taken as
> more of the code block). But that is a problem through the whole file. I
> think we should ignore it for now, and worry about asciidoc-ifying the
> whole thing later, if we choose to.

Agreed; let's ignore it for now.

> > +	The ordering between packs is done lexicographically by the pack name,
> > +	with the exception of the preferred pack, which sorts ahead of all other
> > +	packs.
>
> Hmm, I'm not sure if this "lexicographically" part is true. Really we're
> building on the midx .rev format here. And that says "defined by the
> MIDX's pack list" (though I can't offhand remember if that is
> lexicographic, or if it is in the reverse-mtime order).
>
> At any rate, should we just be referencing the rev documentation?

The packs are listed in lex order in the MIDX, but that is so we can
binary search that list to determine whether a pack is included in the
MIDX or not.

I had to check, but we do use the lex order to resolve duplicate
objects, too. See (at the tip of this branch):

    QSORT(ctx.info, ctx.nr, pack_info_compare);

from within midx.c:write_midx_internal(). Here, ctx.info contains the
list of packs, and pack_info_compare is a thin wrapper around
strcmp()-ing the pack_name values of two packed_git structures.

Arguably, you'd get better EWAH compression of the bits between packs
if we sorted packs in reverse order according to their mtime. But I
suspect that it doesn't matter much in practice, since the number of
objects vastly outpaces the number of packs (but I haven't measured to
be certain, so take that with a grain of salt).

In any case, I think that you're right that adding too much detail hurts
us here, so we should really be mentioning the MIDX's .rev-file
documentation (unfortunately, we can't linkgit it, so mentioning it by
name will have to suffice). I plan to reroll with something like this on
top:

diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt
index 25221c7ec8..04b3ec2178 100644
--- a/Documentation/technical/bitmap-format.txt
+++ b/Documentation/technical/bitmap-format.txt
@@ -26,9 +26,8 @@ An object is uniquely described by its bit position within a bitmap:

 		o1 <= o2 <==> pack(o1) <= pack(o2) /\ offset(o1) <= offset(o2)

-	The ordering between packs is done lexicographically by the pack name,
-	with the exception of the preferred pack, which sorts ahead of all other
-	packs.
+	The ordering between packs is done according to the MIDX's .rev file.
+	Notably, the preferred pack sorts ahead of all other packs.

 The on-disk representation (described below) of a bitmap is the same regardless
 of whether or not that bitmap belongs to a packfile or a MIDX. The only

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 08/24] midx: respect 'core.multiPackIndex' when writing
  2021-07-21 10:23     ` Jeff King
@ 2021-07-21 19:22       ` Taylor Blau
  2021-07-23  8:29         ` Jeff King
  0 siblings, 1 reply; 273+ messages in thread
From: Taylor Blau @ 2021-07-21 19:22 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster, jonathantanmy

On Wed, Jul 21, 2021 at 06:23:23AM -0400, Jeff King wrote:
> On Mon, Jun 21, 2021 at 06:25:18PM -0400, Taylor Blau wrote:
>
> > When writing a new multi-pack index, write_midx_internal() attempts to
> > load any existing one to fill in some pieces of information. But it uses
> > load_multi_pack_index(), which ignores the configuration
> > "core.multiPackIndex", which indicates whether or not Git is allowed to
> > read an existing multi-pack-index.
> >
> > Replace this with a routine that does respect that setting, to avoid
> > reading multi-pack-index files when told not to.
> >
> > This avoids a problem that would arise in subsequent patches due to the
> > combination of 'git repack' reopening the object store in-process and
> > the multi-pack index code not checking whether a pack already exists in
> > the object store when calling add_pack_to_midx().
> >
> > This would ultimately lead to a cycle being created along the
> > 'packed_git' struct's '->next' pointer. That is obviously bad, but it
> > has hard-to-debug downstream effects like saying a bitmap can't be
> > loaded for a pack because one already exists (for the same pack).
>
> I'm not sure I completely understand the bug that this causes.

Off-hand, I can't quite remember either. But it is important; I do have
a distinct memory of dropping this patch and then watching a 'git repack
--write-midx' (that option will be introduced in a later series) fail
horribly.

If I remember correctly, the bug has to do with loading a MIDX twice in
the same process. When we call add_packed_git() from within
prepare_midx_pack(), we load the pack without caring whether or not it's
already loaded. So loading a MIDX twice in the same process will fail.

So really I think that this is papering over that bug: we're just
removing one of the times that we happened to load a MIDX from during
the writing phase.

What I do remember is that this bug was a huge pain to figure out ;).
I'm happy to look further if you aren't satisfied with my vague
explanation here (and I wouldn't blame you).

> But another question: does this impact how
>
>   git -c core.multipackindex=false multi-pack-index write
>
> behaves? I.e., do we still write, but just avoid reading the existing
> midx? That itself seems like a more sensible behavior (e.g., trying to
> recover from a broken midx state).

Yes. Before this patch, that invocation would still load and use any
existing MIDX to write a new one. Now we don't, because (unlike
load_multi_pack_index()) prepare_multi_pack_index_one() does check
core.multiPackIndex before returning anything.

> -Peff

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 09/24] midx: infer preferred pack when not given one
  2021-07-21 10:34     ` Jeff King
@ 2021-07-21 20:16       ` Taylor Blau
  2021-07-23  8:50         ` Jeff King
  0 siblings, 1 reply; 273+ messages in thread
From: Taylor Blau @ 2021-07-21 20:16 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster, jonathantanmy

On Wed, Jul 21, 2021 at 06:34:25AM -0400, Jeff King wrote:
> On Mon, Jun 21, 2021 at 06:25:21PM -0400, Taylor Blau wrote:
>
> > In 9218c6a40c (midx: allow marking a pack as preferred, 2021-03-30), the
> > multi-pack index code learned how to select a pack which all duplicate
> > objects are selected from. That is, if an object appears in multiple
> > packs, select the copy in the preferred pack before breaking ties
> > according to the other rules like pack mtime and readdir() order.
> >
> > Not specifying a preferred pack can cause serious problems with
> > multi-pack reachability bitmaps, because these bitmaps rely on having at
> > least one pack from which all duplicates are selected. Not having such a
> > pack causes problems with the pack reuse code (e.g., like assuming that
> > a base object was sent from that pack via reuse when in fact the base
> > was selected from a different pack).
>
> It might be helpful to use a more descriptive name for "pack reuse code"
> here, since it's kind of vague for people who have not been actively
> working on bitmaps.
>
> I don't have a short name for that chunk of code, but maybe:
>
>   ...causes problems with the code in pack-objects to reuse packs
>   verbatim (e.g., that code assumes that a delta object in a chunk of
>   pack sent verbatim will have its base object sent from the same pack).

Thanks; I like what you wrote here.

> >   - The psuedo pack-order (described in
> >     Documentation/technical/bitmap-format.txt) is computed by
> >     midx_pack_order(), and sorts by pack ID and pack offset, with
> >     preferred packs sorting first.
>
> I think the .rev description in pack-format.txt may be a better
> reference here.

Ditto, I changed that, too.

> >   - But! Pack IDs come from incrementing the pack count in
> >     add_pack_to_midx(), which is a callback to
> >     for_each_file_in_pack_dir(), meaning that pack IDs are assigned in
> >     readdir() order.
> >
> > [ ... ]
>
> This explanation is rather confusing, but I'm not sure if we can do much
> better. I followed all of it, because I was there when we found the bug
> that this is fixing. And of course that happened _after_ we implemented
> midx bitmaps and in particular adapted the verbatim reuse stuff in
> pack-objects to make use of it.
>
> I see why you'd want to float the fix up before then, so we don't ever
> have the broken state. But it's hard to understand what bug this is
> fixing, because the bug does not even exist yet at this point in
> the series!
>
> I dunno. Like I said, I was able to follow it, so maybe it is
> sufficient. I'm just not sure others would be able to.

I think that others will follow it, too. But I agree that it is
confusing, since we're fixing a bug that doesn't yet exist. In reality,
I wrote this patch after sending v1, and then reordered its position to
come before the implementation of MIDX bitmaps for that reason.

So in one sense, I prefer it this way because we don't ever introduce
the bug.  But in another sense, it is very jarring to read about an
interaction that has no basis in the code (yet).

I think that the best thing we could do without adding any significant
reordering would be to just call out the situation we're in. I added
this onto the end of the commit message which I think makes things a
little clearer:

    (Note that multi-pack reachability bitmaps have yet to be
    implemented; so in that sense this patch is fixing a bug which does
    not yet exist.  But by having this patch beforehand, we can prevent
    the bug from ever materializing.)

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 12/24] pack-bitmap.c: introduce 'bitmap_is_preferred_refname()'
  2021-07-21 10:39     ` Jeff King
@ 2021-07-21 20:18       ` Taylor Blau
  0 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-07-21 20:18 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster, jonathantanmy

On Wed, Jul 21, 2021 at 06:39:39AM -0400, Jeff King wrote:
> On Mon, Jun 21, 2021 at 06:25:29PM -0400, Taylor Blau wrote:
> > [...]
> >
> > Implement a function 'bitmap_is_preferred_refname()' which does just
> > that. The caller will be added in a subsequent patch.
>
> I suspect there was some patch reordering here. We don't have any
> multi-pack bitmap code yet. :)
>
> Probably this needs to say something like "in preparation for adding
> multi-pack bitmap code..." or similar?

Oops. Good catch!

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 13/24] pack-bitmap: read multi-pack bitmaps
  2021-07-21 11:32     ` Jeff King
@ 2021-07-21 23:01       ` Taylor Blau
  2021-07-23  9:40         ` Jeff King
  2021-07-23 10:00         ` Jeff King
  0 siblings, 2 replies; 273+ messages in thread
From: Taylor Blau @ 2021-07-21 23:01 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster, jonathantanmy

On Wed, Jul 21, 2021 at 07:32:49AM -0400, Jeff King wrote:
> On Mon, Jun 21, 2021 at 06:25:31PM -0400, Taylor Blau wrote:
> > +	if (!is_pack_valid(packfile)) {
> > +		close(fd);
> > +		return -1;
> > +	}
> > +
>
> What's this extra is_pack_valid() doing? I wouldn't expect many changes
> at all to this non-midx code path (aside from the "did we already load a
> midx bitmap" in the earlier part of the hunk, which makes sense).

That looks like a mistake to me. I did a little digging and tried to
remember if it could have ever been useful, but I think that it's just a
stray change that has no value. Removed.

> > -static int load_pack_bitmap(struct bitmap_index *bitmap_git)
> > +static int load_reverse_index(struct bitmap_index *bitmap_git)
> > +{
> > +	if (bitmap_is_midx(bitmap_git)) {
> > +		uint32_t i;
> > +		int ret;
> > +
> > +		ret = load_midx_revindex(bitmap_git->midx);
> > +		if (ret)
> > +			return ret;
> > +
> > +		for (i = 0; i < bitmap_git->midx->num_packs; i++) {
> > +			if (prepare_midx_pack(the_repository, bitmap_git->midx, i))
> > +				die(_("load_reverse_index: could not open pack"));
> > +			ret = load_pack_revindex(bitmap_git->midx->packs[i]);
> > +			if (ret)
> > +				return ret;
> > +		}
> > +		return 0;
> > +	}
> > +	return load_pack_revindex(bitmap_git->pack);
> > +}
>
> OK, this new function is used in load_bitmap(), which is used for both
> pack and midx bitmaps. So if we have a midx bitmap, we'll
> unconditionally load the revindex here. But:
>
>   - why do we then load individual pack revindexes? I can believe it may
>     be necessary to meet the assumptions of some other part of the code,
>     but it would be nice to have a comment giving us some clue.

Good suggestion. We will need to reference the reverse index belonging
to individual packs in a few locations in pack-objects (for e.g.,
write_reuse_object() calls offset_to_pack_pos(), and
pack_pos_to_offset(), both with arbitrary packs, not just the preferred
one).

I left the comment vague; something along the lines of "lots of routines
in pack-objects will need these structures to be ready to use".

I think there's room for improvement there, since for e.g., `git
rev-list --count --objects --use-bitmap-index` doesn't need to load the
reverse indexes. But that's already the case with classic bitmaps, too,
which eagerly call load_pack_revindex().

>   - in open_midx_bitmap_1(), we also unconditionally load the midx
>     reverse index. I think that will always happen before us here (we
>     cannot load_bitmap() a bitmap that has not been opened). So is this
>     load_midx_revindex() call always a noop?

Great catch. I removed the call to load_midx_revindex(), and replaced it
with a comment explaining why we don't need to call it (because we
already did).

> >  static int bitmap_position(struct bitmap_index *bitmap_git,
> >  			   const struct object_id *oid)
> >  {
> > -	int pos = bitmap_position_packfile(bitmap_git, oid);
> > +	int pos;
> > +	if (bitmap_is_midx(bitmap_git))
> > +		pos = bitmap_position_midx(bitmap_git, oid);
> > +	else
> > +		pos = bitmap_position_packfile(bitmap_git, oid);
> >  	return (pos >= 0) ? pos : bitmap_position_extended(bitmap_git, oid);
> >  }
>
> Makes sense. Not new in your patch, but this "int" return is fudging the
> same 32-bit space we were talking about elsewhere (i.e., "pos" really
> could be 2^32, or even more due to extended objects).

:-). It bothers me to no end, too, because of all of the recent effort
to improve the reverse-index APIs to avoid exactly this issue. But I
tend to agree that the concern is more theoretical than anything,
because we're only using the MSB, so the remaining 2^31 possible objects
still seems pretty generous.

> In practice I think even 2^31 objects is pretty out-of-reach, but it may
> be worth changing the return type (and the callers), or even just
> catching the overflow with an assertion.

Possibly, but keep in mind that the former is basically the same
refactor as we did with the "tell me whether this object was found via
this extra pointer". But bitmap_position() has a lot more callers than
that, so the plumbing required would be a little more prevalent.

So I'd be content to just punt on it for now, if you'd be OK with it.

> >  	if (pos < bitmap_num_objects(bitmap_git)) {
> > -		off_t ofs = pack_pos_to_offset(pack, pos);
> > +		struct packed_git *pack;
> > +		off_t ofs;
> > +
> > +		if (bitmap_is_midx(bitmap_git)) {
> > +			uint32_t midx_pos = pack_pos_to_midx(bitmap_git->midx, pos);
> > +			uint32_t pack_id = nth_midxed_pack_int_id(bitmap_git->midx, midx_pos);
> > +
> > +			pack = bitmap_git->midx->packs[pack_id];
> > +			ofs = nth_midxed_offset(bitmap_git->midx, midx_pos);
> > +		} else {
> > +			pack = bitmap_git->pack;
> > +			ofs = pack_pos_to_offset(pack, pos);
> > +		}
> > +
>
> All of the hunks like this make perfect sense. The big problem would be
> if we _missed_ a place that needed conversion to handle midx. But the
> nice thing is that it would segfault quickly in such an instance. So
> there I'm mostly relying on test coverage, plus our experience running
> with this code at scale.

Yeah; I'm definitely happy to rely on our experience running this and
related patches at GitHub for several months to give us confidence that
we didn't miss anything here.

> >  static void try_partial_reuse(struct bitmap_index *bitmap_git,
> > +			      struct packed_git *pack,
> >  			      size_t pos,
> >  			      struct bitmap *reuse,
> >  			      struct pack_window **w_curs)
> >  {
> > -	off_t offset, header;
> > +	off_t offset, delta_obj_offset;
>
> I'm OK with all of this in one big patch. But I suspect you _could_
> just put:
>
>   if (bitmap_git->midx)
> 	return; /* partial reuse not implemented for midx yet */
>
> to start with, and then actually implement it later. I call out this
> code in particular just because it's got a lot of subtleties (the
> "reuse" bits are much more intimate with the assumptions of packs and
> bitmaps than most other code).
>
> I'm not sure if it's worth the trouble at this point or not.

Yeah, I'd definitely err on the side of not splitting this up now,
especially since you've already gone through the whole patch and
reviewed it. (Of course, if your response was "this patch is way too
big, please split it up so I can more easily review it", that would be a
different story).

But I appreciate the advice, since I have felt that a lot of these
format-level changes require a 3-patch arc where:

  - The first patch describes the new format in Documentation/technical.
  - The second patch implements support for reading files that are
    written in the new format.
  - And finally, the third patch implements support for writing such
    files.

...and it's usually the second of those three patches that is the most
complicated one by far. So this is a good way to split that patch up
into many pieces.

Of course, that only works if you delay adding tests until after support
is added for all parts of the new format, but that's more-or-less what
did here.

> > +static uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git)
> > +{
> > +	struct multi_pack_index *m = bitmap_git->midx;
> > +	if (!m)
> > +		BUG("midx_preferred_pack: requires non-empty MIDX");
> > +	return nth_midxed_pack_int_id(m, pack_pos_to_midx(bitmap_git->midx, 0));
> > +}
>
> This part is really subtle. We infer the preferred pack by looking at
> the pack of the 0th bit position. In general that works, since that's
> part of the definition of the preferred pack.
>
> Could this ever be fooled if we had a preferred pack with 0 objects in
> it? I don't know why we would have such a thing, but just trying to
> think of cases where our assumptions might not hold (and what bad things
> could happen).

An empty preferred pack would cause a problem, yes. The solution is
two-fold (and incorporated into the reroll that I plan on sending
shortly):

  - When the user specifies --preferred-pack, the MIDX code must make
    sure that the given pack is non-empty. That's a new patch, and
    basically adds a new conditional (to check the pack itself) and a
    test (to make sure that we catch the case we are trying to prevent).

  - When the user doesn't specify --preferred-pack (and instead asks us
    to infer one for them) we want to select not just the oldest pack,
    but the oldest *non-empty* pack. That is folded into the "midx:
    infer preferred pack when not given one" patch.

In that patch, I made a note, but I think that it's subtle enough to
merit sharing here again. In the loop over all packs, the conditional for swapping out the oldest pack for the current one was something like:

    if (p->mtime < oldest->mtime)
      oldest = p;

but now we want it to be:

    if (!oldest->num_objects || p->mtime < oldest->mtime)
      oldest = p;

to reject packs that have no objects. And we want to be extra careful in
the case where the only pack fed to the MIDX writer was empty. But we
don't have to do anything there, since there are no objects to write
anyway, so any "preferred_idx" would be fine.

> > +	if (bitmap_is_midx(bitmap_git))
> > +		pack = bitmap_git->midx->packs[midx_preferred_pack(bitmap_git)];
> > +	else
> > +		pack = bitmap_git->pack;
> > +	objects_nr = pack->num_objects;
> > +
> >  	while (i < result->word_alloc && result->words[i] == (eword_t)~0)
> >  		i++;
> >
> > -	/* Don't mark objects not in the packfile */
> > +	/*
> > +	 * Don't mark objects not in the packfile or preferred pack. This bitmap
> > +	 * marks objects eligible for reuse, but the pack-reuse code only
> > +	 * understands how to reuse a single pack. Since the preferred pack is
> > +	 * guaranteed to have all bases for its deltas (in a multi-pack bitmap),
> > +	 * we use it instead of another pack. In single-pack bitmaps, the choice
> > +	 * is made for us.
> > +	 */
> >  	if (i > objects_nr / BITS_IN_EWORD)
> >  		i = objects_nr / BITS_IN_EWORD;
>
> OK, so this clamps our "quick" contiguous set of bits to the number of
> objects in the preferred pack. Makes sense. And then we hit the
> object-by-object loop below...
>
> > @@ -1213,7 +1437,15 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
> >  				break;
> >
> >  			offset += ewah_bit_ctz64(word >> offset);
> > -			try_partial_reuse(bitmap_git, pos + offset, reuse, &w_curs);
> > +			if (bitmap_is_midx(bitmap_git)) {
> > +				/*
> > +				 * Can't reuse from a non-preferred pack (see
> > +				 * above).
> > +				 */
> > +				if (pos + offset >= objects_nr)
> > +					continue;
> > +			}
> > +			try_partial_reuse(bitmap_git, pack, pos + offset, reuse, &w_curs);
>
> ...and this likewise makes sure we never go past that first pack. Good.
>
> I think this "continue" could actually be a "break", as the loop is
> iterating over "offset" (and "pos + offset" always gets larger). In
> fact, it could break out of the outer loop as well (which is
> incrementing "pos"). It's probably a pretty small efficiency in
> practice, though.

Yeah; you're right. And we'll save up to BITS_IN_EWORD cycles of this
loop. (I wonder if smart-enough compilers will realize the same
optimization that you did and turn that `continue` into a `break`
automatically, but that's neither here nor there).

> > @@ -1511,8 +1749,13 @@ uint32_t *create_bitmap_mapping(struct bitmap_index *bitmap_git,
> >  		struct object_id oid;
> >  		struct object_entry *oe;
> >
> > -		nth_packed_object_id(&oid, bitmap_git->pack,
> > -				     pack_pos_to_index(bitmap_git->pack, i));
> > +		if (bitmap_is_midx(bitmap_git))
> > +			nth_midxed_object_oid(&oid,
> > +					      bitmap_git->midx,
> > +					      pack_pos_to_midx(bitmap_git->midx, i));
> > +		else
> > +			nth_packed_object_id(&oid, bitmap_git->pack,
> > +					     pack_pos_to_index(bitmap_git->pack, i));
> >  		oe = packlist_find(mapping, &oid);
>
> Could this be using nth_bitmap_object_oid()? I guess not, because we are
> feeding from pack_pos_to_*. I'm not sure if another helper function is
> worth it (pack_pos_to_bitmap_index() or something?).

You're right that we can't call nth_bitmap_object_oid here directly,
sadly. But I think your suggestion for pack_pos_to_bitmap_index() (or
similar) would only benefit this caller, since most places that dispatch
conditionally to either pack_pos_to_{midx,index} want to pass the result
to a different function depending on which branch they took.

Definitely possible that I missed another case that would help, but that
was what I came up with after just a quick glance.

> > @@ -1575,7 +1831,31 @@ static off_t get_disk_usage_for_type(struct bitmap_index *bitmap_git,
> >  				break;
> >
> >  			offset += ewah_bit_ctz64(word >> offset);
> > -			pos = base + offset;
> > +
> > +			if (bitmap_is_midx(bitmap_git)) {
> > +				uint32_t pack_pos;
> > +				uint32_t midx_pos = pack_pos_to_midx(bitmap_git->midx, base + offset);
> > +				uint32_t pack_id = nth_midxed_pack_int_id(bitmap_git->midx, midx_pos);
> > +				off_t offset = nth_midxed_offset(bitmap_git->midx, midx_pos);
> > +
> > +				pack = bitmap_git->midx->packs[pack_id];
> > +
> > +				if (offset_to_pack_pos(pack, offset, &pack_pos) < 0) {
> > +					struct object_id oid;
> > +					nth_midxed_object_oid(&oid, bitmap_git->midx, midx_pos);
> > +
> > +					die(_("could not find %s in pack #%"PRIu32" at offset %"PRIuMAX),
> > +					    oid_to_hex(&oid),
> > +					    pack_id,
> > +					    (uintmax_t)offset);
> > +				}
> > +
> > +				pos = pack_pos;
> > +			} else {
> > +				pack = bitmap_git->pack;
> > +				pos = base + offset;
> > +			}
> > +
> >  			total += pack_pos_to_offset(pack, pos + 1) -
> >  				 pack_pos_to_offset(pack, pos);
> >  		}
>
> In the midx case, we have to go from midx-bitmap-pos to midx-index-pos,
> to then get the pack/ofs combo, which then gives us a real "pos" in the
> pack. I don't think there's a faster way to do it (and this is still
> much faster than looking up objects in the pack only to check their
> revindex).
>
> But then with the result, we compare the offset of "pos" and "pos + 1".
> We need to know "pos" to find "pos + 1". But in the midx case, don't we
> already have the offset of "pos" (it is "offset" in the bitmap_is_midx()
> conditional, which is shadowing the completely unrelated "offset" in the
> outer loop).
>
> We could reuse it, saving ourselves an extra round-trip of pack_pos to
> index_pos to offset. It would just mean stuffing the "total +=" line
> into the two sides of the conditional.

Yep; agreed. And it allows us to clean up a few little other things, so
I squashed it in. Thanks for the suggestion!

> > +off_t bitmap_pack_offset(struct bitmap_index *bitmap_git, uint32_t pos)
> > +{
> > +	if (bitmap_is_midx(bitmap_git))
> > +		return nth_midxed_offset(bitmap_git->midx,
> > +					 pack_pos_to_midx(bitmap_git->midx, pos));
> > +	return nth_packed_object_offset(bitmap_git->pack,
> > +					pack_pos_to_index(bitmap_git->pack, pos));
> > +}
>
> Does anybody call this function? I don't see any users by the end of the
> series.

Nope, great catch. I looked at callers of nth_midxed_offset and
nth_packed_object_offset to see if they could use this function and
weren't, but the spots I looked at didn't appear to be able that they
would be helped by the existence of this helper, so I just removed it.

Phew! This was quite the email to respond to, but I suppose that's my
fault for writing such a monstrous patch. Thank you for taking the time
to read through it all so carefully. I think you got the short end of
the stick between writing this email and responding to it, so thank you
:-).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 02/24] pack-bitmap-write.c: gracefully fail to write non-closed bitmaps
  2021-07-21 17:20       ` Taylor Blau
@ 2021-07-23  7:37         ` Jeff King
  2021-07-26 18:48           ` Taylor Blau
  0 siblings, 1 reply; 273+ messages in thread
From: Jeff King @ 2021-07-23  7:37 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jonathantanmy

On Wed, Jul 21, 2021 at 01:20:47PM -0400, Taylor Blau wrote:

> > > @@ -463,8 +488,11 @@ void bitmap_writer_build(struct packing_data *to_pack)
> > >  		struct commit *child;
> > >  		int reused = 0;
> > >
> > > -		fill_bitmap_commit(ent, commit, &queue, &tree_queue,
> > > -				   old_bitmap, mapping);
> > > +		if (fill_bitmap_commit(ent, commit, &queue, &tree_queue,
> > > +				       old_bitmap, mapping) < 0) {
> > > +			closed = 0;
> > > +			break;
> > > +		}
> > >
> > >  		if (ent->selected) {
> > >  			store_selected(ent, commit);
> >
> > This is the right thing to do because we still want to free memory, stop
> > progress, etc. I gave a look over what will run after breaking out of
> > the loop, and compute_xor_offsets(), which you already handled, is the
> > only thing we'd want to avoid running. Good.
> 
> Right. The key is that we return "closed ? 0 : -1" (of course, being
> careful to invert "closed" where "1" OK into a suitable return value for
> bitmap_writer_build, where "0" means OK, and a negative number means
> "error").
> 
> While I'm thinking about that inversion, we *could* call this variable
> "open" and set it to "0" until proven otherwise. Then the conditional
> becomes "if (!open)", but the return value is still "return open ? -1 :
> 0" (since I assume we'd want to use 0/1 values for "open" instead of -1,
> meaning we'd have to do some translation).

I thought about suggesting that it be called "err" or "ret" or
something. And then we do not have to care that fill_bitmap_commit()
only returns an error in the non-closed state. We are simply propagating
its error-return back up the stack.

And then you can just write:

  ret = fill_bitmap_commit(...);
  if (ret < 0)
	break;

  ...
  return ret;

without an extra conversion. I don't care that much either way, though
(but if you like it and are re-rolling anyway... :) ).

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 04/24] Documentation: build 'technical/bitmap-format' by default
  2021-07-21 17:23         ` Taylor Blau
@ 2021-07-23  7:39           ` Jeff King
  2021-07-26 18:49             ` Taylor Blau
  0 siblings, 1 reply; 273+ messages in thread
From: Jeff King @ 2021-07-23  7:39 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jonathantanmy

On Wed, Jul 21, 2021 at 01:23:34PM -0400, Taylor Blau wrote:

> > I don't know if it's better to have a poorly-formatted HTML file, or
> > none at all. :)
> >
> > Personally, I would just read the source. And I have a slight concern
> > that if we start "cleaning it up" to render as asciidoc, the source
> > might end up a lot less readable (though I'd reserve judgement until
> > actually seeing it).
> 
> Yeah, the actual source is pretty readable (and it's what I had been
> looking at, although it is sometimes convenient to have a version I can
> read in my web browser). But it's definitely not good Asciidoc.
> 
> I briefly considered cleaning it up, but decided against it. Usually I
> would opt to clean it up, but this series is already so large that I
> figured it would make a negative impact on the reviewer experience to
> read a clean-up patch here.
> 
> I wouldn't be opposed to coming back to it in the future, once the dust
> settles. I guess we can consider this #leftoverbits until then.

Yeah, I definitely don't want to see that cleanup as a dependency for
this series. It's already long enough as it is. Coming back to it later
is just fine with me.

The question here is: should we continue to omit it from the html build,
since it does not render well (i.e., should we simply drop this patch).

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 05/24] Documentation: describe MIDX-based bitmaps
  2021-07-21 17:53       ` Taylor Blau
@ 2021-07-23  7:45         ` Jeff King
  0 siblings, 0 replies; 273+ messages in thread
From: Jeff King @ 2021-07-23  7:45 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jonathantanmy

On Wed, Jul 21, 2021 at 01:53:40PM -0400, Taylor Blau wrote:

> > > +	The ordering between packs is done lexicographically by the pack name,
> > > +	with the exception of the preferred pack, which sorts ahead of all other
> > > +	packs.
> >
> > Hmm, I'm not sure if this "lexicographically" part is true. Really we're
> > building on the midx .rev format here. And that says "defined by the
> > MIDX's pack list" (though I can't offhand remember if that is
> > lexicographic, or if it is in the reverse-mtime order).
> >
> > At any rate, should we just be referencing the rev documentation?
> 
> The packs are listed in lex order in the MIDX, but that is so we can
> binary search that list to determine whether a pack is included in the
> MIDX or not.
> 
> I had to check, but we do use the lex order to resolve duplicate
> objects, too. See (at the tip of this branch):
> 
>     QSORT(ctx.info, ctx.nr, pack_info_compare);
> 
> from within midx.c:write_midx_internal(). Here, ctx.info contains the
> list of packs, and pack_info_compare is a thin wrapper around
> strcmp()-ing the pack_name values of two packed_git structures.

Ah, OK, thanks for checking.

> Arguably, you'd get better EWAH compression of the bits between packs
> if we sorted packs in reverse order according to their mtime. But I
> suspect that it doesn't matter much in practice, since the number of
> objects vastly outpaces the number of packs (but I haven't measured to
> be certain, so take that with a grain of salt).

Agreed, especially when the intended use is with geometric repacking to
keep reasonable-sized packs.

Either way, I think heuristics to optimize the pack ordering can easily
come on top later. Let's keep this series focused on the fundamentals of
having midx bitmaps at all.

> In any case, I think that you're right that adding too much detail hurts
> us here, so we should really be mentioning the MIDX's .rev-file
> documentation (unfortunately, we can't linkgit it, so mentioning it by
> name will have to suffice). I plan to reroll with something like this on
> top:
> 
> diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt
> index 25221c7ec8..04b3ec2178 100644
> --- a/Documentation/technical/bitmap-format.txt
> +++ b/Documentation/technical/bitmap-format.txt
> @@ -26,9 +26,8 @@ An object is uniquely described by its bit position within a bitmap:
> 
>  		o1 <= o2 <==> pack(o1) <= pack(o2) /\ offset(o1) <= offset(o2)
> 
> -	The ordering between packs is done lexicographically by the pack name,
> -	with the exception of the preferred pack, which sorts ahead of all other
> -	packs.
> +	The ordering between packs is done according to the MIDX's .rev file.
> +	Notably, the preferred pack sorts ahead of all other packs.
> 
>  The on-disk representation (described below) of a bitmap is the same regardless
>  of whether or not that bitmap belongs to a packfile or a MIDX. The only

Thanks, that looks much better. We can't linkgit, but we only build HTML
for these. So just a link to pack-format.html would work, as they'd
generally be found side-by-side in the filesystem. But since this
doesn't even really render as asciidoc, I'm not sure I care either way.
(Obviously we could also mention pack-format.txt by name, but it's
probably already obvious-ish to a human that this is where you'd find
information on the pack .rev format).

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 08/24] midx: respect 'core.multiPackIndex' when writing
  2021-07-21 19:22       ` Taylor Blau
@ 2021-07-23  8:29         ` Jeff King
  2021-07-26 18:59           ` Taylor Blau
  0 siblings, 1 reply; 273+ messages in thread
From: Jeff King @ 2021-07-23  8:29 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jonathantanmy

On Wed, Jul 21, 2021 at 03:22:34PM -0400, Taylor Blau wrote:

> > > This avoids a problem that would arise in subsequent patches due to the
> > > combination of 'git repack' reopening the object store in-process and
> > > the multi-pack index code not checking whether a pack already exists in
> > > the object store when calling add_pack_to_midx().
> > >
> > > This would ultimately lead to a cycle being created along the
> > > 'packed_git' struct's '->next' pointer. That is obviously bad, but it
> > > has hard-to-debug downstream effects like saying a bitmap can't be
> > > loaded for a pack because one already exists (for the same pack).
> >
> > I'm not sure I completely understand the bug that this causes.
> 
> Off-hand, I can't quite remember either. But it is important; I do have
> a distinct memory of dropping this patch and then watching a 'git repack
> --write-midx' (that option will be introduced in a later series) fail
> horribly.
> 
> If I remember correctly, the bug has to do with loading a MIDX twice in
> the same process. When we call add_packed_git() from within
> prepare_midx_pack(), we load the pack without caring whether or not it's
> already loaded. So loading a MIDX twice in the same process will fail.
> 
> So really I think that this is papering over that bug: we're just
> removing one of the times that we happened to load a MIDX from during
> the writing phase.

Hmm, after staring at this for a bit, I've unconfused and re-confused
myself several times.

Here are some interesting bits:

  - calling load_multi_pack_index() directly creates a new midx object.
    None of its m->packs[] array will be filled in. Nor is it reachable
    as r->objects->multi_pack_index.

  - in using that midx, we end up calling prepare_midx_pack() for
    various packs, which creates a new packed_git struct and adds it to
    r->objects->packed_git (via install_packed_git()).

So that's a bit weird already, because we have packed_git structs in
r->objects that came from a midx that isn't r->objects->multi_pack_index.
And then if we later call prepare_multi_pack_index(), for example as
part of a pack reprepare, then we'd end up with duplicates.

Whereas normally, when a direct load_multi_pack_index() was not called,
our only midx would be r->objects->multi_pack_index, and so we'd avoid
re-loading it.

That seems wrong and wasteful, but I don't see how it results in a
circular linked list. And it seems like it would already be the case for
this write path, independent of your series. Either way, the solution is
probably for prepare_midx_pack() to check for duplicates (which we can
do pretty cheaply these days due to the hashmap; see prepare_pack).

But I'm worried there is something else going on. Your commit message
mentions add_pack_to_midx(). That's something we call as part of
write_midx_internal(), and it does create other packed_git structs. But
it never calls install_packed_git() on them; they just live in the
write_midx_context. So I'm not sure how they'd interfere with things.

And then there's one final oddity. Your patch assigns to ctx.m from
r->objects->multi_pack_index. But later in write_midx_internal(), we
call close_midx(). In the original, it's in the middle of the function,
but one of your patches puts it at the end of the function. But that
means we are closing r->objects->multi_pack_index.

Looking at close_midx(), it does not actually zero the struct. So we'd
still have r->objects->multi_pack_index->data pointed to memory which
has been unmapped. That seems like an accident waiting to happen. I
guess it doesn't usually cause problems because we'd typically write a
midx near the end of the process, and then not look up other objects?

So I'm concerned this is introducing a subtle bug that will bite us
later. And we should figure out what the actual thing it's fixing is, so
we can understand if there is a better way to fix it (e.g., by removing
duplicates in prepare_midx_pack(), or if it is some interaction with the
writing code).

I guess a good thing to try would be dropping this patch and seeing if
the tests break. ;)

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 09/24] midx: infer preferred pack when not given one
  2021-07-21 20:16       ` Taylor Blau
@ 2021-07-23  8:50         ` Jeff King
  2021-07-26 19:44           ` Taylor Blau
  0 siblings, 1 reply; 273+ messages in thread
From: Jeff King @ 2021-07-23  8:50 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jonathantanmy

On Wed, Jul 21, 2021 at 04:16:07PM -0400, Taylor Blau wrote:

> > I dunno. Like I said, I was able to follow it, so maybe it is
> > sufficient. I'm just not sure others would be able to.
> 
> I think that others will follow it, too. But I agree that it is
> confusing, since we're fixing a bug that doesn't yet exist. In reality,
> I wrote this patch after sending v1, and then reordered its position to
> come before the implementation of MIDX bitmaps for that reason.
> 
> So in one sense, I prefer it this way because we don't ever introduce
> the bug.  But in another sense, it is very jarring to read about an
> interaction that has no basis in the code (yet).
> 
> I think that the best thing we could do without adding any significant
> reordering would be to just call out the situation we're in. I added
> this onto the end of the commit message which I think makes things a
> little clearer:
> 
>     (Note that multi-pack reachability bitmaps have yet to be
>     implemented; so in that sense this patch is fixing a bug which does
>     not yet exist.  But by having this patch beforehand, we can prevent
>     the bug from ever materializing.)

I do like fixing it up front. Here's my attempt at rewriting the commit
message. I tried to omit details about pack order, and instead refer to
the revindex code, and instead add more explanation of how this relates
to the pack-reuse code.

Something like:

  In 9218c6a40c (midx: allow marking a pack as preferred, 2021-03-30),
  the multi-pack index code learned how to select a pack which all
  duplicate objects are selected from. That is, if an object appears in
  multiple packs, select the copy in the preferred pack before using one
  from any other pack.

  Later in that same series, 38ff7cabb6 (pack-revindex: write multi-pack
  reverse indexes, 2021-03-30) learned to put the preferred pack at the
  start of the pack order when generating a midx ".rev" file. So far,
  that .rev ordering has not mattered. But it will be very important
  once we start using the .rev ordering for midx bitmaps.

  There is code in pack-objects to reuse pack bytes verbatim when
  bitmaps tell us a significant portion of the beginning of the code
  should be in the output. This code relies on the pack mentioned by the
  0th bit also being the pack that is preferred for duplicates (because
  we'd want to make sure both bases and deltas come from the same pack).
  For a pack .bitmap, this is trivially correct. For a midx bitmap, it
  is only true when some pack gets both duplicate-priority and is placed
  at the front of the .rev file. I.e., there must be _some_ preferred
  pack.

  So if the user did not specify a preferred pack, we pick one
  arbitrarily.

  There's no test here for a few reasons:

    - the midx bitmap feature does not yet exist; this is preemptively
      fixing a problem before introducing buggy code

    - whether things go wrong with the current rules depends on things
      like readdir() order, since that is used for some midx pack
      ordering. So any test might happen to succeed or fail based on
      factors outside of our control.

Thoughts?

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 13/24] pack-bitmap: read multi-pack bitmaps
  2021-07-21 23:01       ` Taylor Blau
@ 2021-07-23  9:40         ` Jeff King
  2021-07-23 10:00         ` Jeff King
  1 sibling, 0 replies; 273+ messages in thread
From: Jeff King @ 2021-07-23  9:40 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jonathantanmy

On Wed, Jul 21, 2021 at 07:01:16PM -0400, Taylor Blau wrote:

> On Wed, Jul 21, 2021 at 07:32:49AM -0400, Jeff King wrote:
> > On Mon, Jun 21, 2021 at 06:25:31PM -0400, Taylor Blau wrote:
> > > +	if (!is_pack_valid(packfile)) {
> > > +		close(fd);
> > > +		return -1;
> > > +	}
> > > +
> >
> > What's this extra is_pack_valid() doing? I wouldn't expect many changes
> > at all to this non-midx code path (aside from the "did we already load a
> > midx bitmap" in the earlier part of the hunk, which makes sense).
> 
> That looks like a mistake to me. I did a little digging and tried to
> remember if it could have ever been useful, but I think that it's just a
> stray change that has no value. Removed.

This turned out to be quite interesting. It _is_ a mistake to include it
in this series. But it turns out to be quite valuable on its own. :)

I just cleaned it up and sent it as its own separate patch:

  https://lore.kernel.org/git/YPqL%2FpZt6hNYN4hB@coredump.intra.peff.net/

So it's a happy accident that your series called attention to it. :)

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 13/24] pack-bitmap: read multi-pack bitmaps
  2021-07-21 23:01       ` Taylor Blau
  2021-07-23  9:40         ` Jeff King
@ 2021-07-23 10:00         ` Jeff King
  2021-07-26 20:36           ` Taylor Blau
  1 sibling, 1 reply; 273+ messages in thread
From: Jeff King @ 2021-07-23 10:00 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jonathantanmy

On Wed, Jul 21, 2021 at 07:01:16PM -0400, Taylor Blau wrote:

> > OK, this new function is used in load_bitmap(), which is used for both
> > pack and midx bitmaps. So if we have a midx bitmap, we'll
> > unconditionally load the revindex here. But:
> >
> >   - why do we then load individual pack revindexes? I can believe it may
> >     be necessary to meet the assumptions of some other part of the code,
> >     but it would be nice to have a comment giving us some clue.
> 
> Good suggestion. We will need to reference the reverse index belonging
> to individual packs in a few locations in pack-objects (for e.g.,
> write_reuse_object() calls offset_to_pack_pos(), and
> pack_pos_to_offset(), both with arbitrary packs, not just the preferred
> one).
> 
> I left the comment vague; something along the lines of "lots of routines
> in pack-objects will need these structures to be ready to use".

Makes sense. I think we _could_ be lazy-loading them, but IIRC only some
of the revindex functions are happy to lazy-load. It's definitely fine
to punt on that for now with a comment.

> I think there's room for improvement there, since for e.g., `git
> rev-list --count --objects --use-bitmap-index` doesn't need to load the
> reverse indexes. But that's already the case with classic bitmaps, too,
> which eagerly call load_pack_revindex().

Right. I think our solution there was to make loading the revindex
really cheap (open+mmap, rather than the in-core generation). I'm
definitely happy to call that fast enough for now, and if somebody wants
to benchmark and micro-optimize cases where we can avoid loading them,
we can do that later.

> > In practice I think even 2^31 objects is pretty out-of-reach, but it may
> > be worth changing the return type (and the callers), or even just
> > catching the overflow with an assertion.
> 
> Possibly, but keep in mind that the former is basically the same
> refactor as we did with the "tell me whether this object was found via
> this extra pointer". But bitmap_position() has a lot more callers than
> that, so the plumbing required would be a little more prevalent.
> 
> So I'd be content to just punt on it for now, if you'd be OK with it.

Yeah, I think it's fine to leave it out of this series. It's not new,
and we can revisit it later.

> > Could this ever be fooled if we had a preferred pack with 0 objects in
> > it? I don't know why we would have such a thing, but just trying to
> > think of cases where our assumptions might not hold (and what bad things
> > could happen).
> 
> An empty preferred pack would cause a problem, yes. The solution is
> two-fold (and incorporated into the reroll that I plan on sending
> shortly):
> 
>   - When the user specifies --preferred-pack, the MIDX code must make
>     sure that the given pack is non-empty. That's a new patch, and
>     basically adds a new conditional (to check the pack itself) and a
>     test (to make sure that we catch the case we are trying to prevent).
> 
>   - When the user doesn't specify --preferred-pack (and instead asks us
>     to infer one for them) we want to select not just the oldest pack,
>     but the oldest *non-empty* pack. That is folded into the "midx:
>     infer preferred pack when not given one" patch.

Oh good, I said something useful. ;) The fix you outlined sounds
sensible.

> > > +			if (bitmap_is_midx(bitmap_git)) {
> > > +				/*
> > > +				 * Can't reuse from a non-preferred pack (see
> > > +				 * above).
> > > +				 */
> > > +				if (pos + offset >= objects_nr)
> > > +					continue;
> > > +			}
> > > +			try_partial_reuse(bitmap_git, pack, pos + offset, reuse, &w_curs);
> >
> > ...and this likewise makes sure we never go past that first pack. Good.
> >
> > I think this "continue" could actually be a "break", as the loop is
> > iterating over "offset" (and "pos + offset" always gets larger). In
> > fact, it could break out of the outer loop as well (which is
> > incrementing "pos"). It's probably a pretty small efficiency in
> > practice, though.
> 
> Yeah; you're right. And we'll save up to BITS_IN_EWORD cycles of this
> loop. (I wonder if smart-enough compilers will realize the same
> optimization that you did and turn that `continue` into a `break`
> automatically, but that's neither here nor there).

If you break all the way out, then it saves iterating over all of those
other words that are not in the first pack, too. I.e., if your bitmap
has 10 million bits (for a 10-million object clone), but your first pack
only has a million objects in it, we'll call try_partial_reuse() 9
million extra times.

Fortunately, each call is super cheap, because the first thing it does
is check if the requested bit is past the end of the pack. Which kind of
makes me wonder if we could simplify this further by just letting
try_partial_reuse() tell us when there's no point going further:

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 02948e8c78..b84b55c4f3 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1308,7 +1308,11 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 	return NULL;
 }
 
-static void try_partial_reuse(struct bitmap_index *bitmap_git,
+/*
+ * -1 means "stop trying further objects"; 0 means we may or may not have
+ * reused, but you can keep feeding bits.
+ */
+static int try_partial_reuse(struct bitmap_index *bitmap_git,
 			      struct packed_git *pack,
 			      size_t pos,
 			      struct bitmap *reuse,
@@ -1342,12 +1346,12 @@ static void try_partial_reuse(struct bitmap_index *bitmap_git,
 	 */
 
 	if (pos >= pack->num_objects)
-		return; /* not actually in the pack or MIDX preferred pack */
+		return -1; /* not actually in the pack or MIDX preferred pack */
 
 	offset = delta_obj_offset = pack_pos_to_offset(pack, pos);
 	type = unpack_object_header(pack, w_curs, &offset, &size);
 	if (type < 0)
-		return; /* broken packfile, punt */
+		return -1; /* broken packfile, punt */
 
 	if (type == OBJ_REF_DELTA || type == OBJ_OFS_DELTA) {
 		off_t base_offset;
@@ -1364,9 +1368,9 @@ static void try_partial_reuse(struct bitmap_index *bitmap_git,
 		base_offset = get_delta_base(pack, w_curs, &offset, type,
 					     delta_obj_offset);
 		if (!base_offset)
-			return;
+			return 0;
 		if (offset_to_pack_pos(pack, base_offset, &base_pos) < 0)
-			return;
+			return 0;
 
 		/*
 		 * We assume delta dependencies always point backwards. This
@@ -1378,7 +1382,7 @@ static void try_partial_reuse(struct bitmap_index *bitmap_git,
 		 * odd parameters.
 		 */
 		if (base_pos >= pos)
-			return;
+			return 0;
 
 		/*
 		 * And finally, if we're not sending the base as part of our
@@ -1389,13 +1393,14 @@ static void try_partial_reuse(struct bitmap_index *bitmap_git,
 		 * object_entry code path handle it.
 		 */
 		if (!bitmap_get(reuse, base_pos))
-			return;
+			return 0;
 	}
 
 	/*
 	 * If we got here, then the object is OK to reuse. Mark it.
 	 */
 	bitmap_set(reuse, pos);
+	return 0;
 }
 
 static uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git)
@@ -1449,22 +1454,20 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 	for (; i < result->word_alloc; ++i) {
 		eword_t word = result->words[i];
 		size_t pos = (i * BITS_IN_EWORD);
+		int ret;
 
 		for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
 			if ((word >> offset) == 0)
 				break;
 
 			offset += ewah_bit_ctz64(word >> offset);
-			if (bitmap_is_midx(bitmap_git)) {
-				/*
-				 * Can't reuse from a non-preferred pack (see
-				 * above).
-				 */
-				if (pos + offset >= objects_nr)
-					continue;
-			}
-			try_partial_reuse(bitmap_git, pack, pos + offset, reuse, &w_curs);
+			ret = try_partial_reuse(bitmap_git, pack, pos + offset,
+						reuse, &w_curs);
+			if (ret < 0)
+				break;
 		}
+		if (ret < 0)
+			break;
 	}
 
 	unuse_pack(&w_curs);

The double-ret check is kind of ugly, though I suspect compilers
optimize it pretty well. The alternative is a "goto" to a label just
past the loop (also ugly, but easily explained with a comment).

> > > @@ -1511,8 +1749,13 @@ uint32_t *create_bitmap_mapping(struct bitmap_index *bitmap_git,
> > >  		struct object_id oid;
> > >  		struct object_entry *oe;
> > >
> > > -		nth_packed_object_id(&oid, bitmap_git->pack,
> > > -				     pack_pos_to_index(bitmap_git->pack, i));
> > > +		if (bitmap_is_midx(bitmap_git))
> > > +			nth_midxed_object_oid(&oid,
> > > +					      bitmap_git->midx,
> > > +					      pack_pos_to_midx(bitmap_git->midx, i));
> > > +		else
> > > +			nth_packed_object_id(&oid, bitmap_git->pack,
> > > +					     pack_pos_to_index(bitmap_git->pack, i));
> > >  		oe = packlist_find(mapping, &oid);
> >
> > Could this be using nth_bitmap_object_oid()? I guess not, because we are
> > feeding from pack_pos_to_*. I'm not sure if another helper function is
> > worth it (pack_pos_to_bitmap_index() or something?).
> 
> You're right that we can't call nth_bitmap_object_oid here directly,
> sadly. But I think your suggestion for pack_pos_to_bitmap_index() (or
> similar) would only benefit this caller, since most places that dispatch
> conditionally to either pack_pos_to_{midx,index} want to pass the result
> to a different function depending on which branch they took.
> 
> Definitely possible that I missed another case that would help, but that
> was what I came up with after just a quick glance.

Yeah, looking around, I don't see another opportunity. So the benefits
are pretty minimal. We could do:

  index_pos = bitmap_is_midx(bitmap_git) ?
              pack_pos_to_midx(bitmap_git->midx, i) :
	      pack_pos_to_index(bitmap_git->pack, i);
  nth_bitmap_object_oid(&oid, bitmap_git, index_pos);

but that is not buying much. I'm content to leave it.

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 14/24] pack-bitmap: write multi-pack bitmaps
  2021-07-21 12:09     ` Jeff King
@ 2021-07-26 18:12       ` Taylor Blau
  2021-07-26 18:23         ` Taylor Blau
  2021-07-27 17:11         ` Jeff King
  0 siblings, 2 replies; 273+ messages in thread
From: Taylor Blau @ 2021-07-26 18:12 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster, jonathantanmy

On Wed, Jul 21, 2021 at 08:09:19AM -0400, Jeff King wrote:
> On Mon, Jun 21, 2021 at 06:25:34PM -0400, Taylor Blau wrote:
>
> > +static int add_ref_to_pending(const char *refname,
> > +			      const struct object_id *oid,
> > +			      int flag, void *cb_data)
> > +{
> > +	struct rev_info *revs = (struct rev_info*)cb_data;
> > +	struct object *object;
> > +
> > +	if ((flag & REF_ISSYMREF) && (flag & REF_ISBROKEN)) {
> > +		warning("symbolic ref is dangling: %s", refname);
> > +		return 0;
> > +	}
> > +
> > +	object = parse_object_or_die(oid, refname);
> > +	if (object->type != OBJ_COMMIT)
> > +		return 0;
> > +
> > +	add_pending_object(revs, object, "");
> > +	if (bitmap_is_preferred_refname(revs->repo, refname))
> > +		object->flags |= NEEDS_BITMAP;
> > +	return 0;
> > +}
>
> OK, so we'll look at each ref to get the set of commits that we want to
> traverse to put into the bitmap. Which is roughly the same as what the
> pack bitmap does. We only generate bitmaps for all-into-one repacks, so
> it is traversing all of the reachable objects. It is a little different
> in that the pack version is probably hitting reflogs, but IMHO we are
> better off to ignore reflogs for the purposes of bitmaps (I would
> suggest to do so in the pack-bitmap case, too, except that it is
> combined with the "what to pack" traversal there, and by the time we see
> each commit we don't know how we got there).

Right. And we might end up ignoring a lot of these commits, too: the
for-each-ref is just a starting point to enumerate everything, but we
only care about parts of the object graph that are contained in a pack
which is included in the MIDX we are writing (hence the bare "return"
you're commenting around below).

> > +static void bitmap_show_commit(struct commit *commit, void *_data)
> > +{
> > +	struct bitmap_commit_cb *data = _data;
> > +	if (oid_pos(&commit->object.oid, data->ctx->entries,
> > +		    data->ctx->entries_nr,
> > +		    bitmap_oid_access) > -1) {
>
> This "> -1" struck me as a little bit funny. Perhaps ">= 0" would be a
> more obvious way of saying "we found it"?

Sure. (I looked for other uses of oid_pos() to see what is more
common, but there really are vanishingly few uses.) Easier to read might
even be:

    int pos = oid_pos(...);
    if (pos < 0)
      return;
    ALLOC_GROW(...);

which is what I ended up going for.

> > +	/*
> > +	 * Pass selected commits in topo order to match the behavior of
> > +	 * pack-bitmaps when configured with delta islands.
> > +	 */
> > +	revs.topo_order = 1;
> > +	revs.sort_order = REV_SORT_IN_GRAPH_ORDER;
>
> Hmm. Why do we want to match this side effect of delta islands here?
>
> The only impact this has is on the order of commits we feed for bitmap
> selection (and during the actual generation phase, it may impact
> visitation order).
>
> Now I'm of the opinion that topo order is probably the best thing for
> bitmap generation (since the bitmaps themselves are connected to the
> graph structure). But if it is the best thing, shouldn't we perhaps be
> turning on topo-order for single-pack bitmaps, too?
>
> And if it isn't the best thing, then why would we want it here?

Heh, you were the one that suggested I bring this over to MIDX-based
bitmaps in the first place ;).

This comes from an investigation into why bitmap coverage had worsened
for some repositories using MIDX bitmaps at GitHub. The real reason was
resolved and unrelated to this, but trying to match the behavior of MIDX
bitmaps to our existing pack bitmap setup (which uses delta-islands) was
one strategy we tried while debugging.

I actually suspect that it doesn't really matter what order we feed this
list to bitmap_writer_select_commits() in, because the first thing that
it does is QSORT() the incoming list of commits in date order.

But it does mirror the behavior of our previous bitmap generation
settings, which has been running for years.

So... we could probably drop this hunk? I'd probably rather err on the
safe side and leave this alone since it matches a system that we know to
work well in practice.

> > +	if (prepare_revision_walk(&revs))
> > +		die(_("revision walk setup failed"));
>
> We call init_revisions(), and then go straight to
> prepare_revision_walk() with no call to setup_revisions() between. It
> doesn't seem to be clearly documented, but I think you're supposed to,
> as it finalizes some bits like diff_setup_done().
>
> I suspect it works OK in practice, and I did find a few other spots that
> do not call it (e.g., builtin/am.c:write_commit_patch). But most spots
> do at least an empty setup_revisions(0, NULL, &rev, NULL).

Sure, thanks.

> > +	/*
> > +	 * Build the MIDX-order index based on pdata.objects (which is already
> > +	 * in MIDX order; c.f., 'midx_pack_order_cmp()' for the definition of
> > +	 * this order).
> > +	 */
> > +	ALLOC_ARRAY(index, pdata.nr_objects);
> > +	for (i = 0; i < pdata.nr_objects; i++)
> > +		index[i] = (struct pack_idx_entry *)&pdata.objects[i];
>
> This cast is correct because the pack_idx_entry is at the start of each
> object_entry. But maybe:
>
>   index[i] = &pdata.objects[i].idx;
>
> would be less scary looking?

Definitely, and thanks (for this spot and the other one you mentioned).

> > +	bitmap_writer_select_commits(commits, commits_nr, -1);
>
> Not related to your patch, but I had to refresh my memory on what this
> "-1" was for. It's "max_bitmaps", and is ignored if it's negative. But
> the only callers pass "-1"! So we could get rid of it entirely.
>
> It probably makes sense to leave that cleanup out of this
> already-complicated series. But maybe worth doing later on top.

Yeah, seems like an easy topic for somebody interested in any
#leftoverbits could pick up. Once this lands, I'll be happy to take care
of it myself, too.

> > @@ -930,9 +1100,16 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
> >  		for (i = 0; i < ctx.m->num_packs; i++) {
> >  			ALLOC_GROW(ctx.info, ctx.nr + 1, ctx.alloc);
> >
> > +			if (prepare_midx_pack(the_repository, ctx.m, i)) {
> > +				error(_("could not load pack %s"),
> > +				      ctx.m->pack_names[i]);
> > +				result = 1;
> > +				goto cleanup;
> > +			}
>
> It might be worth a comment here. I can easily believe that there is
> some later part of the bitmap generation code that assumes the packs are
> loaded. But somebody reading this is not likely to understand why it's
> here.
>
> Should this be done conditionally only if we're writing a bitmap? (That
> might also make it obvious why we are doing it).

Ah. Actually, I don't think this was necessary before, but we *do* need
it now because we want to compare the pack mtime's for inferring a
preferred pack when one wasn't given. And we also need to open the pack
indexes, too, because we care about the object counts (to make sure that
we don't infer a preferred pack which has no objects).

Luckily, any new packs will be loaded (and likewise have their indexes
open, too), via the the add_pack_to_midx() callback that we pass as an
argument to for_each_file_in_pack_dir().

But we could do something like this instead:

--- 8< ---

diff --git a/midx.c b/midx.c
index 8426e1a0b1..a70a6bca81 100644
--- a/midx.c
+++ b/midx.c
@@ -1111,16 +1111,29 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 		for (i = 0; i < ctx.m->num_packs; i++) {
 			ALLOC_GROW(ctx.info, ctx.nr + 1, ctx.alloc);

-			if (prepare_midx_pack(the_repository, ctx.m, i)) {
-				error(_("could not load pack"));
-				result = 1;
-				goto cleanup;
-			}
-
 			ctx.info[ctx.nr].orig_pack_int_id = i;
 			ctx.info[ctx.nr].pack_name = xstrdup(ctx.m->pack_names[i]);
-			ctx.info[ctx.nr].p = ctx.m->packs[i];
+			ctx.info[ctx.nr].p = NULL;
 			ctx.info[ctx.nr].expired = 0;
+
+			if (flags & MIDX_WRITE_REV_INDEX) {
+				/*
+				 * If generating a reverse index, need to have
+				 * packed_git's loaded to compare their
+				 * mtimes and object count.
+				 */
+				if (prepare_midx_pack(the_repository, ctx.m, i)) {
+					error(_("could not load pack"));
+					result = 1;
+					goto cleanup;
+				}
+
+				if (open_pack_index(ctx.m->packs[i]))
+					die(_("could not open index for %s"),
+					    ctx.m->packs[i]->pack_name);
+				ctx.info[ctx.nr].p = ctx.m->packs[i];
+			}
+
 			ctx.nr++;
 		}
 	}

--- >8 ---

> > @@ -1075,9 +1271,6 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
> >  	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
> >  	f = hashfd(get_lock_file_fd(&lk), get_lock_file_path(&lk));
> >
> > -	if (ctx.m)
> > -		close_midx(ctx.m);
> > -
> >  	if (ctx.nr - dropped_packs == 0) {
> >  		error(_("no pack files to index."));
> >  		result = 1;
>
> I'm not sure what this hunk is doing. We do pick up the close_midx()
> call at the end of the function, amidst the other cleanup.
>
> I expect the answer is something like "we need it open when we generate
> the bitmaps". But it makes me wonder if we could hit any cases where we
> try to overwrite it while it's still open, which would cause problems on
> Windows.

The reason is kind of annoying. If we're building a MIDX bitmap
in-process (e.g., `git multi-pack-index write --bitmap`) then we'll call
prepare_packed_git() to build our pseudo-packing list which we pass to
the bitmap generation machinery.

But prepare_packed_git() calls prepare_packed_git_one() ->
for_each_file_in_pack_dir() with the prepare_pack() callback -> which
the wants to see if the MIDX we have open already knows about a given
pack so we avoid opening it twice.

But even though the MIDX would have gone away by this point (with the
previous close_midx() call that is removed above), we still hold onto
a pointer to it via the object_store's `multi_pack_index` pointer. And
then all the way down in packfile.c:prepare_pack() we try to pass a
now-defunct pointer as the first argument to midx_contains_pack(), and
crash.

And clearing out that `multi_pack_index` pointer is tricky, because the
MIDX would have to compare the odb's `object_dir` with its own (which is
brittle in its own right), but also would have to see if that object
store is pointing at *it*, and not some other MIDX.

So we do have to keep it open there. Which makes me wonder how this
could possibly work on Windows, because holding the MIDX open will make
the commit_lock_file() definitely fail. But it seems OK in the
Windows-based CI runs?

Puzzled.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 14/24] pack-bitmap: write multi-pack bitmaps
  2021-07-26 18:12       ` Taylor Blau
@ 2021-07-26 18:23         ` Taylor Blau
  2021-07-27 17:11         ` Jeff King
  1 sibling, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-07-26 18:23 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster, jonathantanmy

On Mon, Jul 26, 2021 at 02:12:30PM -0400, Taylor Blau wrote:
> So we do have to keep it open there. Which makes me wonder how this
> could possibly work on Windows, because holding the MIDX open will make
> the commit_lock_file() definitely fail. But it seems OK in the
> Windows-based CI runs?
>
> Puzzled.

The below should do the trick; it'll keep the MIDX open just long enough
to generate a bitmap (if one was requested), but will close any
handle(s) on an existing MIDX right before we move the temporary file
into place.

It has the added benefit of making that hunk about destroying stale
references to packs be unnecessary.

Watching the Actions run here to see how this runs on Windows:

    https://github.com/ttaylorr/git/actions/runs/1068457013

Below is the patch.

--- >8 ---

commit c7b7ce0ebc793e311072929772a2d352600f3d54
Author: Taylor Blau <me@ttaylorr.com>
Date:   Mon Jul 26 14:17:27 2021 -0400

    fixup! pack-bitmap: write multi-pack bitmaps

diff --git a/midx.c b/midx.c
index 76c94a0df2..297627f992 100644
--- a/midx.c
+++ b/midx.c
@@ -1358,6 +1358,8 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 		}
 	}

+	close_midx(ctx.m);
+
 	commit_lock_file(&lk);

 	clear_midx_files_ext(the_repository, ".bitmap", midx_hash);
@@ -1368,15 +1370,6 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 		if (ctx.info[i].p) {
 			close_pack(ctx.info[i].p);
 			free(ctx.info[i].p);
-			if (ctx.m) {
-				/*
-				 * Destroy a stale reference to the pack in
-				 * 'ctx.m'.
-				 */
-				uint32_t orig = ctx.info[i].orig_pack_int_id;
-				if (orig < ctx.m->num_packs)
-					ctx.m->packs[orig] = NULL;
-			}
 		}
 		free(ctx.info[i].pack_name);
 	}
@@ -1386,7 +1379,6 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	free(ctx.pack_perm);
 	free(ctx.pack_order);
 	free(midx_name);
-	close_midx(ctx.m);

 	return result;
 }

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 02/24] pack-bitmap-write.c: gracefully fail to write non-closed bitmaps
  2021-07-23  7:37         ` Jeff King
@ 2021-07-26 18:48           ` Taylor Blau
  2021-07-27 17:11             ` Jeff King
  0 siblings, 1 reply; 273+ messages in thread
From: Taylor Blau @ 2021-07-26 18:48 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster, jonathantanmy

On Fri, Jul 23, 2021 at 03:37:52AM -0400, Jeff King wrote:
> I thought about suggesting that it be called "err" or "ret" or
> something. And then we do not have to care that fill_bitmap_commit()
> only returns an error in the non-closed state. We are simply propagating
> its error-return back up the stack.

Hmm. For whatever the inconvience costs us, I do like that the variable
can be named specifically like "open" or "closed" as opposed to the more
generic "err" or "ret".

So I'll probably keep it is unless you feel strongly (which I suspect
you do not).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 04/24] Documentation: build 'technical/bitmap-format' by default
  2021-07-23  7:39           ` Jeff King
@ 2021-07-26 18:49             ` Taylor Blau
  0 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-07-26 18:49 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster, jonathantanmy

On Fri, Jul 23, 2021 at 03:39:25AM -0400, Jeff King wrote:
> The question here is: should we continue to omit it from the html build,
> since it does not render well (i.e., should we simply drop this patch).

I think that's a nice way of putting it. Since the HTML rendering is
terrible, let's just drop this patch and leave cleaning it up as
#leftoverbits.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 08/24] midx: respect 'core.multiPackIndex' when writing
  2021-07-23  8:29         ` Jeff King
@ 2021-07-26 18:59           ` Taylor Blau
  2021-07-26 22:14             ` Taylor Blau
  2021-07-27 17:17             ` Jeff King
  0 siblings, 2 replies; 273+ messages in thread
From: Taylor Blau @ 2021-07-26 18:59 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster, jonathantanmy

On Fri, Jul 23, 2021 at 04:29:37AM -0400, Jeff King wrote:
> On Wed, Jul 21, 2021 at 03:22:34PM -0400, Taylor Blau wrote:
>
> > > > This avoids a problem that would arise in subsequent patches due to the
> > > > combination of 'git repack' reopening the object store in-process and
> > > > the multi-pack index code not checking whether a pack already exists in
> > > > the object store when calling add_pack_to_midx().
> > > >
> > > > This would ultimately lead to a cycle being created along the
> > > > 'packed_git' struct's '->next' pointer. That is obviously bad, but it
> > > > has hard-to-debug downstream effects like saying a bitmap can't be
> > > > loaded for a pack because one already exists (for the same pack).
> > >
> > > I'm not sure I completely understand the bug that this causes.
> >
> > Off-hand, I can't quite remember either. But it is important; I do have
> > a distinct memory of dropping this patch and then watching a 'git repack
> > --write-midx' (that option will be introduced in a later series) fail
> > horribly.
> >
> > If I remember correctly, the bug has to do with loading a MIDX twice in
> > the same process. When we call add_packed_git() from within
> > prepare_midx_pack(), we load the pack without caring whether or not it's
> > already loaded. So loading a MIDX twice in the same process will fail.
> >
> > So really I think that this is papering over that bug: we're just
> > removing one of the times that we happened to load a MIDX from during
> > the writing phase.
>
> Hmm, after staring at this for a bit, I've unconfused and re-confused
> myself several times.
>
> Here are some interesting bits:
>
>   - calling load_multi_pack_index() directly creates a new midx object.
>     None of its m->packs[] array will be filled in. Nor is it reachable
>     as r->objects->multi_pack_index.
>
>   - in using that midx, we end up calling prepare_midx_pack() for
>     various packs, which creates a new packed_git struct and adds it to
>     r->objects->packed_git (via install_packed_git()).
>
> So that's a bit weird already, because we have packed_git structs in
> r->objects that came from a midx that isn't r->objects->multi_pack_index.
> And then if we later call prepare_multi_pack_index(), for example as
> part of a pack reprepare, then we'd end up with duplicates.

Ah, this jogged my memory: this is a relic from when we generated MIDX
bitmaps in-process with the rest of the `repack` code. And when we did
that, we did have to call `reprepare_packed_git()` after writing the new
packs but before moving them into place.

So that's where the `reprepare_packed_git()` came from, but we don't
have any of that code anymore, since we now generate MIDX bitmaps by
invoking:

    git multi-pack-index write --bitmap --stdin-packs --refs-snapshot

as a sub-process of `git repack`; no need for any reprepare which is
what was triggering this bug.

To be sure, I reverted this patch out of GitHub's fork, and reran the
tests both in normal mode (just `make test`) and then once more with the
`GIT_TEST_MULTI_PACK_INDEX{,_WRITE_BITMAP}` environment variables set.
Unsurprisingly, it passed both times.

I'm happy to keep digging further, but I think that I'm 99% satisfied
here. Digging further involves resurrecting a much older version of this
series (and others adjacent to it), and there are probably other bugs
lurking that would be annoying to tease out.

In any case, let's drop this patch from the series. It's disappointing
that we can't run:

    git -c core.multiPackIndex= multi-pack-index write

anymore, but I guess that's no worse than the state we were in before
this patch, so I'm content to let it live on.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 09/24] midx: infer preferred pack when not given one
  2021-07-23  8:50         ` Jeff King
@ 2021-07-26 19:44           ` Taylor Blau
  0 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-07-26 19:44 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster, jonathantanmy

On Fri, Jul 23, 2021 at 04:50:31AM -0400, Jeff King wrote:
> On Wed, Jul 21, 2021 at 04:16:07PM -0400, Taylor Blau wrote:
>
> > > I dunno. Like I said, I was able to follow it, so maybe it is
> > > sufficient. I'm just not sure others would be able to.
> >
> > I think that others will follow it, too. But I agree that it is
> > confusing, since we're fixing a bug that doesn't yet exist. In reality,
> > I wrote this patch after sending v1, and then reordered its position to
> > come before the implementation of MIDX bitmaps for that reason.
> >
> > So in one sense, I prefer it this way because we don't ever introduce
> > the bug.  But in another sense, it is very jarring to read about an
> > interaction that has no basis in the code (yet).
> >
> > I think that the best thing we could do without adding any significant
> > reordering would be to just call out the situation we're in. I added
> > this onto the end of the commit message which I think makes things a
> > little clearer:
> >
> >     (Note that multi-pack reachability bitmaps have yet to be
> >     implemented; so in that sense this patch is fixing a bug which does
> >     not yet exist.  But by having this patch beforehand, we can prevent
> >     the bug from ever materializing.)
>
> I do like fixing it up front. Here's my attempt at rewriting the commit
> message. I tried to omit details about pack order, and instead refer to
> the revindex code, and instead add more explanation of how this relates
> to the pack-reuse code.
>
> Something like:
>
> [...]
>
> Thoughts?

I like it, although reading it fresh I found the sentence beginning with
"So if the user did not specify a preferred pack" to be a little
confusing. To connect it back to the previous paragraph, I added:

  ... in order to avoid a situation where no pack is marked as preferred
  (breaking our assumption about the pack representing the object at the
  0th bit).

and that read out much clearer (to me at least).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 13/24] pack-bitmap: read multi-pack bitmaps
  2021-07-23 10:00         ` Jeff King
@ 2021-07-26 20:36           ` Taylor Blau
  0 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-07-26 20:36 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster, jonathantanmy

On Fri, Jul 23, 2021 at 06:00:47AM -0400, Jeff King wrote:
> On Wed, Jul 21, 2021 at 07:01:16PM -0400, Taylor Blau wrote:
> > > > +			if (bitmap_is_midx(bitmap_git)) {
> > > > +				/*
> > > > +				 * Can't reuse from a non-preferred pack (see
> > > > +				 * above).
> > > > +				 */
> > > > +				if (pos + offset >= objects_nr)
> > > > +					continue;
> > > > +			}
> > > > +			try_partial_reuse(bitmap_git, pack, pos + offset, reuse, &w_curs);
> > >
> > > ...and this likewise makes sure we never go past that first pack. Good.
> > >
> > > I think this "continue" could actually be a "break", as the loop is
> > > iterating over "offset" (and "pos + offset" always gets larger). In
> > > fact, it could break out of the outer loop as well (which is
> > > incrementing "pos"). It's probably a pretty small efficiency in
> > > practice, though.
> >
> > Yeah; you're right. And we'll save up to BITS_IN_EWORD cycles of this
> > loop. (I wonder if smart-enough compilers will realize the same
> > optimization that you did and turn that `continue` into a `break`
> > automatically, but that's neither here nor there).
>
> If you break all the way out, then it saves iterating over all of those
> other words that are not in the first pack, too. I.e., if your bitmap
> has 10 million bits (for a 10-million object clone), but your first pack
> only has a million objects in it, we'll call try_partial_reuse() 9
> million extra times.
>
> Fortunately, each call is super cheap, because the first thing it does
> is check if the requested bit is past the end of the pack. Which kind of
> makes me wonder if we could simplify this further by just letting
> try_partial_reuse() tell us when there's no point going further:
>
> [snip suggested diff]

All looks pretty good to me. I think that a goto is a little easier to
read than two identical "if (ret < 0)" checks. And having a comment
makes it clearer to me than the double if statements. So I'm content do
to that instead.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 08/24] midx: respect 'core.multiPackIndex' when writing
  2021-07-26 18:59           ` Taylor Blau
@ 2021-07-26 22:14             ` Taylor Blau
  2021-07-27 17:29               ` Jeff King
  2021-07-27 17:17             ` Jeff King
  1 sibling, 1 reply; 273+ messages in thread
From: Taylor Blau @ 2021-07-26 22:14 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster, jonathantanmy

On Mon, Jul 26, 2021 at 02:59:02PM -0400, Taylor Blau wrote:
> On Fri, Jul 23, 2021 at 04:29:37AM -0400, Jeff King wrote:
> > On Wed, Jul 21, 2021 at 03:22:34PM -0400, Taylor Blau wrote:
> >
> > > > > This avoids a problem that would arise in subsequent patches due to the
> > > > > combination of 'git repack' reopening the object store in-process and
> > > > > the multi-pack index code not checking whether a pack already exists in
> > > > > the object store when calling add_pack_to_midx().
> > > > >
> > > > > This would ultimately lead to a cycle being created along the
> > > > > 'packed_git' struct's '->next' pointer. That is obviously bad, but it
> > > > > has hard-to-debug downstream effects like saying a bitmap can't be
> > > > > loaded for a pack because one already exists (for the same pack).
> > > >
> > > > I'm not sure I completely understand the bug that this causes.
> > >
> > > Off-hand, I can't quite remember either. But it is important; I do have
> > > a distinct memory of dropping this patch and then watching a 'git repack
> > > --write-midx' (that option will be introduced in a later series) fail
> > > horribly.
> > >
> > > If I remember correctly, the bug has to do with loading a MIDX twice in
> > > the same process. When we call add_packed_git() from within
> > > prepare_midx_pack(), we load the pack without caring whether or not it's
> > > already loaded. So loading a MIDX twice in the same process will fail.
> > >
> > > So really I think that this is papering over that bug: we're just
> > > removing one of the times that we happened to load a MIDX from during
> > > the writing phase.
> >
> > Hmm, after staring at this for a bit, I've unconfused and re-confused
> > myself several times.
> >
> > Here are some interesting bits:
> >
> >   - calling load_multi_pack_index() directly creates a new midx object.
> >     None of its m->packs[] array will be filled in. Nor is it reachable
> >     as r->objects->multi_pack_index.
> >
> >   - in using that midx, we end up calling prepare_midx_pack() for
> >     various packs, which creates a new packed_git struct and adds it to
> >     r->objects->packed_git (via install_packed_git()).
> >
> > So that's a bit weird already, because we have packed_git structs in
> > r->objects that came from a midx that isn't r->objects->multi_pack_index.
> > And then if we later call prepare_multi_pack_index(), for example as
> > part of a pack reprepare, then we'd end up with duplicates.
>
> Ah, this jogged my memory: this is a relic from when we generated MIDX
> bitmaps in-process with the rest of the `repack` code. And when we did
> that, we did have to call `reprepare_packed_git()` after writing the new
> packs but before moving them into place.

Actually, I take that back. You were right from the start: the way the
code is written we *can* end up calling both:

  - load_multi_pack_index, from write_midx_internal(), which sets up a
    MIDX, but does not update r->objects->multi_pack_index to point at
    it.

  - ...and prepare_multi_pack_index_one (via prepare_bitmap_git ->
    open_bitmap -> open_midx_bitmap -> get_multi_pack_index ->
    prepare_packed_git) which *also* creates a new MIDX, *and*
    updates the_repository->objects->multi_pack_index to point at it.

(The latter codepath is from the check in write_midx_internal() to see
if we already have a MIDX bitmap when the MIDX we are trying to write
already exists on disk.)

So in this scenario, we have two copies of the same MIDX open, and the
repository's single pack is opened in one of the MIDXs, but not both.
One copy of the pack is pointed at via r->objects->packed_git. Then when
we fall back to open_pack_bitmap(), we call get_all_packs(), which calls
prepare_midx_pack(), which installs the second MIDX's copy of the same
pack into the r->objects->packed_git, and we have a cycle.

I think there are a few ways to fix this bug. The most obvious is to
make install_packed_git() check for the existence of the pack in the
hashmap of installed packs before (re-)installing it. But that could be
quadratic if the hashmap has too many collisions (and the lookup tends
towards being linear in the number of keys rather than constant).

But I think that a more straightforward way would be to open the MIDX we
use when generating the MIDX with prepare_multi_pack_index_one() instead
of load_multi_pack_index() so that the resulting MIDX is pointed at by
r->objects->multi_pack_index. That would prevent the latter call from
deep within the callstack of prepare_bitmap_git() from opening another
copy and then (mistakenly) re-installing the same pack twice.

Thoughts?

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 14/24] pack-bitmap: write multi-pack bitmaps
  2021-07-26 18:12       ` Taylor Blau
  2021-07-26 18:23         ` Taylor Blau
@ 2021-07-27 17:11         ` Jeff King
  2021-07-27 20:33           ` Taylor Blau
  1 sibling, 1 reply; 273+ messages in thread
From: Jeff King @ 2021-07-27 17:11 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jonathantanmy

On Mon, Jul 26, 2021 at 02:12:30PM -0400, Taylor Blau wrote:

> > This "> -1" struck me as a little bit funny. Perhaps ">= 0" would be a
> > more obvious way of saying "we found it"?
> 
> Sure. (I looked for other uses of oid_pos() to see what is more
> common, but there really are vanishingly few uses.) Easier to read might
> even be:
> 
>     int pos = oid_pos(...);
>     if (pos < 0)
>       return;
>     ALLOC_GROW(...);
> 
> which is what I ended up going for.

Sure, that's better still.

> [topo-sorting commits fed to bitmap writer]
>
> This comes from an investigation into why bitmap coverage had worsened
> for some repositories using MIDX bitmaps at GitHub. The real reason was
> resolved and unrelated to this, but trying to match the behavior of MIDX
> bitmaps to our existing pack bitmap setup (which uses delta-islands) was
> one strategy we tried while debugging.
>
> I actually suspect that it doesn't really matter what order we feed this
> list to bitmap_writer_select_commits() in, because the first thing that
> it does is QSORT() the incoming list of commits in date order.

Hmm, yes, I agree that it shouldn't matter for that reason (though
arguably topo order would still be better than a strict date order, it
does nothing now).

I remember looking into reasons why a single-pack bitmap versus a
midx-of-a-single-pack bitmap might not be identical, but the interesting
things turned out to be elsewhere. Did this actually change anything at
all? If so, then perhaps the "it doesn't matter" is not as true as we
are thinking. I could believe that it has a tiny impact when breaking
times between identical committer dates, though.

> But it does mirror the behavior of our previous bitmap generation
> settings, which has been running for years.
> 
> So... we could probably drop this hunk? I'd probably rather err on the
> safe side and leave this alone since it matches a system that we know to
> work well in practice.

I'd rather drop it, if we think it's doing nothing. While I do value
history in production as a sign of stability, upstream review is a good
time to make sure we understand all of the "why", and to clean things up
(e.g., another example is the questionable close_midx() stuff discussed
elsewhere).

And if we do suspect it is doing something, then IMHO the right thing is
probably still to drop it, and to introduce the feature identically to
both the midx and pack bitmap generation code paths. But that should be
a separate topic (and may actually involve fixing the QSORT to put
things in topo order rather than just date order).

> > > @@ -930,9 +1100,16 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
> > >  		for (i = 0; i < ctx.m->num_packs; i++) {
> > >  			ALLOC_GROW(ctx.info, ctx.nr + 1, ctx.alloc);
> > >
> > > +			if (prepare_midx_pack(the_repository, ctx.m, i)) {
> > > +				error(_("could not load pack %s"),
> > > +				      ctx.m->pack_names[i]);
> > > +				result = 1;
> > > +				goto cleanup;
> > > +			}
> >
> > It might be worth a comment here. I can easily believe that there is
> > some later part of the bitmap generation code that assumes the packs are
> > loaded. But somebody reading this is not likely to understand why it's
> > here.
> >
> > Should this be done conditionally only if we're writing a bitmap? (That
> > might also make it obvious why we are doing it).
> 
> Ah. Actually, I don't think this was necessary before, but we *do* need
> it now because we want to compare the pack mtime's for inferring a
> preferred pack when one wasn't given. And we also need to open the pack
> indexes, too, because we care about the object counts (to make sure that
> we don't infer a preferred pack which has no objects).
> 
> Luckily, any new packs will be loaded (and likewise have their indexes
> open, too), via the the add_pack_to_midx() callback that we pass as an
> argument to for_each_file_in_pack_dir().

Hmm, OK. Your second paragraph makes it sound like we _don't_ need to do
this. But the key is "new packs". In add_pack_to_midx() we skip any
packs that are already in the existing midx, assuming they've already
been added. And we probably must do that, otherwise we end up with
duplicate structs that are not actually shared by ctx->m.

> But we could do something like this instead:
> 
> --- 8< ---
> 
> diff --git a/midx.c b/midx.c
> index 8426e1a0b1..a70a6bca81 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -1111,16 +1111,29 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
>  		for (i = 0; i < ctx.m->num_packs; i++) {
>  			ALLOC_GROW(ctx.info, ctx.nr + 1, ctx.alloc);
> 
> -			if (prepare_midx_pack(the_repository, ctx.m, i)) {
> -				error(_("could not load pack"));
> -				result = 1;
> -				goto cleanup;
> -			}
> -
>  			ctx.info[ctx.nr].orig_pack_int_id = i;
>  			ctx.info[ctx.nr].pack_name = xstrdup(ctx.m->pack_names[i]);
> -			ctx.info[ctx.nr].p = ctx.m->packs[i];
> +			ctx.info[ctx.nr].p = NULL;
>  			ctx.info[ctx.nr].expired = 0;
> +
> +			if (flags & MIDX_WRITE_REV_INDEX) {
> +				/*
> +				 * If generating a reverse index, need to have
> +				 * packed_git's loaded to compare their
> +				 * mtimes and object count.
> +				 */
> +				if (prepare_midx_pack(the_repository, ctx.m, i)) {
> +					error(_("could not load pack"));
> +					result = 1;
> +					goto cleanup;
> +				}
> +
> +				if (open_pack_index(ctx.m->packs[i]))
> +					die(_("could not open index for %s"),
> +					    ctx.m->packs[i]->pack_name);
> +				ctx.info[ctx.nr].p = ctx.m->packs[i];
> +			}
> +

Yeah, was what I was suggesting: make it conditional on bitmaps (well, a
.rev index, which is more precise), and put in comments. :)

It's interesting that your earlier iteration didn't call
open_pack_index(). Is it necessary, or not? From your description, it
seems like it should be. But maybe some later step lazy-loads it? Even
if so, I can see how prepare_midx_pack() would still be required
(because we want to make sure we are using the same struct).

> [closing the midx]
> 
> The reason is kind of annoying. If we're building a MIDX bitmap
> in-process (e.g., `git multi-pack-index write --bitmap`) then we'll call
> prepare_packed_git() to build our pseudo-packing list which we pass to
> the bitmap generation machinery.
> 
> But prepare_packed_git() calls prepare_packed_git_one() ->
> for_each_file_in_pack_dir() with the prepare_pack() callback -> which
> the wants to see if the MIDX we have open already knows about a given
> pack so we avoid opening it twice.
> 
> But even though the MIDX would have gone away by this point (with the
> previous close_midx() call that is removed above), we still hold onto
> a pointer to it via the object_store's `multi_pack_index` pointer. And
> then all the way down in packfile.c:prepare_pack() we try to pass a
> now-defunct pointer as the first argument to midx_contains_pack(), and
> crash.
> 
> And clearing out that `multi_pack_index` pointer is tricky, because the
> MIDX would have to compare the odb's `object_dir` with its own (which is
> brittle in its own right), but also would have to see if that object
> store is pointing at *it*, and not some other MIDX.
> 
> So we do have to keep it open there. Which makes me wonder how this
> could possibly work on Windows, because holding the MIDX open will make
> the commit_lock_file() definitely fail. But it seems OK in the
> Windows-based CI runs?

Forgetting Windows for a moment, it seems like the same "the pointer is
bogus and we crash" is now just pushed down to the end of the function,
rather than the middle. I.e., it is safe to close the midx we got from
load_multi_pack_index(), but not one we got from
prepare_multi_pack_index_one(). So it's hard to reason about this at all
until that problem from patch 08 is resolved.

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 02/24] pack-bitmap-write.c: gracefully fail to write non-closed bitmaps
  2021-07-26 18:48           ` Taylor Blau
@ 2021-07-27 17:11             ` Jeff King
  0 siblings, 0 replies; 273+ messages in thread
From: Jeff King @ 2021-07-27 17:11 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jonathantanmy

On Mon, Jul 26, 2021 at 02:48:19PM -0400, Taylor Blau wrote:

> On Fri, Jul 23, 2021 at 03:37:52AM -0400, Jeff King wrote:
> > I thought about suggesting that it be called "err" or "ret" or
> > something. And then we do not have to care that fill_bitmap_commit()
> > only returns an error in the non-closed state. We are simply propagating
> > its error-return back up the stack.
> 
> Hmm. For whatever the inconvience costs us, I do like that the variable
> can be named specifically like "open" or "closed" as opposed to the more
> generic "err" or "ret".
> 
> So I'll probably keep it is unless you feel strongly (which I suspect
> you do not).

I don't.

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 08/24] midx: respect 'core.multiPackIndex' when writing
  2021-07-26 18:59           ` Taylor Blau
  2021-07-26 22:14             ` Taylor Blau
@ 2021-07-27 17:17             ` Jeff King
  1 sibling, 0 replies; 273+ messages in thread
From: Jeff King @ 2021-07-27 17:17 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jonathantanmy

On Mon, Jul 26, 2021 at 02:59:02PM -0400, Taylor Blau wrote:

> > Hmm, after staring at this for a bit, I've unconfused and re-confused
> > myself several times.
> >
> > Here are some interesting bits:
> >
> >   - calling load_multi_pack_index() directly creates a new midx object.
> >     None of its m->packs[] array will be filled in. Nor is it reachable
> >     as r->objects->multi_pack_index.
> >
> >   - in using that midx, we end up calling prepare_midx_pack() for
> >     various packs, which creates a new packed_git struct and adds it to
> >     r->objects->packed_git (via install_packed_git()).
> >
> > So that's a bit weird already, because we have packed_git structs in
> > r->objects that came from a midx that isn't r->objects->multi_pack_index.
> > And then if we later call prepare_multi_pack_index(), for example as
> > part of a pack reprepare, then we'd end up with duplicates.
> 
> Ah, this jogged my memory: this is a relic from when we generated MIDX
> bitmaps in-process with the rest of the `repack` code. And when we did
> that, we did have to call `reprepare_packed_git()` after writing the new
> packs but before moving them into place.
> 
> So that's where the `reprepare_packed_git()` came from, but we don't
> have any of that code anymore, since we now generate MIDX bitmaps by
> invoking:
> 
>     git multi-pack-index write --bitmap --stdin-packs --refs-snapshot
> 
> as a sub-process of `git repack`; no need for any reprepare which is
> what was triggering this bug.

OK, that makes sense, especially given the "close_midx() leaves the
pointer bogus" stuff discussed elsewhere.

> To be sure, I reverted this patch out of GitHub's fork, and reran the
> tests both in normal mode (just `make test`) and then once more with the
> `GIT_TEST_MULTI_PACK_INDEX{,_WRITE_BITMAP}` environment variables set.
> Unsurprisingly, it passed both times.
> 
> I'm happy to keep digging further, but I think that I'm 99% satisfied
> here. Digging further involves resurrecting a much older version of this
> series (and others adjacent to it), and there are probably other bugs
> lurking that would be annoying to tease out.
> 
> In any case, let's drop this patch from the series. It's disappointing
> that we can't run:
> 
>     git -c core.multiPackIndex= multi-pack-index write
> 
> anymore, but I guess that's no worse than the state we were in before
> this patch, so I'm content to let it live on.

Great. If we can drop it, I think that is the best path forward. I think
that may simplify things for the writing patch, too, then. It should not
matter if we move close_midx() anymore, because we will not be closing
the main r->objects->multi_pack_index struct.

I do suspect we could be skipping the load _and_ close of the midx
entirely in write_midx_internal(), and just using whatever the caller
has passed in (and arguably just having most callers pass in the regular
midx struct if they want us to reuse parts of it). That might be a
cleanup we can leave for later, but it might be necessary to touch these
bits anyway (if there is still some kind of close_midx() ordering gotcha
in the function).

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 08/24] midx: respect 'core.multiPackIndex' when writing
  2021-07-26 22:14             ` Taylor Blau
@ 2021-07-27 17:29               ` Jeff King
  2021-07-27 17:36                 ` Taylor Blau
  0 siblings, 1 reply; 273+ messages in thread
From: Jeff King @ 2021-07-27 17:29 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jonathantanmy

On Mon, Jul 26, 2021 at 06:14:09PM -0400, Taylor Blau wrote:

> > Ah, this jogged my memory: this is a relic from when we generated MIDX
> > bitmaps in-process with the rest of the `repack` code. And when we did
> > that, we did have to call `reprepare_packed_git()` after writing the new
> > packs but before moving them into place.
> 
> Actually, I take that back. You were right from the start: the way the
> code is written we *can* end up calling both:
> 
>   - load_multi_pack_index, from write_midx_internal(), which sets up a
>     MIDX, but does not update r->objects->multi_pack_index to point at
>     it.
> 
>   - ...and prepare_multi_pack_index_one (via prepare_bitmap_git ->
>     open_bitmap -> open_midx_bitmap -> get_multi_pack_index ->
>     prepare_packed_git) which *also* creates a new MIDX, *and*
>     updates the_repository->objects->multi_pack_index to point at it.
> 
> (The latter codepath is from the check in write_midx_internal() to see
> if we already have a MIDX bitmap when the MIDX we are trying to write
> already exists on disk.)
> 
> So in this scenario, we have two copies of the same MIDX open, and the
> repository's single pack is opened in one of the MIDXs, but not both.
> One copy of the pack is pointed at via r->objects->packed_git. Then when
> we fall back to open_pack_bitmap(), we call get_all_packs(), which calls
> prepare_midx_pack(), which installs the second MIDX's copy of the same
> pack into the r->objects->packed_git, and we have a cycle.

Right, I understand how that ends up with duplicate structs for each
pack. But how do we get a cycle out of that?

> I think there are a few ways to fix this bug. The most obvious is to
> make install_packed_git() check for the existence of the pack in the
> hashmap of installed packs before (re-)installing it. But that could be
> quadratic if the hashmap has too many collisions (and the lookup tends
> towards being linear in the number of keys rather than constant).

I think it may be worth doing that anyway. You can assume the hashmap
will behave reasonably.

But it would mean that the "multi_pack_index" flag in packed_git does
not specify _which_ midx is pointing to it. At the very least, it would
need to become a ref-count (so when one midx goes away, it does not lose
its "I am part of a midx" flag).  And possibly it would need to actually
know the complete list of midx structs it's associated with (I haven't
looked at all of the uses of that flag).

That makes things sufficiently tricky that I would prefer not to
untangle it as part of this series.

> But I think that a more straightforward way would be to open the MIDX we
> use when generating the MIDX with prepare_multi_pack_index_one() instead
> of load_multi_pack_index() so that the resulting MIDX is pointed at by
> r->objects->multi_pack_index. That would prevent the latter call from
> deep within the callstack of prepare_bitmap_git() from opening another
> copy and then (mistakenly) re-installing the same pack twice.

But now the internal midx writing code can never call close_midx() on
that, because it does not own it to close. Can we simply drop the
close_midx() call there?

This would all make much more sense to me if write_midx_internal()
simply took a conceptually read-only midx as a parameter, and the caller
passed in the appropriate one (probably even using
prepare_multi_pack_index_one() to get it).

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 08/24] midx: respect 'core.multiPackIndex' when writing
  2021-07-27 17:29               ` Jeff King
@ 2021-07-27 17:36                 ` Taylor Blau
  2021-07-27 17:42                   ` Jeff King
  0 siblings, 1 reply; 273+ messages in thread
From: Taylor Blau @ 2021-07-27 17:36 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster, jonathantanmy

On Tue, Jul 27, 2021 at 01:29:49PM -0400, Jeff King wrote:
> On Mon, Jul 26, 2021 at 06:14:09PM -0400, Taylor Blau wrote:
> > So in this scenario, we have two copies of the same MIDX open, and the
> > repository's single pack is opened in one of the MIDXs, but not both.
> > One copy of the pack is pointed at via r->objects->packed_git. Then when
> > we fall back to open_pack_bitmap(), we call get_all_packs(), which calls
> > prepare_midx_pack(), which installs the second MIDX's copy of the same
> > pack into the r->objects->packed_git, and we have a cycle.
>
> Right, I understand how that ends up with duplicate structs for each
> pack. But how do we get a cycle out of that?

Sorry, it isn't a true cycle where p->next == p.

> But now the internal midx writing code can never call close_midx() on
> that, because it does not own it to close. Can we simply drop the
> close_midx() call there?
>
> This would all make much more sense to me if write_midx_internal()
> simply took a conceptually read-only midx as a parameter, and the caller
> passed in the appropriate one (probably even using
> prepare_multi_pack_index_one() to get it).

No, we can't drop the close_midx() call there because we must close the
MIDX file on Windows before moving a new one into place. My feeling is
we should always be working on the r->objects->multi_pack_index pointer,
and calling close_object_store() there instead of close_midx().

Does that seem like a reasonable approach to you?

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 08/24] midx: respect 'core.multiPackIndex' when writing
  2021-07-27 17:36                 ` Taylor Blau
@ 2021-07-27 17:42                   ` Jeff King
  2021-07-27 17:47                     ` Taylor Blau
  0 siblings, 1 reply; 273+ messages in thread
From: Jeff King @ 2021-07-27 17:42 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jonathantanmy

On Tue, Jul 27, 2021 at 01:36:34PM -0400, Taylor Blau wrote:

> > But now the internal midx writing code can never call close_midx() on
> > that, because it does not own it to close. Can we simply drop the
> > close_midx() call there?
> >
> > This would all make much more sense to me if write_midx_internal()
> > simply took a conceptually read-only midx as a parameter, and the caller
> > passed in the appropriate one (probably even using
> > prepare_multi_pack_index_one() to get it).
> 
> No, we can't drop the close_midx() call there because we must close the
> MIDX file on Windows before moving a new one into place. My feeling is
> we should always be working on the r->objects->multi_pack_index pointer,
> and calling close_object_store() there instead of close_midx().
> 
> Does that seem like a reasonable approach to you?

Yes, though I'd have said that it is the responsibility of the caller
(who knows we are operating with r->objects->multi_pack_index) to do
that closing. But maybe it's not possible if the rename-into-place
happens at too low a level.

BTW, yet another weirdness: close_object_store() will call close_midx()
on the outermost midx struct, ignoring o->multi_pack_index->next
entirely. So that's a leak, but also means we may not be closing the
midx we're interested in (since write_midx_internal() takes an
object-dir parameter, and we could be pointing to some other
object-dir's midx).

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 08/24] midx: respect 'core.multiPackIndex' when writing
  2021-07-27 17:42                   ` Jeff King
@ 2021-07-27 17:47                     ` Taylor Blau
  2021-07-27 17:55                       ` Jeff King
  0 siblings, 1 reply; 273+ messages in thread
From: Taylor Blau @ 2021-07-27 17:47 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster, jonathantanmy

On Tue, Jul 27, 2021 at 01:42:35PM -0400, Jeff King wrote:
> On Tue, Jul 27, 2021 at 01:36:34PM -0400, Taylor Blau wrote:
>
> > > But now the internal midx writing code can never call close_midx() on
> > > that, because it does not own it to close. Can we simply drop the
> > > close_midx() call there?
> > >
> > > This would all make much more sense to me if write_midx_internal()
> > > simply took a conceptually read-only midx as a parameter, and the caller
> > > passed in the appropriate one (probably even using
> > > prepare_multi_pack_index_one() to get it).
> >
> > No, we can't drop the close_midx() call there because we must close the
> > MIDX file on Windows before moving a new one into place. My feeling is
> > we should always be working on the r->objects->multi_pack_index pointer,
> > and calling close_object_store() there instead of close_midx().
> >
> > Does that seem like a reasonable approach to you?
>
> Yes, though I'd have said that it is the responsibility of the caller
> (who knows we are operating with r->objects->multi_pack_index) to do
> that closing. But maybe it's not possible if the rename-into-place
> happens at too low a level.

Right; write_midx_internal() needs to access the MIDX right up until the
point that we write the new one into place, so the only place to close
it is in write_midx_internal().

> BTW, yet another weirdness: close_object_store() will call close_midx()
> on the outermost midx struct, ignoring o->multi_pack_index->next
> entirely. So that's a leak, but also means we may not be closing the
> midx we're interested in (since write_midx_internal() takes an
> object-dir parameter, and we could be pointing to some other
> object-dir's midx).

Yuck, this is a mess. I'm tempted to say that we should be closing the
MIDX that we're operating on inside of write_midx_internal() so we can
write, but then declaring the whole object store to be bunk and calling
close_object_store() before leaving the function. Of course, one of
those steps should be closing the inner-most MIDX before closing the
next one and so on.

> -Peff

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 08/24] midx: respect 'core.multiPackIndex' when writing
  2021-07-27 17:47                     ` Taylor Blau
@ 2021-07-27 17:55                       ` Jeff King
  2021-07-27 20:05                         ` Taylor Blau
  0 siblings, 1 reply; 273+ messages in thread
From: Jeff King @ 2021-07-27 17:55 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jonathantanmy

On Tue, Jul 27, 2021 at 01:47:40PM -0400, Taylor Blau wrote:

> > BTW, yet another weirdness: close_object_store() will call close_midx()
> > on the outermost midx struct, ignoring o->multi_pack_index->next
> > entirely. So that's a leak, but also means we may not be closing the
> > midx we're interested in (since write_midx_internal() takes an
> > object-dir parameter, and we could be pointing to some other
> > object-dir's midx).
> 
> Yuck, this is a mess. I'm tempted to say that we should be closing the
> MIDX that we're operating on inside of write_midx_internal() so we can
> write, but then declaring the whole object store to be bunk and calling
> close_object_store() before leaving the function. Of course, one of
> those steps should be closing the inner-most MIDX before closing the
> next one and so on.

That gets even weirder when you look at other callers of
write_midx_internal(). For instance, expire_midx_packs() is calling
load_multi_pack_index() directly, and then passing it to
write_midx_internal().

So closing the whole object store there is likewise weird.

I actually think having write_midx_internal() open up a new midx is
reasonable-ish. It's just that:

  - it's weird when it stuffs duplicate packs into the
    r->objects->packed_git list. But AFAICT that's not actually hurting
    anything?

  - we do need to make sure that the midx is closed (not just our copy,
    but any other open copies that happen to be in the same process) in
    order for things to work on Windows.

So I guess because of the second point, the internal midx write probably
needs to be calling close_object_store(). But because other callers use
load_multi_pack_index(), it probably needs to be closing the one that is
passed in, too! But of course not double-closing it if it did come from
the regular object store. One easy easy way to avoid that is to just
open a separate one.

I have some spicy takes on how midx's should have been designed, but I
think it's probably not productive to rant about it at this point. ;)

-Peff

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 08/24] midx: respect 'core.multiPackIndex' when writing
  2021-07-27 17:55                       ` Jeff King
@ 2021-07-27 20:05                         ` Taylor Blau
  2021-07-28 17:46                           ` Jeff King
  0 siblings, 1 reply; 273+ messages in thread
From: Taylor Blau @ 2021-07-27 20:05 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster, jonathantanmy

On Tue, Jul 27, 2021 at 01:55:52PM -0400, Jeff King wrote:
> On Tue, Jul 27, 2021 at 01:47:40PM -0400, Taylor Blau wrote:
>
> > > BTW, yet another weirdness: close_object_store() will call close_midx()
> > > on the outermost midx struct, ignoring o->multi_pack_index->next
> > > entirely. So that's a leak, but also means we may not be closing the
> > > midx we're interested in (since write_midx_internal() takes an
> > > object-dir parameter, and we could be pointing to some other
> > > object-dir's midx).
> >
> > Yuck, this is a mess. I'm tempted to say that we should be closing the
> > MIDX that we're operating on inside of write_midx_internal() so we can
> > write, but then declaring the whole object store to be bunk and calling
> > close_object_store() before leaving the function. Of course, one of
> > those steps should be closing the inner-most MIDX before closing the
> > next one and so on.
>
> That gets even weirder when you look at other callers of
> write_midx_internal(). For instance, expire_midx_packs() is calling
> load_multi_pack_index() directly, and then passing it to
> write_midx_internal().
>
> So closing the whole object store there is likewise weird.
>
> I actually think having write_midx_internal() open up a new midx is
> reasonable-ish. It's just that:
>
>   - it's weird when it stuffs duplicate packs into the
>     r->objects->packed_git list. But AFAICT that's not actually hurting
>     anything?

It is hurting us when we try to write a MIDX bitmap, because we try to
see if one already exists. And to do that, we call prepare_bitmap_git(),
which tries to call open_pack_bitmap_1 on *each* pack in the packed_git
list. Critically, prepare_bitmap_git() errors out if it is called with a
bitmap_git that has a non-NULL `->pack` pointer.

So they aren't a cycle in the sense that `p->next == p`, but it does
cause problems for us nonetheless.

I stepped away from my computer for an hour or so and thought about
this, and I think that the solution is two-fold:

  - We should be more careful about freeing up the ->next pointers of a
    MIDX, and releasing the memory we allocated to hold each MIDX struct
    in the first place.

  - We should always be operating on the repository's
    r->objects->multi_pack_index, or any other MIDX that can be reached
    via walking the `->next` pointers. If we do that consistently, then
    we'll only have at most one instance of a MIDX struct corresponding
    to each MIDX file on disk.

In the reroll that I'll send shortly, those are:

  - https://github.com/ttaylorr/git/commit/61a617715f3827401522c7b08b50bb6866f2a4e9
  - https://github.com/ttaylorr/git/commit/fd15ecf47c57ce4ff0d31621c2c9f61ff7a74939

respectively. It resolves my issue locally which was that I was
previously unable to run:

    GIT_TEST_MULTI_PACK_INDEX=1 \
    GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=1 \
      t7700-repack.sh

without t7700.13 failing on me (it previously complained about a
"duplicate" .bitmap file, which is a side-effect of placing duplicates
in the packed_git list, not a true duplicate .bitmap on disk).

I'm waiting on a CI run [1], but I'm relatively happy with the result. I
think we could do something similar to the MIDX code like we did to the
commit-graph code in [2], but I'm reasonably happy with where we are
now.

Thanks,
Taylor

[1]: https://github.com/ttaylorr/git/actions/runs/1072513087
[2]: https://lore.kernel.org/git/cover.1580424766.git.me@ttaylorr.com/

^ permalink raw reply	[flat|nested] 273+ messages in thread

* Re: [PATCH v2 14/24] pack-bitmap: write multi-pack bitmaps
  2021-07-27 17:11         ` Jeff King
@ 2021-07-27 20:33           ` Taylor Blau
  2021-07-28 17:52             ` Jeff King
  0 siblings, 1 reply; 273+ messages in thread
From: Taylor Blau @ 2021-07-27 20:33 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster, jonathantanmy

On Tue, Jul 27, 2021 at 01:11:25PM -0400, Jeff King wrote:
> On Mon, Jul 26, 2021 at 02:12:30PM -0400, Taylor Blau wrote:
> > But it does mirror the behavior of our previous bitmap generation
> > settings, which has been running for years.
> >
> > So... we could probably drop this hunk? I'd probably rather err on the
> > safe side and leave this alone since it matches a system that we know to
> > work well in practice.
>
> I'd rather drop it, if we think it's doing nothing. While I do value
> history in production as a sign of stability, upstream review is a good
> time to make sure we understand all of the "why", and to clean things up
> (e.g., another example is the questionable close_midx() stuff discussed
> elsewhere).

OK, I think that's a very reasonable way of thinking about it, so I'd
rather just get rid of it (not to mention that I really doubt it's doing
much of anything in the first place).

> > Luckily, any new packs will be loaded (and likewise have their indexes
> > open, too), via the the add_pack_to_midx() callback that we pass as an
> > argument to for_each_file_in_pack_dir().
>
> Hmm, OK. Your second paragraph makes it sound like we _don't_ need to do
> this. But the key is "new packs". In add_pack_to_midx() we skip any
> packs that are already in the existing midx, assuming they've already
> been added. And we probably must do that, otherwise we end up with
> duplicate structs that are not actually shared by ctx->m.

Exactly.

> It's interesting that your earlier iteration didn't call
> open_pack_index(). Is it necessary, or not? From your description, it
> seems like it should be. But maybe some later step lazy-loads it? Even
> if so, I can see how prepare_midx_pack() would still be required
> (because we want to make sure we are using the same struct).

It's only necessary now (at least for determining a preferred pack if
the caller didn't specify one with `--preferred-pack`) because we care
about reading the `num_objects` field, which the index must be loaded
for.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v3 00/25] multi-pack reachability bitmaps
  2021-04-09 18:10 [PATCH 00/22] multi-pack reachability bitmaps Taylor Blau
                   ` (22 preceding siblings ...)
  2021-06-21 22:24 ` [PATCH v2 00/24] multi-pack reachability bitmaps Taylor Blau
@ 2021-07-27 21:19 ` Taylor Blau
  2021-07-27 21:19   ` [PATCH v3 01/25] pack-bitmap.c: harden 'test_bitmap_walk()' to check type bitmaps Taylor Blau
                     ` (25 more replies)
  2021-08-24 16:15 ` [PATCH v4 " Taylor Blau
  2021-08-31 20:51 ` [PATCH v5 00/27] " Taylor Blau
  25 siblings, 26 replies; 273+ messages in thread
From: Taylor Blau @ 2021-07-27 21:19 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

Here is another reroll of my series to implement multi-pack reachability
bitmaps, based on reviews from Ævar and Peff.

Notable changes since last time are summarized here (though a complete
range-diff is below as well):

  - Preventing multiple copies of the same MIDX from being opened, see the new
    patches 8-9 for details.
  - Ensuring preferred packs are non-empty.
  - Simplifying a handful of routines to read from MIDX bitmaps.
  - General code clean-up, removing a few stray hunks, some commit message
    tweaking.

This reroll also dropped three patches present in v2, namely:

  - A patch to build Documentation/technical/bitmap-format.txt (the document is
    poorly formatted and the generated HTML isn't readable).
  - A patch to make some MIDX-related functions non-static (the required
    functions are instead exposed in the patches that first make use of them).
  - A patch to respect `core.multiPackIndex` when writing MIDXs (see patches 8-9
    for the replacement).

Thanks in advance for your review. I think Peff still wanted to read through
patches 16-25, but that the first 15 or so should be in pretty good shape by
now.

Jeff King (2):
  t0410: disable GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP
  t5310: disable GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP

Taylor Blau (23):
  pack-bitmap.c: harden 'test_bitmap_walk()' to check type bitmaps
  pack-bitmap-write.c: gracefully fail to write non-closed bitmaps
  pack-bitmap-write.c: free existing bitmaps
  Documentation: describe MIDX-based bitmaps
  midx: clear auxiliary .rev after replacing the MIDX
  midx: reject empty `--preferred-pack`'s
  midx: infer preferred pack when not given one
  midx: close linked MIDXs, avoid leaking memory
  midx: avoid opening multiple MIDXs when writing
  pack-bitmap.c: introduce 'bitmap_num_objects()'
  pack-bitmap.c: introduce 'nth_bitmap_object_oid()'
  pack-bitmap.c: introduce 'bitmap_is_preferred_refname()'
  pack-bitmap.c: avoid redundant calls to try_partial_reuse
  pack-bitmap: read multi-pack bitmaps
  pack-bitmap: write multi-pack bitmaps
  t5310: move some tests to lib-bitmap.sh
  t/helper/test-read-midx.c: add --checksum mode
  t5326: test multi-pack bitmap behavior
  t5319: don't write MIDX bitmaps in t5319
  t7700: update to work with MIDX bitmap test knob
  midx: respect 'GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP'
  p5310: extract full and partial bitmap tests
  p5326: perf tests for MIDX bitmaps

 Documentation/git-multi-pack-index.txt       |  18 +-
 Documentation/technical/bitmap-format.txt    |  71 ++-
 Documentation/technical/multi-pack-index.txt |  10 +-
 builtin/multi-pack-index.c                   |   2 +
 builtin/pack-objects.c                       |   8 +-
 builtin/repack.c                             |  12 +-
 ci/run-build-and-tests.sh                    |   1 +
 midx.c                                       | 321 +++++++++++-
 midx.h                                       |   5 +
 pack-bitmap-write.c                          |  79 ++-
 pack-bitmap.c                                | 499 ++++++++++++++++---
 pack-bitmap.h                                |   9 +-
 packfile.c                                   |   2 +-
 t/README                                     |   4 +
 t/helper/test-read-midx.c                    |  16 +-
 t/lib-bitmap.sh                              | 240 +++++++++
 t/perf/lib-bitmap.sh                         |  69 +++
 t/perf/p5310-pack-bitmaps.sh                 |  65 +--
 t/perf/p5326-multi-pack-bitmaps.sh           |  43 ++
 t/t0410-partial-clone.sh                     |  12 +-
 t/t5310-pack-bitmaps.sh                      | 231 +--------
 t/t5319-multi-pack-index.sh                  |  20 +-
 t/t5326-multi-pack-bitmaps.sh                | 277 ++++++++++
 t/t7700-repack.sh                            |  18 +-
 24 files changed, 1596 insertions(+), 436 deletions(-)
 create mode 100644 t/perf/lib-bitmap.sh
 create mode 100755 t/perf/p5326-multi-pack-bitmaps.sh
 create mode 100755 t/t5326-multi-pack-bitmaps.sh

Range-diff against v2:
 1:  a18baeb0b4 !  1:  fa4cbed48e pack-bitmap.c: harden 'test_bitmap_walk()' to check type bitmaps
    @@ pack-bitmap.c: void count_bitmap_commit_list(struct bitmap_index *bitmap_git,
     +		bitmaps_nr++;
     +	}
     +
    -+	if (!bitmap_type)
    ++	if (bitmap_type == OBJ_NONE)
     +		die("object %s not found in type bitmaps",
     +		    oid_to_hex(&obj->oid));
     +
 2:  3e637d9ec8 =  2:  2b15c1fc5c pack-bitmap-write.c: gracefully fail to write non-closed bitmaps
 3:  490d733d12 =  3:  2ad513a230 pack-bitmap-write.c: free existing bitmaps
 4:  b0bb2e8051 <  -:  ---------- Documentation: build 'technical/bitmap-format' by default
 5:  64a260e0c6 !  4:  8da5de7c24 Documentation: describe MIDX-based bitmaps
    @@ Documentation/technical/bitmap-format.txt
     +
     +		o1 <= o2 <==> pack(o1) <= pack(o2) /\ offset(o1) <= offset(o2)
     +
    -+	The ordering between packs is done lexicographically by the pack name,
    -+	with the exception of the preferred pack, which sorts ahead of all other
    -+	packs.
    ++	The ordering between packs is done according to the MIDX's .rev file.
    ++	Notably, the preferred pack sorts ahead of all other packs.
     +
     +The on-disk representation (described below) of a bitmap is the same regardless
     +of whether or not that bitmap belongs to a packfile or a MIDX. The only
 6:  b3a12424d7 <  -:  ---------- midx: make a number of functions non-static
 7:  1448ca0d2b =  5:  49297f57ed midx: clear auxiliary .rev after replacing the MIDX
 8:  dfd1daacc5 <  -:  ---------- midx: respect 'core.multiPackIndex' when writing
 -:  ---------- >  6:  c5513f2a75 midx: reject empty `--preferred-pack`'s
 9:  9495f6869d !  7:  53ef0a6d67 midx: infer preferred pack when not given one
    @@ Commit message
         Not specifying a preferred pack can cause serious problems with
         multi-pack reachability bitmaps, because these bitmaps rely on having at
         least one pack from which all duplicates are selected. Not having such a
    -    pack causes problems with the pack reuse code (e.g., like assuming that
    -    a base object was sent from that pack via reuse when in fact the base
    -    was selected from a different pack).
    +    pack causes problems with the code in pack-objects to reuse packs
    +    verbatim (e.g., that code assumes that a delta object in a chunk of pack
    +    sent verbatim will have its base object sent from the same pack).
     
         So why does not marking a pack preferred cause problems here? The reason
         is roughly as follows:
    @@ Commit message
             later).
     
           - The psuedo pack-order (described in
    -        Documentation/technical/bitmap-format.txt) is computed by
    +        Documentation/technical/pack-format.txt under the section
    +        "multi-pack-index reverse indexes") is computed by
             midx_pack_order(), and sorts by pack ID and pack offset, with
             preferred packs sorting first.
     
    @@ Commit message
         order, which the bitmap code will treat as the preferred one) did *not*
         have all duplicate objects resolved in its favor, resulting in breakage.
     
    -    The fix is simple: pick a (semi-arbitrary) preferred pack when none was
    -    specified. This forces that pack to have duplicates resolved in its
    -    favor, and (critically) to sort first in pseudo-pack order.
    -    Unfortunately, testing this behavior portably isn't possible, since it
    -    depends on readdir() order which isn't guaranteed by POSIX.
    +    The fix is simple: pick a (semi-arbitrary, non-empty) preferred pack
    +    when none was specified. This forces that pack to have duplicates
    +    resolved in its favor, and (critically) to sort first in pseudo-pack
    +    order.  Unfortunately, testing this behavior portably isn't possible,
    +    since it depends on readdir() order which isn't guaranteed by POSIX.
    +
    +    (Note that multi-pack reachability bitmaps have yet to be implemented;
    +    so in that sense this patch is fixing a bug which does not yet exist.
    +    But by having this patch beforehand, we can prevent the bug from ever
    +    materializing.)
     
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
    @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack
     +			warning(_("unknown preferred pack: '%s'"),
     +				preferred_pack_name);
     +	} else if (ctx.nr && (flags & MIDX_WRITE_REV_INDEX)) {
    -+		time_t oldest = ctx.info[0].p->mtime;
    ++		struct packed_git *oldest = ctx.info[ctx.preferred_pack_idx].p;
     +		ctx.preferred_pack_idx = 0;
     +
     +		if (packs_to_drop && packs_to_drop->nr)
    @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack
     +		 * (and not another pack containing a duplicate)
     +		 */
     +		for (i = 1; i < ctx.nr; i++) {
    -+			time_t mtime = ctx.info[i].p->mtime;
    -+			if (mtime < oldest) {
    -+				oldest = mtime;
    ++			struct packed_git *p = ctx.info[i].p;
    ++
    ++			if (!oldest->num_objects || p->mtime < oldest->mtime) {
    ++				oldest = p;
     +				ctx.preferred_pack_idx = i;
     +			}
     +		}
    ++
    ++		if (!oldest->num_objects) {
    ++			/*
    ++			 * If all packs are empty; unset the preferred index.
    ++			 * This is acceptable since there will be no duplicate
    ++			 * objects to resolve, so the preferred value doesn't
    ++			 * matter.
    ++			 */
    ++			ctx.preferred_pack_idx = -1;
    ++		}
     +	} else {
     +		/*
     +		 * otherwise don't mark any pack as preferred to avoid
    @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack
     +		ctx.preferred_pack_idx = -1;
      	}
      
    - 	ctx.entries = get_sorted_entries(ctx.m, ctx.info, ctx.nr, &ctx.entries_nr,
    + 	if (ctx.preferred_pack_idx > -1) {
     @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack_index *
      						      ctx.info, ctx.nr,
      						      sizeof(*ctx.info),
 -:  ---------- >  8:  114773d9cd midx: close linked MIDXs, avoid leaking memory
 -:  ---------- >  9:  40cff5beb5 midx: avoid opening multiple MIDXs when writing
10:  373aa47528 = 10:  ca7f726abf pack-bitmap.c: introduce 'bitmap_num_objects()'
11:  ac1f46aa1f ! 11:  67e6897a34 pack-bitmap.c: introduce 'nth_bitmap_object_oid()'
    @@ pack-bitmap.c: static inline uint8_t read_u8(const unsigned char *buffer, size_t
      
      #define MAX_XOR_OFFSET 160
      
    -+static void nth_bitmap_object_oid(struct bitmap_index *index,
    -+				  struct object_id *oid,
    -+				  uint32_t n)
    ++static int nth_bitmap_object_oid(struct bitmap_index *index,
    ++				 struct object_id *oid,
    ++				 uint32_t n)
     +{
    -+	nth_packed_object_id(oid, index->pack, n);
    ++	return nth_packed_object_id(oid, index->pack, n);
     +}
     +
      static int load_bitmap_entries_v1(struct bitmap_index *index)
    @@ pack-bitmap.c: static int load_bitmap_entries_v1(struct bitmap_index *index)
      		flags = read_u8(index->map, &index->map_pos);
      
     -		if (nth_packed_object_id(&oid, index->pack, commit_idx_pos) < 0)
    --			return error("corrupt ewah bitmap: commit index %u out of range",
    --				     (unsigned)commit_idx_pos);
    -+		nth_bitmap_object_oid(index, &oid, commit_idx_pos);
    ++		if (nth_bitmap_object_oid(index, &oid, commit_idx_pos) < 0)
    + 			return error("corrupt ewah bitmap: commit index %u out of range",
    + 				     (unsigned)commit_idx_pos);
      
    - 		bitmap = read_bitmap_1(index);
    - 		if (!bitmap)
     @@ pack-bitmap.c: static unsigned long get_size_by_pos(struct bitmap_index *bitmap_git,
      		off_t ofs = pack_pos_to_offset(pack, pos);
      		if (packed_object_info(the_repository, pack, ofs, &oi) < 0) {
12:  c474d2eda5 ! 12:  743a1a138e pack-bitmap.c: introduce 'bitmap_is_preferred_refname()'
    @@ Commit message
         'pack.preferBitmapTips' configuration. This patch prepares the
         multi-pack bitmap code to respect this configuration, too.
     
    -    Since the multi-pack bitmap code already does a traversal of all
    -    references (in order to discover the set of reachable commits in the
    -    multi-pack index), it is more efficient to check whether or not each
    -    reference is a suffix of any value of 'pack.preferBitmapTips' rather
    -    than do an additional traversal.
    +    The yet-to-be implemented code will find that it is more efficient to
    +    check whether each reference contains a prefix found in the configured
    +    set of values rather than doing an additional traversal.
     
    -    Implement a function 'bitmap_is_preferred_refname()' which does just
    -    that. The caller will be added in a subsequent patch.
    +    Implement a function 'bitmap_is_preferred_refname()' which will perform
    +    that check. Its caller will be added in a subsequent patch.
     
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
 -:  ---------- > 13:  a3b641b3e6 pack-bitmap.c: avoid redundant calls to try_partial_reuse
13:  7d44ba6299 ! 14:  141ff83275 pack-bitmap: read multi-pack bitmaps
    @@ Commit message
         in a MIDX.
     
         Note that there are currently no writers who write multi-pack bitmaps,
    -    and that this will be implemented in the subsequent commit.
    +    and that this will be implemented in the subsequent commit. Note also
    +    that get_midx_checksum() and get_midx_filename() are made non-static so
    +    they can be called from pack-bitmap.c.
     
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
    @@ builtin/pack-objects.c: static void write_reused_pack(struct hashfile *f)
      			display_progress(progress_state, ++written);
      		}
     
    + ## midx.c ##
    +@@ midx.c: static uint8_t oid_version(void)
    + 	}
    + }
    + 
    +-static const unsigned char *get_midx_checksum(struct multi_pack_index *m)
    ++const unsigned char *get_midx_checksum(struct multi_pack_index *m)
    + {
    + 	return m->data + m->data_len - the_hash_algo->rawsz;
    + }
    + 
    +-static char *get_midx_filename(const char *object_dir)
    ++char *get_midx_filename(const char *object_dir)
    + {
    + 	return xstrfmt("%s/pack/multi-pack-index", object_dir);
    + }
    +
    + ## midx.h ##
    +@@ midx.h: struct multi_pack_index {
    + #define MIDX_PROGRESS     (1 << 0)
    + #define MIDX_WRITE_REV_INDEX (1 << 1)
    + 
    ++const unsigned char *get_midx_checksum(struct multi_pack_index *m);
    ++char *get_midx_filename(const char *object_dir);
    + char *get_midx_rev_filename(struct multi_pack_index *m);
    + 
    + struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local);
    +
      ## pack-bitmap-write.c ##
     @@ pack-bitmap-write.c: void bitmap_writer_show_progress(int show)
      }
    @@ pack-bitmap.c: static int load_bitmap_header(struct bitmap_index *index)
      	index->map_pos += header_size;
      	return 0;
      }
    -@@ pack-bitmap.c: static void nth_bitmap_object_oid(struct bitmap_index *index,
    - 				  struct object_id *oid,
    - 				  uint32_t n)
    +@@ pack-bitmap.c: static int nth_bitmap_object_oid(struct bitmap_index *index,
    + 				 struct object_id *oid,
    + 				 uint32_t n)
      {
    --	nth_packed_object_id(oid, index->pack, n);
     +	if (index->midx)
    -+		nth_midxed_object_oid(oid, index->midx, n);
    -+	else
    -+		nth_packed_object_id(oid, index->pack, n);
    ++		return nth_midxed_object_oid(oid, index->midx, n) ? 0 : -1;
    + 	return nth_packed_object_id(oid, index->pack, n);
      }
      
    - static int load_bitmap_entries_v1(struct bitmap_index *index)
     @@ pack-bitmap.c: static int load_bitmap_entries_v1(struct bitmap_index *index)
      	return 0;
      }
    @@ pack-bitmap.c: static int open_pack_bitmap_1(struct bitmap_index *bitmap_git, st
      		warning("ignoring extra bitmap file: %s", packfile->pack_name);
      		close(fd);
      		return -1;
    - 	}
    - 
    -+	if (!is_pack_valid(packfile)) {
    -+		close(fd);
    -+		return -1;
    -+	}
    -+
    - 	bitmap_git->pack = packfile;
    - 	bitmap_git->map_size = xsize_t(st.st_size);
    - 	bitmap_git->map = xmmap(NULL, bitmap_git->map_size, PROT_READ, MAP_PRIVATE, fd, 0);
     @@ pack-bitmap.c: static int open_pack_bitmap_1(struct bitmap_index *bitmap_git, struct packed_git
      	return 0;
      }
    @@ pack-bitmap.c: static int open_pack_bitmap_1(struct bitmap_index *bitmap_git, st
     +		uint32_t i;
     +		int ret;
     +
    -+		ret = load_midx_revindex(bitmap_git->midx);
    -+		if (ret)
    -+			return ret;
    -+
    ++		/*
    ++		 * The multi-pack-index's .rev file is already loaded via
    ++		 * open_pack_bitmap_1().
    ++		 *
    ++		 * But we still need to open the individual pack .rev files,
    ++		 * since we will need to make use of them in pack-objects.
    ++		 */
     +		for (i = 0; i < bitmap_git->midx->num_packs; i++) {
     +			if (prepare_midx_pack(the_repository, bitmap_git->midx, i))
     +				die(_("load_reverse_index: could not open pack"));
    @@ pack-bitmap.c: static int open_pack_bitmap(struct repository *r,
      
     -	if (!open_pack_bitmap(r, bitmap_git) && !load_pack_bitmap(bitmap_git))
     +	if (!open_bitmap(r, bitmap_git) && !load_bitmap(bitmap_git))
    ++		return bitmap_git;
    ++
    ++	free_bitmap_index(bitmap_git);
    ++	return NULL;
    ++}
    ++
    ++struct bitmap_index *prepare_midx_bitmap_git(struct repository *r,
    ++					     struct multi_pack_index *midx)
    ++{
    ++	struct bitmap_index *bitmap_git = xcalloc(1, sizeof(*bitmap_git));
    ++
    ++	if (!open_midx_bitmap_1(bitmap_git, midx) && !load_bitmap(bitmap_git))
      		return bitmap_git;
      
      	free_bitmap_index(bitmap_git);
    @@ pack-bitmap.c: struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
      
      	object_array_clear(&revs->pending);
     @@ pack-bitmap.c: struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
    - }
    - 
    - static void try_partial_reuse(struct bitmap_index *bitmap_git,
    -+			      struct packed_git *pack,
    - 			      size_t pos,
    - 			      struct bitmap *reuse,
    - 			      struct pack_window **w_curs)
    +  * reused, but you can keep feeding bits.
    +  */
    + static int try_partial_reuse(struct bitmap_index *bitmap_git,
    ++			     struct packed_git *pack,
    + 			     size_t pos,
    + 			     struct bitmap *reuse,
    + 			     struct pack_window **w_curs)
      {
     -	off_t offset, header;
     +	off_t offset, delta_obj_offset;
    @@ pack-bitmap.c: struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
      	unsigned long size;
      
     -	if (pos >= bitmap_num_objects(bitmap_git))
    --		return; /* not actually in the pack or MIDX */
    +-		return -1; /* not actually in the pack or MIDX */
     +	/*
     +	 * try_partial_reuse() is called either on (a) objects in the
     +	 * bitmapped pack (in the case of a single-pack bitmap) or (b)
    @@ pack-bitmap.c: struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
     -	offset = header = pack_pos_to_offset(bitmap_git->pack, pos);
     -	type = unpack_object_header(bitmap_git->pack, w_curs, &offset, &size);
     +	if (pos >= pack->num_objects)
    -+		return; /* not actually in the pack or MIDX preferred pack */
    ++		return -1; /* not actually in the pack or MIDX preferred pack */
     +
     +	offset = delta_obj_offset = pack_pos_to_offset(pack, pos);
     +	type = unpack_object_header(pack, w_curs, &offset, &size);
      	if (type < 0)
    - 		return; /* broken packfile, punt */
    + 		return -1; /* broken packfile, punt */
      
    -@@ pack-bitmap.c: static void try_partial_reuse(struct bitmap_index *bitmap_git,
    +@@ pack-bitmap.c: static int try_partial_reuse(struct bitmap_index *bitmap_git,
      		 * and the normal slow path will complain about it in
      		 * more detail.
      		 */
    @@ pack-bitmap.c: static void try_partial_reuse(struct bitmap_index *bitmap_git,
     +		base_offset = get_delta_base(pack, w_curs, &offset, type,
     +					     delta_obj_offset);
      		if (!base_offset)
    - 			return;
    + 			return 0;
     -		if (offset_to_pack_pos(bitmap_git->pack, base_offset, &base_pos) < 0)
     +		if (offset_to_pack_pos(pack, base_offset, &base_pos) < 0)
    - 			return;
    + 			return 0;
      
      		/*
    -@@ pack-bitmap.c: static void try_partial_reuse(struct bitmap_index *bitmap_git,
    - 	bitmap_set(reuse, pos);
    +@@ pack-bitmap.c: static int try_partial_reuse(struct bitmap_index *bitmap_git,
    + 	return 0;
      }
      
     +static uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git)
    @@ pack-bitmap.c: int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitma
      				break;
      
      			offset += ewah_bit_ctz64(word >> offset);
    --			try_partial_reuse(bitmap_git, pos + offset, reuse, &w_curs);
    -+			if (bitmap_is_midx(bitmap_git)) {
    -+				/*
    -+				 * Can't reuse from a non-preferred pack (see
    -+				 * above).
    -+				 */
    -+				if (pos + offset >= objects_nr)
    -+					continue;
    -+			}
    -+			try_partial_reuse(bitmap_git, pack, pos + offset, reuse, &w_curs);
    - 		}
    - 	}
    - 
    +-			if (try_partial_reuse(bitmap_git, pos + offset, reuse,
    +-					      &w_curs) < 0) {
    ++			if (try_partial_reuse(bitmap_git, pack, pos + offset,
    ++					      reuse, &w_curs) < 0) {
    + 				/*
    + 				 * try_partial_reuse indicated we couldn't reuse
    + 				 * any bits, so there is no point in trying more
     @@ pack-bitmap.c: int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
      	 * need to be handled separately.
      	 */
    @@ pack-bitmap.c: static off_t get_disk_usage_for_type(struct bitmap_index *bitmap_
      {
      	struct bitmap *result = bitmap_git->result;
     -	struct packed_git *pack = bitmap_git->pack;
    -+	struct packed_git *pack;
      	off_t total = 0;
      	struct ewah_iterator it;
      	eword_t filter;
     @@ pack-bitmap.c: static off_t get_disk_usage_for_type(struct bitmap_index *bitmap_git,
    + 			continue;
    + 
    + 		for (offset = 0; offset < BITS_IN_EWORD; offset++) {
    +-			size_t pos;
    +-
    + 			if ((word >> offset) == 0)
      				break;
      
      			offset += ewah_bit_ctz64(word >> offset);
     -			pos = base + offset;
    +-			total += pack_pos_to_offset(pack, pos + 1) -
    +-				 pack_pos_to_offset(pack, pos);
     +
     +			if (bitmap_is_midx(bitmap_git)) {
     +				uint32_t pack_pos;
     +				uint32_t midx_pos = pack_pos_to_midx(bitmap_git->midx, base + offset);
    -+				uint32_t pack_id = nth_midxed_pack_int_id(bitmap_git->midx, midx_pos);
     +				off_t offset = nth_midxed_offset(bitmap_git->midx, midx_pos);
     +
    -+				pack = bitmap_git->midx->packs[pack_id];
    ++				uint32_t pack_id = nth_midxed_pack_int_id(bitmap_git->midx, midx_pos);
    ++				struct packed_git *pack = bitmap_git->midx->packs[pack_id];
     +
     +				if (offset_to_pack_pos(pack, offset, &pack_pos) < 0) {
     +					struct object_id oid;
     +					nth_midxed_object_oid(&oid, bitmap_git->midx, midx_pos);
     +
    -+					die(_("could not find %s in pack #%"PRIu32" at offset %"PRIuMAX),
    ++					die(_("could not find %s in pack %s at offset %"PRIuMAX),
     +					    oid_to_hex(&oid),
    -+					    pack_id,
    ++					    pack->pack_name,
     +					    (uintmax_t)offset);
     +				}
     +
    -+				pos = pack_pos;
    ++				total += pack_pos_to_offset(pack, pack_pos + 1) - offset;
     +			} else {
    -+				pack = bitmap_git->pack;
    -+				pos = base + offset;
    ++				size_t pos = base + offset;
    ++				total += pack_pos_to_offset(bitmap_git->pack, pos + 1) -
    ++					 pack_pos_to_offset(bitmap_git->pack, pos);
     +			}
    -+
    - 			total += pack_pos_to_offset(pack, pos + 1) -
    - 				 pack_pos_to_offset(pack, pos);
      		}
    + 	}
    + 
     @@ pack-bitmap.c: off_t get_disk_usage_from_bitmap(struct bitmap_index *bitmap_git,
      	return total;
      }
    @@ pack-bitmap.c: off_t get_disk_usage_from_bitmap(struct bitmap_index *bitmap_git,
     +{
     +	return !!bitmap_git->midx;
     +}
    -+
    -+off_t bitmap_pack_offset(struct bitmap_index *bitmap_git, uint32_t pos)
    -+{
    -+	if (bitmap_is_midx(bitmap_git))
    -+		return nth_midxed_offset(bitmap_git->midx,
    -+					 pack_pos_to_midx(bitmap_git->midx, pos));
    -+	return nth_packed_object_offset(bitmap_git->pack,
    -+					pack_pos_to_index(bitmap_git->pack, pos));
    -+}
     +
      const struct string_list *bitmap_preferred_tips(struct repository *r)
      {
      	return repo_config_get_value_multi(r, "pack.preferbitmaptips");
     
      ## pack-bitmap.h ##
    +@@ pack-bitmap.h: typedef int (*show_reachable_fn)(
    + struct bitmap_index;
    + 
    + struct bitmap_index *prepare_bitmap_git(struct repository *r);
    ++struct bitmap_index *prepare_midx_bitmap_git(struct repository *r,
    ++					     struct multi_pack_index *midx);
    + void count_bitmap_commit_list(struct bitmap_index *, uint32_t *commits,
    + 			      uint32_t *trees, uint32_t *blobs, uint32_t *tags);
    + void traverse_bitmap_commit_list(struct bitmap_index *,
     @@ pack-bitmap.h: void bitmap_writer_finish(struct pack_idx_entry **index,
      			  uint32_t index_nr,
      			  const char *filename,
    @@ pack-bitmap.h: void bitmap_writer_finish(struct pack_idx_entry **index,
     +char *pack_bitmap_filename(struct packed_git *p);
     +
     +int bitmap_is_midx(struct bitmap_index *bitmap_git);
    -+off_t bitmap_pack_offset(struct bitmap_index *bitmap_git, uint32_t pos);
      
      const struct string_list *bitmap_preferred_tips(struct repository *r);
      int bitmap_is_preferred_refname(struct repository *r, const char *refname);
14:  a8cec2463d ! 15:  54600b5814 pack-bitmap: write multi-pack bitmaps
    @@ Documentation/git-multi-pack-index.txt: SYNOPSIS
      DESCRIPTION
      -----------
     @@ Documentation/git-multi-pack-index.txt: write::
    - 		multiple packs contain the same object. If not given,
    - 		ties are broken in favor of the pack with the lowest
    - 		mtime.
    + 		multiple packs contain the same object. `<pack>` must
    + 		contain at least one object. If not given, ties are
    + 		broken in favor of the pack with the lowest mtime.
     +
     +	--[no-]bitmap::
     +		Control whether or not a multi-pack bitmap is written.
    @@ Documentation/git-multi-pack-index.txt: EXAMPLES
     +corresponding bitmap.
     ++
     +-------------------------------------------------------------
    -+$ git multi-pack-index write --preferred-pack <pack> --bitmap
    ++$ git multi-pack-index write --preferred-pack=<pack> --bitmap
     +-------------------------------------------------------------
     +
      * Write a MIDX file for the packfiles in an alternate object store.
    @@ midx.c
      
      #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
      #define MIDX_VERSION 1
    -@@ midx.c: static void write_midx_reverse_index(char *midx_name, unsigned char *midx_hash,
    - static void clear_midx_files_ext(struct repository *r, const char *ext,
    - 				 unsigned char *keep_hash);
    +@@ midx.c: static int midx_checksum_valid(struct multi_pack_index *m)
    + 	return hashfile_checksum_valid(m->data, m->data_len);
    + }
      
     +static void prepare_midx_packing_data(struct packing_data *pdata,
     +				      struct write_midx_context *ctx)
    @@ midx.c: static void write_midx_reverse_index(char *midx_name, unsigned char *mid
     +static void bitmap_show_commit(struct commit *commit, void *_data)
     +{
     +	struct bitmap_commit_cb *data = _data;
    -+	if (oid_pos(&commit->object.oid, data->ctx->entries,
    -+		    data->ctx->entries_nr,
    -+		    bitmap_oid_access) > -1) {
    -+		ALLOC_GROW(data->commits, data->commits_nr + 1,
    -+			   data->commits_alloc);
    -+		data->commits[data->commits_nr++] = commit;
    -+	}
    ++	int pos = oid_pos(&commit->object.oid, data->ctx->entries,
    ++			  data->ctx->entries_nr,
    ++			  bitmap_oid_access);
    ++	if (pos < 0)
    ++		return;
    ++
    ++	ALLOC_GROW(data->commits, data->commits_nr + 1, data->commits_alloc);
    ++	data->commits[data->commits_nr++] = commit;
     +}
     +
     +static struct commit **find_commits_for_midx_bitmap(uint32_t *indexed_commits_nr_p,
     +						    struct write_midx_context *ctx)
     +{
     +	struct rev_info revs;
    -+	struct bitmap_commit_cb cb;
    ++	struct bitmap_commit_cb cb = {0};
     +
    -+	memset(&cb, 0, sizeof(struct bitmap_commit_cb));
     +	cb.ctx = ctx;
     +
     +	repo_init_revisions(the_repository, &revs, NULL);
    ++	setup_revisions(0, NULL, &revs, NULL);
     +	for_each_ref(add_ref_to_pending, &revs);
     +
     +	/*
    @@ midx.c: static void write_midx_reverse_index(char *midx_name, unsigned char *mid
     +	fetch_if_missing = 0;
     +	revs.exclude_promisor_objects = 1;
     +
    -+	/*
    -+	 * Pass selected commits in topo order to match the behavior of
    -+	 * pack-bitmaps when configured with delta islands.
    -+	 */
    -+	revs.topo_order = 1;
    -+	revs.sort_order = REV_SORT_IN_GRAPH_ORDER;
    -+
     +	if (prepare_revision_walk(&revs))
     +		die(_("revision walk setup failed"));
     +
    @@ midx.c: static void write_midx_reverse_index(char *midx_name, unsigned char *mid
     +	 */
     +	ALLOC_ARRAY(index, pdata.nr_objects);
     +	for (i = 0; i < pdata.nr_objects; i++)
    -+		index[i] = (struct pack_idx_entry *)&pdata.objects[i];
    ++		index[i] = &pdata.objects[i].idx;
     +
     +	bitmap_writer_show_progress(flags & MIDX_PROGRESS);
     +	bitmap_writer_build_type_index(&pdata, index, pdata.nr_objects);
    @@ midx.c: static void write_midx_reverse_index(char *midx_name, unsigned char *mid
     +	 * bitmap_writer_finish().
     +	 */
     +	for (i = 0; i < pdata.nr_objects; i++)
    -+		index[ctx->pack_order[i]] = (struct pack_idx_entry *)&pdata.objects[i];
    ++		index[ctx->pack_order[i]] = &pdata.objects[i].idx;
     +
     +	bitmap_writer_select_commits(commits, commits_nr, -1);
     +	ret = bitmap_writer_build(&pdata);
    @@ midx.c: static void write_midx_reverse_index(char *midx_name, unsigned char *mid
     +	return ret;
     +}
     +
    - static int write_midx_internal(const char *object_dir, struct multi_pack_index *m,
    + static int write_midx_internal(const char *object_dir,
      			       struct string_list *packs_to_drop,
      			       const char *preferred_pack_name,
    -@@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack_index *
    - 		for (i = 0; i < ctx.m->num_packs; i++) {
    - 			ALLOC_GROW(ctx.info, ctx.nr + 1, ctx.alloc);
    +@@ midx.c: static int write_midx_internal(const char *object_dir,
      
    -+			if (prepare_midx_pack(the_repository, ctx.m, i)) {
    -+				error(_("could not load pack %s"),
    -+				      ctx.m->pack_names[i]);
    -+				result = 1;
    -+				goto cleanup;
    -+			}
    -+
      			ctx.info[ctx.nr].orig_pack_int_id = i;
      			ctx.info[ctx.nr].pack_name = xstrdup(ctx.m->pack_names[i]);
     -			ctx.info[ctx.nr].p = NULL;
     +			ctx.info[ctx.nr].p = ctx.m->packs[i];
      			ctx.info[ctx.nr].expired = 0;
    - 			ctx.nr++;
    - 		}
    -@@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack_index *
    + 
    + 			if (flags & MIDX_WRITE_REV_INDEX) {
    +@@ midx.c: static int write_midx_internal(const char *object_dir,
      	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &ctx);
      	stop_progress(&ctx.progress);
      
    @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack
     +		int bitmap_exists;
     +		int want_bitmap = flags & MIDX_WRITE_BITMAP;
     +
    -+		bitmap_git = prepare_bitmap_git(the_repository);
    ++		bitmap_git = prepare_midx_bitmap_git(the_repository, ctx.m);
     +		bitmap_exists = bitmap_git && bitmap_is_midx(bitmap_git);
     +		free_bitmap_index(bitmap_git);
     +
    @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack
      
      	if (preferred_pack_name) {
      		int found = 0;
    -@@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack_index *
    +@@ midx.c: static int write_midx_internal(const char *object_dir,
      		if (!found)
      			warning(_("unknown preferred pack: '%s'"),
      				preferred_pack_name);
     -	} else if (ctx.nr && (flags & MIDX_WRITE_REV_INDEX)) {
     +	} else if (ctx.nr &&
     +		   (flags & (MIDX_WRITE_REV_INDEX | MIDX_WRITE_BITMAP))) {
    - 		time_t oldest = ctx.info[0].p->mtime;
    + 		struct packed_git *oldest = ctx.info[ctx.preferred_pack_idx].p;
      		ctx.preferred_pack_idx = 0;
      
    -@@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack_index *
    +@@ midx.c: static int write_midx_internal(const char *object_dir,
      	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
      	f = hashfd(get_lock_file_fd(&lk), get_lock_file_path(&lk));
      
    @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack
      	if (ctx.nr - dropped_packs == 0) {
      		error(_("no pack files to index."));
      		result = 1;
    -@@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack_index *
    +@@ midx.c: static int write_midx_internal(const char *object_dir,
      	finalize_hashfile(f, midx_hash, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
      	free_chunkfile(cf);
      
    @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack
     +			goto cleanup;
     +		}
     +	}
    ++
    ++	close_midx(ctx.m);
      
      	commit_lock_file(&lk);
      
    @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack
      	clear_midx_files_ext(the_repository, ".rev", midx_hash);
      
      cleanup:
    -@@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack_index *
    - 		if (ctx.info[i].p) {
    - 			close_pack(ctx.info[i].p);
    - 			free(ctx.info[i].p);
    -+			if (ctx.m) {
    -+				/*
    -+				 * Destroy a stale reference to the pack in
    -+				 * 'ctx.m'.
    -+				 */
    -+				uint32_t orig = ctx.info[i].orig_pack_int_id;
    -+				if (orig < ctx.m->num_packs)
    -+					ctx.m->packs[orig] = NULL;
    -+			}
    - 		}
    - 		free(ctx.info[i].pack_name);
    - 	}
    -@@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack_index *
    +@@ midx.c: static int write_midx_internal(const char *object_dir,
      	free(ctx.pack_perm);
      	free(ctx.pack_order);
      	free(midx_name);
    -+	if (ctx.m)
    -+		close_midx(ctx.m);
     +
      	return result;
      }
15:  c63eb637c8 = 16:  168b7b0976 t5310: move some tests to lib-bitmap.sh
16:  bedb7afb37 = 17:  60ec8b3466 t/helper/test-read-midx.c: add --checksum mode
17:  fbfac4ae8e = 18:  3258ccfc1c t5326: test multi-pack bitmap behavior
18:  2a5df1832a = 19:  47c7e6bb9b t0410: disable GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP
19:  2d24c5b7ad = 20:  6a708858b1 t5310: disable GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP
20:  4cbfaa0e97 = 21:  1eaa744b24 t5319: don't write MIDX bitmaps in t5319
21:  839a7a79eb = 22:  a4a899e31f t7700: update to work with MIDX bitmap test knob
22:  00418d5b09 ! 23:  50865e52a3 midx: respect 'GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP'
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
      		if (!(pack_everything & ALL_INTO_ONE) ||
      		    !is_bare_repository())
      			write_bitmaps = 0;
    --	}
     +	} else if (write_bitmaps &&
     +		   git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0) &&
    -+		   git_env_bool(GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP, 0))
    ++		   git_env_bool(GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP, 0)) {
     +		write_bitmaps = 0;
    + 	}
      	if (pack_kept_objects < 0)
      		pack_kept_objects = write_bitmaps > 0;
    - 
     @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
      		update_server_info(0);
      	remove_temporary_files();
23:  98fa73a76a = 24:  0f1fd6e7d4 p5310: extract full and partial bitmap tests
24:  ec0f53b424 = 25:  82e8133bf4 p5326: perf tests for MIDX bitmaps
-- 
2.31.1.163.ga65ce7f831

^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v3 01/25] pack-bitmap.c: harden 'test_bitmap_walk()' to check type bitmaps
  2021-07-27 21:19 ` [PATCH v3 00/25] " Taylor Blau
@ 2021-07-27 21:19   ` Taylor Blau
  2021-07-27 21:19   ` [PATCH v3 02/25] pack-bitmap-write.c: gracefully fail to write non-closed bitmaps Taylor Blau
                     ` (24 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-07-27 21:19 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

The special `--test-bitmap` mode of `git rev-list` is used to compare
the result of an object traversal with a bitmap to check its integrity.
This mode does not, however, assert that the types of reachable objects
are stored correctly.

Harden this mode by teaching it to also check that each time an object's
bit is marked, the corresponding bit should be set in exactly one of the
type bitmaps (whose type matches the object's true type).

Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index bfc10148f5..a73960a55d 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1319,10 +1319,52 @@ void count_bitmap_commit_list(struct bitmap_index *bitmap_git,
 struct bitmap_test_data {
 	struct bitmap_index *bitmap_git;
 	struct bitmap *base;
+	struct bitmap *commits;
+	struct bitmap *trees;
+	struct bitmap *blobs;
+	struct bitmap *tags;
 	struct progress *prg;
 	size_t seen;
 };
 
+static void test_bitmap_type(struct bitmap_test_data *tdata,
+			     struct object *obj, int pos)
+{
+	enum object_type bitmap_type = OBJ_NONE;
+	int bitmaps_nr = 0;
+
+	if (bitmap_get(tdata->commits, pos)) {
+		bitmap_type = OBJ_COMMIT;
+		bitmaps_nr++;
+	}
+	if (bitmap_get(tdata->trees, pos)) {
+		bitmap_type = OBJ_TREE;
+		bitmaps_nr++;
+	}
+	if (bitmap_get(tdata->blobs, pos)) {
+		bitmap_type = OBJ_BLOB;
+		bitmaps_nr++;
+	}
+	if (bitmap_get(tdata->tags, pos)) {
+		bitmap_type = OBJ_TAG;
+		bitmaps_nr++;
+	}
+
+	if (bitmap_type == OBJ_NONE)
+		die("object %s not found in type bitmaps",
+		    oid_to_hex(&obj->oid));
+
+	if (bitmaps_nr > 1)
+		die("object %s does not have a unique type",
+		    oid_to_hex(&obj->oid));
+
+	if (bitmap_type != obj->type)
+		die("object %s: real type %s, expected: %s",
+		    oid_to_hex(&obj->oid),
+		    type_name(obj->type),
+		    type_name(bitmap_type));
+}
+
 static void test_show_object(struct object *object, const char *name,
 			     void *data)
 {
@@ -1332,6 +1374,7 @@ static void test_show_object(struct object *object, const char *name,
 	bitmap_pos = bitmap_position(tdata->bitmap_git, &object->oid);
 	if (bitmap_pos < 0)
 		die("Object not in bitmap: %s\n", oid_to_hex(&object->oid));
+	test_bitmap_type(tdata, object, bitmap_pos);
 
 	bitmap_set(tdata->base, bitmap_pos);
 	display_progress(tdata->prg, ++tdata->seen);
@@ -1346,6 +1389,7 @@ static void test_show_commit(struct commit *commit, void *data)
 				     &commit->object.oid);
 	if (bitmap_pos < 0)
 		die("Object not in bitmap: %s\n", oid_to_hex(&commit->object.oid));
+	test_bitmap_type(tdata, &commit->object, bitmap_pos);
 
 	bitmap_set(tdata->base, bitmap_pos);
 	display_progress(tdata->prg, ++tdata->seen);
@@ -1393,6 +1437,10 @@ void test_bitmap_walk(struct rev_info *revs)
 
 	tdata.bitmap_git = bitmap_git;
 	tdata.base = bitmap_new();
+	tdata.commits = ewah_to_bitmap(bitmap_git->commits);
+	tdata.trees = ewah_to_bitmap(bitmap_git->trees);
+	tdata.blobs = ewah_to_bitmap(bitmap_git->blobs);
+	tdata.tags = ewah_to_bitmap(bitmap_git->tags);
 	tdata.prg = start_progress("Verifying bitmap entries", result_popcnt);
 	tdata.seen = 0;
 
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v3 02/25] pack-bitmap-write.c: gracefully fail to write non-closed bitmaps
  2021-07-27 21:19 ` [PATCH v3 00/25] " Taylor Blau
  2021-07-27 21:19   ` [PATCH v3 01/25] pack-bitmap.c: harden 'test_bitmap_walk()' to check type bitmaps Taylor Blau
@ 2021-07-27 21:19   ` Taylor Blau
  2021-07-27 21:19   ` [PATCH v3 03/25] pack-bitmap-write.c: free existing bitmaps Taylor Blau
                     ` (23 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-07-27 21:19 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

The set of objects covered by a bitmap must be closed under
reachability, since it must be the case that there is a valid bit
position assigned for every possible reachable object (otherwise the
bitmaps would be incomplete).

Pack bitmaps are never written from 'git repack' unless repacking
all-into-one, and so we never write non-closed bitmaps (except in the
case of partial clones where we aren't guaranteed to have all objects).

But multi-pack bitmaps change this, since it isn't known whether the
set of objects in the MIDX is closed under reachability until walking
them. Plumb through a bit that is set when a reachable object isn't
found.

As soon as a reachable object isn't found in the set of objects to
include in the bitmap, bitmap_writer_build() knows that the set is not
closed, and so it now fails gracefully.

A test is added in t0410 to trigger a bitmap write without full
reachability closure by removing local copies of some reachable objects
from a promisor remote.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c   |  3 +-
 pack-bitmap-write.c      | 76 ++++++++++++++++++++++++++++------------
 pack-bitmap.h            |  2 +-
 t/t0410-partial-clone.sh |  9 ++++-
 4 files changed, 64 insertions(+), 26 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index de00adbb9e..8a523624a1 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1256,7 +1256,8 @@ static void write_pack_file(void)
 
 				bitmap_writer_show_progress(progress);
 				bitmap_writer_select_commits(indexed_commits, indexed_commits_nr, -1);
-				bitmap_writer_build(&to_pack);
+				if (bitmap_writer_build(&to_pack) < 0)
+					die(_("failed to write bitmap index"));
 				bitmap_writer_finish(written_list, nr_written,
 						     tmpname.buf, write_bitmap_options);
 				write_bitmap_index = 0;
diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
index 88d9e696a5..d374f7884b 100644
--- a/pack-bitmap-write.c
+++ b/pack-bitmap-write.c
@@ -125,15 +125,20 @@ static inline void push_bitmapped_commit(struct commit *commit)
 	writer.selected_nr++;
 }
 
-static uint32_t find_object_pos(const struct object_id *oid)
+static uint32_t find_object_pos(const struct object_id *oid, int *found)
 {
 	struct object_entry *entry = packlist_find(writer.to_pack, oid);
 
 	if (!entry) {
-		die("Failed to write bitmap index. Packfile doesn't have full closure "
+		if (found)
+			*found = 0;
+		warning("Failed to write bitmap index. Packfile doesn't have full closure "
 			"(object %s is missing)", oid_to_hex(oid));
+		return 0;
 	}
 
+	if (found)
+		*found = 1;
 	return oe_in_pack_pos(writer.to_pack, entry);
 }
 
@@ -331,9 +336,10 @@ static void bitmap_builder_clear(struct bitmap_builder *bb)
 	bb->commits_nr = bb->commits_alloc = 0;
 }
 
-static void fill_bitmap_tree(struct bitmap *bitmap,
-			     struct tree *tree)
+static int fill_bitmap_tree(struct bitmap *bitmap,
+			    struct tree *tree)
 {
+	int found;
 	uint32_t pos;
 	struct tree_desc desc;
 	struct name_entry entry;
@@ -342,9 +348,11 @@ static void fill_bitmap_tree(struct bitmap *bitmap,
 	 * If our bit is already set, then there is nothing to do. Both this
 	 * tree and all of its children will be set.
 	 */
-	pos = find_object_pos(&tree->object.oid);
+	pos = find_object_pos(&tree->object.oid, &found);
+	if (!found)
+		return -1;
 	if (bitmap_get(bitmap, pos))
-		return;
+		return 0;
 	bitmap_set(bitmap, pos);
 
 	if (parse_tree(tree) < 0)
@@ -355,11 +363,15 @@ static void fill_bitmap_tree(struct bitmap *bitmap,
 	while (tree_entry(&desc, &entry)) {
 		switch (object_type(entry.mode)) {
 		case OBJ_TREE:
-			fill_bitmap_tree(bitmap,
-					 lookup_tree(the_repository, &entry.oid));
+			if (fill_bitmap_tree(bitmap,
+					     lookup_tree(the_repository, &entry.oid)) < 0)
+				return -1;
 			break;
 		case OBJ_BLOB:
-			bitmap_set(bitmap, find_object_pos(&entry.oid));
+			pos = find_object_pos(&entry.oid, &found);
+			if (!found)
+				return -1;
+			bitmap_set(bitmap, pos);
 			break;
 		default:
 			/* Gitlink, etc; not reachable */
@@ -368,15 +380,18 @@ static void fill_bitmap_tree(struct bitmap *bitmap,
 	}
 
 	free_tree_buffer(tree);
+	return 0;
 }
 
-static void fill_bitmap_commit(struct bb_commit *ent,
-			       struct commit *commit,
-			       struct prio_queue *queue,
-			       struct prio_queue *tree_queue,
-			       struct bitmap_index *old_bitmap,
-			       const uint32_t *mapping)
+static int fill_bitmap_commit(struct bb_commit *ent,
+			      struct commit *commit,
+			      struct prio_queue *queue,
+			      struct prio_queue *tree_queue,
+			      struct bitmap_index *old_bitmap,
+			      const uint32_t *mapping)
 {
+	int found;
+	uint32_t pos;
 	if (!ent->bitmap)
 		ent->bitmap = bitmap_new();
 
@@ -401,11 +416,16 @@ static void fill_bitmap_commit(struct bb_commit *ent,
 		 * Mark ourselves and queue our tree. The commit
 		 * walk ensures we cover all parents.
 		 */
-		bitmap_set(ent->bitmap, find_object_pos(&c->object.oid));
+		pos = find_object_pos(&c->object.oid, &found);
+		if (!found)
+			return -1;
+		bitmap_set(ent->bitmap, pos);
 		prio_queue_put(tree_queue, get_commit_tree(c));
 
 		for (p = c->parents; p; p = p->next) {
-			int pos = find_object_pos(&p->item->object.oid);
+			pos = find_object_pos(&p->item->object.oid, &found);
+			if (!found)
+				return -1;
 			if (!bitmap_get(ent->bitmap, pos)) {
 				bitmap_set(ent->bitmap, pos);
 				prio_queue_put(queue, p->item);
@@ -413,8 +433,12 @@ static void fill_bitmap_commit(struct bb_commit *ent,
 		}
 	}
 
-	while (tree_queue->nr)
-		fill_bitmap_tree(ent->bitmap, prio_queue_get(tree_queue));
+	while (tree_queue->nr) {
+		if (fill_bitmap_tree(ent->bitmap,
+				     prio_queue_get(tree_queue)) < 0)
+			return -1;
+	}
+	return 0;
 }
 
 static void store_selected(struct bb_commit *ent, struct commit *commit)
@@ -432,7 +456,7 @@ static void store_selected(struct bb_commit *ent, struct commit *commit)
 	kh_value(writer.bitmaps, hash_pos) = stored;
 }
 
-void bitmap_writer_build(struct packing_data *to_pack)
+int bitmap_writer_build(struct packing_data *to_pack)
 {
 	struct bitmap_builder bb;
 	size_t i;
@@ -441,6 +465,7 @@ void bitmap_writer_build(struct packing_data *to_pack)
 	struct prio_queue tree_queue = { NULL };
 	struct bitmap_index *old_bitmap;
 	uint32_t *mapping;
+	int closed = 1; /* until proven otherwise */
 
 	writer.bitmaps = kh_init_oid_map();
 	writer.to_pack = to_pack;
@@ -463,8 +488,11 @@ void bitmap_writer_build(struct packing_data *to_pack)
 		struct commit *child;
 		int reused = 0;
 
-		fill_bitmap_commit(ent, commit, &queue, &tree_queue,
-				   old_bitmap, mapping);
+		if (fill_bitmap_commit(ent, commit, &queue, &tree_queue,
+				       old_bitmap, mapping) < 0) {
+			closed = 0;
+			break;
+		}
 
 		if (ent->selected) {
 			store_selected(ent, commit);
@@ -499,7 +527,9 @@ void bitmap_writer_build(struct packing_data *to_pack)
 
 	stop_progress(&writer.progress);
 
-	compute_xor_offsets();
+	if (closed)
+		compute_xor_offsets();
+	return closed ? 0 : -1;
 }
 
 /**
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 99d733eb26..020cd8d868 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -87,7 +87,7 @@ struct ewah_bitmap *bitmap_for_commit(struct bitmap_index *bitmap_git,
 				      struct commit *commit);
 void bitmap_writer_select_commits(struct commit **indexed_commits,
 		unsigned int indexed_commits_nr, int max_bitmaps);
-void bitmap_writer_build(struct packing_data *to_pack);
+int bitmap_writer_build(struct packing_data *to_pack);
 void bitmap_writer_finish(struct pack_idx_entry **index,
 			  uint32_t index_nr,
 			  const char *filename,
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index a211a66c67..bbcc51ee8e 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -536,7 +536,13 @@ test_expect_success 'gc does not repack promisor objects if there are none' '
 repack_and_check () {
 	rm -rf repo2 &&
 	cp -r repo repo2 &&
-	git -C repo2 repack $1 -d &&
+	if test x"$1" = "x--must-fail"
+	then
+		shift
+		test_must_fail git -C repo2 repack $1 -d
+	else
+		git -C repo2 repack $1 -d
+	fi &&
 	git -C repo2 fsck &&
 
 	git -C repo2 cat-file -e $2 &&
@@ -561,6 +567,7 @@ test_expect_success 'repack -d does not irreversibly delete promisor objects' '
 	printf "$THREE\n" | pack_as_from_promisor &&
 	delete_object repo "$ONE" &&
 
+	repack_and_check --must-fail -ab "$TWO" "$THREE" &&
 	repack_and_check -a "$TWO" "$THREE" &&
 	repack_and_check -A "$TWO" "$THREE" &&
 	repack_and_check -l "$TWO" "$THREE"
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v3 03/25] pack-bitmap-write.c: free existing bitmaps
  2021-07-27 21:19 ` [PATCH v3 00/25] " Taylor Blau
  2021-07-27 21:19   ` [PATCH v3 01/25] pack-bitmap.c: harden 'test_bitmap_walk()' to check type bitmaps Taylor Blau
  2021-07-27 21:19   ` [PATCH v3 02/25] pack-bitmap-write.c: gracefully fail to write non-closed bitmaps Taylor Blau
@ 2021-07-27 21:19   ` Taylor Blau
  2021-07-27 21:19   ` [PATCH v3 04/25] Documentation: describe MIDX-based bitmaps Taylor Blau
                     ` (22 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-07-27 21:19 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

When writing a new bitmap, the bitmap writer code attempts to read the
existing bitmap (if one is present). This is done in order to quickly
permute the bits of any bitmaps for commits which appear in the existing
bitmap, and were also selected for the new bitmap.

But since this code was added in 341fa34887 (pack-bitmap-write: use
existing bitmaps, 2020-12-08), the resources associated with opening an
existing bitmap were never released.

It's fine to ignore this, but it's bad hygiene. It will also cause a
problem for the multi-pack-index builtin, which will be responsible not
only for writing bitmaps, but also for expiring any old multi-pack
bitmaps.

If an existing bitmap was reused here, it will also be expired. That
will cause a problem on platforms which require file resources to be
closed before unlinking them, like Windows. Avoid this by ensuring we
close reused bitmaps with free_bitmap_index() before removing them.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap-write.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
index d374f7884b..142fd0adb8 100644
--- a/pack-bitmap-write.c
+++ b/pack-bitmap-write.c
@@ -520,6 +520,7 @@ int bitmap_writer_build(struct packing_data *to_pack)
 	clear_prio_queue(&queue);
 	clear_prio_queue(&tree_queue);
 	bitmap_builder_clear(&bb);
+	free_bitmap_index(old_bitmap);
 	free(mapping);
 
 	trace2_region_leave("pack-bitmap-write", "building_bitmaps_total",
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v3 04/25] Documentation: describe MIDX-based bitmaps
  2021-07-27 21:19 ` [PATCH v3 00/25] " Taylor Blau
                     ` (2 preceding siblings ...)
  2021-07-27 21:19   ` [PATCH v3 03/25] pack-bitmap-write.c: free existing bitmaps Taylor Blau
@ 2021-07-27 21:19   ` Taylor Blau
  2021-07-27 21:19   ` [PATCH v3 05/25] midx: clear auxiliary .rev after replacing the MIDX Taylor Blau
                     ` (21 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-07-27 21:19 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

Update the technical documentation to describe the multi-pack bitmap
format. This patch merely introduces the new format, and describes its
high-level ideas. Git does not yet know how to read nor write these
multi-pack variants, and so the subsequent patches will:

  - Introduce code to interpret multi-pack bitmaps, according to this
    document.

  - Then, introduce code to write multi-pack bitmaps from the 'git
    multi-pack-index write' sub-command.

Finally, the implementation will gain tests in subsequent patches (as
opposed to inline with the patch teaching Git how to write multi-pack
bitmaps) to avoid a cyclic dependency.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/technical/bitmap-format.txt    | 71 ++++++++++++++++----
 Documentation/technical/multi-pack-index.txt | 10 +--
 2 files changed, 60 insertions(+), 21 deletions(-)

diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt
index f8c18a0f7a..04b3ec2178 100644
--- a/Documentation/technical/bitmap-format.txt
+++ b/Documentation/technical/bitmap-format.txt
@@ -1,6 +1,44 @@
 GIT bitmap v1 format
 ====================
 
+== Pack and multi-pack bitmaps
+
+Bitmaps store reachability information about the set of objects in a packfile,
+or a multi-pack index (MIDX). The former is defined obviously, and the latter is
+defined as the union of objects in packs contained in the MIDX.
+
+A bitmap may belong to either one pack, or the repository's multi-pack index (if
+it exists). A repository may have at most one bitmap.
+
+An object is uniquely described by its bit position within a bitmap:
+
+	- If the bitmap belongs to a packfile, the __n__th bit corresponds to
+	the __n__th object in pack order. For a function `offset` which maps
+	objects to their byte offset within a pack, pack order is defined as
+	follows:
+
+		o1 <= o2 <==> offset(o1) <= offset(o2)
+
+	- If the bitmap belongs to a MIDX, the __n__th bit corresponds to the
+	__n__th object in MIDX order. With an additional function `pack` which
+	maps objects to the pack they were selected from by the MIDX, MIDX order
+	is defined as follows:
+
+		o1 <= o2 <==> pack(o1) <= pack(o2) /\ offset(o1) <= offset(o2)
+
+	The ordering between packs is done according to the MIDX's .rev file.
+	Notably, the preferred pack sorts ahead of all other packs.
+
+The on-disk representation (described below) of a bitmap is the same regardless
+of whether or not that bitmap belongs to a packfile or a MIDX. The only
+difference is the interpretation of the bits, which is described above.
+
+Certain bitmap extensions are supported (see: Appendix B). No extensions are
+required for bitmaps corresponding to packfiles. For bitmaps that correspond to
+MIDXs, both the bit-cache and rev-cache extensions are required.
+
+== On-disk format
+
 	- A header appears at the beginning:
 
 		4-byte signature: {'B', 'I', 'T', 'M'}
@@ -14,17 +52,19 @@ GIT bitmap v1 format
 			The following flags are supported:
 
 			- BITMAP_OPT_FULL_DAG (0x1) REQUIRED
-			This flag must always be present. It implies that the bitmap
-			index has been generated for a packfile with full closure
-			(i.e. where every single object in the packfile can find
-			 its parent links inside the same packfile). This is a
-			requirement for the bitmap index format, also present in JGit,
-			that greatly reduces the complexity of the implementation.
+			This flag must always be present. It implies that the
+			bitmap index has been generated for a packfile or
+			multi-pack index (MIDX) with full closure (i.e. where
+			every single object in the packfile/MIDX can find its
+			parent links inside the same packfile/MIDX). This is a
+			requirement for the bitmap index format, also present in
+			JGit, that greatly reduces the complexity of the
+			implementation.
 
 			- BITMAP_OPT_HASH_CACHE (0x4)
 			If present, the end of the bitmap file contains
 			`N` 32-bit name-hash values, one per object in the
-			pack. The format and meaning of the name-hash is
+			pack/MIDX. The format and meaning of the name-hash is
 			described below.
 
 		4-byte entry count (network byte order)
@@ -33,7 +73,8 @@ GIT bitmap v1 format
 
 		20-byte checksum
 
-			The SHA1 checksum of the pack this bitmap index belongs to.
+			The SHA1 checksum of the pack/MIDX this bitmap index
+			belongs to.
 
 	- 4 EWAH bitmaps that act as type indexes
 
@@ -50,7 +91,7 @@ GIT bitmap v1 format
 			- Tags
 
 		In each bitmap, the `n`th bit is set to true if the `n`th object
-		in the packfile is of that type.
+		in the packfile or multi-pack index is of that type.
 
 		The obvious consequence is that the OR of all 4 bitmaps will result
 		in a full set (all bits set), and the AND of all 4 bitmaps will
@@ -62,8 +103,9 @@ GIT bitmap v1 format
 		Each entry contains the following:
 
 		- 4-byte object position (network byte order)
-			The position **in the index for the packfile** where the
-			bitmap for this commit is found.
+			The position **in the index for the packfile or
+			multi-pack index** where the bitmap for this commit is
+			found.
 
 		- 1-byte XOR-offset
 			The xor offset used to compress this bitmap. For an entry
@@ -146,10 +188,11 @@ Name-hash cache
 ---------------
 
 If the BITMAP_OPT_HASH_CACHE flag is set, the end of the bitmap contains
-a cache of 32-bit values, one per object in the pack. The value at
+a cache of 32-bit values, one per object in the pack/MIDX. The value at
 position `i` is the hash of the pathname at which the `i`th object
-(counting in index order) in the pack can be found.  This can be fed
-into the delta heuristics to compare objects with similar pathnames.
+(counting in index or multi-pack index order) in the pack/MIDX can be found.
+This can be fed into the delta heuristics to compare objects with similar
+pathnames.
 
 The hash algorithm used is:
 
diff --git a/Documentation/technical/multi-pack-index.txt b/Documentation/technical/multi-pack-index.txt
index fb688976c4..1a73c3ee20 100644
--- a/Documentation/technical/multi-pack-index.txt
+++ b/Documentation/technical/multi-pack-index.txt
@@ -71,14 +71,10 @@ Future Work
   still reducing the number of binary searches required for object
   lookups.
 
-- The reachability bitmap is currently paired directly with a single
-  packfile, using the pack-order as the object order to hopefully
-  compress the bitmaps well using run-length encoding. This could be
-  extended to pair a reachability bitmap with a multi-pack-index. If
-  the multi-pack-index is extended to store a "stable object order"
+- If the multi-pack-index is extended to store a "stable object order"
   (a function Order(hash) = integer that is constant for a given hash,
-  even as the multi-pack-index is updated) then a reachability bitmap
-  could point to a multi-pack-index and be updated independently.
+  even as the multi-pack-index is updated) then MIDX bitmaps could be
+  updated independently of the MIDX.
 
 - Packfiles can be marked as "special" using empty files that share
   the initial name but replace ".pack" with ".keep" or ".promisor".
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v3 05/25] midx: clear auxiliary .rev after replacing the MIDX
  2021-07-27 21:19 ` [PATCH v3 00/25] " Taylor Blau
                     ` (3 preceding siblings ...)
  2021-07-27 21:19   ` [PATCH v3 04/25] Documentation: describe MIDX-based bitmaps Taylor Blau
@ 2021-07-27 21:19   ` Taylor Blau
  2021-07-27 21:19   ` [PATCH v3 06/25] midx: reject empty `--preferred-pack`'s Taylor Blau
                     ` (20 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-07-27 21:19 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

When writing a new multi-pack index, write_midx_internal() attempts to
clean up any auxiliary files (currently just the MIDX's `.rev` file, but
soon to include a `.bitmap`, too) corresponding to the MIDX it's
replacing.

This step should happen after the new MIDX is written into place, since
doing so beforehand means that the old MIDX could be read without its
corresponding .rev file.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 9a35b0255d..3193426d24 100644
--- a/midx.c
+++ b/midx.c
@@ -1086,10 +1086,11 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 	if (flags & MIDX_WRITE_REV_INDEX)
 		write_midx_reverse_index(midx_name, midx_hash, &ctx);
-	clear_midx_files_ext(the_repository, ".rev", midx_hash);
 
 	commit_lock_file(&lk);
 
+	clear_midx_files_ext(the_repository, ".rev", midx_hash);
+
 cleanup:
 	for (i = 0; i < ctx.nr; i++) {
 		if (ctx.info[i].p) {
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v3 06/25] midx: reject empty `--preferred-pack`'s
  2021-07-27 21:19 ` [PATCH v3 00/25] " Taylor Blau
                     ` (4 preceding siblings ...)
  2021-07-27 21:19   ` [PATCH v3 05/25] midx: clear auxiliary .rev after replacing the MIDX Taylor Blau
@ 2021-07-27 21:19   ` Taylor Blau
  2021-07-27 21:19   ` [PATCH v3 07/25] midx: infer preferred pack when not given one Taylor Blau
                     ` (19 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-07-27 21:19 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

The soon-to-be-implemented multi-pack bitmap treats object in the first
bit position specially by assuming that all objects in the pack it was
selected from are also represented from that pack in the MIDX. In other
words, the pack from which the first object was selected must also have
all of its other objects selected from that same pack in the MIDX in
case of any duplicates.

But this assumption relies on the fact that there is at least one object
in that pack to begin with; otherwise the object in the first bit
position isn't from a preferred pack, in which case we can no longer
assume that all objects in that pack were also selected from the same
pack.

Guard this assumption by checking the number of objects in the given
preferred pack, and failing if the given pack is empty.

To make sure we can safely perform this check, open any packs which are
contained in an existing MIDX via prepare_midx_pack(). The same is done
for new packs via the add_pack_to_midx() callback, but packs picked up
from a previous MIDX will not yet have these opened.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-multi-pack-index.txt |  6 +++---
 midx.c                                 | 29 ++++++++++++++++++++++++++
 t/t5319-multi-pack-index.sh            | 17 +++++++++++++++
 3 files changed, 49 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index ffd601bc17..c9b063d31e 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -37,9 +37,9 @@ write::
 --
 	--preferred-pack=<pack>::
 		Optionally specify the tie-breaking pack used when
-		multiple packs contain the same object. If not given,
-		ties are broken in favor of the pack with the lowest
-		mtime.
+		multiple packs contain the same object. `<pack>` must
+		contain at least one object. If not given, ties are
+		broken in favor of the pack with the lowest mtime.
 --
 
 verify::
diff --git a/midx.c b/midx.c
index 3193426d24..092dbf45b6 100644
--- a/midx.c
+++ b/midx.c
@@ -934,6 +934,25 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 			ctx.info[ctx.nr].pack_name = xstrdup(ctx.m->pack_names[i]);
 			ctx.info[ctx.nr].p = NULL;
 			ctx.info[ctx.nr].expired = 0;
+
+			if (flags & MIDX_WRITE_REV_INDEX) {
+				/*
+				 * If generating a reverse index, need to have
+				 * packed_git's loaded to compare their
+				 * mtimes and object count.
+				 */
+				if (prepare_midx_pack(the_repository, ctx.m, i)) {
+					error(_("could not load pack"));
+					result = 1;
+					goto cleanup;
+				}
+
+				if (open_pack_index(ctx.m->packs[i]))
+					die(_("could not open index for %s"),
+					    ctx.m->packs[i]->pack_name);
+				ctx.info[ctx.nr].p = ctx.m->packs[i];
+			}
+
 			ctx.nr++;
 		}
 	}
@@ -961,6 +980,16 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 		}
 	}
 
+	if (ctx.preferred_pack_idx > -1) {
+		struct packed_git *preferred = ctx.info[ctx.preferred_pack_idx].p;
+		if (!preferred->num_objects) {
+			error(_("cannot select preferred pack %s with no objects"),
+			      preferred->pack_name);
+			result = 1;
+			goto cleanup;
+		}
+	}
+
 	ctx.entries = get_sorted_entries(ctx.m, ctx.info, ctx.nr, &ctx.entries_nr,
 					 ctx.preferred_pack_idx);
 
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 7609f1ea64..1f0a2ae852 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -277,6 +277,23 @@ test_expect_success 'midx picks objects from preferred pack' '
 	)
 '
 
+test_expect_success 'preferred packs must be non-empty' '
+	test_when_finished rm -rf preferred.git &&
+	git init preferred.git &&
+	(
+		cd preferred.git &&
+
+		test_commit base &&
+		git repack -ad &&
+
+		empty="$(git pack-objects $objdir/pack/pack </dev/null)" &&
+
+		test_must_fail git multi-pack-index write \
+			--preferred-pack=pack-$empty.pack 2>err &&
+		grep "with no objects" err
+	)
+'
+
 test_expect_success 'verify multi-pack-index success' '
 	git multi-pack-index verify --object-dir=$objdir
 '
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v3 07/25] midx: infer preferred pack when not given one
  2021-07-27 21:19 ` [PATCH v3 00/25] " Taylor Blau
                     ` (5 preceding siblings ...)
  2021-07-27 21:19   ` [PATCH v3 06/25] midx: reject empty `--preferred-pack`'s Taylor Blau
@ 2021-07-27 21:19   ` Taylor Blau
  2021-07-27 21:19   ` [PATCH v3 08/25] midx: close linked MIDXs, avoid leaking memory Taylor Blau
                     ` (18 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-07-27 21:19 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

In 9218c6a40c (midx: allow marking a pack as preferred, 2021-03-30), the
multi-pack index code learned how to select a pack which all duplicate
objects are selected from. That is, if an object appears in multiple
packs, select the copy in the preferred pack before breaking ties
according to the other rules like pack mtime and readdir() order.

Not specifying a preferred pack can cause serious problems with
multi-pack reachability bitmaps, because these bitmaps rely on having at
least one pack from which all duplicates are selected. Not having such a
pack causes problems with the code in pack-objects to reuse packs
verbatim (e.g., that code assumes that a delta object in a chunk of pack
sent verbatim will have its base object sent from the same pack).

So why does not marking a pack preferred cause problems here? The reason
is roughly as follows:

  - Ties are broken (when handling duplicate objects) by sorting
    according to midx_oid_compare(), which sorts objects by OID,
    preferred-ness, pack mtime, and finally pack ID (more on that
    later).

  - The psuedo pack-order (described in
    Documentation/technical/pack-format.txt under the section
    "multi-pack-index reverse indexes") is computed by
    midx_pack_order(), and sorts by pack ID and pack offset, with
    preferred packs sorting first.

  - But! Pack IDs come from incrementing the pack count in
    add_pack_to_midx(), which is a callback to
    for_each_file_in_pack_dir(), meaning that pack IDs are assigned in
    readdir() order.

When specifying a preferred pack, all of that works fine, because
duplicate objects are correctly resolved in favor of the copy in the
preferred pack, and the preferred pack sorts first in the object order.

"Sorting first" is critical, because the bitmap code relies on finding
out which pack holds the first object in the MIDX's pseudo pack-order to
determine which pack is preferred.

But if we didn't specify a preferred pack, and the pack which comes
first in readdir() order does not also have the lowest timestamp, then
it's possible that that pack (the one that sorts first in pseudo-pack
order, which the bitmap code will treat as the preferred one) did *not*
have all duplicate objects resolved in its favor, resulting in breakage.

The fix is simple: pick a (semi-arbitrary, non-empty) preferred pack
when none was specified. This forces that pack to have duplicates
resolved in its favor, and (critically) to sort first in pseudo-pack
order.  Unfortunately, testing this behavior portably isn't possible,
since it depends on readdir() order which isn't guaranteed by POSIX.

(Note that multi-pack reachability bitmaps have yet to be implemented;
so in that sense this patch is fixing a bug which does not yet exist.
But by having this patch beforehand, we can prevent the bug from ever
materializing.)

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c | 50 ++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 44 insertions(+), 6 deletions(-)

diff --git a/midx.c b/midx.c
index 092dbf45b6..8956492b9c 100644
--- a/midx.c
+++ b/midx.c
@@ -969,15 +969,57 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	if (ctx.m && ctx.nr == ctx.m->num_packs && !packs_to_drop)
 		goto cleanup;
 
-	ctx.preferred_pack_idx = -1;
 	if (preferred_pack_name) {
+		int found = 0;
 		for (i = 0; i < ctx.nr; i++) {
 			if (!cmp_idx_or_pack_name(preferred_pack_name,
 						  ctx.info[i].pack_name)) {
 				ctx.preferred_pack_idx = i;
+				found = 1;
 				break;
 			}
 		}
+
+		if (!found)
+			warning(_("unknown preferred pack: '%s'"),
+				preferred_pack_name);
+	} else if (ctx.nr && (flags & MIDX_WRITE_REV_INDEX)) {
+		struct packed_git *oldest = ctx.info[ctx.preferred_pack_idx].p;
+		ctx.preferred_pack_idx = 0;
+
+		if (packs_to_drop && packs_to_drop->nr)
+			BUG("cannot write a MIDX bitmap during expiration");
+
+		/*
+		 * set a preferred pack when writing a bitmap to ensure that
+		 * the pack from which the first object is selected in pseudo
+		 * pack-order has all of its objects selected from that pack
+		 * (and not another pack containing a duplicate)
+		 */
+		for (i = 1; i < ctx.nr; i++) {
+			struct packed_git *p = ctx.info[i].p;
+
+			if (!oldest->num_objects || p->mtime < oldest->mtime) {
+				oldest = p;
+				ctx.preferred_pack_idx = i;
+			}
+		}
+
+		if (!oldest->num_objects) {
+			/*
+			 * If all packs are empty; unset the preferred index.
+			 * This is acceptable since there will be no duplicate
+			 * objects to resolve, so the preferred value doesn't
+			 * matter.
+			 */
+			ctx.preferred_pack_idx = -1;
+		}
+	} else {
+		/*
+		 * otherwise don't mark any pack as preferred to avoid
+		 * interfering with expiration logic below
+		 */
+		ctx.preferred_pack_idx = -1;
 	}
 
 	if (ctx.preferred_pack_idx > -1) {
@@ -1058,11 +1100,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 						      ctx.info, ctx.nr,
 						      sizeof(*ctx.info),
 						      idx_or_pack_name_cmp);
-
-		if (!preferred)
-			warning(_("unknown preferred pack: '%s'"),
-				preferred_pack_name);
-		else {
+		if (preferred) {
 			uint32_t perm = ctx.pack_perm[preferred->orig_pack_int_id];
 			if (perm == PACK_EXPIRED)
 				warning(_("preferred pack '%s' is expired"),
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v3 08/25] midx: close linked MIDXs, avoid leaking memory
  2021-07-27 21:19 ` [PATCH v3 00/25] " Taylor Blau
                     ` (6 preceding siblings ...)
  2021-07-27 21:19   ` [PATCH v3 07/25] midx: infer preferred pack when not given one Taylor Blau
@ 2021-07-27 21:19   ` Taylor Blau
  2021-07-27 21:19   ` [PATCH v3 09/25] midx: avoid opening multiple MIDXs when writing Taylor Blau
                     ` (17 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-07-27 21:19 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

When a repository has at least one alternate, the MIDX belonging to each
alternate is accessed through the `next` pointer on the main object
store's copy of the MIDX. close_midx() didn't bother to close any
of the linked MIDXs. It likewise didn't free the memory pointed to by
`m`, leaving uninitialized bytes with live pointers to them left around
in the heap.

Clean this up by closing linked MIDXs, and freeing up the memory pointed
to by each of them. When callers call close_midx(), then they can
discard the entire linked list of MIDXs and set their pointer to the
head of that list to NULL.

This isn't strictly required for the upcoming patches, but it makes it
much more difficult (though still possible, for e.g., by calling
`close_midx(m->next)` which leaves `m->next` pointing at uninitialized
bytes) to have pointers to uninitialized memory.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/midx.c b/midx.c
index 8956492b9c..18e1949613 100644
--- a/midx.c
+++ b/midx.c
@@ -195,6 +195,8 @@ void close_midx(struct multi_pack_index *m)
 	if (!m)
 		return;
 
+	close_midx(m->next);
+
 	munmap((unsigned char *)m->data, m->data_len);
 
 	for (i = 0; i < m->num_packs; i++) {
@@ -203,6 +205,7 @@ void close_midx(struct multi_pack_index *m)
 	}
 	FREE_AND_NULL(m->packs);
 	FREE_AND_NULL(m->pack_names);
+	free(m);
 }
 
 int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id)
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v3 09/25] midx: avoid opening multiple MIDXs when writing
  2021-07-27 21:19 ` [PATCH v3 00/25] " Taylor Blau
                     ` (7 preceding siblings ...)
  2021-07-27 21:19   ` [PATCH v3 08/25] midx: close linked MIDXs, avoid leaking memory Taylor Blau
@ 2021-07-27 21:19   ` Taylor Blau
  2021-07-29 19:30     ` Taylor Blau
  2021-08-12 20:15     ` Jeff King
  2021-07-27 21:19   ` [PATCH v3 10/25] pack-bitmap.c: introduce 'bitmap_num_objects()' Taylor Blau
                     ` (16 subsequent siblings)
  25 siblings, 2 replies; 273+ messages in thread
From: Taylor Blau @ 2021-07-27 21:19 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

Opening multiple instance of the same MIDX can lead to problems like two
separate packed_git structures which represent the same pack being added
to the repository's object store.

The above scenario can happen because prepare_midx_pack() checks if
`m->packs[pack_int_id]` is NULL in order to determine if a pack has been
opened and installed in the repository before. But a caller can
construct two copies of the same MIDX by calling get_multi_pack_index()
and load_multi_pack_index() since the former manipulates the
object store directly but the latter is a lower-level routine which
allocates a new MIDX for each call.

So if prepare_midx_pack() is called on multiple MIDXs with the same
pack_int_id, then that pack will be installed twice in the object
store's packed_git pointer.

This can lead to problems in, for e.g., the pack-bitmap code, which does
something like the following (in pack-bitmap.c:open_pack_bitmap()):

    struct bitmap_index *bitmap_git = ...;
    for (p = get_all_packs(r); p; p = p->next) {
      if (open_pack_bitmap_1(bitmap_git, p) == 0)
        ret = 0;
    }

which is a problem if two copies of the same pack exist in the
packed_git list because pack-bitmap.c:open_pack_bitmap_1() contains a
conditional like the following:

    if (bitmap_git->pack || bitmap_git->midx) {
      /* ignore extra bitmap file; we can only handle one */
      warning("ignoring extra bitmap file: %s", packfile->pack_name);
      close(fd);
      return -1;
    }

Avoid this scenario by not letting write_midx_internal() open a MIDX
that isn't also pointed at by the object store. So long as this is the
case, other routines should prefer to open MIDXs with
get_multi_pack_index() or reprepare_packed_git() instead of creating
instances on their own. Because get_multi_pack_index() returns
`r->object_store->multi_pack_index` if it is non-NULL, we'll only have
one instance of a MIDX open at one time, avoiding these problems.

To encourage this, drop the `struct multi_pack_index *` parameter from
`write_midx_internal()`, and rely instead on the `object_dir` to find
(or initialize) the correct MIDX instance.

Likewise, replace the call to `close_midx()` with
`close_object_store()`, since we're about to replace the MIDX with a new
one and should invalidate the object store's memory of any MIDX that
might have existed beforehand.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/midx.c b/midx.c
index 18e1949613..d67d7f383d 100644
--- a/midx.c
+++ b/midx.c
@@ -893,7 +893,7 @@ static int midx_checksum_valid(struct multi_pack_index *m)
 	return hashfile_checksum_valid(m->data, m->data_len);
 }
 
-static int write_midx_internal(const char *object_dir, struct multi_pack_index *m,
+static int write_midx_internal(const char *object_dir,
 			       struct string_list *packs_to_drop,
 			       const char *preferred_pack_name,
 			       unsigned flags)
@@ -904,6 +904,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	struct hashfile *f = NULL;
 	struct lock_file lk;
 	struct write_midx_context ctx = { 0 };
+	struct multi_pack_index *cur;
 	int pack_name_concat_len = 0;
 	int dropped_packs = 0;
 	int result = 0;
@@ -914,10 +915,14 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 		die_errno(_("unable to create leading directories of %s"),
 			  midx_name);
 
-	if (m)
-		ctx.m = m;
-	else
-		ctx.m = load_multi_pack_index(object_dir, 1);
+	for (cur = get_multi_pack_index(the_repository); cur; cur = cur->next) {
+		if (!strcmp(object_dir, cur->object_dir)) {
+			ctx.m = cur;
+			break;
+		}
+	}
+	if (!ctx.m)
+		ctx.m = get_local_multi_pack_index(the_repository);
 
 	if (ctx.m && !midx_checksum_valid(ctx.m)) {
 		warning(_("ignoring existing multi-pack-index; checksum mismatch"));
@@ -1182,8 +1187,7 @@ int write_midx_file(const char *object_dir,
 		    const char *preferred_pack_name,
 		    unsigned flags)
 {
-	return write_midx_internal(object_dir, NULL, NULL, preferred_pack_name,
-				   flags);
+	return write_midx_internal(object_dir, NULL, preferred_pack_name, flags);
 }
 
 struct clear_midx_data {
@@ -1460,8 +1464,10 @@ int expire_midx_packs(struct repository *r, const char *object_dir, unsigned fla
 
 	free(count);
 
-	if (packs_to_drop.nr)
-		result = write_midx_internal(object_dir, m, &packs_to_drop, NULL, flags);
+	if (packs_to_drop.nr) {
+		result = write_midx_internal(object_dir, &packs_to_drop, NULL, flags);
+		m = NULL;
+	}
 
 	string_list_clear(&packs_to_drop, 0);
 	return result;
@@ -1650,7 +1656,7 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 		goto cleanup;
 	}
 
-	result = write_midx_internal(object_dir, m, NULL, NULL, flags);
+	result = write_midx_internal(object_dir, NULL, NULL, flags);
 	m = NULL;
 
 cleanup:
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v3 10/25] pack-bitmap.c: introduce 'bitmap_num_objects()'
  2021-07-27 21:19 ` [PATCH v3 00/25] " Taylor Blau
                     ` (8 preceding siblings ...)
  2021-07-27 21:19   ` [PATCH v3 09/25] midx: avoid opening multiple MIDXs when writing Taylor Blau
@ 2021-07-27 21:19   ` Taylor Blau
  2021-07-27 21:19   ` [PATCH v3 11/25] pack-bitmap.c: introduce 'nth_bitmap_object_oid()' Taylor Blau
                     ` (15 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-07-27 21:19 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

A subsequent patch to support reading MIDX bitmaps will be less noisy
after extracting a generic function to return how many objects are
contained in a bitmap.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 37 +++++++++++++++++++++----------------
 1 file changed, 21 insertions(+), 16 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index a73960a55d..54dc4f7915 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -136,6 +136,11 @@ static struct ewah_bitmap *read_bitmap_1(struct bitmap_index *index)
 	return b;
 }
 
+static uint32_t bitmap_num_objects(struct bitmap_index *index)
+{
+	return index->pack->num_objects;
+}
+
 static int load_bitmap_header(struct bitmap_index *index)
 {
 	struct bitmap_disk_header *header = (void *)index->map;
@@ -154,7 +159,7 @@ static int load_bitmap_header(struct bitmap_index *index)
 	/* Parse known bitmap format options */
 	{
 		uint32_t flags = ntohs(header->options);
-		size_t cache_size = st_mult(index->pack->num_objects, sizeof(uint32_t));
+		size_t cache_size = st_mult(bitmap_num_objects(index), sizeof(uint32_t));
 		unsigned char *index_end = index->map + index->map_size - the_hash_algo->rawsz;
 
 		if ((flags & BITMAP_OPT_FULL_DAG) == 0)
@@ -399,7 +404,7 @@ static inline int bitmap_position_extended(struct bitmap_index *bitmap_git,
 
 	if (pos < kh_end(positions)) {
 		int bitmap_pos = kh_value(positions, pos);
-		return bitmap_pos + bitmap_git->pack->num_objects;
+		return bitmap_pos + bitmap_num_objects(bitmap_git);
 	}
 
 	return -1;
@@ -451,7 +456,7 @@ static int ext_index_add_object(struct bitmap_index *bitmap_git,
 		bitmap_pos = kh_value(eindex->positions, hash_pos);
 	}
 
-	return bitmap_pos + bitmap_git->pack->num_objects;
+	return bitmap_pos + bitmap_num_objects(bitmap_git);
 }
 
 struct bitmap_show_data {
@@ -668,7 +673,7 @@ static void show_extended_objects(struct bitmap_index *bitmap_git,
 	for (i = 0; i < eindex->count; ++i) {
 		struct object *obj;
 
-		if (!bitmap_get(objects, bitmap_git->pack->num_objects + i))
+		if (!bitmap_get(objects, bitmap_num_objects(bitmap_git) + i))
 			continue;
 
 		obj = eindex->objects[i];
@@ -826,7 +831,7 @@ static void filter_bitmap_exclude_type(struct bitmap_index *bitmap_git,
 	 * individually.
 	 */
 	for (i = 0; i < eindex->count; i++) {
-		uint32_t pos = i + bitmap_git->pack->num_objects;
+		uint32_t pos = i + bitmap_num_objects(bitmap_git);
 		if (eindex->objects[i]->type == type &&
 		    bitmap_get(to_filter, pos) &&
 		    !bitmap_get(tips, pos))
@@ -853,7 +858,7 @@ static unsigned long get_size_by_pos(struct bitmap_index *bitmap_git,
 
 	oi.sizep = &size;
 
-	if (pos < pack->num_objects) {
+	if (pos < bitmap_num_objects(bitmap_git)) {
 		off_t ofs = pack_pos_to_offset(pack, pos);
 		if (packed_object_info(the_repository, pack, ofs, &oi) < 0) {
 			struct object_id oid;
@@ -863,7 +868,7 @@ static unsigned long get_size_by_pos(struct bitmap_index *bitmap_git,
 		}
 	} else {
 		struct eindex *eindex = &bitmap_git->ext_index;
-		struct object *obj = eindex->objects[pos - pack->num_objects];
+		struct object *obj = eindex->objects[pos - bitmap_num_objects(bitmap_git)];
 		if (oid_object_info_extended(the_repository, &obj->oid, &oi, 0) < 0)
 			die(_("unable to get size of %s"), oid_to_hex(&obj->oid));
 	}
@@ -905,7 +910,7 @@ static void filter_bitmap_blob_limit(struct bitmap_index *bitmap_git,
 	}
 
 	for (i = 0; i < eindex->count; i++) {
-		uint32_t pos = i + bitmap_git->pack->num_objects;
+		uint32_t pos = i + bitmap_num_objects(bitmap_git);
 		if (eindex->objects[i]->type == OBJ_BLOB &&
 		    bitmap_get(to_filter, pos) &&
 		    !bitmap_get(tips, pos) &&
@@ -1131,8 +1136,8 @@ static void try_partial_reuse(struct bitmap_index *bitmap_git,
 	enum object_type type;
 	unsigned long size;
 
-	if (pos >= bitmap_git->pack->num_objects)
-		return; /* not actually in the pack */
+	if (pos >= bitmap_num_objects(bitmap_git))
+		return; /* not actually in the pack or MIDX */
 
 	offset = header = pack_pos_to_offset(bitmap_git->pack, pos);
 	type = unpack_object_header(bitmap_git->pack, w_curs, &offset, &size);
@@ -1198,6 +1203,7 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 	struct pack_window *w_curs = NULL;
 	size_t i = 0;
 	uint32_t offset;
+	uint32_t objects_nr = bitmap_num_objects(bitmap_git);
 
 	assert(result);
 
@@ -1205,8 +1211,8 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 		i++;
 
 	/* Don't mark objects not in the packfile */
-	if (i > bitmap_git->pack->num_objects / BITS_IN_EWORD)
-		i = bitmap_git->pack->num_objects / BITS_IN_EWORD;
+	if (i > objects_nr / BITS_IN_EWORD)
+		i = objects_nr / BITS_IN_EWORD;
 
 	reuse = bitmap_word_alloc(i);
 	memset(reuse->words, 0xFF, i * sizeof(eword_t));
@@ -1290,7 +1296,7 @@ static uint32_t count_object_type(struct bitmap_index *bitmap_git,
 
 	for (i = 0; i < eindex->count; ++i) {
 		if (eindex->objects[i]->type == type &&
-			bitmap_get(objects, bitmap_git->pack->num_objects + i))
+			bitmap_get(objects, bitmap_num_objects(bitmap_git) + i))
 			count++;
 	}
 
@@ -1511,7 +1517,7 @@ uint32_t *create_bitmap_mapping(struct bitmap_index *bitmap_git,
 	uint32_t i, num_objects;
 	uint32_t *reposition;
 
-	num_objects = bitmap_git->pack->num_objects;
+	num_objects = bitmap_num_objects(bitmap_git);
 	CALLOC_ARRAY(reposition, num_objects);
 
 	for (i = 0; i < num_objects; ++i) {
@@ -1594,7 +1600,6 @@ static off_t get_disk_usage_for_type(struct bitmap_index *bitmap_git,
 static off_t get_disk_usage_for_extended(struct bitmap_index *bitmap_git)
 {
 	struct bitmap *result = bitmap_git->result;
-	struct packed_git *pack = bitmap_git->pack;
 	struct eindex *eindex = &bitmap_git->ext_index;
 	off_t total = 0;
 	struct object_info oi = OBJECT_INFO_INIT;
@@ -1606,7 +1611,7 @@ static off_t get_disk_usage_for_extended(struct bitmap_index *bitmap_git)
 	for (i = 0; i < eindex->count; i++) {
 		struct object *obj = eindex->objects[i];
 
-		if (!bitmap_get(result, pack->num_objects + i))
+		if (!bitmap_get(result, bitmap_num_objects(bitmap_git) + i))
 			continue;
 
 		if (oid_object_info_extended(the_repository, &obj->oid, &oi, 0) < 0)
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v3 11/25] pack-bitmap.c: introduce 'nth_bitmap_object_oid()'
  2021-07-27 21:19 ` [PATCH v3 00/25] " Taylor Blau
                     ` (9 preceding siblings ...)
  2021-07-27 21:19   ` [PATCH v3 10/25] pack-bitmap.c: introduce 'bitmap_num_objects()' Taylor Blau
@ 2021-07-27 21:19   ` Taylor Blau
  2021-07-27 21:19   ` [PATCH v3 12/25] pack-bitmap.c: introduce 'bitmap_is_preferred_refname()' Taylor Blau
                     ` (14 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-07-27 21:19 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

A subsequent patch to support reading MIDX bitmaps will be less noisy
after extracting a generic function to fetch the nth OID contained in
the bitmap.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 54dc4f7915..9d0dd1cde7 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -223,6 +223,13 @@ static inline uint8_t read_u8(const unsigned char *buffer, size_t *pos)
 
 #define MAX_XOR_OFFSET 160
 
+static int nth_bitmap_object_oid(struct bitmap_index *index,
+				 struct object_id *oid,
+				 uint32_t n)
+{
+	return nth_packed_object_id(oid, index->pack, n);
+}
+
 static int load_bitmap_entries_v1(struct bitmap_index *index)
 {
 	uint32_t i;
@@ -242,7 +249,7 @@ static int load_bitmap_entries_v1(struct bitmap_index *index)
 		xor_offset = read_u8(index->map, &index->map_pos);
 		flags = read_u8(index->map, &index->map_pos);
 
-		if (nth_packed_object_id(&oid, index->pack, commit_idx_pos) < 0)
+		if (nth_bitmap_object_oid(index, &oid, commit_idx_pos) < 0)
 			return error("corrupt ewah bitmap: commit index %u out of range",
 				     (unsigned)commit_idx_pos);
 
@@ -862,8 +869,8 @@ static unsigned long get_size_by_pos(struct bitmap_index *bitmap_git,
 		off_t ofs = pack_pos_to_offset(pack, pos);
 		if (packed_object_info(the_repository, pack, ofs, &oi) < 0) {
 			struct object_id oid;
-			nth_packed_object_id(&oid, pack,
-					     pack_pos_to_index(pack, pos));
+			nth_bitmap_object_oid(bitmap_git, &oid,
+					      pack_pos_to_index(pack, pos));
 			die(_("unable to get size of %s"), oid_to_hex(&oid));
 		}
 	} else {
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v3 12/25] pack-bitmap.c: introduce 'bitmap_is_preferred_refname()'
  2021-07-27 21:19 ` [PATCH v3 00/25] " Taylor Blau
                     ` (10 preceding siblings ...)
  2021-07-27 21:19   ` [PATCH v3 11/25] pack-bitmap.c: introduce 'nth_bitmap_object_oid()' Taylor Blau
@ 2021-07-27 21:19   ` Taylor Blau
  2021-07-27 21:19   ` [PATCH v3 13/25] pack-bitmap.c: avoid redundant calls to try_partial_reuse Taylor Blau
                     ` (13 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-07-27 21:19 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

In a recent commit, pack-objects learned support for the
'pack.preferBitmapTips' configuration. This patch prepares the
multi-pack bitmap code to respect this configuration, too.

The yet-to-be implemented code will find that it is more efficient to
check whether each reference contains a prefix found in the configured
set of values rather than doing an additional traversal.

Implement a function 'bitmap_is_preferred_refname()' which will perform
that check. Its caller will be added in a subsequent patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 16 ++++++++++++++++
 pack-bitmap.h |  1 +
 2 files changed, 17 insertions(+)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 9d0dd1cde7..6b12c96e32 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1652,3 +1652,19 @@ const struct string_list *bitmap_preferred_tips(struct repository *r)
 {
 	return repo_config_get_value_multi(r, "pack.preferbitmaptips");
 }
+
+int bitmap_is_preferred_refname(struct repository *r, const char *refname)
+{
+	const struct string_list *preferred_tips = bitmap_preferred_tips(r);
+	struct string_list_item *item;
+
+	if (!preferred_tips)
+		return 0;
+
+	for_each_string_list_item(item, preferred_tips) {
+		if (starts_with(refname, item->string))
+			return 1;
+	}
+
+	return 0;
+}
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 020cd8d868..52ea10de51 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -94,5 +94,6 @@ void bitmap_writer_finish(struct pack_idx_entry **index,
 			  uint16_t options);
 
 const struct string_list *bitmap_preferred_tips(struct repository *r);
+int bitmap_is_preferred_refname(struct repository *r, const char *refname);
 
 #endif
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v3 13/25] pack-bitmap.c: avoid redundant calls to try_partial_reuse
  2021-07-27 21:19 ` [PATCH v3 00/25] " Taylor Blau
                     ` (11 preceding siblings ...)
  2021-07-27 21:19   ` [PATCH v3 12/25] pack-bitmap.c: introduce 'bitmap_is_preferred_refname()' Taylor Blau
@ 2021-07-27 21:19   ` Taylor Blau
  2021-07-27 21:19   ` [PATCH v3 14/25] pack-bitmap: read multi-pack bitmaps Taylor Blau
                     ` (12 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-07-27 21:19 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

try_partial_reuse() is used to mark any bits in the beginning of a
bitmap whose objects can be reused verbatim from the pack they came
from.

Currently this function returns void, and signals nothing to the caller
when bits could not be reused. But multi-pack bitmaps would benefit from
having such a signal, because they may try to pass objects which are in
bounds, but from a pack other than the preferred one.

Any extra calls are noops because of a conditional in
reuse_partial_packfile_from_bitmap(), but those loop iterations can be
avoided by letting try_partial_reuse() indicate when it can't accept any
more bits for reuse, and then listening to that signal.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 40 +++++++++++++++++++++++++++++-----------
 1 file changed, 29 insertions(+), 11 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 6b12c96e32..1442f0c8f2 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1134,22 +1134,26 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 	return NULL;
 }
 
-static void try_partial_reuse(struct bitmap_index *bitmap_git,
-			      size_t pos,
-			      struct bitmap *reuse,
-			      struct pack_window **w_curs)
+/*
+ * -1 means "stop trying further objects"; 0 means we may or may not have
+ * reused, but you can keep feeding bits.
+ */
+static int try_partial_reuse(struct bitmap_index *bitmap_git,
+			     size_t pos,
+			     struct bitmap *reuse,
+			     struct pack_window **w_curs)
 {
 	off_t offset, header;
 	enum object_type type;
 	unsigned long size;
 
 	if (pos >= bitmap_num_objects(bitmap_git))
-		return; /* not actually in the pack or MIDX */
+		return -1; /* not actually in the pack or MIDX */
 
 	offset = header = pack_pos_to_offset(bitmap_git->pack, pos);
 	type = unpack_object_header(bitmap_git->pack, w_curs, &offset, &size);
 	if (type < 0)
-		return; /* broken packfile, punt */
+		return -1; /* broken packfile, punt */
 
 	if (type == OBJ_REF_DELTA || type == OBJ_OFS_DELTA) {
 		off_t base_offset;
@@ -1166,9 +1170,9 @@ static void try_partial_reuse(struct bitmap_index *bitmap_git,
 		base_offset = get_delta_base(bitmap_git->pack, w_curs,
 					     &offset, type, header);
 		if (!base_offset)
-			return;
+			return 0;
 		if (offset_to_pack_pos(bitmap_git->pack, base_offset, &base_pos) < 0)
-			return;
+			return 0;
 
 		/*
 		 * We assume delta dependencies always point backwards. This
@@ -1180,7 +1184,7 @@ static void try_partial_reuse(struct bitmap_index *bitmap_git,
 		 * odd parameters.
 		 */
 		if (base_pos >= pos)
-			return;
+			return 0;
 
 		/*
 		 * And finally, if we're not sending the base as part of our
@@ -1191,13 +1195,14 @@ static void try_partial_reuse(struct bitmap_index *bitmap_git,
 		 * object_entry code path handle it.
 		 */
 		if (!bitmap_get(reuse, base_pos))
-			return;
+			return 0;
 	}
 
 	/*
 	 * If we got here, then the object is OK to reuse. Mark it.
 	 */
 	bitmap_set(reuse, pos);
+	return 0;
 }
 
 int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
@@ -1233,10 +1238,23 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 				break;
 
 			offset += ewah_bit_ctz64(word >> offset);
-			try_partial_reuse(bitmap_git, pos + offset, reuse, &w_curs);
+			if (try_partial_reuse(bitmap_git, pos + offset, reuse,
+					      &w_curs) < 0) {
+				/*
+				 * try_partial_reuse indicated we couldn't reuse
+				 * any bits, so there is no point in trying more
+				 * bits in the current word, or any other words
+				 * in result.
+				 *
+				 * Jump out of both loops to avoid future
+				 * unnecessary calls to try_partial_reuse.
+				 */
+				goto done;
+			}
 		}
 	}
 
+done:
 	unuse_pack(&w_curs);
 
 	*entries = bitmap_popcount(reuse);
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v3 14/25] pack-bitmap: read multi-pack bitmaps
  2021-07-27 21:19 ` [PATCH v3 00/25] " Taylor Blau
                     ` (12 preceding siblings ...)
  2021-07-27 21:19   ` [PATCH v3 13/25] pack-bitmap.c: avoid redundant calls to try_partial_reuse Taylor Blau
@ 2021-07-27 21:19   ` Taylor Blau
  2021-07-27 21:20   ` [PATCH v3 15/25] pack-bitmap: write " Taylor Blau
                     ` (11 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-07-27 21:19 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

This prepares the code in pack-bitmap to interpret the new multi-pack
bitmaps described in Documentation/technical/bitmap-format.txt, which
mostly involves converting bit positions to accommodate looking them up
in a MIDX.

Note that there are currently no writers who write multi-pack bitmaps,
and that this will be implemented in the subsequent commit. Note also
that get_midx_checksum() and get_midx_filename() are made non-static so
they can be called from pack-bitmap.c.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c |   5 +
 midx.c                 |   4 +-
 midx.h                 |   2 +
 pack-bitmap-write.c    |   2 +-
 pack-bitmap.c          | 357 ++++++++++++++++++++++++++++++++++++-----
 pack-bitmap.h          |   6 +
 packfile.c             |   2 +-
 7 files changed, 336 insertions(+), 42 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 8a523624a1..e11d3ac2e5 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1124,6 +1124,11 @@ static void write_reused_pack(struct hashfile *f)
 				break;
 
 			offset += ewah_bit_ctz64(word >> offset);
+			/*
+			 * Can use bit positions directly, even for MIDX
+			 * bitmaps. See comment in try_partial_reuse()
+			 * for why.
+			 */
 			write_reused_pack_one(pos + offset, f, &w_curs);
 			display_progress(progress_state, ++written);
 		}
diff --git a/midx.c b/midx.c
index d67d7f383d..db21727c62 100644
--- a/midx.c
+++ b/midx.c
@@ -48,12 +48,12 @@ static uint8_t oid_version(void)
 	}
 }
 
-static const unsigned char *get_midx_checksum(struct multi_pack_index *m)
+const unsigned char *get_midx_checksum(struct multi_pack_index *m)
 {
 	return m->data + m->data_len - the_hash_algo->rawsz;
 }
 
-static char *get_midx_filename(const char *object_dir)
+char *get_midx_filename(const char *object_dir)
 {
 	return xstrfmt("%s/pack/multi-pack-index", object_dir);
 }
diff --git a/midx.h b/midx.h
index 8684cf0fef..1172df1a71 100644
--- a/midx.h
+++ b/midx.h
@@ -42,6 +42,8 @@ struct multi_pack_index {
 #define MIDX_PROGRESS     (1 << 0)
 #define MIDX_WRITE_REV_INDEX (1 << 1)
 
+const unsigned char *get_midx_checksum(struct multi_pack_index *m);
+char *get_midx_filename(const char *object_dir);
 char *get_midx_rev_filename(struct multi_pack_index *m);
 
 struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local);
diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
index 142fd0adb8..9c55c1531e 100644
--- a/pack-bitmap-write.c
+++ b/pack-bitmap-write.c
@@ -48,7 +48,7 @@ void bitmap_writer_show_progress(int show)
 }
 
 /**
- * Build the initial type index for the packfile
+ * Build the initial type index for the packfile or multi-pack-index
  */
 void bitmap_writer_build_type_index(struct packing_data *to_pack,
 				    struct pack_idx_entry **index,
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 1442f0c8f2..f599646e19 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -13,6 +13,7 @@
 #include "repository.h"
 #include "object-store.h"
 #include "list-objects-filter-options.h"
+#include "midx.h"
 #include "config.h"
 
 /*
@@ -35,8 +36,15 @@ struct stored_bitmap {
  * the active bitmap index is the largest one.
  */
 struct bitmap_index {
-	/* Packfile to which this bitmap index belongs to */
+	/*
+	 * The pack or multi-pack index (MIDX) that this bitmap index belongs
+	 * to.
+	 *
+	 * Exactly one of these must be non-NULL; this specifies the object
+	 * order used to interpret this bitmap.
+	 */
 	struct packed_git *pack;
+	struct multi_pack_index *midx;
 
 	/*
 	 * Mark the first `reuse_objects` in the packfile as reused:
@@ -71,6 +79,9 @@ struct bitmap_index {
 	/* If not NULL, this is a name-hash cache pointing into map. */
 	uint32_t *hashes;
 
+	/* The checksum of the packfile or MIDX; points into map. */
+	const unsigned char *checksum;
+
 	/*
 	 * Extended index.
 	 *
@@ -138,6 +149,8 @@ static struct ewah_bitmap *read_bitmap_1(struct bitmap_index *index)
 
 static uint32_t bitmap_num_objects(struct bitmap_index *index)
 {
+	if (index->midx)
+		return index->midx->num_objects;
 	return index->pack->num_objects;
 }
 
@@ -175,6 +188,7 @@ static int load_bitmap_header(struct bitmap_index *index)
 	}
 
 	index->entry_count = ntohl(header->entry_count);
+	index->checksum = header->checksum;
 	index->map_pos += header_size;
 	return 0;
 }
@@ -227,6 +241,8 @@ static int nth_bitmap_object_oid(struct bitmap_index *index,
 				 struct object_id *oid,
 				 uint32_t n)
 {
+	if (index->midx)
+		return nth_midxed_object_oid(oid, index->midx, n) ? 0 : -1;
 	return nth_packed_object_id(oid, index->pack, n);
 }
 
@@ -274,7 +290,14 @@ static int load_bitmap_entries_v1(struct bitmap_index *index)
 	return 0;
 }
 
-static char *pack_bitmap_filename(struct packed_git *p)
+char *midx_bitmap_filename(struct multi_pack_index *midx)
+{
+	return xstrfmt("%s-%s.bitmap",
+		       get_midx_filename(midx->object_dir),
+		       hash_to_hex(get_midx_checksum(midx)));
+}
+
+char *pack_bitmap_filename(struct packed_git *p)
 {
 	size_t len;
 
@@ -283,6 +306,57 @@ static char *pack_bitmap_filename(struct packed_git *p)
 	return xstrfmt("%.*s.bitmap", (int)len, p->pack_name);
 }
 
+static int open_midx_bitmap_1(struct bitmap_index *bitmap_git,
+			      struct multi_pack_index *midx)
+{
+	struct stat st;
+	char *idx_name = midx_bitmap_filename(midx);
+	int fd = git_open(idx_name);
+
+	free(idx_name);
+
+	if (fd < 0)
+		return -1;
+
+	if (fstat(fd, &st)) {
+		close(fd);
+		return -1;
+	}
+
+	if (bitmap_git->pack || bitmap_git->midx) {
+		/* ignore extra bitmap file; we can only handle one */
+		warning("ignoring extra bitmap file: %s",
+			get_midx_filename(midx->object_dir));
+		close(fd);
+		return -1;
+	}
+
+	bitmap_git->midx = midx;
+	bitmap_git->map_size = xsize_t(st.st_size);
+	bitmap_git->map_pos = 0;
+	bitmap_git->map = xmmap(NULL, bitmap_git->map_size, PROT_READ,
+				MAP_PRIVATE, fd, 0);
+	close(fd);
+
+	if (load_bitmap_header(bitmap_git) < 0)
+		goto cleanup;
+
+	if (!hasheq(get_midx_checksum(bitmap_git->midx), bitmap_git->checksum))
+		goto cleanup;
+
+	if (load_midx_revindex(bitmap_git->midx) < 0) {
+		warning(_("multi-pack bitmap is missing required reverse index"));
+		goto cleanup;
+	}
+	return 0;
+
+cleanup:
+	munmap(bitmap_git->map, bitmap_git->map_size);
+	bitmap_git->map_size = 0;
+	bitmap_git->map = NULL;
+	return -1;
+}
+
 static int open_pack_bitmap_1(struct bitmap_index *bitmap_git, struct packed_git *packfile)
 {
 	int fd;
@@ -304,7 +378,8 @@ static int open_pack_bitmap_1(struct bitmap_index *bitmap_git, struct packed_git
 		return -1;
 	}
 
-	if (bitmap_git->pack) {
+	if (bitmap_git->pack || bitmap_git->midx) {
+		/* ignore extra bitmap file; we can only handle one */
 		warning("ignoring extra bitmap file: %s", packfile->pack_name);
 		close(fd);
 		return -1;
@@ -326,13 +401,39 @@ static int open_pack_bitmap_1(struct bitmap_index *bitmap_git, struct packed_git
 	return 0;
 }
 
-static int load_pack_bitmap(struct bitmap_index *bitmap_git)
+static int load_reverse_index(struct bitmap_index *bitmap_git)
+{
+	if (bitmap_is_midx(bitmap_git)) {
+		uint32_t i;
+		int ret;
+
+		/*
+		 * The multi-pack-index's .rev file is already loaded via
+		 * open_pack_bitmap_1().
+		 *
+		 * But we still need to open the individual pack .rev files,
+		 * since we will need to make use of them in pack-objects.
+		 */
+		for (i = 0; i < bitmap_git->midx->num_packs; i++) {
+			if (prepare_midx_pack(the_repository, bitmap_git->midx, i))
+				die(_("load_reverse_index: could not open pack"));
+			ret = load_pack_revindex(bitmap_git->midx->packs[i]);
+			if (ret)
+				return ret;
+		}
+		return 0;
+	}
+	return load_pack_revindex(bitmap_git->pack);
+}
+
+static int load_bitmap(struct bitmap_index *bitmap_git)
 {
 	assert(bitmap_git->map);
 
 	bitmap_git->bitmaps = kh_init_oid_map();
 	bitmap_git->ext_index.positions = kh_init_oid_pos();
-	if (load_pack_revindex(bitmap_git->pack))
+
+	if (load_reverse_index(bitmap_git))
 		goto failed;
 
 	if (!(bitmap_git->commits = read_bitmap_1(bitmap_git)) ||
@@ -376,11 +477,47 @@ static int open_pack_bitmap(struct repository *r,
 	return ret;
 }
 
+static int open_midx_bitmap(struct repository *r,
+			    struct bitmap_index *bitmap_git)
+{
+	struct multi_pack_index *midx;
+
+	assert(!bitmap_git->map);
+
+	for (midx = get_multi_pack_index(r); midx; midx = midx->next) {
+		if (!open_midx_bitmap_1(bitmap_git, midx))
+			return 0;
+	}
+	return -1;
+}
+
+static int open_bitmap(struct repository *r,
+		       struct bitmap_index *bitmap_git)
+{
+	assert(!bitmap_git->map);
+
+	if (!open_midx_bitmap(r, bitmap_git))
+		return 0;
+	return open_pack_bitmap(r, bitmap_git);
+}
+
 struct bitmap_index *prepare_bitmap_git(struct repository *r)
 {
 	struct bitmap_index *bitmap_git = xcalloc(1, sizeof(*bitmap_git));
 
-	if (!open_pack_bitmap(r, bitmap_git) && !load_pack_bitmap(bitmap_git))
+	if (!open_bitmap(r, bitmap_git) && !load_bitmap(bitmap_git))
+		return bitmap_git;
+
+	free_bitmap_index(bitmap_git);
+	return NULL;
+}
+
+struct bitmap_index *prepare_midx_bitmap_git(struct repository *r,
+					     struct multi_pack_index *midx)
+{
+	struct bitmap_index *bitmap_git = xcalloc(1, sizeof(*bitmap_git));
+
+	if (!open_midx_bitmap_1(bitmap_git, midx) && !load_bitmap(bitmap_git))
 		return bitmap_git;
 
 	free_bitmap_index(bitmap_git);
@@ -430,10 +567,26 @@ static inline int bitmap_position_packfile(struct bitmap_index *bitmap_git,
 	return pos;
 }
 
+static int bitmap_position_midx(struct bitmap_index *bitmap_git,
+				const struct object_id *oid)
+{
+	uint32_t want, got;
+	if (!bsearch_midx(oid, bitmap_git->midx, &want))
+		return -1;
+
+	if (midx_to_pack_pos(bitmap_git->midx, want, &got) < 0)
+		return -1;
+	return got;
+}
+
 static int bitmap_position(struct bitmap_index *bitmap_git,
 			   const struct object_id *oid)
 {
-	int pos = bitmap_position_packfile(bitmap_git, oid);
+	int pos;
+	if (bitmap_is_midx(bitmap_git))
+		pos = bitmap_position_midx(bitmap_git, oid);
+	else
+		pos = bitmap_position_packfile(bitmap_git, oid);
 	return (pos >= 0) ? pos : bitmap_position_extended(bitmap_git, oid);
 }
 
@@ -744,6 +897,7 @@ static void show_objects_for_type(
 			continue;
 
 		for (offset = 0; offset < BITS_IN_EWORD; ++offset) {
+			struct packed_git *pack;
 			struct object_id oid;
 			uint32_t hash = 0, index_pos;
 			off_t ofs;
@@ -753,14 +907,28 @@ static void show_objects_for_type(
 
 			offset += ewah_bit_ctz64(word >> offset);
 
-			index_pos = pack_pos_to_index(bitmap_git->pack, pos + offset);
-			ofs = pack_pos_to_offset(bitmap_git->pack, pos + offset);
-			nth_packed_object_id(&oid, bitmap_git->pack, index_pos);
+			if (bitmap_is_midx(bitmap_git)) {
+				struct multi_pack_index *m = bitmap_git->midx;
+				uint32_t pack_id;
+
+				index_pos = pack_pos_to_midx(m, pos + offset);
+				ofs = nth_midxed_offset(m, index_pos);
+				nth_midxed_object_oid(&oid, m, index_pos);
+
+				pack_id = nth_midxed_pack_int_id(m, index_pos);
+				pack = bitmap_git->midx->packs[pack_id];
+			} else {
+				index_pos = pack_pos_to_index(bitmap_git->pack, pos + offset);
+				ofs = pack_pos_to_offset(bitmap_git->pack, pos + offset);
+				nth_bitmap_object_oid(bitmap_git, &oid, index_pos);
+
+				pack = bitmap_git->pack;
+			}
 
 			if (bitmap_git->hashes)
 				hash = get_be32(bitmap_git->hashes + index_pos);
 
-			show_reach(&oid, object_type, 0, hash, bitmap_git->pack, ofs);
+			show_reach(&oid, object_type, 0, hash, pack, ofs);
 		}
 	}
 }
@@ -772,8 +940,13 @@ static int in_bitmapped_pack(struct bitmap_index *bitmap_git,
 		struct object *object = roots->item;
 		roots = roots->next;
 
-		if (find_pack_entry_one(object->oid.hash, bitmap_git->pack) > 0)
-			return 1;
+		if (bitmap_is_midx(bitmap_git)) {
+			if (bsearch_midx(&object->oid, bitmap_git->midx, NULL))
+				return 1;
+		} else {
+			if (find_pack_entry_one(object->oid.hash, bitmap_git->pack) > 0)
+				return 1;
+		}
 	}
 
 	return 0;
@@ -859,14 +1032,26 @@ static void filter_bitmap_blob_none(struct bitmap_index *bitmap_git,
 static unsigned long get_size_by_pos(struct bitmap_index *bitmap_git,
 				     uint32_t pos)
 {
-	struct packed_git *pack = bitmap_git->pack;
 	unsigned long size;
 	struct object_info oi = OBJECT_INFO_INIT;
 
 	oi.sizep = &size;
 
 	if (pos < bitmap_num_objects(bitmap_git)) {
-		off_t ofs = pack_pos_to_offset(pack, pos);
+		struct packed_git *pack;
+		off_t ofs;
+
+		if (bitmap_is_midx(bitmap_git)) {
+			uint32_t midx_pos = pack_pos_to_midx(bitmap_git->midx, pos);
+			uint32_t pack_id = nth_midxed_pack_int_id(bitmap_git->midx, midx_pos);
+
+			pack = bitmap_git->midx->packs[pack_id];
+			ofs = nth_midxed_offset(bitmap_git->midx, midx_pos);
+		} else {
+			pack = bitmap_git->pack;
+			ofs = pack_pos_to_offset(pack, pos);
+		}
+
 		if (packed_object_info(the_repository, pack, ofs, &oi) < 0) {
 			struct object_id oid;
 			nth_bitmap_object_oid(bitmap_git, &oid,
@@ -1047,7 +1232,7 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 	/* try to open a bitmapped pack, but don't parse it yet
 	 * because we may not need to use it */
 	CALLOC_ARRAY(bitmap_git, 1);
-	if (open_pack_bitmap(revs->repo, bitmap_git) < 0)
+	if (open_bitmap(revs->repo, bitmap_git) < 0)
 		goto cleanup;
 
 	for (i = 0; i < revs->pending.nr; ++i) {
@@ -1091,7 +1276,7 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 	 * from disk. this is the point of no return; after this the rev_list
 	 * becomes invalidated and we must perform the revwalk through bitmaps
 	 */
-	if (load_pack_bitmap(bitmap_git) < 0)
+	if (load_bitmap(bitmap_git) < 0)
 		goto cleanup;
 
 	object_array_clear(&revs->pending);
@@ -1139,19 +1324,43 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
  * reused, but you can keep feeding bits.
  */
 static int try_partial_reuse(struct bitmap_index *bitmap_git,
+			     struct packed_git *pack,
 			     size_t pos,
 			     struct bitmap *reuse,
 			     struct pack_window **w_curs)
 {
-	off_t offset, header;
+	off_t offset, delta_obj_offset;
 	enum object_type type;
 	unsigned long size;
 
-	if (pos >= bitmap_num_objects(bitmap_git))
-		return -1; /* not actually in the pack or MIDX */
+	/*
+	 * try_partial_reuse() is called either on (a) objects in the
+	 * bitmapped pack (in the case of a single-pack bitmap) or (b)
+	 * objects in the preferred pack of a multi-pack bitmap.
+	 * Importantly, the latter can pretend as if only a single pack
+	 * exists because:
+	 *
+	 *   - The first pack->num_objects bits of a MIDX bitmap are
+	 *     reserved for the preferred pack, and
+	 *
+	 *   - Ties due to duplicate objects are always resolved in
+	 *     favor of the preferred pack.
+	 *
+	 * Therefore we do not need to ever ask the MIDX for its copy of
+	 * an object by OID, since it will always select it from the
+	 * preferred pack. Likewise, the selected copy of the base
+	 * object for any deltas will reside in the same pack.
+	 *
+	 * This means that we can reuse pos when looking up the bit in
+	 * the reuse bitmap, too, since bits corresponding to the
+	 * preferred pack precede all bits from other packs.
+	 */
 
-	offset = header = pack_pos_to_offset(bitmap_git->pack, pos);
-	type = unpack_object_header(bitmap_git->pack, w_curs, &offset, &size);
+	if (pos >= pack->num_objects)
+		return -1; /* not actually in the pack or MIDX preferred pack */
+
+	offset = delta_obj_offset = pack_pos_to_offset(pack, pos);
+	type = unpack_object_header(pack, w_curs, &offset, &size);
 	if (type < 0)
 		return -1; /* broken packfile, punt */
 
@@ -1167,11 +1376,11 @@ static int try_partial_reuse(struct bitmap_index *bitmap_git,
 		 * and the normal slow path will complain about it in
 		 * more detail.
 		 */
-		base_offset = get_delta_base(bitmap_git->pack, w_curs,
-					     &offset, type, header);
+		base_offset = get_delta_base(pack, w_curs, &offset, type,
+					     delta_obj_offset);
 		if (!base_offset)
 			return 0;
-		if (offset_to_pack_pos(bitmap_git->pack, base_offset, &base_pos) < 0)
+		if (offset_to_pack_pos(pack, base_offset, &base_pos) < 0)
 			return 0;
 
 		/*
@@ -1205,24 +1414,48 @@ static int try_partial_reuse(struct bitmap_index *bitmap_git,
 	return 0;
 }
 
+static uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git)
+{
+	struct multi_pack_index *m = bitmap_git->midx;
+	if (!m)
+		BUG("midx_preferred_pack: requires non-empty MIDX");
+	return nth_midxed_pack_int_id(m, pack_pos_to_midx(bitmap_git->midx, 0));
+}
+
 int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 				       struct packed_git **packfile_out,
 				       uint32_t *entries,
 				       struct bitmap **reuse_out)
 {
+	struct packed_git *pack;
 	struct bitmap *result = bitmap_git->result;
 	struct bitmap *reuse;
 	struct pack_window *w_curs = NULL;
 	size_t i = 0;
 	uint32_t offset;
-	uint32_t objects_nr = bitmap_num_objects(bitmap_git);
+	uint32_t objects_nr;
 
 	assert(result);
 
+	load_reverse_index(bitmap_git);
+
+	if (bitmap_is_midx(bitmap_git))
+		pack = bitmap_git->midx->packs[midx_preferred_pack(bitmap_git)];
+	else
+		pack = bitmap_git->pack;
+	objects_nr = pack->num_objects;
+
 	while (i < result->word_alloc && result->words[i] == (eword_t)~0)
 		i++;
 
-	/* Don't mark objects not in the packfile */
+	/*
+	 * Don't mark objects not in the packfile or preferred pack. This bitmap
+	 * marks objects eligible for reuse, but the pack-reuse code only
+	 * understands how to reuse a single pack. Since the preferred pack is
+	 * guaranteed to have all bases for its deltas (in a multi-pack bitmap),
+	 * we use it instead of another pack. In single-pack bitmaps, the choice
+	 * is made for us.
+	 */
 	if (i > objects_nr / BITS_IN_EWORD)
 		i = objects_nr / BITS_IN_EWORD;
 
@@ -1238,8 +1471,8 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 				break;
 
 			offset += ewah_bit_ctz64(word >> offset);
-			if (try_partial_reuse(bitmap_git, pos + offset, reuse,
-					      &w_curs) < 0) {
+			if (try_partial_reuse(bitmap_git, pack, pos + offset,
+					      reuse, &w_curs) < 0) {
 				/*
 				 * try_partial_reuse indicated we couldn't reuse
 				 * any bits, so there is no point in trying more
@@ -1268,7 +1501,7 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 	 * need to be handled separately.
 	 */
 	bitmap_and_not(result, reuse);
-	*packfile_out = bitmap_git->pack;
+	*packfile_out = pack;
 	*reuse_out = reuse;
 	return 0;
 }
@@ -1542,6 +1775,12 @@ uint32_t *create_bitmap_mapping(struct bitmap_index *bitmap_git,
 	uint32_t i, num_objects;
 	uint32_t *reposition;
 
+	if (!bitmap_is_midx(bitmap_git))
+		load_reverse_index(bitmap_git);
+	else if (load_midx_revindex(bitmap_git->midx) < 0)
+		BUG("rebuild_existing_bitmaps: missing required rev-cache "
+		    "extension");
+
 	num_objects = bitmap_num_objects(bitmap_git);
 	CALLOC_ARRAY(reposition, num_objects);
 
@@ -1549,8 +1788,13 @@ uint32_t *create_bitmap_mapping(struct bitmap_index *bitmap_git,
 		struct object_id oid;
 		struct object_entry *oe;
 
-		nth_packed_object_id(&oid, bitmap_git->pack,
-				     pack_pos_to_index(bitmap_git->pack, i));
+		if (bitmap_is_midx(bitmap_git))
+			nth_midxed_object_oid(&oid,
+					      bitmap_git->midx,
+					      pack_pos_to_midx(bitmap_git->midx, i));
+		else
+			nth_packed_object_id(&oid, bitmap_git->pack,
+					     pack_pos_to_index(bitmap_git->pack, i));
 		oe = packlist_find(mapping, &oid);
 
 		if (oe)
@@ -1576,6 +1820,19 @@ void free_bitmap_index(struct bitmap_index *b)
 	free(b->ext_index.hashes);
 	bitmap_free(b->result);
 	bitmap_free(b->haves);
+	if (bitmap_is_midx(b)) {
+		/*
+		 * Multi-pack bitmaps need to have resources associated with
+		 * their on-disk reverse indexes unmapped so that stale .rev and
+		 * .bitmap files can be removed.
+		 *
+		 * Unlike pack-based bitmaps, multi-pack bitmaps can be read and
+		 * written in the same 'git multi-pack-index write --bitmap'
+		 * process. Close resources so they can be removed safely on
+		 * platforms like Windows.
+		 */
+		close_midx_revindex(b->midx);
+	}
 	free(b);
 }
 
@@ -1590,7 +1847,6 @@ static off_t get_disk_usage_for_type(struct bitmap_index *bitmap_git,
 				     enum object_type object_type)
 {
 	struct bitmap *result = bitmap_git->result;
-	struct packed_git *pack = bitmap_git->pack;
 	off_t total = 0;
 	struct ewah_iterator it;
 	eword_t filter;
@@ -1607,15 +1863,35 @@ static off_t get_disk_usage_for_type(struct bitmap_index *bitmap_git,
 			continue;
 
 		for (offset = 0; offset < BITS_IN_EWORD; offset++) {
-			size_t pos;
-
 			if ((word >> offset) == 0)
 				break;
 
 			offset += ewah_bit_ctz64(word >> offset);
-			pos = base + offset;
-			total += pack_pos_to_offset(pack, pos + 1) -
-				 pack_pos_to_offset(pack, pos);
+
+			if (bitmap_is_midx(bitmap_git)) {
+				uint32_t pack_pos;
+				uint32_t midx_pos = pack_pos_to_midx(bitmap_git->midx, base + offset);
+				off_t offset = nth_midxed_offset(bitmap_git->midx, midx_pos);
+
+				uint32_t pack_id = nth_midxed_pack_int_id(bitmap_git->midx, midx_pos);
+				struct packed_git *pack = bitmap_git->midx->packs[pack_id];
+
+				if (offset_to_pack_pos(pack, offset, &pack_pos) < 0) {
+					struct object_id oid;
+					nth_midxed_object_oid(&oid, bitmap_git->midx, midx_pos);
+
+					die(_("could not find %s in pack %s at offset %"PRIuMAX),
+					    oid_to_hex(&oid),
+					    pack->pack_name,
+					    (uintmax_t)offset);
+				}
+
+				total += pack_pos_to_offset(pack, pack_pos + 1) - offset;
+			} else {
+				size_t pos = base + offset;
+				total += pack_pos_to_offset(bitmap_git->pack, pos + 1) -
+					 pack_pos_to_offset(bitmap_git->pack, pos);
+			}
 		}
 	}
 
@@ -1666,6 +1942,11 @@ off_t get_disk_usage_from_bitmap(struct bitmap_index *bitmap_git,
 	return total;
 }
 
+int bitmap_is_midx(struct bitmap_index *bitmap_git)
+{
+	return !!bitmap_git->midx;
+}
+
 const struct string_list *bitmap_preferred_tips(struct repository *r)
 {
 	return repo_config_get_value_multi(r, "pack.preferbitmaptips");
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 52ea10de51..81664f933f 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -44,6 +44,8 @@ typedef int (*show_reachable_fn)(
 struct bitmap_index;
 
 struct bitmap_index *prepare_bitmap_git(struct repository *r);
+struct bitmap_index *prepare_midx_bitmap_git(struct repository *r,
+					     struct multi_pack_index *midx);
 void count_bitmap_commit_list(struct bitmap_index *, uint32_t *commits,
 			      uint32_t *trees, uint32_t *blobs, uint32_t *tags);
 void traverse_bitmap_commit_list(struct bitmap_index *,
@@ -92,6 +94,10 @@ void bitmap_writer_finish(struct pack_idx_entry **index,
 			  uint32_t index_nr,
 			  const char *filename,
 			  uint16_t options);
+char *midx_bitmap_filename(struct multi_pack_index *midx);
+char *pack_bitmap_filename(struct packed_git *p);
+
+int bitmap_is_midx(struct bitmap_index *bitmap_git);
 
 const struct string_list *bitmap_preferred_tips(struct repository *r);
 int bitmap_is_preferred_refname(struct repository *r, const char *refname);
diff --git a/packfile.c b/packfile.c
index 9ef6d98292..371f5488cf 100644
--- a/packfile.c
+++ b/packfile.c
@@ -860,7 +860,7 @@ static void prepare_pack(const char *full_name, size_t full_name_len,
 	if (!strcmp(file_name, "multi-pack-index"))
 		return;
 	if (starts_with(file_name, "multi-pack-index") &&
-	    ends_with(file_name, ".rev"))
+	    (ends_with(file_name, ".bitmap") || ends_with(file_name, ".rev")))
 		return;
 	if (ends_with(file_name, ".idx") ||
 	    ends_with(file_name, ".rev") ||
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v3 15/25] pack-bitmap: write multi-pack bitmaps
  2021-07-27 21:19 ` [PATCH v3 00/25] " Taylor Blau
                     ` (13 preceding siblings ...)
  2021-07-27 21:19   ` [PATCH v3 14/25] pack-bitmap: read multi-pack bitmaps Taylor Blau
@ 2021-07-27 21:20   ` Taylor Blau
  2021-07-27 21:20   ` [PATCH v3 16/25] t5310: move some tests to lib-bitmap.sh Taylor Blau
                     ` (10 subsequent siblings)
  25 siblings, 0 replies; 273+ messages in thread
From: Taylor Blau @ 2021-07-27 21:20 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

Write multi-pack bitmaps in the format described by
Documentation/technical/bitmap-format.txt, inferring their presence with
the absence of '--bitmap'.

To write a multi-pack bitmap, this patch attempts to reuse as much of
the existing machinery from pack-objects as possible. Specifically, the
MIDX code prepares a packing_data struct that pretends as if a single
packfile has been generated containing all of the objects contained
within the MIDX.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-multi-pack-index.txt |  12 +-
 builtin/multi-pack-index.c             |   2 +
 midx.c                                 | 208 ++++++++++++++++++++++++-
 midx.h                                 |   1 +
 4 files changed, 214 insertions(+), 9 deletions(-)

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index c9b063d31e..ed52459a9d 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -10,7 +10,7 @@ SYNOPSIS
 --------
 [verse]
 'git multi-pack-index' [--object-dir=<dir>] [--[no-]progress]
-	[--preferred-pack=<pack>] <subcommand>
+	[--preferred-pack=<pack>] [--[no-]bitmap] <subcommand>
 
 DESCRIPTION
 -----------
@@ -40,6 +40,9 @@ write::
 		multiple packs contain the same object. `<pack>` must
 		contain at least one object. If not given, ties are
 		broken in favor of the pack with the lowest mtime.
+
+	--[no-]bitmap::
+		Control whether or not a multi-pack bitmap is written.
 --
 
 verify::
@@ -81,6 +84,13 @@ EXAMPLES
 $ git multi-pack-index write
 -----------------------------------------------
 
+* Write a MIDX file for the packfiles in the current .git folder with a
+corresponding bitmap.
++
+-------------------------------------------------------------
+$ git multi-pack-index write --preferred-pack=<pack> --bitmap
+-------------------------------------------------------------
+
 * Write a MIDX file for the packfiles in an alternate object store.
 +
 -----------------------------------------------
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 5d3ea445fd..bf6fa982e3 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -68,6 +68,8 @@ static int cmd_multi_pack_index_write(int argc, const char **argv)
 		OPT_STRING(0, "preferred-pack", &opts.preferred_pack,
 			   N_("preferred-pack"),
 			   N_("pack for reuse when computing a multi-pack bitmap")),
+		OPT_BIT(0, "bitmap", &opts.flags, N_("write multi-pack bitmap"),
+			MIDX_WRITE_BITMAP | MIDX_WRITE_REV_INDEX),
 		OPT_END(),
 	};
 
diff --git a/midx.c b/midx.c
index db21727c62..ce6cc62c20 100644
--- a/midx.c
+++ b/midx.c
@@ -13,6 +13,10 @@
 #include "repository.h"
 #include "chunk-format.h"
 #include "pack.h"
+#include "pack-bitmap.h"
+#include "refs.h"
+#include "revision.h"
+#include "list-objects.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
@@ -893,6 +897,166 @@ static int midx_checksum_valid(struct multi_pack_index *m)
 	return hashfile_checksum_valid(m->data, m->data_len);
 }
 
+static void prepare_midx_packing_data(struct packing_data *pdata,
+				      struct write_midx_context *ctx)
+{
+	uint32_t i;
+
+	memset(pdata, 0, sizeof(struct packing_data));
+	prepare_packing_data(the_repository, pdata);
+
+	for (i = 0; i < ctx->entries_nr; i++) {
+		struct pack_midx_entry *from = &ctx->entries[ctx->pack_order[i]];
+		struct object_entry *to = packlist_alloc(pdata, &from->oid);
+
+		oe_set_in_pack(pdata, to,
+			       ctx->info[ctx->pack_perm[from->pack_int_id]].p);
+	}
+}
+
+static int add_ref_to_pending(const char *refname,
+			      const struct object_id *oid,
+			      int flag, void *cb_data)
+{
+	struct rev_info *revs = (struct rev_info*)cb_data;
+	struct object *object;
+
+	if ((flag & REF_ISSYMREF) && (flag & REF_ISBROKEN)) {
+		warning("symbolic ref is dangling: %s", refname);
+		return 0;
+	}
+
+	object = parse_object_or_die(oid, refname);
+	if (object->type != OBJ_COMMIT)
+		return 0;
+
+	add_pending_object(revs, object, "");
+	if (bitmap_is_preferred_refname(revs->repo, refname))
+		object->flags |= NEEDS_BITMAP;
+	return 0;
+}
+
+struct bitmap_commit_cb {
+	struct commit **commits;
+	size_t commits_nr, commits_alloc;
+
+	struct write_midx_context *ctx;
+};
+
+static const struct object_id *bitmap_oid_access(size_t index,
+						 const void *_entries)
+{
+	const struct pack_midx_entry *entries = _entries;
+	return &entries[index].oid;
+}
+
+static void bitmap_show_commit(struct commit *commit, void *_data)
+{
+	struct bitmap_commit_cb *data = _data;
+	int pos = oid_pos(&commit->object.oid, data->ctx->entries,
+			  data->ctx->entries_nr,
+			  bitmap_oid_access);
+	if (pos < 0)
+		return;
+
+	ALLOC_GROW(data->commits, data->commits_nr + 1, data->commits_alloc);
+	data->commits[data->commits_nr++] = commit;
+}
+
+static struct commit **find_commits_for_midx_bitmap(uint32_t *indexed_commits_nr_p,
+						    struct write_midx_context *ctx)
+{
+	struct rev_info revs;
+	struct bitmap_commit_cb cb = {0};
+
+	cb.ctx = ctx;
+
+	repo_init_revisions(the_repository, &revs, NULL);
+	setup_revisions(0, NULL, &revs, NULL);
+	for_each_ref(add_ref_to_pending, &revs);
+
+	/*
+	 * Skipping promisor objects here is intentional, since it only excludes
+	 * them from the list of reachable commits that we want to select from
+	 * when computing the selection of MIDX'd commits to receive bitmaps.
+	 *
+	 * Reachability bitmaps do require that their objects be closed under
+	 * reachability, but fetching any objects missing from promisors at this
+	 * point is too late. But, if one of those objects can be reached from
+	 * an another object that is included in the bitmap, then we will
+	 * complain later that we don't have reachability closure (and fail
+	 * appropriately).
+	 */
+	fetch_if_missing = 0;
+	revs.exclude_promisor_objects = 1;
+
+	if (prepare_revision_walk(&revs))
+		die(_("revision walk setup failed"));
+
+	traverse_commit_list(&revs, bitmap_show_commit, NULL, &cb);
+	if (indexed_commits_nr_p)
+		*indexed_commits_nr_p = cb.commits_nr;
+
+	return cb.commits;
+}
+
+static int write_midx_bitmap(char *midx_name, unsigned char *midx_hash,
+			     struct write_midx_context *ctx,
+			     unsigned flags)
+{
+	struct packing_data pdata;
+	struct pack_idx_entry **index;
+	struct commit **commits = NULL;
+	uint32_t i, commits_nr;
+	char *bitmap_name = xstrfmt("%s-%s.bitmap", midx_name, hash_to_hex(midx_hash));
+	int ret;
+
+	prepare_midx_packing_data(&pdata, ctx);
+
+	commits = find_commits_for_midx_bitmap(&commits_nr, ctx);
+
+	/*
+	 * Build the MIDX-order index based on pdata.objects (which is already
+	 * in MIDX order; c.f., 'midx_pack_order_cmp()' for the definition of
+	 * this order).
+	 */
+	ALLOC_ARRAY(index, pdata.nr_objects);
+	for (i = 0; i < pdata.nr_objects; i++)
+		index[i] = &pdata.objects[i].idx;
+
+	bitmap_writer_show_progress(flags & MIDX_PROGRESS);
+	bitmap_writer_build_type_index(&pdata, index, pdata.nr_objects);
+
+	/*
+	 * bitmap_writer_finish expects objects in lex order, but pack_order
+	 * gives us exactly that. use it directly instead of re-sorting the
+	 * array.
+	 *
+	 * This changes the order of objects in 'index' between
+	 * bitmap_writer_build_type_index and bitmap_writer_finish.
+	 *
+	 * The same re-ordering takes place in the single-pack bitmap code via
+	 * write_idx_file(), which is called by finish_tmp_packfile(), which
+	 * happens between bitmap_writer_build_type_index() and
+	 * bitmap_writer_finish().
+	 */
+	for (i = 0; i < pdata.nr_objects; i++)
+		index[ctx->pack_order[i]] = &pdata.objects[i].idx;
+
+	bitmap_writer_select_commits(commits, commits_nr, -1);
+	ret = bitmap_writer_build(&pdata);
+	if (ret < 0)
+		goto cleanup;
+
+	bitmap_writer_set_checksum(midx_hash);
+	bitmap_writer_finish(index, pdata.nr_objects, bitmap_name, 0);
+
+cleanup:
+	free(index);
+	free(bitmap_name);
+	return ret;
+}
+
 static int write_midx_internal(const char *object_dir,
 			       struct string_list *packs_to_drop,
 			       const char *preferred_pack_name,
@@ -940,7 +1104,7 @@ static int write_midx_internal(const char *object_dir,
 
 			ctx.info[ctx.nr].orig_pack_int_id = i;
 			ctx.info[ctx.nr].pack_name = xstrdup(ctx.m->pack_names[i]);
-			ctx.info[ctx.nr].p = NULL;
+			ctx.info[ctx.nr].p = ctx.m->packs[i];
 			ctx.info[ctx.nr].expired = 0;
 
 			if (flags & MIDX_WRITE_REV_INDEX) {
@@ -974,8 +1138,26 @@ static int write_midx_internal(const char *object_dir,
 	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &ctx);
 	stop_progress(&ctx.progress);
 
-	if (ctx.m && ctx.nr == ctx.m->num_packs && !packs_to_drop)
-		goto cleanup;
+	if (ctx.m && ctx.nr == ctx.m->num_packs && !packs_to_drop) {
+		struct bitmap_index *bitmap_git;
+		int bitmap_exists;
+		int want_bitmap = flags & MIDX_WRITE_BITMAP;
+
+		bitmap_git = prepare_midx_bitmap_git(the_repository, ctx.m);
+		bitmap_exists = bitmap_git && bitmap_is_midx(bitmap_git);
+		free_bitmap_index(bitmap_git);
+
+		if (bitmap_exists || !want_bitmap) {
+			/*
+			 * The correct MIDX already exists, and so does a
+			 * corresponding bitmap (or one wasn't requested).
+			 */
+			if (!want_bitmap)
+				clear_midx_files_ext(the_repository, ".bitmap",
+						     NULL);
+			goto cleanup;
+		}
+	}
 
 	if (preferred_pack_name) {
 		int found = 0;
@@ -991,7 +1173,8 @@ static int write_midx_internal(const char *object_dir,
 		if (!found)
 			warning(_("unknown preferred pack: '%s'"),
 				preferred_pack_name);
-	} else if (ctx.nr && (flags & MIDX_WRITE_REV_INDEX)) {
+	} else if (ctx.nr &&
+		   (flags & (MIDX_WRITE_REV_INDEX | MIDX_WRITE_BITMAP))) {
 		struct packed_git *oldest = ctx.info[ctx.preferred_pack_idx].p;
 		ctx.preferred_pack_idx = 0;
 
@@ -1123,9 +1306,6 @@ static int write_midx_internal(const char *object_dir,
 	hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR);
 	f = hashfd(get_lock_file_fd(&lk), get_lock_file_path(&lk));
 
-	if (ctx.m)
-		close_midx(ctx.m);
-
 	if (ctx.nr - dropped_packs == 0) {
 		error(_("no pack files to index."));
 		result = 1;
@@ -1156,14 +1336,24 @@ static int write_midx_internal(const char *object_dir,
 	finalize_hashfile(f, midx_hash, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
 	free_chunkfile(cf);
 
-	if (flags & MIDX_WRITE_REV_INDEX)
+	if (flags & (MIDX_WRITE_REV_INDEX | MIDX_WRITE_BITMAP))
 		ctx.pack_order = midx_pack_order(&ctx);
 
 	if (flags & MIDX_WRITE_REV_INDEX)
 		write_midx_reverse_index(midx_name, midx_hash, &ctx);
+	if (flags & MIDX_WRITE_BITMAP) {
+		if (write_midx_bitmap(midx_name, midx_hash, &ctx, flags) < 0) {
+			error(_("could not write multi-pack bitmap"));
+			result = 1;
+			goto cleanup;
+		}
+	}
+
+	close_midx(ctx.m);
 
 	commit_lock_file(&lk);
 
+	clear_midx_files_ext(the_repository, ".bitmap", midx_hash);
 	clear_midx_files_ext(the_repository, ".rev", midx_hash);
 
 cleanup:
@@ -1180,6 +1370,7 @@ static int write_midx_internal(const char *object_dir,
 	free(ctx.pack_perm);
 	free(ctx.pack_order);
 	free(midx_name);
+
 	return result;
 }
 
@@ -1240,6 +1431,7 @@ void clear_midx_file(struct repository *r)
 	if (remove_path(midx))
 		die(_("failed to clear multi-pack-index at %s"), midx);
 
+	clear_midx_files_ext(r, ".bitmap", NULL);
 	clear_midx_files_ext(r, ".rev", NULL);
 
 	free(midx);
diff --git a/midx.h b/midx.h
index 1172df1a71..350f4d0a7b 100644
--- a/midx.h
+++ b/midx.h
@@ -41,6 +41,7 @@ struct multi_pack_index {
 
 #define MIDX_PROGRESS     (1 << 0)
 #define MIDX_WRITE_REV_INDEX (1 << 1)
+#define MIDX_WRITE_BITMAP (1 << 2)
 
 const unsigned char *get_midx_checksum(struct multi_pack_index *m);
 char *get_midx_filename(const char *object_dir);
-- 
2.31.1.163.ga65ce7f831


^ permalink raw reply	[flat|nested] 273+ messages in thread

* [PATCH v3 16/25] t5310: move some tests to lib-bitmap.sh
  2021-07-27 21:19 ` [PATCH v3 00/25] " Taylor Blau
                     ` (14 preceding siblings ...)
  2021-07-27 21:20   ` [PATCH v3 15/25] pack-bitmap: write " Taylor Blau
@ 2021-07-27 21:20   ` Taylor Blau
  2021-08-12 20:25     ` Jeff King
  2021-07-27 21:20   ` [PATCH v3 17/25] t/helper/test-read-midx.c: add --checksum mode Taylor Blau
                     ` (9 subsequent siblings)
  25 siblings, 1 reply; 273+ messages in thread
From: Taylor Blau @ 2021-07-27 21:20 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, gitster, jonathantanmy

We'll soon be adding a test script that will cover many of the same
bitmap concepts as t5310, but for MIDX bitmaps. Let's pull out as many
of the applicable tests as we can so we don't have to rewrite them.

There should be no functional change to t5310; we still run the same
operations in the same order.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/lib-bitmap.sh         | 236 ++++++++++++++++++++++++++++++++++++++++
 t/t5310-pack-bitmaps.sh | 227 +-------------------------------------
 2 files changed, 240 insertions(+), 223 deletions(-)

diff --git a/t/lib-bitmap.sh b/t/lib-bitmap.sh
index fe3f98be24..ecb5d0e05d 100644
--- a/t/lib-bitmap.sh
+++ b/t/lib-bitmap.sh
@@ -1,3 +1,6 @@
+# Helpers for scripts testing bitamp functionality; see t5310 for
+# example usage.
+
 # Compare a file containing rev-list bitmap traversal output to its non-bitmap
 # counterpart. You can't just use test_cmp for this, because the two produce
 # subtly different output:
@@ -24,3 +27,236 @@ test_bitmap_traversal () {
 	test_cmp "$1.normalized" "$2.normalized" &&
 	rm -f "$1.normalized" "$2.normalized"
 }
+
+# To ensure the logic for "maximal commits" is exercised, make
+# the repository a bit more complicated.
+#
+#    other                         second
+#      *                             *
+# (99 commits)                  (99 commits)
+#      *                             *
+#      |\                           /|
+#      | * octo-other  octo-second * |
+#      |/|\_________  ____________/|\|
+#      | \          \/  __________/  |
+#      |  | ________/\ /             |
+#      *  |/          * merge-right  *
+#      | _|__________/ \____________ |
+#      |/ |                         \|
+# (l1) *  * merge-left               * (r1)
+#      | / \________________________ |
+#      |/                           \|
+# (l2) *                             * (r2)
+#       \___________________________ |
+#                                   \|
+#                                    * (base)
+#
+# We only push bits down the first-parent history, which
+# makes some of these commits unimportant!
+#
+# The important part for the maximal commit algorithm is how
+# the bitmasks are extended. Assuming starting bit positions
+# for second (bit 0) and other (bit 1), the bitmasks at the
+# end should be:
+#
+#      second: 1       (maximal, selected)
+#       other: 01      (maximal, selected)
+#      (base): 11 (maximal)
+#
+# This complicated history was important for a previous
+# version of the walk that guarantees never walking a
+# commit multiple times. That goal might be important
+# again, so preserve this complicated case. For now, this
+# test will guarantee that the bitmaps are computed
+# correctly, even with the repeat calculations.
+setup_bitmap_history() {
+	test_expect_success 'setup repo with moderate-sized history' '
+		test_commit_bulk --id=file 10 &&
+		git branch -M second &&
+		git checkout -b other HEAD~5 &&
+		test_commit_bulk --id=side 10 &&
+
+		# add complicated history setup, including merges and
+		# ambiguous merge-bases
+
+		git checkout -b merge-left other~2 &&
+		git merge second~2 -m "merge-left" &&
+
+		git checkout -b merge-right second~1 &&
+		git merge other~1 -m "merge-right" &&
+
+		git checkout -b octo-second second &&
+		git merge merge-left merge-right -m "octopus-second" &&
+
+		git checkout -b octo-other other &&
+		git merge merge-left merge-right -m "octopus-other" &&
+
+		git checkout other &&
+		git merge octo-other -m "pull octopus" &&
+
+		git checkout second &&
+		git merge octo-second -m "pull octopus" &&
+
+		# Remove these branches so they are not selected
+		# as bitmap tips
+		git branch -D merge-left &&
+		git branch -D merge-right &&
+		git branch -D octo-other &&
+		git branch -D octo-second &&
+
+		# add padding to make these merges less interesting
+		# and avoid having them selected for bitmaps
+		test_commit_bulk --id=file 100 &&
+		git checkout other &&
+		test_commit_bulk --id=side 100 &&
+		git checkout second &&
+
+		bitmaptip=$(git rev-parse second) &&
+		blob=$(echo tagged-blob | git hash-object -w --stdin) &&
+		git tag tagged-blob $blob
+	'
+}
+
+rev_list_tests_head () {
+	test_expect_success "counting commits via bitmap ($state, $branch)" '
+		git rev-list --count $branch >expect &&
+		git rev-list --use-bit