git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/6] [GSoC] bitmap: integrate a lookup table extension to the bitmap format
@ 2022-06-20 12:33 Abhradeep Chakraborty via GitGitGadget
  2022-06-20 12:33 ` [PATCH 1/6] Documentation/technical: describe bitmap lookup table extension Abhradeep Chakraborty via GitGitGadget
                   ` (6 more replies)
  0 siblings, 7 replies; 63+ messages in thread
From: Abhradeep Chakraborty via GitGitGadget @ 2022-06-20 12:33 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau, Kaartic Sivaram, Abhradeep Chakraborty

When parsing the .bitmap file, git loads all the bitmaps one by one even if
some of the bitmaps are not necessary. We can remove this overhead by
loading only the necessary bitmaps. A look up table extension can solve this
issue.

The proposed table has:

 * a list of nr_entries object ids. These objects are commits that has
   bitmaps. Ids are stored in lexicographic order (for better searching).
 * a list of <offset, xor-offset> pairs (4-byte integers, network-byte
   order). The i'th pair denotes the offset and xor-offset(respectively) of
   the bitmap of i'th commit in the previous list. These two informations
   are necessary because only in this way bitmaps can be found without
   parsing all the bitmap.
 * a 4-byte integer for table specific flags (none exists currently).

Whenever git want to parse the bitmap for a specific commit, it will first
refer to the table and will look for the offset and xor-offset for that
commit. Git will then try to parse the bitmap located at the offset
position. The xor-offset can be used to find the xor-bitmap for the
bitmap(if any). This process is recursive and will end if xor-offset is null
(i.e. there is no xor-bitmap left).

Abhradeep Chakraborty (5):
  Documentation/technical: describe bitmap lookup table extension
  pack-bitmap: prepare to read lookup table extension
  pack-bitmap-write.c: write lookup table extension
  bitmap-commit-table: add tests for the bitmap lookup table
  bitmap-lookup-table: add performance tests

Taylor Blau (1):
  builtin/pack-objects.c: learn pack.writeBitmapLookupTable

 Documentation/config/pack.txt             |   7 +
 Documentation/technical/bitmap-format.txt |  31 ++++
 builtin/pack-objects.c                    |   8 +
 pack-bitmap-write.c                       |  59 +++++++-
 pack-bitmap.c                             | 172 +++++++++++++++++++++-
 pack-bitmap.h                             |   1 +
 t/perf/p5310-pack-bitmaps.sh              |  60 +++++---
 t/perf/p5326-multi-pack-bitmaps.sh        |  55 ++++---
 t/t5310-pack-bitmaps.sh                   |  14 ++
 t/t5326-multi-pack-bitmaps.sh             |  19 +++
 10 files changed, 375 insertions(+), 51 deletions(-)


base-commit: 5699ec1b0aec51b9e9ba5a2785f65970c5a95d84
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1266%2FAbhra303%2Fbitmap-commit-table-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1266/Abhra303/bitmap-commit-table-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1266
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 1/6] Documentation/technical: describe bitmap lookup table extension
  2022-06-20 12:33 [PATCH 0/6] [GSoC] bitmap: integrate a lookup table extension to the bitmap format Abhradeep Chakraborty via GitGitGadget
@ 2022-06-20 12:33 ` Abhradeep Chakraborty via GitGitGadget
  2022-06-20 16:56   ` Derrick Stolee
                     ` (2 more replies)
  2022-06-20 12:33 ` [PATCH 2/6] pack-bitmap: prepare to read " Abhradeep Chakraborty via GitGitGadget
                   ` (5 subsequent siblings)
  6 siblings, 3 replies; 63+ messages in thread
From: Abhradeep Chakraborty via GitGitGadget @ 2022-06-20 12:33 UTC (permalink / raw)
  To: git
  Cc: Taylor Blau, Kaartic Sivaram, Abhradeep Chakraborty,
	Abhradeep Chakraborty

From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>

When reading bitmap file, git loads each and every bitmap one by one
even if all the bitmaps are not required. A "bitmap lookup table"
extension to the bitmap format can reduce the overhead of loading
bitmaps which stores a list of bitmapped commit oids, along with their
offset and xor offset. This way git can load only the neccesary bitmaps
without loading the previous bitmaps.

Add some information for the new "bitmap lookup table" extension in the
bitmap-format documentation.

Co-Authored-by: Taylor Blau <ttaylorr@github.com>
Mentored-by: Taylor Blau <ttaylorr@github.com>
Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
---
 Documentation/technical/bitmap-format.txt | 31 +++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt
index 04b3ec21785..34e98787b78 100644
--- a/Documentation/technical/bitmap-format.txt
+++ b/Documentation/technical/bitmap-format.txt
@@ -67,6 +67,14 @@ MIDXs, both the bit-cache and rev-cache extensions are required.
 			pack/MIDX. The format and meaning of the name-hash is
 			described below.
 
+			** {empty}
+			BITMAP_OPT_LOOKUP_TABLE (0xf) : :::
+			If present, the end of the bitmap file contains a table
+			containing a list of `N` object ids, a list of pairs of
+			offset and xor offset of respective objects, and 4-byte
+			integer denoting the flags (currently none). The format
+			and meaning of the table is described below.
+
 		4-byte entry count (network byte order)
 
 			The total count of entries (bitmapped commits) in this bitmap index.
@@ -205,3 +213,26 @@ Note that this hashing scheme is tied to the BITMAP_OPT_HASH_CACHE flag.
 If implementations want to choose a different hashing scheme, they are
 free to do so, but MUST allocate a new header flag (because comparing
 hashes made under two different schemes would be pointless).
+
+Commit lookup table
+-------------------
+
+If the BITMAP_OPT_LOOKUP_TABLE flag is set, the end of the `.bitmap`
+contains a lookup table specifying the positions of commits which have a
+bitmap.
+
+For a `.bitmap` containing `nr_entries` reachability bitmaps, the format
+is as follows:
+
+	- `nr_entries` object names.
+
+	- `nr_entries` pairs of 4-byte integers, each in network order.
+	  The first holds the offset from which that commit's bitmap can
+	  be read. The second number holds the position of the commit
+	  whose bitmap the current bitmap is xor'd with in lexicographic
+	  order, or 0xffffffff if the current commit is not xor'd with
+	  anything.
+
+	- One 4-byte network byte order integer specifying
+	  table-specific flags. None exist currently, so this is always
+	  "0".
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 2/6] pack-bitmap: prepare to read lookup table extension
  2022-06-20 12:33 [PATCH 0/6] [GSoC] bitmap: integrate a lookup table extension to the bitmap format Abhradeep Chakraborty via GitGitGadget
  2022-06-20 12:33 ` [PATCH 1/6] Documentation/technical: describe bitmap lookup table extension Abhradeep Chakraborty via GitGitGadget
@ 2022-06-20 12:33 ` Abhradeep Chakraborty via GitGitGadget
  2022-06-20 20:49   ` Derrick Stolee
  2022-06-20 22:06   ` Taylor Blau
  2022-06-20 12:33 ` [PATCH 3/6] pack-bitmap-write.c: write " Abhradeep Chakraborty via GitGitGadget
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 63+ messages in thread
From: Abhradeep Chakraborty via GitGitGadget @ 2022-06-20 12:33 UTC (permalink / raw)
  To: git
  Cc: Taylor Blau, Kaartic Sivaram, Abhradeep Chakraborty,
	Abhradeep Chakraborty

From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>

Bitmap lookup table extension can let git to parse only the necessary
bitmaps without loading the previous bitmaps one by one.

Teach git to read and use the bitmap lookup table extension.

Co-Authored-by: Taylor Blau <ttaylorr@github.com>
Mentored-by: Taylor Blau <ttaylorr@github.com>
Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
---
 pack-bitmap.c | 172 ++++++++++++++++++++++++++++++++++++++++++++++++--
 pack-bitmap.h |   1 +
 2 files changed, 166 insertions(+), 7 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 36134222d7a..d5e5973a79f 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -15,6 +15,7 @@
 #include "list-objects-filter-options.h"
 #include "midx.h"
 #include "config.h"
+#include "hash-lookup.h"
 
 /*
  * An entry on the bitmap index, representing the bitmap for a given
@@ -82,6 +83,13 @@ struct bitmap_index {
 	/* The checksum of the packfile or MIDX; points into map. */
 	const unsigned char *checksum;
 
+	/*
+	 * If not NULL, these point into the various commit table sections
+	 * (within map).
+	 */
+	unsigned char *table_lookup;
+	unsigned char *table_offsets;
+
 	/*
 	 * Extended index.
 	 *
@@ -185,6 +193,24 @@ static int load_bitmap_header(struct bitmap_index *index)
 			index->hashes = (void *)(index_end - cache_size);
 			index_end -= cache_size;
 		}
+
+		if (flags & BITMAP_OPT_LOOKUP_TABLE &&
+		    git_env_bool("GIT_READ_COMMIT_TABLE", 1)) {
+			uint32_t entry_count = ntohl(header->entry_count);
+			uint32_t table_size =
+				(entry_count * the_hash_algo->rawsz) /* oids */ +
+				(entry_count * sizeof(uint32_t)) /* offsets */ +
+				(entry_count * sizeof(uint32_t)) /* xor offsets */ +
+				(sizeof(uint32_t)) /* flags */;
+
+			if (table_size > index_end - index->map - header_size)
+				return error("corrupted bitmap index file (too short to fit commit table)");
+
+			index->table_lookup = (void *)(index_end - table_size);
+			index->table_offsets = index->table_lookup + the_hash_algo->rawsz * entry_count;
+
+			index_end -= table_size;
+		}
 	}
 
 	index->entry_count = ntohl(header->entry_count);
@@ -470,7 +496,7 @@ static int load_bitmap(struct bitmap_index *bitmap_git)
 		!(bitmap_git->tags = read_bitmap_1(bitmap_git)))
 		goto failed;
 
-	if (load_bitmap_entries_v1(bitmap_git) < 0)
+	if (!bitmap_git->table_lookup && load_bitmap_entries_v1(bitmap_git) < 0)
 		goto failed;
 
 	return 0;
@@ -557,14 +583,145 @@ struct include_data {
 	struct bitmap *seen;
 };
 
-struct ewah_bitmap *bitmap_for_commit(struct bitmap_index *bitmap_git,
-				      struct commit *commit)
+static struct stored_bitmap *stored_bitmap_for_commit(struct bitmap_index *bitmap_git,
+						      struct commit *commit,
+						      uint32_t *pos_hint);
+
+static inline const unsigned char *bitmap_oid_pos(struct bitmap_index *bitmap_git,
+						  uint32_t pos)
+{
+	return bitmap_git->table_lookup + (pos * the_hash_algo->rawsz);
+}
+
+static inline const void *bitmap_offset_pos(struct bitmap_index *bitmap_git,
+					    uint32_t pos)
+{
+	return bitmap_git->table_offsets + (pos * 2 * sizeof(uint32_t));
+}
+
+static inline const void *xor_position_pos(struct bitmap_index *bitmap_git,
+					   uint32_t pos)
+{
+	return (unsigned char*) bitmap_offset_pos(bitmap_git, pos) + sizeof(uint32_t);
+}
+
+static int bitmap_lookup_cmp(const void *_va, const void *_vb)
+{
+	return hashcmp(_va, _vb);
+}
+
+static int bitmap_table_lookup(struct bitmap_index *bitmap_git,
+			       struct object_id *oid,
+			       uint32_t *commit_pos)
+{
+	unsigned char *found = bsearch(oid->hash, bitmap_git->table_lookup,
+				       bitmap_git->entry_count,
+				       the_hash_algo->rawsz, bitmap_lookup_cmp);
+	if (found)
+		*commit_pos = (found - bitmap_git->table_lookup) / the_hash_algo->rawsz;
+	return !!found;
+}
+
+static struct stored_bitmap *lazy_bitmap_for_commit(struct bitmap_index *bitmap_git,
+						    struct object_id *oid,
+						    uint32_t commit_pos)
+{
+	uint32_t xor_pos;
+	off_t bitmap_ofs;
+
+	int flags;
+	struct ewah_bitmap *bitmap;
+	struct stored_bitmap *xor_bitmap;
+
+	bitmap_ofs = get_be32(bitmap_offset_pos(bitmap_git, commit_pos));
+	xor_pos = get_be32(xor_position_pos(bitmap_git, commit_pos));
+
+	/*
+	 * Lazily load the xor'd bitmap if required (and we haven't done so
+	 * already). Make sure to pass the xor'd bitmap's position along as a
+	 * hint to avoid an unnecessary binary search in
+	 * stored_bitmap_for_commit().
+	 */
+	if (xor_pos == 0xffffffff) {
+		xor_bitmap = NULL;
+	} else {
+		struct commit *xor_commit;
+		struct object_id xor_oid;
+
+		oidread(&xor_oid, bitmap_oid_pos(bitmap_git, xor_pos));
+
+		xor_commit = lookup_commit(the_repository, &xor_oid);
+		if (!xor_commit)
+			return NULL;
+
+		xor_bitmap = stored_bitmap_for_commit(bitmap_git, xor_commit,
+						      &xor_pos);
+	}
+
+	/*
+	 * Don't bother reading the commit's index position or its xor
+	 * offset:
+	 *
+	 *   - The commit's index position is irrelevant to us, since
+	 *     load_bitmap_entries_v1 only uses it to learn the object
+	 *     id which is used to compute the hashmap's key. We already
+	 *     have an object id, so no need to look it up again.
+	 *
+	 *   - The xor_offset is unusable for us, since it specifies how
+	 *     many entries previous to ours we should look at. This
+	 *     makes sense when reading the bitmaps sequentially (as in
+	 *     load_bitmap_entries_v1()), since we can keep track of
+	 *     each bitmap as we read them.
+	 *
+	 *     But it can't work for us, since the bitmap's don't have a
+	 *     fixed size. So we learn the position of the xor'd bitmap
+	 *     from the commit table (and resolve it to a bitmap in the
+	 *     above if-statement).
+	 *
+	 * Instead, we can skip ahead and immediately read the flags and
+	 * ewah bitmap.
+	 */
+	bitmap_git->map_pos = bitmap_ofs + sizeof(uint32_t) + sizeof(uint8_t);
+	flags = read_u8(bitmap_git->map, &bitmap_git->map_pos);
+	bitmap = read_bitmap_1(bitmap_git);
+	if (!bitmap)
+		return NULL;
+
+	return store_bitmap(bitmap_git, bitmap, oid, xor_bitmap, flags);
+}
+
+static struct stored_bitmap *stored_bitmap_for_commit(struct bitmap_index *bitmap_git,
+						      struct commit *commit,
+						      uint32_t *pos_hint)
 {
 	khiter_t hash_pos = kh_get_oid_map(bitmap_git->bitmaps,
 					   commit->object.oid);
-	if (hash_pos >= kh_end(bitmap_git->bitmaps))
+	if (hash_pos >= kh_end(bitmap_git->bitmaps)) {
+		uint32_t commit_pos;
+		if (!bitmap_git->table_lookup)
+			return NULL;
+
+		/* NEEDSWORK: cache misses aren't recorded. */
+		if (pos_hint)
+			commit_pos = *pos_hint;
+		else if (!bitmap_table_lookup(bitmap_git,
+					      &commit->object.oid,
+					      &commit_pos))
+			return NULL;
+		return lazy_bitmap_for_commit(bitmap_git, &commit->object.oid,
+					      commit_pos);
+	}
+	return kh_value(bitmap_git->bitmaps, hash_pos);
+}
+
+struct ewah_bitmap *bitmap_for_commit(struct bitmap_index *bitmap_git,
+				      struct commit *commit)
+{
+	struct stored_bitmap *sb = stored_bitmap_for_commit(bitmap_git, commit,
+							    NULL);
+	if (!sb)
 		return NULL;
-	return lookup_stored_bitmap(kh_value(bitmap_git->bitmaps, hash_pos));
+	return lookup_stored_bitmap(sb);
 }
 
 static inline int bitmap_position_extended(struct bitmap_index *bitmap_git,
@@ -1699,8 +1856,9 @@ void test_bitmap_walk(struct rev_info *revs)
 	if (revs->pending.nr != 1)
 		die("you must specify exactly one commit to test");
 
-	fprintf(stderr, "Bitmap v%d test (%d entries loaded)\n",
-		bitmap_git->version, bitmap_git->entry_count);
+	if (!bitmap_git->table_lookup)
+		fprintf(stderr, "Bitmap v%d test (%d entries loaded)\n",
+			bitmap_git->version, bitmap_git->entry_count);
 
 	root = revs->pending.objects[0].item;
 	bm = bitmap_for_commit(bitmap_git, (struct commit *)root);
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 3d3ddd77345..37f86787a4d 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -26,6 +26,7 @@ struct bitmap_disk_header {
 enum pack_bitmap_opts {
 	BITMAP_OPT_FULL_DAG = 1,
 	BITMAP_OPT_HASH_CACHE = 4,
+	BITMAP_OPT_LOOKUP_TABLE = 16,
 };
 
 enum pack_bitmap_flags {
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 3/6] pack-bitmap-write.c: write lookup table extension
  2022-06-20 12:33 [PATCH 0/6] [GSoC] bitmap: integrate a lookup table extension to the bitmap format Abhradeep Chakraborty via GitGitGadget
  2022-06-20 12:33 ` [PATCH 1/6] Documentation/technical: describe bitmap lookup table extension Abhradeep Chakraborty via GitGitGadget
  2022-06-20 12:33 ` [PATCH 2/6] pack-bitmap: prepare to read " Abhradeep Chakraborty via GitGitGadget
@ 2022-06-20 12:33 ` Abhradeep Chakraborty via GitGitGadget
  2022-06-20 22:16   ` Taylor Blau
  2022-06-20 12:33 ` [PATCH 4/6] builtin/pack-objects.c: learn pack.writeBitmapLookupTable Taylor Blau via GitGitGadget
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 63+ messages in thread
From: Abhradeep Chakraborty via GitGitGadget @ 2022-06-20 12:33 UTC (permalink / raw)
  To: git
  Cc: Taylor Blau, Kaartic Sivaram, Abhradeep Chakraborty,
	Abhradeep Chakraborty

From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>

Teach git to write bitmap lookup table extension. The table has the
following information:

    - `N` no of Object ids of each bitmapped commits

    - A list of offset, xor-offset pair; the i'th pair denotes the
      offsets and xor-offsets of i'th commit in the previous list.

    - 4-byte integer denoting the flags

Co-authored-by: Taylor Blau <ttaylorr@github.com>
Mentored-by: Taylor Blau <ttaylorr@github.com>
Co-mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
---
 pack-bitmap-write.c | 59 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 57 insertions(+), 2 deletions(-)

diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
index c43375bd344..9e88a64dd65 100644
--- a/pack-bitmap-write.c
+++ b/pack-bitmap-write.c
@@ -650,7 +650,8 @@ static const struct object_id *oid_access(size_t pos, const void *table)
 
 static void write_selected_commits_v1(struct hashfile *f,
 				      struct pack_idx_entry **index,
-				      uint32_t index_nr)
+				      uint32_t index_nr,
+				      off_t *offsets)
 {
 	int i;
 
@@ -663,6 +664,9 @@ static void write_selected_commits_v1(struct hashfile *f,
 		if (commit_pos < 0)
 			BUG("trying to write commit not in index");
 
+		if (offsets)
+			offsets[i] = hashfile_total(f);
+
 		hashwrite_be32(f, commit_pos);
 		hashwrite_u8(f, stored->xor_offset);
 		hashwrite_u8(f, stored->flags);
@@ -671,6 +675,49 @@ static void write_selected_commits_v1(struct hashfile *f,
 	}
 }
 
+static int table_cmp(const void *_va, const void *_vb)
+{
+	return oidcmp(&writer.selected[*(uint32_t*)_va].commit->object.oid,
+		      &writer.selected[*(uint32_t*)_vb].commit->object.oid);
+}
+
+static void write_lookup_table(struct hashfile *f,
+			       off_t *offsets)
+{
+	uint32_t i;
+	uint32_t flags = 0;
+	uint32_t *table, *table_inv;
+
+	ALLOC_ARRAY(table, writer.selected_nr);
+	ALLOC_ARRAY(table_inv, writer.selected_nr);
+
+	for (i = 0; i < writer.selected_nr; i++)
+		table[i] = i;
+	QSORT(table, writer.selected_nr, table_cmp);
+	for (i = 0; i < writer.selected_nr; i++)
+		table_inv[table[i]] = i;
+
+	for (i = 0; i < writer.selected_nr; i++) {
+		struct bitmapped_commit *selected = &writer.selected[table[i]];
+		struct object_id *oid = &selected->commit->object.oid;
+
+		hashwrite(f, oid->hash, the_hash_algo->rawsz);
+	}
+	for (i = 0; i < writer.selected_nr; i++) {
+		struct bitmapped_commit *selected = &writer.selected[table[i]];
+
+		hashwrite_be32(f, offsets[table[i]]);
+		hashwrite_be32(f, selected->xor_offset
+			       ? table_inv[table[i] - selected->xor_offset]
+			       : 0xffffffff);
+	}
+
+	hashwrite_be32(f, flags);
+
+	free(table);
+	free(table_inv);
+}
+
 static void write_hash_cache(struct hashfile *f,
 			     struct pack_idx_entry **index,
 			     uint32_t index_nr)
@@ -695,6 +742,7 @@ void bitmap_writer_finish(struct pack_idx_entry **index,
 {
 	static uint16_t default_version = 1;
 	static uint16_t flags = BITMAP_OPT_FULL_DAG;
+	off_t *offsets = NULL;
 	struct strbuf tmp_file = STRBUF_INIT;
 	struct hashfile *f;
 
@@ -715,8 +763,14 @@ void bitmap_writer_finish(struct pack_idx_entry **index,
 	dump_bitmap(f, writer.trees);
 	dump_bitmap(f, writer.blobs);
 	dump_bitmap(f, writer.tags);
-	write_selected_commits_v1(f, index, index_nr);
 
+	if (options & BITMAP_OPT_LOOKUP_TABLE)
+		CALLOC_ARRAY(offsets, index_nr);
+
+	write_selected_commits_v1(f, index, index_nr, offsets);
+
+	if (options & BITMAP_OPT_LOOKUP_TABLE)
+		write_lookup_table(f, offsets);
 	if (options & BITMAP_OPT_HASH_CACHE)
 		write_hash_cache(f, index, index_nr);
 
@@ -730,4 +784,5 @@ void bitmap_writer_finish(struct pack_idx_entry **index,
 		die_errno("unable to rename temporary bitmap file to '%s'", filename);
 
 	strbuf_release(&tmp_file);
+	free(offsets);
 }
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 4/6] builtin/pack-objects.c: learn pack.writeBitmapLookupTable
  2022-06-20 12:33 [PATCH 0/6] [GSoC] bitmap: integrate a lookup table extension to the bitmap format Abhradeep Chakraborty via GitGitGadget
                   ` (2 preceding siblings ...)
  2022-06-20 12:33 ` [PATCH 3/6] pack-bitmap-write.c: write " Abhradeep Chakraborty via GitGitGadget
@ 2022-06-20 12:33 ` Taylor Blau via GitGitGadget
  2022-06-20 22:18   ` Taylor Blau
  2022-06-20 12:33 ` [PATCH 5/6] bitmap-commit-table: add tests for the bitmap lookup table Abhradeep Chakraborty via GitGitGadget
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 63+ messages in thread
From: Taylor Blau via GitGitGadget @ 2022-06-20 12:33 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau, Kaartic Sivaram, Abhradeep Chakraborty, Taylor Blau

From: Taylor Blau <ttaylorr@github.com>

Teach git to provide a way for users to enable/disable bitmap lookup
table extension by providing a config option named 'writeBitmapLookupTable'.

Signed-off-by: Taylor Blau <ttaylorr@github.com>
Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
---
 Documentation/config/pack.txt | 7 +++++++
 builtin/pack-objects.c        | 8 ++++++++
 2 files changed, 15 insertions(+)

diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index ad7f73a1ead..e12008d2415 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -164,6 +164,13 @@ When writing a multi-pack reachability bitmap, no new namehashes are
 computed; instead, any namehashes stored in an existing bitmap are
 permuted into their appropriate location when writing a new bitmap.
 
+pack.writeBitmapLookupTable::
+	When true, git will include a "lookup table" section in the
+	bitmap index (if one is written). This table is used to defer
+	loading individual bitmaps as late as possible. This can be
+	beneficial in repositories which have relatively large bitmap
+	indexes. Defaults to false.
+
 pack.writeReverseIndex::
 	When true, git will write a corresponding .rev file (see:
 	link:../technical/pack-format.html[Documentation/technical/pack-format.txt])
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index cc5f41086da..3ba20301980 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3148,6 +3148,14 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 		else
 			write_bitmap_options &= ~BITMAP_OPT_HASH_CACHE;
 	}
+
+	if (!strcmp(k, "pack.writebitmaplookuptable")) {
+		if (git_config_bool(k, v))
+			write_bitmap_options |= BITMAP_OPT_LOOKUP_TABLE;
+		else
+			write_bitmap_options &= ~BITMAP_OPT_LOOKUP_TABLE;
+	}
+
 	if (!strcmp(k, "pack.usebitmaps")) {
 		use_bitmap_index_default = git_config_bool(k, v);
 		return 0;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 5/6] bitmap-commit-table: add tests for the bitmap lookup table
  2022-06-20 12:33 [PATCH 0/6] [GSoC] bitmap: integrate a lookup table extension to the bitmap format Abhradeep Chakraborty via GitGitGadget
                   ` (3 preceding siblings ...)
  2022-06-20 12:33 ` [PATCH 4/6] builtin/pack-objects.c: learn pack.writeBitmapLookupTable Taylor Blau via GitGitGadget
@ 2022-06-20 12:33 ` Abhradeep Chakraborty via GitGitGadget
  2022-06-22 16:54   ` Taylor Blau
  2022-06-20 12:33 ` [PATCH 6/6] bitmap-lookup-table: add performance tests Abhradeep Chakraborty via GitGitGadget
  2022-06-26 13:10 ` [PATCH v2 0/6] [GSoC] bitmap: integrate a lookup table extension to the bitmap format Abhradeep Chakraborty via GitGitGadget
  6 siblings, 1 reply; 63+ messages in thread
From: Abhradeep Chakraborty via GitGitGadget @ 2022-06-20 12:33 UTC (permalink / raw)
  To: git
  Cc: Taylor Blau, Kaartic Sivaram, Abhradeep Chakraborty,
	Abhradeep Chakraborty

From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>

Add tests to check the working of the newly implemented lookup table.

Mentored-by: Taylor Blau <ttaylorr@github.com>
Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
---
 t/t5310-pack-bitmaps.sh       | 14 ++++++++++++++
 t/t5326-multi-pack-bitmaps.sh | 19 +++++++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index f775fc1ce69..f05d3e6ace7 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -43,6 +43,20 @@ test_expect_success 'full repack creates bitmaps' '
 
 basic_bitmap_tests
 
+test_expect_success 'using lookup table does not affect basic bitmap tests' '
+	test_config pack.writeBitmapLookupTable true &&
+	git repack -adb
+'
+basic_bitmap_tests
+
+test_expect_success 'using lookup table does not let each entries to be parsed one by one' '
+	test_config pack.writeBitmapLookupTable true &&
+	git repack -adb &&
+	git rev-list --test-bitmap HEAD 2>out &&
+	grep "Found bitmap for" out &&
+	! grep "Bitmap v1 test "
+'
+
 test_expect_success 'incremental repack fails when bitmaps are requested' '
 	test_commit more-1 &&
 	test_must_fail git repack -d 2>err &&
diff --git a/t/t5326-multi-pack-bitmaps.sh b/t/t5326-multi-pack-bitmaps.sh
index 4fe57414c13..85fbdf5e4bb 100755
--- a/t/t5326-multi-pack-bitmaps.sh
+++ b/t/t5326-multi-pack-bitmaps.sh
@@ -306,5 +306,24 @@ test_expect_success 'graceful fallback when missing reverse index' '
 		! grep "ignoring extra bitmap file" err
 	)
 '
+test_expect_success 'multi-pack-index write --bitmap writes lookup table if enabled' '
+	rm -fr repo &&
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+		test_commit_bulk 106 &&
+
+		git repack -d &&
+
+		git config pack.writeBitmapLookupTable true &&
+		git multi-pack-index write --bitmap &&
+
+		git rev-list --test-bitmap HEAD 2>out &&
+		grep "Found bitmap for" out &&
+		! grep "Bitmap v1 test "
+
+	)
+'
 
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 6/6] bitmap-lookup-table: add performance tests
  2022-06-20 12:33 [PATCH 0/6] [GSoC] bitmap: integrate a lookup table extension to the bitmap format Abhradeep Chakraborty via GitGitGadget
                   ` (4 preceding siblings ...)
  2022-06-20 12:33 ` [PATCH 5/6] bitmap-commit-table: add tests for the bitmap lookup table Abhradeep Chakraborty via GitGitGadget
@ 2022-06-20 12:33 ` Abhradeep Chakraborty via GitGitGadget
  2022-06-22 17:14   ` Taylor Blau
  2022-06-26 13:10 ` [PATCH v2 0/6] [GSoC] bitmap: integrate a lookup table extension to the bitmap format Abhradeep Chakraborty via GitGitGadget
  6 siblings, 1 reply; 63+ messages in thread
From: Abhradeep Chakraborty via GitGitGadget @ 2022-06-20 12:33 UTC (permalink / raw)
  To: git
  Cc: Taylor Blau, Kaartic Sivaram, Abhradeep Chakraborty,
	Abhradeep Chakraborty

From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>

Add performance tests for bitmap lookup table extension.

Mentored-by: Taylor Blau <ttaylorr@github.com>
Co-mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
---
 t/perf/p5310-pack-bitmaps.sh       | 60 +++++++++++++++++++-----------
 t/perf/p5326-multi-pack-bitmaps.sh | 55 +++++++++++++++++----------
 2 files changed, 73 insertions(+), 42 deletions(-)

diff --git a/t/perf/p5310-pack-bitmaps.sh b/t/perf/p5310-pack-bitmaps.sh
index 7ad4f237bc3..a8d9414de92 100755
--- a/t/perf/p5310-pack-bitmaps.sh
+++ b/t/perf/p5310-pack-bitmaps.sh
@@ -10,10 +10,11 @@ test_perf_large_repo
 # since we want to be able to compare bitmap-aware
 # git versus non-bitmap git
 #
-# We intentionally use the deprecated pack.writebitmaps
+# We intentionally use the deprecated pack.writeBitmaps
 # config so that we can test against older versions of git.
 test_expect_success 'setup bitmap config' '
-	git config pack.writebitmaps true
+	git config pack.writeBitmaps true &&
+	git config pack.writeReverseIndex true
 '
 
 # we need to create the tag up front such that it is covered by the repack and
@@ -28,27 +29,42 @@ test_perf 'repack to disk' '
 
 test_full_bitmap
 
-test_expect_success 'create partial bitmap state' '
-	# pick a commit to represent the repo tip in the past
-	cutoff=$(git rev-list HEAD~100 -1) &&
-	orig_tip=$(git rev-parse HEAD) &&
-
-	# now kill off all of the refs and pretend we had
-	# just the one tip
-	rm -rf .git/logs .git/refs/* .git/packed-refs &&
-	git update-ref HEAD $cutoff &&
-
-	# and then repack, which will leave us with a nice
-	# big bitmap pack of the "old" history, and all of
-	# the new history will be loose, as if it had been pushed
-	# up incrementally and exploded via unpack-objects
-	git repack -Ad &&
-
-	# and now restore our original tip, as if the pushes
-	# had happened
-	git update-ref HEAD $orig_tip
+test_perf 'use lookup table' '
+    git config pack.writeBitmapLookupTable true
 '
 
-test_partial_bitmap
+test_perf 'repack to disk (lookup table)' '
+    git repack -adb
+'
+
+test_full_bitmap
+
+for i in false true
+do
+	$i && lookup=" (lookup table)"
+	test_expect_success "create partial bitmap state$lookup" '
+		git config pack.writeBitmapLookupTable '"$i"' &&
+		# pick a commit to represent the repo tip in the past
+		cutoff=$(git rev-list HEAD~100 -1) &&
+		orig_tip=$(git rev-parse HEAD) &&
+
+		# now kill off all of the refs and pretend we had
+		# just the one tip
+		rm -rf .git/logs .git/refs/* .git/packed-refs &&
+		git update-ref HEAD $cutoff &&
+
+		# and then repack, which will leave us with a nice
+		# big bitmap pack of the "old" history, and all of
+		# the new history will be loose, as if it had been pushed
+		# up incrementally and exploded via unpack-objects
+		git repack -Ad &&
+
+		# and now restore our original tip, as if the pushes
+		# had happened
+		git update-ref HEAD $orig_tip
+	'
+
+	test_partial_bitmap
+done
 
 test_done
diff --git a/t/perf/p5326-multi-pack-bitmaps.sh b/t/perf/p5326-multi-pack-bitmaps.sh
index f2fa228f16a..9001eb4533e 100755
--- a/t/perf/p5326-multi-pack-bitmaps.sh
+++ b/t/perf/p5326-multi-pack-bitmaps.sh
@@ -26,27 +26,42 @@ test_expect_success 'drop pack bitmap' '
 
 test_full_bitmap
 
-test_expect_success 'create partial bitmap state' '
-	# pick a commit to represent the repo tip in the past
-	cutoff=$(git rev-list HEAD~100 -1) &&
-	orig_tip=$(git rev-parse HEAD) &&
-
-	# now pretend we have just one tip
-	rm -rf .git/logs .git/refs/* .git/packed-refs &&
-	git update-ref HEAD $cutoff &&
-
-	# and then repack, which will leave us with a nice
-	# big bitmap pack of the "old" history, and all of
-	# the new history will be loose, as if it had been pushed
-	# up incrementally and exploded via unpack-objects
-	git repack -Ad &&
-	git multi-pack-index write --bitmap &&
-
-	# and now restore our original tip, as if the pushes
-	# had happened
-	git update-ref HEAD $orig_tip
+test_expect_success 'use lookup table' '
+	git config pack.writeBitmapLookupTable true
 '
 
-test_partial_bitmap
+test_perf 'setup multi-pack-index (lookup table)' '
+	git multi-pack-index write --bitmap
+'
+
+test_full_bitmap
+
+for i in false true
+do
+	$i && lookup=" (lookup table)"
+	test_expect_success "create partial bitmap state$lookup" '
+		git config pack.writeBitmapLookupTable '"$i"' &&
+		# pick a commit to represent the repo tip in the past
+		cutoff=$(git rev-list HEAD~100 -1) &&
+		orig_tip=$(git rev-parse HEAD) &&
+
+		# now pretend we have just one tip
+		rm -rf .git/logs .git/refs/* .git/packed-refs &&
+		git update-ref HEAD $cutoff &&
+
+		# and then repack, which will leave us with a nice
+		# big bitmap pack of the "old" history, and all of
+		# the new history will be loose, as if it had been pushed
+		# up incrementally and exploded via unpack-objects
+		git repack -Ad &&
+		git multi-pack-index write --bitmap &&
+
+		# and now restore our original tip, as if the pushes
+		# had happened
+		git update-ref HEAD $orig_tip
+	'
+
+	test_partial_bitmap
+done
 
 test_done
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/6] Documentation/technical: describe bitmap lookup table extension
  2022-06-20 12:33 ` [PATCH 1/6] Documentation/technical: describe bitmap lookup table extension Abhradeep Chakraborty via GitGitGadget
@ 2022-06-20 16:56   ` Derrick Stolee
  2022-06-20 17:09     ` Taylor Blau
  2022-06-21  8:23     ` Abhradeep Chakraborty
  2022-06-20 17:21   ` Taylor Blau
  2022-06-20 20:21   ` Derrick Stolee
  2 siblings, 2 replies; 63+ messages in thread
From: Derrick Stolee @ 2022-06-20 16:56 UTC (permalink / raw)
  To: Abhradeep Chakraborty via GitGitGadget, git
  Cc: Taylor Blau, Kaartic Sivaram, Abhradeep Chakraborty

On 6/20/2022 8:33 AM, Abhradeep Chakraborty via GitGitGadget wrote:
> From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
> 
> When reading bitmap file, git loads each and every bitmap one by one
> even if all the bitmaps are not required. A "bitmap lookup table"
> extension to the bitmap format can reduce the overhead of loading
> bitmaps which stores a list of bitmapped commit oids, along with their
> offset and xor offset. This way git can load only the neccesary bitmaps
> without loading the previous bitmaps.
> 
> Add some information for the new "bitmap lookup table" extension in the
> bitmap-format documentation.


> @@ -67,6 +67,14 @@ MIDXs, both the bit-cache and rev-cache extensions are required.
>  			pack/MIDX. The format and meaning of the name-hash is
>  			described below.
>  
> +			** {empty}
> +			BITMAP_OPT_LOOKUP_TABLE (0xf) : :::
> +			If present, the end of the bitmap file contains a table
> +			containing a list of `N` object ids, a list of pairs of
> +			offset and xor offset of respective objects, and 4-byte
> +			integer denoting the flags (currently none). The format
> +			and meaning of the table is described below.
> +

Here, you are adding a new flag that indicates that the end of the file
contains this extra extension. This works because the size of the
extension is predictable. As long as any future extensions are also of
a predictable size, then we can continue adding them via flags in this
way.

This is better than updating the full file format to do something like
like use the chunk format API, especially because this format is shared
across other tools (JGit being mentioned frequently).

It might be worth mentioning in your commit message what happens when an
older version of Git (or JGit) notices this flag. Does it refuse to
operate on the .bitmap file? Does it give a warning or die? It would be
nice if this extension could be ignored (it seems like adding the extra
data at the end does not stop the bitmap data from being understood).

> +
> +Commit lookup table
> +-------------------
> +
> +If the BITMAP_OPT_LOOKUP_TABLE flag is set, the end of the `.bitmap`
> +contains a lookup table specifying the positions of commits which have a
> +bitmap.

Perhaps it would be better to say "the last N * (HASH_LEN + 8) + 4 bytes
preceding the trailing hash" or something? This gives us a concrete way
to compute the start of the table, while also being clear that the table
is included in the trailing hash.

> +For a `.bitmap` containing `nr_entries` reachability bitmaps, the format
> +is as follows:
> +
> +	- `nr_entries` object names.

Could you expand that these objects are commit OIDs, one for each bitmap
in the file. Are they sorted in lexicographical order for binary search,
or are we expecting to read the entire table into a hashtable in-memory?

> +	- `nr_entries` pairs of 4-byte integers, each in network order.
> +	  The first holds the offset from which that commit's bitmap can
> +	  be read. The second number holds the position of the commit
> +	  whose bitmap the current bitmap is xor'd with in lexicographic
> +	  order, or 0xffffffff if the current commit is not xor'd with
> +	  anything.

Interesting to give the xor chains directions here. You say "position"
here for the second commit: do you mean within the list of object names
as opposed to the offset? That would make the most sense so we can trace
the full list of XORs we need to make all at once.

Are .bitmap files already constrained to 4GB, so these 32-bit offsets
make sense? Using 64-bit offsets would be a small cost here, I think,
without needing to do any fancy "overflow" tables that could introduce
a variable-length extension.

> +	- One 4-byte network byte order integer specifying
> +	  table-specific flags. None exist currently, so this is always
> +	  "0".

I'm guessing this is at the end of the extension because a future flag
could modify the length of the extension, so we need the flags to be
in a predictable location. Could we make that clear somewhere?

How does Git react to seeing flags here that it does not recognize?
It seems that Git should ignore the lookup table but continue using the
rest of the .bitmap file as it did before, yes?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/6] Documentation/technical: describe bitmap lookup table extension
  2022-06-20 16:56   ` Derrick Stolee
@ 2022-06-20 17:09     ` Taylor Blau
  2022-06-21  8:31       ` Abhradeep Chakraborty
  2022-06-21  8:23     ` Abhradeep Chakraborty
  1 sibling, 1 reply; 63+ messages in thread
From: Taylor Blau @ 2022-06-20 17:09 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Abhradeep Chakraborty via GitGitGadget, git, Kaartic Sivaram,
	Abhradeep Chakraborty

On Mon, Jun 20, 2022 at 12:56:27PM -0400, Derrick Stolee wrote:
> On 6/20/2022 8:33 AM, Abhradeep Chakraborty via GitGitGadget wrote:
> > From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
> >
> > When reading bitmap file, git loads each and every bitmap one by one
> > even if all the bitmaps are not required. A "bitmap lookup table"
> > extension to the bitmap format can reduce the overhead of loading
> > bitmaps which stores a list of bitmapped commit oids, along with their
> > offset and xor offset. This way git can load only the neccesary bitmaps
> > without loading the previous bitmaps.
> >
> > Add some information for the new "bitmap lookup table" extension in the
> > bitmap-format documentation.
>
>
> > @@ -67,6 +67,14 @@ MIDXs, both the bit-cache and rev-cache extensions are required.
> >  			pack/MIDX. The format and meaning of the name-hash is
> >  			described below.
> >
> > +			** {empty}
> > +			BITMAP_OPT_LOOKUP_TABLE (0xf) : :::
> > +			If present, the end of the bitmap file contains a table
> > +			containing a list of `N` object ids, a list of pairs of
> > +			offset and xor offset of respective objects, and 4-byte
> > +			integer denoting the flags (currently none). The format
> > +			and meaning of the table is described below.
> > +
>
> Here, you are adding a new flag that indicates that the end of the file
> contains this extra extension. This works because the size of the
> extension is predictable. As long as any future extensions are also of
> a predictable size, then we can continue adding them via flags in this
> way.

Right; any extensions that are added to the existing .bitmap format must
have a size that is predictable in order for readers to locate the next
extension, if any.

> This is better than updating the full file format to do something like
> like use the chunk format API, especially because this format is shared
> across other tools (JGit being mentioned frequently).

Agreed. Abhradeep and I discussed whether or not it was worth exploring
a new .bitmap format, and the consensus we reached was that it may be
required in the future (if we explored a compression scheme other than
EWAH or made some other backwards-incompatible change), but as of yet it
isn't necessary. So we avoided it to eliminate unnecessary churn,
especially of on-disk formats.

> It might be worth mentioning in your commit message what happens when an
> older version of Git (or JGit) notices this flag. Does it refuse to
> operate on the .bitmap file? Does it give a warning or die? It would be
> nice if this extension could be ignored (it seems like adding the extra
> data at the end does not stop the bitmap data from being understood).

I agree. The bitmap reader does not warn or die when it sees
unrecognized extensions, that way new extensions can be added without
rendering all previously-written bitmaps useless. But in order to
understand an extension on bit N, the reader must also understand
extensions N-1, N-2, and so on (in order to locate the end of
extension N).

> > +	- `nr_entries` pairs of 4-byte integers, each in network order.
> > +	  The first holds the offset from which that commit's bitmap can
> > +	  be read. The second number holds the position of the commit
> > +	  whose bitmap the current bitmap is xor'd with in lexicographic
> > +	  order, or 0xffffffff if the current commit is not xor'd with
> > +	  anything.
>
> Interesting to give the xor chains directions here. You say "position"
> here for the second commit: do you mean within the list of object names
> as opposed to the offset? That would make the most sense so we can trace
> the full list of XORs we need to make all at once.
>
> Are .bitmap files already constrained to 4GB, so these 32-bit offsets
> make sense? Using 64-bit offsets would be a small cost here, I think,
> without needing to do any fancy "overflow" tables that could introduce
> a variable-length extension.

Yeah, we should support >4GB bitmaps here. An overflow table could work,
but I agree with Stolee that in practice it won't matter. Most .bitmap
files that I've looked at in the wild have around ~500 entries at most,
and are usually small. So the cost of widening this section isn't a big
deal.

But note that the entry count is only one component of the bitmap size:
the individual entry lengths obviously matter too. And in repositories
whose bitmaps exceed 500 entries, the entries themselves are often
several million bits long (before compression) already. So it is
certainly possible to exceed 4GB without having an astronomical entry
count.

So doubling the width of this extension might add an extra 250 KiB or
so, which is negligible.

I would much rather see us do that in cases where it makes sense (small
number of entries, minimal cost to wider records, etc.) than adding
unnecessary complexity via an extra lookup table for >4GB offsets.

> > +	- One 4-byte network byte order integer specifying
> > +	  table-specific flags. None exist currently, so this is always
> > +	  "0".
>
> I'm guessing this is at the end of the extension because a future flag
> could modify the length of the extension, so we need the flags to be
> in a predictable location. Could we make that clear somewhere?

I can't remember what I had on my mind when I wrote this ;-).

Abhradeep -- do you have any thoughts about what this might be used for?
I'll try to remember it myself, but I imagine that we could just as
easily remove this altogether and avoid the confusion.

> How does Git react to seeing flags here that it does not recognize?
> It seems that Git should ignore the lookup table but continue using the
> rest of the .bitmap file as it did before, yes?

(See above).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/6] Documentation/technical: describe bitmap lookup table extension
  2022-06-20 12:33 ` [PATCH 1/6] Documentation/technical: describe bitmap lookup table extension Abhradeep Chakraborty via GitGitGadget
  2022-06-20 16:56   ` Derrick Stolee
@ 2022-06-20 17:21   ` Taylor Blau
  2022-06-21  9:22     ` Abhradeep Chakraborty
  2022-06-20 20:21   ` Derrick Stolee
  2 siblings, 1 reply; 63+ messages in thread
From: Taylor Blau @ 2022-06-20 17:21 UTC (permalink / raw)
  To: Abhradeep Chakraborty via GitGitGadget
  Cc: git, Kaartic Sivaram, Abhradeep Chakraborty

On Mon, Jun 20, 2022 at 12:33:09PM +0000, Abhradeep Chakraborty via GitGitGadget wrote:
> From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
>
> When reading bitmap file, git loads each and every bitmap one by one
> even if all the bitmaps are not required. A "bitmap lookup table"
> extension to the bitmap format can reduce the overhead of loading
> bitmaps which stores a list of bitmapped commit oids, along with their
> offset and xor offset. This way git can load only the neccesary bitmaps
> without loading the previous bitmaps.

Well put. It might help to have a concrete example of where we expect
this to help and not help. I suspect that some of this will show up in
your work updating the perf suite to use this new table, but I imagine
that we'll find something like:

    In cases where the result can be read or computed without
    significant additional traversal (e.g., all commits of interest
    already have bitmaps computed), we can save some time loading and
    parsing a majority of the bitmap file that we will never read.

    But in cases where the bitmaps are out-of-date, or there is
    significant traversal required to go from the reference tips to
    what's contained in the .bitmap file, this table provides minimal
    benefit (or something).

Of course, you should verify that that is actually true before we insert
it into the commit message as such ;-). But that sort of information may
help readers understand what the purpose of this change is towards the
beinning of the series.

> Add some information for the new "bitmap lookup table" extension in the
> bitmap-format documentation.
>
> Co-Authored-by: Taylor Blau <ttaylorr@github.com>
> Mentored-by: Taylor Blau <ttaylorr@github.com>

Here and elsewhere: I typically use my <me@ttaylorr.com> address when
contributing to Git. So any trailers that mention my email or commits
that you send on my behalf should use that address, too.

> Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
> Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
> ---
>  Documentation/technical/bitmap-format.txt | 31 +++++++++++++++++++++++
>  1 file changed, 31 insertions(+)
>
> diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt
> index 04b3ec21785..34e98787b78 100644
> --- a/Documentation/technical/bitmap-format.txt
> +++ b/Documentation/technical/bitmap-format.txt
> @@ -67,6 +67,14 @@ MIDXs, both the bit-cache and rev-cache extensions are required.
>  			pack/MIDX. The format and meaning of the name-hash is
>  			described below.
>
> +			** {empty}
> +			BITMAP_OPT_LOOKUP_TABLE (0xf) : :::

It the space between "(0xf)" and the first ":" intentional? Similarly,
should there be two or three colons at the end (either "::" or ":::")?

> +			If present, the end of the bitmap file contains a table
> +			containing a list of `N` object ids, a list of pairs of
> +			offset and xor offset of respective objects, and 4-byte
> +			integer denoting the flags (currently none). The format
> +			and meaning of the table is described below.
> +

I remember we had a brief off-list discussion about whether we should
store the full object IDs in the offset table, or whether we could store
their pack- or index-relative ordering. Is there a reason to prefer one
or the other?

I don't think we need to explain the choice fully in the documentation
in this patch, but it may be worth thinking about separately
nonetheless. We can store either order and convert it to an object ID in
constant time.

To figure out which is best, I would recommend trying a few different
choices here and seeing how they do or don't impact your performance
testing.

>  		4-byte entry count (network byte order)
>
>  			The total count of entries (bitmapped commits) in this bitmap index.
> @@ -205,3 +213,26 @@ Note that this hashing scheme is tied to the BITMAP_OPT_HASH_CACHE flag.
>  If implementations want to choose a different hashing scheme, they are
>  free to do so, but MUST allocate a new header flag (because comparing
>  hashes made under two different schemes would be pointless).
> +
> +Commit lookup table
> +-------------------
> +
> +If the BITMAP_OPT_LOOKUP_TABLE flag is set, the end of the `.bitmap`
> +contains a lookup table specifying the positions of commits which have a
> +bitmap.
> +
> +For a `.bitmap` containing `nr_entries` reachability bitmaps, the format
> +is as follows:
> +
> +	- `nr_entries` object names.
> +
> +	- `nr_entries` pairs of 4-byte integers, each in network order.
> +	  The first holds the offset from which that commit's bitmap can
> +	  be read. The second number holds the position of the commit
> +	  whose bitmap the current bitmap is xor'd with in lexicographic
> +	  order, or 0xffffffff if the current commit is not xor'd with
> +	  anything.

A couple of small thoughts here. I wonder if we'd get better locality if
we made each record look something like:

    (object_id, offset, xor_pos)

Where object_id is either 20- or 4-bytes long (depending if we store the
full object ID, or some 4-byte identifier that allows us to discover
it), offset is 8 bytes long, and xor_pos is 4-bytes (since in practice
we don't support packs or MIDXs which have more than 2^32-1 objects).

In the event that this table doesn't fit into a single cache line, I
think we'll get better performance out of reading it by not forcing the
cache to evict itself whenever we need to refer back to the object_id.

> +	- One 4-byte network byte order integer specifying
> +	  table-specific flags. None exist currently, so this is always
> +	  "0".

I mentioned in my reply to Stolee earlier, but I think that we should
either (a) try to remember what this is for and document it, or (b)
remove it.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/6] Documentation/technical: describe bitmap lookup table extension
  2022-06-20 12:33 ` [PATCH 1/6] Documentation/technical: describe bitmap lookup table extension Abhradeep Chakraborty via GitGitGadget
  2022-06-20 16:56   ` Derrick Stolee
  2022-06-20 17:21   ` Taylor Blau
@ 2022-06-20 20:21   ` Derrick Stolee
  2022-06-21 10:08     ` Abhradeep Chakraborty
  2 siblings, 1 reply; 63+ messages in thread
From: Derrick Stolee @ 2022-06-20 20:21 UTC (permalink / raw)
  To: Abhradeep Chakraborty via GitGitGadget, git
  Cc: Taylor Blau, Kaartic Sivaram, Abhradeep Chakraborty

On 6/20/2022 8:33 AM, Abhradeep Chakraborty via GitGitGadget wrote:
> From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>

> +			** {empty}
> +			BITMAP_OPT_LOOKUP_TABLE (0xf) : :::

I think you mean 0x10 (b_1_0000) instead of 0xf (b_1111).

I noticed when looking at the constant in patch 2.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 2/6] pack-bitmap: prepare to read lookup table extension
  2022-06-20 12:33 ` [PATCH 2/6] pack-bitmap: prepare to read " Abhradeep Chakraborty via GitGitGadget
@ 2022-06-20 20:49   ` Derrick Stolee
  2022-06-21 10:28     ` Abhradeep Chakraborty
  2022-06-20 22:06   ` Taylor Blau
  1 sibling, 1 reply; 63+ messages in thread
From: Derrick Stolee @ 2022-06-20 20:49 UTC (permalink / raw)
  To: Abhradeep Chakraborty via GitGitGadget, git
  Cc: Taylor Blau, Kaartic Sivaram, Abhradeep Chakraborty

On 6/20/2022 8:33 AM, Abhradeep Chakraborty via GitGitGadget wrote:
> From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
> 
> Bitmap lookup table extension can let git to parse only the necessary
> bitmaps without loading the previous bitmaps one by one.

Here is an attempt to reword this a bit:

  The bitmap lookup table extension was documented by an earlier
  change, but Git does not yet know how to parse that information.
  The extension allows parsing a smaller portion of the bitmap
  file in order to find bitmaps for specific commits.

 
> Teach git to read and use the bitmap lookup table extension.

Normally, I don't mind doing the read portion after the write
portion, but it would be nice to have them in the opposite
order so we can test writing the extension _and Git ignoring
the extension_ before implementing the parsing. As it stands,
most of the code in this patch is untested until patch 5.

General outline attempt:

1. Document the format.
2. Write the extension if the flag is given.
3. Add pack.writeBitmapLookupTable and add tests that write
   the lookup table (and do other bitmap reads on that data).
4. Read the lookup table. The tests from step 3 already cover
   acting upon the lookup table. (Perhaps we add a mode here
   that disables GIT_READ_COMMIT_TABLE since that is not used
   anywhere else.)
5. Performance tests.

> +		if (flags & BITMAP_OPT_LOOKUP_TABLE &&
> +		    git_env_bool("GIT_READ_COMMIT_TABLE", 1)) {

This environment variable does not appear to be used or
documented anywhere. Do we really want to use it as a way
to disable reading the lookup table in general? Or would it be
better to have a GIT_TEST_* variable for disabling the read
during testing?

> +			uint32_t entry_count = ntohl(header->entry_count);
> +			uint32_t table_size =
> +				(entry_count * the_hash_algo->rawsz) /* oids */ +
> +				(entry_count * sizeof(uint32_t)) /* offsets */ +
> +				(entry_count * sizeof(uint32_t)) /* xor offsets */ +
> +				(sizeof(uint32_t)) /* flags */;

Here, uint32_T is probably fine, but maybe we should just use
size_t instead? Should we use st_mult() and st_add() everywhere?

Note: you're using the_hash_algo->rawsz here, which makes sense
because the bitmap format doesn't specify which hash algorithm is
used. Just making this note to say that we should include the hash
algorithm as a value in the bitmap format when we increment the
format version (in the future).

> +			if (table_size > index_end - index->map - header_size)
> +				return error("corrupted bitmap index file (too short to fit commit table)");
> +
> +			index->table_lookup = (void *)(index_end - table_size);
> +			index->table_offsets = index->table_lookup + the_hash_algo->rawsz * entry_count;

st_mult(), st_add()? Or, should we assume safety now?

> +			index_end -= table_size;
> +		}

> -	if (load_bitmap_entries_v1(bitmap_git) < 0)
> +	if (!bitmap_git->table_lookup && load_bitmap_entries_v1(bitmap_git) < 0)
>  		goto failed;

Ok, don't load these entries pre-emptively if we have the lookup table.

> +static struct stored_bitmap *stored_bitmap_for_commit(struct bitmap_index *bitmap_git,
> +						      struct commit *commit,
> +						      uint32_t *pos_hint);

I see that we have a two-method recursion loop. Please move this
declaration to immediately before lazy_bitmap_for_commit() so it
is declared as late as possible.

> +static inline const unsigned char *bitmap_oid_pos(struct bitmap_index *bitmap_git,
> +						  uint32_t pos)
> +{
> +	return bitmap_git->table_lookup + (pos * the_hash_algo->rawsz);
> +}

I would call this "bitmap_hash_pos()" because we are getting a raw
hash and not a 'struct object_id'. Do you want a helper that fills
a 'struct object_id', perhaps passed-by-reference?

> +static inline const void *bitmap_offset_pos(struct bitmap_index *bitmap_git,
> +					    uint32_t pos)

> +static inline const void *xor_position_pos(struct bitmap_index *bitmap_git,
> +					   uint32_t pos)

These two helpers should probably return a size_t and uint32_t
instead of a pointer. Let these do get_be[32|64]() on the computed
pointer.

> +static int bitmap_table_lookup(struct bitmap_index *bitmap_git,
> +			       struct object_id *oid,
> +			       uint32_t *commit_pos)
> +{
> +	unsigned char *found = bsearch(oid->hash, bitmap_git->table_lookup,
> +				       bitmap_git->entry_count,
> +				       the_hash_algo->rawsz, bitmap_lookup_cmp);
> +	if (found)
> +		*commit_pos = (found - bitmap_git->table_lookup) / the_hash_algo->rawsz;
> +	return !!found;

Ok, we are running binary search and converting the pointer into a
position.

Frequently, these kind of searches return an int, but use a negative
value to indicate that the value was not found. Using an int in this
way would restrict us to 2^31 bitmaps instead of 2^32, so maybe it
is not worth matching that practice.

> +static struct stored_bitmap *lazy_bitmap_for_commit(struct bitmap_index *bitmap_git,
> +						    struct object_id *oid,
> +						    uint32_t commit_pos)
> +{
> +	uint32_t xor_pos;
> +	off_t bitmap_ofs;
> +
> +	int flags;
> +	struct ewah_bitmap *bitmap;
> +	struct stored_bitmap *xor_bitmap;
> +
> +	bitmap_ofs = get_be32(bitmap_offset_pos(bitmap_git, commit_pos));
> +	xor_pos = get_be32(xor_position_pos(bitmap_git, commit_pos));

These lines become simpler with a change in the helper methods'
prototypes, as I recommended higher up.

> +	/*
> +	 * Lazily load the xor'd bitmap if required (and we haven't done so
> +	 * already). Make sure to pass the xor'd bitmap's position along as a
> +	 * hint to avoid an unnecessary binary search in
> +	 * stored_bitmap_for_commit().
> +	 */
> +	if (xor_pos == 0xffffffff) {
> +		xor_bitmap = NULL;
> +	} else {
> +		struct commit *xor_commit;
> +		struct object_id xor_oid;
> +
> +		oidread(&xor_oid, bitmap_oid_pos(bitmap_git, xor_pos));
> +
> +		xor_commit = lookup_commit(the_repository, &xor_oid);
> +		if (!xor_commit)
> +			return NULL;
> +
> +		xor_bitmap = stored_bitmap_for_commit(bitmap_git, xor_commit,
> +						      &xor_pos);
> +	}

This is using an interesting type of tail-recursion. We might be
better off using a loop with a stack: push to the stack the commit
positions of the XOR bitmaps. At the very bottom, we get a bitmap
without an XOR base. Then, pop off the stack, modifying the bitmap
with XOR operations as we go. (Perhaps we also store these bitmaps
in-memory along the way?) Finally, we have the necessary bitmap.

This iterative approach avoids possible stack exhaustion if there
are long XOR chains in the file.

> +
> +	/*
> +	 * Don't bother reading the commit's index position or its xor
> +	 * offset:
> +	 *
> +	 *   - The commit's index position is irrelevant to us, since
> +	 *     load_bitmap_entries_v1 only uses it to learn the object
> +	 *     id which is used to compute the hashmap's key. We already
> +	 *     have an object id, so no need to look it up again.
> +	 *
> +	 *   - The xor_offset is unusable for us, since it specifies how
> +	 *     many entries previous to ours we should look at. This
> +	 *     makes sense when reading the bitmaps sequentially (as in
> +	 *     load_bitmap_entries_v1()), since we can keep track of
> +	 *     each bitmap as we read them.
> +	 *
> +	 *     But it can't work for us, since the bitmap's don't have a
> +	 *     fixed size. So we learn the position of the xor'd bitmap
> +	 *     from the commit table (and resolve it to a bitmap in the
> +	 *     above if-statement).
> +	 *
> +	 * Instead, we can skip ahead and immediately read the flags and
> +	 * ewah bitmap.
> +	 */
> +	bitmap_git->map_pos = bitmap_ofs + sizeof(uint32_t) + sizeof(uint8_t);
> +	flags = read_u8(bitmap_git->map, &bitmap_git->map_pos);
> +	bitmap = read_bitmap_1(bitmap_git);
> +	if (!bitmap)
> +		return NULL;
> +
> +	return store_bitmap(bitmap_git, bitmap, oid, xor_bitmap, flags);

Looks like we'd want to call store_bitmap() while popping the stack
in the loop I recommended above.

> +}
> +
> +static struct stored_bitmap *stored_bitmap_for_commit(struct bitmap_index *bitmap_git,
> +						      struct commit *commit,
> +						      uint32_t *pos_hint)
>  {
>  	khiter_t hash_pos = kh_get_oid_map(bitmap_git->bitmaps,
>  					   commit->object.oid);
> -	if (hash_pos >= kh_end(bitmap_git->bitmaps))
> +	if (hash_pos >= kh_end(bitmap_git->bitmaps)) {
> +		uint32_t commit_pos;
> +		if (!bitmap_git->table_lookup)
> +			return NULL;
> +
> +		/* NEEDSWORK: cache misses aren't recorded. */
> +		if (pos_hint)
> +			commit_pos = *pos_hint;
> +		else if (!bitmap_table_lookup(bitmap_git,
> +					      &commit->object.oid,
> +					      &commit_pos))
> +			return NULL;
> +		return lazy_bitmap_for_commit(bitmap_git, &commit->object.oid,
> +					      commit_pos);

The extra bonus of going incremental is that we don't have recursion
across two methods, which I always find difficult to reason about.

> +	}
> +	return kh_value(bitmap_git->bitmaps, hash_pos);
> +}

> @@ -26,6 +26,7 @@ struct bitmap_disk_header {
>  enum pack_bitmap_opts {
>  	BITMAP_OPT_FULL_DAG = 1,
>  	BITMAP_OPT_HASH_CACHE = 4,
> +	BITMAP_OPT_LOOKUP_TABLE = 16,

Perhaps it is time to use hexadecimal representation here to match the
file format document?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 2/6] pack-bitmap: prepare to read lookup table extension
  2022-06-20 12:33 ` [PATCH 2/6] pack-bitmap: prepare to read " Abhradeep Chakraborty via GitGitGadget
  2022-06-20 20:49   ` Derrick Stolee
@ 2022-06-20 22:06   ` Taylor Blau
  2022-06-21 11:52     ` Abhradeep Chakraborty
  1 sibling, 1 reply; 63+ messages in thread
From: Taylor Blau @ 2022-06-20 22:06 UTC (permalink / raw)
  To: Abhradeep Chakraborty via GitGitGadget
  Cc: git, Kaartic Sivaram, Abhradeep Chakraborty

On Mon, Jun 20, 2022 at 12:33:10PM +0000, Abhradeep Chakraborty via GitGitGadget wrote:
> From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
>
> Bitmap lookup table extension can let git to parse only the necessary
> bitmaps without loading the previous bitmaps one by one.
>
> Teach git to read and use the bitmap lookup table extension.
>
> Co-Authored-by: Taylor Blau <ttaylorr@github.com>
> Mentored-by: Taylor Blau <ttaylorr@github.com>
> Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
> Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
> ---
>  pack-bitmap.c | 172 ++++++++++++++++++++++++++++++++++++++++++++++++--
>  pack-bitmap.h |   1 +
>  2 files changed, 166 insertions(+), 7 deletions(-)
>
> diff --git a/pack-bitmap.c b/pack-bitmap.c
> index 36134222d7a..d5e5973a79f 100644
> --- a/pack-bitmap.c
> +++ b/pack-bitmap.c
> @@ -15,6 +15,7 @@
>  #include "list-objects-filter-options.h"
>  #include "midx.h"
>  #include "config.h"
> +#include "hash-lookup.h"
>
>  /*
>   * An entry on the bitmap index, representing the bitmap for a given
> @@ -82,6 +83,13 @@ struct bitmap_index {
>  	/* The checksum of the packfile or MIDX; points into map. */
>  	const unsigned char *checksum;
>
> +	/*
> +	 * If not NULL, these point into the various commit table sections
> +	 * (within map).
> +	 */
> +	unsigned char *table_lookup;
> +	unsigned char *table_offsets;
> +

If table_offsets ends up being a list of just offsets, we could assign
this to the appropriate type, e.g., 'uint64_t *'. We would want to
avoid using a type whose width is platform dependent, like off_t.

But if you end up taking my suggestion from a previous response (of
making each entry in the offset table a triple of commit, offset, and
xor position), make sure to _not_ get tempted to define a struct and
assign table_lookup to be a pointer of that structure type.

That's because even though the struct *should* be packed as you expect,
the packing is mostly up to the compiler, so you can't guarantee its
members won't have padding between them or at the end of the struct for
alignment purposes.

>  	/*
>  	 * Extended index.
>  	 *
> @@ -185,6 +193,24 @@ static int load_bitmap_header(struct bitmap_index *index)
>  			index->hashes = (void *)(index_end - cache_size);
>  			index_end -= cache_size;
>  		}
> +
> +		if (flags & BITMAP_OPT_LOOKUP_TABLE &&
> +		    git_env_bool("GIT_READ_COMMIT_TABLE", 1)) {

What is the purpose of the GIT_READ_COMMIT_TABLE environment variable? I
assume that it's to make it easier to run tests (especially performance
ones) with and without access to the lookup table. If so, we should
document that (lightly) in the commit message, and rename this to be
GIT_TEST_READ_COMMIT_TABLE to indicate that it shouldn't be used outside
of tests.

> +			uint32_t entry_count = ntohl(header->entry_count);
> +			uint32_t table_size =
> +				(entry_count * the_hash_algo->rawsz) /* oids */ +
> +				(entry_count * sizeof(uint32_t)) /* offsets */ +
> +				(entry_count * sizeof(uint32_t)) /* xor offsets */ +
> +				(sizeof(uint32_t)) /* flags */;

entry_count is definitely a 4-byte integer, so uint32_t is the right
type. But I think table_size should be a size_t, and computations on it
should be more strictly checked. Perhaps something like;

    size_t table_size = sizeof(uint32_t); /* flags */
    table_size = st_add(table_size, st_mult(entry_count, the_hash_algo->rawsz)); /* oids */
    table_size = st_add(table_size, st_mult(entry_count, sizeof(uint32_t))); /* offsets */
    table_size = st_add(table_size, st_mult(entry_count, sizeof(uint32_t))); /* xor offsets */

or even:

    size_t table_size = sizeof(uint32_t); /* flags */
    table_size = st_add(table_size,
                        st_mult(entry_count,
                                the_hash_algo->rawsz + /* oids */
                                sizeof(uint32_t) + /* offsets*/
                                sizeof(uint32_t) /* xor offsets */
                               ));

> +			if (table_size > index_end - index->map - header_size)
> +				return error("corrupted bitmap index file (too short to fit commit table)");
> +
> +			index->table_lookup = (void *)(index_end - table_size);
> +			index->table_offsets = index->table_lookup + the_hash_algo->rawsz * entry_count;
> +
> +			index_end -= table_size;
> +		}

Looks good.

> @@ -470,7 +496,7 @@ static int load_bitmap(struct bitmap_index *bitmap_git)
>  		!(bitmap_git->tags = read_bitmap_1(bitmap_git)))
>  		goto failed;
>
> -	if (load_bitmap_entries_v1(bitmap_git) < 0)
> +	if (!bitmap_git->table_lookup && load_bitmap_entries_v1(bitmap_git) < 0)
>  		goto failed;

No need to load each of the bitmaps individually via
load_bitmap_entries_v1() if we have a lookup table. That function
doesn't do any other initialization that we depend on, so it's OK to
just avoid calling it altogether.

>  	return 0;
> @@ -557,14 +583,145 @@ struct include_data {
>  	struct bitmap *seen;
>  };
>
> -struct ewah_bitmap *bitmap_for_commit(struct bitmap_index *bitmap_git,
> -				      struct commit *commit)
> +static struct stored_bitmap *stored_bitmap_for_commit(struct bitmap_index *bitmap_git,
> +						      struct commit *commit,
> +						      uint32_t *pos_hint);
> +
> +static inline const unsigned char *bitmap_oid_pos(struct bitmap_index *bitmap_git,
> +						  uint32_t pos)
> +{
> +	return bitmap_git->table_lookup + (pos * the_hash_algo->rawsz);
> +}
> +
> +static inline const void *bitmap_offset_pos(struct bitmap_index *bitmap_git,
> +					    uint32_t pos)
> +{
> +	return bitmap_git->table_offsets + (pos * 2 * sizeof(uint32_t));
> +}
> +
> +static inline const void *xor_position_pos(struct bitmap_index *bitmap_git,
> +					   uint32_t pos)
> +{
> +	return (unsigned char*) bitmap_offset_pos(bitmap_git, pos) + sizeof(uint32_t);
> +}
> +
> +static int bitmap_lookup_cmp(const void *_va, const void *_vb)
> +{
> +	return hashcmp(_va, _vb);
> +}

All makes sense. Some light documentation might help explain what this
comparator function is used for (the bsearch() call below in
bitmap_table_lookup()), although I suspect that this function will get
slightly more complicated if you pack the table contents as I suggest,
in which case more documentation will definitely help.

> +
> +static int bitmap_table_lookup(struct bitmap_index *bitmap_git,
> +			       struct object_id *oid,
> +			       uint32_t *commit_pos)
> +{
> +	unsigned char *found = bsearch(oid->hash, bitmap_git->table_lookup,
> +				       bitmap_git->entry_count,
> +				       the_hash_algo->rawsz, bitmap_lookup_cmp);
> +	if (found)
> +		*commit_pos = (found - bitmap_git->table_lookup) / the_hash_algo->rawsz;

If you end up chaning the type of bitmap_git->table_lookup, make sure
that you scale the result of the pointer arithmetic accordingly, or cast
down to an 'unsigned char *' before you do any math.

> +	return !!found;
> +}
> +
> +static struct stored_bitmap *lazy_bitmap_for_commit(struct bitmap_index *bitmap_git,
> +						    struct object_id *oid,
> +						    uint32_t commit_pos)
> +{
> +	uint32_t xor_pos;
> +	off_t bitmap_ofs;
> +
> +	int flags;
> +	struct ewah_bitmap *bitmap;
> +	struct stored_bitmap *xor_bitmap;
> +
> +	bitmap_ofs = get_be32(bitmap_offset_pos(bitmap_git, commit_pos));
> +	xor_pos = get_be32(xor_position_pos(bitmap_git, commit_pos));
> +
> +	/*
> +	 * Lazily load the xor'd bitmap if required (and we haven't done so
> +	 * already). Make sure to pass the xor'd bitmap's position along as a
> +	 * hint to avoid an unnecessary binary search in
> +	 * stored_bitmap_for_commit().
> +	 */
> +	if (xor_pos == 0xffffffff) {
> +		xor_bitmap = NULL;
> +	} else {
> +		struct commit *xor_commit;
> +		struct object_id xor_oid;
> +
> +		oidread(&xor_oid, bitmap_oid_pos(bitmap_git, xor_pos));

Interesting; this is a point that I forgot about from the original
patch. xor_pos is an index (not an offset) into the list of commits in
the table of contents in the order appear in that table. We should be
clear about (a) what that order is, and (b) that xor_pos is an index
into that order.

The rest of this function looks good to me.

> +static struct stored_bitmap *stored_bitmap_for_commit(struct bitmap_index *bitmap_git,
> +						      struct commit *commit,
> +						      uint32_t *pos_hint)
>  {
>  	khiter_t hash_pos = kh_get_oid_map(bitmap_git->bitmaps,
>  					   commit->object.oid);
> -	if (hash_pos >= kh_end(bitmap_git->bitmaps))
> +	if (hash_pos >= kh_end(bitmap_git->bitmaps)) {
> +		uint32_t commit_pos;
> +		if (!bitmap_git->table_lookup)
> +			return NULL;

I was going to suggest moving this check into the caller
bitmap_for_commit() and making it a BUG() to call
stored_bitmap_for_commit() with a NULL bitmap_git->table_lookup pointer.

And I think this makes sense... if we return NULL here, then we know
that we definitely don't have a stored bitmap, since there's no table to
look it up in and we have already loaded everything else. So we
propagate that NULL to the return value of bitmap_for_commit(), and that
makes sense. Good.

> +		/* NEEDSWORK: cache misses aren't recorded. */

Yeah. The problem here is that we can't record every commit that
_doesn't_ have a bitmap every time we return NULL from one of these
queries, since there are arbitrarily many such commits that don't have
bitmaps.

We could approximate it using a Bloom filter or something, and much of
that code is already written and could be interesting to try and reuse.

But I wonder if we could get by with something simpler, though, which
would cause us to load all bitmaps from the lookup table after a fixed
number of cache misses (at which point we should force ourselves to load
everything and just read everything out of a single O(1) lookup in the
stored bitmap table).

That may or may not be a good idea, and the threshold will probably be
highly dependent on the system. So it may not even be worth it, but I
think it's an interesting area to experiemnt in and think a little more
about.

> +		if (pos_hint)
> +			commit_pos = *pos_hint;

How does this commit_pos work again? I confess I have forgetten since I
wrote some of this code a while ago... :-).

> @@ -1699,8 +1856,9 @@ void test_bitmap_walk(struct rev_info *revs)
>  	if (revs->pending.nr != 1)
>  		die("you must specify exactly one commit to test");
>
> -	fprintf(stderr, "Bitmap v%d test (%d entries loaded)\n",
> -		bitmap_git->version, bitmap_git->entry_count);
> +	if (!bitmap_git->table_lookup)
> +		fprintf(stderr, "Bitmap v%d test (%d entries loaded)\n",
> +			bitmap_git->version, bitmap_git->entry_count);

Should we print this regardless of whether or not there is a lookup
table? We should be able to learn the entry count either way.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 3/6] pack-bitmap-write.c: write lookup table extension
  2022-06-20 12:33 ` [PATCH 3/6] pack-bitmap-write.c: write " Abhradeep Chakraborty via GitGitGadget
@ 2022-06-20 22:16   ` Taylor Blau
  2022-06-21 12:50     ` Abhradeep Chakraborty
  0 siblings, 1 reply; 63+ messages in thread
From: Taylor Blau @ 2022-06-20 22:16 UTC (permalink / raw)
  To: Abhradeep Chakraborty via GitGitGadget
  Cc: git, Kaartic Sivaram, Abhradeep Chakraborty

On Mon, Jun 20, 2022 at 12:33:11PM +0000, Abhradeep Chakraborty via GitGitGadget wrote:
> From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
>
> Teach git to write bitmap lookup table extension. The table has the
> following information:
>
>     - `N` no of Object ids of each bitmapped commits

s/no/number, s/Object/object, s/ids/IDs, and s/commits/commit

>     - A list of offset, xor-offset pair; the i'th pair denotes the
>       offsets and xor-offsets of i'th commit in the previous list.

s/pair/pairs

>     - 4-byte integer denoting the flags
>
> Co-authored-by: Taylor Blau <ttaylorr@github.com>
> Mentored-by: Taylor Blau <ttaylorr@github.com>
> Co-mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
> Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
> ---
>  pack-bitmap-write.c | 59 +++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 57 insertions(+), 2 deletions(-)
>
> diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
> index c43375bd344..9e88a64dd65 100644
> --- a/pack-bitmap-write.c
> +++ b/pack-bitmap-write.c
> @@ -650,7 +650,8 @@ static const struct object_id *oid_access(size_t pos, const void *table)
>
>  static void write_selected_commits_v1(struct hashfile *f,
>  				      struct pack_idx_entry **index,
> -				      uint32_t index_nr)
> +				      uint32_t index_nr,
> +				      off_t *offsets)
>  {
>  	int i;
>
> @@ -663,6 +664,9 @@ static void write_selected_commits_v1(struct hashfile *f,
>  		if (commit_pos < 0)
>  			BUG("trying to write commit not in index");
>
> +		if (offsets)
> +			offsets[i] = hashfile_total(f);
> +

Makes sense; we record the offset for the ith commit as however many
bytes we've already written into the hashfile up to this point, since
the subsequent byte will begin the bitmap (well, the preceding few
bytes of it, anyways) itself.

>  		hashwrite_be32(f, commit_pos);
>  		hashwrite_u8(f, stored->xor_offset);
>  		hashwrite_u8(f, stored->flags);
> @@ -671,6 +675,49 @@ static void write_selected_commits_v1(struct hashfile *f,
>  	}
>  }
>
> +static int table_cmp(const void *_va, const void *_vb)
> +{
> +	return oidcmp(&writer.selected[*(uint32_t*)_va].commit->object.oid,
> +		      &writer.selected[*(uint32_t*)_vb].commit->object.oid);

This implementation looks right to me, but perhaps we should expand it
out from the one-liner here to make it more readable. Perhaps something
like:

    static int table_cmp(const void *_va, const void *_vb)
    {
      struct commit *c1 = &writer.selected[*(uint32_t*)_va];
      struct commit *c2 = &writer.selected[*(uint32_t*)_vb];

      return oidcmp(&c1->object.oid, &c2->object.oid);
    }

which is arguably slightly more readable than the one-liner (but I don't
feel that strongly about it.)

> +static void write_lookup_table(struct hashfile *f,
> +			       off_t *offsets)
> +{
> +	uint32_t i;
> +	uint32_t flags = 0;
> +	uint32_t *table, *table_inv;
> +
> +	ALLOC_ARRAY(table, writer.selected_nr);
> +	ALLOC_ARRAY(table_inv, writer.selected_nr);
> +
> +	for (i = 0; i < writer.selected_nr; i++)
> +		table[i] = i;
> +	QSORT(table, writer.selected_nr, table_cmp);
> +	for (i = 0; i < writer.selected_nr; i++)
> +		table_inv[table[i]] = i;

Right... so table[0] will give us the index into writer.selected of the
commit with the earliest OID in lexicographic order. And table_inv goes
the other way around: table_inv[i] will tell us the lexicographic
position of the commit at writer.selected[i].

> +	for (i = 0; i < writer.selected_nr; i++) {
> +		struct bitmapped_commit *selected = &writer.selected[table[i]];
> +		struct object_id *oid = &selected->commit->object.oid;
> +
> +		hashwrite(f, oid->hash, the_hash_algo->rawsz);
> +	}
> +	for (i = 0; i < writer.selected_nr; i++) {
> +		struct bitmapped_commit *selected = &writer.selected[table[i]];
> +
> +		hashwrite_be32(f, offsets[table[i]]);
> +		hashwrite_be32(f, selected->xor_offset
> +			       ? table_inv[table[i] - selected->xor_offset]

...which we need to discover the position of the XOR'd bitmap. Though
I'm not sure if I remember why `table[i] - selected->xor_offset` is
right and not `i - selected->xor_offset`.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 4/6] builtin/pack-objects.c: learn pack.writeBitmapLookupTable
  2022-06-20 12:33 ` [PATCH 4/6] builtin/pack-objects.c: learn pack.writeBitmapLookupTable Taylor Blau via GitGitGadget
@ 2022-06-20 22:18   ` Taylor Blau
  0 siblings, 0 replies; 63+ messages in thread
From: Taylor Blau @ 2022-06-20 22:18 UTC (permalink / raw)
  To: Taylor Blau via GitGitGadget
  Cc: git, Kaartic Sivaram, Abhradeep Chakraborty, Taylor Blau

On Mon, Jun 20, 2022 at 12:33:12PM +0000, Taylor Blau via GitGitGadget wrote:
> From: Taylor Blau <ttaylorr@github.com>
>
> Teach git to provide a way for users to enable/disable bitmap lookup
> table extension by providing a config option named 'writeBitmapLookupTable'.
>
> Signed-off-by: Taylor Blau <ttaylorr@github.com>
> Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
> ---
>  Documentation/config/pack.txt | 7 +++++++
>  builtin/pack-objects.c        | 8 ++++++++
>  2 files changed, 15 insertions(+)
>
> diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
> index ad7f73a1ead..e12008d2415 100644
> --- a/Documentation/config/pack.txt
> +++ b/Documentation/config/pack.txt
> @@ -164,6 +164,13 @@ When writing a multi-pack reachability bitmap, no new namehashes are
>  computed; instead, any namehashes stored in an existing bitmap are
>  permuted into their appropriate location when writing a new bitmap.
>
> +pack.writeBitmapLookupTable::
> +	When true, git will include a "lookup table" section in the

s/git/Git (I typically use "git" when talking about the command-line
tool, and Git when talking about the project as a proper noun).

> +	bitmap index (if one is written). This table is used to defer
> +	loading individual bitmaps as late as possible. This can be
> +	beneficial in repositories which have relatively large bitmap
> +	indexes. Defaults to false.

Is there a reason that we would want to default to "false" here? Perhaps
in the first version of two we would want this to be an opt-in (since
there is no publicly documented way to opt-out of reading the extension
once it is written).

We should make sure to enable this by default at some point in the
future.

>  pack.writeReverseIndex::

...since it's easy to forget ;-).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/6] Documentation/technical: describe bitmap lookup table extension
  2022-06-20 16:56   ` Derrick Stolee
  2022-06-20 17:09     ` Taylor Blau
@ 2022-06-21  8:23     ` Abhradeep Chakraborty
  1 sibling, 0 replies; 63+ messages in thread
From: Abhradeep Chakraborty @ 2022-06-21  8:23 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Abhradeep Chakraborty, Git, Kaartic Sivaraam, Taylor Blau

Derrick Stole <derrickstolee@github.com> wrote:

> It might be worth mentioning in your commit message what happens when an
> older version of Git (or JGit) notices this flag. Does it refuse to
> operate on the .bitmap file? Does it give a warning or die? It would be
> nice if this extension could be ignored (it seems like adding the extra
> data at the end does not stop the bitmap data from being understood).

No, it doesn't refuse to operate on the .bitmap file. It just ignores the
extension. Will update the commit message.

> Perhaps it would be better to say "the last N * (HASH_LEN + 8) + 4 bytes
> preceding the trailing hash" or something? This gives us a concrete way
> to compute the start of the table, while also being clear that the table
> is included in the trailing hash.

Hmm, well said. Will update it.

> Could you expand that these objects are commit OIDs, one for each bitmap
> in the file. Are they sorted in lexicographical order for binary search,
> or are we expecting to read the entire table into a hashtable in-memory?

Yeah, of course! They are sorted in lexicographical order for binary search.


> Interesting to give the xor chains directions here. You say "position"
> here for the second commit: do you mean within the list of object names
> as opposed to the offset? That would make the most sense so we can trace
> the full list of XORs we need to make all at once.

I think I blundered here. I forgot that the xor-offset is relative to the
current bitmap. The current proposed code takes it as ABSOLUTE value and
tries to find the commit on that position (in the list of commit ids). So,
there are two faults in my code - (1) As the xor-offset have an upper limit
(which is 10 probably; not sure), any of the first 10 commits is always
selected. (2) As xor-offsets are relative to the current bitmap, it depends
On the order of the bitmaps. These bitmaps are ordered by the date of their
corresponding commit and commit ids in the lookup table are ordered
lexicographically. So, we can't use that xor-offset to find the xor'd
commit position.

Will fix it.

> Are .bitmap files already constrained to 4GB, so these 32-bit offsets
> make sense? Using 64-bit offsets would be a small cost here, I think,
> without needing to do any fancy "overflow" tables that could introduce
> a variable-length extension.

I think you're right. I should use 64-bit types here.

> I'm guessing this is at the end of the extension because a future flag
> could modify the length of the extension, so we need the flags to be
> in a predictable location. Could we make that clear somewhere?

Flags are at the end of this extension.

Thanks :)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/6] Documentation/technical: describe bitmap lookup table extension
  2022-06-20 17:09     ` Taylor Blau
@ 2022-06-21  8:31       ` Abhradeep Chakraborty
  2022-06-22 16:26         ` Taylor Blau
  0 siblings, 1 reply; 63+ messages in thread
From: Abhradeep Chakraborty @ 2022-06-21  8:31 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Abhradeep Chakraborty, Git, Kaartic Sivaraam, Derrick Stolee

Taylor Blau <me@ttaylorr.com> wrote:

> Abhradeep -- do you have any thoughts about what this might be used for?
> I'll try to remember it myself, but I imagine that we could just as
> easily remove this altogether and avoid the confusion.

Honestly, I never understood the logic behind adding this flag option.
I thought you have a reason to do that. Even I was thinking of curving
it to 1 byte. I will remove it then.

Thanks :)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/6] Documentation/technical: describe bitmap lookup table extension
  2022-06-20 17:21   ` Taylor Blau
@ 2022-06-21  9:22     ` Abhradeep Chakraborty
  2022-06-22 16:29       ` Taylor Blau
  0 siblings, 1 reply; 63+ messages in thread
From: Abhradeep Chakraborty @ 2022-06-21  9:22 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Abhradeep Chakraborty, Git, Kaartic Sivaraam, Derrick Stolee

Taylor Blau <me@ttaylorr.com> wrote:

>     In cases where the result can be read or computed without
>     significant additional traversal (e.g., all commits of interest
>     already have bitmaps computed), we can save some time loading and
>     parsing a majority of the bitmap file that we will never read.
>
>     But in cases where the bitmaps are out-of-date, or there is
>     significant traversal required to go from the reference tips to
>     what's contained in the .bitmap file, this table provides minimal
>     benefit (or something).
>
> Of course, you should verify that that is actually true before we insert
> it into the commit message as such ;-). But that sort of information may
> help readers understand what the purpose of this change is towards the
> beinning of the series.

The performance tests cover tests for command like "git rev-list --count
--objects --all", "simulated clone", "simulated fetch" etc. And I tested
it with both the Git and Linux. In both cases, the average cost of
"Without lookup table" is bigger than "with lookup table". The margin of
difference is bigger for linux. Though, I need to fix the calculation
of xor-offset (see my reply to derrick), the fix will not affect the
performance too much. So, what you're saying is true. I think I didn't
write the bitmap out-of-date test though.

> Here and elsewhere: I typically use my <me@ttaylorr.com> address when
> contributing to Git. So any trailers that mention my email or commits
> that you send on my behalf should use that address, too.

Ohh, sorry! Will fix it.

> It the space between "(0xf)" and the first ":" intentional? Similarly,
> should there be two or three colons at the end (either "::" or ":::")?

Yes, it is intentional. My previous patch (formatting the bitmap-format.txt)
uses nested description lists. ":::" means it is the level 3 description list.
The space is required else asciidoc will assume that it is level 4 description
list.

> I remember we had a brief off-list discussion about whether we should
> store the full object IDs in the offset table, or whether we could store
> their pack- or index-relative ordering. Is there a reason to prefer one
> or the other?
>
> I don't think we need to explain the choice fully in the documentation
> in this patch, but it may be worth thinking about separately
> nonetheless. We can store either order and convert it to an object ID in
> constant time.
>
> To figure out which is best, I would recommend trying a few different
> choices here and seeing how they do or don't impact your performance
> testing.

I think at that time I thought it would add extra cost of computing
the actual commit ids from those index position. So, I didn't go 
further here.

I still have a feeling that there is some way to get rid of this
list of commit ids. But at the same time, I do not want to add
extra computation to the code.

> A couple of small thoughts here. I wonder if we'd get better locality if
> we made each record look something like:
>
>     (object_id, offset, xor_pos)
>
> Where object_id is either 20- or 4-bytes long (depending if we store the
> full object ID, or some 4-byte identifier that allows us to discover
> it), offset is 8 bytes long, and xor_pos is 4-bytes (since in practice
> we don't support packs or MIDXs which have more than 2^32-1 objects).
>
> In the event that this table doesn't fit into a single cache line, I
> think we'll get better performance out of reading it by not forcing the
> cache to evict itself whenever we need to refer back to the object_id.

Ok, will look into it.

> I mentioned in my reply to Stolee earlier, but I think that we should
> either (a) try to remember what this is for and document it, or (b)
> remove it.

Let us for now remove it.

Thanks :)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/6] Documentation/technical: describe bitmap lookup table extension
  2022-06-20 20:21   ` Derrick Stolee
@ 2022-06-21 10:08     ` Abhradeep Chakraborty
  2022-06-22 16:30       ` Taylor Blau
  0 siblings, 1 reply; 63+ messages in thread
From: Abhradeep Chakraborty @ 2022-06-21 10:08 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Abhradeep Chakraborty, Git, Kaartic Sivaraam, Taylor Blau

Derrick Stolee <derrickstolee@github.com> wrote:

> I think you mean 0x10 (b_1_0000) instead of 0xf (b_1111).
>
> I noticed when looking at the constant in patch 2.

Yes, you're right. It's kind of embarrassment for me :)

If the flag was Oxf it would enable all the extensions.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 2/6] pack-bitmap: prepare to read lookup table extension
  2022-06-20 20:49   ` Derrick Stolee
@ 2022-06-21 10:28     ` Abhradeep Chakraborty
  0 siblings, 0 replies; 63+ messages in thread
From: Abhradeep Chakraborty @ 2022-06-21 10:28 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Abhradeep Chakraborty, Git, Kaartic Sivaraam, Taylor Blau

Derrick Stolee <derrickstolee@github.com> wrote:

> Here is an attempt to reword this a bit:
>
>   The bitmap lookup table extension was documented by an earlier
>   change, but Git does not yet know how to parse that information.
>   The extension allows parsing a smaller portion of the bitmap
>   file in order to find bitmaps for specific commits.

Got it. Thanks.

> This environment variable does not appear to be used or
> documented anywhere. Do we really want to use it as a way
> to disable reading the lookup table in general? Or would it be
> better to have a GIT_TEST_* variable for disabling the read
> during testing?

GIT_TEST_* is perfect. This was mainly for testing purpose.

> Here, uint32_T is probably fine, but maybe we should just use
 size_t instead? Should we use st_mult() and st_add() everywhere?

Yeah, it would be better to use st_*().

> I see that we have a two-method recursion loop. Please move this
> declaration to immediately before lazy_bitmap_for_commit() so it
> is declared as late as possible.

Ok.

> These two helpers should probably return a size_t and uint32_t
> instead of a pointer. Let these do get_be[32|64]() on the computed
> pointer.

Ok.

> This is using an interesting type of tail-recursion. We might be
> better off using a loop with a stack: push to the stack the commit
> positions of the XOR bitmaps. At the very bottom, we get a bitmap
> without an XOR base. Then, pop off the stack, modifying the bitmap
> with XOR operations as we go. (Perhaps we also store these bitmaps
> in-memory along the way?) Finally, we have the necessary bitmap.

Hmm, got the point. I need to fix the xor-offset related issue first
(That I said earlier) before doing this.

> Perhaps it is time to use hexadecimal representation here to match the
> file format document?

Yeah, of course!

Thanks :)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 2/6] pack-bitmap: prepare to read lookup table extension
  2022-06-20 22:06   ` Taylor Blau
@ 2022-06-21 11:52     ` Abhradeep Chakraborty
  2022-06-22 16:49       ` Taylor Blau
  0 siblings, 1 reply; 63+ messages in thread
From: Abhradeep Chakraborty @ 2022-06-21 11:52 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Abhradeep Chakraborty, Git, Kaartic Sivaraam, Derrick Stolee

Taylor Blau <me@ttaylorr.com> wrote:

> What is the purpose of the GIT_READ_COMMIT_TABLE environment variable? I
> assume that it's to make it easier to run tests (especially performance
> ones) with and without access to the lookup table. If so, we should
> document that (lightly) in the commit message, and rename this to be
> GIT_TEST_READ_COMMIT_TABLE to indicate that it shouldn't be used outside
> of tests.

This is mainly for testing, GIT_TEST_READ_COMMIT_TABLE is perfect.


> All makes sense. Some light documentation might help explain what this
> comparator function is used for (the bsearch() call below in
> bitmap_table_lookup()), although I suspect that this function will get
> slightly more complicated if you pack the table contents as I suggest,
> in which case more documentation will definitely help.

Ok.

> Interesting; this is a point that I forgot about from the original
> patch. xor_pos is an index (not an offset) into the list of commits in
> the table of contents in the order appear in that table. We should be
> clear about (a) what that order is, and (b) that xor_pos is an index
> into that order.

This is exactly what I said in my first reply. I made a mistake here.
(1) As xor_pos is relative to the current bitmap, it depends on the bitmap
entry order. These two order are not same. One is ordered by date, another
is lexicographically ordered. I will fix it.

> Yeah. The problem here is that we can't record every commit that
> _doesn't_ have a bitmap every time we return NULL from one of these
> queries, since there are arbitrarily many such commits that don't have
> bitmaps.
>
> We could approximate it using a Bloom filter or something, and much of
> that code is already written and could be interesting to try and reuse.
> But I wonder if we could get by with something simpler, though, which
> would cause us to load all bitmaps from the lookup table after a fixed
> number of cache misses (at which point we should force ourselves to load
> everything and just read everything out of a single O(1) lookup in the
> stored bitmap table).
>
> That may or may not be a good idea, and the threshold will probably be
> highly dependent on the system. So it may not even be worth it, but I
> think it's an interesting area to experiemnt in and think a little more
> about.

Now I got the point. I wonder what if we leave it as it is. How much will
it affect the code?

> How does this commit_pos work again? I confess I have forgetten since I
> wrote some of this code a while ago... :-).

It is using recursive strategy. The first call to `stored_bitmap_for_commit`
function do not have `pos_hint`. So, it uses `bitmap_table_lookup` to find
the commit position in the list and makes a call to `lazy_bitmap_for_commit`
function. This function gets the offset and xor-offset using the commit id's
position in the list. If xor-offset exists, it is using this xor-offset to
get the xor-bitmap by calling `stored_bitmap_for_commit` again. But this time
`pos_hint` is xor-offset. This goes on till the last non-xor bitmap has found.

As I said before, xor-offset should be an absolute value to make it work
correctly.

> Should we print this regardless of whether or not there is a lookup
> table? We should be able to learn the entry count either way.

No, this is necessary. "Bitmap v1 test (%d entries loaded)" means
all the bitmap entries has been loaded. It is basically for 
`load_bitmap_entries_bitmap_v1` function which loads all the bitmaps
One by one. But if there is a lookup table, `prepare_bitmap_git`
function will not load every entries and thus printing the above
line is wrong.

Thanks :)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 3/6] pack-bitmap-write.c: write lookup table extension
  2022-06-20 22:16   ` Taylor Blau
@ 2022-06-21 12:50     ` Abhradeep Chakraborty
  2022-06-22 16:51       ` Taylor Blau
  0 siblings, 1 reply; 63+ messages in thread
From: Abhradeep Chakraborty @ 2022-06-21 12:50 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Abhradeep Chakraborty, Git, Kaartic Sivaraam, Derrick Stolee

Taylor Blau <me@ttaylorr.com> wrote:

> I'm not sure if I remember why `table[i] - selected->xor_offset` is
> right and not `i - selected->xor_offset`.

Even I myself got confused! Before sending the patch to the mailing
list, I was clear about that. That's why I didn't catch the so called
mistake I have been notifying till now. Thanks Taylor for asking
the question!

I should add a comment before the line so that people can understand it.
Let us parse `table_inv[table[i] - selected->xor_offset]` -

Suppose bitmap entries be like - 

Bitmap 0 (for commit 0)
Bitmap 1 (for commit 1)
Bitmap 2 (for commit 2)
Bitmap 3 (for commit 3)
.
.
.
Bitmap 20 (for commit 20)

These bitmaps are ordered by the date of their corresponding commit.
`table` array maps commit's lexicographic order to its bitmap order.
`table_inv` stores the reverse (i.e. it maps bitmap order to lexicographic
order). Say for example, if commit 4 is lexicographically first among all the
Commits then `table[0]` is 4. Similarly `table[1]`=2, table[2]=1 etc.
`table_inv[4]` is 0, table_inv[2]=1 etc.

Now suppose commit 4's bitmap has xor-relation with commit 2's bitmap.
So, xor-offset for bitmap 4 is 2. And `table[0] - selected->xor_offset`
is equal to 4-2 = 2. It is pointing to the commit 2. Now, 2 is in bitmap
Order. We need to convert it into lexicographic order. So, table_inv[2]
gives us the lexicographic order position of commit 2 I.e. 1.

Long story short, there is no issue regarding xor_offset. This xor_offset
is not relative to the current commit. It is absolute.

Sorry for the initial claim :)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/6] Documentation/technical: describe bitmap lookup table extension
  2022-06-21  8:31       ` Abhradeep Chakraborty
@ 2022-06-22 16:26         ` Taylor Blau
  0 siblings, 0 replies; 63+ messages in thread
From: Taylor Blau @ 2022-06-22 16:26 UTC (permalink / raw)
  To: Abhradeep Chakraborty; +Cc: Git, Kaartic Sivaraam, Derrick Stolee

On Tue, Jun 21, 2022 at 02:01:14PM +0530, Abhradeep Chakraborty wrote:
> Taylor Blau <me@ttaylorr.com> wrote:
>
> > Abhradeep -- do you have any thoughts about what this might be used for?
> > I'll try to remember it myself, but I imagine that we could just as
> > easily remove this altogether and avoid the confusion.
>
> Honestly, I never understood the logic behind adding this flag option.
> I thought you have a reason to do that. Even I was thinking of curving
> it to 1 byte. I will remove it then.

I think removing it makes more sense. Since many of the other fields are
4-bytes wide, it's important for alignment purposes that those fields
have addresses which are a multiple of four (relative to the start of
the region, hence the 4-byte wide flags field).

But I'd just as soon get rid of it, so I think that makes sense to me.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/6] Documentation/technical: describe bitmap lookup table extension
  2022-06-21  9:22     ` Abhradeep Chakraborty
@ 2022-06-22 16:29       ` Taylor Blau
  2022-06-22 16:45         ` Abhradeep Chakraborty
  0 siblings, 1 reply; 63+ messages in thread
From: Taylor Blau @ 2022-06-22 16:29 UTC (permalink / raw)
  To: Abhradeep Chakraborty; +Cc: Git, Kaartic Sivaraam, Derrick Stolee

On Tue, Jun 21, 2022 at 02:52:53PM +0530, Abhradeep Chakraborty wrote:
> Taylor Blau <me@ttaylorr.com> wrote:
> > I remember we had a brief off-list discussion about whether we should
> > store the full object IDs in the offset table, or whether we could store
> > their pack- or index-relative ordering. Is there a reason to prefer one
> > or the other?
> >
> > I don't think we need to explain the choice fully in the documentation
> > in this patch, but it may be worth thinking about separately
> > nonetheless. We can store either order and convert it to an object ID in
> > constant time.
> >
> > To figure out which is best, I would recommend trying a few different
> > choices here and seeing how they do or don't impact your performance
> > testing.
>
> I think at that time I thought it would add extra cost of computing
> the actual commit ids from those index position. So, I didn't go
> further here.

It should be negligible relative to everything else, I would imagine.
The function that converts an index position into an object ID is
`nth_packed_object_id()`.

> I still have a feeling that there is some way to get rid of this
> list of commit ids. But at the same time, I do not want to add
> extra computation to the code.

I'm hoping that the additional complexity is minor. And if we can save
some extra bytes that aren't necessary in the first place without
compromising on performance, I think that's worthwhile to do.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/6] Documentation/technical: describe bitmap lookup table extension
  2022-06-21 10:08     ` Abhradeep Chakraborty
@ 2022-06-22 16:30       ` Taylor Blau
  0 siblings, 0 replies; 63+ messages in thread
From: Taylor Blau @ 2022-06-22 16:30 UTC (permalink / raw)
  To: Abhradeep Chakraborty; +Cc: Derrick Stolee, Git, Kaartic Sivaraam

On Tue, Jun 21, 2022 at 03:38:00PM +0530, Abhradeep Chakraborty wrote:
> Derrick Stolee <derrickstolee@github.com> wrote:
>
> > I think you mean 0x10 (b_1_0000) instead of 0xf (b_1111).
> >
> > I noticed when looking at the constant in patch 2.
>
> Yes, you're right. It's kind of embarrassment for me :)

It happens ;). Let's use 0x10 instead.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/6] Documentation/technical: describe bitmap lookup table extension
  2022-06-22 16:29       ` Taylor Blau
@ 2022-06-22 16:45         ` Abhradeep Chakraborty
  0 siblings, 0 replies; 63+ messages in thread
From: Abhradeep Chakraborty @ 2022-06-22 16:45 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Abhradeep Chakraborty, Git, Kaartic Sivaraam, Derrick Stolee

Taylor Blau <me@ttaylorr.com> wrote:

> It should be negligible relative to everything else, I would imagine.
> The function that converts an index position into an object ID is
> `nth_packed_object_id()`.
>
> > I still have a feeling that there is some way to get rid of this
> > list of commit ids. But at the same time, I do not want to add
> > extra computation to the code.
>
> I'm hoping that the additional complexity is minor. And if we can save
> some extra bytes that aren't necessary in the first place without
> compromising on performance, I think that's worthwhile to do.

Ok. I will look into it then.

Most of the reviews has been addressed. Hope I will be able to submit
it soon.

Thanks :)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 2/6] pack-bitmap: prepare to read lookup table extension
  2022-06-21 11:52     ` Abhradeep Chakraborty
@ 2022-06-22 16:49       ` Taylor Blau
  2022-06-22 17:18         ` Abhradeep Chakraborty
  0 siblings, 1 reply; 63+ messages in thread
From: Taylor Blau @ 2022-06-22 16:49 UTC (permalink / raw)
  To: Abhradeep Chakraborty; +Cc: Git, Kaartic Sivaraam, Derrick Stolee

On Tue, Jun 21, 2022 at 05:22:12PM +0530, Abhradeep Chakraborty wrote:
> Taylor Blau <me@ttaylorr.com> wrote:
> > Yeah. The problem here is that we can't record every commit that
> > _doesn't_ have a bitmap every time we return NULL from one of these
> > queries, since there are arbitrarily many such commits that don't have
> > bitmaps.
> >
> > We could approximate it using a Bloom filter or something, and much of
> > that code is already written and could be interesting to try and reuse.
> > But I wonder if we could get by with something simpler, though, which
> > would cause us to load all bitmaps from the lookup table after a fixed
> > number of cache misses (at which point we should force ourselves to load
> > everything and just read everything out of a single O(1) lookup in the
> > stored bitmap table).
> >
> > That may or may not be a good idea, and the threshold will probably be
> > highly dependent on the system. So it may not even be worth it, but I
> > think it's an interesting area to experiemnt in and think a little more
> > about.
>
> Now I got the point. I wonder what if we leave it as it is. How much will
> it affect the code?

I'm not sure, and I think that it depends a lot on the repository and
query that we're running.

I'd imagine that the effect is probably measurable, but small. Each hash
lookup is cheap, but if there are many such lookups (a large proportion
of which end up resulting in "no, we haven't loaded this bitmap yet" and
then "...because no such bitmap exists for that commit") at some point
it is worth it to fault all of the commits that _do_ have bitmaps in and
answer authoritatively.

In other words, right now we have to do two queries when an commit
doesn't have a bitmap stored:

  - first, a lookup to see whether we have already loaded a bitmap for
    that commit

  - then, a subsequent lookup to see whether the .bitmap file itself has
    a bitmap for that commit, but we just haven't loaded it yet

If we knew that we had loaded all of the bitmaps in the file, then we
could simplify the above two queries into one, since whatever the first
one returns is enough to know whether or not a bitmap exists at all.

> > How does this commit_pos work again? I confess I have forgetten since I
> > wrote some of this code a while ago... :-).
>
> It is using recursive strategy. The first call to `stored_bitmap_for_commit`
> function do not have `pos_hint`. So, it uses `bitmap_table_lookup` to find
> the commit position in the list and makes a call to `lazy_bitmap_for_commit`
> function. This function gets the offset and xor-offset using the commit id's
> position in the list. If xor-offset exists, it is using this xor-offset to
> get the xor-bitmap by calling `stored_bitmap_for_commit` again. But this time
> `pos_hint` is xor-offset. This goes on till the last non-xor bitmap has found.

Ahhh. Thanks for refreshing my memory. I wonder if you think there is a
convenient way to work some of this into a short comment to help other
readers in the future, too.

> As I said before, xor-offset should be an absolute value to make it work
> correctly.

Yep, makes sense.

> > Should we print this regardless of whether or not there is a lookup
> > table? We should be able to learn the entry count either way.
>
> No, this is necessary. "Bitmap v1 test (%d entries loaded)" means
> all the bitmap entries has been loaded. It is basically for
> `load_bitmap_entries_bitmap_v1` function which loads all the bitmaps
> One by one. But if there is a lookup table, `prepare_bitmap_git`
> function will not load every entries and thus printing the above
> line is wrong.

Right, that part makes sense to me. But I wonder if we should still
print something, perhaps just "Bitmap v1 test" or "Bitmap v1 test (%d
entries)" omitting the "loaded" part.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 3/6] pack-bitmap-write.c: write lookup table extension
  2022-06-21 12:50     ` Abhradeep Chakraborty
@ 2022-06-22 16:51       ` Taylor Blau
  0 siblings, 0 replies; 63+ messages in thread
From: Taylor Blau @ 2022-06-22 16:51 UTC (permalink / raw)
  To: Abhradeep Chakraborty; +Cc: Git, Kaartic Sivaraam, Derrick Stolee

On Tue, Jun 21, 2022 at 06:20:54PM +0530, Abhradeep Chakraborty wrote:
> Taylor Blau <me@ttaylorr.com> wrote:
>
> > I'm not sure if I remember why `table[i] - selected->xor_offset` is
> > right and not `i - selected->xor_offset`.
>
> Even I myself got confused! Before sending the patch to the mailing
> list, I was clear about that. That's why I didn't catch the so called
> mistake I have been notifying till now. Thanks Taylor for asking
> the question!
>
> I should add a comment before the line so that people can understand it.
> Let us parse `table_inv[table[i] - selected->xor_offset]` -
>
> Suppose bitmap entries be like -
>
> Bitmap 0 (for commit 0)
> Bitmap 1 (for commit 1)
> Bitmap 2 (for commit 2)
> Bitmap 3 (for commit 3)
> .
> .
> .
> Bitmap 20 (for commit 20)
>
> These bitmaps are ordered by the date of their corresponding commit.
> `table` array maps commit's lexicographic order to its bitmap order.
> `table_inv` stores the reverse (i.e. it maps bitmap order to lexicographic
> order). Say for example, if commit 4 is lexicographically first among all the
> Commits then `table[0]` is 4. Similarly `table[1]`=2, table[2]=1 etc.
> `table_inv[4]` is 0, table_inv[2]=1 etc.
>
> Now suppose commit 4's bitmap has xor-relation with commit 2's bitmap.
> So, xor-offset for bitmap 4 is 2. And `table[0] - selected->xor_offset`
> is equal to 4-2 = 2. It is pointing to the commit 2. Now, 2 is in bitmap
> Order. We need to convert it into lexicographic order. So, table_inv[2]
> gives us the lexicographic order position of commit 2 I.e. 1.
>
> Long story short, there is no issue regarding xor_offset. This xor_offset
> is not relative to the current commit. It is absolute.
>
> Sorry for the initial claim :)

Ahhhhh. Makes perfect sense. Thanks!

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 5/6] bitmap-commit-table: add tests for the bitmap lookup table
  2022-06-20 12:33 ` [PATCH 5/6] bitmap-commit-table: add tests for the bitmap lookup table Abhradeep Chakraborty via GitGitGadget
@ 2022-06-22 16:54   ` Taylor Blau
  0 siblings, 0 replies; 63+ messages in thread
From: Taylor Blau @ 2022-06-22 16:54 UTC (permalink / raw)
  To: Abhradeep Chakraborty via GitGitGadget
  Cc: git, Kaartic Sivaram, Abhradeep Chakraborty

On Mon, Jun 20, 2022 at 12:33:13PM +0000, Abhradeep Chakraborty via GitGitGadget wrote:
> From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
>
> Add tests to check the working of the newly implemented lookup table.
>
> Mentored-by: Taylor Blau <ttaylorr@github.com>
> Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
> Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
> ---
>  t/t5310-pack-bitmaps.sh       | 14 ++++++++++++++
>  t/t5326-multi-pack-bitmaps.sh | 19 +++++++++++++++++++
>  2 files changed, 33 insertions(+)
>
> diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
> index f775fc1ce69..f05d3e6ace7 100755
> --- a/t/t5310-pack-bitmaps.sh
> +++ b/t/t5310-pack-bitmaps.sh
> @@ -43,6 +43,20 @@ test_expect_success 'full repack creates bitmaps' '
>
>  basic_bitmap_tests
>
> +test_expect_success 'using lookup table does not affect basic bitmap tests' '
> +	test_config pack.writeBitmapLookupTable true &&
> +	git repack -adb
> +'

Whether or not we end up making pack.writeBitmapLookupTable be "true" by
default, I wonder if we should just set it to "true" whenever we write a
bitmap in this file, and then adjust whether or not we *read* the lookup
table with the GIT_TEST_ environment variable you introduced a few
commits back.

Thinking on it more, though, I don't think it makes a huge practical
difference for the code here in "t", since these repositories are tiny
and repacking them or rewriting their bitmaps is cheap.

But in the performance tests it probably makes a bigger difference.

> +basic_bitmap_tests
> +
> +test_expect_success 'using lookup table does not let each entries to be parsed one by one' '
> +	test_config pack.writeBitmapLookupTable true &&
> +	git repack -adb &&
> +	git rev-list --test-bitmap HEAD 2>out &&
> +	grep "Found bitmap for" out &&
> +	! grep "Bitmap v1 test "
> +'
> +
>  test_expect_success 'incremental repack fails when bitmaps are requested' '
>  	test_commit more-1 &&
>  	test_must_fail git repack -d 2>err &&
> diff --git a/t/t5326-multi-pack-bitmaps.sh b/t/t5326-multi-pack-bitmaps.sh
> index 4fe57414c13..85fbdf5e4bb 100755
> --- a/t/t5326-multi-pack-bitmaps.sh
> +++ b/t/t5326-multi-pack-bitmaps.sh
> @@ -306,5 +306,24 @@ test_expect_success 'graceful fallback when missing reverse index' '
>  		! grep "ignoring extra bitmap file" err
>  	)
>  '
> +test_expect_success 'multi-pack-index write --bitmap writes lookup table if enabled' '
> +	rm -fr repo &&
> +	git init repo &&
> +	test_when_finished "rm -fr repo" &&
> +	(
> +		cd repo &&
> +		test_commit_bulk 106 &&

Is there a reason we need to write this many commits? I think this is
copied from a test further up which deals explicitly with a case where
there are too many commits to write bitmaps for all of them (hence we
need to write more commits than 100 or so).

But I think for our purposes here we just need a single commit, written
into a single pack, which is covered with a MIDX.

So it should suffice to do something like:

    test_commit base &&
    git repack -ad &&
    git config pack.writeBitmapLookupTable true &&
    git multi-pack-index write --bitmap &&
    [...]

instead of what's written here.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 6/6] bitmap-lookup-table: add performance tests
  2022-06-20 12:33 ` [PATCH 6/6] bitmap-lookup-table: add performance tests Abhradeep Chakraborty via GitGitGadget
@ 2022-06-22 17:14   ` Taylor Blau
  0 siblings, 0 replies; 63+ messages in thread
From: Taylor Blau @ 2022-06-22 17:14 UTC (permalink / raw)
  To: Abhradeep Chakraborty via GitGitGadget
  Cc: git, Kaartic Sivaram, Abhradeep Chakraborty

On Mon, Jun 20, 2022 at 12:33:14PM +0000, Abhradeep Chakraborty via GitGitGadget wrote:
> From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
>
> Add performance tests for bitmap lookup table extension.

These tests look good, though I left a few notes below which boil down
to recommending a separate commit to set pack.writeReverseIndex=true,
and some suggestions for how to clean up the diff in the two performance
scripts you modified.

I would be interested to see the relevant results from running these
perf scripts on a reasonably large-sized repository, e.g. the kernel or
similar.

For the next version of this series, would you mind running these
scripts and including the results in this commit message?

> Mentored-by: Taylor Blau <ttaylorr@github.com>
> Co-mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
> Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
> ---
>  t/perf/p5310-pack-bitmaps.sh       | 60 +++++++++++++++++++-----------
>  t/perf/p5326-multi-pack-bitmaps.sh | 55 +++++++++++++++++----------
>  2 files changed, 73 insertions(+), 42 deletions(-)
>
> diff --git a/t/perf/p5310-pack-bitmaps.sh b/t/perf/p5310-pack-bitmaps.sh
> index 7ad4f237bc3..a8d9414de92 100755
> --- a/t/perf/p5310-pack-bitmaps.sh
> +++ b/t/perf/p5310-pack-bitmaps.sh
> @@ -10,10 +10,11 @@ test_perf_large_repo
>  # since we want to be able to compare bitmap-aware
>  # git versus non-bitmap git
>  #
> -# We intentionally use the deprecated pack.writebitmaps
> +# We intentionally use the deprecated pack.writeBitmaps
>  # config so that we can test against older versions of git.
>  test_expect_success 'setup bitmap config' '
> -	git config pack.writebitmaps true
> +	git config pack.writeBitmaps true &&
> +	git config pack.writeReverseIndex true

I suspect that eliminating the overhead of generating the reverse index
in memory is important to see the effect of this test. We should make
sure that this is done in a separate step so when we compare two commits
that both have a reverse index written.

That being said, we should probably make reverse indexes be the default
anyways, since they help significantly with all kinds of things (really,
any operation which has to generate a reverse index in memory, like
preparing a pack to push, the '%(objectsize:disk)' cat-file formatting
atom, and so on.

So at a minimum I would suggest extracting a separate commit here which
sets pack.writeReverseIndex to true for this test. That way the commit
prior to this has reverse indexes written, and comparing "this commit"
to "the previous one" is isolating the effect of just the lookup table.

But as a useful sideproject, it would be worthwhile to investigate
setting this to true by default everywhere, perhaps after this series
has settled a little more (or if you are blocked / want something else
to do).

>  '
>
>  # we need to create the tag up front such that it is covered by the repack and
> @@ -28,27 +29,42 @@ test_perf 'repack to disk' '
>
>  test_full_bitmap
>
> -test_expect_success 'create partial bitmap state' '
> -	# pick a commit to represent the repo tip in the past
> -	cutoff=$(git rev-list HEAD~100 -1) &&
> -	orig_tip=$(git rev-parse HEAD) &&
> -
> -	# now kill off all of the refs and pretend we had
> -	# just the one tip
> -	rm -rf .git/logs .git/refs/* .git/packed-refs &&
> -	git update-ref HEAD $cutoff &&
> -
> -	# and then repack, which will leave us with a nice
> -	# big bitmap pack of the "old" history, and all of
> -	# the new history will be loose, as if it had been pushed
> -	# up incrementally and exploded via unpack-objects
> -	git repack -Ad &&
> -
> -	# and now restore our original tip, as if the pushes
> -	# had happened
> -	git update-ref HEAD $orig_tip
> +test_perf 'use lookup table' '
> +    git config pack.writeBitmapLookupTable true
>  '

This part doesn't need to use 'test_perf', since we don't care about the
performance of running "git config". Instead, using
`test_expect_success` is more appropriate here.

> -test_partial_bitmap
> +test_perf 'repack to disk (lookup table)' '
> +    git repack -adb
> +'
> +
> +test_full_bitmap
> +
> +for i in false true
> +do
> +	$i && lookup=" (lookup table)"
> +	test_expect_success "create partial bitmap state$lookup" '
> +		git config pack.writeBitmapLookupTable '"$i"' &&
> +		# pick a commit to represent the repo tip in the past
> +		cutoff=$(git rev-list HEAD~100 -1) &&
> +		orig_tip=$(git rev-parse HEAD) &&
> +
> +		# now kill off all of the refs and pretend we had
> +		# just the one tip
> +		rm -rf .git/logs .git/refs/* .git/packed-refs &&
> +		git update-ref HEAD $cutoff &&
> +
> +		# and then repack, which will leave us with a nice
> +		# big bitmap pack of the "old" history, and all of
> +		# the new history will be loose, as if it had been pushed
> +		# up incrementally and exploded via unpack-objects
> +		git repack -Ad &&
> +
> +		# and now restore our original tip, as if the pushes
> +		# had happened
> +		git update-ref HEAD $orig_tip
> +	'
> +
> +	test_partial_bitmap
> +done

Could we extract the body of this loop into a function whose first
argument is either true/false? I think that would improve readability
here, and potentially clean up the diff a little bit.

For what it's worth, I don't think we need to do anything fancier for
the test name other than:


    test_partial_bitmap () {
      local enabled="$1"
      test_expect_success "create partial bitmap state (lookup=$enabled)" '
        git config pack.writeBitmapLookupTable "$enabled" &&
        [...]
      '
    }

    test_partial_bitmap false
    test_partial_bitmap true

or something.

> +for i in false true
> +do
> +	$i && lookup=" (lookup table)"
> +	test_expect_success "create partial bitmap state$lookup" '
> +		git config pack.writeBitmapLookupTable '"$i"' &&
> +		# pick a commit to represent the repo tip in the past
> +		cutoff=$(git rev-list HEAD~100 -1) &&
> +		orig_tip=$(git rev-parse HEAD) &&
> +
> +		# now pretend we have just one tip
> +		rm -rf .git/logs .git/refs/* .git/packed-refs &&
> +		git update-ref HEAD $cutoff &&
> +
> +		# and then repack, which will leave us with a nice
> +		# big bitmap pack of the "old" history, and all of
> +		# the new history will be loose, as if it had been pushed
> +		# up incrementally and exploded via unpack-objects
> +		git repack -Ad &&
> +		git multi-pack-index write --bitmap &&
> +
> +		# and now restore our original tip, as if the pushes
> +		# had happened
> +		git update-ref HEAD $orig_tip
> +	'
> +
> +	test_partial_bitmap
> +done

Same note here.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 2/6] pack-bitmap: prepare to read lookup table extension
  2022-06-22 16:49       ` Taylor Blau
@ 2022-06-22 17:18         ` Abhradeep Chakraborty
  2022-06-22 21:34           ` Taylor Blau
  0 siblings, 1 reply; 63+ messages in thread
From: Abhradeep Chakraborty @ 2022-06-22 17:18 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Abhradeep Chakraborty, Git, Kaartic Sivaraam, Derrick Stolee

Taylor Blau <me@ttaylorr.com> wrote:

> In other words, right now we have to do two queries when an commit
> doesn't have a bitmap stored:
>
>   - first, a lookup to see whether we have already loaded a bitmap for
>     that commit
>
>   - then, a subsequent lookup to see whether the .bitmap file itself has
>     a bitmap for that commit, but we just haven't loaded it yet
>
> If we knew that we had loaded all of the bitmaps in the file, then we
> could simplify the above two queries into one, since whatever the first
> one returns is enough to know whether or not a bitmap exists at all.

Hmm, agreed.

> Ahhh. Thanks for refreshing my memory. I wonder if you think there is a
> convenient way to work some of this into a short comment to help other
> readers in the future, too.

Actually, Derrick has suggested to go with iterative approach[1] instead of
Recursive approach. What's your view on it?

> Right, that part makes sense to me. But I wonder if we should still
> print something, perhaps just "Bitmap v1 test" or "Bitmap v1 test (%d
> entries)" omitting the "loaded" part.

Yeah, of course we can!

Thanks :)

[1] https://lore.kernel.org/git/92dc6860-ff35-0989-5114-fe1e220ca10c@github.com/

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 2/6] pack-bitmap: prepare to read lookup table extension
  2022-06-22 17:18         ` Abhradeep Chakraborty
@ 2022-06-22 21:34           ` Taylor Blau
  0 siblings, 0 replies; 63+ messages in thread
From: Taylor Blau @ 2022-06-22 21:34 UTC (permalink / raw)
  To: Abhradeep Chakraborty; +Cc: Git, Kaartic Sivaraam, Derrick Stolee

On Wed, Jun 22, 2022 at 10:48:14PM +0530, Abhradeep Chakraborty wrote:
> > Ahhh. Thanks for refreshing my memory. I wonder if you think there is a
> > convenient way to work some of this into a short comment to help other
> > readers in the future, too.
>
> Actually, Derrick has suggested to go with iterative approach[1] instead of
> Recursive approach. What's your view on it?

I don't have a strong feeling about it. In practice, we seem to top out
at ~500 bitmaps or so for large-ish repositories, so I would be
surprised to see this result in stack exhaustion even in the worst case
(every bitmap xor'd with the previous one, forming a long chain).

But it doesn't hurt to be defensive, so I think it's worth it as long as
you don't find the implementation too complex.

> [1] https://lore.kernel.org/git/92dc6860-ff35-0989-5114-fe1e220ca10c@github.com/

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v2 0/6] [GSoC] bitmap: integrate a lookup table extension to the bitmap format
  2022-06-20 12:33 [PATCH 0/6] [GSoC] bitmap: integrate a lookup table extension to the bitmap format Abhradeep Chakraborty via GitGitGadget
                   ` (5 preceding siblings ...)
  2022-06-20 12:33 ` [PATCH 6/6] bitmap-lookup-table: add performance tests Abhradeep Chakraborty via GitGitGadget
@ 2022-06-26 13:10 ` Abhradeep Chakraborty via GitGitGadget
  2022-06-26 13:10   ` [PATCH v2 1/6] Documentation/technical: describe bitmap lookup table extension Abhradeep Chakraborty via GitGitGadget
                     ` (5 more replies)
  6 siblings, 6 replies; 63+ messages in thread
From: Abhradeep Chakraborty via GitGitGadget @ 2022-06-26 13:10 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau, Kaartic Sivaram, Derrick Stolee, Abhradeep Chakraborty

When parsing the .bitmap file, git loads all the bitmaps one by one even if
some of the bitmaps are not necessary. We can remove this overhead by
loading only the necessary bitmaps. A look up table extension can solve this
issue.

Changes since v1:

This is the second version which addressed all (I think) the reviews. Please
notify me if some reviews are not addressed :)

 * The table size is decreased and the format has also changed. It now
   contains nr_entries triplets of size 4+8+4 bytes. Each triplet contains
   the following things - (1) 4 byte commit position (in the pack-index or
   midx) (2) 8 byte offset and (3) 4 byte xor triplet (i.e. with whose
   bitmap the current triplet's bitmap has to xor) position.
 * Performance tests are splitted into two commits. First contains the
   actual performance tests and second enables the pack.writeReverseIndex
   (as suggested by Taylor).
 * st_*() functions are used.
 * commit order is changed according to Derrick's suggestion.
 * Iterative approach is used instead of recursive approach to parse xor
   bitmaps. (As suggested by Derrick).
 * Some minor bug fixes of previous version.

Initial version:

The proposed table has:

 * a list of nr_entries object ids. These objects are commits that has
   bitmaps. Ids are stored in lexicographic order (for better searching).
 * a list of <offset, xor-offset> pairs (4-byte integers, network-byte
   order). The i'th pair denotes the offset and xor-offset(respectively) of
   the bitmap of i'th commit in the previous list. These two informations
   are necessary because only in this way bitmaps can be found without
   parsing all the bitmap.
 * a 4-byte integer for table specific flags (none exists currently).

Whenever git want to parse the bitmap for a specific commit, it will first
refer to the table and will look for the offset and xor-offset for that
commit. Git will then try to parse the bitmap located at the offset
position. The xor-offset can be used to find the xor-bitmap for the
bitmap(if any).

Abhradeep Chakraborty (6):
  Documentation/technical: describe bitmap lookup table extension
  pack-bitmap-write.c: write lookup table extension
  pack-bitmap-write: learn pack.writeBitmapLookupTable and add tests
  pack-bitmap: prepare to read lookup table extension
  bitmap-lookup-table: add performance tests for lookup table
  p5310-pack-bitmaps.sh: enable pack.writeReverseIndex for testing

 Documentation/config/pack.txt             |   7 +
 Documentation/technical/bitmap-format.txt |  41 +++++
 builtin/multi-pack-index.c                |   8 +
 builtin/pack-objects.c                    |  10 +-
 midx.c                                    |   3 +
 midx.h                                    |   1 +
 pack-bitmap-write.c                       |  74 ++++++++-
 pack-bitmap.c                             | 193 ++++++++++++++++++++--
 pack-bitmap.h                             |   5 +-
 t/perf/p5310-pack-bitmaps.sh              |  66 ++++----
 t/perf/p5326-multi-pack-bitmaps.sh        |  93 ++++++-----
 t/t5310-pack-bitmaps.sh                   |  10 +-
 t/t5326-multi-pack-bitmaps.sh             |  14 ++
 13 files changed, 439 insertions(+), 86 deletions(-)


base-commit: 39c15e485575089eb77c769f6da02f98a55905e0
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1266%2FAbhra303%2Fbitmap-commit-table-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1266/Abhra303/bitmap-commit-table-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/1266

Range-diff vs v1:

 1:  2e22ca5069a ! 1:  4d11be66cfa Documentation/technical: describe bitmap lookup table extension
     @@ Commit message
          When reading bitmap file, git loads each and every bitmap one by one
          even if all the bitmaps are not required. A "bitmap lookup table"
          extension to the bitmap format can reduce the overhead of loading
     -    bitmaps which stores a list of bitmapped commit oids, along with their
     -    offset and xor offset. This way git can load only the neccesary bitmaps
     -    without loading the previous bitmaps.
     +    bitmaps which stores a list of bitmapped commit id pos (in the midx
     +    or pack, along with their offset and xor offset. This way git can
     +    load only the neccesary bitmaps without loading the previous bitmaps.
     +
     +    The older version of Git ignores the lookup table extension and doesn't
     +    throw any kind of warning or error while parsing the bitmap file.
      
          Add some information for the new "bitmap lookup table" extension in the
          bitmap-format documentation.
      
     -    Co-Authored-by: Taylor Blau <ttaylorr@github.com>
     -    Mentored-by: Taylor Blau <ttaylorr@github.com>
     +    Co-Authored-by: Taylor Blau <me@ttaylorr.com>
     +    Mentored-by: Taylor Blau <me@ttaylorr.com>
          Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
          Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
      
     @@ Documentation/technical/bitmap-format.txt: MIDXs, both the bit-cache and rev-cac
       			described below.
       
      +			** {empty}
     -+			BITMAP_OPT_LOOKUP_TABLE (0xf) : :::
     ++			BITMAP_OPT_LOOKUP_TABLE (0x10): :::
      +			If present, the end of the bitmap file contains a table
     -+			containing a list of `N` object ids, a list of pairs of
     -+			offset and xor offset of respective objects, and 4-byte
     -+			integer denoting the flags (currently none). The format
     -+			and meaning of the table is described below.
     ++			containing a list of `N` <commit pos, offset, xor offset>
     ++			triplets. The format and meaning of the table is described
     ++			below.
     +++
     ++NOTE: This xor_offset is different from the bitmap's xor_offset.
     ++Bitmap's xor_offset is relative i.e. it tells how many bitmaps we have
     ++to go back from the current bitmap. Lookup table's xor_offset tells the
     ++position of the triplet in the list whose bitmap the current commit's
     ++bitmap have to xor with.
      +
       		4-byte entry count (network byte order)
       
     @@ Documentation/technical/bitmap-format.txt: Note that this hashing scheme is tied
      +Commit lookup table
      +-------------------
      +
     -+If the BITMAP_OPT_LOOKUP_TABLE flag is set, the end of the `.bitmap`
     -+contains a lookup table specifying the positions of commits which have a
     -+bitmap.
     ++If the BITMAP_OPT_LOOKUP_TABLE flag is set, the last `N * (4 + 8 + 4)`
     ++(preceding the name-hash cache and trailing hash) of the `.bitmap` file
     ++contains a lookup table specifying the information needed to get the
     ++desired bitmap from the entries without parsing previous unnecessary
     ++bitmaps.
      +
     -+For a `.bitmap` containing `nr_entries` reachability bitmaps, the format
     -+is as follows:
     ++For a `.bitmap` containing `nr_entries` reachability bitmaps, the table
     ++contains a list of `nr_entries` <commit pos, offset, xor offset> triplets.
     ++The content of i'th triplet is -
      +
     -+	- `nr_entries` object names.
     ++	* {empty}
     ++	commit pos (4 byte integer, network byte order): ::
     ++	It stores the object position of the commit (in the midx or pack index)
     ++	to which the i'th bitmap in the bitmap entries belongs.
      +
     -+	- `nr_entries` pairs of 4-byte integers, each in network order.
     -+	  The first holds the offset from which that commit's bitmap can
     -+	  be read. The second number holds the position of the commit
     -+	  whose bitmap the current bitmap is xor'd with in lexicographic
     -+	  order, or 0xffffffff if the current commit is not xor'd with
     -+	  anything.
     ++	* {empty}
     ++	offset (8 byte integer, network byte order): ::
     ++	The offset from which that commit's bitmap can be read.
      +
     -+	- One 4-byte network byte order integer specifying
     -+	  table-specific flags. None exist currently, so this is always
     -+	  "0".
     ++	* {empty}
     ++	xor offset (4 byte integer, network byte order): ::
     ++	It holds the position of the triplet with whose bitmap the
     ++	current bitmap need to xor. If the current triplet's bitmap
     ++	do not have any xor bitmap, it defaults to 0xffffffff.
 3:  ed91ebf69a8 ! 2:  d118f1d45e6 pack-bitmap-write.c: write lookup table extension
     @@ Metadata
       ## Commit message ##
          pack-bitmap-write.c: write lookup table extension
      
     -    Teach git to write bitmap lookup table extension. The table has the
     -    following information:
     +    The bitmap lookup table extension was documentated by an earlier
     +    change, but Git does not yet knowhow to write that extension.
      
     -        - `N` no of Object ids of each bitmapped commits
     +    Teach git to write bitmap lookup table extension. The table contains
     +    the list of `N` <commit pos, offset, xor offset>` triplets. These
     +    triplets are sorted according to their commit pos (ascending order).
     +    The meaning of each data in the i'th triplet is given below:
      
     -        - A list of offset, xor-offset pair; the i'th pair denotes the
     -          offsets and xor-offsets of i'th commit in the previous list.
     +      - Commit pos is the position of the commit in the pack-index
     +        (or midx) to which the i'th bitmap belongs. It is a 4 byte
     +        network byte order integer.
      
     -        - 4-byte integer denoting the flags
     +      - offset is the position of the i'th bitmap.
      
     -    Co-authored-by: Taylor Blau <ttaylorr@github.com>
     -    Mentored-by: Taylor Blau <ttaylorr@github.com>
     +      - xor offset denotes the position of the triplet with whose
     +        bitmap the current triplet's bitmap need to xor with.
     +
     +    Co-authored-by: Taylor Blau <me@ttaylorr.com>
     +    Mentored-by: Taylor Blau <me@ttaylorr.com>
          Co-mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
          Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
      
     @@ pack-bitmap-write.c: static const struct object_id *oid_access(size_t pos, const
       				      struct pack_idx_entry **index,
      -				      uint32_t index_nr)
      +				      uint32_t index_nr,
     -+				      off_t *offsets)
     ++				      uint64_t *offsets,
     ++				      uint32_t *commit_positions)
       {
       	int i;
       
     @@ pack-bitmap-write.c: static void write_selected_commits_v1(struct hashfile *f,
       
      +		if (offsets)
      +			offsets[i] = hashfile_total(f);
     ++		if (commit_positions)
     ++			commit_positions[i] = commit_pos;
      +
       		hashwrite_be32(f, commit_pos);
       		hashwrite_u8(f, stored->xor_offset);
     @@ pack-bitmap-write.c: static void write_selected_commits_v1(struct hashfile *f,
       	}
       }
       
     -+static int table_cmp(const void *_va, const void *_vb)
     ++static int table_cmp(const void *_va, const void *_vb, void *commit_positions)
      +{
     -+	return oidcmp(&writer.selected[*(uint32_t*)_va].commit->object.oid,
     -+		      &writer.selected[*(uint32_t*)_vb].commit->object.oid);
     ++	int8_t result = 0;
     ++	uint32_t *positions = (uint32_t *) commit_positions;
     ++	uint32_t a = positions[*(uint32_t *)_va];
     ++	uint32_t b = positions[*(uint32_t *)_vb];
     ++
     ++	if (a > b)
     ++		result = 1;
     ++	else if (a < b)
     ++		result = -1;
     ++	else
     ++		result = 0;
     ++
     ++	return result;
      +}
      +
      +static void write_lookup_table(struct hashfile *f,
     -+			       off_t *offsets)
     ++			       uint64_t *offsets,
     ++			       uint32_t *commit_positions)
      +{
      +	uint32_t i;
     -+	uint32_t flags = 0;
      +	uint32_t *table, *table_inv;
      +
      +	ALLOC_ARRAY(table, writer.selected_nr);
     @@ pack-bitmap-write.c: static void write_selected_commits_v1(struct hashfile *f,
      +
      +	for (i = 0; i < writer.selected_nr; i++)
      +		table[i] = i;
     -+	QSORT(table, writer.selected_nr, table_cmp);
     ++
     ++	QSORT_S(table, writer.selected_nr, table_cmp, commit_positions);
     ++
      +	for (i = 0; i < writer.selected_nr; i++)
      +		table_inv[table[i]] = i;
      +
      +	for (i = 0; i < writer.selected_nr; i++) {
      +		struct bitmapped_commit *selected = &writer.selected[table[i]];
     -+		struct object_id *oid = &selected->commit->object.oid;
     ++		uint32_t xor_offset = selected->xor_offset;
      +
     -+		hashwrite(f, oid->hash, the_hash_algo->rawsz);
     ++		hashwrite_be32(f, commit_positions[table[i]]);
     ++		hashwrite_be64(f, offsets[table[i]]);
     ++		hashwrite_be32(f, xor_offset ?
     ++				table_inv[table[i] - xor_offset]: 0xffffffff);
      +	}
     -+	for (i = 0; i < writer.selected_nr; i++) {
     -+		struct bitmapped_commit *selected = &writer.selected[table[i]];
     -+
     -+		hashwrite_be32(f, offsets[table[i]]);
     -+		hashwrite_be32(f, selected->xor_offset
     -+			       ? table_inv[table[i] - selected->xor_offset]
     -+			       : 0xffffffff);
     -+	}
     -+
     -+	hashwrite_be32(f, flags);
      +
      +	free(table);
      +	free(table_inv);
     @@ pack-bitmap-write.c: void bitmap_writer_finish(struct pack_idx_entry **index,
       {
       	static uint16_t default_version = 1;
       	static uint16_t flags = BITMAP_OPT_FULL_DAG;
     -+	off_t *offsets = NULL;
     ++	uint64_t *offsets = NULL;
     ++	uint32_t *commit_positions = NULL;
       	struct strbuf tmp_file = STRBUF_INIT;
       	struct hashfile *f;
       
     @@ pack-bitmap-write.c: void bitmap_writer_finish(struct pack_idx_entry **index,
       	dump_bitmap(f, writer.tags);
      -	write_selected_commits_v1(f, index, index_nr);
       
     -+	if (options & BITMAP_OPT_LOOKUP_TABLE)
     ++	if (options & BITMAP_OPT_LOOKUP_TABLE) {
      +		CALLOC_ARRAY(offsets, index_nr);
     ++		CALLOC_ARRAY(commit_positions, index_nr);
     ++	}
      +
     -+	write_selected_commits_v1(f, index, index_nr, offsets);
     ++	write_selected_commits_v1(f, index, index_nr, offsets, commit_positions);
      +
      +	if (options & BITMAP_OPT_LOOKUP_TABLE)
     -+		write_lookup_table(f, offsets);
     ++		write_lookup_table(f, offsets, commit_positions);
       	if (options & BITMAP_OPT_HASH_CACHE)
       		write_hash_cache(f, index, index_nr);
       
     @@ pack-bitmap-write.c: void bitmap_writer_finish(struct pack_idx_entry **index,
       
       	strbuf_release(&tmp_file);
      +	free(offsets);
     ++	free(commit_positions);
       }
     +
     + ## pack-bitmap.h ##
     +@@ pack-bitmap.h: struct bitmap_disk_header {
     + #define NEEDS_BITMAP (1u<<22)
     + 
     + enum pack_bitmap_opts {
     +-	BITMAP_OPT_FULL_DAG = 1,
     +-	BITMAP_OPT_HASH_CACHE = 4,
     ++	BITMAP_OPT_FULL_DAG = 0x1,
     ++	BITMAP_OPT_HASH_CACHE = 0x4,
     ++	BITMAP_OPT_LOOKUP_TABLE = 0x10,
     + };
     + 
     + enum pack_bitmap_flags {
 4:  661c1137e1c ! 3:  7786dc879f0 builtin/pack-objects.c: learn pack.writeBitmapLookupTable
     @@
       ## Metadata ##
     -Author: Taylor Blau <ttaylorr@github.com>
     +Author: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
      
       ## Commit message ##
     -    builtin/pack-objects.c: learn pack.writeBitmapLookupTable
     +    pack-bitmap-write: learn pack.writeBitmapLookupTable and add tests
      
          Teach git to provide a way for users to enable/disable bitmap lookup
          table extension by providing a config option named 'writeBitmapLookupTable'.
     +    Default is true.
      
     -    Signed-off-by: Taylor Blau <ttaylorr@github.com>
     +    Also add test to verify writting of lookup table.
     +
     +    Co-Authored-by: Taylor Blau <me@ttaylorr.com>
          Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
     +    Mentored-by: Taylor Blau <me@ttaylorr.com>
     +    Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
      
       ## Documentation/config/pack.txt ##
      @@ Documentation/config/pack.txt: When writing a multi-pack reachability bitmap, no new namehashes are
     @@ Documentation/config/pack.txt: When writing a multi-pack reachability bitmap, no
      +	bitmap index (if one is written). This table is used to defer
      +	loading individual bitmaps as late as possible. This can be
      +	beneficial in repositories which have relatively large bitmap
     -+	indexes. Defaults to false.
     ++	indexes. Defaults to true.
      +
       pack.writeReverseIndex::
       	When true, git will write a corresponding .rev file (see:
       	link:../technical/pack-format.html[Documentation/technical/pack-format.txt])
      
     + ## builtin/multi-pack-index.c ##
     +@@ builtin/multi-pack-index.c: static int git_multi_pack_index_write_config(const char *var, const char *value,
     + 			opts.flags &= ~MIDX_WRITE_BITMAP_HASH_CACHE;
     + 	}
     + 
     ++	if (!strcmp(var, "pack.writebitmaplookuptable")) {
     ++		if (git_config_bool(var, value))
     ++			opts.flags |= MIDX_WRITE_BITMAP_LOOKUP_TABLE;
     ++		else
     ++			opts.flags &= ~MIDX_WRITE_BITMAP_LOOKUP_TABLE;
     ++	}
     ++
     + 	/*
     + 	 * We should never make a fall-back call to 'git_default_config', since
     + 	 * this was already called in 'cmd_multi_pack_index()'.
     +@@ builtin/multi-pack-index.c: static int cmd_multi_pack_index_write(int argc, const char **argv)
     + 	};
     + 
     + 	opts.flags |= MIDX_WRITE_BITMAP_HASH_CACHE;
     ++	opts.flags |= MIDX_WRITE_BITMAP_LOOKUP_TABLE;
     + 
     + 	git_config(git_multi_pack_index_write_config, NULL);
     + 
     +
       ## builtin/pack-objects.c ##
     +@@ builtin/pack-objects.c: static enum {
     + 	WRITE_BITMAP_QUIET,
     + 	WRITE_BITMAP_TRUE,
     + } write_bitmap_index;
     +-static uint16_t write_bitmap_options = BITMAP_OPT_HASH_CACHE;
     ++static uint16_t write_bitmap_options = BITMAP_OPT_HASH_CACHE | BITMAP_OPT_LOOKUP_TABLE;
     + 
     + static int exclude_promisor_objects;
     + 
      @@ builtin/pack-objects.c: static int git_pack_config(const char *k, const char *v, void *cb)
       		else
       			write_bitmap_options &= ~BITMAP_OPT_HASH_CACHE;
     @@ builtin/pack-objects.c: static int git_pack_config(const char *k, const char *v,
       	if (!strcmp(k, "pack.usebitmaps")) {
       		use_bitmap_index_default = git_config_bool(k, v);
       		return 0;
     +
     + ## midx.c ##
     +@@ midx.c: static int write_midx_bitmap(char *midx_name, unsigned char *midx_hash,
     + 	if (flags & MIDX_WRITE_BITMAP_HASH_CACHE)
     + 		options |= BITMAP_OPT_HASH_CACHE;
     + 
     ++	if (flags & MIDX_WRITE_BITMAP_LOOKUP_TABLE)
     ++		options |= BITMAP_OPT_LOOKUP_TABLE;
     ++
     + 	prepare_midx_packing_data(&pdata, ctx);
     + 
     + 	commits = find_commits_for_midx_bitmap(&commits_nr, refs_snapshot, ctx);
     +
     + ## midx.h ##
     +@@ midx.h: struct multi_pack_index {
     + #define MIDX_WRITE_REV_INDEX (1 << 1)
     + #define MIDX_WRITE_BITMAP (1 << 2)
     + #define MIDX_WRITE_BITMAP_HASH_CACHE (1 << 3)
     ++#define MIDX_WRITE_BITMAP_LOOKUP_TABLE (1 << 4)
     + 
     + const unsigned char *get_midx_checksum(struct multi_pack_index *m);
     + void get_midx_filename(struct strbuf *out, const char *object_dir);
     +
     + ## pack-bitmap-write.c ##
     +@@ pack-bitmap-write.c: static void write_lookup_table(struct hashfile *f,
     + 	for (i = 0; i < writer.selected_nr; i++)
     + 		table_inv[table[i]] = i;
     + 
     ++	trace2_region_enter("pack-bitmap-write", "writing_lookup_table", the_repository);
     + 	for (i = 0; i < writer.selected_nr; i++) {
     + 		struct bitmapped_commit *selected = &writer.selected[table[i]];
     + 		uint32_t xor_offset = selected->xor_offset;
     +@@ pack-bitmap-write.c: static void write_lookup_table(struct hashfile *f,
     + 
     + 	free(table);
     + 	free(table_inv);
     ++	trace2_region_leave("pack-bitmap-write", "writing_lookup_table", the_repository);
     + }
     + 
     + static void write_hash_cache(struct hashfile *f,
     +
     + ## t/t5310-pack-bitmaps.sh ##
     +@@ t/t5310-pack-bitmaps.sh: test_expect_success 'full repack creates bitmaps' '
     + 	ls .git/objects/pack/ | grep bitmap >output &&
     + 	test_line_count = 1 output &&
     + 	grep "\"key\":\"num_selected_commits\",\"value\":\"106\"" trace &&
     +-	grep "\"key\":\"num_maximal_commits\",\"value\":\"107\"" trace
     ++	grep "\"key\":\"num_maximal_commits\",\"value\":\"107\"" trace &&
     ++	grep "\"label\":\"writing_lookup_table\"" trace
     + '
     + 
     + basic_bitmap_tests
     +
     + ## t/t5326-multi-pack-bitmaps.sh ##
     +@@ t/t5326-multi-pack-bitmaps.sh: test_expect_success 'graceful fallback when missing reverse index' '
     + 	)
     + '
     + 
     ++test_expect_success 'multi-pack-index write writes lookup table if enabled' '
     ++	rm -fr repo &&
     ++	git init repo &&
     ++	test_when_finished "rm -fr repo" &&
     ++	(
     ++		cd repo &&
     ++		test_commit base &&
     ++		git repack -ad &&
     ++		GIT_TRACE2_EVENT="$(pwd)/trace" \
     ++			git multi-pack-index write --bitmap &&
     ++		grep "\"label\":\"writing_lookup_table\"" trace
     ++	)
     ++'
     + test_done
 2:  d139a4c48aa ! 4:  4fbfcff8a20 pack-bitmap: prepare to read lookup table extension
     @@ Metadata
       ## Commit message ##
          pack-bitmap: prepare to read lookup table extension
      
     -    Bitmap lookup table extension can let git to parse only the necessary
     -    bitmaps without loading the previous bitmaps one by one.
     +    Earlier change teaches Git to write bitmap lookup table. But Git
     +    does not know how to parse them.
      
     -    Teach git to read and use the bitmap lookup table extension.
     +    Teach Git to parse the existing bitmap lookup table. The older
     +    versions of git are not affected by it. Those versions ignore the
     +    lookup table.
      
     -    Co-Authored-by: Taylor Blau <ttaylorr@github.com>
     -    Mentored-by: Taylor Blau <ttaylorr@github.com>
     -    Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
          Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
     +    Mentored-by: Taylor Blau <me@ttaylorr.com>
     +    Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
      
       ## pack-bitmap.c ##
     -@@
     - #include "list-objects-filter-options.h"
     - #include "midx.h"
     - #include "config.h"
     -+#include "hash-lookup.h"
     - 
     - /*
     -  * An entry on the bitmap index, representing the bitmap for a given
      @@ pack-bitmap.c: struct bitmap_index {
       	/* The checksum of the packfile or MIDX; points into map. */
       	const unsigned char *checksum;
       
      +	/*
     -+	 * If not NULL, these point into the various commit table sections
     ++	 * If not NULL, this point into the commit table extension
      +	 * (within map).
      +	 */
      +	unsigned char *table_lookup;
     -+	unsigned char *table_offsets;
      +
       	/*
       	 * Extended index.
     @@ pack-bitmap.c: static int load_bitmap_header(struct bitmap_index *index)
       		}
      +
      +		if (flags & BITMAP_OPT_LOOKUP_TABLE &&
     -+		    git_env_bool("GIT_READ_COMMIT_TABLE", 1)) {
     -+			uint32_t entry_count = ntohl(header->entry_count);
     -+			uint32_t table_size =
     -+				(entry_count * the_hash_algo->rawsz) /* oids */ +
     -+				(entry_count * sizeof(uint32_t)) /* offsets */ +
     -+				(entry_count * sizeof(uint32_t)) /* xor offsets */ +
     -+				(sizeof(uint32_t)) /* flags */;
     ++			git_env_bool("GIT_TEST_READ_COMMIT_TABLE", 1)) {
     ++			size_t table_size = 0;
     ++			size_t triplet_sz = st_add3(sizeof(uint32_t),    /* commit position */
     ++							sizeof(uint64_t),    /* offset */
     ++							sizeof(uint32_t));    /* xor offset */
      +
     ++			table_size = st_add(table_size,
     ++					st_mult(ntohl(header->entry_count),
     ++						triplet_sz));
      +			if (table_size > index_end - index->map - header_size)
     -+				return error("corrupted bitmap index file (too short to fit commit table)");
     -+
     ++				return error("corrupted bitmap index file (too short to fit lookup table)");
      +			index->table_lookup = (void *)(index_end - table_size);
     -+			index->table_offsets = index->table_lookup + the_hash_algo->rawsz * entry_count;
     -+
      +			index_end -= table_size;
      +		}
       	}
       
       	index->entry_count = ntohl(header->entry_count);
     +@@ pack-bitmap.c: static struct stored_bitmap *store_bitmap(struct bitmap_index *index,
     + 
     + 	hash_pos = kh_put_oid_map(index->bitmaps, stored->oid, &ret);
     + 
     +-	/* a 0 return code means the insertion succeeded with no changes,
     +-	 * because the SHA1 already existed on the map. this is bad, there
     +-	 * shouldn't be duplicated commits in the index */
     ++	/* A 0 return code means the insertion succeeded with no changes,
     ++	 * because the SHA1 already existed on the map. If lookup table
     ++	 * is NULL, this is bad, there shouldn't be duplicated commits
     ++	 * in the index.
     ++	 *
     ++	 * If table_lookup exists, that means the desired bitmap is already
     ++	 * loaded. Either this bitmap has been stored directly or another
     ++	 * bitmap has a direct or indirect xor relation with it. */
     + 	if (ret == 0) {
     +-		error("Duplicate entry in bitmap index: %s", oid_to_hex(oid));
     +-		return NULL;
     ++		if (!index->table_lookup) {
     ++			error("Duplicate entry in bitmap index: %s", oid_to_hex(oid));
     ++			return NULL;
     ++		}
     ++		return kh_value(index->bitmaps, hash_pos);
     + 	}
     + 
     + 	kh_value(index->bitmaps, hash_pos) = stored;
      @@ pack-bitmap.c: static int load_bitmap(struct bitmap_index *bitmap_git)
       		!(bitmap_git->tags = read_bitmap_1(bitmap_git)))
       		goto failed;
     @@ pack-bitmap.c: struct include_data {
       	struct bitmap *seen;
       };
       
     --struct ewah_bitmap *bitmap_for_commit(struct bitmap_index *bitmap_git,
     --				      struct commit *commit)
     -+static struct stored_bitmap *stored_bitmap_for_commit(struct bitmap_index *bitmap_git,
     -+						      struct commit *commit,
     -+						      uint32_t *pos_hint);
     -+
     -+static inline const unsigned char *bitmap_oid_pos(struct bitmap_index *bitmap_git,
     -+						  uint32_t pos)
     ++static inline const void *bitmap_get_triplet(struct bitmap_index *bitmap_git, uint32_t xor_pos)
      +{
     -+	return bitmap_git->table_lookup + (pos * the_hash_algo->rawsz);
     ++	size_t triplet_sz = st_add3(sizeof(uint32_t), sizeof(uint64_t), sizeof(uint32_t));
     ++	const void *p = bitmap_git->table_lookup + st_mult(xor_pos, triplet_sz);
     ++	return p;
      +}
      +
     -+static inline const void *bitmap_offset_pos(struct bitmap_index *bitmap_git,
     -+					    uint32_t pos)
     ++static uint64_t triplet_get_offset(const void *triplet)
      +{
     -+	return bitmap_git->table_offsets + (pos * 2 * sizeof(uint32_t));
     ++	const void *p = (unsigned char*) triplet + sizeof(uint32_t);
     ++	return get_be64(p);
      +}
      +
     -+static inline const void *xor_position_pos(struct bitmap_index *bitmap_git,
     -+					   uint32_t pos)
     ++static uint32_t triplet_get_xor_pos(const void *triplet)
      +{
     -+	return (unsigned char*) bitmap_offset_pos(bitmap_git, pos) + sizeof(uint32_t);
     ++	const void *p = (unsigned char*) triplet + st_add(sizeof(uint32_t), sizeof(uint64_t));
     ++	return get_be32(p);
      +}
      +
     -+static int bitmap_lookup_cmp(const void *_va, const void *_vb)
     ++static int triplet_cmp(const void *va, const void *vb)
      +{
     -+	return hashcmp(_va, _vb);
     ++	int result = 0;
     ++	uint32_t *a = (uint32_t *) va;
     ++	uint32_t b = get_be32(vb);
     ++	if (*a > b)
     ++		result = 1;
     ++	else if (*a < b)
     ++		result = -1;
     ++	else
     ++		result = 0;
     ++
     ++	return result;
      +}
      +
     -+static int bitmap_table_lookup(struct bitmap_index *bitmap_git,
     -+			       struct object_id *oid,
     -+			       uint32_t *commit_pos)
     ++static uint32_t bsearch_pos(struct bitmap_index *bitmap_git, struct object_id *oid,
     ++						uint32_t *result)
      +{
     -+	unsigned char *found = bsearch(oid->hash, bitmap_git->table_lookup,
     -+				       bitmap_git->entry_count,
     -+				       the_hash_algo->rawsz, bitmap_lookup_cmp);
     -+	if (found)
     -+		*commit_pos = (found - bitmap_git->table_lookup) / the_hash_algo->rawsz;
     -+	return !!found;
     ++	int found;
     ++
     ++	if (bitmap_git->midx)
     ++		found = bsearch_midx(oid, bitmap_git->midx, result);
     ++	else
     ++		found = bsearch_pack(oid, bitmap_git->pack, result);
     ++
     ++	return found;
      +}
      +
      +static struct stored_bitmap *lazy_bitmap_for_commit(struct bitmap_index *bitmap_git,
     -+						    struct object_id *oid,
     -+						    uint32_t commit_pos)
     ++					  struct commit *commit)
      +{
     -+	uint32_t xor_pos;
     -+	off_t bitmap_ofs;
     -+
     ++	uint32_t commit_pos, xor_pos;
     ++	uint64_t offset;
      +	int flags;
     ++	const void *triplet = NULL;
     ++	struct object_id *oid = &commit->object.oid;
      +	struct ewah_bitmap *bitmap;
     -+	struct stored_bitmap *xor_bitmap;
     ++	struct stored_bitmap *xor_bitmap = NULL;
     ++	size_t triplet_sz = st_add3(sizeof(uint32_t), sizeof(uint64_t), sizeof(uint32_t));
      +
     -+	bitmap_ofs = get_be32(bitmap_offset_pos(bitmap_git, commit_pos));
     -+	xor_pos = get_be32(xor_position_pos(bitmap_git, commit_pos));
     ++	int found = bsearch_pos(bitmap_git, oid, &commit_pos);
      +
     -+	/*
     -+	 * Lazily load the xor'd bitmap if required (and we haven't done so
     -+	 * already). Make sure to pass the xor'd bitmap's position along as a
     -+	 * hint to avoid an unnecessary binary search in
     -+	 * stored_bitmap_for_commit().
     -+	 */
     -+	if (xor_pos == 0xffffffff) {
     -+		xor_bitmap = NULL;
     -+	} else {
     -+		struct commit *xor_commit;
     ++	if (!found)
     ++		return NULL;
     ++
     ++	triplet = bsearch(&commit_pos, bitmap_git->table_lookup, bitmap_git->entry_count,
     ++						triplet_sz, triplet_cmp);
     ++	if (!triplet)
     ++		return NULL;
     ++
     ++	offset = triplet_get_offset(triplet);
     ++	xor_pos = triplet_get_xor_pos(triplet);
     ++
     ++	if (xor_pos != 0xffffffff) {
     ++		int xor_flags;
     ++		uint64_t offset_xor;
     ++		uint32_t *xor_positions;
      +		struct object_id xor_oid;
     ++		size_t size = 0;
      +
     -+		oidread(&xor_oid, bitmap_oid_pos(bitmap_git, xor_pos));
     ++		ALLOC_ARRAY(xor_positions, bitmap_git->entry_count);
     ++		while (xor_pos != 0xffffffff) {
     ++			xor_positions[size++] = xor_pos;
     ++			triplet = bitmap_get_triplet(bitmap_git, xor_pos);
     ++			xor_pos = triplet_get_xor_pos(triplet);
     ++		}
      +
     -+		xor_commit = lookup_commit(the_repository, &xor_oid);
     -+		if (!xor_commit)
     -+			return NULL;
     ++		while (size){
     ++			xor_pos = xor_positions[size - 1];
     ++			triplet = bitmap_get_triplet(bitmap_git, xor_pos);
     ++			commit_pos = get_be32(triplet);
     ++			offset_xor = triplet_get_offset(triplet);
     ++
     ++			if (nth_bitmap_object_oid(bitmap_git, &xor_oid, commit_pos) < 0) {
     ++				free(xor_positions);
     ++				return NULL;
     ++			}
     ++
     ++			bitmap_git->map_pos = offset_xor + sizeof(uint32_t) + sizeof(uint8_t);
     ++			xor_flags = read_u8(bitmap_git->map, &bitmap_git->map_pos);
     ++			bitmap = read_bitmap_1(bitmap_git);
     ++
     ++			if (!bitmap){
     ++				free(xor_positions);
     ++				return NULL;
     ++			}
     ++
     ++			xor_bitmap = store_bitmap(bitmap_git, bitmap, &xor_oid, xor_bitmap, xor_flags);
     ++			size--;
     ++		}
      +
     -+		xor_bitmap = stored_bitmap_for_commit(bitmap_git, xor_commit,
     -+						      &xor_pos);
     ++		free(xor_positions);
      +	}
      +
     -+	/*
     -+	 * Don't bother reading the commit's index position or its xor
     -+	 * offset:
     -+	 *
     -+	 *   - The commit's index position is irrelevant to us, since
     -+	 *     load_bitmap_entries_v1 only uses it to learn the object
     -+	 *     id which is used to compute the hashmap's key. We already
     -+	 *     have an object id, so no need to look it up again.
     -+	 *
     -+	 *   - The xor_offset is unusable for us, since it specifies how
     -+	 *     many entries previous to ours we should look at. This
     -+	 *     makes sense when reading the bitmaps sequentially (as in
     -+	 *     load_bitmap_entries_v1()), since we can keep track of
     -+	 *     each bitmap as we read them.
     -+	 *
     -+	 *     But it can't work for us, since the bitmap's don't have a
     -+	 *     fixed size. So we learn the position of the xor'd bitmap
     -+	 *     from the commit table (and resolve it to a bitmap in the
     -+	 *     above if-statement).
     -+	 *
     -+	 * Instead, we can skip ahead and immediately read the flags and
     -+	 * ewah bitmap.
     -+	 */
     -+	bitmap_git->map_pos = bitmap_ofs + sizeof(uint32_t) + sizeof(uint8_t);
     ++	bitmap_git->map_pos = offset + sizeof(uint32_t) + sizeof(uint8_t);
      +	flags = read_u8(bitmap_git->map, &bitmap_git->map_pos);
      +	bitmap = read_bitmap_1(bitmap_git);
     ++
      +	if (!bitmap)
      +		return NULL;
      +
      +	return store_bitmap(bitmap_git, bitmap, oid, xor_bitmap, flags);
      +}
      +
     -+static struct stored_bitmap *stored_bitmap_for_commit(struct bitmap_index *bitmap_git,
     -+						      struct commit *commit,
     -+						      uint32_t *pos_hint)
     + struct ewah_bitmap *bitmap_for_commit(struct bitmap_index *bitmap_git,
     + 				      struct commit *commit)
       {
       	khiter_t hash_pos = kh_get_oid_map(bitmap_git->bitmaps,
       					   commit->object.oid);
      -	if (hash_pos >= kh_end(bitmap_git->bitmaps))
     +-		return NULL;
      +	if (hash_pos >= kh_end(bitmap_git->bitmaps)) {
     -+		uint32_t commit_pos;
     ++		struct stored_bitmap *bitmap = NULL;
      +		if (!bitmap_git->table_lookup)
      +			return NULL;
      +
     -+		/* NEEDSWORK: cache misses aren't recorded. */
     -+		if (pos_hint)
     -+			commit_pos = *pos_hint;
     -+		else if (!bitmap_table_lookup(bitmap_git,
     -+					      &commit->object.oid,
     -+					      &commit_pos))
     ++		/* NEEDSWORK: cache misses aren't recorded */
     ++		bitmap = lazy_bitmap_for_commit(bitmap_git, commit);
     ++		if(!bitmap)
      +			return NULL;
     -+		return lazy_bitmap_for_commit(bitmap_git, &commit->object.oid,
     -+					      commit_pos);
     ++		return lookup_stored_bitmap(bitmap);
      +	}
     -+	return kh_value(bitmap_git->bitmaps, hash_pos);
     -+}
     -+
     -+struct ewah_bitmap *bitmap_for_commit(struct bitmap_index *bitmap_git,
     -+				      struct commit *commit)
     -+{
     -+	struct stored_bitmap *sb = stored_bitmap_for_commit(bitmap_git, commit,
     -+							    NULL);
     -+	if (!sb)
     - 		return NULL;
     --	return lookup_stored_bitmap(kh_value(bitmap_git->bitmaps, hash_pos));
     -+	return lookup_stored_bitmap(sb);
     + 	return lookup_stored_bitmap(kh_value(bitmap_git->bitmaps, hash_pos));
       }
       
     - static inline int bitmap_position_extended(struct bitmap_index *bitmap_git,
      @@ pack-bitmap.c: void test_bitmap_walk(struct rev_info *revs)
       	if (revs->pending.nr != 1)
       		die("you must specify exactly one commit to test");
       
      -	fprintf(stderr, "Bitmap v%d test (%d entries loaded)\n",
     --		bitmap_git->version, bitmap_git->entry_count);
     ++	fprintf(stderr, "Bitmap v%d test (%d entries)\n",
     + 		bitmap_git->version, bitmap_git->entry_count);
     + 
      +	if (!bitmap_git->table_lookup)
      +		fprintf(stderr, "Bitmap v%d test (%d entries loaded)\n",
      +			bitmap_git->version, bitmap_git->entry_count);
     - 
     ++
       	root = revs->pending.objects[0].item;
       	bm = bitmap_for_commit(bitmap_git, (struct commit *)root);
     + 
     +@@ pack-bitmap.c: void test_bitmap_walk(struct rev_info *revs)
     + 
     + int test_bitmap_commits(struct repository *r)
     + {
     +-	struct bitmap_index *bitmap_git = prepare_bitmap_git(r);
     ++	struct bitmap_index *bitmap_git = NULL;
     + 	struct object_id oid;
     + 	MAYBE_UNUSED void *value;
     + 
     ++	/* As this function is only used to print bitmap selected
     ++	 * commits, we don't have to read the commit table.
     ++	 */
     ++	setenv("GIT_TEST_READ_COMMIT_TABLE", "0", 1);
     ++
     ++	bitmap_git = prepare_bitmap_git(r);
     + 	if (!bitmap_git)
     + 		die("failed to load bitmap indexes");
     + 
     +@@ pack-bitmap.c: int test_bitmap_commits(struct repository *r)
     + 		printf("%s\n", oid_to_hex(&oid));
     + 	});
     + 
     ++	setenv("GIT_TEST_READ_COMMIT_TABLE", "1", 1);
     + 	free_bitmap_index(bitmap_git);
     + 
     + 	return 0;
      
     - ## pack-bitmap.h ##
     -@@ pack-bitmap.h: struct bitmap_disk_header {
     - enum pack_bitmap_opts {
     - 	BITMAP_OPT_FULL_DAG = 1,
     - 	BITMAP_OPT_HASH_CACHE = 4,
     -+	BITMAP_OPT_LOOKUP_TABLE = 16,
     - };
     + ## t/t5310-pack-bitmaps.sh ##
     +@@ t/t5310-pack-bitmaps.sh: test_expect_success 'full repack creates bitmaps' '
     + 	grep "\"label\":\"writing_lookup_table\"" trace
     + '
     + 
     ++test_expect_success 'using lookup table loads only necessary bitmaps' '
     ++	git rev-list --test-bitmap HEAD 2>out &&
     ++	! grep "Bitmap v1 test (106 entries loaded)" out &&
     ++	grep "Found bitmap for" out
     ++'
     ++
     + basic_bitmap_tests
       
     - enum pack_bitmap_flags {
     + test_expect_success 'incremental repack fails when bitmaps are requested' '
     +@@ t/t5310-pack-bitmaps.sh: test_expect_success 'pack reuse respects --incremental' '
     + 
     + test_expect_success 'truncated bitmap fails gracefully (ewah)' '
     + 	test_config pack.writebitmaphashcache false &&
     ++	test_config pack.writebitmaplookuptable false &&
     + 	git repack -ad &&
     + 	git rev-list --use-bitmap-index --count --all >expect &&
     + 	bitmap=$(ls .git/objects/pack/*.bitmap) &&
     +
     + ## t/t5326-multi-pack-bitmaps.sh ##
     +@@ t/t5326-multi-pack-bitmaps.sh: test_expect_success 'multi-pack-index write writes lookup table if enabled' '
     + 		grep "\"label\":\"writing_lookup_table\"" trace
     + 	)
     + '
     ++
     + test_done
 5:  a404779a30f < -:  ----------- bitmap-commit-table: add tests for the bitmap lookup table
 6:  f5f725a3fe2 ! 5:  96c0041688f bitmap-lookup-table: add performance tests
     @@ Metadata
      Author: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
      
       ## Commit message ##
     -    bitmap-lookup-table: add performance tests
     +    bitmap-lookup-table: add performance tests for lookup table
      
     -    Add performance tests for bitmap lookup table extension.
     +    Add performance tests to verify the performance of lookup table.
     +
     +    Lookup table makes Git run faster in most of the cases. Below is the
     +    result of `t/perf/p5310-pack-bitmaps.sh`.`perf/p5326-multi-pack-bitmaps.sh`
     +    gives similar result. The repository used in the test is linux kernel.
     +
     +    Test                                                      this tree
     +    --------------------------------------------------------------------------
     +    5310.4: repack to disk (lookup=false)                   295.94(250.45+15.24)
     +    5310.5: simulated clone                                 12.52(5.07+1.40)
     +    5310.6: simulated fetch                                 1.89(2.94+0.24)
     +    5310.7: pack to file (bitmap)                           41.39(20.33+7.20)
     +    5310.8: rev-list (commits)                              0.98(0.59+0.12)
     +    5310.9: rev-list (objects)                              3.40(3.27+0.10)
     +    5310.10: rev-list with tag negated via --not            0.07(0.02+0.04)
     +             --all (objects)
     +    5310.11: rev-list with negative tag (objects)           0.23(0.16+0.06)
     +    5310.12: rev-list count with blob:none                  0.26(0.18+0.07)
     +    5310.13: rev-list count with blob:limit=1k              6.45(5.94+0.37)
     +    5310.14: rev-list count with tree:0                     0.26(0.18+0.07)
     +    5310.15: simulated partial clone                        4.99(3.19+0.45)
     +    5310.19: repack to disk (lookup=true)                   269.67(174.70+21.33)
     +    5310.20: simulated clone                                11.03(5.07+1.11)
     +    5310.21: simulated fetch                                0.79(0.79+0.17)
     +    5310.22: pack to file (bitmap)                          43.03(20.28+7.43)
     +    5310.23: rev-list (commits)                             0.86(0.54+0.09)
     +    5310.24: rev-list (objects)                             3.35(3.26+0.07)
     +    5310.25: rev-list with tag negated via --not            0.05(0.00+0.03)
     +             --all (objects)
     +    5310.26: rev-list with negative tag (objects)           0.22(0.16+0.05)
     +    5310.27: rev-list count with blob:none                  0.22(0.16+0.05)
     +    5310.28: rev-list count with blob:limit=1k              6.45(5.87+0.31)
     +    5310.29: rev-list count with tree:0                     0.22(0.16+0.05)
     +    5310.30: simulated partial clone                        5.17(3.12+0.48)
     +
     +    Test 4-15 are tested without using lookup table. Same tests are
     +    repeated in 16-30 (using lookup table).
      
     -    Mentored-by: Taylor Blau <ttaylorr@github.com>
     -    Co-mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
          Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
     +    Mentored-by: Taylor Blau <me@ttaylorr.com>
     +    Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
      
       ## t/perf/p5310-pack-bitmaps.sh ##
     -@@ t/perf/p5310-pack-bitmaps.sh: test_perf_large_repo
     - # since we want to be able to compare bitmap-aware
     - # git versus non-bitmap git
     - #
     --# We intentionally use the deprecated pack.writebitmaps
     -+# We intentionally use the deprecated pack.writeBitmaps
     - # config so that we can test against older versions of git.
     - test_expect_success 'setup bitmap config' '
     --	git config pack.writebitmaps true
     -+	git config pack.writeBitmaps true &&
     -+	git config pack.writeReverseIndex true
     +@@ t/perf/p5310-pack-bitmaps.sh: test_expect_success 'setup bitmap config' '
     + 	git config pack.writebitmaps true
       '
       
     - # we need to create the tag up front such that it is covered by the repack and
     -@@ t/perf/p5310-pack-bitmaps.sh: test_perf 'repack to disk' '
     - 
     - test_full_bitmap
     - 
     +-# we need to create the tag up front such that it is covered by the repack and
     +-# thus by generated bitmaps.
     +-test_expect_success 'create tags' '
     +-	git tag --message="tag pointing to HEAD" perf-tag HEAD
     +-'
     +-
     +-test_perf 'repack to disk' '
     +-	git repack -ad
     +-'
     +-
     +-test_full_bitmap
     +-
      -test_expect_success 'create partial bitmap state' '
      -	# pick a commit to represent the repo tip in the past
      -	cutoff=$(git rev-list HEAD~100 -1) &&
     @@ t/perf/p5310-pack-bitmaps.sh: test_perf 'repack to disk' '
      -	# and now restore our original tip, as if the pushes
      -	# had happened
      -	git update-ref HEAD $orig_tip
     -+test_perf 'use lookup table' '
     -+    git config pack.writeBitmapLookupTable true
     - '
     - 
     +-'
     +-
      -test_partial_bitmap
     -+test_perf 'repack to disk (lookup table)' '
     -+    git repack -adb
     -+'
     ++test_bitmap () {
     ++    local enabled="$1"
      +
     -+test_full_bitmap
     ++	# we need to create the tag up front such that it is covered by the repack and
     ++	# thus by generated bitmaps.
     ++	test_expect_success 'create tags' '
     ++		git tag --message="tag pointing to HEAD" perf-tag HEAD
     ++	'
      +
     -+for i in false true
     -+do
     -+	$i && lookup=" (lookup table)"
     -+	test_expect_success "create partial bitmap state$lookup" '
     -+		git config pack.writeBitmapLookupTable '"$i"' &&
     ++	test_expect_success "use lookup table: $enabled" '
     ++		git config pack.writeBitmapLookupTable '"$enabled"'
     ++	'
     ++
     ++	test_perf "repack to disk (lookup=$enabled)" '
     ++		git repack -ad
     ++	'
     ++
     ++	test_full_bitmap
     ++
     ++    test_expect_success "create partial bitmap state (lookup=$enabled)" '
      +		# pick a commit to represent the repo tip in the past
      +		cutoff=$(git rev-list HEAD~100 -1) &&
      +		orig_tip=$(git rev-parse HEAD) &&
     @@ t/perf/p5310-pack-bitmaps.sh: test_perf 'repack to disk' '
      +		# and now restore our original tip, as if the pushes
      +		# had happened
      +		git update-ref HEAD $orig_tip
     -+	'
     ++    '
     ++}
      +
     -+	test_partial_bitmap
     -+done
     ++test_bitmap false
     ++test_bitmap true
       
       test_done
      
       ## t/perf/p5326-multi-pack-bitmaps.sh ##
     -@@ t/perf/p5326-multi-pack-bitmaps.sh: test_expect_success 'drop pack bitmap' '
     +@@ t/perf/p5326-multi-pack-bitmaps.sh: test_description='Tests performance using midx bitmaps'
       
     - test_full_bitmap
     + test_perf_large_repo
       
     +-# we need to create the tag up front such that it is covered by the repack and
     +-# thus by generated bitmaps.
     +-test_expect_success 'create tags' '
     +-	git tag --message="tag pointing to HEAD" perf-tag HEAD
     +-'
     +-
     +-test_expect_success 'start with bitmapped pack' '
     +-	git repack -adb
     +-'
     +-
     +-test_perf 'setup multi-pack index' '
     +-	git multi-pack-index write --bitmap
     +-'
     +-
     +-test_expect_success 'drop pack bitmap' '
     +-	rm -f .git/objects/pack/pack-*.bitmap
     +-'
     +-
     +-test_full_bitmap
     +-
      -test_expect_success 'create partial bitmap state' '
      -	# pick a commit to represent the repo tip in the past
      -	cutoff=$(git rev-list HEAD~100 -1) &&
     @@ t/perf/p5326-multi-pack-bitmaps.sh: test_expect_success 'drop pack bitmap' '
      -	# and now restore our original tip, as if the pushes
      -	# had happened
      -	git update-ref HEAD $orig_tip
     -+test_expect_success 'use lookup table' '
     -+	git config pack.writeBitmapLookupTable true
     - '
     - 
     +-'
     +-
      -test_partial_bitmap
     -+test_perf 'setup multi-pack-index (lookup table)' '
     -+	git multi-pack-index write --bitmap
     -+'
     ++test_bitmap () {
     ++    local enabled="$1"
     ++
     ++	# we need to create the tag up front such that it is covered by the repack and
     ++	# thus by generated bitmaps.
     ++	test_expect_success 'create tags' '
     ++		git tag --message="tag pointing to HEAD" perf-tag HEAD
     ++	'
      +
     -+test_full_bitmap
     ++	test_expect_success "use lookup table: $enabled" '
     ++		git config pack.writeBitmapLookupTable '"$enabled"'
     ++	'
     ++
     ++	test_expect_success "start with bitmapped pack (lookup=$enabled)" '
     ++		git repack -adb
     ++	'
     ++
     ++	test_perf "setup multi-pack index (lookup=$enabled)" '
     ++		git multi-pack-index write --bitmap
     ++	'
      +
     -+for i in false true
     -+do
     -+	$i && lookup=" (lookup table)"
     -+	test_expect_success "create partial bitmap state$lookup" '
     -+		git config pack.writeBitmapLookupTable '"$i"' &&
     ++	test_expect_success "drop pack bitmap (lookup=$enabled)" '
     ++		rm -f .git/objects/pack/pack-*.bitmap
     ++	'
     ++
     ++	test_full_bitmap
     ++
     ++    test_expect_success "create partial bitmap state (lookup=$enabled)" '
      +		# pick a commit to represent the repo tip in the past
      +		cutoff=$(git rev-list HEAD~100 -1) &&
      +		orig_tip=$(git rev-parse HEAD) &&
     @@ t/perf/p5326-multi-pack-bitmaps.sh: test_expect_success 'drop pack bitmap' '
      +		# and now restore our original tip, as if the pushes
      +		# had happened
      +		git update-ref HEAD $orig_tip
     -+	'
     ++    '
     ++}
      +
     -+	test_partial_bitmap
     -+done
     ++test_bitmap false
     ++test_bitmap true
       
       test_done
 -:  ----------- > 6:  fe556b58814 p5310-pack-bitmaps.sh: enable pack.writeReverseIndex for testing

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v2 1/6] Documentation/technical: describe bitmap lookup table extension
  2022-06-26 13:10 ` [PATCH v2 0/6] [GSoC] bitmap: integrate a lookup table extension to the bitmap format Abhradeep Chakraborty via GitGitGadget
@ 2022-06-26 13:10   ` Abhradeep Chakraborty via GitGitGadget
  2022-06-27 14:18     ` Derrick Stolee
  2022-06-26 13:10   ` [PATCH v2 2/6] pack-bitmap-write.c: write " Abhradeep Chakraborty via GitGitGadget
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 63+ messages in thread
From: Abhradeep Chakraborty via GitGitGadget @ 2022-06-26 13:10 UTC (permalink / raw)
  To: git
  Cc: Taylor Blau, Kaartic Sivaram, Derrick Stolee,
	Abhradeep Chakraborty, Abhradeep Chakraborty

From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>

When reading bitmap file, git loads each and every bitmap one by one
even if all the bitmaps are not required. A "bitmap lookup table"
extension to the bitmap format can reduce the overhead of loading
bitmaps which stores a list of bitmapped commit id pos (in the midx
or pack, along with their offset and xor offset. This way git can
load only the neccesary bitmaps without loading the previous bitmaps.

The older version of Git ignores the lookup table extension and doesn't
throw any kind of warning or error while parsing the bitmap file.

Add some information for the new "bitmap lookup table" extension in the
bitmap-format documentation.

Co-Authored-by: Taylor Blau <me@ttaylorr.com>
Mentored-by: Taylor Blau <me@ttaylorr.com>
Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
---
 Documentation/technical/bitmap-format.txt | 41 +++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt
index 04b3ec21785..7d4e450d3d8 100644
--- a/Documentation/technical/bitmap-format.txt
+++ b/Documentation/technical/bitmap-format.txt
@@ -67,6 +67,19 @@ MIDXs, both the bit-cache and rev-cache extensions are required.
 			pack/MIDX. The format and meaning of the name-hash is
 			described below.
 
+			** {empty}
+			BITMAP_OPT_LOOKUP_TABLE (0x10): :::
+			If present, the end of the bitmap file contains a table
+			containing a list of `N` <commit pos, offset, xor offset>
+			triplets. The format and meaning of the table is described
+			below.
++
+NOTE: This xor_offset is different from the bitmap's xor_offset.
+Bitmap's xor_offset is relative i.e. it tells how many bitmaps we have
+to go back from the current bitmap. Lookup table's xor_offset tells the
+position of the triplet in the list whose bitmap the current commit's
+bitmap have to xor with.
+
 		4-byte entry count (network byte order)
 
 			The total count of entries (bitmapped commits) in this bitmap index.
@@ -205,3 +218,31 @@ Note that this hashing scheme is tied to the BITMAP_OPT_HASH_CACHE flag.
 If implementations want to choose a different hashing scheme, they are
 free to do so, but MUST allocate a new header flag (because comparing
 hashes made under two different schemes would be pointless).
+
+Commit lookup table
+-------------------
+
+If the BITMAP_OPT_LOOKUP_TABLE flag is set, the last `N * (4 + 8 + 4)`
+(preceding the name-hash cache and trailing hash) of the `.bitmap` file
+contains a lookup table specifying the information needed to get the
+desired bitmap from the entries without parsing previous unnecessary
+bitmaps.
+
+For a `.bitmap` containing `nr_entries` reachability bitmaps, the table
+contains a list of `nr_entries` <commit pos, offset, xor offset> triplets.
+The content of i'th triplet is -
+
+	* {empty}
+	commit pos (4 byte integer, network byte order): ::
+	It stores the object position of the commit (in the midx or pack index)
+	to which the i'th bitmap in the bitmap entries belongs.
+
+	* {empty}
+	offset (8 byte integer, network byte order): ::
+	The offset from which that commit's bitmap can be read.
+
+	* {empty}
+	xor offset (4 byte integer, network byte order): ::
+	It holds the position of the triplet with whose bitmap the
+	current bitmap need to xor. If the current triplet's bitmap
+	do not have any xor bitmap, it defaults to 0xffffffff.
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v2 2/6] pack-bitmap-write.c: write lookup table extension
  2022-06-26 13:10 ` [PATCH v2 0/6] [GSoC] bitmap: integrate a lookup table extension to the bitmap format Abhradeep Chakraborty via GitGitGadget
  2022-06-26 13:10   ` [PATCH v2 1/6] Documentation/technical: describe bitmap lookup table extension Abhradeep Chakraborty via GitGitGadget
@ 2022-06-26 13:10   ` Abhradeep Chakraborty via GitGitGadget
  2022-06-27 14:35     ` Derrick Stolee
  2022-06-27 16:05     ` Taylor Blau
  2022-06-26 13:10   ` [PATCH v2 3/6] pack-bitmap-write: learn pack.writeBitmapLookupTable and add tests Abhradeep Chakraborty via GitGitGadget
                     ` (3 subsequent siblings)
  5 siblings, 2 replies; 63+ messages in thread
From: Abhradeep Chakraborty via GitGitGadget @ 2022-06-26 13:10 UTC (permalink / raw)
  To: git
  Cc: Taylor Blau, Kaartic Sivaram, Derrick Stolee,
	Abhradeep Chakraborty, Abhradeep Chakraborty

From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>

The bitmap lookup table extension was documentated by an earlier
change, but Git does not yet knowhow to write that extension.

Teach git to write bitmap lookup table extension. The table contains
the list of `N` <commit pos, offset, xor offset>` triplets. These
triplets are sorted according to their commit pos (ascending order).
The meaning of each data in the i'th triplet is given below:

  - Commit pos is the position of the commit in the pack-index
    (or midx) to which the i'th bitmap belongs. It is a 4 byte
    network byte order integer.

  - offset is the position of the i'th bitmap.

  - xor offset denotes the position of the triplet with whose
    bitmap the current triplet's bitmap need to xor with.

Co-authored-by: Taylor Blau <me@ttaylorr.com>
Mentored-by: Taylor Blau <me@ttaylorr.com>
Co-mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
---
 pack-bitmap-write.c | 72 +++++++++++++++++++++++++++++++++++++++++++--
 pack-bitmap.h       |  5 ++--
 2 files changed, 73 insertions(+), 4 deletions(-)

diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
index c43375bd344..899a4a941e1 100644
--- a/pack-bitmap-write.c
+++ b/pack-bitmap-write.c
@@ -650,7 +650,9 @@ static const struct object_id *oid_access(size_t pos, const void *table)
 
 static void write_selected_commits_v1(struct hashfile *f,
 				      struct pack_idx_entry **index,
-				      uint32_t index_nr)
+				      uint32_t index_nr,
+				      uint64_t *offsets,
+				      uint32_t *commit_positions)
 {
 	int i;
 
@@ -663,6 +665,11 @@ static void write_selected_commits_v1(struct hashfile *f,
 		if (commit_pos < 0)
 			BUG("trying to write commit not in index");
 
+		if (offsets)
+			offsets[i] = hashfile_total(f);
+		if (commit_positions)
+			commit_positions[i] = commit_pos;
+
 		hashwrite_be32(f, commit_pos);
 		hashwrite_u8(f, stored->xor_offset);
 		hashwrite_u8(f, stored->flags);
@@ -671,6 +678,55 @@ static void write_selected_commits_v1(struct hashfile *f,
 	}
 }
 
+static int table_cmp(const void *_va, const void *_vb, void *commit_positions)
+{
+	int8_t result = 0;
+	uint32_t *positions = (uint32_t *) commit_positions;
+	uint32_t a = positions[*(uint32_t *)_va];
+	uint32_t b = positions[*(uint32_t *)_vb];
+
+	if (a > b)
+		result = 1;
+	else if (a < b)
+		result = -1;
+	else
+		result = 0;
+
+	return result;
+}
+
+static void write_lookup_table(struct hashfile *f,
+			       uint64_t *offsets,
+			       uint32_t *commit_positions)
+{
+	uint32_t i;
+	uint32_t *table, *table_inv;
+
+	ALLOC_ARRAY(table, writer.selected_nr);
+	ALLOC_ARRAY(table_inv, writer.selected_nr);
+
+	for (i = 0; i < writer.selected_nr; i++)
+		table[i] = i;
+
+	QSORT_S(table, writer.selected_nr, table_cmp, commit_positions);
+
+	for (i = 0; i < writer.selected_nr; i++)
+		table_inv[table[i]] = i;
+
+	for (i = 0; i < writer.selected_nr; i++) {
+		struct bitmapped_commit *selected = &writer.selected[table[i]];
+		uint32_t xor_offset = selected->xor_offset;
+
+		hashwrite_be32(f, commit_positions[table[i]]);
+		hashwrite_be64(f, offsets[table[i]]);
+		hashwrite_be32(f, xor_offset ?
+				table_inv[table[i] - xor_offset]: 0xffffffff);
+	}
+
+	free(table);
+	free(table_inv);
+}
+
 static void write_hash_cache(struct hashfile *f,
 			     struct pack_idx_entry **index,
 			     uint32_t index_nr)
@@ -695,6 +751,8 @@ void bitmap_writer_finish(struct pack_idx_entry **index,
 {
 	static uint16_t default_version = 1;
 	static uint16_t flags = BITMAP_OPT_FULL_DAG;
+	uint64_t *offsets = NULL;
+	uint32_t *commit_positions = NULL;
 	struct strbuf tmp_file = STRBUF_INIT;
 	struct hashfile *f;
 
@@ -715,8 +773,16 @@ void bitmap_writer_finish(struct pack_idx_entry **index,
 	dump_bitmap(f, writer.trees);
 	dump_bitmap(f, writer.blobs);
 	dump_bitmap(f, writer.tags);
-	write_selected_commits_v1(f, index, index_nr);
 
+	if (options & BITMAP_OPT_LOOKUP_TABLE) {
+		CALLOC_ARRAY(offsets, index_nr);
+		CALLOC_ARRAY(commit_positions, index_nr);
+	}
+
+	write_selected_commits_v1(f, index, index_nr, offsets, commit_positions);
+
+	if (options & BITMAP_OPT_LOOKUP_TABLE)
+		write_lookup_table(f, offsets, commit_positions);
 	if (options & BITMAP_OPT_HASH_CACHE)
 		write_hash_cache(f, index, index_nr);
 
@@ -730,4 +796,6 @@ void bitmap_writer_finish(struct pack_idx_entry **index,
 		die_errno("unable to rename temporary bitmap file to '%s'", filename);
 
 	strbuf_release(&tmp_file);
+	free(offsets);
+	free(commit_positions);
 }
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 3d3ddd77345..67a9d0fc303 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -24,8 +24,9 @@ struct bitmap_disk_header {
 #define NEEDS_BITMAP (1u<<22)
 
 enum pack_bitmap_opts {
-	BITMAP_OPT_FULL_DAG = 1,
-	BITMAP_OPT_HASH_CACHE = 4,
+	BITMAP_OPT_FULL_DAG = 0x1,
+	BITMAP_OPT_HASH_CACHE = 0x4,
+	BITMAP_OPT_LOOKUP_TABLE = 0x10,
 };
 
 enum pack_bitmap_flags {
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v2 3/6] pack-bitmap-write: learn pack.writeBitmapLookupTable and add tests
  2022-06-26 13:10 ` [PATCH v2 0/6] [GSoC] bitmap: integrate a lookup table extension to the bitmap format Abhradeep Chakraborty via GitGitGadget
  2022-06-26 13:10   ` [PATCH v2 1/6] Documentation/technical: describe bitmap lookup table extension Abhradeep Chakraborty via GitGitGadget
  2022-06-26 13:10   ` [PATCH v2 2/6] pack-bitmap-write.c: write " Abhradeep Chakraborty via GitGitGadget
@ 2022-06-26 13:10   ` Abhradeep Chakraborty via GitGitGadget
  2022-06-27 14:43     ` Derrick Stolee
  2022-06-27 17:47     ` Taylor Blau
  2022-06-26 13:10   ` [PATCH v2 4/6] pack-bitmap: prepare to read lookup table extension Abhradeep Chakraborty via GitGitGadget
                     ` (2 subsequent siblings)
  5 siblings, 2 replies; 63+ messages in thread
From: Abhradeep Chakraborty via GitGitGadget @ 2022-06-26 13:10 UTC (permalink / raw)
  To: git
  Cc: Taylor Blau, Kaartic Sivaram, Derrick Stolee,
	Abhradeep Chakraborty, Abhradeep Chakraborty

From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>

Teach git to provide a way for users to enable/disable bitmap lookup
table extension by providing a config option named 'writeBitmapLookupTable'.
Default is true.

Also add test to verify writting of lookup table.

Co-Authored-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
Mentored-by: Taylor Blau <me@ttaylorr.com>
Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
---
 Documentation/config/pack.txt |  7 +++++++
 builtin/multi-pack-index.c    |  8 ++++++++
 builtin/pack-objects.c        | 10 +++++++++-
 midx.c                        |  3 +++
 midx.h                        |  1 +
 pack-bitmap-write.c           |  2 ++
 t/t5310-pack-bitmaps.sh       |  3 ++-
 t/t5326-multi-pack-bitmaps.sh | 13 +++++++++++++
 8 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index ad7f73a1ead..6e1f454c4d6 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -164,6 +164,13 @@ When writing a multi-pack reachability bitmap, no new namehashes are
 computed; instead, any namehashes stored in an existing bitmap are
 permuted into their appropriate location when writing a new bitmap.
 
+pack.writeBitmapLookupTable::
+	When true, git will include a "lookup table" section in the
+	bitmap index (if one is written). This table is used to defer
+	loading individual bitmaps as late as possible. This can be
+	beneficial in repositories which have relatively large bitmap
+	indexes. Defaults to true.
+
 pack.writeReverseIndex::
 	When true, git will write a corresponding .rev file (see:
 	link:../technical/pack-format.html[Documentation/technical/pack-format.txt])
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 5edbb7fe86e..3757616f09c 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -87,6 +87,13 @@ static int git_multi_pack_index_write_config(const char *var, const char *value,
 			opts.flags &= ~MIDX_WRITE_BITMAP_HASH_CACHE;
 	}
 
+	if (!strcmp(var, "pack.writebitmaplookuptable")) {
+		if (git_config_bool(var, value))
+			opts.flags |= MIDX_WRITE_BITMAP_LOOKUP_TABLE;
+		else
+			opts.flags &= ~MIDX_WRITE_BITMAP_LOOKUP_TABLE;
+	}
+
 	/*
 	 * We should never make a fall-back call to 'git_default_config', since
 	 * this was already called in 'cmd_multi_pack_index()'.
@@ -123,6 +130,7 @@ static int cmd_multi_pack_index_write(int argc, const char **argv)
 	};
 
 	opts.flags |= MIDX_WRITE_BITMAP_HASH_CACHE;
+	opts.flags |= MIDX_WRITE_BITMAP_LOOKUP_TABLE;
 
 	git_config(git_multi_pack_index_write_config, NULL);
 
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 39e28cfcafc..d6a33fd486c 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -228,7 +228,7 @@ static enum {
 	WRITE_BITMAP_QUIET,
 	WRITE_BITMAP_TRUE,
 } write_bitmap_index;
-static uint16_t write_bitmap_options = BITMAP_OPT_HASH_CACHE;
+static uint16_t write_bitmap_options = BITMAP_OPT_HASH_CACHE | BITMAP_OPT_LOOKUP_TABLE;
 
 static int exclude_promisor_objects;
 
@@ -3148,6 +3148,14 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 		else
 			write_bitmap_options &= ~BITMAP_OPT_HASH_CACHE;
 	}
+
+	if (!strcmp(k, "pack.writebitmaplookuptable")) {
+		if (git_config_bool(k, v))
+			write_bitmap_options |= BITMAP_OPT_LOOKUP_TABLE;
+		else
+			write_bitmap_options &= ~BITMAP_OPT_LOOKUP_TABLE;
+	}
+
 	if (!strcmp(k, "pack.usebitmaps")) {
 		use_bitmap_index_default = git_config_bool(k, v);
 		return 0;
diff --git a/midx.c b/midx.c
index 5f0dd386b02..9c26d04bfde 100644
--- a/midx.c
+++ b/midx.c
@@ -1072,6 +1072,9 @@ static int write_midx_bitmap(char *midx_name, unsigned char *midx_hash,
 	if (flags & MIDX_WRITE_BITMAP_HASH_CACHE)
 		options |= BITMAP_OPT_HASH_CACHE;
 
+	if (flags & MIDX_WRITE_BITMAP_LOOKUP_TABLE)
+		options |= BITMAP_OPT_LOOKUP_TABLE;
+
 	prepare_midx_packing_data(&pdata, ctx);
 
 	commits = find_commits_for_midx_bitmap(&commits_nr, refs_snapshot, ctx);
diff --git a/midx.h b/midx.h
index 22e8e53288e..5578cd7b835 100644
--- a/midx.h
+++ b/midx.h
@@ -47,6 +47,7 @@ struct multi_pack_index {
 #define MIDX_WRITE_REV_INDEX (1 << 1)
 #define MIDX_WRITE_BITMAP (1 << 2)
 #define MIDX_WRITE_BITMAP_HASH_CACHE (1 << 3)
+#define MIDX_WRITE_BITMAP_LOOKUP_TABLE (1 << 4)
 
 const unsigned char *get_midx_checksum(struct multi_pack_index *m);
 void get_midx_filename(struct strbuf *out, const char *object_dir);
diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
index 899a4a941e1..79be0cf80e6 100644
--- a/pack-bitmap-write.c
+++ b/pack-bitmap-write.c
@@ -713,6 +713,7 @@ static void write_lookup_table(struct hashfile *f,
 	for (i = 0; i < writer.selected_nr; i++)
 		table_inv[table[i]] = i;
 
+	trace2_region_enter("pack-bitmap-write", "writing_lookup_table", the_repository);
 	for (i = 0; i < writer.selected_nr; i++) {
 		struct bitmapped_commit *selected = &writer.selected[table[i]];
 		uint32_t xor_offset = selected->xor_offset;
@@ -725,6 +726,7 @@ static void write_lookup_table(struct hashfile *f,
 
 	free(table);
 	free(table_inv);
+	trace2_region_leave("pack-bitmap-write", "writing_lookup_table", the_repository);
 }
 
 static void write_hash_cache(struct hashfile *f,
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index f775fc1ce69..c669ed959e9 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -38,7 +38,8 @@ test_expect_success 'full repack creates bitmaps' '
 	ls .git/objects/pack/ | grep bitmap >output &&
 	test_line_count = 1 output &&
 	grep "\"key\":\"num_selected_commits\",\"value\":\"106\"" trace &&
-	grep "\"key\":\"num_maximal_commits\",\"value\":\"107\"" trace
+	grep "\"key\":\"num_maximal_commits\",\"value\":\"107\"" trace &&
+	grep "\"label\":\"writing_lookup_table\"" trace
 '
 
 basic_bitmap_tests
diff --git a/t/t5326-multi-pack-bitmaps.sh b/t/t5326-multi-pack-bitmaps.sh
index 4fe57414c13..43be49617b8 100755
--- a/t/t5326-multi-pack-bitmaps.sh
+++ b/t/t5326-multi-pack-bitmaps.sh
@@ -307,4 +307,17 @@ test_expect_success 'graceful fallback when missing reverse index' '
 	)
 '
 
+test_expect_success 'multi-pack-index write writes lookup table if enabled' '
+	rm -fr repo &&
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+		test_commit base &&
+		git repack -ad &&
+		GIT_TRACE2_EVENT="$(pwd)/trace" \
+			git multi-pack-index write --bitmap &&
+		grep "\"label\":\"writing_lookup_table\"" trace
+	)
+'
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v2 4/6] pack-bitmap: prepare to read lookup table extension
  2022-06-26 13:10 ` [PATCH v2 0/6] [GSoC] bitmap: integrate a lookup table extension to the bitmap format Abhradeep Chakraborty via GitGitGadget
                     ` (2 preceding siblings ...)
  2022-06-26 13:10   ` [PATCH v2 3/6] pack-bitmap-write: learn pack.writeBitmapLookupTable and add tests Abhradeep Chakraborty via GitGitGadget
@ 2022-06-26 13:10   ` Abhradeep Chakraborty via GitGitGadget
  2022-06-27 15:12     ` Derrick Stolee
  2022-06-27 21:38     ` [PATCH v2 4/6] pack-bitmap: prepare to read lookup table extension Taylor Blau
  2022-06-26 13:10   ` [PATCH v2 5/6] bitmap-lookup-table: add performance tests for lookup table Abhradeep Chakraborty via GitGitGadget
  2022-06-26 13:10   ` [PATCH v2 6/6] p5310-pack-bitmaps.sh: enable pack.writeReverseIndex for testing Abhradeep Chakraborty via GitGitGadget
  5 siblings, 2 replies; 63+ messages in thread
From: Abhradeep Chakraborty via GitGitGadget @ 2022-06-26 13:10 UTC (permalink / raw)
  To: git
  Cc: Taylor Blau, Kaartic Sivaram, Derrick Stolee,
	Abhradeep Chakraborty, Abhradeep Chakraborty

From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>

Earlier change teaches Git to write bitmap lookup table. But Git
does not know how to parse them.

Teach Git to parse the existing bitmap lookup table. The older
versions of git are not affected by it. Those versions ignore the
lookup table.

Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
Mentored-by: Taylor Blau <me@ttaylorr.com>
Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
---
 pack-bitmap.c                 | 193 ++++++++++++++++++++++++++++++++--
 t/t5310-pack-bitmaps.sh       |   7 ++
 t/t5326-multi-pack-bitmaps.sh |   1 +
 3 files changed, 191 insertions(+), 10 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 36134222d7a..9e09c5824fc 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -82,6 +82,12 @@ struct bitmap_index {
 	/* The checksum of the packfile or MIDX; points into map. */
 	const unsigned char *checksum;
 
+	/*
+	 * If not NULL, this point into the commit table extension
+	 * (within map).
+	 */
+	unsigned char *table_lookup;
+
 	/*
 	 * Extended index.
 	 *
@@ -185,6 +191,22 @@ static int load_bitmap_header(struct bitmap_index *index)
 			index->hashes = (void *)(index_end - cache_size);
 			index_end -= cache_size;
 		}
+
+		if (flags & BITMAP_OPT_LOOKUP_TABLE &&
+			git_env_bool("GIT_TEST_READ_COMMIT_TABLE", 1)) {
+			size_t table_size = 0;
+			size_t triplet_sz = st_add3(sizeof(uint32_t),    /* commit position */
+							sizeof(uint64_t),    /* offset */
+							sizeof(uint32_t));    /* xor offset */
+
+			table_size = st_add(table_size,
+					st_mult(ntohl(header->entry_count),
+						triplet_sz));
+			if (table_size > index_end - index->map - header_size)
+				return error("corrupted bitmap index file (too short to fit lookup table)");
+			index->table_lookup = (void *)(index_end - table_size);
+			index_end -= table_size;
+		}
 	}
 
 	index->entry_count = ntohl(header->entry_count);
@@ -211,12 +233,20 @@ static struct stored_bitmap *store_bitmap(struct bitmap_index *index,
 
 	hash_pos = kh_put_oid_map(index->bitmaps, stored->oid, &ret);
 
-	/* a 0 return code means the insertion succeeded with no changes,
-	 * because the SHA1 already existed on the map. this is bad, there
-	 * shouldn't be duplicated commits in the index */
+	/* A 0 return code means the insertion succeeded with no changes,
+	 * because the SHA1 already existed on the map. If lookup table
+	 * is NULL, this is bad, there shouldn't be duplicated commits
+	 * in the index.
+	 *
+	 * If table_lookup exists, that means the desired bitmap is already
+	 * loaded. Either this bitmap has been stored directly or another
+	 * bitmap has a direct or indirect xor relation with it. */
 	if (ret == 0) {
-		error("Duplicate entry in bitmap index: %s", oid_to_hex(oid));
-		return NULL;
+		if (!index->table_lookup) {
+			error("Duplicate entry in bitmap index: %s", oid_to_hex(oid));
+			return NULL;
+		}
+		return kh_value(index->bitmaps, hash_pos);
 	}
 
 	kh_value(index->bitmaps, hash_pos) = stored;
@@ -470,7 +500,7 @@ static int load_bitmap(struct bitmap_index *bitmap_git)
 		!(bitmap_git->tags = read_bitmap_1(bitmap_git)))
 		goto failed;
 
-	if (load_bitmap_entries_v1(bitmap_git) < 0)
+	if (!bitmap_git->table_lookup && load_bitmap_entries_v1(bitmap_git) < 0)
 		goto failed;
 
 	return 0;
@@ -557,13 +587,145 @@ struct include_data {
 	struct bitmap *seen;
 };
 
+static inline const void *bitmap_get_triplet(struct bitmap_index *bitmap_git, uint32_t xor_pos)
+{
+	size_t triplet_sz = st_add3(sizeof(uint32_t), sizeof(uint64_t), sizeof(uint32_t));
+	const void *p = bitmap_git->table_lookup + st_mult(xor_pos, triplet_sz);
+	return p;
+}
+
+static uint64_t triplet_get_offset(const void *triplet)
+{
+	const void *p = (unsigned char*) triplet + sizeof(uint32_t);
+	return get_be64(p);
+}
+
+static uint32_t triplet_get_xor_pos(const void *triplet)
+{
+	const void *p = (unsigned char*) triplet + st_add(sizeof(uint32_t), sizeof(uint64_t));
+	return get_be32(p);
+}
+
+static int triplet_cmp(const void *va, const void *vb)
+{
+	int result = 0;
+	uint32_t *a = (uint32_t *) va;
+	uint32_t b = get_be32(vb);
+	if (*a > b)
+		result = 1;
+	else if (*a < b)
+		result = -1;
+	else
+		result = 0;
+
+	return result;
+}
+
+static uint32_t bsearch_pos(struct bitmap_index *bitmap_git, struct object_id *oid,
+						uint32_t *result)
+{
+	int found;
+
+	if (bitmap_git->midx)
+		found = bsearch_midx(oid, bitmap_git->midx, result);
+	else
+		found = bsearch_pack(oid, bitmap_git->pack, result);
+
+	return found;
+}
+
+static struct stored_bitmap *lazy_bitmap_for_commit(struct bitmap_index *bitmap_git,
+					  struct commit *commit)
+{
+	uint32_t commit_pos, xor_pos;
+	uint64_t offset;
+	int flags;
+	const void *triplet = NULL;
+	struct object_id *oid = &commit->object.oid;
+	struct ewah_bitmap *bitmap;
+	struct stored_bitmap *xor_bitmap = NULL;
+	size_t triplet_sz = st_add3(sizeof(uint32_t), sizeof(uint64_t), sizeof(uint32_t));
+
+	int found = bsearch_pos(bitmap_git, oid, &commit_pos);
+
+	if (!found)
+		return NULL;
+
+	triplet = bsearch(&commit_pos, bitmap_git->table_lookup, bitmap_git->entry_count,
+						triplet_sz, triplet_cmp);
+	if (!triplet)
+		return NULL;
+
+	offset = triplet_get_offset(triplet);
+	xor_pos = triplet_get_xor_pos(triplet);
+
+	if (xor_pos != 0xffffffff) {
+		int xor_flags;
+		uint64_t offset_xor;
+		uint32_t *xor_positions;
+		struct object_id xor_oid;
+		size_t size = 0;
+
+		ALLOC_ARRAY(xor_positions, bitmap_git->entry_count);
+		while (xor_pos != 0xffffffff) {
+			xor_positions[size++] = xor_pos;
+			triplet = bitmap_get_triplet(bitmap_git, xor_pos);
+			xor_pos = triplet_get_xor_pos(triplet);
+		}
+
+		while (size){
+			xor_pos = xor_positions[size - 1];
+			triplet = bitmap_get_triplet(bitmap_git, xor_pos);
+			commit_pos = get_be32(triplet);
+			offset_xor = triplet_get_offset(triplet);
+
+			if (nth_bitmap_object_oid(bitmap_git, &xor_oid, commit_pos) < 0) {
+				free(xor_positions);
+				return NULL;
+			}
+
+			bitmap_git->map_pos = offset_xor + sizeof(uint32_t) + sizeof(uint8_t);
+			xor_flags = read_u8(bitmap_git->map, &bitmap_git->map_pos);
+			bitmap = read_bitmap_1(bitmap_git);
+
+			if (!bitmap){
+				free(xor_positions);
+				return NULL;
+			}
+
+			xor_bitmap = store_bitmap(bitmap_git, bitmap, &xor_oid, xor_bitmap, xor_flags);
+			size--;
+		}
+
+		free(xor_positions);
+	}
+
+	bitmap_git->map_pos = offset + sizeof(uint32_t) + sizeof(uint8_t);
+	flags = read_u8(bitmap_git->map, &bitmap_git->map_pos);
+	bitmap = read_bitmap_1(bitmap_git);
+
+	if (!bitmap)
+		return NULL;
+
+	return store_bitmap(bitmap_git, bitmap, oid, xor_bitmap, flags);
+}
+
 struct ewah_bitmap *bitmap_for_commit(struct bitmap_index *bitmap_git,
 				      struct commit *commit)
 {
 	khiter_t hash_pos = kh_get_oid_map(bitmap_git->bitmaps,
 					   commit->object.oid);
-	if (hash_pos >= kh_end(bitmap_git->bitmaps))
-		return NULL;
+	if (hash_pos >= kh_end(bitmap_git->bitmaps)) {
+		struct stored_bitmap *bitmap = NULL;
+		if (!bitmap_git->table_lookup)
+			return NULL;
+
+		/* NEEDSWORK: cache misses aren't recorded */
+		bitmap = lazy_bitmap_for_commit(bitmap_git, commit);
+		if(!bitmap)
+			return NULL;
+		return lookup_stored_bitmap(bitmap);
+	}
 	return lookup_stored_bitmap(kh_value(bitmap_git->bitmaps, hash_pos));
 }
 
@@ -1699,9 +1861,13 @@ void test_bitmap_walk(struct rev_info *revs)
 	if (revs->pending.nr != 1)
 		die("you must specify exactly one commit to test");
 
-	fprintf(stderr, "Bitmap v%d test (%d entries loaded)\n",
+	fprintf(stderr, "Bitmap v%d test (%d entries)\n",
 		bitmap_git->version, bitmap_git->entry_count);
 
+	if (!bitmap_git->table_lookup)
+		fprintf(stderr, "Bitmap v%d test (%d entries loaded)\n",
+			bitmap_git->version, bitmap_git->entry_count);
+
 	root = revs->pending.objects[0].item;
 	bm = bitmap_for_commit(bitmap_git, (struct commit *)root);
 
@@ -1753,10 +1919,16 @@ void test_bitmap_walk(struct rev_info *revs)
 
 int test_bitmap_commits(struct repository *r)
 {
-	struct bitmap_index *bitmap_git = prepare_bitmap_git(r);
+	struct bitmap_index *bitmap_git = NULL;
 	struct object_id oid;
 	MAYBE_UNUSED void *value;
 
+	/* As this function is only used to print bitmap selected
+	 * commits, we don't have to read the commit table.
+	 */
+	setenv("GIT_TEST_READ_COMMIT_TABLE", "0", 1);
+
+	bitmap_git = prepare_bitmap_git(r);
 	if (!bitmap_git)
 		die("failed to load bitmap indexes");
 
@@ -1764,6 +1936,7 @@ int test_bitmap_commits(struct repository *r)
 		printf("%s\n", oid_to_hex(&oid));
 	});
 
+	setenv("GIT_TEST_READ_COMMIT_TABLE", "1", 1);
 	free_bitmap_index(bitmap_git);
 
 	return 0;
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index c669ed959e9..10d7691d973 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -42,6 +42,12 @@ test_expect_success 'full repack creates bitmaps' '
 	grep "\"label\":\"writing_lookup_table\"" trace
 '
 
+test_expect_success 'using lookup table loads only necessary bitmaps' '
+	git rev-list --test-bitmap HEAD 2>out &&
+	! grep "Bitmap v1 test (106 entries loaded)" out &&
+	grep "Found bitmap for" out
+'
+
 basic_bitmap_tests
 
 test_expect_success 'incremental repack fails when bitmaps are requested' '
@@ -255,6 +261,7 @@ test_expect_success 'pack reuse respects --incremental' '
 
 test_expect_success 'truncated bitmap fails gracefully (ewah)' '
 	test_config pack.writebitmaphashcache false &&
+	test_config pack.writebitmaplookuptable false &&
 	git repack -ad &&
 	git rev-list --use-bitmap-index --count --all >expect &&
 	bitmap=$(ls .git/objects/pack/*.bitmap) &&
diff --git a/t/t5326-multi-pack-bitmaps.sh b/t/t5326-multi-pack-bitmaps.sh
index 43be49617b8..7d36dbcf722 100755
--- a/t/t5326-multi-pack-bitmaps.sh
+++ b/t/t5326-multi-pack-bitmaps.sh
@@ -320,4 +320,5 @@ test_expect_success 'multi-pack-index write writes lookup table if enabled' '
 		grep "\"label\":\"writing_lookup_table\"" trace
 	)
 '
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v2 5/6] bitmap-lookup-table: add performance tests for lookup table
  2022-06-26 13:10 ` [PATCH v2 0/6] [GSoC] bitmap: integrate a lookup table extension to the bitmap format Abhradeep Chakraborty via GitGitGadget
                     ` (3 preceding siblings ...)
  2022-06-26 13:10   ` [PATCH v2 4/6] pack-bitmap: prepare to read lookup table extension Abhradeep Chakraborty via GitGitGadget
@ 2022-06-26 13:10   ` Abhradeep Chakraborty via GitGitGadget
  2022-06-27 21:53     ` Taylor Blau
  2022-06-26 13:10   ` [PATCH v2 6/6] p5310-pack-bitmaps.sh: enable pack.writeReverseIndex for testing Abhradeep Chakraborty via GitGitGadget
  5 siblings, 1 reply; 63+ messages in thread
From: Abhradeep Chakraborty via GitGitGadget @ 2022-06-26 13:10 UTC (permalink / raw)
  To: git
  Cc: Taylor Blau, Kaartic Sivaram, Derrick Stolee,
	Abhradeep Chakraborty, Abhradeep Chakraborty

From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>

Add performance tests to verify the performance of lookup table.

Lookup table makes Git run faster in most of the cases. Below is the
result of `t/perf/p5310-pack-bitmaps.sh`.`perf/p5326-multi-pack-bitmaps.sh`
gives similar result. The repository used in the test is linux kernel.

Test                                                      this tree
--------------------------------------------------------------------------
5310.4: repack to disk (lookup=false)                   295.94(250.45+15.24)
5310.5: simulated clone                                 12.52(5.07+1.40)
5310.6: simulated fetch                                 1.89(2.94+0.24)
5310.7: pack to file (bitmap)                           41.39(20.33+7.20)
5310.8: rev-list (commits)                              0.98(0.59+0.12)
5310.9: rev-list (objects)                              3.40(3.27+0.10)
5310.10: rev-list with tag negated via --not		0.07(0.02+0.04)
         --all (objects)
5310.11: rev-list with negative tag (objects)           0.23(0.16+0.06)
5310.12: rev-list count with blob:none                  0.26(0.18+0.07)
5310.13: rev-list count with blob:limit=1k              6.45(5.94+0.37)
5310.14: rev-list count with tree:0                     0.26(0.18+0.07)
5310.15: simulated partial clone                        4.99(3.19+0.45)
5310.19: repack to disk (lookup=true)                   269.67(174.70+21.33)
5310.20: simulated clone                                11.03(5.07+1.11)
5310.21: simulated fetch                                0.79(0.79+0.17)
5310.22: pack to file (bitmap)                          43.03(20.28+7.43)
5310.23: rev-list (commits)                             0.86(0.54+0.09)
5310.24: rev-list (objects)                             3.35(3.26+0.07)
5310.25: rev-list with tag negated via --not		0.05(0.00+0.03)
	 --all (objects)
5310.26: rev-list with negative tag (objects)           0.22(0.16+0.05)
5310.27: rev-list count with blob:none                  0.22(0.16+0.05)
5310.28: rev-list count with blob:limit=1k              6.45(5.87+0.31)
5310.29: rev-list count with tree:0                     0.22(0.16+0.05)
5310.30: simulated partial clone                        5.17(3.12+0.48)

Test 4-15 are tested without using lookup table. Same tests are
repeated in 16-30 (using lookup table).

Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
Mentored-by: Taylor Blau <me@ttaylorr.com>
Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
---
 t/perf/p5310-pack-bitmaps.sh       | 77 ++++++++++++++-----------
 t/perf/p5326-multi-pack-bitmaps.sh | 93 ++++++++++++++++--------------
 2 files changed, 94 insertions(+), 76 deletions(-)

diff --git a/t/perf/p5310-pack-bitmaps.sh b/t/perf/p5310-pack-bitmaps.sh
index 7ad4f237bc3..6ff42bdd391 100755
--- a/t/perf/p5310-pack-bitmaps.sh
+++ b/t/perf/p5310-pack-bitmaps.sh
@@ -16,39 +16,48 @@ test_expect_success 'setup bitmap config' '
 	git config pack.writebitmaps true
 '
 
-# we need to create the tag up front such that it is covered by the repack and
-# thus by generated bitmaps.
-test_expect_success 'create tags' '
-	git tag --message="tag pointing to HEAD" perf-tag HEAD
-'
-
-test_perf 'repack to disk' '
-	git repack -ad
-'
-
-test_full_bitmap
-
-test_expect_success 'create partial bitmap state' '
-	# pick a commit to represent the repo tip in the past
-	cutoff=$(git rev-list HEAD~100 -1) &&
-	orig_tip=$(git rev-parse HEAD) &&
-
-	# now kill off all of the refs and pretend we had
-	# just the one tip
-	rm -rf .git/logs .git/refs/* .git/packed-refs &&
-	git update-ref HEAD $cutoff &&
-
-	# and then repack, which will leave us with a nice
-	# big bitmap pack of the "old" history, and all of
-	# the new history will be loose, as if it had been pushed
-	# up incrementally and exploded via unpack-objects
-	git repack -Ad &&
-
-	# and now restore our original tip, as if the pushes
-	# had happened
-	git update-ref HEAD $orig_tip
-'
-
-test_partial_bitmap
+test_bitmap () {
+    local enabled="$1"
+
+	# we need to create the tag up front such that it is covered by the repack and
+	# thus by generated bitmaps.
+	test_expect_success 'create tags' '
+		git tag --message="tag pointing to HEAD" perf-tag HEAD
+	'
+
+	test_expect_success "use lookup table: $enabled" '
+		git config pack.writeBitmapLookupTable '"$enabled"'
+	'
+
+	test_perf "repack to disk (lookup=$enabled)" '
+		git repack -ad
+	'
+
+	test_full_bitmap
+
+    test_expect_success "create partial bitmap state (lookup=$enabled)" '
+		# pick a commit to represent the repo tip in the past
+		cutoff=$(git rev-list HEAD~100 -1) &&
+		orig_tip=$(git rev-parse HEAD) &&
+
+		# now kill off all of the refs and pretend we had
+		# just the one tip
+		rm -rf .git/logs .git/refs/* .git/packed-refs &&
+		git update-ref HEAD $cutoff &&
+
+		# and then repack, which will leave us with a nice
+		# big bitmap pack of the "old" history, and all of
+		# the new history will be loose, as if it had been pushed
+		# up incrementally and exploded via unpack-objects
+		git repack -Ad &&
+
+		# and now restore our original tip, as if the pushes
+		# had happened
+		git update-ref HEAD $orig_tip
+    '
+}
+
+test_bitmap false
+test_bitmap true
 
 test_done
diff --git a/t/perf/p5326-multi-pack-bitmaps.sh b/t/perf/p5326-multi-pack-bitmaps.sh
index f2fa228f16a..d67e7437493 100755
--- a/t/perf/p5326-multi-pack-bitmaps.sh
+++ b/t/perf/p5326-multi-pack-bitmaps.sh
@@ -6,47 +6,56 @@ test_description='Tests performance using midx bitmaps'
 
 test_perf_large_repo
 
-# we need to create the tag up front such that it is covered by the repack and
-# thus by generated bitmaps.
-test_expect_success 'create tags' '
-	git tag --message="tag pointing to HEAD" perf-tag HEAD
-'
-
-test_expect_success 'start with bitmapped pack' '
-	git repack -adb
-'
-
-test_perf 'setup multi-pack index' '
-	git multi-pack-index write --bitmap
-'
-
-test_expect_success 'drop pack bitmap' '
-	rm -f .git/objects/pack/pack-*.bitmap
-'
-
-test_full_bitmap
-
-test_expect_success 'create partial bitmap state' '
-	# pick a commit to represent the repo tip in the past
-	cutoff=$(git rev-list HEAD~100 -1) &&
-	orig_tip=$(git rev-parse HEAD) &&
-
-	# now pretend we have just one tip
-	rm -rf .git/logs .git/refs/* .git/packed-refs &&
-	git update-ref HEAD $cutoff &&
-
-	# and then repack, which will leave us with a nice
-	# big bitmap pack of the "old" history, and all of
-	# the new history will be loose, as if it had been pushed
-	# up incrementally and exploded via unpack-objects
-	git repack -Ad &&
-	git multi-pack-index write --bitmap &&
-
-	# and now restore our original tip, as if the pushes
-	# had happened
-	git update-ref HEAD $orig_tip
-'
-
-test_partial_bitmap
+test_bitmap () {
+    local enabled="$1"
+
+	# we need to create the tag up front such that it is covered by the repack and
+	# thus by generated bitmaps.
+	test_expect_success 'create tags' '
+		git tag --message="tag pointing to HEAD" perf-tag HEAD
+	'
+
+	test_expect_success "use lookup table: $enabled" '
+		git config pack.writeBitmapLookupTable '"$enabled"'
+	'
+
+	test_expect_success "start with bitmapped pack (lookup=$enabled)" '
+		git repack -adb
+	'
+
+	test_perf "setup multi-pack index (lookup=$enabled)" '
+		git multi-pack-index write --bitmap
+	'
+
+	test_expect_success "drop pack bitmap (lookup=$enabled)" '
+		rm -f .git/objects/pack/pack-*.bitmap
+	'
+
+	test_full_bitmap
+
+    test_expect_success "create partial bitmap state (lookup=$enabled)" '
+		# pick a commit to represent the repo tip in the past
+		cutoff=$(git rev-list HEAD~100 -1) &&
+		orig_tip=$(git rev-parse HEAD) &&
+
+		# now pretend we have just one tip
+		rm -rf .git/logs .git/refs/* .git/packed-refs &&
+		git update-ref HEAD $cutoff &&
+
+		# and then repack, which will leave us with a nice
+		# big bitmap pack of the "old" history, and all of
+		# the new history will be loose, as if it had been pushed
+		# up incrementally and exploded via unpack-objects
+		git repack -Ad &&
+		git multi-pack-index write --bitmap &&
+
+		# and now restore our original tip, as if the pushes
+		# had happened
+		git update-ref HEAD $orig_tip
+    '
+}
+
+test_bitmap false
+test_bitmap true
 
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v2 6/6] p5310-pack-bitmaps.sh: enable pack.writeReverseIndex for testing
  2022-06-26 13:10 ` [PATCH v2 0/6] [GSoC] bitmap: integrate a lookup table extension to the bitmap format Abhradeep Chakraborty via GitGitGadget
                     ` (4 preceding siblings ...)
  2022-06-26 13:10   ` [PATCH v2 5/6] bitmap-lookup-table: add performance tests for lookup table Abhradeep Chakraborty via GitGitGadget
@ 2022-06-26 13:10   ` Abhradeep Chakraborty via GitGitGadget
  2022-06-27 21:50     ` Taylor Blau
  5 siblings, 1 reply; 63+ messages in thread
From: Abhradeep Chakraborty via GitGitGadget @ 2022-06-26 13:10 UTC (permalink / raw)
  To: git
  Cc: Taylor Blau, Kaartic Sivaram, Derrick Stolee,
	Abhradeep Chakraborty, Abhradeep Chakraborty

From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>

Enable pack.writeReverseIndex to true to see the effect of writing
the reverse index in the existing bitmap tests (with and without
lookup table).

Below is the result of performance test. Output format is in
seconds.

Test                                             this tree
-------------------------------------------------------------------
5310.4: repack to disk (lookup=false)           294.92(257.60+14.29)
5310.5: simulated clone                         14.97(8.95+1.31)
5310.6: simulated fetch                         1.64(2.77+0.20)
5310.7: pack to file (bitmap)                   41.76(29.33+6.77)
5310.8: rev-list (commits)                      0.71(0.49+0.09)
5310.9: rev-list (objects)                      4.65(4.55+0.09)
5310.10: rev-list with tag negated via --not	0.08(0.02+0.05)
	 --all (objects)
5310.11: rev-list with negative tag (objects)   0.06(0.01+0.04)
5310.12: rev-list count with blob:none          0.09(0.03+0.05)
5310.13: rev-list count with blob:limit=1k      7.58(7.06+0.33)
5310.14: rev-list count with tree:0             0.09(0.03+0.06)
5310.15: simulated partial clone                8.64(8.04+0.35)
5310.19: repack to disk (lookup=true)           249.86(191.57+19.50)
5310.20: simulated clone                        13.67(8.83+1.06)
5310.21: simulated fetch                        0.50(0.63+0.13)
5310.22: pack to file (bitmap)                  41.24(28.99+6.67)
5310.23: rev-list (commits)                     0.67(0.50+0.07)
5310.24: rev-list (objects)                     4.88(4.79+0.08)
5310.25: rev-list with tag negated via --not    0.04(0.00+0.03)
	 --all (objects)
5310.26: rev-list with negative tag (objects)   0.05(0.00+0.04)
5310.27: rev-list count with blob:none          0.05(0.01+0.03)
5310.28: rev-list count with blob:limit=1k      8.02(7.16+0.34)
5310.29: rev-list count with tree:0             0.05(0.01+0.04)
5310.30: simulated partial clone                8.57(8.16+0.32)

Tests 4-15 are without the use of lookup table. The rests are
repeatation of the previous tests but using lookup table.

Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
Mentored-by: Taylor Blau <me@ttaylorr.com>
Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
---
 t/perf/p5310-pack-bitmaps.sh | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/t/perf/p5310-pack-bitmaps.sh b/t/perf/p5310-pack-bitmaps.sh
index 6ff42bdd391..9848c5d5040 100755
--- a/t/perf/p5310-pack-bitmaps.sh
+++ b/t/perf/p5310-pack-bitmaps.sh
@@ -13,7 +13,8 @@ test_perf_large_repo
 # We intentionally use the deprecated pack.writebitmaps
 # config so that we can test against older versions of git.
 test_expect_success 'setup bitmap config' '
-	git config pack.writebitmaps true
+	git config pack.writebitmaps true &&
+	git config pack.writeReverseIndex true
 '
 
 test_bitmap () {
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 1/6] Documentation/technical: describe bitmap lookup table extension
  2022-06-26 13:10   ` [PATCH v2 1/6] Documentation/technical: describe bitmap lookup table extension Abhradeep Chakraborty via GitGitGadget
@ 2022-06-27 14:18     ` Derrick Stolee
  2022-06-27 15:48       ` Taylor Blau
  2022-06-27 16:51       ` Abhradeep Chakraborty
  0 siblings, 2 replies; 63+ messages in thread
From: Derrick Stolee @ 2022-06-27 14:18 UTC (permalink / raw)
  To: Abhradeep Chakraborty via GitGitGadget, git
  Cc: Taylor Blau, Kaartic Sivaram, Abhradeep Chakraborty

On 6/26/2022 9:10 AM, Abhradeep Chakraborty via GitGitGadget wrote:
> From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
> 
> When reading bitmap file, git loads each and every bitmap one by one
> even if all the bitmaps are not required. A "bitmap lookup table"
> extension to the bitmap format can reduce the overhead of loading
> bitmaps which stores a list of bitmapped commit id pos (in the midx
> or pack, along with their offset and xor offset. This way git can
> load only the neccesary bitmaps without loading the previous bitmaps.

s/neccesary/necessary/

> +			** {empty}
> +			BITMAP_OPT_LOOKUP_TABLE (0x10): :::
> +			If present, the end of the bitmap file contains a table
> +			containing a list of `N` <commit pos, offset, xor offset>

(Note that "commit pos" and "xor offset" here don't have underscores, but
your discussion below does use "xor_offset" with underscores.)

> +			triplets. The format and meaning of the table is described
> +			below.
> ++
> +NOTE: This xor_offset is different from the bitmap's xor_offset.
> +Bitmap's xor_offset is relative i.e. it tells how many bitmaps we have
> +to go back from the current bitmap. Lookup table's xor_offset tells the
> +position of the triplet in the list whose bitmap the current commit's
> +bitmap have to xor with.

I found this difficult to parse. Here is an attempt at a rewording. Please
let me know if I misunderstood something when reading your version:

  NOTE: The xor_offset stored in the BITMAP_OPT_LOOKUP_TABLE is different
  from the xor_offset used in the bitmap data table. The xor_offset in this
  table indicates the row number within this table of the commit whose
  bitmap is used for the XOR computation with the current commit's stored
  bitmap to create the proper logical reachability bitmap.

This does make me think that "xor_offset" should really be "xor_row" or
something like that.

>  		4-byte entry count (network byte order)
>  
>  			The total count of entries (bitmapped commits) in this bitmap index.
> @@ -205,3 +218,31 @@ Note that this hashing scheme is tied to the BITMAP_OPT_HASH_CACHE flag.
>  If implementations want to choose a different hashing scheme, they are
>  free to do so, but MUST allocate a new header flag (because comparing
>  hashes made under two different schemes would be pointless).
> +
> +Commit lookup table
> +-------------------
> +
> +If the BITMAP_OPT_LOOKUP_TABLE flag is set, the last `N * (4 + 8 + 4)`
> +(preceding the name-hash cache and trailing hash) of the `.bitmap` file
> +contains a lookup table specifying the information needed to get the
> +desired bitmap from the entries without parsing previous unnecessary
> +bitmaps.
> +
> +For a `.bitmap` containing `nr_entries` reachability bitmaps, the table
> +contains a list of `nr_entries` <commit pos, offset, xor offset> triplets.
> +The content of i'th triplet is -
> +
> +	* {empty}
> +	commit pos (4 byte integer, network byte order): ::
> +	It stores the object position of the commit (in the midx or pack index)
> +	to which the i'th bitmap in the bitmap entries belongs.

Ok, we are saving some space here, but relying on looking into the pack-index
or multi-pack-index to get the actual commit OID.

Since this is sorted by the order that stores the bitmaps, binary search will
no longer work on this list (unless we enforce that on the rest of the bitmap
file). I am going to expect that you parse this table into a hashmap in order
to allow fast commit lookups. I'll keep an eye out for that implementation.

> +	* {empty}
> +	offset (8 byte integer, network byte order): ::
> +	The offset from which that commit's bitmap can be read.
> +
> +	* {empty}
> +	xor offset (4 byte integer, network byte order): ::
> +	It holds the position of the triplet with whose bitmap the
> +	current bitmap need to xor. If the current triplet's bitmap
> +	do not have any xor bitmap, it defaults to 0xffffffff.

This last sentence seems backward. Perhaps:

  If the value is 0xffffffff, then the current bitmap has no xor bitmap.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/6] pack-bitmap-write.c: write lookup table extension
  2022-06-26 13:10   ` [PATCH v2 2/6] pack-bitmap-write.c: write " Abhradeep Chakraborty via GitGitGadget
@ 2022-06-27 14:35     ` Derrick Stolee
  2022-06-27 16:12       ` Taylor Blau
  2022-06-27 17:10       ` Abhradeep Chakraborty
  2022-06-27 16:05     ` Taylor Blau
  1 sibling, 2 replies; 63+ messages in thread
From: Derrick Stolee @ 2022-06-27 14:35 UTC (permalink / raw)
  To: Abhradeep Chakraborty via GitGitGadget, git
  Cc: Taylor Blau, Kaartic Sivaram, Abhradeep Chakraborty

On 6/26/2022 9:10 AM, Abhradeep Chakraborty via GitGitGadget wrote:
> From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
> 
> The bitmap lookup table extension was documentated by an earlier

s/documentated/documented/

> change, but Git does not yet knowhow to write that extension.

s/knowhow/know how/

> +static int table_cmp(const void *_va, const void *_vb, void *commit_positions)
> +{
> +	int8_t result = 0;
> +	uint32_t *positions = (uint32_t *) commit_positions;

nit: drop the space between the cast and commit_positions.

> +	uint32_t a = positions[*(uint32_t *)_va];
> +	uint32_t b = positions[*(uint32_t *)_vb];
> +
> +	if (a > b)
> +		result = 1;
> +	else if (a < b)
> +		result = -1;
> +	else
> +		result = 0;
> +
> +	return result;
> +}

Ok, here you are sorting by commit OID (indirectly by the order in the
[multi-]pack-index). I suppose that I misunderstood in the previous
patch, so that could use some more specific language, maybe.

> +static void write_lookup_table(struct hashfile *f,
> +			       uint64_t *offsets,
> +			       uint32_t *commit_positions)
> +{
> +	uint32_t i;
> +	uint32_t *table, *table_inv;
> +
> +	ALLOC_ARRAY(table, writer.selected_nr);
> +	ALLOC_ARRAY(table_inv, writer.selected_nr);
> +
> +	for (i = 0; i < writer.selected_nr; i++)
> +		table[i] = i;
> +
> +	QSORT_S(table, writer.selected_nr, table_cmp, commit_positions);

At the end of this sort, table[j] = i means that the ith bitmap corresponds
to the jth bitmapped commit in lex order of OIDs.

> +	for (i = 0; i < writer.selected_nr; i++)
> +		table_inv[table[i]] = i;

And table_inv helps us discover that relationship (ith bitmap to jth commit
by j = table_inv[i]).

> +	for (i = 0; i < writer.selected_nr; i++) {
> +		struct bitmapped_commit *selected = &writer.selected[table[i]];
> +		uint32_t xor_offset = selected->xor_offset;

Here, xor_offset is "number of bitmaps in relationship to the current bitmap"

> +		hashwrite_be32(f, commit_positions[table[i]]);
> +		hashwrite_be64(f, offsets[table[i]]);
> +		hashwrite_be32(f, xor_offset ?
> +				table_inv[table[i] - xor_offset]: 0xffffffff);

Which means that if "k = table[i] - xor_offset" that the xor base is the kth
bitmap. table_inv[k] gets us the position in this table of that bitmap's
commit.

(It's also strange to me that the offset is being _subtracted_, but I guess
the bitmap format requires the xor base to appear first so the offset does
not need to be a negative number ever.)

This last line is a bit complex.

	uint32_t xor_offset = selected->xor_offset;
	uint32_t xor_row = 0xffffffff;

	if (xor_offset) {
		uint32_t xor_order = table[i] - xor_offset;
		xor_row = table_inf[xor_order];
	}

...then we can "hashwrite_be32(f, xor_row);" when necessary. I'm not sure
that we need the "uint32_t xor_order" inside the "if (xor_offset)" block,
but splitting it helps add clarity to the multi-step computation.

>  enum pack_bitmap_opts {
> -	BITMAP_OPT_FULL_DAG = 1,
> -	BITMAP_OPT_HASH_CACHE = 4,
> +	BITMAP_OPT_FULL_DAG = 0x1,
> +	BITMAP_OPT_HASH_CACHE = 0x4,
> +	BITMAP_OPT_LOOKUP_TABLE = 0x10,
>  };

Excellent.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 3/6] pack-bitmap-write: learn pack.writeBitmapLookupTable and add tests
  2022-06-26 13:10   ` [PATCH v2 3/6] pack-bitmap-write: learn pack.writeBitmapLookupTable and add tests Abhradeep Chakraborty via GitGitGadget
@ 2022-06-27 14:43     ` Derrick Stolee
  2022-06-27 17:42       ` Abhradeep Chakraborty
  2022-06-27 17:47     ` Taylor Blau
  1 sibling, 1 reply; 63+ messages in thread
From: Derrick Stolee @ 2022-06-27 14:43 UTC (permalink / raw)
  To: Abhradeep Chakraborty via GitGitGadget, git
  Cc: Taylor Blau, Kaartic Sivaram, Abhradeep Chakraborty

On 6/26/2022 9:10 AM, Abhradeep Chakraborty via GitGitGadget wrote:
> From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
> 
> Teach git to provide a way for users to enable/disable bitmap lookup
> table extension by providing a config option named 'writeBitmapLookupTable'.
> Default is true.

I wonder if it makes sense to have it default to 'false' for now, but to
change that default after the feature has been shipped and running in
production for a while.

> Also add test to verify writting of lookup table.

s/writting/writing/

> +pack.writeBitmapLookupTable::
> +	When true, git will include a "lookup table" section in the

I think you should either use "Git" when talking about the software
generally, OR use "`git repack --write-bitmap-index` will include..."

> +	bitmap index (if one is written). This table is used to defer
> +	loading individual bitmaps as late as possible. This can be
> +	beneficial in repositories which have relatively large bitmap

s/which/that/

(I'm pretty sure that "that" is better. We're trying to restrict the set
of repositories we are talking about, not implying that all repositories
have this property.)

> +	indexes. Defaults to true.
> +

> --- a/pack-bitmap-write.c
> +++ b/pack-bitmap-write.c
> @@ -713,6 +713,7 @@ static void write_lookup_table(struct hashfile *f,
>  	for (i = 0; i < writer.selected_nr; i++)
>  		table_inv[table[i]] = i;
>  
> +	trace2_region_enter("pack-bitmap-write", "writing_lookup_table", the_repository);
>  	for (i = 0; i < writer.selected_nr; i++) {
>  		struct bitmapped_commit *selected = &writer.selected[table[i]];
>  		uint32_t xor_offset = selected->xor_offset;
> @@ -725,6 +726,7 @@ static void write_lookup_table(struct hashfile *f,
>  
>  	free(table);
>  	free(table_inv);
> +	trace2_region_leave("pack-bitmap-write", "writing_lookup_table", the_repository);
>  }

These lines seem misplaced. Maybe they were meant for the previous
patch?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 4/6] pack-bitmap: prepare to read lookup table extension
  2022-06-26 13:10   ` [PATCH v2 4/6] pack-bitmap: prepare to read lookup table extension Abhradeep Chakraborty via GitGitGadget
@ 2022-06-27 15:12     ` Derrick Stolee
  2022-06-27 18:06       ` [PATCH v2 4/6] pack-bitmap: prepare to read lookup table Abhradeep Chakraborty
  2022-06-27 21:49       ` [PATCH v2 4/6] pack-bitmap: prepare to read lookup table extension Taylor Blau
  2022-06-27 21:38     ` [PATCH v2 4/6] pack-bitmap: prepare to read lookup table extension Taylor Blau
  1 sibling, 2 replies; 63+ messages in thread
From: Derrick Stolee @ 2022-06-27 15:12 UTC (permalink / raw)
  To: Abhradeep Chakraborty via GitGitGadget, git
  Cc: Taylor Blau, Kaartic Sivaram, Abhradeep Chakraborty

On 6/26/2022 9:10 AM, Abhradeep Chakraborty via GitGitGadget wrote:
> From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
> 
> Earlier change teaches Git to write bitmap lookup table. But Git
> does not know how to parse them.
> 
> Teach Git to parse the existing bitmap lookup table. The older
> versions of git are not affected by it. Those versions ignore the
> lookup table.
> 
> Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
> Mentored-by: Taylor Blau <me@ttaylorr.com>
> Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>

I didn't check the previous patches, but your sign-off should be the
last line of the message. (You are singing off on all previous content,
and any later content is not covered by your sign-off.)

> +
> +		if (flags & BITMAP_OPT_LOOKUP_TABLE &&
> +			git_env_bool("GIT_TEST_READ_COMMIT_TABLE", 1)) {

nit: This alignment should use four spaces at the end so the second phrase
matches the start of the previous phrase. Like this:

		if (flags & BITMAP_OPT_LOOKUP_TABLE &&
		    git_env_bool("GIT_TEST_READ_COMMIT_TABLE", 1)) {

Perhaps it looked right in your editor because it renders tabs as 4 spaces
instead of 8 spaces.

> +			size_t table_size = 0;
> +			size_t triplet_sz = st_add3(sizeof(uint32_t),    /* commit position */
> +							sizeof(uint64_t),    /* offset */
> +							sizeof(uint32_t));    /* xor offset */

The 4- vs 8-space tab view would also explain the alignment here:

			size_t triplet_sz = st_add3(sizeof(uint32_t),  /* commit position */
						    sizeof(uint64_t),  /* offset */
						    sizeof(uint32_t)); /* xor offset */

(I also modified the comment alignment.)

Of course, since these values are constants and have no risk of overflowing,
perhaps we can drop st_add3() here:


			size_t triplet_sz = sizeof(uint32_t) + /* commit position */
					    sizeof(uint64_t) +  /* offset */
					    sizeof(uint32_t); /* xor offset */

> +			table_size = st_add(table_size,
> +					st_mult(ntohl(header->entry_count),
> +						triplet_sz));

Here, we _do_ want to keep the st_mult(). Is the st_add() still necessary? It
seems this is a leftover from the previous version that had the 4-byte flag
data.

We set table_size to zero above. We could drop that initialization and instead
have this after the "size_t triplet_sz" definition:

			size_t table_size = st_mult(ntohl(header->entry_count),
						    triplet_sz));

> +			if (table_size > index_end - index->map - header_size)
> +				return error("corrupted bitmap index file (too short to fit lookup table)");

Please add "_(...)" around the error message so it can be translated.

> +			index->table_lookup = (void *)(index_end - table_size);
> +			index_end -= table_size;
> +		}

> -	/* a 0 return code means the insertion succeeded with no changes,
> -	 * because the SHA1 already existed on the map. this is bad, there
> -	 * shouldn't be duplicated commits in the index */
> +	/* A 0 return code means the insertion succeeded with no changes,
> +	 * because the SHA1 already existed on the map. If lookup table
> +	 * is NULL, this is bad, there shouldn't be duplicated commits
> +	 * in the index.
> +	 *
> +	 * If table_lookup exists, that means the desired bitmap is already
> +	 * loaded. Either this bitmap has been stored directly or another
> +	 * bitmap has a direct or indirect xor relation with it. */

If we are modifying this multi-line comment, then we should reformat it to
match convention:

	/*
	 * The first sentence starts after the comment start
	 * so it has symmetry with the comment end which is on
	 * its own line.
	 */

>  	if (ret == 0) {
> -		error("Duplicate entry in bitmap index: %s", oid_to_hex(oid));
> -		return NULL;
> +		if (!index->table_lookup) {
> +			error("Duplicate entry in bitmap index: %s", oid_to_hex(oid));

Errors start with lowercase letters. Please add translation markers "_(...)"

> +static uint32_t triplet_get_xor_pos(const void *triplet)
> +{
> +	const void *p = (unsigned char*) triplet + st_add(sizeof(uint32_t), sizeof(uint64_t));

This st_add() is not necessary since the constants will not overflow.

> +	return get_be32(p);
> +}
> +
> +static int triplet_cmp(const void *va, const void *vb)
> +{
> +	int result = 0;
> +	uint32_t *a = (uint32_t *) va;
> +	uint32_t b = get_be32(vb);
> +	if (*a > b)
> +		result = 1;
> +	else if (*a < b)
> +		result = -1;
> +	else
> +		result = 0;
> +
> +	return result;
> +}
> +
> +static uint32_t bsearch_pos(struct bitmap_index *bitmap_git, struct object_id *oid,
> +						uint32_t *result)

Strange wrapping. Perhaps

static uint32_t bsearch_pos(struct bitmap_index *bitmap_git,
			    struct object_id *oid,
			    uint32_t *result)

> +{
> +	int found;
> +
> +	if (bitmap_git->midx)
> +		found = bsearch_midx(oid, bitmap_git->midx, result);
> +	else
> +		found = bsearch_pack(oid, bitmap_git->pack, result);
> +
> +	return found;

Here, we are doing a binary search on the entire list of packed objects, which could
use quite a few more hops than a binary search on the bitmapped commits.

> +static struct stored_bitmap *lazy_bitmap_for_commit(struct bitmap_index *bitmap_git,
> +					  struct commit *commit)
...
> +	int found = bsearch_pos(bitmap_git, oid, &commit_pos);
> +
> +	if (!found)
> +		return NULL;
> +
> +	triplet = bsearch(&commit_pos, bitmap_git->table_lookup, bitmap_git->entry_count,
> +						triplet_sz, triplet_cmp);

But I see, you are searching the pack-index for the position in the index, and _then_
searching the bitmap lookup table based on that position value.

I expected something different: binary search on the triplets where the comparison is
made by looking up the OID from the [multi-]pack-index and comparing that OID to the
commit OID we are looking for.

I'm not convinced that the binary search I had in mind is meaningfully faster than
what you've implemented here, so I'm happy to leave it as you have it. We can investigate
if that full search on the pack-index matters at all (it probably doesn't).

> +	if (!triplet)
> +		return NULL;
> +
> +	offset = triplet_get_offset(triplet);
> +	xor_pos = triplet_get_xor_pos(triplet);
> +
> +	if (xor_pos != 0xffffffff) {
> +		int xor_flags;
> +		uint64_t offset_xor;
> +		uint32_t *xor_positions;
> +		struct object_id xor_oid;
> +		size_t size = 0;
> +
> +		ALLOC_ARRAY(xor_positions, bitmap_git->entry_count);

While there is potential that this is wasteful, it's probably not that huge,
so we can start with the "maximum XOR depth" and then reconsider a smaller
allocation in the future.

> +		while (xor_pos != 0xffffffff) {

We should consider ensuring that also "size < bitmap_git->entry_count".
Better yet, create an xor_positions_alloc variable that is initialized
to the entry_count value.

"size" should probably be xor_positions_nr.

> +			xor_positions[size++] = xor_pos;
> +			triplet = bitmap_get_triplet(bitmap_git, xor_pos);
> +			xor_pos = triplet_get_xor_pos(triplet);
> +		}

(at this point, "if (xor_positions_nr >= xor_positions_alloc)", then error
out since the file must be malformed with an XOR loop.)

> +		while (size){

nit: ") {"

> +			xor_pos = xor_positions[size - 1];
> +			triplet = bitmap_get_triplet(bitmap_git, xor_pos);
> +			commit_pos = get_be32(triplet);
> +			offset_xor = triplet_get_offset(triplet);
> +
> +			if (nth_bitmap_object_oid(bitmap_git, &xor_oid, commit_pos) < 0) {
> +				free(xor_positions);
> +				return NULL;
> +			}
> +
> +			bitmap_git->map_pos = offset_xor + sizeof(uint32_t) + sizeof(uint8_t);
> +			xor_flags = read_u8(bitmap_git->map, &bitmap_git->map_pos);
> +			bitmap = read_bitmap_1(bitmap_git);
> +
> +			if (!bitmap){

nit: ") {"

> +				free(xor_positions);
> +				return NULL;
> +			}
> +
> +			xor_bitmap = store_bitmap(bitmap_git, bitmap, &xor_oid, xor_bitmap, xor_flags);

Since we are storing the bitmap here as we "pop" the stack, should we be
looking for a stored bitmap while pushing to the stack in the previous loop?
That would save time when using multiple bitmaps with common XOR bases.

(Of course, we want to be careful that we do not create a recursive loop,
but instead _only_ look at the in-memory bitmaps that already exist.)

> +			size--;
> +		}
> +
> +		free(xor_positions);
> +	}
> +
> +	bitmap_git->map_pos = offset + sizeof(uint32_t) + sizeof(uint8_t);
> +	flags = read_u8(bitmap_git->map, &bitmap_git->map_pos);
> +	bitmap = read_bitmap_1(bitmap_git);
> +
> +	if (!bitmap)
> +		return NULL;
> +
> +	return store_bitmap(bitmap_git, bitmap, oid, xor_bitmap, flags);
> +}
> +

I'm happy with the structure of this iterative algorithm!

I'll pause my review here for now.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 1/6] Documentation/technical: describe bitmap lookup table extension
  2022-06-27 14:18     ` Derrick Stolee
@ 2022-06-27 15:48       ` Taylor Blau
  2022-06-27 16:51       ` Abhradeep Chakraborty
  1 sibling, 0 replies; 63+ messages in thread
From: Taylor Blau @ 2022-06-27 15:48 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Abhradeep Chakraborty via GitGitGadget, git, Kaartic Sivaram,
	Abhradeep Chakraborty

On Mon, Jun 27, 2022 at 10:18:51AM -0400, Derrick Stolee wrote:
> On 6/26/2022 9:10 AM, Abhradeep Chakraborty via GitGitGadget wrote:
> > +			triplets. The format and meaning of the table is described
> > +			below.
> > ++
> > +NOTE: This xor_offset is different from the bitmap's xor_offset.
> > +Bitmap's xor_offset is relative i.e. it tells how many bitmaps we have
> > +to go back from the current bitmap. Lookup table's xor_offset tells the
> > +position of the triplet in the list whose bitmap the current commit's
> > +bitmap have to xor with.
>
> I found this difficult to parse. Here is an attempt at a rewording. Please
> let me know if I misunderstood something when reading your version:
>
>   NOTE: The xor_offset stored in the BITMAP_OPT_LOOKUP_TABLE is different
>   from the xor_offset used in the bitmap data table. The xor_offset in this
>   table indicates the row number within this table of the commit whose
>   bitmap is used for the XOR computation with the current commit's stored
>   bitmap to create the proper logical reachability bitmap.
>
> This does make me think that "xor_offset" should really be "xor_row" or
> something like that.

To be fair, I found Stolee's version equally difficult to parse. I
wonder if something like the following would be clearer:

    NOTE: Unlike the xor_offset used to compress an individual bitmap,
    this value stores an *absolute* index into the lookup table, not a
    location relative to the current entry.

> > +For a `.bitmap` containing `nr_entries` reachability bitmaps, the table
> > +contains a list of `nr_entries` <commit pos, offset, xor offset> triplets.
> > +The content of i'th triplet is -
> > +
> > +	* {empty}
> > +	commit pos (4 byte integer, network byte order): ::
> > +	It stores the object position of the commit (in the midx or pack index)
> > +	to which the i'th bitmap in the bitmap entries belongs.
>
> Ok, we are saving some space here, but relying on looking into the pack-index
> or multi-pack-index to get the actual commit OID.
>
> Since this is sorted by the order that stores the bitmaps, binary search will
> no longer work on this list (unless we enforce that on the rest of the bitmap
> file). I am going to expect that you parse this table into a hashmap in order
> to allow fast commit lookups. I'll keep an eye out for that implementation.

The main purpose of this series is to avoid having to construct such a
table ahead of time. This is more or less akin to what the existing
implementation already does in load_bitmap_entries_v1(), though that
function has to read (but not decompress!) all bitmaps.

But I disagree that this isn't binary searchable. The object positions
are in MIDX or pack .idx order, so they are sorted lexicographically.
The comparator implementation could either take as its key an object_id,
and then convert each of the "commit pos" fields themselves to
object_ids and call oidcmp().

Or we could go the other way (as it looks like Abhradeep did in a later
patch) and convert the key's object_id into the index or MIDX-relative
position, and search for that.

> > +	* {empty}
> > +	offset (8 byte integer, network byte order): ::
> > +	The offset from which that commit's bitmap can be read.
> > +
> > +	* {empty}
> > +	xor offset (4 byte integer, network byte order): ::
> > +	It holds the position of the triplet with whose bitmap the
> > +	current bitmap need to xor. If the current triplet's bitmap
> > +	do not have any xor bitmap, it defaults to 0xffffffff.
>
> This last sentence seems backward. Perhaps:
>
>   If the value is 0xffffffff, then the current bitmap has no xor bitmap.

Perhaps even more concisely:

    The position of a triplet whose bitmap is used to compress this one,
    or 0xffffffff if no such bitmap exists.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/6] pack-bitmap-write.c: write lookup table extension
  2022-06-26 13:10   ` [PATCH v2 2/6] pack-bitmap-write.c: write " Abhradeep Chakraborty via GitGitGadget
  2022-06-27 14:35     ` Derrick Stolee
@ 2022-06-27 16:05     ` Taylor Blau
  2022-06-27 18:29       ` Abhradeep Chakraborty
  1 sibling, 1 reply; 63+ messages in thread
From: Taylor Blau @ 2022-06-27 16:05 UTC (permalink / raw)
  To: Abhradeep Chakraborty via GitGitGadget
  Cc: git, Kaartic Sivaram, Derrick Stolee, Abhradeep Chakraborty

On Sun, Jun 26, 2022 at 01:10:13PM +0000, Abhradeep Chakraborty via GitGitGadget wrote:
> From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
>
> The bitmap lookup table extension was documentated by an earlier
> change, but Git does not yet knowhow to write that extension.
>
> Teach git to write bitmap lookup table extension. The table contains
> the list of `N` <commit pos, offset, xor offset>` triplets. These
> triplets are sorted according to their commit pos (ascending order).
> The meaning of each data in the i'th triplet is given below:
>
>   - Commit pos is the position of the commit in the pack-index
>     (or midx) to which the i'th bitmap belongs. It is a 4 byte
>     network byte order integer.
>
>   - offset is the position of the i'th bitmap.
>
>   - xor offset denotes the position of the triplet with whose
>     bitmap the current triplet's bitmap need to xor with.
>
> Co-authored-by: Taylor Blau <me@ttaylorr.com>
> Mentored-by: Taylor Blau <me@ttaylorr.com>
> Co-mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
> Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
> ---
>  pack-bitmap-write.c | 72 +++++++++++++++++++++++++++++++++++++++++++--
>  pack-bitmap.h       |  5 ++--
>  2 files changed, 73 insertions(+), 4 deletions(-)
>
> diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
> index c43375bd344..899a4a941e1 100644
> --- a/pack-bitmap-write.c
> +++ b/pack-bitmap-write.c
> @@ -650,7 +650,9 @@ static const struct object_id *oid_access(size_t pos, const void *table)
>
>  static void write_selected_commits_v1(struct hashfile *f,
>  				      struct pack_idx_entry **index,
> -				      uint32_t index_nr)
> +				      uint32_t index_nr,
> +				      uint64_t *offsets,

We should probably leave this as a pointer to an off_t, since that is a
more appropriate type for keeping track of an offset within a file (and
indeed it is the return type of hashfile_total()).

But since it's platform-dependent, we should make sure to cast it to a
uint64_t before writing it as part of the lookup table.

> +				      uint32_t *commit_positions)
>  {
>  	int i;
>
> @@ -663,6 +665,11 @@ static void write_selected_commits_v1(struct hashfile *f,
>  		if (commit_pos < 0)
>  			BUG("trying to write commit not in index");
>
> +		if (offsets)
> +			offsets[i] = hashfile_total(f);

This makes sense to store here, since we can't easily recover this
information later on.

> +		if (commit_positions)
> +			commit_positions[i] = commit_pos;

This one I'm not as sure about. It would be nice to not have
write_selected_commits_v1() be responsible for writing this down, too.
And I think it's easy enough to recover later on, since we're just doing
a search over "index" (see above the "oid_pos" call).

I think that oid_pos() call could be hidden behind a function that takes
an object_id pointer, an index (double pointer) of pack_idx_entry
structs, and a length.

Its implementation would be something like:

    static int commit_bitmap_writer_pos(struct object_id *oid,
                                        struct pack_idx_entry **index,
                                        uint32_t index_nr)
    {
        return oid_pos(oid, index, index_nr, oid_access);
    }

and then we could replace any calls like commit_positions[i] with one
that first takes `i` to the appropriate object_id in selected commit
order.

That would be strictly less efficient, but not in a way that I think
matters, and it would definitely be cleaner to not rely on a side-effect
of write_selected_commits_v1().

Something in the middle there would be to have write_lookup_table()
assemble that list of commit_positions itself, something like:

    uint32_t *commit_positions;

    ALLOC_ARRAY(commit_positions, writer.selected_nr);

    for (i = 0; i < writer.selected_nr; i++) {
        int pos = oid_pos(&writer.selected[i].commit->object.oid,
                          index, index_nr);
        if (pos < 0)
            BUG("trying to write commit not in index");
        commit_positions[i] = pos;
    }

    ...

    free(commit_positions);

That at least removes a side-effect from the implementation of
write_selected_commits_v1() and brings the creation of the
commit_positions array closer to where it's being used, while still
maintaining the constant-time lookups. So that may be a good
alternative, but I'm curious of your thoughts.

> +static int table_cmp(const void *_va, const void *_vb, void *commit_positions)

OK, so this is sorting the table in order of the commit positions. I
would rename the commit_positions parameter to something like "void
*_data", and then have commit_positions be the result of the cast, like
"uint32_t *commit_positions = _data";

> +{
> +	int8_t result = 0;

int8_t isn't an often used type in Git's codebase, but we can get rid of
this variable altogether and just return immediately from each case,
e.g.:

    if (a < b)
        return -1;
    else if (a > b)
        return 1;
    return 0;

or similar.

> +	uint32_t *positions = (uint32_t *) commit_positions;

Explicit cast isn't need here since you're going up from void*.

> +static void write_lookup_table(struct hashfile *f,
> +			       uint64_t *offsets,
> +			       uint32_t *commit_positions)
> +{
> +	uint32_t i;
> +	uint32_t *table, *table_inv;
> +
> +	ALLOC_ARRAY(table, writer.selected_nr);
> +	ALLOC_ARRAY(table_inv, writer.selected_nr);
> +
> +	for (i = 0; i < writer.selected_nr; i++)
> +		table[i] = i;
> +
> +	QSORT_S(table, writer.selected_nr, table_cmp, commit_positions);

I think the construction of table and table_inv could definitely benefit
from a comment here indicating what they're used for and what they
contain (e.g., "table maps abc to xyz").

> +	for (i = 0; i < writer.selected_nr; i++)
> +		table_inv[table[i]] = i;
> +
> +	for (i = 0; i < writer.selected_nr; i++) {
> +		struct bitmapped_commit *selected = &writer.selected[table[i]];
> +		uint32_t xor_offset = selected->xor_offset;
> +
> +		hashwrite_be32(f, commit_positions[table[i]]);
> +		hashwrite_be64(f, offsets[table[i]]);
> +		hashwrite_be32(f, xor_offset ?
> +				table_inv[table[i] - xor_offset]: 0xffffffff);

Nit: missing space before ':'.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/6] pack-bitmap-write.c: write lookup table extension
  2022-06-27 14:35     ` Derrick Stolee
@ 2022-06-27 16:12       ` Taylor Blau
  2022-06-27 17:10       ` Abhradeep Chakraborty
  1 sibling, 0 replies; 63+ messages in thread
From: Taylor Blau @ 2022-06-27 16:12 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Abhradeep Chakraborty via GitGitGadget, git, Kaartic Sivaram,
	Abhradeep Chakraborty

On Mon, Jun 27, 2022 at 10:35:25AM -0400, Derrick Stolee wrote:
> On 6/26/2022 9:10 AM, Abhradeep Chakraborty via GitGitGadget wrote:
>
> > +	uint32_t a = positions[*(uint32_t *)_va];
> > +	uint32_t b = positions[*(uint32_t *)_vb];
> > +
> > +	if (a > b)
> > +		result = 1;
> > +	else if (a < b)
> > +		result = -1;
> > +	else
> > +		result = 0;
> > +
> > +	return result;
> > +}
>
> Ok, here you are sorting by commit OID (indirectly by the order in the
> [multi-]pack-index). I suppose that I misunderstood in the previous
> patch, so that could use some more specific language, maybe.

Yeah, I agree that some more specific language could be used, with the
main idea being there that we make it clearer that the list of tuples is
still sorted (and can be binary searched).

> > +	for (i = 0; i < writer.selected_nr; i++)
> > +		table[i] = i;
> > +
> > +	QSORT_S(table, writer.selected_nr, table_cmp, commit_positions);
>
> At the end of this sort, table[j] = i means that the ith bitmap corresponds
> to the jth bitmapped commit in lex order of OIDs.
>
> > +	for (i = 0; i < writer.selected_nr; i++)
> > +		table_inv[table[i]] = i;
>
> And table_inv helps us discover that relationship (ith bitmap to jth commit
> by j = table_inv[i]).

These are both great descriptions and should give an idea of what sort
of information is worth putting into a comment.
>
> > +	for (i = 0; i < writer.selected_nr; i++) {
> > +		struct bitmapped_commit *selected = &writer.selected[table[i]];
> > +		uint32_t xor_offset = selected->xor_offset;
>
> Here, xor_offset is "number of bitmaps in relationship to the current bitmap"

It's an offset to an earlier commit which must be used to XOR-decompress the
current one (if any).

> > +		hashwrite_be32(f, commit_positions[table[i]]);
> > +		hashwrite_be64(f, offsets[table[i]]);
> > +		hashwrite_be32(f, xor_offset ?
> > +				table_inv[table[i] - xor_offset]: 0xffffffff);
>
> Which means that if "k = table[i] - xor_offset" that the xor base is the kth
> bitmap. table_inv[k] gets us the position in this table of that bitmap's
> commit.

Yes, exactly. Abhradeep: this is also worth commenting ;-).

> (It's also strange to me that the offset is being _subtracted_, but I guess
> the bitmap format requires the xor base to appear first so the offset does
> not need to be a negative number ever.)

You're right, this follows from the fact that the XOR bases must come
before the commits who must use them to decompress themselves. From
Documentation/technical/bitmap-format.txt:

    This number is always positive, and hence entries are always xor'ed
    with **previous** bitmaps, not bitmaps that will come afterwards in
    the index.

> This last line is a bit complex.
>
> 	uint32_t xor_offset = selected->xor_offset;
> 	uint32_t xor_row = 0xffffffff;
>
> 	if (xor_offset) {
> 		uint32_t xor_order = table[i] - xor_offset;
> 		xor_row = table_inf[xor_order];
> 	}
>
> ...then we can "hashwrite_be32(f, xor_row);" when necessary. I'm not sure
> that we need the "uint32_t xor_order" inside the "if (xor_offset)" block,
> but splitting it helps add clarity to the multi-step computation.

I had the same thought, though I would also say that xor_row should be
declared, not initialized, and the "else" block of "if (xor_offset)"
should set it to 0xffffffff to make the relationship between xor_offset
and the value written a little clearer.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 1/6] Documentation/technical: describe bitmap lookup table extension
  2022-06-27 14:18     ` Derrick Stolee
  2022-06-27 15:48       ` Taylor Blau
@ 2022-06-27 16:51       ` Abhradeep Chakraborty
  1 sibling, 0 replies; 63+ messages in thread
From: Abhradeep Chakraborty @ 2022-06-27 16:51 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Abhradeep Chakraborty, Git, Kaartic Sivaraam, Taylor Blau

Derrick Stolee <derrickstolee@github.com> wrote:

> I found this difficult to parse. Here is an attempt at a rewording. Please
> let me know if I misunderstood something when reading your version:
>
>   NOTE: The xor_offset stored in the BITMAP_OPT_LOOKUP_TABLE is different
>   from the xor_offset used in the bitmap data table. The xor_offset in this
>   table indicates the row number within this table of the commit whose
>   bitmap is used for the XOR computation with the current commit's stored
>   bitmap to create the proper logical reachability bitmap.
>
> This does make me think that "xor_offset" should really be "xor_row" or
> something like that.

Thanks. `xor_row` seems nice to me.

> > +	* {empty}
> > +	commit pos (4 byte integer, network byte order): ::
> > +	It stores the object position of the commit (in the midx or pack index)
> > +	to which the i'th bitmap in the bitmap entries belongs.
>
> Ok, we are saving some space here, but relying on looking into the pack-index
> or multi-pack-index to get the actual commit OID.

Seems like I didn't update this particular part. At the time of writing this
patch, I was clear that I would store these triplets in the bitmap's order.
But when I started to implement the "read" part, I realised that these triplets
need to be ordered in ascending order. So I did update the "write extension"
patch but somehow missed this particular part.

Just to be clear, bitmaps are sorted by their commit's date (as far as I know).
Bitmaps for recent commits comes before bitmaps for older commits. So these
two orders are not same. Thus hashmap would not work here.

Will update this portion.

Thanks :)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/6] pack-bitmap-write.c: write lookup table extension
  2022-06-27 14:35     ` Derrick Stolee
  2022-06-27 16:12       ` Taylor Blau
@ 2022-06-27 17:10       ` Abhradeep Chakraborty
  1 sibling, 0 replies; 63+ messages in thread
From: Abhradeep Chakraborty @ 2022-06-27 17:10 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Abhradeep Chakraborty, Git, Kaartic Sivaraam, Taylor Blau

Derrick Stolee <derrickstolee@github.com> wrote:

> Which means that if "k = table[i] - xor_offset" that the xor base is the kth
> bitmap. table_inv[k] gets us the position in this table of that bitmap's
> commit.
>
> (It's also strange to me that the offset is being _subtracted_, but I guess
> the bitmap format requires the xor base to appear first so the offset does
> not need to be a negative number ever.)
>
> This last line is a bit complex.
>
> 	uint32_t xor_offset = selected->xor_offset;
> 	uint32_t xor_row = 0xffffffff;
>
>	if (xor_offset) {
>		uint32_t xor_order = table[i] - xor_offset;
>		xor_row = table_inf[xor_order];
>	}
>
> ...then we can "hashwrite_be32(f, xor_row);" when necessary. I'm not sure
> that we need the "uint32_t xor_order" inside the "if (xor_offset)" block,
> but splitting it helps add clarity to the multi-step computation.

Got it. Will add comments too.

Thanks :)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v2 3/6] pack-bitmap-write: learn pack.writeBitmapLookupTable and add tests
  2022-06-27 14:43     ` Derrick Stolee
@ 2022-06-27 17:42       ` Abhradeep Chakraborty
  2022-06-27 17:49         ` Taylor Blau
  0 siblings, 1 reply; 63+ messages in thread
From: Abhradeep Chakraborty @ 2022-06-27 17:42 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Abhradeep Chakraborty, Git, Kaartic Sivaraam, Taylor Blau

Derrick Stolee <derrickstolee@github.com> wrote:

> I wonder if it makes sense to have it default to 'false' for now, but to
> change that default after the feature has been shipped and running in
> production for a while.

I do not have any opinion. If most reviewers agree on it, I will surely
Set it to false.

> I think you should either use "Git" when talking about the software
> generally, OR use "`git repack --write-bitmap-index` will include..."

Ohh, yeah! Thanks for pointing out.

> s/which/that/
>
> (I'm pretty sure that "that" is better. We're trying to restrict the set
> of repositories we are talking about, not implying that all repositories
> have this property.)

Ok.

> These lines seem misplaced. Maybe they were meant for the previous
> patch?

I mainly used it for testing purpose. That's why I included it in
This patch. But I got your point and will move it to the previous
patch.

Thanks :)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 3/6] pack-bitmap-write: learn pack.writeBitmapLookupTable and add tests
  2022-06-26 13:10   ` [PATCH v2 3/6] pack-bitmap-write: learn pack.writeBitmapLookupTable and add tests Abhradeep Chakraborty via GitGitGadget
  2022-06-27 14:43     ` Derrick Stolee
@ 2022-06-27 17:47     ` Taylor Blau
  2022-06-27 18:39       ` Abhradeep Chakraborty
  1 sibling, 1 reply; 63+ messages in thread
From: Taylor Blau @ 2022-06-27 17:47 UTC (permalink / raw)
  To: Abhradeep Chakraborty via GitGitGadget
  Cc: git, Kaartic Sivaram, Derrick Stolee, Abhradeep Chakraborty

On Sun, Jun 26, 2022 at 01:10:14PM +0000, Abhradeep Chakraborty via GitGitGadget wrote:
> From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
>
> Teach git to provide a way for users to enable/disable bitmap lookup
> table extension by providing a config option named 'writeBitmapLookupTable'.
> Default is true.
>
> Also add test to verify writting of lookup table.
>
> Co-Authored-by: Taylor Blau <me@ttaylorr.com>
> Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
> Mentored-by: Taylor Blau <me@ttaylorr.com>
> Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>

I think that this was covered earlier in the review of this round, but
in general your Signed-off-by (often abbreviated as "S-o-b") should come
last. The order should be chronological, so I'd probably suggest
something like:

    Mentored-by: Taylor Blau <me@ttaylorr.com>
    Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
    Co-Authored-by: Taylor Blau <me@ttaylorr.com>
    Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>

> diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
> index 5edbb7fe86e..3757616f09c 100644
> --- a/builtin/multi-pack-index.c
> +++ b/builtin/multi-pack-index.c
> @@ -87,6 +87,13 @@ static int git_multi_pack_index_write_config(const char *var, const char *value,
>  			opts.flags &= ~MIDX_WRITE_BITMAP_HASH_CACHE;
>  	}
>
> +	if (!strcmp(var, "pack.writebitmaplookuptable")) {
> +		if (git_config_bool(var, value))
> +			opts.flags |= MIDX_WRITE_BITMAP_LOOKUP_TABLE;
> +		else
> +			opts.flags &= ~MIDX_WRITE_BITMAP_LOOKUP_TABLE;
> +	}
> +
>  	/*
>  	 * We should never make a fall-back call to 'git_default_config', since
>  	 * this was already called in 'cmd_multi_pack_index()'.
> @@ -123,6 +130,7 @@ static int cmd_multi_pack_index_write(int argc, const char **argv)
>  	};
>
>  	opts.flags |= MIDX_WRITE_BITMAP_HASH_CACHE;
> +	opts.flags |= MIDX_WRITE_BITMAP_LOOKUP_TABLE;

I wonder if this should respect pack.writeBitmapLookupTable, too.
Probably both of them should take into account their separate
configuration values, but cleaning up the hashcache one can be done
separately outside of this series.

Everything else looks good.

> diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
> index 899a4a941e1..79be0cf80e6 100644
> --- a/pack-bitmap-write.c
> +++ b/pack-bitmap-write.c
> @@ -713,6 +713,7 @@ static void write_lookup_table(struct hashfile *f,
>  	for (i = 0; i < writer.selected_nr; i++)
>  		table_inv[table[i]] = i;
>
> +	trace2_region_enter("pack-bitmap-write", "writing_lookup_table", the_repository);
>  	for (i = 0; i < writer.selected_nr; i++) {
>  		struct bitmapped_commit *selected = &writer.selected[table[i]];
>  		uint32_t xor_offset = selected->xor_offset;
> @@ -725,6 +726,7 @@ static void write_lookup_table(struct hashfile *f,
>
>  	free(table);
>  	free(table_inv);
> +	trace2_region_leave("pack-bitmap-write", "writing_lookup_table", the_repository);

This region may make more sense to include in the previous commit,
though I don't have a strong feeling about it.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 3/6] pack-bitmap-write: learn pack.writeBitmapLookupTable and add tests
  2022-06-27 17:42       ` Abhradeep Chakraborty
@ 2022-06-27 17:49         ` Taylor Blau
  0 siblings, 0 replies; 63+ messages in thread
From: Taylor Blau @ 2022-06-27 17:49 UTC (permalink / raw)
  To: Abhradeep Chakraborty; +Cc: Derrick Stolee, Git, Kaartic Sivaraam

On Mon, Jun 27, 2022 at 11:12:30PM +0530, Abhradeep Chakraborty wrote:
> Derrick Stolee <derrickstolee@github.com> wrote:
>
> > I wonder if it makes sense to have it default to 'false' for now, but to
> > change that default after the feature has been shipped and running in
> > production for a while.
>
> I do not have any opinion. If most reviewers agree on it, I will surely
> Set it to false.

I think it's definitely a safe approach. I don't have a huge concern
about enabling it earlier, but I don't think we're in a huge rush to add
a new feature here, either.

So I'd be fine to ship this with the default being disabled (IOW, *not*
writing the lookup table). That should give us a window where we can
shake out whatever bugs there are, as is often the case when working
with the bitmap code ;).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 4/6] pack-bitmap: prepare to read lookup table
  2022-06-27 15:12     ` Derrick Stolee
@ 2022-06-27 18:06       ` Abhradeep Chakraborty
  2022-06-27 18:32         ` Derrick Stolee
  2022-06-27 21:49       ` [PATCH v2 4/6] pack-bitmap: prepare to read lookup table extension Taylor Blau
  1 sibling, 1 reply; 63+ messages in thread
From: Abhradeep Chakraborty @ 2022-06-27 18:06 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Abhradeep Chakraborty, Git, Kaartic Sivaraam, Taylor Blau

Derrick Stolee <derrickstolee@github.com> wrote:

> I didn't check the previous patches, but your sign-off should be the
> last line of the message. (You are singing off on all previous content,
> and any later content is not covered by your sign-off.)

Ohhh, got it. I didn't know about it before.

> nit: This alignment should use four spaces at the end so the second phrase
> matches the start of the previous phrase. Like this:
>
>		if (flags & BITMAP_OPT_LOOKUP_TABLE &&
>		    git_env_bool("GIT_TEST_READ_COMMIT_TABLE", 1)) {
>
> Perhaps it looked right in your editor because it renders tabs as 4 spaces
> instead of 8 spaces.

I don't know why but my editor sometimes do some weird things for alignments.
I generally use VS Code. But for alignment related problems, sometimes I have
to use vi editor.

> Here, we _do_ want to keep the st_mult(). Is the st_add() still necessary? It
> seems this is a leftover from the previous version that had the 4-byte flag
> data.
>
> We set table_size to zero above. We could drop that initialization and instead
> have this after the "size_t triplet_sz" definition:
>
>			size_t table_size = st_mult(ntohl(header->entry_count),
>						    triplet_sz));

Yes, you're right. Will update.

> I expected something different: binary search on the triplets where the comparison is
> made by looking up the OID from the [multi-]pack-index and comparing that OID to the
> commit OID we are looking for.
>
> I'm not convinced that the binary search I had in mind is meaningfully faster than
> what you've implemented here, so I'm happy to leave it as you have it. We can investigate
> if that full search on the pack-index matters at all (it probably doesn't).

Good idea! Thanks!

> While there is potential that this is wasteful, it's probably not that huge,
> so we can start with the "maximum XOR depth" and then reconsider a smaller
> allocation in the future.

Ok.

> We should consider ensuring that also "size < bitmap_git->entry_count".
> Better yet, create an xor_positions_alloc variable that is initialized
> to the entry_count value.
>
> "size" should probably be xor_positions_nr.
> 
> > +			xor_positions[size++] = xor_pos;
> > +			triplet = bitmap_get_triplet(bitmap_git, xor_pos);
> > +			xor_pos = triplet_get_xor_pos(triplet);
> > +		}
> 
> (at this point, "if (xor_positions_nr >= xor_positions_alloc)", then error
> out since the file must be malformed with an XOR loop.)

Got it.

> Since we are storing the bitmap here as we "pop" the stack, should we be
> looking for a stored bitmap while pushing to the stack in the previous loop?
> That would save time when using multiple bitmaps with common XOR bases.

Yeah, I also am thinking about it. Will make a try.

Thanks :)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 2/6] pack-bitmap-write.c: write lookup table extension
  2022-06-27 16:05     ` Taylor Blau
@ 2022-06-27 18:29       ` Abhradeep Chakraborty
  0 siblings, 0 replies; 63+ messages in thread
From: Abhradeep Chakraborty @ 2022-06-27 18:29 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Abhradeep Chakraborty, Git, Kaartic Sivaraam, Derrick Stolee

Taylor Blau <me@ttaylorr.com> wrote:

> We should probably leave this as a pointer to an off_t, since that is a
> more appropriate type for keeping track of an offset within a file (and
> indeed it is the return type of hashfile_total()).
> 
> But since it's platform-dependent, we should make sure to cast it to a
> uint64_t before writing it as part of the lookup table.

Hmm, will make the necessary changes.

> That at least removes a side-effect from the implementation of
> write_selected_commits_v1() and brings the creation of the
> commit_positions array closer to where it's being used, while still
> maintaining the constant-time lookups. So that may be a good
> alternative, but I'm curious of your thoughts.

Sounds good to me :)

> I think the construction of table and table_inv could definitely benefit
> from a comment here indicating what they're used for and what they
> contain (e.g., "table maps abc to xyz").

Yeah, true. Will add comments.

Thanks for the other suggestions also :)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 4/6] pack-bitmap: prepare to read lookup table
  2022-06-27 18:06       ` [PATCH v2 4/6] pack-bitmap: prepare to read lookup table Abhradeep Chakraborty
@ 2022-06-27 18:32         ` Derrick Stolee
  0 siblings, 0 replies; 63+ messages in thread
From: Derrick Stolee @ 2022-06-27 18:32 UTC (permalink / raw)
  To: Abhradeep Chakraborty; +Cc: Git, Kaartic Sivaraam, Taylor Blau

On 6/27/2022 2:06 PM, Abhradeep Chakraborty wrote:
> Derrick Stolee <derrickstolee@github.com> wrote:
>> nit: This alignment should use four spaces at the end so the second phrase
>> matches the start of the previous phrase. Like this:
>>
>> 		if (flags & BITMAP_OPT_LOOKUP_TABLE &&
>> 		    git_env_bool("GIT_TEST_READ_COMMIT_TABLE", 1)) {
>>
>> Perhaps it looked right in your editor because it renders tabs as 4 spaces
>> instead of 8 spaces.
> 
> I don't know why but my editor sometimes do some weird things for alignments.
> I generally use VS Code. But for alignment related problems, sometimes I have
> to use vi editor.

I also use VS Code, and I noticed a few spacing issues recently, especially
in .txt files.

I submitted a patch [1] to improve the contrib/vscode/init.sh script, which
adds some helpful config settings to your Git workspace. Please take a look
and see how it works for you.

Thanks,
-Stolee

[1] https://lore.kernel.org/git/pull.1271.git.1656354587496.gitgitgadget@gmail.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 3/6] pack-bitmap-write: learn pack.writeBitmapLookupTable and add tests
  2022-06-27 17:47     ` Taylor Blau
@ 2022-06-27 18:39       ` Abhradeep Chakraborty
  0 siblings, 0 replies; 63+ messages in thread
From: Abhradeep Chakraborty @ 2022-06-27 18:39 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Abhradeep Chakraborty, Git, Kaartic Sivaraam, Derrick Stolee

Taylor Blau <me@ttaylorr.com> wrote:

> Probably both of them should take into account their separate
> configuration values, but cleaning up the hashcache one can be done
> separately outside of this series.

Actually, it does respect the `pack.writebitmaplookuptable` config.
As pack.writebitmaplookuptable is by default true (for this patch
Series), this line enables it by default. If `pack.writebitmaplookuptable`
Set to false, the proposed change in the `git_multi_pack_index_write_config`
function disables this flag.

> This region may make more sense to include in the previous commit,
> though I don't have a strong feeling about it.

Ok. Will move it to the previous patch.

Thanks :)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 4/6] pack-bitmap: prepare to read lookup table extension
  2022-06-26 13:10   ` [PATCH v2 4/6] pack-bitmap: prepare to read lookup table extension Abhradeep Chakraborty via GitGitGadget
  2022-06-27 15:12     ` Derrick Stolee
@ 2022-06-27 21:38     ` Taylor Blau
  2022-06-28 19:25       ` Abhradeep Chakraborty
  1 sibling, 1 reply; 63+ messages in thread
From: Taylor Blau @ 2022-06-27 21:38 UTC (permalink / raw)
  To: Abhradeep Chakraborty via GitGitGadget
  Cc: git, Kaartic Sivaram, Derrick Stolee, Abhradeep Chakraborty

On Sun, Jun 26, 2022 at 01:10:15PM +0000, Abhradeep Chakraborty via GitGitGadget wrote:
> From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
>
> Earlier change teaches Git to write bitmap lookup table. But Git
> does not know how to parse them.
>
> Teach Git to parse the existing bitmap lookup table. The older
> versions of git are not affected by it. Those versions ignore the

s/git/Git

> lookup table.
>
> Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
> Mentored-by: Taylor Blau <me@ttaylorr.com>
> Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
> ---
>  pack-bitmap.c                 | 193 ++++++++++++++++++++++++++++++++--
>  t/t5310-pack-bitmaps.sh       |   7 ++
>  t/t5326-multi-pack-bitmaps.sh |   1 +
>  3 files changed, 191 insertions(+), 10 deletions(-)
>
> diff --git a/pack-bitmap.c b/pack-bitmap.c
> index 36134222d7a..9e09c5824fc 100644
> --- a/pack-bitmap.c
> +++ b/pack-bitmap.c
> @@ -82,6 +82,12 @@ struct bitmap_index {
>  	/* The checksum of the packfile or MIDX; points into map. */
>  	const unsigned char *checksum;
>
> +	/*
> +	 * If not NULL, this point into the commit table extension
> +	 * (within map).

It may be worth replacing "within map" to "within the memory mapped
region `map`" to make clear that this points somewhere within the mmap.

> +	 */
> +	unsigned char *table_lookup;
> +


> @@ -185,6 +191,22 @@ static int load_bitmap_header(struct bitmap_index *index)
>  			index->hashes = (void *)(index_end - cache_size);
>  			index_end -= cache_size;
>  		}
> +
> +		if (flags & BITMAP_OPT_LOOKUP_TABLE &&
> +			git_env_bool("GIT_TEST_READ_COMMIT_TABLE", 1)) {

I should have commented on this in an earlier round, but I wonder what
the behavior should be when we have BITMAP_OPT_LOOKUP_TABLE in our
flags, but GIT_TEST_READ_COMMIT_TABLE is disabled.

Right now, it doesn't matter, since there aren't any flags in bits above
BITMAP_OPT_LOOKUP_TABLE. But in the future, if there was some
BITMAP_OPT_FOO that was newer than BITMAP_OPT_LOOKUP_TABLE, we would
want to be able to read it without needing to read the lookup table.

At least, I think that should be true, though I would be interested to
hear if anybody has a differing opinion there.

> +			size_t table_size = 0;
> +			size_t triplet_sz = st_add3(sizeof(uint32_t),    /* commit position */
> +							sizeof(uint64_t),    /* offset */
> +							sizeof(uint32_t));    /* xor offset */

I don't think we need a st_add3() call here, since the size of these
three types is known to be small and thus won't overflow the available
range of size_t.

> +			table_size = st_add(table_size,
> +					st_mult(ntohl(header->entry_count),
> +						triplet_sz));

And table_size here is going to start off at zero, so the outer st_add()
call isn't necessary, either. This should instead be:

    size_t table_size = st_mult(ntohl(header->entry_count),
                                sizeof(uint32_t) + sizeof(uint64_t) + sizeof(uint32_t));

It might be nice to have triplet_sz #define'd somewhere else, since
there are a handful of declarations in this patch that are all
identical. Probably something like:

    #define BITMAP_LOOKUP_TABLE_RECORD_WIDTH (sizeof(uint32_t) + sizeof(uint64_t) + sizeof(uin32_t))

or even:

    /*
     * The width in bytes of a single record in the lookup table
     * extension:
     *
     *   (commit_pos, offset, xor_pos)
     *
     * whose fields are 32-, 64-, and 32-bits wide, respectively.
     */
    #define BITMAP_LOOKUP_TABLE_RECORD_WIDTH (16)

> +			if (table_size > index_end - index->map - header_size)
> +				return error("corrupted bitmap index file (too short to fit lookup table)");

if we decide to still recognize the lookup table extension without
*reading* from it when GIT_TEST_READ_COMMIT_TABLE is unset, I think we
should do something like:

    if (git_env_bool("GIT_TEST_READ_COMMIT_TABLE", 1))
        index->table_lookup = (void *)(index_end - table_size);
    index_end -= table_size;

...where the subtraction on index_end happens unconditionally.

> +static inline const void *bitmap_get_triplet(struct bitmap_index *bitmap_git, uint32_t xor_pos)
> +{
> +	size_t triplet_sz = st_add3(sizeof(uint32_t), sizeof(uint64_t), sizeof(uint32_t));

Same note about the #define constant here.

> +	const void *p = bitmap_git->table_lookup + st_mult(xor_pos, triplet_sz);

And this can be returned directly. Just:

    return bitmap_git->table_lookup + st_mult(xor_pos, BITMAP_LOOKUP_TABLE_RECORD_WIDTH);

although I wonder: why "xor_pos" and not just "pos" here?

> +static uint64_t triplet_get_offset(const void *triplet)
> +{
> +	const void *p = (unsigned char*) triplet + sizeof(uint32_t);
> +	return get_be64(p);
> +}
> +
> +static uint32_t triplet_get_xor_pos(const void *triplet)
> +{
> +	const void *p = (unsigned char*) triplet + st_add(sizeof(uint32_t), sizeof(uint64_t));
> +	return get_be32(p);
> +}

I wonder if we could get rid of these functions altogether and return a
small structure like:

    struct bitmap_lookup_table_record {
        uint32_t commit_pos;
        uint64_t offset;
        uint32_t xor_pos;
    };

or similar.

> +static int triplet_cmp(const void *va, const void *vb)
> +{
> +	int result = 0;
> +	uint32_t *a = (uint32_t *) va;
> +	uint32_t b = get_be32(vb);

Hmm. This is a little tricky to read. Here we're expecting "va" to hold
commit_pos from below, and "vb" to be a pointer at a lookup record.
Everything here is right, though I wonder if a comment or two might
clarify why one is "*(uint32_t *)va" and the other is "get_be32(vb)".

> +	if (*a > b)
> +		result = 1;
> +	else if (*a < b)
> +		result = -1;
> +	else
> +		result = 0;

Let's just return the result of the comparison directly here. And while
I'm looking at it, I think we can avoid dereferencing "a" on each use,
and instead just dereference va on assignment after casting, e.g.:

    uint32_t a = *(uint32_t*)va;

> +static uint32_t bsearch_pos(struct bitmap_index *bitmap_git, struct object_id *oid,
> +						uint32_t *result)
> +{
> +	int found;
> +
> +	if (bitmap_git->midx)

Nit: let's use the bitmap_is_midx() helper here instead of looking at
bitamp_git->midx directly.

> +		found = bsearch_midx(oid, bitmap_git->midx, result);
> +	else
> +		found = bsearch_pack(oid, bitmap_git->pack, result);
> +
> +	return found;
> +}

Makes sense.

> +static struct stored_bitmap *lazy_bitmap_for_commit(struct bitmap_index *bitmap_git,
> +					  struct commit *commit)
> +{
> +	uint32_t commit_pos, xor_pos;
> +	uint64_t offset;
> +	int flags;
> +	const void *triplet = NULL;
> +	struct object_id *oid = &commit->object.oid;
> +	struct ewah_bitmap *bitmap;
> +	struct stored_bitmap *xor_bitmap = NULL;
> +	size_t triplet_sz = st_add3(sizeof(uint32_t), sizeof(uint64_t), sizeof(uint32_t));
> +
> +	int found = bsearch_pos(bitmap_git, oid, &commit_pos);
> +
> +	if (!found)
> +		return NULL;
> +
> +	triplet = bsearch(&commit_pos, bitmap_git->table_lookup, bitmap_git->entry_count,
> +						triplet_sz, triplet_cmp);
> +	if (!triplet)
> +		return NULL;

OK. If you don't mind, I'm going to "think aloud" while I read through
this function to make sure that we're on the same page.

First thing is to convert the commit OID we're looking for into its
position within the corresponding pack index or MIDX file so that we can
use it as a search key to locate in the lookup table. If we didn't find
anything, or the commit doesn't exist in our pack / MIDX, nothing to do.

> +
> +	offset = triplet_get_offset(triplet);
> +	xor_pos = triplet_get_xor_pos(triplet);

Otherwise, record its offset and XOR "offset".

> +
> +	if (xor_pos != 0xffffffff) {
> +		int xor_flags;
> +		uint64_t offset_xor;
> +		uint32_t *xor_positions;
> +		struct object_id xor_oid;
> +		size_t size = 0;
> +
> +		ALLOC_ARRAY(xor_positions, bitmap_git->entry_count);

If we are XOR'd with another bitmap, make a stack of those bitmaps so
that we can decompress ourself.

I'm a little surprised that we're allocating an array as large as
bitmap_git->entry_count. It's not wrong, but it does waste some bytes
since we likely don't often have these long chains of XOR'd bitmaps.

We should instead allocate a smaller array and grow it over time (search
for examples of ALLOC_GROW() to see the canonical way to do this in
Git's codebase).

> +		while (xor_pos != 0xffffffff) {
> +			xor_positions[size++] = xor_pos;
> +			triplet = bitmap_get_triplet(bitmap_git, xor_pos);
> +			xor_pos = triplet_get_xor_pos(triplet);
> +		}
> +
> +		while (size){

Nit: missing space after ")" and before "{".

> +			xor_pos = xor_positions[size - 1];
> +			triplet = bitmap_get_triplet(bitmap_git, xor_pos);

We already have to get the triplets in the loop above, and then we dig
them back out here. Would it be easier to keep track of a list of
pointers into the mmaped region instead of looking up these triplets
each time?

> +			commit_pos = get_be32(triplet);
> +			offset_xor = triplet_get_offset(triplet);
> +
> +			if (nth_bitmap_object_oid(bitmap_git, &xor_oid, commit_pos) < 0) {

Should it be an error if we can't look up the object's ID here? I'd
think so.

> +				free(xor_positions);
> +				return NULL;
> +			}
> +
> +			bitmap_git->map_pos = offset_xor + sizeof(uint32_t) + sizeof(uint8_t);
> +			xor_flags = read_u8(bitmap_git->map, &bitmap_git->map_pos);
> +			bitmap = read_bitmap_1(bitmap_git);
> +
> +			if (!bitmap){

Nit: missing space between ")" and "{".

> +				free(xor_positions);
> +				return NULL;
> +			}
> +
> +			xor_bitmap = store_bitmap(bitmap_git, bitmap, &xor_oid, xor_bitmap, xor_flags);
> +			size--;

Makes sense. Nicely done!

> +		}
> +
> +		free(xor_positions);
> +	}
> +
> +	bitmap_git->map_pos = offset + sizeof(uint32_t) + sizeof(uint8_t);
> +	flags = read_u8(bitmap_git->map, &bitmap_git->map_pos);
> +	bitmap = read_bitmap_1(bitmap_git);

Great, and now we can finally read the original bitmap that we wanted
to...

> +	if (!bitmap)
> +		return NULL;
> +
> +	return store_bitmap(bitmap_git, bitmap, oid, xor_bitmap, flags);

...and XOR it with the thing we built up in the loop. Very nicely done.
Do we have a good way to make sure that we're testing this code in CI?
It *seems* correct to me, but of course, we should have a computer check
that this produces OK results, not a human ;).

> +}
> +
>  struct ewah_bitmap *bitmap_for_commit(struct bitmap_index *bitmap_git,
>  				      struct commit *commit)
>  {
>  	khiter_t hash_pos = kh_get_oid_map(bitmap_git->bitmaps,
>  					   commit->object.oid);
> -	if (hash_pos >= kh_end(bitmap_git->bitmaps))
> -		return NULL;
> +	if (hash_pos >= kh_end(bitmap_git->bitmaps)) {
> +		struct stored_bitmap *bitmap = NULL;
> +		if (!bitmap_git->table_lookup)
> +			return NULL;
> +
> +		/* NEEDSWORK: cache misses aren't recorded */

For what it's worth, I think that it's completely fine to leave this as
a NEEDSWORK for the purposes of this series. I think we plausibly could
improve this in certain scenarios by finding some threshold on cache
misses when we should just fault in all bitmaps, but that can easily be
done on top.

> +		bitmap = lazy_bitmap_for_commit(bitmap_git, commit);
> +		if(!bitmap)

Nit: missing space between "if" and "(".

> +			return NULL;
> +		return lookup_stored_bitmap(bitmap);
> +	}
>  	return lookup_stored_bitmap(kh_value(bitmap_git->bitmaps, hash_pos));
>  }
>
> @@ -1699,9 +1861,13 @@ void test_bitmap_walk(struct rev_info *revs)
>  	if (revs->pending.nr != 1)
>  		die("you must specify exactly one commit to test");
>
> -	fprintf(stderr, "Bitmap v%d test (%d entries loaded)\n",
> +	fprintf(stderr, "Bitmap v%d test (%d entries)\n",
>  		bitmap_git->version, bitmap_git->entry_count);
>
> +	if (!bitmap_git->table_lookup)
> +		fprintf(stderr, "Bitmap v%d test (%d entries loaded)\n",
> +			bitmap_git->version, bitmap_git->entry_count);
> +

I think we should probably print just one or the other here, perhaps
like:

    fprintf(stderr, "Bitmap v%d test (%d entries%s)",
            bitmap_git->version,
            bitmap_git->entry_count,
            bitmap_git->table_lookup ? "" : " loaded");

>  	root = revs->pending.objects[0].item;
>  	bm = bitmap_for_commit(bitmap_git, (struct commit *)root);
>
> @@ -1753,10 +1919,16 @@ void test_bitmap_walk(struct rev_info *revs)
>
>  int test_bitmap_commits(struct repository *r)
>  {
> -	struct bitmap_index *bitmap_git = prepare_bitmap_git(r);
> +	struct bitmap_index *bitmap_git = NULL;
>  	struct object_id oid;
>  	MAYBE_UNUSED void *value;
>
> +	/* As this function is only used to print bitmap selected
> +	 * commits, we don't have to read the commit table.
> +	 */
> +	setenv("GIT_TEST_READ_COMMIT_TABLE", "0", 1);
> +
> +	bitmap_git = prepare_bitmap_git(r);
>  	if (!bitmap_git)
>  		die("failed to load bitmap indexes");
>
> @@ -1764,6 +1936,7 @@ int test_bitmap_commits(struct repository *r)
>  		printf("%s\n", oid_to_hex(&oid));
>  	});
>
> +	setenv("GIT_TEST_READ_COMMIT_TABLE", "1", 1);
>  	free_bitmap_index(bitmap_git);

Hmm. I'm not sure I follow the purpose of tweaking
GIT_TEST_READ_COMMIT_TABLE like this with setenv(). Are we trying to
avoid reading the lookup table? If so, why? I'd rather avoid
manipulating the environment directly like this, and instead have a
function we could call to fault in all of the bitmaps (when a lookup
table exists, otherwise do nothing).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 4/6] pack-bitmap: prepare to read lookup table extension
  2022-06-27 15:12     ` Derrick Stolee
  2022-06-27 18:06       ` [PATCH v2 4/6] pack-bitmap: prepare to read lookup table Abhradeep Chakraborty
@ 2022-06-27 21:49       ` Taylor Blau
  2022-06-28  8:59         ` [PATCH v2 4/6] pack-bitmap: prepare to read lookup table Abhradeep Chakraborty
  1 sibling, 1 reply; 63+ messages in thread
From: Taylor Blau @ 2022-06-27 21:49 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Abhradeep Chakraborty via GitGitGadget, git, Kaartic Sivaram,
	Abhradeep Chakraborty

On Mon, Jun 27, 2022 at 11:12:09AM -0400, Derrick Stolee wrote:
> On 6/26/2022 9:10 AM, Abhradeep Chakraborty via GitGitGadget wrote:
> > +			table_size = st_add(table_size,
> > +					st_mult(ntohl(header->entry_count),
> > +						triplet_sz));
>
> Here, we _do_ want to keep the st_mult(). Is the st_add() still necessary? It
> seems this is a leftover from the previous version that had the 4-byte flag
> data.
>
> We set table_size to zero above. We could drop that initialization and instead
> have this after the "size_t triplet_sz" definition:
>
> 			size_t table_size = st_mult(ntohl(header->entry_count),
> 						    triplet_sz));

Well put, thank you.

> > +			if (table_size > index_end - index->map - header_size)
> > +				return error("corrupted bitmap index file (too short to fit lookup table)");
>
> Please add "_(...)" around the error message so it can be translated.

I missed this in my own review, but yes: this is a good practice.

> > +	if (bitmap_git->midx)
> > +		found = bsearch_midx(oid, bitmap_git->midx, result);
> > +	else
> > +		found = bsearch_pack(oid, bitmap_git->pack, result);
> > +
> > +	return found;
>
> Here, we are doing a binary search on the entire list of packed objects, which could
> use quite a few more hops than a binary search on the bitmapped commits.

I think this is the best we can do if we make the key to our bsearch
through the lookup table be an index into the pack index / MIDX. But...

> > +static struct stored_bitmap *lazy_bitmap_for_commit(struct bitmap_index *bitmap_git,
> > +					  struct commit *commit)
> ...
> > +	int found = bsearch_pos(bitmap_git, oid, &commit_pos);
> > +
> > +	if (!found)
> > +		return NULL;
> > +
> > +	triplet = bsearch(&commit_pos, bitmap_git->table_lookup, bitmap_git->entry_count,
> > +						triplet_sz, triplet_cmp);
>
> But I see, you are searching the pack-index for the position in the index, and _then_
> searching the bitmap lookup table based on that position value.
>
> I expected something different: binary search on the triplets where the comparison is
> made by looking up the OID from the [multi-]pack-index and comparing that OID to the
> commit OID we are looking for.
>
> I'm not convinced that the binary search I had in mind is meaningfully faster than
> what you've implemented here, so I'm happy to leave it as you have it. We can investigate
> if that full search on the pack-index matters at all (it probably doesn't).

...exactly my thoughts, too. It's possible that it would be faster to
key this search on the object_id "oid" above, and then convert each of
the entries in the lookup table from a uint32_t into an object_id by
calling nth_bitmap_object_oid() repeatedly.

I *think* that what Abhradeep wrote here is going to be faster more
often than not since it makes more efficient use of the page cache
rather than switching between reads across different memory mapped
regions at each point in the binary search.

But of course that depends on a number of factors. Abhradeep: if you're
up for it, I think it would be worth trying it both ways and seeing if
one produces a meaningful speed-up or slow-down over the other. Like I
said: my guess is that what you have now will be faster, but I don't
have a clear sense that that is true without trying it both ways ;-).

> > +	if (!triplet)
> > +		return NULL;
> > +
> > +	offset = triplet_get_offset(triplet);
> > +	xor_pos = triplet_get_xor_pos(triplet);
> > +
> > +	if (xor_pos != 0xffffffff) {
> > +		int xor_flags;
> > +		uint64_t offset_xor;
> > +		uint32_t *xor_positions;
> > +		struct object_id xor_oid;
> > +		size_t size = 0;
> > +
> > +		ALLOC_ARRAY(xor_positions, bitmap_git->entry_count);
>
> While there is potential that this is wasteful, it's probably not that huge,
> so we can start with the "maximum XOR depth" and then reconsider a smaller
> allocation in the future.

There is no maximum XOR depth, to my knowledge. We do have a maximum XOR
*offset*, which says we cannot XOR-compress a bitmap with an entry more
than 160 entries away from the current one. But in theory every commit
could be XOR compressed with the one immediately proceeding it, so the
maximum depth could be as long as the entry_count itself.

I think starting off with a small array and then letting it grow
according to alloc_nr() would be fine here, since it will grow more and
more each time, so the amount of times we have to reallocate the buffer
will tail off over time.

If we were really concerned about it, we could treat the buffer as a
static pointer and reuse it over time (making sure to clear out the
portions of it we're going to reuse, or otherwise ensuring that we don't
read old data). But I doubt it matters much either way in practice: the
individual records are small (at just 4 bytes each) and entry_count is
often less than 1,000, so I think this probably has a vanishingly small
impact.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 6/6] p5310-pack-bitmaps.sh: enable pack.writeReverseIndex for testing
  2022-06-26 13:10   ` [PATCH v2 6/6] p5310-pack-bitmaps.sh: enable pack.writeReverseIndex for testing Abhradeep Chakraborty via GitGitGadget
@ 2022-06-27 21:50     ` Taylor Blau
  2022-06-28  8:01       ` Abhradeep Chakraborty
  0 siblings, 1 reply; 63+ messages in thread
From: Taylor Blau @ 2022-06-27 21:50 UTC (permalink / raw)
  To: Abhradeep Chakraborty via GitGitGadget
  Cc: git, Kaartic Sivaram, Derrick Stolee, Abhradeep Chakraborty

On Sun, Jun 26, 2022 at 01:10:17PM +0000, Abhradeep Chakraborty via GitGitGadget wrote:
> From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
>
> Enable pack.writeReverseIndex to true to see the effect of writing
> the reverse index in the existing bitmap tests (with and without
> lookup table).

I think we should swap the order of these final two patches, since we're
primarily interested in the difference between using a reverse index
with and without the lookup table.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 5/6] bitmap-lookup-table: add performance tests for lookup table
  2022-06-26 13:10   ` [PATCH v2 5/6] bitmap-lookup-table: add performance tests for lookup table Abhradeep Chakraborty via GitGitGadget
@ 2022-06-27 21:53     ` Taylor Blau
  2022-06-28  7:58       ` Abhradeep Chakraborty
  0 siblings, 1 reply; 63+ messages in thread
From: Taylor Blau @ 2022-06-27 21:53 UTC (permalink / raw)
  To: Abhradeep Chakraborty via GitGitGadget
  Cc: git, Kaartic Sivaram, Derrick Stolee, Abhradeep Chakraborty

On Sun, Jun 26, 2022 at 01:10:16PM +0000, Abhradeep Chakraborty via GitGitGadget wrote:
> From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
>
> Add performance tests to verify the performance of lookup table.
>
> Lookup table makes Git run faster in most of the cases. Below is the
> result of `t/perf/p5310-pack-bitmaps.sh`.`perf/p5326-multi-pack-bitmaps.sh`
> gives similar result. The repository used in the test is linux kernel.
>
> Test                                                      this tree
> --------------------------------------------------------------------------
> 5310.4: repack to disk (lookup=false)                   295.94(250.45+15.24)
> 5310.5: simulated clone                                 12.52(5.07+1.40)
> 5310.6: simulated fetch                                 1.89(2.94+0.24)
> 5310.7: pack to file (bitmap)                           41.39(20.33+7.20)
> 5310.8: rev-list (commits)                              0.98(0.59+0.12)
> 5310.9: rev-list (objects)                              3.40(3.27+0.10)
> 5310.10: rev-list with tag negated via --not		0.07(0.02+0.04)
>          --all (objects)
> 5310.11: rev-list with negative tag (objects)           0.23(0.16+0.06)
> 5310.12: rev-list count with blob:none                  0.26(0.18+0.07)
> 5310.13: rev-list count with blob:limit=1k              6.45(5.94+0.37)
> 5310.14: rev-list count with tree:0                     0.26(0.18+0.07)
> 5310.15: simulated partial clone                        4.99(3.19+0.45)
> 5310.19: repack to disk (lookup=true)                   269.67(174.70+21.33)
> 5310.20: simulated clone                                11.03(5.07+1.11)
> 5310.21: simulated fetch                                0.79(0.79+0.17)
> 5310.22: pack to file (bitmap)                          43.03(20.28+7.43)
> 5310.23: rev-list (commits)                             0.86(0.54+0.09)
> 5310.24: rev-list (objects)                             3.35(3.26+0.07)
> 5310.25: rev-list with tag negated via --not		0.05(0.00+0.03)
> 	 --all (objects)
> 5310.26: rev-list with negative tag (objects)           0.22(0.16+0.05)
> 5310.27: rev-list count with blob:none                  0.22(0.16+0.05)
> 5310.28: rev-list count with blob:limit=1k              6.45(5.87+0.31)
> 5310.29: rev-list count with tree:0                     0.22(0.16+0.05)
> 5310.30: simulated partial clone                        5.17(3.12+0.48)
>
> Test 4-15 are tested without using lookup table. Same tests are
> repeated in 16-30 (using lookup table).
>
> Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
> Mentored-by: Taylor Blau <me@ttaylorr.com>
> Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
> ---
>  t/perf/p5310-pack-bitmaps.sh       | 77 ++++++++++++++-----------
>  t/perf/p5326-multi-pack-bitmaps.sh | 93 ++++++++++++++++--------------
>  2 files changed, 94 insertions(+), 76 deletions(-)
>
> diff --git a/t/perf/p5310-pack-bitmaps.sh b/t/perf/p5310-pack-bitmaps.sh
> index 7ad4f237bc3..6ff42bdd391 100755
> --- a/t/perf/p5310-pack-bitmaps.sh
> +++ b/t/perf/p5310-pack-bitmaps.sh
> @@ -16,39 +16,48 @@ test_expect_success 'setup bitmap config' '
>  	git config pack.writebitmaps true
>  '
>
> -# we need to create the tag up front such that it is covered by the repack and
> -# thus by generated bitmaps.
> -test_expect_success 'create tags' '
> -	git tag --message="tag pointing to HEAD" perf-tag HEAD
> -'
> -
> -test_perf 'repack to disk' '
> -	git repack -ad
> -'
> -
> -test_full_bitmap
> -
> -test_expect_success 'create partial bitmap state' '
> -	# pick a commit to represent the repo tip in the past
> -	cutoff=$(git rev-list HEAD~100 -1) &&
> -	orig_tip=$(git rev-parse HEAD) &&
> -
> -	# now kill off all of the refs and pretend we had
> -	# just the one tip
> -	rm -rf .git/logs .git/refs/* .git/packed-refs &&
> -	git update-ref HEAD $cutoff &&
> -
> -	# and then repack, which will leave us with a nice
> -	# big bitmap pack of the "old" history, and all of
> -	# the new history will be loose, as if it had been pushed
> -	# up incrementally and exploded via unpack-objects
> -	git repack -Ad &&
> -
> -	# and now restore our original tip, as if the pushes
> -	# had happened
> -	git update-ref HEAD $orig_tip
> -'
> -
> -test_partial_bitmap
> +test_bitmap () {
> +    local enabled="$1"
> +
> +	# we need to create the tag up front such that it is covered by the repack and
> +	# thus by generated bitmaps.
> +	test_expect_success 'create tags' '
> +		git tag --message="tag pointing to HEAD" perf-tag HEAD
> +	'

I think this "create tags" step can happen outside of the test_bitmap()
function, since it should only need to be done once, right?

> +	test_expect_success "use lookup table: $enabled" '
> +		git config pack.writeBitmapLookupTable '"$enabled"'
> +	'
> +
> +	test_perf "repack to disk (lookup=$enabled)" '
> +		git repack -ad
> +	'

And I think these two tests could be combined, since this could just
become:

    git -c pack.writeBitmapLookupTable "$enabled" repack -ad

right?

> +	test_full_bitmap
> +
> +    test_expect_success "create partial bitmap state (lookup=$enabled)" '

There is some funky spacing going on here, at least in my email client.
Could you double check that tabs are used consistently here?

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 5/6] bitmap-lookup-table: add performance tests for lookup table
  2022-06-27 21:53     ` Taylor Blau
@ 2022-06-28  7:58       ` Abhradeep Chakraborty
  0 siblings, 0 replies; 63+ messages in thread
From: Abhradeep Chakraborty @ 2022-06-28  7:58 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Abhradeep Chakraborty, Git, Kaartic Sivaraam, Derrick Stolee

Taylor Blau <me@ttaylorr.com> wrote:

> I think this "create tags" step can happen outside of the test_bitmap()
> function, since it should only need to be done once, right?

Yeah, I also think the same. That's why I tried to not include in the
Function but for some reason, one test is failing -

  perf 24 - rev-list with tag negated via --not --all (objects):
  running: 
  		git rev-list perf-tag --not --all --use-bitmap-index --objects >/dev/null
	
  fatal: ambiguous argument 'perf-tag': unknown revision or path not in the working tree.
  Use '--' to separate paths from revisions, like this:
  'git <command> [<revision>...] -- [<file>...]'
  not ok 24 - rev-list with tag negated via --not --all (objects)

One thing to note here is that the first `test_bitmap` call always
Passes. But the second `test_bitmap` call fails due to above error.
It throws error irrespective of any parameters for second `test_bitmap`.

If I put it inside the function it doesn't throw any error! 

For this reason, I put it into the function. Do you have any idea
why this happend?

> And I think these two tests could be combined, since this could just
> become:
>
>    git -c pack.writeBitmapLookupTable "$enabled" repack -ad
>
> right?

Yeah, sure.

> There is some funky spacing going on here, at least in my email client.
> Could you double check that tabs are used consistently here?

This is due to my editor's spacing issues. All seems fine when I look at
it in my editor. But actually it is not. Fixing it.

Thanks :)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 6/6] p5310-pack-bitmaps.sh: enable pack.writeReverseIndex for testing
  2022-06-27 21:50     ` Taylor Blau
@ 2022-06-28  8:01       ` Abhradeep Chakraborty
  0 siblings, 0 replies; 63+ messages in thread
From: Abhradeep Chakraborty @ 2022-06-28  8:01 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Abhradeep Chakraborty, Git, Kaartic Sivaraam, Derrick Stolee

Taylor Blau <me@ttaylorr.com> wrote:

> I think we should swap the order of these final two patches, since we're
> primarily interested in the difference between using a reverse index
> with and without the lookup table.

Ok. Thanks :)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 4/6] pack-bitmap: prepare to read lookup table
  2022-06-27 21:49       ` [PATCH v2 4/6] pack-bitmap: prepare to read lookup table extension Taylor Blau
@ 2022-06-28  8:59         ` Abhradeep Chakraborty
  0 siblings, 0 replies; 63+ messages in thread
From: Abhradeep Chakraborty @ 2022-06-28  8:59 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Abhradeep Chakraborty, Git, Kaartic Sivaraam, Derrick Stolee

Taylor Blau <me@ttaylorr.com> wrote:

> ...exactly my thoughts, too. It's possible that it would be faster to
> key this search on the object_id "oid" above, and then convert each of
> the entries in the lookup table from a uint32_t into an object_id by
> calling nth_bitmap_object_oid() repeatedly.
>
> I *think* that what Abhradeep wrote here is going to be faster more
> often than not since it makes more efficient use of the page cache
> rather than switching between reads across different memory mapped
> regions at each point in the binary search.
>
> But of course that depends on a number of factors. Abhradeep: if you're
> up for it, I think it would be worth trying it both ways and seeing if
> one produces a meaningful speed-up or slow-down over the other. Like I
> said: my guess is that what you have now will be faster, but I don't
> have a clear sense that that is true without trying it both ways ;-).

Ok. Let me try both the ways. In my opinion, I think my version has
less searching and less computation. So, I want to stick with this
version. But I also like to try the other one once so that we can
get the best out of these two.

> I think starting off with a small array and then letting it grow
> according to alloc_nr() would be fine here, since it will grow more and
> more each time, so the amount of times we have to reallocate the buffer
> will tail off over time.

What should be the size of that array?

> If we were really concerned about it, we could treat the buffer as a
> static pointer and reuse it over time (making sure to clear out the
> portions of it we're going to reuse, or otherwise ensuring that we don't
> read old data). But I doubt it matters much either way in practice: the
> individual records are small (at just 4 bytes each) and entry_count is
> often less than 1,000, so I think this probably has a vanishingly small
> impact.

Before submitting it to the mailing list, I did use the ALLOC_GROW macro
function. But my version was worse than yours. For every iteration I was
reallocating the array to support `size+1` positions. But later I drop
the code as this might be very much expensive.

Then I wrote this code. As `table` array and `table_inv` array allocate
this size of arrays (though all the indices are used), I thought it
would not be a problem if I use an array of this size for a small amount
of time.

Honestly, I don't like to realloc arrays. Because as far as I can remember,
realloc allocates a new array internally and copies the items from the old
array to the new array. This irritates me.

But at the same time, it is also true that in most cases we might not need
this amount of space.

Thanks :)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2 4/6] pack-bitmap: prepare to read lookup table extension
  2022-06-27 21:38     ` [PATCH v2 4/6] pack-bitmap: prepare to read lookup table extension Taylor Blau
@ 2022-06-28 19:25       ` Abhradeep Chakraborty
  0 siblings, 0 replies; 63+ messages in thread
From: Abhradeep Chakraborty @ 2022-06-28 19:25 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Abhradeep Chakraborty, Git, Kaartic Sivaraam, Derrick Stolee


Ohh, sorry! Looks like I missed this comment!

Taylor Blau <me@ttaylorr.com> wrote:

> It may be worth replacing "within map" to "within the memory mapped
> region `map`" to make clear that this points somewhere within the mmap.

Ok.

> I should have commented on this in an earlier round, but I wonder what
> the behavior should be when we have BITMAP_OPT_LOOKUP_TABLE in our
> flags, but GIT_TEST_READ_COMMIT_TABLE is disabled.
>
> Right now, it doesn't matter, since there aren't any flags in bits above
> BITMAP_OPT_LOOKUP_TABLE. But in the future, if there was some
> BITMAP_OPT_FOO that was newer than BITMAP_OPT_LOOKUP_TABLE, we would
> want to be able to read it without needing to read the lookup table.
>
> At least, I think that should be true, though I would be interested to
> hear if anybody has a differing opinion there.

Oh right! I didn't think about it. In that case, we should still subtract
The table size from the last index_size. In that way, These sections will
Not be overlapped.

> And table_size here is going to start off at zero, so the outer st_add()
> call isn't necessary, either. This should instead be:
>
>     size_t table_size = st_mult(ntohl(header->entry_count),
>                                 sizeof(uint32_t) + sizeof(uint64_t) + sizeof(uint32_t));
>
> It might be nice to have triplet_sz #define'd somewhere else, since
> there are a handful of declarations in this patch that are all
> identical. Probably something like:
>
>     #define BITMAP_LOOKUP_TABLE_RECORD_WIDTH (sizeof(uint32_t) + sizeof(uint64_t) + sizeof(uin32_t))
>
> or even:
>
>     /*
>      * The width in bytes of a single record in the lookup table
>      * extension:
>      *
>      *   (commit_pos, offset, xor_pos)
>      *
>      * whose fields are 32-, 64-, and 32-bits wide, respectively.
>      */
>      #define BITMAP_LOOKUP_TABLE_RECORD_WIDTH (16)

Seems perfect to me.

> if we decide to still recognize the lookup table extension without
> *reading* from it when GIT_TEST_READ_COMMIT_TABLE is unset, I think we
> should do something like:
>
>     if (git_env_bool("GIT_TEST_READ_COMMIT_TABLE", 1))
>         index->table_lookup = (void *)(index_end - table_size);
>     index_end -= table_size;
>
> ...where the subtraction on index_end happens unconditionally.

Right. Thanks!

> I wonder if we could get rid of these functions altogether and return a
> small structure like:
>
>     struct bitmap_lookup_table_record {
>         uint32_t commit_pos;
>         uint64_t offset;
>         uint32_t xor_pos;
>     };
>
> or similar.

Ok.

> Hmm. This is a little tricky to read. Here we're expecting "va" to hold
> commit_pos from below, and "vb" to be a pointer at a lookup record.
> Everything here is right, though I wonder if a comment or two might
> clarify why one is "*(uint32_t *)va" and the other is "get_be32(vb)".

Sure. Will add comments.

> Nit: let's use the bitmap_is_midx() helper here instead of looking at
> bitamp_git->midx directly.

Ok.

> First thing is to convert the commit OID we're looking for into its
> position within the corresponding pack index or MIDX file so that we can
> use it as a search key to locate in the lookup table. If we didn't find
> anything, or the commit doesn't exist in our pack / MIDX, nothing to do.
>
> > +
> > +	offset = triplet_get_offset(triplet);
> > +	xor_pos = triplet_get_xor_pos(triplet);
>
> Otherwise, record its offset and XOR "offset".

Exactly!

> We already have to get the triplets in the loop above, and then we dig
> them back out here. Would it be easier to keep track of a list of
> pointers into the mmaped region instead of looking up these triplets
> each time?

Sure. It might be a good idea. Thanks.

> > +			commit_pos = get_be32(triplet);
> > +			offset_xor = triplet_get_offset(triplet);
> > +
> > +			if (nth_bitmap_object_oid(bitmap_git, &xor_oid, commit_pos) < 0) {
>
> Should it be an error if we can't look up the object's ID here? I'd
> think so.

I also am not sure about it. Morally, I think it is better to throw
An error here.

> Do we have a good way to make sure that we're testing this code in CI?
> It *seems* correct to me, but of course, we should have a computer check
> that this produces OK results, not a human ;).

My current test file changes should test this code. As for now, the lookup
Table is enabled by default, all the existing tests that include write and
read bitmaps uses this lookup table. So, all the test case scenarios should
Pass. So, I think it is being tested in CI. Do you have a good idea to test
It better?

> Hmm. I'm not sure I follow the purpose of tweaking
> GIT_TEST_READ_COMMIT_TABLE like this with setenv(). Are we trying to
> avoid reading the lookup table? If so, why? I'd rather avoid
> manipulating the environment directly like this, and instead have a
> function we could call to fault in all of the bitmaps (when a lookup
> table exists, otherwise do nothing).

The problem was that the `test-tool bitmap list-commit` command was
Not printing any commits (the error that I notified you before). It
is because of this function. As lookup table is enabled by default,
`prepare_bitmap_git` function doesn't load each bitmap entries and
thus the below code in this function doesn't provide the bitmapped
commit list (because Hashtable didn't generated).

        kh_foreach(bitmap_git->bitmaps, oid, value, {
		printf("%s\n", oid_to_hex(&oid));
	});

So, the simplest fix I found was this. Should I make a function then
(Which you suggested here)?

Thanks :)

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2022-06-28 19:30 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-20 12:33 [PATCH 0/6] [GSoC] bitmap: integrate a lookup table extension to the bitmap format Abhradeep Chakraborty via GitGitGadget
2022-06-20 12:33 ` [PATCH 1/6] Documentation/technical: describe bitmap lookup table extension Abhradeep Chakraborty via GitGitGadget
2022-06-20 16:56   ` Derrick Stolee
2022-06-20 17:09     ` Taylor Blau
2022-06-21  8:31       ` Abhradeep Chakraborty
2022-06-22 16:26         ` Taylor Blau
2022-06-21  8:23     ` Abhradeep Chakraborty
2022-06-20 17:21   ` Taylor Blau
2022-06-21  9:22     ` Abhradeep Chakraborty
2022-06-22 16:29       ` Taylor Blau
2022-06-22 16:45         ` Abhradeep Chakraborty
2022-06-20 20:21   ` Derrick Stolee
2022-06-21 10:08     ` Abhradeep Chakraborty
2022-06-22 16:30       ` Taylor Blau
2022-06-20 12:33 ` [PATCH 2/6] pack-bitmap: prepare to read " Abhradeep Chakraborty via GitGitGadget
2022-06-20 20:49   ` Derrick Stolee
2022-06-21 10:28     ` Abhradeep Chakraborty
2022-06-20 22:06   ` Taylor Blau
2022-06-21 11:52     ` Abhradeep Chakraborty
2022-06-22 16:49       ` Taylor Blau
2022-06-22 17:18         ` Abhradeep Chakraborty
2022-06-22 21:34           ` Taylor Blau
2022-06-20 12:33 ` [PATCH 3/6] pack-bitmap-write.c: write " Abhradeep Chakraborty via GitGitGadget
2022-06-20 22:16   ` Taylor Blau
2022-06-21 12:50     ` Abhradeep Chakraborty
2022-06-22 16:51       ` Taylor Blau
2022-06-20 12:33 ` [PATCH 4/6] builtin/pack-objects.c: learn pack.writeBitmapLookupTable Taylor Blau via GitGitGadget
2022-06-20 22:18   ` Taylor Blau
2022-06-20 12:33 ` [PATCH 5/6] bitmap-commit-table: add tests for the bitmap lookup table Abhradeep Chakraborty via GitGitGadget
2022-06-22 16:54   ` Taylor Blau
2022-06-20 12:33 ` [PATCH 6/6] bitmap-lookup-table: add performance tests Abhradeep Chakraborty via GitGitGadget
2022-06-22 17:14   ` Taylor Blau
2022-06-26 13:10 ` [PATCH v2 0/6] [GSoC] bitmap: integrate a lookup table extension to the bitmap format Abhradeep Chakraborty via GitGitGadget
2022-06-26 13:10   ` [PATCH v2 1/6] Documentation/technical: describe bitmap lookup table extension Abhradeep Chakraborty via GitGitGadget
2022-06-27 14:18     ` Derrick Stolee
2022-06-27 15:48       ` Taylor Blau
2022-06-27 16:51       ` Abhradeep Chakraborty
2022-06-26 13:10   ` [PATCH v2 2/6] pack-bitmap-write.c: write " Abhradeep Chakraborty via GitGitGadget
2022-06-27 14:35     ` Derrick Stolee
2022-06-27 16:12       ` Taylor Blau
2022-06-27 17:10       ` Abhradeep Chakraborty
2022-06-27 16:05     ` Taylor Blau
2022-06-27 18:29       ` Abhradeep Chakraborty
2022-06-26 13:10   ` [PATCH v2 3/6] pack-bitmap-write: learn pack.writeBitmapLookupTable and add tests Abhradeep Chakraborty via GitGitGadget
2022-06-27 14:43     ` Derrick Stolee
2022-06-27 17:42       ` Abhradeep Chakraborty
2022-06-27 17:49         ` Taylor Blau
2022-06-27 17:47     ` Taylor Blau
2022-06-27 18:39       ` Abhradeep Chakraborty
2022-06-26 13:10   ` [PATCH v2 4/6] pack-bitmap: prepare to read lookup table extension Abhradeep Chakraborty via GitGitGadget
2022-06-27 15:12     ` Derrick Stolee
2022-06-27 18:06       ` [PATCH v2 4/6] pack-bitmap: prepare to read lookup table Abhradeep Chakraborty
2022-06-27 18:32         ` Derrick Stolee
2022-06-27 21:49       ` [PATCH v2 4/6] pack-bitmap: prepare to read lookup table extension Taylor Blau
2022-06-28  8:59         ` [PATCH v2 4/6] pack-bitmap: prepare to read lookup table Abhradeep Chakraborty
2022-06-27 21:38     ` [PATCH v2 4/6] pack-bitmap: prepare to read lookup table extension Taylor Blau
2022-06-28 19:25       ` Abhradeep Chakraborty
2022-06-26 13:10   ` [PATCH v2 5/6] bitmap-lookup-table: add performance tests for lookup table Abhradeep Chakraborty via GitGitGadget
2022-06-27 21:53     ` Taylor Blau
2022-06-28  7:58       ` Abhradeep Chakraborty
2022-06-26 13:10   ` [PATCH v2 6/6] p5310-pack-bitmaps.sh: enable pack.writeReverseIndex for testing Abhradeep Chakraborty via GitGitGadget
2022-06-27 21:50     ` Taylor Blau
2022-06-28  8:01       ` Abhradeep Chakraborty

Code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).