git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/3] pack-bitmap: boundary-based bitmap traversal
@ 2023-04-25  0:00 Taylor Blau
  2023-04-25  0:00 ` [PATCH 1/3] revision: support tracking uninteresting commits Taylor Blau
                   ` (8 more replies)
  0 siblings, 9 replies; 51+ messages in thread
From: Taylor Blau @ 2023-04-25  0:00 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Derrick Stolee, Junio C Hamano

Here is a short (but dense) series that I worked on towards the end of
2021 with Peff.

Its goal is to improve the bitmap traversal we implement in
prepare_bitmap_walk(). Specifically, it avoids building up a complete
bitmap of the "haves" side, and instead uses a combination of (a)
UNINTERESTING commits between the tips and boundary, and (b) the
boundary itself.

The gory details are laid out in 3/3, but the high-level overview of the
new algorithm to compute the set of objects between "haves" and "wants"
using bitmaps is:

  1. Build a (partial) bitmap of the haves side by first OR-ing any
     bitmap(s) that already exist for UNINTERESTING commits between the
     haves and the boundary.

  2. For each commit along the boundary, add it as a fill-in traversal
     tip (where the traversal terminates once an existing bitmap is
     found), and perform fill-in traversal.

  3. Build up a complete bitmap of the wants side as usual, stopping any
     time we intersect the (partial) haves side.

This improves many cases where using bitmaps was significantly *slower*
than regular, non-bitmap traversal. In some instances, using bitmaps is
still slower than the non-bitmap case, but it is a substantial
improvement over the classic bitmap walk.

    $ ours="$(git branch --show-current)"
      argv="--count --objects $ours --not --exclude=$ours --branches"
      hyperfine \
        -n 'no bitmaps' "git.compile rev-list $argv" \
        -n 'existing bitmap traversal' "git rev-list --use-bitmap-index $argv" \
        -n 'this commit' "git.compile rev-list --use-bitmap-index $argv"
    Benchmark 1: no bitmaps
      Time (mean ± σ):      15.1 ms ±   4.1 ms    [User: 8.1 ms, System: 6.9 ms]
      Range (min … max):     7.4 ms …  21.8 ms    131 runs

    Benchmark 2: existing bitmap traversal
      Time (mean ± σ):      82.6 ms ±   9.2 ms    [User: 63.6 ms, System: 19.0 ms]
      Range (min … max):    73.8 ms … 105.4 ms    28 runs

    Benchmark 3: this commit
      Time (mean ± σ):      19.8 ms ±   3.1 ms    [User: 13.0 ms, System: 6.8 ms]
      Range (min … max):    17.7 ms …  38.6 ms    158 runs

    Summary
      'no bitmaps' ran
        1.31 ± 0.41 times faster than 'this commit'
        5.49 ± 1.62 times faster than 'existing bitmap traversal'

In a large repository on GitHub.com, the timings to compute the objects
unique to "master", as in:

    $ git rev-list --count --objects master --not --exclude=master --branches

improve as follows:

  - 1.012 sec (without bitmaps)
  - 29.571 sec (with the existing bitmap walk)
  - 2.279 sec (with these patches)

The first couple of patches are preparatory, and the third patch is the
substantive one.

Please have a look and poke any holes you can find in the new approach
:-). Thanks in advance for your review.

Taylor Blau (3):
  revision: support tracking uninteresting commits
  pack-bitmap.c: extract `fill_in_bitmap()`
  pack-bitmap.c: use commit boundary during bitmap traversal

 pack-bitmap.c | 252 ++++++++++++++++++++++++++++++++++++++------------
 revision.c    |  10 +-
 revision.h    |   5 +
 3 files changed, 205 insertions(+), 62 deletions(-)

-- 
2.40.0.380.gda896aa358.dirty

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 1/3] revision: support tracking uninteresting commits
  2023-04-25  0:00 [PATCH 0/3] pack-bitmap: boundary-based bitmap traversal Taylor Blau
@ 2023-04-25  0:00 ` Taylor Blau
  2023-04-25 18:15   ` Derrick Stolee
  2023-04-25 18:22   ` Junio C Hamano
  2023-04-25  0:00 ` [PATCH 2/3] pack-bitmap.c: extract `fill_in_bitmap()` Taylor Blau
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 51+ messages in thread
From: Taylor Blau @ 2023-04-25  0:00 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Derrick Stolee, Junio C Hamano

The boundary-based bitmap walk will want to know which commits were
marked UNINTERESTING in the walk used to discover the boundary.

Track which commits are marked as such during list limitation as well as
the topo walk step, though the latter is not used.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 revision.c | 10 +++++++++-
 revision.h |  5 +++++
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/revision.c b/revision.c
index 106ca1ce6c..7177244e7a 100644
--- a/revision.c
+++ b/revision.c
@@ -1446,6 +1446,9 @@ static int limit_list(struct rev_info *revs)
 		if (process_parents(revs, commit, &original_list, NULL) < 0)
 			return -1;
 		if (obj->flags & UNINTERESTING) {
+			if (revs->collect_uninteresting)
+				add_object_array(obj, NULL,
+						 &revs->uninteresting_commits);
 			mark_parents_uninteresting(revs, commit);
 			slop = still_interesting(original_list, date, slop, &interesting_cache);
 			if (slop)
@@ -3072,6 +3075,7 @@ void release_revisions(struct rev_info *revs)
 	diff_free(&revs->pruning);
 	reflog_walk_info_release(revs->reflog_info);
 	release_revisions_topo_walk_info(revs->topo_walk_info);
+	object_array_clear(&revs->uninteresting_commits);
 }
 
 static void add_child(struct rev_info *revs, struct commit *parent, struct commit *child)
@@ -3522,8 +3526,12 @@ static void explore_walk_step(struct rev_info *revs)
 	if (process_parents(revs, c, NULL, NULL) < 0)
 		return;
 
-	if (c->object.flags & UNINTERESTING)
+	if (c->object.flags & UNINTERESTING) {
+		if (revs->collect_uninteresting)
+			add_object_array(&c->object, NULL,
+					 &revs->uninteresting_commits);
 		mark_parents_uninteresting(revs, c);
+	}
 
 	for (p = c->parents; p; p = p->next)
 		test_flag_and_insert(&info->explore_queue, p->item, TOPO_WALK_EXPLORED);
diff --git a/revision.h b/revision.h
index e8f6de9684..1385b29c76 100644
--- a/revision.h
+++ b/revision.h
@@ -124,6 +124,9 @@ struct rev_info {
 	/* Parents of shown commits */
 	struct object_array boundary_commits;
 
+	/* UNINTERESTING commits between the tips and boundary */
+	struct object_array uninteresting_commits;
+
 	/* The end-points specified by the end user */
 	struct rev_cmdline_info cmdline;
 
@@ -183,6 +186,7 @@ struct rev_info {
 			unpacked:1,
 			no_kept_objects:1,
 			boundary:2,
+			collect_uninteresting:1,
 			count:1,
 			left_right:1,
 			left_only:1,
@@ -404,6 +408,7 @@ struct rev_info {
 	.expand_tabs_in_log = -1, \
 	.commit_format = CMIT_FMT_DEFAULT, \
 	.expand_tabs_in_log_default = 8, \
+	.uninteresting_commits = OBJECT_ARRAY_INIT, \
 }
 
 /**
-- 
2.40.0.380.gda896aa358.dirty


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 2/3] pack-bitmap.c: extract `fill_in_bitmap()`
  2023-04-25  0:00 [PATCH 0/3] pack-bitmap: boundary-based bitmap traversal Taylor Blau
  2023-04-25  0:00 ` [PATCH 1/3] revision: support tracking uninteresting commits Taylor Blau
@ 2023-04-25  0:00 ` Taylor Blau
  2023-04-25 18:32   ` Junio C Hamano
  2023-04-25  0:00 ` [PATCH 3/3] pack-bitmap.c: use commit boundary during bitmap traversal Taylor Blau
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 51+ messages in thread
From: Taylor Blau @ 2023-04-25  0:00 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Derrick Stolee, Junio C Hamano

To prepare for the boundary-based bitmap walk to perform a fill-in
traversal using the boundary of either side as the tips, extract routine
used to perform fill-in traversal by `find_objects()` so that it can be
used in both places.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 66 +++++++++++++++++++++++++++++----------------------
 1 file changed, 37 insertions(+), 29 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index b2e7d06d60..046240d072 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1040,6 +1040,41 @@ static int add_commit_to_bitmap(struct bitmap_index *bitmap_git,
 	return 1;
 }
 
+static struct bitmap *fill_in_bitmap(struct bitmap_index *bitmap_git,
+				     struct rev_info *revs,
+				     struct bitmap *base,
+				     struct bitmap *seen)
+{
+	struct include_data incdata;
+	struct bitmap_show_data show_data;
+
+	if (base == NULL)
+		base = bitmap_new();
+
+	incdata.bitmap_git = bitmap_git;
+	incdata.base = base;
+	incdata.seen = seen;
+
+	revs->include_check = should_include;
+	revs->include_check_obj = should_include_obj;
+	revs->include_check_data = &incdata;
+
+	if (prepare_revision_walk(revs))
+		die("revision walk setup failed");
+
+	show_data.bitmap_git = bitmap_git;
+	show_data.base = base;
+
+	traverse_commit_list_filtered(revs, show_commit, show_object,
+				      &show_data, NULL);
+
+	revs->include_check = NULL;
+	revs->include_check_obj = NULL;
+	revs->include_check_data = NULL;
+
+	return base;
+}
+
 static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
 				   struct rev_info *revs,
 				   struct object_list *roots,
@@ -1105,35 +1140,8 @@ static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
 		}
 	}
 
-	if (needs_walk) {
-		struct include_data incdata;
-		struct bitmap_show_data show_data;
-
-		if (!base)
-			base = bitmap_new();
-
-		incdata.bitmap_git = bitmap_git;
-		incdata.base = base;
-		incdata.seen = seen;
-
-		revs->include_check = should_include;
-		revs->include_check_obj = should_include_obj;
-		revs->include_check_data = &incdata;
-
-		if (prepare_revision_walk(revs))
-			die(_("revision walk setup failed"));
-
-		show_data.bitmap_git = bitmap_git;
-		show_data.base = base;
-
-		traverse_commit_list(revs,
-				     show_commit, show_object,
-				     &show_data);
-
-		revs->include_check = NULL;
-		revs->include_check_obj = NULL;
-		revs->include_check_data = NULL;
-	}
+	if (needs_walk)
+		base = fill_in_bitmap(bitmap_git, revs, base, seen);
 
 	return base;
 }
-- 
2.40.0.380.gda896aa358.dirty


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 3/3] pack-bitmap.c: use commit boundary during bitmap traversal
  2023-04-25  0:00 [PATCH 0/3] pack-bitmap: boundary-based bitmap traversal Taylor Blau
  2023-04-25  0:00 ` [PATCH 1/3] revision: support tracking uninteresting commits Taylor Blau
  2023-04-25  0:00 ` [PATCH 2/3] pack-bitmap.c: extract `fill_in_bitmap()` Taylor Blau
@ 2023-04-25  0:00 ` Taylor Blau
  2023-04-25 18:52   ` Junio C Hamano
  2023-04-25 18:53   ` Derrick Stolee
  2023-04-25 18:02 ` [PATCH 0/3] pack-bitmap: boundary-based " Junio C Hamano
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 51+ messages in thread
From: Taylor Blau @ 2023-04-25  0:00 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Derrick Stolee, Junio C Hamano

When reachability bitmap coverage exists in a repository, Git will use a
different (and hopefully faster) traversal to compute revision walks.

Consider a set of positive and negative tips (which we'll refer to with
their standard bitmap parlance by, "wants", and "haves"). In order to
figure out what objects exist between the tips, the existing traversal
in prepare_bitmap_walk() looks something like:

  1. Consider if we can even compute the set of objects with bitmaps,
     and fall back to the usual traversal if we cannot. For example,
     pathspec limiting traversals can't be computed using bitmaps (since
     they don't know which objects are at which paths). The same is true
     of certain kinds of non-trivial object filters.

  2. If we can compute the traversal with bitmaps, partition the
     (dereferenced) tips into two object lists, "haves", and "wants",
     based on whether or not the objects have the UNINTERESTING flag,
     respectively.

  3. Fall back to the ordinary object traversal if either (a) there are
     no haves, (b) none of the haves are in the bitmapped pack or MIDX,
     or (c) there are no wants.

  4. Construct a reachability bitmap for the "haves" side by walking
     from the revision tips down to any existing bitmaps, OR-ing in any
     bitmaps as they are found.

  5. Then do the same for the "wants" side, stopping at any objects that
     appear in the "haves" bitmap.

  6. Filter the results if any object filter (that can be easily
     computed with bitmaps alone) was given, and then return back to the
     caller.

When there is good bitmap coverage relative to the traversal tips, this
walk is often significantly faster than an ordinary object traversal
because it can visit far fewer objects.

But in certain cases, it can be significantly *slower* than the usual
object traversal. Why? Because we need to compute complete bitmaps on
either side of the walk. If either one (or both) of the sides require
walking many (or all!) objects before they get to an existing bitmap,
the extra bitmap machinery is mostly or all overhead.

One of the benefits, however, is that even if the walk is slower, bitmap
traversals are guaranteed to provide an *exact* answer. Unlike the
traditional object traversal algorithm, which can over-count the results
by not opening trees for older commits, the bitmap walk builds an exact
reachability bitmap for either side, meaning the results are never
over-counted.

But producing non-exact results is OK for our traversal here (both in
the bitmap case and not), as long as the results are over-counted, not
under.

Relaxing the bitmap traversal to allow it to produce over-counted
results gives us the opportunity to make some significant improvements.
Instead of the above, the new algorithm only has to walk from the
*boundary* down to the nearest bitmap, instead of from each of the
UNINTERESTING tips.

The boundary-based approach still has degenerate cases, but we'll show
in a moment that it is often a significant improvement.

The new algorithm works as follows:

  1. Build a (partial) bitmap of the haves side by first OR-ing any
     bitmap(s) that already exist for UNINTERESTING commits between the
     haves and the boundary.

  2. For each commit along the boundary, add it as a fill-in traversal
     tip (where the traversal terminates once an existing bitmap is
     found), and perform fill-in traversal.

  3. Build up a complete bitmap of the wants side as usual, stopping any
     time we intersect the (partial) haves side.

  4. Return the results.

And is more-or-less equivalent to using the *old* algorithm with this
invocation:

    $ git rev-list --objects --boundary $WANTS --not $HAVES |
        perl -lne 'print $1 if /^-(.*)/' |
        xargs git rev-list --objects --use-bitmap-index $WANTS --not

The new result performs significantly better in many cases, particularly
when the distance from the boundary commit(s) to an existing bitmap is
shorter than the distance from (all of) the have tips to the nearest
bitmapped commit.

Note that when using the old bitmap traversal algorithm, the results can
be *slower* than without bitmaps! Under the new algorithm, the result is
computed faster with bitmaps than without (at the cost of over-counting
the true number of objects in a similar fashion as the non-bitmap
traversal):

    # (Computing objects unique to 'master' among all branches, not
    # using bitmaps).
    $ time git rev-list --count --objects master --not --exclude=master --branches
    54

    real	0m1.012s
    user	0m0.796s
    sys	0m0.214s

    # (Computing the same uniqueness query using the old bitmap
    # traversal algorithm.)
    $ time git rev-list --use-bitmap-index --count --objects master --not --exclude=master --branches
    42

    real	0m29.571s
    user	0m28.404s
    sys	0m1.164s

    # (Computing the same uniqueness query using the new bitmap
    # traversal algorithm.)
    $ time git.compile rev-list --count --objects --use-bitmap-index master --not --exclude=master --branches
    54

    real	0m2.279s
    user	0m2.023s
    sys	0m0.256s

The new algorithm is still slower than not using bitmaps at all, but it
is nearly a 13-fold improvement over the existing traversal.

In a more realistic setting (using my local copy of git.git), I can
observe a similar speed-up:

    $ ours="$(git branch --show-current)"
      argv="--count --objects $ours --not --exclude=$ours --branches"
      hyperfine \
        -n 'no bitmaps' "git.compile rev-list $argv" \
        -n 'existing bitmap traversal' "git rev-list --use-bitmap-index $argv" \
        -n 'this commit' "git.compile rev-list --use-bitmap-index $argv"
    Benchmark 1: no bitmaps
      Time (mean ± σ):      15.1 ms ±   4.1 ms    [User: 8.1 ms, System: 6.9 ms]
      Range (min … max):     7.4 ms …  21.8 ms    131 runs

    Benchmark 2: existing bitmap traversal
      Time (mean ± σ):      82.6 ms ±   9.2 ms    [User: 63.6 ms, System: 19.0 ms]
      Range (min … max):    73.8 ms … 105.4 ms    28 runs

    Benchmark 3: this commit
      Time (mean ± σ):      19.8 ms ±   3.1 ms    [User: 13.0 ms, System: 6.8 ms]
      Range (min … max):    17.7 ms …  38.6 ms    158 runs

    Summary
      'no bitmaps' ran
        1.31 ± 0.41 times faster than 'this commit'
        5.49 ± 1.62 times faster than 'existing bitmap traversal'

Here, the new algorithm is also still slower than not using bitmaps, but
represents a 4-fold improvement over the existing traversal in a more
modest example.

Since this algorithm was originally written (nearly a year and a half
ago, at the time of writing), the bitmap lookup table shipped, making
the new algorithm's result more competitive. A few other future
directions for improving bitmap traversal times beyond not using bitmaps
at all:

  - Decrease the cost to decompress and OR together many bitmaps
    together (particularly when enumerating the uninteresting side of
    the walk). Here we could explore more efficient bitmap storage
    techniques, like Roaring+Run and/or use SIMD instructions to speed
    up ORing them together.

  - Store pseudo-merge bitmaps, which could allow us to OR together
    fewer "summary" bitmaps (which would also help with the above).

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 194 ++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 158 insertions(+), 36 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 046240d072..c47c97f35b 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1048,7 +1048,7 @@ static struct bitmap *fill_in_bitmap(struct bitmap_index *bitmap_git,
 	struct include_data incdata;
 	struct bitmap_show_data show_data;
 
-	if (base == NULL)
+	if (!base)
 		base = bitmap_new();
 
 	incdata.bitmap_git = bitmap_git;
@@ -1075,6 +1075,145 @@ static struct bitmap *fill_in_bitmap(struct bitmap_index *bitmap_git,
 	return base;
 }
 
+static int obj_in_bitmap(struct bitmap_index *bitmap_git,
+			 struct object *obj, struct bitmap *bitmap)
+{
+	int pos;
+
+	if (!bitmap)
+		return 0;
+	pos = bitmap_position(bitmap_git, &obj->oid);
+	if (pos < 0)
+		return 0;
+	return bitmap_get(bitmap, pos);
+}
+
+static void show_boundary_commit(struct commit *commit, void *data)
+{
+	struct object_array *boundary = data;
+	if (!(commit->object.flags & BOUNDARY))
+		return;
+
+	add_object_array(&commit->object, "", boundary);
+}
+
+static void show_boundary_object(struct object *object,
+				 const char *name, void *data)
+{
+	BUG("should not be called");
+}
+
+static struct bitmap *find_boundary_objects(struct bitmap_index *bitmap_git,
+					    struct rev_info *revs,
+					    struct object_list *roots)
+{
+	struct bitmap *base = NULL;
+	struct object_array boundary = OBJECT_ARRAY_INIT;
+	int any_missing = 0;
+	unsigned int i;
+	int tmp_blobs, tmp_trees, tmp_tags;
+
+	revs->ignore_missing_links = 1;
+	revs->collect_uninteresting = 1;
+
+	/*
+	 * OR in any existing reachability bitmaps among `roots` into `base`.
+	 */
+	while (roots) {
+		struct object *object = roots->item;
+		roots = roots->next;
+
+		if (object->type == OBJ_COMMIT &&
+		    !obj_in_bitmap(bitmap_git, object, base) &&
+		    add_commit_to_bitmap(bitmap_git, &base,
+					 (struct commit *)object)) {
+			object->flags |= SEEN;
+			continue;
+		}
+
+		any_missing = 1;
+	}
+
+	if (!any_missing)
+		goto cleanup;
+
+	tmp_blobs = revs->blob_objects;
+	tmp_trees = revs->tree_objects;
+	tmp_tags = revs->blob_objects;
+	revs->blob_objects = 0;
+	revs->tree_objects = 0;
+	revs->tag_objects = 0;
+
+	/*
+	 * We didn't have complete coverage of the roots. First OR in any
+	 * bitmaps that are UNINTERESTING between the tips and boundary.
+	 */
+	trace2_region_enter("pack-bitmap", "boundary-prepare", the_repository);
+	if (prepare_revision_walk(revs))
+		die("revision walk setup failed");
+	trace2_region_leave("pack-bitmap", "boundary-prepare", the_repository);
+
+	trace2_region_enter("pack-bitmap", "boundary-load-bitmaps", the_repository);
+	for (i = 0; i < revs->uninteresting_commits.nr; i++) {
+		struct object *obj = revs->uninteresting_commits.objects[i].item;
+		if (obj->type != OBJ_COMMIT)
+			BUG("unexpected non-commit %s marked uninteresting",
+			    oid_to_hex(&obj->oid));
+
+		if (obj_in_bitmap(bitmap_git, obj, base))
+			continue;
+
+		add_commit_to_bitmap(bitmap_git, &base, (struct commit *)obj);
+	}
+	trace2_region_leave("pack-bitmap", "boundary-load-bitmaps", the_repository);
+
+	/*
+	 * Then add the boundary commit(s) as fill-in traversal tips.
+	 */
+	trace2_region_enter("pack-bitmap", "boundary-traverse", the_repository);
+	revs->boundary = 1;
+	traverse_commit_list_filtered(revs,
+				      show_boundary_commit,
+				      show_boundary_object,
+				      &boundary, NULL);
+	revs->boundary = 0;
+	revs->blob_objects = tmp_blobs;
+	revs->tree_objects = tmp_trees;
+	revs->tag_objects = tmp_tags;
+
+	reset_revision_walk();
+	clear_object_flags(UNINTERESTING);
+	trace2_region_leave("pack-bitmap", "boundary-traverse", the_repository);
+
+	trace2_region_enter("pack-bitmap", "boundary-fill-in", the_repository);
+	if (boundary.nr) {
+		struct object *obj;
+		int needs_walk = 0;
+
+		for (i = 0; i < boundary.nr; i++) {
+			obj = boundary.objects[i].item;
+
+			if (obj_in_bitmap(bitmap_git, obj, base)) {
+				obj->flags |= SEEN;
+			} else {
+				add_pending_object(revs, obj, "");
+				needs_walk = 1;
+			}
+		}
+
+		if (needs_walk)
+			base = fill_in_bitmap(bitmap_git, revs, base, NULL);
+	}
+	trace2_region_leave("pack-bitmap", "boundary-fill-in", the_repository);
+
+
+cleanup:
+	revs->ignore_missing_links = 0;
+	revs->collect_uninteresting = 0;
+
+	return base;
+}
+
 static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
 				   struct rev_info *revs,
 				   struct object_list *roots,
@@ -1140,8 +1279,21 @@ static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
 		}
 	}
 
-	if (needs_walk)
+	if (needs_walk) {
+		/*
+		 * This fill-in traversal may walk over some objects
+		 * again, since we have already traversed in order to
+		 * find the boundary.
+		 *
+		 * But this extra walk should be extremely cheap, since
+		 * all commit objects are loaded into memory, and
+		 * because we skip walking to parents that are
+		 * UNINTERESTING, since it will be marked in the haves
+		 * bitmap already (or it has an on-disk bitmap, since
+		 * OR-ing it in covers all of its ancestors).
+		 */
 		base = fill_in_bitmap(bitmap_git, revs, base, seen);
+	}
 
 	return base;
 }
@@ -1257,25 +1409,6 @@ static void show_objects_for_type(
 	}
 }
 
-static int in_bitmapped_pack(struct bitmap_index *bitmap_git,
-			     struct object_list *roots)
-{
-	while (roots) {
-		struct object *object = roots->item;
-		roots = roots->next;
-
-		if (bitmap_is_midx(bitmap_git)) {
-			if (bsearch_midx(&object->oid, bitmap_git->midx, NULL))
-				return 1;
-		} else {
-			if (find_pack_entry_one(object->oid.hash, bitmap_git->pack) > 0)
-				return 1;
-		}
-	}
-
-	return 0;
-}
-
 static struct bitmap *find_tip_objects(struct bitmap_index *bitmap_git,
 				       struct object_list *tip_objects,
 				       enum object_type type)
@@ -1583,14 +1716,6 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 			object_list_insert(object, &wants);
 	}
 
-	/*
-	 * if we have a HAVES list, but none of those haves is contained
-	 * in the packfile that has a bitmap, we don't have anything to
-	 * optimize here
-	 */
-	if (haves && !in_bitmapped_pack(bitmap_git, haves))
-		goto cleanup;
-
 	/* if we don't want anything, we're done here */
 	if (!wants)
 		goto cleanup;
@@ -1603,18 +1728,15 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 	if (load_bitmap(bitmap_git) < 0)
 		goto cleanup;
 
-	object_array_clear(&revs->pending);
-
 	if (haves) {
-		revs->ignore_missing_links = 1;
-		haves_bitmap = find_objects(bitmap_git, revs, haves, NULL);
-		reset_revision_walk();
-		revs->ignore_missing_links = 0;
-
+		haves_bitmap = find_boundary_objects(bitmap_git, revs, haves);
 		if (!haves_bitmap)
 			BUG("failed to perform bitmap walk");
 	}
 
+	object_array_clear(&revs->pending);
+	reset_revision_walk();
+
 	wants_bitmap = find_objects(bitmap_git, revs, wants, haves_bitmap);
 
 	if (!wants_bitmap)
-- 
2.40.0.380.gda896aa358.dirty

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/3] pack-bitmap: boundary-based bitmap traversal
  2023-04-25  0:00 [PATCH 0/3] pack-bitmap: boundary-based bitmap traversal Taylor Blau
                   ` (2 preceding siblings ...)
  2023-04-25  0:00 ` [PATCH 3/3] pack-bitmap.c: use commit boundary during bitmap traversal Taylor Blau
@ 2023-04-25 18:02 ` Junio C Hamano
  2023-04-25 18:30   ` Derrick Stolee
  2023-04-25 18:06 ` Derrick Stolee
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 51+ messages in thread
From: Junio C Hamano @ 2023-04-25 18:02 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Derrick Stolee

Taylor Blau <me@ttaylorr.com> writes:

> This improves many cases where using bitmaps was significantly *slower*
> than regular, non-bitmap traversal. In some instances, using bitmaps is
> still slower than the non-bitmap case, but it is a substantial
> improvement over the classic bitmap walk.
> ...
> In a large repository on GitHub.com, the timings to compute the objects
> unique to "master", as in:
>
>     $ git rev-list --count --objects master --not --exclude=master --branches
>
> improve as follows:

Curious---when would it be significantly faster to use bitmaps?
"Most of the time"?  "In not-too-large repository?"


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/3] pack-bitmap: boundary-based bitmap traversal
  2023-04-25  0:00 [PATCH 0/3] pack-bitmap: boundary-based bitmap traversal Taylor Blau
                   ` (3 preceding siblings ...)
  2023-04-25 18:02 ` [PATCH 0/3] pack-bitmap: boundary-based " Junio C Hamano
@ 2023-04-25 18:06 ` Derrick Stolee
  2023-04-25 19:01   ` Taylor Blau
  2023-05-01 22:08 ` Junio C Hamano
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 51+ messages in thread
From: Derrick Stolee @ 2023-04-25 18:06 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: Jeff King, Junio C Hamano

On 4/24/2023 8:00 PM, Taylor Blau wrote:
> Here is a short (but dense) series that I worked on towards the end of
> 2021 with Peff.
> 
> Its goal is to improve the bitmap traversal we implement in
> prepare_bitmap_walk(). Specifically, it avoids building up a complete
> bitmap of the "haves" side, and instead uses a combination of (a)
> UNINTERESTING commits between the tips and boundary, and (b) the
> boundary itself.
> 
> The gory details are laid out in 3/3, but the high-level overview of the
> new algorithm to compute the set of objects between "haves" and "wants"
> using bitmaps is:
> 
>   1. Build a (partial) bitmap of the haves side by first OR-ing any
>      bitmap(s) that already exist for UNINTERESTING commits between the
>      haves and the boundary.
> 
>   2. For each commit along the boundary, add it as a fill-in traversal
>      tip (where the traversal terminates once an existing bitmap is
>      found), and perform fill-in traversal.
> 
>   3. Build up a complete bitmap of the wants side as usual, stopping any
>      time we intersect the (partial) haves side.

In other words: this generates something closer to the object set in the
non-bitmapped object walk. The only difference is that the new bitmapped
algorithm will see objects that were re-introduced across the boundary
(say, a blob was reverted to its older mode).
 
> This improves many cases where using bitmaps was significantly *slower*
> than regular, non-bitmap traversal. In some instances, using bitmaps is
> still slower than the non-bitmap case, but it is a substantial
> improvement over the classic bitmap walk.> 
>     $ ours="$(git branch --show-current)"
>       argv="--count --objects $ours --not --exclude=$ours --branches"
>       hyperfine \
>         -n 'no bitmaps' "git.compile rev-list $argv" \
>         -n 'existing bitmap traversal' "git rev-list --use-bitmap-index $argv" \
>         -n 'this commit' "git.compile rev-list --use-bitmap-index $argv"
>     Benchmark 1: no bitmaps
>       Time (mean ± σ):      15.1 ms ±   4.1 ms    [User: 8.1 ms, System: 6.9 ms]
>       Range (min … max):     7.4 ms …  21.8 ms    131 runs
> 
>     Benchmark 2: existing bitmap traversal
>       Time (mean ± σ):      82.6 ms ±   9.2 ms    [User: 63.6 ms, System: 19.0 ms]
>       Range (min … max):    73.8 ms … 105.4 ms    28 runs
> 
>     Benchmark 3: this commit
>       Time (mean ± σ):      19.8 ms ±   3.1 ms    [User: 13.0 ms, System: 6.8 ms]
>       Range (min … max):    17.7 ms …  38.6 ms    158 runs
> 
>     Summary
>       'no bitmaps' ran
>         1.31 ± 0.41 times faster than 'this commit'
>         5.49 ± 1.62 times faster than 'existing bitmap traversal'
> 
> In a large repository on GitHub.com, the timings to compute the objects
> unique to "master", as in:
> 
>     $ git rev-list --count --objects master --not --exclude=master --branches
> 
> improve as follows:
> 
>   - 1.012 sec (without bitmaps)
>   - 29.571 sec (with the existing bitmap walk)
>   - 2.279 sec (with these patches)

For my curiosity, and since you already have a test environment set up,
could you redo the "without bitmaps" case with pack.useSparse true and
false? When the option was created and defaulted to true, we never
really explored comparing it to the bitmap case. In fact, I assumed the
bitmap case was faster in important cases like this (where there is a
substantial difference in object counts), but your data is surprising me
that the sparse algorithm is outperforming bitmaps, even with this new
algorithm.

The main question I'd like to ask is: is pack.useSparse contributing
in an important way to the performance here?
 
I'll go poking into the patches now.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/3] revision: support tracking uninteresting commits
  2023-04-25  0:00 ` [PATCH 1/3] revision: support tracking uninteresting commits Taylor Blau
@ 2023-04-25 18:15   ` Derrick Stolee
  2023-05-03 21:48     ` Taylor Blau
  2023-05-03 22:08     ` Taylor Blau
  2023-04-25 18:22   ` Junio C Hamano
  1 sibling, 2 replies; 51+ messages in thread
From: Derrick Stolee @ 2023-04-25 18:15 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: Jeff King, Junio C Hamano

On 4/24/2023 8:00 PM, Taylor Blau wrote:
> The boundary-based bitmap walk will want to know which commits were
> marked UNINTERESTING in the walk used to discover the boundary.
> 
> Track which commits are marked as such during list limitation as well as
> the topo walk step, though the latter is not used.

I was surprised to see this as an addition to revision.c, and
only specific to limit_list() (and its iterative copy). I would
expect this to work for non-topo-order, too. I suppose we couldn't
completely trust that the first visit to a commit includes the
UNINTERESTING flag if there is clock skew in the commit history.

But could we also do this at the caller's end by passing a
callback that checks each walked commit if it is UNINTERESTING
and adds it to a set?

I know that the walking code in builtin/pack-objects.c does
this same computation, but it's muddled with other checks and
the trees are marked as UNINTERESTING at the same time. It
doesn't seem like we can reuse anything directly out of there,
but it did give me the idea to try a callback.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/3] revision: support tracking uninteresting commits
  2023-04-25  0:00 ` [PATCH 1/3] revision: support tracking uninteresting commits Taylor Blau
  2023-04-25 18:15   ` Derrick Stolee
@ 2023-04-25 18:22   ` Junio C Hamano
  2023-04-25 18:48     ` Taylor Blau
  1 sibling, 1 reply; 51+ messages in thread
From: Junio C Hamano @ 2023-04-25 18:22 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Derrick Stolee

Taylor Blau <me@ttaylorr.com> writes:

> The boundary-based bitmap walk will want to know which commits were
> marked UNINTERESTING in the walk used to discover the boundary.
>
> Track which commits are marked as such during list limitation as well as
> the topo walk step, though the latter is not used.

There are many more places, other than these two that are touched by
this patch, where a commit is marked with the UNINTERESTING bit
(e.g.  mark_parents_uninteresting(), process_parents()) and it is
not quite clear if the commits smudged with the UNINTERESTING bit in
these other places are eventually seen by these two places that
somebody else smudged them and get collected.

Or am I wrong to understand that the idea is to collect all
UNINTERESTING commits?

Also don't these two places see the same commit more than once, say
when traversing from two branch tips with UNINTERESTING bit set and
the traversals meet at the fork point of these two branches?  Would
such a commit get added to the array in duplicates?

Thanks.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/3] pack-bitmap: boundary-based bitmap traversal
  2023-04-25 18:02 ` [PATCH 0/3] pack-bitmap: boundary-based " Junio C Hamano
@ 2023-04-25 18:30   ` Derrick Stolee
  2023-04-25 18:57     ` Taylor Blau
  0 siblings, 1 reply; 51+ messages in thread
From: Derrick Stolee @ 2023-04-25 18:30 UTC (permalink / raw)
  To: Junio C Hamano, Taylor Blau; +Cc: git, Jeff King

On 4/25/2023 2:02 PM, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
> 
>> This improves many cases where using bitmaps was significantly *slower*
>> than regular, non-bitmap traversal. In some instances, using bitmaps is
>> still slower than the non-bitmap case, but it is a substantial
>> improvement over the classic bitmap walk.
>> ...
>> In a large repository on GitHub.com, the timings to compute the objects
>> unique to "master", as in:
>>
>>     $ git rev-list --count --objects master --not --exclude=master --branches
>>
>> improve as follows:
> 
> Curious---when would it be significantly faster to use bitmaps?
> "Most of the time"?  "In not-too-large repository?"

This is an interesting question that is very difficult to predict
in advance.

When building an independent implementation of reachability bitmaps
for Azure Repos, we never managed to make the bitmap method faster
for incremental fetches, on average, and so only enabled them during
full clones (no haves).

Of course, it's relatively easy to construct something that looks
like an incremental fetch but is as expensive as a clone (include
one have being a root commit), but it is very rare in practice.

It would be interesting if we used the initial commit walk to the
boundary as a predictor of whether bitmaps would be good, or if
we should use the tree-parsing walk instead.

But perhaps this series gets the bitmap walk "close enough" in
the typical case that it's better to use bitmaps in the chance that
we have accidentally hit a case where we'd normally need to walk a
lot of trees.

Thanks,
-Stolee



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/3] pack-bitmap.c: extract `fill_in_bitmap()`
  2023-04-25  0:00 ` [PATCH 2/3] pack-bitmap.c: extract `fill_in_bitmap()` Taylor Blau
@ 2023-04-25 18:32   ` Junio C Hamano
  2023-04-25 18:51     ` [PATCH 2/3] pack-bitmap.c: extract `fill_in_bitmap()`t Taylor Blau
  0 siblings, 1 reply; 51+ messages in thread
From: Junio C Hamano @ 2023-04-25 18:32 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Derrick Stolee

Taylor Blau <me@ttaylorr.com> writes:

> To prepare for the boundary-based bitmap walk to perform a fill-in
> traversal using the boundary of either side as the tips, extract routine
> used to perform fill-in traversal by `find_objects()` so that it can be
> used in both places.

What is done is not a literal "extract", though.  Worth mentioning here?

> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  pack-bitmap.c | 66 +++++++++++++++++++++++++++++----------------------
>  1 file changed, 37 insertions(+), 29 deletions(-)
>
> +	if (base == NULL)
> +		base = bitmap_new();

Style?

> +	incdata.bitmap_git = bitmap_git;
> +	incdata.base = base;
> +	incdata.seen = seen;
> +
> +	revs->include_check = should_include;
> +	revs->include_check_obj = should_include_obj;
> +	revs->include_check_data = &incdata;
> +
> +	if (prepare_revision_walk(revs))
> +		die("revision walk setup failed");
> +
> +	show_data.bitmap_git = bitmap_git;
> +	show_data.base = base;
> +
> +	traverse_commit_list_filtered(revs, show_commit, show_object,
> +				      &show_data, NULL);

The original uses the variant without "_filtered" suffix.  In a
later update when the second caller is added, the new caller may
need to extend it, but shouldn't that wait until that step?

Thanks.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/3] revision: support tracking uninteresting commits
  2023-04-25 18:22   ` Junio C Hamano
@ 2023-04-25 18:48     ` Taylor Blau
  0 siblings, 0 replies; 51+ messages in thread
From: Taylor Blau @ 2023-04-25 18:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King, Derrick Stolee

On Tue, Apr 25, 2023 at 11:22:25AM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> Or am I wrong to understand that the idea is to collect all
> UNINTERESTING commits?

The idea is to collect UNINTERESTING commits between the tips and their
commit boundary. This field should probably be renamed to indicate that.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/3] pack-bitmap.c: extract `fill_in_bitmap()`t
  2023-04-25 18:32   ` Junio C Hamano
@ 2023-04-25 18:51     ` Taylor Blau
  0 siblings, 0 replies; 51+ messages in thread
From: Taylor Blau @ 2023-04-25 18:51 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King, Derrick Stolee

On Tue, Apr 25, 2023 at 11:32:25AM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > To prepare for the boundary-based bitmap walk to perform a fill-in
> > traversal using the boundary of either side as the tips, extract routine
> > used to perform fill-in traversal by `find_objects()` so that it can be
> > used in both places.
>
> What is done is not a literal "extract", though.  Worth mentioning here?

Oops. It is supposed to be a literal extraction, but this patch was
originally written before 3e0370a8d2 (list-objects: consolidate
traverse_commit_list[_filtered], 2022-03-09).

> > ---
> >  pack-bitmap.c | 66 +++++++++++++++++++++++++++++----------------------
> >  1 file changed, 37 insertions(+), 29 deletions(-)
> >
> > +	if (base == NULL)
> > +		base = bitmap_new();
>
> Style?

...It was also written before afe8a9070b (tree-wide: apply
equals-null.cocci, 2022-05-02), which is where this "if (base == NULL)"
thing comes from. I fixed both to be a clean extraction with no
functional changes.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/3] pack-bitmap.c: use commit boundary during bitmap traversal
  2023-04-25  0:00 ` [PATCH 3/3] pack-bitmap.c: use commit boundary during bitmap traversal Taylor Blau
@ 2023-04-25 18:52   ` Junio C Hamano
  2023-05-02 21:31     ` Felipe Contreras
  2023-05-03 21:42     ` Taylor Blau
  2023-04-25 18:53   ` Derrick Stolee
  1 sibling, 2 replies; 51+ messages in thread
From: Junio C Hamano @ 2023-04-25 18:52 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Derrick Stolee

Taylor Blau <me@ttaylorr.com> writes:

>   3. Fall back to the ordinary object traversal if either (a) there are
>      no haves, (b) none of the haves are in the bitmapped pack or MIDX,
>      or (c) there are no wants.

Tangents.  If there are no haves, wouldn't the answer by definition
everything that are reachable from all wants?  Is there a reason why
a bitmapped walk cannot fulfill such a request?  If there are no
wants, wouldn't the answer by definition an empty set?  Again, I am
not sure what the ordinary object traversal is expected to do in
such a case.

These questions are not at all important, as this part is an
explation of what happened in the old code.

> Relaxing the bitmap traversal to allow it to produce over-counted
> results gives us the opportunity to make some significant improvements.
> Instead of the above, the new algorithm only has to walk from the
> *boundary* down to the nearest bitmap, instead of from each of the
> UNINTERESTING tips.

Is it only me, or are all readers equally confused by the use of
the term "boundary" that hasn't been given any definition?

> And is more-or-less equivalent to using the *old* algorithm with this
> invocation:
>
>     $ git rev-list --objects --boundary $WANTS --not $HAVES |

It is especially confusing because the "--boundary" (at least as I
originally had invented it), i.e. a commit that is smudged with
UNINTERESTING bit, whose parent we did not bother to dig further to
paint with the same UNINTERESTING bit, is not something we start our
computation with; it is something we learn _after_ completing the
history traversing.  So I am utterly confused X-<.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/3] pack-bitmap.c: use commit boundary during bitmap traversal
  2023-04-25  0:00 ` [PATCH 3/3] pack-bitmap.c: use commit boundary during bitmap traversal Taylor Blau
  2023-04-25 18:52   ` Junio C Hamano
@ 2023-04-25 18:53   ` Derrick Stolee
  1 sibling, 0 replies; 51+ messages in thread
From: Derrick Stolee @ 2023-04-25 18:53 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: Jeff King, Junio C Hamano

On 4/24/2023 8:00 PM, Taylor Blau wrote:
> When reachability bitmap coverage exists in a repository, Git will use a
> different (and hopefully faster) traversal to compute revision walks.

Before getting into the code...

> Consider a set of positive and negative tips (which we'll refer to with
> their standard bitmap parlance by, "wants", and "haves"). In order to
> figure out what objects exist between the tips, the existing traversal
> in prepare_bitmap_walk() looks something like:

>   4. Construct a reachability bitmap for the "haves" side by walking
>      from the revision tips down to any existing bitmaps, OR-ing in any
>      bitmaps as they are found.
> 
>   5. Then do the same for the "wants" side, stopping at any objects that
>      appear in the "haves" bitmap.

One important thing about this older algorithm is that it will avoid
including trees and blobs that are reachable from the "haves" that
are not reachable from the boundary between "haves" and "wants". The
most obvious case is that someone force-pushed a branch after rewording
the commit messages but keeping the trees and blobs identical.

This case is rare, and usually the objects that would be repeated are
low enough in count and size (with deltas) that it's not a big deal, but
it is worth bringing up. I didn't see a reference to this side of the
difference in your message.

> One of the benefits, however, is that even if the walk is slower, bitmap
> traversals are guaranteed to provide an *exact* answer. Unlike the
> traditional object traversal algorithm, which can over-count the results
> by not opening trees for older commits, the bitmap walk builds an exact
> reachability bitmap for either side, meaning the results are never
> over-counted.

So the new algorithm is a new, third state of "not overcounting things
reachable beyond the commit-walk boundary" and "overcounting things
reachable from the wants..haves commit region".

> But producing non-exact results is OK for our traversal here (both in
> the bitmap case and not), as long as the results are over-counted, not
> under.

True, our response will not create a pack-file that cannot be applied
locally, but it might be larger than required.

For these reasons, I'm surprised that this patch completely replaces
the old algorithm for the new one. I would prefer to see a config
option that enables this new algorithm. It would be safer to deploy
that way, too.

> The new result performs significantly better in many cases, particularly
> when the distance from the boundary commit(s) to an existing bitmap is
> shorter than the distance from (all of) the have tips to the nearest
> bitmapped commit.

Aside: we may even gain benefits by adjusting the commits selected for
bitmaps by trying for this data shape. One way to attempt this is to
not focus on refs but on history common to "most refs" using first-
parent history as a good heuristic. If a commit appears in a lot of
first-parent histories, then it is more likely to be helpful to these
kinds of walks.
 
> Note that when using the old bitmap traversal algorithm, the results can
> be *slower* than without bitmaps! Under the new algorithm, the result is
> computed faster with bitmaps than without (at the cost of over-counting
> the true number of objects in a similar fashion as the non-bitmap
> traversal):
>
>     # (Computing objects unique to 'master' among all branches, not
>     # using bitmaps).
>     $ time git rev-list --count --objects master --not --exclude=master --branches
>     54

I like how you included the output of --count here. It might be interesting
to demonstrate the different forms of overcounting in a carefully constructed
test case, which would help us be sure that we are using the desired
algorithm during a test case.

>     real	0m1.012s  (no bitmaps)
>     real	0m29.571s (bitmaps, old)
>     real	0m2.279s  (bitmaps, new)

> The new algorithm is still slower than not using bitmaps at all, but it
> is nearly a 13-fold improvement over the existing traversal.
 
> In a more realistic setting (using my local copy of git.git), I can
> observe a similar speed-up:
...
> Here, the new algorithm is also still slower than not using bitmaps, but
> represents a 4-fold improvement over the existing traversal in a more
> modest example.

Interesting that the new bitmap algorithm is worse in all presented cases.

How "atypical" do we need to be in order for bitmaps to outperform the
tree-parsing algorithm? Do we need to try "--branches --not master~1000"
to replicate a stale single-branch clone getting every latest commit?

(I'll leave code review to a separate reply.)

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/3] pack-bitmap: boundary-based bitmap traversal
  2023-04-25 18:30   ` Derrick Stolee
@ 2023-04-25 18:57     ` Taylor Blau
  2023-04-25 19:52       ` Derrick Stolee
  0 siblings, 1 reply; 51+ messages in thread
From: Taylor Blau @ 2023-04-25 18:57 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Junio C Hamano, git, Jeff King

On Tue, Apr 25, 2023 at 02:30:01PM -0400, Derrick Stolee wrote:
> On 4/25/2023 2:02 PM, Junio C Hamano wrote:
> > Taylor Blau <me@ttaylorr.com> writes:
> >
> >> This improves many cases where using bitmaps was significantly *slower*
> >> than regular, non-bitmap traversal. In some instances, using bitmaps is
> >> still slower than the non-bitmap case, but it is a substantial
> >> improvement over the classic bitmap walk.
> >> ...
> >> In a large repository on GitHub.com, the timings to compute the objects
> >> unique to "master", as in:
> >>
> >>     $ git rev-list --count --objects master --not --exclude=master --branches
> >>
> >> improve as follows:
> >
> > Curious---when would it be significantly faster to use bitmaps?
> > "Most of the time"?  "In not-too-large repository?"
>
> This is an interesting question that is very difficult to predict
> in advance.

Indeed ;-).

> It would be interesting if we used the initial commit walk to the
> boundary as a predictor of whether bitmaps would be good, or if
> we should use the tree-parsing walk instead.

I think that this is tough, too. When making a prediction, it would be
good if we could avoid walking down to the nearest bitmap, since by that
point, you might as well have used bitmaps (if you found one before
reaching the root(s) of history).

I think the best heuristic we have available is if we have is something
like "# of UNINTERESTING commits between the tips and boundary", since
we OR those in ahead of time, and then the fill-in traversal will
(usually, not always) be cheap.

But I think that heuristic probably isn't too predictive, since it would
often give you the wrong prediction, say, when there are no
UNINTERESTING commits between the tips and boundary, but the boundary is
bitmapped.

Another approach you could take is to cut off the fill-in traversal at
some point, or start with whichever is presumed to be faster and then
dynamically switch to the other at some point during the walk. But that
also depends on repository shape/size, and machine configuration.

So I think it's extremely difficult to get "right", but if others have
suggestions, I would certainly be receptive to them :-).

> But perhaps this series gets the bitmap walk "close enough" in
> the typical case that it's better to use bitmaps in the chance that
> we have accidentally hit a case where we'd normally need to walk a
> lot of trees.

Yeah, that is definitely part of this series's goal. Another framing is,
for configurations that are using bitmaps everywhere, this should reduce
the number of cases where using bitmaps is *significantly* slower than
not.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/3] pack-bitmap: boundary-based bitmap traversal
  2023-04-25 18:06 ` Derrick Stolee
@ 2023-04-25 19:01   ` Taylor Blau
  2023-04-25 20:27     ` Derrick Stolee
  0 siblings, 1 reply; 51+ messages in thread
From: Taylor Blau @ 2023-04-25 19:01 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Jeff King, Junio C Hamano

On Tue, Apr 25, 2023 at 02:06:25PM -0400, Derrick Stolee wrote:
> In other words: this generates something closer to the object set in the
> non-bitmapped object walk. The only difference is that the new bitmapped
> algorithm will see objects that were re-introduced across the boundary
> (say, a blob was reverted to its older mode).

Very well put, thank you.

> For my curiosity, and since you already have a test environment set up,
> could you redo the "without bitmaps" case with pack.useSparse true and
> false? When the option was created and defaulted to true, we never
> really explored comparing it to the bitmap case. In fact, I assumed the
> bitmap case was faster in important cases like this (where there is a
> substantial difference in object counts), but your data is surprising me
> that the sparse algorithm is outperforming bitmaps, even with this new
> algorithm.
>
> The main question I'd like to ask is: is pack.useSparse contributing
> in an important way to the performance here?

I don't know enough about pack.useSparse to say with certainty, but
trying again on the same repository (which is reasonably well-packed at
the moment), they appear about the same:

    $ time git -c pack.useSparse=false rev-list --count --objects master \
      --not --exclude=master --branches
    14

    real	0m0.986s
    user	0m0.599s
    sys	0m0.387s

    $ time git -c pack.useSparse=true rev-list --count --objects master \
      --not --exclude=master --branches
    14

    real	0m0.985s
    user	0m0.600s
    sys	0m0.385s

> I'll go poking into the patches now.

Thanks in advance for your review :-).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/3] pack-bitmap: boundary-based bitmap traversal
  2023-04-25 18:57     ` Taylor Blau
@ 2023-04-25 19:52       ` Derrick Stolee
  2023-05-03 21:43         ` Taylor Blau
  0 siblings, 1 reply; 51+ messages in thread
From: Derrick Stolee @ 2023-04-25 19:52 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Junio C Hamano, git, Jeff King

On 4/25/2023 2:57 PM, Taylor Blau wrote:
> On Tue, Apr 25, 2023 at 02:30:01PM -0400, Derrick Stolee wrote:
>> It would be interesting if we used the initial commit walk to the
>> boundary as a predictor of whether bitmaps would be good, or if
>> we should use the tree-parsing walk instead.
> 
> I think that this is tough, too. When making a prediction, it would be
> good if we could avoid walking down to the nearest bitmap, since by that
> point, you might as well have used bitmaps (if you found one before
> reaching the root(s) of history).

This isn't quite as cut-and-dry, because walking commits is much
faster than walking trees. I agree that we shouldn't walk too far
to avoid wasting time, but somehow we'd want to think about the
distance between the commits we want to cover and the bitmap(s)
we found in their history.

>> But perhaps this series gets the bitmap walk "close enough" in
>> the typical case that it's better to use bitmaps in the chance that
>> we have accidentally hit a case where we'd normally need to walk a
>> lot of trees.
> 
> Yeah, that is definitely part of this series's goal. Another framing is,
> for configurations that are using bitmaps everywhere, this should reduce
> the number of cases where using bitmaps is *significantly* slower than
> not.

It's a good opportunity to re-think if we should be using bitmaps in
these contexts at all, or if there is a simpler thing to do by
disabling them when 'haves' exist. I don't think that's the right
thing to do in all cases, but the experiments you've demonstrated here
seem to point in that direction.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/3] pack-bitmap: boundary-based bitmap traversal
  2023-04-25 19:01   ` Taylor Blau
@ 2023-04-25 20:27     ` Derrick Stolee
  0 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee @ 2023-04-25 20:27 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano

On 4/25/2023 3:01 PM, Taylor Blau wrote:
> On Tue, Apr 25, 2023 at 02:06:25PM -0400, Derrick Stolee wrote:
>> For my curiosity, and since you already have a test environment set up,
>> could you redo the "without bitmaps" case with pack.useSparse true and
>> false? When the option was created and defaulted to true, we never
>> really explored comparing it to the bitmap case. In fact, I assumed the
>> bitmap case was faster in important cases like this (where there is a
>> substantial difference in object counts), but your data is surprising me
>> that the sparse algorithm is outperforming bitmaps, even with this new
>> algorithm.
>>
>> The main question I'd like to ask is: is pack.useSparse contributing
>> in an important way to the performance here?
> 
> I don't know enough about pack.useSparse to say with certainty, but
> trying again on the same repository (which is reasonably well-packed at
> the moment), they appear about the same:

Thanks for running that. I apologize, but I didn't sync in my head
that you're purposefully using 'git rev-list' to do the walk and the
config only makes a difference for 'git pack-objects'. The suspiciously
close timings is indeed too suspicious. Thus, the performance numbers
you are considering are not identical to what would happen in 'git
pack-objects' by default.

So, I created my own example using git.git, which included this input:

	refs/remotes/origin/master
	refs/remotes/origin/maint
	refs/remotes/origin/next
	refs/remotes/origin/seen
	^refs/remotes/origin/master~1000

And there is a clear preference for pack.useSparse=false in this
case:

Benchmark 1: GIT_TRACE2_PERF="/home/stolee/_git/git/git-review/trace-true.txt" ~/_git/git/git-review/git -c pack.useSparse=true -c pack.useBitmaps=false pack-objects --stdout --revs <in >/dev/null
  Time (mean ± σ):      2.857 s ±  0.031 s    [User: 6.850 s, System: 0.378 s]
  Range (min … max):    2.814 s …  2.922 s    10 runs
 
Benchmark 2: GIT_TRACE2_PERF="/home/stolee/_git/git/git-review/trace-false.txt" ~/_git/git/git-review/git -c pack.useSparse=false -c pack.useBitmaps=false pack-objects --stdout --revs <in >/dev/null
  Time (mean ± σ):      2.682 s ±  0.029 s    [User: 6.678 s, System: 0.388 s]
  Range (min … max):    2.653 s …  2.753 s    10 runs
 
and I used the perf traces to extract that the enumerate objects
phase took ~550ms for the sparse walk and ~375ms for the non-sparse
version.

But using this same input, I'm able to construct an example where
your modified bitmap walk is faster than the non-sparse object walk:

Benchmark 1: GIT_TRACE2_PERF="/home/stolee/_git/git/git-review/trace-true.txt" ~/_git/git/git-review/git -c pack.useSparse=false -c pack.useBitmaps=true pack-objects --stdout --revs <in >/dev/null
  Time (mean ± σ):      2.045 s ±  0.039 s    [User: 6.464 s, System: 0.334 s]
  Range (min … max):    1.966 s …  2.089 s    10 runs
 
Benchmark 2: GIT_TRACE2_PERF="/home/stolee/_git/git/git-review/trace-false.txt" ~/_git/git/git-review/git -c pack.useSparse=false -c pack.useBitmaps=false pack-objects --stdout --revs <in >/dev/null
  Time (mean ± σ):      2.138 s ±  0.034 s    [User: 6.070 s, System: 0.333 s]
  Range (min … max):    2.058 s …  2.182 s    10 runs

(Here, the enumerate objects phase takes ~230ms on average.)

I can easily swap this by changing my negative ref to

  ^revs/remotes/origin/master

Benchmark 1: GIT_TRACE2_PERF="/home/stolee/_git/git/git-review/trace-true.txt" ~/_git/git/git-review/git -c pack.useSparse=false -c pack.useBitmaps=true pack-objects --stdout --revs <in >/dev/null
  Time (mean ± σ):     521.0 ms ±  28.8 ms    [User: 1190.9 ms, System: 103.6 ms]
  Range (min … max):   496.5 ms … 589.8 ms    10 runs
 
Benchmark 2: GIT_TRACE2_PERF="/home/stolee/_git/git/git-review/trace-false.txt" ~/_git/git/git-review/git -c pack.useSparse=false -c pack.useBitmaps=false pack-objects --stdout --revs <in >/dev/null
  Time (mean ± σ):     498.5 ms ±  13.2 ms    [User: 1089.1 ms, System: 123.3 ms]
  Range (min … max):   471.3 ms … 513.6 ms    10 runs

which gives us reason to think about "small fetches" don't need
bitmaps, but "big fetches" do. But that can be done separately
from this patch.

Context about the sparse walk:

The sparse walk uses the names of the paths with respect to the root 
trees as a way to walk only the "new" objects without walking the
entire object set at the boundary. This makes a big difference in
repos with many files at HEAD, but the risk is that those path names
become significant overhead if there are enough new changes spread
across a large fraction of the tree. For clients pushing the work
of a single developer, the sparse algorithm outperforms the default
walk.

I wrote about this years ago in [1] and almost brought it up earlier
as a description of the commit boundary (called the frontier in [1]),
but the sparse object walk is helpful to visualize.

[1] https://devblogs.microsoft.com/devops/exploring-new-frontiers-for-git-push-performance/

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/3] pack-bitmap: boundary-based bitmap traversal
  2023-04-25  0:00 [PATCH 0/3] pack-bitmap: boundary-based bitmap traversal Taylor Blau
                   ` (4 preceding siblings ...)
  2023-04-25 18:06 ` Derrick Stolee
@ 2023-05-01 22:08 ` Junio C Hamano
  2023-05-02 23:52   ` Taylor Blau
  2023-05-05 17:27 ` [PATCH v2 0/2] " Taylor Blau
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 51+ messages in thread
From: Junio C Hamano @ 2023-05-01 22:08 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Derrick Stolee

Taylor Blau <me@ttaylorr.com> writes:

> Here is a short (but dense) series that I worked on towards the end of
> 2021 with Peff.

The topic gathered interest on the day it was posted but after 24
hours or so activities seem to have quieted down.  Are we expecting
a polished version?  Or did it open a can of worms that is a bit
larger than we want to deal with and the topic is left on backburner?

Thanks.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/3] pack-bitmap.c: use commit boundary during bitmap traversal
  2023-04-25 18:52   ` Junio C Hamano
@ 2023-05-02 21:31     ` Felipe Contreras
  2023-05-03 21:42     ` Taylor Blau
  1 sibling, 0 replies; 51+ messages in thread
From: Felipe Contreras @ 2023-05-02 21:31 UTC (permalink / raw)
  To: Junio C Hamano, Taylor Blau; +Cc: git, Jeff King, Derrick Stolee

Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:

> > Relaxing the bitmap traversal to allow it to produce over-counted
> > results gives us the opportunity to make some significant improvements.
> > Instead of the above, the new algorithm only has to walk from the
> > *boundary* down to the nearest bitmap, instead of from each of the
> > UNINTERESTING tips.
> 
> Is it only me, or are all readers equally confused by the use of
> the term "boundary" that hasn't been given any definition?
> 
> > And is more-or-less equivalent to using the *old* algorithm with this
> > invocation:
> >
> >     $ git rev-list --objects --boundary $WANTS --not $HAVES |
> 
> It is especially confusing because the "--boundary" (at least as I
> originally had invented it), i.e. a commit that is smudged with
> UNINTERESTING bit, whose parent we did not bother to dig further to
> paint with the same UNINTERESTING bit, is not something we start our
> computation with; it is something we learn _after_ completing the
> history traversing.  So I am utterly confused X-<.

I don't know if it's the case, but in my mind all the `--not $HAVES` are
boundaries. Some of these might be overshadowed by another boundary higher up
in the topology and thus not shown, so in a sense are "uninteresting
boundaries".

Perhaps because you invented `--boundary` you think a "boundary" is only an
interesting boundary which must be computed, and all the other are not true
boundaries.

Cheers.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/3] pack-bitmap: boundary-based bitmap traversal
  2023-05-01 22:08 ` Junio C Hamano
@ 2023-05-02 23:52   ` Taylor Blau
  0 siblings, 0 replies; 51+ messages in thread
From: Taylor Blau @ 2023-05-02 23:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King, Derrick Stolee

On Mon, May 01, 2023 at 03:08:30PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > Here is a short (but dense) series that I worked on towards the end of
> > 2021 with Peff.
>
> The topic gathered interest on the day it was posted but after 24
> hours or so activities seem to have quieted down.  Are we expecting
> a polished version?  Or did it open a can of worms that is a bit
> larger than we want to deal with and the topic is left on backburner?

We are expecting a polished version ;-). Though I haven't gotten to it
as soon as I had hoped. I should have something on the list tomorrow,
though.

But, no, the can of worms is as expected here, and I need to provide
some more thorough explanation (e.g., we can collect UNINTERESTING
commits in limit_list() only because we only need to walk down to the
first UNINTERESTING commit common to both sides, and no further).

Sorry for the delay, more soon.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/3] pack-bitmap.c: use commit boundary during bitmap traversal
  2023-04-25 18:52   ` Junio C Hamano
  2023-05-02 21:31     ` Felipe Contreras
@ 2023-05-03 21:42     ` Taylor Blau
  2023-05-03 21:58       ` Junio C Hamano
  1 sibling, 1 reply; 51+ messages in thread
From: Taylor Blau @ 2023-05-03 21:42 UTC (permalink / raw)
  To: Jeff King, Junio C Hamano; +Cc: git, Jeff King, Derrick Stolee

On Tue, Apr 25, 2023 at 11:52:25AM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> >   3. Fall back to the ordinary object traversal if either (a) there are
> >      no haves, (b) none of the haves are in the bitmapped pack or MIDX,
> >      or (c) there are no wants.
>
> Tangents.  If there are no haves, wouldn't the answer by definition
> everything that are reachable from all wants?  Is there a reason why
> a bitmapped walk cannot fulfill such a request?  If there are no
> wants, wouldn't the answer by definition an empty set?  Again, I am
> not sure what the ordinary object traversal is expected to do in
> such a case.

Oops, (a) is just wrong. The old code is:

    /*
     * if we have a HAVES list, but none of those haves is contained
     * in the packfile that has a bitmap, we don't have anything to
     * optimize here
     */
    if (haves && !in_bitmapped_pack(bitmap_git, haves))
      goto cleanup;

which is itself an iffy optimization that I have toyed with the idea of
getting rid of over the years, since I am skeptical that it is doing
much even in the classic bitmap traversal.

But we only do that in_bitmapped_pack() if there are haves to begin
with, so (a) from my commit message is incorrect. Sorry about that.

> > And is more-or-less equivalent to using the *old* algorithm with this
> > invocation:
> >
> >     $ git rev-list --objects --boundary $WANTS --not $HAVES |
>
> It is especially confusing because the "--boundary" (at least as I
> originally had invented it), i.e. a commit that is smudged with
> UNINTERESTING bit, whose parent we did not bother to dig further to
> paint with the same UNINTERESTING bit, is not something we start our
> computation with; it is something we learn _after_ completing the
> history traversing.  So I am utterly confused X-<.

Hmm. This is from code that Peff and I wrote together a long time ago.
My recollection is that we only care about the UNINTERESTING commits
between the tips and boundary (as you described it), but no further, and
that we could discover that set during limit_list() alone.

But maybe not? I honestly cannot remember. Perhaps Peff does?

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/3] pack-bitmap: boundary-based bitmap traversal
  2023-04-25 19:52       ` Derrick Stolee
@ 2023-05-03 21:43         ` Taylor Blau
  0 siblings, 0 replies; 51+ messages in thread
From: Taylor Blau @ 2023-05-03 21:43 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Junio C Hamano, git, Jeff King

On Tue, Apr 25, 2023 at 03:52:29PM -0400, Derrick Stolee wrote:
> It's a good opportunity to re-think if we should be using bitmaps in
> these contexts at all, or if there is a simpler thing to do by
> disabling them when 'haves' exist. I don't think that's the right
> thing to do in all cases, but the experiments you've demonstrated here
> seem to point in that direction.

Yes, I agree that it's a worthwhile direction to consider in the future.
I fear that it will be hard to come up with absolute rules here, but
that doesn't mean it isn't worth trying :-).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/3] revision: support tracking uninteresting commits
  2023-04-25 18:15   ` Derrick Stolee
@ 2023-05-03 21:48     ` Taylor Blau
  2023-05-04 13:46       ` Derrick Stolee
  2023-05-03 22:08     ` Taylor Blau
  1 sibling, 1 reply; 51+ messages in thread
From: Taylor Blau @ 2023-05-03 21:48 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Jeff King, Junio C Hamano

On Tue, Apr 25, 2023 at 02:15:49PM -0400, Derrick Stolee wrote:
> On 4/24/2023 8:00 PM, Taylor Blau wrote:
> > The boundary-based bitmap walk will want to know which commits were
> > marked UNINTERESTING in the walk used to discover the boundary.
> >
> > Track which commits are marked as such during list limitation as well as
> > the topo walk step, though the latter is not used.
>
> I was surprised to see this as an addition to revision.c, and
> only specific to limit_list() (and its iterative copy). I would
> expect this to work for non-topo-order, too. I suppose we couldn't
> completely trust that the first visit to a commit includes the
> UNINTERESTING flag if there is clock skew in the commit history.

Yeah, the distinction between limit_list() and the --topo-order code
makes things a little funky here. But I think that's OK, since
`--topo-order` is not likely to be used here, since this is all
bitmap-based traversal. IOW, I think that it would be reasonable to ban
the revision args combination of --use-bitmap-index with --topo-order.

> But could we also do this at the caller's end by passing a
> callback that checks each walked commit if it is UNINTERESTING
> and adds it to a set?

I think I remember trying this when I originally wrote this code, and
ended up discarding the idea because it walked over more commits than we
needed to consider.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/3] pack-bitmap.c: use commit boundary during bitmap traversal
  2023-05-03 21:42     ` Taylor Blau
@ 2023-05-03 21:58       ` Junio C Hamano
  0 siblings, 0 replies; 51+ messages in thread
From: Junio C Hamano @ 2023-05-03 21:58 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jeff King, git, Derrick Stolee

Taylor Blau <me@ttaylorr.com> writes:

> Hmm. This is from code that Peff and I wrote together a long time ago.
> My recollection is that we only care about the UNINTERESTING commits
> between the tips and boundary (as you described it), but no further, and
> that we could discover that set during limit_list() alone.
>
> But maybe not? I honestly cannot remember. Perhaps Peff does?

Yes, that matches my understanding.

By the time you finish limit_list(), from the "revision traversal"
point of view, you have done all the hard work already, sifting
commits between the ones that are and are not interesting.  I am
surprised that whatever you learn only after reaching that point is
still helpful to optimize some operations.




^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/3] revision: support tracking uninteresting commits
  2023-04-25 18:15   ` Derrick Stolee
  2023-05-03 21:48     ` Taylor Blau
@ 2023-05-03 22:08     ` Taylor Blau
  2023-05-04 13:59       ` Derrick Stolee
  1 sibling, 1 reply; 51+ messages in thread
From: Taylor Blau @ 2023-05-03 22:08 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Jeff King, Junio C Hamano

On Tue, Apr 25, 2023 at 02:15:49PM -0400, Derrick Stolee wrote:
> I know that the walking code in builtin/pack-objects.c does
> this same computation, but it's muddled with other checks and
> the trees are marked as UNINTERESTING at the same time. It
> doesn't seem like we can reuse anything directly out of there,
> but it did give me the idea to try a callback.

Interesting idea. When you say callback, do you mean a function that we
invoke in place of where we currently call `add_object_array()`? Or do
you mean something else? Curious.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/3] revision: support tracking uninteresting commits
  2023-05-03 21:48     ` Taylor Blau
@ 2023-05-04 13:46       ` Derrick Stolee
  0 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee @ 2023-05-04 13:46 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano

On 5/3/2023 5:48 PM, Taylor Blau wrote:
> On Tue, Apr 25, 2023 at 02:15:49PM -0400, Derrick Stolee wrote:
>> On 4/24/2023 8:00 PM, Taylor Blau wrote:
>>> The boundary-based bitmap walk will want to know which commits were
>>> marked UNINTERESTING in the walk used to discover the boundary.
>>>
>>> Track which commits are marked as such during list limitation as well as
>>> the topo walk step, though the latter is not used.
>>
>> I was surprised to see this as an addition to revision.c, and
>> only specific to limit_list() (and its iterative copy). I would
>> expect this to work for non-topo-order, too. I suppose we couldn't
>> completely trust that the first visit to a commit includes the
>> UNINTERESTING flag if there is clock skew in the commit history.
> 
> Yeah, the distinction between limit_list() and the --topo-order code
> makes things a little funky here. But I think that's OK, since
> `--topo-order` is not likely to be used here, since this is all
> bitmap-based traversal. IOW, I think that it would be reasonable to ban
> the revision args combination of --use-bitmap-index with --topo-order.
 
I think there is a miscommunication here, since the limit_list()
code will only be run if there is a need to "walk to the end" before
outputting results. --topo-order is an example of something that
triggers this need, but it is disabled by default.

It seems that revs->collect_uninteresting would be a condition that
should signal to enable revs->limited so this code is actually
executed.

Taking a look at how this works with your current patch, the only
thing I can think is that since your initial commits include some
UNINTERESTING commits, that actually triggers revs->limited = 1
in handle_commit(). It's an indirect use, and without such a
commit the boundary would be empty, anyway.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/3] revision: support tracking uninteresting commits
  2023-05-03 22:08     ` Taylor Blau
@ 2023-05-04 13:59       ` Derrick Stolee
  2023-05-05 17:30         ` Taylor Blau
  0 siblings, 1 reply; 51+ messages in thread
From: Derrick Stolee @ 2023-05-04 13:59 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano

On 5/3/2023 6:08 PM, Taylor Blau wrote:
> On Tue, Apr 25, 2023 at 02:15:49PM -0400, Derrick Stolee wrote:
>> I know that the walking code in builtin/pack-objects.c does
>> this same computation, but it's muddled with other checks and
>> the trees are marked as UNINTERESTING at the same time. It
>> doesn't seem like we can reuse anything directly out of there,
>> but it did give me the idea to try a callback.
> 
> Interesting idea. When you say callback, do you mean a function that we
> invoke in place of where we currently call `add_object_array()`? Or do
> you mean something else? Curious.

I was talking about passing a callback into the revision walk
instead of having the revision code create a list, as you
replied to in this response:

>> But could we also do this at the caller's end by passing a
>> callback that checks each walked commit if it is UNINTERESTING
>> and adds it to a set?
> 
> I think I remember trying this when I originally wrote this code, and
> ended up discarding the idea because it walked over more commits than we
> needed to consider.
 
It's interesting that it walked more commits than you wanted. I
suppose it's somehow related to the boundary condition you're
implying by enabling the construction of this list.

Could you describe the situation where more commits are walked
than you want? I imagine we can't actually stop at the boundary
because we need to know that certain commits are actually reachable
from those boundary commits.

Here's an example (assume horizontal levels are equal generation
number for the sake of the walk's stopping condition).

  1 (UNINTERESTING)  2 (INTERESTING)
  |\_____       ____/|
  |      \     /     |
  3       4    5     6
  |       |\__ |     |
  |       |   \|     |
  7       8   [9]    10
  |       |    |     |
  11      12   13    14
  |       |    | ___/|
  |       |    |/    |
  15      16  [17]   18
  |       | ___|____/
  |       |/   |
 (19)    [20]   21

Here, 9, 17, and 20 are the boundary, as they are UNINITERESTING
but reachable from an interesting commit (5, 14, and 18,
respectively). This is sufficient to provide a stopping point
for this single-directional difference "2 --not 1".

It's important that we didn't stop walking at 9, since its
UNINTERESTING bit needed to propagate through 13 to get to 17.

Note that 19 is reachable from 1 but not 2, so we would need to
keep walking if we were looking for the boundary of the symmetric
difference 1...2. But we don't need it here since everything at
that level is marked UNINTERESTING so the walk can stop.

Is this example sufficiently complex for you to describe what
causes the extra commit walking? In such a case, what does the
object array approach add to prevent the extra walking?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v2 0/2] pack-bitmap: boundary-based bitmap traversal
  2023-04-25  0:00 [PATCH 0/3] pack-bitmap: boundary-based bitmap traversal Taylor Blau
                   ` (5 preceding siblings ...)
  2023-05-01 22:08 ` Junio C Hamano
@ 2023-05-05 17:27 ` Taylor Blau
  2023-05-05 17:27   ` [PATCH v2 1/2] pack-bitmap.c: extract `fill_in_bitmap()` Taylor Blau
                     ` (3 more replies)
  2023-05-08 17:38 ` [PATCH v3 0/3] pack-bitmap: boundary-based bitmap traversal Taylor Blau
  2023-06-08 16:25 ` [PATCH v4 " Taylor Blau
  8 siblings, 4 replies; 51+ messages in thread
From: Taylor Blau @ 2023-05-05 17:27 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Derrick Stolee, Junio C Hamano

Here is a reroll of my series to implement a boundary-based bitmap
traversal algorithm that I worked on towards the end of 2021 with Peff.

The algorithm is unchanged from the last version, but the implementation
has been made much more straightforward, thanks to a very helpful
suggestion from Stolee.

Instead of hackily trying to write down all of the UNINTERESTING commits
between the tips and boundary within limit_list(), we can just perform a
commit-only walk, which will give us the set of commits that we need.

Thanks in advance for your review.

Taylor Blau (2):
  pack-bitmap.c: extract `fill_in_bitmap()`
  pack-bitmap.c: use commit boundary during bitmap traversal

 pack-bitmap.c | 249 +++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 188 insertions(+), 61 deletions(-)

Range-diff against v1:
1:  a643678c0f < -:  ---------- revision: support tracking uninteresting commits
2:  7624d3b5ba ! 1:  a3a1522439 pack-bitmap.c: extract `fill_in_bitmap()`
    @@ pack-bitmap.c: static int add_commit_to_bitmap(struct bitmap_index *bitmap_git,
     +	struct include_data incdata;
     +	struct bitmap_show_data show_data;
     +
    -+	if (base == NULL)
    ++	if (!base)
     +		base = bitmap_new();
     +
     +	incdata.bitmap_git = bitmap_git;
    @@ pack-bitmap.c: static int add_commit_to_bitmap(struct bitmap_index *bitmap_git,
     +	revs->include_check_data = &incdata;
     +
     +	if (prepare_revision_walk(revs))
    -+		die("revision walk setup failed");
    ++		die(_("revision walk setup failed"));
     +
     +	show_data.bitmap_git = bitmap_git;
     +	show_data.base = base;
     +
    -+	traverse_commit_list_filtered(revs, show_commit, show_object,
    -+				      &show_data, NULL);
    ++	traverse_commit_list(revs, show_commit, show_object, &show_data);
     +
     +	revs->include_check = NULL;
     +	revs->include_check_obj = NULL;
3:  91ed8c70f2 ! 2:  1993af00cb pack-bitmap.c: use commit boundary during bitmap traversal
    @@ Commit message
         different (and hopefully faster) traversal to compute revision walks.
     
         Consider a set of positive and negative tips (which we'll refer to with
    -    their standard bitmap parlance by, "wants", and "haves"). In order to
    +    their standard bitmap parlance by "wants", and "haves"). In order to
         figure out what objects exist between the tips, the existing traversal
    -    in prepare_bitmap_walk() looks something like:
    +    in `prepare_bitmap_walk()` does something like:
     
           1. Consider if we can even compute the set of objects with bitmaps,
              and fall back to the usual traversal if we cannot. For example,
    @@ Commit message
              respectively.
     
           3. Fall back to the ordinary object traversal if either (a) there are
    -         no haves, (b) none of the haves are in the bitmapped pack or MIDX,
    -         or (c) there are no wants.
    +         more than zero haves, none of which are in the bitmapped pack or
    +         MIDX, or (b) there are no wants.
     
           4. Construct a reachability bitmap for the "haves" side by walking
              from the revision tips down to any existing bitmaps, OR-ing in any
    @@ Commit message
         And is more-or-less equivalent to using the *old* algorithm with this
         invocation:
     
    -        $ git rev-list --objects --boundary $WANTS --not $HAVES |
    -            perl -lne 'print $1 if /^-(.*)/' |
    -            xargs git rev-list --objects --use-bitmap-index $WANTS --not
    +        $ git rev-list --objects --use-bitmap-index $WANTS --not \
    +            $(git rev-list --objects --boundary $WANTS --not $HAVES |
    +              perl -lne 'print $1 if /^-(.*)/')
     
         The new result performs significantly better in many cases, particularly
         when the distance from the boundary commit(s) to an existing bitmap is
    @@ Commit message
             # (Computing objects unique to 'master' among all branches, not
             # using bitmaps).
             $ time git rev-list --count --objects master --not --exclude=master --branches
    -        54
    +        43
     
    -        real        0m1.012s
    -        user        0m0.796s
    -        sys 0m0.214s
    +        real        0m0.346s
    +        user        0m0.231s
    +        sys 0m0.115s
     
             # (Computing the same uniqueness query using the old bitmap
             # traversal algorithm.)
             $ time git rev-list --use-bitmap-index --count --objects master --not --exclude=master --branches
    -        42
    +        41
     
    -        real        0m29.571s
    -        user        0m28.404s
    -        sys 0m1.164s
    +        real        0m20.007s
    +        user        0m19.045s
    +        sys 0m0.960s
     
             # (Computing the same uniqueness query using the new bitmap
             # traversal algorithm.)
    -        $ time git.compile rev-list --count --objects --use-bitmap-index master --not --exclude=master --branches
    -        54
    +        $ time git.compile rev-list --use-bitmap-index --count --objects master --not --exclude=master --branches
    +        41
     
    -        real        0m2.279s
    -        user        0m2.023s
    -        sys 0m0.256s
    +        real        0m1.748s
    +        user        0m1.428s
    +        sys 0m0.320s
     
         The new algorithm is still slower than not using bitmaps at all, but it
    -    is nearly a 13-fold improvement over the existing traversal.
    +    is nearly a 11-fold improvement over the existing traversal.
     
         In a more realistic setting (using my local copy of git.git), I can
    -    observe a similar speed-up:
    +    observe a similar (if more modest) speed-up:
     
             $ ours="$(git branch --show-current)"
               argv="--count --objects $ours --not --exclude=$ours --branches"
    @@ Commit message
                 -n 'existing bitmap traversal' "git rev-list --use-bitmap-index $argv" \
                 -n 'this commit' "git.compile rev-list --use-bitmap-index $argv"
             Benchmark 1: no bitmaps
    -          Time (mean ± σ):      15.1 ms ±   4.1 ms    [User: 8.1 ms, System: 6.9 ms]
    -          Range (min … max):     7.4 ms …  21.8 ms    131 runs
    +          Time (mean ± σ):       5.8 ms ±   0.2 ms    [User: 3.3 ms, System: 2.4 ms]
    +          Range (min … max):     5.4 ms …   7.0 ms    409 runs
     
             Benchmark 2: existing bitmap traversal
    -          Time (mean ± σ):      82.6 ms ±   9.2 ms    [User: 63.6 ms, System: 19.0 ms]
    -          Range (min … max):    73.8 ms … 105.4 ms    28 runs
    +          Time (mean ± σ):      65.3 ms ±   0.6 ms    [User: 49.3 ms, System: 15.8 ms]
    +          Range (min … max):    64.1 ms …  66.9 ms    45 runs
     
             Benchmark 3: this commit
    -          Time (mean ± σ):      19.8 ms ±   3.1 ms    [User: 13.0 ms, System: 6.8 ms]
    -          Range (min … max):    17.7 ms …  38.6 ms    158 runs
    +          Time (mean ± σ):      19.8 ms ±   0.3 ms    [User: 12.9 ms, System: 6.8 ms]
    +          Range (min … max):    19.1 ms …  20.8 ms    150 runs
     
             Summary
               'no bitmaps' ran
    -            1.31 ± 0.41 times faster than 'this commit'
    -            5.49 ± 1.62 times faster than 'existing bitmap traversal'
    +            3.43 ± 0.14 times faster than 'this commit'
    +           11.34 ± 0.45 times faster than 'existing bitmap traversal'
     
         Here, the new algorithm is also still slower than not using bitmaps, but
    -    represents a 4-fold improvement over the existing traversal in a more
    -    modest example.
    +    represents a more than 3-fold improvement over the existing traversal in
    +    a more modest example.
     
         Since this algorithm was originally written (nearly a year and a half
         ago, at the time of writing), the bitmap lookup table shipped, making
    @@ Commit message
             fewer "summary" bitmaps (which would also help with the above).
     
         Helped-by: Jeff King <peff@peff.net>
    +    Helped-by: Derrick Stolee <derrickstolee@github.com>
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
      ## pack-bitmap.c ##
    -@@ pack-bitmap.c: static struct bitmap *fill_in_bitmap(struct bitmap_index *bitmap_git,
    - 	struct include_data incdata;
    - 	struct bitmap_show_data show_data;
    - 
    --	if (base == NULL)
    -+	if (!base)
    - 		base = bitmap_new();
    - 
    - 	incdata.bitmap_git = bitmap_git;
     @@ pack-bitmap.c: static struct bitmap *fill_in_bitmap(struct bitmap_index *bitmap_git,
      	return base;
      }
    @@ pack-bitmap.c: static struct bitmap *fill_in_bitmap(struct bitmap_index *bitmap_
     +	return bitmap_get(bitmap, pos);
     +}
     +
    -+static void show_boundary_commit(struct commit *commit, void *data)
    ++struct bitmap_boundary_cb {
    ++	struct bitmap_index *bitmap_git;
    ++	struct bitmap *base;
    ++
    ++	struct object_array boundary;
    ++};
    ++
    ++static void show_boundary_commit(struct commit *commit, void *_data)
     +{
    -+	struct object_array *boundary = data;
    -+	if (!(commit->object.flags & BOUNDARY))
    -+		return;
    ++	struct bitmap_boundary_cb *data = _data;
     +
    -+	add_object_array(&commit->object, "", boundary);
    ++	if (commit->object.flags & BOUNDARY)
    ++		add_object_array(&commit->object, "", &data->boundary);
    ++
    ++	if (commit->object.flags & UNINTERESTING) {
    ++		if (obj_in_bitmap(data->bitmap_git, &commit->object,
    ++				  data->base))
    ++			return;
    ++
    ++		add_commit_to_bitmap(data->bitmap_git, &data->base, commit);
    ++	}
     +}
     +
     +static void show_boundary_object(struct object *object,
    @@ pack-bitmap.c: static struct bitmap *fill_in_bitmap(struct bitmap_index *bitmap_
     +					    struct rev_info *revs,
     +					    struct object_list *roots)
     +{
    -+	struct bitmap *base = NULL;
    -+	struct object_array boundary = OBJECT_ARRAY_INIT;
    -+	int any_missing = 0;
    ++	struct bitmap_boundary_cb cb = { .bitmap_git = bitmap_git };
     +	unsigned int i;
    -+	int tmp_blobs, tmp_trees, tmp_tags;
    ++	unsigned int tmp_blobs, tmp_trees, tmp_tags;
    ++	int any_missing = 0;
     +
     +	revs->ignore_missing_links = 1;
    -+	revs->collect_uninteresting = 1;
     +
     +	/*
     +	 * OR in any existing reachability bitmaps among `roots` into `base`.
    @@ pack-bitmap.c: static struct bitmap *fill_in_bitmap(struct bitmap_index *bitmap_
     +		roots = roots->next;
     +
     +		if (object->type == OBJ_COMMIT &&
    -+		    !obj_in_bitmap(bitmap_git, object, base) &&
    -+		    add_commit_to_bitmap(bitmap_git, &base,
    ++		    !obj_in_bitmap(bitmap_git, object, cb.base) &&
    ++		    add_commit_to_bitmap(bitmap_git, &cb.base,
     +					 (struct commit *)object)) {
     +			object->flags |= SEEN;
     +			continue;
    @@ pack-bitmap.c: static struct bitmap *fill_in_bitmap(struct bitmap_index *bitmap_
     +	revs->tag_objects = 0;
     +
     +	/*
    -+	 * We didn't have complete coverage of the roots. First OR in any
    -+	 * bitmaps that are UNINTERESTING between the tips and boundary.
    ++	 * We didn't have complete coverage of the roots. First setup a
    ++	 * revision walk to (a) OR in any bitmaps that are UNINTERESTING
    ++	 * between the tips and boundary, and (b) record the boundary.
     +	 */
     +	trace2_region_enter("pack-bitmap", "boundary-prepare", the_repository);
     +	if (prepare_revision_walk(revs))
     +		die("revision walk setup failed");
     +	trace2_region_leave("pack-bitmap", "boundary-prepare", the_repository);
     +
    -+	trace2_region_enter("pack-bitmap", "boundary-load-bitmaps", the_repository);
    -+	for (i = 0; i < revs->uninteresting_commits.nr; i++) {
    -+		struct object *obj = revs->uninteresting_commits.objects[i].item;
    -+		if (obj->type != OBJ_COMMIT)
    -+			BUG("unexpected non-commit %s marked uninteresting",
    -+			    oid_to_hex(&obj->oid));
    -+
    -+		if (obj_in_bitmap(bitmap_git, obj, base))
    -+			continue;
    -+
    -+		add_commit_to_bitmap(bitmap_git, &base, (struct commit *)obj);
    -+	}
    -+	trace2_region_leave("pack-bitmap", "boundary-load-bitmaps", the_repository);
    -+
    -+	/*
    -+	 * Then add the boundary commit(s) as fill-in traversal tips.
    -+	 */
     +	trace2_region_enter("pack-bitmap", "boundary-traverse", the_repository);
     +	revs->boundary = 1;
     +	traverse_commit_list_filtered(revs,
     +				      show_boundary_commit,
     +				      show_boundary_object,
    -+				      &boundary, NULL);
    ++				      &cb, NULL);
     +	revs->boundary = 0;
     +	revs->blob_objects = tmp_blobs;
     +	revs->tree_objects = tmp_trees;
    @@ pack-bitmap.c: static struct bitmap *fill_in_bitmap(struct bitmap_index *bitmap_
     +	clear_object_flags(UNINTERESTING);
     +	trace2_region_leave("pack-bitmap", "boundary-traverse", the_repository);
     +
    ++	/*
    ++	 * Then add the boundary commit(s) as fill-in traversal tips.
    ++	 */
     +	trace2_region_enter("pack-bitmap", "boundary-fill-in", the_repository);
    -+	if (boundary.nr) {
    ++	if (cb.boundary.nr) {
     +		struct object *obj;
     +		int needs_walk = 0;
     +
    -+		for (i = 0; i < boundary.nr; i++) {
    -+			obj = boundary.objects[i].item;
    ++		for (i = 0; i < cb.boundary.nr; i++) {
    ++			obj = cb.boundary.objects[i].item;
     +
    -+			if (obj_in_bitmap(bitmap_git, obj, base)) {
    ++			if (obj_in_bitmap(bitmap_git, obj, cb.base)) {
     +				obj->flags |= SEEN;
     +			} else {
     +				add_pending_object(revs, obj, "");
    @@ pack-bitmap.c: static struct bitmap *fill_in_bitmap(struct bitmap_index *bitmap_
     +		}
     +
     +		if (needs_walk)
    -+			base = fill_in_bitmap(bitmap_git, revs, base, NULL);
    ++			cb.base = fill_in_bitmap(bitmap_git, revs, cb.base, NULL);
     +	}
     +	trace2_region_leave("pack-bitmap", "boundary-fill-in", the_repository);
     +
     +
     +cleanup:
     +	revs->ignore_missing_links = 0;
    -+	revs->collect_uninteresting = 0;
     +
    -+	return base;
    ++	return cb.base;
     +}
     +
      static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
    @@ pack-bitmap.c: struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
      	if (!wants)
      		goto cleanup;
     @@ pack-bitmap.c: struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
    - 	if (load_bitmap(bitmap_git) < 0)
    + 	if (load_bitmap(revs->repo, bitmap_git) < 0)
      		goto cleanup;
      
     -	object_array_clear(&revs->pending);
-- 
2.40.1.478.gcab26587e8

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v2 1/2] pack-bitmap.c: extract `fill_in_bitmap()`
  2023-05-05 17:27 ` [PATCH v2 0/2] " Taylor Blau
@ 2023-05-05 17:27   ` Taylor Blau
  2023-05-05 17:27   ` [PATCH v2 2/2] pack-bitmap.c: use commit boundary during bitmap traversal Taylor Blau
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 51+ messages in thread
From: Taylor Blau @ 2023-05-05 17:27 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Derrick Stolee, Junio C Hamano

To prepare for the boundary-based bitmap walk to perform a fill-in
traversal using the boundary of either side as the tips, extract routine
used to perform fill-in traversal by `find_objects()` so that it can be
used in both places.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 65 ++++++++++++++++++++++++++++-----------------------
 1 file changed, 36 insertions(+), 29 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index e0fad723bf..5d2cc6ac96 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1043,6 +1043,40 @@ static int add_commit_to_bitmap(struct bitmap_index *bitmap_git,
 	return 1;
 }
 
+static struct bitmap *fill_in_bitmap(struct bitmap_index *bitmap_git,
+				     struct rev_info *revs,
+				     struct bitmap *base,
+				     struct bitmap *seen)
+{
+	struct include_data incdata;
+	struct bitmap_show_data show_data;
+
+	if (!base)
+		base = bitmap_new();
+
+	incdata.bitmap_git = bitmap_git;
+	incdata.base = base;
+	incdata.seen = seen;
+
+	revs->include_check = should_include;
+	revs->include_check_obj = should_include_obj;
+	revs->include_check_data = &incdata;
+
+	if (prepare_revision_walk(revs))
+		die(_("revision walk setup failed"));
+
+	show_data.bitmap_git = bitmap_git;
+	show_data.base = base;
+
+	traverse_commit_list(revs, show_commit, show_object, &show_data);
+
+	revs->include_check = NULL;
+	revs->include_check_obj = NULL;
+	revs->include_check_data = NULL;
+
+	return base;
+}
+
 static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
 				   struct rev_info *revs,
 				   struct object_list *roots,
@@ -1108,35 +1142,8 @@ static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
 		}
 	}
 
-	if (needs_walk) {
-		struct include_data incdata;
-		struct bitmap_show_data show_data;
-
-		if (!base)
-			base = bitmap_new();
-
-		incdata.bitmap_git = bitmap_git;
-		incdata.base = base;
-		incdata.seen = seen;
-
-		revs->include_check = should_include;
-		revs->include_check_obj = should_include_obj;
-		revs->include_check_data = &incdata;
-
-		if (prepare_revision_walk(revs))
-			die(_("revision walk setup failed"));
-
-		show_data.bitmap_git = bitmap_git;
-		show_data.base = base;
-
-		traverse_commit_list(revs,
-				     show_commit, show_object,
-				     &show_data);
-
-		revs->include_check = NULL;
-		revs->include_check_obj = NULL;
-		revs->include_check_data = NULL;
-	}
+	if (needs_walk)
+		base = fill_in_bitmap(bitmap_git, revs, base, seen);
 
 	return base;
 }
-- 
2.40.1.478.gcab26587e8


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 2/2] pack-bitmap.c: use commit boundary during bitmap traversal
  2023-05-05 17:27 ` [PATCH v2 0/2] " Taylor Blau
  2023-05-05 17:27   ` [PATCH v2 1/2] pack-bitmap.c: extract `fill_in_bitmap()` Taylor Blau
@ 2023-05-05 17:27   ` Taylor Blau
  2023-05-05 18:13     ` Derrick Stolee
  2023-05-05 17:33   ` [PATCH v2 0/2] pack-bitmap: boundary-based " Junio C Hamano
  2023-05-05 17:59   ` Derrick Stolee
  3 siblings, 1 reply; 51+ messages in thread
From: Taylor Blau @ 2023-05-05 17:27 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Derrick Stolee, Junio C Hamano

When reachability bitmap coverage exists in a repository, Git will use a
different (and hopefully faster) traversal to compute revision walks.

Consider a set of positive and negative tips (which we'll refer to with
their standard bitmap parlance by "wants", and "haves"). In order to
figure out what objects exist between the tips, the existing traversal
in `prepare_bitmap_walk()` does something like:

  1. Consider if we can even compute the set of objects with bitmaps,
     and fall back to the usual traversal if we cannot. For example,
     pathspec limiting traversals can't be computed using bitmaps (since
     they don't know which objects are at which paths). The same is true
     of certain kinds of non-trivial object filters.

  2. If we can compute the traversal with bitmaps, partition the
     (dereferenced) tips into two object lists, "haves", and "wants",
     based on whether or not the objects have the UNINTERESTING flag,
     respectively.

  3. Fall back to the ordinary object traversal if either (a) there are
     more than zero haves, none of which are in the bitmapped pack or
     MIDX, or (b) there are no wants.

  4. Construct a reachability bitmap for the "haves" side by walking
     from the revision tips down to any existing bitmaps, OR-ing in any
     bitmaps as they are found.

  5. Then do the same for the "wants" side, stopping at any objects that
     appear in the "haves" bitmap.

  6. Filter the results if any object filter (that can be easily
     computed with bitmaps alone) was given, and then return back to the
     caller.

When there is good bitmap coverage relative to the traversal tips, this
walk is often significantly faster than an ordinary object traversal
because it can visit far fewer objects.

But in certain cases, it can be significantly *slower* than the usual
object traversal. Why? Because we need to compute complete bitmaps on
either side of the walk. If either one (or both) of the sides require
walking many (or all!) objects before they get to an existing bitmap,
the extra bitmap machinery is mostly or all overhead.

One of the benefits, however, is that even if the walk is slower, bitmap
traversals are guaranteed to provide an *exact* answer. Unlike the
traditional object traversal algorithm, which can over-count the results
by not opening trees for older commits, the bitmap walk builds an exact
reachability bitmap for either side, meaning the results are never
over-counted.

But producing non-exact results is OK for our traversal here (both in
the bitmap case and not), as long as the results are over-counted, not
under.

Relaxing the bitmap traversal to allow it to produce over-counted
results gives us the opportunity to make some significant improvements.
Instead of the above, the new algorithm only has to walk from the
*boundary* down to the nearest bitmap, instead of from each of the
UNINTERESTING tips.

The boundary-based approach still has degenerate cases, but we'll show
in a moment that it is often a significant improvement.

The new algorithm works as follows:

  1. Build a (partial) bitmap of the haves side by first OR-ing any
     bitmap(s) that already exist for UNINTERESTING commits between the
     haves and the boundary.

  2. For each commit along the boundary, add it as a fill-in traversal
     tip (where the traversal terminates once an existing bitmap is
     found), and perform fill-in traversal.

  3. Build up a complete bitmap of the wants side as usual, stopping any
     time we intersect the (partial) haves side.

  4. Return the results.

And is more-or-less equivalent to using the *old* algorithm with this
invocation:

    $ git rev-list --objects --use-bitmap-index $WANTS --not \
        $(git rev-list --objects --boundary $WANTS --not $HAVES |
          perl -lne 'print $1 if /^-(.*)/')

The new result performs significantly better in many cases, particularly
when the distance from the boundary commit(s) to an existing bitmap is
shorter than the distance from (all of) the have tips to the nearest
bitmapped commit.

Note that when using the old bitmap traversal algorithm, the results can
be *slower* than without bitmaps! Under the new algorithm, the result is
computed faster with bitmaps than without (at the cost of over-counting
the true number of objects in a similar fashion as the non-bitmap
traversal):

    # (Computing objects unique to 'master' among all branches, not
    # using bitmaps).
    $ time git rev-list --count --objects master --not --exclude=master --branches
    43

    real	0m0.346s
    user	0m0.231s
    sys	0m0.115s

    # (Computing the same uniqueness query using the old bitmap
    # traversal algorithm.)
    $ time git rev-list --use-bitmap-index --count --objects master --not --exclude=master --branches
    41

    real	0m20.007s
    user	0m19.045s
    sys	0m0.960s

    # (Computing the same uniqueness query using the new bitmap
    # traversal algorithm.)
    $ time git.compile rev-list --use-bitmap-index --count --objects master --not --exclude=master --branches
    41

    real	0m1.748s
    user	0m1.428s
    sys	0m0.320s

The new algorithm is still slower than not using bitmaps at all, but it
is nearly a 11-fold improvement over the existing traversal.

In a more realistic setting (using my local copy of git.git), I can
observe a similar (if more modest) speed-up:

    $ ours="$(git branch --show-current)"
      argv="--count --objects $ours --not --exclude=$ours --branches"
      hyperfine \
        -n 'no bitmaps' "git.compile rev-list $argv" \
        -n 'existing bitmap traversal' "git rev-list --use-bitmap-index $argv" \
        -n 'this commit' "git.compile rev-list --use-bitmap-index $argv"
    Benchmark 1: no bitmaps
      Time (mean ± σ):       5.8 ms ±   0.2 ms    [User: 3.3 ms, System: 2.4 ms]
      Range (min … max):     5.4 ms …   7.0 ms    409 runs

    Benchmark 2: existing bitmap traversal
      Time (mean ± σ):      65.3 ms ±   0.6 ms    [User: 49.3 ms, System: 15.8 ms]
      Range (min … max):    64.1 ms …  66.9 ms    45 runs

    Benchmark 3: this commit
      Time (mean ± σ):      19.8 ms ±   0.3 ms    [User: 12.9 ms, System: 6.8 ms]
      Range (min … max):    19.1 ms …  20.8 ms    150 runs

    Summary
      'no bitmaps' ran
        3.43 ± 0.14 times faster than 'this commit'
       11.34 ± 0.45 times faster than 'existing bitmap traversal'

Here, the new algorithm is also still slower than not using bitmaps, but
represents a more than 3-fold improvement over the existing traversal in
a more modest example.

Since this algorithm was originally written (nearly a year and a half
ago, at the time of writing), the bitmap lookup table shipped, making
the new algorithm's result more competitive. A few other future
directions for improving bitmap traversal times beyond not using bitmaps
at all:

  - Decrease the cost to decompress and OR together many bitmaps
    together (particularly when enumerating the uninteresting side of
    the walk). Here we could explore more efficient bitmap storage
    techniques, like Roaring+Run and/or use SIMD instructions to speed
    up ORing them together.

  - Store pseudo-merge bitmaps, which could allow us to OR together
    fewer "summary" bitmaps (which would also help with the above).

Helped-by: Jeff King <peff@peff.net>
Helped-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 190 ++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 155 insertions(+), 35 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 5d2cc6ac96..634fc10156 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1077,6 +1077,143 @@ static struct bitmap *fill_in_bitmap(struct bitmap_index *bitmap_git,
 	return base;
 }
 
+static int obj_in_bitmap(struct bitmap_index *bitmap_git,
+			 struct object *obj, struct bitmap *bitmap)
+{
+	int pos;
+
+	if (!bitmap)
+		return 0;
+	pos = bitmap_position(bitmap_git, &obj->oid);
+	if (pos < 0)
+		return 0;
+	return bitmap_get(bitmap, pos);
+}
+
+struct bitmap_boundary_cb {
+	struct bitmap_index *bitmap_git;
+	struct bitmap *base;
+
+	struct object_array boundary;
+};
+
+static void show_boundary_commit(struct commit *commit, void *_data)
+{
+	struct bitmap_boundary_cb *data = _data;
+
+	if (commit->object.flags & BOUNDARY)
+		add_object_array(&commit->object, "", &data->boundary);
+
+	if (commit->object.flags & UNINTERESTING) {
+		if (obj_in_bitmap(data->bitmap_git, &commit->object,
+				  data->base))
+			return;
+
+		add_commit_to_bitmap(data->bitmap_git, &data->base, commit);
+	}
+}
+
+static void show_boundary_object(struct object *object,
+				 const char *name, void *data)
+{
+	BUG("should not be called");
+}
+
+static struct bitmap *find_boundary_objects(struct bitmap_index *bitmap_git,
+					    struct rev_info *revs,
+					    struct object_list *roots)
+{
+	struct bitmap_boundary_cb cb = { .bitmap_git = bitmap_git };
+	unsigned int i;
+	unsigned int tmp_blobs, tmp_trees, tmp_tags;
+	int any_missing = 0;
+
+	revs->ignore_missing_links = 1;
+
+	/*
+	 * OR in any existing reachability bitmaps among `roots` into `base`.
+	 */
+	while (roots) {
+		struct object *object = roots->item;
+		roots = roots->next;
+
+		if (object->type == OBJ_COMMIT &&
+		    !obj_in_bitmap(bitmap_git, object, cb.base) &&
+		    add_commit_to_bitmap(bitmap_git, &cb.base,
+					 (struct commit *)object)) {
+			object->flags |= SEEN;
+			continue;
+		}
+
+		any_missing = 1;
+	}
+
+	if (!any_missing)
+		goto cleanup;
+
+	tmp_blobs = revs->blob_objects;
+	tmp_trees = revs->tree_objects;
+	tmp_tags = revs->blob_objects;
+	revs->blob_objects = 0;
+	revs->tree_objects = 0;
+	revs->tag_objects = 0;
+
+	/*
+	 * We didn't have complete coverage of the roots. First setup a
+	 * revision walk to (a) OR in any bitmaps that are UNINTERESTING
+	 * between the tips and boundary, and (b) record the boundary.
+	 */
+	trace2_region_enter("pack-bitmap", "boundary-prepare", the_repository);
+	if (prepare_revision_walk(revs))
+		die("revision walk setup failed");
+	trace2_region_leave("pack-bitmap", "boundary-prepare", the_repository);
+
+	trace2_region_enter("pack-bitmap", "boundary-traverse", the_repository);
+	revs->boundary = 1;
+	traverse_commit_list_filtered(revs,
+				      show_boundary_commit,
+				      show_boundary_object,
+				      &cb, NULL);
+	revs->boundary = 0;
+	revs->blob_objects = tmp_blobs;
+	revs->tree_objects = tmp_trees;
+	revs->tag_objects = tmp_tags;
+
+	reset_revision_walk();
+	clear_object_flags(UNINTERESTING);
+	trace2_region_leave("pack-bitmap", "boundary-traverse", the_repository);
+
+	/*
+	 * Then add the boundary commit(s) as fill-in traversal tips.
+	 */
+	trace2_region_enter("pack-bitmap", "boundary-fill-in", the_repository);
+	if (cb.boundary.nr) {
+		struct object *obj;
+		int needs_walk = 0;
+
+		for (i = 0; i < cb.boundary.nr; i++) {
+			obj = cb.boundary.objects[i].item;
+
+			if (obj_in_bitmap(bitmap_git, obj, cb.base)) {
+				obj->flags |= SEEN;
+			} else {
+				add_pending_object(revs, obj, "");
+				needs_walk = 1;
+			}
+		}
+
+		if (needs_walk)
+			cb.base = fill_in_bitmap(bitmap_git, revs, cb.base, NULL);
+	}
+	trace2_region_leave("pack-bitmap", "boundary-fill-in", the_repository);
+
+
+cleanup:
+	revs->ignore_missing_links = 0;
+
+	return cb.base;
+}
+
 static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
 				   struct rev_info *revs,
 				   struct object_list *roots,
@@ -1142,8 +1279,21 @@ static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
 		}
 	}
 
-	if (needs_walk)
+	if (needs_walk) {
+		/*
+		 * This fill-in traversal may walk over some objects
+		 * again, since we have already traversed in order to
+		 * find the boundary.
+		 *
+		 * But this extra walk should be extremely cheap, since
+		 * all commit objects are loaded into memory, and
+		 * because we skip walking to parents that are
+		 * UNINTERESTING, since it will be marked in the haves
+		 * bitmap already (or it has an on-disk bitmap, since
+		 * OR-ing it in covers all of its ancestors).
+		 */
 		base = fill_in_bitmap(bitmap_git, revs, base, seen);
+	}
 
 	return base;
 }
@@ -1259,25 +1409,6 @@ static void show_objects_for_type(
 	}
 }
 
-static int in_bitmapped_pack(struct bitmap_index *bitmap_git,
-			     struct object_list *roots)
-{
-	while (roots) {
-		struct object *object = roots->item;
-		roots = roots->next;
-
-		if (bitmap_is_midx(bitmap_git)) {
-			if (bsearch_midx(&object->oid, bitmap_git->midx, NULL))
-				return 1;
-		} else {
-			if (find_pack_entry_one(object->oid.hash, bitmap_git->pack) > 0)
-				return 1;
-		}
-	}
-
-	return 0;
-}
-
 static struct bitmap *find_tip_objects(struct bitmap_index *bitmap_git,
 				       struct object_list *tip_objects,
 				       enum object_type type)
@@ -1585,14 +1716,6 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 			object_list_insert(object, &wants);
 	}
 
-	/*
-	 * if we have a HAVES list, but none of those haves is contained
-	 * in the packfile that has a bitmap, we don't have anything to
-	 * optimize here
-	 */
-	if (haves && !in_bitmapped_pack(bitmap_git, haves))
-		goto cleanup;
-
 	/* if we don't want anything, we're done here */
 	if (!wants)
 		goto cleanup;
@@ -1605,18 +1728,15 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 	if (load_bitmap(revs->repo, bitmap_git) < 0)
 		goto cleanup;
 
-	object_array_clear(&revs->pending);
-
 	if (haves) {
-		revs->ignore_missing_links = 1;
-		haves_bitmap = find_objects(bitmap_git, revs, haves, NULL);
-		reset_revision_walk();
-		revs->ignore_missing_links = 0;
-
+		haves_bitmap = find_boundary_objects(bitmap_git, revs, haves);
 		if (!haves_bitmap)
 			BUG("failed to perform bitmap walk");
 	}
 
+	object_array_clear(&revs->pending);
+	reset_revision_walk();
+
 	wants_bitmap = find_objects(bitmap_git, revs, wants, haves_bitmap);
 
 	if (!wants_bitmap)
-- 
2.40.1.478.gcab26587e8

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/3] revision: support tracking uninteresting commits
  2023-05-04 13:59       ` Derrick Stolee
@ 2023-05-05 17:30         ` Taylor Blau
  0 siblings, 0 replies; 51+ messages in thread
From: Taylor Blau @ 2023-05-05 17:30 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Jeff King, Junio C Hamano

On Thu, May 04, 2023 at 09:59:32AM -0400, Derrick Stolee wrote:
> It's interesting that it walked more commits than you wanted. I
> suppose it's somehow related to the boundary condition you're
> implying by enabling the construction of this list.
>
> Could you describe the situation where more commits are walked
> than you want? I imagine we can't actually stop at the boundary
> because we need to know that certain commits are actually reachable
> from those boundary commits.

I honestly cannot remember, and was unable to reproduce it when I
reworked the substantive portion of this series last night.

For posterity, Stolee and I had an off-list discussion yesterday where
he walked me through his suggestion to implement the boundary search via
a straightforward revision walk, instead of grafting onto cherry-picked
components of the revision internals.

It works great, and I have been unable to trick it into "walking too
much".

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 0/2] pack-bitmap: boundary-based bitmap traversal
  2023-05-05 17:27 ` [PATCH v2 0/2] " Taylor Blau
  2023-05-05 17:27   ` [PATCH v2 1/2] pack-bitmap.c: extract `fill_in_bitmap()` Taylor Blau
  2023-05-05 17:27   ` [PATCH v2 2/2] pack-bitmap.c: use commit boundary during bitmap traversal Taylor Blau
@ 2023-05-05 17:33   ` Junio C Hamano
  2023-05-05 17:59   ` Derrick Stolee
  3 siblings, 0 replies; 51+ messages in thread
From: Junio C Hamano @ 2023-05-05 17:33 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Derrick Stolee

Taylor Blau <me@ttaylorr.com> writes:

> Instead of hackily trying to write down all of the UNINTERESTING commits
> between the tips and boundary within limit_list(), we can just perform a
> commit-only walk, which will give us the set of commits that we need.

That does make a lot of sense.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 0/2] pack-bitmap: boundary-based bitmap traversal
  2023-05-05 17:27 ` [PATCH v2 0/2] " Taylor Blau
                     ` (2 preceding siblings ...)
  2023-05-05 17:33   ` [PATCH v2 0/2] pack-bitmap: boundary-based " Junio C Hamano
@ 2023-05-05 17:59   ` Derrick Stolee
  2023-05-05 18:46     ` [PATCH v2 0/2] pack-bitmap: boundary-based bitmap traversalt Taylor Blau
  3 siblings, 1 reply; 51+ messages in thread
From: Derrick Stolee @ 2023-05-05 17:59 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: Jeff King, Junio C Hamano

On 5/5/2023 1:27 PM, Taylor Blau wrote:
> Here is a reroll of my series to implement a boundary-based bitmap
> traversal algorithm that I worked on towards the end of 2021 with Peff.
> 
> The algorithm is unchanged from the last version, but the implementation
> has been made much more straightforward, thanks to a very helpful
> suggestion from Stolee.
> 
> Instead of hackily trying to write down all of the UNINTERESTING commits
> between the tips and boundary within limit_list(), we can just perform a
> commit-only walk, which will give us the set of commits that we need.

Something that didn't seem to get attention in this version was buried
deep in the commentary of my high-level review [1]:

> For these reasons, I'm surprised that this patch completely replaces
> the old algorithm for the new one. I would prefer to see a config
> option that enables this new algorithm. It would be safer to deploy
> that way, too.

I still think it would be nice to keep the two algorithms for at least
a little while instead of completely removing the old one. Let's gather
some more evidence and get more reps in with the new algorithm before
making it the new default. I could imagine a scenario where someone
really wants to spend the extra time to make sure none of the objects
reachable from the UNINTERESTING commits are included in the output of
this diff.

[1] https://lore.kernel.org/git/a143150d-7cf5-c697-0e48-0f7af1c6de8f@github.com/

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 2/2] pack-bitmap.c: use commit boundary during bitmap traversal
  2023-05-05 17:27   ` [PATCH v2 2/2] pack-bitmap.c: use commit boundary during bitmap traversal Taylor Blau
@ 2023-05-05 18:13     ` Derrick Stolee
  2023-05-05 18:43       ` Taylor Blau
  0 siblings, 1 reply; 51+ messages in thread
From: Derrick Stolee @ 2023-05-05 18:13 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: Jeff King, Junio C Hamano

On 5/5/2023 1:27 PM, Taylor Blau wrote:

> +static struct bitmap *find_boundary_objects(struct bitmap_index *bitmap_git,
> +					    struct rev_info *revs,
> +					    struct object_list *roots)
> +{
> +	struct bitmap_boundary_cb cb = { .bitmap_git = bitmap_git };
> +	unsigned int i;
> +	unsigned int tmp_blobs, tmp_trees, tmp_tags;
> +	int any_missing = 0;
> +
> +	revs->ignore_missing_links = 1;
> +
> +	/*
> +	 * OR in any existing reachability bitmaps among `roots` into `base`.
> +	 */
> +	while (roots) {
> +		struct object *object = roots->item;
> +		roots = roots->next;
> +
> +		if (object->type == OBJ_COMMIT &&
> +		    !obj_in_bitmap(bitmap_git, object, cb.base) &&
> +		    add_commit_to_bitmap(bitmap_git, &cb.base,
> +					 (struct commit *)object)) {
> +			object->flags |= SEEN;
> +			continue;
> +		}
> +
> +		any_missing = 1;
> +	}
> +
> +	if (!any_missing)
> +		goto cleanup;

Here, we check if the list of roots are completely covered by existing
bitmaps. This prevents doing the commit-only walk as well as the boundary
checks.

There's another confusing bit here: if we have a bitmap for the commit,
then we mark it as SEEN. Does that mean that the later commit walk will
skip walking its history? Would we then get a boundary that is actually
further in history than necessary? ("A --not B C" would walk all of
B..A if C has a bitmap, even if a lot of that region is reachable from C.)

My initial thought here was that this is an unlikely case, so the
optimization isn't worth the code complexity. But now, I'm a little
concerned that it actually hurts the later walk in the case of multiple
roots.

> +	tmp_blobs = revs->blob_objects;
> +	tmp_trees = revs->tree_objects;
> +	tmp_tags = revs->blob_objects;
> +	revs->blob_objects = 0;
> +	revs->tree_objects = 0;
> +	revs->tag_objects = 0;
> +
> +	/*
> +	 * We didn't have complete coverage of the roots. First setup a
> +	 * revision walk to (a) OR in any bitmaps that are UNINTERESTING
> +	 * between the tips and boundary, and (b) record the boundary.
> +	 */
> +	trace2_region_enter("pack-bitmap", "boundary-prepare", the_repository);
> +	if (prepare_revision_walk(revs))
> +		die("revision walk setup failed");
> +	trace2_region_leave("pack-bitmap", "boundary-prepare", the_repository);
> +
> +	trace2_region_enter("pack-bitmap", "boundary-traverse", the_repository);
> +	revs->boundary = 1;
> +	traverse_commit_list_filtered(revs,
> +				      show_boundary_commit,
> +				      show_boundary_object,
> +				      &cb, NULL);

Looks good. Callbacks were clear when I read them.

> +	revs->boundary = 0;
> +	revs->blob_objects = tmp_blobs;
> +	revs->tree_objects = tmp_trees;
> +	revs->tag_objects = tmp_tags;

These would seem more natural if they were after the trace2_region_leave().

> +	reset_revision_walk();
> +	clear_object_flags(UNINTERESTING);
> +	trace2_region_leave("pack-bitmap", "boundary-traverse", the_repository);
> +
> +	/*
> +	 * Then add the boundary commit(s) as fill-in traversal tips.
> +	 */
> +	trace2_region_enter("pack-bitmap", "boundary-fill-in", the_repository);
> +	if (cb.boundary.nr) {

Making this an if block does help keep its variables more local. It does
make me think about whether the trace2 regions should be within the block,
but it's easier to analyze things when the regions are expected to be present.

> +		struct object *obj;
> +		int needs_walk = 0;
> +
> +		for (i = 0; i < cb.boundary.nr; i++) {
> +			obj = cb.boundary.objects[i].item;
> +
> +			if (obj_in_bitmap(bitmap_git, obj, cb.base)) {
> +				obj->flags |= SEEN;

This SEEN makes sense: we want to terminate the walk here as everything
reachable is already in the bitmap.

> +			} else {
> +				add_pending_object(revs, obj, "");
> +				needs_walk = 1;
> +			}
> +		}
> +
> +		if (needs_walk)
> +			cb.base = fill_in_bitmap(bitmap_git, revs, cb.base, NULL);

I wonder if fill_in_bitmap() already does "the right thing" when there
are no pending objects or when all pending objects are already in the
bitmap. Do we need to do these checks, or should we just put all boundary
commits in the pending set?

> +	}
> +	trace2_region_leave("pack-bitmap", "boundary-fill-in", the_repository);
> +
> +

nit: extra empty line

> +cleanup:
> +	revs->ignore_missing_links = 0;
> +
> +	return cb.base;

Do we need to clean up the cb.boundary object array?

> +}
> +

...

> @@ -1605,18 +1728,15 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
>  	if (load_bitmap(revs->repo, bitmap_git) < 0)
>  		goto cleanup;
>  
> -	object_array_clear(&revs->pending);
> -
>  	if (haves) {
> -		revs->ignore_missing_links = 1;
> -		haves_bitmap = find_objects(bitmap_git, revs, haves, NULL);
> -		reset_revision_walk();
> -		revs->ignore_missing_links = 0;
> -
> +		haves_bitmap = find_boundary_objects(bitmap_git, revs, haves);
>  		if (!haves_bitmap)
>  			BUG("failed to perform bitmap walk");
>  	}

I mentioned in reply to the cover letter that deleting this older algorithm
is likely premature at this point, and we might want to keep it in general
as an option.

> +	object_array_clear(&revs->pending);
> +	reset_revision_walk();
> +

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 2/2] pack-bitmap.c: use commit boundary during bitmap traversal
  2023-05-05 18:13     ` Derrick Stolee
@ 2023-05-05 18:43       ` Taylor Blau
  0 siblings, 0 replies; 51+ messages in thread
From: Taylor Blau @ 2023-05-05 18:43 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Jeff King, Junio C Hamano

On Fri, May 05, 2023 at 02:13:45PM -0400, Derrick Stolee wrote:
> On 5/5/2023 1:27 PM, Taylor Blau wrote:
>
> > +static struct bitmap *find_boundary_objects(struct bitmap_index *bitmap_git,
> > +					    struct rev_info *revs,
> > +					    struct object_list *roots)
> > +{
> > +	struct bitmap_boundary_cb cb = { .bitmap_git = bitmap_git };
> > +	unsigned int i;
> > +	unsigned int tmp_blobs, tmp_trees, tmp_tags;
> > +	int any_missing = 0;
> > +
> > +	revs->ignore_missing_links = 1;
> > +
> > +	/*
> > +	 * OR in any existing reachability bitmaps among `roots` into `base`.
> > +	 */
> > +	while (roots) {
> > +		struct object *object = roots->item;
> > +		roots = roots->next;
> > +
> > +		if (object->type == OBJ_COMMIT &&
> > +		    !obj_in_bitmap(bitmap_git, object, cb.base) &&
> > +		    add_commit_to_bitmap(bitmap_git, &cb.base,
> > +					 (struct commit *)object)) {
> > +			object->flags |= SEEN;
> > +			continue;
> > +		}
> > +
> > +		any_missing = 1;
> > +	}
> > +
> > +	if (!any_missing)
> > +		goto cleanup;
>
> Here, we check if the list of roots are completely covered by existing
> bitmaps. This prevents doing the commit-only walk as well as the boundary
> checks.
>
> There's another confusing bit here: if we have a bitmap for the commit,
> then we mark it as SEEN. Does that mean that the later commit walk will
> skip walking its history? Would we then get a boundary that is actually
> further in history than necessary? ("A --not B C" would walk all of
> B..A if C has a bitmap, even if a lot of that region is reachable from C.)
>
> My initial thought here was that this is an unlikely case, so the
> optimization isn't worth the code complexity. But now, I'm a little
> concerned that it actually hurts the later walk in the case of multiple
> roots.

The point about commits with existing bitmaps as SEEN hurting the
boundary traversal is a good one that I hadn't considered. I agree with
what you're saying here, though I think that the optimization still
makes sense (without touching the SEEN bit).

If all of the haves have existing bitmaps, we don't want to bother
starting the boundary traversal at all, since we can just OR those
together and be done. So I think we'd just want to drop the SEEN bits
there, but let me know if I am missing something.

> > +	revs->boundary = 0;
> > +	revs->blob_objects = tmp_blobs;
> > +	revs->tree_objects = tmp_trees;
> > +	revs->tag_objects = tmp_tags;
>
> These would seem more natural if they were after the trace2_region_leave().

Good suggestion, thanks.

> > +			} else {
> > +				add_pending_object(revs, obj, "");
> > +				needs_walk = 1;
> > +			}
> > +		}
> > +
> > +		if (needs_walk)
> > +			cb.base = fill_in_bitmap(bitmap_git, revs, cb.base, NULL);
>
> I wonder if fill_in_bitmap() already does "the right thing" when there
> are no pending objects or when all pending objects are already in the
> bitmap. Do we need to do these checks, or should we just put all boundary
> commits in the pending set?

Great suggestion. It does do the right thing via its `should_include`
checks, which see if the commit we're trying to process is already in
the bitmap, and if so, propagates SEEN to it and its parents.

Here's a diff that I generated after reading through this message, let
me know if you think I'm missing anything:

--- >8 ---
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 634fc10156..f57b74bcc0 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1141,7 +1141,6 @@ static struct bitmap *find_boundary_objects(struct bitmap_index *bitmap_git,
 		    !obj_in_bitmap(bitmap_git, object, cb.base) &&
 		    add_commit_to_bitmap(bitmap_git, &cb.base,
 					 (struct commit *)object)) {
-			object->flags |= SEEN;
 			continue;
 		}

@@ -1175,13 +1174,14 @@ static struct bitmap *find_boundary_objects(struct bitmap_index *bitmap_git,
 				      show_boundary_object,
 				      &cb, NULL);
 	revs->boundary = 0;
+	trace2_region_leave("pack-bitmap", "boundary-traverse", the_repository);
+
 	revs->blob_objects = tmp_blobs;
 	revs->tree_objects = tmp_trees;
 	revs->tag_objects = tmp_tags;

 	reset_revision_walk();
 	clear_object_flags(UNINTERESTING);
-	trace2_region_leave("pack-bitmap", "boundary-traverse", the_repository);

 	/*
 	 * Then add the boundary commit(s) as fill-in traversal tips.
@@ -1189,26 +1189,21 @@ static struct bitmap *find_boundary_objects(struct bitmap_index *bitmap_git,
 	trace2_region_enter("pack-bitmap", "boundary-fill-in", the_repository);
 	if (cb.boundary.nr) {
 		struct object *obj;
-		int needs_walk = 0;
-
 		for (i = 0; i < cb.boundary.nr; i++) {
 			obj = cb.boundary.objects[i].item;

-			if (obj_in_bitmap(bitmap_git, obj, cb.base)) {
+			if (obj_in_bitmap(bitmap_git, obj, cb.base))
 				obj->flags |= SEEN;
-			} else {
+			else
 				add_pending_object(revs, obj, "");
-				needs_walk = 1;
-			}
 		}

-		if (needs_walk)
-			cb.base = fill_in_bitmap(bitmap_git, revs, cb.base, NULL);
+		cb.base = fill_in_bitmap(bitmap_git, revs, cb.base, NULL);
 	}
 	trace2_region_leave("pack-bitmap", "boundary-fill-in", the_repository);

-
 cleanup:
+	object_array_clear(&cb.boundary);
 	revs->ignore_missing_links = 0;

 	return cb.base;
--- 8< ---

Thanks,
Taylor

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 0/2] pack-bitmap: boundary-based bitmap traversalt
  2023-05-05 17:59   ` Derrick Stolee
@ 2023-05-05 18:46     ` Taylor Blau
  2023-05-05 20:45       ` Derrick Stolee
  0 siblings, 1 reply; 51+ messages in thread
From: Taylor Blau @ 2023-05-05 18:46 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Jeff King, Junio C Hamano

On Fri, May 05, 2023 at 01:59:31PM -0400, Derrick Stolee wrote:
> On 5/5/2023 1:27 PM, Taylor Blau wrote:
> > Here is a reroll of my series to implement a boundary-based bitmap
> > traversal algorithm that I worked on towards the end of 2021 with Peff.
> >
> > The algorithm is unchanged from the last version, but the implementation
> > has been made much more straightforward, thanks to a very helpful
> > suggestion from Stolee.
> >
> > Instead of hackily trying to write down all of the UNINTERESTING commits
> > between the tips and boundary within limit_list(), we can just perform a
> > commit-only walk, which will give us the set of commits that we need.
>
> Something that didn't seem to get attention in this version was buried
> deep in the commentary of my high-level review [1]:

Oops, sorry, I definitely missed this unintentionally and did not mean
to ignore it.

> > For these reasons, I'm surprised that this patch completely replaces
> > the old algorithm for the new one. I would prefer to see a config
> > option that enables this new algorithm. It would be safer to deploy
> > that way, too.
>
> I still think it would be nice to keep the two algorithms for at least
> a little while instead of completely removing the old one. Let's gather
> some more evidence and get more reps in with the new algorithm before
> making it the new default. I could imagine a scenario where someone
> really wants to spend the extra time to make sure none of the objects
> reachable from the UNINTERESTING commits are included in the output of
> this diff.
>
> [1] https://lore.kernel.org/git/a143150d-7cf5-c697-0e48-0f7af1c6de8f@github.com/

Hmm. I'm not opposed to keeping the old algorithm around, though I
wonder what the configuration option would look like here. I imagine
that we don't want to support the old algorithm indefinitely, though.

Perhaps something like `pack.useBoundaryBitmapTraversal` (implied by
`feature.experimental`), defaulting to "false", and then eventually
"true"?

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 0/2] pack-bitmap: boundary-based bitmap traversalt
  2023-05-05 18:46     ` [PATCH v2 0/2] pack-bitmap: boundary-based bitmap traversalt Taylor Blau
@ 2023-05-05 20:45       ` Derrick Stolee
  0 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee @ 2023-05-05 20:45 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano



On 5/5/2023 2:46 PM, Taylor Blau wrote:
> On Fri, May 05, 2023 at 01:59:31PM -0400, Derrick Stolee wrote:
>> On 5/5/2023 1:27 PM, Taylor Blau wrote:
>>> Here is a reroll of my series to implement a boundary-based bitmap
>>> traversal algorithm that I worked on towards the end of 2021 with Peff.
>>>
>>> The algorithm is unchanged from the last version, but the implementation
>>> has been made much more straightforward, thanks to a very helpful
>>> suggestion from Stolee.
>>>
>>> Instead of hackily trying to write down all of the UNINTERESTING commits
>>> between the tips and boundary within limit_list(), we can just perform a
>>> commit-only walk, which will give us the set of commits that we need.
>>
>> Something that didn't seem to get attention in this version was buried
>> deep in the commentary of my high-level review [1]:
> 
> Oops, sorry, I definitely missed this unintentionally and did not mean
> to ignore it.

I had to dig deep to find it, even after knowing it was in there
somewhere, so I'm not upset it didn't happen this version.

>>> For these reasons, I'm surprised that this patch completely replaces
>>> the old algorithm for the new one. I would prefer to see a config
>>> option that enables this new algorithm. It would be safer to deploy
>>> that way, too.
>>
>> I still think it would be nice to keep the two algorithms for at least
>> a little while instead of completely removing the old one. Let's gather
>> some more evidence and get more reps in with the new algorithm before
>> making it the new default. I could imagine a scenario where someone
>> really wants to spend the extra time to make sure none of the objects
>> reachable from the UNINTERESTING commits are included in the output of
>> this diff.
>>
>> [1] https://lore.kernel.org/git/a143150d-7cf5-c697-0e48-0f7af1c6de8f@github.com/
> 
> Hmm. I'm not opposed to keeping the old algorithm around, though I
> wonder what the configuration option would look like here. I imagine
> that we don't want to support the old algorithm indefinitely, though.
> 
> Perhaps something like `pack.useBoundaryBitmapTraversal` (implied by
> `feature.experimental`), defaulting to "false", and then eventually
> "true"?

This name makes sense to me. Including it in feature.experimental right
away seems like a good idea. Incrementing it to "true" by default after
a single release would make sense, too, since the performance benefits
are so clear. Just important to have that emergency toggle.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v3 0/3] pack-bitmap: boundary-based bitmap traversal
  2023-04-25  0:00 [PATCH 0/3] pack-bitmap: boundary-based bitmap traversal Taylor Blau
                   ` (6 preceding siblings ...)
  2023-05-05 17:27 ` [PATCH v2 0/2] " Taylor Blau
@ 2023-05-08 17:38 ` Taylor Blau
  2023-05-08 17:38   ` [PATCH v3 1/3] object: add object_array initializer helper function Taylor Blau
                     ` (3 more replies)
  2023-06-08 16:25 ` [PATCH v4 " Taylor Blau
  8 siblings, 4 replies; 51+ messages in thread
From: Taylor Blau @ 2023-05-08 17:38 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Derrick Stolee, Junio C Hamano

Here is another small reroll of my series to implement a boundary-based
bitmap traversal algorithm that I worked on towards the end of 2021 with
Peff.

The algorithm is the same as in the last round, though I cleaned up a
few things throughout thanks to another very helpful review from Stolee.

One significant change from last time is that the new algorithm is
opt-in behind feature.experimental, giving users a way to try out the
new implementation while still being able to switch back to the
known-good traversal in case anything goes wrong.

Hopefully after a version or two we can safely remove the conditional
and drop the existing algorithm.

Thanks in advance for your review.

Taylor Blau (3):
  object: add object_array initializer helper function
  pack-bitmap.c: extract `fill_in_bitmap()`
  pack-bitmap.c: use commit boundary during bitmap traversal

 Documentation/config/feature.txt |   3 +
 Documentation/config/pack.txt    |  17 +++
 ci/run-build-and-tests.sh        |   1 +
 object.c                         |   6 +
 object.h                         |   2 +
 pack-bitmap.c                    | 241 ++++++++++++++++++++++++++-----
 pack-bitmap.h                    |   4 +
 repo-settings.c                  |   7 +-
 repository.h                     |   1 +
 t/README                         |   4 +
 t/t5310-pack-bitmaps.sh          |  38 +++++
 11 files changed, 284 insertions(+), 40 deletions(-)

Range-diff against v2:
-:  ---------- > 1:  c31508ac4a object: add object_array initializer helper function
1:  a3a1522439 = 2:  e7b30490da pack-bitmap.c: extract `fill_in_bitmap()`
2:  1993af00cb ! 3:  8a8f41e0c4 pack-bitmap.c: use commit boundary during bitmap traversal
    @@ Commit message
         the true number of objects in a similar fashion as the non-bitmap
         traversal):
     
    -        # (Computing objects unique to 'master' among all branches, not
    -        # using bitmaps).
    -        $ time git rev-list --count --objects master --not --exclude=master --branches
    -        43
    +        # (Computing the number of tagged objects not on any branches
    +        # without bitmaps).
    +        $ time git rev-list --count --objects --tags --not --branches
    +        20
     
    -        real        0m0.346s
    -        user        0m0.231s
    -        sys 0m0.115s
    +        real        0m1.388s
    +        user        0m1.092s
    +        sys 0m0.296s
     
    -        # (Computing the same uniqueness query using the old bitmap
    -        # traversal algorithm.)
    -        $ time git rev-list --use-bitmap-index --count --objects master --not --exclude=master --branches
    -        41
    +        # (Computing the same query using the old bitmap traversal).
    +        $ time git rev-list --count --objects --tags --not --branches --use-bitmap-index
    +        19
     
    -        real        0m20.007s
    -        user        0m19.045s
    -        sys 0m0.960s
    +        real        0m22.709s
    +        user        0m21.628s
    +        sys 0m1.076s
     
    -        # (Computing the same uniqueness query using the new bitmap
    -        # traversal algorithm.)
    -        $ time git.compile rev-list --use-bitmap-index --count --objects master --not --exclude=master --branches
    -        41
    +        # (this commit)
    +        $ time git.compile rev-list --count --objects --tags --not --branches --use-bitmap-index
    +        19
     
    -        real        0m1.748s
    -        user        0m1.428s
    -        sys 0m0.320s
    +        real        0m1.518s
    +        user        0m1.234s
    +        sys 0m0.284s
     
         The new algorithm is still slower than not using bitmaps at all, but it
    -    is nearly a 11-fold improvement over the existing traversal.
    +    is nearly a 15-fold improvement over the existing traversal.
     
         In a more realistic setting (using my local copy of git.git), I can
         observe a similar (if more modest) speed-up:
     
    -        $ ours="$(git branch --show-current)"
    -          argv="--count --objects $ours --not --exclude=$ours --branches"
    -          hyperfine \
    -            -n 'no bitmaps' "git.compile rev-list $argv" \
    -            -n 'existing bitmap traversal' "git rev-list --use-bitmap-index $argv" \
    -            -n 'this commit' "git.compile rev-list --use-bitmap-index $argv"
    +        $ argv="--count --objects --branches --not --tags"
    +        hyperfine \
    +          -n 'no bitmaps' "git.compile rev-list $argv" \
    +          -n 'existing traversal' "git.compile rev-list --use-bitmap-index $argv" \
    +          -n 'boundary traversal' "git.compile -c pack.useBitmapBoundaryTraversal=true rev-list --use-bitmap-index $argv"
             Benchmark 1: no bitmaps
    -          Time (mean ± σ):       5.8 ms ±   0.2 ms    [User: 3.3 ms, System: 2.4 ms]
    -          Range (min … max):     5.4 ms …   7.0 ms    409 runs
    +          Time (mean ± σ):     124.6 ms ±   2.1 ms    [User: 103.7 ms, System: 20.8 ms]
    +          Range (min … max):   122.6 ms … 133.1 ms    22 runs
     
    -        Benchmark 2: existing bitmap traversal
    -          Time (mean ± σ):      65.3 ms ±   0.6 ms    [User: 49.3 ms, System: 15.8 ms]
    -          Range (min … max):    64.1 ms …  66.9 ms    45 runs
    +        Benchmark 2: existing traversal
    +          Time (mean ± σ):     368.6 ms ±   3.0 ms    [User: 325.3 ms, System: 43.1 ms]
    +          Range (min … max):   365.1 ms … 374.8 ms    10 runs
     
    -        Benchmark 3: this commit
    -          Time (mean ± σ):      19.8 ms ±   0.3 ms    [User: 12.9 ms, System: 6.8 ms]
    -          Range (min … max):    19.1 ms …  20.8 ms    150 runs
    +        Benchmark 3: boundary traversal
    +          Time (mean ± σ):     167.6 ms ±   0.9 ms    [User: 139.5 ms, System: 27.9 ms]
    +          Range (min … max):   166.1 ms … 169.2 ms    17 runs
     
             Summary
               'no bitmaps' ran
    -            3.43 ± 0.14 times faster than 'this commit'
    -           11.34 ± 0.45 times faster than 'existing bitmap traversal'
    +            1.34 ± 0.02 times faster than 'boundary traversal'
    +            2.96 ± 0.05 times faster than 'existing traversal'
     
         Here, the new algorithm is also still slower than not using bitmaps, but
    -    represents a more than 3-fold improvement over the existing traversal in
    +    represents a more than 2-fold improvement over the existing traversal in
         a more modest example.
     
         Since this algorithm was originally written (nearly a year and a half
    @@ Commit message
         Helped-by: Derrick Stolee <derrickstolee@github.com>
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
    + ## Documentation/config/feature.txt ##
    +@@ Documentation/config/feature.txt: feature.experimental::
    + +
    + * `fetch.negotiationAlgorithm=skipping` may improve fetch negotiation times by
    + skipping more commits at a time, reducing the number of round trips.
    +++
    ++* `pack.useBitmapBoundaryTraversal=true` may improve bitmap traversal times by
    ++walking fewer objects.
    + 
    + feature.manyFiles::
    + 	Enable config options that optimize for repos with many files in the
    +
    + ## Documentation/config/pack.txt ##
    +@@ Documentation/config/pack.txt: pack.useBitmaps::
    + 	true. You should not generally need to turn this off unless
    + 	you are debugging pack bitmaps.
    + 
    ++pack.useBitmapBoundaryTraversal::
    ++	When true, Git will use an experimental algorithm for computing
    ++	reachability queries with bitmaps. Instead of building up
    ++	complete bitmaps for all of the negated tips and then OR-ing
    ++	them together, consider negated tips with existing bitmaps as
    ++	additive (i.e. OR-ing them into the result if they exist,
    ++	ignoring them otherwise), and build up a bitmap at the boundary
    ++	instead.
    +++
    ++When using this algorithm, Git may include too many objects as a result
    ++of not opening up trees belonging to certain UNINTERESTING commits. This
    ++inexactness matches the non-bitmap traversal algorithm.
    +++
    ++In many cases, this can provide a speed-up over the exact algorithm,
    ++particularly when there is poor bitmap coverage of the negated side of
    ++the query.
    ++
    + pack.useSparse::
    + 	When true, git will default to using the '--sparse' option in
    + 	'git pack-objects' when the '--revs' option is present. This
    +
    + ## ci/run-build-and-tests.sh ##
    +@@ ci/run-build-and-tests.sh: linux-TEST-vars)
    + 	export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=master
    + 	export GIT_TEST_NO_WRITE_REV_INDEX=1
    + 	export GIT_TEST_CHECKOUT_WORKERS=2
    ++	export GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL=1
    + 	;;
    + linux-clang)
    + 	export GIT_TEST_DEFAULT_HASH=sha1
    +
      ## pack-bitmap.c ##
     @@ pack-bitmap.c: static struct bitmap *fill_in_bitmap(struct bitmap_index *bitmap_git,
      	return base;
      }
      
    -+static int obj_in_bitmap(struct bitmap_index *bitmap_git,
    -+			 struct object *obj, struct bitmap *bitmap)
    -+{
    -+	int pos;
    -+
    -+	if (!bitmap)
    -+		return 0;
    -+	pos = bitmap_position(bitmap_git, &obj->oid);
    -+	if (pos < 0)
    -+		return 0;
    -+	return bitmap_get(bitmap, pos);
    -+}
    -+
     +struct bitmap_boundary_cb {
     +	struct bitmap_index *bitmap_git;
     +	struct bitmap *base;
    @@ pack-bitmap.c: static struct bitmap *fill_in_bitmap(struct bitmap_index *bitmap_
     +		add_object_array(&commit->object, "", &data->boundary);
     +
     +	if (commit->object.flags & UNINTERESTING) {
    -+		if (obj_in_bitmap(data->bitmap_git, &commit->object,
    -+				  data->base))
    ++		if (bitmap_walk_contains(data->bitmap_git, data->base,
    ++					 &commit->object.oid))
     +			return;
     +
     +		add_commit_to_bitmap(data->bitmap_git, &data->base, commit);
    @@ pack-bitmap.c: static struct bitmap *fill_in_bitmap(struct bitmap_index *bitmap_
     +					    struct rev_info *revs,
     +					    struct object_list *roots)
     +{
    -+	struct bitmap_boundary_cb cb = { .bitmap_git = bitmap_git };
    ++	struct bitmap_boundary_cb cb;
    ++	struct object_list *root;
     +	unsigned int i;
     +	unsigned int tmp_blobs, tmp_trees, tmp_tags;
     +	int any_missing = 0;
     +
    ++	cb.bitmap_git = bitmap_git;
    ++	cb.base = bitmap_new();
    ++	object_array_init(&cb.boundary);
    ++
     +	revs->ignore_missing_links = 1;
     +
     +	/*
    -+	 * OR in any existing reachability bitmaps among `roots` into `base`.
    ++	 * OR in any existing reachability bitmaps among `roots` into
    ++	 * `cb.base`.
     +	 */
    -+	while (roots) {
    -+		struct object *object = roots->item;
    -+		roots = roots->next;
    ++	for (root = roots; root; root = root->next) {
    ++		struct object *object = root->item;
    ++		if (object->type != OBJ_COMMIT ||
    ++		    bitmap_walk_contains(bitmap_git, cb.base, &object->oid))
    ++			continue;
     +
    -+		if (object->type == OBJ_COMMIT &&
    -+		    !obj_in_bitmap(bitmap_git, object, cb.base) &&
    -+		    add_commit_to_bitmap(bitmap_git, &cb.base,
    -+					 (struct commit *)object)) {
    -+			object->flags |= SEEN;
    ++		if (add_commit_to_bitmap(bitmap_git, &cb.base,
    ++					 (struct commit *)object))
     +			continue;
    -+		}
     +
     +		any_missing = 1;
     +	}
    @@ pack-bitmap.c: static struct bitmap *fill_in_bitmap(struct bitmap_index *bitmap_
     +				      show_boundary_object,
     +				      &cb, NULL);
     +	revs->boundary = 0;
    ++	trace2_region_leave("pack-bitmap", "boundary-traverse", the_repository);
    ++
     +	revs->blob_objects = tmp_blobs;
     +	revs->tree_objects = tmp_trees;
     +	revs->tag_objects = tmp_tags;
     +
     +	reset_revision_walk();
     +	clear_object_flags(UNINTERESTING);
    -+	trace2_region_leave("pack-bitmap", "boundary-traverse", the_repository);
     +
     +	/*
     +	 * Then add the boundary commit(s) as fill-in traversal tips.
     +	 */
     +	trace2_region_enter("pack-bitmap", "boundary-fill-in", the_repository);
    -+	if (cb.boundary.nr) {
    -+		struct object *obj;
    -+		int needs_walk = 0;
    -+
    -+		for (i = 0; i < cb.boundary.nr; i++) {
    -+			obj = cb.boundary.objects[i].item;
    -+
    -+			if (obj_in_bitmap(bitmap_git, obj, cb.base)) {
    -+				obj->flags |= SEEN;
    -+			} else {
    -+				add_pending_object(revs, obj, "");
    -+				needs_walk = 1;
    -+			}
    -+		}
    -+
    -+		if (needs_walk)
    -+			cb.base = fill_in_bitmap(bitmap_git, revs, cb.base, NULL);
    ++	for (i = 0; i < cb.boundary.nr; i++) {
    ++		struct object *obj = cb.boundary.objects[i].item;
    ++		if (bitmap_walk_contains(bitmap_git, cb.base, &obj->oid))
    ++			obj->flags |= SEEN;
    ++		else
    ++			add_pending_object(revs, obj, "");
     +	}
    ++	if (revs->pending.nr)
    ++		cb.base = fill_in_bitmap(bitmap_git, revs, cb.base, NULL);
     +	trace2_region_leave("pack-bitmap", "boundary-fill-in", the_repository);
     +
    -+
     +cleanup:
    ++	object_array_clear(&cb.boundary);
     +	revs->ignore_missing_links = 0;
     +
     +	return cb.base;
    @@ pack-bitmap.c: static struct bitmap *find_objects(struct bitmap_index *bitmap_gi
      
      	return base;
      }
    -@@ pack-bitmap.c: static void show_objects_for_type(
    - 	}
    - }
    +@@ pack-bitmap.c: struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
    + 					 int filter_provided_objects)
    + {
    + 	unsigned int i;
    ++	int use_boundary_traversal;
      
    --static int in_bitmapped_pack(struct bitmap_index *bitmap_git,
    --			     struct object_list *roots)
    --{
    --	while (roots) {
    --		struct object *object = roots->item;
    --		roots = roots->next;
    --
    --		if (bitmap_is_midx(bitmap_git)) {
    --			if (bsearch_midx(&object->oid, bitmap_git->midx, NULL))
    --				return 1;
    --		} else {
    --			if (find_pack_entry_one(object->oid.hash, bitmap_git->pack) > 0)
    --				return 1;
    --		}
    --	}
    --
    --	return 0;
    --}
    --
    - static struct bitmap *find_tip_objects(struct bitmap_index *bitmap_git,
    - 				       struct object_list *tip_objects,
    - 				       enum object_type type)
    + 	struct object_list *wants = NULL;
    + 	struct object_list *haves = NULL;
     @@ pack-bitmap.c: struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
      			object_list_insert(object, &wants);
      	}
    @@ pack-bitmap.c: struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
     -	 */
     -	if (haves && !in_bitmapped_pack(bitmap_git, haves))
     -		goto cleanup;
    --
    ++	use_boundary_traversal = git_env_bool(GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL, -1);
    ++	if (use_boundary_traversal < 0) {
    ++		prepare_repo_settings(revs->repo);
    ++		use_boundary_traversal = revs->repo->settings.pack_use_bitmap_boundary_traversal;
    ++	}
    ++
    ++	if (!use_boundary_traversal) {
    ++		/*
    ++		 * if we have a HAVES list, but none of those haves is contained
    ++		 * in the packfile that has a bitmap, we don't have anything to
    ++		 * optimize here
    ++		 */
    ++		if (haves && !in_bitmapped_pack(bitmap_git, haves))
    ++			goto cleanup;
    ++	}
    + 
      	/* if we don't want anything, we're done here */
      	if (!wants)
    - 		goto cleanup;
     @@ pack-bitmap.c: struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
      	if (load_bitmap(revs->repo, bitmap_git) < 0)
      		goto cleanup;
      
     -	object_array_clear(&revs->pending);
    --
    ++	if (!use_boundary_traversal)
    ++		object_array_clear(&revs->pending);
    + 
      	if (haves) {
     -		revs->ignore_missing_links = 1;
     -		haves_bitmap = find_objects(bitmap_git, revs, haves, NULL);
     -		reset_revision_walk();
     -		revs->ignore_missing_links = 0;
    --
    -+		haves_bitmap = find_boundary_objects(bitmap_git, revs, haves);
    ++		if (use_boundary_traversal) {
    ++			trace2_region_enter("pack-bitmap", "haves/boundary", the_repository);
    ++			haves_bitmap = find_boundary_objects(bitmap_git, revs, haves);
    ++			trace2_region_leave("pack-bitmap", "haves/boundary", the_repository);
    ++		} else {
    ++			trace2_region_enter("pack-bitmap", "haves/classic", the_repository);
    ++			revs->ignore_missing_links = 1;
    ++			haves_bitmap = find_objects(bitmap_git, revs, haves, NULL);
    ++			reset_revision_walk();
    ++			revs->ignore_missing_links = 0;
    ++			trace2_region_leave("pack-bitmap", "haves/classic", the_repository);
    ++		}
    + 
      		if (!haves_bitmap)
      			BUG("failed to perform bitmap walk");
      	}
      
    -+	object_array_clear(&revs->pending);
    -+	reset_revision_walk();
    ++	if (use_boundary_traversal) {
    ++		object_array_clear(&revs->pending);
    ++		reset_revision_walk();
    ++	}
     +
      	wants_bitmap = find_objects(bitmap_git, revs, wants, haves_bitmap);
      
      	if (!wants_bitmap)
    +
    + ## pack-bitmap.h ##
    +@@ pack-bitmap.h: void traverse_bitmap_commit_list(struct bitmap_index *,
    + void test_bitmap_walk(struct rev_info *revs);
    + int test_bitmap_commits(struct repository *r);
    + int test_bitmap_hashes(struct repository *r);
    ++
    ++#define GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL \
    ++	"GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL"
    ++
    + struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
    + 					 int filter_provided_objects);
    + uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git);
    +
    + ## repo-settings.c ##
    +@@ repo-settings.c: void prepare_repo_settings(struct repository *r)
    + 	repo_cfg_bool(r, "feature.experimental", &experimental, 0);
    + 
    + 	/* Defaults modified by feature.* */
    +-	if (experimental)
    ++	if (experimental) {
    + 		r->settings.fetch_negotiation_algorithm = FETCH_NEGOTIATION_SKIPPING;
    ++		r->settings.pack_use_bitmap_boundary_traversal = 1;
    ++	}
    + 	if (manyfiles) {
    + 		r->settings.index_version = 4;
    + 		r->settings.index_skip_hash = 1;
    +@@ repo-settings.c: void prepare_repo_settings(struct repository *r)
    + 	repo_cfg_bool(r, "index.sparse", &r->settings.sparse_index, 0);
    + 	repo_cfg_bool(r, "index.skiphash", &r->settings.index_skip_hash, r->settings.index_skip_hash);
    + 	repo_cfg_bool(r, "pack.readreverseindex", &r->settings.pack_read_reverse_index, 1);
    ++	repo_cfg_bool(r, "pack.usebitmapboundarytraversal",
    ++		      &r->settings.pack_use_bitmap_boundary_traversal,
    ++		      r->settings.pack_use_bitmap_boundary_traversal);
    + 
    + 	/*
    + 	 * The GIT_TEST_MULTI_PACK_INDEX variable is special in that
    +
    + ## repository.h ##
    +@@ repository.h: struct repo_settings {
    + 	int command_requires_full_index;
    + 	int sparse_index;
    + 	int pack_read_reverse_index;
    ++	int pack_use_bitmap_boundary_traversal;
    + 
    + 	struct fsmonitor_settings *fsmonitor; /* lazily loaded */
    + 
    +
    + ## t/README ##
    +@@ t/README: GIT_TEST_INDEX_VERSION=<n> exercises the index read/write code path
    + for the index version specified.  Can be set to any valid version
    + (currently 2, 3, or 4).
    + 
    ++GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL=<boolean> if enabled will
    ++use the boundary-based bitmap traversal algorithm. See the documentation
    ++of `pack.useBitmapBoundaryTraversal` for more details.
    ++
    + GIT_TEST_PACK_SPARSE=<boolean> if disabled will default the pack-objects
    + builtin to use the non-sparse object walk. This can still be overridden by
    + the --sparse command-line argument.
    +
    + ## t/t5310-pack-bitmaps.sh ##
    +@@ t/t5310-pack-bitmaps.sh: test_description='exercise basic bitmap functionality'
    + # their place.
    + GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0
    + 
    ++# Likewise, allow individual tests to control whether or not they use
    ++# the boundary-based traversal.
    ++sane_unset GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL
    ++
    + objpath () {
    + 	echo ".git/objects/$(echo "$1" | sed -e 's|\(..\)|\1/|')"
    + }
    +@@ t/t5310-pack-bitmaps.sh: test_bitmap_cases () {
    + 
    + test_bitmap_cases
    + 
    ++GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL=1
    ++export GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL
    ++
    ++test_bitmap_cases
    ++
    ++sane_unset GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL
    ++
    + test_expect_success 'incremental repack fails when bitmaps are requested' '
    + 	test_commit more-1 &&
    + 	test_must_fail git repack -d 2>err &&
    +@@ t/t5310-pack-bitmaps.sh: test_expect_success 'incremental repack can disable bitmaps' '
    + 	git repack -d --no-write-bitmap-index
    + '
    + 
    ++test_expect_success 'boundary-based traversal is used when requested' '
    ++	git repack -a -d --write-bitmap-index &&
    ++
    ++	for argv in \
    ++		"git -c pack.useBitmapBoundaryTraversal=true" \
    ++		"git -c feature.experimental=true" \
    ++		"GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL=1 git"
    ++	do
    ++		eval "GIT_TRACE2_EVENT=1 $argv rev-list --objects \
    ++			--use-bitmap-index second..other 2>perf" &&
    ++		grep "\"region_enter\".*\"label\":\"haves/boundary\"" perf ||
    ++			return 1
    ++	done &&
    ++
    ++	for argv in \
    ++		"git -c pack.useBitmapBoundaryTraversal=false" \
    ++		"git -c feature.experimental=true -c pack.useBitmapBoundaryTraversal=false" \
    ++		"GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL=0 git -c pack.useBitmapBoundaryTraversal=true" \
    ++		"GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL=0 git -c feature.experimental=true"
    ++	do
    ++		eval "GIT_TRACE2_EVENT=1 $argv rev-list --objects \
    ++			--use-bitmap-index second..other 2>perf" &&
    ++		grep "\"region_enter\".*\"label\":\"haves/classic\"" perf ||
    ++			return 1
    ++	done
    ++'
    ++
    + test_bitmap_cases "pack.writeBitmapLookupTable"
    + 
    + test_expect_success 'verify writing bitmap lookup table when enabled' '
-- 
2.40.1.479.g8f2b93ade7

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v3 1/3] object: add object_array initializer helper function
  2023-05-08 17:38 ` [PATCH v3 0/3] pack-bitmap: boundary-based bitmap traversal Taylor Blau
@ 2023-05-08 17:38   ` Taylor Blau
  2023-05-08 17:38   ` [PATCH v3 2/3] pack-bitmap.c: extract `fill_in_bitmap()` Taylor Blau
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 51+ messages in thread
From: Taylor Blau @ 2023-05-08 17:38 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Derrick Stolee, Junio C Hamano

The object_array API has an OBJECT_ARRAY_INIT macro, but lacks a
function to initialize an object_array at a given location in memory.

Introduce `object_array_init()` to implement such a function.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 object.c | 6 ++++++
 object.h | 2 ++
 2 files changed, 8 insertions(+)

diff --git a/object.c b/object.c
index 6d4ef1524d..f1adb458b6 100644
--- a/object.c
+++ b/object.c
@@ -356,6 +356,12 @@ void object_list_free(struct object_list **list)
  */
 static char object_array_slopbuf[1];
 
+void object_array_init(struct object_array *array)
+{
+	struct object_array blank = OBJECT_ARRAY_INIT;
+	memcpy(array, &blank, sizeof(*array));
+}
+
 void add_object_array_with_path(struct object *obj, const char *name,
 				struct object_array *array,
 				unsigned mode, const char *path)
diff --git a/object.h b/object.h
index 96e52e24fb..c335435f9c 100644
--- a/object.h
+++ b/object.h
@@ -57,6 +57,8 @@ struct object_array {
 
 #define OBJECT_ARRAY_INIT { 0 }
 
+void object_array_init(struct object_array *array);
+
 /*
  * object flag allocation:
  * revision.h:               0---------10         15             23------27
-- 
2.40.1.479.g8f2b93ade7


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 2/3] pack-bitmap.c: extract `fill_in_bitmap()`
  2023-05-08 17:38 ` [PATCH v3 0/3] pack-bitmap: boundary-based bitmap traversal Taylor Blau
  2023-05-08 17:38   ` [PATCH v3 1/3] object: add object_array initializer helper function Taylor Blau
@ 2023-05-08 17:38   ` Taylor Blau
  2023-05-08 17:38   ` [PATCH v3 3/3] pack-bitmap.c: use commit boundary during bitmap traversal Taylor Blau
  2023-05-10 22:55   ` [PATCH v3 0/3] pack-bitmap: boundary-based " Junio C Hamano
  3 siblings, 0 replies; 51+ messages in thread
From: Taylor Blau @ 2023-05-08 17:38 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Derrick Stolee, Junio C Hamano

To prepare for the boundary-based bitmap walk to perform a fill-in
traversal using the boundary of either side as the tips, extract routine
used to perform fill-in traversal by `find_objects()` so that it can be
used in both places.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 65 ++++++++++++++++++++++++++++-----------------------
 1 file changed, 36 insertions(+), 29 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index e0fad723bf..5d2cc6ac96 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1043,6 +1043,40 @@ static int add_commit_to_bitmap(struct bitmap_index *bitmap_git,
 	return 1;
 }
 
+static struct bitmap *fill_in_bitmap(struct bitmap_index *bitmap_git,
+				     struct rev_info *revs,
+				     struct bitmap *base,
+				     struct bitmap *seen)
+{
+	struct include_data incdata;
+	struct bitmap_show_data show_data;
+
+	if (!base)
+		base = bitmap_new();
+
+	incdata.bitmap_git = bitmap_git;
+	incdata.base = base;
+	incdata.seen = seen;
+
+	revs->include_check = should_include;
+	revs->include_check_obj = should_include_obj;
+	revs->include_check_data = &incdata;
+
+	if (prepare_revision_walk(revs))
+		die(_("revision walk setup failed"));
+
+	show_data.bitmap_git = bitmap_git;
+	show_data.base = base;
+
+	traverse_commit_list(revs, show_commit, show_object, &show_data);
+
+	revs->include_check = NULL;
+	revs->include_check_obj = NULL;
+	revs->include_check_data = NULL;
+
+	return base;
+}
+
 static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
 				   struct rev_info *revs,
 				   struct object_list *roots,
@@ -1108,35 +1142,8 @@ static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
 		}
 	}
 
-	if (needs_walk) {
-		struct include_data incdata;
-		struct bitmap_show_data show_data;
-
-		if (!base)
-			base = bitmap_new();
-
-		incdata.bitmap_git = bitmap_git;
-		incdata.base = base;
-		incdata.seen = seen;
-
-		revs->include_check = should_include;
-		revs->include_check_obj = should_include_obj;
-		revs->include_check_data = &incdata;
-
-		if (prepare_revision_walk(revs))
-			die(_("revision walk setup failed"));
-
-		show_data.bitmap_git = bitmap_git;
-		show_data.base = base;
-
-		traverse_commit_list(revs,
-				     show_commit, show_object,
-				     &show_data);
-
-		revs->include_check = NULL;
-		revs->include_check_obj = NULL;
-		revs->include_check_data = NULL;
-	}
+	if (needs_walk)
+		base = fill_in_bitmap(bitmap_git, revs, base, seen);
 
 	return base;
 }
-- 
2.40.1.479.g8f2b93ade7


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 3/3] pack-bitmap.c: use commit boundary during bitmap traversal
  2023-05-08 17:38 ` [PATCH v3 0/3] pack-bitmap: boundary-based bitmap traversal Taylor Blau
  2023-05-08 17:38   ` [PATCH v3 1/3] object: add object_array initializer helper function Taylor Blau
  2023-05-08 17:38   ` [PATCH v3 2/3] pack-bitmap.c: extract `fill_in_bitmap()` Taylor Blau
@ 2023-05-08 17:38   ` Taylor Blau
  2023-05-08 20:53     ` Derrick Stolee
  2023-05-10 22:55   ` [PATCH v3 0/3] pack-bitmap: boundary-based " Junio C Hamano
  3 siblings, 1 reply; 51+ messages in thread
From: Taylor Blau @ 2023-05-08 17:38 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Derrick Stolee, Junio C Hamano

When reachability bitmap coverage exists in a repository, Git will use a
different (and hopefully faster) traversal to compute revision walks.

Consider a set of positive and negative tips (which we'll refer to with
their standard bitmap parlance by "wants", and "haves"). In order to
figure out what objects exist between the tips, the existing traversal
in `prepare_bitmap_walk()` does something like:

  1. Consider if we can even compute the set of objects with bitmaps,
     and fall back to the usual traversal if we cannot. For example,
     pathspec limiting traversals can't be computed using bitmaps (since
     they don't know which objects are at which paths). The same is true
     of certain kinds of non-trivial object filters.

  2. If we can compute the traversal with bitmaps, partition the
     (dereferenced) tips into two object lists, "haves", and "wants",
     based on whether or not the objects have the UNINTERESTING flag,
     respectively.

  3. Fall back to the ordinary object traversal if either (a) there are
     more than zero haves, none of which are in the bitmapped pack or
     MIDX, or (b) there are no wants.

  4. Construct a reachability bitmap for the "haves" side by walking
     from the revision tips down to any existing bitmaps, OR-ing in any
     bitmaps as they are found.

  5. Then do the same for the "wants" side, stopping at any objects that
     appear in the "haves" bitmap.

  6. Filter the results if any object filter (that can be easily
     computed with bitmaps alone) was given, and then return back to the
     caller.

When there is good bitmap coverage relative to the traversal tips, this
walk is often significantly faster than an ordinary object traversal
because it can visit far fewer objects.

But in certain cases, it can be significantly *slower* than the usual
object traversal. Why? Because we need to compute complete bitmaps on
either side of the walk. If either one (or both) of the sides require
walking many (or all!) objects before they get to an existing bitmap,
the extra bitmap machinery is mostly or all overhead.

One of the benefits, however, is that even if the walk is slower, bitmap
traversals are guaranteed to provide an *exact* answer. Unlike the
traditional object traversal algorithm, which can over-count the results
by not opening trees for older commits, the bitmap walk builds an exact
reachability bitmap for either side, meaning the results are never
over-counted.

But producing non-exact results is OK for our traversal here (both in
the bitmap case and not), as long as the results are over-counted, not
under.

Relaxing the bitmap traversal to allow it to produce over-counted
results gives us the opportunity to make some significant improvements.
Instead of the above, the new algorithm only has to walk from the
*boundary* down to the nearest bitmap, instead of from each of the
UNINTERESTING tips.

The boundary-based approach still has degenerate cases, but we'll show
in a moment that it is often a significant improvement.

The new algorithm works as follows:

  1. Build a (partial) bitmap of the haves side by first OR-ing any
     bitmap(s) that already exist for UNINTERESTING commits between the
     haves and the boundary.

  2. For each commit along the boundary, add it as a fill-in traversal
     tip (where the traversal terminates once an existing bitmap is
     found), and perform fill-in traversal.

  3. Build up a complete bitmap of the wants side as usual, stopping any
     time we intersect the (partial) haves side.

  4. Return the results.

And is more-or-less equivalent to using the *old* algorithm with this
invocation:

    $ git rev-list --objects --use-bitmap-index $WANTS --not \
        $(git rev-list --objects --boundary $WANTS --not $HAVES |
          perl -lne 'print $1 if /^-(.*)/')

The new result performs significantly better in many cases, particularly
when the distance from the boundary commit(s) to an existing bitmap is
shorter than the distance from (all of) the have tips to the nearest
bitmapped commit.

Note that when using the old bitmap traversal algorithm, the results can
be *slower* than without bitmaps! Under the new algorithm, the result is
computed faster with bitmaps than without (at the cost of over-counting
the true number of objects in a similar fashion as the non-bitmap
traversal):

    # (Computing the number of tagged objects not on any branches
    # without bitmaps).
    $ time git rev-list --count --objects --tags --not --branches
    20

    real	0m1.388s
    user	0m1.092s
    sys	0m0.296s

    # (Computing the same query using the old bitmap traversal).
    $ time git rev-list --count --objects --tags --not --branches --use-bitmap-index
    19

    real	0m22.709s
    user	0m21.628s
    sys	0m1.076s

    # (this commit)
    $ time git.compile rev-list --count --objects --tags --not --branches --use-bitmap-index
    19

    real	0m1.518s
    user	0m1.234s
    sys	0m0.284s

The new algorithm is still slower than not using bitmaps at all, but it
is nearly a 15-fold improvement over the existing traversal.

In a more realistic setting (using my local copy of git.git), I can
observe a similar (if more modest) speed-up:

    $ argv="--count --objects --branches --not --tags"
    hyperfine \
      -n 'no bitmaps' "git.compile rev-list $argv" \
      -n 'existing traversal' "git.compile rev-list --use-bitmap-index $argv" \
      -n 'boundary traversal' "git.compile -c pack.useBitmapBoundaryTraversal=true rev-list --use-bitmap-index $argv"
    Benchmark 1: no bitmaps
      Time (mean ± σ):     124.6 ms ±   2.1 ms    [User: 103.7 ms, System: 20.8 ms]
      Range (min … max):   122.6 ms … 133.1 ms    22 runs

    Benchmark 2: existing traversal
      Time (mean ± σ):     368.6 ms ±   3.0 ms    [User: 325.3 ms, System: 43.1 ms]
      Range (min … max):   365.1 ms … 374.8 ms    10 runs

    Benchmark 3: boundary traversal
      Time (mean ± σ):     167.6 ms ±   0.9 ms    [User: 139.5 ms, System: 27.9 ms]
      Range (min … max):   166.1 ms … 169.2 ms    17 runs

    Summary
      'no bitmaps' ran
        1.34 ± 0.02 times faster than 'boundary traversal'
        2.96 ± 0.05 times faster than 'existing traversal'

Here, the new algorithm is also still slower than not using bitmaps, but
represents a more than 2-fold improvement over the existing traversal in
a more modest example.

Since this algorithm was originally written (nearly a year and a half
ago, at the time of writing), the bitmap lookup table shipped, making
the new algorithm's result more competitive. A few other future
directions for improving bitmap traversal times beyond not using bitmaps
at all:

  - Decrease the cost to decompress and OR together many bitmaps
    together (particularly when enumerating the uninteresting side of
    the walk). Here we could explore more efficient bitmap storage
    techniques, like Roaring+Run and/or use SIMD instructions to speed
    up ORing them together.

  - Store pseudo-merge bitmaps, which could allow us to OR together
    fewer "summary" bitmaps (which would also help with the above).

Helped-by: Jeff King <peff@peff.net>
Helped-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/feature.txt |   3 +
 Documentation/config/pack.txt    |  17 +++
 ci/run-build-and-tests.sh        |   1 +
 pack-bitmap.c                    | 182 ++++++++++++++++++++++++++++---
 pack-bitmap.h                    |   4 +
 repo-settings.c                  |   7 +-
 repository.h                     |   1 +
 t/README                         |   4 +
 t/t5310-pack-bitmaps.sh          |  38 +++++++
 9 files changed, 243 insertions(+), 14 deletions(-)

diff --git a/Documentation/config/feature.txt b/Documentation/config/feature.txt
index 17b4d39f89..bf9546fca4 100644
--- a/Documentation/config/feature.txt
+++ b/Documentation/config/feature.txt
@@ -14,6 +14,9 @@ feature.experimental::
 +
 * `fetch.negotiationAlgorithm=skipping` may improve fetch negotiation times by
 skipping more commits at a time, reducing the number of round trips.
++
+* `pack.useBitmapBoundaryTraversal=true` may improve bitmap traversal times by
+walking fewer objects.
 
 feature.manyFiles::
 	Enable config options that optimize for repos with many files in the
diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index d4c7c9d4e4..3748136d14 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -123,6 +123,23 @@ pack.useBitmaps::
 	true. You should not generally need to turn this off unless
 	you are debugging pack bitmaps.
 
+pack.useBitmapBoundaryTraversal::
+	When true, Git will use an experimental algorithm for computing
+	reachability queries with bitmaps. Instead of building up
+	complete bitmaps for all of the negated tips and then OR-ing
+	them together, consider negated tips with existing bitmaps as
+	additive (i.e. OR-ing them into the result if they exist,
+	ignoring them otherwise), and build up a bitmap at the boundary
+	instead.
++
+When using this algorithm, Git may include too many objects as a result
+of not opening up trees belonging to certain UNINTERESTING commits. This
+inexactness matches the non-bitmap traversal algorithm.
++
+In many cases, this can provide a speed-up over the exact algorithm,
+particularly when there is poor bitmap coverage of the negated side of
+the query.
+
 pack.useSparse::
 	When true, git will default to using the '--sparse' option in
 	'git pack-objects' when the '--revs' option is present. This
diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
index a18b13a41d..2528f25e31 100755
--- a/ci/run-build-and-tests.sh
+++ b/ci/run-build-and-tests.sh
@@ -29,6 +29,7 @@ linux-TEST-vars)
 	export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=master
 	export GIT_TEST_NO_WRITE_REV_INDEX=1
 	export GIT_TEST_CHECKOUT_WORKERS=2
+	export GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL=1
 	;;
 linux-clang)
 	export GIT_TEST_DEFAULT_HASH=sha1
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 5d2cc6ac96..e8a1579b16 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1077,6 +1077,126 @@ static struct bitmap *fill_in_bitmap(struct bitmap_index *bitmap_git,
 	return base;
 }
 
+struct bitmap_boundary_cb {
+	struct bitmap_index *bitmap_git;
+	struct bitmap *base;
+
+	struct object_array boundary;
+};
+
+static void show_boundary_commit(struct commit *commit, void *_data)
+{
+	struct bitmap_boundary_cb *data = _data;
+
+	if (commit->object.flags & BOUNDARY)
+		add_object_array(&commit->object, "", &data->boundary);
+
+	if (commit->object.flags & UNINTERESTING) {
+		if (bitmap_walk_contains(data->bitmap_git, data->base,
+					 &commit->object.oid))
+			return;
+
+		add_commit_to_bitmap(data->bitmap_git, &data->base, commit);
+	}
+}
+
+static void show_boundary_object(struct object *object,
+				 const char *name, void *data)
+{
+	BUG("should not be called");
+}
+
+static struct bitmap *find_boundary_objects(struct bitmap_index *bitmap_git,
+					    struct rev_info *revs,
+					    struct object_list *roots)
+{
+	struct bitmap_boundary_cb cb;
+	struct object_list *root;
+	unsigned int i;
+	unsigned int tmp_blobs, tmp_trees, tmp_tags;
+	int any_missing = 0;
+
+	cb.bitmap_git = bitmap_git;
+	cb.base = bitmap_new();
+	object_array_init(&cb.boundary);
+
+	revs->ignore_missing_links = 1;
+
+	/*
+	 * OR in any existing reachability bitmaps among `roots` into
+	 * `cb.base`.
+	 */
+	for (root = roots; root; root = root->next) {
+		struct object *object = root->item;
+		if (object->type != OBJ_COMMIT ||
+		    bitmap_walk_contains(bitmap_git, cb.base, &object->oid))
+			continue;
+
+		if (add_commit_to_bitmap(bitmap_git, &cb.base,
+					 (struct commit *)object))
+			continue;
+
+		any_missing = 1;
+	}
+
+	if (!any_missing)
+		goto cleanup;
+
+	tmp_blobs = revs->blob_objects;
+	tmp_trees = revs->tree_objects;
+	tmp_tags = revs->blob_objects;
+	revs->blob_objects = 0;
+	revs->tree_objects = 0;
+	revs->tag_objects = 0;
+
+	/*
+	 * We didn't have complete coverage of the roots. First setup a
+	 * revision walk to (a) OR in any bitmaps that are UNINTERESTING
+	 * between the tips and boundary, and (b) record the boundary.
+	 */
+	trace2_region_enter("pack-bitmap", "boundary-prepare", the_repository);
+	if (prepare_revision_walk(revs))
+		die("revision walk setup failed");
+	trace2_region_leave("pack-bitmap", "boundary-prepare", the_repository);
+
+	trace2_region_enter("pack-bitmap", "boundary-traverse", the_repository);
+	revs->boundary = 1;
+	traverse_commit_list_filtered(revs,
+				      show_boundary_commit,
+				      show_boundary_object,
+				      &cb, NULL);
+	revs->boundary = 0;
+	trace2_region_leave("pack-bitmap", "boundary-traverse", the_repository);
+
+	revs->blob_objects = tmp_blobs;
+	revs->tree_objects = tmp_trees;
+	revs->tag_objects = tmp_tags;
+
+	reset_revision_walk();
+	clear_object_flags(UNINTERESTING);
+
+	/*
+	 * Then add the boundary commit(s) as fill-in traversal tips.
+	 */
+	trace2_region_enter("pack-bitmap", "boundary-fill-in", the_repository);
+	for (i = 0; i < cb.boundary.nr; i++) {
+		struct object *obj = cb.boundary.objects[i].item;
+		if (bitmap_walk_contains(bitmap_git, cb.base, &obj->oid))
+			obj->flags |= SEEN;
+		else
+			add_pending_object(revs, obj, "");
+	}
+	if (revs->pending.nr)
+		cb.base = fill_in_bitmap(bitmap_git, revs, cb.base, NULL);
+	trace2_region_leave("pack-bitmap", "boundary-fill-in", the_repository);
+
+cleanup:
+	object_array_clear(&cb.boundary);
+	revs->ignore_missing_links = 0;
+
+	return cb.base;
+}
+
 static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
 				   struct rev_info *revs,
 				   struct object_list *roots,
@@ -1142,8 +1262,21 @@ static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
 		}
 	}
 
-	if (needs_walk)
+	if (needs_walk) {
+		/*
+		 * This fill-in traversal may walk over some objects
+		 * again, since we have already traversed in order to
+		 * find the boundary.
+		 *
+		 * But this extra walk should be extremely cheap, since
+		 * all commit objects are loaded into memory, and
+		 * because we skip walking to parents that are
+		 * UNINTERESTING, since it will be marked in the haves
+		 * bitmap already (or it has an on-disk bitmap, since
+		 * OR-ing it in covers all of its ancestors).
+		 */
 		base = fill_in_bitmap(bitmap_git, revs, base, seen);
+	}
 
 	return base;
 }
@@ -1535,6 +1668,7 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 					 int filter_provided_objects)
 {
 	unsigned int i;
+	int use_boundary_traversal;
 
 	struct object_list *wants = NULL;
 	struct object_list *haves = NULL;
@@ -1585,13 +1719,21 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 			object_list_insert(object, &wants);
 	}
 
-	/*
-	 * if we have a HAVES list, but none of those haves is contained
-	 * in the packfile that has a bitmap, we don't have anything to
-	 * optimize here
-	 */
-	if (haves && !in_bitmapped_pack(bitmap_git, haves))
-		goto cleanup;
+	use_boundary_traversal = git_env_bool(GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL, -1);
+	if (use_boundary_traversal < 0) {
+		prepare_repo_settings(revs->repo);
+		use_boundary_traversal = revs->repo->settings.pack_use_bitmap_boundary_traversal;
+	}
+
+	if (!use_boundary_traversal) {
+		/*
+		 * if we have a HAVES list, but none of those haves is contained
+		 * in the packfile that has a bitmap, we don't have anything to
+		 * optimize here
+		 */
+		if (haves && !in_bitmapped_pack(bitmap_git, haves))
+			goto cleanup;
+	}
 
 	/* if we don't want anything, we're done here */
 	if (!wants)
@@ -1605,18 +1747,32 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 	if (load_bitmap(revs->repo, bitmap_git) < 0)
 		goto cleanup;
 
-	object_array_clear(&revs->pending);
+	if (!use_boundary_traversal)
+		object_array_clear(&revs->pending);
 
 	if (haves) {
-		revs->ignore_missing_links = 1;
-		haves_bitmap = find_objects(bitmap_git, revs, haves, NULL);
-		reset_revision_walk();
-		revs->ignore_missing_links = 0;
+		if (use_boundary_traversal) {
+			trace2_region_enter("pack-bitmap", "haves/boundary", the_repository);
+			haves_bitmap = find_boundary_objects(bitmap_git, revs, haves);
+			trace2_region_leave("pack-bitmap", "haves/boundary", the_repository);
+		} else {
+			trace2_region_enter("pack-bitmap", "haves/classic", the_repository);
+			revs->ignore_missing_links = 1;
+			haves_bitmap = find_objects(bitmap_git, revs, haves, NULL);
+			reset_revision_walk();
+			revs->ignore_missing_links = 0;
+			trace2_region_leave("pack-bitmap", "haves/classic", the_repository);
+		}
 
 		if (!haves_bitmap)
 			BUG("failed to perform bitmap walk");
 	}
 
+	if (use_boundary_traversal) {
+		object_array_clear(&revs->pending);
+		reset_revision_walk();
+	}
+
 	wants_bitmap = find_objects(bitmap_git, revs, wants, haves_bitmap);
 
 	if (!wants_bitmap)
diff --git a/pack-bitmap.h b/pack-bitmap.h
index f0180b5276..daa72b6331 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -62,6 +62,10 @@ void traverse_bitmap_commit_list(struct bitmap_index *,
 void test_bitmap_walk(struct rev_info *revs);
 int test_bitmap_commits(struct repository *r);
 int test_bitmap_hashes(struct repository *r);
+
+#define GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL \
+	"GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL"
+
 struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 					 int filter_provided_objects);
 uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git);
diff --git a/repo-settings.c b/repo-settings.c
index d220c5dd9f..7b566d729d 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -41,8 +41,10 @@ void prepare_repo_settings(struct repository *r)
 	repo_cfg_bool(r, "feature.experimental", &experimental, 0);
 
 	/* Defaults modified by feature.* */
-	if (experimental)
+	if (experimental) {
 		r->settings.fetch_negotiation_algorithm = FETCH_NEGOTIATION_SKIPPING;
+		r->settings.pack_use_bitmap_boundary_traversal = 1;
+	}
 	if (manyfiles) {
 		r->settings.index_version = 4;
 		r->settings.index_skip_hash = 1;
@@ -62,6 +64,9 @@ void prepare_repo_settings(struct repository *r)
 	repo_cfg_bool(r, "index.sparse", &r->settings.sparse_index, 0);
 	repo_cfg_bool(r, "index.skiphash", &r->settings.index_skip_hash, r->settings.index_skip_hash);
 	repo_cfg_bool(r, "pack.readreverseindex", &r->settings.pack_read_reverse_index, 1);
+	repo_cfg_bool(r, "pack.usebitmapboundarytraversal",
+		      &r->settings.pack_use_bitmap_boundary_traversal,
+		      r->settings.pack_use_bitmap_boundary_traversal);
 
 	/*
 	 * The GIT_TEST_MULTI_PACK_INDEX variable is special in that
diff --git a/repository.h b/repository.h
index 1a13ff2867..c42f7ab6bd 100644
--- a/repository.h
+++ b/repository.h
@@ -37,6 +37,7 @@ struct repo_settings {
 	int command_requires_full_index;
 	int sparse_index;
 	int pack_read_reverse_index;
+	int pack_use_bitmap_boundary_traversal;
 
 	struct fsmonitor_settings *fsmonitor; /* lazily loaded */
 
diff --git a/t/README b/t/README
index bdfac4cceb..b71a065e4a 100644
--- a/t/README
+++ b/t/README
@@ -442,6 +442,10 @@ GIT_TEST_INDEX_VERSION=<n> exercises the index read/write code path
 for the index version specified.  Can be set to any valid version
 (currently 2, 3, or 4).
 
+GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL=<boolean> if enabled will
+use the boundary-based bitmap traversal algorithm. See the documentation
+of `pack.useBitmapBoundaryTraversal` for more details.
+
 GIT_TEST_PACK_SPARSE=<boolean> if disabled will default the pack-objects
 builtin to use the non-sparse object walk. This can still be overridden by
 the --sparse command-line argument.
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 526a5a506e..78c1c6c923 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -9,6 +9,10 @@ test_description='exercise basic bitmap functionality'
 # their place.
 GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0
 
+# Likewise, allow individual tests to control whether or not they use
+# the boundary-based traversal.
+sane_unset GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL
+
 objpath () {
 	echo ".git/objects/$(echo "$1" | sed -e 's|\(..\)|\1/|')"
 }
@@ -457,6 +461,13 @@ test_bitmap_cases () {
 
 test_bitmap_cases
 
+GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL=1
+export GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL
+
+test_bitmap_cases
+
+sane_unset GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL
+
 test_expect_success 'incremental repack fails when bitmaps are requested' '
 	test_commit more-1 &&
 	test_must_fail git repack -d 2>err &&
@@ -468,6 +479,33 @@ test_expect_success 'incremental repack can disable bitmaps' '
 	git repack -d --no-write-bitmap-index
 '
 
+test_expect_success 'boundary-based traversal is used when requested' '
+	git repack -a -d --write-bitmap-index &&
+
+	for argv in \
+		"git -c pack.useBitmapBoundaryTraversal=true" \
+		"git -c feature.experimental=true" \
+		"GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL=1 git"
+	do
+		eval "GIT_TRACE2_EVENT=1 $argv rev-list --objects \
+			--use-bitmap-index second..other 2>perf" &&
+		grep "\"region_enter\".*\"label\":\"haves/boundary\"" perf ||
+			return 1
+	done &&
+
+	for argv in \
+		"git -c pack.useBitmapBoundaryTraversal=false" \
+		"git -c feature.experimental=true -c pack.useBitmapBoundaryTraversal=false" \
+		"GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL=0 git -c pack.useBitmapBoundaryTraversal=true" \
+		"GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL=0 git -c feature.experimental=true"
+	do
+		eval "GIT_TRACE2_EVENT=1 $argv rev-list --objects \
+			--use-bitmap-index second..other 2>perf" &&
+		grep "\"region_enter\".*\"label\":\"haves/classic\"" perf ||
+			return 1
+	done
+'
+
 test_bitmap_cases "pack.writeBitmapLookupTable"
 
 test_expect_success 'verify writing bitmap lookup table when enabled' '
-- 
2.40.1.479.g8f2b93ade7

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 3/3] pack-bitmap.c: use commit boundary during bitmap traversal
  2023-05-08 17:38   ` [PATCH v3 3/3] pack-bitmap.c: use commit boundary during bitmap traversal Taylor Blau
@ 2023-05-08 20:53     ` Derrick Stolee
  2023-05-08 22:12       ` Taylor Blau
  0 siblings, 1 reply; 51+ messages in thread
From: Derrick Stolee @ 2023-05-08 20:53 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: Jeff King, Junio C Hamano

On 5/8/2023 1:38 PM, Taylor Blau wrote:

> -	/*
> -	 * if we have a HAVES list, but none of those haves is contained
> -	 * in the packfile that has a bitmap, we don't have anything to
> -	 * optimize here
> -	 */
> -	if (haves && !in_bitmapped_pack(bitmap_git, haves))
> -		goto cleanup;
> +	use_boundary_traversal = git_env_bool(GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL, -1);
> +	if (use_boundary_traversal < 0) {
> +		prepare_repo_settings(revs->repo);
> +		use_boundary_traversal = revs->repo->settings.pack_use_bitmap_boundary_traversal;
> +	}
> +
> +	if (!use_boundary_traversal) {
> +		/*
> +		 * if we have a HAVES list, but none of those haves is contained
> +		 * in the packfile that has a bitmap, we don't have anything to
> +		 * optimize here
> +		 */
> +		if (haves && !in_bitmapped_pack(bitmap_git, haves))
> +			goto cleanup;
> +	}

I was reading along, nodding my head, until I came across this comment.
I recognize that it's moved from an existing place, but this seems very
strange and incorrect.

Is this implying that if the 'haves' are not in the bitmapped pack, then
we _can't_ construct a bitmap representing the objects they can reach?

Do we not have the ability to extend the object order beyond the bitmapped
pack in-memory in a way that allows us to provide bit positions for the
objects not in the bitmapped pack?

I can understand saying "it might be more expensive to construct a bitmap
here, because we need to walk into the bitmapped pack before we have a
hope of hitting a bitmap." However, that's far from "we don't have anything
to optimize".

This comment is from fff42755efc (pack-bitmap: add support for bitmap
indexes, 2013-12-21), and perhaps at that time we didn't have the ability
to construct a reachability bitmap across the non-bitmapped pack.

Something to think about removing in the future, but not a blocker for
this series. Getting it out of the way for the boundary-based walk makes
even more sense because the commits to check are those in the boundary,
not the haves themselves.
 
> +test_expect_success 'boundary-based traversal is used when requested' '
> +	git repack -a -d --write-bitmap-index &&
> +
> +	for argv in \
> +		"git -c pack.useBitmapBoundaryTraversal=true" \
> +		"git -c feature.experimental=true" \
> +		"GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL=1 git"
> +	do
> +		eval "GIT_TRACE2_EVENT=1 $argv rev-list --objects \
> +			--use-bitmap-index second..other 2>perf" &&
> +		grep "\"region_enter\".*\"label\":\"haves/boundary\"" perf ||
> +			return 1
> +	done &&
> +
> +	for argv in \
> +		"git -c pack.useBitmapBoundaryTraversal=false" \
> +		"git -c feature.experimental=true -c pack.useBitmapBoundaryTraversal=false" \
> +		"GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL=0 git -c pack.useBitmapBoundaryTraversal=true" \
> +		"GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL=0 git -c feature.experimental=true"

This ordering (GIT_TEST_*=0 overrides config) seems backwards to me, but
it doesn't really matter since it's a GIT_TEST_* variable. Thanks for
including tests so the order is documented.

> +	do
> +		eval "GIT_TRACE2_EVENT=1 $argv rev-list --objects \
> +			--use-bitmap-index second..other 2>perf" &&
> +		grep "\"region_enter\".*\"label\":\"haves/classic\"" perf ||
> +			return 1

nit: you should be able to use 'test_region' here. Probably not worth
a re-roll, as everything else looks good to me.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 3/3] pack-bitmap.c: use commit boundary during bitmap traversal
  2023-05-08 20:53     ` Derrick Stolee
@ 2023-05-08 22:12       ` Taylor Blau
  0 siblings, 0 replies; 51+ messages in thread
From: Taylor Blau @ 2023-05-08 22:12 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Jeff King, Junio C Hamano

On Mon, May 08, 2023 at 04:53:35PM -0400, Derrick Stolee wrote:
> On 5/8/2023 1:38 PM, Taylor Blau wrote:
>
> > -	/*
> > -	 * if we have a HAVES list, but none of those haves is contained
> > -	 * in the packfile that has a bitmap, we don't have anything to
> > -	 * optimize here
> > -	 */
> > -	if (haves && !in_bitmapped_pack(bitmap_git, haves))
> > -		goto cleanup;
> > +	use_boundary_traversal = git_env_bool(GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL, -1);
> > +	if (use_boundary_traversal < 0) {
> > +		prepare_repo_settings(revs->repo);
> > +		use_boundary_traversal = revs->repo->settings.pack_use_bitmap_boundary_traversal;
> > +	}
> > +
> > +	if (!use_boundary_traversal) {
> > +		/*
> > +		 * if we have a HAVES list, but none of those haves is contained
> > +		 * in the packfile that has a bitmap, we don't have anything to
> > +		 * optimize here
> > +		 */
> > +		if (haves && !in_bitmapped_pack(bitmap_git, haves))
> > +			goto cleanup;
> > +	}
>
> I was reading along, nodding my head, until I came across this comment.
> I recognize that it's moved from an existing place, but this seems very
> strange and incorrect.
>
> Is this implying that if the 'haves' are not in the bitmapped pack, then
> we _can't_ construct a bitmap representing the objects they can reach?
>
> Do we not have the ability to extend the object order beyond the bitmapped
> pack in-memory in a way that allows us to provide bit positions for the
> objects not in the bitmapped pack?

It is a strange heuristic indeed, and I had the same thought as you the
first time I remember encountering this part of the code.

in_bitmapped_pack() asks whether there is at least one of the given set
of objects which appears in the pack/MIDX corresponding with our bitmap.
In other words, this heuristic causes us to bail when none of the
objects listed as haves appear in the corresponding pack/MIDX.

Strictly speaking, there isn't a fundamental limitation preventing us
from using the bitmap traversal in cases where none of the tips exist in
the bitmapped pack/MIDX. That's what the extended index is for (see
`bitmap_position_extended()` and `ext_index_add_object()` for more
details).

AFAICT, the heuristic here is another way of saying, "if none of these
haves appear in the corresponding pack or MIDX, my bitmap coverage is
likely bad enough that it's not worth trying to construct a full bitmap
on the haves side (and we should fall back to the ordinary object
traversal)".

> I can understand saying "it might be more expensive to construct a bitmap
> here, because we need to walk into the bitmapped pack before we have a
> hope of hitting a bitmap." However, that's far from "we don't have anything
> to optimize".

Agreed.

> Something to think about removing in the future, but not a blocker for
> this series. Getting it out of the way for the boundary-based walk makes
> even more sense because the commits to check are those in the boundary,
> not the haves themselves.

Yeah. I think that what we do here will end up depending on the
performance of the boundary based approach. If it's a clear enough win
in a majority of cases, I'd just as soon drop the classic traversal (and
with it, this heuristic).

But if we do end up keeping the classic traversal around for a while, I
absolutely agree that we should investigate and consider dropping this
optimization, which I have long-suspected as not being very useful.

> > +test_expect_success 'boundary-based traversal is used when requested' '
> > +	git repack -a -d --write-bitmap-index &&
> > +
> > +	for argv in \
> > +		"git -c pack.useBitmapBoundaryTraversal=true" \
> > +		"git -c feature.experimental=true" \
> > +		"GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL=1 git"
> > +	do
> > +		eval "GIT_TRACE2_EVENT=1 $argv rev-list --objects \
> > +			--use-bitmap-index second..other 2>perf" &&
> > +		grep "\"region_enter\".*\"label\":\"haves/boundary\"" perf ||
> > +			return 1
> > +	done &&
> > +
> > +	for argv in \
> > +		"git -c pack.useBitmapBoundaryTraversal=false" \
> > +		"git -c feature.experimental=true -c pack.useBitmapBoundaryTraversal=false" \
> > +		"GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL=0 git -c pack.useBitmapBoundaryTraversal=true" \
> > +		"GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL=0 git -c feature.experimental=true"
>
> This ordering (GIT_TEST_*=0 overrides config) seems backwards to me, but
> it doesn't really matter since it's a GIT_TEST_* variable. Thanks for
> including tests so the order is documented.

Now that you mention it, I think it's backwards too ;-). But I agree
that documenting it here is sufficient (and ideally, we don't have to
live with this GIT_TEST variable for more than a release or two,
anyway).

> > +	do
> > +		eval "GIT_TRACE2_EVENT=1 $argv rev-list --objects \
> > +			--use-bitmap-index second..other 2>perf" &&
> > +		grep "\"region_enter\".*\"label\":\"haves/classic\"" perf ||
> > +			return 1
>
> nit: you should be able to use 'test_region' here. Probably not worth
> a re-roll, as everything else looks good to me.

Ah, I didn't know of `test_region` before. I'll keep it in mind, thanks!

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 0/3] pack-bitmap: boundary-based bitmap traversal
  2023-05-08 17:38 ` [PATCH v3 0/3] pack-bitmap: boundary-based bitmap traversal Taylor Blau
                     ` (2 preceding siblings ...)
  2023-05-08 17:38   ` [PATCH v3 3/3] pack-bitmap.c: use commit boundary during bitmap traversal Taylor Blau
@ 2023-05-10 22:55   ` Junio C Hamano
  2023-05-10 23:10     ` Taylor Blau
  3 siblings, 1 reply; 51+ messages in thread
From: Junio C Hamano @ 2023-05-10 22:55 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Derrick Stolee

Taylor Blau <me@ttaylorr.com> writes:

> Here is another small reroll of my series to implement a boundary-based
> bitmap traversal algorithm that I worked on towards the end of 2021 with
> Peff.
>
> The algorithm is the same as in the last round, though I cleaned up a
> few things throughout thanks to another very helpful review from Stolee.
>
> One significant change from last time is that the new algorithm is
> opt-in behind feature.experimental, giving users a way to try out the
> new implementation while still being able to switch back to the
> known-good traversal in case anything goes wrong.

It seems that the comments the topic received on the previous round
have all been addressed adequately?  Does everybody feel comfortable
merging this topic to 'next'?

Thanks.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 0/3] pack-bitmap: boundary-based bitmap traversal
  2023-05-10 22:55   ` [PATCH v3 0/3] pack-bitmap: boundary-based " Junio C Hamano
@ 2023-05-10 23:10     ` Taylor Blau
  2023-05-11 15:23       ` Derrick Stolee
  0 siblings, 1 reply; 51+ messages in thread
From: Taylor Blau @ 2023-05-10 23:10 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King, Derrick Stolee

On Wed, May 10, 2023 at 03:55:54PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > Here is another small reroll of my series to implement a boundary-based
> > bitmap traversal algorithm that I worked on towards the end of 2021 with
> > Peff.
> >
> > The algorithm is the same as in the last round, though I cleaned up a
> > few things throughout thanks to another very helpful review from Stolee.
> >
> > One significant change from last time is that the new algorithm is
> > opt-in behind feature.experimental, giving users a way to try out the
> > new implementation while still being able to switch back to the
> > known-good traversal in case anything goes wrong.
>
> It seems that the comments the topic received on the previous round
> have all been addressed adequately?  Does everybody feel comfortable
> merging this topic to 'next'?

Thanks. I agree and think that this is ready to merge down. Even though
it is changing the traversal algorithm, it should be relatively safe to
merge since the new behavior is hidden behind an opt-in configuration
option.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 0/3] pack-bitmap: boundary-based bitmap traversal
  2023-05-10 23:10     ` Taylor Blau
@ 2023-05-11 15:23       ` Derrick Stolee
  0 siblings, 0 replies; 51+ messages in thread
From: Derrick Stolee @ 2023-05-11 15:23 UTC (permalink / raw)
  To: Taylor Blau, Junio C Hamano; +Cc: git, Jeff King

On 5/10/2023 7:10 PM, Taylor Blau wrote:
> On Wed, May 10, 2023 at 03:55:54PM -0700, Junio C Hamano wrote:
>> Taylor Blau <me@ttaylorr.com> writes:
>>
>>> Here is another small reroll of my series to implement a boundary-based
>>> bitmap traversal algorithm that I worked on towards the end of 2021 with
>>> Peff.
>>>
>>> The algorithm is the same as in the last round, though I cleaned up a
>>> few things throughout thanks to another very helpful review from Stolee.
>>>
>>> One significant change from last time is that the new algorithm is
>>> opt-in behind feature.experimental, giving users a way to try out the
>>> new implementation while still being able to switch back to the
>>> known-good traversal in case anything goes wrong.
>>
>> It seems that the comments the topic received on the previous round
>> have all been addressed adequately?  Does everybody feel comfortable
>> merging this topic to 'next'?
> 
> Thanks. I agree and think that this is ready to merge down. Even though
> it is changing the traversal algorithm, it should be relatively safe to
> merge since the new behavior is hidden behind an opt-in configuration
> option.

Yes, I'm happy with this version.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v4 0/3] pack-bitmap: boundary-based bitmap traversal
  2023-04-25  0:00 [PATCH 0/3] pack-bitmap: boundary-based bitmap traversal Taylor Blau
                   ` (7 preceding siblings ...)
  2023-05-08 17:38 ` [PATCH v3 0/3] pack-bitmap: boundary-based bitmap traversal Taylor Blau
@ 2023-06-08 16:25 ` Taylor Blau
  2023-06-08 16:25   ` [PATCH v4 1/3] object: add object_array initializer helper function Taylor Blau
                     ` (2 more replies)
  8 siblings, 3 replies; 51+ messages in thread
From: Taylor Blau @ 2023-06-08 16:25 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jeff King, Junio C Hamano

Here is another copy of the reroll I sent a few weeks ago[1] of the
boundary-based bitmap traversal topic.

It is identical to the previous version, except rebased onto the current
tip of 'master', which thankfully produced no conflicts.

My plan remains to drop the conditional after a version or two, and
hopefully having this be in 'master'/'next' sooner rather than later
will help us build additional confidence with folks who use those
branches as the base for their daily driver.

Thanks in advance for your review.

[1]: https://lore.kernel.org/git/cover.1683567065.git.me@ttaylorr.com/

Taylor Blau (3):
  object: add object_array initializer helper function
  pack-bitmap.c: extract `fill_in_bitmap()`
  pack-bitmap.c: use commit boundary during bitmap traversal

 Documentation/config/feature.txt |   3 +
 Documentation/config/pack.txt    |  17 +++
 ci/run-build-and-tests.sh        |   1 +
 object.c                         |   6 +
 object.h                         |   2 +
 pack-bitmap.c                    | 241 ++++++++++++++++++++++++++-----
 pack-bitmap.h                    |   4 +
 repo-settings.c                  |   7 +-
 repository.h                     |   1 +
 t/README                         |   4 +
 t/t5310-pack-bitmaps.sh          |  38 +++++
 11 files changed, 284 insertions(+), 40 deletions(-)

Range-diff against v3:
1:  c31508ac4a = 1:  1fff820874 object: add object_array initializer helper function
2:  e7b30490da = 2:  fa67382e21 pack-bitmap.c: extract `fill_in_bitmap()`
3:  8a8f41e0c4 = 3:  b80d17c938 pack-bitmap.c: use commit boundary during bitmap traversal
-- 
2.41.0.3.gb80d17c938

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v4 1/3] object: add object_array initializer helper function
  2023-06-08 16:25 ` [PATCH v4 " Taylor Blau
@ 2023-06-08 16:25   ` Taylor Blau
  2023-06-08 16:25   ` [PATCH v4 2/3] pack-bitmap.c: extract `fill_in_bitmap()` Taylor Blau
  2023-06-08 16:25   ` [PATCH v4 3/3] pack-bitmap.c: use commit boundary during bitmap traversal Taylor Blau
  2 siblings, 0 replies; 51+ messages in thread
From: Taylor Blau @ 2023-06-08 16:25 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jeff King, Junio C Hamano

The object_array API has an OBJECT_ARRAY_INIT macro, but lacks a
function to initialize an object_array at a given location in memory.

Introduce `object_array_init()` to implement such a function.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 object.c | 6 ++++++
 object.h | 2 ++
 2 files changed, 8 insertions(+)

diff --git a/object.c b/object.c
index 6d4ef1524d..f1adb458b6 100644
--- a/object.c
+++ b/object.c
@@ -356,6 +356,12 @@ void object_list_free(struct object_list **list)
  */
 static char object_array_slopbuf[1];
 
+void object_array_init(struct object_array *array)
+{
+	struct object_array blank = OBJECT_ARRAY_INIT;
+	memcpy(array, &blank, sizeof(*array));
+}
+
 void add_object_array_with_path(struct object *obj, const char *name,
 				struct object_array *array,
 				unsigned mode, const char *path)
diff --git a/object.h b/object.h
index 5871615fee..114d45954d 100644
--- a/object.h
+++ b/object.h
@@ -58,6 +58,8 @@ struct object_array {
 
 #define OBJECT_ARRAY_INIT { 0 }
 
+void object_array_init(struct object_array *array);
+
 /*
  * object flag allocation:
  * revision.h:               0---------10         15             23------27
-- 
2.41.0.3.gb80d17c938


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v4 2/3] pack-bitmap.c: extract `fill_in_bitmap()`
  2023-06-08 16:25 ` [PATCH v4 " Taylor Blau
  2023-06-08 16:25   ` [PATCH v4 1/3] object: add object_array initializer helper function Taylor Blau
@ 2023-06-08 16:25   ` Taylor Blau
  2023-06-08 16:25   ` [PATCH v4 3/3] pack-bitmap.c: use commit boundary during bitmap traversal Taylor Blau
  2 siblings, 0 replies; 51+ messages in thread
From: Taylor Blau @ 2023-06-08 16:25 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jeff King, Junio C Hamano

To prepare for the boundary-based bitmap walk to perform a fill-in
traversal using the boundary of either side as the tips, extract routine
used to perform fill-in traversal by `find_objects()` so that it can be
used in both places.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 65 ++++++++++++++++++++++++++++-----------------------
 1 file changed, 36 insertions(+), 29 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 999f962602..c1a9c70061 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1043,6 +1043,40 @@ static int add_commit_to_bitmap(struct bitmap_index *bitmap_git,
 	return 1;
 }
 
+static struct bitmap *fill_in_bitmap(struct bitmap_index *bitmap_git,
+				     struct rev_info *revs,
+				     struct bitmap *base,
+				     struct bitmap *seen)
+{
+	struct include_data incdata;
+	struct bitmap_show_data show_data;
+
+	if (!base)
+		base = bitmap_new();
+
+	incdata.bitmap_git = bitmap_git;
+	incdata.base = base;
+	incdata.seen = seen;
+
+	revs->include_check = should_include;
+	revs->include_check_obj = should_include_obj;
+	revs->include_check_data = &incdata;
+
+	if (prepare_revision_walk(revs))
+		die(_("revision walk setup failed"));
+
+	show_data.bitmap_git = bitmap_git;
+	show_data.base = base;
+
+	traverse_commit_list(revs, show_commit, show_object, &show_data);
+
+	revs->include_check = NULL;
+	revs->include_check_obj = NULL;
+	revs->include_check_data = NULL;
+
+	return base;
+}
+
 static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
 				   struct rev_info *revs,
 				   struct object_list *roots,
@@ -1108,35 +1142,8 @@ static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
 		}
 	}
 
-	if (needs_walk) {
-		struct include_data incdata;
-		struct bitmap_show_data show_data;
-
-		if (!base)
-			base = bitmap_new();
-
-		incdata.bitmap_git = bitmap_git;
-		incdata.base = base;
-		incdata.seen = seen;
-
-		revs->include_check = should_include;
-		revs->include_check_obj = should_include_obj;
-		revs->include_check_data = &incdata;
-
-		if (prepare_revision_walk(revs))
-			die(_("revision walk setup failed"));
-
-		show_data.bitmap_git = bitmap_git;
-		show_data.base = base;
-
-		traverse_commit_list(revs,
-				     show_commit, show_object,
-				     &show_data);
-
-		revs->include_check = NULL;
-		revs->include_check_obj = NULL;
-		revs->include_check_data = NULL;
-	}
+	if (needs_walk)
+		base = fill_in_bitmap(bitmap_git, revs, base, seen);
 
 	return base;
 }
-- 
2.41.0.3.gb80d17c938


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v4 3/3] pack-bitmap.c: use commit boundary during bitmap traversal
  2023-06-08 16:25 ` [PATCH v4 " Taylor Blau
  2023-06-08 16:25   ` [PATCH v4 1/3] object: add object_array initializer helper function Taylor Blau
  2023-06-08 16:25   ` [PATCH v4 2/3] pack-bitmap.c: extract `fill_in_bitmap()` Taylor Blau
@ 2023-06-08 16:25   ` Taylor Blau
  2 siblings, 0 replies; 51+ messages in thread
From: Taylor Blau @ 2023-06-08 16:25 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jeff King, Junio C Hamano

When reachability bitmap coverage exists in a repository, Git will use a
different (and hopefully faster) traversal to compute revision walks.

Consider a set of positive and negative tips (which we'll refer to with
their standard bitmap parlance by "wants", and "haves"). In order to
figure out what objects exist between the tips, the existing traversal
in `prepare_bitmap_walk()` does something like:

  1. Consider if we can even compute the set of objects with bitmaps,
     and fall back to the usual traversal if we cannot. For example,
     pathspec limiting traversals can't be computed using bitmaps (since
     they don't know which objects are at which paths). The same is true
     of certain kinds of non-trivial object filters.

  2. If we can compute the traversal with bitmaps, partition the
     (dereferenced) tips into two object lists, "haves", and "wants",
     based on whether or not the objects have the UNINTERESTING flag,
     respectively.

  3. Fall back to the ordinary object traversal if either (a) there are
     more than zero haves, none of which are in the bitmapped pack or
     MIDX, or (b) there are no wants.

  4. Construct a reachability bitmap for the "haves" side by walking
     from the revision tips down to any existing bitmaps, OR-ing in any
     bitmaps as they are found.

  5. Then do the same for the "wants" side, stopping at any objects that
     appear in the "haves" bitmap.

  6. Filter the results if any object filter (that can be easily
     computed with bitmaps alone) was given, and then return back to the
     caller.

When there is good bitmap coverage relative to the traversal tips, this
walk is often significantly faster than an ordinary object traversal
because it can visit far fewer objects.

But in certain cases, it can be significantly *slower* than the usual
object traversal. Why? Because we need to compute complete bitmaps on
either side of the walk. If either one (or both) of the sides require
walking many (or all!) objects before they get to an existing bitmap,
the extra bitmap machinery is mostly or all overhead.

One of the benefits, however, is that even if the walk is slower, bitmap
traversals are guaranteed to provide an *exact* answer. Unlike the
traditional object traversal algorithm, which can over-count the results
by not opening trees for older commits, the bitmap walk builds an exact
reachability bitmap for either side, meaning the results are never
over-counted.

But producing non-exact results is OK for our traversal here (both in
the bitmap case and not), as long as the results are over-counted, not
under.

Relaxing the bitmap traversal to allow it to produce over-counted
results gives us the opportunity to make some significant improvements.
Instead of the above, the new algorithm only has to walk from the
*boundary* down to the nearest bitmap, instead of from each of the
UNINTERESTING tips.

The boundary-based approach still has degenerate cases, but we'll show
in a moment that it is often a significant improvement.

The new algorithm works as follows:

  1. Build a (partial) bitmap of the haves side by first OR-ing any
     bitmap(s) that already exist for UNINTERESTING commits between the
     haves and the boundary.

  2. For each commit along the boundary, add it as a fill-in traversal
     tip (where the traversal terminates once an existing bitmap is
     found), and perform fill-in traversal.

  3. Build up a complete bitmap of the wants side as usual, stopping any
     time we intersect the (partial) haves side.

  4. Return the results.

And is more-or-less equivalent to using the *old* algorithm with this
invocation:

    $ git rev-list --objects --use-bitmap-index $WANTS --not \
        $(git rev-list --objects --boundary $WANTS --not $HAVES |
          perl -lne 'print $1 if /^-(.*)/')

The new result performs significantly better in many cases, particularly
when the distance from the boundary commit(s) to an existing bitmap is
shorter than the distance from (all of) the have tips to the nearest
bitmapped commit.

Note that when using the old bitmap traversal algorithm, the results can
be *slower* than without bitmaps! Under the new algorithm, the result is
computed faster with bitmaps than without (at the cost of over-counting
the true number of objects in a similar fashion as the non-bitmap
traversal):

    # (Computing the number of tagged objects not on any branches
    # without bitmaps).
    $ time git rev-list --count --objects --tags --not --branches
    20

    real	0m1.388s
    user	0m1.092s
    sys	0m0.296s

    # (Computing the same query using the old bitmap traversal).
    $ time git rev-list --count --objects --tags --not --branches --use-bitmap-index
    19

    real	0m22.709s
    user	0m21.628s
    sys	0m1.076s

    # (this commit)
    $ time git.compile rev-list --count --objects --tags --not --branches --use-bitmap-index
    19

    real	0m1.518s
    user	0m1.234s
    sys	0m0.284s

The new algorithm is still slower than not using bitmaps at all, but it
is nearly a 15-fold improvement over the existing traversal.

In a more realistic setting (using my local copy of git.git), I can
observe a similar (if more modest) speed-up:

    $ argv="--count --objects --branches --not --tags"
    hyperfine \
      -n 'no bitmaps' "git.compile rev-list $argv" \
      -n 'existing traversal' "git.compile rev-list --use-bitmap-index $argv" \
      -n 'boundary traversal' "git.compile -c pack.useBitmapBoundaryTraversal=true rev-list --use-bitmap-index $argv"
    Benchmark 1: no bitmaps
      Time (mean ± σ):     124.6 ms ±   2.1 ms    [User: 103.7 ms, System: 20.8 ms]
      Range (min … max):   122.6 ms … 133.1 ms    22 runs

    Benchmark 2: existing traversal
      Time (mean ± σ):     368.6 ms ±   3.0 ms    [User: 325.3 ms, System: 43.1 ms]
      Range (min … max):   365.1 ms … 374.8 ms    10 runs

    Benchmark 3: boundary traversal
      Time (mean ± σ):     167.6 ms ±   0.9 ms    [User: 139.5 ms, System: 27.9 ms]
      Range (min … max):   166.1 ms … 169.2 ms    17 runs

    Summary
      'no bitmaps' ran
        1.34 ± 0.02 times faster than 'boundary traversal'
        2.96 ± 0.05 times faster than 'existing traversal'

Here, the new algorithm is also still slower than not using bitmaps, but
represents a more than 2-fold improvement over the existing traversal in
a more modest example.

Since this algorithm was originally written (nearly a year and a half
ago, at the time of writing), the bitmap lookup table shipped, making
the new algorithm's result more competitive. A few other future
directions for improving bitmap traversal times beyond not using bitmaps
at all:

  - Decrease the cost to decompress and OR together many bitmaps
    together (particularly when enumerating the uninteresting side of
    the walk). Here we could explore more efficient bitmap storage
    techniques, like Roaring+Run and/or use SIMD instructions to speed
    up ORing them together.

  - Store pseudo-merge bitmaps, which could allow us to OR together
    fewer "summary" bitmaps (which would also help with the above).

Helped-by: Jeff King <peff@peff.net>
Helped-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/feature.txt |   3 +
 Documentation/config/pack.txt    |  17 +++
 ci/run-build-and-tests.sh        |   1 +
 pack-bitmap.c                    | 182 ++++++++++++++++++++++++++++---
 pack-bitmap.h                    |   4 +
 repo-settings.c                  |   7 +-
 repository.h                     |   1 +
 t/README                         |   4 +
 t/t5310-pack-bitmaps.sh          |  38 +++++++
 9 files changed, 243 insertions(+), 14 deletions(-)

diff --git a/Documentation/config/feature.txt b/Documentation/config/feature.txt
index 17b4d39f89..bf9546fca4 100644
--- a/Documentation/config/feature.txt
+++ b/Documentation/config/feature.txt
@@ -14,6 +14,9 @@ feature.experimental::
 +
 * `fetch.negotiationAlgorithm=skipping` may improve fetch negotiation times by
 skipping more commits at a time, reducing the number of round trips.
++
+* `pack.useBitmapBoundaryTraversal=true` may improve bitmap traversal times by
+walking fewer objects.
 
 feature.manyFiles::
 	Enable config options that optimize for repos with many files in the
diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index d4c7c9d4e4..3748136d14 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -123,6 +123,23 @@ pack.useBitmaps::
 	true. You should not generally need to turn this off unless
 	you are debugging pack bitmaps.
 
+pack.useBitmapBoundaryTraversal::
+	When true, Git will use an experimental algorithm for computing
+	reachability queries with bitmaps. Instead of building up
+	complete bitmaps for all of the negated tips and then OR-ing
+	them together, consider negated tips with existing bitmaps as
+	additive (i.e. OR-ing them into the result if they exist,
+	ignoring them otherwise), and build up a bitmap at the boundary
+	instead.
++
+When using this algorithm, Git may include too many objects as a result
+of not opening up trees belonging to certain UNINTERESTING commits. This
+inexactness matches the non-bitmap traversal algorithm.
++
+In many cases, this can provide a speed-up over the exact algorithm,
+particularly when there is poor bitmap coverage of the negated side of
+the query.
+
 pack.useSparse::
 	When true, git will default to using the '--sparse' option in
 	'git pack-objects' when the '--revs' option is present. This
diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
index a18b13a41d..2528f25e31 100755
--- a/ci/run-build-and-tests.sh
+++ b/ci/run-build-and-tests.sh
@@ -29,6 +29,7 @@ linux-TEST-vars)
 	export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=master
 	export GIT_TEST_NO_WRITE_REV_INDEX=1
 	export GIT_TEST_CHECKOUT_WORKERS=2
+	export GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL=1
 	;;
 linux-clang)
 	export GIT_TEST_DEFAULT_HASH=sha1
diff --git a/pack-bitmap.c b/pack-bitmap.c
index c1a9c70061..894bff02c5 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1077,6 +1077,126 @@ static struct bitmap *fill_in_bitmap(struct bitmap_index *bitmap_git,
 	return base;
 }
 
+struct bitmap_boundary_cb {
+	struct bitmap_index *bitmap_git;
+	struct bitmap *base;
+
+	struct object_array boundary;
+};
+
+static void show_boundary_commit(struct commit *commit, void *_data)
+{
+	struct bitmap_boundary_cb *data = _data;
+
+	if (commit->object.flags & BOUNDARY)
+		add_object_array(&commit->object, "", &data->boundary);
+
+	if (commit->object.flags & UNINTERESTING) {
+		if (bitmap_walk_contains(data->bitmap_git, data->base,
+					 &commit->object.oid))
+			return;
+
+		add_commit_to_bitmap(data->bitmap_git, &data->base, commit);
+	}
+}
+
+static void show_boundary_object(struct object *object,
+				 const char *name, void *data)
+{
+	BUG("should not be called");
+}
+
+static struct bitmap *find_boundary_objects(struct bitmap_index *bitmap_git,
+					    struct rev_info *revs,
+					    struct object_list *roots)
+{
+	struct bitmap_boundary_cb cb;
+	struct object_list *root;
+	unsigned int i;
+	unsigned int tmp_blobs, tmp_trees, tmp_tags;
+	int any_missing = 0;
+
+	cb.bitmap_git = bitmap_git;
+	cb.base = bitmap_new();
+	object_array_init(&cb.boundary);
+
+	revs->ignore_missing_links = 1;
+
+	/*
+	 * OR in any existing reachability bitmaps among `roots` into
+	 * `cb.base`.
+	 */
+	for (root = roots; root; root = root->next) {
+		struct object *object = root->item;
+		if (object->type != OBJ_COMMIT ||
+		    bitmap_walk_contains(bitmap_git, cb.base, &object->oid))
+			continue;
+
+		if (add_commit_to_bitmap(bitmap_git, &cb.base,
+					 (struct commit *)object))
+			continue;
+
+		any_missing = 1;
+	}
+
+	if (!any_missing)
+		goto cleanup;
+
+	tmp_blobs = revs->blob_objects;
+	tmp_trees = revs->tree_objects;
+	tmp_tags = revs->blob_objects;
+	revs->blob_objects = 0;
+	revs->tree_objects = 0;
+	revs->tag_objects = 0;
+
+	/*
+	 * We didn't have complete coverage of the roots. First setup a
+	 * revision walk to (a) OR in any bitmaps that are UNINTERESTING
+	 * between the tips and boundary, and (b) record the boundary.
+	 */
+	trace2_region_enter("pack-bitmap", "boundary-prepare", the_repository);
+	if (prepare_revision_walk(revs))
+		die("revision walk setup failed");
+	trace2_region_leave("pack-bitmap", "boundary-prepare", the_repository);
+
+	trace2_region_enter("pack-bitmap", "boundary-traverse", the_repository);
+	revs->boundary = 1;
+	traverse_commit_list_filtered(revs,
+				      show_boundary_commit,
+				      show_boundary_object,
+				      &cb, NULL);
+	revs->boundary = 0;
+	trace2_region_leave("pack-bitmap", "boundary-traverse", the_repository);
+
+	revs->blob_objects = tmp_blobs;
+	revs->tree_objects = tmp_trees;
+	revs->tag_objects = tmp_tags;
+
+	reset_revision_walk();
+	clear_object_flags(UNINTERESTING);
+
+	/*
+	 * Then add the boundary commit(s) as fill-in traversal tips.
+	 */
+	trace2_region_enter("pack-bitmap", "boundary-fill-in", the_repository);
+	for (i = 0; i < cb.boundary.nr; i++) {
+		struct object *obj = cb.boundary.objects[i].item;
+		if (bitmap_walk_contains(bitmap_git, cb.base, &obj->oid))
+			obj->flags |= SEEN;
+		else
+			add_pending_object(revs, obj, "");
+	}
+	if (revs->pending.nr)
+		cb.base = fill_in_bitmap(bitmap_git, revs, cb.base, NULL);
+	trace2_region_leave("pack-bitmap", "boundary-fill-in", the_repository);
+
+cleanup:
+	object_array_clear(&cb.boundary);
+	revs->ignore_missing_links = 0;
+
+	return cb.base;
+}
+
 static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
 				   struct rev_info *revs,
 				   struct object_list *roots,
@@ -1142,8 +1262,21 @@ static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
 		}
 	}
 
-	if (needs_walk)
+	if (needs_walk) {
+		/*
+		 * This fill-in traversal may walk over some objects
+		 * again, since we have already traversed in order to
+		 * find the boundary.
+		 *
+		 * But this extra walk should be extremely cheap, since
+		 * all commit objects are loaded into memory, and
+		 * because we skip walking to parents that are
+		 * UNINTERESTING, since it will be marked in the haves
+		 * bitmap already (or it has an on-disk bitmap, since
+		 * OR-ing it in covers all of its ancestors).
+		 */
 		base = fill_in_bitmap(bitmap_git, revs, base, seen);
+	}
 
 	return base;
 }
@@ -1535,6 +1668,7 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 					 int filter_provided_objects)
 {
 	unsigned int i;
+	int use_boundary_traversal;
 
 	struct object_list *wants = NULL;
 	struct object_list *haves = NULL;
@@ -1585,13 +1719,21 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 			object_list_insert(object, &wants);
 	}
 
-	/*
-	 * if we have a HAVES list, but none of those haves is contained
-	 * in the packfile that has a bitmap, we don't have anything to
-	 * optimize here
-	 */
-	if (haves && !in_bitmapped_pack(bitmap_git, haves))
-		goto cleanup;
+	use_boundary_traversal = git_env_bool(GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL, -1);
+	if (use_boundary_traversal < 0) {
+		prepare_repo_settings(revs->repo);
+		use_boundary_traversal = revs->repo->settings.pack_use_bitmap_boundary_traversal;
+	}
+
+	if (!use_boundary_traversal) {
+		/*
+		 * if we have a HAVES list, but none of those haves is contained
+		 * in the packfile that has a bitmap, we don't have anything to
+		 * optimize here
+		 */
+		if (haves && !in_bitmapped_pack(bitmap_git, haves))
+			goto cleanup;
+	}
 
 	/* if we don't want anything, we're done here */
 	if (!wants)
@@ -1605,18 +1747,32 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 	if (load_bitmap(revs->repo, bitmap_git) < 0)
 		goto cleanup;
 
-	object_array_clear(&revs->pending);
+	if (!use_boundary_traversal)
+		object_array_clear(&revs->pending);
 
 	if (haves) {
-		revs->ignore_missing_links = 1;
-		haves_bitmap = find_objects(bitmap_git, revs, haves, NULL);
-		reset_revision_walk();
-		revs->ignore_missing_links = 0;
+		if (use_boundary_traversal) {
+			trace2_region_enter("pack-bitmap", "haves/boundary", the_repository);
+			haves_bitmap = find_boundary_objects(bitmap_git, revs, haves);
+			trace2_region_leave("pack-bitmap", "haves/boundary", the_repository);
+		} else {
+			trace2_region_enter("pack-bitmap", "haves/classic", the_repository);
+			revs->ignore_missing_links = 1;
+			haves_bitmap = find_objects(bitmap_git, revs, haves, NULL);
+			reset_revision_walk();
+			revs->ignore_missing_links = 0;
+			trace2_region_leave("pack-bitmap", "haves/classic", the_repository);
+		}
 
 		if (!haves_bitmap)
 			BUG("failed to perform bitmap walk");
 	}
 
+	if (use_boundary_traversal) {
+		object_array_clear(&revs->pending);
+		reset_revision_walk();
+	}
+
 	wants_bitmap = find_objects(bitmap_git, revs, wants, haves_bitmap);
 
 	if (!wants_bitmap)
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 84591f041b..5273a6a019 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -62,6 +62,10 @@ void traverse_bitmap_commit_list(struct bitmap_index *,
 void test_bitmap_walk(struct rev_info *revs);
 int test_bitmap_commits(struct repository *r);
 int test_bitmap_hashes(struct repository *r);
+
+#define GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL \
+	"GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL"
+
 struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 					 int filter_provided_objects);
 uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git);
diff --git a/repo-settings.c b/repo-settings.c
index d220c5dd9f..7b566d729d 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -41,8 +41,10 @@ void prepare_repo_settings(struct repository *r)
 	repo_cfg_bool(r, "feature.experimental", &experimental, 0);
 
 	/* Defaults modified by feature.* */
-	if (experimental)
+	if (experimental) {
 		r->settings.fetch_negotiation_algorithm = FETCH_NEGOTIATION_SKIPPING;
+		r->settings.pack_use_bitmap_boundary_traversal = 1;
+	}
 	if (manyfiles) {
 		r->settings.index_version = 4;
 		r->settings.index_skip_hash = 1;
@@ -62,6 +64,9 @@ void prepare_repo_settings(struct repository *r)
 	repo_cfg_bool(r, "index.sparse", &r->settings.sparse_index, 0);
 	repo_cfg_bool(r, "index.skiphash", &r->settings.index_skip_hash, r->settings.index_skip_hash);
 	repo_cfg_bool(r, "pack.readreverseindex", &r->settings.pack_read_reverse_index, 1);
+	repo_cfg_bool(r, "pack.usebitmapboundarytraversal",
+		      &r->settings.pack_use_bitmap_boundary_traversal,
+		      r->settings.pack_use_bitmap_boundary_traversal);
 
 	/*
 	 * The GIT_TEST_MULTI_PACK_INDEX variable is special in that
diff --git a/repository.h b/repository.h
index 1a13ff2867..c42f7ab6bd 100644
--- a/repository.h
+++ b/repository.h
@@ -37,6 +37,7 @@ struct repo_settings {
 	int command_requires_full_index;
 	int sparse_index;
 	int pack_read_reverse_index;
+	int pack_use_bitmap_boundary_traversal;
 
 	struct fsmonitor_settings *fsmonitor; /* lazily loaded */
 
diff --git a/t/README b/t/README
index bdfac4cceb..b71a065e4a 100644
--- a/t/README
+++ b/t/README
@@ -442,6 +442,10 @@ GIT_TEST_INDEX_VERSION=<n> exercises the index read/write code path
 for the index version specified.  Can be set to any valid version
 (currently 2, 3, or 4).
 
+GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL=<boolean> if enabled will
+use the boundary-based bitmap traversal algorithm. See the documentation
+of `pack.useBitmapBoundaryTraversal` for more details.
+
 GIT_TEST_PACK_SPARSE=<boolean> if disabled will default the pack-objects
 builtin to use the non-sparse object walk. This can still be overridden by
 the --sparse command-line argument.
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 526a5a506e..78c1c6c923 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -9,6 +9,10 @@ test_description='exercise basic bitmap functionality'
 # their place.
 GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0
 
+# Likewise, allow individual tests to control whether or not they use
+# the boundary-based traversal.
+sane_unset GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL
+
 objpath () {
 	echo ".git/objects/$(echo "$1" | sed -e 's|\(..\)|\1/|')"
 }
@@ -457,6 +461,13 @@ test_bitmap_cases () {
 
 test_bitmap_cases
 
+GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL=1
+export GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL
+
+test_bitmap_cases
+
+sane_unset GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL
+
 test_expect_success 'incremental repack fails when bitmaps are requested' '
 	test_commit more-1 &&
 	test_must_fail git repack -d 2>err &&
@@ -468,6 +479,33 @@ test_expect_success 'incremental repack can disable bitmaps' '
 	git repack -d --no-write-bitmap-index
 '
 
+test_expect_success 'boundary-based traversal is used when requested' '
+	git repack -a -d --write-bitmap-index &&
+
+	for argv in \
+		"git -c pack.useBitmapBoundaryTraversal=true" \
+		"git -c feature.experimental=true" \
+		"GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL=1 git"
+	do
+		eval "GIT_TRACE2_EVENT=1 $argv rev-list --objects \
+			--use-bitmap-index second..other 2>perf" &&
+		grep "\"region_enter\".*\"label\":\"haves/boundary\"" perf ||
+			return 1
+	done &&
+
+	for argv in \
+		"git -c pack.useBitmapBoundaryTraversal=false" \
+		"git -c feature.experimental=true -c pack.useBitmapBoundaryTraversal=false" \
+		"GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL=0 git -c pack.useBitmapBoundaryTraversal=true" \
+		"GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL=0 git -c feature.experimental=true"
+	do
+		eval "GIT_TRACE2_EVENT=1 $argv rev-list --objects \
+			--use-bitmap-index second..other 2>perf" &&
+		grep "\"region_enter\".*\"label\":\"haves/classic\"" perf ||
+			return 1
+	done
+'
+
 test_bitmap_cases "pack.writeBitmapLookupTable"
 
 test_expect_success 'verify writing bitmap lookup table when enabled' '
-- 
2.41.0.3.gb80d17c938

^ permalink raw reply related	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2023-06-08 16:25 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-25  0:00 [PATCH 0/3] pack-bitmap: boundary-based bitmap traversal Taylor Blau
2023-04-25  0:00 ` [PATCH 1/3] revision: support tracking uninteresting commits Taylor Blau
2023-04-25 18:15   ` Derrick Stolee
2023-05-03 21:48     ` Taylor Blau
2023-05-04 13:46       ` Derrick Stolee
2023-05-03 22:08     ` Taylor Blau
2023-05-04 13:59       ` Derrick Stolee
2023-05-05 17:30         ` Taylor Blau
2023-04-25 18:22   ` Junio C Hamano
2023-04-25 18:48     ` Taylor Blau
2023-04-25  0:00 ` [PATCH 2/3] pack-bitmap.c: extract `fill_in_bitmap()` Taylor Blau
2023-04-25 18:32   ` Junio C Hamano
2023-04-25 18:51     ` [PATCH 2/3] pack-bitmap.c: extract `fill_in_bitmap()`t Taylor Blau
2023-04-25  0:00 ` [PATCH 3/3] pack-bitmap.c: use commit boundary during bitmap traversal Taylor Blau
2023-04-25 18:52   ` Junio C Hamano
2023-05-02 21:31     ` Felipe Contreras
2023-05-03 21:42     ` Taylor Blau
2023-05-03 21:58       ` Junio C Hamano
2023-04-25 18:53   ` Derrick Stolee
2023-04-25 18:02 ` [PATCH 0/3] pack-bitmap: boundary-based " Junio C Hamano
2023-04-25 18:30   ` Derrick Stolee
2023-04-25 18:57     ` Taylor Blau
2023-04-25 19:52       ` Derrick Stolee
2023-05-03 21:43         ` Taylor Blau
2023-04-25 18:06 ` Derrick Stolee
2023-04-25 19:01   ` Taylor Blau
2023-04-25 20:27     ` Derrick Stolee
2023-05-01 22:08 ` Junio C Hamano
2023-05-02 23:52   ` Taylor Blau
2023-05-05 17:27 ` [PATCH v2 0/2] " Taylor Blau
2023-05-05 17:27   ` [PATCH v2 1/2] pack-bitmap.c: extract `fill_in_bitmap()` Taylor Blau
2023-05-05 17:27   ` [PATCH v2 2/2] pack-bitmap.c: use commit boundary during bitmap traversal Taylor Blau
2023-05-05 18:13     ` Derrick Stolee
2023-05-05 18:43       ` Taylor Blau
2023-05-05 17:33   ` [PATCH v2 0/2] pack-bitmap: boundary-based " Junio C Hamano
2023-05-05 17:59   ` Derrick Stolee
2023-05-05 18:46     ` [PATCH v2 0/2] pack-bitmap: boundary-based bitmap traversalt Taylor Blau
2023-05-05 20:45       ` Derrick Stolee
2023-05-08 17:38 ` [PATCH v3 0/3] pack-bitmap: boundary-based bitmap traversal Taylor Blau
2023-05-08 17:38   ` [PATCH v3 1/3] object: add object_array initializer helper function Taylor Blau
2023-05-08 17:38   ` [PATCH v3 2/3] pack-bitmap.c: extract `fill_in_bitmap()` Taylor Blau
2023-05-08 17:38   ` [PATCH v3 3/3] pack-bitmap.c: use commit boundary during bitmap traversal Taylor Blau
2023-05-08 20:53     ` Derrick Stolee
2023-05-08 22:12       ` Taylor Blau
2023-05-10 22:55   ` [PATCH v3 0/3] pack-bitmap: boundary-based " Junio C Hamano
2023-05-10 23:10     ` Taylor Blau
2023-05-11 15:23       ` Derrick Stolee
2023-06-08 16:25 ` [PATCH v4 " Taylor Blau
2023-06-08 16:25   ` [PATCH v4 1/3] object: add object_array initializer helper function Taylor Blau
2023-06-08 16:25   ` [PATCH v4 2/3] pack-bitmap.c: extract `fill_in_bitmap()` Taylor Blau
2023-06-08 16:25   ` [PATCH v4 3/3] pack-bitmap.c: use commit boundary during bitmap traversal Taylor Blau

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).