git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Patrick Steinhardt <ps@pks.im>
To: git@vger.kernel.org
Cc: "Jeff King" <peff@peff.net>,
	"Felipe Contreras" <felipe.contreras@gmail.com>,
	"SZEDER Gábor" <szeder.dev@gmail.com>,
	"Chris Torek" <chris.torek@gmail.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Junio C Hamano" <gitster@pobox.com>
Subject: [PATCH v2 3/3] connected: implement connectivity check using bitmaps
Date: Mon, 28 Jun 2021 07:33:15 +0200	[thread overview]
Message-ID: <7687dedd4722c39b5ecef2c2165147c25d16b8d9.1624858240.git.ps@pks.im> (raw)
In-Reply-To: <cover.1624858240.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 10482 bytes --]

The connectivity checks are currently implemented via git-rev-list(1):
we simply ignore any objects which are reachable from preexisting refs
via `--not --all`, and pass all new refs which are to be checked via its
stdin. While this works well enough for repositories which do not have a
lot of references, it's clear that `--not --all` will linearly scale
with the number of refs which do exist: for each reference, we'll walk
through its commit as well as its five parent commits (defined via
`SLOP`). Given that many major hosting sites which use a pull/merge
request workflow keep refs to the request's HEAD, this effectively means
that `--not --all` will do a nearly complete graph walk.

This commit implements an alternate strategy if the target repository
has bitmaps. Objects referenced by a bitmap are by definition always
fully connected, so they form a natural boundary between old revisions
and new revisions. With this alternate strategy, we walk all given new
object IDs. Whenever we hit any object which is covered by the bitmap,
we stop the walk.

The logic only kicks in in case we have a bitmap in the repository. If
not, we wouldn't be able to efficiently abort the walk because we cannot
easily tell whether an object already exists in the target repository
and, if it does, whether it's fully connected. As a result, we'd have to
perform a always do graph walk in this case. Naturally, we could do the
same thing the previous git-rev-list(1) implementation did in that case
and just use preexisting branch tips as boundaries. But for now, we just
keep the old implementation for simplicity's sake given that performance
characteristics are likely not significantly different.

An easier solution may have been to simply add `--use-bitmap-index` to
the git-rev-list(1) invocation, but benchmarks have shown that this is
not effective: performance characteristics remain the same except for
some cases where the bitmap walks performs much worse compared to the
non-bitmap walk

The following benchmarks have been performed in linux.git:

Test                                                  origin/master           HEAD
---------------------------------------------------------------------------------------------------------
5400.4: empty receive-pack updated:new                176.02(387.28+175.12)   176.86(388.75+175.51) +0.5%
5400.7: clone receive-pack updated:new                0.10(0.09+0.01)         0.08(0.06+0.03) -20.0%
5400.9: clone receive-pack updated:main               0.09(0.08+0.01)         0.09(0.06+0.03) +0.0%
5400.11: clone receive-pack main~10:main              0.14(0.11+0.03)         0.13(0.11+0.02) -7.1%
5400.13: clone receive-pack :main                     0.01(0.01+0.00)         0.02(0.01+0.00) +100.0%
5400.16: clone_bitmap receive-pack updated:new        0.10(0.09+0.01)         0.28(0.13+0.15) +180.0%
5400.18: clone_bitmap receive-pack updated:main       0.10(0.08+0.02)         0.28(0.12+0.16) +180.0%
5400.20: clone_bitmap receive-pack main~10:main       0.13(0.11+0.02)         0.26(0.12+0.14) +100.0%
5400.22: clone_bitmap receive-pack :main              0.01(0.01+0.00)         0.01(0.01+0.00) +0.0%
5400.25: extrarefs receive-pack updated:new           32.14(20.76+11.59)      32.35(20.52+12.03) +0.7%
5400.27: extrarefs receive-pack updated:main          32.08(20.54+11.75)      32.10(20.78+11.53) +0.1%
5400.29: extrarefs receive-pack main~10:main          32.14(20.66+11.68)      32.27(20.65+11.83) +0.4%
5400.31: extrarefs receive-pack :main                 7.09(3.56+3.53)         7.10(3.70+3.40) +0.1%
5400.34: extrarefs_bitmap receive-pack updated:new    32.41(20.59+12.02)      7.36(3.76+3.60) -77.3%
5400.36: extrarefs_bitmap receive-pack updated:main   32.26(20.53+11.95)      7.34(3.56+3.78) -77.2%
5400.38: extrarefs_bitmap receive-pack main~10:main   32.44(20.77+11.90)      7.40(3.70+3.70) -77.2%
5400.40: extrarefs_bitmap receive-pack :main          7.09(3.62+3.46)         7.17(3.79+3.38) +1.1%

As expected, performance doesn't change in cases where we do not have a
bitmap available given that the old code path still kicks in. In case we
do have bitmaps, this is kind of a mixed bag: while git-receive-pack(1)
is slower in a "normal" clone of linux.git, it is significantly faster
for a clone with lots of references. The slowness can potentially be
explained by the overhead of loading the bitmap. On the other hand, the
new code is faster as expected in repos which have lots of references
given that we do not have to mark all negative references anymore.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 connected.c   | 136 ++++++++++++++++++++++++++++++++++++++++++++++++++
 pack-bitmap.c |   4 +-
 pack-bitmap.h |   2 +
 3 files changed, 140 insertions(+), 2 deletions(-)

diff --git a/connected.c b/connected.c
index b18299fdf0..474275a52f 100644
--- a/connected.c
+++ b/connected.c
@@ -6,6 +6,127 @@
 #include "transport.h"
 #include "packfile.h"
 #include "promisor-remote.h"
+#include "object.h"
+#include "tree-walk.h"
+#include "commit.h"
+#include "tag.h"
+#include "progress.h"
+#include "oidset.h"
+#include "pack-bitmap.h"
+
+#define QUEUED (1u<<0)
+
+static int queue_object(struct repository *repo,
+			struct bitmap_index *bitmap,
+			struct object_array *pending,
+			const struct object_id *oid)
+{
+	struct object *object;
+
+	/*
+	 * If the object ID is part of the bitmap, then we know that it must by
+	 * definition be reachable in the target repository and be fully
+	 * connected. We can thus skip checking the objects' referenced
+	 * objects.
+	 */
+	if (bitmap_position(bitmap, oid) >= 0)
+		return 0;
+
+	/* Otherwise the object is queued up for a connectivity check. */
+	object = parse_object(repo, oid);
+	if (!object) {
+		/* Promisor objects do not need to be traversed. */
+		if (is_promisor_object(oid))
+			return 0;
+		return -1;
+	}
+
+	/*
+	 * If the object has been queued before already, then we don't need to
+	 * do so again.
+	 */
+	if (object->flags & QUEUED)
+		return 0;
+	object->flags |= QUEUED;
+
+	/*
+	 * We do not need to queue up blobs given that they don't reference any
+	 * other objects anyway.
+	 */
+	if (object->type == OBJ_BLOB)
+		return 0;
+
+	add_object_array(object, NULL, pending);
+
+	return 0;
+}
+
+static int check_object(
+	struct repository *repo,
+	struct bitmap_index *bitmap,
+	struct object_array *pending,
+	const struct object *object)
+{
+	int ret = 0;
+
+	if (object->type == OBJ_TREE) {
+		struct tree *tree = (struct tree *)object;
+		struct tree_desc tree_desc;
+		struct name_entry entry;
+
+		if (init_tree_desc_gently(&tree_desc, tree->buffer, tree->size))
+			return error(_("cannot parse tree entry"));
+		while (tree_entry_gently(&tree_desc, &entry))
+			ret |= queue_object(repo, bitmap, pending, &entry.oid);
+
+		free_tree_buffer(tree);
+	} else if (object->type == OBJ_COMMIT) {
+		struct commit *commit = (struct commit *) object;
+		struct commit_list *parents;
+
+		for (parents = commit->parents; parents; parents = parents->next)
+			ret |= queue_object(repo, bitmap, pending, &parents->item->object.oid);
+
+		free_commit_buffer(repo->parsed_objects, commit);
+	} else if (object->type == OBJ_TAG) {
+		ret |= queue_object(repo, bitmap, pending, get_tagged_oid((struct tag *) object));
+	} else {
+		return error(_("unexpected object type"));
+	}
+
+	return ret;
+}
+
+/*
+ * If the target repository has a bitmap, then we treat all objects reachable
+ * via the bitmap as fully connected. Bitmapped objects thus serve as the
+ * boundary between old and new objects.
+ */
+static int check_connected_with_bitmap(struct repository *repo,
+				       struct bitmap_index *bitmap,
+				       oid_iterate_fn fn, void *cb_data,
+				       struct check_connected_options *opt)
+{
+	struct object_array pending = OBJECT_ARRAY_INIT;
+	struct progress *progress = NULL;
+	size_t checked_objects = 0;
+	struct object_id oid;
+	int ret = 0;
+
+	if (opt->progress)
+		progress = start_delayed_progress("Checking connectivity", 0);
+
+	while (!fn(cb_data, &oid))
+		ret |= queue_object(repo, bitmap, &pending, &oid);
+	while (pending.nr) {
+		ret |= check_object(repo, bitmap, &pending, object_array_pop(&pending));
+		display_progress(progress, ++checked_objects);
+	}
+
+	stop_progress(&progress);
+	object_array_clear(&pending);
+	return ret;
+}
 
 /*
  * If we feed all the commits we want to verify to this command
@@ -28,12 +149,27 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
 	int err = 0;
 	struct packed_git *new_pack = NULL;
 	struct transport *transport;
+	struct bitmap_index *bitmap;
 	size_t base_len;
 
 	if (!opt)
 		opt = &defaults;
 	transport = opt->transport;
 
+	bitmap = prepare_bitmap_git(the_repository);
+	if (bitmap) {
+		/*
+		 * If we have a bitmap, then we can reuse the bitmap as
+		 * boundary between old and new objects.
+		 */
+		err = check_connected_with_bitmap(the_repository, bitmap,
+						  fn, cb_data, opt);
+		if (opt->err_fd)
+			close(opt->err_fd);
+		free_bitmap_index(bitmap);
+		return err;
+	}
+
 	if (fn(cb_data, &oid)) {
 		if (opt->err_fd)
 			close(opt->err_fd);
diff --git a/pack-bitmap.c b/pack-bitmap.c
index d90e1d9d8c..d88a882ee1 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -418,8 +418,8 @@ static inline int bitmap_position_packfile(struct bitmap_index *bitmap_git,
 	return pos;
 }
 
-static int bitmap_position(struct bitmap_index *bitmap_git,
-			   const struct object_id *oid)
+int bitmap_position(struct bitmap_index *bitmap_git,
+		    const struct object_id *oid)
 {
 	int pos = bitmap_position_packfile(bitmap_git, oid);
 	return (pos >= 0) ? pos : bitmap_position_extended(bitmap_git, oid);
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 99d733eb26..7b4b386107 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -63,6 +63,8 @@ int rebuild_existing_bitmaps(struct bitmap_index *, struct packing_data *mapping
 void free_bitmap_index(struct bitmap_index *);
 int bitmap_walk_contains(struct bitmap_index *,
 			 struct bitmap *bitmap, const struct object_id *oid);
+int bitmap_position(struct bitmap_index *bitmap_git,
+		    const struct object_id *oid);
 
 /*
  * After a traversal has been performed by prepare_bitmap_walk(), this can be
-- 
2.32.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  parent reply	other threads:[~2021-06-28  5:33 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-28  5:33 [PATCH v2 0/3] Speed up connectivity checks via bitmaps Patrick Steinhardt
2021-06-28  5:33 ` [PATCH v2 1/3] p5400: add perf tests for git-receive-pack(1) Patrick Steinhardt
2021-06-28  7:49   ` Ævar Arnfjörð Bjarmason
2021-06-29  6:18     ` Patrick Steinhardt
2021-06-29 12:09       ` Ævar Arnfjörð Bjarmason
2021-06-28  5:33 ` [PATCH v2 2/3] receive-pack: skip connectivity checks on delete-only commands Patrick Steinhardt
2021-06-28  8:00   ` Ævar Arnfjörð Bjarmason
2021-06-28  8:06     ` Ævar Arnfjörð Bjarmason
2021-06-29  6:26     ` Patrick Steinhardt
2021-06-30  1:31   ` Jeff King
2021-06-30  1:35     ` Jeff King
2021-06-30 13:52     ` Patrick Steinhardt
2021-06-28  5:33 ` Patrick Steinhardt [this message]
2021-06-28 20:23   ` [PATCH v2 3/3] connected: implement connectivity check using bitmaps Taylor Blau
2021-06-29 22:44     ` Taylor Blau
2021-06-30  2:04       ` Jeff King
2021-06-30  3:07         ` Taylor Blau
2021-06-30  5:45           ` Jeff King
2021-07-02 17:44             ` Taylor Blau
2021-07-02 21:21               ` Jeff King
2021-06-30  1:51   ` Jeff King
2021-07-20 14:26     ` Patrick Steinhardt
2021-08-02  9:37 ` [PATCH v3 0/4] Speed up connectivity checks Patrick Steinhardt
2021-08-02  9:38   ` [PATCH v3 1/4] connected: do not sort input revisions Patrick Steinhardt
2021-08-02 12:49     ` Ævar Arnfjörð Bjarmason
2021-08-03  8:50       ` Patrick Steinhardt
2021-08-04 11:01         ` Ævar Arnfjörð Bjarmason
2021-08-02 19:00     ` Junio C Hamano
2021-08-03  8:55       ` Patrick Steinhardt
2021-08-03 21:47         ` Junio C Hamano
2021-08-02  9:38   ` [PATCH v3 2/4] revision: stop retrieving reference twice Patrick Steinhardt
2021-08-02 12:53     ` Ævar Arnfjörð Bjarmason
2021-08-02  9:38   ` [PATCH v3 3/4] revision: avoid loading object headers multiple times Patrick Steinhardt
2021-08-02 12:55     ` Ævar Arnfjörð Bjarmason
2021-08-05 10:12       ` Patrick Steinhardt
2021-08-02 19:40     ` Junio C Hamano
2021-08-03  9:07       ` Patrick Steinhardt
2021-08-06 14:17         ` Patrick Steinhardt
2021-08-02  9:38   ` [PATCH v3 4/4] revision: avoid hitting packfiles when commits are in commit-graph Patrick Steinhardt
2021-08-02 20:01     ` Junio C Hamano
2021-08-03  9:16       ` Patrick Steinhardt
2021-08-03 21:56         ` Junio C Hamano
2021-08-05 11:01           ` Patrick Steinhardt
2021-08-05 16:16             ` Junio C Hamano
2021-08-04 10:51         ` Ævar Arnfjörð Bjarmason
2021-08-05 11:25   ` [PATCH v4 0/6] Speed up connectivity checks Patrick Steinhardt
2021-08-05 11:25     ` [PATCH v4 1/6] revision: separate walk and unsorted flags Patrick Steinhardt
2021-08-05 18:47       ` Junio C Hamano
2021-08-05 11:25     ` [PATCH v4 2/6] connected: do not sort input revisions Patrick Steinhardt
2021-08-05 18:44       ` Junio C Hamano
2021-08-06  6:00         ` Patrick Steinhardt
2021-08-06 16:50           ` Junio C Hamano
2021-08-05 11:25     ` [PATCH v4 3/6] revision: stop retrieving reference twice Patrick Steinhardt
2021-08-05 11:25     ` [PATCH v4 4/6] revision: avoid loading object headers multiple times Patrick Steinhardt
2021-08-05 11:25     ` [PATCH v4 5/6] commit-graph: split out function to search commit position Patrick Steinhardt
2021-08-05 11:25     ` [PATCH v4 6/6] revision: avoid hitting packfiles when commits are in commit-graph Patrick Steinhardt
2021-08-09  8:00 ` [PATCH v5 0/5] Speed up connectivity checks Patrick Steinhardt
2021-08-09  8:02   ` Patrick Steinhardt
2021-08-09  8:11 ` Patrick Steinhardt
2021-08-09  8:11   ` [PATCH v5 1/5] revision: separate walk and unsorted flags Patrick Steinhardt
2021-08-09  8:11   ` [PATCH v5 2/5] connected: do not sort input revisions Patrick Steinhardt
2021-08-09  8:11   ` [PATCH v5 3/5] revision: stop retrieving reference twice Patrick Steinhardt
2021-08-09  8:11   ` [PATCH v5 4/5] commit-graph: split out function to search commit position Patrick Steinhardt
2021-08-09  8:12   ` [PATCH v5 5/5] revision: avoid hitting packfiles when commits are in commit-graph Patrick Steinhardt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7687dedd4722c39b5ecef2c2165147c25d16b8d9.1624858240.git.ps@pks.im \
    --to=ps@pks.im \
    --cc=avarab@gmail.com \
    --cc=chris.torek@gmail.com \
    --cc=felipe.contreras@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).