git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Patrick Steinhardt <ps@pks.im>
To: git@vger.kernel.org
Cc: "Jeff King" <peff@peff.net>,
	"Felipe Contreras" <felipe.contreras@gmail.com>,
	"SZEDER Gábor" <szeder.dev@gmail.com>,
	"Chris Torek" <chris.torek@gmail.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Junio C Hamano" <gitster@pobox.com>,
	"Taylor Blau" <me@ttaylorr.com>
Subject: [PATCH v5 0/5] Speed up connectivity checks
Date: Mon, 9 Aug 2021 10:11:42 +0200	[thread overview]
Message-ID: <cover.1628496538.git.ps@pks.im> (raw)
In-Reply-To: <cover.1624858240.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 15551 bytes --]

Hi,

this is the fifth version of my series to speed up connectivity checks
in the context of repos with many refs. While the original goal has been
to speed up connectivity checks only, the series is now optimizing
git-rev-list(1) in general to be able to more efficiently load
references. Like this, `--not --all` is a lot faster in the context of
many refs, but other usecases benefit, too.

Changes compared to v4:

    - I've changed the interface to load commits via the commit-graph.
      Instead of the previous version where you'd had to pass in a
      `struct object`, which forced us to use `lookup_unknown_object()`,
      one now only passes in an object ID. If the object ID is found in
      the commit graph and if the corresponding object exists in the
      ODB, then we return the parsed commit object.

      This also avoids a previous pitfal: we'd have parsed the commit
      via the graph and thus had allocated the object even if the
      corresponding object didn't exist. While we knew to handle this in
      `get_reference()` by asserting object existence, any other caller
      which executes `lookup_commit()` would get the parsed commit and
      assume that it exists. This now cannot happen anymore given that
      we only create the commit object in case we know the ID exists in
      the ODB.

    - With this change, I could now drop the patch which avoided loading
      objects multiple times: we don't need `lookup_unknown_object()`
      anymore and thus don't thave the memory/perf tradeoff. And with
      the new interface to load commits via the graph, the deduplication
      only resulted in a ~1% speedup.

    - `--unsorted-input` and `--no-walk` are now mutually exclusive
      regardless of whether `--no-walk` is sorted or unsorted,
      simplifying the logic.

Patrick

Patrick Steinhardt (5):
  revision: separate walk and unsorted flags
  connected: do not sort input revisions
  revision: stop retrieving reference twice
  commit-graph: split out function to search commit position
  revision: avoid hitting packfiles when commits are in commit-graph

 Documentation/rev-list-options.txt |  8 ++-
 builtin/log.c                      |  2 +-
 builtin/revert.c                   |  3 +-
 commit-graph.c                     | 79 ++++++++++++++++++++----------
 commit-graph.h                     |  8 +++
 connected.c                        |  1 +
 revision.c                         | 38 ++++++++------
 revision.h                         |  7 +--
 t/t6000-rev-list-misc.sh           | 31 ++++++++++++
 9 files changed, 129 insertions(+), 48 deletions(-)

Range-diff against v4:
1:  67232910ac = 1:  67232910ac revision: separate walk and unsorted flags
2:  9d7f484907 ! 2:  d3f498a563 connected: do not sort input revisions
    @@ revision.c: static int handle_revision_opt(struct rev_info *revs, int argc, cons
      		revs->sort_order = REV_SORT_BY_AUTHOR_DATE;
      		revs->topo_order = 1;
     +	} else if (!strcmp(arg, "--unsorted-input")) {
    -+		if (revs->no_walk && !revs->unsorted_input)
    -+			die(_("--unsorted-input is incompatible with --no-walk and --no-walk=sorted"));
    ++		if (revs->no_walk)
    ++			die(_("--unsorted-input is incompatible with --no-walk"));
     +		revs->unsorted_input = 1;
      	} else if (!strcmp(arg, "--early-output")) {
      		revs->early_output = 100;
    @@ revision.c: static int handle_revision_pseudo_opt(const char *submodule,
      	} else if (!strcmp(arg, "--not")) {
      		*flags ^= UNINTERESTING | BOTTOM;
      	} else if (!strcmp(arg, "--no-walk")) {
    -+		if (revs->unsorted_input)
    -+			die(_("--no-walk is incompatible with --no-walk=unsorted and --unsorted-input"));
    ++		if (!revs->no_walk && revs->unsorted_input)
    ++			die(_("--no-walk is incompatible with --unsorted-input"));
      		revs->no_walk = 1;
      	} else if (skip_prefix(arg, "--no-walk=", &optarg)) {
    ++		if (!revs->no_walk && revs->unsorted_input)
    ++			die(_("--no-walk is incompatible with --unsorted-input"));
    ++
      		/*
    -@@ revision.c: static int handle_revision_pseudo_opt(const char *submodule,
    + 		 * Detached form ("--no-walk X" as opposed to "--no-walk=X")
      		 * not allowed, since the argument is optional.
    - 		 */
    - 		revs->no_walk = 1;
    --		if (!strcmp(optarg, "sorted"))
    -+		if (!strcmp(optarg, "sorted")) {
    -+			if (revs->unsorted_input)
    -+				die(_("--no-walk=sorted is incompatible with --no-walk=unsorted "
    -+				    "and --unsorted-input"));
    - 			revs->unsorted_input = 0;
    --		else if (!strcmp(optarg, "unsorted"))
    -+		} else if (!strcmp(optarg, "unsorted"))
    - 			revs->unsorted_input = 1;
    - 		else
    - 			return error("invalid argument to --no-walk");
     
      ## t/t6000-rev-list-misc.sh ##
     @@ t/t6000-rev-list-misc.sh: test_expect_success 'rev-list --count --objects' '
    @@ t/t6000-rev-list-misc.sh: test_expect_success 'rev-list --count --objects' '
     +	test_cmp first.sorted second.sorted
     +'
     +
    -+test_expect_success 'rev-list --unsorted-input compatible with --no-walk=unsorted' '
    -+	git rev-list --unsorted-input --no-walk=unsorted HEAD HEAD~ >actual &&
    -+	git rev-parse HEAD >expect &&
    -+	git rev-parse HEAD~ >>expect &&
    -+	test_cmp expect actual
    -+'
    -+
    -+test_expect_success 'rev-list --unsorted-input incompatible with --no-walk=sorted' '
    ++test_expect_success 'rev-list --unsorted-input incompatible with --no-walk' '
     +	cat >expect <<-EOF &&
    -+		fatal: --no-walk is incompatible with --no-walk=unsorted and --unsorted-input
    ++		fatal: --no-walk is incompatible with --unsorted-input
     +	EOF
     +	test_must_fail git rev-list --unsorted-input --no-walk HEAD 2>error &&
     +	test_cmp expect error &&
    -+
    -+	cat >expect <<-EOF &&
    -+		fatal: --no-walk=sorted is incompatible with --no-walk=unsorted and --unsorted-input
    -+	EOF
     +	test_must_fail git rev-list --unsorted-input --no-walk=sorted HEAD 2>error &&
     +	test_cmp expect error &&
    ++	test_must_fail git rev-list --unsorted-input --no-walk=unsorted HEAD 2>error &&
    ++	test_cmp expect error &&
     +
     +	cat >expect <<-EOF &&
    -+		fatal: --unsorted-input is incompatible with --no-walk and --no-walk=sorted
    ++		fatal: --unsorted-input is incompatible with --no-walk
     +	EOF
     +	test_must_fail git rev-list --no-walk --unsorted-input HEAD 2>error &&
     +	test_cmp expect error &&
     +	test_must_fail git rev-list --no-walk=sorted --unsorted-input HEAD 2>error &&
    ++	test_cmp expect error &&
    ++	test_must_fail git rev-list --no-walk=unsorted --unsorted-input HEAD 2>error &&
     +	test_cmp expect error
     +'
     +
3:  d8e63d0943 = 3:  c9a630927b revision: stop retrieving reference twice
4:  ba8df5cad0 < -:  ---------- revision: avoid loading object headers multiple times
5:  e33cd51ebf = 4:  bc89325fdf commit-graph: split out function to search commit position
6:  900c5a9c60 ! 5:  fdb1fa9d57 revision: avoid hitting packfiles when commits are in commit-graph
    @@ Metadata
      ## Commit message ##
         revision: avoid hitting packfiles when commits are in commit-graph
     
    -    When queueing references in git-rev-list(1), we try to either reuse an
    -    already parsed object or alternatively we load the object header from
    -    disk in order to determine its type. This is inefficient for commits
    -    though in cases where we have a commit graph available: instead of
    -    hitting the real object on disk to determine its type, we may instead
    -    search the object graph for the object ID. In case it's found, we can
    -    directly fill in the commit object, otherwise we can still hit the disk
    -    to determine the object's type.
    +    When queueing references in git-rev-list(1), we try to optimize parsing
    +    of commits via the commit-graph. To do so, we first look up the object's
    +    type, and if it is a commit we call `repo_parse_commit()` instead of
    +    `parse_object()`. This is quite inefficient though given that we're
    +    always uncompressing the object header in order to determine the type.
    +    Instead, we can opportunistically search the commit-graph for the object
    +    ID: in case it's found, we know it's a commit and can directly fill in
    +    the commit object without having to uncompress the object header.
     
    -    Expose a new function `parse_commit_in_graph_gently()`, which fills in
    -    an object of unknown type in case we find its object ID in the graph.
    -    This provides a big performance win in cases where there is a
    -    commit-graph available in the repository in case we load lots of
    -    references. The following has been executed in a real-world repository
    -    with about 2.2 million refs:
    +    Expose a new function `lookup_commit_in_graph()`, which tries to find a
    +    commit in the commit-graph by ID, and convert `get_reference()` to use
    +    this function. This provides a big performance win in cases where we
    +    load references in a repository with lots of references pointing to
    +    commits. The following has been executed in a real-world repository with
    +    about 2.2 million refs:
     
             Benchmark #1: HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev
    -          Time (mean ± σ):      4.508 s ±  0.039 s    [User: 4.131 s, System: 0.377 s]
    -          Range (min … max):    4.455 s …  4.576 s    10 runs
    +          Time (mean ± σ):      4.458 s ±  0.044 s    [User: 4.115 s, System: 0.342 s]
    +          Range (min … max):    4.409 s …  4.534 s    10 runs
     
             Benchmark #2: HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev
    -          Time (mean ± σ):      3.072 s ±  0.031 s    [User: 2.707 s, System: 0.365 s]
    -          Range (min … max):    3.040 s …  3.144 s    10 runs
    +          Time (mean ± σ):      3.089 s ±  0.015 s    [User: 2.768 s, System: 0.321 s]
    +          Range (min … max):    3.061 s …  3.105 s    10 runs
     
             Summary
               'HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' ran
    -            1.47 ± 0.02 times faster than 'HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev'
    +            1.44 ± 0.02 times faster than 'HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev'
     
         Signed-off-by: Patrick Steinhardt <ps@pks.im>
     
    @@ commit-graph.c: static int find_commit_pos_in_graph(struct commit *item, struct
      	}
      }
      
    -+int parse_commit_in_graph_gently(struct repository *repo, struct object *object)
    ++struct commit *lookup_commit_in_graph(struct repository *repo, const struct object_id *id)
     +{
     +	struct commit *commit;
     +	uint32_t pos;
     +
    -+	if (object->parsed) {
    -+		if (object->type != OBJ_COMMIT)
    -+			return -1;
    -+		return 0;
    -+	}
    -+
     +	if (!repo->objects->commit_graph)
    -+		return -1;
    ++		return NULL;
    ++	if (!search_commit_pos_in_graph(id, repo->objects->commit_graph, &pos))
    ++		return NULL;
    ++	if (!repo_has_object_file(repo, id))
    ++		return NULL;
     +
    -+	if (!search_commit_pos_in_graph(&object->oid, repo->objects->commit_graph, &pos))
    -+		return -1;
    -+
    -+	commit = object_as_type(object, OBJ_COMMIT, 1);
    ++	commit = lookup_commit(repo, id);
     +	if (!commit)
    -+		return -1;
    -+	if (!fill_commit_in_graph(repo, commit, repo->objects->commit_graph, pos))
    -+		return -1;
    ++		return NULL;
    ++	if (commit->object.parsed)
    ++		return commit;
     +
    -+	return 0;
    ++	if (!fill_commit_in_graph(repo, commit, repo->objects->commit_graph, pos))
    ++		return NULL;
    ++
    ++	return commit;
     +}
     +
      static int parse_commit_in_graph_one(struct repository *r,
    @@ commit-graph.h: int open_commit_graph(const char *graph_file, int *fd, struct st
      int parse_commit_in_graph(struct repository *r, struct commit *item);
      
     +/*
    -+ * Given an object of unknown type, try to fill in the object in case it is a
    -+ * commit part of the commit-graph. Returns 0 if the object is a parsed commit
    -+ * or if it could be filled in via the commit graph, otherwise it returns -1.
    ++ * Look up the given commit ID in the commit-graph. This will only return a
    ++ * commit if the ID exists both in the graph and in the object database such
    ++ * that we don't return commits whose object has been pruned. Otherwise, this
    ++ * function returns `NULL`.
     + */
    -+int parse_commit_in_graph_gently(struct repository *repo, struct object *object);
    ++struct commit *lookup_commit_in_graph(struct repository *repo, const struct object_id *id);
     +
      /*
       * It is possible that we loaded commit contents from the commit buffer,
    @@ commit-graph.h: int open_commit_graph(const char *graph_file, int *fd, struct st
     
      ## revision.c ##
     @@ revision.c: static struct object *get_reference(struct rev_info *revs, const char *name,
    - 	struct object *object = lookup_unknown_object(revs->repo, oid);
    + 				    unsigned int flags)
    + {
    + 	struct object *object;
    ++	struct commit *commit;
      
    - 	if (object->type == OBJ_NONE) {
    --		int type = oid_object_info(revs->repo, oid, NULL);
    --		if (type < 0 || !object_as_type(object, type, 1)) {
    -+		/*
    -+		 * It's likely that the reference points to a commit, so we
    -+		 * first try to look it up via the commit-graph. If successful,
    -+		 * then we know it's a commit and don't have to unpack the
    -+		 * object header. We still need to assert that the object
    -+		 * exists, but given that we don't request any info about the
    -+		 * object this is a lot faster than `oid_object_info()`.
    -+		 */
    -+		if (parse_commit_in_graph_gently(revs->repo, object) < 0) {
    -+			int type = oid_object_info(revs->repo, oid, NULL);
    -+			if (type < 0 || !object_as_type(object, type, 1)) {
    -+				object = NULL;
    -+				goto out;
    -+			}
    -+		} else if (!repo_has_object_file(revs->repo, oid)) {
    - 			object = NULL;
    - 			goto out;
    - 		}
    + 	/*
    +-	 * If the repository has commit graphs, repo_parse_commit() avoids
    +-	 * reading the object buffer, so use it whenever possible.
    ++	 * If the repository has commit graphs, we try to opportunistically
    ++	 * look up the object ID in those graphs. Like this, we can avoid
    ++	 * parsing commit data from disk.
    + 	 */
    +-	if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT) {
    +-		struct commit *c = lookup_commit(revs->repo, oid);
    +-		if (!repo_parse_commit(revs->repo, c))
    +-			object = (struct object *) c;
    +-		else
    +-			object = NULL;
    +-	} else {
    ++	commit = lookup_commit_in_graph(revs->repo, oid);
    ++	if (commit)
    ++		object = &commit->object;
    ++	else
    + 		object = parse_object(revs->repo, oid);
    +-	}
    + 
    + 	if (!object) {
    + 		if (revs->ignore_missing)
-- 
2.32.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  parent reply	other threads:[~2021-08-09  8:11 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-28  5:33 [PATCH v2 0/3] Speed up connectivity checks via bitmaps Patrick Steinhardt
2021-06-28  5:33 ` [PATCH v2 1/3] p5400: add perf tests for git-receive-pack(1) Patrick Steinhardt
2021-06-28  7:49   ` Ævar Arnfjörð Bjarmason
2021-06-29  6:18     ` Patrick Steinhardt
2021-06-29 12:09       ` Ævar Arnfjörð Bjarmason
2021-06-28  5:33 ` [PATCH v2 2/3] receive-pack: skip connectivity checks on delete-only commands Patrick Steinhardt
2021-06-28  8:00   ` Ævar Arnfjörð Bjarmason
2021-06-28  8:06     ` Ævar Arnfjörð Bjarmason
2021-06-29  6:26     ` Patrick Steinhardt
2021-06-30  1:31   ` Jeff King
2021-06-30  1:35     ` Jeff King
2021-06-30 13:52     ` Patrick Steinhardt
2021-06-28  5:33 ` [PATCH v2 3/3] connected: implement connectivity check using bitmaps Patrick Steinhardt
2021-06-28 20:23   ` Taylor Blau
2021-06-29 22:44     ` Taylor Blau
2021-06-30  2:04       ` Jeff King
2021-06-30  3:07         ` Taylor Blau
2021-06-30  5:45           ` Jeff King
2021-07-02 17:44             ` Taylor Blau
2021-07-02 21:21               ` Jeff King
2021-06-30  1:51   ` Jeff King
2021-07-20 14:26     ` Patrick Steinhardt
2021-08-02  9:37 ` [PATCH v3 0/4] Speed up connectivity checks Patrick Steinhardt
2021-08-02  9:38   ` [PATCH v3 1/4] connected: do not sort input revisions Patrick Steinhardt
2021-08-02 12:49     ` Ævar Arnfjörð Bjarmason
2021-08-03  8:50       ` Patrick Steinhardt
2021-08-04 11:01         ` Ævar Arnfjörð Bjarmason
2021-08-02 19:00     ` Junio C Hamano
2021-08-03  8:55       ` Patrick Steinhardt
2021-08-03 21:47         ` Junio C Hamano
2021-08-02  9:38   ` [PATCH v3 2/4] revision: stop retrieving reference twice Patrick Steinhardt
2021-08-02 12:53     ` Ævar Arnfjörð Bjarmason
2021-08-02  9:38   ` [PATCH v3 3/4] revision: avoid loading object headers multiple times Patrick Steinhardt
2021-08-02 12:55     ` Ævar Arnfjörð Bjarmason
2021-08-05 10:12       ` Patrick Steinhardt
2021-08-02 19:40     ` Junio C Hamano
2021-08-03  9:07       ` Patrick Steinhardt
2021-08-06 14:17         ` Patrick Steinhardt
2021-08-02  9:38   ` [PATCH v3 4/4] revision: avoid hitting packfiles when commits are in commit-graph Patrick Steinhardt
2021-08-02 20:01     ` Junio C Hamano
2021-08-03  9:16       ` Patrick Steinhardt
2021-08-03 21:56         ` Junio C Hamano
2021-08-05 11:01           ` Patrick Steinhardt
2021-08-05 16:16             ` Junio C Hamano
2021-08-04 10:51         ` Ævar Arnfjörð Bjarmason
2021-08-05 11:25   ` [PATCH v4 0/6] Speed up connectivity checks Patrick Steinhardt
2021-08-05 11:25     ` [PATCH v4 1/6] revision: separate walk and unsorted flags Patrick Steinhardt
2021-08-05 18:47       ` Junio C Hamano
2021-08-05 11:25     ` [PATCH v4 2/6] connected: do not sort input revisions Patrick Steinhardt
2021-08-05 18:44       ` Junio C Hamano
2021-08-06  6:00         ` Patrick Steinhardt
2021-08-06 16:50           ` Junio C Hamano
2021-08-05 11:25     ` [PATCH v4 3/6] revision: stop retrieving reference twice Patrick Steinhardt
2021-08-05 11:25     ` [PATCH v4 4/6] revision: avoid loading object headers multiple times Patrick Steinhardt
2021-08-05 11:25     ` [PATCH v4 5/6] commit-graph: split out function to search commit position Patrick Steinhardt
2021-08-05 11:25     ` [PATCH v4 6/6] revision: avoid hitting packfiles when commits are in commit-graph Patrick Steinhardt
2021-08-09  8:00 ` [PATCH v5 0/5] Speed up connectivity checks Patrick Steinhardt
2021-08-09  8:02   ` Patrick Steinhardt
2021-08-09  8:11 ` Patrick Steinhardt [this message]
2021-08-09  8:11   ` [PATCH v5 1/5] revision: separate walk and unsorted flags Patrick Steinhardt
2021-08-09  8:11   ` [PATCH v5 2/5] connected: do not sort input revisions Patrick Steinhardt
2021-08-09  8:11   ` [PATCH v5 3/5] revision: stop retrieving reference twice Patrick Steinhardt
2021-08-09  8:11   ` [PATCH v5 4/5] commit-graph: split out function to search commit position Patrick Steinhardt
2021-08-09  8:12   ` [PATCH v5 5/5] revision: avoid hitting packfiles when commits are in commit-graph Patrick Steinhardt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cover.1628496538.git.ps@pks.im \
    --to=ps@pks.im \
    --cc=avarab@gmail.com \
    --cc=chris.torek@gmail.com \
    --cc=felipe.contreras@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).