git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Patrick Steinhardt <ps@pks.im>
To: git@vger.kernel.org
Cc: "Jeff King" <peff@peff.net>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Junio C Hamano" <gitster@pobox.com>,
	"Derrick Stolee" <stolee@gmail.com>,
	"René Scharfe" <l.s.r@web.de>
Subject: [PATCH v2 0/7] Speed up mirror-fetches with many refs
Date: Tue, 24 Aug 2021 12:36:51 +0200	[thread overview]
Message-ID: <cover.1629800774.git.ps@pks.im> (raw)
In-Reply-To: <cover.1629452412.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 13857 bytes --]

Hi,

this is the second version of my patch series to speed up mirror-fetches
with many refs. This topic applies on top of Junio's 9d5700f60b (Merge
branch 'ps/connectivity-optim' into jch, 2021-08-23).

Changes compared to v1:

    - Patch 1/7: I've applied Stolee's proposal to only
      opportunistically load objects via the commit-graph in case the
      reference is not in refs/tags/ such that we don't regress repos
      with many annotated tags.

    - Patch 3/7: The return parameter of the iterator is now const to
      allow further optimizations by the compiler, as suggested by
      René. I've also re-benchmarked this, and one can now see a very
      slight performance improvement of ~1%.

    - Patch 4/7: Added my missing DCO, as pointed out by Junio.

    - Patch 5, 6, 7: I've redone these to make it clearer that the
      refactoring I'm doing doesn't cause us to miss any object
      connectivity checks. Most importantly, I've merged `fetch_refs()`
      and `consume_refs()` into `fetch_and_consume_refs()` in 6/7, which
      makes the optimization where we elide the second connectivity
      check in 7/7 trivial.

Thanks for your feedback!

Patrick

Patrick Steinhardt (7):
  fetch: speed up lookup of want refs via commit-graph
  fetch: avoid unpacking headers in object existence check
  connected: refactor iterator to return next object ID directly
  fetch-pack: optimize loading of refs via commit graph
  fetch: refactor fetch refs to be more extendable
  fetch: merge fetching and consuming refs
  fetch: avoid second connectivity check if we already have all objects

 builtin/clone.c        |  8 ++---
 builtin/fetch.c        | 74 +++++++++++++++++++++++-------------------
 builtin/receive-pack.c | 17 ++++------
 connected.c            | 15 +++++----
 connected.h            |  2 +-
 fetch-pack.c           | 14 +++++---
 6 files changed, 68 insertions(+), 62 deletions(-)

Range-diff against v1:
1:  6872979c45 ! 1:  4a819a6830 fetch: speed up lookup of want refs via commit-graph
    @@ Commit message
         referenced objects.
     
         Speed this up by opportunistcally trying to resolve object IDs via the
    -    commit graph: more likely than not, they're going to be a commit anyway,
    -    and this lets us avoid having to unpack object headers completely in
    -    case the object is a commit that is part of the commit-graph. This
    -    significantly speeds up mirror-fetches in a real-world repository with
    +    commit graph. We only do so for any refs which are not in "refs/tags":
    +    more likely than not, these are going to be a commit anyway, and this
    +    lets us avoid having to unpack object headers completely in case the
    +    object is a commit that is part of the commit-graph. This significantly
    +    speeds up mirror-fetches in a real-world repository with
         2.3M refs:
     
             Benchmark #1: HEAD~: git-fetch
    -          Time (mean ± σ):     56.942 s ±  0.449 s    [User: 53.360 s, System: 5.356 s]
    -          Range (min … max):   56.372 s … 57.533 s    5 runs
    +          Time (mean ± σ):     56.482 s ±  0.384 s    [User: 53.340 s, System: 5.365 s]
    +          Range (min … max):   56.050 s … 57.045 s    5 runs
     
             Benchmark #2: HEAD: git-fetch
    -          Time (mean ± σ):     33.657 s ±  0.167 s    [User: 30.302 s, System: 5.181 s]
    -          Range (min … max):   33.454 s … 33.844 s    5 runs
    +          Time (mean ± σ):     33.727 s ±  0.170 s    [User: 30.252 s, System: 5.194 s]
    +          Range (min … max):   33.452 s … 33.871 s    5 runs
     
             Summary
               'HEAD: git-fetch' ran
    -            1.69 ± 0.02 times faster than 'HEAD~: git-fetch'
    +            1.67 ± 0.01 times faster than 'HEAD~: git-fetch'
     
         Signed-off-by: Patrick Steinhardt <ps@pks.im>
     
      ## builtin/fetch.c ##
    +@@ builtin/fetch.c: static int store_updated_refs(const char *raw_url, const char *remote_name,
    + 			      int connectivity_checked, struct ref *ref_map)
    + {
    + 	struct fetch_head fetch_head;
    +-	struct commit *commit;
    + 	int url_len, i, rc = 0;
    + 	struct strbuf note = STRBUF_INIT, err = STRBUF_INIT;
    + 	struct ref_transaction *transaction = NULL;
    +@@ builtin/fetch.c: static int store_updated_refs(const char *raw_url, const char *remote_name,
    + 	     want_status <= FETCH_HEAD_IGNORE;
    + 	     want_status++) {
    + 		for (rm = ref_map; rm; rm = rm->next) {
    ++			struct commit *commit = NULL;
    + 			struct ref *ref = NULL;
    + 
    + 			if (rm->status == REF_STATUS_REJECT_SHALLOW) {
     @@ builtin/fetch.c: static int store_updated_refs(const char *raw_url, const char *remote_name,
      				continue;
      			}
    @@ builtin/fetch.c: static int store_updated_refs(const char *raw_url, const char *
     -								1);
     -			if (!commit)
     -				rm->fetch_head_status = FETCH_HEAD_NOT_FOR_MERGE;
    -+			commit = lookup_commit_in_graph(the_repository, &rm->old_oid);
    ++			/*
    ++			 * References in "refs/tags/" are often going to point
    ++			 * to annotated tags, which are not part of the
    ++			 * commit-graph. We thus only try to look up refs in
    ++			 * the graph which are not in that namespace to not
    ++			 * regress performance in repositories with many
    ++			 * annotated tags.
    ++			 */
    ++			if (!starts_with(rm->name, "refs/tags/"))
    ++				commit = lookup_commit_in_graph(the_repository, &rm->old_oid);
     +			if (!commit) {
     +				commit = lookup_commit_reference_gently(the_repository,
     +									&rm->old_oid,
2:  d3dac607f2 = 2:  81ebadabe8 fetch: avoid unpacking headers in object existence check
3:  3bdad7bc8b ! 3:  98e981ced9 connected: refactor iterator to return next object ID directly
    @@ Commit message
         The object ID iterator used by the connectivity checks returns the next
         object ID via an out-parameter and then uses a return code to indicate
         whether an item was found. This is a bit roundabout: instead of a
    -    separate error code, we can just retrun the next object ID directly and
    +    separate error code, we can just return the next object ID directly and
         use `NULL` pointers as indicator that the iterator got no items left.
         Furthermore, this avoids a copy of the object ID.
     
         Refactor the iterator and all its implementations to return object IDs
    -    directly. While I was honestly hoping for a small speedup given that we
    -    can now avoid a copy, both versions perform the same. Still, the end
    -    result is easier to understand and thus it makes sense to keep this
    -    refactoring regardless.
    +    directly. This brings a tiny performance improvement when doing a mirror-fetch of a repository with about 2.3M refs:
    +
    +        Benchmark #1: 328dc58b49919c43897240f2eabfa30be2ce32a4~: git-fetch
    +          Time (mean ± σ):     30.110 s ±  0.148 s    [User: 27.161 s, System: 5.075 s]
    +          Range (min … max):   29.934 s … 30.406 s    10 runs
    +
    +        Benchmark #2: 328dc58b49919c43897240f2eabfa30be2ce32a4: git-fetch
    +          Time (mean ± σ):     29.899 s ±  0.109 s    [User: 26.916 s, System: 5.104 s]
    +          Range (min … max):   29.696 s … 29.996 s    10 runs
    +
    +        Summary
    +          '328dc58b49919c43897240f2eabfa30be2ce32a4: git-fetch' ran
    +            1.01 ± 0.01 times faster than '328dc58b49919c43897240f2eabfa30be2ce32a4~: git-fetch'
    +
    +    While this 1% speedup could be labelled as statistically insignificant,
    +    the speedup is consistent on my machine. Furthermore, this is an end to
    +    end test, so it is expected that the improvement in the connectivity
    +    check itself is more significant.
     
         Signed-off-by: Patrick Steinhardt <ps@pks.im>
     
    @@ builtin/clone.c: static void write_followtags(const struct ref *refs, const char
      }
      
     -static int iterate_ref_map(void *cb_data, struct object_id *oid)
    -+static struct object_id *iterate_ref_map(void *cb_data)
    ++static const struct object_id *iterate_ref_map(void *cb_data)
      {
      	struct ref **rm = cb_data;
      	struct ref *ref = *rm;
    @@ builtin/fetch.c: static int update_local_ref(struct ref *ref,
      }
      
     -static int iterate_ref_map(void *cb_data, struct object_id *oid)
    -+static struct object_id *iterate_ref_map(void *cb_data)
    ++static const struct object_id *iterate_ref_map(void *cb_data)
      {
      	struct ref **rm = cb_data;
      	struct ref *ref = *rm;
    @@ builtin/receive-pack.c: static void refuse_unconfigured_deny_delete_current(void
      }
      
     -static int command_singleton_iterator(void *cb_data, struct object_id *oid);
    -+static struct object_id *command_singleton_iterator(void *cb_data);
    ++static const struct object_id *command_singleton_iterator(void *cb_data);
      static int update_shallow_ref(struct command *cmd, struct shallow_info *si)
      {
      	struct shallow_lock shallow_lock = SHALLOW_LOCK_INIT;
    @@ builtin/receive-pack.c: static void check_aliased_updates(struct command *comman
      }
      
     -static int command_singleton_iterator(void *cb_data, struct object_id *oid)
    -+static struct object_id *command_singleton_iterator(void *cb_data)
    ++static const struct object_id *command_singleton_iterator(void *cb_data)
      {
      	struct command **cmd_list = cb_data;
      	struct command *cmd = *cmd_list;
    @@ builtin/receive-pack.c: struct iterate_data {
      };
      
     -static int iterate_receive_command_list(void *cb_data, struct object_id *oid)
    -+static struct object_id *iterate_receive_command_list(void *cb_data)
    ++static const struct object_id *iterate_receive_command_list(void *cb_data)
      {
      	struct iterate_data *data = cb_data;
      	struct command **cmd_list = &data->cmds;
    @@ connected.c: int check_connected(oid_iterate_fn fn, void *cb_data,
      	FILE *rev_list_in;
      	struct check_connected_options defaults = CHECK_CONNECTED_INIT;
     -	struct object_id oid;
    -+	struct object_id *oid;
    ++	const struct object_id *oid;
      	int err = 0;
      	struct packed_git *new_pack = NULL;
      	struct transport *transport;
    @@ connected.h: struct transport;
       * to signal EOF, otherwise return 0.
       */
     -typedef int (*oid_iterate_fn)(void *, struct object_id *oid);
    -+typedef struct object_id *(*oid_iterate_fn)(void *);
    ++typedef const struct object_id *(*oid_iterate_fn)(void *);
      
      /*
       * Named-arguments struct for check_connected. All arguments are
    @@ fetch-pack.c: static void update_shallow(struct fetch_pack_args *args,
      }
      
     -static int iterate_ref_map(void *cb_data, struct object_id *oid)
    -+static struct object_id *iterate_ref_map(void *cb_data)
    ++static const struct object_id *iterate_ref_map(void *cb_data)
      {
      	struct ref **rm = cb_data;
      	struct ref *ref = *rm;
4:  67917af7ce ! 4:  6311203f08 fetch-pack: optimize loading of refs via commit graph
    @@ Commit message
         In case this fails, we fall back to the old code which peels the
         objects to a commit.
     
    +    Signed-off-by: Patrick Steinhardt <ps@pks.im>
    +
      ## fetch-pack.c ##
     @@ fetch-pack.c: static struct commit *deref_without_lazy_fetch(const struct object_id *oid,
      {
5:  7653f8eabc ! 5:  56a9158ac3 fetch: refactor fetch refs to be more extendable
    @@ builtin/fetch.c: static int check_exist_and_connected(struct ref *ref_map)
      static int fetch_refs(struct transport *transport, struct ref *ref_map)
      {
     -	int ret = check_exist_and_connected(ref_map);
    --	if (ret) {
    --		trace2_region_enter("fetch", "fetch_refs", the_repository);
    --		ret = transport_fetch_refs(transport, ref_map);
    --		trace2_region_leave("fetch", "fetch_refs", the_repository);
    --	}
     +	int ret;
     +
     +	/*
    @@ builtin/fetch.c: static int check_exist_and_connected(struct ref *ref_map)
     +	 * refs.
     +	 */
     +	ret = check_exist_and_connected(ref_map);
    - 	if (!ret)
    + 	if (ret) {
    + 		trace2_region_enter("fetch", "fetch_refs", the_repository);
    + 		ret = transport_fetch_refs(transport, ref_map);
    + 		trace2_region_leave("fetch", "fetch_refs", the_repository);
    ++		if (ret) {
    ++			transport_unlock_pack(transport);
    ++			return ret;
    ++		}
    + 	}
    +-	if (!ret)
     -		/*
     -		 * Keep the new pack's ".keep" file around to allow the caller
     -		 * time to update refs to reference the new objects.
     -		 */
    - 		return 0;
    +-		return 0;
     -	transport_unlock_pack(transport);
     -	return ret;
     +
    -+	trace2_region_enter("fetch", "fetch_refs", the_repository);
    -+	ret = transport_fetch_refs(transport, ref_map);
    -+	trace2_region_leave("fetch", "fetch_refs", the_repository);
    -+	if (ret) {
    -+		transport_unlock_pack(transport);
    -+		return ret;
    -+	}
    -+
     +	/*
     +	 * Keep the new pack's ".keep" file around to allow the caller
     +	 * time to update refs to reference the new objects.
6:  646ac90e62 < -:  ---------- fetch: avoid second connectivity check if we already have all objects
-:  ---------- > 6:  31d9f72edf fetch: merge fetching and consuming refs
-:  ---------- > 7:  84e39c847f fetch: avoid second connectivity check if we already have all objects
-- 
2.33.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  parent reply	other threads:[~2021-08-24 10:37 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-20 10:08 [PATCH 0/6] Speed up mirror-fetches with many refs Patrick Steinhardt
2021-08-20 10:08 ` [PATCH 1/6] fetch: speed up lookup of want refs via commit-graph Patrick Steinhardt
2021-08-20 14:27   ` Derrick Stolee
2021-08-20 17:18     ` Junio C Hamano
2021-08-23  6:46       ` Patrick Steinhardt
2021-08-25 14:12         ` Derrick Stolee
2021-08-20 10:08 ` [PATCH 2/6] fetch: avoid unpacking headers in object existence check Patrick Steinhardt
2021-08-25 23:44   ` Ævar Arnfjörð Bjarmason
2021-08-20 10:08 ` [PATCH 3/6] connected: refactor iterator to return next object ID directly Patrick Steinhardt
2021-08-20 14:32   ` Derrick Stolee
2021-08-20 17:43     ` Junio C Hamano
2021-08-20 17:43   ` René Scharfe
2021-08-23  6:47     ` Patrick Steinhardt
2021-08-20 10:08 ` [PATCH 4/6] fetch-pack: optimize loading of refs via commit graph Patrick Steinhardt
2021-08-20 14:37   ` Derrick Stolee
2021-08-20 10:08 ` [PATCH 5/6] fetch: refactor fetch refs to be more extendable Patrick Steinhardt
2021-08-20 14:41   ` Derrick Stolee
2021-08-20 10:08 ` [PATCH 6/6] fetch: avoid second connectivity check if we already have all objects Patrick Steinhardt
2021-08-20 14:47   ` Derrick Stolee
2021-08-23  6:52     ` Patrick Steinhardt
2021-08-20 14:50 ` [PATCH 0/6] Speed up mirror-fetches with many refs Derrick Stolee
2021-08-21  0:09 ` Junio C Hamano
2021-08-24 10:36 ` Patrick Steinhardt [this message]
2021-08-24 10:36   ` [PATCH v2 1/7] fetch: speed up lookup of want refs via commit-graph Patrick Steinhardt
2021-08-25 14:16     ` Derrick Stolee
2021-08-24 10:37   ` [PATCH v2 2/7] fetch: avoid unpacking headers in object existence check Patrick Steinhardt
2021-08-24 10:37   ` [PATCH v2 3/7] connected: refactor iterator to return next object ID directly Patrick Steinhardt
2021-08-24 10:37   ` [PATCH v2 4/7] fetch-pack: optimize loading of refs via commit graph Patrick Steinhardt
2021-08-24 10:37   ` [PATCH v2 5/7] fetch: refactor fetch refs to be more extendable Patrick Steinhardt
2021-08-25 14:19     ` Derrick Stolee
2021-09-01 12:48       ` Patrick Steinhardt
2021-08-24 10:37   ` [PATCH v2 6/7] fetch: merge fetching and consuming refs Patrick Steinhardt
2021-08-25 14:26     ` Derrick Stolee
2021-09-01 12:49       ` Patrick Steinhardt
2021-08-24 10:37   ` [PATCH v2 7/7] fetch: avoid second connectivity check if we already have all objects Patrick Steinhardt
2021-08-24 22:48   ` [PATCH v2 0/7] Speed up mirror-fetches with many refs Junio C Hamano
2021-08-25  6:04     ` Patrick Steinhardt
2021-08-25 14:27   ` Derrick Stolee
2021-09-01 13:09 ` [PATCH v3 " Patrick Steinhardt
2021-09-01 13:09   ` [PATCH v3 1/7] fetch: speed up lookup of want refs via commit-graph Patrick Steinhardt
2021-09-01 13:09   ` [PATCH v3 2/7] fetch: avoid unpacking headers in object existence check Patrick Steinhardt
2021-09-01 13:09   ` [PATCH v3 3/7] connected: refactor iterator to return next object ID directly Patrick Steinhardt
2021-09-01 13:09   ` [PATCH v3 4/7] fetch-pack: optimize loading of refs via commit graph Patrick Steinhardt
2021-09-01 13:09   ` [PATCH v3 5/7] fetch: refactor fetch refs to be more extendable Patrick Steinhardt
2021-09-01 13:10   ` [PATCH v3 6/7] fetch: merge fetching and consuming refs Patrick Steinhardt
2021-09-01 13:10   ` [PATCH v3 7/7] fetch: avoid second connectivity check if we already have all objects Patrick Steinhardt
2021-09-01 19:58   ` [PATCH v3 0/7] Speed up mirror-fetches with many refs Junio C Hamano
2021-09-08  0:08     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cover.1629800774.git.ps@pks.im \
    --to=ps@pks.im \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=l.s.r@web.de \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).