From: Junio C Hamano <gitster@pobox.com>
To: Jonathan Tan <jonathantanmy@google.com>
Cc: git@vger.kernel.org, sluongng@gmail.com
Subject: Re: [PATCH 2/2] pack-objects: prefetch objects to be packed
Date: Mon, 20 Jul 2020 18:00:50 -0700 [thread overview]
Message-ID: <xmqqd04p8ywt.fsf@gitster.c.googlers.com> (raw)
In-Reply-To: <b87764b711621ea3c614dbc8f9e49a8598a25cb1.1595290841.git.jonathantanmy@google.com> (Jonathan Tan's message of "Mon, 20 Jul 2020 17:21:44 -0700")
Jonathan Tan <jonathantanmy@google.com> writes:
> When an object to be packed is noticed to be missing, prefetch all
> to-be-packed objects in one batch.
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
Hmph, the resulting codeflow structure feels somewhat iffy. Perhaps
I am not reading the code correctly, but
* There is a loop that scans from 0..to_pack.nr_objects and calls
check_object() for each and every one of them;
* The called check_object(), when it notices that a missing and
promised (i.e. to be lazily fetched) object is in the to_pack
array, asks prefetch_to_pack() to scan from that point to the end
of that array and grabs all of them that are missing.
It almost feels a lot cleaner to see what is going on in the
resulting code, instead of the way the new "loop" was added, if a
new loop is added _before_ the loop to call check_object() on all
objects in to_pack array as a pre-processing phase when there is a
promisor remote. That is, after reverting all the change this patch
makes to check_object(), add a new loop in get_object_details() that
looks more or less like so:
QSORT(sorted_by_offset, to_pack.nr_objects, pack_offset_sort);
+ if (has_promisor_remote())
+ prefetch_to_pack(0);
+
for (i = 0; i < to_pack.nr_objects; i++) {
Was the patch done this way because scanning the entire array twice
is expensive? The optimization makes sense to me if certain
conditions are met, like...
- Most of the time there is no missing object due to promisor, even
if has_promissor_to_remote() is true;
- When there are missing objects due to promisor, pack_offset_sort
will keep them near the end of the array; and
- Given the oid, oid_object_info_extended() on it with
OBJECT_INFO_FOR_PREFETCH is expensive.
Only when all these conditions are met, it would avoid unnecessary
overhead by scanning only a very later part of the array by delaying
the point in the array where prefetch_to_pack() starts scanning.
Thanks.
> There have been recent discussions about using QUICK whenever we use
> SKIP_FETCH_OBJECT. I don't think it fully applies here, since here we
> fully expect the object to be present in the non-partial-clone case.
> Having said that, I wouldn't be opposed to adding QUICK and then, if the
> object read fails and if the repo is not a partial clone, to retry the
> object load (before setting the type to -1).
> ---
> builtin/pack-objects.c | 36 ++++++++++++++++++++++++++++++++----
> t/t5300-pack-object.sh | 36 ++++++++++++++++++++++++++++++++++++
> 2 files changed, 68 insertions(+), 4 deletions(-)
>
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index e09d140eed..ecef5cda44 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -35,6 +35,7 @@
> #include "midx.h"
> #include "trace2.h"
> #include "shallow.h"
> +#include "promisor-remote.h"
>
> #define IN_PACK(obj) oe_in_pack(&to_pack, obj)
> #define SIZE(obj) oe_size(&to_pack, obj)
> @@ -1704,7 +1705,26 @@ static int can_reuse_delta(const struct object_id *base_oid,
> return 0;
> }
>
> -static void check_object(struct object_entry *entry)
> +static void prefetch_to_pack(uint32_t object_index_start) {
> + struct oid_array to_fetch = OID_ARRAY_INIT;
> + uint32_t i;
> +
> + for (i = object_index_start; i < to_pack.nr_objects; i++) {
> + struct object_entry *entry = to_pack.objects + i;
> +
> + if (!oid_object_info_extended(the_repository,
> + &entry->idx.oid,
> + NULL,
> + OBJECT_INFO_FOR_PREFETCH))
> + continue;
> + oid_array_append(&to_fetch, &entry->idx.oid);
> + }
> + promisor_remote_get_direct(the_repository,
> + to_fetch.oid, to_fetch.nr);
> + oid_array_clear(&to_fetch);
> +}
> +
> +static void check_object(struct object_entry *entry, uint32_t object_index)
> {
> unsigned long canonical_size;
> enum object_type type;
> @@ -1843,8 +1863,16 @@ static void check_object(struct object_entry *entry)
> }
>
> if (oid_object_info_extended(the_repository, &entry->idx.oid, &oi,
> - OBJECT_INFO_LOOKUP_REPLACE) < 0)
> - type = -1;
> + OBJECT_INFO_SKIP_FETCH_OBJECT | OBJECT_INFO_LOOKUP_REPLACE) < 0) {
> + if (has_promisor_remote()) {
> + prefetch_to_pack(object_index);
> + if (oid_object_info_extended(the_repository, &entry->idx.oid, &oi,
> + OBJECT_INFO_SKIP_FETCH_OBJECT | OBJECT_INFO_LOOKUP_REPLACE) < 0)
> + type = -1;
> + } else {
> + type = -1;
> + }
> + }
> oe_set_type(entry, type);
> if (entry->type_valid) {
> SET_SIZE(entry, canonical_size);
> @@ -2065,7 +2093,7 @@ static void get_object_details(void)
>
> for (i = 0; i < to_pack.nr_objects; i++) {
> struct object_entry *entry = sorted_by_offset[i];
> - check_object(entry);
> + check_object(entry, i);
> if (entry->type_valid &&
> oe_size_greater_than(&to_pack, entry, big_file_threshold))
> entry->no_try_delta = 1;
> diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
> index 746cdb626e..d553d0ca46 100755
> --- a/t/t5300-pack-object.sh
> +++ b/t/t5300-pack-object.sh
> @@ -497,4 +497,40 @@ test_expect_success 'make sure index-pack detects the SHA1 collision (large blob
> )
> '
>
> +test_expect_success 'prefetch objects' '
> + rm -rf server client &&
> +
> + git init server &&
> + test_config -C server uploadpack.allowanysha1inwant 1 &&
> + test_config -C server uploadpack.allowfilter 1 &&
> + test_config -C server protocol.version 2 &&
> +
> + echo one >server/one &&
> + git -C server add one &&
> + git -C server commit -m one &&
> + git -C server branch one_branch &&
> +
> + echo two_a >server/two_a &&
> + echo two_b >server/two_b &&
> + git -C server add two_a two_b &&
> + git -C server commit -m two &&
> +
> + echo three >server/three &&
> + git -C server add three &&
> + git -C server commit -m three &&
> + git -C server branch three_branch &&
> +
> + # Clone, fetch "two" with blobs excluded, and re-push it. This requires
> + # the client to have the blobs of "two" - verify that these are
> + # prefetched in one batch.
> + git clone --filter=blob:none --single-branch -b one_branch \
> + "file://$(pwd)/server" client &&
> + test_config -C client protocol.version 2 &&
> + TWO=$(git -C server rev-parse three_branch^) &&
> + git -C client fetch --filter=blob:none origin "$TWO" &&
> + GIT_TRACE_PACKET=$(pwd)/trace git -C client push origin "$TWO":refs/heads/two_branch &&
> + grep "git> done" trace >donelines &&
> + test_line_count = 1 donelines
> +'
> +
> test_done
next prev parent reply other threads:[~2020-07-21 1:01 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-07-21 0:21 [PATCH 0/2] Prefetch objects in pack-objects Jonathan Tan
2020-07-21 0:21 ` [PATCH 1/2] pack-objects: refactor to oid_object_info_extended Jonathan Tan
2020-07-21 0:21 ` [PATCH 2/2] pack-objects: prefetch objects to be packed Jonathan Tan
2020-07-21 1:00 ` Junio C Hamano [this message]
2020-07-21 16:37 ` Jonathan Tan
2020-07-21 19:23 ` Junio C Hamano
2020-07-21 21:27 ` Junio C Hamano
2020-07-21 23:37 ` Jonathan Tan
2020-07-21 23:56 ` Junio C Hamano
2020-07-21 23:20 ` Jonathan Tan
2020-07-21 23:51 ` Junio C Hamano
2020-07-22 21:30 ` Jonathan Tan
2020-07-22 21:45 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=xmqqd04p8ywt.fsf@gitster.c.googlers.com \
--to=gitster@pobox.com \
--cc=git@vger.kernel.org \
--cc=jonathantanmy@google.com \
--cc=sluongng@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).