git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jonathan Tan <jonathantanmy@google.com>
To: gitster@pobox.com
Cc: jonathantanmy@google.com, git@vger.kernel.org, sluongng@gmail.com
Subject: Re: [PATCH 2/2] pack-objects: prefetch objects to be packed
Date: Tue, 21 Jul 2020 09:37:36 -0700	[thread overview]
Message-ID: <20200721163736.69610-1-jonathantanmy@google.com> (raw)
In-Reply-To: <xmqqd04p8ywt.fsf@gitster.c.googlers.com>

> Hmph, the resulting codeflow structure feels somewhat iffy.  Perhaps
> I am not reading the code correctly, but
> 
>  * There is a loop that scans from 0..to_pack.nr_objects and calls
>    check_object() for each and every one of them;
> 
>  * The called check_object(), when it notices that a missing and
>    promised (i.e. to be lazily fetched) object is in the to_pack
>    array, asks prefetch_to_pack() to scan from that point to the end
>    of that array and grabs all of them that are missing.
> 
> It almost feels a lot cleaner to see what is going on in the
> resulting code, instead of the way the new "loop" was added, if a
> new loop is added _before_ the loop to call check_object() on all
> objects in to_pack array as a pre-processing phase when there is a
> promisor remote.  That is, after reverting all the change this patch
> makes to check_object(), add a new loop in get_object_details() that
> looks more or less like so:
> 
> 	QSORT(sorted_by_offset, to_pack.nr_objects, pack_offset_sort);
> 
> +	if (has_promisor_remote())
> +		prefetch_to_pack(0);
> +
> 	for (i = 0; i < to_pack.nr_objects; i++) {
> 
> 
> Was the patch done this way because scanning the entire array twice
> is expensive?

Yes. If we called prefetch_to_pack(0) first (as you suggest), this first
scan involves checking the existence of all objects in the array, so I
thought it would be expensive. (Checking the existence of an object
probably brings the corresponding pack index into disk cache on
platforms like Linux, so 2 object reads might not take much more time
than 1 object read, but I didn't want to rely on this when I could just
avoid the extra read.)

> The optimization makes sense to me if certain
> conditions are met, like...
> 
>  - Most of the time there is no missing object due to promisor, even
>    if has_promissor_to_remote() is true;

I think that optimizing for this condition makes sense - most pushes (I
would think) are pushes of objects we create locally, and thus no
objects are missing.

>  - When there are missing objects due to promisor, pack_offset_sort
>    will keep them near the end of the array; and
> 
>  - Given the oid, oid_object_info_extended() on it with
>    OBJECT_INFO_FOR_PREFETCH is expensive.

I see this as expensive since it involves checking of object existence.

> Only when all these conditions are met, it would avoid unnecessary
> overhead by scanning only a very later part of the array by delaying
> the point in the array where prefetch_to_pack() starts scanning.

Yes (and when there are no missing objects at all, there is no
double-scanning).

  reply	other threads:[~2020-07-21 16:37 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-21  0:21 [PATCH 0/2] Prefetch objects in pack-objects Jonathan Tan
2020-07-21  0:21 ` [PATCH 1/2] pack-objects: refactor to oid_object_info_extended Jonathan Tan
2020-07-21  0:21 ` [PATCH 2/2] pack-objects: prefetch objects to be packed Jonathan Tan
2020-07-21  1:00   ` Junio C Hamano
2020-07-21 16:37     ` Jonathan Tan [this message]
2020-07-21 19:23       ` Junio C Hamano
2020-07-21 21:27         ` Junio C Hamano
2020-07-21 23:37           ` Jonathan Tan
2020-07-21 23:56             ` Junio C Hamano
2020-07-21 23:20         ` Jonathan Tan
2020-07-21 23:51           ` Junio C Hamano
2020-07-22 21:30             ` Jonathan Tan
2020-07-22 21:45               ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200721163736.69610-1-jonathantanmy@google.com \
    --to=jonathantanmy@google.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=sluongng@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).