git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff Hostetler <git@jeffhostetler.com>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Jeff Hostetler via GitGitGadget" <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, dstolee@microsoft.com,
	Junio C Hamano <gitster@pobox.com>,
	Jeff Hostetler <jeffhost@microsoft.com>
Subject: Re: [PATCH 2/3] midx: verify: group objects by packfile to speed up object verification
Date: Mon, 18 Mar 2019 17:39:44 -0400	[thread overview]
Message-ID: <9c4da72d-6066-8d05-f181-a93c3926705f@jeffhostetler.com> (raw)
In-Reply-To: <87pnqodvr9.fsf@evledraar.gmail.com>



On 3/18/2019 11:53 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Mon, Mar 18 2019, Jeff Hostetler via GitGitGadget wrote:
> 
>> +static int compare_pair_pos_vs_id(const void *_a, const void *_b)
>> +{
>> +	struct pair_pos_vs_id *a = (struct pair_pos_vs_id *)_a;
>> +	struct pair_pos_vs_id *b = (struct pair_pos_vs_id *)_b;
>> +
>> +	if (a->pack_int_id < b->pack_int_id)
>> +		return -1;
>> +	if (a->pack_int_id > b->pack_int_id)
>> +		return 1;
>> +
>> +	return 0;
>> +}
> 
> Not a suggestion for a change, just a note that this sent me down the
> rabbit hole of looking at the different idioms we use for QSORT() in
> different places. Some use this form, some a ternary nest, and some the
> succinct subtraction idiom of e.g. (in this case):
> 
>      return b->pack_int_id - a->pack_int_id;

Yeah, I'm not sure which way is better or worse here.
An earlier draft of this function sorted by packfile id
and then by OID (thinking we might benefit from some
locality later when we do the verify), hence the independent
if statements.  But it didn't help, so I removed the other
lines.

On 43+M objects, your version is a hair faster, so I might
as well take it instead.

> 
>> +
>>   int verify_midx_file(const char *object_dir)
>>   {
>> -	uint32_t i;
>> +	struct pair_pos_vs_id *pairs = NULL;
>> +	uint32_t i, k;
>>   	struct progress *progress;
>>   	struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
>>   	verify_midx_error = 0;
>> @@ -997,15 +1017,36 @@ int verify_midx_file(const char *object_dir)
>>   	}
>>
>>   	progress = start_progress(_("Verifying object offsets"), m->num_objects);
>> +
>> +	/*
>> +	 * Create an array mapping each object to its packfile id.  Sort it
>> +	 * to group the objects by packfile.  Use this permutation to visit
>> +	 * each of the objects and only require 1 packfile to be open at a
>> +	 * time.
>> +	 */
>> +	ALLOC_ARRAY(pairs, m->num_objects);
>>   	for (i = 0; i < m->num_objects; i++) {
>> +		pairs[i].pos = i;
>> +		pairs[i].pack_int_id = nth_midxed_pack_int_id(m, i);
>> +	}
>> +	QSORT(pairs, m->num_objects, compare_pair_pos_vs_id);
>> +
>> +	for (k = 0; k < m->num_objects; k++) {
>> [...]
> 
> I have not tested this (or midx in general), but isn't this new QSORT()
> introducing the same sort of progress stalling that I fixed for
> commit-graph in 890226ccb57 ("commit-graph write: add itermediate
> progress", 2019-01-19)? I.e. something you can work around with a
> "display_progress(progress, 0)" before the QSORT().
> 

I wasn't tracking your commit-graph changes, but yes, I think it is.

Tinkering with how to display progress, I found a couple of problems.
On my 3599 packfile, 43M object example, QSORT() takes about 5 seconds.
But there's about 2 seconds of setup before the sort starts.  The final
verify loops takes about 17 seconds.

Here I put trace2 regions around the main loops and used the
GIT_TR2_PERF stream.

> | cmd_name     |     |           |           |            | multi-pack-index (multi-pack-index)
> | cmd_mode     |     |           |           |            | verify
> | data         | r0  |  0.031295 |  0.031295 | midx       | load/num_packs:3599
> | data         | r0  |  0.031330 |  0.031330 | midx       | load/num_objects:42704807
> | region_enter | r0  |  0.031352 |           | midx       | label:verify/prepare 
> | region_leave | r0  |  0.626547 |  0.595195 | midx       | label:verify/prepare 
> | region_enter | r0  |  0.626602 |           | midx       | label:verify/oid_order 
> | region_leave | r0  |  1.570195 |  0.943593 | midx       | label:verify/oid_order 
> | region_enter | r0  |  1.570253 |           | midx       | label:verify/sort_setup 
> | region_leave | r0  |  1.809723 |  0.239470 | midx       | label:verify/sort_setup 
> | region_enter | r0  |  1.809803 |           | midx       | label:verify/sort 
> | region_leave | r0  |  6.950595 |  5.140792 | midx       | label:verify/sort 
> | region_enter | r0  |  6.950651 |           | midx       | label:verify/offsets 
> | region_leave | r0  | 24.059736 | 17.109085 | midx       | label:verify/offsets 
> | exit         |     | 24.101434 |           |            | code:0

So just adding a delay progress block by itself around the sort doesn't
help much.  It just sits there for 7 seconds before the actual progress
starts.

If I add a non-delay progress block around the "verify/prepare",
"verify/oid_order" and the "verify/offsets" loops, we get a pretty good
experience.

There is the dead time while the sort() itself is running, but at least
there is isn't a 5+ second frozen at 0% message on screen.

I'll re-roll shortly.

Thanks,
Jeff

  reply	other threads:[~2019-03-18 21:39 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-18 14:29 [PATCH 0/3] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
2019-03-18 14:29 ` [PATCH 2/3] midx: verify: group objects by packfile to speed up object verification Jeff Hostetler via GitGitGadget
2019-03-18 15:53   ` Ævar Arnfjörð Bjarmason
2019-03-18 21:39     ` Jeff Hostetler [this message]
2019-03-18 22:02       ` Ævar Arnfjörð Bjarmason
2019-03-19  4:08         ` Junio C Hamano
2019-03-19 14:00         ` Jeff Hostetler
2019-03-19 14:06           ` Ævar Arnfjörð Bjarmason
2019-03-18 14:29 ` [PATCH 1/3] midx: verify: add midx packfiles to the packed_git list Jeff Hostetler via GitGitGadget
2019-03-18 14:29 ` [PATCH 3/3] trace2:data: add trace2 data to midx Jeff Hostetler via GitGitGadget
2019-03-19 14:37 ` [PATCH v2 0/4] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
2019-03-19 14:37   ` [PATCH v2 1/4] progress: add sparse mode to force 100% complete message Jeff Hostetler via GitGitGadget
2019-03-19 16:42     ` Eric Sunshine
2019-03-19 18:33       ` Jeff Hostetler
2019-03-19 18:46         ` Eric Sunshine
2019-03-19 14:37   ` [PATCH v2 2/4] trace2:data: add trace2 data to midx Jeff Hostetler via GitGitGadget
2019-03-19 14:37   ` [PATCH v2 3/4] midx: verify: add midx packfiles to the packed_git list Jeff Hostetler via GitGitGadget
2019-03-19 19:53     ` Jeff Hostetler
2019-03-19 14:37   ` [PATCH v2 4/4] midx: verify: group objects by packfile to speed up object verification Jeff Hostetler via GitGitGadget
2019-03-21 19:36   ` [PATCH v3 0/4] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
2019-03-21 19:36     ` [PATCH v3 1/4] progress: add sparse mode to force 100% complete message Jeff Hostetler via GitGitGadget
2019-03-21 19:36     ` [PATCH v3 2/4] trace2:data: add trace2 data to midx Jeff Hostetler via GitGitGadget
2019-03-21 19:36     ` [PATCH v3 3/4] midx: add progress indicators in multi-pack-index verify Jeff Hostetler via GitGitGadget
2019-03-21 19:36     ` [PATCH v3 4/4] midx: during verify group objects by packfile to speed verification Jeff Hostetler via GitGitGadget
2019-03-22  5:37     ` [PATCH v3 0/4] multi-pack-index: fix verify on large repos Junio C Hamano
2019-03-25 17:18       ` Jeff Hostetler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9c4da72d-6066-8d05-f181-a93c3926705f@jeffhostetler.com \
    --to=git@jeffhostetler.com \
    --cc=avarab@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=jeffhost@microsoft.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).