git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/3] multi-pack-index: fix verify on large repos
@ 2019-03-18 14:29 Jeff Hostetler via GitGitGadget
  2019-03-18 14:29 ` [PATCH 1/3] midx: verify: add midx packfiles to the packed_git list Jeff Hostetler via GitGitGadget
                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Jeff Hostetler via GitGitGadget @ 2019-03-18 14:29 UTC (permalink / raw)
  To: git; +Cc: dstolee, Junio C Hamano

Teach "multi-pack-index verify" to handle cases where the number of
packfiles exceeds the open file handle limit.

The first commit fixes a problem that prevented the LRU-style
close_one_pack() mechanism from working which caused midx verify to run out
of file descriptors.

The second commit teaches midx verify to sort the set of objects to verify
by packfile rather than verifying them in OID order. This eliminates the
need to have more than one packfile/idx open at the same time.

With the second commit, runtime on 3600 packfiles went from 12 minutes to 25
seconds.

Thanks, Jeff

Jeff Hostetler (3):
  midx: verify: add midx packfiles to the packed_git list
  midx: verify: group objects by packfile to speed up object
    verification
  trace2:data: add trace2 data to midx

 builtin/multi-pack-index.c |  3 ++
 midx.c                     | 60 ++++++++++++++++++++++++++++++++++----
 packfile.c                 |  2 +-
 packfile.h                 |  2 ++
 4 files changed, 61 insertions(+), 6 deletions(-)


base-commit: e902e9bcae2010bc42648c80ab6adc6c5a16a4a5
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-166%2Fjeffhostetler%2Fupstream-midx-verify-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-166/jeffhostetler/upstream-midx-verify-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/166
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/3] midx: verify: add midx packfiles to the packed_git list
  2019-03-18 14:29 [PATCH 0/3] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
@ 2019-03-18 14:29 ` Jeff Hostetler via GitGitGadget
  2019-03-18 14:29 ` [PATCH 2/3] midx: verify: group objects by packfile to speed up object verification Jeff Hostetler via GitGitGadget
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 26+ messages in thread
From: Jeff Hostetler via GitGitGadget @ 2019-03-18 14:29 UTC (permalink / raw)
  To: git; +Cc: dstolee, Junio C Hamano, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Fix "git multi-pack-index verify" to handle repos with thousands
of packfiles.

Midx verify adds the individual "packed_git" structures to the
multi_pack_index.packs array, but it does not add them to the
"repository.objects.packed_git" list.  During the verification
code, each packfile is opened and scanned.  And "pack_open_fds"
is incremented.  If "pack_open_fds" equals the "pack_max_fds"
open_packed_git_1() calls close_one_pack() to LRU-style close
an already open packfile.  But because the packfiles were never
added to the "packed_git" list, close_one_pack() does nothing.
If there are very many packfiles, Git runs out of file descriptors
and fails.

Note that this was observed on Windows when build with GCC and
in a repository with more than (2048-25) packfiles.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 midx.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/midx.c b/midx.c
index 8a505fd423..d2c39b6d37 100644
--- a/midx.c
+++ b/midx.c
@@ -971,6 +971,9 @@ int verify_midx_file(const char *object_dir)
 	for (i = 0; i < m->num_packs; i++) {
 		if (prepare_midx_pack(m, i))
 			midx_report("failed to load pack in position %d", i);
+
+		if (m->packs[i])
+			install_packed_git(the_repository, m->packs[i]);
 	}
 
 	for (i = 0; i < 255; i++) {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/3] midx: verify: group objects by packfile to speed up object verification
  2019-03-18 14:29 [PATCH 0/3] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
  2019-03-18 14:29 ` [PATCH 1/3] midx: verify: add midx packfiles to the packed_git list Jeff Hostetler via GitGitGadget
@ 2019-03-18 14:29 ` Jeff Hostetler via GitGitGadget
  2019-03-18 15:53   ` Ævar Arnfjörð Bjarmason
  2019-03-18 14:29 ` [PATCH 3/3] trace2:data: add trace2 data to midx Jeff Hostetler via GitGitGadget
  2019-03-19 14:37 ` [PATCH v2 0/4] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
  3 siblings, 1 reply; 26+ messages in thread
From: Jeff Hostetler via GitGitGadget @ 2019-03-18 14:29 UTC (permalink / raw)
  To: git; +Cc: dstolee, Junio C Hamano, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Teach `multi-pack-index verify` to sort the set of objects by
packfile so that only one packfile needs to be open at a time.

This is a performance improvement.  Previously, objects were
verified in OID order.  This essentially requires all packfiles
to be open at the same time.  If the number of packfiles exceeds
the open file limit, packfiles would be LRU-closed and re-opened
many times.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 midx.c     | 53 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 packfile.c |  2 +-
 packfile.h |  2 ++
 3 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/midx.c b/midx.c
index d2c39b6d37..0376757390 100644
--- a/midx.c
+++ b/midx.c
@@ -958,9 +958,29 @@ static void midx_report(const char *fmt, ...)
 	va_end(ap);
 }
 
+struct pair_pos_vs_id
+{
+	uint32_t pos;
+	uint32_t pack_int_id;
+};
+
+static int compare_pair_pos_vs_id(const void *_a, const void *_b)
+{
+	struct pair_pos_vs_id *a = (struct pair_pos_vs_id *)_a;
+	struct pair_pos_vs_id *b = (struct pair_pos_vs_id *)_b;
+
+	if (a->pack_int_id < b->pack_int_id)
+		return -1;
+	if (a->pack_int_id > b->pack_int_id)
+		return 1;
+
+	return 0;
+}
+
 int verify_midx_file(const char *object_dir)
 {
-	uint32_t i;
+	struct pair_pos_vs_id *pairs = NULL;
+	uint32_t i, k;
 	struct progress *progress;
 	struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
 	verify_midx_error = 0;
@@ -997,15 +1017,36 @@ int verify_midx_file(const char *object_dir)
 	}
 
 	progress = start_progress(_("Verifying object offsets"), m->num_objects);
+
+	/*
+	 * Create an array mapping each object to its packfile id.  Sort it
+	 * to group the objects by packfile.  Use this permutation to visit
+	 * each of the objects and only require 1 packfile to be open at a
+	 * time.
+	 */
+	ALLOC_ARRAY(pairs, m->num_objects);
 	for (i = 0; i < m->num_objects; i++) {
+		pairs[i].pos = i;
+		pairs[i].pack_int_id = nth_midxed_pack_int_id(m, i);
+	}
+	QSORT(pairs, m->num_objects, compare_pair_pos_vs_id);
+
+	for (k = 0; k < m->num_objects; k++) {
 		struct object_id oid;
 		struct pack_entry e;
 		off_t m_offset, p_offset;
 
-		nth_midxed_object_oid(&oid, m, i);
+		if (k > 0 && pairs[k-1].pack_int_id != pairs[k].pack_int_id &&
+		    m->packs[pairs[k-1].pack_int_id])
+		{
+			close_pack_fd(m->packs[pairs[k-1].pack_int_id]);
+			close_pack_index(m->packs[pairs[k-1].pack_int_id]);
+		}
+
+		nth_midxed_object_oid(&oid, m, pairs[k].pos);
 		if (!fill_midx_entry(&oid, &e, m)) {
 			midx_report(_("failed to load pack entry for oid[%d] = %s"),
-				    i, oid_to_hex(&oid));
+				    pairs[k].pos, oid_to_hex(&oid));
 			continue;
 		}
 
@@ -1020,11 +1061,13 @@ int verify_midx_file(const char *object_dir)
 
 		if (m_offset != p_offset)
 			midx_report(_("incorrect object offset for oid[%d] = %s: %"PRIx64" != %"PRIx64),
-				    i, oid_to_hex(&oid), m_offset, p_offset);
+				    pairs[k].pos, oid_to_hex(&oid), m_offset, p_offset);
 
-		display_progress(progress, i + 1);
+		display_progress(progress, k + 1);
 	}
 	stop_progress(&progress);
 
+	free(pairs);
+
 	return verify_midx_error;
 }
diff --git a/packfile.c b/packfile.c
index 16bcb75262..d2bcb2f860 100644
--- a/packfile.c
+++ b/packfile.c
@@ -309,7 +309,7 @@ void close_pack_windows(struct packed_git *p)
 	}
 }
 
-static int close_pack_fd(struct packed_git *p)
+int close_pack_fd(struct packed_git *p)
 {
 	if (p->pack_fd < 0)
 		return 0;
diff --git a/packfile.h b/packfile.h
index d70c6d9afb..b1c18504eb 100644
--- a/packfile.h
+++ b/packfile.h
@@ -76,6 +76,8 @@ extern int open_pack_index(struct packed_git *);
  */
 extern void close_pack_index(struct packed_git *);
 
+int close_pack_fd(struct packed_git *p);
+
 extern uint32_t get_pack_fanout(struct packed_git *p, uint32_t value);
 
 extern unsigned char *use_pack(struct packed_git *, struct pack_window **, off_t, unsigned long *);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/3] trace2:data: add trace2 data to midx
  2019-03-18 14:29 [PATCH 0/3] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
  2019-03-18 14:29 ` [PATCH 1/3] midx: verify: add midx packfiles to the packed_git list Jeff Hostetler via GitGitGadget
  2019-03-18 14:29 ` [PATCH 2/3] midx: verify: group objects by packfile to speed up object verification Jeff Hostetler via GitGitGadget
@ 2019-03-18 14:29 ` Jeff Hostetler via GitGitGadget
  2019-03-19 14:37 ` [PATCH v2 0/4] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
  3 siblings, 0 replies; 26+ messages in thread
From: Jeff Hostetler via GitGitGadget @ 2019-03-18 14:29 UTC (permalink / raw)
  To: git; +Cc: dstolee, Junio C Hamano, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Log multi-pack-index command mode.
Log number of objects and packfiles in the midx.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 builtin/multi-pack-index.c | 3 +++
 midx.c                     | 4 ++++
 2 files changed, 7 insertions(+)

diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index fca70f8e4f..ae6e476ad5 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -3,6 +3,7 @@
 #include "config.h"
 #include "parse-options.h"
 #include "midx.h"
+#include "trace2.h"
 
 static char const * const builtin_multi_pack_index_usage[] = {
 	N_("git multi-pack-index [--object-dir=<dir>] (write|verify)"),
@@ -40,6 +41,8 @@ int cmd_multi_pack_index(int argc, const char **argv,
 		return 1;
 	}
 
+	trace2_cmd_mode(argv[0]);
+
 	if (!strcmp(argv[0], "write"))
 		return write_midx_file(opts.object_dir);
 	if (!strcmp(argv[0], "verify"))
diff --git a/midx.c b/midx.c
index 0376757390..df1df20a66 100644
--- a/midx.c
+++ b/midx.c
@@ -8,6 +8,7 @@
 #include "sha1-lookup.h"
 #include "midx.h"
 #include "progress.h"
+#include "trace2.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
@@ -164,6 +165,9 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 			      m->pack_names[i]);
 	}
 
+	trace2_data_intmax("midx", the_repository, "load/num_packs", m->num_packs);
+	trace2_data_intmax("midx", the_repository, "load/num_objects", m->num_objects);
+
 	return m;
 
 cleanup_fail:
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] midx: verify: group objects by packfile to speed up object verification
  2019-03-18 14:29 ` [PATCH 2/3] midx: verify: group objects by packfile to speed up object verification Jeff Hostetler via GitGitGadget
@ 2019-03-18 15:53   ` Ævar Arnfjörð Bjarmason
  2019-03-18 21:39     ` Jeff Hostetler
  0 siblings, 1 reply; 26+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-03-18 15:53 UTC (permalink / raw)
  To: Jeff Hostetler via GitGitGadget
  Cc: git, dstolee, Junio C Hamano, Jeff Hostetler


On Mon, Mar 18 2019, Jeff Hostetler via GitGitGadget wrote:

> +static int compare_pair_pos_vs_id(const void *_a, const void *_b)
> +{
> +	struct pair_pos_vs_id *a = (struct pair_pos_vs_id *)_a;
> +	struct pair_pos_vs_id *b = (struct pair_pos_vs_id *)_b;
> +
> +	if (a->pack_int_id < b->pack_int_id)
> +		return -1;
> +	if (a->pack_int_id > b->pack_int_id)
> +		return 1;
> +
> +	return 0;
> +}

Not a suggestion for a change, just a note that this sent me down the
rabbit hole of looking at the different idioms we use for QSORT() in
different places. Some use this form, some a ternary nest, and some the
succinct subtraction idiom of e.g. (in this case):

    return b->pack_int_id - a->pack_int_id;

> +
>  int verify_midx_file(const char *object_dir)
>  {
> -	uint32_t i;
> +	struct pair_pos_vs_id *pairs = NULL;
> +	uint32_t i, k;
>  	struct progress *progress;
>  	struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
>  	verify_midx_error = 0;
> @@ -997,15 +1017,36 @@ int verify_midx_file(const char *object_dir)
>  	}
>
>  	progress = start_progress(_("Verifying object offsets"), m->num_objects);
> +
> +	/*
> +	 * Create an array mapping each object to its packfile id.  Sort it
> +	 * to group the objects by packfile.  Use this permutation to visit
> +	 * each of the objects and only require 1 packfile to be open at a
> +	 * time.
> +	 */
> +	ALLOC_ARRAY(pairs, m->num_objects);
>  	for (i = 0; i < m->num_objects; i++) {
> +		pairs[i].pos = i;
> +		pairs[i].pack_int_id = nth_midxed_pack_int_id(m, i);
> +	}
> +	QSORT(pairs, m->num_objects, compare_pair_pos_vs_id);
> +
> +	for (k = 0; k < m->num_objects; k++) {
> [...]

I have not tested this (or midx in general), but isn't this new QSORT()
introducing the same sort of progress stalling that I fixed for
commit-graph in 890226ccb57 ("commit-graph write: add itermediate
progress", 2019-01-19)? I.e. something you can work around with a
"display_progress(progress, 0)" before the QSORT().

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] midx: verify: group objects by packfile to speed up object verification
  2019-03-18 15:53   ` Ævar Arnfjörð Bjarmason
@ 2019-03-18 21:39     ` Jeff Hostetler
  2019-03-18 22:02       ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 26+ messages in thread
From: Jeff Hostetler @ 2019-03-18 21:39 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason,
	Jeff Hostetler via GitGitGadget
  Cc: git, dstolee, Junio C Hamano, Jeff Hostetler



On 3/18/2019 11:53 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Mon, Mar 18 2019, Jeff Hostetler via GitGitGadget wrote:
> 
>> +static int compare_pair_pos_vs_id(const void *_a, const void *_b)
>> +{
>> +	struct pair_pos_vs_id *a = (struct pair_pos_vs_id *)_a;
>> +	struct pair_pos_vs_id *b = (struct pair_pos_vs_id *)_b;
>> +
>> +	if (a->pack_int_id < b->pack_int_id)
>> +		return -1;
>> +	if (a->pack_int_id > b->pack_int_id)
>> +		return 1;
>> +
>> +	return 0;
>> +}
> 
> Not a suggestion for a change, just a note that this sent me down the
> rabbit hole of looking at the different idioms we use for QSORT() in
> different places. Some use this form, some a ternary nest, and some the
> succinct subtraction idiom of e.g. (in this case):
> 
>      return b->pack_int_id - a->pack_int_id;

Yeah, I'm not sure which way is better or worse here.
An earlier draft of this function sorted by packfile id
and then by OID (thinking we might benefit from some
locality later when we do the verify), hence the independent
if statements.  But it didn't help, so I removed the other
lines.

On 43+M objects, your version is a hair faster, so I might
as well take it instead.

> 
>> +
>>   int verify_midx_file(const char *object_dir)
>>   {
>> -	uint32_t i;
>> +	struct pair_pos_vs_id *pairs = NULL;
>> +	uint32_t i, k;
>>   	struct progress *progress;
>>   	struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
>>   	verify_midx_error = 0;
>> @@ -997,15 +1017,36 @@ int verify_midx_file(const char *object_dir)
>>   	}
>>
>>   	progress = start_progress(_("Verifying object offsets"), m->num_objects);
>> +
>> +	/*
>> +	 * Create an array mapping each object to its packfile id.  Sort it
>> +	 * to group the objects by packfile.  Use this permutation to visit
>> +	 * each of the objects and only require 1 packfile to be open at a
>> +	 * time.
>> +	 */
>> +	ALLOC_ARRAY(pairs, m->num_objects);
>>   	for (i = 0; i < m->num_objects; i++) {
>> +		pairs[i].pos = i;
>> +		pairs[i].pack_int_id = nth_midxed_pack_int_id(m, i);
>> +	}
>> +	QSORT(pairs, m->num_objects, compare_pair_pos_vs_id);
>> +
>> +	for (k = 0; k < m->num_objects; k++) {
>> [...]
> 
> I have not tested this (or midx in general), but isn't this new QSORT()
> introducing the same sort of progress stalling that I fixed for
> commit-graph in 890226ccb57 ("commit-graph write: add itermediate
> progress", 2019-01-19)? I.e. something you can work around with a
> "display_progress(progress, 0)" before the QSORT().
> 

I wasn't tracking your commit-graph changes, but yes, I think it is.

Tinkering with how to display progress, I found a couple of problems.
On my 3599 packfile, 43M object example, QSORT() takes about 5 seconds.
But there's about 2 seconds of setup before the sort starts.  The final
verify loops takes about 17 seconds.

Here I put trace2 regions around the main loops and used the
GIT_TR2_PERF stream.

> | cmd_name     |     |           |           |            | multi-pack-index (multi-pack-index)
> | cmd_mode     |     |           |           |            | verify
> | data         | r0  |  0.031295 |  0.031295 | midx       | load/num_packs:3599
> | data         | r0  |  0.031330 |  0.031330 | midx       | load/num_objects:42704807
> | region_enter | r0  |  0.031352 |           | midx       | label:verify/prepare 
> | region_leave | r0  |  0.626547 |  0.595195 | midx       | label:verify/prepare 
> | region_enter | r0  |  0.626602 |           | midx       | label:verify/oid_order 
> | region_leave | r0  |  1.570195 |  0.943593 | midx       | label:verify/oid_order 
> | region_enter | r0  |  1.570253 |           | midx       | label:verify/sort_setup 
> | region_leave | r0  |  1.809723 |  0.239470 | midx       | label:verify/sort_setup 
> | region_enter | r0  |  1.809803 |           | midx       | label:verify/sort 
> | region_leave | r0  |  6.950595 |  5.140792 | midx       | label:verify/sort 
> | region_enter | r0  |  6.950651 |           | midx       | label:verify/offsets 
> | region_leave | r0  | 24.059736 | 17.109085 | midx       | label:verify/offsets 
> | exit         |     | 24.101434 |           |            | code:0

So just adding a delay progress block by itself around the sort doesn't
help much.  It just sits there for 7 seconds before the actual progress
starts.

If I add a non-delay progress block around the "verify/prepare",
"verify/oid_order" and the "verify/offsets" loops, we get a pretty good
experience.

There is the dead time while the sort() itself is running, but at least
there is isn't a 5+ second frozen at 0% message on screen.

I'll re-roll shortly.

Thanks,
Jeff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] midx: verify: group objects by packfile to speed up object verification
  2019-03-18 21:39     ` Jeff Hostetler
@ 2019-03-18 22:02       ` Ævar Arnfjörð Bjarmason
  2019-03-19  4:08         ` Junio C Hamano
  2019-03-19 14:00         ` Jeff Hostetler
  0 siblings, 2 replies; 26+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-03-18 22:02 UTC (permalink / raw)
  To: Jeff Hostetler
  Cc: Jeff Hostetler via GitGitGadget, git, dstolee, Junio C Hamano,
	Jeff Hostetler


On Mon, Mar 18 2019, Jeff Hostetler wrote:

> On 3/18/2019 11:53 AM, Ævar Arnfjörð Bjarmason wrote:
>>
>> On Mon, Mar 18 2019, Jeff Hostetler via GitGitGadget wrote:
>>
>>> +static int compare_pair_pos_vs_id(const void *_a, const void *_b)
>>> +{
>>> +	struct pair_pos_vs_id *a = (struct pair_pos_vs_id *)_a;
>>> +	struct pair_pos_vs_id *b = (struct pair_pos_vs_id *)_b;
>>> +
>>> +	if (a->pack_int_id < b->pack_int_id)
>>> +		return -1;
>>> +	if (a->pack_int_id > b->pack_int_id)
>>> +		return 1;
>>> +
>>> +	return 0;
>>> +}
>>
>> Not a suggestion for a change, just a note that this sent me down the
>> rabbit hole of looking at the different idioms we use for QSORT() in
>> different places. Some use this form, some a ternary nest, and some the
>> succinct subtraction idiom of e.g. (in this case):
>>
>>      return b->pack_int_id - a->pack_int_id;
>
> Yeah, I'm not sure which way is better or worse here.
> An earlier draft of this function sorted by packfile id
> and then by OID (thinking we might benefit from some
> locality later when we do the verify), hence the independent
> if statements.  But it didn't help, so I removed the other
> lines.
>
> On 43+M objects, your version is a hair faster, so I might
> as well take it instead.

Cool!

>>
>>> +
>>>   int verify_midx_file(const char *object_dir)
>>>   {
>>> -	uint32_t i;
>>> +	struct pair_pos_vs_id *pairs = NULL;
>>> +	uint32_t i, k;
>>>   	struct progress *progress;
>>>   	struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
>>>   	verify_midx_error = 0;
>>> @@ -997,15 +1017,36 @@ int verify_midx_file(const char *object_dir)
>>>   	}
>>>
>>>   	progress = start_progress(_("Verifying object offsets"), m->num_objects);
>>> +
>>> +	/*
>>> +	 * Create an array mapping each object to its packfile id.  Sort it
>>> +	 * to group the objects by packfile.  Use this permutation to visit
>>> +	 * each of the objects and only require 1 packfile to be open at a
>>> +	 * time.
>>> +	 */
>>> +	ALLOC_ARRAY(pairs, m->num_objects);
>>>   	for (i = 0; i < m->num_objects; i++) {
>>> +		pairs[i].pos = i;
>>> +		pairs[i].pack_int_id = nth_midxed_pack_int_id(m, i);
>>> +	}
>>> +	QSORT(pairs, m->num_objects, compare_pair_pos_vs_id);
>>> +
>>> +	for (k = 0; k < m->num_objects; k++) {
>>> [...]
>>
>> I have not tested this (or midx in general), but isn't this new QSORT()
>> introducing the same sort of progress stalling that I fixed for
>> commit-graph in 890226ccb57 ("commit-graph write: add itermediate
>> progress", 2019-01-19)? I.e. something you can work around with a
>> "display_progress(progress, 0)" before the QSORT().
>>
>
> I wasn't tracking your commit-graph changes, but yes, I think it is.
>
> Tinkering with how to display progress, I found a couple of problems.
> On my 3599 packfile, 43M object example, QSORT() takes about 5 seconds.
> But there's about 2 seconds of setup before the sort starts.  The final
> verify loops takes about 17 seconds.
>
> Here I put trace2 regions around the main loops and used the
> GIT_TR2_PERF stream.
>
>> | cmd_name     |     |           |           |            | multi-pack-index (multi-pack-index)
>> | cmd_mode     |     |           |           |            | verify
>> | data         | r0  |  0.031295 |  0.031295 | midx       | load/num_packs:3599
>> | data         | r0  |  0.031330 |  0.031330 | midx       | load/num_objects:42704807
>> | region_enter | r0  |  0.031352 |           | midx       |
>> label:verify/prepare | region_leave | r0  |  0.626547 |  0.595195 |
>> midx       | label:verify/prepare | region_enter | r0  |  0.626602 |
>> | midx       | label:verify/oid_order | region_leave | r0  |
>> 1.570195 |  0.943593 | midx       | label:verify/oid_order |
>> region_enter | r0  |  1.570253 |           | midx       |
>> label:verify/sort_setup | region_leave | r0  |  1.809723 |  0.239470
>> | midx       | label:verify/sort_setup | region_enter | r0  |
>> 1.809803 |           | midx       | label:verify/sort | region_leave
>> | r0  |  6.950595 |  5.140792 | midx       | label:verify/sort |
>> region_enter | r0  |  6.950651 |           | midx       |
>> label:verify/offsets | region_leave | r0  | 24.059736 | 17.109085 |
>> midx       | label:verify/offsets | exit         |     | 24.101434 |
>> |            | code:0
>
> So just adding a delay progress block by itself around the sort doesn't
> help much.  It just sits there for 7 seconds before the actual progress
> starts.
>
> If I add a non-delay progress block around the "verify/prepare",
> "verify/oid_order" and the "verify/offsets" loops, we get a pretty good
> experience.
>
> There is the dead time while the sort() itself is running, but at least
> there is isn't a 5+ second frozen at 0% message on screen.

Yeah, the same with the commit-graph with my hack. I.e. it'll sit there,
but at least it sits like this:

    What I was doing before 100% (X/Y)
    What I'm about to start doing 0% (0/Z) [hanging]

Instead of:

    What I was doing before 100% (X/Y)
    [hanging]

So that's an improvement, i.e. you know it's started that next phase at
least instead of just having a non-descriptive hang.

Ideally there would be some way to reach into the QSORT() and display
progress there, but that's all sorts of nasty, so as the TODO comment in
commit-graph.c notes I punted it.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] midx: verify: group objects by packfile to speed up object verification
  2019-03-18 22:02       ` Ævar Arnfjörð Bjarmason
@ 2019-03-19  4:08         ` Junio C Hamano
  2019-03-19 14:00         ` Jeff Hostetler
  1 sibling, 0 replies; 26+ messages in thread
From: Junio C Hamano @ 2019-03-19  4:08 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Jeff Hostetler, Jeff Hostetler via GitGitGadget, git, dstolee,
	Jeff Hostetler

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

>>>> +	if (a->pack_int_id < b->pack_int_id)
>>>> +		return -1;
>>>> +	if (a->pack_int_id > b->pack_int_id)
>>>> +		return 1;
>>>> +
>>>> +	return 0;
>>>> +}
>>> ...
>>> succinct subtraction idiom of e.g. (in this case):
>>>
>>>      return b->pack_int_id - a->pack_int_id;
>> ...
>> On 43+M objects, your version is a hair faster, so I might
>> as well take it instead.
>
> Cool!

Yup, following a well-known short idiom lets readers' eyes coast
over without forcing them to make sure the two if statements have
comparison going in the right direction and return the constant with
the right sign, etc.  Even if the idiomatic version weren't faster,
it is worth using it.

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] midx: verify: group objects by packfile to speed up object verification
  2019-03-18 22:02       ` Ævar Arnfjörð Bjarmason
  2019-03-19  4:08         ` Junio C Hamano
@ 2019-03-19 14:00         ` Jeff Hostetler
  2019-03-19 14:06           ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 26+ messages in thread
From: Jeff Hostetler @ 2019-03-19 14:00 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Jeff Hostetler via GitGitGadget, git, dstolee, Junio C Hamano,
	Jeff Hostetler



On 3/18/2019 6:02 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Mon, Mar 18 2019, Jeff Hostetler wrote:
> 
>> On 3/18/2019 11:53 AM, Ævar Arnfjörð Bjarmason wrote:
>>>
>>> On Mon, Mar 18 2019, Jeff Hostetler via GitGitGadget wrote:
>>>
[...]
>>>> +	QSORT(pairs, m->num_objects, compare_pair_pos_vs_id);
>>>> +
>>>> +	for (k = 0; k < m->num_objects; k++) {
>>>> [...]
>>>
>>> I have not tested this (or midx in general), but isn't this new QSORT()
>>> introducing the same sort of progress stalling that I fixed for
>>> commit-graph in 890226ccb57 ("commit-graph write: add itermediate
>>> progress", 2019-01-19)? I.e. something you can work around with a
>>> "display_progress(progress, 0)" before the QSORT().
>>>
>>
>> I wasn't tracking your commit-graph changes, but yes, I think it is.
>>
[...]
>>
>> There is the dead time while the sort() itself is running, but at least
>> there is isn't a 5+ second frozen at 0% message on screen.
> 
> Yeah, the same with the commit-graph with my hack. I.e. it'll sit there,
> but at least it sits like this:
> 
>      What I was doing before 100% (X/Y)
>      What I'm about to start doing 0% (0/Z) [hanging]
> 
> Instead of:
> 
>      What I was doing before 100% (X/Y)
>      [hanging]
> 
> So that's an improvement, i.e. you know it's started that next phase at
> least instead of just having a non-descriptive hang.
> 
> Ideally there would be some way to reach into the QSORT() and display
> progress there, but that's all sorts of nasty, so as the TODO comment in
> commit-graph.c notes I punted it.

Perhaps I'm confused or this is a Windows issue, but when I do:

	progress = start_delayed_progress("sorting", n);
	display_progress(progress, 0);
	QSORT(...);
	stop_progress(&progress);

I never see the 0% message.  It always does the hang with the cursor in
column 0 on a blank line.  If I make this a regular start_progress(),
I do see the 0% message for the duration of the qsort hang.

I did some similar testing around your QSORT() in commit-graph.c
and got the same result.  It looks like start_delayed_processing()
wants to wait at least 2 seconds before displaying anything and has
an interval timer to notify it that another message should be printed,
but the display_progress(0) prior to the QSORT() arrives before the 2
seconds are up and so nothing is printed.  It's not until we get into
the loop below the QSORT that one of the display_progress(i+1) calls
could cause a message to appear.

Right?
Jeff







^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] midx: verify: group objects by packfile to speed up object verification
  2019-03-19 14:00         ` Jeff Hostetler
@ 2019-03-19 14:06           ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 26+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-03-19 14:06 UTC (permalink / raw)
  To: Jeff Hostetler
  Cc: Jeff Hostetler via GitGitGadget, git, dstolee, Junio C Hamano,
	Jeff Hostetler


On Tue, Mar 19 2019, Jeff Hostetler wrote:

> On 3/18/2019 6:02 PM, Ævar Arnfjörð Bjarmason wrote:
>>
>> On Mon, Mar 18 2019, Jeff Hostetler wrote:
>>
>>> On 3/18/2019 11:53 AM, Ævar Arnfjörð Bjarmason wrote:
>>>>
>>>> On Mon, Mar 18 2019, Jeff Hostetler via GitGitGadget wrote:
>>>>
> [...]
>>>>> +	QSORT(pairs, m->num_objects, compare_pair_pos_vs_id);
>>>>> +
>>>>> +	for (k = 0; k < m->num_objects; k++) {
>>>>> [...]
>>>>
>>>> I have not tested this (or midx in general), but isn't this new QSORT()
>>>> introducing the same sort of progress stalling that I fixed for
>>>> commit-graph in 890226ccb57 ("commit-graph write: add itermediate
>>>> progress", 2019-01-19)? I.e. something you can work around with a
>>>> "display_progress(progress, 0)" before the QSORT().
>>>>
>>>
>>> I wasn't tracking your commit-graph changes, but yes, I think it is.
>>>
> [...]
>>>
>>> There is the dead time while the sort() itself is running, but at least
>>> there is isn't a 5+ second frozen at 0% message on screen.
>>
>> Yeah, the same with the commit-graph with my hack. I.e. it'll sit there,
>> but at least it sits like this:
>>
>>      What I was doing before 100% (X/Y)
>>      What I'm about to start doing 0% (0/Z) [hanging]
>>
>> Instead of:
>>
>>      What I was doing before 100% (X/Y)
>>      [hanging]
>>
>> So that's an improvement, i.e. you know it's started that next phase at
>> least instead of just having a non-descriptive hang.
>>
>> Ideally there would be some way to reach into the QSORT() and display
>> progress there, but that's all sorts of nasty, so as the TODO comment in
>> commit-graph.c notes I punted it.
>
> Perhaps I'm confused or this is a Windows issue, but when I do:
>
> 	progress = start_delayed_progress("sorting", n);
> 	display_progress(progress, 0);
> 	QSORT(...);
> 	stop_progress(&progress);
>
> I never see the 0% message.  It always does the hang with the cursor in
> column 0 on a blank line.  If I make this a regular start_progress(),
> I do see the 0% message for the duration of the qsort hang.
>
> I did some similar testing around your QSORT() in commit-graph.c
> and got the same result.  It looks like start_delayed_processing()
> wants to wait at least 2 seconds before displaying anything and has
> an interval timer to notify it that another message should be printed,
> but the display_progress(0) prior to the QSORT() arrives before the 2
> seconds are up and so nothing is printed.  It's not until we get into
> the loop below the QSORT that one of the display_progress(i+1) calls
> could cause a message to appear.

Correct, it'll still hang the N seconds that start_delayed_progress()
imposes.

In the commit-graph code this would sometimes take longer than that, so
I'd see the 0% earlier.

But maybe in this case even on humongous repositories it's faster than
that.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2 0/4] multi-pack-index: fix verify on large repos
  2019-03-18 14:29 [PATCH 0/3] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
                   ` (2 preceding siblings ...)
  2019-03-18 14:29 ` [PATCH 3/3] trace2:data: add trace2 data to midx Jeff Hostetler via GitGitGadget
@ 2019-03-19 14:37 ` Jeff Hostetler via GitGitGadget
  2019-03-19 14:37   ` [PATCH v2 1/4] progress: add sparse mode to force 100% complete message Jeff Hostetler via GitGitGadget
                     ` (4 more replies)
  3 siblings, 5 replies; 26+ messages in thread
From: Jeff Hostetler via GitGitGadget @ 2019-03-19 14:37 UTC (permalink / raw)
  To: git; +Cc: avarab, Junio C Hamano

Version 2 addresses progress-related concerns raised in the previous version
of the midx verify code.

This version also extends the existing progress.[ch] code and adds a
"sparse" mode that automatically ensures the 100% message is issued.


----------------------------------------------------------------------------

Teach "multi-pack-index verify" to handle cases where the number of
packfiles exceeds the open file handle limit.

The first commit fixes a problem that prevented the LRU-style
close_one_pack() mechanism from working which caused midx verify to run out
of file descriptors.

The second commit teaches midx verify to sort the set of objects to verify
by packfile rather than verifying them in OID order. This eliminates the
need to have more than one packfile/idx open at the same time.

With the second commit, runtime on 3600 packfiles went from 12 minutes to 25
seconds.

Thanks, Jeff

Cc: dstolee@microsoft.com

Jeff Hostetler (4):
  progress: add sparse mode to force 100% complete message
  trace2:data: add trace2 data to midx
  midx: verify: add midx packfiles to the packed_git list
  midx: verify: group objects by packfile to speed up object
    verification

 builtin/multi-pack-index.c |  3 ++
 midx.c                     | 84 +++++++++++++++++++++++++++++++++++---
 packfile.c                 |  2 +-
 packfile.h                 |  2 +
 progress.c                 | 40 ++++++++++++++++--
 progress.h                 |  3 ++
 6 files changed, 124 insertions(+), 10 deletions(-)


base-commit: e902e9bcae2010bc42648c80ab6adc6c5a16a4a5
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-166%2Fjeffhostetler%2Fupstream-midx-verify-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-166/jeffhostetler/upstream-midx-verify-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/166

Range-diff vs v1:

 -:  ---------- > 1:  e1da1f84a8 progress: add sparse mode to force 100% complete message
 3:  2d23bc24b7 = 2:  11c88845e7 trace2:data: add trace2 data to midx
 1:  d1a730df94 = 3:  ced7f1cb34 midx: verify: add midx packfiles to the packed_git list
 2:  86f6b03258 ! 4:  e2dd99911f midx: verify: group objects by packfile to speed up object verification
     @@ -31,13 +31,20 @@
      +	struct pair_pos_vs_id *a = (struct pair_pos_vs_id *)_a;
      +	struct pair_pos_vs_id *b = (struct pair_pos_vs_id *)_b;
      +
     -+	if (a->pack_int_id < b->pack_int_id)
     -+		return -1;
     -+	if (a->pack_int_id > b->pack_int_id)
     -+		return 1;
     -+
     -+	return 0;
     ++	return b->pack_int_id - a->pack_int_id;
      +}
     ++
     ++/*
     ++ * Limit calls to display_progress() for performance reasons.
     ++ * The interval here was arbitrarily chosen.
     ++ */
     ++#define SPARSE_PROGRESS_INTERVAL (1 << 12)
     ++#define midx_display_sparse_progress(progress, n) \
     ++	do { \
     ++		uint64_t _n = (n); \
     ++		if ((_n & (SPARSE_PROGRESS_INTERVAL - 1)) == 0)	\
     ++			display_progress(progress, _n); \
     ++	} while (0)
      +
       int verify_midx_file(const char *object_dir)
       {
     @@ -48,10 +55,43 @@
       	struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
       	verify_midx_error = 0;
      @@
     + 	if (!m)
     + 		return 0;
     + 
     ++	progress = start_progress(_("Looking for referenced packfiles"),
     ++				  m->num_packs);
     + 	for (i = 0; i < m->num_packs; i++) {
     + 		if (prepare_midx_pack(m, i))
     + 			midx_report("failed to load pack in position %d", i);
     + 
     + 		if (m->packs[i])
     + 			install_packed_git(the_repository, m->packs[i]);
     ++
     ++		display_progress(progress, i + 1);
     + 	}
     ++	stop_progress(&progress);
     + 
     + 	for (i = 0; i < 255; i++) {
     + 		uint32_t oid_fanout1 = ntohl(m->chunk_oid_fanout[i]);
     +@@
     + 				    i, oid_fanout1, oid_fanout2, i + 1);
       	}
       
     - 	progress = start_progress(_("Verifying object offsets"), m->num_objects);
     ++	progress = start_sparse_progress(_("Verifying OID order in MIDX"),
     ++					 m->num_objects - 1);
     + 	for (i = 0; i < m->num_objects - 1; i++) {
     + 		struct object_id oid1, oid2;
     + 
     +@@
     + 		if (oidcmp(&oid1, &oid2) >= 0)
     + 			midx_report(_("oid lookup out of order: oid[%d] = %s >= %s = oid[%d]"),
     + 				    i, oid_to_hex(&oid1), oid_to_hex(&oid2), i + 1);
      +
     ++		midx_display_sparse_progress(progress, i + 1);
     + 	}
     ++	stop_progress(&progress);
     + 
     +-	progress = start_progress(_("Verifying object offsets"), m->num_objects);
      +	/*
      +	 * Create an array mapping each object to its packfile id.  Sort it
      +	 * to group the objects by packfile.  Use this permutation to visit
     @@ -63,8 +103,15 @@
      +		pairs[i].pos = i;
      +		pairs[i].pack_int_id = nth_midxed_pack_int_id(m, i);
      +	}
     ++
     ++	progress = start_sparse_progress(
     ++		_("Sorting objects by packfile"), m->num_objects);
     ++	display_progress(progress, 0); /* TODO: Measure QSORT() progress */
      +	QSORT(pairs, m->num_objects, compare_pair_pos_vs_id);
     ++	stop_progress(&progress);
      +
     ++	progress = start_sparse_progress(_("Verifying object offsets"),
     ++					 m->num_objects);
      +	for (k = 0; k < m->num_objects; k++) {
       		struct object_id oid;
       		struct pack_entry e;
     @@ -94,7 +141,7 @@
      +				    pairs[k].pos, oid_to_hex(&oid), m_offset, p_offset);
       
      -		display_progress(progress, i + 1);
     -+		display_progress(progress, k + 1);
     ++		midx_display_sparse_progress(progress, k + 1);
       	}
       	stop_progress(&progress);
       

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2 1/4] progress: add sparse mode to force 100% complete message
  2019-03-19 14:37 ` [PATCH v2 0/4] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
@ 2019-03-19 14:37   ` Jeff Hostetler via GitGitGadget
  2019-03-19 16:42     ` Eric Sunshine
  2019-03-19 14:37   ` [PATCH v2 2/4] trace2:data: add trace2 data to midx Jeff Hostetler via GitGitGadget
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 26+ messages in thread
From: Jeff Hostetler via GitGitGadget @ 2019-03-19 14:37 UTC (permalink / raw)
  To: git; +Cc: avarab, Junio C Hamano, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Add new start_sparse_progress() and start_delayed_sparse_progress()
constructors and "sparse" flag to struct progress.

Teach stop_progress() to force a 100% complete progress message before
printing the final "done" message when "sparse" is set.

Calling display_progress() for every item in a large set can
be expensive.  If callers try to filter this for performance
reasons, such as emitting every k-th item, progress would
not reach 100% unless they made a final call to display_progress()
with the item count before calling stop_progress().

Now this is automatic when "sparse" is set.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 progress.c | 40 +++++++++++++++++++++++++++++++++++++---
 progress.h |  3 +++
 2 files changed, 40 insertions(+), 3 deletions(-)

diff --git a/progress.c b/progress.c
index 5a99c9fbf0..3d75376c96 100644
--- a/progress.c
+++ b/progress.c
@@ -34,6 +34,7 @@ struct progress {
 	uint64_t total;
 	unsigned last_percent;
 	unsigned delay;
+	unsigned sparse;
 	struct throughput *throughput;
 	uint64_t start_ns;
 };
@@ -194,7 +195,7 @@ int display_progress(struct progress *progress, uint64_t n)
 }
 
 static struct progress *start_progress_delay(const char *title, uint64_t total,
-					     unsigned delay)
+					     unsigned delay, unsigned sparse)
 {
 	struct progress *progress = malloc(sizeof(*progress));
 	if (!progress) {
@@ -208,6 +209,7 @@ static struct progress *start_progress_delay(const char *title, uint64_t total,
 	progress->last_value = -1;
 	progress->last_percent = -1;
 	progress->delay = delay;
+	progress->sparse = sparse;
 	progress->throughput = NULL;
 	progress->start_ns = getnanotime();
 	set_progress_signal();
@@ -216,16 +218,48 @@ static struct progress *start_progress_delay(const char *title, uint64_t total,
 
 struct progress *start_delayed_progress(const char *title, uint64_t total)
 {
-	return start_progress_delay(title, total, 2);
+	return start_progress_delay(title, total, 2, 0);
 }
 
 struct progress *start_progress(const char *title, uint64_t total)
 {
-	return start_progress_delay(title, total, 0);
+	return start_progress_delay(title, total, 0, 0);
+}
+
+/*
+ * Here "sparse" means that the caller might use some sampling criteria to
+ * decide when to call display_progress() rather than calling it for every
+ * integer value in[0 .. total).  In particular, the caller might not call
+ * display_progress() for the last value in the range.
+ *
+ * When "sparse" is set, stop_progress() will automatically force the done
+ * message to show 100%.
+ */
+struct progress *start_sparse_progress(const char *title, uint64_t total)
+{
+	return start_progress_delay(title, total, 0, 1);
+}
+
+struct progress *start_delayed_sparse_progress(const char *title,
+					       uint64_t total)
+{
+	return start_progress_delay(title, total, 2, 1);
+}
+
+static void finish_if_sparse(struct progress **p_progress)
+{
+	struct progress *progress = *p_progress;
+
+	if (progress &&
+	    progress->sparse &&
+	    progress->last_value != progress->total)
+		display_progress(progress, progress->total);
 }
 
 void stop_progress(struct progress **p_progress)
 {
+	finish_if_sparse(p_progress);
+
 	stop_progress_msg(p_progress, _("done"));
 }
 
diff --git a/progress.h b/progress.h
index 70a4d4a0d6..7b725acc8d 100644
--- a/progress.h
+++ b/progress.h
@@ -6,7 +6,10 @@ struct progress;
 void display_throughput(struct progress *progress, uint64_t total);
 int display_progress(struct progress *progress, uint64_t n);
 struct progress *start_progress(const char *title, uint64_t total);
+struct progress *start_sparse_progress(const char *title, uint64_t total);
 struct progress *start_delayed_progress(const char *title, uint64_t total);
+struct progress *start_delayed_sparse_progress(const char *title,
+					       uint64_t total);
 void stop_progress(struct progress **progress);
 void stop_progress_msg(struct progress **progress, const char *msg);
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 2/4] trace2:data: add trace2 data to midx
  2019-03-19 14:37 ` [PATCH v2 0/4] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
  2019-03-19 14:37   ` [PATCH v2 1/4] progress: add sparse mode to force 100% complete message Jeff Hostetler via GitGitGadget
@ 2019-03-19 14:37   ` Jeff Hostetler via GitGitGadget
  2019-03-19 14:37   ` [PATCH v2 3/4] midx: verify: add midx packfiles to the packed_git list Jeff Hostetler via GitGitGadget
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 26+ messages in thread
From: Jeff Hostetler via GitGitGadget @ 2019-03-19 14:37 UTC (permalink / raw)
  To: git; +Cc: avarab, Junio C Hamano, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Log multi-pack-index command mode.
Log number of objects and packfiles in the midx.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 builtin/multi-pack-index.c | 3 +++
 midx.c                     | 4 ++++
 2 files changed, 7 insertions(+)

diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index fca70f8e4f..ae6e476ad5 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -3,6 +3,7 @@
 #include "config.h"
 #include "parse-options.h"
 #include "midx.h"
+#include "trace2.h"
 
 static char const * const builtin_multi_pack_index_usage[] = {
 	N_("git multi-pack-index [--object-dir=<dir>] (write|verify)"),
@@ -40,6 +41,8 @@ int cmd_multi_pack_index(int argc, const char **argv,
 		return 1;
 	}
 
+	trace2_cmd_mode(argv[0]);
+
 	if (!strcmp(argv[0], "write"))
 		return write_midx_file(opts.object_dir);
 	if (!strcmp(argv[0], "verify"))
diff --git a/midx.c b/midx.c
index 8a505fd423..24141a7c62 100644
--- a/midx.c
+++ b/midx.c
@@ -8,6 +8,7 @@
 #include "sha1-lookup.h"
 #include "midx.h"
 #include "progress.h"
+#include "trace2.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
@@ -164,6 +165,9 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 			      m->pack_names[i]);
 	}
 
+	trace2_data_intmax("midx", the_repository, "load/num_packs", m->num_packs);
+	trace2_data_intmax("midx", the_repository, "load/num_objects", m->num_objects);
+
 	return m;
 
 cleanup_fail:
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 3/4] midx: verify: add midx packfiles to the packed_git list
  2019-03-19 14:37 ` [PATCH v2 0/4] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
  2019-03-19 14:37   ` [PATCH v2 1/4] progress: add sparse mode to force 100% complete message Jeff Hostetler via GitGitGadget
  2019-03-19 14:37   ` [PATCH v2 2/4] trace2:data: add trace2 data to midx Jeff Hostetler via GitGitGadget
@ 2019-03-19 14:37   ` Jeff Hostetler via GitGitGadget
  2019-03-19 19:53     ` Jeff Hostetler
  2019-03-19 14:37   ` [PATCH v2 4/4] midx: verify: group objects by packfile to speed up object verification Jeff Hostetler via GitGitGadget
  2019-03-21 19:36   ` [PATCH v3 0/4] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
  4 siblings, 1 reply; 26+ messages in thread
From: Jeff Hostetler via GitGitGadget @ 2019-03-19 14:37 UTC (permalink / raw)
  To: git; +Cc: avarab, Junio C Hamano, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Fix "git multi-pack-index verify" to handle repos with thousands
of packfiles.

Midx verify adds the individual "packed_git" structures to the
multi_pack_index.packs array, but it does not add them to the
"repository.objects.packed_git" list.  During the verification
code, each packfile is opened and scanned.  And "pack_open_fds"
is incremented.  If "pack_open_fds" equals the "pack_max_fds"
open_packed_git_1() calls close_one_pack() to LRU-style close
an already open packfile.  But because the packfiles were never
added to the "packed_git" list, close_one_pack() does nothing.
If there are very many packfiles, Git runs out of file descriptors
and fails.

Note that this was observed on Windows when build with GCC and
in a repository with more than (2048-25) packfiles.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 midx.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/midx.c b/midx.c
index 24141a7c62..b2f33bcd90 100644
--- a/midx.c
+++ b/midx.c
@@ -975,6 +975,9 @@ int verify_midx_file(const char *object_dir)
 	for (i = 0; i < m->num_packs; i++) {
 		if (prepare_midx_pack(m, i))
 			midx_report("failed to load pack in position %d", i);
+
+		if (m->packs[i])
+			install_packed_git(the_repository, m->packs[i]);
 	}
 
 	for (i = 0; i < 255; i++) {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 4/4] midx: verify: group objects by packfile to speed up object verification
  2019-03-19 14:37 ` [PATCH v2 0/4] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
                     ` (2 preceding siblings ...)
  2019-03-19 14:37   ` [PATCH v2 3/4] midx: verify: add midx packfiles to the packed_git list Jeff Hostetler via GitGitGadget
@ 2019-03-19 14:37   ` Jeff Hostetler via GitGitGadget
  2019-03-21 19:36   ` [PATCH v3 0/4] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
  4 siblings, 0 replies; 26+ messages in thread
From: Jeff Hostetler via GitGitGadget @ 2019-03-19 14:37 UTC (permalink / raw)
  To: git; +Cc: avarab, Junio C Hamano, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Teach `multi-pack-index verify` to sort the set of objects by
packfile so that only one packfile needs to be open at a time.

This is a performance improvement.  Previously, objects were
verified in OID order.  This essentially requires all packfiles
to be open at the same time.  If the number of packfiles exceeds
the open file limit, packfiles would be LRU-closed and re-opened
many times.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 midx.c     | 77 +++++++++++++++++++++++++++++++++++++++++++++++++-----
 packfile.c |  2 +-
 packfile.h |  2 ++
 3 files changed, 74 insertions(+), 7 deletions(-)

diff --git a/midx.c b/midx.c
index b2f33bcd90..a37c09f2a5 100644
--- a/midx.c
+++ b/midx.c
@@ -962,9 +962,36 @@ static void midx_report(const char *fmt, ...)
 	va_end(ap);
 }
 
+struct pair_pos_vs_id
+{
+	uint32_t pos;
+	uint32_t pack_int_id;
+};
+
+static int compare_pair_pos_vs_id(const void *_a, const void *_b)
+{
+	struct pair_pos_vs_id *a = (struct pair_pos_vs_id *)_a;
+	struct pair_pos_vs_id *b = (struct pair_pos_vs_id *)_b;
+
+	return b->pack_int_id - a->pack_int_id;
+}
+
+/*
+ * Limit calls to display_progress() for performance reasons.
+ * The interval here was arbitrarily chosen.
+ */
+#define SPARSE_PROGRESS_INTERVAL (1 << 12)
+#define midx_display_sparse_progress(progress, n) \
+	do { \
+		uint64_t _n = (n); \
+		if ((_n & (SPARSE_PROGRESS_INTERVAL - 1)) == 0)	\
+			display_progress(progress, _n); \
+	} while (0)
+
 int verify_midx_file(const char *object_dir)
 {
-	uint32_t i;
+	struct pair_pos_vs_id *pairs = NULL;
+	uint32_t i, k;
 	struct progress *progress;
 	struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
 	verify_midx_error = 0;
@@ -972,13 +999,18 @@ int verify_midx_file(const char *object_dir)
 	if (!m)
 		return 0;
 
+	progress = start_progress(_("Looking for referenced packfiles"),
+				  m->num_packs);
 	for (i = 0; i < m->num_packs; i++) {
 		if (prepare_midx_pack(m, i))
 			midx_report("failed to load pack in position %d", i);
 
 		if (m->packs[i])
 			install_packed_git(the_repository, m->packs[i]);
+
+		display_progress(progress, i + 1);
 	}
+	stop_progress(&progress);
 
 	for (i = 0; i < 255; i++) {
 		uint32_t oid_fanout1 = ntohl(m->chunk_oid_fanout[i]);
@@ -989,6 +1021,8 @@ int verify_midx_file(const char *object_dir)
 				    i, oid_fanout1, oid_fanout2, i + 1);
 	}
 
+	progress = start_sparse_progress(_("Verifying OID order in MIDX"),
+					 m->num_objects - 1);
 	for (i = 0; i < m->num_objects - 1; i++) {
 		struct object_id oid1, oid2;
 
@@ -998,18 +1032,47 @@ int verify_midx_file(const char *object_dir)
 		if (oidcmp(&oid1, &oid2) >= 0)
 			midx_report(_("oid lookup out of order: oid[%d] = %s >= %s = oid[%d]"),
 				    i, oid_to_hex(&oid1), oid_to_hex(&oid2), i + 1);
+
+		midx_display_sparse_progress(progress, i + 1);
 	}
+	stop_progress(&progress);
 
-	progress = start_progress(_("Verifying object offsets"), m->num_objects);
+	/*
+	 * Create an array mapping each object to its packfile id.  Sort it
+	 * to group the objects by packfile.  Use this permutation to visit
+	 * each of the objects and only require 1 packfile to be open at a
+	 * time.
+	 */
+	ALLOC_ARRAY(pairs, m->num_objects);
 	for (i = 0; i < m->num_objects; i++) {
+		pairs[i].pos = i;
+		pairs[i].pack_int_id = nth_midxed_pack_int_id(m, i);
+	}
+
+	progress = start_sparse_progress(
+		_("Sorting objects by packfile"), m->num_objects);
+	display_progress(progress, 0); /* TODO: Measure QSORT() progress */
+	QSORT(pairs, m->num_objects, compare_pair_pos_vs_id);
+	stop_progress(&progress);
+
+	progress = start_sparse_progress(_("Verifying object offsets"),
+					 m->num_objects);
+	for (k = 0; k < m->num_objects; k++) {
 		struct object_id oid;
 		struct pack_entry e;
 		off_t m_offset, p_offset;
 
-		nth_midxed_object_oid(&oid, m, i);
+		if (k > 0 && pairs[k-1].pack_int_id != pairs[k].pack_int_id &&
+		    m->packs[pairs[k-1].pack_int_id])
+		{
+			close_pack_fd(m->packs[pairs[k-1].pack_int_id]);
+			close_pack_index(m->packs[pairs[k-1].pack_int_id]);
+		}
+
+		nth_midxed_object_oid(&oid, m, pairs[k].pos);
 		if (!fill_midx_entry(&oid, &e, m)) {
 			midx_report(_("failed to load pack entry for oid[%d] = %s"),
-				    i, oid_to_hex(&oid));
+				    pairs[k].pos, oid_to_hex(&oid));
 			continue;
 		}
 
@@ -1024,11 +1087,13 @@ int verify_midx_file(const char *object_dir)
 
 		if (m_offset != p_offset)
 			midx_report(_("incorrect object offset for oid[%d] = %s: %"PRIx64" != %"PRIx64),
-				    i, oid_to_hex(&oid), m_offset, p_offset);
+				    pairs[k].pos, oid_to_hex(&oid), m_offset, p_offset);
 
-		display_progress(progress, i + 1);
+		midx_display_sparse_progress(progress, k + 1);
 	}
 	stop_progress(&progress);
 
+	free(pairs);
+
 	return verify_midx_error;
 }
diff --git a/packfile.c b/packfile.c
index 16bcb75262..d2bcb2f860 100644
--- a/packfile.c
+++ b/packfile.c
@@ -309,7 +309,7 @@ void close_pack_windows(struct packed_git *p)
 	}
 }
 
-static int close_pack_fd(struct packed_git *p)
+int close_pack_fd(struct packed_git *p)
 {
 	if (p->pack_fd < 0)
 		return 0;
diff --git a/packfile.h b/packfile.h
index d70c6d9afb..b1c18504eb 100644
--- a/packfile.h
+++ b/packfile.h
@@ -76,6 +76,8 @@ extern int open_pack_index(struct packed_git *);
  */
 extern void close_pack_index(struct packed_git *);
 
+int close_pack_fd(struct packed_git *p);
+
 extern uint32_t get_pack_fanout(struct packed_git *p, uint32_t value);
 
 extern unsigned char *use_pack(struct packed_git *, struct pack_window **, off_t, unsigned long *);
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/4] progress: add sparse mode to force 100% complete message
  2019-03-19 14:37   ` [PATCH v2 1/4] progress: add sparse mode to force 100% complete message Jeff Hostetler via GitGitGadget
@ 2019-03-19 16:42     ` Eric Sunshine
  2019-03-19 18:33       ` Jeff Hostetler
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Sunshine @ 2019-03-19 16:42 UTC (permalink / raw)
  To: Jeff Hostetler via GitGitGadget
  Cc: Git List, Ævar Arnfjörð Bjarmason, Junio C Hamano,
	Jeff Hostetler

On Tue, Mar 19, 2019 at 10:37 AM Jeff Hostetler via GitGitGadget
<gitgitgadget@gmail.com> wrote:
> Add new start_sparse_progress() and start_delayed_sparse_progress()
> constructors and "sparse" flag to struct progress.
>
> Teach stop_progress() to force a 100% complete progress message before
> printing the final "done" message when "sparse" is set.
>
> Calling display_progress() for every item in a large set can
> be expensive.  If callers try to filter this for performance
> reasons, such as emitting every k-th item, progress would
> not reach 100% unless they made a final call to display_progress()
> with the item count before calling stop_progress().
>
> Now this is automatic when "sparse" is set.
>
> Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
> ---
> diff --git a/progress.c b/progress.c
> +/*
> + * Here "sparse" means that the caller might use some sampling criteria to
> + * decide when to call display_progress() rather than calling it for every
> + * integer value in[0 .. total).  In particular, the caller might not call
> + * display_progress() for the last value in the range.
> + *
> + * When "sparse" is set, stop_progress() will automatically force the done
> + * message to show 100%.
> + */
> +static void finish_if_sparse(struct progress **p_progress)
> +{
> +       struct progress *progress = *p_progress;
> +
> +       if (progress &&
> +           progress->sparse &&
> +           progress->last_value != progress->total)
> +               display_progress(progress, progress->total);
>  }

There's no reason for this function to take a 'struct progress **'
rather than the simpler 'struct progress *', and doing so just
confuses the reader.

>  void stop_progress(struct progress **p_progress)
>  {
> +       finish_if_sparse(p_progress);
> +
>         stop_progress_msg(p_progress, _("done"));
>  }

Rather than adding a new "sparse" mode, I wonder if this entire
concept can be boiled down to a single new function:

    void finish_progress(struct progress **p_progress, const char *msg)
    {
        struct progress *progress = *p_progress;
        if (progress && progress->last_value != progress->total)
            display_progress(progress, progress->total);
        if (msg)
            stop_progress_msg(p_progress, msg);
        else
            stop_progress(p_progress);
    }

without having to make any other changes to the implementation or add
a new field to the structure. It would mean, though, that the caller
would have to remember to invoke finish_progress() rather than
stop_progress(). Which leads one to wonder why this functionality is
needed at all since it's easy enough for a caller to make the one
extra call to simulate this:

    /* client code */
    if (progress)
        display_progress(progress, progress->total);
    stop_progress(&progress);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/4] progress: add sparse mode to force 100% complete message
  2019-03-19 16:42     ` Eric Sunshine
@ 2019-03-19 18:33       ` Jeff Hostetler
  2019-03-19 18:46         ` Eric Sunshine
  0 siblings, 1 reply; 26+ messages in thread
From: Jeff Hostetler @ 2019-03-19 18:33 UTC (permalink / raw)
  To: Eric Sunshine, Jeff Hostetler via GitGitGadget
  Cc: Git List, Ævar Arnfjörð Bjarmason, Junio C Hamano,
	Jeff Hostetler



On 3/19/2019 12:42 PM, Eric Sunshine wrote:
> On Tue, Mar 19, 2019 at 10:37 AM Jeff Hostetler via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> Add new start_sparse_progress() and start_delayed_sparse_progress()
>> constructors and "sparse" flag to struct progress.
>>
>> Teach stop_progress() to force a 100% complete progress message before
>> printing the final "done" message when "sparse" is set.
>>
>> Calling display_progress() for every item in a large set can
>> be expensive.  If callers try to filter this for performance
>> reasons, such as emitting every k-th item, progress would
>> not reach 100% unless they made a final call to display_progress()
>> with the item count before calling stop_progress().
>>
>> Now this is automatic when "sparse" is set.
>>
>> Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
>> ---
>> diff --git a/progress.c b/progress.c
>> +/*
>> + * Here "sparse" means that the caller might use some sampling criteria to
>> + * decide when to call display_progress() rather than calling it for every
>> + * integer value in[0 .. total).  In particular, the caller might not call
>> + * display_progress() for the last value in the range.
>> + *
>> + * When "sparse" is set, stop_progress() will automatically force the done
>> + * message to show 100%.
>> + */
>> +static void finish_if_sparse(struct progress **p_progress)
>> +{
>> +       struct progress *progress = *p_progress;
>> +
>> +       if (progress &&
>> +           progress->sparse &&
>> +           progress->last_value != progress->total)
>> +               display_progress(progress, progress->total);
>>   }
> 
> There's no reason for this function to take a 'struct progress **'
> rather than the simpler 'struct progress *', and doing so just
> confuses the reader.

I was just trying to match the existing API in the stop_progress()
and stop_progress_msg() routines.  But yes, I can simplify this.

> 
>>   void stop_progress(struct progress **p_progress)
>>   {
>> +       finish_if_sparse(p_progress);
>> +
>>          stop_progress_msg(p_progress, _("done"));
>>   }
> 
> Rather than adding a new "sparse" mode, I wonder if this entire
> concept can be boiled down to a single new function:
> 
>      void finish_progress(struct progress **p_progress, const char *msg)
>      {
>          struct progress *progress = *p_progress;
>          if (progress && progress->last_value != progress->total)
>              display_progress(progress, progress->total);
>          if (msg)
>              stop_progress_msg(p_progress, msg);
>          else
>              stop_progress(p_progress);
>      }
> 
> without having to make any other changes to the implementation or add
> a new field to the structure.

The existing model has start_progress() and start_delayed_progress().
I was following that model and added the new option at the start.
I'm not planning on adding any additional flags, but if we had some
we'd want them available on the startup next to this one.


>                                It would mean, though, that the caller
> would have to remember to invoke finish_progress() rather than
> stop_progress().

Right, I was trying to isolate the change to the setup, so that loop
bottoms and any in-loop return points don't all have to worry about
whether to call stop_ or finish_.


>                   Which leads one to wonder why this functionality is
> needed at all since it's easy enough for a caller to make the one
> extra call to simulate this:
> 
>      /* client code */
>      if (progress)
>          display_progress(progress, progress->total);
>      stop_progress(&progress);
> 

"struct progress" is private and I didn't want to publish it just for
this.  And again as with the finish_ remarks, that leads to more
places that would need to be updated or maintained.

And you're right, callers could always just call:
	x = ...whatever...
	progress = start_progress(..., x);
	...loop or whatever...
	display_progress(progress, x);
	stop_progress(...);

but that puts the burden on the caller to keep a copy of "x"
that matches the original value so that we get the 100% message.
I was doing that for a while, but after 3 or 4 progress loops,
I found myself wanting the progress routines to handle that
bookkeeping for me.

Jeff


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/4] progress: add sparse mode to force 100% complete message
  2019-03-19 18:33       ` Jeff Hostetler
@ 2019-03-19 18:46         ` Eric Sunshine
  0 siblings, 0 replies; 26+ messages in thread
From: Eric Sunshine @ 2019-03-19 18:46 UTC (permalink / raw)
  To: Jeff Hostetler
  Cc: Jeff Hostetler via GitGitGadget, Git List,
	Ævar Arnfjörð Bjarmason, Junio C Hamano,
	Jeff Hostetler

On Tue, Mar 19, 2019 at 2:33 PM Jeff Hostetler <git@jeffhostetler.com> wrote:
> On 3/19/2019 12:42 PM, Eric Sunshine wrote:
> > Rather than adding a new "sparse" mode, I wonder if this entire
> > concept can be boiled down to a single new function:
> >
> >      void finish_progress(struct progress **p_progress, const char *msg)
> >      {
> >          [...]
> >      }
> >
> > without having to make any other changes to the implementation or add
> > a new field to the structure.
>
> The existing model has start_progress() and start_delayed_progress().
> I was following that model and added the new option at the start.
> I'm not planning on adding any additional flags, but if we had some
> we'd want them available on the startup next to this one.

Perhaps it makes sense to just take a 'flags' argument instead of
'sparse' so we don't have to worry about this going forward. In other
words:

    #define PROGRESS_DELAYED (1 << 0)
    #define PROGRESS_SPARSE (1 << 1)

    struct progress *start_progress_x(const char *title, uint64_t total,
        unsigned flags);

which covers "delayed" start and "sparse". ("_x" is a placeholder
since I can't think of a good name).

> >                                It would mean, though, that the caller
> > would have to remember to invoke finish_progress() rather than
> > stop_progress().
>
> Right, I was trying to isolate the change to the setup, so that loop
> bottoms and any in-loop return points don't all have to worry about
> whether to call stop_ or finish_.

Yes, I understand the benefit.

Anyhow, my comments are akin to bikeshedding, thus not necessarily actionable.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/4] midx: verify: add midx packfiles to the packed_git list
  2019-03-19 14:37   ` [PATCH v2 3/4] midx: verify: add midx packfiles to the packed_git list Jeff Hostetler via GitGitGadget
@ 2019-03-19 19:53     ` Jeff Hostetler
  0 siblings, 0 replies; 26+ messages in thread
From: Jeff Hostetler @ 2019-03-19 19:53 UTC (permalink / raw)
  To: Jeff Hostetler via GitGitGadget, git
  Cc: avarab, Junio C Hamano, Jeff Hostetler



On 3/19/2019 10:37 AM, Jeff Hostetler via GitGitGadget wrote:
> From: Jeff Hostetler <jeffhost@microsoft.com>
> 
> Fix "git multi-pack-index verify" to handle repos with thousands
> of packfiles.
> 
> Midx verify adds the individual "packed_git" structures to the
> multi_pack_index.packs array, but it does not add them to the
> "repository.objects.packed_git" list.  During the verification
> code, each packfile is opened and scanned.  And "pack_open_fds"
> is incremented.  If "pack_open_fds" equals the "pack_max_fds"
> open_packed_git_1() calls close_one_pack() to LRU-style close
> an already open packfile.  But because the packfiles were never
> added to the "packed_git" list, close_one_pack() does nothing.
> If there are very many packfiles, Git runs out of file descriptors
> and fails.
> 
> Note that this was observed on Windows when build with GCC and
> in a repository with more than (2048-25) packfiles.
> 
> Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
> ---
>   midx.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/midx.c b/midx.c
> index 24141a7c62..b2f33bcd90 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -975,6 +975,9 @@ int verify_midx_file(const char *object_dir)
>   	for (i = 0; i < m->num_packs; i++) {
>   		if (prepare_midx_pack(m, i))
>   			midx_report("failed to load pack in position %d", i);
> +
> +		if (m->packs[i])
> +			install_packed_git(the_repository, m->packs[i]);
>   	}
>   
>   	for (i = 0; i < 255; i++) {
> 

I'd like to drop this commit from this series.

I talked with Stolee offline about this.  It does
address the problem in this one instance, but it leads
us to think about other places where there may be a
similar problem.

Also, the sort by packfile in the next commit in this
series means we'll only have 1 packfile open at a time
and so this commit isn't needed in this particular case.

Thanks,
Jeff


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v3 0/4] multi-pack-index: fix verify on large repos
  2019-03-19 14:37 ` [PATCH v2 0/4] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
                     ` (3 preceding siblings ...)
  2019-03-19 14:37   ` [PATCH v2 4/4] midx: verify: group objects by packfile to speed up object verification Jeff Hostetler via GitGitGadget
@ 2019-03-21 19:36   ` Jeff Hostetler via GitGitGadget
  2019-03-21 19:36     ` [PATCH v3 1/4] progress: add sparse mode to force 100% complete message Jeff Hostetler via GitGitGadget
                       ` (4 more replies)
  4 siblings, 5 replies; 26+ messages in thread
From: Jeff Hostetler via GitGitGadget @ 2019-03-21 19:36 UTC (permalink / raw)
  To: git; +Cc: avarab, Junio C Hamano

Version 3 drops the packed-git commit I mentioned earlier, simplifies the
finish_if_sparse API as suggested by Eric, and splits the object sort commit
into 2 commits: one to add the additional progress indicators and one to
sort the objects by packfile. This makes it easier to follow.


----------------------------------------------------------------------------

Version 2 addresses progress-related concerns raised in the previous version
of the midx verify code.

This version also extends the existing progress.[ch] code and adds a
"sparse" mode that automatically ensures the 100% message is issued.


----------------------------------------------------------------------------

Teach "multi-pack-index verify" to handle cases where the number of
packfiles exceeds the open file handle limit.

The first commit fixes a problem that prevented the LRU-style
close_one_pack() mechanism from working which caused midx verify to run out
of file descriptors.

The second commit teaches midx verify to sort the set of objects to verify
by packfile rather than verifying them in OID order. This eliminates the
need to have more than one packfile/idx open at the same time.

With the second commit, runtime on 3600 packfiles went from 12 minutes to 25
seconds.

Thanks, Jeff

Cc: dstolee@microsoft.com

Jeff Hostetler (4):
  progress: add sparse mode to force 100% complete message
  trace2:data: add trace2 data to midx
  midx: add progress indicators in multi-pack-index verify
  midx: during verify group objects by packfile to speed verification

 builtin/multi-pack-index.c |  3 ++
 midx.c                     | 79 +++++++++++++++++++++++++++++++++++---
 packfile.c                 |  2 +-
 packfile.h                 |  2 +
 progress.c                 | 38 ++++++++++++++++--
 progress.h                 |  3 ++
 6 files changed, 118 insertions(+), 9 deletions(-)


base-commit: e902e9bcae2010bc42648c80ab6adc6c5a16a4a5
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-166%2Fjeffhostetler%2Fupstream-midx-verify-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-166/jeffhostetler/upstream-midx-verify-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/166

Range-diff vs v2:

 1:  e1da1f84a8 ! 1:  5595e019c8 progress: add sparse mode to force 100% complete message
     @@ -80,10 +80,8 @@
      +	return start_progress_delay(title, total, 2, 1);
      +}
      +
     -+static void finish_if_sparse(struct progress **p_progress)
     ++static void finish_if_sparse(struct progress *progress)
      +{
     -+	struct progress *progress = *p_progress;
     -+
      +	if (progress &&
      +	    progress->sparse &&
      +	    progress->last_value != progress->total)
     @@ -92,7 +90,7 @@
       
       void stop_progress(struct progress **p_progress)
       {
     -+	finish_if_sparse(p_progress);
     ++	finish_if_sparse(*p_progress);
      +
       	stop_progress_msg(p_progress, _("done"));
       }
 2:  11c88845e7 = 2:  498258b120 trace2:data: add trace2 data to midx
 3:  ced7f1cb34 < -:  ---------- midx: verify: add midx packfiles to the packed_git list
 -:  ---------- > 3:  8a60902d65 midx: add progress indicators in multi-pack-index verify
 4:  e2dd99911f ! 4:  7e98ea005a midx: verify: group objects by packfile to speed up object verification
     @@ -1,8 +1,8 @@
      Author: Jeff Hostetler <jeffhost@microsoft.com>
      
     -    midx: verify: group objects by packfile to speed up object verification
     +    midx: during verify group objects by packfile to speed verification
      
     -    Teach `multi-pack-index verify` to sort the set of objects by
     +    Teach `multi-pack-index verify` to sort the set of object by
          packfile so that only one packfile needs to be open at a time.
      
          This is a performance improvement.  Previously, objects were
     @@ -34,64 +34,21 @@
      +	return b->pack_int_id - a->pack_int_id;
      +}
      +
     -+/*
     -+ * Limit calls to display_progress() for performance reasons.
     -+ * The interval here was arbitrarily chosen.
     -+ */
     -+#define SPARSE_PROGRESS_INTERVAL (1 << 12)
     -+#define midx_display_sparse_progress(progress, n) \
     -+	do { \
     -+		uint64_t _n = (n); \
     -+		if ((_n & (SPARSE_PROGRESS_INTERVAL - 1)) == 0)	\
     -+			display_progress(progress, _n); \
     -+	} while (0)
     -+
     + /*
     +  * Limit calls to display_progress() for performance reasons.
     +  * The interval here was arbitrarily chosen.
     +@@
     + 
       int verify_midx_file(const char *object_dir)
       {
     --	uint32_t i;
      +	struct pair_pos_vs_id *pairs = NULL;
     -+	uint32_t i, k;
     + 	uint32_t i;
       	struct progress *progress;
       	struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
     - 	verify_midx_error = 0;
     -@@
     - 	if (!m)
     - 		return 0;
     - 
     -+	progress = start_progress(_("Looking for referenced packfiles"),
     -+				  m->num_packs);
     - 	for (i = 0; i < m->num_packs; i++) {
     - 		if (prepare_midx_pack(m, i))
     - 			midx_report("failed to load pack in position %d", i);
     - 
     - 		if (m->packs[i])
     - 			install_packed_git(the_repository, m->packs[i]);
     -+
     -+		display_progress(progress, i + 1);
     - 	}
     -+	stop_progress(&progress);
     - 
     - 	for (i = 0; i < 255; i++) {
     - 		uint32_t oid_fanout1 = ntohl(m->chunk_oid_fanout[i]);
      @@
     - 				    i, oid_fanout1, oid_fanout2, i + 1);
       	}
     + 	stop_progress(&progress);
       
     -+	progress = start_sparse_progress(_("Verifying OID order in MIDX"),
     -+					 m->num_objects - 1);
     - 	for (i = 0; i < m->num_objects - 1; i++) {
     - 		struct object_id oid1, oid2;
     - 
     -@@
     - 		if (oidcmp(&oid1, &oid2) >= 0)
     - 			midx_report(_("oid lookup out of order: oid[%d] = %s >= %s = oid[%d]"),
     - 				    i, oid_to_hex(&oid1), oid_to_hex(&oid2), i + 1);
     -+
     -+		midx_display_sparse_progress(progress, i + 1);
     - 	}
     -+	stop_progress(&progress);
     - 
     --	progress = start_progress(_("Verifying object offsets"), m->num_objects);
      +	/*
      +	 * Create an array mapping each object to its packfile id.  Sort it
      +	 * to group the objects by packfile.  Use this permutation to visit
     @@ -99,37 +56,37 @@
      +	 * time.
      +	 */
      +	ALLOC_ARRAY(pairs, m->num_objects);
     - 	for (i = 0; i < m->num_objects; i++) {
     ++	for (i = 0; i < m->num_objects; i++) {
      +		pairs[i].pos = i;
      +		pairs[i].pack_int_id = nth_midxed_pack_int_id(m, i);
      +	}
      +
     -+	progress = start_sparse_progress(
     -+		_("Sorting objects by packfile"), m->num_objects);
     ++	progress = start_sparse_progress(_("Sorting objects by packfile"),
     ++					 m->num_objects);
      +	display_progress(progress, 0); /* TODO: Measure QSORT() progress */
      +	QSORT(pairs, m->num_objects, compare_pair_pos_vs_id);
      +	stop_progress(&progress);
      +
     -+	progress = start_sparse_progress(_("Verifying object offsets"),
     -+					 m->num_objects);
     -+	for (k = 0; k < m->num_objects; k++) {
     + 	progress = start_sparse_progress(_("Verifying object offsets"), m->num_objects);
     + 	for (i = 0; i < m->num_objects; i++) {
       		struct object_id oid;
       		struct pack_entry e;
       		off_t m_offset, p_offset;
       
      -		nth_midxed_object_oid(&oid, m, i);
     -+		if (k > 0 && pairs[k-1].pack_int_id != pairs[k].pack_int_id &&
     -+		    m->packs[pairs[k-1].pack_int_id])
     ++		if (i > 0 && pairs[i-1].pack_int_id != pairs[i].pack_int_id &&
     ++		    m->packs[pairs[i-1].pack_int_id])
      +		{
     -+			close_pack_fd(m->packs[pairs[k-1].pack_int_id]);
     -+			close_pack_index(m->packs[pairs[k-1].pack_int_id]);
     ++			close_pack_fd(m->packs[pairs[i-1].pack_int_id]);
     ++			close_pack_index(m->packs[pairs[i-1].pack_int_id]);
      +		}
      +
     -+		nth_midxed_object_oid(&oid, m, pairs[k].pos);
     ++		nth_midxed_object_oid(&oid, m, pairs[i].pos);
     ++
       		if (!fill_midx_entry(&oid, &e, m)) {
       			midx_report(_("failed to load pack entry for oid[%d] = %s"),
      -				    i, oid_to_hex(&oid));
     -+				    pairs[k].pos, oid_to_hex(&oid));
     ++				    pairs[i].pos, oid_to_hex(&oid));
       			continue;
       		}
       
     @@ -138,10 +95,9 @@
       		if (m_offset != p_offset)
       			midx_report(_("incorrect object offset for oid[%d] = %s: %"PRIx64" != %"PRIx64),
      -				    i, oid_to_hex(&oid), m_offset, p_offset);
     -+				    pairs[k].pos, oid_to_hex(&oid), m_offset, p_offset);
     ++				    pairs[i].pos, oid_to_hex(&oid), m_offset, p_offset);
       
     --		display_progress(progress, i + 1);
     -+		midx_display_sparse_progress(progress, k + 1);
     + 		midx_display_sparse_progress(progress, i + 1);
       	}
       	stop_progress(&progress);
       

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v3 1/4] progress: add sparse mode to force 100% complete message
  2019-03-21 19:36   ` [PATCH v3 0/4] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
@ 2019-03-21 19:36     ` Jeff Hostetler via GitGitGadget
  2019-03-21 19:36     ` [PATCH v3 2/4] trace2:data: add trace2 data to midx Jeff Hostetler via GitGitGadget
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 26+ messages in thread
From: Jeff Hostetler via GitGitGadget @ 2019-03-21 19:36 UTC (permalink / raw)
  To: git; +Cc: avarab, Junio C Hamano, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Add new start_sparse_progress() and start_delayed_sparse_progress()
constructors and "sparse" flag to struct progress.

Teach stop_progress() to force a 100% complete progress message before
printing the final "done" message when "sparse" is set.

Calling display_progress() for every item in a large set can
be expensive.  If callers try to filter this for performance
reasons, such as emitting every k-th item, progress would
not reach 100% unless they made a final call to display_progress()
with the item count before calling stop_progress().

Now this is automatic when "sparse" is set.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 progress.c | 38 +++++++++++++++++++++++++++++++++++---
 progress.h |  3 +++
 2 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/progress.c b/progress.c
index 5a99c9fbf0..212d00e524 100644
--- a/progress.c
+++ b/progress.c
@@ -34,6 +34,7 @@ struct progress {
 	uint64_t total;
 	unsigned last_percent;
 	unsigned delay;
+	unsigned sparse;
 	struct throughput *throughput;
 	uint64_t start_ns;
 };
@@ -194,7 +195,7 @@ int display_progress(struct progress *progress, uint64_t n)
 }
 
 static struct progress *start_progress_delay(const char *title, uint64_t total,
-					     unsigned delay)
+					     unsigned delay, unsigned sparse)
 {
 	struct progress *progress = malloc(sizeof(*progress));
 	if (!progress) {
@@ -208,6 +209,7 @@ static struct progress *start_progress_delay(const char *title, uint64_t total,
 	progress->last_value = -1;
 	progress->last_percent = -1;
 	progress->delay = delay;
+	progress->sparse = sparse;
 	progress->throughput = NULL;
 	progress->start_ns = getnanotime();
 	set_progress_signal();
@@ -216,16 +218,46 @@ static struct progress *start_progress_delay(const char *title, uint64_t total,
 
 struct progress *start_delayed_progress(const char *title, uint64_t total)
 {
-	return start_progress_delay(title, total, 2);
+	return start_progress_delay(title, total, 2, 0);
 }
 
 struct progress *start_progress(const char *title, uint64_t total)
 {
-	return start_progress_delay(title, total, 0);
+	return start_progress_delay(title, total, 0, 0);
+}
+
+/*
+ * Here "sparse" means that the caller might use some sampling criteria to
+ * decide when to call display_progress() rather than calling it for every
+ * integer value in[0 .. total).  In particular, the caller might not call
+ * display_progress() for the last value in the range.
+ *
+ * When "sparse" is set, stop_progress() will automatically force the done
+ * message to show 100%.
+ */
+struct progress *start_sparse_progress(const char *title, uint64_t total)
+{
+	return start_progress_delay(title, total, 0, 1);
+}
+
+struct progress *start_delayed_sparse_progress(const char *title,
+					       uint64_t total)
+{
+	return start_progress_delay(title, total, 2, 1);
+}
+
+static void finish_if_sparse(struct progress *progress)
+{
+	if (progress &&
+	    progress->sparse &&
+	    progress->last_value != progress->total)
+		display_progress(progress, progress->total);
 }
 
 void stop_progress(struct progress **p_progress)
 {
+	finish_if_sparse(*p_progress);
+
 	stop_progress_msg(p_progress, _("done"));
 }
 
diff --git a/progress.h b/progress.h
index 70a4d4a0d6..7b725acc8d 100644
--- a/progress.h
+++ b/progress.h
@@ -6,7 +6,10 @@ struct progress;
 void display_throughput(struct progress *progress, uint64_t total);
 int display_progress(struct progress *progress, uint64_t n);
 struct progress *start_progress(const char *title, uint64_t total);
+struct progress *start_sparse_progress(const char *title, uint64_t total);
 struct progress *start_delayed_progress(const char *title, uint64_t total);
+struct progress *start_delayed_sparse_progress(const char *title,
+					       uint64_t total);
 void stop_progress(struct progress **progress);
 void stop_progress_msg(struct progress **progress, const char *msg);
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 2/4] trace2:data: add trace2 data to midx
  2019-03-21 19:36   ` [PATCH v3 0/4] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
  2019-03-21 19:36     ` [PATCH v3 1/4] progress: add sparse mode to force 100% complete message Jeff Hostetler via GitGitGadget
@ 2019-03-21 19:36     ` Jeff Hostetler via GitGitGadget
  2019-03-21 19:36     ` [PATCH v3 3/4] midx: add progress indicators in multi-pack-index verify Jeff Hostetler via GitGitGadget
                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 26+ messages in thread
From: Jeff Hostetler via GitGitGadget @ 2019-03-21 19:36 UTC (permalink / raw)
  To: git; +Cc: avarab, Junio C Hamano, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Log multi-pack-index command mode.
Log number of objects and packfiles in the midx.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 builtin/multi-pack-index.c | 3 +++
 midx.c                     | 4 ++++
 2 files changed, 7 insertions(+)

diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index fca70f8e4f..ae6e476ad5 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -3,6 +3,7 @@
 #include "config.h"
 #include "parse-options.h"
 #include "midx.h"
+#include "trace2.h"
 
 static char const * const builtin_multi_pack_index_usage[] = {
 	N_("git multi-pack-index [--object-dir=<dir>] (write|verify)"),
@@ -40,6 +41,8 @@ int cmd_multi_pack_index(int argc, const char **argv,
 		return 1;
 	}
 
+	trace2_cmd_mode(argv[0]);
+
 	if (!strcmp(argv[0], "write"))
 		return write_midx_file(opts.object_dir);
 	if (!strcmp(argv[0], "verify"))
diff --git a/midx.c b/midx.c
index 8a505fd423..24141a7c62 100644
--- a/midx.c
+++ b/midx.c
@@ -8,6 +8,7 @@
 #include "sha1-lookup.h"
 #include "midx.h"
 #include "progress.h"
+#include "trace2.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
@@ -164,6 +165,9 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 			      m->pack_names[i]);
 	}
 
+	trace2_data_intmax("midx", the_repository, "load/num_packs", m->num_packs);
+	trace2_data_intmax("midx", the_repository, "load/num_objects", m->num_objects);
+
 	return m;
 
 cleanup_fail:
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 3/4] midx: add progress indicators in multi-pack-index verify
  2019-03-21 19:36   ` [PATCH v3 0/4] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
  2019-03-21 19:36     ` [PATCH v3 1/4] progress: add sparse mode to force 100% complete message Jeff Hostetler via GitGitGadget
  2019-03-21 19:36     ` [PATCH v3 2/4] trace2:data: add trace2 data to midx Jeff Hostetler via GitGitGadget
@ 2019-03-21 19:36     ` Jeff Hostetler via GitGitGadget
  2019-03-21 19:36     ` [PATCH v3 4/4] midx: during verify group objects by packfile to speed verification Jeff Hostetler via GitGitGadget
  2019-03-22  5:37     ` [PATCH v3 0/4] multi-pack-index: fix verify on large repos Junio C Hamano
  4 siblings, 0 replies; 26+ messages in thread
From: Jeff Hostetler via GitGitGadget @ 2019-03-21 19:36 UTC (permalink / raw)
  To: git; +Cc: avarab, Junio C Hamano, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Add progress indicators to more parts of midx verify.
Use sparse progress indicator for object iteration.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 midx.c | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/midx.c b/midx.c
index 24141a7c62..e3919387d9 100644
--- a/midx.c
+++ b/midx.c
@@ -962,6 +962,18 @@ static void midx_report(const char *fmt, ...)
 	va_end(ap);
 }
 
+/*
+ * Limit calls to display_progress() for performance reasons.
+ * The interval here was arbitrarily chosen.
+ */
+#define SPARSE_PROGRESS_INTERVAL (1 << 12)
+#define midx_display_sparse_progress(progress, n) \
+	do { \
+		uint64_t _n = (n); \
+		if ((_n & (SPARSE_PROGRESS_INTERVAL - 1)) == 0) \
+			display_progress(progress, _n); \
+	} while (0)
+
 int verify_midx_file(const char *object_dir)
 {
 	uint32_t i;
@@ -972,10 +984,15 @@ int verify_midx_file(const char *object_dir)
 	if (!m)
 		return 0;
 
+	progress = start_progress(_("Looking for referenced packfiles"),
+				  m->num_packs);
 	for (i = 0; i < m->num_packs; i++) {
 		if (prepare_midx_pack(m, i))
 			midx_report("failed to load pack in position %d", i);
+
+		display_progress(progress, i + 1);
 	}
+	stop_progress(&progress);
 
 	for (i = 0; i < 255; i++) {
 		uint32_t oid_fanout1 = ntohl(m->chunk_oid_fanout[i]);
@@ -986,6 +1003,8 @@ int verify_midx_file(const char *object_dir)
 				    i, oid_fanout1, oid_fanout2, i + 1);
 	}
 
+	progress = start_sparse_progress(_("Verifying OID order in MIDX"),
+					 m->num_objects - 1);
 	for (i = 0; i < m->num_objects - 1; i++) {
 		struct object_id oid1, oid2;
 
@@ -995,9 +1014,12 @@ int verify_midx_file(const char *object_dir)
 		if (oidcmp(&oid1, &oid2) >= 0)
 			midx_report(_("oid lookup out of order: oid[%d] = %s >= %s = oid[%d]"),
 				    i, oid_to_hex(&oid1), oid_to_hex(&oid2), i + 1);
+
+		midx_display_sparse_progress(progress, i + 1);
 	}
+	stop_progress(&progress);
 
-	progress = start_progress(_("Verifying object offsets"), m->num_objects);
+	progress = start_sparse_progress(_("Verifying object offsets"), m->num_objects);
 	for (i = 0; i < m->num_objects; i++) {
 		struct object_id oid;
 		struct pack_entry e;
@@ -1023,7 +1045,7 @@ int verify_midx_file(const char *object_dir)
 			midx_report(_("incorrect object offset for oid[%d] = %s: %"PRIx64" != %"PRIx64),
 				    i, oid_to_hex(&oid), m_offset, p_offset);
 
-		display_progress(progress, i + 1);
+		midx_display_sparse_progress(progress, i + 1);
 	}
 	stop_progress(&progress);
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 4/4] midx: during verify group objects by packfile to speed verification
  2019-03-21 19:36   ` [PATCH v3 0/4] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
                       ` (2 preceding siblings ...)
  2019-03-21 19:36     ` [PATCH v3 3/4] midx: add progress indicators in multi-pack-index verify Jeff Hostetler via GitGitGadget
@ 2019-03-21 19:36     ` Jeff Hostetler via GitGitGadget
  2019-03-22  5:37     ` [PATCH v3 0/4] multi-pack-index: fix verify on large repos Junio C Hamano
  4 siblings, 0 replies; 26+ messages in thread
From: Jeff Hostetler via GitGitGadget @ 2019-03-21 19:36 UTC (permalink / raw)
  To: git; +Cc: avarab, Junio C Hamano, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Teach `multi-pack-index verify` to sort the set of object by
packfile so that only one packfile needs to be open at a time.

This is a performance improvement.  Previously, objects were
verified in OID order.  This essentially requires all packfiles
to be open at the same time.  If the number of packfiles exceeds
the open file limit, packfiles would be LRU-closed and re-opened
many times.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 midx.c     | 49 ++++++++++++++++++++++++++++++++++++++++++++++---
 packfile.c |  2 +-
 packfile.h |  2 ++
 3 files changed, 49 insertions(+), 4 deletions(-)

diff --git a/midx.c b/midx.c
index e3919387d9..f1cd868f8c 100644
--- a/midx.c
+++ b/midx.c
@@ -962,6 +962,20 @@ static void midx_report(const char *fmt, ...)
 	va_end(ap);
 }
 
+struct pair_pos_vs_id
+{
+	uint32_t pos;
+	uint32_t pack_int_id;
+};
+
+static int compare_pair_pos_vs_id(const void *_a, const void *_b)
+{
+	struct pair_pos_vs_id *a = (struct pair_pos_vs_id *)_a;
+	struct pair_pos_vs_id *b = (struct pair_pos_vs_id *)_b;
+
+	return b->pack_int_id - a->pack_int_id;
+}
+
 /*
  * Limit calls to display_progress() for performance reasons.
  * The interval here was arbitrarily chosen.
@@ -976,6 +990,7 @@ static void midx_report(const char *fmt, ...)
 
 int verify_midx_file(const char *object_dir)
 {
+	struct pair_pos_vs_id *pairs = NULL;
 	uint32_t i;
 	struct progress *progress;
 	struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
@@ -1019,16 +1034,42 @@ int verify_midx_file(const char *object_dir)
 	}
 	stop_progress(&progress);
 
+	/*
+	 * Create an array mapping each object to its packfile id.  Sort it
+	 * to group the objects by packfile.  Use this permutation to visit
+	 * each of the objects and only require 1 packfile to be open at a
+	 * time.
+	 */
+	ALLOC_ARRAY(pairs, m->num_objects);
+	for (i = 0; i < m->num_objects; i++) {
+		pairs[i].pos = i;
+		pairs[i].pack_int_id = nth_midxed_pack_int_id(m, i);
+	}
+
+	progress = start_sparse_progress(_("Sorting objects by packfile"),
+					 m->num_objects);
+	display_progress(progress, 0); /* TODO: Measure QSORT() progress */
+	QSORT(pairs, m->num_objects, compare_pair_pos_vs_id);
+	stop_progress(&progress);
+
 	progress = start_sparse_progress(_("Verifying object offsets"), m->num_objects);
 	for (i = 0; i < m->num_objects; i++) {
 		struct object_id oid;
 		struct pack_entry e;
 		off_t m_offset, p_offset;
 
-		nth_midxed_object_oid(&oid, m, i);
+		if (i > 0 && pairs[i-1].pack_int_id != pairs[i].pack_int_id &&
+		    m->packs[pairs[i-1].pack_int_id])
+		{
+			close_pack_fd(m->packs[pairs[i-1].pack_int_id]);
+			close_pack_index(m->packs[pairs[i-1].pack_int_id]);
+		}
+
+		nth_midxed_object_oid(&oid, m, pairs[i].pos);
+
 		if (!fill_midx_entry(&oid, &e, m)) {
 			midx_report(_("failed to load pack entry for oid[%d] = %s"),
-				    i, oid_to_hex(&oid));
+				    pairs[i].pos, oid_to_hex(&oid));
 			continue;
 		}
 
@@ -1043,11 +1084,13 @@ int verify_midx_file(const char *object_dir)
 
 		if (m_offset != p_offset)
 			midx_report(_("incorrect object offset for oid[%d] = %s: %"PRIx64" != %"PRIx64),
-				    i, oid_to_hex(&oid), m_offset, p_offset);
+				    pairs[i].pos, oid_to_hex(&oid), m_offset, p_offset);
 
 		midx_display_sparse_progress(progress, i + 1);
 	}
 	stop_progress(&progress);
 
+	free(pairs);
+
 	return verify_midx_error;
 }
diff --git a/packfile.c b/packfile.c
index 16bcb75262..d2bcb2f860 100644
--- a/packfile.c
+++ b/packfile.c
@@ -309,7 +309,7 @@ void close_pack_windows(struct packed_git *p)
 	}
 }
 
-static int close_pack_fd(struct packed_git *p)
+int close_pack_fd(struct packed_git *p)
 {
 	if (p->pack_fd < 0)
 		return 0;
diff --git a/packfile.h b/packfile.h
index d70c6d9afb..b1c18504eb 100644
--- a/packfile.h
+++ b/packfile.h
@@ -76,6 +76,8 @@ extern int open_pack_index(struct packed_git *);
  */
 extern void close_pack_index(struct packed_git *);
 
+int close_pack_fd(struct packed_git *p);
+
 extern uint32_t get_pack_fanout(struct packed_git *p, uint32_t value);
 
 extern unsigned char *use_pack(struct packed_git *, struct pack_window **, off_t, unsigned long *);
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 0/4] multi-pack-index: fix verify on large repos
  2019-03-21 19:36   ` [PATCH v3 0/4] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
                       ` (3 preceding siblings ...)
  2019-03-21 19:36     ` [PATCH v3 4/4] midx: during verify group objects by packfile to speed verification Jeff Hostetler via GitGitGadget
@ 2019-03-22  5:37     ` Junio C Hamano
  2019-03-25 17:18       ` Jeff Hostetler
  4 siblings, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2019-03-22  5:37 UTC (permalink / raw)
  To: Jeff Hostetler via GitGitGadget; +Cc: git, avarab

"Jeff Hostetler via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Teach "multi-pack-index verify" to handle cases where the number of
> packfiles exceeds the open file handle limit.
>
> The first commit fixes a problem that prevented the LRU-style
> close_one_pack() mechanism from working which caused midx verify to run out
> of file descriptors.
>
> The second commit teaches midx verify to sort the set of objects to verify
> by packfile rather than verifying them in OID order. This eliminates the
> need to have more than one packfile/idx open at the same time.
>
> With the second commit, runtime on 3600 packfiles went from 12 minutes to 25
> seconds.

These reference to the first and second commit might have become
stale across interations, but logically it makes sense---the first
point is about correctness (i.e. do not die by running out of fds)
and the second one is about usable-performance.

But in this round (possibly in the previous one, too?) the "group
objects by packfile" one addresses both points?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 0/4] multi-pack-index: fix verify on large repos
  2019-03-22  5:37     ` [PATCH v3 0/4] multi-pack-index: fix verify on large repos Junio C Hamano
@ 2019-03-25 17:18       ` Jeff Hostetler
  0 siblings, 0 replies; 26+ messages in thread
From: Jeff Hostetler @ 2019-03-25 17:18 UTC (permalink / raw)
  To: Junio C Hamano, Jeff Hostetler via GitGitGadget; +Cc: git, avarab



On 3/22/2019 1:37 AM, Junio C Hamano wrote:
> "Jeff Hostetler via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> Teach "multi-pack-index verify" to handle cases where the number of
>> packfiles exceeds the open file handle limit.
>>
>> The first commit fixes a problem that prevented the LRU-style
>> close_one_pack() mechanism from working which caused midx verify to run out
>> of file descriptors.
>>
>> The second commit teaches midx verify to sort the set of objects to verify
>> by packfile rather than verifying them in OID order. This eliminates the
>> need to have more than one packfile/idx open at the same time.
>>
>> With the second commit, runtime on 3600 packfiles went from 12 minutes to 25
>> seconds.
> 
> These reference to the first and second commit might have become
> stale across interations, but logically it makes sense---the first
> point is about correctness (i.e. do not die by running out of fds)
> and the second one is about usable-performance.
> 
> But in this round (possibly in the previous one, too?) the "group
> objects by packfile" one addresses both points?
> 


Sorry, I forgot to remote the stale content in the cover letter for
the V3 version.

This version just has the sorting by packfile commit and because it only
keeps 1 packfile open at a time, it does not need the change to add 
packfiles to the packed-git list because it does not trigger the
close_one_pack() problem.

We suspect there are other places (not-yet-observed) where the design of
the all-packs and packed-git lists will lead to similar fd exhaustion
errors and want to fix it properly in the packfile and/or midx code.
We'll address this potential problem in a future patch series.

Jeff

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2019-03-25 17:18 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-18 14:29 [PATCH 0/3] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
2019-03-18 14:29 ` [PATCH 1/3] midx: verify: add midx packfiles to the packed_git list Jeff Hostetler via GitGitGadget
2019-03-18 14:29 ` [PATCH 2/3] midx: verify: group objects by packfile to speed up object verification Jeff Hostetler via GitGitGadget
2019-03-18 15:53   ` Ævar Arnfjörð Bjarmason
2019-03-18 21:39     ` Jeff Hostetler
2019-03-18 22:02       ` Ævar Arnfjörð Bjarmason
2019-03-19  4:08         ` Junio C Hamano
2019-03-19 14:00         ` Jeff Hostetler
2019-03-19 14:06           ` Ævar Arnfjörð Bjarmason
2019-03-18 14:29 ` [PATCH 3/3] trace2:data: add trace2 data to midx Jeff Hostetler via GitGitGadget
2019-03-19 14:37 ` [PATCH v2 0/4] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
2019-03-19 14:37   ` [PATCH v2 1/4] progress: add sparse mode to force 100% complete message Jeff Hostetler via GitGitGadget
2019-03-19 16:42     ` Eric Sunshine
2019-03-19 18:33       ` Jeff Hostetler
2019-03-19 18:46         ` Eric Sunshine
2019-03-19 14:37   ` [PATCH v2 2/4] trace2:data: add trace2 data to midx Jeff Hostetler via GitGitGadget
2019-03-19 14:37   ` [PATCH v2 3/4] midx: verify: add midx packfiles to the packed_git list Jeff Hostetler via GitGitGadget
2019-03-19 19:53     ` Jeff Hostetler
2019-03-19 14:37   ` [PATCH v2 4/4] midx: verify: group objects by packfile to speed up object verification Jeff Hostetler via GitGitGadget
2019-03-21 19:36   ` [PATCH v3 0/4] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
2019-03-21 19:36     ` [PATCH v3 1/4] progress: add sparse mode to force 100% complete message Jeff Hostetler via GitGitGadget
2019-03-21 19:36     ` [PATCH v3 2/4] trace2:data: add trace2 data to midx Jeff Hostetler via GitGitGadget
2019-03-21 19:36     ` [PATCH v3 3/4] midx: add progress indicators in multi-pack-index verify Jeff Hostetler via GitGitGadget
2019-03-21 19:36     ` [PATCH v3 4/4] midx: during verify group objects by packfile to speed verification Jeff Hostetler via GitGitGadget
2019-03-22  5:37     ` [PATCH v3 0/4] multi-pack-index: fix verify on large repos Junio C Hamano
2019-03-25 17:18       ` Jeff Hostetler

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).