* [PATCH v2 1/4] progress: add sparse mode to force 100% complete message
2019-03-19 14:37 ` [PATCH v2 0/4] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
@ 2019-03-19 14:37 ` Jeff Hostetler via GitGitGadget
2019-03-19 16:42 ` Eric Sunshine
2019-03-19 14:37 ` [PATCH v2 2/4] trace2:data: add trace2 data to midx Jeff Hostetler via GitGitGadget
` (3 subsequent siblings)
4 siblings, 1 reply; 26+ messages in thread
From: Jeff Hostetler via GitGitGadget @ 2019-03-19 14:37 UTC (permalink / raw)
To: git; +Cc: avarab, Junio C Hamano, Jeff Hostetler
From: Jeff Hostetler <jeffhost@microsoft.com>
Add new start_sparse_progress() and start_delayed_sparse_progress()
constructors and "sparse" flag to struct progress.
Teach stop_progress() to force a 100% complete progress message before
printing the final "done" message when "sparse" is set.
Calling display_progress() for every item in a large set can
be expensive. If callers try to filter this for performance
reasons, such as emitting every k-th item, progress would
not reach 100% unless they made a final call to display_progress()
with the item count before calling stop_progress().
Now this is automatic when "sparse" is set.
Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
progress.c | 40 +++++++++++++++++++++++++++++++++++++---
progress.h | 3 +++
2 files changed, 40 insertions(+), 3 deletions(-)
diff --git a/progress.c b/progress.c
index 5a99c9fbf0..3d75376c96 100644
--- a/progress.c
+++ b/progress.c
@@ -34,6 +34,7 @@ struct progress {
uint64_t total;
unsigned last_percent;
unsigned delay;
+ unsigned sparse;
struct throughput *throughput;
uint64_t start_ns;
};
@@ -194,7 +195,7 @@ int display_progress(struct progress *progress, uint64_t n)
}
static struct progress *start_progress_delay(const char *title, uint64_t total,
- unsigned delay)
+ unsigned delay, unsigned sparse)
{
struct progress *progress = malloc(sizeof(*progress));
if (!progress) {
@@ -208,6 +209,7 @@ static struct progress *start_progress_delay(const char *title, uint64_t total,
progress->last_value = -1;
progress->last_percent = -1;
progress->delay = delay;
+ progress->sparse = sparse;
progress->throughput = NULL;
progress->start_ns = getnanotime();
set_progress_signal();
@@ -216,16 +218,48 @@ static struct progress *start_progress_delay(const char *title, uint64_t total,
struct progress *start_delayed_progress(const char *title, uint64_t total)
{
- return start_progress_delay(title, total, 2);
+ return start_progress_delay(title, total, 2, 0);
}
struct progress *start_progress(const char *title, uint64_t total)
{
- return start_progress_delay(title, total, 0);
+ return start_progress_delay(title, total, 0, 0);
+}
+
+/*
+ * Here "sparse" means that the caller might use some sampling criteria to
+ * decide when to call display_progress() rather than calling it for every
+ * integer value in[0 .. total). In particular, the caller might not call
+ * display_progress() for the last value in the range.
+ *
+ * When "sparse" is set, stop_progress() will automatically force the done
+ * message to show 100%.
+ */
+struct progress *start_sparse_progress(const char *title, uint64_t total)
+{
+ return start_progress_delay(title, total, 0, 1);
+}
+
+struct progress *start_delayed_sparse_progress(const char *title,
+ uint64_t total)
+{
+ return start_progress_delay(title, total, 2, 1);
+}
+
+static void finish_if_sparse(struct progress **p_progress)
+{
+ struct progress *progress = *p_progress;
+
+ if (progress &&
+ progress->sparse &&
+ progress->last_value != progress->total)
+ display_progress(progress, progress->total);
}
void stop_progress(struct progress **p_progress)
{
+ finish_if_sparse(p_progress);
+
stop_progress_msg(p_progress, _("done"));
}
diff --git a/progress.h b/progress.h
index 70a4d4a0d6..7b725acc8d 100644
--- a/progress.h
+++ b/progress.h
@@ -6,7 +6,10 @@ struct progress;
void display_throughput(struct progress *progress, uint64_t total);
int display_progress(struct progress *progress, uint64_t n);
struct progress *start_progress(const char *title, uint64_t total);
+struct progress *start_sparse_progress(const char *title, uint64_t total);
struct progress *start_delayed_progress(const char *title, uint64_t total);
+struct progress *start_delayed_sparse_progress(const char *title,
+ uint64_t total);
void stop_progress(struct progress **progress);
void stop_progress_msg(struct progress **progress, const char *msg);
--
gitgitgadget
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH v2 1/4] progress: add sparse mode to force 100% complete message
2019-03-19 14:37 ` [PATCH v2 1/4] progress: add sparse mode to force 100% complete message Jeff Hostetler via GitGitGadget
@ 2019-03-19 16:42 ` Eric Sunshine
2019-03-19 18:33 ` Jeff Hostetler
0 siblings, 1 reply; 26+ messages in thread
From: Eric Sunshine @ 2019-03-19 16:42 UTC (permalink / raw)
To: Jeff Hostetler via GitGitGadget
Cc: Git List, Ævar Arnfjörð Bjarmason, Junio C Hamano,
Jeff Hostetler
On Tue, Mar 19, 2019 at 10:37 AM Jeff Hostetler via GitGitGadget
<gitgitgadget@gmail.com> wrote:
> Add new start_sparse_progress() and start_delayed_sparse_progress()
> constructors and "sparse" flag to struct progress.
>
> Teach stop_progress() to force a 100% complete progress message before
> printing the final "done" message when "sparse" is set.
>
> Calling display_progress() for every item in a large set can
> be expensive. If callers try to filter this for performance
> reasons, such as emitting every k-th item, progress would
> not reach 100% unless they made a final call to display_progress()
> with the item count before calling stop_progress().
>
> Now this is automatic when "sparse" is set.
>
> Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
> ---
> diff --git a/progress.c b/progress.c
> +/*
> + * Here "sparse" means that the caller might use some sampling criteria to
> + * decide when to call display_progress() rather than calling it for every
> + * integer value in[0 .. total). In particular, the caller might not call
> + * display_progress() for the last value in the range.
> + *
> + * When "sparse" is set, stop_progress() will automatically force the done
> + * message to show 100%.
> + */
> +static void finish_if_sparse(struct progress **p_progress)
> +{
> + struct progress *progress = *p_progress;
> +
> + if (progress &&
> + progress->sparse &&
> + progress->last_value != progress->total)
> + display_progress(progress, progress->total);
> }
There's no reason for this function to take a 'struct progress **'
rather than the simpler 'struct progress *', and doing so just
confuses the reader.
> void stop_progress(struct progress **p_progress)
> {
> + finish_if_sparse(p_progress);
> +
> stop_progress_msg(p_progress, _("done"));
> }
Rather than adding a new "sparse" mode, I wonder if this entire
concept can be boiled down to a single new function:
void finish_progress(struct progress **p_progress, const char *msg)
{
struct progress *progress = *p_progress;
if (progress && progress->last_value != progress->total)
display_progress(progress, progress->total);
if (msg)
stop_progress_msg(p_progress, msg);
else
stop_progress(p_progress);
}
without having to make any other changes to the implementation or add
a new field to the structure. It would mean, though, that the caller
would have to remember to invoke finish_progress() rather than
stop_progress(). Which leads one to wonder why this functionality is
needed at all since it's easy enough for a caller to make the one
extra call to simulate this:
/* client code */
if (progress)
display_progress(progress, progress->total);
stop_progress(&progress);
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2 1/4] progress: add sparse mode to force 100% complete message
2019-03-19 16:42 ` Eric Sunshine
@ 2019-03-19 18:33 ` Jeff Hostetler
2019-03-19 18:46 ` Eric Sunshine
0 siblings, 1 reply; 26+ messages in thread
From: Jeff Hostetler @ 2019-03-19 18:33 UTC (permalink / raw)
To: Eric Sunshine, Jeff Hostetler via GitGitGadget
Cc: Git List, Ævar Arnfjörð Bjarmason, Junio C Hamano,
Jeff Hostetler
On 3/19/2019 12:42 PM, Eric Sunshine wrote:
> On Tue, Mar 19, 2019 at 10:37 AM Jeff Hostetler via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> Add new start_sparse_progress() and start_delayed_sparse_progress()
>> constructors and "sparse" flag to struct progress.
>>
>> Teach stop_progress() to force a 100% complete progress message before
>> printing the final "done" message when "sparse" is set.
>>
>> Calling display_progress() for every item in a large set can
>> be expensive. If callers try to filter this for performance
>> reasons, such as emitting every k-th item, progress would
>> not reach 100% unless they made a final call to display_progress()
>> with the item count before calling stop_progress().
>>
>> Now this is automatic when "sparse" is set.
>>
>> Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
>> ---
>> diff --git a/progress.c b/progress.c
>> +/*
>> + * Here "sparse" means that the caller might use some sampling criteria to
>> + * decide when to call display_progress() rather than calling it for every
>> + * integer value in[0 .. total). In particular, the caller might not call
>> + * display_progress() for the last value in the range.
>> + *
>> + * When "sparse" is set, stop_progress() will automatically force the done
>> + * message to show 100%.
>> + */
>> +static void finish_if_sparse(struct progress **p_progress)
>> +{
>> + struct progress *progress = *p_progress;
>> +
>> + if (progress &&
>> + progress->sparse &&
>> + progress->last_value != progress->total)
>> + display_progress(progress, progress->total);
>> }
>
> There's no reason for this function to take a 'struct progress **'
> rather than the simpler 'struct progress *', and doing so just
> confuses the reader.
I was just trying to match the existing API in the stop_progress()
and stop_progress_msg() routines. But yes, I can simplify this.
>
>> void stop_progress(struct progress **p_progress)
>> {
>> + finish_if_sparse(p_progress);
>> +
>> stop_progress_msg(p_progress, _("done"));
>> }
>
> Rather than adding a new "sparse" mode, I wonder if this entire
> concept can be boiled down to a single new function:
>
> void finish_progress(struct progress **p_progress, const char *msg)
> {
> struct progress *progress = *p_progress;
> if (progress && progress->last_value != progress->total)
> display_progress(progress, progress->total);
> if (msg)
> stop_progress_msg(p_progress, msg);
> else
> stop_progress(p_progress);
> }
>
> without having to make any other changes to the implementation or add
> a new field to the structure.
The existing model has start_progress() and start_delayed_progress().
I was following that model and added the new option at the start.
I'm not planning on adding any additional flags, but if we had some
we'd want them available on the startup next to this one.
> It would mean, though, that the caller
> would have to remember to invoke finish_progress() rather than
> stop_progress().
Right, I was trying to isolate the change to the setup, so that loop
bottoms and any in-loop return points don't all have to worry about
whether to call stop_ or finish_.
> Which leads one to wonder why this functionality is
> needed at all since it's easy enough for a caller to make the one
> extra call to simulate this:
>
> /* client code */
> if (progress)
> display_progress(progress, progress->total);
> stop_progress(&progress);
>
"struct progress" is private and I didn't want to publish it just for
this. And again as with the finish_ remarks, that leads to more
places that would need to be updated or maintained.
And you're right, callers could always just call:
x = ...whatever...
progress = start_progress(..., x);
...loop or whatever...
display_progress(progress, x);
stop_progress(...);
but that puts the burden on the caller to keep a copy of "x"
that matches the original value so that we get the 100% message.
I was doing that for a while, but after 3 or 4 progress loops,
I found myself wanting the progress routines to handle that
bookkeeping for me.
Jeff
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2 1/4] progress: add sparse mode to force 100% complete message
2019-03-19 18:33 ` Jeff Hostetler
@ 2019-03-19 18:46 ` Eric Sunshine
0 siblings, 0 replies; 26+ messages in thread
From: Eric Sunshine @ 2019-03-19 18:46 UTC (permalink / raw)
To: Jeff Hostetler
Cc: Jeff Hostetler via GitGitGadget, Git List,
Ævar Arnfjörð Bjarmason, Junio C Hamano,
Jeff Hostetler
On Tue, Mar 19, 2019 at 2:33 PM Jeff Hostetler <git@jeffhostetler.com> wrote:
> On 3/19/2019 12:42 PM, Eric Sunshine wrote:
> > Rather than adding a new "sparse" mode, I wonder if this entire
> > concept can be boiled down to a single new function:
> >
> > void finish_progress(struct progress **p_progress, const char *msg)
> > {
> > [...]
> > }
> >
> > without having to make any other changes to the implementation or add
> > a new field to the structure.
>
> The existing model has start_progress() and start_delayed_progress().
> I was following that model and added the new option at the start.
> I'm not planning on adding any additional flags, but if we had some
> we'd want them available on the startup next to this one.
Perhaps it makes sense to just take a 'flags' argument instead of
'sparse' so we don't have to worry about this going forward. In other
words:
#define PROGRESS_DELAYED (1 << 0)
#define PROGRESS_SPARSE (1 << 1)
struct progress *start_progress_x(const char *title, uint64_t total,
unsigned flags);
which covers "delayed" start and "sparse". ("_x" is a placeholder
since I can't think of a good name).
> > It would mean, though, that the caller
> > would have to remember to invoke finish_progress() rather than
> > stop_progress().
>
> Right, I was trying to isolate the change to the setup, so that loop
> bottoms and any in-loop return points don't all have to worry about
> whether to call stop_ or finish_.
Yes, I understand the benefit.
Anyhow, my comments are akin to bikeshedding, thus not necessarily actionable.
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH v2 2/4] trace2:data: add trace2 data to midx
2019-03-19 14:37 ` [PATCH v2 0/4] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
2019-03-19 14:37 ` [PATCH v2 1/4] progress: add sparse mode to force 100% complete message Jeff Hostetler via GitGitGadget
@ 2019-03-19 14:37 ` Jeff Hostetler via GitGitGadget
2019-03-19 14:37 ` [PATCH v2 3/4] midx: verify: add midx packfiles to the packed_git list Jeff Hostetler via GitGitGadget
` (2 subsequent siblings)
4 siblings, 0 replies; 26+ messages in thread
From: Jeff Hostetler via GitGitGadget @ 2019-03-19 14:37 UTC (permalink / raw)
To: git; +Cc: avarab, Junio C Hamano, Jeff Hostetler
From: Jeff Hostetler <jeffhost@microsoft.com>
Log multi-pack-index command mode.
Log number of objects and packfiles in the midx.
Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
builtin/multi-pack-index.c | 3 +++
midx.c | 4 ++++
2 files changed, 7 insertions(+)
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index fca70f8e4f..ae6e476ad5 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -3,6 +3,7 @@
#include "config.h"
#include "parse-options.h"
#include "midx.h"
+#include "trace2.h"
static char const * const builtin_multi_pack_index_usage[] = {
N_("git multi-pack-index [--object-dir=<dir>] (write|verify)"),
@@ -40,6 +41,8 @@ int cmd_multi_pack_index(int argc, const char **argv,
return 1;
}
+ trace2_cmd_mode(argv[0]);
+
if (!strcmp(argv[0], "write"))
return write_midx_file(opts.object_dir);
if (!strcmp(argv[0], "verify"))
diff --git a/midx.c b/midx.c
index 8a505fd423..24141a7c62 100644
--- a/midx.c
+++ b/midx.c
@@ -8,6 +8,7 @@
#include "sha1-lookup.h"
#include "midx.h"
#include "progress.h"
+#include "trace2.h"
#define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
#define MIDX_VERSION 1
@@ -164,6 +165,9 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
m->pack_names[i]);
}
+ trace2_data_intmax("midx", the_repository, "load/num_packs", m->num_packs);
+ trace2_data_intmax("midx", the_repository, "load/num_objects", m->num_objects);
+
return m;
cleanup_fail:
--
gitgitgadget
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v2 3/4] midx: verify: add midx packfiles to the packed_git list
2019-03-19 14:37 ` [PATCH v2 0/4] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
2019-03-19 14:37 ` [PATCH v2 1/4] progress: add sparse mode to force 100% complete message Jeff Hostetler via GitGitGadget
2019-03-19 14:37 ` [PATCH v2 2/4] trace2:data: add trace2 data to midx Jeff Hostetler via GitGitGadget
@ 2019-03-19 14:37 ` Jeff Hostetler via GitGitGadget
2019-03-19 19:53 ` Jeff Hostetler
2019-03-19 14:37 ` [PATCH v2 4/4] midx: verify: group objects by packfile to speed up object verification Jeff Hostetler via GitGitGadget
2019-03-21 19:36 ` [PATCH v3 0/4] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
4 siblings, 1 reply; 26+ messages in thread
From: Jeff Hostetler via GitGitGadget @ 2019-03-19 14:37 UTC (permalink / raw)
To: git; +Cc: avarab, Junio C Hamano, Jeff Hostetler
From: Jeff Hostetler <jeffhost@microsoft.com>
Fix "git multi-pack-index verify" to handle repos with thousands
of packfiles.
Midx verify adds the individual "packed_git" structures to the
multi_pack_index.packs array, but it does not add them to the
"repository.objects.packed_git" list. During the verification
code, each packfile is opened and scanned. And "pack_open_fds"
is incremented. If "pack_open_fds" equals the "pack_max_fds"
open_packed_git_1() calls close_one_pack() to LRU-style close
an already open packfile. But because the packfiles were never
added to the "packed_git" list, close_one_pack() does nothing.
If there are very many packfiles, Git runs out of file descriptors
and fails.
Note that this was observed on Windows when build with GCC and
in a repository with more than (2048-25) packfiles.
Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
midx.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/midx.c b/midx.c
index 24141a7c62..b2f33bcd90 100644
--- a/midx.c
+++ b/midx.c
@@ -975,6 +975,9 @@ int verify_midx_file(const char *object_dir)
for (i = 0; i < m->num_packs; i++) {
if (prepare_midx_pack(m, i))
midx_report("failed to load pack in position %d", i);
+
+ if (m->packs[i])
+ install_packed_git(the_repository, m->packs[i]);
}
for (i = 0; i < 255; i++) {
--
gitgitgadget
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH v2 3/4] midx: verify: add midx packfiles to the packed_git list
2019-03-19 14:37 ` [PATCH v2 3/4] midx: verify: add midx packfiles to the packed_git list Jeff Hostetler via GitGitGadget
@ 2019-03-19 19:53 ` Jeff Hostetler
0 siblings, 0 replies; 26+ messages in thread
From: Jeff Hostetler @ 2019-03-19 19:53 UTC (permalink / raw)
To: Jeff Hostetler via GitGitGadget, git
Cc: avarab, Junio C Hamano, Jeff Hostetler
On 3/19/2019 10:37 AM, Jeff Hostetler via GitGitGadget wrote:
> From: Jeff Hostetler <jeffhost@microsoft.com>
>
> Fix "git multi-pack-index verify" to handle repos with thousands
> of packfiles.
>
> Midx verify adds the individual "packed_git" structures to the
> multi_pack_index.packs array, but it does not add them to the
> "repository.objects.packed_git" list. During the verification
> code, each packfile is opened and scanned. And "pack_open_fds"
> is incremented. If "pack_open_fds" equals the "pack_max_fds"
> open_packed_git_1() calls close_one_pack() to LRU-style close
> an already open packfile. But because the packfiles were never
> added to the "packed_git" list, close_one_pack() does nothing.
> If there are very many packfiles, Git runs out of file descriptors
> and fails.
>
> Note that this was observed on Windows when build with GCC and
> in a repository with more than (2048-25) packfiles.
>
> Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
> ---
> midx.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/midx.c b/midx.c
> index 24141a7c62..b2f33bcd90 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -975,6 +975,9 @@ int verify_midx_file(const char *object_dir)
> for (i = 0; i < m->num_packs; i++) {
> if (prepare_midx_pack(m, i))
> midx_report("failed to load pack in position %d", i);
> +
> + if (m->packs[i])
> + install_packed_git(the_repository, m->packs[i]);
> }
>
> for (i = 0; i < 255; i++) {
>
I'd like to drop this commit from this series.
I talked with Stolee offline about this. It does
address the problem in this one instance, but it leads
us to think about other places where there may be a
similar problem.
Also, the sort by packfile in the next commit in this
series means we'll only have 1 packfile open at a time
and so this commit isn't needed in this particular case.
Thanks,
Jeff
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH v2 4/4] midx: verify: group objects by packfile to speed up object verification
2019-03-19 14:37 ` [PATCH v2 0/4] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
` (2 preceding siblings ...)
2019-03-19 14:37 ` [PATCH v2 3/4] midx: verify: add midx packfiles to the packed_git list Jeff Hostetler via GitGitGadget
@ 2019-03-19 14:37 ` Jeff Hostetler via GitGitGadget
2019-03-21 19:36 ` [PATCH v3 0/4] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
4 siblings, 0 replies; 26+ messages in thread
From: Jeff Hostetler via GitGitGadget @ 2019-03-19 14:37 UTC (permalink / raw)
To: git; +Cc: avarab, Junio C Hamano, Jeff Hostetler
From: Jeff Hostetler <jeffhost@microsoft.com>
Teach `multi-pack-index verify` to sort the set of objects by
packfile so that only one packfile needs to be open at a time.
This is a performance improvement. Previously, objects were
verified in OID order. This essentially requires all packfiles
to be open at the same time. If the number of packfiles exceeds
the open file limit, packfiles would be LRU-closed and re-opened
many times.
Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
midx.c | 77 +++++++++++++++++++++++++++++++++++++++++++++++++-----
packfile.c | 2 +-
packfile.h | 2 ++
3 files changed, 74 insertions(+), 7 deletions(-)
diff --git a/midx.c b/midx.c
index b2f33bcd90..a37c09f2a5 100644
--- a/midx.c
+++ b/midx.c
@@ -962,9 +962,36 @@ static void midx_report(const char *fmt, ...)
va_end(ap);
}
+struct pair_pos_vs_id
+{
+ uint32_t pos;
+ uint32_t pack_int_id;
+};
+
+static int compare_pair_pos_vs_id(const void *_a, const void *_b)
+{
+ struct pair_pos_vs_id *a = (struct pair_pos_vs_id *)_a;
+ struct pair_pos_vs_id *b = (struct pair_pos_vs_id *)_b;
+
+ return b->pack_int_id - a->pack_int_id;
+}
+
+/*
+ * Limit calls to display_progress() for performance reasons.
+ * The interval here was arbitrarily chosen.
+ */
+#define SPARSE_PROGRESS_INTERVAL (1 << 12)
+#define midx_display_sparse_progress(progress, n) \
+ do { \
+ uint64_t _n = (n); \
+ if ((_n & (SPARSE_PROGRESS_INTERVAL - 1)) == 0) \
+ display_progress(progress, _n); \
+ } while (0)
+
int verify_midx_file(const char *object_dir)
{
- uint32_t i;
+ struct pair_pos_vs_id *pairs = NULL;
+ uint32_t i, k;
struct progress *progress;
struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
verify_midx_error = 0;
@@ -972,13 +999,18 @@ int verify_midx_file(const char *object_dir)
if (!m)
return 0;
+ progress = start_progress(_("Looking for referenced packfiles"),
+ m->num_packs);
for (i = 0; i < m->num_packs; i++) {
if (prepare_midx_pack(m, i))
midx_report("failed to load pack in position %d", i);
if (m->packs[i])
install_packed_git(the_repository, m->packs[i]);
+
+ display_progress(progress, i + 1);
}
+ stop_progress(&progress);
for (i = 0; i < 255; i++) {
uint32_t oid_fanout1 = ntohl(m->chunk_oid_fanout[i]);
@@ -989,6 +1021,8 @@ int verify_midx_file(const char *object_dir)
i, oid_fanout1, oid_fanout2, i + 1);
}
+ progress = start_sparse_progress(_("Verifying OID order in MIDX"),
+ m->num_objects - 1);
for (i = 0; i < m->num_objects - 1; i++) {
struct object_id oid1, oid2;
@@ -998,18 +1032,47 @@ int verify_midx_file(const char *object_dir)
if (oidcmp(&oid1, &oid2) >= 0)
midx_report(_("oid lookup out of order: oid[%d] = %s >= %s = oid[%d]"),
i, oid_to_hex(&oid1), oid_to_hex(&oid2), i + 1);
+
+ midx_display_sparse_progress(progress, i + 1);
}
+ stop_progress(&progress);
- progress = start_progress(_("Verifying object offsets"), m->num_objects);
+ /*
+ * Create an array mapping each object to its packfile id. Sort it
+ * to group the objects by packfile. Use this permutation to visit
+ * each of the objects and only require 1 packfile to be open at a
+ * time.
+ */
+ ALLOC_ARRAY(pairs, m->num_objects);
for (i = 0; i < m->num_objects; i++) {
+ pairs[i].pos = i;
+ pairs[i].pack_int_id = nth_midxed_pack_int_id(m, i);
+ }
+
+ progress = start_sparse_progress(
+ _("Sorting objects by packfile"), m->num_objects);
+ display_progress(progress, 0); /* TODO: Measure QSORT() progress */
+ QSORT(pairs, m->num_objects, compare_pair_pos_vs_id);
+ stop_progress(&progress);
+
+ progress = start_sparse_progress(_("Verifying object offsets"),
+ m->num_objects);
+ for (k = 0; k < m->num_objects; k++) {
struct object_id oid;
struct pack_entry e;
off_t m_offset, p_offset;
- nth_midxed_object_oid(&oid, m, i);
+ if (k > 0 && pairs[k-1].pack_int_id != pairs[k].pack_int_id &&
+ m->packs[pairs[k-1].pack_int_id])
+ {
+ close_pack_fd(m->packs[pairs[k-1].pack_int_id]);
+ close_pack_index(m->packs[pairs[k-1].pack_int_id]);
+ }
+
+ nth_midxed_object_oid(&oid, m, pairs[k].pos);
if (!fill_midx_entry(&oid, &e, m)) {
midx_report(_("failed to load pack entry for oid[%d] = %s"),
- i, oid_to_hex(&oid));
+ pairs[k].pos, oid_to_hex(&oid));
continue;
}
@@ -1024,11 +1087,13 @@ int verify_midx_file(const char *object_dir)
if (m_offset != p_offset)
midx_report(_("incorrect object offset for oid[%d] = %s: %"PRIx64" != %"PRIx64),
- i, oid_to_hex(&oid), m_offset, p_offset);
+ pairs[k].pos, oid_to_hex(&oid), m_offset, p_offset);
- display_progress(progress, i + 1);
+ midx_display_sparse_progress(progress, k + 1);
}
stop_progress(&progress);
+ free(pairs);
+
return verify_midx_error;
}
diff --git a/packfile.c b/packfile.c
index 16bcb75262..d2bcb2f860 100644
--- a/packfile.c
+++ b/packfile.c
@@ -309,7 +309,7 @@ void close_pack_windows(struct packed_git *p)
}
}
-static int close_pack_fd(struct packed_git *p)
+int close_pack_fd(struct packed_git *p)
{
if (p->pack_fd < 0)
return 0;
diff --git a/packfile.h b/packfile.h
index d70c6d9afb..b1c18504eb 100644
--- a/packfile.h
+++ b/packfile.h
@@ -76,6 +76,8 @@ extern int open_pack_index(struct packed_git *);
*/
extern void close_pack_index(struct packed_git *);
+int close_pack_fd(struct packed_git *p);
+
extern uint32_t get_pack_fanout(struct packed_git *p, uint32_t value);
extern unsigned char *use_pack(struct packed_git *, struct pack_window **, off_t, unsigned long *);
--
gitgitgadget
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v3 0/4] multi-pack-index: fix verify on large repos
2019-03-19 14:37 ` [PATCH v2 0/4] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
` (3 preceding siblings ...)
2019-03-19 14:37 ` [PATCH v2 4/4] midx: verify: group objects by packfile to speed up object verification Jeff Hostetler via GitGitGadget
@ 2019-03-21 19:36 ` Jeff Hostetler via GitGitGadget
2019-03-21 19:36 ` [PATCH v3 1/4] progress: add sparse mode to force 100% complete message Jeff Hostetler via GitGitGadget
` (4 more replies)
4 siblings, 5 replies; 26+ messages in thread
From: Jeff Hostetler via GitGitGadget @ 2019-03-21 19:36 UTC (permalink / raw)
To: git; +Cc: avarab, Junio C Hamano
Version 3 drops the packed-git commit I mentioned earlier, simplifies the
finish_if_sparse API as suggested by Eric, and splits the object sort commit
into 2 commits: one to add the additional progress indicators and one to
sort the objects by packfile. This makes it easier to follow.
----------------------------------------------------------------------------
Version 2 addresses progress-related concerns raised in the previous version
of the midx verify code.
This version also extends the existing progress.[ch] code and adds a
"sparse" mode that automatically ensures the 100% message is issued.
----------------------------------------------------------------------------
Teach "multi-pack-index verify" to handle cases where the number of
packfiles exceeds the open file handle limit.
The first commit fixes a problem that prevented the LRU-style
close_one_pack() mechanism from working which caused midx verify to run out
of file descriptors.
The second commit teaches midx verify to sort the set of objects to verify
by packfile rather than verifying them in OID order. This eliminates the
need to have more than one packfile/idx open at the same time.
With the second commit, runtime on 3600 packfiles went from 12 minutes to 25
seconds.
Thanks, Jeff
Cc: dstolee@microsoft.com
Jeff Hostetler (4):
progress: add sparse mode to force 100% complete message
trace2:data: add trace2 data to midx
midx: add progress indicators in multi-pack-index verify
midx: during verify group objects by packfile to speed verification
builtin/multi-pack-index.c | 3 ++
midx.c | 79 +++++++++++++++++++++++++++++++++++---
packfile.c | 2 +-
packfile.h | 2 +
progress.c | 38 ++++++++++++++++--
progress.h | 3 ++
6 files changed, 118 insertions(+), 9 deletions(-)
base-commit: e902e9bcae2010bc42648c80ab6adc6c5a16a4a5
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-166%2Fjeffhostetler%2Fupstream-midx-verify-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-166/jeffhostetler/upstream-midx-verify-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/166
Range-diff vs v2:
1: e1da1f84a8 ! 1: 5595e019c8 progress: add sparse mode to force 100% complete message
@@ -80,10 +80,8 @@
+ return start_progress_delay(title, total, 2, 1);
+}
+
-+static void finish_if_sparse(struct progress **p_progress)
++static void finish_if_sparse(struct progress *progress)
+{
-+ struct progress *progress = *p_progress;
-+
+ if (progress &&
+ progress->sparse &&
+ progress->last_value != progress->total)
@@ -92,7 +90,7 @@
void stop_progress(struct progress **p_progress)
{
-+ finish_if_sparse(p_progress);
++ finish_if_sparse(*p_progress);
+
stop_progress_msg(p_progress, _("done"));
}
2: 11c88845e7 = 2: 498258b120 trace2:data: add trace2 data to midx
3: ced7f1cb34 < -: ---------- midx: verify: add midx packfiles to the packed_git list
-: ---------- > 3: 8a60902d65 midx: add progress indicators in multi-pack-index verify
4: e2dd99911f ! 4: 7e98ea005a midx: verify: group objects by packfile to speed up object verification
@@ -1,8 +1,8 @@
Author: Jeff Hostetler <jeffhost@microsoft.com>
- midx: verify: group objects by packfile to speed up object verification
+ midx: during verify group objects by packfile to speed verification
- Teach `multi-pack-index verify` to sort the set of objects by
+ Teach `multi-pack-index verify` to sort the set of object by
packfile so that only one packfile needs to be open at a time.
This is a performance improvement. Previously, objects were
@@ -34,64 +34,21 @@
+ return b->pack_int_id - a->pack_int_id;
+}
+
-+/*
-+ * Limit calls to display_progress() for performance reasons.
-+ * The interval here was arbitrarily chosen.
-+ */
-+#define SPARSE_PROGRESS_INTERVAL (1 << 12)
-+#define midx_display_sparse_progress(progress, n) \
-+ do { \
-+ uint64_t _n = (n); \
-+ if ((_n & (SPARSE_PROGRESS_INTERVAL - 1)) == 0) \
-+ display_progress(progress, _n); \
-+ } while (0)
-+
+ /*
+ * Limit calls to display_progress() for performance reasons.
+ * The interval here was arbitrarily chosen.
+@@
+
int verify_midx_file(const char *object_dir)
{
-- uint32_t i;
+ struct pair_pos_vs_id *pairs = NULL;
-+ uint32_t i, k;
+ uint32_t i;
struct progress *progress;
struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
- verify_midx_error = 0;
-@@
- if (!m)
- return 0;
-
-+ progress = start_progress(_("Looking for referenced packfiles"),
-+ m->num_packs);
- for (i = 0; i < m->num_packs; i++) {
- if (prepare_midx_pack(m, i))
- midx_report("failed to load pack in position %d", i);
-
- if (m->packs[i])
- install_packed_git(the_repository, m->packs[i]);
-+
-+ display_progress(progress, i + 1);
- }
-+ stop_progress(&progress);
-
- for (i = 0; i < 255; i++) {
- uint32_t oid_fanout1 = ntohl(m->chunk_oid_fanout[i]);
@@
- i, oid_fanout1, oid_fanout2, i + 1);
}
+ stop_progress(&progress);
-+ progress = start_sparse_progress(_("Verifying OID order in MIDX"),
-+ m->num_objects - 1);
- for (i = 0; i < m->num_objects - 1; i++) {
- struct object_id oid1, oid2;
-
-@@
- if (oidcmp(&oid1, &oid2) >= 0)
- midx_report(_("oid lookup out of order: oid[%d] = %s >= %s = oid[%d]"),
- i, oid_to_hex(&oid1), oid_to_hex(&oid2), i + 1);
-+
-+ midx_display_sparse_progress(progress, i + 1);
- }
-+ stop_progress(&progress);
-
-- progress = start_progress(_("Verifying object offsets"), m->num_objects);
+ /*
+ * Create an array mapping each object to its packfile id. Sort it
+ * to group the objects by packfile. Use this permutation to visit
@@ -99,37 +56,37 @@
+ * time.
+ */
+ ALLOC_ARRAY(pairs, m->num_objects);
- for (i = 0; i < m->num_objects; i++) {
++ for (i = 0; i < m->num_objects; i++) {
+ pairs[i].pos = i;
+ pairs[i].pack_int_id = nth_midxed_pack_int_id(m, i);
+ }
+
-+ progress = start_sparse_progress(
-+ _("Sorting objects by packfile"), m->num_objects);
++ progress = start_sparse_progress(_("Sorting objects by packfile"),
++ m->num_objects);
+ display_progress(progress, 0); /* TODO: Measure QSORT() progress */
+ QSORT(pairs, m->num_objects, compare_pair_pos_vs_id);
+ stop_progress(&progress);
+
-+ progress = start_sparse_progress(_("Verifying object offsets"),
-+ m->num_objects);
-+ for (k = 0; k < m->num_objects; k++) {
+ progress = start_sparse_progress(_("Verifying object offsets"), m->num_objects);
+ for (i = 0; i < m->num_objects; i++) {
struct object_id oid;
struct pack_entry e;
off_t m_offset, p_offset;
- nth_midxed_object_oid(&oid, m, i);
-+ if (k > 0 && pairs[k-1].pack_int_id != pairs[k].pack_int_id &&
-+ m->packs[pairs[k-1].pack_int_id])
++ if (i > 0 && pairs[i-1].pack_int_id != pairs[i].pack_int_id &&
++ m->packs[pairs[i-1].pack_int_id])
+ {
-+ close_pack_fd(m->packs[pairs[k-1].pack_int_id]);
-+ close_pack_index(m->packs[pairs[k-1].pack_int_id]);
++ close_pack_fd(m->packs[pairs[i-1].pack_int_id]);
++ close_pack_index(m->packs[pairs[i-1].pack_int_id]);
+ }
+
-+ nth_midxed_object_oid(&oid, m, pairs[k].pos);
++ nth_midxed_object_oid(&oid, m, pairs[i].pos);
++
if (!fill_midx_entry(&oid, &e, m)) {
midx_report(_("failed to load pack entry for oid[%d] = %s"),
- i, oid_to_hex(&oid));
-+ pairs[k].pos, oid_to_hex(&oid));
++ pairs[i].pos, oid_to_hex(&oid));
continue;
}
@@ -138,10 +95,9 @@
if (m_offset != p_offset)
midx_report(_("incorrect object offset for oid[%d] = %s: %"PRIx64" != %"PRIx64),
- i, oid_to_hex(&oid), m_offset, p_offset);
-+ pairs[k].pos, oid_to_hex(&oid), m_offset, p_offset);
++ pairs[i].pos, oid_to_hex(&oid), m_offset, p_offset);
-- display_progress(progress, i + 1);
-+ midx_display_sparse_progress(progress, k + 1);
+ midx_display_sparse_progress(progress, i + 1);
}
stop_progress(&progress);
--
gitgitgadget
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH v3 1/4] progress: add sparse mode to force 100% complete message
2019-03-21 19:36 ` [PATCH v3 0/4] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
@ 2019-03-21 19:36 ` Jeff Hostetler via GitGitGadget
2019-03-21 19:36 ` [PATCH v3 2/4] trace2:data: add trace2 data to midx Jeff Hostetler via GitGitGadget
` (3 subsequent siblings)
4 siblings, 0 replies; 26+ messages in thread
From: Jeff Hostetler via GitGitGadget @ 2019-03-21 19:36 UTC (permalink / raw)
To: git; +Cc: avarab, Junio C Hamano, Jeff Hostetler
From: Jeff Hostetler <jeffhost@microsoft.com>
Add new start_sparse_progress() and start_delayed_sparse_progress()
constructors and "sparse" flag to struct progress.
Teach stop_progress() to force a 100% complete progress message before
printing the final "done" message when "sparse" is set.
Calling display_progress() for every item in a large set can
be expensive. If callers try to filter this for performance
reasons, such as emitting every k-th item, progress would
not reach 100% unless they made a final call to display_progress()
with the item count before calling stop_progress().
Now this is automatic when "sparse" is set.
Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
progress.c | 38 +++++++++++++++++++++++++++++++++++---
progress.h | 3 +++
2 files changed, 38 insertions(+), 3 deletions(-)
diff --git a/progress.c b/progress.c
index 5a99c9fbf0..212d00e524 100644
--- a/progress.c
+++ b/progress.c
@@ -34,6 +34,7 @@ struct progress {
uint64_t total;
unsigned last_percent;
unsigned delay;
+ unsigned sparse;
struct throughput *throughput;
uint64_t start_ns;
};
@@ -194,7 +195,7 @@ int display_progress(struct progress *progress, uint64_t n)
}
static struct progress *start_progress_delay(const char *title, uint64_t total,
- unsigned delay)
+ unsigned delay, unsigned sparse)
{
struct progress *progress = malloc(sizeof(*progress));
if (!progress) {
@@ -208,6 +209,7 @@ static struct progress *start_progress_delay(const char *title, uint64_t total,
progress->last_value = -1;
progress->last_percent = -1;
progress->delay = delay;
+ progress->sparse = sparse;
progress->throughput = NULL;
progress->start_ns = getnanotime();
set_progress_signal();
@@ -216,16 +218,46 @@ static struct progress *start_progress_delay(const char *title, uint64_t total,
struct progress *start_delayed_progress(const char *title, uint64_t total)
{
- return start_progress_delay(title, total, 2);
+ return start_progress_delay(title, total, 2, 0);
}
struct progress *start_progress(const char *title, uint64_t total)
{
- return start_progress_delay(title, total, 0);
+ return start_progress_delay(title, total, 0, 0);
+}
+
+/*
+ * Here "sparse" means that the caller might use some sampling criteria to
+ * decide when to call display_progress() rather than calling it for every
+ * integer value in[0 .. total). In particular, the caller might not call
+ * display_progress() for the last value in the range.
+ *
+ * When "sparse" is set, stop_progress() will automatically force the done
+ * message to show 100%.
+ */
+struct progress *start_sparse_progress(const char *title, uint64_t total)
+{
+ return start_progress_delay(title, total, 0, 1);
+}
+
+struct progress *start_delayed_sparse_progress(const char *title,
+ uint64_t total)
+{
+ return start_progress_delay(title, total, 2, 1);
+}
+
+static void finish_if_sparse(struct progress *progress)
+{
+ if (progress &&
+ progress->sparse &&
+ progress->last_value != progress->total)
+ display_progress(progress, progress->total);
}
void stop_progress(struct progress **p_progress)
{
+ finish_if_sparse(*p_progress);
+
stop_progress_msg(p_progress, _("done"));
}
diff --git a/progress.h b/progress.h
index 70a4d4a0d6..7b725acc8d 100644
--- a/progress.h
+++ b/progress.h
@@ -6,7 +6,10 @@ struct progress;
void display_throughput(struct progress *progress, uint64_t total);
int display_progress(struct progress *progress, uint64_t n);
struct progress *start_progress(const char *title, uint64_t total);
+struct progress *start_sparse_progress(const char *title, uint64_t total);
struct progress *start_delayed_progress(const char *title, uint64_t total);
+struct progress *start_delayed_sparse_progress(const char *title,
+ uint64_t total);
void stop_progress(struct progress **progress);
void stop_progress_msg(struct progress **progress, const char *msg);
--
gitgitgadget
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v3 2/4] trace2:data: add trace2 data to midx
2019-03-21 19:36 ` [PATCH v3 0/4] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
2019-03-21 19:36 ` [PATCH v3 1/4] progress: add sparse mode to force 100% complete message Jeff Hostetler via GitGitGadget
@ 2019-03-21 19:36 ` Jeff Hostetler via GitGitGadget
2019-03-21 19:36 ` [PATCH v3 3/4] midx: add progress indicators in multi-pack-index verify Jeff Hostetler via GitGitGadget
` (2 subsequent siblings)
4 siblings, 0 replies; 26+ messages in thread
From: Jeff Hostetler via GitGitGadget @ 2019-03-21 19:36 UTC (permalink / raw)
To: git; +Cc: avarab, Junio C Hamano, Jeff Hostetler
From: Jeff Hostetler <jeffhost@microsoft.com>
Log multi-pack-index command mode.
Log number of objects and packfiles in the midx.
Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
builtin/multi-pack-index.c | 3 +++
midx.c | 4 ++++
2 files changed, 7 insertions(+)
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index fca70f8e4f..ae6e476ad5 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -3,6 +3,7 @@
#include "config.h"
#include "parse-options.h"
#include "midx.h"
+#include "trace2.h"
static char const * const builtin_multi_pack_index_usage[] = {
N_("git multi-pack-index [--object-dir=<dir>] (write|verify)"),
@@ -40,6 +41,8 @@ int cmd_multi_pack_index(int argc, const char **argv,
return 1;
}
+ trace2_cmd_mode(argv[0]);
+
if (!strcmp(argv[0], "write"))
return write_midx_file(opts.object_dir);
if (!strcmp(argv[0], "verify"))
diff --git a/midx.c b/midx.c
index 8a505fd423..24141a7c62 100644
--- a/midx.c
+++ b/midx.c
@@ -8,6 +8,7 @@
#include "sha1-lookup.h"
#include "midx.h"
#include "progress.h"
+#include "trace2.h"
#define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
#define MIDX_VERSION 1
@@ -164,6 +165,9 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
m->pack_names[i]);
}
+ trace2_data_intmax("midx", the_repository, "load/num_packs", m->num_packs);
+ trace2_data_intmax("midx", the_repository, "load/num_objects", m->num_objects);
+
return m;
cleanup_fail:
--
gitgitgadget
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v3 3/4] midx: add progress indicators in multi-pack-index verify
2019-03-21 19:36 ` [PATCH v3 0/4] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
2019-03-21 19:36 ` [PATCH v3 1/4] progress: add sparse mode to force 100% complete message Jeff Hostetler via GitGitGadget
2019-03-21 19:36 ` [PATCH v3 2/4] trace2:data: add trace2 data to midx Jeff Hostetler via GitGitGadget
@ 2019-03-21 19:36 ` Jeff Hostetler via GitGitGadget
2019-03-21 19:36 ` [PATCH v3 4/4] midx: during verify group objects by packfile to speed verification Jeff Hostetler via GitGitGadget
2019-03-22 5:37 ` [PATCH v3 0/4] multi-pack-index: fix verify on large repos Junio C Hamano
4 siblings, 0 replies; 26+ messages in thread
From: Jeff Hostetler via GitGitGadget @ 2019-03-21 19:36 UTC (permalink / raw)
To: git; +Cc: avarab, Junio C Hamano, Jeff Hostetler
From: Jeff Hostetler <jeffhost@microsoft.com>
Add progress indicators to more parts of midx verify.
Use sparse progress indicator for object iteration.
Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
midx.c | 26 ++++++++++++++++++++++++--
1 file changed, 24 insertions(+), 2 deletions(-)
diff --git a/midx.c b/midx.c
index 24141a7c62..e3919387d9 100644
--- a/midx.c
+++ b/midx.c
@@ -962,6 +962,18 @@ static void midx_report(const char *fmt, ...)
va_end(ap);
}
+/*
+ * Limit calls to display_progress() for performance reasons.
+ * The interval here was arbitrarily chosen.
+ */
+#define SPARSE_PROGRESS_INTERVAL (1 << 12)
+#define midx_display_sparse_progress(progress, n) \
+ do { \
+ uint64_t _n = (n); \
+ if ((_n & (SPARSE_PROGRESS_INTERVAL - 1)) == 0) \
+ display_progress(progress, _n); \
+ } while (0)
+
int verify_midx_file(const char *object_dir)
{
uint32_t i;
@@ -972,10 +984,15 @@ int verify_midx_file(const char *object_dir)
if (!m)
return 0;
+ progress = start_progress(_("Looking for referenced packfiles"),
+ m->num_packs);
for (i = 0; i < m->num_packs; i++) {
if (prepare_midx_pack(m, i))
midx_report("failed to load pack in position %d", i);
+
+ display_progress(progress, i + 1);
}
+ stop_progress(&progress);
for (i = 0; i < 255; i++) {
uint32_t oid_fanout1 = ntohl(m->chunk_oid_fanout[i]);
@@ -986,6 +1003,8 @@ int verify_midx_file(const char *object_dir)
i, oid_fanout1, oid_fanout2, i + 1);
}
+ progress = start_sparse_progress(_("Verifying OID order in MIDX"),
+ m->num_objects - 1);
for (i = 0; i < m->num_objects - 1; i++) {
struct object_id oid1, oid2;
@@ -995,9 +1014,12 @@ int verify_midx_file(const char *object_dir)
if (oidcmp(&oid1, &oid2) >= 0)
midx_report(_("oid lookup out of order: oid[%d] = %s >= %s = oid[%d]"),
i, oid_to_hex(&oid1), oid_to_hex(&oid2), i + 1);
+
+ midx_display_sparse_progress(progress, i + 1);
}
+ stop_progress(&progress);
- progress = start_progress(_("Verifying object offsets"), m->num_objects);
+ progress = start_sparse_progress(_("Verifying object offsets"), m->num_objects);
for (i = 0; i < m->num_objects; i++) {
struct object_id oid;
struct pack_entry e;
@@ -1023,7 +1045,7 @@ int verify_midx_file(const char *object_dir)
midx_report(_("incorrect object offset for oid[%d] = %s: %"PRIx64" != %"PRIx64),
i, oid_to_hex(&oid), m_offset, p_offset);
- display_progress(progress, i + 1);
+ midx_display_sparse_progress(progress, i + 1);
}
stop_progress(&progress);
--
gitgitgadget
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v3 4/4] midx: during verify group objects by packfile to speed verification
2019-03-21 19:36 ` [PATCH v3 0/4] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
` (2 preceding siblings ...)
2019-03-21 19:36 ` [PATCH v3 3/4] midx: add progress indicators in multi-pack-index verify Jeff Hostetler via GitGitGadget
@ 2019-03-21 19:36 ` Jeff Hostetler via GitGitGadget
2019-03-22 5:37 ` [PATCH v3 0/4] multi-pack-index: fix verify on large repos Junio C Hamano
4 siblings, 0 replies; 26+ messages in thread
From: Jeff Hostetler via GitGitGadget @ 2019-03-21 19:36 UTC (permalink / raw)
To: git; +Cc: avarab, Junio C Hamano, Jeff Hostetler
From: Jeff Hostetler <jeffhost@microsoft.com>
Teach `multi-pack-index verify` to sort the set of object by
packfile so that only one packfile needs to be open at a time.
This is a performance improvement. Previously, objects were
verified in OID order. This essentially requires all packfiles
to be open at the same time. If the number of packfiles exceeds
the open file limit, packfiles would be LRU-closed and re-opened
many times.
Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
midx.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++---
packfile.c | 2 +-
packfile.h | 2 ++
3 files changed, 49 insertions(+), 4 deletions(-)
diff --git a/midx.c b/midx.c
index e3919387d9..f1cd868f8c 100644
--- a/midx.c
+++ b/midx.c
@@ -962,6 +962,20 @@ static void midx_report(const char *fmt, ...)
va_end(ap);
}
+struct pair_pos_vs_id
+{
+ uint32_t pos;
+ uint32_t pack_int_id;
+};
+
+static int compare_pair_pos_vs_id(const void *_a, const void *_b)
+{
+ struct pair_pos_vs_id *a = (struct pair_pos_vs_id *)_a;
+ struct pair_pos_vs_id *b = (struct pair_pos_vs_id *)_b;
+
+ return b->pack_int_id - a->pack_int_id;
+}
+
/*
* Limit calls to display_progress() for performance reasons.
* The interval here was arbitrarily chosen.
@@ -976,6 +990,7 @@ static void midx_report(const char *fmt, ...)
int verify_midx_file(const char *object_dir)
{
+ struct pair_pos_vs_id *pairs = NULL;
uint32_t i;
struct progress *progress;
struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
@@ -1019,16 +1034,42 @@ int verify_midx_file(const char *object_dir)
}
stop_progress(&progress);
+ /*
+ * Create an array mapping each object to its packfile id. Sort it
+ * to group the objects by packfile. Use this permutation to visit
+ * each of the objects and only require 1 packfile to be open at a
+ * time.
+ */
+ ALLOC_ARRAY(pairs, m->num_objects);
+ for (i = 0; i < m->num_objects; i++) {
+ pairs[i].pos = i;
+ pairs[i].pack_int_id = nth_midxed_pack_int_id(m, i);
+ }
+
+ progress = start_sparse_progress(_("Sorting objects by packfile"),
+ m->num_objects);
+ display_progress(progress, 0); /* TODO: Measure QSORT() progress */
+ QSORT(pairs, m->num_objects, compare_pair_pos_vs_id);
+ stop_progress(&progress);
+
progress = start_sparse_progress(_("Verifying object offsets"), m->num_objects);
for (i = 0; i < m->num_objects; i++) {
struct object_id oid;
struct pack_entry e;
off_t m_offset, p_offset;
- nth_midxed_object_oid(&oid, m, i);
+ if (i > 0 && pairs[i-1].pack_int_id != pairs[i].pack_int_id &&
+ m->packs[pairs[i-1].pack_int_id])
+ {
+ close_pack_fd(m->packs[pairs[i-1].pack_int_id]);
+ close_pack_index(m->packs[pairs[i-1].pack_int_id]);
+ }
+
+ nth_midxed_object_oid(&oid, m, pairs[i].pos);
+
if (!fill_midx_entry(&oid, &e, m)) {
midx_report(_("failed to load pack entry for oid[%d] = %s"),
- i, oid_to_hex(&oid));
+ pairs[i].pos, oid_to_hex(&oid));
continue;
}
@@ -1043,11 +1084,13 @@ int verify_midx_file(const char *object_dir)
if (m_offset != p_offset)
midx_report(_("incorrect object offset for oid[%d] = %s: %"PRIx64" != %"PRIx64),
- i, oid_to_hex(&oid), m_offset, p_offset);
+ pairs[i].pos, oid_to_hex(&oid), m_offset, p_offset);
midx_display_sparse_progress(progress, i + 1);
}
stop_progress(&progress);
+ free(pairs);
+
return verify_midx_error;
}
diff --git a/packfile.c b/packfile.c
index 16bcb75262..d2bcb2f860 100644
--- a/packfile.c
+++ b/packfile.c
@@ -309,7 +309,7 @@ void close_pack_windows(struct packed_git *p)
}
}
-static int close_pack_fd(struct packed_git *p)
+int close_pack_fd(struct packed_git *p)
{
if (p->pack_fd < 0)
return 0;
diff --git a/packfile.h b/packfile.h
index d70c6d9afb..b1c18504eb 100644
--- a/packfile.h
+++ b/packfile.h
@@ -76,6 +76,8 @@ extern int open_pack_index(struct packed_git *);
*/
extern void close_pack_index(struct packed_git *);
+int close_pack_fd(struct packed_git *p);
+
extern uint32_t get_pack_fanout(struct packed_git *p, uint32_t value);
extern unsigned char *use_pack(struct packed_git *, struct pack_window **, off_t, unsigned long *);
--
gitgitgadget
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH v3 0/4] multi-pack-index: fix verify on large repos
2019-03-21 19:36 ` [PATCH v3 0/4] multi-pack-index: fix verify on large repos Jeff Hostetler via GitGitGadget
` (3 preceding siblings ...)
2019-03-21 19:36 ` [PATCH v3 4/4] midx: during verify group objects by packfile to speed verification Jeff Hostetler via GitGitGadget
@ 2019-03-22 5:37 ` Junio C Hamano
2019-03-25 17:18 ` Jeff Hostetler
4 siblings, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2019-03-22 5:37 UTC (permalink / raw)
To: Jeff Hostetler via GitGitGadget; +Cc: git, avarab
"Jeff Hostetler via GitGitGadget" <gitgitgadget@gmail.com> writes:
> Teach "multi-pack-index verify" to handle cases where the number of
> packfiles exceeds the open file handle limit.
>
> The first commit fixes a problem that prevented the LRU-style
> close_one_pack() mechanism from working which caused midx verify to run out
> of file descriptors.
>
> The second commit teaches midx verify to sort the set of objects to verify
> by packfile rather than verifying them in OID order. This eliminates the
> need to have more than one packfile/idx open at the same time.
>
> With the second commit, runtime on 3600 packfiles went from 12 minutes to 25
> seconds.
These reference to the first and second commit might have become
stale across interations, but logically it makes sense---the first
point is about correctness (i.e. do not die by running out of fds)
and the second one is about usable-performance.
But in this round (possibly in the previous one, too?) the "group
objects by packfile" one addresses both points?
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 0/4] multi-pack-index: fix verify on large repos
2019-03-22 5:37 ` [PATCH v3 0/4] multi-pack-index: fix verify on large repos Junio C Hamano
@ 2019-03-25 17:18 ` Jeff Hostetler
0 siblings, 0 replies; 26+ messages in thread
From: Jeff Hostetler @ 2019-03-25 17:18 UTC (permalink / raw)
To: Junio C Hamano, Jeff Hostetler via GitGitGadget; +Cc: git, avarab
On 3/22/2019 1:37 AM, Junio C Hamano wrote:
> "Jeff Hostetler via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> Teach "multi-pack-index verify" to handle cases where the number of
>> packfiles exceeds the open file handle limit.
>>
>> The first commit fixes a problem that prevented the LRU-style
>> close_one_pack() mechanism from working which caused midx verify to run out
>> of file descriptors.
>>
>> The second commit teaches midx verify to sort the set of objects to verify
>> by packfile rather than verifying them in OID order. This eliminates the
>> need to have more than one packfile/idx open at the same time.
>>
>> With the second commit, runtime on 3600 packfiles went from 12 minutes to 25
>> seconds.
>
> These reference to the first and second commit might have become
> stale across interations, but logically it makes sense---the first
> point is about correctness (i.e. do not die by running out of fds)
> and the second one is about usable-performance.
>
> But in this round (possibly in the previous one, too?) the "group
> objects by packfile" one addresses both points?
>
Sorry, I forgot to remote the stale content in the cover letter for
the V3 version.
This version just has the sorting by packfile commit and because it only
keeps 1 packfile open at a time, it does not need the change to add
packfiles to the packed-git list because it does not trigger the
close_one_pack() problem.
We suspect there are other places (not-yet-observed) where the design of
the all-packs and packed-git lists will lead to similar fd exhaustion
errors and want to fix it properly in the packfile and/or midx code.
We'll address this potential problem in a future patch series.
Jeff
^ permalink raw reply [flat|nested] 26+ messages in thread