non-smooth progress indication for git fsck and git gc

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* non-smooth progress  indication for git fsck and git gc
@ 2018-08-16  6:54 Ulrich Windl
  2018-08-16 15:18 ` Duy Nguyen
  2018-08-16 15:57 ` Jeff King
  0 siblings, 2 replies; 25+ messages in thread
From: Ulrich Windl @ 2018-08-16  6:54 UTC (permalink / raw)
  To: git

Hi!

I'd like to point out some minor issue observed while processing some 50000-object repository with many binary objects, but most are rather small:

Between the two phases of "git fsck" (checking directories and checking objects) there was a break of several seconds where no progress was indicated.

During "git gc" the writing objects phase did not update for some seconds, but then the percentage counter jumped like from 15% to 42%.

I understand that updating the progress output too often can be a performance bottleneck, while upating it too rarely might only bore the user... ;-)

But maybe something can be done. My git version is 2.13.7 (openSUSE 42.3).

Regards,
Ulrich

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: non-smooth progress indication for git fsck and git gc
  2018-08-16  6:54 non-smooth progress indication for git fsck and git gc Ulrich Windl
@ 2018-08-16 15:18 ` Duy Nguyen
  2018-08-16 16:05   ` Jeff King
  2018-08-20  8:27   ` Antw: " Ulrich Windl
  2018-08-16 15:57 ` Jeff King
  1 sibling, 2 replies; 25+ messages in thread
From: Duy Nguyen @ 2018-08-16 15:18 UTC (permalink / raw)
  To: Ulrich.Windl; +Cc: Git Mailing List

On Thu, Aug 16, 2018 at 1:10 PM Ulrich Windl
<Ulrich.Windl@rz.uni-regensburg.de> wrote:
>
> Hi!
>
> I'd like to point out some minor issue observed while processing some 50000-object repository with many binary objects, but most are rather small:
>
> Between the two phases of "git fsck" (checking directories and checking objects) there was a break of several seconds where no progress was indicated.
>
> During "git gc" the writing objects phase did not update for some seconds, but then the percentage counter jumped like from 15% to 42%.
>
> I understand that updating the progress output too often can be a performance bottleneck, while upating it too rarely might only bore the user... ;-)
>
> But maybe something can be done. My git version is 2.13.7 (openSUSE 42.3).

Is it possible to make this repository public? You can also use "git
fast-export --anonymize" to make a repo with same structure but no
real content (but read the man page about that option first)

> Regards,
> Ulrich
>
>


-- 
Duy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: non-smooth progress  indication for git fsck and git gc
  2018-08-16  6:54 non-smooth progress indication for git fsck and git gc Ulrich Windl
  2018-08-16 15:18 ` Duy Nguyen
@ 2018-08-16 15:57 ` Jeff King
  2018-08-16 20:02   ` Jeff King
  2018-08-16 20:35   ` Ævar Arnfjörð Bjarmason
  1 sibling, 2 replies; 25+ messages in thread
From: Jeff King @ 2018-08-16 15:57 UTC (permalink / raw)
  To: Ulrich Windl; +Cc: git

On Thu, Aug 16, 2018 at 08:54:25AM +0200, Ulrich Windl wrote:

> I'd like to point out some minor issue observed while processing some
> 50000-object repository with many binary objects, but most are rather
> small:
> 
> Between the two phases of "git fsck" (checking directories and
> checking objects) there was a break of several seconds where no
> progress was indicated.
> 
> During "git gc" the writing objects phase did not update for some
> seconds, but then the percentage counter jumped like from 15% to 42%.
> 
> I understand that updating the progress output too often can be a
> performance bottleneck, while upating it too rarely might only bore
> the user... ;-)

We update the counter integer for every object we process, and then
actually update the display whenever the percentage increases or a
second has elapsed, whichever comes first.

What you're seeing is likely an artifact of _what_ we're counting:
written objects. Not all objects are the same size, so it's not uncommon
to see thousands/sec when dealing with small ones, and then several
seconds for one giant blob.

The only way to solve that is to count bytes. We don't have a total byte
count in most cases, and it wouldn't always make sense (e.g., the
"Compressing objects" meter can show the same issue, but it's not really
putting through bytes in a linear way).  In some cases we do show
transmitted size and throughput, but that's just for network operations.
We could do the same for "gc" with the patch below. But usually
throughput isn't all that interesting for a filesystem write, because
bandwidth isn't the bottleneck.

Possibly we could have a "half throughput" mode that counts up the total
size written, but omits the speed indicator. That's not an unreasonable
thing to show for a local pack, since you end up with the final pack
size. The object counter would still be jumpy, but you'd at least have
one number updated at least once per second as you put through a large
blob.

If you really want a smooth percentage, then we'd have to start counting
bytes instead of objects. Two reasons we don't do that are:

  - we often don't know the total byte size exactly. E.g., for a
    packfile write, it depends on the result of deflating each object.
    You can make an approximation and just pretend at the end that you
    hit 100%.  Or you can count the pre-deflate sizes, but then your
    meter doesn't match the bytes from the throughput counter.

  - for something like fsck, we're not actually writing out bytes.  So I
    guess you'd be measuring "here's how many bytes of objects I
    fsck-ed". But is that on-disk compressed bytes, or decompressed
    bytes?

    If the former, that's only marginally better as a measure of effort,
    since delta compression means that a small number of on-disk bytes
    may require a big effort (imagine processing a 100 byte blob versus
    a 100 byte delta off of a 100MB blob).

    The latter is probably more accurate. But it's also not free to
    pre-generate the total. We can get the number of objects or the size
    of the packfile in constant-time, but totaling up the uncompressed
    size of all objects is O(n). So that's extra computation, but it
    also means a potential lag before we can start the progress meter.

    I'm also not sure how meaningful a byte count is for a user there.

So there. That's probably more than you wanted to know about Git's
progress code. I think it probably _could_ be improved by counting
more/different things, but I also think it can be a bit of a rabbit
hole. Which is why AFAIK nobody's really looked too seriously into
changing it.

-Peff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: non-smooth progress indication for git fsck and git gc
  2018-08-16 15:18 ` Duy Nguyen
@ 2018-08-16 16:05   ` Jeff King
  2018-08-20  8:27   ` Antw: " Ulrich Windl
  1 sibling, 0 replies; 25+ messages in thread
From: Jeff King @ 2018-08-16 16:05 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Ulrich.Windl, Git Mailing List

On Thu, Aug 16, 2018 at 05:18:51PM +0200, Duy Nguyen wrote:

> > During "git gc" the writing objects phase did not update for some
> > seconds, but then the percentage counter jumped like from 15% to
> > 42%.
> [...]
> 
> Is it possible to make this repository public? You can also use "git
> fast-export --anonymize" to make a repo with same structure but no
> real content (but read the man page about that option first)

Try this:

-- >8 --
git init repo
cd repo

# one big blob
dd if=/dev/urandom of=big bs=1M count=100
git add big
git commit -m big

# several small blobs
for i in $(seq 10); do
  echo $i >file
  git add file
  git commit -m $i
done

git gc
-- >8 --

It "stalls" at 33%, and then jumps straight to 100%.

-Peff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: non-smooth progress  indication for git fsck and git gc
  2018-08-16 15:57 ` Jeff King
@ 2018-08-16 20:02   ` Jeff King
  2018-08-16 22:10     ` Junio C Hamano
  2018-08-16 20:35   ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 25+ messages in thread
From: Jeff King @ 2018-08-16 20:02 UTC (permalink / raw)
  To: Ulrich Windl; +Cc: git

On Thu, Aug 16, 2018 at 11:57:14AM -0400, Jeff King wrote:

> The only way to solve that is to count bytes. We don't have a total byte
> count in most cases, and it wouldn't always make sense (e.g., the
> "Compressing objects" meter can show the same issue, but it's not really
> putting through bytes in a linear way).  In some cases we do show
> transmitted size and throughput, but that's just for network operations.
> We could do the same for "gc" with the patch below. But usually
> throughput isn't all that interesting for a filesystem write, because
> bandwidth isn't the bottleneck.

Just realized I forgot to include the patch. Here it is, for reference.

Doing something similar for fsck would be quite a bit more invasive.

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 80c880e9ad..e1130b959d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -837,7 +837,7 @@ static void write_pack_file(void)
 		if (pack_to_stdout)
 			f = hashfd_throughput(1, "<stdout>", progress_state);
 		else
-			f = create_tmp_packfile(&pack_tmp_name);
+			f = create_tmp_packfile(&pack_tmp_name, progress_state);
 
 		offset = write_pack_header(f, nr_remaining);
 
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 9f3b644811..0df45b8f55 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -178,7 +178,7 @@ static void prepare_to_stream(struct bulk_checkin_state *state,
 	if (!(flags & HASH_WRITE_OBJECT) || state->f)
 		return;
 
-	state->f = create_tmp_packfile(&state->pack_tmp_name);
+	state->f = create_tmp_packfile(&state->pack_tmp_name, NULL);
 	reset_pack_idx_option(&state->pack_idx_opts);
 
 	/* Pretend we are going to write only one object */
diff --git a/pack-write.c b/pack-write.c
index a9d46bc03f..b72480b440 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -334,14 +334,15 @@ int encode_in_pack_object_header(unsigned char *hdr, int hdr_len,
 	return n;
 }
 
-struct hashfile *create_tmp_packfile(char **pack_tmp_name)
+struct hashfile *create_tmp_packfile(char **pack_tmp_name,
+				     struct progress *progress)
 {
 	struct strbuf tmpname = STRBUF_INIT;
 	int fd;
 
 	fd = odb_mkstemp(&tmpname, "pack/tmp_pack_XXXXXX");
 	*pack_tmp_name = strbuf_detach(&tmpname, NULL);
-	return hashfd(fd, *pack_tmp_name);
+	return hashfd_throughput(fd, *pack_tmp_name, progress);
 }
 
 void finish_tmp_packfile(struct strbuf *name_buffer,
diff --git a/pack.h b/pack.h
index 34a9d458b4..c87628b093 100644
--- a/pack.h
+++ b/pack.h
@@ -98,7 +98,8 @@ extern int encode_in_pack_object_header(unsigned char *hdr, int hdr_len,
 #define PH_ERROR_PROTOCOL	(-3)
 extern int read_pack_header(int fd, struct pack_header *);
 
-extern struct hashfile *create_tmp_packfile(char **pack_tmp_name);
+extern struct hashfile *create_tmp_packfile(char **pack_tmp_name,
+					    struct progress *progress);
 extern void finish_tmp_packfile(struct strbuf *name_buffer, const char *pack_tmp_name, struct pack_idx_entry **written_list, uint32_t nr_written, struct pack_idx_option *pack_idx_opts, unsigned char sha1[]);
 
 #endif

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: non-smooth progress  indication for git fsck and git gc
  2018-08-16 15:57 ` Jeff King
  2018-08-16 20:02   ` Jeff King
@ 2018-08-16 20:35   ` Ævar Arnfjörð Bjarmason
  2018-08-16 20:55     ` Jeff King
  2018-09-01 12:53     ` Ævar Arnfjörð Bjarmason
  1 sibling, 2 replies; 25+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-08-16 20:35 UTC (permalink / raw)
  To: Jeff King; +Cc: Ulrich Windl, git


On Thu, Aug 16 2018, Jeff King wrote:

> On Thu, Aug 16, 2018 at 08:54:25AM +0200, Ulrich Windl wrote:
>
>> I'd like to point out some minor issue observed while processing some
>> 50000-object repository with many binary objects, but most are rather
>> small:
>>
>> Between the two phases of "git fsck" (checking directories and
>> checking objects) there was a break of several seconds where no
>> progress was indicated.
>>
>> During "git gc" the writing objects phase did not update for some
>> seconds, but then the percentage counter jumped like from 15% to 42%.
>>
>> I understand that updating the progress output too often can be a
>> performance bottleneck, while upating it too rarely might only bore
>> the user... ;-)
>
> We update the counter integer for every object we process, and then
> actually update the display whenever the percentage increases or a
> second has elapsed, whichever comes first.
>
> What you're seeing is likely an artifact of _what_ we're counting:
> written objects. Not all objects are the same size, so it's not uncommon
> to see thousands/sec when dealing with small ones, and then several
> seconds for one giant blob.
>
> The only way to solve that is to count bytes. We don't have a total byte
> count in most cases, and it wouldn't always make sense (e.g., the
> "Compressing objects" meter can show the same issue, but it's not really
> putting through bytes in a linear way).  In some cases we do show
> transmitted size and throughput, but that's just for network operations.
> We could do the same for "gc" with the patch below. But usually
> throughput isn't all that interesting for a filesystem write, because
> bandwidth isn't the bottleneck.
>
> Possibly we could have a "half throughput" mode that counts up the total
> size written, but omits the speed indicator. That's not an unreasonable
> thing to show for a local pack, since you end up with the final pack
> size. The object counter would still be jumpy, but you'd at least have
> one number updated at least once per second as you put through a large
> blob.
>
> If you really want a smooth percentage, then we'd have to start counting
> bytes instead of objects. Two reasons we don't do that are:
>
>   - we often don't know the total byte size exactly. E.g., for a
>     packfile write, it depends on the result of deflating each object.
>     You can make an approximation and just pretend at the end that you
>     hit 100%.  Or you can count the pre-deflate sizes, but then your
>     meter doesn't match the bytes from the throughput counter.
>
>   - for something like fsck, we're not actually writing out bytes.  So I
>     guess you'd be measuring "here's how many bytes of objects I
>     fsck-ed". But is that on-disk compressed bytes, or decompressed
>     bytes?
>
>     If the former, that's only marginally better as a measure of effort,
>     since delta compression means that a small number of on-disk bytes
>     may require a big effort (imagine processing a 100 byte blob versus
>     a 100 byte delta off of a 100MB blob).
>
>     The latter is probably more accurate. But it's also not free to
>     pre-generate the total. We can get the number of objects or the size
>     of the packfile in constant-time, but totaling up the uncompressed
>     size of all objects is O(n). So that's extra computation, but it
>     also means a potential lag before we can start the progress meter.
>
>     I'm also not sure how meaningful a byte count is for a user there.
>
> So there. That's probably more than you wanted to know about Git's
> progress code. I think it probably _could_ be improved by counting
> more/different things, but I also think it can be a bit of a rabbit
> hole. Which is why AFAIK nobody's really looked too seriously into
> changing it.
>
> -Peff

This is all interesting, but I think unrelated to what Ulrich is talking
about. Quote:

    Between the two phases of "git fsck" (checking directories and
    checking objects) there was a break of several seconds where no
    progress was indicated

I.e. it's not about the pause you get with your testcase (which is
certainly another issue) but the break between the two progress bars.

Here's a test case you can clone:
https://github.com/avar/2015-04-03-1M-git (or might already have
"locally" :)

If you fsck this repository it'll take around (on my spinning rust
server) 30 seconds between 100% of "Checking object directories" before
you get any output from "Checking objects".

The breakdown of that is (this is from approximate eyeballing):

 * We spend 1-3 seconds just on this:
   https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L181

 * We spend the majority of the ~30s on this:
   https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L70-L79

 * Wes spend another 3-5 seconds on this QSORT:
   https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L105

I.e. it's not about objects v.s. bytes, but the structural problem with
the code that we pass a progress bar down to verify_pack() which does a
lot of work in verify_pack_index() and verify_packfile() before we even
get to iterating over the objects in the file, and only then do we start
displaying progress.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: non-smooth progress  indication for git fsck and git gc
  2018-08-16 20:35   ` Ævar Arnfjörð Bjarmason
@ 2018-08-16 20:55     ` Jeff King
  2018-08-16 21:06       ` Jeff King
  2018-08-20  8:33       ` Antw: " Ulrich Windl
  2018-09-01 12:53     ` Ævar Arnfjörð Bjarmason
  1 sibling, 2 replies; 25+ messages in thread
From: Jeff King @ 2018-08-16 20:55 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Ulrich Windl, git

On Thu, Aug 16, 2018 at 10:35:53PM +0200, Ævar Arnfjörð Bjarmason wrote:

> This is all interesting, but I think unrelated to what Ulrich is talking
> about. Quote:
> 
>     Between the two phases of "git fsck" (checking directories and
>     checking objects) there was a break of several seconds where no
>     progress was indicated
> 
> I.e. it's not about the pause you get with your testcase (which is
> certainly another issue) but the break between the two progress bars.

I think he's talking about both. What I said responds to this:

> >> During "git gc" the writing objects phase did not update for some
> >> seconds, but then the percentage counter jumped like from 15% to 42%.

But yeah, I missed that the fsck thing was specifically about a break
between two meters. That's a separate problem, but also worth
discussing (and hopefully much easier to address).

> If you fsck this repository it'll take around (on my spinning rust
> server) 30 seconds between 100% of "Checking object directories" before
> you get any output from "Checking objects".
> 
> The breakdown of that is (this is from approximate eyeballing):
> 
>  * We spend 1-3 seconds just on this:
>    https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L181

OK, so that's checking the sha1 over the .idx file. We could put a meter
on that. I wouldn't expect it to generally be all that slow outside of
pathological cases, since it scales with the number of objects (and 1s
is our minimum update anyway, so that might be OK as-is). Your case has
13M objects, which is quite large.

>  * We spend the majority of the ~30s on this:
>    https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L70-L79

This is hashing the actual packfile. This is potentially quite long,
especially if you have a ton of big objects.

I wonder if we need to do this as a separate step anyway, though. Our
verification is based on index-pack these days, which means it's going
to walk over the whole content as part of the "Indexing objects" step to
expand base objects and mark deltas for later. Could we feed this hash
as part of that walk over the data? It's not going to save us 30s, but
it's likely to be more efficient. And it would fold the effort naturally
into the existing progress meter.

>  * Wes spend another 3-5 seconds on this QSORT:
>    https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L105

That's a tough one. I'm not sure how we'd count it (how many compares we
do?). And each item is doing so little work that hitting the progress
code may make things noticeably slower.

Again, your case is pretty big. Just based on the number of objects,
linux.git should be 1.5-2.5 seconds on your machine for the same
operation. Which I think may be small enough to ignore (or even just
print a generic before/after). It's really the 30s packfile hash that's
making the whole thing so terrible.

-Peff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: non-smooth progress  indication for git fsck and git gc
  2018-08-16 20:55     ` Jeff King
@ 2018-08-16 21:06       ` Jeff King
  2018-08-17 14:39         ` Duy Nguyen
  2018-08-20  8:33       ` Antw: " Ulrich Windl
  1 sibling, 1 reply; 25+ messages in thread
From: Jeff King @ 2018-08-16 21:06 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Ulrich Windl, git

On Thu, Aug 16, 2018 at 04:55:56PM -0400, Jeff King wrote:

> >  * We spend the majority of the ~30s on this:
> >    https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L70-L79
> 
> This is hashing the actual packfile. This is potentially quite long,
> especially if you have a ton of big objects.
> 
> I wonder if we need to do this as a separate step anyway, though. Our
> verification is based on index-pack these days, which means it's going
> to walk over the whole content as part of the "Indexing objects" step to
> expand base objects and mark deltas for later. Could we feed this hash
> as part of that walk over the data? It's not going to save us 30s, but
> it's likely to be more efficient. And it would fold the effort naturally
> into the existing progress meter.

Actually, I take it back. That's the nice, modern way we do it in
git-verify-pack. But git-fsck uses the ancient "just walk over all of
the idx entries method". It at least sorts in pack order, which is good,
but:

  - it's not multi-threaded, like index-pack/verify-pack

  - the index-pack way is actually more efficient than pack-ordering for
    the delta-base cache, because it actually walks the delta-graph in
    the optimal order

Once upon a time verify-pack used this same pack-check code, and we
switched in 3de89c9d42 (verify-pack: use index-pack --verify,
2011-06-03). So I suspect the right thing to do here is for fsck to
switch to that, too, and then delete pack-check.c entirely.

That may well involve us switching the progress to a per-pack view
(e.g., "checking pack 1/10 (10%)", followed by its own progress meter).
But I don't think that would be a bad thing. It's a more realistic view
of the work we're actually doing. Although perhaps it's misleading about
the total work remaining, because not all packs are the same size (so
you see that you're halfway through pack 2/10, but that's quite likely
to be half of the total work if it's the one gigantic pack).

-Peff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: non-smooth progress  indication for git fsck and git gc
  2018-08-16 20:02   ` Jeff King
@ 2018-08-16 22:10     ` Junio C Hamano
  0 siblings, 0 replies; 25+ messages in thread
From: Junio C Hamano @ 2018-08-16 22:10 UTC (permalink / raw)
  To: Jeff King; +Cc: Ulrich Windl, git

Jeff King <peff@peff.net> writes:

> On Thu, Aug 16, 2018 at 11:57:14AM -0400, Jeff King wrote:
>
>> The only way to solve that is to count bytes. We don't have a total byte
>> count in most cases, and it wouldn't always make sense (e.g., the
>> "Compressing objects" meter can show the same issue, but it's not really
>> putting through bytes in a linear way).  In some cases we do show
>> transmitted size and throughput, but that's just for network operations.
>> We could do the same for "gc" with the patch below. But usually
>> throughput isn't all that interesting for a filesystem write, because
>> bandwidth isn't the bottleneck.
>
> Just realized I forgot to include the patch. Here it is, for reference.

I've been wondering when you'd realize the omission ;-)

> Doing something similar for fsck would be quite a bit more invasive.

Yeah, on that codepath there is no streaming write passing through a
single chokepoint you can count bytes X-<.

> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 80c880e9ad..e1130b959d 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -837,7 +837,7 @@ static void write_pack_file(void)
>  		if (pack_to_stdout)
>  			f = hashfd_throughput(1, "<stdout>", progress_state);
>  		else
> -			f = create_tmp_packfile(&pack_tmp_name);
> +			f = create_tmp_packfile(&pack_tmp_name, progress_state);
>  
>  		offset = write_pack_header(f, nr_remaining);
>  
> diff --git a/bulk-checkin.c b/bulk-checkin.c
> index 9f3b644811..0df45b8f55 100644
> --- a/bulk-checkin.c
> +++ b/bulk-checkin.c
> @@ -178,7 +178,7 @@ static void prepare_to_stream(struct bulk_checkin_state *state,
>  	if (!(flags & HASH_WRITE_OBJECT) || state->f)
>  		return;
>  
> -	state->f = create_tmp_packfile(&state->pack_tmp_name);
> +	state->f = create_tmp_packfile(&state->pack_tmp_name, NULL);
>  	reset_pack_idx_option(&state->pack_idx_opts);
>  
>  	/* Pretend we are going to write only one object */
> diff --git a/pack-write.c b/pack-write.c
> index a9d46bc03f..b72480b440 100644
> --- a/pack-write.c
> +++ b/pack-write.c
> @@ -334,14 +334,15 @@ int encode_in_pack_object_header(unsigned char *hdr, int hdr_len,
>  	return n;
>  }
>  
> -struct hashfile *create_tmp_packfile(char **pack_tmp_name)
> +struct hashfile *create_tmp_packfile(char **pack_tmp_name,
> +				     struct progress *progress)
>  {
>  	struct strbuf tmpname = STRBUF_INIT;
>  	int fd;
>  
>  	fd = odb_mkstemp(&tmpname, "pack/tmp_pack_XXXXXX");
>  	*pack_tmp_name = strbuf_detach(&tmpname, NULL);
> -	return hashfd(fd, *pack_tmp_name);
> +	return hashfd_throughput(fd, *pack_tmp_name, progress);
>  }
>  
>  void finish_tmp_packfile(struct strbuf *name_buffer,
> diff --git a/pack.h b/pack.h
> index 34a9d458b4..c87628b093 100644
> --- a/pack.h
> +++ b/pack.h
> @@ -98,7 +98,8 @@ extern int encode_in_pack_object_header(unsigned char *hdr, int hdr_len,
>  #define PH_ERROR_PROTOCOL	(-3)
>  extern int read_pack_header(int fd, struct pack_header *);
>  
> -extern struct hashfile *create_tmp_packfile(char **pack_tmp_name);
> +extern struct hashfile *create_tmp_packfile(char **pack_tmp_name,
> +					    struct progress *progress);
>  extern void finish_tmp_packfile(struct strbuf *name_buffer, const char *pack_tmp_name, struct pack_idx_entry **written_list, uint32_t nr_written, struct pack_idx_option *pack_idx_opts, unsigned char sha1[]);
>  
>  #endif

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: non-smooth progress indication for git fsck and git gc
  2018-08-16 21:06       ` Jeff King
@ 2018-08-17 14:39         ` Duy Nguyen
  0 siblings, 0 replies; 25+ messages in thread
From: Duy Nguyen @ 2018-08-17 14:39 UTC (permalink / raw)
  To: Jeff King
  Cc: Ævar Arnfjörð Bjarmason, Ulrich.Windl,
	Git Mailing List

On Thu, Aug 16, 2018 at 11:08 PM Jeff King <peff@peff.net> wrote:
>
> On Thu, Aug 16, 2018 at 04:55:56PM -0400, Jeff King wrote:
>
> > >  * We spend the majority of the ~30s on this:
> > >    https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L70-L79
> >
> > This is hashing the actual packfile. This is potentially quite long,
> > especially if you have a ton of big objects.
> >
> > I wonder if we need to do this as a separate step anyway, though. Our
> > verification is based on index-pack these days, which means it's going
> > to walk over the whole content as part of the "Indexing objects" step to
> > expand base objects and mark deltas for later. Could we feed this hash
> > as part of that walk over the data? It's not going to save us 30s, but
> > it's likely to be more efficient. And it would fold the effort naturally
> > into the existing progress meter.
>
> Actually, I take it back. That's the nice, modern way we do it in
> git-verify-pack. But git-fsck uses the ancient "just walk over all of
> the idx entries method". It at least sorts in pack order, which is good,
> but:
>
>   - it's not multi-threaded, like index-pack/verify-pack
>
>   - the index-pack way is actually more efficient than pack-ordering for
>     the delta-base cache, because it actually walks the delta-graph in
>     the optimal order
>

I actually tried to make git-fsck use index-pack --verify at one
point. The only thing that stopped it from working was index-pack
automatically wrote the newer index version if I remember correctly,
and that would fail the final hash check. fsck performance was not a
big deal so I dropped it. Just saying it should be possible, if
someone's interested in that direction.
-- 
Duy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Antw: Re: non-smooth progress indication for git fsck and git gc
  2018-08-16 15:18 ` Duy Nguyen
  2018-08-16 16:05   ` Jeff King
@ 2018-08-20  8:27   ` Ulrich Windl
  1 sibling, 0 replies; 25+ messages in thread
From: Ulrich Windl @ 2018-08-20  8:27 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: git

>>> Duy Nguyen <pclouds@gmail.com> schrieb am 16.08.2018 um 17:18 in Nachricht
<CACsJy8Dukjw_PKQXMTxwd_C3juA_0cqZSjb=1L2wKqtJoC3rkQ@mail.gmail.com>:
> On Thu, Aug 16, 2018 at 1:10 PM Ulrich Windl
> <Ulrich.Windl@rz.uni-regensburg.de> wrote:
>>
>> Hi!
>>
>> I'd like to point out some minor issue observed while processing some 
> 50000-object repository with many binary objects, but most are rather small:
>>
>> Between the two phases of "git fsck" (checking directories and checking 
> objects) there was a break of several seconds where no progress was 
> indicated.
>>
>> During "git gc" the writing objects phase did not update for some seconds, 
> but then the percentage counter jumped like from 15% to 42%.
>>
>> I understand that updating the progress output too often can be a 
> performance bottleneck, while upating it too rarely might only bore the 
> user... ;-)
>>
>> But maybe something can be done. My git version is 2.13.7 (openSUSE 42.3).
> 
> Is it possible to make this repository public? You can also use "git
> fast-export --anonymize" to make a repo with same structure but no
> real content (but read the man page about that option first)

Hi!

Actually I tried that locally, but with the resulting repository both, fsck and gc are very fast. So I guess it won't be very useful. Also the original .git directory uses 5.3G, while the anonymous .git just used 4.3M...

I tried to capture the behavior as screencast, but it seems the screencast optimized the little cahnges away, and in the result git almost had no delay on any operation 8-(

Regards,
Ulrich

> 
>> Regards,
>> Ulrich
>>
>>
> 
> 
> -- 
> Duy




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Antw: Re: non-smooth progress  indication for git fsck and git gc
  2018-08-16 20:55     ` Jeff King
  2018-08-16 21:06       ` Jeff King
@ 2018-08-20  8:33       ` Ulrich Windl
  2018-08-20  8:57         ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 25+ messages in thread
From: Ulrich Windl @ 2018-08-20  8:33 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, peff; +Cc: git

>>> Jeff King <peff@peff.net> schrieb am 16.08.2018 um 22:55 in Nachricht
<20180816205556.GA8257@sigill.intra.peff.net>:
> On Thu, Aug 16, 2018 at 10:35:53PM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
>> This is all interesting, but I think unrelated to what Ulrich is talking
>> about. Quote:
>> 
>>     Between the two phases of "git fsck" (checking directories and
>>     checking objects) there was a break of several seconds where no
>>     progress was indicated
>> 
>> I.e. it's not about the pause you get with your testcase (which is
>> certainly another issue) but the break between the two progress bars.
> 
> I think he's talking about both. What I said responds to this:

Hi guys!

Yes, I was wondering what git does between the two visible phases, and between
the lines I was suggesting another progress message between those phases. At
least the maximum unspecific three-dot-message "Thinking..." could be displayed
;-) Of course anything more appropriate would be welcome.
Also that message should only be displayed if it's foreseeable that the
operation will take significant time. In my case (I just repeated it a few
minutes ago) the delay is significant (at least 10 seconds). As noted earlier I
was hoping to capture the timing in a screencast, but it seems all the delays
were just optimized away in the recording.

> 
>> >> During "git gc" the writing objects phase did not update for some
>> >> seconds, but then the percentage counter jumped like from 15% to 42%.
> 
> But yeah, I missed that the fsck thing was specifically about a break
> between two meters. That's a separate problem, but also worth
> discussing (and hopefully much easier to address).
> 
>> If you fsck this repository it'll take around (on my spinning rust
>> server) 30 seconds between 100% of "Checking object directories" before
>> you get any output from "Checking objects".
>> 
>> The breakdown of that is (this is from approximate eyeballing):
>> 
>>  * We spend 1-3 seconds just on this:
>>    
>
https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack

> -check.c#L181
> 
> OK, so that's checking the sha1 over the .idx file. We could put a meter
> on that. I wouldn't expect it to generally be all that slow outside of
> pathological cases, since it scales with the number of objects (and 1s
> is our minimum update anyway, so that might be OK as-is). Your case has
> 13M objects, which is quite large.

Sometimes an oldish CPU could bring performance surprises, maybe. Anyway my
CPU is question is an AMD Phenom2 quad-core with 3.2GHz nominal, and there is a
classic spinning disk with 5400RPM built in...

> 
>>  * We spend the majority of the ~30s on this:
>>    
>
https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack

> -check.c#L70-L79
> 
> This is hashing the actual packfile. This is potentially quite long,
> especially if you have a ton of big objects.

That seems to apply. BTW: Is there a way go get some repository statistics
like a histogram of object sizes (or whatever that might be useful to help
making decisions)?

> 
> I wonder if we need to do this as a separate step anyway, though. Our
> verification is based on index-pack these days, which means it's going
> to walk over the whole content as part of the "Indexing objects" step to
> expand base objects and mark deltas for later. Could we feed this hash
> as part of that walk over the data? It's not going to save us 30s, but
> it's likely to be more efficient. And it would fold the effort naturally
> into the existing progress meter.
> 
>>  * Wes spend another 3-5 seconds on this QSORT:
>>    
>
https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack

> -check.c#L105
> 
> That's a tough one. I'm not sure how we'd count it (how many compares we
> do?). And each item is doing so little work that hitting the progress
> code may make things noticeably slower.

If it's sorting, maybe add some code like (wild guess):

if (objects_to_sort > magic_number)
   message("Sorting something...");

> 
> Again, your case is pretty big. Just based on the number of objects,
> linux.git should be 1.5-2.5 seconds on your machine for the same
> operation. Which I think may be small enough to ignore (or even just
> print a generic before/after). It's really the 30s packfile hash that's
> making the whole thing so terrible.
> 
> -Peff




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Antw: Re: non-smooth progress  indication for git fsck and git gc
  2018-08-20  8:33       ` Antw: " Ulrich Windl
@ 2018-08-20  8:57         ` Ævar Arnfjörð Bjarmason
  2018-08-20  9:37           ` Ulrich Windl
  2018-08-21  1:07           ` Jeff King
  0 siblings, 2 replies; 25+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-08-20  8:57 UTC (permalink / raw)
  To: Ulrich Windl; +Cc: peff, git


On Mon, Aug 20 2018, Ulrich Windl wrote:

>>>> Jeff King <peff@peff.net> schrieb am 16.08.2018 um 22:55 in Nachricht
> <20180816205556.GA8257@sigill.intra.peff.net>:
>> On Thu, Aug 16, 2018 at 10:35:53PM +0200, Ævar Arnfjörð Bjarmason wrote:
>>
>>> This is all interesting, but I think unrelated to what Ulrich is talking
>>> about. Quote:
>>>
>>>     Between the two phases of "git fsck" (checking directories and
>>>     checking objects) there was a break of several seconds where no
>>>     progress was indicated
>>>
>>> I.e. it's not about the pause you get with your testcase (which is
>>> certainly another issue) but the break between the two progress bars.
>>
>> I think he's talking about both. What I said responds to this:
>
> Hi guys!
>
> Yes, I was wondering what git does between the two visible phases, and between
> the lines I was suggesting another progress message between those phases. At
> least the maximum unspecific three-dot-message "Thinking..." could be displayed
> ;-) Of course anything more appropriate would be welcome.
> Also that message should only be displayed if it's foreseeable that the
> operation will take significant time. In my case (I just repeated it a few
> minutes ago) the delay is significant (at least 10 seconds). As noted earlier I
> was hoping to capture the timing in a screencast, but it seems all the delays
> were just optimized away in the recording.
>
>>
>>> >> During "git gc" the writing objects phase did not update for some
>>> >> seconds, but then the percentage counter jumped like from 15% to 42%.
>>
>> But yeah, I missed that the fsck thing was specifically about a break
>> between two meters. That's a separate problem, but also worth
>> discussing (and hopefully much easier to address).
>>
>>> If you fsck this repository it'll take around (on my spinning rust
>>> server) 30 seconds between 100% of "Checking object directories" before
>>> you get any output from "Checking objects".
>>>
>>> The breakdown of that is (this is from approximate eyeballing):
>>>
>>>  * We spend 1-3 seconds just on this:
>>>
>>
> https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack
>
>> -check.c#L181
>>
>> OK, so that's checking the sha1 over the .idx file. We could put a meter
>> on that. I wouldn't expect it to generally be all that slow outside of
>> pathological cases, since it scales with the number of objects (and 1s
>> is our minimum update anyway, so that might be OK as-is). Your case has
>> 13M objects, which is quite large.
>
> Sometimes an oldish CPU could bring performance surprises, maybe. Anyway my
> CPU is question is an AMD Phenom2 quad-core with 3.2GHz nominal, and there is a
> classic spinning disk with 5400RPM built in...
>
>>
>>>  * We spend the majority of the ~30s on this:
>>>
>>
> https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack
>
>> -check.c#L70-L79
>>
>> This is hashing the actual packfile. This is potentially quite long,
>> especially if you have a ton of big objects.
>
> That seems to apply. BTW: Is there a way go get some repository statistics
> like a histogram of object sizes (or whatever that might be useful to help
> making decisions)?

The git-sizer program is really helpful in this regard:
https://github.com/github/git-sizer

>>
>> I wonder if we need to do this as a separate step anyway, though. Our
>> verification is based on index-pack these days, which means it's going
>> to walk over the whole content as part of the "Indexing objects" step to
>> expand base objects and mark deltas for later. Could we feed this hash
>> as part of that walk over the data? It's not going to save us 30s, but
>> it's likely to be more efficient. And it would fold the effort naturally
>> into the existing progress meter.
>>
>>>  * Wes spend another 3-5 seconds on this QSORT:
>>>
>>
> https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack
>
>> -check.c#L105
>>
>> That's a tough one. I'm not sure how we'd count it (how many compares we
>> do?). And each item is doing so little work that hitting the progress
>> code may make things noticeably slower.
>
> If it's sorting, maybe add some code like (wild guess):
>
> if (objects_to_sort > magic_number)
>    message("Sorting something...");

I think a good solution to these cases is to just introduce something to
the progress.c mode where it learns how to display a counter where we
don't know what the end-state will be. Something like your proposed
magic_number can just be covered under the more general case where we
don't show the progress bar unless it's been 1 second (which I believe
is the default).

>>
>> Again, your case is pretty big. Just based on the number of objects,
>> linux.git should be 1.5-2.5 seconds on your machine for the same
>> operation. Which I think may be small enough to ignore (or even just
>> print a generic before/after). It's really the 30s packfile hash that's
>> making the whole thing so terrible.
>>
>> -Peff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Antw: Re: non-smooth progress  indication for git fsck and git gc
  2018-08-20  8:57         ` Ævar Arnfjörð Bjarmason
@ 2018-08-20  9:37           ` Ulrich Windl
  2018-08-21  1:07           ` Jeff King
  1 sibling, 0 replies; 25+ messages in thread
From: Ulrich Windl @ 2018-08-20  9:37 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: peff, git

Hi!

Here are some stats from the repository. First the fast import ones (which had
good performance, but probably all cached, also):

% git fast-import <../git-stream
/usr/lib/git/git-fast-import statistics:
---------------------------------------------------------------------
Alloc'd objects:      55000
Total objects:        51959 (        56 duplicates                  )
      blobs  :        47991 (         0 duplicates          0 deltas of       
  0 attempts)
      trees  :         3946 (        56 duplicates        994 deltas of      
3929 attempts)
      commits:           22 (         0 duplicates          0 deltas of       
  0 attempts)
      tags   :            0 (         0 duplicates          0 deltas of       
  0 attempts)
Total branches:          15 (        15 loads     )
      marks:        1048576 (     48013 unique    )
      atoms:          43335
Memory total:          9611 KiB
       pools:          7033 KiB
     objects:          2578 KiB
---------------------------------------------------------------------
pack_report: getpagesize()            =       4096
pack_report: core.packedGitWindowSize = 1073741824
pack_report: core.packedGitLimit      = 8589934592
pack_report: pack_used_ctr            =       1780
pack_report: pack_mmap_calls          =         23
pack_report: pack_open_windows        =          1 /          1
pack_report: pack_mapped              =    2905751 /    2905751
---------------------------------------------------------------------

Then the output from git-sizer:

Processing blobs: 47991                        
Processing trees: 3946                        
Processing commits: 22                        
Matching commits to trees: 22                        
Processing annotated tags: 0                        
Processing references: 15                        
| Name                         | Value     | Level of concern               |
| ---------------------------- | --------- | ------------------------------ |
| Overall repository size      |           |                                |
| * Blobs                      |           |                                |
|   * Total size               |  13.7 GiB | *                              |
|                              |           |                                |
| Biggest objects              |           |                                |
| * Trees                      |           |                                |
|   * Maximum entries      [1] |  13.4 k   | *************                  |
| * Blobs                      |           |                                |
|   * Maximum size         [2] |   279 MiB | *****************************  |
|                              |           |                                |
| Biggest checkouts            |           |                                |
| * Maximum path depth     [3] |    10     | *                              |
| * Maximum path length    [3] |   130 B   | *                              |
| * Total size of files    [3] |  8.63 GiB | *********                      |

[1]  b701345cbd4317276568b9d9890fd77f38933bdc
(refs/heads/master:Resources/default/CIFP)
[2]  19f54c4a7595667329c1be23200234f0fe50ac56
(refs/heads/master:Resources/default/apt.dat)
[3]  b0e3d3a2b7f2504117408f13486c905a8eb8fb1e (refs/heads/master^{tree})

Some notes:
[1] is a directory with many short (typically < 1kB) text files
[2] is a very large text file with many changes
[3] Yes, some paths are long

Regards,
Ulrich


>>> Ævar Arnfjörð Bjarmason <avarab@gmail.com> schrieb am 20.08.2018 um 10:57
in
Nachricht <87woslpg9i.fsf@evledraar.gmail.com>:

> On Mon, Aug 20 2018, Ulrich Windl wrote:
> 
>>>>> Jeff King <peff@peff.net> schrieb am 16.08.2018 um 22:55 in Nachricht
>> <20180816205556.GA8257@sigill.intra.peff.net>:
>>> On Thu, Aug 16, 2018 at 10:35:53PM +0200, Ævar Arnfjörð Bjarmason wrote:
>>>
>>>> This is all interesting, but I think unrelated to what Ulrich is talking
>>>> about. Quote:
>>>>
>>>>     Between the two phases of "git fsck" (checking directories and
>>>>     checking objects) there was a break of several seconds where no
>>>>     progress was indicated
>>>>
>>>> I.e. it's not about the pause you get with your testcase (which is
>>>> certainly another issue) but the break between the two progress bars.
>>>
>>> I think he's talking about both. What I said responds to this:
>>
>> Hi guys!
>>
>> Yes, I was wondering what git does between the two visible phases, and 
> between
>> the lines I was suggesting another progress message between those phases.
At
>> least the maximum unspecific three-dot-message "Thinking..." could be 
> displayed
>> ;-) Of course anything more appropriate would be welcome.
>> Also that message should only be displayed if it's foreseeable that the
>> operation will take significant time. In my case (I just repeated it a few
>> minutes ago) the delay is significant (at least 10 seconds). As noted 
> earlier I
>> was hoping to capture the timing in a screencast, but it seems all the 
> delays
>> were just optimized away in the recording.
>>
>>>
>>>> >> During "git gc" the writing objects phase did not update for some
>>>> >> seconds, but then the percentage counter jumped like from 15% to 42%.
>>>
>>> But yeah, I missed that the fsck thing was specifically about a break
>>> between two meters. That's a separate problem, but also worth
>>> discussing (and hopefully much easier to address).
>>>
>>>> If you fsck this repository it'll take around (on my spinning rust
>>>> server) 30 seconds between 100% of "Checking object directories" before
>>>> you get any output from "Checking objects".
>>>>
>>>> The breakdown of that is (this is from approximate eyeballing):
>>>>
>>>>  * We spend 1-3 seconds just on this:
>>>>
>>>
>> 
>
https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack

>>
>>> -check.c#L181
>>>
>>> OK, so that's checking the sha1 over the .idx file. We could put a meter
>>> on that. I wouldn't expect it to generally be all that slow outside of
>>> pathological cases, since it scales with the number of objects (and 1s
>>> is our minimum update anyway, so that might be OK as-is). Your case has
>>> 13M objects, which is quite large.
>>
>> Sometimes an oldish CPU could bring performance surprises, maybe. Anyway
my
>> CPU is question is an AMD Phenom2 quad-core with 3.2GHz nominal, and there

> is a
>> classic spinning disk with 5400RPM built in...
>>
>>>
>>>>  * We spend the majority of the ~30s on this:
>>>>
>>>
>> 
>
https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack

>>
>>> -check.c#L70-L79
>>>
>>> This is hashing the actual packfile. This is potentially quite long,
>>> especially if you have a ton of big objects.
>>
>> That seems to apply. BTW: Is there a way go get some repository statistics
>> like a histogram of object sizes (or whatever that might be useful to help
>> making decisions)?
> 
> The git-sizer program is really helpful in this regard:
> https://github.com/github/git-sizer 
> 
>>>
>>> I wonder if we need to do this as a separate step anyway, though. Our
>>> verification is based on index-pack these days, which means it's going
>>> to walk over the whole content as part of the "Indexing objects" step to
>>> expand base objects and mark deltas for later. Could we feed this hash
>>> as part of that walk over the data? It's not going to save us 30s, but
>>> it's likely to be more efficient. And it would fold the effort naturally
>>> into the existing progress meter.
>>>
>>>>  * Wes spend another 3-5 seconds on this QSORT:
>>>>
>>>
>> 
>
https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack

>>
>>> -check.c#L105
>>>
>>> That's a tough one. I'm not sure how we'd count it (how many compares we
>>> do?). And each item is doing so little work that hitting the progress
>>> code may make things noticeably slower.
>>
>> If it's sorting, maybe add some code like (wild guess):
>>
>> if (objects_to_sort > magic_number)
>>    message("Sorting something...");
> 
> I think a good solution to these cases is to just introduce something to
> the progress.c mode where it learns how to display a counter where we
> don't know what the end-state will be. Something like your proposed
> magic_number can just be covered under the more general case where we
> don't show the progress bar unless it's been 1 second (which I believe
> is the default).
> 
>>>
>>> Again, your case is pretty big. Just based on the number of objects,
>>> linux.git should be 1.5-2.5 seconds on your machine for the same
>>> operation. Which I think may be small enough to ignore (or even just
>>> print a generic before/after). It's really the 30s packfile hash that's
>>> making the whole thing so terrible.
>>>
>>> -Peff




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Antw: Re: non-smooth progress  indication for git fsck and git gc
  2018-08-20  8:57         ` Ævar Arnfjörð Bjarmason
  2018-08-20  9:37           ` Ulrich Windl
@ 2018-08-21  1:07           ` Jeff King
  2018-08-21  6:20             ` Ulrich Windl
  2018-08-21 15:21             ` Duy Nguyen
  1 sibling, 2 replies; 25+ messages in thread
From: Jeff King @ 2018-08-21  1:07 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Ulrich Windl, git

On Mon, Aug 20, 2018 at 10:57:13AM +0200, Ævar Arnfjörð Bjarmason wrote:

> > That seems to apply. BTW: Is there a way go get some repository statistics
> > like a histogram of object sizes (or whatever that might be useful to help
> > making decisions)?
> 
> The git-sizer program is really helpful in this regard:
> https://github.com/github/git-sizer

Yeah, I'd very much agree that's the best tool for a general overview of
the repository stats.

Ulrich, if you want to more analysis (like a histogram), probably
something like:

  git cat-file --batch-all-objects --batch-check='%(objectsize)'

might be a good starting place to dump information (see the "BATCH
OUTPUT" section of "git help cat-file" for more format items).

> > If it's sorting, maybe add some code like (wild guess):
> >
> > if (objects_to_sort > magic_number)
> >    message("Sorting something...");
> 
> I think a good solution to these cases is to just introduce something to
> the progress.c mode where it learns how to display a counter where we
> don't know what the end-state will be. Something like your proposed
> magic_number can just be covered under the more general case where we
> don't show the progress bar unless it's been 1 second (which I believe
> is the default).

Yeah.  We already have open-ended counters (e.g., "counting objects"),
and delayed meters (we actually normalized the default to 2s recently).

I _think_ they should work together OK without further modification.
Once upon a time the caller had to say "don't show if we're past N%
after M seconds", but I think with the current code we'd just show it if
we're not completely finished after 2 seconds.

So it really should just be a simple:

  progress = start_delayed_progress("Hashing packfile", 0);

That said, counting bytes would probably be ugly (just because the
numbers get really big). We'd probably prefer to show a throughput or
something. And as you noted, there's some refactoring to be done with
pack-check for it to show multiple progress meters.

(I still think in the long run we would want to scrap that code, but
that's a much bigger job; I'm fine if somebody wants to do incremental
improvements in the meantime).

-Peff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Antw: Re: non-smooth progress  indication for git fsck and git gc
  2018-08-21  1:07           ` Jeff King
@ 2018-08-21  6:20             ` Ulrich Windl
  2018-08-21 15:21             ` Duy Nguyen
  1 sibling, 0 replies; 25+ messages in thread
From: Ulrich Windl @ 2018-08-21  6:20 UTC (permalink / raw)
  To: peff; +Cc: git

>>> Jeff King <peff@peff.net> schrieb am 21.08.2018 um 03:07 in Nachricht
<20180821010712.GA32126@sigill.intra.peff.net>:
> On Mon, Aug 20, 2018 at 10:57:13AM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
[...]
> So it really should just be a simple:
> 
>   progress = start_delayed_progress("Hashing packfile", 0);
> 
> That said, counting bytes would probably be ugly (just because the
> numbers get really big). We'd probably prefer to show a throughput or

Hi!

With big numbers you could apply scaling (the usual ISO suffixes), but after
having read GBs, there wouldn't be much change visible unless you add
significant fractional digits.
So scaling seems good for absolute numbers, but not for growing numbers. Of
course one could recursively scale the fractional parts again, but then the
result will be lengthy again, then.

Regards,
Ulrich

> something. And as you noted, there's some refactoring to be done with
> pack-check for it to show multiple progress meters.
> 
> (I still think in the long run we would want to scrap that code, but
> that's a much bigger job; I'm fine if somebody wants to do incremental
> improvements in the meantime).
> 
> -Peff




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Antw: Re: non-smooth progress indication for git fsck and git gc
  2018-08-21  1:07           ` Jeff King
  2018-08-21  6:20             ` Ulrich Windl
@ 2018-08-21 15:21             ` Duy Nguyen
  1 sibling, 0 replies; 25+ messages in thread
From: Duy Nguyen @ 2018-08-21 15:21 UTC (permalink / raw)
  To: Jeff King
  Cc: Ævar Arnfjörð Bjarmason, Ulrich Windl,
	Git Mailing List

On Tue, Aug 21, 2018 at 3:13 AM Jeff King <peff@peff.net> wrote:
> I _think_ they should work together OK without further modification.
> Once upon a time the caller had to say "don't show if we're past N%
> after M seconds", but I think with the current code we'd just show it if
> we're not completely finished after 2 seconds.
>
> So it really should just be a simple:
>
>   progress = start_delayed_progress("Hashing packfile", 0);
>
> That said, counting bytes would probably be ugly (just because the
> numbers get really big). We'd probably prefer to show a throughput or
> something.

Or just an ascii throbber. I think the important thing is communicate
"I'm still doing something, not hanging up". "Hashing packfile"
satisfies the "something" part, the throbber the "hanging".
-- 
Duy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: non-smooth progress  indication for git fsck and git gc
  2018-08-16 20:35   ` Ævar Arnfjörð Bjarmason
  2018-08-16 20:55     ` Jeff King
@ 2018-09-01 12:53     ` Ævar Arnfjörð Bjarmason
  2018-09-01 13:52       ` Ævar Arnfjörð Bjarmason
  2018-09-02  7:46       ` Jeff King
  1 sibling, 2 replies; 25+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-09-01 12:53 UTC (permalink / raw)
  To: Jeff King; +Cc: Ulrich Windl, git


On Thu, Aug 16 2018, Ævar Arnfjörð Bjarmason wrote:

> On Thu, Aug 16 2018, Jeff King wrote:
>
>> On Thu, Aug 16, 2018 at 08:54:25AM +0200, Ulrich Windl wrote:
>>
>>> I'd like to point out some minor issue observed while processing some
>>> 50000-object repository with many binary objects, but most are rather
>>> small:
>>>
>>> Between the two phases of "git fsck" (checking directories and
>>> checking objects) there was a break of several seconds where no
>>> progress was indicated.
>>>
>>> During "git gc" the writing objects phase did not update for some
>>> seconds, but then the percentage counter jumped like from 15% to 42%.
>>>
>>> I understand that updating the progress output too often can be a
>>> performance bottleneck, while upating it too rarely might only bore
>>> the user... ;-)
>>
>> We update the counter integer for every object we process, and then
>> actually update the display whenever the percentage increases or a
>> second has elapsed, whichever comes first.
>>
>> What you're seeing is likely an artifact of _what_ we're counting:
>> written objects. Not all objects are the same size, so it's not uncommon
>> to see thousands/sec when dealing with small ones, and then several
>> seconds for one giant blob.
>>
>> The only way to solve that is to count bytes. We don't have a total byte
>> count in most cases, and it wouldn't always make sense (e.g., the
>> "Compressing objects" meter can show the same issue, but it's not really
>> putting through bytes in a linear way).  In some cases we do show
>> transmitted size and throughput, but that's just for network operations.
>> We could do the same for "gc" with the patch below. But usually
>> throughput isn't all that interesting for a filesystem write, because
>> bandwidth isn't the bottleneck.
>>
>> Possibly we could have a "half throughput" mode that counts up the total
>> size written, but omits the speed indicator. That's not an unreasonable
>> thing to show for a local pack, since you end up with the final pack
>> size. The object counter would still be jumpy, but you'd at least have
>> one number updated at least once per second as you put through a large
>> blob.
>>
>> If you really want a smooth percentage, then we'd have to start counting
>> bytes instead of objects. Two reasons we don't do that are:
>>
>>   - we often don't know the total byte size exactly. E.g., for a
>>     packfile write, it depends on the result of deflating each object.
>>     You can make an approximation and just pretend at the end that you
>>     hit 100%.  Or you can count the pre-deflate sizes, but then your
>>     meter doesn't match the bytes from the throughput counter.
>>
>>   - for something like fsck, we're not actually writing out bytes.  So I
>>     guess you'd be measuring "here's how many bytes of objects I
>>     fsck-ed". But is that on-disk compressed bytes, or decompressed
>>     bytes?
>>
>>     If the former, that's only marginally better as a measure of effort,
>>     since delta compression means that a small number of on-disk bytes
>>     may require a big effort (imagine processing a 100 byte blob versus
>>     a 100 byte delta off of a 100MB blob).
>>
>>     The latter is probably more accurate. But it's also not free to
>>     pre-generate the total. We can get the number of objects or the size
>>     of the packfile in constant-time, but totaling up the uncompressed
>>     size of all objects is O(n). So that's extra computation, but it
>>     also means a potential lag before we can start the progress meter.
>>
>>     I'm also not sure how meaningful a byte count is for a user there.
>>
>> So there. That's probably more than you wanted to know about Git's
>> progress code. I think it probably _could_ be improved by counting
>> more/different things, but I also think it can be a bit of a rabbit
>> hole. Which is why AFAIK nobody's really looked too seriously into
>> changing it.
>>
>> -Peff
>
> This is all interesting, but I think unrelated to what Ulrich is talking
> about. Quote:
>
>     Between the two phases of "git fsck" (checking directories and
>     checking objects) there was a break of several seconds where no
>     progress was indicated
>
> I.e. it's not about the pause you get with your testcase (which is
> certainly another issue) but the break between the two progress bars.
>
> Here's a test case you can clone:
> https://github.com/avar/2015-04-03-1M-git (or might already have
> "locally" :)
>
> If you fsck this repository it'll take around (on my spinning rust
> server) 30 seconds between 100% of "Checking object directories" before
> you get any output from "Checking objects".
>
> The breakdown of that is (this is from approximate eyeballing):
>
>  * We spend 1-3 seconds just on this:
>    https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L181
>
>  * We spend the majority of the ~30s on this:
>    https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L70-L79
>
>  * Wes spend another 3-5 seconds on this QSORT:
>    https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L105
>
> I.e. it's not about objects v.s. bytes, but the structural problem with
> the code that we pass a progress bar down to verify_pack() which does a
> lot of work in verify_pack_index() and verify_packfile() before we even
> get to iterating over the objects in the file, and only then do we start
> displaying progress.

This patch almost entirely fixes the problem, except for the 1-3 seconds
we take on the qsort:

    diff --git a/sha1dc/sha1.c b/sha1dc/sha1.c
    index df0630bc6d..6d6b9b54c4 100644
    --- a/sha1dc/sha1.c
    +++ b/sha1dc/sha1.c
    @@ -25,6 +25,7 @@

     #include "sha1.h"
     #include "ubc_check.h"
    +#include "../progress.h"

     #if (defined(__amd64__) || defined(__amd64) || defined(__x86_64__) || defined(__x86_64) || \
          defined(i386) || defined(__i386) || defined(__i386__) || defined(__i486__)  || \
    @@ -1820,6 +1821,8 @@ void SHA1DCSetCallback(SHA1_CTX* ctx, collision_block_callback callback)
     void SHA1DCUpdate(SHA1_CTX* ctx, const char* buf, size_t len)
     {
            unsigned left, fill;
    +       struct progress *progress;
    +       size_t len_cp = len, done = 0;

            if (len == 0)
                    return;
    @@ -1836,6 +1839,8 @@ void SHA1DCUpdate(SHA1_CTX* ctx, const char* buf, size_t len)
                    len -= fill;
                    left = 0;
            }
    +
    +       progress = start_delayed_progress(_("Hashing"), len_cp);
            while (len >= 64)
            {
                    ctx->total += 64;
    @@ -1848,7 +1853,12 @@ void SHA1DCUpdate(SHA1_CTX* ctx, const char* buf, size_t len)
     #endif /* defined(SHA1DC_ALLOW_UNALIGNED_ACCESS) */
                    buf += 64;
                    len -= 64;
    +               done += 64;
    +               display_progress(progress, done);
            }
    +       display_progress(progress, len_cp);
    +       stop_progress(&progress);
    +
            if (len > 0)
            {
                    ctx->total += len;

With this we'll get output like:

    $ ~/g/git/git  -C ~/g/2015-04-03-1M-git/  --exec-path=$PWD fsck
    Checking object directories: 100% (256/256), done.
    Hashing: 100% (452634108/452634108), done.
    Hashing: 100% (1073741824/1073741824), done.
    Hashing: 100% (1073741824/1073741824), done.
    Hashing: 100% (1008001572/1008001572), done.
    Checking objects:   2% (262144/13064614)
    ^C

All tests pass with this. Isn't it awesome? Except it's of course a
massive hack, we wouldn't want to just hook into SHA1DC like this.

The problem comes down to us needing to call git_hash_sha1_update() with
some really huge input, that function is going to take a *long* time,
and the only way we're getting incremental progress is:

 1) If we ourselves split the input into N chunks
 2) If we hack into the SHA1 library itself

This patch does #2, but for this to be acceptable we'd need to do
something like #1.

The least hacky way I can think of doing this is something like the
following: We'd define two git_hash_algo entries for SHA-1, one that could give you
progress reports, and that one would only ever hash in chunks of at most
size N, and would wrap git_SHA1_Update() in something similar to the
progress code above.

We'd also need to extend the git_hash_algo struct to e.g. have a char *
we'd use as a description, instead of just "Hashing".

Then the caller here could just switch out the hash algo if
show_progress was true, temporarily some "int do_progress" and "char
*progress_description" entries in the struct, and off we go:

    https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L69-L80

Still a hack, but I can't see another way of making it work without
compromising all hashing performance with some progress reporting.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: non-smooth progress  indication for git fsck and git gc
  2018-09-01 12:53     ` Ævar Arnfjörð Bjarmason
@ 2018-09-01 13:52       ` Ævar Arnfjörð Bjarmason
  2018-09-02  7:46       ` Jeff King
  1 sibling, 0 replies; 25+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-09-01 13:52 UTC (permalink / raw)
  To: Jeff King; +Cc: Ulrich Windl, git


On Sat, Sep 01 2018, Ævar Arnfjörð Bjarmason wrote:

> On Thu, Aug 16 2018, Ævar Arnfjörð Bjarmason wrote:
>
>> On Thu, Aug 16 2018, Jeff King wrote:
>>
>>> On Thu, Aug 16, 2018 at 08:54:25AM +0200, Ulrich Windl wrote:
>>>
>>>> I'd like to point out some minor issue observed while processing some
>>>> 50000-object repository with many binary objects, but most are rather
>>>> small:
>>>>
>>>> Between the two phases of "git fsck" (checking directories and
>>>> checking objects) there was a break of several seconds where no
>>>> progress was indicated.
>>>>
>>>> During "git gc" the writing objects phase did not update for some
>>>> seconds, but then the percentage counter jumped like from 15% to 42%.
>>>>
>>>> I understand that updating the progress output too often can be a
>>>> performance bottleneck, while upating it too rarely might only bore
>>>> the user... ;-)
>>>
>>> We update the counter integer for every object we process, and then
>>> actually update the display whenever the percentage increases or a
>>> second has elapsed, whichever comes first.
>>>
>>> What you're seeing is likely an artifact of _what_ we're counting:
>>> written objects. Not all objects are the same size, so it's not uncommon
>>> to see thousands/sec when dealing with small ones, and then several
>>> seconds for one giant blob.
>>>
>>> The only way to solve that is to count bytes. We don't have a total byte
>>> count in most cases, and it wouldn't always make sense (e.g., the
>>> "Compressing objects" meter can show the same issue, but it's not really
>>> putting through bytes in a linear way).  In some cases we do show
>>> transmitted size and throughput, but that's just for network operations.
>>> We could do the same for "gc" with the patch below. But usually
>>> throughput isn't all that interesting for a filesystem write, because
>>> bandwidth isn't the bottleneck.
>>>
>>> Possibly we could have a "half throughput" mode that counts up the total
>>> size written, but omits the speed indicator. That's not an unreasonable
>>> thing to show for a local pack, since you end up with the final pack
>>> size. The object counter would still be jumpy, but you'd at least have
>>> one number updated at least once per second as you put through a large
>>> blob.
>>>
>>> If you really want a smooth percentage, then we'd have to start counting
>>> bytes instead of objects. Two reasons we don't do that are:
>>>
>>>   - we often don't know the total byte size exactly. E.g., for a
>>>     packfile write, it depends on the result of deflating each object.
>>>     You can make an approximation and just pretend at the end that you
>>>     hit 100%.  Or you can count the pre-deflate sizes, but then your
>>>     meter doesn't match the bytes from the throughput counter.
>>>
>>>   - for something like fsck, we're not actually writing out bytes.  So I
>>>     guess you'd be measuring "here's how many bytes of objects I
>>>     fsck-ed". But is that on-disk compressed bytes, or decompressed
>>>     bytes?
>>>
>>>     If the former, that's only marginally better as a measure of effort,
>>>     since delta compression means that a small number of on-disk bytes
>>>     may require a big effort (imagine processing a 100 byte blob versus
>>>     a 100 byte delta off of a 100MB blob).
>>>
>>>     The latter is probably more accurate. But it's also not free to
>>>     pre-generate the total. We can get the number of objects or the size
>>>     of the packfile in constant-time, but totaling up the uncompressed
>>>     size of all objects is O(n). So that's extra computation, but it
>>>     also means a potential lag before we can start the progress meter.
>>>
>>>     I'm also not sure how meaningful a byte count is for a user there.
>>>
>>> So there. That's probably more than you wanted to know about Git's
>>> progress code. I think it probably _could_ be improved by counting
>>> more/different things, but I also think it can be a bit of a rabbit
>>> hole. Which is why AFAIK nobody's really looked too seriously into
>>> changing it.
>>>
>>> -Peff
>>
>> This is all interesting, but I think unrelated to what Ulrich is talking
>> about. Quote:
>>
>>     Between the two phases of "git fsck" (checking directories and
>>     checking objects) there was a break of several seconds where no
>>     progress was indicated
>>
>> I.e. it's not about the pause you get with your testcase (which is
>> certainly another issue) but the break between the two progress bars.
>>
>> Here's a test case you can clone:
>> https://github.com/avar/2015-04-03-1M-git (or might already have
>> "locally" :)
>>
>> If you fsck this repository it'll take around (on my spinning rust
>> server) 30 seconds between 100% of "Checking object directories" before
>> you get any output from "Checking objects".
>>
>> The breakdown of that is (this is from approximate eyeballing):
>>
>>  * We spend 1-3 seconds just on this:
>>    https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L181
>>
>>  * We spend the majority of the ~30s on this:
>>    https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L70-L79
>>
>>  * Wes spend another 3-5 seconds on this QSORT:
>>    https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L105
>>
>> I.e. it's not about objects v.s. bytes, but the structural problem with
>> the code that we pass a progress bar down to verify_pack() which does a
>> lot of work in verify_pack_index() and verify_packfile() before we even
>> get to iterating over the objects in the file, and only then do we start
>> displaying progress.
>
> This patch almost entirely fixes the problem, except for the 1-3 seconds
> we take on the qsort:
>
>     diff --git a/sha1dc/sha1.c b/sha1dc/sha1.c
>     index df0630bc6d..6d6b9b54c4 100644
>     --- a/sha1dc/sha1.c
>     +++ b/sha1dc/sha1.c
>     @@ -25,6 +25,7 @@
>
>      #include "sha1.h"
>      #include "ubc_check.h"
>     +#include "../progress.h"
>
>      #if (defined(__amd64__) || defined(__amd64) || defined(__x86_64__) || defined(__x86_64) || \
>           defined(i386) || defined(__i386) || defined(__i386__) || defined(__i486__)  || \
>     @@ -1820,6 +1821,8 @@ void SHA1DCSetCallback(SHA1_CTX* ctx, collision_block_callback callback)
>      void SHA1DCUpdate(SHA1_CTX* ctx, const char* buf, size_t len)
>      {
>             unsigned left, fill;
>     +       struct progress *progress;
>     +       size_t len_cp = len, done = 0;
>
>             if (len == 0)
>                     return;
>     @@ -1836,6 +1839,8 @@ void SHA1DCUpdate(SHA1_CTX* ctx, const char* buf, size_t len)
>                     len -= fill;
>                     left = 0;
>             }
>     +
>     +       progress = start_delayed_progress(_("Hashing"), len_cp);
>             while (len >= 64)
>             {
>                     ctx->total += 64;
>     @@ -1848,7 +1853,12 @@ void SHA1DCUpdate(SHA1_CTX* ctx, const char* buf, size_t len)
>      #endif /* defined(SHA1DC_ALLOW_UNALIGNED_ACCESS) */
>                     buf += 64;
>                     len -= 64;
>     +               done += 64;
>     +               display_progress(progress, done);
>             }
>     +       display_progress(progress, len_cp);
>     +       stop_progress(&progress);
>     +
>             if (len > 0)
>             {
>                     ctx->total += len;
>
> With this we'll get output like:
>
>     $ ~/g/git/git  -C ~/g/2015-04-03-1M-git/  --exec-path=$PWD fsck
>     Checking object directories: 100% (256/256), done.
>     Hashing: 100% (452634108/452634108), done.
>     Hashing: 100% (1073741824/1073741824), done.
>     Hashing: 100% (1073741824/1073741824), done.
>     Hashing: 100% (1008001572/1008001572), done.
>     Checking objects:   2% (262144/13064614)
>     ^C
>
> All tests pass with this. Isn't it awesome? Except it's of course a
> massive hack, we wouldn't want to just hook into SHA1DC like this.
>
> The problem comes down to us needing to call git_hash_sha1_update() with
> some really huge input, that function is going to take a *long* time,
> and the only way we're getting incremental progress is:
>
>  1) If we ourselves split the input into N chunks
>  2) If we hack into the SHA1 library itself
>
> This patch does #2, but for this to be acceptable we'd need to do
> something like #1.
>
> The least hacky way I can think of doing this is something like the
> following: We'd define two git_hash_algo entries for SHA-1, one that could give you
> progress reports, and that one would only ever hash in chunks of at most
> size N, and would wrap git_SHA1_Update() in something similar to the
> progress code above.
>
> We'd also need to extend the git_hash_algo struct to e.g. have a char *
> we'd use as a description, instead of just "Hashing".
>
> Then the caller here could just switch out the hash algo if
> show_progress was true, temporarily some "int do_progress" and "char
> *progress_description" entries in the struct, and off we go:
>
>     https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L69-L80
>
> Still a hack, but I can't see another way of making it work without
> compromising all hashing performance with some progress reporting.

Here's a POC version which does #1, the chunk size is 1 byte (just for
simplicity), this one's SHA-1 hash implementation independent:

    diff --git a/hash.h b/hash.h
    index 7c8238bc2e..f0cc96586d 100644
    --- a/hash.h
    +++ b/hash.h
    @@ -52,8 +52,9 @@
     #define GIT_HASH_UNKNOWN 0
     /* SHA-1 */
     #define GIT_HASH_SHA1 1
    +#define GIT_HASH_SHA1_WITH_PROGRESS 2
     /* Number of algorithms supported (including unknown). */
    -#define GIT_HASH_NALGOS (GIT_HASH_SHA1 + 1)
    +#define GIT_HASH_NALGOS (GIT_HASH_SHA1_WITH_PROGRESS + 1)

     /* A suitably aligned type for stack allocations of hash contexts. */
     union git_hash_ctx {
    @@ -95,7 +96,11 @@ struct git_hash_algo {

     	/* The OID of the empty blob. */
     	const struct object_id *empty_blob;
    +
    +	/* Are we showing hashing progress? And if so what are we doing? */
    +	int show_progress;
    +	char *progress_title;
     };
    -extern const struct git_hash_algo hash_algos[GIT_HASH_NALGOS];
    +extern struct git_hash_algo hash_algos[GIT_HASH_NALGOS];

     #endif
    diff --git a/pack-check.c b/pack-check.c
    index d3a57df34f..a96a8f4d10 100644
    --- a/pack-check.c
    +++ b/pack-check.c
    @@ -66,6 +66,9 @@ static int verify_packfile(struct packed_git *p,
     	if (!is_pack_valid(p))
     		return error("packfile %s cannot be accessed", p->pack_name);

    +	repo_set_hash_algo(the_repository, GIT_HASH_SHA1_WITH_PROGRESS);
    +	the_hash_algo->show_progress = 1;
    +	the_hash_algo->progress_title = xstrdup("Hashing");
     	the_hash_algo->init_fn(&ctx);
     	do {
     		unsigned long remaining;
    @@ -78,6 +81,9 @@ static int verify_packfile(struct packed_git *p,
     		the_hash_algo->update_fn(&ctx, in, remaining);
     	} while (offset < pack_sig_ofs);
     	the_hash_algo->final_fn(hash, &ctx);
    +	free(the_hash_algo->progress_title);
    +	repo_set_hash_algo(the_repository, GIT_HASH_SHA1);
    +
     	pack_sig = use_pack(p, w_curs, pack_sig_ofs, NULL);
     	if (hashcmp(hash, pack_sig))
     		err = error("%s pack checksum mismatch",
    diff --git a/repository.h b/repository.h
    index 9f16c42c1e..1d2954773c 100644
    --- a/repository.h
    +++ b/repository.h
    @@ -88,7 +88,7 @@ struct repository {
     	struct index_state *index;

     	/* Repository's current hash algorithm, as serialized on disk. */
    -	const struct git_hash_algo *hash_algo;
    +	struct git_hash_algo *hash_algo;

     	/* Configurations */

    diff --git a/sha1-file.c b/sha1-file.c
    index 97b7423848..9caf1f02e7 100644
    --- a/sha1-file.c
    +++ b/sha1-file.c
    @@ -32,6 +32,7 @@
     #include "packfile.h"
     #include "fetch-object.h"
     #include "object-store.h"
    +#include "progress.h"

     /* The maximum size for an object header. */
     #define MAX_HEADER_LEN 32
    @@ -61,7 +62,33 @@ static void git_hash_sha1_init(git_hash_ctx *ctx)

     static void git_hash_sha1_update(git_hash_ctx *ctx, const void *data, size_t len)
     {
    -	git_SHA1_Update(&ctx->sha1, data, len);
    +	void *chunk;
    +	int i;
    +
    +	for (i = 0; i < len; i++) {
    +		chunk = (void *)data + i;
    +		git_SHA1_Update(&ctx->sha1, chunk, 1);
    +	}
    +}
    +
    +static void git_hash_sha1_update_progress(git_hash_ctx *ctx, const void *data, size_t len)
    +{
    +	struct progress *progress;
    +	void *chunk;
    +	int i;
    +
    +	if (len >= 100000) {
    +		progress = start_delayed_progress(_(the_hash_algo->progress_title), len);
    +		for (i = 0; i < len; i++) {
    +			chunk = (void *)data + i;
    +			git_SHA1_Update(&ctx->sha1, chunk, 1);
    +			display_progress(progress, i);
    +		}
    +		display_progress(progress, i);
    +		stop_progress(&progress);
    +	} else {
    +		git_SHA1_Update(&ctx->sha1, data, len);
    +	}
     }

     static void git_hash_sha1_final(unsigned char *hash, git_hash_ctx *ctx)
    @@ -84,7 +111,7 @@ static void git_hash_unknown_final(unsigned char *hash, git_hash_ctx *ctx)
     	BUG("trying to finalize unknown hash");
     }

    -const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
    +struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
     	{
     		NULL,
     		0x00000000,
    @@ -95,6 +122,8 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
     		git_hash_unknown_final,
     		NULL,
     		NULL,
    +		0,
    +		NULL,
     	},
     	{
     		"sha-1",
    @@ -107,6 +136,22 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
     		git_hash_sha1_final,
     		&empty_tree_oid,
     		&empty_blob_oid,
    +		0,
    +		NULL,
    +	},
    +	{
    +		"sha-1",
    +		/* "sha1", big-endian */
    +		0x73686131,
    +		GIT_SHA1_RAWSZ,
    +		GIT_SHA1_HEXSZ,
    +		git_hash_sha1_init,
    +		git_hash_sha1_update_progress,
    +		git_hash_sha1_final,
    +		&empty_tree_oid,
    +		&empty_blob_oid,
    +		0,
    +		NULL,
     	},
     };

Now, setting hash_algos to s/const // is scary, also the whole notion of
needing to store the "do progress" and "title" info there is a bit
nasty.

I don't actually make use of the "show_progress" member, instead I just
hardcode a large input size that we'll start showing this at.

Consider this an RFC. I'd like to submit some patch like this (with
obvious bugs like it should chunk in some sane size fixed). It's nasty,
but very useful. We'll get progress output during fsck where currently
we just hang for 30s on this large repo.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: non-smooth progress  indication for git fsck and git gc
  2018-09-01 12:53     ` Ævar Arnfjörð Bjarmason
  2018-09-01 13:52       ` Ævar Arnfjörð Bjarmason
@ 2018-09-02  7:46       ` Jeff King
  2018-09-02  7:55         ` Jeff King
  1 sibling, 1 reply; 25+ messages in thread
From: Jeff King @ 2018-09-02  7:46 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Ulrich Windl, git

On Sat, Sep 01, 2018 at 02:53:28PM +0200, Ævar Arnfjörð Bjarmason wrote:

> With this we'll get output like:
> 
>     $ ~/g/git/git  -C ~/g/2015-04-03-1M-git/  --exec-path=$PWD fsck
>     Checking object directories: 100% (256/256), done.
>     Hashing: 100% (452634108/452634108), done.
>     Hashing: 100% (1073741824/1073741824), done.
>     Hashing: 100% (1073741824/1073741824), done.
>     Hashing: 100% (1008001572/1008001572), done.
>     Checking objects:   2% (262144/13064614)
>     ^C
> 
> All tests pass with this. Isn't it awesome? Except it's of course a
> massive hack, we wouldn't want to just hook into SHA1DC like this.

I still consider that output so-so; the byte counts are big and there's
no indication how many "hashing" lines we're going to see. It's also
broken up in a weird way (it's not one per file; it's one per giant
chunk we fed to sha1).

> The problem comes down to us needing to call git_hash_sha1_update() with
> some really huge input, that function is going to take a *long* time,
> and the only way we're getting incremental progress is:
> 
>  1) If we ourselves split the input into N chunks
>  2) If we hack into the SHA1 library itself
> 
> This patch does #2, but for this to be acceptable we'd need to do
> something like #1.

I think we could just do the chunking in verify_packfile(), couldn't we?
(And the .idx hash, if we really want to cover that case, but IMHO
that's way less interesting).

Something like this, which chunks it there, uses a per-packfile meter
(though still does not give any clue how many packfiles there are), and
shows a throughput meter.

diff --git a/pack-check.c b/pack-check.c
index d3a57df34f..c94223664f 100644
--- a/pack-check.c
+++ b/pack-check.c
@@ -62,10 +62,25 @@ static int verify_packfile(struct packed_git *p,
 	uint32_t nr_objects, i;
 	int err = 0;
 	struct idx_entry *entries;
+	struct progress *hashing_progress;
+	char *title;
+	off_t total_hashed = 0;
 
 	if (!is_pack_valid(p))
 		return error("packfile %s cannot be accessed", p->pack_name);
 
+	if (progress) {
+		/* Probably too long... */
+		title = xstrfmt("Hashing %s", p->pack_name);
+
+		/*
+		 * I don't think it actually works to have two progresses going
+		 * at the same time, because when the first one ends, we'll
+		 * cancel the alarm. But hey, this is a hacky proof of concept.
+		 */
+		hashing_progress = start_progress(title, 0);
+	}
+
 	the_hash_algo->init_fn(&ctx);
 	do {
 		unsigned long remaining;
@@ -75,9 +90,25 @@ static int verify_packfile(struct packed_git *p,
 			pack_sig_ofs = p->pack_size - the_hash_algo->rawsz;
 		if (offset > pack_sig_ofs)
 			remaining -= (unsigned int)(offset - pack_sig_ofs);
-		the_hash_algo->update_fn(&ctx, in, remaining);
+		while (remaining) {
+			int chunk = remaining < 4096 ? remaining : 4096;
+			the_hash_algo->update_fn(&ctx, in, chunk);
+			in += chunk;
+			remaining -= chunk;
+			total_hashed += chunk;
+			/*
+			 * The progress code needs tweaking to show throughputs
+			 * better for open-ended meters.
+			 */
+			display_throughput(hashing_progress, total_hashed);
+			display_progress(hashing_progress, 0);
+		}
 	} while (offset < pack_sig_ofs);
+
 	the_hash_algo->final_fn(hash, &ctx);
+	stop_progress(&hashing_progress);
+	free(title);
+
 	pack_sig = use_pack(p, w_curs, pack_sig_ofs, NULL);
 	if (hashcmp(hash, pack_sig))
 		err = error("%s pack checksum mismatch",

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: non-smooth progress  indication for git fsck and git gc
  2018-09-02  7:46       ` Jeff King
@ 2018-09-02  7:55         ` Jeff King
  2018-09-02  8:55           ` Jeff King
  2018-09-04 15:53           ` Junio C Hamano
  0 siblings, 2 replies; 25+ messages in thread
From: Jeff King @ 2018-09-02  7:55 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Ulrich Windl, git

On Sun, Sep 02, 2018 at 03:46:57AM -0400, Jeff King wrote:

> Something like this, which chunks it there, uses a per-packfile meter
> (though still does not give any clue how many packfiles there are), and
> shows a throughput meter.

Actually, in typical cases it would not matter how many packfiles there
are, because there's generally one big one, and then none of the small
ones would take long enough to trigger the delayed meter. So you'd only
see one such meter usually, and it would cover the majority of the time.
In theory, anyway.

I still think the more interesting long-term thing here is to reuse the
pack verification from index-pack, which actually hashes as it does the
per-object countup.

That code isn't lib-ified enough to be run in process, but I think the
patch below should give similar behavior to what fsck currently does.
We'd need to tell index-pack to use our fsck.* config for its checks, I
imagine. The progress here is still per-pack, but I think we could pass
in sufficient information to have it do one continuous meter across all
of the packs (see the in-code comment).

And it makes the result multi-threaded, and lets us drop a bunch of
duplicate code.

---
 builtin/fsck.c |  53 +++++++------
 pack-check.c   | 142 -----------------------------------
 pack.h         |   1 -
 3 files changed, 32 insertions(+), 164 deletions(-)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 250f5af118..643e16125b 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -386,25 +386,6 @@ static int fsck_obj(struct object *obj, void *buffer, unsigned long size)
 	return err;
 }
 
-static int fsck_obj_buffer(const struct object_id *oid, enum object_type type,
-			   unsigned long size, void *buffer, int *eaten)
-{
-	/*
-	 * Note, buffer may be NULL if type is OBJ_BLOB. See
-	 * verify_packfile(), data_valid variable for details.
-	 */
-	struct object *obj;
-	obj = parse_object_buffer(the_repository, oid, type, size, buffer,
-				  eaten);
-	if (!obj) {
-		errors_found |= ERROR_OBJECT;
-		return error("%s: object corrupt or missing", oid_to_hex(oid));
-	}
-	obj->flags &= ~(REACHABLE | SEEN);
-	obj->flags |= HAS_OBJ;
-	return fsck_obj(obj, buffer, size);
-}
-
 static int default_refs;
 
 static void fsck_handle_reflog_oid(const char *refname, struct object_id *oid,
@@ -662,6 +643,37 @@ static int mark_packed_for_connectivity(const struct object_id *oid,
 	return 0;
 }
 
+static int verify_pack(struct packed_git *p)
+{
+	struct child_process index_pack = CHILD_PROCESS_INIT;
+
+	for_each_object_in_pack(p, mark_packed_for_connectivity, NULL, 0);
+
+	/*
+	 * This should probably be replaced with a command-line option
+	 * that teaches "index-pack --verify" to show a more compact and
+	 * fsck-oriented progress (see also the "-v" below).
+	 */
+	if (show_progress)
+		fprintf(stderr, "Checking packfile %s...\n",
+			p->pack_name);
+
+	index_pack.git_cmd = 1;
+	argv_array_pushl(&index_pack.args,
+			 "index-pack",
+			 "--verify",
+			 "--strict",
+			 NULL);
+	if (show_progress)
+		argv_array_push(&index_pack.args, "-v");
+	argv_array_push(&index_pack.args, p->pack_name);
+
+	if (run_command(&index_pack))
+		return -1;
+
+	return 0;
+}
+
 static char const * const fsck_usage[] = {
 	N_("git fsck [<options>] [<object>...]"),
 	NULL
@@ -752,8 +764,7 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 			for (p = get_packed_git(the_repository); p;
 			     p = p->next) {
 				/* verify gives error messages itself */
-				if (verify_pack(p, fsck_obj_buffer,
-						progress, count))
+				if (verify_pack(p))
 					errors_found |= ERROR_PACK;
 				count += p->num_objects;
 			}
diff --git a/pack-check.c b/pack-check.c
index d3a57df34f..ea1457ce53 100644
--- a/pack-check.c
+++ b/pack-check.c
@@ -15,17 +15,6 @@ struct idx_entry {
 	unsigned int nr;
 };
 
-static int compare_entries(const void *e1, const void *e2)
-{
-	const struct idx_entry *entry1 = e1;
-	const struct idx_entry *entry2 = e2;
-	if (entry1->offset < entry2->offset)
-		return -1;
-	if (entry1->offset > entry2->offset)
-		return 1;
-	return 0;
-}
-
 int check_pack_crc(struct packed_git *p, struct pack_window **w_curs,
 		   off_t offset, off_t len, unsigned int nr)
 {
@@ -48,121 +37,6 @@ int check_pack_crc(struct packed_git *p, struct pack_window **w_curs,
 	return data_crc != ntohl(*index_crc);
 }
 
-static int verify_packfile(struct packed_git *p,
-			   struct pack_window **w_curs,
-			   verify_fn fn,
-			   struct progress *progress, uint32_t base_count)
-
-{
-	off_t index_size = p->index_size;
-	const unsigned char *index_base = p->index_data;
-	git_hash_ctx ctx;
-	unsigned char hash[GIT_MAX_RAWSZ], *pack_sig;
-	off_t offset = 0, pack_sig_ofs = 0;
-	uint32_t nr_objects, i;
-	int err = 0;
-	struct idx_entry *entries;
-
-	if (!is_pack_valid(p))
-		return error("packfile %s cannot be accessed", p->pack_name);
-
-	the_hash_algo->init_fn(&ctx);
-	do {
-		unsigned long remaining;
-		unsigned char *in = use_pack(p, w_curs, offset, &remaining);
-		offset += remaining;
-		if (!pack_sig_ofs)
-			pack_sig_ofs = p->pack_size - the_hash_algo->rawsz;
-		if (offset > pack_sig_ofs)
-			remaining -= (unsigned int)(offset - pack_sig_ofs);
-		the_hash_algo->update_fn(&ctx, in, remaining);
-	} while (offset < pack_sig_ofs);
-	the_hash_algo->final_fn(hash, &ctx);
-	pack_sig = use_pack(p, w_curs, pack_sig_ofs, NULL);
-	if (hashcmp(hash, pack_sig))
-		err = error("%s pack checksum mismatch",
-			    p->pack_name);
-	if (hashcmp(index_base + index_size - the_hash_algo->hexsz, pack_sig))
-		err = error("%s pack checksum does not match its index",
-			    p->pack_name);
-	unuse_pack(w_curs);
-
-	/* Make sure everything reachable from idx is valid.  Since we
-	 * have verified that nr_objects matches between idx and pack,
-	 * we do not do scan-streaming check on the pack file.
-	 */
-	nr_objects = p->num_objects;
-	ALLOC_ARRAY(entries, nr_objects + 1);
-	entries[nr_objects].offset = pack_sig_ofs;
-	/* first sort entries by pack offset, since unpacking them is more efficient that way */
-	for (i = 0; i < nr_objects; i++) {
-		entries[i].oid.hash = nth_packed_object_sha1(p, i);
-		if (!entries[i].oid.hash)
-			die("internal error pack-check nth-packed-object");
-		entries[i].offset = nth_packed_object_offset(p, i);
-		entries[i].nr = i;
-	}
-	QSORT(entries, nr_objects, compare_entries);
-
-	for (i = 0; i < nr_objects; i++) {
-		void *data;
-		enum object_type type;
-		unsigned long size;
-		off_t curpos;
-		int data_valid;
-
-		if (p->index_version > 1) {
-			off_t offset = entries[i].offset;
-			off_t len = entries[i+1].offset - offset;
-			unsigned int nr = entries[i].nr;
-			if (check_pack_crc(p, w_curs, offset, len, nr))
-				err = error("index CRC mismatch for object %s "
-					    "from %s at offset %"PRIuMAX"",
-					    oid_to_hex(entries[i].oid.oid),
-					    p->pack_name, (uintmax_t)offset);
-		}
-
-		curpos = entries[i].offset;
-		type = unpack_object_header(p, w_curs, &curpos, &size);
-		unuse_pack(w_curs);
-
-		if (type == OBJ_BLOB && big_file_threshold <= size) {
-			/*
-			 * Let check_object_signature() check it with
-			 * the streaming interface; no point slurping
-			 * the data in-core only to discard.
-			 */
-			data = NULL;
-			data_valid = 0;
-		} else {
-			data = unpack_entry(the_repository, p, entries[i].offset, &type, &size);
-			data_valid = 1;
-		}
-
-		if (data_valid && !data)
-			err = error("cannot unpack %s from %s at offset %"PRIuMAX"",
-				    oid_to_hex(entries[i].oid.oid), p->pack_name,
-				    (uintmax_t)entries[i].offset);
-		else if (check_object_signature(entries[i].oid.oid, data, size, type_name(type)))
-			err = error("packed %s from %s is corrupt",
-				    oid_to_hex(entries[i].oid.oid), p->pack_name);
-		else if (fn) {
-			int eaten = 0;
-			err |= fn(entries[i].oid.oid, type, size, data, &eaten);
-			if (eaten)
-				data = NULL;
-		}
-		if (((base_count + i) & 1023) == 0)
-			display_progress(progress, base_count + i);
-		free(data);
-
-	}
-	display_progress(progress, base_count + i);
-	free(entries);
-
-	return err;
-}
-
 int verify_pack_index(struct packed_git *p)
 {
 	off_t index_size;
@@ -185,19 +59,3 @@ int verify_pack_index(struct packed_git *p)
 			    p->pack_name);
 	return err;
 }
-
-int verify_pack(struct packed_git *p, verify_fn fn,
-		struct progress *progress, uint32_t base_count)
-{
-	int err = 0;
-	struct pack_window *w_curs = NULL;
-
-	err |= verify_pack_index(p);
-	if (!p->index_data)
-		return -1;
-
-	err |= verify_packfile(p, &w_curs, fn, progress, base_count);
-	unuse_pack(&w_curs);
-
-	return err;
-}
diff --git a/pack.h b/pack.h
index 34a9d458b4..0d346c5e31 100644
--- a/pack.h
+++ b/pack.h
@@ -80,7 +80,6 @@ typedef int (*verify_fn)(const struct object_id *, enum object_type, unsigned lo
 extern const char *write_idx_file(const char *index_name, struct pack_idx_entry **objects, int nr_objects, const struct pack_idx_option *, const unsigned char *sha1);
 extern int check_pack_crc(struct packed_git *p, struct pack_window **w_curs, off_t offset, off_t len, unsigned int nr);
 extern int verify_pack_index(struct packed_git *);
-extern int verify_pack(struct packed_git *, verify_fn fn, struct progress *, uint32_t);
 extern off_t write_pack_header(struct hashfile *f, uint32_t);
 extern void fixup_pack_header_footer(int, unsigned char *, const char *, uint32_t, unsigned char *, off_t);
 extern char *index_pack_lockfile(int fd);

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: non-smooth progress  indication for git fsck and git gc
  2018-09-02  7:55         ` Jeff King
@ 2018-09-02  8:55           ` Jeff King
  2018-09-03 16:48             ` Ævar Arnfjörð Bjarmason
  2018-09-04 15:53           ` Junio C Hamano
  1 sibling, 1 reply; 25+ messages in thread
From: Jeff King @ 2018-09-02  8:55 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Ulrich Windl, git

On Sun, Sep 02, 2018 at 03:55:28AM -0400, Jeff King wrote:

> I still think the more interesting long-term thing here is to reuse the
> pack verification from index-pack, which actually hashes as it does the
> per-object countup.
> 
> That code isn't lib-ified enough to be run in process, but I think the
> patch below should give similar behavior to what fsck currently does.
> We'd need to tell index-pack to use our fsck.* config for its checks, I
> imagine. The progress here is still per-pack, but I think we could pass
> in sufficient information to have it do one continuous meter across all
> of the packs (see the in-code comment).
> 
> And it makes the result multi-threaded, and lets us drop a bunch of
> duplicate code.

Here's a little polish on that to pass enough progress data to
index-pack to let it have one nice continuous meter. I'm not sure if
it's worth all the hackery or not. The dual-meter that index-pack
generally uses is actually more informative (since it meters the initial
pass through the pack and the delta reconstruction separately).

And there are definitely a few nasty bits (like the way the progress is
ended). I'm not planning on taking this further for now, but maybe
you or somebody can find it interesting or useful.

---
 builtin/fsck.c       |  59 +++++++++------
 builtin/index-pack.c |  43 ++++++++++-
 pack-check.c         | 142 -----------------------------------
 pack.h               |   1 -
 4 files changed, 75 insertions(+), 170 deletions(-)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 250f5af118..2d774ea2e5 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -386,25 +386,6 @@ static int fsck_obj(struct object *obj, void *buffer, unsigned long size)
 	return err;
 }
 
-static int fsck_obj_buffer(const struct object_id *oid, enum object_type type,
-			   unsigned long size, void *buffer, int *eaten)
-{
-	/*
-	 * Note, buffer may be NULL if type is OBJ_BLOB. See
-	 * verify_packfile(), data_valid variable for details.
-	 */
-	struct object *obj;
-	obj = parse_object_buffer(the_repository, oid, type, size, buffer,
-				  eaten);
-	if (!obj) {
-		errors_found |= ERROR_OBJECT;
-		return error("%s: object corrupt or missing", oid_to_hex(oid));
-	}
-	obj->flags &= ~(REACHABLE | SEEN);
-	obj->flags |= HAS_OBJ;
-	return fsck_obj(obj, buffer, size);
-}
-
 static int default_refs;
 
 static void fsck_handle_reflog_oid(const char *refname, struct object_id *oid,
@@ -662,6 +643,32 @@ static int mark_packed_for_connectivity(const struct object_id *oid,
 	return 0;
 }
 
+static int verify_pack(struct packed_git *p,
+		       unsigned long count, unsigned long total)
+{
+	struct child_process index_pack = CHILD_PROCESS_INIT;
+
+	if (is_pack_valid(p) < 0)
+		return -1;
+	for_each_object_in_pack(p, mark_packed_for_connectivity, NULL, 0);
+
+	index_pack.git_cmd = 1;
+	argv_array_pushl(&index_pack.args,
+			 "index-pack",
+			 "--verify-fsck",
+			 NULL);
+	if (show_progress)
+		argv_array_pushf(&index_pack.args,
+				 "--fsck-progress=%lu,%lu,Checking pack %s",
+				 count, total, sha1_to_hex(p->sha1));
+	argv_array_push(&index_pack.args, p->pack_name);
+
+	if (run_command(&index_pack))
+		return -1;
+
+	return 0;
+}
+
 static char const * const fsck_usage[] = {
 	N_("git fsck [<options>] [<object>...]"),
 	NULL
@@ -737,7 +744,6 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 		if (check_full) {
 			struct packed_git *p;
 			uint32_t total = 0, count = 0;
-			struct progress *progress = NULL;
 
 			if (show_progress) {
 				for (p = get_packed_git(the_repository); p;
@@ -746,18 +752,21 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 						continue;
 					total += p->num_objects;
 				}
-
-				progress = start_progress(_("Checking objects"), total);
 			}
 			for (p = get_packed_git(the_repository); p;
 			     p = p->next) {
 				/* verify gives error messages itself */
-				if (verify_pack(p, fsck_obj_buffer,
-						progress, count))
+				if (verify_pack(p, count, total))
 					errors_found |= ERROR_PACK;
 				count += p->num_objects;
 			}
-			stop_progress(&progress);
+			/*
+			 * Our child index-pack never calls stop_progress(),
+			 * which lets each child appear on the same line. Now
+			 * that we've finished all of them, we have to tie that
+			 * off.
+			 */
+			fprintf(stderr, "\n");
 		}
 
 		if (fsck_finish(&fsck_obj_options))
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 9582ead950..d03ec7bb89 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -85,6 +85,13 @@ static int show_resolving_progress;
 static int show_stat;
 static int check_self_contained_and_connected;
 
+static int verify_fsck_mode;
+/* unlike our normal 2-part progress, this counts up only total objects */
+struct progress *fsck_progress;
+static uint32_t fsck_progress_cur;
+static uint32_t fsck_progress_total;
+static const char *fsck_progress_title;
+
 static struct progress *progress;
 
 /* We always read in 4kB chunks. */
@@ -1100,6 +1107,8 @@ static void *threaded_second_pass(void *data)
 		int i;
 		counter_lock();
 		display_progress(progress, nr_resolved_deltas);
+		display_progress(fsck_progress,
+				 fsck_progress_cur + nr_resolved_deltas);
 		counter_unlock();
 		work_lock();
 		while (nr_dispatched < nr_objects &&
@@ -1145,18 +1154,23 @@ static void parse_pack_objects(unsigned char *hash)
 			nr_ofs_deltas++;
 			ofs_delta->obj_no = i;
 			ofs_delta++;
+			/* no fsck_progress; we only count it when we've resolved the delta */
 		} else if (obj->type == OBJ_REF_DELTA) {
 			ALLOC_GROW(ref_deltas, nr_ref_deltas + 1, ref_deltas_alloc);
 			oidcpy(&ref_deltas[nr_ref_deltas].oid, &ref_delta_oid);
 			ref_deltas[nr_ref_deltas].obj_no = i;
 			nr_ref_deltas++;
+			/* no fsck_progress; we only count it when we've resolved the delta */
 		} else if (!data) {
 			/* large blobs, check later */
 			obj->real_type = OBJ_BAD;
 			nr_delays++;
-		} else
+			display_progress(fsck_progress, ++fsck_progress_cur);
+		} else {
 			sha1_object(data, NULL, obj->size, obj->type,
 				    &obj->idx.oid);
+			display_progress(fsck_progress, ++fsck_progress_cur);
+		}
 		free(data);
 		display_progress(progress, i+1);
 	}
@@ -1238,6 +1252,8 @@ static void resolve_deltas(void)
 			continue;
 		resolve_base(obj);
 		display_progress(progress, nr_resolved_deltas);
+		display_progress(fsck_progress,
+				 fsck_progress_cur + nr_resolved_deltas);
 	}
 }
 
@@ -1391,6 +1407,8 @@ static void fix_unresolved_deltas(struct hashfile *f)
 					base_obj->data, base_obj->size, type);
 		find_unresolved_deltas(base_obj);
 		display_progress(progress, nr_resolved_deltas);
+		display_progress(fsck_progress,
+				 fsck_progress_cur + nr_resolved_deltas);
 	}
 	free(sorted_by_pos);
 }
@@ -1714,6 +1732,11 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 				verify = 1;
 				show_stat = 1;
 				stat_only = 1;
+			} else if (!strcmp(arg, "--verify-fsck")) {
+				verify_fsck_mode = 1;
+				verify = 1;
+				strict = 1;
+				do_fsck_object = 1;
 			} else if (skip_to_optional_arg(arg, "--keep", &keep_msg)) {
 				; /* nothing to do */
 			} else if (skip_to_optional_arg(arg, "--promisor", &promisor_msg)) {
@@ -1746,6 +1769,15 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 				verbose = 1;
 			} else if (!strcmp(arg, "--show-resolving-progress")) {
 				show_resolving_progress = 1;
+			} else if (skip_prefix(arg, "--fsck-progress=", &arg)) {
+				char *end;
+				fsck_progress_cur = strtoul(arg, &end, 10);
+				if (*end++ != ',')
+					die("bad fsck-progress arg: %s", arg);
+				fsck_progress_total = strtoul(end, &end, 10);
+				if (*end++ != ',')
+					die("bad fsck-progress arg: %s", arg);
+				fsck_progress_title = end;
 			} else if (!strcmp(arg, "--report-end-of-input")) {
 				report_end_of_input = 1;
 			} else if (!strcmp(arg, "-o")) {
@@ -1800,6 +1832,10 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	}
 #endif
 
+	if (fsck_progress_total)
+		fsck_progress = start_progress(fsck_progress_title,
+					       fsck_progress_total);
+
 	curr_pack = open_pack_file(pack_name);
 	parse_pack_header();
 	objects = xcalloc(st_add(nr_objects, 1), sizeof(struct object_entry));
@@ -1833,8 +1869,11 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	else
 		close(input_fd);
 
-	if (do_fsck_object && fsck_finish(&fsck_options))
+	if (do_fsck_object && fsck_finish(&fsck_options)) {
+		if (verify_fsck_mode)
+			return 1; /* quietly return error */
 		die(_("fsck error in pack objects"));
+	}
 
 	free(objects);
 	strbuf_release(&index_name_buf);
diff --git a/pack-check.c b/pack-check.c
index d3a57df34f..ea1457ce53 100644
--- a/pack-check.c
+++ b/pack-check.c
@@ -15,17 +15,6 @@ struct idx_entry {
 	unsigned int nr;
 };
 
-static int compare_entries(const void *e1, const void *e2)
-{
-	const struct idx_entry *entry1 = e1;
-	const struct idx_entry *entry2 = e2;
-	if (entry1->offset < entry2->offset)
-		return -1;
-	if (entry1->offset > entry2->offset)
-		return 1;
-	return 0;
-}
-
 int check_pack_crc(struct packed_git *p, struct pack_window **w_curs,
 		   off_t offset, off_t len, unsigned int nr)
 {
@@ -48,121 +37,6 @@ int check_pack_crc(struct packed_git *p, struct pack_window **w_curs,
 	return data_crc != ntohl(*index_crc);
 }
 
-static int verify_packfile(struct packed_git *p,
-			   struct pack_window **w_curs,
-			   verify_fn fn,
-			   struct progress *progress, uint32_t base_count)
-
-{
-	off_t index_size = p->index_size;
-	const unsigned char *index_base = p->index_data;
-	git_hash_ctx ctx;
-	unsigned char hash[GIT_MAX_RAWSZ], *pack_sig;
-	off_t offset = 0, pack_sig_ofs = 0;
-	uint32_t nr_objects, i;
-	int err = 0;
-	struct idx_entry *entries;
-
-	if (!is_pack_valid(p))
-		return error("packfile %s cannot be accessed", p->pack_name);
-
-	the_hash_algo->init_fn(&ctx);
-	do {
-		unsigned long remaining;
-		unsigned char *in = use_pack(p, w_curs, offset, &remaining);
-		offset += remaining;
-		if (!pack_sig_ofs)
-			pack_sig_ofs = p->pack_size - the_hash_algo->rawsz;
-		if (offset > pack_sig_ofs)
-			remaining -= (unsigned int)(offset - pack_sig_ofs);
-		the_hash_algo->update_fn(&ctx, in, remaining);
-	} while (offset < pack_sig_ofs);
-	the_hash_algo->final_fn(hash, &ctx);
-	pack_sig = use_pack(p, w_curs, pack_sig_ofs, NULL);
-	if (hashcmp(hash, pack_sig))
-		err = error("%s pack checksum mismatch",
-			    p->pack_name);
-	if (hashcmp(index_base + index_size - the_hash_algo->hexsz, pack_sig))
-		err = error("%s pack checksum does not match its index",
-			    p->pack_name);
-	unuse_pack(w_curs);
-
-	/* Make sure everything reachable from idx is valid.  Since we
-	 * have verified that nr_objects matches between idx and pack,
-	 * we do not do scan-streaming check on the pack file.
-	 */
-	nr_objects = p->num_objects;
-	ALLOC_ARRAY(entries, nr_objects + 1);
-	entries[nr_objects].offset = pack_sig_ofs;
-	/* first sort entries by pack offset, since unpacking them is more efficient that way */
-	for (i = 0; i < nr_objects; i++) {
-		entries[i].oid.hash = nth_packed_object_sha1(p, i);
-		if (!entries[i].oid.hash)
-			die("internal error pack-check nth-packed-object");
-		entries[i].offset = nth_packed_object_offset(p, i);
-		entries[i].nr = i;
-	}
-	QSORT(entries, nr_objects, compare_entries);
-
-	for (i = 0; i < nr_objects; i++) {
-		void *data;
-		enum object_type type;
-		unsigned long size;
-		off_t curpos;
-		int data_valid;
-
-		if (p->index_version > 1) {
-			off_t offset = entries[i].offset;
-			off_t len = entries[i+1].offset - offset;
-			unsigned int nr = entries[i].nr;
-			if (check_pack_crc(p, w_curs, offset, len, nr))
-				err = error("index CRC mismatch for object %s "
-					    "from %s at offset %"PRIuMAX"",
-					    oid_to_hex(entries[i].oid.oid),
-					    p->pack_name, (uintmax_t)offset);
-		}
-
-		curpos = entries[i].offset;
-		type = unpack_object_header(p, w_curs, &curpos, &size);
-		unuse_pack(w_curs);
-
-		if (type == OBJ_BLOB && big_file_threshold <= size) {
-			/*
-			 * Let check_object_signature() check it with
-			 * the streaming interface; no point slurping
-			 * the data in-core only to discard.
-			 */
-			data = NULL;
-			data_valid = 0;
-		} else {
-			data = unpack_entry(the_repository, p, entries[i].offset, &type, &size);
-			data_valid = 1;
-		}
-
-		if (data_valid && !data)
-			err = error("cannot unpack %s from %s at offset %"PRIuMAX"",
-				    oid_to_hex(entries[i].oid.oid), p->pack_name,
-				    (uintmax_t)entries[i].offset);
-		else if (check_object_signature(entries[i].oid.oid, data, size, type_name(type)))
-			err = error("packed %s from %s is corrupt",
-				    oid_to_hex(entries[i].oid.oid), p->pack_name);
-		else if (fn) {
-			int eaten = 0;
-			err |= fn(entries[i].oid.oid, type, size, data, &eaten);
-			if (eaten)
-				data = NULL;
-		}
-		if (((base_count + i) & 1023) == 0)
-			display_progress(progress, base_count + i);
-		free(data);
-
-	}
-	display_progress(progress, base_count + i);
-	free(entries);
-
-	return err;
-}
-
 int verify_pack_index(struct packed_git *p)
 {
 	off_t index_size;
@@ -185,19 +59,3 @@ int verify_pack_index(struct packed_git *p)
 			    p->pack_name);
 	return err;
 }
-
-int verify_pack(struct packed_git *p, verify_fn fn,
-		struct progress *progress, uint32_t base_count)
-{
-	int err = 0;
-	struct pack_window *w_curs = NULL;
-
-	err |= verify_pack_index(p);
-	if (!p->index_data)
-		return -1;
-
-	err |= verify_packfile(p, &w_curs, fn, progress, base_count);
-	unuse_pack(&w_curs);
-
-	return err;
-}
diff --git a/pack.h b/pack.h
index 34a9d458b4..0d346c5e31 100644
--- a/pack.h
+++ b/pack.h
@@ -80,7 +80,6 @@ typedef int (*verify_fn)(const struct object_id *, enum object_type, unsigned lo
 extern const char *write_idx_file(const char *index_name, struct pack_idx_entry **objects, int nr_objects, const struct pack_idx_option *, const unsigned char *sha1);
 extern int check_pack_crc(struct packed_git *p, struct pack_window **w_curs, off_t offset, off_t len, unsigned int nr);
 extern int verify_pack_index(struct packed_git *);
-extern int verify_pack(struct packed_git *, verify_fn fn, struct progress *, uint32_t);
 extern off_t write_pack_header(struct hashfile *f, uint32_t);
 extern void fixup_pack_header_footer(int, unsigned char *, const char *, uint32_t, unsigned char *, off_t);
 extern char *index_pack_lockfile(int fd);

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: non-smooth progress  indication for git fsck and git gc
  2018-09-02  8:55           ` Jeff King
@ 2018-09-03 16:48             ` Ævar Arnfjörð Bjarmason
  2018-09-07  3:30               ` Jeff King
  0 siblings, 1 reply; 25+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-09-03 16:48 UTC (permalink / raw)
  To: Jeff King; +Cc: Ulrich Windl, git


On Sun, Sep 02 2018, Jeff King wrote:

> On Sun, Sep 02, 2018 at 03:55:28AM -0400, Jeff King wrote:
>
>> I still think the more interesting long-term thing here is to reuse the
>> pack verification from index-pack, which actually hashes as it does the
>> per-object countup.
>>
>> That code isn't lib-ified enough to be run in process, but I think the
>> patch below should give similar behavior to what fsck currently does.
>> We'd need to tell index-pack to use our fsck.* config for its checks, I
>> imagine. The progress here is still per-pack, but I think we could pass
>> in sufficient information to have it do one continuous meter across all
>> of the packs (see the in-code comment).
>>
>> And it makes the result multi-threaded, and lets us drop a bunch of
>> duplicate code.
>
> Here's a little polish on that to pass enough progress data to
> index-pack to let it have one nice continuous meter. I'm not sure if
> it's worth all the hackery or not. The dual-meter that index-pack
> generally uses is actually more informative (since it meters the initial
> pass through the pack and the delta reconstruction separately).

Thanks!

> And there are definitely a few nasty bits (like the way the progress is
> ended). I'm not planning on taking this further for now, but maybe
> you or somebody can find it interesting or useful.

I think it would be really nice if this were taken further. Using my
perf test in
https://public-inbox.org/git/20180903144928.30691-7-avarab@gmail.com/T/#u
I get these results:

    $ GIT_PERF_LARGE_REPO=/home/aearnfjord/g/linux GIT_PERF_REPEAT_COUNT=5 GIT_PERF_MAKE_OPTS='-j56 CFLAGS="-O3"' ./run HEAD~ HEAD p1450-fsck.sh
    [...]
    Test           HEAD~                 HEAD
    ----------------------------------------------------------------
    1450.1: fsck   384.18(381.63+2.53)   301.52(508.28+38.34) -21.5%


I.e. this gives a 20% speedup, although of course some of that might be
because some of this might be skipping too much work, but looks really
promising.

I don't know if I'll have time for it, and besides, you wrote the code
so you already understand it *hint* *hint* :)

> ---
>  builtin/fsck.c       |  59 +++++++++------
>  builtin/index-pack.c |  43 ++++++++++-
>  pack-check.c         | 142 -----------------------------------
>  pack.h               |   1 -
>  4 files changed, 75 insertions(+), 170 deletions(-)
>
> diff --git a/builtin/fsck.c b/builtin/fsck.c
> index 250f5af118..2d774ea2e5 100644
> --- a/builtin/fsck.c
> +++ b/builtin/fsck.c
> @@ -386,25 +386,6 @@ static int fsck_obj(struct object *obj, void *buffer, unsigned long size)
>  	return err;
>  }
>
> -static int fsck_obj_buffer(const struct object_id *oid, enum object_type type,
> -			   unsigned long size, void *buffer, int *eaten)
> -{
> -	/*
> -	 * Note, buffer may be NULL if type is OBJ_BLOB. See
> -	 * verify_packfile(), data_valid variable for details.
> -	 */
> -	struct object *obj;
> -	obj = parse_object_buffer(the_repository, oid, type, size, buffer,
> -				  eaten);
> -	if (!obj) {
> -		errors_found |= ERROR_OBJECT;
> -		return error("%s: object corrupt or missing", oid_to_hex(oid));
> -	}
> -	obj->flags &= ~(REACHABLE | SEEN);
> -	obj->flags |= HAS_OBJ;
> -	return fsck_obj(obj, buffer, size);
> -}
> -
>  static int default_refs;
>
>  static void fsck_handle_reflog_oid(const char *refname, struct object_id *oid,
> @@ -662,6 +643,32 @@ static int mark_packed_for_connectivity(const struct object_id *oid,
>  	return 0;
>  }
>
> +static int verify_pack(struct packed_git *p,
> +		       unsigned long count, unsigned long total)
> +{
> +	struct child_process index_pack = CHILD_PROCESS_INIT;
> +
> +	if (is_pack_valid(p) < 0)
> +		return -1;
> +	for_each_object_in_pack(p, mark_packed_for_connectivity, NULL, 0);
> +
> +	index_pack.git_cmd = 1;
> +	argv_array_pushl(&index_pack.args,
> +			 "index-pack",
> +			 "--verify-fsck",
> +			 NULL);
> +	if (show_progress)
> +		argv_array_pushf(&index_pack.args,
> +				 "--fsck-progress=%lu,%lu,Checking pack %s",
> +				 count, total, sha1_to_hex(p->sha1));
> +	argv_array_push(&index_pack.args, p->pack_name);
> +
> +	if (run_command(&index_pack))
> +		return -1;
> +
> +	return 0;
> +}
> +
>  static char const * const fsck_usage[] = {
>  	N_("git fsck [<options>] [<object>...]"),
>  	NULL
> @@ -737,7 +744,6 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
>  		if (check_full) {
>  			struct packed_git *p;
>  			uint32_t total = 0, count = 0;
> -			struct progress *progress = NULL;
>
>  			if (show_progress) {
>  				for (p = get_packed_git(the_repository); p;
> @@ -746,18 +752,21 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
>  						continue;
>  					total += p->num_objects;
>  				}
> -
> -				progress = start_progress(_("Checking objects"), total);
>  			}
>  			for (p = get_packed_git(the_repository); p;
>  			     p = p->next) {
>  				/* verify gives error messages itself */
> -				if (verify_pack(p, fsck_obj_buffer,
> -						progress, count))
> +				if (verify_pack(p, count, total))
>  					errors_found |= ERROR_PACK;
>  				count += p->num_objects;
>  			}
> -			stop_progress(&progress);
> +			/*
> +			 * Our child index-pack never calls stop_progress(),
> +			 * which lets each child appear on the same line. Now
> +			 * that we've finished all of them, we have to tie that
> +			 * off.
> +			 */
> +			fprintf(stderr, "\n");
>  		}
>
>  		if (fsck_finish(&fsck_obj_options))
> diff --git a/builtin/index-pack.c b/builtin/index-pack.c
> index 9582ead950..d03ec7bb89 100644
> --- a/builtin/index-pack.c
> +++ b/builtin/index-pack.c
> @@ -85,6 +85,13 @@ static int show_resolving_progress;
>  static int show_stat;
>  static int check_self_contained_and_connected;
>
> +static int verify_fsck_mode;
> +/* unlike our normal 2-part progress, this counts up only total objects */
> +struct progress *fsck_progress;
> +static uint32_t fsck_progress_cur;
> +static uint32_t fsck_progress_total;
> +static const char *fsck_progress_title;
> +
>  static struct progress *progress;
>
>  /* We always read in 4kB chunks. */
> @@ -1100,6 +1107,8 @@ static void *threaded_second_pass(void *data)
>  		int i;
>  		counter_lock();
>  		display_progress(progress, nr_resolved_deltas);
> +		display_progress(fsck_progress,
> +				 fsck_progress_cur + nr_resolved_deltas);
>  		counter_unlock();
>  		work_lock();
>  		while (nr_dispatched < nr_objects &&
> @@ -1145,18 +1154,23 @@ static void parse_pack_objects(unsigned char *hash)
>  			nr_ofs_deltas++;
>  			ofs_delta->obj_no = i;
>  			ofs_delta++;
> +			/* no fsck_progress; we only count it when we've resolved the delta */
>  		} else if (obj->type == OBJ_REF_DELTA) {
>  			ALLOC_GROW(ref_deltas, nr_ref_deltas + 1, ref_deltas_alloc);
>  			oidcpy(&ref_deltas[nr_ref_deltas].oid, &ref_delta_oid);
>  			ref_deltas[nr_ref_deltas].obj_no = i;
>  			nr_ref_deltas++;
> +			/* no fsck_progress; we only count it when we've resolved the delta */
>  		} else if (!data) {
>  			/* large blobs, check later */
>  			obj->real_type = OBJ_BAD;
>  			nr_delays++;
> -		} else
> +			display_progress(fsck_progress, ++fsck_progress_cur);
> +		} else {
>  			sha1_object(data, NULL, obj->size, obj->type,
>  				    &obj->idx.oid);
> +			display_progress(fsck_progress, ++fsck_progress_cur);
> +		}
>  		free(data);
>  		display_progress(progress, i+1);
>  	}
> @@ -1238,6 +1252,8 @@ static void resolve_deltas(void)
>  			continue;
>  		resolve_base(obj);
>  		display_progress(progress, nr_resolved_deltas);
> +		display_progress(fsck_progress,
> +				 fsck_progress_cur + nr_resolved_deltas);
>  	}
>  }
>
> @@ -1391,6 +1407,8 @@ static void fix_unresolved_deltas(struct hashfile *f)
>  					base_obj->data, base_obj->size, type);
>  		find_unresolved_deltas(base_obj);
>  		display_progress(progress, nr_resolved_deltas);
> +		display_progress(fsck_progress,
> +				 fsck_progress_cur + nr_resolved_deltas);
>  	}
>  	free(sorted_by_pos);
>  }
> @@ -1714,6 +1732,11 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
>  				verify = 1;
>  				show_stat = 1;
>  				stat_only = 1;
> +			} else if (!strcmp(arg, "--verify-fsck")) {
> +				verify_fsck_mode = 1;
> +				verify = 1;
> +				strict = 1;
> +				do_fsck_object = 1;
>  			} else if (skip_to_optional_arg(arg, "--keep", &keep_msg)) {
>  				; /* nothing to do */
>  			} else if (skip_to_optional_arg(arg, "--promisor", &promisor_msg)) {
> @@ -1746,6 +1769,15 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
>  				verbose = 1;
>  			} else if (!strcmp(arg, "--show-resolving-progress")) {
>  				show_resolving_progress = 1;
> +			} else if (skip_prefix(arg, "--fsck-progress=", &arg)) {
> +				char *end;
> +				fsck_progress_cur = strtoul(arg, &end, 10);
> +				if (*end++ != ',')
> +					die("bad fsck-progress arg: %s", arg);
> +				fsck_progress_total = strtoul(end, &end, 10);
> +				if (*end++ != ',')
> +					die("bad fsck-progress arg: %s", arg);
> +				fsck_progress_title = end;
>  			} else if (!strcmp(arg, "--report-end-of-input")) {
>  				report_end_of_input = 1;
>  			} else if (!strcmp(arg, "-o")) {
> @@ -1800,6 +1832,10 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
>  	}
>  #endif
>
> +	if (fsck_progress_total)
> +		fsck_progress = start_progress(fsck_progress_title,
> +					       fsck_progress_total);
> +
>  	curr_pack = open_pack_file(pack_name);
>  	parse_pack_header();
>  	objects = xcalloc(st_add(nr_objects, 1), sizeof(struct object_entry));
> @@ -1833,8 +1869,11 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
>  	else
>  		close(input_fd);
>
> -	if (do_fsck_object && fsck_finish(&fsck_options))
> +	if (do_fsck_object && fsck_finish(&fsck_options)) {
> +		if (verify_fsck_mode)
> +			return 1; /* quietly return error */
>  		die(_("fsck error in pack objects"));
> +	}
>
>  	free(objects);
>  	strbuf_release(&index_name_buf);
> diff --git a/pack-check.c b/pack-check.c
> index d3a57df34f..ea1457ce53 100644
> --- a/pack-check.c
> +++ b/pack-check.c
> @@ -15,17 +15,6 @@ struct idx_entry {
>  	unsigned int nr;
>  };
>
> -static int compare_entries(const void *e1, const void *e2)
> -{
> -	const struct idx_entry *entry1 = e1;
> -	const struct idx_entry *entry2 = e2;
> -	if (entry1->offset < entry2->offset)
> -		return -1;
> -	if (entry1->offset > entry2->offset)
> -		return 1;
> -	return 0;
> -}
> -
>  int check_pack_crc(struct packed_git *p, struct pack_window **w_curs,
>  		   off_t offset, off_t len, unsigned int nr)
>  {
> @@ -48,121 +37,6 @@ int check_pack_crc(struct packed_git *p, struct pack_window **w_curs,
>  	return data_crc != ntohl(*index_crc);
>  }
>
> -static int verify_packfile(struct packed_git *p,
> -			   struct pack_window **w_curs,
> -			   verify_fn fn,
> -			   struct progress *progress, uint32_t base_count)
> -
> -{
> -	off_t index_size = p->index_size;
> -	const unsigned char *index_base = p->index_data;
> -	git_hash_ctx ctx;
> -	unsigned char hash[GIT_MAX_RAWSZ], *pack_sig;
> -	off_t offset = 0, pack_sig_ofs = 0;
> -	uint32_t nr_objects, i;
> -	int err = 0;
> -	struct idx_entry *entries;
> -
> -	if (!is_pack_valid(p))
> -		return error("packfile %s cannot be accessed", p->pack_name);
> -
> -	the_hash_algo->init_fn(&ctx);
> -	do {
> -		unsigned long remaining;
> -		unsigned char *in = use_pack(p, w_curs, offset, &remaining);
> -		offset += remaining;
> -		if (!pack_sig_ofs)
> -			pack_sig_ofs = p->pack_size - the_hash_algo->rawsz;
> -		if (offset > pack_sig_ofs)
> -			remaining -= (unsigned int)(offset - pack_sig_ofs);
> -		the_hash_algo->update_fn(&ctx, in, remaining);
> -	} while (offset < pack_sig_ofs);
> -	the_hash_algo->final_fn(hash, &ctx);
> -	pack_sig = use_pack(p, w_curs, pack_sig_ofs, NULL);
> -	if (hashcmp(hash, pack_sig))
> -		err = error("%s pack checksum mismatch",
> -			    p->pack_name);
> -	if (hashcmp(index_base + index_size - the_hash_algo->hexsz, pack_sig))
> -		err = error("%s pack checksum does not match its index",
> -			    p->pack_name);
> -	unuse_pack(w_curs);
> -
> -	/* Make sure everything reachable from idx is valid.  Since we
> -	 * have verified that nr_objects matches between idx and pack,
> -	 * we do not do scan-streaming check on the pack file.
> -	 */
> -	nr_objects = p->num_objects;
> -	ALLOC_ARRAY(entries, nr_objects + 1);
> -	entries[nr_objects].offset = pack_sig_ofs;
> -	/* first sort entries by pack offset, since unpacking them is more efficient that way */
> -	for (i = 0; i < nr_objects; i++) {
> -		entries[i].oid.hash = nth_packed_object_sha1(p, i);
> -		if (!entries[i].oid.hash)
> -			die("internal error pack-check nth-packed-object");
> -		entries[i].offset = nth_packed_object_offset(p, i);
> -		entries[i].nr = i;
> -	}
> -	QSORT(entries, nr_objects, compare_entries);
> -
> -	for (i = 0; i < nr_objects; i++) {
> -		void *data;
> -		enum object_type type;
> -		unsigned long size;
> -		off_t curpos;
> -		int data_valid;
> -
> -		if (p->index_version > 1) {
> -			off_t offset = entries[i].offset;
> -			off_t len = entries[i+1].offset - offset;
> -			unsigned int nr = entries[i].nr;
> -			if (check_pack_crc(p, w_curs, offset, len, nr))
> -				err = error("index CRC mismatch for object %s "
> -					    "from %s at offset %"PRIuMAX"",
> -					    oid_to_hex(entries[i].oid.oid),
> -					    p->pack_name, (uintmax_t)offset);
> -		}
> -
> -		curpos = entries[i].offset;
> -		type = unpack_object_header(p, w_curs, &curpos, &size);
> -		unuse_pack(w_curs);
> -
> -		if (type == OBJ_BLOB && big_file_threshold <= size) {
> -			/*
> -			 * Let check_object_signature() check it with
> -			 * the streaming interface; no point slurping
> -			 * the data in-core only to discard.
> -			 */
> -			data = NULL;
> -			data_valid = 0;
> -		} else {
> -			data = unpack_entry(the_repository, p, entries[i].offset, &type, &size);
> -			data_valid = 1;
> -		}
> -
> -		if (data_valid && !data)
> -			err = error("cannot unpack %s from %s at offset %"PRIuMAX"",
> -				    oid_to_hex(entries[i].oid.oid), p->pack_name,
> -				    (uintmax_t)entries[i].offset);
> -		else if (check_object_signature(entries[i].oid.oid, data, size, type_name(type)))
> -			err = error("packed %s from %s is corrupt",
> -				    oid_to_hex(entries[i].oid.oid), p->pack_name);
> -		else if (fn) {
> -			int eaten = 0;
> -			err |= fn(entries[i].oid.oid, type, size, data, &eaten);
> -			if (eaten)
> -				data = NULL;
> -		}
> -		if (((base_count + i) & 1023) == 0)
> -			display_progress(progress, base_count + i);
> -		free(data);
> -
> -	}
> -	display_progress(progress, base_count + i);
> -	free(entries);
> -
> -	return err;
> -}
> -
>  int verify_pack_index(struct packed_git *p)
>  {
>  	off_t index_size;
> @@ -185,19 +59,3 @@ int verify_pack_index(struct packed_git *p)
>  			    p->pack_name);
>  	return err;
>  }
> -
> -int verify_pack(struct packed_git *p, verify_fn fn,
> -		struct progress *progress, uint32_t base_count)
> -{
> -	int err = 0;
> -	struct pack_window *w_curs = NULL;
> -
> -	err |= verify_pack_index(p);
> -	if (!p->index_data)
> -		return -1;
> -
> -	err |= verify_packfile(p, &w_curs, fn, progress, base_count);
> -	unuse_pack(&w_curs);
> -
> -	return err;
> -}
> diff --git a/pack.h b/pack.h
> index 34a9d458b4..0d346c5e31 100644
> --- a/pack.h
> +++ b/pack.h
> @@ -80,7 +80,6 @@ typedef int (*verify_fn)(const struct object_id *, enum object_type, unsigned lo
>  extern const char *write_idx_file(const char *index_name, struct pack_idx_entry **objects, int nr_objects, const struct pack_idx_option *, const unsigned char *sha1);
>  extern int check_pack_crc(struct packed_git *p, struct pack_window **w_curs, off_t offset, off_t len, unsigned int nr);
>  extern int verify_pack_index(struct packed_git *);
> -extern int verify_pack(struct packed_git *, verify_fn fn, struct progress *, uint32_t);
>  extern off_t write_pack_header(struct hashfile *f, uint32_t);
>  extern void fixup_pack_header_footer(int, unsigned char *, const char *, uint32_t, unsigned char *, off_t);
>  extern char *index_pack_lockfile(int fd);

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: non-smooth progress  indication for git fsck and git gc
  2018-09-02  7:55         ` Jeff King
  2018-09-02  8:55           ` Jeff King
@ 2018-09-04 15:53           ` Junio C Hamano
  1 sibling, 0 replies; 25+ messages in thread
From: Junio C Hamano @ 2018-09-04 15:53 UTC (permalink / raw)
  To: Jeff King; +Cc: Ævar Arnfjörð Bjarmason, Ulrich Windl, git

Jeff King <peff@peff.net> writes:

> That code isn't lib-ified enough to be run in process, but I think the
> patch below should give similar behavior to what fsck currently does.
> We'd need to tell index-pack to use our fsck.* config for its checks, I
> imagine. The progress here is still per-pack, but I think we could pass
> in sufficient information to have it do one continuous meter across all
> of the packs (see the in-code comment).
>
> And it makes the result multi-threaded, and lets us drop a bunch of
> duplicate code.
>
> ---
>  builtin/fsck.c |  53 +++++++------
>  pack-check.c   | 142 -----------------------------------
>  pack.h         |   1 -
>  3 files changed, 32 insertions(+), 164 deletions(-)

The numbers here are nice, even to readers who do not necessarily
care about the progress meter output ;-)


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: non-smooth progress  indication for git fsck and git gc
  2018-09-03 16:48             ` Ævar Arnfjörð Bjarmason
@ 2018-09-07  3:30               ` Jeff King
  0 siblings, 0 replies; 25+ messages in thread
From: Jeff King @ 2018-09-07  3:30 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Ulrich Windl, git

On Mon, Sep 03, 2018 at 06:48:54PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > And there are definitely a few nasty bits (like the way the progress is
> > ended). I'm not planning on taking this further for now, but maybe
> > you or somebody can find it interesting or useful.
> 
> I think it would be really nice if this were taken further. Using my
> perf test in
> https://public-inbox.org/git/20180903144928.30691-7-avarab@gmail.com/T/#u
> I get these results:
> 
>     $ GIT_PERF_LARGE_REPO=/home/aearnfjord/g/linux GIT_PERF_REPEAT_COUNT=5 GIT_PERF_MAKE_OPTS='-j56 CFLAGS="-O3"' ./run HEAD~ HEAD p1450-fsck.sh
>     [...]
>     Test           HEAD~                 HEAD
>     ----------------------------------------------------------------
>     1450.1: fsck   384.18(381.63+2.53)   301.52(508.28+38.34) -21.5%
> 
> 
> I.e. this gives a 20% speedup, although of course some of that might be
> because some of this might be skipping too much work, but looks really
> promising.

I'm pretty sure it's doing the correct thing, in terms of doing all the
right checks. But look at your CPU time. You're getting a 20% wall-clock
speedup, but spending a lot more CPU. So the main difference is really
the multi-threading in index-pack. It should be strictly worse in terms
of total CPU on a single-processor system because we're doing work in
the sub-process (so we pay for the process invocation, but also we
probably are unable to share things like in-memory commit structs,
wasting a little extra time).

So I'm on the fence on whether it is worth it. I like getting rid of the
duplicated code. But on the other hand it is not all that complex, and
maybe when it comes to things like fsck it is good to have a different
implementation than the one that writes the .idx out in the first place.

-Peff

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2018-09-07  3:30 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-16  6:54 non-smooth progress indication for git fsck and git gc Ulrich Windl
2018-08-16 15:18 ` Duy Nguyen
2018-08-16 16:05   ` Jeff King
2018-08-20  8:27   ` Antw: " Ulrich Windl
2018-08-16 15:57 ` Jeff King
2018-08-16 20:02   ` Jeff King
2018-08-16 22:10     ` Junio C Hamano
2018-08-16 20:35   ` Ævar Arnfjörð Bjarmason
2018-08-16 20:55     ` Jeff King
2018-08-16 21:06       ` Jeff King
2018-08-17 14:39         ` Duy Nguyen
2018-08-20  8:33       ` Antw: " Ulrich Windl
2018-08-20  8:57         ` Ævar Arnfjörð Bjarmason
2018-08-20  9:37           ` Ulrich Windl
2018-08-21  1:07           ` Jeff King
2018-08-21  6:20             ` Ulrich Windl
2018-08-21 15:21             ` Duy Nguyen
2018-09-01 12:53     ` Ævar Arnfjörð Bjarmason
2018-09-01 13:52       ` Ævar Arnfjörð Bjarmason
2018-09-02  7:46       ` Jeff King
2018-09-02  7:55         ` Jeff King
2018-09-02  8:55           ` Jeff King
2018-09-03 16:48             ` Ævar Arnfjörð Bjarmason
2018-09-07  3:30               ` Jeff King
2018-09-04 15:53           ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).