git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* approximate_object_count_valid never set?
@ 2020-09-17  8:20 Rasmus Villemoes
  2020-09-17 11:58 ` Ævar Arnfjörð Bjarmason
  2020-09-17 12:53 ` Jeff King
  0 siblings, 2 replies; 6+ messages in thread
From: Rasmus Villemoes @ 2020-09-17  8:20 UTC (permalink / raw)
  To: Git Mailing List
  Cc: Jeff King, Stefan Beller, Nguyễn Thái Ngọc Duy

Hi,

While poking around the code, I noticed that it seems
->approximate_object_count_valid is never set to 1, and it never has
been, not even back when it was a global variable. So perhaps it can
just be removed and the logic depending on it simplified? Or am I
missing some preprocessor trickery.

Nobody seems to have noticed the lack of caching - and actually setting
it to 1 after the count has been computed might be a little dangerous
unless one takes care to invalidate the cache anywhere that might be
relevant.

Rasmus

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: approximate_object_count_valid never set?
  2020-09-17  8:20 approximate_object_count_valid never set? Rasmus Villemoes
@ 2020-09-17 11:58 ` Ævar Arnfjörð Bjarmason
  2020-09-17 12:53 ` Jeff King
  1 sibling, 0 replies; 6+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2020-09-17 11:58 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Git Mailing List, Jeff King, Stefan Beller,
	Nguyễn Thái Ngọc Duy, Jeff King


On Thu, Sep 17 2020, Rasmus Villemoes wrote:

> Hi,
>
> While poking around the code, I noticed that it seems
> ->approximate_object_count_valid is never set to 1, and it never has
> been, not even back when it was a global variable. So perhaps it can
> just be removed and the logic depending on it simplified? Or am I
> missing some preprocessor trickery.
>
> Nobody seems to have noticed the lack of caching - and actually setting
> it to 1 after the count has been computed might be a little dangerous
> unless one takes care to invalidate the cache anywhere that might be
> relevant.

There's some previous discussion about this in
https://public-inbox.org/git/20180226085508.GA30343@sigill.intra.peff.net/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: approximate_object_count_valid never set?
  2020-09-17  8:20 approximate_object_count_valid never set? Rasmus Villemoes
  2020-09-17 11:58 ` Ævar Arnfjörð Bjarmason
@ 2020-09-17 12:53 ` Jeff King
  2020-09-17 16:47   ` [PATCH] packfile: actually set approximate_object_count_valid Jeff King
  1 sibling, 1 reply; 6+ messages in thread
From: Jeff King @ 2020-09-17 12:53 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Git Mailing List, Stefan Beller,
	Nguyễn Thái Ngọc Duy

On Thu, Sep 17, 2020 at 10:20:03AM +0200, Rasmus Villemoes wrote:

> Hi,
> 
> While poking around the code, I noticed that it seems
> ->approximate_object_count_valid is never set to 1, and it never has
> been, not even back when it was a global variable. So perhaps it can
> just be removed and the logic depending on it simplified? Or am I
> missing some preprocessor trickery.
> 
> Nobody seems to have noticed the lack of caching - and actually setting
> it to 1 after the count has been computed might be a little dangerous
> unless one takes care to invalidate the cache anywhere that might be
> relevant.

We should be able to construct a test where it matters. The main cost
that the flag is overcoming is the iteration through the packs. So we'd
want a lot of packs. And the primary place the function would get called
a lot is when abbreviating commits. So doing:

  for i in $(seq 1000); do
    echo blob
    echo 'data <<EOF'
    echo $i
    echo EOF
    echo checkpoint
  done | git -c transfer.unpacklimit=0 fast-import

will get us a lot of packs. I tried that in linux.git. And then this
should get us a baseline for how much it costs to traverse and print out
object names:

  $ time git rev-list --format=%H HEAD >/dev/null
  real	0m6.636s
  user	0m6.492s
  sys	0m0.144s

And now let's see how long it takes with abbreviation:

  $ time git rev-list --format=%h HEAD  >/dev/null
  real	0m34.518s
  user	0m34.253s
  sys	0m0.264s

Yow. That's a lot. But part of the cost is that we have to look up each
abbreviated hash in each pack to see if it's present there, so we'd
expect it to be a lot more expensive. But let's try it with the caching
flag:

diff --git a/packfile.c b/packfile.c
index 9ef27508f2..e69012e7f2 100644
--- a/packfile.c
+++ b/packfile.c
@@ -923,6 +923,7 @@ unsigned long repo_approximate_object_count(struct repository *r)
 			count += p->num_objects;
 		}
 		r->objects->approximate_object_count = count;
+		r->objects->approximate_object_count_valid = 1;
 	}
 	return r->objects->approximate_object_count;
 }

  $ time git rev-list --format=%h HEAD  >/dev/null
  real	0m29.411s
  user	0m29.150s
  sys	0m0.260s

Still not great, but caching the count did save us 15%. That seems worth
it to me (1000 packs is more than we'd hope for, but not uncommon in a
poorly maintained repo). The failure to set the flag is just a bug;
looks like mine from 8e3f52d778 (find_unique_abbrev: move logic out of
get_short_sha1(), 2016-10-03).

You're right that caching runs the risk of the cache being invalidated.
But in this case I think we're covered. We'd generally modify packed_git
only via reprepare_packed_git(), prepare_packed_git(), and we do reset
the flag there. Plus count is only meant to be approximate, so even if
it ended up stale within a single process, I don't think it would be
that big a deal.

-Peff

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH] packfile: actually set approximate_object_count_valid
  2020-09-17 12:53 ` Jeff King
@ 2020-09-17 16:47   ` Jeff King
  2020-09-17 16:53     ` Taylor Blau
  2020-09-17 18:26     ` Junio C Hamano
  0 siblings, 2 replies; 6+ messages in thread
From: Jeff King @ 2020-09-17 16:47 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Git Mailing List, Stefan Beller,
	Nguyễn Thái Ngọc Duy

On Thu, Sep 17, 2020 at 08:53:33AM -0400, Jeff King wrote:

> Still not great, but caching the count did save us 15%. That seems worth
> it to me (1000 packs is more than we'd hope for, but not uncommon in a
> poorly maintained repo). The failure to set the flag is just a bug;
> looks like mine from 8e3f52d778 (find_unique_abbrev: move logic out of
> get_short_sha1(), 2016-10-03).
> 
> You're right that caching runs the risk of the cache being invalidated.
> But in this case I think we're covered. We'd generally modify packed_git
> only via reprepare_packed_git(), prepare_packed_git(), and we do reset
> the flag there. Plus count is only meant to be approximate, so even if
> it ended up stale within a single process, I don't think it would be
> that big a deal.

So here it is wrapped up as a patch. I think it's worth fixing (as
opposed to dropping the unused flag code). Thanks for finding it.

-- >8 --
Subject: [PATCH] packfile: actually set approximate_object_count_valid

The approximate_object_count() function tries to compute the count only
once per process. But ever since it was introduced in 8e3f52d778
(find_unique_abbrev: move logic out of get_short_sha1(), 2016-10-03), we
failed to actually set the "valid" flag, meaning we'd compute it fresh
on every call.

This turns out not to be _too_ bad, because we're only iterating through
the packed_git list, and not making any system calls. But since it may
get called for every abbreviated hash we output, even this can add up if
you have many packs.

Here are before-and-after timings for a new perf test which just asks
rev-list to abbreviate each commit hash (the test repo is linux.git,
with commit-graphs):

  Test                            origin              HEAD
  ----------------------------------------------------------------------------
  5303.3: rev-list (1)            28.91(28.46+0.44)   29.03(28.65+0.38) +0.4%
  5303.4: abbrev-commit (1)       1.18(1.06+0.11)     1.17(1.02+0.14) -0.8%
  5303.7: rev-list (50)           28.95(28.56+0.38)   29.50(29.17+0.32) +1.9%
  5303.8: abbrev-commit (50)      3.67(3.56+0.10)     3.57(3.42+0.15) -2.7%
  5303.11: rev-list (1000)        30.34(29.89+0.43)   30.82(30.35+0.46) +1.6%
  5303.12: abbrev-commit (1000)   86.82(86.52+0.29)   77.82(77.59+0.22) -10.4%
  5303.15: load 10,000 packs      0.08(0.02+0.05)     0.08(0.02+0.06) +0.0%

It doesn't help at all when we have 1 pack (5303.4), but we get a 10%
speedup when there are 1000 packs (5303.12). That's a modest speedup for
a case that's already slow and we'd hope to avoid in general (note how
slow it is even after, because we have to look in each of those packs
for abbreviations). But it's a one-line change that clearly matches the
original intent, so it seems worth doing.

The included perf test may also be useful for keeping an eye on any
regressions in the overall abbreviation code.

Reported-by: Rasmus Villemoes <rv@rasmusvillemoes.dk>
Signed-off-by: Jeff King <peff@peff.net>
---
 packfile.c                 | 1 +
 t/perf/p5303-many-packs.sh | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/packfile.c b/packfile.c
index 9ef27508f2..e69012e7f2 100644
--- a/packfile.c
+++ b/packfile.c
@@ -923,6 +923,7 @@ unsigned long repo_approximate_object_count(struct repository *r)
 			count += p->num_objects;
 		}
 		r->objects->approximate_object_count = count;
+		r->objects->approximate_object_count_valid = 1;
 	}
 	return r->objects->approximate_object_count;
 }
diff --git a/t/perf/p5303-many-packs.sh b/t/perf/p5303-many-packs.sh
index 7ee791669a..f4c2ab0584 100755
--- a/t/perf/p5303-many-packs.sh
+++ b/t/perf/p5303-many-packs.sh
@@ -73,6 +73,10 @@ do
 		git rev-list --objects --all >/dev/null
 	'
 
+	test_perf "abbrev-commit ($nr_packs)" '
+		git rev-list --abbrev-commit HEAD >/dev/null
+	'
+
 	# This simulates the interesting part of the repack, which is the
 	# actual pack generation, without smudging the on-disk setup
 	# between trials.
-- 
2.28.0.982.gdd163d6eb1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] packfile: actually set approximate_object_count_valid
  2020-09-17 16:47   ` [PATCH] packfile: actually set approximate_object_count_valid Jeff King
@ 2020-09-17 16:53     ` Taylor Blau
  2020-09-17 18:26     ` Junio C Hamano
  1 sibling, 0 replies; 6+ messages in thread
From: Taylor Blau @ 2020-09-17 16:53 UTC (permalink / raw)
  To: Jeff King
  Cc: Rasmus Villemoes, Git Mailing List, Stefan Beller,
	Nguyễn Thái Ngọc Duy

On Thu, Sep 17, 2020 at 12:47:43PM -0400, Jeff King wrote:
> So here it is wrapped up as a patch. I think it's worth fixing (as
> opposed to dropping the unused flag code). Thanks for finding it.

Yup, after reading the patch and performance timings below, I agree that
this is worth fixing and keeping instead of dropping.

> It doesn't help at all when we have 1 pack (5303.4), but we get a 10%
> speedup when there are 1000 packs (5303.12). That's a modest speedup for
> a case that's already slow and we'd hope to avoid in general (note how
> slow it is even after, because we have to look in each of those packs
> for abbreviations). But it's a one-line change that clearly matches the
> original intent, so it seems worth doing.

Excellent.

> The included perf test may also be useful for keeping an eye on any
> regressions in the overall abbreviation code.
>
> Reported-by: Rasmus Villemoes <rv@rasmusvillemoes.dk>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  packfile.c                 | 1 +
>  t/perf/p5303-many-packs.sh | 4 ++++
>  2 files changed, 5 insertions(+)
>
> diff --git a/packfile.c b/packfile.c
> index 9ef27508f2..e69012e7f2 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -923,6 +923,7 @@ unsigned long repo_approximate_object_count(struct repository *r)
>  			count += p->num_objects;
>  		}
>  		r->objects->approximate_object_count = count;
> +		r->objects->approximate_object_count_valid = 1;
>  	}
>  	return r->objects->approximate_object_count;
>  }
> diff --git a/t/perf/p5303-many-packs.sh b/t/perf/p5303-many-packs.sh
> index 7ee791669a..f4c2ab0584 100755
> --- a/t/perf/p5303-many-packs.sh
> +++ b/t/perf/p5303-many-packs.sh
> @@ -73,6 +73,10 @@ do
>  		git rev-list --objects --all >/dev/null
>  	'
>
> +	test_perf "abbrev-commit ($nr_packs)" '
> +		git rev-list --abbrev-commit HEAD >/dev/null
> +	'
> +
>  	# This simulates the interesting part of the repack, which is the
>  	# actual pack generation, without smudging the on-disk setup
>  	# between trials.
> --
> 2.28.0.982.gdd163d6eb1

Looks all very good to me. Thanks.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] packfile: actually set approximate_object_count_valid
  2020-09-17 16:47   ` [PATCH] packfile: actually set approximate_object_count_valid Jeff King
  2020-09-17 16:53     ` Taylor Blau
@ 2020-09-17 18:26     ` Junio C Hamano
  1 sibling, 0 replies; 6+ messages in thread
From: Junio C Hamano @ 2020-09-17 18:26 UTC (permalink / raw)
  To: Jeff King
  Cc: Rasmus Villemoes, Git Mailing List, Stefan Beller,
	Nguyễn Thái Ngọc Duy

Jeff King <peff@peff.net> writes:

> Subject: [PATCH] packfile: actually set approximate_object_count_valid
>
> The approximate_object_count() function tries to compute the count only
> once per process. But ever since it was introduced in 8e3f52d778
> (find_unique_abbrev: move logic out of get_short_sha1(), 2016-10-03), we
> failed to actually set the "valid" flag, meaning we'd compute it fresh
> on every call.
> ...
>   Test                            origin              HEAD
>   ----------------------------------------------------------------------------
>   5303.3: rev-list (1)            28.91(28.46+0.44)   29.03(28.65+0.38) +0.4%
>   5303.4: abbrev-commit (1)       1.18(1.06+0.11)     1.17(1.02+0.14) -0.8%
>   5303.7: rev-list (50)           28.95(28.56+0.38)   29.50(29.17+0.32) +1.9%
>   5303.8: abbrev-commit (50)      3.67(3.56+0.10)     3.57(3.42+0.15) -2.7%
>   5303.11: rev-list (1000)        30.34(29.89+0.43)   30.82(30.35+0.46) +1.6%
>   5303.12: abbrev-commit (1000)   86.82(86.52+0.29)   77.82(77.59+0.22) -10.4%

Yuk, this is quite extreme.

>   5303.15: load 10,000 packs      0.08(0.02+0.05)     0.08(0.02+0.06) +0.0%
>
> It doesn't help at all when we have 1 pack (5303.4), but we get a 10%
> speedup when there are 1000 packs (5303.12). That's a modest speedup for
> a case that's already slow and we'd hope to avoid in general (note how
> slow it is even after, because we have to look in each of those packs
> for abbreviations). But it's a one-line change that clearly matches the
> original intent, so it seems worth doing.
>
> The included perf test may also be useful for keeping an eye on any
> regressions in the overall abbreviation code.
>
> Reported-by: Rasmus Villemoes <rv@rasmusvillemoes.dk>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  packfile.c                 | 1 +
>  t/perf/p5303-many-packs.sh | 4 ++++
>  2 files changed, 5 insertions(+)

Thanks for finding and fixing.  I agree that this is worth doing.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-09-17 18:29 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-17  8:20 approximate_object_count_valid never set? Rasmus Villemoes
2020-09-17 11:58 ` Ævar Arnfjörð Bjarmason
2020-09-17 12:53 ` Jeff King
2020-09-17 16:47   ` [PATCH] packfile: actually set approximate_object_count_valid Jeff King
2020-09-17 16:53     ` Taylor Blau
2020-09-17 18:26     ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).