git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH v2 0/1] builtin/pack-objects.c: avoid iterating all refs
@ 2021-01-20 12:45 Jacob Vosmaer
  2021-01-20 12:45 ` [PATCH v2 1/1] " Jacob Vosmaer
  0 siblings, 1 reply; 11+ messages in thread
From: Jacob Vosmaer @ 2021-01-20 12:45 UTC (permalink / raw)
  To: git; +Cc: avarab, peff, me, Jacob Vosmaer

I have updated the patch with more background information in the
commit message, and I have cleaned up the add_ref_tag iterator
callback a bit. I also switched to for_each_tag_ref.

On the previous version of the patch there was some discussion about
whether we can use for_each_tag_ref in combination with peel_ref. It
turns out we can, and we were already doing it in the same file:
search for mark_tagged to see what I mean.

Jacob Vosmaer (1):
  builtin/pack-objects.c: avoid iterating all refs

 builtin/pack-objects.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

-- 
2.30.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/1] builtin/pack-objects.c: avoid iterating all refs
  2021-01-20 12:45 [PATCH v2 0/1] builtin/pack-objects.c: avoid iterating all refs Jacob Vosmaer
@ 2021-01-20 12:45 ` Jacob Vosmaer
  2021-01-20 14:49   ` Taylor Blau
  0 siblings, 1 reply; 11+ messages in thread
From: Jacob Vosmaer @ 2021-01-20 12:45 UTC (permalink / raw)
  To: git; +Cc: avarab, peff, me, Jacob Vosmaer

In git-pack-objects, we iterate over all the tags if the --include-tag
option is passed on the command line. For some reason this uses
for_each_ref which is expensive if the repo has many refs. We should
use for_each_tag_ref instead.

Because the add_ref_tag callback will now only visit tags we
simplified it a bit.

The motivation for this change is that we observed performance issues
with a repository on gitlab.com that has 500,000 refs but only 2,000
tags. The fetch traffic on that repo is dominated by CI, and when we
changed CI to fetch with 'git fetch --no-tags' we saw a dramatic
change in the CPU profile of git-pack-objects. This lead us to this
particular ref walk. More details in:
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/746#note_483546598

Signed-off-by: Jacob Vosmaer <jacob@gitlab.com>
---
 builtin/pack-objects.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 2a00358f34..ad52c91bdb 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2803,13 +2803,11 @@ static void add_tag_chain(const struct object_id *oid)
 	}
 }
 
-static int add_ref_tag(const char *path, const struct object_id *oid, int flag, void *cb_data)
+static int add_ref_tag(const char *tag, const struct object_id *oid, int flag, void *cb_data)
 {
 	struct object_id peeled;
 
-	if (starts_with(path, "refs/tags/") && /* is a tag? */
-	    !peel_ref(path, &peeled)    && /* peelable? */
-	    obj_is_packed(&peeled)) /* object packed? */
+	if (!peel_ref(tag, &peeled) && obj_is_packed(&peeled))
 		add_tag_chain(oid);
 	return 0;
 }
@@ -3740,7 +3738,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	}
 	cleanup_preferred_base();
 	if (include_tag && nr_result)
-		for_each_ref(add_ref_tag, NULL);
+		for_each_tag_ref(add_ref_tag, NULL);
 	stop_progress(&progress_state);
 	trace2_region_leave("pack-objects", "enumerate-objects",
 			    the_repository);
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/1] builtin/pack-objects.c: avoid iterating all refs
  2021-01-20 12:45 ` [PATCH v2 1/1] " Jacob Vosmaer
@ 2021-01-20 14:49   ` Taylor Blau
  2021-01-20 16:18     ` Jeff King
  0 siblings, 1 reply; 11+ messages in thread
From: Taylor Blau @ 2021-01-20 14:49 UTC (permalink / raw)
  To: Jacob Vosmaer; +Cc: git, avarab, peff, me

Hi Jacob,

On Wed, Jan 20, 2021 at 01:45:14PM +0100, Jacob Vosmaer wrote:
> In git-pack-objects, we iterate over all the tags if the --include-tag
> option is passed on the command line. For some reason this uses
> for_each_ref which is expensive if the repo has many refs. We should
> use for_each_tag_ref instead.

I don't think it's worth sending another version, but I would have liked
to see: "... because we can save time by only iterating over some of the
refs" at the end of this paragraph.

> Because the add_ref_tag callback will now only visit tags we
> simplified it a bit.
>
> The motivation for this change is that we observed performance issues
> with a repository on gitlab.com that has 500,000 refs but only 2,000
> tags. The fetch traffic on that repo is dominated by CI, and when we
> changed CI to fetch with 'git fetch --no-tags' we saw a dramatic
> change in the CPU profile of git-pack-objects. This lead us to this
> particular ref walk. More details in:
> https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/746#note_483546598
>
> Signed-off-by: Jacob Vosmaer <jacob@gitlab.com>
> ---
>  builtin/pack-objects.c | 8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 2a00358f34..ad52c91bdb 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -2803,13 +2803,11 @@ static void add_tag_chain(const struct object_id *oid)
>  	}
>  }
>
> -static int add_ref_tag(const char *path, const struct object_id *oid, int flag, void *cb_data)
> +static int add_ref_tag(const char *tag, const struct object_id *oid, int flag, void *cb_data)
>  {
>  	struct object_id peeled;
>
> -	if (starts_with(path, "refs/tags/") && /* is a tag? */
> -	    !peel_ref(path, &peeled)    && /* peelable? */
> -	    obj_is_packed(&peeled)) /* object packed? */
> +	if (!peel_ref(tag, &peeled) && obj_is_packed(&peeled))
>  		add_tag_chain(oid);
>  	return 0;
>  }
> @@ -3740,7 +3738,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
>  	}
>  	cleanup_preferred_base();
>  	if (include_tag && nr_result)
> -		for_each_ref(add_ref_tag, NULL);
> +		for_each_tag_ref(add_ref_tag, NULL);

OK. Seeing another caller (builtin/pack-objects.c:compute_write_order())
that passes a callback to for_each_tag_ref() makes me feel more
comfortable about using it here.

Thanks for investigating and resolving this in a way which cleans up the
surrounding code.

>  	stop_progress(&progress_state);
>  	trace2_region_leave("pack-objects", "enumerate-objects",
>  			    the_repository);
> --
> 2.30.0

This version looks good to me, thanks for digging!

  Reviewed-by: Taylor Blau <me@ttaylorr.com>

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/1] builtin/pack-objects.c: avoid iterating all refs
  2021-01-20 14:49   ` Taylor Blau
@ 2021-01-20 16:18     ` Jeff King
  2021-01-20 16:19       ` Taylor Blau
  0 siblings, 1 reply; 11+ messages in thread
From: Jeff King @ 2021-01-20 16:18 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jacob Vosmaer, git, avarab

On Wed, Jan 20, 2021 at 09:49:20AM -0500, Taylor Blau wrote:

> > @@ -3740,7 +3738,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
> >  	}
> >  	cleanup_preferred_base();
> >  	if (include_tag && nr_result)
> > -		for_each_ref(add_ref_tag, NULL);
> > +		for_each_tag_ref(add_ref_tag, NULL);
> 
> OK. Seeing another caller (builtin/pack-objects.c:compute_write_order())
> that passes a callback to for_each_tag_ref() makes me feel more
> comfortable about using it here.

It actually made me wonder if that call might potentially be
problematic, too. ;)

It is not clear to me from the commit message if anybody actually looked
at how peel_ref() works, and whether there are any gotchas. I don't
think "there is another call like this" is compelling because we might
not notice any problem:

  - checking the ref_iterator's peeled value is an optimization; we'd
    expect to receive the correct answer but slower if it doesn't kick
    in

  - if the fallback code is looking up by refname, would it use the dwim
    rules and find an unqualified "v1.2.3". That would work most of the
    time, but fail if there was an ambiguous ref.

But looking at the implementation, I think the answer is "no". When
iterating over unqualified refs, the iterator stores a pointer to the
unqualified name, and so the optimization does kick in. And even if we
are not looking at a packed-refs value that can return the answer
quickly, we eventually end up in cache_ref_iterator_peel(), which
ignores the refname and directly peels the oid.

If we do not match the ref iterator, then we use read_ref_full(), which
doesn't do any dwim magic. And hence our unqualified refname would fail.
So it's a bit weird to pass such a refname into the function, but it
works because of the optimization magic. And if that ever changed, I
think the test suite would notice, since many peels would fail in that
case.

So I think both the existing and the new calls using for_each_tag_ref()
are OK here.

  As an aside, this peel_ref() interface is just awful, exactly because
  of this confusion. The nicest thing would be for the each_ref_fn
  callback to actually just get the peel information if we have it. But
  that would require changing tons of callbacks all over the code base,
  which is why we've never done it.

  But probably we would just peel by oid, and have the same "is it the
  current iterated ref" magic kick in. It doesn't matter if it's
  _actually_ the same ref or not. If two refs point to the same oid,
  then their peeled values are the same anyway.

  So that might be a nice cleanup, but definitely out of scope for this
  patch.

-Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/1] builtin/pack-objects.c: avoid iterating all refs
  2021-01-20 16:18     ` Jeff King
@ 2021-01-20 16:19       ` Taylor Blau
  2021-01-20 18:49         ` Jacob Vosmaer
  2021-01-20 19:45         ` Jeff King
  0 siblings, 2 replies; 11+ messages in thread
From: Taylor Blau @ 2021-01-20 16:19 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, Jacob Vosmaer, git, avarab

On Wed, Jan 20, 2021 at 11:18:11AM -0500, Jeff King wrote:
> So I think both the existing and the new calls using for_each_tag_ref()
> are OK here.

Indeed, I followed the same trail of calls as you did and reached the
same conclusion, but didn't write any of it down here since I thought it
wasn't worthwhile.

But, yes, I agree that both are safe.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/1] builtin/pack-objects.c: avoid iterating all refs
  2021-01-20 16:19       ` Taylor Blau
@ 2021-01-20 18:49         ` Jacob Vosmaer
  2021-01-20 19:45         ` Jeff King
  1 sibling, 0 replies; 11+ messages in thread
From: Jacob Vosmaer @ 2021-01-20 18:49 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jeff King, Git Mailing List, avarab

I also spent time being confused about what is going on, and wondering
if that other ref iteration ever worked. I went as far as inspecting
git-verify-pack -v output to look at object order. :)

On the other hand, through working on this I learned that include-tag
has pretty effective test coverage so if peel_ref didn't work or
stopped working, we'd find out.

If there is a better way to write "for each tag + peel ref" I am happy
to change the patch, just let me know what it should look like.

Best regards,

Jacob Vosmaer
GitLab, Inc.

On Wed, Jan 20, 2021 at 5:19 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Wed, Jan 20, 2021 at 11:18:11AM -0500, Jeff King wrote:
> > So I think both the existing and the new calls using for_each_tag_ref()
> > are OK here.
>
> Indeed, I followed the same trail of calls as you did and reached the
> same conclusion, but didn't write any of it down here since I thought it
> wasn't worthwhile.
>
> But, yes, I agree that both are safe.
>
> Thanks,
> Taylor

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/1] builtin/pack-objects.c: avoid iterating all refs
  2021-01-20 16:19       ` Taylor Blau
  2021-01-20 18:49         ` Jacob Vosmaer
@ 2021-01-20 19:45         ` Jeff King
  2021-01-20 21:46           ` Jacob Vosmaer
  1 sibling, 1 reply; 11+ messages in thread
From: Jeff King @ 2021-01-20 19:45 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jacob Vosmaer, git, avarab

On Wed, Jan 20, 2021 at 11:19:41AM -0500, Taylor Blau wrote:

> On Wed, Jan 20, 2021 at 11:18:11AM -0500, Jeff King wrote:
> > So I think both the existing and the new calls using for_each_tag_ref()
> > are OK here.
> 
> Indeed, I followed the same trail of calls as you did and reached the
> same conclusion, but didn't write any of it down here since I thought it
> wasn't worthwhile.
> 
> But, yes, I agree that both are safe.

OK. I think it's worth saying such things (in case that was not obvious.
;) ).

I also rage-replaced peel_ref() with a function taking an oid so we
never have to do that digging again. Posted separately in its own
thread.

-Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/1] builtin/pack-objects.c: avoid iterating all refs
  2021-01-20 19:45         ` Jeff King
@ 2021-01-20 21:46           ` Jacob Vosmaer
  2021-01-20 21:52             ` Taylor Blau
  2021-01-21  2:54             ` Jeff King
  0 siblings, 2 replies; 11+ messages in thread
From: Jacob Vosmaer @ 2021-01-20 21:46 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, Git Mailing List, avarab

On Wed, Jan 20, 2021 at 8:45 PM Jeff King <peff@peff.net> wrote:

> I also rage-replaced peel_ref() with a function taking an oid so we
> never have to do that digging again. Posted separately in its own
> thread.

That sounds like a good solution. Where does that leave my patch? Do I
need to wait for that new commit somehow? I don't know how that works,
"contributing to Git" wise.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/1] builtin/pack-objects.c: avoid iterating all refs
  2021-01-20 21:46           ` Jacob Vosmaer
@ 2021-01-20 21:52             ` Taylor Blau
  2021-01-21  2:54             ` Jeff King
  1 sibling, 0 replies; 11+ messages in thread
From: Taylor Blau @ 2021-01-20 21:52 UTC (permalink / raw)
  To: Jacob Vosmaer; +Cc: Jeff King, Taylor Blau, Git Mailing List, avarab

On Wed, Jan 20, 2021 at 10:46:29PM +0100, Jacob Vosmaer wrote:
> On Wed, Jan 20, 2021 at 8:45 PM Jeff King <peff@peff.net> wrote:
>
> > I also rage-replaced peel_ref() with a function taking an oid so we
> > never have to do that digging again. Posted separately in its own
> > thread.
>
> That sounds like a good solution. Where does that leave my patch? Do I
> need to wait for that new commit somehow? I don't know how that works,
> "contributing to Git" wise.

Peff noted in the patch to nuke "peel_ref()" that merging his and then
yours will cause a merge conflict, but it is trivial to resolve. So, I'd
expect that Junio will pick up both and resolve the merge conflict when
queuing.

Or, in other words, I don't think there's anything left on your part
;-). Thanks for providing your patches; there was a nice digression and
I'm quite happy with where we ended up.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/1] builtin/pack-objects.c: avoid iterating all refs
  2021-01-20 21:46           ` Jacob Vosmaer
  2021-01-20 21:52             ` Taylor Blau
@ 2021-01-21  2:54             ` Jeff King
  2021-01-22 16:46               ` Jacob Vosmaer
  1 sibling, 1 reply; 11+ messages in thread
From: Jeff King @ 2021-01-21  2:54 UTC (permalink / raw)
  To: Jacob Vosmaer; +Cc: Taylor Blau, Git Mailing List, avarab

On Wed, Jan 20, 2021 at 10:46:29PM +0100, Jacob Vosmaer wrote:

> On Wed, Jan 20, 2021 at 8:45 PM Jeff King <peff@peff.net> wrote:
> 
> > I also rage-replaced peel_ref() with a function taking an oid so we
> > never have to do that digging again. Posted separately in its own
> > thread.
> 
> That sounds like a good solution. Where does that leave my patch? Do I
> need to wait for that new commit somehow? I don't know how that works,
> "contributing to Git" wise.

Nope, you shouldn't need to do anything. The conflict is minimal
enough that the two can proceed independently, and the maintainer can
resolve it.

-Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/1] builtin/pack-objects.c: avoid iterating all refs
  2021-01-21  2:54             ` Jeff King
@ 2021-01-22 16:46               ` Jacob Vosmaer
  0 siblings, 0 replies; 11+ messages in thread
From: Jacob Vosmaer @ 2021-01-22 16:46 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, Git Mailing List, avarab

Great, thanks!

Best regards,

Jacob Vosmaer
GitLab, Inc.

On Thu, Jan 21, 2021 at 3:54 AM Jeff King <peff@peff.net> wrote:
>
> On Wed, Jan 20, 2021 at 10:46:29PM +0100, Jacob Vosmaer wrote:
>
> > On Wed, Jan 20, 2021 at 8:45 PM Jeff King <peff@peff.net> wrote:
> >
> > > I also rage-replaced peel_ref() with a function taking an oid so we
> > > never have to do that digging again. Posted separately in its own
> > > thread.
> >
> > That sounds like a good solution. Where does that leave my patch? Do I
> > need to wait for that new commit somehow? I don't know how that works,
> > "contributing to Git" wise.
>
> Nope, you shouldn't need to do anything. The conflict is minimal
> enough that the two can proceed independently, and the maintainer can
> resolve it.
>
> -Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-01-22 17:36 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-20 12:45 [PATCH v2 0/1] builtin/pack-objects.c: avoid iterating all refs Jacob Vosmaer
2021-01-20 12:45 ` [PATCH v2 1/1] " Jacob Vosmaer
2021-01-20 14:49   ` Taylor Blau
2021-01-20 16:18     ` Jeff King
2021-01-20 16:19       ` Taylor Blau
2021-01-20 18:49         ` Jacob Vosmaer
2021-01-20 19:45         ` Jeff King
2021-01-20 21:46           ` Jacob Vosmaer
2021-01-20 21:52             ` Taylor Blau
2021-01-21  2:54             ` Jeff King
2021-01-22 16:46               ` Jacob Vosmaer

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).