git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/1] builtin/pack-objects.c: avoid iterating all refs
@ 2021-01-19 14:33 Jacob Vosmaer
  2021-01-19 14:33 ` [PATCH 1/1] " Jacob Vosmaer
  2021-01-19 23:26 ` [PATCH 0/1] " Jeff King
  0 siblings, 2 replies; 13+ messages in thread
From: Jacob Vosmaer @ 2021-01-19 14:33 UTC (permalink / raw)
  To: git; +Cc: Jacob Vosmaer

This is a small patch for git-pack-objects which will help server side
performance on repositories with lots of refs. I will post a related
but slightly larger patch for ls-refs.c in a separate thread.

The back story is in
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/746 but I
will try to summarize it here.

We have a particular Gitaly (Git RPC) server at GitLab that has a very
homogenous workload, dominated by CI. While trying to reduce CPU
utilization on the server we configured CI to fetch with the
'--no-tags' option. This had an unexpectedly large impact so I started
looking closer at why that may be.

What I learned is that by default, a fetch ends up using the
'--include-tag' command line option of git-pack-objects. This causes
git-pack-objects to iterate through all the tags of the repository to
see if any should be included in the pack because they point to packed
objects. The problem is that this "iterate through all the tags" uses
for_each_ref which iterates through all references in the repository,
and in doing so loads each associated object into memory to check if
the ref is broken. But all we need for '--include-tag' is to iterate
through refs/tags/.

On the repo we were testing this on, there are about
500,000 refs but only 2,000 tags. So we had to load a lot of objects
just for the sake of '--include-tag'. It was common to see more than
half the CPU time in git-pack-objects being spent in do_for_each_ref,
and that in turn was dominated by ref_resolves_to_object.

So, I think it would be nice to just iterate over those 2,000 tags and
not load 500,000 objects outside refs/tags we already know we don't
care about.

Jacob Vosmaer (1):
  builtin/pack-objects.c: avoid iterating all refs

 builtin/pack-objects.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
2.30.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/1] builtin/pack-objects.c: avoid iterating all refs
  2021-01-19 14:33 [PATCH 0/1] builtin/pack-objects.c: avoid iterating all refs Jacob Vosmaer
@ 2021-01-19 14:33 ` Jacob Vosmaer
  2021-01-19 23:08   ` Taylor Blau
                     ` (2 more replies)
  2021-01-19 23:26 ` [PATCH 0/1] " Jeff King
  1 sibling, 3 replies; 13+ messages in thread
From: Jacob Vosmaer @ 2021-01-19 14:33 UTC (permalink / raw)
  To: git; +Cc: Jacob Vosmaer

In git-pack-objects, we iterate over all the tags if the --include-tag
option is passed on the command line. For some reason this uses
for_each_ref which is expensive if the repo has many refs. We should
use a prefix iterator instead.

Signed-off-by: Jacob Vosmaer <jacob@gitlab.com>
---
 builtin/pack-objects.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 2a00358f34..2b32bc93bd 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3740,7 +3740,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	}
 	cleanup_preferred_base();
 	if (include_tag && nr_result)
-		for_each_ref(add_ref_tag, NULL);
+		for_each_fullref_in("refs/tags/", add_ref_tag, NULL, 0);
 	stop_progress(&progress_state);
 	trace2_region_leave("pack-objects", "enumerate-objects",
 			    the_repository);
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] builtin/pack-objects.c: avoid iterating all refs
  2021-01-19 14:33 ` [PATCH 1/1] " Jacob Vosmaer
@ 2021-01-19 23:08   ` Taylor Blau
  2021-01-19 23:33     ` Jeff King
  2021-01-19 23:15   ` Jeff King
  2021-01-20  8:50   ` Ævar Arnfjörð Bjarmason
  2 siblings, 1 reply; 13+ messages in thread
From: Taylor Blau @ 2021-01-19 23:08 UTC (permalink / raw)
  To: Jacob Vosmaer; +Cc: git

On Tue, Jan 19, 2021 at 03:33:48PM +0100, Jacob Vosmaer wrote:
> In git-pack-objects, we iterate over all the tags if the --include-tag
> option is passed on the command line. For some reason this uses
> for_each_ref which is expensive if the repo has many refs. We should
> use a prefix iterator instead.

OK, makes sense to me.

> Signed-off-by: Jacob Vosmaer <jacob@gitlab.com>
> ---
>  builtin/pack-objects.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 2a00358f34..2b32bc93bd 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -3740,7 +3740,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
>  	}
>  	cleanup_preferred_base();
>  	if (include_tag && nr_result)
> -		for_each_ref(add_ref_tag, NULL);
> +		for_each_fullref_in("refs/tags/", add_ref_tag, NULL, 0);

Makes sense. I was curious why it wasn't this way from the beginning in
f0a24aa56e (git-pack-objects: Automatically pack annotated tags if
object was packed, 2008-03-03).

The patch doesn't say, but presumably it was because there wasn't any
speed-up to be had iterating only a subset of references back then if
they were packed (did packed refs even exist in 2008? unsure). In any
case, builtin/pack-objects.c:add_ref_tag() discards anything that
doesn't start with "refs/tags/", so I think your change is doing the
right thing here.

That said, you could shorten this to use 'for_each_tag_ref()' instead
(which compiles to the exact same thing).

It probably wouldn't be a bad time to drop the extra prefix check in
add_ref_tag().

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] builtin/pack-objects.c: avoid iterating all refs
  2021-01-19 14:33 ` [PATCH 1/1] " Jacob Vosmaer
  2021-01-19 23:08   ` Taylor Blau
@ 2021-01-19 23:15   ` Jeff King
  2021-01-20  8:50   ` Ævar Arnfjörð Bjarmason
  2 siblings, 0 replies; 13+ messages in thread
From: Jeff King @ 2021-01-19 23:15 UTC (permalink / raw)
  To: Jacob Vosmaer; +Cc: git

On Tue, Jan 19, 2021 at 03:33:48PM +0100, Jacob Vosmaer wrote:

> In git-pack-objects, we iterate over all the tags if the --include-tag
> option is passed on the command line. For some reason this uses
> for_each_ref which is expensive if the repo has many refs. We should
> use a prefix iterator instead.
> [...]
> -		for_each_ref(add_ref_tag, NULL);
> +		for_each_fullref_in("refs/tags/", add_ref_tag, NULL, 0);

Yeah, this is trivially correct, since the first thing we check in
add_ref_tag() is whether it starts with refs/tags/. It would be safe to
remove that check as part of this patch (and unlike the multi-prefix
one, I am comfortable assuming that for_each_fullref_in() will not
return anything outside of the prefix).

I also wondered if we could just use for_each_ref_in() here, but I think
we are better having the full refname to feed to peel_ref().

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/1] builtin/pack-objects.c: avoid iterating all refs
  2021-01-19 14:33 [PATCH 0/1] builtin/pack-objects.c: avoid iterating all refs Jacob Vosmaer
  2021-01-19 14:33 ` [PATCH 1/1] " Jacob Vosmaer
@ 2021-01-19 23:26 ` Jeff King
  1 sibling, 0 replies; 13+ messages in thread
From: Jeff King @ 2021-01-19 23:26 UTC (permalink / raw)
  To: Jacob Vosmaer; +Cc: git

On Tue, Jan 19, 2021 at 03:33:47PM +0100, Jacob Vosmaer wrote:

> What I learned is that by default, a fetch ends up using the
> '--include-tag' command line option of git-pack-objects. This causes
> git-pack-objects to iterate through all the tags of the repository to
> see if any should be included in the pack because they point to packed
> objects. The problem is that this "iterate through all the tags" uses
> for_each_ref which iterates through all references in the repository,
> and in doing so loads each associated object into memory to check if
> the ref is broken. But all we need for '--include-tag' is to iterate
> through refs/tags/.
> 
> On the repo we were testing this on, there are about
> 500,000 refs but only 2,000 tags. So we had to load a lot of objects
> just for the sake of '--include-tag'. It was common to see more than
> half the CPU time in git-pack-objects being spent in do_for_each_ref,
> and that in turn was dominated by ref_resolves_to_object.

Some of these details may be useful in the commit message, too. :)

Your "load a lot of objects" had me worried for a moment. We try hard
not to load objects during such an iteration, even when peeling them
(because the packed-refs format has a magic shortcut there). But I think
that is all working as intended. What you were seeing was just tons of
has_object_file() to make sure the ref was not corrupt (so finding the
entry in a packfile, but not actually inflating the object contents).

Arguably both upload-pack and pack-objects could use the INCLUDE_BROKEN
flag to avoid even checking this. We'd notice the problem when somebody
actually tried to fetch the object in question. That would speed things
up further on top of your patch, because we wouldn't need to check the
existence of even the tags. But it's definitely orthogonal, and should
be considered separately.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] builtin/pack-objects.c: avoid iterating all refs
  2021-01-19 23:08   ` Taylor Blau
@ 2021-01-19 23:33     ` Jeff King
  2021-01-19 23:54       ` Taylor Blau
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2021-01-19 23:33 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jacob Vosmaer, git

On Tue, Jan 19, 2021 at 06:08:31PM -0500, Taylor Blau wrote:

> > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> > index 2a00358f34..2b32bc93bd 100644
> > --- a/builtin/pack-objects.c
> > +++ b/builtin/pack-objects.c
> > @@ -3740,7 +3740,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
> >  	}
> >  	cleanup_preferred_base();
> >  	if (include_tag && nr_result)
> > -		for_each_ref(add_ref_tag, NULL);
> > +		for_each_fullref_in("refs/tags/", add_ref_tag, NULL, 0);
> 
> Makes sense. I was curious why it wasn't this way from the beginning in
> f0a24aa56e (git-pack-objects: Automatically pack annotated tags if
> object was packed, 2008-03-03).
> 
> The patch doesn't say, but presumably it was because there wasn't any
> speed-up to be had iterating only a subset of references back then if
> they were packed (did packed refs even exist in 2008? unsure). In any
> case, builtin/pack-objects.c:add_ref_tag() discards anything that
> doesn't start with "refs/tags/", so I think your change is doing the
> right thing here.
> 
> That said, you could shorten this to use 'for_each_tag_ref()' instead
> (which compiles to the exact same thing).

You'd end up with "v1.2.3" in the refname field of the callback then,
rather than "refs/tags/v1.2.3". So we'd definitely need to drop the
prefix check in add_ref_tag() then (though I think it is a good idea
anyway). But I'm also not sure if it would interfere with peel_ref(),
and its magic for using packed-refs peeled information.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] builtin/pack-objects.c: avoid iterating all refs
  2021-01-19 23:33     ` Jeff King
@ 2021-01-19 23:54       ` Taylor Blau
  0 siblings, 0 replies; 13+ messages in thread
From: Taylor Blau @ 2021-01-19 23:54 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, Jacob Vosmaer, git

On Tue, Jan 19, 2021 at 06:33:34PM -0500, Jeff King wrote:
> On Tue, Jan 19, 2021 at 06:08:31PM -0500, Taylor Blau wrote:
> > That said, you could shorten this to use 'for_each_tag_ref()' instead
> > (which compiles to the exact same thing).
>
> You'd end up with "v1.2.3" in the refname field of the callback then,
> rather than "refs/tags/v1.2.3". So we'd definitely need to drop the
> prefix check in add_ref_tag() then (though I think it is a good idea
> anyway). But I'm also not sure if it would interfere with peel_ref(),
> and its magic for using packed-refs peeled information.

Ack, thanks for correcting me. Looking at ref_peel_ref(), I am almost
positive that calling peel_ref() on "v1.2.3" (referring to
"refs/tags/v1.2.3") wouldn't work. So I think the current implementation
is good.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] builtin/pack-objects.c: avoid iterating all refs
  2021-01-19 14:33 ` [PATCH 1/1] " Jacob Vosmaer
  2021-01-19 23:08   ` Taylor Blau
  2021-01-19 23:15   ` Jeff King
@ 2021-01-20  8:50   ` Ævar Arnfjörð Bjarmason
  2021-01-20 15:02     ` Taylor Blau
  2021-01-20 19:53     ` Jeff King
  2 siblings, 2 replies; 13+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-01-20  8:50 UTC (permalink / raw)
  To: Jacob Vosmaer; +Cc: git, Taylor Blau, Jeff King


On Tue, Jan 19 2021, Jacob Vosmaer wrote:

> In git-pack-objects, we iterate over all the tags if the --include-tag
> option is passed on the command line. For some reason this uses
> for_each_ref which is expensive if the repo has many refs. We should
> use a prefix iterator instead.
>
> Signed-off-by: Jacob Vosmaer <jacob@gitlab.com>
> ---
>  builtin/pack-objects.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 2a00358f34..2b32bc93bd 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -3740,7 +3740,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
>  	}
>  	cleanup_preferred_base();
>  	if (include_tag && nr_result)
> -		for_each_ref(add_ref_tag, NULL);
> +		for_each_fullref_in("refs/tags/", add_ref_tag, NULL, 0);
>  	stop_progress(&progress_state);
>  	trace2_region_leave("pack-objects", "enumerate-objects",
>  			    the_repository);

Not really a new problem, but it would be nice to also have a test in
t5305-include-tag.sh to check what happens with "tags" outside the
refs/tags/ namespace. I.e. if this & the existing prefix in
add_ref_tag() were dropped.

To elaborate on other things that aren't really your problem & Taylor's
E-Mail downthread we originally added this because:

    If new annotated tags have been introduced then we can also include
    them in the packfile, saving the client from needing to request them
    through a second connection.

I've barked up this whole tag fetch tree before 97716d217c (fetch: add a
--prune-tags option and fetch.pruneTags config, 2018-02-09) but really
not this specific area.

I wonder if longer term simply moving this to be work the client does
wouldn't make more sense. I.e. if we just delete this for_each_ref()
loop.
    
We're just saving a client from saying they "want" a tag. I.e. with the
whole thing removed we need this to make t5503-tagfollow.sh pass (see
[1] at the end for the dialog, the tag is 3253df4d...):

    diff --git a/t/t5503-tagfollow.sh b/t/t5503-tagfollow.sh
    index 6041a4dd32..1ddc430aef 100755
    --- a/t/t5503-tagfollow.sh
    +++ b/t/t5503-tagfollow.sh
    @@ -134,6 +134,7 @@ test_expect_success 'fetch B, S (commit and tag : 1 connection)' '
     test_expect_success 'setup expect' '
     cat - <<EOF >expect
     want $B
    +want $T
     want $S
     EOF
     '
    @@ -146,6 +147,7 @@ test_expect_success 'new clone fetch master and tags' '
                    cd clone2 &&
                    git init &&
                    git remote add origin .. &&
    +               git config --add remote.origin.fetch "+refs/tags/*:refs/tags/*" &&
                    GIT_TRACE_PACKET=$UPATH git fetch &&
                    test $B = $(git rev-parse --verify origin/master) &&
                    test $S = $(git rev-parse --verify tag2) &&

We're also saving the client the work of having to go through
refs/tags/* and figure out whether there are tags there that aren't on
our main history.

I think that since f0a24aa56e (git-pack-objects: Automatically pack
annotated tags if object was packed, 2008-03-03) was written it's become
clear that in the wild that's almost nobody's use-case. I.e. people with
a set of refs/heads/* branches and a refs/tags/* set that doesn't refer
to those branches.

Or if they do, I think it's such an obscure use-case that if we were
designing this today we could pass that work of figuring out if there's
such tags in refs/tags/* off to the client.

It seems we might just be able to delete this code on the server-side,
per protocol-capabilities.txt:

    Clients MUST be prepared for the case where a server has ignored
    include-tag and has not actually sent tags in the pack.  In such
    cases the client SHOULD issue a subsequent fetch to acquire the tags
    that include-tag would have otherwise given the client.

I.e. in the case where the server isn't playing along and I haven't set
"+refs/tags/*:refs/tags/*". But as the test shows we don't do that
following ourselves unless refs/tags/* is in the refspec (and then it's
not really "following", we're just getting all the tags).

1. From t5503-tagfollow.sh:
    $ grep -C5 3253df4d1cf6fb138b52b1938473bcfec1483223 UPLOAD_LOG 
    packet:  upload-pack< ref-prefix refs/tags/
    packet:  upload-pack< ref-prefix refs/tags/
    packet:  upload-pack< 0000
    packet:  upload-pack> 8e10cf4e007ad7e003463c30c34b1050b039db78 refs/heads/master
    packet:        fetch< 8e10cf4e007ad7e003463c30c34b1050b039db78 refs/heads/master
    packet:  upload-pack> 3253df4d1cf6fb138b52b1938473bcfec1483223 refs/tags/tag1 peeled:c06aaaf9c64f9ef1f209bcb6d7bdba68a5e27fab
    packet:        fetch< 3253df4d1cf6fb138b52b1938473bcfec1483223 refs/tags/tag1 peeled:c06aaaf9c64f9ef1f209bcb6d7bdba68a5e27fab
    packet:  upload-pack> ddfa4a33562179aca1ace2bcc662244a17d0b503 refs/tags/tag2 peeled:8e10cf4e007ad7e003463c30c34b1050b039db78
    packet:  upload-pack> 0000
    packet:        fetch< ddfa4a33562179aca1ace2bcc662244a17d0b503 refs/tags/tag2 peeled:8e10cf4e007ad7e003463c30c34b1050b039db78
    packet:        fetch< 0000
    packet:        fetch> command=fetch
    --
    packet:        fetch> 0001
    packet:        fetch> thin-pack
    packet:        fetch> include-tag
    packet:        fetch> ofs-delta
    packet:        fetch> want 8e10cf4e007ad7e003463c30c34b1050b039db78
    packet:        fetch> want 3253df4d1cf6fb138b52b1938473bcfec1483223
    packet:        fetch> want ddfa4a33562179aca1ace2bcc662244a17d0b503
    packet:        fetch> done
    packet:        fetch> 0000
    packet:  upload-pack< object-format=sha1
    packet:  upload-pack< command=fetch
    --
    packet:  upload-pack< 0001
    packet:  upload-pack< thin-pack
    packet:  upload-pack< include-tag
    packet:  upload-pack< ofs-delta
    packet:  upload-pack< want 8e10cf4e007ad7e003463c30c34b1050b039db78
    packet:  upload-pack< want 3253df4d1cf6fb138b52b1938473bcfec1483223
    packet:  upload-pack< want ddfa4a33562179aca1ace2bcc662244a17d0b503
    packet:  upload-pack< done
    packet:  upload-pack< 0000
    packet:  upload-pack> packfile
    packet:        fetch< packfile
    

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] builtin/pack-objects.c: avoid iterating all refs
  2021-01-20  8:50   ` Ævar Arnfjörð Bjarmason
@ 2021-01-20 15:02     ` Taylor Blau
  2021-01-20 16:32       ` Ævar Arnfjörð Bjarmason
  2021-01-20 19:53     ` Jeff King
  1 sibling, 1 reply; 13+ messages in thread
From: Taylor Blau @ 2021-01-20 15:02 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Jacob Vosmaer, git, Taylor Blau, Jeff King

On Wed, Jan 20, 2021 at 09:50:19AM +0100, Ævar Arnfjörð Bjarmason wrote:
> It seems we might just be able to delete this code on the server-side,
> per protocol-capabilities.txt:
>
>     Clients MUST be prepared for the case where a server has ignored
>     include-tag and has not actually sent tags in the pack.  In such
>     cases the client SHOULD issue a subsequent fetch to acquire the tags
>     that include-tag would have otherwise given the client.
>
> I.e. in the case where the server isn't playing along and I haven't set
> "+refs/tags/*:refs/tags/*". But as the test shows we don't do that
> following ourselves unless refs/tags/* is in the refspec (and then it's
> not really "following", we're just getting all the tags).

Reading your email, I see no reason not to do it, and that snippet from
protocol-capabilities.txt makes me feel even better about doing so.

I'd be happy to have Jacob's patch picked up in the meantime, but I
think that this is a good direction to pursue.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] builtin/pack-objects.c: avoid iterating all refs
  2021-01-20 15:02     ` Taylor Blau
@ 2021-01-20 16:32       ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 13+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-01-20 16:32 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jacob Vosmaer, git, Jeff King


On Wed, Jan 20 2021, Taylor Blau wrote:

> On Wed, Jan 20, 2021 at 09:50:19AM +0100, Ævar Arnfjörð Bjarmason wrote:
>> It seems we might just be able to delete this code on the server-side,
>> per protocol-capabilities.txt:
>>
>>     Clients MUST be prepared for the case where a server has ignored
>>     include-tag and has not actually sent tags in the pack.  In such
>>     cases the client SHOULD issue a subsequent fetch to acquire the tags
>>     that include-tag would have otherwise given the client.
>>
>> I.e. in the case where the server isn't playing along and I haven't set
>> "+refs/tags/*:refs/tags/*". But as the test shows we don't do that
>> following ourselves unless refs/tags/* is in the refspec (and then it's
>> not really "following", we're just getting all the tags).
>
> Reading your email, I see no reason not to do it, and that snippet from
> protocol-capabilities.txt makes me feel even better about doing so.
>
> I'd be happy to have Jacob's patch picked up in the meantime, but I
> think that this is a good direction to pursue.

Oh yes, none of this is commentary on not accepting the much smaller and
obviously correct change in the meantime.  Just musings this general
area of fetching & ideas for further optimization.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] builtin/pack-objects.c: avoid iterating all refs
  2021-01-20  8:50   ` Ævar Arnfjörð Bjarmason
  2021-01-20 15:02     ` Taylor Blau
@ 2021-01-20 19:53     ` Jeff King
  2021-01-21 10:26       ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 13+ messages in thread
From: Jeff King @ 2021-01-20 19:53 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Jacob Vosmaer, git, Taylor Blau

On Wed, Jan 20, 2021 at 09:50:19AM +0100, Ævar Arnfjörð Bjarmason wrote:

> To elaborate on other things that aren't really your problem & Taylor's
> E-Mail downthread we originally added this because:
> 
>     If new annotated tags have been introduced then we can also include
>     them in the packfile, saving the client from needing to request them
>     through a second connection.
> 
> I've barked up this whole tag fetch tree before 97716d217c (fetch: add a
> --prune-tags option and fetch.pruneTags config, 2018-02-09) but really
> not this specific area.
> 
> I wonder if longer term simply moving this to be work the client does
> wouldn't make more sense. I.e. if we just delete this for_each_ref()
> loop.
>     
> We're just saving a client from saying they "want" a tag. I.e. with the
> whole thing removed we need this to make t5503-tagfollow.sh pass (see
> [1] at the end for the dialog, the tag is 3253df4d...):

Isn't this an ordering problem, though? The client cannot mention
auto-followed tags in a "want", because they first need to "want" the
main history, receive the pack, and then realize they want the others.

So we are not just saving the client from sending a "want", but making a
second full connection to do so. That seems to be an optimization worth
continuing to have.

But here...

>     diff --git a/t/t5503-tagfollow.sh b/t/t5503-tagfollow.sh
>     index 6041a4dd32..1ddc430aef 100755
>     --- a/t/t5503-tagfollow.sh
>     +++ b/t/t5503-tagfollow.sh
>     @@ -134,6 +134,7 @@ test_expect_success 'fetch B, S (commit and tag : 1 connection)' '
>      test_expect_success 'setup expect' '
>      cat - <<EOF >expect
>      want $B
>     +want $T
>      want $S
>      EOF
>      '
>     @@ -146,6 +147,7 @@ test_expect_success 'new clone fetch master and tags' '
>                     cd clone2 &&
>                     git init &&
>                     git remote add origin .. &&
>     +               git config --add remote.origin.fetch "+refs/tags/*:refs/tags/*" &&
>                     GIT_TRACE_PACKET=$UPATH git fetch &&
>                     test $B = $(git rev-parse --verify origin/master) &&
>                     test $S = $(git rev-parse --verify tag2) &&
> 
> We're also saving the client the work of having to go through
> refs/tags/* and figure out whether there are tags there that aren't on
> our main history.

You seem to be against auto-following at all. And certainly I can see an
argument that it is not worth the trouble it causes. But it is the
default behavior, and I suspect many people are relying on it. Fetching
every tag indiscriminately is going to grab a bunch of extra unwanted
objects in some repos. An obvious case is any time "clone
--single-branch --depth" is used.

Maybe I'm not quite understanding what you're proposing.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] builtin/pack-objects.c: avoid iterating all refs
  2021-01-20 19:53     ` Jeff King
@ 2021-01-21 10:26       ` Ævar Arnfjörð Bjarmason
  2021-01-21 15:34         ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-01-21 10:26 UTC (permalink / raw)
  To: Jeff King; +Cc: Jacob Vosmaer, git, Taylor Blau


On Wed, Jan 20 2021, Jeff King wrote:

> On Wed, Jan 20, 2021 at 09:50:19AM +0100, Ævar Arnfjörð Bjarmason wrote:
>
>> To elaborate on other things that aren't really your problem & Taylor's
>> E-Mail downthread we originally added this because:
>> 
>>     If new annotated tags have been introduced then we can also include
>>     them in the packfile, saving the client from needing to request them
>>     through a second connection.
>> 
>> I've barked up this whole tag fetch tree before 97716d217c (fetch: add a
>> --prune-tags option and fetch.pruneTags config, 2018-02-09) but really
>> not this specific area.
>> 
>> I wonder if longer term simply moving this to be work the client does
>> wouldn't make more sense. I.e. if we just delete this for_each_ref()
>> loop.
>>     
>> We're just saving a client from saying they "want" a tag. I.e. with the
>> whole thing removed we need this to make t5503-tagfollow.sh pass (see
>> [1] at the end for the dialog, the tag is 3253df4d...):
>
> Isn't this an ordering problem, though? The client cannot mention
> auto-followed tags in a "want", because they first need to "want" the
> main history, receive the pack, and then realize they want the others.
>
> So we are not just saving the client from sending a "want", but making a
> second full connection to do so. That seems to be an optimization worth
> continuing to have.

Correct. Suppose a history COMMIT(TAG) history like:

    mainline: A(X) -> B(Y) -> C
    topic:    A(X) -> B(Y) -> D(Z)

You only want the "mainline" history and its tags in your fetch. Now a
server will give you tags X & Y for free, ignoring Z.

The note in protocol-capabilities.txt supposes that you'll need to only
get A/B/C in one fetch, then do another one where you see from the OIDs
you have and advertised (peeled) OIDs for the tags and know you just
need X/Y, not Z.

So we could also just fetch Z, then do our own walk on the client to see
if it's reachable, and throw it away if not. Although I suppose that'll
need a list of "bad" tags on the client so we don't repeat the process
the next time around, hrm...

>>     diff --git a/t/t5503-tagfollow.sh b/t/t5503-tagfollow.sh
>>     index 6041a4dd32..1ddc430aef 100755
>>     --- a/t/t5503-tagfollow.sh
>>     +++ b/t/t5503-tagfollow.sh
>>     @@ -134,6 +134,7 @@ test_expect_success 'fetch B, S (commit and tag : 1 connection)' '
>>      test_expect_success 'setup expect' '
>>      cat - <<EOF >expect
>>      want $B
>>     +want $T
>>      want $S
>>      EOF
>>      '
>>     @@ -146,6 +147,7 @@ test_expect_success 'new clone fetch master and tags' '
>>                     cd clone2 &&
>>                     git init &&
>>                     git remote add origin .. &&
>>     +               git config --add remote.origin.fetch "+refs/tags/*:refs/tags/*" &&
>>                     GIT_TRACE_PACKET=$UPATH git fetch &&
>>                     test $B = $(git rev-parse --verify origin/master) &&
>>                     test $S = $(git rev-parse --verify tag2) &&
>> 
>> We're also saving the client the work of having to go through
>> refs/tags/* and figure out whether there are tags there that aren't on
>> our main history.
>
> You seem to be against auto-following at all. And certainly I can see an
> argument that it is not worth the trouble it causes. But it is the
> default behavior, and I suspect many people are relying on it. Fetching
> every tag indiscriminately is going to grab a bunch of extra unwanted
> objects in some repos. An obvious case is any time "clone
> --single-branch --depth" is used.

I wonder if the use-cases for --depth in the wild aren't 99.9%
--depth=1, in which case the peeled commit on the main branch being
advertised by a tag would alredy give you this information.

I.e. in the common case you don't get a tag, but if you happen to land
on a release you can see that the commit at the tip is tagged. I wonder
if doing that shortcut already in the client (so not send include-tag)
is a useful micro-optimization.

> Maybe I'm not quite understanding what you're proposing.

Not much really, just that looking deeper into this area might be a
useful exercise. I.e. it's a case of server<->client cooperation where
we offlod the work to one or the other based on an old design decision,
maybe that trade-off is not optimal in most cases anymore.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] builtin/pack-objects.c: avoid iterating all refs
  2021-01-21 10:26       ` Ævar Arnfjörð Bjarmason
@ 2021-01-21 15:34         ` Jeff King
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff King @ 2021-01-21 15:34 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Jacob Vosmaer, git, Taylor Blau

On Thu, Jan 21, 2021 at 11:26:13AM +0100, Ævar Arnfjörð Bjarmason wrote:

> > So we are not just saving the client from sending a "want", but making a
> > second full connection to do so. That seems to be an optimization worth
> > continuing to have.
> 
> Correct. Suppose a history COMMIT(TAG) history like:
> 
>     mainline: A(X) -> B(Y) -> C
>     topic:    A(X) -> B(Y) -> D(Z)
> 
> You only want the "mainline" history and its tags in your fetch. Now a
> server will give you tags X & Y for free, ignoring Z.
> 
> The note in protocol-capabilities.txt supposes that you'll need to only
> get A/B/C in one fetch, then do another one where you see from the OIDs
> you have and advertised (peeled) OIDs for the tags and know you just
> need X/Y, not Z.

Right. That note basically indicates that what the server is doing is
"best effort". If it sends you B, it will also send you Y. But it is
ultimately up to the client to decide if they want or need a tag the
server didn't send and do the follow-up.

So include-tag is an optimization, but it works often enough that it
frequently saves the second fetch from happening at all.

> So we could also just fetch Z, then do our own walk on the client to see
> if it's reachable, and throw it away if not. Although I suppose that'll
> need a list of "bad" tags on the client so we don't repeat the process
> the next time around, hrm...

Not only that, but now you've fetched "D" that you never wanted. If it's
one commit, that's not so bad. But it could be pulling in arbitrary
amounts of reachable history that you don't want.

> > You seem to be against auto-following at all. And certainly I can see an
> > argument that it is not worth the trouble it causes. But it is the
> > default behavior, and I suspect many people are relying on it. Fetching
> > every tag indiscriminately is going to grab a bunch of extra unwanted
> > objects in some repos. An obvious case is any time "clone
> > --single-branch --depth" is used.
> 
> I wonder if the use-cases for --depth in the wild aren't 99.9%
> --depth=1, in which case the peeled commit on the main branch being
> advertised by a tag would alredy give you this information.

I'd have thought there was basically no use for anything but --depth=1,
but people seem really enamored with the notion of --depth=50. Many CI
systems use that for some reason.

--depth is just one case where you might not have all of history,
though. Even without it, --single-branch means you wouldn't want to get
the history of all of the other branches. They may only be a few commits
ahead of the main history, in which case fetching extra tags that point
to them might not be a big deal. But it really depends on how your repo
is shaped, what tags point to, etc.

Even without any options, your repo may have a disjoint chunk of history
mentioned by a tag (e.g., a tag pointing to a segment of history that is
meant to be grafted on, or even an archived version of history from
before filter-branch rewrite).

Perhaps if we were designing from scratch we might write those off as
unusual cases we don't need to care about. But I'd be very hesitant to
change the way tag following works now, after so many years. If we were
to make any change, I think one plausible one would be to have clone set
up refspecs to fetch all tags (and then avoid sending include-tag, since
we know we're not auto-following). That would let people still use
auto-follow when they wanted to, but make things simpler to reason
about. And after all, clone already fetches all of the tags.

But I haven't thought too hard about it.

> > Maybe I'm not quite understanding what you're proposing.
> 
> Not much really, just that looking deeper into this area might be a
> useful exercise. I.e. it's a case of server<->client cooperation where
> we offlod the work to one or the other based on an old design decision,
> maybe that trade-off is not optimal in most cases anymore.

I don't think it's very much work on the server, though. Sure, iterating
the tags is O(nr_tags). But we do that iteration lots of other places,
too (the advertisement unless the client asks for a very narrow prefix,
or upload-pack's ref-marking for reachability).

And saving us from sending unwanted objects is a potential win on the
server (even a single unlucky delta where we have to reconstruct and
re-deflate an object will likely blow a ref iteration out of the water).
Likewise, preventing the client from connecting a second time is a win
for the server, which doesn't have to spin up a new upload-pack
(likewise for the client of course; it might even have to prompt the
user for auth again!).

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-01-21 15:39 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-19 14:33 [PATCH 0/1] builtin/pack-objects.c: avoid iterating all refs Jacob Vosmaer
2021-01-19 14:33 ` [PATCH 1/1] " Jacob Vosmaer
2021-01-19 23:08   ` Taylor Blau
2021-01-19 23:33     ` Jeff King
2021-01-19 23:54       ` Taylor Blau
2021-01-19 23:15   ` Jeff King
2021-01-20  8:50   ` Ævar Arnfjörð Bjarmason
2021-01-20 15:02     ` Taylor Blau
2021-01-20 16:32       ` Ævar Arnfjörð Bjarmason
2021-01-20 19:53     ` Jeff King
2021-01-21 10:26       ` Ævar Arnfjörð Bjarmason
2021-01-21 15:34         ` Jeff King
2021-01-19 23:26 ` [PATCH 0/1] " Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).