* [PATCH 0/1] builtin/pack-objects.c: avoid iterating all refs @ 2021-01-19 14:33 Jacob Vosmaer 2021-01-19 14:33 ` [PATCH 1/1] " Jacob Vosmaer 2021-01-19 23:26 ` [PATCH 0/1] " Jeff King 0 siblings, 2 replies; 13+ messages in thread From: Jacob Vosmaer @ 2021-01-19 14:33 UTC (permalink / raw) To: git; +Cc: Jacob Vosmaer This is a small patch for git-pack-objects which will help server side performance on repositories with lots of refs. I will post a related but slightly larger patch for ls-refs.c in a separate thread. The back story is in https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/746 but I will try to summarize it here. We have a particular Gitaly (Git RPC) server at GitLab that has a very homogenous workload, dominated by CI. While trying to reduce CPU utilization on the server we configured CI to fetch with the '--no-tags' option. This had an unexpectedly large impact so I started looking closer at why that may be. What I learned is that by default, a fetch ends up using the '--include-tag' command line option of git-pack-objects. This causes git-pack-objects to iterate through all the tags of the repository to see if any should be included in the pack because they point to packed objects. The problem is that this "iterate through all the tags" uses for_each_ref which iterates through all references in the repository, and in doing so loads each associated object into memory to check if the ref is broken. But all we need for '--include-tag' is to iterate through refs/tags/. On the repo we were testing this on, there are about 500,000 refs but only 2,000 tags. So we had to load a lot of objects just for the sake of '--include-tag'. It was common to see more than half the CPU time in git-pack-objects being spent in do_for_each_ref, and that in turn was dominated by ref_resolves_to_object. So, I think it would be nice to just iterate over those 2,000 tags and not load 500,000 objects outside refs/tags we already know we don't care about. Jacob Vosmaer (1): builtin/pack-objects.c: avoid iterating all refs builtin/pack-objects.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- 2.30.0 ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 1/1] builtin/pack-objects.c: avoid iterating all refs 2021-01-19 14:33 [PATCH 0/1] builtin/pack-objects.c: avoid iterating all refs Jacob Vosmaer @ 2021-01-19 14:33 ` Jacob Vosmaer 2021-01-19 23:08 ` Taylor Blau ` (2 more replies) 2021-01-19 23:26 ` [PATCH 0/1] " Jeff King 1 sibling, 3 replies; 13+ messages in thread From: Jacob Vosmaer @ 2021-01-19 14:33 UTC (permalink / raw) To: git; +Cc: Jacob Vosmaer In git-pack-objects, we iterate over all the tags if the --include-tag option is passed on the command line. For some reason this uses for_each_ref which is expensive if the repo has many refs. We should use a prefix iterator instead. Signed-off-by: Jacob Vosmaer <jacob@gitlab.com> --- builtin/pack-objects.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 2a00358f34..2b32bc93bd 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -3740,7 +3740,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) } cleanup_preferred_base(); if (include_tag && nr_result) - for_each_ref(add_ref_tag, NULL); + for_each_fullref_in("refs/tags/", add_ref_tag, NULL, 0); stop_progress(&progress_state); trace2_region_leave("pack-objects", "enumerate-objects", the_repository); -- 2.30.0 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 1/1] builtin/pack-objects.c: avoid iterating all refs 2021-01-19 14:33 ` [PATCH 1/1] " Jacob Vosmaer @ 2021-01-19 23:08 ` Taylor Blau 2021-01-19 23:33 ` Jeff King 2021-01-19 23:15 ` Jeff King 2021-01-20 8:50 ` Ævar Arnfjörð Bjarmason 2 siblings, 1 reply; 13+ messages in thread From: Taylor Blau @ 2021-01-19 23:08 UTC (permalink / raw) To: Jacob Vosmaer; +Cc: git On Tue, Jan 19, 2021 at 03:33:48PM +0100, Jacob Vosmaer wrote: > In git-pack-objects, we iterate over all the tags if the --include-tag > option is passed on the command line. For some reason this uses > for_each_ref which is expensive if the repo has many refs. We should > use a prefix iterator instead. OK, makes sense to me. > Signed-off-by: Jacob Vosmaer <jacob@gitlab.com> > --- > builtin/pack-objects.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c > index 2a00358f34..2b32bc93bd 100644 > --- a/builtin/pack-objects.c > +++ b/builtin/pack-objects.c > @@ -3740,7 +3740,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) > } > cleanup_preferred_base(); > if (include_tag && nr_result) > - for_each_ref(add_ref_tag, NULL); > + for_each_fullref_in("refs/tags/", add_ref_tag, NULL, 0); Makes sense. I was curious why it wasn't this way from the beginning in f0a24aa56e (git-pack-objects: Automatically pack annotated tags if object was packed, 2008-03-03). The patch doesn't say, but presumably it was because there wasn't any speed-up to be had iterating only a subset of references back then if they were packed (did packed refs even exist in 2008? unsure). In any case, builtin/pack-objects.c:add_ref_tag() discards anything that doesn't start with "refs/tags/", so I think your change is doing the right thing here. That said, you could shorten this to use 'for_each_tag_ref()' instead (which compiles to the exact same thing). It probably wouldn't be a bad time to drop the extra prefix check in add_ref_tag(). Thanks, Taylor ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/1] builtin/pack-objects.c: avoid iterating all refs 2021-01-19 23:08 ` Taylor Blau @ 2021-01-19 23:33 ` Jeff King 2021-01-19 23:54 ` Taylor Blau 0 siblings, 1 reply; 13+ messages in thread From: Jeff King @ 2021-01-19 23:33 UTC (permalink / raw) To: Taylor Blau; +Cc: Jacob Vosmaer, git On Tue, Jan 19, 2021 at 06:08:31PM -0500, Taylor Blau wrote: > > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c > > index 2a00358f34..2b32bc93bd 100644 > > --- a/builtin/pack-objects.c > > +++ b/builtin/pack-objects.c > > @@ -3740,7 +3740,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) > > } > > cleanup_preferred_base(); > > if (include_tag && nr_result) > > - for_each_ref(add_ref_tag, NULL); > > + for_each_fullref_in("refs/tags/", add_ref_tag, NULL, 0); > > Makes sense. I was curious why it wasn't this way from the beginning in > f0a24aa56e (git-pack-objects: Automatically pack annotated tags if > object was packed, 2008-03-03). > > The patch doesn't say, but presumably it was because there wasn't any > speed-up to be had iterating only a subset of references back then if > they were packed (did packed refs even exist in 2008? unsure). In any > case, builtin/pack-objects.c:add_ref_tag() discards anything that > doesn't start with "refs/tags/", so I think your change is doing the > right thing here. > > That said, you could shorten this to use 'for_each_tag_ref()' instead > (which compiles to the exact same thing). You'd end up with "v1.2.3" in the refname field of the callback then, rather than "refs/tags/v1.2.3". So we'd definitely need to drop the prefix check in add_ref_tag() then (though I think it is a good idea anyway). But I'm also not sure if it would interfere with peel_ref(), and its magic for using packed-refs peeled information. -Peff ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/1] builtin/pack-objects.c: avoid iterating all refs 2021-01-19 23:33 ` Jeff King @ 2021-01-19 23:54 ` Taylor Blau 0 siblings, 0 replies; 13+ messages in thread From: Taylor Blau @ 2021-01-19 23:54 UTC (permalink / raw) To: Jeff King; +Cc: Taylor Blau, Jacob Vosmaer, git On Tue, Jan 19, 2021 at 06:33:34PM -0500, Jeff King wrote: > On Tue, Jan 19, 2021 at 06:08:31PM -0500, Taylor Blau wrote: > > That said, you could shorten this to use 'for_each_tag_ref()' instead > > (which compiles to the exact same thing). > > You'd end up with "v1.2.3" in the refname field of the callback then, > rather than "refs/tags/v1.2.3". So we'd definitely need to drop the > prefix check in add_ref_tag() then (though I think it is a good idea > anyway). But I'm also not sure if it would interfere with peel_ref(), > and its magic for using packed-refs peeled information. Ack, thanks for correcting me. Looking at ref_peel_ref(), I am almost positive that calling peel_ref() on "v1.2.3" (referring to "refs/tags/v1.2.3") wouldn't work. So I think the current implementation is good. Thanks, Taylor ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/1] builtin/pack-objects.c: avoid iterating all refs 2021-01-19 14:33 ` [PATCH 1/1] " Jacob Vosmaer 2021-01-19 23:08 ` Taylor Blau @ 2021-01-19 23:15 ` Jeff King 2021-01-20 8:50 ` Ævar Arnfjörð Bjarmason 2 siblings, 0 replies; 13+ messages in thread From: Jeff King @ 2021-01-19 23:15 UTC (permalink / raw) To: Jacob Vosmaer; +Cc: git On Tue, Jan 19, 2021 at 03:33:48PM +0100, Jacob Vosmaer wrote: > In git-pack-objects, we iterate over all the tags if the --include-tag > option is passed on the command line. For some reason this uses > for_each_ref which is expensive if the repo has many refs. We should > use a prefix iterator instead. > [...] > - for_each_ref(add_ref_tag, NULL); > + for_each_fullref_in("refs/tags/", add_ref_tag, NULL, 0); Yeah, this is trivially correct, since the first thing we check in add_ref_tag() is whether it starts with refs/tags/. It would be safe to remove that check as part of this patch (and unlike the multi-prefix one, I am comfortable assuming that for_each_fullref_in() will not return anything outside of the prefix). I also wondered if we could just use for_each_ref_in() here, but I think we are better having the full refname to feed to peel_ref(). -Peff ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/1] builtin/pack-objects.c: avoid iterating all refs 2021-01-19 14:33 ` [PATCH 1/1] " Jacob Vosmaer 2021-01-19 23:08 ` Taylor Blau 2021-01-19 23:15 ` Jeff King @ 2021-01-20 8:50 ` Ævar Arnfjörð Bjarmason 2021-01-20 15:02 ` Taylor Blau 2021-01-20 19:53 ` Jeff King 2 siblings, 2 replies; 13+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2021-01-20 8:50 UTC (permalink / raw) To: Jacob Vosmaer; +Cc: git, Taylor Blau, Jeff King On Tue, Jan 19 2021, Jacob Vosmaer wrote: > In git-pack-objects, we iterate over all the tags if the --include-tag > option is passed on the command line. For some reason this uses > for_each_ref which is expensive if the repo has many refs. We should > use a prefix iterator instead. > > Signed-off-by: Jacob Vosmaer <jacob@gitlab.com> > --- > builtin/pack-objects.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c > index 2a00358f34..2b32bc93bd 100644 > --- a/builtin/pack-objects.c > +++ b/builtin/pack-objects.c > @@ -3740,7 +3740,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) > } > cleanup_preferred_base(); > if (include_tag && nr_result) > - for_each_ref(add_ref_tag, NULL); > + for_each_fullref_in("refs/tags/", add_ref_tag, NULL, 0); > stop_progress(&progress_state); > trace2_region_leave("pack-objects", "enumerate-objects", > the_repository); Not really a new problem, but it would be nice to also have a test in t5305-include-tag.sh to check what happens with "tags" outside the refs/tags/ namespace. I.e. if this & the existing prefix in add_ref_tag() were dropped. To elaborate on other things that aren't really your problem & Taylor's E-Mail downthread we originally added this because: If new annotated tags have been introduced then we can also include them in the packfile, saving the client from needing to request them through a second connection. I've barked up this whole tag fetch tree before 97716d217c (fetch: add a --prune-tags option and fetch.pruneTags config, 2018-02-09) but really not this specific area. I wonder if longer term simply moving this to be work the client does wouldn't make more sense. I.e. if we just delete this for_each_ref() loop. We're just saving a client from saying they "want" a tag. I.e. with the whole thing removed we need this to make t5503-tagfollow.sh pass (see [1] at the end for the dialog, the tag is 3253df4d...): diff --git a/t/t5503-tagfollow.sh b/t/t5503-tagfollow.sh index 6041a4dd32..1ddc430aef 100755 --- a/t/t5503-tagfollow.sh +++ b/t/t5503-tagfollow.sh @@ -134,6 +134,7 @@ test_expect_success 'fetch B, S (commit and tag : 1 connection)' ' test_expect_success 'setup expect' ' cat - <<EOF >expect want $B +want $T want $S EOF ' @@ -146,6 +147,7 @@ test_expect_success 'new clone fetch master and tags' ' cd clone2 && git init && git remote add origin .. && + git config --add remote.origin.fetch "+refs/tags/*:refs/tags/*" && GIT_TRACE_PACKET=$UPATH git fetch && test $B = $(git rev-parse --verify origin/master) && test $S = $(git rev-parse --verify tag2) && We're also saving the client the work of having to go through refs/tags/* and figure out whether there are tags there that aren't on our main history. I think that since f0a24aa56e (git-pack-objects: Automatically pack annotated tags if object was packed, 2008-03-03) was written it's become clear that in the wild that's almost nobody's use-case. I.e. people with a set of refs/heads/* branches and a refs/tags/* set that doesn't refer to those branches. Or if they do, I think it's such an obscure use-case that if we were designing this today we could pass that work of figuring out if there's such tags in refs/tags/* off to the client. It seems we might just be able to delete this code on the server-side, per protocol-capabilities.txt: Clients MUST be prepared for the case where a server has ignored include-tag and has not actually sent tags in the pack. In such cases the client SHOULD issue a subsequent fetch to acquire the tags that include-tag would have otherwise given the client. I.e. in the case where the server isn't playing along and I haven't set "+refs/tags/*:refs/tags/*". But as the test shows we don't do that following ourselves unless refs/tags/* is in the refspec (and then it's not really "following", we're just getting all the tags). 1. From t5503-tagfollow.sh: $ grep -C5 3253df4d1cf6fb138b52b1938473bcfec1483223 UPLOAD_LOG packet: upload-pack< ref-prefix refs/tags/ packet: upload-pack< ref-prefix refs/tags/ packet: upload-pack< 0000 packet: upload-pack> 8e10cf4e007ad7e003463c30c34b1050b039db78 refs/heads/master packet: fetch< 8e10cf4e007ad7e003463c30c34b1050b039db78 refs/heads/master packet: upload-pack> 3253df4d1cf6fb138b52b1938473bcfec1483223 refs/tags/tag1 peeled:c06aaaf9c64f9ef1f209bcb6d7bdba68a5e27fab packet: fetch< 3253df4d1cf6fb138b52b1938473bcfec1483223 refs/tags/tag1 peeled:c06aaaf9c64f9ef1f209bcb6d7bdba68a5e27fab packet: upload-pack> ddfa4a33562179aca1ace2bcc662244a17d0b503 refs/tags/tag2 peeled:8e10cf4e007ad7e003463c30c34b1050b039db78 packet: upload-pack> 0000 packet: fetch< ddfa4a33562179aca1ace2bcc662244a17d0b503 refs/tags/tag2 peeled:8e10cf4e007ad7e003463c30c34b1050b039db78 packet: fetch< 0000 packet: fetch> command=fetch -- packet: fetch> 0001 packet: fetch> thin-pack packet: fetch> include-tag packet: fetch> ofs-delta packet: fetch> want 8e10cf4e007ad7e003463c30c34b1050b039db78 packet: fetch> want 3253df4d1cf6fb138b52b1938473bcfec1483223 packet: fetch> want ddfa4a33562179aca1ace2bcc662244a17d0b503 packet: fetch> done packet: fetch> 0000 packet: upload-pack< object-format=sha1 packet: upload-pack< command=fetch -- packet: upload-pack< 0001 packet: upload-pack< thin-pack packet: upload-pack< include-tag packet: upload-pack< ofs-delta packet: upload-pack< want 8e10cf4e007ad7e003463c30c34b1050b039db78 packet: upload-pack< want 3253df4d1cf6fb138b52b1938473bcfec1483223 packet: upload-pack< want ddfa4a33562179aca1ace2bcc662244a17d0b503 packet: upload-pack< done packet: upload-pack< 0000 packet: upload-pack> packfile packet: fetch< packfile ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/1] builtin/pack-objects.c: avoid iterating all refs 2021-01-20 8:50 ` Ævar Arnfjörð Bjarmason @ 2021-01-20 15:02 ` Taylor Blau 2021-01-20 16:32 ` Ævar Arnfjörð Bjarmason 2021-01-20 19:53 ` Jeff King 1 sibling, 1 reply; 13+ messages in thread From: Taylor Blau @ 2021-01-20 15:02 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Jacob Vosmaer, git, Taylor Blau, Jeff King On Wed, Jan 20, 2021 at 09:50:19AM +0100, Ævar Arnfjörð Bjarmason wrote: > It seems we might just be able to delete this code on the server-side, > per protocol-capabilities.txt: > > Clients MUST be prepared for the case where a server has ignored > include-tag and has not actually sent tags in the pack. In such > cases the client SHOULD issue a subsequent fetch to acquire the tags > that include-tag would have otherwise given the client. > > I.e. in the case where the server isn't playing along and I haven't set > "+refs/tags/*:refs/tags/*". But as the test shows we don't do that > following ourselves unless refs/tags/* is in the refspec (and then it's > not really "following", we're just getting all the tags). Reading your email, I see no reason not to do it, and that snippet from protocol-capabilities.txt makes me feel even better about doing so. I'd be happy to have Jacob's patch picked up in the meantime, but I think that this is a good direction to pursue. Thanks, Taylor ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/1] builtin/pack-objects.c: avoid iterating all refs 2021-01-20 15:02 ` Taylor Blau @ 2021-01-20 16:32 ` Ævar Arnfjörð Bjarmason 0 siblings, 0 replies; 13+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2021-01-20 16:32 UTC (permalink / raw) To: Taylor Blau; +Cc: Jacob Vosmaer, git, Jeff King On Wed, Jan 20 2021, Taylor Blau wrote: > On Wed, Jan 20, 2021 at 09:50:19AM +0100, Ævar Arnfjörð Bjarmason wrote: >> It seems we might just be able to delete this code on the server-side, >> per protocol-capabilities.txt: >> >> Clients MUST be prepared for the case where a server has ignored >> include-tag and has not actually sent tags in the pack. In such >> cases the client SHOULD issue a subsequent fetch to acquire the tags >> that include-tag would have otherwise given the client. >> >> I.e. in the case where the server isn't playing along and I haven't set >> "+refs/tags/*:refs/tags/*". But as the test shows we don't do that >> following ourselves unless refs/tags/* is in the refspec (and then it's >> not really "following", we're just getting all the tags). > > Reading your email, I see no reason not to do it, and that snippet from > protocol-capabilities.txt makes me feel even better about doing so. > > I'd be happy to have Jacob's patch picked up in the meantime, but I > think that this is a good direction to pursue. Oh yes, none of this is commentary on not accepting the much smaller and obviously correct change in the meantime. Just musings this general area of fetching & ideas for further optimization. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/1] builtin/pack-objects.c: avoid iterating all refs 2021-01-20 8:50 ` Ævar Arnfjörð Bjarmason 2021-01-20 15:02 ` Taylor Blau @ 2021-01-20 19:53 ` Jeff King 2021-01-21 10:26 ` Ævar Arnfjörð Bjarmason 1 sibling, 1 reply; 13+ messages in thread From: Jeff King @ 2021-01-20 19:53 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason; +Cc: Jacob Vosmaer, git, Taylor Blau On Wed, Jan 20, 2021 at 09:50:19AM +0100, Ævar Arnfjörð Bjarmason wrote: > To elaborate on other things that aren't really your problem & Taylor's > E-Mail downthread we originally added this because: > > If new annotated tags have been introduced then we can also include > them in the packfile, saving the client from needing to request them > through a second connection. > > I've barked up this whole tag fetch tree before 97716d217c (fetch: add a > --prune-tags option and fetch.pruneTags config, 2018-02-09) but really > not this specific area. > > I wonder if longer term simply moving this to be work the client does > wouldn't make more sense. I.e. if we just delete this for_each_ref() > loop. > > We're just saving a client from saying they "want" a tag. I.e. with the > whole thing removed we need this to make t5503-tagfollow.sh pass (see > [1] at the end for the dialog, the tag is 3253df4d...): Isn't this an ordering problem, though? The client cannot mention auto-followed tags in a "want", because they first need to "want" the main history, receive the pack, and then realize they want the others. So we are not just saving the client from sending a "want", but making a second full connection to do so. That seems to be an optimization worth continuing to have. But here... > diff --git a/t/t5503-tagfollow.sh b/t/t5503-tagfollow.sh > index 6041a4dd32..1ddc430aef 100755 > --- a/t/t5503-tagfollow.sh > +++ b/t/t5503-tagfollow.sh > @@ -134,6 +134,7 @@ test_expect_success 'fetch B, S (commit and tag : 1 connection)' ' > test_expect_success 'setup expect' ' > cat - <<EOF >expect > want $B > +want $T > want $S > EOF > ' > @@ -146,6 +147,7 @@ test_expect_success 'new clone fetch master and tags' ' > cd clone2 && > git init && > git remote add origin .. && > + git config --add remote.origin.fetch "+refs/tags/*:refs/tags/*" && > GIT_TRACE_PACKET=$UPATH git fetch && > test $B = $(git rev-parse --verify origin/master) && > test $S = $(git rev-parse --verify tag2) && > > We're also saving the client the work of having to go through > refs/tags/* and figure out whether there are tags there that aren't on > our main history. You seem to be against auto-following at all. And certainly I can see an argument that it is not worth the trouble it causes. But it is the default behavior, and I suspect many people are relying on it. Fetching every tag indiscriminately is going to grab a bunch of extra unwanted objects in some repos. An obvious case is any time "clone --single-branch --depth" is used. Maybe I'm not quite understanding what you're proposing. -Peff ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/1] builtin/pack-objects.c: avoid iterating all refs 2021-01-20 19:53 ` Jeff King @ 2021-01-21 10:26 ` Ævar Arnfjörð Bjarmason 2021-01-21 15:34 ` Jeff King 0 siblings, 1 reply; 13+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2021-01-21 10:26 UTC (permalink / raw) To: Jeff King; +Cc: Jacob Vosmaer, git, Taylor Blau On Wed, Jan 20 2021, Jeff King wrote: > On Wed, Jan 20, 2021 at 09:50:19AM +0100, Ævar Arnfjörð Bjarmason wrote: > >> To elaborate on other things that aren't really your problem & Taylor's >> E-Mail downthread we originally added this because: >> >> If new annotated tags have been introduced then we can also include >> them in the packfile, saving the client from needing to request them >> through a second connection. >> >> I've barked up this whole tag fetch tree before 97716d217c (fetch: add a >> --prune-tags option and fetch.pruneTags config, 2018-02-09) but really >> not this specific area. >> >> I wonder if longer term simply moving this to be work the client does >> wouldn't make more sense. I.e. if we just delete this for_each_ref() >> loop. >> >> We're just saving a client from saying they "want" a tag. I.e. with the >> whole thing removed we need this to make t5503-tagfollow.sh pass (see >> [1] at the end for the dialog, the tag is 3253df4d...): > > Isn't this an ordering problem, though? The client cannot mention > auto-followed tags in a "want", because they first need to "want" the > main history, receive the pack, and then realize they want the others. > > So we are not just saving the client from sending a "want", but making a > second full connection to do so. That seems to be an optimization worth > continuing to have. Correct. Suppose a history COMMIT(TAG) history like: mainline: A(X) -> B(Y) -> C topic: A(X) -> B(Y) -> D(Z) You only want the "mainline" history and its tags in your fetch. Now a server will give you tags X & Y for free, ignoring Z. The note in protocol-capabilities.txt supposes that you'll need to only get A/B/C in one fetch, then do another one where you see from the OIDs you have and advertised (peeled) OIDs for the tags and know you just need X/Y, not Z. So we could also just fetch Z, then do our own walk on the client to see if it's reachable, and throw it away if not. Although I suppose that'll need a list of "bad" tags on the client so we don't repeat the process the next time around, hrm... >> diff --git a/t/t5503-tagfollow.sh b/t/t5503-tagfollow.sh >> index 6041a4dd32..1ddc430aef 100755 >> --- a/t/t5503-tagfollow.sh >> +++ b/t/t5503-tagfollow.sh >> @@ -134,6 +134,7 @@ test_expect_success 'fetch B, S (commit and tag : 1 connection)' ' >> test_expect_success 'setup expect' ' >> cat - <<EOF >expect >> want $B >> +want $T >> want $S >> EOF >> ' >> @@ -146,6 +147,7 @@ test_expect_success 'new clone fetch master and tags' ' >> cd clone2 && >> git init && >> git remote add origin .. && >> + git config --add remote.origin.fetch "+refs/tags/*:refs/tags/*" && >> GIT_TRACE_PACKET=$UPATH git fetch && >> test $B = $(git rev-parse --verify origin/master) && >> test $S = $(git rev-parse --verify tag2) && >> >> We're also saving the client the work of having to go through >> refs/tags/* and figure out whether there are tags there that aren't on >> our main history. > > You seem to be against auto-following at all. And certainly I can see an > argument that it is not worth the trouble it causes. But it is the > default behavior, and I suspect many people are relying on it. Fetching > every tag indiscriminately is going to grab a bunch of extra unwanted > objects in some repos. An obvious case is any time "clone > --single-branch --depth" is used. I wonder if the use-cases for --depth in the wild aren't 99.9% --depth=1, in which case the peeled commit on the main branch being advertised by a tag would alredy give you this information. I.e. in the common case you don't get a tag, but if you happen to land on a release you can see that the commit at the tip is tagged. I wonder if doing that shortcut already in the client (so not send include-tag) is a useful micro-optimization. > Maybe I'm not quite understanding what you're proposing. Not much really, just that looking deeper into this area might be a useful exercise. I.e. it's a case of server<->client cooperation where we offlod the work to one or the other based on an old design decision, maybe that trade-off is not optimal in most cases anymore. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/1] builtin/pack-objects.c: avoid iterating all refs 2021-01-21 10:26 ` Ævar Arnfjörð Bjarmason @ 2021-01-21 15:34 ` Jeff King 0 siblings, 0 replies; 13+ messages in thread From: Jeff King @ 2021-01-21 15:34 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason; +Cc: Jacob Vosmaer, git, Taylor Blau On Thu, Jan 21, 2021 at 11:26:13AM +0100, Ævar Arnfjörð Bjarmason wrote: > > So we are not just saving the client from sending a "want", but making a > > second full connection to do so. That seems to be an optimization worth > > continuing to have. > > Correct. Suppose a history COMMIT(TAG) history like: > > mainline: A(X) -> B(Y) -> C > topic: A(X) -> B(Y) -> D(Z) > > You only want the "mainline" history and its tags in your fetch. Now a > server will give you tags X & Y for free, ignoring Z. > > The note in protocol-capabilities.txt supposes that you'll need to only > get A/B/C in one fetch, then do another one where you see from the OIDs > you have and advertised (peeled) OIDs for the tags and know you just > need X/Y, not Z. Right. That note basically indicates that what the server is doing is "best effort". If it sends you B, it will also send you Y. But it is ultimately up to the client to decide if they want or need a tag the server didn't send and do the follow-up. So include-tag is an optimization, but it works often enough that it frequently saves the second fetch from happening at all. > So we could also just fetch Z, then do our own walk on the client to see > if it's reachable, and throw it away if not. Although I suppose that'll > need a list of "bad" tags on the client so we don't repeat the process > the next time around, hrm... Not only that, but now you've fetched "D" that you never wanted. If it's one commit, that's not so bad. But it could be pulling in arbitrary amounts of reachable history that you don't want. > > You seem to be against auto-following at all. And certainly I can see an > > argument that it is not worth the trouble it causes. But it is the > > default behavior, and I suspect many people are relying on it. Fetching > > every tag indiscriminately is going to grab a bunch of extra unwanted > > objects in some repos. An obvious case is any time "clone > > --single-branch --depth" is used. > > I wonder if the use-cases for --depth in the wild aren't 99.9% > --depth=1, in which case the peeled commit on the main branch being > advertised by a tag would alredy give you this information. I'd have thought there was basically no use for anything but --depth=1, but people seem really enamored with the notion of --depth=50. Many CI systems use that for some reason. --depth is just one case where you might not have all of history, though. Even without it, --single-branch means you wouldn't want to get the history of all of the other branches. They may only be a few commits ahead of the main history, in which case fetching extra tags that point to them might not be a big deal. But it really depends on how your repo is shaped, what tags point to, etc. Even without any options, your repo may have a disjoint chunk of history mentioned by a tag (e.g., a tag pointing to a segment of history that is meant to be grafted on, or even an archived version of history from before filter-branch rewrite). Perhaps if we were designing from scratch we might write those off as unusual cases we don't need to care about. But I'd be very hesitant to change the way tag following works now, after so many years. If we were to make any change, I think one plausible one would be to have clone set up refspecs to fetch all tags (and then avoid sending include-tag, since we know we're not auto-following). That would let people still use auto-follow when they wanted to, but make things simpler to reason about. And after all, clone already fetches all of the tags. But I haven't thought too hard about it. > > Maybe I'm not quite understanding what you're proposing. > > Not much really, just that looking deeper into this area might be a > useful exercise. I.e. it's a case of server<->client cooperation where > we offlod the work to one or the other based on an old design decision, > maybe that trade-off is not optimal in most cases anymore. I don't think it's very much work on the server, though. Sure, iterating the tags is O(nr_tags). But we do that iteration lots of other places, too (the advertisement unless the client asks for a very narrow prefix, or upload-pack's ref-marking for reachability). And saving us from sending unwanted objects is a potential win on the server (even a single unlucky delta where we have to reconstruct and re-deflate an object will likely blow a ref iteration out of the water). Likewise, preventing the client from connecting a second time is a win for the server, which doesn't have to spin up a new upload-pack (likewise for the client of course; it might even have to prompt the user for auth again!). -Peff ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/1] builtin/pack-objects.c: avoid iterating all refs 2021-01-19 14:33 [PATCH 0/1] builtin/pack-objects.c: avoid iterating all refs Jacob Vosmaer 2021-01-19 14:33 ` [PATCH 1/1] " Jacob Vosmaer @ 2021-01-19 23:26 ` Jeff King 1 sibling, 0 replies; 13+ messages in thread From: Jeff King @ 2021-01-19 23:26 UTC (permalink / raw) To: Jacob Vosmaer; +Cc: git On Tue, Jan 19, 2021 at 03:33:47PM +0100, Jacob Vosmaer wrote: > What I learned is that by default, a fetch ends up using the > '--include-tag' command line option of git-pack-objects. This causes > git-pack-objects to iterate through all the tags of the repository to > see if any should be included in the pack because they point to packed > objects. The problem is that this "iterate through all the tags" uses > for_each_ref which iterates through all references in the repository, > and in doing so loads each associated object into memory to check if > the ref is broken. But all we need for '--include-tag' is to iterate > through refs/tags/. > > On the repo we were testing this on, there are about > 500,000 refs but only 2,000 tags. So we had to load a lot of objects > just for the sake of '--include-tag'. It was common to see more than > half the CPU time in git-pack-objects being spent in do_for_each_ref, > and that in turn was dominated by ref_resolves_to_object. Some of these details may be useful in the commit message, too. :) Your "load a lot of objects" had me worried for a moment. We try hard not to load objects during such an iteration, even when peeling them (because the packed-refs format has a magic shortcut there). But I think that is all working as intended. What you were seeing was just tons of has_object_file() to make sure the ref was not corrupt (so finding the entry in a packfile, but not actually inflating the object contents). Arguably both upload-pack and pack-objects could use the INCLUDE_BROKEN flag to avoid even checking this. We'd notice the problem when somebody actually tried to fetch the object in question. That would speed things up further on top of your patch, because we wouldn't need to check the existence of even the tags. But it's definitely orthogonal, and should be considered separately. -Peff ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2021-01-21 15:39 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-01-19 14:33 [PATCH 0/1] builtin/pack-objects.c: avoid iterating all refs Jacob Vosmaer 2021-01-19 14:33 ` [PATCH 1/1] " Jacob Vosmaer 2021-01-19 23:08 ` Taylor Blau 2021-01-19 23:33 ` Jeff King 2021-01-19 23:54 ` Taylor Blau 2021-01-19 23:15 ` Jeff King 2021-01-20 8:50 ` Ævar Arnfjörð Bjarmason 2021-01-20 15:02 ` Taylor Blau 2021-01-20 16:32 ` Ævar Arnfjörð Bjarmason 2021-01-20 19:53 ` Jeff King 2021-01-21 10:26 ` Ævar Arnfjörð Bjarmason 2021-01-21 15:34 ` Jeff King 2021-01-19 23:26 ` [PATCH 0/1] " Jeff King
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).