git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] name-rev: prefer shorter names over following merges
@ 2021-11-13  0:14 Elijah Newren via GitGitGadget
  2021-11-13  9:42 ` Ævar Arnfjörð Bjarmason
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-11-13  0:14 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

name-rev has a MERGE_TRAVERSAL_WEIGHT to say that traversing a second or
later parent of a merge should be 65535 times more expensive than a
first-parent traversal, as per ac076c29ae8d (name-rev: Fix non-shortest
description, 2007-08-27).  The point of this weight is to prefer names
like

    v2.32.0~1471^2

over names like

    v2.32.0~43^2~15^2~11^2~20^2~31^2

which are two equally valid names in git.git for the same commit.  Note
that the first follows 1472 parent traversals compared to a mere 125 for
the second.  Weighting all traversals equally would clearly prefer the
second name since it has fewer parent traversals, but humans aren't
going to be traversing commits and they tend to have an easier time
digesting names with fewer segments.  The fact that the former only has
two segments (~1471, ^2) makes it much simpler than the latter which has
six segments (~43, ^2, ~15, etc.).  Since name-rev is meant to "find
symbolic names suitable for human digestion", we prefer fewer segments.

However, the particular rule implemented in name-rev would actually
prefer

    v2.33.0-rc0~11^2~1

over

    v2.33.0-rc0~20^2

because both have precisely one second parent traversal, and it gives
the tie breaker to shortest number of total parent traversals.  Fewer
segments is more important for human consumption than number of hops, so
we'd rather see the latter which has one fewer segment.

Include the generation in is_better_name() and use a new
effective_distance() calculation so that we prefer fewer segments in
the printed name over fewer total parent traversals performed to get the
answer.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
    name-rev: prefer shorter names over following merges

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1119%2Fnewren%2Fprefer-shorter-names-in-name-rev-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1119/newren/prefer-shorter-names-in-name-rev-v1
Pull-Request: https://github.com/git/git/pull/1119

 builtin/name-rev.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/builtin/name-rev.c b/builtin/name-rev.c
index b221d300147..27f60153a6c 100644
--- a/builtin/name-rev.c
+++ b/builtin/name-rev.c
@@ -44,11 +44,20 @@ static struct rev_name *get_commit_rev_name(const struct commit *commit)
 	return is_valid_rev_name(name) ? name : NULL;
 }
 
+static int effective_distance(int distance, int generation)
+{
+	return distance + (generation > 0 ? MERGE_TRAVERSAL_WEIGHT : 0);
+}
+
 static int is_better_name(struct rev_name *name,
 			  timestamp_t taggerdate,
+			  int generation,
 			  int distance,
 			  int from_tag)
 {
+	int name_distance = effective_distance(name->distance, name->generation);
+	int new_distance = effective_distance(distance, generation);
+
 	/*
 	 * When comparing names based on tags, prefer names
 	 * based on the older tag, even if it is farther away.
@@ -56,7 +65,7 @@ static int is_better_name(struct rev_name *name,
 	if (from_tag && name->from_tag)
 		return (name->taggerdate > taggerdate ||
 			(name->taggerdate == taggerdate &&
-			 name->distance > distance));
+			 name_distance > new_distance));
 
 	/*
 	 * We know that at least one of them is a non-tag at this point.
@@ -69,8 +78,8 @@ static int is_better_name(struct rev_name *name,
 	 * We are now looking at two non-tags.  Tiebreak to favor
 	 * shorter hops.
 	 */
-	if (name->distance != distance)
-		return name->distance > distance;
+	if (name_distance != new_distance)
+		return name_distance > new_distance;
 
 	/* ... or tiebreak to favor older date */
 	if (name->taggerdate != taggerdate)
@@ -88,7 +97,7 @@ static struct rev_name *create_or_update_name(struct commit *commit,
 	struct rev_name *name = commit_rev_name_at(&rev_names, commit);
 
 	if (is_valid_rev_name(name)) {
-		if (!is_better_name(name, taggerdate, distance, from_tag))
+		if (!is_better_name(name, taggerdate, generation, distance, from_tag))
 			return NULL;
 
 		/*

base-commit: 9d530dc0024503ab4218fe6c4395b8a0aa245478
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] name-rev: prefer shorter names over following merges
  2021-11-13  0:14 [PATCH] name-rev: prefer shorter names over following merges Elijah Newren via GitGitGadget
@ 2021-11-13  9:42 ` Ævar Arnfjörð Bjarmason
  2021-11-13 19:35   ` Elijah Newren
  2021-11-16 12:13 ` Johannes Schindelin
  2021-12-04  5:35 ` [PATCH v2] " Elijah Newren via GitGitGadget
  2 siblings, 1 reply; 7+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-11-13  9:42 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget; +Cc: git, Elijah Newren


On Sat, Nov 13 2021, Elijah Newren via GitGitGadget wrote:

> From: Elijah Newren <newren@gmail.com>
>
> name-rev has a MERGE_TRAVERSAL_WEIGHT to say that traversing a second or
> later parent of a merge should be 65535 times more expensive than a
> first-parent traversal, as per ac076c29ae8d (name-rev: Fix non-shortest
> description, 2007-08-27).  The point of this weight is to prefer names
> like
>
>     v2.32.0~1471^2
>
> over names like
>
>     v2.32.0~43^2~15^2~11^2~20^2~31^2
>
> which are two equally valid names in git.git for the same commit.  Note
> that the first follows 1472 parent traversals compared to a mere 125 for
> the second.  Weighting all traversals equally would clearly prefer the
> second name since it has fewer parent traversals, but humans aren't
> going to be traversing commits and they tend to have an easier time
> digesting names with fewer segments.  The fact that the former only has
> two segments (~1471, ^2) makes it much simpler than the latter which has
> six segments (~43, ^2, ~15, etc.).  Since name-rev is meant to "find
> symbolic names suitable for human digestion", we prefer fewer segments.
>
> However, the particular rule implemented in name-rev would actually
> prefer
>
>     v2.33.0-rc0~11^2~1
>
> over
>
>     v2.33.0-rc0~20^2
>
> because both have precisely one second parent traversal, and it gives
> the tie breaker to shortest number of total parent traversals.  Fewer
> segments is more important for human consumption than number of hops, so
> we'd rather see the latter which has one fewer segment.
>
> Include the generation in is_better_name() and use a new
> effective_distance() calculation so that we prefer fewer segments in
> the printed name over fewer total parent traversals performed to get the
> answer.

So it's the case that if you were to print out the output of "git log
--graph --oneline" for these ranges and draw the path we'd take for
either variant of these examples we'd take different path, i.e. the
first version takes the ^1 parent every time, and the second traverses
various ^2 parents.

I think this change looks good in the context of name-rev, but I wonder
if longer names wouldn't be easier to understand in some cases,
i.e. you'd more clearly get an idea of how this change was tangled up
with other topics.

Does one or the other of these versions provide a better hint for the
"this branch is used by" relationships in What's Cooking?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] name-rev: prefer shorter names over following merges
  2021-11-13  9:42 ` Ævar Arnfjörð Bjarmason
@ 2021-11-13 19:35   ` Elijah Newren
  0 siblings, 0 replies; 7+ messages in thread
From: Elijah Newren @ 2021-11-13 19:35 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Elijah Newren via GitGitGadget, Git Mailing List

Hi Ævar,

On Sat, Nov 13, 2021 at 1:54 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> On Sat, Nov 13 2021, Elijah Newren via GitGitGadget wrote:
>
> > From: Elijah Newren <newren@gmail.com>
> >
> > name-rev has a MERGE_TRAVERSAL_WEIGHT to say that traversing a second or
> > later parent of a merge should be 65535 times more expensive than a
> > first-parent traversal, as per ac076c29ae8d (name-rev: Fix non-shortest
> > description, 2007-08-27).  The point of this weight is to prefer names
> > like
> >
> >     v2.32.0~1471^2
> >
> > over names like
> >
> >     v2.32.0~43^2~15^2~11^2~20^2~31^2
> >
> > which are two equally valid names in git.git for the same commit.  Note
> > that the first follows 1472 parent traversals compared to a mere 125 for
> > the second.  Weighting all traversals equally would clearly prefer the
> > second name since it has fewer parent traversals, but humans aren't
> > going to be traversing commits and they tend to have an easier time
> > digesting names with fewer segments.  The fact that the former only has
> > two segments (~1471, ^2) makes it much simpler than the latter which has
> > six segments (~43, ^2, ~15, etc.).  Since name-rev is meant to "find
> > symbolic names suitable for human digestion", we prefer fewer segments.
> >
> > However, the particular rule implemented in name-rev would actually
> > prefer
> >
> >     v2.33.0-rc0~11^2~1
> >
> > over
> >
> >     v2.33.0-rc0~20^2
> >
> > because both have precisely one second parent traversal, and it gives
> > the tie breaker to shortest number of total parent traversals.  Fewer
> > segments is more important for human consumption than number of hops, so
> > we'd rather see the latter which has one fewer segment.

Since you bring up semantic meaning of the names with more second
parents (which I'll address below), let me add an aside here on
semantic meaning of the names that prefer first-parent traversals:

In addition to shorter names just being easier for humans to consume,
for many repositories the first-parent-traversal-preferred names also
answer the question: "How or when did this commit get merged?"

For example, let's look at commit cbdca289fb -- the second to last
commit of en/ort-perf-batch-11.

Prior to my changes:

* `git name-rev cbdca289fb` used to report tags/v2.33.0-rc0~57^2~5
* v2.33.0-rc0~57^2 is en/ort-perf-batch-12
* v2.33.0-rc0~57 is "Merge branch 'en/ort-perf-batch-12'"
* None of this tells us when cbdca289fb was merged into master.

After my changes:

* `git name-rev cbdca289fb` reports tags/v2.33.0-rc0~112^2~1 [*]
* v2.33.0-rc0~112^2 is the tip of en/ort-perf-batch-11
* v2.33.0-rc0~112 is "Merge branch 'en/ort-perf-batch-11'"
* The commit above identifies exactly when cbdca289fb was merged into master.

Granted, this semantic meaning isn't guaranteed since it is possible
for multiple folks to merge each other's work.  So I didn't mention
this auxiliary advantage in my commit message, instead just focusing
on the primary advantage of "easier for human consumption".  However,
for many common repositories (I can only think of a couple
counter-examples out of the hundreds of repositories I have worked
with), the shorter names do have this extra semantic meaning.


[*] If you're curious why my changes provide this alternate name
despite having the same number of segments as the old name: (a)
cbdca289fb is the parent of 25e65b6dd5 -- thus the name for cbdca289fb
is the first parent of the preferred name for 25e65b6dd5, (b)
25e65b6dd5 could be named either v2.33.0-rc0~112^2 or
v2.33.0-rc0~57^2~4, but the former is preferred over the latter due to
fewer segments, (c) combine the two previous facts, and we get the
name "v2.33.0-rc0~112^2~1" rather than "v2.33.0-rc0~57^2~5".

> >
> > Include the generation in is_better_name() and use a new
> > effective_distance() calculation so that we prefer fewer segments in
> > the printed name over fewer total parent traversals performed to get the
> > answer.
>
> So it's the case that if you were to print out the output of "git log
> --graph --oneline" for these ranges and draw the path we'd take for
> either variant of these examples we'd take different path, i.e. the
> first version takes the ^1 parent every time, and the second traverses
> various ^2 parents.

Not quite, no.  The shorter names do not exclude second parent
traversals (both of the shorter names had one second parent traversal
at the end, because the commits in question are unreachable solely by
first parent traversals).  But I think your intended point was that
the shorter names _prefer_ first parent traversals when possible.
That is certainly true.

> I think this change looks good in the context of name-rev

Awesome, thanks for reviewing.  :-)

>, but I wonder
> if longer names wouldn't be easier to understand in some cases,
> i.e. you'd more clearly get an idea of how this change was tangled up
> with other topics.

Interesting thought, but no.  You'd have to perform a much different
kind of search to find that info.

I think these longer names following more second parents only give you
a quasi-random selection of topics and what commit was "recently
stable as a base" at the time they were applied.  We can see this by
using the example from the commit message; the name
"v2.32.0~43^2~15^2~11^2~20^2~31^2" gives us five chances to look for
tangling or dependencies.  Let's see how many of them show some:

== 1st example ==

$ git log -1 --format='%h,%an,%s' v2.32.0~43^2~15^2~11^2~20^2~31^2
062a309d36,René Scharfe,remote-curl: use argv_array in parse_push()

Now let's trim off the "~31^2" portion of the above and we see:

$ git log -1 --format='%h,%an,%s' v2.32.0~43^2~15^2~11^2~20^2
63020f175f,Derrick Stolee,commit-graph: prefer default size_mult when given zero

Was Stolee's patch based on Rene's?  No, just following its first parent we see:

$ git log -1 --format='%h,%an,%s' v2.32.0~43^2~15^2~11^2~20^2~1
53a06cf39b,Johannes Schindelin,Git 2.24.1


== 2nd example ==

Trimming off the trailing "~20^2" portion from the name for Stolee's
commit, we see:

$ git log -1 --format='%h,%an,%s' v2.32.0~43^2~15^2~11^2
70e24186c0,Elijah Newren,t6022, t6046: fix flaky files-are-updated checks

That was not tangled or dependent upon Stolee's commit, rather it was a
6-patch series that just happened to be rooted at:

$ git log -1 --format='%h,%an,%s' v2.32.0~43^2~15^2~11^2~6
d0654dc308,Junio C Hamano,Git 2.25


== 3rd example ==

Trimming off the trailing "~11^2" portion from the name for my commit, we see:

$ git log -1 --format='%h,%an,%s' v2.32.0~43^2~15^2
173cb08d5b,Carlo Marcelo Arenas Belón,bisect: avoid tailing CR
characters from revision in replay

But that commit wasn't tangled with or dependent on mine.  It was a
2-patch series based on:

$ git log -1 --format='%h,%an,%s' v2.32.0~43^2~15^2~2
af6b65d45e,Jonathan Nieder,Git 2.26.2


== 4th example ==

Trimming off the trailing "~15^2" portion from the name for Carlo's
commit, we see:

$ git log -1 --format='%h,%an,%s' v2.32.0~43^2
3a7f0908b6,Matheus Tavares,clean: remove unnecessary variable

That series didn't depend on Carlo's commit; it was a single
standalone patch that depended on

$ git log -1 --format='%h,%an,%s' v2.32.0~43^2~1
6ff7f46039,Johannes Schindelin,Git 2.27.1


== 5th example ==

Trimming off the trailing "~43^2" portion from the name for Matheus's
commit, we see:

$ git log -1 --format='%h,%an,%s' v2.32.0
ebf3c04b26,Junio C Hamano,Git 2.32

Pretty obvious.  Should be clear this didn't depend on Matheus's
commit; in fact, it was just a commit on top of:

$ git log -1 --format='%h,%an,%s' v2.32.0~1
15664a5f35,Junio C Hamano,Merge tag 'l10n-2.32.0-rnd1.1' of
git://github.com/git-l10n/git-po


So 0 of the 5 showed any kind of tangling or dependencies.  This
commit dates back nearly to the 2.24.1 era and we've had lots of
topics that depended on other topics in the meantime, but it clearly
missed all of them.  I see it as five random topics between v2.32.0
and v2.24.1.  The odds of those hitting a tangled topic are slim since
most of our topics are standalone; but even if it does hit one, it's
almost certainly going to miss all the other dependent topics between
the commit and relevant tag used as the basis of the name.

> Does one or the other of these versions provide a better hint for the
> "this branch is used by" relationships in What's Cooking?

It can, but it's rather unlikely to be very helpful.  Most of the time
it won't provide any hint, and even if it does provide an interesting
name that suggests there might be a relationship, you'd have to double
check it to make sure it's not a false positive (as per the five
examples above).  But, let's put it to the test.  I had to do two
things to try this out:

* name-rev normally fails on topics in "What's Cooking" becasue none
of them are tagged yet, but we can force the issue by telling name-rev
to use "seen" for naming things.
* I modified builtin/name-rev.c to change the MERGE_TRAVERSAL_WEIGHT
from 65535 to 1 (so that shortest distance is preferred rather than
fewest name segments, causing it to be much more likely to follow
second parents).

With those changes, let's look at all three examples from the most
recent "What's cooking":

1. ld/sparse-diff-blame uses vd/sparse-reset

$ git name-rev --refs=refs/remotes/origin/seen gitster/ld/sparse-diff-blame
gitster/ld/sparse-diff-blame remotes/origin/seen~6^2
$ git name-rev --refs=refs/remotes/origin/seen gitster/vd/sparse-reset
gitster/vd/sparse-reset remotes/origin/seen~7^2

Doesn't hint at the relationship at all.

2. ns/batched-fsync uses ns/tmp-objdir

$ git name-rev --refs=refs/remotes/origin/seen gitster/ns/batched-fsync
gitster/ns/batched-fsync remotes/origin/seen~50^2
$ git name-rev --refs=refs/remotes/origin/seen gitster/ns/tmp-objdir
gitster/ns/tmp-objdir remotes/origin/seen~8^2~7^2

Doesn't hint at the relationship between the two at all.

3. ns/remerge-diff uses ns/tmp-objdir

$ git name-rev --refs=refs/remotes/origin/seen gitster/ns/remerge-diff
gitster/ns/remerge-diff remotes/origin/seen~8^2
$ git name-rev --refs=refs/remotes/origin/seen gitster/ns/tmp-objdir
gitster/ns/tmp-objdir remotes/origin/seen~8^2~7^2

Ahah, there we actually see what might be a relationship.  As noted
above, though, one would have to first check whether ns/remerge-diff
was actually a 7 patch series (turns out that it is) in order to
verify that one is based directly on the other; otherwise it's just a
false positive like the five we looked at above.


I also repeated this with the Sept. 20 "What's cooking email".  In
that case, only 1 out of 6 of the generated names hinted at a
relationship.  That's few enough to not be useful, especially when
there's also risk of false positives that would need to be double
checked by a different means anyway.  So, in short, the longer names
that follow second parents generally don't seem to provide any useful
semantic meaning I can think of beyond "what's the fewest parent
traversals I can use from my base tag to get to the commit in
question".  If you can think of one, perhaps it would be useful to add
a flag to request the alternate name.  I just can't think of any.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] name-rev: prefer shorter names over following merges
  2021-11-13  0:14 [PATCH] name-rev: prefer shorter names over following merges Elijah Newren via GitGitGadget
  2021-11-13  9:42 ` Ævar Arnfjörð Bjarmason
@ 2021-11-16 12:13 ` Johannes Schindelin
  2021-11-17  8:49   ` Junio C Hamano
  2021-12-04  5:35 ` [PATCH v2] " Elijah Newren via GitGitGadget
  2 siblings, 1 reply; 7+ messages in thread
From: Johannes Schindelin @ 2021-11-16 12:13 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget; +Cc: git, Elijah Newren, Elijah Newren

Hi Elijah,

On Sat, 13 Nov 2021, Elijah Newren via GitGitGadget wrote:

> From: Elijah Newren <newren@gmail.com>
>
> name-rev has a MERGE_TRAVERSAL_WEIGHT to say that traversing a second or
> later parent of a merge should be 65535 times more expensive than a
> first-parent traversal, as per ac076c29ae8d (name-rev: Fix non-shortest
> description, 2007-08-27).  The point of this weight is to prefer names
> like
>
>     v2.32.0~1471^2
>
> over names like
>
>     v2.32.0~43^2~15^2~11^2~20^2~31^2
>
> which are two equally valid names in git.git for the same commit.  Note
> that the first follows 1472 parent traversals compared to a mere 125 for
> the second.  Weighting all traversals equally would clearly prefer the
> second name since it has fewer parent traversals, but humans aren't
> going to be traversing commits and they tend to have an easier time
> digesting names with fewer segments.  The fact that the former only has
> two segments (~1471, ^2) makes it much simpler than the latter which has
> six segments (~43, ^2, ~15, etc.).  Since name-rev is meant to "find
> symbolic names suitable for human digestion", we prefer fewer segments.
>
> However, the particular rule implemented in name-rev would actually
> prefer
>
>     v2.33.0-rc0~11^2~1
>
> over
>
>     v2.33.0-rc0~20^2
>
> because both have precisely one second parent traversal, and it gives
> the tie breaker to shortest number of total parent traversals.  Fewer
> segments is more important for human consumption than number of hops, so
> we'd rather see the latter which has one fewer segment.
>
> Include the generation in is_better_name() and use a new
> effective_distance() calculation so that we prefer fewer segments in
> the printed name over fewer total parent traversals performed to get the
> answer.

Thank you. As you most likely figured out, that magic weight was
introduced by me, in ac076c29ae8 (name-rev: Fix non-shortest description,
2007-08-27). And indeed the motivation was to keep the name as short as
possible.

Technically, your solution does not fix the problem fully, as we still do
not determine the _shortest possible_ name. Having said that, I think your
patch improves the situation dramatically, so: ACK!

Thanks,
Dscho

>
> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
>     name-rev: prefer shorter names over following merges
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1119%2Fnewren%2Fprefer-shorter-names-in-name-rev-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1119/newren/prefer-shorter-names-in-name-rev-v1
> Pull-Request: https://github.com/git/git/pull/1119
>
>  builtin/name-rev.c | 17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
>
> diff --git a/builtin/name-rev.c b/builtin/name-rev.c
> index b221d300147..27f60153a6c 100644
> --- a/builtin/name-rev.c
> +++ b/builtin/name-rev.c
> @@ -44,11 +44,20 @@ static struct rev_name *get_commit_rev_name(const struct commit *commit)
>  	return is_valid_rev_name(name) ? name : NULL;
>  }
>
> +static int effective_distance(int distance, int generation)
> +{
> +	return distance + (generation > 0 ? MERGE_TRAVERSAL_WEIGHT : 0);
> +}
> +
>  static int is_better_name(struct rev_name *name,
>  			  timestamp_t taggerdate,
> +			  int generation,
>  			  int distance,
>  			  int from_tag)
>  {
> +	int name_distance = effective_distance(name->distance, name->generation);
> +	int new_distance = effective_distance(distance, generation);
> +
>  	/*
>  	 * When comparing names based on tags, prefer names
>  	 * based on the older tag, even if it is farther away.
> @@ -56,7 +65,7 @@ static int is_better_name(struct rev_name *name,
>  	if (from_tag && name->from_tag)
>  		return (name->taggerdate > taggerdate ||
>  			(name->taggerdate == taggerdate &&
> -			 name->distance > distance));
> +			 name_distance > new_distance));
>
>  	/*
>  	 * We know that at least one of them is a non-tag at this point.
> @@ -69,8 +78,8 @@ static int is_better_name(struct rev_name *name,
>  	 * We are now looking at two non-tags.  Tiebreak to favor
>  	 * shorter hops.
>  	 */
> -	if (name->distance != distance)
> -		return name->distance > distance;
> +	if (name_distance != new_distance)
> +		return name_distance > new_distance;
>
>  	/* ... or tiebreak to favor older date */
>  	if (name->taggerdate != taggerdate)
> @@ -88,7 +97,7 @@ static struct rev_name *create_or_update_name(struct commit *commit,
>  	struct rev_name *name = commit_rev_name_at(&rev_names, commit);
>
>  	if (is_valid_rev_name(name)) {
> -		if (!is_better_name(name, taggerdate, distance, from_tag))
> +		if (!is_better_name(name, taggerdate, generation, distance, from_tag))
>  			return NULL;
>
>  		/*
>
> base-commit: 9d530dc0024503ab4218fe6c4395b8a0aa245478
> --
> gitgitgadget
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] name-rev: prefer shorter names over following merges
  2021-11-16 12:13 ` Johannes Schindelin
@ 2021-11-17  8:49   ` Junio C Hamano
  0 siblings, 0 replies; 7+ messages in thread
From: Junio C Hamano @ 2021-11-17  8:49 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Elijah Newren via GitGitGadget, git, Elijah Newren

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> Thank you. As you most likely figured out, that magic weight was
> introduced by me, in ac076c29ae8 (name-rev: Fix non-shortest description,
> 2007-08-27). And indeed the motivation was to keep the name as short as
> possible.
>
> Technically, your solution does not fix the problem fully, as we still do
> not determine the _shortest possible_ name. Having said that, I think your
> patch improves the situation dramatically, so: ACK!

It really depends on how you define "short".  Is v1.0~11 and v1.0~99
the same length and v1.0~100 a bit longer than these two?

I wonder what happens if we counted what the proposed commit log
calls "segments" and nothing else, e.g.

    v2.32.0~1471^2			has 2 segments ("~1471", "^2")
    v2.32.0~43^2~15^2~11^2~20^2~31^2    has 10 segments

and use number of hops only for breaking ties, instead of giving a
magic weight and trying to count both hops and segments.

In any case, this seems to give us a much better results than the
current code, so let's take it and leave further futzing outside the
scope.

Thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v2] name-rev: prefer shorter names over following merges
  2021-11-13  0:14 [PATCH] name-rev: prefer shorter names over following merges Elijah Newren via GitGitGadget
  2021-11-13  9:42 ` Ævar Arnfjörð Bjarmason
  2021-11-16 12:13 ` Johannes Schindelin
@ 2021-12-04  5:35 ` Elijah Newren via GitGitGadget
  2021-12-08 11:38   ` Johannes Schindelin
  2 siblings, 1 reply; 7+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-12-04  5:35 UTC (permalink / raw)
  To: git
  Cc: Ævar Arnfjörð Bjarmason, Elijah Newren,
	Johannes Schindelin, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

name-rev has a MERGE_TRAVERSAL_WEIGHT to say that traversing a second or
later parent of a merge should be 65535 times more expensive than a
first-parent traversal, as per ac076c29ae8d (name-rev: Fix non-shortest
description, 2007-08-27).  The point of this weight is to prefer names
like

    v2.32.0~1471^2

over names like

    v2.32.0~43^2~15^2~11^2~20^2~31^2

which are two equally valid names in git.git for the same commit.  Note
that the first follows 1472 parent traversals compared to a mere 125 for
the second.  Weighting all traversals equally would clearly prefer the
second name since it has fewer parent traversals, but humans aren't
going to be traversing commits and they tend to have an easier time
digesting names with fewer segments.  The fact that the former only has
two segments (~1471, ^2) makes it much simpler than the latter which has
six segments (~43, ^2, ~15, etc.).  Since name-rev is meant to "find
symbolic names suitable for human digestion", we prefer fewer segments.

However, the particular rule implemented in name-rev would actually
prefer

    v2.33.0-rc0~11^2~1

over

    v2.33.0-rc0~20^2

because both have precisely one second parent traversal, and it gives
the tie breaker to shortest number of total parent traversals.  Fewer
segments is more important for human consumption than number of hops, so
we'd rather see the latter which has one fewer segment.

Include the generation in is_better_name() and use a new
effective_distance() calculation so that we prefer fewer segments in
the printed name over fewer total parent traversals performed to get the
answer.

== Side-note on tie-breakers ==

When there are the same number of segments for two different names, we
actually use the name of an ancestor commit as a tie-breaker as well.
For example, for the commit cbdca289fb in the git.git repository, we
prefer the name v2.33.0-rc0~112^2~1 over v2.33.0-rc0~57^2~5.  This is
because:

  * cbdca289fb is the parent of 25e65b6dd5, which implies the name for
    cbdca289fb should be the first parent of the preferred name for
    25e65b6dd5
  * 25e65b6dd5 could be named either v2.33.0-rc0~112^2 or
    v2.33.0-rc0~57^2~4, but the former is preferred over the latter due
    to fewer segments
  * combine the two previous facts, and the name we get for cbdca289fb
    is "v2.33.0-rc0~112^2~1" rather than "v2.33.0-rc0~57^2~5".

Technically, we get this for free out of the implementation since we
only keep track of one name for each commit as we walk history (and
re-add parents to the queue if we find a better name for those parents),
but the first bullet point above ensures users get results that feel
more consistent.

== Alternative Ideas and Meanings Discussed ==

One suggestion that came up during review was that shortest
string-length might be easiest for users to consume.  However, such a
scheme would be rather computationally expensive (we'd have to track all
names for each commit as we traversed the graph) and would additionally
come with the possibly perplexing result that on a linear segment of
history we could rapidly swap back and forth on names:
   MYTAG~3^2     would     be preferred over   MYTAG~9998
   MYTAG~3^2~1   would NOT be preferred over   MYTAG~9999
   MYTAG~3^2~2   might     be preferred over   MYTAG~10000

Another item that came up was possible auxiliary semantic meanings for
name-rev results either before or after this patch.  The basic answer
was that the previous implementation had no known useful auxiliary
semantics, but that for many repositories (most in my experience), the
new scheme does.  In particular, the new name-rev output can often be
used to answer the question, "How or when did this commit get merged?"
Since that usefulness depends on how merges happen within the repository
and thus isn't universally applicable, details are omitted here but you
can see them at [1].

[1] https://lore.kernel.org/git/CABPp-BEeUM+3NLKDVdak90_UUeNghYCx=Dgir6=8ixvYmvyq3Q@mail.gmail.com/

Finally, it was noted that the algorithm could be improved by just
explicitly tracking the number of segments and using both it and
distance in the comparison, instead of giving a magic number that tries
to blend the two (and which therefore might give suboptimal results in
repositories with really huge numbers of commits that periodically merge
older code).  However, "[this patch] seems to give us a much better
results than the current code, so let's take it and leave further
futzing outside the scope."

Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Acked-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
---
    name-rev: prefer shorter names over following merges
    
    Changes since v1:
    
     * Include acks from Ævar and Dscho
     * To aid future readers, added a few comments to the commit message
       from the mailing list discussion about tie-breakers and other
       possible improvements
     * The 2.35 cycle has started and it's been weeks since I sent v1, so
       it's time to resend. :-)

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1119%2Fnewren%2Fprefer-shorter-names-in-name-rev-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1119/newren/prefer-shorter-names-in-name-rev-v2
Pull-Request: https://github.com/git/git/pull/1119

Range-diff vs v1:

 1:  50812ed6fdf ! 1:  5d068486f9f name-rev: prefer shorter names over following merges
     @@ Commit message
          the printed name over fewer total parent traversals performed to get the
          answer.
      
     +    == Side-note on tie-breakers ==
     +
     +    When there are the same number of segments for two different names, we
     +    actually use the name of an ancestor commit as a tie-breaker as well.
     +    For example, for the commit cbdca289fb in the git.git repository, we
     +    prefer the name v2.33.0-rc0~112^2~1 over v2.33.0-rc0~57^2~5.  This is
     +    because:
     +
     +      * cbdca289fb is the parent of 25e65b6dd5, which implies the name for
     +        cbdca289fb should be the first parent of the preferred name for
     +        25e65b6dd5
     +      * 25e65b6dd5 could be named either v2.33.0-rc0~112^2 or
     +        v2.33.0-rc0~57^2~4, but the former is preferred over the latter due
     +        to fewer segments
     +      * combine the two previous facts, and the name we get for cbdca289fb
     +        is "v2.33.0-rc0~112^2~1" rather than "v2.33.0-rc0~57^2~5".
     +
     +    Technically, we get this for free out of the implementation since we
     +    only keep track of one name for each commit as we walk history (and
     +    re-add parents to the queue if we find a better name for those parents),
     +    but the first bullet point above ensures users get results that feel
     +    more consistent.
     +
     +    == Alternative Ideas and Meanings Discussed ==
     +
     +    One suggestion that came up during review was that shortest
     +    string-length might be easiest for users to consume.  However, such a
     +    scheme would be rather computationally expensive (we'd have to track all
     +    names for each commit as we traversed the graph) and would additionally
     +    come with the possibly perplexing result that on a linear segment of
     +    history we could rapidly swap back and forth on names:
     +       MYTAG~3^2     would     be preferred over   MYTAG~9998
     +       MYTAG~3^2~1   would NOT be preferred over   MYTAG~9999
     +       MYTAG~3^2~2   might     be preferred over   MYTAG~10000
     +
     +    Another item that came up was possible auxiliary semantic meanings for
     +    name-rev results either before or after this patch.  The basic answer
     +    was that the previous implementation had no known useful auxiliary
     +    semantics, but that for many repositories (most in my experience), the
     +    new scheme does.  In particular, the new name-rev output can often be
     +    used to answer the question, "How or when did this commit get merged?"
     +    Since that usefulness depends on how merges happen within the repository
     +    and thus isn't universally applicable, details are omitted here but you
     +    can see them at [1].
     +
     +    [1] https://lore.kernel.org/git/CABPp-BEeUM+3NLKDVdak90_UUeNghYCx=Dgir6=8ixvYmvyq3Q@mail.gmail.com/
     +
     +    Finally, it was noted that the algorithm could be improved by just
     +    explicitly tracking the number of segments and using both it and
     +    distance in the comparison, instead of giving a magic number that tries
     +    to blend the two (and which therefore might give suboptimal results in
     +    repositories with really huge numbers of commits that periodically merge
     +    older code).  However, "[this patch] seems to give us a much better
     +    results than the current code, so let's take it and leave further
     +    futzing outside the scope."
     +
          Signed-off-by: Elijah Newren <newren@gmail.com>
     +    Acked-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
     +    Acked-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
      
       ## builtin/name-rev.c ##
      @@ builtin/name-rev.c: static struct rev_name *get_commit_rev_name(const struct commit *commit)


 builtin/name-rev.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/builtin/name-rev.c b/builtin/name-rev.c
index b221d300147..27f60153a6c 100644
--- a/builtin/name-rev.c
+++ b/builtin/name-rev.c
@@ -44,11 +44,20 @@ static struct rev_name *get_commit_rev_name(const struct commit *commit)
 	return is_valid_rev_name(name) ? name : NULL;
 }
 
+static int effective_distance(int distance, int generation)
+{
+	return distance + (generation > 0 ? MERGE_TRAVERSAL_WEIGHT : 0);
+}
+
 static int is_better_name(struct rev_name *name,
 			  timestamp_t taggerdate,
+			  int generation,
 			  int distance,
 			  int from_tag)
 {
+	int name_distance = effective_distance(name->distance, name->generation);
+	int new_distance = effective_distance(distance, generation);
+
 	/*
 	 * When comparing names based on tags, prefer names
 	 * based on the older tag, even if it is farther away.
@@ -56,7 +65,7 @@ static int is_better_name(struct rev_name *name,
 	if (from_tag && name->from_tag)
 		return (name->taggerdate > taggerdate ||
 			(name->taggerdate == taggerdate &&
-			 name->distance > distance));
+			 name_distance > new_distance));
 
 	/*
 	 * We know that at least one of them is a non-tag at this point.
@@ -69,8 +78,8 @@ static int is_better_name(struct rev_name *name,
 	 * We are now looking at two non-tags.  Tiebreak to favor
 	 * shorter hops.
 	 */
-	if (name->distance != distance)
-		return name->distance > distance;
+	if (name_distance != new_distance)
+		return name_distance > new_distance;
 
 	/* ... or tiebreak to favor older date */
 	if (name->taggerdate != taggerdate)
@@ -88,7 +97,7 @@ static struct rev_name *create_or_update_name(struct commit *commit,
 	struct rev_name *name = commit_rev_name_at(&rev_names, commit);
 
 	if (is_valid_rev_name(name)) {
-		if (!is_better_name(name, taggerdate, distance, from_tag))
+		if (!is_better_name(name, taggerdate, generation, distance, from_tag))
 			return NULL;
 
 		/*

base-commit: abe6bb3905392d5eb6b01fa6e54d7e784e0522aa
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] name-rev: prefer shorter names over following merges
  2021-12-04  5:35 ` [PATCH v2] " Elijah Newren via GitGitGadget
@ 2021-12-08 11:38   ` Johannes Schindelin
  0 siblings, 0 replies; 7+ messages in thread
From: Johannes Schindelin @ 2021-12-08 11:38 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Ævar Arnfjörð Bjarmason, Elijah Newren,
	Elijah Newren, Elijah Newren

Hi Elijah,

On Sat, 4 Dec 2021, Elijah Newren via GitGitGadget wrote:

> Range-diff vs v1:
>
>  1:  50812ed6fdf ! 1:  5d068486f9f name-rev: prefer shorter names over following merges
>      @@ Commit message
>           the printed name over fewer total parent traversals performed to get the
>           answer.
>
>      +    == Side-note on tie-breakers ==
>      +
>      +    When there are the same number of segments for two different names, we
>      +    actually use the name of an ancestor commit as a tie-breaker as well.
>      +    For example, for the commit cbdca289fb in the git.git repository, we
>      +    prefer the name v2.33.0-rc0~112^2~1 over v2.33.0-rc0~57^2~5.  This is
>      +    because:
>      +
>      +      * cbdca289fb is the parent of 25e65b6dd5, which implies the name for
>      +        cbdca289fb should be the first parent of the preferred name for
>      +        25e65b6dd5
>      +      * 25e65b6dd5 could be named either v2.33.0-rc0~112^2 or
>      +        v2.33.0-rc0~57^2~4, but the former is preferred over the latter due
>      +        to fewer segments
>      +      * combine the two previous facts, and the name we get for cbdca289fb
>      +        is "v2.33.0-rc0~112^2~1" rather than "v2.33.0-rc0~57^2~5".
>      +
>      +    Technically, we get this for free out of the implementation since we
>      +    only keep track of one name for each commit as we walk history (and
>      +    re-add parents to the queue if we find a better name for those parents),
>      +    but the first bullet point above ensures users get results that feel
>      +    more consistent.
>      +
>      +    == Alternative Ideas and Meanings Discussed ==
>      +
>      +    One suggestion that came up during review was that shortest
>      +    string-length might be easiest for users to consume.  However, such a
>      +    scheme would be rather computationally expensive (we'd have to track all
>      +    names for each commit as we traversed the graph) and would additionally
>      +    come with the possibly perplexing result that on a linear segment of
>      +    history we could rapidly swap back and forth on names:
>      +       MYTAG~3^2     would     be preferred over   MYTAG~9998
>      +       MYTAG~3^2~1   would NOT be preferred over   MYTAG~9999
>      +       MYTAG~3^2~2   might     be preferred over   MYTAG~10000
>      +
>      +    Another item that came up was possible auxiliary semantic meanings for
>      +    name-rev results either before or after this patch.  The basic answer
>      +    was that the previous implementation had no known useful auxiliary
>      +    semantics, but that for many repositories (most in my experience), the
>      +    new scheme does.  In particular, the new name-rev output can often be
>      +    used to answer the question, "How or when did this commit get merged?"
>      +    Since that usefulness depends on how merges happen within the repository
>      +    and thus isn't universally applicable, details are omitted here but you
>      +    can see them at [1].
>      +
>      +    [1] https://lore.kernel.org/git/CABPp-BEeUM+3NLKDVdak90_UUeNghYCx=Dgir6=8ixvYmvyq3Q@mail.gmail.com/
>      +
>      +    Finally, it was noted that the algorithm could be improved by just
>      +    explicitly tracking the number of segments and using both it and
>      +    distance in the comparison, instead of giving a magic number that tries
>      +    to blend the two (and which therefore might give suboptimal results in
>      +    repositories with really huge numbers of commits that periodically merge
>      +    older code).  However, "[this patch] seems to give us a much better
>      +    results than the current code, so let's take it and leave further
>      +    futzing outside the scope."
>      +
>           Signed-off-by: Elijah Newren <newren@gmail.com>

Very nice addition!

And obviously, I am still in favor of moving this patch along in the
direction of `main`.

Thank you,
Dscho

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-12-08 11:38 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-13  0:14 [PATCH] name-rev: prefer shorter names over following merges Elijah Newren via GitGitGadget
2021-11-13  9:42 ` Ævar Arnfjörð Bjarmason
2021-11-13 19:35   ` Elijah Newren
2021-11-16 12:13 ` Johannes Schindelin
2021-11-17  8:49   ` Junio C Hamano
2021-12-04  5:35 ` [PATCH v2] " Elijah Newren via GitGitGadget
2021-12-08 11:38   ` Johannes Schindelin

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).