rev-list with multiple commits sharing same patch-id

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* rev-list with multiple commits sharing same patch-id
@ 2021-01-09 16:24 Arnaud Morin
  2021-01-11  7:59 ` Arnaud Morin
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Arnaud Morin @ 2021-01-09 16:24 UTC (permalink / raw)
  To: git

Hey all,

I am struggling with a rev-list command, hope someone can give me a
clue on what is going on.

I have 2 branches:
$ git branch
  B
* master

Currently on master, there is not diff against B:
$ git diff
(empty)

The commits in B are cherry-picked from master.
Here is the graph:
$ git log --graph --oneline --all
* ae2e3c4 (origin/B, B) remove line2 and add line4 (bis)
* a7a0339 remove line4
* caa4aad restore line2
* d7dc596 remove line2 add line4
* 44bcfd4 add line3
* e372641 b
| * dbf86d8 (HEAD -> master, origin/master) remove line2 and add line4 (bis)
| * 4017282 remove line4
| * 0f2a449 restore line2
| * 8969d3f remove line2 add line4
| * e73b420 add line3
| * fe5a75a b
|/  
* 6192505 a
* b4089e1 init


However, when using git rev-list to perform a symmetric difference, git
is giving me a commit ID:
$ git rev-list --left-right --cherry-pick B...master
>dbf86d8aafc897a25a3093139b4237a62395041e

Note that this commit is not empty
$ git show dbf86d8aafc897a25a3093139b4237a62395041e --stat
commit dbf86d8aafc897a25a3093139b4237a62395041e (origin/master, master)
Author: Arnaud Morin <hidden@mail>
Date:   Sat Jan 9 10:30:10 2021 +0100

    remove line2 and add line4 (bis)

 a | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)



So, from git rev-list perspective, there is a diff.

After digging a little bit, the thing is that this commit is having the
following patch-id:
$ git show dbf86d8aafc897a25a3093139b4237a62395041e | git patch-id
20f4ace68e80a751b07d78a27c94e83d6c5314bc dbf86d8aafc897a25a3093139b4237a62395041e

Which is also already existing in an other commit:
$ for c in $(git rev-list HEAD) ; do git show $c |git patch-id |grep 20f4ace68e80a751b07d78a27c94e83d6c5314bc; done
20f4ace68e80a751b07d78a27c94e83d6c5314bc dbf86d8aafc897a25a3093139b4237a62395041e
20f4ace68e80a751b07d78a27c94e83d6c5314bc 8969d3fa9159730fd3b23199873bfb26e3d20027

So, is it normal that rev-list is not able to figure out that a commit
is existing in both branch when 2 commits share the same patch-id?

Is there any way to prevent rev-list from showing this commit?

Thanks for your help.


PS. I have uploaded my test repo here:
https://gitlab.com/arnaudmorin/git-rev-list


-- 
Arnaud Morin


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rev-list with multiple commits sharing same patch-id
  2021-01-09 16:24 rev-list with multiple commits sharing same patch-id Arnaud Morin
@ 2021-01-11  7:59 ` Arnaud Morin
  2021-01-11  9:54 ` Christian Couder
  2021-01-12 14:17 ` Jeff King
  2 siblings, 0 replies; 15+ messages in thread
From: Arnaud Morin @ 2021-01-11  7:59 UTC (permalink / raw)
  To: git

Sorry, forgot to mention 2 things:
- I am using latest stable git version (2.30)
- I my previous mail, my git diff was against branch B:
$ git diff B
(empty)

Thanks

-- 
Arnaud Morin

On 09.01.21 - 16:24, Arnaud Morin wrote:
> Hey all,
> 
> I am struggling with a rev-list command, hope someone can give me a
> clue on what is going on.
> 
> I have 2 branches:
> $ git branch
>   B
> * master
> 
> Currently on master, there is not diff against B:
> $ git diff
> (empty)
> 
> The commits in B are cherry-picked from master.
> Here is the graph:
> $ git log --graph --oneline --all
> * ae2e3c4 (origin/B, B) remove line2 and add line4 (bis)
> * a7a0339 remove line4
> * caa4aad restore line2
> * d7dc596 remove line2 add line4
> * 44bcfd4 add line3
> * e372641 b
> | * dbf86d8 (HEAD -> master, origin/master) remove line2 and add line4 (bis)
> | * 4017282 remove line4
> | * 0f2a449 restore line2
> | * 8969d3f remove line2 add line4
> | * e73b420 add line3
> | * fe5a75a b
> |/  
> * 6192505 a
> * b4089e1 init
> 
> 
> However, when using git rev-list to perform a symmetric difference, git
> is giving me a commit ID:
> $ git rev-list --left-right --cherry-pick B...master
> >dbf86d8aafc897a25a3093139b4237a62395041e
> 
> Note that this commit is not empty
> $ git show dbf86d8aafc897a25a3093139b4237a62395041e --stat
> commit dbf86d8aafc897a25a3093139b4237a62395041e (origin/master, master)
> Author: Arnaud Morin <hidden@mail>
> Date:   Sat Jan 9 10:30:10 2021 +0100
> 
>     remove line2 and add line4 (bis)
> 
>  a | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> 
> 
> So, from git rev-list perspective, there is a diff.
> 
> After digging a little bit, the thing is that this commit is having the
> following patch-id:
> $ git show dbf86d8aafc897a25a3093139b4237a62395041e | git patch-id
> 20f4ace68e80a751b07d78a27c94e83d6c5314bc dbf86d8aafc897a25a3093139b4237a62395041e
> 
> Which is also already existing in an other commit:
> $ for c in $(git rev-list HEAD) ; do git show $c |git patch-id |grep 20f4ace68e80a751b07d78a27c94e83d6c5314bc; done
> 20f4ace68e80a751b07d78a27c94e83d6c5314bc dbf86d8aafc897a25a3093139b4237a62395041e
> 20f4ace68e80a751b07d78a27c94e83d6c5314bc 8969d3fa9159730fd3b23199873bfb26e3d20027
> 
> So, is it normal that rev-list is not able to figure out that a commit
> is existing in both branch when 2 commits share the same patch-id?
> 
> Is there any way to prevent rev-list from showing this commit?
> 
> Thanks for your help.
> 
> 
> PS. I have uploaded my test repo here:
> https://gitlab.com/arnaudmorin/git-rev-list
> 
> 
> -- 
> Arnaud Morin
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rev-list with multiple commits sharing same patch-id
  2021-01-09 16:24 rev-list with multiple commits sharing same patch-id Arnaud Morin
  2021-01-11  7:59 ` Arnaud Morin
@ 2021-01-11  9:54 ` Christian Couder
  2021-01-11 18:25   ` Arnaud Morin
  2021-01-12 14:17 ` Jeff King
  2 siblings, 1 reply; 15+ messages in thread
From: Christian Couder @ 2021-01-11  9:54 UTC (permalink / raw)
  To: Arnaud Morin; +Cc: git

Hi,

On Sat, Jan 9, 2021 at 5:29 PM Arnaud Morin <arnaud.morin@gmail.com> wrote:

> I am struggling with a rev-list command, hope someone can give me a
> clue on what is going on.

[...]

> However, when using git rev-list to perform a symmetric difference, git
> is giving me a commit ID:
> $ git rev-list --left-right --cherry-pick B...master
> >dbf86d8aafc897a25a3093139b4237a62395041e

It looks like a bug. You might want to check the following test file
and let us know if a similar case is tested in it, or not:

t6007-rev-list-cherry-pick-file.sh

If not, it might be a good idea to add one using 'test_expect_failure'
instead of 'test_expect_success' to document the failure.

Thanks,
Christian.


> Note that this commit is not empty
> $ git show dbf86d8aafc897a25a3093139b4237a62395041e --stat
> commit dbf86d8aafc897a25a3093139b4237a62395041e (origin/master, master)
> Author: Arnaud Morin <hidden@mail>
> Date:   Sat Jan 9 10:30:10 2021 +0100
>
>     remove line2 and add line4 (bis)
>
>  a | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
>
>
> So, from git rev-list perspective, there is a diff.
>
> After digging a little bit, the thing is that this commit is having the
> following patch-id:
> $ git show dbf86d8aafc897a25a3093139b4237a62395041e | git patch-id
> 20f4ace68e80a751b07d78a27c94e83d6c5314bc dbf86d8aafc897a25a3093139b4237a62395041e
>
> Which is also already existing in an other commit:
> $ for c in $(git rev-list HEAD) ; do git show $c |git patch-id |grep 20f4ace68e80a751b07d78a27c94e83d6c5314bc; done
> 20f4ace68e80a751b07d78a27c94e83d6c5314bc dbf86d8aafc897a25a3093139b4237a62395041e
> 20f4ace68e80a751b07d78a27c94e83d6c5314bc 8969d3fa9159730fd3b23199873bfb26e3d20027
>
> So, is it normal that rev-list is not able to figure out that a commit
> is existing in both branch when 2 commits share the same patch-id?
>
> Is there any way to prevent rev-list from showing this commit?
>
> Thanks for your help.
>
>
> PS. I have uploaded my test repo here:
> https://gitlab.com/arnaudmorin/git-rev-list
>
>
> --
> Arnaud Morin
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rev-list with multiple commits sharing same patch-id
  2021-01-11  9:54 ` Christian Couder
@ 2021-01-11 18:25   ` Arnaud Morin
  0 siblings, 0 replies; 15+ messages in thread
From: Arnaud Morin @ 2021-01-11 18:25 UTC (permalink / raw)
  To: Christian Couder; +Cc: git

Hi,

Thanks for the link to the test file.

I patched the test file to add this new case (where the same patch-id is
duplicated in the tree).
This patch can be find in [1]
I had to patch two existing tests to make them work again.
I also added a new test with this new test-case (two identical patch-id
in the same tree.)
Note that I still have one test (outside my new one) which is failing:

not ok 10 - name-rev --exclude excludes matched patterns

I dont know why yet.

Moreover, I am willing to help fixing the code.
I have the feeling that a patch needs to be done in revision.c, maybe
in static void cherry_pick_list?

If you have any clue to help me?

Thanks.


[1] https://github.com/arnaudmorin/git/pull/1


-- 
Arnaud Morin

On 11.01.21 - 10:54, Christian Couder wrote:
> Hi,
> 
> On Sat, Jan 9, 2021 at 5:29 PM Arnaud Morin <arnaud.morin@gmail.com> wrote:
> 
> > I am struggling with a rev-list command, hope someone can give me a
> > clue on what is going on.
> 
> [...]
> 
> > However, when using git rev-list to perform a symmetric difference, git
> > is giving me a commit ID:
> > $ git rev-list --left-right --cherry-pick B...master
> > >dbf86d8aafc897a25a3093139b4237a62395041e
> 
> It looks like a bug. You might want to check the following test file
> and let us know if a similar case is tested in it, or not:
> 
> t6007-rev-list-cherry-pick-file.sh
> 
> If not, it might be a good idea to add one using 'test_expect_failure'
> instead of 'test_expect_success' to document the failure.
> 
> Thanks,
> Christian.
> 
> 
> > Note that this commit is not empty
> > $ git show dbf86d8aafc897a25a3093139b4237a62395041e --stat
> > commit dbf86d8aafc897a25a3093139b4237a62395041e (origin/master, master)
> > Author: Arnaud Morin <hidden@mail>
> > Date:   Sat Jan 9 10:30:10 2021 +0100
> >
> >     remove line2 and add line4 (bis)
> >
> >  a | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> >
> >
> > So, from git rev-list perspective, there is a diff.
> >
> > After digging a little bit, the thing is that this commit is having the
> > following patch-id:
> > $ git show dbf86d8aafc897a25a3093139b4237a62395041e | git patch-id
> > 20f4ace68e80a751b07d78a27c94e83d6c5314bc dbf86d8aafc897a25a3093139b4237a62395041e
> >
> > Which is also already existing in an other commit:
> > $ for c in $(git rev-list HEAD) ; do git show $c |git patch-id |grep 20f4ace68e80a751b07d78a27c94e83d6c5314bc; done
> > 20f4ace68e80a751b07d78a27c94e83d6c5314bc dbf86d8aafc897a25a3093139b4237a62395041e
> > 20f4ace68e80a751b07d78a27c94e83d6c5314bc 8969d3fa9159730fd3b23199873bfb26e3d20027
> >
> > So, is it normal that rev-list is not able to figure out that a commit
> > is existing in both branch when 2 commits share the same patch-id?
> >
> > Is there any way to prevent rev-list from showing this commit?
> >
> > Thanks for your help.
> >
> >
> > PS. I have uploaded my test repo here:
> > https://gitlab.com/arnaudmorin/git-rev-list
> >
> >
> > --
> > Arnaud Morin
> >

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rev-list with multiple commits sharing same patch-id
  2021-01-09 16:24 rev-list with multiple commits sharing same patch-id Arnaud Morin
  2021-01-11  7:59 ` Arnaud Morin
  2021-01-11  9:54 ` Christian Couder
@ 2021-01-12 14:17 ` Jeff King
  2021-01-12 15:11   ` Jeff King
  2 siblings, 1 reply; 15+ messages in thread
From: Jeff King @ 2021-01-12 14:17 UTC (permalink / raw)
  To: Arnaud Morin; +Cc: git, Kevin Willford

On Sat, Jan 09, 2021 at 04:24:40PM +0000, Arnaud Morin wrote:

> After digging a little bit, the thing is that this commit is having the
> following patch-id:
> $ git show dbf86d8aafc897a25a3093139b4237a62395041e | git patch-id
> 20f4ace68e80a751b07d78a27c94e83d6c5314bc dbf86d8aafc897a25a3093139b4237a62395041e
> 
> Which is also already existing in an other commit:
> $ for c in $(git rev-list HEAD) ; do git show $c |git patch-id |grep 20f4ace68e80a751b07d78a27c94e83d6c5314bc; done
> 20f4ace68e80a751b07d78a27c94e83d6c5314bc dbf86d8aafc897a25a3093139b4237a62395041e
> 20f4ace68e80a751b07d78a27c94e83d6c5314bc 8969d3fa9159730fd3b23199873bfb26e3d20027

A slight nit first. Those two commits have the same patch-id, but in
your graph they are on the same branch ("master", not "B"):

> The commits in B are cherry-picked from master.
> Here is the graph:
> $ git log --graph --oneline --all
> * ae2e3c4 (origin/B, B) remove line2 and add line4 (bis)
> * a7a0339 remove line4
> * caa4aad restore line2
> * d7dc596 remove line2 add line4
> * 44bcfd4 add line3
> * e372641 b
> | * dbf86d8 (HEAD -> master, origin/master) remove line2 and add line4 (bis)
> | * 4017282 remove line4
> | * 0f2a449 restore line2
> | * 8969d3f remove line2 add line4
> | * e73b420 add line3
> | * fe5a75a b
> |/  
> * 6192505 a
> * b4089e1 init

And --cherry-pick is only looking for matches between the two sides of
the symmetric difference.

However, I think that patch-id does exist on the other side (it appears
again twice in fact, which is not surprising):

  $ git rev-list --all | git diff-tree --stdin -p | git patch-id | grep 20f4ace68e80
  20f4ace68e80a751b07d78a27c94e83d6c5314bc ae2e3c4754f53440cc4612d35f80d795a695c862
  20f4ace68e80a751b07d78a27c94e83d6c5314bc dbf86d8aafc897a25a3093139b4237a62395041e
  20f4ace68e80a751b07d78a27c94e83d6c5314bc d7dc596fcc34662cba35363febc846bfcab1e4be
  20f4ace68e80a751b07d78a27c94e83d6c5314bc 8969d3fa9159730fd3b23199873bfb26e3d20027

So the entries should be suppressed from both sides.

It looks like this was broken in v2.10.0, via dfb7a1b4d0 (patch-ids:
stop using a hand-rolled hashmap implementation, 2016-07-29).

I think the issue is that it is allowing duplicate entries in the
hashmap. The algorithm is something like:

  - iterate over left-hand commits, inserting patch-id for each into
    hashmap

  - iterate over right-hand commits, seeing if any are present in
    hashmap. If so, we exclude the commit _and_ mark the patch-id as
    "seen"

  - iterate again over left-hand commits, omitting any whose patch-ids
    are marked as "seen"

So if two commits on the left-hand side have the same patch-id, if we
insert two entries into the hashmap, then only one of them is going to
get its "seen" flag marked in the middle step.

-Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rev-list with multiple commits sharing same patch-id
  2021-01-12 14:17 ` Jeff King
@ 2021-01-12 15:11   ` Jeff King
  2021-01-12 15:34     ` Arnaud Morin
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff King @ 2021-01-12 15:11 UTC (permalink / raw)
  To: Arnaud Morin; +Cc: git, Kevin Willford

On Tue, Jan 12, 2021 at 09:17:37AM -0500, Jeff King wrote:

> It looks like this was broken in v2.10.0, via dfb7a1b4d0 (patch-ids:
> stop using a hand-rolled hashmap implementation, 2016-07-29).
> 
> I think the issue is that it is allowing duplicate entries in the
> hashmap. The algorithm is something like:
> 
>   - iterate over left-hand commits, inserting patch-id for each into
>     hashmap
> 
>   - iterate over right-hand commits, seeing if any are present in
>     hashmap. If so, we exclude the commit _and_ mark the patch-id as
>     "seen"
> 
>   - iterate again over left-hand commits, omitting any whose patch-ids
>     are marked as "seen"
> 
> So if two commits on the left-hand side have the same patch-id, if we
> insert two entries into the hashmap, then only one of them is going to
> get its "seen" flag marked in the middle step.

Yeah, that's definitely it. Here's what the fix would look like directly
on top of dfb7a1b4d0:

diff --git a/patch-ids.c b/patch-ids.c
index db31fa647a..a8ed4f0872 100644
--- a/patch-ids.c
+++ b/patch-ids.c
@@ -66,12 +66,24 @@ struct patch_id *add_commit_patch_id(struct commit *commit,
 				     struct patch_ids *ids)
 {
 	struct patch_id *key = xcalloc(1, sizeof(*key));
+	struct patch_id *old;
 
 	if (init_patch_id_entry(key, commit, ids)) {
 		free(key);
 		return NULL;
 	}
 
+	/*
+	 * Don't allow duplicates. Note that we can't use hashmap_put here; we
+	 * must retain the _old_ struct, because some commit->util is pointing
+	 * to it.
+	 */
+	old = hashmap_get(&ids->patches, key, ids);
+	if (old) {
+		free(key);
+		return old;
+	}
+
 	hashmap_add(&ids->patches, key);
 	return key;
 }
diff --git a/t/t6007-rev-list-cherry-pick-file.sh b/t/t6007-rev-list-cherry-pick-file.sh
index 28d4f6b259..0c5f1dcc54 100755
--- a/t/t6007-rev-list-cherry-pick-file.sh
+++ b/t/t6007-rev-list-cherry-pick-file.sh
@@ -207,4 +207,16 @@ test_expect_success '--count --left-right' '
 	test_cmp expect actual
 '
 
+test_expect_success '--cherry-pick with duplicates on each side' '
+	git checkout -b dup-orig &&
+	test_commit dup-base &&
+	git revert dup-base &&
+	git cherry-pick dup-base &&
+	git checkout -b dup-side HEAD~3 &&
+	test_tick &&
+	git cherry-pick -3 dup-orig &&
+	git rev-list --cherry-pick dup-orig...dup-side >actual &&
+	test_must_be_empty actual
+'
+
 test_done


Unfortunately we can't do the same thing now. The "seen" flag went away
in 683f17ec44 (patch-ids: replace the seen indicator with a commit
pointer, 2016-07-29), in favor of having each patch_id struct point to a
"struct commit". We could revert that, or turn that pointer into a list,
but it gets worse. Later, b3dfeebb92 (rebase: avoid computing
unnecessary patch IDs, 2016-07-29) built on that by lazily computing the
patch-ids. So each patch_id struct must point to exactly one commit!

We'll have to either:

  - split them apart into two structs at the time of the lazy computation

or

  - accept having duplicate patch_ids, but on lookup make sure we find
    _all_ matches, not just the first one.

Hmm, that last one might not be too bad, since we have
hashmap_get_next() (it does mean iterating over a chained bucket
linearly, but that's already the worst-case for a hash lookup).

-Peff

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: rev-list with multiple commits sharing same patch-id
  2021-01-12 15:11   ` Jeff King
@ 2021-01-12 15:34     ` Arnaud Morin
  2021-01-12 15:52       ` [PATCH] patch-ids: handle duplicate hashmap entries Jeff King
  0 siblings, 1 reply; 15+ messages in thread
From: Arnaud Morin @ 2021-01-12 15:34 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Kevin Willford

So instead of:
id = has_commit_patch_id(commit, &ids);

We should use some sort of iterator, in order to call:
id->commit->object.flags |= cherry_flag;

for all commits which are related to this patch?



-- 
Arnaud Morin

On 12.01.21 - 10:11, Jeff King wrote:
> On Tue, Jan 12, 2021 at 09:17:37AM -0500, Jeff King wrote:
> 
> > It looks like this was broken in v2.10.0, via dfb7a1b4d0 (patch-ids:
> > stop using a hand-rolled hashmap implementation, 2016-07-29).
> > 
> > I think the issue is that it is allowing duplicate entries in the
> > hashmap. The algorithm is something like:
> > 
> >   - iterate over left-hand commits, inserting patch-id for each into
> >     hashmap
> > 
> >   - iterate over right-hand commits, seeing if any are present in
> >     hashmap. If so, we exclude the commit _and_ mark the patch-id as
> >     "seen"
> > 
> >   - iterate again over left-hand commits, omitting any whose patch-ids
> >     are marked as "seen"
> > 
> > So if two commits on the left-hand side have the same patch-id, if we
> > insert two entries into the hashmap, then only one of them is going to
> > get its "seen" flag marked in the middle step.
> 
> Yeah, that's definitely it. Here's what the fix would look like directly
> on top of dfb7a1b4d0:
> 
> diff --git a/patch-ids.c b/patch-ids.c
> index db31fa647a..a8ed4f0872 100644
> --- a/patch-ids.c
> +++ b/patch-ids.c
> @@ -66,12 +66,24 @@ struct patch_id *add_commit_patch_id(struct commit *commit,
>  				     struct patch_ids *ids)
>  {
>  	struct patch_id *key = xcalloc(1, sizeof(*key));
> +	struct patch_id *old;
>  
>  	if (init_patch_id_entry(key, commit, ids)) {
>  		free(key);
>  		return NULL;
>  	}
>  
> +	/*
> +	 * Don't allow duplicates. Note that we can't use hashmap_put here; we
> +	 * must retain the _old_ struct, because some commit->util is pointing
> +	 * to it.
> +	 */
> +	old = hashmap_get(&ids->patches, key, ids);
> +	if (old) {
> +		free(key);
> +		return old;
> +	}
> +
>  	hashmap_add(&ids->patches, key);
>  	return key;
>  }
> diff --git a/t/t6007-rev-list-cherry-pick-file.sh b/t/t6007-rev-list-cherry-pick-file.sh
> index 28d4f6b259..0c5f1dcc54 100755
> --- a/t/t6007-rev-list-cherry-pick-file.sh
> +++ b/t/t6007-rev-list-cherry-pick-file.sh
> @@ -207,4 +207,16 @@ test_expect_success '--count --left-right' '
>  	test_cmp expect actual
>  '
>  
> +test_expect_success '--cherry-pick with duplicates on each side' '
> +	git checkout -b dup-orig &&
> +	test_commit dup-base &&
> +	git revert dup-base &&
> +	git cherry-pick dup-base &&
> +	git checkout -b dup-side HEAD~3 &&
> +	test_tick &&
> +	git cherry-pick -3 dup-orig &&
> +	git rev-list --cherry-pick dup-orig...dup-side >actual &&
> +	test_must_be_empty actual
> +'
> +
>  test_done
> 
> 
> Unfortunately we can't do the same thing now. The "seen" flag went away
> in 683f17ec44 (patch-ids: replace the seen indicator with a commit
> pointer, 2016-07-29), in favor of having each patch_id struct point to a
> "struct commit". We could revert that, or turn that pointer into a list,
> but it gets worse. Later, b3dfeebb92 (rebase: avoid computing
> unnecessary patch IDs, 2016-07-29) built on that by lazily computing the
> patch-ids. So each patch_id struct must point to exactly one commit!
> 
> We'll have to either:
> 
>   - split them apart into two structs at the time of the lazy computation
> 
> or
> 
>   - accept having duplicate patch_ids, but on lookup make sure we find
>     _all_ matches, not just the first one.
> 
> Hmm, that last one might not be too bad, since we have
> hashmap_get_next() (it does mean iterating over a chained bucket
> linearly, but that's already the worst-case for a hash lookup).
> 
> -Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH] patch-ids: handle duplicate hashmap entries
  2021-01-12 15:34     ` Arnaud Morin
@ 2021-01-12 15:52       ` Jeff King
  2021-01-12 16:24         ` Arnaud Morin
  2021-01-12 19:13         ` Junio C Hamano
  0 siblings, 2 replies; 15+ messages in thread
From: Jeff King @ 2021-01-12 15:52 UTC (permalink / raw)
  To: Arnaud Morin; +Cc: git, Kevin Willford

On Tue, Jan 12, 2021 at 03:34:38PM +0000, Arnaud Morin wrote:

> So instead of:
> id = has_commit_patch_id(commit, &ids);
> 
> We should use some sort of iterator, in order to call:
> id->commit->object.flags |= cherry_flag;
> 
> for all commits which are related to this patch?

Yep, exactly.

Here it is as a patch. :)

-- >8 --
Subject: [PATCH] patch-ids: handle duplicate hashmap entries

This fixes a bug introduced in dfb7a1b4d0 (patch-ids: stop using a
hand-rolled hashmap implementation, 2016-07-29) in which

  git rev-list --cherry-pick A...B

will fail to suppress commits reachable from A even if a commit with
matching patch-id appears in B.

Around the time of that commit, the algorithm for "--cherry-pick" looked
something like this:

  0. Traverse all of the commits, marking them as being on the left or
     right side of the symmetric difference.

  1. Iterate over the left-hand commits, inserting a patch-id struct for
     each into a hashmap, and pointing commit->util to the patch-id
     struct.

  2. Iterate over the right-hand commits, checking which are present in
     the hashmap. If so, we exclude the commit from the output _and_ we
     mark the patch-id as "seen".

  3. Iterate again over the left-hand commits, checking whether
     commit->util->seen is set; if so, exclude them from the output.

At the end, we'll have eliminated commits from both sides that have a
matching patch-id on the other side. But there's a subtle assumption
here: for any given patch-id, we must have exactly one struct
representing it. If two commits from A both have the same patch-id and
we allow duplicates in the hashmap, then we run into a problem:

  a. In step 1, we insert two patch-id structs into the hashmap.

  b. In step 2, our lookups will find only one of these structs, so only
     one "seen" flag is marked.

  c. In step 3, one of the commits in A will have its commit->util->seen
     set, but the other will not. We'll erroneously output the latter.

Prior to dfb7a1b4d0, our hashmap did not allow duplicates. Afterwards,
it used hashmap_add(), which explicitly does allow duplicates.

At that point, the solution would have been easy: when we are about to
add a duplicate, skip doing so and return the existing entry which
matches. But it gets more complicated.

In 683f17ec44 (patch-ids: replace the seen indicator with a commit
pointer, 2016-07-29), our step 3 goes away entirely. Instead, in step 2,
when the right-hand side finds a matching patch_id from the left-hand
side, we can directly mark the left-hand patch_id->commit to be omitted.
Solving that would be easy, too; there's a one-to-many relationship of
patch-ids to commits, so we just need to keep a list.

But there's more. Commit b3dfeebb92 (rebase: avoid computing unnecessary
patch IDs, 2016-07-29) built on that by lazily computing the full
patch-ids. So we don't even know when adding to the hashmap whether two
commits truly have the same id. We'd have to tentatively assign them a
list, and then possibly split them apart (possibly into N new structs)
at the moment we compute the real patch-ids. This could work, but it's
complicated and error-prone.

Instead, let's accept that we may store duplicates, and teach the lookup
side to be more clever. Rather than asking for a single matching
patch-id, it will need to iterate over all matching patch-ids. This does
mean examining every entry in a single hash bucket, but the worst-case
for a hash lookup was already doing that.

We'll keep the hashmap details out of the caller by providing a simple
iteration interface. We can retain the simple has_commit_patch_id()
interface for the other callers, but we'll simplify its return value
into an integer, rather than returning the patch_id struct. That way
they won't be tempted to look at the "commit" field of the return value
without iterating.

Reported-by: Arnaud Morin <arnaud.morin@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
---
 patch-ids.c                          | 14 +++++++++++++-
 patch-ids.h                          | 20 +++++++++++++++++++-
 revision.c                           |  6 ++++--
 t/t6007-rev-list-cherry-pick-file.sh | 12 ++++++++++++
 4 files changed, 48 insertions(+), 4 deletions(-)

diff --git a/patch-ids.c b/patch-ids.c
index 21973e4933..f51021a0cb 100644
--- a/patch-ids.c
+++ b/patch-ids.c
@@ -89,7 +89,7 @@ static int init_patch_id_entry(struct patch_id *patch,
 	return 0;
 }

-struct patch_id *has_commit_patch_id(struct commit *commit,
+struct patch_id *patch_id_iter_first(struct commit *commit,
 				     struct patch_ids *ids)
 {
 	struct patch_id patch;
@@ -104,6 +104,18 @@ struct patch_id *has_commit_patch_id(struct commit *commit,
 	return hashmap_get_entry(&ids->patches, &patch, ent, NULL);
 }

+struct patch_id *patch_id_iter_next(struct patch_id *cur,
+				    struct patch_ids *ids)
+{
+	return hashmap_get_next_entry(&ids->patches, cur, ent);
+}
+
+int has_commit_patch_id(struct commit *commit,
+			struct patch_ids *ids)
+{
+	return !!patch_id_iter_first(commit, ids);
+}
+
 struct patch_id *add_commit_patch_id(struct commit *commit,
 				     struct patch_ids *ids)
 {
diff --git a/patch-ids.h b/patch-ids.h
index 03bb04e707..ab6c6a6804 100644
--- a/patch-ids.h
+++ b/patch-ids.h
@@ -23,7 +23,25 @@ int commit_patch_id(struct commit *commit, struct diff_options *options,
 		    struct object_id *oid, int, int);
 int init_patch_ids(struct repository *, struct patch_ids *);
 int free_patch_ids(struct patch_ids *);
+
+/* Add a patch_id for a single commit to the set. */
 struct patch_id *add_commit_patch_id(struct commit *, struct patch_ids *);
-struct patch_id *has_commit_patch_id(struct commit *, struct patch_ids *);
+
+/* Returns true if the patch-id of "commit" is present in the set. */
+int has_commit_patch_id(struct commit *commit, struct patch_ids *);
+
+/*
+ * Iterate over all commits in the set whose patch id matches that of
+ * "commit", like:
+ *
+ *   struct patch_id *cur;
+ *   for (cur = patch_id_iter_first(commit, ids);
+ *        cur;
+ *        cur = patch_id_iter_next(cur, ids) {
+ *           ... look at cur->commit
+ *   }
+ */
+struct patch_id *patch_id_iter_first(struct commit *commit, struct patch_ids *);
+struct patch_id *patch_id_iter_next(struct patch_id *cur, struct patch_ids *);

 #endif /* PATCH_IDS_H */
diff --git a/revision.c b/revision.c
index 9dff845bed..7ec9b634e3 100644
--- a/revision.c
+++ b/revision.c
@@ -1241,12 +1241,14 @@ static void cherry_pick_list(struct commit_list *list, struct rev_info *revs)
 		/*
 		 * Have we seen the same patch id?
 		 */
-		id = has_commit_patch_id(commit, &ids);
+		id = patch_id_iter_first(commit, &ids);
 		if (!id)
 			continue;

 		commit->object.flags |= cherry_flag;
-		id->commit->object.flags |= cherry_flag;
+		do {
+			id->commit->object.flags |= cherry_flag;
+		} while ((id = patch_id_iter_next(id, &ids)));
 	}

 	free_patch_ids(&ids);
diff --git a/t/t6007-rev-list-cherry-pick-file.sh b/t/t6007-rev-list-cherry-pick-file.sh
index f0268372d2..8bf5ae23c2 100755
--- a/t/t6007-rev-list-cherry-pick-file.sh
+++ b/t/t6007-rev-list-cherry-pick-file.sh
@@ -245,6 +245,18 @@ test_expect_success '--count --left-right' '
 	test_cmp expect actual
 '

+test_expect_success '--cherry-pick with duplicates on each side' '
+	git checkout -b dup-orig &&
+	test_commit dup-base &&
+	git revert dup-base &&
+	git cherry-pick dup-base &&
+	git checkout -b dup-side HEAD~3 &&
+	test_tick &&
+	git cherry-pick -3 dup-orig &&
+	git rev-list --cherry-pick dup-orig...dup-side >actual &&
+	test_must_be_empty actual
+'
+
 # Corrupt the object store deliberately to make sure
 # the object is not even checked for its existence.
 remove_loose_object () {
-- 
2.30.0.455.gdab6b73766

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] patch-ids: handle duplicate hashmap entries
  2021-01-12 15:52       ` [PATCH] patch-ids: handle duplicate hashmap entries Jeff King
@ 2021-01-12 16:24         ` Arnaud Morin
  2021-01-12 19:13         ` Junio C Hamano
  1 sibling, 0 replies; 15+ messages in thread
From: Arnaud Morin @ 2021-01-12 16:24 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Kevin Willford

YAY!

Thanks a lot :)

-- 
Arnaud Morin

On 12.01.21 - 10:52, Jeff King wrote:
> On Tue, Jan 12, 2021 at 03:34:38PM +0000, Arnaud Morin wrote:
> 
> > So instead of:
> > id = has_commit_patch_id(commit, &ids);
> > 
> > We should use some sort of iterator, in order to call:
> > id->commit->object.flags |= cherry_flag;
> > 
> > for all commits which are related to this patch?
> 
> Yep, exactly.
> 
> Here it is as a patch. :)
> 
> -- >8 --
> Subject: [PATCH] patch-ids: handle duplicate hashmap entries
> 
> This fixes a bug introduced in dfb7a1b4d0 (patch-ids: stop using a
> hand-rolled hashmap implementation, 2016-07-29) in which
> 
>   git rev-list --cherry-pick A...B
> 
> will fail to suppress commits reachable from A even if a commit with
> matching patch-id appears in B.
> 
> Around the time of that commit, the algorithm for "--cherry-pick" looked
> something like this:
> 
>   0. Traverse all of the commits, marking them as being on the left or
>      right side of the symmetric difference.
> 
>   1. Iterate over the left-hand commits, inserting a patch-id struct for
>      each into a hashmap, and pointing commit->util to the patch-id
>      struct.
> 
>   2. Iterate over the right-hand commits, checking which are present in
>      the hashmap. If so, we exclude the commit from the output _and_ we
>      mark the patch-id as "seen".
> 
>   3. Iterate again over the left-hand commits, checking whether
>      commit->util->seen is set; if so, exclude them from the output.
> 
> At the end, we'll have eliminated commits from both sides that have a
> matching patch-id on the other side. But there's a subtle assumption
> here: for any given patch-id, we must have exactly one struct
> representing it. If two commits from A both have the same patch-id and
> we allow duplicates in the hashmap, then we run into a problem:
> 
>   a. In step 1, we insert two patch-id structs into the hashmap.
> 
>   b. In step 2, our lookups will find only one of these structs, so only
>      one "seen" flag is marked.
> 
>   c. In step 3, one of the commits in A will have its commit->util->seen
>      set, but the other will not. We'll erroneously output the latter.
> 
> Prior to dfb7a1b4d0, our hashmap did not allow duplicates. Afterwards,
> it used hashmap_add(), which explicitly does allow duplicates.
> 
> At that point, the solution would have been easy: when we are about to
> add a duplicate, skip doing so and return the existing entry which
> matches. But it gets more complicated.
> 
> In 683f17ec44 (patch-ids: replace the seen indicator with a commit
> pointer, 2016-07-29), our step 3 goes away entirely. Instead, in step 2,
> when the right-hand side finds a matching patch_id from the left-hand
> side, we can directly mark the left-hand patch_id->commit to be omitted.
> Solving that would be easy, too; there's a one-to-many relationship of
> patch-ids to commits, so we just need to keep a list.
> 
> But there's more. Commit b3dfeebb92 (rebase: avoid computing unnecessary
> patch IDs, 2016-07-29) built on that by lazily computing the full
> patch-ids. So we don't even know when adding to the hashmap whether two
> commits truly have the same id. We'd have to tentatively assign them a
> list, and then possibly split them apart (possibly into N new structs)
> at the moment we compute the real patch-ids. This could work, but it's
> complicated and error-prone.
> 
> Instead, let's accept that we may store duplicates, and teach the lookup
> side to be more clever. Rather than asking for a single matching
> patch-id, it will need to iterate over all matching patch-ids. This does
> mean examining every entry in a single hash bucket, but the worst-case
> for a hash lookup was already doing that.
> 
> We'll keep the hashmap details out of the caller by providing a simple
> iteration interface. We can retain the simple has_commit_patch_id()
> interface for the other callers, but we'll simplify its return value
> into an integer, rather than returning the patch_id struct. That way
> they won't be tempted to look at the "commit" field of the return value
> without iterating.
> 
> Reported-by: Arnaud Morin <arnaud.morin@gmail.com>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  patch-ids.c                          | 14 +++++++++++++-
>  patch-ids.h                          | 20 +++++++++++++++++++-
>  revision.c                           |  6 ++++--
>  t/t6007-rev-list-cherry-pick-file.sh | 12 ++++++++++++
>  4 files changed, 48 insertions(+), 4 deletions(-)
> 
> diff --git a/patch-ids.c b/patch-ids.c
> index 21973e4933..f51021a0cb 100644
> --- a/patch-ids.c
> +++ b/patch-ids.c
> @@ -89,7 +89,7 @@ static int init_patch_id_entry(struct patch_id *patch,
>  	return 0;
>  }
>  
> -struct patch_id *has_commit_patch_id(struct commit *commit,
> +struct patch_id *patch_id_iter_first(struct commit *commit,
>  				     struct patch_ids *ids)
>  {
>  	struct patch_id patch;
> @@ -104,6 +104,18 @@ struct patch_id *has_commit_patch_id(struct commit *commit,
>  	return hashmap_get_entry(&ids->patches, &patch, ent, NULL);
>  }
>  
> +struct patch_id *patch_id_iter_next(struct patch_id *cur,
> +				    struct patch_ids *ids)
> +{
> +	return hashmap_get_next_entry(&ids->patches, cur, ent);
> +}
> +
> +int has_commit_patch_id(struct commit *commit,
> +			struct patch_ids *ids)
> +{
> +	return !!patch_id_iter_first(commit, ids);
> +}
> +
>  struct patch_id *add_commit_patch_id(struct commit *commit,
>  				     struct patch_ids *ids)
>  {
> diff --git a/patch-ids.h b/patch-ids.h
> index 03bb04e707..ab6c6a6804 100644
> --- a/patch-ids.h
> +++ b/patch-ids.h
> @@ -23,7 +23,25 @@ int commit_patch_id(struct commit *commit, struct diff_options *options,
>  		    struct object_id *oid, int, int);
>  int init_patch_ids(struct repository *, struct patch_ids *);
>  int free_patch_ids(struct patch_ids *);
> +
> +/* Add a patch_id for a single commit to the set. */
>  struct patch_id *add_commit_patch_id(struct commit *, struct patch_ids *);
> -struct patch_id *has_commit_patch_id(struct commit *, struct patch_ids *);
> +
> +/* Returns true if the patch-id of "commit" is present in the set. */
> +int has_commit_patch_id(struct commit *commit, struct patch_ids *);
> +
> +/*
> + * Iterate over all commits in the set whose patch id matches that of
> + * "commit", like:
> + *
> + *   struct patch_id *cur;
> + *   for (cur = patch_id_iter_first(commit, ids);
> + *        cur;
> + *        cur = patch_id_iter_next(cur, ids) {
> + *           ... look at cur->commit
> + *   }
> + */
> +struct patch_id *patch_id_iter_first(struct commit *commit, struct patch_ids *);
> +struct patch_id *patch_id_iter_next(struct patch_id *cur, struct patch_ids *);
>  
>  #endif /* PATCH_IDS_H */
> diff --git a/revision.c b/revision.c
> index 9dff845bed..7ec9b634e3 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -1241,12 +1241,14 @@ static void cherry_pick_list(struct commit_list *list, struct rev_info *revs)
>  		/*
>  		 * Have we seen the same patch id?
>  		 */
> -		id = has_commit_patch_id(commit, &ids);
> +		id = patch_id_iter_first(commit, &ids);
>  		if (!id)
>  			continue;
>  
>  		commit->object.flags |= cherry_flag;
> -		id->commit->object.flags |= cherry_flag;
> +		do {
> +			id->commit->object.flags |= cherry_flag;
> +		} while ((id = patch_id_iter_next(id, &ids)));
>  	}
>  
>  	free_patch_ids(&ids);
> diff --git a/t/t6007-rev-list-cherry-pick-file.sh b/t/t6007-rev-list-cherry-pick-file.sh
> index f0268372d2..8bf5ae23c2 100755
> --- a/t/t6007-rev-list-cherry-pick-file.sh
> +++ b/t/t6007-rev-list-cherry-pick-file.sh
> @@ -245,6 +245,18 @@ test_expect_success '--count --left-right' '
>  	test_cmp expect actual
>  '
>  
> +test_expect_success '--cherry-pick with duplicates on each side' '
> +	git checkout -b dup-orig &&
> +	test_commit dup-base &&
> +	git revert dup-base &&
> +	git cherry-pick dup-base &&
> +	git checkout -b dup-side HEAD~3 &&
> +	test_tick &&
> +	git cherry-pick -3 dup-orig &&
> +	git rev-list --cherry-pick dup-orig...dup-side >actual &&
> +	test_must_be_empty actual
> +'
> +
>  # Corrupt the object store deliberately to make sure
>  # the object is not even checked for its existence.
>  remove_loose_object () {
> -- 
> 2.30.0.455.gdab6b73766
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] patch-ids: handle duplicate hashmap entries
  2021-01-12 15:52       ` [PATCH] patch-ids: handle duplicate hashmap entries Jeff King
  2021-01-12 16:24         ` Arnaud Morin
@ 2021-01-12 19:13         ` Junio C Hamano
  2021-01-13  9:24           ` Arnaud Morin
  1 sibling, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2021-01-12 19:13 UTC (permalink / raw)
  To: Jeff King; +Cc: Arnaud Morin, git, Kevin Willford

Jeff King <peff@peff.net> writes:

> At the end, we'll have eliminated commits from both sides that have a
> matching patch-id on the other side. But there's a subtle assumption
> here: for any given patch-id, we must have exactly one struct
> representing it. If two commits from A both have the same patch-id and
> we allow duplicates in the hashmap, then we run into a problem:

In practical terms, for one side of the history to have two
identical patches, the most likely scenerio for that to happen is to
have a patch, its reversion, and its reapplication, intermixed in
its history with other commits, e.g.

    ---o---o---M---o---o---W---o---o---M---o--- ...
        \
         o---o---o---M---o--- ...

where "M" are commits that does an identical change, and "W"
(upside-down "M") is its reversion.  On the top history, "M" was
introduced, the bottom history cherry-picked, the top history found
problems in it and reverted with "W", and later (perhaps with the
help of preparatory patches on top of "W"), the top history now
considers that "M" is safe, so reapplies it.

And a "--cherry-pick" output that excludes "identical patches" that
appear on both sides on such a history would hide all "M"'s, leaving
a history like

    ---o---o-------o---o---W---o---o-------o--- ...
        \
         o---o---o-------o--- ...

But is this result what the user really wanted to see, I have to
wonder.

I do not see any problem in the patch itself.  We used to hide only
one "M" from the history on the top half in the picture, leaving one
"M" and "W", while hiding the sole "M" from the bottom half.  Now if
we want to no longer show any "M", the updated code would correctly
hide all of them.

It just feels to me that the resulting view of the history look
weird, leaving only the reversion of a patch that has never been
applied in the first place.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] patch-ids: handle duplicate hashmap entries
  2021-01-12 19:13         ` Junio C Hamano
@ 2021-01-13  9:24           ` Arnaud Morin
  2021-01-13 12:59             ` Jeff King
  2021-01-13 19:28             ` Junio C Hamano
  0 siblings, 2 replies; 15+ messages in thread
From: Arnaud Morin @ 2021-01-13  9:24 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git, Kevin Willford

Without this patch, that's even worst, consistency is broken.
Let me explain.

With your history example:

     ---o---o---M---o---o---W---o---o---M---o--- branch
         \
          o---o---o---M---o--- master

# WITHOUT PATCH
If we imagine that master is having more commits count than branch.
The result of rev-list will be like you described:
$ git rev-list --left-right --cherry-pick branch...master
<M
<W

In other words, it's showing both W and M.

BUT, if we imagine now that master is having less commits count than branch.
$ git rev-list --left-right --cherry-pick branch...master
<W

It's only showing W!

So the result is not consistent, depending on which branch is having
more commits.

# WITH PATCH
With the patch, everything is consistent, and only W is kept in ouptut,
no matter the size of history:
$ git.p rev-list --left-right --cherry-pick branch...master
<W

Cheers,

-- 
Arnaud Morin

On 12.01.21 - 11:13, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
> 
> > At the end, we'll have eliminated commits from both sides that have a
> > matching patch-id on the other side. But there's a subtle assumption
> > here: for any given patch-id, we must have exactly one struct
> > representing it. If two commits from A both have the same patch-id and
> > we allow duplicates in the hashmap, then we run into a problem:
> 
> In practical terms, for one side of the history to have two
> identical patches, the most likely scenerio for that to happen is to
> have a patch, its reversion, and its reapplication, intermixed in
> its history with other commits, e.g.
> 
>     ---o---o---M---o---o---W---o---o---M---o--- ...
>         \
>          o---o---o---M---o--- ...
> 
> where "M" are commits that does an identical change, and "W"
> (upside-down "M") is its reversion.  On the top history, "M" was
> introduced, the bottom history cherry-picked, the top history found
> problems in it and reverted with "W", and later (perhaps with the
> help of preparatory patches on top of "W"), the top history now
> considers that "M" is safe, so reapplies it.
> 
> And a "--cherry-pick" output that excludes "identical patches" that
> appear on both sides on such a history would hide all "M"'s, leaving
> a history like
> 
>     ---o---o-------o---o---W---o---o-------o--- ...
>         \
>          o---o---o-------o--- ...
> 
> But is this result what the user really wanted to see, I have to
> wonder.
> 
> I do not see any problem in the patch itself.  We used to hide only
> one "M" from the history on the top half in the picture, leaving one
> "M" and "W", while hiding the sole "M" from the bottom half.  Now if
> we want to no longer show any "M", the updated code would correctly
> hide all of them.
> 
> It just feels to me that the resulting view of the history look
> weird, leaving only the reversion of a patch that has never been
> applied in the first place.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] patch-ids: handle duplicate hashmap entries
  2021-01-13  9:24           ` Arnaud Morin
@ 2021-01-13 12:59             ` Jeff King
  2021-01-13 20:21               ` Junio C Hamano
  2021-01-13 19:28             ` Junio C Hamano
  1 sibling, 1 reply; 15+ messages in thread
From: Jeff King @ 2021-01-13 12:59 UTC (permalink / raw)
  To: Arnaud Morin; +Cc: Junio C Hamano, git, Kevin Willford

On Wed, Jan 13, 2021 at 09:24:48AM +0000, Arnaud Morin wrote:

> Without this patch, that's even worst, consistency is broken.
> Let me explain.
> 
> With your history example:
> 
>      ---o---o---M---o---o---W---o---o---M---o--- branch
>          \
>           o---o---o---M---o--- master
> 
> # WITHOUT PATCH
> If we imagine that master is having more commits count than branch.
> The result of rev-list will be like you described:
> $ git rev-list --left-right --cherry-pick branch...master
> <M
> <W
> 
> In other words, it's showing both W and M.
> 
> BUT, if we imagine now that master is having less commits count than branch.
> $ git rev-list --left-right --cherry-pick branch...master
> <W
> 
> It's only showing W!
> 
> So the result is not consistent, depending on which branch is having
> more commits.

Right. There's definitely a question of whether --cherry-pick is the
most useful thing in such a situation, but the current behavior was
simply broken. It did not behave as advertised, and it treated one side
of the history different from the other. So whether we want to do
anything else to help that case, I think this at least makes
--cherry-pick sane. :)

Here are two related histories to think about.

This is the same history, but the revert and re-do happens on both
sides (and this is actually the setup in the new test):

      ---o---o---M---o---o---W---o---o---M---o--- branch
          \
           o---o---M---o---o---W---o---o---M---o--- master

There we really _do_ want to clear out both M's (really, all four),
because we're doing so _from both sides_. And that is ultimately the
point of the cherry-pick option: show commits that were picked from one
side to the other.

Another interesting case is when the first "M" is actually bundled into
another commit, like:

      ---o---o---M+X---o---o---W---o---o---M---o--- branch
          \
           o---o---o--M---o--- master

where "M+X" has the changes from "M" plus some other changes. It will
get a different patch-id, and --cherry-pick would show both M+X and W,
but not M for either side. This is a limitation of the patch-id concept:
we are looking at whole commits, not overall changes.

I suspect for most operations we care less about "remove all
cherry-picks from both sides", but rather "give me one side, omitting
commits that are already on the other side" (because you want to rebase
or cherry-pick some subset).

-Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] patch-ids: handle duplicate hashmap entries
  2021-01-13  9:24           ` Arnaud Morin
  2021-01-13 12:59             ` Jeff King
@ 2021-01-13 19:28             ` Junio C Hamano
  1 sibling, 0 replies; 15+ messages in thread
From: Junio C Hamano @ 2021-01-13 19:28 UTC (permalink / raw)
  To: Arnaud Morin; +Cc: Jeff King, git, Kevin Willford

Arnaud Morin <arnaud.morin@gmail.com> writes:

> Without this patch, that's even worst, consistency is broken.
> Let me explain.
>
> With your history example:
>
>      ---o---o---M---o---o---W---o---o---M---o--- branch
>          \
>           o---o---o---M---o--- master
>
> # WITHOUT PATCH
> If we imagine that master is having more commits count than branch.
> The result of rev-list will be like you described:
> $ git rev-list --left-right --cherry-pick branch...master
> <M
> <W
>
> In other words, it's showing both W and M.

So, at least they cancel out and the reader can tell that the net
effect was none --- that is "sort of understandable" result.

> BUT, if we imagine now that master is having less commits count than branch.
> $ git rev-list --left-right --cherry-pick branch...master
> <W
>
> It's only showing W!

Which is what I felt misleading.

> # WITH PATCH
> With the patch, everything is consistent, and only W is kept in ouptut,
> no matter the size of history:
> $ git.p rev-list --left-right --cherry-pick branch...master
> <W

So with the patch, the former case is made the same as the latter
misleading result, which would mean that they are consistently
misleading?

I guess that it is better to be misleading all the time than being
"sort of understandable" half the time and misleading the rest of
time.  At least that is consistent.

Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] patch-ids: handle duplicate hashmap entries
  2021-01-13 12:59             ` Jeff King
@ 2021-01-13 20:21               ` Junio C Hamano
  2021-01-13 20:33                 ` Jeff King
  0 siblings, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2021-01-13 20:21 UTC (permalink / raw)
  To: Jeff King; +Cc: Arnaud Morin, git, Kevin Willford

Jeff King <peff@peff.net> writes:

> Right. There's definitely a question of whether --cherry-pick is the
> most useful thing in such a situation, but the current behavior was
> simply broken. It did not behave as advertised, and it treated one side
> of the history different from the other. So whether we want to do
> anything else to help that case, I think this at least makes
> --cherry-pick sane. :)

Yes; I think "sane" does not always equal to "useful", though ;-)

> Here are two related histories to think about.
> ...
> I suspect for most operations we care less about "remove all
> cherry-picks from both sides", but rather "give me one side, omitting
> commits that are already on the other side" (because you want to rebase
> or cherry-pick some subset).

Yes again.  It means "--cherry-pick" inherently is not symmetric.
One side is the reference side (i.e. mainline), and the goal is to
remove from the other side what is in the reference side.

What we are seeing in this discussion is that "--cherry-pick" and
symmetric difference do not conceptually play well together.

But the behaviour with the patch makes the most sense when
cherry-pick is applied to a symmetric difference (provided that
doing so makes sense in the first place, that is).

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] patch-ids: handle duplicate hashmap entries
  2021-01-13 20:21               ` Junio C Hamano
@ 2021-01-13 20:33                 ` Jeff King
  0 siblings, 0 replies; 15+ messages in thread
From: Jeff King @ 2021-01-13 20:33 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Arnaud Morin, git, Kevin Willford

On Wed, Jan 13, 2021 at 12:21:30PM -0800, Junio C Hamano wrote:

> > I suspect for most operations we care less about "remove all
> > cherry-picks from both sides", but rather "give me one side, omitting
> > commits that are already on the other side" (because you want to rebase
> > or cherry-pick some subset).
> 
> Yes again.  It means "--cherry-pick" inherently is not symmetric.
> One side is the reference side (i.e. mainline), and the goal is to
> remove from the other side what is in the reference side.
> 
> What we are seeing in this discussion is that "--cherry-pick" and
> symmetric difference do not conceptually play well together.
> 
> But the behaviour with the patch makes the most sense when
> cherry-pick is applied to a symmetric difference (provided that
> doing so makes sense in the first place, that is).

I didn't realize --cherry-pick would work without a symmetric
difference. The documentation says:

  Omit any commit that introduces the same change as another commit on
  the “other side” when the set of commits are limited with symmetric
  difference.

And indeed, I think it silently does nothing with:

  git rev-list --cherry-pick A..B

(because there is nothing on the "other" side to match).

So you do need some symmetric difference in order to define the two
sets, but you might only want to see one of the sides.  And that is
basically what --cherry does. But having looked at the implementation of
cherry_pick_list(), it is quite happy to swap the sides internally. I
guess if we were going to make the output unsymmetric, the first thing
would be to stop doing that swap. :)

For the specifics of reverts and replays, though, I think the weirdness
has nothing to do with left/right, or showing one side. It's the
mismatched count. So if we were to make any change, it would be to keep
a count of how many times each commit appears on the other side, and
cancel them one for one.

I.e., this:

  o--M--W--M--o--
   \
    o--M--o---

might plausibly want to show "M" once on the upper branch. But:

  o--M--W--M--o--
   \
    o--M--W--M--o--

should never show M, I wouldn't think.

I'm not sure if that would be violating what existing callers expect
from "--cherry-pick", though (i.e., if it would need another name, or an
extra option to trigger).

-Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-01-13 20:35 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-09 16:24 rev-list with multiple commits sharing same patch-id Arnaud Morin
2021-01-11  7:59 ` Arnaud Morin
2021-01-11  9:54 ` Christian Couder
2021-01-11 18:25   ` Arnaud Morin
2021-01-12 14:17 ` Jeff King
2021-01-12 15:11   ` Jeff King
2021-01-12 15:34     ` Arnaud Morin
2021-01-12 15:52       ` [PATCH] patch-ids: handle duplicate hashmap entries Jeff King
2021-01-12 16:24         ` Arnaud Morin
2021-01-12 19:13         ` Junio C Hamano
2021-01-13  9:24           ` Arnaud Morin
2021-01-13 12:59             ` Jeff King
2021-01-13 20:21               ` Junio C Hamano
2021-01-13 20:33                 ` Jeff King
2021-01-13 19:28             ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).