git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Arnaud Morin <arnaud.morin@gmail.com>
Cc: git@vger.kernel.org, Kevin Willford <kewillf@microsoft.com>
Subject: [PATCH] patch-ids: handle duplicate hashmap entries
Date: Tue, 12 Jan 2021 10:52:32 -0500	[thread overview]
Message-ID: <X/3FwNPHqZqY+hv0@coredump.intra.peff.net> (raw)
In-Reply-To: <20210112153438.GC32482@sync>

On Tue, Jan 12, 2021 at 03:34:38PM +0000, Arnaud Morin wrote:

> So instead of:
> id = has_commit_patch_id(commit, &ids);
> 
> We should use some sort of iterator, in order to call:
> id->commit->object.flags |= cherry_flag;
> 
> for all commits which are related to this patch?

Yep, exactly.

Here it is as a patch. :)

-- >8 --
Subject: [PATCH] patch-ids: handle duplicate hashmap entries

This fixes a bug introduced in dfb7a1b4d0 (patch-ids: stop using a
hand-rolled hashmap implementation, 2016-07-29) in which

  git rev-list --cherry-pick A...B

will fail to suppress commits reachable from A even if a commit with
matching patch-id appears in B.

Around the time of that commit, the algorithm for "--cherry-pick" looked
something like this:

  0. Traverse all of the commits, marking them as being on the left or
     right side of the symmetric difference.

  1. Iterate over the left-hand commits, inserting a patch-id struct for
     each into a hashmap, and pointing commit->util to the patch-id
     struct.

  2. Iterate over the right-hand commits, checking which are present in
     the hashmap. If so, we exclude the commit from the output _and_ we
     mark the patch-id as "seen".

  3. Iterate again over the left-hand commits, checking whether
     commit->util->seen is set; if so, exclude them from the output.

At the end, we'll have eliminated commits from both sides that have a
matching patch-id on the other side. But there's a subtle assumption
here: for any given patch-id, we must have exactly one struct
representing it. If two commits from A both have the same patch-id and
we allow duplicates in the hashmap, then we run into a problem:

  a. In step 1, we insert two patch-id structs into the hashmap.

  b. In step 2, our lookups will find only one of these structs, so only
     one "seen" flag is marked.

  c. In step 3, one of the commits in A will have its commit->util->seen
     set, but the other will not. We'll erroneously output the latter.

Prior to dfb7a1b4d0, our hashmap did not allow duplicates. Afterwards,
it used hashmap_add(), which explicitly does allow duplicates.

At that point, the solution would have been easy: when we are about to
add a duplicate, skip doing so and return the existing entry which
matches. But it gets more complicated.

In 683f17ec44 (patch-ids: replace the seen indicator with a commit
pointer, 2016-07-29), our step 3 goes away entirely. Instead, in step 2,
when the right-hand side finds a matching patch_id from the left-hand
side, we can directly mark the left-hand patch_id->commit to be omitted.
Solving that would be easy, too; there's a one-to-many relationship of
patch-ids to commits, so we just need to keep a list.

But there's more. Commit b3dfeebb92 (rebase: avoid computing unnecessary
patch IDs, 2016-07-29) built on that by lazily computing the full
patch-ids. So we don't even know when adding to the hashmap whether two
commits truly have the same id. We'd have to tentatively assign them a
list, and then possibly split them apart (possibly into N new structs)
at the moment we compute the real patch-ids. This could work, but it's
complicated and error-prone.

Instead, let's accept that we may store duplicates, and teach the lookup
side to be more clever. Rather than asking for a single matching
patch-id, it will need to iterate over all matching patch-ids. This does
mean examining every entry in a single hash bucket, but the worst-case
for a hash lookup was already doing that.

We'll keep the hashmap details out of the caller by providing a simple
iteration interface. We can retain the simple has_commit_patch_id()
interface for the other callers, but we'll simplify its return value
into an integer, rather than returning the patch_id struct. That way
they won't be tempted to look at the "commit" field of the return value
without iterating.

Reported-by: Arnaud Morin <arnaud.morin@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
---
 patch-ids.c                          | 14 +++++++++++++-
 patch-ids.h                          | 20 +++++++++++++++++++-
 revision.c                           |  6 ++++--
 t/t6007-rev-list-cherry-pick-file.sh | 12 ++++++++++++
 4 files changed, 48 insertions(+), 4 deletions(-)

diff --git a/patch-ids.c b/patch-ids.c
index 21973e4933..f51021a0cb 100644
--- a/patch-ids.c
+++ b/patch-ids.c
@@ -89,7 +89,7 @@ static int init_patch_id_entry(struct patch_id *patch,
 	return 0;
 }
 
-struct patch_id *has_commit_patch_id(struct commit *commit,
+struct patch_id *patch_id_iter_first(struct commit *commit,
 				     struct patch_ids *ids)
 {
 	struct patch_id patch;
@@ -104,6 +104,18 @@ struct patch_id *has_commit_patch_id(struct commit *commit,
 	return hashmap_get_entry(&ids->patches, &patch, ent, NULL);
 }
 
+struct patch_id *patch_id_iter_next(struct patch_id *cur,
+				    struct patch_ids *ids)
+{
+	return hashmap_get_next_entry(&ids->patches, cur, ent);
+}
+
+int has_commit_patch_id(struct commit *commit,
+			struct patch_ids *ids)
+{
+	return !!patch_id_iter_first(commit, ids);
+}
+
 struct patch_id *add_commit_patch_id(struct commit *commit,
 				     struct patch_ids *ids)
 {
diff --git a/patch-ids.h b/patch-ids.h
index 03bb04e707..ab6c6a6804 100644
--- a/patch-ids.h
+++ b/patch-ids.h
@@ -23,7 +23,25 @@ int commit_patch_id(struct commit *commit, struct diff_options *options,
 		    struct object_id *oid, int, int);
 int init_patch_ids(struct repository *, struct patch_ids *);
 int free_patch_ids(struct patch_ids *);
+
+/* Add a patch_id for a single commit to the set. */
 struct patch_id *add_commit_patch_id(struct commit *, struct patch_ids *);
-struct patch_id *has_commit_patch_id(struct commit *, struct patch_ids *);
+
+/* Returns true if the patch-id of "commit" is present in the set. */
+int has_commit_patch_id(struct commit *commit, struct patch_ids *);
+
+/*
+ * Iterate over all commits in the set whose patch id matches that of
+ * "commit", like:
+ *
+ *   struct patch_id *cur;
+ *   for (cur = patch_id_iter_first(commit, ids);
+ *        cur;
+ *        cur = patch_id_iter_next(cur, ids) {
+ *           ... look at cur->commit
+ *   }
+ */
+struct patch_id *patch_id_iter_first(struct commit *commit, struct patch_ids *);
+struct patch_id *patch_id_iter_next(struct patch_id *cur, struct patch_ids *);
 
 #endif /* PATCH_IDS_H */
diff --git a/revision.c b/revision.c
index 9dff845bed..7ec9b634e3 100644
--- a/revision.c
+++ b/revision.c
@@ -1241,12 +1241,14 @@ static void cherry_pick_list(struct commit_list *list, struct rev_info *revs)
 		/*
 		 * Have we seen the same patch id?
 		 */
-		id = has_commit_patch_id(commit, &ids);
+		id = patch_id_iter_first(commit, &ids);
 		if (!id)
 			continue;
 
 		commit->object.flags |= cherry_flag;
-		id->commit->object.flags |= cherry_flag;
+		do {
+			id->commit->object.flags |= cherry_flag;
+		} while ((id = patch_id_iter_next(id, &ids)));
 	}
 
 	free_patch_ids(&ids);
diff --git a/t/t6007-rev-list-cherry-pick-file.sh b/t/t6007-rev-list-cherry-pick-file.sh
index f0268372d2..8bf5ae23c2 100755
--- a/t/t6007-rev-list-cherry-pick-file.sh
+++ b/t/t6007-rev-list-cherry-pick-file.sh
@@ -245,6 +245,18 @@ test_expect_success '--count --left-right' '
 	test_cmp expect actual
 '
 
+test_expect_success '--cherry-pick with duplicates on each side' '
+	git checkout -b dup-orig &&
+	test_commit dup-base &&
+	git revert dup-base &&
+	git cherry-pick dup-base &&
+	git checkout -b dup-side HEAD~3 &&
+	test_tick &&
+	git cherry-pick -3 dup-orig &&
+	git rev-list --cherry-pick dup-orig...dup-side >actual &&
+	test_must_be_empty actual
+'
+
 # Corrupt the object store deliberately to make sure
 # the object is not even checked for its existence.
 remove_loose_object () {
-- 
2.30.0.455.gdab6b73766


  reply	other threads:[~2021-01-12 15:56 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-09 16:24 rev-list with multiple commits sharing same patch-id Arnaud Morin
2021-01-11  7:59 ` Arnaud Morin
2021-01-11  9:54 ` Christian Couder
2021-01-11 18:25   ` Arnaud Morin
2021-01-12 14:17 ` Jeff King
2021-01-12 15:11   ` Jeff King
2021-01-12 15:34     ` Arnaud Morin
2021-01-12 15:52       ` Jeff King [this message]
2021-01-12 16:24         ` [PATCH] patch-ids: handle duplicate hashmap entries Arnaud Morin
2021-01-12 19:13         ` Junio C Hamano
2021-01-13  9:24           ` Arnaud Morin
2021-01-13 12:59             ` Jeff King
2021-01-13 20:21               ` Junio C Hamano
2021-01-13 20:33                 ` Jeff King
2021-01-13 19:28             ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=X/3FwNPHqZqY+hv0@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=arnaud.morin@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=kewillf@microsoft.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).