git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Arnaud Morin <arnaud.morin@gmail.com>
To: Jeff King <peff@peff.net>
Cc: git@vger.kernel.org, Kevin Willford <kewillf@microsoft.com>
Subject: Re: [PATCH] patch-ids: handle duplicate hashmap entries
Date: Tue, 12 Jan 2021 16:24:25 +0000	[thread overview]
Message-ID: <20210112162425.GD32482@sync> (raw)
In-Reply-To: <X/3FwNPHqZqY+hv0@coredump.intra.peff.net>

YAY!

Thanks a lot :)

-- 
Arnaud Morin

On 12.01.21 - 10:52, Jeff King wrote:
> On Tue, Jan 12, 2021 at 03:34:38PM +0000, Arnaud Morin wrote:
> 
> > So instead of:
> > id = has_commit_patch_id(commit, &ids);
> > 
> > We should use some sort of iterator, in order to call:
> > id->commit->object.flags |= cherry_flag;
> > 
> > for all commits which are related to this patch?
> 
> Yep, exactly.
> 
> Here it is as a patch. :)
> 
> -- >8 --
> Subject: [PATCH] patch-ids: handle duplicate hashmap entries
> 
> This fixes a bug introduced in dfb7a1b4d0 (patch-ids: stop using a
> hand-rolled hashmap implementation, 2016-07-29) in which
> 
>   git rev-list --cherry-pick A...B
> 
> will fail to suppress commits reachable from A even if a commit with
> matching patch-id appears in B.
> 
> Around the time of that commit, the algorithm for "--cherry-pick" looked
> something like this:
> 
>   0. Traverse all of the commits, marking them as being on the left or
>      right side of the symmetric difference.
> 
>   1. Iterate over the left-hand commits, inserting a patch-id struct for
>      each into a hashmap, and pointing commit->util to the patch-id
>      struct.
> 
>   2. Iterate over the right-hand commits, checking which are present in
>      the hashmap. If so, we exclude the commit from the output _and_ we
>      mark the patch-id as "seen".
> 
>   3. Iterate again over the left-hand commits, checking whether
>      commit->util->seen is set; if so, exclude them from the output.
> 
> At the end, we'll have eliminated commits from both sides that have a
> matching patch-id on the other side. But there's a subtle assumption
> here: for any given patch-id, we must have exactly one struct
> representing it. If two commits from A both have the same patch-id and
> we allow duplicates in the hashmap, then we run into a problem:
> 
>   a. In step 1, we insert two patch-id structs into the hashmap.
> 
>   b. In step 2, our lookups will find only one of these structs, so only
>      one "seen" flag is marked.
> 
>   c. In step 3, one of the commits in A will have its commit->util->seen
>      set, but the other will not. We'll erroneously output the latter.
> 
> Prior to dfb7a1b4d0, our hashmap did not allow duplicates. Afterwards,
> it used hashmap_add(), which explicitly does allow duplicates.
> 
> At that point, the solution would have been easy: when we are about to
> add a duplicate, skip doing so and return the existing entry which
> matches. But it gets more complicated.
> 
> In 683f17ec44 (patch-ids: replace the seen indicator with a commit
> pointer, 2016-07-29), our step 3 goes away entirely. Instead, in step 2,
> when the right-hand side finds a matching patch_id from the left-hand
> side, we can directly mark the left-hand patch_id->commit to be omitted.
> Solving that would be easy, too; there's a one-to-many relationship of
> patch-ids to commits, so we just need to keep a list.
> 
> But there's more. Commit b3dfeebb92 (rebase: avoid computing unnecessary
> patch IDs, 2016-07-29) built on that by lazily computing the full
> patch-ids. So we don't even know when adding to the hashmap whether two
> commits truly have the same id. We'd have to tentatively assign them a
> list, and then possibly split them apart (possibly into N new structs)
> at the moment we compute the real patch-ids. This could work, but it's
> complicated and error-prone.
> 
> Instead, let's accept that we may store duplicates, and teach the lookup
> side to be more clever. Rather than asking for a single matching
> patch-id, it will need to iterate over all matching patch-ids. This does
> mean examining every entry in a single hash bucket, but the worst-case
> for a hash lookup was already doing that.
> 
> We'll keep the hashmap details out of the caller by providing a simple
> iteration interface. We can retain the simple has_commit_patch_id()
> interface for the other callers, but we'll simplify its return value
> into an integer, rather than returning the patch_id struct. That way
> they won't be tempted to look at the "commit" field of the return value
> without iterating.
> 
> Reported-by: Arnaud Morin <arnaud.morin@gmail.com>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  patch-ids.c                          | 14 +++++++++++++-
>  patch-ids.h                          | 20 +++++++++++++++++++-
>  revision.c                           |  6 ++++--
>  t/t6007-rev-list-cherry-pick-file.sh | 12 ++++++++++++
>  4 files changed, 48 insertions(+), 4 deletions(-)
> 
> diff --git a/patch-ids.c b/patch-ids.c
> index 21973e4933..f51021a0cb 100644
> --- a/patch-ids.c
> +++ b/patch-ids.c
> @@ -89,7 +89,7 @@ static int init_patch_id_entry(struct patch_id *patch,
>  	return 0;
>  }
>  
> -struct patch_id *has_commit_patch_id(struct commit *commit,
> +struct patch_id *patch_id_iter_first(struct commit *commit,
>  				     struct patch_ids *ids)
>  {
>  	struct patch_id patch;
> @@ -104,6 +104,18 @@ struct patch_id *has_commit_patch_id(struct commit *commit,
>  	return hashmap_get_entry(&ids->patches, &patch, ent, NULL);
>  }
>  
> +struct patch_id *patch_id_iter_next(struct patch_id *cur,
> +				    struct patch_ids *ids)
> +{
> +	return hashmap_get_next_entry(&ids->patches, cur, ent);
> +}
> +
> +int has_commit_patch_id(struct commit *commit,
> +			struct patch_ids *ids)
> +{
> +	return !!patch_id_iter_first(commit, ids);
> +}
> +
>  struct patch_id *add_commit_patch_id(struct commit *commit,
>  				     struct patch_ids *ids)
>  {
> diff --git a/patch-ids.h b/patch-ids.h
> index 03bb04e707..ab6c6a6804 100644
> --- a/patch-ids.h
> +++ b/patch-ids.h
> @@ -23,7 +23,25 @@ int commit_patch_id(struct commit *commit, struct diff_options *options,
>  		    struct object_id *oid, int, int);
>  int init_patch_ids(struct repository *, struct patch_ids *);
>  int free_patch_ids(struct patch_ids *);
> +
> +/* Add a patch_id for a single commit to the set. */
>  struct patch_id *add_commit_patch_id(struct commit *, struct patch_ids *);
> -struct patch_id *has_commit_patch_id(struct commit *, struct patch_ids *);
> +
> +/* Returns true if the patch-id of "commit" is present in the set. */
> +int has_commit_patch_id(struct commit *commit, struct patch_ids *);
> +
> +/*
> + * Iterate over all commits in the set whose patch id matches that of
> + * "commit", like:
> + *
> + *   struct patch_id *cur;
> + *   for (cur = patch_id_iter_first(commit, ids);
> + *        cur;
> + *        cur = patch_id_iter_next(cur, ids) {
> + *           ... look at cur->commit
> + *   }
> + */
> +struct patch_id *patch_id_iter_first(struct commit *commit, struct patch_ids *);
> +struct patch_id *patch_id_iter_next(struct patch_id *cur, struct patch_ids *);
>  
>  #endif /* PATCH_IDS_H */
> diff --git a/revision.c b/revision.c
> index 9dff845bed..7ec9b634e3 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -1241,12 +1241,14 @@ static void cherry_pick_list(struct commit_list *list, struct rev_info *revs)
>  		/*
>  		 * Have we seen the same patch id?
>  		 */
> -		id = has_commit_patch_id(commit, &ids);
> +		id = patch_id_iter_first(commit, &ids);
>  		if (!id)
>  			continue;
>  
>  		commit->object.flags |= cherry_flag;
> -		id->commit->object.flags |= cherry_flag;
> +		do {
> +			id->commit->object.flags |= cherry_flag;
> +		} while ((id = patch_id_iter_next(id, &ids)));
>  	}
>  
>  	free_patch_ids(&ids);
> diff --git a/t/t6007-rev-list-cherry-pick-file.sh b/t/t6007-rev-list-cherry-pick-file.sh
> index f0268372d2..8bf5ae23c2 100755
> --- a/t/t6007-rev-list-cherry-pick-file.sh
> +++ b/t/t6007-rev-list-cherry-pick-file.sh
> @@ -245,6 +245,18 @@ test_expect_success '--count --left-right' '
>  	test_cmp expect actual
>  '
>  
> +test_expect_success '--cherry-pick with duplicates on each side' '
> +	git checkout -b dup-orig &&
> +	test_commit dup-base &&
> +	git revert dup-base &&
> +	git cherry-pick dup-base &&
> +	git checkout -b dup-side HEAD~3 &&
> +	test_tick &&
> +	git cherry-pick -3 dup-orig &&
> +	git rev-list --cherry-pick dup-orig...dup-side >actual &&
> +	test_must_be_empty actual
> +'
> +
>  # Corrupt the object store deliberately to make sure
>  # the object is not even checked for its existence.
>  remove_loose_object () {
> -- 
> 2.30.0.455.gdab6b73766
> 

  reply	other threads:[~2021-01-12 16:26 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-09 16:24 rev-list with multiple commits sharing same patch-id Arnaud Morin
2021-01-11  7:59 ` Arnaud Morin
2021-01-11  9:54 ` Christian Couder
2021-01-11 18:25   ` Arnaud Morin
2021-01-12 14:17 ` Jeff King
2021-01-12 15:11   ` Jeff King
2021-01-12 15:34     ` Arnaud Morin
2021-01-12 15:52       ` [PATCH] patch-ids: handle duplicate hashmap entries Jeff King
2021-01-12 16:24         ` Arnaud Morin [this message]
2021-01-12 19:13         ` Junio C Hamano
2021-01-13  9:24           ` Arnaud Morin
2021-01-13 12:59             ` Jeff King
2021-01-13 20:21               ` Junio C Hamano
2021-01-13 20:33                 ` Jeff King
2021-01-13 19:28             ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210112162425.GD32482@sync \
    --to=arnaud.morin@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=kewillf@microsoft.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).