[RFC/PATCH 0/2] more patch-id speedups

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [RFC/PATCH 0/2] more patch-id speedups
@ 2016-09-07  7:53 Jeff King
  2016-09-07  7:54 ` [PATCH 1/2] patch-ids: turn off rename detection Jeff King
                   ` (3 more replies)
  0 siblings, 4 replies; 35+ messages in thread
From: Jeff King @ 2016-09-07  7:53 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, Kevin Willford

Michael and I found a case where the "format-patch --cherry-pick A...B"
command for a rebase took over 7 minutes to run with git v2.9.3. Yikes.

Switching to v2.10 dropped that to a bit over 3 minutes (due to the
kw/patch-ids-optim topic). Better, but not great.

The culprit turned out to be merge commits; the patch-id code will
happily diff a merge against its first parent, and ignore the rest. This
_seems_ like a bad idea, but maybe there is something clever going on
that I don't know about. I couldn't find anything useful in the history,
and given that this code was adapted from rebase, my guess is that it
was never really intended to handle merge commits in the first place (of
course we weren't trying to rebase merge commits; but it has to generate
patch-ids for everything that happened on "A" to compare against).

Dropping the computation of the merge commits got it down to about 4
seconds. I also noticed that it was doing rename detection (which also
seems like a bad idea). Disabling renames dropped another half second or
so.

This is marked as "RFC" because I don't feel entirely confident that I'm
not missing some clever need for these options. But in both cases my gut
feeling is that they are simply unintended effects that nobody ever
noticed, because it would be very rare that they would affect the
output. And that if they _did_ affect the output, they would probably be
doing the wrong thing.

-peff

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/2] patch-ids: turn off rename detection
  2016-09-07  7:53 [RFC/PATCH 0/2] more patch-id speedups Jeff King
@ 2016-09-07  7:54 ` Jeff King
  2016-09-07 12:53   ` Johannes Schindelin
  2016-09-07  7:54 ` [PATCH 2/2] patch-ids: skip merge commits Jeff King
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 35+ messages in thread
From: Jeff King @ 2016-09-07  7:54 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, Kevin Willford

The patch-id code may be running inside another porcelain
like "git log" or "git format-patch", and therefore may have
set diff_detect_rename_default, either via the diff-ui
config, or by default since 5404c11 (diff: activate
diff.renames by default, 2016-02-25). This is the case even
if a command is run with `--no-renames`, as that is applied
only to the diff-options used by the command itself.

Rename detection doesn't help the patch-id results. It
_may_ actually hurt, as minor differences in the files that
would be overlooked by patch-id's canonicalization might
result in different renames (though I'd doubt that it ever
comes up in practice).

But mostly it is just a waste of CPU to compute these
renames.

Note that we don't have to worry about compatibility here.
This patch disables renames just for the internal patch-id
comparison run by "log --cherry-pick", etc. The user-visible
"git patch-id" output depends on the patch that it is fed
(so it is up to the diff generator to use --no-renames if
they wish).

Signed-off-by: Jeff King <peff@peff.net>
---
 patch-ids.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/patch-ids.c b/patch-ids.c
index 082412a..77e4663 100644
--- a/patch-ids.c
+++ b/patch-ids.c
@@ -45,6 +45,7 @@ int init_patch_ids(struct patch_ids *ids)
 {
 	memset(ids, 0, sizeof(*ids));
 	diff_setup(&ids->diffopts);
+	ids->diffopts.detect_rename = 0;
 	DIFF_OPT_SET(&ids->diffopts, RECURSIVE);
 	diff_setup_done(&ids->diffopts);
 	hashmap_init(&ids->patches, (hashmap_cmp_fn)patch_id_cmp, 256);
-- 
2.10.0.rc2.154.gb4a4b8b

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 2/2] patch-ids: skip merge commits
  2016-09-07  7:53 [RFC/PATCH 0/2] more patch-id speedups Jeff King
  2016-09-07  7:54 ` [PATCH 1/2] patch-ids: turn off rename detection Jeff King
@ 2016-09-07  7:54 ` Jeff King
  2016-09-07 12:52   ` Johannes Schindelin
  2016-09-07 13:06 ` [RFC/PATCH 0/2] more patch-id speedups Johannes Schindelin
  2016-09-07 22:01 ` [RFC/PATCH v2 0/3] patch-id for merges Jeff King
  3 siblings, 1 reply; 35+ messages in thread
From: Jeff King @ 2016-09-07  7:54 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, Kevin Willford

The patch-ids code which powers "log --cherry-pick" doesn't
look at whether each commit is a merge or not. It just feeds
the commit's first parent to the diff, and ignores any
additional parents.

In theory, this might be useful if you wanted to find
equivalence between, say, a merge commit and a squash-merge
that does the same thing.  But that may also be the wrong
thing; the diffs may be the same, but the meaning of the two
commits is definitely not identical. We should err on the
side of _not_ matching such commits.

Moreover, we may spend a lot of extra time computing these
merge diffs. In the case that inspired this patch, a "git
format-patch --cherry-pick" dropped from over 3 minutes to
less than 4 seconds. This seems pretty drastic, but is
easily explained. The command was invoked by a "git rebase"
of an older topic branch; there had been tens of thousands
of commits on the upstream branch in the meantime. In
addition, this project used a topic-branch workflow with
occasional "back-merges" from "master" to each topic (to
resolve conflicts on the topics rather than in the merge
commits). So there were not only extra merges, but the diffs
for these back-merges were generally quite large (because
they represented _everything_ that had been merged to master
since the topic branched).

This patch just ignores merge commits entirely when
generating patch-ids, meaning they will never be matched
(from either side of a symmetric-diff traversal).

Signed-off-by: Jeff King <peff@peff.net>
---
 patch-ids.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/patch-ids.c b/patch-ids.c
index 77e4663..b1f8514 100644
--- a/patch-ids.c
+++ b/patch-ids.c
@@ -7,10 +7,12 @@
 int commit_patch_id(struct commit *commit, struct diff_options *options,
 		    unsigned char *sha1, int diff_header_only)
 {
-	if (commit->parents)
+	if (commit->parents) {
+		if (commit->parents->next)
+			return 0;
 		diff_tree_sha1(commit->parents->item->object.oid.hash,
 			       commit->object.oid.hash, "", options);
-	else
+	} else
 		diff_root_tree_sha1(commit->object.oid.hash, "", options);
 	diffcore_std(options);
 	return diff_flush_patch_id(options, sha1, diff_header_only);
-- 
2.10.0.rc2.154.gb4a4b8b

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/2] patch-ids: skip merge commits
  2016-09-07  7:54 ` [PATCH 2/2] patch-ids: skip merge commits Jeff King
@ 2016-09-07 12:52   ` Johannes Schindelin
  2016-09-07 18:46     ` Jeff King
  0 siblings, 1 reply; 35+ messages in thread
From: Johannes Schindelin @ 2016-09-07 12:52 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Michael Haggerty, Kevin Willford

Hi Peff,

On Wed, 7 Sep 2016, Jeff King wrote:

> [...]
> This patch just ignores merge commits entirely when
> generating patch-ids, meaning they will never be matched
> (from either side of a symmetric-diff traversal).

... except it does not ignore merge commits:

> diff --git a/patch-ids.c b/patch-ids.c
> index 77e4663..b1f8514 100644
> --- a/patch-ids.c
> +++ b/patch-ids.c
> @@ -7,10 +7,12 @@
>  int commit_patch_id(struct commit *commit, struct diff_options *options,
>  		    unsigned char *sha1, int diff_header_only)
>  {
> -	if (commit->parents)
> +	if (commit->parents) {
> +		if (commit->parents->next)
> +			return 0;
>  		diff_tree_sha1(commit->parents->item->object.oid.hash,
>  			       commit->object.oid.hash, "", options);
> -	else
> +	} else

With this change, commit_patch_id() will return 0 for merge commits
(indicating success) but it will not have touched the sha1! Which means it
may very well have all kinds of crap in the sha1 that may, or may not,
match another, real patch ID randomly.

See e.g. the call site in builtin/log.c's prepare_bases():

	[...]
	if (commit_patch_id(commit, &diffopt, sha1, 0))
		die(_("cannot get patch id"));
	ALLOC_GROW(bases->patch_id, bases->nr_patch_id + 1, bases->alloc_patch_id);
	patch_id = bases->patch_id + bases->nr_patch_id;
	hashcpy(patch_id->hash, sha1);
	bases->nr_patch_id++;

So the sha1 is actually used, even if it was not initialized.

I would suggest to simply copy the merge commit's SHA-1. It is no patch
ID, of course, but collisions are as unlikely as commit name collisions,
and it would make the "patch ID" of a merge commit deterministic again.

I.e. something like

 	if (commit->parents) {
-		if (commit->parents->next)
+		if (commit->parents->next) {
+			hashcpy(sha1, commit->object.oid.hash);
 			return 0;
+		}
   		diff_tree_sha1(commit->parents->item->object.oid.hash,

on top.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/2] patch-ids: turn off rename detection
  2016-09-07  7:54 ` [PATCH 1/2] patch-ids: turn off rename detection Jeff King
@ 2016-09-07 12:53   ` Johannes Schindelin
  0 siblings, 0 replies; 35+ messages in thread
From: Johannes Schindelin @ 2016-09-07 12:53 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Michael Haggerty, Kevin Willford

Hi Peff,

On Wed, 7 Sep 2016, Jeff King wrote:

> The patch-id code may be running inside another porcelain
> like "git log" or "git format-patch", and therefore may have
> set diff_detect_rename_default, either via the diff-ui
> config, or by default since 5404c11 (diff: activate
> diff.renames by default, 2016-02-25). This is the case even
> if a command is run with `--no-renames`, as that is applied
> only to the diff-options used by the command itself.
> 
> Rename detection doesn't help the patch-id results. It
> _may_ actually hurt, as minor differences in the files that
> would be overlooked by patch-id's canonicalization might
> result in different renames (though I'd doubt that it ever
> comes up in practice).
> 
> But mostly it is just a waste of CPU to compute these
> renames.
> 
> Note that we don't have to worry about compatibility here.
> This patch disables renames just for the internal patch-id
> comparison run by "log --cherry-pick", etc. The user-visible
> "git patch-id" output depends on the patch that it is fed
> (so it is up to the diff generator to use --no-renames if
> they wish).

Sounds obviously good to me.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC/PATCH 0/2] more patch-id speedups
  2016-09-07  7:53 [RFC/PATCH 0/2] more patch-id speedups Jeff King
  2016-09-07  7:54 ` [PATCH 1/2] patch-ids: turn off rename detection Jeff King
  2016-09-07  7:54 ` [PATCH 2/2] patch-ids: skip merge commits Jeff King
@ 2016-09-07 13:06 ` Johannes Schindelin
  2016-09-07 18:49   ` Jeff King
  2016-09-07 22:01 ` [RFC/PATCH v2 0/3] patch-id for merges Jeff King
  3 siblings, 1 reply; 35+ messages in thread
From: Johannes Schindelin @ 2016-09-07 13:06 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Michael Haggerty, Kevin Willford

Hi Peff,

On Wed, 7 Sep 2016, Jeff King wrote:

> Michael and I found a case where the "format-patch --cherry-pick A...B"
> command for a rebase took over 7 minutes to run with git v2.9.3. Yikes.
> 
> Switching to v2.10 dropped that to a bit over 3 minutes (due to the
> kw/patch-ids-optim topic). Better, but not great.

I agree: not great, but better...

> The culprit turned out to be merge commits; the patch-id code will
> happily diff a merge against its first parent, and ignore the rest. This
> _seems_ like a bad idea, but maybe there is something clever going on
> that I don't know about. I couldn't find anything useful in the history,
> and given that this code was adapted from rebase, my guess is that it
> was never really intended to handle merge commits in the first place (of
> course we weren't trying to rebase merge commits; but it has to generate
> patch-ids for everything that happened on "A" to compare against).
> 
> Dropping the computation of the merge commits got it down to about 4
> seconds. I also noticed that it was doing rename detection (which also
> seems like a bad idea). Disabling renames dropped another half second or
> so.

That makes for a really nice improvement!

> This is marked as "RFC" because I don't feel entirely confident that I'm
> not missing some clever need for these options. But in both cases my gut
> feeling is that they are simply unintended effects that nobody ever
> noticed, because it would be very rare that they would affect the
> output. And that if they _did_ affect the output, they would probably be
> doing the wrong thing.

Given that the patch ID is *wrong* for merge commits (it only looks at the
first parent, so each "-s ours" merge will have the same patch ID!), I
would say that we can get away with re-defining the patch ID of merge
commits.

The only case where it might change things that I can think of would be a
`git rebase --preserve-merges`: it would probably have worked *by chance*
before (or not, in case of "-s ours" merges), and now it would try to pick
the merge commits even if rebased versions were already merged upstream.

If I read the --preserve-merges code correctly, that would result in the
merge commit's parents to be 'rewritten' to HEAD. And as both parents
would be rewritten to HEAD, they would be condensed into a single new
parent, resulting in a cherry-pick that fails (because it tries to
cherry-pick a merge commit without any -n option).

Of course, what we could do is to introduce a modifier, e.g.
--cherry-pick=first-parent, that would trigger the old behavior and would
be asked-for in the --preserve-merges mode.

But quite frankly, personally I would not worry about it *that* much. As
you pointed out, the patch ID for merge commits is incorrect to begin
with, and we may just redeclare all merge commits to be incomparable to
one another when it comes to patch IDs.

In short: I would be fine with the change of behavior.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/2] patch-ids: skip merge commits
  2016-09-07 12:52   ` Johannes Schindelin
@ 2016-09-07 18:46     ` Jeff King
  2016-09-07 22:08       ` Jeff King
  0 siblings, 1 reply; 35+ messages in thread
From: Jeff King @ 2016-09-07 18:46 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Michael Haggerty, Kevin Willford

On Wed, Sep 07, 2016 at 02:52:04PM +0200, Johannes Schindelin wrote:

> > diff --git a/patch-ids.c b/patch-ids.c
> > index 77e4663..b1f8514 100644
> > --- a/patch-ids.c
> > +++ b/patch-ids.c
> > @@ -7,10 +7,12 @@
> >  int commit_patch_id(struct commit *commit, struct diff_options *options,
> >  		    unsigned char *sha1, int diff_header_only)
> >  {
> > -	if (commit->parents)
> > +	if (commit->parents) {
> > +		if (commit->parents->next)
> > +			return 0;
> >  		diff_tree_sha1(commit->parents->item->object.oid.hash,
> >  			       commit->object.oid.hash, "", options);
> > -	else
> > +	} else
> 
> With this change, commit_patch_id() will return 0 for merge commits
> (indicating success) but it will not have touched the sha1! Which means it
> may very well have all kinds of crap in the sha1 that may, or may not,
> match another, real patch ID randomly.

Eek, thanks. Somehow I got it into my head that diff_flush_patch_id()
below was what added it to the list, but clearly that is not the case.
Looking at it again, I can't imagine how that is the case.

> I would suggest to simply copy the merge commit's SHA-1. It is no patch
> ID, of course, but collisions are as unlikely as commit name collisions,
> and it would make the "patch ID" of a merge commit deterministic again.

I agree that would work, though it does mean carrying extra useless
entries in the patch_id hash. I'll see how bad it would be to simply
omit them entirely, but this seems like a good fallback plan.

Thanks, and sorry for the obviously braindead patch.

-Peff

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC/PATCH 0/2] more patch-id speedups
  2016-09-07 13:06 ` [RFC/PATCH 0/2] more patch-id speedups Johannes Schindelin
@ 2016-09-07 18:49   ` Jeff King
  0 siblings, 0 replies; 35+ messages in thread
From: Jeff King @ 2016-09-07 18:49 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Michael Haggerty, Kevin Willford

On Wed, Sep 07, 2016 at 03:06:36PM +0200, Johannes Schindelin wrote:

> > This is marked as "RFC" because I don't feel entirely confident that I'm
> > not missing some clever need for these options. But in both cases my gut
> > feeling is that they are simply unintended effects that nobody ever
> > noticed, because it would be very rare that they would affect the
> > output. And that if they _did_ affect the output, they would probably be
> > doing the wrong thing.
> 
> Given that the patch ID is *wrong* for merge commits (it only looks at the
> first parent, so each "-s ours" merge will have the same patch ID!), I
> would say that we can get away with re-defining the patch ID of merge
> commits.
> 
> The only case where it might change things that I can think of would be a
> `git rebase --preserve-merges`: it would probably have worked *by chance*
> before (or not, in case of "-s ours" merges), and now it would try to pick
> the merge commits even if rebased versions were already merged upstream.
> 
> If I read the --preserve-merges code correctly, that would result in the
> merge commit's parents to be 'rewritten' to HEAD. And as both parents
> would be rewritten to HEAD, they would be condensed into a single new
> parent, resulting in a cherry-pick that fails (because it tries to
> cherry-pick a merge commit without any -n option).
> 
> Of course, what we could do is to introduce a modifier, e.g.
> --cherry-pick=first-parent, that would trigger the old behavior and would
> be asked-for in the --preserve-merges mode.
> 
> But quite frankly, personally I would not worry about it *that* much. As
> you pointed out, the patch ID for merge commits is incorrect to begin
> with, and we may just redeclare all merge commits to be incomparable to
> one another when it comes to patch IDs.
> 
> In short: I would be fine with the change of behavior.

Thanks for this explanation; it matches what I was thinking, but you
went through it in a lot more detail.

So it sounds like this is the right thing, but as you pointed out, the
implementation is just silly. I'll see if I can come up with a working
v2.

-Peff

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC/PATCH v2 0/3] patch-id for merges
  2016-09-07  7:53 [RFC/PATCH 0/2] more patch-id speedups Jeff King
                   ` (2 preceding siblings ...)
  2016-09-07 13:06 ` [RFC/PATCH 0/2] more patch-id speedups Johannes Schindelin
@ 2016-09-07 22:01 ` Jeff King
  2016-09-07 22:02   ` [PATCH 1/3] patch-ids: turn off rename detection Jeff King
                     ` (4 more replies)
  3 siblings, 5 replies; 35+ messages in thread
From: Jeff King @ 2016-09-07 22:01 UTC (permalink / raw)
  To: git
  Cc: Michael Haggerty, Kevin Willford, Xiaolong Ye,
	Johannes Schindelin, Josh Triplett

Here's a re-roll of the series I posted at:

  http://public-inbox.org/git/20160907075346.z6wtmqnfc6bsunjb@sigill.intra.peff.net/

Basically, it drops the time for "format-patch --cherry-pick" on a
particular case from 3 minutes down to 3 seconds, by avoiding diffs
on merge commits. Compared to v1, it fixes the totally-broken handling
of commit_patch_id() pointed out by Johannes.

We can drop the diffs on the merge commits because they're quite broken,
as discussed in the commit message of patch 3 (they don't take into
account any parent except the first). So what do we do when somebody
asks for the patch-id of a merge commit?

This is still marked RFC, because there are really two approaches here,
and I'm not sure which one is better for "format-patch --base". I'd like
to get input from Xiaolong Ye (who worked on --base), and Josh Triplett
(who has proposed some patches in that area, and is presumably using
them).

Option one is that merges are defined as having no patch-id at all. They
are skipped for "--cherry-pick" comparison, and "format-patch --base"
will not mention them at all as prerequisites. That's what I've
implemented here.

Option two is to use the commit sha1 as the patch-id for a merge, making
it (essentially) unique. That gives us a defined value, but it's one
that "--cherry-pick" will not match between two segments of history. I
don't know if having _some_ defined value is useful for "format-patch
--base" or not.

And obviously there's an option 3: define some more complicated patch-id
for merges that takes into account all of the parents. I didn't think
too much on that because I don't really see value in it over using the
commit sha1, and it would be computationally expensive.

  [1/3]: patch-ids: turn off rename detection
  [2/3]: diff_flush_patch_id: stop returning error result
  [3/3]: patch-ids: use commit sha1 as patch-id for merge commits

-Peff

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/3] patch-ids: turn off rename detection
  2016-09-07 22:01 ` [RFC/PATCH v2 0/3] patch-id for merges Jeff King
@ 2016-09-07 22:02   ` Jeff King
  2016-09-07 22:12     ` Jacob Keller
  2016-09-07 22:04   ` [PATCH 2/3] diff_flush_patch_id: stop returning error result Jeff King
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 35+ messages in thread
From: Jeff King @ 2016-09-07 22:02 UTC (permalink / raw)
  To: git
  Cc: Michael Haggerty, Kevin Willford, Xiaolong Ye,
	Johannes Schindelin, Josh Triplett

The patch-id code may be running inside another porcelain
like "git log" or "git format-patch", and therefore may have
set diff_detect_rename_default, either via the diff-ui
config, or by default since 5404c11 (diff: activate
diff.renames by default, 2016-02-25). This is the case even
if a command is run with `--no-renames`, as that is applied
only to the diff-options used by the command itself.

Rename detection doesn't help the patch-id results. It
_may_ actually hurt, as minor differences in the files that
would be overlooked by patch-id's canonicalization might
result in different renames (though I'd doubt that it ever
comes up in practice).

But mostly it is just a waste of CPU to compute these
renames.

Note that this does have one user-visible impact: the
prerequisite patches listed by "format-patch --base". There
may be some confusion between different versions of git as
older ones will enable renames, but newer ones will not.
However, this was already a problem, as people with
different settings for the "diff.renames" config would get
different results. After this patch, everyone should get the
same results, regardless of their config.

Signed-off-by: Jeff King <peff@peff.net>
---
The patch is the same as v1, but the commit message is modified, as I
realized that "--base" does expose this value publicly (but as I argue
above, this is probably an improvement).

 patch-ids.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/patch-ids.c b/patch-ids.c
index 082412a..77e4663 100644
--- a/patch-ids.c
+++ b/patch-ids.c
@@ -45,6 +45,7 @@ int init_patch_ids(struct patch_ids *ids)
 {
 	memset(ids, 0, sizeof(*ids));
 	diff_setup(&ids->diffopts);
+	ids->diffopts.detect_rename = 0;
 	DIFF_OPT_SET(&ids->diffopts, RECURSIVE);
 	diff_setup_done(&ids->diffopts);
 	hashmap_init(&ids->patches, (hashmap_cmp_fn)patch_id_cmp, 256);
-- 
2.10.0.rc2.154.gb4a4b8b

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 2/3] diff_flush_patch_id: stop returning error result
  2016-09-07 22:01 ` [RFC/PATCH v2 0/3] patch-id for merges Jeff King
  2016-09-07 22:02   ` [PATCH 1/3] patch-ids: turn off rename detection Jeff King
@ 2016-09-07 22:04   ` Jeff King
  2016-09-08  0:51     ` Ramsay Jones
  2016-09-09 10:28     ` Johannes Schindelin
  2016-09-07 22:04   ` [PATCH 3/3] patch-ids: use commit sha1 as patch-id for merge commits Jeff King
                     ` (2 subsequent siblings)
  4 siblings, 2 replies; 35+ messages in thread
From: Jeff King @ 2016-09-07 22:04 UTC (permalink / raw)
  To: git
  Cc: Michael Haggerty, Kevin Willford, Xiaolong Ye,
	Johannes Schindelin, Josh Triplett

All of our errors come from diff_get_patch_id(), which has
exactly three error conditions. The first is an internal
assertion, which should be a die("BUG") in the first place.

The other two are caused by an inability to two diff blobs,
which is an indication of a serious problem (probably
repository corruption). All the rest of the diff subsystem
dies immediately on these conditions. By passing up the
error, in theory we can keep going even if patch-id is
unable to function. But in practice this means we may
generate subtly wrong results (e.g., by failing to correlate
two commits). Let's just die(), as we're better off making
it clear to the user that their repository is not
functional.

As a result, we can simplify the calling code.

Signed-off-by: Jeff King <peff@peff.net>
---
This is a prerequisite for patch 3, since it means that
commit_patch_id() stops returning "real" errors. But obviously if this
is distasteful (and it does feel a little weird to convert error() to
die(), even though the rest of the diff code-base behaves this way), we
can teach commit_patch_id() to distinguish between "this has no
patch-id" and "a real error occured" in its return value.

 diff.c      | 18 ++++++++----------
 diff.h      |  2 +-
 patch-ids.c |  3 ++-
 3 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/diff.c b/diff.c
index 534c12e..d0594f6 100644
--- a/diff.c
+++ b/diff.c
@@ -4462,7 +4462,7 @@ static void patch_id_consume(void *priv, char *line, unsigned long len)
 }
 
 /* returns 0 upon success, and writes result into sha1 */
-static int diff_get_patch_id(struct diff_options *options, unsigned char *sha1, int diff_header_only)
+static void diff_get_patch_id(struct diff_options *options, unsigned char *sha1, int diff_header_only)
 {
 	struct diff_queue_struct *q = &diff_queued_diff;
 	int i;
@@ -4484,7 +4484,7 @@ static int diff_get_patch_id(struct diff_options *options, unsigned char *sha1,
 		memset(&xpp, 0, sizeof(xpp));
 		memset(&xecfg, 0, sizeof(xecfg));
 		if (p->status == 0)
-			return error("internal diff status error");
+			die("BUG: diff status unset while computing patch_id");
 		if (p->status == DIFF_STATUS_UNKNOWN)
 			continue;
 		if (diff_unmodified_pair(p))
@@ -4536,7 +4536,7 @@ static int diff_get_patch_id(struct diff_options *options, unsigned char *sha1,
 
 		if (fill_mmfile(&mf1, p->one) < 0 ||
 		    fill_mmfile(&mf2, p->two) < 0)
-			return error("unable to read files to diff");
+			die("unable to read files to diff");
 
 		if (diff_filespec_is_binary(p->one) ||
 		    diff_filespec_is_binary(p->two)) {
@@ -4552,27 +4552,25 @@ static int diff_get_patch_id(struct diff_options *options, unsigned char *sha1,
 		xecfg.flags = 0;
 		if (xdi_diff_outf(&mf1, &mf2, patch_id_consume, &data,
 				  &xpp, &xecfg))
-			return error("unable to generate patch-id diff for %s",
-				     p->one->path);
+			die("unable to generate patch-id diff for %s",
+			    p->one->path);
 	}
 
 	git_SHA1_Final(sha1, &ctx);
-	return 0;
 }
 
-int diff_flush_patch_id(struct diff_options *options, unsigned char *sha1, int diff_header_only)
+void diff_flush_patch_id(struct diff_options *options, unsigned char *sha1, int diff_header_only)
 {
 	struct diff_queue_struct *q = &diff_queued_diff;
 	int i;
-	int result = diff_get_patch_id(options, sha1, diff_header_only);
+
+	diff_get_patch_id(options, sha1, diff_header_only);
 
 	for (i = 0; i < q->nr; i++)
 		diff_free_filepair(q->queue[i]);
 
 	free(q->queue);
 	DIFF_QUEUE_CLEAR(q);
-
-	return result;
 }
 
 static int is_summary_empty(const struct diff_queue_struct *q)
diff --git a/diff.h b/diff.h
index 7883729..f4dcfe1 100644
--- a/diff.h
+++ b/diff.h
@@ -342,7 +342,7 @@ extern int run_diff_files(struct rev_info *revs, unsigned int option);
 extern int run_diff_index(struct rev_info *revs, int cached);
 
 extern int do_diff_cache(const unsigned char *, struct diff_options *);
-extern int diff_flush_patch_id(struct diff_options *, unsigned char *, int);
+extern void diff_flush_patch_id(struct diff_options *, unsigned char *, int);
 
 extern int diff_result_code(struct diff_options *, int);
 
diff --git a/patch-ids.c b/patch-ids.c
index 77e4663..0e95220 100644
--- a/patch-ids.c
+++ b/patch-ids.c
@@ -13,7 +13,8 @@ int commit_patch_id(struct commit *commit, struct diff_options *options,
 	else
 		diff_root_tree_sha1(commit->object.oid.hash, "", options);
 	diffcore_std(options);
-	return diff_flush_patch_id(options, sha1, diff_header_only);
+	diff_flush_patch_id(options, sha1, diff_header_only);
+	return 0;
 }
 
 /*
-- 
2.10.0.rc2.154.gb4a4b8b


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 3/3] patch-ids: use commit sha1 as patch-id for merge commits
  2016-09-07 22:01 ` [RFC/PATCH v2 0/3] patch-id for merges Jeff King
  2016-09-07 22:02   ` [PATCH 1/3] patch-ids: turn off rename detection Jeff King
  2016-09-07 22:04   ` [PATCH 2/3] diff_flush_patch_id: stop returning error result Jeff King
@ 2016-09-07 22:04   ` Jeff King
  2016-09-07 22:28     ` Jacob Keller
  2016-09-07 22:51   ` [RFC/PATCH v2 0/3] patch-id for merges Josh Triplett
  2016-09-09 20:34   ` [PATCH v3 0/2] " Jeff King
  4 siblings, 1 reply; 35+ messages in thread
From: Jeff King @ 2016-09-07 22:04 UTC (permalink / raw)
  To: git
  Cc: Michael Haggerty, Kevin Willford, Xiaolong Ye,
	Johannes Schindelin, Josh Triplett

The patch-ids code which powers "log --cherry-pick" doesn't
look at whether each commit is a merge or not. It just feeds
the commit's first parent to the diff, and ignores any
additional parents.

In theory, this might be useful if you wanted to find
equivalence between, say, a merge commit and a squash-merge
that does the same thing.  But it also promotes a false
equivalence between distinct merges; for example, every
"merge -s ours" would look like an empty commit (which is
true in a sense, but presumably there was a value in merging
in the discarded history). Since patch-ids are meant for
throwing away duplicates, we should err on the side of _not_
matching such merges.

Moreover, we may spend a lot of extra time computing these
merge diffs. In the case that inspired this patch, a "git
format-patch --cherry-pick" dropped from over 3 minutes to
less than 4 seconds.

This seems pretty drastic, but is easily explained. The
command was invoked by a "git rebase" of an older topic
branch; there had been tens of thousands of commits on the
upstream branch in the meantime. In addition, this project
used a topic-branch workflow with occasional "back-merges"
from "master" to each topic (to resolve conflicts on the
topics rather than in the merge commits). So there were not
only extra merges, but the diffs for these back-merges were
generally quite large (because they represented _everything_
that had been merged to master since the topic branched).

This patch defines the patch-id of a merge commit as
essentially "null"; it has no patch-id. As a result,
merges cannot match patch-ids via "--cherry-pick", and
"format-patch --base" will not list merges in its list of
prerequisite patch ids.

We can signal the null patch-id by returning "-1" from
commit_patch_id(), as it no longer returns any meaningful
errors (and as a result, its callers are updated to handle
"-1" differently).

Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/log.c |  2 +-
 patch-ids.c   | 14 +++++++-------
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/builtin/log.c b/builtin/log.c
index 92dc34d..e660034 100644
--- a/builtin/log.c
+++ b/builtin/log.c
@@ -1344,7 +1344,7 @@ static void prepare_bases(struct base_tree_info *bases,
 		if (commit->util)
 			continue;
 		if (commit_patch_id(commit, &diffopt, sha1, 0))
-			die(_("cannot get patch id"));
+			continue;
 		ALLOC_GROW(bases->patch_id, bases->nr_patch_id + 1, bases->alloc_patch_id);
 		patch_id = bases->patch_id + bases->nr_patch_id;
 		hashcpy(patch_id->hash, sha1);
diff --git a/patch-ids.c b/patch-ids.c
index 0e95220..197c70b 100644
--- a/patch-ids.c
+++ b/patch-ids.c
@@ -7,10 +7,12 @@
 int commit_patch_id(struct commit *commit, struct diff_options *options,
 		    unsigned char *sha1, int diff_header_only)
 {
-	if (commit->parents)
+	if (commit->parents) {
+		if (commit->parents->next)
+			return -1;
 		diff_tree_sha1(commit->parents->item->object.oid.hash,
 			       commit->object.oid.hash, "", options);
-	else
+	} else
 		diff_root_tree_sha1(commit->object.oid.hash, "", options);
 	diffcore_std(options);
 	diff_flush_patch_id(options, sha1, diff_header_only);
@@ -19,7 +21,7 @@ int commit_patch_id(struct commit *commit, struct diff_options *options,
 
 /*
  * When we cannot load the full patch-id for both commits for whatever
- * reason, the function returns -1 (i.e. return error(...)). Despite
+ * reason, the function returns -1. Despite
  * the "cmp" in the name of this function, the caller only cares about
  * the return value being zero (a and b are equivalent) or non-zero (a
  * and b are different), and returning non-zero would keep both in the
@@ -33,12 +35,10 @@ static int patch_id_cmp(struct patch_id *a,
 {
 	if (is_null_sha1(a->patch_id) &&
 	    commit_patch_id(a->commit, opt, a->patch_id, 0))
-		return error("Could not get patch ID for %s",
-			oid_to_hex(&a->commit->object.oid));
+		return -1;
 	if (is_null_sha1(b->patch_id) &&
 	    commit_patch_id(b->commit, opt, b->patch_id, 0))
-		return error("Could not get patch ID for %s",
-			oid_to_hex(&b->commit->object.oid));
+		return -1;
 	return hashcmp(a->patch_id, b->patch_id);
 }
 
-- 
2.10.0.rc2.154.gb4a4b8b

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/2] patch-ids: skip merge commits
  2016-09-07 18:46     ` Jeff King
@ 2016-09-07 22:08       ` Jeff King
  0 siblings, 0 replies; 35+ messages in thread
From: Jeff King @ 2016-09-07 22:08 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Michael Haggerty, Kevin Willford

On Wed, Sep 07, 2016 at 02:46:53PM -0400, Jeff King wrote:

> > With this change, commit_patch_id() will return 0 for merge commits
> > (indicating success) but it will not have touched the sha1! Which means it
> > may very well have all kinds of crap in the sha1 that may, or may not,
> > match another, real patch ID randomly.
> 
> Eek, thanks. Somehow I got it into my head that diff_flush_patch_id()
> below was what added it to the list, but clearly that is not the case.
> Looking at it again, I can't imagine how that is the case.

Ah, I see. I initially was looking at an older git (v2.6.x), in which
commit_patch_id() is a static function inside patch-ids.c, and we do not
do any lazy-load trickery. But note that the patch is still wrong even
there; it should return "-1" from commit_patch_id() to instruct the
caller not to add it to the hash.

Anyway...

> > I would suggest to simply copy the merge commit's SHA-1. It is no patch
> > ID, of course, but collisions are as unlikely as commit name collisions,
> > and it would make the "patch ID" of a merge commit deterministic again.
> 
> I agree that would work, though it does mean carrying extra useless
> entries in the patch_id hash. I'll see how bad it would be to simply
> omit them entirely, but this seems like a good fallback plan.

It's not too hard to do so, but it raises a question of what
"format-patch --base" would want. I've just sent another RFC cc-ing
folks interested in that area.

Thanks again for the review.

-Peff

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3] patch-ids: turn off rename detection
  2016-09-07 22:02   ` [PATCH 1/3] patch-ids: turn off rename detection Jeff King
@ 2016-09-07 22:12     ` Jacob Keller
  0 siblings, 0 replies; 35+ messages in thread
From: Jacob Keller @ 2016-09-07 22:12 UTC (permalink / raw)
  To: Jeff King
  Cc: Git mailing list, Michael Haggerty, Kevin Willford, Xiaolong Ye,
	Johannes Schindelin, Josh Triplett

On Wed, Sep 7, 2016 at 3:02 PM, Jeff King <peff@peff.net> wrote:
> The patch-id code may be running inside another porcelain
> like "git log" or "git format-patch", and therefore may have
> set diff_detect_rename_default, either via the diff-ui
> config, or by default since 5404c11 (diff: activate
> diff.renames by default, 2016-02-25). This is the case even
> if a command is run with `--no-renames`, as that is applied
> only to the diff-options used by the command itself.
>
> Rename detection doesn't help the patch-id results. It
> _may_ actually hurt, as minor differences in the files that
> would be overlooked by patch-id's canonicalization might
> result in different renames (though I'd doubt that it ever
> comes up in practice).
>
> But mostly it is just a waste of CPU to compute these
> renames.

Yes this seems reasonable.

>
> Note that this does have one user-visible impact: the
> prerequisite patches listed by "format-patch --base". There
> may be some confusion between different versions of git as
> older ones will enable renames, but newer ones will not.
> However, this was already a problem, as people with
> different settings for the "diff.renames" config would get
> different results. After this patch, everyone should get the
> same results, regardless of their config.

Makes sense.

>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> The patch is the same as v1, but the commit message is modified, as I
> realized that "--base" does expose this value publicly (but as I argue
> above, this is probably an improvement).

I agree this is an improvement.

Thanks,
Jake

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] patch-ids: use commit sha1 as patch-id for merge commits
  2016-09-07 22:04   ` [PATCH 3/3] patch-ids: use commit sha1 as patch-id for merge commits Jeff King
@ 2016-09-07 22:28     ` Jacob Keller
  2016-09-07 22:38       ` Jeff King
  0 siblings, 1 reply; 35+ messages in thread
From: Jacob Keller @ 2016-09-07 22:28 UTC (permalink / raw)
  To: Jeff King
  Cc: Git mailing list, Michael Haggerty, Kevin Willford, Xiaolong Ye,
	Johannes Schindelin, Josh Triplett

On Wed, Sep 7, 2016 at 3:04 PM, Jeff King <peff@peff.net> wrote:
> The patch-ids code which powers "log --cherry-pick" doesn't
> look at whether each commit is a merge or not. It just feeds
> the commit's first parent to the diff, and ignores any
> additional parents.
>

The subject here is misleading since it says you will use sha1 of the
merge commit ,but instead just ignore merge commits and indicate they
have no patch id. I suspect this is because you switched
implementations part way through developing this.

Thanks,
Jake

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] patch-ids: use commit sha1 as patch-id for merge commits
  2016-09-07 22:28     ` Jacob Keller
@ 2016-09-07 22:38       ` Jeff King
  0 siblings, 0 replies; 35+ messages in thread
From: Jeff King @ 2016-09-07 22:38 UTC (permalink / raw)
  To: Jacob Keller
  Cc: Git mailing list, Michael Haggerty, Kevin Willford, Xiaolong Ye,
	Johannes Schindelin, Josh Triplett

On Wed, Sep 07, 2016 at 03:28:10PM -0700, Jacob Keller wrote:

> On Wed, Sep 7, 2016 at 3:04 PM, Jeff King <peff@peff.net> wrote:
> > The patch-ids code which powers "log --cherry-pick" doesn't
> > look at whether each commit is a merge or not. It just feeds
> > the commit's first parent to the diff, and ignores any
> > additional parents.
> >
> 
> The subject here is misleading since it says you will use sha1 of the
> merge commit ,but instead just ignore merge commits and indicate they
> have no patch id. I suspect this is because you switched
> implementations part way through developing this.

Oops, yes, that's exactly what happened. I'll wait for discussion to
settle on the approach and fix it up in the re-roll.

-Peff

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC/PATCH v2 0/3] patch-id for merges
  2016-09-07 22:01 ` [RFC/PATCH v2 0/3] patch-id for merges Jeff King
                     ` (2 preceding siblings ...)
  2016-09-07 22:04   ` [PATCH 3/3] patch-ids: use commit sha1 as patch-id for merge commits Jeff King
@ 2016-09-07 22:51   ` Josh Triplett
  2016-09-08  7:30     ` Jeff King
  2016-09-09 20:34   ` [PATCH v3 0/2] " Jeff King
  4 siblings, 1 reply; 35+ messages in thread
From: Josh Triplett @ 2016-09-07 22:51 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Michael Haggerty, Kevin Willford, Xiaolong Ye,
	Johannes Schindelin

On Wed, Sep 07, 2016 at 06:01:01PM -0400, Jeff King wrote:
> Here's a re-roll of the series I posted at:
> 
>   http://public-inbox.org/git/20160907075346.z6wtmqnfc6bsunjb@sigill.intra.peff.net/
> 
> Basically, it drops the time for "format-patch --cherry-pick" on a
> particular case from 3 minutes down to 3 seconds, by avoiding diffs
> on merge commits. Compared to v1, it fixes the totally-broken handling
> of commit_patch_id() pointed out by Johannes.
> 
> We can drop the diffs on the merge commits because they're quite broken,
> as discussed in the commit message of patch 3 (they don't take into
> account any parent except the first). So what do we do when somebody
> asks for the patch-id of a merge commit?
> 
> This is still marked RFC, because there are really two approaches here,
> and I'm not sure which one is better for "format-patch --base". I'd like
> to get input from Xiaolong Ye (who worked on --base), and Josh Triplett
> (who has proposed some patches in that area, and is presumably using
> them).

Thanks.

I'd love to see a more resilient patch-id mechanism, to make it easier
to match up patches between branches.  I don't think it makes sense to
talk about the patch-id of a merge commit (though it might make sense
for a merge which makes additional changes not present in any of the
parents).  Even if someone wants to match up merge commits with merge
commits, I don't think that should happen via patch-id; I think that
should happen in terms of "what patches does this merge introduce",
without constructing a merge-patch-id via a Merkle tree of commit
patch-ids.

So, I think this patch series makes sense (modulo the comments about the
commit message in patch 3).  We already don't respect merge commits when
doing format-patch; this seems consistent with that.  If we ever make it
possible for format-patch to handle merge commits, then we should also
allow it to have merge commits as prerequisites.

- Josh Triplett

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] diff_flush_patch_id: stop returning error result
  2016-09-07 22:04   ` [PATCH 2/3] diff_flush_patch_id: stop returning error result Jeff King
@ 2016-09-08  0:51     ` Ramsay Jones
  2016-09-08  7:20       ` Jeff King
  2016-09-09 10:28     ` Johannes Schindelin
  1 sibling, 1 reply; 35+ messages in thread
From: Ramsay Jones @ 2016-09-08  0:51 UTC (permalink / raw)
  To: Jeff King, git
  Cc: Michael Haggerty, Kevin Willford, Xiaolong Ye,
	Johannes Schindelin, Josh Triplett



On 07/09/16 23:04, Jeff King wrote:
> All of our errors come from diff_get_patch_id(), which has
> exactly three error conditions. The first is an internal
> assertion, which should be a die("BUG") in the first place.
> 
> The other two are caused by an inability to two diff blobs,
                                           ^^^^^^^^^^^^^^^^^
Huh? ... to diff two blobs?

ATB,
Ramsay Jones

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] diff_flush_patch_id: stop returning error result
  2016-09-08  0:51     ` Ramsay Jones
@ 2016-09-08  7:20       ` Jeff King
  0 siblings, 0 replies; 35+ messages in thread
From: Jeff King @ 2016-09-08  7:20 UTC (permalink / raw)
  To: Ramsay Jones
  Cc: git, Michael Haggerty, Kevin Willford, Xiaolong Ye,
	Johannes Schindelin, Josh Triplett

On Thu, Sep 08, 2016 at 01:51:05AM +0100, Ramsay Jones wrote:

> 
> 
> On 07/09/16 23:04, Jeff King wrote:
> > All of our errors come from diff_get_patch_id(), which has
> > exactly three error conditions. The first is an internal
> > assertion, which should be a die("BUG") in the first place.
> > 
> > The other two are caused by an inability to two diff blobs,
>                                            ^^^^^^^^^^^^^^^^^
> Huh? ... to diff two blobs?

Sorry. English my getting worse be to seems.

Will fix in a re-roll.

-Peff

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC/PATCH v2 0/3] patch-id for merges
  2016-09-07 22:51   ` [RFC/PATCH v2 0/3] patch-id for merges Josh Triplett
@ 2016-09-08  7:30     ` Jeff King
  0 siblings, 0 replies; 35+ messages in thread
From: Jeff King @ 2016-09-08  7:30 UTC (permalink / raw)
  To: Josh Triplett
  Cc: git, Michael Haggerty, Kevin Willford, Xiaolong Ye,
	Johannes Schindelin

On Wed, Sep 07, 2016 at 03:51:04PM -0700, Josh Triplett wrote:

> > This is still marked RFC, because there are really two approaches here,
> > and I'm not sure which one is better for "format-patch --base". I'd like
> > to get input from Xiaolong Ye (who worked on --base), and Josh Triplett
> > (who has proposed some patches in that area, and is presumably using
> > them).
> 
> Thanks.
> 
> I'd love to see a more resilient patch-id mechanism, to make it easier
> to match up patches between branches.  I don't think it makes sense to
> talk about the patch-id of a merge commit (though it might make sense
> for a merge which makes additional changes not present in any of the
> parents).  Even if someone wants to match up merge commits with merge
> commits, I don't think that should happen via patch-id; I think that
> should happen in terms of "what patches does this merge introduce",
> without constructing a merge-patch-id via a Merkle tree of commit
> patch-ids.
> 
> So, I think this patch series makes sense (modulo the comments about the
> commit message in patch 3).  We already don't respect merge commits when
> doing format-patch; this seems consistent with that.  If we ever make it
> possible for format-patch to handle merge commits, then we should also
> allow it to have merge commits as prerequisites.

Thanks for the input. I knew that format-patch doesn't show merge
commits, but I didn't realize that merges were skipped entirely for the
base preparation (but I see it now; there is a "rev.max_parents = 1"
setting in prepare_bases).

So this really doesn't change the output there at all. And in fact, the
switch to:

   if (commit_patch_id(commit, &diffopt, sha1, 0))
  -         die(_("cannot get patch id"))
  +         continue;

should never hit that continue. It could be:

    die("BUG: somehow a merge got fed to commit_patch_id?");

but the "continue" somehow seems like the right thing to me.

I'll wait another day or so for comments and then send the re-roll using
this approach.

-Peff

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] diff_flush_patch_id: stop returning error result
  2016-09-07 22:04   ` [PATCH 2/3] diff_flush_patch_id: stop returning error result Jeff King
  2016-09-08  0:51     ` Ramsay Jones
@ 2016-09-09 10:28     ` Johannes Schindelin
  2016-09-09 10:40       ` Jeff King
  1 sibling, 1 reply; 35+ messages in thread
From: Johannes Schindelin @ 2016-09-09 10:28 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Michael Haggerty, Kevin Willford, Xiaolong Ye, Josh Triplett

Hi Peff,

On Wed, 7 Sep 2016, Jeff King wrote:

> All of our errors come from diff_get_patch_id(), which has
> exactly three error conditions. The first is an internal
> assertion, which should be a die("BUG") in the first place.
> 
> The other two are caused by an inability to two diff blobs,
> which is an indication of a serious problem (probably
> repository corruption). All the rest of the diff subsystem
> dies immediately on these conditions. By passing up the
> error, in theory we can keep going even if patch-id is
> unable to function. But in practice this means we may
> generate subtly wrong results (e.g., by failing to correlate
> two commits). Let's just die(), as we're better off making
> it clear to the user that their repository is not
> functional.
> 
> As a result, we can simplify the calling code.

I like the simplification, but I *hate* the fact that the calling code has
*no way* to inform the user about the proper next steps.

You are touching code that is really quite at the bottom of a lot of call
chains. For example in the one of `git pull --rebase`. I just spent an
insane amount of time trying to make sure that this command will not
simply die() somewhere deep in the code, leaving the user puzzled.

Please see 3be18b4 (t5520: verify that `pull --rebase` shows the helpful
advice when failing, 2016-07-26) for more details.

A much better way, in my opinion, would be to introduce a new flag, say,
skip_merges, and pass that to the diff_flush_patch_id() function. You
could also consider consolidating that flag with the diff_header_only flag
into a "flags" argument via something like

	enum diff_flush_patch_id {
		DIFF_HEADER_ONLY = 1,
		SKIP_MERGES = 2
	}

But it is definitely not a good idea to reintroduce the bad practice of
die()ing deep down in library code. I know, you want proper exception
handling. We cannot have that. We use C. But die() is not a solution: it
introduces new problems.

Mind you: I agree that there are serious problems in the cases you
illustrated. But none of those problems give us license to leave the user
utterly puzzled by not even telling them what is going on: spouting
internals such as "unable to read files to diff" is *most definitely* not
helping users who simply want to run a `git pull --rebase`.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] diff_flush_patch_id: stop returning error result
  2016-09-09 10:28     ` Johannes Schindelin
@ 2016-09-09 10:40       ` Jeff King
  2016-09-09 12:58         ` Johannes Schindelin
  0 siblings, 1 reply; 35+ messages in thread
From: Jeff King @ 2016-09-09 10:40 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: git, Michael Haggerty, Kevin Willford, Xiaolong Ye, Josh Triplett

On Fri, Sep 09, 2016 at 12:28:38PM +0200, Johannes Schindelin wrote:

> I like the simplification, but I *hate* the fact that the calling code has
> *no way* to inform the user about the proper next steps.
> 
> You are touching code that is really quite at the bottom of a lot of call
> chains. For example in the one of `git pull --rebase`. I just spent an
> insane amount of time trying to make sure that this command will not
> simply die() somewhere deep in the code, leaving the user puzzled.
> 
> Please see 3be18b4 (t5520: verify that `pull --rebase` shows the helpful
> advice when failing, 2016-07-26) for more details.

Yes, I agree that this is the opposite direction of libification. And I
agree that the current message is not very helpful.

But I am not sure that returning the error up the stack will actually
help somebody move forward. The reason these are all die() calls in the
rest of the diff code is that they are generally indicative of
unrecoverable repository corruption. So any advice does not really
depend on what operation you are performing; it is always "stop what you
are doing immediately, run fsck, and try to get the broken objects from
somebody else".

So IMHO, on balance this is not hurting anything.

> A much better way, in my opinion, would be to introduce a new flag, say,
> skip_merges, and pass that to the diff_flush_patch_id() function. You
> could also consider consolidating that flag with the diff_header_only flag
> into a "flags" argument via something like

diff_flush_patch_id() doesn't care about merges; that's too late. The
change has to happen in commit_patch_id(). And the problem is not one of
passing in "skip merges" (we _always_ want to skip merges). It is rather
distinguishing the reason that commit_patch_id() told us it did not fill
in the sha1: because it was an error, or because the patch id is
undefined (one triggers a die(), the other a silent continue).

I think I laid out that path already in the cover letter of the
original. If the consensus is that this is too ugly, I can implement
that approach.

-Peff

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] diff_flush_patch_id: stop returning error result
  2016-09-09 10:40       ` Jeff King
@ 2016-09-09 12:58         ` Johannes Schindelin
  2016-09-09 19:37           ` Jeff King
  0 siblings, 1 reply; 35+ messages in thread
From: Johannes Schindelin @ 2016-09-09 12:58 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Michael Haggerty, Kevin Willford, Xiaolong Ye, Josh Triplett

Hi Peff,

On Fri, 9 Sep 2016, Jeff King wrote:

> On Fri, Sep 09, 2016 at 12:28:38PM +0200, Johannes Schindelin wrote:
> 
> > I like the simplification, but I *hate* the fact that the calling code has
> > *no way* to inform the user about the proper next steps.
> > 
> > You are touching code that is really quite at the bottom of a lot of call
> > chains. For example in the one of `git pull --rebase`. I just spent an
> > insane amount of time trying to make sure that this command will not
> > simply die() somewhere deep in the code, leaving the user puzzled.
> > 
> > Please see 3be18b4 (t5520: verify that `pull --rebase` shows the helpful
> > advice when failing, 2016-07-26) for more details.
> 
> Yes, I agree that this is the opposite direction of libification. And I
> agree that the current message is not very helpful.
> 
> But I am not sure that returning the error up the stack will actually
> help somebody move forward. The reason these are all die() calls in the
> rest of the diff code is that they are generally indicative of
> unrecoverable repository corruption. So any advice does not really
> depend on what operation you are performing; it is always "stop what you
> are doing immediately, run fsck, and try to get the broken objects from
> somebody else".
> 
> So IMHO, on balance this is not hurting anything.

Well, you make such a situation even worse than it already is.

It would be one thing to change the code to actually say "stop what you
are doing immediately, run `git fsck` and try to get the broken objects
from somewhere else", *before* saying how to proceed after that.

But that is not what your patch does.

What your patch does is to remove *even the possibility* of saying how to
proceed after getting the repository corruption fixed. And instead of
saying how the corruption could be fixed, it outputs a terse "cannot read
files to diff".

I do not think that is a wise direction.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] diff_flush_patch_id: stop returning error result
  2016-09-09 12:58         ` Johannes Schindelin
@ 2016-09-09 19:37           ` Jeff King
  0 siblings, 0 replies; 35+ messages in thread
From: Jeff King @ 2016-09-09 19:37 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: git, Michael Haggerty, Kevin Willford, Xiaolong Ye, Josh Triplett

On Fri, Sep 09, 2016 at 02:58:25PM +0200, Johannes Schindelin wrote:

> > Yes, I agree that this is the opposite direction of libification. And I
> > agree that the current message is not very helpful.
> > 
> > But I am not sure that returning the error up the stack will actually
> > help somebody move forward. The reason these are all die() calls in the
> > rest of the diff code is that they are generally indicative of
> > unrecoverable repository corruption. So any advice does not really
> > depend on what operation you are performing; it is always "stop what you
> > are doing immediately, run fsck, and try to get the broken objects from
> > somebody else".
> > 
> > So IMHO, on balance this is not hurting anything.
> 
> Well, you make such a situation even worse than it already is.
> 
> It would be one thing to change the code to actually say "stop what you
> are doing immediately, run `git fsck` and try to get the broken objects
> from somewhere else", *before* saying how to proceed after that.
> 
> But that is not what your patch does.
> 
> What your patch does is to remove *even the possibility* of saying how to
> proceed after getting the repository corruption fixed. And instead of
> saying how the corruption could be fixed, it outputs a terse "cannot read
> files to diff".
> 
> I do not think that is a wise direction.

First, do not blame me for the terse "cannot read files to diff". That
is the current message. And my patch does not make changing that message
any more difficult. You are welcome to change it in its error() form.
You are welcome to change it in the resulting die().

The quality of that message is totally orthogonal to what the patch is
doing.

The _only_ thing it is losing is the ability to for the caller to then
additionally say "once you have finished uncorrupting the repository,
you can resume your operation with ...".

My point is that this is not useful advice. No callers give it, and I
don't foresee other callers giving it. My argument above was basically
that it is such an exceptional condition it is not worth worrying about.

-Peff

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v3 0/2] patch-id for merges
  2016-09-07 22:01 ` [RFC/PATCH v2 0/3] patch-id for merges Jeff King
                     ` (3 preceding siblings ...)
  2016-09-07 22:51   ` [RFC/PATCH v2 0/3] patch-id for merges Josh Triplett
@ 2016-09-09 20:34   ` Jeff King
  2016-09-09 20:34     ` [PATCH v3 1/2] patch-ids: turn off rename detection Jeff King
                       ` (3 more replies)
  4 siblings, 4 replies; 35+ messages in thread
From: Jeff King @ 2016-09-09 20:34 UTC (permalink / raw)
  To: git
  Cc: Michael Haggerty, Kevin Willford, Xiaolong Ye,
	Johannes Schindelin, Josh Triplett

On Wed, Sep 07, 2016 at 06:01:01PM -0400, Jeff King wrote:

> Here's a re-roll of the series I posted at:
> 
>   http://public-inbox.org/git/20160907075346.z6wtmqnfc6bsunjb@sigill.intra.peff.net/
> 
> Basically, it drops the time for "format-patch --cherry-pick" on a
> particular case from 3 minutes down to 3 seconds, by avoiding diffs
> on merge commits. Compared to v1, it fixes the totally-broken handling
> of commit_patch_id() pointed out by Johannes.
>
>   [1/3]: patch-ids: turn off rename detection
>   [2/3]: diff_flush_patch_id: stop returning error result
>   [3/3]: patch-ids: use commit sha1 as patch-id for merge commits

And here is v3. Besides commit-message fixups, it drops patch 2, and
instead the third patch teaches commit_patch_id() to distinguish between
errors and "no patch id".

Frankly, I still like v2 better, but I do not feel like arguing with
Johannes about it anymore.

  [1/2]: patch-ids: turn off rename detection
  [2/2]: patch-ids: define patch-id of merge commits as "null"

-Peff

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v3 1/2] patch-ids: turn off rename detection
  2016-09-09 20:34   ` [PATCH v3 0/2] " Jeff King
@ 2016-09-09 20:34     ` Jeff King
  2016-09-09 20:34     ` [PATCH v3 2/2] patch-ids: define patch-id of merge commits as "null" Jeff King
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 35+ messages in thread
From: Jeff King @ 2016-09-09 20:34 UTC (permalink / raw)
  To: git
  Cc: Michael Haggerty, Kevin Willford, Xiaolong Ye,
	Johannes Schindelin, Josh Triplett

The patch-id code may be running inside another porcelain
like "git log" or "git format-patch", and therefore may have
set diff_detect_rename_default, either via the diff-ui
config, or by default since 5404c11 (diff: activate
diff.renames by default, 2016-02-25). This is the case even
if a command is run with `--no-renames`, as that is applied
only to the diff-options used by the command itself.

Rename detection doesn't help the patch-id results. It
_may_ actually hurt, as minor differences in the files that
would be overlooked by patch-id's canonicalization might
result in different renames (though I'd doubt that it ever
comes up in practice).

But mostly it is just a waste of CPU to compute these
renames.

Note that this does have one user-visible impact: the
prerequisite patches listed by "format-patch --base". There
may be some confusion between different versions of git as
older ones will enable renames, but newer ones will not.
However, this was already a problem, as people with
different settings for the "diff.renames" config would get
different results. After this patch, everyone should get the
same results, regardless of their config.

Signed-off-by: Jeff King <peff@peff.net>
---
 patch-ids.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/patch-ids.c b/patch-ids.c
index 082412a..77e4663 100644
--- a/patch-ids.c
+++ b/patch-ids.c
@@ -45,6 +45,7 @@ int init_patch_ids(struct patch_ids *ids)
 {
 	memset(ids, 0, sizeof(*ids));
 	diff_setup(&ids->diffopts);
+	ids->diffopts.detect_rename = 0;
 	DIFF_OPT_SET(&ids->diffopts, RECURSIVE);
 	diff_setup_done(&ids->diffopts);
 	hashmap_init(&ids->patches, (hashmap_cmp_fn)patch_id_cmp, 256);
-- 
2.10.0.161.gdb62534

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 2/2] patch-ids: define patch-id of merge commits as "null"
  2016-09-09 20:34   ` [PATCH v3 0/2] " Jeff King
  2016-09-09 20:34     ` [PATCH v3 1/2] patch-ids: turn off rename detection Jeff King
@ 2016-09-09 20:34     ` Jeff King
  2016-09-09 20:37       ` Jeff King
  2016-09-09 21:13       ` Junio C Hamano
  2016-09-09 21:01     ` [PATCH v3 0/2] patch-id for merges Junio C Hamano
  2016-09-25 18:25     ` Johannes Schindelin
  3 siblings, 2 replies; 35+ messages in thread
From: Jeff King @ 2016-09-09 20:34 UTC (permalink / raw)
  To: git
  Cc: Michael Haggerty, Kevin Willford, Xiaolong Ye,
	Johannes Schindelin, Josh Triplett

The patch-id code which powers "log --cherry-pick" doesn't
look at whether each commit is a merge or not. It just feeds
the commit's first parent to the diff, and ignores any
additional parents.

In theory, this might be useful if you wanted to find
equivalence between, say, a merge commit and a squash-merge
that does the same thing.  But it also promotes a false
equivalence between distinct merges. For example, every
"merge -s ours" would look like identical to an empty commit
(which is true in a sense, but presumably there was a value
in merging in the discarded history). Since patch-ids are
meant for throwing away duplicates, we should err on the
side of _not_ matching such merges.

Moreover, we may spend a lot of extra time computing these
merge diffs. In the case that inspired this patch, a "git
format-patch --cherry-pick" dropped from over 3 minutes to
less than 4 seconds.

This seems pretty drastic, but is easily explained. The
command was invoked by a "git rebase" of an older topic
branch; there had been tens of thousands of commits on the
upstream branch in the meantime. In addition, this project
used a topic-branch workflow with occasional "back-merges"
from "master" to each topic (to resolve conflicts on the
topics rather than in the merge commits). So there were not
only extra merges, but the diffs for these back-merges were
generally quite large (because they represented _everything_
that had been merged to master since the topic branched).

This patch defines the patch-id of a merge commit as
essentially "null"; it has no patch-id. As a result,
merges cannot match patch-ids via "--cherry-pick", and
"format-patch --base" will not list merges in its list of
prerequisite patch ids.

To distinguish between real errors and "null", we have to
expand the semantics of commit_patch_id()'s return value,
and callers need to distinguish these cases.

Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/log.c | 10 +++++++++-
 patch-ids.c   | 40 ++++++++++++++++++++++++++++------------
 patch-ids.h   | 11 +++++++++--
 3 files changed, 46 insertions(+), 15 deletions(-)

diff --git a/builtin/log.c b/builtin/log.c
index 92dc34d..ced1ea7 100644
--- a/builtin/log.c
+++ b/builtin/log.c
@@ -1343,8 +1343,16 @@ static void prepare_bases(struct base_tree_info *bases,
 		struct object_id *patch_id;
 		if (commit->util)
 			continue;
-		if (commit_patch_id(commit, &diffopt, sha1, 0))
+
+		switch (commit_patch_id(commit, &diffopt, sha1, 0)) {
+		case PATCH_ID_OK:
+			break;
+		case PATCH_ID_NONE:
+			continue;
+		case PATCH_ID_ERROR:
 			die(_("cannot get patch id"));
+		}
+
 		ALLOC_GROW(bases->patch_id, bases->nr_patch_id + 1, bases->alloc_patch_id);
 		patch_id = bases->patch_id + bases->nr_patch_id;
 		hashcpy(patch_id->hash, sha1);
diff --git a/patch-ids.c b/patch-ids.c
index 77e4663..8d06099 100644
--- a/patch-ids.c
+++ b/patch-ids.c
@@ -7,18 +7,40 @@
 int commit_patch_id(struct commit *commit, struct diff_options *options,
 		    unsigned char *sha1, int diff_header_only)
 {
-	if (commit->parents)
+	if (commit->parents) {
+		if (commit->parents->next)
+			return PATCH_ID_NONE;
 		diff_tree_sha1(commit->parents->item->object.oid.hash,
 			       commit->object.oid.hash, "", options);
-	else
+	} else
 		diff_root_tree_sha1(commit->object.oid.hash, "", options);
 	diffcore_std(options);
-	return diff_flush_patch_id(options, sha1, diff_header_only);
+	if (diff_flush_patch_id(options, sha1, diff_header_only))
+		return PATCH_ID_ERROR;
+	return PATCH_ID_OK;
+}
+
+/* avoid repeating ourselves in patch_id_cmp */
+static int cmp_setup(struct patch_id *p, struct diff_options *opt)
+{
+	if (!is_null_sha1(p->patch_id))
+		return 0; /* OK, already computed id */
+
+	switch (commit_patch_id(p->commit, opt, p->patch_id, 0)) {
+	case PATCH_ID_OK:
+		return 0;
+	case PATCH_ID_ERROR:
+		return error("Could not get patch ID for %s",
+			     oid_to_hex(&p->commit->object.oid));
+	case PATCH_ID_NONE:
+		return -1; /* not an error, but nothing to compare */
+	}
+	die("BUG: unhandled patch_result");
 }
 
 /*
  * When we cannot load the full patch-id for both commits for whatever
- * reason, the function returns -1 (i.e. return error(...)). Despite
+ * reason, the function returns -1. Despite
  * the "cmp" in the name of this function, the caller only cares about
  * the return value being zero (a and b are equivalent) or non-zero (a
  * and b are different), and returning non-zero would keep both in the
@@ -30,14 +52,8 @@ static int patch_id_cmp(struct patch_id *a,
 			struct patch_id *b,
 			struct diff_options *opt)
 {
-	if (is_null_sha1(a->patch_id) &&
-	    commit_patch_id(a->commit, opt, a->patch_id, 0))
-		return error("Could not get patch ID for %s",
-			oid_to_hex(&a->commit->object.oid));
-	if (is_null_sha1(b->patch_id) &&
-	    commit_patch_id(b->commit, opt, b->patch_id, 0))
-		return error("Could not get patch ID for %s",
-			oid_to_hex(&b->commit->object.oid));
+	if (cmp_setup(a, opt) || cmp_setup(b, opt))
+		return -1;
 	return hashcmp(a->patch_id, b->patch_id);
 }
 
diff --git a/patch-ids.h b/patch-ids.h
index 0f34ea1..96fd2b9 100644
--- a/patch-ids.h
+++ b/patch-ids.h
@@ -12,8 +12,15 @@ struct patch_ids {
 	struct diff_options diffopts;
 };
 
-int commit_patch_id(struct commit *commit, struct diff_options *options,
-		    unsigned char *sha1, int);
+enum patch_id_result {
+	PATCH_ID_ERROR = -1,
+	PATCH_ID_OK = 0,
+	PATCH_ID_NONE
+};
+
+enum patch_id_result commit_patch_id(struct commit *commit,
+				     struct diff_options *options,
+				     unsigned char *sha1, int);
 int init_patch_ids(struct patch_ids *);
 int free_patch_ids(struct patch_ids *);
 struct patch_id *add_commit_patch_id(struct commit *, struct patch_ids *);
-- 
2.10.0.161.gdb62534

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 2/2] patch-ids: define patch-id of merge commits as "null"
  2016-09-09 20:34     ` [PATCH v3 2/2] patch-ids: define patch-id of merge commits as "null" Jeff King
@ 2016-09-09 20:37       ` Jeff King
  2016-09-09 21:13       ` Junio C Hamano
  1 sibling, 0 replies; 35+ messages in thread
From: Jeff King @ 2016-09-09 20:37 UTC (permalink / raw)
  To: git
  Cc: Michael Haggerty, Kevin Willford, Xiaolong Ye,
	Johannes Schindelin, Josh Triplett

On Fri, Sep 09, 2016 at 04:34:47PM -0400, Jeff King wrote:

> This patch defines the patch-id of a merge commit as
> essentially "null"; it has no patch-id. As a result,
> merges cannot match patch-ids via "--cherry-pick", and
> "format-patch --base" will not list merges in its list of
> prerequisite patch ids.
> 
> To distinguish between real errors and "null", we have to
> expand the semantics of commit_patch_id()'s return value,
> and callers need to distinguish these cases.

One alternative would be to add an out-parameter that is set in the
success case saying "yes, we have a real patch-id". And then the callers
could look like:

  if (commit_patch_id(commit, &diffopt, sha1, 0, &got_one))
	die("error!");
  if (!got_one)
	continue; /* silently skip */

We could even use the null sha1 to signal that rather than an extra
parameter, I suppose.

I dunno. It would make the callers less clunky, I think, but it does
feel a bit magical.

-Peff

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 0/2] patch-id for merges
  2016-09-09 20:34   ` [PATCH v3 0/2] " Jeff King
  2016-09-09 20:34     ` [PATCH v3 1/2] patch-ids: turn off rename detection Jeff King
  2016-09-09 20:34     ` [PATCH v3 2/2] patch-ids: define patch-id of merge commits as "null" Jeff King
@ 2016-09-09 21:01     ` Junio C Hamano
  2016-09-12 15:59       ` Jeff King
  2016-09-25 18:25     ` Johannes Schindelin
  3 siblings, 1 reply; 35+ messages in thread
From: Junio C Hamano @ 2016-09-09 21:01 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Michael Haggerty, Kevin Willford, Xiaolong Ye,
	Johannes Schindelin, Josh Triplett

Jeff King <peff@peff.net> writes:

> And here is v3. Besides commit-message fixups, it drops patch 2, and
> instead the third patch teaches commit_patch_id() to distinguish between
> errors and "no patch id".
>
> Frankly, I still like v2 better, but I do not feel like arguing with
> Johannes about it anymore.

FWIW, I too like the simplicity of v2, as all the error-to-die
conversion is for cases in which there is no sane recovery path.

I'll have to take a bit deeper look at [v3 2/2] that had to become
more involved to decide if the additional flexibility is really
worth it.

Thanks.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 2/2] patch-ids: define patch-id of merge commits as "null"
  2016-09-09 20:34     ` [PATCH v3 2/2] patch-ids: define patch-id of merge commits as "null" Jeff King
  2016-09-09 20:37       ` Jeff King
@ 2016-09-09 21:13       ` Junio C Hamano
  1 sibling, 0 replies; 35+ messages in thread
From: Junio C Hamano @ 2016-09-09 21:13 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Michael Haggerty, Kevin Willford, Xiaolong Ye,
	Johannes Schindelin, Josh Triplett

Jeff King <peff@peff.net> writes:

> This patch defines the patch-id of a merge commit as
> essentially "null"; it has no patch-id. As a result,
> merges cannot match patch-ids via "--cherry-pick", and
> "format-patch --base" will not list merges in its list of
> prerequisite patch ids.

At first I wondered if such a change would make all merges look the
same, but the patch-ids.c comparison is not for ordering/sorting but
only for equality, so as long as the comparison function knows that
a comparison of anything with "null" yields "They are different", we
are OK.

> diff --git a/patch-ids.c b/patch-ids.c
> index 77e4663..8d06099 100644
> --- a/patch-ids.c
> +++ b/patch-ids.c
> @@ -7,18 +7,40 @@
>  int commit_patch_id(struct commit *commit, struct diff_options *options,
>  		    unsigned char *sha1, int diff_header_only)
>  {
> -	if (commit->parents)
> +	if (commit->parents) {
> +		if (commit->parents->next)
> +			return PATCH_ID_NONE;
>  		diff_tree_sha1(commit->parents->item->object.oid.hash,
>  			       commit->object.oid.hash, "", options);
> -	else
> +	} else
>  		diff_root_tree_sha1(commit->object.oid.hash, "", options);
>  	diffcore_std(options);
> -	return diff_flush_patch_id(options, sha1, diff_header_only);
> +	if (diff_flush_patch_id(options, sha1, diff_header_only))
> +		return PATCH_ID_ERROR;
> +	return PATCH_ID_OK;
> +}

Looks sensible.  Thanks.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 0/2] patch-id for merges
  2016-09-09 21:01     ` [PATCH v3 0/2] patch-id for merges Junio C Hamano
@ 2016-09-12 15:59       ` Jeff King
  2016-09-12 17:18         ` Junio C Hamano
  0 siblings, 1 reply; 35+ messages in thread
From: Jeff King @ 2016-09-12 15:59 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Michael Haggerty, Kevin Willford, Xiaolong Ye,
	Johannes Schindelin, Josh Triplett

On Fri, Sep 09, 2016 at 02:01:14PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > And here is v3. Besides commit-message fixups, it drops patch 2, and
> > instead the third patch teaches commit_patch_id() to distinguish between
> > errors and "no patch id".
> >
> > Frankly, I still like v2 better, but I do not feel like arguing with
> > Johannes about it anymore.
> 
> FWIW, I too like the simplicity of v2, as all the error-to-die
> conversion is for cases in which there is no sane recovery path.
> 
> I'll have to take a bit deeper look at [v3 2/2] that had to become
> more involved to decide if the additional flexibility is really
> worth it.

One other option I didn't really look at: commit_patch_id() could
consider feeding it a merge as an error, and it would be come the
caller's responsibility to avoid doing so. That should already be the
case for "format-patch --base".

We'd probably have to change add_commit_patch_id() and
has_commit_patch_id() to return NULL early when fed a merge, but that is
not too bad.

The reason I didn't pursue this is that I didn't want the definition of
"what constitutes something with no patch-id" to cross too many
abstraction layers. But it's not like we expect a multitude of
conditions; it will probably remain just "we don't handle merges" for
the foreseeable future.

That looks like the patch below (as a replacement for patch 2), which is
even less invasive. It also performs a little better on my example case,
because we avoid adding merges to the hashmap entirely.

diff --git a/patch-ids.c b/patch-ids.c
index 77e4663..5d2d96a 100644
--- a/patch-ids.c
+++ b/patch-ids.c
@@ -7,10 +7,12 @@
 int commit_patch_id(struct commit *commit, struct diff_options *options,
 		    unsigned char *sha1, int diff_header_only)
 {
-	if (commit->parents)
+	if (commit->parents) {
+		if (commit->parents->next)
+			return -1;
 		diff_tree_sha1(commit->parents->item->object.oid.hash,
 			       commit->object.oid.hash, "", options);
-	else
+	} else
 		diff_root_tree_sha1(commit->object.oid.hash, "", options);
 	diffcore_std(options);
 	return diff_flush_patch_id(options, sha1, diff_header_only);
@@ -72,11 +74,20 @@ static int init_patch_id_entry(struct patch_id *patch,
 	return 0;
 }
 
+static int patch_id_defined(struct commit *commit)
+{
+	/* must be 0 or 1 parents */
+	return !commit->parents || !commit->parents->next;
+}
+
 struct patch_id *has_commit_patch_id(struct commit *commit,
 				     struct patch_ids *ids)
 {
 	struct patch_id patch;
 
+	if (!patch_id_defined(commit))
+		return NULL;
+
 	memset(&patch, 0, sizeof(patch));
 	if (init_patch_id_entry(&patch, commit, ids))
 		return NULL;
@@ -89,6 +100,9 @@ struct patch_id *add_commit_patch_id(struct commit *commit,
 {
 	struct patch_id *key = xcalloc(1, sizeof(*key));
 
+	if (!patch_id_defined(commit))
+		return NULL;
+
 	if (init_patch_id_entry(key, commit, ids)) {
 		free(key);
 		return NULL;

I'd probably do a preparatory patch to drop the return value from
add_commit_patch_id(). No callers actually look at it.

-Peff

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 0/2] patch-id for merges
  2016-09-12 15:59       ` Jeff King
@ 2016-09-12 17:18         ` Junio C Hamano
  2016-09-12 17:56           ` Jeff King
  0 siblings, 1 reply; 35+ messages in thread
From: Junio C Hamano @ 2016-09-12 17:18 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Michael Haggerty, Kevin Willford, Xiaolong Ye,
	Johannes Schindelin, Josh Triplett

Jeff King <peff@peff.net> writes:

> That looks like the patch below (as a replacement for patch 2), which is
> even less invasive. It also performs a little better on my example case,
> because we avoid adding merges to the hashmap entirely.

After reading it, I did not find [v3 2/2] is too bad either, but
this is even better ;-)

> diff --git a/patch-ids.c b/patch-ids.c
> index 77e4663..5d2d96a 100644
> --- a/patch-ids.c
> +++ b/patch-ids.c
> @@ -7,10 +7,12 @@
>  int commit_patch_id(struct commit *commit, struct diff_options *options,
>  		    unsigned char *sha1, int diff_header_only)
>  {
> -	if (commit->parents)
> +	if (commit->parents) {
> +		if (commit->parents->next)
> +			return -1;
>  		diff_tree_sha1(commit->parents->item->object.oid.hash,
>  			       commit->object.oid.hash, "", options);
> -	else
> +	} else
>  		diff_root_tree_sha1(commit->object.oid.hash, "", options);

This looks familiar ;-)

> @@ -72,11 +74,20 @@ static int init_patch_id_entry(struct patch_id *patch,
>  	return 0;
>  }
>  
> +static int patch_id_defined(struct commit *commit)
> +{
> +	/* must be 0 or 1 parents */
> +	return !commit->parents || !commit->parents->next;
> +}

If we make the first hunk begin like so:

> +	if (commit->parents) {
> +		if (!patch_id_defined(commit))
> +			return -1;

I wonder if the compiler gives us the same code.

>  struct patch_id *has_commit_patch_id(struct commit *commit,
>  				     struct patch_ids *ids)
>  {
>  	struct patch_id patch;
>  
> +	if (!patch_id_defined(commit))
> +		return NULL;
> +
>  	memset(&patch, 0, sizeof(patch));
>  	if (init_patch_id_entry(&patch, commit, ids))
>  		return NULL;
> @@ -89,6 +100,9 @@ struct patch_id *add_commit_patch_id(struct commit *commit,
>  {
>  	struct patch_id *key = xcalloc(1, sizeof(*key));
>  
> +	if (!patch_id_defined(commit))
> +		return NULL;
> +
>  	if (init_patch_id_entry(key, commit, ids)) {
>  		free(key);
>  		return NULL;

Yup, these two hunks look a lot nicer.

> I'd probably do a preparatory patch to drop the return value from
> add_commit_patch_id(). No callers actually look at it.

Thanks.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 0/2] patch-id for merges
  2016-09-12 17:18         ` Junio C Hamano
@ 2016-09-12 17:56           ` Jeff King
  2016-09-12 20:44             ` Junio C Hamano
  0 siblings, 1 reply; 35+ messages in thread
From: Jeff King @ 2016-09-12 17:56 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Michael Haggerty, Kevin Willford, Xiaolong Ye,
	Johannes Schindelin, Josh Triplett

On Mon, Sep 12, 2016 at 10:18:33AM -0700, Junio C Hamano wrote:

> > +static int patch_id_defined(struct commit *commit)
> > +{
> > +	/* must be 0 or 1 parents */
> > +	return !commit->parents || !commit->parents->next;
> > +}
> 
> If we make the first hunk begin like so:
> 
> > +	if (commit->parents) {
> > +		if (!patch_id_defined(commit))
> > +			return -1;
> 
> I wonder if the compiler gives us the same code.

Good idea. I actually put the "patch_id_defined" check outside the "if"
block you've quoted (otherwise we're making assumptions about the
contents of patch_id_defined).

I didn't check that the compiler generates the same code, but I'm
willing to blindly put faith in it. Either it will inline
patch_id_defined and optimize out the double-conditional, or it probably
doesn't matter in practice. Either way, the compiler is probably smarter
than me, and we should shoot for readability and not repeating
ourselves.

> > I'd probably do a preparatory patch to drop the return value from
> > add_commit_patch_id(). No callers actually look at it.

I decided against this. Technically add_commit_patch_id() can return an
error via the header-only diff_flush_patch_id(), and we'd be shutting
that down. Of course no callers actually _care_ about that right now, so
it doesn't matter at this point. But I'd prefer to punt it down the line
for when somebody does (and the solution may be to distinguish between
those two return codes, or it may be for the caller to have access to
patch_id_defined(); we won't know until we see the code).

So here's the replacement for 2/2:

-- >8 --
Subject: [PATCH] patch-ids: refuse to compute patch-id for merge commit

The patch-id code which powers "log --cherry-pick" doesn't
look at whether each commit is a merge or not. It just feeds
the commit's first parent to the diff, and ignores any
additional parents.

In theory, this might be useful if you wanted to find
equivalence between, say, a merge commit and a squash-merge
that does the same thing.  But it also promotes a false
equivalence between distinct merges. For example, every
"merge -s ours" would look identical to an empty commit
(which is true in a sense, but presumably there was a value
in merging in the discarded history). Since patch-ids are
meant for throwing away duplicates, we should err on the
side of _not_ matching such merges.

Moreover, we may spend a lot of extra time computing these
merge diffs. In the case that inspired this patch, a "git
format-patch --cherry-pick" dropped from over 3 minutes to
less than 3 seconds.

This seems pretty drastic, but is easily explained. The
command was invoked by a "git rebase" of an older topic
branch; there had been tens of thousands of commits on the
upstream branch in the meantime. In addition, this project
used a topic-branch workflow with occasional "back-merges"
from "master" to each topic (to resolve conflicts on the
topics rather than in the merge commits). So there were not
only extra merges, but the diffs for these back-merges were
generally quite large (because they represented _everything_
that had been merged to master since the topic branched).

This patch treats a merge fed to commit_patch_id() or
add_commit_patch_id() as an error, and a lookup for such a
merge via has_commit_patch_id() will always return NULL.
An earlier version of the patch tried to distinguish between
"error" and "patch id for merges not defined", but that
becomes unnecessarily complicated. The only callers are:

  1. revision traversals which want to do --cherry-pick;
     they call add_commit_patch_id(), but do not care if it
     fails. They only want to add what we can, look it up
     later with has_commit_patch_id(), and err on the side
     of not-matching.

  2. format-patch --base, which calls commit_patch_id().
     This _does_ notice errors, but should never feed a
     merge in the first place (and if it were to do so
     accidentally, then this patch is a strict improvement;
     we notice the bug rather than generating a bogus
     patch-id).

So in both cases, this does the right thing.

Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Jeff King <peff@peff.net>
---
 patch-ids.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/patch-ids.c b/patch-ids.c
index 77e4663..ce285c2 100644
--- a/patch-ids.c
+++ b/patch-ids.c
@@ -4,9 +4,18 @@
 #include "sha1-lookup.h"
 #include "patch-ids.h"

+static int patch_id_defined(struct commit *commit)
+{
+	/* must be 0 or 1 parents */
+	return !commit->parents || !commit->parents->next;
+}
+
 int commit_patch_id(struct commit *commit, struct diff_options *options,
 		    unsigned char *sha1, int diff_header_only)
 {
+	if (!patch_id_defined(commit))
+		return -1;
+
 	if (commit->parents)
 		diff_tree_sha1(commit->parents->item->object.oid.hash,
 			       commit->object.oid.hash, "", options);
@@ -77,6 +86,9 @@ struct patch_id *has_commit_patch_id(struct commit *commit,
 {
 	struct patch_id patch;

+	if (!patch_id_defined(commit))
+		return NULL;
+
 	memset(&patch, 0, sizeof(patch));
 	if (init_patch_id_entry(&patch, commit, ids))
 		return NULL;
@@ -89,6 +101,9 @@ struct patch_id *add_commit_patch_id(struct commit *commit,
 {
 	struct patch_id *key = xcalloc(1, sizeof(*key));

+	if (!patch_id_defined(commit))
+		return NULL;
+
 	if (init_patch_id_entry(key, commit, ids)) {
 		free(key);
 		return NULL;
-- 
2.10.0.230.g6f8d04b

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 0/2] patch-id for merges
  2016-09-12 17:56           ` Jeff King
@ 2016-09-12 20:44             ` Junio C Hamano
  0 siblings, 0 replies; 35+ messages in thread
From: Junio C Hamano @ 2016-09-12 20:44 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Michael Haggerty, Kevin Willford, Xiaolong Ye,
	Johannes Schindelin, Josh Triplett

Jeff King <peff@peff.net> writes:

> On Mon, Sep 12, 2016 at 10:18:33AM -0700, Junio C Hamano wrote:
>
>> > +static int patch_id_defined(struct commit *commit)
>> > +{
>> > +	/* must be 0 or 1 parents */
>> > +	return !commit->parents || !commit->parents->next;
>> > +}
>> 
>> If we make the first hunk begin like so:
>> 
>> > +	if (commit->parents) {
>> > +		if (!patch_id_defined(commit))
>> > +			return -1;
>> 
>> I wonder if the compiler gives us the same code.
>
> Good idea. I actually put the "patch_id_defined" check outside the "if"
> block you've quoted (otherwise we're making assumptions about the
> contents of patch_id_defined).

Facepalm.  I was mis-reading the condition in the helper function.
Of course, guarding up-front makes more sense.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 0/2] patch-id for merges
  2016-09-09 20:34   ` [PATCH v3 0/2] " Jeff King
                       ` (2 preceding siblings ...)
  2016-09-09 21:01     ` [PATCH v3 0/2] patch-id for merges Junio C Hamano
@ 2016-09-25 18:25     ` Johannes Schindelin
  3 siblings, 0 replies; 35+ messages in thread
From: Johannes Schindelin @ 2016-09-25 18:25 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Michael Haggerty, Kevin Willford, Xiaolong Ye, Josh Triplett

Hi Peff,

On Fri, 9 Sep 2016, Jeff King wrote:

> Frankly, I still like v2 better, but I do not feel like arguing with
> Johannes about it anymore.

Sorry that I was a pest. I do care deeply about making our code cleaner,
though, which in my mind means less of this "let exit() clean everything
up and give the callers no chance to do anything anymore" approach, which
to me always felt like we're basically teaching Git to shrug instead of
being useful when something bad happens.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2016-09-25 18:25 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-07  7:53 [RFC/PATCH 0/2] more patch-id speedups Jeff King
2016-09-07  7:54 ` [PATCH 1/2] patch-ids: turn off rename detection Jeff King
2016-09-07 12:53   ` Johannes Schindelin
2016-09-07  7:54 ` [PATCH 2/2] patch-ids: skip merge commits Jeff King
2016-09-07 12:52   ` Johannes Schindelin
2016-09-07 18:46     ` Jeff King
2016-09-07 22:08       ` Jeff King
2016-09-07 13:06 ` [RFC/PATCH 0/2] more patch-id speedups Johannes Schindelin
2016-09-07 18:49   ` Jeff King
2016-09-07 22:01 ` [RFC/PATCH v2 0/3] patch-id for merges Jeff King
2016-09-07 22:02   ` [PATCH 1/3] patch-ids: turn off rename detection Jeff King
2016-09-07 22:12     ` Jacob Keller
2016-09-07 22:04   ` [PATCH 2/3] diff_flush_patch_id: stop returning error result Jeff King
2016-09-08  0:51     ` Ramsay Jones
2016-09-08  7:20       ` Jeff King
2016-09-09 10:28     ` Johannes Schindelin
2016-09-09 10:40       ` Jeff King
2016-09-09 12:58         ` Johannes Schindelin
2016-09-09 19:37           ` Jeff King
2016-09-07 22:04   ` [PATCH 3/3] patch-ids: use commit sha1 as patch-id for merge commits Jeff King
2016-09-07 22:28     ` Jacob Keller
2016-09-07 22:38       ` Jeff King
2016-09-07 22:51   ` [RFC/PATCH v2 0/3] patch-id for merges Josh Triplett
2016-09-08  7:30     ` Jeff King
2016-09-09 20:34   ` [PATCH v3 0/2] " Jeff King
2016-09-09 20:34     ` [PATCH v3 1/2] patch-ids: turn off rename detection Jeff King
2016-09-09 20:34     ` [PATCH v3 2/2] patch-ids: define patch-id of merge commits as "null" Jeff King
2016-09-09 20:37       ` Jeff King
2016-09-09 21:13       ` Junio C Hamano
2016-09-09 21:01     ` [PATCH v3 0/2] patch-id for merges Junio C Hamano
2016-09-12 15:59       ` Jeff King
2016-09-12 17:18         ` Junio C Hamano
2016-09-12 17:56           ` Jeff King
2016-09-12 20:44             ` Junio C Hamano
2016-09-25 18:25     ` Johannes Schindelin

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).