git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 1/1] Bypass expensive content comparsion during rename detection.
@ 2006-12-14 10:07 Shawn O. Pearce
  2006-12-14 10:53 ` Johannes Schindelin
  0 siblings, 1 reply; 8+ messages in thread
From: Shawn O. Pearce @ 2006-12-14 10:07 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

When comparing file contents during the second loop through a rename
detection attempt we can skip the expensive byte-by-byte comparsion
if both source and destination files have valid SHA1 values.  This
improves performance by avoiding either an expensive open/mmap to
read the working tree copy, or an expensive inflate of a blob object.

Unfortunately we still have to at least initialize the sizes of the
source and destination files even if the SHA1 values don't match.
Failing to initialize the sizes causes a number of test cases to fail
and start reporting different copy/rename behavior than was expected.

Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
---
 diffcore-rename.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 57a74b6..91fa2be 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -109,6 +109,8 @@ static int is_exact_match(struct diff_filespec *src,
 		return 0;
 	if (src->size != dst->size)
 		return 0;
+	if (src->sha1_valid && dst->sha1_valid)
+	    return !hashcmp(src->sha1, dst->sha1);
 	if (diff_populate_filespec(src, 0) || diff_populate_filespec(dst, 0))
 		return 0;
 	if (src->size == dst->size &&
-- 

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/1] Bypass expensive content comparsion during rename detection.
  2006-12-14 10:07 [PATCH 1/1] Bypass expensive content comparsion during rename detection Shawn O. Pearce
@ 2006-12-14 10:53 ` Johannes Schindelin
  2006-12-14 11:08   ` Shawn Pearce
  0 siblings, 1 reply; 8+ messages in thread
From: Johannes Schindelin @ 2006-12-14 10:53 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Junio C Hamano, git

Hi,

On Thu, 14 Dec 2006, Shawn O. Pearce wrote:

> When comparing file contents during the second loop through a rename 
> detection attempt we can skip the expensive byte-by-byte comparsion if 
> both source and destination files have valid SHA1 values.  This improves 
> performance by avoiding either an expensive open/mmap to read the 
> working tree copy, or an expensive inflate of a blob object.
> 
> Unfortunately we still have to at least initialize the sizes of the 
> source and destination files even if the SHA1 values don't match. 
> Failing to initialize the sizes causes a number of test cases to fail 
> and start reporting different copy/rename behavior than was expected.

It has a wrong feel to it when you say we have to check the sizes first. 
After all, we are heavily relying on the hashes being different, including 
the case when the sizes are different. So, the order of the checks should 
_not_ matter. I suspect that somewhere sha1_valid should be set to 0, but 
isn't.

Ciao,

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/1] Bypass expensive content comparsion during rename detection.
  2006-12-14 10:53 ` Johannes Schindelin
@ 2006-12-14 11:08   ` Shawn Pearce
  2006-12-14 11:13     ` Shawn Pearce
  2006-12-14 11:18     ` Johannes Schindelin
  0 siblings, 2 replies; 8+ messages in thread
From: Shawn Pearce @ 2006-12-14 11:08 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> On Thu, 14 Dec 2006, Shawn O. Pearce wrote:
> > Unfortunately we still have to at least initialize the sizes of the 
> > source and destination files even if the SHA1 values don't match. 
> > Failing to initialize the sizes causes a number of test cases to fail 
> > and start reporting different copy/rename behavior than was expected.
> 
> It has a wrong feel to it when you say we have to check the sizes first. 
> After all, we are heavily relying on the hashes being different, including 
> the case when the sizes are different. So, the order of the checks should 
> _not_ matter. I suspect that somewhere sha1_valid should be set to 0, but 
> isn't.

Yes.  :-)

I'll admit, I don't understand the diffcore rename code very well
so I'm treading around in code that I'm not used to.  I'm not sure
why the size member of diff_filespec needs to be initialized to get
rename and copy detection to work properly, but it apparently does.


My first version of the patch had the hash comparsion right after
we called diff_populate_filespec to get the size data.  But then
I realized that very often the sizes will be different and the
src->size != dst->size comparsion will tend to be true most of
the time, thus saving us a (relatively) expensive hash comparsion,
which we know must fail anyway if the sizes differed.

-- 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/1] Bypass expensive content comparsion during rename detection.
  2006-12-14 11:08   ` Shawn Pearce
@ 2006-12-14 11:13     ` Shawn Pearce
  2006-12-14 11:18     ` Johannes Schindelin
  1 sibling, 0 replies; 8+ messages in thread
From: Shawn Pearce @ 2006-12-14 11:13 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, git

Shawn Pearce <spearce@spearce.org> wrote:
> I'll admit, I don't understand the diffcore rename code very well
> so I'm treading around in code that I'm not used to.  I'm not sure
> why the size member of diff_filespec needs to be initialized to get
> rename and copy detection to work properly, but it apparently does.

This chunk of code is probably a perfect example of why side-effects
can be so bad.  Its fast because the size information is loaded
once and reused later on; its horrible to maintain because you don't
realize that this simple predicate is actually doing something that
matters downstream even though it returned false!

-- 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/1] Bypass expensive content comparsion during rename detection.
  2006-12-14 11:08   ` Shawn Pearce
  2006-12-14 11:13     ` Shawn Pearce
@ 2006-12-14 11:18     ` Johannes Schindelin
  2006-12-14 11:50       ` Shawn Pearce
  1 sibling, 1 reply; 8+ messages in thread
From: Johannes Schindelin @ 2006-12-14 11:18 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Junio C Hamano, git

Hi,

On Thu, 14 Dec 2006, Shawn Pearce wrote:

> My first version of the patch had the hash comparsion right after we 
> called diff_populate_filespec to get the size data.  But then I realized 
> that very often the sizes will be different and the src->size != 
> dst->size comparsion will tend to be true most of the time, thus saving 
> us a (relatively) expensive hash comparsion, which we know must fail 
> anyway if the sizes differed.

Ah! I misunderstood. Since the call to diff_populate_filespec was not 
visible in the hunk, I erroneously assumed that you meant to _check_ the 
sizes before checking the hashes.

But your explanation makes lots of sense to me. May I request a short 
comment above the new code, like "let diff_populate_filespec() do its 
thing since we need the filesize later on anyway, and having that, do the 
cheaper filesize check before the more expensive hashcmp()"?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/1] Bypass expensive content comparsion during rename detection
@ 2006-12-14 11:26 Junio C Hamano
  2006-12-14 12:13 ` Shawn Pearce
  0 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2006-12-14 11:26 UTC (permalink / raw)
  To: Johannes.Schindelin, Johannes, Schindelin; +Cc: git

estimate_similarity() is called after it has determined that one
and two do not match exactly, and it relies on populate_filespec
with at least size-only has been called on them so that it can
reject filepair with vastly different sizes, still without
looking at (thus without loading) the contents.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/1] Bypass expensive content comparsion during rename detection.
  2006-12-14 11:18     ` Johannes Schindelin
@ 2006-12-14 11:50       ` Shawn Pearce
  0 siblings, 0 replies; 8+ messages in thread
From: Shawn Pearce @ 2006-12-14 11:50 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> Hi,
> 
> On Thu, 14 Dec 2006, Shawn Pearce wrote:
> 
> > My first version of the patch had the hash comparsion right after we 
> > called diff_populate_filespec to get the size data.  But then I realized 
> > that very often the sizes will be different and the src->size != 
> > dst->size comparsion will tend to be true most of the time, thus saving 
> > us a (relatively) expensive hash comparsion, which we know must fail 
> > anyway if the sizes differed.
> 
> Ah! I misunderstood. Since the call to diff_populate_filespec was not 
> visible in the hunk, I erroneously assumed that you meant to _check_ the 
> sizes before checking the hashes.
> 
> But your explanation makes lots of sense to me. May I request a short 
> comment above the new code, like "let diff_populate_filespec() do its 
> thing since we need the filesize later on anyway, and having that, do the 
> cheaper filesize check before the more expensive hashcmp()"?

-- 8> --
Bypass expensive content comparsion during rename detection.

When comparing file contents during the second loop through a rename
detection attempt we can skip the expensive byte-by-byte comparsion
if both source and destination files have valid SHA1 values.  This
improves performance by avoiding either an expensive open/mmap to
read the working tree copy, or an expensive inflate of a blob object.

Unfortunately we still have to at least initialize the sizes of the
source and destination files even if the SHA1 values don't match.
Failing to initialize the sizes causes a number of test cases to fail
and start reporting different copy/rename behavior than was expected.

Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
---
 diffcore-rename.c |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 57a74b6..f7748ce 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -109,6 +109,15 @@ static int is_exact_match(struct diff_filespec *src,
 		return 0;
 	if (src->size != dst->size)
 		return 0;
+	/* Although we can avoid a byte-by-byte comparsion by checking
+	 * hashes we needed to allow diff_populate_filespec to fill in
+	 * the size members, as we need that later on to correctly do
+	 * rename and copy detection.  Not filling in size before we
+	 * return back when contents_too is true causes all sorts of
+	 * havoc (been there, done that, lets not try it again).
+	 */
+	if (src->sha1_valid && dst->sha1_valid)
+	    return !hashcmp(src->sha1, dst->sha1);
 	if (diff_populate_filespec(src, 0) || diff_populate_filespec(dst, 0))
 		return 0;
 	if (src->size == dst->size &&
-- 
1.4.4.2.g72f5

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/1] Bypass expensive content comparsion during rename detection
  2006-12-14 11:26 Junio C Hamano
@ 2006-12-14 12:13 ` Shawn Pearce
  0 siblings, 0 replies; 8+ messages in thread
From: Shawn Pearce @ 2006-12-14 12:13 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Johannes.Schindelin, git

Junio C Hamano <junio@hera.kernel.org> wrote:
> estimate_similarity() is called after it has determined that one
> and two do not match exactly, and it relies on populate_filespec
> with at least size-only has been called on them so that it can
> reject filepair with vastly different sizes, still without
> looking at (thus without loading) the contents.

*ouch* That hurts.  OK, it makes a lot of sense, but its crazy.
About as crazy as the comment I put in my revised patch...  ;-)

-- 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2006-12-14 12:13 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-12-14 10:07 [PATCH 1/1] Bypass expensive content comparsion during rename detection Shawn O. Pearce
2006-12-14 10:53 ` Johannes Schindelin
2006-12-14 11:08   ` Shawn Pearce
2006-12-14 11:13     ` Shawn Pearce
2006-12-14 11:18     ` Johannes Schindelin
2006-12-14 11:50       ` Shawn Pearce
  -- strict thread matches above, loose matches on Subject: below --
2006-12-14 11:26 Junio C Hamano
2006-12-14 12:13 ` Shawn Pearce

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).