git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] diffcore-rename: cache file deltas
@ 2007-09-25 19:29 Jeff King
  2007-09-25 21:29 ` Junio C Hamano
  0 siblings, 1 reply; 5+ messages in thread
From: Jeff King @ 2007-09-25 19:29 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, git

We find rename candidates by computing a fingerprint hash of
each file, and then comparing those fingerprints. There are
inherently O(n^2) comparisons, so it pays in CPU time to
hoist the (rather expensive) computation of the fingerprint
out of that loop (or to cache it once we have computed it once).

Previously, we didn't keep the filespec information around
because then we had the potential to consume a great deal of
memory. However, instead of keeping all of the filespec
data, we can instead just keep the fingerprint.

This patch implements and uses diff_free_filespec_data_large
to accomplish that goal. We also have to change
estimate_similarity not to needlessly repopulate the
filespec data when we already have the hash.

Practical tests showed 4.5x speedup for a 10% memory usage
increase.

Signed-off-by: Jeff King <peff@peff.net>
---
The implementation is a little less nice than I would like, but I was
trying to be non-invasive. Specifically:

 - the name diff_free_filespec_data_large is horrible, but this is based
   on the fact that diff_free_filespec_data actually does too much (it
   frees the data _and_ some other auxiliary data). And renaming that
   would entail changing many callsites.

 - It seems that a better place to call diffcore_populate_filespec
   (rather than in estimate_similarity) would actually be in
   diffcore_count_changes (when we _know_ that we need to populate the
   contents data).

 - The hash_chars() should arguably be tied into
   diffcore_populate_filespec, which should have more of a "what
   information do you want" interface. I.e., the "size_only" parameter
   could be extended to a bitfield to say "populate this, and I need the
   delta fingerprint, size, actual contents, etc". Then callers could
   just use "populate" before looking at the filespec and it would
   lazily load whatever they needed.

This patch cuts my pathological case from 20 minutes to 2 minutes, which
is a great improvement, but still unusable. However, now I should be
able to get more useful numbers on what else can be sped up.

 diff.c            |    7 ++++++-
 diffcore-rename.c |    7 ++++---
 diffcore.h        |    1 +
 3 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/diff.c b/diff.c
index 2216d75..647bcf0 100644
--- a/diff.c
+++ b/diff.c
@@ -1654,7 +1654,7 @@ int diff_populate_filespec(struct diff_filespec *s, int size_only)
 	return 0;
 }
 
-void diff_free_filespec_data(struct diff_filespec *s)
+void diff_free_filespec_data_large(struct diff_filespec *s)
 {
 	if (s->should_free)
 		free(s->data);
@@ -1665,6 +1665,11 @@ void diff_free_filespec_data(struct diff_filespec *s)
 		s->should_free = s->should_munmap = 0;
 		s->data = NULL;
 	}
+}
+
+void diff_free_filespec_data(struct diff_filespec *s)
+{
+	diff_free_filespec_data_large(s);
 	free(s->cnt_data);
 	s->cnt_data = NULL;
 }
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 41b35c3..4fc2000 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -184,7 +184,8 @@ static int estimate_similarity(struct diff_filespec *src,
 	if (base_size * (MAX_SCORE-minimum_score) < delta_size * MAX_SCORE)
 		return 0;
 
-	if (diff_populate_filespec(src, 0) || diff_populate_filespec(dst, 0))
+	if ((!src->cnt_data && diff_populate_filespec(src, 0))
+		|| (!dst->cnt_data && diff_populate_filespec(dst, 0)))
 		return 0; /* error but caught downstream */
 
 
@@ -377,10 +378,10 @@ void diffcore_rename(struct diff_options *options)
 			m->score = estimate_similarity(one, two,
 						       minimum_score);
 			m->name_score = basename_same(one, two);
-			diff_free_filespec_data(one);
+			diff_free_filespec_data_large(one);
 		}
 		/* We do not need the text anymore */
-		diff_free_filespec_data(two);
+		diff_free_filespec_data_large(two);
 		dst_cnt++;
 	}
 	/* cost matrix sorted by most to least similar pair */
diff --git a/diffcore.h b/diffcore.h
index eef17c4..4bf175b 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -48,6 +48,7 @@ extern void fill_filespec(struct diff_filespec *, const unsigned char *,
 
 extern int diff_populate_filespec(struct diff_filespec *, int);
 extern void diff_free_filespec_data(struct diff_filespec *);
+extern void diff_free_filespec_data_large(struct diff_filespec *);
 extern int diff_filespec_is_binary(struct diff_filespec *);
 
 struct diff_filepair {
-- 
1.5.3.2.1061.gc056e-dirty

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] diffcore-rename: cache file deltas
  2007-09-25 19:29 [PATCH] diffcore-rename: cache file deltas Jeff King
@ 2007-09-25 21:29 ` Junio C Hamano
  2007-09-25 21:42   ` Jeff King
  2007-10-03  1:36   ` Linus Torvalds
  0 siblings, 2 replies; 5+ messages in thread
From: Junio C Hamano @ 2007-09-25 21:29 UTC (permalink / raw)
  To: Jeff King; +Cc: Linus Torvalds, git

Jeff King <peff@peff.net> writes:

> We find rename candidates by computing a fingerprint hash of
> each file, and then comparing those fingerprints. There are
> inherently O(n^2) comparisons, so it pays in CPU time to
> hoist the (rather expensive) computation of the fingerprint
> out of that loop (or to cache it once we have computed it once).
>
> Previously, we didn't keep the filespec information around
> because then we had the potential to consume a great deal of
> memory. However, instead of keeping all of the filespec
> data, we can instead just keep the fingerprint.
>
> This patch implements and uses diff_free_filespec_data_large
> to accomplish that goal. We also have to change
> estimate_similarity not to needlessly repopulate the
> filespec data when we already have the hash.
>
> Practical tests showed 4.5x speedup for a 10% memory usage
> increase.
>
> Signed-off-by: Jeff King <peff@peff.net>

Very nice.

> The implementation is a little less nice than I would like, but I was
> trying to be non-invasive. Specifically:
>
>  - the name diff_free_filespec_data_large is horrible, but this is based
>    on the fact that diff_free_filespec_data actually does too much (it
>    frees the data _and_ some other auxiliary data). And renaming that
>    would entail changing many callsites.

True.  But we can rename it to diff_file_filespec_blob() and
that would perfectly well describe what it does.  Will do so
when applying if it is Ok to you.

>  - It seems that a better place to call diffcore_populate_filespec
>    (rather than in estimate_similarity) would actually be in
>    diffcore_count_changes (when we _know_ that we need to populate the
>    contents data).
>
>  - The hash_chars() should arguably be tied into
>    diffcore_populate_filespec, which should have more of a "what
>    information do you want" interface. I.e., the "size_only" parameter
>    could be extended to a bitfield to say "populate this, and I need the
>    delta fingerprint, size, actual contents, etc". Then callers could
>    just use "populate" before looking at the filespec and it would
>    lazily load whatever they needed.

Both good points, but I agree with you that it is wise to do
that with a follow-up patch.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] diffcore-rename: cache file deltas
  2007-09-25 21:29 ` Junio C Hamano
@ 2007-09-25 21:42   ` Jeff King
  2007-10-03  1:36   ` Linus Torvalds
  1 sibling, 0 replies; 5+ messages in thread
From: Jeff King @ 2007-09-25 21:42 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, git

On Tue, Sep 25, 2007 at 02:29:07PM -0700, Junio C Hamano wrote:

> Very nice.

Thank you.

> True.  But we can rename it to diff_file_filespec_blob() and
> that would perfectly well describe what it does.  Will do so
> when applying if it is Ok to you.

Yes, that is a much better name (assuming s/file/free). Please do change
it.

> >  - The hash_chars() should arguably be tied into
> >    diffcore_populate_filespec, which should have more of a "what
> 
> Both good points, but I agree with you that it is wise to do
> that with a follow-up patch.

I think I will leave it, unless you are excited about the change. I feel
like there is a high chance of introducing some bug, and the benefit is
purely stylistic at this point (i.e., the diff code isn't really
changing much at this point, so I think it should just be left alone).

-Peff

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] diffcore-rename: cache file deltas
  2007-09-25 21:29 ` Junio C Hamano
  2007-09-25 21:42   ` Jeff King
@ 2007-10-03  1:36   ` Linus Torvalds
  2007-10-03  4:10     ` Junio C Hamano
  1 sibling, 1 reply; 5+ messages in thread
From: Linus Torvalds @ 2007-10-03  1:36 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git



On Tue, 25 Sep 2007, Junio C Hamano wrote:
> >
> >  - the name diff_free_filespec_data_large is horrible, but this is based
> >    on the fact that diff_free_filespec_data actually does too much (it
> >    frees the data _and_ some other auxiliary data). And renaming that
> >    would entail changing many callsites.
> 
> True.  But we can rename it to diff_file_filespec_blob() and
> that would perfectly well describe what it does.  Will do so
> when applying if it is Ok to you.

Well, that renaming apparently never happened, and it's still called 
diff_free_filespec_data_large() now that it's in master.

That said, I think this patch should make it into the maintenance branch 
too, renamed or not, since it's such a huge performance issue.

			Linus

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] diffcore-rename: cache file deltas
  2007-10-03  1:36   ` Linus Torvalds
@ 2007-10-03  4:10     ` Junio C Hamano
  0 siblings, 0 replies; 5+ messages in thread
From: Junio C Hamano @ 2007-10-03  4:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jeff King, git

Linus Torvalds <torvalds@linux-foundation.org> writes:

> Well, that renaming apparently never happened, and it's still called 
> diff_free_filespec_data_large() now that it's in master.
>
> That said, I think this patch should make it into the maintenance branch 
> too, renamed or not, since it's such a huge performance issue.

Thanks.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2007-10-03  4:10 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-09-25 19:29 [PATCH] diffcore-rename: cache file deltas Jeff King
2007-09-25 21:29 ` Junio C Hamano
2007-09-25 21:42   ` Jeff King
2007-10-03  1:36   ` Linus Torvalds
2007-10-03  4:10     ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).