git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>, git@vger.kernel.org
Subject: [PATCH] diffcore-rename: cache file deltas
Date: Tue, 25 Sep 2007 15:29:42 -0400	[thread overview]
Message-ID: <20070925192941.GA8564@coredump.intra.peff.net> (raw)

We find rename candidates by computing a fingerprint hash of
each file, and then comparing those fingerprints. There are
inherently O(n^2) comparisons, so it pays in CPU time to
hoist the (rather expensive) computation of the fingerprint
out of that loop (or to cache it once we have computed it once).

Previously, we didn't keep the filespec information around
because then we had the potential to consume a great deal of
memory. However, instead of keeping all of the filespec
data, we can instead just keep the fingerprint.

This patch implements and uses diff_free_filespec_data_large
to accomplish that goal. We also have to change
estimate_similarity not to needlessly repopulate the
filespec data when we already have the hash.

Practical tests showed 4.5x speedup for a 10% memory usage
increase.

Signed-off-by: Jeff King <peff@peff.net>
---
The implementation is a little less nice than I would like, but I was
trying to be non-invasive. Specifically:

 - the name diff_free_filespec_data_large is horrible, but this is based
   on the fact that diff_free_filespec_data actually does too much (it
   frees the data _and_ some other auxiliary data). And renaming that
   would entail changing many callsites.

 - It seems that a better place to call diffcore_populate_filespec
   (rather than in estimate_similarity) would actually be in
   diffcore_count_changes (when we _know_ that we need to populate the
   contents data).

 - The hash_chars() should arguably be tied into
   diffcore_populate_filespec, which should have more of a "what
   information do you want" interface. I.e., the "size_only" parameter
   could be extended to a bitfield to say "populate this, and I need the
   delta fingerprint, size, actual contents, etc". Then callers could
   just use "populate" before looking at the filespec and it would
   lazily load whatever they needed.

This patch cuts my pathological case from 20 minutes to 2 minutes, which
is a great improvement, but still unusable. However, now I should be
able to get more useful numbers on what else can be sped up.

 diff.c            |    7 ++++++-
 diffcore-rename.c |    7 ++++---
 diffcore.h        |    1 +
 3 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/diff.c b/diff.c
index 2216d75..647bcf0 100644
--- a/diff.c
+++ b/diff.c
@@ -1654,7 +1654,7 @@ int diff_populate_filespec(struct diff_filespec *s, int size_only)
 	return 0;
 }
 
-void diff_free_filespec_data(struct diff_filespec *s)
+void diff_free_filespec_data_large(struct diff_filespec *s)
 {
 	if (s->should_free)
 		free(s->data);
@@ -1665,6 +1665,11 @@ void diff_free_filespec_data(struct diff_filespec *s)
 		s->should_free = s->should_munmap = 0;
 		s->data = NULL;
 	}
+}
+
+void diff_free_filespec_data(struct diff_filespec *s)
+{
+	diff_free_filespec_data_large(s);
 	free(s->cnt_data);
 	s->cnt_data = NULL;
 }
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 41b35c3..4fc2000 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -184,7 +184,8 @@ static int estimate_similarity(struct diff_filespec *src,
 	if (base_size * (MAX_SCORE-minimum_score) < delta_size * MAX_SCORE)
 		return 0;
 
-	if (diff_populate_filespec(src, 0) || diff_populate_filespec(dst, 0))
+	if ((!src->cnt_data && diff_populate_filespec(src, 0))
+		|| (!dst->cnt_data && diff_populate_filespec(dst, 0)))
 		return 0; /* error but caught downstream */
 
 
@@ -377,10 +378,10 @@ void diffcore_rename(struct diff_options *options)
 			m->score = estimate_similarity(one, two,
 						       minimum_score);
 			m->name_score = basename_same(one, two);
-			diff_free_filespec_data(one);
+			diff_free_filespec_data_large(one);
 		}
 		/* We do not need the text anymore */
-		diff_free_filespec_data(two);
+		diff_free_filespec_data_large(two);
 		dst_cnt++;
 	}
 	/* cost matrix sorted by most to least similar pair */
diff --git a/diffcore.h b/diffcore.h
index eef17c4..4bf175b 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -48,6 +48,7 @@ extern void fill_filespec(struct diff_filespec *, const unsigned char *,
 
 extern int diff_populate_filespec(struct diff_filespec *, int);
 extern void diff_free_filespec_data(struct diff_filespec *);
+extern void diff_free_filespec_data_large(struct diff_filespec *);
 extern int diff_filespec_is_binary(struct diff_filespec *);
 
 struct diff_filepair {
-- 
1.5.3.2.1061.gc056e-dirty

             reply	other threads:[~2007-09-25 19:30 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-09-25 19:29 Jeff King [this message]
2007-09-25 21:29 ` [PATCH] diffcore-rename: cache file deltas Junio C Hamano
2007-09-25 21:42   ` Jeff King
2007-10-03  1:36   ` Linus Torvalds
2007-10-03  4:10     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070925192941.GA8564@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).