git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org
Subject: diffcore-rename performance mode
Date: Tue, 18 Sep 2007 04:23:22 -0400	[thread overview]
Message-ID: <20070918082321.GA9883@coredump.intra.peff.net> (raw)

[this is a resend of some comments I made deep in another thread on
rename limiting; I wanted to get your comments, Junio, but I was afraid
you didn't see it buried in the mess. If you have already read it and
have nothing to say, just tell me to shut up.

I was able to get 1000% speedup on a (perhaps pathological) diff, and I
suspect there may be more speedups possible in the spanhash lookups. So
I think it's worth pursuing.]


Hmm. Actually, doing some profiling, it looks like about 75% of the time
is spent not in the O(n^2) comparison, but in generating the hash
fingerprints of each file.

There seems to be a serious performance problem in diffcore-rename.
There is infrastructure to cache the "cnt_data" member of each filespec,
but it never gets used because we immediately free the filespec data
after use. Oops.

With this patch:

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 6bde439..531a844 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -362,10 +362,7 @@ void diffcore_rename(struct diff_options *options)
 			m->score = estimate_similarity(one, two,
 						       minimum_score);
 			m->name_score = basename_same(one, two);
-			diff_free_filespec_data(one);
 		}
-		/* We do not need the text anymore */
-		diff_free_filespec_data(two);
 		dst_cnt++;
 	}
 	/* cost matrix sorted by most to least similar pair */

My 20-minute diff becomes a 2-minute diff. The downside is that the
memory usage is much increased (for obvious reasons, it should increase
by the dataset size, since we are keeping pointers to the data around --
in my case, around 1G extra).  However, keeping around _just_ the
cnt_data caused only about 100M of extra memory consumption (and gave
the same performance boost).

Of course, 2 minutes for a git-status is still way too slow, so there we
might still need a limiter. It also looks like 57% of my time is spent
in spanhash_find, and another 29% in diffcore_count_changes.

The spanhash data structure is a bit confusing. At first, it looked like
we were doing a linear search for a matching hash, but it's not quite,
since we seem to start at some magic spot based on the hashval we're
looking up.

But it seems to be an array of (hash, count) pairs for each file. Is
there a reason not to use a hash table mapping hash -> count? That would
make insertion and lookup O(1), presumably at the cost of a bit more
memory (since each filespec would have the full table).

Junio, can you shed some light on that decision?

-Peff

             reply	other threads:[~2007-09-18  8:24 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-09-18  8:23 Jeff King [this message]
2007-09-18  8:49 ` diffcore-rename performance mode Junio C Hamano
2007-09-18  8:54   ` Jeff King
2007-09-18  8:58     ` Junio C Hamano
2007-09-18  9:01       ` Jeff King
2007-09-18  9:17         ` Junio C Hamano
2007-09-18 11:20     ` Johannes Schindelin
2007-09-25 16:38     ` Jeff King
2007-09-25 19:06       ` Jeff King
2007-09-25 19:10         ` Andreas Ericsson
2007-09-25 19:32         ` David Kastrup
2007-09-25 19:52           ` Jeff King
2007-09-18 22:12 ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070918082321.GA9883@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).