git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Junio C Hamano <junkio@cox.net>
To: Linus Torvalds <torvalds@osdl.org>
Cc: Git Mailing List <git@vger.kernel.org>
Subject: [PATCH] Fix -B "very-different" logic.
Date: Thu, 02 Jun 2005 16:54:36 -0700	[thread overview]
Message-ID: <7vll5sz54z.fsf_-_@assigned-by-dhcp.cox.net> (raw)
In-Reply-To: <7vmzqau3es.fsf@assigned-by-dhcp.cox.net> (Junio C. Hamano's message of "Tue, 31 May 2005 21:06:19 -0700")

>>>>> "JCH" == Junio C Hamano <junkio@cox.net> writes:

>     (Btw, current versions of git will consider the change
>     in question to be so big that it's considered a whole
>     new file, since the diff is actually bigger than the
>     file. 

JCH> Do you want me to do something about this with -B (and possibly
JCH> -C/-M), like skipping the comparison altogether if the file size
JCH> is smaller than, say, 1k bytes or something silly like that?  Or
JCH> not having special case for this kind of "contrived example"
JCH> preferrable?

I was looking at the -B code.  The reason it thinks change is
too big is because xdelta tells us to reconstruct the
destination by all new literal bytes in this small string case.
There is not much I can do about it.

However I think the diffcore-break algorithm itself was basing
its "very_different" computation on numbers somewhat bogus.  It
was counting newly inserted bytes into account, but amount of
those bytes should not make any difference when determining if
the change is a complete rewrite.

I suspect that -M/-C heuristics has similar (if not the same)
issues, but I would like to address that separately.

Here is a proposed fix for -B.  It also tells diffcore-break not
to break a file smaller than 400 bytes.  I did not make this
number configurable, since that would be too many knobs to
tweak.  If somebody feels strong enough about it, it can be made
into an option later, but for now that size "feels" reasonable.

  -- >8 -- cut here -- >8 --

------------
What we are interested in here is how much the original source
material remains in the final result, and it does not really
matter how much new contents are added as part of the edit.  If
you remove 97 lines from an original 100-line document, it does
not matter if you add 47 lines of your own to make a 50-line
document, or if you add 997 lines to make a 1000-line document.
Either way, you did a complete rewrite.

Earlier code counted both new material and deletions to detect
complete rewrites.  This patch fixes it.  With its default
setting, it detects three such complete rewrites in the core-GIT
repository.

Signed-off-by: Junio C Hamano <junkio@cox.net>
---

 count-delta.h    |    1 +
 diffcore.h       |    4 ++-
 count-delta.c    |   70 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 diffcore-break.c |   55 +++++++++++++++---------------------------
 4 files changed, 92 insertions(+), 38 deletions(-)

diff --git a/count-delta.h b/count-delta.h
--- a/count-delta.h
+++ b/count-delta.h
@@ -5,5 +5,6 @@
 #define COUNT_DELTA_H
 
 unsigned long count_delta(void *, unsigned long);
+unsigned long count_excluded_source_material(void *, unsigned long);
 
 #endif
diff --git a/diffcore.h b/diffcore.h
--- a/diffcore.h
+++ b/diffcore.h
@@ -10,9 +10,9 @@
  */
 #define MAX_SCORE 60000
 #define DEFAULT_RENAME_SCORE 30000 /* rename/copy similarity minimum (50%) */
-#define DEFAULT_BREAK_SCORE  59400 /* minimum for break to happen (99%)*/
+#define DEFAULT_BREAK_SCORE  48000 /* minimum for break to happen    (80%) */
 
-#define RENAME_DST_MATCHED 01
+#define DIFF_MINIMUM_BREAK   400 /* minimum size of source that -B breaks */
 
 struct diff_filespec {
 	unsigned char sha1[20];
diff --git a/count-delta.c b/count-delta.c
--- a/count-delta.c
+++ b/count-delta.c
@@ -93,3 +93,73 @@ unsigned long count_delta(void *delta_bu
 		return 0;
 	return (src_size - copied_from_source) + added_literal;
 }
+
+
+/*
+ * What we are interested in here is how much the original source
+ * material remains in the final result, and it does not really matter
+ * how much new contents are added as part of the edit.  If you remove
+ * 97 lines from an original 100-line document, it does not matter if
+ * you add 47 lines of your own to make a 50-line document, or if you
+ * add 997 lines to make a 1000-line document.  Either way, you did a
+ * complete rewrite.
+ *
+ * Note.  We do not interprete delta fully.  Instead, we look at xdelta
+ * instructions that copy bytes from the source, and count those copied
+ * bytes.  Subtracting this number from the original source size yields
+ * the number of bytes not used from the source material.  In the above
+ * example, this number corresponds to 97-line (but we count in bytes).
+ */
+unsigned long count_excluded_source_material(void *delta_buf,
+					     unsigned long delta_size)
+{
+	unsigned long copied_from_source;
+	const unsigned char *data, *top;
+	unsigned char cmd;
+	unsigned long src_size, dst_size, out;
+
+	/* the smallest delta size possible is 6 bytes */
+	if (delta_size < 6)
+		return UINT_MAX;
+
+	data = delta_buf;
+	top = delta_buf + delta_size;
+
+	src_size = get_hdr_size(&data);
+	dst_size = get_hdr_size(&data);
+
+	copied_from_source = out = 0;
+	while (data < top) {
+		cmd = *data++;
+		if (cmd & 0x80) {
+			unsigned long cp_off = 0, cp_size = 0;
+			if (cmd & 0x01) cp_off = *data++;
+			if (cmd & 0x02) cp_off |= (*data++ << 8);
+			if (cmd & 0x04) cp_off |= (*data++ << 16);
+			if (cmd & 0x08) cp_off |= (*data++ << 24);
+			if (cmd & 0x10) cp_size = *data++;
+			if (cmd & 0x20) cp_size |= (*data++ << 8);
+			if (cp_size == 0) cp_size = 0x10000;
+
+			if (cmd & 0x40)
+				/* copy from dst */
+				;
+			else
+				copied_from_source += cp_size;
+			out += cp_size;
+		} else {
+			/* write literal into dst */
+			out += cmd;
+			data += cmd;
+		}
+	}
+
+	/* sanity check */
+	if (data != top || out != dst_size)
+		return UINT_MAX;
+
+	if (src_size < copied_from_source)
+		/* we ended up overcounting and underflowed; I dunno why */
+		return 0;
+	return src_size - copied_from_source;
+}
diff --git a/diffcore-break.c b/diffcore-break.c
--- a/diffcore-break.c
+++ b/diffcore-break.c
@@ -13,63 +13,46 @@ static int very_different(struct diff_fi
 {
 	/* dst is recorded as a modification of src.  Are they so
 	 * different that we are better off recording this as a pair
-	 * of delete and create?  min_score is the minimum amount of
-	 * new material that must exist in the dst and not in src for
-	 * the pair to be considered a complete rewrite, and recommended
-	 * to be set to a very high value, 99% or so.
+	 * of delete and create?
 	 *
-	 * The value we return represents the amount of new material
-	 * that is in dst and not in src.  We return 0 when we do not
-	 * want to get the filepair broken.
+	 * We base the score on the amount of material originally from
+	 * src that still remains in the dst.  If src was 100-line
+	 * file among which only 3-line remains in the dst, then it is
+	 * a complete rewrite with 97% "change", and it does not
+	 * matter if the resulting file is a 15-line file or a
+	 * 2000-line file.  On the other hand, if 40-line remains
+	 * among those 100-lines, even if the resulting file is a
+	 * 2000-lines file, it still is an edit with 60% "change",
+	 * which may sound counter-intuitive at first but that is the
+	 * right number to use.
 	 */
+
 	void *delta;
-	unsigned long delta_size, base_size;
+	unsigned long delta_size;
 
 	if (!S_ISREG(src->mode) || !S_ISREG(dst->mode))
 		return 0; /* leave symlink rename alone */
 
-	if (diff_populate_filespec(src, 1) || diff_populate_filespec(dst, 1))
-		return 0; /* error but caught downstream */
-
-	delta_size = ((src->size < dst->size) ?
-		      (dst->size - src->size) : (src->size - dst->size));
-
-	/* Notice that we use max of src and dst as the base size,
-	 * unlike rename similarity detection.  This is so that we do
-	 * not mistake a large addition as a complete rewrite.
-	 */
-	base_size = ((src->size < dst->size) ? dst->size : src->size);
-
-	/*
-	 * If file size difference is too big compared to the
-	 * base_size, we declare this a complete rewrite.
-	 */
-	if (base_size * min_score < delta_size * MAX_SCORE)
-		return MAX_SCORE;
-
 	if (diff_populate_filespec(src, 0) || diff_populate_filespec(dst, 0))
 		return 0; /* error but caught downstream */
 
+	if (src->size < DIFF_MINIMUM_BREAK)
+		return 0; /* Too small to consider breaking */
+
 	delta = diff_delta(src->data, src->size,
 			   dst->data, dst->size,
 			   &delta_size);
 
-	/* A delta that has a lot of literal additions would have
-	 * big delta_size no matter what else it does.
-	 */
-	if (base_size * min_score < delta_size * MAX_SCORE)
-		return MAX_SCORE;
-
 	/* Estimate the edit size by interpreting delta. */
-	delta_size = count_delta(delta, delta_size);
+	delta_size = count_excluded_source_material(delta, delta_size);
 	free(delta);
 	if (delta_size == UINT_MAX)
 		return 0; /* error in delta computation */
 
-	if (base_size < delta_size)
+	if (src->size < delta_size)
 		return MAX_SCORE;
 
-	return delta_size * MAX_SCORE / base_size; 
+	return delta_size * MAX_SCORE / src->size;
 }
 
 void diffcore_break(int min_score)
------------


  reply	other threads:[~2005-06-02 23:51 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-05-30 20:00 I want to release a "git-1.0" Linus Torvalds
2005-05-30 20:33 ` jeff millar
2005-05-30 20:49 ` Nicolas Pitre
2005-06-01  6:52   ` Junio C Hamano
2005-06-01  8:24     ` [PATCH] Add -d flag to git-pull-* family Junio C Hamano
2005-06-01 14:39       ` Nicolas Pitre
2005-06-01 16:00         ` Junio C Hamano
     [not found]           ` <7v1x7lk8fl.fsf_-_@assigned-by-dhcp.cox.net>
2005-06-02  0:47             ` [PATCH] Handle deltified object correctly in git-*-pull family Nicolas Pitre
     [not found]             ` <7vpsv5hbm5.fsf@assigned-by-dhcp.cox.net>
2005-06-02  0:51               ` [PATCH] Stop inflating the whole SHA1 file only to check size Nicolas Pitre
2005-06-02  1:32                 ` Junio C Hamano
2005-06-02  0:58             ` [PATCH] Handle deltified object correctly in git-*-pull family Linus Torvalds
2005-06-02  1:43               ` Junio C Hamano
2005-05-30 20:59 ` I want to release a "git-1.0" Junio C Hamano
2005-05-30 21:07 ` Junio C Hamano
2005-05-30 22:11 ` David Greaves
2005-05-30 22:12 ` Dave Jones
2005-05-30 22:55   ` Dmitry Torokhov
2005-05-30 23:15     ` Junio C Hamano
2005-05-30 23:23     ` Dmitry Torokhov
2005-05-31  0:52   ` Linus Torvalds
2005-05-30 22:19 ` Ryan Anderson
2005-05-31  0:58   ` Linus Torvalds
2005-05-30 22:32 ` Chris Wedgwood
2005-05-30 23:56   ` Chris Wedgwood
2005-05-31  1:06   ` Linus Torvalds
2005-06-01  2:11     ` Junio C Hamano
2005-06-01  2:25       ` David Lang
2005-06-01  4:53         ` Junio C Hamano
2005-06-01 20:06           ` David Lang
2005-06-01 20:16             ` C. Scott Ananian
2005-06-02  0:43               ` Nicolas Pitre
2005-06-02  1:14                 ` Brian O'Mahoney
2005-06-01 23:03             ` Junio C Hamano
2005-05-31  0:19 ` Petr Baudis
2005-05-31 13:45 ` Eric W. Biederman
2005-06-01  3:04   ` Linus Torvalds
2005-06-01  4:06     ` Junio C Hamano
2005-06-02 23:54       ` Junio C Hamano [this message]
2005-06-03  0:21         ` [PATCH] Fix -B "very-different" logic Linus Torvalds
2005-06-03  1:33           ` Junio C Hamano
2005-06-03  8:32             ` [PATCH 0/4] " Junio C Hamano
2005-06-03  8:36               ` [PATCH 1/4] Tweak count-delta interface Junio C Hamano
2005-06-03  8:36               ` [PATCH 2/4] diff: Fix docs and add -O to diff-helper Junio C Hamano
2005-06-03  8:37               ` [PATCH 3/4] diff: Clean up diff_scoreopt_parse() Junio C Hamano
2005-06-03  8:40               ` [PATCH 4/4] diff: Update -B heuristics Junio C Hamano
2005-06-01  6:28     ` I want to release a "git-1.0" Junio C Hamano
2005-06-01 22:00     ` Daniel Barkalow
2005-06-01 23:05       ` Junio C Hamano
2005-06-03  9:47       ` Petr Baudis
2005-06-03 15:09         ` Daniel Barkalow
2005-06-02  7:15     ` Eric W. Biederman
2005-06-02  8:32       ` Kay Sievers
2005-06-02 14:52       ` Linus Torvalds
2005-06-02 12:02     ` [PATCH] several typos in tutorial Alexey Nezhdanov
2005-06-02 12:41       ` Vincent Hanquez
2005-06-02 12:45         ` Alexey Nezhdanov
2005-06-02 12:51           ` Vincent Hanquez
2005-06-02 12:56             ` Alexey Nezhdanov
2005-06-02 13:00             ` Alexey Nezhdanov
2005-06-02 23:40     ` I want to release a "git-1.0" Adam Kropelin
2005-06-03  0:06       ` Linus Torvalds
2005-06-03  0:47         ` Linus Torvalds
2005-06-03  1:34           ` Adam Kropelin
2005-06-02 19:43 ` CVS migration section to the tutorial Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7vll5sz54z.fsf_-_@assigned-by-dhcp.cox.net \
    --to=junkio@cox.net \
    --cc=git@vger.kernel.org \
    --cc=torvalds@osdl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).