git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Barret Rhoden <brho@google.com>
To: git@vger.kernel.org
Cc: "Michael Platings" <michael@platin.gs>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"David Kastrup" <dak@gnu.org>, "Jeff King" <peff@peff.net>,
	"Jeff Smith" <whydoubt@gmail.com>,
	"Johannes Schindelin" <Johannes.Schindelin@gmx.de>,
	"Junio C Hamano" <gitster@pobox.com>,
	"René Scharfe" <l.s.r@web.de>,
	"Stefan Beller" <stefanbeller@gmail.com>
Subject: [PATCH v5 6/6] RFC blame: use a fingerprint heuristic to match ignored lines
Date: Wed,  3 Apr 2019 12:02:07 -0400	[thread overview]
Message-ID: <20190403160207.149174-7-brho@google.com> (raw)
In-Reply-To: <20190403160207.149174-1-brho@google.com>

This replaces the heuristic used to identify lines from ignored commits
with one that finds likely candidate lines in the parent's version of
the file.

The old heuristic simply assigned lines in the target to the same line
number (plus offset) in the parent.  The new function uses
fingerprinting based on the sum of the bitwise-ORed bits in a string.

The fingerprint code and the idea to use them for blame
came from Michael Platings <michael@platin.gs>.

For each line in the target, guess_line_blames() finds the best line in
the parent, above a magic threshold (1 byte pair, currently).  Ties are
broken by proximity of the parent line number to the target's line.

Here's an example of the difference between algorithms.  Consider a file
with four commits:

	commit-a 11) void new_func_1(void *x, void *y);
	commit-b 12) void new_func_2(void *x, void *y);
	commit-c 13) some_line_c
	commit-d 14) some_line_d

After a commit 'X', we have:

	commit-X 11) void new_func_1(void *x,
	commit-X 12)                 void *y);
	commit-X 13) void new_func_2(void *x,
	commit-X 14)                 void *y);
	commit-c 15) some_line_c
	commit-d 16) some_line_d

When we blame-ignored with the old algorithm, we get:

	commit-a 11) void new_func_1(void *x,
	commit-b 12)                 void *y);
	00000000 13) void new_func_2(void *x,
	00000000 14)                 void *y);
	commit-c 15) some_line_c
	commit-d 16) some_line_d

Where commit-b is blamed for 12 instead of 13.  With the fingerprint
algorithm, we get:

	commit-a 11) void new_func_1(void *x,
	commit-b 12)                 void *y);
	commit-b 13) void new_func_2(void *x,
	commit-b 14)                 void *y);
	commit-c 15) some_line_c
	commit-d 16) some_line_d

Note both lines 12 and 14 are given to commit b.  Their match is above
the FINGERPRINT_THRESHOLD, and they tied.  Specifically, parent lines 11
and 12 both match these lines.  The algorithm chose parent line 12,
since that was closest to the target line numbers of 12 and 14.

If we increase the threshold, say to 10, those two lines won't match,
and will be treated as 'unblamable.'

TODO:
- Is '1' a decent threshold?  Add another user option?

- Can we decrease that FINGERPRINT_LENGTH?  Or do something about the
TODO about using a set data structure?

- How about matching *outside* the parent's diff hunk?  Right now, this
just looks in the parent's hunk for a match.  But a better match may be
elsewhere in the file.  This might happen when the diff has too small of
a -M.  If we do this, then consider caching the fingerprints for a
parent's entire file, since multiple target blame_entries may check the
entire parent's space.  Also, if we do this, we probably need to sort the
parent's blame list (again), since the spliced entries point outside of
the diff hunk's range in the parent's address space.

- If we never intend to match outside the parent's diff hunk, then we
might be able to short-circuit guess_line_blames() when the parent's
chunk length is 0.  Doing this somewhat limits the algorithms, and would
be a performance optimization, which I didn't want to do without numbers
saying we needed it.

- Fix up this commit + message.  I'd be up for splitting it more,
particularly if Michael wants his contributions/fingerprinting in his
own commit.

Suggested-by: Michael Platings <michael@platin.gs>
Signed-off-by: Barret Rhoden <brho@google.com>
---
 blame.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 87 insertions(+), 5 deletions(-)

diff --git a/blame.c b/blame.c
index c06cbd906658..50511a300059 100644
--- a/blame.c
+++ b/blame.c
@@ -915,27 +915,109 @@ static int are_lines_adjacent(struct blame_line_tracker *first,
 	       first->s_lno + 1 == second->s_lno;
 }
 
-/*
- * This cheap heuristic assigns lines in the chunk to their relative location in
- * the parent's chunk.  Any additional lines are left with the target.
+/* https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel */
+static int bitcount(uint32_t v)
+{
+	v = v - ((v >> 1) & 0x55555555u);
+	v = (v & 0x33333333u) + ((v >> 2) & 0x33333333u);
+	return (((v + (v >> 4)) & 0xf0f0f0fu) * 0x1010101u) >> 24;
+}
+
+#define FINGERPRINT_LENGTH (8 * 256)
+#define FINGERPRINT_THRESHOLD 1
+/* This is just a bitset indicating which byte pairs are present.
+ * e.g. the string "good goo" has pairs "go", "oo", "od", "d ", " g"
+ * String similarity is calculated as a bitwise or and counting the set bits.
+ *
+ * TODO for the string lengths we typically deal with, this would probably be
+ * implemented more efficiently with a set data structure.
  */
+struct fingerprint {
+	uint32_t bits[FINGERPRINT_LENGTH];
+};
+
+static void get_fingerprint(struct fingerprint *result, const char *line_begin,
+			    const char *line_end)
+{
+	for (const char *p = line_begin; p + 1 < line_end; ++p) {
+		unsigned int c = tolower(*p) | (tolower(*(p + 1)) << 8);
+
+		result->bits[c >> 5] |= 1u << (c & 0x1f);
+	}
+}
+
+static int fingerprint_similarity(const struct fingerprint *a,
+				  const struct fingerprint *b)
+{
+	int intersection = 0;
+
+	for (int i = 0; i < FINGERPRINT_LENGTH; ++i)
+		intersection += bitcount(a->bits[i] & b->bits[i]);
+	return intersection;
+}
+
+static void get_chunk_fingerprints(struct fingerprint *fingerprints,
+				   const char *content,
+				   const int *line_starts,
+				   long chunk_start,
+				   long chunk_length)
+{
+	line_starts += chunk_start;
+	for (int i = 0; i != chunk_length; ++i) {
+		const char *linestart = content + line_starts[i];
+		const char *lineend = content + line_starts[i + 1];
+
+		get_fingerprint(fingerprints + i, linestart, lineend);
+	}
+}
+
 static void guess_line_blames(struct blame_entry *e,
 			      struct blame_origin *parent,
 			      struct blame_origin *target,
 			      int offset, int delta,
 			      struct blame_line_tracker *line_blames)
 {
+	struct fingerprint *fp_parent, *fp_target;
 	int nr_parent_lines = e->num_lines - delta;
 
+	fp_parent = xcalloc(sizeof(struct fingerprint), nr_parent_lines);
+	fp_target = xcalloc(sizeof(struct fingerprint), e->num_lines);
+
+	get_chunk_fingerprints(fp_parent, parent->file.ptr,
+			       parent->line_starts,
+			       e->s_lno + offset, nr_parent_lines);
+	get_chunk_fingerprints(fp_target, target->file.ptr,
+			       target->line_starts,
+			       e->s_lno, e->num_lines);
+
 	for (int i = 0; i < e->num_lines; i++) {
-		if (i < nr_parent_lines) {
+		int best_sim_val = FINGERPRINT_THRESHOLD;
+		int best_sim_idx = -1;
+		int sim;
+
+		for (int j = 0; j < nr_parent_lines; j++) {
+			sim = fingerprint_similarity(&fp_target[i],
+						     &fp_parent[j]);
+			if (sim < best_sim_val)
+				continue;
+			/* Break ties with the closest-to-target line number */
+			if (sim == best_sim_val && best_sim_idx != -1 &&
+			    abs(best_sim_idx - i) < abs(j - i))
+				continue;
+			best_sim_val = sim;
+			best_sim_idx = j;
+		}
+		if (best_sim_idx >= 0) {
 			line_blames[i].is_parent = 1;
-			line_blames[i].s_lno = e->s_lno + i + offset;
+			line_blames[i].s_lno = e->s_lno + offset + best_sim_idx;
 		} else {
 			line_blames[i].is_parent = 0;
 			line_blames[i].s_lno = e->s_lno + i;
 		}
 	}
+
+	free(fp_parent);
+	free(fp_target);
 }
 
 /*
-- 
2.21.0.392.gf8f6787159e-goog


  parent reply	other threads:[~2019-04-03 16:02 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-03 16:02 [PATCH v5 0/6] blame: add the ability to ignore commits Barret Rhoden
2019-04-03 16:02 ` [PATCH v5 1/6] Move init_skiplist() outside of fsck Barret Rhoden
2019-04-03 16:02 ` [PATCH v5 2/6] blame: use a helper function in blame_chunk() Barret Rhoden
2019-04-03 16:02 ` [PATCH v5 3/6] blame: optionally track the line starts during fill_blame_origin() Barret Rhoden
2019-04-03 16:02 ` [PATCH v5 4/6] blame: add the ability to ignore commits and their changes Barret Rhoden
2019-04-03 16:02 ` [PATCH v5 5/6] blame: add a config option to mark ignored lines Barret Rhoden
2019-04-03 16:02 ` Barret Rhoden [this message]
2019-04-04 16:37   ` [PATCH v5 6/6] RFC blame: use a fingerprint heuristic to match " SZEDER Gábor
     [not found] <[PATCH v5 6/6] RFC blame: use a fingerprint heuristic to match ignored lines>
2019-04-07 21:46 ` michael
2019-04-07 21:52   ` David Kastrup
2019-04-08  9:48     ` Michael Platings
2019-04-08 16:03       ` Barret Rhoden
2019-04-09 15:38         ` Junio C Hamano
2019-04-09 15:56   ` Barret Rhoden
2019-04-09 19:10     ` Barret Rhoden

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190403160207.149174-7-brho@google.com \
    --to=brho@google.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=avarab@gmail.com \
    --cc=dak@gnu.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=l.s.r@web.de \
    --cc=michael@platin.gs \
    --cc=peff@peff.net \
    --cc=stefanbeller@gmail.com \
    --cc=whydoubt@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).