git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [RFC PATCH 0/1] Fuzzy blame
@ 2019-03-24 23:50 michael
  2019-03-24 23:50 ` [RFC PATCH 1/1] " michael
  2019-03-25  2:39 ` [RFC PATCH 0/1] " Junio C Hamano
  0 siblings, 2 replies; 14+ messages in thread
From: michael @ 2019-03-24 23:50 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Stefan Beller, Jeff Smith, Junio C Hamano,
	René Scharfe, Michael Platings

From: Michael Platings <michael@platin.gs>

Hi Git devs,

Some of you may be familiar with the git-hyper-blame tool [1]. It's "useful if
you have a commit that makes sweeping changes that are unlikely to be what you
are looking for in a blame, such as mass reformatting or renaming."

git-hyper-blame is useful but (a) it's not convenient to install; (b) it's
missing functionality available in regular git blame; (c) it's method of
matching lines between chunks is too simplistic for many use cases; and
(d) it's not Git so it doesn't integrate well with tools that expect Git
e.g. vim plugins. Therefore I'm hoping to add similar and hopefully superior
functionality to Git itself. I have a very rough patch so I'd like to get your
thoughts on the general approach, particularly in terms of its user-visible
behaviour.

My initial idea was to lift the design directly from git-hyper-blame. However
the approach of picking single revisions to somehow ignore doesn't sit well
with the -w, -M & -C options, which have a similar intent but apply to all
revisions.

I'd like to get your thoughts on whether we could allow applying the -M or -w
options to specific revisions. For example, imagine it was agreed that all
the #includes in a project should be reordered. In that case, it would be useful
to be able to specify that the -M option should be used for blames on that
revision specifically, so that in future when someone wants to know why
a #include was added they don't have to run git blame twice to find out.

Options that are specific to a particular revision could be stored in a
".gitrevisions" file or similar.

If the principle of allowing blame options to be applied per-revision is
agreeable then I'd like to add a -F/--fuzzy option, to sit alongside -w, -M & -C.

I've implemented a prototype "fuzzy" option, patch attached.
The option operates at the level of diff chunks. For each line in the "after"
half of the chunk it uses a heuristic to choose which line in the "before" half
of the chunk matches best. The heuristic I'm using at the moment is of matching
"bigrams" as described in [2]. The initial pass typically gives reasonable
results, but can jumble up the lines. As in the reformatting/renaming use case
the content should stay in the same order, it's worth going to extra effort to
avoid jumbling lines. Therefore, after the initial pass, the line that can be
matched with the most confidence is used to partition the chunk into halves
before and after it. The process is then repeated recursively on the halves
above and below the partition line.
I feel like a similar algorithm has probably already been invented in a better
form - if anyone knows of such a thing then please let me know!

I look forward to hearing your thoughts.
Thanks,
-Michael


[1] https://commondatastorage.googleapis.com/chrome-infra-docs/flat/depot_tools/docs/html/git-hyper-blame.html
[2] https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient

Michael Platings (1):
  Add git blame --fuzzy option.

 blame.c                | 352 +++++++++++++++++++++++++++++++++++++++++++++++--
 blame.h                |   1 +
 builtin/blame.c        |   3 +
 t/t8020-blame-fuzzy.sh | 264 +++++++++++++++++++++++++++++++++++++
 4 files changed, 609 insertions(+), 11 deletions(-)
 create mode 100755 t/t8020-blame-fuzzy.sh

-- 
2.14.3 (Apple Git-98)


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 1/1] Fuzzy blame
  2019-03-24 23:50 [RFC PATCH 0/1] Fuzzy blame michael
@ 2019-03-24 23:50 ` michael
  2019-03-25  2:39 ` [RFC PATCH 0/1] " Junio C Hamano
  1 sibling, 0 replies; 14+ messages in thread
From: michael @ 2019-03-24 23:50 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Stefan Beller, Jeff Smith, Junio C Hamano,
	René Scharfe, Michael Platings

From: Michael Platings <michael@platin.gs>

---
 blame.c                | 352 +++++++++++++++++++++++++++++++++++++++++++++++--
 blame.h                |   1 +
 builtin/blame.c        |   3 +
 t/t8020-blame-fuzzy.sh | 264 +++++++++++++++++++++++++++++++++++++
 4 files changed, 609 insertions(+), 11 deletions(-)
 create mode 100755 t/t8020-blame-fuzzy.sh

diff --git a/blame.c b/blame.c
index 5c07dec190..b5a40c8e9f 100644
--- a/blame.c
+++ b/blame.c
@@ -997,6 +997,326 @@ static void pass_blame_to_parent(struct blame_scoreboard *sb,
 	return;
 }
 
+/* https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel */
+static int bitcount(uint32_t v) {
+	v = v - ((v >> 1) & 0x55555555u);
+	v = (v & 0x33333333u) + ((v >> 2) & 0x33333333u);
+	return ((v + (v >> 4) & 0xf0f0f0fu) * 0x1010101u) >> 24;
+}
+
+#define FINGERPRINT_LENGTH (8*256)
+/* This is just a bitset indicating which byte pairs are present.
+ e.g. the string "good goo" has pairs "go", "oo", "od", "d ", " g"
+ String similarity is calculated as a bitwise or and counting the set bits.
+ TODO for the string lengths we typically deal with, this would probably be
+ implemented more efficiently with a set data structure.
+ */
+struct fingerprint {
+	uint32_t bits[FINGERPRINT_LENGTH];
+};
+
+static void get_fingerprint(struct fingerprint *result,
+			    const char *line_begin, const char *line_end) {
+	memset(result, 0, sizeof(struct fingerprint));
+	for (const char *p = line_begin; p + 1 < line_end; ++p) {
+		unsigned c = tolower(*p) | (tolower(*(p + 1)) << 8);
+		result->bits[c >> 5] |= 1u << (c & 0x1f);
+	}
+}
+
+static int fingerprint_similarity(const struct fingerprint *a,
+				  const struct fingerprint *b) {
+	int intersection = 0;
+	for (int i = 0; i < FINGERPRINT_LENGTH; ++i) {
+		intersection += bitcount(a->bits[i] & b->bits[i]);
+	}
+	return intersection;
+}
+
+struct fuzzy_blame_parent_data {
+	struct blame_origin *parent;
+	long offset;
+	struct blame_entry **processed_entries;
+	struct blame_entry **target_entries;
+	const char *parent_content;
+	const char *target_content;
+	int *parent_line_starts;
+	int *target_line_starts;
+	int parent_line_count;
+	int target_line_count;
+};
+
+static void get_chunk_fingerprints(struct fingerprint *fingerprints,
+				   const char *content,
+				   const int *line_starts,
+				   long chunk_start,
+				   long chunk_length) {
+	line_starts += chunk_start;
+	for (int i = 0; i != chunk_length; ++i) {
+		const char* linestart = content + line_starts[i];
+		const char* lineend = content + line_starts[i + 1];
+		get_fingerprint(fingerprints + i, linestart, lineend);
+	}
+}
+
+/* This finds the line that we can match with the most confidence, and
+ uses it as a partition. It then calls itself on the lines on either side of
+ that partition. In this way we avoid lines appearing out of order, and retain
+ a sensible line ordering.
+ TODO: so much optimisation. Currently this does the same work repeatedly.
+ */
+static void fuzzy_find_matching_lines_recurse(
+		const char *content_a, const char *content_b,
+		const int *line_starts_a, const int *line_starts_b,
+		int start_a, int start_b,
+		int length_a, int length_b,
+		int *result,
+		struct fingerprint *fingerprints_a,
+		struct fingerprint *fingerprints_b) {
+
+	int most_certain_line = -1;
+	int most_certain_line_certainty = -1;
+
+	for (int i = 0; i < length_b; ++i) {
+		const struct fingerprint *fingerprint_b = fingerprints_b + i;
+
+		int closest_line_a = (i * 2 + 1) * length_a /
+		(length_b * 2);
+
+		/* Limit range of search to a reasonable number of lines.
+		 TODO consider scaling this up if length_a is greater than
+		 length_b. */
+		const int MAX_SEARCH_DISTANCE = 5;
+		int search_start = closest_line_a - (MAX_SEARCH_DISTANCE - 1);
+		int search_end = closest_line_a + MAX_SEARCH_DISTANCE;
+		if (search_start < 0) search_start = 0;
+		if (search_end > length_a) search_end = length_a;
+
+		/* Find both the best and 2nd best matches. The match certainty
+		 is the difference between these values. */
+		int best_similarity = 0, second_best_similarity = 0;
+		int best_similarity_index = 0;
+
+		for (int j = search_start; j < search_end; ++j) {
+			int similarity = fingerprint_similarity(
+								fingerprint_b,
+								fingerprints_a + j) *
+				(1000 - abs(j - closest_line_a));
+
+			if (similarity > best_similarity) {
+				second_best_similarity = best_similarity;
+				best_similarity = similarity;
+				best_similarity_index = j;
+			}
+			else if (similarity > second_best_similarity) {
+				second_best_similarity = similarity;
+			}
+		}
+
+		if (best_similarity == 0) {
+			result[i] = -1;
+			continue;
+		}
+
+		result[i] = start_a + best_similarity_index;
+
+		int certainty = best_similarity - second_best_similarity;
+		if (certainty > most_certain_line_certainty) {
+			most_certain_line_certainty = certainty;
+			most_certain_line = i;
+		}
+	}
+
+	if (most_certain_line == -1) {
+		return;
+	}
+
+	if (most_certain_line > 0) {
+		fuzzy_find_matching_lines_recurse(content_a, content_b, line_starts_a, line_starts_b, start_a, start_b, result[most_certain_line] + 1 - start_a, most_certain_line, result, fingerprints_a, fingerprints_b);
+	}
+	if (most_certain_line + 1 < length_b) {
+		int second_half_start_a = result[most_certain_line];
+		int second_half_start_b = start_b + most_certain_line + 1;
+		int second_half_length_a = length_a + start_a - second_half_start_a;
+		int second_half_length_b = length_b + start_b - second_half_start_b;
+		fuzzy_find_matching_lines_recurse(content_a, content_b, line_starts_a, line_starts_b, second_half_start_a, second_half_start_b, second_half_length_a, second_half_length_b, result + most_certain_line + 1, fingerprints_a + second_half_start_a - start_a, fingerprints_b + most_certain_line + 1);
+	}
+}
+
+/* Find line numbers in "a" that match with lines in "b"
+ Returns an array of either line indices or -1 where no match is found.
+ The returned array must be free()d after use.
+ */
+static int *fuzzy_find_matching_lines(
+	const char *content_a, const char *content_b,
+	const int *line_starts_a, const int *line_starts_b,
+	int start_a, int start_b,
+	int length_a, int length_b) {
+
+	int *result = malloc(sizeof(int) * length_b);
+
+	struct fingerprint *fingerprints_a =
+		malloc(sizeof(struct fingerprint) * length_a);
+	struct fingerprint *fingerprints_b =
+		malloc(sizeof(struct fingerprint) * length_b);
+
+	get_chunk_fingerprints(fingerprints_a, content_a,
+			       line_starts_a,
+			       start_a, length_a);
+	get_chunk_fingerprints(fingerprints_b, content_b,
+			       line_starts_b,
+			       start_b, length_b);
+
+	fuzzy_find_matching_lines_recurse(content_a, content_b,
+					    line_starts_a, line_starts_b,
+					    start_a, start_b,
+					    length_a, length_b,
+					    result,
+					    fingerprints_a,
+					    fingerprints_b);
+
+	free(fingerprints_a);
+	free(fingerprints_b);
+
+	return result;
+}
+
+static int blame_chunk_fuzzy(long parent_chunk_start,
+				  long parent_chunk_length,
+				  long target_chunk_start,
+				  long target_chunk_length,
+				  void *data)
+{
+	struct fuzzy_blame_parent_data *d = data;
+
+	if (parent_chunk_start - target_chunk_start != d->offset)
+		die("internal error in blame::blame_chunk_fuzzy");
+
+	int target_chunk_end = target_chunk_start + target_chunk_length;
+
+	struct blame_origin *parent = d->parent;
+	struct blame_entry *e = *d->target_entries;
+	struct blame_entry *parent_tail = NULL;
+	struct blame_entry *target_tail = NULL;
+
+	if (parent_chunk_length == 0) {
+		/* Don't try to blame parent for newly added lines */
+		while (e && e->s_lno < target_chunk_end) {
+			target_tail = e;
+			e = e->next;
+		}
+		d->target_entries = &target_tail->next;
+		goto finish;
+	}
+
+	int *matched_lines = fuzzy_find_matching_lines(
+		d->parent_content, d->target_content,
+		d->parent_line_starts, d->target_line_starts,
+		parent_chunk_start, target_chunk_start,
+		parent_chunk_length, target_chunk_length);
+
+	while (e && e->s_lno < target_chunk_end) {
+		struct blame_entry *next = e->next;
+
+		for (int i = 0; i < e->num_lines; ++i) {
+			struct blame_entry *n =
+				xcalloc(1, sizeof (struct blame_entry));
+			n->lno = e->lno + i;
+			n->num_lines = 1;
+			n->score = 0;
+
+			int matched_line = matched_lines[i + e->s_lno -
+				target_chunk_start];
+
+			if (matched_line != -1) {
+				n->suspect = blame_origin_incref(parent);
+				n->s_lno = matched_line;
+				n->next = parent_tail;
+				parent_tail = n;
+			} else {
+				n->suspect = blame_origin_incref(e->suspect);
+				n->s_lno = e->s_lno + i;
+				n->next = target_tail;
+				target_tail = n;
+			}
+		}
+
+		blame_origin_decref(e->suspect);
+		free(e);
+
+		e = next;
+	}
+
+	if (parent_tail) {
+		parent_tail = llist_mergesort(parent_tail, get_next_blame,
+					      set_next_blame,
+					      compare_blame_suspect);
+		*d->processed_entries = parent_tail;
+		while (parent_tail->next) parent_tail = parent_tail->next;
+		d->processed_entries = &parent_tail->next;
+	}
+
+	if (target_tail) {
+		target_tail = llist_mergesort(target_tail, get_next_blame,
+					      set_next_blame,
+					      compare_blame_suspect);
+		*d->target_entries = target_tail;
+		while (target_tail->next) target_tail = target_tail->next;
+		d->target_entries = &target_tail->next;
+	}
+
+	*d->target_entries = e;
+
+	free(matched_lines);
+
+finish:
+	d->offset = parent_chunk_start + parent_chunk_length -
+		(target_chunk_start + target_chunk_length);
+
+	return 0;
+}
+
+static int find_line_starts(int **line_starts, const char *buf, unsigned long len);
+
+static void pass_blame_to_parent_fuzzy(struct blame_scoreboard *sb,
+				       struct blame_origin *target,
+				       struct blame_origin *parent)
+{
+	mmfile_t file_p, file_o;
+	struct fuzzy_blame_parent_data d;
+	struct blame_entry *newdest = NULL;
+
+	if (!target->suspects)
+		return; /* nothing remains for this target */
+
+	d.parent = parent;
+	d.offset = 0;
+	d.processed_entries = &newdest;
+	d.target_entries = &target->suspects;
+
+	fill_origin_blob(&sb->revs->diffopt, parent, &file_p, &sb->num_read_blob);
+	fill_origin_blob(&sb->revs->diffopt, target, &file_o, &sb->num_read_blob);
+	sb->num_get_patch++;
+
+	d.parent_content = file_p.ptr;
+	d.target_content = file_o.ptr;
+	d.parent_line_count = find_line_starts(&d.parent_line_starts, file_p.ptr, file_p.size);
+	d.target_line_count = find_line_starts(&d.target_line_starts, file_o.ptr, file_o.size);
+
+	if (diff_hunks(&file_p, &file_o, blame_chunk_fuzzy, &d, sb->xdl_opts))
+		die("unable to generate diff (%s -> %s)",
+			oid_to_hex(&parent->commit->object.oid),
+			oid_to_hex(&target->commit->object.oid));
+
+	*d.processed_entries = NULL;
+	queue_blames(sb, parent, newdest);
+
+	free(d.target_line_starts);
+	free(d.parent_line_starts);
+
+	return;
+}
+
 /*
  * The lines in blame_entry after splitting blames many times can become
  * very small and trivial, and at some point it becomes pointless to
@@ -1433,7 +1753,7 @@ static void pass_blame(struct blame_scoreboard *sb, struct blame_origin *origin,
 	struct commit *commit = origin->commit;
 	struct commit_list *sg;
 	struct blame_origin *sg_buf[MAXSG];
-	struct blame_origin *porigin, **sg_origin = sg_buf;
+	struct blame_origin *porigin = NULL, **sg_origin = sg_buf;
 	struct blame_entry *toosmall = NULL;
 	struct blame_entry *blames, **blametail = &blames;
 
@@ -1560,6 +1880,11 @@ static void pass_blame(struct blame_scoreboard *sb, struct blame_origin *origin,
 		*tail = origin->suspects;
 		origin->suspects = toosmall;
 	}
+
+	if (sb->fuzzy && porigin) {
+		pass_blame_to_parent_fuzzy(sb, origin, porigin);
+	}
+
 	for (i = 0; i < num_sg; i++) {
 		if (sg_origin[i]) {
 			drop_origin_blob(sg_origin[i]);
@@ -1645,14 +1970,8 @@ static const char *get_next_line(const char *start, const char *end)
 	return nl ? nl + 1 : end;
 }
 
-/*
- * To allow quick access to the contents of nth line in the
- * final image, prepare an index in the scoreboard.
- */
-static int prepare_lines(struct blame_scoreboard *sb)
+static int find_line_starts(int **line_starts, const char *buf, unsigned long len)
 {
-	const char *buf = sb->final_buf;
-	unsigned long len = sb->final_buf_size;
 	const char *end = buf + len;
 	const char *p;
 	int *lineno;
@@ -1661,15 +1980,26 @@ static int prepare_lines(struct blame_scoreboard *sb)
 	for (p = buf; p < end; p = get_next_line(p, end))
 		num++;
 
-	ALLOC_ARRAY(sb->lineno, num + 1);
-	lineno = sb->lineno;
+	ALLOC_ARRAY(*line_starts, num + 1);
+	lineno = *line_starts;
 
 	for (p = buf; p < end; p = get_next_line(p, end))
 		*lineno++ = p - buf;
 
 	*lineno = len;
 
-	sb->num_lines = num;
+	return num;
+}
+
+/*
+ * To allow quick access to the contents of nth line in the
+ * final image, prepare an index in the scoreboard.
+ */
+static int prepare_lines(struct blame_scoreboard *sb)
+{
+	sb->num_lines = find_line_starts(&sb->lineno,
+									 sb->final_buf,
+									 sb->final_buf_size);
 	return sb->num_lines;
 }
 
diff --git a/blame.h b/blame.h
index be3a895043..eb528f9f80 100644
--- a/blame.h
+++ b/blame.h
@@ -142,6 +142,7 @@ struct blame_scoreboard {
 	int xdl_opts;
 	int no_whole_file_rename;
 	int debug;
+	int fuzzy;
 
 	/* callbacks */
 	void(*on_sanity_fail)(struct blame_scoreboard *, int);
diff --git a/builtin/blame.c b/builtin/blame.c
index 177c1022a0..d0a0dfff79 100644
--- a/builtin/blame.c
+++ b/builtin/blame.c
@@ -57,6 +57,7 @@ static struct date_mode blame_date_mode = { DATE_ISO8601 };
 static size_t blame_date_width;
 
 static struct string_list mailmap = STRING_LIST_INIT_NODUP;
+static int fuzzy;
 
 #ifndef DEBUG
 #define DEBUG 0
@@ -808,6 +809,7 @@ int cmd_blame(int argc, const char **argv, const char *prefix)
 		OPT_BIT('w', NULL, &xdl_opts, N_("Ignore whitespace differences"), XDF_IGNORE_WHITESPACE),
 		OPT_BIT(0, "color-lines", &output_option, N_("color redundant metadata from previous line differently"), OUTPUT_COLOR_LINE),
 		OPT_BIT(0, "color-by-age", &output_option, N_("color lines by age"), OUTPUT_SHOW_AGE_WITH_COLOR),
+		OPT_BOOL('F', "fuzzy", &fuzzy, N_("Try to assign blame to similar lines in the parent")),
 
 		/*
 		 * The following two options are parsed by parse_revision_opt()
@@ -996,6 +998,7 @@ int cmd_blame(int argc, const char **argv, const char *prefix)
 
 	init_scoreboard(&sb);
 	sb.revs = &revs;
+	sb.fuzzy = fuzzy;
 	sb.contents_from = contents_from;
 	sb.reverse = reverse;
 	sb.repo = the_repository;
diff --git a/t/t8020-blame-fuzzy.sh b/t/t8020-blame-fuzzy.sh
new file mode 100755
index 0000000000..d26c945722
--- /dev/null
+++ b/t/t8020-blame-fuzzy.sh
@@ -0,0 +1,264 @@
+#!/bin/sh
+
+test_description='git blame ignore a specific revision'
+. ./test-lib.sh
+
+pick_author='s/^[0-9a-f^]* *(\([^ ]*\) .*/\1/'
+
+file_count=8
+
+# Each test is composed of 4 variables:
+# titleN - the test name
+# aN - the initial content
+# bN - the final content
+# expectedN - the line numbers from aN that we expect git blame
+#             on bN to identify, or "Final" if bN itself should
+#             be identified as the origin of that line.
+
+title1="Expand lines"
+cat <<EOF >a1
+aaa
+bbb
+ccc
+ddd
+eee
+EOF
+cat <<EOF >b1
+aaa
+bbbx
+bbbx
+ccc
+dddx
+dddx
+eee
+EOF
+cat <<EOF >expected1
+1
+2
+2
+3
+4
+4
+5
+EOF
+
+title2="Combine 3 lines into 2"
+cat <<EOF >a2
+if ((maxgrow==0) ||
+    ( single_line_field && (field->dcols < maxgrow)) ||
+    (!single_line_field && (field->drows < maxgrow)))
+EOF
+cat <<EOF >b2
+if ((maxgrow == 0) || (single_line_field && (field->dcols < maxgrow)) ||
+    (!single_line_field && (field->drows < maxgrow))) {
+EOF
+cat <<EOF >expected2
+2
+3
+EOF
+
+title3="Add curly brackets"
+cat <<EOF >a3
+    if (rows) *rows = field->rows;
+    if (cols) *cols = field->cols;
+    if (frow) *frow = field->frow;
+    if (fcol) *fcol = field->fcol;
+EOF
+cat <<EOF >b3
+    if (rows) {
+      *rows = field->rows;
+    }
+    if (cols) {
+      *cols = field->cols;
+    }
+    if (frow) {
+      *frow = field->frow;
+    }
+    if (fcol) {
+      *fcol = field->fcol;
+    }
+EOF
+cat <<EOF >expected3
+1
+1
+Final
+2
+2
+Final
+3
+3
+Final
+4
+4
+Final
+EOF
+
+
+title4="Combine many lines and change case"
+cat <<EOF >a4
+for(row=0,pBuffer=field->buf;
+    row<height;
+    row++,pBuffer+=width )
+  {
+    if ((len = (int)( After_End_Of_Data( pBuffer, width ) - pBuffer )) > 0)
+      {
+        wmove( win, row, 0 );
+        waddnstr( win, pBuffer, len );
+EOF
+cat <<EOF >b4
+for (Row = 0, PBuffer = field->buf; Row < Height; Row++, PBuffer += Width) {
+  if ((Len = (int)(afterEndOfData(PBuffer, Width) - PBuffer)) > 0) {
+    wmove(win, Row, 0);
+    waddnstr(win, PBuffer, Len);
+EOF
+cat <<EOF >expected4
+1
+5
+7
+8
+EOF
+
+title5="Rename and combine lines"
+cat <<EOF >a5
+bool need_visual_update = ((form != (FORM *)0)      &&
+                           (form->status & _POSTED) &&
+                           (form->current==field));
+
+if (need_visual_update)
+  Synchronize_Buffer(form);
+
+if (single_line_field)
+  {
+    growth = field->cols * amount;
+    if (field->maxgrow)
+      growth = Minimum(field->maxgrow - field->dcols,growth);
+    field->dcols += growth;
+    if (field->dcols == field->maxgrow)
+EOF
+cat <<EOF >b5
+bool NeedVisualUpdate = ((Form != (FORM *)0) && (Form->status & _POSTED) &&
+                         (Form->current == field));
+
+if (NeedVisualUpdate) {
+  synchronizeBuffer(Form);
+}
+
+if (SingleLineField) {
+  Growth = field->cols * amount;
+  if (field->maxgrow) {
+    Growth = Minimum(field->maxgrow - field->dcols, Growth);
+  }
+  field->dcols += Growth;
+  if (field->dcols == field->maxgrow) {
+EOF
+cat <<EOF >expected5
+1
+3
+4
+5
+6
+Final
+7
+8
+10
+11
+12
+Final
+13
+14
+EOF
+
+# Both lines match identically so position must be used to tie-break.
+title6="Same line twice"
+cat <<EOF >a6
+abc
+abc
+EOF
+cat <<EOF >b6
+abcd
+abcd
+EOF
+cat <<EOF >expected6
+1
+2
+EOF
+
+title7="Enforce line order"
+cat <<EOF >a7
+abcdef
+ghijkl
+ab
+EOF
+cat <<EOF >b7
+ghijk
+abcd
+EOF
+cat <<EOF >expected7
+2
+3
+EOF
+
+title8="Expand lines and rename variables"
+cat <<EOF >a8
+int myFunction(int ArgumentOne, Thing *ArgTwo, Blah XuglyBug) {
+  Squiggle FabulousResult = squargle(ArgumentOne, *ArgTwo,
+    XuglyBug) + EwwwGlobalWithAReallyLongNameYepTooLong;
+  return FabulousResult * 42;
+}
+EOF
+cat <<EOF >b8
+int myFunction(int argument_one, Thing *arg_asdfgh,
+    Blah xugly_bug) {
+  Squiggle fabulous_result = squargle(argument_one,
+    *arg_asdfgh, xugly_bug)
+    + g_ewww_global_with_a_really_long_name_yep_too_long;
+  return fabulous_result * 42;
+}
+EOF
+cat <<EOF >expected8
+1
+1
+2
+3
+3
+4
+5
+EOF
+
+test_expect_success setup '
+	{ for ((i=1;i<=$file_count;i++))
+	do
+		# Append each line in a separate commit to make it easy to
+		# check which original line the blame output relates to.
+
+		line_count=0 &&
+		{ while read line
+		do
+			line_count=$((line_count+1)) &&
+			echo $line >>"$i" &&
+			git add "$i" &&
+			test_tick &&
+			GIT_AUTHOR_NAME="$line_count" git commit -m "$line_count"
+		done } <"a$i"
+	done } &&
+
+	{ for ((i=1;i<=$file_count;i++))
+	do
+		# Overwrite the files with the final content.
+		cp b$i $i &&
+		git add $i &&
+		test_tick
+	done } &&
+
+	# Commit the final content all at once so it can all be
+	# referred to with the same commit ID.
+	GIT_AUTHOR_NAME=Final git commit -m Final
+'
+
+for ((i=1;i<=$file_count;i++)); do
+	title="title$i"
+	test_expect_success "${!title}" \
+	"git blame --fuzzy -- $i | sed -e \"$pick_author\" >actual && test_cmp expected$i actual"
+done
+
+test_done
-- 
2.14.3 (Apple Git-98)


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/1] Fuzzy blame
  2019-03-24 23:50 [RFC PATCH 0/1] Fuzzy blame michael
  2019-03-24 23:50 ` [RFC PATCH 1/1] " michael
@ 2019-03-25  2:39 ` Junio C Hamano
  2019-03-25  9:32   ` Michael Platings
  1 sibling, 1 reply; 14+ messages in thread
From: Junio C Hamano @ 2019-03-25  2:39 UTC (permalink / raw)
  To: michael; +Cc: git, Jeff King, Stefan Beller, Jeff Smith, René Scharfe,
	brho

michael@platin.gs writes:

> From: Michael Platings <michael@platin.gs>
>
> Hi Git devs,
>
> Some of you may be familiar with the git-hyper-blame tool [1]. It's "useful if
> you have a commit that makes sweeping changes that are unlikely to be what you
> are looking for in a blame, such as mass reformatting or renaming."

I recall a month or so ago brho@google (CC'ed) sent a "let's allow
blame to ignore some uninteresting commit" topic, which was
unfortunately not well reviewed (what I mean is *not* that it was
reviewed thoroughly and found to be bad---not many reviewers found
time or inclination to review it well).  The topic is queued as
br/blame-ignore and its tip is at 43a290e3 ("SQUASH???", 2019-02-13)
as of this writing.

Perhaps you two can join forces?

P.S. I expect to be offline for most of the week (packing, moving
and unpacking.  Even though the places packing and unpacking happens
are within 1 kilometer radius, that does not make it less hassle
X-<).  See you guys next month.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/1] Fuzzy blame
  2019-03-25  2:39 ` [RFC PATCH 0/1] " Junio C Hamano
@ 2019-03-25  9:32   ` Michael Platings
  2019-03-25 16:04     ` Barret Rhoden
  0 siblings, 1 reply; 14+ messages in thread
From: Michael Platings @ 2019-03-25  9:32 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Jeff King, Stefan Beller, Jeff Smith, René Scharfe,
	brho

(resending in plain text mode, sorry for the noise)

Thanks Junio, that's super helpful!

A month or two ago I contacted the author of git-hyper-blame, Matt
Giuca, asking whether anyone had looked into adding the feature to git
blame. I didn't receive a response but maybe that prompted Barret
Rhoden's patch? Or maybe just a weird coincidence!

@Barret I see your patches are a nice translation of git-hyper-blame.
However could you give me your thoughts on the approach in my patch? A
comment in the git-hyper-blame source [1] says:
# This doesn't work that well if there are a lot of line changes within the
# hunk (demonstrated by GitHyperBlameLineMotionTest.testIntraHunkLineMotion).
# A fuzzy heuristic that takes the text of the new line and tries to find a
# deleted line within the hunk that mostly matches the new line could help.

My patch aims to implement this "fuzzy heuristic" so I'd love to get
your take on it.

Many thanks,
-Michael

On Mon, 25 Mar 2019 at 02:39, Junio C Hamano <gitster@pobox.com> wrote:
>
> michael@platin.gs writes:
>
> > From: Michael Platings <michael@platin.gs>
> >
> > Hi Git devs,
> >
> > Some of you may be familiar with the git-hyper-blame tool [1]. It's "useful if
> > you have a commit that makes sweeping changes that are unlikely to be what you
> > are looking for in a blame, such as mass reformatting or renaming."
>
> I recall a month or so ago brho@google (CC'ed) sent a "let's allow
> blame to ignore some uninteresting commit" topic, which was
> unfortunately not well reviewed (what I mean is *not* that it was
> reviewed thoroughly and found to be bad---not many reviewers found
> time or inclination to review it well).  The topic is queued as
> br/blame-ignore and its tip is at 43a290e3 ("SQUASH???", 2019-02-13)
> as of this writing.
>
> Perhaps you two can join forces?
>
> P.S. I expect to be offline for most of the week (packing, moving
> and unpacking.  Even though the places packing and unpacking happens
> are within 1 kilometer radius, that does not make it less hassle
> X-<).  See you guys next month.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/1] Fuzzy blame
  2019-03-25  9:32   ` Michael Platings
@ 2019-03-25 16:04     ` Barret Rhoden
  2019-03-25 23:21       ` Michael Platings
  0 siblings, 1 reply; 14+ messages in thread
From: Barret Rhoden @ 2019-03-25 16:04 UTC (permalink / raw)
  To: Michael Platings, Junio C Hamano
  Cc: git, Jeff King, Stefan Beller, Jeff Smith, René Scharfe

Hi -

On 3/25/19 5:32 AM, Michael Platings wrote:
> (resending in plain text mode, sorry for the noise)
> 
> Thanks Junio, that's super helpful!
> 
> A month or two ago I contacted the author of git-hyper-blame, Matt
> Giuca, asking whether anyone had looked into adding the feature to git
> blame. I didn't receive a response but maybe that prompted Barret
> Rhoden's patch? Or maybe just a weird coincidence!

Weird coincidence.  It's a big company.  =)

I work on a project that needs a major reformatting, and one thing 
delaying me was the lack of an ability to ignore commits during blame. 
hyper-blame does that, but I never liked the fact that it wasn't 
directly in git.

> 
> @Barret I see your patches are a nice translation of git-hyper-blame.

Not sure if you've seen my latest version, updated to v4 (2019-02-26) here:

https://public-inbox.org/git/20190226170648.211847-1-brho@google.com/

The one Junio has (br/blame-ignore) hasn't been updated - not sure if 
that's automated, or if it just fell through the cracks.

> However could you give me your thoughts on the approach in my patch? A
> comment in the git-hyper-blame source [1] says:
> # This doesn't work that well if there are a lot of line changes within the
> # hunk (demonstrated by GitHyperBlameLineMotionTest.testIntraHunkLineMotion).
> # A fuzzy heuristic that takes the text of the new line and tries to find a
> # deleted line within the hunk that mostly matches the new line could help.
> 
> My patch aims to implement this "fuzzy heuristic" so I'd love to get
> your take on it.

This is an interesting idea, and it sounds like it might be 
complimentary to the blame-ignore work.  Both have the flavor of "user 
knows this commit is special and wants special processing."

In my patch, I didn't try to find the likely original version of a line 
in a diff hunk.  What I did amounted to finding blame based on the 
parent's image of the file.  Example in this message:

https://public-inbox.org/git/20190226170648.211847-4-brho@google.com/

Of note, line 12 was blamed on commit b, when it really came from commit a.

For any lines added beyond the size of the parent's image (e.g. the 
ignored commit added more lines than it removed), those lines were 
removed from blame contention - marked with all 0s.

In essence, my method did the following:

	for all suspect/child lines 'i' in a diff hunk
		if i <= parent's hunk size
			assign to parent line i (+ offset)
		else
			mark umblamable

Due to the two cases being contiguous, each hunk would be broken up into 
at most two blame entries.  (I actually short-circuit that for loop and 
just split at i == parent_size immediately).

Having a smart/fuzzy matcher sounds nicer.  My patch passes blame to the 
parent.  Yours finds the right *part* of the parent to blame, which 
means we have a better chance of finding the right original commit to blame.

I think you're doing this:

	for all suspect/child lines 'i' in a diff hunk
		if i matches parent line (say, x)
			assign to parent line x
		else
			assign to child line i


 From glancing at your code, it looks like you break up every blame 
entry into N entries, one for each line, which you need since each 
parent line X might not be adjacent to the matching lines in the child.

One question I have is under which circumstances do you find that you 
cannot match a suspect/child line to a parent?  One obvious case is a 
commit that only adds lines - the parent's line set is the empty set.  I 
think you catch that earlier in your code (parent_chunk_length == 0), 
though I guess there are other cases where the parent has lines, but 
they are below a certain match threshold?

For those cases where you can't find a match, I could imagine marking 
them as unblamable.  The purpose of 'unblamable' in my patchset was to 
signal to not bother looking up further commit info for a final blame 
entry.  It was largely so that the user (me!) wouldn't see a commit 
blamed when I explicitly told git to not tell me about that commit. 
Both approaches sound fine though.

Another question was whether or not you wanted this to apply per commit 
or for an entire blame session.  It looks like your current code applies 
it overall, and not for a specific commit.  I'd much prefer it to be 
per-commit, though maybe the -F is just for showing it working in the RFC.

The first thing that comes to mind for me is to plug your fuzzy logic 
into my patch set.  Basically, in my commit near "These go to the 
parent", we do your line-by-line matching.  We might not need my 'delta' 
check anymore, which was basically the 'if' part of my for loop (in text 
above).  But maybe we need more info for your function.

That way, we'd get the per-commit control of when we ignore a commit 
from my patch, and we'd get the fuzzy-blaming brains of your patch, such 
that we try to find the right *part* of the parent to blame.

Anyway, let me know what your thoughts are.  We could try to change that 
part of my code so that it just calls some function that tells it where 
in the parent to blame, and otherwise marks unblamable.  Then your patch 
can replace the logic?

Barret


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/1] Fuzzy blame
  2019-03-25 16:04     ` Barret Rhoden
@ 2019-03-25 23:21       ` Michael Platings
  2019-03-25 23:35         ` Jeff King
  2019-04-03 15:25         ` Barret Rhoden
  0 siblings, 2 replies; 14+ messages in thread
From: Michael Platings @ 2019-03-25 23:21 UTC (permalink / raw)
  To: Barret Rhoden
  Cc: Junio C Hamano, git, Jeff King, Stefan Beller, Jeff Smith,
	René Scharfe

Hi Barret,

> I work on a project that needs a major reformatting, and one thing
> delaying me was the lack of an ability to ignore commits during blame.

I think we understand each other well then - I'm working on a plan to
change the variable naming rule in LLVM, and naturally other
developers aren't keen on making git blame less useful.

> One question I have is under which circumstances do you find that you
> cannot match a suspect/child line to a parent?  One obvious case is a
> commit that only adds lines - the parent's line set is the empty set.  I
> think you catch that earlier in your code (parent_chunk_length == 0),
> though I guess there are other cases where the parent has lines, but
> they are below a certain match threshold?

Yes, exactly. The threshold is currently 0 i.e. a single matching
bigram is all that's required for two lines to be considered matching,
but in future the threshold could be configurable in the same manner
as -M & -C options.
In the t8020-blame-fuzzy.sh test script in my patch, where it's
expected that a line will be attributed to the "ignored" commit you'll
see "Final". So far this is just "}" lines.

> Another question was whether or not you wanted this to apply per commit
> or for an entire blame session.  It looks like your current code applies
> it overall, and not for a specific commit.

This is a really interesting question for this feature. Initially I
just wanted to be able to say "Hey, Git, ignore this revision please."
But then Git says "OK, but how exactly? I can ignore whitespace and I
can detect moves & copies so do you want me to do those?" And then I'm
thinking, actually yes -M10 would be great because I know that this
revision also reordered a bunch of #includes and I still want people
to be able to see where they came from. However other sets of options
might work better for other changes.

On looking at the problem this way it seems that fuzzy matching
belongs in the same class as -w, -M & -C. As these options apply for
an entire blame session, it would be consistent to allow applying the
fuzzy matching likewise. As a bonus, having the ability to apply the
-F option for the entire blame session seems quite nice for other use
cases.

> I'd much prefer it to be per-commit

Yes, we definitely need a way to say "fuzzy match this commit only"
otherwise you lose the ability to detect small but significant changes
in other commits.
I haven't explored this fully, but I'm thinking that the revision
options file might look something like this:

# Just use the defaults, whatever they may be
6e8063eee1d30bc80c7802e94ed0caa8949c6323
# This commit changed loads of tabs to spaces
35ee755a8c43bcb3c2786522d423f006c23d32df -w
# This commit reordered lots of short lines
c5b679e14b761a7bfd6ae93cfffbf66e3c4e25a5 -M5
# This commit copied some stuff and changed CamelCase to snake_case
58b9cd43da695ee339b7679cf0c9f31e1f8ef67f -w -C15 -F

For the command-line, potentially we could make -w/-M/-C/-F specified
after --ignore-rev apply to only that revision e.g.:
git blame myfile --ignore-rev 35ee755a8c43bcb3c2786522d423f006c23d32df -M -F

But as I say, I haven't explored this fully.

> For those cases where you can't find a match, I could imagine marking
> them as unblamable.  The purpose of 'unblamable' in my patchset was to
> signal to not bother looking up further commit info for a final blame
> entry.  It was largely so that the user (me!) wouldn't see a commit
> blamed when I explicitly told git to not tell me about that commit.

I can see how the unblameable concept makes sense for the heuristic
you have right now. However, once you have a heuristic that can tell
you with a high degree of certainty that a line really did come from a
commit that you're merely not interested in, then I suggest that it's
better to just point at that commit.

> Both approaches sound fine though.

:)

> The first thing that comes to mind for me is to plug your fuzzy logic
> into my patch set.

Please do! It should be easy to pluck fuzzy_find_matching_lines() and
its dependencies out. Just to set your expectations, I have not yet
optimised it and it is highly wasteful right now both in terms of time
and memory.

> But maybe we need more info for your function.

The extra info needed is the parent & child file content, plus the
indices in the strings of where new lines start. This information is
already calculated in the course of doing the diff but it looks like a
fair amount of plumbing work will be needed to make the information
available to the chunk callback. That was my reason for initially
plonking the fuzzy matching in a separate pass.

Thanks,
-Michael

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/1] Fuzzy blame
  2019-03-25 23:21       ` Michael Platings
@ 2019-03-25 23:35         ` Jeff King
  2019-03-26  3:07           ` Jacob Keller
  2019-04-03 15:25         ` Barret Rhoden
  1 sibling, 1 reply; 14+ messages in thread
From: Jeff King @ 2019-03-25 23:35 UTC (permalink / raw)
  To: Michael Platings
  Cc: Barret Rhoden, Junio C Hamano, git, Stefan Beller, Jeff Smith,
	René Scharfe

On Mon, Mar 25, 2019 at 11:21:19PM +0000, Michael Platings wrote:

> > I work on a project that needs a major reformatting, and one thing
> > delaying me was the lack of an ability to ignore commits during blame.
> 
> I think we understand each other well then - I'm working on a plan to
> change the variable naming rule in LLVM, and naturally other
> developers aren't keen on making git blame less useful.

This is sort of a tangent to the thread, but have you looked into tools
that provide an interactive "re-blame from the parent" operation? I use
tig for this.  Quite often my blame turns up on some boring line
(whitespace fixing, minor tweaking of a function interface, etc), and
then I want to keep digging on the "same" line, as counted by line count
(but it's OK if it's off by one or two lines, since I'm looking at a
blame of the whole file).

Obviously this isn't as automated as saying "ignore commit X, it's just
variable renaming". But it also eliminates the need to a priori figure
out all such X that affect the lines you care about. You get an answer,
your human mind says "nope, that's not interesting", and you press a
button to dig further.

I think there's room for both solutions to co-exist, but just suggesting
you to try out the one that's already been implemented if you haven't. ;)

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/1] Fuzzy blame
  2019-03-25 23:35         ` Jeff King
@ 2019-03-26  3:07           ` Jacob Keller
  2019-03-26 20:26             ` Michael Platings
  0 siblings, 1 reply; 14+ messages in thread
From: Jacob Keller @ 2019-03-26  3:07 UTC (permalink / raw)
  To: Jeff King
  Cc: Michael Platings, Barret Rhoden, Junio C Hamano, Git mailing list,
	Stefan Beller, Jeff Smith, René Scharfe

On Mon, Mar 25, 2019 at 4:37 PM Jeff King <peff@peff.net> wrote:
>
> On Mon, Mar 25, 2019 at 11:21:19PM +0000, Michael Platings wrote:
>
> > > I work on a project that needs a major reformatting, and one thing
> > > delaying me was the lack of an ability to ignore commits during blame.
> >
> > I think we understand each other well then - I'm working on a plan to
> > change the variable naming rule in LLVM, and naturally other
> > developers aren't keen on making git blame less useful.
>
> This is sort of a tangent to the thread, but have you looked into tools
> that provide an interactive "re-blame from the parent" operation? I use
> tig for this.  Quite often my blame turns up on some boring line
> (whitespace fixing, minor tweaking of a function interface, etc), and
> then I want to keep digging on the "same" line, as counted by line count
> (but it's OK if it's off by one or two lines, since I'm looking at a
> blame of the whole file).
>

+1 for the usefulness of this approach. It really helps figure things
out in a way that doesn't require me to track all "uninteresting"
commits, and also works when I *am* trying to find that uninteresting
commit too.

> Obviously this isn't as automated as saying "ignore commit X, it's just
> variable renaming". But it also eliminates the need to a priori figure
> out all such X that affect the lines you care about. You get an answer,
> your human mind says "nope, that's not interesting", and you press a
> button to dig further.
>
> I think there's room for both solutions to co-exist, but just suggesting
> you to try out the one that's already been implemented if you haven't. ;)
>
> -Peff

That's also my sentiment.

Thanks,
Jake

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/1] Fuzzy blame
  2019-03-26  3:07           ` Jacob Keller
@ 2019-03-26 20:26             ` Michael Platings
  2019-03-27  6:36               ` Duy Nguyen
  0 siblings, 1 reply; 14+ messages in thread
From: Michael Platings @ 2019-03-26 20:26 UTC (permalink / raw)
  To: Jacob Keller
  Cc: Jeff King, Barret Rhoden, Junio C Hamano, Git mailing list,
	Stefan Beller, Jeff Smith, René Scharfe

> Obviously this isn't as automated as saying "ignore commit X, it's just
> variable renaming". But it also eliminates the need to a priori figure
> out all such X that affect the lines you care about. You get an answer,
> your human mind says "nope, that's not interesting", and you press a
> button to dig further.

Hi Peff, for the use case you describe of someone stumbling across a
renaming commit, your approach is clearly better. However the use case
Barret & I are facing is of deliberately choosing to make a large
refactoring/renaming commit, and not wanting everyone else working on
the project to have to press that extra button every time they run git
blame.

I think it's really important that we make this dead easy for everyone
to use. The ultimate in ease of use would be for git blame to
automatically pick up ignore settings without the user having to even
know that it's happening. But that breaks the principle of least
astonishment. The next simplest thing I can think of is to add a
configuration option blame.ignoreRevs which would have the same
effect, except the user has to opt in.
Barret has implemented blame.ignoreRevsFile, but I think the world
will be a more consistent and happier place if we dictate the location
that the revisions are loaded from, in the same way as .gitignore.
Deciding what that location should be is one of those bikeshed
arguments which is perhaps why Barret dodged it :)

-Michael

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/1] Fuzzy blame
  2019-03-26 20:26             ` Michael Platings
@ 2019-03-27  6:36               ` Duy Nguyen
  2019-03-27  8:26                 ` Michael Platings
  0 siblings, 1 reply; 14+ messages in thread
From: Duy Nguyen @ 2019-03-27  6:36 UTC (permalink / raw)
  To: Michael Platings
  Cc: Jacob Keller, Jeff King, Barret Rhoden, Junio C Hamano,
	Git mailing list, Stefan Beller, Jeff Smith, René Scharfe

On Wed, Mar 27, 2019 at 3:27 AM Michael Platings <michael@platin.gs> wrote:
> I think it's really important that we make this dead easy for everyone
> to use. The ultimate in ease of use would be for git blame to
> automatically pick up ignore settings without the user having to even
> know that it's happening. But that breaks the principle of least
> astonishment. The next simplest thing I can think of is to add a
> configuration option blame.ignoreRevs which would have the same
> effect, except the user has to opt in.
> Barret has implemented blame.ignoreRevsFile, but I think the world
> will be a more consistent and happier place if we dictate the location
> that the revisions are loaded from, in the same way as .gitignore.
> Deciding what that location should be is one of those bikeshed
> arguments which is perhaps why Barret dodged it :)

And bikeshedding. Another good place to keep these revs is git-notes,
which probably could result in faster lookups too and can be made
visible in git-log. But that's in addition to --ignoreRevsFile, not
replacing it.
-- 
Duy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/1] Fuzzy blame
  2019-03-27  6:36               ` Duy Nguyen
@ 2019-03-27  8:26                 ` Michael Platings
  2019-03-27  9:02                   ` Duy Nguyen
  0 siblings, 1 reply; 14+ messages in thread
From: Michael Platings @ 2019-03-27  8:26 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Jacob Keller, Jeff King, Barret Rhoden, Junio C Hamano,
	Git mailing list, Stefan Beller, Jeff Smith, René Scharfe

> Another good place to keep these revs is git-notes,
> which probably could result in faster lookups too and can be made
> visible in git-log.

Oh wow, I really like this. A major concern I had about the revisions
file was that you don't know what a revision ID will be until it's
upstream. If you can specify *in the commit message itself* what
options should apply to git blame for that revision then that problem
is solved. And if you change your mind later, or want to ignore a
pre-existing revision then git-notes solves that problem.

So I'm thinking you just have a commit message like this:
"
Make all function names snake_case
git-blame-ignore: fuzzy
"
And users who have blame.ignoreRevs set will have the -F/--fuzzy
option applied to that commit.

> But that's in addition to --ignoreRevsFile, not replacing it.

I disagree. ignoreRevsFile has the major problem that the file will
need updating every time you rebase a commit to be ignored, and you'll
need to remember to edit it for cherry picks. Let's not have that
option as I think it will add unhelpful complexity.

-Michael

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/1] Fuzzy blame
  2019-03-27  8:26                 ` Michael Platings
@ 2019-03-27  9:02                   ` Duy Nguyen
  0 siblings, 0 replies; 14+ messages in thread
From: Duy Nguyen @ 2019-03-27  9:02 UTC (permalink / raw)
  To: Michael Platings
  Cc: Jacob Keller, Jeff King, Barret Rhoden, Junio C Hamano,
	Git mailing list, Stefan Beller, Jeff Smith, René Scharfe

On Wed, Mar 27, 2019 at 3:26 PM Michael Platings <michael@platin.gs> wrote:
>
> > Another good place to keep these revs is git-notes,
> > which probably could result in faster lookups too and can be made
> > visible in git-log.
>
> Oh wow, I really like this. A major concern I had about the revisions
> file was that you don't know what a revision ID will be until it's
> upstream. If you can specify *in the commit message itself* what
> options should apply to git blame for that revision then that problem
> is solved. And if you change your mind later, or want to ignore a
> pre-existing revision then git-notes solves that problem.
>
> So I'm thinking you just have a commit message like this:
> "
> Make all function names snake_case
> git-blame-ignore: fuzzy
> "
> And users who have blame.ignoreRevs set will have the -F/--fuzzy
> option applied to that commit.

Yeah some trailer in the commit itself is also good if you know in
advance it should be treated differently. I think we have
git-interpret-trailers to help extract these info.

> > But that's in addition to --ignoreRevsFile, not replacing it.
>
> I disagree. ignoreRevsFile has the major problem that the file will
> need updating every time you rebase a commit to be ignored, and you'll
> need to remember to edit it for cherry picks. Let's not have that
> option as I think it will add unhelpful complexity.

OK I was just trying to say I did not object any current suggestions
(because I didn't know much in the first place). I'll just leave this
for other people to discuss :)
-- 
Duy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/1] Fuzzy blame
  2019-03-25 23:21       ` Michael Platings
  2019-03-25 23:35         ` Jeff King
@ 2019-04-03 15:25         ` Barret Rhoden
  2019-04-03 21:49           ` Michael Platings
  1 sibling, 1 reply; 14+ messages in thread
From: Barret Rhoden @ 2019-04-03 15:25 UTC (permalink / raw)
  To: Michael Platings
  Cc: Junio C Hamano, git, Jeff King, Stefan Beller, Jeff Smith,
	René Scharfe

Hi -

On 3/25/19 7:21 PM, Michael Platings wrote:
>> The first thing that comes to mind for me is to plug your fuzzy logic
>> into my patch set.
> Please do! It should be easy to pluck fuzzy_find_matching_lines() and
> its dependencies out. Just to set your expectations, I have not yet
> optimised it and it is highly wasteful right now both in terms of time
> and memory.

I edited my patch set to allow changing the heuristic.  I also made a 
commit that uses your fingerprinting code to match target lines to 
parent lines.  I'll send it all out in another email and CC you.  The 
last commit is still a work in progress.

Regarding stuff like the name of the ignore file, in my first version, I 
went with whatever git hyper-blame does.  That was shot down, rightly 
so, I think.  With a git-config setting, you can name the file whatever 
you want, or add multiple files.  With my current patchset, you can 
disable the file too with --ignore-revs-file="".

As far as using notes or per-commit info, that might be nice, though 
it's not a huge burden to have a separate commit - you can wait til 
after things get merged (so we have the final object name (hash)) and 
it's not hugely burdensome.  But I get that's just my opinion.  =)

Barret




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/1] Fuzzy blame
  2019-04-03 15:25         ` Barret Rhoden
@ 2019-04-03 21:49           ` Michael Platings
  0 siblings, 0 replies; 14+ messages in thread
From: Michael Platings @ 2019-04-03 21:49 UTC (permalink / raw)
  To: Barret Rhoden
  Cc: Junio C Hamano, Git mailing list, Jeff King, Stefan Beller,
	Jeff Smith, René Scharfe

Thanks Barret.
I've cooled off on the git-notes idea since learning that notes
branches have to be pulled explicitly. And different people may have
different ideas about which types of commits they want to ignore, so
not predefining the name of the ignore file(s) does seem like the best
option, even if it's not perfect. And maybe having --fuzzy as a
separate option would be nice, maybe not - either way it can wait. In
short, I've come full circle and wish to do what I can to assist your
proposal (+ fuzzy matching).
I've rewritten the fuzzy matching function so it's now much faster and
more modest in its memory use. I hope to share the patch tomorrow.
Following that, I'll do what I can to assist with reviewing your
patches.
Cheers,
-Michael

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-04-03 21:49 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-24 23:50 [RFC PATCH 0/1] Fuzzy blame michael
2019-03-24 23:50 ` [RFC PATCH 1/1] " michael
2019-03-25  2:39 ` [RFC PATCH 0/1] " Junio C Hamano
2019-03-25  9:32   ` Michael Platings
2019-03-25 16:04     ` Barret Rhoden
2019-03-25 23:21       ` Michael Platings
2019-03-25 23:35         ` Jeff King
2019-03-26  3:07           ` Jacob Keller
2019-03-26 20:26             ` Michael Platings
2019-03-27  6:36               ` Duy Nguyen
2019-03-27  8:26                 ` Michael Platings
2019-03-27  9:02                   ` Duy Nguyen
2019-04-03 15:25         ` Barret Rhoden
2019-04-03 21:49           ` Michael Platings

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).