git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: michael@platin.gs
To: git@vger.kernel.org
Cc: "Jeff King" <peff@peff.net>,
	"Stefan Beller" <stefanbeller@gmail.com>,
	"Jeff Smith" <whydoubt@gmail.com>,
	"Junio C Hamano" <gitster@pobox.com>,
	"René Scharfe" <l.s.r@web.de>,
	"Michael Platings" <michael@platin.gs>
Subject: [RFC PATCH 0/1] Fuzzy blame
Date: Sun, 24 Mar 2019 23:50:19 +0000	[thread overview]
Message-ID: <20190324235020.49706-1-michael@platin.gs> (raw)

From: Michael Platings <michael@platin.gs>

Hi Git devs,

Some of you may be familiar with the git-hyper-blame tool [1]. It's "useful if
you have a commit that makes sweeping changes that are unlikely to be what you
are looking for in a blame, such as mass reformatting or renaming."

git-hyper-blame is useful but (a) it's not convenient to install; (b) it's
missing functionality available in regular git blame; (c) it's method of
matching lines between chunks is too simplistic for many use cases; and
(d) it's not Git so it doesn't integrate well with tools that expect Git
e.g. vim plugins. Therefore I'm hoping to add similar and hopefully superior
functionality to Git itself. I have a very rough patch so I'd like to get your
thoughts on the general approach, particularly in terms of its user-visible
behaviour.

My initial idea was to lift the design directly from git-hyper-blame. However
the approach of picking single revisions to somehow ignore doesn't sit well
with the -w, -M & -C options, which have a similar intent but apply to all
revisions.

I'd like to get your thoughts on whether we could allow applying the -M or -w
options to specific revisions. For example, imagine it was agreed that all
the #includes in a project should be reordered. In that case, it would be useful
to be able to specify that the -M option should be used for blames on that
revision specifically, so that in future when someone wants to know why
a #include was added they don't have to run git blame twice to find out.

Options that are specific to a particular revision could be stored in a
".gitrevisions" file or similar.

If the principle of allowing blame options to be applied per-revision is
agreeable then I'd like to add a -F/--fuzzy option, to sit alongside -w, -M & -C.

I've implemented a prototype "fuzzy" option, patch attached.
The option operates at the level of diff chunks. For each line in the "after"
half of the chunk it uses a heuristic to choose which line in the "before" half
of the chunk matches best. The heuristic I'm using at the moment is of matching
"bigrams" as described in [2]. The initial pass typically gives reasonable
results, but can jumble up the lines. As in the reformatting/renaming use case
the content should stay in the same order, it's worth going to extra effort to
avoid jumbling lines. Therefore, after the initial pass, the line that can be
matched with the most confidence is used to partition the chunk into halves
before and after it. The process is then repeated recursively on the halves
above and below the partition line.
I feel like a similar algorithm has probably already been invented in a better
form - if anyone knows of such a thing then please let me know!

I look forward to hearing your thoughts.
Thanks,
-Michael


[1] https://commondatastorage.googleapis.com/chrome-infra-docs/flat/depot_tools/docs/html/git-hyper-blame.html
[2] https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient

Michael Platings (1):
  Add git blame --fuzzy option.

 blame.c                | 352 +++++++++++++++++++++++++++++++++++++++++++++++--
 blame.h                |   1 +
 builtin/blame.c        |   3 +
 t/t8020-blame-fuzzy.sh | 264 +++++++++++++++++++++++++++++++++++++
 4 files changed, 609 insertions(+), 11 deletions(-)
 create mode 100755 t/t8020-blame-fuzzy.sh

-- 
2.14.3 (Apple Git-98)


             reply	other threads:[~2019-03-24 23:52 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-24 23:50 michael [this message]
2019-03-24 23:50 ` [RFC PATCH 1/1] Fuzzy blame michael
2019-03-25  2:39 ` [RFC PATCH 0/1] " Junio C Hamano
2019-03-25  9:32   ` Michael Platings
2019-03-25 16:04     ` Barret Rhoden
2019-03-25 23:21       ` Michael Platings
2019-03-25 23:35         ` Jeff King
2019-03-26  3:07           ` Jacob Keller
2019-03-26 20:26             ` Michael Platings
2019-03-27  6:36               ` Duy Nguyen
2019-03-27  8:26                 ` Michael Platings
2019-03-27  9:02                   ` Duy Nguyen
2019-04-03 15:25         ` Barret Rhoden
2019-04-03 21:49           ` Michael Platings

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190324235020.49706-1-michael@platin.gs \
    --to=michael@platin.gs \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=l.s.r@web.de \
    --cc=peff@peff.net \
    --cc=stefanbeller@gmail.com \
    --cc=whydoubt@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).