git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Barret Rhoden <brho@google.com>
To: michael@platin.gs, git@vger.kernel.org
Cc: "Jeff King" <peff@peff.net>,
	"Stefan Beller" <stefanbeller@gmail.com>,
	"Jeff Smith" <whydoubt@gmail.com>,
	"Junio C Hamano" <gitster@pobox.com>,
	"René Scharfe" <l.s.r@web.de>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"David Kastrup" <dak@gnu.org>,
	"Johannes Schindelin" <Johannes.Schindelin@gmx.de>
Subject: Re: [PATCH v5 6/6] RFC blame: use a fingerprint heuristic to match ignored lines
Date: Tue, 9 Apr 2019 15:10:46 -0400	[thread overview]
Message-ID: <0e53bcba-e27c-228d-a36d-6d3575605d7a@google.com> (raw)
In-Reply-To: <2747b3b2-0447-0d03-dc7e-c7fa460a303b@google.com>

On 4/9/19 11:56 AM, Barret Rhoden wrote:
> Anyway, being able to look outside the current blame_chunk would help in 
> those scenarios.  Specifically, I'm talking about letting blame_chunk() 
> point anywhere in the parent.  Right now, it can only look in the 
> parent's part of the chunk passed to blame_chunk, which can be 
> relatively small.

I hacked up the ability to look outside of a diff chunk.  The change to 
the heuristic-independent part of the code was very minor, both in code 
and in performance.

The change to make the fingerprinting algorithm from my RFC patch look 
at the entire parent was pretty minor too - I can also cache the 
fingerprints.  The main drawback is performance, but Michael's new 
fingerprinting code alleviates this.

Here's a quick analysis.  When run on a 1000 line C file, with large 
changes from an ignored commit, after the file has been paged in (so, 
run twice):

not ignoring at all:
-------------
real	0m0.062s
user	0m0.042s
sys	0m0.021s

scan only in the parent chunk:
----------------------------
real	0m0.097s
user	0m0.085s
sys	0m0.012s

scan parent chunk, scan entire parent on failure:
-------------------------
real	0m1.773s
user	0m1.752s
sys	0m0.021s

scan the entire parent:
-----------------------
real	0m3.049s
user	0m3.024s
sys	0m0.024s

Scanning the parent chunk first helped a lot.  Scanning the entire 
parent is O(nr_parent_lines * nr_lines_changed).  In my test file, 
that's about 1000 * 600.

It still takes a little while even when checking the parent chunk first. 
  Let's call that one the 'smaller scan.'

Caching the fingerprints (meaning, calculate once and attach to the 
blame_origin) didn't help much here.  It's actually worse, possibly due 
to fingerprinting more than I needed to.

smaller scan, without caching:
----------------
real	0m1.651s
user	0m1.626s
sys	0m0.025s

smaller scan, with caching:
----------------
real	0m1.774s
user	0m1.753s
sys	0m0.021s


Let's try Michael's new fingerprinting code (get_fingerprint() and 
fingerprint_similarity())

smaller scan, caching:
-------------------
real	0m0.240s
user	0m0.215s
sys	0m0.025s

smaller scan, no caching:
----------------------
real	0m0.295s
user	0m0.266s
sys	0m0.029s

full parent scan, caching:
--------------------------
real	0m0.377s
user	0m0.356s
sys	0m0.021s

full parent scan, no caching:
-----------------------------
real	0m0.458s
user	0m0.430s
sys	0m0.028s

And for completenesss,

scan only in the parent chunk:
------------------------------
real	0m0.072s
user	0m0.048s
sys	0m0.024s


So first off, Michael's new fingerprinting is much better.  Caching the 
fingerprints also helps.  Overall, it's fast enough now that I don't 
notice the delay.

I'll reroll the patchset with the ability to find lines in the entire 
parent, and see what you all think.

Barret




  reply	other threads:[~2019-04-09 19:10 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <[PATCH v5 6/6] RFC blame: use a fingerprint heuristic to match ignored lines>
2019-04-07 21:46 ` [PATCH v5 6/6] RFC blame: use a fingerprint heuristic to match ignored lines michael
2019-04-07 21:52   ` David Kastrup
2019-04-08  9:48     ` Michael Platings
2019-04-08 16:03       ` Barret Rhoden
2019-04-09 15:38         ` Junio C Hamano
2019-04-09 15:56   ` Barret Rhoden
2019-04-09 19:10     ` Barret Rhoden [this message]
2019-04-03 16:02 [PATCH v5 0/6] blame: add the ability to ignore commits Barret Rhoden
2019-04-03 16:02 ` [PATCH v5 6/6] RFC blame: use a fingerprint heuristic to match ignored lines Barret Rhoden
2019-04-04 16:37   ` SZEDER Gábor

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0e53bcba-e27c-228d-a36d-6d3575605d7a@google.com \
    --to=brho@google.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=avarab@gmail.com \
    --cc=dak@gnu.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=l.s.r@web.de \
    --cc=michael@platin.gs \
    --cc=peff@peff.net \
    --cc=stefanbeller@gmail.com \
    --cc=whydoubt@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).