git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Fwd: git diff with “--word-diff-regex” extremely slow compared to “--word-diff”?
       [not found] <CAEYvigJ14xYDmRG2N0yTgM4spaaB7s9923w0+e9+QQEeFz0NTQ@mail.gmail.com>
@ 2016-11-18 23:40 ` Matthieu S
  2016-11-20 20:17   ` Jeff King
  0 siblings, 1 reply; 4+ messages in thread
From: Matthieu S @ 2016-11-18 23:40 UTC (permalink / raw)
  To: git

Hi

When giving a custom regex to git diff --word-diff-regex= instead of
using the default --word-diff (which splits words on whitespace), git
slows down very considerably... I don't understand why such a speed
difference?

(this question was asked on stack overflow, but after two month
without answer, I'm asking it here instead. Post:
http://stackoverflow.com/questions/39027864/git-diff-with-word-diff-regex-extremely-slow-compared-to-word-diff).

Example (sorry, UNIX specific code): create two one-line files, and
two 200000-lines files:

echo aaa,bbb ,12,12,15 >file1.txt
echo aaa,bbb ,12,12,16 >file2.txt

awk '{for(i=0;i<200000;i++)print}' file1.txt > file1BIG.txt
awk '{for(i=0;i<200000;i++)print}' file2.txt > file2BIG.txt

Default --word-diff has no issues with the BIG files (cannot see time
difference):

git diff --word-diff file1.txt file2.txt
git diff --word-diff file1BIG.txt file2BIG.txt

Now use instead --word-diff-regex= argument (with regex from post:
http://stackoverflow.com/questions/10482773/also-use-comma-as-a-word-separator-in-diff
)

git diff --word-diff-regex=[^[:space:],] file1.txt file2.txt
git diff --word-diff-regex=[^[:space:],] file1BIG.txt file2BIG.txt

Why is the speed so different if one uses --word-diff instead of
--word-diff-regex= ? Is it just because my expression is (slightly)
more complex than the default one (split on period instead of only
whitespace) ? Or is it that the default word-diff is implemented
differently/more efficiently? How can I overcome this speed slowdown?

Thanks!!

Matthieu


PS: using git 2.7.4 on Ubuntu 16.04

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Fwd: git diff with “--word-diff-regex” extremely slow compared to “--word-diff”?
  2016-11-18 23:40 ` Fwd: git diff with “--word-diff-regex” extremely slow compared to “--word-diff”? Matthieu S
@ 2016-11-20 20:17   ` Jeff King
  2016-11-22 18:08     ` Matthieu S
  0 siblings, 1 reply; 4+ messages in thread
From: Jeff King @ 2016-11-20 20:17 UTC (permalink / raw)
  To: Matthieu S; +Cc: git

On Fri, Nov 18, 2016 at 03:40:22PM -0800, Matthieu S wrote:

> Why is the speed so different if one uses --word-diff instead of
> --word-diff-regex= ? Is it just because my expression is (slightly)
> more complex than the default one (split on period instead of only
> whitespace) ? Or is it that the default word-diff is implemented
> differently/more efficiently? How can I overcome this speed slowdown?

I think it's probably both.

See diff.c:find_word_boundaries(). If there's no regex, we use a simple
loop over isspace() to find the boundaries. I don't recall anybody
measuring the performance before, but I'm not surprised to hear that
matching a regex is slower.

If I look at the output of "perf", though, it looks like we also spend a
lot more time in xdl_clean_mmatch(). Which isn't surprising. Your regex
treats commas as boundaries, which is going to generate a lot more
matches for this particular data set (though the output is the same, I
think, because of the nature of the change).

I would have expected "--word-diff-regex=[^[:space:]]" to be faster than
your regex, though, and it does not seem to be.

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Fwd: git diff with “--word-diff-regex” extremely slow compared to “--word-diff”?
  2016-11-20 20:17   ` Jeff King
@ 2016-11-22 18:08     ` Matthieu S
  2016-11-22 19:26       ` Jeff King
  0 siblings, 1 reply; 4+ messages in thread
From: Matthieu S @ 2016-11-22 18:08 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Thanks Jeff for the answer!

You are right, I should have compared with the same regex, and indeed,
--word-diff-regex=[^[:space:]] is also much slower than just
--word-diff, although they do the same job. Maybe this is a hint that
the --word-diff-regex code could be made faster?

I have a small understanding of git, but is git diff computing the
diff value for the whole file, and then showing in the terminal the 10
first values? In some cases, it seems to be a lot of unnecessary
computation! Is there any possibility to ask git-diff to only compare
say the first 100 lines? Or compute only when necessary, i.e.
when"enter" is prompted in the console?

Thanks!

Matthieu

2016-11-20 12:17 GMT-08:00 Jeff King <peff@peff.net>:
> On Fri, Nov 18, 2016 at 03:40:22PM -0800, Matthieu S wrote:
>
>> Why is the speed so different if one uses --word-diff instead of
>> --word-diff-regex= ? Is it just because my expression is (slightly)
>> more complex than the default one (split on period instead of only
>> whitespace) ? Or is it that the default word-diff is implemented
>> differently/more efficiently? How can I overcome this speed slowdown?
>
> I think it's probably both.
>
> See diff.c:find_word_boundaries(). If there's no regex, we use a simple
> loop over isspace() to find the boundaries. I don't recall anybody
> measuring the performance before, but I'm not surprised to hear that
> matching a regex is slower.
>
> If I look at the output of "perf", though, it looks like we also spend a
> lot more time in xdl_clean_mmatch(). Which isn't surprising. Your regex
> treats commas as boundaries, which is going to generate a lot more
> matches for this particular data set (though the output is the same, I
> think, because of the nature of the change).
>
> I would have expected "--word-diff-regex=[^[:space:]]" to be faster than
> your regex, though, and it does not seem to be.
>
> -Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Fwd: git diff with “--word-diff-regex” extremely slow compared to “--word-diff”?
  2016-11-22 18:08     ` Matthieu S
@ 2016-11-22 19:26       ` Jeff King
  0 siblings, 0 replies; 4+ messages in thread
From: Jeff King @ 2016-11-22 19:26 UTC (permalink / raw)
  To: Matthieu S; +Cc: git

On Tue, Nov 22, 2016 at 10:08:33AM -0800, Matthieu S wrote:

> You are right, I should have compared with the same regex, and indeed,
> --word-diff-regex=[^[:space:]] is also much slower than just
> --word-diff, although they do the same job. Maybe this is a hint that
> the --word-diff-regex code could be made faster?

Maybe. If most of the time is spent in the regex engine, there may not
be much we can do. But perhaps there is something in the surrounding
code that can be improved. Looking at find_word_boundaries() (and this
is the first time I've done so), it does look like we regex-match the
whole buffer, and only then find the end-of-line. Now that we have
regexec_buf(), it might be possible to constrain the regex buffer more.

> I have a small understanding of git, but is git diff computing the
> diff value for the whole file, and then showing in the terminal the 10
> first values? In some cases, it seems to be a lot of unnecessary
> computation! Is there any possibility to ask git-diff to only compare
> say the first 100 lines? Or compute only when necessary, i.e.
> when"enter" is prompted in the console?

Git always computes the diff for the whole file. The paging is done by
an external program. So no, there's no easy way to do it incrementally
as the user interacts with the pager, as the pager does not communicate
back to git in any way. However, git should generally be streaming out
results (and the pager showing them) as they're computed, so in an ideal
world you get output immediately, and then the pager buffers the rest of
it while you're reading the first page.

Git does have to look at the whole file in order to do the initial
line-by-line diff, so it would be hard to make that incremental. It
could do the word-coloring for each hunk incrementally, though. I would
have assumed that is already how it is done, though I didn't dig into
it.

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-11-22 19:27 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAEYvigJ14xYDmRG2N0yTgM4spaaB7s9923w0+e9+QQEeFz0NTQ@mail.gmail.com>
2016-11-18 23:40 ` Fwd: git diff with “--word-diff-regex” extremely slow compared to “--word-diff”? Matthieu S
2016-11-20 20:17   ` Jeff King
2016-11-22 18:08     ` Matthieu S
2016-11-22 19:26       ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).