* Fwd: git diff with “--word-diff-regex” extremely slow compared to “--word-diff”? [not found] <CAEYvigJ14xYDmRG2N0yTgM4spaaB7s9923w0+e9+QQEeFz0NTQ@mail.gmail.com> @ 2016-11-18 23:40 ` Matthieu S 2016-11-20 20:17 ` Jeff King 0 siblings, 1 reply; 4+ messages in thread From: Matthieu S @ 2016-11-18 23:40 UTC (permalink / raw) To: git Hi When giving a custom regex to git diff --word-diff-regex= instead of using the default --word-diff (which splits words on whitespace), git slows down very considerably... I don't understand why such a speed difference? (this question was asked on stack overflow, but after two month without answer, I'm asking it here instead. Post: http://stackoverflow.com/questions/39027864/git-diff-with-word-diff-regex-extremely-slow-compared-to-word-diff). Example (sorry, UNIX specific code): create two one-line files, and two 200000-lines files: echo aaa,bbb ,12,12,15 >file1.txt echo aaa,bbb ,12,12,16 >file2.txt awk '{for(i=0;i<200000;i++)print}' file1.txt > file1BIG.txt awk '{for(i=0;i<200000;i++)print}' file2.txt > file2BIG.txt Default --word-diff has no issues with the BIG files (cannot see time difference): git diff --word-diff file1.txt file2.txt git diff --word-diff file1BIG.txt file2BIG.txt Now use instead --word-diff-regex= argument (with regex from post: http://stackoverflow.com/questions/10482773/also-use-comma-as-a-word-separator-in-diff ) git diff --word-diff-regex=[^[:space:],] file1.txt file2.txt git diff --word-diff-regex=[^[:space:],] file1BIG.txt file2BIG.txt Why is the speed so different if one uses --word-diff instead of --word-diff-regex= ? Is it just because my expression is (slightly) more complex than the default one (split on period instead of only whitespace) ? Or is it that the default word-diff is implemented differently/more efficiently? How can I overcome this speed slowdown? Thanks!! Matthieu PS: using git 2.7.4 on Ubuntu 16.04 ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Fwd: git diff with “--word-diff-regex” extremely slow compared to “--word-diff”? 2016-11-18 23:40 ` Fwd: git diff with “--word-diff-regex” extremely slow compared to “--word-diff”? Matthieu S @ 2016-11-20 20:17 ` Jeff King 2016-11-22 18:08 ` Matthieu S 0 siblings, 1 reply; 4+ messages in thread From: Jeff King @ 2016-11-20 20:17 UTC (permalink / raw) To: Matthieu S; +Cc: git On Fri, Nov 18, 2016 at 03:40:22PM -0800, Matthieu S wrote: > Why is the speed so different if one uses --word-diff instead of > --word-diff-regex= ? Is it just because my expression is (slightly) > more complex than the default one (split on period instead of only > whitespace) ? Or is it that the default word-diff is implemented > differently/more efficiently? How can I overcome this speed slowdown? I think it's probably both. See diff.c:find_word_boundaries(). If there's no regex, we use a simple loop over isspace() to find the boundaries. I don't recall anybody measuring the performance before, but I'm not surprised to hear that matching a regex is slower. If I look at the output of "perf", though, it looks like we also spend a lot more time in xdl_clean_mmatch(). Which isn't surprising. Your regex treats commas as boundaries, which is going to generate a lot more matches for this particular data set (though the output is the same, I think, because of the nature of the change). I would have expected "--word-diff-regex=[^[:space:]]" to be faster than your regex, though, and it does not seem to be. -Peff ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Fwd: git diff with “--word-diff-regex” extremely slow compared to “--word-diff”? 2016-11-20 20:17 ` Jeff King @ 2016-11-22 18:08 ` Matthieu S 2016-11-22 19:26 ` Jeff King 0 siblings, 1 reply; 4+ messages in thread From: Matthieu S @ 2016-11-22 18:08 UTC (permalink / raw) To: Jeff King; +Cc: git Thanks Jeff for the answer! You are right, I should have compared with the same regex, and indeed, --word-diff-regex=[^[:space:]] is also much slower than just --word-diff, although they do the same job. Maybe this is a hint that the --word-diff-regex code could be made faster? I have a small understanding of git, but is git diff computing the diff value for the whole file, and then showing in the terminal the 10 first values? In some cases, it seems to be a lot of unnecessary computation! Is there any possibility to ask git-diff to only compare say the first 100 lines? Or compute only when necessary, i.e. when"enter" is prompted in the console? Thanks! Matthieu 2016-11-20 12:17 GMT-08:00 Jeff King <peff@peff.net>: > On Fri, Nov 18, 2016 at 03:40:22PM -0800, Matthieu S wrote: > >> Why is the speed so different if one uses --word-diff instead of >> --word-diff-regex= ? Is it just because my expression is (slightly) >> more complex than the default one (split on period instead of only >> whitespace) ? Or is it that the default word-diff is implemented >> differently/more efficiently? How can I overcome this speed slowdown? > > I think it's probably both. > > See diff.c:find_word_boundaries(). If there's no regex, we use a simple > loop over isspace() to find the boundaries. I don't recall anybody > measuring the performance before, but I'm not surprised to hear that > matching a regex is slower. > > If I look at the output of "perf", though, it looks like we also spend a > lot more time in xdl_clean_mmatch(). Which isn't surprising. Your regex > treats commas as boundaries, which is going to generate a lot more > matches for this particular data set (though the output is the same, I > think, because of the nature of the change). > > I would have expected "--word-diff-regex=[^[:space:]]" to be faster than > your regex, though, and it does not seem to be. > > -Peff ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Fwd: git diff with “--word-diff-regex” extremely slow compared to “--word-diff”? 2016-11-22 18:08 ` Matthieu S @ 2016-11-22 19:26 ` Jeff King 0 siblings, 0 replies; 4+ messages in thread From: Jeff King @ 2016-11-22 19:26 UTC (permalink / raw) To: Matthieu S; +Cc: git On Tue, Nov 22, 2016 at 10:08:33AM -0800, Matthieu S wrote: > You are right, I should have compared with the same regex, and indeed, > --word-diff-regex=[^[:space:]] is also much slower than just > --word-diff, although they do the same job. Maybe this is a hint that > the --word-diff-regex code could be made faster? Maybe. If most of the time is spent in the regex engine, there may not be much we can do. But perhaps there is something in the surrounding code that can be improved. Looking at find_word_boundaries() (and this is the first time I've done so), it does look like we regex-match the whole buffer, and only then find the end-of-line. Now that we have regexec_buf(), it might be possible to constrain the regex buffer more. > I have a small understanding of git, but is git diff computing the > diff value for the whole file, and then showing in the terminal the 10 > first values? In some cases, it seems to be a lot of unnecessary > computation! Is there any possibility to ask git-diff to only compare > say the first 100 lines? Or compute only when necessary, i.e. > when"enter" is prompted in the console? Git always computes the diff for the whole file. The paging is done by an external program. So no, there's no easy way to do it incrementally as the user interacts with the pager, as the pager does not communicate back to git in any way. However, git should generally be streaming out results (and the pager showing them) as they're computed, so in an ideal world you get output immediately, and then the pager buffers the rest of it while you're reading the first page. Git does have to look at the whole file in order to do the initial line-by-line diff, so it would be hard to make that incremental. It could do the word-coloring for each hunk incrementally, though. I would have assumed that is already how it is done, though I didn't dig into it. -Peff ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2016-11-22 19:27 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <CAEYvigJ14xYDmRG2N0yTgM4spaaB7s9923w0+e9+QQEeFz0NTQ@mail.gmail.com> 2016-11-18 23:40 ` Fwd: git diff with “--word-diff-regex” extremely slow compared to “--word-diff”? Matthieu S 2016-11-20 20:17 ` Jeff King 2016-11-22 18:08 ` Matthieu S 2016-11-22 19:26 ` Jeff King
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).