* grep vs git grep performance? @ 2017-10-26 15:02 Joe Perches 2017-10-26 15:11 ` Han-Wen Nienhuys ` (2 more replies) 0 siblings, 3 replies; 12+ messages in thread From: Joe Perches @ 2017-10-26 15:02 UTC (permalink / raw) To: git Comparing a cache warm git grep vs command line grep shows significant differences in cpu & wall clock. Any ideas how to improve this? $ time git grep "\bseq_.*%p\W" | wc -l 112 real 0m4.271s user 0m15.520s sys 0m0.395s $ time grep -r --include=*.[ch] "\bseq_.*%p\W" * | wc -l 112 real 0m1.164s user 0m0.847s sys 0m0.314s ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: grep vs git grep performance? 2017-10-26 15:02 grep vs git grep performance? Joe Perches @ 2017-10-26 15:11 ` Han-Wen Nienhuys 2017-10-26 15:55 ` Joe Perches 2017-10-26 16:13 ` SZEDER Gábor 2017-10-26 16:58 ` Stefan Beller 2 siblings, 1 reply; 12+ messages in thread From: Han-Wen Nienhuys @ 2017-10-26 15:11 UTC (permalink / raw) To: Joe Perches; +Cc: git On Thu, Oct 26, 2017 at 5:02 PM, Joe Perches <joe@perches.com> wrote: > Comparing a cache warm git grep vs command line grep > shows significant differences in cpu & wall clock. > > Any ideas how to improve this? Is git-grep multithreaded? IIRC, grep -r uses multiple threads. (Do you have a 4-core machine?) -- Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich Registergericht und -nummer: Hamburg, HRB 86891 Sitz der Gesellschaft: Hamburg Geschäftsführer: Paul Manicle, Halimah DeLaine Prado ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: grep vs git grep performance? 2017-10-26 15:11 ` Han-Wen Nienhuys @ 2017-10-26 15:55 ` Joe Perches 0 siblings, 0 replies; 12+ messages in thread From: Joe Perches @ 2017-10-26 15:55 UTC (permalink / raw) To: Han-Wen Nienhuys; +Cc: git On Thu, 2017-10-26 at 17:11 +0200, Han-Wen Nienhuys wrote: > On Thu, Oct 26, 2017 at 5:02 PM, Joe Perches <joe@perches.com> wrote: > > Comparing a cache warm git grep vs command line grep > > shows significant differences in cpu & wall clock. > > > > Any ideas how to improve this? > > Is git-grep multithreaded? Yes, at least according to the documentation $ git grep --help [] grep.threads Number of grep worker threads to use. If unset (or set to 0), 8 threads are used by default (for now). > IIRC, grep -r uses multiple threads. (Do > you have a 4-core machine?) I have a 2 core machine with hyperthreading $ cat /proc/cpuinfo [] model name : Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz stepping : 3 microcode : 0xba cpu MHz : 2400.000 cache size : 3072 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 2 ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: grep vs git grep performance? 2017-10-26 15:02 grep vs git grep performance? Joe Perches 2017-10-26 15:11 ` Han-Wen Nienhuys @ 2017-10-26 16:13 ` SZEDER Gábor 2017-10-26 16:20 ` Joe Perches 2017-10-26 16:58 ` Stefan Beller 2 siblings, 1 reply; 12+ messages in thread From: SZEDER Gábor @ 2017-10-26 16:13 UTC (permalink / raw) To: Joe Perches; +Cc: SZEDER Gábor, git > Comparing a cache warm git grep vs command line grep > shows significant differences in cpu & wall clock. > > Any ideas how to improve this? > > $ time git grep "\bseq_.*%p\W" | wc -l > 112 > > real 0m4.271s > user 0m15.520s > sys 0m0.395s > > $ time grep -r --include=*.[ch] "\bseq_.*%p\W" * | wc -l > 112 > > real 0m1.164s > user 0m0.847s > sys 0m0.314s Note that this "regular" grep is limited to *.c and *.h files, while the above git grep invocation isn't and has to look at all tracked files. How does git grep "\bseq_.*%p\W" "*.[ch]" fare? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: grep vs git grep performance? 2017-10-26 16:13 ` SZEDER Gábor @ 2017-10-26 16:20 ` Joe Perches 0 siblings, 0 replies; 12+ messages in thread From: Joe Perches @ 2017-10-26 16:20 UTC (permalink / raw) To: SZEDER Gábor; +Cc: git On Thu, 2017-10-26 at 18:13 +0200, SZEDER Gábor wrote: > > Comparing a cache warm git grep vs command line grep > > shows significant differences in cpu & wall clock. > > > > Any ideas how to improve this? > > > > $ time git grep "\bseq_.*%p\W" | wc -l > > 112 > > > > real 0m4.271s > > user 0m15.520s > > sys 0m0.395s > > > > $ time grep -r --include=*.[ch] "\bseq_.*%p\W" * | wc -l > > 112 > > > > real 0m1.164s > > user 0m0.847s > > sys 0m0.314s > > Note that this "regular" grep is limited to *.c and *.h files, while > the above git grep invocation isn't and has to look at all tracked > files. How does > > git grep "\bseq_.*%p\W" "*.[ch]" > > fare? Same-ish $ time git grep "\bseq_.*%p\W" -- "*.[ch]" | wc -l 112 real 0m4.225s user 0m14.485s sys 0m0.413s ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: grep vs git grep performance? 2017-10-26 15:02 grep vs git grep performance? Joe Perches 2017-10-26 15:11 ` Han-Wen Nienhuys 2017-10-26 16:13 ` SZEDER Gábor @ 2017-10-26 16:58 ` Stefan Beller 2017-10-26 17:41 ` Joe Perches 2 siblings, 1 reply; 12+ messages in thread From: Stefan Beller @ 2017-10-26 16:58 UTC (permalink / raw) To: Joe Perches, Ævar Arnfjörð Bjarmason; +Cc: git + Avar who knows a thing about pcre (I assume the regex compilation has impact on grep speed) On Thu, Oct 26, 2017 at 8:02 AM, Joe Perches <joe@perches.com> wrote: > Comparing a cache warm git grep vs command line grep > shows significant differences in cpu & wall clock. > > Any ideas how to improve this? > > $ time git grep "\bseq_.*%p\W" | wc -l > 112 > > real 0m4.271s > user 0m15.520s > sys 0m0.395s > > $ time grep -r --include=*.[ch] "\bseq_.*%p\W" * | wc -l > 112 > > real 0m1.164s > user 0m0.847s > sys 0m0.314s > I wonder how much is algorithmic advantage vs coding/micro optimization that we can do. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: grep vs git grep performance? 2017-10-26 16:58 ` Stefan Beller @ 2017-10-26 17:41 ` Joe Perches 2017-10-26 17:45 ` Stefan Beller 0 siblings, 1 reply; 12+ messages in thread From: Joe Perches @ 2017-10-26 17:41 UTC (permalink / raw) To: Stefan Beller, Ævar Arnfjörð Bjarmason; +Cc: git On Thu, 2017-10-26 at 09:58 -0700, Stefan Beller wrote: > + Avar who knows a thing about pcre (I assume the regex compilation > has impact on grep speed) > > On Thu, Oct 26, 2017 at 8:02 AM, Joe Perches <joe@perches.com> wrote: > > Comparing a cache warm git grep vs command line grep > > shows significant differences in cpu & wall clock. > > > > Any ideas how to improve this? > > > > $ time git grep "\bseq_.*%p\W" | wc -l > > 112 > > > > real 0m4.271s > > user 0m15.520s > > sys 0m0.395s > > > > $ time grep -r --include=*.[ch] "\bseq_.*%p\W" * | wc -l > > 112 > > > > real 0m1.164s > > user 0m0.847s > > sys 0m0.314s > > > > I wonder how much is algorithmic advantage vs coding/micro > optimization that we can do. As do I. I presume this is libpcre related. For instance, git grep performance is better than grep for: $ time git grep -w "seq_printf" -- "*.[ch]" | wc -l 8609 real 0m0.301s user 0m0.548s sys 0m0.372s $ time grep -w -r --include=*.[ch] "seq_printf" * | wc -l 8609 real 0m0.706s user 0m0.396s sys 0m0.309s ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: grep vs git grep performance? 2017-10-26 17:41 ` Joe Perches @ 2017-10-26 17:45 ` Stefan Beller 2017-10-27 17:22 ` Joe Perches 0 siblings, 1 reply; 12+ messages in thread From: Stefan Beller @ 2017-10-26 17:45 UTC (permalink / raw) To: Joe Perches; +Cc: Ævar Arnfjörð Bjarmason, git On Thu, Oct 26, 2017 at 10:41 AM, Joe Perches <joe@perches.com> wrote: > On Thu, 2017-10-26 at 09:58 -0700, Stefan Beller wrote: >> + Avar who knows a thing about pcre (I assume the regex compilation >> has impact on grep speed) >> >> On Thu, Oct 26, 2017 at 8:02 AM, Joe Perches <joe@perches.com> wrote: >> > Comparing a cache warm git grep vs command line grep >> > shows significant differences in cpu & wall clock. >> > >> > Any ideas how to improve this? >> > >> > $ time git grep "\bseq_.*%p\W" | wc -l >> > 112 >> > >> > real 0m4.271s >> > user 0m15.520s >> > sys 0m0.395s >> > >> > $ time grep -r --include=*.[ch] "\bseq_.*%p\W" * | wc -l >> > 112 >> > >> > real 0m1.164s >> > user 0m0.847s >> > sys 0m0.314s >> > >> >> I wonder how much is algorithmic advantage vs coding/micro >> optimization that we can do. > > As do I. I presume this is libpcre related. > > For instance, git grep performance is better than grep for: > > $ time git grep -w "seq_printf" -- "*.[ch]" | wc -l > 8609 > > real 0m0.301s > user 0m0.548s > sys 0m0.372s > > $ time grep -w -r --include=*.[ch] "seq_printf" * | wc -l > 8609 > > real 0m0.706s > user 0m0.396s > sys 0m0.309s > One important piece of information is what version of Git you are running, $ git tag --contains origin/ab/pcre-v2 v2.14.0 ... (and the version of pcre, see the numbers) https://git.kernel.org/pub/scm/git/git.git/commit/?id=94da9193a6eb8f1085d611c04ff8bbb4f5ae1e0a ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: grep vs git grep performance? 2017-10-26 17:45 ` Stefan Beller @ 2017-10-27 17:22 ` Joe Perches 2017-10-27 22:11 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 12+ messages in thread From: Joe Perches @ 2017-10-27 17:22 UTC (permalink / raw) To: Stefan Beller; +Cc: Ævar Arnfjörð Bjarmason, git On Thu, 2017-10-26 at 10:45 -0700, Stefan Beller wrote: > On Thu, Oct 26, 2017 at 10:41 AM, Joe Perches <joe@perches.com> wrote: > > On Thu, 2017-10-26 at 09:58 -0700, Stefan Beller wrote: > > > + Avar who knows a thing about pcre (I assume the regex compilation > > > has impact on grep speed) > > > > > > On Thu, Oct 26, 2017 at 8:02 AM, Joe Perches <joe@perches.com> wrote: > > > > Comparing a cache warm git grep vs command line grep > > > > shows significant differences in cpu & wall clock. > > > > > > > > Any ideas how to improve this? > > > > > > > > $ time git grep "\bseq_.*%p\W" | wc -l > > > > 112 > > > > > > > > real 0m4.271s > > > > user 0m15.520s > > > > sys 0m0.395s > > > > > > > > $ time grep -r --include=*.[ch] "\bseq_.*%p\W" * | wc -l > > > > 112 > > > > > > > > real 0m1.164s > > > > user 0m0.847s > > > > sys 0m0.314s > > > > > > > > > > I wonder how much is algorithmic advantage vs coding/micro > > > optimization that we can do. > > > > As do I. I presume this is libpcre related. > > > > For instance, git grep performance is better than grep for: > > > > $ time git grep -w "seq_printf" -- "*.[ch]" | wc -l > > 8609 > > > > real 0m0.301s > > user 0m0.548s > > sys 0m0.372s > > > > $ time grep -w -r --include=*.[ch] "seq_printf" * | wc -l > > 8609 > > > > real 0m0.706s > > user 0m0.396s > > sys 0m0.309s > > > > One important piece of information is what version of Git you are running, > > > $ git tag --contains origin/ab/pcre-v2 > v2.14.0 v2.10 > ... > > (and the version of pcre, see the numbers) > https://git.kernel.org/pub/scm/git/git.git/commit/?id=94da9193a6eb8f1085d611c04ff8bbb4f5ae1e0a I definitely didn't have that one. I recompiled git latest (with USE_LIBPCRE2) and reran. Here are the results $ git --version git version 2.15.0.rc2.48.g4e40fb3 $ time git grep -P "\bseq_.*%p\W" -- "*.[ch]" | wc -l 112 real 0m0.437s user 0m1.008s sys 0m0.381s So, git grep performance has already been quite successfully improved. Thanks. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: grep vs git grep performance? 2017-10-27 17:22 ` Joe Perches @ 2017-10-27 22:11 ` Ævar Arnfjörð Bjarmason 2017-10-27 23:22 ` Joe Perches 0 siblings, 1 reply; 12+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2017-10-27 22:11 UTC (permalink / raw) To: Joe Perches; +Cc: Stefan Beller, git On Fri, Oct 27 2017, Joe Perches jotted: > On Thu, 2017-10-26 at 10:45 -0700, Stefan Beller wrote: >> On Thu, Oct 26, 2017 at 10:41 AM, Joe Perches <joe@perches.com> wrote: >> > On Thu, 2017-10-26 at 09:58 -0700, Stefan Beller wrote: >> > > + Avar who knows a thing about pcre (I assume the regex compilation >> > > has impact on grep speed) >> > > >> > > On Thu, Oct 26, 2017 at 8:02 AM, Joe Perches <joe@perches.com> wrote: >> > > > Comparing a cache warm git grep vs command line grep >> > > > shows significant differences in cpu & wall clock. >> > > > >> > > > Any ideas how to improve this? >> > > > >> > > > $ time git grep "\bseq_.*%p\W" | wc -l >> > > > 112 >> > > > >> > > > real 0m4.271s >> > > > user 0m15.520s >> > > > sys 0m0.395s >> > > > >> > > > $ time grep -r --include=*.[ch] "\bseq_.*%p\W" * | wc -l >> > > > 112 >> > > > >> > > > real 0m1.164s >> > > > user 0m0.847s >> > > > sys 0m0.314s >> > > > >> > > >> > > I wonder how much is algorithmic advantage vs coding/micro >> > > optimization that we can do. >> > >> > As do I. I presume this is libpcre related. >> > >> > For instance, git grep performance is better than grep for: >> > >> > $ time git grep -w "seq_printf" -- "*.[ch]" | wc -l >> > 8609 >> > >> > real 0m0.301s >> > user 0m0.548s >> > sys 0m0.372s >> > >> > $ time grep -w -r --include=*.[ch] "seq_printf" * | wc -l >> > 8609 >> > >> > real 0m0.706s >> > user 0m0.396s >> > sys 0m0.309s >> > >> >> One important piece of information is what version of Git you are running, >> >> >> $ git tag --contains origin/ab/pcre-v2 >> v2.14.0 > > v2.10 > >> ... >> >> (and the version of pcre, see the numbers) >> https://git.kernel.org/pub/scm/git/git.git/commit/?id=94da9193a6eb8f1085d611c04ff8bbb4f5ae1e0a > > I definitely didn't have that one. > > I recompiled git latest (with USE_LIBPCRE2) and reran. > > Here are the results > > $ git --version > git version 2.15.0.rc2.48.g4e40fb3 > > $ time git grep -P "\bseq_.*%p\W" -- "*.[ch]" | wc -l > 112 > > real 0m0.437s > user 0m1.008s > sys 0m0.381s > > So, git grep performance has already been > quite successfully improved. ...and I have WIP patches to use the PCRE engine for patterns without -P which I intend to start sending soon after the next release. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: grep vs git grep performance? 2017-10-27 22:11 ` Ævar Arnfjörð Bjarmason @ 2017-10-27 23:22 ` Joe Perches 2017-10-28 7:45 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 12+ messages in thread From: Joe Perches @ 2017-10-27 23:22 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason; +Cc: Stefan Beller, git On Sat, 2017-10-28 at 00:11 +0200, Ævar Arnfjörð Bjarmason wrote: > On Fri, Oct 27 2017, Joe Perches jotted: [] > > git grep performance has already been > > quite successfully improved. > > ...and I have WIP patches to use the PCRE engine for patterns without -P > which I intend to start sending soon after the next release. One addition that would be quite nice would be an option to have regex matches span input lines. grep v2.54 was the last grep version that allowed this and I keep it around just for that. ie: $ cat hello.txt Hello World $ grep -P "Hello\s*World" hello.txt $ grep-2.5.4 -P "Hello\s*World" hello.txt Hello World ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: grep vs git grep performance? 2017-10-27 23:22 ` Joe Perches @ 2017-10-28 7:45 ` Ævar Arnfjörð Bjarmason 0 siblings, 0 replies; 12+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2017-10-28 7:45 UTC (permalink / raw) To: Joe Perches; +Cc: Stefan Beller, git On Fri, Oct 27 2017, Joe Perches jotted: > On Sat, 2017-10-28 at 00:11 +0200, Ævar Arnfjörð Bjarmason wrote: >> On Fri, Oct 27 2017, Joe Perches jotted: > [] >> > git grep performance has already been >> > quite successfully improved. >> >> ...and I have WIP patches to use the PCRE engine for patterns without -P >> which I intend to start sending soon after the next release. > > One addition that would be quite nice would be > an option to have regex matches span input lines. > > grep v2.54 was the last grep version that allowed > this and I keep it around just for that. > > ie: > > $ cat hello.txt > Hello > World > $ grep -P "Hello\s*World" hello.txt > $ grep-2.5.4 -P "Hello\s*World" hello.txt > Hello > World I'm unable to build 2.5.4 and can't find anything relevant in the release notes at a quick glance around that time saying that this would be removed, if you can still build it I'd be interested to see what this bisects down to in grep.git. But aside from that, a feature like this constrains the regex implementation a lot since it's going to need to either match the entire file as we'd need to do with PCRE, or we'd need to really deeply embed the core logic of the regex matcher into our grep implementation. I.e. in this case a more optimal implementation would start by parsing this regex down: ((EXACT "Hello") (STAR (POSIXU "\s")) (EXACT "World")) Then when you open the file you can start searching for the fixed-string "Hello", if you don't find that you're done, if you do you can forward look-ahead for the fixed "World", and only if you find that do you need to match the more complex part in the middle. Whereas our API for the internal regex matchers now is that we find the boundaries of newlines and batch-match a bunch of lines with a match() function that takes a string, and if that matches we drill down to what specific line matches. Which is not to say that this can't be done without a potentially unacceptable memory trade-off (i.e. matching the entire file in all cases), the PCRE2 engine in particular includes some I/O abstractions that we're not using but could (but I haven't looked into it). But right now the entire internal API we have is constrained by catering to the lowest common denominator (a regexec that takes a char*), so supporting more fancy multi-line matching features can be a PITA since we'd need to maintain both codepaths. Or we could make PCRE a hard dependency, which given the performance advantages I'm increasingly willing to make the case for. ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2017-10-28 7:46 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-10-26 15:02 grep vs git grep performance? Joe Perches 2017-10-26 15:11 ` Han-Wen Nienhuys 2017-10-26 15:55 ` Joe Perches 2017-10-26 16:13 ` SZEDER Gábor 2017-10-26 16:20 ` Joe Perches 2017-10-26 16:58 ` Stefan Beller 2017-10-26 17:41 ` Joe Perches 2017-10-26 17:45 ` Stefan Beller 2017-10-27 17:22 ` Joe Perches 2017-10-27 22:11 ` Ævar Arnfjörð Bjarmason 2017-10-27 23:22 ` Joe Perches 2017-10-28 7:45 ` Ævar Arnfjörð Bjarmason
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).