git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* grep vs git grep performance?
@ 2017-10-26 15:02 Joe Perches
  2017-10-26 15:11 ` Han-Wen Nienhuys
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Joe Perches @ 2017-10-26 15:02 UTC (permalink / raw)
  To: git

Comparing a cache warm git grep vs command line grep
shows significant differences in cpu & wall clock.

Any ideas how to improve this?

$ time git grep "\bseq_.*%p\W" | wc -l
112

real	0m4.271s
user	0m15.520s
sys	0m0.395s

$ time grep -r --include=*.[ch] "\bseq_.*%p\W" * | wc -l
112

real	0m1.164s
user	0m0.847s
sys	0m0.314s



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: grep vs git grep performance?
  2017-10-26 15:02 grep vs git grep performance? Joe Perches
@ 2017-10-26 15:11 ` Han-Wen Nienhuys
  2017-10-26 15:55   ` Joe Perches
  2017-10-26 16:13 ` SZEDER Gábor
  2017-10-26 16:58 ` Stefan Beller
  2 siblings, 1 reply; 12+ messages in thread
From: Han-Wen Nienhuys @ 2017-10-26 15:11 UTC (permalink / raw)
  To: Joe Perches; +Cc: git

On Thu, Oct 26, 2017 at 5:02 PM, Joe Perches <joe@perches.com> wrote:
> Comparing a cache warm git grep vs command line grep
> shows significant differences in cpu & wall clock.
>
> Any ideas how to improve this?

Is git-grep multithreaded? IIRC, grep -r uses multiple threads. (Do
you have a 4-core machine?)

--

Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: grep vs git grep performance?
  2017-10-26 15:11 ` Han-Wen Nienhuys
@ 2017-10-26 15:55   ` Joe Perches
  0 siblings, 0 replies; 12+ messages in thread
From: Joe Perches @ 2017-10-26 15:55 UTC (permalink / raw)
  To: Han-Wen Nienhuys; +Cc: git

On Thu, 2017-10-26 at 17:11 +0200, Han-Wen Nienhuys wrote:
> On Thu, Oct 26, 2017 at 5:02 PM, Joe Perches <joe@perches.com> wrote:
> > Comparing a cache warm git grep vs command line grep
> > shows significant differences in cpu & wall clock.
> > 
> > Any ideas how to improve this?
> 
> Is git-grep multithreaded?

Yes, at least according to the documentation

$ git grep --help
[]
       grep.threads
           Number of grep worker threads to use. If unset (or set to 0), 8
           threads are used by default (for now).

> IIRC, grep -r uses multiple threads. (Do
> you have a 4-core machine?)

I have a 2 core machine with hyperthreading

$ cat /proc/cpuinfo
[]
model name	: Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz
stepping	: 3
microcode	: 0xba
cpu MHz		: 2400.000
cache size	: 3072 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 2


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: grep vs git grep performance?
  2017-10-26 15:02 grep vs git grep performance? Joe Perches
  2017-10-26 15:11 ` Han-Wen Nienhuys
@ 2017-10-26 16:13 ` SZEDER Gábor
  2017-10-26 16:20   ` Joe Perches
  2017-10-26 16:58 ` Stefan Beller
  2 siblings, 1 reply; 12+ messages in thread
From: SZEDER Gábor @ 2017-10-26 16:13 UTC (permalink / raw)
  To: Joe Perches; +Cc: SZEDER Gábor, git

> Comparing a cache warm git grep vs command line grep
> shows significant differences in cpu & wall clock.
> 
> Any ideas how to improve this?
> 
> $ time git grep "\bseq_.*%p\W" | wc -l
> 112
> 
> real	0m4.271s
> user	0m15.520s
> sys	0m0.395s
> 
> $ time grep -r --include=*.[ch] "\bseq_.*%p\W" * | wc -l
> 112
> 
> real	0m1.164s
> user	0m0.847s
> sys	0m0.314s

Note that this "regular" grep is limited to *.c and *.h files, while
the above git grep invocation isn't and has to look at all tracked
files.  How does

  git grep "\bseq_.*%p\W" "*.[ch]"

fare?


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: grep vs git grep performance?
  2017-10-26 16:13 ` SZEDER Gábor
@ 2017-10-26 16:20   ` Joe Perches
  0 siblings, 0 replies; 12+ messages in thread
From: Joe Perches @ 2017-10-26 16:20 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: git

On Thu, 2017-10-26 at 18:13 +0200, SZEDER Gábor wrote:
> > Comparing a cache warm git grep vs command line grep
> > shows significant differences in cpu & wall clock.
> > 
> > Any ideas how to improve this?
> > 
> > $ time git grep "\bseq_.*%p\W" | wc -l
> > 112
> > 
> > real	0m4.271s
> > user	0m15.520s
> > sys	0m0.395s
> > 
> > $ time grep -r --include=*.[ch] "\bseq_.*%p\W" * | wc -l
> > 112
> > 
> > real	0m1.164s
> > user	0m0.847s
> > sys	0m0.314s
> 
> Note that this "regular" grep is limited to *.c and *.h files, while
> the above git grep invocation isn't and has to look at all tracked
> files.  How does
> 
>   git grep "\bseq_.*%p\W" "*.[ch]"
> 
> fare?

Same-ish

$ time git grep "\bseq_.*%p\W" -- "*.[ch]" | wc -l
112

real	0m4.225s
user	0m14.485s
sys	0m0.413s

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: grep vs git grep performance?
  2017-10-26 15:02 grep vs git grep performance? Joe Perches
  2017-10-26 15:11 ` Han-Wen Nienhuys
  2017-10-26 16:13 ` SZEDER Gábor
@ 2017-10-26 16:58 ` Stefan Beller
  2017-10-26 17:41   ` Joe Perches
  2 siblings, 1 reply; 12+ messages in thread
From: Stefan Beller @ 2017-10-26 16:58 UTC (permalink / raw)
  To: Joe Perches, Ævar Arnfjörð Bjarmason; +Cc: git

+ Avar who knows a thing about pcre (I assume the regex compilation
has impact on grep speed)

On Thu, Oct 26, 2017 at 8:02 AM, Joe Perches <joe@perches.com> wrote:
> Comparing a cache warm git grep vs command line grep
> shows significant differences in cpu & wall clock.
>
> Any ideas how to improve this?
>
> $ time git grep "\bseq_.*%p\W" | wc -l
> 112
>
> real    0m4.271s
> user    0m15.520s
> sys     0m0.395s
>
> $ time grep -r --include=*.[ch] "\bseq_.*%p\W" * | wc -l
> 112
>
> real    0m1.164s
> user    0m0.847s
> sys     0m0.314s
>

I wonder how much is algorithmic advantage vs coding/micro
optimization that we can do.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: grep vs git grep performance?
  2017-10-26 16:58 ` Stefan Beller
@ 2017-10-26 17:41   ` Joe Perches
  2017-10-26 17:45     ` Stefan Beller
  0 siblings, 1 reply; 12+ messages in thread
From: Joe Perches @ 2017-10-26 17:41 UTC (permalink / raw)
  To: Stefan Beller, Ævar Arnfjörð Bjarmason; +Cc: git

On Thu, 2017-10-26 at 09:58 -0700, Stefan Beller wrote:
> + Avar who knows a thing about pcre (I assume the regex compilation
> has impact on grep speed)
> 
> On Thu, Oct 26, 2017 at 8:02 AM, Joe Perches <joe@perches.com> wrote:
> > Comparing a cache warm git grep vs command line grep
> > shows significant differences in cpu & wall clock.
> > 
> > Any ideas how to improve this?
> > 
> > $ time git grep "\bseq_.*%p\W" | wc -l
> > 112
> > 
> > real    0m4.271s
> > user    0m15.520s
> > sys     0m0.395s
> > 
> > $ time grep -r --include=*.[ch] "\bseq_.*%p\W" * | wc -l
> > 112
> > 
> > real    0m1.164s
> > user    0m0.847s
> > sys     0m0.314s
> > 
> 
> I wonder how much is algorithmic advantage vs coding/micro
> optimization that we can do.

As do I.  I presume this is libpcre related.

For instance, git grep performance is better than grep for:

$ time git grep -w "seq_printf" -- "*.[ch]" | wc -l
8609

real	0m0.301s
user	0m0.548s
sys	0m0.372s

$ time grep -w -r --include=*.[ch] "seq_printf" * | wc -l
8609

real	0m0.706s
user	0m0.396s
sys	0m0.309s



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: grep vs git grep performance?
  2017-10-26 17:41   ` Joe Perches
@ 2017-10-26 17:45     ` Stefan Beller
  2017-10-27 17:22       ` Joe Perches
  0 siblings, 1 reply; 12+ messages in thread
From: Stefan Beller @ 2017-10-26 17:45 UTC (permalink / raw)
  To: Joe Perches; +Cc: Ævar Arnfjörð Bjarmason, git

On Thu, Oct 26, 2017 at 10:41 AM, Joe Perches <joe@perches.com> wrote:
> On Thu, 2017-10-26 at 09:58 -0700, Stefan Beller wrote:
>> + Avar who knows a thing about pcre (I assume the regex compilation
>> has impact on grep speed)
>>
>> On Thu, Oct 26, 2017 at 8:02 AM, Joe Perches <joe@perches.com> wrote:
>> > Comparing a cache warm git grep vs command line grep
>> > shows significant differences in cpu & wall clock.
>> >
>> > Any ideas how to improve this?
>> >
>> > $ time git grep "\bseq_.*%p\W" | wc -l
>> > 112
>> >
>> > real    0m4.271s
>> > user    0m15.520s
>> > sys     0m0.395s
>> >
>> > $ time grep -r --include=*.[ch] "\bseq_.*%p\W" * | wc -l
>> > 112
>> >
>> > real    0m1.164s
>> > user    0m0.847s
>> > sys     0m0.314s
>> >
>>
>> I wonder how much is algorithmic advantage vs coding/micro
>> optimization that we can do.
>
> As do I.  I presume this is libpcre related.
>
> For instance, git grep performance is better than grep for:
>
> $ time git grep -w "seq_printf" -- "*.[ch]" | wc -l
> 8609
>
> real    0m0.301s
> user    0m0.548s
> sys     0m0.372s
>
> $ time grep -w -r --include=*.[ch] "seq_printf" * | wc -l
> 8609
>
> real    0m0.706s
> user    0m0.396s
> sys     0m0.309s
>

One important piece of information is what version of Git you are running,


$ git tag --contains origin/ab/pcre-v2
v2.14.0
...

(and the version of pcre, see the numbers)
https://git.kernel.org/pub/scm/git/git.git/commit/?id=94da9193a6eb8f1085d611c04ff8bbb4f5ae1e0a

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: grep vs git grep performance?
  2017-10-26 17:45     ` Stefan Beller
@ 2017-10-27 17:22       ` Joe Perches
  2017-10-27 22:11         ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 12+ messages in thread
From: Joe Perches @ 2017-10-27 17:22 UTC (permalink / raw)
  To: Stefan Beller; +Cc: Ævar Arnfjörð Bjarmason, git

On Thu, 2017-10-26 at 10:45 -0700, Stefan Beller wrote:
> On Thu, Oct 26, 2017 at 10:41 AM, Joe Perches <joe@perches.com> wrote:
> > On Thu, 2017-10-26 at 09:58 -0700, Stefan Beller wrote:
> > > + Avar who knows a thing about pcre (I assume the regex compilation
> > > has impact on grep speed)
> > > 
> > > On Thu, Oct 26, 2017 at 8:02 AM, Joe Perches <joe@perches.com> wrote:
> > > > Comparing a cache warm git grep vs command line grep
> > > > shows significant differences in cpu & wall clock.
> > > > 
> > > > Any ideas how to improve this?
> > > > 
> > > > $ time git grep "\bseq_.*%p\W" | wc -l
> > > > 112
> > > > 
> > > > real    0m4.271s
> > > > user    0m15.520s
> > > > sys     0m0.395s
> > > > 
> > > > $ time grep -r --include=*.[ch] "\bseq_.*%p\W" * | wc -l
> > > > 112
> > > > 
> > > > real    0m1.164s
> > > > user    0m0.847s
> > > > sys     0m0.314s
> > > > 
> > > 
> > > I wonder how much is algorithmic advantage vs coding/micro
> > > optimization that we can do.
> > 
> > As do I.  I presume this is libpcre related.
> > 
> > For instance, git grep performance is better than grep for:
> > 
> > $ time git grep -w "seq_printf" -- "*.[ch]" | wc -l
> > 8609
> > 
> > real    0m0.301s
> > user    0m0.548s
> > sys     0m0.372s
> > 
> > $ time grep -w -r --include=*.[ch] "seq_printf" * | wc -l
> > 8609
> > 
> > real    0m0.706s
> > user    0m0.396s
> > sys     0m0.309s
> > 
> 
> One important piece of information is what version of Git you are running,
> 
> 
> $ git tag --contains origin/ab/pcre-v2
> v2.14.0

v2.10

> ...
> 
> (and the version of pcre, see the numbers)
> https://git.kernel.org/pub/scm/git/git.git/commit/?id=94da9193a6eb8f1085d611c04ff8bbb4f5ae1e0a

I definitely didn't have that one.

I recompiled git latest (with USE_LIBPCRE2) and reran.

Here are the results

$ git --version
git version 2.15.0.rc2.48.g4e40fb3

$ time git grep -P "\bseq_.*%p\W" -- "*.[ch]" | wc -l
112

real	0m0.437s
user	0m1.008s
sys	0m0.381s

So, git grep performance has already been
quite successfully improved.

Thanks.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: grep vs git grep performance?
  2017-10-27 17:22       ` Joe Perches
@ 2017-10-27 22:11         ` Ævar Arnfjörð Bjarmason
  2017-10-27 23:22           ` Joe Perches
  0 siblings, 1 reply; 12+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-10-27 22:11 UTC (permalink / raw)
  To: Joe Perches; +Cc: Stefan Beller, git


On Fri, Oct 27 2017, Joe Perches jotted:

> On Thu, 2017-10-26 at 10:45 -0700, Stefan Beller wrote:
>> On Thu, Oct 26, 2017 at 10:41 AM, Joe Perches <joe@perches.com> wrote:
>> > On Thu, 2017-10-26 at 09:58 -0700, Stefan Beller wrote:
>> > > + Avar who knows a thing about pcre (I assume the regex compilation
>> > > has impact on grep speed)
>> > >
>> > > On Thu, Oct 26, 2017 at 8:02 AM, Joe Perches <joe@perches.com> wrote:
>> > > > Comparing a cache warm git grep vs command line grep
>> > > > shows significant differences in cpu & wall clock.
>> > > >
>> > > > Any ideas how to improve this?
>> > > >
>> > > > $ time git grep "\bseq_.*%p\W" | wc -l
>> > > > 112
>> > > >
>> > > > real    0m4.271s
>> > > > user    0m15.520s
>> > > > sys     0m0.395s
>> > > >
>> > > > $ time grep -r --include=*.[ch] "\bseq_.*%p\W" * | wc -l
>> > > > 112
>> > > >
>> > > > real    0m1.164s
>> > > > user    0m0.847s
>> > > > sys     0m0.314s
>> > > >
>> > >
>> > > I wonder how much is algorithmic advantage vs coding/micro
>> > > optimization that we can do.
>> >
>> > As do I.  I presume this is libpcre related.
>> >
>> > For instance, git grep performance is better than grep for:
>> >
>> > $ time git grep -w "seq_printf" -- "*.[ch]" | wc -l
>> > 8609
>> >
>> > real    0m0.301s
>> > user    0m0.548s
>> > sys     0m0.372s
>> >
>> > $ time grep -w -r --include=*.[ch] "seq_printf" * | wc -l
>> > 8609
>> >
>> > real    0m0.706s
>> > user    0m0.396s
>> > sys     0m0.309s
>> >
>>
>> One important piece of information is what version of Git you are running,
>>
>>
>> $ git tag --contains origin/ab/pcre-v2
>> v2.14.0
>
> v2.10
>
>> ...
>>
>> (and the version of pcre, see the numbers)
>> https://git.kernel.org/pub/scm/git/git.git/commit/?id=94da9193a6eb8f1085d611c04ff8bbb4f5ae1e0a
>
> I definitely didn't have that one.
>
> I recompiled git latest (with USE_LIBPCRE2) and reran.
>
> Here are the results
>
> $ git --version
> git version 2.15.0.rc2.48.g4e40fb3
>
> $ time git grep -P "\bseq_.*%p\W" -- "*.[ch]" | wc -l
> 112
>
> real	0m0.437s
> user	0m1.008s
> sys	0m0.381s
>
> So, git grep performance has already been
> quite successfully improved.

...and I have WIP patches to use the PCRE engine for patterns without -P
which I intend to start sending soon after the next release.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: grep vs git grep performance?
  2017-10-27 22:11         ` Ævar Arnfjörð Bjarmason
@ 2017-10-27 23:22           ` Joe Perches
  2017-10-28  7:45             ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 12+ messages in thread
From: Joe Perches @ 2017-10-27 23:22 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Stefan Beller, git

On Sat, 2017-10-28 at 00:11 +0200, Ævar Arnfjörð Bjarmason wrote:
> On Fri, Oct 27 2017, Joe Perches jotted:
[]
> > git grep performance has already been
> > quite successfully improved.
> 
> ...and I have WIP patches to use the PCRE engine for patterns without -P
> which I intend to start sending soon after the next release.

One addition that would be quite nice would be
an option to have regex matches span input lines.

grep v2.54 was the last grep version that allowed
this and I keep it around just for that.

ie:

$ cat hello.txt 
Hello
World
$ grep -P "Hello\s*World" hello.txt 
$ grep-2.5.4 -P "Hello\s*World" hello.txt 
Hello
World


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: grep vs git grep performance?
  2017-10-27 23:22           ` Joe Perches
@ 2017-10-28  7:45             ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 12+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-10-28  7:45 UTC (permalink / raw)
  To: Joe Perches; +Cc: Stefan Beller, git


On Fri, Oct 27 2017, Joe Perches jotted:

> On Sat, 2017-10-28 at 00:11 +0200, Ævar Arnfjörð Bjarmason wrote:
>> On Fri, Oct 27 2017, Joe Perches jotted:
> []
>> > git grep performance has already been
>> > quite successfully improved.
>>
>> ...and I have WIP patches to use the PCRE engine for patterns without -P
>> which I intend to start sending soon after the next release.
>
> One addition that would be quite nice would be
> an option to have regex matches span input lines.
>
> grep v2.54 was the last grep version that allowed
> this and I keep it around just for that.
>
> ie:
>
> $ cat hello.txt
> Hello
> World
> $ grep -P "Hello\s*World" hello.txt
> $ grep-2.5.4 -P "Hello\s*World" hello.txt
> Hello
> World

I'm unable to build 2.5.4 and can't find anything relevant in the
release notes at a quick glance around that time saying that this would
be removed, if you can still build it I'd be interested to see what this
bisects down to in grep.git.

But aside from that, a feature like this constrains the regex
implementation a lot since it's going to need to either match the entire
file as we'd need to do with PCRE, or we'd need to really deeply embed
the core logic of the regex matcher into our grep implementation.

I.e. in this case a more optimal implementation would start by parsing
this regex down:

    ((EXACT "Hello")
     (STAR (POSIXU "\s"))
     (EXACT "World"))

Then when you open the file you can start searching for the fixed-string
"Hello", if you don't find that you're done, if you do you can forward
look-ahead for the fixed "World", and only if you find that do you need
to match the more complex part in the middle.

Whereas our API for the internal regex matchers now is that we find the
boundaries of newlines and batch-match a bunch of lines with a match()
function that takes a string, and if that matches we drill down to what
specific line matches.

Which is not to say that this can't be done without a potentially
unacceptable memory trade-off (i.e. matching the entire file in all
cases), the PCRE2 engine in particular includes some I/O abstractions
that we're not using but could (but I haven't looked into it).

But right now the entire internal API we have is constrained by catering
to the lowest common denominator (a regexec that takes a char*), so
supporting more fancy multi-line matching features can be a PITA since
we'd need to maintain both codepaths.

Or we could make PCRE a hard dependency, which given the performance
advantages I'm increasingly willing to make the case for.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-10-28  7:46 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-26 15:02 grep vs git grep performance? Joe Perches
2017-10-26 15:11 ` Han-Wen Nienhuys
2017-10-26 15:55   ` Joe Perches
2017-10-26 16:13 ` SZEDER Gábor
2017-10-26 16:20   ` Joe Perches
2017-10-26 16:58 ` Stefan Beller
2017-10-26 17:41   ` Joe Perches
2017-10-26 17:45     ` Stefan Beller
2017-10-27 17:22       ` Joe Perches
2017-10-27 22:11         ` Ævar Arnfjörð Bjarmason
2017-10-27 23:22           ` Joe Perches
2017-10-28  7:45             ` Ævar Arnfjörð Bjarmason

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).