git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* git grep with leading inverted bracket expression
@ 2018-06-07 15:27 Matthew Wilcox
  2018-06-07 19:09 ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 4+ messages in thread
From: Matthew Wilcox @ 2018-06-07 15:27 UTC (permalink / raw)
  To: git


If the first atom of a regex is a bracket expression with an inverted range,
git grep is very slow.

$ time git grep 'struct_size' >/dev/null

real	0m0.368s
user	0m0.563s
sys	0m0.453s

$ time git grep '[^t]truct_size' >/dev/null

real	0m31.529s
user	1m54.909s
sys	0m0.805s

If the bracket expression is moved to even the second position in the string,
it runs much faster:

$ time git grep 's[^p]ruct_size' >/dev/null

real	0m3.989s
user	0m13.939s
sys	0m0.403s

It's pretty bad with even a '.' as the first character:

$ time git grep '.truct_size' >/dev/null

real	0m14.514s
user	0m52.624s
sys	0m0.598s

$ git --version
git version 2.17.1

Setting LANG=C improves matters by a factor of 3-4 (depending if you
count real or user time):

$ time git grep '[^t]truct_size' >/dev/null
real	0m10.035s
user	0m28.795s
sys	0m0.537s

(this is using something pretty close to Linus' current HEAD of the
linux repository, an i7-7500, 16GB memory).

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: git grep with leading inverted bracket expression
  2018-06-07 15:27 git grep with leading inverted bracket expression Matthew Wilcox
@ 2018-06-07 19:09 ` Ævar Arnfjörð Bjarmason
  2018-06-07 19:22   ` Matthew Wilcox
  0 siblings, 1 reply; 4+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-06-07 19:09 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: git


On Thu, Jun 07 2018, Matthew Wilcox wrote:

> If the first atom of a regex is a bracket expression with an inverted range,
> git grep is very slow.
>
> $ time git grep 'struct_size' >/dev/null
>
> real	0m0.368s
> user	0m0.563s
> sys	0m0.453s
>
> $ time git grep '[^t]truct_size' >/dev/null
>
> real	0m31.529s
> user	1m54.909s
> sys	0m0.805s
>
> If the bracket expression is moved to even the second position in the string,
> it runs much faster:
>
> $ time git grep 's[^p]ruct_size' >/dev/null
>
> real	0m3.989s
> user	0m13.939s
> sys	0m0.403s
>
> It's pretty bad with even a '.' as the first character:
>
> $ time git grep '.truct_size' >/dev/null
>
> real	0m14.514s
> user	0m52.624s
> sys	0m0.598s
>
> $ git --version
> git version 2.17.1
>
> Setting LANG=C improves matters by a factor of 3-4 (depending if you
> count real or user time):
>
> $ time git grep '[^t]truct_size' >/dev/null
> real	0m10.035s
> user	0m28.795s
> sys	0m0.537s
>
> (this is using something pretty close to Linus' current HEAD of the
> linux repository, an i7-7500, 16GB memory).

I have some WIP patches to fix all of this, which I'll hopefully submit
before 2.19 is out the door.

What you've discovered here is how shitty your libc regex engine is,
because unless you provide -P and compile with a reasonably up-to-date
libpcre (preferably v2) with JIT that's what you'll get.

The reason stuff like 'struct_size' is so much faster is because there
we don't use any regex engine at all, but rather an optimized
fixed-string searcher.

With our own benchmarks modified per your E-Mail:
    
    diff --git a/t/perf/p7820-grep-engines.sh b/t/perf/p7820-grep-engines.sh
    index 8b09c5bf32..fe4c5681da 100755
    --- a/t/perf/p7820-grep-engines.sh
    +++ b/t/perf/p7820-grep-engines.sh
    @@ -28,11 +28,10 @@ then
     fi
    
     for pattern in \
    -       'how.to' \
    -       '^how to' \
    -       '[how] to' \
    -       '\(e.t[^ ]*\|v.ry\) rare' \
    -       'm\(ú\|u\)lt.b\(æ\|y\)te'
    +       'struct size' \
    +       '[^t]truct_size' \
    +       's[^p]ruct_size' \
    +       '.truct_size'
     do
            for engine in basic extended perl
            do

I get these results against linux.git:

    $ GIT_PERF_LARGE_REPO=~/g/linux ./run p7820-grep-engines.sh
    [...]
    Test                                      this tree
    ----------------------------------------------------------
    7820.1: basic grep 'struct size'          0.23(0.52+0.76)
    7820.2: extended grep 'struct size'       0.22(0.60+0.61)
    7820.3: perl grep 'struct size'           0.22(0.56+0.65)
    7820.5: basic grep '[^t]truct_size'       4.29(29.43+0.51)
    7820.6: extended grep '[^t]truct_size'    4.27(29.59+0.36)
    7820.7: perl grep '[^t]truct_size'        0.21(0.40+0.69)
    7820.9: basic grep 's[^p]ruct_size'       0.49(2.22+0.49)
    7820.10: extended grep 's[^p]ruct_size'   0.43(2.24+0.48)
    7820.11: perl grep 's[^p]ruct_size'       0.21(0.38+0.71)
    7820.13: basic grep '.truct_size'         4.42(31.29+0.44)
    7820.14: extended grep '.truct_size'      4.50(31.18+0.46)
    7820.15: perl grep '.truct_size'          0.21(0.35+0.75)

So you need to just use an up-to-date libpcre2 & -P and performance
won't suck.

My WIP patches will make us use PCRE for all grep modes, using an API it
has to convert basic & extended regexp syntax to its own syntax, so
we'll be able to do that transparently.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: git grep with leading inverted bracket expression
  2018-06-07 19:09 ` Ævar Arnfjörð Bjarmason
@ 2018-06-07 19:22   ` Matthew Wilcox
  2018-06-07 19:29     ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 4+ messages in thread
From: Matthew Wilcox @ 2018-06-07 19:22 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git

On Thu, Jun 07, 2018 at 09:09:25PM +0200, Ævar Arnfjörð Bjarmason wrote:
> On Thu, Jun 07 2018, Matthew Wilcox wrote:
> > If the first atom of a regex is a bracket expression with an inverted range,
> > git grep is very slow.
> 
> I have some WIP patches to fix all of this, which I'll hopefully submit
> before 2.19 is out the door.
> 
> What you've discovered here is how shitty your libc regex engine is,
> because unless you provide -P and compile with a reasonably up-to-date
> libpcre (preferably v2) with JIT that's what you'll get.

I'm using Debian's build, and it is linked against a recent libpcre2:
$ ldd /usr/lib/git-core/git
	libpcre2-8.so.0 => /usr/lib/x86_64-linux-gnu/libpcre2-8.so.0 (0x00007f59ad5f2000)
$ dpkg --status libpcre2-8-0
Version: 10.31-3

But I wasn't using -P.  If I do, then I see the performance numbers you do:

$ time git grep -P '[^t]truct_size' >/dev/null
real	0m0.354s
user	0m0.340s
sys	0m0.639s
$ time git grep -P 'struct_size' >/dev/null
real	0m0.336s
user	0m0.552s
sys	0m0.457s
$ time git grep 'struct_size' >/dev/null
real	0m0.335s
user	0m0.535s
sys	0m0.474s

> So you need to just use an up-to-date libpcre2 & -P and performance
> won't suck.

I don't tend to use terribly advanced regexps, so I'll just set
grep.patternType to 'perl' and then it'll automatically be fast for me
without your patches ;-)

> My WIP patches will make us use PCRE for all grep modes, using an API it
> has to convert basic & extended regexp syntax to its own syntax, so
> we'll be able to do that transparently.

That's clearly the right answer.  Thanks!

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: git grep with leading inverted bracket expression
  2018-06-07 19:22   ` Matthew Wilcox
@ 2018-06-07 19:29     ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 4+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-06-07 19:29 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: git, Junio C Hamano


On Thu, Jun 07 2018, Matthew Wilcox wrote:

> On Thu, Jun 07, 2018 at 09:09:25PM +0200, Ævar Arnfjörð Bjarmason wrote:
>> On Thu, Jun 07 2018, Matthew Wilcox wrote:
>> > If the first atom of a regex is a bracket expression with an inverted range,
>> > git grep is very slow.
>>
>> I have some WIP patches to fix all of this, which I'll hopefully submit
>> before 2.19 is out the door.
>>
>> What you've discovered here is how shitty your libc regex engine is,
>> because unless you provide -P and compile with a reasonably up-to-date
>> libpcre (preferably v2) with JIT that's what you'll get.
>
> I'm using Debian's build, and it is linked against a recent libpcre2:
> $ ldd /usr/lib/git-core/git
> 	libpcre2-8.so.0 => /usr/lib/x86_64-linux-gnu/libpcre2-8.so.0 (0x00007f59ad5f2000)
> $ dpkg --status libpcre2-8-0
> Version: 10.31-3
>
> But I wasn't using -P.  If I do, then I see the performance numbers you do:
>
> $ time git grep -P '[^t]truct_size' >/dev/null
> real	0m0.354s
> user	0m0.340s
> sys	0m0.639s
> $ time git grep -P 'struct_size' >/dev/null
> real	0m0.336s
> user	0m0.552s
> sys	0m0.457s
> $ time git grep 'struct_size' >/dev/null
> real	0m0.335s
> user	0m0.535s
> sys	0m0.474s
>
>> So you need to just use an up-to-date libpcre2 & -P and performance
>> won't suck.

Yeah that's recent enough & will get you all the benefits.

> I don't tend to use terribly advanced regexps, so I'll just set
> grep.patternType to 'perl' and then it'll automatically be fast for me
> without your patches ;-)

Indeed, if you're happy with that that'll do it.

>> My WIP patches will make us use PCRE for all grep modes, using an API it
>> has to convert basic & extended regexp syntax to its own syntax, so
>> we'll be able to do that transparently.
>
> That's clearly the right answer.  Thanks!

Yeah, unfortunately git-grep's default is "basic" regexp which has a
really atrocious syntax that's different enough from extended & Perl's
that we probably couldn't just switch it over.

That won't be needed with my patches, but maybe I'll follow-up with
something to s/basic/extended/g by default, because on side effect of
having the pattern converter is that we could have a warning whenever
the user has a pattern that would be different under extended/perl, so
we can see how common that is.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-06-07 19:29 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-07 15:27 git grep with leading inverted bracket expression Matthew Wilcox
2018-06-07 19:09 ` Ævar Arnfjörð Bjarmason
2018-06-07 19:22   ` Matthew Wilcox
2018-06-07 19:29     ` Ævar Arnfjörð Bjarmason

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).