git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Joe Perches <joe@perches.com>
Cc: Stefan Beller <sbeller@google.com>, git <git@vger.kernel.org>
Subject: Re: grep vs git grep performance?
Date: Sat, 28 Oct 2017 09:45:55 +0200	[thread overview]
Message-ID: <8760azyato.fsf@evledraar.booking.com> (raw)
In-Reply-To: <1509146542.1914.19.camel@perches.com>


On Fri, Oct 27 2017, Joe Perches jotted:

> On Sat, 2017-10-28 at 00:11 +0200, Ævar Arnfjörð Bjarmason wrote:
>> On Fri, Oct 27 2017, Joe Perches jotted:
> []
>> > git grep performance has already been
>> > quite successfully improved.
>>
>> ...and I have WIP patches to use the PCRE engine for patterns without -P
>> which I intend to start sending soon after the next release.
>
> One addition that would be quite nice would be
> an option to have regex matches span input lines.
>
> grep v2.54 was the last grep version that allowed
> this and I keep it around just for that.
>
> ie:
>
> $ cat hello.txt
> Hello
> World
> $ grep -P "Hello\s*World" hello.txt
> $ grep-2.5.4 -P "Hello\s*World" hello.txt
> Hello
> World

I'm unable to build 2.5.4 and can't find anything relevant in the
release notes at a quick glance around that time saying that this would
be removed, if you can still build it I'd be interested to see what this
bisects down to in grep.git.

But aside from that, a feature like this constrains the regex
implementation a lot since it's going to need to either match the entire
file as we'd need to do with PCRE, or we'd need to really deeply embed
the core logic of the regex matcher into our grep implementation.

I.e. in this case a more optimal implementation would start by parsing
this regex down:

    ((EXACT "Hello")
     (STAR (POSIXU "\s"))
     (EXACT "World"))

Then when you open the file you can start searching for the fixed-string
"Hello", if you don't find that you're done, if you do you can forward
look-ahead for the fixed "World", and only if you find that do you need
to match the more complex part in the middle.

Whereas our API for the internal regex matchers now is that we find the
boundaries of newlines and batch-match a bunch of lines with a match()
function that takes a string, and if that matches we drill down to what
specific line matches.

Which is not to say that this can't be done without a potentially
unacceptable memory trade-off (i.e. matching the entire file in all
cases), the PCRE2 engine in particular includes some I/O abstractions
that we're not using but could (but I haven't looked into it).

But right now the entire internal API we have is constrained by catering
to the lowest common denominator (a regexec that takes a char*), so
supporting more fancy multi-line matching features can be a PITA since
we'd need to maintain both codepaths.

Or we could make PCRE a hard dependency, which given the performance
advantages I'm increasingly willing to make the case for.

      reply	other threads:[~2017-10-28  7:46 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-26 15:02 grep vs git grep performance? Joe Perches
2017-10-26 15:11 ` Han-Wen Nienhuys
2017-10-26 15:55   ` Joe Perches
2017-10-26 16:13 ` SZEDER Gábor
2017-10-26 16:20   ` Joe Perches
2017-10-26 16:58 ` Stefan Beller
2017-10-26 17:41   ` Joe Perches
2017-10-26 17:45     ` Stefan Beller
2017-10-27 17:22       ` Joe Perches
2017-10-27 22:11         ` Ævar Arnfjörð Bjarmason
2017-10-27 23:22           ` Joe Perches
2017-10-28  7:45             ` Ævar Arnfjörð Bjarmason [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8760azyato.fsf@evledraar.booking.com \
    --to=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=joe@perches.com \
    --cc=sbeller@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).