git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: git@vger.kernel.org
Cc: git-packagers@googlegroups.com, gitgitgadget@gmail.com,
	gitster@pobox.com, johannes.schindelin@gmx.de, peff@peff.net,
	sandals@crustytoothpaste.net, szeder.dev@gmail.com,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Subject: [PATCH v3 10/10] grep: use PCRE v2 for optimized fixed-string search
Date: Mon,  1 Jul 2019 23:21:00 +0200	[thread overview]
Message-ID: <20190701212100.27850-11-avarab@gmail.com> (raw)
In-Reply-To: <20190627233912.7117-1-avarab@gmail.com>

Bring back optimized fixed-string search for "grep", this time with
PCRE v2 as an optional backend. As noted in [1] with kwset we were
slower than PCRE v1 and v2 JIT with the kwset backend, so that
optimization was counterproductive.

This brings back the optimization for "--fixed-strings", without
changing the semantics of having a NUL-byte in patterns. As seen in
previous commits in this series we could support it now, but I'd
rather just leave that edge-case aside so we don't have one behavior
or the other depending what "--fixed-strings" backend we're using. It
makes the behavior harder to understand and document, and makes tests
for the different backends more painful.

This does change the behavior under non-C locales when "log"'s
"--encoding" option is used and the heystack/needle in the
content/command-line doesn't have a matching encoding. See the recent
change in "t4210: skip more command-line encoding tests on MinGW" in
this series. I think that's OK. We did nothing sensible before
then (just compared raw bytes that had no hope of matching). At least
now the user will get some idea why their grep/log never matches in
that edge case.

I could also support the PCRE v1 backend here, but that would make the
code more complex. I'd rather aim for simplicity here and in future
changes to the diffcore. We're not going to have someone who
absolutely must have faster search, but for whom building PCRE v2
isn't acceptable.

The difference between this series of commits and the current "master"
is, using the same t/perf commands shown in the last commit:

plain grep:

    Test                             origin/master     HEAD
    -------------------------------------------------------------------------
    7821.1: fixed grep int           0.55(1.67+0.56)   0.41(0.98+0.60) -25.5%
    7821.2: basic grep int           0.58(1.65+0.52)   0.41(0.96+0.57) -29.3%
    7821.3: extended grep int        0.57(1.66+0.49)   0.42(0.93+0.60) -26.3%
    7821.4: perl grep int            0.54(1.67+0.50)   0.43(0.88+0.65) -20.4%
    7821.6: fixed grep uncommon      0.21(0.52+0.42)   0.16(0.24+0.51) -23.8%
    7821.7: basic grep uncommon      0.20(0.49+0.45)   0.17(0.28+0.47) -15.0%
    7821.8: extended grep uncommon   0.20(0.54+0.39)   0.16(0.25+0.50) -20.0%
    7821.9: perl grep uncommon       0.20(0.58+0.36)   0.16(0.23+0.50) -20.0%
    7821.11: fixed grep æ            0.35(1.24+0.43)   0.16(0.23+0.50) -54.3%
    7821.12: basic grep æ            0.36(1.29+0.38)   0.16(0.20+0.54) -55.6%
    7821.13: extended grep æ         0.35(1.23+0.44)   0.16(0.24+0.50) -54.3%
    7821.14: perl grep æ             0.35(1.33+0.34)   0.16(0.28+0.46) -54.3%

grep with -i:

    Test                                origin/master     HEAD
    ----------------------------------------------------------------------------
    7821.1: fixed grep -i int           0.62(1.81+0.70)   0.47(1.11+0.64) -24.2%
    7821.2: basic grep -i int           0.67(1.90+0.53)   0.46(1.07+0.62) -31.3%
    7821.3: extended grep -i int        0.62(1.92+0.53)   0.53(1.12+0.58) -14.5%
    7821.4: perl grep -i int            0.66(1.85+0.58)   0.45(1.10+0.59) -31.8%
    7821.6: fixed grep -i uncommon      0.21(0.54+0.43)   0.17(0.20+0.55) -19.0%
    7821.7: basic grep -i uncommon      0.20(0.52+0.45)   0.17(0.29+0.48) -15.0%
    7821.8: extended grep -i uncommon   0.21(0.52+0.44)   0.17(0.26+0.50) -19.0%
    7821.9: perl grep -i uncommon       0.21(0.53+0.44)   0.17(0.20+0.56) -19.0%
    7821.11: fixed grep -i æ            0.26(0.79+0.44)   0.16(0.29+0.46) -38.5%
    7821.12: basic grep -i æ            0.26(0.79+0.42)   0.16(0.20+0.54) -38.5%
    7821.13: extended grep -i æ         0.26(0.84+0.39)   0.16(0.24+0.50) -38.5%
    7821.14: perl grep -i æ             0.16(0.24+0.49)   0.17(0.25+0.51) +6.3%

plain log:

    Test                                     origin/master     HEAD
    --------------------------------------------------------------------------------
    4221.1: fixed log --grep='int'           7.24(6.95+0.28)   7.20(6.95+0.18) -0.6%
    4221.2: basic log --grep='int'           7.31(6.97+0.22)   7.20(6.93+0.21) -1.5%
    4221.3: extended log --grep='int'        7.37(7.04+0.24)   7.22(6.91+0.25) -2.0%
    4221.4: perl log --grep='int'            7.31(7.04+0.21)   7.19(6.89+0.21) -1.6%
    4221.6: fixed log --grep='uncommon'      6.93(6.59+0.32)   7.04(6.66+0.37) +1.6%
    4221.7: basic log --grep='uncommon'      6.92(6.58+0.29)   7.08(6.75+0.29) +2.3%
    4221.8: extended log --grep='uncommon'   6.92(6.55+0.31)   7.00(6.68+0.31) +1.2%
    4221.9: perl log --grep='uncommon'       7.03(6.59+0.33)   7.12(6.73+0.34) +1.3%
    4221.11: fixed log --grep='æ'            7.41(7.08+0.28)   7.05(6.76+0.29) -4.9%
    4221.12: basic log --grep='æ'            7.39(6.99+0.33)   7.00(6.68+0.25) -5.3%
    4221.13: extended log --grep='æ'         7.34(7.00+0.25)   7.15(6.81+0.31) -2.6%
    4221.14: perl log --grep='æ'             7.43(7.13+0.26)   7.01(6.60+0.36) -5.7%

log with -i:

    Test                                        origin/master     HEAD
    ------------------------------------------------------------------------------------
    4221.1: fixed log -i --grep='int'           7.31(7.07+0.24)   7.23(7.00+0.22) -1.1%
    4221.2: basic log -i --grep='int'           7.40(7.08+0.28)   7.19(6.92+0.20) -2.8%
    4221.3: extended log -i --grep='int'        7.43(7.13+0.25)   7.27(6.99+0.21) -2.2%
    4221.4: perl log -i --grep='int'            7.34(7.10+0.24)   7.10(6.90+0.19) -3.3%
    4221.6: fixed log -i --grep='uncommon'      7.07(6.71+0.32)   7.11(6.77+0.28) +0.6%
    4221.7: basic log -i --grep='uncommon'      6.99(6.64+0.28)   7.12(6.69+0.38) +1.9%
    4221.8: extended log -i --grep='uncommon'   7.11(6.74+0.32)   7.10(6.77+0.27) -0.1%
    4221.9: perl log -i --grep='uncommon'       6.98(6.60+0.29)   7.05(6.64+0.34) +1.0%
    4221.11: fixed log -i --grep='æ'            7.85(7.45+0.34)   7.03(6.68+0.32) -10.4%
    4221.12: basic log -i --grep='æ'            7.87(7.49+0.29)   7.06(6.69+0.31) -10.3%
    4221.13: extended log -i --grep='æ'         7.87(7.54+0.31)   7.09(6.69+0.31) -9.9%
    4221.14: perl log -i --grep='æ'             7.06(6.77+0.28)   6.91(6.57+0.31) -2.1%

So as with e05b027627 ("grep: use PCRE v2 for optimized fixed-string
search", 2019-06-26) there's a huge improvement in performance for
"grep", but in "log" most of our time is spent elsewhere, so we don't
notice it that much.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 grep.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 49 insertions(+), 2 deletions(-)

diff --git a/grep.c b/grep.c
index 4468519d5c..fc0ed73ef3 100644
--- a/grep.c
+++ b/grep.c
@@ -356,6 +356,18 @@ static NORETURN void compile_regexp_failed(const struct grep_pat *p,
 	die("%s'%s': %s", where, p->pattern, error);
 }
 
+static int is_fixed(const char *s, size_t len)
+{
+	size_t i;
+
+	for (i = 0; i < len; i++) {
+		if (is_regex_special(s[i]))
+			return 0;
+	}
+
+	return 1;
+}
+
 #ifdef USE_LIBPCRE1
 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt)
 {
@@ -602,7 +614,6 @@ static int pcre2match(struct grep_pat *p, const char *line, const char *eol,
 static void free_pcre2_pattern(struct grep_pat *p)
 {
 }
-#endif /* !USE_LIBPCRE2 */
 
 static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
@@ -623,11 +634,13 @@ static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt)
 		compile_regexp_failed(p, errbuf);
 	}
 }
+#endif /* !USE_LIBPCRE2 */
 
 static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
 	int err;
 	int regflags = REG_NEWLINE;
+	int pat_is_fixed;
 
 	p->word_regexp = opt->word_regexp;
 	p->ignore_case = opt->ignore_case;
@@ -636,8 +649,42 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 	if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
 		die(_("given pattern contains NULL byte (via -f <file>). This is only supported with -P under PCRE v2"));
 
-	if (opt->fixed) {
+	pat_is_fixed = is_fixed(p->pattern, p->patternlen);
+	if (opt->fixed || pat_is_fixed) {
+#ifdef USE_LIBPCRE2
+		opt->pcre2 = 1;
+		if (pat_is_fixed) {
+			compile_pcre2_pattern(p, opt);
+		} else {
+			/*
+			 * E.g. t7811-grep-open.sh relies on the
+			 * pattern being restored.
+			 */
+			char *old_pattern = p->pattern;
+			size_t old_patternlen = p->patternlen;
+			struct strbuf sb = STRBUF_INIT;
+
+			/*
+			 * There is the PCRE2_LITERAL flag, but it's
+			 * only in PCRE v2 10.30 and later. Needing to
+			 * ifdef our way around that and dealing with
+			 * it + PCRE2_MULTILINE being an error is more
+			 * complex than just quoting this ourselves.
+			*/
+			strbuf_add(&sb, "\\Q", 2);
+			strbuf_add(&sb, p->pattern, p->patternlen);
+			strbuf_add(&sb, "\\E", 2);
+
+			p->pattern = sb.buf;
+			p->patternlen = sb.len;
+			compile_pcre2_pattern(p, opt);
+			p->pattern = old_pattern;
+			p->patternlen = old_patternlen;
+			strbuf_release(&sb);
+		}
+#else /* !USE_LIBPCRE2 */
 		compile_fixed_regexp(p, opt);
+#endif /* !USE_LIBPCRE2 */
 		return;
 	}
 
-- 
2.22.0.455.g172b71a6c5


  parent reply	other threads:[~2019-07-01 21:21 UTC|newest]

Thread overview: 90+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-13 11:49 [PATCH 0/4] Support building with GCC v8.x/v9.x Johannes Schindelin via GitGitGadget
2019-06-13 11:49 ` [PATCH 1/4] poll (mingw): allow compiling with GCC 8 and DEVELOPER=1 Johannes Schindelin via GitGitGadget
2019-06-13 11:49 ` [PATCH 2/4] kwset: allow building with GCC 8 Johannes Schindelin via GitGitGadget
2019-06-13 16:11   ` Junio C Hamano
2019-06-14  9:53   ` SZEDER Gábor
2019-06-14 10:00     ` [RFC/PATCH v1 0/4] compat/obstack: update from upstream SZEDER Gábor
2019-06-14 10:00       ` [PATCH v1 1/4] " SZEDER Gábor
2019-06-14 10:00       ` [PATCH v1 2/4] SQUASH??? compat/obstack: fix portability issues SZEDER Gábor
2019-06-14 10:00       ` [PATCH v1 3/4] SQUASH??? compat/obstack: fix build errors with Clang SZEDER Gábor
2019-06-14 10:00       ` [PATCH v1 4/4] compat/obstack: fix some sparse warnings SZEDER Gábor
2019-06-14 17:57       ` [RFC/PATCH v1 0/4] compat/obstack: update from upstream Jeff King
2019-06-14 18:19       ` Junio C Hamano
2019-06-14 20:30       ` Ramsay Jones
2019-06-14 21:24         ` Ramsay Jones
2019-06-17 18:36         ` SZEDER Gábor
2019-06-14 16:12     ` [PATCH 2/4] kwset: allow building with GCC 8 Junio C Hamano
2019-06-17 18:26       ` SZEDER Gábor
2019-06-14 22:09   ` Ævar Arnfjörð Bjarmason
2019-06-14 22:55   ` Can we just get rid of kwset & obstack in favor of optimistically using PCRE v2 JIT? Ævar Arnfjörð Bjarmason
2019-06-14 23:19     ` Ævar Arnfjörð Bjarmason
2019-06-20 10:35       ` Jeff King
2019-06-15  9:01     ` Carlo Arenas
2019-06-15 19:15     ` brian m. carlson
2019-06-15 22:14       ` Ævar Arnfjörð Bjarmason
2019-06-26  0:03         ` [RFC/PATCH 0/7] grep: move from kwset to optional PCRE v2 Ævar Arnfjörð Bjarmason
2019-06-26 14:02           ` Johannes Schindelin
2019-06-27  9:16             ` Johannes Schindelin
2019-06-27 16:27               ` Ævar Arnfjörð Bjarmason
2019-06-27 18:21                 ` Johannes Schindelin
2019-06-27 23:39           ` [PATCH v2 0/9] " Ævar Arnfjörð Bjarmason
2019-06-28  7:23             ` Ævar Arnfjörð Bjarmason
2019-06-28 16:10               ` Junio C Hamano
2019-07-01 21:20             ` [PATCH v3 00/10] " Ævar Arnfjörð Bjarmason
2019-07-01 21:31               ` Junio C Hamano
2019-07-02 11:10                 ` Ævar Arnfjörð Bjarmason
2019-07-02 12:32               ` Johannes Schindelin
2019-07-02 19:57                 ` Junio C Hamano
2019-07-03 10:08                   ` Johannes Schindelin
2019-07-03 10:25                 ` Johannes Schindelin
2019-07-03 11:27                   ` Johannes Schindelin
2019-07-01 21:20             ` [PATCH v3 01/10] log tests: test regex backends in "--encode=<enc>" tests Ævar Arnfjörð Bjarmason
2019-07-01 21:20             ` [PATCH v3 02/10] grep: don't use PCRE2?_UTF8 with "log --encoding=<non-utf8>" Ævar Arnfjörð Bjarmason
2019-07-01 21:20             ` [PATCH v3 03/10] t4210: skip more command-line encoding tests on MinGW Ævar Arnfjörð Bjarmason
2019-07-01 21:20             ` [PATCH v3 04/10] grep: inline the return value of a function call used only once Ævar Arnfjörð Bjarmason
2019-07-01 21:20             ` [PATCH v3 05/10] grep tests: move "grep binary" alongside the rest Ævar Arnfjörð Bjarmason
2019-07-01 21:20             ` [PATCH v3 06/10] grep tests: move binary pattern tests into their own file Ævar Arnfjörð Bjarmason
2019-07-01 21:20             ` [PATCH v3 07/10] grep: make the behavior for NUL-byte in patterns sane Ævar Arnfjörð Bjarmason
2019-07-01 21:20             ` [PATCH v3 08/10] grep: drop support for \0 in --fixed-strings <pattern> Ævar Arnfjörð Bjarmason
2019-07-01 21:20             ` [PATCH v3 09/10] grep: remove the kwset optimization Ævar Arnfjörð Bjarmason
2019-07-01 21:21             ` Ævar Arnfjörð Bjarmason [this message]
2019-06-27 23:39           ` [PATCH v2 1/9] log tests: test regex backends in "--encode=<enc>" tests Ævar Arnfjörð Bjarmason
2019-06-27 23:39           ` [PATCH v2 2/9] grep: don't use PCRE2?_UTF8 with "log --encoding=<non-utf8>" Ævar Arnfjörð Bjarmason
2019-06-27 23:39           ` [PATCH v2 3/9] grep: inline the return value of a function call used only once Ævar Arnfjörð Bjarmason
2019-06-27 23:39           ` [PATCH v2 4/9] grep tests: move "grep binary" alongside the rest Ævar Arnfjörð Bjarmason
2019-06-27 23:39           ` [PATCH v2 5/9] grep tests: move binary pattern tests into their own file Ævar Arnfjörð Bjarmason
2019-06-27 23:39           ` [PATCH v2 6/9] grep: make the behavior for NUL-byte in patterns sane Ævar Arnfjörð Bjarmason
2019-06-27 23:39           ` [PATCH v2 7/9] grep: drop support for \0 in --fixed-strings <pattern> Ævar Arnfjörð Bjarmason
2019-06-27 23:39           ` [PATCH v2 8/9] grep: remove the kwset optimization Ævar Arnfjörð Bjarmason
2019-06-27 23:39           ` [PATCH v2 9/9] grep: use PCRE v2 for optimized fixed-string search Ævar Arnfjörð Bjarmason
2019-06-26  0:03         ` [RFC/PATCH 1/7] grep: inline the return value of a function call used only once Ævar Arnfjörð Bjarmason
2019-06-26  0:03         ` [RFC/PATCH 2/7] grep tests: move "grep binary" alongside the rest Ævar Arnfjörð Bjarmason
2019-06-26 14:05           ` Johannes Schindelin
2019-06-26 18:13           ` Junio C Hamano
2019-06-26  0:03         ` [RFC/PATCH 3/7] grep tests: move binary pattern tests into their own file Ævar Arnfjörð Bjarmason
2019-06-26  0:03         ` [RFC/PATCH 4/7] grep: make the behavior for \0 in patterns sane Ævar Arnfjörð Bjarmason
2019-06-27  2:03           ` brian m. carlson
2019-06-26  0:03         ` [RFC/PATCH 5/7] grep: drop support for \0 in --fixed-strings <pattern> Ævar Arnfjörð Bjarmason
2019-06-26 16:14           ` Junio C Hamano
2019-06-26  0:03         ` [RFC/PATCH 6/7] grep: remove the kwset optimization Ævar Arnfjörð Bjarmason
2019-06-26  0:03         ` [RFC/PATCH 7/7] grep: use PCRE v2 for optimized fixed-string search Ævar Arnfjörð Bjarmason
2019-06-26 14:13           ` Johannes Schindelin
2019-06-26 18:45             ` Junio C Hamano
2019-06-27  9:31               ` Johannes Schindelin
2019-06-27 18:45                 ` Johannes Schindelin
2019-06-27 19:06                   ` Junio C Hamano
2019-06-28 10:56                     ` Johannes Schindelin
2019-06-13 11:49 ` [PATCH 3/4] winansi: simplify loading the GetCurrentConsoleFontEx() function Johannes Schindelin via GitGitGadget
2019-06-13 11:49 ` [PATCH 4/4] config: avoid calling `labs()` on too-large data type Johannes Schindelin via GitGitGadget
2019-06-13 16:13   ` Junio C Hamano
2019-06-16  6:48   ` René Scharfe
2019-06-16  8:24     ` René Scharfe
2019-06-16 14:01       ` René Scharfe
2019-06-16 22:26         ` Junio C Hamano
2019-06-20 19:58           ` René Scharfe
2019-06-20 21:07             ` Junio C Hamano
2019-06-21 18:35             ` Johannes Schindelin
2019-06-22 10:03               ` René Scharfe
2019-06-22 10:03           ` [PATCH v2 1/3] config: use unsigned_mult_overflows to check for overflows René Scharfe
2019-06-22 10:03           ` [PATCH v2 2/3] config: don't multiply in parse_unit_factor() René Scharfe
2019-06-22 10:03           ` [PATCH v2 3/3] config: simplify parsing of unit factors René Scharfe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190701212100.27850-11-avarab@gmail.com \
    --to=avarab@gmail.com \
    --cc=git-packagers@googlegroups.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=johannes.schindelin@gmx.de \
    --cc=peff@peff.net \
    --cc=sandals@crustytoothpaste.net \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).