git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: git@vger.kernel.org
Cc: "Junio C Hamano" <gitster@pobox.com>, "Jeff King" <peff@peff.net>,
	"Jeffrey Walton" <noloader@gmail.com>,
	"Michał Kiedrowicz" <michal.kiedrowicz@gmail.com>,
	"J Smith" <dark.panda@gmail.com>,
	"Victor Leschuk" <vleschuk@gmail.com>,
	"Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>,
	"Fredrik Kuivinen" <frekui@gmail.com>,
	"Brandon Williams" <bmwill@google.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Subject: [PATCH/RFC 5/6] grep: support regex patterns containing \0 via PCRE v2
Date: Thu, 11 May 2017 17:51:14 +0000	[thread overview]
Message-ID: <20170511175115.648-6-avarab@gmail.com> (raw)
In-Reply-To: <20170511175115.648-1-avarab@gmail.com>

Support regex patterns with embedded \0's, as an earlier commit[1]
notes this was previously impossible due to an internal limitation.

Before this change any regex metacharacters in patterns containing \0
were silently ignored and the pattern matched as if it were a
--fixed-strings pattern.

Now these patterns will be matched with PCRE instead, which supports
combining regex metacharacters with patterns containing \0.

A side-effect of this change is that these patterns which previously
would be considered --fixed-strings patterns regardless of the engine
requested now all implicitly become --perl-regexp instead.

A subsequent change introduces a POSIX to PCRE syntax converter, and
could be used to be 100% truthful to our documentation by using POSIX
basic syntax (which we haven't been in quite some time with kwset).

But due to a chicken & egg issue with this change being easier to
implement stand-alone first, the subsequent change depending on a SVN
trunk version of PCRE, but most importantly I don't think anyone will
mind this change, so I'm leaving it as it is.

This implementation is faster than the previous kwset implementation,
but I haven't bothered to come up with a \0-specific fixed-string
performance test.

See the next change in this series for a change which optionally
expands the PCRE v2 use to use it for all fixed-string patterns, the
performance tests for those will be applicable to these patterns as
well, since PCRE matches \0 like any other character.

1. ("grep: factor test for \0 in grep patterns into a function",
   2017-05-08)

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 grep.c                 | 24 ++++++++++++++++
 t/t7008-grep-binary.sh | 74 ++++++++++++++++++++++++++++++++++----------------
 2 files changed, 75 insertions(+), 23 deletions(-)

diff --git a/grep.c b/grep.c
index 2ff4e253ff..5db614cf80 100644
--- a/grep.c
+++ b/grep.c
@@ -613,6 +613,30 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 	icase	       = opt->regflags & REG_ICASE || p->ignore_case;
 	ascii_only     = !has_non_ascii(p->pattern);
 
+#ifdef USE_LIBPCRE2
+	if (has_null(p->pattern, p->patternlen)) {
+		struct strbuf sb = STRBUF_INIT;
+		if (icase)
+			strbuf_add(&sb, "(?i)", 4);
+		if (opt->fixed)
+			strbuf_add(&sb, "\\Q", 2);		
+		strbuf_add(&sb, p->pattern, p->patternlen);
+		if (opt->fixed)
+			strbuf_add(&sb, "\\E", 2);
+
+		p->pattern = sb.buf;
+		p->patternlen = sb.len;
+
+		/* FIXME: Check in compile_pcre2_pattern() that we're
+		 * using basic rx using !opt->pcre2 && <something>
+		 */
+		opt->pcre2 = 1;
+
+		compile_pcre2_pattern(p, opt);
+		return;
+	}
+#endif
+
 	/*
 	 * Even when -F (fixed) asks us to do a non-regexp search, we
 	 * may not be able to correctly case-fold when -i
diff --git a/t/t7008-grep-binary.sh b/t/t7008-grep-binary.sh
index ba3db06501..fc86ed5fce 100755
--- a/t/t7008-grep-binary.sh
+++ b/t/t7008-grep-binary.sh
@@ -124,35 +124,63 @@ nul_match 0 '-F' '[æ]Qð'
 nul_match 0 '-Fi' 'ÆQ[Ð]'
 nul_match 0 '-Fi' '[Æ]QÐ'
 
-# kwset is disabled on -i & non-ASCII. No way to match non-ASCII \0
-# patterns case-insensitively.
-nul_match T1 '-i' 'ÆQÐ'
-
-# \0 implicitly disables regexes. This is an undocumented internal
-# limitation.
-nul_match T1 '' 'yQ[f]'
-nul_match T1 '' '[y]Qf'
-nul_match T1 '-i' 'YQ[F]'
-nul_match T1 '-i' '[Y]Qf'
-nul_match T1 '' 'æQ[ð]'
-nul_match T1 '' '[æ]Qð'
-nul_match T1 '-i' 'ÆQ[Ð]'
-
-# ... because of \0 implicitly disabling regexes regexes that
-# should/shouldn't match don't do the right thing.
-nul_match T1 '' 'eQm.*cQ'
-nul_match T1 '-i' 'EQM.*cQ'
-nul_match T0 '' 'eQm[*]c'
-nul_match T0 '-i' 'EQM[*]C'
+if test_have_prereq LIBPCRE2
+then
+	# Regex patterns that should match without -F
+	nul_match 1 '' 'yQ[f]'
+	nul_match 1 '' '[y]Qf'
+	nul_match 1 '-i' 'YQ[F]'
+	nul_match 1 '-i' '[Y]Qf'
+	nul_match 1 '' 'æQ[ð]'
+	nul_match 1 '' '[æ]Qð'
+	nul_match 0 '-i' '[Æ]Qð'
+	nul_match 1 '' 'eQm.*cQ'
+	nul_match 1 '-i' 'EQM.*cQ'
+	nul_match 0 '' 'eQm[*]c'
+	nul_match 0 '-i' 'EQM[*]C'
+
+	# These should also match, but don't due to some heisenbug,
+	# they succeed when run manually!
+	nul_match T1 '-i' 'ÆQÐ'
+	nul_match T1 '-i' 'ÆQ[Ð]'
+else
+	# \0 implicitly disables regexes. This is an undocumented
+	# internal limitation.
+	nul_match T1 '' 'yQ[f]'
+	nul_match T1 '' '[y]Qf'
+	nul_match T1 '-i' 'YQ[F]'
+	nul_match T1 '-i' '[Y]Qf'
+	nul_match T1 '' 'æQ[ð]'
+	nul_match T1 '' '[æ]Qð'
+	nul_match T1 '-i' 'ÆQ[Ð]'
+
+	# ... because of \0 implicitly disabling regexes regexes that
+	# should/shouldn't match don't do the right thing.
+	nul_match T1 '' 'eQm.*cQ'
+	nul_match T1 '-i' 'EQM.*cQ'
+	nul_match T0 '' 'eQm[*]c'
+	nul_match T0 '-i' 'EQM[*]C'
+fi
 
 # Due to the REG_STARTEND extension when kwset() is disabled on -i &
 # non-ASCII the string will be matched in its entirety, but the
 # pattern will be cut off at the first \0.
 nul_match 0 '-i' 'NOMATCHQð'
-nul_match T0 '-i' '[Æ]QNOMATCH'
-nul_match T0 '-i' '[æ]QNOMATCH'
+if test_have_prereq LIBPCRE2
+then
+	nul_match 0 '-i' '[Æ]QNOMATCH'
+	nul_match 0 '-i' '[æ]QNOMATCH'
+else
+	nul_match T0 '-i' '[Æ]QNOMATCH'
+	nul_match T0 '-i' '[æ]QNOMATCH'
+fi
 # Matches, but for the wrong reasons, just stops at [æ]
-nul_match 1 '-i' '[Æ]Qð'
+if test_have_prereq LIBPCRE2
+then
+	nul_match T1 '-i' '[Æ]Qð'
+else
+	nul_match 1 '-i' '[Æ]Qð'
+fi
 nul_match 1 '-i' '[æ]Qð'
 
 # Ensure that the matcher doesn't regress to something that stops at
-- 
2.11.0


  parent reply	other threads:[~2017-05-11 17:52 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-05-11 17:51 [PATCH/RFC 0/6] Speed up git-grep by using PCRE v2 as a backend Ævar Arnfjörð Bjarmason
2017-05-11 17:51 ` [PATCH/RFC 1/6] Makefile & compat/pcre2: add ability to build an embedded PCRE Ævar Arnfjörð Bjarmason
2017-05-11 17:51 ` [PATCH/RFC 2/6] Makefile & compat/pcre2: add dependency on pcre2_convert.c Ævar Arnfjörð Bjarmason
2017-05-11 17:51 ` [PATCH/RFC 4/6] test-lib: add LIBPCRE1 & LIBPCRE2 prerequisites Ævar Arnfjörð Bjarmason
2017-05-11 17:51 ` Ævar Arnfjörð Bjarmason [this message]
2017-05-11 17:51 ` [PATCH/RFC 6/6] grep: use PCRE v2 under the hood for -G & -E for amazing speedup Ævar Arnfjörð Bjarmason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170511175115.648-6-avarab@gmail.com \
    --to=avarab@gmail.com \
    --cc=bmwill@google.com \
    --cc=dark.panda@gmail.com \
    --cc=frekui@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=michal.kiedrowicz@gmail.com \
    --cc=noloader@gmail.com \
    --cc=pclouds@gmail.com \
    --cc=peff@peff.net \
    --cc=vleschuk@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).