git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH/RFC 0/6] Speed up git-grep by using PCRE v2 as a backend
@ 2017-05-11 17:51 Ævar Arnfjörð Bjarmason
  2017-05-11 17:51 ` [PATCH/RFC 1/6] Makefile & compat/pcre2: add ability to build an embedded PCRE Ævar Arnfjörð Bjarmason
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-05-11 17:51 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jeff King, Jeffrey Walton, Michał Kiedrowicz,
	J Smith, Victor Leschuk, Nguyễn Thái Ngọc Duy,
	Fredrik Kuivinen, Brandon Williams,
	Ævar Arnfjörð Bjarmason

Thought I'd send this to the list too. This is first of my WIP
post-PCRE v2 inclusion series's.

In addition to the huge speed improvements for grep -P noted in the
culmination of that series[1], this speeds up all other types of grep
invocations (fixed string & POSIX basic/extended) by using an
experimental PCRE API to translate those patterns to PCRE syntax.

Fixed string grep is sped up by ~15-50%, and any greps containing
regexes by 40-70%, with 50% seeming to be the average for most normal
patterns.

It isn't ready for the reasons noted in the last patch in the series,
and currently brings most of PCRE into git in compat/pcre2 since it
uses an experimental API.

The 5/6 patch is pretty much ready though and works on stock PCRE, it
fixes TODO tests for patterns that contain a \0, and enables regex
metacharacters in such patterns (right now they're all implicitly
fixed strings), see the discussion in that patch for some of the
caveats.

That patch will most likely be dropped by the list, it can be
retrieved from https://github.com/avar/git
avar/grep-and-pcre-and-more, or the whole series viewed at
https://github.com/git/git/compare/master...avar:avar/grep-and-pcre-and-more.

1. <20170511170142.15934-8-avarab@gmail.com>
   (https://public-inbox.org/git/20170511170142.15934-8-avarab@gmail.com/)

Ævar Arnfjörð Bjarmason (6):
  Makefile & compat/pcre2: add ability to build an embedded PCRE
  Makefile & compat/pcre2: add dependency on pcre2_convert.c
  compat/pcre2: import pcre2 from svn trunk
  test-lib: add LIBPCRE1 & LIBPCRE2 prerequisites
  grep: support regex patterns containing \0 via PCRE v2
  grep: use PCRE v2 under the hood for -G & -E for amazing speedup

 Makefile                                           |    53 +
 compat/pcre2/get-pcre2.sh                          |    68 +
 compat/pcre2/src/pcre2.h                           |   832 ++
 compat/pcre2/src/pcre2_auto_possess.c              |  1291 ++
 compat/pcre2/src/pcre2_chartables.c                |     1 +
 compat/pcre2/src/pcre2_chartables.c.dist           |   198 +
 compat/pcre2/src/pcre2_compile.c                   |  9626 +++++++++++++++
 compat/pcre2/src/pcre2_config.c                    |   222 +
 compat/pcre2/src/pcre2_context.c                   |   450 +
 compat/pcre2/src/pcre2_convert.c                   |   724 ++
 compat/pcre2/src/pcre2_error.c                     |   327 +
 compat/pcre2/src/pcre2_find_bracket.c              |   218 +
 compat/pcre2/src/pcre2_internal.h                  |  1967 +++
 compat/pcre2/src/pcre2_intmodedep.h                |   884 ++
 compat/pcre2/src/pcre2_jit_compile.c               | 12307 +++++++++++++++++++
 compat/pcre2/src/pcre2_jit_match.c                 |   189 +
 compat/pcre2/src/pcre2_jit_misc.c                  |   227 +
 compat/pcre2/src/pcre2_maketables.c                |   157 +
 compat/pcre2/src/pcre2_match.c                     |  6826 ++++++++++
 compat/pcre2/src/pcre2_match_data.c                |   147 +
 compat/pcre2/src/pcre2_newline.c                   |   243 +
 compat/pcre2/src/pcre2_ord2utf.c                   |   120 +
 compat/pcre2/src/pcre2_string_utils.c              |   201 +
 compat/pcre2/src/pcre2_study.c                     |  1624 +++
 compat/pcre2/src/pcre2_tables.c                    |   765 ++
 compat/pcre2/src/pcre2_ucd.c                       |  3761 ++++++
 compat/pcre2/src/pcre2_ucp.h                       |   268 +
 compat/pcre2/src/pcre2_valid_utf.c                 |   398 +
 compat/pcre2/src/pcre2_xclass.c                    |   271 +
 compat/pcre2/src/sljit/sljitConfig.h               |   145 +
 compat/pcre2/src/sljit/sljitConfigInternal.h       |   725 ++
 compat/pcre2/src/sljit/sljitExecAllocator.c        |   312 +
 compat/pcre2/src/sljit/sljitLir.c                  |  2224 ++++
 compat/pcre2/src/sljit/sljitLir.h                  |  1392 +++
 compat/pcre2/src/sljit/sljitNativeARM_32.c         |  2326 ++++
 compat/pcre2/src/sljit/sljitNativeARM_64.c         |  2104 ++++
 compat/pcre2/src/sljit/sljitNativeARM_T2_32.c      |  1987 +++
 compat/pcre2/src/sljit/sljitNativeMIPS_32.c        |   437 +
 compat/pcre2/src/sljit/sljitNativeMIPS_64.c        |   539 +
 compat/pcre2/src/sljit/sljitNativeMIPS_common.c    |  2110 ++++
 compat/pcre2/src/sljit/sljitNativePPC_32.c         |   276 +
 compat/pcre2/src/sljit/sljitNativePPC_64.c         |   447 +
 compat/pcre2/src/sljit/sljitNativePPC_common.c     |  2421 ++++
 compat/pcre2/src/sljit/sljitNativeSPARC_32.c       |   165 +
 compat/pcre2/src/sljit/sljitNativeSPARC_common.c   |  1471 +++
 compat/pcre2/src/sljit/sljitNativeTILEGX-encoder.c | 10159 +++++++++++++++
 compat/pcre2/src/sljit/sljitNativeTILEGX_64.c      |  2555 ++++
 compat/pcre2/src/sljit/sljitNativeX86_32.c         |   602 +
 compat/pcre2/src/sljit/sljitNativeX86_64.c         |   742 ++
 compat/pcre2/src/sljit/sljitNativeX86_common.c     |  2921 +++++
 compat/pcre2/src/sljit/sljitProtExecAllocator.c    |   421 +
 compat/pcre2/src/sljit/sljitUtils.c                |   334 +
 grep.c                                             |    73 +-
 grep.h                                             |     5 +
 t/README                                           |    18 +
 t/t7008-grep-binary.sh                             |    87 +-
 t/test-lib.sh                                      |     3 +
 57 files changed, 81335 insertions(+), 31 deletions(-)
 create mode 100755 compat/pcre2/get-pcre2.sh
 create mode 100644 compat/pcre2/src/pcre2.h
 create mode 100644 compat/pcre2/src/pcre2_auto_possess.c
 create mode 120000 compat/pcre2/src/pcre2_chartables.c
 create mode 100644 compat/pcre2/src/pcre2_chartables.c.dist
 create mode 100644 compat/pcre2/src/pcre2_compile.c
 create mode 100644 compat/pcre2/src/pcre2_config.c
 create mode 100644 compat/pcre2/src/pcre2_context.c
 create mode 100644 compat/pcre2/src/pcre2_convert.c
 create mode 100644 compat/pcre2/src/pcre2_error.c
 create mode 100644 compat/pcre2/src/pcre2_find_bracket.c
 create mode 100644 compat/pcre2/src/pcre2_internal.h
 create mode 100644 compat/pcre2/src/pcre2_intmodedep.h
 create mode 100644 compat/pcre2/src/pcre2_jit_compile.c
 create mode 100644 compat/pcre2/src/pcre2_jit_match.c
 create mode 100644 compat/pcre2/src/pcre2_jit_misc.c
 create mode 100644 compat/pcre2/src/pcre2_maketables.c
 create mode 100644 compat/pcre2/src/pcre2_match.c
 create mode 100644 compat/pcre2/src/pcre2_match_data.c
 create mode 100644 compat/pcre2/src/pcre2_newline.c
 create mode 100644 compat/pcre2/src/pcre2_ord2utf.c
 create mode 100644 compat/pcre2/src/pcre2_string_utils.c
 create mode 100644 compat/pcre2/src/pcre2_study.c
 create mode 100644 compat/pcre2/src/pcre2_tables.c
 create mode 100644 compat/pcre2/src/pcre2_ucd.c
 create mode 100644 compat/pcre2/src/pcre2_ucp.h
 create mode 100644 compat/pcre2/src/pcre2_valid_utf.c
 create mode 100644 compat/pcre2/src/pcre2_xclass.c
 create mode 100644 compat/pcre2/src/sljit/sljitConfig.h
 create mode 100644 compat/pcre2/src/sljit/sljitConfigInternal.h
 create mode 100644 compat/pcre2/src/sljit/sljitExecAllocator.c
 create mode 100644 compat/pcre2/src/sljit/sljitLir.c
 create mode 100644 compat/pcre2/src/sljit/sljitLir.h
 create mode 100644 compat/pcre2/src/sljit/sljitNativeARM_32.c
 create mode 100644 compat/pcre2/src/sljit/sljitNativeARM_64.c
 create mode 100644 compat/pcre2/src/sljit/sljitNativeARM_T2_32.c
 create mode 100644 compat/pcre2/src/sljit/sljitNativeMIPS_32.c
 create mode 100644 compat/pcre2/src/sljit/sljitNativeMIPS_64.c
 create mode 100644 compat/pcre2/src/sljit/sljitNativeMIPS_common.c
 create mode 100644 compat/pcre2/src/sljit/sljitNativePPC_32.c
 create mode 100644 compat/pcre2/src/sljit/sljitNativePPC_64.c
 create mode 100644 compat/pcre2/src/sljit/sljitNativePPC_common.c
 create mode 100644 compat/pcre2/src/sljit/sljitNativeSPARC_32.c
 create mode 100644 compat/pcre2/src/sljit/sljitNativeSPARC_common.c
 create mode 100644 compat/pcre2/src/sljit/sljitNativeTILEGX-encoder.c
 create mode 100644 compat/pcre2/src/sljit/sljitNativeTILEGX_64.c
 create mode 100644 compat/pcre2/src/sljit/sljitNativeX86_32.c
 create mode 100644 compat/pcre2/src/sljit/sljitNativeX86_64.c
 create mode 100644 compat/pcre2/src/sljit/sljitNativeX86_common.c
 create mode 100644 compat/pcre2/src/sljit/sljitProtExecAllocator.c
 create mode 100644 compat/pcre2/src/sljit/sljitUtils.c

-- 
2.11.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH/RFC 1/6] Makefile & compat/pcre2: add ability to build an embedded PCRE
  2017-05-11 17:51 [PATCH/RFC 0/6] Speed up git-grep by using PCRE v2 as a backend Ævar Arnfjörð Bjarmason
@ 2017-05-11 17:51 ` Ævar Arnfjörð Bjarmason
  2017-05-11 17:51 ` [PATCH/RFC 2/6] Makefile & compat/pcre2: add dependency on pcre2_convert.c Ævar Arnfjörð Bjarmason
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-05-11 17:51 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jeff King, Jeffrey Walton, Michał Kiedrowicz,
	J Smith, Victor Leschuk, Nguyễn Thái Ngọc Duy,
	Fredrik Kuivinen, Brandon Williams,
	Ævar Arnfjörð Bjarmason

Add a USE_LIBPCRE2_BUNDLED=YesIHaveNoPackagedVersion flag to the
Makefile which'll use the PCRE v2 shipped in compat/pcre2 instead of
trying to find it via -lpcre2-8 on the installed system.

As covered in a previous commits ("grep: add support for PCRE v2",
2017-04-08) there are major benefits to using a bleeding edge PCRE v2,
but more importantly I'd like to experiment with making PCRE a
mandatory dependency to power various internal features of grep/log
without the user being aware that they're using the library under the
hood, similar to how we use kwset now for fixed-string searches.

Imposing that hard dependency on everyone using git would bother a lot
of people, whereas if git itself ships PCRE it's no more bothersome
than the code using kwset, i.e. it can be invisible to the builder &
user, and allow git to target newer PCRE APIs without worrying about
versioning.

See [1] for a mostly one-sided pcre-dev mailing list thread discussing
how embed the library.

Implementation details:

 * I configured PCRE v2 with ./configure --enable-jit --enable-utf

 * It sets a lot of -DHAVE_* but these are used by the subset of the
   files I copied, many are either unused or only used by pcre2test.c
   which isn't brought in by the script.

 * These -DHAVE_* flags are something we have already by default &
   assume in other git.git code, so it should be fine to define it.

 * -DPCRE2_CODE_UNIT_WIDTH=8 only compiles the functions linking to
   -lpcre2-8 would have gotten us.

 * All the limits / sizes are the PCRE defaults, the
   MATCH_LIMIT_RECURSION define is a synonym for MATCH_LIMIT_DEPTH in
   older versions, it allows building against older (currently
   release) versions of the library.

 * -DNEWLINE_DEFAULT=2 means only \n is recognized as a newline. This
    corresponds to the --enable-newline-is-lf option. It's also
    possible to set this to CR, CRLF, any of CR, LF, or CRLF, or any
    Unicode newline character being recognized as \n.

    This *might* have to be customized on Windows, but I think the
    grep machinery always splits on newlines for us already, so this
    probably works on Windows as-is, but needs testing.

1. https://lists.exim.org/lurker/thread/20170507.223619.fbee8f00.en.html

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Makefile                  | 52 ++++++++++++++++++++++++++++++++++++
 compat/pcre2/get-pcre2.sh | 67 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 119 insertions(+)
 create mode 100755 compat/pcre2/get-pcre2.sh

diff --git a/Makefile b/Makefile
index d77ca4c1a5..b18867196e 100644
--- a/Makefile
+++ b/Makefile
@@ -34,6 +34,11 @@ all::
 # library. The USE_LIBPCRE flag will likely be changed to mean v2 by
 # default in future releases.
 #
+# Define USE_LIBPCRE2_BUNDLED=YesIHaveNoPackagedVersion in addition to
+# USE_LIBPCRE2=YesPlease if you'd like to use a copy of PCRE version 2
+# bunded with Git. This is for setups where getting a hold of a
+# packaged PCRE is inconvenient.
+#
 # Define LIBPCREDIR=/foo/bar if your PCRE header and library files are in
 # /foo/bar/include and /foo/bar/lib directories.
 #
@@ -1105,8 +1110,10 @@ endif
 
 ifdef USE_LIBPCRE2
 	BASIC_CFLAGS += -DUSE_LIBPCRE2
+ifndef USE_LIBPCRE2_BUNDLED
 	EXTLIBS += -lpcre2-8
 endif
+endif
 
 ifdef LIBPCREDIR
 	BASIC_CFLAGS += -I$(LIBPCREDIR)/include
@@ -1505,6 +1512,50 @@ ifdef NO_REGEX
 	COMPAT_CFLAGS += -Icompat/regex
 	COMPAT_OBJS += compat/regex/regex.o
 endif
+ifdef USE_LIBPCRE2_BUNDLED
+ifndef USE_LIBPCRE2
+$(error please set USE_LIBPCRE2=YesPlease when setting \
+USE_LIBPCRE2_BUNDLED=$(USE_LIBPCRE2_BUNDLED))
+endif
+	COMPAT_CFLAGS += \
+		-Icompat/pcre2/src \
+		-DHAVE_BCOPY=1 \
+		-DHAVE_INTTYPES_H=1 \
+		-DHAVE_MEMMOVE=1 \
+		-DHAVE_STDINT_H=1 \
+		-DPCRE2_CODE_UNIT_WIDTH=8 \
+		-DLINK_SIZE=2 \
+		-DHEAP_LIMIT=20000000 \
+		-DMATCH_LIMIT=10000000 \
+		-DMATCH_LIMIT_DEPTH=10000000 \
+		-DMATCH_LIMIT_RECURSION=10000000 \
+		-DMAX_NAME_COUNT=10000 \
+		-DMAX_NAME_SIZE=32 \
+		-DPARENS_NEST_LIMIT=250 \
+		-DNEWLINE_DEFAULT=2 \
+		-DSUPPORT_JIT \
+		-DSUPPORT_UNICODE
+	COMPAT_OBJS += \
+		compat/pcre2/src/pcre2_auto_possess.o \
+		compat/pcre2/src/pcre2_chartables.o \
+		compat/pcre2/src/pcre2_compile.o \
+		compat/pcre2/src/pcre2_config.o \
+		compat/pcre2/src/pcre2_context.o \
+		compat/pcre2/src/pcre2_error.o \
+		compat/pcre2/src/pcre2_find_bracket.o \
+		compat/pcre2/src/pcre2_jit_compile.o \
+		compat/pcre2/src/pcre2_maketables.o \
+		compat/pcre2/src/pcre2_match.o \
+		compat/pcre2/src/pcre2_match_data.o \
+		compat/pcre2/src/pcre2_newline.o \
+		compat/pcre2/src/pcre2_ord2utf.o \
+		compat/pcre2/src/pcre2_string_utils.o \
+		compat/pcre2/src/pcre2_study.o \
+		compat/pcre2/src/pcre2_tables.o \
+		compat/pcre2/src/pcre2_ucd.o \
+		compat/pcre2/src/pcre2_valid_utf.o \
+		compat/pcre2/src/pcre2_xclass.o
+endif
 ifdef NATIVE_CRLF
 	BASIC_CFLAGS += -DNATIVE_CRLF
 endif
@@ -2259,6 +2310,7 @@ GIT-BUILD-OPTIONS: FORCE
 	@echo NO_EXPAT=\''$(subst ','\'',$(subst ','\'',$(NO_EXPAT)))'\' >>$@+
 	@echo USE_LIBPCRE1=\''$(subst ','\'',$(subst ','\'',$(USE_LIBPCRE)))'\' >>$@+
 	@echo USE_LIBPCRE2=\''$(subst ','\'',$(subst ','\'',$(USE_LIBPCRE2)))'\' >>$@+
+	@echo USE_LIBPCRE2_BUNDLED=\''$(subst ','\'',$(subst ','\'',$(USE_LIBPCRE2_BUNDLED)))'\' >>$@+
 	@echo NO_PERL=\''$(subst ','\'',$(subst ','\'',$(NO_PERL)))'\' >>$@+
 	@echo NO_PTHREADS=\''$(subst ','\'',$(subst ','\'',$(NO_PTHREADS)))'\' >>$@+
 	@echo NO_PYTHON=\''$(subst ','\'',$(subst ','\'',$(NO_PYTHON)))'\' >>$@+
diff --git a/compat/pcre2/get-pcre2.sh b/compat/pcre2/get-pcre2.sh
new file mode 100755
index 0000000000..f1796cb518
--- /dev/null
+++ b/compat/pcre2/get-pcre2.sh
@@ -0,0 +1,67 @@
+#!/bin/sh -e
+
+# Usage:
+# ./get-pcre2.sh '' 'trunk'
+# ./get-pcre2.sh '' 'tags/pcre2-10.23'
+# ./get-pcre2.sh ~/g/pcre2 ''
+
+srcdir=$1
+version=$2
+if test -z "$version"
+then
+	version="tags/pcre2-10.23"
+fi
+
+echo Getting PCRE v2 version $version
+rm -rfv src
+mkdir src src/sljit
+
+for srcfile in \
+	pcre2.h \
+	pcre2_internal.h \
+	pcre2_intmodedep.h \
+	pcre2_ucp.h \
+	pcre2_auto_possess.c \
+	pcre2_chartables.c.dist \
+	pcre2_compile.c \
+	pcre2_config.c \
+	pcre2_context.c \
+	pcre2_error.c \
+	pcre2_find_bracket.c \
+	pcre2_jit_compile.c \
+	pcre2_jit_match.c \
+	pcre2_jit_misc.c \
+	pcre2_maketables.c \
+	pcre2_match.c \
+	pcre2_match_data.c \
+	pcre2_newline.c \
+	pcre2_ord2utf.c \
+	pcre2_string_utils.c \
+	pcre2_study.c \
+	pcre2_tables.c \
+	pcre2_ucd.c \
+	pcre2_valid_utf.c \
+	pcre2_xclass.c
+do
+	if test -z "$srcdir"
+	then
+		svn cat svn://vcs.exim.org/pcre2/code/$version/src/$srcfile >src/$srcfile
+	else
+		cp "$srcdir/src/$srcfile" src/$srcfile
+	fi
+	wc -l src/$srcfile
+done
+
+(cd src && ln -sf pcre2_chartables.c.dist pcre2_chartables.c)
+
+if test -z "$srcdir"
+then
+	for srcfile in $(svn ls svn://vcs.exim.org/pcre2/code/tags/pcre2-10.23/src/sljit)
+	do
+		svn cat svn://vcs.exim.org/pcre2/code/$version/src/sljit/$srcfile >src/sljit/$srcfile
+		wc -l src/sljit/$srcfile
+	done
+else
+	cp -R "$srcdir/src/sljit" src/
+	wc -l src/sljit/*
+fi
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH/RFC 2/6] Makefile & compat/pcre2: add dependency on pcre2_convert.c
  2017-05-11 17:51 [PATCH/RFC 0/6] Speed up git-grep by using PCRE v2 as a backend Ævar Arnfjörð Bjarmason
  2017-05-11 17:51 ` [PATCH/RFC 1/6] Makefile & compat/pcre2: add ability to build an embedded PCRE Ævar Arnfjörð Bjarmason
@ 2017-05-11 17:51 ` Ævar Arnfjörð Bjarmason
  2017-05-11 17:51 ` [PATCH/RFC 4/6] test-lib: add LIBPCRE1 & LIBPCRE2 prerequisites Ævar Arnfjörð Bjarmason
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-05-11 17:51 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jeff King, Jeffrey Walton, Michał Kiedrowicz,
	J Smith, Victor Leschuk, Nguyễn Thái Ngọc Duy,
	Fredrik Kuivinen, Brandon Williams,
	Ævar Arnfjörð Bjarmason

Add a dependency on the experimental pcre2_convert.c. This only exists
in svn trunk of pcre2 currently, and allows for converting POSIX
basic/extended & glob patterns to patterns accepted by PCRE[1][2].

1. https://bugs.exim.org/show_bug.cgi?id=2106
2. https://bugs.exim.org/show_bug.cgi?id=2107

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Makefile                  | 1 +
 compat/pcre2/get-pcre2.sh | 1 +
 2 files changed, 2 insertions(+)

diff --git a/Makefile b/Makefile
index b18867196e..e437fa011c 100644
--- a/Makefile
+++ b/Makefile
@@ -1541,6 +1541,7 @@ endif
 		compat/pcre2/src/pcre2_compile.o \
 		compat/pcre2/src/pcre2_config.o \
 		compat/pcre2/src/pcre2_context.o \
+		compat/pcre2/src/pcre2_convert.o \
 		compat/pcre2/src/pcre2_error.o \
 		compat/pcre2/src/pcre2_find_bracket.o \
 		compat/pcre2/src/pcre2_jit_compile.o \
diff --git a/compat/pcre2/get-pcre2.sh b/compat/pcre2/get-pcre2.sh
index f1796cb518..7679fba8e4 100755
--- a/compat/pcre2/get-pcre2.sh
+++ b/compat/pcre2/get-pcre2.sh
@@ -26,6 +26,7 @@ for srcfile in \
 	pcre2_compile.c \
 	pcre2_config.c \
 	pcre2_context.c \
+	pcre2_convert.c \
 	pcre2_error.c \
 	pcre2_find_bracket.c \
 	pcre2_jit_compile.c \
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH/RFC 4/6] test-lib: add LIBPCRE1 & LIBPCRE2 prerequisites
  2017-05-11 17:51 [PATCH/RFC 0/6] Speed up git-grep by using PCRE v2 as a backend Ævar Arnfjörð Bjarmason
  2017-05-11 17:51 ` [PATCH/RFC 1/6] Makefile & compat/pcre2: add ability to build an embedded PCRE Ævar Arnfjörð Bjarmason
  2017-05-11 17:51 ` [PATCH/RFC 2/6] Makefile & compat/pcre2: add dependency on pcre2_convert.c Ævar Arnfjörð Bjarmason
@ 2017-05-11 17:51 ` Ævar Arnfjörð Bjarmason
  2017-05-11 17:51 ` [PATCH/RFC 5/6] grep: support regex patterns containing \0 via PCRE v2 Ævar Arnfjörð Bjarmason
  2017-05-11 17:51 ` [PATCH/RFC 6/6] grep: use PCRE v2 under the hood for -G & -E for amazing speedup Ævar Arnfjörð Bjarmason
  4 siblings, 0 replies; 6+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-05-11 17:51 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jeff King, Jeffrey Walton, Michał Kiedrowicz,
	J Smith, Victor Leschuk, Nguyễn Thái Ngọc Duy,
	Fredrik Kuivinen, Brandon Williams,
	Ævar Arnfjörð Bjarmason

Add LIBPCRE1 and LIBPCRE2 prerequisites which are true when git is
compiled with USE_LIBPCRE1=YesPlease or USE_LIBPCRE2=YesPlease,
respectively.

There are various edge cases or version-specific features that need to
be tested for.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 t/README      | 12 ++++++++++++
 t/test-lib.sh |  2 ++
 2 files changed, 14 insertions(+)

diff --git a/t/README b/t/README
index 2f95860369..1ff612ca65 100644
--- a/t/README
+++ b/t/README
@@ -808,6 +808,18 @@ use these, and "test_set_prereq" for how to define your own.
    Git was compiled with support for PCRE. Wrap any tests
    that use git-grep --perl-regexp or git-grep -P in these.
 
+ - LIBPCRE1
+
+   Git was compiled with PCRE v1 support via
+   USE_LIBPCRE1=YesPlease. Wrap any PCRE using tests that for some
+   reason need v1 of the PCRE library instead of v2 in these.
+
+ - LIBPCRE2
+
+   Git was compiled with PCRE v2 support via
+   USE_LIBPCRE2=YesPlease. Wrap any PCRE using tests that for some
+   reason need v2 of the PCRE library instead of v1 in these.
+
  - CASE_INSENSITIVE_FS
 
    Test is run on a case insensitive file system.
diff --git a/t/test-lib.sh b/t/test-lib.sh
index 44d4679384..13ed81dc16 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -1012,6 +1012,8 @@ test -z "$NO_PERL" && test_set_prereq PERL
 test -z "$NO_PTHREADS" && test_set_prereq PTHREADS
 test -z "$NO_PYTHON" && test_set_prereq PYTHON
 test -n "$USE_LIBPCRE1$USE_LIBPCRE2" && test_set_prereq PCRE
+test -n "$USE_LIBPCRE1" && test_set_prereq LIBPCRE1
+test -n "$USE_LIBPCRE2" && test_set_prereq LIBPCRE2
 test -z "$NO_GETTEXT" && test_set_prereq GETTEXT
 
 # Can we rely on git's output in the C locale?
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH/RFC 5/6] grep: support regex patterns containing \0 via PCRE v2
  2017-05-11 17:51 [PATCH/RFC 0/6] Speed up git-grep by using PCRE v2 as a backend Ævar Arnfjörð Bjarmason
                   ` (2 preceding siblings ...)
  2017-05-11 17:51 ` [PATCH/RFC 4/6] test-lib: add LIBPCRE1 & LIBPCRE2 prerequisites Ævar Arnfjörð Bjarmason
@ 2017-05-11 17:51 ` Ævar Arnfjörð Bjarmason
  2017-05-11 17:51 ` [PATCH/RFC 6/6] grep: use PCRE v2 under the hood for -G & -E for amazing speedup Ævar Arnfjörð Bjarmason
  4 siblings, 0 replies; 6+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-05-11 17:51 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jeff King, Jeffrey Walton, Michał Kiedrowicz,
	J Smith, Victor Leschuk, Nguyễn Thái Ngọc Duy,
	Fredrik Kuivinen, Brandon Williams,
	Ævar Arnfjörð Bjarmason

Support regex patterns with embedded \0's, as an earlier commit[1]
notes this was previously impossible due to an internal limitation.

Before this change any regex metacharacters in patterns containing \0
were silently ignored and the pattern matched as if it were a
--fixed-strings pattern.

Now these patterns will be matched with PCRE instead, which supports
combining regex metacharacters with patterns containing \0.

A side-effect of this change is that these patterns which previously
would be considered --fixed-strings patterns regardless of the engine
requested now all implicitly become --perl-regexp instead.

A subsequent change introduces a POSIX to PCRE syntax converter, and
could be used to be 100% truthful to our documentation by using POSIX
basic syntax (which we haven't been in quite some time with kwset).

But due to a chicken & egg issue with this change being easier to
implement stand-alone first, the subsequent change depending on a SVN
trunk version of PCRE, but most importantly I don't think anyone will
mind this change, so I'm leaving it as it is.

This implementation is faster than the previous kwset implementation,
but I haven't bothered to come up with a \0-specific fixed-string
performance test.

See the next change in this series for a change which optionally
expands the PCRE v2 use to use it for all fixed-string patterns, the
performance tests for those will be applicable to these patterns as
well, since PCRE matches \0 like any other character.

1. ("grep: factor test for \0 in grep patterns into a function",
   2017-05-08)

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 grep.c                 | 24 ++++++++++++++++
 t/t7008-grep-binary.sh | 74 ++++++++++++++++++++++++++++++++++----------------
 2 files changed, 75 insertions(+), 23 deletions(-)

diff --git a/grep.c b/grep.c
index 2ff4e253ff..5db614cf80 100644
--- a/grep.c
+++ b/grep.c
@@ -613,6 +613,30 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 	icase	       = opt->regflags & REG_ICASE || p->ignore_case;
 	ascii_only     = !has_non_ascii(p->pattern);
 
+#ifdef USE_LIBPCRE2
+	if (has_null(p->pattern, p->patternlen)) {
+		struct strbuf sb = STRBUF_INIT;
+		if (icase)
+			strbuf_add(&sb, "(?i)", 4);
+		if (opt->fixed)
+			strbuf_add(&sb, "\\Q", 2);		
+		strbuf_add(&sb, p->pattern, p->patternlen);
+		if (opt->fixed)
+			strbuf_add(&sb, "\\E", 2);
+
+		p->pattern = sb.buf;
+		p->patternlen = sb.len;
+
+		/* FIXME: Check in compile_pcre2_pattern() that we're
+		 * using basic rx using !opt->pcre2 && <something>
+		 */
+		opt->pcre2 = 1;
+
+		compile_pcre2_pattern(p, opt);
+		return;
+	}
+#endif
+
 	/*
 	 * Even when -F (fixed) asks us to do a non-regexp search, we
 	 * may not be able to correctly case-fold when -i
diff --git a/t/t7008-grep-binary.sh b/t/t7008-grep-binary.sh
index ba3db06501..fc86ed5fce 100755
--- a/t/t7008-grep-binary.sh
+++ b/t/t7008-grep-binary.sh
@@ -124,35 +124,63 @@ nul_match 0 '-F' '[æ]Qð'
 nul_match 0 '-Fi' 'ÆQ[Ð]'
 nul_match 0 '-Fi' '[Æ]QÐ'
 
-# kwset is disabled on -i & non-ASCII. No way to match non-ASCII \0
-# patterns case-insensitively.
-nul_match T1 '-i' 'ÆQÐ'
-
-# \0 implicitly disables regexes. This is an undocumented internal
-# limitation.
-nul_match T1 '' 'yQ[f]'
-nul_match T1 '' '[y]Qf'
-nul_match T1 '-i' 'YQ[F]'
-nul_match T1 '-i' '[Y]Qf'
-nul_match T1 '' 'æQ[ð]'
-nul_match T1 '' '[æ]Qð'
-nul_match T1 '-i' 'ÆQ[Ð]'
-
-# ... because of \0 implicitly disabling regexes regexes that
-# should/shouldn't match don't do the right thing.
-nul_match T1 '' 'eQm.*cQ'
-nul_match T1 '-i' 'EQM.*cQ'
-nul_match T0 '' 'eQm[*]c'
-nul_match T0 '-i' 'EQM[*]C'
+if test_have_prereq LIBPCRE2
+then
+	# Regex patterns that should match without -F
+	nul_match 1 '' 'yQ[f]'
+	nul_match 1 '' '[y]Qf'
+	nul_match 1 '-i' 'YQ[F]'
+	nul_match 1 '-i' '[Y]Qf'
+	nul_match 1 '' 'æQ[ð]'
+	nul_match 1 '' '[æ]Qð'
+	nul_match 0 '-i' '[Æ]Qð'
+	nul_match 1 '' 'eQm.*cQ'
+	nul_match 1 '-i' 'EQM.*cQ'
+	nul_match 0 '' 'eQm[*]c'
+	nul_match 0 '-i' 'EQM[*]C'
+
+	# These should also match, but don't due to some heisenbug,
+	# they succeed when run manually!
+	nul_match T1 '-i' 'ÆQÐ'
+	nul_match T1 '-i' 'ÆQ[Ð]'
+else
+	# \0 implicitly disables regexes. This is an undocumented
+	# internal limitation.
+	nul_match T1 '' 'yQ[f]'
+	nul_match T1 '' '[y]Qf'
+	nul_match T1 '-i' 'YQ[F]'
+	nul_match T1 '-i' '[Y]Qf'
+	nul_match T1 '' 'æQ[ð]'
+	nul_match T1 '' '[æ]Qð'
+	nul_match T1 '-i' 'ÆQ[Ð]'
+
+	# ... because of \0 implicitly disabling regexes regexes that
+	# should/shouldn't match don't do the right thing.
+	nul_match T1 '' 'eQm.*cQ'
+	nul_match T1 '-i' 'EQM.*cQ'
+	nul_match T0 '' 'eQm[*]c'
+	nul_match T0 '-i' 'EQM[*]C'
+fi
 
 # Due to the REG_STARTEND extension when kwset() is disabled on -i &
 # non-ASCII the string will be matched in its entirety, but the
 # pattern will be cut off at the first \0.
 nul_match 0 '-i' 'NOMATCHQð'
-nul_match T0 '-i' '[Æ]QNOMATCH'
-nul_match T0 '-i' '[æ]QNOMATCH'
+if test_have_prereq LIBPCRE2
+then
+	nul_match 0 '-i' '[Æ]QNOMATCH'
+	nul_match 0 '-i' '[æ]QNOMATCH'
+else
+	nul_match T0 '-i' '[Æ]QNOMATCH'
+	nul_match T0 '-i' '[æ]QNOMATCH'
+fi
 # Matches, but for the wrong reasons, just stops at [æ]
-nul_match 1 '-i' '[Æ]Qð'
+if test_have_prereq LIBPCRE2
+then
+	nul_match T1 '-i' '[Æ]Qð'
+else
+	nul_match 1 '-i' '[Æ]Qð'
+fi
 nul_match 1 '-i' '[æ]Qð'
 
 # Ensure that the matcher doesn't regress to something that stops at
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH/RFC 6/6] grep: use PCRE v2 under the hood for -G & -E for amazing speedup
  2017-05-11 17:51 [PATCH/RFC 0/6] Speed up git-grep by using PCRE v2 as a backend Ævar Arnfjörð Bjarmason
                   ` (3 preceding siblings ...)
  2017-05-11 17:51 ` [PATCH/RFC 5/6] grep: support regex patterns containing \0 via PCRE v2 Ævar Arnfjörð Bjarmason
@ 2017-05-11 17:51 ` Ævar Arnfjörð Bjarmason
  4 siblings, 0 replies; 6+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-05-11 17:51 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jeff King, Jeffrey Walton, Michał Kiedrowicz,
	J Smith, Victor Leschuk, Nguyễn Thái Ngọc Duy,
	Fredrik Kuivinen, Brandon Williams,
	Ævar Arnfjörð Bjarmason

Change the underlying engine powering POSIX basic & extended patterns
to be PCRE v2 under the hood.

This relies on an experimental SVN-trunk only PCRE v2 API which Philip
Hazel (the PCRE maintainer) wrote up in response to a feature request
I filed1[1].

This allows us to use pcre2_pattern_convert() to power all grep regex
matches by converting the POSIX patterns into PCRE syntax before
compiling them.

Due to PCRE generally being faster than POSIX, but most importantly
due to its JIT feature (where available) this speeds up grep by
a *lot*.

The improvements to the "perl" tests are already a part of this
series, but all the other benchmarks show improvements made by this
change alone:

    $ GIT_PERF_REPEAT_COUNT=30 GIT_PERF_LARGE_REPO=~/g/linux [see note #2 for the GIT_PERF_MAKE_COMMAND ...]
    [...]
    Test                                           v2.13.0             HEAD
    -----------------------------------------------------------------------------------------
    7810.1: grep worktree, cheap regex             0.19(0.39+0.52)     0.19(0.26+0.56) +0.0%
    7810.2: grep worktree, expensive regex         5.11(29.75+0.33)    1.07(5.36+0.34) -79.1%
    7810.3: grep --cached, cheap regex             2.91(2.77+0.12)     2.85(2.78+0.06) -2.1%
    7810.4: grep --cached, expensive regex         21.00(20.89+0.08)   6.25(6.18+0.06) -70.2%
    7820.1: basic grep how.to                      0.32(1.20+0.43)     0.19(0.26+0.56) -40.6%
    7820.2: extended grep how.to                   0.32(1.12+0.51)     0.19(0.22+0.60) -40.6%
    7820.3: perl grep how.to                       0.31(1.11+0.45)     0.19(0.30+0.54) -38.7%
    7820.5: basic grep ^how to                     0.31(1.09+0.51)     0.19(0.24+0.57) -38.7%
    7820.6: extended grep ^how to                  0.31(1.14+0.46)     0.19(0.29+0.52) -38.7%
    7820.7: perl grep ^how to                      0.57(2.63+0.38)     0.19(0.25+0.56) -66.7%
    7820.9: basic grep [how] to                    0.49(2.19+0.36)     0.22(0.36+0.54) -55.1%
    7820.10: extended grep [how] to                0.49(2.16+0.41)     0.22(0.41+0.50) -55.1%
    7820.11: perl grep [how] to                    0.57(2.55+0.40)     0.22(0.35+0.55) -61.4%
    7820.13: basic grep \(e.t[^ ]*\|v.ry\) rare    0.65(3.18+0.38)     0.22(0.44+0.52) -66.2%
    7820.14: extended grep (e.t[^ ]*|v.ry) rare    0.65(3.17+0.40)     0.21(0.47+0.52) -67.7%
    7820.15: perl grep (e.t[^ ]*|v.ry) rare        1.05(5.64+0.34)     0.22(0.46+0.53) -79.0%
    7820.17: basic grep m\(ú\|u\)ult.b\(æ\|y\)te   0.33(1.33+0.38)     0.19(0.31+0.51) -42.4%
    7820.18: extended grep m(ú|u)ult.b(æ|y)te      0.33(1.27+0.44)     0.19(0.32+0.50) -42.4%
    7820.19: perl grep m(ú|u)ult.b(æ|y)te          0.37(1.58+0.40)     0.19(0.30+0.53) -48.6%
    7821.1: fixed grep int                         0.53(1.70+0.60)     0.43(1.13+0.66) -18.9%
    7821.2: basic grep int                         0.55(1.62+0.59)     0.46(1.08+0.64) -16.4%
    7821.3: extended grep int                      0.54(1.65+0.59)     0.45(1.17+0.56) -16.7%
    7821.4: perl grep int                          0.54(1.63+0.62)     0.46(1.12+0.60) -14.8%
    7821.6: fixed grep -i int                      0.58(1.93+0.54)     0.48(1.40+0.52) -17.2%
    7821.7: basic grep -i int                      0.83(1.91+0.60)     0.57(1.23+0.67) -31.3%
    7821.8: extended grep -i int                   0.59(1.80+0.66)     0.48(1.33+0.59) -18.6%
    7821.9: perl grep -i int                       0.61(1.91+0.56)     0.52(1.28+0.63) -14.8%
    7821.11: fixed grep æ                          0.34(1.25+0.45)     0.19(0.29+0.51) -44.1%
    7821.12: basic grep æ                          0.34(1.26+0.43)     0.19(0.28+0.53) -44.1%
    7821.13: extended grep æ                       0.34(1.22+0.48)     0.19(0.29+0.53) -44.1%
    7821.14: perl grep æ                           0.34(1.30+0.41)     0.19(0.26+0.57) -44.1%
    7821.16: fixed grep -i æ                       0.27(0.88+0.46)     0.19(0.30+0.51) -29.6%
    7821.17: basic grep -i æ                       0.27(0.88+0.44)     0.19(0.27+0.54) -29.6%
    7821.18: extended grep -i æ                    0.27(0.90+0.42)     0.19(0.22+0.59) -29.6%
    7821.19: perl grep -i æ                        0.25(0.74+0.51)     0.18(0.27+0.58) -28.0%
    7821.1: fixed grep int                         0.53(1.70+0.60)     0.43(1.13+0.66) -18.9%
    7821.2: basic grep int                         0.55(1.62+0.59)     0.46(1.08+0.64) -16.4%
    7821.3: extended grep int                      0.54(1.65+0.59)     0.45(1.17+0.56) -16.7%
    7821.4: perl grep int                          0.54(1.63+0.62)     0.46(1.12+0.60) -14.8%
    7821.6: fixed grep -i int                      0.58(1.93+0.54)     0.48(1.40+0.52) -17.2%
    7821.7: basic grep -i int                      0.83(1.91+0.60)     0.57(1.23+0.67) -31.3%
    7821.8: extended grep -i int                   0.59(1.80+0.66)     0.48(1.33+0.59) -18.6%
    7821.9: perl grep -i int                       0.61(1.91+0.56)     0.52(1.28+0.63) -14.8%
    7821.11: fixed grep æ                          0.34(1.25+0.45)     0.19(0.29+0.51) -44.1%
    7821.12: basic grep æ                          0.34(1.26+0.43)     0.19(0.28+0.53) -44.1%
    7821.13: extended grep æ                       0.34(1.22+0.48)     0.19(0.29+0.53) -44.1%
    7821.14: perl grep æ                           0.34(1.30+0.41)     0.19(0.26+0.57) -44.1%
    7821.16: fixed grep -i æ                       0.27(0.88+0.46)     0.19(0.30+0.51) -29.6%
    7821.17: basic grep -i æ                       0.27(0.88+0.44)     0.19(0.27+0.54) -29.6%
    7821.18: extended grep -i æ                    0.27(0.90+0.42)     0.19(0.22+0.59) -29.6%
    7821.19: perl grep -i æ                        0.25(0.74+0.51)     0.18(0.27+0.58) -28.0%

Caveats & other things to mention:

 * This will expose PCRE v2 (as opposed to C library reg(comp|exec))to
   the network via gitweb in its default configuration. See
   <CACBZZX6V8qbnrZAdhRvPthy5Z91iEG8rrJ=Sf9tdkOt52M9j1Q@mail.gmail.com>
   for a discussion of security & other caveats related to that.

 * I'm checking for PCRE2_CONVERT_POSIX_BASIC to enable this, but the
   experimental API of pcre2_pattern_convert() may change before it
   makes it into a release.

   If we think this patch is awesome enough to get into a git release
   regardless, it should be guarded by some other method so we don't
   rudely tie upstream PCRE to this API least they break git versions
   in the wild.

 * One way to do to that would be to guard this via the
   USE_LIBPCRE2_BUNDLED flag, but see the above E-Mail thread for
   concerns about shipping an embedded PCRE, and for ways that could
   be made OK.

 * We could ship some copy of just the logic in
   pcre2_pattern_convert() & use the system PCRE instead. I haven't
   tried splitting it off from the PCRE codebase, and don't know how
   hard that would be.

 * There are outstanding bugs in the pcre2_pattern_convert()
   function. Grepping with -G and -E for all ASCII characters from
   1..127 both "$char" and "\\$char" will produce numerous
   differences. These are mostly obscure cases, I'm working out fixes
   to those with Philip.

1. https://bugs.exim.org/show_bug.cgi?id=2106

2. GIT_PERF_MAKE_COMMAND='grep -q LIBPCRE2 Makefile && make -j8 USE_LIBPCRE2=YesPlease USE_LIBPCRE2_BUNDLED=Y CC=~/perl5/installed/bin/gcc NO_R_TO_GCC_LINKER=YesPlease CFLAGS=-O3 || make -j8 USE_LIBPCRE=YesPlease CC=~/perl5/installed/bin/gcc NO_R_TO_GCC_LINKER=YesPlease CFLAGS=-O3 LIBPCREDIR=/home/avar/g/pcre/inst LDFLAGS=-Wl,-rpath,/home/avar/g/pcre/inst/lib' ./run v2.13.0 HEAD -- p*grep*

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 grep.c                 | 97 +++++++++++++++++++++++++++++++++++---------------
 grep.h                 |  5 +++
 t/README               |  6 ++++
 t/t7008-grep-binary.sh | 13 +++++--
 t/test-lib.sh          |  1 +
 5 files changed, 90 insertions(+), 32 deletions(-)

diff --git a/grep.c b/grep.c
index 5db614cf80..0f6ee709c5 100644
--- a/grep.c
+++ b/grep.c
@@ -472,8 +472,48 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 	const uint8_t *character_tables = NULL;
 	uint32_t canjit;
 	int jitret;
+	int icase = opt->regflags & REG_ICASE || p->ignore_case;
+	PCRE2_SPTR pattern = (PCRE2_SPTR)p->pattern;
+	PCRE2_SIZE length = p->patternlen;
+	int copied_pattern = 0;
+	struct strbuf pattern_sb = STRBUF_INIT;
+#ifdef PCRE2_CONVERT_POSIX_BASIC
+	int convret;
+	PCRE2_UCHAR *convpatbuf = NULL;
+	PCRE2_SIZE convpatlen;
+	int converted_pattern = 0;
+#endif
 
-	assert(opt->pcre2);
+	if (opt->fixed || has_null(p->pattern, p->patternlen) || is_fixed(p->pattern, p->patternlen)) {
+		if (icase)
+			strbuf_add(&pattern_sb, "(?i)", 4);
+		if (opt->fixed)
+			strbuf_add(&pattern_sb, "\\Q", 2);
+		strbuf_add(&pattern_sb, p->pattern, p->patternlen);
+		if (opt->fixed)
+			strbuf_add(&pattern_sb, "\\E", 2);
+
+		pattern = (PCRE2_SPTR)pattern_sb.buf;
+		length = pattern_sb.len;
+		copied_pattern = 1;
+	} else if (opt->pcre2_posix_emulation) {
+#ifdef PCRE2_CONVERT_POSIX_BASIC
+		convret = pcre2_pattern_convert(pattern, length,
+					       (opt->regflags & REG_EXTENDED
+						? PCRE2_CONVERT_POSIX_EXTENDED
+						: PCRE2_CONVERT_POSIX_BASIC),
+					       &convpatbuf, &convpatlen, NULL);
+		if (convret != 0) {
+			fprintf(stderr, "oh noes\n");
+			pcre2_get_error_message(convret, errbuf, sizeof(errbuf));
+			compile_regexp_failed(p, (const char *)&errbuf);
+		}
+		pattern = convpatbuf;
+		length = convpatlen;
+		converted_pattern = 1;
+#endif
+	} else
+		assert(opt->pcre2);
 
 	p->pcre2_compile_context = NULL;
 
@@ -488,11 +528,16 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 	if (is_utf8_locale() && has_non_ascii(p->pattern))
 		options |= PCRE2_UTF;
 
-	p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern,
-					 p->patternlen, options, &error, &erroffset,
-					 p->pcre2_compile_context);
+	p->pcre2_pattern = pcre2_compile(pattern, length, options, &error,
+					 &erroffset, p->pcre2_compile_context);
 
 	if (p->pcre2_pattern) {
+		if (copied_pattern)
+			strbuf_release(&pattern_sb);
+#ifdef PCRE2_CONVERT_POSIX_BASIC
+		if (converted_pattern)
+			pcre2_converted_pattern_free(convpatbuf);
+#endif
 		p->pcre2_match_data = pcre2_match_data_create_from_pattern(p->pcre2_pattern, NULL);
 		if (!p->pcre2_match_data)
 			die("BUG: Couldn't allocate PCRE2 match data");
@@ -580,7 +625,6 @@ static int pcre2match(struct grep_pat *p, const char *line, const char *eol,
 static void free_pcre2_pattern(struct grep_pat *p)
 {
 }
-#endif /* !USE_LIBPCRE2 */
 
 static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
@@ -602,41 +646,21 @@ static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt)
 		compile_regexp_failed(p, errbuf);
 	}
 }
+#endif /* !USE_LIBPCRE2 */
 
 static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
+#ifndef USE_LIBPCRE2
 	int icase, ascii_only;
+#endif
 	int err;
 
 	p->word_regexp = opt->word_regexp;
 	p->ignore_case = opt->ignore_case;
+#ifndef USE_LIBPCRE2
 	icase	       = opt->regflags & REG_ICASE || p->ignore_case;
 	ascii_only     = !has_non_ascii(p->pattern);
 
-#ifdef USE_LIBPCRE2
-	if (has_null(p->pattern, p->patternlen)) {
-		struct strbuf sb = STRBUF_INIT;
-		if (icase)
-			strbuf_add(&sb, "(?i)", 4);
-		if (opt->fixed)
-			strbuf_add(&sb, "\\Q", 2);		
-		strbuf_add(&sb, p->pattern, p->patternlen);
-		if (opt->fixed)
-			strbuf_add(&sb, "\\E", 2);
-
-		p->pattern = sb.buf;
-		p->patternlen = sb.len;
-
-		/* FIXME: Check in compile_pcre2_pattern() that we're
-		 * using basic rx using !opt->pcre2 && <something>
-		 */
-		opt->pcre2 = 1;
-
-		compile_pcre2_pattern(p, opt);
-		return;
-	}
-#endif
-
 	/*
 	 * Even when -F (fixed) asks us to do a non-regexp search, we
 	 * may not be able to correctly case-fold when -i
@@ -668,12 +692,26 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 		compile_fixed_regexp(p, opt);
 		return;
 	}
+#endif
 
 	if (opt->pcre2) {
 		compile_pcre2_pattern(p, opt);
 		return;
 	}
 
+#ifdef USE_LIBPCRE2
+	if (opt->fixed || has_null(p->pattern, p->patternlen) || is_fixed(p->pattern, p->patternlen)) {
+		compile_pcre2_pattern(p, opt);
+		return;
+	}
+
+#ifdef PCRE2_CONVERT_POSIX_BASIC
+	opt->pcre2_posix_emulation = 1;
+	compile_pcre2_pattern(p, opt);
+	return;
+#endif
+#endif
+
 	if (opt->pcre1) {
 		compile_pcre1_regexp(p, opt);
 		return;
@@ -686,6 +724,7 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 		regfree(&p->regexp);
 		compile_regexp_failed(p, errbuf);
 	}
+	return;
 }
 
 static struct grep_expr *compile_pattern_or(struct grep_pat **);
diff --git a/grep.h b/grep.h
index b40afc2e2f..39897489e4 100644
--- a/grep.h
+++ b/grep.h
@@ -29,6 +29,9 @@ typedef int pcre2_compile_context;
 typedef int pcre2_match_context;
 typedef int pcre2_jit_stack;
 #endif
+#ifndef PCRE2_CONVERT_POSIX_EXTENDED
+typedef int pcre2_convert_context;
+#endif
 #include "kwset.h"
 #include "thread-utils.h"
 #include "userdiff.h"
@@ -73,6 +76,7 @@ struct grep_pat {
 	pcre_jit_stack *pcre1_jit_stack;
 	const unsigned char *pcre1_tables;
 	int pcre1_jit_on;
+	pcre2_convert_context *pcre2_convert_context;
 	pcre2_code *pcre2_pattern;
 	pcre2_match_data *pcre2_match_data;
 	pcre2_compile_context *pcre2_compile_context;
@@ -143,6 +147,7 @@ struct grep_opt {
 	int use_reflog_filter;
 	int pcre1;
 	int pcre2;
+	int pcre2_posix_emulation;
 	int relative;
 	int pathname;
 	int null_following_name;
diff --git a/t/README b/t/README
index 1ff612ca65..0dbf5373a2 100644
--- a/t/README
+++ b/t/README
@@ -820,6 +820,12 @@ use these, and "test_set_prereq" for how to define your own.
    USE_LIBPCRE2=YesPlease. Wrap any PCRE using tests that for some
    reason need v2 of the PCRE library instead of v1 in these.
 
+ - LIBPCRE2_BUNDLED
+
+   Git was compiled with the bundled PCRE v2 support via
+   USE_LIBPCRE2=YesPlease &
+   USE_LIBPCRE2_BUNDLED=IWantPatternConvertAwesomeSauce.
+
  - CASE_INSENSITIVE_FS
 
    Test is run on a case insensitive file system.
diff --git a/t/t7008-grep-binary.sh b/t/t7008-grep-binary.sh
index fc86ed5fce..d9de5c986c 100755
--- a/t/t7008-grep-binary.sh
+++ b/t/t7008-grep-binary.sh
@@ -100,9 +100,16 @@ test_expect_success 'git grep ile a' '
 	git grep ile a
 '
 
-test_expect_failure 'git grep .fi a' '
-	git grep .fi a
-'
+if test_have_prereq LIBPCRE2_BUNDLED
+then
+	test_expect_success 'git grep .fi a' '
+		git grep .fi a
+	'
+else
+	test_expect_failure 'git grep .fi a' '
+		git grep .fi a
+	'
+fi
 
 nul_match 1 '-F' 'yQf'
 nul_match 0 '-F' 'yQx'
diff --git a/t/test-lib.sh b/t/test-lib.sh
index 13ed81dc16..71760e01a0 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -1014,6 +1014,7 @@ test -z "$NO_PYTHON" && test_set_prereq PYTHON
 test -n "$USE_LIBPCRE1$USE_LIBPCRE2" && test_set_prereq PCRE
 test -n "$USE_LIBPCRE1" && test_set_prereq LIBPCRE1
 test -n "$USE_LIBPCRE2" && test_set_prereq LIBPCRE2
+test -n "$USE_LIBPCRE2_BUNDLED" && test_set_prereq LIBPCRE2_BUNDLED
 test -z "$NO_GETTEXT" && test_set_prereq GETTEXT
 
 # Can we rely on git's output in the C locale?
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-05-11 17:52 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-11 17:51 [PATCH/RFC 0/6] Speed up git-grep by using PCRE v2 as a backend Ævar Arnfjörð Bjarmason
2017-05-11 17:51 ` [PATCH/RFC 1/6] Makefile & compat/pcre2: add ability to build an embedded PCRE Ævar Arnfjörð Bjarmason
2017-05-11 17:51 ` [PATCH/RFC 2/6] Makefile & compat/pcre2: add dependency on pcre2_convert.c Ævar Arnfjörð Bjarmason
2017-05-11 17:51 ` [PATCH/RFC 4/6] test-lib: add LIBPCRE1 & LIBPCRE2 prerequisites Ævar Arnfjörð Bjarmason
2017-05-11 17:51 ` [PATCH/RFC 5/6] grep: support regex patterns containing \0 via PCRE v2 Ævar Arnfjörð Bjarmason
2017-05-11 17:51 ` [PATCH/RFC 6/6] grep: use PCRE v2 under the hood for -G & -E for amazing speedup Ævar Arnfjörð Bjarmason

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).