git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/4] Support building with GCC v8.x/v9.x
@ 2019-06-13 11:49 Johannes Schindelin via GitGitGadget
  2019-06-13 11:49 ` [PATCH 1/4] poll (mingw): allow compiling with GCC 8 and DEVELOPER=1 Johannes Schindelin via GitGitGadget
                   ` (3 more replies)
  0 siblings, 4 replies; 90+ messages in thread
From: Johannes Schindelin via GitGitGadget @ 2019-06-13 11:49 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano

I noticed a while ago that I could not build Git's master in Git for
Windows' SDK when using GCC v8.x. This became a much less pressing problem
when I discovered a serious bug that would not let us compile with ASLR/DEP
enabled (the resulting executables would just throw segmentation faults left
and right), so I put these patches on the backburner for a while.

But now GCC v8.x is fixed, and MSYS2 even switched to GCC v9.x (which found
yet another problem that I fix in patch 4/4 of this patch series). So it is
time to get Git's master ready.

Johannes Schindelin (4):
  poll (mingw): allow compiling with GCC 8 and DEVELOPER=1
  kwset: allow building with GCC 8
  winansi: simplify loading the GetCurrentConsoleFontEx() function
  config: avoid calling `labs()` on too-large data type

 compat/poll/poll.c |  2 +-
 compat/winansi.c   | 14 +++++---------
 config.c           |  4 ++--
 kwset.c            |  8 +++++++-
 4 files changed, 15 insertions(+), 13 deletions(-)


base-commit: b697d92f56511e804b8ba20ccbe7bdc85dc66810
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-265%2Fdscho%2Fgcc-8-and-9-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-265/dscho/gcc-8-and-9-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/265
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH 1/4] poll (mingw): allow compiling with GCC 8 and DEVELOPER=1
  2019-06-13 11:49 [PATCH 0/4] Support building with GCC v8.x/v9.x Johannes Schindelin via GitGitGadget
@ 2019-06-13 11:49 ` Johannes Schindelin via GitGitGadget
  2019-06-13 11:49 ` [PATCH 2/4] kwset: allow building with GCC 8 Johannes Schindelin via GitGitGadget
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 90+ messages in thread
From: Johannes Schindelin via GitGitGadget @ 2019-06-13 11:49 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Johannes Schindelin

From: Johannes Schindelin <johannes.schindelin@gmx.de>

The return type of the `GetProcAddress()` function is `FARPROC` which
evaluates to `long long int (*)()`, i.e. it cannot be cast to the
correct function signature by GCC 8.

To work around that, we first cast to `void *` and go on with our merry
lives.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 compat/poll/poll.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/compat/poll/poll.c b/compat/poll/poll.c
index 4459408c7d..8b07edb0fe 100644
--- a/compat/poll/poll.c
+++ b/compat/poll/poll.c
@@ -149,7 +149,7 @@ win32_compute_revents (HANDLE h, int *p_sought)
     case FILE_TYPE_PIPE:
       if (!once_only)
 	{
-	  NtQueryInformationFile = (PNtQueryInformationFile)
+	  NtQueryInformationFile = (PNtQueryInformationFile)(void (*)(void))
 	    GetProcAddress (GetModuleHandle ("ntdll.dll"),
 			    "NtQueryInformationFile");
 	  once_only = TRUE;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH 2/4] kwset: allow building with GCC 8
  2019-06-13 11:49 [PATCH 0/4] Support building with GCC v8.x/v9.x Johannes Schindelin via GitGitGadget
  2019-06-13 11:49 ` [PATCH 1/4] poll (mingw): allow compiling with GCC 8 and DEVELOPER=1 Johannes Schindelin via GitGitGadget
@ 2019-06-13 11:49 ` Johannes Schindelin via GitGitGadget
  2019-06-13 16:11   ` Junio C Hamano
                     ` (3 more replies)
  2019-06-13 11:49 ` [PATCH 3/4] winansi: simplify loading the GetCurrentConsoleFontEx() function Johannes Schindelin via GitGitGadget
  2019-06-13 11:49 ` [PATCH 4/4] config: avoid calling `labs()` on too-large data type Johannes Schindelin via GitGitGadget
  3 siblings, 4 replies; 90+ messages in thread
From: Johannes Schindelin via GitGitGadget @ 2019-06-13 11:49 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Johannes Schindelin

From: Johannes Schindelin <johannes.schindelin@gmx.de>

The kwset functionality makes use of the obstack code, which expects to
be handed a function that can allocate large chunks of data. It expects
that function to accept a `size` parameter of type `long`.

This upsets GCC 8 on Windows, because `long` does not have the same
bit size as `size_t` there.

Now, the proper thing to do would be to switch to `size_t`. But this
would make us deviate from the "upstream" code even further, making it
hard to synchronize with newer versions, and also it would be quite
involved because that `long` type is so invasive in that code.

Let's punt, and instead provide a super small wrapper around
`xmalloc()`.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 kwset.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kwset.c b/kwset.c
index 4fb6455aca..efc2ff41bc 100644
--- a/kwset.c
+++ b/kwset.c
@@ -38,7 +38,13 @@
 #include "compat/obstack.h"
 
 #define NCHAR (UCHAR_MAX + 1)
-#define obstack_chunk_alloc xmalloc
+/* adapter for `xmalloc()`, which takes `size_t`, not `long` */
+static void *obstack_chunk_alloc(long size)
+{
+	if (size < 0)
+		BUG("Cannot allocate a negative amount: %ld", size);
+	return xmalloc(size);
+}
 #define obstack_chunk_free free
 
 #define U(c) ((unsigned char) (c))
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH 3/4] winansi: simplify loading the GetCurrentConsoleFontEx() function
  2019-06-13 11:49 [PATCH 0/4] Support building with GCC v8.x/v9.x Johannes Schindelin via GitGitGadget
  2019-06-13 11:49 ` [PATCH 1/4] poll (mingw): allow compiling with GCC 8 and DEVELOPER=1 Johannes Schindelin via GitGitGadget
  2019-06-13 11:49 ` [PATCH 2/4] kwset: allow building with GCC 8 Johannes Schindelin via GitGitGadget
@ 2019-06-13 11:49 ` Johannes Schindelin via GitGitGadget
  2019-06-13 11:49 ` [PATCH 4/4] config: avoid calling `labs()` on too-large data type Johannes Schindelin via GitGitGadget
  3 siblings, 0 replies; 90+ messages in thread
From: Johannes Schindelin via GitGitGadget @ 2019-06-13 11:49 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Johannes Schindelin

From: Johannes Schindelin <johannes.schindelin@gmx.de>

We introduced helper macros to simplify loading functions dynamically.
Might just as well use them.

This also side-steps a compiler warning when building with GCC v8.x: it
would complain about casting between incompatible function pointers.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 compat/winansi.c | 14 +++++---------
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/compat/winansi.c b/compat/winansi.c
index f4f08237f9..a29d34ef44 100644
--- a/compat/winansi.c
+++ b/compat/winansi.c
@@ -7,6 +7,7 @@
 #include <wingdi.h>
 #include <winreg.h>
 #include "win32.h"
+#include "win32/lazyload.h"
 
 static int fd_is_interactive[3] = { 0, 0, 0 };
 #define FD_CONSOLE 0x1
@@ -41,26 +42,21 @@ typedef struct _CONSOLE_FONT_INFOEX {
 #endif
 #endif
 
-typedef BOOL (WINAPI *PGETCURRENTCONSOLEFONTEX)(HANDLE, BOOL,
-		PCONSOLE_FONT_INFOEX);
-
 static void warn_if_raster_font(void)
 {
 	DWORD fontFamily = 0;
-	PGETCURRENTCONSOLEFONTEX pGetCurrentConsoleFontEx;
+	DECLARE_PROC_ADDR(kernel32.dll, BOOL, GetCurrentConsoleFontEx,
+			HANDLE, BOOL, PCONSOLE_FONT_INFOEX);
 
 	/* don't bother if output was ascii only */
 	if (!non_ascii_used)
 		return;
 
 	/* GetCurrentConsoleFontEx is available since Vista */
-	pGetCurrentConsoleFontEx = (PGETCURRENTCONSOLEFONTEX) GetProcAddress(
-			GetModuleHandle("kernel32.dll"),
-			"GetCurrentConsoleFontEx");
-	if (pGetCurrentConsoleFontEx) {
+	if (INIT_PROC_ADDR(GetCurrentConsoleFontEx)) {
 		CONSOLE_FONT_INFOEX cfi;
 		cfi.cbSize = sizeof(cfi);
-		if (pGetCurrentConsoleFontEx(console, 0, &cfi))
+		if (GetCurrentConsoleFontEx(console, 0, &cfi))
 			fontFamily = cfi.FontFamily;
 	} else {
 		/* pre-Vista: check default console font in registry */
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH 4/4] config: avoid calling `labs()` on too-large data type
  2019-06-13 11:49 [PATCH 0/4] Support building with GCC v8.x/v9.x Johannes Schindelin via GitGitGadget
                   ` (2 preceding siblings ...)
  2019-06-13 11:49 ` [PATCH 3/4] winansi: simplify loading the GetCurrentConsoleFontEx() function Johannes Schindelin via GitGitGadget
@ 2019-06-13 11:49 ` Johannes Schindelin via GitGitGadget
  2019-06-13 16:13   ` Junio C Hamano
  2019-06-16  6:48   ` René Scharfe
  3 siblings, 2 replies; 90+ messages in thread
From: Johannes Schindelin via GitGitGadget @ 2019-06-13 11:49 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Johannes Schindelin

From: Johannes Schindelin <johannes.schindelin@gmx.de>

The `labs()` function operates, as the initial `l` suggests, on `long`
parameters. However, in `config.c` we tried to use it on values of type
`intmax_t`.

This problem was found by GCC v9.x.

To fix it, let's just "unroll" the function (i.e. negate the value if it
is negative).

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 config.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/config.c b/config.c
index 296a6d9cc4..01c6e9df23 100644
--- a/config.c
+++ b/config.c
@@ -869,9 +869,9 @@ static int git_parse_signed(const char *value, intmax_t *ret, intmax_t max)
 			errno = EINVAL;
 			return 0;
 		}
-		uval = labs(val);
+		uval = val < 0 ? -val : val;
 		uval *= factor;
-		if (uval > max || labs(val) > uval) {
+		if (uval > max || (val < 0 ? -val : val) > uval) {
 			errno = ERANGE;
 			return 0;
 		}
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2/4] kwset: allow building with GCC 8
  2019-06-13 11:49 ` [PATCH 2/4] kwset: allow building with GCC 8 Johannes Schindelin via GitGitGadget
@ 2019-06-13 16:11   ` Junio C Hamano
  2019-06-14  9:53   ` SZEDER Gábor
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 90+ messages in thread
From: Junio C Hamano @ 2019-06-13 16:11 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget; +Cc: git, Johannes Schindelin

"Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com>
writes:

> From: Johannes Schindelin <johannes.schindelin@gmx.de>
>
> The kwset functionality makes use of the obstack code, which expects to
> be handed a function that can allocate large chunks of data. It expects
> that function to accept a `size` parameter of type `long`.
>
> This upsets GCC 8 on Windows, because `long` does not have the same
> bit size as `size_t` there.
>
> Now, the proper thing to do would be to switch to `size_t`. But this
> would make us deviate from the "upstream" code even further, making it
> hard to synchronize with newer versions, and also it would be quite
> involved because that `long` type is so invasive in that code.
>
> Let's punt, and instead provide a super small wrapper around
> `xmalloc()`.

Yay.

The above description makes it sound as if this patch is an ugly
workaround, but I think this is "the proper thing" to do, as long as
the use of obstack stuff in this context is meant to allocate less
than MAX_LONG bytes at a time, even if long is somtimes smaller than
size_t.

Thanks.

>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>  kwset.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/kwset.c b/kwset.c
> index 4fb6455aca..efc2ff41bc 100644
> --- a/kwset.c
> +++ b/kwset.c
> @@ -38,7 +38,13 @@
>  #include "compat/obstack.h"
>  
>  #define NCHAR (UCHAR_MAX + 1)
> -#define obstack_chunk_alloc xmalloc
> +/* adapter for `xmalloc()`, which takes `size_t`, not `long` */
> +static void *obstack_chunk_alloc(long size)
> +{
> +	if (size < 0)
> +		BUG("Cannot allocate a negative amount: %ld", size);
> +	return xmalloc(size);
> +}
>  #define obstack_chunk_free free
>  
>  #define U(c) ((unsigned char) (c))

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/4] config: avoid calling `labs()` on too-large data type
  2019-06-13 11:49 ` [PATCH 4/4] config: avoid calling `labs()` on too-large data type Johannes Schindelin via GitGitGadget
@ 2019-06-13 16:13   ` Junio C Hamano
  2019-06-16  6:48   ` René Scharfe
  1 sibling, 0 replies; 90+ messages in thread
From: Junio C Hamano @ 2019-06-13 16:13 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget; +Cc: git, Johannes Schindelin

"Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com>
writes:

> From: Johannes Schindelin <johannes.schindelin@gmx.de>
>
> The `labs()` function operates, as the initial `l` suggests, on `long`
> parameters. However, in `config.c` we tried to use it on values of type
> `intmax_t`.
>
> This problem was found by GCC v9.x.
>
> To fix it, let's just "unroll" the function (i.e. negate the value if it
> is negative).

Thanks.  Obviously correct.

>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>  config.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/config.c b/config.c
> index 296a6d9cc4..01c6e9df23 100644
> --- a/config.c
> +++ b/config.c
> @@ -869,9 +869,9 @@ static int git_parse_signed(const char *value, intmax_t *ret, intmax_t max)
>  			errno = EINVAL;
>  			return 0;
>  		}
> -		uval = labs(val);
> +		uval = val < 0 ? -val : val;
>  		uval *= factor;
> -		if (uval > max || labs(val) > uval) {
> +		if (uval > max || (val < 0 ? -val : val) > uval) {
>  			errno = ERANGE;
>  			return 0;
>  		}

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2/4] kwset: allow building with GCC 8
  2019-06-13 11:49 ` [PATCH 2/4] kwset: allow building with GCC 8 Johannes Schindelin via GitGitGadget
  2019-06-13 16:11   ` Junio C Hamano
@ 2019-06-14  9:53   ` SZEDER Gábor
  2019-06-14 10:00     ` [RFC/PATCH v1 0/4] compat/obstack: update from upstream SZEDER Gábor
  2019-06-14 16:12     ` [PATCH 2/4] kwset: allow building with GCC 8 Junio C Hamano
  2019-06-14 22:09   ` Ævar Arnfjörð Bjarmason
  2019-06-14 22:55   ` Can we just get rid of kwset & obstack in favor of optimistically using PCRE v2 JIT? Ævar Arnfjörð Bjarmason
  3 siblings, 2 replies; 90+ messages in thread
From: SZEDER Gábor @ 2019-06-14  9:53 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget
  Cc: git, Junio C Hamano, Johannes Schindelin

> Subject: Re: [PATCH 2/4] kwset: allow building with GCC 8

The subject could benefit from a "on Windows" at the end; 'kwset' and
compat/obstack can be build with GCC 8 and 9 just fine on some other
platforms.

On Thu, Jun 13, 2019 at 04:49:45AM -0700, Johannes Schindelin via GitGitGadget wrote:
> From: Johannes Schindelin <johannes.schindelin@gmx.de>
> 
> The kwset functionality makes use of the obstack code, which expects to
> be handed a function that can allocate large chunks of data. It expects
> that function to accept a `size` parameter of type `long`.
> 
> This upsets GCC 8 on Windows, because `long` does not have the same
> bit size as `size_t` there.
> 
> Now, the proper thing to do would be to switch to `size_t`. But this
> would make us deviate from the "upstream" code even further,

This is not entirely true: upstream already uses 'size_t', so the
switch would actually bring our copy closer to upstream.

But look out for the patches that I'll send out in a minute...

> making it
> hard to synchronize with newer versions, and also it would be quite
> involved because that `long` type is so invasive in that code.
> 
> Let's punt, and instead provide a super small wrapper around
> `xmalloc()`.
> 
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>  kwset.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/kwset.c b/kwset.c
> index 4fb6455aca..efc2ff41bc 100644
> --- a/kwset.c
> +++ b/kwset.c
> @@ -38,7 +38,13 @@
>  #include "compat/obstack.h"
>  
>  #define NCHAR (UCHAR_MAX + 1)
> -#define obstack_chunk_alloc xmalloc
> +/* adapter for `xmalloc()`, which takes `size_t`, not `long` */
> +static void *obstack_chunk_alloc(long size)
> +{
> +	if (size < 0)
> +		BUG("Cannot allocate a negative amount: %ld", size);
> +	return xmalloc(size);
> +}
>  #define obstack_chunk_free free
>  
>  #define U(c) ((unsigned char) (c))
> -- 
> gitgitgadget
> 

^ permalink raw reply	[flat|nested] 90+ messages in thread

* [RFC/PATCH v1 0/4] compat/obstack: update from upstream
  2019-06-14  9:53   ` SZEDER Gábor
@ 2019-06-14 10:00     ` SZEDER Gábor
  2019-06-14 10:00       ` [PATCH v1 1/4] " SZEDER Gábor
                         ` (6 more replies)
  2019-06-14 16:12     ` [PATCH 2/4] kwset: allow building with GCC 8 Junio C Hamano
  1 sibling, 7 replies; 90+ messages in thread
From: SZEDER Gábor @ 2019-06-14 10:00 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Johannes Schindelin, Ramsay Jones, SZEDER Gábor

Update 'compat/obstack.{c,h}' from upstream, because they already use
'size_t' instead of 'long' in places that might eventually end up as
an argument to malloc(), which might solve build errors with GCC 8 on
Windows.

The first patch just imports from upstream and doesn't modify anything
at all, and, consequently, it can't be compiled because of a screenful
or two of errors.  This is bad for future bisects, of course.

OTOH, adding all the necessary build fixes right away makes review
harder...

I'm not sure how to deal with this situation, so here is a series with
the fixes in separate patches for review, for now.  If there's an
agreement that this is the direction to take, then I'll squash in the
fixes in the first patch and touch up the resulting commit message.


Ramsay, could you please run sparse on top of these patch series to
make sure that I caught and converted all "0 instead of NULL" usages
in the last patch?  Thanks.


And here is an all-green build of these patches on Travis CI:

  https://travis-ci.org/szeder/git/builds/545645247

(and one bonus patch on top to deal with some Homebrew nonsense)

SZEDER Gábor (4):
  compat/obstack: update from upstream
  SQUASH??? compat/obstack: fix portability issues
  SQUASH??? compat/obstack: fix build errors with Clang
  compat/obstack: fix some sparse warnings

 compat/obstack.c | 356 ++++++++------------
 compat/obstack.h | 832 ++++++++++++++++++++++++-----------------------
 2 files changed, 572 insertions(+), 616 deletions(-)

-- 
2.22.0.589.g5bd7971b91


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v1 1/4] compat/obstack: update from upstream
  2019-06-14 10:00     ` [RFC/PATCH v1 0/4] compat/obstack: update from upstream SZEDER Gábor
@ 2019-06-14 10:00       ` SZEDER Gábor
  2019-06-14 10:00       ` [PATCH v1 2/4] SQUASH??? compat/obstack: fix portability issues SZEDER Gábor
                         ` (5 subsequent siblings)
  6 siblings, 0 replies; 90+ messages in thread
From: SZEDER Gábor @ 2019-06-14 10:00 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Johannes Schindelin, Ramsay Jones, SZEDER Gábor

!! This does not compile !!

Update 'compat/obstack.{c,h}' from commit
5905d8ca9945f0d60ff40eb6cfa42afc0199ab8f in
https://git.savannah.gnu.org/git/gnulib.git

We have made a couple of changes to our copy of 'compat/obstack.{c,h}'
since it was introduced in e831171d67 (Add obstack.[ch] from EGLIBC
2.10, 2011-08-21), and in the meantime some of those issues have been
addressed in upstream as well [1].  Furthermore, upstream fixed one
big issue that we still suffer from, namely our copy still uses type
'long' to specify the size of the chunk of memory to allocate while
xmalloc() expects type 'size_t', which triggers compiler errors with
GCC 8 on Windows, where these two data types are of different size;
upstream has been using 'size_t' for quite some time now.  Making the
conversion from 'long' to 'size_t' in our copy just doesn't worth it,
hence the update from upstream.

[1] In particular the following changes have been adressed in
    upstream:
    764473d257 (compat/obstack: fix -Wcast-function-type warnings,
    2019-01-17)
    484257925f (Replace Free Software Foundation address in license
    notices, 2017-11-07)
    7323513d28 (obstack: fix spelling of similar, 2013-04-12)

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 compat/obstack.c | 399 ++++++++++------------
 compat/obstack.h | 835 +++++++++++++++++++++++++----------------------
 2 files changed, 606 insertions(+), 628 deletions(-)

diff --git a/compat/obstack.c b/compat/obstack.c
index 27cd5c1ea1..6949111e4d 100644
--- a/compat/obstack.c
+++ b/compat/obstack.c
@@ -1,6 +1,5 @@
 /* obstack.c - subroutines used implicitly by object stack macros
-   Copyright (C) 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1996, 1997, 1998,
-   1999, 2000, 2001, 2002, 2003, 2004, 2005 Free Software Foundation, Inc.
+   Copyright (C) 1988-2019 Free Software Foundation, Inc.
    This file is part of the GNU C Library.
 
    The GNU C Library is free software; you can redistribute it and/or
@@ -15,16 +14,19 @@
 
    You should have received a copy of the GNU Lesser General Public
    License along with the GNU C Library; if not, see
-   <http://www.gnu.org/licenses/>.  */
+   <https://www.gnu.org/licenses/>.  */
 
-#include "git-compat-util.h"
-#include <gettext.h>
-#include "obstack.h"
 
-/* NOTE BEFORE MODIFYING THIS FILE: This version number must be
-   incremented whenever callers compiled using an old obstack.h can no
-   longer properly call the functions in this obstack.c.  */
-#define OBSTACK_INTERFACE_VERSION 1
+#ifdef _LIBC
+# include <obstack.h>
+#else
+# include <config.h>
+# include "obstack.h"
+#endif
+
+/* NOTE BEFORE MODIFYING THIS FILE: _OBSTACK_INTERFACE_VERSION in
+   obstack.h must be incremented whenever callers compiled using an old
+   obstack.h can no longer properly call the functions in this file.  */
 
 /* Comment out all this code if we are using the GNU C Library, and are not
    actually compiling the library itself, and the installed library
@@ -32,113 +34,82 @@
    C Library, but also included in many other GNU distributions.  Compiling
    and linking in this code is a waste when using the GNU C library
    (especially if it is a shared library).  Rather than having every GNU
-   program understand `configure --with-gnu-libc' and omit the object
+   program understand 'configure --with-gnu-libc' and omit the object
    files, it is simpler to just do this in the source for each such file.  */
-
-#include <stdio.h>		/* Random thing to get __GNU_LIBRARY__.  */
 #if !defined _LIBC && defined __GNU_LIBRARY__ && __GNU_LIBRARY__ > 1
 # include <gnu-versions.h>
-# if _GNU_OBSTACK_INTERFACE_VERSION == OBSTACK_INTERFACE_VERSION
-#  define ELIDE_CODE
+# if (_GNU_OBSTACK_INTERFACE_VERSION == _OBSTACK_INTERFACE_VERSION	      \
+      || (_GNU_OBSTACK_INTERFACE_VERSION == 1				      \
+          && _OBSTACK_INTERFACE_VERSION == 2				      \
+          && defined SIZEOF_INT && defined SIZEOF_SIZE_T		      \
+          && SIZEOF_INT == SIZEOF_SIZE_T))
+#  define _OBSTACK_ELIDE_CODE
 # endif
 #endif
 
-#include <stddef.h>
-
-#ifndef ELIDE_CODE
-
-
-# if HAVE_INTTYPES_H
-#  include <inttypes.h>
+#ifndef _OBSTACK_ELIDE_CODE
+/* If GCC, or if an oddball (testing?) host that #defines __alignof__,
+   use the already-supplied __alignof__.  Otherwise, this must be Gnulib
+   (as glibc assumes GCC); defer to Gnulib's alignof_type.  */
+# if !defined __GNUC__ && !defined __alignof__
+#  include <alignof.h>
+#  define __alignof__(type) alignof_type (type)
 # endif
-# if HAVE_STDINT_H || defined _LIBC
-#  include <stdint.h>
+# include <stdlib.h>
+# include <stdint.h>
+
+# ifndef MAX
+#  define MAX(a,b) ((a) > (b) ? (a) : (b))
 # endif
 
 /* Determine default alignment.  */
-union fooround
-{
-  uintmax_t i;
-  long double d;
-  void *p;
-};
-struct fooalign
-{
-  char c;
-  union fooround u;
-};
+
 /* If malloc were really smart, it would round addresses to DEFAULT_ALIGNMENT.
    But in fact it might be less smart and round addresses to as much as
-   DEFAULT_ROUNDING.  So we prepare for it to do that.  */
-enum
-  {
-    DEFAULT_ALIGNMENT = offsetof (struct fooalign, u),
-    DEFAULT_ROUNDING = sizeof (union fooround)
-  };
-
-/* When we copy a long block of data, this is the unit to do it with.
-   On some machines, copying successive ints does not work;
-   in such a case, redefine COPYING_UNIT to `long' (if that works)
-   or `char' as a last resort.  */
-# ifndef COPYING_UNIT
-#  define COPYING_UNIT int
-# endif
+   DEFAULT_ROUNDING.  So we prepare for it to do that.
+
+   DEFAULT_ALIGNMENT cannot be an enum constant; see gnulib's alignof.h.  */
+#define DEFAULT_ALIGNMENT MAX (__alignof__ (long double),		      \
+                               MAX (__alignof__ (uintmax_t),		      \
+                                    __alignof__ (void *)))
+#define DEFAULT_ROUNDING MAX (sizeof (long double),			      \
+                               MAX (sizeof (uintmax_t),			      \
+                                    sizeof (void *)))
+
+/* Call functions with either the traditional malloc/free calling
+   interface, or the mmalloc/mfree interface (that adds an extra first
+   argument), based on the value of use_extra_arg.  */
+
+static void *
+call_chunkfun (struct obstack *h, size_t size)
+{
+  if (h->use_extra_arg)
+    return h->chunkfun.extra (h->extra_arg, size);
+  else
+    return h->chunkfun.plain (size);
+}
 
+static void
+call_freefun (struct obstack *h, void *old_chunk)
+{
+  if (h->use_extra_arg)
+    h->freefun.extra (h->extra_arg, old_chunk);
+  else
+    h->freefun.plain (old_chunk);
+}
 
-/* The functions allocating more room by calling `obstack_chunk_alloc'
-   jump to the handler pointed to by `obstack_alloc_failed_handler'.
-   This can be set to a user defined function which should either
-   abort gracefully or use longjump - but shouldn't return.  This
-   variable by default points to the internal function
-   `print_and_abort'.  */
-static void print_and_abort (void);
-void (*obstack_alloc_failed_handler) (void) = print_and_abort;
-
-# ifdef _LIBC
-#  if SHLIB_COMPAT (libc, GLIBC_2_0, GLIBC_2_3_4)
-/* A looong time ago (before 1994, anyway; we're not sure) this global variable
-   was used by non-GNU-C macros to avoid multiple evaluation.  The GNU C
-   library still exports it because somebody might use it.  */
-struct obstack *_obstack_compat;
-compat_symbol (libc, _obstack_compat, _obstack, GLIBC_2_0);
-#  endif
-# endif
 
-/* Define a macro that either calls functions with the traditional malloc/free
-   calling interface, or calls functions with the mmalloc/mfree interface
-   (that adds an extra first argument), based on the state of use_extra_arg.
-   For free, do not use ?:, since some compilers, like the MIPS compilers,
-   do not allow (expr) ? void : void.  */
-
-# define CALL_CHUNKFUN(h, size) \
-  (((h) -> use_extra_arg) \
-   ? (*(h)->chunkfun.extra) ((h)->extra_arg, (size)) \
-   : (*(h)->chunkfun.plain) ((size)))
-
-# define CALL_FREEFUN(h, old_chunk) \
-  do { \
-    if ((h) -> use_extra_arg) \
-      (*(h)->freefun.extra) ((h)->extra_arg, (old_chunk)); \
-    else \
-      (*(h)->freefun.plain) ((old_chunk)); \
-  } while (0)
-
-\f
 /* Initialize an obstack H for use.  Specify chunk size SIZE (0 means default).
    Objects start on multiples of ALIGNMENT (0 means use default).
-   CHUNKFUN is the function to use to allocate chunks,
-   and FREEFUN the function to free them.
 
    Return nonzero if successful, calls obstack_alloc_failed_handler if
    allocation fails.  */
 
-int
-_obstack_begin (struct obstack *h,
-		int size, int alignment,
-		void *(*chunkfun) (long),
-		void (*freefun) (void *))
+static int
+_obstack_begin_worker (struct obstack *h,
+                       _OBSTACK_SIZE_T size, _OBSTACK_SIZE_T alignment)
 {
-  register struct _obstack_chunk *chunk; /* points to new chunk */
+  struct _obstack_chunk *chunk; /* points to new chunk */
 
   if (alignment == 0)
     alignment = DEFAULT_ALIGNMENT;
@@ -146,33 +117,29 @@ _obstack_begin (struct obstack *h,
     /* Default size is what GNU malloc can fit in a 4096-byte block.  */
     {
       /* 12 is sizeof (mhead) and 4 is EXTRA from GNU malloc.
-	 Use the values for range checking, because if range checking is off,
-	 the extra bytes won't be missed terribly, but if range checking is on
-	 and we used a larger request, a whole extra 4096 bytes would be
-	 allocated.
+         Use the values for range checking, because if range checking is off,
+         the extra bytes won't be missed terribly, but if range checking is on
+         and we used a larger request, a whole extra 4096 bytes would be
+         allocated.
 
-	 These number are irrelevant to the new GNU malloc.  I suspect it is
-	 less sensitive to the size of the request.  */
+         These number are irrelevant to the new GNU malloc.  I suspect it is
+         less sensitive to the size of the request.  */
       int extra = ((((12 + DEFAULT_ROUNDING - 1) & ~(DEFAULT_ROUNDING - 1))
-		    + 4 + DEFAULT_ROUNDING - 1)
-		   & ~(DEFAULT_ROUNDING - 1));
+                    + 4 + DEFAULT_ROUNDING - 1)
+                   & ~(DEFAULT_ROUNDING - 1));
       size = 4096 - extra;
     }
 
-  h->chunkfun.plain = chunkfun;
-  h->freefun.plain = freefun;
   h->chunk_size = size;
   h->alignment_mask = alignment - 1;
-  h->use_extra_arg = 0;
 
-  chunk = h->chunk = CALL_CHUNKFUN (h, h -> chunk_size);
+  chunk = h->chunk = call_chunkfun (h, h->chunk_size);
   if (!chunk)
     (*obstack_alloc_failed_handler) ();
   h->next_free = h->object_base = __PTR_ALIGN ((char *) chunk, chunk->contents,
-					       alignment - 1);
-  h->chunk_limit = chunk->limit
-    = (char *) chunk + h->chunk_size;
-  chunk->prev = NULL;
+                                               alignment - 1);
+  h->chunk_limit = chunk->limit = (char *) chunk + h->chunk_size;
+  chunk->prev = 0;
   /* The initial chunk now contains no empty object.  */
   h->maybe_empty_object = 0;
   h->alloc_failed = 0;
@@ -180,52 +147,29 @@ _obstack_begin (struct obstack *h,
 }
 
 int
-_obstack_begin_1 (struct obstack *h, int size, int alignment,
-		  void *(*chunkfun) (void *, long),
-		  void (*freefun) (void *, void *),
-		  void *arg)
+_obstack_begin (struct obstack *h,
+                _OBSTACK_SIZE_T size, _OBSTACK_SIZE_T alignment,
+                void *(*chunkfun) (size_t),
+                void (*freefun) (void *))
 {
-  register struct _obstack_chunk *chunk; /* points to new chunk */
-
-  if (alignment == 0)
-    alignment = DEFAULT_ALIGNMENT;
-  if (size == 0)
-    /* Default size is what GNU malloc can fit in a 4096-byte block.  */
-    {
-      /* 12 is sizeof (mhead) and 4 is EXTRA from GNU malloc.
-	 Use the values for range checking, because if range checking is off,
-	 the extra bytes won't be missed terribly, but if range checking is on
-	 and we used a larger request, a whole extra 4096 bytes would be
-	 allocated.
-
-	 These number are irrelevant to the new GNU malloc.  I suspect it is
-	 less sensitive to the size of the request.  */
-      int extra = ((((12 + DEFAULT_ROUNDING - 1) & ~(DEFAULT_ROUNDING - 1))
-		    + 4 + DEFAULT_ROUNDING - 1)
-		   & ~(DEFAULT_ROUNDING - 1));
-      size = 4096 - extra;
-    }
-
-  h->chunkfun.extra = (struct _obstack_chunk * (*)(void *,long)) chunkfun;
-  h->freefun.extra = (void (*) (void *, struct _obstack_chunk *)) freefun;
+  h->chunkfun.plain = chunkfun;
+  h->freefun.plain = freefun;
+  h->use_extra_arg = 0;
+  return _obstack_begin_worker (h, size, alignment);
+}
 
-  h->chunk_size = size;
-  h->alignment_mask = alignment - 1;
+int
+_obstack_begin_1 (struct obstack *h,
+                  _OBSTACK_SIZE_T size, _OBSTACK_SIZE_T alignment,
+                  void *(*chunkfun) (void *, size_t),
+                  void (*freefun) (void *, void *),
+                  void *arg)
+{
+  h->chunkfun.extra = chunkfun;
+  h->freefun.extra = freefun;
   h->extra_arg = arg;
   h->use_extra_arg = 1;
-
-  chunk = h->chunk = CALL_CHUNKFUN (h, h -> chunk_size);
-  if (!chunk)
-    (*obstack_alloc_failed_handler) ();
-  h->next_free = h->object_base = __PTR_ALIGN ((char *) chunk, chunk->contents,
-					       alignment - 1);
-  h->chunk_limit = chunk->limit
-    = (char *) chunk + h->chunk_size;
-  chunk->prev = NULL;
-  /* The initial chunk now contains no empty object.  */
-  h->maybe_empty_object = 0;
-  h->alloc_failed = 0;
-  return 1;
+  return _obstack_begin_worker (h, size, alignment);
 }
 
 /* Allocate a new current chunk for the obstack *H
@@ -235,25 +179,27 @@ _obstack_begin_1 (struct obstack *h, int size, int alignment,
    to the beginning of the new one.  */
 
 void
-_obstack_newchunk (struct obstack *h, int length)
+_obstack_newchunk (struct obstack *h, _OBSTACK_SIZE_T length)
 {
-  register struct _obstack_chunk *old_chunk = h->chunk;
-  register struct _obstack_chunk *new_chunk;
-  register long	new_size;
-  register long obj_size = h->next_free - h->object_base;
-  register long i;
-  long already;
+  struct _obstack_chunk *old_chunk = h->chunk;
+  struct _obstack_chunk *new_chunk = 0;
+  size_t obj_size = h->next_free - h->object_base;
   char *object_base;
 
   /* Compute size for new chunk.  */
-  new_size = (obj_size + length) + (obj_size >> 3) + h->alignment_mask + 100;
+  size_t sum1 = obj_size + length;
+  size_t sum2 = sum1 + h->alignment_mask;
+  size_t new_size = sum2 + (obj_size >> 3) + 100;
+  if (new_size < sum2)
+    new_size = sum2;
   if (new_size < h->chunk_size)
     new_size = h->chunk_size;
 
   /* Allocate and initialize the new chunk.  */
-  new_chunk = CALL_CHUNKFUN (h, new_size);
+  if (obj_size <= sum1 && sum1 <= sum2)
+    new_chunk = call_chunkfun (h, new_size);
   if (!new_chunk)
-    (*obstack_alloc_failed_handler) ();
+    (*obstack_alloc_failed_handler)();
   h->chunk = new_chunk;
   new_chunk->prev = old_chunk;
   new_chunk->limit = h->chunk_limit = (char *) new_chunk + new_size;
@@ -262,36 +208,19 @@ _obstack_newchunk (struct obstack *h, int length)
   object_base =
     __PTR_ALIGN ((char *) new_chunk, new_chunk->contents, h->alignment_mask);
 
-  /* Move the existing object to the new chunk.
-     Word at a time is fast and is safe if the object
-     is sufficiently aligned.  */
-  if (h->alignment_mask + 1 >= DEFAULT_ALIGNMENT)
-    {
-      for (i = obj_size / sizeof (COPYING_UNIT) - 1;
-	   i >= 0; i--)
-	((COPYING_UNIT *)object_base)[i]
-	  = ((COPYING_UNIT *)h->object_base)[i];
-      /* We used to copy the odd few remaining bytes as one extra COPYING_UNIT,
-	 but that can cross a page boundary on a machine
-	 which does not do strict alignment for COPYING_UNITS.  */
-      already = obj_size / sizeof (COPYING_UNIT) * sizeof (COPYING_UNIT);
-    }
-  else
-    already = 0;
-  /* Copy remaining bytes one by one.  */
-  for (i = already; i < obj_size; i++)
-    object_base[i] = h->object_base[i];
+  /* Move the existing object to the new chunk.  */
+  memcpy (object_base, h->object_base, obj_size);
 
   /* If the object just copied was the only data in OLD_CHUNK,
      free that chunk and remove it from the chain.
      But not if that chunk might contain an empty object.  */
-  if (! h->maybe_empty_object
+  if (!h->maybe_empty_object
       && (h->object_base
-	  == __PTR_ALIGN ((char *) old_chunk, old_chunk->contents,
-			  h->alignment_mask)))
+          == __PTR_ALIGN ((char *) old_chunk, old_chunk->contents,
+                          h->alignment_mask)))
     {
       new_chunk->prev = old_chunk->prev;
-      CALL_FREEFUN (h, old_chunk);
+      call_freefun (h, old_chunk);
     }
 
   h->object_base = object_base;
@@ -299,9 +228,6 @@ _obstack_newchunk (struct obstack *h, int length)
   /* The new chunk certainly contains no empty object yet.  */
   h->maybe_empty_object = 0;
 }
-# ifdef _LIBC
-libc_hidden_def (_obstack_newchunk)
-# endif
 
 /* Return nonzero if object OBJ has been allocated from obstack H.
    This is here for debugging.
@@ -309,48 +235,46 @@ libc_hidden_def (_obstack_newchunk)
 
 /* Suppress -Wmissing-prototypes warning.  We don't want to declare this in
    obstack.h because it is just for debugging.  */
-int _obstack_allocated_p (struct obstack *h, void *obj);
+int _obstack_allocated_p (struct obstack *h, void *obj) __attribute_pure__;
 
 int
 _obstack_allocated_p (struct obstack *h, void *obj)
 {
-  register struct _obstack_chunk *lp;	/* below addr of any objects in this chunk */
-  register struct _obstack_chunk *plp;	/* point to previous chunk if any */
+  struct _obstack_chunk *lp;    /* below addr of any objects in this chunk */
+  struct _obstack_chunk *plp;   /* point to previous chunk if any */
 
   lp = (h)->chunk;
   /* We use >= rather than > since the object cannot be exactly at
      the beginning of the chunk but might be an empty object exactly
      at the end of an adjacent chunk.  */
-  while (lp != NULL && ((void *) lp >= obj || (void *) (lp)->limit < obj))
+  while (lp != 0 && ((void *) lp >= obj || (void *) (lp)->limit < obj))
     {
       plp = lp->prev;
       lp = plp;
     }
-  return lp != NULL;
+  return lp != 0;
 }
-\f
+
 /* Free objects in obstack H, including OBJ and everything allocate
    more recently than OBJ.  If OBJ is zero, free everything in H.  */
 
-# undef obstack_free
-
 void
-obstack_free (struct obstack *h, void *obj)
+_obstack_free (struct obstack *h, void *obj)
 {
-  register struct _obstack_chunk *lp;	/* below addr of any objects in this chunk */
-  register struct _obstack_chunk *plp;	/* point to previous chunk if any */
+  struct _obstack_chunk *lp;    /* below addr of any objects in this chunk */
+  struct _obstack_chunk *plp;   /* point to previous chunk if any */
 
   lp = h->chunk;
   /* We use >= because there cannot be an object at the beginning of a chunk.
      But there can be an empty object at that address
      at the end of another chunk.  */
-  while (lp != NULL && ((void *) lp >= obj || (void *) (lp)->limit < obj))
+  while (lp != 0 && ((void *) lp >= obj || (void *) (lp)->limit < obj))
     {
       plp = lp->prev;
-      CALL_FREEFUN (h, lp);
+      call_freefun (h, lp);
       lp = plp;
       /* If we switch chunks, we can't tell whether the new current
-	 chunk contains an empty object, so assume that it may.  */
+         chunk contains an empty object, so assume that it may.  */
       h->maybe_empty_object = 1;
     }
   if (lp)
@@ -359,42 +283,50 @@ obstack_free (struct obstack *h, void *obj)
       h->chunk_limit = lp->limit;
       h->chunk = lp;
     }
-  else if (obj != NULL)
+  else if (obj != 0)
     /* obj is not in any of the chunks! */
     abort ();
 }
 
-# ifdef _LIBC
-/* Older versions of libc used a function _obstack_free intended to be
-   called by non-GCC compilers.  */
-strong_alias (obstack_free, _obstack_free)
-# endif
-\f
-int
+_OBSTACK_SIZE_T
 _obstack_memory_used (struct obstack *h)
 {
-  register struct _obstack_chunk* lp;
-  register int nbytes = 0;
+  struct _obstack_chunk *lp;
+  _OBSTACK_SIZE_T nbytes = 0;
 
-  for (lp = h->chunk; lp != NULL; lp = lp->prev)
+  for (lp = h->chunk; lp != 0; lp = lp->prev)
     {
       nbytes += lp->limit - (char *) lp;
     }
   return nbytes;
 }
-\f
-# ifdef _LIBC
-#  include <libio/iolibio.h>
-# endif
 
-# ifndef __attribute__
-/* This feature is available in gcc versions 2.5 and later.  */
-#  if __GNUC__ < 2 || (__GNUC__ == 2 && __GNUC_MINOR__ < 5)
-#   define __attribute__(Spec) /* empty */
+# ifndef _OBSTACK_NO_ERROR_HANDLER
+/* Define the error handler.  */
+#  include <stdio.h>
+
+/* Exit value used when 'print_and_abort' is used.  */
+#  ifdef _LIBC
+int obstack_exit_failure = EXIT_FAILURE;
+#  else
+#   include "exitfail.h"
+#   define obstack_exit_failure exit_failure
 #  endif
-# endif
 
-static void
+#  ifdef _LIBC
+#   include <libintl.h>
+#  else
+#   include "gettext.h"
+#  endif
+#  ifndef _
+#   define _(msgid) gettext (msgid)
+#  endif
+
+#  ifdef _LIBC
+#   include <libio/iolibio.h>
+#  endif
+
+static _Noreturn void
 print_and_abort (void)
 {
   /* Don't change any of these strings.  Yes, it would be possible to add
@@ -402,12 +334,21 @@ print_and_abort (void)
      happen because the "memory exhausted" message appears in other places
      like this and the translation should be reused instead of creating
      a very similar string which requires a separate translation.  */
-# ifdef _LIBC
+#  ifdef _LIBC
   (void) __fxprintf (NULL, "%s\n", _("memory exhausted"));
-# else
+#  else
   fprintf (stderr, "%s\n", _("memory exhausted"));
-# endif
-  exit (1);
+#  endif
+  exit (obstack_exit_failure);
 }
 
-#endif	/* !ELIDE_CODE */
+/* The functions allocating more room by calling 'obstack_chunk_alloc'
+   jump to the handler pointed to by 'obstack_alloc_failed_handler'.
+   This can be set to a user defined function which should either
+   abort gracefully or use longjump - but shouldn't return.  This
+   variable by default points to the internal function
+   'print_and_abort'.  */
+__attribute_noreturn__ void (*obstack_alloc_failed_handler) (void)
+  = print_and_abort;
+# endif /* !_OBSTACK_NO_ERROR_HANDLER */
+#endif /* !_OBSTACK_ELIDE_CODE */
diff --git a/compat/obstack.h b/compat/obstack.h
index ced94d0118..811de588a4 100644
--- a/compat/obstack.h
+++ b/compat/obstack.h
@@ -1,6 +1,5 @@
 /* obstack.h - object stack macros
-   Copyright (C) 1988-1994,1996-1999,2003,2004,2005,2009
-	Free Software Foundation, Inc.
+   Copyright (C) 1988-2019 Free Software Foundation, Inc.
    This file is part of the GNU C Library.
 
    The GNU C Library is free software; you can redistribute it and/or
@@ -15,89 +14,89 @@
 
    You should have received a copy of the GNU Lesser General Public
    License along with the GNU C Library; if not, see
-   <http://www.gnu.org/licenses/>.  */
+   <https://www.gnu.org/licenses/>.  */
 
 /* Summary:
 
-All the apparent functions defined here are macros. The idea
-is that you would use these pre-tested macros to solve a
-very specific set of problems, and they would run fast.
-Caution: no side-effects in arguments please!! They may be
-evaluated MANY times!!
-
-These macros operate a stack of objects.  Each object starts life
-small, and may grow to maturity.  (Consider building a word syllable
-by syllable.)  An object can move while it is growing.  Once it has
-been "finished" it never changes address again.  So the "top of the
-stack" is typically an immature growing object, while the rest of the
-stack is of mature, fixed size and fixed address objects.
-
-These routines grab large chunks of memory, using a function you
-supply, called `obstack_chunk_alloc'.  On occasion, they free chunks,
-by calling `obstack_chunk_free'.  You must define them and declare
-them before using any obstack macros.
-
-Each independent stack is represented by a `struct obstack'.
-Each of the obstack macros expects a pointer to such a structure
-as the first argument.
-
-One motivation for this package is the problem of growing char strings
-in symbol tables.  Unless you are "fascist pig with a read-only mind"
---Gosper's immortal quote from HAKMEM item 154, out of context--you
-would not like to put any arbitrary upper limit on the length of your
-symbols.
-
-In practice this often means you will build many short symbols and a
-few long symbols.  At the time you are reading a symbol you don't know
-how long it is.  One traditional method is to read a symbol into a
-buffer, realloc()ating the buffer every time you try to read a symbol
-that is longer than the buffer.  This is beaut, but you still will
-want to copy the symbol from the buffer to a more permanent
-symbol-table entry say about half the time.
-
-With obstacks, you can work differently.  Use one obstack for all symbol
-names.  As you read a symbol, grow the name in the obstack gradually.
-When the name is complete, finalize it.  Then, if the symbol exists already,
-free the newly read name.
-
-The way we do this is to take a large chunk, allocating memory from
-low addresses.  When you want to build a symbol in the chunk you just
-add chars above the current "high water mark" in the chunk.  When you
-have finished adding chars, because you got to the end of the symbol,
-you know how long the chars are, and you can create a new object.
-Mostly the chars will not burst over the highest address of the chunk,
-because you would typically expect a chunk to be (say) 100 times as
-long as an average object.
-
-In case that isn't clear, when we have enough chars to make up
-the object, THEY ARE ALREADY CONTIGUOUS IN THE CHUNK (guaranteed)
-so we just point to it where it lies.  No moving of chars is
-needed and this is the second win: potentially long strings need
-never be explicitly shuffled. Once an object is formed, it does not
-change its address during its lifetime.
-
-When the chars burst over a chunk boundary, we allocate a larger
-chunk, and then copy the partly formed object from the end of the old
-chunk to the beginning of the new larger chunk.  We then carry on
-accreting characters to the end of the object as we normally would.
-
-A special macro is provided to add a single char at a time to a
-growing object.  This allows the use of register variables, which
-break the ordinary 'growth' macro.
-
-Summary:
-	We allocate large chunks.
-	We carve out one object at a time from the current chunk.
-	Once carved, an object never moves.
-	We are free to append data of any size to the currently
-	  growing object.
-	Exactly one object is growing in an obstack at any one time.
-	You can run one obstack per control block.
-	You may have as many control blocks as you dare.
-	Because of the way we do it, you can `unwind' an obstack
-	  back to a previous state. (You may remove objects much
-	  as you would with a stack.)
-*/
+   All the apparent functions defined here are macros. The idea
+   is that you would use these pre-tested macros to solve a
+   very specific set of problems, and they would run fast.
+   Caution: no side-effects in arguments please!! They may be
+   evaluated MANY times!!
+
+   These macros operate a stack of objects.  Each object starts life
+   small, and may grow to maturity.  (Consider building a word syllable
+   by syllable.)  An object can move while it is growing.  Once it has
+   been "finished" it never changes address again.  So the "top of the
+   stack" is typically an immature growing object, while the rest of the
+   stack is of mature, fixed size and fixed address objects.
+
+   These routines grab large chunks of memory, using a function you
+   supply, called 'obstack_chunk_alloc'.  On occasion, they free chunks,
+   by calling 'obstack_chunk_free'.  You must define them and declare
+   them before using any obstack macros.
+
+   Each independent stack is represented by a 'struct obstack'.
+   Each of the obstack macros expects a pointer to such a structure
+   as the first argument.
+
+   One motivation for this package is the problem of growing char strings
+   in symbol tables.  Unless you are "fascist pig with a read-only mind"
+   --Gosper's immortal quote from HAKMEM item 154, out of context--you
+   would not like to put any arbitrary upper limit on the length of your
+   symbols.
+
+   In practice this often means you will build many short symbols and a
+   few long symbols.  At the time you are reading a symbol you don't know
+   how long it is.  One traditional method is to read a symbol into a
+   buffer, realloc()ating the buffer every time you try to read a symbol
+   that is longer than the buffer.  This is beaut, but you still will
+   want to copy the symbol from the buffer to a more permanent
+   symbol-table entry say about half the time.
+
+   With obstacks, you can work differently.  Use one obstack for all symbol
+   names.  As you read a symbol, grow the name in the obstack gradually.
+   When the name is complete, finalize it.  Then, if the symbol exists already,
+   free the newly read name.
+
+   The way we do this is to take a large chunk, allocating memory from
+   low addresses.  When you want to build a symbol in the chunk you just
+   add chars above the current "high water mark" in the chunk.  When you
+   have finished adding chars, because you got to the end of the symbol,
+   you know how long the chars are, and you can create a new object.
+   Mostly the chars will not burst over the highest address of the chunk,
+   because you would typically expect a chunk to be (say) 100 times as
+   long as an average object.
+
+   In case that isn't clear, when we have enough chars to make up
+   the object, THEY ARE ALREADY CONTIGUOUS IN THE CHUNK (guaranteed)
+   so we just point to it where it lies.  No moving of chars is
+   needed and this is the second win: potentially long strings need
+   never be explicitly shuffled. Once an object is formed, it does not
+   change its address during its lifetime.
+
+   When the chars burst over a chunk boundary, we allocate a larger
+   chunk, and then copy the partly formed object from the end of the old
+   chunk to the beginning of the new larger chunk.  We then carry on
+   accreting characters to the end of the object as we normally would.
+
+   A special macro is provided to add a single char at a time to a
+   growing object.  This allows the use of register variables, which
+   break the ordinary 'growth' macro.
+
+   Summary:
+        We allocate large chunks.
+        We carve out one object at a time from the current chunk.
+        Once carved, an object never moves.
+        We are free to append data of any size to the currently
+          growing object.
+        Exactly one object is growing in an obstack at any one time.
+        You can run one obstack per control block.
+        You may have as many control blocks as you dare.
+        Because of the way we do it, you can "unwind" an obstack
+          back to a previous state. (You may remove objects much
+          as you would with a stack.)
+ */
 
 
 /* Don't do the contents of this file more than once.  */
@@ -105,20 +104,30 @@ break the ordinary 'growth' macro.
 #ifndef _OBSTACK_H
 #define _OBSTACK_H 1
 
-#ifdef __cplusplus
-extern "C" {
+#ifndef _OBSTACK_INTERFACE_VERSION
+# define _OBSTACK_INTERFACE_VERSION 2
+#endif
+
+#include <stddef.h>             /* For size_t and ptrdiff_t.  */
+#include <string.h>             /* For __GNU_LIBRARY__, and memcpy.  */
+
+#if __STDC_VERSION__ < 199901L || defined __HP_cc
+# define __FLEXIBLE_ARRAY_MEMBER 1
+#else
+# define __FLEXIBLE_ARRAY_MEMBER
 #endif
-\f
-/* We need the type of a pointer subtraction.  If __PTRDIFF_TYPE__ is
-   defined, as with GNU C, use that; that way we don't pollute the
-   namespace with <stddef.h>'s symbols.  Otherwise, include <stddef.h>
-   and use ptrdiff_t.  */
-
-#ifdef __PTRDIFF_TYPE__
-# define PTR_INT_TYPE __PTRDIFF_TYPE__
+
+#if _OBSTACK_INTERFACE_VERSION == 1
+/* For binary compatibility with obstack version 1, which used "int"
+   and "long" for these two types.  */
+# define _OBSTACK_SIZE_T unsigned int
+# define _CHUNK_SIZE_T unsigned long
+# define _OBSTACK_CAST(type, expr) ((type) (expr))
 #else
-# include <stddef.h>
-# define PTR_INT_TYPE ptrdiff_t
+/* Version 2 with sane types, especially for 64-bit hosts.  */
+# define _OBSTACK_SIZE_T size_t
+# define _CHUNK_SIZE_T size_t
+# define _OBSTACK_CAST(type, expr) (expr)
 #endif
 
 /* If B is the base of an object addressed by P, return the result of
@@ -127,78 +136,102 @@ extern "C" {
 
 #define __BPTR_ALIGN(B, P, A) ((B) + (((P) - (B) + (A)) & ~(A)))
 
-/* Similar to _BPTR_ALIGN (B, P, A), except optimize the common case
+/* Similar to __BPTR_ALIGN (B, P, A), except optimize the common case
    where pointers can be converted to integers, aligned as integers,
-   and converted back again.  If PTR_INT_TYPE is narrower than a
+   and converted back again.  If ptrdiff_t is narrower than a
    pointer (e.g., the AS/400), play it safe and compute the alignment
    relative to B.  Otherwise, use the faster strategy of computing the
    alignment relative to 0.  */
 
-#define __PTR_ALIGN(B, P, A)						    \
-  __BPTR_ALIGN (sizeof (PTR_INT_TYPE) < sizeof (void *) ? (B) : (char *) 0, \
-		P, A)
+#define __PTR_ALIGN(B, P, A)						      \
+  __BPTR_ALIGN (sizeof (ptrdiff_t) < sizeof (void *) ? (B) : (char *) 0,      \
+                P, A)
+
+#ifndef __attribute_pure__
+# define __attribute_pure__ _GL_ATTRIBUTE_PURE
+#endif
+
+/* Not the same as _Noreturn, since it also works with function pointers.  */
+#ifndef __attribute_noreturn__
+# if 2 < __GNUC__ + (8 <= __GNUC_MINOR__) || 0x5110 <= __SUNPRO_C
+#  define __attribute_noreturn__ __attribute__ ((__noreturn__))
+# else
+#  define __attribute_noreturn__
+# endif
+#endif
 
-#include <string.h>
+#ifdef __cplusplus
+extern "C" {
+#endif
 
-struct _obstack_chunk		/* Lives at front of each chunk. */
+struct _obstack_chunk           /* Lives at front of each chunk. */
 {
-  char  *limit;			/* 1 past end of this chunk */
-  struct _obstack_chunk *prev;	/* address of prior chunk or NULL */
-  char	contents[4];		/* objects begin here */
+  char *limit;                  /* 1 past end of this chunk */
+  struct _obstack_chunk *prev;  /* address of prior chunk or NULL */
+  char contents[__FLEXIBLE_ARRAY_MEMBER]; /* objects begin here */
 };
 
-struct obstack		/* control current object in current chunk */
+struct obstack          /* control current object in current chunk */
 {
-  long	chunk_size;		/* preferred size to allocate chunks in */
-  struct _obstack_chunk *chunk;	/* address of current struct obstack_chunk */
-  char	*object_base;		/* address of object we are building */
-  char	*next_free;		/* where to add next char to current object */
-  char	*chunk_limit;		/* address of char after current chunk */
+  _CHUNK_SIZE_T chunk_size;     /* preferred size to allocate chunks in */
+  struct _obstack_chunk *chunk; /* address of current struct obstack_chunk */
+  char *object_base;            /* address of object we are building */
+  char *next_free;              /* where to add next char to current object */
+  char *chunk_limit;            /* address of char after current chunk */
+  union
+  {
+    _OBSTACK_SIZE_T i;
+    void *p;
+  } temp;                       /* Temporary for some macros.  */
+  _OBSTACK_SIZE_T alignment_mask;  /* Mask of alignment for each object. */
+
+  /* These prototypes vary based on 'use_extra_arg'.  */
   union
   {
-    PTR_INT_TYPE tempint;
-    void *tempptr;
-  } temp;			/* Temporary for some macros.  */
-  int   alignment_mask;		/* Mask of alignment for each object. */
-  /* These prototypes vary based on `use_extra_arg'. */
-  union {
-    void *(*plain) (long);
-    struct _obstack_chunk *(*extra) (void *, long);
+    void *(*plain) (size_t);
+    void *(*extra) (void *, size_t);
   } chunkfun;
-  union {
+  union
+  {
     void (*plain) (void *);
-    void (*extra) (void *, struct _obstack_chunk *);
+    void (*extra) (void *, void *);
   } freefun;
-  void *extra_arg;		/* first arg for chunk alloc/dealloc funcs */
-  unsigned use_extra_arg:1;	/* chunk alloc/dealloc funcs take extra arg */
-  unsigned maybe_empty_object:1;/* There is a possibility that the current
-				   chunk contains a zero-length object.  This
-				   prevents freeing the chunk if we allocate
-				   a bigger chunk to replace it. */
-  unsigned alloc_failed:1;	/* No longer used, as we now call the failed
-				   handler on error, but retained for binary
-				   compatibility.  */
+
+  void *extra_arg;              /* first arg for chunk alloc/dealloc funcs */
+  unsigned use_extra_arg : 1;     /* chunk alloc/dealloc funcs take extra arg */
+  unsigned maybe_empty_object : 1; /* There is a possibility that the current
+                                      chunk contains a zero-length object.  This
+                                      prevents freeing the chunk if we allocate
+                                      a bigger chunk to replace it. */
+  unsigned alloc_failed : 1;      /* No longer used, as we now call the failed
+                                     handler on error, but retained for binary
+                                     compatibility.  */
 };
 
 /* Declare the external functions we use; they are in obstack.c.  */
 
-extern void _obstack_newchunk (struct obstack *, int);
-extern int _obstack_begin (struct obstack *, int, int,
-			    void *(*) (long), void (*) (void *));
-extern int _obstack_begin_1 (struct obstack *, int, int,
-			     void *(*) (void *, long),
-			     void (*) (void *, void *), void *);
-extern int _obstack_memory_used (struct obstack *);
+extern void _obstack_newchunk (struct obstack *, _OBSTACK_SIZE_T);
+extern void _obstack_free (struct obstack *, void *);
+extern int _obstack_begin (struct obstack *,
+                           _OBSTACK_SIZE_T, _OBSTACK_SIZE_T,
+                           void *(*) (size_t), void (*) (void *));
+extern int _obstack_begin_1 (struct obstack *,
+                             _OBSTACK_SIZE_T, _OBSTACK_SIZE_T,
+                             void *(*) (void *, size_t),
+                             void (*) (void *, void *), void *);
+extern _OBSTACK_SIZE_T _obstack_memory_used (struct obstack *)
+  __attribute_pure__;
 
-void obstack_free (struct obstack *, void *);
 
-\f
-/* Error handler called when `obstack_chunk_alloc' failed to allocate
+/* Error handler called when 'obstack_chunk_alloc' failed to allocate
    more memory.  This can be set to a user defined function which
    should either abort gracefully or use longjump - but shouldn't
    return.  The default action is to print a message and abort.  */
-extern void (*obstack_alloc_failed_handler) (void);
-\f
+extern __attribute_noreturn__ void (*obstack_alloc_failed_handler) (void);
+
+/* Exit value used when 'print_and_abort' is used.  */
+extern int obstack_exit_failure;
+
 /* Pointer to beginning of object being allocated or to be allocated next.
    Note that this might not be the final address of the object
    because a new chunk might be needed to hold the final size.  */
@@ -211,210 +244,210 @@ extern void (*obstack_alloc_failed_handler) (void);
 
 /* Pointer to next byte not yet allocated in current chunk.  */
 
-#define obstack_next_free(h)	((h)->next_free)
+#define obstack_next_free(h) ((void *) (h)->next_free)
 
 /* Mask specifying low bits that should be clear in address of an object.  */
 
 #define obstack_alignment_mask(h) ((h)->alignment_mask)
 
 /* To prevent prototype warnings provide complete argument list.  */
-#define obstack_init(h)						\
-  _obstack_begin ((h), 0, 0,					\
-		  (void *(*) (long)) obstack_chunk_alloc,	\
-		  (void (*) (void *)) obstack_chunk_free)
+#define obstack_init(h)							      \
+  _obstack_begin ((h), 0, 0,						      \
+                  _OBSTACK_CAST (void *(*) (size_t), obstack_chunk_alloc),    \
+                  _OBSTACK_CAST (void (*) (void *), obstack_chunk_free))
 
-#define obstack_begin(h, size)					\
-  _obstack_begin ((h), (size), 0,				\
-		  (void *(*) (long)) obstack_chunk_alloc,	\
-		  (void (*) (void *)) obstack_chunk_free)
+#define obstack_begin(h, size)						      \
+  _obstack_begin ((h), (size), 0,					      \
+                  _OBSTACK_CAST (void *(*) (size_t), obstack_chunk_alloc), \
+                  _OBSTACK_CAST (void (*) (void *), obstack_chunk_free))
 
-#define obstack_specify_allocation(h, size, alignment, chunkfun, freefun)  \
-  _obstack_begin ((h), (size), (alignment),				   \
-		  (void *(*) (long)) (chunkfun),			   \
-		  (void (*) (void *)) (freefun))
+#define obstack_specify_allocation(h, size, alignment, chunkfun, freefun)     \
+  _obstack_begin ((h), (size), (alignment),				      \
+                  _OBSTACK_CAST (void *(*) (size_t), chunkfun),		      \
+                  _OBSTACK_CAST (void (*) (void *), freefun))
 
 #define obstack_specify_allocation_with_arg(h, size, alignment, chunkfun, freefun, arg) \
-  _obstack_begin_1 ((h), (size), (alignment),				\
-		    (void *(*) (void *, long)) (chunkfun),		\
-		    (void (*) (void *, void *)) (freefun), (arg))
+  _obstack_begin_1 ((h), (size), (alignment),				      \
+                    _OBSTACK_CAST (void *(*) (void *, size_t), chunkfun),     \
+                    _OBSTACK_CAST (void (*) (void *, void *), freefun), arg)
 
-#define obstack_chunkfun(h, newchunkfun) \
-  ((h)->chunkfun.extra = (struct _obstack_chunk *(*)(void *, long)) (newchunkfun))
+#define obstack_chunkfun(h, newchunkfun)				      \
+  ((void) ((h)->chunkfun.extra = (void *(*) (void *, size_t)) (newchunkfun)))
 
-#define obstack_freefun(h, newfreefun) \
-  ((h)->freefun.extra = (void (*)(void *, struct _obstack_chunk *)) (newfreefun))
+#define obstack_freefun(h, newfreefun)					      \
+  ((void) ((h)->freefun.extra = (void *(*) (void *, void *)) (newfreefun)))
 
-#define obstack_1grow_fast(h,achar) (*((h)->next_free)++ = (achar))
+#define obstack_1grow_fast(h, achar) ((void) (*((h)->next_free)++ = (achar)))
 
-#define obstack_blank_fast(h,n) ((h)->next_free += (n))
+#define obstack_blank_fast(h, n) ((void) ((h)->next_free += (n)))
 
 #define obstack_memory_used(h) _obstack_memory_used (h)
-\f
-#if defined __GNUC__ && defined __STDC__ && __STDC__
-/* NextStep 2.0 cc is really gcc 1.93 but it defines __GNUC__ = 2 and
-   does not implement __extension__.  But that compiler doesn't define
-   __GNUC_MINOR__.  */
-# if __GNUC__ < 2 || (__NeXT__ && !__GNUC_MINOR__)
+
+#if defined __GNUC__
+# if !defined __GNUC_MINOR__ || __GNUC__ * 1000 + __GNUC_MINOR__ < 2008
 #  define __extension__
 # endif
 
 /* For GNU C, if not -traditional,
    we can define these macros to compute all args only once
    without using a global variable.
-   Also, we can avoid using the `temp' slot, to make faster code.  */
-
-# define obstack_object_size(OBSTACK)					\
-  __extension__								\
-  ({ struct obstack const *__o = (OBSTACK);				\
-     (unsigned) (__o->next_free - __o->object_base); })
-
-# define obstack_room(OBSTACK)						\
-  __extension__								\
-  ({ struct obstack const *__o = (OBSTACK);				\
-     (unsigned) (__o->chunk_limit - __o->next_free); })
-
-# define obstack_make_room(OBSTACK,length)				\
-__extension__								\
-({ struct obstack *__o = (OBSTACK);					\
-   int __len = (length);						\
-   if (__o->chunk_limit - __o->next_free < __len)			\
-     _obstack_newchunk (__o, __len);					\
-   (void) 0; })
-
-# define obstack_empty_p(OBSTACK)					\
-  __extension__								\
-  ({ struct obstack const *__o = (OBSTACK);				\
-     (__o->chunk->prev == 0						\
-      && __o->next_free == __PTR_ALIGN ((char *) __o->chunk,		\
-					__o->chunk->contents,		\
-					__o->alignment_mask)); })
-
-# define obstack_grow(OBSTACK,where,length)				\
-__extension__								\
-({ struct obstack *__o = (OBSTACK);					\
-   int __len = (length);						\
-   if (__o->next_free + __len > __o->chunk_limit)			\
-     _obstack_newchunk (__o, __len);					\
-   memcpy (__o->next_free, where, __len);				\
-   __o->next_free += __len;						\
-   (void) 0; })
-
-# define obstack_grow0(OBSTACK,where,length)				\
-__extension__								\
-({ struct obstack *__o = (OBSTACK);					\
-   int __len = (length);						\
-   if (__o->next_free + __len + 1 > __o->chunk_limit)			\
-     _obstack_newchunk (__o, __len + 1);				\
-   memcpy (__o->next_free, where, __len);				\
-   __o->next_free += __len;						\
-   *(__o->next_free)++ = 0;						\
-   (void) 0; })
-
-# define obstack_1grow(OBSTACK,datum)					\
-__extension__								\
-({ struct obstack *__o = (OBSTACK);					\
-   if (__o->next_free + 1 > __o->chunk_limit)				\
-     _obstack_newchunk (__o, 1);					\
-   obstack_1grow_fast (__o, datum);					\
-   (void) 0; })
+   Also, we can avoid using the 'temp' slot, to make faster code.  */
+
+# define obstack_object_size(OBSTACK)					      \
+  __extension__								      \
+    ({ struct obstack const *__o = (OBSTACK);				      \
+       (_OBSTACK_SIZE_T) (__o->next_free - __o->object_base); })
+
+/* The local variable is named __o1 to avoid a shadowed variable
+   warning when invoked from other obstack macros.  */
+# define obstack_room(OBSTACK)						      \
+  __extension__								      \
+    ({ struct obstack const *__o1 = (OBSTACK);				      \
+       (_OBSTACK_SIZE_T) (__o1->chunk_limit - __o1->next_free); })
+
+# define obstack_make_room(OBSTACK, length)				      \
+  __extension__								      \
+    ({ struct obstack *__o = (OBSTACK);					      \
+       _OBSTACK_SIZE_T __len = (length);				      \
+       if (obstack_room (__o) < __len)					      \
+         _obstack_newchunk (__o, __len);				      \
+       (void) 0; })
+
+# define obstack_empty_p(OBSTACK)					      \
+  __extension__								      \
+    ({ struct obstack const *__o = (OBSTACK);				      \
+       (__o->chunk->prev == 0						      \
+        && __o->next_free == __PTR_ALIGN ((char *) __o->chunk,		      \
+                                          __o->chunk->contents,		      \
+                                          __o->alignment_mask)); })
+
+# define obstack_grow(OBSTACK, where, length)				      \
+  __extension__								      \
+    ({ struct obstack *__o = (OBSTACK);					      \
+       _OBSTACK_SIZE_T __len = (length);				      \
+       if (obstack_room (__o) < __len)					      \
+         _obstack_newchunk (__o, __len);				      \
+       memcpy (__o->next_free, where, __len);				      \
+       __o->next_free += __len;						      \
+       (void) 0; })
+
+# define obstack_grow0(OBSTACK, where, length)				      \
+  __extension__								      \
+    ({ struct obstack *__o = (OBSTACK);					      \
+       _OBSTACK_SIZE_T __len = (length);				      \
+       if (obstack_room (__o) < __len + 1)				      \
+         _obstack_newchunk (__o, __len + 1);				      \
+       memcpy (__o->next_free, where, __len);				      \
+       __o->next_free += __len;						      \
+       *(__o->next_free)++ = 0;						      \
+       (void) 0; })
+
+# define obstack_1grow(OBSTACK, datum)					      \
+  __extension__								      \
+    ({ struct obstack *__o = (OBSTACK);					      \
+       if (obstack_room (__o) < 1)					      \
+         _obstack_newchunk (__o, 1);					      \
+       obstack_1grow_fast (__o, datum); })
 
 /* These assume that the obstack alignment is good enough for pointers
    or ints, and that the data added so far to the current object
    shares that much alignment.  */
 
-# define obstack_ptr_grow(OBSTACK,datum)				\
-__extension__								\
-({ struct obstack *__o = (OBSTACK);					\
-   if (__o->next_free + sizeof (void *) > __o->chunk_limit)		\
-     _obstack_newchunk (__o, sizeof (void *));				\
-   obstack_ptr_grow_fast (__o, datum); })				\
-
-# define obstack_int_grow(OBSTACK,datum)				\
-__extension__								\
-({ struct obstack *__o = (OBSTACK);					\
-   if (__o->next_free + sizeof (int) > __o->chunk_limit)		\
-     _obstack_newchunk (__o, sizeof (int));				\
-   obstack_int_grow_fast (__o, datum); })
-
-# define obstack_ptr_grow_fast(OBSTACK,aptr)				\
-__extension__								\
-({ struct obstack *__o1 = (OBSTACK);					\
-   *(const void **) __o1->next_free = (aptr);				\
-   __o1->next_free += sizeof (const void *);				\
-   (void) 0; })
-
-# define obstack_int_grow_fast(OBSTACK,aint)				\
-__extension__								\
-({ struct obstack *__o1 = (OBSTACK);					\
-   *(int *) __o1->next_free = (aint);					\
-   __o1->next_free += sizeof (int);					\
-   (void) 0; })
-
-# define obstack_blank(OBSTACK,length)					\
-__extension__								\
-({ struct obstack *__o = (OBSTACK);					\
-   int __len = (length);						\
-   if (__o->chunk_limit - __o->next_free < __len)			\
-     _obstack_newchunk (__o, __len);					\
-   obstack_blank_fast (__o, __len);					\
-   (void) 0; })
-
-# define obstack_alloc(OBSTACK,length)					\
-__extension__								\
-({ struct obstack *__h = (OBSTACK);					\
-   obstack_blank (__h, (length));					\
-   obstack_finish (__h); })
-
-# define obstack_copy(OBSTACK,where,length)				\
-__extension__								\
-({ struct obstack *__h = (OBSTACK);					\
-   obstack_grow (__h, (where), (length));				\
-   obstack_finish (__h); })
-
-# define obstack_copy0(OBSTACK,where,length)				\
-__extension__								\
-({ struct obstack *__h = (OBSTACK);					\
-   obstack_grow0 (__h, (where), (length));				\
-   obstack_finish (__h); })
-
-/* The local variable is named __o1 to avoid a name conflict
-   when obstack_blank is called.  */
-# define obstack_finish(OBSTACK)					\
-__extension__								\
-({ struct obstack *__o1 = (OBSTACK);					\
-   void *__value = (void *) __o1->object_base;				\
-   if (__o1->next_free == __value)					\
-     __o1->maybe_empty_object = 1;					\
-   __o1->next_free							\
-     = __PTR_ALIGN (__o1->object_base, __o1->next_free,			\
-		    __o1->alignment_mask);				\
-   if (__o1->next_free - (char *)__o1->chunk				\
-       > __o1->chunk_limit - (char *)__o1->chunk)			\
-     __o1->next_free = __o1->chunk_limit;				\
-   __o1->object_base = __o1->next_free;					\
-   __value; })
-
-# define obstack_free(OBSTACK, OBJ)					\
-__extension__								\
-({ struct obstack *__o = (OBSTACK);					\
-   void *__obj = (OBJ);							\
-   if (__obj > (void *)__o->chunk && __obj < (void *)__o->chunk_limit)  \
-     __o->next_free = __o->object_base = (char *)__obj;			\
-   else (obstack_free) (__o, __obj); })
-\f
-#else /* not __GNUC__ or not __STDC__ */
-
-# define obstack_object_size(h) \
- (unsigned) ((h)->next_free - (h)->object_base)
-
-# define obstack_room(h)		\
- (unsigned) ((h)->chunk_limit - (h)->next_free)
-
-# define obstack_empty_p(h) \
- ((h)->chunk->prev == 0							\
-  && (h)->next_free == __PTR_ALIGN ((char *) (h)->chunk,		\
-				    (h)->chunk->contents,		\
-				    (h)->alignment_mask))
+# define obstack_ptr_grow(OBSTACK, datum)				      \
+  __extension__								      \
+    ({ struct obstack *__o = (OBSTACK);					      \
+       if (obstack_room (__o) < sizeof (void *))			      \
+         _obstack_newchunk (__o, sizeof (void *));			      \
+       obstack_ptr_grow_fast (__o, datum); })
+
+# define obstack_int_grow(OBSTACK, datum)				      \
+  __extension__								      \
+    ({ struct obstack *__o = (OBSTACK);					      \
+       if (obstack_room (__o) < sizeof (int))				      \
+         _obstack_newchunk (__o, sizeof (int));				      \
+       obstack_int_grow_fast (__o, datum); })
+
+# define obstack_ptr_grow_fast(OBSTACK, aptr)				      \
+  __extension__								      \
+    ({ struct obstack *__o1 = (OBSTACK);				      \
+       void *__p1 = __o1->next_free;					      \
+       *(const void **) __p1 = (aptr);					      \
+       __o1->next_free += sizeof (const void *);			      \
+       (void) 0; })
+
+# define obstack_int_grow_fast(OBSTACK, aint)				      \
+  __extension__								      \
+    ({ struct obstack *__o1 = (OBSTACK);				      \
+       void *__p1 = __o1->next_free;					      \
+       *(int *) __p1 = (aint);						      \
+       __o1->next_free += sizeof (int);					      \
+       (void) 0; })
+
+# define obstack_blank(OBSTACK, length)					      \
+  __extension__								      \
+    ({ struct obstack *__o = (OBSTACK);					      \
+       _OBSTACK_SIZE_T __len = (length);				      \
+       if (obstack_room (__o) < __len)					      \
+         _obstack_newchunk (__o, __len);				      \
+       obstack_blank_fast (__o, __len); })
+
+# define obstack_alloc(OBSTACK, length)					      \
+  __extension__								      \
+    ({ struct obstack *__h = (OBSTACK);					      \
+       obstack_blank (__h, (length));					      \
+       obstack_finish (__h); })
+
+# define obstack_copy(OBSTACK, where, length)				      \
+  __extension__								      \
+    ({ struct obstack *__h = (OBSTACK);					      \
+       obstack_grow (__h, (where), (length));				      \
+       obstack_finish (__h); })
+
+# define obstack_copy0(OBSTACK, where, length)				      \
+  __extension__								      \
+    ({ struct obstack *__h = (OBSTACK);					      \
+       obstack_grow0 (__h, (where), (length));				      \
+       obstack_finish (__h); })
+
+/* The local variable is named __o1 to avoid a shadowed variable
+   warning when invoked from other obstack macros, typically obstack_free.  */
+# define obstack_finish(OBSTACK)					      \
+  __extension__								      \
+    ({ struct obstack *__o1 = (OBSTACK);				      \
+       void *__value = (void *) __o1->object_base;			      \
+       if (__o1->next_free == __value)					      \
+         __o1->maybe_empty_object = 1;					      \
+       __o1->next_free							      \
+         = __PTR_ALIGN (__o1->object_base, __o1->next_free,		      \
+                        __o1->alignment_mask);				      \
+       if ((size_t) (__o1->next_free - (char *) __o1->chunk)		      \
+           > (size_t) (__o1->chunk_limit - (char *) __o1->chunk))	      \
+         __o1->next_free = __o1->chunk_limit;				      \
+       __o1->object_base = __o1->next_free;				      \
+       __value; })
+
+# define obstack_free(OBSTACK, OBJ)					      \
+  __extension__								      \
+    ({ struct obstack *__o = (OBSTACK);					      \
+       void *__obj = (void *) (OBJ);					      \
+       if (__obj > (void *) __o->chunk && __obj < (void *) __o->chunk_limit)  \
+         __o->next_free = __o->object_base = (char *) __obj;		      \
+       else								      \
+         _obstack_free (__o, __obj); })
+
+#else /* not __GNUC__ */
+
+# define obstack_object_size(h)						      \
+  ((_OBSTACK_SIZE_T) ((h)->next_free - (h)->object_base))
+
+# define obstack_room(h)						      \
+  ((_OBSTACK_SIZE_T) ((h)->chunk_limit - (h)->next_free))
+
+# define obstack_empty_p(h)						      \
+  ((h)->chunk->prev == 0						      \
+   && (h)->next_free == __PTR_ALIGN ((char *) (h)->chunk,		      \
+                                     (h)->chunk->contents,		      \
+                                     (h)->alignment_mask))
 
 /* Note that the call to _obstack_newchunk is enclosed in (..., 0)
    so that we can avoid having void expressions
@@ -422,88 +455,92 @@ __extension__								\
    Casting the third operand to void was tried before,
    but some compilers won't accept it.  */
 
-# define obstack_make_room(h,length)					\
-( (h)->temp.tempint = (length),						\
-  (((h)->next_free + (h)->temp.tempint > (h)->chunk_limit)		\
-   ? (_obstack_newchunk ((h), (h)->temp.tempint), 0) : 0))
-
-# define obstack_grow(h,where,length)					\
-( (h)->temp.tempint = (length),						\
-  (((h)->next_free + (h)->temp.tempint > (h)->chunk_limit)		\
-   ? (_obstack_newchunk ((h), (h)->temp.tempint), 0) : 0),		\
-  memcpy ((h)->next_free, where, (h)->temp.tempint),			\
-  (h)->next_free += (h)->temp.tempint)
-
-# define obstack_grow0(h,where,length)					\
-( (h)->temp.tempint = (length),						\
-  (((h)->next_free + (h)->temp.tempint + 1 > (h)->chunk_limit)		\
-   ? (_obstack_newchunk ((h), (h)->temp.tempint + 1), 0) : 0),		\
-  memcpy ((h)->next_free, where, (h)->temp.tempint),			\
-  (h)->next_free += (h)->temp.tempint,					\
-  *((h)->next_free)++ = 0)
-
-# define obstack_1grow(h,datum)						\
-( (((h)->next_free + 1 > (h)->chunk_limit)				\
-   ? (_obstack_newchunk ((h), 1), 0) : 0),				\
-  obstack_1grow_fast (h, datum))
-
-# define obstack_ptr_grow(h,datum)					\
-( (((h)->next_free + sizeof (char *) > (h)->chunk_limit)		\
-   ? (_obstack_newchunk ((h), sizeof (char *)), 0) : 0),		\
-  obstack_ptr_grow_fast (h, datum))
-
-# define obstack_int_grow(h,datum)					\
-( (((h)->next_free + sizeof (int) > (h)->chunk_limit)			\
-   ? (_obstack_newchunk ((h), sizeof (int)), 0) : 0),			\
-  obstack_int_grow_fast (h, datum))
-
-# define obstack_ptr_grow_fast(h,aptr)					\
-  (((const void **) ((h)->next_free += sizeof (void *)))[-1] = (aptr))
-
-# define obstack_int_grow_fast(h,aint)					\
-  (((int *) ((h)->next_free += sizeof (int)))[-1] = (aint))
-
-# define obstack_blank(h,length)					\
-( (h)->temp.tempint = (length),						\
-  (((h)->chunk_limit - (h)->next_free < (h)->temp.tempint)		\
-   ? (_obstack_newchunk ((h), (h)->temp.tempint), 0) : 0),		\
-  obstack_blank_fast (h, (h)->temp.tempint))
-
-# define obstack_alloc(h,length)					\
- (obstack_blank ((h), (length)), obstack_finish ((h)))
-
-# define obstack_copy(h,where,length)					\
- (obstack_grow ((h), (where), (length)), obstack_finish ((h)))
-
-# define obstack_copy0(h,where,length)					\
- (obstack_grow0 ((h), (where), (length)), obstack_finish ((h)))
-
-# define obstack_finish(h)						\
-( ((h)->next_free == (h)->object_base					\
-   ? (((h)->maybe_empty_object = 1), 0)					\
-   : 0),								\
-  (h)->temp.tempptr = (h)->object_base,					\
-  (h)->next_free							\
-    = __PTR_ALIGN ((h)->object_base, (h)->next_free,			\
-		   (h)->alignment_mask),				\
-  (((h)->next_free - (char *) (h)->chunk				\
-    > (h)->chunk_limit - (char *) (h)->chunk)				\
-   ? ((h)->next_free = (h)->chunk_limit) : 0),				\
-  (h)->object_base = (h)->next_free,					\
-  (h)->temp.tempptr)
-
-# define obstack_free(h,obj)						\
-( (h)->temp.tempint = (char *) (obj) - (char *) (h)->chunk,		\
-  ((((h)->temp.tempint > 0						\
-    && (h)->temp.tempint < (h)->chunk_limit - (char *) (h)->chunk))	\
-   ? (int) ((h)->next_free = (h)->object_base				\
-	    = (h)->temp.tempint + (char *) (h)->chunk)			\
-   : (((obstack_free) ((h), (h)->temp.tempint + (char *) (h)->chunk), 0), 0)))
-
-#endif /* not __GNUC__ or not __STDC__ */
+# define obstack_make_room(h, length)					      \
+  ((h)->temp.i = (length),						      \
+   ((obstack_room (h) < (h)->temp.i)					      \
+    ? (_obstack_newchunk (h, (h)->temp.i), 0) : 0),			      \
+   (void) 0)
+
+# define obstack_grow(h, where, length)					      \
+  ((h)->temp.i = (length),						      \
+   ((obstack_room (h) < (h)->temp.i)					      \
+   ? (_obstack_newchunk ((h), (h)->temp.i), 0) : 0),			      \
+   memcpy ((h)->next_free, where, (h)->temp.i),				      \
+   (h)->next_free += (h)->temp.i,					      \
+   (void) 0)
+
+# define obstack_grow0(h, where, length)				      \
+  ((h)->temp.i = (length),						      \
+   ((obstack_room (h) < (h)->temp.i + 1)				      \
+   ? (_obstack_newchunk ((h), (h)->temp.i + 1), 0) : 0),		      \
+   memcpy ((h)->next_free, where, (h)->temp.i),				      \
+   (h)->next_free += (h)->temp.i,					      \
+   *((h)->next_free)++ = 0,						      \
+   (void) 0)
+
+# define obstack_1grow(h, datum)					      \
+  (((obstack_room (h) < 1)						      \
+    ? (_obstack_newchunk ((h), 1), 0) : 0),				      \
+   obstack_1grow_fast (h, datum))
+
+# define obstack_ptr_grow(h, datum)					      \
+  (((obstack_room (h) < sizeof (char *))				      \
+    ? (_obstack_newchunk ((h), sizeof (char *)), 0) : 0),		      \
+   obstack_ptr_grow_fast (h, datum))
+
+# define obstack_int_grow(h, datum)					      \
+  (((obstack_room (h) < sizeof (int))					      \
+    ? (_obstack_newchunk ((h), sizeof (int)), 0) : 0),			      \
+   obstack_int_grow_fast (h, datum))
+
+# define obstack_ptr_grow_fast(h, aptr)					      \
+  (((const void **) ((h)->next_free += sizeof (void *)))[-1] = (aptr),	      \
+   (void) 0)
+
+# define obstack_int_grow_fast(h, aint)					      \
+  (((int *) ((h)->next_free += sizeof (int)))[-1] = (aint),		      \
+   (void) 0)
+
+# define obstack_blank(h, length)					      \
+  ((h)->temp.i = (length),						      \
+   ((obstack_room (h) < (h)->temp.i)					      \
+   ? (_obstack_newchunk ((h), (h)->temp.i), 0) : 0),			      \
+   obstack_blank_fast (h, (h)->temp.i))
+
+# define obstack_alloc(h, length)					      \
+  (obstack_blank ((h), (length)), obstack_finish ((h)))
+
+# define obstack_copy(h, where, length)					      \
+  (obstack_grow ((h), (where), (length)), obstack_finish ((h)))
+
+# define obstack_copy0(h, where, length)				      \
+  (obstack_grow0 ((h), (where), (length)), obstack_finish ((h)))
+
+# define obstack_finish(h)						      \
+  (((h)->next_free == (h)->object_base					      \
+    ? (((h)->maybe_empty_object = 1), 0)				      \
+    : 0),								      \
+   (h)->temp.p = (h)->object_base,					      \
+   (h)->next_free							      \
+     = __PTR_ALIGN ((h)->object_base, (h)->next_free,			      \
+                    (h)->alignment_mask),				      \
+   (((size_t) ((h)->next_free - (char *) (h)->chunk)			      \
+     > (size_t) ((h)->chunk_limit - (char *) (h)->chunk))		      \
+   ? ((h)->next_free = (h)->chunk_limit) : 0),				      \
+   (h)->object_base = (h)->next_free,					      \
+   (h)->temp.p)
+
+# define obstack_free(h, obj)						      \
+  ((h)->temp.p = (void *) (obj),					      \
+   (((h)->temp.p > (void *) (h)->chunk					      \
+     && (h)->temp.p < (void *) (h)->chunk_limit)			      \
+    ? (void) ((h)->next_free = (h)->object_base = (char *) (h)->temp.p)       \
+    : _obstack_free ((h), (h)->temp.p)))
+
+#endif /* not __GNUC__ */
 
 #ifdef __cplusplus
-}	/* C++ */
+}       /* C++ */
 #endif
 
-#endif /* obstack.h */
+#endif /* _OBSTACK_H */
-- 
2.22.0.589.g5bd7971b91


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v1 2/4] SQUASH??? compat/obstack: fix portability issues
  2019-06-14 10:00     ` [RFC/PATCH v1 0/4] compat/obstack: update from upstream SZEDER Gábor
  2019-06-14 10:00       ` [PATCH v1 1/4] " SZEDER Gábor
@ 2019-06-14 10:00       ` SZEDER Gábor
  2019-06-14 10:00       ` [PATCH v1 3/4] SQUASH??? compat/obstack: fix build errors with Clang SZEDER Gábor
                         ` (4 subsequent siblings)
  6 siblings, 0 replies; 90+ messages in thread
From: SZEDER Gábor @ 2019-06-14 10:00 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Johannes Schindelin, Ramsay Jones, SZEDER Gábor

This is sort-of a cherry-pick of d190a0875f (obstack: Fix portability
issues, 2011-08-28), which is necessary to make 'compat/obstack.c'
compile again.  Only "sort-of a cherry-pick", because the divergence
between upstream and our copy was just too bit and I gave up on the
conflict resolution, and instead made the still necessary/applicable
edits in the spirit of d190a0875f by hand.

With this patch 'compat/obstack.c' can be compiled with GCC both on
Linux and on macOS (well, at least in our 'osx-gcc' build job on
Travis CI), but, alas, not with Clang.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 compat/obstack.c | 29 ++++-------------------------
 1 file changed, 4 insertions(+), 25 deletions(-)

diff --git a/compat/obstack.c b/compat/obstack.c
index 6949111e4d..17fa95d46c 100644
--- a/compat/obstack.c
+++ b/compat/obstack.c
@@ -16,13 +16,9 @@
    License along with the GNU C Library; if not, see
    <https://www.gnu.org/licenses/>.  */
 
-
-#ifdef _LIBC
-# include <obstack.h>
-#else
-# include <config.h>
-# include "obstack.h"
-#endif
+#include "git-compat-util.h"
+#include <gettext.h>
+#include "obstack.h"
 
 /* NOTE BEFORE MODIFYING THIS FILE: _OBSTACK_INTERFACE_VERSION in
    obstack.h must be incremented whenever callers compiled using an old
@@ -305,23 +301,6 @@ _obstack_memory_used (struct obstack *h)
 /* Define the error handler.  */
 #  include <stdio.h>
 
-/* Exit value used when 'print_and_abort' is used.  */
-#  ifdef _LIBC
-int obstack_exit_failure = EXIT_FAILURE;
-#  else
-#   include "exitfail.h"
-#   define obstack_exit_failure exit_failure
-#  endif
-
-#  ifdef _LIBC
-#   include <libintl.h>
-#  else
-#   include "gettext.h"
-#  endif
-#  ifndef _
-#   define _(msgid) gettext (msgid)
-#  endif
-
 #  ifdef _LIBC
 #   include <libio/iolibio.h>
 #  endif
@@ -339,7 +318,7 @@ print_and_abort (void)
 #  else
   fprintf (stderr, "%s\n", _("memory exhausted"));
 #  endif
-  exit (obstack_exit_failure);
+  exit (1);
 }
 
 /* The functions allocating more room by calling 'obstack_chunk_alloc'
-- 
2.22.0.589.g5bd7971b91


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v1 3/4] SQUASH??? compat/obstack: fix build errors with Clang
  2019-06-14 10:00     ` [RFC/PATCH v1 0/4] compat/obstack: update from upstream SZEDER Gábor
  2019-06-14 10:00       ` [PATCH v1 1/4] " SZEDER Gábor
  2019-06-14 10:00       ` [PATCH v1 2/4] SQUASH??? compat/obstack: fix portability issues SZEDER Gábor
@ 2019-06-14 10:00       ` SZEDER Gábor
  2019-06-14 10:00       ` [PATCH v1 4/4] compat/obstack: fix some sparse warnings SZEDER Gábor
                         ` (3 subsequent siblings)
  6 siblings, 0 replies; 90+ messages in thread
From: SZEDER Gábor @ 2019-06-14 10:00 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Johannes Schindelin, Ramsay Jones, SZEDER Gábor

Compiling 'compat/obstack.c' with Clang on Linux and macOS fails with
different errors:

On Linux:

      CC compat/obstack.o
  compat/obstack.c:330:31: error: incompatible pointer types initializing
        'void (*)(void) __attribute__((noreturn))' with an expression of type
        'void (void)' [-Werror,-Wincompatible-pointer-types]
  __attribute_noreturn__ void (*obstack_alloc_failed_handler) (void)
                                ^

Remove '__attribute_noreturn__' from the function's declaration and
definition to resolve this build error.

On macOS:

  compat/obstack.h:223:3: error: expected function body after function declarator
    __attribute_pure__;
    ^
  compat/obstack.h:151:29: note: expanded from macro '__attribute_pure__'
  # define __attribute_pure__ _GL_ATTRIBUTE_PURE

Remove '__attribute_pure__' to resolve this build error.

With this patch it's now possible to compile 'compat/obstack.c' both
with GCC and Clang on both on Linux and macOS.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 compat/obstack.c | 4 ++--
 compat/obstack.h | 5 ++---
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/compat/obstack.c b/compat/obstack.c
index 17fa95d46c..6ef8cecb8a 100644
--- a/compat/obstack.c
+++ b/compat/obstack.c
@@ -231,7 +231,7 @@ _obstack_newchunk (struct obstack *h, _OBSTACK_SIZE_T length)
 
 /* Suppress -Wmissing-prototypes warning.  We don't want to declare this in
    obstack.h because it is just for debugging.  */
-int _obstack_allocated_p (struct obstack *h, void *obj) __attribute_pure__;
+int _obstack_allocated_p (struct obstack *h, void *obj);
 
 int
 _obstack_allocated_p (struct obstack *h, void *obj)
@@ -327,7 +327,7 @@ print_and_abort (void)
    abort gracefully or use longjump - but shouldn't return.  This
    variable by default points to the internal function
    'print_and_abort'.  */
-__attribute_noreturn__ void (*obstack_alloc_failed_handler) (void)
+void (*obstack_alloc_failed_handler) (void)
   = print_and_abort;
 # endif /* !_OBSTACK_NO_ERROR_HANDLER */
 #endif /* !_OBSTACK_ELIDE_CODE */
diff --git a/compat/obstack.h b/compat/obstack.h
index 811de588a4..f8f9625121 100644
--- a/compat/obstack.h
+++ b/compat/obstack.h
@@ -219,15 +219,14 @@ extern int _obstack_begin_1 (struct obstack *,
                              _OBSTACK_SIZE_T, _OBSTACK_SIZE_T,
                              void *(*) (void *, size_t),
                              void (*) (void *, void *), void *);
-extern _OBSTACK_SIZE_T _obstack_memory_used (struct obstack *)
-  __attribute_pure__;
+extern _OBSTACK_SIZE_T _obstack_memory_used (struct obstack *);
 
 
 /* Error handler called when 'obstack_chunk_alloc' failed to allocate
    more memory.  This can be set to a user defined function which
    should either abort gracefully or use longjump - but shouldn't
    return.  The default action is to print a message and abort.  */
-extern __attribute_noreturn__ void (*obstack_alloc_failed_handler) (void);
+extern void (*obstack_alloc_failed_handler) (void);
 
 /* Exit value used when 'print_and_abort' is used.  */
 extern int obstack_exit_failure;
-- 
2.22.0.589.g5bd7971b91


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v1 4/4] compat/obstack: fix some sparse warnings
  2019-06-14 10:00     ` [RFC/PATCH v1 0/4] compat/obstack: update from upstream SZEDER Gábor
                         ` (2 preceding siblings ...)
  2019-06-14 10:00       ` [PATCH v1 3/4] SQUASH??? compat/obstack: fix build errors with Clang SZEDER Gábor
@ 2019-06-14 10:00       ` SZEDER Gábor
  2019-06-14 17:57       ` [RFC/PATCH v1 0/4] compat/obstack: update from upstream Jeff King
                         ` (2 subsequent siblings)
  6 siblings, 0 replies; 90+ messages in thread
From: SZEDER Gábor @ 2019-06-14 10:00 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Johannes Schindelin, Ramsay Jones, SZEDER Gábor

'compat/obstack.c' occasionally assigns/compares a plain 0 to a
pointer, which triggers sparse warnings.  Use NULL instead.

This is basically a cherry-pick of 3254310863 (obstack.c: Fix some
sparse warnings, 2011-09-11) on top of the just updated code from
upstream.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 compat/obstack.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/compat/obstack.c b/compat/obstack.c
index 6ef8cecb8a..5fff087cd3 100644
--- a/compat/obstack.c
+++ b/compat/obstack.c
@@ -135,7 +135,7 @@ _obstack_begin_worker (struct obstack *h,
   h->next_free = h->object_base = __PTR_ALIGN ((char *) chunk, chunk->contents,
                                                alignment - 1);
   h->chunk_limit = chunk->limit = (char *) chunk + h->chunk_size;
-  chunk->prev = 0;
+  chunk->prev = NULL;
   /* The initial chunk now contains no empty object.  */
   h->maybe_empty_object = 0;
   h->alloc_failed = 0;
@@ -178,7 +178,7 @@ void
 _obstack_newchunk (struct obstack *h, _OBSTACK_SIZE_T length)
 {
   struct _obstack_chunk *old_chunk = h->chunk;
-  struct _obstack_chunk *new_chunk = 0;
+  struct _obstack_chunk *new_chunk = NULL;
   size_t obj_size = h->next_free - h->object_base;
   char *object_base;
 
@@ -243,12 +243,12 @@ _obstack_allocated_p (struct obstack *h, void *obj)
   /* We use >= rather than > since the object cannot be exactly at
      the beginning of the chunk but might be an empty object exactly
      at the end of an adjacent chunk.  */
-  while (lp != 0 && ((void *) lp >= obj || (void *) (lp)->limit < obj))
+  while (lp != NULL && ((void *) lp >= obj || (void *) (lp)->limit < obj))
     {
       plp = lp->prev;
       lp = plp;
     }
-  return lp != 0;
+  return lp != NULL;
 }
 
 /* Free objects in obstack H, including OBJ and everything allocate
@@ -264,7 +264,7 @@ _obstack_free (struct obstack *h, void *obj)
   /* We use >= because there cannot be an object at the beginning of a chunk.
      But there can be an empty object at that address
      at the end of another chunk.  */
-  while (lp != 0 && ((void *) lp >= obj || (void *) (lp)->limit < obj))
+  while (lp != NULL && ((void *) lp >= obj || (void *) (lp)->limit < obj))
     {
       plp = lp->prev;
       call_freefun (h, lp);
@@ -279,7 +279,7 @@ _obstack_free (struct obstack *h, void *obj)
       h->chunk_limit = lp->limit;
       h->chunk = lp;
     }
-  else if (obj != 0)
+  else if (obj != NULL)
     /* obj is not in any of the chunks! */
     abort ();
 }
@@ -290,7 +290,7 @@ _obstack_memory_used (struct obstack *h)
   struct _obstack_chunk *lp;
   _OBSTACK_SIZE_T nbytes = 0;
 
-  for (lp = h->chunk; lp != 0; lp = lp->prev)
+  for (lp = h->chunk; lp != NULL; lp = lp->prev)
     {
       nbytes += lp->limit - (char *) lp;
     }
-- 
2.22.0.589.g5bd7971b91


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2/4] kwset: allow building with GCC 8
  2019-06-14  9:53   ` SZEDER Gábor
  2019-06-14 10:00     ` [RFC/PATCH v1 0/4] compat/obstack: update from upstream SZEDER Gábor
@ 2019-06-14 16:12     ` Junio C Hamano
  2019-06-17 18:26       ` SZEDER Gábor
  1 sibling, 1 reply; 90+ messages in thread
From: Junio C Hamano @ 2019-06-14 16:12 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: Johannes Schindelin via GitGitGadget, git, Johannes Schindelin

SZEDER Gábor <szeder.dev@gmail.com> writes:

>> Now, the proper thing to do would be to switch to `size_t`. But this
>> would make us deviate from the "upstream" code even further,
>
> This is not entirely true: upstream already uses 'size_t', so the
> switch would actually bring our copy closer to upstream.

Ah, earlier I said that within the context how kwset uses obstack,
it is perfectly proper to fix it like the patch in question did, but
the upstream already using size_t changes the picture quite a bit.

I do not mind updating our copy of obstack, but make sure you pick
the version with license compatible with ours.

Thanks.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC/PATCH v1 0/4] compat/obstack: update from upstream
  2019-06-14 10:00     ` [RFC/PATCH v1 0/4] compat/obstack: update from upstream SZEDER Gábor
                         ` (3 preceding siblings ...)
  2019-06-14 10:00       ` [PATCH v1 4/4] compat/obstack: fix some sparse warnings SZEDER Gábor
@ 2019-06-14 17:57       ` Jeff King
  2019-06-14 18:19       ` Junio C Hamano
  2019-06-14 20:30       ` Ramsay Jones
  6 siblings, 0 replies; 90+ messages in thread
From: Jeff King @ 2019-06-14 17:57 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: git, Junio C Hamano, Johannes Schindelin, Ramsay Jones

On Fri, Jun 14, 2019 at 12:00:55PM +0200, SZEDER Gábor wrote:

> Update 'compat/obstack.{c,h}' from upstream, because they already use
> 'size_t' instead of 'long' in places that might eventually end up as
> an argument to malloc(), which might solve build errors with GCC 8 on
> Windows.
> 
> The first patch just imports from upstream and doesn't modify anything
> at all, and, consequently, it can't be compiled because of a screenful
> or two of errors.  This is bad for future bisects, of course.
> 
> OTOH, adding all the necessary build fixes right away makes review
> harder...

One thing about your approach that makes it hard to review is that the
first commit obliterates all of our local changes, and then you have to
re-apply them individually. Looking at "git log" there aren't very many
in this case, so it's pretty easy to be sure you got them all (in some
cases this can be particularly nasty if the changes were themselves part
of conflict resolution, and so you have to pick them out of a merge).

I think a flow that better matches "what really happened" is to do more
of a vendor-branch approach: have a line of history that represents the
upstream changes (each one obliterating the last), and then periodically
merge that into our fork.

That can even retain bisectability as long as the commits along the
vendor branch don't actually try to build the code. Unfortunately our
initial import does try to build, so I think it already wrecks
bisectability, but we can undo that now. So e.g.,:

  # start at e831171d67 (Add obstack.[ch] from EGLIBC 2.10, 2011-08-21)
  git checkout -b upstream-obstack e831171d67

  # undo build changes to restore bisection; ideally this would have
  # been done back then, but it's too late now
  sed -i /obstack/d Makefile
  git commit -am 'strip out obstack building'

  # but of course in our merged version we want that back, so let's
  # do a noop merge to represent that.
  git checkout master ;# or whatever feature branch you're working on
  git merge -s ours upstream-obstack

  # and now with a sane vendor branch established, we can proceed to do
  # a real update there
  git checkout upstream-obstack
  cp /path/to/obstack.[ch] compat/
  git commit -am 'update obstack'

  # and now we are free to merge that in, getting a real 3-way merge
  # between our changes and what happened upstream.
  git checkout master
  git merge upstream-obstack

Now, if you try this you may find that the conflicts are pretty horrid.
And the result may end up way less readable than your cherry-picks (and
harder to resolve in the first place). I claim only that:

  1. This represents in the history graph more clearly the actual
     sequence of events.

  2. Its saves you from digging up the set of changes that have been
     applied since our last upstream import.

So in this case the way you did it may well be the best way. But I offer
it as an alternative. :)

-Peff

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC/PATCH v1 0/4] compat/obstack: update from upstream
  2019-06-14 10:00     ` [RFC/PATCH v1 0/4] compat/obstack: update from upstream SZEDER Gábor
                         ` (4 preceding siblings ...)
  2019-06-14 17:57       ` [RFC/PATCH v1 0/4] compat/obstack: update from upstream Jeff King
@ 2019-06-14 18:19       ` Junio C Hamano
  2019-06-14 20:30       ` Ramsay Jones
  6 siblings, 0 replies; 90+ messages in thread
From: Junio C Hamano @ 2019-06-14 18:19 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: git, Johannes Schindelin, Ramsay Jones

SZEDER Gábor <szeder.dev@gmail.com> writes:

> And here is an all-green build of these patches on Travis CI:
>
>   https://travis-ci.org/szeder/git/builds/545645247
>
> (and one bonus patch on top to deal with some Homebrew nonsense)

Is this the one that making all of the jobs pass in the above
output, including the mac gcc one.  It would be wonderful to have it
separately and fast-tracked ;-)

-- >8 --
From: SZEDER Gábor <szeder.dev@gmail.com>
Date: Wed, 3 Apr 2019 02:49:47 +0200
Subject: [PATCH] ci: make Homebrew's operations faster

---
 ci/install-dependencies.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ci/install-dependencies.sh b/ci/install-dependencies.sh
index 7f6acdd803..f804b40ddd 100755
--- a/ci/install-dependencies.sh
+++ b/ci/install-dependencies.sh
@@ -34,7 +34,7 @@ linux-clang|linux-gcc)
 	popd
 	;;
 osx-clang|osx-gcc)
-	brew update >/dev/null
+	export HOMEBREW_NO_INSTALL_CLEANUP=1 HOMEBREW_NO_AUTO_UPDATE=1
 	# Uncomment this if you want to run perf tests:
 	# brew install gnu-time
 	test -z "$BREW_INSTALL_PACKAGES" ||
-- 
2.22.0-68-g0aae918dd9


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC/PATCH v1 0/4] compat/obstack: update from upstream
  2019-06-14 10:00     ` [RFC/PATCH v1 0/4] compat/obstack: update from upstream SZEDER Gábor
                         ` (5 preceding siblings ...)
  2019-06-14 18:19       ` Junio C Hamano
@ 2019-06-14 20:30       ` Ramsay Jones
  2019-06-14 21:24         ` Ramsay Jones
  2019-06-17 18:36         ` SZEDER Gábor
  6 siblings, 2 replies; 90+ messages in thread
From: Ramsay Jones @ 2019-06-14 20:30 UTC (permalink / raw)
  To: SZEDER Gábor, git; +Cc: Junio C Hamano, Johannes Schindelin



On 14/06/2019 11:00, SZEDER Gábor wrote:
> Update 'compat/obstack.{c,h}' from upstream, because they already use
> 'size_t' instead of 'long' in places that might eventually end up as
> an argument to malloc(), which might solve build errors with GCC 8 on
> Windows.
> 
> The first patch just imports from upstream and doesn't modify anything
> at all, and, consequently, it can't be compiled because of a screenful
> or two of errors.  This is bad for future bisects, of course.
> 
> OTOH, adding all the necessary build fixes right away makes review
> harder...
> 
> I'm not sure how to deal with this situation, so here is a series with
> the fixes in separate patches for review, for now.  If there's an
> agreement that this is the direction to take, then I'll squash in the
> fixes in the first patch and touch up the resulting commit message.
> 
> 
> Ramsay, could you please run sparse on top of these patch series to
> make sure that I caught and converted all "0 instead of NULL" usages
> in the last patch?  Thanks.

I applied your patches to current master (@0aae918dd9) and, since
you dropped the final hunk of commit 3254310863 ("obstack.c: Fix
some sparse warnings", 2011-09-11), sparse complains, thus:

  $ diff sp-out sp-out1
  27a28,30
  > compat/obstack.c:331:5: warning: incorrect type in initializer (different modifiers)
  > compat/obstack.c:331:5:    expected void ( *[addressable] [toplevel] obstack_alloc_failed_handler )( ... )
  > compat/obstack.c:331:5:    got void ( [noreturn] * )( ... )
  $ 

So, yes you did catch all "using plain integer as NULL pointer"
warnings! :-D

Thanks.

ATB,
Ramsay Jones

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC/PATCH v1 0/4] compat/obstack: update from upstream
  2019-06-14 20:30       ` Ramsay Jones
@ 2019-06-14 21:24         ` Ramsay Jones
  2019-06-17 18:36         ` SZEDER Gábor
  1 sibling, 0 replies; 90+ messages in thread
From: Ramsay Jones @ 2019-06-14 21:24 UTC (permalink / raw)
  To: SZEDER Gábor, git; +Cc: Junio C Hamano, Johannes Schindelin



On 14/06/2019 21:30, Ramsay Jones wrote:
> 
> 
> On 14/06/2019 11:00, SZEDER Gábor wrote:
>> Update 'compat/obstack.{c,h}' from upstream, because they already use
>> 'size_t' instead of 'long' in places that might eventually end up as
>> an argument to malloc(), which might solve build errors with GCC 8 on
>> Windows.
>>
>> The first patch just imports from upstream and doesn't modify anything
>> at all, and, consequently, it can't be compiled because of a screenful
>> or two of errors.  This is bad for future bisects, of course.
>>
>> OTOH, adding all the necessary build fixes right away makes review
>> harder...
>>
>> I'm not sure how to deal with this situation, so here is a series with
>> the fixes in separate patches for review, for now.  If there's an
>> agreement that this is the direction to take, then I'll squash in the
>> fixes in the first patch and touch up the resulting commit message.
>>
>>
>> Ramsay, could you please run sparse on top of these patch series to
>> make sure that I caught and converted all "0 instead of NULL" usages
>> in the last patch?  Thanks.
> 
> I applied your patches to current master (@0aae918dd9) and, since
> you dropped the final hunk of commit 3254310863 ("obstack.c: Fix
> some sparse warnings", 2011-09-11), sparse complains, thus:
> 
>   $ diff sp-out sp-out1
>   27a28,30
>   > compat/obstack.c:331:5: warning: incorrect type in initializer (different modifiers)
>   > compat/obstack.c:331:5:    expected void ( *[addressable] [toplevel] obstack_alloc_failed_handler )( ... )
>   > compat/obstack.c:331:5:    got void ( [noreturn] * )( ... )
>   $ 
> 
> So, yes you did catch all "using plain integer as NULL pointer"
> warnings! :-D

Sorry for being a bit slow here, but I just realized that
I should not have seen that on Linux (and should have tested
on cygwin), because the obstack code gets elided on Linux ...

Oh, wait:

  $ diff sc sc1
  3a4,7
  > compat/obstack.o	- _obstack_allocated_p
  > compat/obstack.o	- obstack_alloc_failed_handler
  > compat/obstack.o	- _obstack_begin_1
  > compat/obstack.o	- _obstack_memory_used
  $ 

Hmm, so on master, this code is totally elided on Linux, but
that is no longer the case with your patches applied. I guess
you need to look at the definition of the {_OBSTACK_}ELIDE_CODE
preprocessor variable(s).

HTH.

ATB,
Ramsay Jones


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2/4] kwset: allow building with GCC 8
  2019-06-13 11:49 ` [PATCH 2/4] kwset: allow building with GCC 8 Johannes Schindelin via GitGitGadget
  2019-06-13 16:11   ` Junio C Hamano
  2019-06-14  9:53   ` SZEDER Gábor
@ 2019-06-14 22:09   ` Ævar Arnfjörð Bjarmason
  2019-06-14 22:55   ` Can we just get rid of kwset & obstack in favor of optimistically using PCRE v2 JIT? Ævar Arnfjörð Bjarmason
  3 siblings, 0 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-06-14 22:09 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget
  Cc: git, Junio C Hamano, Johannes Schindelin


On Thu, Jun 13 2019, Johannes Schindelin via GitGitGadget wrote:

> From: Johannes Schindelin <johannes.schindelin@gmx.de>
>
> The kwset functionality makes use of the obstack code, which expects to
> be handed a function that can allocate large chunks of data. It expects
> that function to accept a `size` parameter of type `long`.
>
> This upsets GCC 8 on Windows, because `long` does not have the same
> bit size as `size_t` there.
>
> Now, the proper thing to do would be to switch to `size_t`. But this
> would make us deviate from the "upstream" code even further, making it
> hard to synchronize with newer versions, and also it would be quite
> involved because that `long` type is so invasive in that code.

Also because we'd need to switch git.git to GPLv3, as noted at the top
of the file we've grabbed the last GPLv2 version of this code.

That or convince these authors of GNU grep to dual-license their
contributions, assuming their code isn't derived from something else (I
didn't check):

 grep.git $ git shortlog -sn e7ac713d.. -- '*kwset*'
    25  Jim Meyering
    25  Paul Eggert
     8  Norihiro Tanaka
     2  Paolo Bonzini
     1  Karl Berry
     1  Tony Abou-Assaleh
     1  Yuliy Pisetsky

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Can we just get rid of kwset & obstack in favor of optimistically using PCRE v2 JIT?
  2019-06-13 11:49 ` [PATCH 2/4] kwset: allow building with GCC 8 Johannes Schindelin via GitGitGadget
                     ` (2 preceding siblings ...)
  2019-06-14 22:09   ` Ævar Arnfjörð Bjarmason
@ 2019-06-14 22:55   ` Ævar Arnfjörð Bjarmason
  2019-06-14 23:19     ` Ævar Arnfjörð Bjarmason
                       ` (2 more replies)
  3 siblings, 3 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-06-14 22:55 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget
  Cc: git, Junio C Hamano, Johannes Schindelin, SZEDER Gábor,
	Jeff King, git-packagers


On Thu, Jun 13 2019, Johannes Schindelin via GitGitGadget wrote:

> The kwset functionality makes use of the obstack code, which expects to
> be handed a function that can allocate large chunks of data. It expects
> that function to accept a `size` parameter of type `long`.
>
> This upsets GCC 8 on Windows, because `long` does not have the same
> bit size as `size_t` there.
>
> Now, the proper thing to do would be to switch to `size_t`. But this
> would make us deviate from the "upstream" code even further, making it
> hard to synchronize with newer versions, and also it would be quite
> involved because that `long` type is so invasive in that code.
>
> Let's punt, and instead provide a super small wrapper around
> `xmalloc()`.

I have a WIP patches from 2017 that do $subject that I can dig up, but
thought I'd gauge interest first.

Right now the grep code & pickaxe machinery will detect fixed strings
and use kwset() as an optimization.

Back when kwset was added in 9eceddeec6 ("Use kwset in grep",
2011-08-21) this helped, but now doing this for grep with a PCRE pattern
is actually counterproductive for performance. On top of current
`master`:

    @@ -368 +368 @@ static int is_fixed(const char *s, size_t len)
    -       return 1;
    +       return 0;

And running p7821-grep-engines-fixed.sh[1] (which is in git.git, and is
as far as I got with this) we get:

    Test                             HEAD~             HEAD
    -------------------------------------------------------------------------
    7821.1: fixed grep int           0.48(1.59+0.63)   0.48(1.53+0.68) +0.0%
    7821.2: basic grep int           0.55(1.64+0.51)   0.72(2.97+0.54) +30.9%
    7821.3: extended grep int        0.65(1.63+0.54)   0.77(2.92+0.60) +18.5%
    7821.4: perl grep int            1.01(1.62+0.55)   0.36(0.97+0.58) -64.4%
    7821.6: fixed grep uncommon      0.18(0.51+0.45)   0.18(0.51+0.46) +0.0%
    7821.7: basic grep uncommon      0.18(0.50+0.46)   0.30(1.36+0.33) +66.7%
    7821.8: extended grep uncommon   0.18(0.45+0.52)   0.28(1.37+0.37) +55.6%
    7821.9: perl grep uncommon       0.18(0.52+0.45)   0.16(0.28+0.54) -11.1%
    7821.11: fixed grep æ            0.31(1.28+0.39)   0.31(1.24+0.43) +0.0%
    7821.12: basic grep æ            0.30(1.29+0.38)   0.22(0.85+0.36) -26.7%
    7821.13: extended grep æ         0.30(1.26+0.40)   0.22(0.78+0.45) -26.7%
    7821.14: perl grep æ             0.30(1.33+0.34)   0.16(0.25+0.56) -46.7%

So what this means on my Debian box is that when we use PCRE with JIT
and just get out of its way and let it do its own fixed string matching
it's up to ~65% faster than the kwset() path.

The usual case of just feeding the fixed pattern to glibc's regex
function is slower, although as seen there when you grep for a
rarely-occurring non-ASCII string glibc does better now (the perils of
using ancient last-version-to-use-GPLv2 snapshots...).

So my plan for this (which I partially implemented) was to have a series
where if we have a fixed string and have PCRE v2 we'd use it instead of
kwset() for fixed strings.

It seems most packagers build with PCRE v2 now (CC:
git-packagers@). I.e. we'd keep something like compile_fixed_regexp()
(and as an aside just use PCRE's \Q...\E instead of our own escaping).

We'd have performance regression for platforms that use kwset() now but
don't build pcre2, or where pcre2 jit doesn't work. Does anyone care?

This would allow us to just "git rm" kwset.[ch] compat/obstack.[ch],
which is ~2k lines of tricky code, 1/2 of which we're currently doomed
to maintain a bitrotting version of due to license incompatibilities
with upstream[2].

As an aside there's other code in grep.c that we could similarly remove
in favor of optimistic PCRE v2 use, e.g. the -w case can be replaced by
\b<word>\b, but I found that less promising[3]. We can also get a huge
performance win for BRE and ERE patterns by using PCRE v2 with a
translation layer for those under the hood[4], but various solvable
backwards compatible headaches[5] related to that are why I got lost in
the weeds back in 2017 and didn't finish this.

But just the s/kwset/pcre2/ case is easy enough.

1. Via:

    GIT_PERF_REPEAT_COUNT=10 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_7821_GREP_OPTS='' GIT_PERF_MAKE_OPTS='-j8 CFLAGS=-O3 USE_LIBPCRE2=YesPlease' ./run HEAD~ HEAD -- p7821-grep-engines-fixed.sh
2. https://public-inbox.org/git/87wohn95vb.fsf@evledraar.gmail.com/
3. https://github.com/avar/git/commit/49ca92e799
4. https://github.com/avar/git/commit/a3cc090344
5. https://github.com/avar/git/commit/7dd367eb37

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: Can we just get rid of kwset & obstack in favor of optimistically using PCRE v2 JIT?
  2019-06-14 22:55   ` Can we just get rid of kwset & obstack in favor of optimistically using PCRE v2 JIT? Ævar Arnfjörð Bjarmason
@ 2019-06-14 23:19     ` Ævar Arnfjörð Bjarmason
  2019-06-20 10:35       ` Jeff King
  2019-06-15  9:01     ` Carlo Arenas
  2019-06-15 19:15     ` brian m. carlson
  2 siblings, 1 reply; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-06-14 23:19 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget
  Cc: git, Junio C Hamano, Johannes Schindelin, SZEDER Gábor,
	Jeff King, git-packagers


On Sat, Jun 15 2019, Ævar Arnfjörð Bjarmason wrote:

> On Thu, Jun 13 2019, Johannes Schindelin via GitGitGadget wrote:
>
>> The kwset functionality makes use of the obstack code, which expects to
>> be handed a function that can allocate large chunks of data. It expects
>> that function to accept a `size` parameter of type `long`.
>>
>> This upsets GCC 8 on Windows, because `long` does not have the same
>> bit size as `size_t` there.
>>
>> Now, the proper thing to do would be to switch to `size_t`. But this
>> would make us deviate from the "upstream" code even further, making it
>> hard to synchronize with newer versions, and also it would be quite
>> involved because that `long` type is so invasive in that code.
>>
>> Let's punt, and instead provide a super small wrapper around
>> `xmalloc()`.
>
> I have a WIP patches from 2017 that do $subject that I can dig up, but
> thought I'd gauge interest first.
>
> Right now the grep code & pickaxe machinery will detect fixed strings
> and use kwset() as an optimization.
>
> Back when kwset was added in 9eceddeec6 ("Use kwset in grep",
> 2011-08-21) this helped, but now doing this for grep with a PCRE pattern
> is actually counterproductive for performance. On top of current
> `master`:
>
>     @@ -368 +368 @@ static int is_fixed(const char *s, size_t len)
>     -       return 1;
>     +       return 0;
>
> And running p7821-grep-engines-fixed.sh[1] (which is in git.git, and is
> as far as I got with this) we get:
>
>     Test                             HEAD~             HEAD
>     -------------------------------------------------------------------------
>     7821.1: fixed grep int           0.48(1.59+0.63)   0.48(1.53+0.68) +0.0%
>     7821.2: basic grep int           0.55(1.64+0.51)   0.72(2.97+0.54) +30.9%
>     7821.3: extended grep int        0.65(1.63+0.54)   0.77(2.92+0.60) +18.5%
>     7821.4: perl grep int            1.01(1.62+0.55)   0.36(0.97+0.58) -64.4%
>     7821.6: fixed grep uncommon      0.18(0.51+0.45)   0.18(0.51+0.46) +0.0%
>     7821.7: basic grep uncommon      0.18(0.50+0.46)   0.30(1.36+0.33) +66.7%
>     7821.8: extended grep uncommon   0.18(0.45+0.52)   0.28(1.37+0.37) +55.6%
>     7821.9: perl grep uncommon       0.18(0.52+0.45)   0.16(0.28+0.54) -11.1%
>     7821.11: fixed grep æ            0.31(1.28+0.39)   0.31(1.24+0.43) +0.0%
>     7821.12: basic grep æ            0.30(1.29+0.38)   0.22(0.85+0.36) -26.7%
>     7821.13: extended grep æ         0.30(1.26+0.40)   0.22(0.78+0.45) -26.7%
>     7821.14: perl grep æ             0.30(1.33+0.34)   0.16(0.25+0.56) -46.7%
>
> So what this means on my Debian box is that when we use PCRE with JIT
> and just get out of its way and let it do its own fixed string matching
> it's up to ~65% faster than the kwset() path.
>
> The usual case of just feeding the fixed pattern to glibc's regex
> function is slower, although as seen there when you grep for a
> rarely-occurring non-ASCII string glibc does better now (the perils of
> using ancient last-version-to-use-GPLv2 snapshots...).
>
> So my plan for this (which I partially implemented) was to have a series
> where if we have a fixed string and have PCRE v2 we'd use it instead of
> kwset() for fixed strings.
>
> It seems most packagers build with PCRE v2 now (CC:
> git-packagers@). I.e. we'd keep something like compile_fixed_regexp()
> (and as an aside just use PCRE's \Q...\E instead of our own escaping).
>
> We'd have performance regression for platforms that use kwset() now but
> don't build pcre2, or where pcre2 jit doesn't work. Does anyone care?
>
> This would allow us to just "git rm" kwset.[ch] compat/obstack.[ch],
> which is ~2k lines of tricky code, 1/2 of which we're currently doomed
> to maintain a bitrotting version of due to license incompatibilities
> with upstream[2].
>
> As an aside there's other code in grep.c that we could similarly remove
> in favor of optimistic PCRE v2 use, e.g. the -w case can be replaced by
> \b<word>\b, but I found that less promising[3]. We can also get a huge
> performance win for BRE and ERE patterns by using PCRE v2 with a
> translation layer for those under the hood[4], but various solvable
> backwards compatible headaches[5] related to that are why I got lost in
> the weeds back in 2017 and didn't finish this.
>
> But just the s/kwset/pcre2/ case is easy enough.

...small correction, we currently hard-rely on kwset() for any pattern
containing a \0 for "git-grep" (these can only by supplied via the -f
<pattern-from-file> option), this means that any pattern containing a \0
is implicitly fixed, unless kwset() doesn't like it (-i and non-ASCII),
what a mess.

Since we hard depend on REG_STARTEND since 2f8952250a ("regex: add
regexec_buf() that can work on a non NUL-terminated string", 2016-09-21)
we should just fix that while we're at it. It's a backwards-incompatible
change, but I doubt anyone is relying on our undocumented behavior of
implicitly considering grep patterns with \0 in them always fixed.


> 1. Via:
>
>     GIT_PERF_REPEAT_COUNT=10 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_7821_GREP_OPTS='' GIT_PERF_MAKE_OPTS='-j8 CFLAGS=-O3 USE_LIBPCRE2=YesPlease' ./run HEAD~ HEAD -- p7821-grep-engines-fixed.sh
> 2. https://public-inbox.org/git/87wohn95vb.fsf@evledraar.gmail.com/
> 3. https://github.com/avar/git/commit/49ca92e799
> 4. https://github.com/avar/git/commit/a3cc090344
> 5. https://github.com/avar/git/commit/7dd367eb37

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: Can we just get rid of kwset & obstack in favor of optimistically using PCRE v2 JIT?
  2019-06-14 22:55   ` Can we just get rid of kwset & obstack in favor of optimistically using PCRE v2 JIT? Ævar Arnfjörð Bjarmason
  2019-06-14 23:19     ` Ævar Arnfjörð Bjarmason
@ 2019-06-15  9:01     ` Carlo Arenas
  2019-06-15 19:15     ` brian m. carlson
  2 siblings, 0 replies; 90+ messages in thread
From: Carlo Arenas @ 2019-06-15  9:01 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Johannes Schindelin via GitGitGadget, git, Junio C Hamano,
	Johannes Schindelin, SZEDER Gábor, Jeff King, git-packagers

> It seems most packagers build with PCRE v2 now (CC:
> git-packagers@). I.e. we'd keep something like compile_fixed_regexp()
> (and as an aside just use PCRE's \Q...\E instead of our own escaping).

OpenBSD does PCRE v1 without JIT but HardenedBSD does and therefore
segfaults when calling: `git grep -P`.

A fix probably based on the old proposed patchset[1] will be needed in this case
with more urgency

Carlo

[1] https://public-inbox.org/git/20181209230024.43444-1-carenas@gmail.com/

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: Can we just get rid of kwset & obstack in favor of optimistically using PCRE v2 JIT?
  2019-06-14 22:55   ` Can we just get rid of kwset & obstack in favor of optimistically using PCRE v2 JIT? Ævar Arnfjörð Bjarmason
  2019-06-14 23:19     ` Ævar Arnfjörð Bjarmason
  2019-06-15  9:01     ` Carlo Arenas
@ 2019-06-15 19:15     ` brian m. carlson
  2019-06-15 22:14       ` Ævar Arnfjörð Bjarmason
  2 siblings, 1 reply; 90+ messages in thread
From: brian m. carlson @ 2019-06-15 19:15 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Johannes Schindelin via GitGitGadget, git, Junio C Hamano,
	Johannes Schindelin, SZEDER Gábor, Jeff King, git-packagers

[-- Attachment #1: Type: text/plain, Size: 1357 bytes --]

On 2019-06-14 at 22:55:17, Ævar Arnfjörð Bjarmason wrote:
> It seems most packagers build with PCRE v2 now (CC:
> git-packagers@). I.e. we'd keep something like compile_fixed_regexp()
> (and as an aside just use PCRE's \Q...\E instead of our own escaping).
> 
> We'd have performance regression for platforms that use kwset() now but
> don't build pcre2, or where pcre2 jit doesn't work. Does anyone care?

I know that there are people shipping newer versions of Git using CentOS
6, which IIRC doesn't ship PCRE 2[0]. Since having to ship your own PCRE
is a security maintenance nightmare, it's probably best to leave this at
least compatible with non-PCRE 2 systems until November 2020. At that
point, I'm happy to drop support for it.

If it would work but just be slower with PCRE 1, I'm not too terribly
concerned. Let that be an incentive to users to upgrade.

Also, as Carlos pointed out, not all platforms will have the JIT support
functional, such as OpenBSD, NetBSD, and PaX Linux systems. That may be
more of a blocker than the CentOS issue, especially since people run PaX
kernels with standard distros.

[0] I'm not certain because CentOS 6 Docker images segfault on newer
kernels and I'm too lazy to download a live CD image for testing.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: Can we just get rid of kwset & obstack in favor of optimistically using PCRE v2 JIT?
  2019-06-15 19:15     ` brian m. carlson
@ 2019-06-15 22:14       ` Ævar Arnfjörð Bjarmason
  2019-06-26  0:03         ` [RFC/PATCH 0/7] grep: move from kwset to optional PCRE v2 Ævar Arnfjörð Bjarmason
                           ` (7 more replies)
  0 siblings, 8 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-06-15 22:14 UTC (permalink / raw)
  To: brian m. carlson
  Cc: Johannes Schindelin via GitGitGadget, git, Junio C Hamano,
	Johannes Schindelin, SZEDER Gábor, Jeff King, git-packagers


On Sat, Jun 15 2019, brian m. carlson wrote:

> On 2019-06-14 at 22:55:17, Ævar Arnfjörð Bjarmason wrote:
>> It seems most packagers build with PCRE v2 now (CC:
>> git-packagers@). I.e. we'd keep something like compile_fixed_regexp()
>> (and as an aside just use PCRE's \Q...\E instead of our own escaping).
>>
>> We'd have performance regression for platforms that use kwset() now but
>> don't build pcre2, or where pcre2 jit doesn't work. Does anyone care?
>
> I know that there are people shipping newer versions of Git using CentOS
> 6, which IIRC doesn't ship PCRE 2[0]. Since having to ship your own PCRE
> is a security maintenance nightmare, it's probably best to leave this at
> least compatible with non-PCRE 2 systems until November 2020. At that
> point, I'm happy to drop support for it.
>
> If it would work but just be slower with PCRE 1, I'm not too terribly
> concerned. Let that be an incentive to users to upgrade.

Not just PCRE, but if you don't have PCRE at all things would still work
perfectly fine.

I.e. all we're talking about is how to treat this internal
optimization. If we'd never imported kwset we'd still be perfectly
capable of searching for fixed strings with grep/pickaxe, it would have
just been slower.

So platforms that don't have PCRE at all would be slowed down by
something like what the benchmark in my
https://public-inbox.org/git/87v9x793qi.fsf@evledraar.gmail.com/
upthread shows.

Or not, maybe their C library POSIX regcomp()/regexec() is faster.

Platforms that do have PCRE would be faster than they are now, and we
could stop shipping this kwset code.

The *only* case where what I've outlined above isn't true is cases where
the pattern being matched has a \0. See my 966be95549 ("grep: add tests
to fix blind spots with \0 patterns", 2017-05-20) for how that behaves.

There our current behavior is IMNSHO insane, and is certainly
undocumented and unreliable (i.e. it behaves differently if you
e.g. have non-ASCII along with \0 in the pattern, none of this is
documented).

Having poked a bit at that I think the only sane thing there is to just
outright die unless you have PCRE v2, which is the only backend we have
that has any hope of handling that sanely.

> Also, as Carlos pointed out, not all platforms will have the JIT support
> functional, such as OpenBSD, NetBSD, and PaX Linux systems. That may be
> more of a blocker than the CentOS issue, especially since people run PaX
> kernels with standard distros.
>
> [0] I'm not certain because CentOS 6 Docker images segfault on newer
> kernels and I'm too lazy to download a live CD image for testing.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/4] config: avoid calling `labs()` on too-large data type
  2019-06-13 11:49 ` [PATCH 4/4] config: avoid calling `labs()` on too-large data type Johannes Schindelin via GitGitGadget
  2019-06-13 16:13   ` Junio C Hamano
@ 2019-06-16  6:48   ` René Scharfe
  2019-06-16  8:24     ` René Scharfe
  1 sibling, 1 reply; 90+ messages in thread
From: René Scharfe @ 2019-06-16  6:48 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget, git
  Cc: Junio C Hamano, Johannes Schindelin

Am 13.06.19 um 13:49 schrieb Johannes Schindelin via GitGitGadget:
> From: Johannes Schindelin <johannes.schindelin@gmx.de>
>
> The `labs()` function operates, as the initial `l` suggests, on `long`
> parameters. However, in `config.c` we tried to use it on values of type
> `intmax_t`.
>
> This problem was found by GCC v9.x.
>
> To fix it, let's just "unroll" the function (i.e. negate the value if it
> is negative).

There's also imaxabs(3).

>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>  config.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/config.c b/config.c
> index 296a6d9cc4..01c6e9df23 100644
> --- a/config.c
> +++ b/config.c
> @@ -869,9 +869,9 @@ static int git_parse_signed(const char *value, intmax_t *ret, intmax_t max)
>  			errno = EINVAL;
>  			return 0;
>  		}
> -		uval = labs(val);
> +		uval = val < 0 ? -val : val;
>  		uval *= factor;
> -		if (uval > max || labs(val) > uval) {
> +		if (uval > max || (val < 0 ? -val : val) > uval) {
>  			errno = ERANGE;
>  			return 0;
>  		}

So this check uses unsigned arithmetic to find out if the multiplication
overflows, right?  Let's say value is "4G", then val will be 4 and
factor will be 2^30.  Multiplying the two yields 2^32.  On a 32-bit
system this will wrap around to 0, so that's what we get for uval there.
The range check will then pass unless max is negative, which is wrong.

This behavior was not introduced by your patch, of course.

We could fix it by using the macro unsigned_mult_overflows():

		uval = imaxabs(val);
		if (unsigned_mult_overflows(uval, factor) ||
		    uval * factor > max) {
			errno = ERANGE;
			return 0;
		}

René

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/4] config: avoid calling `labs()` on too-large data type
  2019-06-16  6:48   ` René Scharfe
@ 2019-06-16  8:24     ` René Scharfe
  2019-06-16 14:01       ` René Scharfe
  0 siblings, 1 reply; 90+ messages in thread
From: René Scharfe @ 2019-06-16  8:24 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget, git
  Cc: Junio C Hamano, Johannes Schindelin

Am 16.06.19 um 08:48 schrieb René Scharfe:
> Am 13.06.19 um 13:49 schrieb Johannes Schindelin via GitGitGadget:
>> From: Johannes Schindelin <johannes.schindelin@gmx.de>
>>
>> The `labs()` function operates, as the initial `l` suggests, on `long`
>> parameters. However, in `config.c` we tried to use it on values of type
>> `intmax_t`.
>>
>> This problem was found by GCC v9.x.
>>
>> To fix it, let's just "unroll" the function (i.e. negate the value if it
>> is negative).
>
> There's also imaxabs(3).
>
>>
>> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
>> ---
>>  config.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/config.c b/config.c
>> index 296a6d9cc4..01c6e9df23 100644
>> --- a/config.c
>> +++ b/config.c
>> @@ -869,9 +869,9 @@ static int git_parse_signed(const char *value, intmax_t *ret, intmax_t max)
>>  			errno = EINVAL;
>>  			return 0;
>>  		}
>> -		uval = labs(val);
>> +		uval = val < 0 ? -val : val;
>>  		uval *= factor;
>> -		if (uval > max || labs(val) > uval) {
>> +		if (uval > max || (val < 0 ? -val : val) > uval) {
>>  			errno = ERANGE;
>>  			return 0;
>>  		}
>
> So this check uses unsigned arithmetic to find out if the multiplication
> overflows, right?  Let's say value is "4G", then val will be 4 and
> factor will be 2^30.  Multiplying the two yields 2^32.  On a 32-bit
> system this will wrap around to 0, so that's what we get for uval there.
> The range check will then pass unless max is negative, which is wrong.

No, this example is wrong.  (I need to remember to take baby steps while
carrying numbers. o_O)

So value = "5G", then val = 5 and factor = 2^30.  After multiplication
uval = 2^32 + 2^30, on 32-bit machines this is 2^30 due to wrap-around.
Correct so far?

If uval is 2^30, then it's smaller than 2^31, so will pass a check
against INT_MAX.  val is 5, which is smaller than 2^30, so will pass the
second check as well.  Makes sense?

That would mean "5G" will overflow on a 32-bit machine, but we won't
detect it.

René

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/4] config: avoid calling `labs()` on too-large data type
  2019-06-16  8:24     ` René Scharfe
@ 2019-06-16 14:01       ` René Scharfe
  2019-06-16 22:26         ` Junio C Hamano
  0 siblings, 1 reply; 90+ messages in thread
From: René Scharfe @ 2019-06-16 14:01 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget, git
  Cc: Junio C Hamano, Johannes Schindelin

Am 16.06.19 um 10:24 schrieb René Scharfe:
> Am 16.06.19 um 08:48 schrieb René Scharfe:
>> Am 13.06.19 um 13:49 schrieb Johannes Schindelin via GitGitGadget:
>>> From: Johannes Schindelin <johannes.schindelin@gmx.de>
>>>
>>> The `labs()` function operates, as the initial `l` suggests, on `long`
>>> parameters. However, in `config.c` we tried to use it on values of type
>>> `intmax_t`.
>>>
>>> This problem was found by GCC v9.x.
>>>
>>> To fix it, let's just "unroll" the function (i.e. negate the value if it
>>> is negative).
>>
>> There's also imaxabs(3).
>>
>>>
>>> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
>>> ---
>>>  config.c | 4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/config.c b/config.c
>>> index 296a6d9cc4..01c6e9df23 100644
>>> --- a/config.c
>>> +++ b/config.c
>>> @@ -869,9 +869,9 @@ static int git_parse_signed(const char *value, intmax_t *ret, intmax_t max)
>>>  			errno = EINVAL;
>>>  			return 0;
>>>  		}
>>> -		uval = labs(val);
>>> +		uval = val < 0 ? -val : val;
>>>  		uval *= factor;
>>> -		if (uval > max || labs(val) > uval) {
>>> +		if (uval > max || (val < 0 ? -val : val) > uval) {
>>>  			errno = ERANGE;
>>>  			return 0;
>>>  		}
>>
>> So this check uses unsigned arithmetic to find out if the multiplication
>> overflows, right?  Let's say value is "4G", then val will be 4 and
>> factor will be 2^30.  Multiplying the two yields 2^32.  On a 32-bit
>> system this will wrap around to 0, so that's what we get for uval there.
>> The range check will then pass unless max is negative, which is wrong.
>
> No, this example is wrong.  (I need to remember to take baby steps while
> carrying numbers. o_O)
>
> So value = "5G", then val = 5 and factor = 2^30.  After multiplication
> uval = 2^32 + 2^30, on 32-bit machines this is 2^30 due to wrap-around.

Yeah, except that in the real world uintmax_t is 8 bytes wide
everywhere, even on x86 and ARM.  So the code should be fine as-is.  It
would be in trouble if we introduced bigger units, like T for 2^40 etc.,
though.

Anyway, the code would be easier to read and ready for any units if it
used unsigned_mult_overflows; would have saved me time spent painfully
wading through the math at least.  (Hopefully that's just my problem,
though.)

René

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/4] config: avoid calling `labs()` on too-large data type
  2019-06-16 14:01       ` René Scharfe
@ 2019-06-16 22:26         ` Junio C Hamano
  2019-06-20 19:58           ` René Scharfe
                             ` (3 more replies)
  0 siblings, 4 replies; 90+ messages in thread
From: Junio C Hamano @ 2019-06-16 22:26 UTC (permalink / raw)
  To: René Scharfe
  Cc: Johannes Schindelin via GitGitGadget, git, Johannes Schindelin

René Scharfe <l.s.r@web.de> writes:

>>>> To fix it, let's just "unroll" the function (i.e. negate the value if it
>>>> is negative).
>>>
>>> There's also imaxabs(3).

That may be true, but seeing that some platforms wants to see
intmax_t defined in the compat/ layer, I suspect we cannot avoid
having a copy of unrolled implementation somewhere in our code.

>>>> +		uval = val < 0 ? -val : val;
>>>>  		uval *= factor;
>>>> -		if (uval > max || labs(val) > uval) {
>>>> +		if (uval > max || (val < 0 ? -val : val) > uval) {
>>>>  			errno = ERANGE;
>>>>  			return 0;
>>>>  		}
>>>
>>> So this check uses unsigned arithmetic to find out if the multiplication
>>> overflows, right?...
>> No, this example is wrong...
> ...
> Anyway, the code would be easier to read and ready for any units if it
> used unsigned_mult_overflows; would have saved me time spent painfully
> wading through the math at least.

A patch to use unsigned_mult_overflows() here, on top of the
"unrolled imaxabs" patch we reviewed, would be good to tie a loose
end.

Thanks.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2/4] kwset: allow building with GCC 8
  2019-06-14 16:12     ` [PATCH 2/4] kwset: allow building with GCC 8 Junio C Hamano
@ 2019-06-17 18:26       ` SZEDER Gábor
  0 siblings, 0 replies; 90+ messages in thread
From: SZEDER Gábor @ 2019-06-17 18:26 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Johannes Schindelin via GitGitGadget, git, Johannes Schindelin

On Fri, Jun 14, 2019 at 09:12:50AM -0700, Junio C Hamano wrote:
> SZEDER Gábor <szeder.dev@gmail.com> writes:
> 
> >> Now, the proper thing to do would be to switch to `size_t`. But this
> >> would make us deviate from the "upstream" code even further,
> >
> > This is not entirely true: upstream already uses 'size_t', so the
> > switch would actually bring our copy closer to upstream.
> 
> Ah, earlier I said that within the context how kwset uses obstack,
> it is perfectly proper to fix it like the patch in question did, but
> the upstream already using size_t changes the picture quite a bit.
> 
> I do not mind updating our copy of obstack, but make sure you pick
> the version with license compatible with ours.

The licensing of obstack.{c,h} didn't change, it's still "GNU Lesser
General Public License as published by the Free Software Foundation;
either version 2.1 of the License, or (at your option) any later
version"

Note how the first patch updating these files makes only superficial
changes to their license notices:

  diff --git a/compat/obstack.h b/compat/obstack.h
  index ced94d0118..811de588a4 100644
  --- a/compat/obstack.h
  +++ b/compat/obstack.h
  @@ -1,6 +1,5 @@
   /* obstack.h - object stack macros
  -   Copyright (C) 1988-1994,1996-1999,2003,2004,2005,2009
  -       Free Software Foundation, Inc.
  +   Copyright (C) 1988-2019 Free Software Foundation, Inc.
      This file is part of the GNU C Library.
   
      The GNU C Library is free software; you can redistribute it and/or
  @@ -15,89 +14,89 @@
   
      You should have received a copy of the GNU Lesser General Public
      License along with the GNU C Library; if not, see
  -   <http://www.gnu.org/licenses/>.  */
  +   <https://www.gnu.org/licenses/>.  */
 

But I rather like Ævar's idea of simply getting rid of them :)


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC/PATCH v1 0/4] compat/obstack: update from upstream
  2019-06-14 20:30       ` Ramsay Jones
  2019-06-14 21:24         ` Ramsay Jones
@ 2019-06-17 18:36         ` SZEDER Gábor
  1 sibling, 0 replies; 90+ messages in thread
From: SZEDER Gábor @ 2019-06-17 18:36 UTC (permalink / raw)
  To: Ramsay Jones; +Cc: git, Junio C Hamano, Johannes Schindelin

On Fri, Jun 14, 2019 at 09:30:20PM +0100, Ramsay Jones wrote:
> 
> 
> On 14/06/2019 11:00, SZEDER Gábor wrote:
> > Update 'compat/obstack.{c,h}' from upstream, because they already use
> > 'size_t' instead of 'long' in places that might eventually end up as
> > an argument to malloc(), which might solve build errors with GCC 8 on
> > Windows.
> > 
> > The first patch just imports from upstream and doesn't modify anything
> > at all, and, consequently, it can't be compiled because of a screenful
> > or two of errors.  This is bad for future bisects, of course.
> > 
> > OTOH, adding all the necessary build fixes right away makes review
> > harder...
> > 
> > I'm not sure how to deal with this situation, so here is a series with
> > the fixes in separate patches for review, for now.  If there's an
> > agreement that this is the direction to take, then I'll squash in the
> > fixes in the first patch and touch up the resulting commit message.
> > 
> > 
> > Ramsay, could you please run sparse on top of these patch series to
> > make sure that I caught and converted all "0 instead of NULL" usages
> > in the last patch?  Thanks.
> 
> I applied your patches to current master (@0aae918dd9) and, since
> you dropped the final hunk of commit 3254310863 ("obstack.c: Fix
> some sparse warnings", 2011-09-11), sparse complains, thus:

Oh, indeed.  3254310863 removed that "__attribute__ ((noreturn))" from
the function's definition, but nowadays upstream writes that as
"static _Noreturn void print_and_abort (void)", and I didn't realize
that this _Noreturn is the same thing.

>   $ diff sp-out sp-out1
>   27a28,30
>   > compat/obstack.c:331:5: warning: incorrect type in initializer (different modifiers)
>   > compat/obstack.c:331:5:    expected void ( *[addressable] [toplevel] obstack_alloc_failed_handler )( ... )
>   > compat/obstack.c:331:5:    got void ( [noreturn] * )( ... )
>   $ 
> 
> So, yes you did catch all "using plain integer as NULL pointer"
> warnings! :-D

Heh :)

Anyway, I won't do anything for the time being, in the hope that we
can get on board with removing kwset/obstack...


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: Can we just get rid of kwset & obstack in favor of optimistically using PCRE v2 JIT?
  2019-06-14 23:19     ` Ævar Arnfjörð Bjarmason
@ 2019-06-20 10:35       ` Jeff King
  0 siblings, 0 replies; 90+ messages in thread
From: Jeff King @ 2019-06-20 10:35 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Johannes Schindelin via GitGitGadget, git, Junio C Hamano,
	Johannes Schindelin, SZEDER Gábor, git-packagers

On Sat, Jun 15, 2019 at 01:19:33AM +0200, Ævar Arnfjörð Bjarmason wrote:

> ...small correction, we currently hard-rely on kwset() for any pattern
> containing a \0 for "git-grep" (these can only by supplied via the -f
> <pattern-from-file> option), this means that any pattern containing a \0
> is implicitly fixed, unless kwset() doesn't like it (-i and non-ASCII),
> what a mess.
> 
> Since we hard depend on REG_STARTEND since 2f8952250a ("regex: add
> regexec_buf() that can work on a non NUL-terminated string", 2016-09-21)
> we should just fix that while we're at it. It's a backwards-incompatible
> change, but I doubt anyone is relying on our undocumented behavior of
> implicitly considering grep patterns with \0 in them always fixed.

That's only for NULs in the haystack, though. I don't think there's a
way to have a NUL in the pattern with regcomp(), since it takes a
NUL-terminated string.

I do agree with you that treating it like a fixed string is somewhat
insane. We're probably better off to die.

In general, your plan to get rid of kwset sounds like a good path. It
would be a slight regression for somebody who is truly feeding a
fixed-string pattern with a NUL in it, on a system without pcre. Right
now that works (via kwset), and if we would start feeding fixed strings
to regcomp() then obviously that won't work. I guess we could go back to
using memmem as a fallback, which is what it looks like we used before
9eceddeec6 (Use kwset in grep, 2011-08-21).

Seems like a code path that would get exercised approximately never,
though.

-Peff

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/4] config: avoid calling `labs()` on too-large data type
  2019-06-16 22:26         ` Junio C Hamano
@ 2019-06-20 19:58           ` René Scharfe
  2019-06-20 21:07             ` Junio C Hamano
  2019-06-21 18:35             ` Johannes Schindelin
  2019-06-22 10:03           ` [PATCH v2 1/3] config: use unsigned_mult_overflows to check for overflows René Scharfe
                             ` (2 subsequent siblings)
  3 siblings, 2 replies; 90+ messages in thread
From: René Scharfe @ 2019-06-20 19:58 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Johannes Schindelin via GitGitGadget, git, Johannes Schindelin

Am 17.06.19 um 00:26 schrieb Junio C Hamano:
> René Scharfe <l.s.r@web.de> writes:
>
>>>>> To fix it, let's just "unroll" the function (i.e. negate the value if it
>>>>> is negative).
>>>>
>>>> There's also imaxabs(3).
>
> That may be true, but seeing that some platforms wants to see
> intmax_t defined in the compat/ layer, I suspect we cannot avoid
> having a copy of unrolled implementation somewhere in our code.

Right.  And if we later decide to give it a name then we could ask
Coccinelle to find the places that need converting with a semantic
patch like this:

	@@
	intmax_t i;
	@@
	- i < 0 ? -i : i
	+ imaxabs(i)

Side note: I was surprised to find that I added those labs(3) calls, in
83915ba521 ("use labs() for variables of type long instead of abs()",
2014-11-15).  Sloppy. :-/

> A patch to use unsigned_mult_overflows() here, on top of the
> "unrolled imaxabs" patch we reviewed, would be good to tie a loose
> end.

How about this?

-- >8 --
Subject: [PATCH] config: simplify unit suffix handling

parse_unit_factor() checks if a K, M or G is present after a number and
multiplies it by 2^10, 2^20 or 2^30, respectively.  One of its callers
checks if the result is smaller than the number alone to detect
overflows.  The other one passes 1 as the number and does multiplication
and overflow check itself in a similar manner.

This works, but is inconsistent, and it would break if we added support
for a bigger unit factor.  E.g. 16777217T expands to 2^64 + 2^40, which
is too big for a 64-bit number.  Modulo 2^64 we get 2^40 == 1TB, which
is bigger than the raw number 16777217 == 2^24 + 1, so the overflow
would go undetected by that method.

Move the multiplication out of parse_unit_factor() and rename it to
get_unit_factor() to signify its reduced functionality.  This partially
reverts c8deb5a146 ("Improve error messages when int/long cannot be
parsed from config", 2007-12-25), but keeps the improved error messages.
Use a return value of 0 to signal an invalid suffix.

And use unsigned_mult_overflows to check for an overflow *before* doing
the actual multiplication, which is simpler and can deal with larger
unit factors.

Signed-off-by: Rene Scharfe <l.s.r@web.de>
---
Patch generated with --function-context for easier reviewing.

 config.c | 39 ++++++++++++++++++---------------------
 1 file changed, 18 insertions(+), 21 deletions(-)

diff --git a/config.c b/config.c
index 01c6e9df23..61a8bbb5cd 100644
--- a/config.c
+++ b/config.c
@@ -834,51 +834,46 @@ static int git_parse_source(config_fn_t fn, void *data,
 	return error_return;
 }

-static int parse_unit_factor(const char *end, uintmax_t *val)
+static uintmax_t get_unit_factor(const char *end)
 {
 	if (!*end)
 		return 1;
-	else if (!strcasecmp(end, "k")) {
-		*val *= 1024;
-		return 1;
-	}
-	else if (!strcasecmp(end, "m")) {
-		*val *= 1024 * 1024;
-		return 1;
-	}
-	else if (!strcasecmp(end, "g")) {
-		*val *= 1024 * 1024 * 1024;
-		return 1;
-	}
+	if (!strcasecmp(end, "k"))
+		return 1024;
+	if (!strcasecmp(end, "m"))
+		return 1024 * 1024;
+	if (!strcasecmp(end, "g"))
+		return 1024 * 1024 * 1024;
 	return 0;
 }

 static int git_parse_signed(const char *value, intmax_t *ret, intmax_t max)
 {
 	if (value && *value) {
 		char *end;
 		intmax_t val;
 		uintmax_t uval;
-		uintmax_t factor = 1;
+		uintmax_t factor;

 		errno = 0;
 		val = strtoimax(value, &end, 0);
 		if (errno == ERANGE)
 			return 0;
-		if (!parse_unit_factor(end, &factor)) {
+		factor = get_unit_factor(end);
+		if (!factor) {
 			errno = EINVAL;
 			return 0;
 		}
 		uval = val < 0 ? -val : val;
-		uval *= factor;
-		if (uval > max || (val < 0 ? -val : val) > uval) {
+		if (unsigned_mult_overflows(factor, uval) ||
+		    factor * uval > max) {
 			errno = ERANGE;
 			return 0;
 		}
 		val *= factor;
 		*ret = val;
 		return 1;
 	}
 	errno = EINVAL;
 	return 0;
 }
@@ -886,26 +881,28 @@ static int git_parse_signed(const char *value, intmax_t *ret, intmax_t max)
 static int git_parse_unsigned(const char *value, uintmax_t *ret, uintmax_t max)
 {
 	if (value && *value) {
 		char *end;
 		uintmax_t val;
-		uintmax_t oldval;
+		uintmax_t factor;

 		errno = 0;
 		val = strtoumax(value, &end, 0);
 		if (errno == ERANGE)
 			return 0;
-		oldval = val;
-		if (!parse_unit_factor(end, &val)) {
+		factor = get_unit_factor(end);
+		if (!factor) {
 			errno = EINVAL;
 			return 0;
 		}
-		if (val > max || oldval > val) {
+		if (unsigned_mult_overflows(factor, val) ||
+		    factor * val > max) {
 			errno = ERANGE;
 			return 0;
 		}
+		val *= factor;
 		*ret = val;
 		return 1;
 	}
 	errno = EINVAL;
 	return 0;
 }
--
2.22.0

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/4] config: avoid calling `labs()` on too-large data type
  2019-06-20 19:58           ` René Scharfe
@ 2019-06-20 21:07             ` Junio C Hamano
  2019-06-21 18:35             ` Johannes Schindelin
  1 sibling, 0 replies; 90+ messages in thread
From: Junio C Hamano @ 2019-06-20 21:07 UTC (permalink / raw)
  To: René Scharfe
  Cc: Johannes Schindelin via GitGitGadget, git, Johannes Schindelin

René Scharfe <l.s.r@web.de> writes:

> How about this?

Sounds sensible.

> -- >8 --
> Subject: [PATCH] config: simplify unit suffix handling
>
> parse_unit_factor() checks if a K, M or G is present after a number and
> multiplies it by 2^10, 2^20 or 2^30, respectively.  One of its callers
> checks if the result is smaller than the number alone to detect
> overflows.  The other one passes 1 as the number and does multiplication
> and overflow check itself in a similar manner.
>
> This works, but is inconsistent, and it would break if we added support
> for a bigger unit factor.  E.g. 16777217T expands to 2^64 + 2^40, which
> is too big for a 64-bit number.  Modulo 2^64 we get 2^40 == 1TB, which
> is bigger than the raw number 16777217 == 2^24 + 1, so the overflow
> would go undetected by that method.
>
> Move the multiplication out of parse_unit_factor() and rename it to
> get_unit_factor() to signify its reduced functionality.  This partially
> reverts c8deb5a146 ("Improve error messages when int/long cannot be
> parsed from config", 2007-12-25), but keeps the improved error messages.
> Use a return value of 0 to signal an invalid suffix.
>
> And use unsigned_mult_overflows to check for an overflow *before* doing
> the actual multiplication, which is simpler and can deal with larger
> unit factors.
>
> Signed-off-by: Rene Scharfe <l.s.r@web.de>
> ---
> Patch generated with --function-context for easier reviewing.
>
>  config.c | 39 ++++++++++++++++++---------------------
>  1 file changed, 18 insertions(+), 21 deletions(-)
>
> diff --git a/config.c b/config.c
> index 01c6e9df23..61a8bbb5cd 100644
> --- a/config.c
> +++ b/config.c
> @@ -834,51 +834,46 @@ static int git_parse_source(config_fn_t fn, void *data,
>  	return error_return;
>  }
>
> -static int parse_unit_factor(const char *end, uintmax_t *val)
> +static uintmax_t get_unit_factor(const char *end)
>  {
>  	if (!*end)
>  		return 1;
> -	else if (!strcasecmp(end, "k")) {
> -		*val *= 1024;
> -		return 1;
> -	}
> -	else if (!strcasecmp(end, "m")) {
> -		*val *= 1024 * 1024;
> -		return 1;
> -	}
> -	else if (!strcasecmp(end, "g")) {
> -		*val *= 1024 * 1024 * 1024;
> -		return 1;
> -	}
> +	if (!strcasecmp(end, "k"))
> +		return 1024;
> +	if (!strcasecmp(end, "m"))
> +		return 1024 * 1024;
> +	if (!strcasecmp(end, "g"))
> +		return 1024 * 1024 * 1024;
>  	return 0;
>  }
>
>  static int git_parse_signed(const char *value, intmax_t *ret, intmax_t max)
>  {
>  	if (value && *value) {
>  		char *end;
>  		intmax_t val;
>  		uintmax_t uval;
> -		uintmax_t factor = 1;
> +		uintmax_t factor;
>
>  		errno = 0;
>  		val = strtoimax(value, &end, 0);
>  		if (errno == ERANGE)
>  			return 0;
> -		if (!parse_unit_factor(end, &factor)) {
> +		factor = get_unit_factor(end);
> +		if (!factor) {
>  			errno = EINVAL;
>  			return 0;
>  		}
>  		uval = val < 0 ? -val : val;
> -		uval *= factor;
> -		if (uval > max || (val < 0 ? -val : val) > uval) {
> +		if (unsigned_mult_overflows(factor, uval) ||
> +		    factor * uval > max) {
>  			errno = ERANGE;
>  			return 0;
>  		}
>  		val *= factor;
>  		*ret = val;
>  		return 1;
>  	}
>  	errno = EINVAL;
>  	return 0;
>  }
> @@ -886,26 +881,28 @@ static int git_parse_signed(const char *value, intmax_t *ret, intmax_t max)
>  static int git_parse_unsigned(const char *value, uintmax_t *ret, uintmax_t max)
>  {
>  	if (value && *value) {
>  		char *end;
>  		uintmax_t val;
> -		uintmax_t oldval;
> +		uintmax_t factor;
>
>  		errno = 0;
>  		val = strtoumax(value, &end, 0);
>  		if (errno == ERANGE)
>  			return 0;
> -		oldval = val;
> -		if (!parse_unit_factor(end, &val)) {
> +		factor = get_unit_factor(end);
> +		if (!factor) {
>  			errno = EINVAL;
>  			return 0;
>  		}
> -		if (val > max || oldval > val) {
> +		if (unsigned_mult_overflows(factor, val) ||
> +		    factor * val > max) {
>  			errno = ERANGE;
>  			return 0;
>  		}
> +		val *= factor;
>  		*ret = val;
>  		return 1;
>  	}
>  	errno = EINVAL;
>  	return 0;
>  }
> --
> 2.22.0

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/4] config: avoid calling `labs()` on too-large data type
  2019-06-20 19:58           ` René Scharfe
  2019-06-20 21:07             ` Junio C Hamano
@ 2019-06-21 18:35             ` Johannes Schindelin
  2019-06-22 10:03               ` René Scharfe
  1 sibling, 1 reply; 90+ messages in thread
From: Johannes Schindelin @ 2019-06-21 18:35 UTC (permalink / raw)
  To: René Scharfe
  Cc: Junio C Hamano, Johannes Schindelin via GitGitGadget, git

[-- Attachment #1: Type: text/plain, Size: 5293 bytes --]

Hi René,

On Thu, 20 Jun 2019, René Scharfe wrote:

> Subject: [PATCH] config: simplify unit suffix handling
>
> parse_unit_factor() checks if a K, M or G is present after a number and
> multiplies it by 2^10, 2^20 or 2^30, respectively.  One of its callers
> checks if the result is smaller than the number alone to detect
> overflows.  The other one passes 1 as the number and does multiplication
> and overflow check itself in a similar manner.
>
> This works, but is inconsistent, and it would break if we added support
> for a bigger unit factor.  E.g. 16777217T expands to 2^64 + 2^40, which
> is too big for a 64-bit number.  Modulo 2^64 we get 2^40 == 1TB, which
> is bigger than the raw number 16777217 == 2^24 + 1, so the overflow
> would go undetected by that method.
>
> Move the multiplication out of parse_unit_factor() and rename it to
> get_unit_factor() to signify its reduced functionality.  This partially

I do not necessarily think that the name `get_unit_factor()` is better,
given that we still parse the unit factor. I'd vote for keeping the
original name.

However, what _does_ make sense is to change that function to _really_
only parse the unit factor. That is, I would keep the exact signature, I
just would not multiply `*val` by the unit factor, I would overwrite it by
the unit factor instead.

At least that is what I would have expected, reading the name
`parse_unit_factor()`.

> reverts c8deb5a146 ("Improve error messages when int/long cannot be
> parsed from config", 2007-12-25), but keeps the improved error messages.
> Use a return value of 0 to signal an invalid suffix.

This comment should probably become a code comment above the function.

> And use unsigned_mult_overflows to check for an overflow *before* doing
> the actual multiplication, which is simpler and can deal with larger
> unit factors.

Makes sense.

> Signed-off-by: Rene Scharfe <l.s.r@web.de>

What, no accent?

> ---
> Patch generated with --function-context for easier reviewing.

Ooh, ooh, I did not know that flag. Neat!

> diff --git a/config.c b/config.c
> index 01c6e9df23..61a8bbb5cd 100644
> --- a/config.c
> +++ b/config.c
> @@ -834,51 +834,46 @@ static int git_parse_source(config_fn_t fn, void *data,
>  	return error_return;
>  }
>
> -static int parse_unit_factor(const char *end, uintmax_t *val)
> +static uintmax_t get_unit_factor(const char *end)

It has been a historical wart that the parameter was called `end`. Maybe
that could be fixed, "while at it"?

And as I said earlier, I do not see the need to change the signature
(including the function name) at all.

>  {
>  	if (!*end)
>  		return 1;
> -	else if (!strcasecmp(end, "k")) {
> -		*val *= 1024;
> -		return 1;
> -	}
> -	else if (!strcasecmp(end, "m")) {
> -		*val *= 1024 * 1024;
> -		return 1;
> -	}
> -	else if (!strcasecmp(end, "g")) {
> -		*val *= 1024 * 1024 * 1024;
> -		return 1;
> -	}
> +	if (!strcasecmp(end, "k"))
> +		return 1024;
> +	if (!strcasecmp(end, "m"))
> +		return 1024 * 1024;
> +	if (!strcasecmp(end, "g"))
> +		return 1024 * 1024 * 1024;
>  	return 0;
>  }
>
>  static int git_parse_signed(const char *value, intmax_t *ret, intmax_t max)
>  {
>  	if (value && *value) {
>  		char *end;
>  		intmax_t val;
>  		uintmax_t uval;
> -		uintmax_t factor = 1;
> +		uintmax_t factor;

I'd keep this change, but...

>
>  		errno = 0;
>  		val = strtoimax(value, &end, 0);
>  		if (errno == ERANGE)
>  			return 0;
> -		if (!parse_unit_factor(end, &factor)) {
> +		factor = get_unit_factor(end);
> +		if (!factor) {

... drop this change, and...

>  			errno = EINVAL;
>  			return 0;
>  		}
>  		uval = val < 0 ? -val : val;
> -		uval *= factor;
> -		if (uval > max || (val < 0 ? -val : val) > uval) {
> +		if (unsigned_mult_overflows(factor, uval) ||
> +		    factor * uval > max) {

... again keep this change.

>  			errno = ERANGE;
>  			return 0;
>  		}
>  		val *= factor;
>  		*ret = val;
>  		return 1;
>  	}
>  	errno = EINVAL;
>  	return 0;
>  }
> @@ -886,26 +881,28 @@ static int git_parse_signed(const char *value, intmax_t *ret, intmax_t max)
>  static int git_parse_unsigned(const char *value, uintmax_t *ret, uintmax_t max)
>  {
>  	if (value && *value) {
>  		char *end;
>  		uintmax_t val;
> -		uintmax_t oldval;
> +		uintmax_t factor;

Good.

>
>  		errno = 0;
>  		val = strtoumax(value, &end, 0);
>  		if (errno == ERANGE)
>  			return 0;
> -		oldval = val;
> -		if (!parse_unit_factor(end, &val)) {
> +		factor = get_unit_factor(end);
> +		if (!factor) {

Again, here I would strongly suggest the less intrusive change (with a
more intuitive outcome):

-		oldval = val;
-		if (!parse_unit_factor(end, &val)) {
+		if (!parse_unit_factor(end, &factor)) {

>  			errno = EINVAL;
>  			return 0;
>  		}
> -		if (val > max || oldval > val) {
> +		if (unsigned_mult_overflows(factor, val) ||
> +		    factor * val > max) {

And this is obviously a good change again.

>  			errno = ERANGE;
>  			return 0;
>  		}
> +		val *= factor;

As is this.

Thanks for working on this!
Dscho

>  		*ret = val;
>  		return 1;
>  	}
>  	errno = EINVAL;
>  	return 0;
>  }
> --
> 2.22.0
>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/4] config: avoid calling `labs()` on too-large data type
  2019-06-21 18:35             ` Johannes Schindelin
@ 2019-06-22 10:03               ` René Scharfe
  0 siblings, 0 replies; 90+ messages in thread
From: René Scharfe @ 2019-06-22 10:03 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Junio C Hamano, Johannes Schindelin via GitGitGadget, git

Am 21.06.19 um 20:35 schrieb Johannes Schindelin:
> Hi René,
>
> On Thu, 20 Jun 2019, René Scharfe wrote:
>
>> Subject: [PATCH] config: simplify unit suffix handling
>>
>> parse_unit_factor() checks if a K, M or G is present after a number and
>> multiplies it by 2^10, 2^20 or 2^30, respectively.  One of its callers
>> checks if the result is smaller than the number alone to detect
>> overflows.  The other one passes 1 as the number and does multiplication
>> and overflow check itself in a similar manner.
>>
>> This works, but is inconsistent, and it would break if we added support
>> for a bigger unit factor.  E.g. 16777217T expands to 2^64 + 2^40, which
>> is too big for a 64-bit number.  Modulo 2^64 we get 2^40 == 1TB, which
>> is bigger than the raw number 16777217 == 2^24 + 1, so the overflow
>> would go undetected by that method.
>>
>> Move the multiplication out of parse_unit_factor() and rename it to
>> get_unit_factor() to signify its reduced functionality.  This partially
>
> I do not necessarily think that the name `get_unit_factor()` is better,
> given that we still parse the unit factor. I'd vote for keeping the
> original name.

get_unit_factor() is the original name from before c8deb5a146.

> However, what _does_ make sense is to change that function to _really_
> only parse the unit factor. That is, I would keep the exact signature, I
> just would not multiply `*val` by the unit factor, I would overwrite it by
> the unit factor instead.

So the patch is too big.  Its narrative of "let's restore the original
code, but keep the good features added since" is not carrying the
weight of its many changes.

> At least that is what I would have expected, reading the name
> `parse_unit_factor()`.

Hence the renaming. :)

When I read parse_unit_factor() without any context then I expect it to
work in the middle of a string, telling the caller how many characters
were recognized.  It would then be usable with different units, e.g.
for "17KB" just as well as for "100Mbps".

We don't need such a generic function here, of course.

>> reverts c8deb5a146 ("Improve error messages when int/long cannot be
>> parsed from config", 2007-12-25), but keeps the improved error messages.
>> Use a return value of 0 to signal an invalid suffix.
>
> This comment should probably become a code comment above the function.

You mean just the last sentence, right?  For an exported function I'd
agree, but for this short helper I'm not so sure and would rather not
bother the reader with easily inferable facts.

>> Signed-off-by: Rene Scharfe <l.s.r@web.de>
>
> What, no accent?

I prefer a recognizable simplified version to a butchered one.  Perhaps
the world is ready for Unicode now?  I still get weirdly transformed
characters on letters and parcels, so I'm cautious.  Testing the waters
with the sender name setting in my MUA for some time now..

>> diff --git a/config.c b/config.c
>> index 01c6e9df23..61a8bbb5cd 100644
>> --- a/config.c
>> +++ b/config.c
>> @@ -834,51 +834,46 @@ static int git_parse_source(config_fn_t fn, void *data,
>>  	return error_return;
>>  }
>>
>> -static int parse_unit_factor(const char *end, uintmax_t *val)
>> +static uintmax_t get_unit_factor(const char *end)
>
> It has been a historical wart that the parameter was called `end`. Maybe
> that could be fixed, "while at it"?

I was tempted to do that, and am a bit proud of having resisted that
one.  I try to avoid "what at it" these days -- if it's worth doing
that other thing then it can live in its own patch.

But the name "end" is arguably good, as it signifies that the function
only works with unit factors at the end of strings.

>>
>>  		errno = 0;
>>  		val = strtoumax(value, &end, 0);
>>  		if (errno == ERANGE)
>>  			return 0;
>> -		oldval = val;
>> -		if (!parse_unit_factor(end, &val)) {
>> +		factor = get_unit_factor(end);
>> +		if (!factor) {
>
> Again, here I would strongly suggest the less intrusive change (with a
> more intuitive outcome):
>
> -		oldval = val;
> -		if (!parse_unit_factor(end, &val)) {
> +		if (!parse_unit_factor(end, &factor)) {
>
>>  			errno = EINVAL;
>>  			return 0;
>>  		}
>> -		if (val > max || oldval > val) {
>> +		if (unsigned_mult_overflows(factor, val) ||
>> +		    factor * val > max) {

I'll split that out, then we can discuss it separately.

René

^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v2 1/3] config: use unsigned_mult_overflows to check for overflows
  2019-06-16 22:26         ` Junio C Hamano
  2019-06-20 19:58           ` René Scharfe
@ 2019-06-22 10:03           ` René Scharfe
  2019-06-22 10:03           ` [PATCH v2 2/3] config: don't multiply in parse_unit_factor() René Scharfe
  2019-06-22 10:03           ` [PATCH v2 3/3] config: simplify parsing of unit factors René Scharfe
  3 siblings, 0 replies; 90+ messages in thread
From: René Scharfe @ 2019-06-22 10:03 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Johannes Schindelin via GitGitGadget, git, Johannes Schindelin

parse_unit_factor() checks if a K, M or G is present after a number and
multiplies it by 2^10, 2^20 or 2^30, respectively.  One of its callers
checks if the result is smaller than the number alone to detect
overflows.  The other one passes 1 as the number and does multiplication
and overflow check itself in a similar manner.

This works, but is inconsistent, and it would break if we added support
for a bigger unit factor.  E.g. 16777217T is 2^64 + 2^40, i.e. too big
for a 64-bit number.  Modulo 2^64 we get 2^40 == 1TB, which is bigger
than the raw number 16777217 == 2^24 + 1, so the overflow would go
undetected by that method.

Let both callers pass 1 and handle overflow check and multiplication
themselves.  Do the check before the multiplication, using
unsigned_mult_overflows, which is simpler and can deal with larger unit
factors.

Signed-off-by: Rene Scharfe <l.s.r@web.de>
---
Patch generated with --function-context for easier review (e.g. to see
why we can stop updating uval in place).

 config.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/config.c b/config.c
index 01c6e9df23..3c00369ba8 100644
--- a/config.c
+++ b/config.c
@@ -856,29 +856,29 @@ static int parse_unit_factor(const char *end, uintmax_t *val)
 static int git_parse_signed(const char *value, intmax_t *ret, intmax_t max)
 {
 	if (value && *value) {
 		char *end;
 		intmax_t val;
 		uintmax_t uval;
 		uintmax_t factor = 1;

 		errno = 0;
 		val = strtoimax(value, &end, 0);
 		if (errno == ERANGE)
 			return 0;
 		if (!parse_unit_factor(end, &factor)) {
 			errno = EINVAL;
 			return 0;
 		}
 		uval = val < 0 ? -val : val;
-		uval *= factor;
-		if (uval > max || (val < 0 ? -val : val) > uval) {
+		if (unsigned_mult_overflows(factor, uval) ||
+		    factor * uval > max) {
 			errno = ERANGE;
 			return 0;
 		}
 		val *= factor;
 		*ret = val;
 		return 1;
 	}
 	errno = EINVAL;
 	return 0;
 }
@@ -886,26 +886,27 @@ static int git_parse_signed(const char *value, intmax_t *ret, intmax_t max)
 static int git_parse_unsigned(const char *value, uintmax_t *ret, uintmax_t max)
 {
 	if (value && *value) {
 		char *end;
 		uintmax_t val;
-		uintmax_t oldval;
+		uintmax_t factor = 1;

 		errno = 0;
 		val = strtoumax(value, &end, 0);
 		if (errno == ERANGE)
 			return 0;
-		oldval = val;
-		if (!parse_unit_factor(end, &val)) {
+		if (!parse_unit_factor(end, &factor)) {
 			errno = EINVAL;
 			return 0;
 		}
-		if (val > max || oldval > val) {
+		if (unsigned_mult_overflows(factor, val) ||
+		    factor * val > max) {
 			errno = ERANGE;
 			return 0;
 		}
+		val *= factor;
 		*ret = val;
 		return 1;
 	}
 	errno = EINVAL;
 	return 0;
 }
--
2.22.0

^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v2 2/3] config: don't multiply in parse_unit_factor()
  2019-06-16 22:26         ` Junio C Hamano
  2019-06-20 19:58           ` René Scharfe
  2019-06-22 10:03           ` [PATCH v2 1/3] config: use unsigned_mult_overflows to check for overflows René Scharfe
@ 2019-06-22 10:03           ` René Scharfe
  2019-06-22 10:03           ` [PATCH v2 3/3] config: simplify parsing of unit factors René Scharfe
  3 siblings, 0 replies; 90+ messages in thread
From: René Scharfe @ 2019-06-22 10:03 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Johannes Schindelin via GitGitGadget, git, Johannes Schindelin

parse_unit_factor() multiplies the number that is passed to it with the
value of a recognized unit factor (K, M or G for 2^10, 2^20 and 2^30,
respectively).  All callers pass in 1 as a number, though, which allows
them to check the actual multiplication for overflow before they are
doing it themselves.

Ignore the passed in number and don't multiply, as this feature of
parse_unit_factor() is not used anymore.  Rename the output parameter to
reflect that it's not about the end result anymore, but just about the
unit factor.

Suggested-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Rene Scharfe <l.s.r@web.de>
---
 config.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/config.c b/config.c
index 3c00369ba8..a8bd1d821e 100644
--- a/config.c
+++ b/config.c
@@ -834,20 +834,22 @@ static int git_parse_source(config_fn_t fn, void *data,
 	return error_return;
 }

-static int parse_unit_factor(const char *end, uintmax_t *val)
+static int parse_unit_factor(const char *end, uintmax_t *factor)
 {
-	if (!*end)
+	if (!*end) {
+		*factor = 1;
 		return 1;
+	}
 	else if (!strcasecmp(end, "k")) {
-		*val *= 1024;
+		*factor = 1024;
 		return 1;
 	}
 	else if (!strcasecmp(end, "m")) {
-		*val *= 1024 * 1024;
+		*factor = 1024 * 1024;
 		return 1;
 	}
 	else if (!strcasecmp(end, "g")) {
-		*val *= 1024 * 1024 * 1024;
+		*factor = 1024 * 1024 * 1024;
 		return 1;
 	}
 	return 0;
@@ -859,7 +861,7 @@ static int git_parse_signed(const char *value, intmax_t *ret, intmax_t max)
 		char *end;
 		intmax_t val;
 		uintmax_t uval;
-		uintmax_t factor = 1;
+		uintmax_t factor;

 		errno = 0;
 		val = strtoimax(value, &end, 0);
@@ -888,7 +890,7 @@ static int git_parse_unsigned(const char *value, uintmax_t *ret, uintmax_t max)
 	if (value && *value) {
 		char *end;
 		uintmax_t val;
-		uintmax_t factor = 1;
+		uintmax_t factor;

 		errno = 0;
 		val = strtoumax(value, &end, 0);
--
2.22.0

^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v2 3/3] config: simplify parsing of unit factors
  2019-06-16 22:26         ` Junio C Hamano
                             ` (2 preceding siblings ...)
  2019-06-22 10:03           ` [PATCH v2 2/3] config: don't multiply in parse_unit_factor() René Scharfe
@ 2019-06-22 10:03           ` René Scharfe
  3 siblings, 0 replies; 90+ messages in thread
From: René Scharfe @ 2019-06-22 10:03 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Johannes Schindelin via GitGitGadget, git, Johannes Schindelin

Just return the value of the factor or zero for unrecognized strings
instead of using an output reference and a separate return value to
indicate success.  This is shorter and simpler.

It basically reverts that function to before c8deb5a146 ("Improve error
messages when int/long cannot be parsed from config", 2007-12-25), while
keeping the better messages, so restore its old name, get_unit_factor(),
as well.

Signed-off-by: Rene Scharfe <l.s.r@web.de>
---
Change from v1: The "else" is kept in each branch, even though it's not
needed, to match the original code from before c8deb5a146.  Other than
that this series arrives at the same end result.  Patch 3 can be
dropped easily if it's not convincing.

 config.c | 30 ++++++++++++------------------
 1 file changed, 12 insertions(+), 18 deletions(-)

diff --git a/config.c b/config.c
index a8bd1d821e..26196bdccf 100644
--- a/config.c
+++ b/config.c
@@ -834,24 +834,16 @@ static int git_parse_source(config_fn_t fn, void *data,
 	return error_return;
 }

-static int parse_unit_factor(const char *end, uintmax_t *factor)
+static uintmax_t get_unit_factor(const char *end)
 {
-	if (!*end) {
-		*factor = 1;
+	if (!*end)
 		return 1;
-	}
-	else if (!strcasecmp(end, "k")) {
-		*factor = 1024;
-		return 1;
-	}
-	else if (!strcasecmp(end, "m")) {
-		*factor = 1024 * 1024;
-		return 1;
-	}
-	else if (!strcasecmp(end, "g")) {
-		*factor = 1024 * 1024 * 1024;
-		return 1;
-	}
+	else if (!strcasecmp(end, "k"))
+		return 1024;
+	else if (!strcasecmp(end, "m"))
+		return 1024 * 1024;
+	else if (!strcasecmp(end, "g"))
+		return 1024 * 1024 * 1024;
 	return 0;
 }

@@ -867,7 +859,8 @@ static int git_parse_signed(const char *value, intmax_t *ret, intmax_t max)
 		val = strtoimax(value, &end, 0);
 		if (errno == ERANGE)
 			return 0;
-		if (!parse_unit_factor(end, &factor)) {
+		factor = get_unit_factor(end);
+		if (!factor) {
 			errno = EINVAL;
 			return 0;
 		}
@@ -896,7 +889,8 @@ static int git_parse_unsigned(const char *value, uintmax_t *ret, uintmax_t max)
 		val = strtoumax(value, &end, 0);
 		if (errno == ERANGE)
 			return 0;
-		if (!parse_unit_factor(end, &factor)) {
+		factor = get_unit_factor(end);
+		if (!factor) {
 			errno = EINVAL;
 			return 0;
 		}
--
2.22.0

^ permalink raw reply	[flat|nested] 90+ messages in thread

* [RFC/PATCH 0/7] grep: move from kwset to optional PCRE v2
  2019-06-15 22:14       ` Ævar Arnfjörð Bjarmason
@ 2019-06-26  0:03         ` Ævar Arnfjörð Bjarmason
  2019-06-26 14:02           ` Johannes Schindelin
                             ` (10 more replies)
  2019-06-26  0:03         ` [RFC/PATCH 1/7] grep: inline the return value of a function call used only once Ævar Arnfjörð Bjarmason
                           ` (6 subsequent siblings)
  7 siblings, 11 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-06-26  0:03 UTC (permalink / raw)
  To: git
  Cc: git-packagers, gitgitgadget, gitster, johannes.schindelin, peff,
	sandals, szeder.dev, Ævar Arnfjörð Bjarmason

This speeds things up a lot, but as shown in the patches & tests
changed modifies the behavior where we have \0 in *patterns* (only
possible with 'grep -f <file>').

I'd like to go down this route because it makes dropping kwset a lot
easier, and I don't think bending over backwards to support these \0
patterns is worth it.

But maybe others disagree, so I wanted to send what I had before I
tried tackling the pickaxe code. There I figured I'd just make -G's
ERE be a PCRE if we had the PCRE v2 backend, since unlike "grep"'s
default BRE the ERE syntax is mostly a subset of PCRE, but again
others might thing that's too aggressive and would prefer to keep the
distinction, only using PCRE there in place of our current use of
kwset.

Ævar Arnfjörð Bjarmason (7):
  grep: inline the return value of a function call used only once
  grep tests: move "grep binary" alongside the rest
  grep tests: move binary pattern tests into their own file
  grep: make the behavior for \0 in patterns sane
  grep: drop support for \0 in --fixed-strings <pattern>
  grep: remove the kwset optimization
  grep: use PCRE v2 for optimized fixed-string search

 Documentation/git-grep.txt                    |  17 +++
 grep.c                                        | 103 ++++++--------
 grep.h                                        |   2 -
 ...a1.sh => t7008-filter-branch-null-sha1.sh} |   0
 ...08-grep-binary.sh => t7815-grep-binary.sh} | 101 --------------
 t/t7816-grep-binary-pattern.sh                | 127 ++++++++++++++++++
 6 files changed, 183 insertions(+), 167 deletions(-)
 rename t/{t7009-filter-branch-null-sha1.sh => t7008-filter-branch-null-sha1.sh} (100%)
 rename t/{t7008-grep-binary.sh => t7815-grep-binary.sh} (55%)
 create mode 100755 t/t7816-grep-binary-pattern.sh

-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [RFC/PATCH 1/7] grep: inline the return value of a function call used only once
  2019-06-15 22:14       ` Ævar Arnfjörð Bjarmason
  2019-06-26  0:03         ` [RFC/PATCH 0/7] grep: move from kwset to optional PCRE v2 Ævar Arnfjörð Bjarmason
@ 2019-06-26  0:03         ` Ævar Arnfjörð Bjarmason
  2019-06-26  0:03         ` [RFC/PATCH 2/7] grep tests: move "grep binary" alongside the rest Ævar Arnfjörð Bjarmason
                           ` (5 subsequent siblings)
  7 siblings, 0 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-06-26  0:03 UTC (permalink / raw)
  To: git
  Cc: git-packagers, gitgitgadget, gitster, johannes.schindelin, peff,
	sandals, szeder.dev, Ævar Arnfjörð Bjarmason

Since e944d9d932 ("grep: rewrite an if/else condition to avoid
duplicate expression", 2016-06-25) the "ascii_only" variable has only
been used once in compile_regexp(), let's just inline it there.

This makes the code easier to read, and might make it marginally
faster depending on compiler optimizations.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 grep.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/grep.c b/grep.c
index f7c3a5803e..d3e6111c46 100644
--- a/grep.c
+++ b/grep.c
@@ -650,13 +650,11 @@ static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt)
 
 static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
-	int ascii_only;
 	int err;
 	int regflags = REG_NEWLINE;
 
 	p->word_regexp = opt->word_regexp;
 	p->ignore_case = opt->ignore_case;
-	ascii_only     = !has_non_ascii(p->pattern);
 
 	/*
 	 * Even when -F (fixed) asks us to do a non-regexp search, we
@@ -673,7 +671,7 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 	if (opt->fixed ||
 	    has_null(p->pattern, p->patternlen) ||
 	    is_fixed(p->pattern, p->patternlen))
-		p->fixed = !p->ignore_case || ascii_only;
+		p->fixed = !p->ignore_case || !has_non_ascii(p->pattern);
 
 	if (p->fixed) {
 		p->kws = kwsalloc(p->ignore_case ? tolower_trans_tbl : NULL);
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [RFC/PATCH 2/7] grep tests: move "grep binary" alongside the rest
  2019-06-15 22:14       ` Ævar Arnfjörð Bjarmason
  2019-06-26  0:03         ` [RFC/PATCH 0/7] grep: move from kwset to optional PCRE v2 Ævar Arnfjörð Bjarmason
  2019-06-26  0:03         ` [RFC/PATCH 1/7] grep: inline the return value of a function call used only once Ævar Arnfjörð Bjarmason
@ 2019-06-26  0:03         ` Ævar Arnfjörð Bjarmason
  2019-06-26 14:05           ` Johannes Schindelin
  2019-06-26 18:13           ` Junio C Hamano
  2019-06-26  0:03         ` [RFC/PATCH 3/7] grep tests: move binary pattern tests into their own file Ævar Arnfjörð Bjarmason
                           ` (4 subsequent siblings)
  7 siblings, 2 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-06-26  0:03 UTC (permalink / raw)
  To: git
  Cc: git-packagers, gitgitgadget, gitster, johannes.schindelin, peff,
	sandals, szeder.dev, Ævar Arnfjörð Bjarmason

Move the "grep binary" test case added in aca20dd558 ("grep: add test
script for binary file handling", 2010-05-22) so that it lives
alongside the rest of the "grep" tests in t781*. This would have left
a gap in the t/700* namespace, so move a "filter-branch" test down,
leaving the "t7010-setup.sh" test as the next one after that.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 ...ilter-branch-null-sha1.sh => t7008-filter-branch-null-sha1.sh} | 0
 t/{t7008-grep-binary.sh => t7815-grep-binary.sh}                  | 0
 2 files changed, 0 insertions(+), 0 deletions(-)
 rename t/{t7009-filter-branch-null-sha1.sh => t7008-filter-branch-null-sha1.sh} (100%)
 rename t/{t7008-grep-binary.sh => t7815-grep-binary.sh} (100%)

diff --git a/t/t7009-filter-branch-null-sha1.sh b/t/t7008-filter-branch-null-sha1.sh
similarity index 100%
rename from t/t7009-filter-branch-null-sha1.sh
rename to t/t7008-filter-branch-null-sha1.sh
diff --git a/t/t7008-grep-binary.sh b/t/t7815-grep-binary.sh
similarity index 100%
rename from t/t7008-grep-binary.sh
rename to t/t7815-grep-binary.sh
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [RFC/PATCH 3/7] grep tests: move binary pattern tests into their own file
  2019-06-15 22:14       ` Ævar Arnfjörð Bjarmason
                           ` (2 preceding siblings ...)
  2019-06-26  0:03         ` [RFC/PATCH 2/7] grep tests: move "grep binary" alongside the rest Ævar Arnfjörð Bjarmason
@ 2019-06-26  0:03         ` Ævar Arnfjörð Bjarmason
  2019-06-26  0:03         ` [RFC/PATCH 4/7] grep: make the behavior for \0 in patterns sane Ævar Arnfjörð Bjarmason
                           ` (3 subsequent siblings)
  7 siblings, 0 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-06-26  0:03 UTC (permalink / raw)
  To: git
  Cc: git-packagers, gitgitgadget, gitster, johannes.schindelin, peff,
	sandals, szeder.dev, Ævar Arnfjörð Bjarmason

Move the tests for "-f <file>" where "<file>" contains a "\0" pattern
into their own file. I added most of these tests in 966be95549 ("grep:
add tests to fix blind spots with \0 patterns", 2017-05-20).

Whether a regex engine supports matching binary content is very
different from whether it matches binary patterns. Since
2f8952250a ("regex: add regexec_buf() that can work on a non
NUL-terminated string", 2016-09-21) we've required REG_STARTEND of our
regex engines so we can match binary content, but only the PCRE v2
engine can sensibly match binary patterns.

Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) we've been punting
patterns containing "\0" and considering them fixed, except in cases
where "--ignore-case" is provided and they're non-ASCII, see
5c1ebcca4d ("grep/icase: avoid kwsset on literal non-ascii strings",
2016-06-25). Subsequent commits will change this behavior.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 t/t7815-grep-binary.sh         | 101 -----------------------------
 t/t7816-grep-binary-pattern.sh | 114 +++++++++++++++++++++++++++++++++
 2 files changed, 114 insertions(+), 101 deletions(-)
 create mode 100755 t/t7816-grep-binary-pattern.sh

diff --git a/t/t7815-grep-binary.sh b/t/t7815-grep-binary.sh
index 2d87c49b75..90ebb64f46 100755
--- a/t/t7815-grep-binary.sh
+++ b/t/t7815-grep-binary.sh
@@ -4,41 +4,6 @@ test_description='git grep in binary files'
 
 . ./test-lib.sh
 
-nul_match () {
-	matches=$1
-	flags=$2
-	pattern=$3
-	pattern_human=$(echo "$pattern" | sed 's/Q/<NUL>/g')
-
-	if test "$matches" = 1
-	then
-		test_expect_success "git grep -f f $flags '$pattern_human' a" "
-			printf '$pattern' | q_to_nul >f &&
-			git grep -f f $flags a
-		"
-	elif test "$matches" = 0
-	then
-		test_expect_success "git grep -f f $flags '$pattern_human' a" "
-			printf '$pattern' | q_to_nul >f &&
-			test_must_fail git grep -f f $flags a
-		"
-	elif test "$matches" = T1
-	then
-		test_expect_failure "git grep -f f $flags '$pattern_human' a" "
-			printf '$pattern' | q_to_nul >f &&
-			git grep -f f $flags a
-		"
-	elif test "$matches" = T0
-	then
-		test_expect_failure "git grep -f f $flags '$pattern_human' a" "
-			printf '$pattern' | q_to_nul >f &&
-			test_must_fail git grep -f f $flags a
-		"
-	else
-		test_expect_success "PANIC: Test framework error. Unknown matches value $matches" 'false'
-	fi
-}
-
 test_expect_success 'setup' "
 	echo 'binaryQfileQm[*]cQ*æQð' | q_to_nul >a &&
 	git add a &&
@@ -102,72 +67,6 @@ test_expect_failure 'git grep .fi a' '
 	git grep .fi a
 '
 
-nul_match 1 '-F' 'yQf'
-nul_match 0 '-F' 'yQx'
-nul_match 1 '-Fi' 'YQf'
-nul_match 0 '-Fi' 'YQx'
-nul_match 1 '' 'yQf'
-nul_match 0 '' 'yQx'
-nul_match 1 '' 'æQð'
-nul_match 1 '-F' 'eQm[*]c'
-nul_match 1 '-Fi' 'EQM[*]C'
-
-# Regex patterns that would match but shouldn't with -F
-nul_match 0 '-F' 'yQ[f]'
-nul_match 0 '-F' '[y]Qf'
-nul_match 0 '-Fi' 'YQ[F]'
-nul_match 0 '-Fi' '[Y]QF'
-nul_match 0 '-F' 'æQ[ð]'
-nul_match 0 '-F' '[æ]Qð'
-nul_match 0 '-Fi' 'ÆQ[Ð]'
-nul_match 0 '-Fi' '[Æ]QÐ'
-
-# kwset is disabled on -i & non-ASCII. No way to match non-ASCII \0
-# patterns case-insensitively.
-nul_match T1 '-i' 'ÆQÐ'
-
-# \0 implicitly disables regexes. This is an undocumented internal
-# limitation.
-nul_match T1 '' 'yQ[f]'
-nul_match T1 '' '[y]Qf'
-nul_match T1 '-i' 'YQ[F]'
-nul_match T1 '-i' '[Y]Qf'
-nul_match T1 '' 'æQ[ð]'
-nul_match T1 '' '[æ]Qð'
-nul_match T1 '-i' 'ÆQ[Ð]'
-
-# ... because of \0 implicitly disabling regexes regexes that
-# should/shouldn't match don't do the right thing.
-nul_match T1 '' 'eQm.*cQ'
-nul_match T1 '-i' 'EQM.*cQ'
-nul_match T0 '' 'eQm[*]c'
-nul_match T0 '-i' 'EQM[*]C'
-
-# Due to the REG_STARTEND extension when kwset() is disabled on -i &
-# non-ASCII the string will be matched in its entirety, but the
-# pattern will be cut off at the first \0.
-nul_match 0 '-i' 'NOMATCHQð'
-nul_match T0 '-i' '[Æ]QNOMATCH'
-nul_match T0 '-i' '[æ]QNOMATCH'
-# Matches, but for the wrong reasons, just stops at [æ]
-nul_match 1 '-i' '[Æ]Qð'
-nul_match 1 '-i' '[æ]Qð'
-
-# Ensure that the matcher doesn't regress to something that stops at
-# \0
-nul_match 0 '-F' 'yQ[f]'
-nul_match 0 '-Fi' 'YQ[F]'
-nul_match 0 '' 'yQNOMATCH'
-nul_match 0 '' 'QNOMATCH'
-nul_match 0 '-i' 'YQNOMATCH'
-nul_match 0 '-i' 'QNOMATCH'
-nul_match 0 '-F' 'æQ[ð]'
-nul_match 0 '-Fi' 'ÆQ[Ð]'
-nul_match 0 '' 'yQNÓMATCH'
-nul_match 0 '' 'QNÓMATCH'
-nul_match 0 '-i' 'YQNÓMATCH'
-nul_match 0 '-i' 'QNÓMATCH'
-
 test_expect_success 'grep respects binary diff attribute' '
 	echo text >t &&
 	git add t &&
diff --git a/t/t7816-grep-binary-pattern.sh b/t/t7816-grep-binary-pattern.sh
new file mode 100755
index 0000000000..4060dbd679
--- /dev/null
+++ b/t/t7816-grep-binary-pattern.sh
@@ -0,0 +1,114 @@
+#!/bin/sh
+
+test_description='git grep with a binary pattern files'
+
+. ./test-lib.sh
+
+nul_match () {
+	matches=$1
+	flags=$2
+	pattern=$3
+	pattern_human=$(echo "$pattern" | sed 's/Q/<NUL>/g')
+
+	if test "$matches" = 1
+	then
+		test_expect_success "git grep -f f $flags '$pattern_human' a" "
+			printf '$pattern' | q_to_nul >f &&
+			git grep -f f $flags a
+		"
+	elif test "$matches" = 0
+	then
+		test_expect_success "git grep -f f $flags '$pattern_human' a" "
+			printf '$pattern' | q_to_nul >f &&
+			test_must_fail git grep -f f $flags a
+		"
+	elif test "$matches" = T1
+	then
+		test_expect_failure "git grep -f f $flags '$pattern_human' a" "
+			printf '$pattern' | q_to_nul >f &&
+			git grep -f f $flags a
+		"
+	elif test "$matches" = T0
+	then
+		test_expect_failure "git grep -f f $flags '$pattern_human' a" "
+			printf '$pattern' | q_to_nul >f &&
+			test_must_fail git grep -f f $flags a
+		"
+	else
+		test_expect_success "PANIC: Test framework error. Unknown matches value $matches" 'false'
+	fi
+}
+
+test_expect_success 'setup' "
+	echo 'binaryQfileQm[*]cQ*æQð' | q_to_nul >a &&
+	git add a &&
+	git commit -m.
+"
+
+nul_match 1 '-F' 'yQf'
+nul_match 0 '-F' 'yQx'
+nul_match 1 '-Fi' 'YQf'
+nul_match 0 '-Fi' 'YQx'
+nul_match 1 '' 'yQf'
+nul_match 0 '' 'yQx'
+nul_match 1 '' 'æQð'
+nul_match 1 '-F' 'eQm[*]c'
+nul_match 1 '-Fi' 'EQM[*]C'
+
+# Regex patterns that would match but shouldn't with -F
+nul_match 0 '-F' 'yQ[f]'
+nul_match 0 '-F' '[y]Qf'
+nul_match 0 '-Fi' 'YQ[F]'
+nul_match 0 '-Fi' '[Y]QF'
+nul_match 0 '-F' 'æQ[ð]'
+nul_match 0 '-F' '[æ]Qð'
+nul_match 0 '-Fi' 'ÆQ[Ð]'
+nul_match 0 '-Fi' '[Æ]QÐ'
+
+# kwset is disabled on -i & non-ASCII. No way to match non-ASCII \0
+# patterns case-insensitively.
+nul_match T1 '-i' 'ÆQÐ'
+
+# \0 implicitly disables regexes. This is an undocumented internal
+# limitation.
+nul_match T1 '' 'yQ[f]'
+nul_match T1 '' '[y]Qf'
+nul_match T1 '-i' 'YQ[F]'
+nul_match T1 '-i' '[Y]Qf'
+nul_match T1 '' 'æQ[ð]'
+nul_match T1 '' '[æ]Qð'
+nul_match T1 '-i' 'ÆQ[Ð]'
+
+# ... because of \0 implicitly disabling regexes regexes that
+# should/shouldn't match don't do the right thing.
+nul_match T1 '' 'eQm.*cQ'
+nul_match T1 '-i' 'EQM.*cQ'
+nul_match T0 '' 'eQm[*]c'
+nul_match T0 '-i' 'EQM[*]C'
+
+# Due to the REG_STARTEND extension when kwset() is disabled on -i &
+# non-ASCII the string will be matched in its entirety, but the
+# pattern will be cut off at the first \0.
+nul_match 0 '-i' 'NOMATCHQð'
+nul_match T0 '-i' '[Æ]QNOMATCH'
+nul_match T0 '-i' '[æ]QNOMATCH'
+# Matches, but for the wrong reasons, just stops at [æ]
+nul_match 1 '-i' '[Æ]Qð'
+nul_match 1 '-i' '[æ]Qð'
+
+# Ensure that the matcher doesn't regress to something that stops at
+# \0
+nul_match 0 '-F' 'yQ[f]'
+nul_match 0 '-Fi' 'YQ[F]'
+nul_match 0 '' 'yQNOMATCH'
+nul_match 0 '' 'QNOMATCH'
+nul_match 0 '-i' 'YQNOMATCH'
+nul_match 0 '-i' 'QNOMATCH'
+nul_match 0 '-F' 'æQ[ð]'
+nul_match 0 '-Fi' 'ÆQ[Ð]'
+nul_match 0 '' 'yQNÓMATCH'
+nul_match 0 '' 'QNÓMATCH'
+nul_match 0 '-i' 'YQNÓMATCH'
+nul_match 0 '-i' 'QNÓMATCH'
+
+test_done
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [RFC/PATCH 4/7] grep: make the behavior for \0 in patterns sane
  2019-06-15 22:14       ` Ævar Arnfjörð Bjarmason
                           ` (3 preceding siblings ...)
  2019-06-26  0:03         ` [RFC/PATCH 3/7] grep tests: move binary pattern tests into their own file Ævar Arnfjörð Bjarmason
@ 2019-06-26  0:03         ` Ævar Arnfjörð Bjarmason
  2019-06-27  2:03           ` brian m. carlson
  2019-06-26  0:03         ` [RFC/PATCH 5/7] grep: drop support for \0 in --fixed-strings <pattern> Ævar Arnfjörð Bjarmason
                           ` (2 subsequent siblings)
  7 siblings, 1 reply; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-06-26  0:03 UTC (permalink / raw)
  To: git
  Cc: git-packagers, gitgitgadget, gitster, johannes.schindelin, peff,
	sandals, szeder.dev, Ævar Arnfjörð Bjarmason

The behavior of "grep" when patterns contained "\0" has always been
haphazard, and has served the vagaries of the implementation more than
anything else. A "\0" in a pattern can only be provided via "-f
<file>", and since pickaxe (log search) has no such flag "\0" in
patterns has only ever been supported by "grep".

Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) patterns containing
"\0" were considered fixed. In 966be95549 ("grep: add tests to fix
blind spots with \0 patterns", 2017-05-20) I added tests for this
behavior.

Change the behavior to do the obvious thing, i.e. don't silently
discard a regex pattern and make it implicitly fixed just because it
contains a \0. Instead die if e.g. --basic-regexp is combined with
such a pattern.

This is desired because from a user's point of view it's the obvious
thing to do. Whether we support BRE/ERE/Perl syntax is different from
whether our implementation is limited by C-strings. These patterns are
obscure enough that I think this behavior change is OK, especially
since we never documented the old behavior.

Doing this also makes it easier to replace the kwset backend with
something else, since we'll no longer strictly need it for anything we
can't easily use another fixed-string backend for.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/git-grep.txt     |  17 ++++
 grep.c                         |  23 ++---
 t/t7816-grep-binary-pattern.sh | 159 ++++++++++++++++++---------------
 3 files changed, 110 insertions(+), 89 deletions(-)

diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index 2d27969057..c89fb569e3 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -271,6 +271,23 @@ providing this option will cause it to die.
 
 -f <file>::
 	Read patterns from <file>, one per line.
++
+Passing the pattern via <file> allows for providing a search pattern
+containing a \0.
++
+Not all pattern types support patterns containing \0. Git will error
+out if a given pattern type can't support such a pattern. The
+`--perl-regexp` pattern type when compiled against the PCRE v2 backend
+has the widest support for these types of patterns.
++
+In versions of Git before 2.23.0 patterns containing \0 would be
+silently considered fixed. This was never documented, there were also
+odd and undocumented interactions between e.g. non-ASCII patterns
+containing \0 and `--ignore-case`.
++
+In future versions we may learn to support patterns containing \0 for
+more search backends, until then we'll die when the pattern type in
+question doesn't support them.
 
 -e::
 	The next parameter is the pattern. This option has to be
diff --git a/grep.c b/grep.c
index d3e6111c46..261bd3a342 100644
--- a/grep.c
+++ b/grep.c
@@ -368,18 +368,6 @@ static int is_fixed(const char *s, size_t len)
 	return 1;
 }
 
-static int has_null(const char *s, size_t len)
-{
-	/*
-	 * regcomp cannot accept patterns with NULs so when using it
-	 * we consider any pattern containing a NUL fixed.
-	 */
-	if (memchr(s, 0, len))
-		return 1;
-
-	return 0;
-}
-
 #ifdef USE_LIBPCRE1
 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt)
 {
@@ -668,9 +656,7 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 	 * simple string match using kws.  p->fixed tells us if we
 	 * want to use kws.
 	 */
-	if (opt->fixed ||
-	    has_null(p->pattern, p->patternlen) ||
-	    is_fixed(p->pattern, p->patternlen))
+	if (opt->fixed || is_fixed(p->pattern, p->patternlen))
 		p->fixed = !p->ignore_case || !has_non_ascii(p->pattern);
 
 	if (p->fixed) {
@@ -678,7 +664,12 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 		kwsincr(p->kws, p->pattern, p->patternlen);
 		kwsprep(p->kws);
 		return;
-	} else if (opt->fixed) {
+	}
+
+	if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
+		die(_("given pattern contains NULL byte (via -f <file>). This is only supported with -P under PCRE v2"));
+
+	if (opt->fixed) {
 		/*
 		 * We come here when the pattern has the non-ascii
 		 * characters we cannot case-fold, and asked to
diff --git a/t/t7816-grep-binary-pattern.sh b/t/t7816-grep-binary-pattern.sh
index 4060dbd679..9e09bd5d6a 100755
--- a/t/t7816-grep-binary-pattern.sh
+++ b/t/t7816-grep-binary-pattern.sh
@@ -2,113 +2,126 @@
 
 test_description='git grep with a binary pattern files'
 
-. ./test-lib.sh
+. ./lib-gettext.sh
 
-nul_match () {
+nul_match_internal () {
 	matches=$1
-	flags=$2
-	pattern=$3
+	prereqs=$2
+	lc_all=$3
+	extra_flags=$4
+	flags=$5
+	pattern=$6
 	pattern_human=$(echo "$pattern" | sed 's/Q/<NUL>/g')
 
 	if test "$matches" = 1
 	then
-		test_expect_success "git grep -f f $flags '$pattern_human' a" "
+		test_expect_success $prereqs "LC_ALL='$lc_all' git grep $extra_flags -f f $flags '$pattern_human' a" "
 			printf '$pattern' | q_to_nul >f &&
-			git grep -f f $flags a
+			LC_ALL='$lc_all' git grep $extra_flags -f f $flags a
 		"
 	elif test "$matches" = 0
 	then
-		test_expect_success "git grep -f f $flags '$pattern_human' a" "
+		test_expect_success $prereqs "LC_ALL='$lc_all' git grep $extra_flags -f f $flags '$pattern_human' a" "
+			>stderr &&
 			printf '$pattern' | q_to_nul >f &&
-			test_must_fail git grep -f f $flags a
+			test_must_fail env LC_ALL=\"$lc_all\" git grep $extra_flags -f f $flags a 2>stderr &&
+			test_i18ngrep ! 'This is only supported with -P under PCRE v2' stderr
 		"
-	elif test "$matches" = T1
+	elif test "$matches" = P
 	then
-		test_expect_failure "git grep -f f $flags '$pattern_human' a" "
+		test_expect_success $prereqs "error, PCRE v2 only: LC_ALL='$lc_all' git grep -f f $flags '$pattern_human' a" "
+			>stderr &&
 			printf '$pattern' | q_to_nul >f &&
-			git grep -f f $flags a
-		"
-	elif test "$matches" = T0
-	then
-		test_expect_failure "git grep -f f $flags '$pattern_human' a" "
-			printf '$pattern' | q_to_nul >f &&
-			test_must_fail git grep -f f $flags a
+			test_must_fail env LC_ALL=\"$lc_all\" git grep -f f $flags a 2>stderr &&
+			test_i18ngrep 'This is only supported with -P under PCRE v2' stderr
 		"
 	else
 		test_expect_success "PANIC: Test framework error. Unknown matches value $matches" 'false'
 	fi
 }
 
+nul_match () {
+	matches=$1
+	matches_pcre2=$2
+	matches_pcre2_locale=$3
+	flags=$4
+	pattern=$5
+	pattern_human=$(echo "$pattern" | sed 's/Q/<NUL>/g')
+
+	nul_match_internal "$matches" "" "C" "" "$flags" "$pattern"
+	nul_match_internal "$matches_pcre2" "LIBPCRE2" "C" "-P" "$flags" "$pattern"
+	nul_match_internal "$matches_pcre2_locale" "LIBPCRE2,GETTEXT_LOCALE" "$is_IS_locale" "-P" "$flags" "$pattern"
+}
+
 test_expect_success 'setup' "
 	echo 'binaryQfileQm[*]cQ*æQð' | q_to_nul >a &&
 	git add a &&
 	git commit -m.
 "
 
-nul_match 1 '-F' 'yQf'
-nul_match 0 '-F' 'yQx'
-nul_match 1 '-Fi' 'YQf'
-nul_match 0 '-Fi' 'YQx'
-nul_match 1 '' 'yQf'
-nul_match 0 '' 'yQx'
-nul_match 1 '' 'æQð'
-nul_match 1 '-F' 'eQm[*]c'
-nul_match 1 '-Fi' 'EQM[*]C'
+# Simple fixed-string matching that can use kwset (no -i && non-ASCII)
+nul_match 1 1 1 '-F' 'yQf'
+nul_match 0 0 0 '-F' 'yQx'
+nul_match 1 1 1 '-Fi' 'YQf'
+nul_match 0 0 0 '-Fi' 'YQx'
+nul_match 1 1 1 '' 'yQf'
+nul_match 0 0 0 '' 'yQx'
+nul_match 1 1 1 '' 'æQð'
+nul_match 1 1 1 '-F' 'eQm[*]c'
+nul_match 1 1 1 '-Fi' 'EQM[*]C'
 
 # Regex patterns that would match but shouldn't with -F
-nul_match 0 '-F' 'yQ[f]'
-nul_match 0 '-F' '[y]Qf'
-nul_match 0 '-Fi' 'YQ[F]'
-nul_match 0 '-Fi' '[Y]QF'
-nul_match 0 '-F' 'æQ[ð]'
-nul_match 0 '-F' '[æ]Qð'
-nul_match 0 '-Fi' 'ÆQ[Ð]'
-nul_match 0 '-Fi' '[Æ]QÐ'
+nul_match 0 0 0 '-F' 'yQ[f]'
+nul_match 0 0 0 '-F' '[y]Qf'
+nul_match 0 0 0 '-Fi' 'YQ[F]'
+nul_match 0 0 0 '-Fi' '[Y]QF'
+nul_match 0 0 0 '-F' 'æQ[ð]'
+nul_match 0 0 0 '-F' '[æ]Qð'
 
-# kwset is disabled on -i & non-ASCII. No way to match non-ASCII \0
-# patterns case-insensitively.
-nul_match T1 '-i' 'ÆQÐ'
+# The -F kwset codepath can't handle -i && non-ASCII...
+nul_match P 1 1 '-i' '[æ]Qð'
 
-# \0 implicitly disables regexes. This is an undocumented internal
-# limitation.
-nul_match T1 '' 'yQ[f]'
-nul_match T1 '' '[y]Qf'
-nul_match T1 '-i' 'YQ[F]'
-nul_match T1 '-i' '[Y]Qf'
-nul_match T1 '' 'æQ[ð]'
-nul_match T1 '' '[æ]Qð'
-nul_match T1 '-i' 'ÆQ[Ð]'
+# ...PCRE v2 only matches non-ASCII with -i casefolding under UTF-8
+# semantics
+nul_match P P P '-Fi' 'ÆQ[Ð]'
+nul_match P 0 1 '-i'  'ÆQ[Ð]'
+nul_match P 0 1 '-i'  '[Æ]QÐ'
+nul_match P 0 1 '-i' '[Æ]Qð'
+nul_match P 0 1 '-i' 'ÆQÐ'
 
-# ... because of \0 implicitly disabling regexes regexes that
-# should/shouldn't match don't do the right thing.
-nul_match T1 '' 'eQm.*cQ'
-nul_match T1 '-i' 'EQM.*cQ'
-nul_match T0 '' 'eQm[*]c'
-nul_match T0 '-i' 'EQM[*]C'
+# \0 in regexes can only work with -P & PCRE v2
+nul_match P 1 1 '' 'yQ[f]'
+nul_match P 1 1 '' '[y]Qf'
+nul_match P 1 1 '-i' 'YQ[F]'
+nul_match P 1 1 '-i' '[Y]Qf'
+nul_match P 1 1 '' 'æQ[ð]'
+nul_match P 1 1 '' '[æ]Qð'
+nul_match P 0 1 '-i' 'ÆQ[Ð]'
+nul_match P 1 1 '' 'eQm.*cQ'
+nul_match P 1 1 '-i' 'EQM.*cQ'
+nul_match P 0 0 '' 'eQm[*]c'
+nul_match P 0 0 '-i' 'EQM[*]C'
 
-# Due to the REG_STARTEND extension when kwset() is disabled on -i &
-# non-ASCII the string will be matched in its entirety, but the
-# pattern will be cut off at the first \0.
-nul_match 0 '-i' 'NOMATCHQð'
-nul_match T0 '-i' '[Æ]QNOMATCH'
-nul_match T0 '-i' '[æ]QNOMATCH'
-# Matches, but for the wrong reasons, just stops at [æ]
-nul_match 1 '-i' '[Æ]Qð'
-nul_match 1 '-i' '[æ]Qð'
+# Assert that we're using REG_STARTEND and the pattern doesn't match
+# just because it's cut off at the first \0.
+nul_match 0 0 0 '-i' 'NOMATCHQð'
+nul_match P 0 0 '-i' '[Æ]QNOMATCH'
+nul_match P 0 0 '-i' '[æ]QNOMATCH'
 
 # Ensure that the matcher doesn't regress to something that stops at
 # \0
-nul_match 0 '-F' 'yQ[f]'
-nul_match 0 '-Fi' 'YQ[F]'
-nul_match 0 '' 'yQNOMATCH'
-nul_match 0 '' 'QNOMATCH'
-nul_match 0 '-i' 'YQNOMATCH'
-nul_match 0 '-i' 'QNOMATCH'
-nul_match 0 '-F' 'æQ[ð]'
-nul_match 0 '-Fi' 'ÆQ[Ð]'
-nul_match 0 '' 'yQNÓMATCH'
-nul_match 0 '' 'QNÓMATCH'
-nul_match 0 '-i' 'YQNÓMATCH'
-nul_match 0 '-i' 'QNÓMATCH'
+nul_match 0 0 0 '-F' 'yQ[f]'
+nul_match 0 0 0 '-Fi' 'YQ[F]'
+nul_match 0 0 0 '' 'yQNOMATCH'
+nul_match 0 0 0 '' 'QNOMATCH'
+nul_match 0 0 0 '-i' 'YQNOMATCH'
+nul_match 0 0 0 '-i' 'QNOMATCH'
+nul_match 0 0 0 '-F' 'æQ[ð]'
+nul_match P P P '-Fi' 'ÆQ[Ð]'
+nul_match P 0 1 '-i' 'ÆQ[Ð]'
+nul_match 0 0 0 '' 'yQNÓMATCH'
+nul_match 0 0 0 '' 'QNÓMATCH'
+nul_match 0 0 0 '-i' 'YQNÓMATCH'
+nul_match 0 0 0 '-i' 'QNÓMATCH'
 
 test_done
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [RFC/PATCH 5/7] grep: drop support for \0 in --fixed-strings <pattern>
  2019-06-15 22:14       ` Ævar Arnfjörð Bjarmason
                           ` (4 preceding siblings ...)
  2019-06-26  0:03         ` [RFC/PATCH 4/7] grep: make the behavior for \0 in patterns sane Ævar Arnfjörð Bjarmason
@ 2019-06-26  0:03         ` Ævar Arnfjörð Bjarmason
  2019-06-26 16:14           ` Junio C Hamano
  2019-06-26  0:03         ` [RFC/PATCH 6/7] grep: remove the kwset optimization Ævar Arnfjörð Bjarmason
  2019-06-26  0:03         ` [RFC/PATCH 7/7] grep: use PCRE v2 for optimized fixed-string search Ævar Arnfjörð Bjarmason
  7 siblings, 1 reply; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-06-26  0:03 UTC (permalink / raw)
  To: git
  Cc: git-packagers, gitgitgadget, gitster, johannes.schindelin, peff,
	sandals, szeder.dev, Ævar Arnfjörð Bjarmason

Change "-f <file>" to not support patterns with "\0" in them under
--fixed-strings, we'll now only support these under --perl-regexp with
PCRE v2.

A previous change to Documentation/git-grep.txt changed the
description of "-f <file>" to be vague enough as to not promise that
this would work, and by dropping support for this we make it a whole
lot easier to move away from the kwset backend, which a subsequent
change will try to do.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 grep.c                         |  6 +--
 t/t7816-grep-binary-pattern.sh | 82 +++++++++++++++++-----------------
 2 files changed, 44 insertions(+), 44 deletions(-)

diff --git a/grep.c b/grep.c
index 261bd3a342..14570c7ac1 100644
--- a/grep.c
+++ b/grep.c
@@ -644,6 +644,9 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 	p->word_regexp = opt->word_regexp;
 	p->ignore_case = opt->ignore_case;
 
+	if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
+		die(_("given pattern contains NULL byte (via -f <file>). This is only supported with -P under PCRE v2"));
+
 	/*
 	 * Even when -F (fixed) asks us to do a non-regexp search, we
 	 * may not be able to correctly case-fold when -i
@@ -666,9 +669,6 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 		return;
 	}
 
-	if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
-		die(_("given pattern contains NULL byte (via -f <file>). This is only supported with -P under PCRE v2"));
-
 	if (opt->fixed) {
 		/*
 		 * We come here when the pattern has the non-ascii
diff --git a/t/t7816-grep-binary-pattern.sh b/t/t7816-grep-binary-pattern.sh
index 9e09bd5d6a..60bab291e4 100755
--- a/t/t7816-grep-binary-pattern.sh
+++ b/t/t7816-grep-binary-pattern.sh
@@ -60,23 +60,23 @@ test_expect_success 'setup' "
 "
 
 # Simple fixed-string matching that can use kwset (no -i && non-ASCII)
-nul_match 1 1 1 '-F' 'yQf'
-nul_match 0 0 0 '-F' 'yQx'
-nul_match 1 1 1 '-Fi' 'YQf'
-nul_match 0 0 0 '-Fi' 'YQx'
-nul_match 1 1 1 '' 'yQf'
-nul_match 0 0 0 '' 'yQx'
-nul_match 1 1 1 '' 'æQð'
-nul_match 1 1 1 '-F' 'eQm[*]c'
-nul_match 1 1 1 '-Fi' 'EQM[*]C'
+nul_match P P P '-F' 'yQf'
+nul_match P P P '-F' 'yQx'
+nul_match P P P '-Fi' 'YQf'
+nul_match P P P '-Fi' 'YQx'
+nul_match P P 1 '' 'yQf'
+nul_match P P 0 '' 'yQx'
+nul_match P P 1 '' 'æQð'
+nul_match P P P '-F' 'eQm[*]c'
+nul_match P P P '-Fi' 'EQM[*]C'
 
 # Regex patterns that would match but shouldn't with -F
-nul_match 0 0 0 '-F' 'yQ[f]'
-nul_match 0 0 0 '-F' '[y]Qf'
-nul_match 0 0 0 '-Fi' 'YQ[F]'
-nul_match 0 0 0 '-Fi' '[Y]QF'
-nul_match 0 0 0 '-F' 'æQ[ð]'
-nul_match 0 0 0 '-F' '[æ]Qð'
+nul_match P P P '-F' 'yQ[f]'
+nul_match P P P '-F' '[y]Qf'
+nul_match P P P '-Fi' 'YQ[F]'
+nul_match P P P '-Fi' '[Y]QF'
+nul_match P P P '-F' 'æQ[ð]'
+nul_match P P P '-F' '[æ]Qð'
 
 # The -F kwset codepath can't handle -i && non-ASCII...
 nul_match P 1 1 '-i' '[æ]Qð'
@@ -90,38 +90,38 @@ nul_match P 0 1 '-i' '[Æ]Qð'
 nul_match P 0 1 '-i' 'ÆQÐ'
 
 # \0 in regexes can only work with -P & PCRE v2
-nul_match P 1 1 '' 'yQ[f]'
-nul_match P 1 1 '' '[y]Qf'
-nul_match P 1 1 '-i' 'YQ[F]'
-nul_match P 1 1 '-i' '[Y]Qf'
-nul_match P 1 1 '' 'æQ[ð]'
-nul_match P 1 1 '' '[æ]Qð'
-nul_match P 0 1 '-i' 'ÆQ[Ð]'
-nul_match P 1 1 '' 'eQm.*cQ'
-nul_match P 1 1 '-i' 'EQM.*cQ'
-nul_match P 0 0 '' 'eQm[*]c'
-nul_match P 0 0 '-i' 'EQM[*]C'
+nul_match P P 1 '' 'yQ[f]'
+nul_match P P 1 '' '[y]Qf'
+nul_match P P 1 '-i' 'YQ[F]'
+nul_match P P 1 '-i' '[Y]Qf'
+nul_match P P 1 '' 'æQ[ð]'
+nul_match P P 1 '' '[æ]Qð'
+nul_match P P 1 '-i' 'ÆQ[Ð]'
+nul_match P P 1 '' 'eQm.*cQ'
+nul_match P P 1 '-i' 'EQM.*cQ'
+nul_match P P 0 '' 'eQm[*]c'
+nul_match P P 0 '-i' 'EQM[*]C'
 
 # Assert that we're using REG_STARTEND and the pattern doesn't match
 # just because it's cut off at the first \0.
-nul_match 0 0 0 '-i' 'NOMATCHQð'
-nul_match P 0 0 '-i' '[Æ]QNOMATCH'
-nul_match P 0 0 '-i' '[æ]QNOMATCH'
+nul_match P P 0 '-i' 'NOMATCHQð'
+nul_match P P 0 '-i' '[Æ]QNOMATCH'
+nul_match P P 0 '-i' '[æ]QNOMATCH'
 
 # Ensure that the matcher doesn't regress to something that stops at
 # \0
-nul_match 0 0 0 '-F' 'yQ[f]'
-nul_match 0 0 0 '-Fi' 'YQ[F]'
-nul_match 0 0 0 '' 'yQNOMATCH'
-nul_match 0 0 0 '' 'QNOMATCH'
-nul_match 0 0 0 '-i' 'YQNOMATCH'
-nul_match 0 0 0 '-i' 'QNOMATCH'
-nul_match 0 0 0 '-F' 'æQ[ð]'
+nul_match P P P '-F' 'yQ[f]'
+nul_match P P P '-Fi' 'YQ[F]'
+nul_match P P 0 '' 'yQNOMATCH'
+nul_match P P 0 '' 'QNOMATCH'
+nul_match P P 0 '-i' 'YQNOMATCH'
+nul_match P P 0 '-i' 'QNOMATCH'
+nul_match P P P '-F' 'æQ[ð]'
 nul_match P P P '-Fi' 'ÆQ[Ð]'
-nul_match P 0 1 '-i' 'ÆQ[Ð]'
-nul_match 0 0 0 '' 'yQNÓMATCH'
-nul_match 0 0 0 '' 'QNÓMATCH'
-nul_match 0 0 0 '-i' 'YQNÓMATCH'
-nul_match 0 0 0 '-i' 'QNÓMATCH'
+nul_match P P 1 '-i' 'ÆQ[Ð]'
+nul_match P P 0 '' 'yQNÓMATCH'
+nul_match P P 0 '' 'QNÓMATCH'
+nul_match P P 0 '-i' 'YQNÓMATCH'
+nul_match P P 0 '-i' 'QNÓMATCH'
 
 test_done
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [RFC/PATCH 6/7] grep: remove the kwset optimization
  2019-06-15 22:14       ` Ævar Arnfjörð Bjarmason
                           ` (5 preceding siblings ...)
  2019-06-26  0:03         ` [RFC/PATCH 5/7] grep: drop support for \0 in --fixed-strings <pattern> Ævar Arnfjörð Bjarmason
@ 2019-06-26  0:03         ` Ævar Arnfjörð Bjarmason
  2019-06-26  0:03         ` [RFC/PATCH 7/7] grep: use PCRE v2 for optimized fixed-string search Ævar Arnfjörð Bjarmason
  7 siblings, 0 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-06-26  0:03 UTC (permalink / raw)
  To: git
  Cc: git-packagers, gitgitgadget, gitster, johannes.schindelin, peff,
	sandals, szeder.dev, Ævar Arnfjörð Bjarmason

A later change will replace this optimization with a different one,
but as removing it and running the tests demonstrates no grep
semantics depend on this backend anymore.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 grep.c | 63 +++-------------------------------------------------------
 grep.h |  2 --
 2 files changed, 3 insertions(+), 62 deletions(-)

diff --git a/grep.c b/grep.c
index 14570c7ac1..4716217837 100644
--- a/grep.c
+++ b/grep.c
@@ -356,18 +356,6 @@ static NORETURN void compile_regexp_failed(const struct grep_pat *p,
 	die("%s'%s': %s", where, p->pattern, error);
 }
 
-static int is_fixed(const char *s, size_t len)
-{
-	size_t i;
-
-	for (i = 0; i < len; i++) {
-		if (is_regex_special(s[i]))
-			return 0;
-	}
-
-	return 1;
-}
-
 #ifdef USE_LIBPCRE1
 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt)
 {
@@ -643,38 +631,12 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 
 	p->word_regexp = opt->word_regexp;
 	p->ignore_case = opt->ignore_case;
+	p->fixed = opt->fixed;
 
 	if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
 		die(_("given pattern contains NULL byte (via -f <file>). This is only supported with -P under PCRE v2"));
 
-	/*
-	 * Even when -F (fixed) asks us to do a non-regexp search, we
-	 * may not be able to correctly case-fold when -i
-	 * (ignore-case) is asked (in which case, we'll synthesize a
-	 * regexp to match the pattern that matches regexp special
-	 * characters literally, while ignoring case differences).  On
-	 * the other hand, even without -F, if the pattern does not
-	 * have any regexp special characters and there is no need for
-	 * case-folding search, we can internally turn it into a
-	 * simple string match using kws.  p->fixed tells us if we
-	 * want to use kws.
-	 */
-	if (opt->fixed || is_fixed(p->pattern, p->patternlen))
-		p->fixed = !p->ignore_case || !has_non_ascii(p->pattern);
-
-	if (p->fixed) {
-		p->kws = kwsalloc(p->ignore_case ? tolower_trans_tbl : NULL);
-		kwsincr(p->kws, p->pattern, p->patternlen);
-		kwsprep(p->kws);
-		return;
-	}
-
 	if (opt->fixed) {
-		/*
-		 * We come here when the pattern has the non-ascii
-		 * characters we cannot case-fold, and asked to
-		 * ignore-case.
-		 */
 		compile_fixed_regexp(p, opt);
 		return;
 	}
@@ -1042,9 +1004,7 @@ void free_grep_patterns(struct grep_opt *opt)
 		case GREP_PATTERN: /* atom */
 		case GREP_PATTERN_HEAD:
 		case GREP_PATTERN_BODY:
-			if (p->kws)
-				kwsfree(p->kws);
-			else if (p->pcre1_regexp)
+			if (p->pcre1_regexp)
 				free_pcre1_regexp(p);
 			else if (p->pcre2_pattern)
 				free_pcre2_pattern(p);
@@ -1104,29 +1064,12 @@ static void show_name(struct grep_opt *opt, const char *name)
 	opt->output(opt, opt->null_following_name ? "\0" : "\n", 1);
 }
 
-static int fixmatch(struct grep_pat *p, char *line, char *eol,
-		    regmatch_t *match)
-{
-	struct kwsmatch kwsm;
-	size_t offset = kwsexec(p->kws, line, eol - line, &kwsm);
-	if (offset == -1) {
-		match->rm_so = match->rm_eo = -1;
-		return REG_NOMATCH;
-	} else {
-		match->rm_so = offset;
-		match->rm_eo = match->rm_so + kwsm.size[0];
-		return 0;
-	}
-}
-
 static int patmatch(struct grep_pat *p, char *line, char *eol,
 		    regmatch_t *match, int eflags)
 {
 	int hit;
 
-	if (p->fixed)
-		hit = !fixmatch(p, line, eol, match);
-	else if (p->pcre1_regexp)
+	if (p->pcre1_regexp)
 		hit = !pcre1match(p, line, eol, match, eflags);
 	else if (p->pcre2_pattern)
 		hit = !pcre2match(p, line, eol, match, eflags);
diff --git a/grep.h b/grep.h
index 1875880f37..90ca435aad 100644
--- a/grep.h
+++ b/grep.h
@@ -32,7 +32,6 @@ typedef int pcre2_compile_context;
 typedef int pcre2_match_context;
 typedef int pcre2_jit_stack;
 #endif
-#include "kwset.h"
 #include "thread-utils.h"
 #include "userdiff.h"
 
@@ -97,7 +96,6 @@ struct grep_pat {
 	pcre2_match_context *pcre2_match_context;
 	pcre2_jit_stack *pcre2_jit_stack;
 	uint32_t pcre2_jit_on;
-	kwset_t kws;
 	unsigned fixed:1;
 	unsigned ignore_case:1;
 	unsigned word_regexp:1;
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [RFC/PATCH 7/7] grep: use PCRE v2 for optimized fixed-string search
  2019-06-15 22:14       ` Ævar Arnfjörð Bjarmason
                           ` (6 preceding siblings ...)
  2019-06-26  0:03         ` [RFC/PATCH 6/7] grep: remove the kwset optimization Ævar Arnfjörð Bjarmason
@ 2019-06-26  0:03         ` Ævar Arnfjörð Bjarmason
  2019-06-26 14:13           ` Johannes Schindelin
  7 siblings, 1 reply; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-06-26  0:03 UTC (permalink / raw)
  To: git
  Cc: git-packagers, gitgitgadget, gitster, johannes.schindelin, peff,
	sandals, szeder.dev, Ævar Arnfjörð Bjarmason

Bring back optimized fixed-string search for "grep", this time with
PCRE v2 as an optional backend. As noted in [1] with kwset we were
slower than PCRE v1 and v2 JIT with the kwset backend, so that
optimization was counterproductive.

This brings back the optimization for "-F", without changing the
semantics of "\0" in patterns. As seen in previous commits in this
series we could support it now, but I'd rather just leave that
edge-case aside so the tests don't need to do one thing or the other
depending on what --fixed-strings backend we're using.

I could also support the v1 backend here, but that would make the code
more complex, and I'd rather aim for simplicity here and in future
changes to the diffcore. We're not going to have someone who
absolutely must have faster search, but for whom building PCRE v2
isn't acceptable.

1. https://public-inbox.org/git/87v9x793qi.fsf@evledraar.gmail.com/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 grep.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/grep.c b/grep.c
index 4716217837..6b75d5be68 100644
--- a/grep.c
+++ b/grep.c
@@ -356,6 +356,18 @@ static NORETURN void compile_regexp_failed(const struct grep_pat *p,
 	die("%s'%s': %s", where, p->pattern, error);
 }
 
+static int is_fixed(const char *s, size_t len)
+{
+	size_t i;
+
+	for (i = 0; i < len; i++) {
+		if (is_regex_special(s[i]))
+			return 0;
+	}
+
+	return 1;
+}
+
 #ifdef USE_LIBPCRE1
 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt)
 {
@@ -602,7 +614,6 @@ static int pcre2match(struct grep_pat *p, const char *line, const char *eol,
 static void free_pcre2_pattern(struct grep_pat *p)
 {
 }
-#endif /* !USE_LIBPCRE2 */
 
 static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
@@ -623,11 +634,13 @@ static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt)
 		compile_regexp_failed(p, errbuf);
 	}
 }
+#endif /* !USE_LIBPCRE2 */
 
 static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
 	int err;
 	int regflags = REG_NEWLINE;
+	int pat_is_fixed;
 
 	p->word_regexp = opt->word_regexp;
 	p->ignore_case = opt->ignore_case;
@@ -636,8 +649,38 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 	if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
 		die(_("given pattern contains NULL byte (via -f <file>). This is only supported with -P under PCRE v2"));
 
-	if (opt->fixed) {
+	pat_is_fixed = is_fixed(p->pattern, p->patternlen);
+	if (opt->fixed || pat_is_fixed) {
+#ifdef USE_LIBPCRE2
+		opt->pcre2 = 1;
+		if (pat_is_fixed) {
+			compile_pcre2_pattern(p, opt);
+		} else {
+			/*
+			 * E.g. t7811-grep-open.sh relies on the
+			 * pattern being restored, and unfortunately
+			 * there's no PCRE compile flag for "this is
+			 * fixed", so we need to munge it to
+			 * "\Q<pat>\E".
+			 */
+			char *old_pattern = p->pattern;
+			size_t old_patternlen = p->patternlen;
+			struct strbuf sb = STRBUF_INIT;
+
+			strbuf_add(&sb, "\\Q", 2);
+			strbuf_add(&sb, p->pattern, p->patternlen);
+			strbuf_add(&sb, "\\E", 2);
+
+			p->pattern = sb.buf;
+			p->patternlen = sb.len;
+			compile_pcre2_pattern(p, opt);
+			p->pattern = old_pattern;
+			p->patternlen = old_patternlen;
+			strbuf_release(&sb);
+		}
+#else
 		compile_fixed_regexp(p, opt);
+#endif
 		return;
 	}
 
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC/PATCH 0/7] grep: move from kwset to optional PCRE v2
  2019-06-26  0:03         ` [RFC/PATCH 0/7] grep: move from kwset to optional PCRE v2 Ævar Arnfjörð Bjarmason
@ 2019-06-26 14:02           ` Johannes Schindelin
  2019-06-27  9:16             ` Johannes Schindelin
  2019-06-27 23:39           ` [PATCH v2 0/9] " Ævar Arnfjörð Bjarmason
                             ` (9 subsequent siblings)
  10 siblings, 1 reply; 90+ messages in thread
From: Johannes Schindelin @ 2019-06-26 14:02 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, git-packagers, gitgitgadget, gitster, peff, sandals, szeder.dev

[-- Attachment #1: Type: text/plain, Size: 392 bytes --]

Hi Ævar,

On Wed, 26 Jun 2019, Ævar Arnfjörð Bjarmason wrote:

> This speeds things up a lot, but as shown in the patches & tests
> changed modifies the behavior where we have \0 in *patterns* (only
> possible with 'grep -f <file>').

I agree that it is not worth a lot to care about NULs in search patterns.

So I am in favor of the goal of this patch series.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC/PATCH 2/7] grep tests: move "grep binary" alongside the rest
  2019-06-26  0:03         ` [RFC/PATCH 2/7] grep tests: move "grep binary" alongside the rest Ævar Arnfjörð Bjarmason
@ 2019-06-26 14:05           ` Johannes Schindelin
  2019-06-26 18:13           ` Junio C Hamano
  1 sibling, 0 replies; 90+ messages in thread
From: Johannes Schindelin @ 2019-06-26 14:05 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, git-packagers, gitgitgadget, gitster, peff, sandals, szeder.dev

[-- Attachment #1: Type: text/plain, Size: 480 bytes --]

Hi Ævar,

On Wed, 26 Jun 2019, Ævar Arnfjörð Bjarmason wrote:

> Move the "grep binary" test case added in aca20dd558 ("grep: add test
> script for binary file handling", 2010-05-22) so that it lives
> alongside the rest of the "grep" tests in t781*. This would have left
> a gap in the t/700* namespace, so move a "filter-branch" test down,
> leaving the "t7010-setup.sh" test as the next one after that.

I would be totally fine with having gaps.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC/PATCH 7/7] grep: use PCRE v2 for optimized fixed-string search
  2019-06-26  0:03         ` [RFC/PATCH 7/7] grep: use PCRE v2 for optimized fixed-string search Ævar Arnfjörð Bjarmason
@ 2019-06-26 14:13           ` Johannes Schindelin
  2019-06-26 18:45             ` Junio C Hamano
  0 siblings, 1 reply; 90+ messages in thread
From: Johannes Schindelin @ 2019-06-26 14:13 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, git-packagers, gitgitgadget, gitster, peff, sandals, szeder.dev

[-- Attachment #1: Type: text/plain, Size: 4070 bytes --]

Hi Ævar,

On Wed, 26 Jun 2019, Ævar Arnfjörð Bjarmason wrote:

> Bring back optimized fixed-string search for "grep", this time with
> PCRE v2 as an optional backend. As noted in [1] with kwset we were
> slower than PCRE v1 and v2 JIT with the kwset backend, so that
> optimization was counterproductive.
>
> This brings back the optimization for "-F", without changing the
> semantics of "\0" in patterns. As seen in previous commits in this
> series we could support it now, but I'd rather just leave that
> edge-case aside so the tests don't need to do one thing or the other
> depending on what --fixed-strings backend we're using.

Nice. Very, very nice.

> I could also support the v1 backend here, but that would make the code
> more complex, and I'd rather aim for simplicity here and in future
> changes to the diffcore. We're not going to have someone who
> absolutely must have faster search, but for whom building PCRE v2
> isn't acceptable.

I could not agree more.

> diff --git a/grep.c b/grep.c
> index 4716217837..6b75d5be68 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -356,6 +356,18 @@ static NORETURN void compile_regexp_failed(const struct grep_pat *p,
>  	die("%s'%s': %s", where, p->pattern, error);
>  }
>
> +static int is_fixed(const char *s, size_t len)
> +{
> +	size_t i;
> +
> +	for (i = 0; i < len; i++) {
> +		if (is_regex_special(s[i]))
> +			return 0;
> +	}
> +
> +	return 1;
> +}
> +
>  #ifdef USE_LIBPCRE1
>  static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt)
>  {
> @@ -602,7 +614,6 @@ static int pcre2match(struct grep_pat *p, const char *line, const char *eol,
>  static void free_pcre2_pattern(struct grep_pat *p)
>  {
>  }
> -#endif /* !USE_LIBPCRE2 */

Huh? Removing an `#endif` without removing the corresponding `#if`?

... but...

>  static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt)
>  {
> @@ -623,11 +634,13 @@ static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt)
>  		compile_regexp_failed(p, errbuf);
>  	}
>  }
> +#endif /* !USE_LIBPCRE2 */

Ah hah!

If we would not have plenty of exercise for the PCRE2 build options, I
would be worried. But AFAICT the CI build includes this all the time, so
we're fine.

>  static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
>  {
>  	int err;
>  	int regflags = REG_NEWLINE;
> +	int pat_is_fixed;
>
>  	p->word_regexp = opt->word_regexp;
>  	p->ignore_case = opt->ignore_case;
> @@ -636,8 +649,38 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
>  	if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
>  		die(_("given pattern contains NULL byte (via -f <file>). This is only supported with -P under PCRE v2"));
>
> -	if (opt->fixed) {
> +	pat_is_fixed = is_fixed(p->pattern, p->patternlen);
> +	if (opt->fixed || pat_is_fixed) {
> +#ifdef USE_LIBPCRE2
> +		opt->pcre2 = 1;
> +		if (pat_is_fixed) {
> +			compile_pcre2_pattern(p, opt);
> +		} else {
> +			/*
> +			 * E.g. t7811-grep-open.sh relies on the
> +			 * pattern being restored, and unfortunately
> +			 * there's no PCRE compile flag for "this is
> +			 * fixed", so we need to munge it to
> +			 * "\Q<pat>\E".
> +			 */
> +			char *old_pattern = p->pattern;
> +			size_t old_patternlen = p->patternlen;
> +			struct strbuf sb = STRBUF_INIT;
> +
> +			strbuf_add(&sb, "\\Q", 2);
> +			strbuf_add(&sb, p->pattern, p->patternlen);
> +			strbuf_add(&sb, "\\E", 2);
> +
> +			p->pattern = sb.buf;
> +			p->patternlen = sb.len;
> +			compile_pcre2_pattern(p, opt);
> +			p->pattern = old_pattern;
> +			p->patternlen = old_patternlen;
> +			strbuf_release(&sb);
> +		}
> +#else
>  		compile_fixed_regexp(p, opt);
> +#endif

It might be a bit easier to read if the shorter clause came first.

Other than that: what a nice read. I should save reviewing all your patch
series for just-before-bed time.

Thanks,
Dscho

>  		return;
>  	}
>
> --
> 2.22.0.455.g172b71a6c5
>
>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC/PATCH 5/7] grep: drop support for \0 in --fixed-strings <pattern>
  2019-06-26  0:03         ` [RFC/PATCH 5/7] grep: drop support for \0 in --fixed-strings <pattern> Ævar Arnfjörð Bjarmason
@ 2019-06-26 16:14           ` Junio C Hamano
  0 siblings, 0 replies; 90+ messages in thread
From: Junio C Hamano @ 2019-06-26 16:14 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, git-packagers, gitgitgadget, johannes.schindelin, peff,
	sandals, szeder.dev

Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

> Change "-f <file>" to not support patterns with "\0" in them under
> --fixed-strings, we'll now only support these under --perl-regexp with
> PCRE v2.
>
> A previous change to Documentation/git-grep.txt changed the
> description of "-f <file>" to be vague enough as to not promise that
> this would work, and by dropping support for this we make it a whole
> lot easier to move away from the kwset backend, which a subsequent
> change will try to do.
>
> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> ---

This step, together with all others, looks sensibly justified to me
only when I wear tinted glasses that make me pretend that the final
goal is to promote pcre backend, which is much more important than
serving the current users.  When I remove the glasses, it smells
more like making excuses.

But as we saw discussed in the previous thread, I too think it is OK
to make 'Nobody would notice the updated behaviour of NUL in the
patterns' our working assumption and see if anybody screams---after
all we have to start somewhere to make progress.

A very good thing about this series is that it does *not* add a new
feature that people would miss, even if it went straight to 'master'
and to the next release.  All it does is to optimize differently and
changing the behaviour we assume nobody depends on.  

It is easy enough to revert the whole thing if it turns out to be
problematic even in the worst case, and nobody would notice ;-)

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC/PATCH 2/7] grep tests: move "grep binary" alongside the rest
  2019-06-26  0:03         ` [RFC/PATCH 2/7] grep tests: move "grep binary" alongside the rest Ævar Arnfjörð Bjarmason
  2019-06-26 14:05           ` Johannes Schindelin
@ 2019-06-26 18:13           ` Junio C Hamano
  1 sibling, 0 replies; 90+ messages in thread
From: Junio C Hamano @ 2019-06-26 18:13 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, git-packagers, gitgitgadget, johannes.schindelin, peff,
	sandals, szeder.dev

Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

> Move the "grep binary" test case added in aca20dd558 ("grep: add test
> script for binary file handling", 2010-05-22) so that it lives
> alongside the rest of the "grep" tests in t781*. This would have left
> a gap in the t/700* namespace, so move a "filter-branch" test down,
> leaving the "t7010-setup.sh" test as the next one after that.

A gap here and there is fine.

>
> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> ---
>  ...ilter-branch-null-sha1.sh => t7008-filter-branch-null-sha1.sh} | 0
>  t/{t7008-grep-binary.sh => t7815-grep-binary.sh}                  | 0
>  2 files changed, 0 insertions(+), 0 deletions(-)
>  rename t/{t7009-filter-branch-null-sha1.sh => t7008-filter-branch-null-sha1.sh} (100%)
>  rename t/{t7008-grep-binary.sh => t7815-grep-binary.sh} (100%)
>
> diff --git a/t/t7009-filter-branch-null-sha1.sh b/t/t7008-filter-branch-null-sha1.sh
> similarity index 100%
> rename from t/t7009-filter-branch-null-sha1.sh
> rename to t/t7008-filter-branch-null-sha1.sh
> diff --git a/t/t7008-grep-binary.sh b/t/t7815-grep-binary.sh
> similarity index 100%
> rename from t/t7008-grep-binary.sh
> rename to t/t7815-grep-binary.sh

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC/PATCH 7/7] grep: use PCRE v2 for optimized fixed-string search
  2019-06-26 14:13           ` Johannes Schindelin
@ 2019-06-26 18:45             ` Junio C Hamano
  2019-06-27  9:31               ` Johannes Schindelin
  0 siblings, 1 reply; 90+ messages in thread
From: Junio C Hamano @ 2019-06-26 18:45 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Ævar Arnfjörð Bjarmason, git, git-packagers,
	gitgitgadget, peff, sandals, szeder.dev

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> ...
> Ah hah!
>
> If we would not have plenty of exercise for the PCRE2 build options, I
> would be worried. But AFAICT the CI build includes this all the time, so
> we're fine.

Well, I'd feel safer if it were not "all the time", i.e. we know we
are testing both sides of the coin.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC/PATCH 4/7] grep: make the behavior for \0 in patterns sane
  2019-06-26  0:03         ` [RFC/PATCH 4/7] grep: make the behavior for \0 in patterns sane Ævar Arnfjörð Bjarmason
@ 2019-06-27  2:03           ` brian m. carlson
  0 siblings, 0 replies; 90+ messages in thread
From: brian m. carlson @ 2019-06-27  2:03 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, git-packagers, gitgitgadget, gitster, johannes.schindelin,
	peff, szeder.dev

[-- Attachment #1: Type: text/plain, Size: 2437 bytes --]

On 2019-06-26 at 00:03:26, Ævar Arnfjörð Bjarmason wrote:
> diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
> index 2d27969057..c89fb569e3 100644
> --- a/Documentation/git-grep.txt
> +++ b/Documentation/git-grep.txt
> @@ -271,6 +271,23 @@ providing this option will cause it to die.
>  
>  -f <file>::
>  	Read patterns from <file>, one per line.
> ++
> +Passing the pattern via <file> allows for providing a search pattern
> +containing a \0.

In this case, I think it's easier if we write this as "NUL" or "NUL
byte", since I think you mean a literal byte with value 0 and not the
literal string "\0". I certainly find myself a bit confused, at least,
and I expect others will as well.

> diff --git a/grep.c b/grep.c
> index d3e6111c46..261bd3a342 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -368,18 +368,6 @@ static int is_fixed(const char *s, size_t len)
>  	return 1;
>  }
>  
> -static int has_null(const char *s, size_t len)
> -{
> -	/*
> -	 * regcomp cannot accept patterns with NULs so when using it
> -	 * we consider any pattern containing a NUL fixed.
> -	 */
> -	if (memchr(s, 0, len))
> -		return 1;
> -
> -	return 0;
> -}
> -
>  #ifdef USE_LIBPCRE1
>  static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt)
>  {
> @@ -668,9 +656,7 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
>  	 * simple string match using kws.  p->fixed tells us if we
>  	 * want to use kws.
>  	 */
> -	if (opt->fixed ||
> -	    has_null(p->pattern, p->patternlen) ||
> -	    is_fixed(p->pattern, p->patternlen))
> +	if (opt->fixed || is_fixed(p->pattern, p->patternlen))
>  		p->fixed = !p->ignore_case || !has_non_ascii(p->pattern);
>  
>  	if (p->fixed) {
> @@ -678,7 +664,12 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
>  		kwsincr(p->kws, p->pattern, p->patternlen);
>  		kwsprep(p->kws);
>  		return;
> -	} else if (opt->fixed) {
> +	}
> +
> +	if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
> +		die(_("given pattern contains NULL byte (via -f <file>). This is only supported with -P under PCRE v2"));

We probably want to write this as "NUL" as well.

Otherwise, I'm okay with this change. I didn't expect Git to handle
literal NULs in patterns and I'm surprised that it ever worked.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC/PATCH 0/7] grep: move from kwset to optional PCRE v2
  2019-06-26 14:02           ` Johannes Schindelin
@ 2019-06-27  9:16             ` Johannes Schindelin
  2019-06-27 16:27               ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 90+ messages in thread
From: Johannes Schindelin @ 2019-06-27  9:16 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, git-packagers, gitgitgadget, gitster, peff, sandals, szeder.dev

[-- Attachment #1: Type: text/plain, Size: 1216 bytes --]

Hi Ævar,

On Wed, 26 Jun 2019, Johannes Schindelin wrote:

> On Wed, 26 Jun 2019, Ævar Arnfjörð Bjarmason wrote:
>
> > This speeds things up a lot, but as shown in the patches & tests
> > changed modifies the behavior where we have \0 in *patterns* (only
> > possible with 'grep -f <file>').
>
> I agree that it is not worth a lot to care about NULs in search patterns.
>
> So I am in favor of the goal of this patch series.

There seems to be a Windows-specific test failure:
https://dev.azure.com/gitgitgadget/git/_build/results?buildId=11535&view=ms.vss-test-web.build-test-results-tab&runId=28232&resultId=101315&paneView=debug

The output is this:

-- snip --
not ok 5 - log --grep does not find non-reencoded values (latin1)

expecting success:
	git log --encoding=ISO-8859-1 --format=%s --grep=$utf8_e >actual
&&
	test_must_be_empty actual

++ git log --encoding=ISO-8859-1 --format=%s --grep=é
fatal: pcre2_match failed with error code -8: UTF-8 error: byte 2 top bits
not 0x80
-- snap --

Any quick ideas? (I _could_ imagine that it is yet another case of passing
non-UTF-8-encoded stuff via command-line vs via file, which does not work
on Windows.)

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC/PATCH 7/7] grep: use PCRE v2 for optimized fixed-string search
  2019-06-26 18:45             ` Junio C Hamano
@ 2019-06-27  9:31               ` Johannes Schindelin
  2019-06-27 18:45                 ` Johannes Schindelin
  0 siblings, 1 reply; 90+ messages in thread
From: Johannes Schindelin @ 2019-06-27  9:31 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason, git, git-packagers,
	gitgitgadget, peff, sandals, szeder.dev

Hi Junio,

On Wed, 26 Jun 2019, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>
> > ...
> > Ah hah!
> >
> > If we would not have plenty of exercise for the PCRE2 build options, I
> > would be worried. But AFAICT the CI build includes this all the time, so
> > we're fine.
>
> Well, I'd feel safer if it were not "all the time", i.e. we know we
> are testing both sides of the coin.

AFAIR at least the Linux32 job is built without PCRE2 by default. I might
be wrong on that, though...

In any case, the upcoming MSVC support for our Azure Pipeline _will_ build
without PCRE2, then we will have that axis covered.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC/PATCH 0/7] grep: move from kwset to optional PCRE v2
  2019-06-27  9:16             ` Johannes Schindelin
@ 2019-06-27 16:27               ` Ævar Arnfjörð Bjarmason
  2019-06-27 18:21                 ` Johannes Schindelin
  0 siblings, 1 reply; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-06-27 16:27 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: git, git-packagers, gitgitgadget, gitster, peff, sandals, szeder.dev


On Thu, Jun 27 2019, Johannes Schindelin wrote:

> Hi Ævar,
>
> On Wed, 26 Jun 2019, Johannes Schindelin wrote:
>
>> On Wed, 26 Jun 2019, Ævar Arnfjörð Bjarmason wrote:
>>
>> > This speeds things up a lot, but as shown in the patches & tests
>> > changed modifies the behavior where we have \0 in *patterns* (only
>> > possible with 'grep -f <file>').
>>
>> I agree that it is not worth a lot to care about NULs in search patterns.
>>
>> So I am in favor of the goal of this patch series.
>
> There seems to be a Windows-specific test failure:
> https://dev.azure.com/gitgitgadget/git/_build/results?buildId=11535&view=ms.vss-test-web.build-test-results-tab&runId=28232&resultId=101315&paneView=debug
>
> The output is this:
>
> -- snip --
> not ok 5 - log --grep does not find non-reencoded values (latin1)
>
> expecting success:
> 	git log --encoding=ISO-8859-1 --format=%s --grep=$utf8_e >actual
> &&
> 	test_must_be_empty actual
>
> ++ git log --encoding=ISO-8859-1 --format=%s --grep=é
> fatal: pcre2_match failed with error code -8: UTF-8 error: byte 2 top bits
> not 0x80
> -- snap --
>
> Any quick ideas? (I _could_ imagine that it is yet another case of passing
> non-UTF-8-encoded stuff via command-line vs via file, which does not work
> on Windows.)

This is an existing issue that my patches just happen to uncover. I'm
working on a v2 which'll fix it.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC/PATCH 0/7] grep: move from kwset to optional PCRE v2
  2019-06-27 16:27               ` Ævar Arnfjörð Bjarmason
@ 2019-06-27 18:21                 ` Johannes Schindelin
  0 siblings, 0 replies; 90+ messages in thread
From: Johannes Schindelin @ 2019-06-27 18:21 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, git-packagers, gitgitgadget, gitster, peff, sandals, szeder.dev

[-- Attachment #1: Type: text/plain, Size: 4152 bytes --]

Hi Ævar,

On Thu, 27 Jun 2019, Ævar Arnfjörð Bjarmason wrote:

>
> On Thu, Jun 27 2019, Johannes Schindelin wrote:
>
> > On Wed, 26 Jun 2019, Johannes Schindelin wrote:
> >
> >> On Wed, 26 Jun 2019, Ævar Arnfjörð Bjarmason wrote:
> >>
> >> > This speeds things up a lot, but as shown in the patches & tests
> >> > changed modifies the behavior where we have \0 in *patterns* (only
> >> > possible with 'grep -f <file>').
> >>
> >> I agree that it is not worth a lot to care about NULs in search patterns.
> >>
> >> So I am in favor of the goal of this patch series.
> >
> > There seems to be a Windows-specific test failure:
> > https://dev.azure.com/gitgitgadget/git/_build/results?buildId=11535&view=ms.vss-test-web.build-test-results-tab&runId=28232&resultId=101315&paneView=debug
> >
> > The output is this:
> >
> > -- snip --
> > not ok 5 - log --grep does not find non-reencoded values (latin1)
> >
> > expecting success:
> > 	git log --encoding=ISO-8859-1 --format=%s --grep=$utf8_e >actual
> > &&
> > 	test_must_be_empty actual
> >
> > ++ git log --encoding=ISO-8859-1 --format=%s --grep=é
> > fatal: pcre2_match failed with error code -8: UTF-8 error: byte 2 top bits
> > not 0x80
> > -- snap --
> >
> > Any quick ideas? (I _could_ imagine that it is yet another case of passing
> > non-UTF-8-encoded stuff via command-line vs via file, which does not work
> > on Windows.)
>
> This is an existing issue that my patches just happen to uncover. I'm
> working on a v2 which'll fix it.

I just found yet another problem. When using a libpcre2 _without_ JIT
support, I get this:

-- snip --
$ sh t4210-log-i18n.sh -i -V -x || tail -20
test-results/t4210-log-i18n.out
ok 1 - create commits in different encodings
ok 2 - log --grep searches in log output encoding (utf8)
ok 3 # skip log --grep searches in log output encoding (latin1) (missing !MINGW)
ok 4 # skip log --grep does not find non-reencoded values (utf8) (missing !MINGW)
not ok 5 - log --grep does not find non-reencoded values (latin1)
#
#               git log --encoding=ISO-8859-1 --format=%s --grep=$utf8_e
#               >actual &&
#               test_must_be_empty actual
#
ok 3 # skip log --grep searches in log output encoding (latin1) (missing !MINGW)

skipping test: log --grep does not find non-reencoded values (utf8)
        git log --encoding=utf8 --format=%s --grep=$latin1_e >actual &&
        test_must_be_empty actual

ok 4 # skip log --grep does not find non-reencoded values (utf8) (missing !MINGW)

expecting success:
        git log --encoding=ISO-8859-1 --format=%s --grep=$utf8_e >actual &&
        test_must_be_empty actual

++ git log --encoding=ISO-8859-1 --format=%s --grep=é
fatal: pcre2_match failed with error code -8: UTF-8 error: byte 2 top bits not 0x80
error: last command exited with $?=128
not ok 5 - log --grep does not find non-reencoded values (latin1)
#
#               git log --encoding=ISO-8859-1 --format=%s --grep=$utf8_e
#               >actual &&
#               test_must_be_empty actual
#
-- snap --

This is actually a correct error, as we specifically feed non-UTF-8 text
to PCRE2, but we do turn on the PCRE2_UTF flag.

Funnily enough, this error only occurs when `pcre2_jit_on == 0`, i.e. when
we hit the code path that calls `pcre2_match()`. When the alternative code
path is used, `pcre2_jit_match()` is called, and it does _not_ print that
error.

Whatever the bug in libpcre2 that causes the JIT code path to fail on the
Unicode validation, it points to the problem in this code in
`compile_pcre2_pattern()`:

-- snip --
        if (is_utf8_locale() && has_non_ascii(p->pattern))
                options |= PCRE2_UTF;
-- snap --

It only asks whether there is a non-ASCII character in pattern, but we
never bother to see whether the haystack is also encoded in UTF-8. In this
case, it is not...

Ciao,
Dscho

P.S.: Yes, yes, I know, we should run PCRE2 with JIT, and I am trying to
figure out why it is not enabled on Windows when I specifically asked
`./configure` to enable it... Investigating now.

-

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC/PATCH 7/7] grep: use PCRE v2 for optimized fixed-string search
  2019-06-27  9:31               ` Johannes Schindelin
@ 2019-06-27 18:45                 ` Johannes Schindelin
  2019-06-27 19:06                   ` Junio C Hamano
  0 siblings, 1 reply; 90+ messages in thread
From: Johannes Schindelin @ 2019-06-27 18:45 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason, git, git-packagers,
	gitgitgadget, peff, sandals, szeder.dev

Hi,

On Thu, 27 Jun 2019, Johannes Schindelin wrote:

> On Wed, 26 Jun 2019, Junio C Hamano wrote:
>
> > Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> >
> > > ...
> > > Ah hah!
> > >
> > > If we would not have plenty of exercise for the PCRE2 build options, I
> > > would be worried. But AFAICT the CI build includes this all the time, so
> > > we're fine.
> >
> > Well, I'd feel safer if it were not "all the time", i.e. we know we
> > are testing both sides of the coin.
>
> AFAIR at least the Linux32 job is built without PCRE2 by default. I might
> be wrong on that, though...

Actually, it seems that _all_ of the Linux builds in our Azure Pipeline
compile without pcre2. It seems you have to pass `USE_LIBPCRE2=1` to
`make`, and we do not do that in `ci/run-build-and-tests.sh` nor in
`azure-pipelines.yml`. I do not even see that for the macOS builds.

So we got PCRE2 covered only in the Windows build, it seems.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC/PATCH 7/7] grep: use PCRE v2 for optimized fixed-string search
  2019-06-27 18:45                 ` Johannes Schindelin
@ 2019-06-27 19:06                   ` Junio C Hamano
  2019-06-28 10:56                     ` Johannes Schindelin
  0 siblings, 1 reply; 90+ messages in thread
From: Junio C Hamano @ 2019-06-27 19:06 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Ævar Arnfjörð Bjarmason, git, git-packagers,
	gitgitgadget, peff, sandals, szeder.dev

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

>> > > If we would not have plenty of exercise for the PCRE2 build options, I
>> > > would be worried. But AFAICT the CI build includes this all the time, so
>> > > we're fine.
>> >
>> > Well, I'd feel safer if it were not "all the time", i.e. we know we
>> > are testing both sides of the coin.
>>
>> AFAIR at least the Linux32 job is built without PCRE2 by default. I might
>> be wrong on that, though...
>
> Actually, it seems that _all_ of the Linux builds in our Azure Pipeline
> compile without pcre2. It seems you have to pass `USE_LIBPCRE2=1` to
> `make`, and we do not do that in `ci/run-build-and-tests.sh` nor in
> `azure-pipelines.yml`. I do not even see that for the macOS builds.
>
> So we got PCRE2 covered only in the Windows build, it seems.

OK, it sounds like we have sufficient coverage on both fronts.
Good.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v2 0/9] grep: move from kwset to optional PCRE v2
  2019-06-26  0:03         ` [RFC/PATCH 0/7] grep: move from kwset to optional PCRE v2 Ævar Arnfjörð Bjarmason
  2019-06-26 14:02           ` Johannes Schindelin
@ 2019-06-27 23:39           ` Ævar Arnfjörð Bjarmason
  2019-06-28  7:23             ` Ævar Arnfjörð Bjarmason
                               ` (11 more replies)
  2019-06-27 23:39           ` [PATCH v2 1/9] log tests: test regex backends in "--encode=<enc>" tests Ævar Arnfjörð Bjarmason
                             ` (8 subsequent siblings)
  10 siblings, 12 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-06-27 23:39 UTC (permalink / raw)
  To: git
  Cc: git-packagers, gitgitgadget, gitster, johannes.schindelin, peff,
	sandals, szeder.dev, Ævar Arnfjörð Bjarmason

A non-RFC since it seem people like this approach.

This should fix the test failure noted by Johannes, there's two new
patches at the start of this series. They address a bug that was there
for a long time, but I happened to trip over since PCRE is more strict
about UTF-8 validation than kwset (which doesn't care at all).

I also added performance numbers to the relevant commit messages, took
brian's suggestion of saying "NUL-byte" instead of "\0", and did some
other copyediting of my own.

The rest of the code changes are all just comments & rewording of
previously added comments.

Ævar Arnfjörð Bjarmason (9):
  log tests: test regex backends in "--encode=<enc>" tests
  grep: don't use PCRE2?_UTF8 with "log --encoding=<non-utf8>"
  grep: inline the return value of a function call used only once
  grep tests: move "grep binary" alongside the rest
  grep tests: move binary pattern tests into their own file
  grep: make the behavior for NUL-byte in patterns sane
  grep: drop support for \0 in --fixed-strings <pattern>
  grep: remove the kwset optimization
  grep: use PCRE v2 for optimized fixed-string search

 Documentation/git-grep.txt                    |  17 +++
 grep.c                                        | 115 +++++++---------
 grep.h                                        |   3 +-
 revision.c                                    |   3 +
 t/t4210-log-i18n.sh                           |  39 +++++-
 ...a1.sh => t7008-filter-branch-null-sha1.sh} |   0
 ...08-grep-binary.sh => t7815-grep-binary.sh} | 101 --------------
 t/t7816-grep-binary-pattern.sh                | 127 ++++++++++++++++++
 8 files changed, 233 insertions(+), 172 deletions(-)
 rename t/{t7009-filter-branch-null-sha1.sh => t7008-filter-branch-null-sha1.sh} (100%)
 rename t/{t7008-grep-binary.sh => t7815-grep-binary.sh} (55%)
 create mode 100755 t/t7816-grep-binary-pattern.sh

Range-diff:
 -:  ---------- >  1:  cfc01f49d3 log tests: test regex backends in "--encode=<enc>" tests
 -:  ---------- >  2:  4b59eb32f0 grep: don't use PCRE2?_UTF8 with "log --encoding=<non-utf8>"
 1:  ad55d3be7e =  3:  cc4d3b50d5 grep: inline the return value of a function call used only once
 2:  650bcc8582 =  4:  d9b29bdd89 grep tests: move "grep binary" alongside the rest
 3:  ef10a8820d !  5:  f85614f435 grep tests: move binary pattern tests into their own file
    @@ -2,9 +2,10 @@
     
         grep tests: move binary pattern tests into their own file
     
    -    Move the tests for "-f <file>" where "<file>" contains a "\0" pattern
    -    into their own file. I added most of these tests in 966be95549 ("grep:
    -    add tests to fix blind spots with \0 patterns", 2017-05-20).
    +    Move the tests for "-f <file>" where "<file>" contains a NUL byte
    +    pattern into their own file. I added most of these tests in
    +    966be95549 ("grep: add tests to fix blind spots with \0 patterns",
    +    2017-05-20).
     
         Whether a regex engine supports matching binary content is very
         different from whether it matches binary patterns. Since
    @@ -14,8 +15,8 @@
         engine can sensibly match binary patterns.
     
         Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) we've been punting
    -    patterns containing "\0" and considering them fixed, except in cases
    -    where "--ignore-case" is provided and they're non-ASCII, see
    +    patterns containing NUL-byte and considering them fixed, except in
    +    cases where "--ignore-case" is provided and they're non-ASCII, see
         5c1ebcca4d ("grep/icase: avoid kwsset on literal non-ascii strings",
         2016-06-25). Subsequent commits will change this behavior.
     
 4:  03e5637efc !  6:  90afca8707 grep: make the behavior for \0 in patterns sane
    @@ -1,12 +1,13 @@
     Author: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
     
    -    grep: make the behavior for \0 in patterns sane
    +    grep: make the behavior for NUL-byte in patterns sane
     
    -    The behavior of "grep" when patterns contained "\0" has always been
    -    haphazard, and has served the vagaries of the implementation more than
    -    anything else. A "\0" in a pattern can only be provided via "-f
    -    <file>", and since pickaxe (log search) has no such flag "\0" in
    -    patterns has only ever been supported by "grep".
    +    The behavior of "grep" when patterns contained a NUL-byte has always
    +    been haphazard, and has served the vagaries of the implementation more
    +    than anything else. A pattern containing a NUL-byte can only be
    +    provided via "-f <file>". Since pickaxe (log search) has no such flag
    +    the NUL-byte in patterns has only ever been supported by "grep" (and
    +    not "log --grep").
     
         Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) patterns containing
         "\0" were considered fixed. In 966be95549 ("grep: add tests to fix
    @@ -14,9 +15,9 @@
         behavior.
     
         Change the behavior to do the obvious thing, i.e. don't silently
    -    discard a regex pattern and make it implicitly fixed just because it
    -    contains a \0. Instead die if e.g. --basic-regexp is combined with
    -    such a pattern.
    +    discard a regex pattern and make it implicitly fixed just because they
    +    contain a NUL-byte. Instead die if the backend in question can't
    +    handle them, e.g. --basic-regexp is combined with such a pattern.
     
         This is desired because from a user's point of view it's the obvious
         thing to do. Whether we support BRE/ERE/Perl syntax is different from
 5:  b9aad3ec1c !  7:  526b925fdc grep: drop support for \0 in --fixed-strings <pattern>
    @@ -2,15 +2,14 @@
     
         grep: drop support for \0 in --fixed-strings <pattern>
     
    -    Change "-f <file>" to not support patterns with "\0" in them under
    -    --fixed-strings, we'll now only support these under --perl-regexp with
    -    PCRE v2.
    +    Change "-f <file>" to not support patterns with a NUL-byte in them
    +    under --fixed-strings. We'll now only support these under
    +    "--perl-regexp" with PCRE v2.
     
    -    A previous change to Documentation/git-grep.txt changed the
    -    description of "-f <file>" to be vague enough as to not promise that
    -    this would work, and by dropping support for this we make it a whole
    -    lot easier to move away from the kwset backend, which a subsequent
    -    change will try to do.
    +    A previous change to grep's documentation changed the description of
    +    "-f <file>" to be vague enough as to not promise that this would work.
    +    By dropping support for this we make it a whole lot easier to move
    +    away from the kwset backend, which we'll do in a subsequent change.
     
         Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
     
 6:  3587be009a !  8:  14269bb295 grep: remove the kwset optimization
    @@ -2,9 +2,99 @@
     
         grep: remove the kwset optimization
     
    -    A later change will replace this optimization with a different one,
    -    but as removing it and running the tests demonstrates no grep
    -    semantics depend on this backend anymore.
    +    A later change will replace this optimization with optimistic use of
    +    PCRE v2. I'm completely removing it as an intermediate step, as
    +    opposed to replacing it with PCRE v2, to demonstrate that no grep
    +    semantics depend on this (or any other) optimization for the fixed
    +    backend anymore.
    +
    +    For now this is mostly (but not entirely) a performance regression, as
    +    shown by this hacky one-liner:
    +
    +        for opt in '' ' -i'
    +            do
    +            GIT_PERF_7821_GREP_OPTS=$opt GIT_PERF_REPEAT_COUNT=10 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS='-j8 CFLAGS=-O3 USE_LIBPCRE=YesPlease' ./run origin/master HEAD -- p7821-grep-engines-fixed.sh
    +        done &&
    +        for opt in '' ' -i'
    +            do GIT_PERF_4221_LOG_OPTS=$opt GIT_PERF_REPEAT_COUNT=10 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS='-j8 CFLAGS=-O3 USE_LIBPCRE=YesPlease' ./run origin/master HEAD -- p4221-log-grep-engines-fixed.sh
    +        done
    +
    +    Which produces:
    +
    +    plain grep:
    +
    +        Test                             origin/master     HEAD
    +        -------------------------------------------------------------------------
    +        7821.1: fixed grep int           0.55(1.60+0.63)   0.82(3.11+0.51) +49.1%
    +        7821.2: basic grep int           0.62(1.68+0.49)   0.85(3.02+0.52) +37.1%
    +        7821.3: extended grep int        0.61(1.63+0.53)   0.91(3.09+0.44) +49.2%
    +        7821.4: perl grep int            0.55(1.60+0.57)   0.41(0.93+0.57) -25.5%
    +        7821.6: fixed grep uncommon      0.20(0.50+0.44)   0.35(1.27+0.42) +75.0%
    +        7821.7: basic grep uncommon      0.20(0.49+0.45)   0.35(1.29+0.41) +75.0%
    +        7821.8: extended grep uncommon   0.20(0.45+0.48)   0.35(1.25+0.44) +75.0%
    +        7821.9: perl grep uncommon       0.20(0.53+0.41)   0.16(0.24+0.49) -20.0%
    +        7821.11: fixed grep æ            0.35(1.27+0.40)   0.25(0.82+0.39) -28.6%
    +        7821.12: basic grep æ            0.35(1.28+0.38)   0.25(0.75+0.44) -28.6%
    +        7821.13: extended grep æ         0.36(1.21+0.46)   0.25(0.86+0.35) -30.6%
    +        7821.14: perl grep æ             0.35(1.33+0.34)   0.16(0.26+0.47) -54.3%
    +
    +    grep with -i:
    +
    +        Test                                origin/master     HEAD
    +        -----------------------------------------------------------------------------
    +        7821.1: fixed grep -i int           0.61(1.84+0.64)   1.11(4.12+0.64) +82.0%
    +        7821.2: basic grep -i int           0.72(1.86+0.57)   1.15(4.48+0.49) +59.7%
    +        7821.3: extended grep -i int        0.94(1.83+0.60)   1.53(4.12+0.58) +62.8%
    +        7821.4: perl grep -i int            0.66(1.82+0.59)   0.55(1.08+0.58) -16.7%
    +        7821.6: fixed grep -i uncommon      0.21(0.51+0.44)   0.44(1.74+0.34) +109.5%
    +        7821.7: basic grep -i uncommon      0.21(0.55+0.41)   0.44(1.72+0.40) +109.5%
    +        7821.8: extended grep -i uncommon   0.21(0.57+0.39)   0.42(1.64+0.45) +100.0%
    +        7821.9: perl grep -i uncommon       0.21(0.48+0.48)   0.17(0.30+0.45) -19.0%
    +        7821.11: fixed grep -i æ            0.25(0.73+0.45)   0.25(0.75+0.45) +0.0%
    +        7821.12: basic grep -i æ            0.25(0.71+0.49)   0.26(0.77+0.44) +4.0%
    +        7821.13: extended grep -i æ         0.25(0.75+0.44)   0.25(0.74+0.46) +0.0%
    +        7821.14: perl grep -i æ             0.17(0.26+0.48)   0.16(0.20+0.52) -5.9%
    +
    +    plain log:
    +
    +        Test                                     origin/master     HEAD
    +        ---------------------------------------------------------------------------------
    +        4221.1: fixed log --grep='int'           7.31(7.06+0.21)   8.11(7.85+0.20) +10.9%
    +        4221.2: basic log --grep='int'           7.30(6.94+0.27)   8.16(7.89+0.19) +11.8%
    +        4221.3: extended log --grep='int'        7.34(7.05+0.21)   8.08(7.76+0.25) +10.1%
    +        4221.4: perl log --grep='int'            7.27(6.94+0.24)   7.05(6.76+0.25) -3.0%
    +        4221.6: fixed log --grep='uncommon'      6.97(6.62+0.32)   7.86(7.51+0.30) +12.8%
    +        4221.7: basic log --grep='uncommon'      7.05(6.69+0.29)   7.89(7.60+0.28) +11.9%
    +        4221.8: extended log --grep='uncommon'   6.89(6.56+0.32)   7.99(7.66+0.24) +16.0%
    +        4221.9: perl log --grep='uncommon'       7.02(6.66+0.33)   6.97(6.54+0.36) -0.7%
    +        4221.11: fixed log --grep='æ'            7.37(7.03+0.33)   7.67(7.30+0.31) +4.1%
    +        4221.12: basic log --grep='æ'            7.41(7.00+0.31)   7.60(7.28+0.26) +2.6%
    +        4221.13: extended log --grep='æ'         7.35(6.96+0.38)   7.73(7.31+0.34) +5.2%
    +        4221.14: perl log --grep='æ'             7.43(7.10+0.32)   6.95(6.61+0.27) -6.5%
    +
    +    log with -i:
    +
    +        Test                                        origin/master     HEAD
    +        ------------------------------------------------------------------------------------
    +        4221.1: fixed log -i --grep='int'           7.40(7.05+0.23)   8.66(8.38+0.20) +17.0%
    +        4221.2: basic log -i --grep='int'           7.39(7.09+0.23)   8.67(8.39+0.20) +17.3%
    +        4221.3: extended log -i --grep='int'        7.29(6.99+0.26)   8.69(8.31+0.26) +19.2%
    +        4221.4: perl log -i --grep='int'            7.42(7.16+0.21)   7.14(6.80+0.24) -3.8%
    +        4221.6: fixed log -i --grep='uncommon'      6.94(6.58+0.35)   8.43(8.04+0.30) +21.5%
    +        4221.7: basic log -i --grep='uncommon'      6.95(6.62+0.31)   8.34(7.93+0.32) +20.0%
    +        4221.8: extended log -i --grep='uncommon'   7.06(6.75+0.25)   8.32(7.98+0.31) +17.8%
    +        4221.9: perl log -i --grep='uncommon'       6.96(6.69+0.26)   7.04(6.64+0.32) +1.1%
    +        4221.11: fixed log -i --grep='æ'            7.92(7.55+0.33)   7.86(7.44+0.34) -0.8%
    +        4221.12: basic log -i --grep='æ'            7.88(7.49+0.32)   7.84(7.46+0.34) -0.5%
    +        4221.13: extended log -i --grep='æ'         7.91(7.51+0.32)   7.87(7.48+0.32) -0.5%
    +        4221.14: perl log -i --grep='æ'             7.01(6.59+0.35)   6.99(6.64+0.28) -0.3%
    +
    +    Some of those, as noted in [1] are because PCRE is faster at finding
    +    fixed strings. This looks bad for some engines, but in the next change
    +    we'll optimistically use PCRE v2 for all of these, so it'll look
    +    better.
    +
    +    1. https://public-inbox.org/git/87v9x793qi.fsf@evledraar.gmail.com/
     
         Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
     
 7:  5bc25c03b8 !  9:  c0fd75d102 grep: use PCRE v2 for optimized fixed-string search
    @@ -7,19 +7,95 @@
         slower than PCRE v1 and v2 JIT with the kwset backend, so that
         optimization was counterproductive.
     
    -    This brings back the optimization for "-F", without changing the
    -    semantics of "\0" in patterns. As seen in previous commits in this
    -    series we could support it now, but I'd rather just leave that
    -    edge-case aside so the tests don't need to do one thing or the other
    -    depending on what --fixed-strings backend we're using.
    -
    -    I could also support the v1 backend here, but that would make the code
    -    more complex, and I'd rather aim for simplicity here and in future
    +    This brings back the optimization for "--fixed-strings", without
    +    changing the semantics of having a NUL-byte in patterns. As seen in
    +    previous commits in this series we could support it now, but I'd
    +    rather just leave that edge-case aside so we don't have one behavior
    +    or the other depending what "--fixed-strings" backend we're using. It
    +    makes the behavior harder to understand and document, and makes tests
    +    for the different backends more painful.
    +
    +    I could also support the PCRE v1 backend here, but that would make the
    +    code more complex. I'd rather aim for simplicity here and in future
         changes to the diffcore. We're not going to have someone who
         absolutely must have faster search, but for whom building PCRE v2
         isn't acceptable.
     
    -    1. https://public-inbox.org/git/87v9x793qi.fsf@evledraar.gmail.com/
    +    The difference between this series of commits and the current "master"
    +    is, using the same t/perf commands shown in the last commit:
    +
    +    plain grep:
    +
    +        Test                             origin/master     HEAD
    +        -------------------------------------------------------------------------
    +        7821.1: fixed grep int           0.55(1.67+0.56)   0.41(0.98+0.60) -25.5%
    +        7821.2: basic grep int           0.58(1.65+0.52)   0.41(0.96+0.57) -29.3%
    +        7821.3: extended grep int        0.57(1.66+0.49)   0.42(0.93+0.60) -26.3%
    +        7821.4: perl grep int            0.54(1.67+0.50)   0.43(0.88+0.65) -20.4%
    +        7821.6: fixed grep uncommon      0.21(0.52+0.42)   0.16(0.24+0.51) -23.8%
    +        7821.7: basic grep uncommon      0.20(0.49+0.45)   0.17(0.28+0.47) -15.0%
    +        7821.8: extended grep uncommon   0.20(0.54+0.39)   0.16(0.25+0.50) -20.0%
    +        7821.9: perl grep uncommon       0.20(0.58+0.36)   0.16(0.23+0.50) -20.0%
    +        7821.11: fixed grep æ            0.35(1.24+0.43)   0.16(0.23+0.50) -54.3%
    +        7821.12: basic grep æ            0.36(1.29+0.38)   0.16(0.20+0.54) -55.6%
    +        7821.13: extended grep æ         0.35(1.23+0.44)   0.16(0.24+0.50) -54.3%
    +        7821.14: perl grep æ             0.35(1.33+0.34)   0.16(0.28+0.46) -54.3%
    +
    +    grep with -i:
    +
    +        Test                                origin/master     HEAD
    +        ----------------------------------------------------------------------------
    +        7821.1: fixed grep -i int           0.62(1.81+0.70)   0.47(1.11+0.64) -24.2%
    +        7821.2: basic grep -i int           0.67(1.90+0.53)   0.46(1.07+0.62) -31.3%
    +        7821.3: extended grep -i int        0.62(1.92+0.53)   0.53(1.12+0.58) -14.5%
    +        7821.4: perl grep -i int            0.66(1.85+0.58)   0.45(1.10+0.59) -31.8%
    +        7821.6: fixed grep -i uncommon      0.21(0.54+0.43)   0.17(0.20+0.55) -19.0%
    +        7821.7: basic grep -i uncommon      0.20(0.52+0.45)   0.17(0.29+0.48) -15.0%
    +        7821.8: extended grep -i uncommon   0.21(0.52+0.44)   0.17(0.26+0.50) -19.0%
    +        7821.9: perl grep -i uncommon       0.21(0.53+0.44)   0.17(0.20+0.56) -19.0%
    +        7821.11: fixed grep -i æ            0.26(0.79+0.44)   0.16(0.29+0.46) -38.5%
    +        7821.12: basic grep -i æ            0.26(0.79+0.42)   0.16(0.20+0.54) -38.5%
    +        7821.13: extended grep -i æ         0.26(0.84+0.39)   0.16(0.24+0.50) -38.5%
    +        7821.14: perl grep -i æ             0.16(0.24+0.49)   0.17(0.25+0.51) +6.3%
    +
    +    plain log:
    +
    +        Test                                     origin/master     HEAD
    +        --------------------------------------------------------------------------------
    +        4221.1: fixed log --grep='int'           7.24(6.95+0.28)   7.20(6.95+0.18) -0.6%
    +        4221.2: basic log --grep='int'           7.31(6.97+0.22)   7.20(6.93+0.21) -1.5%
    +        4221.3: extended log --grep='int'        7.37(7.04+0.24)   7.22(6.91+0.25) -2.0%
    +        4221.4: perl log --grep='int'            7.31(7.04+0.21)   7.19(6.89+0.21) -1.6%
    +        4221.6: fixed log --grep='uncommon'      6.93(6.59+0.32)   7.04(6.66+0.37) +1.6%
    +        4221.7: basic log --grep='uncommon'      6.92(6.58+0.29)   7.08(6.75+0.29) +2.3%
    +        4221.8: extended log --grep='uncommon'   6.92(6.55+0.31)   7.00(6.68+0.31) +1.2%
    +        4221.9: perl log --grep='uncommon'       7.03(6.59+0.33)   7.12(6.73+0.34) +1.3%
    +        4221.11: fixed log --grep='æ'            7.41(7.08+0.28)   7.05(6.76+0.29) -4.9%
    +        4221.12: basic log --grep='æ'            7.39(6.99+0.33)   7.00(6.68+0.25) -5.3%
    +        4221.13: extended log --grep='æ'         7.34(7.00+0.25)   7.15(6.81+0.31) -2.6%
    +        4221.14: perl log --grep='æ'             7.43(7.13+0.26)   7.01(6.60+0.36) -5.7%
    +
    +    log with -i:
    +
    +        Test                                        origin/master     HEAD
    +        ------------------------------------------------------------------------------------
    +        4221.1: fixed log -i --grep='int'           7.31(7.07+0.24)   7.23(7.00+0.22) -1.1%
    +        4221.2: basic log -i --grep='int'           7.40(7.08+0.28)   7.19(6.92+0.20) -2.8%
    +        4221.3: extended log -i --grep='int'        7.43(7.13+0.25)   7.27(6.99+0.21) -2.2%
    +        4221.4: perl log -i --grep='int'            7.34(7.10+0.24)   7.10(6.90+0.19) -3.3%
    +        4221.6: fixed log -i --grep='uncommon'      7.07(6.71+0.32)   7.11(6.77+0.28) +0.6%
    +        4221.7: basic log -i --grep='uncommon'      6.99(6.64+0.28)   7.12(6.69+0.38) +1.9%
    +        4221.8: extended log -i --grep='uncommon'   7.11(6.74+0.32)   7.10(6.77+0.27) -0.1%
    +        4221.9: perl log -i --grep='uncommon'       6.98(6.60+0.29)   7.05(6.64+0.34) +1.0%
    +        4221.11: fixed log -i --grep='æ'            7.85(7.45+0.34)   7.03(6.68+0.32) -10.4%
    +        4221.12: basic log -i --grep='æ'            7.87(7.49+0.29)   7.06(6.69+0.31) -10.3%
    +        4221.13: extended log -i --grep='æ'         7.87(7.54+0.31)   7.09(6.69+0.31) -9.9%
    +        4221.14: perl log -i --grep='æ'             7.06(6.77+0.28)   6.91(6.57+0.31) -2.1%
    +
    +    So as with e05b027627 ("grep: use PCRE v2 for optimized fixed-string
    +    search", 2019-06-26) there's a huge improvement in performance for
    +    "grep", but in "log" most of our time is spent elsewhere, so we don't
    +    notice it that much.
     
         Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
     
    @@ -81,15 +157,19 @@
     +		} else {
     +			/*
     +			 * E.g. t7811-grep-open.sh relies on the
    -+			 * pattern being restored, and unfortunately
    -+			 * there's no PCRE compile flag for "this is
    -+			 * fixed", so we need to munge it to
    -+			 * "\Q<pat>\E".
    ++			 * pattern being restored.
     +			 */
     +			char *old_pattern = p->pattern;
     +			size_t old_patternlen = p->patternlen;
     +			struct strbuf sb = STRBUF_INIT;
     +
    ++			/*
    ++			 * There is the PCRE2_LITERAL flag, but it's
    ++			 * only in PCRE v2 10.30 and later. Needing to
    ++			 * ifdef our way around that and dealing with
    ++			 * it + PCRE2_MULTILINE being an error is more
    ++			 * complex than just quoting this ourselves.
    ++			*/
     +			strbuf_add(&sb, "\\Q", 2);
     +			strbuf_add(&sb, p->pattern, p->patternlen);
     +			strbuf_add(&sb, "\\E", 2);
    @@ -101,9 +181,9 @@
     +			p->patternlen = old_patternlen;
     +			strbuf_release(&sb);
     +		}
    -+#else
    ++#else /* !USE_LIBPCRE2 */
      		compile_fixed_regexp(p, opt);
    -+#endif
    ++#endif /* !USE_LIBPCRE2 */
      		return;
      	}
      
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v2 1/9] log tests: test regex backends in "--encode=<enc>" tests
  2019-06-26  0:03         ` [RFC/PATCH 0/7] grep: move from kwset to optional PCRE v2 Ævar Arnfjörð Bjarmason
  2019-06-26 14:02           ` Johannes Schindelin
  2019-06-27 23:39           ` [PATCH v2 0/9] " Ævar Arnfjörð Bjarmason
@ 2019-06-27 23:39           ` Ævar Arnfjörð Bjarmason
  2019-06-27 23:39           ` [PATCH v2 2/9] grep: don't use PCRE2?_UTF8 with "log --encoding=<non-utf8>" Ævar Arnfjörð Bjarmason
                             ` (7 subsequent siblings)
  10 siblings, 0 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-06-27 23:39 UTC (permalink / raw)
  To: git
  Cc: git-packagers, gitgitgadget, gitster, johannes.schindelin, peff,
	sandals, szeder.dev, Ævar Arnfjörð Bjarmason

Improve the tests added in 04deccda11 ("log: re-encode commit messages
before grepping", 2013-02-11) to test the regex backends. Those tests
never worked as advertised, due to the is_fixed() optimization in
grep.c (which was in place at the time), and the needle in the tests
being a fixed string.

We'd thus always use the "fixed" backend during the tests, which would
use the kwset() backend. This backend liberally accepts any garbage
input, so invalid encodings would be silently accepted.

In a follow-up commit we'll fix this bug, this test just demonstrates
the existing issue.

In practice this issue happened on Windows, see [1], but due to the
structure of the existing tests & how liberal the kwset code is about
garbage we missed this.

Cover this blind spot by testing all our regex engines. The PCRE
backend will spot these invalid encodings. It's possible that this
test breaks the "basic" and "extended" backends on some systems that
are more anal than glibc about the encoding of locale issues with
POSIX functions that I can remember, but PCRE is more careful about
the validation.

1. https://public-inbox.org/git/nycvar.QRO.7.76.6.1906271113090.44@tvgsbejvaqbjf.bet/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 t/t4210-log-i18n.sh | 41 ++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 40 insertions(+), 1 deletion(-)

diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh
index 7c519436ef..86d22c1d4c 100755
--- a/t/t4210-log-i18n.sh
+++ b/t/t4210-log-i18n.sh
@@ -1,12 +1,15 @@
 #!/bin/sh
 
 test_description='test log with i18n features'
-. ./test-lib.sh
+. ./lib-gettext.sh
 
 # two forms of é
 utf8_e=$(printf '\303\251')
 latin1_e=$(printf '\351')
 
+# invalid UTF-8
+invalid_e=$(printf '\303\50)') # ")" at end to close opening "("
+
 test_expect_success 'create commits in different encodings' '
 	test_tick &&
 	cat >msg <<-EOF &&
@@ -53,4 +56,40 @@ test_expect_success 'log --grep does not find non-reencoded values (latin1)' '
 	test_must_be_empty actual
 '
 
+for engine in fixed basic extended perl
+do
+	prereq=
+	result=success
+	if test $engine = "perl"
+	then
+		result=failure
+		prereq="PCRE"
+	else
+		prereq=""
+	fi
+	force_regex=
+	if test $engine != "fixed"
+	then
+	    force_regex=.*
+	fi
+	test_expect_$result GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" "
+		cat >expect <<-\EOF &&
+		latin1
+		utf8
+		EOF
+		LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log --encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$latin1_e\" >actual &&
+		test_cmp expect actual
+	"
+
+	test_expect_success GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" "
+		LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log --encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$utf8_e\" >actual &&
+		test_must_be_empty actual
+	"
+
+	test_expect_$result GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not die on invalid UTF-8 value (latin1 + locale + invalid needle)" "
+		LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log --encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$invalid_e\" >actual &&
+		test_must_be_empty actual
+	"
+done
+
 test_done
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v2 2/9] grep: don't use PCRE2?_UTF8 with "log --encoding=<non-utf8>"
  2019-06-26  0:03         ` [RFC/PATCH 0/7] grep: move from kwset to optional PCRE v2 Ævar Arnfjörð Bjarmason
                             ` (2 preceding siblings ...)
  2019-06-27 23:39           ` [PATCH v2 1/9] log tests: test regex backends in "--encode=<enc>" tests Ævar Arnfjörð Bjarmason
@ 2019-06-27 23:39           ` Ævar Arnfjörð Bjarmason
  2019-06-27 23:39           ` [PATCH v2 3/9] grep: inline the return value of a function call used only once Ævar Arnfjörð Bjarmason
                             ` (6 subsequent siblings)
  10 siblings, 0 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-06-27 23:39 UTC (permalink / raw)
  To: git
  Cc: git-packagers, gitgitgadget, gitster, johannes.schindelin, peff,
	sandals, szeder.dev, Ævar Arnfjörð Bjarmason

Fix a bug introduced in 18547aacf5 ("grep/pcre: support utf-8",
2016-06-25) that was missed due to a blindspot in our tests, as
discussed in the previous commit. I then blindly copied the same bug
in 94da9193a6 ("grep: add support for PCRE v2", 2017-06-01) when
adding the PCRE v2 code.

We should not tell PCRE that we're processing UTF-8 just because we're
dealing with non-ASCII. In the case of e.g. "log --encoding=<...>"
under is_utf8_locale() the haystack might be in ISO-8859-1, and the
needle might be in a non-UTF-8 encoding.

Maybe we should be more strict here and die earlier? Should we also be
converting the needle to the encoding in question, and failing if it's
not a string that's valid in that encoding? Maybe.

But for now matching this as non-UTF8 at least has some hope of
producing sensible results, since we know that our default heuristic
of assuming the text to be matched is in the user locale encoding
isn't true when we've explicitly encoded it to be in a different
encoding.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 grep.c              | 8 ++++----
 grep.h              | 1 +
 revision.c          | 3 +++
 t/t4210-log-i18n.sh | 6 ++----
 4 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/grep.c b/grep.c
index f7c3a5803e..1de4ab49c0 100644
--- a/grep.c
+++ b/grep.c
@@ -388,11 +388,11 @@ static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt)
 	int options = PCRE_MULTILINE;
 
 	if (opt->ignore_case) {
-		if (has_non_ascii(p->pattern))
+		if (!opt->ignore_locale && has_non_ascii(p->pattern))
 			p->pcre1_tables = pcre_maketables();
 		options |= PCRE_CASELESS;
 	}
-	if (is_utf8_locale() && has_non_ascii(p->pattern))
+	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern))
 		options |= PCRE_UTF8;
 
 	p->pcre1_regexp = pcre_compile(p->pattern, options, &error, &erroffset,
@@ -498,14 +498,14 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 	p->pcre2_compile_context = NULL;
 
 	if (opt->ignore_case) {
-		if (has_non_ascii(p->pattern)) {
+		if (!opt->ignore_locale && has_non_ascii(p->pattern)) {
 			character_tables = pcre2_maketables(NULL);
 			p->pcre2_compile_context = pcre2_compile_context_create(NULL);
 			pcre2_set_character_tables(p->pcre2_compile_context, character_tables);
 		}
 		options |= PCRE2_CASELESS;
 	}
-	if (is_utf8_locale() && has_non_ascii(p->pattern))
+	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern))
 		options |= PCRE2_UTF;
 
 	p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern,
diff --git a/grep.h b/grep.h
index 1875880f37..4bb8a79d93 100644
--- a/grep.h
+++ b/grep.h
@@ -173,6 +173,7 @@ struct grep_opt {
 	int funcbody;
 	int extended_regexp_option;
 	int pattern_type_option;
+	int ignore_locale;
 	char colors[NR_GREP_COLORS][COLOR_MAXLEN];
 	unsigned pre_context;
 	unsigned post_context;
diff --git a/revision.c b/revision.c
index 621feb9df7..a842fb158a 100644
--- a/revision.c
+++ b/revision.c
@@ -28,6 +28,7 @@
 #include "commit-graph.h"
 #include "prio-queue.h"
 #include "hashmap.h"
+#include "utf8.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -2655,6 +2656,8 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s
 
 	grep_commit_pattern_type(GREP_PATTERN_TYPE_UNSPECIFIED,
 				 &revs->grep_filter);
+	if (!is_encoding_utf8(get_log_output_encoding()))
+		revs->grep_filter.ignore_locale = 1;
 	compile_grep_patterns(&revs->grep_filter);
 
 	if (revs->reverse && revs->reflog_info)
diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh
index 86d22c1d4c..515bcb7ce1 100755
--- a/t/t4210-log-i18n.sh
+++ b/t/t4210-log-i18n.sh
@@ -59,10 +59,8 @@ test_expect_success 'log --grep does not find non-reencoded values (latin1)' '
 for engine in fixed basic extended perl
 do
 	prereq=
-	result=success
 	if test $engine = "perl"
 	then
-		result=failure
 		prereq="PCRE"
 	else
 		prereq=""
@@ -72,7 +70,7 @@ do
 	then
 	    force_regex=.*
 	fi
-	test_expect_$result GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" "
+	test_expect_success GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" "
 		cat >expect <<-\EOF &&
 		latin1
 		utf8
@@ -86,7 +84,7 @@ do
 		test_must_be_empty actual
 	"
 
-	test_expect_$result GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not die on invalid UTF-8 value (latin1 + locale + invalid needle)" "
+	test_expect_success GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not die on invalid UTF-8 value (latin1 + locale + invalid needle)" "
 		LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log --encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$invalid_e\" >actual &&
 		test_must_be_empty actual
 	"
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v2 3/9] grep: inline the return value of a function call used only once
  2019-06-26  0:03         ` [RFC/PATCH 0/7] grep: move from kwset to optional PCRE v2 Ævar Arnfjörð Bjarmason
                             ` (3 preceding siblings ...)
  2019-06-27 23:39           ` [PATCH v2 2/9] grep: don't use PCRE2?_UTF8 with "log --encoding=<non-utf8>" Ævar Arnfjörð Bjarmason
@ 2019-06-27 23:39           ` Ævar Arnfjörð Bjarmason
  2019-06-27 23:39           ` [PATCH v2 4/9] grep tests: move "grep binary" alongside the rest Ævar Arnfjörð Bjarmason
                             ` (5 subsequent siblings)
  10 siblings, 0 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-06-27 23:39 UTC (permalink / raw)
  To: git
  Cc: git-packagers, gitgitgadget, gitster, johannes.schindelin, peff,
	sandals, szeder.dev, Ævar Arnfjörð Bjarmason

Since e944d9d932 ("grep: rewrite an if/else condition to avoid
duplicate expression", 2016-06-25) the "ascii_only" variable has only
been used once in compile_regexp(), let's just inline it there.

This makes the code easier to read, and might make it marginally
faster depending on compiler optimizations.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 grep.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/grep.c b/grep.c
index 1de4ab49c0..4e8d0645a8 100644
--- a/grep.c
+++ b/grep.c
@@ -650,13 +650,11 @@ static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt)
 
 static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
-	int ascii_only;
 	int err;
 	int regflags = REG_NEWLINE;
 
 	p->word_regexp = opt->word_regexp;
 	p->ignore_case = opt->ignore_case;
-	ascii_only     = !has_non_ascii(p->pattern);
 
 	/*
 	 * Even when -F (fixed) asks us to do a non-regexp search, we
@@ -673,7 +671,7 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 	if (opt->fixed ||
 	    has_null(p->pattern, p->patternlen) ||
 	    is_fixed(p->pattern, p->patternlen))
-		p->fixed = !p->ignore_case || ascii_only;
+		p->fixed = !p->ignore_case || !has_non_ascii(p->pattern);
 
 	if (p->fixed) {
 		p->kws = kwsalloc(p->ignore_case ? tolower_trans_tbl : NULL);
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v2 4/9] grep tests: move "grep binary" alongside the rest
  2019-06-26  0:03         ` [RFC/PATCH 0/7] grep: move from kwset to optional PCRE v2 Ævar Arnfjörð Bjarmason
                             ` (4 preceding siblings ...)
  2019-06-27 23:39           ` [PATCH v2 3/9] grep: inline the return value of a function call used only once Ævar Arnfjörð Bjarmason
@ 2019-06-27 23:39           ` Ævar Arnfjörð Bjarmason
  2019-06-27 23:39           ` [PATCH v2 5/9] grep tests: move binary pattern tests into their own file Ævar Arnfjörð Bjarmason
                             ` (4 subsequent siblings)
  10 siblings, 0 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-06-27 23:39 UTC (permalink / raw)
  To: git
  Cc: git-packagers, gitgitgadget, gitster, johannes.schindelin, peff,
	sandals, szeder.dev, Ævar Arnfjörð Bjarmason

Move the "grep binary" test case added in aca20dd558 ("grep: add test
script for binary file handling", 2010-05-22) so that it lives
alongside the rest of the "grep" tests in t781*. This would have left
a gap in the t/700* namespace, so move a "filter-branch" test down,
leaving the "t7010-setup.sh" test as the next one after that.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 ...ilter-branch-null-sha1.sh => t7008-filter-branch-null-sha1.sh} | 0
 t/{t7008-grep-binary.sh => t7815-grep-binary.sh}                  | 0
 2 files changed, 0 insertions(+), 0 deletions(-)
 rename t/{t7009-filter-branch-null-sha1.sh => t7008-filter-branch-null-sha1.sh} (100%)
 rename t/{t7008-grep-binary.sh => t7815-grep-binary.sh} (100%)

diff --git a/t/t7009-filter-branch-null-sha1.sh b/t/t7008-filter-branch-null-sha1.sh
similarity index 100%
rename from t/t7009-filter-branch-null-sha1.sh
rename to t/t7008-filter-branch-null-sha1.sh
diff --git a/t/t7008-grep-binary.sh b/t/t7815-grep-binary.sh
similarity index 100%
rename from t/t7008-grep-binary.sh
rename to t/t7815-grep-binary.sh
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v2 5/9] grep tests: move binary pattern tests into their own file
  2019-06-26  0:03         ` [RFC/PATCH 0/7] grep: move from kwset to optional PCRE v2 Ævar Arnfjörð Bjarmason
                             ` (5 preceding siblings ...)
  2019-06-27 23:39           ` [PATCH v2 4/9] grep tests: move "grep binary" alongside the rest Ævar Arnfjörð Bjarmason
@ 2019-06-27 23:39           ` Ævar Arnfjörð Bjarmason
  2019-06-27 23:39           ` [PATCH v2 6/9] grep: make the behavior for NUL-byte in patterns sane Ævar Arnfjörð Bjarmason
                             ` (3 subsequent siblings)
  10 siblings, 0 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-06-27 23:39 UTC (permalink / raw)
  To: git
  Cc: git-packagers, gitgitgadget, gitster, johannes.schindelin, peff,
	sandals, szeder.dev, Ævar Arnfjörð Bjarmason

Move the tests for "-f <file>" where "<file>" contains a NUL byte
pattern into their own file. I added most of these tests in
966be95549 ("grep: add tests to fix blind spots with \0 patterns",
2017-05-20).

Whether a regex engine supports matching binary content is very
different from whether it matches binary patterns. Since
2f8952250a ("regex: add regexec_buf() that can work on a non
NUL-terminated string", 2016-09-21) we've required REG_STARTEND of our
regex engines so we can match binary content, but only the PCRE v2
engine can sensibly match binary patterns.

Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) we've been punting
patterns containing NUL-byte and considering them fixed, except in
cases where "--ignore-case" is provided and they're non-ASCII, see
5c1ebcca4d ("grep/icase: avoid kwsset on literal non-ascii strings",
2016-06-25). Subsequent commits will change this behavior.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 t/t7815-grep-binary.sh         | 101 -----------------------------
 t/t7816-grep-binary-pattern.sh | 114 +++++++++++++++++++++++++++++++++
 2 files changed, 114 insertions(+), 101 deletions(-)
 create mode 100755 t/t7816-grep-binary-pattern.sh

diff --git a/t/t7815-grep-binary.sh b/t/t7815-grep-binary.sh
index 2d87c49b75..90ebb64f46 100755
--- a/t/t7815-grep-binary.sh
+++ b/t/t7815-grep-binary.sh
@@ -4,41 +4,6 @@ test_description='git grep in binary files'
 
 . ./test-lib.sh
 
-nul_match () {
-	matches=$1
-	flags=$2
-	pattern=$3
-	pattern_human=$(echo "$pattern" | sed 's/Q/<NUL>/g')
-
-	if test "$matches" = 1
-	then
-		test_expect_success "git grep -f f $flags '$pattern_human' a" "
-			printf '$pattern' | q_to_nul >f &&
-			git grep -f f $flags a
-		"
-	elif test "$matches" = 0
-	then
-		test_expect_success "git grep -f f $flags '$pattern_human' a" "
-			printf '$pattern' | q_to_nul >f &&
-			test_must_fail git grep -f f $flags a
-		"
-	elif test "$matches" = T1
-	then
-		test_expect_failure "git grep -f f $flags '$pattern_human' a" "
-			printf '$pattern' | q_to_nul >f &&
-			git grep -f f $flags a
-		"
-	elif test "$matches" = T0
-	then
-		test_expect_failure "git grep -f f $flags '$pattern_human' a" "
-			printf '$pattern' | q_to_nul >f &&
-			test_must_fail git grep -f f $flags a
-		"
-	else
-		test_expect_success "PANIC: Test framework error. Unknown matches value $matches" 'false'
-	fi
-}
-
 test_expect_success 'setup' "
 	echo 'binaryQfileQm[*]cQ*æQð' | q_to_nul >a &&
 	git add a &&
@@ -102,72 +67,6 @@ test_expect_failure 'git grep .fi a' '
 	git grep .fi a
 '
 
-nul_match 1 '-F' 'yQf'
-nul_match 0 '-F' 'yQx'
-nul_match 1 '-Fi' 'YQf'
-nul_match 0 '-Fi' 'YQx'
-nul_match 1 '' 'yQf'
-nul_match 0 '' 'yQx'
-nul_match 1 '' 'æQð'
-nul_match 1 '-F' 'eQm[*]c'
-nul_match 1 '-Fi' 'EQM[*]C'
-
-# Regex patterns that would match but shouldn't with -F
-nul_match 0 '-F' 'yQ[f]'
-nul_match 0 '-F' '[y]Qf'
-nul_match 0 '-Fi' 'YQ[F]'
-nul_match 0 '-Fi' '[Y]QF'
-nul_match 0 '-F' 'æQ[ð]'
-nul_match 0 '-F' '[æ]Qð'
-nul_match 0 '-Fi' 'ÆQ[Ð]'
-nul_match 0 '-Fi' '[Æ]QÐ'
-
-# kwset is disabled on -i & non-ASCII. No way to match non-ASCII \0
-# patterns case-insensitively.
-nul_match T1 '-i' 'ÆQÐ'
-
-# \0 implicitly disables regexes. This is an undocumented internal
-# limitation.
-nul_match T1 '' 'yQ[f]'
-nul_match T1 '' '[y]Qf'
-nul_match T1 '-i' 'YQ[F]'
-nul_match T1 '-i' '[Y]Qf'
-nul_match T1 '' 'æQ[ð]'
-nul_match T1 '' '[æ]Qð'
-nul_match T1 '-i' 'ÆQ[Ð]'
-
-# ... because of \0 implicitly disabling regexes regexes that
-# should/shouldn't match don't do the right thing.
-nul_match T1 '' 'eQm.*cQ'
-nul_match T1 '-i' 'EQM.*cQ'
-nul_match T0 '' 'eQm[*]c'
-nul_match T0 '-i' 'EQM[*]C'
-
-# Due to the REG_STARTEND extension when kwset() is disabled on -i &
-# non-ASCII the string will be matched in its entirety, but the
-# pattern will be cut off at the first \0.
-nul_match 0 '-i' 'NOMATCHQð'
-nul_match T0 '-i' '[Æ]QNOMATCH'
-nul_match T0 '-i' '[æ]QNOMATCH'
-# Matches, but for the wrong reasons, just stops at [æ]
-nul_match 1 '-i' '[Æ]Qð'
-nul_match 1 '-i' '[æ]Qð'
-
-# Ensure that the matcher doesn't regress to something that stops at
-# \0
-nul_match 0 '-F' 'yQ[f]'
-nul_match 0 '-Fi' 'YQ[F]'
-nul_match 0 '' 'yQNOMATCH'
-nul_match 0 '' 'QNOMATCH'
-nul_match 0 '-i' 'YQNOMATCH'
-nul_match 0 '-i' 'QNOMATCH'
-nul_match 0 '-F' 'æQ[ð]'
-nul_match 0 '-Fi' 'ÆQ[Ð]'
-nul_match 0 '' 'yQNÓMATCH'
-nul_match 0 '' 'QNÓMATCH'
-nul_match 0 '-i' 'YQNÓMATCH'
-nul_match 0 '-i' 'QNÓMATCH'
-
 test_expect_success 'grep respects binary diff attribute' '
 	echo text >t &&
 	git add t &&
diff --git a/t/t7816-grep-binary-pattern.sh b/t/t7816-grep-binary-pattern.sh
new file mode 100755
index 0000000000..4060dbd679
--- /dev/null
+++ b/t/t7816-grep-binary-pattern.sh
@@ -0,0 +1,114 @@
+#!/bin/sh
+
+test_description='git grep with a binary pattern files'
+
+. ./test-lib.sh
+
+nul_match () {
+	matches=$1
+	flags=$2
+	pattern=$3
+	pattern_human=$(echo "$pattern" | sed 's/Q/<NUL>/g')
+
+	if test "$matches" = 1
+	then
+		test_expect_success "git grep -f f $flags '$pattern_human' a" "
+			printf '$pattern' | q_to_nul >f &&
+			git grep -f f $flags a
+		"
+	elif test "$matches" = 0
+	then
+		test_expect_success "git grep -f f $flags '$pattern_human' a" "
+			printf '$pattern' | q_to_nul >f &&
+			test_must_fail git grep -f f $flags a
+		"
+	elif test "$matches" = T1
+	then
+		test_expect_failure "git grep -f f $flags '$pattern_human' a" "
+			printf '$pattern' | q_to_nul >f &&
+			git grep -f f $flags a
+		"
+	elif test "$matches" = T0
+	then
+		test_expect_failure "git grep -f f $flags '$pattern_human' a" "
+			printf '$pattern' | q_to_nul >f &&
+			test_must_fail git grep -f f $flags a
+		"
+	else
+		test_expect_success "PANIC: Test framework error. Unknown matches value $matches" 'false'
+	fi
+}
+
+test_expect_success 'setup' "
+	echo 'binaryQfileQm[*]cQ*æQð' | q_to_nul >a &&
+	git add a &&
+	git commit -m.
+"
+
+nul_match 1 '-F' 'yQf'
+nul_match 0 '-F' 'yQx'
+nul_match 1 '-Fi' 'YQf'
+nul_match 0 '-Fi' 'YQx'
+nul_match 1 '' 'yQf'
+nul_match 0 '' 'yQx'
+nul_match 1 '' 'æQð'
+nul_match 1 '-F' 'eQm[*]c'
+nul_match 1 '-Fi' 'EQM[*]C'
+
+# Regex patterns that would match but shouldn't with -F
+nul_match 0 '-F' 'yQ[f]'
+nul_match 0 '-F' '[y]Qf'
+nul_match 0 '-Fi' 'YQ[F]'
+nul_match 0 '-Fi' '[Y]QF'
+nul_match 0 '-F' 'æQ[ð]'
+nul_match 0 '-F' '[æ]Qð'
+nul_match 0 '-Fi' 'ÆQ[Ð]'
+nul_match 0 '-Fi' '[Æ]QÐ'
+
+# kwset is disabled on -i & non-ASCII. No way to match non-ASCII \0
+# patterns case-insensitively.
+nul_match T1 '-i' 'ÆQÐ'
+
+# \0 implicitly disables regexes. This is an undocumented internal
+# limitation.
+nul_match T1 '' 'yQ[f]'
+nul_match T1 '' '[y]Qf'
+nul_match T1 '-i' 'YQ[F]'
+nul_match T1 '-i' '[Y]Qf'
+nul_match T1 '' 'æQ[ð]'
+nul_match T1 '' '[æ]Qð'
+nul_match T1 '-i' 'ÆQ[Ð]'
+
+# ... because of \0 implicitly disabling regexes regexes that
+# should/shouldn't match don't do the right thing.
+nul_match T1 '' 'eQm.*cQ'
+nul_match T1 '-i' 'EQM.*cQ'
+nul_match T0 '' 'eQm[*]c'
+nul_match T0 '-i' 'EQM[*]C'
+
+# Due to the REG_STARTEND extension when kwset() is disabled on -i &
+# non-ASCII the string will be matched in its entirety, but the
+# pattern will be cut off at the first \0.
+nul_match 0 '-i' 'NOMATCHQð'
+nul_match T0 '-i' '[Æ]QNOMATCH'
+nul_match T0 '-i' '[æ]QNOMATCH'
+# Matches, but for the wrong reasons, just stops at [æ]
+nul_match 1 '-i' '[Æ]Qð'
+nul_match 1 '-i' '[æ]Qð'
+
+# Ensure that the matcher doesn't regress to something that stops at
+# \0
+nul_match 0 '-F' 'yQ[f]'
+nul_match 0 '-Fi' 'YQ[F]'
+nul_match 0 '' 'yQNOMATCH'
+nul_match 0 '' 'QNOMATCH'
+nul_match 0 '-i' 'YQNOMATCH'
+nul_match 0 '-i' 'QNOMATCH'
+nul_match 0 '-F' 'æQ[ð]'
+nul_match 0 '-Fi' 'ÆQ[Ð]'
+nul_match 0 '' 'yQNÓMATCH'
+nul_match 0 '' 'QNÓMATCH'
+nul_match 0 '-i' 'YQNÓMATCH'
+nul_match 0 '-i' 'QNÓMATCH'
+
+test_done
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v2 6/9] grep: make the behavior for NUL-byte in patterns sane
  2019-06-26  0:03         ` [RFC/PATCH 0/7] grep: move from kwset to optional PCRE v2 Ævar Arnfjörð Bjarmason
                             ` (6 preceding siblings ...)
  2019-06-27 23:39           ` [PATCH v2 5/9] grep tests: move binary pattern tests into their own file Ævar Arnfjörð Bjarmason
@ 2019-06-27 23:39           ` Ævar Arnfjörð Bjarmason
  2019-06-27 23:39           ` [PATCH v2 7/9] grep: drop support for \0 in --fixed-strings <pattern> Ævar Arnfjörð Bjarmason
                             ` (2 subsequent siblings)
  10 siblings, 0 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-06-27 23:39 UTC (permalink / raw)
  To: git
  Cc: git-packagers, gitgitgadget, gitster, johannes.schindelin, peff,
	sandals, szeder.dev, Ævar Arnfjörð Bjarmason

The behavior of "grep" when patterns contained a NUL-byte has always
been haphazard, and has served the vagaries of the implementation more
than anything else. A pattern containing a NUL-byte can only be
provided via "-f <file>". Since pickaxe (log search) has no such flag
the NUL-byte in patterns has only ever been supported by "grep" (and
not "log --grep").

Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) patterns containing
"\0" were considered fixed. In 966be95549 ("grep: add tests to fix
blind spots with \0 patterns", 2017-05-20) I added tests for this
behavior.

Change the behavior to do the obvious thing, i.e. don't silently
discard a regex pattern and make it implicitly fixed just because they
contain a NUL-byte. Instead die if the backend in question can't
handle them, e.g. --basic-regexp is combined with such a pattern.

This is desired because from a user's point of view it's the obvious
thing to do. Whether we support BRE/ERE/Perl syntax is different from
whether our implementation is limited by C-strings. These patterns are
obscure enough that I think this behavior change is OK, especially
since we never documented the old behavior.

Doing this also makes it easier to replace the kwset backend with
something else, since we'll no longer strictly need it for anything we
can't easily use another fixed-string backend for.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/git-grep.txt     |  17 ++++
 grep.c                         |  23 ++---
 t/t7816-grep-binary-pattern.sh | 159 ++++++++++++++++++---------------
 3 files changed, 110 insertions(+), 89 deletions(-)

diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index 2d27969057..c89fb569e3 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -271,6 +271,23 @@ providing this option will cause it to die.
 
 -f <file>::
 	Read patterns from <file>, one per line.
++
+Passing the pattern via <file> allows for providing a search pattern
+containing a \0.
++
+Not all pattern types support patterns containing \0. Git will error
+out if a given pattern type can't support such a pattern. The
+`--perl-regexp` pattern type when compiled against the PCRE v2 backend
+has the widest support for these types of patterns.
++
+In versions of Git before 2.23.0 patterns containing \0 would be
+silently considered fixed. This was never documented, there were also
+odd and undocumented interactions between e.g. non-ASCII patterns
+containing \0 and `--ignore-case`.
++
+In future versions we may learn to support patterns containing \0 for
+more search backends, until then we'll die when the pattern type in
+question doesn't support them.
 
 -e::
 	The next parameter is the pattern. This option has to be
diff --git a/grep.c b/grep.c
index 4e8d0645a8..d6603bc950 100644
--- a/grep.c
+++ b/grep.c
@@ -368,18 +368,6 @@ static int is_fixed(const char *s, size_t len)
 	return 1;
 }
 
-static int has_null(const char *s, size_t len)
-{
-	/*
-	 * regcomp cannot accept patterns with NULs so when using it
-	 * we consider any pattern containing a NUL fixed.
-	 */
-	if (memchr(s, 0, len))
-		return 1;
-
-	return 0;
-}
-
 #ifdef USE_LIBPCRE1
 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt)
 {
@@ -668,9 +656,7 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 	 * simple string match using kws.  p->fixed tells us if we
 	 * want to use kws.
 	 */
-	if (opt->fixed ||
-	    has_null(p->pattern, p->patternlen) ||
-	    is_fixed(p->pattern, p->patternlen))
+	if (opt->fixed || is_fixed(p->pattern, p->patternlen))
 		p->fixed = !p->ignore_case || !has_non_ascii(p->pattern);
 
 	if (p->fixed) {
@@ -678,7 +664,12 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 		kwsincr(p->kws, p->pattern, p->patternlen);
 		kwsprep(p->kws);
 		return;
-	} else if (opt->fixed) {
+	}
+
+	if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
+		die(_("given pattern contains NULL byte (via -f <file>). This is only supported with -P under PCRE v2"));
+
+	if (opt->fixed) {
 		/*
 		 * We come here when the pattern has the non-ascii
 		 * characters we cannot case-fold, and asked to
diff --git a/t/t7816-grep-binary-pattern.sh b/t/t7816-grep-binary-pattern.sh
index 4060dbd679..9e09bd5d6a 100755
--- a/t/t7816-grep-binary-pattern.sh
+++ b/t/t7816-grep-binary-pattern.sh
@@ -2,113 +2,126 @@
 
 test_description='git grep with a binary pattern files'
 
-. ./test-lib.sh
+. ./lib-gettext.sh
 
-nul_match () {
+nul_match_internal () {
 	matches=$1
-	flags=$2
-	pattern=$3
+	prereqs=$2
+	lc_all=$3
+	extra_flags=$4
+	flags=$5
+	pattern=$6
 	pattern_human=$(echo "$pattern" | sed 's/Q/<NUL>/g')
 
 	if test "$matches" = 1
 	then
-		test_expect_success "git grep -f f $flags '$pattern_human' a" "
+		test_expect_success $prereqs "LC_ALL='$lc_all' git grep $extra_flags -f f $flags '$pattern_human' a" "
 			printf '$pattern' | q_to_nul >f &&
-			git grep -f f $flags a
+			LC_ALL='$lc_all' git grep $extra_flags -f f $flags a
 		"
 	elif test "$matches" = 0
 	then
-		test_expect_success "git grep -f f $flags '$pattern_human' a" "
+		test_expect_success $prereqs "LC_ALL='$lc_all' git grep $extra_flags -f f $flags '$pattern_human' a" "
+			>stderr &&
 			printf '$pattern' | q_to_nul >f &&
-			test_must_fail git grep -f f $flags a
+			test_must_fail env LC_ALL=\"$lc_all\" git grep $extra_flags -f f $flags a 2>stderr &&
+			test_i18ngrep ! 'This is only supported with -P under PCRE v2' stderr
 		"
-	elif test "$matches" = T1
+	elif test "$matches" = P
 	then
-		test_expect_failure "git grep -f f $flags '$pattern_human' a" "
+		test_expect_success $prereqs "error, PCRE v2 only: LC_ALL='$lc_all' git grep -f f $flags '$pattern_human' a" "
+			>stderr &&
 			printf '$pattern' | q_to_nul >f &&
-			git grep -f f $flags a
-		"
-	elif test "$matches" = T0
-	then
-		test_expect_failure "git grep -f f $flags '$pattern_human' a" "
-			printf '$pattern' | q_to_nul >f &&
-			test_must_fail git grep -f f $flags a
+			test_must_fail env LC_ALL=\"$lc_all\" git grep -f f $flags a 2>stderr &&
+			test_i18ngrep 'This is only supported with -P under PCRE v2' stderr
 		"
 	else
 		test_expect_success "PANIC: Test framework error. Unknown matches value $matches" 'false'
 	fi
 }
 
+nul_match () {
+	matches=$1
+	matches_pcre2=$2
+	matches_pcre2_locale=$3
+	flags=$4
+	pattern=$5
+	pattern_human=$(echo "$pattern" | sed 's/Q/<NUL>/g')
+
+	nul_match_internal "$matches" "" "C" "" "$flags" "$pattern"
+	nul_match_internal "$matches_pcre2" "LIBPCRE2" "C" "-P" "$flags" "$pattern"
+	nul_match_internal "$matches_pcre2_locale" "LIBPCRE2,GETTEXT_LOCALE" "$is_IS_locale" "-P" "$flags" "$pattern"
+}
+
 test_expect_success 'setup' "
 	echo 'binaryQfileQm[*]cQ*æQð' | q_to_nul >a &&
 	git add a &&
 	git commit -m.
 "
 
-nul_match 1 '-F' 'yQf'
-nul_match 0 '-F' 'yQx'
-nul_match 1 '-Fi' 'YQf'
-nul_match 0 '-Fi' 'YQx'
-nul_match 1 '' 'yQf'
-nul_match 0 '' 'yQx'
-nul_match 1 '' 'æQð'
-nul_match 1 '-F' 'eQm[*]c'
-nul_match 1 '-Fi' 'EQM[*]C'
+# Simple fixed-string matching that can use kwset (no -i && non-ASCII)
+nul_match 1 1 1 '-F' 'yQf'
+nul_match 0 0 0 '-F' 'yQx'
+nul_match 1 1 1 '-Fi' 'YQf'
+nul_match 0 0 0 '-Fi' 'YQx'
+nul_match 1 1 1 '' 'yQf'
+nul_match 0 0 0 '' 'yQx'
+nul_match 1 1 1 '' 'æQð'
+nul_match 1 1 1 '-F' 'eQm[*]c'
+nul_match 1 1 1 '-Fi' 'EQM[*]C'
 
 # Regex patterns that would match but shouldn't with -F
-nul_match 0 '-F' 'yQ[f]'
-nul_match 0 '-F' '[y]Qf'
-nul_match 0 '-Fi' 'YQ[F]'
-nul_match 0 '-Fi' '[Y]QF'
-nul_match 0 '-F' 'æQ[ð]'
-nul_match 0 '-F' '[æ]Qð'
-nul_match 0 '-Fi' 'ÆQ[Ð]'
-nul_match 0 '-Fi' '[Æ]QÐ'
+nul_match 0 0 0 '-F' 'yQ[f]'
+nul_match 0 0 0 '-F' '[y]Qf'
+nul_match 0 0 0 '-Fi' 'YQ[F]'
+nul_match 0 0 0 '-Fi' '[Y]QF'
+nul_match 0 0 0 '-F' 'æQ[ð]'
+nul_match 0 0 0 '-F' '[æ]Qð'
 
-# kwset is disabled on -i & non-ASCII. No way to match non-ASCII \0
-# patterns case-insensitively.
-nul_match T1 '-i' 'ÆQÐ'
+# The -F kwset codepath can't handle -i && non-ASCII...
+nul_match P 1 1 '-i' '[æ]Qð'
 
-# \0 implicitly disables regexes. This is an undocumented internal
-# limitation.
-nul_match T1 '' 'yQ[f]'
-nul_match T1 '' '[y]Qf'
-nul_match T1 '-i' 'YQ[F]'
-nul_match T1 '-i' '[Y]Qf'
-nul_match T1 '' 'æQ[ð]'
-nul_match T1 '' '[æ]Qð'
-nul_match T1 '-i' 'ÆQ[Ð]'
+# ...PCRE v2 only matches non-ASCII with -i casefolding under UTF-8
+# semantics
+nul_match P P P '-Fi' 'ÆQ[Ð]'
+nul_match P 0 1 '-i'  'ÆQ[Ð]'
+nul_match P 0 1 '-i'  '[Æ]QÐ'
+nul_match P 0 1 '-i' '[Æ]Qð'
+nul_match P 0 1 '-i' 'ÆQÐ'
 
-# ... because of \0 implicitly disabling regexes regexes that
-# should/shouldn't match don't do the right thing.
-nul_match T1 '' 'eQm.*cQ'
-nul_match T1 '-i' 'EQM.*cQ'
-nul_match T0 '' 'eQm[*]c'
-nul_match T0 '-i' 'EQM[*]C'
+# \0 in regexes can only work with -P & PCRE v2
+nul_match P 1 1 '' 'yQ[f]'
+nul_match P 1 1 '' '[y]Qf'
+nul_match P 1 1 '-i' 'YQ[F]'
+nul_match P 1 1 '-i' '[Y]Qf'
+nul_match P 1 1 '' 'æQ[ð]'
+nul_match P 1 1 '' '[æ]Qð'
+nul_match P 0 1 '-i' 'ÆQ[Ð]'
+nul_match P 1 1 '' 'eQm.*cQ'
+nul_match P 1 1 '-i' 'EQM.*cQ'
+nul_match P 0 0 '' 'eQm[*]c'
+nul_match P 0 0 '-i' 'EQM[*]C'
 
-# Due to the REG_STARTEND extension when kwset() is disabled on -i &
-# non-ASCII the string will be matched in its entirety, but the
-# pattern will be cut off at the first \0.
-nul_match 0 '-i' 'NOMATCHQð'
-nul_match T0 '-i' '[Æ]QNOMATCH'
-nul_match T0 '-i' '[æ]QNOMATCH'
-# Matches, but for the wrong reasons, just stops at [æ]
-nul_match 1 '-i' '[Æ]Qð'
-nul_match 1 '-i' '[æ]Qð'
+# Assert that we're using REG_STARTEND and the pattern doesn't match
+# just because it's cut off at the first \0.
+nul_match 0 0 0 '-i' 'NOMATCHQð'
+nul_match P 0 0 '-i' '[Æ]QNOMATCH'
+nul_match P 0 0 '-i' '[æ]QNOMATCH'
 
 # Ensure that the matcher doesn't regress to something that stops at
 # \0
-nul_match 0 '-F' 'yQ[f]'
-nul_match 0 '-Fi' 'YQ[F]'
-nul_match 0 '' 'yQNOMATCH'
-nul_match 0 '' 'QNOMATCH'
-nul_match 0 '-i' 'YQNOMATCH'
-nul_match 0 '-i' 'QNOMATCH'
-nul_match 0 '-F' 'æQ[ð]'
-nul_match 0 '-Fi' 'ÆQ[Ð]'
-nul_match 0 '' 'yQNÓMATCH'
-nul_match 0 '' 'QNÓMATCH'
-nul_match 0 '-i' 'YQNÓMATCH'
-nul_match 0 '-i' 'QNÓMATCH'
+nul_match 0 0 0 '-F' 'yQ[f]'
+nul_match 0 0 0 '-Fi' 'YQ[F]'
+nul_match 0 0 0 '' 'yQNOMATCH'
+nul_match 0 0 0 '' 'QNOMATCH'
+nul_match 0 0 0 '-i' 'YQNOMATCH'
+nul_match 0 0 0 '-i' 'QNOMATCH'
+nul_match 0 0 0 '-F' 'æQ[ð]'
+nul_match P P P '-Fi' 'ÆQ[Ð]'
+nul_match P 0 1 '-i' 'ÆQ[Ð]'
+nul_match 0 0 0 '' 'yQNÓMATCH'
+nul_match 0 0 0 '' 'QNÓMATCH'
+nul_match 0 0 0 '-i' 'YQNÓMATCH'
+nul_match 0 0 0 '-i' 'QNÓMATCH'
 
 test_done
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v2 7/9] grep: drop support for \0 in --fixed-strings <pattern>
  2019-06-26  0:03         ` [RFC/PATCH 0/7] grep: move from kwset to optional PCRE v2 Ævar Arnfjörð Bjarmason
                             ` (7 preceding siblings ...)
  2019-06-27 23:39           ` [PATCH v2 6/9] grep: make the behavior for NUL-byte in patterns sane Ævar Arnfjörð Bjarmason
@ 2019-06-27 23:39           ` Ævar Arnfjörð Bjarmason
  2019-06-27 23:39           ` [PATCH v2 8/9] grep: remove the kwset optimization Ævar Arnfjörð Bjarmason
  2019-06-27 23:39           ` [PATCH v2 9/9] grep: use PCRE v2 for optimized fixed-string search Ævar Arnfjörð Bjarmason
  10 siblings, 0 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-06-27 23:39 UTC (permalink / raw)
  To: git
  Cc: git-packagers, gitgitgadget, gitster, johannes.schindelin, peff,
	sandals, szeder.dev, Ævar Arnfjörð Bjarmason

Change "-f <file>" to not support patterns with a NUL-byte in them
under --fixed-strings. We'll now only support these under
"--perl-regexp" with PCRE v2.

A previous change to grep's documentation changed the description of
"-f <file>" to be vague enough as to not promise that this would work.
By dropping support for this we make it a whole lot easier to move
away from the kwset backend, which we'll do in a subsequent change.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 grep.c                         |  6 +--
 t/t7816-grep-binary-pattern.sh | 82 +++++++++++++++++-----------------
 2 files changed, 44 insertions(+), 44 deletions(-)

diff --git a/grep.c b/grep.c
index d6603bc950..8d0fff316c 100644
--- a/grep.c
+++ b/grep.c
@@ -644,6 +644,9 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 	p->word_regexp = opt->word_regexp;
 	p->ignore_case = opt->ignore_case;
 
+	if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
+		die(_("given pattern contains NULL byte (via -f <file>). This is only supported with -P under PCRE v2"));
+
 	/*
 	 * Even when -F (fixed) asks us to do a non-regexp search, we
 	 * may not be able to correctly case-fold when -i
@@ -666,9 +669,6 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 		return;
 	}
 
-	if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
-		die(_("given pattern contains NULL byte (via -f <file>). This is only supported with -P under PCRE v2"));
-
 	if (opt->fixed) {
 		/*
 		 * We come here when the pattern has the non-ascii
diff --git a/t/t7816-grep-binary-pattern.sh b/t/t7816-grep-binary-pattern.sh
index 9e09bd5d6a..60bab291e4 100755
--- a/t/t7816-grep-binary-pattern.sh
+++ b/t/t7816-grep-binary-pattern.sh
@@ -60,23 +60,23 @@ test_expect_success 'setup' "
 "
 
 # Simple fixed-string matching that can use kwset (no -i && non-ASCII)
-nul_match 1 1 1 '-F' 'yQf'
-nul_match 0 0 0 '-F' 'yQx'
-nul_match 1 1 1 '-Fi' 'YQf'
-nul_match 0 0 0 '-Fi' 'YQx'
-nul_match 1 1 1 '' 'yQf'
-nul_match 0 0 0 '' 'yQx'
-nul_match 1 1 1 '' 'æQð'
-nul_match 1 1 1 '-F' 'eQm[*]c'
-nul_match 1 1 1 '-Fi' 'EQM[*]C'
+nul_match P P P '-F' 'yQf'
+nul_match P P P '-F' 'yQx'
+nul_match P P P '-Fi' 'YQf'
+nul_match P P P '-Fi' 'YQx'
+nul_match P P 1 '' 'yQf'
+nul_match P P 0 '' 'yQx'
+nul_match P P 1 '' 'æQð'
+nul_match P P P '-F' 'eQm[*]c'
+nul_match P P P '-Fi' 'EQM[*]C'
 
 # Regex patterns that would match but shouldn't with -F
-nul_match 0 0 0 '-F' 'yQ[f]'
-nul_match 0 0 0 '-F' '[y]Qf'
-nul_match 0 0 0 '-Fi' 'YQ[F]'
-nul_match 0 0 0 '-Fi' '[Y]QF'
-nul_match 0 0 0 '-F' 'æQ[ð]'
-nul_match 0 0 0 '-F' '[æ]Qð'
+nul_match P P P '-F' 'yQ[f]'
+nul_match P P P '-F' '[y]Qf'
+nul_match P P P '-Fi' 'YQ[F]'
+nul_match P P P '-Fi' '[Y]QF'
+nul_match P P P '-F' 'æQ[ð]'
+nul_match P P P '-F' '[æ]Qð'
 
 # The -F kwset codepath can't handle -i && non-ASCII...
 nul_match P 1 1 '-i' '[æ]Qð'
@@ -90,38 +90,38 @@ nul_match P 0 1 '-i' '[Æ]Qð'
 nul_match P 0 1 '-i' 'ÆQÐ'
 
 # \0 in regexes can only work with -P & PCRE v2
-nul_match P 1 1 '' 'yQ[f]'
-nul_match P 1 1 '' '[y]Qf'
-nul_match P 1 1 '-i' 'YQ[F]'
-nul_match P 1 1 '-i' '[Y]Qf'
-nul_match P 1 1 '' 'æQ[ð]'
-nul_match P 1 1 '' '[æ]Qð'
-nul_match P 0 1 '-i' 'ÆQ[Ð]'
-nul_match P 1 1 '' 'eQm.*cQ'
-nul_match P 1 1 '-i' 'EQM.*cQ'
-nul_match P 0 0 '' 'eQm[*]c'
-nul_match P 0 0 '-i' 'EQM[*]C'
+nul_match P P 1 '' 'yQ[f]'
+nul_match P P 1 '' '[y]Qf'
+nul_match P P 1 '-i' 'YQ[F]'
+nul_match P P 1 '-i' '[Y]Qf'
+nul_match P P 1 '' 'æQ[ð]'
+nul_match P P 1 '' '[æ]Qð'
+nul_match P P 1 '-i' 'ÆQ[Ð]'
+nul_match P P 1 '' 'eQm.*cQ'
+nul_match P P 1 '-i' 'EQM.*cQ'
+nul_match P P 0 '' 'eQm[*]c'
+nul_match P P 0 '-i' 'EQM[*]C'
 
 # Assert that we're using REG_STARTEND and the pattern doesn't match
 # just because it's cut off at the first \0.
-nul_match 0 0 0 '-i' 'NOMATCHQð'
-nul_match P 0 0 '-i' '[Æ]QNOMATCH'
-nul_match P 0 0 '-i' '[æ]QNOMATCH'
+nul_match P P 0 '-i' 'NOMATCHQð'
+nul_match P P 0 '-i' '[Æ]QNOMATCH'
+nul_match P P 0 '-i' '[æ]QNOMATCH'
 
 # Ensure that the matcher doesn't regress to something that stops at
 # \0
-nul_match 0 0 0 '-F' 'yQ[f]'
-nul_match 0 0 0 '-Fi' 'YQ[F]'
-nul_match 0 0 0 '' 'yQNOMATCH'
-nul_match 0 0 0 '' 'QNOMATCH'
-nul_match 0 0 0 '-i' 'YQNOMATCH'
-nul_match 0 0 0 '-i' 'QNOMATCH'
-nul_match 0 0 0 '-F' 'æQ[ð]'
+nul_match P P P '-F' 'yQ[f]'
+nul_match P P P '-Fi' 'YQ[F]'
+nul_match P P 0 '' 'yQNOMATCH'
+nul_match P P 0 '' 'QNOMATCH'
+nul_match P P 0 '-i' 'YQNOMATCH'
+nul_match P P 0 '-i' 'QNOMATCH'
+nul_match P P P '-F' 'æQ[ð]'
 nul_match P P P '-Fi' 'ÆQ[Ð]'
-nul_match P 0 1 '-i' 'ÆQ[Ð]'
-nul_match 0 0 0 '' 'yQNÓMATCH'
-nul_match 0 0 0 '' 'QNÓMATCH'
-nul_match 0 0 0 '-i' 'YQNÓMATCH'
-nul_match 0 0 0 '-i' 'QNÓMATCH'
+nul_match P P 1 '-i' 'ÆQ[Ð]'
+nul_match P P 0 '' 'yQNÓMATCH'
+nul_match P P 0 '' 'QNÓMATCH'
+nul_match P P 0 '-i' 'YQNÓMATCH'
+nul_match P P 0 '-i' 'QNÓMATCH'
 
 test_done
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v2 8/9] grep: remove the kwset optimization
  2019-06-26  0:03         ` [RFC/PATCH 0/7] grep: move from kwset to optional PCRE v2 Ævar Arnfjörð Bjarmason
                             ` (8 preceding siblings ...)
  2019-06-27 23:39           ` [PATCH v2 7/9] grep: drop support for \0 in --fixed-strings <pattern> Ævar Arnfjörð Bjarmason
@ 2019-06-27 23:39           ` Ævar Arnfjörð Bjarmason
  2019-06-27 23:39           ` [PATCH v2 9/9] grep: use PCRE v2 for optimized fixed-string search Ævar Arnfjörð Bjarmason
  10 siblings, 0 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-06-27 23:39 UTC (permalink / raw)
  To: git
  Cc: git-packagers, gitgitgadget, gitster, johannes.schindelin, peff,
	sandals, szeder.dev, Ævar Arnfjörð Bjarmason

A later change will replace this optimization with optimistic use of
PCRE v2. I'm completely removing it as an intermediate step, as
opposed to replacing it with PCRE v2, to demonstrate that no grep
semantics depend on this (or any other) optimization for the fixed
backend anymore.

For now this is mostly (but not entirely) a performance regression, as
shown by this hacky one-liner:

    for opt in '' ' -i'
        do
        GIT_PERF_7821_GREP_OPTS=$opt GIT_PERF_REPEAT_COUNT=10 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS='-j8 CFLAGS=-O3 USE_LIBPCRE=YesPlease' ./run origin/master HEAD -- p7821-grep-engines-fixed.sh
    done &&
    for opt in '' ' -i'
        do GIT_PERF_4221_LOG_OPTS=$opt GIT_PERF_REPEAT_COUNT=10 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS='-j8 CFLAGS=-O3 USE_LIBPCRE=YesPlease' ./run origin/master HEAD -- p4221-log-grep-engines-fixed.sh
    done

Which produces:

plain grep:

    Test                             origin/master     HEAD
    -------------------------------------------------------------------------
    7821.1: fixed grep int           0.55(1.60+0.63)   0.82(3.11+0.51) +49.1%
    7821.2: basic grep int           0.62(1.68+0.49)   0.85(3.02+0.52) +37.1%
    7821.3: extended grep int        0.61(1.63+0.53)   0.91(3.09+0.44) +49.2%
    7821.4: perl grep int            0.55(1.60+0.57)   0.41(0.93+0.57) -25.5%
    7821.6: fixed grep uncommon      0.20(0.50+0.44)   0.35(1.27+0.42) +75.0%
    7821.7: basic grep uncommon      0.20(0.49+0.45)   0.35(1.29+0.41) +75.0%
    7821.8: extended grep uncommon   0.20(0.45+0.48)   0.35(1.25+0.44) +75.0%
    7821.9: perl grep uncommon       0.20(0.53+0.41)   0.16(0.24+0.49) -20.0%
    7821.11: fixed grep æ            0.35(1.27+0.40)   0.25(0.82+0.39) -28.6%
    7821.12: basic grep æ            0.35(1.28+0.38)   0.25(0.75+0.44) -28.6%
    7821.13: extended grep æ         0.36(1.21+0.46)   0.25(0.86+0.35) -30.6%
    7821.14: perl grep æ             0.35(1.33+0.34)   0.16(0.26+0.47) -54.3%

grep with -i:

    Test                                origin/master     HEAD
    -----------------------------------------------------------------------------
    7821.1: fixed grep -i int           0.61(1.84+0.64)   1.11(4.12+0.64) +82.0%
    7821.2: basic grep -i int           0.72(1.86+0.57)   1.15(4.48+0.49) +59.7%
    7821.3: extended grep -i int        0.94(1.83+0.60)   1.53(4.12+0.58) +62.8%
    7821.4: perl grep -i int            0.66(1.82+0.59)   0.55(1.08+0.58) -16.7%
    7821.6: fixed grep -i uncommon      0.21(0.51+0.44)   0.44(1.74+0.34) +109.5%
    7821.7: basic grep -i uncommon      0.21(0.55+0.41)   0.44(1.72+0.40) +109.5%
    7821.8: extended grep -i uncommon   0.21(0.57+0.39)   0.42(1.64+0.45) +100.0%
    7821.9: perl grep -i uncommon       0.21(0.48+0.48)   0.17(0.30+0.45) -19.0%
    7821.11: fixed grep -i æ            0.25(0.73+0.45)   0.25(0.75+0.45) +0.0%
    7821.12: basic grep -i æ            0.25(0.71+0.49)   0.26(0.77+0.44) +4.0%
    7821.13: extended grep -i æ         0.25(0.75+0.44)   0.25(0.74+0.46) +0.0%
    7821.14: perl grep -i æ             0.17(0.26+0.48)   0.16(0.20+0.52) -5.9%

plain log:

    Test                                     origin/master     HEAD
    ---------------------------------------------------------------------------------
    4221.1: fixed log --grep='int'           7.31(7.06+0.21)   8.11(7.85+0.20) +10.9%
    4221.2: basic log --grep='int'           7.30(6.94+0.27)   8.16(7.89+0.19) +11.8%
    4221.3: extended log --grep='int'        7.34(7.05+0.21)   8.08(7.76+0.25) +10.1%
    4221.4: perl log --grep='int'            7.27(6.94+0.24)   7.05(6.76+0.25) -3.0%
    4221.6: fixed log --grep='uncommon'      6.97(6.62+0.32)   7.86(7.51+0.30) +12.8%
    4221.7: basic log --grep='uncommon'      7.05(6.69+0.29)   7.89(7.60+0.28) +11.9%
    4221.8: extended log --grep='uncommon'   6.89(6.56+0.32)   7.99(7.66+0.24) +16.0%
    4221.9: perl log --grep='uncommon'       7.02(6.66+0.33)   6.97(6.54+0.36) -0.7%
    4221.11: fixed log --grep='æ'            7.37(7.03+0.33)   7.67(7.30+0.31) +4.1%
    4221.12: basic log --grep='æ'            7.41(7.00+0.31)   7.60(7.28+0.26) +2.6%
    4221.13: extended log --grep='æ'         7.35(6.96+0.38)   7.73(7.31+0.34) +5.2%
    4221.14: perl log --grep='æ'             7.43(7.10+0.32)   6.95(6.61+0.27) -6.5%

log with -i:

    Test                                        origin/master     HEAD
    ------------------------------------------------------------------------------------
    4221.1: fixed log -i --grep='int'           7.40(7.05+0.23)   8.66(8.38+0.20) +17.0%
    4221.2: basic log -i --grep='int'           7.39(7.09+0.23)   8.67(8.39+0.20) +17.3%
    4221.3: extended log -i --grep='int'        7.29(6.99+0.26)   8.69(8.31+0.26) +19.2%
    4221.4: perl log -i --grep='int'            7.42(7.16+0.21)   7.14(6.80+0.24) -3.8%
    4221.6: fixed log -i --grep='uncommon'      6.94(6.58+0.35)   8.43(8.04+0.30) +21.5%
    4221.7: basic log -i --grep='uncommon'      6.95(6.62+0.31)   8.34(7.93+0.32) +20.0%
    4221.8: extended log -i --grep='uncommon'   7.06(6.75+0.25)   8.32(7.98+0.31) +17.8%
    4221.9: perl log -i --grep='uncommon'       6.96(6.69+0.26)   7.04(6.64+0.32) +1.1%
    4221.11: fixed log -i --grep='æ'            7.92(7.55+0.33)   7.86(7.44+0.34) -0.8%
    4221.12: basic log -i --grep='æ'            7.88(7.49+0.32)   7.84(7.46+0.34) -0.5%
    4221.13: extended log -i --grep='æ'         7.91(7.51+0.32)   7.87(7.48+0.32) -0.5%
    4221.14: perl log -i --grep='æ'             7.01(6.59+0.35)   6.99(6.64+0.28) -0.3%

Some of those, as noted in [1] are because PCRE is faster at finding
fixed strings. This looks bad for some engines, but in the next change
we'll optimistically use PCRE v2 for all of these, so it'll look
better.

1. https://public-inbox.org/git/87v9x793qi.fsf@evledraar.gmail.com/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 grep.c | 63 +++-------------------------------------------------------
 grep.h |  2 --
 2 files changed, 3 insertions(+), 62 deletions(-)

diff --git a/grep.c b/grep.c
index 8d0fff316c..4468519d5c 100644
--- a/grep.c
+++ b/grep.c
@@ -356,18 +356,6 @@ static NORETURN void compile_regexp_failed(const struct grep_pat *p,
 	die("%s'%s': %s", where, p->pattern, error);
 }
 
-static int is_fixed(const char *s, size_t len)
-{
-	size_t i;
-
-	for (i = 0; i < len; i++) {
-		if (is_regex_special(s[i]))
-			return 0;
-	}
-
-	return 1;
-}
-
 #ifdef USE_LIBPCRE1
 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt)
 {
@@ -643,38 +631,12 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 
 	p->word_regexp = opt->word_regexp;
 	p->ignore_case = opt->ignore_case;
+	p->fixed = opt->fixed;
 
 	if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
 		die(_("given pattern contains NULL byte (via -f <file>). This is only supported with -P under PCRE v2"));
 
-	/*
-	 * Even when -F (fixed) asks us to do a non-regexp search, we
-	 * may not be able to correctly case-fold when -i
-	 * (ignore-case) is asked (in which case, we'll synthesize a
-	 * regexp to match the pattern that matches regexp special
-	 * characters literally, while ignoring case differences).  On
-	 * the other hand, even without -F, if the pattern does not
-	 * have any regexp special characters and there is no need for
-	 * case-folding search, we can internally turn it into a
-	 * simple string match using kws.  p->fixed tells us if we
-	 * want to use kws.
-	 */
-	if (opt->fixed || is_fixed(p->pattern, p->patternlen))
-		p->fixed = !p->ignore_case || !has_non_ascii(p->pattern);
-
-	if (p->fixed) {
-		p->kws = kwsalloc(p->ignore_case ? tolower_trans_tbl : NULL);
-		kwsincr(p->kws, p->pattern, p->patternlen);
-		kwsprep(p->kws);
-		return;
-	}
-
 	if (opt->fixed) {
-		/*
-		 * We come here when the pattern has the non-ascii
-		 * characters we cannot case-fold, and asked to
-		 * ignore-case.
-		 */
 		compile_fixed_regexp(p, opt);
 		return;
 	}
@@ -1042,9 +1004,7 @@ void free_grep_patterns(struct grep_opt *opt)
 		case GREP_PATTERN: /* atom */
 		case GREP_PATTERN_HEAD:
 		case GREP_PATTERN_BODY:
-			if (p->kws)
-				kwsfree(p->kws);
-			else if (p->pcre1_regexp)
+			if (p->pcre1_regexp)
 				free_pcre1_regexp(p);
 			else if (p->pcre2_pattern)
 				free_pcre2_pattern(p);
@@ -1104,29 +1064,12 @@ static void show_name(struct grep_opt *opt, const char *name)
 	opt->output(opt, opt->null_following_name ? "\0" : "\n", 1);
 }
 
-static int fixmatch(struct grep_pat *p, char *line, char *eol,
-		    regmatch_t *match)
-{
-	struct kwsmatch kwsm;
-	size_t offset = kwsexec(p->kws, line, eol - line, &kwsm);
-	if (offset == -1) {
-		match->rm_so = match->rm_eo = -1;
-		return REG_NOMATCH;
-	} else {
-		match->rm_so = offset;
-		match->rm_eo = match->rm_so + kwsm.size[0];
-		return 0;
-	}
-}
-
 static int patmatch(struct grep_pat *p, char *line, char *eol,
 		    regmatch_t *match, int eflags)
 {
 	int hit;
 
-	if (p->fixed)
-		hit = !fixmatch(p, line, eol, match);
-	else if (p->pcre1_regexp)
+	if (p->pcre1_regexp)
 		hit = !pcre1match(p, line, eol, match, eflags);
 	else if (p->pcre2_pattern)
 		hit = !pcre2match(p, line, eol, match, eflags);
diff --git a/grep.h b/grep.h
index 4bb8a79d93..d35a137fcb 100644
--- a/grep.h
+++ b/grep.h
@@ -32,7 +32,6 @@ typedef int pcre2_compile_context;
 typedef int pcre2_match_context;
 typedef int pcre2_jit_stack;
 #endif
-#include "kwset.h"
 #include "thread-utils.h"
 #include "userdiff.h"
 
@@ -97,7 +96,6 @@ struct grep_pat {
 	pcre2_match_context *pcre2_match_context;
 	pcre2_jit_stack *pcre2_jit_stack;
 	uint32_t pcre2_jit_on;
-	kwset_t kws;
 	unsigned fixed:1;
 	unsigned ignore_case:1;
 	unsigned word_regexp:1;
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v2 9/9] grep: use PCRE v2 for optimized fixed-string search
  2019-06-26  0:03         ` [RFC/PATCH 0/7] grep: move from kwset to optional PCRE v2 Ævar Arnfjörð Bjarmason
                             ` (9 preceding siblings ...)
  2019-06-27 23:39           ` [PATCH v2 8/9] grep: remove the kwset optimization Ævar Arnfjörð Bjarmason
@ 2019-06-27 23:39           ` Ævar Arnfjörð Bjarmason
  10 siblings, 0 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-06-27 23:39 UTC (permalink / raw)
  To: git
  Cc: git-packagers, gitgitgadget, gitster, johannes.schindelin, peff,
	sandals, szeder.dev, Ævar Arnfjörð Bjarmason

Bring back optimized fixed-string search for "grep", this time with
PCRE v2 as an optional backend. As noted in [1] with kwset we were
slower than PCRE v1 and v2 JIT with the kwset backend, so that
optimization was counterproductive.

This brings back the optimization for "--fixed-strings", without
changing the semantics of having a NUL-byte in patterns. As seen in
previous commits in this series we could support it now, but I'd
rather just leave that edge-case aside so we don't have one behavior
or the other depending what "--fixed-strings" backend we're using. It
makes the behavior harder to understand and document, and makes tests
for the different backends more painful.

I could also support the PCRE v1 backend here, but that would make the
code more complex. I'd rather aim for simplicity here and in future
changes to the diffcore. We're not going to have someone who
absolutely must have faster search, but for whom building PCRE v2
isn't acceptable.

The difference between this series of commits and the current "master"
is, using the same t/perf commands shown in the last commit:

plain grep:

    Test                             origin/master     HEAD
    -------------------------------------------------------------------------
    7821.1: fixed grep int           0.55(1.67+0.56)   0.41(0.98+0.60) -25.5%
    7821.2: basic grep int           0.58(1.65+0.52)   0.41(0.96+0.57) -29.3%
    7821.3: extended grep int        0.57(1.66+0.49)   0.42(0.93+0.60) -26.3%
    7821.4: perl grep int            0.54(1.67+0.50)   0.43(0.88+0.65) -20.4%
    7821.6: fixed grep uncommon      0.21(0.52+0.42)   0.16(0.24+0.51) -23.8%
    7821.7: basic grep uncommon      0.20(0.49+0.45)   0.17(0.28+0.47) -15.0%
    7821.8: extended grep uncommon   0.20(0.54+0.39)   0.16(0.25+0.50) -20.0%
    7821.9: perl grep uncommon       0.20(0.58+0.36)   0.16(0.23+0.50) -20.0%
    7821.11: fixed grep æ            0.35(1.24+0.43)   0.16(0.23+0.50) -54.3%
    7821.12: basic grep æ            0.36(1.29+0.38)   0.16(0.20+0.54) -55.6%
    7821.13: extended grep æ         0.35(1.23+0.44)   0.16(0.24+0.50) -54.3%
    7821.14: perl grep æ             0.35(1.33+0.34)   0.16(0.28+0.46) -54.3%

grep with -i:

    Test                                origin/master     HEAD
    ----------------------------------------------------------------------------
    7821.1: fixed grep -i int           0.62(1.81+0.70)   0.47(1.11+0.64) -24.2%
    7821.2: basic grep -i int           0.67(1.90+0.53)   0.46(1.07+0.62) -31.3%
    7821.3: extended grep -i int        0.62(1.92+0.53)   0.53(1.12+0.58) -14.5%
    7821.4: perl grep -i int            0.66(1.85+0.58)   0.45(1.10+0.59) -31.8%
    7821.6: fixed grep -i uncommon      0.21(0.54+0.43)   0.17(0.20+0.55) -19.0%
    7821.7: basic grep -i uncommon      0.20(0.52+0.45)   0.17(0.29+0.48) -15.0%
    7821.8: extended grep -i uncommon   0.21(0.52+0.44)   0.17(0.26+0.50) -19.0%
    7821.9: perl grep -i uncommon       0.21(0.53+0.44)   0.17(0.20+0.56) -19.0%
    7821.11: fixed grep -i æ            0.26(0.79+0.44)   0.16(0.29+0.46) -38.5%
    7821.12: basic grep -i æ            0.26(0.79+0.42)   0.16(0.20+0.54) -38.5%
    7821.13: extended grep -i æ         0.26(0.84+0.39)   0.16(0.24+0.50) -38.5%
    7821.14: perl grep -i æ             0.16(0.24+0.49)   0.17(0.25+0.51) +6.3%

plain log:

    Test                                     origin/master     HEAD
    --------------------------------------------------------------------------------
    4221.1: fixed log --grep='int'           7.24(6.95+0.28)   7.20(6.95+0.18) -0.6%
    4221.2: basic log --grep='int'           7.31(6.97+0.22)   7.20(6.93+0.21) -1.5%
    4221.3: extended log --grep='int'        7.37(7.04+0.24)   7.22(6.91+0.25) -2.0%
    4221.4: perl log --grep='int'            7.31(7.04+0.21)   7.19(6.89+0.21) -1.6%
    4221.6: fixed log --grep='uncommon'      6.93(6.59+0.32)   7.04(6.66+0.37) +1.6%
    4221.7: basic log --grep='uncommon'      6.92(6.58+0.29)   7.08(6.75+0.29) +2.3%
    4221.8: extended log --grep='uncommon'   6.92(6.55+0.31)   7.00(6.68+0.31) +1.2%
    4221.9: perl log --grep='uncommon'       7.03(6.59+0.33)   7.12(6.73+0.34) +1.3%
    4221.11: fixed log --grep='æ'            7.41(7.08+0.28)   7.05(6.76+0.29) -4.9%
    4221.12: basic log --grep='æ'            7.39(6.99+0.33)   7.00(6.68+0.25) -5.3%
    4221.13: extended log --grep='æ'         7.34(7.00+0.25)   7.15(6.81+0.31) -2.6%
    4221.14: perl log --grep='æ'             7.43(7.13+0.26)   7.01(6.60+0.36) -5.7%

log with -i:

    Test                                        origin/master     HEAD
    ------------------------------------------------------------------------------------
    4221.1: fixed log -i --grep='int'           7.31(7.07+0.24)   7.23(7.00+0.22) -1.1%
    4221.2: basic log -i --grep='int'           7.40(7.08+0.28)   7.19(6.92+0.20) -2.8%
    4221.3: extended log -i --grep='int'        7.43(7.13+0.25)   7.27(6.99+0.21) -2.2%
    4221.4: perl log -i --grep='int'            7.34(7.10+0.24)   7.10(6.90+0.19) -3.3%
    4221.6: fixed log -i --grep='uncommon'      7.07(6.71+0.32)   7.11(6.77+0.28) +0.6%
    4221.7: basic log -i --grep='uncommon'      6.99(6.64+0.28)   7.12(6.69+0.38) +1.9%
    4221.8: extended log -i --grep='uncommon'   7.11(6.74+0.32)   7.10(6.77+0.27) -0.1%
    4221.9: perl log -i --grep='uncommon'       6.98(6.60+0.29)   7.05(6.64+0.34) +1.0%
    4221.11: fixed log -i --grep='æ'            7.85(7.45+0.34)   7.03(6.68+0.32) -10.4%
    4221.12: basic log -i --grep='æ'            7.87(7.49+0.29)   7.06(6.69+0.31) -10.3%
    4221.13: extended log -i --grep='æ'         7.87(7.54+0.31)   7.09(6.69+0.31) -9.9%
    4221.14: perl log -i --grep='æ'             7.06(6.77+0.28)   6.91(6.57+0.31) -2.1%

So as with e05b027627 ("grep: use PCRE v2 for optimized fixed-string
search", 2019-06-26) there's a huge improvement in performance for
"grep", but in "log" most of our time is spent elsewhere, so we don't
notice it that much.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 grep.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 49 insertions(+), 2 deletions(-)

diff --git a/grep.c b/grep.c
index 4468519d5c..fc0ed73ef3 100644
--- a/grep.c
+++ b/grep.c
@@ -356,6 +356,18 @@ static NORETURN void compile_regexp_failed(const struct grep_pat *p,
 	die("%s'%s': %s", where, p->pattern, error);
 }
 
+static int is_fixed(const char *s, size_t len)
+{
+	size_t i;
+
+	for (i = 0; i < len; i++) {
+		if (is_regex_special(s[i]))
+			return 0;
+	}
+
+	return 1;
+}
+
 #ifdef USE_LIBPCRE1
 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt)
 {
@@ -602,7 +614,6 @@ static int pcre2match(struct grep_pat *p, const char *line, const char *eol,
 static void free_pcre2_pattern(struct grep_pat *p)
 {
 }
-#endif /* !USE_LIBPCRE2 */
 
 static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
@@ -623,11 +634,13 @@ static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt)
 		compile_regexp_failed(p, errbuf);
 	}
 }
+#endif /* !USE_LIBPCRE2 */
 
 static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
 	int err;
 	int regflags = REG_NEWLINE;
+	int pat_is_fixed;
 
 	p->word_regexp = opt->word_regexp;
 	p->ignore_case = opt->ignore_case;
@@ -636,8 +649,42 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 	if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
 		die(_("given pattern contains NULL byte (via -f <file>). This is only supported with -P under PCRE v2"));
 
-	if (opt->fixed) {
+	pat_is_fixed = is_fixed(p->pattern, p->patternlen);
+	if (opt->fixed || pat_is_fixed) {
+#ifdef USE_LIBPCRE2
+		opt->pcre2 = 1;
+		if (pat_is_fixed) {
+			compile_pcre2_pattern(p, opt);
+		} else {
+			/*
+			 * E.g. t7811-grep-open.sh relies on the
+			 * pattern being restored.
+			 */
+			char *old_pattern = p->pattern;
+			size_t old_patternlen = p->patternlen;
+			struct strbuf sb = STRBUF_INIT;
+
+			/*
+			 * There is the PCRE2_LITERAL flag, but it's
+			 * only in PCRE v2 10.30 and later. Needing to
+			 * ifdef our way around that and dealing with
+			 * it + PCRE2_MULTILINE being an error is more
+			 * complex than just quoting this ourselves.
+			*/
+			strbuf_add(&sb, "\\Q", 2);
+			strbuf_add(&sb, p->pattern, p->patternlen);
+			strbuf_add(&sb, "\\E", 2);
+
+			p->pattern = sb.buf;
+			p->patternlen = sb.len;
+			compile_pcre2_pattern(p, opt);
+			p->pattern = old_pattern;
+			p->patternlen = old_patternlen;
+			strbuf_release(&sb);
+		}
+#else /* !USE_LIBPCRE2 */
 		compile_fixed_regexp(p, opt);
+#endif /* !USE_LIBPCRE2 */
 		return;
 	}
 
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v2 0/9] grep: move from kwset to optional PCRE v2
  2019-06-27 23:39           ` [PATCH v2 0/9] " Ævar Arnfjörð Bjarmason
@ 2019-06-28  7:23             ` Ævar Arnfjörð Bjarmason
  2019-06-28 16:10               ` Junio C Hamano
  2019-07-01 21:20             ` [PATCH v3 00/10] " Ævar Arnfjörð Bjarmason
                               ` (10 subsequent siblings)
  11 siblings, 1 reply; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-06-28  7:23 UTC (permalink / raw)
  To: git
  Cc: git-packagers, gitgitgadget, gitster, johannes.schindelin, peff,
	sandals, szeder.dev


On Fri, Jun 28 2019, Ævar Arnfjörð Bjarmason wrote:

> A non-RFC since it seem people like this approach.
>
> This should fix the test failure noted by Johannes, there's two new
> patches at the start of this series. They address a bug that was there
> for a long time, but I happened to trip over since PCRE is more strict
> about UTF-8 validation than kwset (which doesn't care at all).
>
> I also added performance numbers to the relevant commit messages, took
> brian's suggestion of saying "NUL-byte" instead of "\0", and did some
> other copyediting of my own.
>
> The rest of the code changes are all just comments & rewording of
> previously added comments.

Junio. I thought I'd submit this in before your merge to "next", but I
see that happened. Are you OK with rewinding it for this (& maybe
something else) or should I submit a v3 rebased on "next"?

I'd really prefer the improved commit messages with performance numbers,
and thought I'd have time to work on those details since it was an
RFC/PATCH :)

> Ævar Arnfjörð Bjarmason (9):
>   log tests: test regex backends in "--encode=<enc>" tests
>   grep: don't use PCRE2?_UTF8 with "log --encoding=<non-utf8>"
>   grep: inline the return value of a function call used only once
>   grep tests: move "grep binary" alongside the rest
>   grep tests: move binary pattern tests into their own file
>   grep: make the behavior for NUL-byte in patterns sane
>   grep: drop support for \0 in --fixed-strings <pattern>
>   grep: remove the kwset optimization
>   grep: use PCRE v2 for optimized fixed-string search
>
>  Documentation/git-grep.txt                    |  17 +++
>  grep.c                                        | 115 +++++++---------
>  grep.h                                        |   3 +-
>  revision.c                                    |   3 +
>  t/t4210-log-i18n.sh                           |  39 +++++-
>  ...a1.sh => t7008-filter-branch-null-sha1.sh} |   0
>  ...08-grep-binary.sh => t7815-grep-binary.sh} | 101 --------------
>  t/t7816-grep-binary-pattern.sh                | 127 ++++++++++++++++++
>  8 files changed, 233 insertions(+), 172 deletions(-)
>  rename t/{t7009-filter-branch-null-sha1.sh => t7008-filter-branch-null-sha1.sh} (100%)
>  rename t/{t7008-grep-binary.sh => t7815-grep-binary.sh} (55%)
>  create mode 100755 t/t7816-grep-binary-pattern.sh
>
> Range-diff:
>  -:  ---------- >  1:  cfc01f49d3 log tests: test regex backends in "--encode=<enc>" tests
>  -:  ---------- >  2:  4b59eb32f0 grep: don't use PCRE2?_UTF8 with "log --encoding=<non-utf8>"
>  1:  ad55d3be7e =  3:  cc4d3b50d5 grep: inline the return value of a function call used only once
>  2:  650bcc8582 =  4:  d9b29bdd89 grep tests: move "grep binary" alongside the rest
>  3:  ef10a8820d !  5:  f85614f435 grep tests: move binary pattern tests into their own file
>     @@ -2,9 +2,10 @@
>
>          grep tests: move binary pattern tests into their own file
>
>     -    Move the tests for "-f <file>" where "<file>" contains a "\0" pattern
>     -    into their own file. I added most of these tests in 966be95549 ("grep:
>     -    add tests to fix blind spots with \0 patterns", 2017-05-20).
>     +    Move the tests for "-f <file>" where "<file>" contains a NUL byte
>     +    pattern into their own file. I added most of these tests in
>     +    966be95549 ("grep: add tests to fix blind spots with \0 patterns",
>     +    2017-05-20).
>
>          Whether a regex engine supports matching binary content is very
>          different from whether it matches binary patterns. Since
>     @@ -14,8 +15,8 @@
>          engine can sensibly match binary patterns.
>
>          Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) we've been punting
>     -    patterns containing "\0" and considering them fixed, except in cases
>     -    where "--ignore-case" is provided and they're non-ASCII, see
>     +    patterns containing NUL-byte and considering them fixed, except in
>     +    cases where "--ignore-case" is provided and they're non-ASCII, see
>          5c1ebcca4d ("grep/icase: avoid kwsset on literal non-ascii strings",
>          2016-06-25). Subsequent commits will change this behavior.
>
>  4:  03e5637efc !  6:  90afca8707 grep: make the behavior for \0 in patterns sane
>     @@ -1,12 +1,13 @@
>      Author: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
>
>     -    grep: make the behavior for \0 in patterns sane
>     +    grep: make the behavior for NUL-byte in patterns sane
>
>     -    The behavior of "grep" when patterns contained "\0" has always been
>     -    haphazard, and has served the vagaries of the implementation more than
>     -    anything else. A "\0" in a pattern can only be provided via "-f
>     -    <file>", and since pickaxe (log search) has no such flag "\0" in
>     -    patterns has only ever been supported by "grep".
>     +    The behavior of "grep" when patterns contained a NUL-byte has always
>     +    been haphazard, and has served the vagaries of the implementation more
>     +    than anything else. A pattern containing a NUL-byte can only be
>     +    provided via "-f <file>". Since pickaxe (log search) has no such flag
>     +    the NUL-byte in patterns has only ever been supported by "grep" (and
>     +    not "log --grep").
>
>          Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) patterns containing
>          "\0" were considered fixed. In 966be95549 ("grep: add tests to fix
>     @@ -14,9 +15,9 @@
>          behavior.
>
>          Change the behavior to do the obvious thing, i.e. don't silently
>     -    discard a regex pattern and make it implicitly fixed just because it
>     -    contains a \0. Instead die if e.g. --basic-regexp is combined with
>     -    such a pattern.
>     +    discard a regex pattern and make it implicitly fixed just because they
>     +    contain a NUL-byte. Instead die if the backend in question can't
>     +    handle them, e.g. --basic-regexp is combined with such a pattern.
>
>          This is desired because from a user's point of view it's the obvious
>          thing to do. Whether we support BRE/ERE/Perl syntax is different from
>  5:  b9aad3ec1c !  7:  526b925fdc grep: drop support for \0 in --fixed-strings <pattern>
>     @@ -2,15 +2,14 @@
>
>          grep: drop support for \0 in --fixed-strings <pattern>
>
>     -    Change "-f <file>" to not support patterns with "\0" in them under
>     -    --fixed-strings, we'll now only support these under --perl-regexp with
>     -    PCRE v2.
>     +    Change "-f <file>" to not support patterns with a NUL-byte in them
>     +    under --fixed-strings. We'll now only support these under
>     +    "--perl-regexp" with PCRE v2.
>
>     -    A previous change to Documentation/git-grep.txt changed the
>     -    description of "-f <file>" to be vague enough as to not promise that
>     -    this would work, and by dropping support for this we make it a whole
>     -    lot easier to move away from the kwset backend, which a subsequent
>     -    change will try to do.
>     +    A previous change to grep's documentation changed the description of
>     +    "-f <file>" to be vague enough as to not promise that this would work.
>     +    By dropping support for this we make it a whole lot easier to move
>     +    away from the kwset backend, which we'll do in a subsequent change.
>
>          Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
>
>  6:  3587be009a !  8:  14269bb295 grep: remove the kwset optimization
>     @@ -2,9 +2,99 @@
>
>          grep: remove the kwset optimization
>
>     -    A later change will replace this optimization with a different one,
>     -    but as removing it and running the tests demonstrates no grep
>     -    semantics depend on this backend anymore.
>     +    A later change will replace this optimization with optimistic use of
>     +    PCRE v2. I'm completely removing it as an intermediate step, as
>     +    opposed to replacing it with PCRE v2, to demonstrate that no grep
>     +    semantics depend on this (or any other) optimization for the fixed
>     +    backend anymore.
>     +
>     +    For now this is mostly (but not entirely) a performance regression, as
>     +    shown by this hacky one-liner:
>     +
>     +        for opt in '' ' -i'
>     +            do
>     +            GIT_PERF_7821_GREP_OPTS=$opt GIT_PERF_REPEAT_COUNT=10 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS='-j8 CFLAGS=-O3 USE_LIBPCRE=YesPlease' ./run origin/master HEAD -- p7821-grep-engines-fixed.sh
>     +        done &&
>     +        for opt in '' ' -i'
>     +            do GIT_PERF_4221_LOG_OPTS=$opt GIT_PERF_REPEAT_COUNT=10 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS='-j8 CFLAGS=-O3 USE_LIBPCRE=YesPlease' ./run origin/master HEAD -- p4221-log-grep-engines-fixed.sh
>     +        done
>     +
>     +    Which produces:
>     +
>     +    plain grep:
>     +
>     +        Test                             origin/master     HEAD
>     +        -------------------------------------------------------------------------
>     +        7821.1: fixed grep int           0.55(1.60+0.63)   0.82(3.11+0.51) +49.1%
>     +        7821.2: basic grep int           0.62(1.68+0.49)   0.85(3.02+0.52) +37.1%
>     +        7821.3: extended grep int        0.61(1.63+0.53)   0.91(3.09+0.44) +49.2%
>     +        7821.4: perl grep int            0.55(1.60+0.57)   0.41(0.93+0.57) -25.5%
>     +        7821.6: fixed grep uncommon      0.20(0.50+0.44)   0.35(1.27+0.42) +75.0%
>     +        7821.7: basic grep uncommon      0.20(0.49+0.45)   0.35(1.29+0.41) +75.0%
>     +        7821.8: extended grep uncommon   0.20(0.45+0.48)   0.35(1.25+0.44) +75.0%
>     +        7821.9: perl grep uncommon       0.20(0.53+0.41)   0.16(0.24+0.49) -20.0%
>     +        7821.11: fixed grep æ            0.35(1.27+0.40)   0.25(0.82+0.39) -28.6%
>     +        7821.12: basic grep æ            0.35(1.28+0.38)   0.25(0.75+0.44) -28.6%
>     +        7821.13: extended grep æ         0.36(1.21+0.46)   0.25(0.86+0.35) -30.6%
>     +        7821.14: perl grep æ             0.35(1.33+0.34)   0.16(0.26+0.47) -54.3%
>     +
>     +    grep with -i:
>     +
>     +        Test                                origin/master     HEAD
>     +        -----------------------------------------------------------------------------
>     +        7821.1: fixed grep -i int           0.61(1.84+0.64)   1.11(4.12+0.64) +82.0%
>     +        7821.2: basic grep -i int           0.72(1.86+0.57)   1.15(4.48+0.49) +59.7%
>     +        7821.3: extended grep -i int        0.94(1.83+0.60)   1.53(4.12+0.58) +62.8%
>     +        7821.4: perl grep -i int            0.66(1.82+0.59)   0.55(1.08+0.58) -16.7%
>     +        7821.6: fixed grep -i uncommon      0.21(0.51+0.44)   0.44(1.74+0.34) +109.5%
>     +        7821.7: basic grep -i uncommon      0.21(0.55+0.41)   0.44(1.72+0.40) +109.5%
>     +        7821.8: extended grep -i uncommon   0.21(0.57+0.39)   0.42(1.64+0.45) +100.0%
>     +        7821.9: perl grep -i uncommon       0.21(0.48+0.48)   0.17(0.30+0.45) -19.0%
>     +        7821.11: fixed grep -i æ            0.25(0.73+0.45)   0.25(0.75+0.45) +0.0%
>     +        7821.12: basic grep -i æ            0.25(0.71+0.49)   0.26(0.77+0.44) +4.0%
>     +        7821.13: extended grep -i æ         0.25(0.75+0.44)   0.25(0.74+0.46) +0.0%
>     +        7821.14: perl grep -i æ             0.17(0.26+0.48)   0.16(0.20+0.52) -5.9%
>     +
>     +    plain log:
>     +
>     +        Test                                     origin/master     HEAD
>     +        ---------------------------------------------------------------------------------
>     +        4221.1: fixed log --grep='int'           7.31(7.06+0.21)   8.11(7.85+0.20) +10.9%
>     +        4221.2: basic log --grep='int'           7.30(6.94+0.27)   8.16(7.89+0.19) +11.8%
>     +        4221.3: extended log --grep='int'        7.34(7.05+0.21)   8.08(7.76+0.25) +10.1%
>     +        4221.4: perl log --grep='int'            7.27(6.94+0.24)   7.05(6.76+0.25) -3.0%
>     +        4221.6: fixed log --grep='uncommon'      6.97(6.62+0.32)   7.86(7.51+0.30) +12.8%
>     +        4221.7: basic log --grep='uncommon'      7.05(6.69+0.29)   7.89(7.60+0.28) +11.9%
>     +        4221.8: extended log --grep='uncommon'   6.89(6.56+0.32)   7.99(7.66+0.24) +16.0%
>     +        4221.9: perl log --grep='uncommon'       7.02(6.66+0.33)   6.97(6.54+0.36) -0.7%
>     +        4221.11: fixed log --grep='æ'            7.37(7.03+0.33)   7.67(7.30+0.31) +4.1%
>     +        4221.12: basic log --grep='æ'            7.41(7.00+0.31)   7.60(7.28+0.26) +2.6%
>     +        4221.13: extended log --grep='æ'         7.35(6.96+0.38)   7.73(7.31+0.34) +5.2%
>     +        4221.14: perl log --grep='æ'             7.43(7.10+0.32)   6.95(6.61+0.27) -6.5%
>     +
>     +    log with -i:
>     +
>     +        Test                                        origin/master     HEAD
>     +        ------------------------------------------------------------------------------------
>     +        4221.1: fixed log -i --grep='int'           7.40(7.05+0.23)   8.66(8.38+0.20) +17.0%
>     +        4221.2: basic log -i --grep='int'           7.39(7.09+0.23)   8.67(8.39+0.20) +17.3%
>     +        4221.3: extended log -i --grep='int'        7.29(6.99+0.26)   8.69(8.31+0.26) +19.2%
>     +        4221.4: perl log -i --grep='int'            7.42(7.16+0.21)   7.14(6.80+0.24) -3.8%
>     +        4221.6: fixed log -i --grep='uncommon'      6.94(6.58+0.35)   8.43(8.04+0.30) +21.5%
>     +        4221.7: basic log -i --grep='uncommon'      6.95(6.62+0.31)   8.34(7.93+0.32) +20.0%
>     +        4221.8: extended log -i --grep='uncommon'   7.06(6.75+0.25)   8.32(7.98+0.31) +17.8%
>     +        4221.9: perl log -i --grep='uncommon'       6.96(6.69+0.26)   7.04(6.64+0.32) +1.1%
>     +        4221.11: fixed log -i --grep='æ'            7.92(7.55+0.33)   7.86(7.44+0.34) -0.8%
>     +        4221.12: basic log -i --grep='æ'            7.88(7.49+0.32)   7.84(7.46+0.34) -0.5%
>     +        4221.13: extended log -i --grep='æ'         7.91(7.51+0.32)   7.87(7.48+0.32) -0.5%
>     +        4221.14: perl log -i --grep='æ'             7.01(6.59+0.35)   6.99(6.64+0.28) -0.3%
>     +
>     +    Some of those, as noted in [1] are because PCRE is faster at finding
>     +    fixed strings. This looks bad for some engines, but in the next change
>     +    we'll optimistically use PCRE v2 for all of these, so it'll look
>     +    better.
>     +
>     +    1. https://public-inbox.org/git/87v9x793qi.fsf@evledraar.gmail.com/
>
>          Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
>
>  7:  5bc25c03b8 !  9:  c0fd75d102 grep: use PCRE v2 for optimized fixed-string search
>     @@ -7,19 +7,95 @@
>          slower than PCRE v1 and v2 JIT with the kwset backend, so that
>          optimization was counterproductive.
>
>     -    This brings back the optimization for "-F", without changing the
>     -    semantics of "\0" in patterns. As seen in previous commits in this
>     -    series we could support it now, but I'd rather just leave that
>     -    edge-case aside so the tests don't need to do one thing or the other
>     -    depending on what --fixed-strings backend we're using.
>     -
>     -    I could also support the v1 backend here, but that would make the code
>     -    more complex, and I'd rather aim for simplicity here and in future
>     +    This brings back the optimization for "--fixed-strings", without
>     +    changing the semantics of having a NUL-byte in patterns. As seen in
>     +    previous commits in this series we could support it now, but I'd
>     +    rather just leave that edge-case aside so we don't have one behavior
>     +    or the other depending what "--fixed-strings" backend we're using. It
>     +    makes the behavior harder to understand and document, and makes tests
>     +    for the different backends more painful.
>     +
>     +    I could also support the PCRE v1 backend here, but that would make the
>     +    code more complex. I'd rather aim for simplicity here and in future
>          changes to the diffcore. We're not going to have someone who
>          absolutely must have faster search, but for whom building PCRE v2
>          isn't acceptable.
>
>     -    1. https://public-inbox.org/git/87v9x793qi.fsf@evledraar.gmail.com/
>     +    The difference between this series of commits and the current "master"
>     +    is, using the same t/perf commands shown in the last commit:
>     +
>     +    plain grep:
>     +
>     +        Test                             origin/master     HEAD
>     +        -------------------------------------------------------------------------
>     +        7821.1: fixed grep int           0.55(1.67+0.56)   0.41(0.98+0.60) -25.5%
>     +        7821.2: basic grep int           0.58(1.65+0.52)   0.41(0.96+0.57) -29.3%
>     +        7821.3: extended grep int        0.57(1.66+0.49)   0.42(0.93+0.60) -26.3%
>     +        7821.4: perl grep int            0.54(1.67+0.50)   0.43(0.88+0.65) -20.4%
>     +        7821.6: fixed grep uncommon      0.21(0.52+0.42)   0.16(0.24+0.51) -23.8%
>     +        7821.7: basic grep uncommon      0.20(0.49+0.45)   0.17(0.28+0.47) -15.0%
>     +        7821.8: extended grep uncommon   0.20(0.54+0.39)   0.16(0.25+0.50) -20.0%
>     +        7821.9: perl grep uncommon       0.20(0.58+0.36)   0.16(0.23+0.50) -20.0%
>     +        7821.11: fixed grep æ            0.35(1.24+0.43)   0.16(0.23+0.50) -54.3%
>     +        7821.12: basic grep æ            0.36(1.29+0.38)   0.16(0.20+0.54) -55.6%
>     +        7821.13: extended grep æ         0.35(1.23+0.44)   0.16(0.24+0.50) -54.3%
>     +        7821.14: perl grep æ             0.35(1.33+0.34)   0.16(0.28+0.46) -54.3%
>     +
>     +    grep with -i:
>     +
>     +        Test                                origin/master     HEAD
>     +        ----------------------------------------------------------------------------
>     +        7821.1: fixed grep -i int           0.62(1.81+0.70)   0.47(1.11+0.64) -24.2%
>     +        7821.2: basic grep -i int           0.67(1.90+0.53)   0.46(1.07+0.62) -31.3%
>     +        7821.3: extended grep -i int        0.62(1.92+0.53)   0.53(1.12+0.58) -14.5%
>     +        7821.4: perl grep -i int            0.66(1.85+0.58)   0.45(1.10+0.59) -31.8%
>     +        7821.6: fixed grep -i uncommon      0.21(0.54+0.43)   0.17(0.20+0.55) -19.0%
>     +        7821.7: basic grep -i uncommon      0.20(0.52+0.45)   0.17(0.29+0.48) -15.0%
>     +        7821.8: extended grep -i uncommon   0.21(0.52+0.44)   0.17(0.26+0.50) -19.0%
>     +        7821.9: perl grep -i uncommon       0.21(0.53+0.44)   0.17(0.20+0.56) -19.0%
>     +        7821.11: fixed grep -i æ            0.26(0.79+0.44)   0.16(0.29+0.46) -38.5%
>     +        7821.12: basic grep -i æ            0.26(0.79+0.42)   0.16(0.20+0.54) -38.5%
>     +        7821.13: extended grep -i æ         0.26(0.84+0.39)   0.16(0.24+0.50) -38.5%
>     +        7821.14: perl grep -i æ             0.16(0.24+0.49)   0.17(0.25+0.51) +6.3%
>     +
>     +    plain log:
>     +
>     +        Test                                     origin/master     HEAD
>     +        --------------------------------------------------------------------------------
>     +        4221.1: fixed log --grep='int'           7.24(6.95+0.28)   7.20(6.95+0.18) -0.6%
>     +        4221.2: basic log --grep='int'           7.31(6.97+0.22)   7.20(6.93+0.21) -1.5%
>     +        4221.3: extended log --grep='int'        7.37(7.04+0.24)   7.22(6.91+0.25) -2.0%
>     +        4221.4: perl log --grep='int'            7.31(7.04+0.21)   7.19(6.89+0.21) -1.6%
>     +        4221.6: fixed log --grep='uncommon'      6.93(6.59+0.32)   7.04(6.66+0.37) +1.6%
>     +        4221.7: basic log --grep='uncommon'      6.92(6.58+0.29)   7.08(6.75+0.29) +2.3%
>     +        4221.8: extended log --grep='uncommon'   6.92(6.55+0.31)   7.00(6.68+0.31) +1.2%
>     +        4221.9: perl log --grep='uncommon'       7.03(6.59+0.33)   7.12(6.73+0.34) +1.3%
>     +        4221.11: fixed log --grep='æ'            7.41(7.08+0.28)   7.05(6.76+0.29) -4.9%
>     +        4221.12: basic log --grep='æ'            7.39(6.99+0.33)   7.00(6.68+0.25) -5.3%
>     +        4221.13: extended log --grep='æ'         7.34(7.00+0.25)   7.15(6.81+0.31) -2.6%
>     +        4221.14: perl log --grep='æ'             7.43(7.13+0.26)   7.01(6.60+0.36) -5.7%
>     +
>     +    log with -i:
>     +
>     +        Test                                        origin/master     HEAD
>     +        ------------------------------------------------------------------------------------
>     +        4221.1: fixed log -i --grep='int'           7.31(7.07+0.24)   7.23(7.00+0.22) -1.1%
>     +        4221.2: basic log -i --grep='int'           7.40(7.08+0.28)   7.19(6.92+0.20) -2.8%
>     +        4221.3: extended log -i --grep='int'        7.43(7.13+0.25)   7.27(6.99+0.21) -2.2%
>     +        4221.4: perl log -i --grep='int'            7.34(7.10+0.24)   7.10(6.90+0.19) -3.3%
>     +        4221.6: fixed log -i --grep='uncommon'      7.07(6.71+0.32)   7.11(6.77+0.28) +0.6%
>     +        4221.7: basic log -i --grep='uncommon'      6.99(6.64+0.28)   7.12(6.69+0.38) +1.9%
>     +        4221.8: extended log -i --grep='uncommon'   7.11(6.74+0.32)   7.10(6.77+0.27) -0.1%
>     +        4221.9: perl log -i --grep='uncommon'       6.98(6.60+0.29)   7.05(6.64+0.34) +1.0%
>     +        4221.11: fixed log -i --grep='æ'            7.85(7.45+0.34)   7.03(6.68+0.32) -10.4%
>     +        4221.12: basic log -i --grep='æ'            7.87(7.49+0.29)   7.06(6.69+0.31) -10.3%
>     +        4221.13: extended log -i --grep='æ'         7.87(7.54+0.31)   7.09(6.69+0.31) -9.9%
>     +        4221.14: perl log -i --grep='æ'             7.06(6.77+0.28)   6.91(6.57+0.31) -2.1%
>     +
>     +    So as with e05b027627 ("grep: use PCRE v2 for optimized fixed-string
>     +    search", 2019-06-26) there's a huge improvement in performance for
>     +    "grep", but in "log" most of our time is spent elsewhere, so we don't
>     +    notice it that much.
>
>          Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
>
>     @@ -81,15 +157,19 @@
>      +		} else {
>      +			/*
>      +			 * E.g. t7811-grep-open.sh relies on the
>     -+			 * pattern being restored, and unfortunately
>     -+			 * there's no PCRE compile flag for "this is
>     -+			 * fixed", so we need to munge it to
>     -+			 * "\Q<pat>\E".
>     ++			 * pattern being restored.
>      +			 */
>      +			char *old_pattern = p->pattern;
>      +			size_t old_patternlen = p->patternlen;
>      +			struct strbuf sb = STRBUF_INIT;
>      +
>     ++			/*
>     ++			 * There is the PCRE2_LITERAL flag, but it's
>     ++			 * only in PCRE v2 10.30 and later. Needing to
>     ++			 * ifdef our way around that and dealing with
>     ++			 * it + PCRE2_MULTILINE being an error is more
>     ++			 * complex than just quoting this ourselves.
>     ++			*/
>      +			strbuf_add(&sb, "\\Q", 2);
>      +			strbuf_add(&sb, p->pattern, p->patternlen);
>      +			strbuf_add(&sb, "\\E", 2);
>     @@ -101,9 +181,9 @@
>      +			p->patternlen = old_patternlen;
>      +			strbuf_release(&sb);
>      +		}
>     -+#else
>     ++#else /* !USE_LIBPCRE2 */
>       		compile_fixed_regexp(p, opt);
>     -+#endif
>     ++#endif /* !USE_LIBPCRE2 */
>       		return;
>       	}

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC/PATCH 7/7] grep: use PCRE v2 for optimized fixed-string search
  2019-06-27 19:06                   ` Junio C Hamano
@ 2019-06-28 10:56                     ` Johannes Schindelin
  0 siblings, 0 replies; 90+ messages in thread
From: Johannes Schindelin @ 2019-06-28 10:56 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason, git, git-packagers,
	gitgitgadget, peff, sandals, szeder.dev

Hi Junio,

On Thu, 27 Jun 2019, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>
> >> > > If we would not have plenty of exercise for the PCRE2 build
> >> > > options, I would be worried. But AFAICT the CI build includes
> >> > > this all the time, so we're fine.
> >> >
> >> > Well, I'd feel safer if it were not "all the time", i.e. we know we
> >> > are testing both sides of the coin.
> >>
> >> AFAIR at least the Linux32 job is built without PCRE2 by default. I
> >> might be wrong on that, though...
> >
> > Actually, it seems that _all_ of the Linux builds in our Azure Pipeline
> > compile without pcre2. It seems you have to pass `USE_LIBPCRE2=1` to
> > `make`, and we do not do that in `ci/run-build-and-tests.sh` nor in
> > `azure-pipelines.yml`. I do not even see that for the macOS builds.
> >
> > So we got PCRE2 covered only in the Windows build, it seems.
>
> OK, it sounds like we have sufficient coverage on both fronts.

Maybe not. With the bug I uncovered that is _only_ triggering an error
message if the PCRE2 in question does not support JIT'ed operations, I am
a bit wary now.

But I cannot really see a reasonable way to add those axes to the CI
builds.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v2 0/9] grep: move from kwset to optional PCRE v2
  2019-06-28  7:23             ` Ævar Arnfjörð Bjarmason
@ 2019-06-28 16:10               ` Junio C Hamano
  0 siblings, 0 replies; 90+ messages in thread
From: Junio C Hamano @ 2019-06-28 16:10 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, git-packagers, gitgitgadget, johannes.schindelin, peff,
	sandals, szeder.dev

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> Junio. I thought I'd submit this in before your merge to "next", but I
> see that happened. Are you OK with rewinding it for this (& maybe
> something else) or should I submit a v3 rebased on "next"?

Sensible choices are between

 (1) reverting 984c7ccbbf (Merge branch 'ab/no-kwset' into next) and
     then replacing the whole topic with this round, or

 (2) queuing an incremental series based on 984c7ccbbf^2 (i.e. the
     tip of ab/no-kwset as of today)

I think the former is probably good enough.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v3 00/10] grep: move from kwset to optional PCRE v2
  2019-06-27 23:39           ` [PATCH v2 0/9] " Ævar Arnfjörð Bjarmason
  2019-06-28  7:23             ` Ævar Arnfjörð Bjarmason
@ 2019-07-01 21:20             ` Ævar Arnfjörð Bjarmason
  2019-07-01 21:31               ` Junio C Hamano
  2019-07-02 12:32               ` Johannes Schindelin
  2019-07-01 21:20             ` [PATCH v3 01/10] log tests: test regex backends in "--encode=<enc>" tests Ævar Arnfjörð Bjarmason
                               ` (9 subsequent siblings)
  11 siblings, 2 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-01 21:20 UTC (permalink / raw)
  To: git
  Cc: git-packagers, gitgitgadget, gitster, johannes.schindelin, peff,
	sandals, szeder.dev, Ævar Arnfjörð Bjarmason

This v3 has a new patch (3/10) that I believe fixes the regression on
MinGW Johannes noted in
https://public-inbox.org/git/nycvar.QRO.7.76.6.1907011515150.44@tvgsbejvaqbjf.bet/

As noted in the updated commit message in 10/10 I believe just
skipping this test & documenting this in a commit message is the least
amount of suck for now. It's really an existing issue with us doing
nothing sensible when the log/grep haystack encoding doesn't match the
needle encoding supplied via the command line.

We swept that under the carpet with the kwset backend, but PCRE v2
exposes it.

Ævar Arnfjörð Bjarmason (10):
  log tests: test regex backends in "--encode=<enc>" tests
  grep: don't use PCRE2?_UTF8 with "log --encoding=<non-utf8>"
  t4210: skip more command-line encoding tests on MinGW
  grep: inline the return value of a function call used only once
  grep tests: move "grep binary" alongside the rest
  grep tests: move binary pattern tests into their own file
  grep: make the behavior for NUL-byte in patterns sane
  grep: drop support for \0 in --fixed-strings <pattern>
  grep: remove the kwset optimization
  grep: use PCRE v2 for optimized fixed-string search

 Documentation/git-grep.txt                    |  17 +++
 grep.c                                        | 115 +++++++---------
 grep.h                                        |   3 +-
 revision.c                                    |   3 +
 t/t4210-log-i18n.sh                           |  41 +++++-
 ...a1.sh => t7008-filter-branch-null-sha1.sh} |   0
 ...08-grep-binary.sh => t7815-grep-binary.sh} | 101 --------------
 t/t7816-grep-binary-pattern.sh                | 127 ++++++++++++++++++
 8 files changed, 234 insertions(+), 173 deletions(-)
 rename t/{t7009-filter-branch-null-sha1.sh => t7008-filter-branch-null-sha1.sh} (100%)
 rename t/{t7008-grep-binary.sh => t7815-grep-binary.sh} (55%)
 create mode 100755 t/t7816-grep-binary-pattern.sh

Range-diff:
 1:  cfc01f49d3 =  1:  cfc01f49d3 log tests: test regex backends in "--encode=<enc>" tests
 2:  4b59eb32f0 =  2:  4b59eb32f0 grep: don't use PCRE2?_UTF8 with "log --encoding=<non-utf8>"
 -:  ---------- >  3:  676c76afe4 t4210: skip more command-line encoding tests on MinGW
 3:  cc4d3b50d5 =  4:  da9b491f70 grep: inline the return value of a function call used only once
 4:  d9b29bdd89 =  5:  c42d3268fa grep tests: move "grep binary" alongside the rest
 5:  f85614f435 =  6:  36b9c1c541 grep tests: move binary pattern tests into their own file
 6:  90afca8707 =  7:  3c54e782e6 grep: make the behavior for NUL-byte in patterns sane
 7:  526b925fdc =  8:  8e5f418189 grep: drop support for \0 in --fixed-strings <pattern>
 8:  14269bb295 =  9:  d1cb8319d5 grep: remove the kwset optimization
 9:  c0fd75d102 ! 10:  4de0c82314 grep: use PCRE v2 for optimized fixed-string search
    @@ -15,6 +15,15 @@
         makes the behavior harder to understand and document, and makes tests
         for the different backends more painful.
     
    +    This does change the behavior under non-C locales when "log"'s
    +    "--encoding" option is used and the heystack/needle in the
    +    content/command-line doesn't have a matching encoding. See the recent
    +    change in "t4210: skip more command-line encoding tests on MinGW" in
    +    this series. I think that's OK. We did nothing sensible before
    +    then (just compared raw bytes that had no hope of matching). At least
    +    now the user will get some idea why their grep/log never matches in
    +    that edge case.
    +
         I could also support the PCRE v1 backend here, but that would make the
         code more complex. I'd rather aim for simplicity here and in future
         changes to the diffcore. We're not going to have someone who
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v3 01/10] log tests: test regex backends in "--encode=<enc>" tests
  2019-06-27 23:39           ` [PATCH v2 0/9] " Ævar Arnfjörð Bjarmason
  2019-06-28  7:23             ` Ævar Arnfjörð Bjarmason
  2019-07-01 21:20             ` [PATCH v3 00/10] " Ævar Arnfjörð Bjarmason
@ 2019-07-01 21:20             ` Ævar Arnfjörð Bjarmason
  2019-07-01 21:20             ` [PATCH v3 02/10] grep: don't use PCRE2?_UTF8 with "log --encoding=<non-utf8>" Ævar Arnfjörð Bjarmason
                               ` (8 subsequent siblings)
  11 siblings, 0 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-01 21:20 UTC (permalink / raw)
  To: git
  Cc: git-packagers, gitgitgadget, gitster, johannes.schindelin, peff,
	sandals, szeder.dev, Ævar Arnfjörð Bjarmason

Improve the tests added in 04deccda11 ("log: re-encode commit messages
before grepping", 2013-02-11) to test the regex backends. Those tests
never worked as advertised, due to the is_fixed() optimization in
grep.c (which was in place at the time), and the needle in the tests
being a fixed string.

We'd thus always use the "fixed" backend during the tests, which would
use the kwset() backend. This backend liberally accepts any garbage
input, so invalid encodings would be silently accepted.

In a follow-up commit we'll fix this bug, this test just demonstrates
the existing issue.

In practice this issue happened on Windows, see [1], but due to the
structure of the existing tests & how liberal the kwset code is about
garbage we missed this.

Cover this blind spot by testing all our regex engines. The PCRE
backend will spot these invalid encodings. It's possible that this
test breaks the "basic" and "extended" backends on some systems that
are more anal than glibc about the encoding of locale issues with
POSIX functions that I can remember, but PCRE is more careful about
the validation.

1. https://public-inbox.org/git/nycvar.QRO.7.76.6.1906271113090.44@tvgsbejvaqbjf.bet/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 t/t4210-log-i18n.sh | 41 ++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 40 insertions(+), 1 deletion(-)

diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh
index 7c519436ef..86d22c1d4c 100755
--- a/t/t4210-log-i18n.sh
+++ b/t/t4210-log-i18n.sh
@@ -1,12 +1,15 @@
 #!/bin/sh
 
 test_description='test log with i18n features'
-. ./test-lib.sh
+. ./lib-gettext.sh
 
 # two forms of é
 utf8_e=$(printf '\303\251')
 latin1_e=$(printf '\351')
 
+# invalid UTF-8
+invalid_e=$(printf '\303\50)') # ")" at end to close opening "("
+
 test_expect_success 'create commits in different encodings' '
 	test_tick &&
 	cat >msg <<-EOF &&
@@ -53,4 +56,40 @@ test_expect_success 'log --grep does not find non-reencoded values (latin1)' '
 	test_must_be_empty actual
 '
 
+for engine in fixed basic extended perl
+do
+	prereq=
+	result=success
+	if test $engine = "perl"
+	then
+		result=failure
+		prereq="PCRE"
+	else
+		prereq=""
+	fi
+	force_regex=
+	if test $engine != "fixed"
+	then
+	    force_regex=.*
+	fi
+	test_expect_$result GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" "
+		cat >expect <<-\EOF &&
+		latin1
+		utf8
+		EOF
+		LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log --encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$latin1_e\" >actual &&
+		test_cmp expect actual
+	"
+
+	test_expect_success GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" "
+		LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log --encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$utf8_e\" >actual &&
+		test_must_be_empty actual
+	"
+
+	test_expect_$result GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not die on invalid UTF-8 value (latin1 + locale + invalid needle)" "
+		LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log --encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$invalid_e\" >actual &&
+		test_must_be_empty actual
+	"
+done
+
 test_done
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v3 02/10] grep: don't use PCRE2?_UTF8 with "log --encoding=<non-utf8>"
  2019-06-27 23:39           ` [PATCH v2 0/9] " Ævar Arnfjörð Bjarmason
                               ` (2 preceding siblings ...)
  2019-07-01 21:20             ` [PATCH v3 01/10] log tests: test regex backends in "--encode=<enc>" tests Ævar Arnfjörð Bjarmason
@ 2019-07-01 21:20             ` Ævar Arnfjörð Bjarmason
  2019-07-01 21:20             ` [PATCH v3 03/10] t4210: skip more command-line encoding tests on MinGW Ævar Arnfjörð Bjarmason
                               ` (7 subsequent siblings)
  11 siblings, 0 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-01 21:20 UTC (permalink / raw)
  To: git
  Cc: git-packagers, gitgitgadget, gitster, johannes.schindelin, peff,
	sandals, szeder.dev, Ævar Arnfjörð Bjarmason

Fix a bug introduced in 18547aacf5 ("grep/pcre: support utf-8",
2016-06-25) that was missed due to a blindspot in our tests, as
discussed in the previous commit. I then blindly copied the same bug
in 94da9193a6 ("grep: add support for PCRE v2", 2017-06-01) when
adding the PCRE v2 code.

We should not tell PCRE that we're processing UTF-8 just because we're
dealing with non-ASCII. In the case of e.g. "log --encoding=<...>"
under is_utf8_locale() the haystack might be in ISO-8859-1, and the
needle might be in a non-UTF-8 encoding.

Maybe we should be more strict here and die earlier? Should we also be
converting the needle to the encoding in question, and failing if it's
not a string that's valid in that encoding? Maybe.

But for now matching this as non-UTF8 at least has some hope of
producing sensible results, since we know that our default heuristic
of assuming the text to be matched is in the user locale encoding
isn't true when we've explicitly encoded it to be in a different
encoding.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 grep.c              | 8 ++++----
 grep.h              | 1 +
 revision.c          | 3 +++
 t/t4210-log-i18n.sh | 6 ++----
 4 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/grep.c b/grep.c
index f7c3a5803e..1de4ab49c0 100644
--- a/grep.c
+++ b/grep.c
@@ -388,11 +388,11 @@ static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt)
 	int options = PCRE_MULTILINE;
 
 	if (opt->ignore_case) {
-		if (has_non_ascii(p->pattern))
+		if (!opt->ignore_locale && has_non_ascii(p->pattern))
 			p->pcre1_tables = pcre_maketables();
 		options |= PCRE_CASELESS;
 	}
-	if (is_utf8_locale() && has_non_ascii(p->pattern))
+	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern))
 		options |= PCRE_UTF8;
 
 	p->pcre1_regexp = pcre_compile(p->pattern, options, &error, &erroffset,
@@ -498,14 +498,14 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 	p->pcre2_compile_context = NULL;
 
 	if (opt->ignore_case) {
-		if (has_non_ascii(p->pattern)) {
+		if (!opt->ignore_locale && has_non_ascii(p->pattern)) {
 			character_tables = pcre2_maketables(NULL);
 			p->pcre2_compile_context = pcre2_compile_context_create(NULL);
 			pcre2_set_character_tables(p->pcre2_compile_context, character_tables);
 		}
 		options |= PCRE2_CASELESS;
 	}
-	if (is_utf8_locale() && has_non_ascii(p->pattern))
+	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern))
 		options |= PCRE2_UTF;
 
 	p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern,
diff --git a/grep.h b/grep.h
index 1875880f37..4bb8a79d93 100644
--- a/grep.h
+++ b/grep.h
@@ -173,6 +173,7 @@ struct grep_opt {
 	int funcbody;
 	int extended_regexp_option;
 	int pattern_type_option;
+	int ignore_locale;
 	char colors[NR_GREP_COLORS][COLOR_MAXLEN];
 	unsigned pre_context;
 	unsigned post_context;
diff --git a/revision.c b/revision.c
index 621feb9df7..a842fb158a 100644
--- a/revision.c
+++ b/revision.c
@@ -28,6 +28,7 @@
 #include "commit-graph.h"
 #include "prio-queue.h"
 #include "hashmap.h"
+#include "utf8.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -2655,6 +2656,8 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s
 
 	grep_commit_pattern_type(GREP_PATTERN_TYPE_UNSPECIFIED,
 				 &revs->grep_filter);
+	if (!is_encoding_utf8(get_log_output_encoding()))
+		revs->grep_filter.ignore_locale = 1;
 	compile_grep_patterns(&revs->grep_filter);
 
 	if (revs->reverse && revs->reflog_info)
diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh
index 86d22c1d4c..515bcb7ce1 100755
--- a/t/t4210-log-i18n.sh
+++ b/t/t4210-log-i18n.sh
@@ -59,10 +59,8 @@ test_expect_success 'log --grep does not find non-reencoded values (latin1)' '
 for engine in fixed basic extended perl
 do
 	prereq=
-	result=success
 	if test $engine = "perl"
 	then
-		result=failure
 		prereq="PCRE"
 	else
 		prereq=""
@@ -72,7 +70,7 @@ do
 	then
 	    force_regex=.*
 	fi
-	test_expect_$result GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" "
+	test_expect_success GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" "
 		cat >expect <<-\EOF &&
 		latin1
 		utf8
@@ -86,7 +84,7 @@ do
 		test_must_be_empty actual
 	"
 
-	test_expect_$result GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not die on invalid UTF-8 value (latin1 + locale + invalid needle)" "
+	test_expect_success GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not die on invalid UTF-8 value (latin1 + locale + invalid needle)" "
 		LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log --encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$invalid_e\" >actual &&
 		test_must_be_empty actual
 	"
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v3 03/10] t4210: skip more command-line encoding tests on MinGW
  2019-06-27 23:39           ` [PATCH v2 0/9] " Ævar Arnfjörð Bjarmason
                               ` (3 preceding siblings ...)
  2019-07-01 21:20             ` [PATCH v3 02/10] grep: don't use PCRE2?_UTF8 with "log --encoding=<non-utf8>" Ævar Arnfjörð Bjarmason
@ 2019-07-01 21:20             ` Ævar Arnfjörð Bjarmason
  2019-07-01 21:20             ` [PATCH v3 04/10] grep: inline the return value of a function call used only once Ævar Arnfjörð Bjarmason
                               ` (6 subsequent siblings)
  11 siblings, 0 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-01 21:20 UTC (permalink / raw)
  To: git
  Cc: git-packagers, gitgitgadget, gitster, johannes.schindelin, peff,
	sandals, szeder.dev, Ævar Arnfjörð Bjarmason

In 5212f91deb ("t4210: skip command-line encoding tests on mingw",
2014-07-17) the positive tests in this file were skipped. That left
the negative tests that don't produce a match.

An upcoming change to migrate the "fixed" backend of grep to PCRE v2
will cause these "log" commands to produce an error instead on
MinGW. This is because the command-line on that platform implicitly
has its encoding changed before being passed to git. See [1].

1. https://public-inbox.org/git/nycvar.QRO.7.76.6.1907011515150.44@tvgsbejvaqbjf.bet/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 t/t4210-log-i18n.sh | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh
index 515bcb7ce1..6e61f57f09 100755
--- a/t/t4210-log-i18n.sh
+++ b/t/t4210-log-i18n.sh
@@ -51,7 +51,7 @@ test_expect_success !MINGW 'log --grep does not find non-reencoded values (utf8)
 	test_must_be_empty actual
 '
 
-test_expect_success 'log --grep does not find non-reencoded values (latin1)' '
+test_expect_success !MINGW 'log --grep does not find non-reencoded values (latin1)' '
 	git log --encoding=ISO-8859-1 --format=%s --grep=$utf8_e >actual &&
 	test_must_be_empty actual
 '
@@ -70,7 +70,7 @@ do
 	then
 	    force_regex=.*
 	fi
-	test_expect_success GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" "
+	test_expect_success !MINGW,GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" "
 		cat >expect <<-\EOF &&
 		latin1
 		utf8
@@ -79,12 +79,12 @@ do
 		test_cmp expect actual
 	"
 
-	test_expect_success GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" "
+	test_expect_success !MINGW,GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" "
 		LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log --encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$utf8_e\" >actual &&
 		test_must_be_empty actual
 	"
 
-	test_expect_success GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not die on invalid UTF-8 value (latin1 + locale + invalid needle)" "
+	test_expect_success !MINGW,GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not die on invalid UTF-8 value (latin1 + locale + invalid needle)" "
 		LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log --encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$invalid_e\" >actual &&
 		test_must_be_empty actual
 	"
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v3 04/10] grep: inline the return value of a function call used only once
  2019-06-27 23:39           ` [PATCH v2 0/9] " Ævar Arnfjörð Bjarmason
                               ` (4 preceding siblings ...)
  2019-07-01 21:20             ` [PATCH v3 03/10] t4210: skip more command-line encoding tests on MinGW Ævar Arnfjörð Bjarmason
@ 2019-07-01 21:20             ` Ævar Arnfjörð Bjarmason
  2019-07-01 21:20             ` [PATCH v3 05/10] grep tests: move "grep binary" alongside the rest Ævar Arnfjörð Bjarmason
                               ` (5 subsequent siblings)
  11 siblings, 0 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-01 21:20 UTC (permalink / raw)
  To: git
  Cc: git-packagers, gitgitgadget, gitster, johannes.schindelin, peff,
	sandals, szeder.dev, Ævar Arnfjörð Bjarmason

Since e944d9d932 ("grep: rewrite an if/else condition to avoid
duplicate expression", 2016-06-25) the "ascii_only" variable has only
been used once in compile_regexp(), let's just inline it there.

This makes the code easier to read, and might make it marginally
faster depending on compiler optimizations.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 grep.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/grep.c b/grep.c
index 1de4ab49c0..4e8d0645a8 100644
--- a/grep.c
+++ b/grep.c
@@ -650,13 +650,11 @@ static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt)
 
 static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
-	int ascii_only;
 	int err;
 	int regflags = REG_NEWLINE;
 
 	p->word_regexp = opt->word_regexp;
 	p->ignore_case = opt->ignore_case;
-	ascii_only     = !has_non_ascii(p->pattern);
 
 	/*
 	 * Even when -F (fixed) asks us to do a non-regexp search, we
@@ -673,7 +671,7 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 	if (opt->fixed ||
 	    has_null(p->pattern, p->patternlen) ||
 	    is_fixed(p->pattern, p->patternlen))
-		p->fixed = !p->ignore_case || ascii_only;
+		p->fixed = !p->ignore_case || !has_non_ascii(p->pattern);
 
 	if (p->fixed) {
 		p->kws = kwsalloc(p->ignore_case ? tolower_trans_tbl : NULL);
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v3 05/10] grep tests: move "grep binary" alongside the rest
  2019-06-27 23:39           ` [PATCH v2 0/9] " Ævar Arnfjörð Bjarmason
                               ` (5 preceding siblings ...)
  2019-07-01 21:20             ` [PATCH v3 04/10] grep: inline the return value of a function call used only once Ævar Arnfjörð Bjarmason
@ 2019-07-01 21:20             ` Ævar Arnfjörð Bjarmason
  2019-07-01 21:20             ` [PATCH v3 06/10] grep tests: move binary pattern tests into their own file Ævar Arnfjörð Bjarmason
                               ` (4 subsequent siblings)
  11 siblings, 0 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-01 21:20 UTC (permalink / raw)
  To: git
  Cc: git-packagers, gitgitgadget, gitster, johannes.schindelin, peff,
	sandals, szeder.dev, Ævar Arnfjörð Bjarmason

Move the "grep binary" test case added in aca20dd558 ("grep: add test
script for binary file handling", 2010-05-22) so that it lives
alongside the rest of the "grep" tests in t781*. This would have left
a gap in the t/700* namespace, so move a "filter-branch" test down,
leaving the "t7010-setup.sh" test as the next one after that.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 ...ilter-branch-null-sha1.sh => t7008-filter-branch-null-sha1.sh} | 0
 t/{t7008-grep-binary.sh => t7815-grep-binary.sh}                  | 0
 2 files changed, 0 insertions(+), 0 deletions(-)
 rename t/{t7009-filter-branch-null-sha1.sh => t7008-filter-branch-null-sha1.sh} (100%)
 rename t/{t7008-grep-binary.sh => t7815-grep-binary.sh} (100%)

diff --git a/t/t7009-filter-branch-null-sha1.sh b/t/t7008-filter-branch-null-sha1.sh
similarity index 100%
rename from t/t7009-filter-branch-null-sha1.sh
rename to t/t7008-filter-branch-null-sha1.sh
diff --git a/t/t7008-grep-binary.sh b/t/t7815-grep-binary.sh
similarity index 100%
rename from t/t7008-grep-binary.sh
rename to t/t7815-grep-binary.sh
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v3 06/10] grep tests: move binary pattern tests into their own file
  2019-06-27 23:39           ` [PATCH v2 0/9] " Ævar Arnfjörð Bjarmason
                               ` (6 preceding siblings ...)
  2019-07-01 21:20             ` [PATCH v3 05/10] grep tests: move "grep binary" alongside the rest Ævar Arnfjörð Bjarmason
@ 2019-07-01 21:20             ` Ævar Arnfjörð Bjarmason
  2019-07-01 21:20             ` [PATCH v3 07/10] grep: make the behavior for NUL-byte in patterns sane Ævar Arnfjörð Bjarmason
                               ` (3 subsequent siblings)
  11 siblings, 0 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-01 21:20 UTC (permalink / raw)
  To: git
  Cc: git-packagers, gitgitgadget, gitster, johannes.schindelin, peff,
	sandals, szeder.dev, Ævar Arnfjörð Bjarmason

Move the tests for "-f <file>" where "<file>" contains a NUL byte
pattern into their own file. I added most of these tests in
966be95549 ("grep: add tests to fix blind spots with \0 patterns",
2017-05-20).

Whether a regex engine supports matching binary content is very
different from whether it matches binary patterns. Since
2f8952250a ("regex: add regexec_buf() that can work on a non
NUL-terminated string", 2016-09-21) we've required REG_STARTEND of our
regex engines so we can match binary content, but only the PCRE v2
engine can sensibly match binary patterns.

Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) we've been punting
patterns containing NUL-byte and considering them fixed, except in
cases where "--ignore-case" is provided and they're non-ASCII, see
5c1ebcca4d ("grep/icase: avoid kwsset on literal non-ascii strings",
2016-06-25). Subsequent commits will change this behavior.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 t/t7815-grep-binary.sh         | 101 -----------------------------
 t/t7816-grep-binary-pattern.sh | 114 +++++++++++++++++++++++++++++++++
 2 files changed, 114 insertions(+), 101 deletions(-)
 create mode 100755 t/t7816-grep-binary-pattern.sh

diff --git a/t/t7815-grep-binary.sh b/t/t7815-grep-binary.sh
index 2d87c49b75..90ebb64f46 100755
--- a/t/t7815-grep-binary.sh
+++ b/t/t7815-grep-binary.sh
@@ -4,41 +4,6 @@ test_description='git grep in binary files'
 
 . ./test-lib.sh
 
-nul_match () {
-	matches=$1
-	flags=$2
-	pattern=$3
-	pattern_human=$(echo "$pattern" | sed 's/Q/<NUL>/g')
-
-	if test "$matches" = 1
-	then
-		test_expect_success "git grep -f f $flags '$pattern_human' a" "
-			printf '$pattern' | q_to_nul >f &&
-			git grep -f f $flags a
-		"
-	elif test "$matches" = 0
-	then
-		test_expect_success "git grep -f f $flags '$pattern_human' a" "
-			printf '$pattern' | q_to_nul >f &&
-			test_must_fail git grep -f f $flags a
-		"
-	elif test "$matches" = T1
-	then
-		test_expect_failure "git grep -f f $flags '$pattern_human' a" "
-			printf '$pattern' | q_to_nul >f &&
-			git grep -f f $flags a
-		"
-	elif test "$matches" = T0
-	then
-		test_expect_failure "git grep -f f $flags '$pattern_human' a" "
-			printf '$pattern' | q_to_nul >f &&
-			test_must_fail git grep -f f $flags a
-		"
-	else
-		test_expect_success "PANIC: Test framework error. Unknown matches value $matches" 'false'
-	fi
-}
-
 test_expect_success 'setup' "
 	echo 'binaryQfileQm[*]cQ*æQð' | q_to_nul >a &&
 	git add a &&
@@ -102,72 +67,6 @@ test_expect_failure 'git grep .fi a' '
 	git grep .fi a
 '
 
-nul_match 1 '-F' 'yQf'
-nul_match 0 '-F' 'yQx'
-nul_match 1 '-Fi' 'YQf'
-nul_match 0 '-Fi' 'YQx'
-nul_match 1 '' 'yQf'
-nul_match 0 '' 'yQx'
-nul_match 1 '' 'æQð'
-nul_match 1 '-F' 'eQm[*]c'
-nul_match 1 '-Fi' 'EQM[*]C'
-
-# Regex patterns that would match but shouldn't with -F
-nul_match 0 '-F' 'yQ[f]'
-nul_match 0 '-F' '[y]Qf'
-nul_match 0 '-Fi' 'YQ[F]'
-nul_match 0 '-Fi' '[Y]QF'
-nul_match 0 '-F' 'æQ[ð]'
-nul_match 0 '-F' '[æ]Qð'
-nul_match 0 '-Fi' 'ÆQ[Ð]'
-nul_match 0 '-Fi' '[Æ]QÐ'
-
-# kwset is disabled on -i & non-ASCII. No way to match non-ASCII \0
-# patterns case-insensitively.
-nul_match T1 '-i' 'ÆQÐ'
-
-# \0 implicitly disables regexes. This is an undocumented internal
-# limitation.
-nul_match T1 '' 'yQ[f]'
-nul_match T1 '' '[y]Qf'
-nul_match T1 '-i' 'YQ[F]'
-nul_match T1 '-i' '[Y]Qf'
-nul_match T1 '' 'æQ[ð]'
-nul_match T1 '' '[æ]Qð'
-nul_match T1 '-i' 'ÆQ[Ð]'
-
-# ... because of \0 implicitly disabling regexes regexes that
-# should/shouldn't match don't do the right thing.
-nul_match T1 '' 'eQm.*cQ'
-nul_match T1 '-i' 'EQM.*cQ'
-nul_match T0 '' 'eQm[*]c'
-nul_match T0 '-i' 'EQM[*]C'
-
-# Due to the REG_STARTEND extension when kwset() is disabled on -i &
-# non-ASCII the string will be matched in its entirety, but the
-# pattern will be cut off at the first \0.
-nul_match 0 '-i' 'NOMATCHQð'
-nul_match T0 '-i' '[Æ]QNOMATCH'
-nul_match T0 '-i' '[æ]QNOMATCH'
-# Matches, but for the wrong reasons, just stops at [æ]
-nul_match 1 '-i' '[Æ]Qð'
-nul_match 1 '-i' '[æ]Qð'
-
-# Ensure that the matcher doesn't regress to something that stops at
-# \0
-nul_match 0 '-F' 'yQ[f]'
-nul_match 0 '-Fi' 'YQ[F]'
-nul_match 0 '' 'yQNOMATCH'
-nul_match 0 '' 'QNOMATCH'
-nul_match 0 '-i' 'YQNOMATCH'
-nul_match 0 '-i' 'QNOMATCH'
-nul_match 0 '-F' 'æQ[ð]'
-nul_match 0 '-Fi' 'ÆQ[Ð]'
-nul_match 0 '' 'yQNÓMATCH'
-nul_match 0 '' 'QNÓMATCH'
-nul_match 0 '-i' 'YQNÓMATCH'
-nul_match 0 '-i' 'QNÓMATCH'
-
 test_expect_success 'grep respects binary diff attribute' '
 	echo text >t &&
 	git add t &&
diff --git a/t/t7816-grep-binary-pattern.sh b/t/t7816-grep-binary-pattern.sh
new file mode 100755
index 0000000000..4060dbd679
--- /dev/null
+++ b/t/t7816-grep-binary-pattern.sh
@@ -0,0 +1,114 @@
+#!/bin/sh
+
+test_description='git grep with a binary pattern files'
+
+. ./test-lib.sh
+
+nul_match () {
+	matches=$1
+	flags=$2
+	pattern=$3
+	pattern_human=$(echo "$pattern" | sed 's/Q/<NUL>/g')
+
+	if test "$matches" = 1
+	then
+		test_expect_success "git grep -f f $flags '$pattern_human' a" "
+			printf '$pattern' | q_to_nul >f &&
+			git grep -f f $flags a
+		"
+	elif test "$matches" = 0
+	then
+		test_expect_success "git grep -f f $flags '$pattern_human' a" "
+			printf '$pattern' | q_to_nul >f &&
+			test_must_fail git grep -f f $flags a
+		"
+	elif test "$matches" = T1
+	then
+		test_expect_failure "git grep -f f $flags '$pattern_human' a" "
+			printf '$pattern' | q_to_nul >f &&
+			git grep -f f $flags a
+		"
+	elif test "$matches" = T0
+	then
+		test_expect_failure "git grep -f f $flags '$pattern_human' a" "
+			printf '$pattern' | q_to_nul >f &&
+			test_must_fail git grep -f f $flags a
+		"
+	else
+		test_expect_success "PANIC: Test framework error. Unknown matches value $matches" 'false'
+	fi
+}
+
+test_expect_success 'setup' "
+	echo 'binaryQfileQm[*]cQ*æQð' | q_to_nul >a &&
+	git add a &&
+	git commit -m.
+"
+
+nul_match 1 '-F' 'yQf'
+nul_match 0 '-F' 'yQx'
+nul_match 1 '-Fi' 'YQf'
+nul_match 0 '-Fi' 'YQx'
+nul_match 1 '' 'yQf'
+nul_match 0 '' 'yQx'
+nul_match 1 '' 'æQð'
+nul_match 1 '-F' 'eQm[*]c'
+nul_match 1 '-Fi' 'EQM[*]C'
+
+# Regex patterns that would match but shouldn't with -F
+nul_match 0 '-F' 'yQ[f]'
+nul_match 0 '-F' '[y]Qf'
+nul_match 0 '-Fi' 'YQ[F]'
+nul_match 0 '-Fi' '[Y]QF'
+nul_match 0 '-F' 'æQ[ð]'
+nul_match 0 '-F' '[æ]Qð'
+nul_match 0 '-Fi' 'ÆQ[Ð]'
+nul_match 0 '-Fi' '[Æ]QÐ'
+
+# kwset is disabled on -i & non-ASCII. No way to match non-ASCII \0
+# patterns case-insensitively.
+nul_match T1 '-i' 'ÆQÐ'
+
+# \0 implicitly disables regexes. This is an undocumented internal
+# limitation.
+nul_match T1 '' 'yQ[f]'
+nul_match T1 '' '[y]Qf'
+nul_match T1 '-i' 'YQ[F]'
+nul_match T1 '-i' '[Y]Qf'
+nul_match T1 '' 'æQ[ð]'
+nul_match T1 '' '[æ]Qð'
+nul_match T1 '-i' 'ÆQ[Ð]'
+
+# ... because of \0 implicitly disabling regexes regexes that
+# should/shouldn't match don't do the right thing.
+nul_match T1 '' 'eQm.*cQ'
+nul_match T1 '-i' 'EQM.*cQ'
+nul_match T0 '' 'eQm[*]c'
+nul_match T0 '-i' 'EQM[*]C'
+
+# Due to the REG_STARTEND extension when kwset() is disabled on -i &
+# non-ASCII the string will be matched in its entirety, but the
+# pattern will be cut off at the first \0.
+nul_match 0 '-i' 'NOMATCHQð'
+nul_match T0 '-i' '[Æ]QNOMATCH'
+nul_match T0 '-i' '[æ]QNOMATCH'
+# Matches, but for the wrong reasons, just stops at [æ]
+nul_match 1 '-i' '[Æ]Qð'
+nul_match 1 '-i' '[æ]Qð'
+
+# Ensure that the matcher doesn't regress to something that stops at
+# \0
+nul_match 0 '-F' 'yQ[f]'
+nul_match 0 '-Fi' 'YQ[F]'
+nul_match 0 '' 'yQNOMATCH'
+nul_match 0 '' 'QNOMATCH'
+nul_match 0 '-i' 'YQNOMATCH'
+nul_match 0 '-i' 'QNOMATCH'
+nul_match 0 '-F' 'æQ[ð]'
+nul_match 0 '-Fi' 'ÆQ[Ð]'
+nul_match 0 '' 'yQNÓMATCH'
+nul_match 0 '' 'QNÓMATCH'
+nul_match 0 '-i' 'YQNÓMATCH'
+nul_match 0 '-i' 'QNÓMATCH'
+
+test_done
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v3 07/10] grep: make the behavior for NUL-byte in patterns sane
  2019-06-27 23:39           ` [PATCH v2 0/9] " Ævar Arnfjörð Bjarmason
                               ` (7 preceding siblings ...)
  2019-07-01 21:20             ` [PATCH v3 06/10] grep tests: move binary pattern tests into their own file Ævar Arnfjörð Bjarmason
@ 2019-07-01 21:20             ` Ævar Arnfjörð Bjarmason
  2019-07-01 21:20             ` [PATCH v3 08/10] grep: drop support for \0 in --fixed-strings <pattern> Ævar Arnfjörð Bjarmason
                               ` (2 subsequent siblings)
  11 siblings, 0 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-01 21:20 UTC (permalink / raw)
  To: git
  Cc: git-packagers, gitgitgadget, gitster, johannes.schindelin, peff,
	sandals, szeder.dev, Ævar Arnfjörð Bjarmason

The behavior of "grep" when patterns contained a NUL-byte has always
been haphazard, and has served the vagaries of the implementation more
than anything else. A pattern containing a NUL-byte can only be
provided via "-f <file>". Since pickaxe (log search) has no such flag
the NUL-byte in patterns has only ever been supported by "grep" (and
not "log --grep").

Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) patterns containing
"\0" were considered fixed. In 966be95549 ("grep: add tests to fix
blind spots with \0 patterns", 2017-05-20) I added tests for this
behavior.

Change the behavior to do the obvious thing, i.e. don't silently
discard a regex pattern and make it implicitly fixed just because they
contain a NUL-byte. Instead die if the backend in question can't
handle them, e.g. --basic-regexp is combined with such a pattern.

This is desired because from a user's point of view it's the obvious
thing to do. Whether we support BRE/ERE/Perl syntax is different from
whether our implementation is limited by C-strings. These patterns are
obscure enough that I think this behavior change is OK, especially
since we never documented the old behavior.

Doing this also makes it easier to replace the kwset backend with
something else, since we'll no longer strictly need it for anything we
can't easily use another fixed-string backend for.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/git-grep.txt     |  17 ++++
 grep.c                         |  23 ++---
 t/t7816-grep-binary-pattern.sh | 159 ++++++++++++++++++---------------
 3 files changed, 110 insertions(+), 89 deletions(-)

diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index 2d27969057..c89fb569e3 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -271,6 +271,23 @@ providing this option will cause it to die.
 
 -f <file>::
 	Read patterns from <file>, one per line.
++
+Passing the pattern via <file> allows for providing a search pattern
+containing a \0.
++
+Not all pattern types support patterns containing \0. Git will error
+out if a given pattern type can't support such a pattern. The
+`--perl-regexp` pattern type when compiled against the PCRE v2 backend
+has the widest support for these types of patterns.
++
+In versions of Git before 2.23.0 patterns containing \0 would be
+silently considered fixed. This was never documented, there were also
+odd and undocumented interactions between e.g. non-ASCII patterns
+containing \0 and `--ignore-case`.
++
+In future versions we may learn to support patterns containing \0 for
+more search backends, until then we'll die when the pattern type in
+question doesn't support them.
 
 -e::
 	The next parameter is the pattern. This option has to be
diff --git a/grep.c b/grep.c
index 4e8d0645a8..d6603bc950 100644
--- a/grep.c
+++ b/grep.c
@@ -368,18 +368,6 @@ static int is_fixed(const char *s, size_t len)
 	return 1;
 }
 
-static int has_null(const char *s, size_t len)
-{
-	/*
-	 * regcomp cannot accept patterns with NULs so when using it
-	 * we consider any pattern containing a NUL fixed.
-	 */
-	if (memchr(s, 0, len))
-		return 1;
-
-	return 0;
-}
-
 #ifdef USE_LIBPCRE1
 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt)
 {
@@ -668,9 +656,7 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 	 * simple string match using kws.  p->fixed tells us if we
 	 * want to use kws.
 	 */
-	if (opt->fixed ||
-	    has_null(p->pattern, p->patternlen) ||
-	    is_fixed(p->pattern, p->patternlen))
+	if (opt->fixed || is_fixed(p->pattern, p->patternlen))
 		p->fixed = !p->ignore_case || !has_non_ascii(p->pattern);
 
 	if (p->fixed) {
@@ -678,7 +664,12 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 		kwsincr(p->kws, p->pattern, p->patternlen);
 		kwsprep(p->kws);
 		return;
-	} else if (opt->fixed) {
+	}
+
+	if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
+		die(_("given pattern contains NULL byte (via -f <file>). This is only supported with -P under PCRE v2"));
+
+	if (opt->fixed) {
 		/*
 		 * We come here when the pattern has the non-ascii
 		 * characters we cannot case-fold, and asked to
diff --git a/t/t7816-grep-binary-pattern.sh b/t/t7816-grep-binary-pattern.sh
index 4060dbd679..9e09bd5d6a 100755
--- a/t/t7816-grep-binary-pattern.sh
+++ b/t/t7816-grep-binary-pattern.sh
@@ -2,113 +2,126 @@
 
 test_description='git grep with a binary pattern files'
 
-. ./test-lib.sh
+. ./lib-gettext.sh
 
-nul_match () {
+nul_match_internal () {
 	matches=$1
-	flags=$2
-	pattern=$3
+	prereqs=$2
+	lc_all=$3
+	extra_flags=$4
+	flags=$5
+	pattern=$6
 	pattern_human=$(echo "$pattern" | sed 's/Q/<NUL>/g')
 
 	if test "$matches" = 1
 	then
-		test_expect_success "git grep -f f $flags '$pattern_human' a" "
+		test_expect_success $prereqs "LC_ALL='$lc_all' git grep $extra_flags -f f $flags '$pattern_human' a" "
 			printf '$pattern' | q_to_nul >f &&
-			git grep -f f $flags a
+			LC_ALL='$lc_all' git grep $extra_flags -f f $flags a
 		"
 	elif test "$matches" = 0
 	then
-		test_expect_success "git grep -f f $flags '$pattern_human' a" "
+		test_expect_success $prereqs "LC_ALL='$lc_all' git grep $extra_flags -f f $flags '$pattern_human' a" "
+			>stderr &&
 			printf '$pattern' | q_to_nul >f &&
-			test_must_fail git grep -f f $flags a
+			test_must_fail env LC_ALL=\"$lc_all\" git grep $extra_flags -f f $flags a 2>stderr &&
+			test_i18ngrep ! 'This is only supported with -P under PCRE v2' stderr
 		"
-	elif test "$matches" = T1
+	elif test "$matches" = P
 	then
-		test_expect_failure "git grep -f f $flags '$pattern_human' a" "
+		test_expect_success $prereqs "error, PCRE v2 only: LC_ALL='$lc_all' git grep -f f $flags '$pattern_human' a" "
+			>stderr &&
 			printf '$pattern' | q_to_nul >f &&
-			git grep -f f $flags a
-		"
-	elif test "$matches" = T0
-	then
-		test_expect_failure "git grep -f f $flags '$pattern_human' a" "
-			printf '$pattern' | q_to_nul >f &&
-			test_must_fail git grep -f f $flags a
+			test_must_fail env LC_ALL=\"$lc_all\" git grep -f f $flags a 2>stderr &&
+			test_i18ngrep 'This is only supported with -P under PCRE v2' stderr
 		"
 	else
 		test_expect_success "PANIC: Test framework error. Unknown matches value $matches" 'false'
 	fi
 }
 
+nul_match () {
+	matches=$1
+	matches_pcre2=$2
+	matches_pcre2_locale=$3
+	flags=$4
+	pattern=$5
+	pattern_human=$(echo "$pattern" | sed 's/Q/<NUL>/g')
+
+	nul_match_internal "$matches" "" "C" "" "$flags" "$pattern"
+	nul_match_internal "$matches_pcre2" "LIBPCRE2" "C" "-P" "$flags" "$pattern"
+	nul_match_internal "$matches_pcre2_locale" "LIBPCRE2,GETTEXT_LOCALE" "$is_IS_locale" "-P" "$flags" "$pattern"
+}
+
 test_expect_success 'setup' "
 	echo 'binaryQfileQm[*]cQ*æQð' | q_to_nul >a &&
 	git add a &&
 	git commit -m.
 "
 
-nul_match 1 '-F' 'yQf'
-nul_match 0 '-F' 'yQx'
-nul_match 1 '-Fi' 'YQf'
-nul_match 0 '-Fi' 'YQx'
-nul_match 1 '' 'yQf'
-nul_match 0 '' 'yQx'
-nul_match 1 '' 'æQð'
-nul_match 1 '-F' 'eQm[*]c'
-nul_match 1 '-Fi' 'EQM[*]C'
+# Simple fixed-string matching that can use kwset (no -i && non-ASCII)
+nul_match 1 1 1 '-F' 'yQf'
+nul_match 0 0 0 '-F' 'yQx'
+nul_match 1 1 1 '-Fi' 'YQf'
+nul_match 0 0 0 '-Fi' 'YQx'
+nul_match 1 1 1 '' 'yQf'
+nul_match 0 0 0 '' 'yQx'
+nul_match 1 1 1 '' 'æQð'
+nul_match 1 1 1 '-F' 'eQm[*]c'
+nul_match 1 1 1 '-Fi' 'EQM[*]C'
 
 # Regex patterns that would match but shouldn't with -F
-nul_match 0 '-F' 'yQ[f]'
-nul_match 0 '-F' '[y]Qf'
-nul_match 0 '-Fi' 'YQ[F]'
-nul_match 0 '-Fi' '[Y]QF'
-nul_match 0 '-F' 'æQ[ð]'
-nul_match 0 '-F' '[æ]Qð'
-nul_match 0 '-Fi' 'ÆQ[Ð]'
-nul_match 0 '-Fi' '[Æ]QÐ'
+nul_match 0 0 0 '-F' 'yQ[f]'
+nul_match 0 0 0 '-F' '[y]Qf'
+nul_match 0 0 0 '-Fi' 'YQ[F]'
+nul_match 0 0 0 '-Fi' '[Y]QF'
+nul_match 0 0 0 '-F' 'æQ[ð]'
+nul_match 0 0 0 '-F' '[æ]Qð'
 
-# kwset is disabled on -i & non-ASCII. No way to match non-ASCII \0
-# patterns case-insensitively.
-nul_match T1 '-i' 'ÆQÐ'
+# The -F kwset codepath can't handle -i && non-ASCII...
+nul_match P 1 1 '-i' '[æ]Qð'
 
-# \0 implicitly disables regexes. This is an undocumented internal
-# limitation.
-nul_match T1 '' 'yQ[f]'
-nul_match T1 '' '[y]Qf'
-nul_match T1 '-i' 'YQ[F]'
-nul_match T1 '-i' '[Y]Qf'
-nul_match T1 '' 'æQ[ð]'
-nul_match T1 '' '[æ]Qð'
-nul_match T1 '-i' 'ÆQ[Ð]'
+# ...PCRE v2 only matches non-ASCII with -i casefolding under UTF-8
+# semantics
+nul_match P P P '-Fi' 'ÆQ[Ð]'
+nul_match P 0 1 '-i'  'ÆQ[Ð]'
+nul_match P 0 1 '-i'  '[Æ]QÐ'
+nul_match P 0 1 '-i' '[Æ]Qð'
+nul_match P 0 1 '-i' 'ÆQÐ'
 
-# ... because of \0 implicitly disabling regexes regexes that
-# should/shouldn't match don't do the right thing.
-nul_match T1 '' 'eQm.*cQ'
-nul_match T1 '-i' 'EQM.*cQ'
-nul_match T0 '' 'eQm[*]c'
-nul_match T0 '-i' 'EQM[*]C'
+# \0 in regexes can only work with -P & PCRE v2
+nul_match P 1 1 '' 'yQ[f]'
+nul_match P 1 1 '' '[y]Qf'
+nul_match P 1 1 '-i' 'YQ[F]'
+nul_match P 1 1 '-i' '[Y]Qf'
+nul_match P 1 1 '' 'æQ[ð]'
+nul_match P 1 1 '' '[æ]Qð'
+nul_match P 0 1 '-i' 'ÆQ[Ð]'
+nul_match P 1 1 '' 'eQm.*cQ'
+nul_match P 1 1 '-i' 'EQM.*cQ'
+nul_match P 0 0 '' 'eQm[*]c'
+nul_match P 0 0 '-i' 'EQM[*]C'
 
-# Due to the REG_STARTEND extension when kwset() is disabled on -i &
-# non-ASCII the string will be matched in its entirety, but the
-# pattern will be cut off at the first \0.
-nul_match 0 '-i' 'NOMATCHQð'
-nul_match T0 '-i' '[Æ]QNOMATCH'
-nul_match T0 '-i' '[æ]QNOMATCH'
-# Matches, but for the wrong reasons, just stops at [æ]
-nul_match 1 '-i' '[Æ]Qð'
-nul_match 1 '-i' '[æ]Qð'
+# Assert that we're using REG_STARTEND and the pattern doesn't match
+# just because it's cut off at the first \0.
+nul_match 0 0 0 '-i' 'NOMATCHQð'
+nul_match P 0 0 '-i' '[Æ]QNOMATCH'
+nul_match P 0 0 '-i' '[æ]QNOMATCH'
 
 # Ensure that the matcher doesn't regress to something that stops at
 # \0
-nul_match 0 '-F' 'yQ[f]'
-nul_match 0 '-Fi' 'YQ[F]'
-nul_match 0 '' 'yQNOMATCH'
-nul_match 0 '' 'QNOMATCH'
-nul_match 0 '-i' 'YQNOMATCH'
-nul_match 0 '-i' 'QNOMATCH'
-nul_match 0 '-F' 'æQ[ð]'
-nul_match 0 '-Fi' 'ÆQ[Ð]'
-nul_match 0 '' 'yQNÓMATCH'
-nul_match 0 '' 'QNÓMATCH'
-nul_match 0 '-i' 'YQNÓMATCH'
-nul_match 0 '-i' 'QNÓMATCH'
+nul_match 0 0 0 '-F' 'yQ[f]'
+nul_match 0 0 0 '-Fi' 'YQ[F]'
+nul_match 0 0 0 '' 'yQNOMATCH'
+nul_match 0 0 0 '' 'QNOMATCH'
+nul_match 0 0 0 '-i' 'YQNOMATCH'
+nul_match 0 0 0 '-i' 'QNOMATCH'
+nul_match 0 0 0 '-F' 'æQ[ð]'
+nul_match P P P '-Fi' 'ÆQ[Ð]'
+nul_match P 0 1 '-i' 'ÆQ[Ð]'
+nul_match 0 0 0 '' 'yQNÓMATCH'
+nul_match 0 0 0 '' 'QNÓMATCH'
+nul_match 0 0 0 '-i' 'YQNÓMATCH'
+nul_match 0 0 0 '-i' 'QNÓMATCH'
 
 test_done
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v3 08/10] grep: drop support for \0 in --fixed-strings <pattern>
  2019-06-27 23:39           ` [PATCH v2 0/9] " Ævar Arnfjörð Bjarmason
                               ` (8 preceding siblings ...)
  2019-07-01 21:20             ` [PATCH v3 07/10] grep: make the behavior for NUL-byte in patterns sane Ævar Arnfjörð Bjarmason
@ 2019-07-01 21:20             ` Ævar Arnfjörð Bjarmason
  2019-07-01 21:20             ` [PATCH v3 09/10] grep: remove the kwset optimization Ævar Arnfjörð Bjarmason
  2019-07-01 21:21             ` [PATCH v3 10/10] grep: use PCRE v2 for optimized fixed-string search Ævar Arnfjörð Bjarmason
  11 siblings, 0 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-01 21:20 UTC (permalink / raw)
  To: git
  Cc: git-packagers, gitgitgadget, gitster, johannes.schindelin, peff,
	sandals, szeder.dev, Ævar Arnfjörð Bjarmason

Change "-f <file>" to not support patterns with a NUL-byte in them
under --fixed-strings. We'll now only support these under
"--perl-regexp" with PCRE v2.

A previous change to grep's documentation changed the description of
"-f <file>" to be vague enough as to not promise that this would work.
By dropping support for this we make it a whole lot easier to move
away from the kwset backend, which we'll do in a subsequent change.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 grep.c                         |  6 +--
 t/t7816-grep-binary-pattern.sh | 82 +++++++++++++++++-----------------
 2 files changed, 44 insertions(+), 44 deletions(-)

diff --git a/grep.c b/grep.c
index d6603bc950..8d0fff316c 100644
--- a/grep.c
+++ b/grep.c
@@ -644,6 +644,9 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 	p->word_regexp = opt->word_regexp;
 	p->ignore_case = opt->ignore_case;
 
+	if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
+		die(_("given pattern contains NULL byte (via -f <file>). This is only supported with -P under PCRE v2"));
+
 	/*
 	 * Even when -F (fixed) asks us to do a non-regexp search, we
 	 * may not be able to correctly case-fold when -i
@@ -666,9 +669,6 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 		return;
 	}
 
-	if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
-		die(_("given pattern contains NULL byte (via -f <file>). This is only supported with -P under PCRE v2"));
-
 	if (opt->fixed) {
 		/*
 		 * We come here when the pattern has the non-ascii
diff --git a/t/t7816-grep-binary-pattern.sh b/t/t7816-grep-binary-pattern.sh
index 9e09bd5d6a..60bab291e4 100755
--- a/t/t7816-grep-binary-pattern.sh
+++ b/t/t7816-grep-binary-pattern.sh
@@ -60,23 +60,23 @@ test_expect_success 'setup' "
 "
 
 # Simple fixed-string matching that can use kwset (no -i && non-ASCII)
-nul_match 1 1 1 '-F' 'yQf'
-nul_match 0 0 0 '-F' 'yQx'
-nul_match 1 1 1 '-Fi' 'YQf'
-nul_match 0 0 0 '-Fi' 'YQx'
-nul_match 1 1 1 '' 'yQf'
-nul_match 0 0 0 '' 'yQx'
-nul_match 1 1 1 '' 'æQð'
-nul_match 1 1 1 '-F' 'eQm[*]c'
-nul_match 1 1 1 '-Fi' 'EQM[*]C'
+nul_match P P P '-F' 'yQf'
+nul_match P P P '-F' 'yQx'
+nul_match P P P '-Fi' 'YQf'
+nul_match P P P '-Fi' 'YQx'
+nul_match P P 1 '' 'yQf'
+nul_match P P 0 '' 'yQx'
+nul_match P P 1 '' 'æQð'
+nul_match P P P '-F' 'eQm[*]c'
+nul_match P P P '-Fi' 'EQM[*]C'
 
 # Regex patterns that would match but shouldn't with -F
-nul_match 0 0 0 '-F' 'yQ[f]'
-nul_match 0 0 0 '-F' '[y]Qf'
-nul_match 0 0 0 '-Fi' 'YQ[F]'
-nul_match 0 0 0 '-Fi' '[Y]QF'
-nul_match 0 0 0 '-F' 'æQ[ð]'
-nul_match 0 0 0 '-F' '[æ]Qð'
+nul_match P P P '-F' 'yQ[f]'
+nul_match P P P '-F' '[y]Qf'
+nul_match P P P '-Fi' 'YQ[F]'
+nul_match P P P '-Fi' '[Y]QF'
+nul_match P P P '-F' 'æQ[ð]'
+nul_match P P P '-F' '[æ]Qð'
 
 # The -F kwset codepath can't handle -i && non-ASCII...
 nul_match P 1 1 '-i' '[æ]Qð'
@@ -90,38 +90,38 @@ nul_match P 0 1 '-i' '[Æ]Qð'
 nul_match P 0 1 '-i' 'ÆQÐ'
 
 # \0 in regexes can only work with -P & PCRE v2
-nul_match P 1 1 '' 'yQ[f]'
-nul_match P 1 1 '' '[y]Qf'
-nul_match P 1 1 '-i' 'YQ[F]'
-nul_match P 1 1 '-i' '[Y]Qf'
-nul_match P 1 1 '' 'æQ[ð]'
-nul_match P 1 1 '' '[æ]Qð'
-nul_match P 0 1 '-i' 'ÆQ[Ð]'
-nul_match P 1 1 '' 'eQm.*cQ'
-nul_match P 1 1 '-i' 'EQM.*cQ'
-nul_match P 0 0 '' 'eQm[*]c'
-nul_match P 0 0 '-i' 'EQM[*]C'
+nul_match P P 1 '' 'yQ[f]'
+nul_match P P 1 '' '[y]Qf'
+nul_match P P 1 '-i' 'YQ[F]'
+nul_match P P 1 '-i' '[Y]Qf'
+nul_match P P 1 '' 'æQ[ð]'
+nul_match P P 1 '' '[æ]Qð'
+nul_match P P 1 '-i' 'ÆQ[Ð]'
+nul_match P P 1 '' 'eQm.*cQ'
+nul_match P P 1 '-i' 'EQM.*cQ'
+nul_match P P 0 '' 'eQm[*]c'
+nul_match P P 0 '-i' 'EQM[*]C'
 
 # Assert that we're using REG_STARTEND and the pattern doesn't match
 # just because it's cut off at the first \0.
-nul_match 0 0 0 '-i' 'NOMATCHQð'
-nul_match P 0 0 '-i' '[Æ]QNOMATCH'
-nul_match P 0 0 '-i' '[æ]QNOMATCH'
+nul_match P P 0 '-i' 'NOMATCHQð'
+nul_match P P 0 '-i' '[Æ]QNOMATCH'
+nul_match P P 0 '-i' '[æ]QNOMATCH'
 
 # Ensure that the matcher doesn't regress to something that stops at
 # \0
-nul_match 0 0 0 '-F' 'yQ[f]'
-nul_match 0 0 0 '-Fi' 'YQ[F]'
-nul_match 0 0 0 '' 'yQNOMATCH'
-nul_match 0 0 0 '' 'QNOMATCH'
-nul_match 0 0 0 '-i' 'YQNOMATCH'
-nul_match 0 0 0 '-i' 'QNOMATCH'
-nul_match 0 0 0 '-F' 'æQ[ð]'
+nul_match P P P '-F' 'yQ[f]'
+nul_match P P P '-Fi' 'YQ[F]'
+nul_match P P 0 '' 'yQNOMATCH'
+nul_match P P 0 '' 'QNOMATCH'
+nul_match P P 0 '-i' 'YQNOMATCH'
+nul_match P P 0 '-i' 'QNOMATCH'
+nul_match P P P '-F' 'æQ[ð]'
 nul_match P P P '-Fi' 'ÆQ[Ð]'
-nul_match P 0 1 '-i' 'ÆQ[Ð]'
-nul_match 0 0 0 '' 'yQNÓMATCH'
-nul_match 0 0 0 '' 'QNÓMATCH'
-nul_match 0 0 0 '-i' 'YQNÓMATCH'
-nul_match 0 0 0 '-i' 'QNÓMATCH'
+nul_match P P 1 '-i' 'ÆQ[Ð]'
+nul_match P P 0 '' 'yQNÓMATCH'
+nul_match P P 0 '' 'QNÓMATCH'
+nul_match P P 0 '-i' 'YQNÓMATCH'
+nul_match P P 0 '-i' 'QNÓMATCH'
 
 test_done
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v3 09/10] grep: remove the kwset optimization
  2019-06-27 23:39           ` [PATCH v2 0/9] " Ævar Arnfjörð Bjarmason
                               ` (9 preceding siblings ...)
  2019-07-01 21:20             ` [PATCH v3 08/10] grep: drop support for \0 in --fixed-strings <pattern> Ævar Arnfjörð Bjarmason
@ 2019-07-01 21:20             ` Ævar Arnfjörð Bjarmason
  2019-07-01 21:21             ` [PATCH v3 10/10] grep: use PCRE v2 for optimized fixed-string search Ævar Arnfjörð Bjarmason
  11 siblings, 0 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-01 21:20 UTC (permalink / raw)
  To: git
  Cc: git-packagers, gitgitgadget, gitster, johannes.schindelin, peff,
	sandals, szeder.dev, Ævar Arnfjörð Bjarmason

A later change will replace this optimization with optimistic use of
PCRE v2. I'm completely removing it as an intermediate step, as
opposed to replacing it with PCRE v2, to demonstrate that no grep
semantics depend on this (or any other) optimization for the fixed
backend anymore.

For now this is mostly (but not entirely) a performance regression, as
shown by this hacky one-liner:

    for opt in '' ' -i'
        do
        GIT_PERF_7821_GREP_OPTS=$opt GIT_PERF_REPEAT_COUNT=10 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS='-j8 CFLAGS=-O3 USE_LIBPCRE=YesPlease' ./run origin/master HEAD -- p7821-grep-engines-fixed.sh
    done &&
    for opt in '' ' -i'
        do GIT_PERF_4221_LOG_OPTS=$opt GIT_PERF_REPEAT_COUNT=10 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS='-j8 CFLAGS=-O3 USE_LIBPCRE=YesPlease' ./run origin/master HEAD -- p4221-log-grep-engines-fixed.sh
    done

Which produces:

plain grep:

    Test                             origin/master     HEAD
    -------------------------------------------------------------------------
    7821.1: fixed grep int           0.55(1.60+0.63)   0.82(3.11+0.51) +49.1%
    7821.2: basic grep int           0.62(1.68+0.49)   0.85(3.02+0.52) +37.1%
    7821.3: extended grep int        0.61(1.63+0.53)   0.91(3.09+0.44) +49.2%
    7821.4: perl grep int            0.55(1.60+0.57)   0.41(0.93+0.57) -25.5%
    7821.6: fixed grep uncommon      0.20(0.50+0.44)   0.35(1.27+0.42) +75.0%
    7821.7: basic grep uncommon      0.20(0.49+0.45)   0.35(1.29+0.41) +75.0%
    7821.8: extended grep uncommon   0.20(0.45+0.48)   0.35(1.25+0.44) +75.0%
    7821.9: perl grep uncommon       0.20(0.53+0.41)   0.16(0.24+0.49) -20.0%
    7821.11: fixed grep æ            0.35(1.27+0.40)   0.25(0.82+0.39) -28.6%
    7821.12: basic grep æ            0.35(1.28+0.38)   0.25(0.75+0.44) -28.6%
    7821.13: extended grep æ         0.36(1.21+0.46)   0.25(0.86+0.35) -30.6%
    7821.14: perl grep æ             0.35(1.33+0.34)   0.16(0.26+0.47) -54.3%

grep with -i:

    Test                                origin/master     HEAD
    -----------------------------------------------------------------------------
    7821.1: fixed grep -i int           0.61(1.84+0.64)   1.11(4.12+0.64) +82.0%
    7821.2: basic grep -i int           0.72(1.86+0.57)   1.15(4.48+0.49) +59.7%
    7821.3: extended grep -i int        0.94(1.83+0.60)   1.53(4.12+0.58) +62.8%
    7821.4: perl grep -i int            0.66(1.82+0.59)   0.55(1.08+0.58) -16.7%
    7821.6: fixed grep -i uncommon      0.21(0.51+0.44)   0.44(1.74+0.34) +109.5%
    7821.7: basic grep -i uncommon      0.21(0.55+0.41)   0.44(1.72+0.40) +109.5%
    7821.8: extended grep -i uncommon   0.21(0.57+0.39)   0.42(1.64+0.45) +100.0%
    7821.9: perl grep -i uncommon       0.21(0.48+0.48)   0.17(0.30+0.45) -19.0%
    7821.11: fixed grep -i æ            0.25(0.73+0.45)   0.25(0.75+0.45) +0.0%
    7821.12: basic grep -i æ            0.25(0.71+0.49)   0.26(0.77+0.44) +4.0%
    7821.13: extended grep -i æ         0.25(0.75+0.44)   0.25(0.74+0.46) +0.0%
    7821.14: perl grep -i æ             0.17(0.26+0.48)   0.16(0.20+0.52) -5.9%

plain log:

    Test                                     origin/master     HEAD
    ---------------------------------------------------------------------------------
    4221.1: fixed log --grep='int'           7.31(7.06+0.21)   8.11(7.85+0.20) +10.9%
    4221.2: basic log --grep='int'           7.30(6.94+0.27)   8.16(7.89+0.19) +11.8%
    4221.3: extended log --grep='int'        7.34(7.05+0.21)   8.08(7.76+0.25) +10.1%
    4221.4: perl log --grep='int'            7.27(6.94+0.24)   7.05(6.76+0.25) -3.0%
    4221.6: fixed log --grep='uncommon'      6.97(6.62+0.32)   7.86(7.51+0.30) +12.8%
    4221.7: basic log --grep='uncommon'      7.05(6.69+0.29)   7.89(7.60+0.28) +11.9%
    4221.8: extended log --grep='uncommon'   6.89(6.56+0.32)   7.99(7.66+0.24) +16.0%
    4221.9: perl log --grep='uncommon'       7.02(6.66+0.33)   6.97(6.54+0.36) -0.7%
    4221.11: fixed log --grep='æ'            7.37(7.03+0.33)   7.67(7.30+0.31) +4.1%
    4221.12: basic log --grep='æ'            7.41(7.00+0.31)   7.60(7.28+0.26) +2.6%
    4221.13: extended log --grep='æ'         7.35(6.96+0.38)   7.73(7.31+0.34) +5.2%
    4221.14: perl log --grep='æ'             7.43(7.10+0.32)   6.95(6.61+0.27) -6.5%

log with -i:

    Test                                        origin/master     HEAD
    ------------------------------------------------------------------------------------
    4221.1: fixed log -i --grep='int'           7.40(7.05+0.23)   8.66(8.38+0.20) +17.0%
    4221.2: basic log -i --grep='int'           7.39(7.09+0.23)   8.67(8.39+0.20) +17.3%
    4221.3: extended log -i --grep='int'        7.29(6.99+0.26)   8.69(8.31+0.26) +19.2%
    4221.4: perl log -i --grep='int'            7.42(7.16+0.21)   7.14(6.80+0.24) -3.8%
    4221.6: fixed log -i --grep='uncommon'      6.94(6.58+0.35)   8.43(8.04+0.30) +21.5%
    4221.7: basic log -i --grep='uncommon'      6.95(6.62+0.31)   8.34(7.93+0.32) +20.0%
    4221.8: extended log -i --grep='uncommon'   7.06(6.75+0.25)   8.32(7.98+0.31) +17.8%
    4221.9: perl log -i --grep='uncommon'       6.96(6.69+0.26)   7.04(6.64+0.32) +1.1%
    4221.11: fixed log -i --grep='æ'            7.92(7.55+0.33)   7.86(7.44+0.34) -0.8%
    4221.12: basic log -i --grep='æ'            7.88(7.49+0.32)   7.84(7.46+0.34) -0.5%
    4221.13: extended log -i --grep='æ'         7.91(7.51+0.32)   7.87(7.48+0.32) -0.5%
    4221.14: perl log -i --grep='æ'             7.01(6.59+0.35)   6.99(6.64+0.28) -0.3%

Some of those, as noted in [1] are because PCRE is faster at finding
fixed strings. This looks bad for some engines, but in the next change
we'll optimistically use PCRE v2 for all of these, so it'll look
better.

1. https://public-inbox.org/git/87v9x793qi.fsf@evledraar.gmail.com/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 grep.c | 63 +++-------------------------------------------------------
 grep.h |  2 --
 2 files changed, 3 insertions(+), 62 deletions(-)

diff --git a/grep.c b/grep.c
index 8d0fff316c..4468519d5c 100644
--- a/grep.c
+++ b/grep.c
@@ -356,18 +356,6 @@ static NORETURN void compile_regexp_failed(const struct grep_pat *p,
 	die("%s'%s': %s", where, p->pattern, error);
 }
 
-static int is_fixed(const char *s, size_t len)
-{
-	size_t i;
-
-	for (i = 0; i < len; i++) {
-		if (is_regex_special(s[i]))
-			return 0;
-	}
-
-	return 1;
-}
-
 #ifdef USE_LIBPCRE1
 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt)
 {
@@ -643,38 +631,12 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 
 	p->word_regexp = opt->word_regexp;
 	p->ignore_case = opt->ignore_case;
+	p->fixed = opt->fixed;
 
 	if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
 		die(_("given pattern contains NULL byte (via -f <file>). This is only supported with -P under PCRE v2"));
 
-	/*
-	 * Even when -F (fixed) asks us to do a non-regexp search, we
-	 * may not be able to correctly case-fold when -i
-	 * (ignore-case) is asked (in which case, we'll synthesize a
-	 * regexp to match the pattern that matches regexp special
-	 * characters literally, while ignoring case differences).  On
-	 * the other hand, even without -F, if the pattern does not
-	 * have any regexp special characters and there is no need for
-	 * case-folding search, we can internally turn it into a
-	 * simple string match using kws.  p->fixed tells us if we
-	 * want to use kws.
-	 */
-	if (opt->fixed || is_fixed(p->pattern, p->patternlen))
-		p->fixed = !p->ignore_case || !has_non_ascii(p->pattern);
-
-	if (p->fixed) {
-		p->kws = kwsalloc(p->ignore_case ? tolower_trans_tbl : NULL);
-		kwsincr(p->kws, p->pattern, p->patternlen);
-		kwsprep(p->kws);
-		return;
-	}
-
 	if (opt->fixed) {
-		/*
-		 * We come here when the pattern has the non-ascii
-		 * characters we cannot case-fold, and asked to
-		 * ignore-case.
-		 */
 		compile_fixed_regexp(p, opt);
 		return;
 	}
@@ -1042,9 +1004,7 @@ void free_grep_patterns(struct grep_opt *opt)
 		case GREP_PATTERN: /* atom */
 		case GREP_PATTERN_HEAD:
 		case GREP_PATTERN_BODY:
-			if (p->kws)
-				kwsfree(p->kws);
-			else if (p->pcre1_regexp)
+			if (p->pcre1_regexp)
 				free_pcre1_regexp(p);
 			else if (p->pcre2_pattern)
 				free_pcre2_pattern(p);
@@ -1104,29 +1064,12 @@ static void show_name(struct grep_opt *opt, const char *name)
 	opt->output(opt, opt->null_following_name ? "\0" : "\n", 1);
 }
 
-static int fixmatch(struct grep_pat *p, char *line, char *eol,
-		    regmatch_t *match)
-{
-	struct kwsmatch kwsm;
-	size_t offset = kwsexec(p->kws, line, eol - line, &kwsm);
-	if (offset == -1) {
-		match->rm_so = match->rm_eo = -1;
-		return REG_NOMATCH;
-	} else {
-		match->rm_so = offset;
-		match->rm_eo = match->rm_so + kwsm.size[0];
-		return 0;
-	}
-}
-
 static int patmatch(struct grep_pat *p, char *line, char *eol,
 		    regmatch_t *match, int eflags)
 {
 	int hit;
 
-	if (p->fixed)
-		hit = !fixmatch(p, line, eol, match);
-	else if (p->pcre1_regexp)
+	if (p->pcre1_regexp)
 		hit = !pcre1match(p, line, eol, match, eflags);
 	else if (p->pcre2_pattern)
 		hit = !pcre2match(p, line, eol, match, eflags);
diff --git a/grep.h b/grep.h
index 4bb8a79d93..d35a137fcb 100644
--- a/grep.h
+++ b/grep.h
@@ -32,7 +32,6 @@ typedef int pcre2_compile_context;
 typedef int pcre2_match_context;
 typedef int pcre2_jit_stack;
 #endif
-#include "kwset.h"
 #include "thread-utils.h"
 #include "userdiff.h"
 
@@ -97,7 +96,6 @@ struct grep_pat {
 	pcre2_match_context *pcre2_match_context;
 	pcre2_jit_stack *pcre2_jit_stack;
 	uint32_t pcre2_jit_on;
-	kwset_t kws;
 	unsigned fixed:1;
 	unsigned ignore_case:1;
 	unsigned word_regexp:1;
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v3 10/10] grep: use PCRE v2 for optimized fixed-string search
  2019-06-27 23:39           ` [PATCH v2 0/9] " Ævar Arnfjörð Bjarmason
                               ` (10 preceding siblings ...)
  2019-07-01 21:20             ` [PATCH v3 09/10] grep: remove the kwset optimization Ævar Arnfjörð Bjarmason
@ 2019-07-01 21:21             ` Ævar Arnfjörð Bjarmason
  11 siblings, 0 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-01 21:21 UTC (permalink / raw)
  To: git
  Cc: git-packagers, gitgitgadget, gitster, johannes.schindelin, peff,
	sandals, szeder.dev, Ævar Arnfjörð Bjarmason

Bring back optimized fixed-string search for "grep", this time with
PCRE v2 as an optional backend. As noted in [1] with kwset we were
slower than PCRE v1 and v2 JIT with the kwset backend, so that
optimization was counterproductive.

This brings back the optimization for "--fixed-strings", without
changing the semantics of having a NUL-byte in patterns. As seen in
previous commits in this series we could support it now, but I'd
rather just leave that edge-case aside so we don't have one behavior
or the other depending what "--fixed-strings" backend we're using. It
makes the behavior harder to understand and document, and makes tests
for the different backends more painful.

This does change the behavior under non-C locales when "log"'s
"--encoding" option is used and the heystack/needle in the
content/command-line doesn't have a matching encoding. See the recent
change in "t4210: skip more command-line encoding tests on MinGW" in
this series. I think that's OK. We did nothing sensible before
then (just compared raw bytes that had no hope of matching). At least
now the user will get some idea why their grep/log never matches in
that edge case.

I could also support the PCRE v1 backend here, but that would make the
code more complex. I'd rather aim for simplicity here and in future
changes to the diffcore. We're not going to have someone who
absolutely must have faster search, but for whom building PCRE v2
isn't acceptable.

The difference between this series of commits and the current "master"
is, using the same t/perf commands shown in the last commit:

plain grep:

    Test                             origin/master     HEAD
    -------------------------------------------------------------------------
    7821.1: fixed grep int           0.55(1.67+0.56)   0.41(0.98+0.60) -25.5%
    7821.2: basic grep int           0.58(1.65+0.52)   0.41(0.96+0.57) -29.3%
    7821.3: extended grep int        0.57(1.66+0.49)   0.42(0.93+0.60) -26.3%
    7821.4: perl grep int            0.54(1.67+0.50)   0.43(0.88+0.65) -20.4%
    7821.6: fixed grep uncommon      0.21(0.52+0.42)   0.16(0.24+0.51) -23.8%
    7821.7: basic grep uncommon      0.20(0.49+0.45)   0.17(0.28+0.47) -15.0%
    7821.8: extended grep uncommon   0.20(0.54+0.39)   0.16(0.25+0.50) -20.0%
    7821.9: perl grep uncommon       0.20(0.58+0.36)   0.16(0.23+0.50) -20.0%
    7821.11: fixed grep æ            0.35(1.24+0.43)   0.16(0.23+0.50) -54.3%
    7821.12: basic grep æ            0.36(1.29+0.38)   0.16(0.20+0.54) -55.6%
    7821.13: extended grep æ         0.35(1.23+0.44)   0.16(0.24+0.50) -54.3%
    7821.14: perl grep æ             0.35(1.33+0.34)   0.16(0.28+0.46) -54.3%

grep with -i:

    Test                                origin/master     HEAD
    ----------------------------------------------------------------------------
    7821.1: fixed grep -i int           0.62(1.81+0.70)   0.47(1.11+0.64) -24.2%
    7821.2: basic grep -i int           0.67(1.90+0.53)   0.46(1.07+0.62) -31.3%
    7821.3: extended grep -i int        0.62(1.92+0.53)   0.53(1.12+0.58) -14.5%
    7821.4: perl grep -i int            0.66(1.85+0.58)   0.45(1.10+0.59) -31.8%
    7821.6: fixed grep -i uncommon      0.21(0.54+0.43)   0.17(0.20+0.55) -19.0%
    7821.7: basic grep -i uncommon      0.20(0.52+0.45)   0.17(0.29+0.48) -15.0%
    7821.8: extended grep -i uncommon   0.21(0.52+0.44)   0.17(0.26+0.50) -19.0%
    7821.9: perl grep -i uncommon       0.21(0.53+0.44)   0.17(0.20+0.56) -19.0%
    7821.11: fixed grep -i æ            0.26(0.79+0.44)   0.16(0.29+0.46) -38.5%
    7821.12: basic grep -i æ            0.26(0.79+0.42)   0.16(0.20+0.54) -38.5%
    7821.13: extended grep -i æ         0.26(0.84+0.39)   0.16(0.24+0.50) -38.5%
    7821.14: perl grep -i æ             0.16(0.24+0.49)   0.17(0.25+0.51) +6.3%

plain log:

    Test                                     origin/master     HEAD
    --------------------------------------------------------------------------------
    4221.1: fixed log --grep='int'           7.24(6.95+0.28)   7.20(6.95+0.18) -0.6%
    4221.2: basic log --grep='int'           7.31(6.97+0.22)   7.20(6.93+0.21) -1.5%
    4221.3: extended log --grep='int'        7.37(7.04+0.24)   7.22(6.91+0.25) -2.0%
    4221.4: perl log --grep='int'            7.31(7.04+0.21)   7.19(6.89+0.21) -1.6%
    4221.6: fixed log --grep='uncommon'      6.93(6.59+0.32)   7.04(6.66+0.37) +1.6%
    4221.7: basic log --grep='uncommon'      6.92(6.58+0.29)   7.08(6.75+0.29) +2.3%
    4221.8: extended log --grep='uncommon'   6.92(6.55+0.31)   7.00(6.68+0.31) +1.2%
    4221.9: perl log --grep='uncommon'       7.03(6.59+0.33)   7.12(6.73+0.34) +1.3%
    4221.11: fixed log --grep='æ'            7.41(7.08+0.28)   7.05(6.76+0.29) -4.9%
    4221.12: basic log --grep='æ'            7.39(6.99+0.33)   7.00(6.68+0.25) -5.3%
    4221.13: extended log --grep='æ'         7.34(7.00+0.25)   7.15(6.81+0.31) -2.6%
    4221.14: perl log --grep='æ'             7.43(7.13+0.26)   7.01(6.60+0.36) -5.7%

log with -i:

    Test                                        origin/master     HEAD
    ------------------------------------------------------------------------------------
    4221.1: fixed log -i --grep='int'           7.31(7.07+0.24)   7.23(7.00+0.22) -1.1%
    4221.2: basic log -i --grep='int'           7.40(7.08+0.28)   7.19(6.92+0.20) -2.8%
    4221.3: extended log -i --grep='int'        7.43(7.13+0.25)   7.27(6.99+0.21) -2.2%
    4221.4: perl log -i --grep='int'            7.34(7.10+0.24)   7.10(6.90+0.19) -3.3%
    4221.6: fixed log -i --grep='uncommon'      7.07(6.71+0.32)   7.11(6.77+0.28) +0.6%
    4221.7: basic log -i --grep='uncommon'      6.99(6.64+0.28)   7.12(6.69+0.38) +1.9%
    4221.8: extended log -i --grep='uncommon'   7.11(6.74+0.32)   7.10(6.77+0.27) -0.1%
    4221.9: perl log -i --grep='uncommon'       6.98(6.60+0.29)   7.05(6.64+0.34) +1.0%
    4221.11: fixed log -i --grep='æ'            7.85(7.45+0.34)   7.03(6.68+0.32) -10.4%
    4221.12: basic log -i --grep='æ'            7.87(7.49+0.29)   7.06(6.69+0.31) -10.3%
    4221.13: extended log -i --grep='æ'         7.87(7.54+0.31)   7.09(6.69+0.31) -9.9%
    4221.14: perl log -i --grep='æ'             7.06(6.77+0.28)   6.91(6.57+0.31) -2.1%

So as with e05b027627 ("grep: use PCRE v2 for optimized fixed-string
search", 2019-06-26) there's a huge improvement in performance for
"grep", but in "log" most of our time is spent elsewhere, so we don't
notice it that much.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 grep.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 49 insertions(+), 2 deletions(-)

diff --git a/grep.c b/grep.c
index 4468519d5c..fc0ed73ef3 100644
--- a/grep.c
+++ b/grep.c
@@ -356,6 +356,18 @@ static NORETURN void compile_regexp_failed(const struct grep_pat *p,
 	die("%s'%s': %s", where, p->pattern, error);
 }
 
+static int is_fixed(const char *s, size_t len)
+{
+	size_t i;
+
+	for (i = 0; i < len; i++) {
+		if (is_regex_special(s[i]))
+			return 0;
+	}
+
+	return 1;
+}
+
 #ifdef USE_LIBPCRE1
 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt)
 {
@@ -602,7 +614,6 @@ static int pcre2match(struct grep_pat *p, const char *line, const char *eol,
 static void free_pcre2_pattern(struct grep_pat *p)
 {
 }
-#endif /* !USE_LIBPCRE2 */
 
 static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
@@ -623,11 +634,13 @@ static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt)
 		compile_regexp_failed(p, errbuf);
 	}
 }
+#endif /* !USE_LIBPCRE2 */
 
 static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
 	int err;
 	int regflags = REG_NEWLINE;
+	int pat_is_fixed;
 
 	p->word_regexp = opt->word_regexp;
 	p->ignore_case = opt->ignore_case;
@@ -636,8 +649,42 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 	if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
 		die(_("given pattern contains NULL byte (via -f <file>). This is only supported with -P under PCRE v2"));
 
-	if (opt->fixed) {
+	pat_is_fixed = is_fixed(p->pattern, p->patternlen);
+	if (opt->fixed || pat_is_fixed) {
+#ifdef USE_LIBPCRE2
+		opt->pcre2 = 1;
+		if (pat_is_fixed) {
+			compile_pcre2_pattern(p, opt);
+		} else {
+			/*
+			 * E.g. t7811-grep-open.sh relies on the
+			 * pattern being restored.
+			 */
+			char *old_pattern = p->pattern;
+			size_t old_patternlen = p->patternlen;
+			struct strbuf sb = STRBUF_INIT;
+
+			/*
+			 * There is the PCRE2_LITERAL flag, but it's
+			 * only in PCRE v2 10.30 and later. Needing to
+			 * ifdef our way around that and dealing with
+			 * it + PCRE2_MULTILINE being an error is more
+			 * complex than just quoting this ourselves.
+			*/
+			strbuf_add(&sb, "\\Q", 2);
+			strbuf_add(&sb, p->pattern, p->patternlen);
+			strbuf_add(&sb, "\\E", 2);
+
+			p->pattern = sb.buf;
+			p->patternlen = sb.len;
+			compile_pcre2_pattern(p, opt);
+			p->pattern = old_pattern;
+			p->patternlen = old_patternlen;
+			strbuf_release(&sb);
+		}
+#else /* !USE_LIBPCRE2 */
 		compile_fixed_regexp(p, opt);
+#endif /* !USE_LIBPCRE2 */
 		return;
 	}
 
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 00/10] grep: move from kwset to optional PCRE v2
  2019-07-01 21:20             ` [PATCH v3 00/10] " Ævar Arnfjörð Bjarmason
@ 2019-07-01 21:31               ` Junio C Hamano
  2019-07-02 11:10                 ` Ævar Arnfjörð Bjarmason
  2019-07-02 12:32               ` Johannes Schindelin
  1 sibling, 1 reply; 90+ messages in thread
From: Junio C Hamano @ 2019-07-01 21:31 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, git-packagers, gitgitgadget, johannes.schindelin, peff,
	sandals, szeder.dev

Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

> This v3 has a new patch (3/10) that I believe fixes the regression on
> MinGW Johannes noted in
> https://public-inbox.org/git/nycvar.QRO.7.76.6.1907011515150.44@tvgsbejvaqbjf.bet/
>
> As noted in the updated commit message in 10/10 I believe just
> skipping this test & documenting this in a commit message is the least
> amount of suck for now. It's really an existing issue with us doing
> nothing sensible when the log/grep haystack encoding doesn't match the
> needle encoding supplied via the command line.

Is that quite the case?  If they do not match, not finding the match
is the right answer, because we are byte-for-byte matching/searching
IIUC.

> We swept that under the carpet with the kwset backend, but PCRE v2
> exposes it.

Is it exposing, or just showing the limitation of the rewritten
implementation where it cannot do byte-for-byte matching/searching
as we used to be able to?

Without having a way to know what encoding is used on the command
line, there is no sensible way to reencode them to match the
haystack encoding (even when it is known), so "you got to feed the
strings in the same encoding, as we are going to match/search
byte-for-byte" is the only sensible way to work, given the design
space, I would think.

Not that it is all that useful to be able to match/search
byte-for-byte, of course, so I am OK if we punt with these tests,
but I'd prefer to see us admit we are punting when we do ;-).





^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 00/10] grep: move from kwset to optional PCRE v2
  2019-07-01 21:31               ` Junio C Hamano
@ 2019-07-02 11:10                 ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-02 11:10 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, git-packagers, gitgitgadget, johannes.schindelin, peff,
	sandals, szeder.dev


On Mon, Jul 01 2019, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:
>
>> This v3 has a new patch (3/10) that I believe fixes the regression on
>> MinGW Johannes noted in
>> https://public-inbox.org/git/nycvar.QRO.7.76.6.1907011515150.44@tvgsbejvaqbjf.bet/
>>
>> As noted in the updated commit message in 10/10 I believe just
>> skipping this test & documenting this in a commit message is the least
>> amount of suck for now. It's really an existing issue with us doing
>> nothing sensible when the log/grep haystack encoding doesn't match the
>> needle encoding supplied via the command line.
>
> Is that quite the case?  If they do not match, not finding the match
> is the right answer, because we are byte-for-byte matching/searching
> IIUC.
>
>> We swept that under the carpet with the kwset backend, but PCRE v2
>> exposes it.
>
> Is it exposing, or just showing the limitation of the rewritten
> implementation where it cannot do byte-for-byte matching/searching
> as we used to be able to?
>
> Without having a way to know what encoding is used on the command
> line, there is no sensible way to reencode them to match the
> haystack encoding (even when it is known), so "you got to feed the
> strings in the same encoding, as we are going to match/search
> byte-for-byte" is the only sensible way to work, given the design
> space, I would think.
>
> Not that it is all that useful to be able to match/search
> byte-for-byte, of course, so I am OK if we punt with these tests,
> but I'd prefer to see us admit we are punting when we do ;-).

I'm guilty as charged in punting this larger encoding issue. As it
pertains to this patch series it unearths an obscure case I think nobody
cares about in practice, and I'd like to move on with the "remove kwset"
optimization.

But I strongly believe that the new behavior with the PCRE v2
optimization is the only sane thing to do, and to the extent we have
anything left to do (#leftoverbits) it's that we should modify git more
generally (aside from string searching) to do the same thing where
appropriate.

Remember, this only happens if the user has set a UTF-8 locale and thus
promised that they're going to give us UTF-8. We then take that promise
and make e.g. "æ" match "Æ" under --ignore-case.

Just falling back on raw byte matching isn't going to cut it, because
then "æ<invalid utf8>" won't match "Æ<same invalid utf8>" under
--ignore-case, and there's other cases like that with matching word
boundaries & other Unicode gotchas.

The best that can be hoped for at that point is some "loose UTF-8"
mode. I see both perl & GNU grep seem to support that (although I'm sure
it falls apart at some point). GNU grep will also die in the same way
that we now die with --perl-regexp (since it also use PCRE).

I think that's saner, if the user thinks they're feeding us UTF-8 but
they're not I think they'd like to know rather than having the string
matching library fall back.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 00/10] grep: move from kwset to optional PCRE v2
  2019-07-01 21:20             ` [PATCH v3 00/10] " Ævar Arnfjörð Bjarmason
  2019-07-01 21:31               ` Junio C Hamano
@ 2019-07-02 12:32               ` Johannes Schindelin
  2019-07-02 19:57                 ` Junio C Hamano
  2019-07-03 10:25                 ` Johannes Schindelin
  1 sibling, 2 replies; 90+ messages in thread
From: Johannes Schindelin @ 2019-07-02 12:32 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, git-packagers, gitgitgadget, gitster, peff, sandals, szeder.dev

[-- Attachment #1: Type: text/plain, Size: 2346 bytes --]

Hi Ævar,

On Mon, 1 Jul 2019, Ævar Arnfjörð Bjarmason wrote:

> This v3 has a new patch (3/10) that I believe fixes the regression on
> MinGW Johannes noted in
> https://public-inbox.org/git/nycvar.QRO.7.76.6.1907011515150.44@tvgsbejvaqbjf.bet/

Yes.

However, I probably failed to mention another breakage, though...:

not ok 54 - LC_ALL='C' git grep -P -f f -i 'Æ<NUL>[Ð]' a

 expecting success:
			>stderr &&
			printf 'ÆQ[Ð]' | q_to_nul >f &&
			test_must_fail env LC_ALL="C" git grep -P -f f -i a 2>stderr &&
			test_i18ngrep ! 'This is only supported with -P under PCRE v2' stderr

++ printf 'ÆQ[Ð]'
++ q_to_nul
++ perl -pe 'y/Q/\000/'
++ command /usr/bin/perl -pe 'y/Q/\000/'
++ /usr/bin/perl -pe 'y/Q/\000/'
++ test_must_fail env LC_ALL=C git grep -P -f f -i a
++ case "$1" in
++ _test_ok=
++ env LC_ALL=C git grep -P -f f -i a
Binary file a matches
++ exit_code=0
++ test 0 -eq 0
++ list_contains '' success
++ case ",$1," in
++ return 1
++ echo 'test_must_fail: command succeeded: env LC_ALL=C git grep -P -f f -i a'
test_must_fail: command succeeded: env LC_ALL=C git grep -P -f f -i a
++ return 1
error: last command exited with $?=1

There are three more test cases in that test script that fail similarly. See
https://dev.azure.com/Git-for-Windows/git/_build/results?buildId=38852&view=ms.vss-test-web.build-test-results-tab&runId=1019770&resultId=101368&paneView=debug

I ran out of time to look into this in more detail :-(

> As noted in the updated commit message in 10/10 I believe just
> skipping this test & documenting this in a commit message is the least
> amount of suck for now. It's really an existing issue with us doing
> nothing sensible when the log/grep haystack encoding doesn't match the
> needle encoding supplied via the command line.
>
> We swept that under the carpet with the kwset backend, but PCRE v2
> exposes it.

Please note that the problem is _not_ MinGW! The problem is that the
non-JIT'ted code path is a lot more stringent than the JIT'ted one. So
what you'd need is a prerequisite that tests whether the PCREv2 in use
supports JIT'ted code or not, and skip the test case in the latter one.

Or you fix the code by re-encoding the plain text in UTF-8 if we know that
it is not UTF-8-encoded but the needle is.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 00/10] grep: move from kwset to optional PCRE v2
  2019-07-02 12:32               ` Johannes Schindelin
@ 2019-07-02 19:57                 ` Junio C Hamano
  2019-07-03 10:08                   ` Johannes Schindelin
  2019-07-03 10:25                 ` Johannes Schindelin
  1 sibling, 1 reply; 90+ messages in thread
From: Junio C Hamano @ 2019-07-02 19:57 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Ævar Arnfjörð Bjarmason, git, git-packagers,
	gitgitgadget, peff, sandals, szeder.dev

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> Please note that the problem is _not_ MinGW! The problem is that the
> non-JIT'ted code path is a lot more stringent than the JIT'ted one. So
> what you'd need is a prerequisite that tests whether the PCREv2 in use
> supports JIT'ted code or not, and skip the test case in the latter one.

Hmph, so additional prereq !MINGW may happen to match "do we use
pcre sans jit?" but not a right thing to use here?


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 00/10] grep: move from kwset to optional PCRE v2
  2019-07-02 19:57                 ` Junio C Hamano
@ 2019-07-03 10:08                   ` Johannes Schindelin
  0 siblings, 0 replies; 90+ messages in thread
From: Johannes Schindelin @ 2019-07-03 10:08 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason, git, git-packagers,
	gitgitgadget, peff, sandals, szeder.dev

Hi Junio,

On Tue, 2 Jul 2019, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>
> > Please note that the problem is _not_ MinGW! The problem is that the
> > non-JIT'ted code path is a lot more stringent than the JIT'ted one. So
> > what you'd need is a prerequisite that tests whether the PCREv2 in use
> > supports JIT'ted code or not, and skip the test case in the latter one.
>
> Hmph, so additional prereq !MINGW may happen to match "do we use
> pcre sans jit?" but not a right thing to use here?

That's right, the `!MINGW` prereq works by happenstance only, and as soon
as I find some time to rebuild PCREv2 with JIT support, it will stop doing
the right thing.

Which might happen very soon.

Quite honestly, I'd rather introduce a prerequisite here that specifically
tests whether the output of a `git grep -P` suggests that it has been fed
incorrect UTF-8, and skip the test case under that circumstance.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 00/10] grep: move from kwset to optional PCRE v2
  2019-07-02 12:32               ` Johannes Schindelin
  2019-07-02 19:57                 ` Junio C Hamano
@ 2019-07-03 10:25                 ` Johannes Schindelin
  2019-07-03 11:27                   ` Johannes Schindelin
  1 sibling, 1 reply; 90+ messages in thread
From: Johannes Schindelin @ 2019-07-03 10:25 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, git-packagers, gitgitgadget, gitster, peff, sandals, szeder.dev

[-- Attachment #1: Type: text/plain, Size: 1739 bytes --]

Hi,

On Tue, 2 Jul 2019, Johannes Schindelin wrote:

> [...] I probably failed to mention another breakage, though...:
>
> not ok 54 - LC_ALL='C' git grep -P -f f -i 'Æ<NUL>[Ð]' a
>
>  expecting success:
> 			>stderr &&
> 			printf 'ÆQ[Ð]' | q_to_nul >f &&
> 			test_must_fail env LC_ALL="C" git grep -P -f f -i a 2>stderr &&
> 			test_i18ngrep ! 'This is only supported with -P under PCRE v2' stderr
>
> ++ printf 'ÆQ[Ð]'
> ++ q_to_nul
> ++ perl -pe 'y/Q/\000/'
> ++ command /usr/bin/perl -pe 'y/Q/\000/'
> ++ /usr/bin/perl -pe 'y/Q/\000/'
> ++ test_must_fail env LC_ALL=C git grep -P -f f -i a
> ++ case "$1" in
> ++ _test_ok=
> ++ env LC_ALL=C git grep -P -f f -i a
> Binary file a matches
> ++ exit_code=0
> ++ test 0 -eq 0
> ++ list_contains '' success
> ++ case ",$1," in
> ++ return 1
> ++ echo 'test_must_fail: command succeeded: env LC_ALL=C git grep -P -f f -i a'
> test_must_fail: command succeeded: env LC_ALL=C git grep -P -f f -i a
> ++ return 1
> error: last command exited with $?=1
>
> There are three more test cases in that test script that fail similarly. See
> https://dev.azure.com/Git-for-Windows/git/_build/results?buildId=38852&view=ms.vss-test-web.build-test-results-tab&runId=1019770&resultId=101368&paneView=debug
>
> I ran out of time to look into this in more detail :-(

I figured it out. It does not happen with your `ab/no-kwset` patch series
in isolation, it's only when it is merged into `pu`, and the culprit is
the bad interaction with the `js/mingw-use-utf8` branch.

To fix it, I have a tentative patch:
https://github.com/git-for-windows/git/commit/e561446d

So I'll head over to that patch series and add more information there.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 00/10] grep: move from kwset to optional PCRE v2
  2019-07-03 10:25                 ` Johannes Schindelin
@ 2019-07-03 11:27                   ` Johannes Schindelin
  0 siblings, 0 replies; 90+ messages in thread
From: Johannes Schindelin @ 2019-07-03 11:27 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, git-packagers, gitgitgadget, gitster, peff, sandals, szeder.dev

[-- Attachment #1: Type: text/plain, Size: 1768 bytes --]

Hi,

On Wed, 3 Jul 2019, Johannes Schindelin wrote:

> Hi,
>
> On Tue, 2 Jul 2019, Johannes Schindelin wrote:
>
> > [...] I probably failed to mention another breakage, though...:
> >
> > not ok 54 - LC_ALL='C' git grep -P -f f -i 'Æ<NUL>[Ð]' a
> >
> >  expecting success:
> > 			>stderr &&
> > 			printf 'ÆQ[Ð]' | q_to_nul >f &&
> > 			test_must_fail env LC_ALL="C" git grep -P -f f -i a 2>stderr &&
> > 			test_i18ngrep ! 'This is only supported with -P under PCRE v2' stderr
> >
> > ++ printf 'ÆQ[Ð]'
> > ++ q_to_nul
> > ++ perl -pe 'y/Q/\000/'
> > ++ command /usr/bin/perl -pe 'y/Q/\000/'
> > ++ /usr/bin/perl -pe 'y/Q/\000/'
> > ++ test_must_fail env LC_ALL=C git grep -P -f f -i a
> > ++ case "$1" in
> > ++ _test_ok=
> > ++ env LC_ALL=C git grep -P -f f -i a
> > Binary file a matches
> > ++ exit_code=0
> > ++ test 0 -eq 0
> > ++ list_contains '' success
> > ++ case ",$1," in
> > ++ return 1
> > ++ echo 'test_must_fail: command succeeded: env LC_ALL=C git grep -P -f f -i a'
> > test_must_fail: command succeeded: env LC_ALL=C git grep -P -f f -i a
> > ++ return 1
> > error: last command exited with $?=1
> >
> > There are three more test cases in that test script that fail similarly. See
> > https://dev.azure.com/Git-for-Windows/git/_build/results?buildId=38852&view=ms.vss-test-web.build-test-results-tab&runId=1019770&resultId=101368&paneView=debug
> >
> > I ran out of time to look into this in more detail :-(
>
> I figured it out. It does not happen with your `ab/no-kwset` patch series
> in isolation, it's only when it is merged into `pu`, and the culprit is
> the bad interaction with the `js/mingw-use-utf8` branch.

Whoops, it is the `kb/windows-force-utf8` branch instead.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 90+ messages in thread

end of thread, other threads:[~2019-07-03 11:27 UTC | newest]

Thread overview: 90+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-13 11:49 [PATCH 0/4] Support building with GCC v8.x/v9.x Johannes Schindelin via GitGitGadget
2019-06-13 11:49 ` [PATCH 1/4] poll (mingw): allow compiling with GCC 8 and DEVELOPER=1 Johannes Schindelin via GitGitGadget
2019-06-13 11:49 ` [PATCH 2/4] kwset: allow building with GCC 8 Johannes Schindelin via GitGitGadget
2019-06-13 16:11   ` Junio C Hamano
2019-06-14  9:53   ` SZEDER Gábor
2019-06-14 10:00     ` [RFC/PATCH v1 0/4] compat/obstack: update from upstream SZEDER Gábor
2019-06-14 10:00       ` [PATCH v1 1/4] " SZEDER Gábor
2019-06-14 10:00       ` [PATCH v1 2/4] SQUASH??? compat/obstack: fix portability issues SZEDER Gábor
2019-06-14 10:00       ` [PATCH v1 3/4] SQUASH??? compat/obstack: fix build errors with Clang SZEDER Gábor
2019-06-14 10:00       ` [PATCH v1 4/4] compat/obstack: fix some sparse warnings SZEDER Gábor
2019-06-14 17:57       ` [RFC/PATCH v1 0/4] compat/obstack: update from upstream Jeff King
2019-06-14 18:19       ` Junio C Hamano
2019-06-14 20:30       ` Ramsay Jones
2019-06-14 21:24         ` Ramsay Jones
2019-06-17 18:36         ` SZEDER Gábor
2019-06-14 16:12     ` [PATCH 2/4] kwset: allow building with GCC 8 Junio C Hamano
2019-06-17 18:26       ` SZEDER Gábor
2019-06-14 22:09   ` Ævar Arnfjörð Bjarmason
2019-06-14 22:55   ` Can we just get rid of kwset & obstack in favor of optimistically using PCRE v2 JIT? Ævar Arnfjörð Bjarmason
2019-06-14 23:19     ` Ævar Arnfjörð Bjarmason
2019-06-20 10:35       ` Jeff King
2019-06-15  9:01     ` Carlo Arenas
2019-06-15 19:15     ` brian m. carlson
2019-06-15 22:14       ` Ævar Arnfjörð Bjarmason
2019-06-26  0:03         ` [RFC/PATCH 0/7] grep: move from kwset to optional PCRE v2 Ævar Arnfjörð Bjarmason
2019-06-26 14:02           ` Johannes Schindelin
2019-06-27  9:16             ` Johannes Schindelin
2019-06-27 16:27               ` Ævar Arnfjörð Bjarmason
2019-06-27 18:21                 ` Johannes Schindelin
2019-06-27 23:39           ` [PATCH v2 0/9] " Ævar Arnfjörð Bjarmason
2019-06-28  7:23             ` Ævar Arnfjörð Bjarmason
2019-06-28 16:10               ` Junio C Hamano
2019-07-01 21:20             ` [PATCH v3 00/10] " Ævar Arnfjörð Bjarmason
2019-07-01 21:31               ` Junio C Hamano
2019-07-02 11:10                 ` Ævar Arnfjörð Bjarmason
2019-07-02 12:32               ` Johannes Schindelin
2019-07-02 19:57                 ` Junio C Hamano
2019-07-03 10:08                   ` Johannes Schindelin
2019-07-03 10:25                 ` Johannes Schindelin
2019-07-03 11:27                   ` Johannes Schindelin
2019-07-01 21:20             ` [PATCH v3 01/10] log tests: test regex backends in "--encode=<enc>" tests Ævar Arnfjörð Bjarmason
2019-07-01 21:20             ` [PATCH v3 02/10] grep: don't use PCRE2?_UTF8 with "log --encoding=<non-utf8>" Ævar Arnfjörð Bjarmason
2019-07-01 21:20             ` [PATCH v3 03/10] t4210: skip more command-line encoding tests on MinGW Ævar Arnfjörð Bjarmason
2019-07-01 21:20             ` [PATCH v3 04/10] grep: inline the return value of a function call used only once Ævar Arnfjörð Bjarmason
2019-07-01 21:20             ` [PATCH v3 05/10] grep tests: move "grep binary" alongside the rest Ævar Arnfjörð Bjarmason
2019-07-01 21:20             ` [PATCH v3 06/10] grep tests: move binary pattern tests into their own file Ævar Arnfjörð Bjarmason
2019-07-01 21:20             ` [PATCH v3 07/10] grep: make the behavior for NUL-byte in patterns sane Ævar Arnfjörð Bjarmason
2019-07-01 21:20             ` [PATCH v3 08/10] grep: drop support for \0 in --fixed-strings <pattern> Ævar Arnfjörð Bjarmason
2019-07-01 21:20             ` [PATCH v3 09/10] grep: remove the kwset optimization Ævar Arnfjörð Bjarmason
2019-07-01 21:21             ` [PATCH v3 10/10] grep: use PCRE v2 for optimized fixed-string search Ævar Arnfjörð Bjarmason
2019-06-27 23:39           ` [PATCH v2 1/9] log tests: test regex backends in "--encode=<enc>" tests Ævar Arnfjörð Bjarmason
2019-06-27 23:39           ` [PATCH v2 2/9] grep: don't use PCRE2?_UTF8 with "log --encoding=<non-utf8>" Ævar Arnfjörð Bjarmason
2019-06-27 23:39           ` [PATCH v2 3/9] grep: inline the return value of a function call used only once Ævar Arnfjörð Bjarmason
2019-06-27 23:39           ` [PATCH v2 4/9] grep tests: move "grep binary" alongside the rest Ævar Arnfjörð Bjarmason
2019-06-27 23:39           ` [PATCH v2 5/9] grep tests: move binary pattern tests into their own file Ævar Arnfjörð Bjarmason
2019-06-27 23:39           ` [PATCH v2 6/9] grep: make the behavior for NUL-byte in patterns sane Ævar Arnfjörð Bjarmason
2019-06-27 23:39           ` [PATCH v2 7/9] grep: drop support for \0 in --fixed-strings <pattern> Ævar Arnfjörð Bjarmason
2019-06-27 23:39           ` [PATCH v2 8/9] grep: remove the kwset optimization Ævar Arnfjörð Bjarmason
2019-06-27 23:39           ` [PATCH v2 9/9] grep: use PCRE v2 for optimized fixed-string search Ævar Arnfjörð Bjarmason
2019-06-26  0:03         ` [RFC/PATCH 1/7] grep: inline the return value of a function call used only once Ævar Arnfjörð Bjarmason
2019-06-26  0:03         ` [RFC/PATCH 2/7] grep tests: move "grep binary" alongside the rest Ævar Arnfjörð Bjarmason
2019-06-26 14:05           ` Johannes Schindelin
2019-06-26 18:13           ` Junio C Hamano
2019-06-26  0:03         ` [RFC/PATCH 3/7] grep tests: move binary pattern tests into their own file Ævar Arnfjörð Bjarmason
2019-06-26  0:03         ` [RFC/PATCH 4/7] grep: make the behavior for \0 in patterns sane Ævar Arnfjörð Bjarmason
2019-06-27  2:03           ` brian m. carlson
2019-06-26  0:03         ` [RFC/PATCH 5/7] grep: drop support for \0 in --fixed-strings <pattern> Ævar Arnfjörð Bjarmason
2019-06-26 16:14           ` Junio C Hamano
2019-06-26  0:03         ` [RFC/PATCH 6/7] grep: remove the kwset optimization Ævar Arnfjörð Bjarmason
2019-06-26  0:03         ` [RFC/PATCH 7/7] grep: use PCRE v2 for optimized fixed-string search Ævar Arnfjörð Bjarmason
2019-06-26 14:13           ` Johannes Schindelin
2019-06-26 18:45             ` Junio C Hamano
2019-06-27  9:31               ` Johannes Schindelin
2019-06-27 18:45                 ` Johannes Schindelin
2019-06-27 19:06                   ` Junio C Hamano
2019-06-28 10:56                     ` Johannes Schindelin
2019-06-13 11:49 ` [PATCH 3/4] winansi: simplify loading the GetCurrentConsoleFontEx() function Johannes Schindelin via GitGitGadget
2019-06-13 11:49 ` [PATCH 4/4] config: avoid calling `labs()` on too-large data type Johannes Schindelin via GitGitGadget
2019-06-13 16:13   ` Junio C Hamano
2019-06-16  6:48   ` René Scharfe
2019-06-16  8:24     ` René Scharfe
2019-06-16 14:01       ` René Scharfe
2019-06-16 22:26         ` Junio C Hamano
2019-06-20 19:58           ` René Scharfe
2019-06-20 21:07             ` Junio C Hamano
2019-06-21 18:35             ` Johannes Schindelin
2019-06-22 10:03               ` René Scharfe
2019-06-22 10:03           ` [PATCH v2 1/3] config: use unsigned_mult_overflows to check for overflows René Scharfe
2019-06-22 10:03           ` [PATCH v2 2/3] config: don't multiply in parse_unit_factor() René Scharfe
2019-06-22 10:03           ` [PATCH v2 3/3] config: simplify parsing of unit factors René Scharfe

Code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).