[PATCH] t4062: stop using repetition in regex

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH] t4062: stop using repetition in regex
@ 2017-08-08  6:53 René Scharfe
  2017-08-08 14:49 ` Johannes Schindelin
  0 siblings, 1 reply; 17+ messages in thread
From: René Scharfe @ 2017-08-08  6:53 UTC (permalink / raw
  To: Git List
  Cc: David Coppa, Johannes Schindelin, Junio C Hamano,
	SZEDER Gábor

OpenBSD's regex library has a repetition limit (RE_DUP_MAX) of 255.
That's the minimum acceptable value according to POSIX.  In t4062 we use
4096 repetitions in the test "-G matches", though, causing it to fail.

Do the same as the test "-S --pickaxe-regex" in the same file and search
for a single zero instead.  That still suffices to trigger the buffer
overrun in older versions (checked with b7d36ffca02^ and --valgrind on
Linux), simplifies the test a bit, and avoids exceeding OpenBSD's limit.

Original-patch-by: David Coppa <dcoppa@openbsd.org>
Signed-off-by: Rene Scharfe <l.s.r@web.de>
---
 t/t4062-diff-pickaxe.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/t/t4062-diff-pickaxe.sh b/t/t4062-diff-pickaxe.sh
index 7c4903f497..c16e5af6fa 100755
--- a/t/t4062-diff-pickaxe.sh
+++ b/t/t4062-diff-pickaxe.sh
@@ -15,7 +15,7 @@ test_expect_success setup '
 	git commit -m "A 4k file"
 '
 test_expect_success '-G matches' '
-	git diff --name-only -G "^0{4096}$" HEAD^ >out &&
+	git diff --name-only -G0 HEAD^ >out &&
 	test 4096-zeroes.txt = "$(cat out)"
 '

-- 
2.14.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] t4062: stop using repetition in regex
  2017-08-08  6:53 [PATCH] t4062: stop using repetition in regex René Scharfe
@ 2017-08-08 14:49 ` Johannes Schindelin
  2017-08-08 15:18   ` René Scharfe
  0 siblings, 1 reply; 17+ messages in thread
From: Johannes Schindelin @ 2017-08-08 14:49 UTC (permalink / raw
  To: René Scharfe
  Cc: Git List, David Coppa, Junio C Hamano, SZEDER Gábor

[-- Attachment #1: Type: text/plain, Size: 845 bytes --]

Hi René,

On Tue, 8 Aug 2017, René Scharfe wrote:

> OpenBSD's regex library has a repetition limit (RE_DUP_MAX) of 255.
> That's the minimum acceptable value according to POSIX.  In t4062 we use
> 4096 repetitions in the test "-G matches", though, causing it to fail.
> 
> Do the same as the test "-S --pickaxe-regex" in the same file and search
> for a single zero instead.  That still suffices to trigger the buffer
> overrun in older versions (checked with b7d36ffca02^ and --valgrind on
> Linux), simplifies the test a bit, and avoids exceeding OpenBSD's limit.

I am afraid not. The 4096 is precisely the page size required to trigger
the bug on Windows against which this regression test tries to safeguard.

Maybe simply disable the test on OpenBSD instead? Or guard the {4096}
behind the MINGW prereq.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] t4062: stop using repetition in regex
  2017-08-08 14:49 ` Johannes Schindelin
@ 2017-08-08 15:18   ` René Scharfe
  2017-08-08 22:09     ` Junio C Hamano
  0 siblings, 1 reply; 17+ messages in thread
From: René Scharfe @ 2017-08-08 15:18 UTC (permalink / raw
  To: Johannes Schindelin
  Cc: Git List, David Coppa, Junio C Hamano, SZEDER Gábor

Am 08.08.2017 um 16:49 schrieb Johannes Schindelin:
> Hi René,
> 
> On Tue, 8 Aug 2017, René Scharfe wrote:
> 
>> OpenBSD's regex library has a repetition limit (RE_DUP_MAX) of 255.
>> That's the minimum acceptable value according to POSIX.  In t4062 we use
>> 4096 repetitions in the test "-G matches", though, causing it to fail.
>>
>> Do the same as the test "-S --pickaxe-regex" in the same file and search
>> for a single zero instead.  That still suffices to trigger the buffer
>> overrun in older versions (checked with b7d36ffca02^ and --valgrind on
>> Linux), simplifies the test a bit, and avoids exceeding OpenBSD's limit.
> 
> I am afraid not. The 4096 is precisely the page size required to trigger
> the bug on Windows against which this regression test tries to safeguard.

Checked with b7d36ffca02^ on MinGW now as well and found that it
segfaults with the proposed change ten out of ten times.

You get different results?  How is that possible?  The search string is
NUL-terminated in each case, while the point of the test is that the
file contents isn't, right?

> Maybe simply disable the test on OpenBSD instead? Or guard the {4096}
> behind the MINGW prereq.

It's easy to build a long search string with two repetitions or by using
a longer string as the base, if necessary.  But first we need to find out
why regexec() doesn't overflow in your case.  My build uses the version
from compat/.  Why would it stop before reaching a NUL?  That sounds like
a different and serious bug.

Thanks,
René

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] t4062: stop using repetition in regex
  2017-08-08 15:18   ` René Scharfe
@ 2017-08-08 22:09     ` Junio C Hamano
  2017-08-08 22:26       ` Junio C Hamano
  2017-08-08 22:27       ` René Scharfe
  0 siblings, 2 replies; 17+ messages in thread
From: Junio C Hamano @ 2017-08-08 22:09 UTC (permalink / raw
  To: René Scharfe
  Cc: Johannes Schindelin, Git List, David Coppa, SZEDER Gábor

René Scharfe <l.s.r@web.de> writes:

> Am 08.08.2017 um 16:49 schrieb Johannes Schindelin:
>> Hi René,
>> 
>> On Tue, 8 Aug 2017, René Scharfe wrote:
>> 
>>> OpenBSD's regex library has a repetition limit (RE_DUP_MAX) of 255.
>>> That's the minimum acceptable value according to POSIX.  In t4062 we use
>>> 4096 repetitions in the test "-G matches", though, causing it to fail.
>>>
>>> Do the same as the test "-S --pickaxe-regex" in the same file and search
>>> for a single zero instead.  That still suffices to trigger the buffer
>>> overrun in older versions (checked with b7d36ffca02^ and --valgrind on
>>> Linux), simplifies the test a bit, and avoids exceeding OpenBSD's limit.
>> 
>> I am afraid not. The 4096 is precisely the page size required to trigger
>> the bug on Windows against which this regression test tries to safeguard.
>
> Checked with b7d36ffca02^ on MinGW now as well and found that it
> segfaults with the proposed change ten out of ten times.

That is a strange result but I can believe it.

The reason why I find it strange is that the test wants to run
diff_grep() in diffcore-pickaxe.c with one == NULL (because we are
looking at difference between an initial empty commit and the
current commit that adds 4096-zeroes.txt file), which makes the
current blob (i.e. a page of '0' that may be mmap(2)ed without
trailing NUL to terminate it) scanned via regexec() to look for the
search string.

I can understand why Dscho originally did "^0{4096}$"; it is to
ensure that the whole page is scanned for 4096 zeroes and then the
code would somehow make sure that there is no more byte until the
end of line, which will force regexec (but not regexec_buf that knows
where the buffer ends) to look at the 4097th byte that does not exist.

If you change the pattern to just "0" that is not anchored, I'd expect
regexec() that does not know how big the haystack is to just find "0"
at the first byte and happily return without segfaulting (because it
does not even have to scan the remainder of the buffer).

So I find Dscho's concern quite valid, even though I do believe you
when you say the code somehow segfaults.  I just can not tell
how/why it would segfault, though---it is possible that regexec()
implementation is stupid and does not realize that it can return early
reporting success without looking at the rest of the buffer, but
somehow I find it unlikely.

Puzzled.

> You get different results?  How is that possible?  The search string is
> NUL-terminated in each case, while the point of the test is that the
> file contents isn't, right?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] t4062: stop using repetition in regex
  2017-08-08 22:09     ` Junio C Hamano
@ 2017-08-08 22:26       ` Junio C Hamano
  2017-08-08 22:34         ` René Scharfe
  2017-08-08 22:27       ` René Scharfe
  1 sibling, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2017-08-08 22:26 UTC (permalink / raw
  To: René Scharfe
  Cc: Johannes Schindelin, Git List, David Coppa, SZEDER Gábor

Junio C Hamano <gitster@pobox.com> writes:

> So I find Dscho's concern quite valid, even though I do believe you
> when you say the code somehow segfaults.  I just can not tell
> how/why it would segfault, though---it is possible that regexec()
> implementation is stupid and does not realize that it can return early
> reporting success without looking at the rest of the buffer, but
> somehow I find it unlikely.
>
> Puzzled.
>
>> You get different results?  How is that possible?  The search string is
>> NUL-terminated in each case, while the point of the test is that the
>> file contents isn't, right?

Intellectual curiosity tells me we may want to find out why it
fails, but in the meantime, I think replacing the test with "0$" to
force the scanner to find either the end of line or the end of the
buffer may be a good workaround.  We do not have to care how many of
random bytes are in front of the last "0" in order to ensure that
the regexec_buf() does not overstep to 4097th byte, while seeing
that regexec() that does not know how long the haystack is has to do
so, no?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] t4062: stop using repetition in regex
  2017-08-08 22:09     ` Junio C Hamano
  2017-08-08 22:26       ` Junio C Hamano
@ 2017-08-08 22:27       ` René Scharfe
  1 sibling, 0 replies; 17+ messages in thread
From: René Scharfe @ 2017-08-08 22:27 UTC (permalink / raw
  To: Junio C Hamano
  Cc: Johannes Schindelin, Git List, David Coppa, SZEDER Gábor

Am 09.08.2017 um 00:09 schrieb Junio C Hamano:
> René Scharfe <l.s.r@web.de> writes:
> 
>> Am 08.08.2017 um 16:49 schrieb Johannes Schindelin:
>>> Hi René,
>>>
>>> On Tue, 8 Aug 2017, René Scharfe wrote:
>>>
>>>> OpenBSD's regex library has a repetition limit (RE_DUP_MAX) of 255.
>>>> That's the minimum acceptable value according to POSIX.  In t4062 we use
>>>> 4096 repetitions in the test "-G matches", though, causing it to fail.
>>>>
>>>> Do the same as the test "-S --pickaxe-regex" in the same file and search
>>>> for a single zero instead.  That still suffices to trigger the buffer
>>>> overrun in older versions (checked with b7d36ffca02^ and --valgrind on
>>>> Linux), simplifies the test a bit, and avoids exceeding OpenBSD's limit.
>>>
>>> I am afraid not. The 4096 is precisely the page size required to trigger
>>> the bug on Windows against which this regression test tries to safeguard.
>>
>> Checked with b7d36ffca02^ on MinGW now as well and found that it
>> segfaults with the proposed change ten out of ten times.
> 
> That is a strange result but I can believe it.
> 
> The reason why I find it strange is that the test wants to run
> diff_grep() in diffcore-pickaxe.c with one == NULL (because we are
> looking at difference between an initial empty commit and the
> current commit that adds 4096-zeroes.txt file), which makes the
> current blob (i.e. a page of '0' that may be mmap(2)ed without
> trailing NUL to terminate it) scanned via regexec() to look for the
> search string.
> 
> I can understand why Dscho originally did "^0{4096}$"; it is to
> ensure that the whole page is scanned for 4096 zeroes and then the
> code would somehow make sure that there is no more byte until the
> end of line, which will force regexec (but not regexec_buf that knows
> where the buffer ends) to look at the 4097th byte that does not exist.
> 
> If you change the pattern to just "0" that is not anchored, I'd expect
> regexec() that does not know how big the haystack is to just find "0"
> at the first byte and happily return without segfaulting (because it
> does not even have to scan the remainder of the buffer).
> 
> So I find Dscho's concern quite valid, even though I do believe you
> when you say the code somehow segfaults.  I just can not tell
> how/why it would segfault, though---it is possible that regexec()
> implementation is stupid and does not realize that it can return early
> reporting success without looking at the rest of the buffer, but
> somehow I find it unlikely.
> 
> Puzzled.

Good point.  Valgrind reports:

==57466== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==57466==  Access not within mapped region at address 0x4027000
==57466==    at 0x4C2EDF4: strlen (vg_replace_strmem.c:458)
==57466==    by 0x59D9F76: regexec@@GLIBC_2.3.4 (regexec.c:240)
==57466==    by 0x54D96E: diff_grep (diffcore-pickaxe.c:0)
==57466==    by 0x54DAC3: pickaxe_match (diffcore-pickaxe.c:149)

And you can see in our version in compat/regex/regexec.c:241 that the
first thing regexec() does is calling strlen().

So to avoid depending on that implementation detail we'd need to use
a search string that won't be found (e.g. "1") or with unlimited
repetition (e.g. "0*"), right?

René

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] t4062: stop using repetition in regex
  2017-08-08 22:26       ` Junio C Hamano
@ 2017-08-08 22:34         ` René Scharfe
  2017-08-09  5:29           ` Junio C Hamano
  0 siblings, 1 reply; 17+ messages in thread
From: René Scharfe @ 2017-08-08 22:34 UTC (permalink / raw
  To: Junio C Hamano
  Cc: Johannes Schindelin, Git List, David Coppa, SZEDER Gábor

Am 09.08.2017 um 00:26 schrieb Junio C Hamano:
> Junio C Hamano <gitster@pobox.com> writes:
> 
>> So I find Dscho's concern quite valid, even though I do believe you
>> when you say the code somehow segfaults.  I just can not tell
>> how/why it would segfault, though---it is possible that regexec()
>> implementation is stupid and does not realize that it can return early
>> reporting success without looking at the rest of the buffer, but
>> somehow I find it unlikely.
>>
>> Puzzled.
>>
>>> You get different results?  How is that possible?  The search string is
>>> NUL-terminated in each case, while the point of the test is that the
>>> file contents isn't, right?
> 
> Intellectual curiosity tells me we may want to find out why it
> fails, but in the meantime, I think replacing the test with "0$" to
> force the scanner to find either the end of line or the end of the
> buffer may be a good workaround.  We do not have to care how many of
> random bytes are in front of the last "0" in order to ensure that
> the regexec_buf() does not overstep to 4097th byte, while seeing
> that regexec() that does not know how long the haystack is has to do
> so, no?

Our regexec() calls strlen() (see my other reply).

Using "0$" looks like the best option to me.

René

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] t4062: stop using repetition in regex
  2017-08-08 22:34         ` René Scharfe
@ 2017-08-09  5:29           ` Junio C Hamano
  2017-08-09  6:15             ` René Scharfe
  0 siblings, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2017-08-09  5:29 UTC (permalink / raw
  To: René Scharfe
  Cc: Johannes Schindelin, Git List, David Coppa, SZEDER Gábor

René Scharfe <l.s.r@web.de> writes:

> Am 09.08.2017 um 00:26 schrieb Junio C Hamano:
>> ... but in the meantime, I think replacing the test with "0$" to
>> force the scanner to find either the end of line or the end of the
>> buffer may be a good workaround.  We do not have to care how many of
>> random bytes are in front of the last "0" in order to ensure that
>> the regexec_buf() does not overstep to 4097th byte, while seeing
>> that regexec() that does not know how long the haystack is has to do
>> so, no?
>
> Our regexec() calls strlen() (see my other reply).
>
> Using "0$" looks like the best option to me.

Yeah, it seems that way.  If we want to be close/faithful to the
original, we could do "^0*$", but the part that is essential to
trigger the old bug is not the "we have many zeroes" (or "we have
4096 zeroes") part, but "zero is at the end of the string" part, so
"0$" would be the minimal pattern that also would work for OBSD.

Dscho?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] t4062: stop using repetition in regex
  2017-08-09  5:29           ` Junio C Hamano
@ 2017-08-09  6:15             ` René Scharfe
  2017-08-09 14:15               ` René Scharfe
  0 siblings, 1 reply; 17+ messages in thread
From: René Scharfe @ 2017-08-09  6:15 UTC (permalink / raw
  To: Junio C Hamano
  Cc: Johannes Schindelin, Git List, David Coppa, SZEDER Gábor

Am 09.08.2017 um 07:29 schrieb Junio C Hamano:
> René Scharfe <l.s.r@web.de> writes:
> 
>> Am 09.08.2017 um 00:26 schrieb Junio C Hamano:
>>> ... but in the meantime, I think replacing the test with "0$" to
>>> force the scanner to find either the end of line or the end of the
>>> buffer may be a good workaround.  We do not have to care how many of
>>> random bytes are in front of the last "0" in order to ensure that
>>> the regexec_buf() does not overstep to 4097th byte, while seeing
>>> that regexec() that does not know how long the haystack is has to do
>>> so, no?
>>
>> Our regexec() calls strlen() (see my other reply).
>>
>> Using "0$" looks like the best option to me.
> 
> Yeah, it seems that way.  If we want to be close/faithful to the
> original, we could do "^0*$", but the part that is essential to
> trigger the old bug is not the "we have many zeroes" (or "we have
> 4096 zeroes") part, but "zero is at the end of the string" part, so
> "0$" would be the minimal pattern that also would work for OBSD.

Thought about it a bit more.

"^0{4096}$" checks if the byte after the buffer is \n or \0 in the
hope of triggering a segfault.  On Linux I can access that byte just
fine; perhaps there is no guard page.  Also there is a 2 in 256
chance of the byte being \n or \0 (provided its value is random),
which would cause the test to falsely report success.

"0$" effectively looks for "0\n" or "0\0", which can only occur
after the buffer.  If that string is found close enough then we
may not trigger a segfault and report a false positive.

In the face of unreliable segfaults we need to reverse our strategy,
I think.  Searching for something not in the buffer (e.g. "1") and
considering matches and segfaults as confirmation that the bug is
still present should avoid any false positives.  Right?

Thanks,
René

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] t4062: stop using repetition in regex
  2017-08-09  6:15             ` René Scharfe
@ 2017-08-09 14:15               ` René Scharfe
  2017-08-09 14:25                 ` David Coppa
  2017-08-09 16:07                 ` Junio C Hamano
  0 siblings, 2 replies; 17+ messages in thread
From: René Scharfe @ 2017-08-09 14:15 UTC (permalink / raw
  To: Junio C Hamano
  Cc: Johannes Schindelin, Git List, David Coppa, SZEDER Gábor

Am 09.08.2017 um 08:15 schrieb René Scharfe:
> Am 09.08.2017 um 07:29 schrieb Junio C Hamano:
>> René Scharfe <l.s.r@web.de> writes:
>>
>>> Am 09.08.2017 um 00:26 schrieb Junio C Hamano:
>>>> ... but in the meantime, I think replacing the test with "0$" to
>>>> force the scanner to find either the end of line or the end of the
>>>> buffer may be a good workaround.  We do not have to care how many of
>>>> random bytes are in front of the last "0" in order to ensure that
>>>> the regexec_buf() does not overstep to 4097th byte, while seeing
>>>> that regexec() that does not know how long the haystack is has to do
>>>> so, no?
>>>
>>> Our regexec() calls strlen() (see my other reply).
>>>
>>> Using "0$" looks like the best option to me.
>>
>> Yeah, it seems that way.  If we want to be close/faithful to the
>> original, we could do "^0*$", but the part that is essential to
>> trigger the old bug is not the "we have many zeroes" (or "we have
>> 4096 zeroes") part, but "zero is at the end of the string" part, so
>> "0$" would be the minimal pattern that also would work for OBSD.
> 
> Thought about it a bit more.
> 
> "^0{4096}$" checks if the byte after the buffer is \n or \0 in the
> hope of triggering a segfault.  On Linux I can access that byte just
> fine; perhaps there is no guard page.  Also there is a 2 in 256
> chance of the byte being \n or \0 (provided its value is random),
> which would cause the test to falsely report success.
> 
> "0$" effectively looks for "0\n" or "0\0", which can only occur
> after the buffer.  If that string is found close enough then we
> may not trigger a segfault and report a false positive.
> 
> In the face of unreliable segfaults we need to reverse our strategy,
> I think.  Searching for something not in the buffer (e.g. "1") and
> considering matches and segfaults as confirmation that the bug is
> still present should avoid any false positives.  Right?

And that's not it either. *sigh*

If the 4097th byte is NUL or LF then we can only hope its access
triggers a segfault -- there is no other way to distinguish the
result from a legitimate match when limiting with REG_STARTEND.  So
we have to accept this flakiness.

We can check the value of that byte with [^0] and interpret a
match as failure, but that adds negation and makes the test more
complex.

^0*$ would falsely match if the 4097th byte (and possibly later
ones) is 0.  We need to make sure we check for end-of-line after
the 4096th byte, not later.

Sorry, Dscho, I thought we could take a shortcut here, but -- as
you wrote all along -- we can't.

So how about this?

-- >8 --
Subject: [PATCH] t4062: use less than 256 repetitions in regex

OpenBSD's regex library has a repetition limit (RE_DUP_MAX) of 255.
That's the minimum acceptable value according to POSIX.  In t4062 we use
4096 repetitions in the test "-G matches", though, causing it to fail.
Combine two repetition operators, both less than 256, to arrive at 4096
zeros instead of using a single one, to fix the test on OpenBSD.

Original-patch-by: David Coppa <dcoppa@openbsd.org>
Signed-off-by: Rene Scharfe <l.s.r@web.de>
---
 t/t4062-diff-pickaxe.sh | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/t/t4062-diff-pickaxe.sh b/t/t4062-diff-pickaxe.sh
index 7c4903f497..1130c8019b 100755
--- a/t/t4062-diff-pickaxe.sh
+++ b/t/t4062-diff-pickaxe.sh
@@ -14,8 +14,10 @@ test_expect_success setup '
 	test_tick &&
 	git commit -m "A 4k file"
 '
+
+# OpenBSD only supports up to 255 repetitions, so repeat twice for 64*64=4096.
 test_expect_success '-G matches' '
-	git diff --name-only -G "^0{4096}$" HEAD^ >out &&
+	git diff --name-only -G "^(0{64}){64}$" HEAD^ >out &&
 	test 4096-zeroes.txt = "$(cat out)"
 '
 
-- 
2.14.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] t4062: stop using repetition in regex
  2017-08-09 14:15               ` René Scharfe
@ 2017-08-09 14:25                 ` David Coppa
  2017-08-09 21:49                   ` Johannes Schindelin
  2017-08-09 16:07                 ` Junio C Hamano
  1 sibling, 1 reply; 17+ messages in thread
From: David Coppa @ 2017-08-09 14:25 UTC (permalink / raw
  To: René Scharfe
  Cc: Junio C Hamano, Johannes Schindelin, Git List, David Coppa,
	SZEDER Gábor

On Wed, Aug 9, 2017 at 4:15 PM, René Scharfe <l.s.r@web.de> wrote:
> Am 09.08.2017 um 08:15 schrieb René Scharfe:
>> Am 09.08.2017 um 07:29 schrieb Junio C Hamano:
>>> René Scharfe <l.s.r@web.de> writes:
>>>
>>>> Am 09.08.2017 um 00:26 schrieb Junio C Hamano:
>>>>> ... but in the meantime, I think replacing the test with "0$" to
>>>>> force the scanner to find either the end of line or the end of the
>>>>> buffer may be a good workaround.  We do not have to care how many of
>>>>> random bytes are in front of the last "0" in order to ensure that
>>>>> the regexec_buf() does not overstep to 4097th byte, while seeing
>>>>> that regexec() that does not know how long the haystack is has to do
>>>>> so, no?
>>>>
>>>> Our regexec() calls strlen() (see my other reply).
>>>>
>>>> Using "0$" looks like the best option to me.
>>>
>>> Yeah, it seems that way.  If we want to be close/faithful to the
>>> original, we could do "^0*$", but the part that is essential to
>>> trigger the old bug is not the "we have many zeroes" (or "we have
>>> 4096 zeroes") part, but "zero is at the end of the string" part, so
>>> "0$" would be the minimal pattern that also would work for OBSD.
>>
>> Thought about it a bit more.
>>
>> "^0{4096}$" checks if the byte after the buffer is \n or \0 in the
>> hope of triggering a segfault.  On Linux I can access that byte just
>> fine; perhaps there is no guard page.  Also there is a 2 in 256
>> chance of the byte being \n or \0 (provided its value is random),
>> which would cause the test to falsely report success.
>>
>> "0$" effectively looks for "0\n" or "0\0", which can only occur
>> after the buffer.  If that string is found close enough then we
>> may not trigger a segfault and report a false positive.
>>
>> In the face of unreliable segfaults we need to reverse our strategy,
>> I think.  Searching for something not in the buffer (e.g. "1") and
>> considering matches and segfaults as confirmation that the bug is
>> still present should avoid any false positives.  Right?
>
> And that's not it either. *sigh*
>
> If the 4097th byte is NUL or LF then we can only hope its access
> triggers a segfault -- there is no other way to distinguish the
> result from a legitimate match when limiting with REG_STARTEND.  So
> we have to accept this flakiness.
>
> We can check the value of that byte with [^0] and interpret a
> match as failure, but that adds negation and makes the test more
> complex.
>
> ^0*$ would falsely match if the 4097th byte (and possibly later
> ones) is 0.  We need to make sure we check for end-of-line after
> the 4096th byte, not later.
>
> Sorry, Dscho, I thought we could take a shortcut here, but -- as
> you wrote all along -- we can't.
>
> So how about this?
>
> -- >8 --
> Subject: [PATCH] t4062: use less than 256 repetitions in regex
>
> OpenBSD's regex library has a repetition limit (RE_DUP_MAX) of 255.
> That's the minimum acceptable value according to POSIX.  In t4062 we use
> 4096 repetitions in the test "-G matches", though, causing it to fail.
> Combine two repetition operators, both less than 256, to arrive at 4096
> zeros instead of using a single one, to fix the test on OpenBSD.
>
> Original-patch-by: David Coppa <dcoppa@openbsd.org>
> Signed-off-by: Rene Scharfe <l.s.r@web.de>
> ---
>  t/t4062-diff-pickaxe.sh | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/t/t4062-diff-pickaxe.sh b/t/t4062-diff-pickaxe.sh
> index 7c4903f497..1130c8019b 100755
> --- a/t/t4062-diff-pickaxe.sh
> +++ b/t/t4062-diff-pickaxe.sh
> @@ -14,8 +14,10 @@ test_expect_success setup '
>         test_tick &&
>         git commit -m "A 4k file"
>  '
> +
> +# OpenBSD only supports up to 255 repetitions, so repeat twice for 64*64=4096.
>  test_expect_success '-G matches' '
> -       git diff --name-only -G "^0{4096}$" HEAD^ >out &&
> +       git diff --name-only -G "^(0{64}){64}$" HEAD^ >out &&
>         test 4096-zeroes.txt = "$(cat out)"
>  '
>
> --
> 2.14.0

I think this should work w/o problems on OpenBSD:

$ uname -a
OpenBSD open.bsdgeek.it 6.1 GENERIC#54 amd64
$ git diff --name-only -G "^0{4096}$" HEAD^ 1>/dev/null
fatal: invalid regex: invalid repetition count(s)
$ git diff --name-only -G "^(0{64}){64}$" HEAD^ 1>/dev/null
$

Ciao!
David

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] t4062: stop using repetition in regex
  2017-08-09 14:15               ` René Scharfe
  2017-08-09 14:25                 ` David Coppa
@ 2017-08-09 16:07                 ` Junio C Hamano
  2017-08-09 17:20                   ` René Scharfe
  1 sibling, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2017-08-09 16:07 UTC (permalink / raw
  To: René Scharfe
  Cc: Johannes Schindelin, Git List, David Coppa, SZEDER Gábor

René Scharfe <l.s.r@web.de> writes:

>> In the face of unreliable segfaults we need to reverse our strategy,
>> I think.  Searching for something not in the buffer (e.g. "1") and
>> considering matches and segfaults as confirmation that the bug is
>> still present should avoid any false positives.  Right?
>
> And that's not it either. *sigh*
>
> If the 4097th byte is NUL or LF then we can only hope its access
> triggers a segfault -- there is no other way to distinguish the
> result from a legitimate match when limiting with REG_STARTEND.  So
> we have to accept this flakiness.

> We can check the value of that byte with [^0] and interpret a
> match as failure, but that adds negation and makes the test more
> complex.
>
> ^0*$ would falsely match if the 4097th byte (and possibly later
> ones) is 0.  We need to make sure we check for end-of-line after
> the 4096th byte, not later.

I do not have a strong objection against "^0{64}{64}$", but let me
make sure that I understand your rationale correctly.

We assume that we do not have an unmapped guard page to reliably
trap us.  So "^0{64}{64}$" will report a false success when that
4097th byte is either NUL or LF.  There is 2/256 probability of that
happening but we accept it.

A pattern "0$" will give us a false success in the same case, but
the reason why it is worse is because in addition, we get a false
success if that second page begins with "0\0" or "0\n".  The chance
of that happening is additional 2/256 * 1/256.  Worse yet, the page
could start with "00\0" or "00\n", which is additional 2/256 *
1/65536.  We could have yet another "0" at the beginning of that
second page, which only increases the chance of us getting a false
sucess.

We consider that 2/256 * (1/256 + 1/256^2 + 1/256^3 + ...) is too
big an additional chance of getting false success to be acceptable,
even though 2/256 which is the chance of false success that we have
to and are willing to accept.

Now I am not good at doing math on the keyboard and text terminal,
but

    X    = (1/256 + 1/256^2 + ...)
    256X = 1 + X
    X    = 1/255

so that additional rate of false success is 2/256 * 1/255.

So we are saying that we accept ~1/100 false success rate, but
additional ~1/30000 is unacceptable.

I do not know if I buy that argument, but I do think that additional
false success rate, even if it is miniscule, is unnecessary.  So as
long as everybody's regexp library is happy with "^0{64}{64}$",
let's use that.

Thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] t4062: stop using repetition in regex
  2017-08-09 16:07                 ` Junio C Hamano
@ 2017-08-09 17:20                   ` René Scharfe
  2017-08-09 17:47                     ` Junio C Hamano
  0 siblings, 1 reply; 17+ messages in thread
From: René Scharfe @ 2017-08-09 17:20 UTC (permalink / raw
  To: Junio C Hamano
  Cc: Johannes Schindelin, Git List, David Coppa, SZEDER Gábor

Am 09.08.2017 um 18:07 schrieb Junio C Hamano:
> René Scharfe <l.s.r@web.de> writes:
> 
>>> In the face of unreliable segfaults we need to reverse our strategy,
>>> I think.  Searching for something not in the buffer (e.g. "1") and
>>> considering matches and segfaults as confirmation that the bug is
>>> still present should avoid any false positives.  Right?
>>
>> And that's not it either. *sigh*
>>
>> If the 4097th byte is NUL or LF then we can only hope its access
>> triggers a segfault -- there is no other way to distinguish the
>> result from a legitimate match when limiting with REG_STARTEND.  So
>> we have to accept this flakiness.
> 
>> We can check the value of that byte with [^0] and interpret a
>> match as failure, but that adds negation and makes the test more
>> complex.
>>
>> ^0*$ would falsely match if the 4097th byte (and possibly later
>> ones) is 0.  We need to make sure we check for end-of-line after
>> the 4096th byte, not later.
> 
> I do not have a strong objection against "^0{64}{64}$", but let me
> make sure that I understand your rationale correctly.
> 
> We assume that we do not have an unmapped guard page to reliably
> trap us.  So "^0{64}{64}$" will report a false success when that
> 4097th byte is either NUL or LF.  There is 2/256 probability of that
> happening but we accept it.
> 
> A pattern "0$" will give us a false success in the same case, but
> the reason why it is worse is because in addition, we get a false
> success if that second page begins with "0\0" or "0\n".  The chance
> of that happening is additional 2/256 * 1/256.  Worse yet, the page
> could start with "00\0" or "00\n", which is additional 2/256 *
> 1/65536.  We could have yet another "0" at the beginning of that
> second page, which only increases the chance of us getting a false
> sucess.

I demonstrated a lack of logical thinking and now you bring on
probabilities!? ;-)

There could be any characters except NUL and LF between the 4096 zeros
and "0$" for the latter to match wrongly, no?  So there are 4095
opportunities for the misleading pattern in a page, with probabilities
like this:

  0$                          1/256 * 2/256
  .0$         254/256       * 1/256 * 2/256
  ..0$       (254/256)^2    * 1/256 * 2/256
  .{3}0$     (254/256)^3    * 1/256 * 2/256

  .{4094}0$  (254/256)^4094 * 1/256 * 2/256

That sums up to ca. 1/256 (did that numerically).  Does that make
sense?

> So we are saying that we accept ~1/100 false success rate, but
> additional ~1/30000 is unacceptable.
> 
> I do not know if I buy that argument, but I do think that additional
> false success rate, even if it is miniscule, is unnecessary.  So as
> long as everybody's regexp library is happy with "^0{64}{64}$",
> let's use that.

The parentheses are necessary ("^(0{64}){64}$"), at least on OpenBSD.
It doesn't accept consecutive repetition operators without them.  We
could use "^(00000000000000000000000000000000){128}$" instead if
there's a problem with that.

René

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] t4062: stop using repetition in regex
  2017-08-09 17:20                   ` René Scharfe
@ 2017-08-09 17:47                     ` Junio C Hamano
  2017-08-10  6:08                       ` René Scharfe
  0 siblings, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2017-08-09 17:47 UTC (permalink / raw
  To: René Scharfe
  Cc: Johannes Schindelin, Git List, David Coppa, SZEDER Gábor

René Scharfe <l.s.r@web.de> writes:

> There could be any characters except NUL and LF between the 4096 zeros
> and "0$" for the latter to match wrongly, no?  So there are 4095
> opportunities for the misleading pattern in a page, with probabilities
> like this:
>
>   0$                          1/256 * 2/256
>   .0$         254/256       * 1/256 * 2/256
>   ..0$       (254/256)^2    * 1/256 * 2/256
>   .{3}0$     (254/256)^3    * 1/256 * 2/256
>
>   .{4094}0$  (254/256)^4094 * 1/256 * 2/256
>
> That sums up to ca. 1/256 (did that numerically).  Does that make
> sense?

Yes, thanks.  I think the number would be different for "^0*$" (the
above is for "0$") and moves it down to ~1/30000, but as I said,
allowing additional false success rate is unnecessary (even if it is
miniscule enough to be acceptable), so let's take the 64*64 patch.

>> So we are saying that we accept ~1/100 false success rate, but
>> additional ~1/30000 is unacceptable.
>> 
>> I do not know if I buy that argument, but I do think that additional
>> false success rate, even if it is miniscule, is unnecessary.  So as
>> long as everybody's regexp library is happy with "^0{64}{64}$",
>> let's use that.
>
> The parentheses are necessary ("^(0{64}){64}$"), at least on OpenBSD.

Sorry, what I wrote was merely a typo; the one from you I applied
did have the parens so we are good.

Thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] t4062: stop using repetition in regex
  2017-08-09 14:25                 ` David Coppa
@ 2017-08-09 21:49                   ` Johannes Schindelin
  0 siblings, 0 replies; 17+ messages in thread
From: Johannes Schindelin @ 2017-08-09 21:49 UTC (permalink / raw
  To: David Coppa
  Cc: René Scharfe, Junio C Hamano, Git List, David Coppa,
	SZEDER Gábor

[-- Attachment #1: Type: text/plain, Size: 4660 bytes --]

Hi all,

On Wed, 9 Aug 2017, David Coppa wrote:

> On Wed, Aug 9, 2017 at 4:15 PM, René Scharfe <l.s.r@web.de> wrote:
> > Am 09.08.2017 um 08:15 schrieb René Scharfe:
> >> Am 09.08.2017 um 07:29 schrieb Junio C Hamano:
> >>> René Scharfe <l.s.r@web.de> writes:
> >>>
> >>>> Am 09.08.2017 um 00:26 schrieb Junio C Hamano:
> >>>>> ... but in the meantime, I think replacing the test with "0$" to
> >>>>> force the scanner to find either the end of line or the end of the
> >>>>> buffer may be a good workaround.  We do not have to care how many of
> >>>>> random bytes are in front of the last "0" in order to ensure that
> >>>>> the regexec_buf() does not overstep to 4097th byte, while seeing
> >>>>> that regexec() that does not know how long the haystack is has to do
> >>>>> so, no?
> >>>>
> >>>> Our regexec() calls strlen() (see my other reply).
> >>>>
> >>>> Using "0$" looks like the best option to me.
> >>>
> >>> Yeah, it seems that way.  If we want to be close/faithful to the
> >>> original, we could do "^0*$", but the part that is essential to
> >>> trigger the old bug is not the "we have many zeroes" (or "we have
> >>> 4096 zeroes") part, but "zero is at the end of the string" part, so
> >>> "0$" would be the minimal pattern that also would work for OBSD.
> >>
> >> Thought about it a bit more.
> >>
> >> "^0{4096}$" checks if the byte after the buffer is \n or \0 in the
> >> hope of triggering a segfault.  On Linux I can access that byte just
> >> fine; perhaps there is no guard page.  Also there is a 2 in 256
> >> chance of the byte being \n or \0 (provided its value is random),
> >> which would cause the test to falsely report success.
> >>
> >> "0$" effectively looks for "0\n" or "0\0", which can only occur
> >> after the buffer.  If that string is found close enough then we
> >> may not trigger a segfault and report a false positive.
> >>
> >> In the face of unreliable segfaults we need to reverse our strategy,
> >> I think.  Searching for something not in the buffer (e.g. "1") and
> >> considering matches and segfaults as confirmation that the bug is
> >> still present should avoid any false positives.  Right?
> >
> > And that's not it either. *sigh*
> >
> > If the 4097th byte is NUL or LF then we can only hope its access
> > triggers a segfault -- there is no other way to distinguish the
> > result from a legitimate match when limiting with REG_STARTEND.  So
> > we have to accept this flakiness.
> >
> > We can check the value of that byte with [^0] and interpret a
> > match as failure, but that adds negation and makes the test more
> > complex.
> >
> > ^0*$ would falsely match if the 4097th byte (and possibly later
> > ones) is 0.  We need to make sure we check for end-of-line after
> > the 4096th byte, not later.
> >
> > Sorry, Dscho, I thought we could take a shortcut here, but -- as
> > you wrote all along -- we can't.
> >
> > So how about this?
> >
> > -- >8 --
> > Subject: [PATCH] t4062: use less than 256 repetitions in regex
> >
> > OpenBSD's regex library has a repetition limit (RE_DUP_MAX) of 255.
> > That's the minimum acceptable value according to POSIX.  In t4062 we use
> > 4096 repetitions in the test "-G matches", though, causing it to fail.
> > Combine two repetition operators, both less than 256, to arrive at 4096
> > zeros instead of using a single one, to fix the test on OpenBSD.
> >
> > Original-patch-by: David Coppa <dcoppa@openbsd.org>
> > Signed-off-by: Rene Scharfe <l.s.r@web.de>
> > ---
> >  t/t4062-diff-pickaxe.sh | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/t/t4062-diff-pickaxe.sh b/t/t4062-diff-pickaxe.sh
> > index 7c4903f497..1130c8019b 100755
> > --- a/t/t4062-diff-pickaxe.sh
> > +++ b/t/t4062-diff-pickaxe.sh
> > @@ -14,8 +14,10 @@ test_expect_success setup '
> >         test_tick &&
> >         git commit -m "A 4k file"
> >  '
> > +
> > +# OpenBSD only supports up to 255 repetitions, so repeat twice for 64*64=4096.
> >  test_expect_success '-G matches' '
> > -       git diff --name-only -G "^0{4096}$" HEAD^ >out &&
> > +       git diff --name-only -G "^(0{64}){64}$" HEAD^ >out &&
> >         test 4096-zeroes.txt = "$(cat out)"
> >  '
> >
> > --
> > 2.14.0
> 
> I think this should work w/o problems on OpenBSD:
> 
> $ uname -a
> OpenBSD open.bsdgeek.it 6.1 GENERIC#54 amd64
> $ git diff --name-only -G "^0{4096}$" HEAD^ 1>/dev/null
> fatal: invalid regex: invalid repetition count(s)
> $ git diff --name-only -G "^(0{64}){64}$" HEAD^ 1>/dev/null
> $

Perfect!

Thank y'all for being so thorough,
Dscho

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] t4062: stop using repetition in regex
  2017-08-09 17:47                     ` Junio C Hamano
@ 2017-08-10  6:08                       ` René Scharfe
  2017-08-11 18:20                         ` Junio C Hamano
  0 siblings, 1 reply; 17+ messages in thread
From: René Scharfe @ 2017-08-10  6:08 UTC (permalink / raw
  To: Junio C Hamano
  Cc: Johannes Schindelin, Git List, David Coppa, SZEDER Gábor

Am 09.08.2017 um 19:47 schrieb Junio C Hamano:
> René Scharfe <l.s.r@web.de> writes:
> 
>> There could be any characters except NUL and LF between the 4096 zeros
>> and "0$" for the latter to match wrongly, no?  So there are 4095
>> opportunities for the misleading pattern in a page, with probabilities
>> like this:
>>
>>    0$                          1/256 * 2/256
>>    .0$         254/256       * 1/256 * 2/256
>>    ..0$       (254/256)^2    * 1/256 * 2/256
>>    .{3}0$     (254/256)^3    * 1/256 * 2/256
>>
>>    .{4094}0$  (254/256)^4094 * 1/256 * 2/256
>>
>> That sums up to ca. 1/256 (did that numerically).  Does that make
>> sense?
> 
> Yes, thanks.  I think the number would be different for "^0*$" (the
> above is for "0$") and moves it down to ~1/30000, but as I said,
> allowing additional false success rate is unnecessary (even if it is
> miniscule enough to be acceptable), so let's take the 64*64 patch.

Ah, right, now I get your calculation in the email I replied to above.
"^0*$" has a probability of 2/255 to produce false positives.

Thanks,
René

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] t4062: stop using repetition in regex
  2017-08-10  6:08                       ` René Scharfe
@ 2017-08-11 18:20                         ` Junio C Hamano
  0 siblings, 0 replies; 17+ messages in thread
From: Junio C Hamano @ 2017-08-11 18:20 UTC (permalink / raw
  To: René Scharfe
  Cc: Johannes Schindelin, Git List, David Coppa, SZEDER Gábor

René Scharfe <l.s.r@web.de> writes:

> Am 09.08.2017 um 19:47 schrieb Junio C Hamano:
>> René Scharfe <l.s.r@web.de> writes:
>> 
>>> There could be any characters except NUL and LF between the 4096 zeros
>>> and "0$" for the latter to match wrongly, no?  So there are 4095
>>> opportunities for the misleading pattern in a page, with probabilities
>>> like this:
>>>
>>>    0$                          1/256 * 2/256
>>>    .0$         254/256       * 1/256 * 2/256
>>>    ..0$       (254/256)^2    * 1/256 * 2/256
>>>    .{3}0$     (254/256)^3    * 1/256 * 2/256
>>>
>>>    .{4094}0$  (254/256)^4094 * 1/256 * 2/256
>>>
>>> That sums up to ca. 1/256 (did that numerically).  Does that make
>>> sense?
>> 
>> Yes, thanks.  I think the number would be different for "^0*$" (the
>> above is for "0$") and moves it down to ~1/30000, but as I said,
>> allowing additional false success rate is unnecessary (even if it is
>> miniscule enough to be acceptable), so let's take the 64*64 patch.
>
> Ah, right, now I get your calculation in the email I replied to above.
> "^0*$" has a probability of 2/255 to produce false positives.

Yes, and that is larger than 2/256 we would have to accept with the
original "^0{4096}$" or the updated "^(0{64}){64}$" by ~1/30000,
which is unnecessary additional false rate of success.

Thanks.





^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-08-11 18:21 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-08-08  6:53 [PATCH] t4062: stop using repetition in regex René Scharfe
2017-08-08 14:49 ` Johannes Schindelin
2017-08-08 15:18   ` René Scharfe
2017-08-08 22:09     ` Junio C Hamano
2017-08-08 22:26       ` Junio C Hamano
2017-08-08 22:34         ` René Scharfe
2017-08-09  5:29           ` Junio C Hamano
2017-08-09  6:15             ` René Scharfe
2017-08-09 14:15               ` René Scharfe
2017-08-09 14:25                 ` David Coppa
2017-08-09 21:49                   ` Johannes Schindelin
2017-08-09 16:07                 ` Junio C Hamano
2017-08-09 17:20                   ` René Scharfe
2017-08-09 17:47                     ` Junio C Hamano
2017-08-10  6:08                       ` René Scharfe
2017-08-11 18:20                         ` Junio C Hamano
2017-08-08 22:27       ` René Scharfe

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).