possible bug in regex and dfa

bug-gnulib@gnu.org mirror (unofficial)
 help / color / mirror / Atom feed

* possible bug in regex and dfa
@ 2021-07-15 18:48 Arnold Robbins
  2021-07-17  2:58 ` Paul Eggert
  0 siblings, 1 reply; 8+ messages in thread
From: Arnold Robbins @ 2021-07-15 18:48 UTC (permalink / raw)
  To: bug-gnulib

Hi.

Please see the thread starting at

	https://lists.gnu.org/archive/html/bug-gawk/2021-07/msg00026.html

The regexp used there, ".^", to my mind should be treated as invalid.
Mawk does so, reading the entire file as one record.  Gawk matches a
newline for it:

$ cat data
a.^b
a.^b

$ cat x.awk
BEGIN { RS = ".^" }

{
	gsub(/.^/, ">&<")
	print NR, $0
	print "RT=<" RT ">"
}

$ mawk -f x.awk data
1 a.^b
a.^b

RT=<>

$ ./gawk -f x.awk data
1 a.^b
RT=<
>
2 a.^b
RT=<
>

To make debugging easier, there is a test program in the gawk
git repo that just does regexp matching the way gawk does, called
testdfa.  To use it,

	git clone git://git.savannah.gnu.org/gawk.git
	cd gawk
	./bootstrap && ./configure
	## edit Makefile and support/Makefile to remove -O, add -g
	make -j
	cd helpers
	gcc -g -I.. -I../support testdfa.c ../support/libsupport.a -o testdfa

When run:

$ cd helpers
$ ./testdfa -b '.^' < ../data
Ignorecase: false
Syntax: RE_BACKSLASH_ESCAPE_IN_LISTS|RE_CHAR_CLASSES|RE_CONTEXT_INDEP_ANCHORS|RE_DOT_NEWLINE|RE_INTERVALS|RE_NO_BK_BRACES|RE_NO_BK_PARENS|RE_NO_BK_VBAR|RE_NO_EMPTY_RANGES|RE_UNMATCHED_RIGHT_PAREN_ORD|RE_INVALID_INTERVAL_ORD
Pattern: /.^/, len = 2
After setup_pattern(), len = 2
MB_CUR_MAX = 6
Calling dfacomp(.^, 2, 0x55e9d56a5600, true)
re_search returned position 4 (true)
dfaexec returned 5 (a.^)

If this is supposed to match a newline, I'd like to understand why.
If it's not, I'd like to get a fix for regexp and dfa.  Or if
RE_SYNTAX_GNU_AWK needs more or fewer syntax bits[1], I'd like to
know which, and why.

Please cc me on any and all replies, as I'm not subscribed to
this list.

Thanks,

Arnold

[1] I hate the syntax bits. I have hated them for decades. Sigh.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: possible bug in regex and dfa
  2021-07-15 18:48 possible bug in regex and dfa Arnold Robbins
@ 2021-07-17  2:58 ` Paul Eggert
  2021-07-18  9:01   ` Bruno Haible
  2021-07-18 12:56   ` arnold
  0 siblings, 2 replies; 8+ messages in thread
From: Paul Eggert @ 2021-07-17  2:58 UTC (permalink / raw)
  To: Arnold Robbins, bug-gnulib

On 7/15/21 1:48 PM, Arnold Robbins wrote:
> The regexp used there, ".^", to my mind should be treated as invalid.

No, that regular expression is valid because "." matches newline in 
POSIX EREs. So the "." matches a newline, and the following "^" matches 
the start of the next line.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: possible bug in regex and dfa
  2021-07-17  2:58 ` Paul Eggert
@ 2021-07-18  9:01   ` Bruno Haible
  2021-07-18 12:56   ` arnold
  1 sibling, 0 replies; 8+ messages in thread
From: Bruno Haible @ 2021-07-18  9:01 UTC (permalink / raw)
  To: Arnold Robbins; +Cc: Paul Eggert, bug-gnulib

> No, that regular expression is valid because "." matches newline in 
> POSIX EREs.

And if you don't like this, you need to remove the RE_DOT_NEWLINE flag from
the value that you pass to re_set_syntax.

Bruno



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: possible bug in regex and dfa
  2021-07-17  2:58 ` Paul Eggert
  2021-07-18  9:01   ` Bruno Haible
@ 2021-07-18 12:56   ` arnold
  2021-07-18 16:09     ` Bruno Haible
  1 sibling, 1 reply; 8+ messages in thread
From: arnold @ 2021-07-18 12:56 UTC (permalink / raw)
  To: eggert, bug-gnulib, arnold

Hi.

Paul Eggert <eggert@cs.ucla.edu> wrote:

> On 7/15/21 1:48 PM, Arnold Robbins wrote:
> > The regexp used there, ".^", to my mind should be treated as invalid.
>
> No, that regular expression is valid because "." matches newline in 
> POSIX EREs. So the "." matches a newline, and the following "^" matches 
> the start of the next line.

Bruno Haible <bruno@clisp.org> wrote:

> > No, that regular expression is valid because "." matches newline in 
> > POSIX EREs.
>
> And if you don't like this, you need to remove the RE_DOT_NEWLINE flag from
> the value that you pass to re_set_syntax.

Dot matching newline isn't the issue here.

It's ^ matching in the middle of a string.  For my purposes, ^ should
only match at the beginning of a *string* (as $ should only match at
the end of a string).  I haven't rechecked POSIX, but this is how awk
has behaved since forever. (And how I've documented things in the manual,
also since forever.)

For RS, gawk treats the concatenation of the input files as one long
string, so ^ should only match at the very beginning, and $ at the
very end.

But even for strings the GNU regex routines seem to get it wrong:

$ cat y.awk
BEGIN {
	data = "a.^b\na.^b\n"
	gsub(/.^/, ">&<", data)
	print data
}

$ mawk -f y.awk		# gets it right IMHO
a.^b
a.^b

$ nawk -f y.awk
nawk: syntax error in regular expression .^ at 
 source line number 3 source file y.awk
 context is
		gsub(/.^/, ">&<", >>>  data) <<< 

$ ./gawk -f y.awk
a.^b>
<a.^b>
<

Is there some way I can get the regex routines (and dfa) to relate
to ^ and $ as relative to the *string* and not the *line*?

Thanks,

Arnold

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: possible bug in regex and dfa
  2021-07-18 12:56   ` arnold
@ 2021-07-18 16:09     ` Bruno Haible
  2021-07-18 18:59       ` arnold
  2021-07-18 19:30       ` possible bug in regex and dfa arnold
  0 siblings, 2 replies; 8+ messages in thread
From: Bruno Haible @ 2021-07-18 16:09 UTC (permalink / raw)
  To: bug-gnulib; +Cc: arnold, eggert

Hi Arnold,

> Dot matching newline isn't the issue here.
> 
> It's ^ matching in the middle of a string.  For my purposes, ^ should
> only match at the beginning of a *string* (as $ should only match at
> the end of a string).  I haven't rechecked POSIX, but this is how awk
> has behaved since forever.

Hmm. Regarding POSIX: I've read section 9.3.8 and 9.4.9 of [1],
the description of REG_NOTBOL, REG_NOTEOL in [2], and the description
of REG_NEWLINE in [3]. If I understand it correctly, within POSIX,
".^" should not match a newline because
  - if REG_NEWLINE is set, '^' matches after the newline but '.' does not
    match the newline,
  - if REG_NEWLINE is not set, '.' matches newline but '^' does not match
    after the newline.

However, GNU regex.h also has a flag RE_CONTEXT_INDEP_ANCHORS; I don't know
what effect it has.

> (And how I've documented things in the manual, also since forever.)

If you want the behaviour of the GNU regex to be stable over time, you
should contribute unit tests to tests/test-regex.c. So far, I see unit
tests for the flags
  REG_EXTENDED
  REG_NOSUB
  RE_SYNTAX_POSIX_BASIC
  RE_SYNTAX_GREP
  RE_SYNTAX_EGREP
  RE_SYNTAX_POSIX_EGREP
  RE_SYNTAX_EMACS
  RE_HAT_LISTS_NOT_NEWLINE
  RE_ICASE
  RE_CONTEXT_INVALID_DUP
  RE_NO_EMPTY_RANGES
but no tests at all for
  RE_SYNTAX_AWK
  RE_SYNTAX_GNU_AWK
  RE_SYNTAX_POSIX_AWK
  RE_SYNTAX_POSIX_EXTENDED
  REG_NEWLINE
  REG_NOTBOL
  REG_NOTEOL
  REG_STARTEND

Usually it takes about as many code lines to reasonably unit test some code
as the code itself has. The regex module is over 300 KB large, and its unit
test less than 20 KB large. Already from these figures you can tell that
the regex module is *NEARLY UNTESTED*.

You may say, well, some other unit tests exist in glibc, in sed, in grep,
in awk, in coreutils, etc. But it doesn't help maintenance if the unit tests
are not part of what gets tested by
  ./gnulib-tool --test --single-configure regex

Bruno

[1] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html
[2] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/regex.h.html
[3] https://pubs.opengroup.org/onlinepubs/9699919799/functions/regexec.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: possible bug in regex and dfa
  2021-07-18 16:09     ` Bruno Haible
@ 2021-07-18 18:59       ` arnold
  2021-07-18 21:45         ` regex unit tests Bruno Haible
  2021-07-18 19:30       ` possible bug in regex and dfa arnold
  1 sibling, 1 reply; 8+ messages in thread
From: arnold @ 2021-07-18 18:59 UTC (permalink / raw)
  To: bug-gnulib, bruno; +Cc: eggert, arnold

Bruno Haible <bruno@clisp.org> wrote:

> Hi Arnold,
>
> > Dot matching newline isn't the issue here.
> > 
> > It's ^ matching in the middle of a string.  For my purposes, ^ should
> > only match at the beginning of a *string* (as $ should only match at
> > the end of a string).  I haven't rechecked POSIX, but this is how awk
> > has behaved since forever.
>
> Hmm. Regarding POSIX: I've read section 9.3.8 and 9.4.9 of [1],
> the description of REG_NOTBOL, REG_NOTEOL in [2], and the description
> of REG_NEWLINE in [3]. If I understand it correctly, within POSIX,
> ".^" should not match a newline because
>   - if REG_NEWLINE is set, '^' matches after the newline but '.' does not
>     match the newline,
>   - if REG_NEWLINE is not set, '.' matches newline but '^' does not match
>     after the newline.

That makes sense.  This is why I felt that, for gawk, ".^" is an invalid
regexp. (Indeed, the original Unix awk rejects it as such.)

REG_NEWLINE is not included in any of the RE_*_AWK definitions since I
want exactly the behavior you describe: dot matches newline but ^ does
not match after the newline.

To me this feels very much like a bug.

> However, GNU regex.h also has a flag RE_CONTEXT_INDEP_ANCHORS; I don't know
> what effect it has.

In this case it makes things worse, causing gawk to match ".^" literally.

> > (And how I've documented things in the manual, also since forever.)
>
> If you want the behaviour of the GNU regex to be stable over time, you
> should contribute unit tests to tests/test-regex.c.

This is a separate issue. It almost sounds like you're saying "it's your
fault there's a bug here, you didn't contribute unit tests".  I hope
that's not your intent; if it is then sorry, I don't buy it.

In any case, I've supplied a regexp, input data, and in the gawk dist,
a test harness, so that debugging can be done if one of the Gnulib
maintainers will look into this particular issue.

Thanks,

Arnold

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: possible bug in regex and dfa
  2021-07-18 16:09     ` Bruno Haible
  2021-07-18 18:59       ` arnold
@ 2021-07-18 19:30       ` arnold
  1 sibling, 0 replies; 8+ messages in thread
From: arnold @ 2021-07-18 19:30 UTC (permalink / raw)
  To: bug-gnulib, bruno; +Cc: eggert, arnold

Hi.

Bruno Haible <bruno@clisp.org> wrote:

>   - if REG_NEWLINE is not set, '.' matches newline but '^' does not match
>     after the newline.

This is indeed the desired behavior, but regex isn't following it.

REG_NEWLINE being set gets translated into preg->newline_anchor. 

Starting at line 620, regexec.c relates to it:

|   /* If initial states with non-begbuf contexts have no elements,
|      the regex must be anchored.  If preg->newline_anchor is set,
|      we'll never use init_state_nl, so do not check it.  */
|   if (dfa->init_state->nodes.nelem == 0
|       && dfa->init_state_word->nodes.nelem == 0
|       && (dfa->init_state_nl->nodes.nelem == 0
| 	  || !preg->newline_anchor))
|     {
|       if (start != 0 && last_start != 0)
|         return REG_NOMATCH;
|       start = last_start = 0;
|     }

(As a side note, I don't think the comment matches the code.)

In my case, preg->newline_anchor is zero (correctly), but
dfa->init_state->nodes.nelem is not, so this body isn't executed.
Making the test for preg->newline_anchor the first thing causes my test
case to work correctly but breaks the gawk test suite.

In other words, I think the bug is somewhere in this area, but I
don't understand the regex internals enough to fix it.  dfa will also
need looking at.

Thanks,

Arnold

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: regex unit tests
  2021-07-18 18:59       ` arnold
@ 2021-07-18 21:45         ` Bruno Haible
  0 siblings, 0 replies; 8+ messages in thread
From: Bruno Haible @ 2021-07-18 21:45 UTC (permalink / raw)
  To: arnold; +Cc: eggert, bug-gnulib

Hi Arnold,

> > > (And how I've documented things in the manual, also since forever.)
> >
> > If you want the behaviour of the GNU regex to be stable over time, you
> > should contribute unit tests to tests/test-regex.c.
> 
> This is a separate issue. It almost sounds like you're saying "it's your
> fault there's a bug here, you didn't contribute unit tests".

I'm not talking about past incidents and "fault", because that is generally
useless. I'm talking about the future and what we can do to avoid that
packages that depend on the 'regex' module see regressions.

If a Gnulib module does not have a decent test coverage in Gnulib, then its
bugs and regressions become apparent only after a while and only through
these other packages. A good example of this sequence of events was
<https://lists.gnu.org/archive/html/bug-gnulib/2020-07/msg00036.html>,
but I'm sure you can find many others of the same kind in the mailing
list archive. If, on the other hand, there is a unit test and it runs on
glibc platforms, a regression is likely to be visible in the weekly continuous
integration build <https://gitlab.com/gnulib/gnulib-ci/-/pipelines>.

For the regex module, with 20 KB of tests for 300 KB of code full of
complex algorithms, the test coverage is very thin, and it is *to be expected*
that regressions are only visible once the code is integrated into gawk,
grep, sed, etc. Similarly for the 'dfa' module with 5 KB of tests for 140 KB
of code.

The regex and dfa modules are being maintained here (by Paul, with
contributions from various people), and we have seen that it is not
obvious whether a patch is good or not: sometimes Paul has rejected
patches, sometimes he had to revert patches.

I think it would be good if these two modules had a larger test coverage,
and I'm inviting everyone who can to contribute to these unit tests.

> I hope that's not your intent; if it is then sorry, I don't buy it.

The module doesn't have tests for the

  RE_SYNTAX_AWK
  RE_SYNTAX_GNU_AWK
  RE_SYNTAX_POSIX_AWK

syntaxes. It's gawk which depends on the correct functioning of these
syntaxes, not glibc, not grep, not sed, not emacs. Therefore IMO if
the gawk developers don't contribute some test cases for these syntaxes,
no one will. (I certainly won't, because I find writing tests a bit
boring, and I don't see why I should have the "boring" part whereas
others have the "fun" part :-) )

Bruno

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-07-18 21:45 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-15 18:48 possible bug in regex and dfa Arnold Robbins
2021-07-17  2:58 ` Paul Eggert
2021-07-18  9:01   ` Bruno Haible
2021-07-18 12:56   ` arnold
2021-07-18 16:09     ` Bruno Haible
2021-07-18 18:59       ` arnold
2021-07-18 21:45         ` regex unit tests Bruno Haible
2021-07-18 19:30       ` possible bug in regex and dfa arnold

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).