bug-gnulib@gnu.org mirror (unofficial)
 help / color / mirror / Atom feed
* possible bug in regex and dfa
@ 2021-07-15 18:48 Arnold Robbins
  2021-07-17  2:58 ` Paul Eggert
  0 siblings, 1 reply; 8+ messages in thread
From: Arnold Robbins @ 2021-07-15 18:48 UTC (permalink / raw)
  To: bug-gnulib

Hi.

Please see the thread starting at

	https://lists.gnu.org/archive/html/bug-gawk/2021-07/msg00026.html

The regexp used there, ".^", to my mind should be treated as invalid.
Mawk does so, reading the entire file as one record.  Gawk matches a
newline for it:

$ cat data
a.^b
a.^b

$ cat x.awk
BEGIN { RS = ".^" }

{
	gsub(/.^/, ">&<")
	print NR, $0
	print "RT=<" RT ">"
}

$ mawk -f x.awk data
1 a.^b
a.^b

RT=<>

$ ./gawk -f x.awk data
1 a.^b
RT=<
>
2 a.^b
RT=<
>

To make debugging easier, there is a test program in the gawk
git repo that just does regexp matching the way gawk does, called
testdfa.  To use it,

	git clone git://git.savannah.gnu.org/gawk.git
	cd gawk
	./bootstrap && ./configure
	## edit Makefile and support/Makefile to remove -O, add -g
	make -j
	cd helpers
	gcc -g -I.. -I../support testdfa.c ../support/libsupport.a -o testdfa

When run:

$ cd helpers
$ ./testdfa -b '.^' < ../data
Ignorecase: false
Syntax: RE_BACKSLASH_ESCAPE_IN_LISTS|RE_CHAR_CLASSES|RE_CONTEXT_INDEP_ANCHORS|RE_DOT_NEWLINE|RE_INTERVALS|RE_NO_BK_BRACES|RE_NO_BK_PARENS|RE_NO_BK_VBAR|RE_NO_EMPTY_RANGES|RE_UNMATCHED_RIGHT_PAREN_ORD|RE_INVALID_INTERVAL_ORD
Pattern: /.^/, len = 2
After setup_pattern(), len = 2
MB_CUR_MAX = 6
Calling dfacomp(.^, 2, 0x55e9d56a5600, true)
re_search returned position 4 (true)
dfaexec returned 5 (a.^)

If this is supposed to match a newline, I'd like to understand why.
If it's not, I'd like to get a fix for regexp and dfa.  Or if
RE_SYNTAX_GNU_AWK needs more or fewer syntax bits[1], I'd like to
know which, and why.

Please cc me on any and all replies, as I'm not subscribed to
this list.

Thanks,

Arnold

[1] I hate the syntax bits. I have hated them for decades. Sigh.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-07-18 21:45 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-15 18:48 possible bug in regex and dfa Arnold Robbins
2021-07-17  2:58 ` Paul Eggert
2021-07-18  9:01   ` Bruno Haible
2021-07-18 12:56   ` arnold
2021-07-18 16:09     ` Bruno Haible
2021-07-18 18:59       ` arnold
2021-07-18 21:45         ` regex unit tests Bruno Haible
2021-07-18 19:30       ` possible bug in regex and dfa arnold

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).