From: Arnold Robbins <arnold@skeeve.com>
To: bug-gnulib@gnu.org
Subject: possible bug in regex and dfa
Date: Thu, 15 Jul 2021 21:48:12 +0300 [thread overview]
Message-ID: <E1m46P6-0003HR-E3@tanda> (raw)
Hi.
Please see the thread starting at
https://lists.gnu.org/archive/html/bug-gawk/2021-07/msg00026.html
The regexp used there, ".^", to my mind should be treated as invalid.
Mawk does so, reading the entire file as one record. Gawk matches a
newline for it:
$ cat data
a.^b
a.^b
$ cat x.awk
BEGIN { RS = ".^" }
{
gsub(/.^/, ">&<")
print NR, $0
print "RT=<" RT ">"
}
$ mawk -f x.awk data
1 a.^b
a.^b
RT=<>
$ ./gawk -f x.awk data
1 a.^b
RT=<
>
2 a.^b
RT=<
>
To make debugging easier, there is a test program in the gawk
git repo that just does regexp matching the way gawk does, called
testdfa. To use it,
git clone git://git.savannah.gnu.org/gawk.git
cd gawk
./bootstrap && ./configure
## edit Makefile and support/Makefile to remove -O, add -g
make -j
cd helpers
gcc -g -I.. -I../support testdfa.c ../support/libsupport.a -o testdfa
When run:
$ cd helpers
$ ./testdfa -b '.^' < ../data
Ignorecase: false
Syntax: RE_BACKSLASH_ESCAPE_IN_LISTS|RE_CHAR_CLASSES|RE_CONTEXT_INDEP_ANCHORS|RE_DOT_NEWLINE|RE_INTERVALS|RE_NO_BK_BRACES|RE_NO_BK_PARENS|RE_NO_BK_VBAR|RE_NO_EMPTY_RANGES|RE_UNMATCHED_RIGHT_PAREN_ORD|RE_INVALID_INTERVAL_ORD
Pattern: /.^/, len = 2
After setup_pattern(), len = 2
MB_CUR_MAX = 6
Calling dfacomp(.^, 2, 0x55e9d56a5600, true)
re_search returned position 4 (true)
dfaexec returned 5 (a.^)
If this is supposed to match a newline, I'd like to understand why.
If it's not, I'd like to get a fix for regexp and dfa. Or if
RE_SYNTAX_GNU_AWK needs more or fewer syntax bits[1], I'd like to
know which, and why.
Please cc me on any and all replies, as I'm not subscribed to
this list.
Thanks,
Arnold
[1] I hate the syntax bits. I have hated them for decades. Sigh.
next reply other threads:[~2021-07-15 18:48 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-07-15 18:48 Arnold Robbins [this message]
2021-07-17 2:58 ` possible bug in regex and dfa Paul Eggert
2021-07-18 9:01 ` Bruno Haible
2021-07-18 12:56 ` arnold
2021-07-18 16:09 ` Bruno Haible
2021-07-18 18:59 ` arnold
2021-07-18 21:45 ` regex unit tests Bruno Haible
2021-07-18 19:30 ` possible bug in regex and dfa arnold
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://lists.gnu.org/mailman/listinfo/bug-gnulib
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=E1m46P6-0003HR-E3@tanda \
--to=arnold@skeeve.com \
--cc=bug-gnulib@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).