From: arnold@skeeve.com
To: sam@gentoo.org, arnold@skeeve.com
Cc: bug-gnulib@gnu.org, concord@gentoo.org, bug-gawk@gnu.org
Subject: Re: Clang-built Gawk 5.2.1 regex oddity
Date: Sun, 01 Jan 2023 12:06:27 -0700 [thread overview]
Message-ID: <202301011906.301J6ROQ018104@freefriends.org> (raw)
In-Reply-To: <6DADB0FC-87EE-4028-91DF-C93A968A8982@gentoo.org>
Hi Sam,
Thanks for the further info.
Looking at both bits of dfa.c code, I don't see how either can be
undefined behavior.
In any case, dfa.c is copied directly from GNULIB, so I am cc-ing
bug-gnulib.
Paul & Jim, for background, please see the thread at
https://lists.gnu.org/archive/html/bug-gawk/2022-12/msg00010.html.
This still smells like "compiler bug" to me, but even if not,
the GNULIB folks need to look at it.
I will take a look at testdfa; it's been a while since I've had to
use it, so maybe something has gotten out of sync.
Thanks,
Arnold
Sam James <sam@gentoo.org> wrote:
> > On 30 Dec 2022, at 09:13, arnold@skeeve.com wrote:
> >
> > Hi.
> >
> > Thanks for the report.
> >
> > Although the dfa and regex code changed some between releases,
> > this smells strongly like a compiler issue and not a gawk issue.
> >
> > I suggest first that you try compiling with clang but without
> > optimization. After running configure, edit the top level Makefile *and*
> > support/Makefile and remove any -O flags. Then build.
>
> Kenton mentioned to me that with no optimisation, it works okay.
>
> > If the bug goes away, it's definitely a clang issue.
>
> It _probably_ is, but it's also possible it's UB. I tried building with UBSAN
> (as did Kenton) and we both got this when running the command he posted
> when built with Clang:
> ```
> $ ./configure CC=clang CFLAGS="-O2 -fsanitize=undefined -ggdb3" LDFLAGS="-fsanitize=undefined -ggdb3"
> $ make
> $ export UBSAN_OPTIONS=print_stacktrace=1
> $ ./gawk 'BEGIN { RS="[[][:blank:]]" }'
> dfa.c:1141:6: runtime error: execution reached an unreachable program point
> #0 0x5db652 in parse_bracket_exp /tmp/gawk/support/dfa.c:1141:6
> #1 0x5c241a in lex /tmp/gawk/support/dfa.c:1543:37
> #2 0x5dc8f1 in atom /tmp/gawk/support/dfa.c:1888:24
> #3 0x5dc8f1 in closure /tmp/gawk/support/dfa.c:1961:3
> #4 0x5dc022 in branch /tmp/gawk/support/dfa.c:2002:3
> #5 0x5c7082 in regexp /tmp/gawk/support/dfa.c:2014:3
> #6 0x5c0e32 in dfaparse /tmp/gawk/support/dfa.c:2042:3
> #7 0x5c76c2 in dfacomp /tmp/gawk/support/dfa.c:3812:5
> #8 0x5abb33 in make_regexp /tmp/gawk/re.c:272:3
> #9 0x56dffd in set_RS /tmp/gawk/io.c:4092:14
> #10 0x50510b in r_interpret /tmp/gawk/./interpret.h
> #11 0x5754d7 in main /tmp/gawk/main.c:538:3
> #12 0x7f7bb5df464f in __libc_start_call_main /var/tmp/portage/sys-libs/glibc-2.36-r6/work/glibc-2.36/csu/../sysdeps/nptl/libc_start_call_main.h:58:16
> #13 0x7f7bb5df4708 in __libc_start_main@GLIBC_2.2.5 /var/tmp/portage/sys-libs/glibc-2.36-r6/work/glibc-2.36/csu/../csu/libc-start.c:381:3
> #14 0x4092a4 in _start /var/tmp/portage/sys-libs/glibc-2.36-r6/work/glibc-2.36/csu/../sysdeps/x86_64/start.S:115
>
> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior dfa.c:1141:6 in # (yes, this is cut off, I don't know why!)
> ```
>
> If I build with ASAN instead with Clang:
> ```
> $ ./configure CC=clang CFLAGS="-O2 -fsanitize=address -ggdb3" LDFLAGS="-fsanitize=address -ggdb3"
> $ make
> $ ./gawk 'BEGIN { RS="[[][:blank:]]" }'
> =================================================================
> ==1517313==ERROR: AddressSanitizer: unknown-crash on address 0x7fa647137000 at pc 0x000000658214 bp 0x7ffe59482ad0 sp 0x7ffe59482ac8
> READ of size 8 at 0x7fa647137000 thread T0
> #0 0x658213 in setbit /tmp/gawk/support/dfa.c:746:33
> #1 0x658213 in setbit_case_fold_c /tmp/gawk/support/dfa.c:868:7
> #2 0x658213 in parse_bracket_exp /tmp/gawk/support/dfa.c:1095:27
> #3 0x64b6d0 in lex /tmp/gawk/support/dfa.c:1543:37
> #4 0x6588dd in atom /tmp/gawk/support/dfa.c:1888:24
> #5 0x6588dd in closure /tmp/gawk/support/dfa.c:1961:3
> #6 0x64d84c in branch /tmp/gawk/support/dfa.c:2002:3
> #7 0x64d84c in regexp /tmp/gawk/support/dfa.c:2014:3
> #8 0x64aad6 in dfaparse /tmp/gawk/support/dfa.c:2042:3
> #9 0x64dbb7 in dfacomp /tmp/gawk/support/dfa.c:3812:5
> #10 0x6404df in make_regexp /tmp/gawk/re.c:272:3
> #11 0x611b66 in set_RS /tmp/gawk/io.c:4092:14
> #12 0x5c693b in r_interpret /tmp/gawk/./interpret.h
> #13 0x616e6b in main /tmp/gawk/main.c:538:3
> #14 0x7fa646ccc64f in __libc_start_call_main /var/tmp/portage/sys-libs/glibc-2.36-r6/work/glibc-2.36/csu/../sysdeps/nptl/libc_start_call_main.h:58:16
> #15 0x7fa646ccc708 in __libc_start_main@GLIBC_2.2.5 /var/tmp/portage/sys-libs/glibc-2.36-r6/work/glibc-2.36/csu/../csu/libc-start.c:381:3
> #16 0x420df4 in _start /var/tmp/portage/sys-libs/glibc-2.36-r6/work/glibc-2.36/csu/../sysdeps/x86_64/start.S:115
>
> Address 0x7fa647137000 is a wild pointer inside of access range of size 0x000000000008.
> SUMMARY: AddressSanitizer: unknown-crash /tmp/gawk/support/dfa.c:746:33 in setbit
> Shadow bytes around the buggy address:
> 0x7fa647136d80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0x7fa647136e00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0x7fa647136e80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0x7fa647136f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0x7fa647136f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> =>0x7fa647137000:[00]00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0x7fa647137080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0x7fa647137100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0x7fa647137180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0x7fa647137200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0x7fa647137280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> Shadow byte legend (one shadow byte represents 8 application bytes):
> Addressable: 00
> Partially addressable: 01 02 03 04 05 06 07
> Heap left redzone: fa
> Freed heap region: fd
> Stack left redzone: f1
> Stack mid redzone: f2
> Stack right redzone: f3
> Stack after return: f5
> Stack use after scope: f8
> Global redzone: f9
> Global init order: f6
> Poisoned by user: f7
> Container overflow: fc
> Array cookie: ac
> Intra object redzone: bb
> ASan internal: fe
> Left alloca redzone: ca
> Right alloca redzone: cb
> ==1517313==ABORTING
> ``
>
> I'm testing with Clang from git (LLVM 16, dfc20708bcdf7b4c4bea8595fc4ac8674634d5e6)
> but when I tried Clang 15, I got the same. I'm pretty sure Kenton is using Clang 15 as well.
>
> Of course, this might still be a Clang bug though. I don't see this with
> GCC but that's not proof either way. So if this all looks impossible, one
> of us can forward it up to Clang and see what they say.
>
> > In any case, in the gawk repo in helpers/testdfa.c is a program that
> > may be useful for further isolating the problem, since it extracts
> > the regex building and matching from the rest of gawk's code. If
> > the problem persists with that program, it will be of more use
> > in making a bug report to the clang team.
> >
>
> Unfortunately, no matter what input I give to testdfa,
> it seems to say "malloc failed", e.g.
> ```
> $ ./testdfa 'a'
> Ignorecase: false
> Syntax: RE_BACKSLASH_ESCAPE_IN_LISTS|RE_CHAR_CLASSES|RE_CONTEXT_INDEP_ANCHORS|RE_DOT_NEWLINE|RE_INTERVALS|RE_NO_BK_BRACES|RE_NO_BK_PARENS|RE_NO_BK_VBAR|RE_NO_EMPTY_RANGES|RE_UNMATCHED_RIGHT_PAREN_ORD|RE_INVALID_INTERVAL_ORD
> Pattern: /a/, len = 1
> setup_pattern: malloc failed
> ```
>
> This happens even if testdfa is built with GCC (12.2.1_20221224).
>
> Best,
> sam
next parent reply other threads:[~2023-01-01 19:07 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20221230000119.hyui6umnspuyzqum@bubbles>
[not found] ` <202212300913.2BU9DV6V030160@freefriends.org>
[not found] ` <6DADB0FC-87EE-4028-91DF-C93A968A8982@gentoo.org>
2023-01-01 19:06 ` arnold [this message]
2023-01-02 6:10 ` Clang-built Gawk 5.2.1 regex oddity Paul Eggert
2023-01-03 2:14 ` Sam James
2023-01-03 2:43 ` Sam James
2023-01-05 23:06 ` Arsen Arsenović
2023-01-06 12:21 ` arnold
2023-01-13 7:03 ` Sam James
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://lists.gnu.org/mailman/listinfo/bug-gnulib
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=202301011906.301J6ROQ018104@freefriends.org \
--to=arnold@skeeve.com \
--cc=bug-gawk@gnu.org \
--cc=bug-gnulib@gnu.org \
--cc=concord@gentoo.org \
--cc=sam@gentoo.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).