From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS22989 209.51.188.0/24 X-Spam-Status: No, score=-4.2 required=3.0 tests=AWL,BAYES_00, MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI,RCVD_IN_MSPIKE_H2,SPF_HELO_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.6 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 5D85D1F698 for ; Sun, 1 Jan 2023 19:07:17 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1pC3fN-00076k-Jd; Sun, 01 Jan 2023 14:06:41 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pC3fK-00075x-Jj; Sun, 01 Jan 2023 14:06:38 -0500 Received: from freefriends.org ([96.88.95.60]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pC3fI-0008P6-Bj; Sun, 01 Jan 2023 14:06:38 -0500 X-Envelope-From: arnold@skeeve.com Received: from freefriends.org (freefriends.org [96.88.95.60]) by freefriends.org (8.14.7/8.14.7) with ESMTP id 301J6Rb7018105 (version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sun, 1 Jan 2023 12:06:28 -0700 Received: (from arnold@localhost) by freefriends.org (8.14.7/8.14.7/Submit) id 301J6ROQ018104; Sun, 1 Jan 2023 12:06:27 -0700 From: arnold@skeeve.com Message-Id: <202301011906.301J6ROQ018104@freefriends.org> X-Authentication-Warning: frenzy.freefriends.org: arnold set sender to arnold@skeeve.com using -f Date: Sun, 01 Jan 2023 12:06:27 -0700 To: sam@gentoo.org, arnold@skeeve.com Subject: Re: Clang-built Gawk 5.2.1 regex oddity Cc: bug-gnulib@gnu.org, concord@gentoo.org, bug-gawk@gnu.org References: <20221230000119.hyui6umnspuyzqum@bubbles> <202212300913.2BU9DV6V030160@freefriends.org> <6DADB0FC-87EE-4028-91DF-C93A968A8982@gentoo.org> In-Reply-To: <6DADB0FC-87EE-4028-91DF-C93A968A8982@gentoo.org> User-Agent: Heirloom mailx 12.5 7/5/10 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Received-SPF: none client-ip=96.88.95.60; envelope-from=arnold@skeeve.com; helo=freefriends.org X-Spam_score_int: -41 X-Spam_score: -4.2 X-Spam_bar: ---- X-Spam_report: (-4.2 / 5.0 requ) BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_PASS=-0.001, SPF_NONE=0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: bug-gnulib@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gnulib discussion list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnulib-bounces+normalperson=yhbt.net@gnu.org Sender: bug-gnulib-bounces+normalperson=yhbt.net@gnu.org Hi Sam, Thanks for the further info. Looking at both bits of dfa.c code, I don't see how either can be undefined behavior. In any case, dfa.c is copied directly from GNULIB, so I am cc-ing bug-gnulib. Paul & Jim, for background, please see the thread at https://lists.gnu.org/archive/html/bug-gawk/2022-12/msg00010.html. This still smells like "compiler bug" to me, but even if not, the GNULIB folks need to look at it. I will take a look at testdfa; it's been a while since I've had to use it, so maybe something has gotten out of sync. Thanks, Arnold Sam James wrote: > > On 30 Dec 2022, at 09:13, arnold@skeeve.com wrote: > > > > Hi. > > > > Thanks for the report. > > > > Although the dfa and regex code changed some between releases, > > this smells strongly like a compiler issue and not a gawk issue. > > > > I suggest first that you try compiling with clang but without > > optimization. After running configure, edit the top level Makefile *and* > > support/Makefile and remove any -O flags. Then build. > > Kenton mentioned to me that with no optimisation, it works okay. > > > If the bug goes away, it's definitely a clang issue. > > It _probably_ is, but it's also possible it's UB. I tried building with UBSAN > (as did Kenton) and we both got this when running the command he posted > when built with Clang: > ``` > $ ./configure CC=clang CFLAGS="-O2 -fsanitize=undefined -ggdb3" LDFLAGS="-fsanitize=undefined -ggdb3" > $ make > $ export UBSAN_OPTIONS=print_stacktrace=1 > $ ./gawk 'BEGIN { RS="[[][:blank:]]" }' > dfa.c:1141:6: runtime error: execution reached an unreachable program point > #0 0x5db652 in parse_bracket_exp /tmp/gawk/support/dfa.c:1141:6 > #1 0x5c241a in lex /tmp/gawk/support/dfa.c:1543:37 > #2 0x5dc8f1 in atom /tmp/gawk/support/dfa.c:1888:24 > #3 0x5dc8f1 in closure /tmp/gawk/support/dfa.c:1961:3 > #4 0x5dc022 in branch /tmp/gawk/support/dfa.c:2002:3 > #5 0x5c7082 in regexp /tmp/gawk/support/dfa.c:2014:3 > #6 0x5c0e32 in dfaparse /tmp/gawk/support/dfa.c:2042:3 > #7 0x5c76c2 in dfacomp /tmp/gawk/support/dfa.c:3812:5 > #8 0x5abb33 in make_regexp /tmp/gawk/re.c:272:3 > #9 0x56dffd in set_RS /tmp/gawk/io.c:4092:14 > #10 0x50510b in r_interpret /tmp/gawk/./interpret.h > #11 0x5754d7 in main /tmp/gawk/main.c:538:3 > #12 0x7f7bb5df464f in __libc_start_call_main /var/tmp/portage/sys-libs/glibc-2.36-r6/work/glibc-2.36/csu/../sysdeps/nptl/libc_start_call_main.h:58:16 > #13 0x7f7bb5df4708 in __libc_start_main@GLIBC_2.2.5 /var/tmp/portage/sys-libs/glibc-2.36-r6/work/glibc-2.36/csu/../csu/libc-start.c:381:3 > #14 0x4092a4 in _start /var/tmp/portage/sys-libs/glibc-2.36-r6/work/glibc-2.36/csu/../sysdeps/x86_64/start.S:115 > > SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior dfa.c:1141:6 in # (yes, this is cut off, I don't know why!) > ``` > > If I build with ASAN instead with Clang: > ``` > $ ./configure CC=clang CFLAGS="-O2 -fsanitize=address -ggdb3" LDFLAGS="-fsanitize=address -ggdb3" > $ make > $ ./gawk 'BEGIN { RS="[[][:blank:]]" }' > ================================================================= > ==1517313==ERROR: AddressSanitizer: unknown-crash on address 0x7fa647137000 at pc 0x000000658214 bp 0x7ffe59482ad0 sp 0x7ffe59482ac8 > READ of size 8 at 0x7fa647137000 thread T0 > #0 0x658213 in setbit /tmp/gawk/support/dfa.c:746:33 > #1 0x658213 in setbit_case_fold_c /tmp/gawk/support/dfa.c:868:7 > #2 0x658213 in parse_bracket_exp /tmp/gawk/support/dfa.c:1095:27 > #3 0x64b6d0 in lex /tmp/gawk/support/dfa.c:1543:37 > #4 0x6588dd in atom /tmp/gawk/support/dfa.c:1888:24 > #5 0x6588dd in closure /tmp/gawk/support/dfa.c:1961:3 > #6 0x64d84c in branch /tmp/gawk/support/dfa.c:2002:3 > #7 0x64d84c in regexp /tmp/gawk/support/dfa.c:2014:3 > #8 0x64aad6 in dfaparse /tmp/gawk/support/dfa.c:2042:3 > #9 0x64dbb7 in dfacomp /tmp/gawk/support/dfa.c:3812:5 > #10 0x6404df in make_regexp /tmp/gawk/re.c:272:3 > #11 0x611b66 in set_RS /tmp/gawk/io.c:4092:14 > #12 0x5c693b in r_interpret /tmp/gawk/./interpret.h > #13 0x616e6b in main /tmp/gawk/main.c:538:3 > #14 0x7fa646ccc64f in __libc_start_call_main /var/tmp/portage/sys-libs/glibc-2.36-r6/work/glibc-2.36/csu/../sysdeps/nptl/libc_start_call_main.h:58:16 > #15 0x7fa646ccc708 in __libc_start_main@GLIBC_2.2.5 /var/tmp/portage/sys-libs/glibc-2.36-r6/work/glibc-2.36/csu/../csu/libc-start.c:381:3 > #16 0x420df4 in _start /var/tmp/portage/sys-libs/glibc-2.36-r6/work/glibc-2.36/csu/../sysdeps/x86_64/start.S:115 > > Address 0x7fa647137000 is a wild pointer inside of access range of size 0x000000000008. > SUMMARY: AddressSanitizer: unknown-crash /tmp/gawk/support/dfa.c:746:33 in setbit > Shadow bytes around the buggy address: > 0x7fa647136d80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 0x7fa647136e00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 0x7fa647136e80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 0x7fa647136f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 0x7fa647136f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > =>0x7fa647137000:[00]00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 0x7fa647137080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 0x7fa647137100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 0x7fa647137180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 0x7fa647137200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 0x7fa647137280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > Shadow byte legend (one shadow byte represents 8 application bytes): > Addressable: 00 > Partially addressable: 01 02 03 04 05 06 07 > Heap left redzone: fa > Freed heap region: fd > Stack left redzone: f1 > Stack mid redzone: f2 > Stack right redzone: f3 > Stack after return: f5 > Stack use after scope: f8 > Global redzone: f9 > Global init order: f6 > Poisoned by user: f7 > Container overflow: fc > Array cookie: ac > Intra object redzone: bb > ASan internal: fe > Left alloca redzone: ca > Right alloca redzone: cb > ==1517313==ABORTING > `` > > I'm testing with Clang from git (LLVM 16, dfc20708bcdf7b4c4bea8595fc4ac8674634d5e6) > but when I tried Clang 15, I got the same. I'm pretty sure Kenton is using Clang 15 as well. > > Of course, this might still be a Clang bug though. I don't see this with > GCC but that's not proof either way. So if this all looks impossible, one > of us can forward it up to Clang and see what they say. > > > In any case, in the gawk repo in helpers/testdfa.c is a program that > > may be useful for further isolating the problem, since it extracts > > the regex building and matching from the rest of gawk's code. If > > the problem persists with that program, it will be of more use > > in making a bug report to the clang team. > > > > Unfortunately, no matter what input I give to testdfa, > it seems to say "malloc failed", e.g. > ``` > $ ./testdfa 'a' > Ignorecase: false > Syntax: RE_BACKSLASH_ESCAPE_IN_LISTS|RE_CHAR_CLASSES|RE_CONTEXT_INDEP_ANCHORS|RE_DOT_NEWLINE|RE_INTERVALS|RE_NO_BK_BRACES|RE_NO_BK_PARENS|RE_NO_BK_VBAR|RE_NO_EMPTY_RANGES|RE_UNMATCHED_RIGHT_PAREN_ORD|RE_INVALID_INTERVAL_ORD > Pattern: /a/, len = 1 > setup_pattern: malloc failed > ``` > > This happens even if testdfa is built with GCC (12.2.1_20221224). > > Best, > sam