From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS22989 209.51.188.0/24 X-Spam-Status: No, score=-3.9 required=3.0 tests=AWL,BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED, RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 194F11F8C6 for ; Thu, 15 Jul 2021 18:48:23 +0000 (UTC) Received: from localhost ([::1]:45576 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1m46PF-00081Y-PK for normalperson@yhbt.net; Thu, 15 Jul 2021 14:48:21 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:47030) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1m46PB-00080N-Vf for bug-gnulib@gnu.org; Thu, 15 Jul 2021 14:48:17 -0400 Received: from mxout3.netvision.net.il ([194.90.6.2]:41456) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1m46PA-00053E-5p for bug-gnulib@gnu.org; Thu, 15 Jul 2021 14:48:17 -0400 MIME-version: 1.0 Content-transfer-encoding: 7BIT Content-type: text/plain; CHARSET=US-ASCII Received: from tanda ([93.173.1.235]) by mxout3.netvision.net.il (Oracle Communications Messaging Server 8.0.2.1.20180104 64bit (built Jan 4 2018)) with ESMTPSA id <0QWA005BOUWC9G70@mxout3.netvision.net.il> for bug-gnulib@gnu.org; Thu, 15 Jul 2021 21:48:13 +0300 (IDT) Received: from arnold by tanda with local (Exim 4.90_1) (envelope-from ) id 1m46P6-0003HR-E3 for bug-gnulib@gnu.org; Thu, 15 Jul 2021 21:48:12 +0300 Date: Thu, 15 Jul 2021 21:48:12 +0300 To: bug-gnulib@gnu.org Subject: possible bug in regex and dfa User-Agent: Heirloom mailx 12.5 6/20/10 Message-id: From: Arnold Robbins Received-SPF: none client-ip=194.90.6.2; envelope-from=arnold@skeeve.com; helo=mxout3.netvision.net.il X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, MANY_HDRS_LCASE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_NONE=0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: bug-gnulib@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Gnulib discussion list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnulib-bounces+normalperson=yhbt.net@gnu.org Sender: "bug-gnulib" Hi. Please see the thread starting at https://lists.gnu.org/archive/html/bug-gawk/2021-07/msg00026.html The regexp used there, ".^", to my mind should be treated as invalid. Mawk does so, reading the entire file as one record. Gawk matches a newline for it: $ cat data a.^b a.^b $ cat x.awk BEGIN { RS = ".^" } { gsub(/.^/, ">&<") print NR, $0 print "RT=<" RT ">" } $ mawk -f x.awk data 1 a.^b a.^b RT=<> $ ./gawk -f x.awk data 1 a.^b RT=< > 2 a.^b RT=< > To make debugging easier, there is a test program in the gawk git repo that just does regexp matching the way gawk does, called testdfa. To use it, git clone git://git.savannah.gnu.org/gawk.git cd gawk ./bootstrap && ./configure ## edit Makefile and support/Makefile to remove -O, add -g make -j cd helpers gcc -g -I.. -I../support testdfa.c ../support/libsupport.a -o testdfa When run: $ cd helpers $ ./testdfa -b '.^' < ../data Ignorecase: false Syntax: RE_BACKSLASH_ESCAPE_IN_LISTS|RE_CHAR_CLASSES|RE_CONTEXT_INDEP_ANCHORS|RE_DOT_NEWLINE|RE_INTERVALS|RE_NO_BK_BRACES|RE_NO_BK_PARENS|RE_NO_BK_VBAR|RE_NO_EMPTY_RANGES|RE_UNMATCHED_RIGHT_PAREN_ORD|RE_INVALID_INTERVAL_ORD Pattern: /.^/, len = 2 After setup_pattern(), len = 2 MB_CUR_MAX = 6 Calling dfacomp(.^, 2, 0x55e9d56a5600, true) re_search returned position 4 (true) dfaexec returned 5 (a.^) If this is supposed to match a newline, I'd like to understand why. If it's not, I'd like to get a fix for regexp and dfa. Or if RE_SYNTAX_GNU_AWK needs more or fewer syntax bits[1], I'd like to know which, and why. Please cc me on any and all replies, as I'm not subscribed to this list. Thanks, Arnold [1] I hate the syntax bits. I have hated them for decades. Sigh.