From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS22989 209.51.188.0/24 X-Spam-Status: No, score=-4.1 required=3.0 tests=AWL,BAYES_00, MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL,SPF_HELO_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 668B91F5AE for ; Sun, 18 Jul 2021 18:59:46 +0000 (UTC) Received: from localhost ([::1]:34502 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1m5C0v-0000uw-5U for normalperson@yhbt.net; Sun, 18 Jul 2021 14:59:45 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:52854) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1m5C0r-0000tX-Ly for bug-gnulib@gnu.org; Sun, 18 Jul 2021 14:59:41 -0400 Received: from freefriends.org ([96.88.95.60]:41892) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1m5C0p-00059E-Or for bug-gnulib@gnu.org; Sun, 18 Jul 2021 14:59:41 -0400 X-Envelope-From: arnold@skeeve.com Received: from freefriends.org (freefriends.org [96.88.95.60]) by freefriends.org (8.14.7/8.14.7) with ESMTP id 16IIxSsF007114 (version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sun, 18 Jul 2021 12:59:28 -0600 Received: (from arnold@localhost) by freefriends.org (8.14.7/8.14.7/Submit) id 16IIxOCA007113; Sun, 18 Jul 2021 12:59:24 -0600 From: arnold@skeeve.com Message-Id: <202107181859.16IIxOCA007113@freefriends.org> X-Authentication-Warning: frenzy.freefriends.org: arnold set sender to arnold@skeeve.com using -f Date: Sun, 18 Jul 2021 12:59:24 -0600 To: bug-gnulib@gnu.org, bruno@clisp.org Subject: Re: possible bug in regex and dfa References: <85ef7fe3-c793-f082-3df1-3011fd8d0966@cs.ucla.edu> <202107181256.16ICuEjF027369@freefriends.org> <3323531.JAME3IizvO@omega> In-Reply-To: <3323531.JAME3IizvO@omega> User-Agent: Heirloom mailx 12.5 7/5/10 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Received-SPF: none client-ip=96.88.95.60; envelope-from=arnold@skeeve.com; helo=freefriends.org X-Spam_score_int: -41 X-Spam_score: -4.2 X-Spam_bar: ---- X-Spam_report: (-4.2 / 5.0 requ) BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_PASS=-0.001, SPF_NONE=0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: bug-gnulib@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Gnulib discussion list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: eggert@cs.ucla.edu, arnold@skeeve.com Errors-To: bug-gnulib-bounces+normalperson=yhbt.net@gnu.org Sender: "bug-gnulib" Bruno Haible wrote: > Hi Arnold, > > > Dot matching newline isn't the issue here. > > > > It's ^ matching in the middle of a string. For my purposes, ^ should > > only match at the beginning of a *string* (as $ should only match at > > the end of a string). I haven't rechecked POSIX, but this is how awk > > has behaved since forever. > > Hmm. Regarding POSIX: I've read section 9.3.8 and 9.4.9 of [1], > the description of REG_NOTBOL, REG_NOTEOL in [2], and the description > of REG_NEWLINE in [3]. If I understand it correctly, within POSIX, > ".^" should not match a newline because > - if REG_NEWLINE is set, '^' matches after the newline but '.' does not > match the newline, > - if REG_NEWLINE is not set, '.' matches newline but '^' does not match > after the newline. That makes sense. This is why I felt that, for gawk, ".^" is an invalid regexp. (Indeed, the original Unix awk rejects it as such.) REG_NEWLINE is not included in any of the RE_*_AWK definitions since I want exactly the behavior you describe: dot matches newline but ^ does not match after the newline. To me this feels very much like a bug. > However, GNU regex.h also has a flag RE_CONTEXT_INDEP_ANCHORS; I don't know > what effect it has. In this case it makes things worse, causing gawk to match ".^" literally. > > (And how I've documented things in the manual, also since forever.) > > If you want the behaviour of the GNU regex to be stable over time, you > should contribute unit tests to tests/test-regex.c. This is a separate issue. It almost sounds like you're saying "it's your fault there's a bug here, you didn't contribute unit tests". I hope that's not your intent; if it is then sorry, I don't buy it. In any case, I've supplied a regexp, input data, and in the gawk dist, a test harness, so that debugging can be done if one of the Gnulib maintainers will look into this particular issue. Thanks, Arnold