From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS22989 209.51.188.0/24 X-Spam-Status: No, score=-3.8 required=3.0 tests=AWL,BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, NICE_REPLY_A,RCVD_IN_DNSWL_HI,RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL, SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 407A51F5AE for ; Sun, 18 Jul 2021 16:09:44 +0000 (UTC) Received: from localhost ([::1]:39050 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1m59MM-0006Np-TK for normalperson@yhbt.net; Sun, 18 Jul 2021 12:09:42 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:36928) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1m59MI-0006Nh-FP for bug-gnulib@gnu.org; Sun, 18 Jul 2021 12:09:38 -0400 Received: from mo4-p00-ob.smtp.rzone.de ([81.169.146.216]:16466) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1m59MG-00011t-2Q for bug-gnulib@gnu.org; Sun, 18 Jul 2021 12:09:38 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; t=1626624564; s=strato-dkim-0002; d=clisp.org; h=References:In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Cc:Date: From:Subject:Sender; bh=A2mEZzfdIm1VzOJPn4ObbkYxPOM3OUfugHTBk1sEPFw=; b=ClPz1p2HHo/UFK6H08WhVow7HJM5lJxtdaMQPSFfNU/CWAt1lsE4Yig73Gxz/VOql4 uCg8hY798ljot0imELnAbRcFEZlDp2+9GpoJn8p7A/sxkYYnYRPXpVh0aeiY7oJxH94p FYSkMTXDPTmBR8Et1N7gVJAGp8ZVsyfqnAtbl9R9rTGLOi8at5ATDdtvmjiOAW0JeFoz 1aLiVH1DottrsBFhAh6kuJvFVHyHuPNoMa0s0iV1aH8zC50+on21ZK6RG5YXb6YrLP3O oyWSotIDyNC0/HEE2R6AXSyNCSFq2f2NAh2X24qqfz3nE8U7GAnWq0adpiLfrpJJKzV/ py+w== Authentication-Results: strato.com; dkim=none X-RZG-AUTH: ":Ln4Re0+Ic/6oZXR1YgKryK8brlshOcZlIWs+iCP5vnk6shH/DXj0JGsbh0vbrMZq" X-RZG-CLASS-ID: mo00 Received: from bruno.haible.de by smtp.strato.de (RZmta 47.28.1 DYNA|AUTH) with ESMTPSA id u08ae3x6IG9NH7x (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (curve X9_62_prime256v1 with 256 ECDH bits, eq. 3072 bits RSA)) (Client did not present a certificate); Sun, 18 Jul 2021 18:09:23 +0200 (CEST) From: Bruno Haible To: bug-gnulib@gnu.org Subject: Re: possible bug in regex and dfa Date: Sun, 18 Jul 2021 18:09:23 +0200 Message-ID: <3323531.JAME3IizvO@omega> User-Agent: KMail/5.1.3 (Linux/4.4.0-210-generic; KDE/5.18.0; x86_64; ; ) In-Reply-To: <202107181256.16ICuEjF027369@freefriends.org> References: <85ef7fe3-c793-f082-3df1-3011fd8d0966@cs.ucla.edu> <202107181256.16ICuEjF027369@freefriends.org> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" Received-SPF: none client-ip=81.169.146.216; envelope-from=bruno@clisp.org; helo=mo4-p00-ob.smtp.rzone.de X-Spam_score_int: -28 X-Spam_score: -2.9 X-Spam_bar: -- X-Spam_report: (-2.9 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, NICE_REPLY_A=-0.07, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_HELO_PASS=-0.001, SPF_NONE=0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: bug-gnulib@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Gnulib discussion list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: arnold@skeeve.com, eggert@cs.ucla.edu Errors-To: bug-gnulib-bounces+normalperson=yhbt.net@gnu.org Sender: "bug-gnulib" Hi Arnold, > Dot matching newline isn't the issue here. > > It's ^ matching in the middle of a string. For my purposes, ^ should > only match at the beginning of a *string* (as $ should only match at > the end of a string). I haven't rechecked POSIX, but this is how awk > has behaved since forever. Hmm. Regarding POSIX: I've read section 9.3.8 and 9.4.9 of [1], the description of REG_NOTBOL, REG_NOTEOL in [2], and the description of REG_NEWLINE in [3]. If I understand it correctly, within POSIX, ".^" should not match a newline because - if REG_NEWLINE is set, '^' matches after the newline but '.' does not match the newline, - if REG_NEWLINE is not set, '.' matches newline but '^' does not match after the newline. However, GNU regex.h also has a flag RE_CONTEXT_INDEP_ANCHORS; I don't know what effect it has. > (And how I've documented things in the manual, also since forever.) If you want the behaviour of the GNU regex to be stable over time, you should contribute unit tests to tests/test-regex.c. So far, I see unit tests for the flags REG_EXTENDED REG_NOSUB RE_SYNTAX_POSIX_BASIC RE_SYNTAX_GREP RE_SYNTAX_EGREP RE_SYNTAX_POSIX_EGREP RE_SYNTAX_EMACS RE_HAT_LISTS_NOT_NEWLINE RE_ICASE RE_CONTEXT_INVALID_DUP RE_NO_EMPTY_RANGES but no tests at all for RE_SYNTAX_AWK RE_SYNTAX_GNU_AWK RE_SYNTAX_POSIX_AWK RE_SYNTAX_POSIX_EXTENDED REG_NEWLINE REG_NOTBOL REG_NOTEOL REG_STARTEND Usually it takes about as many code lines to reasonably unit test some code as the code itself has. The regex module is over 300 KB large, and its unit test less than 20 KB large. Already from these figures you can tell that the regex module is *NEARLY UNTESTED*. You may say, well, some other unit tests exist in glibc, in sed, in grep, in awk, in coreutils, etc. But it doesn't help maintenance if the unit tests are not part of what gets tested by ./gnulib-tool --test --single-configure regex Bruno [1] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html [2] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/regex.h.html [3] https://pubs.opengroup.org/onlinepubs/9699919799/functions/regexec.html