From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS22989 209.51.188.0/24 X-Spam-Status: No, score=-3.7 required=3.0 tests=AWL,BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED, SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 6F66D1F8C6 for ; Mon, 5 Jul 2021 12:12:12 +0000 (UTC) Received: from localhost ([::1]:56780 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1m0NSN-0000zN-4g for normalperson@yhbt.net; Mon, 05 Jul 2021 08:12:11 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:37278) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1m0NSJ-0000yg-Cg for bug-gnulib@gnu.org; Mon, 05 Jul 2021 08:12:07 -0400 Received: from vmicros1.altlinux.org ([194.107.17.57]:50552) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1m0NSH-0006hm-0f for bug-gnulib@gnu.org; Mon, 05 Jul 2021 08:12:07 -0400 Received: from mua.local.altlinux.org (mua.local.altlinux.org [192.168.1.14]) by vmicros1.altlinux.org (Postfix) with ESMTP id 6FAE372C8B4; Mon, 5 Jul 2021 15:12:02 +0300 (MSK) Received: by mua.local.altlinux.org (Postfix, from userid 508) id 4F93B7CF746; Mon, 5 Jul 2021 15:12:02 +0300 (MSK) Date: Mon, 5 Jul 2021 15:12:02 +0300 From: "Dmitry V. Levin" To: Egor Ignatov Subject: Re: [PATCH] regex: fix backreference matching Message-ID: <20210705121201.GA20072@altlinux.org> References: <20210607011027.GA18724@altlinux.org> <20210616094615.186681-1-egori@altlinux.org> <20210616101339.GA8379@altlinux.org> <85975173-4e58-1402-00c8-8d065b967f99@altlinux.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <85975173-4e58-1402-00c8-8d065b967f99@altlinux.org> Received-SPF: pass client-ip=194.107.17.57; envelope-from=ldv@altlinux.org; helo=vmicros1.altlinux.org X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: bug-gnulib@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Gnulib discussion list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Paul Eggert , bug-gnulib@gnu.org Errors-To: bug-gnulib-bounces+normalperson=yhbt.net@gnu.org Sender: "bug-gnulib" On Tue, Jun 29, 2021 at 11:51:13AM +0300, Egor Ignatov wrote: > Well, then I have a few questions about matching and capturing > groups. > > 1. "ab" -> "^(a*)*(.)" > So, from your test case I can assume that: > regs[0] = (0, 2] > regs[1] = (0, 1] > regs[2] = (1, 2] > > But if we add backref at the end: > 2. "ab" -> "^(a*)*(.)\1" > check_matching matches the whole string "ab", > this means that the first group accepted 'a' but in fact is empty, > otherwise it could not match backref later on. > What is the correct match here? Is check_matching wrong and > should match only "a" in the 2nd group (as it would be with > "^(a*)(.)\1")? or should set_regs check for this and shrink the > match? My test-regex.c entry for a similar but a bit simplified case was: /* Test for ** match with backreferences. */ { "^(a*)*\\1", "a", REG_EXTENDED, 2, { { 0, 0 }, { 0, 0 } } } I suppose the corresponding entry for your example would be { "^(a*)*(.)\1", "ab", REG_EXTENDED, 3, { { 0, 1 }, { 0, 0 }, { 0, 1 } } } > Next, > 3. "aaba" -> "^(a*)*(.)\1" > Again check_matching matches "aaba", then the first group > is "a", and were the 2nd 'a' goes? I suppose the corresponding test-regex.c entry for this case would be { "^(a*)*(.)\1", "aaba", REG_EXTENDED, 3, { { 0, 4 }, { 0, 1 }, { 2, 3 } } } -- ldv