From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS22989 209.51.188.0/24 X-Spam-Status: No, score=-3.8 required=3.0 tests=AWL,BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, RCVD_IN_DNSWL_HI,SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 4C4421F8C6 for ; Tue, 29 Jun 2021 08:51:26 +0000 (UTC) Received: from localhost ([::1]:36946 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1ly9Sm-0004vc-Rm for normalperson@yhbt.net; Tue, 29 Jun 2021 04:51:24 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:57402) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ly9Sj-0004te-Oq for bug-gnulib@gnu.org; Tue, 29 Jun 2021 04:51:21 -0400 Received: from air.basealt.ru ([194.107.17.39]:43358) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ly9Sh-0003ew-S2 for bug-gnulib@gnu.org; Tue, 29 Jun 2021 04:51:21 -0400 Received: by air.basealt.ru (Postfix, from userid 490) id 0D938589890; Tue, 29 Jun 2021 08:51:15 +0000 (UTC) Received: from [10.88.144.159] (obninsk.basealt.ru [217.15.195.17]) by air.basealt.ru (Postfix) with ESMTPSA id 7A273589891; Tue, 29 Jun 2021 08:51:13 +0000 (UTC) Subject: Re: [PATCH] regex: fix backreference matching To: "Dmitry V. Levin" References: <20210607011027.GA18724@altlinux.org> <20210616094615.186681-1-egori@altlinux.org> <20210616101339.GA8379@altlinux.org> From: Egor Ignatov Message-ID: <85975173-4e58-1402-00c8-8d065b967f99@altlinux.org> Date: Tue, 29 Jun 2021 11:51:13 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: <20210616101339.GA8379@altlinux.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US Received-SPF: pass client-ip=194.107.17.39; envelope-from=egori@altlinux.org; helo=air.basealt.ru X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, NICE_REPLY_A=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: bug-gnulib@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Gnulib discussion list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Paul Eggert , bug-gnulib@gnu.org Errors-To: bug-gnulib-bounces+normalperson=yhbt.net@gnu.org Sender: "bug-gnulib" Well, then I have a few questions about matching and capturing groups. 1. "ab" -> "^(a*)*(.)" So, from your test case I can assume that: regs[0] = (0, 2] regs[1] = (0, 1] regs[2] = (1, 2] But if we add backref at the end: 2. "ab" -> "^(a*)*(.)\1" check_matching matches the whole string "ab", this means that the first group accepted 'a' but in fact is empty, other vice it could not match backref later on. What is the correct match here? Is check_matching wrong and should match only "a" in the 2nd group (as it would be with "^(a*)(.)\1")? or should set_regs check for this and shrink the match? Next, 3. "aaba" -> "^(a*)*(.)\1" Again check_matching matches "aaba", then the first group is "a", and were the 2nd 'a' goes? In PCRE2 they save empty string for an optional groups like "(a*)*", and I assume this is because capturing group saves the last match and empty string matches. So in this case they would match only "aab". So please tell me how all 3 cases should match, this will help me to fix the initial issue with backrefs and implement the correct matching. Thanks. -- Egor