From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-2.7 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_DNSWL_HI,RCVD_IN_SORBS_SPAM, RP_MATCHES_RCVD shortcircuit=no autolearn=no autolearn_force=no version=3.4.0 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id 2836D2018D for ; Thu, 11 May 2017 17:52:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933394AbdEKRwQ (ORCPT ); Thu, 11 May 2017 13:52:16 -0400 Received: from mail-qt0-f170.google.com ([209.85.216.170]:36598 "EHLO mail-qt0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933309AbdEKRv7 (ORCPT ); Thu, 11 May 2017 13:51:59 -0400 Received: by mail-qt0-f170.google.com with SMTP id m91so23526104qte.3 for ; Thu, 11 May 2017 10:51:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=iqATgKwEwJAFqUkfz3nzxjfgels64NJXWBUSeW2tX8I=; b=oep05kUIxQLpW103q6OKSppJGVqLLeN5FwR5+UVc4QwIHNaljbgjkmWx65dWeDoByk HWNCx/8IOuRsLnSdd2+jzuFdOtylOXoFKZN96XxprExrEbGXSSwxr2TeJQ6kUwvg4ukG UqikevKOrQw9PQvsuXhB6czJLewbe3w2nfePq2/Tc6kCyWlHXb3Jw5HqoPpLCQvDfIMO iTf+MixQ3fy2lCepCRZzvCbimegA7abJoEn4WwHsL5/2yO40xcja5/geVkx0/H5JQUz4 LkURRwlLYsVgjI8AFD5BvB1HcdJcC9+zvYkPwl9Zhp+pAk+C5bhk6R++HPJcQhhPLlj+ 2hIw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=iqATgKwEwJAFqUkfz3nzxjfgels64NJXWBUSeW2tX8I=; b=VXzBe19Wgq+NSATWwE1hciFwOmxwyrwBUZ+t19pgdIgRc6kgvm3H5DXrxuw9uyFFkw 6rK04D4o9zK3DVQUwVTJEBg3trO6zL9mA+Xb1jUakNTgOqiKSdu2c1hR08NZcDHiYvDQ scvc4vI7//p4Fv4MnWAUe+64+vpnxwrwtPNcnRwwzrklRLAKg5oRhSRna/Oy+2X1ZNXh laHDt2ZalYtJtex4lH6wSl+9h+0xX4sIKAYhP5YMDt1+Atus9V/tSQoBBXD5ChKTfi4r QyUAK72ekwMTLfDCYjUTdWeJJrlIXAGC6K6btSc7vgMPjBfMHOY/jlnmMKCfk5Fj3Ssj SjZA== X-Gm-Message-State: AODbwcDYRnF7LRNBLv4iHcXKQi1Arjro8sIIoA5lUSJKWe5ChqGpKXtj Jd4anLXgR6C4McmsajVw2A== X-Received: by 10.200.45.17 with SMTP id n17mr42358qta.250.1494525117835; Thu, 11 May 2017 10:51:57 -0700 (PDT) Received: from u.nix.is ([2a01:4f8:190:5095::2]) by smtp.gmail.com with ESMTPSA id g3sm561483qte.66.2017.05.11.10.51.55 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 11 May 2017 10:51:56 -0700 (PDT) From: =?UTF-8?q?=C3=86var=20Arnfj=C3=B6r=C3=B0=20Bjarmason?= To: git@vger.kernel.org Cc: Junio C Hamano , Jeff King , Jeffrey Walton , =?UTF-8?q?Micha=C5=82=20Kiedrowicz?= , J Smith , Victor Leschuk , =?UTF-8?q?Nguy=E1=BB=85n=20Th=C3=A1i=20Ng=E1=BB=8Dc=20Duy?= , Fredrik Kuivinen , Brandon Williams , =?UTF-8?q?=C3=86var=20Arnfj=C3=B6r=C3=B0=20Bjarmason?= Subject: [PATCH/RFC 5/6] grep: support regex patterns containing \0 via PCRE v2 Date: Thu, 11 May 2017 17:51:14 +0000 Message-Id: <20170511175115.648-6-avarab@gmail.com> X-Mailer: git-send-email 2.11.0 In-Reply-To: <20170511175115.648-1-avarab@gmail.com> References: <20170511175115.648-1-avarab@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Support regex patterns with embedded \0's, as an earlier commit[1] notes this was previously impossible due to an internal limitation. Before this change any regex metacharacters in patterns containing \0 were silently ignored and the pattern matched as if it were a --fixed-strings pattern. Now these patterns will be matched with PCRE instead, which supports combining regex metacharacters with patterns containing \0. A side-effect of this change is that these patterns which previously would be considered --fixed-strings patterns regardless of the engine requested now all implicitly become --perl-regexp instead. A subsequent change introduces a POSIX to PCRE syntax converter, and could be used to be 100% truthful to our documentation by using POSIX basic syntax (which we haven't been in quite some time with kwset). But due to a chicken & egg issue with this change being easier to implement stand-alone first, the subsequent change depending on a SVN trunk version of PCRE, but most importantly I don't think anyone will mind this change, so I'm leaving it as it is. This implementation is faster than the previous kwset implementation, but I haven't bothered to come up with a \0-specific fixed-string performance test. See the next change in this series for a change which optionally expands the PCRE v2 use to use it for all fixed-string patterns, the performance tests for those will be applicable to these patterns as well, since PCRE matches \0 like any other character. 1. ("grep: factor test for \0 in grep patterns into a function", 2017-05-08) Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 24 ++++++++++++++++ t/t7008-grep-binary.sh | 74 ++++++++++++++++++++++++++++++++++---------------- 2 files changed, 75 insertions(+), 23 deletions(-) diff --git a/grep.c b/grep.c index 2ff4e253ff..5db614cf80 100644 --- a/grep.c +++ b/grep.c @@ -613,6 +613,30 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) icase = opt->regflags & REG_ICASE || p->ignore_case; ascii_only = !has_non_ascii(p->pattern); +#ifdef USE_LIBPCRE2 + if (has_null(p->pattern, p->patternlen)) { + struct strbuf sb = STRBUF_INIT; + if (icase) + strbuf_add(&sb, "(?i)", 4); + if (opt->fixed) + strbuf_add(&sb, "\\Q", 2); + strbuf_add(&sb, p->pattern, p->patternlen); + if (opt->fixed) + strbuf_add(&sb, "\\E", 2); + + p->pattern = sb.buf; + p->patternlen = sb.len; + + /* FIXME: Check in compile_pcre2_pattern() that we're + * using basic rx using !opt->pcre2 && + */ + opt->pcre2 = 1; + + compile_pcre2_pattern(p, opt); + return; + } +#endif + /* * Even when -F (fixed) asks us to do a non-regexp search, we * may not be able to correctly case-fold when -i diff --git a/t/t7008-grep-binary.sh b/t/t7008-grep-binary.sh index ba3db06501..fc86ed5fce 100755 --- a/t/t7008-grep-binary.sh +++ b/t/t7008-grep-binary.sh @@ -124,35 +124,63 @@ nul_match 0 '-F' '[æ]Qð' nul_match 0 '-Fi' 'ÆQ[Ð]' nul_match 0 '-Fi' '[Æ]QÐ' -# kwset is disabled on -i & non-ASCII. No way to match non-ASCII \0 -# patterns case-insensitively. -nul_match T1 '-i' 'ÆQÐ' - -# \0 implicitly disables regexes. This is an undocumented internal -# limitation. -nul_match T1 '' 'yQ[f]' -nul_match T1 '' '[y]Qf' -nul_match T1 '-i' 'YQ[F]' -nul_match T1 '-i' '[Y]Qf' -nul_match T1 '' 'æQ[ð]' -nul_match T1 '' '[æ]Qð' -nul_match T1 '-i' 'ÆQ[Ð]' - -# ... because of \0 implicitly disabling regexes regexes that -# should/shouldn't match don't do the right thing. -nul_match T1 '' 'eQm.*cQ' -nul_match T1 '-i' 'EQM.*cQ' -nul_match T0 '' 'eQm[*]c' -nul_match T0 '-i' 'EQM[*]C' +if test_have_prereq LIBPCRE2 +then + # Regex patterns that should match without -F + nul_match 1 '' 'yQ[f]' + nul_match 1 '' '[y]Qf' + nul_match 1 '-i' 'YQ[F]' + nul_match 1 '-i' '[Y]Qf' + nul_match 1 '' 'æQ[ð]' + nul_match 1 '' '[æ]Qð' + nul_match 0 '-i' '[Æ]Qð' + nul_match 1 '' 'eQm.*cQ' + nul_match 1 '-i' 'EQM.*cQ' + nul_match 0 '' 'eQm[*]c' + nul_match 0 '-i' 'EQM[*]C' + + # These should also match, but don't due to some heisenbug, + # they succeed when run manually! + nul_match T1 '-i' 'ÆQÐ' + nul_match T1 '-i' 'ÆQ[Ð]' +else + # \0 implicitly disables regexes. This is an undocumented + # internal limitation. + nul_match T1 '' 'yQ[f]' + nul_match T1 '' '[y]Qf' + nul_match T1 '-i' 'YQ[F]' + nul_match T1 '-i' '[Y]Qf' + nul_match T1 '' 'æQ[ð]' + nul_match T1 '' '[æ]Qð' + nul_match T1 '-i' 'ÆQ[Ð]' + + # ... because of \0 implicitly disabling regexes regexes that + # should/shouldn't match don't do the right thing. + nul_match T1 '' 'eQm.*cQ' + nul_match T1 '-i' 'EQM.*cQ' + nul_match T0 '' 'eQm[*]c' + nul_match T0 '-i' 'EQM[*]C' +fi # Due to the REG_STARTEND extension when kwset() is disabled on -i & # non-ASCII the string will be matched in its entirety, but the # pattern will be cut off at the first \0. nul_match 0 '-i' 'NOMATCHQð' -nul_match T0 '-i' '[Æ]QNOMATCH' -nul_match T0 '-i' '[æ]QNOMATCH' +if test_have_prereq LIBPCRE2 +then + nul_match 0 '-i' '[Æ]QNOMATCH' + nul_match 0 '-i' '[æ]QNOMATCH' +else + nul_match T0 '-i' '[Æ]QNOMATCH' + nul_match T0 '-i' '[æ]QNOMATCH' +fi # Matches, but for the wrong reasons, just stops at [æ] -nul_match 1 '-i' '[Æ]Qð' +if test_have_prereq LIBPCRE2 +then + nul_match T1 '-i' '[Æ]Qð' +else + nul_match 1 '-i' '[Æ]Qð' +fi nul_match 1 '-i' '[æ]Qð' # Ensure that the matcher doesn't regress to something that stops at -- 2.11.0