From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-3.6 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.6 Received: from out1.vger.email (out1.vger.email [IPv6:2620:137:e000::1:20]) by dcvr.yhbt.net (Postfix) with ESMTP id DF8181F454 for ; Thu, 6 Apr 2023 13:40:18 +0000 (UTC) Authentication-Results: dcvr.yhbt.net; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20210112 header.b=TzJ51q3x; dkim-atps=neutral Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238653AbjDFNj4 (ORCPT ); Thu, 6 Apr 2023 09:39:56 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43564 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238640AbjDFNjz (ORCPT ); Thu, 6 Apr 2023 09:39:55 -0400 Received: from mail-qv1-xf2b.google.com (mail-qv1-xf2b.google.com [IPv6:2607:f8b0:4864:20::f2b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 85DCA976F for ; Thu, 6 Apr 2023 06:39:43 -0700 (PDT) Received: by mail-qv1-xf2b.google.com with SMTP id k12so9471954qvo.13 for ; Thu, 06 Apr 2023 06:39:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1680788382; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=qHG4jeUv5wcUw/5Ov0dUmw116z+k+P3f0ikY/j9srHA=; b=TzJ51q3xpt2NBeiYICRYi18x5HIblyYRB56/90ewB7G9pvQnl/l1Es41UEI8S/mbYe P6GHj1TDWzDdnf324jBOcx0ruyJoHjZBfDMVwH6WyLHPL0Ur7CHxJzclmE/EGbtEiUlN uVTDgyTf+RY4zPvNfR0g5dbfY+adei57zRCD6neJ3a9KrXdUzaXY7nNHAL4aZBPrIRmS 8OYSbJ3X5VW/q30VxR1wTDAWrhMPHuiGyVdgwtD/tNOqRGIEgIs32FImASbr2BdvINI8 HGmjHRwXeV16+EpIoUUmkp3T4MPa3p5C/H8WGPRMZSXJo9fU2fB51RRVkNAAYf5TFyOW g2eA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680788382; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=qHG4jeUv5wcUw/5Ov0dUmw116z+k+P3f0ikY/j9srHA=; b=S/lNI7r+97Ve+jR1L35C7D7cNH8p0K0mNuVS8bGxcC9Xcy+DQShajkn/YFkP7mS6Vg /lWKsZbNgP9X1L6NCH5g5KgORpSayF4vtIweXF+YdabOCr+ZArZN7Ua8ceGHbDtXC+Ma aJ82fpXxKWKs6W/kqEcg2HDZt+21VXelKVGx8WRxUUisi6wy9v+vqMNXfjz0fR6AYTNi YVTA4DvGeRJDd7+9G0xP8KDxNuU2UrwdOvRsxGhYbItj0sdGeUXizFIgQt8C0byXI0p2 NRMIYP1Jy6/mYCbmEtP+M36ajLgqupv7TERd44aPL5yZcUk/ZE4ic8aCvecppovjJ6kH oyFw== X-Gm-Message-State: AAQBX9fo5aY+cuhWwkiH2mEx+p+QS8LiUiREC9fE3k2yCRxUzBpoJ6tL IYPTLMHp2t0XXyydMjqodBHH6pknjG6iF2eaYxxh7u1UpeA= X-Google-Smtp-Source: AKy350Z5c7XEECkeGZ6yZ/3bRWz1vgXrT0l6ZxLjSewl8PQDX5zhWKkhT1VUsRPntK8S9FLhI9nSIgEjyBevA+AplXA= X-Received: by 2002:ad4:58b2:0:b0:56f:378:951 with SMTP id ea18-20020ad458b2000000b0056f03780951mr542261qvb.1.1680788382666; Thu, 06 Apr 2023 06:39:42 -0700 (PDT) MIME-Version: 1.0 References: <2554712d-e386-3bab-bc6c-1f0e85d999db@cs.ucla.edu> <96358c4e-7200-e5a5-869e-5da9d0de3503@cs.ucla.edu> In-Reply-To: From: demerphq Date: Thu, 6 Apr 2023 15:39:31 +0200 Message-ID: Subject: Re: bug#60690: -P '\d' in GNU and git grep To: Junio C Hamano Cc: Paul Eggert , Carlo Arenas , 60690@debbugs.gnu.org, mega lith01 , =?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= , git@vger.kernel.org, =?UTF-8?Q?Tukusej=E2=80=99s_Sirs?= , pcre-dev@exim.org Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On Tue, 4 Apr 2023 at 21:31, Junio C Hamano wrote: > > Paul Eggert writes: > > > This is an evolving area. Git master is fiddling with flags and > > options, and so is GNU grep master, and so is PCRE2, and there are > > bugs. If you're running bleeding-edge versions of this code you'll get > > different behavior than if you're running grep 3.8, pcregrep 8.45, > > Perl 5.36, and git 2.39.2 (which is what Fedora 37 has). > > > > What I'm fearing is that we may evolve into mutually incompatible > > interpretations of how Perl regular expressions deal with UTF-8 > > text. That'd be a recipe for confusion down the road. > > Nicely said. My personal inclination is to let Perl folks decide > and follow them (even though I am skeptical about the wisdom of > letting '\d' match anything other than [0-9]), but even in Git > circle there would be different opinions, so I am glad that the > discussion is visible on the list to those who are intrested. Perl matches Unicode text according to the rules specified by the Unicode consortium. It is the reference implementation for Unicode regular expression matching. Unicode specifies that \d match any digit in any script that it supports. Thus \d matches far more codepoints than \p{PosixDigit} or [0-9] would. Be aware that Unicode contains and separates numbers and digits, eg, \x{1EC9E} represents a Lakh, which is used in many Indian languages for 100,000, but which is not considered a *digit* for obvious reasons. FWIW, someone mentioned [[:digit:]] which matches the same as \d does on Unicode strings and under the /u matching flag for regexes in Perl. Arguably this was a mistake, [[:digit:]] is a POSIX character class, and POSIX doesn't support Unicode so it should have matched [0-9] or \p{PosixDigit}. But historically \d and [[:digit:]] in Perl were the same and when \d was extended to meet the Unicode specification [[:digit:]] came along for the ride likely inadvertently, thus \p{PosixDigit} is equivalent to [0-9], but \p{XPosixDigit} is equivalent to \d and [[:digit:]]. I notice that other posts in this thread have moved the conversation on, and covered most of the points I wanted to make here. However I wanted to say that there seem to be two different issues here. The first is "what semantics do i expect from my regular expressions", Unicode or legacy-ASCII, mostly this relates to case-insensitive matching, but things like \d also surface discrepancies. The second is "what encodings does the regular expression engine understand". Unfortunately on *nix there is no tradition of using BOM's to distinguish the 6 different possible encodings of Unicode (UTF-8, UTF-EBCDIC, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE), and there seems to be some level of desire of matching with unicode semantics against files that are not uniformly encoded in one of these formats. So the question comes up, A) how do you tell the regular expression engine what semantics you want and B) how does the regular expression library identify the encoding in the file, and how does it handle malformed content in that file. For instance if I have a file which contains snippets of UTF8 encoded data, *and* snippets of data that is illegal in UTF8, what should the regular expression engine do if it is asked to do a case insensitive match against that file. cheers, yves