From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-Status: No, score=-3.4 required=3.0 tests=AWL,BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED, RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 5BB031F4B4 for ; Thu, 24 Sep 2020 00:05:31 +0000 (UTC) Received: from localhost ([::1]:52806 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kLElO-0003wn-GT for normalperson@yhbt.net; Wed, 23 Sep 2020 20:05:30 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:60898) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kLElJ-0003vC-5o for bug-gnulib@gnu.org; Wed, 23 Sep 2020 20:05:25 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:56148) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kLElG-0005if-61 for bug-gnulib@gnu.org; Wed, 23 Sep 2020 20:05:24 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 8FE7F160105 for ; Wed, 23 Sep 2020 17:05:20 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 6Mbt6_pszKxL; Wed, 23 Sep 2020 17:05:19 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 6BB231600A2; Wed, 23 Sep 2020 17:05:19 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id Tu--yb7xCLG4; Wed, 23 Sep 2020 17:05:19 -0700 (PDT) Received: from Penguin.CS.UCLA.EDU (Penguin.CS.UCLA.EDU [131.179.64.200]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 45E0616008D; Wed, 23 Sep 2020 17:05:19 -0700 (PDT) From: Paul Eggert To: bug-gnulib@gnu.org Subject: [PATCH 2/2] regex: fix ignore-case Turkish bug Date: Wed, 23 Sep 2020 17:05:03 -0700 Message-Id: <20200924000503.311248-2-eggert@cs.ucla.edu> X-Mailer: git-send-email 2.25.4 In-Reply-To: <20200924000503.311248-1-eggert@cs.ucla.edu> References: <20200924000503.311248-1-eggert@cs.ucla.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=131.179.128.68; envelope-from=eggert@cs.ucla.edu; helo=zimbra.cs.ucla.edu X-detected-operating-system: by eggs.gnu.org: First seen = 2020/09/23 20:05:17 X-ACL-Warn: Detected OS = Linux 3.1-3.10 [fuzzy] X-Spam_score_int: -41 X-Spam_score: -4.2 X-Spam_bar: ---- X-Spam_report: (-4.2 / 5.0 requ) BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: bug-gnulib@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Gnulib discussion list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Paul Eggert Errors-To: bug-gnulib-bounces+normalperson=yhbt.net@gnu.org Sender: "bug-gnulib" * lib/regex_internal.c (build_wcs_upper_buffer): Do not assume that converting single-byte character to upper yields a single-byte character. This is not true for Turkish, where towupper (L'i') yields L'=C4=B0', which is not single-byte. * tests/test-regex.c (main): Test for this bug. --- ChangeLog | 7 +++++++ lib/regex_internal.c | 19 ++++++++++--------- tests/test-regex.c | 41 ++++++++++++++++++++++++++++++++++++----- 3 files changed, 53 insertions(+), 14 deletions(-) diff --git a/ChangeLog b/ChangeLog index d15f158ab..5c4d8f849 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,5 +1,12 @@ 2020-09-23 Paul Eggert =20 + regex: fix ignore-case Turkish bug + * lib/regex_internal.c (build_wcs_upper_buffer): + Do not assume that converting single-byte character to upper + yields a single-byte character. This is not true for Turkish, + where towupper (L'i') yields L'=C4=B0', which is not single-byte. + * tests/test-regex.c (main): Test for this bug. + regex: port to weird isascii platforms * lib/regex_internal.h (isascii) [!_LIBC]: Supply glibc version. =20 diff --git a/lib/regex_internal.c b/lib/regex_internal.c index e1b6b4d5a..ed0a13461 100644 --- a/lib/regex_internal.c +++ b/lib/regex_internal.c @@ -300,18 +300,20 @@ build_wcs_upper_buffer (re_string_t *pstr) while (byte_idx < end_idx) { wchar_t wc; + unsigned char ch =3D pstr->raw_mbs[pstr->raw_mbs_idx + byte_idx]; =20 - if (isascii (pstr->raw_mbs[pstr->raw_mbs_idx + byte_idx]) - && mbsinit (&pstr->cur_state)) + if (isascii (ch) && mbsinit (&pstr->cur_state)) { - /* In case of a singlebyte character. */ - pstr->mbs[byte_idx] - =3D toupper (pstr->raw_mbs[pstr->raw_mbs_idx + byte_idx]); /* The next step uses the assumption that wchar_t is encoded ASCII-safe: all ASCII values can be converted like this. */ - pstr->wcs[byte_idx] =3D (wchar_t) pstr->mbs[byte_idx]; - ++byte_idx; - continue; + wchar_t wcu =3D __towupper (ch); + if (isascii (wcu)) + { + pstr->mbs[byte_idx] =3D wcu; + pstr->wcs[byte_idx] =3D wcu; + byte_idx++; + continue; + } } =20 remain_len =3D end_idx - byte_idx; @@ -348,7 +350,6 @@ build_wcs_upper_buffer (re_string_t *pstr) { /* It is an invalid character, an incomplete character at the end of the string, or '\0'. Just use the byte. */ - int ch =3D pstr->raw_mbs[pstr->raw_mbs_idx + byte_idx]; pstr->mbs[byte_idx] =3D ch; /* And also cast it to wide char. */ pstr->wcs[byte_idx++] =3D (wchar_t) ch; diff --git a/tests/test-regex.c b/tests/test-regex.c index d3f429aeb..b4e23c8c8 100644 --- a/tests/test-regex.c +++ b/tests/test-regex.c @@ -29,6 +29,15 @@ =20 #include "localcharset.h" =20 +/* Check whether it's really a UTF-8 locale. + On mingw, setlocale (LC_ALL, "en_US.UTF-8") succeeds but returns + "English_United States.1252", with locale_charset () returning "CP125= 2". */ +static int +really_utf8 (void) +{ + return strcmp (locale_charset (), "UTF-8") =3D=3D 0; +} + int main (void) { @@ -75,11 +84,7 @@ main (void) } } =20 - /* Check whether it's really a UTF-8 locale. - On mingw, the setlocale call succeeds but returns - "English_United States.1252", with locale_charset() returning - "CP1252". */ - if (strcmp (locale_charset (), "UTF-8") =3D=3D 0) + if (really_utf8 ()) { /* This test is from glibc bug 15078. The test case is from Andreas Schwab in @@ -119,6 +124,32 @@ main (void) return 1; } =20 + if (setlocale (LC_ALL, "tr_TR.UTF-8") && really_utf8 ()) + { + re_set_syntax (RE_SYNTAX_GREP | RE_ICASE); + if (re_compile_pattern ("i", 1, ®ex)) + result |=3D 1; + else + { + /* UTF-8 encoding of U+0130 LATIN CAPITAL LETTER I WITH DOT AB= OVE. + In Turkish, this is the upper-case equivalent of ASCII "i". + Older versions of Gnulib failed to match "i" to U+0130 when + ignoring case in Turkish . */ + static char const data[] =3D "\xc4\xb0"; + + memset (®s, 0, sizeof regs); + if (re_search (®ex, data, sizeof data - 1, 0, sizeof data -= 1, + ®s)) + result |=3D 1; + regfree (®ex); + free (regs.start); + free (regs.end); + + if (! setlocale (LC_ALL, "C")) + return 1; + } + } + /* This test is from glibc bug 3957, reported by Andrew Mackey. */ re_set_syntax (RE_SYNTAX_EGREP | RE_HAT_LISTS_NOT_NEWLINE); memset (®ex, 0, sizeof regex); --=20 2.25.4