From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <bug-gnulib-bounces+normalperson=yhbt.net@gnu.org>
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net
X-Spam-Level: 
X-Spam-Status: No, score=-3.4 required=3.0 tests=AWL,BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED,
	RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS
	shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2
Received: from lists.gnu.org (lists.gnu.org [209.51.188.17])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by dcvr.yhbt.net (Postfix) with ESMTPS id 5BB031F4B4
	for <normalperson@yhbt.net>; Thu, 24 Sep 2020 00:05:31 +0000 (UTC)
Received: from localhost ([::1]:52806 helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <bug-gnulib-bounces+normalperson=yhbt.net@gnu.org>)
	id 1kLElO-0003wn-GT
	for normalperson@yhbt.net; Wed, 23 Sep 2020 20:05:30 -0400
Received: from eggs.gnu.org ([2001:470:142:3::10]:60898)
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <eggert@cs.ucla.edu>)
 id 1kLElJ-0003vC-5o
 for bug-gnulib@gnu.org; Wed, 23 Sep 2020 20:05:25 -0400
Received: from zimbra.cs.ucla.edu ([131.179.128.68]:56148)
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <eggert@cs.ucla.edu>)
 id 1kLElG-0005if-61
 for bug-gnulib@gnu.org; Wed, 23 Sep 2020 20:05:24 -0400
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id 8FE7F160105
 for <bug-gnulib@gnu.org>; Wed, 23 Sep 2020 17:05:20 -0700 (PDT)
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032)
 with ESMTP id 6Mbt6_pszKxL; Wed, 23 Sep 2020 17:05:19 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id 6BB231600A2;
 Wed, 23 Sep 2020 17:05:19 -0700 (PDT)
X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026)
 with ESMTP id Tu--yb7xCLG4; Wed, 23 Sep 2020 17:05:19 -0700 (PDT)
Received: from Penguin.CS.UCLA.EDU (Penguin.CS.UCLA.EDU [131.179.64.200])
 by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 45E0616008D;
 Wed, 23 Sep 2020 17:05:19 -0700 (PDT)
From: Paul Eggert <eggert@cs.ucla.edu>
To: bug-gnulib@gnu.org
Subject: [PATCH 2/2] regex: fix ignore-case Turkish bug
Date: Wed, 23 Sep 2020 17:05:03 -0700
Message-Id: <20200924000503.311248-2-eggert@cs.ucla.edu>
X-Mailer: git-send-email 2.25.4
In-Reply-To: <20200924000503.311248-1-eggert@cs.ucla.edu>
References: <20200924000503.311248-1-eggert@cs.ucla.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Received-SPF: pass client-ip=131.179.128.68; envelope-from=eggert@cs.ucla.edu;
 helo=zimbra.cs.ucla.edu
X-detected-operating-system: by eggs.gnu.org: First seen = 2020/09/23 20:05:17
X-ACL-Warn: Detected OS   = Linux 3.1-3.10 [fuzzy]
X-Spam_score_int: -41
X-Spam_score: -4.2
X-Spam_bar: ----
X-Spam_report: (-4.2 / 5.0 requ) BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3,
 SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-BeenThere: bug-gnulib@gnu.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Gnulib discussion list <bug-gnulib.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/bug-gnulib>,
 <mailto:bug-gnulib-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/bug-gnulib>
List-Post: <mailto:bug-gnulib@gnu.org>
List-Help: <mailto:bug-gnulib-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/bug-gnulib>,
 <mailto:bug-gnulib-request@gnu.org?subject=subscribe>
Cc: Paul Eggert <eggert@cs.ucla.edu>
Errors-To: bug-gnulib-bounces+normalperson=yhbt.net@gnu.org
Sender: "bug-gnulib" <bug-gnulib-bounces+normalperson=yhbt.net@gnu.org>

* lib/regex_internal.c (build_wcs_upper_buffer):
Do not assume that converting single-byte character to upper
yields a single-byte character.  This is not true for Turkish,
where towupper (L'i') yields L'=C4=B0', which is not single-byte.
* tests/test-regex.c (main): Test for this bug.
---
 ChangeLog            |  7 +++++++
 lib/regex_internal.c | 19 ++++++++++---------
 tests/test-regex.c   | 41 ++++++++++++++++++++++++++++++++++++-----
 3 files changed, 53 insertions(+), 14 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index d15f158ab..5c4d8f849 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,12 @@
 2020-09-23  Paul Eggert  <eggert@cs.ucla.edu>
=20
+	regex: fix ignore-case Turkish bug
+	* lib/regex_internal.c (build_wcs_upper_buffer):
+	Do not assume that converting single-byte character to upper
+	yields a single-byte character.  This is not true for Turkish,
+	where towupper (L'i') yields L'=C4=B0', which is not single-byte.
+	* tests/test-regex.c (main): Test for this bug.
+
 	regex: port to weird isascii platforms
 	* lib/regex_internal.h (isascii) [!_LIBC]: Supply glibc version.
=20
diff --git a/lib/regex_internal.c b/lib/regex_internal.c
index e1b6b4d5a..ed0a13461 100644
--- a/lib/regex_internal.c
+++ b/lib/regex_internal.c
@@ -300,18 +300,20 @@ build_wcs_upper_buffer (re_string_t *pstr)
       while (byte_idx < end_idx)
 	{
 	  wchar_t wc;
+	  unsigned char ch =3D pstr->raw_mbs[pstr->raw_mbs_idx + byte_idx];
=20
-	  if (isascii (pstr->raw_mbs[pstr->raw_mbs_idx + byte_idx])
-	      && mbsinit (&pstr->cur_state))
+	  if (isascii (ch) && mbsinit (&pstr->cur_state))
 	    {
-	      /* In case of a singlebyte character.  */
-	      pstr->mbs[byte_idx]
-		=3D toupper (pstr->raw_mbs[pstr->raw_mbs_idx + byte_idx]);
 	      /* The next step uses the assumption that wchar_t is encoded
 		 ASCII-safe: all ASCII values can be converted like this.  */
-	      pstr->wcs[byte_idx] =3D (wchar_t) pstr->mbs[byte_idx];
-	      ++byte_idx;
-	      continue;
+	      wchar_t wcu =3D __towupper (ch);
+	      if (isascii (wcu))
+		{
+		  pstr->mbs[byte_idx] =3D wcu;
+		  pstr->wcs[byte_idx] =3D wcu;
+		  byte_idx++;
+		  continue;
+		}
 	    }
=20
 	  remain_len =3D end_idx - byte_idx;
@@ -348,7 +350,6 @@ build_wcs_upper_buffer (re_string_t *pstr)
 	    {
 	      /* It is an invalid character, an incomplete character
 		 at the end of the string, or '\0'.  Just use the byte.  */
-	      int ch =3D pstr->raw_mbs[pstr->raw_mbs_idx + byte_idx];
 	      pstr->mbs[byte_idx] =3D ch;
 	      /* And also cast it to wide char.  */
 	      pstr->wcs[byte_idx++] =3D (wchar_t) ch;
diff --git a/tests/test-regex.c b/tests/test-regex.c
index d3f429aeb..b4e23c8c8 100644
--- a/tests/test-regex.c
+++ b/tests/test-regex.c
@@ -29,6 +29,15 @@
=20
 #include "localcharset.h"
=20
+/* Check whether it's really a UTF-8 locale.
+   On mingw, setlocale (LC_ALL, "en_US.UTF-8") succeeds but returns
+   "English_United States.1252", with locale_charset () returning "CP125=
2".  */
+static int
+really_utf8 (void)
+{
+  return strcmp (locale_charset (), "UTF-8") =3D=3D 0;
+}
+
 int
 main (void)
 {
@@ -75,11 +84,7 @@ main (void)
           }
       }
=20
-      /* Check whether it's really a UTF-8 locale.
-         On mingw, the setlocale call succeeds but returns
-         "English_United States.1252", with locale_charset() returning
-         "CP1252".  */
-      if (strcmp (locale_charset (), "UTF-8") =3D=3D 0)
+      if (really_utf8 ())
         {
           /* This test is from glibc bug 15078.
              The test case is from Andreas Schwab in
@@ -119,6 +124,32 @@ main (void)
         return 1;
     }
=20
+  if (setlocale (LC_ALL, "tr_TR.UTF-8") && really_utf8 ())
+    {
+      re_set_syntax (RE_SYNTAX_GREP | RE_ICASE);
+      if (re_compile_pattern ("i", 1, &regex))
+        result |=3D 1;
+      else
+        {
+          /* UTF-8 encoding of U+0130 LATIN CAPITAL LETTER I WITH DOT AB=
OVE.
+             In Turkish, this is the upper-case equivalent of ASCII "i".
+             Older versions of Gnulib failed to match "i" to U+0130 when
+             ignoring case in Turkish <https://bugs.gnu.org/43577>.  */
+          static char const data[] =3D "\xc4\xb0";
+
+          memset (&regs, 0, sizeof regs);
+          if (re_search (&regex, data, sizeof data - 1, 0, sizeof data -=
 1,
+                         &regs))
+            result |=3D 1;
+          regfree (&regex);
+          free (regs.start);
+          free (regs.end);
+
+          if (! setlocale (LC_ALL, "C"))
+            return 1;
+        }
+    }
+
   /* This test is from glibc bug 3957, reported by Andrew Mackey.  */
   re_set_syntax (RE_SYNTAX_EGREP | RE_HAT_LISTS_NOT_NEWLINE);
   memset (&regex, 0, sizeof regex);
--=20
2.25.4