From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS22989 209.51.188.0/24 X-Spam-Status: No, score=-3.1 required=3.0 tests=AWL,BAYES_00,BODY_8BITS, DKIM_INVALID,DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 319931F45F for ; Sun, 5 May 2019 11:36:19 +0000 (UTC) Received: from localhost ([127.0.0.1]:39739 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hNFRJ-0005mh-Ia for normalperson@yhbt.net; Sun, 05 May 2019 07:36:17 -0400 Received: from eggs.gnu.org ([209.51.188.92]:59160) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hNFRF-0005mU-GG for bug-gnulib@gnu.org; Sun, 05 May 2019 07:36:14 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1hNFRE-0005Ak-6r for bug-gnulib@gnu.org; Sun, 05 May 2019 07:36:13 -0400 Received: from mo6-p00-ob.smtp.rzone.de ([2a01:238:20a:202:5300::8]:19698) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1hNFRD-00059S-Gn for bug-gnulib@gnu.org; Sun, 05 May 2019 07:36:12 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; t=1557056169; s=strato-dkim-0002; d=clisp.org; h=References:In-Reply-To:Message-ID:Date:Subject:Cc:To:From: X-RZG-CLASS-ID:X-RZG-AUTH:From:Subject:Sender; bh=NNA8/+DdTqHlL9mzBBMlCYMUiygNDyutyD+bqyp6Gmc=; b=BwQiwuAm777R1nUJ9WN7kqyO4hWLHbw3sn2oSCQRW7vwKeKRBIZZDR1zYMdicNdnv1 1TBihWTC0Hs7MmSl5nqjW7AFUpyN2zEDxQ4Dm+s4ili+z9N31ikqTK+oOIwebeRKqxPC /wPnIv+6rXY5vcaTRHavUbymPcmSMs8x1hqU7uaPkKelA2x3PxeaSYy2SQMiXkw7F6tR dOH1kFHjd8xw+8u53E5y+dEoIvXWuL23nHz2NoVJSII9Pfq0I4eZELHdsMljlSLmTqLH wm9BTrCSy9mbTRHFjB5JF4AtEX/PzTUDejoO5wEtZk7ox0LTWCGMcjwyi+crVIbX6Im1 mtLg== X-RZG-AUTH: ":Ln4Re0+Ic/6oZXR1YgKryK8brlshOcZlIWs+iCP5vnk6shH+AHjwLuWOGaf0y5RW" X-RZG-CLASS-ID: mo00 Received: from bruno.haible.de by smtp.strato.de (RZmta 44.18 DYNA|AUTH) with ESMTPSA id m03afev45BZvcsV (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (curve secp521r1 with 521 ECDH bits, eq. 15360 bits RSA)) (Client did not present a certificate); Sun, 5 May 2019 13:35:57 +0200 (CEST) From: Bruno Haible To: bug-gnulib@gnu.org Subject: Re: mbswidth "failure" on Solaris Date: Sun, 05 May 2019 13:35:56 +0200 Message-ID: <1717997.9M1LzPS7EO@omega> User-Agent: KMail/5.1.3 (Linux/4.4.0-145-generic; KDE/5.18.0; x86_64; ; ) In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="UTF-8" X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2a01:238:20a:202:5300::8 X-BeenThere: bug-gnulib@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: Gnulib discussion list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Kiyoshi KANAZAWA , Akim Demaille Errors-To: bug-gnulib-bounces+normalperson=yhbt.net@gnu.org Sender: "bug-gnulib" Hi, > > 15 | e: {=E2=88=87=E2=83=97=C3=97=F0=9D=90=B8=E2=83=97 =3D -=E2=88= =82=F0=9D=90=B5=E2=83=97/=E2=88=82t} > > - | ^~~~~~~~~~~~~~ > > + | ^~~~~~~~~~~~~~~~~ Indeed, mbswidth seems to have returned 3 more columns. > The error (three more columns than expected) seems to indicate something > related to the combining arrow. No. The issue comes from the math symbols. The following test programs shows it: #include #include #include #include #include "mbswidth.h" int main () { setlocale (LC_ALL, "en_US.UTF-8"); printf ("%d\n", (int) mbswidth ("{=E2=88=87=E2=83=97=C3=97=F0=9D=90=B8=E2= =83=97 =3D -=E2=88=82=F0=9D=90=B5=E2=83=97/=E2=88=82t}",0)); // 14 vs 17 printf ("%d\n", wcwidth (0x2207)); // 1 vs. 2 printf ("%d\n", wcwidth (0x20D7)); // 0 printf ("%d\n", wcwidth (0x00D7)); // 1 printf ("%d\n", wcwidth (0x1D438)); // 1 printf ("%d\n", wcwidth (0x2202)); // 1 vs. 2 printf ("%d\n", wcwidth (0x1D435)); // 1 } The following patch should fix it. The patch changes the behaviour of wcwidth(0x2202) for UTF-8 locales. It would be possible to limit the change to the non-East-Asian UTF-8 locales (by using the function uc_locale_language() and testing whether its result is not one of "zh", "ja", "ko"), but glibc does not do this (it uses the same width across all UTF-8 locales), therefore I'm not doing it here either. 2019-05-05 Bruno Haible wcwidth: Ensure width 1, not 2, for ambiguous characters. Reported by Kiyoshi KANAZAWA via Akim Demaille . * m4/wcwidth.m4 (gl_FUNC_WCWIDTH): Check the width of U+2202. Use an en_US.UTF-8 locale, since that is more likely to be present than an fr_FR.UTF-8 locale. * tests/test-wcwidth.c (main): Check the width of U+2202. * doc/posix-functions/wcwidth.texi: Mention the issue. diff --git a/m4/wcwidth.m4 b/m4/wcwidth.m4 index 3952fd2..e9b5bf4 100644 =2D-- a/m4/wcwidth.m4 +++ b/m4/wcwidth.m4 @@ -1,4 +1,4 @@ =2D# wcwidth.m4 serial 28 +# wcwidth.m4 serial 29 dnl Copyright (C) 2006-2019 Free Software Foundation, Inc. dnl This file is free software; the Free Software Foundation dnl gives unlimited permission to copy and/or distribute it, @@ -54,6 +54,8 @@ AC_DEFUN([gl_FUNC_WCWIDTH], dnl On OSF/1 5.1, wcwidth(0x200B) (ZERO WIDTH SPACE) returns 1. dnl On OpenBSD 5.8, wcwidth(0xFF1A) (FULLWIDTH COLON) returns 0. dnl This leads to bugs in 'ls' (coreutils). + dnl On Solaris 11.4, wcwidth(0x2202) (PARTIAL DIFFERENTIAL) returns 2, + dnl even in Western locales. AC_CACHE_CHECK([whether wcwidth works reasonably in UTF-8 locales], [gl_cv_func_wcwidth_works], [ @@ -80,7 +82,7 @@ int wcwidth (int); int main () { int result =3D 0; =2D if (setlocale (LC_ALL, "fr_FR.UTF-8") !=3D NULL) + if (setlocale (LC_ALL, "en_US.UTF-8") !=3D NULL) { if (wcwidth (0x0301) > 0) result |=3D 1; @@ -90,6 +92,8 @@ int main () result |=3D 4; if (wcwidth (0xFF1A) =3D=3D 0) result |=3D 8; + if (wcwidth (0x2202) > 1) + result |=3D 16; } return result; }]])], diff --git a/tests/test-wcwidth.c b/tests/test-wcwidth.c index eb7bdd2..8e9cea3 100644 =2D-- a/tests/test-wcwidth.c +++ b/tests/test-wcwidth.c @@ -72,6 +72,22 @@ main () ASSERT (wcwidth (0x200B) =3D=3D 0); ASSERT (wcwidth (0xFEFF) <=3D 0); =20 + /* Test width of some math symbols. + U+2202 is marked as having ambiguous width (A) in EastAsianWidth.= txt + (see ). + The Unicode Standard Annex 11 + + says + "Ambiguous characters behave like wide or narrow characters + depending on the context (language tag, script identification, + associated font, source of data, or explicit markup; all can + provide the context). If the context cannot be established + reliably, they should be treated as narrow characters by defau= lt." + For wcwidth(), the only available context information is the loca= le. + "fr_FR.UTF-8" is a Western locale, not an East Asian locale, ther= efore + U+2202 should be treated like a narrow character. */ + ASSERT (wcwidth (0x2202) =3D=3D 1); + /* Test width of some CJK characters. */ ASSERT (wcwidth (0x3000) =3D=3D 2); ASSERT (wcwidth (0xB250) =3D=3D 2); diff --git a/doc/posix-functions/wcwidth.texi b/doc/posix-functions/wcwidth= =2Etexi index 741be8e..ecdf758 100644 =2D-- a/doc/posix-functions/wcwidth.texi +++ b/doc/posix-functions/wcwidth.texi @@ -18,6 +18,10 @@ glibc 2.8. This function handles combining characters in UTF-8 locales incorrectly on= some platforms: Mac OS X 10.3, OpenBSD 5.8. +@item +This function returns 2 for characters with ambiguous east asian width, ev= en in +Western locales, on some platforms: +Solaris 11.4. @end itemize =20 Portability problems not fixed by Gnulib: