bug-gnulib@gnu.org mirror (unofficial)
 help / color / mirror / Atom feed
From: Bruno Haible <bruno@clisp.org>
To: bug-gnulib@gnu.org
Cc: Kiyoshi KANAZAWA <yoi_no_myoujou@yahoo.co.jp>,
	Akim Demaille <akim.demaille@gmail.com>
Subject: Re: mbswidth "failure" on Solaris
Date: Sun, 05 May 2019 13:35:56 +0200	[thread overview]
Message-ID: <1717997.9M1LzPS7EO@omega> (raw)
In-Reply-To: <DF518912-F53F-4D0F-99C9-FB237BB43DDD@gmail.com>

Hi,

> >     15 | e: {∇⃗×𝐸⃗ = -∂𝐵⃗/∂t}
> > -      |    ^~~~~~~~~~~~~~
> > +      |    ^~~~~~~~~~~~~~~~~

Indeed, mbswidth seems to have returned 3 more columns.

> The error (three more columns than expected) seems to indicate something
> related to the combining arrow.

No. The issue comes from the math symbols. The following test programs shows
it:

#include <config.h>
#include <stdio.h>
#include <locale.h>
#include <wchar.h>
#include "mbswidth.h"
int main ()
{
  setlocale (LC_ALL, "en_US.UTF-8");
  printf ("%d\n", (int) mbswidth ("{∇⃗×𝐸⃗ = -∂𝐵⃗/∂t}",0)); // 14 vs 17
  printf ("%d\n", wcwidth (0x2207)); // 1 vs. 2
  printf ("%d\n", wcwidth (0x20D7)); // 0
  printf ("%d\n", wcwidth (0x00D7)); // 1
  printf ("%d\n", wcwidth (0x1D438)); // 1
  printf ("%d\n", wcwidth (0x2202)); // 1 vs. 2
  printf ("%d\n", wcwidth (0x1D435)); // 1
}

The following patch should fix it.

The patch changes the behaviour of wcwidth(0x2202) for UTF-8 locales.
It would be possible to limit the change to the non-East-Asian UTF-8
locales (by using the function uc_locale_language() and testing
whether its result is not one of "zh", "ja", "ko"), but glibc does not
do this (it uses the same width across all UTF-8 locales), therefore
I'm not doing it here either.


2019-05-05  Bruno Haible  <bruno@clisp.org>

	wcwidth: Ensure width 1, not 2, for ambiguous characters.
	Reported by Kiyoshi KANAZAWA <yoi_no_myoujou@yahoo.co.jp>
	via Akim Demaille <akim.demaille@gmail.com>.
	* m4/wcwidth.m4 (gl_FUNC_WCWIDTH): Check the width of U+2202. Use an
	en_US.UTF-8 locale, since that is more likely to be present than an
	fr_FR.UTF-8 locale.
	* tests/test-wcwidth.c (main): Check the width of U+2202.
	* doc/posix-functions/wcwidth.texi: Mention the issue.

diff --git a/m4/wcwidth.m4 b/m4/wcwidth.m4
index 3952fd2..e9b5bf4 100644
--- a/m4/wcwidth.m4
+++ b/m4/wcwidth.m4
@@ -1,4 +1,4 @@
-# wcwidth.m4 serial 28
+# wcwidth.m4 serial 29
 dnl Copyright (C) 2006-2019 Free Software Foundation, Inc.
 dnl This file is free software; the Free Software Foundation
 dnl gives unlimited permission to copy and/or distribute it,
@@ -54,6 +54,8 @@ AC_DEFUN([gl_FUNC_WCWIDTH],
     dnl On OSF/1 5.1, wcwidth(0x200B) (ZERO WIDTH SPACE) returns 1.
     dnl On OpenBSD 5.8, wcwidth(0xFF1A) (FULLWIDTH COLON) returns 0.
     dnl This leads to bugs in 'ls' (coreutils).
+    dnl On Solaris 11.4, wcwidth(0x2202) (PARTIAL DIFFERENTIAL) returns 2,
+    dnl even in Western locales.
     AC_CACHE_CHECK([whether wcwidth works reasonably in UTF-8 locales],
       [gl_cv_func_wcwidth_works],
       [
@@ -80,7 +82,7 @@ int wcwidth (int);
 int main ()
 {
   int result = 0;
-  if (setlocale (LC_ALL, "fr_FR.UTF-8") != NULL)
+  if (setlocale (LC_ALL, "en_US.UTF-8") != NULL)
     {
       if (wcwidth (0x0301) > 0)
         result |= 1;
@@ -90,6 +92,8 @@ int main ()
         result |= 4;
       if (wcwidth (0xFF1A) == 0)
         result |= 8;
+      if (wcwidth (0x2202) > 1)
+        result |= 16;
     }
   return result;
 }]])],
diff --git a/tests/test-wcwidth.c b/tests/test-wcwidth.c
index eb7bdd2..8e9cea3 100644
--- a/tests/test-wcwidth.c
+++ b/tests/test-wcwidth.c
@@ -72,6 +72,22 @@ main ()
       ASSERT (wcwidth (0x200B) == 0);
       ASSERT (wcwidth (0xFEFF) <= 0);
 
+      /* Test width of some math symbols.
+         U+2202 is marked as having ambiguous width (A) in EastAsianWidth.txt
+         (see <https://www.unicode.org/Public/12.0.0/ucd/EastAsianWidth.txt>).
+         The Unicode Standard Annex 11
+         <https://www.unicode.org/reports/tr11/tr11-36.html>
+         says
+           "Ambiguous characters behave like wide or narrow characters
+            depending on the context (language tag, script identification,
+            associated font, source of data, or explicit markup; all can
+            provide the context). If the context cannot be established
+            reliably, they should be treated as narrow characters by default."
+         For wcwidth(), the only available context information is the locale.
+         "fr_FR.UTF-8" is a Western locale, not an East Asian locale, therefore
+         U+2202 should be treated like a narrow character.  */
+      ASSERT (wcwidth (0x2202) == 1);
+
       /* Test width of some CJK characters.  */
       ASSERT (wcwidth (0x3000) == 2);
       ASSERT (wcwidth (0xB250) == 2);
diff --git a/doc/posix-functions/wcwidth.texi b/doc/posix-functions/wcwidth.texi
index 741be8e..ecdf758 100644
--- a/doc/posix-functions/wcwidth.texi
+++ b/doc/posix-functions/wcwidth.texi
@@ -18,6 +18,10 @@ glibc 2.8.
 This function handles combining characters in UTF-8 locales incorrectly on some
 platforms:
 Mac OS X 10.3, OpenBSD 5.8.
+@item
+This function returns 2 for characters with ambiguous east asian width, even in
+Western locales, on some platforms:
+Solaris 11.4.
 @end itemize
 
 Portability problems not fixed by Gnulib:




  reply	other threads:[~2019-05-05 11:36 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-05  7:15 mbswidth "failure" on Solaris Akim Demaille
2019-05-05 11:35 ` Bruno Haible [this message]
2019-05-05 16:00   ` Kiyoshi KANAZAWA
2019-05-07  6:31     ` Akim Demaille

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://lists.gnu.org/mailman/listinfo/bug-gnulib

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1717997.9M1LzPS7EO@omega \
    --to=bruno@clisp.org \
    --cc=akim.demaille@gmail.com \
    --cc=bug-gnulib@gnu.org \
    --cc=yoi_no_myoujou@yahoo.co.jp \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).