* Android and the C locale
@ 2023-01-16 14:57 Bruno Haible
2023-01-17 13:10 ` hard_locale on Android Bruno Haible
0 siblings, 1 reply; 2+ messages in thread
From: Bruno Haible @ 2023-01-16 14:57 UTC (permalink / raw)
To: bug-gnulib
[-- Attachment #1: Type: text/plain, Size: 4090 bytes --]
Android < 5.0 had only dummy locales.
Starting with Android 5.0 (according to the Android libc's git history),
they have locales. But there are two problems:
1) The default locale (i.e. the locale in use when setlocale was not called)
is the "C.UTF-8" locale, not the "C" locale.
Test case:
================================================================================
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
int main ()
{
printf ("Locale=|%s| LC_CTYPE=|%s| MB_CUR_MAX=%d\n",
setlocale (LC_ALL, NULL), setlocale (LC_CTYPE, NULL), (int) MB_CUR_MAX);
}
================================================================================
prints
Locale=|C.UTF-8| LC_CTYPE=|C.UTF-8| MB_CUR_MAX=4
rather than the expected
Locale=|C| LC_CTYPE=|C| MB_CUR_MAX=1
POSIX <https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/basedefs/V1_chap07.html#tag_07_02>
says that the default locale should be the "C"/"POSIX" locale.
2) A setlocale call that is meant to set the "C" or "POSIX" locale actually
sets a locale with UTF-8 encoding.
Test case 1:
================================================================================
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <locale.h>
#include <wchar.h>
int main ()
{
mbstate_t state;
if (setlocale (LC_ALL, "") == NULL)
return 1;
memset (&state, '\0', sizeof (state));
printf ("Locale=|%s| LC_CTYPE=|%s| MB_CUR_MAX=%d mbrtowc(0xC0)=%d\n",
setlocale (LC_ALL, NULL), setlocale (LC_CTYPE, NULL), (int) MB_CUR_MAX,
(int) mbrtowc (NULL, "\xC0", 1, &state));
}
================================================================================
$ LC_ALL=C ./a.out
and
$ LC_ALL=POSIX ./a.out
print
Locale=|C.UTF-8| LC_CTYPE=|C.UTF-8| MB_CUR_MAX=4 mbrtowc(0xC0)=-2
rather than the expected
Locale=|C| LC_CTYPE=|C| MB_CUR_MAX=1 mbrtowc(0xC0)=-1
Test case 2:
================================================================================
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <locale.h>
#include <wchar.h>
int main ()
{
mbstate_t state;
if (setlocale (LC_ALL, "C") == NULL)
return 1;
memset (&state, '\0', sizeof (state));
printf ("Locale=|%s| LC_CTYPE=|%s| MB_CUR_MAX=%d mbrtowc(0xC0)=%d\n",
setlocale (LC_ALL, NULL), setlocale (LC_CTYPE, NULL), (int) MB_CUR_MAX,
(int) mbrtowc (NULL, "\xC0", 1, &state));
if (setlocale (LC_ALL, "POSIX") == NULL)
return 1;
memset (&state, '\0', sizeof (state));
printf ("Locale=|%s| LC_CTYPE=|%s| MB_CUR_MAX=%d mbrtowc(0xC0)=%d\n",
setlocale (LC_ALL, NULL), setlocale (LC_CTYPE, NULL), (int) MB_CUR_MAX,
(int) mbrtowc (NULL, "\xC0", 1, &state));
}
================================================================================
prints
Locale=|C| LC_CTYPE=|C| MB_CUR_MAX=4 mbrtowc(0xC0)=-2
Locale=|C| LC_CTYPE=|C| MB_CUR_MAX=4 mbrtowc(0xC0)=-2
rather than the expected
Locale=|C| LC_CTYPE=|C| MB_CUR_MAX=1 mbrtowc(0xC0)=-1
Locale=|C| LC_CTYPE=|C| MB_CUR_MAX=1 mbrtowc(0xC0)=-1
One of the consequences are these two test failures:
FAIL: test-mbrtoc32-5.sh
========================
../../gltests/test-mbrtoc32.c:105: assertion 'ret == 1' failed
Aborted
FAIL test-mbrtoc32-5.sh (exit status: 134)
FAIL: test-mbrtowc5.sh
======================
../../gltests/test-mbrtowc.c:105: assertion 'ret == 1' failed
Aborted
FAIL test-mbrtowc5.sh (exit status: 134)
As a workaround, I'm applying these two patches.
2023-01-16 Bruno Haible <bruno@clisp.org>
mbrtowc, mbrtoc32 tests: Avoid test failure on Android ≥ 5.0.
* tests/test-mbrtowc.c (main): On Android 5.0 or newer, when testing
the "C" locale, verify that the encoding is UTF-8.
* tests/test-mbrtoc32.c (main): Likewise.
* doc/posix-functions/setlocale.texi: Mention the Android problems.
mbrtowc, mbrtoc32 tests: Refactor.
* tests/test-mbrtowc.c (main): Straighten convoluted code.
* tests/test-mbrtoc32.c (main): Likewise.
[-- Attachment #2: 0001-mbrtowc-mbrtoc32-tests-Refactor.patch --]
[-- Type: text/x-patch, Size: 7140 bytes --]
From 1ca5866371acd6b4bdcb1913d18cc14b7a8528c1 Mon Sep 17 00:00:00 2001
From: Bruno Haible <bruno@clisp.org>
Date: Mon, 16 Jan 2023 14:30:06 +0100
Subject: [PATCH 1/2] mbrtowc, mbrtoc32 tests: Refactor.
* tests/test-mbrtowc.c (main): Straighten convoluted code.
* tests/test-mbrtoc32.c (main): Likewise.
---
ChangeLog | 6 +++++
tests/test-mbrtoc32.c | 54 ++++++++++++++++++++++++++++++-------------
tests/test-mbrtowc.c | 54 ++++++++++++++++++++++++++++++-------------
3 files changed, 82 insertions(+), 32 deletions(-)
diff --git a/ChangeLog b/ChangeLog
index 9bc953423f..045e1c6247 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,9 @@
+2023-01-16 Bruno Haible <bruno@clisp.org>
+
+ mbrtowc, mbrtoc32 tests: Refactor.
+ * tests/test-mbrtowc.c (main): Straighten convoluted code.
+ * tests/test-mbrtoc32.c (main): Likewise.
+
2023-01-16 Paul Eggert <eggert@cs.ucla.edu>
sigpipe tests: Modernize use of 'head'.
diff --git a/tests/test-mbrtoc32.c b/tests/test-mbrtoc32.c
index c8f735d520..36b520f7b8 100644
--- a/tests/test-mbrtoc32.c
+++ b/tests/test-mbrtoc32.c
@@ -72,10 +72,6 @@ main (int argc, char *argv[])
for (c = 0; c < 0x100; c++)
switch (c)
{
- default:
- if (! (c && 1 < argc && argv[1][0] == '5'))
- break;
- FALLTHROUGH;
case '\t': case '\v': case '\f':
case ' ': case '!': case '"': case '#': case '%':
case '&': case '\'': case '(': case ')': case '*':
@@ -97,25 +93,23 @@ main (int argc, char *argv[])
case 'p': case 'q': case 'r': case 's': case 't':
case 'u': case 'v': case 'w': case 'x': case 'y':
case 'z': case '{': case '|': case '}': case '~':
- /* c is in the ISO C "basic character set", or argv[1] starts
- with '5' so we are testing all nonnull bytes. */
+ /* c is in the ISO C "basic character set". */
+ ASSERT (c < 0x80);
+ /* c is an ASCII character. */
buf[0] = c;
+
wc = (char32_t) 0xBADFACE;
ret = mbrtoc32 (&wc, buf, 1, &state);
ASSERT (ret == 1);
- if (c < 0x80)
- /* c is an ASCII character. */
- ASSERT (wc == c);
- else
- /* argv[1] starts with '5', that is, we are testing the C or POSIX
- locale.
- On most platforms, the bytes 0x80..0xFF map to U+0080..U+00FF.
- But on musl libc, the bytes 0x80..0xFF map to U+DF80..U+DFFF. */
- ASSERT (wc == (btowc (c) == 0xDF00 + c ? btowc (c) : c));
+ ASSERT (wc == c);
ASSERT (mbsinit (&state));
+
ret = mbrtoc32 (NULL, buf, 1, &state);
ASSERT (ret == 1);
ASSERT (mbsinit (&state));
+
+ break;
+ default:
break;
}
}
@@ -368,7 +362,35 @@ main (int argc, char *argv[])
return 0;
case '5':
- /* C locale; tested above. */
+ /* C or POSIX locale. */
+ {
+ int c;
+ char buf[1];
+
+ memset (&state, '\0', sizeof (mbstate_t));
+ for (c = 0; c < 0x100; c++)
+ if (c != 0)
+ {
+ /* We are testing all nonnull bytes. */
+ buf[0] = c;
+
+ wc = (char32_t) 0xBADFACE;
+ ret = mbrtoc32 (&wc, buf, 1, &state);
+ ASSERT (ret == 1);
+ if (c < 0x80)
+ /* c is an ASCII character. */
+ ASSERT (wc == c);
+ else
+ /* On most platforms, the bytes 0x80..0xFF map to U+0080..U+00FF.
+ But on musl libc, the bytes 0x80..0xFF map to U+DF80..U+DFFF. */
+ ASSERT (wc == (btowc (c) == 0xDF00 + c ? btowc (c) : c));
+ ASSERT (mbsinit (&state));
+
+ ret = mbrtoc32 (NULL, buf, 1, &state);
+ ASSERT (ret == 1);
+ ASSERT (mbsinit (&state));
+ }
+ }
return 0;
}
diff --git a/tests/test-mbrtowc.c b/tests/test-mbrtowc.c
index 9019ea0e71..b358d8d583 100644
--- a/tests/test-mbrtowc.c
+++ b/tests/test-mbrtowc.c
@@ -72,10 +72,6 @@ main (int argc, char *argv[])
for (c = 0; c < 0x100; c++)
switch (c)
{
- default:
- if (! (c && 1 < argc && argv[1][0] == '5'))
- break;
- FALLTHROUGH;
case '\t': case '\v': case '\f':
case ' ': case '!': case '"': case '#': case '%':
case '&': case '\'': case '(': case ')': case '*':
@@ -97,25 +93,23 @@ main (int argc, char *argv[])
case 'p': case 'q': case 'r': case 's': case 't':
case 'u': case 'v': case 'w': case 'x': case 'y':
case 'z': case '{': case '|': case '}': case '~':
- /* c is in the ISO C "basic character set", or argv[1] starts
- with '5' so we are testing all nonnull bytes. */
+ /* c is in the ISO C "basic character set". */
+ ASSERT (c < 0x80);
+ /* c is an ASCII character. */
buf[0] = c;
+
wc = (wchar_t) 0xBADFACE;
ret = mbrtowc (&wc, buf, 1, &state);
ASSERT (ret == 1);
- if (c < 0x80)
- /* c is an ASCII character. */
- ASSERT (wc == c);
- else
- /* argv[1] starts with '5', that is, we are testing the C or POSIX
- locale.
- On most platforms, the bytes 0x80..0xFF map to U+0080..U+00FF.
- But on musl libc, the bytes 0x80..0xFF map to U+DF80..U+DFFF. */
- ASSERT (wc == (btowc (c) == 0xDF00 + c ? btowc (c) : c));
+ ASSERT (wc == c);
ASSERT (mbsinit (&state));
+
ret = mbrtowc (NULL, buf, 1, &state);
ASSERT (ret == 1);
ASSERT (mbsinit (&state));
+
+ break;
+ default:
break;
}
}
@@ -349,7 +343,35 @@ main (int argc, char *argv[])
return 0;
case '5':
- /* C locale; tested above. */
+ /* C or POSIX locale. */
+ {
+ int c;
+ char buf[1];
+
+ memset (&state, '\0', sizeof (mbstate_t));
+ for (c = 0; c < 0x100; c++)
+ if (c != 0)
+ {
+ /* We are testing all nonnull bytes. */
+ buf[0] = c;
+
+ wc = (wchar_t) 0xBADFACE;
+ ret = mbrtowc (&wc, buf, 1, &state);
+ ASSERT (ret == 1);
+ if (c < 0x80)
+ /* c is an ASCII character. */
+ ASSERT (wc == c);
+ else
+ /* On most platforms, the bytes 0x80..0xFF map to U+0080..U+00FF.
+ But on musl libc, the bytes 0x80..0xFF map to U+DF80..U+DFFF. */
+ ASSERT (wc == (btowc (c) == 0xDF00 + c ? btowc (c) : c));
+ ASSERT (mbsinit (&state));
+
+ ret = mbrtowc (NULL, buf, 1, &state);
+ ASSERT (ret == 1);
+ ASSERT (mbsinit (&state));
+ }
+ }
return 0;
}
--
2.34.1
[-- Attachment #3: 0002-mbrtowc-mbrtoc32-tests-Avoid-test-failure-on-Android.patch --]
[-- Type: text/x-patch, Size: 4622 bytes --]
From 653bc7d23e08ab61ee2382f8773f0a95d93ab871 Mon Sep 17 00:00:00 2001
From: Bruno Haible <bruno@clisp.org>
Date: Mon, 16 Jan 2023 14:34:56 +0100
Subject: [PATCH 2/2] =?UTF-8?q?mbrtowc,=20mbrtoc32=20tests:=20Avoid=20test?=
=?UTF-8?q?=20failure=20on=20Android=20=E2=89=A5=205.0.?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
* tests/test-mbrtowc.c (main): On Android 5.0 or newer, when testing
the "C" locale, verify that the encoding is UTF-8.
* tests/test-mbrtoc32.c (main): Likewise.
* doc/posix-functions/setlocale.texi: Mention the Android problems.
---
ChangeLog | 6 ++++++
doc/posix-functions/setlocale.texi | 8 +++++++-
tests/test-mbrtoc32.c | 10 ++++++++++
tests/test-mbrtowc.c | 10 ++++++++++
4 files changed, 33 insertions(+), 1 deletion(-)
diff --git a/ChangeLog b/ChangeLog
index 045e1c6247..0051e3237f 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,11 @@
2023-01-16 Bruno Haible <bruno@clisp.org>
+ mbrtowc, mbrtoc32 tests: Avoid test failure on Android ≥ 5.0.
+ * tests/test-mbrtowc.c (main): On Android 5.0 or newer, when testing
+ the "C" locale, verify that the encoding is UTF-8.
+ * tests/test-mbrtoc32.c (main): Likewise.
+ * doc/posix-functions/setlocale.texi: Mention the Android problems.
+
mbrtowc, mbrtoc32 tests: Refactor.
* tests/test-mbrtowc.c (main): Straighten convoluted code.
* tests/test-mbrtoc32.c (main): Likewise.
diff --git a/doc/posix-functions/setlocale.texi b/doc/posix-functions/setlocale.texi
index 11364d3901..6e232200f8 100644
--- a/doc/posix-functions/setlocale.texi
+++ b/doc/posix-functions/setlocale.texi
@@ -21,7 +21,7 @@ On Windows platforms (excluding Cygwin), @code{setlocale} understands different
locale names, that are not based on ISO 639 language names and ISO 3166 country
names.
@item
-On Android 4.3, which which doesn't have locales, the @code{setlocale} function
+On Android < 5.0, which doesn't have locales, the @code{setlocale} function
always fails. The replacement, however, supports only the locale names
@code{"C"} and @code{"POSIX"}.
@end itemize
@@ -52,4 +52,10 @@ In addition any value is accepted for @code{LC_CTYPE}, and so NULL
is never returned to indicate a failure to set locale.
To verify category values, each category must be set individually
with @code{setlocale(LC_COLLATE,"")} etc.
+@item
+On Android 5.0 and newer, the default locale (i.e.@: the locale in use when
+@code{setlocale} was not called) is the @code{"C.UTF-8"} locale, not the
+@code{"C"} locale. Additionally, a @code{setlocale} call that is meant to set
+the @code{"C"} or @code{"POSIX"} locale actually sets an equivalent of the
+@code{"C.UTF-8"} locale.
@end itemize
diff --git a/tests/test-mbrtoc32.c b/tests/test-mbrtoc32.c
index 36b520f7b8..0d75c3db14 100644
--- a/tests/test-mbrtoc32.c
+++ b/tests/test-mbrtoc32.c
@@ -26,6 +26,7 @@ SIGNATURE_CHECK (mbrtoc32, size_t,
#include <locale.h>
#include <stdio.h>
+#include <stdlib.h>
#include <string.h>
#include "macros.h"
@@ -124,6 +125,15 @@ main (int argc, char *argv[])
ASSERT (mbsinit (&state));
}
+#ifdef __ANDROID__
+ /* On Android ≥ 5.0, the default locale is the "C.UTF-8" locale, not the
+ "C" locale. Furthermore, when you attempt to set the "C" or "POSIX"
+ locale via setlocale(), what you get is a "C" locale with UTF-8 encoding,
+ that is, effectively the "C.UTF-8" locale. */
+ if (argc > 1 && strcmp (argv[1], "5") == 0 && MB_CUR_MAX > 1)
+ argv[1] = "2";
+#endif
+
if (argc > 1)
switch (argv[1][0])
{
diff --git a/tests/test-mbrtowc.c b/tests/test-mbrtowc.c
index b358d8d583..1fdf039c42 100644
--- a/tests/test-mbrtowc.c
+++ b/tests/test-mbrtowc.c
@@ -26,6 +26,7 @@ SIGNATURE_CHECK (mbrtowc, size_t, (wchar_t *, char const *, size_t,
#include <locale.h>
#include <stdio.h>
+#include <stdlib.h>
#include <string.h>
#include "macros.h"
@@ -124,6 +125,15 @@ main (int argc, char *argv[])
ASSERT (mbsinit (&state));
}
+#ifdef __ANDROID__
+ /* On Android ≥ 5.0, the default locale is the "C.UTF-8" locale, not the
+ "C" locale. Furthermore, when you attempt to set the "C" or "POSIX"
+ locale via setlocale(), what you get is a "C" locale with UTF-8 encoding,
+ that is, effectively the "C.UTF-8" locale. */
+ if (argc > 1 && strcmp (argv[1], "5") == 0 && MB_CUR_MAX > 1)
+ argv[1] = "2";
+#endif
+
if (argc > 1)
switch (argv[1][0])
{
--
2.34.1
^ permalink raw reply related [flat|nested] 2+ messages in thread
* hard_locale on Android
2023-01-16 14:57 Android and the C locale Bruno Haible
@ 2023-01-17 13:10 ` Bruno Haible
0 siblings, 0 replies; 2+ messages in thread
From: Bruno Haible @ 2023-01-17 13:10 UTC (permalink / raw)
To: bug-gnulib
Given the strange behaviour of the "C" locale on Android, it is no wonder
that I see a test failure:
FAIL: test-hard-locale
======================
The initial locale should not be hard!
FAIL test-hard-locale (exit status: 1)
This patch adjusts the function hard_locale in a conservative way (i.e.
hard_locale(category) returns 1 if the locale has UTF-8 encoding or
may behave in an unknown way in the future), and updates the unit test
accordingly.
I don't expect coreutils malfunctions on Android due to this, but we'll see.
2023-01-17 Bruno Haible <bruno@clisp.org>
hard-locale: Port to Android ≥ 5.0.
* lib/hard-locale.c: Include <stdlib.h>.
(hard_locale): On Android, consider also MB_CUR_MAX, even if the
locale's name is "C".
* tests/test-hard-locale.c (test_one, main): Assume that on Android,
even the "C" locale is hard.
diff --git a/lib/hard-locale.c b/lib/hard-locale.c
index 0a28552e75..c01fce5344 100644
--- a/lib/hard-locale.c
+++ b/lib/hard-locale.c
@@ -21,6 +21,7 @@
#include "hard-locale.h"
#include <locale.h>
+#include <stdlib.h>
#include <string.h>
bool
@@ -31,5 +32,16 @@ hard_locale (int category)
if (setlocale_null_r (category, locale, sizeof (locale)))
return false;
- return !(strcmp (locale, "C") == 0 || strcmp (locale, "POSIX") == 0);
+ if (!(strcmp (locale, "C") == 0 || strcmp (locale, "POSIX") == 0))
+ return true;
+
+#if defined __ANDROID__
+ /* On Android 5.0 or newer, it is possible to set a locale that has the same
+ name as the "C" locale but in fact uses UTF-8 encoding. Cf. test case 2 in
+ <https://lists.gnu.org/archive/html/bug-gnulib/2023-01/msg00141.html>. */
+ if (MB_CUR_MAX > 1)
+ return true;
+#endif
+
+ return false;
}
diff --git a/tests/test-hard-locale.c b/tests/test-hard-locale.c
index eb02f4f6e6..6f94e6c3ac 100644
--- a/tests/test-hard-locale.c
+++ b/tests/test-hard-locale.c
@@ -38,8 +38,10 @@ test_one (const char *name, int failure_bitmask)
/* musl libc has special code for the C.UTF-8 locale; other than that,
all locale names are accepted and all locales are trivial.
OpenBSD returns the locale name that was set, but we don't know how it
- behaves under the hood. Likewise for Haiku. */
-#if defined MUSL_LIBC || defined __OpenBSD__ || defined __HAIKU__
+ behaves under the hood. Likewise for Haiku.
+ On Android >= 5.0, the "C" locale may have UTF-8 encoding, and we don't
+ know how it will behave in the future. */
+#if defined MUSL_LIBC || defined __OpenBSD__ || defined __HAIKU__ || defined __ANDROID__
expected = true;
#else
expected = !all_trivial;
@@ -57,12 +59,14 @@ test_one (const char *name, int failure_bitmask)
/* On NetBSD 7.0, some locales such as de_DE.ISO8859-1 and de_DE.UTF-8
have the LC_COLLATE category set to "C".
- Similarly, on musl libc, with the C.UTF-8 locale. */
+ Similarly, on musl libc, with the C.UTF-8 locale.
+ On Android >= 5.0, the "C" locale may have UTF-8 encoding, and we don't
+ know how it will behave in the future. */
#if defined __NetBSD__
expected = false;
#elif defined MUSL_LIBC
expected = strcmp (name, "C.UTF-8") != 0;
-#elif (defined __OpenBSD__ && HAVE_DUPLOCALE) || defined __HAIKU__ /* OpenBSD >= 6.2, Haiku */
+#elif (defined __OpenBSD__ && HAVE_DUPLOCALE) || defined __HAIKU__ || defined __ANDROID__ /* OpenBSD >= 6.2, Haiku, Android */
expected = true;
#else
expected = !all_trivial;
@@ -86,12 +90,16 @@ main ()
{
int fail = 0;
- /* The initial locale is the "C" or "POSIX" locale. */
+ /* The initial locale is the "C" or "POSIX" locale.
+ On Android >= 5.0, it is equivalent to the "C.UTF-8" locale, cf.
+ <https://lists.gnu.org/archive/html/bug-gnulib/2023-01/msg00141.html>. */
+#if ! defined __ANDROID__
if (hard_locale (LC_CTYPE) || hard_locale (LC_COLLATE))
{
fprintf (stderr, "The initial locale should not be hard!\n");
fail |= 1;
}
+#endif
all_trivial = (setlocale (LC_ALL, "foobar") != NULL);
^ permalink raw reply related [flat|nested] 2+ messages in thread
end of thread, other threads:[~2023-01-17 13:11 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-16 14:57 Android and the C locale Bruno Haible
2023-01-17 13:10 ` hard_locale on Android Bruno Haible
Code repositories for project(s) associated with this public inbox
https://public-inbox.org/mirrors/gnulib.git/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).