unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* [PATCH v12 0/2] C.UTF-8
@ 2021-09-06 15:43 Carlos O'Donell via Libc-alpha
  2021-09-06 15:43 ` [PATCH v12 1/2] Add 'codepoint_collation' support for LC_COLLATE Carlos O'Donell via Libc-alpha
  2021-09-06 15:43 ` [PATCH v12 2/2] Add generic C.UTF-8 locale (Bug 17318) Carlos O'Donell via Libc-alpha
  0 siblings, 2 replies; 12+ messages in thread
From: Carlos O'Donell via Libc-alpha @ 2021-09-06 15:43 UTC (permalink / raw)
  To: libc-alpha

The following changes implement a minimally sized C.UTF-8.
First we implement the 'codepoint_collation' directive.
Then we implement C.UTF-8 with an LC_COLLATE that uses the
'codepoint_collation' directive to support using strcmp
or wcscmp for collation i.e. code point sorting. The final
C.UTF-8 is only ~396KiB with the largest ~346KiB in
LC_CTYPE for all of Unicode.

v12 fixes commit message to match NEWS.

v11 fixes a defect in the tst-regex.c test. All modified
tests were reviewed for similar defects and none were
found. v11 also removes an obsolete Contributed-by line
in C-collate-seq.c.

v10 fixes a defect in the transbug.c test.

v9 is rebased against the changes to remove ISO-8859-1
characters from the bug-regex1.c test
(69623c0db0a540f26ee537bae09446d3dcdf1f80).

v8 includes a NEWS entry for the updated C.UTF-8.

v7 fixed the regressions detected in Fedora Rawhide
here: https://bugzilla.redhat.com/show_bug.cgi?id=1986421,
but does so by generating identity tables for 
_NL_COLLATE_COLLSEQMB, and _NL_COLLATE_COLLSEQWC to provide
mappings for ASCII characters. This ensures that static
applications using the new C.UTF-8 have a functioning
fnmatch, regcomp, and regexec for ASCII ranges. This raises
the size of LC_COLLATE from 92 to 1406 bytes. Valgrind
reports no errors using the tables with C.UTF-8 under
tst-fnmatch. v7 also corrected collation sequence byte
ordering on BE targets, and I verified this by building
crossed locales with localedef --big-endian and confirming
that s390x built native C.UTF-8 is the same as an x86_64
C.UTF-8 built wtih --big-endian.

The fixes that were in v4 for nrules == 0 will be included
in the next release of glibc, and when those are proven
correct they can be backported to provide dyanmic
or newly compiled static applications with the ability
to use all code points in ranges.

Carlos O'Donell (2):
  Add 'codepoint_collation' support for LC_COLLATE.
  Add generic C.UTF-8 locale (Bug 17318)

 NEWS                             |  10 +-
 iconv/Makefile                   |  22 +-
 iconv/tst-iconv9.c               |  87 +++++
 locale/C-collate-seq.c           | 100 ++++++
 locale/C-collate.c               |  78 +----
 locale/programs/ld-collate.c     |  36 +-
 locale/programs/locfile-kw.gperf |   1 +
 locale/programs/locfile-kw.h     | 299 ++++++++---------
 locale/programs/locfile-token.h  |   1 +
 localedata/C.UTF-8.in            | 157 +++++++++
 localedata/Makefile              |   2 +
 localedata/SUPPORTED             |   1 +
 localedata/locales/C             | 194 +++++++++++
 posix/Makefile                   |  16 +-
 posix/bug-regex1.c               |  20 ++
 posix/bug-regex19.c              |  22 +-
 posix/bug-regex4.c               |  25 ++
 posix/bug-regex6.c               |   2 +-
 posix/transbug.c                 |  24 +-
 posix/tst-fnmatch.input          | 549 ++++++++++++++++++++++++++++++-
 posix/tst-regcomp-truncated.c    |   1 +
 posix/tst-regex.c                |  33 +-
 22 files changed, 1417 insertions(+), 263 deletions(-)
 create mode 100644 iconv/tst-iconv9.c
 create mode 100644 locale/C-collate-seq.c
 create mode 100644 localedata/C.UTF-8.in
 create mode 100644 localedata/locales/C

-- 
2.31.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v12 1/2] Add 'codepoint_collation' support for LC_COLLATE.
  2021-09-06 15:43 [PATCH v12 0/2] C.UTF-8 Carlos O'Donell via Libc-alpha
@ 2021-09-06 15:43 ` Carlos O'Donell via Libc-alpha
  2021-09-06 17:20   ` Matheus Castanho via Libc-alpha
  2021-09-06 15:43 ` [PATCH v12 2/2] Add generic C.UTF-8 locale (Bug 17318) Carlos O'Donell via Libc-alpha
  1 sibling, 1 reply; 12+ messages in thread
From: Carlos O'Donell via Libc-alpha @ 2021-09-06 15:43 UTC (permalink / raw)
  To: libc-alpha; +Cc: Florian Weimer

Support a new directive 'codepoint_collation' in the LC_COLLATE
section of a locale source file. This new directive causes all
collation rules to be dropped and instead STRCMP (strcmp or
wcscmp) is used for collation of the input character set. This
is required to allow for a C.UTF-8 that contains zero collation
rules (minimal size) and sorts using code point sorting.

To date the only implementation of a locale with zero collation
rules is the C/POSIX locale. The C/POSIX locale provides
identity tables for _NL_COLLATE_COLLSEQMB and
_NL_COLLATE_COLLSEQWC that map to ASCII even though it has zero
rules. This has lead to existing fnmatch, regexec, and regcomp
implementations that require these tables. It is not correct
to use these tables when nrules == 0, but the conservative fix
is to provide these tables when nrules == 0. This assures that
existing static applications using a new C.UTF-8 locale with
'codepoint_collation' at least have functional range expressions
with ASCII e.g. [0-9] or [a-z]. Such static applications would
not have the fixes to fnmatch, regexec and regcomp that avoid
the use of the tables when nrules == 0. Future fixes to fnmatch,
regexec, and regcomp would allow range expressions to use the
full set of code points for such ranges.

Tested on x86_64 and i686 without regression.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
---
 locale/C-collate-seq.c           | 100 +++++++++++
 locale/C-collate.c               |  78 +-------
 locale/programs/ld-collate.c     |  36 +++-
 locale/programs/locfile-kw.gperf |   1 +
 locale/programs/locfile-kw.h     | 299 ++++++++++++++++---------------
 locale/programs/locfile-token.h  |   1 +
 6 files changed, 286 insertions(+), 229 deletions(-)
 create mode 100644 locale/C-collate-seq.c

diff --git a/locale/C-collate-seq.c b/locale/C-collate-seq.c
new file mode 100644
index 0000000000..4fb82cb835
--- /dev/null
+++ b/locale/C-collate-seq.c
@@ -0,0 +1,100 @@
+/* Copyright (C) 1995-2021 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <stdint.h>
+
+static const char collseqmb[] =
+{
+  '\x00', '\x01', '\x02', '\x03', '\x04', '\x05', '\x06', '\x07',
+  '\x08', '\x09', '\x0a', '\x0b', '\x0c', '\x0d', '\x0e', '\x0f',
+  '\x10', '\x11', '\x12', '\x13', '\x14', '\x15', '\x16', '\x17',
+  '\x18', '\x19', '\x1a', '\x1b', '\x1c', '\x1d', '\x1e', '\x1f',
+  '\x20', '\x21', '\x22', '\x23', '\x24', '\x25', '\x26', '\x27',
+  '\x28', '\x29', '\x2a', '\x2b', '\x2c', '\x2d', '\x2e', '\x2f',
+  '\x30', '\x31', '\x32', '\x33', '\x34', '\x35', '\x36', '\x37',
+  '\x38', '\x39', '\x3a', '\x3b', '\x3c', '\x3d', '\x3e', '\x3f',
+  '\x40', '\x41', '\x42', '\x43', '\x44', '\x45', '\x46', '\x47',
+  '\x48', '\x49', '\x4a', '\x4b', '\x4c', '\x4d', '\x4e', '\x4f',
+  '\x50', '\x51', '\x52', '\x53', '\x54', '\x55', '\x56', '\x57',
+  '\x58', '\x59', '\x5a', '\x5b', '\x5c', '\x5d', '\x5e', '\x5f',
+  '\x60', '\x61', '\x62', '\x63', '\x64', '\x65', '\x66', '\x67',
+  '\x68', '\x69', '\x6a', '\x6b', '\x6c', '\x6d', '\x6e', '\x6f',
+  '\x70', '\x71', '\x72', '\x73', '\x74', '\x75', '\x76', '\x77',
+  '\x78', '\x79', '\x7a', '\x7b', '\x7c', '\x7d', '\x7e', '\x7f',
+  '\x80', '\x81', '\x82', '\x83', '\x84', '\x85', '\x86', '\x87',
+  '\x88', '\x89', '\x8a', '\x8b', '\x8c', '\x8d', '\x8e', '\x8f',
+  '\x90', '\x91', '\x92', '\x93', '\x94', '\x95', '\x96', '\x97',
+  '\x98', '\x99', '\x9a', '\x9b', '\x9c', '\x9d', '\x9e', '\x9f',
+  '\xa0', '\xa1', '\xa2', '\xa3', '\xa4', '\xa5', '\xa6', '\xa7',
+  '\xa8', '\xa9', '\xaa', '\xab', '\xac', '\xad', '\xae', '\xaf',
+  '\xb0', '\xb1', '\xb2', '\xb3', '\xb4', '\xb5', '\xb6', '\xb7',
+  '\xb8', '\xb9', '\xba', '\xbb', '\xbc', '\xbd', '\xbe', '\xbf',
+  '\xc0', '\xc1', '\xc2', '\xc3', '\xc4', '\xc5', '\xc6', '\xc7',
+  '\xc8', '\xc9', '\xca', '\xcb', '\xcc', '\xcd', '\xce', '\xcf',
+  '\xd0', '\xd1', '\xd2', '\xd3', '\xd4', '\xd5', '\xd6', '\xd7',
+  '\xd8', '\xd9', '\xda', '\xdb', '\xdc', '\xdd', '\xde', '\xdf',
+  '\xe0', '\xe1', '\xe2', '\xe3', '\xe4', '\xe5', '\xe6', '\xe7',
+  '\xe8', '\xe9', '\xea', '\xeb', '\xec', '\xed', '\xee', '\xef',
+  '\xf0', '\xf1', '\xf2', '\xf3', '\xf4', '\xf5', '\xf6', '\xf7',
+  '\xf8', '\xf9', '\xfa', '\xfb', '\xfc', '\xfd', '\xfe', '\xff'
+};
+
+/* This table must be 256 bytes in size. We index bytes into the
+   table to find the collation sequence.  */
+_Static_assert (sizeof (collseqmb) == 256);
+
+static const uint32_t collseqwc[] =
+{
+  8, 1, 8, 0x0, 0xff,
+  /* 1st-level table */
+  6 * sizeof (uint32_t),
+  /* 2nd-level table */
+  7 * sizeof (uint32_t),
+  /* 3rd-level table */
+  L'\x00', L'\x01', L'\x02', L'\x03', L'\x04', L'\x05', L'\x06', L'\x07',
+  L'\x08', L'\x09', L'\x0a', L'\x0b', L'\x0c', L'\x0d', L'\x0e', L'\x0f',
+  L'\x10', L'\x11', L'\x12', L'\x13', L'\x14', L'\x15', L'\x16', L'\x17',
+  L'\x18', L'\x19', L'\x1a', L'\x1b', L'\x1c', L'\x1d', L'\x1e', L'\x1f',
+  L'\x20', L'\x21', L'\x22', L'\x23', L'\x24', L'\x25', L'\x26', L'\x27',
+  L'\x28', L'\x29', L'\x2a', L'\x2b', L'\x2c', L'\x2d', L'\x2e', L'\x2f',
+  L'\x30', L'\x31', L'\x32', L'\x33', L'\x34', L'\x35', L'\x36', L'\x37',
+  L'\x38', L'\x39', L'\x3a', L'\x3b', L'\x3c', L'\x3d', L'\x3e', L'\x3f',
+  L'\x40', L'\x41', L'\x42', L'\x43', L'\x44', L'\x45', L'\x46', L'\x47',
+  L'\x48', L'\x49', L'\x4a', L'\x4b', L'\x4c', L'\x4d', L'\x4e', L'\x4f',
+  L'\x50', L'\x51', L'\x52', L'\x53', L'\x54', L'\x55', L'\x56', L'\x57',
+  L'\x58', L'\x59', L'\x5a', L'\x5b', L'\x5c', L'\x5d', L'\x5e', L'\x5f',
+  L'\x60', L'\x61', L'\x62', L'\x63', L'\x64', L'\x65', L'\x66', L'\x67',
+  L'\x68', L'\x69', L'\x6a', L'\x6b', L'\x6c', L'\x6d', L'\x6e', L'\x6f',
+  L'\x70', L'\x71', L'\x72', L'\x73', L'\x74', L'\x75', L'\x76', L'\x77',
+  L'\x78', L'\x79', L'\x7a', L'\x7b', L'\x7c', L'\x7d', L'\x7e', L'\x7f',
+  L'\x80', L'\x81', L'\x82', L'\x83', L'\x84', L'\x85', L'\x86', L'\x87',
+  L'\x88', L'\x89', L'\x8a', L'\x8b', L'\x8c', L'\x8d', L'\x8e', L'\x8f',
+  L'\x90', L'\x91', L'\x92', L'\x93', L'\x94', L'\x95', L'\x96', L'\x97',
+  L'\x98', L'\x99', L'\x9a', L'\x9b', L'\x9c', L'\x9d', L'\x9e', L'\x9f',
+  L'\xa0', L'\xa1', L'\xa2', L'\xa3', L'\xa4', L'\xa5', L'\xa6', L'\xa7',
+  L'\xa8', L'\xa9', L'\xaa', L'\xab', L'\xac', L'\xad', L'\xae', L'\xaf',
+  L'\xb0', L'\xb1', L'\xb2', L'\xb3', L'\xb4', L'\xb5', L'\xb6', L'\xb7',
+  L'\xb8', L'\xb9', L'\xba', L'\xbb', L'\xbc', L'\xbd', L'\xbe', L'\xbf',
+  L'\xc0', L'\xc1', L'\xc2', L'\xc3', L'\xc4', L'\xc5', L'\xc6', L'\xc7',
+  L'\xc8', L'\xc9', L'\xca', L'\xcb', L'\xcc', L'\xcd', L'\xce', L'\xcf',
+  L'\xd0', L'\xd1', L'\xd2', L'\xd3', L'\xd4', L'\xd5', L'\xd6', L'\xd7',
+  L'\xd8', L'\xd9', L'\xda', L'\xdb', L'\xdc', L'\xdd', L'\xde', L'\xdf',
+  L'\xe0', L'\xe1', L'\xe2', L'\xe3', L'\xe4', L'\xe5', L'\xe6', L'\xe7',
+  L'\xe8', L'\xe9', L'\xea', L'\xeb', L'\xec', L'\xed', L'\xee', L'\xef',
+  L'\xf0', L'\xf1', L'\xf2', L'\xf3', L'\xf4', L'\xf5', L'\xf6', L'\xf7',
+  L'\xf8', L'\xf9', L'\xfa', L'\xfb', L'\xfc', L'\xfd', L'\xfe', L'\xff'
+};
diff --git a/locale/C-collate.c b/locale/C-collate.c
index 02b70570a4..bc93819f32 100644
--- a/locale/C-collate.c
+++ b/locale/C-collate.c
@@ -19,83 +19,7 @@
 #include <stdint.h>
 #include "localeinfo.h"
 
-static const char collseqmb[] =
-{
-  '\x00', '\x01', '\x02', '\x03', '\x04', '\x05', '\x06', '\x07',
-  '\x08', '\x09', '\x0a', '\x0b', '\x0c', '\x0d', '\x0e', '\x0f',
-  '\x10', '\x11', '\x12', '\x13', '\x14', '\x15', '\x16', '\x17',
-  '\x18', '\x19', '\x1a', '\x1b', '\x1c', '\x1d', '\x1e', '\x1f',
-  '\x20', '\x21', '\x22', '\x23', '\x24', '\x25', '\x26', '\x27',
-  '\x28', '\x29', '\x2a', '\x2b', '\x2c', '\x2d', '\x2e', '\x2f',
-  '\x30', '\x31', '\x32', '\x33', '\x34', '\x35', '\x36', '\x37',
-  '\x38', '\x39', '\x3a', '\x3b', '\x3c', '\x3d', '\x3e', '\x3f',
-  '\x40', '\x41', '\x42', '\x43', '\x44', '\x45', '\x46', '\x47',
-  '\x48', '\x49', '\x4a', '\x4b', '\x4c', '\x4d', '\x4e', '\x4f',
-  '\x50', '\x51', '\x52', '\x53', '\x54', '\x55', '\x56', '\x57',
-  '\x58', '\x59', '\x5a', '\x5b', '\x5c', '\x5d', '\x5e', '\x5f',
-  '\x60', '\x61', '\x62', '\x63', '\x64', '\x65', '\x66', '\x67',
-  '\x68', '\x69', '\x6a', '\x6b', '\x6c', '\x6d', '\x6e', '\x6f',
-  '\x70', '\x71', '\x72', '\x73', '\x74', '\x75', '\x76', '\x77',
-  '\x78', '\x79', '\x7a', '\x7b', '\x7c', '\x7d', '\x7e', '\x7f',
-  '\x80', '\x81', '\x82', '\x83', '\x84', '\x85', '\x86', '\x87',
-  '\x88', '\x89', '\x8a', '\x8b', '\x8c', '\x8d', '\x8e', '\x8f',
-  '\x90', '\x91', '\x92', '\x93', '\x94', '\x95', '\x96', '\x97',
-  '\x98', '\x99', '\x9a', '\x9b', '\x9c', '\x9d', '\x9e', '\x9f',
-  '\xa0', '\xa1', '\xa2', '\xa3', '\xa4', '\xa5', '\xa6', '\xa7',
-  '\xa8', '\xa9', '\xaa', '\xab', '\xac', '\xad', '\xae', '\xaf',
-  '\xb0', '\xb1', '\xb2', '\xb3', '\xb4', '\xb5', '\xb6', '\xb7',
-  '\xb8', '\xb9', '\xba', '\xbb', '\xbc', '\xbd', '\xbe', '\xbf',
-  '\xc0', '\xc1', '\xc2', '\xc3', '\xc4', '\xc5', '\xc6', '\xc7',
-  '\xc8', '\xc9', '\xca', '\xcb', '\xcc', '\xcd', '\xce', '\xcf',
-  '\xd0', '\xd1', '\xd2', '\xd3', '\xd4', '\xd5', '\xd6', '\xd7',
-  '\xd8', '\xd9', '\xda', '\xdb', '\xdc', '\xdd', '\xde', '\xdf',
-  '\xe0', '\xe1', '\xe2', '\xe3', '\xe4', '\xe5', '\xe6', '\xe7',
-  '\xe8', '\xe9', '\xea', '\xeb', '\xec', '\xed', '\xee', '\xef',
-  '\xf0', '\xf1', '\xf2', '\xf3', '\xf4', '\xf5', '\xf6', '\xf7',
-  '\xf8', '\xf9', '\xfa', '\xfb', '\xfc', '\xfd', '\xfe', '\xff'
-};
-
-static const uint32_t collseqwc[] =
-{
-  8, 1, 8, 0x0, 0xff,
-  /* 1st-level table */
-  6 * sizeof (uint32_t),
-  /* 2nd-level table */
-  7 * sizeof (uint32_t),
-  /* 3rd-level table */
-  L'\x00', L'\x01', L'\x02', L'\x03', L'\x04', L'\x05', L'\x06', L'\x07',
-  L'\x08', L'\x09', L'\x0a', L'\x0b', L'\x0c', L'\x0d', L'\x0e', L'\x0f',
-  L'\x10', L'\x11', L'\x12', L'\x13', L'\x14', L'\x15', L'\x16', L'\x17',
-  L'\x18', L'\x19', L'\x1a', L'\x1b', L'\x1c', L'\x1d', L'\x1e', L'\x1f',
-  L'\x20', L'\x21', L'\x22', L'\x23', L'\x24', L'\x25', L'\x26', L'\x27',
-  L'\x28', L'\x29', L'\x2a', L'\x2b', L'\x2c', L'\x2d', L'\x2e', L'\x2f',
-  L'\x30', L'\x31', L'\x32', L'\x33', L'\x34', L'\x35', L'\x36', L'\x37',
-  L'\x38', L'\x39', L'\x3a', L'\x3b', L'\x3c', L'\x3d', L'\x3e', L'\x3f',
-  L'\x40', L'\x41', L'\x42', L'\x43', L'\x44', L'\x45', L'\x46', L'\x47',
-  L'\x48', L'\x49', L'\x4a', L'\x4b', L'\x4c', L'\x4d', L'\x4e', L'\x4f',
-  L'\x50', L'\x51', L'\x52', L'\x53', L'\x54', L'\x55', L'\x56', L'\x57',
-  L'\x58', L'\x59', L'\x5a', L'\x5b', L'\x5c', L'\x5d', L'\x5e', L'\x5f',
-  L'\x60', L'\x61', L'\x62', L'\x63', L'\x64', L'\x65', L'\x66', L'\x67',
-  L'\x68', L'\x69', L'\x6a', L'\x6b', L'\x6c', L'\x6d', L'\x6e', L'\x6f',
-  L'\x70', L'\x71', L'\x72', L'\x73', L'\x74', L'\x75', L'\x76', L'\x77',
-  L'\x78', L'\x79', L'\x7a', L'\x7b', L'\x7c', L'\x7d', L'\x7e', L'\x7f',
-  L'\x80', L'\x81', L'\x82', L'\x83', L'\x84', L'\x85', L'\x86', L'\x87',
-  L'\x88', L'\x89', L'\x8a', L'\x8b', L'\x8c', L'\x8d', L'\x8e', L'\x8f',
-  L'\x90', L'\x91', L'\x92', L'\x93', L'\x94', L'\x95', L'\x96', L'\x97',
-  L'\x98', L'\x99', L'\x9a', L'\x9b', L'\x9c', L'\x9d', L'\x9e', L'\x9f',
-  L'\xa0', L'\xa1', L'\xa2', L'\xa3', L'\xa4', L'\xa5', L'\xa6', L'\xa7',
-  L'\xa8', L'\xa9', L'\xaa', L'\xab', L'\xac', L'\xad', L'\xae', L'\xaf',
-  L'\xb0', L'\xb1', L'\xb2', L'\xb3', L'\xb4', L'\xb5', L'\xb6', L'\xb7',
-  L'\xb8', L'\xb9', L'\xba', L'\xbb', L'\xbc', L'\xbd', L'\xbe', L'\xbf',
-  L'\xc0', L'\xc1', L'\xc2', L'\xc3', L'\xc4', L'\xc5', L'\xc6', L'\xc7',
-  L'\xc8', L'\xc9', L'\xca', L'\xcb', L'\xcc', L'\xcd', L'\xce', L'\xcf',
-  L'\xd0', L'\xd1', L'\xd2', L'\xd3', L'\xd4', L'\xd5', L'\xd6', L'\xd7',
-  L'\xd8', L'\xd9', L'\xda', L'\xdb', L'\xdc', L'\xdd', L'\xde', L'\xdf',
-  L'\xe0', L'\xe1', L'\xe2', L'\xe3', L'\xe4', L'\xe5', L'\xe6', L'\xe7',
-  L'\xe8', L'\xe9', L'\xea', L'\xeb', L'\xec', L'\xed', L'\xee', L'\xef',
-  L'\xf0', L'\xf1', L'\xf2', L'\xf3', L'\xf4', L'\xf5', L'\xf6', L'\xf7',
-  L'\xf8', L'\xf9', L'\xfa', L'\xfb', L'\xfc', L'\xfd', L'\xfe', L'\xff'
-};
+#include "C-collate-seq.c"
 
 const struct __locale_data _nl_C_LC_COLLATE attribute_hidden =
 {
diff --git a/locale/programs/ld-collate.c b/locale/programs/ld-collate.c
index f4a8f34e46..06a5203334 100644
--- a/locale/programs/ld-collate.c
+++ b/locale/programs/ld-collate.c
@@ -23,6 +23,7 @@
 #include <wchar.h>
 #include <stdint.h>
 #include <sys/param.h>
+#include <array_length.h>
 
 #include "localedef.h"
 #include "charmap.h"
@@ -194,6 +195,9 @@ struct name_list
 /* The real definition of the struct for the LC_COLLATE locale.  */
 struct locale_collate_t
 {
+  /* Does the locale use code points to compare the encoding?  */
+  bool codepoint_collation;
+
   int col_weight_max;
   int cur_weight_max;
 
@@ -1509,6 +1513,7 @@ collate_startup (struct linereader *ldfile, struct localedef_t *locale,
 	  obstack_init (&collate->mempool);
 
 	  collate->col_weight_max = -1;
+	  collate->codepoint_collation = false;
 	}
       else
 	/* Reuse the copy_locale's data structures.  */
@@ -1567,6 +1572,10 @@ collate_finish (struct localedef_t *locale, const struct charmap_t *charmap)
       return;
     }
 
+  /* No data required.  */
+  if (collate->codepoint_collation)
+    return;
+
   /* If this assertion is hit change the type in `element_t'.  */
   assert (nrules <= sizeof (runp->used_in_level) * 8);
 
@@ -2091,6 +2100,10 @@ add_to_tablewc (uint32_t ch, struct element_t *runp)
     }
 }
 
+/* Include the C locale identity tables for _NL_COLLATE_COLLSEQMB and
+   _NL_COLLATE_COLLSEQWC.  */
+#include "C-collate-seq.c"
+
 void
 collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
 		const char *output_path)
@@ -2114,7 +2127,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
   add_locale_uint32 (&file, nrules);
 
   /* If we have no LC_COLLATE data emit only the number of rules as zero.  */
-  if (collate == NULL)
+  if (collate == NULL || collate->codepoint_collation)
     {
       size_t idx;
       for (idx = 1; idx < nelems; idx++)
@@ -2122,6 +2135,17 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
 	  /* The words have to be handled specially.  */
 	  if (idx == _NL_ITEM_INDEX (_NL_COLLATE_SYMB_HASH_SIZEMB))
 	    add_locale_uint32 (&file, 0);
+	  else if (idx == _NL_ITEM_INDEX (_NL_COLLATE_CODESET)
+		   && collate != NULL)
+	    /* A valid LC_COLLATE must have a code set name.  */
+	    add_locale_string (&file, charmap->code_set_name);
+	  else if (idx == _NL_ITEM_INDEX (_NL_COLLATE_COLLSEQMB)
+		   && collate != NULL)
+	    add_locale_raw_data (&file, collseqmb, sizeof (collseqmb));
+	  else if (idx == _NL_ITEM_INDEX (_NL_COLLATE_COLLSEQWC)
+		   && collate != NULL)
+	    add_locale_uint32_array (&file, collseqwc,
+				     array_length (collseqwc));
 	  else
 	    add_locale_empty (&file);
 	}
@@ -2671,6 +2695,10 @@ collate_read (struct linereader *ldfile, struct localedef_t *result,
 
       switch (nowtok)
 	{
+	case tok_codepoint_collation:
+	  collate->codepoint_collation = true;
+	  break;
+
 	case tok_copy:
 	  /* Allow copying other locales.  */
 	  now = lr_token (ldfile, charmap, result, NULL, verbose);
@@ -3741,9 +3769,11 @@ error while adding equivalent collating symbol"));
 	  /* Next we assume `LC_COLLATE'.  */
 	  if (!ignore_content)
 	    {
-	      if (state == 0 && copy_locale == NULL)
+	      if (state == 0
+		  && copy_locale == NULL
+		  && !collate->codepoint_collation)
 		/* We must either see a copy statement or have
-		   ordering values.  */
+		   ordering values, or codepoint_collation.  */
 		lr_error (ldfile,
 			  _("%s: empty category description not allowed"),
 			  "LC_COLLATE");
diff --git a/locale/programs/locfile-kw.gperf b/locale/programs/locfile-kw.gperf
index 0d3b95d77b..5ca9b47085 100644
--- a/locale/programs/locfile-kw.gperf
+++ b/locale/programs/locfile-kw.gperf
@@ -53,6 +53,7 @@ translit_end,           tok_translit_end,           0
 translit_ignore,        tok_translit_ignore,        0
 default_missing,        tok_default_missing,        0
 LC_COLLATE,             tok_lc_collate,             0
+codepoint_collation,    tok_codepoint_collation,    0
 coll_weight_max,        tok_coll_weight_max,        0
 section-symbol,         tok_section_symbol,         0
 collating-element,      tok_collating_element,      0
diff --git a/locale/programs/locfile-kw.h b/locale/programs/locfile-kw.h
index dc150bb8f8..c57d74f5f3 100644
--- a/locale/programs/locfile-kw.h
+++ b/locale/programs/locfile-kw.h
@@ -53,7 +53,7 @@
 #line 24 "locfile-kw.gperf"
 struct keyword_t ;
 
-#define TOTAL_KEYWORDS 178
+#define TOTAL_KEYWORDS 179
 #define MIN_WORD_LENGTH 3
 #define MAX_WORD_LENGTH 22
 #define MIN_HASH_VALUE 3
@@ -133,92 +133,92 @@ locfile_hash (register const char *str, register size_t len)
 #line 31 "locfile-kw.gperf"
       {"END",                    tok_end,                    0},
       {""}, {""},
-#line 70 "locfile-kw.gperf"
+#line 71 "locfile-kw.gperf"
       {"IGNORE",                 tok_ignore,                 0},
-#line 129 "locfile-kw.gperf"
+#line 130 "locfile-kw.gperf"
       {"LC_TIME",                tok_lc_time,                0},
 #line 30 "locfile-kw.gperf"
       {"LC_CTYPE",               tok_lc_ctype,               0},
       {""},
-#line 168 "locfile-kw.gperf"
+#line 169 "locfile-kw.gperf"
       {"LC_ADDRESS",             tok_lc_address,             0},
-#line 153 "locfile-kw.gperf"
+#line 154 "locfile-kw.gperf"
       {"LC_MESSAGES",            tok_lc_messages,            0},
-#line 161 "locfile-kw.gperf"
+#line 162 "locfile-kw.gperf"
       {"LC_NAME",                tok_lc_name,                0},
-#line 158 "locfile-kw.gperf"
+#line 159 "locfile-kw.gperf"
       {"LC_PAPER",               tok_lc_paper,               0},
-#line 186 "locfile-kw.gperf"
+#line 187 "locfile-kw.gperf"
       {"LC_MEASUREMENT",         tok_lc_measurement,         0},
 #line 56 "locfile-kw.gperf"
       {"LC_COLLATE",             tok_lc_collate,             0},
       {""},
-#line 188 "locfile-kw.gperf"
+#line 189 "locfile-kw.gperf"
       {"LC_IDENTIFICATION",      tok_lc_identification,      0},
-#line 201 "locfile-kw.gperf"
+#line 202 "locfile-kw.gperf"
       {"revision",               tok_revision,               0},
-#line 69 "locfile-kw.gperf"
+#line 70 "locfile-kw.gperf"
       {"UNDEFINED",              tok_undefined,              0},
-#line 125 "locfile-kw.gperf"
+#line 126 "locfile-kw.gperf"
       {"LC_NUMERIC",             tok_lc_numeric,             0},
-#line 82 "locfile-kw.gperf"
+#line 83 "locfile-kw.gperf"
       {"LC_MONETARY",            tok_lc_monetary,            0},
-#line 181 "locfile-kw.gperf"
+#line 182 "locfile-kw.gperf"
       {"LC_TELEPHONE",           tok_lc_telephone,           0},
       {""}, {""}, {""},
-#line 75 "locfile-kw.gperf"
+#line 76 "locfile-kw.gperf"
       {"define",                 tok_define,                 0},
-#line 154 "locfile-kw.gperf"
+#line 155 "locfile-kw.gperf"
       {"yesexpr",                tok_yesexpr,                0},
-#line 141 "locfile-kw.gperf"
+#line 142 "locfile-kw.gperf"
       {"era_year",               tok_era_year,               0},
       {""},
 #line 54 "locfile-kw.gperf"
       {"translit_ignore",        tok_translit_ignore,        0},
-#line 156 "locfile-kw.gperf"
+#line 157 "locfile-kw.gperf"
       {"yesstr",                 tok_yesstr,                 0},
       {""},
-#line 89 "locfile-kw.gperf"
+#line 90 "locfile-kw.gperf"
       {"negative_sign",          tok_negative_sign,          0},
       {""},
-#line 137 "locfile-kw.gperf"
+#line 138 "locfile-kw.gperf"
       {"t_fmt",                  tok_t_fmt,                  0},
-#line 159 "locfile-kw.gperf"
+#line 160 "locfile-kw.gperf"
       {"height",                 tok_height,                 0},
       {""}, {""},
 #line 52 "locfile-kw.gperf"
       {"translit_start",         tok_translit_start,         0},
-#line 136 "locfile-kw.gperf"
+#line 137 "locfile-kw.gperf"
       {"d_fmt",                  tok_d_fmt,                  0},
       {""},
 #line 53 "locfile-kw.gperf"
       {"translit_end",           tok_translit_end,           0},
-#line 94 "locfile-kw.gperf"
+#line 95 "locfile-kw.gperf"
       {"n_cs_precedes",          tok_n_cs_precedes,          0},
-#line 144 "locfile-kw.gperf"
+#line 145 "locfile-kw.gperf"
       {"era_t_fmt",              tok_era_t_fmt,              0},
 #line 39 "locfile-kw.gperf"
       {"space",                  tok_space,                  0},
-#line 72 "locfile-kw.gperf"
-      {"reorder-end",            tok_reorder_end,            0},
 #line 73 "locfile-kw.gperf"
+      {"reorder-end",            tok_reorder_end,            0},
+#line 74 "locfile-kw.gperf"
       {"reorder-sections-after", tok_reorder_sections_after, 0},
       {""},
-#line 142 "locfile-kw.gperf"
+#line 143 "locfile-kw.gperf"
       {"era_d_fmt",              tok_era_d_fmt,              0},
-#line 189 "locfile-kw.gperf"
+#line 190 "locfile-kw.gperf"
       {"title",                  tok_title,                  0},
       {""}, {""},
-#line 149 "locfile-kw.gperf"
+#line 150 "locfile-kw.gperf"
       {"timezone",               tok_timezone,               0},
       {""},
-#line 74 "locfile-kw.gperf"
+#line 75 "locfile-kw.gperf"
       {"reorder-sections-end",   tok_reorder_sections_end,   0},
       {""}, {""}, {""},
-#line 95 "locfile-kw.gperf"
+#line 96 "locfile-kw.gperf"
       {"n_sep_by_space",         tok_n_sep_by_space,         0},
       {""}, {""},
-#line 100 "locfile-kw.gperf"
+#line 101 "locfile-kw.gperf"
       {"int_n_cs_precedes",      tok_int_n_cs_precedes,      0},
       {""}, {""}, {""},
 #line 26 "locfile-kw.gperf"
@@ -232,147 +232,147 @@ locfile_hash (register const char *str, register size_t len)
       {"print",                  tok_print,                  0},
 #line 44 "locfile-kw.gperf"
       {"xdigit",                 tok_xdigit,                 0},
-#line 110 "locfile-kw.gperf"
+#line 111 "locfile-kw.gperf"
       {"duo_n_cs_precedes",      tok_duo_n_cs_precedes,      0},
-#line 127 "locfile-kw.gperf"
+#line 128 "locfile-kw.gperf"
       {"thousands_sep",          tok_thousands_sep,          0},
-#line 197 "locfile-kw.gperf"
+#line 198 "locfile-kw.gperf"
       {"territory",              tok_territory,              0},
 #line 36 "locfile-kw.gperf"
       {"digit",                  tok_digit,                  0},
       {""}, {""},
-#line 92 "locfile-kw.gperf"
+#line 93 "locfile-kw.gperf"
       {"p_cs_precedes",          tok_p_cs_precedes,          0},
       {""}, {""},
-#line 62 "locfile-kw.gperf"
+#line 63 "locfile-kw.gperf"
       {"script",                 tok_script,                 0},
 #line 29 "locfile-kw.gperf"
       {"include",                tok_include,                0},
       {""},
-#line 78 "locfile-kw.gperf"
+#line 79 "locfile-kw.gperf"
       {"else",                   tok_else,                   0},
-#line 184 "locfile-kw.gperf"
+#line 185 "locfile-kw.gperf"
       {"int_select",             tok_int_select,             0},
       {""}, {""}, {""},
-#line 132 "locfile-kw.gperf"
+#line 133 "locfile-kw.gperf"
       {"week",                   tok_week,                   0},
 #line 33 "locfile-kw.gperf"
       {"upper",                  tok_upper,                  0},
       {""}, {""},
-#line 194 "locfile-kw.gperf"
+#line 195 "locfile-kw.gperf"
       {"tel",                    tok_tel,                    0},
-#line 93 "locfile-kw.gperf"
+#line 94 "locfile-kw.gperf"
       {"p_sep_by_space",         tok_p_sep_by_space,         0},
-#line 160 "locfile-kw.gperf"
+#line 161 "locfile-kw.gperf"
       {"width",                  tok_width,                  0},
       {""},
-#line 98 "locfile-kw.gperf"
+#line 99 "locfile-kw.gperf"
       {"int_p_cs_precedes",      tok_int_p_cs_precedes,      0},
       {""}, {""},
 #line 41 "locfile-kw.gperf"
       {"punct",                  tok_punct,                  0},
       {""}, {""},
-#line 101 "locfile-kw.gperf"
+#line 102 "locfile-kw.gperf"
       {"int_n_sep_by_space",     tok_int_n_sep_by_space,     0},
       {""}, {""}, {""},
-#line 108 "locfile-kw.gperf"
+#line 109 "locfile-kw.gperf"
       {"duo_p_cs_precedes",      tok_duo_p_cs_precedes,      0},
 #line 48 "locfile-kw.gperf"
       {"charconv",               tok_charconv,               0},
       {""},
 #line 47 "locfile-kw.gperf"
       {"class",                  tok_class,                  0},
-#line 114 "locfile-kw.gperf"
-      {"duo_int_n_cs_precedes",  tok_duo_int_n_cs_precedes,  0},
 #line 115 "locfile-kw.gperf"
+      {"duo_int_n_cs_precedes",  tok_duo_int_n_cs_precedes,  0},
+#line 116 "locfile-kw.gperf"
       {"duo_int_n_sep_by_space", tok_duo_int_n_sep_by_space, 0},
-#line 111 "locfile-kw.gperf"
+#line 112 "locfile-kw.gperf"
       {"duo_n_sep_by_space",     tok_duo_n_sep_by_space,     0},
-#line 119 "locfile-kw.gperf"
+#line 120 "locfile-kw.gperf"
       {"duo_int_n_sign_posn",    tok_duo_int_n_sign_posn,    0},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""}, {""}, {""},
-#line 58 "locfile-kw.gperf"
+#line 59 "locfile-kw.gperf"
       {"section-symbol",         tok_section_symbol,         0},
-#line 185 "locfile-kw.gperf"
+#line 186 "locfile-kw.gperf"
       {"int_prefix",             tok_int_prefix,             0},
       {""}, {""}, {""}, {""},
 #line 42 "locfile-kw.gperf"
       {"graph",                  tok_graph,                  0},
       {""}, {""},
-#line 99 "locfile-kw.gperf"
+#line 100 "locfile-kw.gperf"
       {"int_p_sep_by_space",     tok_int_p_sep_by_space,     0},
       {""}, {""}, {""}, {""}, {""}, {""}, {""},
-#line 112 "locfile-kw.gperf"
-      {"duo_int_p_cs_precedes",  tok_duo_int_p_cs_precedes,  0},
 #line 113 "locfile-kw.gperf"
+      {"duo_int_p_cs_precedes",  tok_duo_int_p_cs_precedes,  0},
+#line 114 "locfile-kw.gperf"
       {"duo_int_p_sep_by_space", tok_duo_int_p_sep_by_space, 0},
-#line 109 "locfile-kw.gperf"
+#line 110 "locfile-kw.gperf"
       {"duo_p_sep_by_space",     tok_duo_p_sep_by_space,     0},
-#line 118 "locfile-kw.gperf"
+#line 119 "locfile-kw.gperf"
       {"duo_int_p_sign_posn",    tok_duo_int_p_sign_posn,    0},
-#line 157 "locfile-kw.gperf"
+#line 158 "locfile-kw.gperf"
       {"nostr",                  tok_nostr,                  0},
       {""}, {""},
-#line 140 "locfile-kw.gperf"
+#line 141 "locfile-kw.gperf"
       {"era",                    tok_era,                    0},
       {""},
-#line 84 "locfile-kw.gperf"
+#line 85 "locfile-kw.gperf"
       {"currency_symbol",        tok_currency_symbol,        0},
       {""},
-#line 167 "locfile-kw.gperf"
+#line 168 "locfile-kw.gperf"
       {"name_ms",                tok_name_ms,                0},
-#line 165 "locfile-kw.gperf"
-      {"name_mrs",               tok_name_mrs,               0},
 #line 166 "locfile-kw.gperf"
+      {"name_mrs",               tok_name_mrs,               0},
+#line 167 "locfile-kw.gperf"
       {"name_miss",              tok_name_miss,              0},
-#line 83 "locfile-kw.gperf"
+#line 84 "locfile-kw.gperf"
       {"int_curr_symbol",        tok_int_curr_symbol,        0},
-#line 190 "locfile-kw.gperf"
+#line 191 "locfile-kw.gperf"
       {"source",                 tok_source,                 0},
-#line 164 "locfile-kw.gperf"
+#line 165 "locfile-kw.gperf"
       {"name_mr",                tok_name_mr,                0},
-#line 163 "locfile-kw.gperf"
+#line 164 "locfile-kw.gperf"
       {"name_gen",               tok_name_gen,               0},
-#line 202 "locfile-kw.gperf"
+#line 203 "locfile-kw.gperf"
       {"date",                   tok_date,                   0},
       {""}, {""},
-#line 191 "locfile-kw.gperf"
+#line 192 "locfile-kw.gperf"
       {"address",                tok_address,                0},
-#line 162 "locfile-kw.gperf"
+#line 163 "locfile-kw.gperf"
       {"name_fmt",               tok_name_fmt,               0},
 #line 32 "locfile-kw.gperf"
       {"copy",                   tok_copy,                   0},
-#line 103 "locfile-kw.gperf"
+#line 104 "locfile-kw.gperf"
       {"int_n_sign_posn",        tok_int_n_sign_posn,        0},
       {""}, {""},
-#line 131 "locfile-kw.gperf"
+#line 132 "locfile-kw.gperf"
       {"day",                    tok_day,                    0},
-#line 105 "locfile-kw.gperf"
+#line 106 "locfile-kw.gperf"
       {"duo_currency_symbol",    tok_duo_currency_symbol,    0},
       {""}, {""}, {""},
-#line 150 "locfile-kw.gperf"
+#line 151 "locfile-kw.gperf"
       {"date_fmt",               tok_date_fmt,               0},
-#line 64 "locfile-kw.gperf"
+#line 65 "locfile-kw.gperf"
       {"order_end",              tok_order_end,              0},
-#line 117 "locfile-kw.gperf"
+#line 118 "locfile-kw.gperf"
       {"duo_n_sign_posn",        tok_duo_n_sign_posn,        0},
       {""},
-#line 170 "locfile-kw.gperf"
+#line 171 "locfile-kw.gperf"
       {"country_name",           tok_country_name,           0},
-#line 71 "locfile-kw.gperf"
+#line 72 "locfile-kw.gperf"
       {"reorder-after",          tok_reorder_after,          0},
       {""}, {""},
-#line 155 "locfile-kw.gperf"
+#line 156 "locfile-kw.gperf"
       {"noexpr",                 tok_noexpr,                 0},
 #line 50 "locfile-kw.gperf"
       {"tolower",                tok_tolower,                0},
-#line 198 "locfile-kw.gperf"
+#line 199 "locfile-kw.gperf"
       {"audience",               tok_audience,               0},
       {""}, {""}, {""},
 #line 49 "locfile-kw.gperf"
       {"toupper",                tok_toupper,                0},
-#line 68 "locfile-kw.gperf"
+#line 69 "locfile-kw.gperf"
       {"position",               tok_position,               0},
       {""},
 #line 40 "locfile-kw.gperf"
@@ -380,196 +380,197 @@ locfile_hash (register const char *str, register size_t len)
       {""},
 #line 27 "locfile-kw.gperf"
       {"comment_char",           tok_comment_char,           0},
-#line 88 "locfile-kw.gperf"
+#line 89 "locfile-kw.gperf"
       {"positive_sign",          tok_positive_sign,          0},
       {""}, {""}, {""}, {""},
-#line 61 "locfile-kw.gperf"
+#line 62 "locfile-kw.gperf"
       {"symbol-equivalence",     tok_symbol_equivalence,     0},
       {""},
-#line 102 "locfile-kw.gperf"
+#line 103 "locfile-kw.gperf"
       {"int_p_sign_posn",        tok_int_p_sign_posn,        0},
-#line 175 "locfile-kw.gperf"
+#line 176 "locfile-kw.gperf"
       {"country_car",            tok_country_car,            0},
       {""}, {""},
-#line 104 "locfile-kw.gperf"
+#line 105 "locfile-kw.gperf"
       {"duo_int_curr_symbol",    tok_duo_int_curr_symbol,    0},
       {""}, {""},
-#line 135 "locfile-kw.gperf"
+#line 136 "locfile-kw.gperf"
       {"d_t_fmt",                tok_d_t_fmt,                0},
       {""}, {""},
-#line 116 "locfile-kw.gperf"
+#line 117 "locfile-kw.gperf"
       {"duo_p_sign_posn",        tok_duo_p_sign_posn,        0},
-#line 187 "locfile-kw.gperf"
+#line 188 "locfile-kw.gperf"
       {"measurement",            tok_measurement,            0},
-#line 176 "locfile-kw.gperf"
+#line 177 "locfile-kw.gperf"
       {"country_isbn",           tok_country_isbn,           0},
 #line 37 "locfile-kw.gperf"
       {"outdigit",               tok_outdigit,               0},
       {""}, {""},
-#line 143 "locfile-kw.gperf"
+#line 144 "locfile-kw.gperf"
       {"era_d_t_fmt",            tok_era_d_t_fmt,            0},
       {""}, {""}, {""},
 #line 34 "locfile-kw.gperf"
       {"lower",                  tok_lower,                  0},
-#line 183 "locfile-kw.gperf"
+#line 184 "locfile-kw.gperf"
       {"tel_dom_fmt",            tok_tel_dom_fmt,            0},
-#line 171 "locfile-kw.gperf"
+#line 172 "locfile-kw.gperf"
       {"country_post",           tok_country_post,           0},
-#line 148 "locfile-kw.gperf"
+#line 149 "locfile-kw.gperf"
       {"cal_direction",          tok_cal_direction,          0},
-      {""},
-#line 139 "locfile-kw.gperf"
+#line 57 "locfile-kw.gperf"
+      {"codepoint_collation",    tok_codepoint_collation,    0},
+#line 140 "locfile-kw.gperf"
       {"t_fmt_ampm",             tok_t_fmt_ampm,             0},
-#line 91 "locfile-kw.gperf"
+#line 92 "locfile-kw.gperf"
       {"frac_digits",            tok_frac_digits,            0},
       {""}, {""},
-#line 177 "locfile-kw.gperf"
+#line 178 "locfile-kw.gperf"
       {"lang_name",              tok_lang_name,              0},
-#line 90 "locfile-kw.gperf"
+#line 91 "locfile-kw.gperf"
       {"int_frac_digits",        tok_int_frac_digits,        0},
       {""},
-#line 121 "locfile-kw.gperf"
+#line 122 "locfile-kw.gperf"
       {"uno_valid_to",           tok_uno_valid_to,           0},
-#line 126 "locfile-kw.gperf"
+#line 127 "locfile-kw.gperf"
       {"decimal_point",          tok_decimal_point,          0},
       {""},
-#line 133 "locfile-kw.gperf"
+#line 134 "locfile-kw.gperf"
       {"abmon",                  tok_abmon,                  0},
       {""}, {""}, {""}, {""},
-#line 107 "locfile-kw.gperf"
+#line 108 "locfile-kw.gperf"
       {"duo_frac_digits",        tok_duo_frac_digits,        0},
-#line 182 "locfile-kw.gperf"
+#line 183 "locfile-kw.gperf"
       {"tel_int_fmt",            tok_tel_int_fmt,            0},
-#line 123 "locfile-kw.gperf"
+#line 124 "locfile-kw.gperf"
       {"duo_valid_to",           tok_duo_valid_to,           0},
-#line 146 "locfile-kw.gperf"
+#line 147 "locfile-kw.gperf"
       {"first_weekday",          tok_first_weekday,          0},
       {""},
-#line 130 "locfile-kw.gperf"
+#line 131 "locfile-kw.gperf"
       {"abday",                  tok_abday,                  0},
       {""},
-#line 200 "locfile-kw.gperf"
+#line 201 "locfile-kw.gperf"
       {"abbreviation",           tok_abbreviation,           0},
-#line 147 "locfile-kw.gperf"
+#line 148 "locfile-kw.gperf"
       {"first_workday",          tok_first_workday,          0},
       {""}, {""},
-#line 97 "locfile-kw.gperf"
+#line 98 "locfile-kw.gperf"
       {"n_sign_posn",            tok_n_sign_posn,            0},
       {""}, {""}, {""},
-#line 145 "locfile-kw.gperf"
+#line 146 "locfile-kw.gperf"
       {"alt_digits",             tok_alt_digits,             0},
       {""}, {""},
-#line 128 "locfile-kw.gperf"
+#line 129 "locfile-kw.gperf"
       {"grouping",               tok_grouping,               0},
       {""},
 #line 45 "locfile-kw.gperf"
       {"blank",                  tok_blank,                  0},
       {""}, {""},
-#line 196 "locfile-kw.gperf"
+#line 197 "locfile-kw.gperf"
       {"language",               tok_language,               0},
-#line 120 "locfile-kw.gperf"
+#line 121 "locfile-kw.gperf"
       {"uno_valid_from",         tok_uno_valid_from,         0},
       {""},
-#line 199 "locfile-kw.gperf"
+#line 200 "locfile-kw.gperf"
       {"application",            tok_application,            0},
       {""},
-#line 80 "locfile-kw.gperf"
+#line 81 "locfile-kw.gperf"
       {"elifndef",               tok_elifndef,               0},
       {""}, {""}, {""}, {""}, {""},
-#line 122 "locfile-kw.gperf"
+#line 123 "locfile-kw.gperf"
       {"duo_valid_from",         tok_duo_valid_from,         0},
-#line 57 "locfile-kw.gperf"
+#line 58 "locfile-kw.gperf"
       {"coll_weight_max",        tok_coll_weight_max,        0},
       {""},
-#line 79 "locfile-kw.gperf"
+#line 80 "locfile-kw.gperf"
       {"elifdef",                tok_elifdef,                0},
-#line 67 "locfile-kw.gperf"
+#line 68 "locfile-kw.gperf"
       {"backward",               tok_backward,               0},
-#line 106 "locfile-kw.gperf"
+#line 107 "locfile-kw.gperf"
       {"duo_int_frac_digits",    tok_duo_int_frac_digits,    0},
       {""}, {""}, {""}, {""}, {""}, {""},
-#line 96 "locfile-kw.gperf"
+#line 97 "locfile-kw.gperf"
       {"p_sign_posn",            tok_p_sign_posn,            0},
       {""},
-#line 203 "locfile-kw.gperf"
+#line 204 "locfile-kw.gperf"
       {"category",               tok_category,               0},
       {""}, {""}, {""}, {""},
-#line 134 "locfile-kw.gperf"
+#line 135 "locfile-kw.gperf"
       {"mon",                    tok_mon,                    0},
       {""},
-#line 124 "locfile-kw.gperf"
+#line 125 "locfile-kw.gperf"
       {"conversion_rate",        tok_conversion_rate,        0},
       {""}, {""}, {""}, {""}, {""},
-#line 63 "locfile-kw.gperf"
+#line 64 "locfile-kw.gperf"
       {"order_start",            tok_order_start,            0},
       {""}, {""}, {""}, {""}, {""},
-#line 178 "locfile-kw.gperf"
+#line 179 "locfile-kw.gperf"
       {"lang_ab",                tok_lang_ab,                0},
-#line 180 "locfile-kw.gperf"
+#line 181 "locfile-kw.gperf"
       {"lang_lib",               tok_lang_lib,               0},
       {""}, {""}, {""},
-#line 192 "locfile-kw.gperf"
+#line 193 "locfile-kw.gperf"
       {"contact",                tok_contact,                0},
       {""}, {""}, {""},
-#line 173 "locfile-kw.gperf"
+#line 174 "locfile-kw.gperf"
       {"country_ab3",            tok_country_ab3,            0},
       {""}, {""}, {""},
-#line 193 "locfile-kw.gperf"
+#line 194 "locfile-kw.gperf"
       {"email",                  tok_email,                  0},
-#line 172 "locfile-kw.gperf"
+#line 173 "locfile-kw.gperf"
       {"country_ab2",            tok_country_ab2,            0},
       {""}, {""}, {""},
 #line 55 "locfile-kw.gperf"
       {"default_missing",        tok_default_missing,        0},
       {""}, {""},
-#line 195 "locfile-kw.gperf"
+#line 196 "locfile-kw.gperf"
       {"fax",                    tok_fax,                    0},
       {""}, {""}, {""}, {""}, {""}, {""}, {""},
-#line 174 "locfile-kw.gperf"
+#line 175 "locfile-kw.gperf"
       {"country_num",            tok_country_num,            0},
       {""}, {""}, {""}, {""}, {""}, {""},
 #line 51 "locfile-kw.gperf"
       {"map",                    tok_map,                    0},
-#line 65 "locfile-kw.gperf"
+#line 66 "locfile-kw.gperf"
       {"from",                   tok_from,                   0},
       {""}, {""}, {""}, {""}, {""}, {""}, {""},
-#line 86 "locfile-kw.gperf"
+#line 87 "locfile-kw.gperf"
       {"mon_thousands_sep",      tok_mon_thousands_sep,      0},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""},
-#line 81 "locfile-kw.gperf"
+#line 82 "locfile-kw.gperf"
       {"endif",                  tok_endif,                  0},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
-#line 151 "locfile-kw.gperf"
+#line 152 "locfile-kw.gperf"
       {"alt_mon",                tok_alt_mon,                0},
       {""}, {""}, {""}, {""}, {""}, {""}, {""},
-#line 76 "locfile-kw.gperf"
+#line 77 "locfile-kw.gperf"
       {"undef",                  tok_undef,                  0},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
-#line 59 "locfile-kw.gperf"
+#line 60 "locfile-kw.gperf"
       {"collating-element",      tok_collating_element,      0},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
-#line 152 "locfile-kw.gperf"
+#line 153 "locfile-kw.gperf"
       {"ab_alt_mon",             tok_ab_alt_mon,             0},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
-#line 66 "locfile-kw.gperf"
+#line 67 "locfile-kw.gperf"
       {"forward",                tok_forward,                0},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""}, {""}, {""},
-#line 85 "locfile-kw.gperf"
+#line 86 "locfile-kw.gperf"
       {"mon_decimal_point",      tok_mon_decimal_point,      0},
       {""}, {""},
-#line 169 "locfile-kw.gperf"
+#line 170 "locfile-kw.gperf"
       {"postal_fmt",             tok_postal_fmt,             0},
       {""}, {""}, {""}, {""}, {""},
-#line 60 "locfile-kw.gperf"
+#line 61 "locfile-kw.gperf"
       {"collating-symbol",       tok_collating_symbol,       0},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
@@ -582,15 +583,15 @@ locfile_hash (register const char *str, register size_t len)
 #line 38 "locfile-kw.gperf"
       {"alnum",                  tok_alnum,                  0},
       {""},
-#line 87 "locfile-kw.gperf"
+#line 88 "locfile-kw.gperf"
       {"mon_grouping",           tok_mon_grouping,           0},
       {""},
-#line 179 "locfile-kw.gperf"
+#line 180 "locfile-kw.gperf"
       {"lang_term",              tok_lang_term,              0},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""}, {""}, {""}, {""}, {""},
-#line 77 "locfile-kw.gperf"
+#line 78 "locfile-kw.gperf"
       {"ifdef",                  tok_ifdef,                  0},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
@@ -598,7 +599,7 @@ locfile_hash (register const char *str, register size_t len)
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""}, {""},
-#line 138 "locfile-kw.gperf"
+#line 139 "locfile-kw.gperf"
       {"am_pm",                  tok_am_pm,                  0}
     };
 
diff --git a/locale/programs/locfile-token.h b/locale/programs/locfile-token.h
index abeff8a09e..0bf771c752 100644
--- a/locale/programs/locfile-token.h
+++ b/locale/programs/locfile-token.h
@@ -90,6 +90,7 @@ enum token_t
   tok_translit_ignore,
   tok_default_missing,
   tok_lc_collate,
+  tok_codepoint_collation,
   tok_coll_weight_max,
   tok_section_symbol,
   tok_collating_element,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v12 2/2] Add generic C.UTF-8 locale (Bug 17318)
  2021-09-06 15:43 [PATCH v12 0/2] C.UTF-8 Carlos O'Donell via Libc-alpha
  2021-09-06 15:43 ` [PATCH v12 1/2] Add 'codepoint_collation' support for LC_COLLATE Carlos O'Donell via Libc-alpha
@ 2021-09-06 15:43 ` Carlos O'Donell via Libc-alpha
  2022-01-26  2:44   ` Michael Hudson-Doyle via Libc-alpha
  1 sibling, 1 reply; 12+ messages in thread
From: Carlos O'Donell via Libc-alpha @ 2021-09-06 15:43 UTC (permalink / raw)
  To: libc-alpha; +Cc: Florian Weimer

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 49274 bytes --]

We add a new C.UTF-8 locale. This locale is not builtin to glibc, but
is provided as a distinct locale. The locale provides full support for
UTF-8 and this includes full code point sorting via STRCMP-based
collation (strcmp or wcscmp).

The collation uses a new keyword 'codepoint_collation' which drops all
collation rules and generates an empty zero rules collation to enable
STRCMP usage in collation. This ensures that we get full code point
sorting for C.UTF-8 with a minimal 1406 bytes of overhead (LC_COLLATE
structure information and ASCII collating tables).

The new locale is added to SUPPORTED. Minimal test data for specific
code points (minus those not supported by collate-test) is provided in
C.UTF-8.in, and this verifies code point sorting is working reasonably
across the range. The locale was tested manually with the full set of
code points without failure.

The locale is harmonized with locales already shipping in various
downstream distributions. A new tst-iconv9 test is added which verifies
the C.UTF-8 locale is generally usable.

Testing for fnmatch, regexec, and recomp is provided by extending
bug-regex1, bugregex19, bug-regex4, bug-regex6, transbug, tst-fnmatch,
tst-regcomp-truncated, and tst-regex to use C.UTF-8.

Tested on x86_64 or i686 without regression.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
---
 NEWS                          |  10 +-
 iconv/Makefile                |  22 +-
 iconv/tst-iconv9.c            |  87 ++++++
 localedata/C.UTF-8.in         | 157 ++++++++++
 localedata/Makefile           |   2 +
 localedata/SUPPORTED          |   1 +
 localedata/locales/C          | 194 ++++++++++++
 posix/Makefile                |  16 +-
 posix/bug-regex1.c            |  20 ++
 posix/bug-regex19.c           |  22 +-
 posix/bug-regex4.c            |  25 ++
 posix/bug-regex6.c            |   2 +-
 posix/transbug.c              |  24 +-
 posix/tst-fnmatch.input       | 549 +++++++++++++++++++++++++++++++++-
 posix/tst-regcomp-truncated.c |   1 +
 posix/tst-regex.c             |  33 +-
 16 files changed, 1131 insertions(+), 34 deletions(-)
 create mode 100644 iconv/tst-iconv9.c
 create mode 100644 localedata/C.UTF-8.in
 create mode 100644 localedata/locales/C

diff --git a/NEWS b/NEWS
index 79c895e382..5b014fabbf 100644
--- a/NEWS
+++ b/NEWS
@@ -9,7 +9,15 @@ Version 2.35
 
 Major new features:
 
-  [Add new features here]
+* Support for the C.UTF-8 locale has been added to glibc.  The locale
+  supports full code-point sorting for all valid Unicode code points.  A
+  limitation in the framework for fnmatch, regexec, and regcomp requires
+  a compromise to save space and only ASCII-based range expressions are
+  supported for now (see bug 28255).  The full size of the locale is
+  only ~400KiB, with 346KiB coming from LC_CTYPE information for
+  Unicode.  This locale harmonizes downstream C.UTF-8 already shipping
+  in various downstream distributions.  The locale is not built into
+  glibc, and must be installed.
 
 Deprecated and removed features, and other changes affecting compatibility:
 
diff --git a/iconv/Makefile b/iconv/Makefile
index 07d77c9eca..9993f2d3f3 100644
--- a/iconv/Makefile
+++ b/iconv/Makefile
@@ -43,8 +43,19 @@ CFLAGS-charmap.c += -DCHARMAP_PATH='"$(i18ndir)/charmaps"' \
 CFLAGS-linereader.c += -DNO_TRANSLITERATION
 CFLAGS-simple-hash.c += -I../locale
 
-tests	= tst-iconv1 tst-iconv2 tst-iconv3 tst-iconv4 tst-iconv5 tst-iconv6 \
-	  tst-iconv7 tst-iconv8 tst-iconv-mt tst-iconv-opt
+tests = \
+	tst-iconv1 \
+	tst-iconv2 \
+	tst-iconv3 \
+	tst-iconv4 \
+	tst-iconv5 \
+	tst-iconv6 \
+	tst-iconv7 \
+	tst-iconv8 \
+	tst-iconv9 \
+	tst-iconv-mt \
+	tst-iconv-opt \
+	# tests
 
 others		= iconv_prog iconvconfig
 install-others-programs	= $(inst_bindir)/iconv
@@ -83,10 +94,15 @@ endif
 include ../Rules
 
 ifeq ($(run-built-tests),yes)
-LOCALES := en_US.UTF-8
+# We have to generate locales (list sorted alphabetically)
+LOCALES := \
+	C.UTF-8 \
+	en_US.UTF-8 \
+	# LOCALES
 include ../gen-locales.mk
 
 $(objpfx)tst-iconv-opt.out: $(gen-locales)
+$(objpfx)tst-iconv9.out: $(gen-locales)
 endif
 
 $(inst_bindir)/iconv: $(objpfx)iconv_prog $(+force)
diff --git a/iconv/tst-iconv9.c b/iconv/tst-iconv9.c
new file mode 100644
index 0000000000..c46b1833d8
--- /dev/null
+++ b/iconv/tst-iconv9.c
@@ -0,0 +1,87 @@
+/* Verify that using C.UTF-8 works.
+
+   Copyright (C) 2021 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <iconv.h>
+#include <stddef.h>
+#include <stdio.h>
+#include <string.h>
+#include <support/support.h>
+#include <support/check.h>
+
+/* This test does two things:
+   (1) Verify that we have likely included translit_combining in C.UTF-8.
+   (2) Verify default_missing is '?' as expected.  */
+
+/* ISO-8859-1 encoding of "für".  */
+char iso88591_in[] = { 0x66, 0xfc, 0x72, 0x0 };
+/* ASCII transliteration is "fur" with C.UTF-8 translit_combining.  */
+char ascii_exp[] = { 0x66, 0x75, 0x72, 0x0 };
+
+/* First 3-byte UTF-8 code point.  */
+char utf8_in[] = { 0xe0, 0xa0, 0x80, 0x0 };
+/* There is no ASCII transliteration for SAMARITAN LETTER ALAF
+   so we get default_missing used which is '?'.  */
+char default_missing_exp[] = { 0x3f, 0x0 };
+
+static int
+do_test (void)
+{
+  char ascii_out[5];
+  iconv_t cd;
+  char *inbuf;
+  char *outbuf;
+  size_t inbytes;
+  size_t outbytes;
+  size_t n;
+
+  /* The C.UTF-8 locale should include translit_combining, which provides
+     the transliteration for "LATIN SMALL LETTER U WITH DIAERESIS" which
+     is not provided by locale/C-translit.h.in.  */
+  xsetlocale (LC_ALL, "C.UTF-8");
+
+  /* From ISO-8859-1 to ASCII.  */
+  cd = iconv_open ("ASCII//TRANSLIT,IGNORE", "ISO-8859-1");
+  TEST_VERIFY (cd != (iconv_t) -1);
+  inbuf = iso88591_in;
+  inbytes = 3;
+  outbuf = ascii_out;
+  outbytes = 3;
+  n = iconv (cd, &inbuf, &inbytes, &outbuf, &outbytes);
+  TEST_VERIFY (n != -1);
+  *outbuf = '\0';
+  TEST_COMPARE_BLOB (ascii_out, 3, ascii_exp, 3);
+  TEST_VERIFY (iconv_close (cd) == 0);
+
+  /* From UTF-8 to ASCII.  */
+  cd = iconv_open ("ASCII//TRANSLIT,IGNORE", "UTF-8");
+  TEST_VERIFY (cd != (iconv_t) -1);
+  inbuf = utf8_in;
+  inbytes = 3;
+  outbuf = ascii_out;
+  outbytes = 3;
+  n = iconv (cd, &inbuf, &inbytes, &outbuf, &outbytes);
+  TEST_VERIFY (n != -1);
+  *outbuf = '\0';
+  TEST_COMPARE_BLOB (ascii_out, 1, default_missing_exp, 1);
+  TEST_VERIFY (iconv_close (cd) == 0);
+
+  return 0;
+}
+
+#include <support/test-driver.c>
diff --git a/localedata/C.UTF-8.in b/localedata/C.UTF-8.in
new file mode 100644
index 0000000000..c31dcc2aa0
--- /dev/null
+++ b/localedata/C.UTF-8.in
@@ -0,0 +1,157 @@
+\x01 ; <U1>
+\x02 ; <U2>
+\x03 ; <U3>
+\x04 ; <U4>
+\x05 ; <U5>
+\x06 ; <U6>
+\a ; <U7>
+\b ; <U8>
+\x0e ; <UE>
+\x0f ; <UF>
+\x10 ; <U10>
+\x11 ; <U11>
+\x12 ; <U12>
+\x13 ; <U13>
+\x14 ; <U14>
+\x15 ; <U15>
+\x16 ; <U16>
+\x17 ; <U17>
+\x18 ; <U18>
+\x19 ; <U19>
+\x1a ; <U1A>
+^[ ; <U1B>
+\x1c ; <U1C>
+\x1d ; <U1D>
+\x1e ; <U1E>
+\x1f ; <U1F>
+! ; <U21>
+" ; <U22>
+# ; <U23>
+$ ; <U24>
+% ; <U25>
+& ; <U26>
+' ; <U27>
+) ; <U29>
+* ; <U2A>
++ ; <U2B>
+, ; <U2C>
+- ; <U2D>
+. ; <U2E>
+/ ; <U2F>
+0 ; <U30>
+1 ; <U31>
+2 ; <U32>
+3 ; <U33>
+4 ; <U34>
+5 ; <U35>
+6 ; <U36>
+7 ; <U37>
+8 ; <U38>
+9 ; <U39>
+< ; <U3C>
+= ; <U3D>
+> ; <U3E>
+? ; <U3F>
+@ ; <U40>
+A ; <U41>
+B ; <U42>
+C ; <U43>
+D ; <U44>
+E ; <U45>
+F ; <U46>
+G ; <U47>
+H ; <U48>
+I ; <U49>
+J ; <U4A>
+K ; <U4B>
+L ; <U4C>
+M ; <U4D>
+N ; <U4E>
+O ; <U4F>
+P ; <U50>
+Q ; <U51>
+R ; <U52>
+S ; <U53>
+T ; <U54>
+U ; <U55>
+V ; <U56>
+W ; <U57>
+X ; <U58>
+Y ; <U59>
+Z ; <U5A>
+[ ; <U5B>
+\ ; <U5C>
+] ; <U5D>
+^ ; <U5E>
+_ ; <U5F>
+` ; <U60>
+a ; <U61>
+b ; <U62>
+c ; <U63>
+d ; <U64>
+e ; <U65>
+f ; <U66>
+g ; <U67>
+h ; <U68>
+i ; <U69>
+j ; <U6A>
+k ; <U6B>
+l ; <U6C>
+m ; <U6D>
+n ; <U6E>
+o ; <U6F>
+p ; <U70>
+q ; <U71>
+r ; <U72>
+s ; <U73>
+t ; <U74>
+u ; <U75>
+v ; <U76>
+w ; <U77>
+x ; <U78>
+y ; <U79>
+z ; <U7A>
+{ ; <U7B>
+| ; <U7C>
+} ; <U7D>
+~ ; <U7E>
+\x7f ; <U7F>
+€ ; <U80>
+ÿ ; <UFF>
+Ä€ ; <U100>
+à¿¿ ; <UFFF>
+က ; <U1000>
+� ; <UFFFD>
+ï¿¿ ; <UFFFF>
+𐀀 ; <U10000>
+🿿 ; <U1FFFF>
+ð €€ ; <U20000>
+𯿿 ; <U2FFFF>
+ð°€€ ; <U30000>
+ð¿¿¾ ; <U3FFFE>
+񀀀 ; <U40000>
+񏿿 ; <U4FFFF>
+񐀀 ; <U50000>
+񟿿 ; <U5FFFF>
+ñ €€ ; <U60000>
+񯿿 ; <U6FFFF>
+ñ°€€ ; <U70000>
+ñ¿¿¿ ; <U7FFFF>
+ò€€€ ; <U80000>
+򏿿 ; <U8FFFF>
+򐀀 ; <U90000>
+òŸ¿¿ ; <U9FFFF>
+ò €€ ; <UA0000>
+򯿿 ; <UAFFFF>
+ò°€€ ; <UB0000>
+ò¿¿¿ ; <UBFFFF>
+󀀁 ; <UC0001>
+󏿌 ; <UCFFCC>
+󐀎 ; <UD000E>
+óŸ¿¿ ; <UDFFFF>
+󠀁 ; <UE0001>
+󯿿 ; <UEFFFF>
+󰀁 ; <UF0001>
+ó¿¿¿ ; <UFFFFF>
+􀀁 ; <U100001>
+􏿿 ; <U10FFFF>
diff --git a/localedata/Makefile b/localedata/Makefile
index f585e0dd41..66a269641b 100644
--- a/localedata/Makefile
+++ b/localedata/Makefile
@@ -47,6 +47,7 @@ test-input := \
 	bg_BG.UTF-8 \
 	br_FR.UTF-8 \
 	bs_BA.UTF-8 \
+	C.UTF-8 \
 	ckb_IQ.UTF-8 \
 	cmn_TW.UTF-8 \
 	crh_UA.UTF-8 \
@@ -206,6 +207,7 @@ LOCALES := \
 	bg_BG.UTF-8 \
 	br_FR.UTF-8 \
 	bs_BA.UTF-8 \
+	C.UTF-8 \
 	ckb_IQ.UTF-8 \
 	cmn_TW.UTF-8 \
 	crh_UA.UTF-8 \
diff --git a/localedata/SUPPORTED b/localedata/SUPPORTED
index 1ee5b5e8c8..d768aa4795 100644
--- a/localedata/SUPPORTED
+++ b/localedata/SUPPORTED
@@ -79,6 +79,7 @@ brx_IN/UTF-8 \
 bs_BA.UTF-8/UTF-8 \
 bs_BA/ISO-8859-2 \
 byn_ER/UTF-8 \
+C.UTF-8/UTF-8 \
 ca_AD.UTF-8/UTF-8 \
 ca_AD/ISO-8859-15 \
 ca_ES.UTF-8/UTF-8 \
diff --git a/localedata/locales/C b/localedata/locales/C
new file mode 100644
index 0000000000..ca801c79cf
--- /dev/null
+++ b/localedata/locales/C
@@ -0,0 +1,194 @@
+escape_char /
+comment_char %
+% Locale for C locale in UTF-8
+
+LC_IDENTIFICATION
+title      "C locale"
+source     ""
+address    ""
+contact    ""
+email      "bug-glibc-locales@gnu.org"
+tel        ""
+fax        ""
+language   ""
+territory  ""
+revision   "2.0"
+date       "2020-06-28"
+category  "i18n:2012";LC_IDENTIFICATION
+category  "i18n:2012";LC_CTYPE
+category  "i18n:2012";LC_COLLATE
+category  "i18n:2012";LC_TIME
+category  "i18n:2012";LC_NUMERIC
+category  "i18n:2012";LC_MONETARY
+category  "i18n:2012";LC_MESSAGES
+category  "i18n:2012";LC_PAPER
+category  "i18n:2012";LC_NAME
+category  "i18n:2012";LC_ADDRESS
+category  "i18n:2012";LC_TELEPHONE
+category  "i18n:2012";LC_MEASUREMENT
+END LC_IDENTIFICATION
+
+LC_CTYPE
+% Include only the i18n character type classes without any of the
+% transliteration that i18n uses by default.
+copy "i18n_ctype"
+
+% Include the neutral transliterations.  The builtin C and
+% POSIX locales have +1600 transliterations that are built into
+% the locales, and these are a superset of those.
+translit_start
+include "translit_neutral";""
+% We must use '?' for default_missing because the transliteration
+% framework includes it directly into the output and so it must
+% be compatible with ASCII if that is the target character set.
+default_missing <U003F>
+translit_end
+
+% Include the transliterations that can convert combined characters.
+% These are generally expected by users.
+translit_start
+include "translit_combining";""
+translit_end
+
+END LC_CTYPE
+
+LC_COLLATE
+% The keyword 'codepoint_collation' in any part of any LC_COLLATE
+% immediately discards all collation information and causes the
+% locale to use strcmp/wcscmp for collation comparison.  This is
+% exactly what is needed for C (ASCII) or C.UTF-8.
+codepoint_collation
+END LC_COLLATE
+
+LC_MONETARY
+
+% This is the 14652 i18n fdcc-set definition for the LC_MONETARY
+% category (except for the int_curr_symbol and currency_symbol, they are
+% empty in the 14652 i18n fdcc-set definition and also empty in
+% glibc/locale/C-monetary.c.).
+int_curr_symbol     ""
+currency_symbol     ""
+mon_decimal_point   "."
+mon_thousands_sep   ""
+mon_grouping        -1
+positive_sign       ""
+negative_sign       "-"
+int_frac_digits     -1
+frac_digits         -1
+p_cs_precedes       -1
+int_p_sep_by_space  -1
+p_sep_by_space      -1
+n_cs_precedes       -1
+int_n_sep_by_space  -1
+n_sep_by_space      -1
+p_sign_posn         -1
+n_sign_posn         -1
+%
+END LC_MONETARY
+
+LC_NUMERIC
+% This is the POSIX Locale definition for
+% the LC_NUMERIC category.
+%
+decimal_point   "."
+thousands_sep   ""
+grouping        -1
+END LC_NUMERIC
+
+LC_TIME
+% This is the POSIX Locale definition for the LC_TIME category with the
+% exception that time is per ISO 8601 and 24-hour.
+%
+% Abbreviated weekday names (%a)
+abday       "Sun";"Mon";"Tue";"Wed";"Thu";"Fri";"Sat"
+
+% Full weekday names (%A)
+day         "Sunday";"Monday";"Tuesday";"Wednesday";"Thursday";/
+            "Friday";"Saturday"
+
+% Abbreviated month names (%b)
+abmon       "Jan";"Feb";"Mar";"Apr";"May";"Jun";"Jul";"Aug";"Sep";/
+            "Oct";"Nov";"Dec"
+
+% Full month names (%B)
+mon         "January";"February";"March";"April";"May";"June";"July";/
+            "August";"September";"October";"November";"December"
+
+% Week description, consists of three fields:
+% 1. Number of days in a week.
+% 2. Gregorian date that is a first weekday (19971130 for Sunday, 19971201 for Monday).
+% 3. The weekday number to be contained in the first week of the year.
+%
+% ISO 8601 conforming applications should use the values 7, 19971201 (a
+% Monday), and 4 (Thursday), respectively.
+week    7;19971201;4
+first_weekday	1
+first_workday	2
+
+% Appropriate date and time representation (%c)
+d_t_fmt "%a %b %e %H:%M:%S %Y"
+
+% Appropriate date representation (%x)
+d_fmt   "%m/%d/%y"
+
+% Appropriate time representation (%X)
+t_fmt   "%H:%M:%S"
+
+% Appropriate AM/PM time representation (%r)
+t_fmt_ampm "%I:%M:%S %p"
+
+% Equivalent of AM/PM (%p)
+am_pm	"AM";"PM"
+
+% Appropriate date representation (date(1))
+date_fmt	"%a %b %e %H:%M:%S %Z %Y"
+END LC_TIME
+
+LC_MESSAGES
+% This is the POSIX Locale definition for
+% the LC_NUMERIC category.
+%
+yesexpr "^[yY]"
+noexpr  "^[nN]"
+yesstr  "Yes"
+nostr   "No"
+END LC_MESSAGES
+
+LC_PAPER
+% This is the ISO/IEC 14652 "i18n" definition for
+% the LC_PAPER category.
+% (A4 paper, this is also used in the built in C/POSIX
+% locale in glibc/locale/C-paper.c)
+height   297
+width    210
+END LC_PAPER
+
+LC_NAME
+% This is the ISO/IEC 14652 "i18n" definition for
+% the LC_NAME category.
+% (also used in the built in C/POSIX locale in glibc/locale/C-name.c)
+name_fmt    "%p%t%g%t%m%t%f"
+END LC_NAME
+
+LC_ADDRESS
+% This is the ISO/IEC 14652 "i18n" definition for
+% the LC_ADDRESS category.
+% (also used in the built in C/POSIX locale in glibc/locale/C-address.c)
+postal_fmt    "%a%N%f%N%d%N%b%N%s %h %e %r%N%C-%z %T%N%c%N"
+END LC_ADDRESS
+
+LC_TELEPHONE
+% This is the ISO/IEC 14652 "i18n" definition for
+% the LC_TELEPHONE category.
+% "+%c %a %l"
+tel_int_fmt    "+%c %a %l"
+% (also used in the built in C/POSIX locale in glibc/locale/C-telephone.c)
+END LC_TELEPHONE
+
+LC_MEASUREMENT
+% This is the ISO/IEC 14652 "i18n" definition for
+% the LC_MEASUREMENT category.
+% (same as in the built in C/POSIX locale in glibc/locale/C-measurement.c)
+%metric
+measurement    1
+END LC_MEASUREMENT
diff --git a/posix/Makefile b/posix/Makefile
index 059efb3cd2..a5229777ee 100644
--- a/posix/Makefile
+++ b/posix/Makefile
@@ -190,9 +190,19 @@ $(objpfx)wordexp-tst.out: wordexp-tst.sh $(objpfx)wordexp-test
 	$(evaluate-test)
 endif
 
-LOCALES := cs_CZ.UTF-8 da_DK.ISO-8859-1 de_DE.ISO-8859-1 de_DE.UTF-8 \
-	   en_US.UTF-8 es_US.ISO-8859-1 es_US.UTF-8 ja_JP.EUC-JP tr_TR.UTF-8 \
-	   cs_CZ.ISO-8859-2
+LOCALES := \
+	cs_CZ.ISO-8859-2 \
+	cs_CZ.UTF-8 \
+	C.UTF-8 \
+	da_DK.ISO-8859-1 \
+	de_DE.ISO-8859-1 \
+	de_DE.UTF-8 \
+	en_US.UTF-8 \
+	es_US.ISO-8859-1 \
+	es_US.UTF-8 \
+	ja_JP.EUC-JP \
+	tr_TR.UTF-8 \
+	# LOCALES
 include ../gen-locales.mk
 
 $(objpfx)bug-regex1.out: $(gen-locales)
diff --git a/posix/bug-regex1.c b/posix/bug-regex1.c
index b8cf97c8ce..99357e359e 100644
--- a/posix/bug-regex1.c
+++ b/posix/bug-regex1.c
@@ -40,6 +40,26 @@ main (void)
 	puts (" -> OK");
     }
 
+  puts ("in C.UTF-8 locale");
+  setlocale (LC_ALL, "C.UTF-8");
+  s = re_compile_pattern ("[an\371]*n", 7, &regex);
+  if (s != NULL)
+    {
+      puts ("re_compile_pattern return non-NULL value");
+      result = 1;
+    }
+  else
+    {
+      match = re_match (&regex, "an", 2, 0, &regs);
+      if (match != 2)
+	{
+	  printf ("re_match returned %d, expected 2\n", match);
+	  result = 1;
+	}
+      else
+	puts (" -> OK");
+    }
+
   puts ("in de_DE.ISO-8859-1 locale");
   setlocale (LC_ALL, "de_DE.ISO-8859-1");
   s = re_compile_pattern ("[an\371]*n", 7, &regex);
diff --git a/posix/bug-regex19.c b/posix/bug-regex19.c
index 001827c3a8..44f6ab606f 100644
--- a/posix/bug-regex19.c
+++ b/posix/bug-regex19.c
@@ -24,6 +24,7 @@
 #include <string.h>
 #include <locale.h>
 #include <libc-diag.h>
+#include <support/support.h>
 
 #define BRE RE_SYNTAX_POSIX_BASIC
 #define ERE RE_SYNTAX_POSIX_EXTENDED
@@ -406,8 +407,8 @@ do_mb_tests (const struct test_s *test)
   return 0;
 }
 
-int
-main (void)
+static int
+do_test (void)
 {
   size_t i;
   int ret = 0;
@@ -416,20 +417,17 @@ main (void)
 
   for (i = 0; i < sizeof (tests) / sizeof (tests[0]); ++i)
     {
-      if (setlocale (LC_ALL, "de_DE.ISO-8859-1") == NULL)
-	{
-	  puts ("setlocale de_DE.ISO-8859-1 failed");
-	  ret = 1;
-	}
+      xsetlocale (LC_ALL, "de_DE.ISO-8859-1");
       ret |= do_one_test (&tests[i], "");
-      if (setlocale (LC_ALL, "de_DE.UTF-8") == NULL)
-	{
-	  puts ("setlocale de_DE.UTF-8 failed");
-	  ret = 1;
-	}
+      xsetlocale (LC_ALL, "de_DE.UTF-8");
+      ret |= do_one_test (&tests[i], "UTF-8 ");
+      ret |= do_mb_tests (&tests[i]);
+      xsetlocale (LC_ALL, "C.UTF-8");
       ret |= do_one_test (&tests[i], "UTF-8 ");
       ret |= do_mb_tests (&tests[i]);
     }
 
   return ret;
 }
+
+#include <support/test-driver.c>
diff --git a/posix/bug-regex4.c b/posix/bug-regex4.c
index 86901ecaa7..3b63d7d1b7 100644
--- a/posix/bug-regex4.c
+++ b/posix/bug-regex4.c
@@ -31,8 +31,33 @@ main (void)
 
   memset (&regex, '\0', sizeof (regex));
 
+  printf ("INFO: Checking C.\n");
   setlocale (LC_ALL, "C");
 
+  s = re_compile_pattern ("ab[cde]", 7, &regex);
+  if (s != NULL)
+    {
+      puts ("re_compile_pattern returned non-NULL value");
+      result = 1;
+    }
+  else
+    {
+      match[0] = re_search_2 (&regex, "xyabez", 6, "", 0, 1, 5, NULL, 6);
+      match[1] = re_search_2 (&regex, NULL, 0, "abc", 3, 0, 3, NULL, 3);
+      match[2] = re_search_2 (&regex, "xya", 3, "bd", 2, 2, 3, NULL, 5);
+      if (match[0] != 2 || match[1] != 0 || match[2] != 2)
+	{
+	  printf ("re_search_2 returned %d,%d,%d, expected 2,0,2\n",
+		  match[0], match[1], match[2]);
+	  result = 1;
+	}
+      else
+	puts (" -> OK");
+    }
+
+  printf ("INFO: Checking C.UTF-8.\n");
+  setlocale (LC_ALL, "C.UTF-8");
+
   s = re_compile_pattern ("ab[cde]", 7, &regex);
   if (s != NULL)
     {
diff --git a/posix/bug-regex6.c b/posix/bug-regex6.c
index 324bd5199d..145f007c3c 100644
--- a/posix/bug-regex6.c
+++ b/posix/bug-regex6.c
@@ -29,7 +29,7 @@ main (int argc, char *argv[])
   regex_t re;
   regmatch_t mat[10];
   int i, j, ret = 0;
-  const char *locales[] = { "C", "de_DE.UTF-8" };
+  const char *locales[] = { "C", "C.UTF-8", "de_DE.UTF-8" };
   const char *string = "http://www.regex.com/pattern/matching.html#intro";
   regmatch_t expect[10] = {
     { 0, 48 }, { 0, 5 }, { 0, 4 }, { 5, 20 }, { 7, 20 }, { 20, 42 },
diff --git a/posix/transbug.c b/posix/transbug.c
index d0983b4d44..b240177cf7 100644
--- a/posix/transbug.c
+++ b/posix/transbug.c
@@ -116,16 +116,32 @@ do_test (void)
   static const char lower[] = "[[:lower:]]+";
   static const char upper[] = "[[:upper:]]+";
   struct re_registers regs[4];
+  int result = 0;
 
+#define CHECK(exp) \
+  if (exp) { puts (#exp); result = 1; }
+
+  printf ("INFO: Checking C.\n");
   setlocale (LC_ALL, "C");
 
   (void) re_set_syntax (RE_SYNTAX_GNU_AWK);
 
-  int result;
-#define CHECK(exp) \
-  if (exp) { puts (#exp); result = 1; }
+  result |= run_test (lower, regs);
+  result |= run_test (upper, &regs[2]);
+  if (! result)
+    {
+      CHECK (regs[0].start[0] != regs[2].start[0]);
+      CHECK (regs[0].end[0] != regs[2].end[0]);
+      CHECK (regs[1].start[0] != regs[3].start[0]);
+      CHECK (regs[1].end[0] != regs[3].end[0]);
+    }
+
+  printf ("INFO: Checking C.UTF-8.\n");
+  setlocale (LC_ALL, "C.UTF-8");
+
+  (void) re_set_syntax (RE_SYNTAX_GNU_AWK);
 
-  result = run_test (lower, regs);
+  result |= run_test (lower, regs);
   result |= run_test (upper, &regs[2]);
   if (! result)
     {
diff --git a/posix/tst-fnmatch.input b/posix/tst-fnmatch.input
index 9d071683dd..837fa2ccaf 100644
--- a/posix/tst-fnmatch.input
+++ b/posix/tst-fnmatch.input
@@ -472,6 +472,397 @@ C		"\\"			"[Z-\\]]"	       0
 C		"]"			"[Z-\\]]"	       0
 C		"-"			"[Z-\\]]"	       NOMATCH
 
+# B.6 004(C)
+C.UTF-8		 "!#%+,-./01234567889"	"!#%+,-./01234567889"  0
+C.UTF-8		 ":;=@ABCDEFGHIJKLMNO"	":;=@ABCDEFGHIJKLMNO"  0
+C.UTF-8		 "PQRSTUVWXYZ]abcdefg"	"PQRSTUVWXYZ]abcdefg"  0
+C.UTF-8		 "hijklmnopqrstuvwxyz"	"hijklmnopqrstuvwxyz"  0
+C.UTF-8		 "^_{}~"		"^_{}~"		       0
+
+# B.6 005(C)
+C.UTF-8		 "\"$&'()"		"\\\"\\$\\&\\'\\(\\)"  0
+C.UTF-8		 "*?[\\`|"		"\\*\\?\\[\\\\\\`\\|"  0
+C.UTF-8		 "<>"			"\\<\\>"	       0
+
+# B.6 006(C)
+C.UTF-8		 "?*["			"[?*[][?*[][?*[]"      0
+C.UTF-8		 "a/b"			"?/b"		       0
+
+# B.6 007(C)
+C.UTF-8		 "a/b"			"a?b"		       0
+C.UTF-8		 "a/b"			"a/?"		       0
+C.UTF-8		 "aa/b"			"?/b"		       NOMATCH
+C.UTF-8		 "aa/b"			"a?b"		       NOMATCH
+C.UTF-8		 "a/bb"			"a/?"		       NOMATCH
+
+# B.6 009(C)
+C.UTF-8		 "abc"			"[abc]"		       NOMATCH
+C.UTF-8		 "x"			"[abc]"		       NOMATCH
+C.UTF-8		 "a"			"[abc]"		       0
+C.UTF-8		 "["			"[[abc]"	       0
+C.UTF-8		 "a"			"[][abc]"	       0
+C.UTF-8		 "a]"			"[]a]]"		       0
+
+# B.6 010(C)
+C.UTF-8		 "xyz"			"[!abc]"	       NOMATCH
+C.UTF-8		 "x"			"[!abc]"	       0
+C.UTF-8		 "a"			"[!abc]"	       NOMATCH
+
+# B.6 011(C)
+C.UTF-8		 "]"			"[][abc]"	       0
+C.UTF-8		 "abc]"			"[][abc]"	       NOMATCH
+C.UTF-8		 "[]abc"		"[][]abc"	       NOMATCH
+C.UTF-8		 "]"			"[!]]"		       NOMATCH
+C.UTF-8		 "aa]"			"[!]a]"		       NOMATCH
+C.UTF-8		 "]"			"[!a]"		       0
+C.UTF-8		 "]]"			"[!a]]"		       0
+
+# B.6 012(C)
+C.UTF-8		 "a"			"[[.a.]]"	       0
+C.UTF-8		 "-"			"[[.-.]]"	       0
+C.UTF-8		 "-"			"[[.-.][.].]]"	       0
+C.UTF-8		 "-"			"[[.].][.-.]]"	       0
+C.UTF-8		 "-"			"[[.-.][=u=]]"	       0
+C.UTF-8		 "-"			"[[.-.][:alpha:]]"     0
+C.UTF-8		 "a"			"[![.a.]]"	       NOMATCH
+
+# B.6 013(C)
+C.UTF-8		 "a"			"[[.b.]]"	       NOMATCH
+C.UTF-8		 "a"			"[[.b.][.c.]]"	       NOMATCH
+C.UTF-8		 "a"			"[[.b.][=b=]]"	       NOMATCH
+
+
+# B.6 015(C)
+C.UTF-8		 "a"			"[[=a=]]"	       0
+C.UTF-8		 "b"			"[[=a=]b]"	       0
+C.UTF-8		 "b"			"[[=a=][=b=]]"	       0
+C.UTF-8		 "a"			"[[=a=][=b=]]"	       0
+C.UTF-8		 "a"			"[[=a=][.b.]]"	       0
+C.UTF-8		 "a"			"[[=a=][:digit:]]"     0
+
+# B.6 016(C)
+C.UTF-8		 "="			"[[=a=]b]"	       NOMATCH
+C.UTF-8		 "]"			"[[=a=]b]"	       NOMATCH
+C.UTF-8		 "a"			"[[=b=][=c=]]"	       NOMATCH
+C.UTF-8		 "a"			"[[=b=][.].]]"	       NOMATCH
+C.UTF-8		 "a"			"[[=b=][:digit:]]"     NOMATCH
+
+# B.6 017(C)
+C.UTF-8		 "a"			"[[:alnum:]]"	       0
+C.UTF-8		 "a"			"[![:alnum:]]"	       NOMATCH
+C.UTF-8		 "-"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "a]a"			"[[:alnum:]]a"	       NOMATCH
+C.UTF-8		 "-"			"[[:alnum:]-]"	       0
+C.UTF-8		 "aa"			"[[:alnum:]]a"	       0
+C.UTF-8		 "-"			"[![:alnum:]]"	       0
+C.UTF-8		 "]"			"[!][:alnum:]]"	       NOMATCH
+C.UTF-8		 "["			"[![:alnum:][]"	       NOMATCH
+C.UTF-8		 "a"			"[[:alnum:]]"	       0
+C.UTF-8		 "b"			"[[:alnum:]]"	       0
+C.UTF-8		 "c"			"[[:alnum:]]"	       0
+C.UTF-8		 "d"			"[[:alnum:]]"	       0
+C.UTF-8		 "e"			"[[:alnum:]]"	       0
+C.UTF-8		 "f"			"[[:alnum:]]"	       0
+C.UTF-8		 "g"			"[[:alnum:]]"	       0
+C.UTF-8		 "h"			"[[:alnum:]]"	       0
+C.UTF-8		 "i"			"[[:alnum:]]"	       0
+C.UTF-8		 "j"			"[[:alnum:]]"	       0
+C.UTF-8		 "k"			"[[:alnum:]]"	       0
+C.UTF-8		 "l"			"[[:alnum:]]"	       0
+C.UTF-8		 "m"			"[[:alnum:]]"	       0
+C.UTF-8		 "n"			"[[:alnum:]]"	       0
+C.UTF-8		 "o"			"[[:alnum:]]"	       0
+C.UTF-8		 "p"			"[[:alnum:]]"	       0
+C.UTF-8		 "q"			"[[:alnum:]]"	       0
+C.UTF-8		 "r"			"[[:alnum:]]"	       0
+C.UTF-8		 "s"			"[[:alnum:]]"	       0
+C.UTF-8		 "t"			"[[:alnum:]]"	       0
+C.UTF-8		 "u"			"[[:alnum:]]"	       0
+C.UTF-8		 "v"			"[[:alnum:]]"	       0
+C.UTF-8		 "w"			"[[:alnum:]]"	       0
+C.UTF-8		 "x"			"[[:alnum:]]"	       0
+C.UTF-8		 "y"			"[[:alnum:]]"	       0
+C.UTF-8		 "z"			"[[:alnum:]]"	       0
+C.UTF-8		 "A"			"[[:alnum:]]"	       0
+C.UTF-8		 "B"			"[[:alnum:]]"	       0
+C.UTF-8		 "C"			"[[:alnum:]]"	       0
+C.UTF-8		 "D"			"[[:alnum:]]"	       0
+C.UTF-8		 "E"			"[[:alnum:]]"	       0
+C.UTF-8		 "F"			"[[:alnum:]]"	       0
+C.UTF-8		 "G"			"[[:alnum:]]"	       0
+C.UTF-8		 "H"			"[[:alnum:]]"	       0
+C.UTF-8		 "I"			"[[:alnum:]]"	       0
+C.UTF-8		 "J"			"[[:alnum:]]"	       0
+C.UTF-8		 "K"			"[[:alnum:]]"	       0
+C.UTF-8		 "L"			"[[:alnum:]]"	       0
+C.UTF-8		 "M"			"[[:alnum:]]"	       0
+C.UTF-8		 "N"			"[[:alnum:]]"	       0
+C.UTF-8		 "O"			"[[:alnum:]]"	       0
+C.UTF-8		 "P"			"[[:alnum:]]"	       0
+C.UTF-8		 "Q"			"[[:alnum:]]"	       0
+C.UTF-8		 "R"			"[[:alnum:]]"	       0
+C.UTF-8		 "S"			"[[:alnum:]]"	       0
+C.UTF-8		 "T"			"[[:alnum:]]"	       0
+C.UTF-8		 "U"			"[[:alnum:]]"	       0
+C.UTF-8		 "V"			"[[:alnum:]]"	       0
+C.UTF-8		 "W"			"[[:alnum:]]"	       0
+C.UTF-8		 "X"			"[[:alnum:]]"	       0
+C.UTF-8		 "Y"			"[[:alnum:]]"	       0
+C.UTF-8		 "Z"			"[[:alnum:]]"	       0
+C.UTF-8		 "0"			"[[:alnum:]]"	       0
+C.UTF-8		 "1"			"[[:alnum:]]"	       0
+C.UTF-8		 "2"			"[[:alnum:]]"	       0
+C.UTF-8		 "3"			"[[:alnum:]]"	       0
+C.UTF-8		 "4"			"[[:alnum:]]"	       0
+C.UTF-8		 "5"			"[[:alnum:]]"	       0
+C.UTF-8		 "6"			"[[:alnum:]]"	       0
+C.UTF-8		 "7"			"[[:alnum:]]"	       0
+C.UTF-8		 "8"			"[[:alnum:]]"	       0
+C.UTF-8		 "9"			"[[:alnum:]]"	       0
+C.UTF-8		 "!"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "#"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "%"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "+"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 ","			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "-"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "."			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "/"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 ":"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 ";"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "="			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "@"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "["			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "\\"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "]"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "^"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "_"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "{"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "}"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "~"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "\""			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "$"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "&"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "'"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "("			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 ")"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "*"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "?"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "`"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "|"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "<"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 ">"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "\t"			"[[:cntrl:]]"	       0
+C.UTF-8		 "t"			"[[:cntrl:]]"	       NOMATCH
+C.UTF-8		 "t"			"[[:lower:]]"	       0
+C.UTF-8		 "\t"			"[[:lower:]]"	       NOMATCH
+C.UTF-8		 "T"			"[[:lower:]]"	       NOMATCH
+C.UTF-8		 "\t"			"[[:space:]]"	       0
+C.UTF-8		 "t"			"[[:space:]]"	       NOMATCH
+C.UTF-8		 "t"			"[[:alpha:]]"	       0
+C.UTF-8		 "\t"			"[[:alpha:]]"	       NOMATCH
+C.UTF-8		 "0"			"[[:digit:]]"	       0
+C.UTF-8		 "\t"			"[[:digit:]]"	       NOMATCH
+C.UTF-8		 "t"			"[[:digit:]]"	       NOMATCH
+C.UTF-8		 "\t"			"[[:print:]]"	       NOMATCH
+C.UTF-8		 "t"			"[[:print:]]"	       0
+C.UTF-8		 "T"			"[[:upper:]]"	       0
+C.UTF-8		 "\t"			"[[:upper:]]"	       NOMATCH
+C.UTF-8		 "t"			"[[:upper:]]"	       NOMATCH
+C.UTF-8		 "\t"			"[[:blank:]]"	       0
+C.UTF-8		 "t"			"[[:blank:]]"	       NOMATCH
+C.UTF-8		 "\t"			"[[:graph:]]"	       NOMATCH
+C.UTF-8		 "t"			"[[:graph:]]"	       0
+C.UTF-8		 "."			"[[:punct:]]"	       0
+C.UTF-8		 "t"			"[[:punct:]]"	       NOMATCH
+C.UTF-8		 "\t"			"[[:punct:]]"	       NOMATCH
+C.UTF-8		 "0"			"[[:xdigit:]]"	       0
+C.UTF-8		 "\t"			"[[:xdigit:]]"	       NOMATCH
+C.UTF-8		 "a"			"[[:xdigit:]]"	       0
+C.UTF-8		 "A"			"[[:xdigit:]]"	       0
+C.UTF-8		 "t"			"[[:xdigit:]]"	       NOMATCH
+C.UTF-8		 "a"			"[[alpha]]"	       NOMATCH
+C.UTF-8		 "a"			"[[alpha:]]"	       NOMATCH
+C.UTF-8		 "a]"			"[[alpha]]"	       0
+C.UTF-8		 "a]"			"[[alpha:]]"	       0
+C.UTF-8		 "a"			"[[:alpha:][.b.]]"     0
+C.UTF-8		 "a"			"[[:alpha:][=b=]]"     0
+C.UTF-8		 "a"			"[[:alpha:][:digit:]]" 0
+C.UTF-8		 "a"			"[[:digit:][:alpha:]]" 0
+
+# B.6 018(C)
+C.UTF-8		 "a"			"[a-c]"		       0
+C.UTF-8		 "b"			"[a-c]"		       0
+C.UTF-8		 "c"			"[a-c]"		       0
+C.UTF-8		 "a"			"[b-c]"		       NOMATCH
+C.UTF-8		 "d"			"[b-c]"		       NOMATCH
+C.UTF-8		 "B"			"[a-c]"		       NOMATCH
+C.UTF-8		 "b"			"[A-C]"		       NOMATCH
+C.UTF-8		 ""			"[a-c]"		       NOMATCH
+C.UTF-8		 "as"			"[a-ca-z]"	       NOMATCH
+C.UTF-8		 "a"			"[[.a.]-c]"	       0
+C.UTF-8		 "a"			"[a-[.c.]]"	       0
+C.UTF-8		 "a"			"[[.a.]-[.c.]]"	       0
+C.UTF-8		 "b"			"[[.a.]-c]"	       0
+C.UTF-8		 "b"			"[a-[.c.]]"	       0
+C.UTF-8		 "b"			"[[.a.]-[.c.]]"	       0
+C.UTF-8		 "c"			"[[.a.]-c]"	       0
+C.UTF-8		 "c"			"[a-[.c.]]"	       0
+C.UTF-8		 "c"			"[[.a.]-[.c.]]"	       0
+C.UTF-8		 "d"			"[[.a.]-c]"	       NOMATCH
+C.UTF-8		 "d"			"[a-[.c.]]"	       NOMATCH
+C.UTF-8		 "d"			"[[.a.]-[.c.]]"	       NOMATCH
+
+# B.6 019(C)
+C.UTF-8		 "a"			"[c-a]"		       NOMATCH
+C.UTF-8		 "a"			"[[.c.]-a]"	       NOMATCH
+C.UTF-8		 "a"			"[c-[.a.]]"	       NOMATCH
+C.UTF-8		 "a"			"[[.c.]-[.a.]]"	       NOMATCH
+C.UTF-8		 "c"			"[c-a]"		       NOMATCH
+C.UTF-8		 "c"			"[[.c.]-a]"	       NOMATCH
+C.UTF-8		 "c"			"[c-[.a.]]"	       NOMATCH
+C.UTF-8		 "c"			"[[.c.]-[.a.]]"	       NOMATCH
+
+# B.6 020(C)
+C.UTF-8		 "a"			"[a-c0-9]"	       0
+C.UTF-8		 "d"			"[a-c0-9]"	       NOMATCH
+C.UTF-8		 "B"			"[a-c0-9]"	       NOMATCH
+
+# B.6 021(C)
+C.UTF-8		 "-"			"[-a]"		       0
+C.UTF-8		 "a"			"[-b]"		       NOMATCH
+C.UTF-8		 "-"			"[!-a]"		       NOMATCH
+C.UTF-8		 "a"			"[!-b]"		       0
+C.UTF-8		 "-"			"[a-c-0-9]"	       0
+C.UTF-8		 "b"			"[a-c-0-9]"	       0
+C.UTF-8		 "a:"			"a[0-9-a]"	       NOMATCH
+C.UTF-8		 "a:"			"a[09-a]"	       0
+
+# B.6 024(C)
+C.UTF-8		 ""			"*"		       0
+C.UTF-8		 "asd/sdf"		"*"		       0
+
+# B.6 025(C)
+C.UTF-8		 "as"			"[a-c][a-z]"	       0
+C.UTF-8		 "as"			"??"		       0
+
+# B.6 026(C)
+C.UTF-8		 "asd/sdf"		"as*df"		       0
+C.UTF-8		 "asd/sdf"		"as*"		       0
+C.UTF-8		 "asd/sdf"		"*df"		       0
+C.UTF-8		 "asd/sdf"		"as*dg"		       NOMATCH
+C.UTF-8		 "asdf"			"as*df"		       0
+C.UTF-8		 "asdf"			"as*df?"	       NOMATCH
+C.UTF-8		 "asdf"			"as*??"		       0
+C.UTF-8		 "asdf"			"a*???"		       0
+C.UTF-8		 "asdf"			"*????"		       0
+C.UTF-8		 "asdf"			"????*"		       0
+C.UTF-8		 "asdf"			"??*?"		       0
+
+# B.6 027(C)
+C.UTF-8		 "/"			"/"		       0
+C.UTF-8		 "/"			"/*"		       0
+C.UTF-8		 "/"			"*/"		       0
+C.UTF-8		 "/"			"/?"		       NOMATCH
+C.UTF-8		 "/"			"?/"		       NOMATCH
+C.UTF-8		 "/"			"?"		       0
+C.UTF-8		 "."			"?"		       0
+C.UTF-8		 "/."			"??"		       0
+C.UTF-8		 "/"			"[!a-c]"	       0
+C.UTF-8		 "."			"[!a-c]"	       0
+
+# B.6 029(C)
+C.UTF-8		 "/"			"/"		       0       PATHNAME
+C.UTF-8		 "//"			"//"		       0       PATHNAME
+C.UTF-8		 "/.a"			"/*"		       0       PATHNAME
+C.UTF-8		 "/.a"			"/?a"		       0       PATHNAME
+C.UTF-8		 "/.a"			"/[!a-z]a"	       0       PATHNAME
+C.UTF-8		 "/.a/.b"		"/*/?b"		       0       PATHNAME
+
+# B.6 030(C)
+C.UTF-8		 "/"			"?"		       NOMATCH PATHNAME
+C.UTF-8		 "/"			"*"		       NOMATCH PATHNAME
+C.UTF-8		 "a/b"			"a?b"		       NOMATCH PATHNAME
+C.UTF-8		 "/.a/.b"		"/*b"		       NOMATCH PATHNAME
+
+# B.6 031(C)
+C.UTF-8		 "/$"			"\\/\\$"	       0
+C.UTF-8		 "/["			"\\/\\["	       0
+C.UTF-8		 "/["			"\\/["		       0
+C.UTF-8		 "/[]"			"\\/\\[]"	       0
+
+# B.6 032(C)
+C.UTF-8		 "/$"			"\\/\\$"	       NOMATCH NOESCAPE
+C.UTF-8		 "/\\$"			"\\/\\$"	       NOMATCH NOESCAPE
+C.UTF-8		 "\\/\\$"		"\\/\\$"	       0       NOESCAPE
+
+# B.6 033(C)
+C.UTF-8		 ".asd"			".*"		       0       PERIOD
+C.UTF-8		 "/.asd"		"*"		       0       PERIOD
+C.UTF-8		 "/as/.df"		"*/?*f"		       0       PERIOD
+C.UTF-8		 "..asd"		".[!a-z]*"	       0       PERIOD
+
+# B.6 034(C)
+C.UTF-8		 ".asd"			"*"		       NOMATCH PERIOD
+C.UTF-8		 ".asd"			"?asd"		       NOMATCH PERIOD
+C.UTF-8		 ".asd"			"[!a-z]*"	       NOMATCH PERIOD
+
+# B.6 035(C)
+C.UTF-8		 "/."			"/."		       0       PATHNAME|PERIOD
+C.UTF-8		 "/.a./.b."		"/.*/.*"	       0       PATHNAME|PERIOD
+C.UTF-8		 "/.a./.b."		"/.??/.??"	       0       PATHNAME|PERIOD
+
+# B.6 036(C)
+C.UTF-8		 "/."			"*"		       NOMATCH PATHNAME|PERIOD
+C.UTF-8		 "/."			"/*"		       NOMATCH PATHNAME|PERIOD
+C.UTF-8		 "/."			"/?"		       NOMATCH PATHNAME|PERIOD
+C.UTF-8		 "/."			"/[!a-z]"	       NOMATCH PATHNAME|PERIOD
+C.UTF-8		 "/a./.b."		"/*/*"		       NOMATCH PATHNAME|PERIOD
+C.UTF-8		 "/a./.b."		"/??/???"	       NOMATCH PATHNAME|PERIOD
+
+# Some home-grown tests.
+C.UTF-8		"foobar"		"foo*[abc]z"	       NOMATCH
+C.UTF-8		"foobaz"		"foo*[abc][xyz]"       0
+C.UTF-8		"foobaz"		"foo?*[abc][xyz]"      0
+C.UTF-8		"foobaz"		"foo?*[abc][x/yz]"     0
+C.UTF-8		"foobaz"		"foo?*[abc]/[xyz]"     NOMATCH PATHNAME
+C.UTF-8		"a"			"a/"                   NOMATCH PATHNAME
+C.UTF-8		"a/"			"a"		       NOMATCH PATHNAME
+C.UTF-8		"//a"			"/a"		       NOMATCH PATHNAME
+C.UTF-8		"/a"			"//a"		       NOMATCH PATHNAME
+C.UTF-8		"az"			"[a-]z"		       0
+C.UTF-8		"bz"			"[ab-]z"	       0
+C.UTF-8		"cz"			"[ab-]z"	       NOMATCH
+C.UTF-8		"-z"			"[ab-]z"	       0
+C.UTF-8		"az"			"[-a]z"		       0
+C.UTF-8		"bz"			"[-ab]z"	       0
+C.UTF-8		"cz"			"[-ab]z"	       NOMATCH
+C.UTF-8		"-z"			"[-ab]z"	       0
+C.UTF-8		"\\"			"[\\\\-a]"	       0
+C.UTF-8		"_"			"[\\\\-a]"	       0
+C.UTF-8		"a"			"[\\\\-a]"	       0
+C.UTF-8		"-"			"[\\\\-a]"	       NOMATCH
+C.UTF-8		"\\"			"[\\]-a]"	       NOMATCH
+C.UTF-8		"_"			"[\\]-a]"	       0
+C.UTF-8		"a"			"[\\]-a]"	       0
+C.UTF-8		"]"			"[\\]-a]"	       0
+C.UTF-8		"-"			"[\\]-a]"	       NOMATCH
+C.UTF-8		"\\"			"[!\\\\-a]"	       NOMATCH
+C.UTF-8		"_"			"[!\\\\-a]"	       NOMATCH
+C.UTF-8		"a"			"[!\\\\-a]"	       NOMATCH
+C.UTF-8		"-"			"[!\\\\-a]"	       0
+C.UTF-8		"!"			"[\\!-]"	       0
+C.UTF-8		"-"			"[\\!-]"	       0
+C.UTF-8		"\\"			"[\\!-]"	       NOMATCH
+C.UTF-8		"Z"			"[Z-\\\\]"	       0
+C.UTF-8		"["			"[Z-\\\\]"	       0
+C.UTF-8		"\\"			"[Z-\\\\]"	       0
+C.UTF-8		"-"			"[Z-\\\\]"	       NOMATCH
+C.UTF-8		"Z"			"[Z-\\]]"	       0
+C.UTF-8		"["			"[Z-\\]]"	       0
+C.UTF-8		"\\"			"[Z-\\]]"	       0
+C.UTF-8		"]"			"[Z-\\]]"	       0
+C.UTF-8		"-"			"[Z-\\]]"	       NOMATCH
+
 # Following are tests outside the scope of IEEE 2003.2 since they are using
 # locales other than the C locale.  The main focus of the tests is on the
 # handling of ranges and the recognition of character (vs bytes).
@@ -677,7 +1068,6 @@ C		 "x/y"			"*"		       0       PATHNAME|LEADING_DIR
 C		 "x/y/z"		"*"		       0       PATHNAME|LEADING_DIR
 C		 "x"			"*x"		       0       PATHNAME|LEADING_DIR
 
-en_US.UTF-8	 "\366.csv"		"*.csv"                0
 C		 "x/y"			"*x"		       0       PATHNAME|LEADING_DIR
 C		 "x/y/z"		"*x"		       0       PATHNAME|LEADING_DIR
 C		 "x"			"x*"		       0       PATHNAME|LEADING_DIR
@@ -693,6 +1083,33 @@ C		 "x"			"x?y"		       NOMATCH PATHNAME|LEADING_DIR
 C		 "x/y"			"x?y"		       NOMATCH PATHNAME|LEADING_DIR
 C		 "x/y/z"		"x?y"		       NOMATCH PATHNAME|LEADING_DIR
 
+# Duplicate the "Test of GNU extensions." tests but for C.UTF-8.
+C.UTF-8		 "x"			"x"		       0       PATHNAME|LEADING_DIR
+C.UTF-8		 "x/y"			"x"		       0       PATHNAME|LEADING_DIR
+C.UTF-8		 "x/y/z"		"x"		       0       PATHNAME|LEADING_DIR
+C.UTF-8		 "x"			"*"		       0       PATHNAME|LEADING_DIR
+C.UTF-8		 "x/y"			"*"		       0       PATHNAME|LEADING_DIR
+C.UTF-8		 "x/y/z"		"*"		       0       PATHNAME|LEADING_DIR
+C.UTF-8		 "x"			"*x"		       0       PATHNAME|LEADING_DIR
+
+C.UTF-8		 "x/y"			"*x"		       0       PATHNAME|LEADING_DIR
+C.UTF-8		 "x/y/z"		"*x"		       0       PATHNAME|LEADING_DIR
+C.UTF-8		 "x"			"x*"		       0       PATHNAME|LEADING_DIR
+C.UTF-8		 "x/y"			"x*"		       0       PATHNAME|LEADING_DIR
+C.UTF-8		 "x/y/z"		"x*"		       0       PATHNAME|LEADING_DIR
+C.UTF-8		 "x"			"a"		       NOMATCH PATHNAME|LEADING_DIR
+C.UTF-8		 "x/y"			"a"		       NOMATCH PATHNAME|LEADING_DIR
+C.UTF-8		 "x/y/z"		"a"		       NOMATCH PATHNAME|LEADING_DIR
+C.UTF-8		 "x"			"x/y"		       NOMATCH PATHNAME|LEADING_DIR
+C.UTF-8		 "x/y"			"x/y"		       0       PATHNAME|LEADING_DIR
+C.UTF-8		 "x/y/z"		"x/y"		       0       PATHNAME|LEADING_DIR
+C.UTF-8		 "x"			"x?y"		       NOMATCH PATHNAME|LEADING_DIR
+C.UTF-8		 "x/y"			"x?y"		       NOMATCH PATHNAME|LEADING_DIR
+C.UTF-8		 "x/y/z"		"x?y"		       NOMATCH PATHNAME|LEADING_DIR
+
+# Bug 14185
+en_US.UTF-8	 "\366.csv"		"*.csv"                0
+
 # ksh style matching.
 C		"abcd"			"?@(a|b)*@(c)d"	       0       EXTMATCH
 C		"/dev/udp/129.22.8.102/45" "/dev/@(tcp|udp)/*/*" 0     PATHNAME|EXTMATCH
@@ -822,3 +1239,133 @@ C		""			""		       0
 C		""			""		       0       EXTMATCH
 C		""			"*([abc])"	       0       EXTMATCH
 C		""			"?([abc])"	       0       EXTMATCH
+
+# Duplicate the "ksh style matching." for C.UTF-8.
+C.UTF-8		"abcd"			"?@(a|b)*@(c)d"	       0       EXTMATCH
+C.UTF-8		"/dev/udp/129.22.8.102/45" "/dev/@(tcp|udp)/*/*" 0     PATHNAME|EXTMATCH
+C.UTF-8		"12"			"[1-9]*([0-9])"        0       EXTMATCH
+C.UTF-8		"12abc"			"[1-9]*([0-9])"        NOMATCH EXTMATCH
+C.UTF-8		"1"			"[1-9]*([0-9])"	       0       EXTMATCH
+C.UTF-8		"07"			"+([0-7])"	       0       EXTMATCH
+C.UTF-8		"0377"			"+([0-7])"	       0       EXTMATCH
+C.UTF-8		"09"			"+([0-7])"	       NOMATCH EXTMATCH
+C.UTF-8		"paragraph"		"para@(chute|graph)"   0       EXTMATCH
+C.UTF-8		"paramour"		"para@(chute|graph)"   NOMATCH EXTMATCH
+C.UTF-8		"para991"		"para?([345]|99)1"     0       EXTMATCH
+C.UTF-8		"para381"		"para?([345]|99)1"     NOMATCH EXTMATCH
+C.UTF-8		"paragraph"		"para*([0-9])"	       NOMATCH EXTMATCH
+C.UTF-8		"para"			"para*([0-9])"	       0       EXTMATCH
+C.UTF-8		"para13829383746592"	"para*([0-9])"	       0       EXTMATCH
+C.UTF-8		"paragraph"		"para+([0-9])"	       NOMATCH EXTMATCH
+C.UTF-8		"para"			"para+([0-9])"	       NOMATCH EXTMATCH
+C.UTF-8		"para987346523"		"para+([0-9])"	       0       EXTMATCH
+C.UTF-8		"paragraph"		"para!(*.[0-9])"       0       EXTMATCH
+C.UTF-8		"para.38"		"para!(*.[0-9])"       0       EXTMATCH
+C.UTF-8		"para.graph"		"para!(*.[0-9])"       0       EXTMATCH
+C.UTF-8		"para39"		"para!(*.[0-9])"       0       EXTMATCH
+C.UTF-8		""			"*(0|1|3|5|7|9)"       0       EXTMATCH
+C.UTF-8		"137577991"		"*(0|1|3|5|7|9)"       0       EXTMATCH
+C.UTF-8		"2468"			"*(0|1|3|5|7|9)"       NOMATCH EXTMATCH
+C.UTF-8		"1358"			"*(0|1|3|5|7|9)"       NOMATCH EXTMATCH
+C.UTF-8		"file.c"		"*.c?(c)"	       0       EXTMATCH
+C.UTF-8		"file.C"		"*.c?(c)"	       NOMATCH EXTMATCH
+C.UTF-8		"file.cc"		"*.c?(c)"	       0       EXTMATCH
+C.UTF-8		"file.ccc"		"*.c?(c)"	       NOMATCH EXTMATCH
+C.UTF-8		"parse.y"		"!(*.c|*.h|Makefile.in|config*|README)" 0 EXTMATCH
+C.UTF-8		"shell.c"		"!(*.c|*.h|Makefile.in|config*|README)" NOMATCH EXTMATCH
+C.UTF-8		"Makefile"		"!(*.c|*.h|Makefile.in|config*|README)" 0 EXTMATCH
+C.UTF-8		"VMS.FILE;1"		"*\;[1-9]*([0-9])"     0       EXTMATCH
+C.UTF-8		"VMS.FILE;0"		"*\;[1-9]*([0-9])"     NOMATCH EXTMATCH
+C.UTF-8		"VMS.FILE;"		"*\;[1-9]*([0-9])"     NOMATCH EXTMATCH
+C.UTF-8		"VMS.FILE;139"		"*\;[1-9]*([0-9])"     0       EXTMATCH
+C.UTF-8		"VMS.FILE;1N"		"*\;[1-9]*([0-9])"     NOMATCH EXTMATCH
+C.UTF-8		"abcfefg"		"ab**(e|f)"	       0       EXTMATCH
+C.UTF-8		"abcfefg"		"ab**(e|f)g"	       0       EXTMATCH
+C.UTF-8		"ab"			"ab*+(e|f)"	       NOMATCH EXTMATCH
+C.UTF-8		"abef"			"ab***ef"	       0       EXTMATCH
+C.UTF-8		"abef"			"ab**"		       0       EXTMATCH
+C.UTF-8		"fofo"			"*(f*(o))"	       0       EXTMATCH
+C.UTF-8		"ffo"			"*(f*(o))"	       0       EXTMATCH
+C.UTF-8		"foooofo"		"*(f*(o))"	       0       EXTMATCH
+C.UTF-8		"foooofof"		"*(f*(o))"	       0       EXTMATCH
+C.UTF-8		"fooofoofofooo"		"*(f*(o))"	       0       EXTMATCH
+C.UTF-8		"foooofof"		"*(f+(o))"	       NOMATCH EXTMATCH
+C.UTF-8		"xfoooofof"		"*(f*(o))"	       NOMATCH EXTMATCH
+C.UTF-8		"foooofofx"		"*(f*(o))"	       NOMATCH EXTMATCH
+C.UTF-8		"ofxoofxo"		"*(*(of*(o)x)o)"       0       EXTMATCH
+C.UTF-8		"ofooofoofofooo"	"*(f*(o))"	       NOMATCH EXTMATCH
+C.UTF-8		"foooxfooxfoxfooox"	"*(f*(o)x)"	       0       EXTMATCH
+C.UTF-8		"foooxfooxofoxfooox"	"*(f*(o)x)"	       NOMATCH EXTMATCH
+C.UTF-8		"foooxfooxfxfooox"	"*(f*(o)x)"	       0       EXTMATCH
+C.UTF-8		"ofxoofxo"		"*(*(of*(o)x)o)"       0       EXTMATCH
+C.UTF-8		"ofoooxoofxo"		"*(*(of*(o)x)o)"       0       EXTMATCH
+C.UTF-8		"ofoooxoofxoofoooxoofxo" "*(*(of*(o)x)o)"      0       EXTMATCH
+C.UTF-8		"ofoooxoofxoofoooxoofxoo" "*(*(of*(o)x)o)"     0       EXTMATCH
+C.UTF-8		"ofoooxoofxoofoooxoofxofo" "*(*(of*(o)x)o)"    NOMATCH EXTMATCH
+C.UTF-8		"ofoooxoofxoofoooxoofxooofxofxo" "*(*(of*(o)x)o)" 0    EXTMATCH
+C.UTF-8		"aac"			"*(@(a))a@(c)"	       0       EXTMATCH
+C.UTF-8		"ac"			"*(@(a))a@(c)"	       0       EXTMATCH
+C.UTF-8		"c"			"*(@(a))a@(c)"	       NOMATCH EXTMATCH
+C.UTF-8		"aaac"			"*(@(a))a@(c)"	       0       EXTMATCH
+C.UTF-8		"baaac"			"*(@(a))a@(c)"	       NOMATCH EXTMATCH
+C.UTF-8		"abcd"			"?@(a|b)*@(c)d"	       0       EXTMATCH
+C.UTF-8		"abcd"			"@(ab|a*@(b))*(c)d"    0       EXTMATCH
+C.UTF-8		"acd"			"@(ab|a*(b))*(c)d"     0       EXTMATCH
+C.UTF-8		"abbcd"			"@(ab|a*(b))*(c)d"     0       EXTMATCH
+C.UTF-8		"effgz"			"@(b+(c)d|e*(f)g?|?(h)i@(j|k))" 0 EXTMATCH
+C.UTF-8		"efgz"			"@(b+(c)d|e*(f)g?|?(h)i@(j|k))" 0 EXTMATCH
+C.UTF-8		"egz"			"@(b+(c)d|e*(f)g?|?(h)i@(j|k))" 0 EXTMATCH
+C.UTF-8		"egzefffgzbcdij"	"*(b+(c)d|e*(f)g?|?(h)i@(j|k))" 0 EXTMATCH
+C.UTF-8		"egz"			"@(b+(c)d|e+(f)g?|?(h)i@(j|k))" NOMATCH EXTMATCH
+C.UTF-8		"ofoofo"		"*(of+(o))"	       0       EXTMATCH
+C.UTF-8		"oxfoxoxfox"		"*(oxf+(ox))"	       0       EXTMATCH
+C.UTF-8		"oxfoxfox"		"*(oxf+(ox))"	       NOMATCH EXTMATCH
+C.UTF-8		"ofoofo"		"*(of+(o)|f)"	       0       EXTMATCH
+C.UTF-8		"foofoofo"		"@(foo|f|fo)*(f|of+(o))" 0     EXTMATCH
+C.UTF-8		"oofooofo"		"*(of|oof+(o))"	       0       EXTMATCH
+C.UTF-8		"fffooofoooooffoofffooofff" "*(*(f)*(o))"      0       EXTMATCH
+C.UTF-8		"fofoofoofofoo"		"*(fo|foo)"	       0       EXTMATCH
+C.UTF-8		"foo"			"!(x)"		       0       EXTMATCH
+C.UTF-8		"foo"			"!(x)*"		       0       EXTMATCH
+C.UTF-8		"foo"			"!(foo)"	       NOMATCH EXTMATCH
+C.UTF-8		"foo"			"!(foo)*"	       0       EXTMATCH
+C.UTF-8		"foobar"		"!(foo)"	       0       EXTMATCH
+C.UTF-8		"foobar"		"!(foo)*"	       0       EXTMATCH
+C.UTF-8		"moo.cow"		"!(*.*).!(*.*)"	       0       EXTMATCH
+C.UTF-8		"mad.moo.cow"		"!(*.*).!(*.*)"	       NOMATCH EXTMATCH
+C.UTF-8		"mucca.pazza"		"mu!(*(c))?.pa!(*(z))?" NOMATCH EXTMATCH
+C.UTF-8		"fff"			"!(f)"		       0       EXTMATCH
+C.UTF-8		"fff"			"*(!(f))"	       0       EXTMATCH
+C.UTF-8		"fff"			"+(!(f))"	       0       EXTMATCH
+C.UTF-8		"ooo"			"!(f)"		       0       EXTMATCH
+C.UTF-8		"ooo"			"*(!(f))"	       0       EXTMATCH
+C.UTF-8		"ooo"			"+(!(f))"	       0       EXTMATCH
+C.UTF-8		"foo"			"!(f)"		       0       EXTMATCH
+C.UTF-8		"foo"			"*(!(f))"	       0       EXTMATCH
+C.UTF-8		"foo"			"+(!(f))"	       0       EXTMATCH
+C.UTF-8		"f"			"!(f)"		       NOMATCH EXTMATCH
+C.UTF-8		"f"			"*(!(f))"	       NOMATCH EXTMATCH
+C.UTF-8		"f"			"+(!(f))"	       NOMATCH EXTMATCH
+C.UTF-8		"foot"			"@(!(z*)|*x)"	       0       EXTMATCH
+C.UTF-8		"zoot"			"@(!(z*)|*x)"	       NOMATCH EXTMATCH
+C.UTF-8		"foox"			"@(!(z*)|*x)"	       0       EXTMATCH
+C.UTF-8		"zoox"			"@(!(z*)|*x)"	       0       EXTMATCH
+C.UTF-8		"foo"			"*(!(foo))"	       0       EXTMATCH
+C.UTF-8		"foob"			"!(foo)b*"	       NOMATCH EXTMATCH
+C.UTF-8		"foobb"			"!(foo)b*"	       0       EXTMATCH
+C.UTF-8		"["			"*([a[])"	       0       EXTMATCH
+C.UTF-8		"]"			"*([]a[])"	       0       EXTMATCH
+C.UTF-8		"a"			"*([]a[])"	       0       EXTMATCH
+C.UTF-8		"b"			"*([!]a[])"	       0       EXTMATCH
+C.UTF-8		"["			"*([!]a[]|[[])"	       0       EXTMATCH
+C.UTF-8		"]"			"*([!]a[]|[]])"	       0       EXTMATCH
+C.UTF-8		"["			"!([!]a[])"	       0       EXTMATCH
+C.UTF-8		"]"			"!([!]a[])"	       0       EXTMATCH
+C.UTF-8		")"			"*([)])"	       0       EXTMATCH
+C.UTF-8		"*"			"*([*(])"	       0       EXTMATCH
+C.UTF-8		"abcd"			"*!(|a)cd"	       0       EXTMATCH
+C.UTF-8		"ab/.a"			"+([abc])/*"	       NOMATCH EXTMATCH|PATHNAME|PERIOD
+C.UTF-8		""			""		       0
+C.UTF-8		""			""		       0       EXTMATCH
+C.UTF-8		""			"*([abc])"	       0       EXTMATCH
+C.UTF-8		""			"?([abc])"	       0       EXTMATCH
diff --git a/posix/tst-regcomp-truncated.c b/posix/tst-regcomp-truncated.c
index 84195fcd2e..da3f97799e 100644
--- a/posix/tst-regcomp-truncated.c
+++ b/posix/tst-regcomp-truncated.c
@@ -37,6 +37,7 @@
 static const char locales[][17] =
   {
     "C",
+    "C.UTF-8",
     "en_US.UTF-8",
     "de_DE.ISO-8859-1",
   };
diff --git a/posix/tst-regex.c b/posix/tst-regex.c
index e7c2b05e86..531128de2a 100644
--- a/posix/tst-regex.c
+++ b/posix/tst-regex.c
@@ -32,6 +32,7 @@
 #include <sys/stat.h>
 #include <sys/types.h>
 #include <regex.h>
+#include <support/support.h>
 
 
 #if defined _POSIX_CPUTIME && _POSIX_CPUTIME >= 0
@@ -58,7 +59,7 @@ do_test (void)
   const char *file;
   int fd;
   struct stat st;
-  int result;
+  int result = 0;
   char *inmem;
   char *outmem;
   size_t inlen;
@@ -123,7 +124,7 @@ do_test (void)
 
   /* Run the actual tests.  All tests are run in a single-byte and a
      multi-byte locale.  */
-  result = test_expr ("[äáàâéèêíìîñöóòôüúùû]", 4, 4);
+  result |= test_expr ("[äáàâéèêíìîñöóòôüúùû]", 4, 4);
   result |= test_expr ("G.ran", 2, 3);
   result |= test_expr ("G.\\{1\\}ran", 2, 3);
   result |= test_expr ("G.*ran", 3, 44);
@@ -143,19 +144,33 @@ do_test (void)
 static int
 test_expr (const char *expr, int expected, int expectedicase)
 {
-  int result;
+  int result = 0;
   char *inmem;
   char *outmem;
   size_t inlen;
   size_t outlen;
   char *uexpr;
 
-  /* First test: search with an UTF-8 locale.  */
-  if (setlocale (LC_ALL, "de_DE.UTF-8") == NULL)
-    error (EXIT_FAILURE, 0, "cannot set locale de_DE.UTF-8");
+  /* First test: search with basic C.UTF-8 locale.  */
+  printf ("INFO: Testing C.UTF-8.\n");
+  xsetlocale (LC_ALL, "C.UTF-8");
 
   printf ("\nTest \"%s\" with multi-byte locale\n", expr);
-  result = run_test (expr, mem, memlen, 0, expected);
+  result |= run_test (expr, mem, memlen, 0, expected);
+  printf ("\nTest \"%s\" with multi-byte locale, case insensitive\n", expr);
+  result |= run_test (expr, mem, memlen, 1, expectedicase);
+  printf ("\nTest \"%s\" backwards with multi-byte locale\n", expr);
+  result |= run_test_backwards (expr, mem, memlen, 0, expected);
+  printf ("\nTest \"%s\" backwards with multi-byte locale, case insensitive\n",
+	  expr);
+  result |= run_test_backwards (expr, mem, memlen, 1, expectedicase);
+
+  /* Second test: search with an UTF-8 locale.  */
+  printf ("INFO: Testing de_DE.UTF-8.\n");
+  xsetlocale (LC_ALL, "de_DE.UTF-8");
+
+  printf ("\nTest \"%s\" with multi-byte locale\n", expr);
+  result |= run_test (expr, mem, memlen, 0, expected);
   printf ("\nTest \"%s\" with multi-byte locale, case insensitive\n", expr);
   result |= run_test (expr, mem, memlen, 1, expectedicase);
   printf ("\nTest \"%s\" backwards with multi-byte locale\n", expr);
@@ -165,8 +180,8 @@ test_expr (const char *expr, int expected, int expectedicase)
   result |= run_test_backwards (expr, mem, memlen, 1, expectedicase);
 
   /* Second test: search with an ISO-8859-1 locale.  */
-  if (setlocale (LC_ALL, "de_DE.ISO-8859-1") == NULL)
-    error (EXIT_FAILURE, 0, "cannot set locale de_DE.ISO-8859-1");
+  printf ("INFO: Testing de_DE.ISO-8859-1.\n");
+  xsetlocale (LC_ALL, "de_DE.ISO-8859-1");
 
   inmem = (char *) expr;
   inlen = strlen (expr);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v12 1/2] Add 'codepoint_collation' support for LC_COLLATE.
  2021-09-06 15:43 ` [PATCH v12 1/2] Add 'codepoint_collation' support for LC_COLLATE Carlos O'Donell via Libc-alpha
@ 2021-09-06 17:20   ` Matheus Castanho via Libc-alpha
  2021-09-06 17:28     ` Florian Weimer via Libc-alpha
  2021-09-07  1:57     ` Carlos O'Donell via Libc-alpha
  0 siblings, 2 replies; 12+ messages in thread
From: Matheus Castanho via Libc-alpha @ 2021-09-06 17:20 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: Florian Weimer, libc-alpha


Carlos O'Donell via Libc-alpha <libc-alpha@sourceware.org> writes:

> Support a new directive 'codepoint_collation' in the LC_COLLATE
> section of a locale source file. This new directive causes all
> collation rules to be dropped and instead STRCMP (strcmp or
> wcscmp) is used for collation of the input character set. This
> is required to allow for a C.UTF-8 that contains zero collation
> rules (minimal size) and sorts using code point sorting.
>
[...]

> diff --git a/locale/C-collate-seq.c b/locale/C-collate-seq.c
> new file mode 100644
> index 0000000000..4fb82cb835
> --- /dev/null
> +++ b/locale/C-collate-seq.c
> @@ -0,0 +1,100 @@
> +/* Copyright (C) 1995-2021 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#include <stdint.h>
> +
> +static const char collseqmb[] =
> +{
> +  '\x00', '\x01', '\x02', '\x03', '\x04', '\x05', '\x06', '\x07',
> +  '\x08', '\x09', '\x0a', '\x0b', '\x0c', '\x0d', '\x0e', '\x0f',
> +  '\x10', '\x11', '\x12', '\x13', '\x14', '\x15', '\x16', '\x17',
> +  '\x18', '\x19', '\x1a', '\x1b', '\x1c', '\x1d', '\x1e', '\x1f',
> +  '\x20', '\x21', '\x22', '\x23', '\x24', '\x25', '\x26', '\x27',
> +  '\x28', '\x29', '\x2a', '\x2b', '\x2c', '\x2d', '\x2e', '\x2f',
> +  '\x30', '\x31', '\x32', '\x33', '\x34', '\x35', '\x36', '\x37',
> +  '\x38', '\x39', '\x3a', '\x3b', '\x3c', '\x3d', '\x3e', '\x3f',
> +  '\x40', '\x41', '\x42', '\x43', '\x44', '\x45', '\x46', '\x47',
> +  '\x48', '\x49', '\x4a', '\x4b', '\x4c', '\x4d', '\x4e', '\x4f',
> +  '\x50', '\x51', '\x52', '\x53', '\x54', '\x55', '\x56', '\x57',
> +  '\x58', '\x59', '\x5a', '\x5b', '\x5c', '\x5d', '\x5e', '\x5f',
> +  '\x60', '\x61', '\x62', '\x63', '\x64', '\x65', '\x66', '\x67',
> +  '\x68', '\x69', '\x6a', '\x6b', '\x6c', '\x6d', '\x6e', '\x6f',
> +  '\x70', '\x71', '\x72', '\x73', '\x74', '\x75', '\x76', '\x77',
> +  '\x78', '\x79', '\x7a', '\x7b', '\x7c', '\x7d', '\x7e', '\x7f',
> +  '\x80', '\x81', '\x82', '\x83', '\x84', '\x85', '\x86', '\x87',
> +  '\x88', '\x89', '\x8a', '\x8b', '\x8c', '\x8d', '\x8e', '\x8f',
> +  '\x90', '\x91', '\x92', '\x93', '\x94', '\x95', '\x96', '\x97',
> +  '\x98', '\x99', '\x9a', '\x9b', '\x9c', '\x9d', '\x9e', '\x9f',
> +  '\xa0', '\xa1', '\xa2', '\xa3', '\xa4', '\xa5', '\xa6', '\xa7',
> +  '\xa8', '\xa9', '\xaa', '\xab', '\xac', '\xad', '\xae', '\xaf',
> +  '\xb0', '\xb1', '\xb2', '\xb3', '\xb4', '\xb5', '\xb6', '\xb7',
> +  '\xb8', '\xb9', '\xba', '\xbb', '\xbc', '\xbd', '\xbe', '\xbf',
> +  '\xc0', '\xc1', '\xc2', '\xc3', '\xc4', '\xc5', '\xc6', '\xc7',
> +  '\xc8', '\xc9', '\xca', '\xcb', '\xcc', '\xcd', '\xce', '\xcf',
> +  '\xd0', '\xd1', '\xd2', '\xd3', '\xd4', '\xd5', '\xd6', '\xd7',
> +  '\xd8', '\xd9', '\xda', '\xdb', '\xdc', '\xdd', '\xde', '\xdf',
> +  '\xe0', '\xe1', '\xe2', '\xe3', '\xe4', '\xe5', '\xe6', '\xe7',
> +  '\xe8', '\xe9', '\xea', '\xeb', '\xec', '\xed', '\xee', '\xef',
> +  '\xf0', '\xf1', '\xf2', '\xf3', '\xf4', '\xf5', '\xf6', '\xf7',
> +  '\xf8', '\xf9', '\xfa', '\xfb', '\xfc', '\xfd', '\xfe', '\xff'
> +};
> +
> +/* This table must be 256 bytes in size. We index bytes into the
> +   table to find the collation sequence.  */
> +_Static_assert (sizeof (collseqmb) == 256);

Hi Carlos,

glibc doesn't build after this patch went in, looks like the assert
message is missing:

In file included from C-collate.c:22:
C-collate-seq.c:58:42: error: expected ‘,’ before ‘)’ token
 _Static_assert (sizeof (collseqmb) == 256);
                                          ^

--
Matheus Castanho

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v12 1/2] Add 'codepoint_collation' support for LC_COLLATE.
  2021-09-06 17:20   ` Matheus Castanho via Libc-alpha
@ 2021-09-06 17:28     ` Florian Weimer via Libc-alpha
  2021-09-07  1:28       ` Carlos O'Donell via Libc-alpha
  2021-09-07  1:57     ` Carlos O'Donell via Libc-alpha
  1 sibling, 1 reply; 12+ messages in thread
From: Florian Weimer via Libc-alpha @ 2021-09-06 17:28 UTC (permalink / raw)
  To: Matheus Castanho; +Cc: libc-alpha

* Matheus Castanho:

> glibc doesn't build after this patch went in, looks like the assert
> message is missing:
>
> In file included from C-collate.c:22:
> C-collate-seq.c:58:42: error: expected ‘,’ before ‘)’ token
>  _Static_assert (sizeof (collseqmb) == 256);

Sorry, I missed that.  I'm testing a patch.

Florian


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v12 1/2] Add 'codepoint_collation' support for LC_COLLATE.
  2021-09-06 17:28     ` Florian Weimer via Libc-alpha
@ 2021-09-07  1:28       ` Carlos O'Donell via Libc-alpha
  0 siblings, 0 replies; 12+ messages in thread
From: Carlos O'Donell via Libc-alpha @ 2021-09-07  1:28 UTC (permalink / raw)
  To: Florian Weimer, Matheus Castanho; +Cc: libc-alpha

On 9/6/21 1:28 PM, Florian Weimer wrote:
> * Matheus Castanho:
> 
>> glibc doesn't build after this patch went in, looks like the assert
>> message is missing:
>>
>> In file included from C-collate.c:22:
>> C-collate-seq.c:58:42: error: expected ‘,’ before ‘)’ token
>>  _Static_assert (sizeof (collseqmb) == 256);
> 
> Sorry, I missed that.  I'm testing a patch.

Thanks for committing a fix for this. I'm using too new compilers
and thinking in too new C :-)

I missed it too, C++ allows the assertion with an expression only
but current C11 does not. That's an annoying wrinkle fixed in C2x
which allows a one-argument version of _Static_assert.

This behaviour changed between gcc 8.5 and 9.1 where in 9.1 under
-std=c11 started accepting the 1-argument-form _Static_assert.

All of our CI/CD is using gcc 11, and I'm using gcc 10, so we would
not have seen this. Seems like we need another CI/CD trybot with
gcc 6.2!

-- 
Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v12 1/2] Add 'codepoint_collation' support for LC_COLLATE.
  2021-09-06 17:20   ` Matheus Castanho via Libc-alpha
  2021-09-06 17:28     ` Florian Weimer via Libc-alpha
@ 2021-09-07  1:57     ` Carlos O'Donell via Libc-alpha
  2021-09-20 12:49       ` Matheus Castanho via Libc-alpha
  1 sibling, 1 reply; 12+ messages in thread
From: Carlos O'Donell via Libc-alpha @ 2021-09-07  1:57 UTC (permalink / raw)
  To: Matheus Castanho; +Cc: Florian Weimer, libc-alpha

On 9/6/21 1:20 PM, Matheus Castanho wrote:
> In file included from C-collate.c:22:
> C-collate-seq.c:58:42: error: expected ‘,’ before ‘)’ token
>  _Static_assert (sizeof (collseqmb) == 256);
>                                           ^

Which version of gcc? I assume gcc 8.x? Which distro? I know Florian
fixed this, but I'm curious what you're using.

So DJ and I just talked about how we would catch this in CI/CD, but
we'd have to run fairly old unsupported Fedora to test this, so we'd
have to do something different.

-- 
Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v12 1/2] Add 'codepoint_collation' support for LC_COLLATE.
  2021-09-07  1:57     ` Carlos O'Donell via Libc-alpha
@ 2021-09-20 12:49       ` Matheus Castanho via Libc-alpha
  2021-09-20 12:54         ` Carlos O'Donell via Libc-alpha
  0 siblings, 1 reply; 12+ messages in thread
From: Matheus Castanho via Libc-alpha @ 2021-09-20 12:49 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: Florian Weimer, libc-alpha


Carlos O'Donell <carlos@redhat.com> writes:

> On 9/6/21 1:20 PM, Matheus Castanho wrote:
>> In file included from C-collate.c:22:
>> C-collate-seq.c:58:42: error: expected ‘,’ before ‘)’ token
>>  _Static_assert (sizeof (collseqmb) == 256);
>>                                           ^
>
> Which version of gcc? I assume gcc 8.x? Which distro? I know Florian
> fixed this, but I'm curious what you're using.
>
> So DJ and I just talked about how we would catch this in CI/CD, but
> we'd have to run fairly old unsupported Fedora to test this, so we'd
> have to do something different.

Sorry for the delay. One case where I saw it was using GCC from Debian
10:

$ gcc --version
gcc (Debian 8.3.0-6) 8.3.0
[...]

--
Matheus Castanho

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v12 1/2] Add 'codepoint_collation' support for LC_COLLATE.
  2021-09-20 12:49       ` Matheus Castanho via Libc-alpha
@ 2021-09-20 12:54         ` Carlos O'Donell via Libc-alpha
  0 siblings, 0 replies; 12+ messages in thread
From: Carlos O'Donell via Libc-alpha @ 2021-09-20 12:54 UTC (permalink / raw)
  To: Matheus Castanho; +Cc: Florian Weimer, libc-alpha

On 9/20/21 08:49, Matheus Castanho wrote:
> 
> Carlos O'Donell <carlos@redhat.com> writes:
> 
>> On 9/6/21 1:20 PM, Matheus Castanho wrote:
>>> In file included from C-collate.c:22:
>>> C-collate-seq.c:58:42: error: expected ‘,’ before ‘)’ token
>>>  _Static_assert (sizeof (collseqmb) == 256);
>>>                                           ^
>>
>> Which version of gcc? I assume gcc 8.x? Which distro? I know Florian
>> fixed this, but I'm curious what you're using.
>>
>> So DJ and I just talked about how we would catch this in CI/CD, but
>> we'd have to run fairly old unsupported Fedora to test this, so we'd
>> have to do something different.
> 
> Sorry for the delay. One case where I saw it was using GCC from Debian
> 10:
> 
> $ gcc --version
> gcc (Debian 8.3.0-6) 8.3.0
> [...]

Yea, so that's old enough that it would trigger the failure. Thanks, Florian fixed this 
with commit c9fef4b7d1d0f2dad192c74f06102752247677a9, and we shouldn't regress again
thanks to the requirement for 2 arguments.

-- 
Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v12 2/2] Add generic C.UTF-8 locale (Bug 17318)
  2021-09-06 15:43 ` [PATCH v12 2/2] Add generic C.UTF-8 locale (Bug 17318) Carlos O'Donell via Libc-alpha
@ 2022-01-26  2:44   ` Michael Hudson-Doyle via Libc-alpha
  2022-01-28 16:42     ` Carlos O'Donell via Libc-alpha
  0 siblings, 1 reply; 12+ messages in thread
From: Michael Hudson-Doyle via Libc-alpha @ 2022-01-26  2:44 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: Florian Weimer, libc-alpha

On Tue, 7 Sept 2021 at 03:45, Carlos O'Donell via Libc-alpha <
libc-alpha@sourceware.org> wrote:

> diff --git a/localedata/locales/C b/localedata/locales/C
> new file mode 100644
> index 0000000000..ca801c79cf
> --- /dev/null
> +++ b/localedata/locales/C


[...]


>
>
+LC_TIME
> +% This is the POSIX Locale definition for the LC_TIME category with the
> +% exception that time is per ISO 8601 and 24-hour.
> +%
> +% Abbreviated weekday names (%a)
> +abday       "Sun";"Mon";"Tue";"Wed";"Thu";"Fri";"Sat"
> +
> +% Full weekday names (%A)
> +day         "Sunday";"Monday";"Tuesday";"Wednesday";"Thursday";/
> +            "Friday";"Saturday"
> +
> +% Abbreviated month names (%b)
> +abmon       "Jan";"Feb";"Mar";"Apr";"May";"Jun";"Jul";"Aug";"Sep";/
> +            "Oct";"Nov";"Dec"
> +
> +% Full month names (%B)
> +mon         "January";"February";"March";"April";"May";"June";"July";/
> +            "August";"September";"October";"November";"December"
> +
> +% Week description, consists of three fields:
> +% 1. Number of days in a week.
> +% 2. Gregorian date that is a first weekday (19971130 for Sunday,
> 19971201 for Monday).
> +% 3. The weekday number to be contained in the first week of the year.
> +%
> +% ISO 8601 conforming applications should use the values 7, 19971201 (a
> +% Monday), and 4 (Thursday), respectively.
> +week    7;19971201;4
>

It's obviously a bit late, but this is a difference from the Debian/Ubuntu
C.UTF-8 locale, which has:

week    7;19971130;4

(confusingly, this is preceded by this comment:

% ISO 8601 conforming applications should use the values 7, 19971130 (a
% Monday), and 4 (Thursday), respectively.

but 19971130 is a Sunday).

The locale(5) page from the man-pages project also says:

"For compatibility reasons, all glibc locales should set the value of the
second week list item to 19971130 (Sunday) and base the abday and day lists
appropriately,".

I found this because it breaks a test of rrdtool (which is probably buggy!
It sets LC_TIME but needs to clear LC_ALL for that to take any effect) and
I just wanted to check that this was truly the intended value before (even
if only just) the release.

Cheers,
mwh


> +first_weekday  1
> +first_workday  2
> +
> +% Appropriate date and time representation (%c)
> +d_t_fmt "%a %b %e %H:%M:%S %Y"
> +
> +% Appropriate date representation (%x)
> +d_fmt   "%m/%d/%y"
> +
> +% Appropriate time representation (%X)
> +t_fmt   "%H:%M:%S"
> +
> +% Appropriate AM/PM time representation (%r)
> +t_fmt_ampm "%I:%M:%S %p"
> +
> +% Equivalent of AM/PM (%p)
> +am_pm  "AM";"PM"
> +
> +% Appropriate date representation (date(1))
> +date_fmt       "%a %b %e %H:%M:%S %Z %Y"
> +END LC_TIME
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v12 2/2] Add generic C.UTF-8 locale (Bug 17318)
  2022-01-26  2:44   ` Michael Hudson-Doyle via Libc-alpha
@ 2022-01-28 16:42     ` Carlos O'Donell via Libc-alpha
  2022-01-30 23:58       ` Michael Hudson-Doyle via Libc-alpha
  0 siblings, 1 reply; 12+ messages in thread
From: Carlos O'Donell via Libc-alpha @ 2022-01-28 16:42 UTC (permalink / raw)
  To: Michael Hudson-Doyle; +Cc: Florian Weimer, libc-alpha

On 1/25/22 21:44, Michael Hudson-Doyle wrote:
> On Tue, 7 Sept 2021 at 03:45, Carlos O'Donell via Libc-alpha <
> libc-alpha@sourceware.org> wrote:
> 
>> diff --git a/localedata/locales/C b/localedata/locales/C
>> new file mode 100644
>> index 0000000000..ca801c79cf
>> --- /dev/null
>> +++ b/localedata/locales/C
> 
> 
> [...]
> 
> 
>>
>>
> +LC_TIME
>> +% This is the POSIX Locale definition for the LC_TIME category with the
>> +% exception that time is per ISO 8601 and 24-hour.
>> +%
>> +% Abbreviated weekday names (%a)
>> +abday       "Sun";"Mon";"Tue";"Wed";"Thu";"Fri";"Sat"
>> +
>> +% Full weekday names (%A)
>> +day         "Sunday";"Monday";"Tuesday";"Wednesday";"Thursday";/
>> +            "Friday";"Saturday"
>> +
>> +% Abbreviated month names (%b)
>> +abmon       "Jan";"Feb";"Mar";"Apr";"May";"Jun";"Jul";"Aug";"Sep";/
>> +            "Oct";"Nov";"Dec"
>> +
>> +% Full month names (%B)
>> +mon         "January";"February";"March";"April";"May";"June";"July";/
>> +            "August";"September";"October";"November";"December"
>> +
>> +% Week description, consists of three fields:
>> +% 1. Number of days in a week.
>> +% 2. Gregorian date that is a first weekday (19971130 for Sunday,
>> 19971201 for Monday).
>> +% 3. The weekday number to be contained in the first week of the year.
>> +%
>> +% ISO 8601 conforming applications should use the values 7, 19971201 (a
>> +% Monday), and 4 (Thursday), respectively.
>> +week    7;19971201;4
>>
> 
> It's obviously a bit late, but this is a difference from the Debian/Ubuntu

It is never too late! Thank you for raising this.

Given that you've had problems with one application, other applications will have problems too.

I think we should probably keep C == C.UTF-8 and not change any of the existing LC_TIME properties.

> C.UTF-8 locale, which has:
> 
> week    7;19971130;4

This is the default value from ISO 30112.

This data matches the internal C/POSIX locale.

e.g.

    { .string = "\7" },

7 days in the week.

    { .word = 19971130 },

Week start Sunday. This matches ISO 30112 definition if week is not specified.

    { .string = "\4" },

And Thursday needs to be included in the week for it be considered a "first week."

    { .string = "\1" },
    { .string = "\2" },

And ld-time.c follows defaults from ISO 30112 also.

482   /* Set up defaults based on ISO 30112 WD10 [2014].  */
483   if (time->week_ndays == 0)
484     time->week_ndays = 7;
485 
486   if (time->week_1stday == 0)
487     time->week_1stday = 19971130;
488 
489   if (time->week_1stweek == 0)
490     time->week_1stweek = 7;

> (confusingly, this is preceded by this comment:
> 
> % ISO 8601 conforming applications should use the values 7, 19971130 (a
> % Monday), and 4 (Thursday), respectively.
> 
> but 19971130 is a Sunday).

The above comment is wrong as you note, it is a Sunday.

The verbatim comment from ISO 30112 standard is:
~~~
ISO 8601 conforming applications should use the values 7, 19971201 (a
Monday), and 4 (Thursday), respectively.
~~~

Note the correction in the YYYYMMDD e.g. 19971201.

In our upstream C.UTF-8 locale we are consciously aligning with ISO 8601 in more cases.

117 % Week description, consists of three fields:
118 % 1. Number of days in a week.
119 % 2. Gregorian date that is a first weekday (19971130 for Sunday, 19971201 for Monday).
120 % 3. The weekday number to be contained in the first week of the year.
121 %
122 % ISO 8601 conforming applications should use the values 7, 19971201 (a
123 % Monday), and 4 (Thursday), respectively.
124 week    7;19971201;4
125 first_weekday   1
126 first_workday   2

So there is a difference between C and C.UTF-8 in that they have different first weekday.
 
> The locale(5) page from the man-pages project also says:
> 
> "For compatibility reasons, all glibc locales should set the value of the
> second week list item to 19971130 (Sunday) and base the abday and day lists
> appropriately,".

This is to align with ISO 30112, which is an older standard.

> I found this because it breaks a test of rrdtool (which is probably buggy!
> It sets LC_TIME but needs to clear LC_ALL for that to take any effect) and
> I just wanted to check that this was truly the intended value before (even
> if only just) the release.

In this case for C.UTF-8 we have aligned week with ISO 8601.

There are other parts of C.UTF-8's LC_TIME which are not aligned with ISO 8601.

However, this choice is perhaps inconsistent with the intent of C.UTF-8, so I think this is
actually a bug, and Florian found a real bug in d_fmt (need double slashes).

I'm going to post a patch to fix this and make it consistent with C.

-- 
Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v12 2/2] Add generic C.UTF-8 locale (Bug 17318)
  2022-01-28 16:42     ` Carlos O'Donell via Libc-alpha
@ 2022-01-30 23:58       ` Michael Hudson-Doyle via Libc-alpha
  0 siblings, 0 replies; 12+ messages in thread
From: Michael Hudson-Doyle via Libc-alpha @ 2022-01-30 23:58 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: Florian Weimer, libc-alpha

On Sat, 29 Jan 2022 at 05:42, Carlos O'Donell <carlos@redhat.com> wrote:

> On 1/25/22 21:44, Michael Hudson-Doyle wrote:
> > On Tue, 7 Sept 2021 at 03:45, Carlos O'Donell via Libc-alpha <
> > libc-alpha@sourceware.org> wrote:
> >
> >> diff --git a/localedata/locales/C b/localedata/locales/C
> >> new file mode 100644
> >> index 0000000000..ca801c79cf
> >> --- /dev/null
> >> +++ b/localedata/locales/C
> >
> >
> > [...]
> >
> >
> >>
> >>
> > +LC_TIME
> >> +% This is the POSIX Locale definition for the LC_TIME category with the
> >> +% exception that time is per ISO 8601 and 24-hour.
> >> +%
> >> +% Abbreviated weekday names (%a)
> >> +abday       "Sun";"Mon";"Tue";"Wed";"Thu";"Fri";"Sat"
> >> +
> >> +% Full weekday names (%A)
> >> +day         "Sunday";"Monday";"Tuesday";"Wednesday";"Thursday";/
> >> +            "Friday";"Saturday"
> >> +
> >> +% Abbreviated month names (%b)
> >> +abmon       "Jan";"Feb";"Mar";"Apr";"May";"Jun";"Jul";"Aug";"Sep";/
> >> +            "Oct";"Nov";"Dec"
> >> +
> >> +% Full month names (%B)
> >> +mon         "January";"February";"March";"April";"May";"June";"July";/
> >> +            "August";"September";"October";"November";"December"
> >> +
> >> +% Week description, consists of three fields:
> >> +% 1. Number of days in a week.
> >> +% 2. Gregorian date that is a first weekday (19971130 for Sunday,
> >> 19971201 for Monday).
> >> +% 3. The weekday number to be contained in the first week of the year.
> >> +%
> >> +% ISO 8601 conforming applications should use the values 7, 19971201 (a
> >> +% Monday), and 4 (Thursday), respectively.
> >> +week    7;19971201;4
> >>
> >
> > It's obviously a bit late, but this is a difference from the
> Debian/Ubuntu
>
> It is never too late! Thank you for raising this.
>

No worries. Should have got to it sooner but well.


> Given that you've had problems with one application, other applications
> will have problems too.
>
> I think we should probably keep C == C.UTF-8 and not change any of the
> existing LC_TIME properties.
>

As far as I understand these issues (which is not very far) I think this
makes sense.

Cheers,
mwh

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-01-30 23:58 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-06 15:43 [PATCH v12 0/2] C.UTF-8 Carlos O'Donell via Libc-alpha
2021-09-06 15:43 ` [PATCH v12 1/2] Add 'codepoint_collation' support for LC_COLLATE Carlos O'Donell via Libc-alpha
2021-09-06 17:20   ` Matheus Castanho via Libc-alpha
2021-09-06 17:28     ` Florian Weimer via Libc-alpha
2021-09-07  1:28       ` Carlos O'Donell via Libc-alpha
2021-09-07  1:57     ` Carlos O'Donell via Libc-alpha
2021-09-20 12:49       ` Matheus Castanho via Libc-alpha
2021-09-20 12:54         ` Carlos O'Donell via Libc-alpha
2021-09-06 15:43 ` [PATCH v12 2/2] Add generic C.UTF-8 locale (Bug 17318) Carlos O'Donell via Libc-alpha
2022-01-26  2:44   ` Michael Hudson-Doyle via Libc-alpha
2022-01-28 16:42     ` Carlos O'Donell via Libc-alpha
2022-01-30 23:58       ` Michael Hudson-Doyle via Libc-alpha

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).