[PATCH v4 2/4] Update UTF-8 charmap processing. - Carlos O'Donell via Libc-alpha

unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: Carlos O'Donell via Libc-alpha <libc-alpha@sourceware.org>
To: libc-alpha@sourceware.org, fweimer@redhat.com
Subject: [PATCH v4 2/4] Update UTF-8 charmap processing.
Date: Wed, 28 Apr 2021 09:00:31 -0400	[thread overview]
Message-ID: <20210428130033.3196848-3-carlos@redhat.com> (raw)
In-Reply-To: <20210428130033.3196848-1-carlos@redhat.com>

The UTF-8 character map processing is updated to use the new wider
ellipsis support. On top of this the Unicode Noncharacters compliance
is improved by adding Noncharacters to the UTF-8 character map to
allow them to be processed and transformed correctly when considering
the character map only. All gaps, excluding surrogates, for the UTF-8
character map are filled with unassigned blocks of characters. The
UTF-8 character map now includes all Unicode Scalar values.

Tested by regenerating the locale data from the Unicode data and
running the testsuite.

Tested on x86_64 and i686 without regression.
---
 localedata/unicode-gen/utf8_gen.py | 133 +++++++++++++++++++----------
 1 file changed, 86 insertions(+), 47 deletions(-)

diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/utf8_gen.py
index 899840923a..56a680bc06 100755
--- a/localedata/unicode-gen/utf8_gen.py
+++ b/localedata/unicode-gen/utf8_gen.py
@@ -81,25 +81,46 @@ def process_range(start, end, outfile, name):
     # 3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
     # 4DB5;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;
     #
-    # The glibc UTF-8 file splits ranges like these into shorter
+    # The old glibc UTF-8 file splits ranges like these into shorter
     # ranges of 64 code points each:
     #
     # <U3400>..<U343F>     /xe3/x90/x80         <CJK Ideograph Extension A>
     # …
     # <U4D80>..<U4DB5>     /xe4/xb6/x80         <CJK Ideograph Extension A>
-    for i in range(int(start, 16), int(end, 16), 64 ):
-        if i > (int(end, 16)-64):
-            outfile.write('{:s}..{:s} {:<12s} {:s}\n'.format(
-                    unicode_utils.ucs_symbol(i),
-                    unicode_utils.ucs_symbol(int(end,16)),
-                    convert_to_hex(i),
-                    name))
-            break
-        outfile.write('{:s}..{:s} {:<12s} {:s}\n'.format(
-                unicode_utils.ucs_symbol(i),
-                unicode_utils.ucs_symbol(i+63),
-                convert_to_hex(i),
-                name))
+    #
+    # We do not split the ranges like this. It is not required. The
+    # ellipsis processing in ld-collate.c can handle any sized ranges.
+    outfile.write('{:s}..{:s} {:<12s} {:s}\n'.format(
+                  unicode_utils.ucs_symbol(int (start, 16)),
+                  unicode_utils.ucs_symbol(int (end, 16)),
+                  convert_to_hex (int (start, 16)),
+                  name))
+
+def process_gap (start, end, outfile):
+    '''This function processes a gap and fills it if needed.  The value
+       of start is the last value output, and the value of end is the
+       next value which may be output.  Therefore if there is a gap
+       between the two then it is filled with an ellipsis or a single
+       symbol.
+
+    '''
+    # If start and end are more than 1 away then we have a gap, and
+    # that needs filling to provide proper code-point collation
+    # support.
+    cp_prev = int(start, 16)
+    cp_next = int(end, 16)
+
+    # Special case of just one symbol missing?
+    if cp_next - 1 == cp_prev + 1:
+        outfile.write('{:<11s} {:<12s} {:s}\n'.format(
+                      unicode_utils.ucs_symbol(cp_prev + 1),
+                      convert_to_hex(cp_prev + 1),
+                      '<Unassigned>'))
+    elif cp_next > cp_prev + 1:
+        # More than one symbol, so use an ellipsis.
+        process_range ('{:x}'.format(cp_prev + 1),
+                       '{:x}'.format(cp_next - 1),
+                       outfile, '<Unassigned>')
 
 def process_charmap(flines, outfile):
     '''This function takes an array which contains *all* lines of
@@ -129,63 +150,81 @@ def process_charmap(flines, outfile):
     %<UDB7F>     /xed/xad/xbf <Non Private Use High Surrogate, Last>
     <U0010FFC0>..<U0010FFFD>     /xf4/x8f/xbf/x80 <Plane 16 Private Use>
 
+    The old glibc UTF-8 charmap left the surrogates commented out.
+    Surrogates are not Unicode scalar values, and are ill-formed code
+    sequences. We continue to comment them out in the character map to
+    ensure no locale accidentally uses these values. The use of
+    surrogate symbols will be treated as if they were UNDEFINED. The
+    converters will handle them as ill-formed code sequences and either
+    raise an error or transform them to REPLACEMENT CHARACTER.
     '''
     fields_start = []
+    fields_end = []
     for line in flines:
         fields = line.split(";")
-         # Some characters have “<control>” as their name. We try to
-         # use the “Unicode 1.0 Name” (10th field in
-         # UnicodeData.txt) for them.
-         #
-         # The Characters U+0080, U+0081, U+0084 and U+0099 have
-         # “<control>” as their name but do not even have aa
-         # ”Unicode 1.0 Name”. We could write code to take their
-         # alternate names from NameAliases.txt.
+        # Some characters have "<control>" as their name. We try to
+        # use the "Unicode 1.0 Name" (10th field in
+        # UnicodeData.txt) for them.
+        #
+        # The Characters U+0080, U+0081, U+0084 and U+0099 have
+        # "<control>" as their name but do not even have a
+        # "Unicode 1.0 Name". We could write code to take their
+        # alternate names from NameAliases.txt.
         if fields[1] == "<control>" and fields[10]:
             fields[1] = fields[10]
         # Handling code point ranges like:
         #
         # 3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
         # 4DB5;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;
-        if fields[1].endswith(', First>') and not 'Surrogate,' in fields[1]:
+        if fields[1].endswith(', First>'):
             fields_start = fields
             continue
-        if fields[1].endswith(', Last>') and not 'Surrogate,' in fields[1]:
+        if fields[1].endswith(', Last>'):
+            # 1. Process the gap.
+            # First process the gap between the last entry and the
+            # newly started range.
+            process_gap (fields_end[0], fields_start[0], outfile)
+            # 2. Exclude surrogate ranges.
+            # Comment out the surrogates in the UTF-8 file.
+            # One could of course skip them completely but
+            # the original UTF-8 file in glibc had them as
+            # comments, so we keep these comment lines.
+            if 'Surrogate,' in fields[1]:
+                outfile.write('%')
+            # 3. Process the range.
             process_range(fields_start[0], fields[0],
                           outfile, fields[1][:-7]+'>')
             fields_start = []
+            fields_end = fields
             continue
         fields_start = []
-        if 'Surrogate,' in fields[1]:
-            # Comment out the surrogates in the UTF-8 file.
-            # One could of course skip them completely but
-            # the original UTF-8 file in glibc had them as
-            # comments, so we keep these comment lines.
-            outfile.write('%')
+
+        if len (fields_end) > 0:
+            process_gap (fields_end[0], fields[0], outfile)
+
         outfile.write('{:<11s} {:<12s} {:s}\n'.format(
                 unicode_utils.ucs_symbol(int(fields[0], 16)),
                 convert_to_hex(int(fields[0], 16)),
                 fields[1]))
 
+        fields_end = fields
+    # We may need to output a final set of symbols if we are not yet at
+    # U+10FFFF, so check that last gap.  We use U+110000 as the
+    # hypothetical next entry.  In practice UTF-8 ends at U+10FFFD and
+    # so indeed we have 2 missing symbols at the end.
+    process_gap (fields_end[0], '110000', outfile)
+
 def convert_to_hex(code_point):
     '''Converts a code point to a hexadecimal UTF-8 representation
-    like /x**/x**/x**.'''
-    # Getting UTF8 of Unicode characters.
-    # In Python3, .encode('UTF-8') does not work for
-    # surrogates. Therefore, we use this conversion table
-    surrogates = {
-        0xD800: '/xed/xa0/x80',
-        0xDB7F: '/xed/xad/xbf',
-        0xDB80: '/xed/xae/x80',
-        0xDBFF: '/xed/xaf/xbf',
-        0xDC00: '/xed/xb0/x80',
-        0xDFFF: '/xed/xbf/xbf',
-    }
-    if code_point in surrogates:
-        return surrogates[code_point]
-    return ''.join([
-        '/x{:02x}'.format(c) for c in chr(code_point).encode('UTF-8')
-    ])
+    ready for use in a locale character map specification e.g.
+    /xc2/xaf for MACRON.
+
+    '''
+    cp_locale = ''
+    cp_bytes = chr(code_point).encode('UTF-8', 'surrogatepass')
+    for byte in cp_bytes:
+       cp_locale += ''.join('/x{:02x}'.format(byte))
+    return cp_locale
 
 def write_header_charmap(outfile):
     '''Write the header on top of the CHARMAP section to the output file'''
-- 
2.26.3

next prev parent reply	other threads:[~2021-04-28 13:01 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-28 13:00 [PATCH v4 0/4] Add new C.UTF-8 locale (Bug 17318) Carlos O'Donell via Libc-alpha
2021-04-28 13:00 ` [PATCH v4 1/4] Add support for processing wide ellipsis ranges in UTF-8 Carlos O'Donell via Libc-alpha
2021-04-29 14:11   ` Florian Weimer via Libc-alpha
2021-04-28 13:00 ` Carlos O'Donell via Libc-alpha [this message]
2021-04-29 14:07   ` [PATCH v4 2/4] Update UTF-8 charmap processing Florian Weimer via Libc-alpha
2021-04-29 21:02     ` Carlos O'Donell via Libc-alpha
2021-04-30  4:18       ` Florian Weimer via Libc-alpha
2021-05-02 19:18         ` Carlos O'Donell via Libc-alpha
2021-04-28 13:00 ` [PATCH v4 3/4] Regenerate localedata files Carlos O'Donell via Libc-alpha
2021-04-29 21:03   ` Carlos O'Donell via Libc-alpha
2021-04-28 13:00 ` [PATCH v4 4/4] Add generic C.UTF-8 locale (Bug 17318) Carlos O'Donell via Libc-alpha
2021-04-29 14:13   ` Florian Weimer via Libc-alpha
2021-04-29 20:05     ` Carlos O'Donell via Libc-alpha
2021-04-30 17:59       ` Carlos O'Donell via Libc-alpha
2021-04-30 18:20         ` Florian Weimer via Libc-alpha
2021-05-02 19:18           ` Carlos O'Donell via Libc-alpha

find likely ancestor, descendant, or conflicting patches for this message:
dfblob:899840923 dfblob:56a680bc0
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/libc/involved.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210428130033.3196848-3-carlos@redhat.com \
    --to=libc-alpha@sourceware.org \
    --cc=carlos@redhat.com \
    --cc=fweimer@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).