From: Carlos O'Donell via Libc-alpha <libc-alpha@sourceware.org>
To: libc-alpha@sourceware.org, fweimer@redhat.com
Subject: [PATCH v4 2/4] Update UTF-8 charmap processing.
Date: Wed, 28 Apr 2021 09:00:31 -0400 [thread overview]
Message-ID: <20210428130033.3196848-3-carlos@redhat.com> (raw)
In-Reply-To: <20210428130033.3196848-1-carlos@redhat.com>
The UTF-8 character map processing is updated to use the new wider
ellipsis support. On top of this the Unicode Noncharacters compliance
is improved by adding Noncharacters to the UTF-8 character map to
allow them to be processed and transformed correctly when considering
the character map only. All gaps, excluding surrogates, for the UTF-8
character map are filled with unassigned blocks of characters. The
UTF-8 character map now includes all Unicode Scalar values.
Tested by regenerating the locale data from the Unicode data and
running the testsuite.
Tested on x86_64 and i686 without regression.
---
localedata/unicode-gen/utf8_gen.py | 133 +++++++++++++++++++----------
1 file changed, 86 insertions(+), 47 deletions(-)
diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/utf8_gen.py
index 899840923a..56a680bc06 100755
--- a/localedata/unicode-gen/utf8_gen.py
+++ b/localedata/unicode-gen/utf8_gen.py
@@ -81,25 +81,46 @@ def process_range(start, end, outfile, name):
# 3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
# 4DB5;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;
#
- # The glibc UTF-8 file splits ranges like these into shorter
+ # The old glibc UTF-8 file splits ranges like these into shorter
# ranges of 64 code points each:
#
# <U3400>..<U343F> /xe3/x90/x80 <CJK Ideograph Extension A>
# …
# <U4D80>..<U4DB5> /xe4/xb6/x80 <CJK Ideograph Extension A>
- for i in range(int(start, 16), int(end, 16), 64 ):
- if i > (int(end, 16)-64):
- outfile.write('{:s}..{:s} {:<12s} {:s}\n'.format(
- unicode_utils.ucs_symbol(i),
- unicode_utils.ucs_symbol(int(end,16)),
- convert_to_hex(i),
- name))
- break
- outfile.write('{:s}..{:s} {:<12s} {:s}\n'.format(
- unicode_utils.ucs_symbol(i),
- unicode_utils.ucs_symbol(i+63),
- convert_to_hex(i),
- name))
+ #
+ # We do not split the ranges like this. It is not required. The
+ # ellipsis processing in ld-collate.c can handle any sized ranges.
+ outfile.write('{:s}..{:s} {:<12s} {:s}\n'.format(
+ unicode_utils.ucs_symbol(int (start, 16)),
+ unicode_utils.ucs_symbol(int (end, 16)),
+ convert_to_hex (int (start, 16)),
+ name))
+
+def process_gap (start, end, outfile):
+ '''This function processes a gap and fills it if needed. The value
+ of start is the last value output, and the value of end is the
+ next value which may be output. Therefore if there is a gap
+ between the two then it is filled with an ellipsis or a single
+ symbol.
+
+ '''
+ # If start and end are more than 1 away then we have a gap, and
+ # that needs filling to provide proper code-point collation
+ # support.
+ cp_prev = int(start, 16)
+ cp_next = int(end, 16)
+
+ # Special case of just one symbol missing?
+ if cp_next - 1 == cp_prev + 1:
+ outfile.write('{:<11s} {:<12s} {:s}\n'.format(
+ unicode_utils.ucs_symbol(cp_prev + 1),
+ convert_to_hex(cp_prev + 1),
+ '<Unassigned>'))
+ elif cp_next > cp_prev + 1:
+ # More than one symbol, so use an ellipsis.
+ process_range ('{:x}'.format(cp_prev + 1),
+ '{:x}'.format(cp_next - 1),
+ outfile, '<Unassigned>')
def process_charmap(flines, outfile):
'''This function takes an array which contains *all* lines of
@@ -129,63 +150,81 @@ def process_charmap(flines, outfile):
%<UDB7F> /xed/xad/xbf <Non Private Use High Surrogate, Last>
<U0010FFC0>..<U0010FFFD> /xf4/x8f/xbf/x80 <Plane 16 Private Use>
+ The old glibc UTF-8 charmap left the surrogates commented out.
+ Surrogates are not Unicode scalar values, and are ill-formed code
+ sequences. We continue to comment them out in the character map to
+ ensure no locale accidentally uses these values. The use of
+ surrogate symbols will be treated as if they were UNDEFINED. The
+ converters will handle them as ill-formed code sequences and either
+ raise an error or transform them to REPLACEMENT CHARACTER.
'''
fields_start = []
+ fields_end = []
for line in flines:
fields = line.split(";")
- # Some characters have “<control>” as their name. We try to
- # use the “Unicode 1.0 Name” (10th field in
- # UnicodeData.txt) for them.
- #
- # The Characters U+0080, U+0081, U+0084 and U+0099 have
- # “<control>” as their name but do not even have aa
- # ”Unicode 1.0 Name”. We could write code to take their
- # alternate names from NameAliases.txt.
+ # Some characters have "<control>" as their name. We try to
+ # use the "Unicode 1.0 Name" (10th field in
+ # UnicodeData.txt) for them.
+ #
+ # The Characters U+0080, U+0081, U+0084 and U+0099 have
+ # "<control>" as their name but do not even have a
+ # "Unicode 1.0 Name". We could write code to take their
+ # alternate names from NameAliases.txt.
if fields[1] == "<control>" and fields[10]:
fields[1] = fields[10]
# Handling code point ranges like:
#
# 3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
# 4DB5;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;
- if fields[1].endswith(', First>') and not 'Surrogate,' in fields[1]:
+ if fields[1].endswith(', First>'):
fields_start = fields
continue
- if fields[1].endswith(', Last>') and not 'Surrogate,' in fields[1]:
+ if fields[1].endswith(', Last>'):
+ # 1. Process the gap.
+ # First process the gap between the last entry and the
+ # newly started range.
+ process_gap (fields_end[0], fields_start[0], outfile)
+ # 2. Exclude surrogate ranges.
+ # Comment out the surrogates in the UTF-8 file.
+ # One could of course skip them completely but
+ # the original UTF-8 file in glibc had them as
+ # comments, so we keep these comment lines.
+ if 'Surrogate,' in fields[1]:
+ outfile.write('%')
+ # 3. Process the range.
process_range(fields_start[0], fields[0],
outfile, fields[1][:-7]+'>')
fields_start = []
+ fields_end = fields
continue
fields_start = []
- if 'Surrogate,' in fields[1]:
- # Comment out the surrogates in the UTF-8 file.
- # One could of course skip them completely but
- # the original UTF-8 file in glibc had them as
- # comments, so we keep these comment lines.
- outfile.write('%')
+
+ if len (fields_end) > 0:
+ process_gap (fields_end[0], fields[0], outfile)
+
outfile.write('{:<11s} {:<12s} {:s}\n'.format(
unicode_utils.ucs_symbol(int(fields[0], 16)),
convert_to_hex(int(fields[0], 16)),
fields[1]))
+ fields_end = fields
+ # We may need to output a final set of symbols if we are not yet at
+ # U+10FFFF, so check that last gap. We use U+110000 as the
+ # hypothetical next entry. In practice UTF-8 ends at U+10FFFD and
+ # so indeed we have 2 missing symbols at the end.
+ process_gap (fields_end[0], '110000', outfile)
+
def convert_to_hex(code_point):
'''Converts a code point to a hexadecimal UTF-8 representation
- like /x**/x**/x**.'''
- # Getting UTF8 of Unicode characters.
- # In Python3, .encode('UTF-8') does not work for
- # surrogates. Therefore, we use this conversion table
- surrogates = {
- 0xD800: '/xed/xa0/x80',
- 0xDB7F: '/xed/xad/xbf',
- 0xDB80: '/xed/xae/x80',
- 0xDBFF: '/xed/xaf/xbf',
- 0xDC00: '/xed/xb0/x80',
- 0xDFFF: '/xed/xbf/xbf',
- }
- if code_point in surrogates:
- return surrogates[code_point]
- return ''.join([
- '/x{:02x}'.format(c) for c in chr(code_point).encode('UTF-8')
- ])
+ ready for use in a locale character map specification e.g.
+ /xc2/xaf for MACRON.
+
+ '''
+ cp_locale = ''
+ cp_bytes = chr(code_point).encode('UTF-8', 'surrogatepass')
+ for byte in cp_bytes:
+ cp_locale += ''.join('/x{:02x}'.format(byte))
+ return cp_locale
def write_header_charmap(outfile):
'''Write the header on top of the CHARMAP section to the output file'''
--
2.26.3
next prev parent reply other threads:[~2021-04-28 13:01 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-28 13:00 [PATCH v4 0/4] Add new C.UTF-8 locale (Bug 17318) Carlos O'Donell via Libc-alpha
2021-04-28 13:00 ` [PATCH v4 1/4] Add support for processing wide ellipsis ranges in UTF-8 Carlos O'Donell via Libc-alpha
2021-04-29 14:11 ` Florian Weimer via Libc-alpha
2021-04-28 13:00 ` Carlos O'Donell via Libc-alpha [this message]
2021-04-29 14:07 ` [PATCH v4 2/4] Update UTF-8 charmap processing Florian Weimer via Libc-alpha
2021-04-29 21:02 ` Carlos O'Donell via Libc-alpha
2021-04-30 4:18 ` Florian Weimer via Libc-alpha
2021-05-02 19:18 ` Carlos O'Donell via Libc-alpha
2021-04-28 13:00 ` [PATCH v4 3/4] Regenerate localedata files Carlos O'Donell via Libc-alpha
2021-04-29 21:03 ` Carlos O'Donell via Libc-alpha
2021-04-28 13:00 ` [PATCH v4 4/4] Add generic C.UTF-8 locale (Bug 17318) Carlos O'Donell via Libc-alpha
2021-04-29 14:13 ` Florian Weimer via Libc-alpha
2021-04-29 20:05 ` Carlos O'Donell via Libc-alpha
2021-04-30 17:59 ` Carlos O'Donell via Libc-alpha
2021-04-30 18:20 ` Florian Weimer via Libc-alpha
2021-05-02 19:18 ` Carlos O'Donell via Libc-alpha
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/libc/involved.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210428130033.3196848-3-carlos@redhat.com \
--to=libc-alpha@sourceware.org \
--cc=carlos@redhat.com \
--cc=fweimer@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).