unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: Carlos O'Donell via Libc-alpha <libc-alpha@sourceware.org>
To: Florian Weimer <fweimer@redhat.com>,
	Carlos O'Donell via Libc-alpha <libc-alpha@sourceware.org>
Subject: Re: [PATCH v3] Add new C.UTF-8 locale (Bug 17318)
Date: Wed, 28 Apr 2021 07:54:39 -0400	[thread overview]
Message-ID: <ae1c23f9-55a2-4c87-3c8e-ac455b6ef0a3@redhat.com> (raw)
In-Reply-To: <87v99xcxzm.fsf@oldenburg.str.redhat.com>

On 3/11/21 2:05 PM, Florian Weimer wrote:
> * Carlos O'Donell via Libc-alpha:
> 
>> diff --git a/locale/programs/charmap.c b/locale/programs/charmap.c
>> index 3d51e702dc..77085cff72 100644
>> --- a/locale/programs/charmap.c
>> +++ b/locale/programs/charmap.c
>> @@ -49,7 +49,7 @@ static void new_width (struct linereader *cmfile, struct charmap_t *result,
> 
>> +  /* POSIX explicitly requires that ellipsis processing do the
>> +     following: "Bytes shall be treated as unsigned octets, and carry
>> +     shall be propagated between the bytes as necessary to represent the
>> +     range."  It then goes on to say that such a declaration should
>> +     never be specified because it creates NULL bytes.  Therefore we
> 
> NUL or null, I think.

Fixed.
 
>> +     error on this condition (see charmap_new_char).  However this still
>> +     leaves a problem for encodings which use less than the full 8-bits,
>> +     like UTF-8, and in such encodings you can use an ellipsis to
>> +     silently and accidentally create invalid ranges.  In UTF-8 you have
>> +     only the first 6-bits of the first byte and if your ellipsis covers
> 
> UTF-8 is variable length even in the leader byte, so “only the first
> 6-bits of the first byte” seems wrong.

Fixed.

>> +/* This function takes the Unicode code point CP and encodes it into
>> +   a UTF-8 byte stream that must be NBYTES long and is stored into
>> +   the unsigned character array at BYTES.
>> +
>> +   If CP requires more than NBYTES to be encoded then we return an
>> +   error of -1.
>> +
>> +   If CP is not within any of the valid Unicode code point ranges
>> +   then we return an error of -2.
>> +
>> +   Otherwise we return the number of bytes encoded.  */
>> +static int
>> +output_utf8_bytes (unsigned int cp, size_t nbytes, unsigned char *bytes)
>> +{
>> +  /* We need at least 1 byte.  */
>> +  if (nbytes < 1)
>> +    return -1;
>> +
>> +  /* One byte range.  */
>> +  if (cp >= 0x0 && cp <= 0x7f)
>> +    {
>> +      bytes[0] = cp & 0x7f;
>> +      return 1;
>> +    }
> 
> 0x7f is superfluous and confusing here, as discussed before.

Fixed.

>> diff --git a/localedata/charmaps/UTF-8 b/localedata/charmaps/UTF-8
>> index 8cce47cd97..c70d359744 100644
>> --- a/localedata/charmaps/UTF-8
>> +++ b/localedata/charmaps/UTF-8
>> @@ -895,12 +895,14 @@ CHARMAP
> 
>> +<UD800>..<UDB7F> /xed/xa0/x80 <Non Private Use High Surrogate>
>> +<UDB80>..<UDBFF> /xed/xae/x80 <Private Use High Surrogate>
>> +<UDC00>..<UDFFF> /xed/xb0/x80 <Low Surrogate>
>> +<UE000>..<UF8FF> /xee/x80/x80 <Private Use>
> 
> Technically this isn't right.  We don't want mappings for those
> characters because it might introduce in other locale files that use
> those characters.  But may be just need to be careful.

Correct, per the standard:
http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G7404

The Unicode scalar values exclude the surrogates.

Such values mean the code unit sequence is ill-formed.

I'm going to remove them.

There is no benefit in having them, and since we're going to switch
to strcmp eventually, and a synthetic weights table via nl_langinfo,
I don't think we need them. It conflates two things: valid character
maps, and the data used for collation.

> I'm surprised that this doesn't lead to testsuite failures because it's
> inconsistent with the gconv converters.  Maybe we don't use this
> anywhere?

I don't think we use it anywhere except when parsing the locales
themselves and generating other data.

The gconv converters ignore the character map data. We can have out
of sync character maps and converters (which was the case with TSCII
which I'll fix later).

> The other invalid-ish Unicode codepoints (U+FFFE, U+FFFF) are actually
> valid UTF-8 and handled by gconv, so including them seems okay.

They are simiar to U+FDD0..U+FDEF. There are 66 noncharacters.

They are not invalid. They are noncharacters and the standard had to
clarify that they must be encodable, they must be transported, and we
must not modify them in any way. The noncharacters are for private use
for the standard itself. The standard says: "C2 A process shall interpret
a noncharacter code point as an abstract character."

>> diff --git a/localedata/locales/C b/localedata/locales/C
>> new file mode 100644
>> index 0000000000..418e7c90a5
>> --- /dev/null
>> +++ b/localedata/locales/C
>> @@ -0,0 +1,192 @@
> 
>> +% One rule, sort forward, for all code points to give code point
>> +% order sorting for Unicode.
>> +LC_COLLATE
>> +order_start forward
>> +<U00000000>
>> +..
>> +<U0000007F>
>> +<U00000080>
>> +..
>> +<U000007FF>
>> +<U00000800>
>> +..
>> +<U0000FFFF>
>> +<U00010000>
>> +..
>> +<U0010FFFF>
>> +UNDEFINED
>> +order_end
>> +END LC_COLLATE
> 
> Why are multiple ranges required here?

No reason. I'll split into two ranges that exclude surrogates.

>> diff --git a/localedata/locales/i18n_ctype b/localedata/locales/i18n_ctype
>> index c63e0790fc..c92bb95148 100644
>> --- a/localedata/locales/i18n_ctype
>> +++ b/localedata/locales/i18n_ctype
>> @@ -26,7 +26,7 @@ fax       ""
>>  language  ""
>>  territory "Earth"
>>  revision  "13.0.0"
>> -date      "2020-06-25"
>> +date      "2021-02-17"
>>  category  "i18n:2012";LC_CTYPE
>>  END LC_IDENTIFICATION
> 
> Those date changes seem spurious.  Is this no-op file regeneration
> really needed?

The protocol is:

cd localedata/unicode-gen
make install

The spurious regeneration is not needed, but it's easier to run the
above commands. It gives a date for the last generation for all files
consistently.

>> diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/utf8_gen.py
>> index 899840923a..42fc5efcb9 100755
>> --- a/localedata/unicode-gen/utf8_gen.py
>> +++ b/localedata/unicode-gen/utf8_gen.py
> 
>>  def convert_to_hex(code_point):
>>      '''Converts a code point to a hexadecimal UTF-8 representation
>> +    like /x**/x**/x** without using any python library functions.
>> +    This avoids problems with the encode function, including an
>> +    inability to output the surrogate code points.
> 
> You can use chr(code_point).encode('UTF-8', 'surrogatepass') and the
> Python encoder.
> 
> I reviewed the other changes and spot-checked the generated charmap.
> Those parts look okay.

Fixed.
 
> The question is whether we actually need a UTF-8 charmap.  If not, we
> can teach charmap.c to generate the UTF-8 data on the fly in
> charmap_find_value and charmap_find_symbol.  But I consider this part of
> the data representation changes we discussed earlier.

Agreed.

-- 
Cheers,
Carlos.


  parent reply	other threads:[~2021-04-28 11:54 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-19  3:27 [PATCH v3] Add new C.UTF-8 locale (Bug 17318) Carlos O'Donell via Libc-alpha
2021-02-19  3:34 ` Carlos O'Donell via Libc-alpha
2021-02-19  8:47 ` Andreas Schwab
2021-03-11 19:05 ` Florian Weimer via Libc-alpha
2021-03-11 19:19   ` Florian Weimer via Libc-alpha
2021-03-11 22:55     ` Carlos O'Donell via Libc-alpha
2021-04-28 11:54   ` Carlos O'Donell via Libc-alpha [this message]
2021-04-28 17:36     ` Joseph Myers
2021-04-29 11:43       ` Carlos O'Donell via Libc-alpha
2021-04-29 11:57       ` Carlos O'Donell via Libc-alpha
  -- strict thread matches above, loose matches on Subject: below --
2021-02-19  3:26 Carlos O'Donell via Libc-alpha
2021-02-19  3:22 Carlos O'Donell via Libc-alpha

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/libc/involved.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ae1c23f9-55a2-4c87-3c8e-ac455b6ef0a3@redhat.com \
    --to=libc-alpha@sourceware.org \
    --cc=carlos@redhat.com \
    --cc=fweimer@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).