From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-bounces@sourceware.org>
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net
X-Spam-Level: 
X-Spam-ASN: AS17314 8.43.84.0/22
X-Spam-Status: No, score=-3.7 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI,NICE_REPLY_A,
	RCVD_IN_DNSWL_MED,RDNS_DYNAMIC,SPF_HELO_PASS,SPF_PASS shortcircuit=no
	autolearn=ham autolearn_force=no version=3.4.2
Received: from sourceware.org (ip-8-43-85-97.sourceware.org [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(No client certificate requested)
	by dcvr.yhbt.net (Postfix) with ESMTPS id 133FE1F5AE
	for <e@80x24.org>; Wed, 28 Apr 2021 11:54:52 +0000 (UTC)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 20542393C037;
	Wed, 28 Apr 2021 11:54:51 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 20542393C037
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1619610891;
	bh=aexB6KbDklJ8bYWGu72iyikFJt1AGWWaGJW40NpqeZ4=;
	h=Subject:To:References:Date:In-Reply-To:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=AbgbobtHjDmFraPcDI0N0BmYWSYS0KdUSo+Mpj2e3U7jPqi0yCZJL6gWqoGFMaLzi
	 82tWrO7WV8AAGYYf3xsiCWC7RG9UzUZn/7k6nlqG2Lp5gI0uwgNw4h00LtlTyabdt8
	 UXmrbzPjdyi3v2dFPvu7/ydNUIZGxfOcg+kSAIxA=
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
 by sourceware.org (Postfix) with ESMTP id 66E13393C033
 for <libc-alpha@sourceware.org>; Wed, 28 Apr 2021 11:54:47 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 66E13393C033
Received: from mail-qt1-f197.google.com (mail-qt1-f197.google.com
 [209.85.160.197]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-64-vmsWdfdJNRGHb3phgI0VJg-1; Wed, 28 Apr 2021 07:54:43 -0400
X-MC-Unique: vmsWdfdJNRGHb3phgI0VJg-1
Received: by mail-qt1-f197.google.com with SMTP id
 b8-20020a05622a0208b02901b5b18f4f91so22315311qtx.18
 for <libc-alpha@sourceware.org>; Wed, 28 Apr 2021 04:54:43 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:subject:to:references:from:organization
 :message-id:date:user-agent:mime-version:in-reply-to
 :content-language:content-transfer-encoding;
 bh=aexB6KbDklJ8bYWGu72iyikFJt1AGWWaGJW40NpqeZ4=;
 b=NPfID6aTxIDy17Zpbm3XuyQT4gTlwNHWnm9wT7Ux6lRSWOxuw0vyAi5qWJesXqAi4Z
 wJBSBEVVDKdndoqziG8ewkEEWvSNQ0szkXTCbNtnVPU4r6vkSyf7mF1lb1aTBjGYeZhs
 hoUxN/Ld7mvie3ShxUX7x75lT0+MZTwhCTMUyx9gdv5zPQadmAxe07ZxFrjidXPEgMfY
 WKFELZJMUafOy76/NpvX3EKSah764ntT5lDtFsfJhjlhUDKzSDI0RPGTykeGrEiCaWOa
 ojRlxdniRx77uqRjG7pHyOx6l7WAD612G5XW8t8ygwsC3cmhmzOG/qnY/kHwRXYUB5eA
 L8Zg==
X-Gm-Message-State: AOAM530V17w0VI5uoftw9QOmSUx9798ss/I5EW30laCD3P1JXvP6n0me
 Te1CDtjAN9YW9UUzAgZibAQ2/w8oIOEx3fC29WzNspRKCZf76f6olUA0cUpvBWKKu/OVkvLlidf
 pkm1anZPJEZHxFJ6Hs4pPbQ1yRdO/ltNhK9tqzuFW2Mc7k1NFIG6/NPTl/9mORPutwRo9BA==
X-Received: by 2002:a37:8905:: with SMTP id l5mr29237022qkd.321.1619610882330; 
 Wed, 28 Apr 2021 04:54:42 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJzCm2+dZCJ2sx+2vZhjDaVjacXAslnVUYpS90lcq5PUX/OlLMYi2+2LTVED8cw8AzkT7CoSng==
X-Received: by 2002:a37:8905:: with SMTP id l5mr29236998qkd.321.1619610881905; 
 Wed, 28 Apr 2021 04:54:41 -0700 (PDT)
Received: from [192.168.1.16] (198-84-214-74.cpe.teksavvy.com. [198.84.214.74])
 by smtp.gmail.com with ESMTPSA id j9sm1792862qtl.12.2021.04.28.04.54.40
 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
 Wed, 28 Apr 2021 04:54:41 -0700 (PDT)
Subject: Re: [PATCH v3] Add new C.UTF-8 locale (Bug 17318)
To: Florian Weimer <fweimer@redhat.com>,
 Carlos O'Donell via Libc-alpha <libc-alpha@sourceware.org>
References: <20210219032748.564216-1-carlos@redhat.com>
 <87v99xcxzm.fsf@oldenburg.str.redhat.com>
Organization: Red Hat
Message-ID: <ae1c23f9-55a2-4c87-3c8e-ac455b6ef0a3@redhat.com>
Date: Wed, 28 Apr 2021 07:54:39 -0400
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.8.1
MIME-Version: 1.0
In-Reply-To: <87v99xcxzm.fsf@oldenburg.str.redhat.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 8bit
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
From: Carlos O'Donell via Libc-alpha <libc-alpha@sourceware.org>
Reply-To: Carlos O'Donell <carlos@redhat.com>
Errors-To: libc-alpha-bounces@sourceware.org
Sender: "Libc-alpha" <libc-alpha-bounces@sourceware.org>

On 3/11/21 2:05 PM, Florian Weimer wrote:
> * Carlos O'Donell via Libc-alpha:
> 
>> diff --git a/locale/programs/charmap.c b/locale/programs/charmap.c
>> index 3d51e702dc..77085cff72 100644
>> --- a/locale/programs/charmap.c
>> +++ b/locale/programs/charmap.c
>> @@ -49,7 +49,7 @@ static void new_width (struct linereader *cmfile, struct charmap_t *result,
> 
>> +  /* POSIX explicitly requires that ellipsis processing do the
>> +     following: "Bytes shall be treated as unsigned octets, and carry
>> +     shall be propagated between the bytes as necessary to represent the
>> +     range."  It then goes on to say that such a declaration should
>> +     never be specified because it creates NULL bytes.  Therefore we
> 
> NUL or null, I think.

Fixed.
 
>> +     error on this condition (see charmap_new_char).  However this still
>> +     leaves a problem for encodings which use less than the full 8-bits,
>> +     like UTF-8, and in such encodings you can use an ellipsis to
>> +     silently and accidentally create invalid ranges.  In UTF-8 you have
>> +     only the first 6-bits of the first byte and if your ellipsis covers
> 
> UTF-8 is variable length even in the leader byte, so “only the first
> 6-bits of the first byte” seems wrong.

Fixed.

>> +/* This function takes the Unicode code point CP and encodes it into
>> +   a UTF-8 byte stream that must be NBYTES long and is stored into
>> +   the unsigned character array at BYTES.
>> +
>> +   If CP requires more than NBYTES to be encoded then we return an
>> +   error of -1.
>> +
>> +   If CP is not within any of the valid Unicode code point ranges
>> +   then we return an error of -2.
>> +
>> +   Otherwise we return the number of bytes encoded.  */
>> +static int
>> +output_utf8_bytes (unsigned int cp, size_t nbytes, unsigned char *bytes)
>> +{
>> +  /* We need at least 1 byte.  */
>> +  if (nbytes < 1)
>> +    return -1;
>> +
>> +  /* One byte range.  */
>> +  if (cp >= 0x0 && cp <= 0x7f)
>> +    {
>> +      bytes[0] = cp & 0x7f;
>> +      return 1;
>> +    }
> 
> 0x7f is superfluous and confusing here, as discussed before.

Fixed.

>> diff --git a/localedata/charmaps/UTF-8 b/localedata/charmaps/UTF-8
>> index 8cce47cd97..c70d359744 100644
>> --- a/localedata/charmaps/UTF-8
>> +++ b/localedata/charmaps/UTF-8
>> @@ -895,12 +895,14 @@ CHARMAP
> 
>> +<UD800>..<UDB7F> /xed/xa0/x80 <Non Private Use High Surrogate>
>> +<UDB80>..<UDBFF> /xed/xae/x80 <Private Use High Surrogate>
>> +<UDC00>..<UDFFF> /xed/xb0/x80 <Low Surrogate>
>> +<UE000>..<UF8FF> /xee/x80/x80 <Private Use>
> 
> Technically this isn't right.  We don't want mappings for those
> characters because it might introduce in other locale files that use
> those characters.  But may be just need to be careful.

Correct, per the standard:
http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G7404

The Unicode scalar values exclude the surrogates.

Such values mean the code unit sequence is ill-formed.

I'm going to remove them.

There is no benefit in having them, and since we're going to switch
to strcmp eventually, and a synthetic weights table via nl_langinfo,
I don't think we need them. It conflates two things: valid character
maps, and the data used for collation.

> I'm surprised that this doesn't lead to testsuite failures because it's
> inconsistent with the gconv converters.  Maybe we don't use this
> anywhere?

I don't think we use it anywhere except when parsing the locales
themselves and generating other data.

The gconv converters ignore the character map data. We can have out
of sync character maps and converters (which was the case with TSCII
which I'll fix later).

> The other invalid-ish Unicode codepoints (U+FFFE, U+FFFF) are actually
> valid UTF-8 and handled by gconv, so including them seems okay.

They are simiar to U+FDD0..U+FDEF. There are 66 noncharacters.

They are not invalid. They are noncharacters and the standard had to
clarify that they must be encodable, they must be transported, and we
must not modify them in any way. The noncharacters are for private use
for the standard itself. The standard says: "C2 A process shall interpret
a noncharacter code point as an abstract character."

>> diff --git a/localedata/locales/C b/localedata/locales/C
>> new file mode 100644
>> index 0000000000..418e7c90a5
>> --- /dev/null
>> +++ b/localedata/locales/C
>> @@ -0,0 +1,192 @@
> 
>> +% One rule, sort forward, for all code points to give code point
>> +% order sorting for Unicode.
>> +LC_COLLATE
>> +order_start forward
>> +<U00000000>
>> +..
>> +<U0000007F>
>> +<U00000080>
>> +..
>> +<U000007FF>
>> +<U00000800>
>> +..
>> +<U0000FFFF>
>> +<U00010000>
>> +..
>> +<U0010FFFF>
>> +UNDEFINED
>> +order_end
>> +END LC_COLLATE
> 
> Why are multiple ranges required here?

No reason. I'll split into two ranges that exclude surrogates.

>> diff --git a/localedata/locales/i18n_ctype b/localedata/locales/i18n_ctype
>> index c63e0790fc..c92bb95148 100644
>> --- a/localedata/locales/i18n_ctype
>> +++ b/localedata/locales/i18n_ctype
>> @@ -26,7 +26,7 @@ fax       ""
>>  language  ""
>>  territory "Earth"
>>  revision  "13.0.0"
>> -date      "2020-06-25"
>> +date      "2021-02-17"
>>  category  "i18n:2012";LC_CTYPE
>>  END LC_IDENTIFICATION
> 
> Those date changes seem spurious.  Is this no-op file regeneration
> really needed?

The protocol is:

cd localedata/unicode-gen
make install

The spurious regeneration is not needed, but it's easier to run the
above commands. It gives a date for the last generation for all files
consistently.

>> diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/utf8_gen.py
>> index 899840923a..42fc5efcb9 100755
>> --- a/localedata/unicode-gen/utf8_gen.py
>> +++ b/localedata/unicode-gen/utf8_gen.py
> 
>>  def convert_to_hex(code_point):
>>      '''Converts a code point to a hexadecimal UTF-8 representation
>> +    like /x**/x**/x** without using any python library functions.
>> +    This avoids problems with the encode function, including an
>> +    inability to output the surrogate code points.
> 
> You can use chr(code_point).encode('UTF-8', 'surrogatepass') and the
> Python encoder.
> 
> I reviewed the other changes and spot-checked the generated charmap.
> Those parts look okay.

Fixed.
 
> The question is whether we actually need a UTF-8 charmap.  If not, we
> can teach charmap.c to generate the UTF-8 data on the fly in
> charmap_find_value and charmap_find_symbol.  But I consider this part of
> the data representation changes we discussed earlier.

Agreed.

-- 
Cheers,
Carlos.