From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-bounces@sourceware.org>
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net
X-Spam-Level: 
X-Spam-Status: No, score=-3.8 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_MED,RDNS_NONE,SPF_HELO_PASS,SPF_PASS shortcircuit=no
	autolearn=ham autolearn_force=no version=3.4.2
Received: from sourceware.org (unknown [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(No client certificate requested)
	by dcvr.yhbt.net (Postfix) with ESMTPS id DA6851F9FD
	for <e@80x24.org>; Thu, 11 Mar 2021 19:05:16 +0000 (UTC)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 107E3388E813;
	Thu, 11 Mar 2021 19:05:16 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 107E3388E813
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1615489516;
	bh=SV+kyA2LkR1vHBb4Mn/jdRZL4eNs5zFG6odhv3mzM48=;
	h=To:Subject:References:Date:In-Reply-To:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=Xi81PnkpZQlBxdOciqjBS07fu7pzaoHXbklH9KOASS9/LDTzMZGi2xoq7wR2fsJOu
	 8jCjP6J2fBKKzpEsbLWfhY7AUn4VNqciykKBrdKWQUYPqrGNa41GUvASp7BPxiQIQg
	 juVnXOgDnkP2HOio4hYo3wOzD4XKGQhYYlCyUQ90=
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [63.128.21.124])
 by sourceware.org (Postfix) with ESMTP id CFE9538708C5
 for <libc-alpha@sourceware.org>; Thu, 11 Mar 2021 19:05:13 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org CFE9538708C5
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-352-rEgEQIr-MPuMvJP50aCyqQ-1; Thu, 11 Mar 2021 14:05:11 -0500
X-MC-Unique: rEgEQIr-MPuMvJP50aCyqQ-1
Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com
 [10.5.11.22])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 8134F760C0
 for <libc-alpha@sourceware.org>; Thu, 11 Mar 2021 19:05:10 +0000 (UTC)
Received: from oldenburg.str.redhat.com (ovpn-112-77.ams2.redhat.com
 [10.36.112.77])
 by smtp.corp.redhat.com (Postfix) with ESMTPS id 599A6102AE7E;
 Thu, 11 Mar 2021 19:05:06 +0000 (UTC)
To: Carlos O'Donell via Libc-alpha <libc-alpha@sourceware.org>
Subject: Re: [PATCH v3] Add new C.UTF-8 locale (Bug 17318)
References: <20210219032748.564216-1-carlos@redhat.com>
Date: Thu, 11 Mar 2021 20:05:17 +0100
In-Reply-To: <20210219032748.564216-1-carlos@redhat.com> (Carlos O'Donell via
 Libc-alpha's message of "Thu, 18 Feb 2021 22:27:48 -0500")
Message-ID: <87v99xcxzm.fsf@oldenburg.str.redhat.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
From: Florian Weimer via Libc-alpha <libc-alpha@sourceware.org>
Reply-To: Florian Weimer <fweimer@redhat.com>
Errors-To: libc-alpha-bounces@sourceware.org
Sender: "Libc-alpha" <libc-alpha-bounces@sourceware.org>

* Carlos O'Donell via Libc-alpha:

> diff --git a/locale/programs/charmap.c b/locale/programs/charmap.c
> index 3d51e702dc..77085cff72 100644
> --- a/locale/programs/charmap.c
> +++ b/locale/programs/charmap.c
> @@ -49,7 +49,7 @@ static void new_width (struct linereader *cmfile, struc=
t charmap_t *result,

> +  /* POSIX explicitly requires that ellipsis processing do the
> +     following: "Bytes shall be treated as unsigned octets, and carry
> +     shall be propagated between the bytes as necessary to represent the
> +     range."  It then goes on to say that such a declaration should
> +     never be specified because it creates NULL bytes.  Therefore we

NUL or null, I think.

> +     error on this condition (see charmap_new_char).  However this still
> +     leaves a problem for encodings which use less than the full 8-bits,
> +     like UTF-8, and in such encodings you can use an ellipsis to
> +     silently and accidentally create invalid ranges.  In UTF-8 you have
> +     only the first 6-bits of the first byte and if your ellipsis covers

UTF-8 is variable length even in the leader byte, so =E2=80=9Conly the firs=
t
6-bits of the first byte=E2=80=9D seems wrong.

> +/* This function takes the Unicode code point CP and encodes it into
> +   a UTF-8 byte stream that must be NBYTES long and is stored into
> +   the unsigned character array at BYTES.
> +
> +   If CP requires more than NBYTES to be encoded then we return an
> +   error of -1.
> +
> +   If CP is not within any of the valid Unicode code point ranges
> +   then we return an error of -2.
> +
> +   Otherwise we return the number of bytes encoded.  */
> +static int
> +output_utf8_bytes (unsigned int cp, size_t nbytes, unsigned char *bytes)
> +{
> +  /* We need at least 1 byte.  */
> +  if (nbytes < 1)
> +    return -1;
> +
> +  /* One byte range.  */
> +  if (cp >=3D 0x0 && cp <=3D 0x7f)
> +    {
> +      bytes[0] =3D cp & 0x7f;
> +      return 1;
> +    }

0x7f is superfluous and confusing here, as discussed before.

> diff --git a/localedata/charmaps/UTF-8 b/localedata/charmaps/UTF-8
> index 8cce47cd97..c70d359744 100644
> --- a/localedata/charmaps/UTF-8
> +++ b/localedata/charmaps/UTF-8
> @@ -895,12 +895,14 @@ CHARMAP

> +<UD800>..<UDB7F> /xed/xa0/x80 <Non Private Use High Surrogate>
> +<UDB80>..<UDBFF> /xed/xae/x80 <Private Use High Surrogate>
> +<UDC00>..<UDFFF> /xed/xb0/x80 <Low Surrogate>
> +<UE000>..<UF8FF> /xee/x80/x80 <Private Use>

Technically this isn't right.  We don't want mappings for those
characters because it might introduce in other locale files that use
those characters.  But may be just need to be careful.

I'm surprised that this doesn't lead to testsuite failures because it's
inconsistent with the gconv converters.  Maybe we don't use this
anywhere?

The other invalid-ish Unicode codepoints (U+FFFE, U+FFFF) are actually
valid UTF-8 and handled by gconv, so including them seems okay.

> diff --git a/localedata/locales/C b/localedata/locales/C
> new file mode 100644
> index 0000000000..418e7c90a5
> --- /dev/null
> +++ b/localedata/locales/C
> @@ -0,0 +1,192 @@

> +% One rule, sort forward, for all code points to give code point
> +% order sorting for Unicode.
> +LC_COLLATE
> +order_start forward
> +<U00000000>
> +..
> +<U0000007F>
> +<U00000080>
> +..
> +<U000007FF>
> +<U00000800>
> +..
> +<U0000FFFF>
> +<U00010000>
> +..
> +<U0010FFFF>
> +UNDEFINED
> +order_end
> +END LC_COLLATE

Why are multiple ranges required here?

> diff --git a/localedata/locales/i18n_ctype b/localedata/locales/i18n_ctyp=
e
> index c63e0790fc..c92bb95148 100644
> --- a/localedata/locales/i18n_ctype
> +++ b/localedata/locales/i18n_ctype
> @@ -26,7 +26,7 @@ fax       ""
>  language  ""
>  territory "Earth"
>  revision  "13.0.0"
> -date      "2020-06-25"
> +date      "2021-02-17"
>  category  "i18n:2012";LC_CTYPE
>  END LC_IDENTIFICATION

Those date changes seem spurious.  Is this no-op file regeneration
really needed?

> diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/=
utf8_gen.py
> index 899840923a..42fc5efcb9 100755
> --- a/localedata/unicode-gen/utf8_gen.py
> +++ b/localedata/unicode-gen/utf8_gen.py

>  def convert_to_hex(code_point):
>      '''Converts a code point to a hexadecimal UTF-8 representation
> +    like /x**/x**/x** without using any python library functions.
> +    This avoids problems with the encode function, including an
> +    inability to output the surrogate code points.

You can use chr(code_point).encode('UTF-8', 'surrogatepass') and the
Python encoder.

I reviewed the other changes and spot-checked the generated charmap.
Those parts look okay.

The question is whether we actually need a UTF-8 charmap.  If not, we
can teach charmap.c to generate the UTF-8 data on the fly in
charmap_find_value and charmap_find_symbol.  But I consider this part of
the data representation changes we discussed earlier.

Thanks,
Florian