From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED,RDNS_NONE,SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from sourceware.org (unknown [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id DA6851F9FD for ; Thu, 11 Mar 2021 19:05:16 +0000 (UTC) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 107E3388E813; Thu, 11 Mar 2021 19:05:16 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 107E3388E813 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1615489516; bh=SV+kyA2LkR1vHBb4Mn/jdRZL4eNs5zFG6odhv3mzM48=; h=To:Subject:References:Date:In-Reply-To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=Xi81PnkpZQlBxdOciqjBS07fu7pzaoHXbklH9KOASS9/LDTzMZGi2xoq7wR2fsJOu 8jCjP6J2fBKKzpEsbLWfhY7AUn4VNqciykKBrdKWQUYPqrGNa41GUvASp7BPxiQIQg juVnXOgDnkP2HOio4hYo3wOzD4XKGQhYYlCyUQ90= Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [63.128.21.124]) by sourceware.org (Postfix) with ESMTP id CFE9538708C5 for ; Thu, 11 Mar 2021 19:05:13 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org CFE9538708C5 Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-352-rEgEQIr-MPuMvJP50aCyqQ-1; Thu, 11 Mar 2021 14:05:11 -0500 X-MC-Unique: rEgEQIr-MPuMvJP50aCyqQ-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 8134F760C0 for ; Thu, 11 Mar 2021 19:05:10 +0000 (UTC) Received: from oldenburg.str.redhat.com (ovpn-112-77.ams2.redhat.com [10.36.112.77]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 599A6102AE7E; Thu, 11 Mar 2021 19:05:06 +0000 (UTC) To: Carlos O'Donell via Libc-alpha Subject: Re: [PATCH v3] Add new C.UTF-8 locale (Bug 17318) References: <20210219032748.564216-1-carlos@redhat.com> Date: Thu, 11 Mar 2021 20:05:17 +0100 In-Reply-To: <20210219032748.564216-1-carlos@redhat.com> (Carlos O'Donell via Libc-alpha's message of "Thu, 18 Feb 2021 22:27:48 -0500") Message-ID: <87v99xcxzm.fsf@oldenburg.str.redhat.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Florian Weimer via Libc-alpha Reply-To: Florian Weimer Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" * Carlos O'Donell via Libc-alpha: > diff --git a/locale/programs/charmap.c b/locale/programs/charmap.c > index 3d51e702dc..77085cff72 100644 > --- a/locale/programs/charmap.c > +++ b/locale/programs/charmap.c > @@ -49,7 +49,7 @@ static void new_width (struct linereader *cmfile, struc= t charmap_t *result, > + /* POSIX explicitly requires that ellipsis processing do the > + following: "Bytes shall be treated as unsigned octets, and carry > + shall be propagated between the bytes as necessary to represent the > + range." It then goes on to say that such a declaration should > + never be specified because it creates NULL bytes. Therefore we NUL or null, I think. > + error on this condition (see charmap_new_char). However this still > + leaves a problem for encodings which use less than the full 8-bits, > + like UTF-8, and in such encodings you can use an ellipsis to > + silently and accidentally create invalid ranges. In UTF-8 you have > + only the first 6-bits of the first byte and if your ellipsis covers UTF-8 is variable length even in the leader byte, so =E2=80=9Conly the firs= t 6-bits of the first byte=E2=80=9D seems wrong. > +/* This function takes the Unicode code point CP and encodes it into > + a UTF-8 byte stream that must be NBYTES long and is stored into > + the unsigned character array at BYTES. > + > + If CP requires more than NBYTES to be encoded then we return an > + error of -1. > + > + If CP is not within any of the valid Unicode code point ranges > + then we return an error of -2. > + > + Otherwise we return the number of bytes encoded. */ > +static int > +output_utf8_bytes (unsigned int cp, size_t nbytes, unsigned char *bytes) > +{ > + /* We need at least 1 byte. */ > + if (nbytes < 1) > + return -1; > + > + /* One byte range. */ > + if (cp >=3D 0x0 && cp <=3D 0x7f) > + { > + bytes[0] =3D cp & 0x7f; > + return 1; > + } 0x7f is superfluous and confusing here, as discussed before. > diff --git a/localedata/charmaps/UTF-8 b/localedata/charmaps/UTF-8 > index 8cce47cd97..c70d359744 100644 > --- a/localedata/charmaps/UTF-8 > +++ b/localedata/charmaps/UTF-8 > @@ -895,12 +895,14 @@ CHARMAP > +.. /xed/xa0/x80 > +.. /xed/xae/x80 > +.. /xed/xb0/x80 > +.. /xee/x80/x80 Technically this isn't right. We don't want mappings for those characters because it might introduce in other locale files that use those characters. But may be just need to be careful. I'm surprised that this doesn't lead to testsuite failures because it's inconsistent with the gconv converters. Maybe we don't use this anywhere? The other invalid-ish Unicode codepoints (U+FFFE, U+FFFF) are actually valid UTF-8 and handled by gconv, so including them seems okay. > diff --git a/localedata/locales/C b/localedata/locales/C > new file mode 100644 > index 0000000000..418e7c90a5 > --- /dev/null > +++ b/localedata/locales/C > @@ -0,0 +1,192 @@ > +% One rule, sort forward, for all code points to give code point > +% order sorting for Unicode. > +LC_COLLATE > +order_start forward > + > +.. > + > + > +.. > + > + > +.. > + > + > +.. > + > +UNDEFINED > +order_end > +END LC_COLLATE Why are multiple ranges required here? > diff --git a/localedata/locales/i18n_ctype b/localedata/locales/i18n_ctyp= e > index c63e0790fc..c92bb95148 100644 > --- a/localedata/locales/i18n_ctype > +++ b/localedata/locales/i18n_ctype > @@ -26,7 +26,7 @@ fax "" > language "" > territory "Earth" > revision "13.0.0" > -date "2020-06-25" > +date "2021-02-17" > category "i18n:2012";LC_CTYPE > END LC_IDENTIFICATION Those date changes seem spurious. Is this no-op file regeneration really needed? > diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/= utf8_gen.py > index 899840923a..42fc5efcb9 100755 > --- a/localedata/unicode-gen/utf8_gen.py > +++ b/localedata/unicode-gen/utf8_gen.py > def convert_to_hex(code_point): > '''Converts a code point to a hexadecimal UTF-8 representation > + like /x**/x**/x** without using any python library functions. > + This avoids problems with the encode function, including an > + inability to output the surrogate code points. You can use chr(code_point).encode('UTF-8', 'surrogatepass') and the Python encoder. I reviewed the other changes and spot-checked the generated charmap. Those parts look okay. The question is whether we actually need a UTF-8 charmap. If not, we can teach charmap.c to generate the UTF-8 data on the fly in charmap_find_value and charmap_find_symbol. But I consider this part of the data representation changes we discussed earlier. Thanks, Florian