From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS17314 8.43.84.0/22 X-Spam-Status: No, score=-3.2 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED,RDNS_DYNAMIC,SPF_HELO_PASS,SPF_PASS, T_SCC_BODY_TEXT_LINE shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from sourceware.org (ip-8-43-85-97.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 01C791F54E for ; Tue, 6 Sep 2022 18:07:23 +0000 (UTC) Authentication-Results: dcvr.yhbt.net; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.b="CSy6yOg9"; dkim-atps=neutral Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id DA93E385354D for ; Tue, 6 Sep 2022 18:07:21 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org DA93E385354D DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1662487641; bh=LI3Vi+5UkWm0Tvi/LC8HI/Z1EFo2Hb2EWhQ3gnSy7tU=; h=Date:To:Subject:References:In-Reply-To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=CSy6yOg9Uz2RmKNF2HeFhKuTr+juV7Im9Ka2Z/Zm8792BRed9hH5iOhRQ8StsPT+0 vqWU1g4L5DLoDwOol/7poQXFUrvf5dBGbVWzCRrFgrnJeP3seRwlWo3k/AKMqz09vk mvNQH9Kac8L9B8YDM7R5SS5P+53dZbZh/dWFDkV0= Received: from tarta.nabijaczleweli.xyz (unknown [139.28.40.42]) by sourceware.org (Postfix) with ESMTP id 9B437385354D for ; Tue, 6 Sep 2022 18:07:00 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 9B437385354D Received: from tarta.nabijaczleweli.xyz (unknown [192.168.1.250]) by tarta.nabijaczleweli.xyz (Postfix) with ESMTPSA id 51C013C4; Tue, 6 Sep 2022 20:06:58 +0200 (CEST) Date: Tue, 6 Sep 2022 20:06:57 +0200 To: Florian Weimer Subject: Re: [PATCH] POSIX locale covers every byte [BZ# 29511] Message-ID: <20220906180657.eo53xia2yqqiv4ri@tarta.nabijaczleweli.xyz> References: <20220830181932.oggrz6f6itrpyi6g@tarta.nabijaczleweli.xyz> <87a67c60x6.fsf@oldenburg.str.redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="ujbyoq4cyzan5j56" Content-Disposition: inline In-Reply-To: <87a67c60x6.fsf@oldenburg.str.redhat.com> User-Agent: NeoMutt/20220429 X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: =?utf-8?b?0L3QsNCxIHZpYSBMaWJjLWFscGhh?= Reply-To: =?utf-8?B?0L3QsNCx?= Cc: libc-alpha@sourceware.org Errors-To: libc-alpha-bounces+e=80x24.org@sourceware.org Sender: "Libc-alpha" --ujbyoq4cyzan5j56 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hi! On Tue, Sep 06, 2022 at 04:19:01PM +0200, Florian Weimer wrote: > * =D0=BD=D0=B0=D0=B1 via Libc-alpha: >=20 > > This is a trivial patch, largely duplicating the extant ASCII code > > > > There are two user-facing changes: > > * nl_langinfo(CODESET) is "POSIX" instead of "ANSI_X3.4-1968" > > * mbrtowc() and friends return b if b <=3D 0x7F else +b > > > > Since Issue 7 TC 2/Issue 8, the C/POSIX locale, effectively: > > (a) is 1-byte, stateless, and contains 256 characters > > (b) which collate in byte order > > (c) the first 128 characters are equivalent to ASCII (like previous) > > cf. https://www.austingroupbugs.net/view.php?id=3D663 for a summary of > > changes to the standard; > > in short, this means that mbrtowc() must never fail and must return > > b if b <=3D 0x7F else ab+c for all bytes b > > where c is some constant >=3D0x80 > > and a is a positive integer constant > > > > By strategically picking c=3D we land at the tail-end of the > > Unicode Low Surrogate Area at DC00-DFFF, described as > > > Isolated surrogate code points have no interpretation; > > > consequently, no character code charts or names lists > > > are provided for this range. > > and match musl >=20 > We don't match Python and its surrogateescape encoding (PEP 838). 404? > It > maps invalid bytes in the 0x80=E2=80=A60xff range to U+DC80=E2=80=A6U+DCF= F. (The same as musl.) > It may make > more sense to align with that. With a=3D1 and c=3D, assuming it's as you say, we very much do? $ printf '\x80\xff' | output/elf/ld.so --library-path output/ output/iconv/= iconv_prog -fPOSIX -tUCS4 | hd 00000000 00 00 df 80 00 00 df ff |........| 00000008 > Anyway, regarding mechanics, we'll need a new localedata/charmaps/POSIX > charmap, I think. This charmap then can be tested against the gconv > converter. Hm, the problem with that is tst-tables -> tst-table -> tst-table-from (and -to) convert by constructing a UTF-8 sequence. The problem with this approach is that glibc rejects unpaired surrogates. The output for tst-table-from UTF-8 is: ... 0xED9FBE 0xD7FE 0xED9FBF 0xD7FF 0xEE8080 0xE000 0xEE8081 0xE001 ... i.e. there's a gap for the surrogates; and, indeed, the charmap reads /xed/x9f/xbb HANGUL JONGSEONG PHIEUPH-THIEUTH % /xed/xa0/x80 % /xed/xad/xbf % /xed/xae/x80 % /xed/xaf/xbf % /xed/xb0/x80 % /xed/xbf/xbf .. /xee/x80/x80 with the surrogate range commented-out; this dates back to the inclusion of UTF-8 generator scripts in 2015 (4a4839c94a4c93ffc0d5b95c69a08b02a57007f2), these exclusions are deliberate (grep for surrog in localedata/unicode-gen/utf8_gen.py). Given this limitation, expanding the charmap to ANSI_X3.4-1968 + .. doesn't actually test much: having them as separate codepoints will always fail tests, and dot-notation lines are ignored when generating the comparison tables, so this particular type of test just proves that POSIX is the same as ANSI_X3.4-1968 for the first 128 characters. There's already an exhaustive iconv_prog-based testsuite (cf. additions to iconv/tst-iconv_prog.sh), though. > You should put the new converters into a separate file (not > iconv/gconv_simple.c), then the s390x version will use that > automatically. Oh, of course! Moved to iconv/gconv_posix.c. > > diff --git a/localedata/locales/POSIX b/localedata/locales/POSIX > > index 7ec7f1c577..fc34a6abc1 100644 > > --- a/localedata/locales/POSIX > > +++ b/localedata/locales/POSIX > > @@ -97,6 +97,20 @@ END LC_CTYPE > > LC_COLLATE > > % This is the POSIX Locale definition for the LC_COLLATE category. >=20 > Isn't this just the C locale? Yes, C is defined to be POSIX. > We don't have a separate file for that. Yes, we very obviously do, seeing as this patch edits it? Nothing consumes it AFAICT, but. > > diff --git a/wcsmbs/wcsmbsload.c b/wcsmbs/wcsmbsload.c > > index 0f0f55f9ed..f87099bcf5 100644 > > --- a/wcsmbs/wcsmbsload.c > > +++ b/wcsmbs/wcsmbsload.c > > @@ -33,10 +33,10 @@ static const struct __gconv_step to_wc =3D > > .__shlib_handle =3D NULL, > > .__modname =3D NULL, > > .__counter =3D INT_MAX, > > - .__from_name =3D (char *) "ANSI_X3.4-1968//TRANSLIT", > > + .__from_name =3D (char *) "POSIX", > > .__to_name =3D (char *) "INTERNAL", > > - .__fct =3D __gconv_transform_ascii_internal, > > - .__btowc_fct =3D __gconv_btwoc_ascii, > > + .__fct =3D __gconv_transform_posix_internal, > > + .__btowc_fct =3D __gconv_btwoc_posix, > > .__init_fct =3D NULL, > > .__end_fct =3D NULL, > > .__min_needed_from =3D 1, > > @@ -53,8 +53,8 @@ static const struct __gconv_step to_mb =3D > > .__modname =3D NULL, > > .__counter =3D INT_MAX, > > .__from_name =3D (char *) "INTERNAL", > > - .__to_name =3D (char *) "ANSI_X3.4-1968//TRANSLIT", > > - .__fct =3D __gconv_transform_internal_ascii, > > + .__to_name =3D (char *) "POSIX", > > + .__fct =3D __gconv_transform_internal_posix, > > .__btowc_fct =3D NULL, > > .__init_fct =3D NULL, > > .__end_fct =3D NULL, >=20 > This makes the comment on __wcsmbs_gconv_fcts_c in the same file > obsolete. Comment fixed. > Thanks, > Florian New patchset in followup. Best, =D0=BD=D0=B0=D0=B1 --ujbyoq4cyzan5j56 Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEfWlHToQCjFzAxEFjvP0LAY0mWPEFAmMXjEAACgkQvP0LAY0m WPGkCg/9GX6ehwPT1ZXVbX6U3kGKBUc8rEEAj+R7fVnJJnCfPcWWARCyb565Ht1j F66+3XXnuG6Ye40hjU6QYFjqqShW5BpVGQTibPNUW+UaB6YK/sHVIB6Agnc+f5BS mvU95vddiv6A+h9RK6s2PvG6XWxzSmUbsN/Kgsv9JezrCB62jicP8sL/twVG1wJV VWn5xJmdraDWG6hxqygjVjS+aVPyG+nACf/bKGLyWNtKnqmYtVeLkHf/98/MoTDn 1you7bc8PqjW7LuP0SufHD0VIAH5OI6XqCtDn3YGbI4k3wQd6fiVYFCe0Tnh/oYH c/CtxTIaRQU0P8Cz0YZ26MG4EjXc+81zqtpfJG9vsAa1ZZWBCAr5vSzHrx0jZb3Y 1ylf0YVHTyG2+BuXvFomdQqmoS3olNQMrQs6om3QsdGq1runBLH8jL2b/HrNq2Ae D9/U3Bgf9S8GiePbBtre3ypR69MkQUlvW1vKsEoiY23QN62Cld+uyp4VgW7GQp+f YfM+4BupQ8Lx84nhJCeoeGZM8E2DyQ+gK+ESBHjiZM0yaSFlVhSIAGbSBx6tbG1L GDt9ibhNRbsOfscxznBVSamrhla2Syna3oWlTmccjyDvLDjPwQWghPemOpTwXxNf /CkBKjHjwH+TSMIcVNIJeosuh+iXfn9JOYtOXInLYn4ff9cjxvA= =+Rni -----END PGP SIGNATURE----- --ujbyoq4cyzan5j56--