From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS3215 2.6.0.0/16 X-Spam-Status: No, score=-3.6 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 87B1B1F601 for ; Fri, 2 Dec 2022 18:42:26 +0000 (UTC) Authentication-Results: dcvr.yhbt.net; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.b="BrtsCIxA"; dkim-atps=neutral Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id BFB7A3858C39 for ; Fri, 2 Dec 2022 18:42:24 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org BFB7A3858C39 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1670006544; bh=qoucLWE/O3lkRaq5TAx87kxt+JH2hU05WLWNgu+I8YA=; h=Date:To:Cc:Subject:References:In-Reply-To:List-Id: List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe: From:Reply-To:From; b=BrtsCIxAZVSHSTqBzHxkbcj6/MU2k5Z3qh/tvhrTdWbZCgxqvakYX6o7EGoUNiDdy PH3SBOhGarcLnGQFrxK81V0rQkA8vxKvF7+8N72aWdRSwjfNXsvxzHkvbji6GYLFEb vsjGpk5ibyZAoZQtwC2OQoFhpU1Qjud+Px8SQJwg= Received: from tarta.nabijaczleweli.xyz (unknown [139.28.40.42]) by sourceware.org (Postfix) with ESMTP id 3268E3858D20 for ; Fri, 2 Dec 2022 18:42:05 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 3268E3858D20 Received: from tarta.nabijaczleweli.xyz (unknown [192.168.1.250]) by tarta.nabijaczleweli.xyz (Postfix) with ESMTPSA id 74809A2C; Fri, 2 Dec 2022 19:42:01 +0100 (CET) Date: Fri, 2 Dec 2022 19:42:00 +0100 To: Florian Weimer Cc: libc-alpha@sourceware.org, Victor Stinner Subject: Re: [PATCH v6 2/2] POSIX locale covers every byte [BZ# 29511] Message-ID: <20221202184200.gfcnjwcfnc75wqqi@tarta.nabijaczleweli.xyz> References: <969aa82c8d5904c1d2040bba87abe2f17a0dc647.1667409408.git.nabijaczleweli@nabijaczleweli.xyz> <874jv8dxat.fsf@oldenburg.str.redhat.com> <87wn8344by.fsf@oldenburg.str.redhat.com> <20221128162453.supbaps5ftl2mg3s@tarta.nabijaczleweli.xyz> <87mt85pv1h.fsf@oldenburg.str.redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="d3knoeizpge5g76w" Content-Disposition: inline In-Reply-To: <87mt85pv1h.fsf@oldenburg.str.redhat.com> User-Agent: NeoMutt/20220429 X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: =?utf-8?b?0L3QsNCxIHZpYSBMaWJjLWFscGhh?= Reply-To: =?utf-8?B?0L3QsNCx?= Errors-To: libc-alpha-bounces+e=80x24.org@sourceware.org Sender: "Libc-alpha" --d3knoeizpge5g76w Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hi! On Fri, Dec 02, 2022 at 06:36:26PM +0100, Florian Weimer wrote: > * =D0=BD=D0=B0=D0=B1: > > On Thu, Nov 10, 2022 at 09:10:57AM +0100, Florian Weimer wrote: > >> Raised on the musl list here: > >> Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale > >> > > > > That thread seems to've been exhausted (at least I don't see anything > > fresh in the archive) =E2=80=92 should I just resend with the comments = for v7 > > applied, or do you have a mapping range you'd rather see given those > > givens? >=20 > I still can't make up my mind. I think the options are: >=20 > * Some sort of custom encoding (like you posted). For which there's Prior Art in both other libcs and implementations of similar mechanisms in unrelated software, making it just about what users expect, and the lowest-energy conversion to what POSIX has mandated for close to 8 years now. > * Latin-1 Sorry, what? The Latin-1 that's so poorly defined W3C requires, per spec, that Latin-1 charset specs be ignored? The Latin-1 they got wrong so badly they made two subsequent standards, only one of which compatible? The Latin-1 that has some random subset of germanic and maybe like french if you squint and that's apparently fine? The iron curtain has fallen, for better or for worse, since the 80s. If that's the "solution", then leave "C" 7-bit. I'm gonna assume that's a joke. (Also, people would then try to "use" it, and then (a) you've lost, but (b) the collation sequence is gonna be wrong always because it no longer represents any given language (though apparently it's "fine" if they collate in any random order, so it's legal per spec to make it just spanish, I think; this is somehow even worse, and I may be misunderstanding POSIX 7.3.2.6 because it'd mean that other parts of the standard that use and recommend "LC_ALL=3DC utility ..." to process bytes, like for sort, are also wrong?).) > * UTF-8 with surrogate escape encoding (and encouraging POSIX to change a= gain) Well it's not gonna =E2=80=92 at least I don't think it is =E2=80=92 given = that I don't think it /changed/ anything actually? Issue 7-2008 7.2 says > The tables in Locale Definition describe the characteristics and > behavior of the POSIX locale for data consisting entirely of > characters from the portable character set and the control character > set. For other characters, the behavior is unspecified. And just TC2 specified what had been unspecified behaviour I think? Implementations had freedom to do whatever, including UTF-8, until 2016. Naturally, as we're seeing now, not one has exercised that freedom. If glibc /did/ do POSIX=3DC=3DC.UTF-8 before then, then maybe we'd see a different result, but it hadn't, so we didn't. > What argues in favor of the last point is that many, many people are > using C.UTF-8 nowadays. Great! They can continue to use C.UTF-8. They have had to opt in to their preferred encoding like everyone else, and they will continue. 0 changes observed here. > And effectively disabling wide/multibyte > conversion until you call setlocale does not seem particularly useful. "Mangling input data until explicitly disabled" is worse than "input data is data, and you can make it characters". Don't take me for not-a-UTF-8-maximalist, but, y'know, it will never, unfortunately, be all I see, and being able to completely opt out of additional input processing would be nice; we're kinda close now with the current hard-7-bit ASCII, and making it, essentially, I Can't Believe It's Not Just Bytes!, per pt. 1, would eliminate even more head-aches IME. Putting pro-verbial KOI-8 or [your grandma's favourite encoding] through the UTF-8 grinder is much worse than just degrading to strcmp(), but at this point I think I'm rambling, and spilled enough ink; your call to make, at the end of the day. =D0=BD=D0=B0=D0=B1 --d3knoeizpge5g76w Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEfWlHToQCjFzAxEFjvP0LAY0mWPEFAmOKRvUACgkQvP0LAY0m WPGQKQ/8DREy0fjHETj/p8dWZ1mBHdr1+fkOYnvXcSGdrq1Hq5Qy3s1wdxBlB0ZG 3eNzLlCQevTTJy60/Wr31Rnwwq602n63yPDJpNmEvcvVnahckEQHiu+r6McLg697 LYfrY4SWRzby6YuF8+K2BnLwoHQjUS44BQ1hHBbyTIbnacmebbH7VlQzbKvEznqG o4dOq5uJw7qttnwI4yllpKqUuDwrW3sKVGX46HUnAeL86o1NVa9OiyvYvK+prmC1 wImpY7UH8egc0U/1bg02MKwVqEeXqUngjLKZAxD10B2seZw/kErW1cFd/8RX+ZjT SHyWAeZ1NOQSlPDZEOsDgYUZDUPuzsJ/kecjS07JBUQXMuyOdW+61ZqGsczz3yQw SV62kmcBBt4ulIEPdCgn9b1c9tNGDBdpxi3eZPPIc9kvb7r/c0ZweqwXr5VdaZo3 LIzcOZsEJk8lVXNSeU97+3j9l1gfYFxAD22VIVLOC60xF31RhRVQH60OfNXZqf/M Zyu41MpxOmU6H5gawXEuZDu3x6JWHl10nLa1It9zZDOsbG1TGzpXEgPs3F8yfHvc CK0pVuqbbFmXZLRXEHNHd2mZ2FX/qj1hb6T/t6lSLENj6ni+6dY3LVT31Br+cpsS /En2ShHR0RDu/NZAWf93N7GfrLcKq7u2N0SSk9JxR0G+EwlcTfQ= =Z//v -----END PGP SIGNATURE----- --d3knoeizpge5g76w--