From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-bounces+e=80x24.org@sourceware.org>
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net
X-Spam-Level: 
X-Spam-ASN: AS3215 2.6.0.0/16
X-Spam-Status: No, score=-3.6 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham
	autolearn_force=no version=3.4.2
Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(No client certificate requested)
	by dcvr.yhbt.net (Postfix) with ESMTPS id 87B1B1F601
	for <e@80x24.org>; Fri,  2 Dec 2022 18:42:26 +0000 (UTC)
Authentication-Results: dcvr.yhbt.net;
	dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.b="BrtsCIxA";
	dkim-atps=neutral
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id BFB7A3858C39
	for <e@80x24.org>; Fri,  2 Dec 2022 18:42:24 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org BFB7A3858C39
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1670006544;
	bh=qoucLWE/O3lkRaq5TAx87kxt+JH2hU05WLWNgu+I8YA=;
	h=Date:To:Cc:Subject:References:In-Reply-To:List-Id:
	 List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe:
	 From:Reply-To:From;
	b=BrtsCIxAZVSHSTqBzHxkbcj6/MU2k5Z3qh/tvhrTdWbZCgxqvakYX6o7EGoUNiDdy
	 PH3SBOhGarcLnGQFrxK81V0rQkA8vxKvF7+8N72aWdRSwjfNXsvxzHkvbji6GYLFEb
	 vsjGpk5ibyZAoZQtwC2OQoFhpU1Qjud+Px8SQJwg=
Received: from tarta.nabijaczleweli.xyz (unknown [139.28.40.42])
 by sourceware.org (Postfix) with ESMTP id 3268E3858D20
 for <libc-alpha@sourceware.org>; Fri,  2 Dec 2022 18:42:05 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 3268E3858D20
Received: from tarta.nabijaczleweli.xyz (unknown [192.168.1.250])
 by tarta.nabijaczleweli.xyz (Postfix) with ESMTPSA id 74809A2C;
 Fri,  2 Dec 2022 19:42:01 +0100 (CET)
Date: Fri, 2 Dec 2022 19:42:00 +0100
To: Florian Weimer <fweimer@redhat.com>
Cc: libc-alpha@sourceware.org, Victor Stinner <vstinner@redhat.com>
Subject: Re: [PATCH v6 2/2] POSIX locale covers every byte [BZ# 29511]
Message-ID: <20221202184200.gfcnjwcfnc75wqqi@tarta.nabijaczleweli.xyz>
References: <ad6720f44981d53ce50804d4ea3696ca1b7cd0b7.1663768863.git.nabijaczleweli@nabijaczleweli.xyz>
 <969aa82c8d5904c1d2040bba87abe2f17a0dc647.1667409408.git.nabijaczleweli@nabijaczleweli.xyz>
 <874jv8dxat.fsf@oldenburg.str.redhat.com>
 <87wn8344by.fsf@oldenburg.str.redhat.com>
 <20221128162453.supbaps5ftl2mg3s@tarta.nabijaczleweli.xyz>
 <87mt85pv1h.fsf@oldenburg.str.redhat.com>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha512;
 protocol="application/pgp-signature"; boundary="d3knoeizpge5g76w"
Content-Disposition: inline
In-Reply-To: <87mt85pv1h.fsf@oldenburg.str.redhat.com>
User-Agent: NeoMutt/20220429
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
From: =?utf-8?b?0L3QsNCxIHZpYSBMaWJjLWFscGhh?= <libc-alpha@sourceware.org>
Reply-To: =?utf-8?B?0L3QsNCx?= <nabijaczleweli@nabijaczleweli.xyz>
Errors-To: libc-alpha-bounces+e=80x24.org@sourceware.org
Sender: "Libc-alpha" <libc-alpha-bounces+e=80x24.org@sourceware.org>


--d3knoeizpge5g76w
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Hi!

On Fri, Dec 02, 2022 at 06:36:26PM +0100, Florian Weimer wrote:
> * =D0=BD=D0=B0=D0=B1:
> > On Thu, Nov 10, 2022 at 09:10:57AM +0100, Florian Weimer wrote:
> >> Raised on the musl list here:
> >>   Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale
> >>   <https://www.openwall.com/lists/musl/2022/11/10/1>
> >
> > That thread seems to've been exhausted (at least I don't see anything
> > fresh in the archive) =E2=80=92 should I just resend with the comments =
for v7
> > applied, or do you have a mapping range you'd rather see given those
> > givens?
>=20
> I still can't make up my mind.  I think the options are:
>=20
> * Some sort of custom encoding (like you posted).
For which there's Prior Art in both other libcs and implementations
of similar mechanisms in unrelated software, making it just about what
users expect, and the lowest-energy conversion to what POSIX
has mandated for close to 8 years now.

> * Latin-1
Sorry, what? The Latin-1 that's so poorly defined W3C requires,
per spec, that Latin-1 charset specs be ignored? The Latin-1 they got
wrong so badly they made two subsequent standards, only one of which
compatible? The Latin-1 that has some random subset of germanic and
maybe like french if you squint and that's apparently fine? The iron
curtain has fallen, for better or for worse, since the 80s. If that's
the "solution", then leave "C" 7-bit. I'm gonna assume that's a joke.
(Also, people would then try to "use" it, and then (a) you've lost,
 but (b) the collation sequence is gonna be wrong always because
 it no longer represents any given language
 (though apparently it's "fine" if they collate in any random order,
  so it's legal per spec to make it just spanish, I think;
  this is somehow even worse, and I may be misunderstanding
  POSIX 7.3.2.6 because it'd mean that other parts of the standard
  that use and recommend "LC_ALL=3DC utility ..." to process bytes,
  like for sort, are also wrong?).)

> * UTF-8 with surrogate escape encoding (and encouraging POSIX to change a=
gain)
Well it's not gonna =E2=80=92 at least I don't think it is =E2=80=92 given =
that I don't
think it /changed/ anything actually? Issue 7-2008 7.2 says
> The tables in Locale Definition describe the characteristics and
> behavior of the POSIX locale for data consisting entirely of
> characters from the portable character set and the control character
> set. For other characters, the behavior is unspecified.

And just TC2 specified what had been unspecified behaviour I think?
Implementations had freedom to do whatever, including UTF-8, until 2016.
Naturally, as we're seeing now, not one has exercised that freedom.
If glibc /did/ do POSIX=3DC=3DC.UTF-8 before then,
then maybe we'd see a different result, but it hadn't, so we didn't.

> What argues in favor of the last point is that many, many people are
> using C.UTF-8 nowadays.
Great! They can continue to use C.UTF-8. They have had to opt in to
their preferred encoding like everyone else, and they will continue.
0 changes observed here.

> And effectively disabling wide/multibyte
> conversion until you call setlocale does not seem particularly useful.
"Mangling input data until explicitly disabled" is worse than
"input data is data, and you can make it characters".
Don't take me for not-a-UTF-8-maximalist, but, y'know,
it will never, unfortunately, be all I see,
and being able to completely opt out of additional input processing
would be nice; we're kinda close now with the current hard-7-bit ASCII,
and making it, essentially, I Can't Believe It's Not Just Bytes!,
per pt. 1, would eliminate even more head-aches IME.

Putting pro-verbial KOI-8 or [your grandma's favourite encoding] through
the UTF-8 grinder is much worse than just degrading to strcmp(),
but at this point I think I'm rambling, and spilled enough ink;
your call to make, at the end of the day.

=D0=BD=D0=B0=D0=B1

--d3knoeizpge5g76w
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEEfWlHToQCjFzAxEFjvP0LAY0mWPEFAmOKRvUACgkQvP0LAY0m
WPGQKQ/8DREy0fjHETj/p8dWZ1mBHdr1+fkOYnvXcSGdrq1Hq5Qy3s1wdxBlB0ZG
3eNzLlCQevTTJy60/Wr31Rnwwq602n63yPDJpNmEvcvVnahckEQHiu+r6McLg697
LYfrY4SWRzby6YuF8+K2BnLwoHQjUS44BQ1hHBbyTIbnacmebbH7VlQzbKvEznqG
o4dOq5uJw7qttnwI4yllpKqUuDwrW3sKVGX46HUnAeL86o1NVa9OiyvYvK+prmC1
wImpY7UH8egc0U/1bg02MKwVqEeXqUngjLKZAxD10B2seZw/kErW1cFd/8RX+ZjT
SHyWAeZ1NOQSlPDZEOsDgYUZDUPuzsJ/kecjS07JBUQXMuyOdW+61ZqGsczz3yQw
SV62kmcBBt4ulIEPdCgn9b1c9tNGDBdpxi3eZPPIc9kvb7r/c0ZweqwXr5VdaZo3
LIzcOZsEJk8lVXNSeU97+3j9l1gfYFxAD22VIVLOC60xF31RhRVQH60OfNXZqf/M
Zyu41MpxOmU6H5gawXEuZDu3x6JWHl10nLa1It9zZDOsbG1TGzpXEgPs3F8yfHvc
CK0pVuqbbFmXZLRXEHNHd2mZ2FX/qj1hb6T/t6lSLENj6ni+6dY3LVT31Br+cpsS
/En2ShHR0RDu/NZAWf93N7GfrLcKq7u2N0SSk9JxR0G+EwlcTfQ=
=Z//v
-----END PGP SIGNATURE-----

--d3knoeizpge5g76w--