From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.2 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.6 Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id C6B641F626 for ; Mon, 13 Feb 2023 14:52:32 +0000 (UTC) Authentication-Results: dcvr.yhbt.net; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=RMeFuLQp; dkim-atps=neutral Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id D67B0385B539 for ; Mon, 13 Feb 2023 14:52:30 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D67B0385B539 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1676299950; bh=rH0txXV7L7d2yigQ5qrshW9+XQBFUfLN5sfxj5mExBU=; h=To:Cc:Subject:References:Date:In-Reply-To:List-Id: List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe: From:Reply-To:From; b=RMeFuLQptJa9mhjaZ09H9ojrT0qnJmitznFqeKA1h5GiWWFbw6P5ocHM3Vum6haqD Z9FNas7tDRnq9kHzmzaXR32NddyW8WwFd4FoDpWmGKEdRvamcE1AU6ietI17Ff2xbE jhyvXmVt1FMA/j9VGsKdLp2yMOmSDWFEADWntb1Y= Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by sourceware.org (Postfix) with ESMTPS id C9275385B539 for ; Mon, 13 Feb 2023 14:52:10 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org C9275385B539 Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-303-zxFdI35SNMq7CBSigesRGg-1; Mon, 13 Feb 2023 09:52:09 -0500 X-MC-Unique: zxFdI35SNMq7CBSigesRGg-1 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com [10.11.54.4]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id A344329ABA29; Mon, 13 Feb 2023 14:52:08 +0000 (UTC) Received: from oldenburg.str.redhat.com (unknown [10.2.16.7]) by smtp.corp.redhat.com (Postfix) with ESMTPS id E718F2026D4B; Mon, 13 Feb 2023 14:52:07 +0000 (UTC) To: =?utf-8?B?0L3QsNCx?= Cc: libc-alpha@sourceware.org, Victor Stinner Subject: Re: [PATCH v9] POSIX locale covers every byte [BZ# 29511] References: <20230109151747.j3b7ls2kumcxa4px@tarta.nabijaczleweli.xyz> <20230207141645.fox6f5w6fn524bch@tarta.nabijaczleweli.xyz> Date: Mon, 13 Feb 2023 15:52:06 +0100 In-Reply-To: <20230207141645.fox6f5w6fn524bch@tarta.nabijaczleweli.xyz> (=?utf-8?B?ItC90LDQsSIncw==?= message of "Tue, 7 Feb 2023 15:16:45 +0100") Message-ID: <87lel1d3e1.fsf@oldenburg.str.redhat.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux) MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.4 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Florian Weimer via Libc-alpha Reply-To: Florian Weimer Errors-To: libc-alpha-bounces+e=80x24.org@sourceware.org Sender: "Libc-alpha" * =D0=BD=D0=B0=D0=B1: > This largely duplicates the ASCII code with the error path changed > > There are two user-facing changes: > * nl_langinfo(CODESET) is "POSIX" instead of "ANSI_X3.4-1968" > * mbrtowc() and friends return b if b <=3D 0x7F else +b > > Since Issue 7 TC 2/Issue 8, the C/POSIX locale, effectively, > (a) is 1-byte, stateless, and contains 256 characters > (b) they collate in byte order > (c) the first 128 characters are equivalent to ASCII (like previous) > cf. https://www.austingroupbugs.net/view.php?id=3D663 for a summary of > changes to the standard; > in short, this means that mbrtowc() must never fail and must return > b if b <=3D 0x7F else ab+c for all bytes b > where c is some constant >=3D0x80 > and a is a positive integer constant > > By strategically picking c=3D we land at the tail-end of the > Unicode Low Surrogate Area at DC00-DFFF, described as > > Isolated surrogate code points have no interpretation; > > consequently, no character code charts or names lists > > are provided for this range. > and match musl I've thought about this some more, and I don't think this is the direction we should be going in. * Add a UTF-8SE charset to glibc: it's UTF-8 with surrogate encoding (in the Python style). It should have the property that it can encode every byte string as a string of wchar_t characters, and convert the result back. It's not entirely trivial because we need to handle partial UTF-8 sequences at the end of the buffer carefully. There might be some warts regarding EILSEQ handling lurking there. Like the Python approach, it is somewhat imperfect because it's not preserving identity under string concatenation, i.e. f(x) || f(y) is not always equal to f(x || y), but that's just unavoidable. * Switch the charset for the default C locale to UTF-8SE. This matches the POSIX requirement that every byte can be encoded. * Work with POSIX to drop the requirement that the C locale needs to be a single-byte locale. * (Optional, somewhat unrelated.) Add a generic mechanism so that UTF-8 locales can be used as UTF-8SE without recompilation. Thanks, Florian