From: Florian Weimer via Libc-alpha <libc-alpha@sourceware.org>
To: наб <nabijaczleweli@nabijaczleweli.xyz>
Cc: libc-alpha@sourceware.org, Victor Stinner <vstinner@redhat.com>
Subject: Re: [PATCH v6 2/2] POSIX locale covers every byte [BZ# 29511]
Date: Thu, 10 Nov 2022 09:10:57 +0100 [thread overview]
Message-ID: <87wn8344by.fsf@oldenburg.str.redhat.com> (raw)
In-Reply-To: <874jv8dxat.fsf@oldenburg.str.redhat.com> (Florian Weimer's message of "Wed, 09 Nov 2022 15:20:26 +0100")
* Florian Weimer:
> * наб:
>
>> This is a logistically trivial patch,
>> largely duplicating the extant ASCII code with the error path changed
>
> I wouldn't say it's trivial in the commit message. 8-)
>
>> There are two user-facing changes:
>> * nl_langinfo(CODESET) is "POSIX" instead of "ANSI_X3.4-1968"
>> * mbrtowc() and friends return b if b <= 0x7F else <UDF00>+b
>>
>> Since Issue 7 TC 2/Issue 8, the C/POSIX locale, effectively,
>> (a) is 1-byte, stateless, and contains 256 characters
>> (b) they collate in byte order
>> (c) the first 128 characters are equivalent to ASCII (like previous)
>> cf. https://www.austingroupbugs.net/view.php?id=663 for a summary of
>> changes to the standard;
>> in short, this means that mbrtowc() must never fail and must return
>> b if b <= 0x7F else ab+c for all bytes b
>> where c is some constant >=0x80
>> and a is a positive integer constant
>>
>> By strategically picking c=<UDF00> we land at the tail-end of the
>> Unicode Low Surrogate Area at DC00-DFFF, described as
>> > Isolated surrogate code points have no interpretation;
>> > consequently, no character code charts or names lists
>> > are provided for this range.
>> and match musl
>
> Sadly this doesn't match Python and PEP 540:
>
>>>> b'\x80'.decode('UTF-8', errors='surrogateescape')
> '\udc80'
>
> I believe the implementation translates this to 0xDF80 instead.
>
> Not sure what is more important here, musl compatibility or Python
> compatibility. Cc:ing Victor in case he as comments. I should probably
> ask on the musl list as well as how this divergence came to pass.
Raised on the musl list here:
Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale
<https://www.openwall.com/lists/musl/2022/11/10/1>
> This change definitely needs a NEWS entry.
(With this I meant the change overall, not the encoding.)
Thanks,
Florian
next prev parent reply other threads:[~2022-11-10 8:11 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-30 18:19 [PATCH] POSIX locale covers every byte [BZ# 29511] наб via Libc-alpha
2022-09-06 14:06 ` [PATCH v2] " наб via Libc-alpha
2022-09-06 14:19 ` [PATCH] " Florian Weimer via Libc-alpha
2022-09-06 18:06 ` наб via Libc-alpha
2022-09-06 18:10 ` [PATCH v3 1/2] iconvdata/tst-table-charmap.sh: remove handling of old, borrowed format наб via Libc-alpha
2022-09-14 2:39 ` [PATCH v4 " наб via Libc-alpha
2022-09-21 14:01 ` [PATCH v5 " наб via Libc-alpha
2022-11-02 17:17 ` [PATCH v6 " наб via Libc-alpha
2022-11-09 12:49 ` Florian Weimer via Libc-alpha
2022-11-02 17:17 ` [PATCH v6 2/2] POSIX locale covers every byte [BZ# 29511] наб via Libc-alpha
2022-11-09 14:20 ` Florian Weimer via Libc-alpha
2022-11-09 16:14 ` [PATCH v7] " наб via Libc-alpha
2022-11-10 9:52 ` Florian Weimer via Libc-alpha
2023-01-09 15:17 ` [PATCH v8] " наб via Libc-alpha
2023-02-07 14:16 ` [PATCH v9] " наб via Libc-alpha
2023-02-13 14:52 ` Florian Weimer via Libc-alpha
2023-04-26 18:54 ` наб via Libc-alpha
2023-04-26 21:27 ` Florian Weimer via Libc-alpha
2023-04-27 0:17 ` [PATCH v10] " наб via Libc-alpha
2023-04-28 15:43 ` [PATCH v11] " наб via Libc-alpha
2023-05-07 22:53 ` [PATCH v12] " наб via Libc-alpha
2023-05-29 13:54 ` [PATCH v13] " наб via Libc-alpha
2022-11-10 8:10 ` Florian Weimer via Libc-alpha [this message]
2022-11-28 16:24 ` [PATCH v6 2/2] " наб via Libc-alpha
2022-12-02 17:36 ` Florian Weimer via Libc-alpha
2022-12-02 18:42 ` наб via Libc-alpha
2022-09-21 14:01 ` [PATCH v5 " наб via Libc-alpha
2022-09-14 2:39 ` [PATCH v4 " наб via Libc-alpha
2022-09-06 18:11 ` [PATCH v3 " наб via Libc-alpha
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/libc/involved.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87wn8344by.fsf@oldenburg.str.redhat.com \
--to=libc-alpha@sourceware.org \
--cc=fweimer@redhat.com \
--cc=nabijaczleweli@nabijaczleweli.xyz \
--cc=vstinner@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).