unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: Florian Weimer via Libc-alpha <libc-alpha@sourceware.org>
To: наб <nabijaczleweli@nabijaczleweli.xyz>
Cc: libc-alpha@sourceware.org,  Victor Stinner <vstinner@redhat.com>
Subject: Re: [PATCH v9] POSIX locale covers every byte [BZ# 29511]
Date: Mon, 13 Feb 2023 15:52:06 +0100	[thread overview]
Message-ID: <87lel1d3e1.fsf@oldenburg.str.redhat.com> (raw)
In-Reply-To: <20230207141645.fox6f5w6fn524bch@tarta.nabijaczleweli.xyz> ("наб"'s message of "Tue, 7 Feb 2023 15:16:45 +0100")

* наб:

> This largely duplicates the ASCII code with the error path changed
>
> There are two user-facing changes:
>   * nl_langinfo(CODESET) is "POSIX" instead of "ANSI_X3.4-1968"
>   * mbrtowc() and friends return b if b <= 0x7F else <UDF00>+b
>
> Since Issue 7 TC 2/Issue 8, the C/POSIX locale, effectively,
>   (a) is 1-byte, stateless, and contains 256 characters
>   (b) they collate in byte order
>   (c) the first 128 characters are equivalent to ASCII (like previous)
> cf. https://www.austingroupbugs.net/view.php?id=663 for a summary of
> changes to the standard;
> in short, this means that mbrtowc() must never fail and must return
>   b if b <= 0x7F else ab+c for all bytes b
>   where c is some constant >=0x80
>     and a is a positive integer constant
>
> By strategically picking c=<UDF00> we land at the tail-end of the
> Unicode Low Surrogate Area at DC00-DFFF, described as
>   > Isolated surrogate code points have no interpretation;
>   > consequently, no character code charts or names lists
>   > are provided for this range.
> and match musl

I've thought about this some more, and I don't think this is the
direction we should be going in.

* Add a UTF-8SE charset to glibc: it's UTF-8 with surrogate encoding (in
  the Python style).  It should have the property that it can encode
  every byte string as a string of wchar_t characters, and convert the
  result back.  It's not entirely trivial because we need to handle
  partial UTF-8 sequences at the end of the buffer carefully.  There
  might be some warts regarding EILSEQ handling lurking there.  Like the
  Python approach, it is somewhat imperfect because it's not preserving
  identity under string concatenation, i.e. f(x) || f(y) is not always
  equal to f(x || y), but that's just unavoidable.

* Switch the charset for the default C locale to UTF-8SE.  This matches
  the POSIX requirement that every byte can be encoded.

* Work with POSIX to drop the requirement that the C locale needs to be
  a single-byte locale.

* (Optional, somewhat unrelated.) Add a generic mechanism so that UTF-8
  locales can be used as UTF-8SE without recompilation.

Thanks,
Florian


  reply	other threads:[~2023-02-13 14:52 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-30 18:19 [PATCH] POSIX locale covers every byte [BZ# 29511] наб via Libc-alpha
2022-09-06 14:06 ` [PATCH v2] " наб via Libc-alpha
2022-09-06 14:19 ` [PATCH] " Florian Weimer via Libc-alpha
2022-09-06 18:06   ` наб via Libc-alpha
2022-09-06 18:10     ` [PATCH v3 1/2] iconvdata/tst-table-charmap.sh: remove handling of old, borrowed format наб via Libc-alpha
2022-09-14  2:39       ` [PATCH v4 " наб via Libc-alpha
2022-09-21 14:01         ` [PATCH v5 " наб via Libc-alpha
2022-11-02 17:17           ` [PATCH v6 " наб via Libc-alpha
2022-11-09 12:49             ` Florian Weimer via Libc-alpha
2022-11-02 17:17           ` [PATCH v6 2/2] POSIX locale covers every byte [BZ# 29511] наб via Libc-alpha
2022-11-09 14:20             ` Florian Weimer via Libc-alpha
2022-11-09 16:14               ` [PATCH v7] " наб via Libc-alpha
2022-11-10  9:52                 ` Florian Weimer via Libc-alpha
2023-01-09 15:17                   ` [PATCH v8] " наб via Libc-alpha
2023-02-07 14:16                     ` [PATCH v9] " наб via Libc-alpha
2023-02-13 14:52                       ` Florian Weimer via Libc-alpha [this message]
2023-04-26 18:54                         ` наб via Libc-alpha
2023-04-26 21:27                           ` Florian Weimer via Libc-alpha
2023-04-27  0:17                             ` [PATCH v10] " наб via Libc-alpha
2023-04-28 15:43                               ` [PATCH v11] " наб via Libc-alpha
2023-05-07 22:53                                 ` [PATCH v12] " наб via Libc-alpha
2023-05-29 13:54                                   ` [PATCH v13] " наб via Libc-alpha
2022-11-10  8:10               ` [PATCH v6 2/2] " Florian Weimer via Libc-alpha
2022-11-28 16:24                 ` наб via Libc-alpha
2022-12-02 17:36                   ` Florian Weimer via Libc-alpha
2022-12-02 18:42                     ` наб via Libc-alpha
2022-09-21 14:01         ` [PATCH v5 " наб via Libc-alpha
2022-09-14  2:39       ` [PATCH v4 " наб via Libc-alpha
2022-09-06 18:11     ` [PATCH v3 " наб via Libc-alpha

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/libc/involved.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87lel1d3e1.fsf@oldenburg.str.redhat.com \
    --to=libc-alpha@sourceware.org \
    --cc=fweimer@redhat.com \
    --cc=nabijaczleweli@nabijaczleweli.xyz \
    --cc=vstinner@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).