From: Florian Weimer via Libc-alpha <libc-alpha@sourceware.org>
To: наб <nabijaczleweli@nabijaczleweli.xyz>
Cc: libc-alpha@sourceware.org, Victor Stinner <vstinner@redhat.com>
Subject: Re: [PATCH v6 2/2] POSIX locale covers every byte [BZ# 29511]
Date: Wed, 09 Nov 2022 15:20:26 +0100 [thread overview]
Message-ID: <874jv8dxat.fsf@oldenburg.str.redhat.com> (raw)
In-Reply-To: <969aa82c8d5904c1d2040bba87abe2f17a0dc647.1667409408.git.nabijaczleweli@nabijaczleweli.xyz> ("наб"'s message of "Wed, 2 Nov 2022 18:17:17 +0100")
* наб:
> This is a logistically trivial patch,
> largely duplicating the extant ASCII code with the error path changed
I wouldn't say it's trivial in the commit message. 8-)
> There are two user-facing changes:
> * nl_langinfo(CODESET) is "POSIX" instead of "ANSI_X3.4-1968"
> * mbrtowc() and friends return b if b <= 0x7F else <UDF00>+b
>
> Since Issue 7 TC 2/Issue 8, the C/POSIX locale, effectively,
> (a) is 1-byte, stateless, and contains 256 characters
> (b) they collate in byte order
> (c) the first 128 characters are equivalent to ASCII (like previous)
> cf. https://www.austingroupbugs.net/view.php?id=663 for a summary of
> changes to the standard;
> in short, this means that mbrtowc() must never fail and must return
> b if b <= 0x7F else ab+c for all bytes b
> where c is some constant >=0x80
> and a is a positive integer constant
>
> By strategically picking c=<UDF00> we land at the tail-end of the
> Unicode Low Surrogate Area at DC00-DFFF, described as
> > Isolated surrogate code points have no interpretation;
> > consequently, no character code charts or names lists
> > are provided for this range.
> and match musl
Sadly this doesn't match Python and PEP 540:
>>> b'\x80'.decode('UTF-8', errors='surrogateescape')
'\udc80'
I believe the implementation translates this to 0xDF80 instead.
Not sure what is more important here, musl compatibility or Python
compatibility. Cc:ing Victor in case he as comments. I should probably
ask on the musl list as well as how this divergence came to pass.
This change definitely needs a NEWS entry.
The mechanics of the patch look okay to me, just a few nits below.
> diff --git a/iconv/gconv_int.h b/iconv/gconv_int.h
> index 1c6745043e..45ab1edfad 100644
> --- a/iconv/gconv_int.h
> +++ b/iconv/gconv_int.h
> @@ -281,6 +281,8 @@ extern int __gconv_compare_alias (const char *name1, const char *name2)
>
> __BUILTIN_TRANSFORM (__gconv_transform_ascii_internal);
> __BUILTIN_TRANSFORM (__gconv_transform_internal_ascii);
> +__BUILTIN_TRANSFORM (__gconv_transform_posix_internal);
> +__BUILTIN_TRANSFORM (__gconv_transform_internal_posix);
> __BUILTIN_TRANSFORM (__gconv_transform_utf8_internal);
> __BUILTIN_TRANSFORM (__gconv_transform_internal_utf8);
> __BUILTIN_TRANSFORM (__gconv_transform_ucs2_internal);
> @@ -299,6 +301,11 @@ __BUILTIN_TRANSFORM (__gconv_transform_utf16_internal);
> only ASCII characters. */
> extern wint_t __gconv_btwoc_ascii (struct __gconv_step *step, unsigned char c);
>
> +/* Specialized conversion function for a single byte to INTERNAL,
> + identity-mapping bytes [0, 0x7F], and moving [0x80, 0xFF] into the end
> + of the Low Surrogate Area at [U+DF80, U+DFFF]. */
> +extern wint_t __gconv_btwoc_posix (struct __gconv_step *step, unsigned char c);
> +
> #endif
Missing attribute_hidden. Yes, it's also missing from
__gconv_btwoc_ascii. The linker probably papers over it.
>
> __END_DECLS
> diff --git a/iconv/gconv_posix.c b/iconv/gconv_posix.c
> new file mode 100644
> index 0000000000..dcb13fbb43
> --- /dev/null
> +++ b/iconv/gconv_posix.c
> @@ -0,0 +1,96 @@
> +/* Simple transformations functions.
I think this line should say something about surrogate-escape encoding
for the POSIX locale.
> +#define MIN_NEEDED_INPUT MIN_NEEDED_FROM
> +#define MIN_NEEDED_OUTPUT MIN_NEEDED_TO
> +#define LOOPFCT FROM_LOOP
> +#define BODY \
> + { \
> + uint32_t val = *((const uint32_t *) inptr); \
> + if (__glibc_unlikely ((val > 0x7f && val < 0xdf80) || val > 0xdfff)) \
> + { \
> + UNICODE_TAG_HANDLER (val, 4); \
> + STANDARD_TO_LOOP_ERR_HANDLER (4); \
> + } \
> + else \
> + { \
> + if (__glibc_unlikely (val > 0x7f)) \
> + val -= 0xdf00; \
> + *outptr++ = val; \
> + inptr += sizeof (uint32_t); \
> + } \
> + }
I suggest to drop the last __glibc_unlikely here because it's
input-dependent.
> +#define LOOP_NEED_FLAGS
> +#include <iconv/loop.c>
> +#include <iconv/skeleton.c>
> diff --git a/iconv/tst-iconv_prog.sh b/iconv/tst-iconv_prog.sh
> index b3d8bf5110..a24d8d2207 100644
> --- a/iconv/tst-iconv_prog.sh
> +++ b/iconv/tst-iconv_prog.sh
> @@ -285,3 +285,46 @@ for errorcommand in "${errorarray[@]}"; do
> execute_test
> check_errtest_result
> done
> +
> +allbytes ()
> +{
> + for (( i = 0; i <= 255; i++ )); do
> + printf '\'"$(printf "%o" "$i")"
> + done
> +}
> +
> +allucs4be ()
> +{
> + for (( i = 0; i <= 127; i++ )); do
> + printf '\0\0\0\'"$(printf "%o" "$i")"
> + done
> + for (( i = 128; i <= 255; i++ )); do
> + printf '\0\0\xdf\'"$(printf "%o" "$i")"
> + done
> +}
> +
> +check_posix_result ()
> +{
> + if [ $? -eq 0 ]; then
> + result=PASS
> + else
> + result=FAIL
> + fi
> +
> + echo "$result: from \"$1\", to: \"$2\""
> +
> + if [ "$result" != "PASS" ]; then
> + exit 1
> + fi
> +}
> +
> +check_posix_encoding ()
> +{
> + eval PROG=\"$ICONV\"
> + allbytes | $PROG -f POSIX -t UCS-4BE | cmp -s - <(allucs4be)
> + check_posix_result POSIX UCS-4BE
> + allucs4be | $PROG -f UCS-4BE -t POSIX | cmp -s - <(allbytes)
> + check_posix_result UCS-4BE POSIX
> +}
> +
> +check_posix_encoding
> diff --git a/iconvdata/tst-tables.sh b/iconvdata/tst-tables.sh
> index 4207b44175..33a02158ac 100755
> --- a/iconvdata/tst-tables.sh
> +++ b/iconvdata/tst-tables.sh
> @@ -31,6 +31,7 @@ cat <<EOF |
> # Keep this list in the same order as gconv-modules.
> #
> # charset name table name comment
> + POSIX
> ASCII ANSI_X3.4-1968
> ISO646-GB BS_4730
> ISO646-CA CSA_Z243.4-1985-1
> diff --git a/inet/tst-idna_name_classify.c b/inet/tst-idna_name_classify.c
> index bfd34eee31..b379481844 100644
> --- a/inet/tst-idna_name_classify.c
> +++ b/inet/tst-idna_name_classify.c
> @@ -37,11 +37,11 @@ do_test (void)
> puts ("info: C locale tests");
> locale_insensitive_tests ();
> TEST_COMPARE (__idna_name_classify ("abc\200def"),
> - idna_name_encoding_error);
> + idna_name_nonascii);
> TEST_COMPARE (__idna_name_classify ("abc\200\\def"),
> - idna_name_encoding_error);
> + idna_name_nonascii_backslash);
> TEST_COMPARE (__idna_name_classify ("abc\377def"),
> - idna_name_encoding_error);
> + idna_name_nonascii);
>
> puts ("info: en_US.ISO-8859-1 locale tests");
> if (setlocale (LC_CTYPE, "en_US.ISO-8859-1") == 0)
This seems to be okay, there is further test coverage for
idna_name_encoding_error.
> diff --git a/locale/tst-C-locale.c b/locale/tst-C-locale.c
> index 6bd0367069..f30396ae12 100644
> --- a/locale/tst-C-locale.c
> +++ b/locale/tst-C-locale.c
> @@ -229,6 +229,75 @@ run_test (const char *locname)
> STRTEST (YESSTR, "");
> STRTEST (NOSTR, "");
>
> +#define CONVTEST(b, v) \
> + { \
> + unsigned char bs[] = {b, 0}; \
> + mbstate_t ctx = {}; \
> + wchar_t wc = -1; \
> + size_t sz = mbrtowc(&wc, (char *) bs, 1, &ctx); \
Missing space before '(' (also in other cases below).
Not sure if the macros are needed, maybe write one loop for each
direction with a condition in it?
> diff --git a/stdio-common/tst-printf-bz25691.c b/stdio-common/tst-printf-bz25691.c
> index 44844e71c3..e66242b58f 100644
> --- a/stdio-common/tst-printf-bz25691.c
> +++ b/stdio-common/tst-printf-bz25691.c
> @@ -30,6 +30,8 @@
> static int
> do_test (void)
> {
> + setlocale(LC_CTYPE, "C.UTF-8");
> +
> mtrace ();
>
> /* For 's' conversion specifier with 'l' modifier the array must be
What's the rationale for this change? If it is really required, you
must also update stdio-common/Makefile with a new dependency on
$(gen-locales).
Thanks,
Florian
next prev parent reply other threads:[~2022-11-09 14:20 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-30 18:19 [PATCH] POSIX locale covers every byte [BZ# 29511] наб via Libc-alpha
2022-09-06 14:06 ` [PATCH v2] " наб via Libc-alpha
2022-09-06 14:19 ` [PATCH] " Florian Weimer via Libc-alpha
2022-09-06 18:06 ` наб via Libc-alpha
2022-09-06 18:10 ` [PATCH v3 1/2] iconvdata/tst-table-charmap.sh: remove handling of old, borrowed format наб via Libc-alpha
2022-09-14 2:39 ` [PATCH v4 " наб via Libc-alpha
2022-09-21 14:01 ` [PATCH v5 " наб via Libc-alpha
2022-11-02 17:17 ` [PATCH v6 " наб via Libc-alpha
2022-11-09 12:49 ` Florian Weimer via Libc-alpha
2022-11-02 17:17 ` [PATCH v6 2/2] POSIX locale covers every byte [BZ# 29511] наб via Libc-alpha
2022-11-09 14:20 ` Florian Weimer via Libc-alpha [this message]
2022-11-09 16:14 ` [PATCH v7] " наб via Libc-alpha
2022-11-10 9:52 ` Florian Weimer via Libc-alpha
2023-01-09 15:17 ` [PATCH v8] " наб via Libc-alpha
2023-02-07 14:16 ` [PATCH v9] " наб via Libc-alpha
2023-02-13 14:52 ` Florian Weimer via Libc-alpha
2023-04-26 18:54 ` наб via Libc-alpha
2023-04-26 21:27 ` Florian Weimer via Libc-alpha
2023-04-27 0:17 ` [PATCH v10] " наб via Libc-alpha
2023-04-28 15:43 ` [PATCH v11] " наб via Libc-alpha
2023-05-07 22:53 ` [PATCH v12] " наб via Libc-alpha
2023-05-29 13:54 ` [PATCH v13] " наб via Libc-alpha
2022-11-10 8:10 ` [PATCH v6 2/2] " Florian Weimer via Libc-alpha
2022-11-28 16:24 ` наб via Libc-alpha
2022-12-02 17:36 ` Florian Weimer via Libc-alpha
2022-12-02 18:42 ` наб via Libc-alpha
2022-09-21 14:01 ` [PATCH v5 " наб via Libc-alpha
2022-09-14 2:39 ` [PATCH v4 " наб via Libc-alpha
2022-09-06 18:11 ` [PATCH v3 " наб via Libc-alpha
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/libc/involved.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=874jv8dxat.fsf@oldenburg.str.redhat.com \
--to=libc-alpha@sourceware.org \
--cc=fweimer@redhat.com \
--cc=nabijaczleweli@nabijaczleweli.xyz \
--cc=vstinner@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).