Re: [PATCH v6 2/2] POSIX locale covers every byte [BZ# 29511]

unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: Florian Weimer via Libc-alpha <libc-alpha@sourceware.org>
To: наб <nabijaczleweli@nabijaczleweli.xyz>
Cc: libc-alpha@sourceware.org, Victor Stinner <vstinner@redhat.com>
Subject: Re: [PATCH v6 2/2] POSIX locale covers every byte [BZ# 29511]
Date: Wed, 09 Nov 2022 15:20:26 +0100	[thread overview]
Message-ID: <874jv8dxat.fsf@oldenburg.str.redhat.com> (raw)
In-Reply-To: <969aa82c8d5904c1d2040bba87abe2f17a0dc647.1667409408.git.nabijaczleweli@nabijaczleweli.xyz> ("наб"'s message of "Wed, 2 Nov 2022 18:17:17 +0100")

* наб:

> This is a logistically trivial patch,
> largely duplicating the extant ASCII code with the error path changed

I wouldn't say it's trivial in the commit message. 8-)

> There are two user-facing changes:
>   * nl_langinfo(CODESET) is "POSIX" instead of "ANSI_X3.4-1968"
>   * mbrtowc() and friends return b if b <= 0x7F else <UDF00>+b
>
> Since Issue 7 TC 2/Issue 8, the C/POSIX locale, effectively,
>   (a) is 1-byte, stateless, and contains 256 characters
>   (b) they collate in byte order
>   (c) the first 128 characters are equivalent to ASCII (like previous)
> cf. https://www.austingroupbugs.net/view.php?id=663 for a summary of
> changes to the standard;
> in short, this means that mbrtowc() must never fail and must return
>   b if b <= 0x7F else ab+c for all bytes b
>   where c is some constant >=0x80
>     and a is a positive integer constant
>
> By strategically picking c=<UDF00> we land at the tail-end of the
> Unicode Low Surrogate Area at DC00-DFFF, described as
>   > Isolated surrogate code points have no interpretation;
>   > consequently, no character code charts or names lists
>   > are provided for this range.
> and match musl

Sadly this doesn't match Python and PEP 540:

>>> b'\x80'.decode('UTF-8', errors='surrogateescape')
'\udc80'

I believe the implementation translates this to 0xDF80 instead.

Not sure what is more important here, musl compatibility or Python
compatibility.  Cc:ing Victor in case he as comments.  I should probably
ask on the musl list as well as how this divergence came to pass.

This change definitely needs a NEWS entry.

The mechanics of the patch look okay to me, just a few nits below.

> diff --git a/iconv/gconv_int.h b/iconv/gconv_int.h
> index 1c6745043e..45ab1edfad 100644
> --- a/iconv/gconv_int.h
> +++ b/iconv/gconv_int.h
> @@ -281,6 +281,8 @@ extern int __gconv_compare_alias (const char *name1, const char *name2)
>  
>  __BUILTIN_TRANSFORM (__gconv_transform_ascii_internal);
>  __BUILTIN_TRANSFORM (__gconv_transform_internal_ascii);
> +__BUILTIN_TRANSFORM (__gconv_transform_posix_internal);
> +__BUILTIN_TRANSFORM (__gconv_transform_internal_posix);
>  __BUILTIN_TRANSFORM (__gconv_transform_utf8_internal);
>  __BUILTIN_TRANSFORM (__gconv_transform_internal_utf8);
>  __BUILTIN_TRANSFORM (__gconv_transform_ucs2_internal);
> @@ -299,6 +301,11 @@ __BUILTIN_TRANSFORM (__gconv_transform_utf16_internal);
>     only ASCII characters.  */
>  extern wint_t __gconv_btwoc_ascii (struct __gconv_step *step, unsigned char c);
>  
> +/* Specialized conversion function for a single byte to INTERNAL,
> +   identity-mapping bytes [0, 0x7F], and moving [0x80, 0xFF] into the end
> +   of the Low Surrogate Area at [U+DF80, U+DFFF].  */
> +extern wint_t __gconv_btwoc_posix (struct __gconv_step *step, unsigned char c);
> +
>  #endif

Missing attribute_hidden.  Yes, it's also missing from
__gconv_btwoc_ascii.  The linker probably papers over it.

>  
>  __END_DECLS
> diff --git a/iconv/gconv_posix.c b/iconv/gconv_posix.c
> new file mode 100644
> index 0000000000..dcb13fbb43
> --- /dev/null
> +++ b/iconv/gconv_posix.c
> @@ -0,0 +1,96 @@
> +/* Simple transformations functions.

I think this line should say something about surrogate-escape encoding
for the POSIX locale.

> +#define MIN_NEEDED_INPUT	MIN_NEEDED_FROM
> +#define MIN_NEEDED_OUTPUT	MIN_NEEDED_TO
> +#define LOOPFCT			FROM_LOOP
> +#define BODY \
> +  {									      \
> +    uint32_t val = *((const uint32_t *) inptr);				      \
> +    if (__glibc_unlikely ((val > 0x7f && val < 0xdf80) || val > 0xdfff))      \
> +      {									      \
> +	UNICODE_TAG_HANDLER (val, 4);					      \
> +	STANDARD_TO_LOOP_ERR_HANDLER (4);				      \
> +      }									      \
> +    else								      \
> +      {									      \
> +	if (__glibc_unlikely (val > 0x7f))				      \
> +	  val -= 0xdf00;						      \
> +	*outptr++ = val;						      \
> +	inptr += sizeof (uint32_t);					      \
> +      }									      \
> +  }

I suggest to drop the last __glibc_unlikely here because it's
input-dependent.

> +#define LOOP_NEED_FLAGS
> +#include <iconv/loop.c>
> +#include <iconv/skeleton.c>
> diff --git a/iconv/tst-iconv_prog.sh b/iconv/tst-iconv_prog.sh
> index b3d8bf5110..a24d8d2207 100644
> --- a/iconv/tst-iconv_prog.sh
> +++ b/iconv/tst-iconv_prog.sh
> @@ -285,3 +285,46 @@ for errorcommand in "${errorarray[@]}"; do
>    execute_test
>    check_errtest_result
>  done
> +
> +allbytes ()
> +{
> +  for (( i = 0; i <= 255; i++ )); do
> +    printf '\'"$(printf "%o" "$i")"
> +  done
> +}
> +
> +allucs4be ()
> +{
> +  for (( i = 0; i <= 127; i++ )); do
> +    printf '\0\0\0\'"$(printf "%o" "$i")"
> +  done
> +  for (( i = 128; i <= 255; i++ )); do
> +    printf '\0\0\xdf\'"$(printf "%o" "$i")"
> +  done
> +}
> +
> +check_posix_result ()
> +{
> +  if [ $? -eq 0 ]; then
> +    result=PASS
> +  else
> +    result=FAIL
> +  fi
> +
> +  echo "$result: from \"$1\", to: \"$2\""
> +
> +  if [ "$result" != "PASS" ]; then
> +    exit 1
> +  fi
> +}
> +
> +check_posix_encoding ()
> +{
> +  eval PROG=\"$ICONV\"
> +  allbytes  | $PROG -f POSIX -t UCS-4BE | cmp -s - <(allucs4be)
> +  check_posix_result POSIX UCS-4BE
> +  allucs4be | $PROG -f UCS-4BE -t POSIX | cmp -s - <(allbytes)
> +  check_posix_result UCS-4BE POSIX
> +}
> +
> +check_posix_encoding
> diff --git a/iconvdata/tst-tables.sh b/iconvdata/tst-tables.sh
> index 4207b44175..33a02158ac 100755
> --- a/iconvdata/tst-tables.sh
> +++ b/iconvdata/tst-tables.sh
> @@ -31,6 +31,7 @@ cat <<EOF |
>    # Keep this list in the same order as gconv-modules.
>    #
>    # charset name    table name          comment
> +  POSIX
>    ASCII             ANSI_X3.4-1968
>    ISO646-GB         BS_4730
>    ISO646-CA         CSA_Z243.4-1985-1
> diff --git a/inet/tst-idna_name_classify.c b/inet/tst-idna_name_classify.c
> index bfd34eee31..b379481844 100644
> --- a/inet/tst-idna_name_classify.c
> +++ b/inet/tst-idna_name_classify.c
> @@ -37,11 +37,11 @@ do_test (void)
>    puts ("info: C locale tests");
>    locale_insensitive_tests ();
>    TEST_COMPARE (__idna_name_classify ("abc\200def"),
> -                idna_name_encoding_error);
> +                idna_name_nonascii);
>    TEST_COMPARE (__idna_name_classify ("abc\200\\def"),
> -                idna_name_encoding_error);
> +                idna_name_nonascii_backslash);
>    TEST_COMPARE (__idna_name_classify ("abc\377def"),
> -                idna_name_encoding_error);
> +                idna_name_nonascii);
>  
>    puts ("info: en_US.ISO-8859-1 locale tests");
>    if (setlocale (LC_CTYPE, "en_US.ISO-8859-1") == 0)

This seems to be okay, there is further test coverage for
idna_name_encoding_error.

> diff --git a/locale/tst-C-locale.c b/locale/tst-C-locale.c
> index 6bd0367069..f30396ae12 100644
> --- a/locale/tst-C-locale.c
> +++ b/locale/tst-C-locale.c
> @@ -229,6 +229,75 @@ run_test (const char *locname)
>    STRTEST (YESSTR, "");
>    STRTEST (NOSTR, "");
>  
> +#define CONVTEST(b, v) \
> +  {									      \
> +    unsigned char bs[] = {b, 0};					      \
> +    mbstate_t ctx = {};							      \
> +    wchar_t wc = -1;							      \
> +    size_t sz = mbrtowc(&wc, (char *) bs, 1, &ctx);			      \

Missing space before '(' (also in other cases below).

Not sure if the macros are needed, maybe write one loop for each
direction with a condition in it?

> diff --git a/stdio-common/tst-printf-bz25691.c b/stdio-common/tst-printf-bz25691.c
> index 44844e71c3..e66242b58f 100644
> --- a/stdio-common/tst-printf-bz25691.c
> +++ b/stdio-common/tst-printf-bz25691.c
> @@ -30,6 +30,8 @@
>  static int
>  do_test (void)
>  {
> +  setlocale(LC_CTYPE, "C.UTF-8");
> +
>    mtrace ();
>  
>    /* For 's' conversion specifier with 'l' modifier the array must be

What's the rationale for this change?  If it is really required, you
must also update stdio-common/Makefile with a new dependency on
$(gen-locales).

Thanks,
Florian

next prev parent reply	other threads:[~2022-11-09 14:20 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-30 18:19 [PATCH] POSIX locale covers every byte [BZ# 29511] наб via Libc-alpha
2022-09-06 14:06 ` [PATCH v2] " наб via Libc-alpha
2022-09-06 14:19 ` [PATCH] " Florian Weimer via Libc-alpha
2022-09-06 18:06   ` наб via Libc-alpha
2022-09-06 18:10     ` [PATCH v3 1/2] iconvdata/tst-table-charmap.sh: remove handling of old, borrowed format наб via Libc-alpha
2022-09-14  2:39       ` [PATCH v4 " наб via Libc-alpha
2022-09-21 14:01         ` [PATCH v5 " наб via Libc-alpha
2022-11-02 17:17           ` [PATCH v6 " наб via Libc-alpha
2022-11-09 12:49             ` Florian Weimer via Libc-alpha
2022-11-02 17:17           ` [PATCH v6 2/2] POSIX locale covers every byte [BZ# 29511] наб via Libc-alpha
2022-11-09 14:20             ` Florian Weimer via Libc-alpha [this message]
2022-11-09 16:14               ` [PATCH v7] " наб via Libc-alpha
2022-11-10  9:52                 ` Florian Weimer via Libc-alpha
2023-01-09 15:17                   ` [PATCH v8] " наб via Libc-alpha
2023-02-07 14:16                     ` [PATCH v9] " наб via Libc-alpha
2023-02-13 14:52                       ` Florian Weimer via Libc-alpha
2023-04-26 18:54                         ` наб via Libc-alpha
2023-04-26 21:27                           ` Florian Weimer via Libc-alpha
2023-04-27  0:17                             ` [PATCH v10] " наб via Libc-alpha
2023-04-28 15:43                               ` [PATCH v11] " наб via Libc-alpha
2023-05-07 22:53                                 ` [PATCH v12] " наб via Libc-alpha
2023-05-29 13:54                                   ` [PATCH v13] " наб via Libc-alpha
2022-11-10  8:10               ` [PATCH v6 2/2] " Florian Weimer via Libc-alpha
2022-11-28 16:24                 ` наб via Libc-alpha
2022-12-02 17:36                   ` Florian Weimer via Libc-alpha
2022-12-02 18:42                     ` наб via Libc-alpha
2022-09-21 14:01         ` [PATCH v5 " наб via Libc-alpha
2022-09-14  2:39       ` [PATCH v4 " наб via Libc-alpha
2022-09-06 18:11     ` [PATCH v3 " наб via Libc-alpha

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/libc/involved.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=874jv8dxat.fsf@oldenburg.str.redhat.com \
    --to=libc-alpha@sourceware.org \
    --cc=fweimer@redhat.com \
    --cc=nabijaczleweli@nabijaczleweli.xyz \
    --cc=vstinner@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).