u32_normalize UNINORM_NFKC on 0xD800

bug-gnulib@gnu.org mirror (unofficial)
 help / color / mirror / Atom feed

* u32_normalize UNINORM_NFKC on 0xD800
@ 2011-05-26 21:35 Simon Josefsson
  2011-05-26 23:49 ` Bruno Haible
  0 siblings, 1 reply; 6+ messages in thread
From: Simon Josefsson @ 2011-05-26 21:35 UTC (permalink / raw
  To: bug-gnulib

I'm doing some Unicode NFKC operations and noticing that u32_normalize
fails for U+D800.  Is this behaviour permitted by TR15?  I thought
toNFKC should succeed for all code points.

/Simon

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: u32_normalize UNINORM_NFKC on 0xD800
  2011-05-26 21:35 u32_normalize UNINORM_NFKC on 0xD800 Simon Josefsson
@ 2011-05-26 23:49 ` Bruno Haible
  2011-05-27  9:23   ` Simon Josefsson
  0 siblings, 1 reply; 6+ messages in thread
From: Bruno Haible @ 2011-05-26 23:49 UTC (permalink / raw
  To: bug-gnulib; +Cc: Simon Josefsson

Simon Josefsson wrote:
> I'm doing some Unicode NFKC operations and noticing that u32_normalize
> fails for U+D800.

This is a valid behaviour, because U+D800 is a "surrogate" point code
and therefore not a valid character code point.

See the Unicode standard, chapter 2 [1], pages 23..24:
Surrogate code points and other non-character code points "should never be
interchanged". This means, for libunistring, that they are invalid input
and invalid output in all functions taking or returning UTF-32 strings or
UTF-8 strings.

Character code points and code points that are in regions that may be assigned
in future Unicode versions must not be rejected; these are valid input.

Bruno

[1] http://www.unicode.org/versions/Unicode6.0.0/ch02.pdf
-- 
In memoriam Jeane Gardiner <http://en.wikipedia.org/wiki/Jeane_Gardiner>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: u32_normalize UNINORM_NFKC on 0xD800
  2011-05-26 23:49 ` Bruno Haible
@ 2011-05-27  9:23   ` Simon Josefsson
  2011-05-27 11:36     ` Simon Josefsson
  2011-05-27 17:42     ` Bruno Haible
  0 siblings, 2 replies; 6+ messages in thread
From: Simon Josefsson @ 2011-05-27  9:23 UTC (permalink / raw
  To: Bruno Haible; +Cc: bug-gnulib

Bruno Haible <bruno@clisp.org> writes:

> Simon Josefsson wrote:
>> I'm doing some Unicode NFKC operations and noticing that u32_normalize
>> fails for U+D800.
>
> This is a valid behaviour, because U+D800 is a "surrogate" point code
> and therefore not a valid character code point.
>
> See the Unicode standard, chapter 2 [1], pages 23..24:
> Surrogate code points and other non-character code points "should never be
> interchanged". This means, for libunistring, that they are invalid input
> and invalid output in all functions taking or returning UTF-32 strings or
> UTF-8 strings.
>
> Character code points and code points that are in regions that may be assigned
> in future Unicode versions must not be rejected; these are valid input.

I'm not interchanging the code points, I'm calculating this IDNA2008
property

   toNFKC(toCaseFold(toNFKC(cp))) != cp

for all code points.  Is this impossible to do with the u32_normalize
interface?

I notice that ICU also gives an error in this situation:

http://demo.icu-project.org/icu-bin/nbrowser?t=&s=D800&uv=0

I wonder what the above expression means when toNFKC fails..

I managed to work around this using a local patch to make u32_uctomb
mimic u32_mbtouc_unsafe's behaviour.  But I'm not sure if I'm going to
use it.

--- lib/unistr/u32-uctomb.c.orig	2011-05-27 11:16:00.112466242 +0200
+++ lib/unistr/u32-uctomb.c	2011-05-27 11:16:01.696467065 +0200
@@ -30,8 +30,10 @@
 int
 u32_uctomb (uint32_t *s, ucs4_t uc, int n)
 {
+#if CONFIG_UNICODE_SAFETY
   if (uc < 0xd800 || (uc >= 0xe000 && uc < 0x110000))
     {
+#endif
       if (n > 0)
         {
           *s = uc;
@@ -39,9 +41,11 @@
         }
       else
         return -2;
+#if CONFIG_UNICODE_SAFETY
     }
   else
     return -1;
+#endif
 }
 
 #endif

/Simon


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: u32_normalize UNINORM_NFKC on 0xD800
  2011-05-27  9:23   ` Simon Josefsson
@ 2011-05-27 11:36     ` Simon Josefsson
  2011-05-27 17:42     ` Bruno Haible
  1 sibling, 0 replies; 6+ messages in thread
From: Simon Josefsson @ 2011-05-27 11:36 UTC (permalink / raw
  To: bug-gnulib

FWIW, I came up with a better approach to handle this, and have asked
for confirmation of the interpretation on the IDNABIS list.  So I think
u32_normalize is fine, as you explained.

http://www.alvestrand.no/pipermail/idna-update/2011-May/007099.html

/Simon

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: u32_normalize UNINORM_NFKC on 0xD800
  2011-05-27  9:23   ` Simon Josefsson
  2011-05-27 11:36     ` Simon Josefsson
@ 2011-05-27 17:42     ` Bruno Haible
  2011-05-27 18:13       ` Simon Josefsson
  1 sibling, 1 reply; 6+ messages in thread
From: Bruno Haible @ 2011-05-27 17:42 UTC (permalink / raw
  To: Simon Josefsson; +Cc: bug-gnulib

Simon Josefsson wrote:
> I'm calculating this IDNA2008 property
> 
>    toNFKC(toCaseFold(toNFKC(cp))) != cp
> 
> for all code points.

It makes no sense to consider non-character code points here. Citing again
the Unicode standard, chapter 3 [1], section 3.8:

  "High-surrogate and low-surrogate code units are used only in the context
   of the UTF-16 character encoding form."

> Is this impossible to do with the u32_normalize interface?

Yes, you are passing invalid input to the u32_normalize function.

Bruno

[1] http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: u32_normalize UNINORM_NFKC on 0xD800
  2011-05-27 17:42     ` Bruno Haible
@ 2011-05-27 18:13       ` Simon Josefsson
  0 siblings, 0 replies; 6+ messages in thread
From: Simon Josefsson @ 2011-05-27 18:13 UTC (permalink / raw
  To: Bruno Haible; +Cc: bug-gnulib

Bruno Haible <bruno@clisp.org> writes:

> Simon Josefsson wrote:
>> I'm calculating this IDNA2008 property
>> 
>>    toNFKC(toCaseFold(toNFKC(cp))) != cp
>> 
>> for all code points.
>
> It makes no sense to consider non-character code points here. Citing again
> the Unicode standard, chapter 3 [1], section 3.8:
>
>   "High-surrogate and low-surrogate code units are used only in the context
>    of the UTF-16 character encoding form."

It seems Mark Davis believes toNFKC should be defined for all code points:

http://www.alvestrand.no/pipermail/idna-update/2011-May/007106.html

The issue turned out to be irrelevant for me, so I don't care strongly
either way.

/Simon


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-05-27 18:13 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-05-26 21:35 u32_normalize UNINORM_NFKC on 0xD800 Simon Josefsson
2011-05-26 23:49 ` Bruno Haible
2011-05-27  9:23   ` Simon Josefsson
2011-05-27 11:36     ` Simon Josefsson
2011-05-27 17:42     ` Bruno Haible
2011-05-27 18:13       ` Simon Josefsson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).