* u32_normalize UNINORM_NFKC on 0xD800
@ 2011-05-26 21:35 Simon Josefsson
2011-05-26 23:49 ` Bruno Haible
0 siblings, 1 reply; 6+ messages in thread
From: Simon Josefsson @ 2011-05-26 21:35 UTC (permalink / raw
To: bug-gnulib
I'm doing some Unicode NFKC operations and noticing that u32_normalize
fails for U+D800. Is this behaviour permitted by TR15? I thought
toNFKC should succeed for all code points.
/Simon
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: u32_normalize UNINORM_NFKC on 0xD800
2011-05-26 21:35 u32_normalize UNINORM_NFKC on 0xD800 Simon Josefsson
@ 2011-05-26 23:49 ` Bruno Haible
2011-05-27 9:23 ` Simon Josefsson
0 siblings, 1 reply; 6+ messages in thread
From: Bruno Haible @ 2011-05-26 23:49 UTC (permalink / raw
To: bug-gnulib; +Cc: Simon Josefsson
Simon Josefsson wrote:
> I'm doing some Unicode NFKC operations and noticing that u32_normalize
> fails for U+D800.
This is a valid behaviour, because U+D800 is a "surrogate" point code
and therefore not a valid character code point.
See the Unicode standard, chapter 2 [1], pages 23..24:
Surrogate code points and other non-character code points "should never be
interchanged". This means, for libunistring, that they are invalid input
and invalid output in all functions taking or returning UTF-32 strings or
UTF-8 strings.
Character code points and code points that are in regions that may be assigned
in future Unicode versions must not be rejected; these are valid input.
Bruno
[1] http://www.unicode.org/versions/Unicode6.0.0/ch02.pdf
--
In memoriam Jeane Gardiner <http://en.wikipedia.org/wiki/Jeane_Gardiner>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: u32_normalize UNINORM_NFKC on 0xD800
2011-05-26 23:49 ` Bruno Haible
@ 2011-05-27 9:23 ` Simon Josefsson
2011-05-27 11:36 ` Simon Josefsson
2011-05-27 17:42 ` Bruno Haible
0 siblings, 2 replies; 6+ messages in thread
From: Simon Josefsson @ 2011-05-27 9:23 UTC (permalink / raw
To: Bruno Haible; +Cc: bug-gnulib
Bruno Haible <bruno@clisp.org> writes:
> Simon Josefsson wrote:
>> I'm doing some Unicode NFKC operations and noticing that u32_normalize
>> fails for U+D800.
>
> This is a valid behaviour, because U+D800 is a "surrogate" point code
> and therefore not a valid character code point.
>
> See the Unicode standard, chapter 2 [1], pages 23..24:
> Surrogate code points and other non-character code points "should never be
> interchanged". This means, for libunistring, that they are invalid input
> and invalid output in all functions taking or returning UTF-32 strings or
> UTF-8 strings.
>
> Character code points and code points that are in regions that may be assigned
> in future Unicode versions must not be rejected; these are valid input.
I'm not interchanging the code points, I'm calculating this IDNA2008
property
toNFKC(toCaseFold(toNFKC(cp))) != cp
for all code points. Is this impossible to do with the u32_normalize
interface?
I notice that ICU also gives an error in this situation:
http://demo.icu-project.org/icu-bin/nbrowser?t=&s=D800&uv=0
I wonder what the above expression means when toNFKC fails..
I managed to work around this using a local patch to make u32_uctomb
mimic u32_mbtouc_unsafe's behaviour. But I'm not sure if I'm going to
use it.
--- lib/unistr/u32-uctomb.c.orig 2011-05-27 11:16:00.112466242 +0200
+++ lib/unistr/u32-uctomb.c 2011-05-27 11:16:01.696467065 +0200
@@ -30,8 +30,10 @@
int
u32_uctomb (uint32_t *s, ucs4_t uc, int n)
{
+#if CONFIG_UNICODE_SAFETY
if (uc < 0xd800 || (uc >= 0xe000 && uc < 0x110000))
{
+#endif
if (n > 0)
{
*s = uc;
@@ -39,9 +41,11 @@
}
else
return -2;
+#if CONFIG_UNICODE_SAFETY
}
else
return -1;
+#endif
}
#endif
/Simon
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: u32_normalize UNINORM_NFKC on 0xD800
2011-05-27 9:23 ` Simon Josefsson
@ 2011-05-27 11:36 ` Simon Josefsson
2011-05-27 17:42 ` Bruno Haible
1 sibling, 0 replies; 6+ messages in thread
From: Simon Josefsson @ 2011-05-27 11:36 UTC (permalink / raw
To: bug-gnulib
FWIW, I came up with a better approach to handle this, and have asked
for confirmation of the interpretation on the IDNABIS list. So I think
u32_normalize is fine, as you explained.
http://www.alvestrand.no/pipermail/idna-update/2011-May/007099.html
/Simon
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: u32_normalize UNINORM_NFKC on 0xD800
2011-05-27 9:23 ` Simon Josefsson
2011-05-27 11:36 ` Simon Josefsson
@ 2011-05-27 17:42 ` Bruno Haible
2011-05-27 18:13 ` Simon Josefsson
1 sibling, 1 reply; 6+ messages in thread
From: Bruno Haible @ 2011-05-27 17:42 UTC (permalink / raw
To: Simon Josefsson; +Cc: bug-gnulib
Simon Josefsson wrote:
> I'm calculating this IDNA2008 property
>
> toNFKC(toCaseFold(toNFKC(cp))) != cp
>
> for all code points.
It makes no sense to consider non-character code points here. Citing again
the Unicode standard, chapter 3 [1], section 3.8:
"High-surrogate and low-surrogate code units are used only in the context
of the UTF-16 character encoding form."
> Is this impossible to do with the u32_normalize interface?
Yes, you are passing invalid input to the u32_normalize function.
Bruno
[1] http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: u32_normalize UNINORM_NFKC on 0xD800
2011-05-27 17:42 ` Bruno Haible
@ 2011-05-27 18:13 ` Simon Josefsson
0 siblings, 0 replies; 6+ messages in thread
From: Simon Josefsson @ 2011-05-27 18:13 UTC (permalink / raw
To: Bruno Haible; +Cc: bug-gnulib
Bruno Haible <bruno@clisp.org> writes:
> Simon Josefsson wrote:
>> I'm calculating this IDNA2008 property
>>
>> toNFKC(toCaseFold(toNFKC(cp))) != cp
>>
>> for all code points.
>
> It makes no sense to consider non-character code points here. Citing again
> the Unicode standard, chapter 3 [1], section 3.8:
>
> "High-surrogate and low-surrogate code units are used only in the context
> of the UTF-16 character encoding form."
It seems Mark Davis believes toNFKC should be defined for all code points:
http://www.alvestrand.no/pipermail/idna-update/2011-May/007106.html
The issue turned out to be irrelevant for me, so I don't care strongly
either way.
/Simon
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2011-05-27 18:13 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-05-26 21:35 u32_normalize UNINORM_NFKC on 0xD800 Simon Josefsson
2011-05-26 23:49 ` Bruno Haible
2011-05-27 9:23 ` Simon Josefsson
2011-05-27 11:36 ` Simon Josefsson
2011-05-27 17:42 ` Bruno Haible
2011-05-27 18:13 ` Simon Josefsson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).