[ruby-core:25540] [Bug #2095] Oniguruma No Longer Understands Unihan Characters

ruby-core@ruby-lang.org archive (unofficial mirror)
 help / color / mirror / Atom feed

* [ruby-core:25540] [Bug #2095] Oniguruma No Longer Understands Unihan Characters
@ 2009-09-13  0:21 Run Paint Run Run
  2009-09-14  0:31 ` [ruby-core:25559] " Run Paint Run Run
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Run Paint Run Run @ 2009-09-13  0:21 UTC (permalink / raw
  To: ruby-core

Bug #2095: Oniguruma No Longer Understands Unihan Characters
http://redmine.ruby-lang.org/issues/show/2095

Author: Run Paint Run Run
Status: Open, Priority: High
ruby -v: ruby 1.9.2dev (2009-09-11) [i686-linux]

As Oniguruma was undocumented, the recent update was based mainly on guesswork. While working on a Unicode library to create an exhaustive test suite I noticed that the update introduced a serious regression. We based the update on UnicodeData.txt and Scripts.txt, but as the former omits Unihan characters their properties are no longer recognized. To fix this we can have tool/enc-unicode.rb parse Unihan.txt (or, rather, the files to which it is divided over as of Unicode 5.2). However, I'd prefer instead to update the script to use the new XML dump Unicode has made available. This is comprehensive and the simpler, standardized file format means parsing bugs are far less likely. In addition it makes it easier to expand our Unicode support in the feature simply by selecting additional attributes. Unfortunately, both approaches preclude storing the data file(s) in SVN (as we currently do with UnicodeData.txt and Scripts.txt) because the Unihan.txt file alone is 28MB uncompresse!
 d. (The XML dump is, of course, even bigger).

In the next 24 hours I will update the script to download the latest XML dump and parse it.

----------------------------------------
http://redmine.ruby-lang.org

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [ruby-core:25559] [Bug #2095] Oniguruma No Longer Understands Unihan Characters
  2009-09-13  0:21 [ruby-core:25540] [Bug #2095] Oniguruma No Longer Understands Unihan Characters Run Paint Run Run
@ 2009-09-14  0:31 ` Run Paint Run Run
  2009-09-14  1:43 ` [ruby-core:25562] [Bug #2095](Closed) " Yui NARUSE
  2009-09-14  6:40 ` [ruby-core:25566] " "Martin J. Dürst"
  2 siblings, 0 replies; 6+ messages in thread
From: Run Paint Run Run @ 2009-09-14  0:31 UTC (permalink / raw
  To: ruby-core

Issue #2095 has been updated by Run Paint Run Run.


Having re-written said script I discovered that my initial analysis was wrong; there is no bug. This ticket can be closed. I apologize. :-/
----------------------------------------
http://redmine.ruby-lang.org/issues/show/2095

----------------------------------------
http://redmine.ruby-lang.org

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [ruby-core:25562] [Bug #2095](Closed) Oniguruma No Longer Understands Unihan Characters
  2009-09-13  0:21 [ruby-core:25540] [Bug #2095] Oniguruma No Longer Understands Unihan Characters Run Paint Run Run
  2009-09-14  0:31 ` [ruby-core:25559] " Run Paint Run Run
@ 2009-09-14  1:43 ` Yui NARUSE
  2009-09-14  6:52   ` [ruby-core:25567] " "Martin J. Dürst"
  2009-09-14  6:40 ` [ruby-core:25566] " "Martin J. Dürst"
  2 siblings, 1 reply; 6+ messages in thread
From: Yui NARUSE @ 2009-09-14  1:43 UTC (permalink / raw
  To: ruby-core

Issue #2095 has been updated by Yui NARUSE.

Status changed from Open to Closed

ok I close this.

Anyway I thougt UnicodeData.txt and Scripts.txt are also large.
So those source data shouldn't be bundled with Ruby, and download by enc-unicode.rb when it runs and uset them.

So you can use XML dump :-)
----------------------------------------
http://redmine.ruby-lang.org/issues/show/2095

----------------------------------------
http://redmine.ruby-lang.org

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [ruby-core:25566] Re: [Bug #2095] Oniguruma No Longer Understands Unihan Characters
  2009-09-13  0:21 [ruby-core:25540] [Bug #2095] Oniguruma No Longer Understands Unihan Characters Run Paint Run Run
  2009-09-14  0:31 ` [ruby-core:25559] " Run Paint Run Run
  2009-09-14  1:43 ` [ruby-core:25562] [Bug #2095](Closed) " Yui NARUSE
@ 2009-09-14  6:40 ` "Martin J. Dürst"
  2 siblings, 0 replies; 6+ messages in thread
From: "Martin J. Dürst" @ 2009-09-14  6:40 UTC (permalink / raw
  To: ruby-core



On 2009/09/13 9:21, Run Paint Run Run wrote:
> Bug #2095: Oniguruma No Longer Understands Unihan Characters
> http://redmine.ruby-lang.org/issues/show/2095
>
> Author: Run Paint Run Run
> Status: Open, Priority: High
> ruby -v: ruby 1.9.2dev (2009-09-11) [i686-linux]
>
> As Oniguruma was undocumented, the recent update was based mainly on guesswork.

> We based the update on UnicodeData.txt and Scripts.txt,

UnicodeData.txt since ages contains two-line entries such as

3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
4DB5;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;

or

4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FC3;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;

or

AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
D800;<Non Private Use High Surrogate, First>;Cs;0;L;;;;;N;;;;;
DB7F;<Non Private Use High Surrogate, Last>;Cs;0;L;;;;;N;;;;;
DB80;<Private Use High Surrogate, First>;Cs;0;L;;;;;N;;;;;
DBFF;<Private Use High Surrogate, Last>;Cs;0;L;;;;;N;;;;;
DC00;<Low Surrogate, First>;Cs;0;L;;;;;N;;;;;
DFFF;<Low Surrogate, Last>;Cs;0;L;;;;;N;;;;;
E000;<Private Use, First>;Co;0;L;;;;;N;;;;;
F8FF;<Private Use, Last>;Co;0;L;;;;;N;;;;;

These are indications of any of the following:
1) All the characters in the respective range have the same property 
(e.g. 'Lo' for CJK Ideographs)
2) Certain properties essentially don't apply (e.g. Surrogates are 'L', 
but for Ruby, they should not exist, and certainly not match in Regexps)
3) Properties or other relevant data should be generated algorithmically 
(e.g. Character Names for Ideographs and Hangul, normalization 
(de)compositions for Hangul,...)

In my experience, it is best to handle each of these specific ranges 
explicitly in a script such as yours, and to throw an error (and use a 
patch to fix it) when a new range is encountered, because a) new such 
ranges are added rarely (currently, there are only 10), and b) it is 
impossible to predict which of the above three cases applies.

Regards,    Martin.

> but as the former omits Unihan characters their properties are no longer recognized. To fix this we can have tool/enc-unicode.rb parse Unihan.txt (or, rather, the files to which it is divided over as of Unicode 5.2). However, I'd prefer instead to update the script to use the new XML dump Unicode has made available. This is comprehensive and the simpler, standardized file format means parsing bugs are far less likely. In addition it makes it easier to expand our Unicode support in the feature simply by selecting additional attributes. Unfortunately, both approaches preclude storing the data file(s) in SVN (as we currently do with UnicodeData.txt and Scripts.txt) because the Unihan.txt file alone is 28MB uncompresse!
>   d. (The XML dump is, of course, even bigger).
>
> In the next 24 hours I will update the script to download the latest XML dump and parse it.
>
>
> ----------------------------------------
> http://redmine.ruby-lang.org
>
>

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [ruby-core:25567] Re: [Bug #2095](Closed) Oniguruma No Longer Understands Unihan Characters
  2009-09-14  1:43 ` [ruby-core:25562] [Bug #2095](Closed) " Yui NARUSE
@ 2009-09-14  6:52   ` "Martin J. Dürst"
  2009-09-14  7:09     ` [ruby-core:25568] [Bug #2095] " Run Paint Run Run
  0 siblings, 1 reply; 6+ messages in thread
From: "Martin J. Dürst" @ 2009-09-14  6:52 UTC (permalink / raw
  To: ruby-core

On 2009/09/14 10:43, Yui NARUSE wrote:
> Issue #2095 has been updated by Yui NARUSE.
>
> Status changed from Open to Closed
>
> ok I close this.
>
> Anyway I thougt UnicodeData.txt and Scripts.txt are also large.

Please note that this means that implementations will take the newest 
Unicode version when compiled; this may not work if older Ruby versions 
(such as 1.9.1) do not want to follow Unicode versions automatically.

This is fine with me as I support following the newest final Unicode 
versions, but you argued the other way a few weeks ago, and we haven't 
heard back yet on this issue from Yugui.

Also, it makes it more difficult to check a beta version of Unicode on a 
'beta' (or trunk) version of Ruby unless this test is limited to 
individual (human) compilers.

Regards,   Martin.

> So those source data shouldn't be bundled with Ruby, and download by enc-unicode.rb when it runs and uset them.
>
> So you can use XML dump :-)
> ----------------------------------------
> http://redmine.ruby-lang.org/issues/show/2095
>
> ----------------------------------------
> http://redmine.ruby-lang.org
>
>

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [ruby-core:25568] [Bug #2095] Oniguruma No Longer Understands Unihan Characters
  2009-09-14  6:52   ` [ruby-core:25567] " "Martin J. Dürst"
@ 2009-09-14  7:09     ` Run Paint Run Run
  0 siblings, 0 replies; 6+ messages in thread
From: Run Paint Run Run @ 2009-09-14  7:09 UTC (permalink / raw
  To: ruby-core

Issue #2095 has been updated by Run Paint Run Run.


> Anyway I thougt UnicodeData.txt and Scripts.txt are also large.

They're nothing compared to the full XML dump (~130MB). ;-)

> So you can use XML dump :-)

Well given that I've written the script now, I guess it does no harm to keep it. Maybe we can look at changing over once we have tests.

> In my experience, it is best to handle each of these specific ranges 
> explicitly in a script such as yours, and to throw an error (and use a 
> patch to fix it) when a new range is encountered, because a) new such 
> ranges are added rarely (currently, there are only 10), and b) it is 
> impossible to predict which of the above three cases applies.

Thanks. :-) This was part of the reason I wanted to use the XML dump, because I suspected it would make this kind of thing easier. (I'm learning Unicode as I go ;-)).

> Please note that this means that implementations will take the newest 
> Unicode version when compiled; this may not work if older Ruby versions 
> (such as 1.9.1) do not want to follow Unicode versions automatically.

To clarify, the property table is not regenerated on compilation; we manually update it when we want to synchronize with a new Unicode version.  :-)
----------------------------------------
http://redmine.ruby-lang.org/issues/show/2095

----------------------------------------
http://redmine.ruby-lang.org

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-09-14  7:10 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-09-13  0:21 [ruby-core:25540] [Bug #2095] Oniguruma No Longer Understands Unihan Characters Run Paint Run Run
2009-09-14  0:31 ` [ruby-core:25559] " Run Paint Run Run
2009-09-14  1:43 ` [ruby-core:25562] [Bug #2095](Closed) " Yui NARUSE
2009-09-14  6:52   ` [ruby-core:25567] " "Martin J. Dürst"
2009-09-14  7:09     ` [ruby-core:25568] [Bug #2095] " Run Paint Run Run
2009-09-14  6:40 ` [ruby-core:25566] " "Martin J. Dürst"

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).