[ruby-dev:49454] [Ruby trunk - Bug #11859] [Open] Regexp matching with \p{Upper} and \p{Lower} for EUC-JP doesn’t work.

ruby-dev (Japanese) list archive (unofficial mirror)
 help / color / mirror / Atom feed

* [ruby-dev:49454] [Ruby trunk - Bug #11859] [Open] Regexp matching with \p{Upper} and \p{Lower} for EUC-JP doesn’t work.
       [not found] <redmine.issue-11859.20151222081637@ruby-lang.org>
@ 2015-12-22  8:16 ` champion.is.acmilan
  2015-12-22  8:21 ` [ruby-dev:49455] [Ruby trunk - Bug #11859] " champion.is.acmilan
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 5+ messages in thread
From: champion.is.acmilan @ 2015-12-22  8:16 UTC (permalink / raw
  To: ruby-dev

Issue #11859 has been reported by Kimihito Matsui.

----------------------------------------
Bug #11859: Regexp matching with \p{Upper} and \p{Lower} for EUC-JP doesn’t work.
https://bugs.ruby-lang.org/issues/11859

* Author: Kimihito Matsui
* Status: Open
* Priority: Normal
* Assignee: 
* ruby -v: ruby 2.2.2p95 (2015-04-13 revision 50295) [x86_64-darwin14]
* Backport: 2.0.0: UNKNOWN, 2.1: UNKNOWN, 2.2: UNKNOWN
----------------------------------------
U+FF21 (Ａ, FULLWIDTH LATIN CAPITAL LETTER A) and U+00c0 (À, LATIN CAPITAL LETTER A WITH GRAVE) is @Uppercase_Letter@ so it should be match and return 0 in following case but this returns 1.

<pre>
ruby -e 'puts "\uFF21A".encode("EUC-JP") =~ Regexp.compile("\\\p{Upper}".encode("EUC-JP”))' # => 1
ruby -e 'puts "\u00C0A".encode("EUC-JP") =~ Regexp.compile("\\\p{Upper}".encode("EUC-JP"))’ # => 1
</pre>

This also happens in lower case matching.
<pre>
ruby -e 'puts "\uFF41a".encode("EUC-JP") =~ Regexp.compile("\\\p{Lower}".encode("EUC-JP"))’ ＃=> 1
</pre>

In Unicode encoding it works as follows.
<pre>
ruby -e 'puts "\uFF21A" =~ Regexp.compile("\\\p{Upper}")'  # => 0
</pre>
Looks like EUC-JP @\p{Upper}@ and @\p{Lower}@ regex is limited to ASCII characters.



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [ruby-dev:49455] [Ruby trunk - Bug #11859] Regexp matching with \p{Upper} and \p{Lower} for EUC-JP doesn’t work.
       [not found] <redmine.issue-11859.20151222081637@ruby-lang.org>
  2015-12-22  8:16 ` [ruby-dev:49454] [Ruby trunk - Bug #11859] [Open] Regexp matching with \p{Upper} and \p{Lower} for EUC-JP doesn’t work champion.is.acmilan
@ 2015-12-22  8:21 ` champion.is.acmilan
  2015-12-22  8:45 ` [ruby-dev:49456] " champion.is.acmilan
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 5+ messages in thread
From: champion.is.acmilan @ 2015-12-22  8:21 UTC (permalink / raw
  To: ruby-dev

Issue #11859 has been updated by Kimihito Matsui.

Description updated

----------------------------------------
Bug #11859: Regexp matching with \p{Upper} and \p{Lower} for EUC-JP doesn’t work.
https://bugs.ruby-lang.org/issues/11859#change-55721

* Author: Kimihito Matsui
* Status: Open
* Priority: Normal
* Assignee: 
* ruby -v: ruby 2.2.2p95 (2015-04-13 revision 50295) [x86_64-darwin14]
* Backport: 2.0.0: UNKNOWN, 2.1: UNKNOWN, 2.2: UNKNOWN
----------------------------------------
U+FF21 (Ａ, FULLWIDTH LATIN CAPITAL LETTER A) and U+00c0 (À, LATIN CAPITAL LETTER A WITH GRAVE) is `Uppercase_Letter` so it should be match and return 0 in following case but this returns 1.

~~~
ruby -e 'puts "\uFF21A".encode("EUC-JP") =~ Regexp.compile("\\\p{Upper}".encode("EUC-JP”))' # => 1
ruby -e 'puts "\u00C0A".encode("EUC-JP") =~ Regexp.compile("\\\p{Upper}".encode("EUC-JP"))’ # => 1
~~~
This also happens in lower case matching.

~~~
ruby -e 'puts "\uFF41a".encode("EUC-JP") =~ Regexp.compile("\\\p{Lower}".encode("EUC-JP"))’ ＃=> 1
~~~

In Unicode encoding it works as follows.

~~~
ruby -e 'puts "\uFF21A" =~ Regexp.compile("\\\p{Upper}")'  # => 0
~~~
Looks like EUC-JP `\p{Upper}` and `\p{Lower}` regex is limited to ASCII characters.



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [ruby-dev:49456] [Ruby trunk - Bug #11859] Regexp matching with \p{Upper} and \p{Lower} for EUC-JP doesn’t work.
       [not found] <redmine.issue-11859.20151222081637@ruby-lang.org>
  2015-12-22  8:16 ` [ruby-dev:49454] [Ruby trunk - Bug #11859] [Open] Regexp matching with \p{Upper} and \p{Lower} for EUC-JP doesn’t work champion.is.acmilan
  2015-12-22  8:21 ` [ruby-dev:49455] [Ruby trunk - Bug #11859] " champion.is.acmilan
@ 2015-12-22  8:45 ` champion.is.acmilan
  2016-06-13  9:37 ` [ruby-dev:49663] [Ruby trunk Bug#11859][Rejected] " naruse
  2016-06-14  9:32 ` [ruby-dev:49664] [Ruby trunk Bug#11859] " duerst
  4 siblings, 0 replies; 5+ messages in thread
From: champion.is.acmilan @ 2015-12-22  8:45 UTC (permalink / raw
  To: ruby-dev

Issue #11859 has been updated by Kimihito Matsui.

Description updated

----------------------------------------
Bug #11859: Regexp matching with \p{Upper} and \p{Lower} for EUC-JP doesn’t work.
https://bugs.ruby-lang.org/issues/11859#change-55724

* Author: Kimihito Matsui
* Status: Open
* Priority: Normal
* Assignee: 
* ruby -v: ruby 2.2.2p95 (2015-04-13 revision 50295) [x86_64-darwin14]
* Backport: 2.0.0: UNKNOWN, 2.1: UNKNOWN, 2.2: UNKNOWN
----------------------------------------
U+FF21 (Ａ, FULLWIDTH LATIN CAPITAL LETTER A) and U+00c0 (À, LATIN CAPITAL LETTER A WITH GRAVE) is `Uppercase_Letter` so it should match and return 0 in following case but this returns 1.

~~~
ruby -e 'puts "\uFF21A".encode("EUC-JP") =~ Regexp.compile("\\\p{Upper}".encode("EUC-JP”))' # => 1
ruby -e 'puts "\u00C0A".encode("EUC-JP") =~ Regexp.compile("\\\p{Upper}".encode("EUC-JP"))’ # => 1
~~~
This also happens in lower case matching.

~~~
ruby -e 'puts "\uFF41a".encode("EUC-JP") =~ Regexp.compile("\\\p{Lower}".encode("EUC-JP"))’ ＃=> 1
~~~

In Unicode encoding it works as follows.

~~~
ruby -e 'puts "\uFF21A" =~ Regexp.compile("\\\p{Upper}")'  # => 0
~~~
Looks like EUC-JP `\p{Upper}` and `\p{Lower}` regex is limited to ASCII characters.



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [ruby-dev:49663] [Ruby trunk Bug#11859][Rejected] Regexp matching with \p{Upper} and \p{Lower} for EUC-JP doesn’t work.
       [not found] <redmine.issue-11859.20151222081637@ruby-lang.org>
                   ` (2 preceding siblings ...)
  2015-12-22  8:45 ` [ruby-dev:49456] " champion.is.acmilan
@ 2016-06-13  9:37 ` naruse
  2016-06-14  9:32 ` [ruby-dev:49664] [Ruby trunk Bug#11859] " duerst
  4 siblings, 0 replies; 5+ messages in thread
From: naruse @ 2016-06-13  9:37 UTC (permalink / raw
  To: ruby-dev

Issue #11859 has been updated by Yui NARUSE.

Status changed from Open to Rejected

Ruby doesn't have case tables for non Unicode encodings.

And EUC-JP is legacy encoding, I don't think such encoding should be extended.

----------------------------------------
Bug #11859: Regexp matching with \p{Upper} and \p{Lower} for EUC-JP doesn’t work.
https://bugs.ruby-lang.org/issues/11859#change-59185

* Author: Kimihito Matsui
* Status: Rejected
* Priority: Normal
* Assignee: 
* ruby -v: ruby 2.2.2p95 (2015-04-13 revision 50295) [x86_64-darwin14]
* Backport: 2.0.0: UNKNOWN, 2.1: UNKNOWN, 2.2: UNKNOWN
----------------------------------------
U+FF21 (Ａ, FULLWIDTH LATIN CAPITAL LETTER A) and U+00c0 (À, LATIN CAPITAL LETTER A WITH GRAVE) is `Uppercase_Letter` so it should match and return 0 in following case but this returns 1.

~~~
ruby -e 'puts "\uFF21A".encode("EUC-JP") =~ Regexp.compile("\\\p{Upper}".encode("EUC-JP”))' # => 1
ruby -e 'puts "\u00C0A".encode("EUC-JP") =~ Regexp.compile("\\\p{Upper}".encode("EUC-JP"))’ # => 1
~~~
This also happens in lower case matching.

~~~
ruby -e 'puts "\uFF41a".encode("EUC-JP") =~ Regexp.compile("\\\p{Lower}".encode("EUC-JP"))’ ＃=> 1
~~~

In Unicode encoding it works as follows.

~~~
ruby -e 'puts "\uFF21A" =~ Regexp.compile("\\\p{Upper}")'  # => 0
~~~
Looks like EUC-JP `\p{Upper}` and `\p{Lower}` regex is limited to ASCII characters.



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [ruby-dev:49664] [Ruby trunk Bug#11859] Regexp matching with \p{Upper} and \p{Lower} for EUC-JP doesn’t work.
       [not found] <redmine.issue-11859.20151222081637@ruby-lang.org>
                   ` (3 preceding siblings ...)
  2016-06-13  9:37 ` [ruby-dev:49663] [Ruby trunk Bug#11859][Rejected] " naruse
@ 2016-06-14  9:32 ` duerst
  4 siblings, 0 replies; 5+ messages in thread
From: duerst @ 2016-06-14  9:32 UTC (permalink / raw
  To: ruby-dev

Issue #11859 has been updated by Martin Dürst.


Some additional comments following up on the commiters' meeting yesterday:

There are many single-byte non-Unicode encodings that have case tables.

Checking the paper versions of the standards in question, À (LATIN CAPITAL LETTER A WITH GRAVE) exists in JIS X 0212-1990 at position (区点) 10-2, and in JIS X 0213-2004 at position 9-23 on the first plane (面). JIS X 0213-2004 is the version I have at hand, but that character didn't change from the -2000 version.

Checking the actual encoding of À in EUC-JP in Ruby shows the following:
```
$ ruby -e 'puts "\u00C0".encode("EUC-JP").b.inspect'
"\x8F\xAA\xA2"
```

This is clearly the JIS X 0212-1990 version, using SS3 (0x8F) to switch to the JIS X 0212 plane at G3. The 1990 version of JIS X 0212 is the first one, so the À character didn't exist in EUC-JP before.

----------------------------------------
Bug #11859: Regexp matching with \p{Upper} and \p{Lower} for EUC-JP doesn’t work.
https://bugs.ruby-lang.org/issues/11859#change-59218

* Author: Kimihito Matsui
* Status: Rejected
* Priority: Normal
* Assignee: 
* ruby -v: ruby 2.2.2p95 (2015-04-13 revision 50295) [x86_64-darwin14]
* Backport: 2.0.0: UNKNOWN, 2.1: UNKNOWN, 2.2: UNKNOWN
----------------------------------------
U+FF21 (Ａ, FULLWIDTH LATIN CAPITAL LETTER A) and U+00c0 (À, LATIN CAPITAL LETTER A WITH GRAVE) is `Uppercase_Letter` so it should match and return 0 in following case but this returns 1.

~~~
ruby -e 'puts "\uFF21A".encode("EUC-JP") =~ Regexp.compile("\\\p{Upper}".encode("EUC-JP”))' # => 1
ruby -e 'puts "\u00C0A".encode("EUC-JP") =~ Regexp.compile("\\\p{Upper}".encode("EUC-JP"))’ # => 1
~~~
This also happens in lower case matching.

~~~
ruby -e 'puts "\uFF41a".encode("EUC-JP") =~ Regexp.compile("\\\p{Lower}".encode("EUC-JP"))’ ＃=> 1
~~~

In Unicode encoding it works as follows.

~~~
ruby -e 'puts "\uFF21A" =~ Regexp.compile("\\\p{Upper}")'  # => 0
~~~
Looks like EUC-JP `\p{Upper}` and `\p{Lower}` regex is limited to ASCII characters.



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-06-14  8:57 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <redmine.issue-11859.20151222081637@ruby-lang.org>
2015-12-22  8:16 ` [ruby-dev:49454] [Ruby trunk - Bug #11859] [Open] Regexp matching with \p{Upper} and \p{Lower} for EUC-JP doesn’t work champion.is.acmilan
2015-12-22  8:21 ` [ruby-dev:49455] [Ruby trunk - Bug #11859] " champion.is.acmilan
2015-12-22  8:45 ` [ruby-dev:49456] " champion.is.acmilan
2016-06-13  9:37 ` [ruby-dev:49663] [Ruby trunk Bug#11859][Rejected] " naruse
2016-06-14  9:32 ` [ruby-dev:49664] [Ruby trunk Bug#11859] " duerst

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).