[ruby-core:104422] [Ruby master Bug#18009] Regexps \w and \W with /i option and /u option produce inconsistent results under nested negation and intersection

ruby-core@ruby-lang.org archive (unofficial mirror)
 help / color / mirror / Atom feed

* [ruby-core:104422] [Ruby master Bug#18009] Regexps \w and \W with /i option and /u option produce inconsistent results under nested negation and intersection
@ 2021-06-28  9:09 jiri.marsik
  2024-02-06 20:33 ` [ruby-core:116609] " mjrzasa via ruby-core
  0 siblings, 1 reply; 2+ messages in thread
From: jiri.marsik @ 2021-06-28  9:09 UTC (permalink / raw
  To: ruby-core

Issue #18009 has been reported by jirkamarsik (Jirka Marsik).

----------------------------------------
Bug #18009: Regexps \w and \W with /i option and /u option produce inconsistent results under nested negation and intersection
https://bugs.ruby-lang.org/issues/18009

* Author: jirkamarsik (Jirka Marsik)
* Status: Open
* Priority: Normal
* ruby -v: ruby 3.0.1p64 (2021-04-05 revision 0fb782ee38) [x86_64-linux]
* Backport: 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN
----------------------------------------
This is a follow up to [issue 4044](https://bugs.ruby-lang.org/issues/4044). Its fix (https://github.com/k-takata/Onigmo/issues/4) handled the cases that were reported in the original issue, but there are other cases, which were omitted and now produce inconsistent results.

If the `\w` character set is used inside a nested negated character class, it will not be picked up by the part of the character class analyzer that's responsible for limiting the case-folding of certain character sets (like `\w` and `\W`) across the ASCII boundary. We then end up with the situation where `/[^\w]/iu` and `/[[^\w]]/iu` match different sets of characters.

```
irb(main):001:0> ("a".."z").to_a.join.scan(/\W/iu)
=> []
irb(main):002:0> ("a".."z").to_a.join.scan(/[^\w]/iu)
=> []
irb(main):003:0> ("a".."z").to_a.join.scan(/[[^\w]]/iu)
=> ["k", "s"]
```

This can also be demonstrated using the inverted matcher:

```
irb(main):004:0> ("a".."z").to_a.join.scan(/\w/iu).length
=> 26
irb(main):005:0> ("a".."z").to_a.join.scan(/[^[^\w]]/iu).length
=> 24
```

A similar issue also arises when using character class intersection. The idea behind the pattern compiler's analysis is that characters are allowed to case-fold across the ASCII boundary only if they are included in the character class by some other means than just being included in `\w` (or in one of several other character sets which have special treatment). Therefore, in the below, `/[\w]/iu` will not match the Kelvin sign `\u212a`, because that would mean crossing the ASCII boundary from `k` to `\u212a`. However, `/[kx]/iu` will match the Kelvin sign, because the `k` was not contributed by `\w` and therefore is not subject to the ASCII boundary restriction (we have to use `/[kx]/iu` instead of `/[k]/iu` in our examples, or else the pattern analyzer would replace `[k]` with `k` and follow a different code path).

```
irb(main):006:0> /[\w]/iu.match("\u212a")
=> nil
irb(main):007:0> /[kx]/iu.match("\u212a")
=> #<MatchData "K">
```

The problem then is when we perform an intersection of these two character sets. Since `[kx]` is a subset of `\w`, we would expect their intersection to behave the same as `[kx]`, but that is not the case.

```
irb(main):008:0> /[\w&&kx]/i.match("\u212a")
=> nil
```

The underlying issue in these cases is the manner in which the `ascCc` character set is computed during the parsing of character classes. The `ascCc` character set should contain all characters of the character class except those which were contributed by `\w` and similar character sets. This is done in a way that these character sets are essentially ignored in the calculation of `ascCc`, which works well for set union and top-most negation (which is handled explicitly), but it doesn't handle nested set negation and set intersection.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 2+ messages in thread

* [ruby-core:116609] [Ruby master Bug#18009] Regexps \w and \W with /i option and /u option produce inconsistent results under nested negation and intersection
  2021-06-28  9:09 [ruby-core:104422] [Ruby master Bug#18009] Regexps \w and \W with /i option and /u option produce inconsistent results under nested negation and intersection jiri.marsik
@ 2024-02-06 20:33 ` mjrzasa via ruby-core
  0 siblings, 0 replies; 2+ messages in thread
From: mjrzasa via ruby-core @ 2024-02-06 20:33 UTC (permalink / raw
  To: ruby-core; +Cc: mjrzasa

Issue #18009 has been updated by mjrzasa (Maciek Rząsa).

One more case:
```
[26] pry(main)> ("a".."z").to_a.join.scan(/[\W]/iu)  
=> ["st"]
```

----------------------------------------
Bug #18009: Regexps \w and \W with /i option and /u option produce inconsistent results under nested negation and intersection
https://bugs.ruby-lang.org/issues/18009#change-106621

* Author: jirkamarsik (Jirka Marsik)
* Status: Open
* Priority: Normal
* ruby -v: ruby 3.0.1p64 (2021-04-05 revision 0fb782ee38) [x86_64-linux]
* Backport: 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN
----------------------------------------
This is a follow up to [issue 4044](https://bugs.ruby-lang.org/issues/4044). Its fix (https://github.com/k-takata/Onigmo/issues/4) handled the cases that were reported in the original issue, but there are other cases, which were omitted and now produce inconsistent results.

If the `\w` character set is used inside a nested negated character class, it will not be picked up by the part of the character class analyzer that's responsible for limiting the case-folding of certain character sets (like `\w` and `\W`) across the ASCII boundary. We then end up with the situation where `/[^\w]/iu` and `/[[^\w]]/iu` match different sets of characters.

```
irb(main):001:0> ("a".."z").to_a.join.scan(/\W/iu)
=> []
irb(main):002:0> ("a".."z").to_a.join.scan(/[^\w]/iu)
=> []
irb(main):003:0> ("a".."z").to_a.join.scan(/[[^\w]]/iu)
=> ["k", "s"]
```

This can also be demonstrated using the inverted matcher:

```
irb(main):004:0> ("a".."z").to_a.join.scan(/\w/iu).length
=> 26
irb(main):005:0> ("a".."z").to_a.join.scan(/[^[^\w]]/iu).length
=> 24
```

A similar issue also arises when using character class intersection. The idea behind the pattern compiler's analysis is that characters are allowed to case-fold across the ASCII boundary only if they are included in the character class by some other means than just being included in `\w` (or in one of several other character sets which have special treatment). Therefore, in the below, `/[\w]/iu` will not match the Kelvin sign `\u212a`, because that would mean crossing the ASCII boundary from `k` to `\u212a`. However, `/[kx]/iu` will match the Kelvin sign, because the `k` was not contributed by `\w` and therefore is not subject to the ASCII boundary restriction (we have to use `/[kx]/iu` instead of `/[k]/iu` in our examples, or else the pattern analyzer would replace `[k]` with `k` and follow a different code path).

```
irb(main):006:0> /[\w]/iu.match("\u212a")
=> nil
irb(main):007:0> /[kx]/iu.match("\u212a")
=> #<MatchData "K">
```

The problem then is when we perform an intersection of these two character sets. Since `[kx]` is a subset of `\w`, we would expect their intersection to behave the same as `[kx]`, but that is not the case.

```
irb(main):008:0> /[\w&&kx]/i.match("\u212a")
=> nil
```

The underlying issue in these cases is the manner in which the `ascCc` character set is computed during the parsing of character classes. The `ascCc` character set should contain all characters of the character class except those which were contributed by `\w` and similar character sets. This is done in a way that these character sets are essentially ignored in the calculation of `ascCc`, which works well for set union and top-most negation (which is handled explicitly), but it doesn't handle nested set negation and set intersection.

-- 
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2024-02-06 20:33 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-06-28  9:09 [ruby-core:104422] [Ruby master Bug#18009] Regexps \w and \W with /i option and /u option produce inconsistent results under nested negation and intersection jiri.marsik
2024-02-06 20:33 ` [ruby-core:116609] " mjrzasa via ruby-core

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).