ruby-core@ruby-lang.org archive (unofficial mirror)
 help / color / mirror / Atom feed
* [ruby-core:91377] [Ruby trunk Bug#15583] Regex: ? on quantified group {n} is interpreted as optional, should be lazy
       [not found] <redmine.issue-15583.20190201152800@ruby-lang.org>
@ 2019-02-01 15:28 ` davisjam
  2019-02-02  6:45 ` [ruby-core:91383] " naruse
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 4+ messages in thread
From: davisjam @ 2019-02-01 15:28 UTC (permalink / raw)
  To: ruby-core

Issue #15583 has been reported by davisjam (James Davis).

----------------------------------------
Bug #15583: Regex: ? on quantified group {n} is interpreted as optional, should be lazy
https://bugs.ruby-lang.org/issues/15583

* Author: davisjam (James Davis)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
* ruby -v: 
* Backport: 2.4: UNKNOWN, 2.5: UNKNOWN, 2.6: UNKNOWN
----------------------------------------
The Ruby regex docs have this to say about repetition ([specific link](https://ruby-doc.org/core-2.6.1/Regexp.html#class-Regexp-label-Repetition)):

> The constructs described so far match a single character. They can be followed by a repetition metacharacter to specify how many times they need to occur. Such metacharacters are called quantifiers.
> 
> - * - Zero or more times
> - ...
> - {n} - Exactly n times
> - ...

From this I conclude that the {n} construct is considered a quantifier metacharacter.

The docs go on to say

> Repetition is greedy by default: as many occurrences as possible are matched while still allowing the overall match to succeed. By contrast, lazy matching makes the minimal amount of matches necessary for overall success. A greedy metacharacter can be made lazy by following it with ?.

Since `{n}` is a greedy metacharacter, it seems like `{n}?` should make it lazy. In the particular case of `{n}?`, laziness is meaningless -- the regex engine must match n of whatever is being quantified, lazily or not. But I think other behavior is needlessly confusing. To make `{n}` optional, I think I should have to wrap it in parentheses: `(a{n})?`.

The docs make it sound like `?` as "lazy" has stronger precedence than `?` as "optional".
This make sense to me -- the "optional" meaning can be communicated using parentheses while the lazy meaning cannot.

Here is a test program to explore this behavior:

```ruby
if /a{1,}?/.match("")
	puts "a{1,}? matched the empty string"
else
	puts "a{1,}? did not match"
end

if /a{1,3}?/.match("")
	puts "a{1,3}? matched the empty string"
else
	puts "a{1,3}? did not match"
end

if /a{,1}?/.match("")
	puts "a{,1}? matched the empty string"
else
	puts "a{,1}? did not match"
end

if /a{1}?/.match("")
	puts "a{1}? matched the empty string"
else
	puts "Did not match"
end
```

If `?` attaches more strongly to quantifers (to mean non-greedy) than to arbitrary patterns (to mean optional), then I expect it to mean "non-greedy" in each of these cases. So the expected behavior is:

1. `/a{1,}?/` *should not* match the empty string, since even non-greedily it must match at least 1 a.
2. `/a{1,3}?/` *should not* match the empty string, since even non-greedily it must match at least 1 a.
3. `/a{,1}?/` *should* match the empty string, since non-greedily it can match 0 a's.
4. `/a{1}?/` *should not* match the empty string, since even non-greedily it must match at least 1 a.

Let's see how it behaves in Ruby 2.6.1:

```shell
(09:43:09) jamie@woody /tmp $ ruby -v
ruby 2.6.1p33 (2019-01-30 revision 66950) [x86_64-linux]
(09:43:12) jamie@woody /tmp $ ruby /tmp/t.rb
a{1,}? did not match
a{1,3}? did not match
a{,1}? matched the empty string
a{1}? matched the empty string
```

Cases 1-3 all behave as expected. However, case 4 matches the empty string, implying that in `/a{1}?/` the `?` interpreted to mean optional rather than non-greedy.
I find this inconsistency a bit confusing.

I tested this behavior in 7 other languages: Go, Java, JavaScript, Perl, PHP, Python, and Rust. In those languages, /a{1}?/ does not match the empty string (and is thus the `{n}?` notation interpreted as non-greedy rather than optional).

Perhaps this should be addressed via a docs change to avoid possible breakage. Here is some possible wording:

Repetition is greedy by default: as many occurrences as possible are matched while still allowing the overall match to succeed. By contrast, lazy matching makes the minimal amount of matches necessary for overall success. Most greedy metacharacters can be made lazy by following them with ?. For the {n} metacharacter, greedy and non-greedy behavior is identical and the ? instead makes the repeated pattern optional.

---Files--------------------------------
t.rb (397 Bytes)


-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [ruby-core:91383] [Ruby trunk Bug#15583] Regex: ? on quantified group {n} is interpreted as optional, should be lazy
       [not found] <redmine.issue-15583.20190201152800@ruby-lang.org>
  2019-02-01 15:28 ` [ruby-core:91377] [Ruby trunk Bug#15583] Regex: ? on quantified group {n} is interpreted as optional, should be lazy davisjam
@ 2019-02-02  6:45 ` naruse
  2019-02-16  2:21 ` [ruby-core:91572] " davisjam
  2019-02-16  5:13 ` [ruby-core:91573] " duerst
  3 siblings, 0 replies; 4+ messages in thread
From: naruse @ 2019-02-02  6:45 UTC (permalink / raw)
  To: ruby-core

Issue #15583 has been updated by naruse (Yui NARUSE).

Backport changed from 2.4: UNKNOWN, 2.5: UNKNOWN, 2.6: UNKNOWN to 2.4: REQUIRED, 2.5: UNKNOWN, 2.6: UNKNOWN

Unfortunately it's expected behavior and the bug of documentation.

The upstream of Ruby's regexp, Onigmo, has such non greedy feature, but Ruby disables it.
https://github.com/k-takata/Onigmo/blob/master/doc/RE#L155-L156

It's because of compatibility of Ruby 1.8 and prior. Ruby allowed /a{n}?/ before it introduces Oniguruma/Onigmo, we cannot change the behavior.
If we introduce it, it requires some migration path to the new behavior.
If you want to have non greedy repetition with quantifier in a real world application, we consider the reasonable migration paths harder though it takes some years...

----------------------------------------
Bug #15583: Regex: ? on quantified group {n} is interpreted as optional, should be lazy
https://bugs.ruby-lang.org/issues/15583#change-76638

* Author: davisjam (James Davis)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
* ruby -v: 2.6.1
* Backport: 2.4: REQUIRED, 2.5: UNKNOWN, 2.6: UNKNOWN
----------------------------------------
The Ruby regex docs have this to say about repetition ([specific link](https://ruby-doc.org/core-2.6.1/Regexp.html#class-Regexp-label-Repetition)):

> The constructs described so far match a single character. They can be followed by a repetition metacharacter to specify how many times they need to occur. Such metacharacters are called quantifiers.
> 
> - * - Zero or more times
> - ...
> - {n} - Exactly n times
> - ...

From this I conclude that the {n} construct is considered a quantifier metacharacter.

The docs go on to say

> Repetition is greedy by default: as many occurrences as possible are matched while still allowing the overall match to succeed. By contrast, lazy matching makes the minimal amount of matches necessary for overall success. A greedy metacharacter can be made lazy by following it with ?.

Since `{n}` is a greedy metacharacter, it seems like `{n}?` should make it lazy. In the particular case of `{n}?`, laziness is meaningless -- the regex engine must match n of whatever is being quantified, lazily or not. But I think other behavior is needlessly confusing. To make `{n}` optional, I think I should have to wrap it in parentheses: `(a{n})?`.

The docs make it sound like `?` as "lazy" has stronger precedence than `?` as "optional".
This make sense to me -- the "optional" meaning can be communicated using parentheses while the lazy meaning cannot.

Here is a test program to explore this behavior:

```ruby
if /a{1,}?/.match("")
	puts "a{1,}? matched the empty string"
else
	puts "a{1,}? did not match"
end

if /a{1,3}?/.match("")
	puts "a{1,3}? matched the empty string"
else
	puts "a{1,3}? did not match"
end

if /a{,1}?/.match("")
	puts "a{,1}? matched the empty string"
else
	puts "a{,1}? did not match"
end

if /a{1}?/.match("")
	puts "a{1}? matched the empty string"
else
	puts "Did not match"
end
```

If `?` attaches more strongly to quantifers (to mean non-greedy) than to arbitrary patterns (to mean optional), then I expect it to mean "non-greedy" in each of these cases. So the expected behavior is:

1. `/a{1,}?/` *should not* match the empty string, since even non-greedily it must match at least 1 a.
2. `/a{1,3}?/` *should not* match the empty string, since even non-greedily it must match at least 1 a.
3. `/a{,1}?/` *should* match the empty string, since non-greedily it can match 0 a's.
4. `/a{1}?/` *should not* match the empty string, since even non-greedily it must match at least 1 a.

Let's see how it behaves in Ruby 2.6.1:

```shell
(09:43:09) jamie@woody /tmp $ ruby -v
ruby 2.6.1p33 (2019-01-30 revision 66950) [x86_64-linux]
(09:43:12) jamie@woody /tmp $ ruby /tmp/t.rb
a{1,}? did not match
a{1,3}? did not match
a{,1}? matched the empty string
a{1}? matched the empty string
```

Cases 1-3 all behave as expected. However, case 4 matches the empty string, implying that in `/a{1}?/` the `?` interpreted to mean optional rather than non-greedy.
I find this inconsistency a bit confusing.

I tested this behavior in 7 other languages: Go, Java, JavaScript, Perl, PHP, Python, and Rust. In those languages, /a{1}?/ does not match the empty string (and is thus the `{n}?` notation interpreted as non-greedy rather than optional).

Perhaps this should be addressed via a docs change to avoid possible breakage. Here is some possible wording:

Repetition is greedy by default: as many occurrences as possible are matched while still allowing the overall match to succeed. By contrast, lazy matching makes the minimal amount of matches necessary for overall success. Most greedy metacharacters can be made lazy by following them with ?. For the {n} metacharacter, greedy and non-greedy behavior is identical and the ? instead makes the repeated pattern optional.

---Files--------------------------------
t.rb (397 Bytes)


-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [ruby-core:91572] [Ruby trunk Bug#15583] Regex: ? on quantified group {n} is interpreted as optional, should be lazy
       [not found] <redmine.issue-15583.20190201152800@ruby-lang.org>
  2019-02-01 15:28 ` [ruby-core:91377] [Ruby trunk Bug#15583] Regex: ? on quantified group {n} is interpreted as optional, should be lazy davisjam
  2019-02-02  6:45 ` [ruby-core:91383] " naruse
@ 2019-02-16  2:21 ` davisjam
  2019-02-16  5:13 ` [ruby-core:91573] " duerst
  3 siblings, 0 replies; 4+ messages in thread
From: davisjam @ 2019-02-16  2:21 UTC (permalink / raw)
  To: ruby-core

Issue #15583 has been updated by davisjam (James Davis).


Can we change the documentation? I am happy to propose additional text.

----------------------------------------
Bug #15583: Regex: ? on quantified group {n} is interpreted as optional, should be lazy
https://bugs.ruby-lang.org/issues/15583#change-76836

* Author: davisjam (James Davis)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
* ruby -v: 2.6.1
* Backport: 2.4: UNKNOWN, 2.5: UNKNOWN, 2.6: UNKNOWN
----------------------------------------
The Ruby regex docs have this to say about repetition ([specific link](https://ruby-doc.org/core-2.6.1/Regexp.html#class-Regexp-label-Repetition)):

> The constructs described so far match a single character. They can be followed by a repetition metacharacter to specify how many times they need to occur. Such metacharacters are called quantifiers.
> 
> - * - Zero or more times
> - ...
> - {n} - Exactly n times
> - ...

From this I conclude that the {n} construct is considered a quantifier metacharacter.

The docs go on to say

> Repetition is greedy by default: as many occurrences as possible are matched while still allowing the overall match to succeed. By contrast, lazy matching makes the minimal amount of matches necessary for overall success. A greedy metacharacter can be made lazy by following it with ?.

Since `{n}` is a greedy metacharacter, it seems like `{n}?` should make it lazy. In the particular case of `{n}?`, laziness is meaningless -- the regex engine must match n of whatever is being quantified, lazily or not. But I think other behavior is needlessly confusing. To make `{n}` optional, I think I should have to wrap it in parentheses: `(a{n})?`.

The docs make it sound like `?` as "lazy" has stronger precedence than `?` as "optional".
This make sense to me -- the "optional" meaning can be communicated using parentheses while the lazy meaning cannot.

Here is a test program to explore this behavior:

```ruby
if /a{1,}?/.match("")
	puts "a{1,}? matched the empty string"
else
	puts "a{1,}? did not match"
end

if /a{1,3}?/.match("")
	puts "a{1,3}? matched the empty string"
else
	puts "a{1,3}? did not match"
end

if /a{,1}?/.match("")
	puts "a{,1}? matched the empty string"
else
	puts "a{,1}? did not match"
end

if /a{1}?/.match("")
	puts "a{1}? matched the empty string"
else
	puts "Did not match"
end
```

If `?` attaches more strongly to quantifers (to mean non-greedy) than to arbitrary patterns (to mean optional), then I expect it to mean "non-greedy" in each of these cases. So the expected behavior is:

1. `/a{1,}?/` *should not* match the empty string, since even non-greedily it must match at least 1 a.
2. `/a{1,3}?/` *should not* match the empty string, since even non-greedily it must match at least 1 a.
3. `/a{,1}?/` *should* match the empty string, since non-greedily it can match 0 a's.
4. `/a{1}?/` *should not* match the empty string, since even non-greedily it must match at least 1 a.

Let's see how it behaves in Ruby 2.6.1:

```shell
(09:43:09) jamie@woody /tmp $ ruby -v
ruby 2.6.1p33 (2019-01-30 revision 66950) [x86_64-linux]
(09:43:12) jamie@woody /tmp $ ruby /tmp/t.rb
a{1,}? did not match
a{1,3}? did not match
a{,1}? matched the empty string
a{1}? matched the empty string
```

Cases 1-3 all behave as expected. However, case 4 matches the empty string, implying that in `/a{1}?/` the `?` interpreted to mean optional rather than non-greedy.
I find this inconsistency a bit confusing.

I tested this behavior in 7 other languages: Go, Java, JavaScript, Perl, PHP, Python, and Rust. In those languages, /a{1}?/ does not match the empty string (and is thus the `{n}?` notation interpreted as non-greedy rather than optional).

Perhaps this should be addressed via a docs change to avoid possible breakage. Here is some possible wording:

Repetition is greedy by default: as many occurrences as possible are matched while still allowing the overall match to succeed. By contrast, lazy matching makes the minimal amount of matches necessary for overall success. Most greedy metacharacters can be made lazy by following them with ?. For the {n} metacharacter, greedy and non-greedy behavior is identical and the ? instead makes the repeated pattern optional.

---Files--------------------------------
t.rb (397 Bytes)


-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [ruby-core:91573] [Ruby trunk Bug#15583] Regex: ? on quantified group {n} is interpreted as optional, should be lazy
       [not found] <redmine.issue-15583.20190201152800@ruby-lang.org>
                   ` (2 preceding siblings ...)
  2019-02-16  2:21 ` [ruby-core:91572] " davisjam
@ 2019-02-16  5:13 ` duerst
  3 siblings, 0 replies; 4+ messages in thread
From: duerst @ 2019-02-16  5:13 UTC (permalink / raw)
  To: ruby-core

Issue #15583 has been updated by duerst (Martin Dürst).


davisjam (James Davis) wrote:

> Perhaps this should be addressed via a docs change to avoid possible breakage.

Agreed.

> Here is some possible wording:
> 
> Repetition is greedy by default: as many occurrences as possible are matched while still allowing the overall match to succeed. By contrast, lazy matching makes the minimal amount of matches necessary for overall success. Most greedy metacharacters can be made lazy by following them with ?. For the {n} metacharacter, greedy and non-greedy behavior is identical and the ? instead makes the repeated pattern optional.

The term "metacharacter" is usually used for single characters. So "the {n} metacharacter" sounds really strange. '{' and '}' are metacharacters, but "{n}" is not a metacharacter (because it's not a character in the first place).

This should be taken into account when rewriting the documentation.

----------------------------------------
Bug #15583: Regex: ? on quantified group {n} is interpreted as optional, should be lazy
https://bugs.ruby-lang.org/issues/15583#change-76837

* Author: davisjam (James Davis)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
* ruby -v: 2.6.1
* Backport: 2.4: UNKNOWN, 2.5: UNKNOWN, 2.6: UNKNOWN
----------------------------------------
The Ruby regex docs have this to say about repetition ([specific link](https://ruby-doc.org/core-2.6.1/Regexp.html#class-Regexp-label-Repetition)):

> The constructs described so far match a single character. They can be followed by a repetition metacharacter to specify how many times they need to occur. Such metacharacters are called quantifiers.
> 
> - * - Zero or more times
> - ...
> - {n} - Exactly n times
> - ...

From this I conclude that the {n} construct is considered a quantifier metacharacter.

The docs go on to say

> Repetition is greedy by default: as many occurrences as possible are matched while still allowing the overall match to succeed. By contrast, lazy matching makes the minimal amount of matches necessary for overall success. A greedy metacharacter can be made lazy by following it with ?.

Since `{n}` is a greedy metacharacter, it seems like `{n}?` should make it lazy. In the particular case of `{n}?`, laziness is meaningless -- the regex engine must match n of whatever is being quantified, lazily or not. But I think other behavior is needlessly confusing. To make `{n}` optional, I think I should have to wrap it in parentheses: `(a{n})?`.

The docs make it sound like `?` as "lazy" has stronger precedence than `?` as "optional".
This make sense to me -- the "optional" meaning can be communicated using parentheses while the lazy meaning cannot.

Here is a test program to explore this behavior:

```ruby
if /a{1,}?/.match("")
	puts "a{1,}? matched the empty string"
else
	puts "a{1,}? did not match"
end

if /a{1,3}?/.match("")
	puts "a{1,3}? matched the empty string"
else
	puts "a{1,3}? did not match"
end

if /a{,1}?/.match("")
	puts "a{,1}? matched the empty string"
else
	puts "a{,1}? did not match"
end

if /a{1}?/.match("")
	puts "a{1}? matched the empty string"
else
	puts "Did not match"
end
```

If `?` attaches more strongly to quantifers (to mean non-greedy) than to arbitrary patterns (to mean optional), then I expect it to mean "non-greedy" in each of these cases. So the expected behavior is:

1. `/a{1,}?/` *should not* match the empty string, since even non-greedily it must match at least 1 a.
2. `/a{1,3}?/` *should not* match the empty string, since even non-greedily it must match at least 1 a.
3. `/a{,1}?/` *should* match the empty string, since non-greedily it can match 0 a's.
4. `/a{1}?/` *should not* match the empty string, since even non-greedily it must match at least 1 a.

Let's see how it behaves in Ruby 2.6.1:

```shell
(09:43:09) jamie@woody /tmp $ ruby -v
ruby 2.6.1p33 (2019-01-30 revision 66950) [x86_64-linux]
(09:43:12) jamie@woody /tmp $ ruby /tmp/t.rb
a{1,}? did not match
a{1,3}? did not match
a{,1}? matched the empty string
a{1}? matched the empty string
```

Cases 1-3 all behave as expected. However, case 4 matches the empty string, implying that in `/a{1}?/` the `?` interpreted to mean optional rather than non-greedy.
I find this inconsistency a bit confusing.

I tested this behavior in 7 other languages: Go, Java, JavaScript, Perl, PHP, Python, and Rust. In those languages, /a{1}?/ does not match the empty string (and is thus the `{n}?` notation interpreted as non-greedy rather than optional).

Perhaps this should be addressed via a docs change to avoid possible breakage. Here is some possible wording:

Repetition is greedy by default: as many occurrences as possible are matched while still allowing the overall match to succeed. By contrast, lazy matching makes the minimal amount of matches necessary for overall success. Most greedy metacharacters can be made lazy by following them with ?. For the {n} metacharacter, greedy and non-greedy behavior is identical and the ? instead makes the repeated pattern optional.

---Files--------------------------------
t.rb (397 Bytes)


-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-02-16  5:13 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <redmine.issue-15583.20190201152800@ruby-lang.org>
2019-02-01 15:28 ` [ruby-core:91377] [Ruby trunk Bug#15583] Regex: ? on quantified group {n} is interpreted as optional, should be lazy davisjam
2019-02-02  6:45 ` [ruby-core:91383] " naruse
2019-02-16  2:21 ` [ruby-core:91572] " davisjam
2019-02-16  5:13 ` [ruby-core:91573] " duerst

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).