ruby-core@ruby-lang.org archive (unofficial mirror)
 help / color / mirror / Atom feed
* [ruby-core:117568] [Ruby master Misc#20434] Deprecate regular expression modifiers
@ 2024-04-17 15:58 kddnewton (Kevin Newton) via ruby-core
  2024-04-18  0:23 ` [ruby-core:117581] [Ruby master Misc#20434] Deprecate encoding-releated " shyouhei (Shyouhei Urabe) via ruby-core
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: kddnewton (Kevin Newton) via ruby-core @ 2024-04-17 15:58 UTC (permalink / raw)
  To: ruby-core; +Cc: kddnewton (Kevin Newton)

Issue #20434 has been reported by kddnewton (Kevin Newton).

----------------------------------------
Misc #20434: Deprecate regular expression modifiers
https://bugs.ruby-lang.org/issues/20434

* Author: kddnewton (Kevin Newton)
* Status: Open
----------------------------------------
This is a follow-up to @duerst's comment here: https://bugs.ruby-lang.org/issues/20406#note-6.

As noted in the other issue, there are many encodings that factor in to how a regular expression operates. This includes:

* The encoding of the file
* The encoding of the string parts within the regular expression
* The regular expression encoding modifiers
* The encoding of the string being matched

At the time the modifiers were introduced, I believe the modifiers may have been the only (??) encoding that factored in here. At this point, however, they can lead to quite a bit of confusion, as noted in the other ticket.

I would like to propose to deprecate the regular expression encoding modifiers. Instead, we could suggest in a warning to instead create a regular expression with an encoded string. For example, when we find:

```ruby
/\x81\x40/s
```

we would instead suggest:

```ruby
::Regexp.new(::String.new("\x81\x40", encoding: "Windows-31J"))
```

or equivalent. As a migration path, we could do the following:

1. Emit a warning to change to the suggested expression
2. Change the compiler to compile to the suggested expression when those flags are found
3. Remove support for the flags

Step 2 may be unnecessary depending on how long of a timeline we would like to provide. To be clear, I'm not advocating for any particular timeline, and would be fine with this being multiple years/versions to give plenty of time for people to migrate. But I do think this would be a good change to eliminate confusion about the interaction between the four different encodings at play.



-- 
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [ruby-core:117581] [Ruby master Misc#20434] Deprecate encoding-releated regular expression modifiers
  2024-04-17 15:58 [ruby-core:117568] [Ruby master Misc#20434] Deprecate regular expression modifiers kddnewton (Kevin Newton) via ruby-core
@ 2024-04-18  0:23 ` shyouhei (Shyouhei Urabe) via ruby-core
  2024-04-18  6:22 ` [ruby-core:117591] " duerst via ruby-core
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: shyouhei (Shyouhei Urabe) via ruby-core @ 2024-04-18  0:23 UTC (permalink / raw)
  To: ruby-core; +Cc: shyouhei (Shyouhei Urabe)

Issue #20434 has been updated by shyouhei (Shyouhei Urabe).

Subject changed from Deprecate regular expression modifiers to Deprecate encoding-releated regular expression modifiers

+1 for deprecating encoding modifiers, but they're not everything that a regexp can take.  For instance `/foo/i` is a valid regular expression literal in ruby, perl, PHP(preg), and Javascript.

I'm sure Kevin didn't intend to kill everything.  Let me narrow the scope of this request; subject updated.

----------------------------------------
Misc #20434: Deprecate encoding-releated regular expression modifiers
https://bugs.ruby-lang.org/issues/20434#change-107981

* Author: kddnewton (Kevin Newton)
* Status: Open
----------------------------------------
This is a follow-up to @duerst's comment here: https://bugs.ruby-lang.org/issues/20406#note-6.

As noted in the other issue, there are many encodings that factor in to how a regular expression operates. This includes:

* The encoding of the file
* The encoding of the string parts within the regular expression
* The regular expression encoding modifiers
* The encoding of the string being matched

At the time the modifiers were introduced, I believe the modifiers may have been the only (??) encoding that factored in here. At this point, however, they can lead to quite a bit of confusion, as noted in the other ticket.

I would like to propose to deprecate the regular expression encoding modifiers. Instead, we could suggest in a warning to instead create a regular expression with an encoded string. For example, when we find:

```ruby
/\x81\x40/s
```

we would instead suggest:

```ruby
::Regexp.new(::String.new("\x81\x40", encoding: "Windows-31J"))
```

or equivalent. As a migration path, we could do the following:

1. Emit a warning to change to the suggested expression
2. Change the compiler to compile to the suggested expression when those flags are found
3. Remove support for the flags

Step 2 may be unnecessary depending on how long of a timeline we would like to provide. To be clear, I'm not advocating for any particular timeline, and would be fine with this being multiple years/versions to give plenty of time for people to migrate. But I do think this would be a good change to eliminate confusion about the interaction between the four different encodings at play.



-- 
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [ruby-core:117591] [Ruby master Misc#20434] Deprecate encoding-releated regular expression modifiers
  2024-04-17 15:58 [ruby-core:117568] [Ruby master Misc#20434] Deprecate regular expression modifiers kddnewton (Kevin Newton) via ruby-core
  2024-04-18  0:23 ` [ruby-core:117581] [Ruby master Misc#20434] Deprecate encoding-releated " shyouhei (Shyouhei Urabe) via ruby-core
@ 2024-04-18  6:22 ` duerst via ruby-core
  2024-04-18  9:23 ` [ruby-core:117595] " byroot (Jean Boussier) via ruby-core
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: duerst via ruby-core @ 2024-04-18  6:22 UTC (permalink / raw)
  To: ruby-core; +Cc: duerst

Issue #20434 has been updated by duerst (Martin Dürst).


I guess there might still be some use for the encoding-related modifiers in single-line scripts and the like. But I don't have an actual use case; I hope whoever has such an use case comes forward.

The replacement code (`::Regexp.new(::String.new("\x81\x40", encoding: "Windows-31J"))`) is quite lengthy. This makes it clear that while each regular expression has an encodings in the same way as each String has an encoding, regular expressions don't really allow to manipulate the encoding. Strings have #force_encoding and #encode, so maybe adding one or both methods to Regexp would help. The example could then be written as `/\x81\x40/.force_encoding("Windows-31J")` or /\3000/.encode("Windows-31J").

----------------------------------------
Misc #20434: Deprecate encoding-releated regular expression modifiers
https://bugs.ruby-lang.org/issues/20434#change-107992

* Author: kddnewton (Kevin Newton)
* Status: Open
----------------------------------------
This is a follow-up to @duerst's comment here: https://bugs.ruby-lang.org/issues/20406#note-6.

As noted in the other issue, there are many encodings that factor in to how a regular expression operates. This includes:

* The encoding of the file
* The encoding of the string parts within the regular expression
* The regular expression encoding modifiers
* The encoding of the string being matched

At the time the modifiers were introduced, I believe the modifiers may have been the only (??) encoding that factored in here. At this point, however, they can lead to quite a bit of confusion, as noted in the other ticket.

I would like to propose to deprecate the regular expression encoding modifiers. Instead, we could suggest in a warning to instead create a regular expression with an encoded string. For example, when we find:

```ruby
/\x81\x40/s
```

we would instead suggest:

```ruby
::Regexp.new(::String.new("\x81\x40", encoding: "Windows-31J"))
```

or equivalent. As a migration path, we could do the following:

1. Emit a warning to change to the suggested expression
2. Change the compiler to compile to the suggested expression when those flags are found
3. Remove support for the flags

Step 2 may be unnecessary depending on how long of a timeline we would like to provide. To be clear, I'm not advocating for any particular timeline, and would be fine with this being multiple years/versions to give plenty of time for people to migrate. But I do think this would be a good change to eliminate confusion about the interaction between the four different encodings at play.



-- 
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [ruby-core:117595] [Ruby master Misc#20434] Deprecate encoding-releated regular expression modifiers
  2024-04-17 15:58 [ruby-core:117568] [Ruby master Misc#20434] Deprecate regular expression modifiers kddnewton (Kevin Newton) via ruby-core
  2024-04-18  0:23 ` [ruby-core:117581] [Ruby master Misc#20434] Deprecate encoding-releated " shyouhei (Shyouhei Urabe) via ruby-core
  2024-04-18  6:22 ` [ruby-core:117591] " duerst via ruby-core
@ 2024-04-18  9:23 ` byroot (Jean Boussier) via ruby-core
  2024-04-18 11:41 ` [ruby-core:117596] [Ruby master Misc#20434] Deprecate encoding-related " Eregon (Benoit Daloze) via ruby-core
  2024-04-18 18:50 ` [ruby-core:117600] " kddnewton (Kevin Newton) via ruby-core
  4 siblings, 0 replies; 6+ messages in thread
From: byroot (Jean Boussier) via ruby-core @ 2024-04-18  9:23 UTC (permalink / raw)
  To: ruby-core; +Cc: byroot (Jean Boussier)

Issue #20434 has been updated by byroot (Jean Boussier).


`/\x81\x40/.force_encoding("Windows-31J")` wouldn't work because `String#force_encoding` mutates the string, and Regexp literals are immutable.

Similarly `String#encode` doesn't just change the string encoding attribute, but convert the bytes to the new encoding. So I'd expect `/\3000/.encode("Windows-31J")` to fail with:

```ruby
\x81" on UTF-8 (Encoding::InvalidByteSequenceError)
```

So I think the String API to mirror would be `String.new(encoding:)`

   - `Regexp.new(/\x81\x40/, encoding: Encoding::WINDOWS_31J)`
   - `Regexp.new("\x81\x40", encoding: Encoding::WINDOWS_31J)`

But if we want an instance method, I think something like:

`/\x81\x40/.encoded(Encoding::WINDOWS_31J)`, which by the way would also be useful on `String`, e.g., this is common:


```ruby
# frozen_string_literal: true
THING = "fée".dup.force_encoding(Encoding::ISO8859_1)
```

So it could become:

```ruby
# frozen_string_literal: true
THING = "fée".encoded(Encoding::ISO8859_1)
```


----------------------------------------
Misc #20434: Deprecate encoding-releated regular expression modifiers
https://bugs.ruby-lang.org/issues/20434#change-107997

* Author: kddnewton (Kevin Newton)
* Status: Open
----------------------------------------
This is a follow-up to @duerst's comment here: https://bugs.ruby-lang.org/issues/20406#note-6.

As noted in the other issue, there are many encodings that factor in to how a regular expression operates. This includes:

* The encoding of the file
* The encoding of the string parts within the regular expression
* The regular expression encoding modifiers
* The encoding of the string being matched

At the time the modifiers were introduced, I believe the modifiers may have been the only (??) encoding that factored in here. At this point, however, they can lead to quite a bit of confusion, as noted in the other ticket.

I would like to propose to deprecate the regular expression encoding modifiers. Instead, we could suggest in a warning to instead create a regular expression with an encoded string. For example, when we find:

```ruby
/\x81\x40/s
```

we would instead suggest:

```ruby
::Regexp.new(::String.new("\x81\x40", encoding: "Windows-31J"))
```

or equivalent. As a migration path, we could do the following:

1. Emit a warning to change to the suggested expression
2. Change the compiler to compile to the suggested expression when those flags are found
3. Remove support for the flags

Step 2 may be unnecessary depending on how long of a timeline we would like to provide. To be clear, I'm not advocating for any particular timeline, and would be fine with this being multiple years/versions to give plenty of time for people to migrate. But I do think this would be a good change to eliminate confusion about the interaction between the four different encodings at play.



-- 
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [ruby-core:117596] [Ruby master Misc#20434] Deprecate encoding-related regular expression modifiers
  2024-04-17 15:58 [ruby-core:117568] [Ruby master Misc#20434] Deprecate regular expression modifiers kddnewton (Kevin Newton) via ruby-core
                   ` (2 preceding siblings ...)
  2024-04-18  9:23 ` [ruby-core:117595] " byroot (Jean Boussier) via ruby-core
@ 2024-04-18 11:41 ` Eregon (Benoit Daloze) via ruby-core
  2024-04-18 18:50 ` [ruby-core:117600] " kddnewton (Kevin Newton) via ruby-core
  4 siblings, 0 replies; 6+ messages in thread
From: Eregon (Benoit Daloze) via ruby-core @ 2024-04-18 11:41 UTC (permalink / raw)
  To: ruby-core; +Cc: Eregon (Benoit Daloze)

Issue #20434 has been updated by Eregon (Benoit Daloze).


This seems a good simplification to me, I think the semantics of these encoding modifiers are confusing to most Rubyists.

I wouldn't be worried too much about length of the replacement, because `/.../s`/`/.../e` are likely very rare (using file encoding seems a good replacement for those).
`/.../u` seems redundant with the default source encoding, so the `u` can likely just be removed in most cases.

I'm not so sure `/.../n`, that may be more frequent.

Methods to convert an existing Regexp from one encoding to another feel suboptiomal, because that will cause an extra Regexp instance and creating a Regexp is not cheap due to many checks, allocations, and even compilation (AFAIK eager in CRuby at least).

So I think the existing `Regexp.new("a".dup.force_encoding(Encoding::WINDOWS_31J))` is good enough.
And since this would address a deprecation, it seems very important that the code also works on older Ruby versions.

(I'm all for `String#{encoded,with_encoding}` but it seems best to propose that as a separate ticket)

I would be interested to have a good textual description in #20406 of how the encoding of a Regexp is computed currently, it seems quite complex, but having it in text would allow to reason more easily about it.
Maybe we could simplify it while remaining compatible (i.e. the specific value of Regexp#encoding matters not so much, what matters is a Regexp can still be matched against Strings of various encoding like it could before).

----------------------------------------
Misc #20434: Deprecate encoding-related regular expression modifiers
https://bugs.ruby-lang.org/issues/20434#change-108001

* Author: kddnewton (Kevin Newton)
* Status: Open
----------------------------------------
This is a follow-up to @duerst's comment here: https://bugs.ruby-lang.org/issues/20406#note-6.

As noted in the other issue, there are many encodings that factor in to how a regular expression operates. This includes:

* The encoding of the file
* The encoding of the string parts within the regular expression
* The regular expression encoding modifiers
* The encoding of the string being matched

At the time the modifiers were introduced, I believe the modifiers may have been the only (??) encoding that factored in here. At this point, however, they can lead to quite a bit of confusion, as noted in the other ticket.

I would like to propose to deprecate the regular expression encoding modifiers. Instead, we could suggest in a warning to instead create a regular expression with an encoded string. For example, when we find:

```ruby
/\x81\x40/s
```

we would instead suggest:

```ruby
::Regexp.new(::String.new("\x81\x40", encoding: "Windows-31J"))
```

or equivalent. As a migration path, we could do the following:

1. Emit a warning to change to the suggested expression
2. Change the compiler to compile to the suggested expression when those flags are found
3. Remove support for the flags

Step 2 may be unnecessary depending on how long of a timeline we would like to provide. To be clear, I'm not advocating for any particular timeline, and would be fine with this being multiple years/versions to give plenty of time for people to migrate. But I do think this would be a good change to eliminate confusion about the interaction between the four different encodings at play.



-- 
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [ruby-core:117600] [Ruby master Misc#20434] Deprecate encoding-related regular expression modifiers
  2024-04-17 15:58 [ruby-core:117568] [Ruby master Misc#20434] Deprecate regular expression modifiers kddnewton (Kevin Newton) via ruby-core
                   ` (3 preceding siblings ...)
  2024-04-18 11:41 ` [ruby-core:117596] [Ruby master Misc#20434] Deprecate encoding-related " Eregon (Benoit Daloze) via ruby-core
@ 2024-04-18 18:50 ` kddnewton (Kevin Newton) via ruby-core
  4 siblings, 0 replies; 6+ messages in thread
From: kddnewton (Kevin Newton) via ruby-core @ 2024-04-18 18:50 UTC (permalink / raw)
  To: ruby-core; +Cc: kddnewton (Kevin Newton)

Issue #20434 has been updated by kddnewton (Kevin Newton).


Thanks @shyouhei — I definitely only meant encoding-relate modifiers. I really like the other ones!

----------------------------------------
Misc #20434: Deprecate encoding-related regular expression modifiers
https://bugs.ruby-lang.org/issues/20434#change-108009

* Author: kddnewton (Kevin Newton)
* Status: Open
----------------------------------------
This is a follow-up to @duerst's comment here: https://bugs.ruby-lang.org/issues/20406#note-6.

As noted in the other issue, there are many encodings that factor in to how a regular expression operates. This includes:

* The encoding of the file
* The encoding of the string parts within the regular expression
* The regular expression encoding modifiers
* The encoding of the string being matched

At the time the modifiers were introduced, I believe the modifiers may have been the only (??) encoding that factored in here. At this point, however, they can lead to quite a bit of confusion, as noted in the other ticket.

I would like to propose to deprecate the regular expression encoding modifiers. Instead, we could suggest in a warning to instead create a regular expression with an encoded string. For example, when we find:

```ruby
/\x81\x40/s
```

we would instead suggest:

```ruby
::Regexp.new(::String.new("\x81\x40", encoding: "Windows-31J"))
```

or equivalent. As a migration path, we could do the following:

1. Emit a warning to change to the suggested expression
2. Change the compiler to compile to the suggested expression when those flags are found
3. Remove support for the flags

Step 2 may be unnecessary depending on how long of a timeline we would like to provide. To be clear, I'm not advocating for any particular timeline, and would be fine with this being multiple years/versions to give plenty of time for people to migrate. But I do think this would be a good change to eliminate confusion about the interaction between the four different encodings at play.



-- 
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-04-18 18:50 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-17 15:58 [ruby-core:117568] [Ruby master Misc#20434] Deprecate regular expression modifiers kddnewton (Kevin Newton) via ruby-core
2024-04-18  0:23 ` [ruby-core:117581] [Ruby master Misc#20434] Deprecate encoding-releated " shyouhei (Shyouhei Urabe) via ruby-core
2024-04-18  6:22 ` [ruby-core:117591] " duerst via ruby-core
2024-04-18  9:23 ` [ruby-core:117595] " byroot (Jean Boussier) via ruby-core
2024-04-18 11:41 ` [ruby-core:117596] [Ruby master Misc#20434] Deprecate encoding-related " Eregon (Benoit Daloze) via ruby-core
2024-04-18 18:50 ` [ruby-core:117600] " kddnewton (Kevin Newton) via ruby-core

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).