[ruby-core:110542] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML

ruby-core@ruby-lang.org archive (unofficial mirror)
 help / color / mirror / Atom feed

* [ruby-core:110542] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML
@ 2022-10-29  7:40 k0kubun (Takashi Kokubun)
  2022-10-29 13:38 ` [ruby-core:110543] " Eregon (Benoit Daloze)
                   ` (13 more replies)
  0 siblings, 14 replies; 15+ messages in thread
From: k0kubun (Takashi Kokubun) @ 2022-10-29  7:40 UTC (permalink / raw)
  To: ruby-core

Issue #19090 has been reported by k0kubun (Takashi Kokubun).

----------------------------------------
Feature #19090: Do not duplicate an unescaped string in CGI.escapeHTML
https://bugs.ruby-lang.org/issues/19090

* Author: k0kubun (Takashi Kokubun)
* Status: Open
* Priority: Normal
----------------------------------------
## Proposal
Stop guaranteeing that `GGI.escapeHTML` returns a new string even if there's nothing to be escaped.

More specifically, stop calling this `rb_str_dup` https://github.com/ruby/cgi/blob/v0.3.3/ext/cgi/escape/escape.c#L72 for the case that nothing needs to be escaped.

## Background
My original implementation https://github.com/ruby/ruby/pull/1164 was not calling it. The reason why `rb_str_dup` was added was that [Bug #11858] claimed returning the argument object for non-escaped cases is a backward incompatibility because the original `gsub`-based implementation always returns a new object. As a result, even while many people use `CGI.escapeHTML` as an optimized implementation for escaping HTML today, it ended up having a compromised performance.

## Motivation
The motivation is to improve performance. By just doing so, escaping a pre-allocated `"string"` becomes 1.34x faster on my machine https://gist.github.com/k0kubun/f66d6fe1e6ba821e4263257e504ba28f.

The most major use case of `CGP.escapeHTML` is to safely embed a user input. When the result is just embedded in another string, the allocated new object will be just wasted. It's pretty common that an embedded string fragment doesn't contain any of `'"&<>` characters. So we should stop wasting that to optimize that case.

[Bug #11858] wasn't really a use case but a gound-less impression "I think this is backward incompatibility". It feels like the ticket addressed a problem that doesn't exist in actual applications. It should have cited existing code that could be broken by that, and I can't find such code with `gem-codesearch` today.

The only reason to maintain the current behavior would be to allow using a return value of `CGI.escapeHTML` as a buffer for creating another longer string starting with the escaped value, but using `CGI.escapeHTML` to initialize a string buffer feels like an abuse. Relying on the behavior never makes sense as an "optimization" either because it makes all other cases (the result is not used as a string buffer) suboptimal.

## Why not add an optional flag like `CGI.escapeHTML(str, dup: false)`?
Two reasons:

* The non-dup behavior should be used 99.999..9% of the time. We shouldn't make code using `CGI.escapeHTML` less readable just for maintaining a use case that doesn't exist.
* Passing keyword arguments to a C extension is unfortunately slow, and it defeats the optimization purpose. In core classes, we could use `Primitive` to address that, but this is a default gem and we can't use that.
  * We could workaround that if we choose `CGI.escapeHTML(str, false)`, but again it spoils the readability for maintaining an invalid use case.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [ruby-core:110543] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML
  2022-10-29  7:40 [ruby-core:110542] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML k0kubun (Takashi Kokubun)
@ 2022-10-29 13:38 ` Eregon (Benoit Daloze)
  2022-11-01 12:55 ` [ruby-core:110566] " Dan0042 (Daniel DeLorme)
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Eregon (Benoit Daloze) @ 2022-10-29 13:38 UTC (permalink / raw)
  To: ruby-core

Issue #19090 has been updated by Eregon (Benoit Daloze).

Isn't `rb_str_dup` copy-on-write and so should be fairly cheap?

----------------------------------------
Feature #19090: Do not duplicate an unescaped string in CGI.escapeHTML
https://bugs.ruby-lang.org/issues/19090#change-99872

* Author: k0kubun (Takashi Kokubun)
* Status: Open
* Priority: Normal
----------------------------------------
## Proposal
Stop guaranteeing that `GGI.escapeHTML` returns a new string even if there's nothing to be escaped.

More specifically, stop calling this `rb_str_dup` https://github.com/ruby/cgi/blob/v0.3.3/ext/cgi/escape/escape.c#L72 for the case that nothing needs to be escaped.

## Background
My original implementation https://github.com/ruby/ruby/pull/1164 was not calling it. The reason why `rb_str_dup` was added was that [Bug #11858] claimed returning the argument object for non-escaped cases is a backward incompatibility because the original `gsub`-based implementation always returns a new object. As a result, even while many people use `CGI.escapeHTML` as an optimized implementation for escaping HTML today, it ended up having a compromised performance.

## Motivation
The motivation is to improve performance. By just doing so, escaping a pre-allocated `"string"` becomes 1.34x faster on my machine https://gist.github.com/k0kubun/f66d6fe1e6ba821e4263257e504ba28f.

The most major use case of `CGP.escapeHTML` is to safely embed a user input. When the result is just embedded in another string, the allocated new object will be just wasted. It's pretty common that an embedded string fragment doesn't contain any of `'"&<>` characters. So we should stop wasting that to optimize that case.

[Bug #11858] wasn't really a use case but just "I think this is backward incompatibility" based on frozen Hello World. Unlike user input, you usually don't need to escape your own string literal. It feels like the ticket addressed a problem that doesn't exist in actual applications. It should have cited existing code that could be broken by that, and I can't find such code with `gem-codesearch` today.

The only reason to maintain the current behavior would be to allow using a return value of `CGI.escapeHTML` as a buffer for creating another longer string starting with the escaped value, but using `CGI.escapeHTML` to initialize a string buffer feels like an abuse. Relying on the behavior never makes sense as an "optimization" either because it makes all other cases (the result is not used as a string buffer) suboptimal.

## Why not an optional flag like `CGI.escapeHTML(str, dup: false)`?
Two reasons:

* The non-dup behavior should be used 99.999..9% of the time. We shouldn't make code using `CGI.escapeHTML` less readable just for maintaining a use case that doesn't exist.
* Passing keyword arguments to a C extension is unfortunately slow, and it defeats the optimization purpose. In core classes, we could use `Primitive` to address that, but this is a default gem and we can't use that.
  * We could workaround that if we choose `CGI.escapeHTML(str, false)`, but again it'd spoil the readability for maintaining an invalid use case.

## Why not a new method?

It's a good idea actually, but with `escapeHTML`, `escape_html`, and `h` aliased to it already, I can't think of a good name for it. And again, not calling it `escapeHTML` or `escape_html` would spoil the readability for no valid reason.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [ruby-core:110566] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML
  2022-10-29  7:40 [ruby-core:110542] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML k0kubun (Takashi Kokubun)
  2022-10-29 13:38 ` [ruby-core:110543] " Eregon (Benoit Daloze)
@ 2022-11-01 12:55 ` Dan0042 (Daniel DeLorme)
  2022-11-02  4:45 ` [ruby-core:110576] " k0kubun (Takashi Kokubun)
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Dan0042 (Daniel DeLorme) @ 2022-11-01 12:55 UTC (permalink / raw)
  To: ruby-core

Issue #19090 has been updated by Dan0042 (Daniel DeLorme).

I agree the dup is unnecessary since 99.99% of the time you're just doing `buf << CGI.escapeHTML(str)`. Not sure the performance gain would be measurable. A new method like `CGI.escapeHTML!` would make sense to me, as it indicates the danger of mutating the return value. 

I did a search for `CGI.escapeHTML` in gems: https://pastebin.com/7HYUTASZ
There's a lot of safe usage where the result is directly used in interpolation or concatenation.
There's also some "potentially unsafe" usage like
```ruby
      def escape_html(string)
        CGI.escapeHTML(string)
      end
```
which depends on how the `escape_html` method is called, but in general the only valid use case for these methods is when appending to a buffer. So I'd say it's a low-risk change.

----------------------------------------
Feature #19090: Do not duplicate an unescaped string in CGI.escapeHTML
https://bugs.ruby-lang.org/issues/19090#change-99897

* Author: k0kubun (Takashi Kokubun)
* Status: Open
* Priority: Normal
----------------------------------------
## Proposal
Stop guaranteeing that `GGI.escapeHTML` returns a new string even if there's nothing to be escaped.

More specifically, stop calling this `rb_str_dup` https://github.com/ruby/cgi/blob/v0.3.3/ext/cgi/escape/escape.c#L72 for the case that nothing needs to be escaped.

## Background
My original implementation https://github.com/ruby/ruby/pull/1164 was not calling it. The reason why `rb_str_dup` was added was that [Bug #11858] claimed returning the argument object for non-escaped cases is a backward incompatibility because the original `gsub`-based implementation always returns a new object. As a result, even while many people use `CGI.escapeHTML` as an optimized implementation for escaping HTML today, it ended up having a compromised performance.

## Motivation
The motivation is to improve performance. By just doing so, escaping a pre-allocated `"string"` becomes 1.34x faster on my machine https://gist.github.com/k0kubun/f66d6fe1e6ba821e4263257e504ba28f.

The most major use case of `CGP.escapeHTML` is to safely embed a user input. When the result is just embedded in another string, the allocated new object will be just wasted. It's pretty common that an embedded string fragment doesn't contain any of `'"&<>` characters. So we should stop wasting that to optimize that case.

[Bug #11858] wasn't really a use case but just "I think this is backward incompatibility" based on frozen Hello World. Unlike user input, you usually don't need to escape your own string literal. It feels like the ticket addressed a problem that doesn't exist in actual applications. It should have cited existing code that could be broken by that, and I can't find such code with `gem-codesearch` today.

The only reason to maintain the current behavior would be to allow using a return value of `CGI.escapeHTML` as a buffer for creating another longer string starting with the escaped value, but using `CGI.escapeHTML` to initialize a string buffer feels like an abuse. Relying on the behavior never makes sense as an "optimization" either because it makes all other cases (the result is not used as a string buffer) suboptimal.

## Why not an optional flag like `CGI.escapeHTML(str, dup: false)`?
Two reasons:

* The non-dup behavior should be used 99.999..9% of the time. We shouldn't make code using `CGI.escapeHTML` less readable just for maintaining a use case that doesn't exist.
* Passing keyword arguments to a C extension is unfortunately slow, and it defeats the optimization purpose. In core classes, we could use `Primitive` to address that, but this is a default gem and we can't use that.
  * We could workaround that if we choose `CGI.escapeHTML(str, false)`, but again it'd spoil the readability for maintaining an invalid use case.

## Why not a new method?

It's a good idea actually, but with `escapeHTML`, `escape_html`, and `h` aliased to it already, I can't think of a good name for it. And again, not calling it `escapeHTML` or `escape_html` would spoil the readability for no valid reason.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [ruby-core:110576] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML
  2022-10-29  7:40 [ruby-core:110542] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML k0kubun (Takashi Kokubun)
  2022-10-29 13:38 ` [ruby-core:110543] " Eregon (Benoit Daloze)
  2022-11-01 12:55 ` [ruby-core:110566] " Dan0042 (Daniel DeLorme)
@ 2022-11-02  4:45 ` k0kubun (Takashi Kokubun)
  2022-11-02 13:04 ` [ruby-core:110581] " Dan0042 (Daniel DeLorme)
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: k0kubun (Takashi Kokubun) @ 2022-11-02  4:45 UTC (permalink / raw)
  To: ruby-core

Issue #19090 has been updated by k0kubun (Takashi Kokubun).

> Isn't rb_str_dup copy-on-write and so should be fairly cheap?

As I wrote in the description, calling `rb_str_dup` is making the whole method 1.34x slower no matter how efficient CoW is. Whether the 34% slowdown is cheap or not depends on how often the method is used and hits the non-escaped case. What I'm saying is that this method is used for almost all embedded expressions in templates and it hits the non-escaped case most of the time. Let's say you have a page that lists 100 resources with 5 fields, it would call `CGI.escapeHTML` at least 500 times and many of them could have no `'"&<>` characters.

Generally, escaping an HTML is known to be one of the largest bottlenecks in template engine benchmarks that enable HTML escaping, which is why I've literally spent years optimizing this method. I would never say a 34% slowdown in `CGI.escapeHTML` is cheap. If you modify the benchmark created by the Slim team to escape embedded expressions, the benchmark becomes 1.1x faster by just removing this `rb_str_dup` call. So the 34% slowdown in the microbenchmark of this method translates to a 10% slowdown in the template rendering.

> A new method like CGI.escapeHTML! would make sense to me, as it indicates the danger of mutating the return value.

It's not doing any mutation by itself, and I don't know of any existing method named like that. Thus Matz would not like the name, and I'm not sure if the library maintainer (@nobu) likes it either. But thanks for proposing an alternative name.

----------------------------------------
Feature #19090: Do not duplicate an unescaped string in CGI.escapeHTML
https://bugs.ruby-lang.org/issues/19090#change-99910

* Author: k0kubun (Takashi Kokubun)
* Status: Open
* Priority: Normal
----------------------------------------
## Proposal
Stop guaranteeing that `GGI.escapeHTML` returns a new string even if there's nothing to be escaped.

More specifically, stop calling this `rb_str_dup` https://github.com/ruby/cgi/blob/v0.3.3/ext/cgi/escape/escape.c#L72 for the case that nothing needs to be escaped.

## Background
My original implementation https://github.com/ruby/ruby/pull/1164 was not calling it. The reason why `rb_str_dup` was added was that [Bug #11858] claimed returning the argument object for non-escaped cases is a backward incompatibility because the original `gsub`-based implementation always returns a new object. As a result, even while many people use `CGI.escapeHTML` as an optimized implementation for escaping HTML today, it ended up having a compromised performance.

## Motivation
The motivation is to improve performance. By just doing so, escaping a pre-allocated `"string"` becomes 1.34x faster on my machine https://gist.github.com/k0kubun/f66d6fe1e6ba821e4263257e504ba28f.

The most major use case of `CGP.escapeHTML` is to safely embed a user input. When the result is just embedded in another string, the allocated new object will be just wasted. It's pretty common that an embedded string fragment doesn't contain any of `'"&<>` characters. So we should stop wasting that to optimize that case.

[Bug #11858] wasn't really a use case but just "I think this is backward incompatibility" based on frozen Hello World. Unlike user input, you usually don't need to escape your own string literal. It feels like the ticket addressed a problem that doesn't exist in actual applications. It should have cited existing code that could be broken by that, and I can't find such code with `gem-codesearch` today.

The only reason to maintain the current behavior would be to allow using a return value of `CGI.escapeHTML` as a buffer for creating another longer string starting with the escaped value, but using `CGI.escapeHTML` to initialize a string buffer feels like an abuse. Relying on the behavior never makes sense as an "optimization" either because it makes all other cases (the result is not used as a string buffer) suboptimal.

## Why not an optional flag like `CGI.escapeHTML(str, dup: false)`?
Two reasons:

* The non-dup behavior should be used 99.999..9% of the time. We shouldn't make code using `CGI.escapeHTML` less readable just for maintaining a use case that doesn't exist.
* Passing keyword arguments to a C extension is unfortunately slow, and it defeats the optimization purpose. In core classes, we could use `Primitive` to address that, but this is a default gem and we can't use that.
  * We could workaround that if we choose `CGI.escapeHTML(str, false)`, but again it'd spoil the readability for maintaining an invalid use case.

## Why not a new method?

It's a good idea actually, but with `escapeHTML`, `escape_html`, and `h` aliased to it already, I can't think of a good name for it. And again, not calling it `escapeHTML` or `escape_html` would spoil the readability for no valid reason.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [ruby-core:110581] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML
  2022-10-29  7:40 [ruby-core:110542] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML k0kubun (Takashi Kokubun)
                   ` (2 preceding siblings ...)
  2022-11-02  4:45 ` [ruby-core:110576] " k0kubun (Takashi Kokubun)
@ 2022-11-02 13:04 ` Dan0042 (Daniel DeLorme)
  2022-11-02 15:02 ` [ruby-core:110583] " Dan0042 (Daniel DeLorme)
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Dan0042 (Daniel DeLorme) @ 2022-11-02 13:04 UTC (permalink / raw)
  To: ruby-core

Issue #19090 has been updated by Dan0042 (Daniel DeLorme).

You know, if speed is *that* much a concern, I think `optimized_escape_html2` should seek the first escapable character and only if found then do the ALLOCV_N (and memcpy from cstr to buf).

----------------------------------------
Feature #19090: Do not duplicate an unescaped string in CGI.escapeHTML
https://bugs.ruby-lang.org/issues/19090#change-99914

* Author: k0kubun (Takashi Kokubun)
* Status: Open
* Priority: Normal
----------------------------------------
## Proposal
Stop guaranteeing that `GGI.escapeHTML` returns a new string even if there's nothing to be escaped.

More specifically, stop calling this `rb_str_dup` https://github.com/ruby/cgi/blob/v0.3.3/ext/cgi/escape/escape.c#L72 for the case that nothing needs to be escaped.

## Background
My original implementation https://github.com/ruby/ruby/pull/1164 was not calling it. The reason why `rb_str_dup` was added was that [Bug #11858] claimed returning the argument object for non-escaped cases is a backward incompatibility because the original `gsub`-based implementation always returns a new object. As a result, even while many people use `CGI.escapeHTML` as an optimized implementation for escaping HTML today, it ended up having a compromised performance.

## Motivation
The motivation is to improve performance. By just doing so, escaping a pre-allocated `"string"` becomes 1.34x faster on my machine https://gist.github.com/k0kubun/f66d6fe1e6ba821e4263257e504ba28f.

The most major use case of `CGP.escapeHTML` is to safely embed a user input. When the result is just embedded in another string, the allocated new object will be just wasted. It's pretty common that an embedded string fragment doesn't contain any of `'"&<>` characters. So we should stop wasting that to optimize that case.

[Bug #11858] wasn't really a use case but just "I think this is backward incompatibility" based on frozen Hello World. Unlike user input, you usually don't need to escape your own string literal. It feels like the ticket addressed a problem that doesn't exist in actual applications. It should have cited existing code that could be broken by that, and I can't find such code with `gem-codesearch` today.

The only reason to maintain the current behavior would be to allow using a return value of `CGI.escapeHTML` as a buffer for creating another longer string starting with the escaped value, but using `CGI.escapeHTML` to initialize a string buffer feels like an abuse. Relying on the behavior never makes sense as an "optimization" either because it makes all other cases (the result is not used as a string buffer) suboptimal.

## Why not an optional flag like `CGI.escapeHTML(str, dup: false)`?
Two reasons:

* The non-dup behavior should be used 99.999..9% of the time. We shouldn't make code using `CGI.escapeHTML` less readable just for maintaining a use case that doesn't exist.
* Passing keyword arguments to a C extension is unfortunately slow, and it defeats the optimization purpose. In core classes, we could use `Primitive` to address that, but this is a default gem and we can't use that.
  * We could workaround that if we choose `CGI.escapeHTML(str, false)`, but again it'd spoil the readability for maintaining an invalid use case.

## Why not a new method?

It's a good idea actually, but with `escapeHTML`, `escape_html`, and `h` aliased to it already, I can't think of a good name for it. And again, not calling it `escapeHTML` or `escape_html` would spoil the readability for no valid reason.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [ruby-core:110583] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML
  2022-10-29  7:40 [ruby-core:110542] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML k0kubun (Takashi Kokubun)
                   ` (3 preceding siblings ...)
  2022-11-02 13:04 ` [ruby-core:110581] " Dan0042 (Daniel DeLorme)
@ 2022-11-02 15:02 ` Dan0042 (Daniel DeLorme)
  2022-11-02 20:12 ` [ruby-core:110585] " k0kubun (Takashi Kokubun)
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Dan0042 (Daniel DeLorme) @ 2022-11-02 15:02 UTC (permalink / raw)
  To: ruby-core

Issue #19090 has been updated by Dan0042 (Daniel DeLorme).

Actually an even better approach may be to append the escaped bytes directly to the final output buffer, instead of generating any intermediary string at all. As such I would like to propose:
`CGI.escapeHTML(str, "")` append to new string (current behavior)
`CGI.escapeHTML(str, nil)` append to new string or return unmodified `str` (new default?)
`CGI.escapeHTML(str, buffer)` append to buffer (most efficient when rendering mutliple string in template)

----------------------------------------
Feature #19090: Do not duplicate an unescaped string in CGI.escapeHTML
https://bugs.ruby-lang.org/issues/19090#change-99916

* Author: k0kubun (Takashi Kokubun)
* Status: Open
* Priority: Normal
----------------------------------------
## Proposal
Stop guaranteeing that `GGI.escapeHTML` returns a new string even if there's nothing to be escaped.

More specifically, stop calling this `rb_str_dup` https://github.com/ruby/cgi/blob/v0.3.3/ext/cgi/escape/escape.c#L72 for the case that nothing needs to be escaped.

## Background
My original implementation https://github.com/ruby/ruby/pull/1164 was not calling it. The reason why `rb_str_dup` was added was that [Bug #11858] claimed returning the argument object for non-escaped cases is a backward incompatibility because the original `gsub`-based implementation always returns a new object. As a result, even while many people use `CGI.escapeHTML` as an optimized implementation for escaping HTML today, it ended up having a compromised performance.

## Motivation
The motivation is to improve performance. By just doing so, escaping a pre-allocated `"string"` becomes 1.34x faster on my machine https://gist.github.com/k0kubun/f66d6fe1e6ba821e4263257e504ba28f.

The most major use case of `CGP.escapeHTML` is to safely embed a user input. When the result is just embedded in another string, the allocated new object will be just wasted. It's pretty common that an embedded string fragment doesn't contain any of `'"&<>` characters. So we should stop wasting that to optimize that case.

[Bug #11858] wasn't really a use case but just "I think this is backward incompatibility" based on frozen Hello World. Unlike user input, you usually don't need to escape your own string literal. It feels like the ticket addressed a problem that doesn't exist in actual applications. It should have cited existing code that could be broken by that, and I can't find such code with `gem-codesearch` today.

The only reason to maintain the current behavior would be to allow using a return value of `CGI.escapeHTML` as a buffer for creating another longer string starting with the escaped value, but using `CGI.escapeHTML` to initialize a string buffer feels like an abuse. Relying on the behavior never makes sense as an "optimization" either because it makes all other cases (the result is not used as a string buffer) suboptimal.

## Why not an optional flag like `CGI.escapeHTML(str, dup: false)`?
Two reasons:

* The non-dup behavior should be used 99.999..9% of the time. We shouldn't make code using `CGI.escapeHTML` less readable just for maintaining a use case that doesn't exist.
* Passing keyword arguments to a C extension is unfortunately slow, and it defeats the optimization purpose. In core classes, we could use `Primitive` to address that, but this is a default gem and we can't use that.
  * We could workaround that if we choose `CGI.escapeHTML(str, false)`, but again it'd spoil the readability for maintaining an invalid use case.

## Why not a new method?

It's a good idea actually, but with `escapeHTML`, `escape_html`, and `h` aliased to it already, I can't think of a good name for it. And again, not calling it `escapeHTML` or `escape_html` would spoil the readability for no valid reason.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [ruby-core:110585] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML
  2022-10-29  7:40 [ruby-core:110542] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML k0kubun (Takashi Kokubun)
                   ` (4 preceding siblings ...)
  2022-11-02 15:02 ` [ruby-core:110583] " Dan0042 (Daniel DeLorme)
@ 2022-11-02 20:12 ` k0kubun (Takashi Kokubun)
  2022-11-02 20:18 ` [ruby-core:110586] " k0kubun (Takashi Kokubun)
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: k0kubun (Takashi Kokubun) @ 2022-11-02 20:12 UTC (permalink / raw)
  To: ruby-core

Issue #19090 has been updated by k0kubun (Takashi Kokubun).

> You know, if speed is that much a concern, I think optimized_escape_html2 should seek the first escapable character and only if found then do the ALLOCV_N (and memcpy from cstr to buf).

`ALLOCV_N` is essentially `alloca`. I'd say this one is indeed "fairly cheap". Feel free to prove it otherwise by submitting a patch and benchmarking it though.

As to `memcpy`, while technically `memcpy` isn't called for the case we're discussing, `*dest++ = c;` feels like an unneeded overhead and I have an idea to improve it. I'd discuss it in a different ticket though.

> Actually an even better approach may be to append the escaped bytes directly to the final output buffer, instead of generating any intermediary string at all.

You seem to assume that the buffer is bare String, but it's not necessarily the case for Rails. I believe it could/should be fixed, but it's a different discussion.

----------------------------------------
Feature #19090: Do not duplicate an unescaped string in CGI.escapeHTML
https://bugs.ruby-lang.org/issues/19090#change-99917

* Author: k0kubun (Takashi Kokubun)
* Status: Open
* Priority: Normal
----------------------------------------
## Proposal
Stop guaranteeing that `GGI.escapeHTML` returns a new string even if there's nothing to be escaped.

More specifically, stop calling this `rb_str_dup` https://github.com/ruby/cgi/blob/v0.3.3/ext/cgi/escape/escape.c#L72 for the case that nothing needs to be escaped.

## Background
My original implementation https://github.com/ruby/ruby/pull/1164 was not calling it. The reason why `rb_str_dup` was added was that [Bug #11858] claimed returning the argument object for non-escaped cases is a backward incompatibility because the original `gsub`-based implementation always returns a new object. As a result, even while many people use `CGI.escapeHTML` as an optimized implementation for escaping HTML today, it ended up having a compromised performance.

## Motivation
The motivation is to improve performance. By just doing so, escaping a pre-allocated `"string"` becomes 1.34x faster on my machine https://gist.github.com/k0kubun/f66d6fe1e6ba821e4263257e504ba28f.

The most major use case of `CGP.escapeHTML` is to safely embed a user input. When the result is just embedded in another string, the allocated new object will be just wasted. It's pretty common that an embedded string fragment doesn't contain any of `'"&<>` characters. So we should stop wasting that to optimize that case.

[Bug #11858] wasn't really a use case but just "I think this is backward incompatibility" based on frozen Hello World. Unlike user input, you usually don't need to escape your own string literal. It feels like the ticket addressed a problem that doesn't exist in actual applications. It should have cited existing code that could be broken by that, and I can't find such code with `gem-codesearch` today.

The only reason to maintain the current behavior would be to allow using a return value of `CGI.escapeHTML` as a buffer for creating another longer string starting with the escaped value, but using `CGI.escapeHTML` to initialize a string buffer feels like an abuse. Relying on the behavior never makes sense as an "optimization" either because it makes all other cases (the result is not used as a string buffer) suboptimal.

## Why not an optional flag like `CGI.escapeHTML(str, dup: false)`?
Two reasons:

* The non-dup behavior should be used 99.999..9% of the time. We shouldn't make code using `CGI.escapeHTML` less readable just for maintaining a use case that doesn't exist.
* Passing keyword arguments to a C extension is unfortunately slow, and it defeats the optimization purpose. In core classes, we could use `Primitive` to address that, but this is a default gem and we can't use that.
  * We could workaround that if we choose `CGI.escapeHTML(str, false)`, but again it'd spoil the readability for maintaining an invalid use case.

## Why not a new method?

It's a good idea actually, but with `escapeHTML`, `escape_html`, and `h` aliased to it already, I can't think of a good name for it. And again, not calling it `escapeHTML` or `escape_html` would spoil the readability for no valid reason.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [ruby-core:110586] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML
  2022-10-29  7:40 [ruby-core:110542] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML k0kubun (Takashi Kokubun)
                   ` (5 preceding siblings ...)
  2022-11-02 20:12 ` [ruby-core:110585] " k0kubun (Takashi Kokubun)
@ 2022-11-02 20:18 ` k0kubun (Takashi Kokubun)
  2022-11-03  5:16 ` [ruby-core:110587] " k0kubun (Takashi Kokubun)
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: k0kubun (Takashi Kokubun) @ 2022-11-02 20:18 UTC (permalink / raw)
  To: ruby-core

Issue #19090 has been updated by k0kubun (Takashi Kokubun).

We might be able to make it efficient for a non-String buffer too, so I'll give it a try first. Thanks for the idea.

----------------------------------------
Feature #19090: Do not duplicate an unescaped string in CGI.escapeHTML
https://bugs.ruby-lang.org/issues/19090#change-99918

* Author: k0kubun (Takashi Kokubun)
* Status: Open
* Priority: Normal
----------------------------------------
## Proposal
Stop guaranteeing that `GGI.escapeHTML` returns a new string even if there's nothing to be escaped.

More specifically, stop calling this `rb_str_dup` https://github.com/ruby/cgi/blob/v0.3.3/ext/cgi/escape/escape.c#L72 for the case that nothing needs to be escaped.

## Background
My original implementation https://github.com/ruby/ruby/pull/1164 was not calling it. The reason why `rb_str_dup` was added was that [Bug #11858] claimed returning the argument object for non-escaped cases is a backward incompatibility because the original `gsub`-based implementation always returns a new object. As a result, even while many people use `CGI.escapeHTML` as an optimized implementation for escaping HTML today, it ended up having a compromised performance.

## Motivation
The motivation is to improve performance. By just doing so, escaping a pre-allocated `"string"` becomes 1.34x faster on my machine https://gist.github.com/k0kubun/f66d6fe1e6ba821e4263257e504ba28f.

The most major use case of `CGP.escapeHTML` is to safely embed a user input. When the result is just embedded in another string, the allocated new object will be just wasted. It's pretty common that an embedded string fragment doesn't contain any of `'"&<>` characters. So we should stop wasting that to optimize that case.

[Bug #11858] wasn't really a use case but just "I think this is backward incompatibility" based on frozen Hello World. Unlike user input, you usually don't need to escape your own string literal. It feels like the ticket addressed a problem that doesn't exist in actual applications. It should have cited existing code that could be broken by that, and I can't find such code with `gem-codesearch` today.

The only reason to maintain the current behavior would be to allow using a return value of `CGI.escapeHTML` as a buffer for creating another longer string starting with the escaped value, but using `CGI.escapeHTML` to initialize a string buffer feels like an abuse. Relying on the behavior never makes sense as an "optimization" either because it makes all other cases (the result is not used as a string buffer) suboptimal.

## Why not an optional flag like `CGI.escapeHTML(str, dup: false)`?
Two reasons:

* The non-dup behavior should be used 99.999..9% of the time. We shouldn't make code using `CGI.escapeHTML` less readable just for maintaining a use case that doesn't exist.
* Passing keyword arguments to a C extension is unfortunately slow, and it defeats the optimization purpose. In core classes, we could use `Primitive` to address that, but this is a default gem and we can't use that.
  * We could workaround that if we choose `CGI.escapeHTML(str, false)`, but again it'd spoil the readability for maintaining an invalid use case.

## Why not a new method?

It's a good idea actually, but with `escapeHTML`, `escape_html`, and `h` aliased to it already, I can't think of a good name for it. And again, not calling it `escapeHTML` or `escape_html` would spoil the readability for no valid reason.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [ruby-core:110587] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML
  2022-10-29  7:40 [ruby-core:110542] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML k0kubun (Takashi Kokubun)
                   ` (6 preceding siblings ...)
  2022-11-02 20:18 ` [ruby-core:110586] " k0kubun (Takashi Kokubun)
@ 2022-11-03  5:16 ` k0kubun (Takashi Kokubun)
  2022-11-03 14:14 ` [ruby-core:110589] " Dan0042 (Daniel DeLorme)
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: k0kubun (Takashi Kokubun) @ 2022-11-03  5:16 UTC (permalink / raw)
  To: ruby-core

Issue #19090 has been updated by k0kubun (Takashi Kokubun).

Status changed from Open to Closed

On second thought, it might not be necessarily more efficient even with a bare String buffer because the point of the current `CGI.escapeHTML` implementation is to pre-allocate a 5x-size buffer on the stack to avoid multiple reallocations for buffer extension. That optimization is hard if you need to directly write to the argument String from the beginning; the best thing you could do would be to `realloc` it to a 5x size first and then `realloc` it again to shrink it. Doing object reallocation twice seems expensive compared to just manipulating a stack pointer, even combined with extra `memcpy`. 

Plus, as a maintainer of many template engines, this interface seems hard to use in some cases. It seems to help only limited places with a lot of complexity.

---

I came up with a completely different solution. I'll file it as a different ticket after verifying the idea. So closing this. Thank you for the discussion.

----------------------------------------
Feature #19090: Do not duplicate an unescaped string in CGI.escapeHTML
https://bugs.ruby-lang.org/issues/19090#change-99919

* Author: k0kubun (Takashi Kokubun)
* Status: Closed
* Priority: Normal
----------------------------------------
## Proposal
Stop guaranteeing that `GGI.escapeHTML` returns a new string even if there's nothing to be escaped.

More specifically, stop calling this `rb_str_dup` https://github.com/ruby/cgi/blob/v0.3.3/ext/cgi/escape/escape.c#L72 for the case that nothing needs to be escaped.

## Background
My original implementation https://github.com/ruby/ruby/pull/1164 was not calling it. The reason why `rb_str_dup` was added was that [Bug #11858] claimed returning the argument object for non-escaped cases is a backward incompatibility because the original `gsub`-based implementation always returns a new object. As a result, even while many people use `CGI.escapeHTML` as an optimized implementation for escaping HTML today, it ended up having a compromised performance.

## Motivation
The motivation is to improve performance. By just doing so, escaping a pre-allocated `"string"` becomes 1.34x faster on my machine https://gist.github.com/k0kubun/f66d6fe1e6ba821e4263257e504ba28f.

The most major use case of `CGP.escapeHTML` is to safely embed a user input. When the result is just embedded in another string, the allocated new object will be just wasted. It's pretty common that an embedded string fragment doesn't contain any of `'"&<>` characters. So we should stop wasting that to optimize that case.

[Bug #11858] wasn't really a use case but just "I think this is backward incompatibility" based on frozen Hello World. Unlike user input, you usually don't need to escape your own string literal. It feels like the ticket addressed a problem that doesn't exist in actual applications. It should have cited existing code that could be broken by that, and I can't find such code with `gem-codesearch` today.

The only reason to maintain the current behavior would be to allow using a return value of `CGI.escapeHTML` as a buffer for creating another longer string starting with the escaped value, but using `CGI.escapeHTML` to initialize a string buffer feels like an abuse. Relying on the behavior never makes sense as an "optimization" either because it makes all other cases (the result is not used as a string buffer) suboptimal.

## Why not an optional flag like `CGI.escapeHTML(str, dup: false)`?
Two reasons:

* The non-dup behavior should be used 99.999..9% of the time. We shouldn't make code using `CGI.escapeHTML` less readable just for maintaining a use case that doesn't exist.
* Passing keyword arguments to a C extension is unfortunately slow, and it defeats the optimization purpose. In core classes, we could use `Primitive` to address that, but this is a default gem and we can't use that.
  * We could workaround that if we choose `CGI.escapeHTML(str, false)`, but again it'd spoil the readability for maintaining an invalid use case.

## Why not a new method?

It's a good idea actually, but with `escapeHTML`, `escape_html`, and `h` aliased to it already, I can't think of a good name for it. And again, not calling it `escapeHTML` or `escape_html` would spoil the readability for no valid reason.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [ruby-core:110589] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML
  2022-10-29  7:40 [ruby-core:110542] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML k0kubun (Takashi Kokubun)
                   ` (7 preceding siblings ...)
  2022-11-03  5:16 ` [ruby-core:110587] " k0kubun (Takashi Kokubun)
@ 2022-11-03 14:14 ` Dan0042 (Daniel DeLorme)
  2022-11-26 13:06 ` [ruby-core:111021] " Eregon (Benoit Daloze)
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Dan0042 (Daniel DeLorme) @ 2022-11-03 14:14 UTC (permalink / raw)
  To: ruby-core

Issue #19090 has been updated by Dan0042 (Daniel DeLorme).

It's true ALLOCV_N is almost free but I was thinking more about the cost of copying between buffers
a) when nothing needs escaping: copy bytes to tmp buffer then discard
b) when something needs escaping: copy bytes to tmp buffer then copy again to final string

> You seem to assume that the buffer is bare String, but it's not necessarily the case for Rails.

If buffer is bare String we can use `CGI.escapeHTML(str, buffer)`, and if something else (Array?) we can use `CGI.escapeHTML(str)`. I think it's nice to support both cases.

6x allocation is optimal for worst cast like `'"' * 1000` but personally I would optimize for common case. So grow the string by an amount sufficient for 99.9% of cases (maybe something like (N+30)*1.2) and accept a realloc for degenerate cases. It's a different kind of choice/trade-off.

Thank you for the discussion, it's a fun topic.

----------------------------------------
Feature #19090: Do not duplicate an unescaped string in CGI.escapeHTML
https://bugs.ruby-lang.org/issues/19090#change-99921

* Author: k0kubun (Takashi Kokubun)
* Status: Closed
* Priority: Normal
----------------------------------------
## Proposal
Stop guaranteeing that `GGI.escapeHTML` returns a new string even if there's nothing to be escaped.

More specifically, stop calling this `rb_str_dup` https://github.com/ruby/cgi/blob/v0.3.3/ext/cgi/escape/escape.c#L72 for the case that nothing needs to be escaped.

## Background
My original implementation https://github.com/ruby/ruby/pull/1164 was not calling it. The reason why `rb_str_dup` was added was that [Bug #11858] claimed returning the argument object for non-escaped cases is a backward incompatibility because the original `gsub`-based implementation always returns a new object. As a result, even while many people use `CGI.escapeHTML` as an optimized implementation for escaping HTML today, it ended up having a compromised performance.

## Motivation
The motivation is to improve performance. By just doing so, escaping a pre-allocated `"string"` becomes 1.34x faster on my machine https://gist.github.com/k0kubun/f66d6fe1e6ba821e4263257e504ba28f.

The most major use case of `CGP.escapeHTML` is to safely embed a user input. When the result is just embedded in another string, the allocated new object will be just wasted. It's pretty common that an embedded string fragment doesn't contain any of `'"&<>` characters. So we should stop wasting that to optimize that case.

[Bug #11858] wasn't really a use case but just "I think this is backward incompatibility" based on frozen Hello World. Unlike user input, you usually don't need to escape your own string literal. It feels like the ticket addressed a problem that doesn't exist in actual applications. It should have cited existing code that could be broken by that, and I can't find such code with `gem-codesearch` today.

The only reason to maintain the current behavior would be to allow using a return value of `CGI.escapeHTML` as a buffer for creating another longer string starting with the escaped value, but using `CGI.escapeHTML` to initialize a string buffer feels like an abuse. Relying on the behavior never makes sense as an "optimization" either because it makes all other cases (the result is not used as a string buffer) suboptimal.

## Why not an optional flag like `CGI.escapeHTML(str, dup: false)`?
Two reasons:

* The non-dup behavior should be used 99.999..9% of the time. We shouldn't make code using `CGI.escapeHTML` less readable just for maintaining a use case that doesn't exist.
* Passing keyword arguments to a C extension is unfortunately slow, and it defeats the optimization purpose. In core classes, we could use `Primitive` to address that, but this is a default gem and we can't use that.
  * We could workaround that if we choose `CGI.escapeHTML(str, false)`, but again it'd spoil the readability for maintaining an invalid use case.

## Why not a new method?

It's a good idea actually, but with `escapeHTML`, `escape_html`, and `h` aliased to it already, I can't think of a good name for it. And again, not calling it `escapeHTML` or `escape_html` would spoil the readability for no valid reason.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [ruby-core:111021] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML
  2022-10-29  7:40 [ruby-core:110542] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML k0kubun (Takashi Kokubun)
                   ` (8 preceding siblings ...)
  2022-11-03 14:14 ` [ruby-core:110589] " Dan0042 (Daniel DeLorme)
@ 2022-11-26 13:06 ` Eregon (Benoit Daloze)
  2022-11-26 21:58 ` [ruby-core:111025] " k0kubun (Takashi Kokubun)
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Eregon (Benoit Daloze) @ 2022-11-26 13:06 UTC (permalink / raw)
  To: ruby-core

Issue #19090 has been updated by Eregon (Benoit Daloze).

I did not expect `rb_str_dup()` is so costly on CRuby, I guess the allocation is slow and of course CRuby can't escape-analyze it.
I think it is important to keep the optimized HTML escape in core/stdlib (e.g., in CGI), as the fastest way is implementation-dependent.
See https://bugs.ruby-lang.org/issues/19102#note-4

I think we should add an optional argument to avoid copying the string when unchanged, that's easy and avoids yet another place/method with that escaping logic.

----------------------------------------
Feature #19090: Do not duplicate an unescaped string in CGI.escapeHTML
https://bugs.ruby-lang.org/issues/19090#change-100276

* Author: k0kubun (Takashi Kokubun)
* Status: Closed
* Priority: Normal
----------------------------------------
## Proposal
Stop guaranteeing that `GGI.escapeHTML` returns a new string even if there's nothing to be escaped.

More specifically, stop calling this `rb_str_dup` https://github.com/ruby/cgi/blob/v0.3.3/ext/cgi/escape/escape.c#L72 for the case that nothing needs to be escaped.

## Background
My original implementation https://github.com/ruby/ruby/pull/1164 was not calling it. The reason why `rb_str_dup` was added was that [Bug #11858] claimed returning the argument object for non-escaped cases is a backward incompatibility because the original `gsub`-based implementation always returns a new object. As a result, even while many people use `CGI.escapeHTML` as an optimized implementation for escaping HTML today, it ended up having a compromised performance.

## Motivation
The motivation is to improve performance. By just doing so, escaping a pre-allocated `"string"` becomes 1.34x faster on my machine https://gist.github.com/k0kubun/f66d6fe1e6ba821e4263257e504ba28f.

The most major use case of `CGP.escapeHTML` is to safely embed a user input. When the result is just embedded in another string, the allocated new object will be just wasted. It's pretty common that an embedded string fragment doesn't contain any of `'"&<>` characters. So we should stop wasting that to optimize that case.

[Bug #11858] wasn't really a use case but just "I think this is backward incompatibility" based on frozen Hello World. Unlike user input, you usually don't need to escape your own string literal. It feels like the ticket addressed a problem that doesn't exist in actual applications. It should have cited existing code that could be broken by that, and I can't find such code with `gem-codesearch` today.

The only reason to maintain the current behavior would be to allow using a return value of `CGI.escapeHTML` as a buffer for creating another longer string starting with the escaped value, but using `CGI.escapeHTML` to initialize a string buffer feels like an abuse. Relying on the behavior never makes sense as an "optimization" either because it makes all other cases (the result is not used as a string buffer) suboptimal.

## Why not an optional flag like `CGI.escapeHTML(str, dup: false)`?
Two reasons:

* The non-dup behavior should be used 99.999..9% of the time. We shouldn't make code using `CGI.escapeHTML` less readable just for maintaining a use case that doesn't exist.
* Passing keyword arguments to a C extension is unfortunately slow, and it defeats the optimization purpose. In core classes, we could use `Primitive` to address that, but this is a default gem and we can't use that.
  * We could workaround that if we choose `CGI.escapeHTML(str, false)`, but again it'd spoil the readability for maintaining an invalid use case.

## Why not a new method?

It's a good idea actually, but with `escapeHTML`, `escape_html`, and `h` aliased to it already, I can't think of a good name for it. And again, not calling it `escapeHTML` or `escape_html` would spoil the readability for no valid reason.

-- 
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [ruby-core:111025] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML
  2022-10-29  7:40 [ruby-core:110542] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML k0kubun (Takashi Kokubun)
                   ` (9 preceding siblings ...)
  2022-11-26 13:06 ` [ruby-core:111021] " Eregon (Benoit Daloze)
@ 2022-11-26 21:58 ` k0kubun (Takashi Kokubun)
  2022-11-27  4:57 ` [ruby-core:111029] " Eregon (Benoit Daloze)
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: k0kubun (Takashi Kokubun) @ 2022-11-26 21:58 UTC (permalink / raw)
  To: ruby-core

Issue #19090 has been updated by k0kubun (Takashi Kokubun).

I filed a PR for truffleruby https://github.com/ruby/erb/pull/39.

> I think it is important to keep the optimized HTML escape in core/stdlib (e.g., in CGI)

I didn't get what you mean in this part. CGI and ERB are both default gems. There's no distinction in its "coreness" between them.

----------------------------------------
Feature #19090: Do not duplicate an unescaped string in CGI.escapeHTML
https://bugs.ruby-lang.org/issues/19090#change-100279

* Author: k0kubun (Takashi Kokubun)
* Status: Closed
* Priority: Normal
----------------------------------------
## Proposal
Stop guaranteeing that `GGI.escapeHTML` returns a new string even if there's nothing to be escaped.

More specifically, stop calling this `rb_str_dup` https://github.com/ruby/cgi/blob/v0.3.3/ext/cgi/escape/escape.c#L72 for the case that nothing needs to be escaped.

## Background
My original implementation https://github.com/ruby/ruby/pull/1164 was not calling it. The reason why `rb_str_dup` was added was that [Bug #11858] claimed returning the argument object for non-escaped cases is a backward incompatibility because the original `gsub`-based implementation always returns a new object. As a result, even while many people use `CGI.escapeHTML` as an optimized implementation for escaping HTML today, it ended up having a compromised performance.

## Motivation
The motivation is to improve performance. By just doing so, escaping a pre-allocated `"string"` becomes 1.34x faster on my machine https://gist.github.com/k0kubun/f66d6fe1e6ba821e4263257e504ba28f.

The most major use case of `CGP.escapeHTML` is to safely embed a user input. When the result is just embedded in another string, the allocated new object will be just wasted. It's pretty common that an embedded string fragment doesn't contain any of `'"&<>` characters. So we should stop wasting that to optimize that case.

[Bug #11858] wasn't really a use case but just "I think this is backward incompatibility" based on frozen Hello World. Unlike user input, you usually don't need to escape your own string literal. It feels like the ticket addressed a problem that doesn't exist in actual applications. It should have cited existing code that could be broken by that, and I can't find such code with `gem-codesearch` today.

The only reason to maintain the current behavior would be to allow using a return value of `CGI.escapeHTML` as a buffer for creating another longer string starting with the escaped value, but using `CGI.escapeHTML` to initialize a string buffer feels like an abuse. Relying on the behavior never makes sense as an "optimization" either because it makes all other cases (the result is not used as a string buffer) suboptimal.

## Why not an optional flag like `CGI.escapeHTML(str, dup: false)`?
Two reasons:

* The non-dup behavior should be used 99.999..9% of the time. We shouldn't make code using `CGI.escapeHTML` less readable just for maintaining a use case that doesn't exist.
* Passing keyword arguments to a C extension is unfortunately slow, and it defeats the optimization purpose. In core classes, we could use `Primitive` to address that, but this is a default gem and we can't use that.
  * We could workaround that if we choose `CGI.escapeHTML(str, false)`, but again it'd spoil the readability for maintaining an invalid use case.

## Why not a new method?

It's a good idea actually, but with `escapeHTML`, `escape_html`, and `h` aliased to it already, I can't think of a good name for it. And again, not calling it `escapeHTML` or `escape_html` would spoil the readability for no valid reason.

-- 
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [ruby-core:111029] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML
  2022-10-29  7:40 [ruby-core:110542] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML k0kubun (Takashi Kokubun)
                   ` (10 preceding siblings ...)
  2022-11-26 21:58 ` [ruby-core:111025] " k0kubun (Takashi Kokubun)
@ 2022-11-27  4:57 ` Eregon (Benoit Daloze)
  2022-11-27  5:49 ` [ruby-core:111030] " k0kubun (Takashi Kokubun)
  2022-11-30  3:08 ` [ruby-core:111079] " Eregon (Benoit Daloze)
  13 siblings, 0 replies; 15+ messages in thread
From: Eregon (Benoit Daloze) @ 2022-11-27  4:57 UTC (permalink / raw)
  To: ruby-core

Issue #19090 has been updated by Eregon (Benoit Daloze).

Right, I forgot CGI is a default gem too.
I think it seems cleaner for other template engines (e.g. haml, slim, etc) to depend (as in require "cgi") on CGI vs depending on ERB, i.e. CGI feels smaller/like a subset of ERB. In other words, it doesn't seem ideal for other template engines to depend on the ERB template engine just for escaping (although practically speaking there is no issue, just from a design perspective).

----------------------------------------
Feature #19090: Do not duplicate an unescaped string in CGI.escapeHTML
https://bugs.ruby-lang.org/issues/19090#change-100282

* Author: k0kubun (Takashi Kokubun)
* Status: Closed
* Priority: Normal
----------------------------------------
## Proposal
Stop guaranteeing that `GGI.escapeHTML` returns a new string even if there's nothing to be escaped.

More specifically, stop calling this `rb_str_dup` https://github.com/ruby/cgi/blob/v0.3.3/ext/cgi/escape/escape.c#L72 for the case that nothing needs to be escaped.

## Background
My original implementation https://github.com/ruby/ruby/pull/1164 was not calling it. The reason why `rb_str_dup` was added was that [Bug #11858] claimed returning the argument object for non-escaped cases is a backward incompatibility because the original `gsub`-based implementation always returns a new object. As a result, even while many people use `CGI.escapeHTML` as an optimized implementation for escaping HTML today, it ended up having a compromised performance.

## Motivation
The motivation is to improve performance. By just doing so, escaping a pre-allocated `"string"` becomes 1.34x faster on my machine https://gist.github.com/k0kubun/f66d6fe1e6ba821e4263257e504ba28f.

The most major use case of `CGP.escapeHTML` is to safely embed a user input. When the result is just embedded in another string, the allocated new object will be just wasted. It's pretty common that an embedded string fragment doesn't contain any of `'"&<>` characters. So we should stop wasting that to optimize that case.

[Bug #11858] wasn't really a use case but just "I think this is backward incompatibility" based on frozen Hello World. Unlike user input, you usually don't need to escape your own string literal. It feels like the ticket addressed a problem that doesn't exist in actual applications. It should have cited existing code that could be broken by that, and I can't find such code with `gem-codesearch` today.

The only reason to maintain the current behavior would be to allow using a return value of `CGI.escapeHTML` as a buffer for creating another longer string starting with the escaped value, but using `CGI.escapeHTML` to initialize a string buffer feels like an abuse. Relying on the behavior never makes sense as an "optimization" either because it makes all other cases (the result is not used as a string buffer) suboptimal.

## Why not an optional flag like `CGI.escapeHTML(str, dup: false)`?
Two reasons:

* The non-dup behavior should be used 99.999..9% of the time. We shouldn't make code using `CGI.escapeHTML` less readable just for maintaining a use case that doesn't exist.
* Passing keyword arguments to a C extension is unfortunately slow, and it defeats the optimization purpose. In core classes, we could use `Primitive` to address that, but this is a default gem and we can't use that.
  * We could workaround that if we choose `CGI.escapeHTML(str, false)`, but again it'd spoil the readability for maintaining an invalid use case.

## Why not a new method?

It's a good idea actually, but with `escapeHTML`, `escape_html`, and `h` aliased to it already, I can't think of a good name for it. And again, not calling it `escapeHTML` or `escape_html` would spoil the readability for no valid reason.

-- 
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [ruby-core:111030] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML
  2022-10-29  7:40 [ruby-core:110542] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML k0kubun (Takashi Kokubun)
                   ` (11 preceding siblings ...)
  2022-11-27  4:57 ` [ruby-core:111029] " Eregon (Benoit Daloze)
@ 2022-11-27  5:49 ` k0kubun (Takashi Kokubun)
  2022-11-30  3:08 ` [ruby-core:111079] " Eregon (Benoit Daloze)
  13 siblings, 0 replies; 15+ messages in thread
From: k0kubun (Takashi Kokubun) @ 2022-11-27  5:49 UTC (permalink / raw)
  To: ruby-core

Issue #19090 has been updated by k0kubun (Takashi Kokubun).

I get what you're saying. My position on this issue is:

* `CGI` is not a good place either unless you're writing a CGI application. ERB also has `ERB::Escape` now, and I'd say "embedded Ruby escape" is a better module name than "CGI" for the purpose.
* ERB 4.0.0 was shipped with a new file `erb/util.rb`, which allows you to load only a couple of escape methods, not loading an extra template engine.
* The way we defined the method was designed by @jeremyevans0 for Erubi. Loading the ERB template engine would be the last thing the Erubi maintainer would like to do. So, thanks to his idea, it's possible to load only escape methods.
* `ERB::Util.html_escape` is monkey-patched by Rails and it's been the only official `html_safe`-compatible HTML escape method for years (while it's been using Erubis or Erubi). It's the only Rails-official way to do it, and moving it to somewhere else would be unnecessarily disruptive.

----------------------------------------
Feature #19090: Do not duplicate an unescaped string in CGI.escapeHTML
https://bugs.ruby-lang.org/issues/19090#change-100283

* Author: k0kubun (Takashi Kokubun)
* Status: Closed
* Priority: Normal
----------------------------------------
## Proposal
Stop guaranteeing that `GGI.escapeHTML` returns a new string even if there's nothing to be escaped.

More specifically, stop calling this `rb_str_dup` https://github.com/ruby/cgi/blob/v0.3.3/ext/cgi/escape/escape.c#L72 for the case that nothing needs to be escaped.

## Background
My original implementation https://github.com/ruby/ruby/pull/1164 was not calling it. The reason why `rb_str_dup` was added was that [Bug #11858] claimed returning the argument object for non-escaped cases is a backward incompatibility because the original `gsub`-based implementation always returns a new object. As a result, even while many people use `CGI.escapeHTML` as an optimized implementation for escaping HTML today, it ended up having a compromised performance.

## Motivation
The motivation is to improve performance. By just doing so, escaping a pre-allocated `"string"` becomes 1.34x faster on my machine https://gist.github.com/k0kubun/f66d6fe1e6ba821e4263257e504ba28f.

The most major use case of `CGP.escapeHTML` is to safely embed a user input. When the result is just embedded in another string, the allocated new object will be just wasted. It's pretty common that an embedded string fragment doesn't contain any of `'"&<>` characters. So we should stop wasting that to optimize that case.

[Bug #11858] wasn't really a use case but just "I think this is backward incompatibility" based on frozen Hello World. Unlike user input, you usually don't need to escape your own string literal. It feels like the ticket addressed a problem that doesn't exist in actual applications. It should have cited existing code that could be broken by that, and I can't find such code with `gem-codesearch` today.

The only reason to maintain the current behavior would be to allow using a return value of `CGI.escapeHTML` as a buffer for creating another longer string starting with the escaped value, but using `CGI.escapeHTML` to initialize a string buffer feels like an abuse. Relying on the behavior never makes sense as an "optimization" either because it makes all other cases (the result is not used as a string buffer) suboptimal.

## Why not an optional flag like `CGI.escapeHTML(str, dup: false)`?
Two reasons:

* The non-dup behavior should be used 99.999..9% of the time. We shouldn't make code using `CGI.escapeHTML` less readable just for maintaining a use case that doesn't exist.
* Passing keyword arguments to a C extension is unfortunately slow, and it defeats the optimization purpose. In core classes, we could use `Primitive` to address that, but this is a default gem and we can't use that.
  * We could workaround that if we choose `CGI.escapeHTML(str, false)`, but again it'd spoil the readability for maintaining an invalid use case.

## Why not a new method?

It's a good idea actually, but with `escapeHTML`, `escape_html`, and `h` aliased to it already, I can't think of a good name for it. And again, not calling it `escapeHTML` or `escape_html` would spoil the readability for no valid reason.

-- 
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [ruby-core:111079] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML
  2022-10-29  7:40 [ruby-core:110542] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML k0kubun (Takashi Kokubun)
                   ` (12 preceding siblings ...)
  2022-11-27  5:49 ` [ruby-core:111030] " k0kubun (Takashi Kokubun)
@ 2022-11-30  3:08 ` Eregon (Benoit Daloze)
  13 siblings, 0 replies; 15+ messages in thread
From: Eregon (Benoit Daloze) @ 2022-11-30  3:08 UTC (permalink / raw)
  To: ruby-core

Issue #19090 has been updated by Eregon (Benoit Daloze).

Thank you for the explanation, that's very good arguments and they address my concerns.

----------------------------------------
Feature #19090: Do not duplicate an unescaped string in CGI.escapeHTML
https://bugs.ruby-lang.org/issues/19090#change-100339

* Author: k0kubun (Takashi Kokubun)
* Status: Closed
* Priority: Normal
----------------------------------------
## Proposal
Stop guaranteeing that `GGI.escapeHTML` returns a new string even if there's nothing to be escaped.

More specifically, stop calling this `rb_str_dup` https://github.com/ruby/cgi/blob/v0.3.3/ext/cgi/escape/escape.c#L72 for the case that nothing needs to be escaped.

## Background
My original implementation https://github.com/ruby/ruby/pull/1164 was not calling it. The reason why `rb_str_dup` was added was that [Bug #11858] claimed returning the argument object for non-escaped cases is a backward incompatibility because the original `gsub`-based implementation always returns a new object. As a result, even while many people use `CGI.escapeHTML` as an optimized implementation for escaping HTML today, it ended up having a compromised performance.

## Motivation
The motivation is to improve performance. By just doing so, escaping a pre-allocated `"string"` becomes 1.34x faster on my machine https://gist.github.com/k0kubun/f66d6fe1e6ba821e4263257e504ba28f.

The most major use case of `CGP.escapeHTML` is to safely embed a user input. When the result is just embedded in another string, the allocated new object will be just wasted. It's pretty common that an embedded string fragment doesn't contain any of `'"&<>` characters. So we should stop wasting that to optimize that case.

[Bug #11858] wasn't really a use case but just "I think this is backward incompatibility" based on frozen Hello World. Unlike user input, you usually don't need to escape your own string literal. It feels like the ticket addressed a problem that doesn't exist in actual applications. It should have cited existing code that could be broken by that, and I can't find such code with `gem-codesearch` today.

The only reason to maintain the current behavior would be to allow using a return value of `CGI.escapeHTML` as a buffer for creating another longer string starting with the escaped value, but using `CGI.escapeHTML` to initialize a string buffer feels like an abuse. Relying on the behavior never makes sense as an "optimization" either because it makes all other cases (the result is not used as a string buffer) suboptimal.

## Why not an optional flag like `CGI.escapeHTML(str, dup: false)`?
Two reasons:

* The non-dup behavior should be used 99.999..9% of the time. We shouldn't make code using `CGI.escapeHTML` less readable just for maintaining a use case that doesn't exist.
* Passing keyword arguments to a C extension is unfortunately slow, and it defeats the optimization purpose. In core classes, we could use `Primitive` to address that, but this is a default gem and we can't use that.
  * We could workaround that if we choose `CGI.escapeHTML(str, false)`, but again it'd spoil the readability for maintaining an invalid use case.

## Why not a new method?

It's a good idea actually, but with `escapeHTML`, `escape_html`, and `h` aliased to it already, I can't think of a good name for it. And again, not calling it `escapeHTML` or `escape_html` would spoil the readability for no valid reason.

-- 
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-11-30  3:08 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-29  7:40 [ruby-core:110542] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML k0kubun (Takashi Kokubun)
2022-10-29 13:38 ` [ruby-core:110543] " Eregon (Benoit Daloze)
2022-11-01 12:55 ` [ruby-core:110566] " Dan0042 (Daniel DeLorme)
2022-11-02  4:45 ` [ruby-core:110576] " k0kubun (Takashi Kokubun)
2022-11-02 13:04 ` [ruby-core:110581] " Dan0042 (Daniel DeLorme)
2022-11-02 15:02 ` [ruby-core:110583] " Dan0042 (Daniel DeLorme)
2022-11-02 20:12 ` [ruby-core:110585] " k0kubun (Takashi Kokubun)
2022-11-02 20:18 ` [ruby-core:110586] " k0kubun (Takashi Kokubun)
2022-11-03  5:16 ` [ruby-core:110587] " k0kubun (Takashi Kokubun)
2022-11-03 14:14 ` [ruby-core:110589] " Dan0042 (Daniel DeLorme)
2022-11-26 13:06 ` [ruby-core:111021] " Eregon (Benoit Daloze)
2022-11-26 21:58 ` [ruby-core:111025] " k0kubun (Takashi Kokubun)
2022-11-27  4:57 ` [ruby-core:111029] " Eregon (Benoit Daloze)
2022-11-27  5:49 ` [ruby-core:111030] " k0kubun (Takashi Kokubun)
2022-11-30  3:08 ` [ruby-core:111079] " Eregon (Benoit Daloze)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).