ruby-core@ruby-lang.org archive (unofficial mirror)
 help / color / mirror / Atom feed
From: "k0kubun (Takashi Kokubun)" <noreply@ruby-lang.org>
To: ruby-core@neon.ruby-lang.org
Subject: [ruby-core:110587] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML
Date: Thu, 03 Nov 2022 05:16:48 +0000 (UTC)	[thread overview]
Message-ID: <redmine.journal-99919.20221103051647.10073@ruby-lang.org> (raw)
In-Reply-To: redmine.issue-19090.20221029074014.10073@ruby-lang.org

Issue #19090 has been updated by k0kubun (Takashi Kokubun).

Status changed from Open to Closed

On second thought, it might not be necessarily more efficient even with a bare String buffer because the point of the current `CGI.escapeHTML` implementation is to pre-allocate a 5x-size buffer on the stack to avoid multiple reallocations for buffer extension. That optimization is hard if you need to directly write to the argument String from the beginning; the best thing you could do would be to `realloc` it to a 5x size first and then `realloc` it again to shrink it. Doing object reallocation twice seems expensive compared to just manipulating a stack pointer, even combined with extra `memcpy`. 

Plus, as a maintainer of many template engines, this interface seems hard to use in some cases. It seems to help only limited places with a lot of complexity.

---

I came up with a completely different solution. I'll file it as a different ticket after verifying the idea. So closing this. Thank you for the discussion.

----------------------------------------
Feature #19090: Do not duplicate an unescaped string in CGI.escapeHTML
https://bugs.ruby-lang.org/issues/19090#change-99919

* Author: k0kubun (Takashi Kokubun)
* Status: Closed
* Priority: Normal
----------------------------------------
## Proposal
Stop guaranteeing that `GGI.escapeHTML` returns a new string even if there's nothing to be escaped.

More specifically, stop calling this `rb_str_dup` https://github.com/ruby/cgi/blob/v0.3.3/ext/cgi/escape/escape.c#L72 for the case that nothing needs to be escaped.

## Background
My original implementation https://github.com/ruby/ruby/pull/1164 was not calling it. The reason why `rb_str_dup` was added was that [Bug #11858] claimed returning the argument object for non-escaped cases is a backward incompatibility because the original `gsub`-based implementation always returns a new object. As a result, even while many people use `CGI.escapeHTML` as an optimized implementation for escaping HTML today, it ended up having a compromised performance.

## Motivation
The motivation is to improve performance. By just doing so, escaping a pre-allocated `"string"` becomes 1.34x faster on my machine https://gist.github.com/k0kubun/f66d6fe1e6ba821e4263257e504ba28f.

The most major use case of `CGP.escapeHTML` is to safely embed a user input. When the result is just embedded in another string, the allocated new object will be just wasted. It's pretty common that an embedded string fragment doesn't contain any of `'"&<>` characters. So we should stop wasting that to optimize that case.

[Bug #11858] wasn't really a use case but just "I think this is backward incompatibility" based on frozen Hello World. Unlike user input, you usually don't need to escape your own string literal. It feels like the ticket addressed a problem that doesn't exist in actual applications. It should have cited existing code that could be broken by that, and I can't find such code with `gem-codesearch` today.

The only reason to maintain the current behavior would be to allow using a return value of `CGI.escapeHTML` as a buffer for creating another longer string starting with the escaped value, but using `CGI.escapeHTML` to initialize a string buffer feels like an abuse. Relying on the behavior never makes sense as an "optimization" either because it makes all other cases (the result is not used as a string buffer) suboptimal.

## Why not an optional flag like `CGI.escapeHTML(str, dup: false)`?
Two reasons:

* The non-dup behavior should be used 99.999..9% of the time. We shouldn't make code using `CGI.escapeHTML` less readable just for maintaining a use case that doesn't exist.
* Passing keyword arguments to a C extension is unfortunately slow, and it defeats the optimization purpose. In core classes, we could use `Primitive` to address that, but this is a default gem and we can't use that.
  * We could workaround that if we choose `CGI.escapeHTML(str, false)`, but again it'd spoil the readability for maintaining an invalid use case.

## Why not a new method?

It's a good idea actually, but with `escapeHTML`, `escape_html`, and `h` aliased to it already, I can't think of a good name for it. And again, not calling it `escapeHTML` or `escape_html` would spoil the readability for no valid reason.



-- 
https://bugs.ruby-lang.org/

  parent reply	other threads:[~2022-11-03  5:16 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-29  7:40 [ruby-core:110542] [Ruby master Feature#19090] Do not duplicate an unescaped string in CGI.escapeHTML k0kubun (Takashi Kokubun)
2022-10-29 13:38 ` [ruby-core:110543] " Eregon (Benoit Daloze)
2022-11-01 12:55 ` [ruby-core:110566] " Dan0042 (Daniel DeLorme)
2022-11-02  4:45 ` [ruby-core:110576] " k0kubun (Takashi Kokubun)
2022-11-02 13:04 ` [ruby-core:110581] " Dan0042 (Daniel DeLorme)
2022-11-02 15:02 ` [ruby-core:110583] " Dan0042 (Daniel DeLorme)
2022-11-02 20:12 ` [ruby-core:110585] " k0kubun (Takashi Kokubun)
2022-11-02 20:18 ` [ruby-core:110586] " k0kubun (Takashi Kokubun)
2022-11-03  5:16 ` k0kubun (Takashi Kokubun) [this message]
2022-11-03 14:14 ` [ruby-core:110589] " Dan0042 (Daniel DeLorme)
2022-11-26 13:06 ` [ruby-core:111021] " Eregon (Benoit Daloze)
2022-11-26 21:58 ` [ruby-core:111025] " k0kubun (Takashi Kokubun)
2022-11-27  4:57 ` [ruby-core:111029] " Eregon (Benoit Daloze)
2022-11-27  5:49 ` [ruby-core:111030] " k0kubun (Takashi Kokubun)
2022-11-30  3:08 ` [ruby-core:111079] " Eregon (Benoit Daloze)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-list from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.ruby-lang.org/en/community/mailing-lists/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=redmine.journal-99919.20221103051647.10073@ruby-lang.org \
    --to=ruby-core@ruby-lang.org \
    --cc=ruby-core@neon.ruby-lang.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).