rack-devel archive mirror (unofficial) https://groups.google.com/group/rack-devel
 help / color / mirror / Atom feed
From: John Firebaugh <john.firebaugh@gmail.com>
To: rack-devel@googlegroups.com
Subject: Re: How to handle unicode escaped params?
Date: Mon, 25 Jul 2011 11:48:03 -0700	[thread overview]
Message-ID: <CAL3hnJZrhPog6A6bOmNi6Jtx9i=h8Z_K660AKs8U5TmyxqsEtQ@mail.gmail.com> (raw)
In-Reply-To: <871fffb6-96c1-425e-a99e-735ec3aaef80@p20g2000yqp.googlegroups.com>

[-- Attachment #1: Type: text/plain, Size: 3129 bytes --]

Hi Tobi,

Both %c3%a9 and %e9 could be valid -- it depends on the content of the page
and what the server is prepared to accept. In short, a standards-conforming
browser will choose which encoding to use based (mainly) on the form's
accept-charset attribute. You can read the details here:

http://www.w3.org/TR/html5/association-of-controls-and-forms.html#url-encoded-form-data

Also, note that while UTF-8 is indeed a character encoding, "Unicode" is a
standard, not a character encoding itself. As such, it doesn't make sense to
talk about é being "encoded in Unicode". It is true that é is the Unicode
code point U+00E9, but code points are independent of encodings (they are
just an abstract numeric identifiers for a particular characters). You
*can*say that é is encoded in ISO-8859-1 (and other common encodings)
as hex E9,
however. (See here for more:
http://www.joelonsoftware.com/articles/Unicode.html)

So, with a form that specifies accept-charset="utf-8", é would be escaped
as %c3%a9. With a form that specifies accept-charset="iso-8859-1", it would
be escaped as %e9. Leaving it unescaped is technically invalid.

In order to process the data correctly, the server must know what encoding
the incoming URL parameters are encoded in, typically either by convention
or via HTTP headers. Rack has a deficiency here, in that
Rack::Utils.unescape does not allow you to specify the encoding -- only
UTF-8 is supported. The underlying API used by Rack,
URI.decode_www_form_component, does allow you to specify the encoding, so if
that matters to you, you might have to use that directly. (Though as was
discovered recently, it has a rexexp that is vulnerable to catastrophic
backtracking.)

John

On Mon, Jul 25, 2011 at 7:22 AM, Tobias Bielohlawek <tobi@soundcloud.com>wrote:

> I have a question on correct behavior of handling unicode encoded
> query params.
>
> Let's take character 'é', in UTF-8 that's encoded '%c3%a9 ' in unicode
> '%e9'. As an real life example, we're seeing those URLs:
>
> 1. not escaped: -> http://soundcloud.com/search?q%5Bfulltext%5D=café<http://soundcloud.com/search?q%5Bfulltext%5D=caf%C3%A9>
> 2. UTF8 escaped -> http://soundcloud.com/search?q%5Bfulltext%5D=caf%c3%a9
> 3. or Unicode escaped ->
> http://soundcloud.com/search?q%5Bfulltext%5D=caf%E9
>
> Using rack 1.3.1, the first two cases are processed correct, but the
> latter fails with error 'incorrect UTF-8 byte sequence'. Which is
> correct behavior at first place as it's
> not UTF-8 but the unicode.
>
> I'm no wondering what's best solution here? Why not be smart and give
> it another run to unescape the URL assuming it's unicode encoded?
>
> I checked other players, e.g. Twitter has same error (but fails
> silently)
> http://twitter.com/search/é
> http://twitter.com/search/%c3%a9
> http://twitter.com/search/%E9
>
> but not Google, it does it this way:
> http://www.google.de/search?q=é
> http://www.google.de/search?q=%c3%a9
> http://www.google.de/search?q=%E9
>
>
> Any advice is very much appreciated. Patch is comming..
>
> Thx - Tobi
>
>
>

[-- Attachment #2: Type: text/html, Size: 4840 bytes --]

  reply	other threads:[~2011-07-25 18:48 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-07-25 14:22 How to handle unicode escaped params? Tobias Bielohlawek
2011-07-25 18:48 ` John Firebaugh [this message]
2011-07-26  9:07   ` Tobias Bielohlawek | SoundCloud
2011-07-26  9:23     ` Evgeni Dzhelyov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-list from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://groups.google.com/group/rack-devel

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAL3hnJZrhPog6A6bOmNi6Jtx9i=h8Z_K660AKs8U5TmyxqsEtQ@mail.gmail.com' \
    --to=rack-devel@googlegroups.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).