rack-devel archive mirror (unofficial) https://groups.google.com/group/rack-devel
 help / color / mirror / Atom feed
From: Evgeni Dzhelyov <evgeni.dzhelyov@gmail.com>
To: rack-devel@googlegroups.com
Subject: Re: How to handle unicode escaped params?
Date: Tue, 26 Jul 2011 12:23:04 +0300	[thread overview]
Message-ID: <CA+AT3SxDfTpbCaUgoZzNAW3j2JNaGbnBM0qThad7zBBk+PXVww@mail.gmail.com> (raw)
In-Reply-To: <F5176A30-B94B-4CF4-AFD3-3933790F0A52@soundcloud.com>

I think that why Rails 3 appends the tick character (utf8=✓) into the
forms, to force the browser to send the form data into utf-8.

On Tue, Jul 26, 2011 at 12:07 PM, Tobias Bielohlawek | SoundCloud
<tobi@soundcloud.com> wrote:
> Hey John
> thank you very much for you comprehensive answer. Good you pointed me to
> ISO-8859-1, stupid I didn't thought of that - that confused me most. So the
> request params we are seeing are indeed ISO-8859-1 encoded but without
> any  "accept-charset" param given. For those cases we don't wont to fail,
> but be smart and try to encode ISO-8859-1:
> So on error I try decode again, but with Encoding::ISO8859_1, followed , by
> iconv to get it to UTF-8
>
> str = URI.decode_www_form_component("%E9", Encoding::ISO8859_1)
> Iconv.conv("utf-8", "ISO-8859-1", str)
> That seams to do the trick. Thanks!
> Tobi
>
>
> On 25.07.2011, at 20:48, John Firebaugh wrote:
>
> Hi Tobi,
> Both %c3%a9 and %e9 could be valid -- it depends on the content of the page
> and what the server is prepared to accept. In short, a standards-conforming
> browser will choose which encoding to use based (mainly) on the form's
> accept-charset attribute. You can read the details here:
> http://www.w3.org/TR/html5/association-of-controls-and-forms.html#url-encoded-form-data
> Also, note that while UTF-8 is indeed a character encoding, "Unicode" is a
> standard, not a character encoding itself. As such, it doesn't make sense to
> talk about é being "encoded in Unicode". It is true that é is the Unicode
> code point U+00E9, but code points are independent of encodings (they are
> just an abstract numeric identifiers for a particular characters). You can
> say that é is encoded in ISO-8859-1 (and other common encodings) as hex E9,
> however. (See here for
> more: http://www.joelonsoftware.com/articles/Unicode.html)
> So, with a form that specifies accept-charset="utf-8", é would be escaped
> as %c3%a9. With a form that specifies accept-charset="iso-8859-1", it would
> be escaped as %e9. Leaving it unescaped is technically invalid.
> In order to process the data correctly, the server must know what encoding
> the incoming URL parameters are encoded in, typically either by convention
> or via HTTP headers. Rack has a deficiency here, in that
> Rack::Utils.unescape does not allow you to specify the encoding -- only
> UTF-8 is supported. The underlying API used by Rack,
>
> URI.decode_www_form_component, does allow you to specify the encoding, so if
> that matters to you, you might have to use that directly. (Though as was
> discovered recently, it has a rexexp that is vulnerable to catastrophic
> backtracking.)
>
> John
> On Mon, Jul 25, 2011 at 7:22 AM, Tobias Bielohlawek <tobi@soundcloud.com>
> wrote:
>>
>> I have a question on correct behavior of handling unicode encoded
>> query params.
>>
>> Let's take character 'é', in UTF-8 that's encoded '%c3%a9 ' in unicode
>> '%e9'. As an real life example, we're seeing those URLs:
>>
>> 1. not escaped: -> http://soundcloud.com/search?q%5Bfulltext%5D=café
>> 2. UTF8 escaped -> http://soundcloud.com/search?q%5Bfulltext%5D=caf%c3%a9
>> 3. or Unicode escaped ->
>> http://soundcloud.com/search?q%5Bfulltext%5D=caf%E9
>>
>> Using rack 1.3.1, the first two cases are processed correct, but the
>> latter fails with error 'incorrect UTF-8 byte sequence'. Which is
>> correct behavior at first place as it's
>> not UTF-8 but the unicode.
>>
>> I'm no wondering what's best solution here? Why not be smart and give
>> it another run to unescape the URL assuming it's unicode encoded?
>>
>> I checked other players, e.g. Twitter has same error (but fails
>> silently)
>> http://twitter.com/search/é
>> http://twitter.com/search/%c3%a9
>> http://twitter.com/search/%E9
>>
>> but not Google, it does it this way:
>> http://www.google.de/search?q=é
>> http://www.google.de/search?q=%c3%a9
>> http://www.google.de/search?q=%E9
>>
>>
>> Any advice is very much appreciated. Patch is comming..
>>
>> Thx - Tobi
>>
>>
>
>
>
> Cheers,
>   Tobi
> –––––––––––––––––
> Tobias Bielohlawek
> Developer, SoundCloud
>
> Mail & gtalk: tobi@soundcloud.com
> Rosenthalerstraße 13, 10119 Berlin, Germany
>
> What is SoundCloud?
> http://soundcloud.com/tour
>
> Send me a track?
> http://soundcloud.com/hopit/dropbox
>
> Limited registered at Company House, Cardiff, UK. Registered Office: London,
> UK. Company Number 6343600
> Managing Director: Alexander Ljung
> Local Branch Office Berlin, Germany: AG Charlottenburg, HRB 110657B
>
>
>

      reply	other threads:[~2011-07-26  9:53 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-07-25 14:22 How to handle unicode escaped params? Tobias Bielohlawek
2011-07-25 18:48 ` John Firebaugh
2011-07-26  9:07   ` Tobias Bielohlawek | SoundCloud
2011-07-26  9:23     ` Evgeni Dzhelyov [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-list from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://groups.google.com/group/rack-devel

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CA+AT3SxDfTpbCaUgoZzNAW3j2JNaGbnBM0qThad7zBBk+PXVww@mail.gmail.com \
    --to=rack-devel@googlegroups.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).