How to handle unicode escaped params?

rack-devel archive mirror (unofficial) https://groups.google.com/group/rack-devel
 help / color / mirror / Atom feed

* How to handle unicode escaped params?
@ 2011-07-25 14:22 Tobias Bielohlawek
  2011-07-25 18:48 ` John Firebaugh
  0 siblings, 1 reply; 4+ messages in thread
From: Tobias Bielohlawek @ 2011-07-25 14:22 UTC (permalink / raw)
  To: Rack Development

I have a question on correct behavior of handling unicode encoded
query params.

Let's take character 'é', in UTF-8 that's encoded '%c3%a9 ' in unicode
'%e9'. As an real life example, we're seeing those URLs:

1. not escaped: -> http://soundcloud.com/search?q%5Bfulltext%5D=café
2. UTF8 escaped -> http://soundcloud.com/search?q%5Bfulltext%5D=caf%c3%a9
3. or Unicode escaped -> http://soundcloud.com/search?q%5Bfulltext%5D=caf%E9

Using rack 1.3.1, the first two cases are processed correct, but the
latter fails with error 'incorrect UTF-8 byte sequence'. Which is
correct behavior at first place as it's
not UTF-8 but the unicode.

I'm no wondering what's best solution here? Why not be smart and give
it another run to unescape the URL assuming it's unicode encoded?

I checked other players, e.g. Twitter has same error (but fails
silently)
http://twitter.com/search/é
http://twitter.com/search/%c3%a9
http://twitter.com/search/%E9

but not Google, it does it this way:
http://www.google.de/search?q=é
http://www.google.de/search?q=%c3%a9
http://www.google.de/search?q=%E9

Any advice is very much appreciated. Patch is comming..

Thx - Tobi

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: How to handle unicode escaped params?
  2011-07-25 14:22 How to handle unicode escaped params? Tobias Bielohlawek
@ 2011-07-25 18:48 ` John Firebaugh
  2011-07-26  9:07   ` Tobias Bielohlawek | SoundCloud
  0 siblings, 1 reply; 4+ messages in thread
From: John Firebaugh @ 2011-07-25 18:48 UTC (permalink / raw)
  To: rack-devel

[-- Attachment #1: Type: text/plain, Size: 3129 bytes --]

Hi Tobi,

Both %c3%a9 and %e9 could be valid -- it depends on the content of the page
and what the server is prepared to accept. In short, a standards-conforming
browser will choose which encoding to use based (mainly) on the form's
accept-charset attribute. You can read the details here:

http://www.w3.org/TR/html5/association-of-controls-and-forms.html#url-encoded-form-data

Also, note that while UTF-8 is indeed a character encoding, "Unicode" is a
standard, not a character encoding itself. As such, it doesn't make sense to
talk about é being "encoded in Unicode". It is true that é is the Unicode
code point U+00E9, but code points are independent of encodings (they are
just an abstract numeric identifiers for a particular characters). You
*can*say that é is encoded in ISO-8859-1 (and other common encodings)
as hex E9,
however. (See here for more:
http://www.joelonsoftware.com/articles/Unicode.html)

So, with a form that specifies accept-charset="utf-8", é would be escaped
as %c3%a9. With a form that specifies accept-charset="iso-8859-1", it would
be escaped as %e9. Leaving it unescaped is technically invalid.

In order to process the data correctly, the server must know what encoding
the incoming URL parameters are encoded in, typically either by convention
or via HTTP headers. Rack has a deficiency here, in that
Rack::Utils.unescape does not allow you to specify the encoding -- only
UTF-8 is supported. The underlying API used by Rack,
URI.decode_www_form_component, does allow you to specify the encoding, so if
that matters to you, you might have to use that directly. (Though as was
discovered recently, it has a rexexp that is vulnerable to catastrophic
backtracking.)

John

On Mon, Jul 25, 2011 at 7:22 AM, Tobias Bielohlawek <tobi@soundcloud.com>wrote:

> I have a question on correct behavior of handling unicode encoded
> query params.
>
> Let's take character 'é', in UTF-8 that's encoded '%c3%a9 ' in unicode
> '%e9'. As an real life example, we're seeing those URLs:
>
> 1. not escaped: -> http://soundcloud.com/search?q%5Bfulltext%5D=café<http://soundcloud.com/search?q%5Bfulltext%5D=caf%C3%A9>
> 2. UTF8 escaped -> http://soundcloud.com/search?q%5Bfulltext%5D=caf%c3%a9
> 3. or Unicode escaped ->
> http://soundcloud.com/search?q%5Bfulltext%5D=caf%E9
>
> Using rack 1.3.1, the first two cases are processed correct, but the
> latter fails with error 'incorrect UTF-8 byte sequence'. Which is
> correct behavior at first place as it's
> not UTF-8 but the unicode.
>
> I'm no wondering what's best solution here? Why not be smart and give
> it another run to unescape the URL assuming it's unicode encoded?
>
> I checked other players, e.g. Twitter has same error (but fails
> silently)
> http://twitter.com/search/é
> http://twitter.com/search/%c3%a9
> http://twitter.com/search/%E9
>
> but not Google, it does it this way:
> http://www.google.de/search?q=é
> http://www.google.de/search?q=%c3%a9
> http://www.google.de/search?q=%E9
>
>
> Any advice is very much appreciated. Patch is comming..
>
> Thx - Tobi
>
>
>

[-- Attachment #2: Type: text/html, Size: 4840 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: How to handle unicode escaped params?
  2011-07-25 18:48 ` John Firebaugh
@ 2011-07-26  9:07   ` Tobias Bielohlawek | SoundCloud
  2011-07-26  9:23     ` Evgeni Dzhelyov
  0 siblings, 1 reply; 4+ messages in thread
From: Tobias Bielohlawek | SoundCloud @ 2011-07-26  9:07 UTC (permalink / raw)
  To: rack-devel

[-- Attachment #1: Type: text/plain, Size: 4253 bytes --]

Hey John

thank you very much for you comprehensive answer. Good you pointed me to ISO-8859-1, stupid I didn't thought of that - that confused me most. So the request params we are seeing are indeed ISO-8859-1 encoded but without any  "accept-charset" param given. For those cases we don't wont to fail, but be smart and try to encode ISO-8859-1:

So on error I try decode again, but with Encoding::ISO8859_1, followed , by iconv to get it to UTF-8
 
str = URI.decode_www_form_component("%E9", Encoding::ISO8859_1)
Iconv.conv("utf-8", "ISO-8859-1", str)

That seams to do the trick. Thanks!

Tobi



On 25.07.2011, at 20:48, John Firebaugh wrote:

> Hi Tobi,
> 
> Both %c3%a9 and %e9 could be valid -- it depends on the content of the page and what the server is prepared to accept. In short, a standards-conforming browser will choose which encoding to use based (mainly) on the form's accept-charset attribute. You can read the details here:
> 
> http://www.w3.org/TR/html5/association-of-controls-and-forms.html#url-encoded-form-data
> 
> Also, note that while UTF-8 is indeed a character encoding, "Unicode" is a standard, not a character encoding itself. As such, it doesn't make sense to talk about é being "encoded in Unicode". It is true that é is the Unicode code point U+00E9, but code points are independent of encodings (they are just an abstract numeric identifiers for a particular characters). You can say that é is encoded in ISO-8859-1 (and other common encodings) as hex E9, however. (See here for more: http://www.joelonsoftware.com/articles/Unicode.html)
> 
> So, with a form that specifies accept-charset="utf-8", é would be escaped as %c3%a9. With a form that specifies accept-charset="iso-8859-1", it would be escaped as %e9. Leaving it unescaped is technically invalid.
> 
> In order to process the data correctly, the server must know what encoding the incoming URL parameters are encoded in, typically either by convention or via HTTP headers. Rack has a deficiency here, in that Rack::Utils.unescape does not allow you to specify the encoding -- only UTF-8 is supported. The underlying API used by Rack,


> URI.decode_www_form_component, does allow you to specify the encoding, so if that matters to you, you might have to use that directly. (Though as was discovered recently, it has a rexexp that is vulnerable to catastrophic backtracking.)
> 
> John
> 
> On Mon, Jul 25, 2011 at 7:22 AM, Tobias Bielohlawek <tobi@soundcloud.com> wrote:
> I have a question on correct behavior of handling unicode encoded
> query params.
> 
> Let's take character 'é', in UTF-8 that's encoded '%c3%a9 ' in unicode
> '%e9'. As an real life example, we're seeing those URLs:
> 
> 1. not escaped: -> http://soundcloud.com/search?q%5Bfulltext%5D=café
> 2. UTF8 escaped -> http://soundcloud.com/search?q%5Bfulltext%5D=caf%c3%a9
> 3. or Unicode escaped -> http://soundcloud.com/search?q%5Bfulltext%5D=caf%E9
> 
> Using rack 1.3.1, the first two cases are processed correct, but the
> latter fails with error 'incorrect UTF-8 byte sequence'. Which is
> correct behavior at first place as it's
> not UTF-8 but the unicode.
> 
> I'm no wondering what's best solution here? Why not be smart and give
> it another run to unescape the URL assuming it's unicode encoded?
> 
> I checked other players, e.g. Twitter has same error (but fails
> silently)
> http://twitter.com/search/é
> http://twitter.com/search/%c3%a9
> http://twitter.com/search/%E9
> 
> but not Google, it does it this way:
> http://www.google.de/search?q=é
> http://www.google.de/search?q=%c3%a9
> http://www.google.de/search?q=%E9
> 
> 
> Any advice is very much appreciated. Patch is comming..
> 
> Thx - Tobi
> 
> 
> 


Cheers,
  Tobi

–––––––––––––––––
Tobias Bielohlawek
Developer, SoundCloud

Mail & gtalk: tobi@soundcloud.com
Rosenthalerstraße 13, 10119 Berlin, Germany

What is SoundCloud? 
http://soundcloud.com/tour

Send me a track?
http://soundcloud.com/hopit/dropbox

Limited registered at Company House, Cardiff, UK. Registered Office: London, UK. Company Number 6343600
Managing Director: Alexander Ljung
Local Branch Office Berlin, Germany: AG Charlottenburg, HRB 110657B



[-- Attachment #2: Type: text/html, Size: 7978 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: How to handle unicode escaped params?
  2011-07-26  9:07   ` Tobias Bielohlawek | SoundCloud
@ 2011-07-26  9:23     ` Evgeni Dzhelyov
  0 siblings, 0 replies; 4+ messages in thread
From: Evgeni Dzhelyov @ 2011-07-26  9:23 UTC (permalink / raw)
  To: rack-devel

I think that why Rails 3 appends the tick character (utf8=✓) into the
forms, to force the browser to send the form data into utf-8.

On Tue, Jul 26, 2011 at 12:07 PM, Tobias Bielohlawek | SoundCloud
<tobi@soundcloud.com> wrote:
> Hey John
> thank you very much for you comprehensive answer. Good you pointed me to
> ISO-8859-1, stupid I didn't thought of that - that confused me most. So the
> request params we are seeing are indeed ISO-8859-1 encoded but without
> any  "accept-charset" param given. For those cases we don't wont to fail,
> but be smart and try to encode ISO-8859-1:
> So on error I try decode again, but with Encoding::ISO8859_1, followed , by
> iconv to get it to UTF-8
>
> str = URI.decode_www_form_component("%E9", Encoding::ISO8859_1)
> Iconv.conv("utf-8", "ISO-8859-1", str)
> That seams to do the trick. Thanks!
> Tobi
>
>
> On 25.07.2011, at 20:48, John Firebaugh wrote:
>
> Hi Tobi,
> Both %c3%a9 and %e9 could be valid -- it depends on the content of the page
> and what the server is prepared to accept. In short, a standards-conforming
> browser will choose which encoding to use based (mainly) on the form's
> accept-charset attribute. You can read the details here:
> http://www.w3.org/TR/html5/association-of-controls-and-forms.html#url-encoded-form-data
> Also, note that while UTF-8 is indeed a character encoding, "Unicode" is a
> standard, not a character encoding itself. As such, it doesn't make sense to
> talk about é being "encoded in Unicode". It is true that é is the Unicode
> code point U+00E9, but code points are independent of encodings (they are
> just an abstract numeric identifiers for a particular characters). You can
> say that é is encoded in ISO-8859-1 (and other common encodings) as hex E9,
> however. (See here for
> more: http://www.joelonsoftware.com/articles/Unicode.html)
> So, with a form that specifies accept-charset="utf-8", é would be escaped
> as %c3%a9. With a form that specifies accept-charset="iso-8859-1", it would
> be escaped as %e9. Leaving it unescaped is technically invalid.
> In order to process the data correctly, the server must know what encoding
> the incoming URL parameters are encoded in, typically either by convention
> or via HTTP headers. Rack has a deficiency here, in that
> Rack::Utils.unescape does not allow you to specify the encoding -- only
> UTF-8 is supported. The underlying API used by Rack,
>
> URI.decode_www_form_component, does allow you to specify the encoding, so if
> that matters to you, you might have to use that directly. (Though as was
> discovered recently, it has a rexexp that is vulnerable to catastrophic
> backtracking.)
>
> John
> On Mon, Jul 25, 2011 at 7:22 AM, Tobias Bielohlawek <tobi@soundcloud.com>
> wrote:
>>
>> I have a question on correct behavior of handling unicode encoded
>> query params.
>>
>> Let's take character 'é', in UTF-8 that's encoded '%c3%a9 ' in unicode
>> '%e9'. As an real life example, we're seeing those URLs:
>>
>> 1. not escaped: -> http://soundcloud.com/search?q%5Bfulltext%5D=café
>> 2. UTF8 escaped -> http://soundcloud.com/search?q%5Bfulltext%5D=caf%c3%a9
>> 3. or Unicode escaped ->
>> http://soundcloud.com/search?q%5Bfulltext%5D=caf%E9
>>
>> Using rack 1.3.1, the first two cases are processed correct, but the
>> latter fails with error 'incorrect UTF-8 byte sequence'. Which is
>> correct behavior at first place as it's
>> not UTF-8 but the unicode.
>>
>> I'm no wondering what's best solution here? Why not be smart and give
>> it another run to unescape the URL assuming it's unicode encoded?
>>
>> I checked other players, e.g. Twitter has same error (but fails
>> silently)
>> http://twitter.com/search/é
>> http://twitter.com/search/%c3%a9
>> http://twitter.com/search/%E9
>>
>> but not Google, it does it this way:
>> http://www.google.de/search?q=é
>> http://www.google.de/search?q=%c3%a9
>> http://www.google.de/search?q=%E9
>>
>>
>> Any advice is very much appreciated. Patch is comming..
>>
>> Thx - Tobi
>>
>>
>
>
>
> Cheers,
>   Tobi
> –––––––––––––––––
> Tobias Bielohlawek
> Developer, SoundCloud
>
> Mail & gtalk: tobi@soundcloud.com
> Rosenthalerstraße 13, 10119 Berlin, Germany
>
> What is SoundCloud?
> http://soundcloud.com/tour
>
> Send me a track?
> http://soundcloud.com/hopit/dropbox
>
> Limited registered at Company House, Cardiff, UK. Registered Office: London,
> UK. Company Number 6343600
> Managing Director: Alexander Ljung
> Local Branch Office Berlin, Germany: AG Charlottenburg, HRB 110657B
>
>
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2011-07-26  9:53 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-25 14:22 How to handle unicode escaped params? Tobias Bielohlawek
2011-07-25 18:48 ` John Firebaugh
2011-07-26  9:07   ` Tobias Bielohlawek | SoundCloud
2011-07-26  9:23     ` Evgeni Dzhelyov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).