Hey John thank you very much for you comprehensive answer. Good you pointed me to ISO-8859-1, stupid I didn't thought of that - that confused me most. So the request params we are seeing are indeed ISO-8859-1 encoded but without any "accept-charset" param given. For those cases we don't wont to fail, but be smart and try to encode ISO-8859-1: So on error I try decode again, but with Encoding::ISO8859_1, followed , by iconv to get it to UTF-8 str = URI.decode_www_form_component("%E9", Encoding::ISO8859_1) Iconv.conv("utf-8", "ISO-8859-1", str) That seams to do the trick. Thanks! Tobi On 25.07.2011, at 20:48, John Firebaugh wrote: > Hi Tobi, > > Both %c3%a9 and %e9 could be valid -- it depends on the content of the page and what the server is prepared to accept. In short, a standards-conforming browser will choose which encoding to use based (mainly) on the form's accept-charset attribute. You can read the details here: > > http://www.w3.org/TR/html5/association-of-controls-and-forms.html#url-encoded-form-data > > Also, note that while UTF-8 is indeed a character encoding, "Unicode" is a standard, not a character encoding itself. As such, it doesn't make sense to talk about é being "encoded in Unicode". It is true that é is the Unicode code point U+00E9, but code points are independent of encodings (they are just an abstract numeric identifiers for a particular characters). You can say that é is encoded in ISO-8859-1 (and other common encodings) as hex E9, however. (See here for more: http://www.joelonsoftware.com/articles/Unicode.html) > > So, with a form that specifies accept-charset="utf-8", é would be escaped as %c3%a9. With a form that specifies accept-charset="iso-8859-1", it would be escaped as %e9. Leaving it unescaped is technically invalid. > > In order to process the data correctly, the server must know what encoding the incoming URL parameters are encoded in, typically either by convention or via HTTP headers. Rack has a deficiency here, in that Rack::Utils.unescape does not allow you to specify the encoding -- only UTF-8 is supported. The underlying API used by Rack, > URI.decode_www_form_component, does allow you to specify the encoding, so if that matters to you, you might have to use that directly. (Though as was discovered recently, it has a rexexp that is vulnerable to catastrophic backtracking.) > > John > > On Mon, Jul 25, 2011 at 7:22 AM, Tobias Bielohlawek wrote: > I have a question on correct behavior of handling unicode encoded > query params. > > Let's take character 'é', in UTF-8 that's encoded '%c3%a9 ' in unicode > '%e9'. As an real life example, we're seeing those URLs: > > 1. not escaped: -> http://soundcloud.com/search?q%5Bfulltext%5D=café > 2. UTF8 escaped -> http://soundcloud.com/search?q%5Bfulltext%5D=caf%c3%a9 > 3. or Unicode escaped -> http://soundcloud.com/search?q%5Bfulltext%5D=caf%E9 > > Using rack 1.3.1, the first two cases are processed correct, but the > latter fails with error 'incorrect UTF-8 byte sequence'. Which is > correct behavior at first place as it's > not UTF-8 but the unicode. > > I'm no wondering what's best solution here? Why not be smart and give > it another run to unescape the URL assuming it's unicode encoded? > > I checked other players, e.g. Twitter has same error (but fails > silently) > http://twitter.com/search/é > http://twitter.com/search/%c3%a9 > http://twitter.com/search/%E9 > > but not Google, it does it this way: > http://www.google.de/search?q=é > http://www.google.de/search?q=%c3%a9 > http://www.google.de/search?q=%E9 > > > Any advice is very much appreciated. Patch is comming.. > > Thx - Tobi > > > Cheers, Tobi ––––––––––––––––– Tobias Bielohlawek Developer, SoundCloud Mail & gtalk: tobi@soundcloud.com Rosenthalerstraße 13, 10119 Berlin, Germany What is SoundCloud? http://soundcloud.com/tour Send me a track? http://soundcloud.com/hopit/dropbox Limited registered at Company House, Cardiff, UK. Registered Office: London, UK. Company Number 6343600 Managing Director: Alexander Ljung Local Branch Office Berlin, Germany: AG Charlottenburg, HRB 110657B