From mboxrd@z Thu Jan 1 00:00:00 1970 Delivered-To: chneukirchen@gmail.com Received: by 10.229.70.138 with SMTP id d10cs15844qcj; Tue, 26 Jul 2011 02:53:33 -0700 (PDT) Return-Path: Received-SPF: pass (google.com: domain of rack-devel+bncCKvPm5_3FRCZnbrxBBoE1NX8Pw@googlegroups.com designates 10.216.171.65 as permitted sender) client-ip=10.216.171.65; Authentication-Results: mr.google.com; spf=pass (google.com: domain of rack-devel+bncCKvPm5_3FRCZnbrxBBoE1NX8Pw@googlegroups.com designates 10.216.171.65 as permitted sender) smtp.mail=rack-devel+bncCKvPm5_3FRCZnbrxBBoE1NX8Pw@googlegroups.com; dkim=pass header.i=rack-devel+bncCKvPm5_3FRCZnbrxBBoE1NX8Pw@googlegroups.com Received: from mr.google.com ([10.216.171.65]) by 10.216.171.65 with SMTP id q43mr3028014wel.23.1311674012337 (num_hops = 1); Tue, 26 Jul 2011 02:53:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=beta; h=x-beenthere:received-spf:mime-version:in-reply-to:references:date :message-id:subject:from:to:x-original-sender :x-original-authentication-results:reply-to:precedence:mailing-list :list-id:x-google-group-id:list-post:list-help:list-archive:sender :list-subscribe:list-unsubscribe:content-type :content-transfer-encoding; bh=TLEX11zopWcBHDK3+kvbGbuHmTA0g8JpfTTa+vrZKz8=; b=4ibU067e+xU1GJAbLISYzfRq89Ax+RNjwhfodLEUeYzbkSp50ZHg4oZ1hwQD49Ac6V pj0JsV+OYAk9/N2EJwCgvyVgsQZTNmYJ0kVfCYFyT59Ei/qehWQBTs3JUNAQeygwseRm Z5xM92Mdg5SpYWpbtUPfpsWDBCrAccfMDWsKg= Received: by 10.216.171.65 with SMTP id q43mr872296wel.23.1311674009979; Tue, 26 Jul 2011 02:53:29 -0700 (PDT) X-BeenThere: rack-devel@googlegroups.com Received: by 10.227.175.140 with SMTP id ba12ls1070936wbb.3.gmail; Tue, 26 Jul 2011 02:53:29 -0700 (PDT) Received: by 10.216.156.131 with SMTP id m3mr96004wek.4.1311674009051; Tue, 26 Jul 2011 02:53:29 -0700 (PDT) Received: by 10.216.198.15 with SMTP id u15mswen; Tue, 26 Jul 2011 02:23:05 -0700 (PDT) Received: by 10.14.10.223 with SMTP id 71mr339469eev.30.1311672184499; Tue, 26 Jul 2011 02:23:04 -0700 (PDT) Received: by 10.14.10.223 with SMTP id 71mr339468eev.30.1311672184484; Tue, 26 Jul 2011 02:23:04 -0700 (PDT) Received: from mail-ey0-f176.google.com (mail-ey0-f176.google.com [209.85.215.176]) by gmr-mx.google.com with ESMTPS id z64si537966eez.3.2011.07.26.02.23.04 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 26 Jul 2011 02:23:04 -0700 (PDT) Received-SPF: pass (google.com: domain of evgeni.dzhelyov@gmail.com designates 209.85.215.176 as permitted sender) client-ip=209.85.215.176; Received: by mail-ey0-f176.google.com with SMTP id 28so537677eya.35 for ; Tue, 26 Jul 2011 02:23:04 -0700 (PDT) MIME-Version: 1.0 Received: by 10.204.22.198 with SMTP id o6mr1708694bkb.389.1311672184263; Tue, 26 Jul 2011 02:23:04 -0700 (PDT) Received: by 10.204.70.68 with HTTP; Tue, 26 Jul 2011 02:23:04 -0700 (PDT) In-Reply-To: References: <871fffb6-96c1-425e-a99e-735ec3aaef80@p20g2000yqp.googlegroups.com> Date: Tue, 26 Jul 2011 12:23:04 +0300 Message-ID: Subject: Re: How to handle unicode escaped params? From: Evgeni Dzhelyov To: rack-devel@googlegroups.com X-Original-Sender: evgeni.dzhelyov@gmail.com X-Original-Authentication-Results: gmr-mx.google.com; spf=pass (google.com: domain of evgeni.dzhelyov@gmail.com designates 209.85.215.176 as permitted sender) smtp.mail=evgeni.dzhelyov@gmail.com; dkim=pass (test mode) header.i=@gmail.com Reply-To: rack-devel@googlegroups.com Precedence: list Mailing-list: list rack-devel@googlegroups.com; contact rack-devel+owners@googlegroups.com List-ID: X-Google-Group-Id: 486215384060 List-Post: , List-Help: , List-Archive: Sender: rack-devel@googlegroups.com List-Subscribe: , List-Unsubscribe: , Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable I think that why Rails 3 appends the tick character (utf8=3D=E2=9C=93) into= the forms, to force the browser to send the form data into utf-8. On Tue, Jul 26, 2011 at 12:07 PM, Tobias Bielohlawek | SoundCloud wrote: > Hey John > thank you very much for you comprehensive answer. Good you pointed me to > ISO-8859-1, stupid I didn't thought of that - that confused me most. So t= he > request params we are seeing are indeed=C2=A0ISO-8859-1 encoded but witho= ut > any=C2=A0=C2=A0"accept-charset" param given. For those cases we don't won= t to fail, > but be smart and try to encode=C2=A0ISO-8859-1: > So on error I try decode again, but with=C2=A0Encoding::ISO8859_1,=C2=A0f= ollowed , by > iconv to get it to UTF-8 > > str =3D URI.decode_www_form_component("%E9", Encoding::ISO8859_1) > Iconv.conv("utf-8", "ISO-8859-1", str) > That seams to do the trick.=C2=A0Thanks! > Tobi > > > On 25.07.2011, at 20:48, John Firebaugh wrote: > > Hi Tobi, > Both=C2=A0%c3%a9 and=C2=A0%e9 could be valid -- it depends on the content= of the page > and what the server is prepared to accept. In short, a standards-conformi= ng > browser will choose which encoding to use based (mainly) on the form's > accept-charset attribute. You can read the details here: > http://www.w3.org/TR/html5/association-of-controls-and-forms.html#url-enc= oded-form-data > Also, note that while UTF-8 is indeed a character encoding, "Unicode" is = a > standard, not a character encoding itself. As such, it doesn't make sense= to > talk about =C3=A9 being "encoded in Unicode". It is true that =C3=A9 is t= he Unicode > code point U+00E9, but code points are independent of encodings (they are > just an abstract numeric identifiers for a particular characters). You ca= n > say that =C3=A9 is encoded in=C2=A0ISO-8859-1 (and other common encodings= ) as hex E9, > however. (See here for > more:=C2=A0http://www.joelonsoftware.com/articles/Unicode.html) > So, with a form that specifies accept-charset=3D"utf-8", =C3=A9 would be = escaped > as=C2=A0%c3%a9. With a form that specifies accept-charset=3D"iso-8859-1",= it would > be=C2=A0escaped=C2=A0as=C2=A0%e9. Leaving it unescaped=C2=A0is technicall= y invalid. > In order to process the data correctly, the server must know what encodin= g > the incoming URL parameters are encoded in, typically either by conventio= n > or via HTTP headers. Rack has a deficiency here, in that > Rack::Utils.unescape does not allow you to specify the encoding -- only > UTF-8 is supported. The underlying API used by Rack, > > URI.decode_www_form_component, does allow you to specify the encoding, so= if > that matters to you, you might have to use that directly. (Though as was > discovered recently, it has a rexexp that is vulnerable to catastrophic > backtracking.) > > John > On Mon, Jul 25, 2011 at 7:22 AM, Tobias Bielohlawek > wrote: >> >> I have a question on correct behavior of handling unicode encoded >> query params. >> >> Let's take character '=C3=A9', in UTF-8 that's encoded '%c3%a9 ' in unic= ode >> '%e9'. As an real life example, we're seeing those URLs: >> >> 1. not escaped: -> http://soundcloud.com/search?q%5Bfulltext%5D=3Dcaf=C3= =A9 >> 2. UTF8 escaped -> http://soundcloud.com/search?q%5Bfulltext%5D=3Dcaf%c3= %a9 >> 3. or Unicode escaped -> >> http://soundcloud.com/search?q%5Bfulltext%5D=3Dcaf%E9 >> >> Using rack 1.3.1, the first two cases are processed correct, but the >> latter fails with error 'incorrect UTF-8 byte sequence'. Which is >> correct behavior at first place as it's >> not UTF-8 but the unicode. >> >> I'm no wondering what's best solution here? Why not be smart and give >> it another run to unescape the URL assuming it's unicode encoded? >> >> I checked other players, e.g. Twitter has same error (but fails >> silently) >> http://twitter.com/search/=C3=A9 >> http://twitter.com/search/%c3%a9 >> http://twitter.com/search/%E9 >> >> but not Google, it does it this way: >> http://www.google.de/search?q=3D=C3=A9 >> http://www.google.de/search?q=3D%c3%a9 >> http://www.google.de/search?q=3D%E9 >> >> >> Any advice is very much appreciated. Patch is comming.. >> >> Thx - Tobi >> >> > > > > Cheers, > =C2=A0=C2=A0Tobi > =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93= =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2= =80=93 > Tobias Bielohlawek > Developer, SoundCloud > > Mail & gtalk:=C2=A0tobi@soundcloud.com > Rosenthalerstra=C3=9Fe 13, 10119 Berlin,=C2=A0Germany > > What is SoundCloud? > http://soundcloud.com/tour > > Send me a track? > http://soundcloud.com/hopit/dropbox > > Limited registered at Company=C2=A0House, Cardiff, UK. Registered Office:= =C2=A0London, > UK. Company Number=C2=A06343600 > Managing Director: Alexander Ljung > Local Branch Office Berlin, Germany:=C2=A0AG Charlottenburg, HRB 110657B > > >