From mboxrd@z Thu Jan  1 00:00:00 1970
Delivered-To: chneukirchen@gmail.com
Received: by 10.229.70.138 with SMTP id d10cs15968qcj; Mon, 25 Jul 2011
 11:48:26 -0700 (PDT)
Return-Path: <rack-devel+bncCIDu1sPnGBD49LbxBBoEPf80kQ@googlegroups.com>
Received-SPF: pass (google.com: domain of
 rack-devel+bncCIDu1sPnGBD49LbxBBoEPf80kQ@googlegroups.com designates
 10.220.203.12 as permitted sender) client-ip=10.220.203.12;
Authentication-Results: mr.google.com; spf=pass (google.com: domain of
 rack-devel+bncCIDu1sPnGBD49LbxBBoEPf80kQ@googlegroups.com designates
 10.220.203.12 as permitted sender)
 smtp.mail=rack-devel+bncCIDu1sPnGBD49LbxBBoEPf80kQ@googlegroups.com;
 dkim=pass header.i=rack-devel+bncCIDu1sPnGBD49LbxBBoEPf80kQ@googlegroups.com
Received: from mr.google.com ([10.220.203.12]) by 10.220.203.12 with SMTP id
 fg12mr1759250vcb.45.1311619706612 (num_hops = 1); Mon, 25 Jul 2011 11:48:26
 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com;
 s=beta; h=x-beenthere:received-spf:mime-version:in-reply-to:references:from
 :date:message-id:subject:to:x-original-sender
 :x-original-authentication-results:reply-to:precedence:mailing-list
 :list-id:x-google-group-id:list-post:list-help:list-archive:sender
 :list-subscribe:list-unsubscribe:content-type;
 bh=8kTQe6B/BoBBKfrp9Ax19gN4KV02gxT5ghFGe1vNiYs=;
 b=SMYiYR2z49hsIrS7cgEHGTGm9bBmCsfpLIYsIx1UCht/sKaKwzsqwJOicpoU6Zeh4c
 ecp+hDz9upj81OTja4JDG2WAa9mDFNJ3CqPVhcxcx4YnnQ+TCh7jFNMtmHZJ6jG/xVDm
 JqaIq+76kvc/XNt5eQziXgDjlptFaSvbdX2yk=
Received: by 10.220.203.12 with SMTP id fg12mr549244vcb.45.1311619704207;
 Mon, 25 Jul 2011 11:48:24 -0700 (PDT)
X-BeenThere: rack-devel@googlegroups.com
Received: by 10.52.179.138 with SMTP id dg10ls1704513vdc.2.gmail; Mon, 25 Jul
 2011 11:48:23 -0700 (PDT)
Received: by 10.52.91.134 with SMTP id ce6mr1428293vdb.13.1311619703245; Mon,
 25 Jul 2011 11:48:23 -0700 (PDT)
Received: by 10.52.91.134 with SMTP id ce6mr1428292vdb.13.1311619703232; Mon,
 25 Jul 2011 11:48:23 -0700 (PDT)
Received: from mail-vx0-f175.google.com ([209.85.220.175]) by
 gmr-mx.google.com with ESMTPS id v20si4599587vdu.2.2011.07.25.11.48.23
 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 25 Jul 2011 11:48:23 -0700 (PDT)
Received-SPF: pass (google.com: domain of john.firebaugh@gmail.com designates
 209.85.220.175 as permitted sender) client-ip=209.85.220.175;
Received: by mail-vx0-f175.google.com with SMTP id 2so3419363vxh.6 for
 <rack-devel@googlegroups.com>; Mon, 25 Jul 2011 11:48:23 -0700 (PDT)
Received: by 10.52.72.230 with SMTP id g6mr4599128vdv.163.1311619703136; Mon,
 25 Jul 2011 11:48:23 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.52.156.226 with HTTP; Mon, 25 Jul 2011 11:48:03 -0700 (PDT)
In-Reply-To:
 <871fffb6-96c1-425e-a99e-735ec3aaef80@p20g2000yqp.googlegroups.com>
References:
 <871fffb6-96c1-425e-a99e-735ec3aaef80@p20g2000yqp.googlegroups.com>
From: John Firebaugh <john.firebaugh@gmail.com>
Date: Mon, 25 Jul 2011 11:48:03 -0700
Message-ID:
 <CAL3hnJZrhPog6A6bOmNi6Jtx9i=h8Z_K660AKs8U5TmyxqsEtQ@mail.gmail.com>
Subject: Re: How to handle unicode escaped params?
To: rack-devel@googlegroups.com
X-Original-Sender: john.firebaugh@gmail.com
X-Original-Authentication-Results: gmr-mx.google.com; spf=pass (google.com:
 domain of john.firebaugh@gmail.com designates 209.85.220.175 as permitted
 sender) smtp.mail=john.firebaugh@gmail.com; dkim=pass (test mode)
 header.i=@gmail.com
Reply-To: rack-devel@googlegroups.com
Precedence: list
Mailing-list: list rack-devel@googlegroups.com; contact
 rack-devel+owners@googlegroups.com
List-ID: <rack-devel.googlegroups.com>
X-Google-Group-Id: 486215384060
List-Post: <http://groups.google.com/group/rack-devel/post?hl=en_US>,
 <mailto:rack-devel@googlegroups.com>
List-Help: <http://groups.google.com/support/?hl=en_US>,
 <mailto:rack-devel+help@googlegroups.com>
List-Archive: <http://groups.google.com/group/rack-devel?hl=en_US>
Sender: rack-devel@googlegroups.com
List-Subscribe:
 <http://groups.google.com/group/rack-devel/subscribe?hl=en_US>,
 <mailto:rack-devel+subscribe@googlegroups.com>
List-Unsubscribe:
 <http://groups.google.com/group/rack-devel/subscribe?hl=en_US>,
 <mailto:rack-devel+unsubscribe@googlegroups.com>
Content-Type: multipart/alternative; boundary=bcaec50160117c629604a8e94268

--bcaec50160117c629604a8e94268
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi Tobi,

Both %c3%a9 and %e9 could be valid -- it depends on the content of the page
and what the server is prepared to accept. In short, a standards-conforming
browser will choose which encoding to use based (mainly) on the form's
accept-charset attribute. You can read the details here:

http://www.w3.org/TR/html5/association-of-controls-and-forms.html#url-encod=
ed-form-data

Also, note that while UTF-8 is indeed a character encoding, "Unicode" is a
standard, not a character encoding itself. As such, it doesn't make sense t=
o
talk about =E9 being "encoded in Unicode". It is true that =E9 is the Unico=
de
code point U+00E9, but code points are independent of encodings (they are
just an abstract numeric identifiers for a particular characters). You
*can*say that =E9 is encoded in ISO-8859-1 (and other common encodings)
as hex E9,
however. (See here for more:
http://www.joelonsoftware.com/articles/Unicode.html)

So, with a form that specifies accept-charset=3D"utf-8", =E9 would be escap=
ed
as %c3%a9. With a form that specifies accept-charset=3D"iso-8859-1", it wou=
ld
be escaped as %e9. Leaving it unescaped is technically invalid.

In order to process the data correctly, the server must know what encoding
the incoming URL parameters are encoded in, typically either by convention
or via HTTP headers. Rack has a deficiency here, in that
Rack::Utils.unescape does not allow you to specify the encoding -- only
UTF-8 is supported. The underlying API used by Rack,
URI.decode_www_form_component, does allow you to specify the encoding, so i=
f
that matters to you, you might have to use that directly. (Though as was
discovered recently, it has a rexexp that is vulnerable to catastrophic
backtracking.)

John

On Mon, Jul 25, 2011 at 7:22 AM, Tobias Bielohlawek <tobi@soundcloud.com>wr=
ote:

> I have a question on correct behavior of handling unicode encoded
> query params.
>
> Let's take character '=E9', in UTF-8 that's encoded '%c3%a9 ' in unicode
> '%e9'. As an real life example, we're seeing those URLs:
>
> 1. not escaped: -> http://soundcloud.com/search?q%5Bfulltext%5D=3Dcaf=E9<=
http://soundcloud.com/search?q%5Bfulltext%5D=3Dcaf%C3%A9>
> 2. UTF8 escaped -> http://soundcloud.com/search?q%5Bfulltext%5D=3Dcaf%c3%=
a9
> 3. or Unicode escaped ->
> http://soundcloud.com/search?q%5Bfulltext%5D=3Dcaf%E9
>
> Using rack 1.3.1, the first two cases are processed correct, but the
> latter fails with error 'incorrect UTF-8 byte sequence'. Which is
> correct behavior at first place as it's
> not UTF-8 but the unicode.
>
> I'm no wondering what's best solution here? Why not be smart and give
> it another run to unescape the URL assuming it's unicode encoded?
>
> I checked other players, e.g. Twitter has same error (but fails
> silently)
> http://twitter.com/search/=E9
> http://twitter.com/search/%c3%a9
> http://twitter.com/search/%E9
>
> but not Google, it does it this way:
> http://www.google.de/search?q=3D=E9
> http://www.google.de/search?q=3D%c3%a9
> http://www.google.de/search?q=3D%E9
>
>
> Any advice is very much appreciated. Patch is comming..
>
> Thx - Tobi
>
>
>

--bcaec50160117c629604a8e94268
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi Tobi,<div><br></div><div>Both=A0%c3%a9 and=A0%e9 could be valid -- it de=
pends on the content of the page and what the server is prepared to accept.=
 In short, a standards-conforming browser will choose which encoding to use=
 based (mainly) on the form&#39;s accept-charset attribute. You can read th=
e details here:</div>

<meta charset=3D"utf-8"><meta charset=3D"utf-8"><div><br></div><div><a href=
=3D"http://www.w3.org/TR/html5/association-of-controls-and-forms.html#url-e=
ncoded-form-data">http://www.w3.org/TR/html5/association-of-controls-and-fo=
rms.html#url-encoded-form-data</a></div>

<div><br></div><div>Also, note that while UTF-8 is indeed a character encod=
ing, &quot;Unicode&quot; is a standard, not a character encoding itself. As=
 such, it doesn&#39;t make sense to talk about =E9 being &quot;encoded in U=
nicode&quot;. It is true that =E9 is the Unicode code point U+00E9, but cod=
e points are independent of encodings (they are just an abstract numeric id=
entifiers for a particular characters). You <i>can</i> say that =E9 is enco=
ded in=A0ISO-8859-1 (and other common encodings) as hex E9, however. (See h=
ere for more:=A0<a href=3D"http://www.joelonsoftware.com/articles/Unicode.h=
tml">http://www.joelonsoftware.com/articles/Unicode.html</a>)</div>

<meta charset=3D"utf-8"><div><br></div><div>So, with a form that specifies =
accept-charset=3D&quot;utf-8&quot;, =E9 would be escaped as=A0%c3%a9. With =
a form that specifies accept-charset=3D&quot;<span class=3D"Apple-style-spa=
n" style=3D"font-family: &#39;Lucida Grande&#39;, Verdana, Helvetica, sans-=
serif; font-size: 12px; ">iso-8859-1&quot;, it would be=A0<meta charset=3D"=
utf-8"><span class=3D"Apple-style-span" style=3D"font-family: arial; font-s=
ize: small; ">escaped</span>=A0as=A0</span>%e9. Leaving it un<meta charset=
=3D"utf-8">escaped=A0is technically invalid.</div>

<div><br></div><div>In order to process the data correctly, the server must=
 know what encoding the incoming URL parameters are encoded in, typically e=
ither by convention or via HTTP headers. Rack has a deficiency here, in tha=
t Rack::Utils.unescape does not allow you to specify the encoding -- only U=
TF-8 is supported. The underlying API used by Rack, URI.decode_www_form_com=
ponent, does allow you to specify the encoding, so if that matters to you, =
you might have to use that directly. (Though as was discovered recently, it=
 has a rexexp that is vulnerable to catastrophic backtracking.)</div>

<div><br></div><div>John</div><meta charset=3D"utf-8"><meta charset=3D"utf-=
8"><meta charset=3D"utf-8"><div><br><div class=3D"gmail_quote">On Mon, Jul =
25, 2011 at 7:22 AM, Tobias Bielohlawek <span dir=3D"ltr">&lt;<a href=3D"ma=
ilto:tobi@soundcloud.com">tobi@soundcloud.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex;">I have a question on correct behavior of ha=
ndling unicode encoded<br>
query params.<br>
<br>
Let&#39;s take character &#39;=E9&#39;, in UTF-8 that&#39;s encoded &#39;%c=
3%a9 &#39; in unicode<br>
&#39;%e9&#39;. As an real life example, we&#39;re seeing those URLs:<br>
<br>
1. not escaped: -&gt; <a href=3D"http://soundcloud.com/search?q%5Bfulltext%=
5D=3Dcaf%C3%A9" target=3D"_blank">http://soundcloud.com/search?q%5Bfulltext=
%5D=3Dcaf=E9</a><br>
2. UTF8 escaped -&gt; <a href=3D"http://soundcloud.com/search?q%5Bfulltext%=
5D=3Dcaf%c3%a9" target=3D"_blank">http://soundcloud.com/search?q%5Bfulltext=
%5D=3Dcaf%c3%a9</a><br>
3. or Unicode escaped -&gt; <a href=3D"http://soundcloud.com/search?q%5Bful=
ltext%5D=3Dcaf%E9" target=3D"_blank">http://soundcloud.com/search?q%5Bfullt=
ext%5D=3Dcaf%E9</a><br>
<br>
Using rack 1.3.1, the first two cases are processed correct, but the<br>
latter fails with error &#39;incorrect UTF-8 byte sequence&#39;. Which is<b=
r>
correct behavior at first place as it&#39;s<br>
not UTF-8 but the unicode.<br>
<br>
I&#39;m no wondering what&#39;s best solution here? Why not be smart and gi=
ve<br>
it another run to unescape the URL assuming it&#39;s unicode encoded?<br>
<br>
I checked other players, e.g. Twitter has same error (but fails<br>
silently)<br>
<a href=3D"http://twitter.com/search/%C3%A9" target=3D"_blank">http://twitt=
er.com/search/=E9</a><br>
<a href=3D"http://twitter.com/search/%c3%a9" target=3D"_blank">http://twitt=
er.com/search/%c3%a9</a><br>
<a href=3D"http://twitter.com/search/%E9" target=3D"_blank">http://twitter.=
com/search/%E9</a><br>
<br>
but not Google, it does it this way:<br>
<a href=3D"http://www.google.de/search?q=3D%C3%A9" target=3D"_blank">http:/=
/www.google.de/search?q=3D=E9</a><br>
<a href=3D"http://www.google.de/search?q=3D%c3%a9" target=3D"_blank">http:/=
/www.google.de/search?q=3D%c3%a9</a><br>
<a href=3D"http://www.google.de/search?q=3D%E9" target=3D"_blank">http://ww=
w.google.de/search?q=3D%E9</a><br>
<br>
<br>
Any advice is very much appreciated. Patch is comming..<br>
<br>
Thx - Tobi<br>
<br>
<br>
</blockquote></div><br></div>

--bcaec50160117c629604a8e94268--