From mboxrd@z Thu Jan 1 00:00:00 1970 Delivered-To: chneukirchen@gmail.com Received: by 10.229.49.16 with SMTP id t16cs167162qcf; Sun, 12 Sep 2010 21:21:25 -0700 (PDT) Return-Path: Received-SPF: pass (google.com: domain of rack-devel+bncCP7CjYe8CRDDzbbkBBoESlyIVA@googlegroups.com designates 10.216.145.167 as permitted sender) client-ip=10.216.145.167; Authentication-Results: mr.google.com; spf=pass (google.com: domain of rack-devel+bncCP7CjYe8CRDDzbbkBBoESlyIVA@googlegroups.com designates 10.216.145.167 as permitted sender) smtp.mail=rack-devel+bncCP7CjYe8CRDDzbbkBBoESlyIVA@googlegroups.com; dkim=pass header.i=rack-devel+bncCP7CjYe8CRDDzbbkBBoESlyIVA@googlegroups.com Received: from mr.google.com ([10.216.145.167]) by 10.216.145.167 with SMTP id p39mr2067276wej.25.1284351684507 (num_hops = 1); Sun, 12 Sep 2010 21:21:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=beta; h=domainkey-signature:received:x-beenthere:received:received:received :received:received-spf:received:mime-version:received:received :in-reply-to:references:date:message-id:subject:from:to :x-original-sender:x-original-authentication-results:reply-to :precedence:mailing-list:list-id:list-post:list-help:list-archive :sender:list-subscribe:list-unsubscribe:content-type :content-transfer-encoding; bh=BOpRBTy7SStlVHlHZf4izz7ag08pJQUygwRZJVH88fU=; b=oV3TGqDHhfOx1+y8u1b2ss+pql2JRibUFQnpUg5lnjDQZkwAzv8lpZgeCDni7u39+B M73Wd2MKNYQ7+tvOWlE7lS89UQxOK4GV+Dlof4U3VFMj3xv9eRwu+Rbpm3pqt9QTcwVi EmiPfk3LM9s5UpSm1XfewZad6uWIlv1BUzr+Q= DomainKey-Signature: a=rsa-sha1; c=nofws; d=googlegroups.com; s=beta; h=x-beenthere:received-spf:mime-version:in-reply-to:references:date :message-id:subject:from:to:x-original-sender :x-original-authentication-results:reply-to:precedence:mailing-list :list-id:list-post:list-help:list-archive:sender:list-subscribe :list-unsubscribe:content-type:content-transfer-encoding; b=R+KsA5ro6kN2ohicXuKUCJBvQ+T9+gUUXEeayeaAjIlfFEVJer+64SbRf5IdTVUMp+ YLmQ24uBxzWxkASlFhWhphqwnspJVesE000Uxq7SDavmr1Qq1nHKLx1oCowb/VDJWHul H+bWbySDZPpr4x2qpe5a1xfLM1D9x+GrLWNe0= Received: by 10.216.145.167 with SMTP id p39mr405727wej.25.1284351683076; Sun, 12 Sep 2010 21:21:23 -0700 (PDT) X-BeenThere: rack-devel@googlegroups.com Received: by 10.216.237.165 with SMTP id y37ls4209942weq.1.p; Sun, 12 Sep 2010 21:21:22 -0700 (PDT) Received: by 10.216.26.74 with SMTP id b52mr101568wea.14.1284351682039; Sun, 12 Sep 2010 21:21:22 -0700 (PDT) Received: by 10.216.26.74 with SMTP id b52mr101567wea.14.1284351682016; Sun, 12 Sep 2010 21:21:22 -0700 (PDT) Received: from mail-wy0-f177.google.com (mail-wy0-f177.google.com [74.125.82.177]) by gmr-mx.google.com with ESMTP id m11si1927612wej.7.2010.09.12.21.21.20; Sun, 12 Sep 2010 21:21:21 -0700 (PDT) Received-SPF: pass (google.com: domain of wycats@gmail.com designates 74.125.82.177 as permitted sender) client-ip=74.125.82.177; Received: by wyb38 with SMTP id 38so6388553wyb.36 for ; Sun, 12 Sep 2010 21:21:20 -0700 (PDT) MIME-Version: 1.0 Received: by 10.216.174.69 with SMTP id w47mr3864882wel.25.1284351680795; Sun, 12 Sep 2010 21:21:20 -0700 (PDT) Received: by 10.216.161.81 with HTTP; Sun, 12 Sep 2010 21:21:20 -0700 (PDT) In-Reply-To: <86810130-684d-413f-aa69-a56f170459e6@m1g2000vbh.googlegroups.com> References: <86810130-684d-413f-aa69-a56f170459e6@m1g2000vbh.googlegroups.com> Date: Sun, 12 Sep 2010 21:21:20 -0700 Message-ID: Subject: Re: Rack environment encoding From: Yehuda Katz To: rack-devel X-Original-Sender: wycats@gmail.com X-Original-Authentication-Results: gmr-mx.google.com; spf=pass (google.com: domain of wycats@gmail.com designates 74.125.82.177 as permitted sender) smtp.mail=wycats@gmail.com; dkim=pass (test mode) header.i=@gmail.com Reply-To: rack-devel@googlegroups.com Precedence: list Mailing-list: list rack-devel@googlegroups.com; contact rack-devel+owners@googlegroups.com List-ID: List-Post: , List-Help: , List-Archive: Sender: rack-devel@googlegroups.com List-Subscribe: , List-Unsubscribe: , Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable In general, when dealing with Strings from external sources (as Rack is), you have two options: 1) The String comes with some out-of-band information about its encoding (or you happen to know the encoding for sure), and you should tag it with that encoding 2) The String does not come with out-of-band information, and you don't know the encoding, and you should leave it as BINARY. In the case of (1), after marking the encoding with force_encoding, you should call encode! (with no arguments). The encode! method works conceptually like this: class String =A0=A0def encode! =A0=A0 =A0encode!(Encoding.default_internal) if Encoding.default_internal =A0=A0end end Ruby itself never sets Encoding.default_internal. It is used by end-users to specify that they would like libraries operating at the boundary to convert known encodings to the one specified. Rails sets this to the value of config.encoding, which defaults to UTF-8. As a result, if a boundary library knows the encoding of a String (and it honors default_internal, like most well-behaved libraries), Rails will get it as UTF-8, but libraries don't need to hardcode UTF-8, and they can allow people who need to work with encoding on a more fine-grained level to do as they will. I should point out that Encoding.default_external is an entirely different setting, which tells Ruby what encoding to default files on the file system to. This defaults to the encoding of the operating system. In general, libraries that read files from disk should either let the operating system's default encoding ($LC_CTYPE/$LANG) win, or they should read files in the "rb" mode, which will tag the String as BINARY. Yehuda Katz Architect | Engine Yard (ph) 718.877.1325 On Sun, Sep 12, 2010 at 9:48 AM, Hongli Lai wrote: > > The current Rack specification doesn't say anything about the encoding > of the value strings in the Rack environment. However from various bug > reports it has become clear that Rails and possibly many other apps > expect some value strings, such as REQUEST_URI, to be UTF-8. See #16 > at http://code.google.com/p/phusion-passenger/issues/detail?id=3D404. > > I believe the encoding should be standardized. Here are some ideas > that might serve as a starting point for discussion. > > PATH_INFO and QUERY_STRING are usually extracted from REQUEST_URI, > however REQUEST_URI is not standardized even though lots of people use > it. Furthermore REQUEST_URI tends to be a URI. I therefore propose the > following requirements: I'm actually opposed to standardizing REQUEST_URI. It's always possible to extract REQUEST_URI from SCRIPT_NAME and PATH_INFO, and endpoints that rely on REQUEST_URI cannot be mounted. This was a serious problem for both Rails and Merb (before we both switched to using PATH_INFO). > - REQUEST_URI, if exists, MUST be a valid URI. This implies that > REQUEST_URI must contain the unescaped form of the URI (e.g. "/clubs/ > %C3%BC", not "/clubs/=FC"). > - All required Rack variables that are strings (PATH_INFO, > REQUEST_URI, etc) except for HTTP_ variables MUST be encoded as UTF-8. > That is, the #encoding method must return #. The main Rack variables (REQUEST_METHOD, SCRIPT_NAME, PATH_INFO, QUERY_STRING, SERVER_NAME, and SERVER_PORT) should always be ASCII. These should be encoded as ASCII, and then the server should call encode!. This will have the effect of giving the end application the encoding that it expects (representing by Encoding.default_internal) or by leaving it in ASCII, which is the correct encoding. > - HTTP_ variables MUST be encoded as binary. This seems correct, because headers can be encoded as Latin-1. Since we can't know for sure which encoding the client used, we should leave it as BINARY and let the application (which might know better) decode it. > The valid URI requirement for REQUEST_URI guarantees that encoding it > in UTF-8 is possible because URIs are valid ASCII. Again, I don't think we should standardize REQUEST_URI. > Because PATH_INFO and QUERY_STRING tend to be extracted from > REQUEST_URI and are therefore substrings of an URI, it is also > possible for them to be UTF-8. Again, I think the best way to achieve this would be to mark these as ASCII (which they actually are), and then let the application specify what transcoding it wants using the standard Ruby mechanism. > The binary requirement for HTTP_ is necessary because HTTP allows > header values to contain characters that are not valid UTF-8 nor valid > US-ASCII (see the HTTP grammar's TEXT rule). Yep. > Non-HTTP_ required Rack values must not be ASCII-encoded because Rails > and many apps work primarily with UTF-8 strings. If the app does > something like > > =A0some_utf8_string + env['PATH_INFO'] > > then Ruby 1.9 will complain with an incompatible encoding error. Actually, ASCII and UTF-8 should always concatenate with no error. Maybe you're thinking about putting BINARY Strings through a Unicode regular expression? > On the other hand, if the app does something like > > =A0some_utf8_string + env['HTTP_FOO_BAR'] > > then things will still blow up so I'm not sure whether my requirement > makes sense. Does Rails mandate an encoding for its request.env? Concatenating UTF-8 and BINARY should blow up. As you pointed out, we can't be sure that HTTP_FOO_BAR *is* UTF-8. For all we know, it's Latin-1. In Ruby 1.9, if a BINARY string contains characters that are not ASCII, you get an exception. This is correct. The only real solution is to somehow know for sure what the encoding of the header is. One solution could be a middleware that marks the Strings as Encoding::ASCII if the String is #ascii_only? and then uses rchardet to guess the encoding if it's not. Of course, it'd call encode! afterward, which would mean that Rails apps would see the String as UTF-8 no matter what. > I'm unsure what to do with all other variables. Should there be > requirements about their encodings? As far as I can tell, when unescaped, SCRIPT_NAME and PATH_INFO will always be UTF-8 in the wild (I've tried with quite a number of browsers). The Utils that unescapes percent encoded Strings should first mark the String as UTF-8, and then call encode! (which should almost always be a no-op, unless somewhat made their default_internal UTF-16 or Latin-1 from some odd reason).