From mboxrd@z Thu Jan 1 00:00:00 1970 Delivered-To: chneukirchen@gmail.com Received: by 10.229.49.16 with SMTP id t16cs147868qcf; Sun, 12 Sep 2010 11:00:27 -0700 (PDT) Return-Path: Received-SPF: pass (google.com: domain of rack-devel+bncCP_V2_zRBRC5qrTkBBoEh8-M8Q@googlegroups.com designates 10.220.164.130 as permitted sender) client-ip=10.220.164.130; Authentication-Results: mr.google.com; spf=pass (google.com: domain of rack-devel+bncCP_V2_zRBRC5qrTkBBoEh8-M8Q@googlegroups.com designates 10.220.164.130 as permitted sender) smtp.mail=rack-devel+bncCP_V2_zRBRC5qrTkBBoEh8-M8Q@googlegroups.com; dkim=pass header.i=rack-devel+bncCP_V2_zRBRC5qrTkBBoEh8-M8Q@googlegroups.com Received: from mr.google.com ([10.220.164.130]) by 10.220.164.130 with SMTP id e2mr625511vcy.25.1284314427204 (num_hops = 1); Sun, 12 Sep 2010 11:00:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=beta; h=domainkey-signature:received:x-beenthere:received:received:received :received:received-spf:received:received:received:mime-version :subject:from:in-reply-to:date:message-id:references:to:x-mailer :x-original-sender:x-original-authentication-results:reply-to :precedence:mailing-list:list-id:list-post:list-help:list-archive :sender:list-subscribe:list-unsubscribe:content-type :content-transfer-encoding; bh=stT/icE7JxBzp1jAITeI3SLCMpvkqnLqUCRCvZ0h3YE=; b=FxySxcCRbZTtjRe7KgixmbBP2E4hfE+OKtZhUVfxLL5MK5pdh0vl3qzQWy7IUy47Nm aZiIegop+B96m57H5PWIh1kecd6yy5TdgCPixTnhjVeVLdjjERMNxxafT7VV1Q5T0pe7 mrXlv2sNtAURxQNclYNfx0MToiYg0qG6R6jcc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=googlegroups.com; s=beta; h=x-beenthere:received-spf:mime-version:subject:from:in-reply-to:date :message-id:references:to:x-mailer:x-original-sender :x-original-authentication-results:reply-to:precedence:mailing-list :list-id:list-post:list-help:list-archive:sender:list-subscribe :list-unsubscribe:content-type:content-transfer-encoding; b=ZHctgHwQ7tejF4cDGQLtXQlJv2AMx/Tqirc5R1ro5/+ABCovi4PcdjRYV69iNM6NxU x8ngBOPgflnquoCIfiORkQuVbafAAWJ1+liHgM0zDINY/k3pep2+lj4rH4bX0T64S+Z1 U9GHBVT3bArB8QNwQyIdCJANuJFtZyxrY+598= Received: by 10.220.164.130 with SMTP id e2mr137914vcy.25.1284314425809; Sun, 12 Sep 2010 11:00:25 -0700 (PDT) X-BeenThere: rack-devel@googlegroups.com Received: by 10.220.70.31 with SMTP id b31ls2291645vcj.0.p; Sun, 12 Sep 2010 11:00:25 -0700 (PDT) Received: by 10.220.190.71 with SMTP id dh7mr861542vcb.26.1284314424926; Sun, 12 Sep 2010 11:00:24 -0700 (PDT) Received: by 10.220.190.71 with SMTP id dh7mr861539vcb.26.1284314424769; Sun, 12 Sep 2010 11:00:24 -0700 (PDT) Received: from mail-qw0-f41.google.com (mail-qw0-f41.google.com [209.85.216.41]) by gmr-mx.google.com with ESMTP id n18si519753vbs.3.2010.09.12.11.00.23; Sun, 12 Sep 2010 11:00:23 -0700 (PDT) Received-SPF: pass (google.com: domain of jftucker@gmail.com designates 209.85.216.41 as permitted sender) client-ip=209.85.216.41; Received: by qwf7 with SMTP id 7so3731176qwf.14 for ; Sun, 12 Sep 2010 11:00:23 -0700 (PDT) Received: by 10.229.1.88 with SMTP id 24mr2209974qce.269.1284314423407; Sun, 12 Sep 2010 11:00:23 -0700 (PDT) Received: from [192.168.101.2] ([199.172.230.17]) by mx.google.com with ESMTPS id q8sm5328618qcs.24.2010.09.12.11.00.21 (version=TLSv1/SSLv3 cipher=RC4-MD5); Sun, 12 Sep 2010 11:00:22 -0700 (PDT) Mime-Version: 1.0 (Apple Message framework v1081) Subject: Re: Rack environment encoding From: James Tucker In-Reply-To: <86810130-684d-413f-aa69-a56f170459e6@m1g2000vbh.googlegroups.com> Date: Sun, 12 Sep 2010 15:00:19 -0300 Message-Id: <729D0748-DCFF-4643-AB90-3250D40D2370@gmail.com> References: <86810130-684d-413f-aa69-a56f170459e6@m1g2000vbh.googlegroups.com> To: rack-devel@googlegroups.com X-Mailer: Apple Mail (2.1081) X-Original-Sender: jftucker@gmail.com X-Original-Authentication-Results: gmr-mx.google.com; spf=pass (google.com: domain of jftucker@gmail.com designates 209.85.216.41 as permitted sender) smtp.mail=jftucker@gmail.com; dkim=pass (test mode) header.i=@gmail.com Reply-To: rack-devel@googlegroups.com Precedence: list Mailing-list: list rack-devel@googlegroups.com; contact rack-devel+owners@googlegroups.com List-ID: List-Post: , List-Help: , List-Archive: Sender: rack-devel@googlegroups.com List-Subscribe: , List-Unsubscribe: , Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Eeek, encodings.. here we go! On 12 Sep 2010, at 13:48, Hongli Lai wrote: > The current Rack specification doesn't say anything about the encoding > of the value strings in the Rack environment. However from various bug > reports it has become clear that Rails and possibly many other apps > expect some value strings, such as REQUEST_URI, to be UTF-8. See #16 > at http://code.google.com/p/phusion-passenger/issues/detail?id=3D404. I'm not sure, but I think they don't expect them to be utf-8, they = actually expect them to be compatible with literals. > I believe the encoding should be standardized. Here are some ideas > that might serve as a starting point for discussion. Maybe, from what I am told, we will need users of CP932 and other really = annoying stuff to come forward and test, such that we don't either break = their stuff or have to revert these specs in future. Binary is the only = lossless manner we have right now. I personally can't (fully) audit this = yet, although I'm trying to learn. > PATH_INFO and QUERY_STRING are usually extracted from REQUEST_URI, > however REQUEST_URI is not standardized even though lots of people use > it. Furthermore REQUEST_URI tends to be a URI. I therefore propose the > following requirements: >=20 > - REQUEST_URI, if exists, MUST be a valid URI. This implies that > REQUEST_URI must contain the unescaped form of the URI (e.g. "/clubs/ > %C3%BC", not "/clubs/=FC"). Percent encoded is definitely the way to go. Percent encoded data must = fit within the ascii encoding set (no multibyte). It is commonly = accepted that most percent encoded URIs actually expand out to UTF-8, = however, as far as I know this is not exclusively the case. iirc, there = is some mention of this in the newer IRI spec, which tbh, is quite = horrible. > - All required Rack variables that are strings (PATH_INFO, > REQUEST_URI, etc) except for HTTP_ variables MUST be encoded as UTF-8. > That is, the #encoding method must return #. Although this might be literal compatible, I'm not sure it's strictly = correct. That may not be a technical problem because it's compatible = with the most likely actual encoding of the /data itself/ (n.b. not the = non-percent-encoded data as noted above), however it may lead to a = social problem, whereby it suggests that the data really is UTF-8. Maybe = I'm being pedantic here, I'm not sure, but this stuff is hard enough = without misleading defaults. > - HTTP_ variables MUST be encoded as binary. Seems sensible. > The valid URI requirement for REQUEST_URI guarantees that encoding it > in UTF-8 is possible because URIs are valid ASCII. > Because PATH_INFO and QUERY_STRING tend to be extracted from > REQUEST_URI and are therefore substrings of an URI, it is also > possible for them to be UTF-8. Yes, they are compatible, that much I agree with, but with the caveats = above. > The binary requirement for HTTP_ is necessary because HTTP allows > header values to contain characters that are not valid UTF-8 nor valid > US-ASCII (see the HTTP grammar's TEXT rule). It's hard to determine what they should be, indeed the snowmen help, but = I think there's yet more research to figure out what the real world use = cases are when browsers are set in non-utf8 encodings and setting = headers from JS, etc, etc. I would really appreciate someone or some = company in the community sponsoring real research and documentation in = this area; that is, extensively. (Extensively includes headers, form = data, multipart data, etc etc across all major browsers in all major = encoding settings and with all major encodings as inputs (files, pasted = data, etc). Noticeable issues occur as a common and major example, = pasting rich text data into forms into windows browsers on non-automatic = modes into forms from programs like word. This is of course the very = essence of the snowman hack from rails3, but to apply any of this kind = of stuff to rack requires more research and at least some helpers and = documentation. (The latter I have started whilst trying to get as much = of this out of Yehudas brain as possible, but have not had time to = finish yet). > Non-HTTP_ required Rack values must not be ASCII-encoded because Rails > and many apps work primarily with UTF-8 strings. Literal strings. I should also note at this point that -e, irb, and = textmate create lies during testing, please don't rely on "rules of the = road" you derive via these tools, which enforce utf-8 literals by = default, whereas normal source files start ascii encoded. This also = depends on arguments to ruby, and $LC_CTYPE. > If the app does > something like >=20 > some_utf8_string + env['PATH_INFO'] >=20 > then Ruby 1.9 will complain with an incompatible encoding error. On your system. > On the other hand, if the app does something like >=20 > some_utf8_string + env['HTTP_FOO_BAR'] >=20 > then things will still blow up so I'm not sure whether my requirement > makes sense. Does Rails mandate an encoding for its request.env? Rails does a lot of work on the /client side/ to try and ensure it = receives UTF-8, and tries to enforce UTF-8 elsewhere. Rack can't enforce = this as it doesn't operate client side (build forms). It's also worth = noting that rails accepts a percentile use case hit here, whereby it = makes no attempt to expect full support for encodings that can't = round-trip through unicode. For them this is sensible, and maybe it = might be for us, but this is why I need particularly CP932 users to = actually pay attention here. Until I hear from someone who deals with = these issues in the real world, I cannot defer to the advice "just use = unicode". Alas, one of the larger issues here is that I don't speak the = languages required to actually track down most of these users, so I need = help from people who do. I hope there's someone on this list proactive = enough to do this, or knows someone to call on. > I'm unsure what to do with all other variables. Should there be > requirements about their encodings? I think we do need to either: 1. Set some specific requirements based on complete research (and = document potential loss/error cashes) 2. Not set requirements based on complete research (and document as = such) It should be noted that 1 may result in the software being simply = incompatible with certain requirements, whereas 2 may require the common = user to do more work themselves. At present there may be no workaround = for when 1 is a problem, due to the pervasiveness of rack in rubys = modern libraries and frameworks. In any case, the minimum output of these discussions should be = documentation on the topic, so that once we're done, we can stop having = them with everyone who hits another issue. As 1.9 becomes more common, = this is going to come up more and more.=