rack-devel archive mirror (unofficial) https://groups.google.com/group/rack-devel
 help / color / mirror / Atom feed
From: James Tucker <jftucker@gmail.com>
To: rack-devel@googlegroups.com
Subject: Re: Rack environment encoding
Date: Sun, 12 Sep 2010 15:00:19 -0300	[thread overview]
Message-ID: <729D0748-DCFF-4643-AB90-3250D40D2370@gmail.com> (raw)
In-Reply-To: <86810130-684d-413f-aa69-a56f170459e6@m1g2000vbh.googlegroups.com>

Eeek, encodings.. here we go!

On 12 Sep 2010, at 13:48, Hongli Lai wrote:

> The current Rack specification doesn't say anything about the encoding
> of the value strings in the Rack environment. However from various bug
> reports it has become clear that Rails and possibly many other apps
> expect some value strings, such as REQUEST_URI, to be UTF-8. See #16
> at http://code.google.com/p/phusion-passenger/issues/detail?id=404.

I'm not sure, but I think they don't expect them to be utf-8, they actually expect them to be compatible with literals.

> I believe the encoding should be standardized. Here are some ideas
> that might serve as a starting point for discussion.

Maybe, from what I am told, we will need users of CP932 and other really annoying stuff to come forward and test, such that we don't either break their stuff or have to revert these specs in future. Binary is the only lossless manner we have right now. I personally can't (fully) audit this yet, although I'm trying to learn.

> PATH_INFO and QUERY_STRING are usually extracted from REQUEST_URI,
> however REQUEST_URI is not standardized even though lots of people use
> it. Furthermore REQUEST_URI tends to be a URI. I therefore propose the
> following requirements:
> 
> - REQUEST_URI, if exists, MUST be a valid URI. This implies that
> REQUEST_URI must contain the unescaped form of the URI (e.g. "/clubs/
> %C3%BC", not "/clubs/ü").

Percent encoded is definitely the way to go. Percent encoded data must fit within the ascii encoding set (no multibyte). It is commonly accepted that most percent encoded URIs actually expand out to UTF-8, however, as far as I know this is not exclusively the case. iirc, there is some mention of this in the newer IRI spec, which tbh, is quite horrible.

> - All required Rack variables that are strings (PATH_INFO,
> REQUEST_URI, etc) except for HTTP_ variables MUST be encoded as UTF-8.
> That is, the #encoding method must return #<Encoding:UTF-8>.

Although this might be literal compatible, I'm not sure it's strictly correct. That may not be a technical problem because it's compatible with the most likely actual encoding of the /data itself/ (n.b. not the non-percent-encoded data as noted above), however it may lead to a social problem, whereby it suggests that the data really is UTF-8. Maybe I'm being pedantic here, I'm not sure, but this stuff is hard enough without misleading defaults.

> - HTTP_ variables MUST be encoded as binary.

Seems sensible.

> The valid URI requirement for REQUEST_URI guarantees that encoding it
> in UTF-8 is possible because URIs are valid ASCII.
> Because PATH_INFO and QUERY_STRING tend to be extracted from
> REQUEST_URI and are therefore substrings of an URI, it is also
> possible for them to be UTF-8.

Yes, they are compatible, that much I agree with, but with the caveats above.

> The binary requirement for HTTP_ is necessary because HTTP allows
> header values to contain characters that are not valid UTF-8 nor valid
> US-ASCII (see the HTTP grammar's TEXT rule).

It's hard to determine what they should be, indeed the snowmen help, but I think there's yet more research to figure out what the real world use cases are when browsers are set in non-utf8 encodings and setting headers from JS, etc, etc. I would really appreciate someone or some company in the community sponsoring real research and documentation in this area; that is, extensively. (Extensively includes headers, form data, multipart data, etc etc across all major browsers in all major encoding settings and with all major encodings as inputs (files, pasted data, etc). Noticeable issues occur as a common and major example, pasting rich text data into forms into windows browsers on non-automatic modes into forms from programs like word. This is of course the very essence of the snowman hack from rails3, but to apply any of this kind of stuff to rack requires more research and at least some helpers and documentation. (The latter I have started whilst trying to get as much of this out of Yehudas brain as possible, but have not had time to finish yet).

> Non-HTTP_ required Rack values must not be ASCII-encoded because Rails
> and many apps work primarily with UTF-8 strings.

Literal strings. I should also note at this point that -e, irb, and textmate create lies during testing, please don't rely on "rules of the road" you derive via these tools, which enforce utf-8 literals by default, whereas normal source files start ascii encoded. This also depends on arguments to ruby, and $LC_CTYPE.

> If the app does
> something like
> 
>  some_utf8_string + env['PATH_INFO']
> 
> then Ruby 1.9 will complain with an incompatible encoding error.

On your system.

> On the other hand, if the app does something like
> 
>  some_utf8_string + env['HTTP_FOO_BAR']
> 
> then things will still blow up so I'm not sure whether my requirement
> makes sense. Does Rails mandate an encoding for its request.env?

Rails does a lot of work on the /client side/ to try and ensure it receives UTF-8, and tries to enforce UTF-8 elsewhere. Rack can't enforce this as it doesn't operate client side (build forms). It's also worth noting that rails accepts a percentile use case hit here, whereby it makes no attempt to expect full support for encodings that can't round-trip through unicode. For them this is sensible, and maybe it might be for us, but this is why I need particularly CP932 users to actually pay attention here. Until I hear from someone who deals with these issues in the real world, I cannot defer to the advice "just use unicode". Alas, one of the larger issues here is that I don't speak the languages required to actually track down most of these users, so I need help from people who do. I hope there's someone on this list proactive enough to do this, or knows someone to call on.

> I'm unsure what to do with all other variables. Should there be
> requirements about their encodings?

I think we do need to either:

1. Set some specific requirements based on complete research (and document potential loss/error cashes)
2. Not set requirements based on complete research (and document as such)

It should be noted that 1 may result in the software being simply incompatible with certain requirements, whereas 2 may require the common user to do more work themselves. At present there may be no workaround for when 1 is a problem, due to the pervasiveness of rack in rubys modern libraries and frameworks.

In any case, the minimum output of these discussions should be documentation on the topic, so that once we're done, we can stop having them with everyone who hits another issue. As 1.9 becomes more common, this is going to come up more and more.

  reply	other threads:[~2010-09-12 18:00 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-12 16:48 Rack environment encoding Hongli Lai
2010-09-12 18:00 ` James Tucker [this message]
2010-09-12 18:23   ` Steve Klabnik
2010-09-13 13:56   ` Hongli Lai
2010-09-13  4:21 ` Yehuda Katz
2010-09-13  9:05   ` naruse
2010-09-13 14:08   ` Hongli Lai
2010-09-15  1:23     ` naruse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-list from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://groups.google.com/group/rack-devel

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=729D0748-DCFF-4643-AB90-3250D40D2370@gmail.com \
    --to=rack-devel@googlegroups.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).