rack-devel archive mirror (unofficial) https://groups.google.com/group/rack-devel
 help / color / mirror / Atom feed
From: Yehuda Katz <wycats@gmail.com>
To: rack-devel <rack-devel@googlegroups.com>
Subject: Re: Rack environment encoding
Date: Sun, 12 Sep 2010 21:21:20 -0700	[thread overview]
Message-ID: <AANLkTi=piuWq+Ldadz+bMSjM=fitqxxFN+99TgAVRwiM@mail.gmail.com> (raw)
In-Reply-To: <86810130-684d-413f-aa69-a56f170459e6@m1g2000vbh.googlegroups.com>

In general, when dealing with Strings from external sources (as Rack
is), you have two options:
1) The String comes with some out-of-band information about its
encoding (or you happen to know the encoding for sure), and you should
tag it with that encoding
2) The String does not come with out-of-band information, and you
don't know the encoding, and you should leave it as BINARY.
In the case of (1), after marking the encoding with force_encoding,
you should call encode! (with no arguments). The encode! method works
conceptually like this:

class String
  def encode!
    encode!(Encoding.default_internal) if Encoding.default_internal
  end
end

Ruby itself never sets Encoding.default_internal. It is used by
end-users to specify that they would like libraries operating at the
boundary to convert known encodings to the one specified. Rails sets
this to the value of config.encoding, which defaults to UTF-8.

As a result, if a boundary library knows the encoding of a String (and
it honors default_internal, like most well-behaved libraries), Rails
will get it as UTF-8, but libraries don't need to hardcode UTF-8, and
they can allow people who need to work with encoding on a more
fine-grained level to do as they will.

I should point out that Encoding.default_external is an entirely
different setting, which tells Ruby what encoding to default files on
the file system to. This defaults to the encoding of the operating
system. In general, libraries that read files from disk should either
let the operating system's default encoding ($LC_CTYPE/$LANG) win, or
they should read files in the "rb" mode, which will tag the String as
BINARY.

Yehuda Katz
Architect | Engine Yard
(ph) 718.877.1325


On Sun, Sep 12, 2010 at 9:48 AM, Hongli Lai <hongli@phusion.nl> wrote:
>
> The current Rack specification doesn't say anything about the encoding
> of the value strings in the Rack environment. However from various bug
> reports it has become clear that Rails and possibly many other apps
> expect some value strings, such as REQUEST_URI, to be UTF-8. See #16
> at http://code.google.com/p/phusion-passenger/issues/detail?id=404.
>
> I believe the encoding should be standardized. Here are some ideas
> that might serve as a starting point for discussion.
>
> PATH_INFO and QUERY_STRING are usually extracted from REQUEST_URI,
> however REQUEST_URI is not standardized even though lots of people use
> it. Furthermore REQUEST_URI tends to be a URI. I therefore propose the
> following requirements:

I'm actually opposed to standardizing REQUEST_URI. It's always
possible to extract REQUEST_URI from SCRIPT_NAME and PATH_INFO, and
endpoints that rely on REQUEST_URI cannot be mounted. This was a
serious problem for both Rails and Merb (before we both switched to
using PATH_INFO).

> - REQUEST_URI, if exists, MUST be a valid URI. This implies that
> REQUEST_URI must contain the unescaped form of the URI (e.g. "/clubs/
> %C3%BC", not "/clubs/ü").
> - All required Rack variables that are strings (PATH_INFO,
> REQUEST_URI, etc) except for HTTP_ variables MUST be encoded as UTF-8.
> That is, the #encoding method must return #<Encoding:UTF-8>.

The main Rack variables (REQUEST_METHOD, SCRIPT_NAME, PATH_INFO,
QUERY_STRING, SERVER_NAME, and SERVER_PORT) should always be ASCII.
These should be encoded as ASCII, and then the server should call
encode!. This will have the effect of giving the end application the
encoding that it expects (representing by Encoding.default_internal)
or by leaving it in ASCII, which is the correct encoding.

> - HTTP_ variables MUST be encoded as binary.

This seems correct, because headers can be encoded as Latin-1. Since
we can't know for sure which encoding the client used, we should leave
it as BINARY and let the application (which might know better) decode
it.

> The valid URI requirement for REQUEST_URI guarantees that encoding it
> in UTF-8 is possible because URIs are valid ASCII.

Again, I don't think we should standardize REQUEST_URI.

> Because PATH_INFO and QUERY_STRING tend to be extracted from
> REQUEST_URI and are therefore substrings of an URI, it is also
> possible for them to be UTF-8.

Again, I think the best way to achieve this would be to mark these as
ASCII (which they actually are), and then let the application specify
what transcoding it wants using the standard Ruby mechanism.

> The binary requirement for HTTP_ is necessary because HTTP allows
> header values to contain characters that are not valid UTF-8 nor valid
> US-ASCII (see the HTTP grammar's TEXT rule).

Yep.

> Non-HTTP_ required Rack values must not be ASCII-encoded because Rails
> and many apps work primarily with UTF-8 strings. If the app does
> something like
>
>  some_utf8_string + env['PATH_INFO']
>
> then Ruby 1.9 will complain with an incompatible encoding error.

Actually, ASCII and UTF-8 should always concatenate with no error.
Maybe you're thinking about putting BINARY Strings through a Unicode
regular expression?

> On the other hand, if the app does something like
>
>  some_utf8_string + env['HTTP_FOO_BAR']
>
> then things will still blow up so I'm not sure whether my requirement
> makes sense. Does Rails mandate an encoding for its request.env?

Concatenating UTF-8 and BINARY should blow up. As you pointed out, we
can't be sure that HTTP_FOO_BAR *is* UTF-8. For all we know, it's
Latin-1. In Ruby 1.9, if a BINARY string contains characters that are
not ASCII, you get an exception. This is correct. The only real
solution is to somehow know for sure what the encoding of the header
is.
One solution could be a middleware that marks the Strings as
Encoding::ASCII if the String is #ascii_only? and then uses rchardet
to guess the encoding if it's not. Of course, it'd call encode!
afterward, which would mean that Rails apps would see the String as
UTF-8 no matter what.

> I'm unsure what to do with all other variables. Should there be
> requirements about their encodings?

As far as I can tell, when unescaped, SCRIPT_NAME and PATH_INFO will
always be UTF-8 in the wild (I've tried with quite a number of
browsers). The Utils that unescapes percent encoded Strings should
first mark the String as UTF-8, and then call encode! (which should
almost always be a no-op, unless somewhat made their default_internal
UTF-16 or Latin-1 from some odd reason).

  parent reply	other threads:[~2010-09-13  4:21 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-12 16:48 Rack environment encoding Hongli Lai
2010-09-12 18:00 ` James Tucker
2010-09-12 18:23   ` Steve Klabnik
2010-09-13 13:56   ` Hongli Lai
2010-09-13  4:21 ` Yehuda Katz [this message]
2010-09-13  9:05   ` naruse
2010-09-13 14:08   ` Hongli Lai
2010-09-15  1:23     ` naruse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-list from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://groups.google.com/group/rack-devel

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='AANLkTi=piuWq+Ldadz+bMSjM=fitqxxFN+99TgAVRwiM@mail.gmail.com' \
    --to=rack-devel@googlegroups.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).