Patch: better handling of large bodies.

rack-devel archive mirror (unofficial) https://groups.google.com/group/rack-devel
 help / color / mirror / Atom feed

* Patch: better handling of large bodies.
@ 2010-12-03  6:34 nick
  2010-12-03  8:50 ` Graham
  0 siblings, 1 reply; 4+ messages in thread
From: nick @ 2010-12-03  6:34 UTC (permalink / raw)
  To: Rack Development

I was playing around with building a REST API on sinatra and found
some issues with large file transfers (e.g. 10-50MB).  There are a
couple of places where some regular expressions have scaling problems
with large inputs.  I hit these problems when processing binary files
(pcaps) from the body of post requests. The posts themselves (I used
curl --data-binary and curl --data-urlencode) appear well formed and
worked as expected at smaller sizes. I debugged this on ruby 1.8.6 but
checked ruby 1.9.2 and it is also affected.

1.) The regular expression in Utils.unescape throws "RegexpError :
Stack overflow in regexp matcher" when run against large url encoded
strings. This patch fixes that by bounding the regex to only work on
chunks of 1024 uri encoded characters at a time.

2.) Utils.escape has a problem with memory exhaustion, and the fix
here was the same (do things in chunks). The ruby interpreter's memory
usage baloons up to about 2G on my system when I run the unit test in
this patch (escaping a 50mb string of nul characters) w/out the fix.
At this point my system starts becoming unresponsive and I have to
kill the ruby process. I don't think this is likely to be a problem
with typical binary data because random characters that don't need
escaping should naturally break the work in smaller chunks. But I
think it's better to do this explicitly than rely on chance. I only
found this problem by code inspection (unlike the other two problems
which found me).

3.) The last problem I found is in Utils#normalize_params. It looks
like if you have a POST that rack assumes there may be some query
parameters and it tries to give you a nice hash to access all of the
parameters and even handles unnesting of complex parameter names like
foo[lkj][34]=bar. The problem is the regexes used in the unnesting
logic will result in a regexp stack overflow when you have a large
body that is not a query. The fix here is to just impose a size limit
beyond which unnesting of paramater names is not attempted. So
"foo[lkj][34]=bar will be unnested but "#{'a'*4096}[lkd][34]=bar" will
not be unnested.

Note: I'm not sure the tests in this patch are in the right place.
They're somewhat slow tests and should maybe be someplace where
they'll be run less often.

-Nick

diff --git a/lib/rack/utils.rb b/lib/rack/utils.rb
index bf04420..ba1f74f 100644
--- a/lib/rack/utils.rb
+++ b/lib/rack/utils.rb
@@ -13,16 +13,16 @@ module Rack
     # query strings faster.  Use this rather than the cgi.rb
     # version since it's faster.  (Stolen from Camping).
     def escape(s)
-      s.to_s.gsub(/([^ a-zA-Z0-9_.-]+)/u) {
-        '%'+$1.unpack('H2'*bytesize($1)).join('%').upcase
+      s.to_s.gsub(/[^ a-zA-Z0-9_.-]{1,1024}/u) {
+        '%'+$&.unpack('H2'*bytesize($&)).join('%').upcase
       }.tr(' ', '+')
     end
     module_function :escape

     # Unescapes a URI escaped string. (Stolen from Camping).
     def unescape(s)
-      s.tr('+', ' ').gsub(/((?:%[0-9a-fA-F]{2})+)/n){
-        [$1.delete('%')].pack('H*')
+      s.tr('+', ' ').gsub(/(?:%[0-9a-fA-F]{2}){1,1024}/n){
+        [$&.delete('%')].pack('H*')
       }
     end
     module_function :unescape
@@ -66,7 +66,14 @@ module Rack
     end
     module_function :parse_nested_query

+    MAX_NESTED_NAME_SIZE = 4096
+
     def normalize_params(params, name, v = nil)
+      if name and name.size > MAX_NESTED_NAME_SIZE
+          params[name] = v
+          return params
+      end
+
       name =~ %r(\A[\[\]]*([^\[\]]+)\]*)
       k = $1 || ''
       after = $' || ''
diff --git a/test/spec_utils.rb b/test/spec_utils.rb
index 79d0f1c..ab2b089 100644
--- a/test/spec_utils.rb
+++ b/test/spec_utils.rb
@@ -18,6 +18,11 @@ describe Rack::Utils do
     Rack::Utils.escape(matz_name_sep).should.equal '%E3%81%BE
%E3%81%A4+%E3%82%82%E3%81%A8'
   end

+  should "escape big strings and actually return" do
+    big = 50*1024*1024
+    Rack::Utils.escape("\0"*big).should.equal "%00"*big
+  end
+
   should "unescape correctly" do
     Rack::Utils.unescape("fo%3Co%3Ebar").should.equal "fo<o>bar"
     Rack::Utils.unescape("a+space").should.equal "a space"
@@ -26,6 +31,11 @@ describe Rack::Utils do
       should.equal "q1!2\"'w$5&7/z8)?\\"
   end

+  should "unescape big strings without overflow" do
+    big = 10*1024*1024
+    Rack::Utils.unescape("%00"*big).should.equal "\0"*big
+  end
+
   should "parse query strings correctly" do
     Rack::Utils.parse_query("foo=bar").
       should.equal "foo" => "bar"
@@ -118,6 +128,12 @@ describe Rack::Utils do
       message.should.equal "expected Array (got String) for param
`y'"
   end

+  should "handle large message bodies" do
+    big = 50*1024*1024
+    Rack::Utils.parse_nested_query("x"*big).
+      should.equal "x"*big => nil
+  end
+
   should "build query strings correctly" do
     Rack::Utils.build_query("foo" => "bar").should.equal "foo=bar"
     Rack::Utils.build_query("foo" => ["bar", "quux"]).

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: Patch: better handling of large bodies.
  2010-12-03  6:34 Patch: better handling of large bodies nick
@ 2010-12-03  8:50 ` Graham
  2010-12-03 15:48   ` nick
  0 siblings, 1 reply; 4+ messages in thread
From: Graham @ 2010-12-03  8:50 UTC (permalink / raw)
  To: Rack Development

In the unescape case won't it give incorrect results if the % and
hexcode don't appear in the same chunk? Ie. s[1023] == '%' and s[1024]
== '0' and s[1025] == '0' it'll just pass through %00?

On Dec 2, 11:34 pm, nick <nbagg...@gmail.com> wrote:
> I was playing around with building a REST API on sinatra and found
> some issues with large file transfers (e.g. 10-50MB).  There are a
> couple of places where some regular expressions have scaling problems
> with large inputs.  I hit these problems when processing binary files
> (pcaps) from the body of post requests. The posts themselves (I used
> curl --data-binary and curl --data-urlencode) appear well formed and
> worked as expected at smaller sizes. I debugged this on ruby 1.8.6 but
> checked ruby 1.9.2 and it is also affected.
>
> 1.) The regular expression in Utils.unescape throws "RegexpError :
> Stack overflow in regexp matcher" when run against large url encoded
> strings. This patch fixes that by bounding the regex to only work on
> chunks of 1024 uri encoded characters at a time.
>
> 2.) Utils.escape has a problem with memory exhaustion, and the fix
> here was the same (do things in chunks). The ruby interpreter's memory
> usage baloons up to about 2G on my system when I run the unit test in
> this patch (escaping a 50mb string of nul characters) w/out the fix.
> At this point my system starts becoming unresponsive and I have to
> kill the ruby process. I don't think this is likely to be a problem
> with typical binary data because random characters that don't need
> escaping should naturally break the work in smaller chunks. But I
> think it's better to do this explicitly than rely on chance. I only
> found this problem by code inspection (unlike the other two problems
> which found me).
>
> 3.) The last problem I found is in Utils#normalize_params. It looks
> like if you have a POST that rack assumes there may be some query
> parameters and it tries to give you a nice hash to access all of the
> parameters and even handles unnesting of complex parameter names like
> foo[lkj][34]=bar. The problem is the regexes used in the unnesting
> logic will result in a regexp stack overflow when you have a large
> body that is not a query. The fix here is to just impose a size limit
> beyond which unnesting of paramater names is not attempted. So
> "foo[lkj][34]=bar will be unnested but "#{'a'*4096}[lkd][34]=bar" will
> not be unnested.
>
> Note: I'm not sure the tests in this patch are in the right place.
> They're somewhat slow tests and should maybe be someplace where
> they'll be run less often.
>
> -Nick
>
> diff --git a/lib/rack/utils.rb b/lib/rack/utils.rb
> index bf04420..ba1f74f 100644
> --- a/lib/rack/utils.rb
> +++ b/lib/rack/utils.rb
> @@ -13,16 +13,16 @@ module Rack
>      # query strings faster.  Use this rather than the cgi.rb
>      # version since it's faster.  (Stolen from Camping).
>      def escape(s)
> -      s.to_s.gsub(/([^ a-zA-Z0-9_.-]+)/u) {
> -        '%'+$1.unpack('H2'*bytesize($1)).join('%').upcase
> +      s.to_s.gsub(/[^ a-zA-Z0-9_.-]{1,1024}/u) {
> +        '%'+$&.unpack('H2'*bytesize($&)).join('%').upcase
>        }.tr(' ', '+')
>      end
>      module_function :escape
>
>      # Unescapes a URI escaped string. (Stolen from Camping).
>      def unescape(s)
> -      s.tr('+', ' ').gsub(/((?:%[0-9a-fA-F]{2})+)/n){
> -        [$1.delete('%')].pack('H*')
> +      s.tr('+', ' ').gsub(/(?:%[0-9a-fA-F]{2}){1,1024}/n){
> +        [$&.delete('%')].pack('H*')
>        }
>      end
>      module_function :unescape
> @@ -66,7 +66,14 @@ module Rack
>      end
>      module_function :parse_nested_query
>
> +    MAX_NESTED_NAME_SIZE = 4096
> +
>      def normalize_params(params, name, v = nil)
> +      if name and name.size > MAX_NESTED_NAME_SIZE
> +          params[name] = v
> +          return params
> +      end
> +
>        name =~ %r(\A[\[\]]*([^\[\]]+)\]*)
>        k = $1 || ''
>        after = $' || ''
> diff --git a/test/spec_utils.rb b/test/spec_utils.rb
> index 79d0f1c..ab2b089 100644
> --- a/test/spec_utils.rb
> +++ b/test/spec_utils.rb
> @@ -18,6 +18,11 @@ describe Rack::Utils do
>      Rack::Utils.escape(matz_name_sep).should.equal '%E3%81%BE
> %E3%81%A4+%E3%82%82%E3%81%A8'
>    end
>
> +  should "escape big strings and actually return" do
> +    big = 50*1024*1024
> +    Rack::Utils.escape("\0"*big).should.equal "%00"*big
> +  end
> +
>    should "unescape correctly" do
>      Rack::Utils.unescape("fo%3Co%3Ebar").should.equal "fo<o>bar"
>      Rack::Utils.unescape("a+space").should.equal "a space"
> @@ -26,6 +31,11 @@ describe Rack::Utils do
>        should.equal "q1!2\"'w$5&7/z8)?\\"
>    end
>
> +  should "unescape big strings without overflow" do
> +    big = 10*1024*1024
> +    Rack::Utils.unescape("%00"*big).should.equal "\0"*big
> +  end
> +
>    should "parse query strings correctly" do
>      Rack::Utils.parse_query("foo=bar").
>        should.equal "foo" => "bar"
> @@ -118,6 +128,12 @@ describe Rack::Utils do
>        message.should.equal "expected Array (got String) for param
> `y'"
>    end
>
> +  should "handle large message bodies" do
> +    big = 50*1024*1024
> +    Rack::Utils.parse_nested_query("x"*big).
> +      should.equal "x"*big => nil
> +  end
> +
>    should "build query strings correctly" do
>      Rack::Utils.build_query("foo" => "bar").should.equal "foo=bar"
>      Rack::Utils.build_query("foo" => ["bar", "quux"]).

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Patch: better handling of large bodies.
  2010-12-03  8:50 ` Graham
@ 2010-12-03 15:48   ` nick
  2010-12-06  3:36     ` Graham
  0 siblings, 1 reply; 4+ messages in thread
From: nick @ 2010-12-03 15:48 UTC (permalink / raw)
  To: Rack Development

The non-capturing group in the regex ensures that characters will only
be consumed in groups of 3 (a "%" followed by two hex characters):
(?:%[0-9a-fA-F]{2})

The patch protects against overflow by changing the qualifier from the
unbounded "+" to "{1,1024}". This caps things at 1024 repeats (3072
characters). If there are more then 1024 encoded characters in a row,
gsub will process them as a subsequent match.

-Nick

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Patch: better handling of large bodies.
  2010-12-03 15:48   ` nick
@ 2010-12-06  3:36     ` Graham
  0 siblings, 0 replies; 4+ messages in thread
From: Graham @ 2010-12-06  3:36 UTC (permalink / raw)
  To: Rack Development

I stand corrected. And support anything that makes Rack work better
with large data sizes...

On Dec 3, 8:48 am, nick <nbagg...@gmail.com> wrote:
> The non-capturing group in the regex ensures that characters will only
> be consumed in groups of 3 (a "%" followed by two hex characters):
> (?:%[0-9a-fA-F]{2})
>
> The patch protects against overflow by changing the qualifier from the
> unbounded "+" to "{1,1024}". This caps things at 1024 repeats (3072
> characters). If there are more then 1024 encoded characters in a row,
> gsub will process them as a subsequent match.
>
> -Nick

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2010-12-06  3:36 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-12-03  6:34 Patch: better handling of large bodies nick
2010-12-03  8:50 ` Graham
2010-12-03 15:48   ` nick
2010-12-06  3:36     ` Graham

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).