From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.2 required=3.0 tests=ALL_TRUSTED,AWL,BAYES_00, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, T_SCC_BODY_TEXT_LINE shortcircuit=no autolearn=ham autolearn_force=no version=3.4.6 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 237DF1F406; Mon, 27 Nov 2023 10:23:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=80x24.org; s=selector1; t=1701080629; bh=SWVdYS0PuKQBE2BSv0GsaIrTOdAsw/Mr6ZrDbgD4/B8=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=UMTaRIuBTRZNM7qniMr/4emZlMCbvkNIJJEz6faL9ZsShRByfzQ4iEAa+fmK+OsqH w3c1/cX+o8hodK36KN08jw4gU8MPBKQD06qpm0x+0hpgO+c3IMcDeMIQZDiOnaNEPW kAQMDhvdfDDt8cNWr8gCMVcDcD7gdGKqjezta5O8= Date: Mon, 27 Nov 2023 10:23:48 +0000 From: Eric Wong To: Ricardo =?utf-8?Q?Ca=C3=B1uelo?= Cc: meta@public-inbox.org Subject: Re: [BUG] Unescaped '&' ampersands in atom header links Message-ID: <20231127102349.M32456@dcvr> References: <87o7ff4nlk.fsf@collabora.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <87o7ff4nlk.fsf@collabora.com> List-Id: Ricardo Cañuelo wrote: > where the '&' character is escaped in the text of the tag but > not in the href attributes. Shouldn't these be escaped as well? If so, > the fix should be most likely located in WwwAtomStream.pm:atom_header(). Thanks for the bug report. Yes, '&' should be escaped, that's the job of the ->qs_html call from atom_header(). The below fixes it and also brings a minor improvement in allowing '/' to be unescaped for dfn: queries. More could be allowed, actually, but '/' is actually common. But I'm running on fumes right now, so an extra set of eyes would be helpful. deployed on https://yhbt.net/lore/ -----8<----- Subject: [PATCH] www: qs_html: fix escaping of `q' param Our use of MID_ESC characters was only intended for the pathname component of URIs and not appropriate for the query string component. So use a different $unsafe parameter list for uri_escape to make the result appropriate for query strings by disallowing [\&\'\+=] characters. Most notably, this change also allows us to accept `/' (slash) unescaped to make dfn: queries nicer to look at. Finally, we'll also add a ascii_html call on the URI-escaped result as an extra safety measure even though it's not really needed. As far as I can tell, the code without this fix didn't result in in an HTML injection since all our uses of uri_escape did escape angle brackets. Reported-by: Ricardo Cañuelo <ricardo.canuelo@collabora.com> Link: https://public-inbox.org/meta/87o7ff4nlk.fsf@collabora.com/ --- lib/PublicInbox/MID.pm | 2 +- lib/PublicInbox/SearchQuery.pm | 10 +++++++--- t/psgi_search.t | 27 ++++++++++++++++++++++++++- 3 files changed, 34 insertions(+), 5 deletions(-) diff --git a/lib/PublicInbox/MID.pm b/lib/PublicInbox/MID.pm index b1ae9939..97cf3a54 100644 --- a/lib/PublicInbox/MID.pm +++ b/lib/PublicInbox/MID.pm @@ -125,7 +125,7 @@ sub uniq_mids ($;$) { \@ret; } -# RFC3986, section 3.3: +# RFC3986, section 3.3 (pathnames only): sub MID_ESC () { '^A-Za-z0-9\-\._~!\$\&\'\(\)\*\+,;=:@' } sub mid_escape ($) { uri_escape_utf8($_[0], MID_ESC) } diff --git a/lib/PublicInbox/SearchQuery.pm b/lib/PublicInbox/SearchQuery.pm index 96246c53..747e3249 100644 --- a/lib/PublicInbox/SearchQuery.pm +++ b/lib/PublicInbox/SearchQuery.pm @@ -6,7 +6,7 @@ package PublicInbox::SearchQuery; use strict; use v5.10.1; use URI::Escape qw(uri_escape); -use PublicInbox::MID qw(MID_ESC); +use PublicInbox::Hval qw(ascii_html); our $LIM = 200; sub new { @@ -35,9 +35,13 @@ sub qs_html { } my $qs = ''; if (defined(my $q = $self->{'q'})) { - $q = uri_escape($q, MID_ESC); + # not using MID_ESC since that's for the path component and + # this is for the query component. Unlike MID_ESC, + # this disallows [\&\'\+=] and allows slash [/] for + # nicer looking dfn: queries + $q = uri_escape($q, '^A-Za-z0-9\-\._~!\$\(\)\*,;:@/'); $q =~ s/%20/+/g; # improve URL readability - $qs .= "q=$q"; + $qs .= 'q='.ascii_html($q); } if (my $o = $self->{o}) { # ignore o == 0 $qs .= "&o=$o"; diff --git a/t/psgi_search.t b/t/psgi_search.t index 289c34e7..8c981c6c 100644 --- a/t/psgi_search.t +++ b/t/psgi_search.t @@ -18,7 +18,8 @@ local $ENV{TZ} = 'UTC'; my $digits = '10010260936330'; my $ua = 'Pine.LNX.4.10'; my $mid = "$ua.$digits.2460-100000\@penguin.transmeta.com"; -my $ibx = create_inbox 'git', indexlevel => 'full', tmpdir => "$tmpdir/1", sub { +my $ibx = create_inbox '26-git', indexlevel => 'full', tmpdir => "$tmpdir/1", +sub { my ($im) = @_; # n.b. these headers are not properly RFC2047-encoded $im->add(PublicInbox::Eml->new(<<EOF)) or BAIL_OUT; @@ -48,6 +49,17 @@ Message-ID: <no-subject-at-all@example.com> From: no subject at all <no-subject-at-all@example.com> To: git@vger.kernel.org +EOF + $im->add(PublicInbox::Eml->new(<<'EOF')) or BAIL_OUT; +Message-ID: <ampersand@example.com> +From: <e@example.com> +To: git@vger.kernel.org +Subject: git & ampersand + +hi +++ b/foo +x=y +s'more + EOF }; @@ -155,6 +167,19 @@ test_psgi(sub { $www->call(@_) }, sub { is($res->code, 200, 'successful mbox download w/ threads'); gunzip(\($res->content) => \(my $after)); isnt($before, $after); + + $res = $cb->(GET('/test/?q=git+%26+ampersand&x=A')); + is $res->code, 200, 'Atom hit with ampersand'; + unlike $res->content, qr/git\+&\+ampersand/, '& is HTML-escaped'; + + $res = $cb->(GET('/test/?q=%22hi+%2b%2b%2b+b/foo%22&x=A')); + is $res->code, 200, 'slashes and plusses search hit'; + like $res->content, qr!q=%22hi\+(?:%2[bB]){3}\+b/foo%22!, + '+ and " escaped, but slash not escaped in query'; + + $res = $cb->(GET(q{/test/?q=%22s'more%22&x=A})); + is $res->code, 200, 'single quote inside phrase'; + # TODO: more tests and odd cases }); done_testing();