From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 7E53C1F4B4; Fri, 9 Apr 2021 10:21:29 +0000 (UTC) Date: Fri, 9 Apr 2021 10:21:29 +0000 From: Eric Wong To: Kyle Meyer Cc: meta@public-inbox.org Subject: Re: archive links broken with obfuscate=true Message-ID: <20210409102129.GA16787@dcvr> References: <87a6q8p5qa.fsf@kyleam.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <87a6q8p5qa.fsf@kyleam.com> List-Id: Kyle Meyer wrote: > I've been testing out obfuscate=true a bit (which won't be a surprise to > Eric, given a private email that was sent to both of us). One issue I > noticed is that it breaks archive links. I've posted an example at > : > > Reported-by: Kyle Meyer > Link: https://public-inbox.org/meta/87360nlc44.fsf@kyleam•com/ Oops, I think the following fixes it, but not sure if there's a better way to accomplish the same thing.... I worry the regexp change is susceptible to performance problems from malicious inputs. I can't remember if something like this triggers a pathological case or not, or if I'm confusing this with another quirk that does (or quirks of another RE engine) ------------8<-------- Subject: [WIP] www: do not perform address obfuscation on URLs --- lib/PublicInbox/Hval.pm | 10 ++++++---- t/hval.t | 4 ++++ 2 files changed, 10 insertions(+), 4 deletions(-) diff --git a/lib/PublicInbox/Hval.pm b/lib/PublicInbox/Hval.pm index d20f70ae..6f1a046c 100644 --- a/lib/PublicInbox/Hval.pm +++ b/lib/PublicInbox/Hval.pm @@ -82,15 +82,17 @@ sub obfuscate_addrs ($$;$) { my $repl = $_[2] // '•'; my $re = $ibx->{-no_obfuscate_re}; # regex of domains my $addrs = $ibx->{-no_obfuscate}; # { $address => 1 } - $_[1] =~ s/(([\w\.\+=\-]+)\@([\w\-]+\.[\w\.\-]+))/ - my ($addr, $user, $domain) = ($1, $2, $3); - if ($addrs->{$addr} || ((defined $re && $domain =~ $re))) { + $_[1] =~ s#(\S*?)(([\w\.\+=\-]+)\@([\w\-]+\.[\w\.\-]+))# + my ($beg, $addr, $user, $domain) = ($1, $2, $3, $4); + if (index($beg, '://') > 0) { + $beg.$addr; + } elsif ($addrs->{$addr} || ((defined $re && $domain =~ $re))) { $addr; } else { $domain =~ s!([^\.]+)\.!$1$repl!; $user . '@' . $domain } - /sge; + #sge; } # like format_sanitized_subject in git.git pretty.c with '%f' format string diff --git a/t/hval.t b/t/hval.t index 9d0dab7a..5afc2052 100644 --- a/t/hval.t +++ b/t/hval.t @@ -47,6 +47,10 @@ EOF is($html, $exp, 'only obfuscated relevant addresses'); +$exp = 'https://example.net/foo@example.net'; +PublicInbox::Hval::obfuscate_addrs($ibx, my $res = $exp); +is($res, $exp, 'does not obfuscate URL with Message-ID'); + is(PublicInbox::Hval::to_filename('foo bar '), 'foo-bar', 'to_filename has no trailing -');