From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 1FC051F4B4; Fri, 9 Apr 2021 23:37:00 +0000 (UTC) Date: Fri, 9 Apr 2021 23:37:00 +0000 From: Eric Wong To: Kyle Meyer Cc: meta@public-inbox.org Subject: Re: archive links broken with obfuscate=true Message-ID: <20210409233700.GA11190@dcvr> References: <87a6q8p5qa.fsf@kyleam.com> <20210409102129.GA16787@dcvr> <87zgy7rs9q.fsf@kyleam.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <87zgy7rs9q.fsf@kyleam.com> List-Id: Kyle Meyer wrote: > Eric Wong writes: > > > Oops, I think the following fixes it, but not sure if there's a > > better way to accomplish the same thing.... > > Thanks. Jumping around a bit with that installed, I haven't spotted any > remaining issues. Thanks for the report. Have you run any performance tests? > > I worry the regexp change is susceptible to performance problems > > from malicious inputs. I can't remember if something like this > > triggers a pathological case or not, or if I'm confusing this > > with another quirk that does (or quirks of another RE engine) > > Hmm... > > > diff --git a/lib/PublicInbox/Hval.pm b/lib/PublicInbox/Hval.pm > > index d20f70ae..6f1a046c 100644 > > --- a/lib/PublicInbox/Hval.pm > > +++ b/lib/PublicInbox/Hval.pm > > @@ -82,15 +82,17 @@ sub obfuscate_addrs ($$;$) { > > my $repl = $_[2] // '•'; > > my $re = $ibx->{-no_obfuscate_re}; # regex of domains > > my $addrs = $ibx->{-no_obfuscate}; # { $address => 1 } > > - $_[1] =~ s/(([\w\.\+=\-]+)\@([\w\-]+\.[\w\.\-]+))/ > > - my ($addr, $user, $domain) = ($1, $2, $3); > > - if ($addrs->{$addr} || ((defined $re && $domain =~ $re))) { > > + $_[1] =~ s#(\S*?)(([\w\.\+=\-]+)\@([\w\-]+\.[\w\.\-]+))# > > + my ($beg, $addr, $user, $domain) = ($1, $2, $3, $4); > > ... what about allowing the first match to be {0,N}, where N is some not > so huge value? It'd risk incorrectly obfuscating some really long > links, but given that it's just the HTML presentation, that seems > acceptable. I'm actually more worried about the '0' (of '{0,}') or '*' being combined with '?'. I can't remember if there's a pathological case in that... The upper bound of N is a smaller concern, especially for non-spam messages which only have non-space tokens of reasonable length. Maybe changing the three existing '+' to {1,M} would be a way to ameliorate the problem (though I'm not sure what a good value of M would be, 255?). OTOH, it may not be an actual problem at all and I'm just confusing this with something else.