From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 1E4991F8C1; Thu, 7 May 2020 03:00:10 +0000 (UTC) Date: Thu, 7 May 2020 03:00:09 +0000 From: Eric Wong To: meta@public-inbox.org Subject: [PATCH] search: support searching on List-Id Message-ID: <20200507030009.GA21973@dcvr> References: <20200420015317.GA9660@dcvr> <20200422221724.hwx5pde2ege4fekz@chatter.i7.local> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20200422221724.hwx5pde2ege4fekz@chatter.i7.local> List-Id: Konstantin Ryabitsev wrote: > On Mon, Apr 20, 2020 at 01:53:17AM +0000, Eric Wong wrote: > > I'm probably going to start indexing List-Id: headers by > > default, and have `lid:' be the search prefix for inboxes > > which combine multiple lists and may have unstable email > > addresses. > > This would be handy indeed! Not sure if both `lid:' and `l:' are necessary, but it's consistent with `mid:' and `m:' as far as exact (boolean) vs. probabilistic search goes. I figure `l:' is probably useful for lists projects which change domains/hosts. Patch below > > Anything else that should be indexed by default? > > > > There'll also be an option to define indexing for other headers, > > such as bug-tracker-specific IDs and such. > > I think if this is configurable, then it's really the only thing that's > needed. Everyone's needs are going to be different, so indexing headers > that aren't interesting to many people is just going to lead to storage > bloat. The other thing is whether or not decoding RFC 2047 is necessary or even correct for a particular header. Email::MIME->{header_str,header} blindly decodes some things which probably shouldn't be... I'm probably splitting hairs, here, though. -------8<------ Subject: [PATCH] search: support searching on List-Id We'll support both probabilistic matches via `l:' and boolean matches via `lid:' for exact matches, similar to how both `m:' and `mid:' are supported. Only text inside angle braces (`<' and `>') are supported, since I'm not sure if there's value in searching on the optional phrases (which would require decoding with ->header_str instead of ->header_raw). --- lib/PublicInbox/Search.pm | 9 +++++++++ lib/PublicInbox/SearchIdx.pm | 6 ++++++ t/search.t | 31 +++++++++++++++++++++++++++++++ 3 files changed, 46 insertions(+) diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm index 86a6ad674b3..b7db2b9f7fc 100644 --- a/lib/PublicInbox/Search.pm +++ b/lib/PublicInbox/Search.pm @@ -77,11 +77,17 @@ use constant { # 15 - see public-inbox-v2-format(5) # further bumps likely unnecessary, we'll suggest in-place # "--reindex" use for further fixes and tweaks + # + # public-inbox v1.5.0 adds (still SCHEMA_VERSION=15): + # * "lid:" and "l:" for List-Id searches SCHEMA_VERSION => 15, }; +# note: the non-X term prefix allocations are shared with +# Xapian omega, see xapian-applications/omega/docs/termprefixes.rst my %bool_pfx_external = ( mid => 'Q', # Message-ID (full/exact), this is mostly uniQue + lid => 'G', # newsGroup (or similar entity), just inside <> dfpre => 'XDFPRE', dfpost => 'XDFPOST', dfblob => 'XDFPRE XDFPOST', @@ -92,6 +98,7 @@ my %prob_prefix = ( # for mairix compatibility s => 'S', m => 'XM', # 'mid:' (bool) is exact, 'm:' (prob) can do partial + l => 'XL', # 'lid:' (bool) is exact, 'l:' (prob) can do partial f => 'A', t => 'XTO', tc => 'XTO XCC', @@ -134,6 +141,8 @@ EOF 'f:' => 'match within the From header', 'a:' => 'match within the To, Cc, and From headers', 'tc:' => 'match within the To and Cc headers', + 'lid:' => 'exact contents of the List-Id', + 'l:' => 'partial match contents of the List-Id header', 'bs:' => 'match within the Subject and body', 'dfn:' => 'match filename from diff', 'dfa:' => 'match diff removed (-) lines', diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm index 25118f43613..998341a7d4d 100644 --- a/lib/PublicInbox/SearchIdx.pm +++ b/lib/PublicInbox/SearchIdx.pm @@ -352,6 +352,12 @@ sub add_xapian ($$$$) { } } $doc->add_boolean_term('Q' . $_) foreach @$mids; + for my $l ($hdr->header_raw('List-Id')) { + $l =~ /<([^>]+)>/ or next; + my $lid = $1; + $doc->add_boolean_term('G' . $lid); + index_text($self, $lid, 1, 'XL'); # probabilistic + } $self->{xdb}->replace_document($smsg->{num}, $doc); } diff --git a/t/search.t b/t/search.t index 83986837eaf..92f3305d556 100644 --- a/t/search.t +++ b/t/search.t @@ -66,6 +66,7 @@ Subject: Hello world Message-ID: From: John Smith To: list@example.com +List-Id: I'm not mad \m/ EOF @@ -77,6 +78,7 @@ Message-ID: From: John Smith To: list@example.com Cc: foo@example.com +List-Id: there's nothing goodbye forever :< EOF @@ -448,6 +450,35 @@ EOF is($ro->query("m:Pine m:LNX m:10010260936330", {mset=>1})->size, 1); }); +{ # List-Id searching + my $found = $ro->query('lid:i.m.just.bored'); + is_deeply([ filter_mids($found) ], [ 'root@s' ], + 'got expected mid on exact lid: search'); + + $found = $ro->query('lid:just.bored'); + is_deeply($found, [], 'got nothing on lid: search'); + + $found = $ro->query('lid:*.just.bored'); + is_deeply($found, [], 'got nothing on lid: search'); + + $found = $ro->query('l:i.m.just.bored'); + is_deeply([ filter_mids($found) ], [ 'root@s' ], + 'probabilistic search works on full List-Id contents'); + + $found = $ro->query('l:just.bored'); + is_deeply([ filter_mids($found) ], [ 'root@s' ], + 'probabilistic search works on partial List-Id contents'); + + $found = $ro->query('lid:mad'); + is_deeply($found, [], 'no match on phrase with lid:'); + + $found = $ro->query('lid:bored'); + is_deeply($found, [], 'no match on partial List-Id with lid:'); + + $found = $ro->query('l:nothing'); + is_deeply($found, [], 'matched on phrase with l:'); +} + done_testing(); 1;