user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
From: Eric Wong <e@yhbt.net>
To: meta@public-inbox.org
Subject: [PATCH] search: support searching on List-Id
Date: Thu, 7 May 2020 03:00:09 +0000	[thread overview]
Message-ID: <20200507030009.GA21973@dcvr> (raw)
In-Reply-To: <20200422221724.hwx5pde2ege4fekz@chatter.i7.local>

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Mon, Apr 20, 2020 at 01:53:17AM +0000, Eric Wong wrote:
> > I'm probably going to start indexing List-Id: headers by
> > default, and have `lid:' be the search prefix for inboxes
> > which combine multiple lists and may have unstable email
> > addresses.
> 
> This would be handy indeed!

Not sure if both `lid:' and `l:' are necessary, but it's
consistent with `mid:' and `m:' as far as exact (boolean) vs.
probabilistic search goes.

I figure `l:' is probably useful for lists projects which change
domains/hosts.  Patch below

> > Anything else that should be indexed by default?
> > 
> > There'll also be an option to define indexing for other headers,
> > such as bug-tracker-specific IDs and such.
> 
> I think if this is configurable, then it's really the only thing that's 
> needed. Everyone's needs are going to be different, so indexing headers 
> that aren't interesting to many people is just going to lead to storage 
> bloat.

The other thing is whether or not decoding RFC 2047 is necessary
or even correct for a particular header.

Email::MIME->{header_str,header} blindly decodes some things
which probably shouldn't be... I'm probably splitting hairs,
here, though.

-------8<------
Subject: [PATCH] search: support searching on List-Id

We'll support both probabilistic matches via `l:' and boolean
matches via `lid:' for exact matches, similar to how both `m:'
and `mid:' are supported.  Only text inside angle braces (`<'
and `>') are supported, since I'm not sure if there's value in
searching on the optional phrases (which would require decoding
with ->header_str instead of ->header_raw).
---
 lib/PublicInbox/Search.pm    |  9 +++++++++
 lib/PublicInbox/SearchIdx.pm |  6 ++++++
 t/search.t                   | 31 +++++++++++++++++++++++++++++++
 3 files changed, 46 insertions(+)

diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 86a6ad674b3..b7db2b9f7fc 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -77,11 +77,17 @@ use constant {
 	# 15 - see public-inbox-v2-format(5)
 	#      further bumps likely unnecessary, we'll suggest in-place
 	#      "--reindex" use for further fixes and tweaks
+	#
+	#      public-inbox v1.5.0 adds (still SCHEMA_VERSION=15):
+	#      * "lid:" and "l:" for List-Id searches
 	SCHEMA_VERSION => 15,
 };
 
+# note: the non-X term prefix allocations are shared with
+# Xapian omega, see xapian-applications/omega/docs/termprefixes.rst
 my %bool_pfx_external = (
 	mid => 'Q', # Message-ID (full/exact), this is mostly uniQue
+	lid => 'G', # newsGroup (or similar entity), just inside <>
 	dfpre => 'XDFPRE',
 	dfpost => 'XDFPOST',
 	dfblob => 'XDFPRE XDFPOST',
@@ -92,6 +98,7 @@ my %prob_prefix = (
 	# for mairix compatibility
 	s => 'S',
 	m => 'XM', # 'mid:' (bool) is exact, 'm:' (prob) can do partial
+	l => 'XL', # 'lid:' (bool) is exact, 'l:' (prob) can do partial
 	f => 'A',
 	t => 'XTO',
 	tc => 'XTO XCC',
@@ -134,6 +141,8 @@ EOF
 	'f:' => 'match within the From header',
 	'a:' => 'match within the To, Cc, and From headers',
 	'tc:' => 'match within the To and Cc headers',
+	'lid:' => 'exact contents of the List-Id',
+	'l:' => 'partial match contents of the List-Id header',
 	'bs:' => 'match within the Subject and body',
 	'dfn:' => 'match filename from diff',
 	'dfa:' => 'match diff removed (-) lines',
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 25118f43613..998341a7d4d 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -352,6 +352,12 @@ sub add_xapian ($$$$) {
 		}
 	}
 	$doc->add_boolean_term('Q' . $_) foreach @$mids;
+	for my $l ($hdr->header_raw('List-Id')) {
+		$l =~ /<([^>]+)>/ or next;
+		my $lid = $1;
+		$doc->add_boolean_term('G' . $lid);
+		index_text($self, $lid, 1, 'XL'); # probabilistic
+	}
 	$self->{xdb}->replace_document($smsg->{num}, $doc);
 }
 
diff --git a/t/search.t b/t/search.t
index 83986837eaf..92f3305d556 100644
--- a/t/search.t
+++ b/t/search.t
@@ -66,6 +66,7 @@ Subject: Hello world
 Message-ID: <root@s>
 From: John Smith <js@example.com>
 To: list@example.com
+List-Id: I'm not mad <i.m.just.bored>
 
 \m/
 EOF
@@ -77,6 +78,7 @@ Message-ID: <last@s>
 From: John Smith <js@example.com>
 To: list@example.com
 Cc: foo@example.com
+List-Id: there's nothing <left.for.me.to.do>
 
 goodbye forever :<
 EOF
@@ -448,6 +450,35 @@ EOF
 	is($ro->query("m:Pine m:LNX m:10010260936330", {mset=>1})->size, 1);
 });
 
+{ # List-Id searching
+	my $found = $ro->query('lid:i.m.just.bored');
+	is_deeply([ filter_mids($found) ], [ 'root@s' ],
+		'got expected mid on exact lid: search');
+
+	$found = $ro->query('lid:just.bored');
+	is_deeply($found, [], 'got nothing on lid: search');
+
+	$found = $ro->query('lid:*.just.bored');
+	is_deeply($found, [], 'got nothing on lid: search');
+
+	$found = $ro->query('l:i.m.just.bored');
+	is_deeply([ filter_mids($found) ], [ 'root@s' ],
+		'probabilistic search works on full List-Id contents');
+
+	$found = $ro->query('l:just.bored');
+	is_deeply([ filter_mids($found) ], [ 'root@s' ],
+		'probabilistic search works on partial List-Id contents');
+
+	$found = $ro->query('lid:mad');
+	is_deeply($found, [], 'no match on phrase with lid:');
+
+	$found = $ro->query('lid:bored');
+	is_deeply($found, [], 'no match on partial List-Id with lid:');
+
+	$found = $ro->query('l:nothing');
+	is_deeply($found, [], 'matched on phrase with l:');
+}
+
 done_testing();
 
 1;

      reply	other threads:[~2020-05-07  3:00 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-20  1:53 mail header indexing additions Eric Wong
2020-04-22 22:17 ` Konstantin Ryabitsev
2020-05-07  3:00   ` Eric Wong [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200507030009.GA21973@dcvr \
    --to=e@yhbt.net \
    --cc=meta@public-inbox.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).