user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
* mail header indexing additions
@ 2020-04-20  1:53 Eric Wong
  2020-04-22 22:17 ` Konstantin Ryabitsev
  0 siblings, 1 reply; 3+ messages in thread
From: Eric Wong @ 2020-04-20  1:53 UTC (permalink / raw)
  To: meta

I'm probably going to start indexing List-Id: headers by
default, and have `lid:' be the search prefix for inboxes
which combine multiple lists and may have unstable email
addresses.

Anything else that should be indexed by default?

There'll also be an option to define indexing for other headers,
such as bug-tracker-specific IDs and such.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: mail header indexing additions
  2020-04-20  1:53 mail header indexing additions Eric Wong
@ 2020-04-22 22:17 ` Konstantin Ryabitsev
  2020-05-07  3:00   ` [PATCH] search: support searching on List-Id Eric Wong
  0 siblings, 1 reply; 3+ messages in thread
From: Konstantin Ryabitsev @ 2020-04-22 22:17 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Mon, Apr 20, 2020 at 01:53:17AM +0000, Eric Wong wrote:
> I'm probably going to start indexing List-Id: headers by
> default, and have `lid:' be the search prefix for inboxes
> which combine multiple lists and may have unstable email
> addresses.

This would be handy indeed!

> Anything else that should be indexed by default?
> 
> There'll also be an option to define indexing for other headers,
> such as bug-tracker-specific IDs and such.

I think if this is configurable, then it's really the only thing that's 
needed. Everyone's needs are going to be different, so indexing headers 
that aren't interesting to many people is just going to lead to storage 
bloat.

-K

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH] search: support searching on List-Id
  2020-04-22 22:17 ` Konstantin Ryabitsev
@ 2020-05-07  3:00   ` Eric Wong
  0 siblings, 0 replies; 3+ messages in thread
From: Eric Wong @ 2020-05-07  3:00 UTC (permalink / raw)
  To: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Mon, Apr 20, 2020 at 01:53:17AM +0000, Eric Wong wrote:
> > I'm probably going to start indexing List-Id: headers by
> > default, and have `lid:' be the search prefix for inboxes
> > which combine multiple lists and may have unstable email
> > addresses.
> 
> This would be handy indeed!

Not sure if both `lid:' and `l:' are necessary, but it's
consistent with `mid:' and `m:' as far as exact (boolean) vs.
probabilistic search goes.

I figure `l:' is probably useful for lists projects which change
domains/hosts.  Patch below

> > Anything else that should be indexed by default?
> > 
> > There'll also be an option to define indexing for other headers,
> > such as bug-tracker-specific IDs and such.
> 
> I think if this is configurable, then it's really the only thing that's 
> needed. Everyone's needs are going to be different, so indexing headers 
> that aren't interesting to many people is just going to lead to storage 
> bloat.

The other thing is whether or not decoding RFC 2047 is necessary
or even correct for a particular header.

Email::MIME->{header_str,header} blindly decodes some things
which probably shouldn't be... I'm probably splitting hairs,
here, though.

-------8<------
Subject: [PATCH] search: support searching on List-Id

We'll support both probabilistic matches via `l:' and boolean
matches via `lid:' for exact matches, similar to how both `m:'
and `mid:' are supported.  Only text inside angle braces (`<'
and `>') are supported, since I'm not sure if there's value in
searching on the optional phrases (which would require decoding
with ->header_str instead of ->header_raw).
---
 lib/PublicInbox/Search.pm    |  9 +++++++++
 lib/PublicInbox/SearchIdx.pm |  6 ++++++
 t/search.t                   | 31 +++++++++++++++++++++++++++++++
 3 files changed, 46 insertions(+)

diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 86a6ad674b3..b7db2b9f7fc 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -77,11 +77,17 @@ use constant {
 	# 15 - see public-inbox-v2-format(5)
 	#      further bumps likely unnecessary, we'll suggest in-place
 	#      "--reindex" use for further fixes and tweaks
+	#
+	#      public-inbox v1.5.0 adds (still SCHEMA_VERSION=15):
+	#      * "lid:" and "l:" for List-Id searches
 	SCHEMA_VERSION => 15,
 };
 
+# note: the non-X term prefix allocations are shared with
+# Xapian omega, see xapian-applications/omega/docs/termprefixes.rst
 my %bool_pfx_external = (
 	mid => 'Q', # Message-ID (full/exact), this is mostly uniQue
+	lid => 'G', # newsGroup (or similar entity), just inside <>
 	dfpre => 'XDFPRE',
 	dfpost => 'XDFPOST',
 	dfblob => 'XDFPRE XDFPOST',
@@ -92,6 +98,7 @@ my %prob_prefix = (
 	# for mairix compatibility
 	s => 'S',
 	m => 'XM', # 'mid:' (bool) is exact, 'm:' (prob) can do partial
+	l => 'XL', # 'lid:' (bool) is exact, 'l:' (prob) can do partial
 	f => 'A',
 	t => 'XTO',
 	tc => 'XTO XCC',
@@ -134,6 +141,8 @@ EOF
 	'f:' => 'match within the From header',
 	'a:' => 'match within the To, Cc, and From headers',
 	'tc:' => 'match within the To and Cc headers',
+	'lid:' => 'exact contents of the List-Id',
+	'l:' => 'partial match contents of the List-Id header',
 	'bs:' => 'match within the Subject and body',
 	'dfn:' => 'match filename from diff',
 	'dfa:' => 'match diff removed (-) lines',
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 25118f43613..998341a7d4d 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -352,6 +352,12 @@ sub add_xapian ($$$$) {
 		}
 	}
 	$doc->add_boolean_term('Q' . $_) foreach @$mids;
+	for my $l ($hdr->header_raw('List-Id')) {
+		$l =~ /<([^>]+)>/ or next;
+		my $lid = $1;
+		$doc->add_boolean_term('G' . $lid);
+		index_text($self, $lid, 1, 'XL'); # probabilistic
+	}
 	$self->{xdb}->replace_document($smsg->{num}, $doc);
 }
 
diff --git a/t/search.t b/t/search.t
index 83986837eaf..92f3305d556 100644
--- a/t/search.t
+++ b/t/search.t
@@ -66,6 +66,7 @@ Subject: Hello world
 Message-ID: <root@s>
 From: John Smith <js@example.com>
 To: list@example.com
+List-Id: I'm not mad <i.m.just.bored>
 
 \m/
 EOF
@@ -77,6 +78,7 @@ Message-ID: <last@s>
 From: John Smith <js@example.com>
 To: list@example.com
 Cc: foo@example.com
+List-Id: there's nothing <left.for.me.to.do>
 
 goodbye forever :<
 EOF
@@ -448,6 +450,35 @@ EOF
 	is($ro->query("m:Pine m:LNX m:10010260936330", {mset=>1})->size, 1);
 });
 
+{ # List-Id searching
+	my $found = $ro->query('lid:i.m.just.bored');
+	is_deeply([ filter_mids($found) ], [ 'root@s' ],
+		'got expected mid on exact lid: search');
+
+	$found = $ro->query('lid:just.bored');
+	is_deeply($found, [], 'got nothing on lid: search');
+
+	$found = $ro->query('lid:*.just.bored');
+	is_deeply($found, [], 'got nothing on lid: search');
+
+	$found = $ro->query('l:i.m.just.bored');
+	is_deeply([ filter_mids($found) ], [ 'root@s' ],
+		'probabilistic search works on full List-Id contents');
+
+	$found = $ro->query('l:just.bored');
+	is_deeply([ filter_mids($found) ], [ 'root@s' ],
+		'probabilistic search works on partial List-Id contents');
+
+	$found = $ro->query('lid:mad');
+	is_deeply($found, [], 'no match on phrase with lid:');
+
+	$found = $ro->query('lid:bored');
+	is_deeply($found, [], 'no match on partial List-Id with lid:');
+
+	$found = $ro->query('l:nothing');
+	is_deeply($found, [], 'matched on phrase with l:');
+}
+
 done_testing();
 
 1;

^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-05-07  3:00 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-20  1:53 mail header indexing additions Eric Wong
2020-04-22 22:17 ` Konstantin Ryabitsev
2020-05-07  3:00   ` [PATCH] search: support searching on List-Id Eric Wong

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).