user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
Search results ordered by [date|relevance]  view[summary|nested|Atom feed]
thread overview below | download mbox.gz: |
* [PATCH v2 3/3] SearchIdx: Allow the amount of indexing be configured
  2018-07-18 16:52  4%         ` [PATCH v2 1/3] Making the search indexes optional Eric W. Biederman
@ 2018-07-18 16:53  4%           ` Eric W. Biederman
  0 siblings, 0 replies; 7+ results
From: Eric W. Biederman @ 2018-07-18 16:53 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta, Eric W. Biederman

This adds a new inbox configuration option 'indexlevel' that can take
the values 'full', 'medium', and 'basic'.

When set to 'full' everything is indexed including the positions
of all terms.

When set to 'medium' everything except the positions of terms is
indexed.

When set to 'basic' terms and positions are not indexed.  Just the
Overview database for NNTP is created.  Which is still quite good and
allows searching for messages by Message-ID.  But there are no indexes to support
searching inside the email messages themselves.

Update the reindex tests to exercise the full medium and basic code paths

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 lib/PublicInbox/Config.pm    |  2 +-
 lib/PublicInbox/SearchIdx.pm |  8 +++++++
 t/v1reindex.t                | 43 +++++++++++++++++++++++++++++++++++-
 t/v2reindex.t                | 40 +++++++++++++++++++++++++++++++++
 4 files changed, 91 insertions(+), 2 deletions(-)

diff --git a/lib/PublicInbox/Config.pm b/lib/PublicInbox/Config.pm
index 289c36a6ccf1..78586560cf0f 100644
--- a/lib/PublicInbox/Config.pm
+++ b/lib/PublicInbox/Config.pm
@@ -158,7 +158,7 @@ sub _fill {
 
 	foreach my $k (qw(mainrepo filter url newsgroup
 			infourl watch watchheader httpbackendmax
-			replyto feedmax nntpserver)) {
+			replyto feedmax nntpserver indexlevel)) {
 		my $v = $self->{"$pfx.$k"};
 		$rv->{$k} = $v if defined $v;
 	}
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 8978914ab087..04e853069f06 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -47,6 +47,7 @@ sub git_unquote ($) {
 
 sub new {
 	my ($class, $ibx, $creat, $part) = @_;
+	my $levels = qr/\A(?:full|medium|basic)\z/;
 	my $mainrepo = $ibx; # for "public-inbox-index" w/o entry in config
 	my $git_dir = $mainrepo;
 	my ($altid, $git);
@@ -62,6 +63,13 @@ sub new {
 				PublicInbox::AltId->new($ibx, $_);
 			} @$altid ];
 		}
+		if ($ibx->{indexlevel}) {
+			if ($ibx->{indexlevel} =~ $levels) {
+				$indexlevel = $ibx->{indexlevel};
+			} else {
+				die("Invalid indexlevel $ibx->{indexlevel}\n");
+			}
+		}
 	} else { # v1
 		$ibx = { mainrepo => $git_dir, version => 1 };
 	}
diff --git a/t/v1reindex.t b/t/v1reindex.t
index 0df36d3fee89..ff32750fc3c3 100644
--- a/t/v1reindex.t
+++ b/t/v1reindex.t
@@ -20,6 +20,7 @@ my $ibx_config = {
 	mainrepo => $mainrepo,
 	name => 'test-v1reindex',
 	-primary_address => 'test@example.com',
+	indexlevel => 'full',
 };
 my $ibx = PublicInbox::Inbox->new($ibx_config);
 my $mime = PublicInbox::MIME->create(
@@ -74,15 +75,32 @@ is_deeply([ $ibx->mm->minmax ], $minmax, 'minmax unchanged');
 
 ok(unlink "$mainrepo/public-inbox/msgmap.sqlite3", 'remove msgmap');
 remove_tree($xap);
+ok(!-d $xap, 'Xapian directories removed again');
+
 $rw = PublicInbox::SearchIdx->new($ibx, 1);
+{
+	my @warn;
+	local $SIG{__WARN__} = sub { push @warn, @_ };
+	eval { $rw->index_sync({reindex => 1}) };
+	is($@, '', 'no error from reindexing without msgmap');
+	is(scalar(@warn), 0, 'no warnings from reindexing');
+	$im->done;
+	ok(-d $xap, 'Xapian directories recreated');
+	delete $ibx->{mm};
+	is_deeply([ $ibx->mm->minmax ], $minmax, 'minmax unchanged');
+}
 
+ok(unlink "$mainrepo/public-inbox/msgmap.sqlite3", 'remove msgmap');
+remove_tree($xap);
 ok(!-d $xap, 'Xapian directories removed again');
+
+$rw = PublicInbox::SearchIdx->new($ibx, 1);
 {
 	my @warn;
 	local $SIG{__WARN__} = sub { push @warn, @_ };
 	eval { $rw->index_sync({reindex => 1}) };
 	is($@, '', 'no error from reindexing without msgmap');
-	is(scalar(@warn), 0, 'no warnings from reindexing');
+	is_deeply(\@warn, [], 'no warnings');
 	$im->done;
 	ok(-d $xap, 'Xapian directories recreated');
 	delete $ibx->{mm};
@@ -91,9 +109,31 @@ ok(!-d $xap, 'Xapian directories removed again');
 
 ok(unlink "$mainrepo/public-inbox/msgmap.sqlite3", 'remove msgmap');
 remove_tree($xap);
+ok(!-d $xap, 'Xapian directories removed again');
+
+$ibx_config->{index_level} = 'medium';
+$ibx = PublicInbox::Inbox->new($ibx_config);
 $rw = PublicInbox::SearchIdx->new($ibx, 1);
+{
+	my @warn;
+	local $SIG{__WARN__} = sub { push @warn, @_ };
+	eval { $rw->index_sync({reindex => 1}) };
+	is($@, '', 'no error from reindexing without msgmap');
+	is_deeply(\@warn, [], 'no warnings');
+	$im->done;
+	ok(-d $xap, 'Xapian directories recreated');
+	delete $ibx->{mm};
+	is_deeply([ $ibx->mm->minmax ], $minmax, 'minmax unchanged');
+}
 
+
+ok(unlink "$mainrepo/public-inbox/msgmap.sqlite3", 'remove msgmap');
+remove_tree($xap);
 ok(!-d $xap, 'Xapian directories removed again');
+
+$ibx_config->{index_level} = 'basic';
+$ibx = PublicInbox::Inbox->new($ibx_config);
+$rw = PublicInbox::SearchIdx->new($ibx, 1);
 {
 	my @warn;
 	local $SIG{__WARN__} = sub { push @warn, @_ };
@@ -106,4 +146,5 @@ ok(!-d $xap, 'Xapian directories removed again');
 	is_deeply([ $ibx->mm->minmax ], $minmax, 'minmax unchanged');
 }
 
+
 done_testing();
diff --git a/t/v2reindex.t b/t/v2reindex.t
index 8af30991f858..20903967bf25 100644
--- a/t/v2reindex.t
+++ b/t/v2reindex.t
@@ -19,6 +19,7 @@ my $ibx_config = {
 	name => 'test-v2writable',
 	version => 2,
 	-primary_address => 'test@example.com',
+	indexlevel => 'full',
 };
 my $ibx = PublicInbox::Inbox->new($ibx_config);
 my $mime = PublicInbox::MIME->create(
@@ -95,4 +96,43 @@ ok(!-d $xap, 'Xapian directories removed again');
 	is_deeply([ $ibx->mm->minmax ], $minmax, 'minmax unchanged');
 }
 
+ok(unlink "$mainrepo/msgmap.sqlite3", 'remove msgmap');
+remove_tree($xap);
+ok(!-d $xap, 'Xapian directories removed again');
+
+$ibx_config->{index_level} = 'medium';
+$ibx = PublicInbox::Inbox->new($ibx_config);
+$im = PublicInbox::V2Writable->new($ibx);
+{
+	my @warn;
+	local $SIG{__WARN__} = sub { push @warn, @_ };
+	eval { $im->index_sync({reindex => 1}) };
+	is($@, '', 'no error from reindexing without msgmap');
+	is_deeply(\@warn, [], 'no warnings');
+	$im->done;
+	ok(-d $xap, 'Xapian directories recreated');
+	delete $ibx->{mm};
+	is_deeply([ $ibx->mm->minmax ], $minmax, 'minmax unchanged');
+}
+
+
+ok(unlink "$mainrepo/msgmap.sqlite3", 'remove msgmap');
+remove_tree($xap);
+ok(!-d $xap, 'Xapian directories removed again');
+
+$ibx_config->{index_level} = 'basic';
+$ibx = PublicInbox::Inbox->new($ibx_config);
+$im = PublicInbox::V2Writable->new($ibx);
+{
+	my @warn;
+	local $SIG{__WARN__} = sub { push @warn, @_ };
+	eval { $im->index_sync({reindex => 1}) };
+	is($@, '', 'no error from reindexing without msgmap');
+	is_deeply(\@warn, [], 'no warnings');
+	$im->done;
+	ok(-d $xap, 'Xapian directories recreated');
+	delete $ibx->{mm};
+	is_deeply([ $ibx->mm->minmax ], $minmax, 'minmax unchanged');
+}
+
 done_testing();
-- 
2.17.1


^ permalink raw reply related	[relevance 4%]

* [PATCH v2 1/3] Making the search indexes optional
  2018-07-18 16:31  7%       ` Eric Wong
@ 2018-07-18 16:52  4%         ` Eric W. Biederman
  2018-07-18 16:53  4%           ` [PATCH v2 3/3] SearchIdx: Allow the amount of indexing be configured Eric W. Biederman
  0 siblings, 1 reply; 7+ results
From: Eric W. Biederman @ 2018-07-18 16:52 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta


This is my respin of these patches.  I have used the levels:
full, medium, basic.

I think basic conveys the message that it is ok to run with and you can
expect most things to work, better than minimal where it feels like
you don't know what will fail.

I have tweaked the reindex tests to run with all 3 different levels
so at least these code paths get exercised.

Eric W. Biederman (3):
      SearchIdx.pm: Make indexing search positions optional
      SearchIdx: Add the mechanism for making all Xapian indexing optional
      SearchIdx: Allow the amount of indexing be configured

 lib/PublicInbox/Config.pm    |   2 +-
 lib/PublicInbox/SearchIdx.pm | 256 +++++++++++++++++++++++--------------------
 t/v1reindex.t                |  43 +++++++-
 t/v2reindex.t                |  40 +++++++
 4 files changed, 220 insertions(+), 121 deletions(-)


^ permalink raw reply	[relevance 4%]

* Re: [PATCH 3/3] SearchIdx: Allow the amount of indexing be configured
  2018-07-18 16:00  7%     ` Eric W. Biederman
@ 2018-07-18 16:31  7%       ` Eric Wong
  2018-07-18 16:52  4%         ` [PATCH v2 1/3] Making the search indexes optional Eric W. Biederman
  0 siblings, 1 reply; 7+ results
From: Eric Wong @ 2018-07-18 16:31 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: meta

"Eric W. Biederman" <ebiederm@xmission.com> wrote:
> Eric Wong <e@80x24.org> writes:
> 
> > "Eric W. Biederman" <ebiederm@xmission.com> wrote:
> >> This adds a new inbox configuration option 'indexlevel' that can take
> >> the values 'positions', 'terms', and 'over'.
> >
> > The names of these user-facing configuration variables aren't
> > obviously "levels" at all; especially to people not familiar
> > with Xapian or public-inbox internals.
> >
> > As for "over", at least it should be spelled out "overview";
> > but really, I would much prefer something which wouldn't require
> > consulting the manual for explanations, such as:
> >
> > 	'full', 'medium', 'minimal'
> 
> Do you mind the config option name indexlevel?

'indexlevel' is fine.  I originally had something along the
lines of 'type' in mind (e.g. 'indextype'); but maybe 'level'
is more obvious and requires less documentation.

> I don't mind changing the names I just needed some name and
> those names were present.
> 
> > Where it's obvious which one sits relative to the others.
> >
> > That wouldn't tie our user-facing configuration to our internal
> > choices or terminology used by Xapian, either.  I'm pretty happy
> > with Xapian; but it may be worth exploring other search engines
> > at some point...
> >
> >> --- a/lib/PublicInbox/SearchIdx.pm
> >> +++ b/lib/PublicInbox/SearchIdx.pm
> >> @@ -47,6 +47,7 @@ sub git_unquote ($) {
> >>  
> >>  sub new {
> >>  	my ($class, $ibx, $creat, $part) = @_;
> >> +	my $levels = qr/(positions|terms|over)/;
> >
> > Please anchor matches so they match expected strings exactly.
> > It lets typos be caught and makes life easier for 3rd-party
> > tools and implementations if we're stricter in what we accept.
> > Captures aren't necessary, so '?:' can be used:
> >
> > 	qr/\A(?:full|medium|minimal)\z/
> >
> > Same comment applies to patch 2/3
> 
> Good point.  I wish I knew a way so I didn't have to repeat the test
> so often.  But getting the user space interface correct is the first
> step then we can optimize if need be.

As in repeating the string comparison?  Perhaps it could be
mapped to different subroutine calls:

sub do_index_text { ... }
sub do_index_text_without_positions { ... }
sub do_overview { ... }

my %INDEX_LEVEL = (
	full => *do_index_text,
	medium => *do_index_text_without_positions,
	minimal => *do_overview,
);

	$self->{index_cb} = $INDEX_LEVEL{$ibx->{indexlevel}};
	defined $self->{index_cb} or die "invalid indexlevel\n";

^ permalink raw reply	[relevance 7%]

* Re: [PATCH 3/3] SearchIdx: Allow the amount of indexing be configured
  2018-07-18 10:22  7%   ` Eric Wong
@ 2018-07-18 16:00  7%     ` Eric W. Biederman
  2018-07-18 16:31  7%       ` Eric Wong
  0 siblings, 1 reply; 7+ results
From: Eric W. Biederman @ 2018-07-18 16:00 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

Eric Wong <e@80x24.org> writes:

> "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>> This adds a new inbox configuration option 'indexlevel' that can take
>> the values 'positions', 'terms', and 'over'.
>
> The names of these user-facing configuration variables aren't
> obviously "levels" at all; especially to people not familiar
> with Xapian or public-inbox internals.
>
> As for "over", at least it should be spelled out "overview";
> but really, I would much prefer something which wouldn't require
> consulting the manual for explanations, such as:
>
> 	'full', 'medium', 'minimal'

Do you mind the config option name indexlevel?

I don't mind changing the names I just needed some name and
those names were present.

> Where it's obvious which one sits relative to the others.
>
> That wouldn't tie our user-facing configuration to our internal
> choices or terminology used by Xapian, either.  I'm pretty happy
> with Xapian; but it may be worth exploring other search engines
> at some point...
>
>> --- a/lib/PublicInbox/SearchIdx.pm
>> +++ b/lib/PublicInbox/SearchIdx.pm
>> @@ -47,6 +47,7 @@ sub git_unquote ($) {
>>  
>>  sub new {
>>  	my ($class, $ibx, $creat, $part) = @_;
>> +	my $levels = qr/(positions|terms|over)/;
>
> Please anchor matches so they match expected strings exactly.
> It lets typos be caught and makes life easier for 3rd-party
> tools and implementations if we're stricter in what we accept.
> Captures aren't necessary, so '?:' can be used:
>
> 	qr/\A(?:full|medium|minimal)\z/
>
> Same comment applies to patch 2/3

Good point.  I wish I knew a way so I didn't have to repeat the test
so often.  But getting the user space interface correct is the first
step then we can optimize if need be.

Eric


^ permalink raw reply	[relevance 7%]

* Re: [PATCH 3/3] SearchIdx: Allow the amount of indexing be configured
  2018-07-17 23:30  6% ` [PATCH 3/3] SearchIdx: Allow the amount of indexing be configured Eric W. Biederman
@ 2018-07-18 10:22  7%   ` Eric Wong
  2018-07-18 16:00  7%     ` Eric W. Biederman
  0 siblings, 1 reply; 7+ results
From: Eric Wong @ 2018-07-18 10:22 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: meta

"Eric W. Biederman" <ebiederm@xmission.com> wrote:
> This adds a new inbox configuration option 'indexlevel' that can take
> the values 'positions', 'terms', and 'over'.

The names of these user-facing configuration variables aren't
obviously "levels" at all; especially to people not familiar
with Xapian or public-inbox internals.

As for "over", at least it should be spelled out "overview";
but really, I would much prefer something which wouldn't require
consulting the manual for explanations, such as:

	'full', 'medium', 'minimal'

Where it's obvious which one sits relative to the others.

That wouldn't tie our user-facing configuration to our internal
choices or terminology used by Xapian, either.  I'm pretty happy
with Xapian; but it may be worth exploring other search engines
at some point...

> --- a/lib/PublicInbox/SearchIdx.pm
> +++ b/lib/PublicInbox/SearchIdx.pm
> @@ -47,6 +47,7 @@ sub git_unquote ($) {
>  
>  sub new {
>  	my ($class, $ibx, $creat, $part) = @_;
> +	my $levels = qr/(positions|terms|over)/;

Please anchor matches so they match expected strings exactly.
It lets typos be caught and makes life easier for 3rd-party
tools and implementations if we're stricter in what we accept.
Captures aren't necessary, so '?:' can be used:

	qr/\A(?:full|medium|minimal)\z/

Same comment applies to patch 2/3

^ permalink raw reply	[relevance 7%]

* [PATCH 3/3] SearchIdx: Allow the amount of indexing be configured
  2018-07-17 23:27  4% [PATCH 0/3] Making the search indexes optional Eric W. Biederman
@ 2018-07-17 23:30  6% ` Eric W. Biederman
  2018-07-18 10:22  7%   ` Eric Wong
  0 siblings, 1 reply; 7+ results
From: Eric W. Biederman @ 2018-07-17 23:30 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta, Eric W. Biederman

This adds a new inbox configuration option 'indexlevel' that can take
the values 'positions', 'terms', and 'over'.

When set to 'positions' everything is indexed including the positions
of all terms.

When set to 'terms' everything except the positions of terms is
indexed.

When set to 'over' terms and positions are not indexed.  Just the
Overview database for NNTP is created.  Which is still quite good and
allows searching for messages by Message-ID.  But there are no indexes to support
searching inside the email messages themselves.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 lib/PublicInbox/Config.pm    | 2 +-
 lib/PublicInbox/SearchIdx.pm | 8 ++++++++
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/lib/PublicInbox/Config.pm b/lib/PublicInbox/Config.pm
index 289c36a6ccf1..78586560cf0f 100644
--- a/lib/PublicInbox/Config.pm
+++ b/lib/PublicInbox/Config.pm
@@ -158,7 +158,7 @@ sub _fill {
 
 	foreach my $k (qw(mainrepo filter url newsgroup
 			infourl watch watchheader httpbackendmax
-			replyto feedmax nntpserver)) {
+			replyto feedmax nntpserver indexlevel)) {
 		my $v = $self->{"$pfx.$k"};
 		$rv->{$k} = $v if defined $v;
 	}
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index deb87db3f88a..c42821f813a7 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -47,6 +47,7 @@ sub git_unquote ($) {
 
 sub new {
 	my ($class, $ibx, $creat, $part) = @_;
+	my $levels = qr/(positions|terms|over)/;
 	my $mainrepo = $ibx; # for "public-inbox-index" w/o entry in config
 	my $git_dir = $mainrepo;
 	my ($altid, $git);
@@ -62,6 +63,13 @@ sub new {
 				PublicInbox::AltId->new($ibx, $_);
 			} @$altid ];
 		}
+		if ($ibx->{indexlevel}) {
+			if ($ibx->{indexlevel} =~ $levels) {
+				$indexlevel = $ibx->{indexlevel};
+			} else {
+				die("Invalid indexlevel $ibx->{indexlevel}\n");
+			}
+		}
 	} else { # v1
 		$ibx = { mainrepo => $git_dir, version => 1 };
 	}
-- 
2.17.1


^ permalink raw reply related	[relevance 6%]

* [PATCH 0/3] Making the search indexes optional
@ 2018-07-17 23:27  4% Eric W. Biederman
  2018-07-17 23:30  6% ` [PATCH 3/3] SearchIdx: Allow the amount of indexing be configured Eric W. Biederman
  0 siblings, 1 reply; 7+ results
From: Eric W. Biederman @ 2018-07-17 23:27 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta


Here is the code to make the Xapian search indexes optional.

The first patch makes the term position database optional.
The second patch makes anything in Xapian optional.
Finally the last patch adds a config option.

At the end of the day it all looks simple and straight forward so I feel
good about the code.  At the very least it looks like a good starting
point.

What this code does not do is make the Xapian code modules optional.  As
that is more involved, and there is not much reward for that.  With a
little cleverness in moving around code that is probably possible in a
follow change.

Eric W. Biederman (3):
      SearchIdx.pm: Make indexing search positions optional
      SearchIdx: Add the mechanism for making all Xapian indexing optional
      SearchIdx: Allow the amount of indexing be configured

 lib/PublicInbox/Config.pm    |   2 +-
 lib/PublicInbox/SearchIdx.pm | 255 +++++++++++++++++++++++--------------------
 2 files changed, 137 insertions(+), 120 deletions(-)



^ permalink raw reply	[relevance 4%]

Results 1-7 of 7 | reverse | options above
-- pct% links below jump to the message on this page, permalinks otherwise --
2018-07-17 23:27  4% [PATCH 0/3] Making the search indexes optional Eric W. Biederman
2018-07-17 23:30  6% ` [PATCH 3/3] SearchIdx: Allow the amount of indexing be configured Eric W. Biederman
2018-07-18 10:22  7%   ` Eric Wong
2018-07-18 16:00  7%     ` Eric W. Biederman
2018-07-18 16:31  7%       ` Eric Wong
2018-07-18 16:52  4%         ` [PATCH v2 1/3] Making the search indexes optional Eric W. Biederman
2018-07-18 16:53  4%           ` [PATCH v2 3/3] SearchIdx: Allow the amount of indexing be configured Eric W. Biederman

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).