user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
Search results ordered by [date|relevance]  view[summary|nested|Atom feed]
thread overview below | download mbox.gz: |
* [PATCH 0/3] mda: v2: ensure message bodies are indexed
@ 2018-07-29  9:34 14% Eric Wong
  2018-07-29  9:34  5% ` [PATCH 3/3] " Eric Wong
  0 siblings, 1 reply; 2+ results
From: Eric Wong @ 2018-07-29  9:34 UTC (permalink / raw)
  To: meta

I found a bug for v2 users getting mail through -mda, causing
message bodies to not show up in the search results.  It was a
stupid one-line bug made in an effort to save memory :x

Anyways, to properly index message bodies on affected mda-using
v2 inboxes, a reindex is required:

	public-inbox-index --reindex

This can take a long while and requires roughly double the
current Xapian storage.   However, it's designed to run online
so users will gradually find search more useful as indexing
completes (it runs in reverse-chronological order)

Fwiw, I always run indexing with "eatmydata" to disable fsync
and speed up the process, since Xapian data isn't critical.

I suppose another idea is to allow passing a limit to reindex,
as this bug didn't affect initial imports... (But I'm tired
and I fixed this bug while getting sidetracked from another
bugfix on another project)

Eric Wong (3):
  mda: use InboxWritable
  t/v2mda: make it easy to test v1 repos here, too
  mda: v2: ensure message bodies are indexed

 MANIFEST                         |  1 +
 lib/PublicInbox/InboxWritable.pm |  1 +
 script/public-inbox-mda          | 38 +++++++-------------------
 t/data/0001.patch                | 46 ++++++++++++++++++++++++++++++++
 t/v2mda.t                        | 19 ++++++++++++-
 t/watch_maildir_v2.t             | 15 +++++++++++
 6 files changed, 91 insertions(+), 29 deletions(-)
 create mode 100644 t/data/0001.patch

-- 
EW

^ permalink raw reply	[relevance 14%]

* [PATCH 3/3] mda: v2: ensure message bodies are indexed
  2018-07-29  9:34 14% [PATCH 0/3] mda: v2: ensure message bodies are indexed Eric Wong
@ 2018-07-29  9:34  5% ` Eric Wong
  0 siblings, 0 replies; 2+ results
From: Eric Wong @ 2018-07-29  9:34 UTC (permalink / raw)
  To: meta

We must not clobber the original message string, as Email::MIME(*)
still needs it for iterating through parts in SearchIdx (but not
when handing it as a raw string to git-fast-import).

I've noticed message bodies (especially dfpre/dpost) were not
getting indexed when going through -mda (no problems with
-watch).  This also did not affect v1 repos, since indexing is a
separate process for v1 and requires re-reading the data from
git.

(*) tested Email::MIME 1.937 on Debian stretch
---
 MANIFEST                |  1 +
 script/public-inbox-mda |  1 -
 t/data/0001.patch       | 46 +++++++++++++++++++++++++++++++++++++++++
 t/v2mda.t               | 10 +++++++++
 t/watch_maildir_v2.t    | 15 ++++++++++++++
 5 files changed, 72 insertions(+), 1 deletion(-)
 create mode 100644 t/data/0001.patch

diff --git a/MANIFEST b/MANIFEST
index fd74a43..003c3c5 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -146,6 +146,7 @@ t/config.t
 t/config_limiter.t
 t/content_id.t
 t/convert-compact.t
+t/data/0001.patch
 t/emergency.t
 t/fail-bin/spamc
 t/feed.t
diff --git a/script/public-inbox-mda b/script/public-inbox-mda
index 2a31537..2b7f298 100755
--- a/script/public-inbox-mda
+++ b/script/public-inbox-mda
@@ -51,7 +51,6 @@ $emm = PublicInbox::Emergency->new($emergency);
 $emm->prepare(\$str);
 $ems = $ems->abort;
 my $mime = PublicInbox::MIME->new(\$str);
-$str = '';
 do_exit(0) unless $spam_ok;
 
 my $fcfg = $dst->{filter} || '';
diff --git a/t/data/0001.patch b/t/data/0001.patch
new file mode 100644
index 0000000..b7964a2
--- /dev/null
+++ b/t/data/0001.patch
@@ -0,0 +1,46 @@
+From: Eric Wong <e@80x24.org>
+Date: Fri, 20 Jul 2018 07:21:41 +0000
+To: test@example.com
+Subject: [PATCH] search: use boolean prefix for filenames in diffs, too
+Message-ID: <20180720072141.GA15957@example>
+
+Filenames within a project tend to be reasonably stable within a
+project and I plan on having automated searches hit these.
+
+Also, using no term prefix at all (the default for searching)
+still allows probabilistic searches on everything that's in a
+"git diff", including the blob names which were just made
+boolean.
+
+Note, attachment filenames ("n:" prefix) will stil use
+probabilistic search, as they're hardly standardized.
+---
+ lib/PublicInbox/Search.pm | 6 +++---
+ 1 file changed, 3 insertions(+), 3 deletions(-)
+
+diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
+index 090d998b6c2c..6e006fd73b1d 100644
+--- a/lib/PublicInbox/Search.pm
++++ b/lib/PublicInbox/Search.pm
+@@ -53,6 +53,9 @@ my %bool_pfx_external = (
+ 	dfpre => 'XDFPRE',
+ 	dfpost => 'XDFPOST',
+ 	dfblob => 'XDFPRE XDFPOST',
++	dfn => 'XDFN',
++	dfa => 'XDFA',
++	dfb => 'XDFB',
+ );
+ 
+ my $non_quoted_body = 'XNQ XDFN XDFA XDFB XDFHH XDFCTX XDFPRE XDFPOST';
+@@ -72,9 +75,6 @@ my %prob_prefix = (
+ 
+ 	q => 'XQUOT',
+ 	nq => $non_quoted_body,
+-	dfn => 'XDFN',
+-	dfa => 'XDFA',
+-	dfb => 'XDFB',
+ 	dfhh => 'XDFHH',
+ 	dfctx => 'XDFCTX',
+ 
+-- 
+^_^
diff --git a/t/v2mda.t b/t/v2mda.t
index 7df3a43..6145720 100644
--- a/t/v2mda.t
+++ b/t/v2mda.t
@@ -65,4 +65,14 @@ my $msgs = $ibx->search->query('');
 my $saved = $ibx->smsg_mime($msgs->[0]);
 is($saved->{mime}->as_string, $mime->as_string, 'injected message');
 
+my $patch = 't/data/0001.patch';
+open my $fh, '<', $patch or die "failed to open $patch: $!\n";
+$rdr = { 0 => fileno($fh) };
+ok(PublicInbox::Import::run_die(['public-inbox-mda'], undef, $rdr),
+	'mda delivered a patch');
+my $post = $ibx->search->reopen->query('dfpost:6e006fd7');
+is(scalar(@$post), 1, 'got one result for dfpost');
+my $pre = $ibx->search->query('dfpre:090d998');
+is(scalar(@$pre), 1, 'got one result for dfpre');
+is($post->[0]->{blob}, $pre->[0]->{blob}, 'same message in both cases');
 done_testing();
diff --git a/t/watch_maildir_v2.t b/t/watch_maildir_v2.t
index a76e413..fc002dc 100644
--- a/t/watch_maildir_v2.t
+++ b/t/watch_maildir_v2.t
@@ -120,6 +120,21 @@ More majordomo info at  http://vger.kernel.org/majordomo-info.html\n);
 	is($nr, 1, 'inbox has one mail after spamc OK-ed a message');
 	my $mref = $ibx->msg_by_smsg($msgs->[0]);
 	like($$mref, qr/something\n\z/s, 'message scrubbed on import');
+	delete $config->{'publicinboxwatch.spamcheck'};
+}
+
+{
+	my $patch = 't/data/0001.patch';
+	open my $fh, '<', $patch or die "failed to open $patch: $!\n";
+	$msg = eval { local $/; <$fh> };
+	PublicInbox::Emergency->new($maildir)->prepare(\$msg);
+	PublicInbox::WatchMaildir->new($config)->scan('full');
+	($nr, $msgs) = $srch->reopen->query('dfpost:6e006fd7');
+	is($nr, 1, 'diff postimage found');
+	my $post = $msgs->[0];
+	($nr, $msgs) = $srch->query('dfpre:090d998b6c2c');
+	is($nr, 1, 'diff preimage found');
+	is($post->{blob}, $msgs->[0]->{blob}, 'same message');
 }
 
 done_testing;
-- 
EW


^ permalink raw reply related	[relevance 5%]

Results 1-2 of 2 | reverse | options above
-- pct% links below jump to the message on this page, permalinks otherwise --
2018-07-29  9:34 14% [PATCH 0/3] mda: v2: ensure message bodies are indexed Eric Wong
2018-07-29  9:34  5% ` [PATCH 3/3] " Eric Wong

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).