about summary refs log tree commit homepage
path: root/t/psgi_search.t
diff options
context:
space:
mode:
authorEric Wong (Contractor, The Linux Foundation) <e@80x24.org>2018-04-22 08:01:48 +0000
committerEric Wong <e@80x24.org>2018-04-22 08:02:13 +0000
commita46893a2b5dabfdbcf7b593ac19967daecfb1772 (patch)
tree4b49778a165ec769a6412b07f965413567954c95 /t/psgi_search.t
parent866837def71b9d70198f51e634e6141f75f0df3e (diff)
downloadpublic-inbox-a46893a2b5dabfdbcf7b593ac19967daecfb1772.tar.gz
"LIKE" in SQLite (and other SQL implementations I've seen) is
expensive with nearly 3 million messages in the archives.

This caused some partial Message-ID lookups to take over 600ms
on my workstation (~300ms on a faster Xeon).  Cut that to below
under 30ms on average on my workstation by relying exclusively
on Xapian for partial Message-ID lookups as we have in the past.

Unlike in the past when we tried using Xapian to match partial
Message-IDs; we now optimize our indexing of Message-IDs to
break apart "words" in Message-IDs for searching, yielding
(hopefully) "good enough" accuracy for folks who get long URLs
broken across lines when copy+pasting.

We'll also drop the (in retrospect) pointless stripping of
"/[tTf]" suffixes for the partial match, since anybody who
hits that codepath would be hitting an invalid message ID.

Finally, limit wildcard expansion to prevent easy DoS vectors
on short terms.

And blame Pine and alpine for generating Message-IDs with
low-entropy prefixes :P
Diffstat (limited to 't/psgi_search.t')
-rw-r--r--t/psgi_search.t23
1 files changed, 17 insertions, 6 deletions
diff --git a/t/psgi_search.t b/t/psgi_search.t
index 2f033016..a057a994 100644
--- a/t/psgi_search.t
+++ b/t/psgi_search.t
@@ -20,11 +20,14 @@ my $git_dir = "$tmpdir/a.git";
 is(0, system(qw(git init -q --bare), $git_dir), "git init (main)");
 my $rw = PublicInbox::SearchIdx->new($git_dir, 1);
 ok($rw, "search indexer created");
-my $data = <<'EOF';
+my $digits = '10010260936330';
+my $ua = 'Pine.LNX.4.10';
+my $mid = "$ua.$digits.2460-100000\@penguin.transmeta.com";
+my $data = <<"EOF";
 Subject: test
-Message-Id: <utf8@example>
-From: Ævar Arnfjörð Bjarmason <avarab@example>
-To: git@vger.kernel.org
+Message-ID: <$mid>
+From: Ævar Arnfjörð Bjarmason <avarab\@example>
+To: git\@vger.kernel.org
 
 EOF
 
@@ -37,8 +40,7 @@ foreach (reverse split(/\n\n/, $data)) {
         my $mime = Email::MIME->new(\$_);
         my $bytes = bytes::length($mime->as_string);
         my $doc_id = $rw->add_message($mime, $bytes, ++$num, 'ignored');
-        my $mid = $mime->header('Message-Id');
-        ok($doc_id, 'message added: '. $mid);
+        ok($doc_id, 'message added');
 }
 
 $rw->commit_txn_lazy;
@@ -72,6 +74,15 @@ test_psgi(sub { $www->call(@_) }, sub {
         $res = $cb->(POST('/test/?q=s:bogus&x=m'));
         is($res->code, 404, 'failed search result gives 404');
         is_deeply([], $warn, 'no warnings');
+
+        my $mid_re = qr/\Q$mid\E/o;
+        while (length($digits) > 8) {
+                $res = $cb->(GET("/test/$ua.$digits/"));
+                is($res->code, 300, 'partial match found while truncated');
+                like($res->content, qr/\b1 partial match found\b/);
+                like($res->content, $mid_re, 'found mid in response');
+                chop($digits);
+        }
 });
 
 done_testing();