Date | Commit message (Collapse) |
|
We can use id_batch in the common case to speed up full mbox
retrievals. Gigantic msets are still a problem, but will
be fixed in future commits.
|
|
While SQLite is faster than Xapian for some queries we
use, it sucks at handling OFFSET. Fortunately, we do
not need offsets when retrieving sorted results and
can bake it into the query.
For inbox.comp.version-control.git (v1 Xapian),
XOVER and XHDR are over 20x faster.
|
|
In many cases, we do not care about the total number of
messages. It's a rather expensive operation in SQLite
(Xapian only provides an estimate).
For LKML, this brings top-level /$INBOX/ loading time from
~375ms to around 60ms on my system. Days ago, this operation
was taking 800-900ms(!) for me before introducing the SQLite
overview DB.
|
|
This ought to provide better performance and scalability
which is less dependent on inbox size. Xapian does not
seem optimized for some queries used by the WWW homepage,
Atom feeds, XOVER and NEWNEWS NNTP commands.
This can actually make Xapian optional for NNTP usage,
and allow more functionality to work without Xapian
installed.
Indexing performance was extremely bad at first, but
DBI::Profile helped me optimize away problematic queries.
|
|
We can store :bytes and :lines in doc_data since we never
sort or search by them. We don't have much use for the Date:
stamp at the moment, either.
|
|
-watch on a busy/giant Maildir caused too many Xapian
errors while attempting to browse.
|
|
This was causing errors while attempting to load messages via
the WWW interface while mass-importing LKML. While we're at it,
remove unnecessary eval from lookup_article.
|
|
We do not need this subroutine for read-only use in Search.pm
|
|
Too many similar functions doing the same basic thing was
redundant and misleading, especially since Message-ID is
no longer treated as a truly unique identifier.
For displaying threads in the HTML, this makes it clear
that we favor the primary Message-ID mapped to an NNTP
article number if a message cannot be found.
|
|
The only Xapian term which should be unique is the NNTP article
number; so we no longer need find_unique_doc_id.
|
|
This gives more-up-to-date data in case and allows us
to avoid reopening in more places ourselves.
|
|
Since v2 supports duplicate messages, we need to support
looking up different messages with the same Message-Id.
Fortunately, our "raw" endpoint has always been mboxrd,
so users won't need to change their parsing tools.
|
|
We want to rely on Date: to sort messages within individual
threads since it keeps messages from git-send-email(1) sorted.
However, since developers occasionally have the clock set
wrong on their machines, sort overall messages by the newest
date in a Received: header so the landing page isn't forever
polluted by messages from the future.
This also gives us determinism for commit times in most cases,
as we'll used the Received: timestamp there, as well.
|
|
Makes life a little easier for V2Writable...
|
|
We do not need the large DBs for MID scans.
|
|
The skeleton DB is smaller and hit more frequently given the
homepage and per-message/thread views; so it will be hotter in
the page cache.
|
|
I guess nobody uses this command (slrnpull does not), and
the breakage was not noticed until I started writing new
tests for multi-MID handling.
Fixes: 3fc411c772a21d8f ("search: drop pointless range processors for Unix timestamp")
|
|
Since Message-IDs are no longer unique within Xapian
(but are within the SQLite Msgmap); favor NNTP article
numbers for internal lookups. This will prevent us
from finding the "wrong" internal Message-ID.
|
|
When indexing diffs, we can avoid indexing the diff parts under
XNQ and instead combine the parts in the read-only search
interface. This results in better indexing performance and
10-15% smaller Xapian indices.
|
|
It's possible to have a message handle multiple terms;
so use this feature to ensure messages with multiple MIDs
can be found by either one.
|
|
'Q' is merely a convention in the Xapian world, and is close
enough to unique for practical purposes, so stop using XMID
and gain a little more term length as a result.
|
|
This is a bit expensive in a multi-process situation because
we need to make our indices and packs visible to the read-only
pieces.
|
|
It was making imports too noisy.
|
|
The skeleton DB is where we store all the information needed
for NNTP overviews via XOVER. This seems to be the only change
necessary (besides eventually handling duplicates) necessary
to support our nntpd interface for v2 repositories.
|
|
Interchangably using "all", "skel", "threader", etc. were
confusing. Standardize on the "skeleton" term to describe
this class since it's also used for retrieval of basic headers.
|
|
A different Xapian DB requires the use of a different Enquire
object. This is necessary for get_thread and thread skeleton
to work in the PSGI UI.
|
|
Any Xapian DB is subject to the same errors and retries.
Perhaps in the future this can made more granular to avoid
unnecessary reopens.
|
|
Relying more on Xapian requires retrying reopens in more
places to ensure it does not fall down and show errors to
the user.
|
|
Fortunately, Xapian multiple database support makes things
easier but we still need to handle the skeleton DB separately.
|
|
The parallelization requires splitting Msgmap, text+term
indexing, and thread-linking out into separate processes.
git-fast-import is fast, so we don't bother parallelizing it.
Msgmap (SQLite) and thread-linking (Xapian) must be serialized
because they rely on monotonically increasing numbers (NNTP
article number and internal thread_id, respectively).
We handle msgmap in the main process which drives fast-import.
When the article number is retrieved/generated, we write the
entire message to per-partition subprocesses via pipes for
expensive text+term indexing.
When these per-partition subprocesses are done with the
expensive text+term indexing, they write SearchMsg (small data)
to a shared pipe (inherited from the main V2Writable process)
back to the threader, which runs its own subprocess.
The number of text+term Xapian partitions is chosen at import
and can be made equal to the number of cores in a machine.
V2Writable --> Import -> git-fast-import
\-> SearchIdxThread -> Msgmap (synchronous)
\-> SearchIdxPart[n] -> SearchIdx[*]
\-> SearchIdxThread -> SearchIdx ("threader", a subprocess)
[* ] each subprocess writes to threader
|
|
This is too slow, currently. Working with only 2017 LKML
archives:
git-only: ~1 minute
git + SQLite: ~12 minutes
git+Xapian+SQlite: ~45 minutes
So yes, it looks like we'll need to parallelize Xapian indexing,
at least.
|
|
In general, they are, but there's no way for or general purpose
mail server to enforce that. This is a step in allowing us
to handle more corner cases which existing lists throw at us.
|
|
This will allow easier-compatibility with v2 code which will
introduce content_id as the unique identifier.
The old "XMID" becomes "XM" as a free text searchable term.
"Q" becomes "XMID" as a boolean prefix.
There's no user-visible changes in this, but there needs to
be a schema version bump later on...
(more changes planned which can affect v1)
|
|
Using update-copyrights from gnulib
While we're at it, use the SPDX identifier for AGPL-3.0+ to
ease mechanical processing.
|
|
Since we attempt to fill in threads by Subject, our thread
skeletons can cross actual thread IDs, leading to the
possibility of false ghosts showing up in the skeleton.
Try to fill in the ghosts as well as possible by performing
a message lookup.
|
|
This can be tied into a repository browser to browse
in-flight topics on a mailing list.
|
|
This simplifies the code a bit and reduces the translation
overhead for looking directly at data from tools shipped
with Xapian.
While we're at it, fix thread-all.t :)
|
|
Due to the asynchronous nature of SMTP, it is possible for the
root message of a thread (with no References/In-Reply-To)
to arrive last in a series. We must preserve the thread_id
of the ghost message in this case, as we do when vivifiying
non-root ghosts.
Otherwise, this causes threads to be broken when the root
arrives last.
|
|
I'm not sure if people use either and it's not in mairix
(where we base our abbreviations off of). Lets go
with the shorter prefix since it's easier-to-type.
|
|
We cannot distinguish between legitimate ghosts and mis-threaded
messages before commit 83425ef12e4b65cdcecd11ddcb38175d4a91d5a0
("searchidx: deal with empty In-Reply-To and References headers")
so we must rebuild the index in parallel to fix it.
|
|
This should fix problems with multipart messages where
text/plain parts lack a header.
cf. git clone --mirror https://github.com/rjbs/Email-MIME.git
refs/pull/28/head
In the future, we may still introduce as streaming
interface to reduce memory usage on large emails.
|
|
Apparently it never actually got used, and the world seems
fine without it, so we can drop it.
While we're at it, consider removing our subject_path
usage from existence, too. We are not using fancy subject-line
based URLs, here.
|
|
This is faster, smaller, and more straighforward to me with
fewer layers of indirection.
|
|
We call lookup_mail all over the place, be sure we can handle
database modifications in those cases.
|
|
Instead, only preload the ->mid field for threading,
as we only need ->thread and ->path once in Search->get_thread
(but we will need the ->mid field repeatedly).
This more than doubles View->load_results performance on
according to thread-all on an inbox with over 300K messages.
|
|
In addition to needing to retry enquire queries, we also need
to protect document loading from the Xapian DB and retry on
modification, as it seems to throw the same errors.
Checking the $@ ref for Search::Xapian::DatabaseModifiedError
is actually in the test suite for both the XS and SWIG Xapian
bindings, so we should be good as far as forward/backwards
compatibility.
|
|
This makes life easier for the threading algorithm, as we can
use the implied ordering of timestamps to avoid temporary ghosts
and resulting container vivication.
This would've also allowed us to hide the bug (in most cases)
fixed by the patch titled "thread: last Reference always wins",
in case that needs to be reverted due to infinite looping.
|
|
Support (and document) 'a:' after all, as "mairix -h" uses it,
so this should reduce the learning curve for mairix users.
|
|
And while we're at it, ensure searching inside displayable
attachment bodies works.
|
|
Specifying the "d:" field only worked for
NumberValueRangeProcessor in older versions of Xapian, such
as the one in Debian wheezy (libsearch-xapian-perl=1.2.10.0-1)
This slipped through since I rarely use wheezy, anymore, and
perhaps nobody else does, either. Perhaps wheezy support may be
dropped, soon.
Unfortunately, this requires a schema version bump.
|