Date | Commit message (Collapse) |
|
This reduces code duplication needed for locking and
and hopefully makes things easier to understand.
|
|
We need to hide removals from anybody hitting the search engine.
|
|
Followup-to: ebb59815035b42c2
("searchidx: do not modify Xapian DB while iterating")
|
|
We'll let the config of all.git dictate every other subrepo to
ease maintenance and configuration. The "include" directive has
been supported since git 1.7.10, so it's safe to depend on as v2
requires git 2.6.0+ anyways for "get-mark" in fast-import.
|
|
We can't rely on header order for Message-ID after all
since we fall back to existing MIDs if they exist and
are unseen. This lets us use SearchMsg->mid to get the
MID we associated with the NNTP article number to ensure
all NNTP article lookups roundtrip correctly.
|
|
Since we support duplicate MIDs in v2, we can safely truncate
long MID terms in the database and let other normal duplicate
resolution sort it out. It seems only spammers use excessively
long MIDs, and there'll always be abuse/misuse vectors for causing
mis-threaded messages, so it's not worth worrying about
excessively long MIDs.
|
|
Since we support duplicate MIDs in v2, the NNTP article number
becomes the true unique identifier and we want a way to do fast
lookups on it.
While we're at it, stop putting XPATH in the term partitions
since we only need it in the skeleton DB.
|
|
Aside from the Message-Id ('Q'), these terms do not appear in
content and thus have no business contributing to the Xapian
document length.
Thanks-to Olly Betts for the tip on xapian-discuss
<20180228004400.GU12724@survex.com>
|
|
When indexing diffs, we can avoid indexing the diff parts under
XNQ and instead combine the parts in the read-only search
interface. This results in better indexing performance and
10-15% smaller Xapian indices.
|
|
It's possible to have a message handle multiple terms;
so use this feature to ensure messages with multiple MIDs
can be found by either one.
|
|
'Q' is merely a convention in the Xapian world, and is close
enough to unique for practical purposes, so stop using XMID
and gain a little more term length as a result.
|
|
It's shorter and more convenient, here.
|
|
This is a bit expensive in a multi-process situation because
we need to make our indices and packs visible to the read-only
pieces.
|
|
|
|
Iterating through a list of documents while modifying them does
not seem to be supported in Xapian and it can trigger
DatabaseCorruptError exceptions. This only worked with past
datasets out of dumb luck. With the work-in-progress "v2"
public-inbox layout, this problem might become more visible
as the "thread skeleton" is partitioned out to a separate,
smaller Xapian database.
I've reproduced the problem on both Debian 8.x and 9.x with
Xapian 1.2.19 (chert backend) and 1.4.3 (glass backend)
respectively.
|
|
Interchangably using "all", "skel", "threader", etc. were
confusing. Standardize on the "skeleton" term to describe
this class since it's also used for retrieval of basic headers.
|
|
We will need timestamp, YYYYMMDD, article number, and line count
for querying thread information (including XOVER for NNTP).
|
|
This used to lookup the message in git, but no longer, so
remove a needless indirection layer and call add_message
directly.
|
|
It works around some bugs in older Email::MIME which we'll
find useful.
|
|
This should give us an idea of how much a problem deduplication
will be.
|
|
The parallelization requires splitting Msgmap, text+term
indexing, and thread-linking out into separate processes.
git-fast-import is fast, so we don't bother parallelizing it.
Msgmap (SQLite) and thread-linking (Xapian) must be serialized
because they rely on monotonically increasing numbers (NNTP
article number and internal thread_id, respectively).
We handle msgmap in the main process which drives fast-import.
When the article number is retrieved/generated, we write the
entire message to per-partition subprocesses via pipes for
expensive text+term indexing.
When these per-partition subprocesses are done with the
expensive text+term indexing, they write SearchMsg (small data)
to a shared pipe (inherited from the main V2Writable process)
back to the threader, which runs its own subprocess.
The number of text+term Xapian partitions is chosen at import
and can be made equal to the number of cores in a machine.
V2Writable --> Import -> git-fast-import
\-> SearchIdxThread -> Msgmap (synchronous)
\-> SearchIdxPart[n] -> SearchIdx[*]
\-> SearchIdxThread -> SearchIdx ("threader", a subprocess)
[* ] each subprocess writes to threader
|
|
This is too slow, currently. Working with only 2017 LKML
archives:
git-only: ~1 minute
git + SQLite: ~12 minutes
git+Xapian+SQlite: ~45 minutes
So yes, it looks like we'll need to parallelize Xapian indexing,
at least.
|
|
In general, they are, but there's no way for or general purpose
mail server to enforce that. This is a step in allowing us
to handle more corner cases which existing lists throw at us.
|
|
I decided not to copy the notmuch implementation regarding
serialization of integers to Xapian metadata.
|
|
This will allow easier-compatibility with v2 code which will
introduce content_id as the unique identifier.
The old "XMID" becomes "XM" as a free text searchable term.
"Q" becomes "XMID" as a boolean prefix.
There's no user-visible changes in this, but there needs to
be a schema version bump later on...
(more changes planned which can affect v1)
|
|
Using update-copyrights from gnulib
While we're at it, use the SPDX identifier for AGPL-3.0+ to
ease mechanical processing.
|
|
We should not blindly join References and In-Reply-To headers
as a single string, because some messages can have an open
angle brace '<' in References: without a corresponding '>'.
|
|
Yet another hiccup from reusing pre-set article numbers on
various ruby-lang.org mailing lists. This was causing messages
to not appear to NNTP readers which use XOVER.
|
|
This fixes a bug introduced in
commit 7eeadcb62729b0efbcb53cd9b7b181897c92cf9a
("search: remove unnecessary abstractions and functionality")
|
|
This can be tied into a repository browser to browse
in-flight topics on a mailing list.
|
|
Xapian memory usage is tied to the size of the indexed
text, so take the raw message size into account when
deciding when to flush Xapian data.
More importantly, we now flush Xapian before we have it
buffer beyond our maximum; and we do it unconditionally
to prevent even high priority processes from OOM-ing.
|
|
This simplifies the code a bit and reduces the translation
overhead for looking directly at data from tools shipped
with Xapian.
While we're at it, fix thread-all.t :)
|
|
umask should never fail and set $@, but use the cached local
to be more explicit just in case.
|
|
Due to the asynchronous nature of SMTP, it is possible for the
root message of a thread (with no References/In-Reply-To)
to arrive last in a series. We must preserve the thread_id
of the ghost message in this case, as we do when vivifiying
non-root ghosts.
Otherwise, this causes threads to be broken when the root
arrives last.
|
|
It seems possible for git-send-email(1) to generate repeated
repeated instances of References and In-Reply-To headers,
as evidenced in:
https://public-inbox.org/git/20161111124541.8216-17-vascomalmeida@sapo.pt/raw
This causes a mismatch between how our search indexer threads
and how our HTML view handles threading. In the future, View.pm
will use the smsg-parsed {references} field and avoid redoing
Email::MIME header parsing.
We will still need to figure out a way to deal with messages
with repeated Message-IDs, at some point, too.
|
|
Oops, that's broken, too. I guess the only way to reindex
after fixing the thread detection is to start from scratch.
This reverts commit 5d91adedf5f33ef1cb87df2a86306ddf370b4f8d.
|
|
We cannot always reuse thread IDs since our threading
logic may change as bugs are fixed.
|
|
In some messages, these headers exist, but have empty values.
Do not let empty values throw off our search indexer to tie
threads together, as it can make non-sensical threads grouped
to a Message-Id of "" (empty string).
See
<https://public-inbox.org/git/11340844841342-git-send-email-mailing-lists.git@rawuncut.elitemail.org/raw>
for an example of such a message.
Thanks-to: Johannes Schindelin <Johannes.Schindelin@gmx.de>
<https://public-inbox.org/git/alpine.DEB.2.20.1702041206130.3496@virtualbox/>
|
|
This should fix problems with multipart messages where
text/plain parts lack a header.
cf. git clone --mirror https://github.com/rjbs/Email-MIME.git
refs/pull/28/head
In the future, we may still introduce as streaming
interface to reduce memory usage on large emails.
|
|
This is faster, smaller, and more straighforward to me with
fewer layers of indirection.
|
|
Some email clients set the References headers backwards, so
trust the In-Reply-To header if (and only if) it exists and
is parseable as direct parent of the current message.
For affected repos, this will require reindexing (via
"public-inbox-index --reindex"), but there will be no
version bump for this bugfix.
|
|
Introduce our own SearchThread class for threading messages.
This should allow us to specialize and optimize away objects
in future commits.
|
|
And while we're at it, ensure searching inside displayable
attachment bodies works.
|
|
The basic rule is that if it is displayable via our WWW
interface, it should be indexable text for Xapian search.
|
|
It's not worth entering a complex codepath in Email::MIME to
save some (probably immeasurable amount of) memory, here. We've
already stopped doing this in our WWW code a while back, too.
If we really cared enough about it, we'd prioritize work on a
streaming replacement for Email::MIME.
|
|
Specifying the "d:" field only worked for
NumberValueRangeProcessor in older versions of Xapian, such
as the one in Debian wheezy (libsearch-xapian-perl=1.2.10.0-1)
This slipped through since I rarely use wheezy, anymore, and
perhaps nobody else does, either. Perhaps wheezy support may be
dropped, soon.
Unfortunately, this requires a schema version bump.
|
|
We pay a storage cost for storing positional information
in Xapian, make good use of it by attempting to preserve
it for (hopefully) better search results.
|
|
This is stricter than the mutt quote_regexp default
("^([ \t]*[|>:}#])+" on Debian jessie),
but matches what we have in View.pm.
I prefer the stricter quote detection since it is less ambiguous
and less likely to hide/obscure important details.
|
|
As of Xapian 1.0.4 (from 2007) is possible to use
Search::Xapian::QueryParser::add_prefix multiple times with the
same user field name but different term prefixes.
This brings my current git@vger mirror from 6.5GB to 2.1GB
(both sizes are after xapian-compact).
|
|
"bs:" and "b:" are adapted from mairix(1)
We will also support searching explicitly for quoted vs
non-quoted text via "q:" and "nq:" prefixes since sometimes
readers will not care for quoted text.
In the future, we will support parsing diffs (perhaps when
repobrowse integration is complete).
Note: this roughly doubles the size of the Xapian database due
to the additional information; so this change may not be worth
it.
|