Date | Commit message (Collapse) |
|
|
|
Another small step towards terminology consistency with Xapian.
|
|
Another step towards keeping our internal data structures
consistent with Xapian naming.
|
|
No sense in supporting multiple methods of initialization
for an internal class.
|
|
In case some BOFH decides to randomly create directories
using non-ASCII digits all over the place.
|
|
Some users (or bots :P) can trigger horrible queries which
the caller can choose to either log or ignore. This prevents
horrible queries from ExtMsg from logging confusing "ref: "
messages when $@ is not a Perl reference.
|
|
This is assuming nobody uses flint or earlier, anymore;
as flint predates the existence of this project.
|
|
* origin/xap-optional:
admin: improve warnings and errors for missing modules
searchidx: do not create empty Xapian partitions for basic
lazy load Xapian and make it optional for v2
www: use Inbox->over where appropriate
nntp: use Inbox->over directly
inbox: add ->over method to ease access
|
|
There probably needs to be an option to enable this
independently of indexlevel; but for now this is
the safest option.
And, as I discovered during the development of the
indexlevel option, Xapian does a pretty good job of
finding phrases without position data, anyways.
|
|
More tests work without Search::Xapian, now.
Usability issues still need to be fixed
|
|
We don't need to rely on Xapian search functionality for the
majority of the WWW code, even. subject_normalized is moved to
SearchMsg, where it (probably) makes more sense, anyways.
|
|
None of the NNTP code actually relies on Xapian, anymore.
|
|
I've hit some case where probabilistic searches don't work when
using dfpre:/dfpost:/dfblob: search prefixes because stemming in
the query parser interferes.
In any case, our indexing code indexes longer/unabbreviated blob
names down to its 7 character abbreviation, so there should be
no need to do wildcard searches on git blob names.
|
|
Previous search queries already set sort order on the Enquire
object, altering the ordering of results and was causing
messages to be redundantly downloaded via POST /$INBOX/?q=$QUERY&x=m
So stop caching the Search::Xapian::Enquire object since it
wasn't providing any measurable performance improvement.
|
|
"LIKE" in SQLite (and other SQL implementations I've seen) is
expensive with nearly 3 million messages in the archives.
This caused some partial Message-ID lookups to take over 600ms
on my workstation (~300ms on a faster Xeon). Cut that to below
under 30ms on average on my workstation by relying exclusively
on Xapian for partial Message-ID lookups as we have in the past.
Unlike in the past when we tried using Xapian to match partial
Message-IDs; we now optimize our indexing of Message-IDs to
break apart "words" in Message-IDs for searching, yielding
(hopefully) "good enough" accuracy for folks who get long URLs
broken across lines when copy+pasting.
We'll also drop the (in retrospect) pointless stripping of
"/[tTf]" suffixes for the partial match, since anybody who
hits that codepath would be hitting an invalid message ID.
Finally, limit wildcard expansion to prevent easy DoS vectors
on short terms.
And blame Pine and alpine for generating Message-IDs with
low-entropy prefixes :P
|
|
Favor simpler internal APIs this time around, this cuts
a fair amount of code out and takes another step towards
removing Xapian as a dependency for v2 repos.
|
|
Dscho found this useful for finding matching git commits based
on AuthorDate in git. Add it to the overview DB format, too;
so in the future we can support v2 repos without Xapian.
https://public-inbox.org/git/nycvar.QRO.7.76.6.1804041821420.55@ZVAVAG-6OXH6DA.rhebcr.pbec.zvpebfbsg.pbz
https://public-inbox.org/git/alpine.DEB.2.20.1702041206130.3496@virtualbox/
|
|
Sorting large msets is a waste when it comes to mboxes
since MUAs should thread and sort them as the user desires.
This forces us to rework each of the mbox download mechanisms
to be more independent of each other, but might make things
easier to reason about.
|
|
This was vestigial code from the switch to the overview DB
|
|
We can use id_batch in the common case to speed up full mbox
retrievals. Gigantic msets are still a problem, but will
be fixed in future commits.
|
|
While SQLite is faster than Xapian for some queries we
use, it sucks at handling OFFSET. Fortunately, we do
not need offsets when retrieving sorted results and
can bake it into the query.
For inbox.comp.version-control.git (v1 Xapian),
XOVER and XHDR are over 20x faster.
|
|
In many cases, we do not care about the total number of
messages. It's a rather expensive operation in SQLite
(Xapian only provides an estimate).
For LKML, this brings top-level /$INBOX/ loading time from
~375ms to around 60ms on my system. Days ago, this operation
was taking 800-900ms(!) for me before introducing the SQLite
overview DB.
|
|
This ought to provide better performance and scalability
which is less dependent on inbox size. Xapian does not
seem optimized for some queries used by the WWW homepage,
Atom feeds, XOVER and NEWNEWS NNTP commands.
This can actually make Xapian optional for NNTP usage,
and allow more functionality to work without Xapian
installed.
Indexing performance was extremely bad at first, but
DBI::Profile helped me optimize away problematic queries.
|
|
We can store :bytes and :lines in doc_data since we never
sort or search by them. We don't have much use for the Date:
stamp at the moment, either.
|
|
-watch on a busy/giant Maildir caused too many Xapian
errors while attempting to browse.
|
|
This was causing errors while attempting to load messages via
the WWW interface while mass-importing LKML. While we're at it,
remove unnecessary eval from lookup_article.
|
|
We do not need this subroutine for read-only use in Search.pm
|
|
Too many similar functions doing the same basic thing was
redundant and misleading, especially since Message-ID is
no longer treated as a truly unique identifier.
For displaying threads in the HTML, this makes it clear
that we favor the primary Message-ID mapped to an NNTP
article number if a message cannot be found.
|
|
The only Xapian term which should be unique is the NNTP article
number; so we no longer need find_unique_doc_id.
|
|
This gives more-up-to-date data in case and allows us
to avoid reopening in more places ourselves.
|
|
Since v2 supports duplicate messages, we need to support
looking up different messages with the same Message-Id.
Fortunately, our "raw" endpoint has always been mboxrd,
so users won't need to change their parsing tools.
|
|
We want to rely on Date: to sort messages within individual
threads since it keeps messages from git-send-email(1) sorted.
However, since developers occasionally have the clock set
wrong on their machines, sort overall messages by the newest
date in a Received: header so the landing page isn't forever
polluted by messages from the future.
This also gives us determinism for commit times in most cases,
as we'll used the Received: timestamp there, as well.
|
|
Makes life a little easier for V2Writable...
|
|
We do not need the large DBs for MID scans.
|
|
The skeleton DB is smaller and hit more frequently given the
homepage and per-message/thread views; so it will be hotter in
the page cache.
|
|
I guess nobody uses this command (slrnpull does not), and
the breakage was not noticed until I started writing new
tests for multi-MID handling.
Fixes: 3fc411c772a21d8f ("search: drop pointless range processors for Unix timestamp")
|
|
Since Message-IDs are no longer unique within Xapian
(but are within the SQLite Msgmap); favor NNTP article
numbers for internal lookups. This will prevent us
from finding the "wrong" internal Message-ID.
|
|
When indexing diffs, we can avoid indexing the diff parts under
XNQ and instead combine the parts in the read-only search
interface. This results in better indexing performance and
10-15% smaller Xapian indices.
|
|
It's possible to have a message handle multiple terms;
so use this feature to ensure messages with multiple MIDs
can be found by either one.
|
|
'Q' is merely a convention in the Xapian world, and is close
enough to unique for practical purposes, so stop using XMID
and gain a little more term length as a result.
|
|
This is a bit expensive in a multi-process situation because
we need to make our indices and packs visible to the read-only
pieces.
|
|
It was making imports too noisy.
|
|
The skeleton DB is where we store all the information needed
for NNTP overviews via XOVER. This seems to be the only change
necessary (besides eventually handling duplicates) necessary
to support our nntpd interface for v2 repositories.
|
|
Interchangably using "all", "skel", "threader", etc. were
confusing. Standardize on the "skeleton" term to describe
this class since it's also used for retrieval of basic headers.
|
|
A different Xapian DB requires the use of a different Enquire
object. This is necessary for get_thread and thread skeleton
to work in the PSGI UI.
|
|
Any Xapian DB is subject to the same errors and retries.
Perhaps in the future this can made more granular to avoid
unnecessary reopens.
|
|
Relying more on Xapian requires retrying reopens in more
places to ensure it does not fall down and show errors to
the user.
|
|
Fortunately, Xapian multiple database support makes things
easier but we still need to handle the skeleton DB separately.
|
|
The parallelization requires splitting Msgmap, text+term
indexing, and thread-linking out into separate processes.
git-fast-import is fast, so we don't bother parallelizing it.
Msgmap (SQLite) and thread-linking (Xapian) must be serialized
because they rely on monotonically increasing numbers (NNTP
article number and internal thread_id, respectively).
We handle msgmap in the main process which drives fast-import.
When the article number is retrieved/generated, we write the
entire message to per-partition subprocesses via pipes for
expensive text+term indexing.
When these per-partition subprocesses are done with the
expensive text+term indexing, they write SearchMsg (small data)
to a shared pipe (inherited from the main V2Writable process)
back to the threader, which runs its own subprocess.
The number of text+term Xapian partitions is chosen at import
and can be made equal to the number of cores in a machine.
V2Writable --> Import -> git-fast-import
\-> SearchIdxThread -> Msgmap (synchronous)
\-> SearchIdxPart[n] -> SearchIdx[*]
\-> SearchIdxThread -> SearchIdx ("threader", a subprocess)
[* ] each subprocess writes to threader
|
|
This is too slow, currently. Working with only 2017 LKML
archives:
git-only: ~1 minute
git + SQLite: ~12 minutes
git+Xapian+SQlite: ~45 minutes
So yes, it looks like we'll need to parallelize Xapian indexing,
at least.
|