Date | Commit message (Collapse) |
|
Email::Simple is slightly faster this way, and Email::MIME
and PublicInbox::MIME both wrap that.
|
|
Stopping and starting a bunch of processes to look up duplicates
or removals is inefficient. Take advantage of checkpointing
in "git fast-import" and transactions in Xapian and SQLite.
|
|
We will be using Sender: in more places if the From: header
is not available, this is one of them.
Followup-to: ("import: fall back to Sender for extracting name and email")
|
|
The current inbox is more important for partial Message-ID
matching, so we try harder on that to fix common errors before
moving onto other inboxes. Then, prevent expensive scanning of
other inboxes by requiring a Message-ID length of at least 16
bytes.
Finally, we limit the overall partial responses to 200 when
scanning other inboxes to avoid excessive memory usage.
|
|
We need to detect the number of partitions the repository was
created with to ensure Xapian DBs can work across different
machines (or even CPU affinity changes) without leaving messages
unaffected by search.
|
|
It appears most of the mboxes in the archive I've been given are
mboxrd (despite having Content-Length:) and needs the escaping.
|
|
This seems like a reasonable course of action for old messages.
Cc: Nicolás Ojeda Bär <n.oje.bar@gmail.com>
|
|
The first Received: header is believable since it typically
hits the user's mail server and can be treated as relatively
trustworthy. We still show the Date: in per-message (permalink)
views, which may expose users for having incorrect Date:
headers, but all the ISO YYYY-MM-DD dates we display will
match what we see.
|
|
Not a big deal since we still commit to the skeleton for every
single partition (barrier work abandoned).
|
|
We do not need the large DBs for MID scans.
|
|
The skeleton DB is smaller and hit more frequently given the
homepage and per-message/thread views; so it will be hotter in
the page cache.
|
|
I've missed a few things over time :x
|
|
We'll let the config of all.git dictate every other subrepo to
ease maintenance and configuration. The "include" directive has
been supported since git 1.7.10, so it's safe to depend on as v2
requires git 2.6.0+ anyways for "get-mark" in fast-import.
|
|
It's easier to store everything in one array ref similar
to what our Git->check routine returns
|
|
We can't rely on header order for Message-ID after all
since we fall back to existing MIDs if they exist and
are unseen. This lets us use SearchMsg->mid to get the
MID we associated with the NNTP article number to ensure
all NNTP article lookups roundtrip correctly.
|
|
I guess nobody uses this command (slrnpull does not), and
the breakage was not noticed until I started writing new
tests for multi-MID handling.
Fixes: 3fc411c772a21d8f ("search: drop pointless range processors for Unix timestamp")
|
|
Since Message-IDs are no longer unique within Xapian
(but are within the SQLite Msgmap); favor NNTP article
numbers for internal lookups. This will prevent us
from finding the "wrong" internal Message-ID.
|
|
Since we support duplicate MIDs in v2, we can safely truncate
long MID terms in the database and let other normal duplicate
resolution sort it out. It seems only spammers use excessively
long MIDs, and there'll always be abuse/misuse vectors for causing
mis-threaded messages, so it's not worth worrying about
excessively long MIDs.
|
|
Since we support duplicate MIDs in v2, the NNTP article number
becomes the true unique identifier and we want a way to do fast
lookups on it.
While we're at it, stop putting XPATH in the term partitions
since we only need it in the skeleton DB.
|
|
Aside from the Message-Id ('Q'), these terms do not appear in
content and thus have no business contributing to the Xapian
document length.
Thanks-to Olly Betts for the tip on xapian-discuss
<20180228004400.GU12724@survex.com>
|
|
This is to make SearchMsg behave more sanely under NNTP.
|
|
It's tempting to rely on the atomicity of smaller-than-PIPE_BUF
writes, but it doesn't work if mixed with larger ones.
|
|
When indexing diffs, we can avoid indexing the diff parts under
XNQ and instead combine the parts in the read-only search
interface. This results in better indexing performance and
10-15% smaller Xapian indices.
|
|
Traditionally we've been more lax on parsing Message-Id
and allow it without the angle brackets. We've always been
strict on References and can't have it be pointlessly
large when some MUA decides to use HTML-escaped angle
brackets ("<", ">").
|
|
It's possible to have a message handle multiple terms;
so use this feature to ensure messages with multiple MIDs
can be found by either one.
|
|
'Q' is merely a convention in the Xapian world, and is close
enough to unique for practical purposes, so stop using XMID
and gain a little more term length as a result.
|
|
Since we'll need to support multiple Message-IDs anyways,
inject a new one if we hit a duplicate (or don't get one at
all).
Try to use a deterministic Message-Id for consistency, but give
up determinism and use a random Message-Id if an "attacker"
wants to prevent their message from being archived.
|
|
We merely use this for internal comparisons and do not store
this in Xapian. So using a shorter, non-human readable digest
is enough. Furthermore, introduce "content_digest" which
returns the Digest::SHA object for extra changes.
|
|
It's shorter and more convenient, here.
|
|
These already take care of deduping internally, so we'll save
ourselves at least some of the trouble while using a more
consistent API. While we're at it, hash the header name as
well, since we need to distinguish which header a certain value
came from.
|
|
We'll be using a more consistent API for extracting Message-IDs
from various headers.
|
|
This was creating an unnecessary epoll descriptor via
Danga::Socket when using V2Writable to import a mbox. That
said, there should probably be better way of detecting whether
or not we're inside a Danga::Socket event loop.
Fixes: 427245acacaf04a8
("evcleanup: ensure deferred close from timers are handled ASAP")
|
|
This is a bit expensive in a multi-process situation because
we need to make our indices and packs visible to the read-only
pieces.
|
|
We'll be using these in a more OO manner for V2Writable
(which doesn't use Danga::Socket), so lets not unnecessarily
register cleanup handlers intended for network daemons.
|
|
Some emails in LKML archives are identical with the only
difference being s/References:/In-Reply-To:/ in the headers.
Since this difference doesn't affect how we handle message
threading, we will treat them the same way for the purposes
of deduplication.
There may be more changes to how we do content_id along these
lines (e.g. using msg_iter to walk the message).
|
|
|
|
It was making imports too noisy.
|
|
As with the ::Import class this wraps, we want this to be
usable as a checkpoint and be able to call ->add afterwards.
We'll be relying on ->done to flush changes through all
partition and skeleton DBs for deduplication checks.
|
|
A work-in-progress, but it appears the v2 UI pieces do
will not require a lot of work to do.
|
|
The skeleton DB is where we store all the information needed
for NNTP overviews via XOVER. This seems to be the only change
necessary (besides eventually handling duplicates) necessary
to support our nntpd interface for v2 repositories.
|
|
Iterating through a list of documents while modifying them does
not seem to be supported in Xapian and it can trigger
DatabaseCorruptError exceptions. This only worked with past
datasets out of dumb luck. With the work-in-progress "v2"
public-inbox layout, this problem might become more visible
as the "thread skeleton" is partitioned out to a separate,
smaller Xapian database.
I've reproduced the problem on both Debian 8.x and 9.x with
Xapian 1.2.19 (chert backend) and 1.4.3 (glass backend)
respectively.
|
|
I added these while chasing down the DatabaseCorruptError
exceptions which turned out to be caused by Xapian DB
modifications during iteration.
|
|
We need to ensure Xapian transaction commits are made to remote
partitions before associated commits hit the skeleton DB.
This causes unnecessary commits to be made to the skeleton DB;
but they're mostly harmless. Further work will be necessary
to ensure proper ordering and avoidance of unnecessary commits.
|
|
Interchangably using "all", "skel", "threader", etc. were
confusing. Standardize on the "skeleton" term to describe
this class since it's also used for retrieval of basic headers.
|
|
A different Xapian DB requires the use of a different Enquire
object. This is necessary for get_thread and thread skeleton
to work in the PSGI UI.
|
|
We will need timestamp, YYYYMMDD, article number, and line count
for querying thread information (including XOVER for NNTP).
|
|
Any Xapian DB is subject to the same errors and retries.
Perhaps in the future this can made more granular to avoid
unnecessary reopens.
|
|
Make data passed via Storable to the skeleton worker
a little neater.
|
|
Otherwise, references and thread linking doesn't happen
across subject mismatches. Oops, this is important.
|
|
We haven't needed this since we integrated threading
and dropped Email::Abstract and Mail::Thread usage.
|