Date | Commit message (Collapse) |
|
No functional changes, yet, but this makes future changes
easier-to-read.
|
|
Instead of using ssoma-based locking, enable locking via Import
for now.
|
|
This will make reindexing easier.
|
|
Hexdigests are too long and shorter Message-IDs are easier
to deal with.
|
|
This allows us to share code for generating Message-IDs
between v1 and v2 repos.
For v1, this introduces a slight incompatibility in message
removal iff the original message lacked a Message-ID AND
the training request came from a message which did not
pass through the public-inbox:
The workaround for this would be to reuse the bad message from
the archive itself.
|
|
This can probably be moved to Import for code reuse.
|
|
This allows us to be more consistent in dealing with completely
empty Message-Ids.
|
|
This will allow WatchMaildir to use ->barrier operations instead
of reaching inside for nchg. This also ensures dumb HTTP
clients can see changes to V2 repos immediately.
|
|
In the future, we may store "purged" content IDs or other
uncommon stuff under "_/" of the git tree. This keeps the
top-level tree small and more amenable to deltafication.
This helps the the common case where "m" is most commonly
changed file at the top level.
Also, use 'D' instead of 'd' since it matches git's '--raw'
output format.
|
|
This makes it easier to audit deletes with "git log -p" and
prevents an unstable specification of "content_id" from being
stored in history.
This should be cost-free if done in the same partition (and even
cheaper than before as it introduces no new blobs). It does
have a higher cost across partitions, but is probably irrelevant
given the typical ham:spam ratio.
|
|
Writing to the main skeleton pipe requires a lock since it's
shared with partition processes.
|
|
We need to hide removals from anybody hitting the search engine.
|
|
Makes life a little easier for V2Writable...
|
|
Followup-to: ebb59815035b42c2
("searchidx: do not modify Xapian DB while iterating")
|
|
We no longer need it with ->barrier working
|
|
Email::Simple is slightly faster this way, and Email::MIME
and PublicInbox::MIME both wrap that.
|
|
Stopping and starting a bunch of processes to look up duplicates
or removals is inefficient. Take advantage of checkpointing
in "git fast-import" and transactions in Xapian and SQLite.
|
|
We will be using Sender: in more places if the From: header
is not available, this is one of them.
Followup-to: ("import: fall back to Sender for extracting name and email")
|
|
The current inbox is more important for partial Message-ID
matching, so we try harder on that to fix common errors before
moving onto other inboxes. Then, prevent expensive scanning of
other inboxes by requiring a Message-ID length of at least 16
bytes.
Finally, we limit the overall partial responses to 200 when
scanning other inboxes to avoid excessive memory usage.
|
|
We need to detect the number of partitions the repository was
created with to ensure Xapian DBs can work across different
machines (or even CPU affinity changes) without leaving messages
unaffected by search.
|
|
It appears most of the mboxes in the archive I've been given are
mboxrd (despite having Content-Length:) and needs the escaping.
|
|
This seems like a reasonable course of action for old messages.
Cc: Nicolás Ojeda Bär <n.oje.bar@gmail.com>
|
|
The first Received: header is believable since it typically
hits the user's mail server and can be treated as relatively
trustworthy. We still show the Date: in per-message (permalink)
views, which may expose users for having incorrect Date:
headers, but all the ISO YYYY-MM-DD dates we display will
match what we see.
|
|
Not a big deal since we still commit to the skeleton for every
single partition (barrier work abandoned).
|
|
We do not need the large DBs for MID scans.
|
|
The skeleton DB is smaller and hit more frequently given the
homepage and per-message/thread views; so it will be hotter in
the page cache.
|
|
I've missed a few things over time :x
|
|
We'll let the config of all.git dictate every other subrepo to
ease maintenance and configuration. The "include" directive has
been supported since git 1.7.10, so it's safe to depend on as v2
requires git 2.6.0+ anyways for "get-mark" in fast-import.
|
|
It's easier to store everything in one array ref similar
to what our Git->check routine returns
|
|
We can't rely on header order for Message-ID after all
since we fall back to existing MIDs if they exist and
are unseen. This lets us use SearchMsg->mid to get the
MID we associated with the NNTP article number to ensure
all NNTP article lookups roundtrip correctly.
|
|
I guess nobody uses this command (slrnpull does not), and
the breakage was not noticed until I started writing new
tests for multi-MID handling.
Fixes: 3fc411c772a21d8f ("search: drop pointless range processors for Unix timestamp")
|
|
Since Message-IDs are no longer unique within Xapian
(but are within the SQLite Msgmap); favor NNTP article
numbers for internal lookups. This will prevent us
from finding the "wrong" internal Message-ID.
|
|
Since we support duplicate MIDs in v2, we can safely truncate
long MID terms in the database and let other normal duplicate
resolution sort it out. It seems only spammers use excessively
long MIDs, and there'll always be abuse/misuse vectors for causing
mis-threaded messages, so it's not worth worrying about
excessively long MIDs.
|
|
Since we support duplicate MIDs in v2, the NNTP article number
becomes the true unique identifier and we want a way to do fast
lookups on it.
While we're at it, stop putting XPATH in the term partitions
since we only need it in the skeleton DB.
|
|
Aside from the Message-Id ('Q'), these terms do not appear in
content and thus have no business contributing to the Xapian
document length.
Thanks-to Olly Betts for the tip on xapian-discuss
<20180228004400.GU12724@survex.com>
|
|
This is to make SearchMsg behave more sanely under NNTP.
|
|
It's tempting to rely on the atomicity of smaller-than-PIPE_BUF
writes, but it doesn't work if mixed with larger ones.
|
|
When indexing diffs, we can avoid indexing the diff parts under
XNQ and instead combine the parts in the read-only search
interface. This results in better indexing performance and
10-15% smaller Xapian indices.
|
|
Traditionally we've been more lax on parsing Message-Id
and allow it without the angle brackets. We've always been
strict on References and can't have it be pointlessly
large when some MUA decides to use HTML-escaped angle
brackets ("<", ">").
|
|
It's possible to have a message handle multiple terms;
so use this feature to ensure messages with multiple MIDs
can be found by either one.
|
|
'Q' is merely a convention in the Xapian world, and is close
enough to unique for practical purposes, so stop using XMID
and gain a little more term length as a result.
|
|
Since we'll need to support multiple Message-IDs anyways,
inject a new one if we hit a duplicate (or don't get one at
all).
Try to use a deterministic Message-Id for consistency, but give
up determinism and use a random Message-Id if an "attacker"
wants to prevent their message from being archived.
|
|
We merely use this for internal comparisons and do not store
this in Xapian. So using a shorter, non-human readable digest
is enough. Furthermore, introduce "content_digest" which
returns the Digest::SHA object for extra changes.
|
|
It's shorter and more convenient, here.
|
|
These already take care of deduping internally, so we'll save
ourselves at least some of the trouble while using a more
consistent API. While we're at it, hash the header name as
well, since we need to distinguish which header a certain value
came from.
|
|
We'll be using a more consistent API for extracting Message-IDs
from various headers.
|
|
This was creating an unnecessary epoll descriptor via
Danga::Socket when using V2Writable to import a mbox. That
said, there should probably be better way of detecting whether
or not we're inside a Danga::Socket event loop.
Fixes: 427245acacaf04a8
("evcleanup: ensure deferred close from timers are handled ASAP")
|
|
This is a bit expensive in a multi-process situation because
we need to make our indices and packs visible to the read-only
pieces.
|
|
We'll be using these in a more OO manner for V2Writable
(which doesn't use Danga::Socket), so lets not unnecessarily
register cleanup handlers intended for network daemons.
|
|
Some emails in LKML archives are identical with the only
difference being s/References:/In-Reply-To:/ in the headers.
Since this difference doesn't affect how we handle message
threading, we will treat them the same way for the purposes
of deduplication.
There may be more changes to how we do content_id along these
lines (e.g. using msg_iter to walk the message).
|