Date | Commit message (Collapse) |
|
We want to make it clear to the code and DEBUG_DIFF users
that we do not introduce messages with unsuitable headers
into public archives.
|
|
Allow best-effort regeneration of NNTP article numbers from
cloned git repositories in addition to indexing Xapian Article
numbers will not remain consistent when we add purge support,
though.
|
|
I'll be relying on some of this behavior for regenerating NNTP
article numbers off fresh clones.
|
|
This still requires a msgmap.sqlite3 file to exist, but
it allows us to tweak Xapian indexing rules and reindex
the Xapian database online while -watch is running.
|
|
I keep forgetting to run "make syntax"
|
|
This will be used to keep track of Message-ID <-> NNTP Article
numbers to prevent article number reuse when reindexing.
|
|
We want to rely on Date: to sort messages within individual
threads since it keeps messages from git-send-email(1) sorted.
However, since developers occasionally have the clock set
wrong on their machines, sort overall messages by the newest
date in a Received: header so the landing page isn't forever
polluted by messages from the future.
This also gives us determinism for commit times in most cases,
as we'll used the Received: timestamp there, as well.
|
|
This will make it easier to as well as supporting future
Filter API users. It allows simplifying our ad-hoc
import_vger_from_mbox script.
|
|
Reduce the places where we have duplicate logic for discarding
unwanted headers.
|
|
This code will be shared with future mass-import tools.
|
|
If we need to use content_id, we've already lost hope
in relying on Message-Id as a differentiator. This
prevents duplicates from showing up repeatedly with
-watch when Message-Ids are reused and we generate
new Message-Ids to disambiguate.
|
|
public-inbox-watch gets restarted on reboots and whatnot, so
it could get pointlessly noisy. This message was only useful
during initial development and imports.
|
|
This can help us track down some differences during import,
if needed.
|
|
Perhaps we should filter these headers out in Import
|
|
While parallel processes improves import speed for initial
imports; they are probably not necessary for daily mail imports
via WatchMaildir and certainly not for public-inbox-init. Save
some memory for daily use and even helps improve readability of
some subroutines by showing which methods they call remotely.
|
|
Be consistent with our "remote_" prefix for other IPC subs
|
|
Unfortunately this gives up some minor performance tweaks we
made to avoid reforking import processes.
|
|
This matches Import::done behavior
|
|
I had to dig through commit history for this and we should
better document our tests (along with everything else).
|
|
This reduces code duplication needed for locking and
and hopefully makes things easier to understand.
|
|
No functional changes, yet, but this makes future changes
easier-to-read.
|
|
Instead of using ssoma-based locking, enable locking via Import
for now.
|
|
This will make reindexing easier.
|
|
Hexdigests are too long and shorter Message-IDs are easier
to deal with.
|
|
This allows us to share code for generating Message-IDs
between v1 and v2 repos.
For v1, this introduces a slight incompatibility in message
removal iff the original message lacked a Message-ID AND
the training request came from a message which did not
pass through the public-inbox:
The workaround for this would be to reuse the bad message from
the archive itself.
|
|
This can probably be moved to Import for code reuse.
|
|
This allows us to be more consistent in dealing with completely
empty Message-Ids.
|
|
This will allow WatchMaildir to use ->barrier operations instead
of reaching inside for nchg. This also ensures dumb HTTP
clients can see changes to V2 repos immediately.
|
|
In the future, we may store "purged" content IDs or other
uncommon stuff under "_/" of the git tree. This keeps the
top-level tree small and more amenable to deltafication.
This helps the the common case where "m" is most commonly
changed file at the top level.
Also, use 'D' instead of 'd' since it matches git's '--raw'
output format.
|
|
This makes it easier to audit deletes with "git log -p" and
prevents an unstable specification of "content_id" from being
stored in history.
This should be cost-free if done in the same partition (and even
cheaper than before as it introduces no new blobs). It does
have a higher cost across partitions, but is probably irrelevant
given the typical ham:spam ratio.
|
|
Writing to the main skeleton pipe requires a lock since it's
shared with partition processes.
|
|
We need to hide removals from anybody hitting the search engine.
|
|
Makes life a little easier for V2Writable...
|
|
Followup-to: ebb59815035b42c2
("searchidx: do not modify Xapian DB while iterating")
|
|
We no longer need it with ->barrier working
|
|
Email::Simple is slightly faster this way, and Email::MIME
and PublicInbox::MIME both wrap that.
|
|
Stopping and starting a bunch of processes to look up duplicates
or removals is inefficient. Take advantage of checkpointing
in "git fast-import" and transactions in Xapian and SQLite.
|
|
We will be using Sender: in more places if the From: header
is not available, this is one of them.
Followup-to: ("import: fall back to Sender for extracting name and email")
|
|
The current inbox is more important for partial Message-ID
matching, so we try harder on that to fix common errors before
moving onto other inboxes. Then, prevent expensive scanning of
other inboxes by requiring a Message-ID length of at least 16
bytes.
Finally, we limit the overall partial responses to 200 when
scanning other inboxes to avoid excessive memory usage.
|
|
We need to detect the number of partitions the repository was
created with to ensure Xapian DBs can work across different
machines (or even CPU affinity changes) without leaving messages
unaffected by search.
|
|
It appears most of the mboxes in the archive I've been given are
mboxrd (despite having Content-Length:) and needs the escaping.
|
|
This seems like a reasonable course of action for old messages.
Cc: Nicolás Ojeda Bär <n.oje.bar@gmail.com>
|
|
The first Received: header is believable since it typically
hits the user's mail server and can be treated as relatively
trustworthy. We still show the Date: in per-message (permalink)
views, which may expose users for having incorrect Date:
headers, but all the ISO YYYY-MM-DD dates we display will
match what we see.
|
|
Not a big deal since we still commit to the skeleton for every
single partition (barrier work abandoned).
|
|
We do not need the large DBs for MID scans.
|
|
The skeleton DB is smaller and hit more frequently given the
homepage and per-message/thread views; so it will be hotter in
the page cache.
|
|
I've missed a few things over time :x
|
|
We'll let the config of all.git dictate every other subrepo to
ease maintenance and configuration. The "include" directive has
been supported since git 1.7.10, so it's safe to depend on as v2
requires git 2.6.0+ anyways for "get-mark" in fast-import.
|
|
It's easier to store everything in one array ref similar
to what our Git->check routine returns
|
|
We can't rely on header order for Message-ID after all
since we fall back to existing MIDs if they exist and
are unseen. This lets us use SearchMsg->mid to get the
MID we associated with the NNTP article number to ensure
all NNTP article lookups roundtrip correctly.
|