Date | Commit message (Collapse) |
|
Since the overview stuff is a synchronization point anyways,
move it into the main V2Writable process and allow us to
drop a bunch of code. This is another step towards making
Xapian optional for v2.
In other words, the fan-out point is moved and the Xapian
partitions no longer need to synchronize against each other:
Before:
/-------->\
/---------->\
v2writable -->+----parts----> over
\---------->/
\-------->/
After:
/---------->
/----------->
v2writable --> over-->+----parts--->
\----------->
\---------->
Since the overview/threading logic needs to run on the same core
that feeds git-fast-import, it's slower for small repos but is
not noticeable in large imports where I/O wait in the partitions
dominates.
|
|
Favor simpler internal APIs this time around, this cuts
a fair amount of code out and takes another step towards
removing Xapian as a dependency for v2 repos.
|
|
Some of this jankiness was from early performance problems
and they turned out to be unnecessary measures.
|
|
This hopefully helps for people who try to understand
this design.
|
|
For upgrades, this will let users keep an old version
running while performing "public-inbox-index" on the
newest version.
|
|
The Xapian partitions will trigger the removal anyways.
Test this and fix some description/spelling errors
while we're at it.
|
|
The partition count can change if public-inbox-compact runs
while public-inbox-watch or public-inbox-index is running.
|
|
Xapian may become unhappy if a DB is modified during iteration:
nntp://news.gmane.org/20180228004400.GU12724@survex.com
|
|
This is important for people running mirrors via "git fetch",
as they need to be kept up-to-date. Purging is also now
supported in mirrors.
The short-lived "--regenerate" option is gone and is now
implicitly enabled as a result. It's still cheap when
article number regeneration is unnecessary, as we track
the range for each git repository.
|
|
We do not need to rewrite old commits unaffected by the object_id
purge, only newer commits. This was a state management bug :x
We will also return the new commit ID of rewritten history to
aid in incremental indexing of mirrors for the next change.
|
|
searchidx_checkpoint was too convoluted and confusing.
Since barrier is mostly the same thing; use that instead
and add an fsync option for the overview DB.
|
|
This ought to provide better performance and scalability
which is less dependent on inbox size. Xapian does not
seem optimized for some queries used by the WWW homepage,
Atom feeds, XOVER and NEWNEWS NNTP commands.
This can actually make Xapian optional for NNTP usage,
and allow more functionality to work without Xapian
installed.
Indexing performance was extremely bad at first, but
DBI::Profile helped me optimize away problematic queries.
|
|
I was too aggressively disabling parallelization to speed up
the test suite and broke this :x Re-enable parallelization
for the v2reindex test so we can catch it later.
|
|
We need to ensure there is only one file in the top-level tree
at any commit so the "add; remove; add;" sequence on the same
message is detected properly.
Otherwise, git will not detect the second "add" unless
a second message is added to history.
Deletes are now stored in "d" (and not "D" or "_/D") at the
top-level, now. There's no need to have a "_" to reduce churn
as "m" and "d" should never co-exist. It's now lowercased to
make it easier-to-distinguish from "D" in git-log output.
|
|
Ensure -convert and -compact do not make repositories
unreadable on live servers.
|
|
This is consistent with how we internally generate new
Message-IDs to break conflicts and allows ->reindex to
succeed while walking backwards through history
|
|
By supporting purge and allowing users to delete git partitions,
we can open up ourselves to gaps and un-reindexible data. Let
that be.
|
|
Somebody may only care about the most recent history,
so allow -init and -index to operate quietly on missing
partitions.
|
|
And we do not want to start making confused repos if somebody
leaves out "-V2" the second time around.
|
|
The layout of this structure ended up being a bit different
and the read-only access is handled through the ::Inbox class,
instead.
|
|
Purging existing messages is fairly straightforward since we can
take advantage of Xapian and lookup the git object_id with it.
Unfortunately, purging an already "removed" message (which is
no longer in Xapian) is not as easy and we'll need to expose
->purge_oids to purge by the git object_id (currently SHA-1).
Furthermore, we expire reflogs and prune in hopes a dumb HTTP
client won't get the object.
|
|
The original Message-ID is still the most important when
discussing with other recipients who do not rely on a message
flowing through public-inbox. So whatever Message-ID we use
to deduplicate internally will be secondary and less important.
All of our front-end v2 code is order-independent, so we won't
let the message count against us, that way.
|
|
It would be a bug to have deleted files marked but not
seen in our histories.
|
|
This also quiets down warnings from -watch when spam training
happens on messages without Message-Id.
|
|
The File::Temp API is a bit tricky and needs TMPDIR explicitly
enabled if a template is given.
|
|
We want to make it clear to the code and DEBUG_DIFF users
that we do not introduce messages with unsuitable headers
into public archives.
|
|
Allow best-effort regeneration of NNTP article numbers from
cloned git repositories in addition to indexing Xapian Article
numbers will not remain consistent when we add purge support,
though.
|
|
This still requires a msgmap.sqlite3 file to exist, but
it allows us to tweak Xapian indexing rules and reindex
the Xapian database online while -watch is running.
|
|
This will make it easier to as well as supporting future
Filter API users. It allows simplifying our ad-hoc
import_vger_from_mbox script.
|
|
public-inbox-watch gets restarted on reboots and whatnot, so
it could get pointlessly noisy. This message was only useful
during initial development and imports.
|
|
This can help us track down some differences during import,
if needed.
|
|
While parallel processes improves import speed for initial
imports; they are probably not necessary for daily mail imports
via WatchMaildir and certainly not for public-inbox-init. Save
some memory for daily use and even helps improve readability of
some subroutines by showing which methods they call remotely.
|
|
Be consistent with our "remote_" prefix for other IPC subs
|
|
This matches Import::done behavior
|
|
This reduces code duplication needed for locking and
and hopefully makes things easier to understand.
|
|
Instead of using ssoma-based locking, enable locking via Import
for now.
|
|
This allows us to share code for generating Message-IDs
between v1 and v2 repos.
For v1, this introduces a slight incompatibility in message
removal iff the original message lacked a Message-ID AND
the training request came from a message which did not
pass through the public-inbox:
The workaround for this would be to reuse the bad message from
the archive itself.
|
|
This will allow WatchMaildir to use ->barrier operations instead
of reaching inside for nchg. This also ensures dumb HTTP
clients can see changes to V2 repos immediately.
|
|
This makes it easier to audit deletes with "git log -p" and
prevents an unstable specification of "content_id" from being
stored in history.
This should be cost-free if done in the same partition (and even
cheaper than before as it introduces no new blobs). It does
have a higher cost across partitions, but is probably irrelevant
given the typical ham:spam ratio.
|
|
We need to hide removals from anybody hitting the search engine.
|
|
Makes life a little easier for V2Writable...
|
|
We no longer need it with ->barrier working
|
|
Stopping and starting a bunch of processes to look up duplicates
or removals is inefficient. Take advantage of checkpointing
in "git fast-import" and transactions in Xapian and SQLite.
|
|
We need to detect the number of partitions the repository was
created with to ensure Xapian DBs can work across different
machines (or even CPU affinity changes) without leaving messages
unaffected by search.
|
|
Not a big deal since we still commit to the skeleton for every
single partition (barrier work abandoned).
|
|
We'll let the config of all.git dictate every other subrepo to
ease maintenance and configuration. The "include" directive has
been supported since git 1.7.10, so it's safe to depend on as v2
requires git 2.6.0+ anyways for "get-mark" in fast-import.
|
|
It's easier to store everything in one array ref similar
to what our Git->check routine returns
|
|
We can't rely on header order for Message-ID after all
since we fall back to existing MIDs if they exist and
are unseen. This lets us use SearchMsg->mid to get the
MID we associated with the NNTP article number to ensure
all NNTP article lookups roundtrip correctly.
|
|
This is to make SearchMsg behave more sanely under NNTP.
|
|
Since we'll need to support multiple Message-IDs anyways,
inject a new one if we hit a duplicate (or don't get one at
all).
Try to use a deterministic Message-Id for consistency, but give
up determinism and use a random Message-Id if an "attacker"
wants to prevent their message from being archived.
|