Date | Commit message (Collapse) |
|
Xapian docids have been tied to the over {num} column for
nearly 3 years, now; and OIDs are no longer stored in Xapian
document data. There's no need to increase code and IPC
complexity by passing the OID around.
|
|
Otherwise, any explicitly set shard counts were ignored and
we'd be counting CPUs every single time.
|
|
v1 and v2 inbox indexing now supports graceful shutdown checks
just like ExtSearchIdx. Additionally, we'll consistently
perform quit checks at the top of loops for consistency.
Interaction with the --xapian-only and --sequential-shard
options are a bit lacking, and will warn the user to use
"--reindex --xapian-only" to fix.
|
|
This simplifies callers and allows empty newsgroups to be
represented (the WWW UI may be insufficient there, too).
|
|
This will be used to index and search Inbox objects and perhaps
individual git repositories/epochs for grokmirror manifest.js.gz
generation. There is no sharding planned for this at the moment
since inbox count should remain low (~100K to 1M) compared to
message count.
Folding this into the existing sharded DBs could be possible;
but would likely increase query and maintenance costs, as well
as development complexity. So we'll use a few more inodes and
FDs at runtime, instead.
|
|
We can also avoid a needless progress message on log2stack
interruptions, too.
|
|
Just like the daemon processes, -extindex now supports graceful
shutdown via the same signals. This lets users avoid having to
repeat indexing messages when a power outage strikes during a
long (multi-hour/day) indexing run.
Per-inbox (v1/v2) -index graceful shutdowns are not supported,
yet, but is planned for later.
|
|
There's no need to continuously append to {todo} when indexing
multiple inboxes. They're not redundantly indexed (because the
IdxStack is discarded, making it a noop), but it's still a waste
of memory keeping the $unit hashrefs around.
|
|
Since all.git (v2) and ALL.git (extindex) encompass every single
epoch or indexed inbox; and is_ancestor() only uses hexadecimal
OIDs; there is no good reason to use $unit->{git} for an
epoch-local $git->check.
This prevents dozens/hundreds of --batch-check processes from
being left running after indexing and can improve locality
if size checks are being done (since that uses --batch-check,
too).
Theoretically several epochs may have conflicting OIDs, but
we're screwed in those cases, anyways, so we might as well
detect it earlier (though I'm not sure what the behavior would
be :x).
|
|
This will set us up for supporting graceful shutdown
on -index without repeating any work.
|
|
With async git blob retrievals, the OID being enqueued and the
OID being processed can be totally unrelated and misleading.
We'll also prefix $INBOX_DIR for v2, and not just the epoch
since we could be indexing multiple inboxes via both -index
and -extindex.
|
|
Since extindex holds no locks on parallel inbox writers,
we can simply use "barrier" IPC shard commands to checkpoint
and avoid respawning shard or git processes.
|
|
We can now handle cases where messages are edited in one inbox
but not another, bifurcating the message.
V2Writable::log_range handles some edge-cases which could happen
in v2-only code paths, as well, but weren't usually triggered
due to default git-gc knobs not pruning immediately
|
|
We'll be validating against this in the future to stop
bugs from creeping in.
|
|
And clearly label it. We may try to reuse some of this for v1
indexing code paths.
|
|
This will let us use it from ExtSearchIdx.
|
|
This will allow ExtSearchIdx to override or reuse them more
easily. Unfortunately we lose prototype validation, but that
seems to be discouraged anyways given the 'signatures' feature
in Perl 5.20+.
|
|
This will make it easier to reuse some indexing code for ExtSearchIdx.
|
|
Using `->can(method)' allows subclasses to override `index_oid'
and `unindex_oid' methods.
|
|
We want to reuse this code for ExtSearchIdx, eventually.
|
|
Since we store {ibx} in $sync state, we no longer have to
pass it as an argument to log2stack.
|
|
ExtSearchIdx will not have Msgmap, since it may index
non email blobs in the future (it'll still be usable
with IMAP, but not NNTP).
|
|
"remote" used to imply "child process on the same machine" which
was somewhat non-sensical, anyways. And OverIdx has been in the
same process since v2 was finalized. So use the suffix "aux"
for "auxiliary" since it can be safely jettisoned without
breaking URLs.
|
|
We'll be using per-sync-state {ibx} refs instead, so make parts
of the v2 indexing code less-dependent on $self->{ibx} where
$self is a V2Writable object.
|
|
This will be needed for ExtSearchIdx which doesn't have a
persistent PublicInbox::Inbox object.
|
|
This will make it easier-to-use in ExtSearchIdx.
|
|
We'll be reusing this for external indices and possibly
other places.
|
|
External indices won't have $self->{ibx} since it needs to
deal with multiple inboxes. We can also hoist out
->parallel_init to make it easier to distinguish the
non-parallel control flow.
|
|
We'll try to reuse as much V2Writable code as possible for
external indices, but the way "last_commit" info is stored
must be different as external indices will deal with last_commit
info for multiple inboxes.
|
|
This will make it easier to share code with ExtSearchIdx.
|
|
->cat_async and ->check_async may trigger each other (in future
callers) while waiting, so we need a unified method to ensure
both complete. This doesn't affect current code, but allows us
to slightly simplify existing callers.
|
|
Users may want to change the default branch used for git epochs
in v2 (v1 SearchIdx always used whatever "HEAD" pointed to).
|
|
{unindex_range} only exists in the $sync state, nowadays, not the
V2Writable ($self) object. $sync->{unindex_range} won't be
populated if $regen_max is zero, either, unless somebody is
injecting importable commits into an epoch history, in which
this change will result in no-op indexing doing no work.
|
|
While Perl implements tail recursion via `goto' which allows
avoiding warnings on deep recursion. It doesn't (as of 5.28)
optimize the speed of such dispatches, though it may reduce
ephemeral memory usage.
Make the code less alien to hackers coming from other languages
by using normal subroutine dispatch. It's actually slightly
faster in micro benchmarks due to the complexity of `goto &NAME'.
|
|
We'll also fix the read-only code to ensure we notice missing
Xapian shards, since gaps would throw off our expectation that
Xapian document IDs and NNTP article numbers are interchangeable.
|
|
We'll use {oidx} as the common field name for the read-write
OverIdx, here, to disambiguate it from the read-only {over}
field. This hopefully makes it clearer which code paths are
read-only and which are read-write.
|
|
This should further mitigate lock contention problems
when -watch is configured to watch on a Maildir for spam
while performing a large NNTP import.
There is now a small risk a message won't get removed because if
it's in the current (uncommitted) fast-import batch, but
unlikely given the batch size is now only 10 messages.
If a that small window is hit, flipping the \Seen flag
(e.g. marking it unread, and then read again) will trigger
another removal attempt via IMAP or Maildir.
|
|
Since we got rid of over->connect, `disconnect' no longer pairs
with it. So name it after the `close(2)' syscall it ultimately
issues.
|
|
The SWIG binding won't auto-convert IV/UV to PV like the XS
Search::Xapian binding would, so workaround that shortcoming
for now.
Fixes: a367ec1b15a2458 ("mbox: disable "&t" on existing Xapian until full reindex")
|
|
There's no reason we'd want Xapian to defer flushing once we've
indexed everything belonging to a particular shard.
|
|
Expanding threads via over.sqlite3 for mbox.gz downloads without
Xapian effectively collapsing on the THREADID column leads to
repeated messages getting downloaded.
To avoid that situation, use a "has_threadid" Xapian metadata
flag that's only set on --reindex (and brand new Xapian DBs).
This allows admins to upgrade WWW or do --reindex in any order;
without worrying about users eating up bandwidth and CPU cycles.
|
|
We'll also rename the /^remote_/ prefix to "shard_", since
remote implies the process is on a different host. These
methods only pass messages to a child process on the same host
OR perform operations within the same process.
|
|
Otherwise things get very confusing when verbosity is enabled :x
|
|
We use IdxStack via log2stack() from SearchIdx, now.
|
|
Move away from hard-to-read alllowercase naming and favor
snake_case or separated-by-dashes.
We'll keep `--indexlevel' as-is for now, since it's been around
for several releases; but we'll support `--index-level' in the
CLI and update our documentation in a few months.
We'll also clarify that publicInbox.indexMaxSize is only
intended for -index, and not -watch or -mda.
|
|
We can use open(..., undef) natively in Perl in t/import.t
In places where we need a pathname, the File::Temp OO API
gives us auto-unlinking for free.
|
|
We should never reindex all data in Xapian unless --reindex is
specified on the command-line. This means users who put
publicInbox.indexSequentialShard in their config file won't have
to put up with a full reindex at every invocation, only when
they specify --reindex.
We'll also cleanup the progress output to not emit non-sensical
ranges where the starting number is higher than the end.
|
|
getconf(1) itself is POSIX, while `_NPROCESSORS_ONLN' is not.
However, FreeBSD (tested 11.4 and 12.1) and glibc (tested CentOS
7.x and Debian 10.x) both support `getconf _NPROCESSORS_ONLN'.
GNU coreutils (and thus `nproc' or `gnproc') are not installed
by default on the *BSDs, so we'll try the option most likely
to exist on both glibc and *BSDs out-of-the-box.
|
|
We need to account for whether shard parallelization is
enabled or not, since users of parallelization are expected
to have more RAM.
|
|
We'll continue supporting `--no-sync' even if its yet-to-make it
it into a release, but the term `sync' is overloaded in our
codebase which may be confusing to new hackers and users.
None of our our code nor dependencies issue the sync(2) syscall,
either, only fsync(2) and fdatasync(2).
|