Date | Commit message (Collapse) |
|
There is no need to verify checksums of data already stored in
git. Doing this ourselves also limits flexibility in moving to
other hashes.
|
|
Inboxes may be removed or newsgroups renamed over time.
Introduce a switch to do garbage collection and eliminate stale
search and xref3 results based on inboxes which remain in the
config file.
This may also fixup stale results leftover from any bugs which
may leave stale data around.
This is also useful in case a clumsy BOFH (me :P) is swapping
between several PI_CONFIGs and accidentally indexed a bunch of
inboxes they didn't intend to.
|
|
We need to completely remove a message from over.sqlite3 and
Xapian when no references remain, otherwise users will still see
the removed messages in NNTP overviews and WWW search
results/summaries.
References to messages are now solely handled by the `xref3'
table of over.sqlite3. We can also trust `xref3' when deciding
whether to remove only the "O$eidx_key" and "G$lid" terms from a
document in Xapian or to remove the entire Xapian document.
|
|
v1 and v2 inbox indexing now supports graceful shutdown checks
just like ExtSearchIdx. Additionally, we'll consistently
perform quit checks at the top of loops for consistency.
Interaction with the --xapian-only and --sequential-shard
options are a bit lacking, and will warn the user to use
"--reindex --xapian-only" to fix.
|
|
This allows us to filter out duplicate alternates entries in case
there's symlinks or bind mounts in play, as I (and perhaps some
other users) tend to use symlinks and/or bind mounts heavily.
|
|
This was intended to make development easier; but also allows us
description, URL, and address changes to be picked up
independently of message history.
|
|
We shouldn't leave "cat-file --batch" processes around when
we're done with an epoch or inbox, since there could be
many thousands.
|
|
This will be used to index and search Inbox objects and perhaps
individual git repositories/epochs for grokmirror manifest.js.gz
generation. There is no sharding planned for this at the moment
since inbox count should remain low (~100K to 1M) compared to
message count.
Folding this into the existing sharded DBs could be possible;
but would likely increase query and maintenance costs, as well
as development complexity. So we'll use a few more inodes and
FDs at runtime, instead.
|
|
Just like the daemon processes, -extindex now supports graceful
shutdown via the same signals. This lets users avoid having to
repeat indexing messages when a power outage strikes during a
long (multi-hour/day) indexing run.
Per-inbox (v1/v2) -index graceful shutdowns are not supported,
yet, but is planned for later.
|
|
There's no need to continuously append to {todo} when indexing
multiple inboxes. They're not redundantly indexed (because the
IdxStack is discarded, making it a noop), but it's still a waste
of memory keeping the $unit hashrefs around.
|
|
Since all.git (v2) and ALL.git (extindex) encompass every single
epoch or indexed inbox; and is_ancestor() only uses hexadecimal
OIDs; there is no good reason to use $unit->{git} for an
epoch-local $git->check.
This prevents dozens/hundreds of --batch-check processes from
being left running after indexing and can improve locality
if size checks are being done (since that uses --batch-check,
too).
Theoretically several epochs may have conflicting OIDs, but
we're screwed in those cases, anyways, so we might as well
detect it earlier (though I'm not sure what the behavior would
be :x).
|
|
This will set us up for supporting graceful shutdown
on -index without repeating any work.
|
|
Matching the behavior of git-fast-import(1), we'll allow a user
to send SIGUSR1 to checkpoint over.sqlite3 and Xapian.
|
|
With async git blob retrievals, the OID being enqueued and the
OID being processed can be totally unrelated and misleading.
We'll also prefix $INBOX_DIR for v2, and not just the epoch
since we could be indexing multiple inboxes via both -index
and -extindex.
|
|
This makes `ps' output look a bit nicer if there's trailing
slashes involved from the command-line.
|
|
"deleted" messages (via -learn <spam|rm>) in the source inboxes
are likely to already be unindexed, so avoid triggering needless
warnings about the spam message being missing.
|
|
As with fill_alternates in V2Writable, we do not need to update
$GIT_DIR/objects/info/alternates if nothing is changed.
|
|
This is needed to limit the RSS of processes and ensure the
stored data in over.sqlite3 and Xapian DBs are consistent if
interrupted. Without checkpoints, indexing lore causes shard
workers to take several GB of memory and thrash/OOM smaller
systems.
|
|
This bit is duplicated with per-Inbox indexing in Admin,
undecided if it's the right place for it.
|
|
We can now handle cases where messages are edited in one inbox
but not another, bifurcating the message.
V2Writable::log_range handles some edge-cases which could happen
in v2-only code paths, as well, but weren't usually triggered
due to default git-gc knobs not pruning immediately
|
|
It doesn't seem worth storing xref3 data in Xapian now that
the same info is in over.sqlite3.
|
|
A couple of more things to prepare us to run syncs on
both v1 and v2 inboxes.
|
|
Now that the V2Writable code is more generic, we can
sync with it to use `units' which represent either
a v2 epoch or an entire v1 inbox.
|
|
Moved to per-epoch "units".
|
|
We'll use `index_oid' and `unindex_oid' as our method names
so V2Writable methods may use `$self->can' to access them.
|
|
It compiles...
|