Date | Commit message (Collapse) |
|
We no longer read Xapian docdata and favor hitting over.sqlite3,
instead, as Xapian is less likely to be available than SQLite.
|
|
Having parse_references in OverIdx was awkward and Smsg is
a better place for it.
|
|
Xapian v1.2.21..v1.2.24 failed to set the close-on-exec flag
on the flintlock FD, causing "git cat-file" processes to
hold onto the lock and prevent subsequent Xapian::WritableDatabase
from locking the DB. So cleanup git processes after committing
the miscidx transaction.
|
|
We can more clearly distinguish between v1 and v2-only code
paths this way, and may be able to save a few cycles this way.
|
|
We don't need to be keeping the raw message around after it hits
git. Shard work now relies on Storable (or Sereal) and all of
the indexing code relies on the Email::MIME-like API of Eml to
access interesting parts of the message.
Similarly, smsg->{raw_bytes} is no longer carried around and we
do the CRLF adjustment when setting smsg->{bytes}.
There's also a small simplification to t/import.t while
we're in the area to use xqx instead of spawn/popen_rd.
|
|
We can remove some now-pointless wrapper functions by using
->ipc_do in even more places.
|
|
It's nice to prove the new code works by swapping it into
the current V2Writable / SearchIdxShard packages. This is
only the first step for the core bits, and we'll be able
to delete more code in a subsequent patch.
|
|
Using "make update-copyrights" after setting GNULIB_PATH in my
config.mak
|
|
* origin/master: (58 commits)
ds: flatten + reuse @events, epoll_wait style fixes
ds: simplify EventLoop implementation
check defined return value for localized slurp errors
import: check for git->qx errors, clearer return values
git: qx: avoid extra "local" for scalar context case
search: remove {mset} option for ->mset method
search: remove pointless {relevance} setting
miscsearch: take reopen from Search and use it
extsearch: unconditionally reopen on access
extindex: allow using --all without EXTINDEX_DIR
extindex: add undocumented --no-scan switch
extindex: enable autoflush on STDOUT/STDERR
extindex: various --watch signal handling fixes
extindex: --watch for inotify-based updates
eml: fix undefined vars on <Perl 5.28
t/config: test --get-urlmatch for git <2.26
default to CORE::warn in $SIG{__WARN__} handlers
inbox: name variable for values loop iterator
inboxidle: avoid needless syscalls on refresh
inboxidle: clue users into resolving ENOSPC from inotify
...
|
|
We'll count the number of log changes (regardless of index or
unindex) and only attach inboxes to ExtSearchIdx objects when
they get new work. We'll also reduce lock bouncing and only
update external indices after all per-inbox indexing is done.
This also updates existing v2 indexing/unindexing callers
to be more consistent and ensures unindex log entries update
per-inbox last commit information.
|
|
This simplifies all ->with_umask callers and opens the
door for further optimizations to delay/elide process spawning.
|
|
This brings -nntpd startup time down from ~35s to ~5s with 50K
inboxes.
Further improvements ought to be possible with deeper changes to
MiscIdx, since -mda having to load every inbox seems unreasonable;
but this general change is fairly unintrusive.
|
|
Values can be strings in Xapian, although we currently use
integer values exclusively. Give the wrapper a more appropriate
name in case we start using string columns.
For future-proofing, we'll now return `undef' on missing columns
and coerce the return value to an IV (integer value) to save
memory, as sortable_unserialise returns a PV (pointer value)
scalar despite it existing to support numeric values.
|
|
This reduces differences between v1 and v2 code, and
introduces ->xdb_shards_flat to provide read-only access
to shards without using Xapian::MultiDatabase. This
will allow us to combine shards of several inboxes
AND extindexes for lei.
|
|
Still unstable, this builds off the equally unstable extindex :P
This will be used for caching/memoization of traditional mail
stores (IMAP, Maildir, etc) while providing indexing via Xapian,
along with compression, and checksumming from git.
Most notably, this adds the ability to add/remove per-message
keywords (draft, seen, flagged, answered) as described in the
JMAP specification (RFC 8621 section 4.1.1).
We'll use `.' (a single period) as an $eidx_key since it's an
invalid {inboxdir} or {newsgroup} name.
|
|
-index runs on data that's already frozen in git, so there's
no point in warning users about it.
While we're at it, set the {current_info} prefix for v1 as
we do in v2 inboxes in case new problems show up.
|
|
Since we're inside a Xapian transaction, calling ->index_raw
followed by ->shard_add_eidx_info calls on the same docid
doesn't seem to hurt indexing performance. It definitely
reduces FS read traffic and IPC from git at the cost of some
more IPC between the parent and workers. Nevertheless, the code
and FD reductions seem worth it.
|
|
--reindex allows us to catch missed and stale messages due to
-extindex vs -index races prior to commit 02b2fcc46f364b51
("extsearchidx: enforce -index before -extindex").
We'll also rely on reindex to internally deal with v1/v2 inbox
removals and partial-unindexing of messages which are only
removed from one inbox out of many.
This reindex design is completely different than how normal
v1/v2 inbox reindex operates due to extindex having multiple
histories to work with. Instead of scanning git history, this
relies exclusively on comparing over.sqlite3 contents between
the v1/v2 inboxes and the extindex.
Changes to Xapian behavior also get picked up, now. Xapian indexing
is handled by workers with minimal IPC to the parent process.
This results in more read I/O but fewer writes when dealing
with cross-posted messages.
Changes to $smsg->populate and --rethread still need further
work.
|
|
This should help us detect bugs in our code or storage
synchronization problems more easily. This probably won't
detect corrupted git storage, but can detect corrupted SQLite
files.
"Bad blobs, bad blobs, whatcha gonna do when they come for you?"
|
|
Xapian docids have been tied to the over {num} column for
nearly 3 years, now; and OIDs are no longer stored in Xapian
document data. There's no need to increase code and IPC
complexity by passing the OID around.
|
|
Inboxes may be removed or newsgroups renamed over time.
Introduce a switch to do garbage collection and eliminate stale
search and xref3 results based on inboxes which remain in the
config file.
This may also fixup stale results leftover from any bugs which
may leave stale data around.
This is also useful in case a clumsy BOFH (me :P) is swapping
between several PI_CONFIGs and accidentally indexed a bunch of
inboxes they didn't intend to.
|
|
v1 and v2 inbox indexing now supports graceful shutdown checks
just like ExtSearchIdx. Additionally, we'll consistently
perform quit checks at the top of loops for consistency.
Interaction with the --xapian-only and --sequential-shard
options are a bit lacking, and will warn the user to use
"--reindex --xapian-only" to fix.
|
|
This will be used to index and search Inbox objects and perhaps
individual git repositories/epochs for grokmirror manifest.js.gz
generation. There is no sharding planned for this at the moment
since inbox count should remain low (~100K to 1M) compared to
message count.
Folding this into the existing sharded DBs could be possible;
but would likely increase query and maintenance costs, as well
as development complexity. So we'll use a few more inodes and
FDs at runtime, instead.
|
|
The initial "git log" invocation for a git epoch can be time
consuming, so check for graceful shutdown at each line to ensure
timely shutdowns and avoid SSD/HDD wear.
|
|
This will set us up for supporting graceful shutdown
on -index without repeating any work.
|
|
In case of other bugs or intentional corruption of over.sqlite3,
we don't want to attempt dereferencing a non-ref scalar when
calling ->mid_delete in the fallback code path.
Noticed while chasing another bug in extindex development...
|
|
This seems necessary for some cross-posted messages (and we did
it historically before we used over.sqlite3).
|
|
It doesn't seem worth storing xref3 data in Xapian now that
the same info is in over.sqlite3.
|
|
In case we want to reuse code with ExtSearchIdx or V2Writable.
|
|
This will let us work consistently with both existing inboxes
and external indices.
|
|
We'll be needing it in ExtSearchIdx for the next commit.
|
|
Since we store {ibx} in $sync state, we no longer have to
pass it as an argument to log2stack.
|
|
This will allow reusability with ExtSearchIdx
|
|
Having a special init path for external indices is probably
easier than further overloading SearchIdx->new initialization
to work without an Inbox object.
|
|
Not yet tested, but Perl compiles it!
|
|
Using `O' (owner) here (according Xapian omega's
termprefixes.rst) since we could say the newsgroup or inbox is
the owner of the given message.
|
|
This is preferable to open-coding "newsgroup // inboxdir" everywhere.
|
|
This will be used to track cross-posted messages in the
external/detached index.
|
|
This will be used by external/detached indices, too.
|
|
->cat_async and ->check_async may trigger each other (in future
callers) while waiting, so we need a unified method to ensure
both complete. This doesn't affect current code, but allows us
to slightly simplify existing callers.
|
|
We don't want a List-Id value being confused with a Xapian
term prefix, here.
Followup-to: 8b06cda3a3af3f0e ("mda: match List-Id insensitively")
|
|
This switch is still undocumented, but we can reduce the scope
of our Xapian docdata dependency by moving its only caller to
SearchIdx. This reduces the amount of code loaded by read-only
code paths.
|
|
We'll use {oidx} as the common field name for the read-write
OverIdx, here, to disambiguate it from the read-only {over}
field. This hopefully makes it clearer which code paths are
read-only and which are read-write.
|
|
croak() can give more context on the failure, and setting
`PERL5OPT=-MCarp=verbose' can force a stacktrace.
|
|
Expanding threads via over.sqlite3 for mbox.gz downloads without
Xapian effectively collapsing on the THREADID column leads to
repeated messages getting downloaded.
To avoid that situation, use a "has_threadid" Xapian metadata
flag that's only set on --reindex (and brand new Xapian DBs).
This allows admins to upgrade WWW or do --reindex in any order;
without worrying about users eating up bandwidth and CPU cycles.
|
|
This is the `tid' column from over.sqlite3; and will be used for
IMAP and JMAP search (among other things).
|
|
We'll also rename the /^remote_/ prefix to "shard_", since
remote implies the process is on a different host. These
methods only pass messages to a child process on the same host
OR perform operations within the same process.
|
|
Since we no longer read document data from Xapian, allow users
to opt-out of storing it.
This breaks compatibility with previous releases of
public-inbox, but gives us a ~1.5% space savings on Xapian
storage (and associated I/O and page cache pressure reduction).
|
|
The rest of our indexing code uses `$opt' instead of `$opts'.
|
|
Move away from hard-to-read alllowercase naming and favor
snake_case or separated-by-dashes.
We'll keep `--indexlevel' as-is for now, since it's been around
for several releases; but we'll support `--index-level' in the
CLI and update our documentation in a few months.
We'll also clarify that publicInbox.indexMaxSize is only
intended for -index, and not -watch or -mda.
|