Date | Commit message (Collapse) |
|
barrier (synchronous checkpoint) is better than ->done with
parallel lei commands being issued (via '&' or different
terminals), since repeatedly stopping and restarting processes
doesn't play nicely with expensive tasks like `lei reindex'.
This introduces a slight regression in maintaining more
processes (and thus resource use) when lei is idle, but that'll
be fixed in the next commit.
|
|
BSD::Resource isn't packaged for Alpine (as of 3.19), but we
also have optional Inline::C support and already rely on calling
setrlimit(2) directly from the Inline::C version of pi_fork_exec.
|
|
The musl strftime(3) implementation on AlpineLinux 3.19.0
doesn't support `%k' and `%k' isn't in POSIX, either. So we
fall back to using the `sprintf' perlop in the user-facing UI
since leading zeroes require needless overhead for my eyes and
brain to parse in the time.
|
|
Stale entries from newsgroup name changes (including adding
a `publicinbox.<name>.newsgroup' entry when none existed
before) can wreak havoc during a --reindex. So give the
hint to users about running -extindex with --gc to clean
up stale entries.
|
|
This lets us get rid of some awkwardness around the old API
and single-use subroutines while saving us some LoC.
|
|
Since CodeSearchIdx doesn't deal with inboxes, it makes sense
to split it out from inbox-specific code and start moving
towards using OnDestroy to restore the umask at the end of
scope and reducing extra functions.
|
|
This allows us to avoid repeatedly using memory-intensive
anonymous subs in CodeSearchIdx where the callback is assigned
frequently. Anonymous subs are known to leak memory in old
Perls (e.g. 5.16.3 in enterprise distros) and still expensive in
newer Perls. So favor the (\&subroutine, @args) form which
allows us to eliminate anonymous subs going forward.
Only CodeSearchIdx takes advantage of the new API at the moment,
since it's the biggest repeat user of post-loop callback
changes.
Getting rid of the subroutine and relying on a global `our'
variable also has two advantages:
1) Perl warnings can detect typos at compile-time, whereas the
(now gone) method could only detect errors at run-time.
2) `our' variable assignment can be `local'-ized to a scope
|
|
This is like more familiar to readers of TAP (Test Anywhere
Protocol) output, as well as shell and Perl scripters which also
use `#' for comments.
AFAIK, nobody is parsing our stderr, and I'm not sure how
standardized the `I:' prefix is (nor `W:' and `E:' are). It's
already the prevailing style in Lei* code, too, so things have
been moving in that direction for a bit.
|
|
This enables Xapian::DB_DANGEROUS to support in-place updates.
This can speed up the initial index and reduce I/O at the cost
of preventing concurrent readers and being unsafe in the face of
any abnormal terminations. This is more dangerous than
--no-fsync. --no-fsync is only unsafe in the event of a power
loss or kernel crash; --dangerous is unsafe even on SIGKILL.
|
|
There seems to be a bug in v2 inbox reindexing somewhere...
|
|
Check for graceful termination at every message since it's
a fairly inexpensive check.
|
|
I'm not sure if this is a bug or not (or it could be
an old bug in the v2 indexing code).
|
|
Ensure the num highwater mark of the target inbox is stable
before using it. Otherwise we may end up repeating work
done to index a message.
|
|
Since this is intended for use on the command-line,
include TZ offset in time and try to shorten the
message a bit so it wraps less on a terminal.
|
|
We can't attempt to unref messages beyond the highwater mark of
an inbox. This bugfix was found by commit c485036d0b1ce7ed
(extindex: guard against buggy unrefs, 2021-10-14), which
actually did its intended job and guarded against a buggy unref.
|
|
Seeing the same warning over and over again gets annoying.
|
|
This makes some of our code less noisy by reducing the
amount of pack('H*', ...) use.
|
|
I noticed some unref messages which shouldn't have been
happening, but they were. Which is troubling. So add
a guard around an unref path until we can get to the bottom
of this.
|
|
This gives context as to where warnings are coming from.
|
|
AFAIK I've never hit these messages, but I might be glad
if I ever do.
|
|
This prevents unnecessary message renumbering and I/O.
Without this change, there is a small window for long-running
WWW streaming requests to miss a message that was unref-ed
before reindexing. If we expose an "All Mail" mailbox via
IMAP/JMAP, this will save client traffic.
|
|
When unref-ing a blob from xref3, make sure the "preferred"
smsg->{blob} doesn't point to the blob we just unrefed. This
is necessary because we periodically checkpoint our extindex
process to allow -watch and -mda processes to run.
This also gets rid of a lot of redundant code for ->remove_xref3,
since it's all handled in ExtSearchIdx, now.
|
|
We need to ensure a message is consistently removed from eidxq,
over and Xapian in all cases. Removing from eidxq saves users
from some noisy error messages.
|
|
We can use the same logic for --gc and --reindex and
'd' log entries
They're similar enough and the actual need to unref should
be fairly rare. We could go a lot faster if we didn't show
progress for --gc and --reindex, actually.
|
|
We also have the idea of active inboxes, too, so "active shards"
ought to make the purpose of the data structure more obvious.
|
|
This required some tweaking of xref3 indices in over.sqlite3,
but the end result is it brings no-op "--reindex --fast --all"
checks down to roughly 20 minutes (from 30-40 minutes) on
lore/all.
This is faster because a bunch of small SQLite queries are still
slower en-mass than a bunch of perlops. Despite the lack of IPC
overhead, crossing .so boundaries and repeating lookups over
btrees is still slower than doing the same with Perl hash tables.
|
|
Otherwise, it gets too noisy and we repeat some work
when we do an actual sync, since the last_commit info
will be out-of-date.
|
|
We were deleting ghost entries, this was usually harmless since
other messages could fill-in-the-blanks, but could cause
misthreading in odd cases where a big chunk of a thread is
missing and the latest messages only referenced ghosts.
We'll also save some cycles when scanning Xapian shards since
docids won't be <= 0.
|
|
Don't bother decoding the 20-byte SHA-1 to a 40-byte hex value
since we don't read it, anyways. We can also use the on-stack
ibx->eidx_key value instead of dispatching the method again.
|
|
Avoiding repeated SQL statements brings --gc down to 2-3 minutes
from around 10. We'll also add some checkpoints around over and
xref3 cleanups.
|
|
This mode only checks history for missed/stale messages
and doesn't attempt to reindex messages which are already
indexed.
|
|
We need to ensure -extindex --gc runs don't prevent other
work from happening in the meantime. I actually caused
my -extindex to OOM due to the lack of checkpoints :x
We'll also hoist out the shard scanning into its own sub
in preparation for lei/store usage.
|
|
As with most of our internal-only code, favor smaller
comparisons to reduce memory traffic.
|
|
I'm not sure why they weren't emitted, earlier.
|
|
It doesn't seem to matter, actually, but this matches the
behavior of attach_inbox and the comment in ->new.
|
|
When indexing a single inbox, do not attempt reindexing code
paths without a full config, otherwise ordering comparisons
won't work.
|
|
Since signalfd is often combined with our event loop, give it a
convenient API and reduce the code duplication required to use it.
EventLoop is replaced with ::event_loop to allow consistent
parameter passing and avoid needlessly passing the package name
on stack.
We also avoid exporting SFD_NONBLOCK since it's the only flag we
support. There's no sense in having the memory overhead of a
constant function when it's in cold code.
|
|
This fixes the occasional t/lei-sigpipe.t infinite loop
under "make check-run".
Link: http://nntp.perl.org/group/perl.perl5.porters/258784
<CAHhgV8hPbcmkzWizp6Vijw921M5BOXixj4+zTh3nRS9vRBYk8w@mail.gmail.com>
Followup-to: b552bb9150775fe4 ("daemon+watch: fix localization of %SIG for non-signalfd users")
|
|
IMHO, this greatly improves code sharing and organization
between v2, extindex, and lei/store. Common git-related
logic for these is lightly-refactored and easier to reason
about.
The impetus for this big change was to ensure inboxes
created+managed by public-inbox-{clone,fetch} could have
alternates and configs setup properly without depending on
SQLite (via V2Writable). This change does that while
making old code shorter and better factored.
|
|
While messages from removed inboxes were removed from Xapian
search, --gc failed to remove messages from over.sqlite3
entirely. They no longer show up in the topic summary view.
Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Link: https://public-inbox.org/20210830201723.dehoul4y6gpqf2cp@nitro.local/
|
|
Boost relies on knowledge of all inboxes in a given config file
to work properly. So while we support indexing a subset of
inboxes, we must still account for boost in inboxes we're not
indexing. So split internal inbox groups into "known" and
"active", where previously we only cared for inboxes which were
being actively indexed.
Furthermore, boost checks need to be applied when a
message arrives in different inboxes across multiple
invocations.
Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Link: https://public-inbox.org/meta/20210802204058.vscbxs5q7xyolyu2@nitro.local/
|
|
Cross-posted messages don't result in massive writes to the
Xapian DBs like a completely unseen message would, so stop
accounting for their size. This ought to improve performance
for heavily cross-posted setups, but --commit-interval still
has effect.
|
|
This wasn't wired up properly, but Xapian appears to suffer from
I/O amplification problems as DB shards get larger:
https://lists.xapian.org/pipermail/xapian-discuss/2019-February/009727.html
<23640.32170.703368.841021@y.dockes.com>
Of course, we shouldn't have too many shards, either; because
performance problems with too many shards was the entire reason
extindex was created:
https://lists.xapian.org/pipermail/xapian-discuss/2020-August/009823.html
<20200826064728.GA32239@dcvr>
|
|
We'll use 20-byte SHA-1 comparisons instead of 40-byte
hex representations for a minor reduction in memory
traffic.
|
|
The over.msgid table may contain ghost Message-IDs and also
Message-IDs of deleted spam messages, so over->max isn't a
good aproproximation of dedupe progress.
|
|
I found myself tempted to remove this, but it appears impossible
due to odd messages which have multiple Message-IDs.
|
|
Sometimes I just want to dedupe a single Message-ID to test
something, and this lets me do it.
This patch appears to do what its supposed to. But it also
appears to be finding duplicates that were previously missed.
That's a good thing, but I wish I understood what seems to be
fixed :x
I'm not sure why the previous ExtSearchIdx.pm (blob 357312b8)
was causing messages to be missed, even, and why this patch
seems to fix it... And it's not infinite looping, either.
Anyways, before this patch, "-extindex --dedupe" was taking ~5
min to no-op every message (after the initial full --dedupe run
which took over a day to run). No-op --dedupes now take just
under 2 hours to scan every single cross-posted message for a
no-op dedupe. The initial dedupe took nearly 44 hours on my
system for <https://yhbt.net/lore/all/> due to SATA-2 TLC SSD
latency on 3 gigantic Xapian shards.
Running --dedupe with this change seems to prevent
/BUG\?.*?not deduplicated properly/ stderr messages from being
triggered by View.pm. Current versions of -extindex do not
seem susceptible to introducing duplicates.
|
|
This behaves identically the lei external "boost" parameter in
prioritizing raw messages for extindex.
Relying exclusively on the config file order doesn't work well
for mirrors since it's impossible to guarantee config file
ordering via grokmirror hooks.
Config file ordering remains the default if boost is
unconfigured, or in case of ties.
Note: I chose the name "boost" rather than "priority" or "rank"
since I always get confused by whether higher or lower numbers
take precedence when it comes to kernel scheduling. "weight" is
also a part of Xapian API terminology, which we currently do not
expose to configuration (but may in the future).
|
|
Complex queries causes SQLite to block readers for longer than
their retry period. For dedupe, it was also preventing us from
making good use of checkpoints due to the query time.
With many deduplications, checkpoints are necessary to maintain
system health due to having too much data piled up.
|
|
There's nothing we can do about misformatted emails and headers
we get from untrusted sources. They're too noisy and those
messages already exist in public-inboxes, anyways, so just
keep things quiet so we can spot real problems more easily.
|