Date | Commit message (Collapse) |
|
Since extindex uses Xapian shards in a similar way to
v2 inboxes, we'll support -xcpdb (reshard+upgrade) and
-compact all the same to give admins tuning+upgrade
options.
|
|
No callers pass an unblessed pathname to index_inbox,
only Inbox object refs.
|
|
This helps avoid errors from script/lei dying on ECONNRESET
when a single lei-daemon is serving all tests when run via
"make check-run".
Instead of using some arbitrary limit, use INT_MAX and let
the kernel clamp it (both Linux and FreeBSD do).
There's no need to call listen() in LEI.pm, either, since
Listener->new takes care of it.
|
|
The underscore variant was never documented and maintaining
the difference between the command-line and internal hash
is not worth it.
|
|
This wasn't wired up properly, but Xapian appears to suffer from
I/O amplification problems as DB shards get larger:
https://lists.xapian.org/pipermail/xapian-discuss/2019-February/009727.html
<23640.32170.703368.841021@y.dockes.com>
Of course, we shouldn't have too many shards, either; because
performance problems with too many shards was the entire reason
extindex was created:
https://lists.xapian.org/pipermail/xapian-discuss/2020-August/009823.html
<20200826064728.GA32239@dcvr>
|
|
Favor oidbin use internally to reduce internal memory traffic.
|
|
Not sure how this wasn't caught, earlier...
|
|
Reduce memory traffic and code, too.
|
|
We'll use 20-byte SHA-1 comparisons instead of 40-byte
hex representations for a minor reduction in memory
traffic.
|
|
The over.msgid table may contain ghost Message-IDs and also
Message-IDs of deleted spam messages, so over->max isn't a
good aproproximation of dedupe progress.
|
|
I found myself tempted to remove this, but it appears impossible
due to odd messages which have multiple Message-IDs.
|
|
Sometimes I just want to dedupe a single Message-ID to test
something, and this lets me do it.
This patch appears to do what its supposed to. But it also
appears to be finding duplicates that were previously missed.
That's a good thing, but I wish I understood what seems to be
fixed :x
I'm not sure why the previous ExtSearchIdx.pm (blob 357312b8)
was causing messages to be missed, even, and why this patch
seems to fix it... And it's not infinite looping, either.
Anyways, before this patch, "-extindex --dedupe" was taking ~5
min to no-op every message (after the initial full --dedupe run
which took over a day to run). No-op --dedupes now take just
under 2 hours to scan every single cross-posted message for a
no-op dedupe. The initial dedupe took nearly 44 hours on my
system for <https://yhbt.net/lore/all/> due to SATA-2 TLC SSD
latency on 3 gigantic Xapian shards.
Running --dedupe with this change seems to prevent
/BUG\?.*?not deduplicated properly/ stderr messages from being
triggered by View.pm. Current versions of -extindex do not
seem susceptible to introducing duplicates.
|
|
Pretty trivial since it just invokes "git-config". It's mainly
intended to make shell completion easier.
|
|
SQLite COUNT() is a slow operation that does a full table scan
with no conditions. There's no need for it, since lei dedupe
only needs to know if it's empty or not to decide between
new/ and cur/ for Maildir outputs.
|
|
This makes behavior less surprising on restarts as we no longer
lose state on restarts, so there's no need to manually run "lei
add-watch" to re-enable watches. This also allows us to
transparently handle changes if somebody edits the lei config
file directly or via git-config(1).
|
|
This allows lei to automatically note keyword (message flag)
changes made to a Maildir and propagate it into lei/store:
lei add-watch --state=tag-ro /path/to/Maildir
This doesn't persist across restarts, yet. In the future,
it will be applied automatically to "lei q" output Maildirs
by default (with an option to disable it).
State values of tag-rw, index-<ro|rw>, import-<ro|rw> will all
be supported for Maildir.
This represents a fairly major internal change that's fairly
intrusive, but the whole daemon-oriented design was to
facilitate being able to automatically monitor (and propagate)
Maildir/IMAP flag changes.
|
|
This behaves identically the lei external "boost" parameter in
prioritizing raw messages for extindex.
Relying exclusively on the config file order doesn't work well
for mirrors since it's impossible to guarantee config file
ordering via grokmirror hooks.
Config file ordering remains the default if boost is
unconfigured, or in case of ties.
Note: I chose the name "boost" rather than "priority" or "rank"
since I always get confused by whether higher or lower numbers
take precedence when it comes to kernel scheduling. "weight" is
also a part of Xapian API terminology, which we currently do not
expose to configuration (but may in the future).
|
|
We'll be using this in lei for watch configs.
|
|
Complex queries causes SQLite to block readers for longer than
their retry period. For dedupe, it was also preventing us from
making good use of checkpoints due to the query time.
With many deduplications, checkpoints are necessary to maintain
system health due to having too much data piled up.
|
|
There's nothing we can do about misformatted emails and headers
we get from untrusted sources. They're too noisy and those
messages already exist in public-inboxes, anyways, so just
keep things quiet so we can spot real problems more easily.
|
|
Xapian shard cleanup only requires read-only access to
over.sqlite3, so avoid opening it with read-write access since
create_tables will hit lock conflicts on "INSERT OR IGNORE"
statements.
|
|
This is intended to fix older indices that had deduplication
bugs for matching content. It'll also make dealing with
future changes to ContentHash easier since that's never
guaranteed stable.
It also supports --dry-run to print changes only without
making them.
|
|
These seem needed with the data I'm currently working on, but I
haven't changed my version of Email::Address::XS since my last
Debian stable upgrade (to buster).
|
|
Sometimes a user will be bored waiting for a command to finish,
so ensure we drop disconnect workers in this case.
|
|
IMAP flag-only synchronization doesn't fetch entire messages,
so we can safely bump the batch size iff a user specified one
for full messages to 10000 times that.
Since I sometimes wonder why nothing happens for several seconds
after starting "lei import $URL", we'll also show some progress
during the flag synchronization phase.
|
|
It's the most generic name I could find for it since it can
mean so many things...
|
|
I haven't found any bugs from this (still looking for missed
deduplication bugs), and it's a bit shorter and more likely to
catch future bugs. Clean up an unnecessary ->{mid} array copy
while we're at it, too.
|
|
Using this to track down deduplication failures in -extindex...
|
|
All commands which output non-trivial amounts of data to
the terminal should support this.
|
|
This avoids errors from git in case -extindex gets invoked in
parallel.
|
|
It's possible for these to exist and git can (or may eventually)
take advantage of them to speed up functionality which affects
us.
|
|
This default seems closer to reasonable on 64-bit systems which
are the norm these days. 32-bit systems gain 48K so it's an
even 1 MB, but we need to keep 32-bit systems from using too
much since there's still some ancient systems out there with
small inboxes.
|
|
ManifestJsGz->response was not invoking the new "url_filter"
method properly. Furthermore, fix url_filter for returning 404
responses.
Reported-by: Kyle Meyer <kyle@kyleam.com>
Link: https://public-inbox.org/meta/87fsx3128a.fsf@kyleam.com/
Fixes: 520be116e8a686cb ("www_listing: start updating for pagination + search")
|
|
This is a fair amount of complexity, but it speeds up
"git cat-file --batch" startup by 3-4% with 50K packfiles
with a hot kernel cache.
This appears extremely sensitive to RAM available to
the kernel page cache with my SATA 2 SSD. Faster storage
and more RAM can bring loading pack.
2.60s vs 2.69s were the best cases on my workstation with and
without the multi-pack-index, however times could be all over
the place (even in the minutes) with more activity on my
workstation.
Getting sub-minute times requires a git patch to speed up
alt_odb_usable():
<https://lore.kernel.org/20210624005806.12079-1-e@80x24.org/>
Otherwise, prepare to wait several minutes.
|
|
WwwListing and ManifestJsGz may be too different nowadays to
be worth the code sharing between them.
Update some comments and note we still needs better tests :x
Fixes: 520be116e8a686cb ("www_listing: start updating for pagination + search")
|
|
We have git_sha() nowadays that's used everywhere, so avoid
process spawning overhead for "git hash-object".
|
|
While both git and libgit2 take around 16 minutes to load 100K
alternates there's already a proposed patch to make git faster:
<https://lore.kernel.org/git/20210624005806.12079-1-e@80x24.org/>
It's also easier to patch and install git locally since the
git.git build system defaults to prefix=$HOME and dealing with
dynamic linking with libgit2 is more difficult for end users
relying on Inline::C.
libgit2 remains in use for the non-ALL.git case, but maybe it's
not necessary (libgit2 is significantly slower than git in
Debian 10 due to SHA-1 collision checking).
|
|
Sometimes users (or bots) may lead queries with '&' and
trigger uninitialized variable warnings, just ignore them
and give consumers a $ctx->{qp}->{''} entry.
While we're in the area, pass a regexp rather than scalar string
to the `split' perlop to prevent Perl from recompiling the
regexp on every call.
|
|
When dealing with thousands of inboxes, displaying all of
them on a single page isn't going to work. So steal some
pagination and search results code from the message search
to generate some basic HTML output that looks good in w3m.
|
|
This allows us to simplify callers throughout, and exceptions are
can no longer be silently hidden. MiscSearch now uses xap_terms
for looking up eidx_key terms for a code reduction.
We also simplify LeiStore->_msg_kw for runtime use by moving the
MsetIterator handling into t/lei_store.t test case.
|
|
This is for consistency with the open() at initial accept, in
case we hit a code path which expects Perl directory handles
rather than "file handles". Both work with the chdir() perlop
(fchdir(2), in our case).
|
|
I've found it's very helpful for large IMAP folders.
|
|
%INC can hold undef. This can be hit on a Linux machine missing
Linux::Inotify2. Loading PublicInbox::KQNotify is attempted and
PublicInbox/KQNotify.pm always exists, causing the `undef' entry
in %INC when it fails to load IO::KQueue.
$ perl -MData::Dumper -I lib \
-E 'eval { require PublicInbox::KQNotify }; say Dumper(\%INC)'
|
|
There appears to be some cases of duplicates appearing due to
-extindex. I haven't nailed down the cause of it, yet, but
this should make things easier for readers using the PSGI
HTML interface in the meantime.
The raw mboxrd remains undeduplicated for now, and the
correct fix/workaround would be some fsck-like mode for
public-inbox-extindex.
|
|
We'll be supporting inotify directly as we do with epoll so so
Linux users won't have to deal with XS, extra DSOs or install
Linux::Inotify2 (and common::sense) modules.
|
|
Simplify oid2docid and filter out undefined docids in ->add_eml,
instead. This avoids SQLite "datatype mismatch" errors in
OverIdx->add_over
Fixes: d1052f03ea85d4af ("lei/store: cull redundant docids based on blob OID")
|
|
I'm not sure how this happened (only once for me in March), but
it should not happen... In any case, we'll operate on the
lowest numbered docid and cull redundant index entries when
lei/store is open for read-write.
This also fixes the normal lei/store removal path to clean up
the xref3 table (since it's not done automatically for
public-facing -eidx due to the multi-list nature of it).
|
|
This will simplify upcoming code for watches.
|
|
"num:" is useful for inspecting Inbox-ish directories, while
"docid:" can be used for any Xapian DB (not just stuff managed
by our code).
|
|
Since users can't set IMAP flags in read-only IMAP folders,
we won't clobber local flags when importing from IMAP. This
also enables the local_blob fallback used for lei-index to
be used for index deduplication.
|